Publishing documentation with Markdown, LaTeX, Pandoc and GitHub Actions
- Published on
- Reading time
- Authors
- Name
- Simon Waight
- Mastodon
- @simonwaight
If you've been following me for a while you'll know I'm a huge fan of markdown, GitHub and simplifying the life of developers. In this post I am going to show you how you use the tools you already use and love along with some open source magic that has a long history to build a documentation publishing workflow. All without needing to install Word or use Windows, even if you want to publish a Word document!
While co-editing capabilities in Microsoft Word enable easy team editing of documents, there may be times where being constrained to editing in Word (or its web-based version) are too restrictive.
Foundations
In order to get the most out of this approach there are a few items to get in order first:
- A good text editor that supports Markdown and LaTeX. (I'm using Visual Studio Code with Markdown All in One and LaTeX Workshop extensions installed).
- Diagramming extensions such as the excellent draw.io one for Visual Studio Code.
- A GitHub repository that includes:
- One or more Markdown file with the files being named in an order that you want them combined and published in.
- A GitHub Action that uses an approach similar to the sample I will cover later in this post.
As a boost to your writing, also look to use an AI assistant such as GitHub Copilot which can help increase your writing productivity, particuarly for documentation that will include code snippets or diagrams.
The final piece of the puzzle is to introduce the use of Pandoc which is an excellent open source document converter that can handle markdown and LaTeX formatting as an input and produce (amongst other formats) PDF and Word documents.
Defining output formats
The output formatting for the process covered in this blog post uses the open source typesetting system LaTeX which has been around since the 1990s. LaTeX uses a series of commands to control many aspects of output documents. I'm not going to dig into the specifics here as LaTeX is quite complex and can be used to produce very rich documents, but needless to say, it gives you a huge amount of flexibility in controlling output formatting!
The beauty of our markdown-based approach that uses Pandoc is we can include LaTeX snippets as externally referenced files, inline in the Pandoc CLI or embedded in our markdown files. You can see a sample file in the section on including images below, and the CLI command in the GitHub Action in the sample repository.
Content layout of your repository
There are a few different ways to approach this, but however you do go about it, you will need to be able to order the input markdown files in a list that is in the order in which you want them to be published.
This might mean having a directory or markdown file per chapter, but you must ensure ordering is maintained. So, for example, you might have the following using folders:
.
├── 01-chapter
│ └── content.md
└── 02-chapter
└── content.md
Alternatively you could use just files in the root of the repository, though you will have to exclude your README if you create one and don't want it included in your print edition!
.
├── 01-intro-chapter.md
├── 02-my-secret-of-success.md
└── 03-turning-ideas-into-blogs.md
Including images
If you have images you'd like to include you can drop them into subfolders in the same location as the markdown files and then reference them as you normally do in standard markdown.
There are a few items to watch out for though:
- Images float by default when publshing with LaTeX via pandoc. This means images show up wherever there is space in the document which may not be where you want them to be!
- Images can be oddly sized, especially if you're publishing images that you might have created to display on the web.
- Sub-folders are not automatically handled by pandoc.
So how do we fix these items?
Image float can be managed by defining a LaTeX format file and including the following snippet which ensures images are rendered inline.
% Control Image Float so they retain their position
\usepackage{float}
\let\origfigure\figure
\let\endorigfigure\endfigure
\renewenvironment{figure}[1][2] {
\expandafter\origfigure\expandafter[H]
} {
\endorigfigure
}
Image sizing requires you to add some inline LaTeX to your markdown images. The below value of {height=55%}
is the value I'm using for all my images.
![Sample Image.](media/my-file.png 'Sample image.'){height=55%}
In order to fix the final issue we need to make sure we include the following argument in our pandoc CLI call.
-f markdown+rebase_relative_paths
Automating with GitHub Actions
To wrap everything up, we can use a GitHub Actions to automate the production of our output document when files are committed.
Pandoc provides a series of Docker images that includes all the tooling required to use it which simplifies GitHub Actions substantially. Pandoc recommend using a specific version of their container releases in order to avoid breaking changes if you simply use latest
.
The workflow definition is quite straight forward and consists of:
- Checking out your source files.
- Enumerating all your markdown files into an array to use as an input to pandoc (this is where file naming is important for ordering)
- Invoking Pandoc via a container image.
- Upload the resulting output as an artifact to the Action.
name: Create Document
on:
push:
branches:
- main
pull_request:
branches:
- main
workflow_dispatch:
jobs:
create_file_with_pandoc:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Checkout and list files
id: files_list
run: |
mkdir output # create output dir
echo "files=$(printf '"%s" ' content/*.md)" > $GITHUB_OUTPUT
- uses: docker://pandoc/latex:3.1
with:
args: ${{ steps.files_list.outputs.files }} --pdf-engine=xelatex -H formatting/format.tex -f markdown+rebase_relative_paths -o output/sample.pdf --toc -V lof -V "title:Sample posts from blog.siliconvalve.com" -V "author:Simon Waight" -V "date:\today" -V "geometry:a4paper" -V toc-depth:2 -V secnumdepth:3 --number-sections
- uses: actions/upload-artifact@v4
with:
name: output
path: output
The Pandoc commandline should be fairly easy to unpick using the Pandoc manual, but here's a summary of what it's doing:
- Take all the input markdown files
- Use the
xelatex
PDF engine - Use the
format.tex
file for formatting - Use relative paths for images.
- Create the output as a PDF (change the extension to
docx
to get a Word Document!) - Have a table of contents (TOC)
- Provide a series of LaTeX instructions via
-V
variable aguments in format of argument:value.
If you want to see the PDF produced by the above workflow for the sample repository you can download it.
Once the GitHub Action runs you'll find the artifact is attached and you can download and review.
What tips and tricks do you have to help produce documents this way? Leave comments below.
Until my next post... Happy Days!
😎
P.S. - you can even customise any Word document you create by building a reference docx file that uses the styles you need in your output. How cool is that?!