Publishing documentation with Markdown, LaTeX, Pandoc and GitHub Actions

Published on
Reading time
Authors

If you've been following me for a while you'll know I'm a huge fan of markdown, GitHub and simplifying the life of developers. In this post I am going to show you how you use the tools you already use and love along with some open source magic that has a long history to build a documentation publishing workflow. All without needing to install Word or use Windows, even if you want to publish a Word document!

While co-editing capabilities in Microsoft Word enable easy team editing of documents, there may be times where being constrained to editing in Word (or its web-based version) are too restrictive.

Foundations

In order to get the most out of this approach there are a few items to get in order first:

  • A good text editor that supports Markdown and LaTeX. (I'm using Visual Studio Code with Markdown All in One and LaTeX Workshop extensions installed).
  • Diagramming extensions such as the excellent draw.io one for Visual Studio Code.
  • A GitHub repository that includes:
    • One or more Markdown file with the files being named in an order that you want them combined and published in.
    • A GitHub Action that uses an approach similar to the sample I will cover later in this post.

As a boost to your writing, also look to use an AI assistant such as GitHub Copilot which can help increase your writing productivity, particuarly for documentation that will include code snippets or diagrams.

The final piece of the puzzle is to introduce the use of Pandoc which is an excellent open source document converter that can handle markdown and LaTeX formatting as an input and produce (amongst other formats) PDF and Word documents.

Defining output formats

The output formatting for the process covered in this blog post uses the open source typesetting system LaTeX which has been around since the 1990s. LaTeX uses a series of commands to control many aspects of output documents. I'm not going to dig into the specifics here as LaTeX is quite complex and can be used to produce very rich documents, but needless to say, it gives you a huge amount of flexibility in controlling output formatting!

The beauty of our markdown-based approach that uses Pandoc is we can include LaTeX snippets as externally referenced files, inline in the Pandoc CLI or embedded in our markdown files. You can see a sample file in the section on including images below, and the CLI command in the GitHub Action in the sample repository.

Content layout of your repository

There are a few different ways to approach this, but however you do go about it, you will need to be able to order the input markdown files in a list that is in the order in which you want them to be published.

This might mean having a directory or markdown file per chapter, but you must ensure ordering is maintained. So, for example, you might have the following using folders:

.
├── 01-chapter
│   └── content.md
└── 02-chapter
    └── content.md

Alternatively you could use just files in the root of the repository, though you will have to exclude your README if you create one and don't want it included in your print edition!

.
├── 01-intro-chapter.md
├── 02-my-secret-of-success.md
└── 03-turning-ideas-into-blogs.md

Including images

If you have images you'd like to include you can drop them into subfolders in the same location as the markdown files and then reference them as you normally do in standard markdown.

There are a few items to watch out for though:

  1. Images float by default when publshing with LaTeX via pandoc. This means images show up wherever there is space in the document which may not be where you want them to be!
  2. Images can be oddly sized, especially if you're publishing images that you might have created to display on the web.
  3. Sub-folders are not automatically handled by pandoc.

So how do we fix these items?

Image float can be managed by defining a LaTeX format file and including the following snippet which ensures images are rendered inline.

% Control Image Float so they retain their position
\usepackage{float}
\let\origfigure\figure
\let\endorigfigure\endfigure
\renewenvironment{figure}[1][2] {
    \expandafter\origfigure\expandafter[H]
} {
    \endorigfigure
}

Image sizing requires you to add some inline LaTeX to your markdown images. The below value of {height=55%} is the value I'm using for all my images.

![Sample Image.](media/my-file.png 'Sample image.'){height=55%}

In order to fix the final issue we need to make sure we include the following argument in our pandoc CLI call.

-f markdown+rebase_relative_paths

Automating with GitHub Actions

To wrap everything up, we can use a GitHub Actions to automate the production of our output document when files are committed.

Pandoc provides a series of Docker images that includes all the tooling required to use it which simplifies GitHub Actions substantially. Pandoc recommend using a specific version of their container releases in order to avoid breaking changes if you simply use latest.

The workflow definition is quite straight forward and consists of:

  1. Checking out your source files.
  2. Enumerating all your markdown files into an array to use as an input to pandoc (this is where file naming is important for ordering)
  3. Invoking Pandoc via a container image.
  4. Upload the resulting output as an artifact to the Action.
name: Create Document

on:
  push:
    branches:
      - main
  pull_request:
    branches:
      - main
  workflow_dispatch:

jobs:
  create_file_with_pandoc:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Checkout and list files
        id: files_list
        run: |
          mkdir output # create output dir
          echo "files=$(printf '"%s" ' content/*.md)" > $GITHUB_OUTPUT
      - uses: docker://pandoc/latex:3.1
        with:
          args: ${{ steps.files_list.outputs.files }} --pdf-engine=xelatex -H formatting/format.tex -f markdown+rebase_relative_paths -o output/sample.pdf --toc -V lof -V "title:Sample posts from blog.siliconvalve.com" -V "author:Simon Waight" -V "date:\today" -V "geometry:a4paper" -V toc-depth:2 -V secnumdepth:3 --number-sections
      - uses: actions/upload-artifact@v4
        with:
          name: output
          path: output

The Pandoc commandline should be fairly easy to unpick using the Pandoc manual, but here's a summary of what it's doing:

  1. Take all the input markdown files
  2. Use the xelatex PDF engine
  3. Use the format.tex file for formatting
  4. Use relative paths for images.
  5. Create the output as a PDF (change the extension to docx to get a Word Document!)
  6. Have a table of contents (TOC)
  7. Provide a series of LaTeX instructions via -V variable aguments in format of argument:value.

If you want to see the PDF produced by the above workflow for the sample repository you can download it.

Once the GitHub Action runs you'll find the artifact is attached and you can download and review.

Output document as artifact on GitHub Action run!

What tips and tricks do you have to help produce documents this way? Leave comments below.

Until my next post... Happy Days!

😎

P.S. - you can even customise any Word document you create by building a reference docx file that uses the styles you need in your output. How cool is that?!