Document Conversion Utility


A document conversion utility helps convert one document encoding format to another. Depending on the format (and the complexity of the document) the conversion may not be lossless. For example, converting a Microsoft Word document to markdown results in losing most formatting directives such as font-size, color, margins, and so on.

In DocOps, the use of document conversion utilities is fundamental in automation pipelines, especially when it comes to converting documents to the final render format which is typically PDF or HTML.

Pandoc

Pandoc is often called the “the Swiss Army knife” of document conversion. Pandoc implements a modular architecture based on a universal document AST (abstract syntax tree) and a series of readers and writers around it. It converts from and to pretty much any legacy and contemporary document encoding format in existence.

The Pandoc binary is easy to invoke from DocOps automation pipelines and provides a wealth of flags to control its behavior. For example, the following command converts a markdown document to PDF with a table of contents:

pandoc source-document.md -o rendered-document.pdf --toc

Pandoc also allows writing custom filters using Lua, Python or any other language (by presenting a JSON-based AST to the programmer). The more adventurous DocOps engineer can also use Haskell to interact with Pandoc’s native APIs directly to modify existing readers and writers or create new ones.

Input Document Encoding Formats

Document encoding formats that Pandoc converts from:

Output Document Encoding Formats

Document encoding formats that Pandoc converts to:

Additional Document Encoding Format-Centric Utilities

Input-Centric

Input-centric utilities are focused on the format they convert from rather than to.

AsciiDoc

AsciiDoctor, is the official Ruby-based AsciiDoc utility, which converts to the following formats:

DocBook

DocBook’s official conversion approach is through a set of XSTL stylesheets rather than a language-specific library or command line tool. The offered stylesheets help convert DocBook XML files to common formats such as:

  • HTML (and XHTML)
  • Linux/Unix Manual Pages
  • PDF
  • EPUB (2 and 3)

Markdown

Microsoft Word

OpenDocument

soffice is the official binary command line tool to launch LibreOffice which can also be used in a headless fashion to convert OpenDocument files (and also Microsoft Word) to:

reST

Docutils is the official Python-based set of utilities to convert reST (reStructuredText) to:

Output-Centric

Ebooks (EPUB, MOBI, AZW3)

  • Calibre, similarly to Pandoc, is considered a bit of a Swiss Army knife but when it comes to ebook conversion. It’s primary target output formats are:
  • Kindle Previewer is Amazon’s official tool to generate Kindle books using the latest AZW3 format. Unless books are authored directly using Kindle’s proprietary format (AZW3), the typical authoring workflow involves converting from Microsoft Word or EPUB .

PDF

Please note that Pandoc and most input-centric tools are capable of generating PDF documents. These are additional tools whose sole goal is the generation of PDF documents:

  • pdflatex and xelatex are command line tools which Pandoc use as its backends which are focused on the authoring of articles, documents, and books in PDF format from a LaTeX-based intermediate document format.
  • PDFtk Server is an utility to manipulate existing PDF files. It supports various functions such as merge, split, rotate, compress images, add watermarks, etc.
  • Prince is a commercial utility to produce professional-looking index documents from HTML.
  • wkhtmltopdf converts HTML to PDF. The tool offers several configuration options to make the resulting documents look more native and disguise their HTML origin.

Markdown From Multiple Formats

MarkItDown is a Python-based tool that uses LLM technology to convert a variety of document encoding formats including PDF, PowerPoint, Word, Excel, HTML, CSV, JSON, XML, and raster images into Markdown documents.


© 2022-2024 Ernesto Garbarino | Contact me at ernesto@garba.org