A document conversion utility helps convert one document encoding format to another. Depending on the format (and the complexity of the document) the conversion may not be lossless. For example, converting a Microsoft Word document to markdown results in losing most formatting directives such as font-size, color, margins, and so on.
In DocOps, the use of document conversion utilities is fundamental in automation pipelines, especially when it comes to converting documents to the final render format which is typically PDF or HTML.
Pandoc is often called the “the Swiss Army knife” of document conversion. Pandoc implements a modular architecture based on a universal document AST (abstract syntax tree) and a series of readers and writers around it. It converts from and to pretty much any legacy and contemporary document encoding format in existence.
The Pandoc binary is easy to invoke from DocOps automation pipelines and provides a wealth of flags to control its behavior. For example, the following command converts a markdown document to PDF with a table of contents:
pandoc source-document.md -o rendered-document.pdf --toc
Pandoc also allows writing custom filters using Lua, Python or any other language (by presenting a JSON-based AST to the programmer). The more adventurous DocOps engineer can also use Haskell to interact with Pandoc’s native APIs directly to modify existing readers and writers or create new ones.
Document encoding formats that Pandoc converts from:
bibtex
(BibTeX bibliography)biblatex
(BibLaTeX bibliography)bits
(BITS XML, alias for jats
)commonmark
(CommonMark Markdown)commonmark_x
(CommonMark Markdown with extensions)creole
(Creole 1.0)csljson
(CSL JSON bibliography)csv
(CSV table)tsv
(TSV table)djot
(Djot markup)docbook
(DocBook)docx
(Word docx)dokuwiki
(DokuWiki markup)endnotexml
(EndNote XML bibliography)epub
(EPUB)fb2
(FictionBook2 e-book)gfm
(GitHub-Flavored Markdown), or the deprecated and less accurate markdown_github
; use markdown_github
only if you need extensions not supported in gfm
.haddock
(Haddock markup)html
(HTML)ipynb
(Jupyter notebook)jats
(JATS XML)jira
(Jira/Confluence wiki markup)json
(JSON version of native AST)latex
(LaTeX)markdown
(Pandoc’s Markdown)markdown_mmd
(MultiMarkdown)markdown_phpextra
(PHP Markdown Extra)markdown_strict
(original unextended Markdown)mediawiki
(MediaWiki markup)man
(roff man)muse
(Muse)native
(native Haskell)odt
(OpenOffice text document)opml
(OPML)org
(Emacs Org mode)ris
(RIS bibliography)rtf
(Rich Text Format)rst
(reStructuredText)t2t
(txt2tags)textile
(Textile)tikiwiki
(TikiWiki markup)twiki
(TWiki markup)typst
(typst)vimwiki
(Vimwiki)Document encoding formats that Pandoc converts to:
asciidoc
(modern AsciiDoc as interpreted by AsciiDoctor)asciidoc_legacy
(AsciiDoc as interpreted by asciidoc-py
).asciidoctor
(deprecated synonym for asciidoc
)beamer
(LaTeX beamer slide show)bibtex
(BibTeX bibliography)biblatex
(BibLaTeX bibliography)chunkedhtml
(zip archive of multiple linked HTML files)commonmark
(CommonMark Markdown)commonmark_x
(CommonMark Markdown with extensions)context
(ConTeXt)csljson
(CSL JSON bibliography)djot
(Djot markup)docbook
or docbook4
(DocBook 4)docbook5
(DocBook 5)docx
(Word docx)dokuwiki
(DokuWiki markup)epub
or epub3
(EPUB v3 book)epub2
(EPUB v2)fb2
(FictionBook2 e-book)gfm
(GitHub-Flavored Markdown), or the deprecated and less accurate markdown_github
; use markdown_github
only if you need extensions not supported in gfm
.haddock
(Haddock markup)html
or html5
(HTML, i.e. HTML5/XHTML polyglot markup)html4
(XHTML 1.0 Transitional)icml
(InDesign ICML)ipynb
(Jupyter notebook)jats_archiving
(JATS XML, Archiving and Interchange Tag Set)jats_articleauthoring
(JATS XML, Article Authoring Tag Set)jats_publishing
(JATS XML, Journal Publishing Tag Set)jats
(alias for jats_archiving
)jira
(Jira/Confluence wiki markup)json
(JSON version of native AST)latex
(LaTeX)man
(roff man)markdown
(Pandoc’s Markdown)markdown_mmd
(MultiMarkdown)markdown_phpextra
(PHP Markdown Extra)markdown_strict
(original unextended Markdown)markua
(Markua)mediawiki
(MediaWiki markup)ms
(roff ms)muse
(Muse)native
(native Haskell)odt
(OpenOffice text document)opml
(OPML)opendocument
(OpenDocument)org
(Emacs Org mode)pdf
(PDF)plain
(plain text)pptx
(PowerPoint slide show)rst
(reStructuredText)rtf
(Rich Text Format)texinfo
(GNU Texinfo)textile
(Textile)slideous
(Slideous HTML and JavaScript slide show)slidy
(Slidy HTML and JavaScript slide show)dzslides
(DZSlides HTML5 + JavaScript slide show)revealjs
(reveal.js HTML5 + JavaScript slide show)s5
(S5 HTML and JavaScript slide show)tei
(TEI Simple)typst
(typst)xwiki
(XWiki markup)zimwiki
(ZimWiki markup)Input-centric utilities are focused on the format they convert from rather than to.
AsciiDoctor, is the official Ruby-based AsciiDoc utility, which converts to the following formats:
DocBook’s official conversion approach is through a set of XSTL stylesheets rather than a language-specific library or command line tool. The offered stylesheets help convert DocBook XML files to common formats such as:
soffice is the official binary command line tool to launch LibreOffice which can also be used in a headless fashion to convert OpenDocument files (and also Microsoft Word) to:
Docutils is the official Python-based set of utilities to convert reST (reStructuredText) to:
Please note that Pandoc and most input-centric tools are capable of generating PDF documents. These are additional tools whose sole goal is the generation of PDF documents:
MarkItDown is a Python-based tool that uses LLM technology to convert a variety of document encoding formats including PDF, PowerPoint, Word, Excel, HTML, CSV, JSON, XML, and raster images into Markdown documents.
© 2022-2024 Ernesto Garbarino | Contact me at ernesto@garba.org