A Microsoft Word Document is a paper-oriented document encoding format used by the Microsoft Word application. Microsoft Word files typically end with the extensions .doc
, and .docx
. Microsoft Word documents are primarily paper-oriented but may be displayed in various modes in which the notion of pages is abstracted away.
The original binary .doc
format has evolved substantially since since the early versions of the word processor for Xenix and MS-DOS in 1983. The current family of formats uses the .docx
extension and consists of a zipped folder which contains XML files, among many other resources.
Example
The example below shows a simplified and abridged version of word/document.xml
, which encodes the main document within a .docx
file:
<?xml version="1.0" encoding="UTF-8"?>
<document
<body>
<p>
<pPr>
<pStyle val="Heading1" />
</pPr>
<r>
<t xml:space="preserve">This is a level 1 heading</t>
</r>
</p>
<p>
<pPr>
<pStyle val="Heading2" />
</pPr>
<r>
<t xml:space="preserve">This is a level 2 heading</t>
</r>
</p>
<p>
<pPr>
<pStyle val="FirstParagraph" />
</pPr>
<r>
<t xml:space="preserve">This is text in</t>
</r>
<r>
<t xml:space="preserve"> </t>
</r>
<r>
<rPr>
<b />
<bCs />
</rPr>
<t xml:space="preserve">bold</t>
</r>
<r>
<t xml:space="preserve"> and </t>
</r>
<r>
<rPr>
<i />
<iCs />
</rPr>
<t xml:space="preserve">italics</t>
</r>
<r>
<t xml:space="preserve">, and this is an external link to</t>
</r>
<r>
<t xml:space="preserve"> </t>
</r>
<hyperlink r:id="rId20">
<r>
<rPr>
<rStyle val="Hyperlink" />
</rPr>
<t xml:space="preserve">DocOps</t>
</r>
</hyperlink>
<r>
<t xml:space="preserve">. Now, some bullet points:</t>
</r>
</p>
<p>
<pPr>
<pStyle val="Compact" />
<numPr>
<ilvl val="0" />
<numId val="1001" />
</numPr>
</pPr>
<r>
<t xml:space="preserve">Bullet point 1</t>
</r>
</p>
<p>
<pPr>
<pStyle val="Compact" />
<numPr>
<ilvl val="0" />
<numId val="1001" />
</numPr>
</pPr>
<r>
<t xml:space="preserve">Bullet point 2</t>
</r>
</p>
<p>
<pPr>
<pStyle val="Compact" />
<numPr>
<ilvl val="0" />
<numId val="1001" />
</numPr>
</pPr>
<r>
<t xml:space="preserve">Bullet point 3</t>
</r>
</p>
<sectPr />
</body>
</document>
In the above example, the link pointed by <hyperlink r:id="rId20">
is stored in a separate file, word/_rels/document.xml.rels
as follows:
The .docx
format is complex and hard to generate programatically from scratch without the aid of a specialized library. From a DocOps perspective, Microsoft Word Documents are usually treated as a render target. That is to say, documents are authored in a different format and then rendered to Microsoft Word.
The example has been generated using Pandoc from the markdown version.
© 2022-2024 Ernesto Garbarino | Contact me at ernesto@garba.org