Automated Content Generation


The documentation system tenet of automated content generation states that the documentation system should aim to generate content in a hands-free, mechanical fashion, whenever possible. This tenet directly supports the principle of generative content by generating content automatically from applications, files, and data.

The tenet of generative content prompts us to automate the generation of documentation. In this example, a customer trajectory document view is generated from source data in tabular format
The tenet of generative content prompts us to automate the generation of documentation. In this example, a customer trajectory document view is generated from source data in tabular format

This tenet shares similarities with the tenet of composability, as well as embedding and blending. The difference is that rather than selecting document parts, or querying an external source within the confines of a content component, we are able to generate brand new content, document views, or even entire bodies of documentation.

We can reason about generative content based on the nature of the source content, which gives us the following fundamental use cases:

  • Existing documents: new documentation is generated from existing documents.
  • Data: documentation is generated from applications, databases, spreadsheets, etc.
  • Code: documentation is generated from application code and configuration files.
  • Multimedia content: documentation is generated from audio and video content.

Let’s look at each of them.

Document-centric Content Generation

Document-centric content generation involves the generation of content based on content that is already expressed in textual form. A basic example is that of document merging and consolidation. 

Say for instance that in a large project there are a series of documents produced in various formats and by different people—outside the realm of a central documentation platform. We can create an automation process which selects and combines all of these documents so that all of the project’s moving parts can be easily grasped. Such a process may involve skipping cover pages, renumbering headings, expanding business terms, and removing formatting—to adhere to many of the tenets we have elaborated upon such as that of consistent layout.

With the advent of LLM technology, content may also be generated by summarizing large bodies of text, or by generating content that fits predefined topics. For example, the top questions asked to a customer care center can be formulated as LLM prompts to generate relevant content for training or scripting purposes.

Content Generation From Data

Nearly all modern business applications store data in structured databases, and allow interaction via APIs. All such data can be used to generate documentation in a programmatic fashion. For example, a billing system’s database (or API) can be queried to create documents that describe each of the tariffs and discounts defined in them, without having users maintain parallel ‘paper’ versions of details that are already natively crystallized as structured data.

We may also want to set up structured data for the objective of content generation in a purposeful manner. This is the approach taken by business intelligence teams for the generation of business reports. In this case, a specific data structure is agreed upon for the generation of various tables and infographics which are then embedded on a dashboard or business report document.

While there are a number of tools to generate visualizations for statistical information such as Matplotlib, there is also a healthy ecosystem of tools to generate business analysis and software engineering artifacts such as flowcharts, class diagrams, cloud infrastructure architectures and so on. For example, Graphviz can generate most ‘boxes and arrows’ diagram types from data, avoiding manual ‘drawing’ of information. Such an example is presented at the beginning of this article.

Content Generation From Code 

In the case of embedding and blending we usually have a main body of text in which code snippets are embedded. This approach is suitable for tutorials or guides but not for reference documentation. 

Whenever documenting APIs including web services, methods, functions, commands, and so on, it is preferable to generate the entire body of documentation from the relevant codebase.

Most programming languages have an associated tool to generate HTML-based documentation, and most APIs are documented using the OpenAPI specification, which is normally rendered as HTML using tools such as SmartBear SwaggerHub. However, we often want greater control over key aspects in the documentation generation process, especially if we want to adhere to the tenets expressed in this book:

  • Coherence with the rest of the enterprise documentation—tenet of consistent layout.
  • Searchability so that code reference is searchable as any other type of documentation—tenet of contemporary prompt.
  • Connectedness so that business terms in narratives are meaningful—tenet of connected content.
  • Portability so that the code reference can be printed, browsed as an ebook, and so on—tenet of decoupled rendering.

Now, it is worth noting that code and other forms of semi-structured data require some more effort than simple structured data—for the purposes of content generation. For example, extracting content from source code may require the use of an off-the-shelf parser, or writing our own—in the case of an obscure language or file format. Naturally, we can also use vanilla text manipulation primitives—even the grep command— for simple use cases such as that of extracting comments.

In a nutshell, for semi-structured data such as code we need to bring extra tooling to preprocess, parse, or decode the data source before it is in a sufficiently structured shape that facilitates the generation of documentation.

Content Generation from Multimedia Content 

In the case of embedding and blending, we may embed a video in a document, its transcript, or both. We may also generate top-level documentation from a library of multimedia content. For example, when integrating with videoconferencing applications, we can organize the content by title, participants, time, and the topics derived from each transcript.


© 2022-2024 Ernesto Garbarino | Contact me at ernesto@garba.org