How to cite this paper

Kimber, Eliot. “High-Quality Microsoft Word documents from XML: The Wordinator.” Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). https://doi.org/10.4242/BalisageVol25.Kimber01.

Balisage: The Markup Conference 2020
July 27 - 31, 2020

Balisage Paper: High-Quality Microsoft Word documents from XML: The Wordinator

Eliot Kimber

Senior Solutions Architect

Contrext, LLC

`<ekimber@contrext.com>`

Eliot Kimber is an XML practitioner currently working with a U.S. government agency on a new report authoring, management, and delivery system. He has been involved with SGML and XML for more than 30 years. Eliot has contributed to a number of standards, including SGML, HyTime, XML, XSLT, DSSSL, and DITA. While Eliot's focus has been managing large scale hyperdocuments for authoring and delivery, most of his day-to-day work involves producing online and paged (or pageable) media from XML documents. Eliot maintains a number of open-source projects including DITA for Publishers, The Wordinator, and the DITA Community collection of DITA-related tools and other aids. Eliot is author of DITA for Practitioners, Vol 1: Architecture and Technology, from XML Press. When not trying to retire the technical debt in his various open-source projects, Eliot lives with his family in Austin, Texas, where he practices Aikido and bakes bread.

Abstract

Many products make XML from Microsoft Word, but consider the reverse: making Word versions of your XML documents, thus using MS Word as a document composition engine. The Wordinator enables automatic creation of high-quality Word documents from XML source. It uses an extension of the Word2DITA project’s SimpleWP (Simple Word Processing markup language) as the input to an Apache POI-based Java application that generates Word documents. XSLT generates the SimpleWP XML, managing the mapping of source XML elements to Word constructs and styles. I consider, in particular, the separation of concerns between the XSLT that generates the SimpleWP XML and the Java code that generates the Word documents.

Problem Statement
Solution: Separate The XML Transform from The DOCX Generation
SimpleWP to DOCX via POI
Authored XML to DOCX Process
Authored XML to SimpleWP XML
Conclusions and Future Work

Problem Statement

The Wordinator is a Java processor the takes as input a simplified XML representation of a word processing document and produces as output a Microsoft Word DOCX document. The key requirements for The Wordinator are:

Accurately and completely reflect the page layout details and visual rendering required for non-trivial documents, initially codified municipal code as published by Municode, Inc., including complex tables with horizontal and vertical spans and embedded graphics.
Support the creation of multiple page sequences with different page geometries and different running heads and feet.
Minimize the effort required to configure the mapping from source elements to named styles.
Support local formatting overrides (that is, do not limit styling only to the use of named styles)
Enable the automatic creation of DOCX files for different components of the source document as authored ("chunking"), including the ability to create both a single DOCX for the complete document as well as individual DOCX files for subcomponents of the document in a single processing run.
Enable ease of integration into a larger XML-based publishing pipeline

The Wordinator also integrates with Saxon in order to provide a one-command process that takes as input the document source as authored, generates the intermediate word processing XML in memory, and produces the DOCX files as output. See Figure 1.

The driving requirements for The Wordinator WORDINATOR come from Municode, Inc's need to provide Microsoft Word versions of the municipal codes they publish to HTML such that for any chosen section of a municipality's code, the user of the HTML can download a high-quality Word version of what they are reading. The Word documents may be statically generated as part of the HTML publishing process or generated on demand by the web server that provides the HTML versions of the code.^[1]

This requirement stems from the fact that the people working on municipal code almost invariably do their work in Microsoft Word. Almost without exception, reviews of and revisions to muncipal code that are not done on printed paper are done in Word.

More generally, Microsoft Word is a de facto standard for document viewing and printing in many organizations and for many private individuals. For example, the U.S. Government Accountability Office uses Microsoft Word for all the drafts of its reports (the reports are either published directly from their Word drafts or imported into the GAO's authoring management system as XML generated from the Word documents). GAO also uses Word as an intermediate format for producing accessible PDFs of their reports.

In many commercial publishing contexts Word is likewise used as the primary or only format for drafts of publications as they are developed, even when the publications themselves will be put into an XML format for final pre-publication preparation and publishing.

Another general requirement is using Word as a page composition engine. For example, in the DITA community the ability to print DITA documents using Microsoft Word would serve the needs of many small organizations that do not have the time or resources to customize the DITA community’s main open-source pagination tool or cannot afford other commercial tools.

While it is not difficult to generate Microsoft's DOCX format using typical XML processing tools, it is a challenge to generate it correctly and with high quality such that the Word documents accurately reflect all of the structures and layout features required for a given document (to the degree that Word can support those layout requirements).

DOCX is an XML-based format optimized for the representation of the internal structures used by word processors, spread sheets, and so on. The XML files are packaged into a Zip file that then forms the working DOCX file. DOCX is standardized as ECMA International standard ECMA-376 Office Open XML File Formats ECMA-376-1.

As an XML format, it is of course possible to generate Office Open XML directly using XSLT and a number of tools do that, including a plug-in for the DITA Open Toolkit ELOVIRTA. However, the markup generated is highly detailed and requires managing a number of low-level concerns, such as ID-based references among different files, the detailed rules for the construction of specific components, and so on. It is easy to get this markup wrong due to the complexity of the markup itself and the vagaries of how it is processed by Microsoft Word.

In the context of Municode's requirements, the need was for Word documents that matched, as closely as possible, the visual layout of the municipal code as published in HTML and PDF, as well as ensuring that all content was correctly reflected, enabling working hyperlinks, generated tables of contents, and so on. In addition, the visual style of the generated Word documents could potentially vary from municipality to municipality, indicating the need for easy-to-configure styling and content organization details.

Municode, like many publishers, was putting in place a general XML-based system for producing multiple outputs from a single XML source, meaning that the core XML processing tools (in particular the Saxon XSLT engine) were available for use for Word generation as well.

In addition to meeting Municode's immediate requirements, I wanted to develop a general-purpose XML-to-DOCX tool that could be quickly adapted to other documentation formats, in particular DITA. Municode agreed to allow me to develop The Wordinator as an open-source project.

Solution: Separate The XML Transform from The DOCX Generation

Transforming directly from arbitrary XML for published documents (DocBook, JATS, DITA, HTML5) to DOCX, while possible, does not provide a general solution that could be easily adapted to other formats.

As a general design principle, I like to use multi-phase processes that separate different concerns as much as possible and as appropriate.

In this case, there are three main concerns:

Defining the visual styling and page layout of the generated Word documents.
Mapping of source document elements to the appropriate Word structures: paragraphs and character runs with specific visual effects or named styles, tables, placed images, and other page constructs, such as running heads and feet.
Generating the DOCX files themselves.

These three concerns are reflected more generally in any XML-based publishing process:

How should the content as published look? (layout and styling)
How is the content as authored mapped to the input to the publishing engine? (transformation mapping)
How are the published artifacts generated? (publishing automation that replaces manual artifact creation such as manual page layout)

Commonly-used XML-based publishing technologies such as XSL Formatting Objects (XSL-FO) present the challenge that the layout and styling concern is not easily separable from the transformation mapping concern, which makes XSL-FO a challenge to use and more expensive to implement and maintain than approaches that keep the design concern separate (for example, using CSS pagination).

As a general observation, the more that layout and styling are defined in the transformation logic, the harder they are to both develop initially and adapt to new requirements, because it usually requires a software engineer to implement any required stylistic changes. That is, the styling concern is not in the hands of those who could or would otherwise define the styling.

In the case of Word documents, the styling concern is best implemented using Word styles, which provide a fairly complete mechanism for defining the visual look and feel of the resulting document, including page layout definitions (page geometry, running heads and feet, etc.). In addition, Word's existing features for generating tables of contents and reflecting layout-specific data (page numbers, paragraph numbers, automatic list numbers, etc.) satisfy most, if not all layout definition requirements.

Given a template with a complete set of styles for paragraphs, text runs, tables, and objects, as well as page design definitions (page geometry and page headers and footers), the mapping from source document elements to their appropriate visual renderings is mostly a matter of mapping elements in context to Word component types and style names. This makes the source-to-layout transform about as simple as it can be, significantly reducing the engineering cost needed to implement the transformation for any given input XML source. The mapping is simple enough that it could be mostly or entirely defined through a declarative configuration or defined using an interactive mapping tool.

That leaves the generation of the DOCX data itself.

The Apache POI library POI provides a robust open-source Java API for reading and writing Office Open documents. This API handles most of the details of the Office Open XML format (OOXML) and thus makes the actual generation task both easier to implement and more reliable than the equivalent direct OOXML generation using XSLT would be. The Apache POI project is actively maintained and provides reasonably frequent updates.

I had prior experience using the POI library to generate Office documents in the context of The Slidinator SLIDINATOR, a tool for generating PowerPoint documents from DITA-based XML source, so I knew that POI would provide a quick and robust solution for generating Word documents.

That left only the question of how to get from the source XML to POI. This requires a Java processor that interprets some input source and calls the POI API to produce the result DOCX files. In addition, the processor must be able to read a Word template (DOTX) that defines the styles and page layouts to which the source XML is mapped.

My solution was to adapt the Simple Word Processing (SimpleWP) markup I originally developed for the Word2DITA transformation framework WORD2DITA, which goes from DOCX documents to XML for document authoring, by adding the information needed to also go from arbitrary XML to DOCX.

The SimpleWP XML is then processed in Java to generate the DOCX content via the POI library, which handles generating both the individual XML files that make up a DOCX file as well as doing the Zip processing to create the final working DOCX file.

In terms of the above three concerns, the solution is:

Use Microsoft Word to define a normal template document that provides all the named styles needed to implement the desired published look and feel, as well as the necessary page layout and section definitions.
Implement an XSLT transform that generates SimpleWP XML from the authored source XML. Municode authors in HTML5 so the initial implementation of this concern was a relatively simple HTML-to-SimpleWP transform.
A general-purpose Java component that reads SimpleWP input documents and generates the DOCX results using the POI library.

This separation of concerns keeps the styling task in the hands of Microsoft Word experts, minimizes the source-to-style mapping XSLT implementation effort, and largely encapsulates the details of the DOCX generation in the SimpleWP-to-POI process, which can be treated as a black box.

In addition, because The Wordinator comes out of the box with an HTML-to-SimpleWP mapping, it means that other documentation source vocabularies can use Wordinator by generating HTML rather than going all the way to SimpleWP. As most, if not all, such vocabularies already have robust HTML generation transforms, the cost of adapting those to generate HTML for use with The Wordinator should be low.

SimpleWP to DOCX via POI

The Simple Word Processing XML vocabulary (SimpleWP) is a simplified representation of typical Word processing formats, but specifically Words structures: paragraphs, inline runs, tables, objects (image references, other embedded objects), etc. It provides the minimum information need to either capture or generate the essential content and properties of Word document content.

The SimpleWP vocabulary was originally developed to enable the transformation of Word documents into DITA XML and as such did not reflect layout-specific details such as page sequences and what Word calls "sections", which are sequences of pages with the same page geometry and running head and foot definitions (what would be page sequences in XSL-FO).

To adapt SimpleWP to the needs of DOCX generation I added markup to represent page masters, page sequence masters, and page sequences, that is Office Open sections and section-specific components. I used terminology that is more reflective of XSL Formatting Objects because that is what I'm most familiar with and will also likely be most familiar to other XML practitioners who might work with The Wordinator.

A typical SimpleWP document looks like this:

<document xmlns="urn:ns:wordinator:simplewpml">
  <page-sequence-properties>
    <headers-and-footers>
      <header type="odd">
        <p style="Header" styleId="Header">
          <run><tab/></run>
          <run><tab/></run>
          <run>IFRS 13</run>
        </p>
      </header>
      <header type="even">
        <p style="Header" styleId="Header">
          <run>IFRS 13</run>
        </p>
      </header>
      <footer type="odd">
        <p style="Footer" styleId="Footer">
          <run><tab/></run>
          <run>© IFRS Foundation</run>
          <run><tab/></run>
          <page-number-ref/>
        </p>
      </footer>
      <footer type="even">
        <p style="Footer" styleId="Footer">
          <page-number-ref/>
          <run><tab/></run>
          <run>© IFRS Foundation</run>
        </p>
      </footer>
    </headers-and-footers>
  </page-sequence-properties>
  <body>
    <section type="oddPage">
      <body>
        <p style="IASB Identifier" styleId="IASBIdentifier">
          <run>IFRS 13</run>
        </p>
        ...
      </body>
    </section>
    ...
  </body>
</document>

The full SimpleWP grammar is available from The Wordinator project. It's design is purely utilitarian, with the goal of keeping it as simple as possible in order to enable generation of all required Word features. In particular, it is not (yet) a full representation of everything you could do in a Word document. For example, it does not provide a way to represent arbitrarily-placed text boxes.

The Java code that processes SimpleWP XML is implemented as single Java class, DocxGenerator, of about 2300 lines, that implements the overall business logic, supported by a number of utility classes that abstract some fundamental components, such as measurements and table column definitions.

The DocxGenerator class has three inputs:

A SimpleWP XML document.
The Word template (DOTX) that provides the style and page layout definitions to use in the result document.
The DOCX file to write to.

The code as implemented assumes that all DOCX files are written to the file system--there was no requirement to be able to stream the DOCX for output, although of course that could be added easily enough.

The DOCX generation process operates on the SimpleWP XML using Apache's XML beans XmlCursor object, which parses an XML document into an XmlObject instance:

XmlObject xml = XmlObject.Factory.parse(inFile);

XmlObject uses a cursor model to step through the XML, including moving up and down the document hierarchy. This is the same approach used in the POI code itself to read and write the DOCX XML files.

For example, the top-level constructDoc() method looks like this:

private void constructDoc(XWPFDocument doc, XmlObject xml) throws DocxGenerationException {
    XmlCursor cursor = xml.newCursor();
    cursor.toFirstChild(); // Put us on the root element of the document
    cursor.push();
    XmlObject pageSequenceProperties = null;
    if (cursor.toChild(new QName(DocxConstants.SIMPLE_WP_NS, "page-sequence-properties"))) {
      // Set up document-level headers. These will apply to the whole
      // document if there are no sections, or to the last section if
      // there are sections. Results in a w:sectPr as  the last child 
      // of w:body.
      setupPageSequence(doc, cursor.getObject());
      pageSequenceProperties = cursor.getObject();
    }
    cursor.pop();
    cursor.toChild(new QName(DocxConstants.SIMPLE_WP_NS, "body"));
    handleBody(doc, cursor.getObject(), pageSequenceProperties);      
}

This provides a reasonably simple and natural way to process the XML. The main challenge is ensuring that pushes and pops on the cursor stack are balanced.

The main output processing is handled by the handleBody() method, which processes the content of the <body> element to which it is applied and returns the last paragraph in the section or complete document to which the body applies:

private XWPFParagraph handleBody(
      XWPFDocument doc, 
      XmlObject xml, 
      XmlObject pageSequenceProperties) 
        throws DocxGenerationException {
  XmlCursor cursor = xml.newCursor();
  if (cursor.toFirstChild()) {
    do {
      String tagName = cursor.getName().getLocalPart();
      String namespace = cursor.getName().getNamespaceURI();
      if ("p".equals(tagName)) {
        XWPFParagraph p = doc.createParagraph();
        makeParagraph(p, cursor);
      } else if ("section".equals(tagName)) {
        handleSection(doc, cursor.getObject(), pageSequenceProperties);
      } else if ("table".equals(tagName)) {
        XWPFTable table = doc.createTable();
        makeTable(table, cursor.getObject());
      } else if ("object".equals(tagName)) {
        // FIXME: This is currently unimplemented.
        makeObject(doc, cursor);
      } else {
        log.warn("handleBody(): Unexpected element {" + namespace + "}:'" + tagName + "' in <body>. Ignored.");
      }
    } while (cursor.toNextSibling());  
  }
  // The section properties always go on an empty paragraph.
  XWPFParagraph lastPara = doc.createParagraph();
  lastPara.setSpacingBefore(0);
  lastPara.setSpacingAfter(0);
  return lastPara;
}

This method simply iterates over the children of <body> and dispatches each child to the appropriate handler.

The XWPF objects are the top-level POI classes that abstract fundamental Office Open constructs for Word documents, hide the details of how the actual Office Open XML is constructed, and provide appropriate methods for constructing the objects in terms of their semantics rather than in terms of the underlying Office Open details. This makes the API about as easy to use as it could be for this task.

Where the XWPF classes do not support generation of the Office Open XML details, it is (usually) possible construct the underlying XML structures directly using the lower level POI APIs.

In a few cases I found places where I needed to extend the XWPF API to meet the needs of The Wordinator. In all of those cases I was able to contribute the enhancement back to the POI project for release in the time frame that I needed them for Municode's use of The Wordinator or simply produce my own local build of POI, as needed.

In general, it was clear that most users of POI are reading, but not writing, Word documents.

Construction of individual elements, such as paragraphs, gets a little more involved (some details omitted for brevity):

private XWPFParagraph makeParagraph(
    XWPFParagraph para, 
    XmlCursor cursor, 
    Map<String, String> additionalProperties) 
        throws DocxGenerationException {
  
  cursor.push();
  String styleName = cursor.getAttributeText(DocxConstants.QNAME_STYLE_ATT);
  String styleId = cursor.getAttributeText(DocxConstants.QNAME_STYLEID_ATT);

  if (null != styleName && null == styleId) {
    // Look up the style by name:
    XWPFStyle style = para.getDocument().getStyles().getStyleWithName(styleName);
    if (null != style) {
      styleId = style.getStyleId();
    }
  }
  if (null != styleId) {
    para.setStyle(styleId);
  }
        
  // Explicit page break on a paragraph should override the section-level break I would think.
  String pageBreakBefore = cursor.getAttributeText(DocxConstants.QNAME_PAGE_BREAK_BEFORE_ATT);
  if (pageBreakBefore != null) {
    boolean breakValue = Boolean.valueOf(pageBreakBefore);
    para.setPageBreak(breakValue);
  }

  if (cursor.toFirstChild()) {
    do {
      String tagName = cursor.getName().getLocalPart();
      String namespace = cursor.getName().getNamespaceURI();
      if ("run".equals(tagName)) {
        makeRun(para, cursor.getObject());
      } else if ("bookmarkStart".equals(tagName)) {
        makeBookmarkStart(para, cursor);
      } else if ("bookmarkEnd".equals(tagName)) {
        makeBookmarkEnd(para, cursor);
      } else if ("fn".equals(tagName)) {
        makeFootnote(para, cursor.getObject());
      } else if ("hyperlink".equals(tagName)) {
        makeHyperlink(para, cursor);
      } else if ("image".equals(tagName)) {
        makeImage(para, cursor);
      } else if ("object".equals(tagName)) {
        makeObject(para, cursor);
      } else if ("page-number-ref".equals(tagName)) {
        makePageNumberRef(para, cursor);
      } else {
        log.warn("Unexpected element {" + namespace + "}:" + tagName + " in <p>. Ignored.");
      }
    } while(cursor.toNextSibling());
  }
  cursor.pop();
  return para;
}

Again, an iteration over the contents of the incoming paragraphs to dispatch the appropriate construction handlers.

The main implementation challenge with paragraphs is applying the appropriate styles. When an input SimpleWP paragraph, run, or table specifies a style name the style must be present in the input template or there is no way to correctly style the document.

In addition, Office Open XML has the concept of "latent styles", which are styles where the style definition is defined entirely in the processing application, i.e., Microsoft Word, and is not otherwise defined in the Office Open XML anywhere. References to latent styles are thus not resolvable to anything in a general way because they are by definition application-defined. The DOCX file lists the names of all latent styles, so you can know if a style name is the name of a latent style, but you have no general way of knowing what the definition of that style is.

For example, in Microsoft Word, when you select the option to view All Styles, you are seeing both styles that are explicitly defined in the document's style catalog as well as all latent styles. If you subsequently selected a latent style for use on content in the document the latent style is copied into your document's local style catalog. This ensures that a given document's style catalog is a small as possible but makes it hard to know, a-priori, what the actual definition of a given latent style is as there's no generally-available definition of the latent styles that I'm aware of, short of creating a document that uses every latent style and then capturing its style catalog.

One missing feature in the XWPF API is access to the list of latent styles to know if a given style name is in fact a latent style--the API simply never considered the need because it has no relevance to reading DOCX files, only to writing them (or working with styles in some way). The challenge for The Wordinator is distinguishing between a style name that does not exist at all and a style name that is a latent style so that the user can be accurately informed about a bad style name as distinct from a reference to a latent style.

As in all publishing processes, tables are the most challenging structure to generate, mostly because of the challenge of handling vertical spans. However, because the SimpleWP table markup is already a close match to the Office Open XML table model, the actual processing is not that complicated.

Another table generation challenge is relative column widths where relative widths are mixed with absolute widths.

Office Open has a mechanism for specifying relative widths as a percentage of the total table width but the SimpleWP markup usually will not specify the absolute width of the table because that is normally a function of the output rendering.

Without knowing the width of the table there is no way to determine what fraction of the total a given relative-width column is when any other columns have explicit widths.

If all the columns have relative widths then you can calculate the percentage each column uses.

The SimpleWP table element provides an attribute for specifying the explicit width of the table but most authoring formats to not provide a reliable or general way to know what the rendered width of the table should be.

Thus, when the width of the table is not specified, The Wordinator effectively requires either all explicit widths or all relative widths and issues a warning if this is not the case.

As with similar single-pass composition processes, the DOCX generation process does not have access to the formatted DOCX document, so it cannot know what the final rendered size of any component is.

Authored XML to DOCX Process

While the core DOCX generation processor takes as input SimpleWP documents, the normal use case for The Wordinator starts with the authored XML as input, producing one or more DOCX files as output with all intermediate processing done in memory, as opposed to first generating a set of SimpleWP documents and then processing them to DOCX as separate process invocations.

To facilitate this use case, The Wordinator integrates Saxon to do the authored-XML-to-SimpleWP transform and then immediately generate DOCX from the resulting SimpleWP XML.

The Wordinator provides a command-line application that be invoked by specifying the authored XML source, the XSLT transform to apply to that source, the DOTX template to use for the result DOCX files, and the directory to write the DOCX files to.

The connection from the Saxon result to the DOCX generator is done using a Saxon-specific output URI resolver:

Processor processor = new Processor(false);
DocxGeneratingOutputUriResolver outputResolver = 
      new DocxGeneratingOutputUriResolver(outDir, templateDoc, log);
processor.setConfigurationProperty(Feature.OUTPUT_URI_RESOLVER, outputResolver);

...
XdmValue result = transformer.applyTemplates(docSource);

The output URI resolver is then used by Saxon for xsl:result-document instructions, effectively providing the result SimpleXP output of the XSLT transform to the DOCX builder, here from the HTML-to-SimpleWP transform provided with The Wordinator, where $result-uri is the URI of the result DOCX file:

<xsl:template 
    match="xhtml:section[local:is-chunk(.)] | section[local:is-chunk(.)] | 
           xhtml:html[local:is-chunk(.)] | html[local:is-chunk(.)]" priority="10">
  
  ... (generate $swpx-base-result)...
  
  <xsl:result-document href="{$result-uri}" format="swpx" >
    <xsl:apply-templates select="$swpx-base-result" mode="cleanup-swpx">
      <xsl:with-param name="doDebug" as="xs:boolean" tunnel="yes" select="$doDebug"/>
    </xsl:apply-templates>
  </xsl:result-document>
</xsl:template>

The DocxGeneratingoutputUriResolver's resolve() method sets up the Result object, which simply provides the result URI to use for the generated DOCX file:

public Result resolve(String href, String base) throws TransformerException {
  saxHandler = XmlObject.Factory.newXmlSaxHandler();

  Result result = new SAXResult(saxHandler.getContentHandler());
  result.setSystemId(href);
  return result;
  
}

The resolver's close() method does the actual DOCX generation:

public void close(Result result) throws TransformerException {

  // Do the DOCX building
  try {
    XmlObject xml = saxHandler.getObject();
    String outFilepath = URLDecoder.decode(result.getSystemId(), "UTF-8");
    String filename = FilenameUtils.getBaseName(outFilepath) + ".docx";
    File outFile = new File(outDir, filename);
    File inFile = new File(new URL(result.getSystemId()).toURI());
    log.info("Generating DOCX file \"" + outFile.getAbsolutePath() + "\"");
    DocxGenerator generator = new DocxGenerator(inFile, outFile, templateDoc);
    generator.setDotsPerInch(dotsPerInch);
    generator.generate(xml);
  } catch (Exception e) {
    throw new TransformerException(e);
  }

}

Note that this approach puts the decision of how to chunk the result DOCX files in the authoring-to-SimpleWP transform.

The connection between Saxon itself and the DOCX generator is completely generic.

This use of an output URI resolver to manage the generation of the DOCX files makes the use of the DOCX generator as natural as using Saxon to generate XML files and does not require any special consideration on the part of the XSLT implementor other than specifying the name of the result DOCX file at the result document URI.

Authored XML to SimpleWP XML

The generation of SimpleWP XML involves mapping the authored structures to the appropriate word processing structure and mapping content elements in context to the appropriate named style or specific formatting effect.

The HTML-to-SimpleWP transform provided with the The Wordinator demonstrates the general technique:

A main processor maps the source XML to the appropriate general SimpleWP structures: document, section, paragraph, run, table, etc.
A "get style name" mode module provides the element-in-context to style name mappings.

The main processor is fairly generic for a given input vocabulary because for most authoring vocabularies the structural mapping will be the same regardless of the content details. For example, for DocBook, each top-level division would probably result in a new result section and <para> will almost always result in a SimpleWP paragraph.

For example, the base template for HTML paragraphs and similar block elements is:

<xsl:template match="
  xhtml:p | p | 
  xhtml:dt | dt | 
  xhtml:dd[empty(xhtml:p)] | dd[empty(p)] | 
  xhtml:pre | pre">
  <xsl:param name="doDebug" as="xs:boolean" tunnel="yes" select="false()"/>

  <wp:p>
    <xsl:call-template name="set-style">
      <xsl:with-param name="doDebug" as="xs:boolean" tunnel="yes" select="$doDebug"/>
    </xsl:call-template>
    <xsl:apply-templates select="." mode="text-before">
      <xsl:with-param name="doDebug" as="xs:boolean" tunnel="yes" select="$doDebug"/>
    </xsl:apply-templates>
    <xsl:apply-templates>
      <xsl:with-param name="doDebug" as="xs:boolean" tunnel="yes" select="$doDebug"/>
    </xsl:apply-templates>
    <xsl:apply-templates select="." mode="text-after">
      <xsl:with-param name="doDebug" as="xs:boolean" tunnel="yes" select="$doDebug"/>
    </xsl:apply-templates>
  </wp:p>
</xsl:template>

The "get style name" module provides the more variable, and easier-to-code, mapping of elements in context to style names.

The get style name module provides three main services:

A literal class- or tag-name-to-style map using an XSLT map.
A base mapping to Word's default styles for headings, lists, and other common structures for which Word provides built-in styles or dedicated OOXML structures.

Explicit element-in-context mapping to arbitrary style names or format overrides.

The design goal is to make the mapping from the authored source (or the HTML generated from the authored source) to Word styles as simple as possible to specify so that the easy cases are easy, obvious default mappings just work, and special cases can be implemented without unnecessary overhead.

The class or tag name mapping is simply a literal XSLT map:

<xsl:variable name="classToStyleNameMap" as="map(xs:string, xs:string)">
  <!-- Mapping of DITA element types and generated classes to likely 
       style names.
    -->
  <xsl:map>
    <xsl:map-entry key="'b'" select="'bold'"/>
    <xsl:map-entry key="'cite'" select="'italic'"/>
    <xsl:map-entry key="'cmd'" select="'cmd'"/>
    <xsl:map-entry key="'cmdname'" select="'cmdname'"/>
    <xsl:map-entry key="'codeblock'" select="'Codeblock'"/>
    <xsl:map-entry key="'codeph'" select="'codeph'"/>
    ...
    <xsl:map-entry key="'xmlelement'" select="'xmlelement'"/>
    <xsl:map-entry key="'xmlnsname'" select="'xmlnsname'"/>
  </xsl:map>
</xsl:variable>

This syntax is simple enough that anyone should be able to modify it, but of course it could be moved to a separate configuration file if that was useful. That level of flexibility and convenience was not a requirement for Municode.

One could imagine, for example, an interactive application that reads the style definitions from a DOTX or DOCX file and the @class values from an HTML document or the set of element type names from an XML document and enables specifying the mapping from class to style.

The default "obvious" mappings include mapping hierarchical titles to Word's built-in Heading N styles and styles for lists, e.g.:

<xsl:template mode="get-style-name" as="xs:string?"
  match="
    xhtml:h1 |
    xhtml:h2 |
    xhtml:h3 |
    xhtml:h4 |
    xhtml:h5
  " 
  >
  <xsl:param name="doDebug" as="xs:boolean" tunnel="yes" select="false()"/>
  
  <xsl:variable name="heading-number" as="xs:string"
    select="substring-after(name(.), 'h')"
  />
  <xsl:variable name="headingLevel" as="xs:integer"
    select="xs:integer($heading-number)"
  />
  <xsl:variable name="result" as="xs:string" 
    select="'Heading ' || $headingLevel"
  />
  <xsl:sequence select="$result"/>
</xsl:template>

Note that the result of the template is an optional string, which is the name of the Word style to use in the generated DOCX file.

Special case mappings are handled using normal XSLT match templates that return the style name:

<xsl:template mode="get-style-name" as="xs:string?"
  match="xhtml:article[@class = ('sidebar')]/xhtml:p[1]" 
  >
  <xsl:sequence select="'Sidebar Para First'"/>
</xsl:template>

Even this template, while full XSLT, is simple enough that people who understand XPath well enough to construct the correct match expression but are not otherwise XSLT programmers could implement this kind of special case mapping.

While the simplicity of this template suggests that there should be a way to capture the same mapping in some kind of configuration file, in my thinking on this issue to date (which started more than 10 years ago with work I did to implement a similar DITA-to-InDesign transformation DITA2ICML), it has always seemed that any such configuration file would not be significantly easier to create than this style of simple XSLT template and the cost of implementing the processing of such a configuration file would be significant, especially if the target audience is not expected to be XML or XSLT experts, meaning the configuration file processing has to provide robust error handling with clear messages, appropriate convenience features, and so on.

However, that work and thinking was done with XSLT 1 and 2. Newer features in XSLT 3, such as the ability to dynamically evaluate XPath expressions and better facilities for error handling, might lower the cost of such a configuration mechanism to make the added convenience worth the implementation effort if that level of configuration convenience is otherwise a requirement.

Conclusions and Future Work

The Wordinator achieves it's original requirement of producing high-quality Word documents, including support for multiple page sequences with different page geometries, headers, and footers, complex tables, embedded graphics, and multi-column content.

By leveraging the Apache POI library the implementation cost was kept to a minimum, within Municode's limited budget. In addition, the use of POI, with its robust and mature implementation of the Office Open XML format, limits the risk of producing bad DOCX data.

The implementation demonstrates the utility of separating the three concerns of style and layout, authored-content-to-style mapping, and deliverable generation. It also demonstrates a useful technique of using Saxon output URI resolvers to post-process the direct output of an XSLT transformation into some non-XML format that cannot be generated (or easily generated) using XSLT alone.

By using a simple-as-possible intermediate format (Simple Word Processing XML) as the input to the DOCX generation process, the complexity of the authored-content-to-deliverable-structure is minimized.

The current Wordinator release, 1.0.2, is sufficiently complete to meets the needs of most documents that do not require page layouts and typographic effects that can only be produced in more sophisticated page layout systems or through manual construction of pages. At this level of completeness it is limited mostly by the inherent limitations in Microsoft Word itself.

However, Wordinator does not implement all layout features of Word, so there is room for improvement, for example, generation of content in floating text boxes.

Another area for investigation is the generation of Word documents with accessibility features to be used as input to existing tools that generate accessible PDFs from Word documents. The U.S. GAO currently uses Word to create accessible versions of GAO reports but the Word is created manually as part of the publishing process. It is likely that Wordinator could be used to generate Word documents with the necessary accessibility features, removing a manual process without completely rearchitecting the current GAO publishing process.

Additional possible extensions include using CSS to define the Word style details, enabling automatic generation of the Word styles from CSS style sheets, enabling the reuse of existing CSS style sheets used for web or paged delivery.

The Wordinator could easily be adapted to take non-XML data as input, in particular JSON, using XSLT 3's JSON processing features.

While part of my personal motivation for building The Wordinator was to enable the easy publishing of DITA content to PDF via Word, I have not actually implemented that process. It should be a relatively simple task to extend the existing DITA Open Toolkit HTML transformation to produce HTML optimized for use with The Wordinator, along with DITA-specific Word Templates that provide named styles corresponding to DITA content elements.

References

[DITA2ICML] DITA to InDesign project, https://github.com/dita4publishers/org.dita4publishers.dita2indesign.

[ECMA-376-1] ECMA-376-1:2016 Office Open XML File Formats — Fundamentals and Markup Language Reference, https://www.ecma-international.org/publications/files/ECMA-ST/ECMA-376, Fifth Edition, Part 1 - Fundamentals And Markup Language Reference.zip.

[ELOVIRTA] DITA to Word plug-in, Jarno Elovirta, https://github.com/jelovirt/com.elovirta.ooxml/.

[POI] Apache POI - the Java API for Microsoft Documents, https://poi.apache.org/.

[SLIDINATOR] The Slidinator project, https://github.com/drmacro/slidinator.

[WORD2DITA] Word2DITA Project, https://github.com/dita4publishers/org.dita4publishers.word2dita.

[WORDINATOR] The Wordinator Project, https://github.com/drmacro/wordinator.

^[1] A typical example of Municode's product includes the code for the city of Austin, Texas: https://library.municode.com/tx/austin/codes/code_of_ordinances?nodeId=TIT5CIRI_CH5-1HODI_ART2DIHOAIHOACCO_DIV3PRAGDI_S5-1-51DISAREHO. Note that as of 5 July 2020 this site reflects the old Word document creation tooling and not the use of The Wordinator.

DITA to InDesign project, https://github.com/dita4publishers/org.dita4publishers.dita2indesign.

ECMA-376-1:2016 Office Open XML File Formats — Fundamentals and Markup Language Reference, https://www.ecma-international.org/publications/files/ECMA-ST/ECMA-376, Fifth Edition, Part 1 - Fundamentals And Markup Language Reference.zip.

DITA to Word plug-in, Jarno Elovirta, https://github.com/jelovirt/com.elovirta.ooxml/.

Apache POI - the Java API for Microsoft Documents, https://poi.apache.org/.

The Slidinator project, https://github.com/drmacro/slidinator.

Word2DITA Project, https://github.com/dita4publishers/org.dita4publishers.word2dita.

The Wordinator Project, https://github.com/drmacro/wordinator.