How to cite this paper

Nordström, Ari, and Jean Kaplansky. “Migrating DocBook to Uncharted Waters.” Presented at Balisage: The Markup Conference 2022, Washington, DC, August 1 - 5, 2022. In Proceedings of Balisage: The Markup Conference 2022. Balisage Series on Markup Technologies, vol. 27 (2022). https://doi.org/10.4242/BalisageVol27.Nordstrom01.

Balisage: The Markup Conference 2022
August 1 - 5, 2022

Balisage Paper: Migrating DocBook to Uncharted Waters

Ari Nordström

Ari is an independent markup geek based in Göteborg, Sweden. He has provided angled brackets to many organisations and companies across a number of borders over the years, some of which deliver the rule of law, help dairy farmers make a living, and assist in servicing commercial aircraft. And others are just for fun.

Ari is the proud owner and head projectionist of Western Sweden's last functioning 35/70mm cinema, situated in his garage, which should explain why he once wrote a paper on automating commercial cinemas using XML.

Jean Kaplansky

Jean Kaplansky is a proven and experienced Publishing Content Platform Solutions Architect with a background in working in the small, medium, and enterprise business environments across multiple industries, including Software, Hardware, Aerospace, Insurance, Scholarly Publishing, and the instructional design and publishing of education content. Jean has accomplished extensive DocBook stylesheet design and content analysis throughout her career.

Jean enjoys a challenge in addition to solving problems and explaining technical things to non-technical people. Jean has contributed to publishing taxonomies, schemas, authoring tools, Content Management, and Automated Publishing systems development. Jean also contributes to designing new business workflows and processes based on information architecture, knowledge management science, and best practices.

Abstract

Kaplan North America (KNA) is migrating all XML source content across business lines to a new technology stack. The stack uses Alfresco Community Edition and Componize for Component Content Managent System (CCMS) capabilities, and Oxygen XML Web Author for editing and authoring new content. Approximately 50 GB of content scheduled for migration exists in DocBook 4.0 and MathML 3.1. All XML content is stored in Arbortext Content Manager (ACM), a content management system built by PTC, and based on the Windchill Product Lifecycle Management (PLM) software.

This article aims to document the XSLT and XProc pipeline solution built to migrate KNA's content and, further, to create HTML with metadata for ingestion into KNA's Learning Management System (LMS) as a product. No project is perfect, however. As a result, we experienced a few hard-learned lessons as consultants while we worked our way through the project. We were excited to work on a DocBook project with XML tools — work we greatly enjoy — which led to a over-confidence that rendered us slightly stuck in the project's human variables. We learned (and continue to learn) valuable lessons about consulting and acting as trusted advisors throughout the project.

Mission Statement

XSLT Pipelines Overview

DocBook Cleanup

Encoding and DOCTYPE
XIncludes
Filenames
4.0 to 4.3 and Cleanup
The 4.3 DOCTYPE

DocBook to HTML

Normalisation and Split Markup
Metadata
Structural Tweaks
Cleanup and Split
Notes on xslTNG
Final Tweaking
XProc, As Ever

Project Learning Outcomes: If only we had...

Project Learning Outcomes: The XML Community is Alive and Kicking!

Successfully migrating 50 GB of Kaplan North America's (KNA) DocBook content included specific requirements:

to implement an extensive content analysis of forty-year-old content,
to understand a partially documented Learning Management System (LMS) metadata, and
to identify the best approach to modify DOCTYPE statements reflecting an association with Oxygen XML Web Editor.

Mission Statement

Finding the right people to work on a project to migrate 50 GB of mature DocBook content to modern content management and authoring solutions is a challenge in 2022! Finding service vendors in the United States to work with old DocBook content was unsuccessful. Some vendors required reminding that their product did support DocBook once. Other service vendors were willing to write XSLT but did not want to create the necessary pipeline to move the content through syntax cleanup and transformation to the required LMS HTML output. One after another, service vendors declined to work with Kaplan's DocBook content in 2021. Moreover, an LMS publishing deadline was fast approaching.

The quest: Kaplan North America's (KNA) engineering licensing test preparation content originated in the mid-1970s. The engineering, architecture, and design content is product content used across various Kaplan-based programs, including test preparation, education services, and professional license programs. The content is part of services provided to corporations, government agencies, student groups, universities, and professionals. Kaplan's programs set the standard for license review and advanced educational services.

Kaplan's highly technical engineering content originated in 1975. It evolved through a variety of markup languages, eventually migrating into DocBook 4.0 authored, managed, and published with an Arbortext Editor (AE), Arbortext Publishing Engine (APE), and Arbortext Content Manager (ACM) implementation in 2015. Previous content migration efforts did not explicitly include a project step to thoroughly scrub the content for dated SGML/XML syntax components, including character entity syntaxes, extensive Math markup, and table formatting partially specified by Arbortext Processing Instructions (PIs). By 2021 the entire body of engineering content existed wholly in the Arbortext ecosystem and included many Arbortext-specific markup inserts to help control editor state content management.

The Quest: Migrate 50 GB of legacy content from the Arbortext ecosystem to Alfresco and Componize as a Component Content Management System (CCMS) using Oxygen XML Web Author for future editing and new content creation in 2022. Rendered product output includes HTML with precise specific metadata for LMS ingestion in addition to printed books and digital EPUB content.

It became evident early on in planning the DocBook content migration that one or more transforms in a pipeline were required to clean up the content syntax, remove remnants of older content management and authoring solutions, and, most importantly, produce output ready for ingestion into Alfresco Componize and check-out with Oxygen XML Web Author.

Quest and challenges accepted!

Jean Kaplansky, Kaplan's Solutions Architect for the new content platform, finally found someone willing to write XSLT and XProc to take Kaplan's DocBook content through a series of transformations to clean the codebase, enable future content editing, and authoring with Oxygen XML Web Author, and produce LMS-ingestible HTML in late September 2021. With an initial publishing deadline set for late November 2021. Ari Nordström came forward willing to take on the unknown with a strong background in content transformation and pipelines — including an XProc framework of his invention, XProc Batch [XProc Batch Wrapper Scripts].

Given Jean's extensive analysis of one of the more significant DocBook titles, the tasks seemed pretty straight-forward^[1]:

Clean up and convert DocBook 4.0 to 4.3 while maintaining the association with MathML 3.1 and content referenced by ACM-style XIncludes
Migrate the newly cleaned and converted DocBook to a logical content management organization in Alfresco Content Services Community Edition with the XML managed by Componize's Alfresco plug-in to make Alfresco "XML-aware."
Convert the newly housed DocBook 4.3 to (X)HTML including specific LMS metadata

Easy, right? Ari's first thought was pipelines, followed by ooh, I can do XProc again. We moved forward — and immediately ran into multiple and previously unconsidered challenges — because DocBook! XProc! We get to work on a cool project!

XSLT Pipelines Overview

Those recollecting previous Balisage papers may remember Ari's fascination with XSLT pipelines [Pipelined XSLT Transformations]. While the paper at hand is not about how XSLT pipelines work, the idea is simple enough:

A sequence of XSLT stylesheets listed in an XML manifest line runs in the order presented, with the previous XSLT's output providing the input to the next. Each XSLT transform handles one thing and one thing only.^[2], and that is really about it. Every XSLT is essentially an ID transform with a few tweaks, and since the full transformation is a sequence of these XSLTs, it is easy to modify the transform by adding, removing, and moving the individual XSLTs. For example, this manifest is the XSLT pipeline part of the DocBook cleanup described in the following section:

<manifest xmlns="http://www.corbas.co.uk/ns/transforms/data" xml:base=".">
    
    <group description="Initial stuff" xml:base="../xslt/common/">
        <item href="idtransform.xsl" description="UTF-8 serialisation to get rid of the character entities"/>
        <item href="remove-wc-protocol.xsl" description="Remove Windchill protocol from hrefs"/>
        <item href="clean-hrefs.xsl" description="Remove unwanted chars in hrefs"/>
        <item href="uri-encode-hrefs.xsl" description="URI-encode refs"/>
    </group>
    
    <group description="Main processing" xml:base="../xslt/db-cleanup/">
        <item href="fix-xi-namespaces.xsl" description="Move any non-standard XIncludes to the right namespace"/>
        <item href="normalize-docbook.xsl" description="Normalise XIncludes, remove namespaces, serialise to UTF-8" enabled="false"/>
        <item href="orphaned-table.xsl" description="Reinsert title and tgroup in table orphaned by Tagsoup"/>
        <item href="convert-table-pi.xsl" description="Table PI handling"/>
        <item href="cleanup.xsl" description="Various cleanup"/>
    </group>
    
</manifest>

Did we mention that it's practically self-documenting?

DocBook Cleanup

The DocBook 4.0 sources had some complicating factors, all of which we need to address:

The character encoding, sort of ISO 8859-ish, with some company-specific entities sprinkled throughout
The old Arbortext Editor/Arbortext Content Manager (ACM) authoring and Content Management System (CMS) environment:
- XIncludes, some with fake namespace declarations^[3]
- DOCTYPE declarations only in the root XML, not the XIncluded content
- An ACM-specific file protocol (see fake namespace declarations, above)
- Arbortext processing instructions to manually override automatic formatting
(Lots of) MathML equations, some with fake namespace declarations
Inconsistent file naming including characters that should never appear in a file name^[4]

And more, some of which only came to light later, when we examined additional books.

Above all, we needed to encode everything in Unicode and UTF-8. The Arbortext PIs and other Arbortext-specific constructs needed to go, and the XInclude targets should have normalised DOCTYPE statements.

A hard requirement remained for us to continue working with a DocBook 4.3 doctype statement including an internal subset to include MathML.

Encoding and `DOCTYPE`

Character entities, of course, are only possible to convert if declarations are present. In the case of DocBook 4.0 or any other DTD-validated XML documents, the place to look for character entity definitions in the DTD. A parser normally accesses the DTD by reading the document's DOCTYPE declaration. However, what if the DOCTYPE definition is missing?

In a nutshell, the DocBook 4.0 source consisted of multiple files, with the top-level (root) file containing XInclude links pointing out chapters, which in turn would point out sections or Q and A entries. However, only the top-level XML had a DOCTYPE statement. The lower-level XML filed did not require a DOCTYPE statement to work in Arbortext's ecosystem, creating an interesting chicken-or-egg problem.

See, if a parser finds a character entity it can't resolve, it usually reports an error and exits. The XIncluded files contained character entities but no DOCTYPEs statements resulting in attemptes to open XML files that immediately resulted in errors because the initial cleanup scripts threw errors and exited because entities did not resolve properly.

So, before fixing the encoding, the files needed DOCTYPE declarations where missing, thus ruling out anything XSLT^[5] and most things XProc^[6], so what, then? How about something in an Ant build script that would then run the actual cleanup? Ant, an XML-based pipelining language, offers many tools to process XML, but very few would uncritically open and read an XML file that is not well-formed.

Regular expressions are always an option, of course, but just look at these two variations:

<!--Arbortext, Inc., 1988-2017, v.4002-->
<?Pub EntList alpha bull copy rArr sect trade ordm ohm mu j0 omega
eacute?>
<section id="sectionKASTDE08003" vendor="0000084394">

The previous example has no DOCTYPE to begin with, so it may be possible to match the root element. But what about this one?:

<!-- Fragment document type declaration subset:
Arbortext, Inc., 1988-2017, v.4002
<!DOCTYPE set PUBLIC "-//Arbortext//DTD DocBook XML V4.0 + MathML//EN"
 "axdocbook_math.dtd" [
<!ENTITY minus "&#38;#8722;">
]>
-->
<?Pub EntList alpha bull copy rArr sect trade lbrace rbrace thinsp delta
Delta?>
<?Pub Inc?>
<qandaentry id="qandaentryFEBB06-014" role="1of4" vendor="0000080121">

This one is sneaky because the DOCTYPE is out-commented over multiple lines (and has the wrong document element). You must match the comment markup and then the DOCTYPE inside it to know.

And so on. There are a couple of things to do here:

Find out if there is a DOCTYPE already (we don't want to include two).
Get the root element (because it's part of the DOCTYPE).
Compose and add the DOCTYPE.

A few of the XIncluded files had out-commented DOCTYPE declarations, complicating Step 1. Also, getting the root element was not obvious because of the comments, processing instructions, etc, so the regex idea was a no-go.

OK, so maybe an Ant exec task to run external software that adds the DOCTYPE? Most parsers are out of the question because of the basic well-formedness issues, but there is something called Tagsoup [TagSoup - Just Keep On Truckin'], a rather brilliant piece of software intended to fix bad HTML. Among other things, it can add any DOCTYPE declaration to a file, not just HTML, but more importantly, it doesn't care if the file contains unresolved character entities—that is one of the problems it's designed to ignore while fixing other things!

It doesn't do reporting very well, however, so it couldn't reliably check for an existing DOCTYPE and only then, if there wasn't one, add it. For checking for the existence of one, we ended up using xmllint [Installing Libxml/xmllint for Windows]^[7]

Note

An added complication was the fact that all this needed to run on Windows machines. As a Linux user since the mid-90s, Ari was accustomed to SGML and XML toolkits, from OpenSP to libxml, being readily available and only an apt-get install away; he was shocked to realise how much effort it took to add xmllint.

XIncludes

With the character encoding handled, the next task was processing the XIncludes. We promptly learned that ACM used a non-standard file protocol:

<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
href="x-wc://file=0000055442.xml"/>

Handling this was straight-forward in an XSLT step and a regular expression.

While the normalisation of XIncludes (that is, recursively pulling in the XInclude contents into a single file) is not a complex problem, the fact that there was no way to know which of the hundreds of XML files in an input folder was the root complicated matters. Once you have pulled in an XInclude target in the parent file, you no longer have to process the child; you can leave it behind and only process the normalised parent. How do I leave them all behind?

The XSLT pipeline acts on an input directory: it lists the files in that directory and then iterates through the list, applying the pipeline to each one. Here is the clever bit, though: you can apply to include and exclude filters on the operation by using regular expressions.

A pre-processing pipeline step first walks through all of the input files, opening each of them in turn, and compiles a list of the XInclude targets, each separated with a | (pipe) character (whitespace added for readability):

<exclude-filter
    xmlns:axi="urn:x-prefix:xi"
    xmlns:c="http://www.w3.org/ns/xproc-step"
    xmlns:xi="http://www.w3.org/2001/XInclude">
    (Introduction_0000080922.xml|
    Electrostatic_Fields_0000080916.xml|
    Voltage_0000080934.xml|Current_0000080933.xml|
    Magnetic_Fields_0000080930.xml|
    Induced_Voltage_0000080928.xml|
    equationEEEG029_Through_equationEEEG033_Capacitors_0000151486.xml|
    ...
    )
</exclude-filter>

We could extend the exclude filter, but the principle was easy enough except for the filenames.

Filenames

The first few tests had filenames without particular issues, but the first complete book we examined highlighted the weakness of the exclude filter approach described above.

The Arbortext system happily processes filenames with any number of weird characters while some of the pipeline tools choked. For example, some file names contained multiple spaces, parentheses, square brackets, commas, and other characters not typically found in file names.

If you know your regular expressions, you will know that several of these would have to be escaped if used in a pattern. So, when listing the XInclude targets and using the list as an exclude filter (see above), the list would be unusable as a pattern unless every offending filename character was escaped correctly in the regex. This was not an attractive proposition, especially after examining more book sources.

Easier was writing a pre-processing step to rename the files by removing (or replacing) the offending characters and then opening each file and applying the same renaming rules to any referenced files—all this had to be run before the actual cleanup and normalisation to make sure that the XInclude target list used as an exclude filter would not break^[8].

The rename step eventually became a separate XProc library where XSLT stylesheets handled most of the heavy lifting.

4.0 to 4.3 and Cleanup

The actual DocBook 4.0 to 4.3 conversion held very little excitement when compared to the above: a handful of XSLTs to get rid of Arbortext PIs for formatting and other functionality and to unify namespaces—for some reason, authors had sometimes added, um, alternative namespace URLs for MathML and XInclude. The DocBook versions as such are very similar.

The 4.3 `DOCTYPE`

A final cleanup complication was that the full DOCTYPE for the target files required an internal subset:

[
<!-- Will use "m" as a prefix.-->
<!ENTITY % equation.content "(alt?, (graphic|mediaobject|m:math)+)">
<!ENTITY % inlineequation.content "(alt?, (graphic|inlinemediaobject|m:math)+)">
<!ENTITY % NS.prefixed     "INCLUDE">
<!ENTITY % MATHML.prefix   "m">
<!ENTITY % NamespaceDecl.attrib "
	xmlns:m
		CDATA	#FIXED 'http://www.w3.org/1998/Math/MathML'
">
<!-- Include the MathML DTD -->
<!ENTITY % mathml PUBLIC "-//W3C//DTD MathML 3.0//EN"
                  "http://www.w3.org/TR/MathML3/dtd/mathml3.dtd">
%mathml;
<!ENTITY % xinclude SYSTEM "http://www.docbook.org/xml/4.4/xinclude.mod" >
%xinclude;
]

The XML world has been moving away from DTDs and DOCTYPE declarations ever since XML first came out, 24 years ago. You can't actually output an internal subset or serialise an ENTITY declaration using XSLT^[9]. You can still serialise a basic DOCTYPE, though, with PUBLIC and SYSTEM identifiers.

This was good enough. Remember the Ant build wrapped around everything? Well, we had an external text file with the internal subset, and then, last in the build, after provisioning the processed files with vanilla DocBook 4. DOCTYPE declarations, this added that external text file to the DOCTYPEs:

<target name="insert-doctype">
    <loadfile property="local.doctype" srcfile="${resources.internal-subset}"/>
    
    <!-- Insert the DOCTYPE internal subset -->
    <replaceregexp match="^(.*)(/docbookx.dtd&quot;)(>(.*))$" replace="\1\2${local.doctype}\3" flags="m">
        <fileset dir="${base.out}">
            <include name="*.xml"/>
        </fileset>
    </replaceregexp>
</target>

Cheating, I know^[10].

DocBook to HTML

The second part of the mission statement was to transform DocBook 4.3 to (X)HTML; this would mostly be a vanilla HTML transform, but with some bits, mainly publishing-specific metadata on Q and A entries, sections, and chapters added.

The initial idea was to use the standard DocBook XSLT package as included in oXygen. As Ari wasn't keen on using them—XSLT 1.0 is just so last century—and really wanted to try xslTNG [DocBook xslTNG], Norm Walsh's XSLT 3.0 package presented at Balisage 2020 [XSLT 3.0 on ordinary prose], the plan morphed into doing some initial pipelining to preprocess the DocBook 4.3 to add the required metadata and other content^[11], and then simply hand over everything to xslTNG with a few tweaks in the calling XSLT.

While the theory was that since the HTML output would be in fragments we should be able to simply convert the input DocBook XInclude fragments (see section “XIncludes”) directly, the required output fragments did not actually match the input. This little tidbit was not part of the original statement of work; it only came to light later.

For example, while the Q and A entries were separate source XML files, some published books (Reference Manuals, known as RM books) would include them inside their parent chapters with everything normalised, while other book types required them to be in separate files. Similarly, there were other conditions governing the output fragmentation. This would also mean that the metadata to be inserted would vary—if output as separate files, the Q and A entries required metadata headers; if included inside the chapters, no metadata was needed.

In other words, context would decide where the books would need to be split apart, not the input fragmentation.

Normalisation and Split Markup

All this meant that we'd have to start by normalising the sources after all, insert markup to identify where they would be pulled apart again, later in the pipeline, and then add metadata and other structures.

This is where the XInclude normalisation step [section “XIncludes”] to preprocess the XInludes to use the XInclude targets in an exclude filter to the directory listing became a lifesaver. The main book file would grow significantly larger in the first step since everything was pulled in, as seen in this debug listing:

Here, the normalisation is followed by steps to prepare the newly assembled book to be split again. Some conditions had to be done as part of the normalisation because those required examining filenames and containing folders to produce context markup^[12]:

<!-- Find out stuff about the filename to use for later naming and split -->
<xsl:template match="*[not(parent::*)]">
    <xsl:variable name="name" select="name(.)"/>
    
    <xsl:variable name="filename" select="tokenize(base-uri(.),'/')[last()]"/>
    
    <xsl:variable name="label">
        <xsl:choose>
            <!-- Properly marked-up chapter -->
            <xsl:when test="@label and @label!='' and $name='chapter'">
                <xsl:value-of select="@label"/>
            </xsl:when>
            <!-- Diagnostic -->
            <xsl:when test="matches(@id,'diag')">
                <xsl:analyze-string select="@id" regex="^(.*)diag([0-9]+)$">
                    <xsl:matching-substring>
                        <xsl:value-of select="format-number(number(regex-group(2)),'1')"/>
                    </xsl:matching-substring>
                    <xsl:non-matching-substring/>
                </xsl:analyze-string>
            </xsl:when>
            <!-- Is it a front matter chapter? -->
            <xsl:when test="matches(@id,'^(.*)00$')">
                <xsl:value-of select="'0'"/>
            </xsl:when>
            <!-- Gleaned from @id -->
            <xsl:when test="matches(@id,'^chapter[A-Z]+[0-9]+$')">
                <xsl:analyze-string select="@id" regex="^chapter[A-Z]+([0-9]+)$">
                    <xsl:matching-substring>
                        <xsl:value-of select="format-number(number(regex-group(1)),'1')"/>
                    </xsl:matching-substring>
                </xsl:analyze-string>
            </xsl:when>
            <xsl:otherwise>
                <xsl:value-of select="''"/>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:variable>
    
    <xsl:variable name="title" select="string-join(title//text())" as="xs:string?"/>
    
    <!-- ID with 'diag' means diagnostic -->
    <xsl:variable name="dg" select="if (matches(@id,'diag')) then ('diag') else ()"/>
    
    <!-- RM books are identified thusly -->
    <xsl:variable name="rm" select="if (matches($path,'RM/data/$')) then ('rm') else ()"/>
    
    <xsl:variable name="prefix">
        <xsl:choose>
            <xsl:when test="$name=('chapter','bookinfo')">
                <xsl:value-of select="'Chapter'"/>
            </xsl:when>
            <xsl:when test="$name='qandaentry'"/>
            <xsl:otherwise>
                <xsl:value-of select="$name"/>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:variable>
    
    <xsl:variable name="identifier" as="xs:string">
        <xsl:choose>
            <!-- Intro front matter chapter -->
            <xsl:when test="$title='Introduction' or $label = '0.50'"><!-- FIXME -->
                <xsl:value-of select="'0.50'"/>
            </xsl:when>
            
            <!-- Diagnostic test inside the book, chapter label xx.50 -->
            <xsl:when test="$name='chapter' and matches($filename,'_DG_')">
                ...
            </xsl:when>
            
            <!-- Labelled (non-diagnostic) chapter -->
            <xsl:when test="$name='chapter' and $label!='0' and $label!=''">
                <xsl:value-of select="$label"/>
            </xsl:when>
            
            ...
            
        </xsl:choose>
    </xsl:variable>
    
    <xsl:copy>
        <xsl:copy-of select="@*"/>
        <xsl:attribute name="prefix" select="$prefix"/>
        <xsl:attribute name="identifier" select="$identifier"/>
        <xsl:attribute name="dg" select="$dg"/>
        <xsl:if test="$rm='rm'">
            <xsl:attribute name="rm" select="$rm"/>
        </xsl:if>
        <xsl:apply-templates select="node()"/>
    </xsl:copy>
</xsl:template>

Based on this, the next step would then identify splits, like in this example:

<xsl:if test="name(.)=$split-elements and not(ancestor::chapter[@rm='rm'])">
    <xsl:attribute name="split" select="true()"/>
</xsl:if>

A step that named the soon-to-be fragments in a temporary @name attribute then followed. For example:

<xsl:template match="*[@split='true']">
    <xsl:variable
        name="name"
        select="name(.)"/>
    
    <xsl:variable
        name="prefix"
        select="@prefix"/>
    <xsl:variable
        name="identifier"
        select="if (matches(@identifier,'\.50') and @dg='diag')
        then (following-sibling::chapter[1]/@identifier || '.50')
        else (@identifier)"/>
    
    <xsl:variable name="combined-name">
        <xsl:choose>
            <xsl:when test="$prefix='Chapter'">
                <xsl:value-of select="$acronym || '_' || $prefix || '_' || $identifier"/>
            </xsl:when>
            <xsl:when test="$prefix!=''">
                <xsl:value-of select="$acronym || '_' || $prefix || '_' || $identifier"/>
            </xsl:when>
            <xsl:when test="@prefix=''">
                <xsl:value-of select="$identifier"/>
            </xsl:when>
        </xsl:choose>
    </xsl:variable>
    
    <xsl:copy>
        <xsl:copy-of select="@*"/>
        <xsl:attribute name="name" select="$combined-name || '.xml'"/>
        <xsl:apply-templates select="node()"/>
    </xsl:copy>
</xsl:template>

Note that none of this was part of the initial pipeline—the original statement of work left out everything having to do with splitting based on context, type of book.

Note

This sort of thing is where a pipelined apporach will excel. It's much easier to add processing in the middle of a multi-step pipeline than a monolithic XSLT transform.

Metadata

The metadata depended on context and the type of book, essentially the same criteria used to determine where to split the normalised files. For example, the @rm attribute, added during normalisation to identify the Reference Manual book type context, determined if the Q and A entries needed metadata or not:

<xsl:template match="orderedlist[(ancestor::chapter[@rm='' or not(@rm)]) and
    ancestor::question and listitem[@role='correct']]">
    <xsl:copy>
        <xsl:copy-of select="@*"/>
        <xsl:attribute name="class" select="'ktp-answer-set'"/>
        <xsl:apply-templates select="node()">
            <xsl:with-param name="property" select="'ktp:answer'" tunnel="yes"/>
        </xsl:apply-templates>
    </xsl:copy>
</xsl:template>

Another example, also pre-converting selected structures to HTML:

<xsl:choose>
    <!-- Not RM (don't add @class=solution_title before answer) -->
    <xsl:when test="ancestor::chapter[@rm='' or not(@rm)]">
        <xhtml:section
            property="ktp:explanation"
            typeof="ktp:Explanation"
            class="ktp-explanation"
            keep="true">
            <xhtml:section
                property="ktp:explanation-section"
                typeof="ktp:feedback"
                class="ktp-explanation-section"
                data-title="Solution"
                keep="true">
                <xsl:apply-templates select="node()"/>
            </xhtml:section>
        </xhtml:section>
    </xsl:when>
    <!-- RM -->
    <xsl:otherwise>
        <xhtml:span keep="true" class="solution_title t-color--atom-primary t-italic">Solution</xhtml:span>
        <xsl:apply-templates select="node()"/>
    </xsl:otherwise>
</xsl:choose>

What metadata was added to what elements and in what contexts also changed during the project, and again, the pipeline approach proved its worth.

Structural Tweaks

We ended up using seven pipeline steps to add metadata to the required DocBook structures, with another four to tweak them further, and it proved easier to convert these to HTML as part of the metadata additions rather than having xslTNG do it later.

The Q and A entries, for example, use lists for both to list possible answers and for, well, listing things. While it was certainly possible to convert them to HTML using xslTNG, it was easier in a pipeline step:

<xsl:template match="orderedlist[ancestor::question]">
    <xhtml:ol keep="true">
        <xsl:copy-of select="@*"/>
        <xsl:if test="not(@class) and (ancestor::chapter[@rm='' or not(@rm)]) and not(@numeration='upperroman')">
            <xsl:attribute name="class" select="'ktp-answer-set'"/>
        </xsl:if>
        <xsl:choose>
            <xsl:when test="@numeration='upperroman'">
                <xsl:attribute name="style" select="'list-style: upper-roman'"/>
            </xsl:when>
        </xsl:choose>
        <xsl:apply-templates select="node()"/>
    </xhtml:ol>
</xsl:template>


<xsl:template match="listitem[ancestor::question]">
    <xhtml:li keep="true">
        <xsl:copy-of select="@* except @role"/>
        <xsl:if test="not(@property) and (ancestor::chapter[@rm='' or not(@rm)]) and not(ancestor::orderedlist[@numeration='upperroman'])">
            <xsl:attribute name="property" select="'ktp:answer'"/>
            <xsl:attribute name="typeof" select="'ktp:Answer'"/>
            <xsl:processing-instruction name="no-correct-answer-given"/>
        </xsl:if>
        <xsl:apply-templates select="node()"/>
    </xhtml:li>
</xsl:template>

Note keep="true". These were inserted whenever converting to HTML and told xslTNG to not touch that particular node.

Note

@keep has two possible values: true means don't touch the current element and its attributes, and copy means don't touch the current node or any descendants.

Other tweaks included generating cross-reference labels^[13], tweaking footnotes, adding table and equation caption labels and numbering, and adding @class attributes for the publishing CSS. One late-breaking step also replaced .eps image file suffixes with .png when the client changed the image format.

Cleanup and Split

XSLT pipelines tend to end with cleanup steps, from removing processing attributes to getting rid of unwanted namespace prefixes and so on. This one was no exception. However, once the XSLT pipeline, a matter of running XSLTs listed in a manifest file in sequence, was finished, it was still necessary to split the files, which added a separate XSLT file to the XProc itself.

Why not in the XSLT pipeline? The XProc step running the XSLT pipeline manifest is, for practical purposes, a black box that accepts an input folder of files and hands over a converted result folder, all as part of a single XProc step. If we wanted to do secondary output from that final XSLT, we would have to do much tweaking inside that black box. It was far easier to add an XSLT splitting the output of the XSLT pipeline to a result-document and iterating through that in XProc.

Notes on xslTNG

In hindsight, the final pipeline ended up converting more DocBook to HTML than planned; we had simply assumed that xslTNG would do most of the heavy lifting. Ari really wanted to test it in a live setting, having watched Norm's presentation at Balisage in 2020 [XSLT 3.0 on ordinary prose], but doing the entire conversion in the pipeline would have resulted in an easier-to-maintain transform.

There's also the not-at-all-insignificant fact that what we have here is DocBook 4.3, whereas xslTNG much prefers DB 5.0 and later. 4.3 still has INCLUDE/IGNORE sections for SGML, and the DocBook elements do not live in a namespace^[14].

As it stands, the calling XSLT does a few tweaks, from tweaking IDs to changing section numbering. It sets a few variables, and, above all, it makes sure that the @keep attributes are respected.

Final Tweaking

In the end, one additional XSLT, run after xslTNG, removed some final bits and pieces (DocBook attributes that xslTNG didn't seem to recognise, for one thing) and restored HTML wrappers that went missing in xslTNG, either because Ari missed some vital information in the excellent documentation [DocBook xslTNG Reference] or (more likely) the code leading to xslTNG.

XProc, As Ever

Ari's been an XProc convert since before 1.0 came out. The basic idea is fabulous; you black-box everything, connect the boxes, and out comes the result(s). The spec is somewhat less than the sum of it parts; it always gave the impression of knowing about the latest and the greatest while not quite being able to deliver it. You could do amazing things but only within strict confines and never without some pain.

It follows that for all that other processing — the pain — you were resigned to Ant processing or similar, and so it was with the subject at hand. Because of XProc 1.0's strict confines, Ant was a necessity.

This year — this month, actually — XProc 3.0 was finally released. The libraries that do the heavy lifting have all been converted to XProc 3.0 now (again, this month!), and had this project happened a few months later, it would all have been XProc 3.0. Maybe next year's talk.

Project Learning Outcomes: If only we had...

Even experienced XML publishing consultants can empathize too much with clients who say they have a hard deadline. Despite our best efforts to put together a Statement of Work that defined the complete scope of work, we learned some hard lessons:

Lesson learned: Ensure all identification of all stakeholders at the beginning of the project — We went into the project thinking in goodwill and faith in our client to provide adequate guidance and collateral to finish the project within the constraints we thought we put in the SOW. Unfortunately, we did not realize the presence of a second project stakeholder: The team responsible for ingesting the HTML content into the company's current LMS platform.

Our original stakeholder representative provided access to a repository of historical project information; however, much of the current process documentation focused on meeting requirements for ingesting content to a different LMS. Further, access to specific information about ingesting content into the older LMS was less than transparent, and the LMS-specific metadata was applied "on-the-fly" during content ingestion.
Lesson learned: Each stakeholder must have a direct representative participating in the project — We specifically asked for the information we needed to complete the project from the onset, relying on the client's point person, the current CMS and DocBook solutions administrator, to gather required documentation based from our collective experience working with XML content migrations. Our first point person is the current DocBook CMS and workflow system administrator, who is deeply involved in the DocBook content architecture required to produce content for delivery to the previous LMS, EPUB, and print products. We expected our DocBook point person to specify the detailed documentation required to clean up existing markup and migrate the content from one authoring and content management system to another.

However, we missed a critical outcome expectation: Ingesting HTML with required metadata values into the current LMS platform.

Our DocBook project point person initially shared some HTML metadata ingestion requirements but obtained requirements from a pilot project that pulled content into the LMS from the previous LMS instead of the DocBook 4.3 files coming out of the pipeline. Much confusion about the precise HTML metadata required in the content ensued. We had many meetings where the entire project team (sometimes ourselves) required continual reminders that the current effort was to create a DocBook to HTML transform destined for the current LMS platform. As a result, our DocBook point person was not entirely aware of the HTML metadata required to ingest and process the DocBook HTML originating from the XProc pipeline.

The rabbit hole: the LMS platform team, filled with a sense of "we are on a deadline," tested HTML ingestion from content pulled from a previous LMS to the current LMS platform. We collectively realized that the required LMS platform HTML metadata documentation was still a work in progress in the context of the HTML produced from DocBook through the XProc pipeline. We required significant input and documentation from the LMS team for Ari to write an HTML transform to produce a critical part of the project's expected outcomes. Multiple meetings happened before all of the stakeholders, and Ari and Jean fully understood the difference between the initial testing by the LMS team and the HTML produced by the XProc pipeline.

The requirement for the HTML transforms pipeline to generate HTML metadata required by the current LMS platform initially looked like scope creep from our original SOW. Our "aha" moment regarding the LMS platform HTML metadata requirements documentation and project scope required to refactor the HTML transform was a collective headache for ourselves and the multiple stakeholders across the project team.
Lesson learned: Project teams may determine their own single-sourcing XML best practices different from other industries — ...and a client may insist on continuing to follow their own best practices regardless of what is considered an industry best practice. We had extensive conversations about best practices with our DocBook point person. The project team built a quirky (but it worked for them) single-sourcing architecture that caused hiccups in how we needed the transforms to handle XML content. Our stakeholders vehemently defended their chosen approach to content architecture contrary to what we, consultants, know as single-sourcing XML best practices. We reluctantly learned to adapt to match the client's adamance.
Lesson learned: Make sure the stakeholder has adequate resources to handle required work; Training may take longer than anticipated — We provided extensive "how to run the pipeline" training to a single person. Our understanding was that we were "training the trainer." The reality of the effort required to transform pipelines on large bodies of content destined for migration and ingestion into an LMS was overwhelming for the stakeholder's available resources. We added two more people to help run the project content through the transform pipelines. Even so, we did not adequately anticipate the amount of training and step-by-step walkthroughs required to enable people to run the pipeline independently.
Lesson learned: Stakeholders may change priorities in the "messy" middle — When we realized the extent of LMS platform HTML metadata ingestion requirements, our DocBook stakeholders had already lost sight of the project's original purpose:
- Create DocBook XML stored in the Alfresco Componize CCMS for future editing with Oxygen XML Web Author.
- Create an XProc-based pipeline to create HTML output for ingestion into the client's LMS.
Over time, our DocBook stakeholders became more and more focused on the LMS ingestion and publishing deadline, leaving the Alfresco/Componize-ready DocBook content to sit uningested to the target repository. The change in priority means the project will not be complete until the client reprioritizes ingesting the cleaned and Oxygen XML Web Author-aware DocBook content into the Alfresco Componize CCMS for future authoring publishing efforts.
Lesson learned: Always provide a preferred format for client feedback at the project's outset — For example, while we are both familiar with working in GitHub with GitHub issues, our client was more familiar with collecting feedback from multiple sources in Google Docs. We spent significant time analyzing stakeholder feedback and identifying and confirming real issues from duplicates and what we considered fundamental misunderstandings about the nature of markup with the LMS team. To our credit, we eventually concluded that moving all feedback and related communications to GitHub issues where we could track the discussion and gather previously missing information on an issue-by-issue basis.

Moving to GitHub Issues was no panacea for our communication issues, however. To communicate more effectively, we had to provide in-depth Git and GitHub education to our DocBook stakeholders.
Lesson learned: Set expectations about the role of the consultant firmly at the beginning of the project — At one point, our stakeholders started to blur the lines between the role of consultants and the role of content processing services resulting in scope creep. We had to remind the individuals filing GitHub issues that they could not file issues stating a brand new issue as "a problem to fix" as part of the original scope of work.

We noticed that individuals did not know a specific issue existed until they filed a GitHub issue. We continually vetted each issue to determine whether or not the problem was in scope or a new problem to solve. Further, unexpected variance in markup patterns often required us to make the call whether or not the issue was in scope.

We had to reset stakeholder expectations: individuals filing GitHub issues were not allowed to "drip" new and out-of-scope requirements into issues to resolve one by one. It is easy to lose track of how far the scope creeps from the SOW without properly defining expectations upfront about what is genuinely a "problem to solve" issue and what is an entirely new requirement. It is too easy to have stakeholders take advantage of a consultant's goodwill as a trusted advisor. Keep a parking lot for newly discovered requirements during the project. Write another SOW to implement the new requirements as required.
Lesson learned: Do not put aside legal and financial aspects of setting up the SOW contract BEFORE starting work — …even for a friend who represents the stakeholder. It is too easy for stakeholders to take advantage of a consultant's goodwill while blaming hang-ups with settling contractual issues on finance or legal departments. Our stakeholders panicked and became distracted about meeting what turned out to be an unrealistic deadline. We allowed ourselves to become too empathetic with the stakeholder's panic and the distraction affected our better judgment.

Aside from our lessons learned from our admitted over-eagerness to have great fun getting our hands deep into DocBook once again, the project eventually turned a corner into a workflow that succeeds in meeting the original content migration goals AND producing correct HTML content for LMS ingestion. The internal project team eventually learned how vital the XProc pipeline was to their migration effort and completed content migration of the most extensive product series. The client resolved internal accounting issues eventually, and Ari received payment. The project team continues to work with the XProc pipeline today as the primary tool enabling content migration of the remaining series products from ACM to Alfresco and from Arbortext Editor to Oxygen XML Web Author.

Project Learning Outcomes: The XML Community is Alive and Kicking!

There was no way this project could come together without input from additional colleagues in the greater XML community. We are grateful to both Tommie Usdin and Liam Quin for their assistance behind the scenes in creating a workable scenario that made it possible for one developer in Sweden to work with another developer in New York for a project based in San Francisco.

We faced continual stakeholder challenges throughout the project, but we had fun working together on DocBook content in a clever solution that addressed the original problem specified as the SOW scope. In addition, DocBook projects are great late-night projects for developing lasting friendships and potential project colleagues in future consulting efforts.

References

[XSLT 3.0 on ordinary prose] Walsh, Norman. “XSLT 3.0 on ordinary prose.” Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). doi:https://doi.org/10.4242/BalisageVol25.Walsh01.

[DocBook xslTNG Reference] Walsh, Norman. DocBook xslTNG Reference.

[DocBook xslTNG] Walsh, Norman. “XSLT 3.0 stylesheets for DocBook.” DocBook xslTNG.

[XProc Batch Wrapper Scripts] Nordström, Ari. “XProc Batch.” https://github.com/sgmlguru/xproc-batch.

[Pipelined XSLT Transformations] Nordström, Ari. “Pipelined XSLT Transformations.” Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). doi:https://doi.org/10.4242/BalisageVol25.Nordstrom01.

[TagSoup - Just Keep On Truckin'] Cowan, John. TagSoup - Just Keep On Truckin'.

[Installing Libxml/xmllint for Windows] Zlatković, Igor. Installing Libxml.

^[1] Straightforward was an erroneous assumption. In hindsight, Jean should have analyzed at least five titles from each of three product series.

^[2] The trick is to define what that one thing is.

^[3] This is confusing to a developer until one precisely realizes why content is not coming through the XSLT.

^[4] Know the POSIX standard? Well, this wasn't even close. Multiple whitespace, forbidden characters, unsuitable characters...

^[5] Because the input must be well-formed, it is probably possible to do something with regular expressions, but here we much preferred not to; there were too many variations to consider.

^[6] For this project, we had XProc 1.0, which really will only deal with (at least) well-formed XML. XProc 3.0 can do a lot better in this regard.

^[7] Moreover, no, xmllint cannot add a DOCTYPE, unfortunately. It can, however, give the same error message every time there is not one.

^[8] Ari: After I had written the pre-processing steps to normalise the XIncludes, Jean told me there was no need; the client now decided they wanted to keep the pieces as-is.

^[9] Unless you're outputting your DOCTYPE with xsl:text, which is not a unique approach.

^[10] For those of you who do not read Ant, it's essentially a regular expression matching the newly inserted DOCTYPE in all XML files in the output directory and inserting the internal subset last in that DOCTYPE.

^[11] Cross-reference labels, various @class attribute values, captions, etc.

^[12] As seen in the XSLT example, we pulled some information from attribute values.

^[13] Again, certainly possible in xslTNG, but easier in the pipeline.

^[14] Also, I could not understand why my ID tweaks did not work for quite some time. That is, until remembering that later DocBook versions use @xml:id, not @id…

Walsh, Norman. “XSLT 3.0 on ordinary prose.” Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). doi:https://doi.org/10.4242/BalisageVol25.Walsh01.

Walsh, Norman. DocBook xslTNG Reference.

Walsh, Norman. “XSLT 3.0 stylesheets for DocBook.” DocBook xslTNG.

Nordström, Ari. “XProc Batch.” https://github.com/sgmlguru/xproc-batch.

Nordström, Ari. “Pipelined XSLT Transformations.” Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). doi:https://doi.org/10.4242/BalisageVol25.Nordstrom01.

Cowan, John. TagSoup - Just Keep On Truckin'.

Zlatković, Igor. Installing Libxml.

BalisageThe Markup Conference2022

Balisage Paper: Migrating DocBook to Uncharted Waters

Abstract

Table of Contents

Mission Statement

XSLT Pipelines Overview

DocBook Cleanup

Encoding and `DOCTYPE`

Note

XIncludes

Filenames

4.0 to 4.3 and Cleanup

The 4.3 `DOCTYPE`

DocBook to HTML

Normalisation and Split Markup

Note

Metadata

Structural Tweaks

Note

Cleanup and Split

Notes on xslTNG

Final Tweaking

XProc, As Ever

Project Learning Outcomes: If only we had...

Project Learning Outcomes: The XML Community is Alive and Kicking!

References

Balisage Series on Markup Technologies

Balisage Paper: Migrating DocBook to Uncharted Waters

Abstract

Table of Contents

Mission Statement

XSLT Pipelines Overview

DocBook Cleanup

Encoding and DOCTYPE

Note

XIncludes

Filenames

4.0 to 4.3 and Cleanup

The 4.3 DOCTYPE

DocBook to HTML

Normalisation and Split Markup

Note

Metadata

Structural Tweaks

Note

Cleanup and Split

Notes on xslTNG

Final Tweaking

XProc, As Ever

Project Learning Outcomes: If only we had...

Project Learning Outcomes: The XML Community is Alive and Kicking!

References

Balisage Series on Markup Technologies

Encoding and `DOCTYPE`

The 4.3 `DOCTYPE`