Basics

Imagine you need to transform a Microsoft Excel 2013 spreadsheet to DocBook. The input is nothing like your final output; there is no structure and the input's got no clear semantics. The input is a mess of word processor formatting and weird namespaces, and your actual contents are either hidden in multiple t and v elements, or not visible at all on first sight, seemingly placed there at random. And imagine that you'll have to use traditional XSLT development to do it. I'm sure you can do it, if you're a reasonably experienced XSLT developer with some insight into the xlsx format, but you'll experience pain like none you've known before.

Now, imagine you'd instead be able to do it step by step, with one step's output being the input to the next: first, you remove anything that isn't needed, second you include the shared strings into your cells, third, you normalise your hyperlinks, and so on (where and so on means pretty much every actual conversion step). You'd do it all in steps that use the previous one's output as input.

Wouldn't that be wonderful?

Ideally, such a transformation should be describable by a document listing the individual steps in order:

  1. Step 1

  2. Step 2

  3. Step 3

And ideally, it should be possible to change the way the pipeline works, simply by adding a step between existing steps or by rearranging the individual steps:

  1. Step 1

  2. Step 3

  3. Step 4

  4. Step 2

If allowing any number of steps, it should be possible to isolate concerns, that is, to have each and every step address one single thing, be it to convert lists, add a caption to images or resolve external references.

The idea isn't new, of course (see section “Some Previous Work (That Inspired Me)”), and I don't in any way claim to have originated any of it; I'm simply a happy user, a convert, and hope never to have to go back to a more traditional, monolithic way of developing XSLTs again.

Disclaimer

A number of excellent points were made by the reviewers of this paper, some of the points by almost all of them, and while I have addressed some in the paper itself, I've elected to discuss others here.

Who Uses Monolithic XSLT These Days?

You'd be surprised. All but one of my clients the last 5+ years or so used monolithic XSLT. Sure, they'd have modules upon modules, and sometimes they'd use variables to hint at a pipelined approach to specific parts of the transformation, but essentially, the transformation was still monolithic.

The one client who did use pipelined XSLT was the one that converted me, pretty much on the spot about five years ago as I write this. I had previously created pipelined transformations by hard-coding my XSLTs in p:xslt steps in XProc 1.0 pipelines but had no generic approach to adding an arbitrary number of XSLT steps to a pipeline.

Note

And even that client had developers who would seemingly write XSLT pipelines but still have each step focus on multiple things rather than refactoring the steps; clearly there was room for improvement.

Um, XSLT 3.0? XProc 3.0?

Some reviewers rightly point out features in XSLT 3.0 that would simplify a lot of the approach discussed in this paper. I agree. However, two things: one, none of my clients uses XSLT 3.0, and two, therefore my experience in XSLT 3.0 is limited to things done in my own time, which is not nearly enough. It's on my list, though. Any day now.

XProc 3.0? Absolutely, there are so many goodies in there, many of which will utterly change the predominantly XProc 1.0/XSLT 2.0 approach discussed here. However, the first XProc 3.0 processor[1] was only becoming available when I was halfway through writing this paper. There wasn't enough time to rewrite the paper, let alone the XProc 1.0 pipelines I was and still am using in my day-to-day work. But do read section “Thoughts on an XProc 3.0 Approach”, where I outline some of my thoughts on rewriting everything in XProc 3.0.

Mechanics

My preferred approach to pipelined XSLT transformations is largely based on Nic Gibson's XProc Tools (see XProc Tools), where an XML-based manifest file lists the XSLT stylesheets to be run, in document order. A simple manifest looks like this:

<manifest xmlns="http://www.corbas.co.uk/ns/transforms/data" xml:base=".">
    
    <group description="XLSX normalisation and cleanup steps" xml:base="../xslt/">
        <item href="step1.xsl" description="To element one">
            <meta name="param1" value="value1"/>
        </item>
        <item href="step2.xsl" description="To element two"/>
        <item href="step3.xsl" description="To element three">
            <meta name="param3A" value="value3A"/>
            <meta name="param3B" value="value3B"/>
        </item>
        <item href="step4.xsl" description="To element four"/>
    </group>
    
</manifest>

Each item element identifies an XSLT stylesheet to be run (@href), in the order they're listed, and everything has a @description attribute that tells us what the step does. Also note that Stylesheets #1 and #3 have input parameters described by meta elements. Changing the order in which the steps run is easy enough: simply move the items about in your favourite XML editor.

The manifest adheres to a Relax NG compact schema, available on Github as part of the XProc Tools repository (see XProc Tools).

Under the Hood

An XProc step reads the manifest document, loading every XSLT and input parameters listed by an item into an in-memory sequence in a pairwise fashion, and then loops over every XSLT and parameter pair in the sequence using p:for-each, creating a transform (a p:xslt step) for each iteration, meaning each step, of the pipeline described by the manifest.

This implies some recursion, of course; the XProc step that creates the transform is called by itself once for each step in the manifest, the output of the previous step being used as the input to the next, and applying the next XSLT/parameters pair on it. It will discard each XSLT after use and check if there are any left in the sequence; if not, the pipeline is done and the output is saved to disk.

By default, the entire transform is done in-memory, without any serialisation until after the last step, but it is, of course, possible to save the outputs of the intermediate steps for debugging purposes (see section “Debugging”).

Developing

Developing XSLT stylesheets for pipelines is straight-forward and quite a bit easier than those monolithic stylesheets we've all been used to. Given the manifest file above, step1.xsl, for example, looks like this:

<xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">
    
    <xsl:output method="xml" indent="yes"/>
    
    <xsl:param name="param1"/>
    
    <xsl:template match="/">
        <xsl:apply-templates select="node()" mode="STEP-1"/>
    </xsl:template>
    
    <xsl:template match="*" mode="STEP-1" priority="1">
        <one one="{$param1}">
            <xsl:copy-of select="@*"/>
            <xsl:apply-templates select="node()" mode="STEP-1"/>
        </one>
    </xsl:template>
    
    <xsl:template match="node()" mode="STEP-1">
        <xsl:copy>
            <xsl:copy-of select="@*"/>
            <xsl:apply-templates select="node()" mode="STEP-1"/>
        </xsl:copy>
    </xsl:template>
    
</xsl:stylesheet>

There is a higher-priority template that converts every input element to one elements in the output and adds @one="{$param1}" to each, but the rest is just an identity transform, copying its input verbatim. And if you'd wanted to only convert, say, para elements, we'd obviously not need @priority at all. This is the case with step2.xsl:

<xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">
    
    <xsl:output method="xml" indent="yes"/>
    
    <xsl:template match="/">
        <xsl:apply-templates select="node()" mode="STEP-2"/>
    </xsl:template>
    
    
    <xsl:template match="one" mode="STEP-2">
        <two>
            <xsl:copy-of select="@*"/>
            <xsl:apply-templates select="node()" mode="STEP-2"/>
        </two>
    </xsl:template>
    
    
    <xsl:template match="node()" mode="STEP-2">
        <xsl:copy>
            <xsl:copy-of select="@*"/>
            <xsl:apply-templates select="node()" mode="STEP-2"/>
        </xsl:copy>
    </xsl:template>
    
</xsl:stylesheet>

Steps #3, and #4 are both very similar to #2, as this is merely an example pipeline demonstrating the basic ideas. It does illustrate the approach, though, most importantly that individual steps should always be as simple as possible. This helps us achieve that all-important separation of concerns; if you're tempted to add an unrelated template to an existing step rather than creating a new, you're adding needless complexity to that step.

In other words, we create simple stylesheets that focus on changing one thing, be it a single element or a group of similar elements. For example, I tend to have a single step for all inline processing if I only have a few inline elements to transform. More complex semantics might require multiple steps. For example, when upconverting a flat structure such as a Microsoft Word docx file, you'll likely want to use several steps to convert docx lists to a structured XML list—the first would be to identify the list items themselves, the second to determine list labels, the third to wrap the items in a list element, and so on[2]. If not, the steps will become needlessly complex and hard to follow, which is what we're trying to avoid here.

Having said that, how you write XSLT pipelines is up to you; however, you might want to consider future maintenance, the level of documentation, the skills of the developers, etc, when writing. Some XSLTs will be large regardless, simply because it's what is required.

The example pipeline above is available on Github (see XSLT Pipeline Example Repository) for the inquisitive reader.

Debugging

Writing and running a lot of simple steps in sequence rather than a single monolithic step in one go is, in my humble opinion, useful enough to be preferred, but debugging is really when the pipelined approach comes into its own. By letting the pipeline save each intermediate step on disk, you can easily troubleshoot your transformation. Usually, it's much easier to a) find the approximate location of the problem in the pipeline, usually down to a single step, and b) test and fix the bug.

In debug mode, the XProc pipeline will produce debug output like this:

ari@toddao:~/Documents/projects/findcourses/poc/tmp/debug/Activate_Learning.xml$ ls -lh
total 106M
-rw-r--r-- 1 ari ari 3,5M mar  5 16:16 0-Activate_Learning.xml
-rw-r--r-- 1 ari ari 3,5M mar  5 16:16 1-XLSX-UTIL_remove-empty.xsl.xml
-rw-r--r-- 1 ari ari 6,4M mar  5 16:16 2-XLSX-UTIL_normalisation.xsl.xml
-rw-r--r-- 1 ari ari 6,4M mar  5 16:16 3-XLSX-UTIL_hyperlinks.xsl.xml
-rw-r--r-- 1 ari ari 4,8M mar  5 16:16 4-XLSX-UTIL_cleanup.xsl.xml
-rw-r--r-- 1 ari ari 4,9M mar  5 16:16 5-XLSX2XML_structure.xsl.xml
-rw-r--r-- 1 ari ari 6,2M mar  5 16:16 6-XLSX2XML_courses.xsl.xml
-rw-r--r-- 1 ari ari 5,6M mar  5 16:16 7-XLSX2XML_dates.xsl.xml
-rw-r--r-- 1 ari ari 5,6M mar  5 16:16 8-XLSX2XML_locations.xsl.xml
-rw-r--r-- 1 ari ari 6,0M mar  5 16:16 9-XLSX2XML_fields.xsl.xml
-rw-r--r-- 1 ari ari 5,7M mar  5 16:16 10-EXC2XI_course.xsl.xml
-rw-r--r-- 1 ari ari 5,7M mar  5 16:16 11-EXC2XI_content-fields.xsl.xml
-rw-r--r-- 1 ari ari 5,7M mar  5 16:16 12-EXC2XI_course-links.xsl.xml
-rw-r--r-- 1 ari ari 5,5M mar  5 16:16 13-EXC2XI_categories.xsl.xml
-rw-r--r-- 1 ari ari 5,5M mar  5 16:16 14-EXC2XI_exc-locations.xsl.xml
-rw-r--r-- 1 ari ari 5,6M mar  5 16:16 15-EXC2XI_exc-events.xsl.xml
-rw-r--r-- 1 ari ari 5,3M mar  5 16:16 16-EXC2XI_exc-duration.xsl.xml
-rw-r--r-- 1 ari ari 5,3M mar  5 16:16 17-EXC2XI_exc-email.xsl.xml
-rw-r--r-- 1 ari ari 4,4M mar  5 16:16 18-EXC2XI_xi-dedupe.xsl.xml
-rw-r--r-- 1 ari ari 4,4M mar  5 16:16 19-EXC2XI_xi-cleanup.xsl.xml

This is from a live customer project and so a bit more complex. Each file listed is the output from the transformation performed by the XSLT stylesheet the file is named after. The debug files are all prefixed by ordinal numbers so they can be sorted in the running order. Debugging a step is as easy as opening the output of the immediately preceding step in an XML IDE and applying the offending XSLT on it. An XSLT that focusses on one task becomes very easy to debug.

Useful Tips

Some of the suggestions and tips here will be obvious to many of you. My apologies, if that is the case.

Pipeline Organisation

A useful feature of the XSLT manifest file is the ability to group related steps (this is an outline of a manifest for a docx to XML conversion):

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="xproc-tools/schema/manifest.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<manifest 
	xmlns="http://www.corbas.co.uk/ns/transforms/data"
	xml:id="migration.rtf.stair"
	description="migration.rtf.stair"
	xml:base="."
	version="1.0">
	
	<group
		xml:id="word.conversion"
		description="word.conversion"
		xml:base="../xslt/common/"
		enabled="true">
		<item href="word-to-xhtml5-elements.xsl" description="Flat Word to XHTML conversion"/>
		<item href="add-eqn-pi.xsl" description="inserts a PI placeholder for equations"/>
		...
	</group>
	
	<group
		xml:id="word.conversion.cleanup"
		description="word.conversion"
		xml:base="../xslt/common/"
		enabled="true">
		...
	</group>
	
	<!-- Up till now we've done common word conversion components -->
	
	<!-- Preprocess for Kepler (HTML to HTML) -->
	<group
		xml:id="preprocess.word.conversion.stair"
		description="preprocess.word.conversion.stair"
		xml:base="../xslt/stair-halsbury/"
		enabled="true">
		...
	</group>
	
	
	<!-- Basic construction of Kepler block structures and inline markup -->
	<group
		xml:id="structure.inline.word.conversion.stair"
		description="structure.inline.word.conversion.stair"
		xml:base="../xslt/stair-halsbury/"
		enabled="true">		
		...
	</group>
	
	<!-- Xref and cite processing -->
	<group
		xml:id="xrefs.cites.word.conversion.stair"
		description="xrefs.cites.word.conversion.stair"
		xml:base="../xslt/stair-halsbury/"
		enabled="true">
		...
	</group>
	
	<!-- Other Kepler fixes and cleanup -->
	<group
		xml:id="word.conversion.stair"
		description="word.conversion.stair"
		xml:base="../xslt/stair-halsbury/"
		enabled="true">

		...
		<item href="html2kepler_cleanup.xsl"/>
		<item href="html2kepler_namespacecleanup.xsl"/>
	</group>
	
</manifest>

Notice that the group elements contain steps manipulating the same types of information, for example, cites and cross-references in the above example[3]. They all can be provided their own @xml:base values, allowing better organisation of the steps in different directories on a file system.

Attribute Processing

A basic rule of pipelined XSLT is to copy everything that you aren't touching in a step. If you need to change one attribute but leave the rest, this is a useful approach:

<xsl:copy-of select="@* except @style"/>
<xsl:attribute name="formatting" select="@style"/>

If you're converting something like Microsoft Word docx, there are probably lots of attributes that you can discard, so a variant of the above is to parameterise anything following except and use a generic step with those parameter values as input.

When writing a complex pipeline of several dozens of steps, I've found it's really useful to leave breadcrumbs behind until the very last steps, attributes that tell me where something came from. For example, the original element's name is sometimes useful, especially if you map several source elements to a single target:

<xsl:template match="para" mode="STEP-2">
    <p role="{name(.)}">
        ...
    </p>
</xsl:template>

Here, role="{name(.)}" inserts the source element name in @role.

And sometimes it's helpful to leave behind entire attribute name/value pairs. Here, I'm copying two attributes to a combined @src-styling attribute:

<xsl:attribute name="src-styling">
    <xsl:for-each select="(@style,@frame)">
        <xsl:value-of select="concat('@',name(.),'=',.)"/>
        <xsl:value-of select="if (position()!=last()) then (' ') else ()"/>
    </xsl:for-each>
</xsl:attribute>

For a source like this:

<para style="1" frame="dash">Test</para>

We'd add an attribute like this:

<p src-styling="@style=1 @frame=dash">...</p>

This is really useful for debugging two dozen steps and a few days later, when you no longer remember the specifics of the transform but need to remind yourself of some processing attributes in the source you did away with.

Testing

What about testing, you say? Well, writing tests is still about writing XSpec tests for each step which is fine, but if you need to write tests that check actual XML input and output, it won't be possible by merely checking the pipeline inputs and outputs in a single XSpec. XSpec acts on a single XSLT applied on a single XML file, not a sequence of them. We must therefore instantiate the XSpec tests for any step that requires using the actual step inputs and outputs.

I've written a set of XSLT and XProc tools that use the debug output and generic step XSpec tests to generate instance XSpecs. The XProc orchestrates this by using an XSpec manifest file, much like the XSLT pipeline manifest, running each XSpec in sequence. Here's an example XSpec manifest from a live customer project:

<tests
    xmlns="http://www.sgmlguru.org/ns/xproc/steps"
    manifest="../../pipelines/poc/poc-xlsx2xml-manifest.xml"
    xml:base="/home/ari/Documents/repos/xlsx2xml/">
    
    <group description="XLSX normalisation steps">
        <test xslt="xslt/common/XLSX-UTIL_normalisation.xsl" xspec="xspec/common/XLSX-UTIL_normalisation.xspec" description="Normalise shared strings."/>
        <test xslt="xslt/common/XLSX-UTIL_hyperlinks.xsl" xspec="xspec/common/XLSX-UTIL_hyperlinks.xspec" description="Normalise referenced hyperlinks."/>
    </group>
    
    <group description="Conversion steps">
        <test xslt="xslt/common/XLSX2XML_structure.xsl" xspec="xspec/common/XLSX2XML_structure.xspec" description="Convert main structures."/>
        <test xslt="xslt/common/XLSX2XML_dates.xsl" xspec="xspec/common/XLSX2XML_dates.xspec" description="Convert ECMA 376 dates to xs:dateTime."/>
        <test xslt="xslt/common/XLSX2XML_locations.xsl" xspec="xspec/common/XLSX2XML_locations.xspec" description="Convert location info to exc, group it in locations wrapper."/>
    </group>
    
</tests>

Here, each test matches a pipeline manifest item, and this manifest actually tests the pipeline I'm illustrating debug output with, above.

Note that there is nothing to stop you from adding more than one XSpec test to a pipeline step. Quite the contrary, in fact, as splitting tests into multiple files will often make for more readable code.

Documenting

An XSLT pipeline, while not self-documenting, should be relatively easy to understand simply by looking at the manifest. Every element in the manifest includes a @description attribute that should contain at least a basic description of what a step does.

From a documentation perspective, the test manifest is possibly even better. It also contains descriptions, of course, but it is easy to write an XSLT that extracts descriptions and XSpec tests to document a step[4].

In my ever-so-humble opinion, one of the real advantages of this pipelined approach is to be able to write short steps, meaning that each step will be easier to understand[5].

Implementations

My preferred way to write XSLT pipelines, again, is to use XProc Tools (see XProc Tools) as the basis for my pipelines. XProc Tools is open source (LGPL 3.0 license) and freely available on Github, but they are just that, a set of tools. There is the XProc step that runs XSLT manifests as discussed above, and there's a step for reading directories and subdirectories recursively, plus some other useful steps. They remain an excellent showcase of what XProc can do.

They don't run out of the box, however. They require wrapper scripts that run the pipeline transformation step in context, validate the output, unzip docx archives, and so on. To this end, I've written additional XProc scripts that do all this, from converting entire directories in batch to validating the output with Schematron and applying XSpec tests. These, just as XProc Tools, are open source (LGPL 3.0) and available on Github (see XProc Batch Wrapper Scripts).

Note

I've also written an example XSLT pipeline that shows how to use the above. This, again, is open source and available on Github (see XSLT Pipeline Example Repository).

XProc pipelines can be run from oXygen XML Editor or from the command line[6].

Thoughts on an XProc 3.0 Approach

XProc Tools, just as my XProc Batch scripts, are written in XProc 1.0 and require XML Calabash 1.x to run. XProc 3.0 is in last call, however, and the XProc 3.0 processor MorganaXProc-III (see MorganaXProc-III XProc 3.0 Processor) has recently been made available as a public beta, so I am currently writing XProc 3.0 versions of everything. XProc 3.0 (see XProc 3.0: A Pipeline Language) is a significant improvement over XProc 1.0, both in functionality and in ease of use, and I hope to have everything up and running by Balisage 2020.

A @version='3.0' alternative is, of course, the most basic of rewrites, the idea being to do as little as possible and merely make sure that the syntax is right. But there's just so much that has been improved and updated. For example:

  • XProc 1.0 couldn't do recursive listing of directories so Nic Gibson, like so many others, had a library step for the purpose. XProc 3.0, of course, does this out of the box, so there's no reason for that library.

  • p:store, the step that, well, stores whatever is on its input port, is now also an identity step, which is just such a lifesaver. No need for various pipes to redirect stuff, so a lot of the code in the 1.0 implementations just disappears.

  • There's a @use-when attribute that you can attach on pretty much anything. An obvious use is to do something like p:store[@use-when=$debug] when you want to output debug info. This essentially gets rid of another library step.

  • XProc 3.0 makes it easy to add message output to any step of your choice through a @message attribute, allowing us to produce verbose output of anything we like in the pipeline. again vastly simplifying the solution in my 1.0 pipelines, essentially a Calabash extension step.

  • Variables in 3.0 are much easier to handle than the 1.0 approach where they had to appear in the step prolog.

  • Etc.

Add to that all the syntactic sugar and it does not make sense to simply reproduce the 1.0 pipelines in 3.0. For example, Running an XSLT stylesheet and storing it in 3.0 is this easy:

<p:xslt>
   <p:with-input port="stylesheet" href="..."/>
</p:xslt>
<p:store use-when="$debug" href="..."/>

...

We could easily produce this by applying an XSLT on the pipeline manifest file — it looks a lot like the manifest to begin with.

This was doable but not very practical in XProc 1.0. The in-memory sequence of XSLTs was a better fit, and while that would work here, too, this would probably consume less memory.

XQuery?

XProc 1.0 enjoys limited support, unfortunately, and so recently, when I used the pipelined XSLT approach for live conversions of input Excel spreadsheets to XML for a client and needed to run everything in the eXist-db XML Database, I discovered that the XProc support was nowhere near sufficient, so I replaced the XProc pipeline runner with an XQuery module and made that open source, too (see XProc Batch Wrapper Scripts). The XQuery module works in almost the exact same way as the XProc Tools do (see section “Under the Hood”), but requires eXist-db to run[7].

Apart from having to run everything from within eXist-db, the approach remains the same; the one thing I need is some XQuery that includes this[8] to run the pipeline:

pipelines:transform-collection($sources,$manifest-uri,$debug,$out)

I was able to upload my pipeline manifest and steps to eXist-db unchanged. As I don't yet have XQuery functions to run my XSpec tests in eXist-db, I could instead develop the pipeline on the file system, using XProc, and upload everything to eXist-db for the live conversion.

A More Generic Approach?

A reviewer of the first version of this paper suggested a more generic approach. Forget XProc, forget XQuery; instead, base the pipelined XSLT on the concept of the manifest and use whatever implementation language that fits, the idea, as I understand it[9], being to include all logic in the manifest.

Yes, the idea did (sort of) occur to me when writing the XQuery implementation. The problem is that if adding all the logic to the manifest, I'd effectively be creating yet another pipeline language and I don't think I want to go that far[10]. My motivation here is to simplify my day-to-day XSLT. My point-of-view for the last five years has been very much converting a large number of documents from one format to another with as much simplicity, transparency, and debugging capabilities as possible, and so I wasn't after inventing something new as much as I was after streamlining and adapting something that already existed to my needs.

Thus, while the idea is worth exploring, it's not on my short-term list.

End Notes

Why?

Still need convincing? Consider these points:

  • It's easier to write. You do one task per step.

  • It's easier to debug. See above.

  • It's easier to take over from someone else.

  • It's safer, as you're less likely to remove elements from the source without noticing. Everything but the one thing you change is an identity transform!

  • And it's fun!

Why Not?

Are the reasons not to do pipelined transforms?

  • If you're supplying the transformation live, in a time-critical context, the XProc 1.0 approach might not be the way to go, especially with a large pipeline. Every step is read into memory, which means the computer will consume lots of it. There's also the fact that the JVM running the XProc engine has a start-up time, further adding to the total.

  • Not every environment will support XProc[11], and not everyone will want to transform their stuff using eXist-db. However, I believe both of these can be addressed in either XProc 3.0 or an XQuery implementation that doesn't depend on eXist-db.

  • The XProc (or XQuery) pipeline framework could easily add an extra tool that has to be managed[12], especially if there is a need to add features to the manifest.

  • A reviewer suggested that the steps might not necessarily be independent. This is true, of course — I've had countless problems arising from the wrong step order and such — but the steps are not meant to be independent. They are meant to be run in order which implies the opposite, and very frequently, one will have to use temporary markup as step output to preserve or make explicit some aspect of the step transfom. The idea is not independence for each step, the idea is an isolation of concerns, doing one thing at a time.

    It is my experience that most problems arise not from step interdependence as such but from expanding a step to do too much. I'm always tempted to add some tiny little tweak here or an additional template there in an existing step rather than adding a new one. It's about convenience and laziness[13].

Also, while not a reason to avoid pipelined XSLT, turning on debugging will consume a lot of disk space, and so you might not want to do it if you're running conversions of many large documents in batch. I've run out of space on a half-empty 500 GB disk a couple of times.

Note

Having been playing with XProc 3.0 and generally thinking about debugging, I'm now convinced that I need to add a feature to handle debugging levels in the manifest. If you have an 80-step pipeline, having debug output from each step is way over the top. Why not use the manifest items to decide how important a step is, how prone it is to fail? Some will be so basic that there is no need to produce debug output from each step. Or perhaps a combination, or maybe limiting debug to group outputs?

Some Previous Work (That Inspired Me)

Me, I was heavily influenced by Nic Gibson's XProc Tools (XProc Tools) that allow XSLT developers to define their XSLT pipelines in manifest documents listing the individual stylesheets and run the pipelines using XProc. While working at LexisNexis UK[14], I used those scripts for many conversion projects (see Up and Sideways), frequently wrapping the XProc inside Ant scripts. Later, I've also introduced similar solutions for other clients.

Others have approached the problem by introducing XSLT variables in sequence within the same stylesheet, something like this:

<xsl:template match="something">
    
    <xsl:variable name="step1">
		<xsl:element name="some-element">
			<xsl:for-each select="$something-to-start-with/...">
				...
			</xsl:for-each>
		</xsl:element>
	</xsl:variable>
	
	...
	
	<xsl:variable name="step5">
		<xsl:element name="some-element">
			<xsl:for-each select="$step4/...">
				...
			</xsl:for-each>
		</xsl:element>
	</xsl:variable>
	
	<xsl:result-document href="{concat($filepath, 'aaa_step5.xml')}">
		<xsl:copy-of select="$step5"/>
	</xsl:result-document>
	
	<xsl:variable name="step5aa">
		<xsl:element name="some-element">
			<xsl:for-each select="$step5/...">
				...
			</xsl:for-each>
		</xsl:element>
	</xsl:variable>
	
	<xsl:result-document href="{concat($filepath, 'aaa_step5aa.xml')}">
		<xsl:copy-of select="$step5aa"/>
	</xsl:result-document>
    
    ...
    
</xsl:template>

Note the xsl:result-document elements inserted between variables. These are for providing debug output from the preceding variables. The approach is certainly useful, but will not work well for anything beyond small documents or document fragments, and certainly not for converting anything of any real size and certainly not in batch.

I should also mention Liam Quin, who demonstrated an XSLT 3.0 approach using transform() at XML Prague (see XProc in XSLT: Why and Why Not).

Note

A reviewer also mentioned Orbeon XPL and Netkernel. I haven't considered either one and cannot speak to their applicability in this context. I did not intend this paper to be a survey of available pipeline languages, however, as there are quite a few others that are worth mentioning. ANT, for one, is tremendously useful and could certainly be used to implement a pipelined XSLT toolkit based on XML manifest files.

Future Work

I've couple of things on my to do list:

  • Move to XProc 3.0 for everything. There's a ton of new features and syntactic shortcuts, and I fully expect the 3.0 implementation to be tiny in comparison.

  • This should include proper XProc 3.0 tests.

  • The eXist-db/XQuery implementation doesn't support XSpec tests at all yet. Again, I'm planning to add support for XSpec in the future, hopefully in time for Balisage 2020.

  • Generated documentation (as suggested in section “Documenting”).

  • I'we been thinnking about adding a few Ant wrapper scripts that run pipelines[15] and do some things that aren't readily doable in XProc on the side, if needed.

Thanks To...

  • Nic Gibson

  • Phil Hodder

  • Norm Tovey-Walsh

  • Achim Berndzen

  • Liam Quin

  • ...and many others whose approaches to pipelining taught me a lot.

References

[XProc Tools] XProc Tools by Nic Gibson [online, fetched on 1 April 2020]. https://github.com/Corbas/xproc-tools

[XProc Batch Wrapper Scripts] XProc Batch by Ari Nordström [online, fetched on 1 April 2020]. https://github.com/sgmlguru/xproc-batch

[XSLT Pipeline Example Repository] XSLT Pipelines by Ari Nordström. https://github.com/sgmlguru/xslt-pipelines

[XProc 3.0: A Pipeline Language] XProc 3.0: A Pipeline Language [online, fetched on 1 April 2020]. http://spec.xproc.org/

[MorganaXProc-III XProc 3.0 Processor] MorganaXProc-III by Achim Berndzen, an XProc 3.0 processor. https://www.xml-project.com/morganaxproc-iii/

[Up and Sideways] Ari Nordström, Up and Sideways: RTF to XML. Presented at Symposium on Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions, Washington, DC, July 31, 2017. In Proceedings of the Symposium on Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions. Balisage Series on Markup Technologies, vol. 20 (2017). doi:https://doi.org/10.4242/BalisageVol20.Nordstrom01.

[XProc in XSLT: Why and Why Not] Liam Quin, XProc in XSLT: Why and Why Not. Presented at XML Prague 2019, February 7-9, 2019. In XML Prague 2019 Conference Proceedings, https://archive.xmlprague.cz/2019/files/xmlprague-2019-proceedings.pdf

[XPants] Phil Hodder, XPants - XML Practical Ant Scripts. https://github.com/encodis/xpants

[Lightweight XML DevOps Using Apache Ant] Phil Hodder, Lightweight XML DevOps Using Apache Ant. Presented at Markup UK 2018, June 9-10, 2018. In Markup UK 2018 Proceedings, https://markupuk.org/2018/webhelp/index.html#ar11.html



[1] Namely, the MorganaXProc-III Beta.

[2] Quite possibly, you'd need a recursive approach to handle nested lists.

[3] The example is from LexisNexis, so legal document processing.

[4] This is on my to-do list.

[5] I've had to debug monolithic XSLT stylesheets of 5,000+ lines of code with nested for-each-groups inside multiple variables in sequence, simulating a pipelined approach, and can confidently say that I'd much rather have a 100-step pipeline without documentation using the approach outlined here than a 5,000 line XSLT with documentation.

[6] For examples, see XProc Tools.

[7] It shouldn't be too hard to port the code to BaseX or MarkLogic, or a file system-based solution, but that's for another project.

[8] As opposed to a shell script that runs my XProc pipeline.

[9] Any misread or misunderstanding is, of course, mine and mine only.

[10] I've done one of those; see my 2012 Balisage paper for an example on how to configure XProc publishing pipelines using an XML file.

[11] Which is why I wrote the XQuery module for eXist-db. But XProc, especially in version 3.0, is tremendously useful, and so I always want to counter with well, why not? and choose something that does support XProc.

[12] Although even for a complex transformation comprising close to a hundred steps, the XProc tools have been both stable and reliable.

[13] A rule of thumb I've come to rely on is that if my XSpec step tests grow too much — they have too many scenarios — I better start refactoring.

[14] The one client of mine who didn't do monolithic XSLT, in case you wondered.

[15] To get an idea of what Ant can do in this context, have a look at XPants, presented at Markup UK 2018 (see Lightweight XML DevOps Using Apache Ant).

Ari Nordström

Ari Nordström is an independent markup geek based in Göteborg, Sweden. He has provided angled brackets to many organisations and companies across a number of borders over the years.

Ari is the proud owner and head projectionist of Western Sweden's last functioning 35/70mm cinema, situated in his garage, which should explain why he once wrote a paper on automating commercial cinemas using XML.