How to cite this paper
Nordström, Ari. “Pipelined XSLT Transformations.” Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). https://doi.org/10.4242/BalisageVol25.Nordstrom01.
Balisage: The Markup Conference 2020
July 27 - 31, 2020
Balisage Paper: Pipelined XSLT Transformations
Ari Nordström
Ari Nordström is an independent markup geek based in Göteborg, Sweden.
He has provided angled brackets to many organisations and companies
across a number of borders over the years.
Ari is the proud owner and head projectionist of Western Sweden's last
functioning 35/70mm cinema, situated in his garage, which should explain
why he once wrote a paper on automating commercial cinemas using
XML.
Copyright ©2020 Ari Nordström
Abstract
This paper is a commercial for pipelined XSLT transformations, hoping to convince
the reader to move to a liberating step-by-step approach to write your
transformations rather than continuing to endure the pain that is known as
monolithic XSLTs. Pipelined transformations are about writing small steps run in
sequence, the output of one step providing the input of the next, and with each step
focussing on one task rather than, well, everything. Developing and debugging
becomes easier, and so does testing and documentation.
It all runs in XProc, but knowing XProc is not a requirement (even though it's fun
in its own right); XSLT is enough to enjoy pipelined XSLTs. This paper should
provide enough information, including implementations and examples, to get the
reader started.
Table of Contents
- Basics
-
- Disclaimer
-
- Who Uses Monolithic XSLT These Days?
- Um, XSLT 3.0? XProc 3.0?
- Mechanics
-
- Under the Hood
- Developing
-
- Debugging
- Useful Tips
-
- Pipeline Organisation
- Attribute Processing
- Testing
- Documenting
- Implementations
-
- Thoughts on an XProc 3.0 Approach
- XQuery?
- A More Generic Approach?
- End Notes
-
- Why?
- Why Not?
- Some Previous Work (That Inspired Me)
- Future Work
- Thanks To...
Basics
Imagine you need to transform a Microsoft Excel 2013 spreadsheet to DocBook. The input
is nothing like your final output; there is no structure and the input's got no clear
semantics. The input is a mess of word processor formatting and weird namespaces,
and
your actual contents are either hidden in multiple t
and v
elements, or not visible at all on first sight, seemingly placed there at random.
And
imagine that you'll have to use traditional XSLT development to do it. I'm sure you
can
do it, if you're a reasonably experienced XSLT developer with some insight into the
xlsx
format, but you'll experience pain like none you've known
before.
Now, imagine you'd instead be able to do it step by step, with one step's output being
the input to the next: first, you remove anything that isn't
needed, second you include the shared strings into your cells,
third, you normalise your hyperlinks, and so on (where
and so on
means pretty much every actual conversion step). You'd do
it all in steps that use the previous one's output as input.
Wouldn't that be wonderful?
Ideally, such a transformation should be describable by a document listing the
individual steps in order:
And ideally, it should be possible to change the way the pipeline
works, simply by adding a step between existing steps or by rearranging the individual
steps:
-
Step 1
-
Step 3
-
Step 4
-
Step 2
If allowing any number of steps, it should be possible to isolate
concerns, that is, to have each and every step address one single thing,
be it to convert lists, add a caption to images or resolve external references.
The idea isn't new, of course (see section “Some Previous Work (That Inspired Me)”), and I don't in any way claim to have originated any of
it; I'm simply a happy user, a convert, and hope never to have to go back to a more
traditional, monolithic way of developing XSLTs again.
Disclaimer
A number of excellent points were made by the reviewers of this paper, some of the
points by almost all of them, and while I have addressed some in the paper itself,
I've elected to discuss others here.
Who Uses Monolithic XSLT These Days?
You'd be surprised. All but one of my clients the last 5+ years or so used
monolithic XSLT. Sure, they'd have modules upon modules, and sometimes they'd
use variables to hint at a pipelined approach to specific parts of the
transformation, but essentially, the transformation was still monolithic.
The one client who did use pipelined XSLT was the one that converted me,
pretty much on the spot about five years ago as I write this. I had previously
created pipelined transformations by hard-coding my XSLTs in p:xslt
steps in XProc 1.0 pipelines but had no generic approach to adding an arbitrary
number of XSLT steps to a pipeline.
Note
And even that client had developers who would seemingly write XSLT
pipelines but still have each step focus on multiple things rather than
refactoring the steps; clearly there was room for improvement.
Um, XSLT 3.0? XProc 3.0?
Some reviewers rightly point out features in XSLT 3.0 that would simplify a
lot of the approach discussed in this paper. I agree. However, two things: one,
none of my clients uses XSLT 3.0, and two, therefore my
experience in XSLT 3.0 is limited to things done in my own time, which is not
nearly enough. It's on my list, though. Any day now.
XProc 3.0? Absolutely, there are so many goodies in there, many of which will
utterly change the predominantly XProc 1.0/XSLT 2.0 approach discussed here.
However, the first XProc 3.0 processor was only becoming available when I was halfway through writing this
paper. There wasn't enough time to rewrite the paper, let alone the XProc 1.0
pipelines I was and still am using in my day-to-day work. But do read section “Thoughts on an XProc 3.0 Approach”,
where I outline some of my thoughts on rewriting everything in XProc 3.0.
Mechanics
My preferred approach to pipelined XSLT transformations is largely based on Nic
Gibson's XProc Tools (see XProc Tools), where an XML-based manifest file
lists the XSLT stylesheets to be run, in document order. A simple manifest looks like
this:
<manifest xmlns="http://www.corbas.co.uk/ns/transforms/data" xml:base=".">
<group description="XLSX normalisation and cleanup steps" xml:base="../xslt/">
<item href="step1.xsl" description="To element one">
<meta name="param1" value="value1"/>
</item>
<item href="step2.xsl" description="To element two"/>
<item href="step3.xsl" description="To element three">
<meta name="param3A" value="value3A"/>
<meta name="param3B" value="value3B"/>
</item>
<item href="step4.xsl" description="To element four"/>
</group>
</manifest>
Each item
element identifies an XSLT stylesheet to be run
(@href
), in the order they're listed, and everything has a
@description
attribute that tells us what the step does. Also note that
Stylesheets #1 and #3 have input parameters described by meta
elements.
Changing the order in which the steps run is easy enough: simply move the items about
in
your favourite XML editor.
The manifest adheres to a Relax NG compact schema, available on Github as part of
the
XProc Tools repository (see XProc Tools).
Under the Hood
An XProc step reads the manifest document, loading every XSLT and input parameters
listed by an item into an in-memory sequence in a pairwise fashion, and then loops
over every XSLT and parameter pair in the sequence using p:for-each
,
creating a transform (a p:xslt
step) for each iteration, meaning each
step, of the pipeline described by the manifest.
This implies some recursion, of course; the XProc step that creates the transform
is called by itself once for each step in the manifest, the output of the previous
step being used as the input to the next, and applying the next XSLT/parameters pair
on it. It will discard each XSLT after use and check if there are any left in the
sequence; if not, the pipeline is done and the output is saved to disk.
By default, the entire transform is done in-memory, without any serialisation
until after the last step, but it is, of course, possible to save the outputs of the
intermediate steps for debugging purposes (see section “Debugging”).
Developing
Developing XSLT stylesheets for pipelines is straight-forward and quite a bit easier
than those monolithic stylesheets we've all been used to. Given the manifest file
above,
step1.xsl
, for example, looks like this:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="2.0">
<xsl:output method="xml" indent="yes"/>
<xsl:param name="param1"/>
<xsl:template match="/">
<xsl:apply-templates select="node()" mode="STEP-1"/>
</xsl:template>
<xsl:template match="*" mode="STEP-1" priority="1">
<one one="{$param1}">
<xsl:copy-of select="@*"/>
<xsl:apply-templates select="node()" mode="STEP-1"/>
</one>
</xsl:template>
<xsl:template match="node()" mode="STEP-1">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates select="node()" mode="STEP-1"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
There is a higher-priority template that converts every input element to
one
elements in the output and adds @one="{$param1}"
to
each, but the rest is just an identity transform, copying its input verbatim. And
if
you'd wanted to only convert, say, para
elements, we'd obviously not need
@priority
at all. This is the case with step2.xsl
:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="2.0">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<xsl:apply-templates select="node()" mode="STEP-2"/>
</xsl:template>
<xsl:template match="one" mode="STEP-2">
<two>
<xsl:copy-of select="@*"/>
<xsl:apply-templates select="node()" mode="STEP-2"/>
</two>
</xsl:template>
<xsl:template match="node()" mode="STEP-2">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates select="node()" mode="STEP-2"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Steps #3, and #4 are both very similar to #2, as this is merely an example pipeline
demonstrating the basic ideas. It does illustrate the approach, though, most importantly
that individual steps should always be as simple as possible. This helps us achieve
that all-important separation of concerns; if you're tempted to add an unrelated
template to an existing step rather than creating a new, you're adding needless
complexity to that step.
In other words, we create simple stylesheets that focus on changing one thing, be
it a
single element or a group of similar elements. For example, I tend to have a single
step
for all inline processing if I only have a few inline elements to transform. More
complex semantics might require multiple steps. For example, when upconverting a flat
structure such as a Microsoft Word docx file, you'll likely want to use several steps
to
convert docx lists to a structured XML list—the first would be to identify the list
items themselves, the second to determine list labels, the third to wrap the items
in a
list
element, and so on. If not, the steps will become needlessly complex and hard to follow, which
is what we're trying to avoid here.
Having said that, how you write XSLT pipelines is up to you; however, you might want
to consider future maintenance, the level of documentation, the skills of the
developers, etc, when writing. Some XSLTs will be large regardless, simply because
it's
what is required.
The example pipeline above is available on Github (see XSLT Pipeline Example Repository) for the
inquisitive reader.
Debugging
Writing and running a lot of simple steps in sequence rather than a single
monolithic step in one go is, in my humble opinion, useful enough to be preferred,
but debugging is really when the pipelined approach comes into its own. By letting
the pipeline save each intermediate step on disk, you can easily troubleshoot your
transformation. Usually, it's much easier to a) find the approximate location of the
problem in the pipeline, usually down to a single step, and b) test and fix the
bug.
In debug mode, the XProc pipeline will produce debug output like this:
ari@toddao:~/Documents/projects/findcourses/poc/tmp/debug/Activate_Learning.xml$ ls -lh
total 106M
-rw-r--r-- 1 ari ari 3,5M mar 5 16:16 0-Activate_Learning.xml
-rw-r--r-- 1 ari ari 3,5M mar 5 16:16 1-XLSX-UTIL_remove-empty.xsl.xml
-rw-r--r-- 1 ari ari 6,4M mar 5 16:16 2-XLSX-UTIL_normalisation.xsl.xml
-rw-r--r-- 1 ari ari 6,4M mar 5 16:16 3-XLSX-UTIL_hyperlinks.xsl.xml
-rw-r--r-- 1 ari ari 4,8M mar 5 16:16 4-XLSX-UTIL_cleanup.xsl.xml
-rw-r--r-- 1 ari ari 4,9M mar 5 16:16 5-XLSX2XML_structure.xsl.xml
-rw-r--r-- 1 ari ari 6,2M mar 5 16:16 6-XLSX2XML_courses.xsl.xml
-rw-r--r-- 1 ari ari 5,6M mar 5 16:16 7-XLSX2XML_dates.xsl.xml
-rw-r--r-- 1 ari ari 5,6M mar 5 16:16 8-XLSX2XML_locations.xsl.xml
-rw-r--r-- 1 ari ari 6,0M mar 5 16:16 9-XLSX2XML_fields.xsl.xml
-rw-r--r-- 1 ari ari 5,7M mar 5 16:16 10-EXC2XI_course.xsl.xml
-rw-r--r-- 1 ari ari 5,7M mar 5 16:16 11-EXC2XI_content-fields.xsl.xml
-rw-r--r-- 1 ari ari 5,7M mar 5 16:16 12-EXC2XI_course-links.xsl.xml
-rw-r--r-- 1 ari ari 5,5M mar 5 16:16 13-EXC2XI_categories.xsl.xml
-rw-r--r-- 1 ari ari 5,5M mar 5 16:16 14-EXC2XI_exc-locations.xsl.xml
-rw-r--r-- 1 ari ari 5,6M mar 5 16:16 15-EXC2XI_exc-events.xsl.xml
-rw-r--r-- 1 ari ari 5,3M mar 5 16:16 16-EXC2XI_exc-duration.xsl.xml
-rw-r--r-- 1 ari ari 5,3M mar 5 16:16 17-EXC2XI_exc-email.xsl.xml
-rw-r--r-- 1 ari ari 4,4M mar 5 16:16 18-EXC2XI_xi-dedupe.xsl.xml
-rw-r--r-- 1 ari ari 4,4M mar 5 16:16 19-EXC2XI_xi-cleanup.xsl.xml
This is from a live customer project and so a bit more complex. Each file listed
is the output from the transformation performed by the XSLT stylesheet the
file is named after. The debug files are all prefixed by ordinal
numbers so they can be sorted in the running order. Debugging a step is as easy as
opening the output of the immediately preceding step in an XML IDE and applying the
offending XSLT on it. An XSLT that focusses on one task becomes very easy to
debug.
Useful Tips
Some of the suggestions and tips here will be obvious to many of you. My
apologies, if that is the case.
Pipeline Organisation
A useful feature of the XSLT manifest file is the ability to group related
steps (this is an outline of a manifest for a docx to XML conversion):
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="xproc-tools/schema/manifest.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<manifest
xmlns="http://www.corbas.co.uk/ns/transforms/data"
xml:id="migration.rtf.stair"
description="migration.rtf.stair"
xml:base="."
version="1.0">
<group
xml:id="word.conversion"
description="word.conversion"
xml:base="../xslt/common/"
enabled="true">
<item href="word-to-xhtml5-elements.xsl" description="Flat Word to XHTML conversion"/>
<item href="add-eqn-pi.xsl" description="inserts a PI placeholder for equations"/>
...
</group>
<group
xml:id="word.conversion.cleanup"
description="word.conversion"
xml:base="../xslt/common/"
enabled="true">
...
</group>
<!-- Up till now we've done common word conversion components -->
<!-- Preprocess for Kepler (HTML to HTML) -->
<group
xml:id="preprocess.word.conversion.stair"
description="preprocess.word.conversion.stair"
xml:base="../xslt/stair-halsbury/"
enabled="true">
...
</group>
<!-- Basic construction of Kepler block structures and inline markup -->
<group
xml:id="structure.inline.word.conversion.stair"
description="structure.inline.word.conversion.stair"
xml:base="../xslt/stair-halsbury/"
enabled="true">
...
</group>
<!-- Xref and cite processing -->
<group
xml:id="xrefs.cites.word.conversion.stair"
description="xrefs.cites.word.conversion.stair"
xml:base="../xslt/stair-halsbury/"
enabled="true">
...
</group>
<!-- Other Kepler fixes and cleanup -->
<group
xml:id="word.conversion.stair"
description="word.conversion.stair"
xml:base="../xslt/stair-halsbury/"
enabled="true">
...
<item href="html2kepler_cleanup.xsl"/>
<item href="html2kepler_namespacecleanup.xsl"/>
</group>
</manifest>
Notice that the group
elements contain steps manipulating the
same types of information, for example, cites and cross-references in the above example. They all can be provided their own @xml:base
values,
allowing better organisation of the steps in different directories on a file
system.
Attribute Processing
A basic rule of pipelined XSLT is to copy everything that you aren't
touching in a step. If you need to change one attribute but leave
the rest, this is a useful approach:
<xsl:copy-of select="@* except @style"/>
<xsl:attribute name="formatting" select="@style"/>
If you're converting something like Microsoft Word docx
, there
are probably lots of attributes that you can discard, so a variant of the above
is to parameterise anything following except
and use a generic step
with those parameter values as input.
When writing a complex pipeline of several dozens of steps, I've found it's
really useful to leave breadcrumbs behind until the very last steps, attributes
that tell me where something came from. For example, the original element's name
is sometimes useful, especially if you map several source elements to a single
target:
<xsl:template match="para" mode="STEP-2">
<p role="{name(.)}">
...
</p>
</xsl:template>
Here, role="{name(.)}"
inserts the source element name in
@role
.
And sometimes it's helpful to leave behind entire attribute name/value pairs.
Here, I'm copying two attributes to a combined @src-styling
attribute:
<xsl:attribute name="src-styling">
<xsl:for-each select="(@style,@frame)">
<xsl:value-of select="concat('@',name(.),'=',.)"/>
<xsl:value-of select="if (position()!=last()) then (' ') else ()"/>
</xsl:for-each>
</xsl:attribute>
For a source like this:
<para style="1" frame="dash">Test</para>
We'd add an attribute like this:
<p src-styling="@style=1 @frame=dash">...</p>
This is really useful for debugging two dozen steps and a few days later, when
you no longer remember the specifics of the transform but need to remind
yourself of some processing attributes in the source you did away with.
Testing
What about testing, you say? Well, writing tests is still about writing XSpec tests
for each step which is fine, but if you need to write tests that check actual XML
input
and output, it won't be possible by merely checking the pipeline inputs and outputs
in a
single XSpec. XSpec acts on a single XSLT applied on a single XML file, not a sequence
of them. We must therefore instantiate the XSpec tests for any step that requires
using
the actual step inputs and outputs.
I've written a set of XSLT and XProc tools that use the debug output and generic step
XSpec tests to generate instance XSpecs. The XProc orchestrates this by using an XSpec
manifest file, much like the XSLT pipeline manifest, running each XSpec in sequence.
Here's an example XSpec manifest from a live customer project:
<tests
xmlns="http://www.sgmlguru.org/ns/xproc/steps"
manifest="../../pipelines/poc/poc-xlsx2xml-manifest.xml"
xml:base="/home/ari/Documents/repos/xlsx2xml/">
<group description="XLSX normalisation steps">
<test xslt="xslt/common/XLSX-UTIL_normalisation.xsl" xspec="xspec/common/XLSX-UTIL_normalisation.xspec" description="Normalise shared strings."/>
<test xslt="xslt/common/XLSX-UTIL_hyperlinks.xsl" xspec="xspec/common/XLSX-UTIL_hyperlinks.xspec" description="Normalise referenced hyperlinks."/>
</group>
<group description="Conversion steps">
<test xslt="xslt/common/XLSX2XML_structure.xsl" xspec="xspec/common/XLSX2XML_structure.xspec" description="Convert main structures."/>
<test xslt="xslt/common/XLSX2XML_dates.xsl" xspec="xspec/common/XLSX2XML_dates.xspec" description="Convert ECMA 376 dates to xs:dateTime."/>
<test xslt="xslt/common/XLSX2XML_locations.xsl" xspec="xspec/common/XLSX2XML_locations.xspec" description="Convert location info to exc, group it in locations wrapper."/>
</group>
</tests>
Here, each test
matches a pipeline manifest item
, and this
manifest actually tests the pipeline I'm illustrating debug output with, above.
Note that there is nothing to stop you from adding more than one XSpec test to a
pipeline step. Quite the contrary, in fact, as splitting tests into multiple files
will
often make for more readable code.
Documenting
An XSLT pipeline, while not self-documenting, should be relatively easy to understand
simply by looking at the manifest. Every element in the manifest includes a
@description
attribute that should contain at least a basic description
of what a step does.
From a documentation perspective, the test manifest is possibly even better. It also
contains descriptions, of course, but it is easy to write an XSLT that extracts
descriptions and XSpec tests to document a step.
In my ever-so-humble opinion, one of the real advantages of this pipelined approach
is
to be able to write short steps, meaning that each step will be easier to understand.
Implementations
My preferred way to write XSLT pipelines, again, is to use XProc Tools (see XProc Tools) as the
basis for my pipelines. XProc Tools is open source (LGPL 3.0 license) and freely
available on Github, but they are just that, a set of tools. There is the XProc step
that runs XSLT manifests as discussed above, and there's a step for reading directories
and subdirectories recursively, plus some other useful steps. They remain an excellent
showcase of what XProc can do.
They don't run out of the box, however. They require wrapper scripts that run the
pipeline transformation step in context, validate the output, unzip docx archives,
and
so on. To this end, I've written additional XProc scripts that do all this, from
converting entire directories in batch to validating the output with Schematron and
applying XSpec tests. These, just as XProc Tools, are open source (LGPL 3.0) and
available on Github (see XProc Batch Wrapper Scripts).
Note
I've also written an example XSLT pipeline that shows how to use the above. This,
again, is open source and available on Github (see XSLT Pipeline Example Repository).
XProc pipelines can be run from oXygen XML Editor or from the command line.
Thoughts on an XProc 3.0 Approach
XProc Tools, just as my XProc Batch scripts, are written in XProc 1.0 and require
XML Calabash 1.x to run. XProc 3.0 is in last call, however, and the XProc 3.0
processor MorganaXProc-III (see MorganaXProc-III XProc 3.0 Processor) has recently been made available as a public beta, so
I am currently writing XProc 3.0 versions of everything. XProc 3.0 (see XProc 3.0: A Pipeline Language) is a
significant improvement over XProc 1.0, both in functionality and in ease of use,
and I hope to have everything up and running by Balisage 2020.
A @version='3.0'
alternative is, of course, the most basic of
rewrites, the idea being to do as little as possible and merely make sure that the
syntax is right. But there's just so much that has been improved and updated. For
example:
-
XProc 1.0 couldn't do recursive listing of directories so Nic Gibson, like
so many others, had a library step for the purpose. XProc 3.0, of course,
does this out of the box, so there's no reason for that library.
-
p:store
, the step that, well, stores whatever is on its input
port, is now also an identity step, which is just such a lifesaver. No need
for various pipes to redirect stuff, so a lot of the code in the 1.0
implementations just disappears.
-
There's a @use-when
attribute that you can attach on pretty
much anything. An obvious use is to do something like
p:store[@use-when=$debug]
when you want to output debug
info. This essentially gets rid of another library step.
-
XProc 3.0 makes it easy to add message output to any step of your choice
through a @message
attribute, allowing us to produce verbose
output of anything we like in the pipeline. again vastly simplifying the
solution in my 1.0 pipelines, essentially a Calabash extension step.
-
Variables in 3.0 are much easier to handle than the 1.0 approach where
they had to appear in the step prolog.
-
Etc.
Add to that all the syntactic sugar and it does not make sense to simply reproduce
the 1.0 pipelines in 3.0. For example, Running an XSLT stylesheet and storing it in
3.0 is this easy:
<p:xslt>
<p:with-input port="stylesheet" href="..."/>
</p:xslt>
<p:store use-when="$debug" href="..."/>
...
We could easily produce this by applying an XSLT on the pipeline manifest file —
it looks a lot like the manifest to begin with.
This was doable but not very practical in XProc 1.0. The in-memory sequence of
XSLTs was a better fit, and while that would work here, too, this would probably
consume less memory.
XQuery?
XProc 1.0 enjoys limited support, unfortunately, and so recently, when I used the
pipelined XSLT approach for live conversions of input Excel spreadsheets to XML for
a client and needed to run everything in the eXist-db XML
Database, I discovered that the XProc support was nowhere near
sufficient, so I replaced the XProc pipeline runner with an XQuery module and made
that open source, too (see XProc Batch Wrapper Scripts). The XQuery module works in almost the exact same
way as the XProc Tools do (see section “Under the Hood”), but requires eXist-db to run.
Apart from having to run everything from within eXist-db, the approach remains the
same; the one thing I need is some XQuery that includes this to run the pipeline:
pipelines:transform-collection($sources,$manifest-uri,$debug,$out)
I was able to upload my pipeline manifest and steps to eXist-db unchanged. As I
don't yet have XQuery functions to run my XSpec tests in eXist-db, I could instead
develop the pipeline on the file system, using XProc, and upload everything to
eXist-db for the live conversion.
A More Generic Approach?
A reviewer of the first version of this paper suggested a more generic approach.
Forget XProc, forget XQuery; instead, base the pipelined XSLT on the concept of the
manifest and use whatever implementation language that fits, the idea, as I
understand it, being to include all logic in the manifest.
Yes, the idea did (sort of) occur to me when writing the XQuery implementation.
The problem is that if adding all the logic to the manifest, I'd effectively be
creating yet another pipeline language and I don't think I want to go that far. My motivation here is to simplify my day-to-day XSLT. My point-of-view
for the last five years has been very much converting a large number of documents
from one format to another with as much simplicity, transparency, and debugging
capabilities as possible, and so I wasn't after inventing something new as much as
I
was after streamlining and adapting something that already existed to my
needs.
Thus, while the idea is worth exploring, it's not on my short-term list.
End Notes
Why?
Still need convincing? Consider these points:
-
It's easier to write. You do one task per step.
-
It's easier to debug. See above.
-
It's easier to take over from someone else.
-
It's safer, as you're less likely to remove elements
from the source without noticing. Everything but the one thing you change is
an identity transform!
-
And it's fun!
Why Not?
Are the reasons not to do pipelined transforms?
-
If you're supplying the transformation live, in a time-critical context,
the XProc 1.0 approach might not be the way to go, especially with a large
pipeline. Every step is read into memory, which means the computer will
consume lots of it. There's also the fact that the JVM running the XProc
engine has a start-up time, further adding to the total.
-
Not every environment will support XProc, and not everyone will want to transform their stuff using
eXist-db. However, I believe both of these can be addressed in either XProc
3.0 or an XQuery implementation that doesn't depend on eXist-db.
-
The XProc (or XQuery) pipeline framework could easily add an extra tool
that has to be managed, especially if there is a need to add features to the
manifest.
-
A reviewer suggested that the steps might not necessarily be independent.
This is true, of course — I've had countless problems arising from the wrong
step order and such — but the steps are not meant to be
independent. They are meant to be run in order which implies
the opposite, and very frequently, one will have to use temporary markup as
step output to preserve or make explicit some aspect of the step transfom.
The idea is not independence for each step, the idea is an isolation of
concerns, doing one thing at a time.
It is my experience that most problems arise not from step interdependence
as such but from expanding a step to do too much. I'm always tempted to add
some tiny little tweak here or an additional template there in an existing
step rather than adding a new one. It's about convenience and laziness.
Also, while not a reason to avoid pipelined XSLT, turning on debugging will
consume a lot of disk space, and so you might not want to do it if you're running
conversions of many large documents in batch. I've run out of space on a half-empty
500 GB disk a couple of times.
Note
Having been playing with XProc 3.0 and generally thinking about debugging, I'm
now convinced that I need to add a feature to handle debugging levels in the
manifest. If you have an 80-step pipeline, having debug output from each step is
way over the top. Why not use the manifest items to decide how important a step
is, how prone it is to fail? Some will be so basic that there is no need to
produce debug output from each step. Or perhaps a combination, or maybe limiting
debug to group
outputs?
Some Previous Work (That Inspired Me)
Me, I was heavily influenced by Nic Gibson's XProc Tools
(XProc Tools)
that allow XSLT developers to define their XSLT pipelines in manifest documents
listing the individual stylesheets and run the pipelines using XProc. While working
at LexisNexis UK, I used those scripts for many conversion projects (see Up and Sideways),
frequently wrapping the XProc inside Ant scripts. Later, I've also introduced
similar solutions for other clients.
Others have approached the problem by introducing XSLT variables in sequence
within the same stylesheet, something like this:
<xsl:template match="something">
<xsl:variable name="step1">
<xsl:element name="some-element">
<xsl:for-each select="$something-to-start-with/...">
...
</xsl:for-each>
</xsl:element>
</xsl:variable>
...
<xsl:variable name="step5">
<xsl:element name="some-element">
<xsl:for-each select="$step4/...">
...
</xsl:for-each>
</xsl:element>
</xsl:variable>
<xsl:result-document href="{concat($filepath, 'aaa_step5.xml')}">
<xsl:copy-of select="$step5"/>
</xsl:result-document>
<xsl:variable name="step5aa">
<xsl:element name="some-element">
<xsl:for-each select="$step5/...">
...
</xsl:for-each>
</xsl:element>
</xsl:variable>
<xsl:result-document href="{concat($filepath, 'aaa_step5aa.xml')}">
<xsl:copy-of select="$step5aa"/>
</xsl:result-document>
...
</xsl:template>
Note the xsl:result-document
elements inserted between variables.
These are for providing debug output from the preceding variables. The approach is
certainly useful, but will not work well for anything beyond small documents or
document fragments, and certainly not for converting anything of any real size and
certainly not in batch.
I should also mention Liam Quin, who demonstrated an XSLT 3.0 approach using
transform()
at XML Prague (see XProc in XSLT: Why and Why Not).
Note
A reviewer also mentioned Orbeon XPL and Netkernel. I haven't considered
either one and cannot speak to their applicability in this context. I did not
intend this paper to be a survey of available pipeline languages, however, as
there are quite a few others that are worth mentioning. ANT, for one, is
tremendously useful and could certainly be used to implement a pipelined XSLT
toolkit based on XML manifest files.
Future Work
I've couple of things on my to do list:
-
Move to XProc 3.0 for everything. There's a ton of new features and
syntactic shortcuts, and I fully expect the 3.0 implementation to be tiny in
comparison.
-
This should include proper XProc 3.0 tests.
-
The eXist-db/XQuery implementation doesn't support XSpec tests at all yet.
Again, I'm planning to add support for XSpec in the future, hopefully in
time for Balisage 2020.
-
Generated documentation (as suggested in section “Documenting”).
-
I'we been thinnking about adding a few Ant wrapper scripts that run pipelines and do some things that aren't readily doable in XProc on the
side, if needed.
References
[XProc Tools] XProc Tools
by
Nic Gibson [online, fetched on 1 April 2020]. https://github.com/Corbas/xproc-tools
[XProc Batch Wrapper Scripts] XProc
Batch
by Ari Nordström [online, fetched on 1 April 2020]. https://github.com/sgmlguru/xproc-batch
[XSLT Pipeline Example Repository] XSLT Pipelines
by Ari Nordström. https://github.com/sgmlguru/xslt-pipelines
[XProc 3.0: A Pipeline Language] XProc
3.0: A Pipeline Language
[online, fetched on 1 April 2020]. http://spec.xproc.org/
[MorganaXProc-III XProc 3.0 Processor] MorganaXProc-III
by Achim Berndzen, an XProc 3.0 processor. https://www.xml-project.com/morganaxproc-iii/
[Up and Sideways] Ari Nordström, Up
and Sideways: RTF to XML.
Presented at Symposium on Up-Translation and
Up-Transformation: Tasks, Challenges, and Solutions, Washington, DC, July 31, 2017.
In
Proceedings of the Symposium on Up-Translation and
Up-Transformation: Tasks, Challenges, and Solutions. Balisage Series on
Markup Technologies, vol. 20 (2017).
doi:https://doi.org/10.4242/BalisageVol20.Nordstrom01.
[XProc in XSLT: Why and Why Not] Liam Quin,
XProc in XSLT: Why and Why Not.
Presented at XML Prague 2019, February 7-9, 2019. In XML Prague 2019 Conference Proceedings,
https://archive.xmlprague.cz/2019/files/xmlprague-2019-proceedings.pdf
[XPants] Phil Hodder, XPants - XML Practical
Ant Scripts.
https://github.com/encodis/xpants
[Lightweight XML DevOps Using Apache Ant] Phil
Hodder, Lightweight XML DevOps Using Apache Ant.
Presented at Markup UK 2018, June 9-10, 2018. In Markup UK 2018 Proceedings, https://markupuk.org/2018/webhelp/index.html#ar11.html
×Ari Nordström, Up
and Sideways: RTF to XML.
Presented at Symposium on Up-Translation and
Up-Transformation: Tasks, Challenges, and Solutions, Washington, DC, July 31, 2017.
In
Proceedings of the Symposium on Up-Translation and
Up-Transformation: Tasks, Challenges, and Solutions. Balisage Series on
Markup Technologies, vol. 20 (2017).
doi:https://doi.org/10.4242/BalisageVol20.Nordstrom01.