How to cite this paper
Walsh, Norman, and Achim Berndzen. “XProc 3.0.” Presented at Balisage: The Markup Conference 2019, Washington, DC, July 30 - August 2, 2019. In Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23 (2019). https://doi.org/10.4242/BalisageVol23.Walsh02.
Balisage: The Markup Conference 2019
July 30 - August 2, 2019
Balisage Paper: XProc 3.0
Norman Walsh
Norman Walsh is a Principal Engineer at MarkLogic Corporation where he
helps to develop APIs and tools for advanced content applications. He
was the chair of the XML Processing Model Working Group at the W3C and
is a member of the XProc 3.0 editorial team. Norm has spent more than
twenty years developing commercial and open source software including
XML Calabash, his XProc processor.
Achim Berndzen
Achim earned an M.A. in philosophy at Aachen University and
has more than 20 years of teaching experience in communications.
In 2014 he founded <xml-project />. He is developer of
MorganaXProc, a fully compliant XProc
processor with an emphasis on configurability and plugability. He
is a member of the XProc 3.0 editors group and currently develops
MorganaXProc-III.
Copyright ©2019 by the authors.
Abstract
XProc 3.0 is an XML pipeline language for constructing
markup centric workflows. With a rich vocabulary of steps and
modern control structures, it allows the author to easily build
complex pipelines.
Table of Contents
- Introduction
- What is a pipeline language?
- What is XProc?
-
- What about XProc 1.0?
- What about XProc 2.0?
- Pipeline concepts
-
- Steps and ports
- Step options
- Documents
-
- Documents from URIs
- Documents from another step
- Inline documents
- “Empty” documents
- XPath expressions
-
- Variables
- Value templates
- Long form options
- Atomic and compound steps
- Pipelines are graphs
- Hands on: building some pipelines
-
- The anatomy of a step
- Our first XProc pipeline
- Changing the pipeline
- Compound steps
-
- Writing pipeline steps
- Loops with for-each
- Conditionals
-
- p:choose
- p:if
- Exception handling
- Viewports
- Groups
- Libraries
- Loose ends
-
- Document properties
- Irreducible complexity
- Why not just use XSLT?
Introduction
XProc is a language for defining pipelines. Pipelines are
ubiquitous in our lives. They operate all around us, all the time, at
every level. There are multienzyme complexes in your cells that
function as pipelines strictly controlling metabolic processes [Pröschel, et. al, 2015]. Modern CPU
architectures, like the one you probably have in your phone, run
pipelines of instructions: literal pipelines implemented in silicon.
The global delivery supply chain network that powers modern industry
is a massively complicated, massively pipelined process. And we’ve
said nothing of the literal pipelines of oil and water and gas that we
rely upon daily. These pipelines are analogies, some stronger than others,
for what XProc does.
Anecdotally, one of the strengths of Unix (specifically of the Unix
command line interface) is that it offers a broad collection of
“small, sharp tools” that can easily be combined. Small in the sense
that they accomplish a single, focussed task. Sharp in the sense that they do
that task efficiently, with a minimum of fuss.
Learning to think about problems in terms of small, sharp tools
is incredibly valuable. For the benefit of readers who
aren’t familiar with the Unix command line, let’s move our analogy out
into the real world. A pair of scissors, is a prototypical small,
sharp tool; the antithesis of a
Rube Goldberg machine. Other
examples that we might classify as small, sharp tools are string,
tape,and paper clips. Each one does a single, particular
thing (tying, sticking, clipping) and does it well. They’re
also adaptable. String can be used to tie many things; scissors can
cut many things, tape and paper clips likewise.
When we compose tools together, we’re forming pipelines.
What is a pipeline language?
A pipeline language provides a set of tools and a declarative
language for describing how those tools should be composed.
In this context, we mean software tools. In particular, as markup
users, we mean tools that parse, validate, transform, perform XInclude, rename
elements, add attributes, etc.
You write ad hoc pipelines with these tools every day, you write shell
scripts or Windows batch files or Makefiles or Ant build scripts, or
Gradle build scripts, or any one of a dozen other possibilities (in a
large system, more likely several of them).
Looking slightly farther afield from the core markup language
technologies, we also want to get data from APIs, extract
information from .docx
files, update bug tracking systems, construct
EPUB files and publish PDF documents.
Integrating that broader set of tools into ad hoc pipelines only
increases the complexity of those scripts and makes it harder to
understand what they do.
What is XProc?
XProc is (an extensible) set of small, sharp tools for creating and
transforming markup and other documents, and a declarative XML
vocabulary for describing pipelines composed in this way.
XProc 3.0 is actively being developed by a community group. There are:
-
Four principal editors: Achim Berndzen, Gerrit Imsieke, Erik Siegel,
and Norman Walsh.
-
Several specifications: a core language spec currently in “last
call”, a standard step library expected to go into “last call” this
year, and several specifications for optional steps and additional
vocabularies.
-
Two independent implementations tracking the specifications,
MorganaXProc by Achim Berndzen and XML Calabash by Norman Walsh
-
A
public organization
at GitHub where you are encouraged to comment
on the specifications.
-
A public
xproc-dev mailing list
that you are encouraged to join.
-
Public workshops held several times a year, often co-located with
other markup events.
-
In addition, Erik Siegel has written a complete
programming guide to XProc which will be published by XML
Press as soon as the editors stop changing
things!
The narrative structure of the rest of this paper is designed to
give you a complete overview of XProc. It contains several examples
and describes all of the major structures in XProc. It doesn’t attempt
to cover every nuance of programming with pipelines. Please feel free
to ask questions in one of the fora above, or talk to any of the editors.
What about XProc 1.0?
XProc 1.0 became a W3C Recommendation in 2010. It has been used very
successfully by many users, but has not seen anything that could
reasonably be described as widespread adoption. There are several
reasons for this. Although most pipelines in the real world need to
interact with at least some non-XML data, the XProc 1.0 language is
extremely XML-centric. The language is also verbose with few
syntactic shortcuts and a number of complex features that hinder
casual adoption.
In addition, XML Calabash, the implementation introduced to most users
interested in learning XProc 1.0 provides very little assitance to
inexperienced users and quite terse error messages.
If you have never used XProc 1.0: good. You may begin your journey
into XML pipelines with XProc 3.0 and never have to wrestle with the
inconveniences of XProc 1.0. If you have used XProc 1.0 successfully,
the community group believes that you will be delighted by the
improvents in XProc 3.0. If you have attempted to use XProc 1.0 and
been stymied by it, please attempt to set aside the prejudices you may feel towards
XProc
and journey into a new world of XML pipelines.
What about XProc 2.0?
There isn’t one. This is something of a running joke.
As the working groups involved in the
development of XPath, XSLT, XQuery, and the family of related
specifications, worked their way towards a second major release, they
had a problem.
When that work started, XQuery 1.0 had been published along with XSLT
2.0. The ongoing work was very much a product of cooperation between
the working groups. The next release of XSLT couldn’t be 2.0,
obviously, but having XQuery 2.0 and XSLT 3.0 seemed only likely to
introduce confusion for users. The decision was made that XQuery would
skip 2.0 entirely and the group would meet at 3.0.
In the XProc realm, there had been some work on a 2.0 version; it is
possible to find drafts labeled 2.0 on the internet, even though that
work was abandoned at the W3C before advancing very far.
It seemed that the simplest thing to do, given that we were
building on the 3.x versions of the underlying XML specifications, was
to fall in line and jump directly to 3.0 as well.
Pipeline concepts
We believe it will be easier to understand the examples which
follow if we invest a little time laying some conceptual groundwork.
Steps and ports
The central concept in XProc piplines is the step. Steps are the tools
from which pipelines are composed. You can think of a step as a kind
of box. It has holes in the top where you can pour in source
documents, it has holes in the bottom out of which result documents
will flow, and it may have some switches on the side that you can
toggle to control the behavior of the box. In XProc, the holes are
called ports and the switches are called options.
The simplest possible step is the p:identity
step.
It has one source port, one output port, and no options. You pour
documents in the top, they come out the bottom. It is every bit as
simple as it sounds, although not quite as pointless!
A slightly more complex step is the p:xinclude
step, which performs XInclude processing. Like the identity step,
it has a single input port and a single output port. If you put an XML
document in the source port, it will be transformed according to the
rules of the XInclude specification and the transformed document will
flow out of the result port.
The XInclude specification mandates two user options: one to control
how xml:base
attributes are propagated and another to control
xml:lang
attributes (both of these are called “fixups” by
the specification). These options are exposed directly on the “side” of the
XInclude box.
Here’s an example of using the XInclude step in an XProc pipeline:
<p:xinclude name="expanded-docs"/>
As this example shows, steps can also have names. We’ll come back
to those names later in the section “Documents from another step”. The names
aren’t required. You can name every step if you want, or only the steps
where you need the names to connect to them.
Here’s an example of using p:xinclude
with xml:base
fixup explicitly disabled:
<p:xinclude fixup-xml-base="false"/>
We’ll look at options more closely in the section “Step options”.
An often used step that is a bit more interesting is the XSLT step.
Before we present the “XSLT box” we need to cover ports in a little more
detail.
-
Ports are always named and the names are always unique
on any given step. In the case of XSLT, we’ll have a port called
“source”, for providing the step with documents we want to transform, and
a port called “stylesheet”, where we give the step our XSLT
stylesheet.
-
Ports can be defined so that they accept (or produce) either a
single document or a sequence of documents (zero or more). It’s an
error to pour two documents into a port that only accepts a single
document. It’s an error if a step defined to produce a single
document on an output port doesn’t produce exactly one document on
that port.
-
Any kind of documents can flow through a pipeline, XML documents,
HTML documents, JSON documents, JPG images, PDF files, ZIP files, etc. (We’ll come
on to describing how you get documents into pipelines in a bit.)
Some steps, like the identity step (or the p:count
step that just
counts the documents that flow through it) don’t care about what kinds
of documents they receive. Most steps do care. It only makes sense to
send XML documents to the XInclude step, for example.
Ports can specify what content types they accept (or produce). It is
an error to send any other kind of document through that port.
-
Finally, exactly one input port and one output port can be
designated as “primary”. That doesn’t really have anything to do
with the semantics of the step, it has to do with how they’re
connected together. You can think of the primary ports
as having little magnets so they snap together automatically when
you put two steps next to each other.
With those concepts in hand, we’re ready to look at the input ports
of the XSLT step. In XProc, they’re defined like this:
<p:input port="source" content-types="any" sequence="true" primary="true"/>
<p:input port="stylesheet" content-types="xml"/>
Those declarations say that the port named “source” accepts a sequence
of any kind of document and is the primary input port for the step.
The “stylesheet” port accepts only a single XML document.
The output ports are defined like this:
<p:output port="result" primary="true" sequence="true" content-types="any"/>
<p:output port="secondary" sequence="true" content-types="any"/>
In other words, the port named “result” is the primary output port and
it can produce a sequence of anything. The “secondary” output port can
also produce a sequence of anything.
If you’re familair with XSLT, the result port is where the main result
document appears (the one you didn’t identify with
xsl:result-document
or the one with a xsl:result-document
that
doesn’t specify a URI). Any other result documents produced appear on
the “secondary” port.
This is a good time to point out that steps in XProc do not typically
write to disk. If you’re used to running XSLT from the command line or
from within an editor, your mental model may be that XSLT reads files
from disk, does some transformations, and writes the results back to
disk. This is not the case in XProc. In XProc, everything flows
through the pipeline. There’s a step, p:store
, that will write to
disk, but otherwise, all your documents are ephemeral.
Step options
The p:xslt
step also has a number of options. These
correspond to the processor options “inital mode”, “named template”,
and “output base URI”. Like the options on the XInclude step, the
options are defined by the XSLT specification itself:
<p:option name="initial-mode" as="xs:QName?"/>
<p:option name="template-name" as="xs:QName?"/>
<p:option name="output-base-uri" as="xs:anyURI?"/>
As you can see, options have a name and may define their type.
They may also define a default value or assert that they are required,
though none of these options do either. When your pipeline is running,
values will be computed for these options and passed to the step.
Unlike ports, through which documents flow, options can be any
XPath 3.1 Data Model
[XDM] item.
The p:xslt
step has a version attribute,
so that you can assert in your stylesheet, for example, that you need
an XSLT 26.2 processor and there’s no point even trying to run the step
if the XProc implementation can’t provide one.
Finally, there’s an option called parameters
that takes a map. This is how
you pass stylesheet parameters to the step. Here’s a complete syntax
summary for the p:xslt
step:
<p:declare-step type="p:xslt">
<p:input port="source" content-types="any" sequence="true" primary="true"/>
<p:input port="stylesheet" content-types="xml"/>
<p:output port="result" primary="true" sequence="false" content-types="any"/>
<p:output port="secondary" sequence="true" content-types="*/*"/>
<p:option name="initial-mode" as="xs:QName?"/>
<p:option name="template-name" as="xs:QName?"/>
<p:option name="output-base-uri" as="xs:anyURI?"/>
<p:option name="version" as="xs:string?"/>
<p:option name="parameters" as="map(xs:QName,item()*)?"/>
</p:declare-step>
You can use XSLT as many times as you like in your pipeline, with
different inputs and different option values, but every instance of
the p:xslt
step will fit this “signature”.
Imagine that you have a stylesheet, tohtml.xsl
that transforms XML
into HTML. It has a single stylesheet option, css
, that allows the
user to specify what CSS stylesheet link should be inserted into the
output. Here’s how you might use that in an XProc pipeline:
<p:xslt parameters="map { 'css': 'basic.css' }">
<p:with-input port="stylesheet" href="tohtml.xsl"/>
</p:xslt>
Option values can also be computed dynamically with expressions
as we’ll see in the
section “XPath expressions”.
In the declaration of a step (the
definition of its signature) the allowed inputs and outputs are
identified with p:input
and p:output
. When a
step is used, the p:with-input
element makes a connection to one of the ports on the step. In the
example above, the pipeline author is connecting the
stylesheet
port to the document tohtml.xsl
.
The source port, the primary port, is being connected automatically in
this example.
An obvious analogy for connecting up steps is to think of them
as tanks with ports on the top and bottom, the connections between them
as hoses, and the documents like water.
You link all the steps with hoses and then pour water in the
top of your pipeline; magic happens and the results pour out the
bottom.
It’s a good analogy, but don’t hold onto it too tightly. It breaks
down in a couple of ways. First, you can attach any number of “hoses”
to the output of a step. Want to connect the output of the validator
to ten different XSLT steps? No problem. Second, you never have to
think about the “output ends” of the pipes. Each input port identifies
where it gets documents. If you say that the p:xslt
step gets its
input from the result of the validator, you’ve said implicitly that
the output of the validator is connected to the XSLT step. You can’t
say that explicitly. The outputs are all implicitly connected according to
how the inputs are defined.
By the way, if you don’t connect anything to a particular output
port, that’s ok. The processor will automatically stick a bucket under
there for you and take care of it.
Documents
As stated earlier, any kind of document can flow through an XProc
pipeline, but where do documents come from? There are four possible
answers to that question: from a URI, from another step, from “inline”,
or “from nowhere” (a way of saying explicitly that nothing goes to a
particular port).
Documents from URIs
The p:document
element reads from a URI:
<p:document href="mydocument.xml"/>
The URI value can be an expression, in which case it may be useful
to assert what kind of documents are acceptable:
<p:document href="{$userinput}.json" content-type="application/json"/>
We saw the content-type
attribute earlier in the discussion of ports.
Generally, you can specify a list of MIME Media Types there, but you can
also use shortcuts: “xml”, “html”, “text”, or “json”. In fact, the example
above uses application/json
merely as an example; using “json” would
be simpler.
If the (computed) URI is relative, it will be made absolute with
respect to the base URI of the p:document
element on which
it appears.
As you saw in the p:with-input
example in section “Step options”, there is a shortcut for the simple case
where you want to read a single document into a port. (In which case,
it will be made absolute with respect to the base URI of the
p:with-input
element.)
Documents from another step
The “magnetic” property of primary ports means that they’ll
automatically snap their ports together for you; in many cases these implicit
connections are all that’s necessary. But they only works for steps
that are next to each other, so you will still sometimes have to add a pipe
to connect two steps together.
The p:pipe
element constructs an explicit connection between two
steps. The pipe has two attributes: step
, which gives the name of the
step you’re connecting to; and port
, which gives the name of the port
you’re reading. There are sensible defaults: for example, if you omit
the port, the primary output port is assumed.
Here’s a pipe that connects back the first XInclude example.
<p:pipe step="expanded-docs"/>
It would be perfectly fine to add port="result"
to that pipe, but
it’s not necessary.
Inline documents
You can just type the documents inline if you want. This is one
common use of the p:identity
step:
<p:identity name="config">
<p:with-input>
<p:inline content-type="application/json">
{
"config": {
"uri": "http://example.com/",
"port": 8080,
"oauth": true
}
}
</p:inline>
</p:with-input>
</p:identity>
Now any step in the pipeline can read from the “config” step to
get the configuration data. The p:inline
element is
required here because the content isn’t XML, so the content type must
be specified. If the inline data were a single XML document, p:inline
could be omitted.
<p:identity name="state-capitols">
<p:with-input>
<states>
<alabama abbrev="AL">Montgomery</alabama>
<alaska abbrev="AK">Juneau</alaska>
<!-- ... -->
<wisconsin abbrev="WI">Madison</wisconsin>
<wyoming abbrev="WY">Cheyenne</wyoming>
</states>
</p:with-input>
</p:identity>
I’ve also elided the port name (port="source"
) this is fine because
the p:identity
step only has one input port (and, technically,
because it’s the primary input port).
This inline data needn’t always be in an identity step; you
can put it directly into the input port on any step. There are
additional attributes on p:inline
that allow you to inline encoded
binary data, if you wish.
“Empty” documents
Sometimes it’s useful to say explicitly that no documents should
appear on a particular port. This is necessary if you want to defeat
the default connection mechanisms that would ordinarily apply.
The p:empty
connection serves this purpose:
<p:count>
<p:with-input>
<p:empty/
</p:with-input>
</p:count>
Irrespective of the context in which this appears, no documents will be
sent to the count step and it will invariably return 0.
XPath expressions
XProc uses XPath as its expression language. Expressions appear most
commonly in attribute and text value templates and in the expressions
that initialize options and variables.
Variables
It is sometimes useful to calculate a value in an XProc pipeline and
then use that value in subsequent expressions. There are both
practical and pedagogical reasons to do this. A variable has a name,
an optional type, and an expression that intializes it:
<p:variable name="pi" select="355 div 113"/>
Variables are lexically scoped and can appear anywhere in a pipeline.
The set of “in scope” variables can be referenced in
XPath expressions. The variable declaration may identify what document
should be used as the context item.
Value templates
When an option is passed to a step, its value can be initialized with
an attribute value template:
<p:xinclude fixup-xml-base="{$dofixup}"/>
Value templates can be used in inline content:
<p:identity name="constants">
<p:with-input>
<constants>
<e>2.71828183</e>
<pi>{$pi}</pi>
</constants>
</p:with-input>
</p:identity>
Unlike text value templates in XSLT, text value templates in XProc
can insert nodes into the document.
Atomic and compound steps
All of the steps we’ve looked at so far are “atomic steps”, they
have inputs, outputs, and options, but they have no internal structure.
They are effectively “black boxes”. The p:xslt
step does
XSLT, the p:identity
step copies its input blindly,
the p:xinclude
step performs XInclude processing.
Aside from any options exposed, you have no control over the
behavior of the step.
XProc also has a small vocabulary of “compound steps”
(see the section “Compound steps”). These steps are “white boxes”.
The steps explicitly wrap around an internal “subpipeline” that defines
some of their behavior. Whereas two p:xslt
steps always do the same
thing, two p:for-each
steps can do very different things.
Pipelines are graphs
Steps can be connected together in arbitrary ways. Many steps can read
from the same output port and any given step can combine the outputs
from many different steps into one input port. In this way, a pipeline is
a graph.
A key constraint is that the graph must be acyclic. A step can
never read its own output, no matter how indirectly. Only M. C. Esher
can make water flow uphill! Once a document has passed through a step,
the only direction it can go is down.
One subtlety: when a variable is defined, it may have a context item
that is the output of a step. If it does, subsequent references to
that variable count as “connections” to that output port when
considering whether or not the pipeline contains any loops.
Hands on: building some pipelines
As we’ve seen, steps are the basic building blocks in XProc 3.0.
A large library of standard steps comes with every conformant
implementation:
-
There are 50+ atomic step types in the standard library. These
atomic steps are the smallest tools in your pipeline, doing things
such as XSLT transformations, validation with Schematron, calling an
HTTP web service, or adding an attribute to element nodes in a
document.
-
The XProc 3.0 specification defines additional, optional step
libraries with about twenty steps. They’re optional in the sense that a
conformant implementation is not required to implement them, though
most probably will. Optional step libraries include steps for file
handling, interacting with the operating system, and producing paged
media, among others.
-
In addition to the large library of atomic steps, XProc 3.0 also
defines five compound steps containing subpipelines. These
subpipelines can themselves be composed of atomic or compound steps.
Compound steps are used for control flow, looping, and catching
exceptions, for example. We’ll look at them more closely in
the section “Compound steps”.
-
Implementations may also ship with additional defined either
by the implementor or by some community process. The set
of available atomic steps might even be user-extensible;
implementations might allow users to program their own atomic
steps.
The anatomy of a step
In XProc, documents flow between steps: One or more documents flow
into a step; some work characteristic for that step is performed; and one or
more documents flow out of the step, usually to another step. XProc 3.0
has five document types:
-
An XML document is an instance of a document in the XPath Data
Model (XDM). These do not necessarily have to be well-formed XML
documents; any XDM document instance will do. (XSLT can produce
instances that contain multiple top-level elements, for example, or
that contain only text nodes.)
-
An HTML document is essentially the same as an XML document.
What’s different is that documents with an HTML media type will be
parsed with an HTML parser (rather than an XML parser, so they do not
have to be well-formed XML when they are loaded). If an HTML document
is serialized, by default the HTML serializer will be used.
-
A text document is a text
document without any markup. In the XDM, they are represented by a
document containing a single text node.
-
A JSON document is one that contains a map, an array, or atomic
values. These are represented in the XDM as maps, arrays, and atomic
values. Any valid JSON document can be loaded; it is also possible to
create maps and arrays that contain data types not available in JSON
(for example, xs:dateTime
values). These will flow through
the pipeline just fine, and will be converted back to JSON strings at
serialization time (if they’re ever serialized).
-
Finally there are other documents: anything else. This
includes binary images or ZIP documents (in an ePUB), or a PDF rendered
from a DocBook source. Implementations have some latitude in how they
process arbitrary data.
These documents flow through the input and output ports of
steps. Steps can have an arbitrary number of input and output ports
corresponding to their requirements. The p:xslt
step, as
we’ve seen, has two input ports and two output ports. Some steps may
have no input ports at all, only output ports (think of a step that
loads a document from disk), others may have input ports,
but no output ports. (It’s conceivable to have a step with no ports of
any kind, but it’s not obvious what purpose it would serve in the
pipeline.)
We’re going to use two steps in our example pipeline,
p:add-attribute
and p:store
. Here’s
the signature for p:add-attribute
:
<p:declare-step type="p:add-attribute">
<p:input port="source" content-types="xml html"/>
<p:output port="result" content-types="xml html"/>
<p:option name="match" as="xs:string" select="'/*'"/>
<p:option name="attribute-name" required="true" as="xs:QName"/>
<p:option name="attribute-value" required="true" as="xs:string"/>
</p:declare-step>
As you might guess from its name, the
p:add-attribute
step adds attributes to elements in a
document. The document that arrives on the source port is decorated
with attributes and the resulting document flows out of the result
port.
What attributes are added? The attribute-name
and
attribute-value
options define the attribute name and its
value. The match
attribute contains a “selection
pattern”, a concept borrowed from XSLT 3.0 (In XSLT 2.0, it was
called a “match pattern”) to identify which elements in the source
document to change.
The signature for p:store
may be a little more
surprising:
<p:declare-step type="p:store">
<p:input port="source" content-types="any"/>
<p:output port="result" content-types="any" primary="true"/>
<p:output port="result-uri" content-types="application/xml"/>
<p:option name="href" required="true" as="xs:anyURI"/>
<p:option name="serialization" as="map(xs:QName,item()*)?"/>
</p:declare-step>
It takes the document that appears on its
source port and stores it in the location identified by the
href
option. (It’s implementation-defined whether any URI
schemes besides “file:
” are supported.) The serialization
option allows you to specify how XML and HTML documents should be
serialized (with or without indentation, for example, or using
XHTML-style empty tags).
The p:store
step has two output ports. What appears on the result
port is the same document that appeared on the source
port.
What appears on the result-uri
port is a document that
contains the absolute URI where the document was written.
This might not be intuitive at first glance, but it is a
convenience for pipeline authors. Think of debugging a pipeline: if
you want to inspect some intermediate results, just add a
p:store
in your pipeline and you’re done. The
result-uri
output is useful, for example, if you need to
send the location where a PDF was stored to some downstream process.
Either port’s output might be useful in some workflows, but you’re also free to
ignore one (or both!) of them.
Our first XProc pipeline
With that preamble out of the way, let’s try to put the concepts
we’ve learned into a usable pipeline. It’s going to be a simple, contrived
pipeline, but a whole and usable one nevertheless.
Suppose we have an XHTML document with URI
“somewhere.xhtml
”. For some reason we need to change this
document to add an attribute named class
with value
“header
” to all h1
elements.
Assume we want to save the changed document at
“somewhere_new.xhtml
”.
Our source document might look like
this:
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h1>Chapter 1</h1>
<p>Text of chapter.</p>
<h1>Chapter 2</h1>
<p>Some more text.</p>
<!-- ... -->
<h1>Chapter 99</h1>
<p>Text of the final chapter.</p>
</body>
</html>
Doing this by hand would be both boring and error-prone. So this
is an extremely simple, but typical, use case for an XProc
pipeline.
Given what we already know about XProc, we can sketch out what
is required: it’s a p:add-attribute
step and a
p:store
step, where the output port
result
of the former is connected to input port
source
of the latter.
Here’s what our “add attributes” step might look like using
the long-form options:
<p:add-option name="attribute-adder">
<p:with-option name="attribute-name" select="'class'"/>
<p:with-option name="attribute-value" select="'header'"/>
<p:with-option name="match" select="'xhtml:h1'"/>
</p:add-option>
Note that the value of a select
attribute in XProc
is an XPath expression, just like it is in XSLT. If you don’t
“quote” string values twice, you’ll get strange results.
If, for example, you left out the single quotes around “class”, you’d
be asking the processor to find an element named
class
in the context document and use its string
value as the value for the attribute-name
option.
That’s not likely to go well.
In any event, we’re more likely to use the convenient shortcut forms
in practice, so let’s switch to those:
<p:add-option name="attribute-adder">
match="xhtml:h1"
attribute-name="class"
attribute-value="header"/>
Much better, except it doesn’t have any input. You might think
that would mean you wouldn’t get any output, but if you glance back at
the signature for p:add-attribute
, you’ll see that
the source
port does not allow a sequence (i.e, it requires
exactly one input; not zero, and not more than one). If you don’t
provide any input, you’ll get an error.
Providing a way for a step to receive input is called “binding
the port” in XProc. To mark a port binding for a step, XProc 3.0 uses
a p:with-input
element where the port
attribute is used to name the port which is to be bound. Inside this
element the actual binding takes place.
We saw p:document
before; it’s what we need here;
but we also saw that you can use the href
trick to read
a single document. Let’s just use that:
<p:add-attribute name="attribute-adder">
match="xhtml:h1"
attribute-name="class"
attribute-value="header">
<p:with-input href="somewhere.xhtml"/>
</p:add-attribute>
Our other step is p:store
and we already know
everything we need to write that:
<p:store href="somewhere_new.html"/>
Now we only need to work out how to connect the result output
from p:add-attribute
to the source port on
p:store
. We can do that with a p:pipe
<p:store href="somewhere_new.html">
<p:with-input>
<p:pipe step="attribute-adder" port="result"/>
</p:with-input>
</p:store>
Our first pipeline is almost complete. We have written the
two steps required to do the task, we have set the steps options to
the required values, and we have bound the input port of the two
steps. Two things are left: we need to give our steps a common root
element (every XProc pipeline has to be a valid XML document) and we
have to bind the namespace prefixes we’ve used. The root element of
every pipeline in XProc 3.0 has to be a
p:declare-step
; it has a version
attribute that must be set to “3.0”.
So our final pipeline looks like this:
<p:declare-step version="3.0"
xmlns:p="http://www.w3.org/ns/xproc"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<p:add-attribute name="attribute-adder">
match="xhtml:h1"
attribute-name="class"
attribute-value="header">
<p:with-input href="somewhere.xhtml"/>
</p:add-attribute>
<p:store href="somewhere_new.html">
<p:with-input>
<p:pipe step="attribute-adder" port="result"/>
</p:with-input>
</p:store>
We can simplify this further. When two steps appear
adjacent to each other in a pipeline, the default connection (the
“magnetics”) will connect the primary output port of the
first step to the primary input port of the second. That’s exactly the
situation we have here, so we can remove the explicit pipe binding.
<p:declare-step version="3.0"
xmlns:p="http://www.w3.org/ns/xproc"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<p:add-attribute name="attribute-adder">
match="xhtml:h1"
attribute-name="class"
attribute-value="header">
<p:with-input href="somewhere.xhtml"/>
</p:add-attribute>
<p:store href="somewhere_new.html"/>
</p:declare-step>
Changing the pipeline
Given our first pipeline, let’s consider how we might adapt it
over time. Suppose our task becomes a little more complicated; not only
should we add the class
attribute to the elements, but
we should also mark the header nesting by adding an attribute
level
with value “1”.
All we have to do is to add another
p:add-attribute
step between our two, existing steps.
<p:declare-step version="3.0"
xmlns:p="http://www.w3.org/ns/xproc"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<p:add-attribute name="attribute-adder">
match="xhtml:h1"
attribute-name="class"
attribute-value="header">
<p:with-input href="somewhere.xhtml"/>
</p:add-attribute>
<p:add-attribute name="level-adder">
match="xhtml:h1"
attribute-name="level"
attribute-value="1">
<p:with-input href="somewhere.xhtml"/>
</p:add-attribute>
<p:store href="somewhere_new.html"/>
</p:declare-step>
This example demonstrates the convenience of the default
bindings. If we’d left in our explicit pipe binding to
“attribute-adder”, the stored document would not
have been unchanged by the new step we added.
In practice, everything in an XProc pipeline is about the
connections between steps. Inserting new steps usually also involves
fixing up the connections. Forgetting this can lead to surprising
results.
It may also have occurred to you by now that, if you make all of
the connections explicit (which you are entirely free to do), then the
order of the steps in your pipeline document is
basically irrelevant. For the sake of the poor soul (very possibly
yourself) who has to modify your pipeline in six months, don’t
take advantage of this fact.
A good rule of thumb is to represent make linear flows in your
pipeline with linear sequences of steps in your pipeline document.
Branching, merging, and nested pipelines always introduce some amount of
complexity, see the section “Irreducible complexity”.
Compound steps
In addition to a large vocabulary of atomic steps, steps like
p:xinclude
and p:xslt
which have no child
elements, XProc 3.0 defines several “compound” steps that let you control the
flow of documents.
Writing pipeline steps
As we saw above, p:declare-step
lets you write your own pipeline
steps. Once written, you can call them directly or embed them in
other pipelines.
Loops with for-each
The p:for-each
step lets you perform a series of steps (a
subpipeline) to all of the input documents you provide to it.
The p:directory-list
step returns a directory listing. The p:load
step has an href
option and it loads the document identified by that
URI.
We can combine these steps with p:for-each
to process all of the
documents in a directory:
<p:directory-list path="*.xml"/>
<p:for-each select="//c:file">
<p:load href="{resolve-uri(@name, base-uri(.))}"/>
<p:xslt>
<p:with-input port="stylesheet" href="tohtml.xsl"/>
</p:xslt>
</p:for-each>
Here we get a list of all the files in the current directory
that match “*.xml
”, load each one, and run XSLT over it.
The resulting sequence of transformed HTML documents appears on the
output port of the p:for-each
.
Conditionals
There are two conditional elements, a general p:choose
and a syntactic
shortcut, p:if
, for the simple case of a single conditional.
p:choose
Looking back at the p:for-each
example, suppose
some of the documents in the directory are
already XHTML. We don’t want to process them with
our tohtml.xsl
stylesheet because they’re already HTML,
but we do want to process the other documents. We can use
p:choose
to achieve this:
<p:directory-list path="*.xml"/>
<p:for-each select="//c:file">
<p:load href="{resolve-uri(@name, base-uri(.))}"/>
<p:choose>
<p:when test="/h:html">
<p:identity/>
</p:when>
<p:otherwise>
<p:xslt>
<p:with-input port="stylesheet" href="tohtml.xsl"/>
</p:xslt>
</p:otherwise>
</p:choose>
</p:for-each>
The p:choose
step will evaluate the test
condition on each
p:when
and run only the first one that matches. In this case, if the
root element of the document loaded is h:html
, then we pass it through
the identity step. Otherwise, we pass it through XSLT. The output of
the p:choose
step is the output of the single branch that gets run.
p:if
It is very common in pipelines to have conditionals where you
want to perform some step if an expression is true, and pass the
document through unchanged if it isn’t.
That’s what the preceding p:choose
example does, in fact. The p:if
statement can be used to simplify this case. It has a single test
expression. If the expression is true, then its subpipeline is
evaluated, otherwise, it passes its source through unchanged.
The preceding pipeline can be simplified with p:if
:
<p:directory-list path="*.xml"/>
<p:for-each select="//c:file">
<p:load href="{resolve-uri(@name, base-uri(.))}"/>
<p:if test="not(/h:html)">
<p:xslt>
<p:with-input port="stylesheet" href="tohtml.xsl"/>
</p:xslt>
</p:if>
</p:for-each>
The semantics are exactly the same. If the document element is not
h:html
, it will be transformed, otherwise it will pass through
unchanged.
Exception handling
Many pipelines just assume that nothing will go wrong; often
nothing does. But on the occasions when a step fails, that failure “cascades
up” through the pipeline and if nothing “catches” it, the whole
pipeline will crash.
Sometimes, having the whole pipeline crash is not appropriate. We can
write defensive pipelines by adding try/catch elements around the
steps that we know might fail (and for which there is some useful
corrective action). That’s what p:try
is for:
<p:try>
<ex:do-something/>
<p:catch code="err:XC0053">
<ex:recover-from-validation-error/>
</p:catch>
<p:catch>
<ex:recover-from-other-errors/>
</p:catch>
</p:try>
This pipeline will do ex:do-something
. If that succeeds, that’s the
result of the p:try
. If it fails, p:try
will choose a “catch”
pipeline to deal with the error.
If the error thrown is err:XC0053
, a validation error
(unfortunately, you just have to look up the error codes), the
ex:recover-from-validation-step
pipeline will be run. If it
succeeds, that’s the result of the p:try
. (If it fails, the whole
p:try
fails and we better hope there’s another one higher up!) If
the error thrown isn’t a validation error, then
ex:recover-from-other-errors
will run.
In no case will more than one catch branch run.
Viewports
The p:viewport
step is a looping step, like p:for-each
. The
difference is that where p:for-each
loops over a set of documents,
p:viewport
loops over parts of a single document.
Suppose there’s some processing that you want to perform on specific
sections of a document. Let’s say you want to transform all sections
that are marked as “final” in some way.
Because sections can be nested arbitrarily, there’s no straightforward
way to “pull apart” the document so that you can run p:for-each
over
it. Instead, you need to use p:viewport
:
<p:viewport match="section[@status='final']">
<p:xslt>
<p:with-input port="stylesheet" href="final-sections.xsl"/>
</p:xslt>
</p:viewport>
This step will take each section marked as “final” out of the input
document and transform it with final-sections.xsl
. It will then
stitch the results of that transformation back into the original
document exactly where the sections appeared initially.
All of the other content in the document will be left untouched.
Groups
The p:group
element does nothing. Like div
in HTML, it’s a
free-form wrapper that allows authors to group steps together. This
may make the pipelines easier to edit and it provides a way for
authors to limit the scope of variables and steps.
You’ll probably never use it.
Libraries
The pipeline steps you write can be grouped together into libraries
for convenience. This allows whole libraries of related steps to be
imported at once.
Loose ends
The authors wish to address a few more topics, without cluttering the
flow of the preceding narrative.
Document properties
All documents flowing through an XProc pipeline have an
associated collection of document properties. The document properties
are name/value pairs that may be retrieved by expressions in the
pipeline language and set by steps.
There are standard properties for the base URI, media type, and
serialization properties. Authors are free to take advantage of
document properties to associate metadata with documents as they flow
through the pipeline.
One natural question to ask is, when is metadata preserved? It
seems pretty clear that the properties associated with a document
should survive if the step passes through a p:identity
step. Conversely, it seems likely that the output from a
DocBook-to-HTML transformation is in no practical sense “the same
document” that went in and preserving document properties is as likely
to be an error as not.
Step authors should describe how their pipelines effect the
properties of the documents flowing through them.
Irreducible complexity
The syntax of XProc 3.0 is, we believe, a marked improvement
over the XProc 1.0 syntax. While much of it is still familiar, some
awkward concepts have been removed and a large number of authoring
shortcuts have been added.
Unfortunately, at the end of the day, complex pipelines are
still, quite obviously, complex. XProc is, fundamentally, a tree-based language
describing a graph-shaped problem. Until such time as someone invents
a useful, graph-shaped syntax, we may be stuck with a certain amount
of irreducible complexity.
Why not just use XSLT?
XSLT is a fabulous tool. It appears in almost every XProc
pipeline written to process XML. It is very definitely a sharp tool,
but it is by no means “small” anymore. The XSLT 3.0 specification runs
to more than 1,000 pages; printed in a similar way, the XProc 3.0
specification doesn’t (yet) break 100 pages.
That is absolutely not a criticism of XSLT. But there is value
in breaking problems down into simpler parts. Developing, testing, and
debugging six small stylesheets is much easier than performing any of
those tasks on a single stylesheet that performs all six functions.
Combining processing into a single stylesheet also introduces whole
classes of errors that simply don’t occur in small, separate
stylesheets.
If XSLT will do the job, by all means, use it. But we think
there is a role for declarative pipelines that is complimentary to
XSLT.
References
[Proschel2015]
“Engineering of Metabolic Pathways by Artificial Enzyme
Channels”.
Frontiers in Bioengineering and Biotechnology.
Pröschel M, Detsch R, Boccaccini AR, and Sonnewald U.
2015. doi:https://doi.org/10.3389/fbioe.2015.00168.
[XDM]
XQuery and XPath Data Model 3.1.
Norman Walsh, John Snelson, and Andrew Coleman, editors.
W3C Recommendation. 21 March 2017.
http://www.w3.org/TR/xpath-datamodel-31/
[XInclude]
XML Inclusions (XInclude) Version 1.0 (Second Edition).
Jonathan Marsh, David Orchard, and Daniel Veillard, editors.
W3C Recommendation. 15 November 2006.
http://www.w3.org/TR/xinclude/
[XProc30]
XProc 3.0: An XML Pipeline Language.
Norman Walsh, Achim Berndzen, Gerrit Imsieke and Erik Siegel, editors.
http://spec.xproc.org/
[XSLT30]
XSL Transformations (XSLT) Version 3.0.
Michael Kay. W3C Recommendation 8 June 2017.
http://www.w3.org/TR/xslt-30/
×
“Engineering of Metabolic Pathways by Artificial Enzyme
Channels”.
Frontiers in Bioengineering and Biotechnology.
Pröschel M, Detsch R, Boccaccini AR, and Sonnewald U.
2015. doi:https://doi.org/10.3389/fbioe.2015.00168.
×
XML Inclusions (XInclude) Version 1.0 (Second Edition).
Jonathan Marsh, David Orchard, and Daniel Veillard, editors.
W3C Recommendation. 15 November 2006.
http://www.w3.org/TR/xinclude/
×
XProc 3.0: An XML Pipeline Language.
Norman Walsh, Achim Berndzen, Gerrit Imsieke and Erik Siegel, editors.
http://spec.xproc.org/