Bauman, Syd. “Two Paths are Better than One.” Presented at Balisage: The Markup Conference 2024, Washington, DC, July 29 - August 2, 2024. In Proceedings of Balisage: The Markup Conference 2024. Balisage Series on Markup Technologies, vol. 29 (2024).

Balisage: The Markup Conference 2024
July 29 - August 2, 2024

Balisage Paper: Two Paths are Better than One

Syd Bauman

Senior XML Programmer / Analyst

Northeastern University / Library / DSG / WWP

Syd Bauman became a hard-core computer user in 1982, and a devotee of descriptive markup two years later. He began using SGML and the TEI when he came to the Women Writers Project in 1990. Although his title would have you believe that he is a computer programmer, Syd is fond of pointing out that he doesn’t write that much actual code, but usually writes in XSLT, and his programs are always free (as in speech). From 2001 to 2007 Syd served as North American editor of the TEI, and is currently on the TEI Technical Council.

Being able to execute an XPath against a corpus of XML documents is powerful. Being able to execute a second XPath from each node selected by the first is even more powerful.

Table of Contents

Note to the reader
Brief History
Main Methods of XML Analysis
XPath Tools for XML Analysis
Two-path power
Real or imagined?
Human cognition — bears of very little brain
Other things
Distressing dogmatic division
Appendix A. A tangent on the danger of implied attribute order

Note to the reader

A link to an updated version (or even a newer edition) of this paper may be available on the WWP bibliography page.


In 1980 the long-lost father of the protagonist of a reasonably popular science fiction motion picture gave some suspect advice to his son: If you only knew the power of the dark side! I sometimes think to myself “If you only knew the power of an XPath!”.[1] Not that I think that XPath is analogous to the dark side or will lead to any kind of galactic conquest, let alone that of an evil fascist empire. Rather it is because I see XPath as the underlying Swiss army knife that leverages the power of XML data, that slices, dices, chops, shreds, purees — but wait, there’s more.

But it has recently occurred to me — and this is a thought that may have occurred to many of you years ago (I’m a bit slow) — that XPaths are synergistic: that the power of two XPaths is greater than twice the power of one.

I should confess that when I say XPath herein I am primarily (but not always) considering what XPath 1.0 calls a Location Path (XPath1§2) and XPath 2.0 and 3.1 call a path expression (XPath2§3.2 and XPath3.1§3.3).

Brief History

When TEI P2 (TEIP2) was published in 1993 it included what seemed to me to be both a pretty obvious and pretty exciting new feature — the extended pointer syntax.[2] Roughly analogous to XPath, it permitted selection of a single sequence of PCDATA characters from an SGML document based on a combination of one or more tests of its position in the element tree, character offset, or pattern matching. It also allowed selection of bits of graphical or spatio-temporal data (e.g., for referring to sections of images or 3D models), and anything that could be referred to using HyQ (HyTime; also DDMHW) or a foreign mechanism, but these systems were not defined fully by the TEI Guidelines.[3]

While the fact that an extended pointer only refers to PCDATA is a significant limitation of the TEI extended pointer mechanism, the fact that it only points to one sequence of characters is not. TEI has other mechanisms for combining the various snippets of text pointed at by a set of pointers into one.

The major point for our purposes here, though, is that although the actual syntax was quite different, the concept of selecting a portion of an (SG|X)ML tree structure by taking steps in particular directions within that structure is essentially the same. TEI extended pointers even used directional names similar to the names of what are called axes in XPath (the main difference being that following was called next).


XPath 1.0 itself became a recommendation in 1999; the latest version, 3.1, in 2017; and an updated next generation version 4.0 is being worked on currently. While 3.1 offers significant improvements over 1.0, the basic underlying mechanism for addressing nodes in the XML tree remains the same. In fact, the tool I use most often for simple XPath execution (i.e., use of XPath other than in XSLT or Schematron) is based on XPath 1.0.

Main Methods of XML Analysis

My job title is Senior XML Programmer/Analyst[4], so it is not surprising that I often find myself tasked with analyzing an XML corpus, usually looking for insights into the encoding or for inconsistencies (because inconsistencies are often, although not always, errors). The two major categories of tools I see used for analyzing XML corpora are regular expression string searches and XPaths. The regular expressions are sometimes used in a relatively simple manner (e.g., grep) and sometimes wrapped in a full programming language (e.g., JavaScript, Perl, Python, or Ruby). The XPaths are sometimes used in a relatively simple manner (e.g., one of the commandline XPath processors listed below) and sometimes wrapped in Schematron, XQuery, or XSLT.

Regular expressions are ubiquitously available and easy to learn. Moreover, because a string search (whether a “plain” search, “wildcard” search, or a full-fledged regular expression search) operates over the characters used as the actual contents of the file(s), regular expressions are useful in all sorts of non-XML cases.[5]

XPath, on the other hand, while extraordinarily powerful, is a relatively niche technology: Its only intended use case is for operating over XML documents.[6] Thus finding tutorials and software for XPath, while not difficult, is far harder than finding tutorials or software for regular expressions.

But more important than their relative availability, the major difference between regular expression string searches and XPath is that while the former operate over the characters used to serialize the XML, i.e. the actual contents of the file(s), XPath, on the other hand, operates over the XML tree after it has been parsed — i.e., over the XDM (XDMspec; also Kay2). Thus for working with XML data, regular expressions have the occasionally useful advantage of allowing you to search for things that are not part of the XDM (for example, the whitespace characters before an attribute specification). On the other hand, XPath has the very frequently useful advantage of not making you work around things that are not part of the XDM (for example, whether an attribute value is delimited with double quote characters (‘"’, U+0022, which SGML called LIT) or single quote characters (‘'’, U+0027, which SGML called LITA)).

A colleague recently recommended a simple regular expression search to find, within a particular XML corpus, all of the <hi> elements that contain nothing but whitespace:

        $ egrep '<hi[^>]*>\s+</hi>' /files/in/corpus/*.xml
And this works perfectly well for her case, because in her corpus the character ‘>’ is not permitted in an attribute value, and because she does not mind finding cases of <hi> </hi> in comments or processing instructions.

But in arbitrary XML the character ‘>’ might be in an attribute value. So she would need either a much more complicated regular expression[7] or a somewhat simpler XPath path expression, for example, in XPath 3.1:

        //hi[ matches( ., '^\s+$') ]

XPath Tools for XML Analysis

So unless one needs to look for features that are not part of the XDM, XPath is probably the best tool for analysis of XML files, and certainly the one I use and recommend the most. There are quite a few systems that will allow you to apply a single XPath to an XML file and give you the result. The oXygen XML Editor, for example, has an “XPath Toolbar” that allows the user to enter an XPath which can be executed against the current file or a set of files, although when executed against a set of files what you get is separate results for each file, not the combined results for all files.[9]


Wile I love oXygen (it is the only payware on my main system), I primarily work on the commandline. There are, not surprisingly, quite a few utilities for using XPath on the commandline.[10] My favorite, by far, is xmlstarlet, but I do not claim that it is the best, only that I like it the most, probably because it was the first one I started using seriously. I tried writing my own a few years ago, and while miserable failure would be an exaggeration, roaring success would be a far greater exaggeration. My program, a bash script based on the xpath++ command, had the signature

        $ xpath.bash [-1|-2|-3] <XPath expression> <XML file>
I wrote it, in part, because I felt a bit too constrained by the XPath 1.0 limitation of xmlstarlet. But even on those rare occasions when my program was working, I found that it was not powerful enough, even though it permitted XPath 3.1. It slowly dawned on me that what made (my use of) xmlstarlet so powerful was the capability to do something else with the results of XPath A, for example to execute an XPath B from every selected node. After all, that’s what makes Schematron so powerful, right?


Schematron,[11] which can be very useful for querying XML data, is basically just that — the capability to test any XPath B (the @test attribute) from every node selected by any XPath A (the @context attribute). Sure, it adds other twists to this basic concept — for starters, every XPath B is evaluated as a Boolean. But it also has the capability to insert an XPath into your output message (with <sch:value-of>), abstract patterns, the clever feature that only the first XPath A within a <pattern> is executed, and importantly the ability to store the results of an XPath as a “variable”.[12] And although those extra capabilities and scaffolding are important, it seems to me that the power of Schematron comes from this basic capability to evaluate XPath B from each node selected by XPath A.

As an example, consider the following pattern which tests that the values of @maxOccurs and @minOccurs are usable.

  <sch:pattern id="att.repeatable-MINandMAXoccurs">
    <sch:rule context="*[ @minOccurs  and  @maxOccurs ]">
      <sch:let name="min" value="@minOccurs cast as xs:integer"/>
      <sch:let name="max" value="if ( normalize-space( @maxOccurs ) eq 'unbounded')
                                 then -1
                                 else @maxOccurs cast as xs:integer"/>
      <sch:assert test="$max eq -1  or  $max ge $min">
        @maxOccurs should be greater than or equal to @minOccurs
    <sch:rule context="*[ @minOccurs  and  not( @maxOccurs ) ]">
      <sch:assert test="@minOccurs cast as xs:integer lt 2">
        When @maxOccurs is not specified, @minOccurs must be 0 or 1
This pattern uses four XPath tests and two assignments. (The values that are assigned are expressed in XPath as well, but this is not particularly germane to the argument here.) I submit that if the job of these four XPaths were assigned to a single XPath, the result is significantly harder to decipher, whether intermediate variables are used or not. Here is the same test using intermediate variables (as the Schematron does):
        if ( @minOccurs  and  not( @maxOccurs ) ) then
          if ( @minOccurs cast as xs:integer ge 2 ) then
            'When @maxOccurs is not specified, @minOccurs must be 0 or 1'
          else ''
        else if ( @minOccurs  and  @maxOccurs ) then
          let $max := if ( normalize-space( @maxOccurs ) eq 'unbounded')
                      then -1
                      else @maxOccurs cast as xs:integer,
              $min := @minOccurs cast as xs:integer
          if ( $max eq -1  or  $max ge $min ) then
            '@maxOccurs should be greater than or equal to @minOccurs'
        else ''
Without the intermediate variables comprehension is marginally more difficult:
        if ( @minOccurs  and  @maxOccurs ) then
          if ( @maxOccurs eq 'unbounded'
                 if ( @maxOccurs castable as xs:integer )
                 then xs:integer( @maxOccurs ) ge xs:integer( @minOccurs )
                 else false()
          then ''
          else '@maxOccurs should be greater than or equal to @minOccurs'
          if ( @minOccurs  and  not( @maxOccurs ) ) then
            if ( @minOccurs cast as xs:integer gt 1 ) then
              'When @maxOccurs is not specified, @minOccurs must be 0 or 1'
            else ''
          else ''
And, not surprisingly, without whitespace (or color syntax highlighting or some other aid for the human reader), all hope is lost:
if (@minOccurs and not(@maxOccurs)) then if (@minOccurs cast as xs:integer ge 2) then 'When @maxOccurs is not specified, @minOccurs must be 0 or 1' else '' else if (@minOccurs and @maxOccurs) then let $max := if (normalize-space(@maxOccurs) eq 'unbounded') then -1 else @maxOccurs cast as xs:integer, $min := @minOccurs cast as xs:integer return if ($max eq -1 or $max ge $min) then '' else '@maxOccurs should be greater than or equal to @minOccurs' else ''
While the loss of whitespace would also make the XSLT version very hard to read, lots of XML tools will easily indent it in a reasonable fashion. (E.g., oXygen has a “Format and Indent” capability; or xmllint --format ugly_input.xml > prettier_output.xml.[13])

Two-path power

Real or imagined?

Of course this assertion, that two XPaths are more powerful than one, is not strictly true. Executing two XPath path expressions in a row is not much different from tacking one onto the other. That is, executing XPath B from each node selected by XPath A is not significantly different from XPath A / XPath B. Furthermore, in XPath 3.1 we have the simple mapping operator (‘!’) so if items instead of nodes are involved, one could use XPath A ! XPath B.

So if my two XPaths (A and B) could have easily been expressed as a single XPath (let’s call it AB), then why do I think it so much better to be able to express them separately? For at least two reasons. First because of human cognition, and second because of what else can be done in addition to XPath B at each node selected by XPath A.

Human cognition — bears of very little brain

When you are a Bear of Very Little Brain, and you Think of Things, you find sometimes that a Thing which seemed very Thingish inside you is quite different when it gets out into the open and has other people looking at it.

— A.A. Milne, Winnie-the-Pooh

Far be it from me to imply that any in the audience or reading this paper is of very little brain, except insofar as, like me, your brain is generally bigger when you write a complex XPath (or any other snippet of complicated code) than it is when you try to read and understand it months or years later. And whether the author of an XPath is of big brain or not, it is almost always the case that someday someone of smaller brain (or, at least, of lesser understanding of the task at hand) will be reading it.

That is, I assert that readability is usually a more important feature of an XPath in particular, and computer code in general, than is brevity or execution efficiency. For starters [p]rogrammers think they get paid to write code, but the truth is, we spend a lot more time reading it than writing it. (NTW1). Given that programmer time in general, and my time in particular, is quite valuable, writing XPaths so that they can be read and understood quickly is valuable.

Thus the divide and conquer approach[14] of writing complicated chunks of code, including XPaths, can be quite advantageous. This method, of course, is a method for making it easier to write code; my contention here is that it can also make it easier to read the code. Thus I am thinking of this as divide for comprehension instead (although perhaps divide to comprehend is snappier).

As a simple example, the equivalent of the following XPath appears in an obfuscation program I wrote (inspired by and in part based on ERHobx):

        ( ( ( $seed * $b + 1 ) mod $m ) + 1 + $low + $me ) mod ( $high - $low + 1 ) + $low
While it is not impossible to figure out what is going on here, this seems like a reasonable candidate for the Bet you can’t tell me what this does game.[15] (It helps to know that all the variables are declared as xs:integer, and that $me is between $low and $high.) When re-written as two XPaths (here using XSLT variables), it becomes much easier (although still not easy) to make sense of:
        <xsl:variable name="random" select="( ( $seed * $b + 1 ) mod $m ) + 1"                         as="xs:integer"/>
        <xsl:variable name="result" select="( $random + $low + $me ) mod ( $high - $low + 1 ) + $low"  as="xs:integer"/>
While this code is still not at all obvious at first glance, after a few minutes puzzling one can figure out that $random is a pseudorandom number generated by a linear congruential approach, and $result is thus a pseudorandom number between $low and $high that is based on $me.

It is worth noting, by the way, that the length of an XPath, even if cleverly measured in tokens or terms rather than as the raw number of characters, is not necessarily a good measure of its complexity. For example, the longest XPath I have ever written (before variable substitution, at least) is 45,992 characters long, or 41,994 without spaces, but is very easy to understand. It was written for the project I presented last year, and is nothing more than a single sequence of the first 3,999 positive integers represented in Roman numerals, except those that actually spell out an English word (i.e., ‘I’ and “MIX”) are recorded as a zero-length string instead.

Conversely, some reasonably short XPaths can be quite difficult to decipher. For example, the above example of pseudorandom number generation in a single XPath has only 59 characters other than ignorable whitespace.

Other things

In a corpus of a particular journal I was working with each article is encoded as a TEI document which has a root <TEI> element which itself has two children: <teiHeader> for metadata and <text> for the content of the article (including its bibliography, etc.). Each article has, at a specific spot in the <teiHeader>, a pair of <idno> elements to specify to which volume and issue of the journal the article belongs. I wanted to ask the question “How many articles are in each issue of volume 4?”.

I performed this task as a single one-liner in bash:

        $ xsel -t -m "/*/t:teiHeader//t:idno[@type='volume'][.='004']" -v "../t:idno[@type='issue']" -n /path/to/articles/*.xml | rank
That takes some unpacking.


A shell alias to xmlstarlet select -N t=,[16] which itself deserves some unpacking: the xmlstarlet’s select subcommand (sel for short) allows as many -N options as desired before the first template (where the XPaths are specified), each of which binds a namespace URI to a prefix. Here I have bound the prefix t: to the TEI namespace.


Short for --template, this option introduces a template (not unlike a simple XSLT template using different syntax).


Short for --match, indicates that the template being defined should fire when the XPath (1.0) specified as the argument is found. We might call this XPath ‘A’.


An XPath (1.0) that selects all of the (metadata) <tei:idno> elements that have a @type attribute with the value volume and whose content is 004. In the vast majority of articles there should be zero such elements; in a dozen or so there should be one.[17]


The first option of the template body, short for --value-of; its argument is an XPath (1.0) whose value should be returned when the template body is executed. We might call this XPath ‘B’.[18]


An XPath (1.0) that grabs the sibling <tei:idno type="issue"> of the <tei:idno> that has been matched by the template and prints it (to standard output). Given that it is just being printed, the textual value is what is output.


The second option of the template body, short for --nl, causes a new line character (U+000A) to be printed to standard output.


A glob pattern that selects the desired files to process.[19]


The Unix pipe redirector. The output (STDOUT, in this case of xsel) of the command on its left is handed over to the command on its right for use as its input (STDIN, in this case to rank).


A shell alias for sort | uniq -c | sort -nrs. That is, take the input records as a sequence, and sort the distinct values of that sequence by how many times each record occurs.[20]

It is not particularly difficult to whip up an XSLT or XQuery program that reports the same information in a single XPath. But I submit that, presuming the reader has a good knowledge of the host languages involved (in this case bash on one side and either XSLT or XQuery on the other), the shell one-liner is far easier to write, and somewhat easier to understand.

For example, the following XPath (3.1) does a similar job.

        let $articleCorpus := collection('/path/to/articles?select=*.xml')
          let $results :=
            let $v4issues := $articleCorpus/TEI/teiHeader[.//idno[@type eq 'volume'] ne '004']//idno[@type eq 'issue']/text()
              for $thisIssueNum in distinct-values( $v4issues!sort(.) )
              return count( $v4issues[ . eq $thisIssueNum ] )||' '||$thisIssueNum||'&#x0A;'
          return sort( $results, (), function($result) { tokenize($result)[1] cast as xs:integer } )

The result is only “similar” because while the same values are returned in the same order, the whitespace is not quite as nice. Note that this is not a direct comparison of one XPath to two, in that in my shell one-liner (using “divide to comprehend”) the job of counting occurrences and sorting the results has been factored out of the realm of XPath into a simple bash pipeline. If we similarly factor out that work into the host language (in the example below XSLT) we see that a major portion of the advantage of my bash one-liner is that this process of sorting the unique values of the sequence by how many times each occurs is so tersely expressed.

        <xsl:variable name="v4issues" select="$allArticles/TEI/teiHeader[.//idno[@type eq 'volume'] eq '004']//idno[@type eq 'issue']/text()"/>
        <xsl:variable name="uniq_v4issues" select="distinct-values( $v4issues )"/>
        <xsl:for-each select="$uniq_v4issues">
          <xsl:sort select="count( $v4issues[ . eq current() ] )"/>
          <xsl:sequence select="count( $v4issues[ . eq current() ] )||' '||.||'&#x0A;'"/>

Distressing dogmatic division

Of course, it is not the case that for every XPath that can be divided into two, doing so improves readability and comprehension. Consider the following single XPath designed to be used to query a Subversion log file that has been created with the --xml and --verbose switches.

        /log/logentry[author = 'syd']/paths/path[ contains(., '.odd') ]!concat( ../../date, ' revision ', ../../@revision, '&#x0A; : ', normalize-space(../../msg) )
The above is a long and unwieldy XPath, somewhat hard to understand quickly. But dividing it up recursively into nearly atomic steps (here using the host language XSLT), I claim, results in code that is even harder to comprehend in its totality. (It is likely somewhat easier to debug, though. This is because when debugging we often want to examine each minute step, rather than the totality.)
  <xsl:template name="xsl:initial-template" match="/">
    <xsl:variable name="inputDocument" select="/" as="document-node()"/>
    <xsl:variable name="outermost" select="$inputDocument/*" as="element(log)"/>
    <xsl:variable name="allEntries" select="$outermost/logentry" as="element(logentry)+"/>
    <xsl:variable name="sydEntries" select="$allEntries[ author = 'syd']" as="element(logentry)*"/>
    <xsl:variable name="sydPathss" select="$sydEntries/paths" as="element(paths)*"/>
    <xsl:variable name="sydPaths" select="$sydPathss/path" as="element(path)*"/>
    <xsl:variable name="sydODDPaths" select="$sydPaths[ contains( ., '.odd')]" as="element(path)*"/>
    <xsl:apply-templates select="$sydODDPaths"/>

  <xsl:template match="path">
    <xsl:variable name="date" select="../../date" as="xs:string"/>
    <xsl:variable name="boilerplate1" select="' revision '" as="xs:string"/>
    <xsl:variable name="revision" select="../../@revision" as="xs:string"/>
    <xsl:variable name="boilerplate2" select="'&#x0A; : '" as="xs:string"/>
    <xsl:variable name="message" select="normalize-space(../../msg)" as="xs:string"/>
    <xsl:sequence select="$date||$boilerplate1||$revision||$boilerplate2||$message||'&#x0A;'"/>

However, not surprisingly, dividing it into a small number of XPaths (in this case just two) yields much more readable and comprehensible code than either of the above.

  <xsl:template name="xsl:initial-template" match="/">
    <xsl:variable name="paths" select="/log/logentry[author='syd']/paths/path[contains(.,'.odd')]"/>
    <xsl:sequence select="$paths!concat( ../../date, ' revision ', ../../@revision, '&#x0A; : ', normalize-space(../../msg) )"/>

Appendix A. A tangent on the danger of implied attribute order

In XML the order of attributes is insignificant. That is, <song title="White Rabbit" composer="Grace Slick" performedBy="Jefferson Airplane"/> is exactly the same (informationally) as <song composer="Grace Slick" performedBy="Jefferson Airplane" title="White Rabbit"/>. Thus with either of those two elements as the input document, an XPath processor presented with for $a in /*/@* return name($a) might well return ('composer', 'performedBy', 'title'), but might instead return ('title', 'performedBy', 'composer').

Given that this is the case, why does XPath permit the expression @*[1] or attribute::*[ position() > count( parent::*/preceding-sibling::*[1]/@* ) ]? I think there might be an argument in favor of allowing them based on treating various types of nodes consistently. But I would be far more worried that a programmer who is somewhat unfamiliar with XML itself would use such a construct, and then perhaps years later the system would suffer a catastrophic failure because of a change in the XPath engine.

My first thought was that this could be addressed by having fn:position() return "NaN" when used in a predicate that is selecting from attribute::*. While this prevents the problematic formulations mentioned above, it has the weird side-effect that it would allow @*[ position() ne 'NaN'] (which selects none of them) and @*[ position() eq 'NaN'] (which selects all of them).


XML; XPath; Schematron; command line