Two Paths are Better than One

Syd Bauman

Abstract

Being able to execute an XPath against a corpus of XML documents is powerful. Being able to execute a second XPath from each node selected by the first is even more powerful.

Note to the reader

A link to an updated version (or even a newer edition) of this paper may be available on the WWP bibliography page.

Introduction

In 1980 the long-lost father of the protagonist of a reasonably popular science fiction motion picture gave some suspect advice to his son: If you only knew the power of the dark side! I sometimes think to myself “If you only knew the power of an XPath!”.^[1] Not that I think that XPath is analogous to the dark side or will lead to any kind of galactic conquest, let alone that of an evil fascist empire. Rather it is because I see XPath as the underlying Swiss army knife that leverages the power of XML data, that slices, dices, chops, shreds, purees — but wait, there’s more.

But it has recently occurred to me — and this is a thought that may have occurred to many of you years ago (I’m a bit slow) — that XPaths are synergistic: that the power of two XPaths is greater than twice the power of one.

I should confess that when I say XPath herein I am primarily (but not always) considering what XPath 1.0 calls a Location Path (XPath1§2) and XPath 2.0 and 3.1 call a path expression (XPath2§3.2 and XPath3.1§3.3).

Brief History

When TEI P2 (TEIP2) was published in 1993 it included what seemed to me to be both a pretty obvious and pretty exciting new feature — the extended pointer syntax.^[2] Roughly analogous to XPath, it permitted selection of a single sequence of PCDATA characters from an SGML document based on a combination of one or more tests of its position in the element tree, character offset, or pattern matching. It also allowed selection of bits of graphical or spatio-temporal data (e.g., for referring to sections of images or 3D models), and anything that could be referred to using HyQ (HyTime; also DDMHW) or a foreign mechanism, but these systems were not defined fully by the TEI Guidelines.^[3]

While the fact that an extended pointer only refers to PCDATA is a significant limitation of the TEI extended pointer mechanism, the fact that it only points to one sequence of characters is not. TEI has other mechanisms for combining the various snippets of text pointed at by a set of pointers into one.

The major point for our purposes here, though, is that although the actual syntax was quite different, the concept of selecting a portion of an (SG|X)ML tree structure by taking steps in particular directions within that structure is essentially the same. TEI extended pointers even used directional names similar to the names of what are called axes in XPath (the main difference being that following was called next).

XPath

XPath 1.0 itself became a recommendation in 1999; the latest version, 3.1, in 2017; and an updated next generation version 4.0 is being worked on currently. While 3.1 offers significant improvements over 1.0, the basic underlying mechanism for addressing nodes in the XML tree remains the same. In fact, the tool I use most often for simple XPath execution (i.e., use of XPath other than in XSLT or Schematron) is based on XPath 1.0.

Main Methods of XML Analysis

My job title is Senior XML Programmer/Analyst^[4], so it is not surprising that I often find myself tasked with analyzing an XML corpus, usually looking for insights into the encoding or for inconsistencies (because inconsistencies are often, although not always, errors). The two major categories of tools I see used for analyzing XML corpora are regular expression string searches and XPaths. The regular expressions are sometimes used in a relatively simple manner (e.g., grep) and sometimes wrapped in a full programming language (e.g., JavaScript, Perl, Python, or Ruby). The XPaths are sometimes used in a relatively simple manner (e.g., one of the commandline XPath processors listed below) and sometimes wrapped in Schematron, XQuery, or XSLT.

Regular expressions are ubiquitously available and easy to learn. Moreover, because a string search (whether a “plain” search, “wildcard” search, or a full-fledged regular expression search) operates over the characters used as the actual contents of the file(s), regular expressions are useful in all sorts of non-XML cases.^[5]

XPath, on the other hand, while extraordinarily powerful, is a relatively niche technology: Its only intended use case is for operating over XML documents.^[6] Thus finding tutorials and software for XPath, while not difficult, is far harder than finding tutorials or software for regular expressions.

But more important than their relative availability, the major difference between regular expression string searches and XPath is that while the former operate over the characters used to serialize the XML, i.e. the actual contents of the file(s), XPath, on the other hand, operates over the XML tree after it has been parsed — i.e., over the XDM (XDMspec; also Kay2). Thus for working with XML data, regular expressions have the occasionally useful advantage of allowing you to search for things that are not part of the XDM (for example, the whitespace characters before an attribute specification). On the other hand, XPath has the very frequently useful advantage of not making you work around things that are not part of the XDM (for example, whether an attribute value is delimited with double quote characters (‘"’, U+0022, which SGML called LIT) or single quote characters (‘'’, U+0027, which SGML called LITA)).

A colleague recently recommended a simple regular expression search to find, within a particular XML corpus, all of the <hi> elements that contain nothing but whitespace:

        $ egrep '<hi[^>]*>\s+</hi>' /files/in/corpus/*.xml

And this works perfectly well for her case, because in her corpus the character ‘>’ is not permitted in an attribute value, and because she does not mind finding cases of <hi> </hi> in comments or processing instructions.

But in arbitrary XML the character ‘>’ might be in an attribute value. So she would need either a much more complicated regular expression^[7] or a somewhat simpler XPath path expression, for example, in XPath 3.1:

        //hi[ matches( ., '^\s+$') ]

^[8]

XPath Tools for XML Analysis

So unless one needs to look for features that are not part of the XDM, XPath is probably the best tool for analysis of XML files, and certainly the one I use and recommend the most. There are quite a few systems that will allow you to apply a single XPath to an XML file and give you the result. The oXygen XML Editor, for example, has an “XPath Toolbar” that allows the user to enter an XPath which can be executed against the current file or a set of files, although when executed against a set of files what you get is separate results for each file, not the combined results for all files.^[9]

Commandline

Wile I love oXygen (it is the only payware on my main system), I primarily work on the commandline. There are, not surprisingly, quite a few utilities for using XPath on the commandline.^[10] My favorite, by far, is xmlstarlet, but I do not claim that it is the best, only that I like it the most, probably because it was the first one I started using seriously. I tried writing my own a few years ago, and while miserable failure would be an exaggeration, roaring success would be a far greater exaggeration. My program, a bash script based on the xpath++ command, had the signature

        $ xpath.bash [-1|-2|-3] <XPath expression> <XML file>

I wrote it, in part, because I felt a bit too constrained by the XPath 1.0 limitation of xmlstarlet. But even on those rare occasions when my program was working, I found that it was not powerful enough, even though it permitted XPath 3.1. It slowly dawned on me that what made (my use of) xmlstarlet so powerful was the capability to do something else with the results of XPath A, for example to execute an XPath B from every selected node. After all, that’s what makes Schematron so powerful, right?

Schematron

Schematron,^[11] which can be very useful for querying XML data, is basically just that — the capability to test any XPath B (the @test attribute) from every node selected by any XPath A (the @context attribute). Sure, it adds other twists to this basic concept — for starters, every XPath B is evaluated as a Boolean. But it also has the capability to insert an XPath into your output message (with <sch:value-of>), abstract patterns, the clever feature that only the first XPath A within a <pattern> is executed, and importantly the ability to store the results of an XPath as a “variable”.^[12] And although those extra capabilities and scaffolding are important, it seems to me that the power of Schematron comes from this basic capability to evaluate XPath B from each node selected by XPath A.

As an example, consider the following pattern which tests that the values of @maxOccurs and @minOccurs are usable.

  <sch:pattern id="att.repeatable-MINandMAXoccurs">
    <sch:rule context="*[ @minOccurs  and  @maxOccurs ]">
      <sch:let name="min" value="@minOccurs cast as xs:integer"/>
      <sch:let name="max" value="if ( normalize-space( @maxOccurs ) eq 'unbounded')
                                 then -1
                                 else @maxOccurs cast as xs:integer"/>
      <sch:assert test="$max eq -1  or  $max ge $min">
        @maxOccurs should be greater than or equal to @minOccurs
      </sch:assert>
    </sch:rule>
    <sch:rule context="*[ @minOccurs  and  not( @maxOccurs ) ]">
      <sch:assert test="@minOccurs cast as xs:integer lt 2">
        When @maxOccurs is not specified, @minOccurs must be 0 or 1
      </sch:assert>
    </sch:rule>
  </sch:pattern>

This pattern uses four XPath tests and two assignments. (The values that are assigned are expressed in XPath as well, but this is not particularly germane to the argument here.) I submit that if the job of these four XPaths were assigned to a single XPath, the result is significantly harder to decipher, whether intermediate variables are used or not. Here is the same test using intermediate variables (as the Schematron does):

        if ( @minOccurs  and  not( @maxOccurs ) ) then
          if ( @minOccurs cast as xs:integer ge 2 ) then
            'When @maxOccurs is not specified, @minOccurs must be 0 or 1'
          else ''
        else if ( @minOccurs  and  @maxOccurs ) then
          let $max := if ( normalize-space( @maxOccurs ) eq 'unbounded')
                      then -1
                      else @maxOccurs cast as xs:integer,
              $min := @minOccurs cast as xs:integer
          return
          if ( $max eq -1  or  $max ge $min ) then
            ''
          else
            '@maxOccurs should be greater than or equal to @minOccurs'
        else ''

Without the intermediate variables comprehension is marginally more difficult:

        if ( @minOccurs  and  @maxOccurs ) then
          if ( @maxOccurs eq 'unbounded'
               or
               (
                 if ( @maxOccurs castable as xs:integer )
                 then xs:integer( @maxOccurs ) ge xs:integer( @minOccurs )
                 else false()
                )
              )
          then ''
          else '@maxOccurs should be greater than or equal to @minOccurs'
        else
          if ( @minOccurs  and  not( @maxOccurs ) ) then
            if ( @minOccurs cast as xs:integer gt 1 ) then
              'When @maxOccurs is not specified, @minOccurs must be 0 or 1'
            else ''
          else ''

And, not surprisingly, without whitespace (or color syntax highlighting or some other aid for the human reader), all hope is lost:

if (@minOccurs and not(@maxOccurs)) then if (@minOccurs cast as xs:integer ge 2) then 'When @maxOccurs is not specified, @minOccurs must be 0 or 1' else '' else if (@minOccurs and @maxOccurs) then let $max := if (normalize-space(@maxOccurs) eq 'unbounded') then -1 else @maxOccurs cast as xs:integer, $min := @minOccurs cast as xs:integer return if ($max eq -1 or $max ge $min) then '' else '@maxOccurs should be greater than or equal to @minOccurs' else ''

While the loss of whitespace would also make the XSLT version very hard to read, lots of XML tools will easily indent it in a reasonable fashion. (E.g., oXygen has a “Format and Indent” capability; or

xmllint --format ugly_input.xml >
                        prettier_output.xml

.^[13])

Two-path power

Real or imagined?

Of course this assertion, that two XPaths are more powerful than one, is not strictly true. Executing two XPath path expressions in a row is not much different from tacking one onto the other. That is, executing XPath B from each node selected by XPath A is not significantly different from XPath A / XPath B. Furthermore, in XPath 3.1 we have the simple mapping operator (‘!’) so if items instead of nodes are involved, one could use XPath A ! XPath B.

So if my two XPaths (A and B) could have easily been expressed as a single XPath (let’s call it AB), then why do I think it so much better to be able to express them separately? For at least two reasons. First because of human cognition, and second because of what else can be done in addition to XPath B at each node selected by XPath A.

Human cognition — bears of very little brain

When you are a Bear of Very Little Brain, and you Think of Things, you find sometimes that a Thing which seemed very Thingish inside you is quite different when it gets out into the open and has other people looking at it.

— A.A. Milne, Winnie-the-Pooh

Far be it from me to imply that any in the audience or reading this paper is of very little brain, except insofar as, like me, your brain is generally bigger when you write a complex XPath (or any other snippet of complicated code) than it is when you try to read and understand it months or years later. And whether the author of an XPath is of big brain or not, it is almost always the case that someday someone of smaller brain (or, at least, of lesser understanding of the task at hand) will be reading it.

That is, I assert that readability is usually a more important feature of an XPath in particular, and computer code in general, than is brevity or execution efficiency. For starters [p]rogrammers think they get paid to write code, but the truth is, we spend a lot more time reading it than writing it. (NTW1). Given that programmer time in general, and my time in particular, is quite valuable, writing XPaths so that they can be read and understood quickly is valuable.

Thus the divide and conquer approach^[14] of writing complicated chunks of code, including XPaths, can be quite advantageous. This method, of course, is a method for making it easier to write code; my contention here is that it can also make it easier to read the code. Thus I am thinking of this as divide for comprehension instead (although perhaps divide to comprehend is snappier).

As a simple example, the equivalent of the following XPath appears in an obfuscation program I wrote (inspired by and in part based on ERHobx):

        ( ( ( $seed * $b + 1 ) mod $m ) + 1 + $low + $me ) mod ( $high - $low + 1 ) + $low

While it is not impossible to figure out what is going on here, this seems like a reasonable candidate for the Bet you can’t tell me what this does game.^[15] (It helps to know that all the variables are declared as xs:integer, and that $me is between $low and $high.) When re-written as two XPaths (here using XSLT variables), it becomes much easier (although still not easy) to make sense of:

        <xsl:variable name="random" select="( ( $seed * $b + 1 ) mod $m ) + 1"                         as="xs:integer"/>
        <xsl:variable name="result" select="( $random + $low + $me ) mod ( $high - $low + 1 ) + $low"  as="xs:integer"/>

While this code is still not at all obvious at first glance, after a few minutes puzzling one can figure out that $random is a pseudorandom number generated by a linear congruential approach, and $result is thus a pseudorandom number between $low and $high that is based on $me.

It is worth noting, by the way, that the length of an XPath, even if cleverly measured in tokens or terms rather than as the raw number of characters, is not necessarily a good measure of its complexity. For example, the longest XPath I have ever written (before variable substitution, at least) is 45,992 characters long, or 41,994 without spaces, but is very easy to understand. It was written for the project I presented last year, and is nothing more than a single sequence of the first 3,999 positive integers represented in Roman numerals, except those that actually spell out an English word (i.e., ‘I’ and “MIX”) are recorded as a zero-length string instead.

Conversely, some reasonably short XPaths can be quite difficult to decipher. For example, the above example of pseudorandom number generation in a single XPath has only 59 characters other than ignorable whitespace.

Other things

In a corpus of a particular journal I was working with each article is encoded as a TEI document which has a root <TEI> element which itself has two children: <teiHeader> for metadata and <text> for the content of the article (including its bibliography, etc.). Each article has, at a specific spot in the <teiHeader>, a pair of <idno> elements to specify to which volume and issue of the journal the article belongs. I wanted to ask the question “How many articles are in each issue of volume 4?”.

I performed this task as a single one-liner in bash:

        $ xsel -t -m "/*/t:teiHeader//t:idno[@type='volume'][.='004']" -v "../t:idno[@type='issue']" -n /path/to/articles/*.xml | rank

That takes some unpacking.

xsel	A shell alias to `xmlstarlet select -N t=http://www.tei-c.org/ns/1.0`,^[16] which itself deserves some unpacking: the xmlstarlet’s `select` subcommand (`sel` for short) allows as many `-N` options as desired before the first template (where the XPaths are specified), each of which binds a namespace URI to a prefix. Here I have bound the prefix `t:` to the TEI namespace.
-t	Short for `--template`, this option introduces a template (not unlike a simple XSLT template using different syntax).
-m	Short for `--match`, indicates that the template being defined should fire when the XPath (1.0) specified as the argument is found. We might call this XPath ‘A’.
/*/t:teiHeader//t:idno[@type='volume'][.='004']	An XPath (1.0) that selects all of the (metadata) `<tei:idno>` elements that have a `@type` attribute with the value `volume` and whose content is `004`. In the vast majority of articles there should be zero such elements; in a dozen or so there should be one.^[17]
-v	The first option of the template body, short for `--value-of`; its argument is an XPath (1.0) whose value should be returned when the template body is executed. We might call this XPath ‘B’.^[18]
../t:idno[@type='issue']	An XPath (1.0) that grabs the sibling `<tei:idno type="issue">` of the `<tei:idno>` that has been matched by the template and prints it (to standard output). Given that it is just being printed, the textual value is what is output.
-n	The second option of the template body, short for `--nl`, causes a new line character (U+000A) to be printed to standard output.
/path/to/articles/*.xml	A glob pattern that selects the desired files to process.^[19]
\|	The Unix pipe redirector. The output (STDOUT, in this case of `xsel`) of the command on its left is handed over to the command on its right for use as its input (STDIN, in this case to `rank`).
rank	A shell alias for `sort \| uniq -c \| sort -nrs`. That is, take the input records as a sequence, and sort the distinct values of that sequence by how many times each record occurs.^[20]

It is not particularly difficult to whip up an XSLT or XQuery program that reports the same information in a single XPath. But I submit that, presuming the reader has a good knowledge of the host languages involved (in this case bash on one side and either XSLT or XQuery on the other), the shell one-liner is far easier to write, and somewhat easier to understand.

For example, the following XPath (3.1) does a similar job.

        let $articleCorpus := collection('/path/to/articles?select=*.xml')
        return
          let $results :=
            let $v4issues := $articleCorpus/TEI/teiHeader[.//idno[@type eq 'volume'] ne '004']//idno[@type eq 'issue']/text()
            return
              for $thisIssueNum in distinct-values( $v4issues!sort(.) )
              return count( $v4issues[ . eq $thisIssueNum ] )||' '||$thisIssueNum||'&#x0A;'
          return sort( $results, (), function($result) { tokenize($result)[1] cast as xs:integer } )

^[21]

The result is only “similar” because while the same values are returned in the same order, the whitespace is not quite as nice. Note that this is not a direct comparison of one XPath to two, in that in my shell one-liner (using “divide to comprehend”) the job of counting occurrences and sorting the results has been factored out of the realm of XPath into a simple bash pipeline. If we similarly factor out that work into the host language (in the example below XSLT) we see that a major portion of the advantage of my bash one-liner is that this process of sorting the unique values of the sequence by how many times each occurs is so tersely expressed.

        <xsl:variable name="v4issues" select="$allArticles/TEI/teiHeader[.//idno[@type eq 'volume'] eq '004']//idno[@type eq 'issue']/text()"/>
        <xsl:variable name="uniq_v4issues" select="distinct-values( $v4issues )"/>
        <xsl:for-each select="$uniq_v4issues">
          <xsl:sort select="count( $v4issues[ . eq current() ] )"/>
          <xsl:sequence select="count( $v4issues[ . eq current() ] )||' '||.||'&#x0A;'"/>
        </xsl:for-each>

Distressing dogmatic division

Of course, it is not the case that for every XPath that can be divided into two, doing so improves readability and comprehension. Consider the following single XPath designed to be used to query a Subversion log file that has been created with the --xml and --verbose switches.

        /log/logentry[author = 'syd']/paths/path[ contains(., '.odd') ]!concat( ../../date, ' revision ', ../../@revision, '&#x0A; : ', normalize-space(../../msg) )

The above is a long and unwieldy XPath, somewhat hard to understand quickly. But dividing it up recursively into nearly atomic steps (here using the host language XSLT), I claim, results in code that is even harder to comprehend in its totality. (It is likely somewhat easier to debug, though. This is because when debugging we often want to examine each minute step, rather than the totality.)

  <xsl:template name="xsl:initial-template" match="/">
    <xsl:variable name="inputDocument" select="/" as="document-node()"/>
    <xsl:variable name="outermost" select="$inputDocument/*" as="element(log)"/>
    <xsl:variable name="allEntries" select="$outermost/logentry" as="element(logentry)+"/>
    <xsl:variable name="sydEntries" select="$allEntries[ author = 'syd']" as="element(logentry)*"/>
    <xsl:variable name="sydPathss" select="$sydEntries/paths" as="element(paths)*"/>
    <xsl:variable name="sydPaths" select="$sydPathss/path" as="element(path)*"/>
    <xsl:variable name="sydODDPaths" select="$sydPaths[ contains( ., '.odd')]" as="element(path)*"/>
    <xsl:apply-templates select="$sydODDPaths"/>
  </xsl:template>

  <xsl:template match="path">
    <xsl:variable name="date" select="../../date" as="xs:string"/>
    <xsl:variable name="boilerplate1" select="' revision '" as="xs:string"/>
    <xsl:variable name="revision" select="../../@revision" as="xs:string"/>
    <xsl:variable name="boilerplate2" select="'&#x0A; : '" as="xs:string"/>
    <xsl:variable name="message" select="normalize-space(../../msg)" as="xs:string"/>
    <xsl:sequence select="$date||$boilerplate1||$revision||$boilerplate2||$message||'&#x0A;'"/>
  </xsl:template>

However, not surprisingly, dividing it into a small number of XPaths (in this case just two) yields much more readable and comprehensible code than either of the above.

  <xsl:template name="xsl:initial-template" match="/">
    <xsl:variable name="paths" select="/log/logentry[author='syd']/paths/path[contains(.,'.odd')]"/>
    <xsl:sequence select="$paths!concat( ../../date, ' revision ', ../../@revision, '&#x0A; : ', normalize-space(../../msg) )"/>
  </xsl:template>

Appendix A. A tangent on the danger of implied attribute order

In XML the order of attributes is insignificant. That is, <song title="White Rabbit" composer="Grace Slick" performedBy="Jefferson Airplane"/> is exactly the same (informationally) as <song composer="Grace Slick" performedBy="Jefferson Airplane" title="White Rabbit"/>. Thus with either of those two elements as the input document, an XPath processor presented with for $a in /*/@* return name($a) might well return ('composer', 'performedBy', 'title'), but might instead return ('title', 'performedBy', 'composer').

Given that this is the case, why does XPath permit the expression @*[1] or attribute::*[ position() > count( parent::*/preceding-sibling::*[1]/@* ) ]? I think there might be an argument in favor of allowing them based on treating various types of nodes consistently. But I would be far more worried that a programmer who is somewhat unfamiliar with XML itself would use such a construct, and then perhaps years later the system would suffer a catastrophic failure because of a change in the XPath engine.

My first thought was that this could be addressed by having fn:position() return "NaN" when used in a predicate that is selecting from attribute::*. While this prevents the problematic formulations mentioned above, it has the weird side-effect that it would allow @*[ position() ne 'NaN'] (which selects none of them) and @*[ position() eq 'NaN'] (which selects all of them).

References

[ERHobx] Harold, Elliotte Rusty. Obscuring XML. Proceedings of Extreme Markup Languages®. Idealliance 2005.

[XDMspec] Walsh, Norman, John Snelson, and Andrew Coleman, eds. XQuery and XPath Data Model 3.1. 2017-03-21. World Wide Web Consortium. (Accessed 2024-04-03.)

[Kay2] Kay, Michael. XSLT 2.0 and XPath 2.0, 4th edition. Wiley Publishing, Inc., Indianapolis, 2008. pp. 45–67.

[TEIP2] Sperberg-McQueen, C. M. and Lou Burnard, eds. Guidelines for Electronic Text Encoding and Interchange 1992–1993; ACH, ACL, & ALLC. (This document is not directly available on the web, but an archive of the plain text can be downloaded from the TEI vault. The section quoted here (16.3) is in file p2sa.doc.)

[XPath1§2] Clark, James and Steve DeRose, eds. XML Path Language (XPath) Version 1.0, section 2, Location Paths. 1999-11-16. World Wide Web Consortium. (Accessed 2024-03-30.)

[XPath2§3.2] Berglund, Anders, Scott Boag, Don Chamberlin, Mary F. Fernández, Michael Kay, Jonathan Robie, and Jérôme Siméon, eds. XML Path Language (XPath) 2.0 (Second Edition), section 3.2, Path Expressions. 2010-12-14. World Wide Web Consortium. (Accessed 2024-03-30.)

[XPath3.1§3.3] Robie, Jonathan, Michael Dyck, and Josh Spiegel, eds. XML Path Language (XPath) 3.1, section 3.3, Path Expressions. 2017-03-21. World Wide Web Consortium. (Accessed 2024-03-30.)

[HyTime] ISO - ISO/IEC 10744:1997 - Information technology — Hypermedia/Time-based Structuring Language (HyTime), Edition 2. ISO, 1997.

[DDMHW] DeRose, Steven and David Durand. Making Hypermedia Work: A User’s Guide to HyTime. Kluwer Academic Publishers, 1994 (ISBN 0-7923-9432-1).

[NTW1] Tovey-Walsh, Norman. On the xml.com Slack workspace, #xpath-ng channel, 2024-07-11T08:05:03Z.

[W.DandC] Divide-and-conquer algorithm. Wikipedia. Wikimedia Foundation, 2024-04-16T22:14, https://en.wikipedia.org/wiki/Divide-and-conquer_algorithm.

^[1] This is not entirely true. I actually remembered the quote as Don’t underestimate the power of the dark side!, but on researching this paper discovered I was simply wrong.

^[2] When I say obvious I mean the need for a feature that allowed pointing into SGML documents based on hierarchy etc., was obvious, not that the particular details of the extended pointer syntax were obvious.

^[3] Quite presciently, though, the name of the only foreign (i.e., non-TEI) system mentioned is XFORM.

^[4] However, I often think of myself as more of an XML Data Hygienist.

^[5] In truth, if a student wanted to learn one and only one search technology, I would likely recommend regular expressions, which are, as Martin Holmes has said, the next thing to learn after you learn to type.

^[6] Which does not mean it is not useful for anything else. For example, the audience at XML Prague 2024 seemed to conclude that not only is it also useful for operating over JSON, it may be better than JSONPath for operating over JSON.

^[7] Using the W3C Schema regular expression language:

        <hi(\s+\i\c*\s*=\s*('[^']*'|"[^"]*"))*>\s+</hi>

But in truth most other regular expression languages do not have shortcuts for NameStartChar (\i) and NameChar (\c), so for truly arbitrary XML in those languages the regular expression is much longer. In fact in the rare cases where I have wanted to do this, I have taken a quick look at the list of all attribute names first (using, e.g.,

xsel -t -m
                        "//@*" -v "name(.)" -n /path/to/files.xml | rank

), so the regular expression could be comparatively concise.

^[8] Or, in XPath 1.0 either of the following:

        //tei:hi[ string-length(.) gt 0  and  normalize-space(.) eq '']

        //tei:hi[ normalize-space('☮'||.||'☮') eq '☮&#x20;☮']

or …

^[9] In October of 2010 while presenting at a workshop on encoding manuscripts at the University of Nebraska Lincoln I remember realizing quite suddenly that none of the dozen or so participants had been taking notes when I off-handedly used this feature of oXygen to demonstrate something, and immediately at least half the room started jotting it down and there were one or two requests for a repeat. These folks understood how important XPath is, but had not been aware of the oXygen feature.

^[10]

some commandline XPath tools

xidel
BaseX can be used in standalone command-line mode
xmlsh
Perl-based
- xpath
- xsh
- xml_grep
libxml2-based and libxslt-based
- xmllint
- xmlstarlet
Saxon-based
- Saxon (XQuery)
- Saxon’s Gizmo
- saxon-lint
Python-based
- python-lxml
Ruby experts can also write Ruby one-liners using REXML or nokogiri

^[11] By which I mean ISO Schematron, but for the purposes of this discussion, version does not really matter.

^[12] Not to mention Schematron Quick Fixes.

^[13] I call the output “prettier”, not “pretty”, because in my experience no indentation algorithm produces exactly the indentation a given user wants, but they very often produce a close approximation thereof.

^[14] I am not limiting divide and conquer to the actual algorithm paradigm described, e.g. in W.DandC, but rather using it to denote the general approach of dividing large problems or large chunks of code into smaller problems or small snippets of code.

^[15] A game at which APL is always the clear winner. For an undergraduate assignment I once wrote a 42-line APL program: 1 line of housekeeping, 1 line of code, and 40 lines of commentary to explain the 1 line of code. I’m sure others have even more impressive horror stories.

^[16] Actually, the alias I use provides over a dozen other namespace bindings, but they are not relevant to this example.

^[17] Given that there are no <tei:idno type="volume"> elements in the contents of any of the articles, I could have just used //t:idno as the beginning of the XPath, but a) I did not know that at the time, and b) in theory this XPath should be mildly faster than just //t:idno, especially if there are lots of articles with lots of nodes, particularly <tei:idno> nodes, within the /t:TEI/t:text. I measured the time difference in a not particularly rigorous test, and found that while the longer XPath was a wee bit faster, the speed difference was such that I would have to execute it hundreds of times to make up for the extra time it took for me to type the longer XPath — and I am a reasonably fast typist.

This observation perfectly matches my (perhaps imperfect) recollection of Michael Kay’s advice on XSLT efficiency. Mr. Kay has said on more than one occasion that the stylesheet writer is not likely to be able to predict what the optimizer does, and thus what is fast and what is not. The corollary is not to bother worrying about which XPath or XSLT construct is computationally more efficient until a problem arises — i.e., until one’s program is problematically too slow — and at that point actually timing components is a better approach than reasoning about which construct is likely faster.

^[18] However, it is common to use more than one -v option, so we may need to call them XPath ‘B1’, ‘B2’, etc.

^[19] In truth, as with most shell commands, this is really just a list of filepaths. If a simple glob will not get the set of files I want, I have often used a more complex method of listing the desired files. Here are some examples.

/path/to/articles/000[12]/art.xml	The list of files in the subdirectories of `/path/to/articles/` which start with `0001` or `0002`, and whose filename starts with `art`. (Imagine a corpus in which the articles are stored in files named “art_[author]”, where “[author]” is the first three letters of the primary author’s surname followed by the month and day of their date of birth, each stored in one of thousands of sequentially numbered directories each named with a 5-digit number. This glob gets only the 1XX and 2XX series articles, but avoids the extraneous files `schemas.xml` (generated by Emacs/nxml), `bio.xml` (provided by the author), etc., as these do not start with `art`.)
/path/to/TEI/P5/Source/Specs/teidata.xml /path/to/TEI/P5/Source/Specs/macro.xml	The list of files that define TEI datatypes and macros, avoiding the files that define classes, elements, or modules.
$( find release -name '.xsl' -o -name '.xslt' -o -name '.sch' -o -name '*.isosch')	The list of all XSLT files (whether named `.xsl` or `.xslt`) and all Schematron files (whether named `.sch` or `.isosch`) anywhere within the `release/` directory (whether directly in `release/`, or in `release/validation/`, `release/stylesheets/`, or `release/lib/`, or …).
$( xsel -t -m "/*/c:group[@id='published']/c:uri" -v "@uri" /path/to/catalog.xml )	The list of published articles extracted from an XML catalog. (Presumably unpublished files are stored in a different `<c:group>`).

^[20] The idea that this pattern of commands — sort, uniq(ue), count, sort on the count — is so commonly helpful in corpus analysis that it deserves its own shorthand is something I picked up from a chat with Steve DeRose during a break at one of the first Balisage conferences. A quick, unscientific count says that I have used this shorthand in ~6.2% of the last 19,524 commands I have issued.

^[21] The astute reader might question the use of the eq item comparison operator rather than the = sequence comparison operator for that first equality check. After all, although '004' is guaranteed to be a single item, .//idno[@type eq 'volume'] might very well return a sequence of two or more <idno> elements, in which case = would still work fine and eq would result in an error. However, for this particular corpus there is a rule that there should be one and only one <idno type="volume"> in each article. Thus if there were two or more, I wanted to know about it. (Since this rule is schema-enforced, there were none.)

Author's keywords for this paper:

XML; XPath; Schematron; command line

Syd Bauman

Senior XML Programmer / Analyst

Northeastern University / Library / DSG / WWP

`<s.bauman@northeastern.edu>`

Syd Bauman became a hard-core computer user in 1982, and a devotee of descriptive markup two years later. He began using SGML and the TEI when he came to the Women Writers Project in 1990. Although his title would have you believe that he is a computer programmer, Syd is fond of pointing out that he doesn’t write that much actual code, but usually writes in XSLT, and his programs are always free (as in speech). From 2001 to 2007 Syd served as North American editor of the TEI, and is currently on the TEI Technical Council.

BalisageThe Markup Conference

Balisage Paper: Two Paths are Better than One

Syd Bauman

`<s.bauman@northeastern.edu>`

Table of Contents

Note to the reader

Introduction

Brief History

XPath

Main Methods of XML Analysis

XPath Tools for XML Analysis

Commandline

Schematron

Two-path power

Real or imagined?

Human cognition — bears of very little brain

Other things

Distressing dogmatic division

Appendix A. A tangent on the danger of implied attribute order

References

Author's keywords for this paper:

`<s.bauman@northeastern.edu>`

Balisage Series on Markup Technologies