Note to the reader
A link to an updated version (or even a newer edition) of this paper may be available on the WWP bibliography page.
Introduction
In 1980 the long-lost father of the protagonist of a
reasonably popular science fiction motion picture gave some
suspect advice to his son: If you only knew the power of
the dark side!
I sometimes think to myself “If you only
knew the power of an XPath!”.[1] Not that I
think that XPath is analogous to the dark side or will lead to any
kind of galactic conquest, let alone that of an evil fascist
empire. Rather it is because I see XPath as the underlying Swiss
army knife that leverages the power of XML data, that slices,
dices, chops, shreds, purees — but wait, there’s more.
But it has recently occurred to me — and this is a thought that may have occurred to many of you years ago (I’m a bit slow) — that XPaths are synergistic: that the power of two XPaths is greater than twice the power of one.
I should confess that when I say XPath
herein
I am primarily (but not always) considering what XPath 1.0 calls a
Location Path
(XPath1§2) and
XPath 2.0 and 3.1 call a path expression
(XPath2§3.2 and XPath3.1§3.3).
Brief History
When TEI P2 (TEIP2) was published in 1993
it included what seemed to me to be both a pretty obvious and
pretty exciting new feature — the extended pointer
syntax.[2] Roughly analogous to XPath, it
permitted selection of a single sequence of PCDATA characters from
an SGML document based on a combination of one or more tests of
its position in the element tree, character offset, or pattern
matching. It also allowed selection of bits of graphical or
spatio-temporal data
(e.g., for referring to sections
of images or 3D models), and anything that could be referred to
using HyQ (HyTime; also DDMHW)
or a foreign mechanism, but these systems were not defined
fully
by the TEI Guidelines.[3]
While the fact that an extended pointer only refers to PCDATA is a significant limitation of the TEI extended pointer mechanism, the fact that it only points to one sequence of characters is not. TEI has other mechanisms for combining the various snippets of text pointed at by a set of pointers into one.
The major point for our purposes here, though, is that
although the actual syntax was quite
different, the concept of selecting a portion of an (SG|X)ML tree
structure by taking steps in particular directions within that
structure is essentially the same. TEI extended pointers even used
directional names similar to the names of what are called axes in
XPath (the main difference being that following
was called next
).
XPath
XPath 1.0 itself became a recommendation in 1999; the
latest version, 3.1, in 2017; and an updated next
generation
version 4.0 is being worked on
currently. While 3.1 offers significant improvements over 1.0,
the basic underlying mechanism for addressing nodes in the XML
tree remains the same. In fact, the tool I use most often for
simple XPath execution (i.e., use of XPath other than in XSLT or
Schematron) is based on XPath 1.0.
Main Methods of XML Analysis
My job title is Senior XML Programmer/Analyst
[4],
so it is not surprising that I often find myself tasked with
analyzing an XML corpus, usually looking for insights into the
encoding or for inconsistencies (because inconsistencies are
often, although not always, errors). The two major categories of
tools I see used for analyzing XML corpora are regular expression
string searches and XPaths. The regular expressions are sometimes
used in a relatively simple manner (e.g., grep
) and
sometimes wrapped in a full programming language (e.g.,
JavaScript, Perl, Python, or Ruby). The XPaths are sometimes used
in a relatively simple manner (e.g., one of the commandline XPath
processors listed below) and sometimes wrapped in Schematron,
XQuery, or XSLT.
Regular expressions are ubiquitously available and easy to learn. Moreover, because a string search (whether a “plain” search, “wildcard” search, or a full-fledged regular expression search) operates over the characters used as the actual contents of the file(s), regular expressions are useful in all sorts of non-XML cases.[5]
XPath, on the other hand, while extraordinarily powerful, is a relatively niche technology: Its only intended use case is for operating over XML documents.[6] Thus finding tutorials and software for XPath, while not difficult, is far harder than finding tutorials or software for regular expressions.
But more important than their relative availability, the major difference between regular expression string searches and XPath is that while the former operate over the characters used to serialize the XML, i.e. the actual contents of the file(s), XPath, on the other hand, operates over the XML tree after it has been parsed — i.e., over the XDM (XDMspec; also Kay2). Thus for working with XML data, regular expressions have the occasionally useful advantage of allowing you to search for things that are not part of the XDM (for example, the whitespace characters before an attribute specification). On the other hand, XPath has the very frequently useful advantage of not making you work around things that are not part of the XDM (for example, whether an attribute value is delimited with double quote characters (‘"’, U+0022, which SGML called LIT) or single quote characters (‘'’, U+0027, which SGML called LITA)).
A colleague recently recommended a simple regular expression
search to find, within a particular XML corpus, all of the
<hi>
elements that contain nothing but whitespace:
$ egrep '<hi[^>]*>\s+</hi>' /files/in/corpus/*.xmlAnd this works perfectly well for her case, because in her corpus the character ‘>’ is not permitted in an attribute value, and because she does not mind finding cases of
<hi> </hi>in comments or processing instructions.
But in arbitrary XML the character ‘>’ might be in an attribute value. So she would need either a much more complicated regular expression[7] or a somewhat simpler XPath path expression, for example, in XPath 3.1:
//hi[ matches( ., '^\s+$') ][8]
XPath Tools for XML Analysis
So unless one needs to look for features that are not part of the XDM, XPath is probably the best tool for analysis of XML files, and certainly the one I use and recommend the most. There are quite a few systems that will allow you to apply a single XPath to an XML file and give you the result. The oXygen XML Editor, for example, has an “XPath Toolbar” that allows the user to enter an XPath which can be executed against the current file or a set of files, although when executed against a set of files what you get is separate results for each file, not the combined results for all files.[9]
Commandline
Wile I love oXygen (it is the only payware on my main
system), I primarily work on the commandline. There are, not
surprisingly, quite a few utilities for using XPath on the commandline.[10] My favorite, by far, is xmlstarlet,
but I do not claim that it is the best,
only that I like it the most, probably because it was the
first one I started using seriously. I tried writing my own a
few years ago, and while miserable failure
would be an exaggeration, roaring success
would
be a far greater exaggeration. My program, a bash script based
on the xpath++
command, had the signature
$ xpath.bash [-1|-2|-3] <XPath expression> <XML file>I wrote it, in part, because I felt a bit too constrained by the XPath 1.0 limitation of xmlstarlet. But even on those rare occasions when my program was working, I found that it was not powerful enough, even though it permitted XPath 3.1. It slowly dawned on me that what made (my use of) xmlstarlet so powerful was the capability to do something else with the results of XPath A, for example to execute an XPath B from every selected node. After all, that’s what makes Schematron so powerful, right?
Schematron
Schematron,[11] which can be very
useful for querying XML data, is basically just that — the
capability to test any XPath B (the @test
attribute) from every node selected by any XPath A (the
@context
attribute). Sure, it adds other twists
to this basic concept — for starters, every XPath B is
evaluated as a Boolean. But it also has the capability to
insert an XPath into your output message (with
<sch:value-of>
), abstract patterns, the clever
feature that only the first XPath A within a
<pattern>
is executed, and importantly the
ability to store the results of an XPath as a
“variable”.[12] And although those extra capabilities
and scaffolding are important, it seems to me that the power of Schematron comes from this
basic capability to evaluate XPath B from each node selected
by XPath A.
As an example, consider the following pattern which tests
that the values of @maxOccurs
and @minOccurs
are usable.
<sch:pattern id="att.repeatable-MINandMAXoccurs"> <sch:rule context="*[ @minOccurs and @maxOccurs ]"> <sch:let name="min" value="@minOccurs cast as xs:integer"/> <sch:let name="max" value="if ( normalize-space( @maxOccurs ) eq 'unbounded') then -1 else @maxOccurs cast as xs:integer"/> <sch:assert test="$max eq -1 or $max ge $min"> @maxOccurs should be greater than or equal to @minOccurs </sch:assert> </sch:rule> <sch:rule context="*[ @minOccurs and not( @maxOccurs ) ]"> <sch:assert test="@minOccurs cast as xs:integer lt 2"> When @maxOccurs is not specified, @minOccurs must be 0 or 1 </sch:assert> </sch:rule> </sch:pattern>This pattern uses four XPath tests and two assignments. (The values that are assigned are expressed in XPath as well, but this is not particularly germane to the argument here.) I submit that if the job of these four XPaths were assigned to a single XPath, the result is significantly harder to decipher, whether intermediate variables are used or not. Here is the same test using intermediate variables (as the Schematron does):
if ( @minOccurs and not( @maxOccurs ) ) then if ( @minOccurs cast as xs:integer ge 2 ) then 'When @maxOccurs is not specified, @minOccurs must be 0 or 1' else '' else if ( @minOccurs and @maxOccurs ) then let $max := if ( normalize-space( @maxOccurs ) eq 'unbounded') then -1 else @maxOccurs cast as xs:integer, $min := @minOccurs cast as xs:integer return if ( $max eq -1 or $max ge $min ) then '' else '@maxOccurs should be greater than or equal to @minOccurs' else ''Without the intermediate variables comprehension is marginally more difficult:
if ( @minOccurs and @maxOccurs ) then if ( @maxOccurs eq 'unbounded' or ( if ( @maxOccurs castable as xs:integer ) then xs:integer( @maxOccurs ) ge xs:integer( @minOccurs ) else false() ) ) then '' else '@maxOccurs should be greater than or equal to @minOccurs' else if ( @minOccurs and not( @maxOccurs ) ) then if ( @minOccurs cast as xs:integer gt 1 ) then 'When @maxOccurs is not specified, @minOccurs must be 0 or 1' else '' else ''And, not surprisingly, without whitespace (or color syntax highlighting or some other aid for the human reader), all hope is lost:
if (@minOccurs and not(@maxOccurs)) then if (@minOccurs cast as xs:integer ge 2) then 'When @maxOccurs is not specified, @minOccurs must be 0 or 1' else '' else if (@minOccurs and @maxOccurs) then let $max := if (normalize-space(@maxOccurs) eq 'unbounded') then -1 else @maxOccurs cast as xs:integer, $min := @minOccurs cast as xs:integer return if ($max eq -1 or $max ge $min) then '' else '@maxOccurs should be greater than or equal to @minOccurs' else ''While the loss of whitespace would also make the XSLT version very hard to read, lots of XML tools will easily indent it in a reasonable fashion. (E.g., oXygen has a “Format and Indent” capability; or
xmllint --format ugly_input.xml >
prettier_output.xml
.[13])
Two-path power
Real or imagined?
Of course this assertion, that two XPaths are more
powerful than one, is not strictly true. Executing two XPath
path expressions in a row is not much different from tacking
one onto the other. That is, executing XPath B from each node
selected by XPath A is not significantly different from
XPath A / XPath B
. Furthermore, in XPath 3.1 we
have the simple mapping operator (‘!
’) so if
items instead of nodes are involved, one could use XPath
A ! XPath B
.
So if my two XPaths (A and B) could have easily been expressed as a single XPath (let’s call it AB), then why do I think it so much better to be able to express them separately? For at least two reasons. First because of human cognition, and second because of what else can be done in addition to XPath B at each node selected by XPath A.
Human cognition — bears of very little brain
When you are a Bear of Very Little Brain, and you Think of Things, you find sometimes that a Thing which seemed very Thingish inside you is quite different when it gets out into the open and has other people looking at it.
— A.A. Milne, Winnie-the-Pooh
Far be it from me to imply that any in the audience or reading this paper is of very little brain, except insofar as, like me, your brain is generally bigger when you write a complex XPath (or any other snippet of complicated code) than it is when you try to read and understand it months or years later. And whether the author of an XPath is of big brain or not, it is almost always the case that someday someone of smaller brain (or, at least, of lesser understanding of the task at hand) will be reading it.
That is, I assert that readability is usually a more important
feature of an XPath in particular, and computer code in general,
than is brevity or execution efficiency. For starters
[p]rogrammers think they get paid to write code, but the
truth is, we spend a lot more time reading it than writing it
.
(NTW1). Given that programmer time in general,
and my time in particular, is
quite valuable, writing XPaths so that they can be read and
understood quickly is valuable.
Thus the divide and conquer
approach[14] of
writing complicated chunks of code, including XPaths, can be
quite advantageous. This method, of course, is a method for
making it easier to write code;
my contention here is that it can also make it easier to
read the code. Thus I am
thinking of this as divide for comprehension
instead (although perhaps divide to comprehend
is
snappier).
As a simple example, the equivalent of the following XPath appears in an obfuscation program I wrote (inspired by and in part based on ERHobx):
( ( ( $seed * $b + 1 ) mod $m ) + 1 + $low + $me ) mod ( $high - $low + 1 ) + $lowWhile it is not impossible to figure out what is going on here, this seems like a reasonable candidate for the
Bet you can’t tell me what this doesgame.[15] (It helps to know that all the variables are declared as xs:integer, and that
$me
is between $low
and
$high
.) When re-written as two XPaths (here using
XSLT variables), it becomes much easier (although still not
easy) to make sense of:
<xsl:variable name="random" select="( ( $seed * $b + 1 ) mod $m ) + 1" as="xs:integer"/> <xsl:variable name="result" select="( $random + $low + $me ) mod ( $high - $low + 1 ) + $low" as="xs:integer"/>While this code is still not at all obvious at first glance, after a few minutes puzzling one can figure out that
$random
is a pseudorandom number generated by a
linear congruential approach, and $result
is thus
a pseudorandom number between $low
and
$high
that is based on $me
.
It is worth noting, by the way, that the length of an XPath, even if cleverly measured in tokens or terms rather than as the raw number of characters, is not necessarily a good measure of its complexity. For example, the longest XPath I have ever written (before variable substitution, at least) is 45,992 characters long, or 41,994 without spaces, but is very easy to understand. It was written for the project I presented last year, and is nothing more than a single sequence of the first 3,999 positive integers represented in Roman numerals, except those that actually spell out an English word (i.e., ‘I’ and “MIX”) are recorded as a zero-length string instead.
Conversely, some reasonably short XPaths can be quite difficult to decipher. For example, the above example of pseudorandom number generation in a single XPath has only 59 characters other than ignorable whitespace.
Other things
In a corpus of a particular journal I was working with
each article is encoded as a TEI document which has a root
<TEI>
element which itself has two children:
<teiHeader>
for metadata and
<text>
for the content of the article
(including its bibliography, etc.). Each article has, at a
specific spot in the <teiHeader>
, a pair of
<idno>
elements to specify to which volume and
issue of the journal the article belongs. I wanted to ask the
question “How many articles are in each issue of volume
4?”.
I performed this task as a single one-liner in bash:
$ xsel -t -m "/*/t:teiHeader//t:idno[@type='volume'][.='004']" -v "../t:idno[@type='issue']" -n /path/to/articles/*.xml | rankThat takes some unpacking.
xsel |
A shell alias to |
-t |
Short for |
-m |
Short for |
/*/t:teiHeader//t:idno[@type='volume'][.='004'] |
An XPath (1.0) that selects all of the (metadata)
|
-v |
The first option of the template body, short for
|
../t:idno[@type='issue'] |
An XPath (1.0) that grabs the sibling
|
-n |
The second option of the template body, short for
|
/path/to/articles/*.xml |
A glob pattern that selects the desired files to process.[19] |
| |
The Unix pipe redirector. The output (STDOUT, in
this case of |
rank |
A shell alias for |
It is not particularly difficult to whip up an XSLT or XQuery program that reports the same information in a single XPath. But I submit that, presuming the reader has a good knowledge of the host languages involved (in this case bash on one side and either XSLT or XQuery on the other), the shell one-liner is far easier to write, and somewhat easier to understand.
For example, the following XPath (3.1) does a similar job.
let $articleCorpus := collection('/path/to/articles?select=*.xml') return let $results := let $v4issues := $articleCorpus/TEI/teiHeader[.//idno[@type eq 'volume'] ne '004']//idno[@type eq 'issue']/text() return for $thisIssueNum in distinct-values( $v4issues!sort(.) ) return count( $v4issues[ . eq $thisIssueNum ] )||' '||$thisIssueNum||'
' return sort( $results, (), function($result) { tokenize($result)[1] cast as xs:integer } )[21]
The result is only “similar” because while the same values are returned in the same order, the whitespace is not quite as nice. Note that this is not a direct comparison of one XPath to two, in that in my shell one-liner (using “divide to comprehend”) the job of counting occurrences and sorting the results has been factored out of the realm of XPath into a simple bash pipeline. If we similarly factor out that work into the host language (in the example below XSLT) we see that a major portion of the advantage of my bash one-liner is that this process of sorting the unique values of the sequence by how many times each occurs is so tersely expressed.
<xsl:variable name="v4issues" select="$allArticles/TEI/teiHeader[.//idno[@type eq 'volume'] eq '004']//idno[@type eq 'issue']/text()"/> <xsl:variable name="uniq_v4issues" select="distinct-values( $v4issues )"/> <xsl:for-each select="$uniq_v4issues"> <xsl:sort select="count( $v4issues[ . eq current() ] )"/> <xsl:sequence select="count( $v4issues[ . eq current() ] )||' '||.||'
'"/> </xsl:for-each>
Distressing dogmatic division
Of course, it is not the case that for every XPath that
can be divided into two, doing so improves readability and
comprehension. Consider the following single XPath designed to
be used to query a Subversion log file that has been created
with the --xml
and --verbose
switches.
/log/logentry[author = 'syd']/paths/path[ contains(., '.odd') ]!concat( ../../date, ' revision ', ../../@revision, '
 : ', normalize-space(../../msg) )The above is a long and unwieldy XPath, somewhat hard to understand quickly. But dividing it up recursively into nearly atomic steps (here using the host language XSLT), I claim, results in code that is even harder to comprehend in its totality. (It is likely somewhat easier to debug, though. This is because when debugging we often want to examine each minute step, rather than the totality.)
<xsl:template name="xsl:initial-template" match="/"> <xsl:variable name="inputDocument" select="/" as="document-node()"/> <xsl:variable name="outermost" select="$inputDocument/*" as="element(log)"/> <xsl:variable name="allEntries" select="$outermost/logentry" as="element(logentry)+"/> <xsl:variable name="sydEntries" select="$allEntries[ author = 'syd']" as="element(logentry)*"/> <xsl:variable name="sydPathss" select="$sydEntries/paths" as="element(paths)*"/> <xsl:variable name="sydPaths" select="$sydPathss/path" as="element(path)*"/> <xsl:variable name="sydODDPaths" select="$sydPaths[ contains( ., '.odd')]" as="element(path)*"/> <xsl:apply-templates select="$sydODDPaths"/> </xsl:template> <xsl:template match="path"> <xsl:variable name="date" select="../../date" as="xs:string"/> <xsl:variable name="boilerplate1" select="' revision '" as="xs:string"/> <xsl:variable name="revision" select="../../@revision" as="xs:string"/> <xsl:variable name="boilerplate2" select="'
 : '" as="xs:string"/> <xsl:variable name="message" select="normalize-space(../../msg)" as="xs:string"/> <xsl:sequence select="$date||$boilerplate1||$revision||$boilerplate2||$message||'
'"/> </xsl:template>
However, not surprisingly, dividing it into a small number of XPaths (in this case just two) yields much more readable and comprehensible code than either of the above.
<xsl:template name="xsl:initial-template" match="/"> <xsl:variable name="paths" select="/log/logentry[author='syd']/paths/path[contains(.,'.odd')]"/> <xsl:sequence select="$paths!concat( ../../date, ' revision ', ../../@revision, '
 : ', normalize-space(../../msg) )"/> </xsl:template>
Appendix A. A tangent on the danger of implied attribute order
In XML the order of attributes is insignificant. That is,
<song title="White Rabbit" composer="Grace Slick" performedBy="Jefferson Airplane"/>
is exactly the same (informationally) as
<song composer="Grace Slick" performedBy="Jefferson Airplane" title="White Rabbit"/>
. Thus with either of those two elements as the
input document, an XPath processor presented with for $a in
/*/@* return name($a)
might well return
('composer', 'performedBy', 'title'), but might instead return
('title', 'performedBy', 'composer').
Given that this is the case, why does XPath permit the
expression @*[1]
or attribute::*[ position() >
count( parent::*/preceding-sibling::*[1]/@* ) ]
? I think
there might be an argument in favor of allowing them based on
treating various types of nodes consistently. But I would be far
more worried that a programmer who is somewhat unfamiliar with XML
itself would use such a construct, and then perhaps years later
the system would suffer a catastrophic failure because of a change
in the XPath engine.
My first thought was that this could be addressed by having
fn:position()
return "NaN" when used in a predicate
that is selecting from attribute::*
. While this
prevents the problematic formulations mentioned above, it has the
weird side-effect that it would allow @*[ position() ne
'NaN']
(which selects none of them) and @*[
position() eq 'NaN']
(which selects all of them).
References
[ERHobx]
Harold, Elliotte Rusty. Obscuring XML
.
Proceedings of Extreme Markup
Languages®. Idealliance 2005.
[XDMspec] Walsh, Norman, John Snelson, and Andrew Coleman, eds. XQuery and XPath Data Model 3.1. 2017-03-21. World Wide Web Consortium. (Accessed 2024-04-03.)
[Kay2] Kay, Michael. XSLT 2.0 and XPath 2.0, 4th edition. Wiley Publishing, Inc., Indianapolis, 2008. pp. 45–67.
[TEIP2] Sperberg-McQueen, C. M. and Lou Burnard, eds. Guidelines for Electronic Text Encoding and Interchange 1992–1993; ACH, ACL, & ALLC. (This document is not directly available on the web, but an archive of the plain text can be downloaded from the TEI vault. The section quoted here (16.3) is in file p2sa.doc.)
[XPath1§2] Clark, James and Steve DeRose, eds. XML Path Language (XPath) Version 1.0, section 2, Location Paths. 1999-11-16. World Wide Web Consortium. (Accessed 2024-03-30.)
[XPath2§3.2] Berglund, Anders, Scott Boag, Don Chamberlin, Mary F. Fernández, Michael Kay, Jonathan Robie, and Jérôme Siméon, eds. XML Path Language (XPath) 2.0 (Second Edition), section 3.2, Path Expressions. 2010-12-14. World Wide Web Consortium. (Accessed 2024-03-30.)
[XPath3.1§3.3] Robie, Jonathan, Michael Dyck, and Josh Spiegel, eds. XML Path Language (XPath) 3.1, section 3.3, Path Expressions. 2017-03-21. World Wide Web Consortium. (Accessed 2024-03-30.)
[HyTime] ISO - ISO/IEC 10744:1997 - Information technology — Hypermedia/Time-based Structuring Language (HyTime), Edition 2. ISO, 1997.
[DDMHW] DeRose, Steven and David Durand. Making Hypermedia Work: A User’s Guide to HyTime. Kluwer Academic Publishers, 1994 (ISBN 0-7923-9432-1).
[NTW1] Tovey-Walsh, Norman. On the xml.com Slack workspace, #xpath-ng channel, 2024-07-11T08:05:03Z.
[W.DandC]
Divide-and-conquer algorithm
.
Wikipedia. Wikimedia Foundation, 2024-04-16T22:14, https://en.wikipedia.org/wiki/Divide-and-conquer_algorithm.
[1] This is not entirely
true. I actually remembered the quote as Don’t
underestimate the power of the dark side!
, but on
researching this paper discovered I was simply wrong.
[2] When I say obvious
I mean
the need for a feature that allowed pointing into SGML documents
based on hierarchy etc., was obvious, not that the particular
details of the extended pointer syntax were
obvious.
[3] Quite
presciently, though, the name of the only foreign (i.e., non-TEI)
system mentioned is XFORM
.
[4] However, I often think of myself as more of an XML Data Hygienist.
[5] In truth, if a student wanted to
learn one and only one search technology, I would likely recommend
regular expressions, which are, as Martin Holmes has said,
the next thing to learn after you learn to
type
.
[6] Which does not mean it is not useful for anything else. For example, the audience at XML Prague 2024 seemed to conclude that not only is it also useful for operating over JSON, it may be better than JSONPath for operating over JSON.
[7] Using the W3C Schema regular expression language:
<hi(\s+\i\c*\s*=\s*('[^']*'|"[^"]*"))*>\s+</hi>But in truth most other regular expression languages do not have shortcuts for NameStartChar (
\i
) and NameChar
(\c
), so for truly arbitrary XML in those languages
the regular expression is much longer. In fact in the rare cases
where I have wanted to do this, I have taken a quick look at the
list of all attribute names first (using, e.g., xsel -t -m
"//@*" -v "name(.)" -n /path/to/files.xml | rank
), so the
regular expression could be comparatively
concise.
[8] Or, in XPath 1.0 either of the following:
//tei:hi[ string-length(.) gt 0 and normalize-space(.) eq '']or
//tei:hi[ normalize-space('☮'||.||'☮') eq '☮ ☮']or …
[9] In October of 2010 while presenting at a workshop on encoding manuscripts at the University of Nebraska Lincoln I remember realizing quite suddenly that none of the dozen or so participants had been taking notes when I off-handedly used this feature of oXygen to demonstrate something, and immediately at least half the room started jotting it down and there were one or two requests for a repeat. These folks understood how important XPath is, but had not been aware of the oXygen feature.
some commandline XPath tools
-
xidel
-
BaseX
can be used in standalone command-line mode -
xmlsh
-
Perl-based
-
xpath
-
xsh
-
xml_grep
-
-
libxml2-based and libxslt-based
-
xmllint
-
xmlstarlet
-
-
Saxon-based
-
Saxon
(XQuery) -
Saxon’s
Gizmo
-
saxon-lint
-
-
Python-based
-
python-lxml
-
-
Ruby experts can also write Ruby one-liners using REXML or nokogiri
[11] By which I mean ISO Schematron, but for the purposes of this discussion, version does not really matter.
[12] Not to mention Schematron Quick Fixes.
[13] I call the output “prettier”, not “pretty”, because in my experience no indentation algorithm produces exactly the indentation a given user wants, but they very often produce a close approximation thereof.
[14] I am not limiting divide and conquer
to the
actual algorithm paradigm described, e.g. in W.DandC, but rather using it to denote the general
approach of dividing large problems or large chunks of code into
smaller problems or small snippets of code.
[15] A game at which APL is always the clear winner. For an undergraduate assignment I once wrote a 42-line APL program: 1 line of housekeeping, 1 line of code, and 40 lines of commentary to explain the 1 line of code. I’m sure others have even more impressive horror stories.
[16] Actually, the alias I use provides over a dozen other namespace bindings, but they are not relevant to this example.
[17] Given that there are no <tei:idno
type="volume">
elements in the contents of any of
the articles, I could have just used
//t:idno
as the beginning of the XPath, but
a) I did not know that at the time, and b) in theory
this XPath should be mildly faster than just
//t:idno
, especially if there are lots of
articles with lots of nodes, particularly
<tei:idno>
nodes, within the
/t:TEI/t:text
. I measured the time
difference in a not particularly rigorous test, and
found that while the longer XPath was a wee bit faster,
the speed difference was such that I would have to
execute it hundreds of times to make up for the extra
time it took for me to type the longer XPath — and I am
a reasonably fast typist.
This observation perfectly matches my (perhaps imperfect) recollection of Michael Kay’s advice on XSLT efficiency. Mr. Kay has said on more than one occasion that the stylesheet writer is not likely to be able to predict what the optimizer does, and thus what is fast and what is not. The corollary is not to bother worrying about which XPath or XSLT construct is computationally more efficient until a problem arises — i.e., until one’s program is problematically too slow — and at that point actually timing components is a better approach than reasoning about which construct is likely faster.
[18] However, it is common to use more
than one -v
option, so we may need to call
them XPath ‘B1’, ‘B2’, etc.
[19] In truth, as with most shell commands, this is really just a list of filepaths. If a simple glob will not get the set of files I want, I have often used a more complex method of listing the desired files. Here are some examples.
/path/to/articles/000[12]*/art*.xml |
The list of files in the subdirectories of
(Imagine a corpus in which the articles are
stored in files named “art_[author]”, where
“[author]” is the first three letters of the
primary author’s surname followed by the month and
day of their date of birth, each stored in one of
thousands of sequentially numbered directories
each named with a 5-digit number. This glob gets
only the 1XX and 2XX series articles, but avoids
the extraneous files |
/path/to/TEI/P5/Source/Specs/teidata*.xml /path/to/TEI/P5/Source/Specs/macro*.xml |
The list of files that define TEI datatypes and macros, avoiding the files that define classes, elements, or modules. |
$( find release -name '*.xsl' -o -name '*.xslt' -o -name '.sch' -o -name '*.isosch') |
The list of all XSLT files (whether named
|
$( xsel -t -m "/*/c:group[@id='published']/c:uri" -v "@uri" /path/to/catalog.xml ) |
The list of published articles
extracted from an XML catalog. (Presumably
unpublished files are stored in a different
|
[20] The idea that this pattern of commands — sort, uniq(ue), count, sort on the count — is so commonly helpful in corpus analysis that it deserves its own shorthand is something I picked up from a chat with Steve DeRose during a break at one of the first Balisage conferences. A quick, unscientific count says that I have used this shorthand in ~6.2% of the last 19,524 commands I have issued.
[21] The astute reader might question the use of the
eq
item comparison operator rather than the
=
sequence comparison operator for that first
equality check. After all, although '004'
is
guaranteed to be a single item, .//idno[@type eq
'volume']
might very well return a sequence of two or
more <idno>
elements, in which case
=
would still work fine and eq
would
result in an error. However, for this particular corpus there
is a rule that there should be one and only one <idno
type="volume">
in each article. Thus if there were two
or more, I wanted to know about it. (Since this rule is
schema-enforced, there were none.)