How to cite this paper
Birnbaum, David J. “An XML user steps into, and escapes from, XPath quicksand.” Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). https://doi.org/10.4242/BalisageVol3.Birnbaum01.
Balisage: The Markup Conference 2009
August 11 - 14, 2009
Balisage Paper: An XML user steps into, and escapes from, XPath quicksand
David J. Birnbaum
Professor and Chair
Department of Slavic Languages and Literatures University of
Pittsburgh
David J. Birnbaum is Professor and Chair of the Department of Slavic Languages
and Literatures at the University of Pittsburgh. He has been involved in the
study of electronic text technology since the mid-1980s, has delivered
presentations at a variety of electronic text technology conferences, and has
served on the board of the Association for Computers and the Humanities, the
editorial board of Markup Languages: Theory and
Practice, and the Text Encoding Initiative Council. Much of his
electronic text work intersects with his research in medieval Slavic manuscript
studies, but he also often writes about issues in the philosophy of
markup.
Copyright © 2009 by David J. Birnbaum. All rights reserved. Used by
permission.
Abstract
Until recently, the admirable and impressive eXist XML database sometimes failed to optimize queries with
numerical predicates. For example, a search for $i/following::word[1] would retrieve all
<word>
elements on the following
axis and
only then apply the predicate as a filter to return only the first of them. This was
enormously inefficient when $i
pointed to a node near the beginning of
a very large document, with many thousands of following
<word>
elements. As an end-user without the Java
programming skills to write optimization code for eXist, the author describes two types of optimization in the more
familiar XML, XPath, and XQuery, which reduced the number of nodes that needed to
be
accessed and thus improved response time substantially.
A subsequent optimization introduced by the eXist
developers into the eXist code base is described in
an addendum to this paper. Although this revision partially obviates the need for
the work-arounds developed earlier, the analysis of the efficiency of various XPath
approaches to a single problem continues to provide valuable general lessons about
XPath.
Table of Contents
- The corpus
- The task
- The problem
- An XPath solution
- A range index solution
- Time test results
- Is it XML?
- Conclusion
- Addendum
The corpus
Alain de Lille’s (Alanus ab insulis)
allegorical-philosophical epic, the Anticlaudianus, is
a poem divided into nine books with a brief verse prologue and an equally brief prose
pre-prologue. The text was published by Robert Bossuat in 1955 (Anticlaudianus / Alain de Lille: texte critique avec une introduction et des
tables publié par R. Bossuat [Paris : J. Vrin, 1955]), cleaned up in 2009
by Danuta Shanzer (University of Illinois at Urbana-Champaign, shanzer@illinois.edu) as a Dumbarton Oaks Medieval Library Latin Series Work
In Progress, and converted to XML and published as a queriable concordance by David
J.
Birnbaum (University of Pittsburgh, djbpitt@pitt.edu). When this report was
last edited in August 2009, the concordance was freely accessible at http://clover.slavic.pitt.edu:8081/exist/acl/search.html; it will eventually
move to a different stable freely accessible address, which has not yet been determined,
but which should be discoverable through search engines.
The corpus consists of a single XML file subdivided into
<book>
elements (prose introduction, verse introduction,
nine principal chapters). Each <book>
is divided into
<line>
elements (4344 <line>
elements in the nine principal <book>
elements, for an
average of 482.7 <line>
elements per
<book>
element), and each
<line>
is divided into <word>
elements (27222 <word>
elements in the nine principal books,
or approximately 6.27 <word>
elements per
<line>
element; the <word>
counts per <line>
range from a low of 4 to a high of 10). For
example, Book 2 begins:
<book n="2">
<line>
<word>Regia</word>
<word>tota</word>
<word>silet;</word>
<word>expirat</word>
<word>murmur</word>
<word>in</word>
<word>altum,</word>
</line>
<line>
<word>cum</word>
<word>visu</word>
<word>placidos</word>
<word>delegat</word>
<word>curia</word>
<word>vultus,</word>
</line>
<!-- more lines -->
</book>
Because scholars are likely to be interested more in the contents of the nine
principal books than in the contents of the prose and verse introductions, the counts
and calculations below omit the latter. They are, nonetheless, relevant in certain
situations, such as when a query consults all <word>
elements
in the document, including those in the introductions. The prose introduction contains
477 <word>
elements and the verse introduction 50.
The task
The goal of the electronic concordance project is to enable users to search for words
and generate a keyword-in-context (KWIC) report on the fly. The system was originally
implemented using the eXist XML database (http://www.exist-db.org), version 1.3.0dev-rev:8710-20090308. Version 1.3 of
eXist, still in beta at the time this report was
last revised, is the first to introduce a Lucene-based
index, which is intended eventually to replace the original proprietary full-text
index,
and the present concordance is indexed and accessed using the Lucene index. When the user enters a query string and launches a search,
the system retrieves all hits and returns them with several preceding and following
words (three each by default, but the user can adjust this number). Line breaks in
the
original are rendered as slashes in the KWIC output (this part of the code has been
omitted from most of this report in the interest of legibility, except when it is
the
object of optimization). For example, a search for pugna
(Latin for
‘fight’), which occurs four times in the corpus, returns:
-
Acl.8.254: Martis amore / succensi, pugna cupiunt incidere vitam. /
-
Acl.9.283: Luxum / Sobrietas, sed pugna favet Virtutibus, harum /
-
Acl.9.331: fraudesque recurrit. / degeneri
pugna, servili Marte, dolosa /
-
Acl.9.384: indignata sub umbras. / Pugna cadit, cedit iuveni
The XPath preceding
and following
axes are ideally suited to
this type of project, since they ignore the <line>
element
boundaries and treat adjacent <word>
elements identically
irrespective of whether they fall in the same <line>
element
as the target word or in preceding or following <line>
elements. For example, the system can retrieve the three
<word>
elements that precede the target
<word>
element with the following XQuery (assume that
$i
represents the target <word>
element):
for $j in reverse(1 to 3) return $i/preceding::word[$j]
This query returns the third, second, and first <word>
elements before the target <word>
element, in the specified
order. An analogous statement can retrieve the three <word>
elements that follow the hit, thus providing the rest of the context. Because queries
along the long (preceding
and following
) axes make no
distinction between preceding and following <word>
elements
within the same <line>
element and those that require
crossing a <line>
element boundary, the resulting XQuery code
is lucid and clean, making it extremely easy to read, write, and maintain.
The problem
The preceding strategy retrieves the correct results, and does so with elegant code,
but initially it proved unusable in practice for reasons of efficiency, even with
appropriately configured eXist Lucene and range indexes
(“Configuring,” “Lucene,” “Tuning”). Until a recent revision in the eXist code base (see the Addendum,
below), XPath expressions that addressed the long axes were inefficient because
eXist retrieved the entire set of nodes on the specified axis before looking at the
predicate. For example, in the worst case a hit would fall at the beginning of the
first
<line>
in the first <book>
element, which meant that in order to find the three immediately following
<word>
elements by looking on the following
axis eXist would first retrieve as many as 27221
following <word>
elements and only then apply a numerical
predicate to filter the returned result. Since the context includes both preceding
and
following <word>
elements (that is, it requires accessing the
preceding
axis in the former case and the following
axis
in the latter), hits that have fewer preceding <word>
elements have more following ones, and vice versa, which means that all hits require
sets of retrievals that span the entire document. The overhead was not a significant
problem with queries that retrieved a mere handful of hits, but those that retrieved
as
few as two hundred hits could take several minutes to return, and sometimes they failed
to return entirely because they generated Java heap overflow errors (which could have
been evaded by increasing the heap size, but that would not have provided a solution
to
the efficiency problem).
This problem is not a unique or inherent property of the long axes. Rather, it is
a
property of the number of nodes on which the predicate operates. For example, in a
flattened tree (imagine transforming the document to one where all
<word>
elements are directly under the root and line and
book boundaries are encoded as empty milestone tags), a hit at the beginning of the
document that queried the following-sibling
axis, rather than the
following
axis, would nonetheless retrieve tens of thousands of
unwanted <word>
elements before applying the predicates to
select the mere three that were actually needed. For this reason, although the problem
may initially appear to be an overlap issue in that in its original form it crosses
the
boundaries of <line>
elements, the flattening thought
experiment reveals that it is actually an optimization problem that is independent
of
both the specific axes involved and the depth of the nesting. If, for example, eXist were to look first at the predicate and then access
only the necessary elements on the specified axis, instead of first retrieving all
elements on that axis and only then consulting the predicate and discarding the unwanted
ones (in the present case, all but one), processing would not be suffocated by
unnecessary retrieval irrespective of whether the query needed to cross a
<line>
element boundary.
As is discussed in the Addendum, below, eXist has since implemented this type of optimization, but
the problem nonetheless continues to merit consideration. Among other things, the
issue
was never about eXist, since were that the case, an
obvious solution would have been to use an alternative platform that provided the
necessary optimization. Instead, the problem provides an opportunity to reflect more
generally on the nature of XPath expressions and the relationship between XPath and
XML.
For example, although the inefficiency described above is independent of the specific
axis involved insofar as it could also have arisen with the sibling axes in a flattened
tree, it nonetheless does depend on the XML structure, or, more precisely, on its
indifference to the XML structure. What the long axes in the actual problem and the
sibling axes in the flattened tree alternative have in common is that they operate
independently of the tree structure. For example, the sibling axes in the original
tree
constrains the number of nodes involved because the tree is balanced in a way that
ensures that no <word>
element will have more than nine
sibling <word>
elements. What the long axes in that same tree
and the sibling axes in the hypothetical flattened tree alternative have in common,
on
the other hand, is that all <word>
elements are treated as
though they are on the same level of the tree (in the former case because the long
axes
ignore the tree and in the latter case because the tree is flattened, and therefore
irrelevant). This suggests that unless one can be certain that the software that will
evaluate one’s XPath expressions will optimize one’s query, the designer should take
into consideration the number of nodes that will be addressed by those
expressions.
An XPath solution
Since eXist was unable to optimize the queries in
question, that duty fell on the user, who, in this case, first adopted a strategy
that
avoided the long axes, favoring instead the sibling axes at all levels
(<word>
and <line>
). This
constrains searches for <word>
elements within a
<line>
element to an average of 2.64 (5.27 / 2) elements
and searches for <line>
elements within a
<book>
element to 240.5 (481.7 / 2) elements, and in the
worst case to 5.27 and 481.7, respectively. These numbers should be multiplied by
six
for a typical query, which retrieves three preceding and three following
<word>
elements, but they still compare very favorably to
queries along the long axes, which consult 13610.5 (27221 / 2)
<word>
elements on average and 27221 in the worst case. A
prose explanation of the strategy for retrieving the three
<word>
elements following the target while using the
following-sibling
axis exclusively instead of the
following
axis is:
-
If there is a <word>
element in the appropriate
position on the following-sibling
axis (that is, in the same
<line>
element), retrieve it.
-
If not, navigate up to the parent
axis (a
<line>
element), get its following-sibling
<line>
element, and retrieve the appropriate
<word>
child element from within that
<line>
.
The following XQuery snippet retrieves the three <word>
elements that follow the target. Assume that $i
refers to the target (here
and in subsequent examples):
if (count($i/following-sibling::word) ge 3)
then for $j in (1 to 3) return $i/following-sibling::word[$j]
else
if (count($i/following-sibling::word) eq 2)
then (for $j in (1 to 2) return $i/following-sibling::word[$j],
$i/parent::line/following-sibling::line/word[1])
else
if (count($i/following-sibling::word) eq 1)
then ($i/following-sibling::word[1], for $j in (1 to 2)
return ($i/parent::line/following-sibling::line/word[$j]))
else
for $j in (1 to 3) return $i/parent::line/following-sibling::line/word[$j]
An analogous strategy can retrieve the three <word>
elements that immediately precede the target.
It is possible to generalize this solution to allow the user to specify at run time
the number of preceding or following <word>
elements to
provide as context along the following lines (assume that $scope
specifies
the number of words of context to provide after the target
<word>
):
if (count($i/following-sibling::word) ge $scope)
then for $j in (1 to $scope) return $i/following-sibling::word[$j]
else for $k in (0 to ($scope - 1)) return
if (count($i/following-sibling::word) eq $k)
then (
for $j in (1 to $k) return $i/following-sibling::word[$j],
for $j in (1 to ($scope - $k)) return $i/parent::line/following-sibling::line/word[$j]
)
else ""
An analogous strategy can retrieve a user-specified number of
<word>
elements that immediately precede the target. This
generalization, however, is fragile because it looks only at the
<line>
element that contains the target
<word>
element plus the immediately preceding or
following sibling <line>
element. This means that, for
example, if the user wants to include a context of five following
<word>
elements, the code will fail when the target
<word>
falls at the end of a
<line>
and the following (sibling)
<line>
contains only four
<word>
elements. It might be possible to circumvent this
limitation through a recursive approach, but by that point the code would become so
convoluted as to be difficult to understand and impractical to maintain.
A range index solution
In addition to the new Lucene-based full-text index
(and the original full-text index, which is still available), eXist also supports range indexes, which provide very fast access to
element nodes on the descendant-or-self
(//
) or
child
(/
) axis and attribute notes on the
attribute
(/@
) axis according to their typed data value.
This suggests an alternative approach:
-
Before storing the XML source document, create an @offset
attribute for each <word>
element and assign it a
unique sequential integer value. The first <word>
element in the document has an @offset
value of 1
, the
next has a value of 2
, etc.
-
Retrieve the target <word>
element by using the new
Lucene full-text index.
-
Retrieve the adjacent context <word>
elements
according to their @offset
attribute values by counting backward or
forward from the @offset
value of the target
<word>
element.
This approach is available only where the designer has control over the XML source
and
is able to incorporate a specific @offset
attribute that is to be used only
for navigation during retrieval. It has at least two weaknesses, one aesthetic and
one
technical:
-
The aesthetic problem is that the ordinal position of each
<word>
element within the document is an inherent
property of the document structure, and the user should not have to specify
through the insertion of character-based markup into the document a value that
is already encoded implicitly but consistently and unambiguously in the markup
structure.
-
The technical problem is that the insertion or deletion of a word in the
document will break the numbering, requiring that the @offset
values be calculated anew and rewritten. In the present case the document is
relatively stable (that is, more stable than, for example, an on-line commerce
site that writes new data for every transaction), but the editor may still
choose to modify her reading at some point as she reconsiders the available
evidence and perhaps discovers and needs to integrate the data from newly
discovered manuscript witnesses.
In practice, neither of these weakness imposes a serious inconvenience in the context
of the present project, and the opportunity to use the eXist range index feature provides a substantial improvement in response
time over alternative strategies (see the time test data below).
With the modification to the XML source described above, the first two lines of Book
2
now look like:
<book n="2">
<line>
<word offset="3708">Regia</word>
<word offset="3709">tota</word>
<word offset="3710">silet;</word>
<word offset="3711">expirat</word>
<word offset="3712">murmur</word>
<word offset="3713">in</word>
<word offset="3714">altum,</word>
</line>
<line>
<word offset="3715">cum</word>
<word offset="3716">visu</word>
<word offset="3717">placidos</word>
<word offset="3718">delegat</word>
<word offset="3719">curia</word>
<word offset="3720">vultus,</word>
</line>
<!-- more lines -->
</book>
The following code will now use a properly-configured eXist range index to retrieve the three
<word>
elements following the target
<word>
element (assume $i
is the target
<word>
element):
let $offset := $i/@offset
return
for $j in (1 to 3) return doc("/db/acl/acl.xml")//word[@offset eq ($offset + $j)]
An analogous strategy can retrieve the three <word>
elements that immediately precede the target. Furthermore, this solution is easily
generalized to allow the user to specify the number of preceding or following context
<word>
elements at run time (assume that
$scope
specifies the number of words of context to provide after the
target <word>
):
let $offset := $i/@offset
return
for $j in (1 to $scope) return doc("/db/acl/acl.xml")//word[@offset eq ($offset + $j)]
The strategy of writing structural information about the XML (such as offset position)
into attribute values of the XML source itself in the interest of improved execution
time can also be applied to writing slashes to mark the ends of lines. The most natural
XPath way to write a slash after the target <word>
element
when it ends a <line>
element in the XML source is:
if (not($i/following-sibling::word)) then " / " else ""
A similar approach can be used to write slashes after the leading and trailing context
<word>
elements when they end a
<line>
element. This method is not maddeningly slow
because the number of nodes on the sibling axes is typically small, but further
improvement in processing speed is available by modifying the XML source to include
the
string " / "
(without the quotation marks) as an attribute value associated
with <word>
elements that fall at the end of
<line>
elements and then retrieving it when generating
the report, instead of consulting the sibling axis. The first two lines of Book 2
now
look like:
<book n="2">
<line>
<word offset="3708">Regia</word>
<word offset="3709">tota</word>
<word offset="3710">silet;</word>
<word offset="3711">expirat</word>
<word offset="3712">murmur</word>
<word offset="3713">in</word>
<word last=" / " offset="3714">altum,</word>
</line>
<line>
<word offset="3715">cum</word>
<word offset="3716">visu</word>
<word offset="3717">placidos</word>
<word offset="3718">delegat</word>
<word offset="3719">curia</word>
<word last=" / " offset="3720">vultus,</word>
</line>
<!-- more lines -->
</book>
The XQuery code to write trailing context <word>
elements
is then:
let $offset := $i/@offset
for $j in (1 to 3) return (
" ",
data(doc("/db/acl/acl.xml")//word[@offset eq ($offset + $j)]),
data(doc("/db/acl/acl.xml")//word[@offset eq ($offset + $j)]/@last)
)
This approach writes a space after the target <word>
element, followed by the appropriate trailing context <word>
element, followed by the value of the @last
attribute of that
<word>
element. If there is a @last
attribute, it has the value " / "
(without the quotation marks), which is
what we want to write. If the attribute is missing, that statement operates vacuously,
producing no output.
Time test results
To test the relative efficiency of the various coding strategies described above,
the
same query was executed ten times with each of four strategies. The query was a search
for the word sed
(Latin for ‘but’), which occurs 221 times in the corpus,
and the XQuery scripts were all written to return it along with its location (book
and
line number) and with three context words on either side. The search strategies
were:
-
Long axes: Search for context words using the
preceding
and following
axes. Place slashes at the
ends of lines by checking whether each output word has a following sibling
<word>
element in the same
<line>
.
-
Sibling axes: Search for context words using
the sibling axes. If there are not enough context
<word>
elements in the same
<line>
, find the nearest sibling of the
<line>
and navigate to the desired word element
along the child
axis. Place slashes as described above.
-
@offset
: Modify the XML to add
an @offset
attribute to every <word>
element and find the context words by counting down or up from the
@offset
attribute value for the target
<word>
. Place slashes as described above.
-
@last
: Same as
@offset
, except add a @last
attribute in the XML
to every <word>
element that is the last in its
parent <line>
, and place slashes at the ends of lines
by returning the value of that attribute.
The tests were conducted on a Gateway 3GHz Pentium D with 1MG of memory, running
Microsoft Windows Vista with Service Pack 1.0. The Java version is 1.6.0_13 and the
eXist version is 1.3.0dev-rev:0000-20090528.
Table I
Test no.
|
Long axes
|
Sibling axes
|
@offset
|
@last
|
1
|
447.750 |
38.477 |
34.345 |
24.413 |
2
|
440.265 |
38.390 |
34.446 |
24.416 |
3
|
559.141 |
38.710 |
34.438 |
24.445 |
4
|
905.472 |
38.484 |
35.409 |
24.384 |
5
|
702.915 |
38.590 |
34.562 |
24.447 |
6
|
530.739 |
38.341 |
34.798 |
24.424 |
7
|
851.608 |
38.714 |
34.145 |
24.415 |
8
|
473.601 |
38.496 |
34.410 |
24.395 |
9
|
670.772 |
39.521 |
34.423 |
24.463 |
10
|
473.601 |
38.317 |
34.429 |
24.469 |
Mean
|
631.150 |
38.604 |
34.541 |
24.427 |
Ratio 1
|
25.838 |
1.580 |
1.414 |
1.000 |
Ratio 2
|
18.273 |
1.118 |
1.000 |
|
Ratio 3
|
16.349 |
1.000 |
|
|
Times are reported in seconds. The three ratio lines in the table set the time for
one
of the tests at a value of 1
and then calculate the amount of time the
other implementations required in proportion to it.
The results show that querying along the long axes took more than 16 times as much
time as querying along the sibling axes. Using the @offset
attribute value
instead of either the long axes or the sibling axes saved an additional 11% in time,
and
using the @last
attribute value as well saved an additional 41% in time
over that. All told, the implementation that relies on the long axes took more than
25
times as much time as the one with the greatest optimization.
Is it XML?
The XML version of the poem has an inherent hierarchy (the poem contains books, which
contain lines, which contain words) and inherent order (the words occur in a particular
order, as do the lines and books). Those inherent features are encoded naturally in
the
structure of the XML document because XML documents are obligatorily hierarchical
(even
though in some projects the hierarchy may be flat) and ordered (even though in some
projects the user may ignore the order). The addition of @offset
and
@last
attributes and the adoption of a strategy that treats the
document as flat and never looks at the hierarchy essentially transforms the approach
from one that is based on natural properties of XML documents to one that is based
on a
flat-file database way of thinking. That is, we could map each
<word>
element in the XML version to a record in a
database table, the fields of which would be the textual representation of the word
(a
character string), the offset value (a unique positive integer), an indication of
whether the word falls at the end of a line (a boolean value), and the book and line
number (a string value, which is used in reporting). Records in a database do not
have
an inherent order, but once we rely on the value of the @offset
attribute
in the XML document, the <word>
elements might as well be
sprinkled through the document in any order, and the <line>
and <book>
elements play no role at all in the system. That
is, except for the book and line number, the most highly optimized (and most efficient)
implementation above adopts precisely a flat-file database approach, which raises
the
question of whether this project should have been undertaken in XML in the first
place.
The answer to that rhetorical question is that of course it should have been
undertaken in XML because the order and hierarchy are meaningful. They are inherent
in
the XML structure but must be written explicitly into a corresponding database
implementation, which indicates that this is data that wants, as it were, to be regarded
as an ordered and hierarchical XML document. The problem is not that the data is
inherently tabular, and therefore inherently suited to a flat-file database solution,
but that the XML tool available to manipulate the data was not sufficiently optimized
for
the type of retrieval required.
Conclusion
The best solution would be, of course, an optimization within eXist that would let users write concise and legible XQuery code (using
the long axes), which would then be executed efficiently through optimization behind
the
scenes. This type of solution would remove the need for both more complex code (along
the lines of the sibling-axes approach described above) and modifying the XML to write
information into the document in character form when that information is already
inherent in the document structure. Until such a solution became available, though,
the
strategies described above provided a substantial improvement over explicit use of
the
long axes, salvaging a project that would otherwise have been unusable for reasons
of
efficiency.
Addendum
eXist is an open-source project, which means that
impatient users who require an optimization not already present in the code have the
opportunity to implement that optimization themselves and contribute it to the project.
Unfortunately, in the present case this particular impatient user lacked the Java
programming skills to undertake the task. Fortunately, however, the eXist development team is very responsive to feature requests
from users, and shortly after I wrote to the developers about the problem they released
an upgrade that implemented precisely the modification described above (consult the
predicate first and retrieve only the nodes that will be needed from the designated
axis). Rerunning the original code that relied on the long axes on the same machine
as
the earlier tests but using eXist version
1.3.0dev-rev9622-20090802, which includes this new optimization, yielded times of
1.754,
1.778, 1.765, 1.944, 1.944, 1.777, 18.949, 1.838, 1.763, and 1.798 seconds. The mean
time for these tests was 3.531 seconds, and if we exclude the aberrant long time on
the
seventh trial (an artifact of a system process that woke up at an inconvenient moment?),
the mean drops to 1.818 seconds. The 3.531-second figure is 14.455% of the best mean
time (24.427 seconds) achieved with my XSLT-based optimizations and 0.559% of the
mean
time of the long-axes search (631.150 seconds) before the introduction of the eXist-internal optimization. The 1.818-second figure is
7.443% of the best mean time (24.427 seconds) achieved with my XPath-based optimizations
and 0.288% of the mean time of the long-axes search (631.150 seconds) before the
introduction of the eXist-internal optimization.
The eXist optimization works by checking the static
return type of the predicate expression to determine whether it is a positional
predicate. (This paragraph reproduces more or less verbatim an explanation provided
by
the eXist developers.) If the answer is yes and there
is no context dependency, the predicate will be evaluated in advance and the result
will
be used to limit the range of the context selection (e.g.,
following::word
). For example, $i/following::word[1]
would
benefit from the optimization because the static return type of the predicate is a
positional predicate and it entails no context dependency. On the other hand,
$i/following::word[position() = 1]
would not be optimized because it
introduces a context dependency insofar as position()
returns the position
of the current context item and cannot be evaluated without looking at the context.
Furthermore, determining the static type is not always easy. In particular, the type
information is passed along local variables declared in a let
or
for
, but it gets lost through function calls. My original query,
for $j in (1 to 3) return $i/following::word[$j]
, works, but if
$j
were a function parameter, it would not. Additionally, support for
this optimization with particular XPath functions is being introduced only
incrementally, to avoid breaking existing code. For example, the developers’ initial
attempt at an optimization failed with the reverse()
function that I used
to retrieve the three preceding words in the correct order, although support for this
function was added later to the optimization.
The unsurprising technical conclusion, then, is that, at least in the present case,
optimization of the XPath code by the user to reduce the scope of a query can achieve
substantial improvement, but much more impressive results are obtained by optimizing
the
Java code underlying the XPath interpreter. What this experiment also reveals, though,
is that, at least in the present case, the user was not reduced to waiting helplessly
for a resolution by the developers, and was able to achieve meaningful improvement
in
those areas that he did control, viz., the XML, XPath, and XQuery.
In his concluding statement at the Balisage 2009 pre-conference Symposium on
Processing XML Efficiently, Michael Kay invoked David Wheeler’s advice that application
developers optimize the code that users actually write, that is, that they find out
what
people are doing and make that go quickly. From an end-user perspective, though, the
lesson can be reversed: Find out what goes quickly and use it.