The corpus
Alain de Lille’s (Alanus ab insulis) allegorical-philosophical epic, the Anticlaudianus, is a poem divided into nine books with a brief verse prologue and an equally brief prose pre-prologue. The text was published by Robert Bossuat in 1955 (Anticlaudianus / Alain de Lille: texte critique avec une introduction et des tables publié par R. Bossuat [Paris : J. Vrin, 1955]), cleaned up in 2009 by Danuta Shanzer (University of Illinois at Urbana-Champaign, shanzer@illinois.edu) as a Dumbarton Oaks Medieval Library Latin Series Work In Progress, and converted to XML and published as a queriable concordance by David J. Birnbaum (University of Pittsburgh, djbpitt@pitt.edu). When this report was last edited in August 2009, the concordance was freely accessible at http://clover.slavic.pitt.edu:8081/exist/acl/search.html; it will eventually move to a different stable freely accessible address, which has not yet been determined, but which should be discoverable through search engines.
The corpus consists of a single XML file subdivided into
<book>
elements (prose introduction, verse introduction,
nine principal chapters). Each <book>
is divided into
<line>
elements (4344 <line>
elements in the nine principal <book>
elements, for an
average of 482.7 <line>
elements per
<book>
element), and each
<line>
is divided into <word>
elements (27222 <word>
elements in the nine principal books,
or approximately 6.27 <word>
elements per
<line>
element; the <word>
counts per <line>
range from a low of 4 to a high of 10). For
example, Book 2 begins:
<book n="2"> <line> <word>Regia</word> <word>tota</word> <word>silet;</word> <word>expirat</word> <word>murmur</word> <word>in</word> <word>altum,</word> </line> <line> <word>cum</word> <word>visu</word> <word>placidos</word> <word>delegat</word> <word>curia</word> <word>vultus,</word> </line> <!-- more lines --> </book>
Because scholars are likely to be interested more in the contents of the nine
principal books than in the contents of the prose and verse introductions, the counts
and calculations below omit the latter. They are, nonetheless, relevant in certain
situations, such as when a query consults all <word>
elements
in the document, including those in the introductions. The prose introduction contains
477 <word>
elements and the verse introduction 50.
The task
The goal of the electronic concordance project is to enable users to search for words
and generate a keyword-in-context (KWIC) report on the fly. The system was originally
implemented using the eXist XML database (http://www.exist-db.org), version 1.3.0dev-rev:8710-20090308. Version 1.3 of
eXist, still in beta at the time this report was
last revised, is the first to introduce a Lucene-based
index, which is intended eventually to replace the original proprietary full-text
index,
and the present concordance is indexed and accessed using the Lucene index. When the user enters a query string and launches a search,
the system retrieves all hits and returns them with several preceding and following
words (three each by default, but the user can adjust this number). Line breaks in
the
original are rendered as slashes in the KWIC output (this part of the code has been
omitted from most of this report in the interest of legibility, except when it is
the
object of optimization). For example, a search for pugna
(Latin for
‘fight’), which occurs four times in the corpus, returns:
-
Acl.8.254: Martis amore / succensi, pugna cupiunt incidere vitam. /
-
Acl.9.283: Luxum / Sobrietas, sed pugna favet Virtutibus, harum /
-
Acl.9.331: fraudesque recurrit. / degeneri pugna, servili Marte, dolosa /
-
Acl.9.384: indignata sub umbras. / Pugna cadit, cedit iuveni
The XPath preceding
and following
axes are ideally suited to
this type of project, since they ignore the <line>
element
boundaries and treat adjacent <word>
elements identically
irrespective of whether they fall in the same <line>
element
as the target word or in preceding or following <line>
elements. For example, the system can retrieve the three
<word>
elements that precede the target
<word>
element with the following XQuery (assume that
$i
represents the target <word>
element):
for $j in reverse(1 to 3) return $i/preceding::word[$j]
This query returns the third, second, and first <word>
elements before the target <word>
element, in the specified
order. An analogous statement can retrieve the three <word>
elements that follow the hit, thus providing the rest of the context. Because queries
along the long (preceding
and following
) axes make no
distinction between preceding and following <word>
elements
within the same <line>
element and those that require
crossing a <line>
element boundary, the resulting XQuery code
is lucid and clean, making it extremely easy to read, write, and maintain.
The problem
The preceding strategy retrieves the correct results, and does so with elegant code,
but initially it proved unusable in practice for reasons of efficiency, even with
appropriately configured eXist Lucene and range indexes
(“Configuring,” “Lucene,” “Tuning”). Until a recent revision in the eXist code base (see the Addendum,
below), XPath expressions that addressed the long axes were inefficient because
eXist retrieved the entire set of nodes on the specified axis before looking at the
predicate. For example, in the worst case a hit would fall at the beginning of the
first
<line>
in the first <book>
element, which meant that in order to find the three immediately following
<word>
elements by looking on the following
axis eXist would first retrieve as many as 27221
following <word>
elements and only then apply a numerical
predicate to filter the returned result. Since the context includes both preceding
and
following <word>
elements (that is, it requires accessing the
preceding
axis in the former case and the following
axis
in the latter), hits that have fewer preceding <word>
elements have more following ones, and vice versa, which means that all hits require
sets of retrievals that span the entire document. The overhead was not a significant
problem with queries that retrieved a mere handful of hits, but those that retrieved
as
few as two hundred hits could take several minutes to return, and sometimes they failed
to return entirely because they generated Java heap overflow errors (which could have
been evaded by increasing the heap size, but that would not have provided a solution
to
the efficiency problem).
This problem is not a unique or inherent property of the long axes. Rather, it is
a
property of the number of nodes on which the predicate operates. For example, in a
flattened tree (imagine transforming the document to one where all
<word>
elements are directly under the root and line and
book boundaries are encoded as empty milestone tags), a hit at the beginning of the
document that queried the following-sibling
axis, rather than the
following
axis, would nonetheless retrieve tens of thousands of
unwanted <word>
elements before applying the predicates to
select the mere three that were actually needed. For this reason, although the problem
may initially appear to be an overlap issue in that in its original form it crosses
the
boundaries of <line>
elements, the flattening thought
experiment reveals that it is actually an optimization problem that is independent
of
both the specific axes involved and the depth of the nesting. If, for example, eXist were to look first at the predicate and then access
only the necessary elements on the specified axis, instead of first retrieving all
elements on that axis and only then consulting the predicate and discarding the unwanted
ones (in the present case, all but one), processing would not be suffocated by
unnecessary retrieval irrespective of whether the query needed to cross a
<line>
element boundary.
As is discussed in the Addendum, below, eXist has since implemented this type of optimization, but
the problem nonetheless continues to merit consideration. Among other things, the
issue
was never about eXist, since were that the case, an
obvious solution would have been to use an alternative platform that provided the
necessary optimization. Instead, the problem provides an opportunity to reflect more
generally on the nature of XPath expressions and the relationship between XPath and
XML.
For example, although the inefficiency described above is independent of the specific
axis involved insofar as it could also have arisen with the sibling axes in a flattened
tree, it nonetheless does depend on the XML structure, or, more precisely, on its
indifference to the XML structure. What the long axes in the actual problem and the
sibling axes in the flattened tree alternative have in common is that they operate
independently of the tree structure. For example, the sibling axes in the original
tree
constrains the number of nodes involved because the tree is balanced in a way that
ensures that no <word>
element will have more than nine
sibling <word>
elements. What the long axes in that same tree
and the sibling axes in the hypothetical flattened tree alternative have in common,
on
the other hand, is that all <word>
elements are treated as
though they are on the same level of the tree (in the former case because the long
axes
ignore the tree and in the latter case because the tree is flattened, and therefore
irrelevant). This suggests that unless one can be certain that the software that will
evaluate one’s XPath expressions will optimize one’s query, the designer should take
into consideration the number of nodes that will be addressed by those
expressions.
An XPath solution
Since eXist was unable to optimize the queries in
question, that duty fell on the user, who, in this case, first adopted a strategy
that
avoided the long axes, favoring instead the sibling axes at all levels
(<word>
and <line>
). This
constrains searches for <word>
elements within a
<line>
element to an average of 2.64 (5.27 / 2) elements
and searches for <line>
elements within a
<book>
element to 240.5 (481.7 / 2) elements, and in the
worst case to 5.27 and 481.7, respectively. These numbers should be multiplied by
six
for a typical query, which retrieves three preceding and three following
<word>
elements, but they still compare very favorably to
queries along the long axes, which consult 13610.5 (27221 / 2)
<word>
elements on average and 27221 in the worst case. A
prose explanation of the strategy for retrieving the three
<word>
elements following the target while using the
following-sibling
axis exclusively instead of the
following
axis is:
-
If there is a
<word>
element in the appropriate position on thefollowing-sibling
axis (that is, in the same<line>
element), retrieve it. -
If not, navigate up to the
parent
axis (a<line>
element), get its following-sibling<line>
element, and retrieve the appropriate<word>
child element from within that<line>
.
The following XQuery snippet retrieves the three <word>
elements that follow the target. Assume that $i
refers to the target (here
and in subsequent examples):
if (count($i/following-sibling::word) ge 3) then for $j in (1 to 3) return $i/following-sibling::word[$j] else if (count($i/following-sibling::word) eq 2) then (for $j in (1 to 2) return $i/following-sibling::word[$j], $i/parent::line/following-sibling::line/word[1]) else if (count($i/following-sibling::word) eq 1) then ($i/following-sibling::word[1], for $j in (1 to 2) return ($i/parent::line/following-sibling::line/word[$j])) else for $j in (1 to 3) return $i/parent::line/following-sibling::line/word[$j]
An analogous strategy can retrieve the three <word>
elements that immediately precede the target.
It is possible to generalize this solution to allow the user to specify at run time
the number of preceding or following <word>
elements to
provide as context along the following lines (assume that $scope
specifies
the number of words of context to provide after the target
<word>
):
if (count($i/following-sibling::word) ge $scope) then for $j in (1 to $scope) return $i/following-sibling::word[$j] else for $k in (0 to ($scope - 1)) return if (count($i/following-sibling::word) eq $k) then ( for $j in (1 to $k) return $i/following-sibling::word[$j], for $j in (1 to ($scope - $k)) return $i/parent::line/following-sibling::line/word[$j] ) else ""
An analogous strategy can retrieve a user-specified number of
<word>
elements that immediately precede the target. This
generalization, however, is fragile because it looks only at the
<line>
element that contains the target
<word>
element plus the immediately preceding or
following sibling <line>
element. This means that, for
example, if the user wants to include a context of five following
<word>
elements, the code will fail when the target
<word>
falls at the end of a
<line>
and the following (sibling)
<line>
contains only four
<word>
elements. It might be possible to circumvent this
limitation through a recursive approach, but by that point the code would become so
convoluted as to be difficult to understand and impractical to maintain.
A range index solution
In addition to the new Lucene-based full-text index
(and the original full-text index, which is still available), eXist also supports range indexes, which provide very fast access to
element nodes on the descendant-or-self
(//
) or
child
(/
) axis and attribute notes on the
attribute
(/@
) axis according to their typed data value.
This suggests an alternative approach:
-
Before storing the XML source document, create an
@offset
attribute for each<word>
element and assign it a unique sequential integer value. The first<word>
element in the document has an@offset
value of1
, the next has a value of2
, etc. -
Retrieve the target
<word>
element by using the new Lucene full-text index. -
Retrieve the adjacent context
<word>
elements according to their@offset
attribute values by counting backward or forward from the@offset
value of the target<word>
element.
This approach is available only where the designer has control over the XML source
and
is able to incorporate a specific @offset
attribute that is to be used only
for navigation during retrieval. It has at least two weaknesses, one aesthetic and
one
technical:
-
The aesthetic problem is that the ordinal position of each
<word>
element within the document is an inherent property of the document structure, and the user should not have to specify through the insertion of character-based markup into the document a value that is already encoded implicitly but consistently and unambiguously in the markup structure. -
The technical problem is that the insertion or deletion of a word in the document will break the numbering, requiring that the
@offset
values be calculated anew and rewritten. In the present case the document is relatively stable (that is, more stable than, for example, an on-line commerce site that writes new data for every transaction), but the editor may still choose to modify her reading at some point as she reconsiders the available evidence and perhaps discovers and needs to integrate the data from newly discovered manuscript witnesses.
In practice, neither of these weakness imposes a serious inconvenience in the context of the present project, and the opportunity to use the eXist range index feature provides a substantial improvement in response time over alternative strategies (see the time test data below).
With the modification to the XML source described above, the first two lines of Book 2 now look like:
<book n="2"> <line> <word offset="3708">Regia</word> <word offset="3709">tota</word> <word offset="3710">silet;</word> <word offset="3711">expirat</word> <word offset="3712">murmur</word> <word offset="3713">in</word> <word offset="3714">altum,</word> </line> <line> <word offset="3715">cum</word> <word offset="3716">visu</word> <word offset="3717">placidos</word> <word offset="3718">delegat</word> <word offset="3719">curia</word> <word offset="3720">vultus,</word> </line> <!-- more lines --> </book>
The following code will now use a properly-configured eXist range index to retrieve the three
<word>
elements following the target
<word>
element (assume $i
is the target
<word>
element):
let $offset := $i/@offset return for $j in (1 to 3) return doc("/db/acl/acl.xml")//word[@offset eq ($offset + $j)]
An analogous strategy can retrieve the three <word>
elements that immediately precede the target. Furthermore, this solution is easily
generalized to allow the user to specify the number of preceding or following context
<word>
elements at run time (assume that
$scope
specifies the number of words of context to provide after the
target <word>
):
let $offset := $i/@offset return for $j in (1 to $scope) return doc("/db/acl/acl.xml")//word[@offset eq ($offset + $j)]
The strategy of writing structural information about the XML (such as offset position)
into attribute values of the XML source itself in the interest of improved execution
time can also be applied to writing slashes to mark the ends of lines. The most natural
XPath way to write a slash after the target <word>
element
when it ends a <line>
element in the XML source is:
if (not($i/following-sibling::word)) then " / " else ""
A similar approach can be used to write slashes after the leading and trailing context
<word>
elements when they end a
<line>
element. This method is not maddeningly slow
because the number of nodes on the sibling axes is typically small, but further
improvement in processing speed is available by modifying the XML source to include
the
string " / "
(without the quotation marks) as an attribute value associated
with <word>
elements that fall at the end of
<line>
elements and then retrieving it when generating
the report, instead of consulting the sibling axis. The first two lines of Book 2
now
look like:
<book n="2"> <line> <word offset="3708">Regia</word> <word offset="3709">tota</word> <word offset="3710">silet;</word> <word offset="3711">expirat</word> <word offset="3712">murmur</word> <word offset="3713">in</word> <word last=" / " offset="3714">altum,</word> </line> <line> <word offset="3715">cum</word> <word offset="3716">visu</word> <word offset="3717">placidos</word> <word offset="3718">delegat</word> <word offset="3719">curia</word> <word last=" / " offset="3720">vultus,</word> </line> <!-- more lines --> </book>
The XQuery code to write trailing context <word>
elements
is then:
let $offset := $i/@offset for $j in (1 to 3) return ( " ", data(doc("/db/acl/acl.xml")//word[@offset eq ($offset + $j)]), data(doc("/db/acl/acl.xml")//word[@offset eq ($offset + $j)]/@last) )
This approach writes a space after the target <word>
element, followed by the appropriate trailing context <word>
element, followed by the value of the @last
attribute of that
<word>
element. If there is a @last
attribute, it has the value " / "
(without the quotation marks), which is
what we want to write. If the attribute is missing, that statement operates vacuously,
producing no output.
Time test results
To test the relative efficiency of the various coding strategies described above,
the
same query was executed ten times with each of four strategies. The query was a search
for the word sed
(Latin for ‘but’), which occurs 221 times in the corpus,
and the XQuery scripts were all written to return it along with its location (book
and
line number) and with three context words on either side. The search strategies
were:
-
Long axes: Search for context words using the
preceding
andfollowing
axes. Place slashes at the ends of lines by checking whether each output word has a following sibling<word>
element in the same<line>
. -
Sibling axes: Search for context words using the sibling axes. If there are not enough context
<word>
elements in the same<line>
, find the nearest sibling of the<line>
and navigate to the desired word element along thechild
axis. Place slashes as described above. -
@offset
: Modify the XML to add an@offset
attribute to every<word>
element and find the context words by counting down or up from the@offset
attribute value for the target<word>
. Place slashes as described above. -
@last
: Same as@offset
, except add a@last
attribute in the XML to every<word>
element that is the last in its parent<line>
, and place slashes at the ends of lines by returning the value of that attribute.
The tests were conducted on a Gateway 3GHz Pentium D with 1MG of memory, running Microsoft Windows Vista with Service Pack 1.0. The Java version is 1.6.0_13 and the eXist version is 1.3.0dev-rev:0000-20090528.
Table I
Test no. |
Long axes |
Sibling axes |
@offset
|
@last
|
1 | 447.750 | 38.477 | 34.345 | 24.413 |
2 | 440.265 | 38.390 | 34.446 | 24.416 |
3 | 559.141 | 38.710 | 34.438 | 24.445 |
4 | 905.472 | 38.484 | 35.409 | 24.384 |
5 | 702.915 | 38.590 | 34.562 | 24.447 |
6 | 530.739 | 38.341 | 34.798 | 24.424 |
7 | 851.608 | 38.714 | 34.145 | 24.415 |
8 | 473.601 | 38.496 | 34.410 | 24.395 |
9 | 670.772 | 39.521 | 34.423 | 24.463 |
10 | 473.601 | 38.317 | 34.429 | 24.469 |
Mean | 631.150 | 38.604 | 34.541 | 24.427 |
Ratio 1 | 25.838 | 1.580 | 1.414 | 1.000 |
Ratio 2 | 18.273 | 1.118 | 1.000 | |
Ratio 3 | 16.349 | 1.000 |
Times are reported in seconds. The three ratio lines in the table set the time for
one
of the tests at a value of 1
and then calculate the amount of time the
other implementations required in proportion to it.
The results show that querying along the long axes took more than 16 times as much
time as querying along the sibling axes. Using the @offset
attribute value
instead of either the long axes or the sibling axes saved an additional 11% in time,
and
using the @last
attribute value as well saved an additional 41% in time
over that. All told, the implementation that relies on the long axes took more than
25
times as much time as the one with the greatest optimization.
Is it XML?
The XML version of the poem has an inherent hierarchy (the poem contains books, which
contain lines, which contain words) and inherent order (the words occur in a particular
order, as do the lines and books). Those inherent features are encoded naturally in
the
structure of the XML document because XML documents are obligatorily hierarchical
(even
though in some projects the hierarchy may be flat) and ordered (even though in some
projects the user may ignore the order). The addition of @offset
and
@last
attributes and the adoption of a strategy that treats the
document as flat and never looks at the hierarchy essentially transforms the approach
from one that is based on natural properties of XML documents to one that is based
on a
flat-file database way of thinking. That is, we could map each
<word>
element in the XML version to a record in a
database table, the fields of which would be the textual representation of the word
(a
character string), the offset value (a unique positive integer), an indication of
whether the word falls at the end of a line (a boolean value), and the book and line
number (a string value, which is used in reporting). Records in a database do not
have
an inherent order, but once we rely on the value of the @offset
attribute
in the XML document, the <word>
elements might as well be
sprinkled through the document in any order, and the <line>
and <book>
elements play no role at all in the system. That
is, except for the book and line number, the most highly optimized (and most efficient)
implementation above adopts precisely a flat-file database approach, which raises
the
question of whether this project should have been undertaken in XML in the first
place.
The answer to that rhetorical question is that of course it should have been undertaken in XML because the order and hierarchy are meaningful. They are inherent in the XML structure but must be written explicitly into a corresponding database implementation, which indicates that this is data that wants, as it were, to be regarded as an ordered and hierarchical XML document. The problem is not that the data is inherently tabular, and therefore inherently suited to a flat-file database solution, but that the XML tool available to manipulate the data was not sufficiently optimized for the type of retrieval required.
Conclusion
The best solution would be, of course, an optimization within eXist that would let users write concise and legible XQuery code (using the long axes), which would then be executed efficiently through optimization behind the scenes. This type of solution would remove the need for both more complex code (along the lines of the sibling-axes approach described above) and modifying the XML to write information into the document in character form when that information is already inherent in the document structure. Until such a solution became available, though, the strategies described above provided a substantial improvement over explicit use of the long axes, salvaging a project that would otherwise have been unusable for reasons of efficiency.
Addendum
eXist is an open-source project, which means that impatient users who require an optimization not already present in the code have the opportunity to implement that optimization themselves and contribute it to the project. Unfortunately, in the present case this particular impatient user lacked the Java programming skills to undertake the task. Fortunately, however, the eXist development team is very responsive to feature requests from users, and shortly after I wrote to the developers about the problem they released an upgrade that implemented precisely the modification described above (consult the predicate first and retrieve only the nodes that will be needed from the designated axis). Rerunning the original code that relied on the long axes on the same machine as the earlier tests but using eXist version 1.3.0dev-rev9622-20090802, which includes this new optimization, yielded times of 1.754, 1.778, 1.765, 1.944, 1.944, 1.777, 18.949, 1.838, 1.763, and 1.798 seconds. The mean time for these tests was 3.531 seconds, and if we exclude the aberrant long time on the seventh trial (an artifact of a system process that woke up at an inconvenient moment?), the mean drops to 1.818 seconds. The 3.531-second figure is 14.455% of the best mean time (24.427 seconds) achieved with my XSLT-based optimizations and 0.559% of the mean time of the long-axes search (631.150 seconds) before the introduction of the eXist-internal optimization. The 1.818-second figure is 7.443% of the best mean time (24.427 seconds) achieved with my XPath-based optimizations and 0.288% of the mean time of the long-axes search (631.150 seconds) before the introduction of the eXist-internal optimization.
The eXist optimization works by checking the static
return type of the predicate expression to determine whether it is a positional
predicate. (This paragraph reproduces more or less verbatim an explanation provided
by
the eXist developers.) If the answer is yes and there
is no context dependency, the predicate will be evaluated in advance and the result
will
be used to limit the range of the context selection (e.g.,
following::word
). For example, $i/following::word[1]
would
benefit from the optimization because the static return type of the predicate is a
positional predicate and it entails no context dependency. On the other hand,
$i/following::word[position() = 1]
would not be optimized because it
introduces a context dependency insofar as position()
returns the position
of the current context item and cannot be evaluated without looking at the context.
Furthermore, determining the static type is not always easy. In particular, the type
information is passed along local variables declared in a let
or
for
, but it gets lost through function calls. My original query,
for $j in (1 to 3) return $i/following::word[$j]
, works, but if
$j
were a function parameter, it would not. Additionally, support for
this optimization with particular XPath functions is being introduced only
incrementally, to avoid breaking existing code. For example, the developers’ initial
attempt at an optimization failed with the reverse()
function that I used
to retrieve the three preceding words in the correct order, although support for this
function was added later to the optimization.
The unsurprising technical conclusion, then, is that, at least in the present case, optimization of the XPath code by the user to reduce the scope of a query can achieve substantial improvement, but much more impressive results are obtained by optimizing the Java code underlying the XPath interpreter. What this experiment also reveals, though, is that, at least in the present case, the user was not reduced to waiting helplessly for a resolution by the developers, and was able to achieve meaningful improvement in those areas that he did control, viz., the XML, XPath, and XQuery.
In his concluding statement at the Balisage 2009 pre-conference Symposium on Processing XML Efficiently, Michael Kay invoked David Wheeler’s advice that application developers optimize the code that users actually write, that is, that they find out what people are doing and make that go quickly. From an end-user perspective, though, the lesson can be reversed: Find out what goes quickly and use it.
References
[Configuring] “Configuring Database Indexes.” (Part of the eXist documentation.) http://www.exist-db.org/indexing.html. Accessed 2009-05-31.
[Lucene] “Lucene-based Full Text Index” (Part of the eXist documentation.) http://www.exist-db.org/lucene.html. Accessed 2009-05-31.
[Tuning] “Tuning the Database.” (Part of the eXist documentation.) http://exist.sourceforge.net/tuning.html. Accessed 2009-05-31.