Sperberg-McQueen, C. M. “Aparecium: An XQuery / XSLT library for invisible XML.” Presented at Balisage: The Markup Conference 2019, Washington, DC, July 30 - August 2, 2019. In Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23 (2019). https://doi.org/10.4242/BalisageVol23.Sperberg-McQueen01.
Balisage: The Markup Conference 2019 July 30 - August 2, 2019
Balisage Paper: Aparecium
An XQuery / XSLT library for invisible XML
C. M. Sperberg-McQueen
Founder and principal
Black Mesa Technologies LLC
C. M. Sperberg-McQueen is the founder and
principal of Black Mesa Technologies, a consultancy
specializing in helping memory institutions improve
the long term preservation of and access to the
information for which they are responsible.
He served as editor in chief of the TEI
Guidelines from 1988 to 2000, and has also served
as co-editor of the World Wide Web Consortium's
XML 1.0 and XML Schema 1.1
specifications.
Aparecium is an XQuery / XSLT library for reading non-XML
data as XML, under the control of an invisible
XML grammar describing the structure of the input.
The use of the library is illustrated with an application, and
some technical issues in the development of the library are
discussed.
She tapped the diary three times and said, Aparecium!
— J. K. Rowling,
Harry Potter and the Chamber of Secrets
Introduction
This paper introduces Aparecium, a library intended to make
the use of invisible XML convenient for users of
XSLT and XQuery. This introduction provides some background
information on invisible XML and describes the purpose of the
Aparecium library. The next section illustrates the use of the
library with a sample application to parse and evaluate arithmetic
expressions (section “A sample application”). Following that
illustration the library's public interface is described and some
technical issues are discussed. The paper concludes with a survey
of related work (section “Related work”) and a section on
future tasks (section “Future work”).
Invisible XML
In 2013, Steven Pemberton introduced an idea he called
invisible XML (Pemberton 2013),
which he describes as a method for treating non-XML
documents as if they were XML (Pemberton 2019b). More recently, he has produced
a more formal specification of the idea (Pemberton 2019a). From a data creator's or author's point
of view, invisible XML (ixml, for short) allows authors to use
formats other than XML if they prefer, while still retaining
access to the XML tool chain. From the XML programmer's point
of view, it allows people working within the XML eco-system to
have more convenient access to non-XML data. The only
prerequisite is that it must be possible to describe the data
using a context-free grammar.
A great many resources are available in specialized
formats like CSS, wiki markup, or domain-specific notations.
Social-science data may interleave group-level records with
individual-level records in a way that varies from project to
project. Other resources are distributed in generic non-XML
formats like comma-separated values or JSON. Nor is it only
entire documents, files, or web resources which may use non-XML
formats. SGML and XML documents have a long history of
including elements or attributes whose value is written in a
specialized notation: mathematical expressions are often written
in TeX or LaTeX; pointers may be written as URLs, as other forms
of URI, as ISSNs or ISBNs, or (not too long ago) in TEI
extended-pointer notation; inline styles are now frequently
written in CSS. All of these notations make visible some of the
structure of the information; sometimes it would be convenient
to have easy access to that structure in XML processing.
Invisible XML is a way of making that possible.
Context-free grammars and parse trees
The main requirement of invisible XML is that the non-XML
data to be read should be described by a context-free grammar.
(Readers not familiar with this concept may think of it as a
kind of DTD or schema for non-XML data.) For example, we might
describe the language of arithmetic expressions like 23 *
(2 + 7.2) with the following grammar, written in ixml
notation:
{ An expression is the sum of one or more terms. }
expression: term, (add-op, term)*.
{ A term is the product of one or more factors. }
term: factor, (mul-op, factor)*.
{ Addition operators and multiplication operators are
+, -, *, and /. }
add-op: "+"; "-".
mul-op: "*"; "/".
{ A factor is either a number or a parenthesized
sub-expression. }
factor: number; "(", expression, ")".
number: digit+, (".", digit*)?.
Here, braces surround comments, each rule of the grammar
consists of a single non-terminal symbol followed by a colon, a
regular expression on the right-hand side of the colon, and a
full stop. On the right hand side, quoted strings denote
terminal symbols, unquoted identifiers are non-terminal symbols,
commas separate items which must appear in sequence, semicolons
separate alternatives, parentheses enforce grouping when
necessary, question mark signals that the immediately preceding
expression is optional, plus that it may occur one or more
times, and asterisk that it can occur zero or more times.
Parsed against the grammar above, the arithmetic
expression 23 + 1834 / (60 - 3) - 9 has a parse
tree of the form
It will not have escaped the reader that such a tree
structure can readily be expressed in an XML document:
Such deeply-nested tree structures and such a high
markup-to-content ratio are unusual in XML vocabularies
designed for hand authoring, but they can when well designed
make processing the information much simpler.[1] For example, it
should be fairly easy to see that an XSLT stylesheet is capable
not just of displaying but of evaluating arithmetic expressions
represented in this way. Each node in the syntax tree which
represents a well-formed (sub)expression can be assigned a
value. For leaf nodes containing numerals, the value is the
number denoted by the numeral. Leaf nodes denoting operators and
whitespace are not well formed arithmetic expressions and have
no numeric value. In the case of factors, terms, and
expressions, the value depends on the values of their children,
in the pattern known as compositional
semantics (the meaning of the compound expression is
composed from the meanings of its component
parts). If we label each node with its value, the syntax tree
looks like this:
In a nutshell, the idea of invisible XML is to regard the
XML structure as latent in the original expression: present, but
invisible. The task of Aparecium or any other invisible-XML
parser is to make that invisible XML visible, so that it can be
processed by standard XML tools.
Earley parsers
Many tools for using context-free grammars require that
the grammar have particular properties which are necessary for
particular parsing strategies to work: some tools require that
the grammar be (as computer scientists say) LL(1), or LL(k);
others like the classic parser generator yacc require that the
grammar be LALR(1). I will not detain the reader with a
description of what these different properties entail, or even
what the abbreviations mean; suffice it to say that not all
context-free grammars satisfy the constraints.
The advantage of such restrictions is that they allow the
tool to use
parsing techniques which are relatively simple, require
relatively little storage, and are relatively fast. The
disadvantage is that even experienced parser writers may find it
difficult to write grammars which satisfy the constraints, and
for existing notations such grammars may not even be possible.
Worse, it is very difficult to explain to users of the tool even
what the constraints are, without administering a short course
in the relevant parsing algorithms.[2]
Invisible XML takes a different approach: users are not
required to supply grammars that satisfy any particular set of
implementation-related constraints. Instead, implementations
are required to accept any context-free grammar. This in turn
suggests that implementations will need to use one of the
algorithms which is guaranteed to work for all context-free
grammars without restriction. There are several such
algorithms[3], but the invisible XML specification suggests the
use of an Earley parser.
Earley parsers are guaranteed to handle arbitrary
context-free grammars, including ambiguous ones. In the worst
case they parse an input in time and space proportional to the
cube of the length of the input, but on non-pathological
grammars their performance is much closer to being linear in the
length of the input. They will still however typically be
slower than non-general parsers, because they have more
housekeeping overhead. Issues arising in the use of functional
languages like XQuery or XSLT to implement Earley parsing, and
the extension of the algorithm to cover regular-right-part
grammars (like invisible XML) instead of only grammars in
Backus-Naur Form, have been discussed in general terms by Sperberg-McQueen 2017.
The Aparecium library
Pemberton's original description of invisible XML proposes
that those who publish resources parseable by means of invisible
XML should label them with the MIME type
application/xml and including a link to the
appropriate invisible-XML grammar in the HTTP header (Pemberton 2013). When the HTTP server is configured
as proposed, a CSS stylesheet can be served both as
text/css and as
application/xml-invisible; content negotiation
between the client and the server will determine which form is
sent. When the client's request specifies a preference for the
application/xml form, the server will serve the stylesheet as
usual, but instead of the line
The syntax parameter allows suitably designed
software on the client side to fetch the ixml grammar for
the resource from the specified location, use it to parse
the resource, and present the resource to the user in an XML
form.
This design allows invisible XML to fit neatly into
the architecture of the Web, but for the moment it has
a certain flaw. Most of the non-XML resources which
we might like to process as XML are served only in their
default MIME type, not as application/xml.
The Aparecium library is intended to make it easy for
people working in XSLT or XQuery to use invisible XML
without requiring any cooperation from webmasters or servers
belonging to other organizations. It does not rely on
distant servers to label resources with the application/xml
MIME type, although if they do, it can handle those cases,
too.
For resources served as described by Pemberton 2013, Aparecium offers a call modeled on
the doc() function of XPath 2.0: a call to
aparecium:doc($uri)
will (in the normal case) return an XML representation of
the resource at the URI given.
For resources not served in this way, Aparecium offers
a way for the XSLT or XQuery programmer to take matters
into their own hands by specifying both the URI of the
resource to retrieve and the URI of an ixml grammar
describing it:
aparecium:parse-resource($uriR, $uriG)
When the non-XML material and its grammar are not on the
Web or in files but are available in string form, Aparecium
offers a similar function to parse a string:
aparecium:parse-string($string, $grammar)
Since ixml grammars are themselves described by an ixml
grammar, they also have a defined XML representation, so the
grammar can also be supplied by the user in XML form; this
saves parsing the grammar into XML preparatory to parsing
the resource the user is interested in. Given the XML form
of a grammar, Aparecium pre-processes it before using it to
parse input; this pre-processed form can also be used. A
family of functions is provided to allow various
combinations of these variations: input as URI or string;
grammar as URI or as string or as
standard XML or as pre-processed
XML.
A sample application
As a simple application of Aparecium, let us consider
an evaluator for arithmetic expressions.
The ixml grammar for the expressions we would
like to accept is similar to the one given above,
but extended in a few ways: we allow whitespace to
appear within expressions; we predefine two named
constants; and we allow some alternative Unicode
forms for symbols and constants.
Expressions we would like to reject (ideally with
useful diagnostics, though that is not the case at
the moment) include:
1 + 3 2
1 * 2 +- 10 / 2
1 * (2 = 10) / 2
1.02e3 + 3
π × 42^2
r = 42; π × r^2
The application will allow the user to type in an
arithmetic expression and produce an XML representation of the
parse tree. For example, we can type in the expression
mentioned above into the text window:
When the user clicks on the appropriate buttons, the
expression is parsed and the XML representation already shown
above is displayed.
Note
Note: For pedagogical purposes, the
demo application produces not only the
abstract syntax tree specified in the
ixml grammar, but also a raw or
concrete parse tree which represents
all non-terminal symbols as elements, eliding none and
representing none of them as attributes. Such behavior
is unlikely to be of interest except to those interested
in parser implementation.
To illustrate the possible exploitation of the XML
structure, the application also provides a button to find the
numeric value of the expression. To make it easier (possibly)
to understand the evaluation process, the values of all
subexpressions in the tree are also shown:
The internals of the demo application are not of great
interest except as they illustrate the use of the Aparecium
library. The user interface is provided by an XForm displayed
by a web browser, which sends the expression typed by the user
to the server and accepts the XML representation of the
expression's parse tree in return. The actual parsing takes
place on the server (although it could in principle be
performed in the browser using an implementation of XQuery or
XSLT 3.0 in the browser). The arithmetic evaluation of the
expression is handled by an XSLT 1.0 stylesheet running in the
browser, the only non-trivial part of which are the templates
to calculate the values of expressions and terms, which
requires coping with an indeterminate number of child
expressions and applying the specified operators in the
appropriate left-to-right sequence.
The part of the application of most concern for purposes of
this paper is the XQuery module that turns a string containing
an arithmetic expression into an XML document representing its
parse tree. In this case, it could consist essentially of a
single call to the Aparecium parse-string()
function, like this:[4]
declare variable $expression as xs:string external;
import namespace a = "http://blackmesatech.com/2019/iXML/Aparecium";
a:parse-string(
$expression,
unparsed-text(
"http://blackmesatech.com/2019/demos/arith/arith.ixml"
)
)
To explain why this is not the way the demo application is
actually implemented will require a short discussion of some
of the internal organization of the library.
Ignoring the fact that at various times things
move from one machine to another, the overall flow of
the application is as shown in this diagram:
There are two inputs (the expression and the ixml
grammar for arithmetic expressions) and one output (the number
denoted by the expression). Within the function
parse-resource(), however, the grammar
goes through several transformations, as shown in this
expansion of the diagram:
Since the grammar does not change between calls to the demo
application, there is no advantage in parsing it each time and
preparing it for use. Instead, the XML version is generated
and prepared for use once as part of application development,
and the core module of the application uses that prepared
grammar rather than the original ixml form:[5]
declare variable $expression as xs:string external;
import namespace a = "http://blackmesatech.com/2019/iXML/Aparecium";
a:parse-string-with-compiled-grammar(
$expression,
"http://blackmesatech.com/2019/demos/arith/arith.gl.xml"
)
A real application will of course
insert as many sanity checks on the input and recover
from as many user errors as possible here; the demo
application does not currently do any error checking.
If the design of the interface has been successful, the
reader will now be thinking something like That's all there
is to it? And, as far as making the non-XML input
available as XML is concerned, yes, that is all there is
to it. It will sometimes happen that providing the
grammar needed to describe the non-XML resource will be
non-trivial[6] but one goal of
Aparecium is that once such a grammar is available, actually
gaining access to non-XML resources that follow that grammar
should be as easy as a call to the doc()
function.
The library
The preceding sections have reviewed some background
information and given a simple example of the library use. We
turn in this section to a discussion of the library itself: its
interfaces, its current state of development, and some
technical issues which arise in its development.
Basic interfaces
The implementation of Aparecium is spread across several
modules; all public-facing functions are gathered into one
module, so only one import instruction should
be needed. The functions which need the least explanation
have already been mentioned:
aparecium:parse-string($Input,
$Grammar) takes two strings as arguments: one
is the input to be parsed, the second is the ixml grammar
to be used.
First the grammar is itself parsed against the ixml
grammar for ixml grammars to produce XML, and then the XML
is annotated. To be more specific: the elements representing
symbols on the right-hand side of rules are given
additional attributes which all the rule to be interpreted
as a Gluschkov automaton, following the algorithm given by
Brüggemann-Klein 1993a. The annotated grammar is then
used to parse the input, and the resulting XML document is
returned.
aparecium:parse-resource($Resource,
$Grammar) takes two URIs as arguments. The first
is the URI of the input to be parsed, the second is that
of the ixml grammar to be used.
Each is dereferenced into a string (using the
unparsed-text() function); from this point on
the course of events is as described for
parse-string().
The workhorse function of the library requires that the
grammar be provided neither as a URI nor as a string, but
already pre-processed into annotated XML:
aparecium:parse-string-with-compiled-grammar($Input,
$Grammar) takes a string representing the input and
an annotated XML form of the grammar. It calls the Earley
parser with these as arguments and returns the resulting
XML document.
In the simple case, the resulting XML document will be
a parse tree for the input as specified in the ixml
grammar: terminal symbols will all be included by default,
but may be omitted from the tree if marked with a minus
sign. Non-terminal symbols will by default be represented
as elements, but if marked - they will
be elided (though their children will continue to
appear). If marked with @, a
non-terminal will become an attribute on the nearest
ancestor node in the parse tree that becomes an element;
this will normally be the parent, but it may be a
grandparent or other ancestor if the parent and other
intervening ancestors are elided. This is all as
described in Pemberton 2013. (Note: in
the current implementation the parse function returns a
full parse tree and a separate process is needed to
perform the elisions and conversions from element to to
attribute.)
If the input does not parse against the grammar, an XML
document with trace and diagnostic information is
returned. (In the current implementation, this
essentially dumps all the relevant data structures, in a
way that the intended users of the library will not find
helpful; it is hoped that future revisions of the library
will produce better diagnostics.)
If the input is ambiguous, multiple trees are
returned; this topic is discussed further below (section “Treatment of ambiguity”).
Aparecium also provides a function for fetching resources
whose publishers provide access to them with the MIME type
application/invisible-xml and a pointer to an
appropriate ixml grammar, as described in Pemberton 2013.
aparecium:doc($Resource) takes a
string containing the URI of an invisible-XML resource,
retrieves it, checks the HTTP header of the response for a
pointer to the ixml grammar, retrieves and parses the
grammar, and then parses the resource against the grammar
and returns the resulting XML.
Note: this function is not yet implemented (though it
should be by the time of Balisage 2019); testing it
will require configuring an HTTP server to behave as
required by Pemberton 2013.
Several functions are also available for preparing
grammars.
aparecium:parse-grammar-from-string($Grammar)
takes a string containing an ixml grammar, parses it
against the ixml grammar for ixml grammars, and returns
the resulting XML.
aparecium:parse-grammar-from-uri($Grammar)
takes a string containing the URI of an ixml grammar,
fetches the grammar by dereferencing the URI, parses it
against the ixml grammar for ixml grammars, and returns
the resulting XML.
aparecium:compile-grammar-from-string($Grammar)
takes a string containing an ixml grammar, parses the
grammar, compiles (annotates) it as
described above, and returns the resulting annotated XML.
aparecium:compile-grammar-from-uri($Grammar)
takes a string containing the URI of an ixml grammar,
fetches the grammar by dereferencing the URI, and then
proceeds as for compiling from a string.
aparecium:compile-grammar-from-xml($Grammar)
takes as input the XML representation of an ixml grammar,
annotates it as described above, and returns the annotated
XML.
Current state of implementation
The current implementation of Aparecium is probably best
described as a proof of concept. The parsing functions work
(on at least some inputs) and produce XML which does
represent the abstract syntax tree specified in the relevant
ixml grammar. The results are satisfactory enough to allow
simple demo applications like the arithmetic-expression
parse described above to be built.
On the other hand, a great deal of work remains to be
done before the library reaches production quality. The
current test suite contains 75 or so grammars, including
some chosen to exhibit odd or pathological behavior like the
infinite ambiguity described below, but many more are
needed. The set of available unit tests also needs to be
expanded greatly. Better diagnostics are needed for cases
where things go wrong. And the code needs to be improved so
that things go wrong less frequently.
Currently, the implementation is written only in XQuery;
it has been tested with BaseX and with Saxon, but it has not
yet been tested with eXist, MarkLogic, or any other XQuery
processor.
The biggest single shortcoming of the current
implementation, however, is speed. This results in part
from a conscious decision to worry not about speed but only
about correctness, and to postpone optimization until a more
complete test suite has been developed and the code is
passing all the tests. The current implementation has no
trouble with short inputs like those foreseen in the
arithmetic-expression demo, but parse times for larger
documents, even ones of only a few hundred bytes, are
painfully slow, measured in minutes rather than in seconds.
Fortunately, there are a number of fairly obvious ways to
improve performance, and the current code is as far as
possible written to hide low-level details of data
representation from the high-level functions that perform
parsing.
Data structures: representation of Earley items
The central data structure of the Earley algorithm is the
Earley item: a triple (x,
y, L), where
x is a number between 0 and the
length of the input, inclusive (the start
position)
y is a number between
x and the length of the input,
inclusive (the end position)
L is a location
in a rule of the grammar
In grammars written in Backus/Naur Form, each grammar
rule consists of a single non-terminal symbol on the
left-hand side and a sequence of symbols on the right-hand
side. If the rules of the grammar are numbered, therefore,
each Earley item can be represented by a tuple of four
integers (x, y,
r, i) giving the start
position, end position, rule number, and index into the
rule.
In regular-right-part grammars like those of invisible
XML, the right-hand side of a rule is not a flat sequence of
symbols but a regular expression whose basic symbols are
symbols in the vocabulary of the grammar. It is possible in
principle to translate a grammar from regular-right-part
notation to the simpler structures of BNF-style grammars,
and the translation could remain in ixml notation. But
it seemed easier (as well as more fun) to make
the Earley algorithm work directly with regular-right-part
grammars, following the discussion in Sperberg-McQueen 2017.
The most convenient way to package together the bits of
disparate information that constitute an Earley item is (or so
it seemed to the implementor) to represent each Earley item as
an XML element of type item, with
from and to attributes
giving the start and end points. The location information is
given by including a rule element as the
only child of the item, and giving the location within the rule
in an attribute named ri (rule
index). This in turn means that we need a way to denote
positions within a rule; a brief digression on the
representation of grammar rules in Aparecium appears to be
necessary.
In the XML notation prescribed for ixml grammars, the
first rule of the arithmetic grammar given above looks
something like this:[7]
The XML element structure of the def
element corresponds in a straightforward way to the
structure of the regular expression on the right-hand side
of the ixml rule, which was, it may be recalled:
expression: term, (s, addop, s, term)*.
The def element and each of its descendants
represents an expression, and like all regular expressions,
these may be interpreted as defining regular languages.
The usual way to determine whether a given sequence of
symbols is in the language so defined is to translate
the regular expression into a finite state automaton (FSA).
One common way of doing so is the Thompson
construction (Thompson 1968),
descriptions of which are readily and widely available. But
the Thompson construction involves first creating an FSA
with a potentially large number of states and
epsilon-transitions (transitions that move from one state of
the automaton to another without consuming any tokens of the
input) and then performing a tedious and expensive process
to eliminate the epsilon transitions again and reduce the
number of states. The relation between the resulting FSA
and the original regular expression is typically
obscure.
For our purposes, a better algorithm is that given by
Brüggemann-Klein 1993a, which constructs the
Gluschkov automaton for the regular
expression (so called for the Russian mathematician Viktor
Gluschkov, whose work Brüggemann-Klein extends). I will not
attempt to describe the algorithm here, but limit myself to
a few observations. In the Gluschkov automaton, each state
corresponds one-to-one with (or just: is) one
of the basic symbols in the original regular expression[8]; the automaton also has an additional
start state which corresponds to no symbol in the
expression. In the Gluschkov automaton, we move to a given
state s whenever we read in the input a
symbol matching the symbol associated with s.
The relation between the regular expression and the
automaton is thus very simple and straightforward, and the
number of states in the FSA is predictable from the regular
expression.
To simplify description and calculation of the automaton,
it is convenient if each state (and each subexpression) has
a name. So the first modification we make to the
representation of the rule is to add an identifier for each
state. With identifiers added, the rule for
expression looks something like this:
Following Brüggemann-Klein's algorithm, we then calculate
for each sub-expression (down to and including the basic
symbols themselves, here all nonterminal
elements) a number of properties:
The set of states which can be visited
first in paths across the
expression.
The set of states which can be visited
last.
Whether the expression matches the empty string (if
so, the expression is nullable).
For each symbol in the subexpression, what other
symbols in the subexpression can follow it in a path: the
follow set for the symbol.
This information is supplied as an attribute on each
expression; the follow sets are given by attributes in a
namespace created for the purpose: the attribute value
specification follow:term_1 = "s_1" on the outermost
def element means that within that
expression, state s_1 can follow state term_1.
The process of calculating the values for these properties
and assigning the appropriate attributes has been referred to
elsewhere as compiling or
annotating the grammar. The fully annotated rule
now looks something like this:
The finite state automaton whose states are the basic
symbols of the expression rule (plus an
extra start state) and whose transitions and other properties
are given by the attributes first,
last, and follow:*
attributes on the outermost def element can
be illustrated by the following diagram:
Our digression into the structure of rules is almost
complete. It remains only to be observed that the current
state of an FSA which has partially recognized a string is
given fully by the name (the identity, really) of the state.
(That is pretty much a defining feature of a finite state
automaton: its future behavior is defined solely by the
identity of the current state.)
So we can represent the current position in the
right-hand side of a rule by recording the name of the
current state: that is, given the way the Gluschkov
automaton is constructed, the identifier assigned in the
rule to the corresponding terminal or non-terminal symbol.
If nothing has yet been read, we are in the start state
for the rule. By convention, the start state of any rule
is named q0.
This concludes the digression onto the structure of
grammar rules in Aparecium.
We can now show an example of an Earley item as
represented in the current implementation. At position 20 in
the string 23 + 1834 / (60 - 3) - 9
(that is, just after the right parenthesis), the parsing
algorithm will generate an Earley item that means, in effect,
after reading the input string from position 0 to
position 20, the FSA for rule expression is
in state term_2. That Earley item
will have the following form:
This representation is convenient for the programmer but
has a practical flaw. There will be at least one Earley
item for each terminal symbol in the input; in ixml
grammars, that means one for each character. Making the
rule element a child of the
item element makes it extremely convenient to
consult the rule when needed, but it also requires that the
XQuery or XSLT implementation make a new copy of the rule
for each Earley item created (because each copy of the rule
will have a different parent element). The frequent copying
creates a perceptible strain on memory in the current
implementation of Aparecium. A simpler representation of
the Earley item would not reproduce the rule but merely
point at it, thus:
Better performance can also come from representing the Earley
item not as an element but as a map which points at, rather than
naming, the rule and state (assuming that the variables
$rule and $state have
been appropriately initialized:
Changing the representation of items to use maps instead of
elements reduced parse time for grammars in the test suite
between 15% and 30%. This was much less than had been
expected. A much more dramatic effect (seven-fold speedup)
came from changing the representation of the set of Earley
items being calculated by the algorithm to make the search
for Earley items with specific properties faster.
Treatment of ambiguity
A sentence is ambiguous if it has more
than one derivation that follows the rules of the grammar.
One of the great advantages of Earley parsing is that it
is robust in the presence of ambiguity: if there is
exactly one derivation of a sentence, the Earley parser
will find it; if there is more than one, the Earley parser
will find more than one. This contrasts sharply with most
widely used parsing techniques, which cannot function if
the grammar has any ambiguous sentences. Unfortunately,
existing parsing tools are often not much help detecting
ambiguity or helping the grammar author remove it. So a
great deal of time can be spent, with tools like yacc,
trying more or less random changes to the grammar in an
attempt to eliminate the parsing conflicts detected by the
tool.
For ambiguous sentences, Pemberton 2013
prescribes that an ixml parse should return any one of the
parse trees, with a flag to signal that there are other
parse trees for the input. This is clearly satisfactory if
it is known that any ambiguity is inconsequential. If a
grammar for arithmetic expressions is ambiguous only because
whitespace may be regarded either as a child of a term or as
the child or an expression, then it does not matter for
numerical evaluation which tree is selected: whitespace
plays no role. The rule may be less satisfactory, however,
if the two parses would assign different meanings to the
sentence. In such a case, it may be useful for a grammar
writer (though perhaps not a grammar user) to be able to
compare the two parses; that comparison may help make the
source of the ambiguity clear.
For this reason, the current version of Aparecium
deviates from the rule enunciated by Pemberton: for
ambiguous sentences Aparecium returns multiple parse trees.
In some simple cases, it returns all parse
trees, but that is not and cannot be true in general: in
some cases, a sentence may have an infinite number of parse
trees.
Consider the following grammar:
A: A; "a".
An A consists either of another
A or of the letter a. The
only sentence in the language defined by this grammar is
a, but it has an infinite number of parse trees.
XML representations of a few of these include:
Another small grammar illustrates a different kind of
infinity in parse trees:
S: (X)*. X: "x"; {nil}.
Here the language consists of strings of zero or more
occurrences of x, but each sentence has an
infinite number of parse trees. For the sentence
x, the parse trees include the following.
In both of these cases, Aparecium will detect the
infinite ambiguity and return only a finite number of
parse trees.
Related work
Related work falls into several groups: other work on the
implementation of invisible XML; other parser tools and
techniques for XSLT and XQuery; other implementations of
Earley parsing; and other tools for making non-XML data
accessible in XML.
Implementations of invisible
XML. As far as the author knows, there are at the
moment three implementations of invisible XML, only one of which
has been published. Steven Pemberton has (I believe) a
prototype or proof of concept implementation, but his ixml web
page says only Software to support ixml will be made
available at a later date (Pemberton 2018). John Snelson created an
implementation in the MarkLogic dialect of XQuery but found its
performance unsatisfactory and has abandoned it. The only
published implementation I know of is the Perl package called
XML::Invisible, attributed to Ed J and available
from the Comprehensive Perl Archive Network, CPAN (ETJ 2018). It uses the PEG (parsing expression
grammars) engine Pegex to generate functions to parse input that
conforms to the given grammar; its notation is that of Pegex
rather than the grammar notation proposed by Pemberton 2013.
Note
Note: The author has not yet been able to
experiment with either Snelson's parser or the XML::Invisible
package; when such experimentation is possible it should be
possible to say something more about how they compare to the
work described here.
Other parsing tools and techniques for
XSLT and XQuery. There are, moreover, other tools and
techniques for parsing in XQuery and/or XSLT. The most
prominent of these is probably the tool REx, developed by
Gunther Rademacher and made available on the Web at https://www.bottlecaps.de/rex/.
REx accepts as its input a grammar written in a notation based
on that used in the XML, XPath, XQuery, and other related
specifications (with some extensions to the notation in the area
of lexical analysis), and generates as output an executable
parser in any of a variety of languages (notably including
XQuery and XSLT as well as several imperative languages). The
user may specify how much lookahead to perform and may select a
backtracking parser (effectively allowing unlimited lookahead);
one of the options is for the parser to emit an XML
representation of the parse tree. For well known notations,
therefore, parsers generated by REx are a plausible alternative
to the use of ixml for parsing non-XML data; the REx site offers
grammars in REx notation for several specifications of interest,
e.g. XPath, XQuery, and JSON.
Another relevant tool is the LR-1 parser provided by Dimitre
Novatchev as part of his FXSL package for functional programming
in XSLT (Novatchev 2006b; on FXSL, but not the
parser, see also Novatchev 2003 and Novatchev 2006a). Like REx, the FXSL parser is
designed as a tool to aid in the development of software that
needs a parser. It has a relatively steep learning curve: the
parsing function requires that the user provide functions for
lexical scanning and for grammar-rule reduction, as well as an XML
representation of the LR parsing tables, which appear to be most
conveniently created by post-editing the output of a customized
version of Berkeley Yacc provided as part of the FXSL
distribution.
There has also been some discussion of writing recursive
descent parsers in XSLT and XQuery (Sperberg-McQueen 2013),
but the approach does not currently seem to be widely used.
The current work differs from these in sacrificing the goal
of speed for greater generality and easier usage. The LR parser
of FXSL requires that the grammar be LALR(1); recursive-descent
parsing requires that it be LL(1); REx does not restrict its
grammars in the same way, but does require that they be free of
left recursion and ambiguity. The library presented here will in
contrast accept any context-free grammar, with or without left
recursion, with or without bounded lookahead, whether the grammar
is ambiguous or unambiguous.
Other Earley parsers. A
prominent implementation of Earley parsing is the open-source
Perl module Marpa (Kegler 2018). The author,
Jeffrey Kegler, has implemented a number of optimizations which
mean that in many cases parsing takes O(N) time, where
N is the length of the input.
Other work on making non-XML resources
accessible to XML tools. The idea of making non-XML
resources accessible for processing with XML tools is not a new
one. One antecedent is the Data Format Definition Language
(DFDL) (Rose 2005, Zhu 2005,
Beckerle / Hanson 2014); it was developed by the Open Grid
Forum to allow non-XML data (textual or binary) to be presented
as XML. It involves writing an XSD schema to describe the XML
format (using a subset of XSD) and adding DFDL annotations to
define the mapping between the XML format at the non-XML format.
Another antecedent is the concept of XML lenses
(which make everything look XML-colored), implemented for
example in the product Data Direct Connectors (Lavinio 2007). In the Data Direct case, such
connectors involve Java code which intercepts calls to
dereference URIs, fetches the data, and translates it into XML
for presentation to the XQuery or XSLT caller; analogous
translations in the reverse direction can map from XML into the
non-XML format. Because the mappings are implemented in Java
code, they can involve any operation a Turing-complete language
can perform. A more distant antecedent are the
DATATAG and SHORTREF facilities of
SGML, which allowed text without visible markup to be parsed as
structured SGML. Unfortunately, very few people seem to have
developed any facility with either DATATAG or
SHORTREF; the MARK-IT parser from SoBeMaP (later
SEMA Group) did include an example in which a Pascal-like
programming language was defined using pure SGML facilities, but
this approach does not seem to have found many followers. Ever
since XML defined a subset of SGML in which DATATAG
and SHORTREF were removed, there have been periodic
suggestions to specify and build similar functionality; perhaps
the regular fragmentations proposal of Simon
St. Laurent and the STnG system of Ari Krupnikov can stand as
two examples for many in this line (St. Laurent 2001, Krupnikov 2003).
For some people (including the author of this paper),
invisible XML appears to have achieved more traction than its
antecedents. It is less ambitious than some: it targets textual
data only, leaving binary data aside. It is more ambitious than
others: it allows arbitrary context-free grammars, rather than
restricting itself to regular expressions or to grammars with
particular look-ahead properties. It is high-level, working with
context-free grammars instead of arbitrary executable code. It
requires some understanding of context-free grammars and how to
write them, and it benefits from some understanding of what one
will want the XML representation to look like, but it requires
very little tool-specific knowledge beyond the details of the ixml
grammar syntax. As the library presented here illustrates, it is
also possible to use ixml grammars to parse non-XML data on the
Web even without any effort by the owners or publishers of that
data, and without making dramatic changes to the network
infrastructure of the uers's machine (so: no Java code to
intercept HTTP requests, no proxy server setup, only the
installation of an XQuery or XSLT module and calls to that
module).
Future work
The immmediate task for the ongoing work on the library is
to change it from a proof of concept to a reliable practical tool
for use by XQuery and XSLT programmars. A systematic suite of
unit tests and system tests is important both to help establish
correctness of the code and to help ensure that ensuing attempts
to improve performance do not introduce incorrect behavior.
When complete (or less incomplete), the library will be
released under an open-source license; dual licensing may be
available for those who require other terms.
Performance improvements are necessary before the library
can be a useful tool for more than specialized applications and
short inputs. The obvious place to start is with the
representation of Earley items. Reducing the amount of copying
that need be done, and the amount of memory needed, as the set of
items grows, may make a signficant difference. Improving the
processor's ability to find the needed Earley item in a search of
the current set is also desirable. Earley's original algorithm is
built around a data structure in which items are in effect indexed
by their start value; it seems straightforward to do the same
using maps in XQuery and XSLT 3.0.
Performance comparisons with other available tools will
also be of interest. It is to be expected that Earley parsing
will often be slower than parsing using other techniques[9]; it
will be interesting to see how much slower it is.
Some additional features seem potentially desirable. On
the assumption that many users will want to parse non-XML
resources conforming to some widely known notations like CSS,
JSON, XPath, or XPointer, it might be helpful to package
grammars for such notations with the library. That in turn
would make it possible to let the aparecium:doc()
function select an ixml grammar based on the MIME type of the
resource, even if the publisher has not provided a grammar.
Since many published notations are designed to be parseable
with existing parser tools, it might also be feasible to
package parsing functions for some notations with the libary,
and use a faster algorithm when possible. This suggests
outfitting Aparecium with a pre-populated repository of
grammars and parsers and providing methods to allow the user
to add and delete items from the repository.
In a related vein, it would be desirable to have a library
of functions for the analysis of grammars. Given an ixml
grammar, such a library could determine whether the grammar as
given is LL(1) or LR(1) and thus suitable for parsing with
other tools. It would also be desirable to be able to test
grammars for ambiguity. This is not, in general, a soluble
problem: there is no algorithm for detecting ambiguity that
will work in all cases. But there are methods which will
detect at least some ambiguities, and which can explain and
illustrate the ambiguity usefully. Utilities for translating
grammars between notations might also be useful, although
few grammatical notations have free-standing specifications
in the style of ixml: many are tied to specific pieces
of software and have documentation which is at best
imperfect.
I hope to have shown in this paper that invisible XML can
be made useful to XSLT and XQuery programmers, independent of
the willingness of information publishers to provide invisible
XML grammars for their information, by providing a library
that allows them to supply the grammars themselves. Using
non-XML resources for which an invisible XML grammar is
available thus becomes as simple as using XML resources: a
call to a single function, to load the resource as an XML
document. The current implementation of Aparecium provides a
demonstration that such a library can exist; what is needed
next is work to move it from a proof of concept to a useful
tool.
[Pemberton 2013]
Pemberton, Steven.
Invisible XML.
Presented at Balisage: The Markup Conference 2013,
Montréal, Canada, August 6 - 9, 2013.
In
Proceedings of Balisage: The Markup Conference 2013.
Balisage Series on Markup Technologies, vol. 10 (2013).
doi:https://doi.org/10.4242/BalisageVol10.Pemberton01.
[Rose 2005]
Rose, Kristoffer H.
Introducing the Data Format Description Language.
Paper given at Extreme Markup Languages 2005,
Montréal, sponsored by IDEAlliance.
[Sperberg-McQueen 2017]
Sperberg-McQueen, C. M.
Translating imperative algorithms
into declarative, functional terms:
towards Earley parsing in XSLT and XQuery.
Presented at Balisage: The Markup Conference 2017,
Washington, DC, August 1 - 4, 2017.
In Proceedings of Balisage: The Markup Conference 2017.
Balisage Series on Markup Technologies, vol. 19 (2017).
doi:https://doi.org/10.4242/BalisageVol19.Sperberg-McQueen01.
[1] The proposition that deep nesting can make processing
simpler is one of a number of important insights into markup
which Lynn Price worked for a long time to get into my head.
She often gave it as a motivation for the presence in SGML of
markup minimization features like OMITTAG, SHORTREF, and
DATATAG: in the ideal case, they provide a notation simple
enough to be suitable for human authoring and rich enough to
make processing easier.
[2]
In many cases, a grammar can be reworked to become LL(1), or
LR(1), or whatever is required. Unfortunately, there is no
simple set of rules (and in fact no reliable set of rules at
all) to say what changes we can make in a grammar without
changing the language recognized by the grammar. If there
were such rules, the tool could do the transformation for us.
Intuitively, experience with existing parser-generation tools
leads many users to believe that any reasonable
language can be defined by a grammar satisfying the
constraints of the tool, and that languages which persist in
being unreasonable can usefully be improved by
modifying them until they become reasonable again. This
belief may have some truth to it, although at first glance it
looks as if the parsing tool were merely training its users
not to want what it cannot provide.
[3]
Among those best known are the Cocke / Younger / Kasami (CYK)
algorithm, the Earley algorithm, and a number of
generalized algorithms (generalized LR,
generalized LL, generalized LC) and backtracking algorithms.
By far the best source for descriptions of these and other
parsing algorithms is Grune/Jacobs 2008.
[4]
Just as parse-resource() takes two URIs for the
input and the grammar as arguments, so
parse-string() takes two strings. It is thus
necessary to dereference the URI for the grammar, as shown in
the example.
[5]
The gl in the filename
arith.gl.xml reflects the fact that (as is
about to be explained below) this version of the grammar has
annotations that allow it to be interpreted as a set of
Gluschkov automata.
[6] As a case in point: the
arithmetic grammar shown is not the first draft; several
errors have needed to be fixed, and the version shown here
still has a potentially painful problem in the handling of
whitespace characters. Grammar development is not very hard,
but it can raise some tricky issues. Whitespace handling is
consistently one of them. SGML users may remember
nostalgically the observation known as Goldfarb's Law: if a
text processing application has bugs, at least one of them
will have to do with whitespace handling. Grammar development
appears to be subject to this law.
[7] The syntax shown is that currently used by Aparecium; it
is based on the grammar exhibited in Pemberton 2013 and on a 2016 revision of the grammar.
In the current version of the ixml specification (Pemberton 2019a) the structure for this rule is the same,
but the def element has been renamed
alts. Other changes in the syntax are not
visible in this example.
[8] Formally, a finite state automaton has a set
of states, but it is immaterial what the states actually
are: the only intrinsic property they need to possess is
that they must be distinguishable from each other, which is
implicit in the phrase set of states: if we cannot
distinguish the elements of a set from each other, then we
cannot handled it as a set after all.
It follows from this that there is
no necessity to imagine a set of states that correspond 1:1
to the symbols of the regular expression; we can take the
set of symbols (symbol tokens, to be exact) in the
expression, and make that be the set of states in our
automaton.
[9] Grune and Jacobs say (p. 547 of Grune/Jacobs 2008)
It should be noted that if any of the general parsers
performs in linear time, it may still be a factor of ten or so
slower than a deterministic method, due to the much heavier
administration they need.
Beckerle, Michael J., and
Stephen M. Hanson.
Data Format Description Language
(DFDL) v1.0 Specification.
Open Grid Forum DFDL Working Group, Document GFD-P-R.207,
September 2014.
On the Web at https://www.ogf.org/documents/GFD.207.pdf
Pemberton, Steven.
Invisible XML.
Presented at Balisage: The Markup Conference 2013,
Montréal, Canada, August 6 - 9, 2013.
In
Proceedings of Balisage: The Markup Conference 2013.
Balisage Series on Markup Technologies, vol. 10 (2013).
doi:https://doi.org/10.4242/BalisageVol10.Pemberton01.
Rose, Kristoffer H.
Introducing the Data Format Description Language.
Paper given at Extreme Markup Languages 2005,
Montréal, sponsored by IDEAlliance.
Sperberg-McQueen, C. M.
Recursive descent parsing
in XQuery (and other functional languages).
Blog post, 7 January 2013.
On the web at http://cmsmcq.com/mib/?p=1260.
Sperberg-McQueen, C. M.
Translating imperative algorithms
into declarative, functional terms:
towards Earley parsing in XSLT and XQuery.
Presented at Balisage: The Markup Conference 2017,
Washington, DC, August 1 - 4, 2017.
In Proceedings of Balisage: The Markup Conference 2017.
Balisage Series on Markup Technologies, vol. 19 (2017).
doi:https://doi.org/10.4242/BalisageVol19.Sperberg-McQueen01.