How to cite this paper
Ogbuji, Uche. “A MicroXPath for MicroXML (AKA A New, Simpler Way of Looking at XML Data Content).” Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016). https://doi.org/10.4242/BalisageVol17.Ogbuji01.
Balisage: The Markup Conference 2016
August 2 - 5, 2016
Balisage Paper: A MicroXPath for MicroXML (AKA A New, Simpler Way of Looking at XML Data Content)
Uche Ogbuji
CTO, Partner
Zepheira LLC
Uche Ogbuji is a pioneer in the integration of Web architecture with traditional
enterprise data technology. An Electrical/Computer Engineer by education, Uche has
written over 300 articles on XML, RDF, Web services and related topics, connecting
open source and commercial software development. His present project is publishing
to the Web "dark" data from the comprehensive riches of library catalogs and other
metadata. Uche is also an award-winning poet, and first Balisage Poet Laureate (2015).
Copyright © 2016 Uche Ogbuji
Abstract
There has always been tension in the development, and in the community reception of
the XML stack of technologies, across the many poles of technology and philological
interests which have been attracted by XML's success. For some XML was too spare for
rich data applications, and needed additional support from schema systems and sophisticated
query systems, culminating in XSLT 3.0 and XQuery. Others craved greater and greater
simplicity, allergic to any constructs complicating things too far beyond the basis
of elements, attributes and text. This latter camp came together in 2012 to create
a MicroXML specification. MicroXML is a radical simplification, stripping away namespaces,
syntactic quirks such as CDATA Sections, the various trappings of DTDs, and much more.
Nevertheless, there is need for systems to process it, and these systems start with
a basic data model of the MicroXML, and a basic language for processing documents
using that data model. In other words, MicroXML needs an XPath.
Even XPath 1.0 is too complex for MicroXML, because of its handling of the many features
removed from MicroXML. What's needed is a MicroXPath, which can be the basis of additional
MicroXML processing technologies. This paper provides a straw man specification for
MicroXPath, and includes a discussion of the technical considerations for MicroXPath,
including differences from the W3C XPath recommendations, based on the differences
between XML 1.0 and MicroXML.
Table of Contents
- MicroXML
- MicroXPath Design Principles
- MicroXPath Data Model
- Location Paths
- Sequences
- Core Functions
- Implementation
- Conclusion
MicroXML
There have been many attempts to modify XML, usually to simplify it. This should
be considered a normal consequence of XML's success. There have been even more episodes
of insistence that XML is dead because some other format was supplanting it. Whether
YAML, HTML5 or JSON. This is also a natural consequence of XML's success. For a long
time the idea of a modest simplification of XML has buzzed around groups of XML experts,
and MicroXML MicroXML Spec is a W3C community group MicroXML Community Group and a spec that emerged from that group offering a backward-compatibile format looking
to keep the best of XML while omitting anything considered too complicating.
The MicroXML specification is only eight pages or so, compared to the 37-odd pages
of the XML 1.0 specification. Even so MicroXML provides something XML 1.0 did not,
a data model. In the XML world the lack of a data model in the foundational spec led
to a succession of separate specifications for XML data models, including the XML
Infoset and the XPath Data Model (XDM) for XPath 2.0 and beyond. It's worth noting
that the simplest of these, the XML Infoset is in itself twice the length of the MicroXML
spec. The most widely used data model was one of the first options, the Document Object
Model (DOM), which was enormously complicated, partly because it also served as a
scaffolding for dynamic, in-browser operations on HTML as well as XML. By including
a data model which can be specified in 4 pages or so, MicroXML helps enforce simplicity
and improves the likelihood of interoperability of implementations.
The next logical step after considering the MicroXML data model is thinking how
to jot down basic expressions in context of a document. XPath provides way to do so
in the XML space, and developing a subset and variation on XPath suitable for MicroXML
would be a valuable furtherance of that technology stack. There are several possible
approaches to developing a MicroXPath, from creating an entirely new language to adapting
an existing one such as CSS 3 or XPath. This paper describes an approach largely based
on XPath 1.0, but with one major concept taken from XPath 2.0.
MicroXPath Design Principles
XPath, like XML, has been used in many different ways. XPath 1.0 developed as a
unified selection language for XPointer and expression language for XSLT. Use of XPointer
faded away, and XSLT predominated, but many limitations of XSLT 1.0 emerged. Many
users, however found XPath useful as a utility language within non-XSLT host environments,
from XML databases to general-purpose programming languages. One simple XPath could
eliminate dozens of lines of DOM traversal code. Come time to develop XPath 2.0 there
was a clamor for features from XSLT users as well as XML database users. This led
to a language with much more features than XPath 1, but also far more complex. This
added complexity for the most part isn't needed in cases of powerfully expressive
host environemnts, such as Java or Python programs.
The target for MicroXPath is to assume that very sophisticated processing can be
done by a Turing-complete and fully expressive modern programming language. MixroXPath
focuses on delivering nodes from the document to the host environment. It does offer
a system of expressions which goes beyond MicroXML nodes, but this is largely to power
predicate operations which are used to narrow down the selection of nodes.
Before getting to MicroXPath design goals, it's worth remembering the key goals
of MicroXML. As established by the Community Group these are as follows.
-
The syntax of MicroXML is a subset of XML 1.0.
-
MicroXML specifies a data model and a mapping from the syntax to the data model,
which is substantially consistent with XML 1.0.
-
MicroXML is dramatically simpler than XML regarding its specification, syntax,
and data model.
-
MicroXML is designed to complement rather than replace XML, JSON, and HTML.
-
MicroXML supports the needs of documents, in particular mixed content.
-
MicroXML supports Unicode.
-
MicroXML supports the use of text editors for authoring.
-
MicroXML is able to straightforwardly represent HTML.
-
The specification of MicroXML is as self-contained as is practical.
MicroXPath is inspired by the above, and has its own minimal set of design goals.
-
A large proportion of XPath 1.0 produce similar results in MicroXPath, notably
excluding expressions which involve namespaces.
-
MicroXPath incorporates additional features based on experience with the limitations
of XPath 1.0.
-
MicroXPath is read-only. It does not modify the context provided by the host environment.
-
MicroXPath is designed to provide information from MicroXML nodes directly to a
modern, Turing-complete host language. MicroXPath itself is not intended to be computationally
complete in any way except in its reach of MicroXML nodes based on a provided context
structure.
The final goal also enshrines the decision for MicroXPath to be substantially based
on XPath 1.0 and not XPath 2.0 or 3.0. Other design principles important to these
later XPath versions, such as composability, and thus mathematical closure, are pursued
in MicroXPath only as far as practical. MicroXPath also preserves "syntactic sugar"
from XPath, such as axes, which are a convenient way of writing node traversal along
common relationships, and node tests, which are a useful abbreviation of some predicates.
MicroXPath Data Model
MicroXPath has a data model that's a superset of the MixroXML data model. A key
construct in MicroXML is the sequence, which is based on the XPath 2.0 construct.
XPath 1.0 was built around the concept of node sets, for a variety of reasons relating
to its origins supporting XPointer and XSLT. This led to a great deal of confusion
among users, especially when the result tree fragment construct from XSLT was brought
into the picture. XPath 2.0 drew from many of those lessons to rely on a more versatile
sequence construct, including a node list, which is a sequence of nodes. MicroXPath
adopts this approach. The results of all MicroXPath expressions are sequences.
A MicroXPath sequence provides zero or more objects. An object can be one of four
types.
-
element (MicroXML element item. There are no other node types.)
-
boolean (true or false)
-
number (floating-point number)
-
string (sequence of UCS characters. This is an abstract sequence, and not a MicroXML
sequence object in itself.)
A MicroXPath sequence cannot contain another MicroXPath sequence. Nesting is not
allowed. any operation that would seem to result in nested sequences implicitly has
those sequences flattened.
A MicroXPath expression is evaluated respect to a context, comprising the information
that can affect the result of the expression, namely the following.
-
context node, the current item being processed
-
context object, an object of any type. If a node, must be identical to the contxt
node
-
context position, a non-zero positive integer giving the position of the context
node within the sequence of items being processed
-
context size, a non-zero positive integer giving the number of nodes in the sequence
of items being processed
-
variable bindings, a mapping from names to values set by the hosting environment
-
function library, a mapping from names to behaviors set by the hosting environment
-
key bindings, a mapping of mappings to make available through the key()
function.
The biggest difference from XPath 1 context is the addition of the context item this
is meant to support predicate expressions, which can operate on any sequence. In an
expression such as (1, 2, 3, 4, 5)[. > 3]
, the predicate sees each number in order, and the .
computes to each number, so that in this case the result would be the sequence (4, 5)
.
MicroXPath does introduce two node types which are not in the MicroXML data model.
These are required in order to ensure that most XPath expressions retain similar semantics
in MicroXPath. For example without a root node, the semantics of absolute location
paths would be radically different. MicroXML introduces a root node object purely
for purposes of expression evaluation. It also introduces attribute nodes. It is perfectly
acceptable for an implementation to not construct root nodes or attribute nodes until
required by expression semantics. The MicroXPath implementation I wrote does use such
a just-in-time strategy in constructing root and attribute nodes.
Location Paths
A location path selects a set of nodes relative to the context node. MicroXPath
location paths are very similar identical to XPath ones, sometimes nicknamed "Tumblers"
among XPointer users. The main difference is that name tests are never in the form
of QNames. The result of evaluating an expression that is a location path is a sequence
of nodes.
All the examples from the top of section 2 of the XPath 1.0 spec XPath 1.0 happen to be valied MicroXPath expressions. Ditto all location paths using abbreviated
syntax in section 2.5. All XPath 1.0 axes are valid in MicroXML exepting the nameapace
axis. In other words the following are MicroXML axes, each containing the same nodes
as in XPath 1.0.
-
ancestor
-
ancestor-or-self
-
attribute
-
child
-
descendant
-
descendant-or-self
-
following
-
following-sibling
-
parent
-
preceding
-
preceding-sibling
-
self
Attributes are the principal node type for the attribute axis, and elements for
all other axes. There are only two node tests in MicroXPath, node()
and text()
.
Sequences
Syntactically MicroXPath is much the same as XPath 1.0, but there is one significant
addition. MicroXPath provides a syntax for creating sequences, borrowing from XPath
2.0/3.0. The following are the examples of expressions that construct sequences, taken
from 3.4.1 of the XPath 3 spec. They have the same semantics as in MicroXPath.
-
(10, 1, 2, 3, 4)
results in a sequence of five integers.
-
(10, (1, 2), (), (3, 4), (5))
results in a sequence with six items, 10, 1, 2, 3, 4, 5. The five component sequences
of length one, two, zero, two, and one, respectively, are combined.
-
(salary, bonus)
results in a sequence containing all salary children of the context node followed
by all bonus children. The salary and bonus children flow into the result separately
in document order, but the resulting sequence may not be in document order.
-
($price, $price)
results in a sequence with the value of the variable $price
twice over. If $price is bound to the value 10.50, the result of this expression
is the sequence 10.50, 10.50. If $price is bound to the a sequence 1, 2, 3, the result
of this expression is the sequence 1, 2, 3, 1, 2, 3.
As you can see from the second example the empty sequence is expressed as ()
and a sequence of a single item ($item)
is the same as the item expressed directly $item
. This derives from the fact that all MicroXPath expressions are sequences. XPath
2.0/3.0 range expressions are not supported, but there are core functions to provide
similar features.
The MicroXPath union operator |
behaves much as in XPath 1, and is also a way to arrange a sequence into document
order. The first items in the result sequence of $a|$b
would be all nodes in either $a
or $b
, in document order. Next would come all strings, sorted by code points, then all
numbers, sorted numerically, then finally booleans, false if it occurs in either of
the arguments, followed by true, if it occurs. This is a simple case of the more generalized
function of the built-in union
function discussed below.
Core Functions
Unlike XPath there is no syntactic distinction between core functions and extension
functions. MicroXPath defines a core library of functions which must always be made
available by a conforming host environment. The host environment can provide any additional
functions as long as their names do not conflict with any of the core function names,
nor the reserved names "node" and "text" (reserved because they are names of node
tests which are semantically different from functions but use similar-seeming syntax).
XPath 1 functions related to namespace processing are not in the MicroXPath core
set, namely local-name
and namespace-uri
. The name
function returns the simple node name (generic identifier) of an element or attribute
node. Another change is that
Other changes have to do with the use of sequences. The count
function now takes any sequence and returns the number of items in that sequence.
The string
function always operates on the first item in its argument sequence, regardless of
document order.
MicroXPath defines a number of additional functions, many of which derive from
EXSLT EXSLT. It borrows the key
function from XSLT 1.0 XSLT 1.0, but MicroXPath does not specify a way to create lookup tables (e.g. <xsl:key …/>
). Rather it's up to the host environment to provide lookup tables in the execution
context. MicroXML does not support any aspect of DocType declarations you can (for
ID DTD types), nor does it support namespaces (for xml:id
attributes) so there is no standard way to specify element IDs. As such there is
no id
function in MicroXPath and users must rely on key
. There is no lang
function either, but there is a same-lang
function which can be used to do similar ISO-639-modulo comparisons.
union()
takes two or more argument and returns the sequence union of these arguments, with
duplicates removed. union($a, $b, $c)
returns the same results as $a|$b|$c
, but you might expect the former to be more readily optimized by implementations.
intersection()
takes two or more argument and returns the sequence intersection of these arguments,
i.e. only items occuring in all the arguments. The resulting order is analogous to
the results order of union()
.
MicroXPath offers a few core functions using the Regular Expressions syntax as defined
by Perl Perl Regex. This is very similar to the Regular expressions defined in section 7.6 of XPath
2.0 Functions and Operators XPath 2.0 Functions. The functions are matches
, replace
and tokenize
.
object-type()
is similar to exsl:object-type()
, operating on the first item in the argument sequence. evaluate()
is similar to dyn:evaluate()
, and provides some of the power of higher-order functions. MicroXPath does not support
functions as a core data type, as in XPath 3 XPath 3.0 (see section 3.2.2, "Dynamic Function Call"). For one thing, this would require side
effects in the host environment. This is not necessarily a cast-in-iron design principle,
but evaluate()
seems to provide most of the useful additonal power in a simpler package.
There are also MicroXPath core functions to provide similar features to XPath 2.0/3.0
range expressions and to provide map/apply capabilities.
Implementation
There is an implementation in Amara 3 Amara 3 MicroXPath, written for Python 3.4 or higher, and which also implements MicroXML. Amara 3 is
able to parse XML 1.0 into the MicroXML data model, so as long as you don't need to
worry about deep namespace processing, you can readily try out MicroXPath on XML 1.0
documents as well as MicroXML.
The implementation uses the Ply libray to generate the lexical scanner and parse
table to create an AST. The AST's control flow uses Python's generators and the yield
statement to effect the fact that MicroXPath expressions result in a sequence. This
also makes for straightforward and efficient computation. Expressions that aggregate
from component expressions use the yield from
statement new in Python 3.4. This feature also makes it easy to treat sequences as
the dynamic outcomes of operation, efficient and automatically flattened. The computation
of location paths naturally results in document ordered sequences. Sorting by document
order is only required with the union operator and the set functions union
and intersection
.
Conclusion
XPath 1.0 has always been an admirably well-designed language, in the context of its
design constraints. It had to serve the uneven masters of XSLT and XPointer, the latter
of which then faded into obscurity. It had to deal with the quirks of XML Namespaces.
It had to account for limitations claimed by key implementors which now seem quaint
in the light of the latest software engineering. Despite that it has been very successful.
Even as many of the small ranks of hard-core XML types moved on to later XPath versions
and XQuery, XPath 1.0 remains the widest used and most recognizable processing technology.
It has, especially in its location paths system, influenced other languages and processing
tools.
The significant simplification of MicroXML over XML 1.x has opened up a fine opportunity
to explore what XPath could be like with fewer shackles on its design. MicroXPath
takes the best of XPath 1.0, and sprinkles in a few key bits from XPath 2.0/3.0 and
EXSLT. The resulting language is about on par in complexity as XPath 1.0. In truth
the "Micro" moniker is not as apt as it is with "MicroXML." MicroXPath omits dealing
with namespaces, comments and processing instructions, but it adds operations related
to sequences, and some useful core functions. Implementing MicroXPath was one test
of it as a language, and efficient implementation in Python proved a quite natural
use of that language's generator/iterator features.