How to cite this paper
van der Vlist, Eric. “Fleshing the XDM chimera.” Presented at Balisage: The Markup Conference 2012, Montréal, Canada, August 7 - 10, 2012. In Proceedings of Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012). https://doi.org/10.4242/BalisageVol8.Vlist01.
Balisage: The Markup Conference 2012
August 7 - 10, 2012
Balisage Paper: Fleshing the XDM chimera
Eric van der Vlist
Eric is an independent consultant and trainer. His domain of expertise include Web
development and XML technologies.
He is the creator and main editor of XMLfr.org, the main site dedicated to XML technologies in French, the author of the O'Reilly
animal books XML Schema and RELAX NG and a member or the ISO DSDL (http://dsdl.org) working group focused on XML schema languages.
He is based in Paris and you can reach him by mail (vdv@dyomedea.com) or meet him in one of the many conferences where he
presents his projects.
Published under the Creative Commons "cc by" license
Abstract
The XQuery and XPath Data Model 3.0 (XDM) is the kernel of the XML ecosystem. XDM
had been extended with foreign item types to embrace new data sources such as JSON,
taking the risk
to become a chimera. This talk explores some ways to move this fundamental piece of
the XML stack forward.
Table of Contents
- Motivation
- XML Data Models
-
- XPath/XSLT 1.0
- XDML 2.0: XPath 2.0/XSLT 2.0/XQuery 1.0
- XDM 3.0: XPath 3.0/XSLT 3.0/XQuery 3.0
- Identity Crisis
- Introducing χίμαιραλ (chimeral), the Chimera Language
-
- Example
- χίμαιραλ In a Nutshell
- Remaining Issues
- χίμαιραλ and the identity crisis
- Moving the chimera forward
-
- Embracing RDF
- Syntactical sugar
- XPath
- Validation
- Conclusion
Motivation
Chimera (mythology): The Chimera (also Chimaera or Chimæra) (Greek: Χίμαιρα, Khimaira, from χίμαρος,
khimaros, "she-goat") was, according to Greek mythology, a monstrous fire-breathing
female creature of Lycia in Asia Minor, composed of the parts of multiple animals:
upon the body of
a lioness with a tail that ended in a snake's head, the head of a goat arose on her
back at the center of her spine. The Chimera was one of the offspring of Typhon and
Echidna and a
sibling of such monsters as Cerberus and the Lernaean Hydra. The term chimera has
also come to describe any mythical animal with parts taken from various animals and,
more generally,
an impossible or foolish fantasy.
— Wikipedia
Chimera (genetics): A chimera or chimaera is a single organism (usually an animal) that is composed
of
two or more different populations of genetically distinct cells that originated from
different zygotes involved in sexual reproduction. If the different cells have emerged
from the
same zygote, the organism is called a mosaic. Chimeras are formed from at least four
parent cells (two fertilized eggs or early embryos fused together). Each population
of cells keeps
its own character and the resulting organism is a mixture of tissues.
— Wikipedia
During her opening keynote at XML Prague 2012, speaking about the relation between
XML, HTML, JSON and RDF, Jeni Tennison warned us against the temptation to create
chimeras:
chimera are usually ugly, foolish or impossible fantasies.
The next morning, Michael Kay and Jonathan Robie came to present new features in XPath/XQuery/XSLT
3.0. A lot of these features are directly based on the XQuery and XPath Data Model
3.0
(aka XDM):
The XPath Data Model is the abstraction over which XPath expressions are evaluated.
Historically, all of the items in the data model could be derived directly (nodes)
or
indirectly (typed values, sequences) from an XML document. However, as the XPath expression
language has matured, new features have been added which require additional types
of
items to appear in the data model. These items have no direct XML serialization, but
they are never the less part of the data model.
XDM 3.0 is composed of items from a number of different technologies:
-
Items from the XML Infoset (nodes, attributes, ...)
-
Datatype information borrowed from the Post Schema Validation Infoset
-
Sequences
-
Atomic values
-
Functions that can also be used to model JSON arrays
Note
The feature that will be introduced to model JSON arrays is called "maps" and it will
be specified as a XSLT feature in the XSLT 3.0 recommendation (not published yet).
The XSLT 3.0
editor, Michael Kay has published an early version of this feature in his blog. In this paper, XDM 3.0 will
refer to the XSLT 3.0 data model (the XPath 3.0 data model augmented with maps).
XDM 3.0 being a single data model composed of items from different data models, it
is fair to say that it is a chimera!
Following Jeni Tennison on stage, I have tried to show that in a world where HTML 5 on one
hand and JSON on the other hand are gaining traction, XML has become an ecosystem
in a competitive environment and that it's data model is a major competitive advantage.
Among other factors, the continued success of XML will thus come from its ability
to seamlessly integrate other data models such as JSON.
If we follow this conclusion, we must admit that this chimera is essential to the
future of XML and do our best to make it elegant and smart.
XML Data Models
Whether it's a bug or a feature could be debated endlessly, but a remarkable feature
of the XML recommendation it's all about syntax and parsing rule and does not really
define a data
model. The big advantage is that everyone can find pretty much what he wants in XML
documents but for the sake of this paper we need to choose a well known -and well
defined- data model to
work on.
The most common XML data model is probably the data model defined by the trio XPath/XSLT/XQuery
known as "XDM" since XPath version 2.0 and that's the one we will choose.
XDM version 3.0, still work in progress, will be the third version of this data model.
It's important to understand its design and evolution to use its most advanced features
and we'll
start our prospective by a short history of its versions.
XPath/XSLT 1.0
The XPath 1.0 data model is described as being composed of seven types of nodes (root, elements, text, attributes,
namespaces, processing instructions and comments).
The XSLT 1.0 data model is defined as being the XPath 1.0 data model with:
-
Relaxed constraints on root node children to support well-formed external general
parsed entities that are not well formed XML documents
-
An additional "base URI" property on every node.
-
An additional "unparsed entities" property on the root node.
It's fair to say that these two -very close- data models are completely focused on
XML, but is that all?
Not entirely and these two specifications introduce other notions that should be considered
as related to the data model even if they are not described in their sections called
"Data
Model"...
XSLT 1.0 inadvertently mentions the four basic XPath data-types (string, number, boolean, node-set)
to explicitly add a fifth one: result tree
fragments
.
These four basic data-types are implicitly defined in XPath 1.0 in its section about
its function library but no formal description of these types is given.
XDML 2.0: XPath 2.0/XSLT 2.0/XQuery 1.0
In version 2.0, the XDM is promoted to get its own specification.
XDM 2.0 keeps the same seven types of nodes as XPath 1.0 and integrates the additions
from the XSLT 1.0 data model. A number of properties are added to these nodes to capture
information that had been left outside the data model by the previous version and
also to support the data-type system from the PSVI (Post Schema Validation Infoset).
The term "data-type" or simply "type" being now used to refer to XML Schema data-types,
a new terminology is introduced where the data model is composed of "information items"
(or
items) being either XML nodes or "atomic values".
The concept of "sequences" is also introduced. Sequences are not strictly considered
as items but play a very important role in XDM. They are defined as an ordered collection
of zero or more items
.
The data model is thus now composed of three different concepts:
-
nodes
-
atomic values
-
sequences
XDM 2.0 notes that an important difference between nodes and atomic values is that
only nodes have identities:
This is a crucial distinction that divides the data model into two different kind
of items (those which have an identity and those which haven't one). Let's take an
example:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>5</foo>
<foo>5</foo>
<bar foo="5">
<foo>5</foo>
</bar>
</root>
The three <foo>5</foo>
look similar and can be considered "deeply equal" but they
are three different elements with three different identities. This is needed because
some of their properties are different: the parent of the first two is <root/>
while the parent of the third one is <bar/>
, the preceding sibling of the second one is the first one while the first one has
no preceeding sibling, ...
The three "5" text nodes are similar but they still are different text nodes with
different identities and this is necessary because they don't have the same parent
elements.
By contrast, the atomic values of the three <foo/>
element (and the atomic value of the @foo
attribute) are the same atomic value, the "5" (assuming they
have all been declared with the same datatype). Among many other things, this means
that when you manipulate their values, you can't access back to the node that is holding
the
value).
XDM 3.0: XPath 3.0/XSLT 3.0/XQuery 3.0
Note
These specifications are still work on progress, currently divided between XQuery and XPath Data
Model 3.0 and data model extensions described in XSL Transformations (XSLT) Version 3.0.
XDM 3.0 adds functions as a third kind of items, transforming XQuery and XSLT into
functional languages.
Like atomic values, functions have no identity:
XSLT 3.0 adds to XDM 3.0 a fourth king of items: maps, derived from functions which,
among many other use cases, can be used to model JSON objects:
Like atomic values and functions (from which they are derived), maps have no identity:
Note
In this statement, the specification does acknowledge that sequences have no identity
either. This is understandable but didn't seem to be clearly specified elsewhere.
Of course, XSLT 3.0 is also adding functions to create, manipulate maps and serialize/deserialize
them as JSON and a syntax to define map literals. It does not any new pattern to
select of match maps or map entries, though.
Identity Crisis
Appolonius' ship is a beautiful ship. Over the years it has been repaired so many
times that there is not a single piece of the original materials remaining. The question
is,
therefore, is it really still Appolonius' ship?
— ObjectIdentity on c2.com
Object identity is often confused with mutability. The need for objects to have identities
is more obvious when they are mutable, their identities being then used to track them
despite
their changes like Appolonius' ship. However, XDM 3.0 gives us a good opportunity
to explore the meaning and consequences of having (or not having) an identity for
immutable object
structures.
The definition of node identity in XDM 3.0 is directly copied from XDM 2.0:
I find this definition confusing:
-
Why should the value “5” as an integer be instantiated and why should we care? The
value “5” as an integer is... the value “5” as an integer! It's unique and being unique,
doesn't it have an identity?
-
A node, with all the properties defined in XDM (including its document-uri and parent
accessors) would be unique if it had "previous-sibling" or "document-order" accessors.
Note
To find the previous siblings of a node relying only on the accessors defined in XDM
(2.0 or 3.0), you'd have to access to the node's parent and loop over it's children
until you
find the current node that you would identify as such by checking its identity.
Rather than focussing on uniqueness, which for immutable information items does not
really matter, a better differentiation could be between information items which have
enough context
information to "know where they belong" in the data model and those which don't.
This differentiation has the benefit of highlighting the consequences of having or
not having an identity: to be able to navigate between an information item and its
ancestors or sibling
this item must know where it belongs. When that's not the case, it is still be possible
to navigate between the item and its descendants but axis such as ancestor::
or
sibling::
are not available.
Note
Identity can be seen as the price to pay for the ancestor::
and sibling::
axis.
Let's take back a simple example:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>5</foo>
<foo>5</foo>
<bar>
<foo>5</foo>
</bar>
</root>
In an hypothetical data model where nodes have no identity, there would be only 3
elements:
If we add identity (or context information) properties, the foo elements become three
information different items since they defer by these properties.
The process of adding these properties to an information item looks familiar. Depending
on your background, you can compare it to:
We've seen that XDM 3.0 acknowledges this difference between information items which
have context information and those which don't have. I don't want to deny that both
types of data
models have their use cases: there are obviously many use cases where context information
is needed and use cases where lightweight structures are a better fit.
That being said, if we are serious about the support of JSON in XDM, we should offer
the same features to access data whether this data is stored in maps or in XML nodes.
Let's consider this JSON object borrowed from the XSLT 3.0 Working
Draft:
{ "accounting" : [
{ "firstName" : "John",
"lastName" : "Doe",
"age" : 23 },
{ "firstName" : "Mary",
"lastName" : "Smith",
"age" : 32 }
],
"sales" : [
{ "firstName" : "Sally",
"lastName" : "Green",
"age" : 27 },
{ "firstName" : "Jim",
"lastName" : "Galley",
"age" : 41 }
]
}
This object could be represented in XML by the following
document:
<?xml version="1.0" encoding="UTF-8"?>
<company>
<department name="sales">
<employee>
<firstName>Sally</firstName>
<lastName>Green</lastName>
<age>27</age>
</employee>
<employee>
<firstName>Jim</firstName>
<lastName>Galley</lastName>
<age>41</age>
</employee>
</department>
<department name="accounting">
<employee>
<firstName>John</firstName>
<lastName>Doe</lastName>
<age>23</age>
</employee>
<employee>
<firstName>Mary</firstName>
<lastName>Smith</lastName>
<age>32</age>
</employee>
</department>
</company>
The features introduced in the latest XSLT 3.0 Working Draft do allow to transform
rather easily from one model to the other, but these two models do not have, bar far,
the same
features.
In the XML flavor, when the context item is the employee "John Doe", you can easily
find out what his department is because this is an element and element do carry context
information.
In the map flavor by contrast when the context item is an employee map, this object
has no context information and you can't tell which is his department without looping
within the
containing map.
This important restriction is at a purely data model level. It is aggravated by the
XPath syntax has not been extended to generalize axis so that they can work with maps.
If I work with
the XML version of this structure, it's obvious to evaluate things such as the number
of employees, the average age of employees, the number of departments, the number
of employees by
department, the average age by department, obvious to find out if there is an employee
called "Mary Smith" in one of the departments, the employees who are more than 40,
to get a list of
employees from all the department sorted by age, ... In the map flavor by contrast,
I don't have any XPath axis available and must do all these operations using a limited
number of map
functions (map:keys(), map:contains(), map:get()). In other words, while I can use
XPath expressions with the XML version, I must use DOM like operations to access the
map version!
To summarize, yes XDM 3.0 does support JSON but to do pretty much anything interesting
with JSON objects, you'd better transform them into XML nodes first! XSLT 3.0 does
give you the
tools to do this transformation quite easily but the message to JSON users is that
we don't treat their data model as a first class citizen.
To make it worse, XPath is used by many other specifications, within and outside the
W3C and the level of support for JSON provided by XDM and XPath will determine how
these
specifications will be able to support for JSON. Specifications that are impacted
by this issue include XForms, XProc and Schematron. Supporting JSON would be really
useful for these three
specifications if and only if map items could have the same features than nodes.
Furthermore, the same asymmetry exists when you went to create these two structures
from other sources: to create the XML structure you can use sequence constructors
but to create the
map structure, you have to use the map:new()
and map:item()
functions.
My proposal to solve this issue is:
-
To acknowledge the fact that any type of information item can be either "context independent"
or include context information and explore the consequences of this
statement.
-
To generalize XPath axis so that they can be used with map items.
-
To create sequence constructors for maps and map entries.
You are welcome to discuss this further:
Introducing χίμαιραλ (chimeral), the Chimera Language
When I started to work on χίμαιραλ a few months ago, my first motivation was to propose an XDM serialization for maps
which would turn the
rather abstract prose from the specification into concrete angle brackets that you
could see and read.
The exercise has been very instructive and helped me a lot to understand the spec,
however a more ambitious use pattern has emerged while I was making progress. The
XSLT 3.0 Working
Draft is part of a batch of Working Drafts which are far more advanced. My proposals
to solve the "map identity crisis" are probably too intrusive and too late to be taken
into account and
the batch of specifications will most probably carry on with the current proposal.
If that's the case, we've seen that it makes a lot of sense to convert maps into nodes
to enable to use XPath axis and χίμαιραλ provides a generic target format for these
conversions.
Example
Let's take again the JSON object borrowed from the XSLT 3.0 Working
Draft:
{ "accounting" : [
{ "firstName" : "John",
"lastName" : "Doe",
"age" : 23 },
{ "firstName" : "Mary",
"lastName" : "Smith",
"age" : 32 }
],
"sales" : [
{ "firstName" : "Sally",
"lastName" : "Green",
"age" : 27 },
{ "firstName" : "Jim",
"lastName" : "Galley",
"age" : 41 }
]
}
Its χίμαιραλ representation
is:
<?xml version="1.0" encoding="UTF-8"?>
<χ:data-model xmlns:χ="http://χίμαιραλ.com#">
<χ:map>
<χ:entry key="sales" keyType="string">
<χ:map>
<χ:entry key="1" keyType="number">
<χ:map>
<χ:entry key="lastName" keyType="string">
<χ:atomic-value type="string">Green</χ:atomic-value>
</χ:entry>
<χ:entry key="age" keyType="string">
<χ:atomic-value type="number">27</χ:atomic-value>
</χ:entry>
<χ:entry key="firstName" keyType="string">
<χ:atomic-value type="string">Sally</χ:atomic-value>
</χ:entry>
</χ:map>
</χ:entry>
<χ:entry key="2" keyType="number">
<χ:map>
<χ:entry key="lastName" keyType="string">
<χ:atomic-value type="string">Galley</χ:atomic-value>
</χ:entry>
<χ:entry key="age" keyType="string">
<χ:atomic-value type="number">41</χ:atomic-value>
</χ:entry>
<χ:entry key="firstName" keyType="string">
<χ:atomic-value type="string">Jim</χ:atomic-value>
</χ:entry>
</χ:map>
</χ:entry>
</χ:map>
</χ:entry>
<χ:entry key="accounting" keyType="string">
<χ:map>
<χ:entry key="1" keyType="number">
<χ:map>
<χ:entry key="lastName" keyType="string">
<χ:atomic-value type="string">Doe</χ:atomic-value>
</χ:entry>
<χ:entry key="age" keyType="string">
<χ:atomic-value type="number">23</χ:atomic-value>
</χ:entry>
<χ:entry key="firstName" keyType="string">
<χ:atomic-value type="string">John</χ:atomic-value>
</χ:entry>
</χ:map>
</χ:entry>
<χ:entry key="2" keyType="number">
<χ:map>
<χ:entry key="lastName" keyType="string">
<χ:atomic-value type="string">Smith</χ:atomic-value>
</χ:entry>
<χ:entry key="age" keyType="string">
<χ:atomic-value type="number">32</χ:atomic-value>
</χ:entry>
<χ:entry key="firstName" keyType="string">
<χ:atomic-value type="string">Mary</χ:atomic-value>
</χ:entry>
</χ:map>
</χ:entry>
</χ:map>
</χ:entry>
</χ:map>
</χ:data-model>
Granted, it's much more verbose than the JSON version, but it's the exact translation
of the XDM corresponding to the JSON object in XML.
χίμαιραλ In a Nutshell
The design goals are:
-
Be as close as possible to the XDM and its terminology
-
Represent XML nodes as... XML nodes
-
Allow round-trips (an XDM model serialized as χίμαιραλ should give a XDM model identical
to the original one when de-serialized)
-
Be easy to process using XPath/XQuery/XSLT
-
Support of the PSVI is not a goal
χίμαιραλ is not the only proposal to serialize XDM as XML. Two other notable ones
are:
-
Zorba's XDM serialization is a straight and
accurate XDM serialization which does support PSVI annotations. As a consequence,
nodes are serialized as xdm:*
elements (an element is an
xdm:element
, an attribute an xdm:attribute
element, ...). This does'n meet by second requirement to represent nodes as themselves.
-
XDML, presented by Rennau, Hans-Jürgen, and David A. Lee at
Balisage 2011 is more than just an XDM serialization and also includes manipulation
and processing definitions. It introduces its own terminology and concepts and is
too
far away from XDM for my design goals.
A lot of attention has been given to the first design goal: the structure of a χίμαιραλ
model and the name of its elements and attributes are directly derived from the
specifications.
In XDM, map entries' values can be arrays (an array beeing nothing else than a map
with integer keys) but also sequences (which is not possible in JSON). χίμαιραλ respects
the fact
that in XDM there is no difference between a sequence composed of a single element
and represents sequences by a repetition of values.
The map map{1:= 'foo'}
is serialized
as:
<χ:data-model xmlns:χ="http://χίμαιραλ.com#">
<χ:map>
<χ:entry key="1" keyType="number">
<χ:atomic-value type="string">foo</χ:atomic-value>
</χ:entry>
</χ:map>
</χ:data-model>
And the map map{1:= ('foo', 'bar')}
is serialized
as:
<χ:data-model xmlns:χ="http://χίμαιραλ.com#">
<χ:map>
<χ:entry key="1" keyType="number">
<χ:atomic-value type="string">foo</χ:atomic-value>
<χ:atomic-value type="string">bar</χ:atomic-value>
</χ:entry>
</χ:map>
</χ:data-model>
We've seen that XDM makes a clear distinction between nodes which have identities
and other item types (atomic values, functions and maps) which haven't. XDM allows
to use nodes as
map entry values. χίμαιραλ allows this feature too, but copying the nodes would create
new nodes with different identities.
To avoid that, documents to which these nodes belong are copied into χ:instance elements
and references between map entries values and instances are made using XPath expressions.
The following $map
variable:
<xsl:variable name="a-node">
<foo/>
</xsl:variable>
<xsl:variable name="map" select="map{'a-node':= $a-node}"/>
Is serialized
as:
<χ:data-model xmlns:χ="http://χίμαιραλ.com#">
<χ:instance id="d4" kind="document">
<foo/>
</χ:instance>
<χ:map>
<χ:entry key="a-node" keyType="string">
<χ:node kind="document" instance="d4" path="/"/>
</χ:entry>
</χ:map>
</χ:data-model>
Like XSLT variable, instances do not always contain document nodes and the following
$map
variable:
<xsl:variable name="a-node" as="node()">
<foo/>
</xsl:variable>
<xsl:variable name="map" select="map{'a-node':= $a-node}"/>
Is serialized
as:
<χ:data-model xmlns:χ="http://χίμαιραλ.com#">
<χ:instance id="d4e0" kind="fragment">
<foo/>
</χ:instance>
<χ:map>
<χ:entry key="a-node" keyType="string">
<χ:node kind="element" instance="d4e0" path="root()" name="foo"/>
</χ:entry>
</χ:map>
</χ:data-model>
Nodes can belong to more than one instances, and this $map
variable:
<xsl:variable name="a-node" as="node()*">
<foo/>
<bar/>
</xsl:variable>
<xsl:variable name="map" select="map{'a-node':= $a-node}"/>
Is serialized
as:
<χ:data-model xmlns:χ="http://χίμαιραλ.com#">
<χ:instance id="d4e0" kind="fragment">
<foo/>
</χ:instance>
<χ:instance id="d4e3" kind="fragment">
<bar/>
</χ:instance>
<χ:map>
<χ:entry key="a-node" keyType="string">
<χ:node kind="element" instance="d4e0" path="root()" name="foo"/>
<χ:node kind="element" instance="d4e3" path="root()" name="bar"/>
</χ:entry>
</χ:map>
</χ:data-model>
Nodes can be "deep linked", a same node can be linked several times and nodes can
be mixed with atomic values at wish. The following $map
variable:
<xsl:variable name="doc">
<department name="sales">
<employee>
<firstName>Sally</firstName>
<lastName>Green</lastName>
<age>27</age>
</employee>
<employee>
<firstName>Jim</firstName>
<lastName>Galley</lastName>
<age>41</age>
</employee>
</department>
<department name="accounting">
<employee>
<firstName>John</firstName>
<lastName>Doe</lastName>
<age>23</age>
</employee>
<employee>
<firstName>Mary</firstName>
<lastName>Smith</lastName>
<age>32</age>
</employee>
</department>
</xsl:variable>
<xsl:variable name="map"
select="map{
'sales' := $doc/department[@name='sales'],
'Sally' := $doc//employee[firstName = 'Sally'],
'kids' := $doc//employee[age < 30],
'dep-names-attributes' := $doc/department/@name,
'dep-names' := for $name in $doc/department/@name return string($name)
}"/>
Is serialized
as:
<χ:data-model xmlns:χ="http://χίμαιραλ.com#">
<χ:instance id="d4" kind="document">
<department name="sales">
<employee>
<firstName>Sally</firstName>
<lastName>Green</lastName>
<age>27</age>
</employee>
<employee>
<firstName>Jim</firstName>
<lastName>Galley</lastName>
<age>41</age>
</employee>
</department>
<department name="accounting">
<employee>
<firstName>John</firstName>
<lastName>Doe</lastName>
<age>23</age>
</employee>
<employee>
<firstName>Mary</firstName>
<lastName>Smith</lastName>
<age>32</age>
</employee>
</department>
</χ:instance>
<χ:map>
<χ:entry key="sales" keyType="string">
<χ:node kind="element"
instance="d4"
path="/"":department[1]"
name="department"/>
</χ:entry>
<χ:entry key="Sally" keyType="string">
<χ:node kind="element"
instance="d4"
path="/"":department[1]/"":employee[1]"
name="employee"/>
</χ:entry>
<χ:entry key="kids" keyType="string">
<χ:node kind="element"
instance="d4"
path="/"":department[1]/"":employee[1]"
name="employee"/>
<χ:node kind="element"
instance="d4"
path="/"":department[2]/"":employee[1]"
name="employee"/>
</χ:entry>
<χ:entry key="dep-names-attributes" keyType="string">
<χ:node kind="attribute"
instance="d4"
path="/"":department[1]/@name"
name="name">sales</χ:node>
<χ:node kind="attribute"
instance="d4"
path="/"":department[2]/@name"
name="name">accounting</χ:node>
</χ:entry>
<χ:entry key="dep-names" keyType="string">
<χ:atomic-value type="string">sales</χ:atomic-value>
<χ:atomic-value type="string">accounting</χ:atomic-value>
</χ:entry>
</χ:map>
</χ:data-model>
Remaining Issues
A collation property should be added to <χ:map/>
, probably as an attribute, the transformation to serialize to χίμαιραλ should be
cleaned up and the reverse
transformation should be implemented.
These are pretty trivial issues and the biggest one is probably to find a way to cleanly
serialize references to nodes that are not contained within an element, such as the
following
$map
variable:
<xsl:variable name="attribute" as="node()">
<xsl:attribute name="foo">bar</xsl:attribute>
</xsl:variable>
<xsl:variable name="map"
select="map{
'attribute' := $attribute
}"/>
Support of functions should also be considered.
χίμαιραλ and the identity crisis
To some extend, χίμαιραλ can be considered as a solution to the XDM identity crisis:
-
Serializing an XDM model as χίμαιραλ creates elements for maps, map entries and atomic
values and these elements, being nodes, have identities. The serialization is
therefore also an instantiation of XDM information items as defined above.
-
De-serializing a χίμαιραλ to create an XDM data model is also a de-instantiation--
except of course that the identity of XML nodes is not "removed".
However, χίμαιραλ does keep a strong difference between nodes which are kept in <χ:instance>
elements and maps and atomic values.
Moving the chimera forward
χίμαιραλ is a good playground to explore the new possibilities offered by XDM 3.0.
Here is a (non exhaustive) list of a few directions that seem interesting...
Note
Don't expect to find fully baked proposals in this section which contains, on the
contrary very early drafts of ideas to follow to support XDM maps as "first class
citizens"!
Embracing RDF
If you had the opportunity to enjoy the sunny weather of Orlando in December 2001,
you may remember "The Syntactic Web" a provocative talk where
Jonathan Robie has shown how XQuery 1.0 could be used to query normalized XML/RDF
documents.
The gap between RDF triples and the versatility of its XML representation was a big
issue, but the new features brought by this new version of the XPath/XQuery/XSLT package
should
help us.
The basic data model of RDF is based on triples, a triple being a composed of a subject,
a predicate and an object. In XDM, a triple can now be represented by either a sequence,
an
array or a map of three items.
XDM sequences have the property that they cannot include other sequences and representing
triples as sequences would mean that you couldn't define sequences of triples. For
that
reason it is probably better to define triples as maps or arrays. An array being a
map indexed by integers, that doesn't make a huge difference at a conceptual level,
but I find it
cleaner to access to the subject of a triple using a QName (such as rdf:subject) rather
than an index. Following this principle, we could define a triple
as:
map {
xs:QName('rdf:subject') := xs:anyURI('http://www.example.org/index.html'),
xs:QName('rdf:predicate') := xs:anyURI('http://purl.org/dc/elements/1.1/creator'),
xs:QName('rdf:object') := xs:anyURI('http://www.example.org/staffid/85740')
}
The χίμαιραλ serialization of this map
is:
<χ:data-model xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:χ="http://χίμαιραλ.com#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<χ:map>
<χ:entry key="rdf:object"
keyType="xs:QName">
<χ:atomic-value type="xs:anyURI">http://www.example.org/staffid/85740</χ:atomic-value>
</χ:entry>
<χ:entry key="rdf:predicate"
keyType="xs:QName">
<χ:atomic-value type="xs:anyURI">http://purl.org/dc/elements/1.1/creator</χ:atomic-value>
</χ:entry>
<χ:entry key="rdf:subject"
keyType="xs:QName">
<χ:atomic-value type="xs:anyURI">http://www.example.org/index.html</χ:atomic-value>
</χ:entry>
</χ:map>
</χ:data-model>
What can we do with such triples? Using higher order functions, it should not be too
difficult to define triple stores with basic query features!
Is this lightweight enough? Or does RDF support deserve new information item types
to be supported by XDM?
Syntactical sugar
We've seen that this JSON
object
{ "accounting" : [
{ "firstName" : "John",
"lastName" : "Doe",
"age" : 23 },
{ "firstName" : "Mary",
"lastName" : "Smith",
"age" : 32 }
],
"sales" : [
{ "firstName" : "Sally",
"lastName" : "Green",
"age" : 27 },
{ "firstName" : "Jim",
"lastName" : "Galley",
"age" : 41 }
]
}
Is serialized in χίμαιραλ
as:
<?xml version="1.0" encoding="UTF-8"?>
<χ:data-model xmlns:χ="http://χίμαιραλ.com#">
<χ:map>
<χ:entry key="sales" keyType="string">
<χ:map>
<χ:entry key="1" keyType="number">
<χ:map>
<χ:entry key="lastName" keyType="string">
<χ:atomic-value type="string">Green</χ:atomic-value>
</χ:entry>
<χ:entry key="age" keyType="string">
<χ:atomic-value type="number">27</χ:atomic-value>
</χ:entry>
<χ:entry key="firstName" keyType="string">
<χ:atomic-value type="string">Sally</χ:atomic-value>
</χ:entry>
</χ:map>
</χ:entry>
<χ:entry key="2" keyType="number">
<χ:map>
<χ:entry key="lastName" keyType="string">
<χ:atomic-value type="string">Galley</χ:atomic-value>
</χ:entry>
<χ:entry key="age" keyType="string">
<χ:atomic-value type="number">41</χ:atomic-value>
</χ:entry>
<χ:entry key="firstName" keyType="string">
<χ:atomic-value type="string">Jim</χ:atomic-value>
</χ:entry>
</χ:map>
</χ:entry>
</χ:map>
</χ:entry>
<χ:entry key="accounting" keyType="string">
<χ:map>
<χ:entry key="1" keyType="number">
<χ:map>
<χ:entry key="lastName" keyType="string">
<χ:atomic-value type="string">Doe</χ:atomic-value>
</χ:entry>
<χ:entry key="age" keyType="string">
<χ:atomic-value type="number">23</χ:atomic-value>
</χ:entry>
<χ:entry key="firstName" keyType="string">
<χ:atomic-value type="string">John</χ:atomic-value>
</χ:entry>
</χ:map>
</χ:entry>
<χ:entry key="2" keyType="number">
<χ:map>
<χ:entry key="lastName" keyType="string">
<χ:atomic-value type="string">Smith</χ:atomic-value>
</χ:entry>
<χ:entry key="age" keyType="string">
<χ:atomic-value type="number">32</χ:atomic-value>
</χ:entry>
<χ:entry key="firstName" keyType="string">
<χ:atomic-value type="string">Mary</χ:atomic-value>
</χ:entry>
</χ:map>
</χ:entry>
</χ:map>
</χ:entry>
</χ:map>
</χ:data-model>
We can work with that, but wouldn't it be nice if we had a native syntax that does
not use XML elements and attributes to represent maps?
Depending on the requirements, many approaches are possible.
A first option would be to define pluggable notation parsers within XML and
write:
<χ:notation mediatype="application/json"><![CDATA[
{ "accounting" : [
{ "firstName" : "John",
"lastName" : "Doe",
"age" : 23 },
{ "firstName" : "Mary",
"lastName" : "Smith",
"age" : 32 }
],
"sales" : [
{ "firstName" : "Sally",
"lastName" : "Green",
"age" : 27 },
{ "firstName" : "Jim",
"lastName" : "Galley",
"age" : 41 }
]
}
]]></χ:notation>
The
meaning of the
<χ:notation/>
element would be to trigger a parser supporting the application/json datatype. This
is less verbose, more natural to JSON users, but
doesn't allow to add XML nodes in maps or sequences.
Another direction would be to extend the syntax of XML itself. To do so, again, there
are many possibilities. The markup in XML is based on angle brackets and the distinction
between
the different XML productions is usually done through the character following the
bracket in the opening tags.
This principle leaves a lot of possibilities. For instance, maps could be identified
by the tags <{>
and </}>
to follow the characters used by XDM map
literals and JSON objects
:
<χ:data-model>
<{>
<χ:entry key="sales" keyType="string">
<{>
<χ:entry key="1" keyType="number">
<{>
<χ:entry key="lastName" keyType="string">
<χ:atomic-value type="string">Green</χ:atomic-value>
</χ:entry>
<χ:entry key="age" keyType="string">
<χ:atomic-value type="number">27</χ:atomic-value>
</χ:entry>
<χ:entry key="firstName" keyType="string">
<χ:atomic-value type="string">Sally</χ:atomic-value>
</χ:entry>
</}>
</χ:entry>
<χ:entry key="2" keyType="number">
<{>
<χ:entry key="lastName" keyType="string">
<χ:atomic-value type="string">Galley</χ:atomic-value>
</χ:entry>
<χ:entry key="age" keyType="string">
<χ:atomic-value type="number">41</χ:atomic-value>
</χ:entry>
<χ:entry key="firstName" keyType="string">
<χ:atomic-value type="string">Jim</χ:atomic-value>
</χ:entry>
</}>
</χ:entry>
</}>
</χ:entry>
<χ:entry key="accounting" keyType="string">
<{>
<χ:entry key="1" keyType="number">
<{>
<χ:entry key="lastName" keyType="string">
<χ:atomic-value type="string">Doe</χ:atomic-value>
</χ:entry>
<χ:entry key="age" keyType="string">
<χ:atomic-value type="number">23</χ:atomic-value>
</χ:entry>
<χ:entry key="firstName" keyType="string">
<χ:atomic-value type="string">John</χ:atomic-value>
</χ:entry>
</}>
</χ:entry>
<χ:entry key="2" keyType="number">
<{>
<χ:entry key="lastName" keyType="string">
<χ:atomic-value type="string">Smith</χ:atomic-value>
</χ:entry>
<χ:entry key="age" keyType="string">
<χ:atomic-value type="number">32</χ:atomic-value>
</χ:entry>
<χ:entry key="firstName" keyType="string">
<χ:atomic-value type="string">Mary</χ:atomic-value>
</χ:entry>
</}>
</χ:entry>
</}>
</χ:entry>
</}>
</χ:data-model>
Map entries are not ordered and in that respect they are similar to XML attributes.
We could use this similarity and use the character @
to identify map
entries:
<χ:data-model>
<{>
<@"sales" keyType="string">
<{>
<@"1" keyType="number">
<{>
<@"lastName" keyType="string">
<χ:atomic-value type="string">Green</χ:atomic-value>
</@"lastName">
<@"age" keyType="string">
<χ:atomic-value type="number">27</χ:atomic-value>
</@"age">
<@"firstName" keyType="string">
<χ:atomic-value type="string">Sally</χ:atomic-value>
</@"firstName">
</}>
</@"1">
<@"2" keyType="number">
<{>
<@"lastName" keyType="string">
<χ:atomic-value type="string">Galley</χ:atomic-value>
</@"lastName">
<@"age" keyType="string">
<χ:atomic-value type="number">41</χ:atomic-value>
</@"age">
<@"firstName" keyType="string">
<χ:atomic-value type="string">Jim</χ:atomic-value>
</@"firstName">
</}>
</@"2">
</}>
</@"sales">
<@"accounting" keyType="string">
<{>
<@"1" keyType="number">
<{>
<@"lastName" keyType="string">
<χ:atomic-value type="string">Doe</χ:atomic-value>
</@"lastName">
<@"age" keyType="string">
<χ:atomic-value type="number">23</χ:atomic-value>
</@"age">
<@"firstName" keyType="string">
<χ:atomic-value type="string">John</χ:atomic-value>
</@"firstName">
</}>
</@"1">
<@"2" keyType="number">
<{>
<@"lastName" keyType="string">
<χ:atomic-value type="string">Smith</χ:atomic-value>
</@"lastName">
<@"age" keyType="string">
<χ:atomic-value type="number">32</χ:atomic-value>
</@"age">
<@"firstName" keyType="string">
<χ:atomic-value type="string">Mary</χ:atomic-value>
</@"firstName">
</}>
</@"2">
</}>
</@"accounting">
</}>
</χ:data-model>
The key names have been enclosed between quotes because map keys can include any character
including whitespaces, but they can be made optional when they are not needed. We
could
also give to the keyType a default value of
"string":
<χ:data-model>
<{>
<@sales>
<{>
<@1 keyType="number">
<{>
<@lastName>
<χ:atomic-value type="string">Green</χ:atomic-value>
</@lastName
<@age>
<χ:atomic-value type="number">27</χ:atomic-value>
</@age
<@firstName>
<χ:atomic-value type="string">Sally</χ:atomic-value>
</@firstName
</}>
</@1
<@2 keyType="number">
<{>
<@lastName>
<χ:atomic-value type="string">Galley</χ:atomic-value>
</@lastName
<@age>
<χ:atomic-value type="number">41</χ:atomic-value>
</@age
<@firstName>
<χ:atomic-value type="string">Jim</χ:atomic-value>
</@firstName
</}>
</@2
</}>
</@sales
<@accounting>
<{>
<@1 keyType="number">
<{>
<@lastName>
<χ:atomic-value type="string">Doe</χ:atomic-value>
</@lastName
<@age>
<χ:atomic-value type="number">23</χ:atomic-value>
</@age
<@firstName>
<χ:atomic-value type="string">John</χ:atomic-value>
</@firstName
</}>
</@1
<@2 keyType="number">
<{>
<@lastName>
<χ:atomic-value type="string">Smith</χ:atomic-value>
</@lastName
<@age>
<χ:atomic-value type="number">32</χ:atomic-value>
</@age
<@firstName>
<χ:atomic-value type="string">Mary</χ:atomic-value>
</@firstName
</}>
</@2
</}>
</@accounting
</}>
</χ:data-model>
Atomic values could be identified by <=>
and </=>
and the same default value applied to its type
attribute:
<χ:data-model>
<{>
<@sales>
<{>
<@1 keyType="number">
<{>
<@lastName>
<=>Green</=>
</@lastName>
<@age>
<= type="number">27</=>
</@age>
<@firstName>
<=>Sally</=>
</@firstName>
</}>
</@1>
<@2 keyType="number">
<{>
<@lastName>
<=>Galley</=>
</@lastName>
<@age>
<= type="number">41</=>
</@age>
<@firstName>
<=>Jim</=>
</@firstName>
</}>
</@2>
</}>
</@sales>
<@accounting>
<{>
<@1 keyType="number">
<{>
<@lastName>
<=>Doe</=>
</@lastName>
<@age>
<= type="number">23</=>
</@age>
<@firstName>
<=>John</=>
</@firstName>
</}>
</@1>
<@2 keyType="number">
<{>
<@lastName>
<=>Smith</=>
</@lastName>
<@age>
<= type="number">32</=>
</@age>
<@firstName>
<=>Mary</=>
</@firstName>
</}>
</@2>
</}>
</@accounting>
</}>
</χ:data-model>
The tags that surround atomic values are useful when these values are within a sequence
but look superfluous when the item has a single value. The next step could be to define
that
in that case as a shortcut the value and its type attribute could be directly included
in the
item:
<χ:data-model>
<{>
<@sales>
<{>
<@1 keyType="number">
<{>
<@lastName>Green</@lastName>
<@age type="number">27</@age>
<@firstName>Sally</@firstName>
</}>
</@1>
<@2 keyType="number">
<{>
<@lastName>Galley</@lastName>
<@age type="number">41</@age>
<@firstName>Jim</@firstName>
</}>
</@2>
</}>
</@sales>
<@accounting>
<{>
<@1 keyType="number">
<{>
<@lastName>Doe</@lastName>
<@age type="number">23</@age>
<@firstName>John</@firstName>
</}>
</@1>
<@2 keyType="number">
<{>
<@lastName>Smith</@lastName>
<@age type="number">32</@age>
<@firstName>Mary</@firstName>
</}>
</@2>
</}>
</@accounting>
</}>
</χ:data-model>
XPath
The χίμαιραλ serialization being XML, it is possible to use XPath path expressions
to query its structure. For instance, to get a list of employees which are less than
30, we can
write:
χ:map/χ:entry/χ:map/χ:entry/χ:map[χ:entry[@key='age'][χ:atomic-value < 30]]
Or, if we're feeling lucky:
//χ:map[χ:entry[@key='age'][χ:atomic-value < 30]]
Again, that's good as long we work on a χίμαιραλ serialization but it would be good
to be able to use path expressions directly on map data structures. To do so we would need at minima to
define steps to match maps and entries.
XSLT 3.0 introduces a new map()
item type which could be used as a kind test to identify maps.
If we follow the idea that map entries are similar to XML attributes, we could use
the @
notation to identify them. The XPath expression would then
become:
map()/@*/map()/@*/map()[@age < 30]]
Or, if we're feeling lucky:
//map()[@age < 30]]
Validation
These data models can be complex. Wouldn't it be useful to be able to validate them
with schema languages? This would give us a way to validate JSON maps!
Of course we can already serialize them in χίμαιραλ and validate the serialization
using any schema language, but again it would be good to be able to validate these
structures
directly.
A RELAX NG schema to validate the χίμαιραλ serialization of our example would
be:
namespace χ = "http://χίμαιραλ.com#"
start = element χ:data-model { top-level-map }
# Top level map: departments
top-level-map =
element χ:map {
element χ:entry {
attribute key { xsd:NMTOKEN },
attribute keyType { "string" },
emp-array
}*
}
# List of employees
emp-array =
element χ:map {
element χ:entry {
attribute key { xsd:positiveInteger },
attribute keyType { "number" },
emp-map
}*
}
# Description of an employee
emp-map = element χ:map { (age | firstName | lastName) + }
age =
element χ:entry {
attribute key { "age" },
attribute keyType { "string" },
element χ:atomic-value {
attribute type { "number" },
xsd:positiveInteger
}
}
firstName =
element χ:entry {
attribute key { "firstName" },
attribute keyType { "string" },
element χ:atomic-value {
attribute type { "string" },
xsd:token
}
}
lastName =
element χ:entry {
attribute key { "lastName" },
attribute keyType { "string" },
element χ:atomic-value {
attribute type { "string" },
xsd:token
}
}
Note
In the description of the maps used to describe employees, we cannot use interleave
patterns because of the restriction on interleave and the schema is approximate. In this specific case, we could
enumerate the six possible combinations but the exercise would quickly become verbose
if the number of items
grew:
emp-map = element χ:map {
(age, firstName, lastName)
| (age, lastName, firstName)
| (firstName, age, lastName)
| (firstName, lastName, age)
| (lastName, age, firstName)
| (lastName, firstName, age)
}
A Schematron schema for the χίμαιραλ serialization could be developed based on XPath
expressions similar to those that have been shown in the previous section.
Again, it would be interesting to support maps directly as first class citizens in
XML schema languages.
The ability to use Schematron on XDM maps depends directly on the ability to browse
maps using patterns and path expressions in XPath and XSLT (see above)...
The main impact on RELAX NG would be to add map
and item
patterns and the schema could look
like:
namespace χ = "http://χίμαιραλ.com#"
start = element χ:data-model { top-level-map }
# Top level map: departments
top-level-map =
map {
entry xsd:NMTOKEN {
emp-array
}*
}
# List of employees
emp-array =
map {
entry xsd:positiveInteger {
emp-map
}*
}
# Description of an employee
emp-map = map { age, firstName, lastName }
age =
entry age {
xsd:positiveInteger
}
}
firstName =
entry firstName {
xsd:token
}
}
lastName =
entry lastName {
xsd:token
}
}
Sequences could probably be supported without adding a new pattern but would require
to relax some restrictions to allow the description of sequences mixing atomic values,
maps and
nodes (in Relax NG, sequences of atomic values are already possible in list datatypes,
sequences of nodes are of course available to describe node contents but these two
type of
sequences cannot be mixed).
Conclusion
According to the definition of chimeras in genetics from Wikipedia quoted in the introduction,
chimeras are formed from at least four parent cells (two fertilized eggs or early
embryos fused together). Each population of cells keeps its own character and the
resulting organism is a mixture of tissues.
The current XDM proposals have added to the XML data model a foreign model to represent
maps. This new model is a superset of the JSON data model. The two data models keep
their own
character and the resulting model is a mixture of information items.
It's far to say that the current XDM proposal is a chimera, something described as
usually ugly, foolish or impossible fantasies by Jeni Tennison.
I hope that the proposals sketched in this paper will help to address this situation
and fully integrate these new information items in the XML echosystem.