Balisage: Fleshing the XDM chimera

Motivation

Chimera (mythology): The Chimera (also Chimaera or Chimæra) (Greek: Χίμαιρα, Khimaira, from χίμαρος, khimaros, "she-goat") was, according to Greek mythology, a monstrous fire-breathing female creature of Lycia in Asia Minor, composed of the parts of multiple animals: upon the body of a lioness with a tail that ended in a snake's head, the head of a goat arose on her back at the center of her spine. The Chimera was one of the offspring of Typhon and Echidna and a sibling of such monsters as Cerberus and the Lernaean Hydra. The term chimera has also come to describe any mythical animal with parts taken from various animals and, more generally, an impossible or foolish fantasy.

— Wikipedia

Chimera (genetics): A chimera or chimaera is a single organism (usually an animal) that is composed of two or more different populations of genetically distinct cells that originated from different zygotes involved in sexual reproduction. If the different cells have emerged from the same zygote, the organism is called a mosaic. Chimeras are formed from at least four parent cells (two fertilized eggs or early embryos fused together). Each population of cells keeps its own character and the resulting organism is a mixture of tissues.

— Wikipedia

During her opening keynote at XML Prague 2012, speaking about the relation between XML, HTML, JSON and RDF, Jeni Tennison warned us against the temptation to create chimeras: chimera are usually ugly, foolish or impossible fantasies.

The next morning, Michael Kay and Jonathan Robie came to present new features in XPath/XQuery/XSLT 3.0. A lot of these features are directly based on the XQuery and XPath Data Model 3.0 (aka XDM):

The XPath Data Model is the abstraction over which XPath expressions are evaluated. Historically, all of the items in the data model could be derived directly (nodes) or indirectly (typed values, sequences) from an XML document. However, as the XPath expression language has matured, new features have been added which require additional types of items to appear in the data model. These items have no direct XML serialization, but they are never the less part of the data model.

XDM 3.0 is composed of items from a number of different technologies:

Items from the XML Infoset (nodes, attributes, ...)
Datatype information borrowed from the Post Schema Validation Infoset
Sequences
Atomic values
Functions that can also be used to model JSON arrays

Note

The feature that will be introduced to model JSON arrays is called "maps" and it will be specified as a XSLT feature in the XSLT 3.0 recommendation (not published yet). The XSLT 3.0 editor, Michael Kay has published an early version of this feature in his blog. In this paper, XDM 3.0 will refer to the XSLT 3.0 data model (the XPath 3.0 data model augmented with maps).

XDM 3.0 being a single data model composed of items from different data models, it is fair to say that it is a chimera!

Following Jeni Tennison on stage, I have tried to show that in a world where HTML 5 on one hand and JSON on the other hand are gaining traction, XML has become an ecosystem in a competitive environment and that it's data model is a major competitive advantage.

Among other factors, the continued success of XML will thus come from its ability to seamlessly integrate other data models such as JSON.

If we follow this conclusion, we must admit that this chimera is essential to the future of XML and do our best to make it elegant and smart.

XML Data Models

Whether it's a bug or a feature could be debated endlessly, but a remarkable feature of the XML recommendation it's all about syntax and parsing rule and does not really define a data model. The big advantage is that everyone can find pretty much what he wants in XML documents but for the sake of this paper we need to choose a well known -and well defined- data model to work on.

The most common XML data model is probably the data model defined by the trio XPath/XSLT/XQuery known as "XDM" since XPath version 2.0 and that's the one we will choose.

XDM version 3.0, still work in progress, will be the third version of this data model. It's important to understand its design and evolution to use its most advanced features and we'll start our prospective by a short history of its versions.

XPath/XSLT 1.0

The XPath 1.0 data model is described as being composed of seven types of nodes (root, elements, text, attributes, namespaces, processing instructions and comments).

The XSLT 1.0 data model is defined as being the XPath 1.0 data model with:

Relaxed constraints on root node children to support well-formed external general parsed entities that are not well formed XML documents
An additional "base URI" property on every node.
An additional "unparsed entities" property on the root node.

It's fair to say that these two -very close- data models are completely focused on XML, but is that all?

Not entirely and these two specifications introduce other notions that should be considered as related to the data model even if they are not described in their sections called "Data Model"...

XSLT 1.0 inadvertently mentions the four basic XPath data-types (string, number, boolean, node-set) to explicitly add a fifth one: result tree fragments.

These four basic data-types are implicitly defined in XPath 1.0 in its section about its function library but no formal description of these types is given.

XDML 2.0: XPath 2.0/XSLT 2.0/XQuery 1.0

In version 2.0, the XDM is promoted to get its own specification.

XDM 2.0 keeps the same seven types of nodes as XPath 1.0 and integrates the additions from the XSLT 1.0 data model. A number of properties are added to these nodes to capture information that had been left outside the data model by the previous version and also to support the data-type system from the PSVI (Post Schema Validation Infoset).

The term "data-type" or simply "type" being now used to refer to XML Schema data-types, a new terminology is introduced where the data model is composed of "information items" (or items) being either XML nodes or "atomic values".

The concept of "sequences" is also introduced. Sequences are not strictly considered as items but play a very important role in XDM. They are defined as an ordered collection of zero or more items.

The data model is thus now composed of three different concepts:

nodes
atomic values
sequences

XDM 2.0 notes that an important difference between nodes and atomic values is that only nodes have identities:

Each node has a unique identity. Every node in an instance of the data model is unique: identical to itself, and not identical to any other node. (Atomic values do not have identity; every instance of the value “5” as an integer is identical to every other instance of the value “5” as an integer.)

— XQuery 1.0 and XPath 2.0 Data Model (XDM) (Second Edition)

This is a crucial distinction that divides the data model into two different kind of items (those which have an identity and those which haven't one). Let's take an example:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <foo>5</foo>
    <foo>5</foo>
    <bar foo="5">
        <foo>5</foo>
    </bar>
</root>

The three <foo>5</foo> look similar and can be considered "deeply equal" but they are three different elements with three different identities. This is needed because some of their properties are different: the parent of the first two is <root/> while the parent of the third one is <bar/>, the preceding sibling of the second one is the first one while the first one has no preceeding sibling, ...

The three "5" text nodes are similar but they still are different text nodes with different identities and this is necessary because they don't have the same parent elements.

By contrast, the atomic values of the three <foo/> element (and the atomic value of the @foo attribute) are the same atomic value, the "5" (assuming they have all been declared with the same datatype). Among many other things, this means that when you manipulate their values, you can't access back to the node that is holding the value).

XDM 3.0: XPath 3.0/XSLT 3.0/XQuery 3.0

Note

These specifications are still work on progress, currently divided between XQuery and XPath Data Model 3.0 and data model extensions described in XSL Transformations (XSLT) Version 3.0.

XDM 3.0 adds functions as a third kind of items, transforming XQuery and XSLT into functional languages.

Like atomic values, functions have no identity:

Functions have no identity, cannot be compared, and have no serialization.

— XQuery and XPath Data Model 3.0 - W3C Working Draft 13 December 2011

XSLT 3.0 adds to XDM 3.0 a fourth king of items: maps, derived from functions which, among many other use cases, can be used to model JSON objects:

Like atomic values and functions (from which they are derived), maps have no identity:

Like sequences, maps have no identity. It is meaningful to compare the contents of two maps, but there is no way of asking whether they are "the same map": two maps with the same content are indistinguishable.

— XSL Transformations (XSLT) Version 3.0 - W3C Working Draft 10 July 2012

Note

In this statement, the specification does acknowledge that sequences have no identity either. This is understandable but didn't seem to be clearly specified elsewhere.

Of course, XSLT 3.0 is also adding functions to create, manipulate maps and serialize/deserialize them as JSON and a syntax to define map literals. It does not any new pattern to select of match maps or map entries, though.

Identity Crisis

Appolonius' ship is a beautiful ship. Over the years it has been repaired so many times that there is not a single piece of the original materials remaining. The question is, therefore, is it really still Appolonius' ship?

— ObjectIdentity on c2.com

Object identity is often confused with mutability. The need for objects to have identities is more obvious when they are mutable, their identities being then used to track them despite their changes like Appolonius' ship. However, XDM 3.0 gives us a good opportunity to explore the meaning and consequences of having (or not having) an identity for immutable object structures.

The definition of node identity in XDM 3.0 is directly copied from XDM 2.0:

Each node has a unique identity. Every node in an instance of the data model is unique: identical to itself, and not identical to any other node. (Atomic values do not have identity; every instance of the value “5” as an integer is identical to every other instance of the value “5” as an integer.)

— XQuery and XPath Data Model 3.0 - W3C Working Draft 13 December 2011

I find this definition confusing:

Why should the value “5” as an integer be instantiated and why should we care? The value “5” as an integer is... the value “5” as an integer! It's unique and being unique, doesn't it have an identity?
A node, with all the properties defined in XDM (including its document-uri and parent accessors) would be unique if it had "previous-sibling" or "document-order" accessors.

Note

To find the previous siblings of a node relying only on the accessors defined in XDM (2.0 or 3.0), you'd have to access to the node's parent and loop over it's children until you find the current node that you would identify as such by checking its identity.

Rather than focussing on uniqueness, which for immutable information items does not really matter, a better differentiation could be between information items which have enough context information to "know where they belong" in the data model and those which don't.

This differentiation has the benefit of highlighting the consequences of having or not having an identity: to be able to navigate between an information item and its ancestors or sibling this item must know where it belongs. When that's not the case, it is still be possible to navigate between the item and its descendants but axis such as ancestor:: or sibling:: are not available.

Note

Identity can be seen as the price to pay for the ancestor:: and sibling:: axis.

Let's take back a simple example:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <foo>5</foo>
    <foo>5</foo>
    <bar> 
        <foo>5</foo>
    </bar>
</root>

In an hypothetical data model where nodes have no identity, there would be only 3 elements:

The root element
The bar element
The foo element (referred twice has children of root end once as child of bar)

If we add identity (or context information) properties, the foo elements become three information different items since they defer by these properties.

The process of adding these properties to an information item looks familiar. Depending on your background, you can compare it to:

class/object instantiation in class based Object Oriented Programming
clones in prototype based Object Oriented Programming
RDF reification.

We've seen that XDM 3.0 acknowledges this difference between information items which have context information and those which don't have. I don't want to deny that both types of data models have their use cases: there are obviously many use cases where context information is needed and use cases where lightweight structures are a better fit.

That being said, if we are serious about the support of JSON in XDM, we should offer the same features to access data whether this data is stored in maps or in XML nodes.

Let's consider this JSON object borrowed from the XSLT 3.0 Working Draft:

{ "accounting" : [ 
      { "firstName" : "John", 
        "lastName"  : "Doe",
        "age"       : 23 },
      
      { "firstName" : "Mary", 
        "lastName"  : "Smith",
        "age"       : 32 }
                 ],                                 
  "sales"     : [ 
      { "firstName" : "Sally", 
        "lastName"  : "Green",
        "age"       : 27 },
      
      { "firstName" : "Jim",  
        "lastName"  : "Galley",
        "age"       : 41 }
                  ]
}

This object could be represented in XML by the following document:

<?xml version="1.0" encoding="UTF-8"?>
<company>
    <department name="sales">
        <employee>
            <firstName>Sally</firstName>
            <lastName>Green</lastName>
            <age>27</age>
        </employee>
        <employee>
            <firstName>Jim</firstName>
            <lastName>Galley</lastName>
            <age>41</age>
        </employee>
    </department>
    <department name="accounting">
        <employee>
            <firstName>John</firstName>
            <lastName>Doe</lastName>
            <age>23</age>
        </employee>
        <employee>
            <firstName>Mary</firstName>
            <lastName>Smith</lastName>
            <age>32</age>
        </employee>
    </department>
</company>

The features introduced in the latest XSLT 3.0 Working Draft do allow to transform rather easily from one model to the other, but these two models do not have, bar far, the same features.

In the XML flavor, when the context item is the employee "John Doe", you can easily find out what his department is because this is an element and element do carry context information. In the map flavor by contrast when the context item is an employee map, this object has no context information and you can't tell which is his department without looping within the containing map.

This important restriction is at a purely data model level. It is aggravated by the XPath syntax has not been extended to generalize axis so that they can work with maps. If I work with the XML version of this structure, it's obvious to evaluate things such as the number of employees, the average age of employees, the number of departments, the number of employees by department, the average age by department, obvious to find out if there is an employee called "Mary Smith" in one of the departments, the employees who are more than 40, to get a list of employees from all the department sorted by age, ... In the map flavor by contrast, I don't have any XPath axis available and must do all these operations using a limited number of map functions (map:keys(), map:contains(), map:get()). In other words, while I can use XPath expressions with the XML version, I must use DOM like operations to access the map version!

To summarize, yes XDM 3.0 does support JSON but to do pretty much anything interesting with JSON objects, you'd better transform them into XML nodes first! XSLT 3.0 does give you the tools to do this transformation quite easily but the message to JSON users is that we don't treat their data model as a first class citizen.

To make it worse, XPath is used by many other specifications, within and outside the W3C and the level of support for JSON provided by XDM and XPath will determine how these specifications will be able to support for JSON. Specifications that are impacted by this issue include XForms, XProc and Schematron. Supporting JSON would be really useful for these three specifications if and only if map items could have the same features than nodes.

Furthermore, the same asymmetry exists when you went to create these two structures from other sources: to create the XML structure you can use sequence constructors but to create the map structure, you have to use the map:new() and map:item() functions.

My proposal to solve this issue is:

To acknowledge the fact that any type of information item can be either "context independent" or include context information and explore the consequences of this statement.
To generalize XPath axis so that they can be used with map items.
To create sequence constructors for maps and map entries.

You are welcome to discuss this further:

W3C XSLT 3.0 Bug 16118, "Maps should be first class citizens"
Blog entry: XDM Maps should be first class citizens.
Blog entry: More musings on XDM 3.0

Introducing χίμαιραλ (chimeral), the Chimera Language

When I started to work on χίμαιραλ a few months ago, my first motivation was to propose an XDM serialization for maps which would turn the rather abstract prose from the specification into concrete angle brackets that you could see and read.

The exercise has been very instructive and helped me a lot to understand the spec, however a more ambitious use pattern has emerged while I was making progress. The XSLT 3.0 Working Draft is part of a batch of Working Drafts which are far more advanced. My proposals to solve the "map identity crisis" are probably too intrusive and too late to be taken into account and the batch of specifications will most probably carry on with the current proposal.

If that's the case, we've seen that it makes a lot of sense to convert maps into nodes to enable to use XPath axis and χίμαιραλ provides a generic target format for these conversions.

Example

Let's take again the JSON object borrowed from the XSLT 3.0 Working Draft:

{ "accounting" : [ 
      { "firstName" : "John", 
        "lastName"  : "Doe",
        "age"       : 23 },
      
      { "firstName" : "Mary", 
        "lastName"  : "Smith",
        "age"       : 32 }
                 ],                                 
  "sales"     : [ 
      { "firstName" : "Sally", 
        "lastName"  : "Green",
        "age"       : 27 },
      
      { "firstName" : "Jim",  
        "lastName"  : "Galley",
        "age"       : 41 }
                  ]
}

Its χίμαιραλ representation is:

<?xml version="1.0" encoding="UTF-8"?>
<χ:data-model xmlns:χ="http://χίμαιραλ.com#">
    <χ:map>
        <χ:entry key="sales" keyType="string">
            <χ:map>
                <χ:entry key="1" keyType="number">
                    <χ:map>
                        <χ:entry key="lastName" keyType="string">
                            <χ:atomic-value type="string">Green</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="age" keyType="string">
                            <χ:atomic-value type="number">27</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="firstName" keyType="string">
                            <χ:atomic-value type="string">Sally</χ:atomic-value>
                        </χ:entry>
                    </χ:map>
                </χ:entry>
                <χ:entry key="2" keyType="number">
                    <χ:map>
                        <χ:entry key="lastName" keyType="string">
                            <χ:atomic-value type="string">Galley</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="age" keyType="string">
                            <χ:atomic-value type="number">41</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="firstName" keyType="string">
                            <χ:atomic-value type="string">Jim</χ:atomic-value>
                        </χ:entry>
                    </χ:map>
                </χ:entry>
            </χ:map>
        </χ:entry>
        <χ:entry key="accounting" keyType="string">
            <χ:map>
                <χ:entry key="1" keyType="number">
                    <χ:map>
                        <χ:entry key="lastName" keyType="string">
                            <χ:atomic-value type="string">Doe</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="age" keyType="string">
                            <χ:atomic-value type="number">23</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="firstName" keyType="string">
                            <χ:atomic-value type="string">John</χ:atomic-value>
                        </χ:entry>
                    </χ:map>
                </χ:entry>
                <χ:entry key="2" keyType="number">
                    <χ:map>
                        <χ:entry key="lastName" keyType="string">
                            <χ:atomic-value type="string">Smith</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="age" keyType="string">
                            <χ:atomic-value type="number">32</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="firstName" keyType="string">
                            <χ:atomic-value type="string">Mary</χ:atomic-value>
                        </χ:entry>
                    </χ:map>
                </χ:entry>
            </χ:map>
        </χ:entry>
    </χ:map>
</χ:data-model>

Granted, it's much more verbose than the JSON version, but it's the exact translation of the XDM corresponding to the JSON object in XML.

χίμαιραλ In a Nutshell

The design goals are:

Be as close as possible to the XDM and its terminology
Represent XML nodes as... XML nodes
Allow round-trips (an XDM model serialized as χίμαιραλ should give a XDM model identical to the original one when de-serialized)
Be easy to process using XPath/XQuery/XSLT
Support of the PSVI is not a goal

χίμαιραλ is not the only proposal to serialize XDM as XML. Two other notable ones are:

Zorba's XDM serialization is a straight and accurate XDM serialization which does support PSVI annotations. As a consequence, nodes are serialized as xdm:* elements (an element is an xdm:element, an attribute an xdm:attribute element, ...). This does'n meet by second requirement to represent nodes as themselves.
XDML, presented by Rennau, Hans-Jürgen, and David A. Lee at Balisage 2011 is more than just an XDM serialization and also includes manipulation and processing definitions. It introduces its own terminology and concepts and is too far away from XDM for my design goals.

A lot of attention has been given to the first design goal: the structure of a χίμαιραλ model and the name of its elements and attributes are directly derived from the specifications.

In XDM, map entries' values can be arrays (an array beeing nothing else than a map with integer keys) but also sequences (which is not possible in JSON). χίμαιραλ respects the fact that in XDM there is no difference between a sequence composed of a single element and represents sequences by a repetition of values.

The map map{1:= 'foo'} is serialized as:

<χ:data-model xmlns:χ="http://χίμαιραλ.com#">
   <χ:map>
      <χ:entry key="1" keyType="number">
         <χ:atomic-value type="string">foo</χ:atomic-value>
      </χ:entry>
   </χ:map>
</χ:data-model>

And the map map{1:= ('foo', 'bar')} is serialized as:

<χ:data-model xmlns:χ="http://χίμαιραλ.com#">
   <χ:map>
      <χ:entry key="1" keyType="number">
         <χ:atomic-value type="string">foo</χ:atomic-value>
         <χ:atomic-value type="string">bar</χ:atomic-value>
      </χ:entry>
   </χ:map>
</χ:data-model>

We've seen that XDM makes a clear distinction between nodes which have identities and other item types (atomic values, functions and maps) which haven't. XDM allows to use nodes as map entry values. χίμαιραλ allows this feature too, but copying the nodes would create new nodes with different identities.

To avoid that, documents to which these nodes belong are copied into χ:instance elements and references between map entries values and instances are made using XPath expressions.

The following $map variable:

<xsl:variable name="a-node">
    <foo/>
</xsl:variable>
<xsl:variable name="map" select="map{'a-node':= $a-node}"/>

Is serialized as:

<χ:data-model xmlns:χ="http://χίμαιραλ.com#">
   <χ:instance id="d4" kind="document">
      <foo/>
   </χ:instance>
   <χ:map>
      <χ:entry key="a-node" keyType="string">
         <χ:node kind="document" instance="d4" path="/"/>
      </χ:entry>
   </χ:map>
</χ:data-model>

Like XSLT variable, instances do not always contain document nodes and the following $map variable:

<xsl:variable name="a-node" as="node()">
    <foo/>
</xsl:variable>
<xsl:variable name="map" select="map{'a-node':= $a-node}"/>

Is serialized as:

<χ:data-model xmlns:χ="http://χίμαιραλ.com#">
   <χ:instance id="d4e0" kind="fragment">
      <foo/>
   </χ:instance>
   <χ:map>
      <χ:entry key="a-node" keyType="string">
         <χ:node kind="element" instance="d4e0" path="root()" name="foo"/>
      </χ:entry>
   </χ:map>
</χ:data-model>

Nodes can belong to more than one instances, and this $map variable:

<xsl:variable name="a-node" as="node()*">
    <foo/>
    <bar/>
</xsl:variable>
<xsl:variable name="map" select="map{'a-node':= $a-node}"/>

Is serialized as:

<χ:data-model xmlns:χ="http://χίμαιραλ.com#">
   <χ:instance id="d4e0" kind="fragment">
      <foo/>
   </χ:instance>
   <χ:instance id="d4e3" kind="fragment">
      <bar/>
   </χ:instance>
   <χ:map>
      <χ:entry key="a-node" keyType="string">
         <χ:node kind="element" instance="d4e0" path="root()" name="foo"/>
         <χ:node kind="element" instance="d4e3" path="root()" name="bar"/>
      </χ:entry>
   </χ:map>
</χ:data-model>

Nodes can be "deep linked", a same node can be linked several times and nodes can be mixed with atomic values at wish. The following $map variable:

<xsl:variable name="doc">
    <department name="sales">
        <employee>
            <firstName>Sally</firstName>
            <lastName>Green</lastName>
            <age>27</age>
        </employee>
        <employee>
            <firstName>Jim</firstName>
            <lastName>Galley</lastName>
            <age>41</age>
        </employee>
    </department>
    <department name="accounting">
        <employee>
            <firstName>John</firstName>
            <lastName>Doe</lastName>
            <age>23</age>
        </employee>
        <employee>
            <firstName>Mary</firstName>
            <lastName>Smith</lastName>
            <age>32</age>
        </employee>
    </department>
</xsl:variable>
<xsl:variable name="map"
    select="map{
            'sales' := $doc/department[@name='sales'],
            'Sally' := $doc//employee[firstName = 'Sally'],
            'kids'  := $doc//employee[age &lt; 30],
            'dep-names-attributes' := $doc/department/@name,
            'dep-names' := for $name in $doc/department/@name return string($name)
            }"/>

Is serialized as:

<χ:data-model xmlns:χ="http://χίμαιραλ.com#">
   <χ:instance id="d4" kind="document">
      <department name="sales">
         <employee>
            <firstName>Sally</firstName>
            <lastName>Green</lastName>
            <age>27</age>
         </employee>
         <employee>
            <firstName>Jim</firstName>
            <lastName>Galley</lastName>
            <age>41</age>
         </employee>
      </department>
      <department name="accounting">
         <employee>
            <firstName>John</firstName>
            <lastName>Doe</lastName>
            <age>23</age>
         </employee>
         <employee>
            <firstName>Mary</firstName>
            <lastName>Smith</lastName>
            <age>32</age>
         </employee>
      </department>
   </χ:instance>
   <χ:map>
      <χ:entry key="sales" keyType="string">
         <χ:node kind="element"
                 instance="d4"
                 path="/&#34;&#34;:department[1]"
                 name="department"/>
      </χ:entry>
      <χ:entry key="Sally" keyType="string">
         <χ:node kind="element"
                 instance="d4"
                 path="/&#34;&#34;:department[1]/&#34;&#34;:employee[1]"
                 name="employee"/>
      </χ:entry>
      <χ:entry key="kids" keyType="string">
         <χ:node kind="element"
                 instance="d4"
                 path="/&#34;&#34;:department[1]/&#34;&#34;:employee[1]"
                 name="employee"/>
         <χ:node kind="element"
                 instance="d4"
                 path="/&#34;&#34;:department[2]/&#34;&#34;:employee[1]"
                 name="employee"/>
      </χ:entry>
      <χ:entry key="dep-names-attributes" keyType="string">
         <χ:node kind="attribute"
                 instance="d4"
                 path="/&#34;&#34;:department[1]/@name"
                 name="name">sales</χ:node>
         <χ:node kind="attribute"
                 instance="d4"
                 path="/&#34;&#34;:department[2]/@name"
                 name="name">accounting</χ:node>
      </χ:entry>
      <χ:entry key="dep-names" keyType="string">
         <χ:atomic-value type="string">sales</χ:atomic-value>
         <χ:atomic-value type="string">accounting</χ:atomic-value>
      </χ:entry>
   </χ:map>
</χ:data-model>

Remaining Issues

A collation property should be added to <χ:map/>, probably as an attribute, the transformation to serialize to χίμαιραλ should be cleaned up and the reverse transformation should be implemented.

These are pretty trivial issues and the biggest one is probably to find a way to cleanly serialize references to nodes that are not contained within an element, such as the following $map variable:

<xsl:variable name="attribute" as="node()">
    <xsl:attribute name="foo">bar</xsl:attribute>
</xsl:variable>
<xsl:variable name="map"
    select="map{
            'attribute' := $attribute
            }"/>

Support of functions should also be considered.

χίμαιραλ and the identity crisis

To some extend, χίμαιραλ can be considered as a solution to the XDM identity crisis:

Serializing an XDM model as χίμαιραλ creates elements for maps, map entries and atomic values and these elements, being nodes, have identities. The serialization is therefore also an instantiation of XDM information items as defined above.
De-serializing a χίμαιραλ to create an XDM data model is also a de-instantiation-- except of course that the identity of XML nodes is not "removed".

However, χίμαιραλ does keep a strong difference between nodes which are kept in <χ:instance> elements and maps and atomic values.

Moving the chimera forward

χίμαιραλ is a good playground to explore the new possibilities offered by XDM 3.0. Here is a (non exhaustive) list of a few directions that seem interesting...

Note

Don't expect to find fully baked proposals in this section which contains, on the contrary very early drafts of ideas to follow to support XDM maps as "first class citizens"!

Embracing RDF

If you had the opportunity to enjoy the sunny weather of Orlando in December 2001, you may remember "The Syntactic Web" a provocative talk where Jonathan Robie has shown how XQuery 1.0 could be used to query normalized XML/RDF documents.

The gap between RDF triples and the versatility of its XML representation was a big issue, but the new features brought by this new version of the XPath/XQuery/XSLT package should help us.

The basic data model of RDF is based on triples, a triple being a composed of a subject, a predicate and an object. In XDM, a triple can now be represented by either a sequence, an array or a map of three items.

XDM sequences have the property that they cannot include other sequences and representing triples as sequences would mean that you couldn't define sequences of triples. For that reason it is probably better to define triples as maps or arrays. An array being a map indexed by integers, that doesn't make a huge difference at a conceptual level, but I find it cleaner to access to the subject of a triple using a QName (such as rdf:subject) rather than an index. Following this principle, we could define a triple as:

map {
    xs:QName('rdf:subject')   := xs:anyURI('http://www.example.org/index.html'),
    xs:QName('rdf:predicate') := xs:anyURI('http://purl.org/dc/elements/1.1/creator'),
    xs:QName('rdf:object')    := xs:anyURI('http://www.example.org/staffid/85740')
}

The χίμαιραλ serialization of this map is:

<χ:data-model xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:χ="http://χίμαιραλ.com#"
              xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
   <χ:map>
      <χ:entry key="rdf:object"
               keyType="xs:QName">
         <χ:atomic-value type="xs:anyURI">http://www.example.org/staffid/85740</χ:atomic-value>
      </χ:entry>
      <χ:entry key="rdf:predicate"
               keyType="xs:QName">
         <χ:atomic-value type="xs:anyURI">http://purl.org/dc/elements/1.1/creator</χ:atomic-value>
      </χ:entry>
      <χ:entry key="rdf:subject"
               keyType="xs:QName">
         <χ:atomic-value type="xs:anyURI">http://www.example.org/index.html</χ:atomic-value>
      </χ:entry>
   </χ:map>
</χ:data-model>

What can we do with such triples? Using higher order functions, it should not be too difficult to define triple stores with basic query features!

Is this lightweight enough? Or does RDF support deserve new information item types to be supported by XDM?

Syntactical sugar

We've seen that this JSON object

{ "accounting" : [ 
      { "firstName" : "John", 
        "lastName"  : "Doe",
        "age"       : 23 },
      
      { "firstName" : "Mary", 
        "lastName"  : "Smith",
        "age"       : 32 }
                 ],                                 
  "sales"     : [ 
      { "firstName" : "Sally", 
        "lastName"  : "Green",
        "age"       : 27 },
      
      { "firstName" : "Jim",  
        "lastName"  : "Galley",
        "age"       : 41 }
                  ]
}

Is serialized in χίμαιραλ as:

<?xml version="1.0" encoding="UTF-8"?>
<χ:data-model xmlns:χ="http://χίμαιραλ.com#">
    <χ:map>
        <χ:entry key="sales" keyType="string">
            <χ:map>
                <χ:entry key="1" keyType="number">
                    <χ:map>
                        <χ:entry key="lastName" keyType="string">
                            <χ:atomic-value type="string">Green</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="age" keyType="string">
                            <χ:atomic-value type="number">27</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="firstName" keyType="string">
                            <χ:atomic-value type="string">Sally</χ:atomic-value>
                        </χ:entry>
                    </χ:map>
                </χ:entry>
                <χ:entry key="2" keyType="number">
                    <χ:map>
                        <χ:entry key="lastName" keyType="string">
                            <χ:atomic-value type="string">Galley</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="age" keyType="string">
                            <χ:atomic-value type="number">41</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="firstName" keyType="string">
                            <χ:atomic-value type="string">Jim</χ:atomic-value>
                        </χ:entry>
                    </χ:map>
                </χ:entry>
            </χ:map>
        </χ:entry>
        <χ:entry key="accounting" keyType="string">
            <χ:map>
                <χ:entry key="1" keyType="number">
                    <χ:map>
                        <χ:entry key="lastName" keyType="string">
                            <χ:atomic-value type="string">Doe</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="age" keyType="string">
                            <χ:atomic-value type="number">23</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="firstName" keyType="string">
                            <χ:atomic-value type="string">John</χ:atomic-value>
                        </χ:entry>
                    </χ:map>
                </χ:entry>
                <χ:entry key="2" keyType="number">
                    <χ:map>
                        <χ:entry key="lastName" keyType="string">
                            <χ:atomic-value type="string">Smith</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="age" keyType="string">
                            <χ:atomic-value type="number">32</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="firstName" keyType="string">
                            <χ:atomic-value type="string">Mary</χ:atomic-value>
                        </χ:entry>
                    </χ:map>
                </χ:entry>
            </χ:map>
        </χ:entry>
    </χ:map>
</χ:data-model>

We can work with that, but wouldn't it be nice if we had a native syntax that does not use XML elements and attributes to represent maps?

Depending on the requirements, many approaches are possible.

A first option would be to define pluggable notation parsers within XML and write:

<χ:notation mediatype="application/json"><![CDATA[
{ "accounting" : [ 
      { "firstName" : "John", 
        "lastName"  : "Doe",
        "age"       : 23 },
      
      { "firstName" : "Mary", 
        "lastName"  : "Smith",
        "age"       : 32 }
                 ],                                 
  "sales"     : [ 
      { "firstName" : "Sally", 
        "lastName"  : "Green",
        "age"       : 27 },
      
      { "firstName" : "Jim",  
        "lastName"  : "Galley",
        "age"       : 41 }
                  ]
}                  
]]></χ:notation>

The meaning of the <χ:notation/> element would be to trigger a parser supporting the application/json datatype. This is less verbose, more natural to JSON users, but doesn't allow to add XML nodes in maps or sequences.

Another direction would be to extend the syntax of XML itself. To do so, again, there are many possibilities. The markup in XML is based on angle brackets and the distinction between the different XML productions is usually done through the character following the bracket in the opening tags.

This principle leaves a lot of possibilities. For instance, maps could be identified by the tags <{> and </}> to follow the characters used by XDM map literals and JSON objects :

<χ:data-model>
    <{>
        <χ:entry key="sales" keyType="string">
            <{>
                <χ:entry key="1" keyType="number">
                    <{>
                        <χ:entry key="lastName" keyType="string">
                            <χ:atomic-value type="string">Green</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="age" keyType="string">
                            <χ:atomic-value type="number">27</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="firstName" keyType="string">
                            <χ:atomic-value type="string">Sally</χ:atomic-value>
                        </χ:entry>
                    </}>
                </χ:entry>
                <χ:entry key="2" keyType="number">
                    <{>
                        <χ:entry key="lastName" keyType="string">
                            <χ:atomic-value type="string">Galley</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="age" keyType="string">
                            <χ:atomic-value type="number">41</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="firstName" keyType="string">
                            <χ:atomic-value type="string">Jim</χ:atomic-value>
                        </χ:entry>
                    </}>
                </χ:entry>
            </}>
        </χ:entry>
        <χ:entry key="accounting" keyType="string">
            <{>
                <χ:entry key="1" keyType="number">
                    <{>
                        <χ:entry key="lastName" keyType="string">
                            <χ:atomic-value type="string">Doe</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="age" keyType="string">
                            <χ:atomic-value type="number">23</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="firstName" keyType="string">
                            <χ:atomic-value type="string">John</χ:atomic-value>
                        </χ:entry>
                    </}>
                </χ:entry>
                <χ:entry key="2" keyType="number">
                    <{>
                        <χ:entry key="lastName" keyType="string">
                            <χ:atomic-value type="string">Smith</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="age" keyType="string">
                            <χ:atomic-value type="number">32</χ:atomic-value>
                        </χ:entry>
                        <χ:entry key="firstName" keyType="string">
                            <χ:atomic-value type="string">Mary</χ:atomic-value>
                        </χ:entry>
                    </}>
                </χ:entry>
            </}>
        </χ:entry>
    </}>
</χ:data-model>

Map entries are not ordered and in that respect they are similar to XML attributes. We could use this similarity and use the character @ to identify map entries:

<χ:data-model>
    <{>
        <@"sales" keyType="string">
            <{>
                <@"1" keyType="number">
                    <{>
                        <@"lastName" keyType="string">
                            <χ:atomic-value type="string">Green</χ:atomic-value>
                        </@"lastName">
                        <@"age" keyType="string">
                            <χ:atomic-value type="number">27</χ:atomic-value>
                        </@"age">
                        <@"firstName" keyType="string">
                            <χ:atomic-value type="string">Sally</χ:atomic-value>
                        </@"firstName">
                    </}>
                </@"1">
                <@"2" keyType="number">
                    <{>
                        <@"lastName" keyType="string">
                            <χ:atomic-value type="string">Galley</χ:atomic-value>
                        </@"lastName">
                        <@"age" keyType="string">
                            <χ:atomic-value type="number">41</χ:atomic-value>
                        </@"age">
                        <@"firstName" keyType="string">
                            <χ:atomic-value type="string">Jim</χ:atomic-value>
                        </@"firstName">
                    </}>
                </@"2">
            </}>
        </@"sales">
        <@"accounting" keyType="string">
            <{>
                <@"1" keyType="number">
                    <{>
                        <@"lastName" keyType="string">
                            <χ:atomic-value type="string">Doe</χ:atomic-value>
                        </@"lastName">
                        <@"age" keyType="string">
                            <χ:atomic-value type="number">23</χ:atomic-value>
                        </@"age">
                        <@"firstName" keyType="string">
                            <χ:atomic-value type="string">John</χ:atomic-value>
                        </@"firstName">
                    </}>
                </@"1">
                <@"2" keyType="number">
                    <{>
                        <@"lastName" keyType="string">
                            <χ:atomic-value type="string">Smith</χ:atomic-value>
                        </@"lastName">
                        <@"age" keyType="string">
                            <χ:atomic-value type="number">32</χ:atomic-value>
                        </@"age">
                        <@"firstName" keyType="string">
                            <χ:atomic-value type="string">Mary</χ:atomic-value>
                        </@"firstName">
                    </}>
                </@"2">
            </}>
        </@"accounting">
    </}>
</χ:data-model>

The key names have been enclosed between quotes because map keys can include any character including whitespaces, but they can be made optional when they are not needed. We could also give to the keyType a default value of "string":

<χ:data-model>
    <{>
        <@sales>
            <{>
                <@1 keyType="number">
                    <{>
                        <@lastName>
                            <χ:atomic-value type="string">Green</χ:atomic-value>
                        </@lastName
                        <@age>
                            <χ:atomic-value type="number">27</χ:atomic-value>
                        </@age
                        <@firstName>
                            <χ:atomic-value type="string">Sally</χ:atomic-value>
                        </@firstName
                    </}>
                </@1
                <@2 keyType="number">
                    <{>
                        <@lastName>
                            <χ:atomic-value type="string">Galley</χ:atomic-value>
                        </@lastName
                        <@age>
                            <χ:atomic-value type="number">41</χ:atomic-value>
                        </@age
                        <@firstName>
                            <χ:atomic-value type="string">Jim</χ:atomic-value>
                        </@firstName
                    </}>
                </@2
            </}>
        </@sales
        <@accounting>
            <{>
                <@1 keyType="number">
                    <{>
                        <@lastName>
                            <χ:atomic-value type="string">Doe</χ:atomic-value>
                        </@lastName
                        <@age>
                            <χ:atomic-value type="number">23</χ:atomic-value>
                        </@age
                        <@firstName>
                            <χ:atomic-value type="string">John</χ:atomic-value>
                        </@firstName
                    </}>
                </@1
                <@2 keyType="number">
                    <{>
                        <@lastName>
                            <χ:atomic-value type="string">Smith</χ:atomic-value>
                        </@lastName
                        <@age>
                            <χ:atomic-value type="number">32</χ:atomic-value>
                        </@age
                        <@firstName>
                            <χ:atomic-value type="string">Mary</χ:atomic-value>
                        </@firstName
                    </}>
                </@2
            </}>
        </@accounting
    </}>
</χ:data-model>

Atomic values could be identified by <=> and </=> and the same default value applied to its type attribute:

<χ:data-model>
    <{>
        <@sales>
            <{>
                <@1 keyType="number">
                    <{>
                        <@lastName>
                            <=>Green</=>
                        </@lastName>
                        <@age>
                            <= type="number">27</=>
                        </@age>
                        <@firstName>
                            <=>Sally</=>
                        </@firstName>
                    </}>
                </@1>
                <@2 keyType="number">
                    <{>
                        <@lastName>
                            <=>Galley</=>
                        </@lastName>
                        <@age>
                            <= type="number">41</=>
                        </@age>
                        <@firstName>
                            <=>Jim</=>
                        </@firstName>
                    </}>
                </@2>
            </}>
        </@sales>
        <@accounting>
            <{>
                <@1 keyType="number">
                    <{>
                        <@lastName>
                            <=>Doe</=>
                        </@lastName>
                        <@age>
                            <= type="number">23</=>
                        </@age>
                        <@firstName>
                            <=>John</=>
                        </@firstName>
                    </}>
                </@1>
                <@2 keyType="number">
                    <{>
                        <@lastName>
                            <=>Smith</=>
                        </@lastName>
                        <@age>
                            <= type="number">32</=>
                        </@age>
                        <@firstName>
                            <=>Mary</=>
                        </@firstName>
                    </}>
                </@2>
            </}>
        </@accounting>
    </}>
</χ:data-model>

The tags that surround atomic values are useful when these values are within a sequence but look superfluous when the item has a single value. The next step could be to define that in that case as a shortcut the value and its type attribute could be directly included in the item:

<χ:data-model>
    <{>
        <@sales>
            <{>
                <@1 keyType="number">
                    <{>
                        <@lastName>Green</@lastName>
                        <@age type="number">27</@age>
                        <@firstName>Sally</@firstName>
                    </}>
                </@1>
                <@2 keyType="number">
                    <{>
                        <@lastName>Galley</@lastName>
                        <@age type="number">41</@age>
                        <@firstName>Jim</@firstName>
                    </}>
                </@2>
            </}>
        </@sales>
        <@accounting>
            <{>
                <@1 keyType="number">
                    <{>
                        <@lastName>Doe</@lastName>
                        <@age type="number">23</@age>
                        <@firstName>John</@firstName>
                    </}>
                </@1>
                <@2 keyType="number">
                    <{>
                        <@lastName>Smith</@lastName>
                        <@age type="number">32</@age>
                        <@firstName>Mary</@firstName>
                    </}>
                </@2>
            </}>
        </@accounting>
    </}>
</χ:data-model>

XPath

The χίμαιραλ serialization being XML, it is possible to use XPath path expressions to query its structure. For instance, to get a list of employees which are less than 30, we can write:

χ:map/χ:entry/χ:map/χ:entry/χ:map[χ:entry[@key='age'][χ:atomic-value < 30]]

Or, if we're feeling lucky:

//χ:map[χ:entry[@key='age'][χ:atomic-value < 30]]

Again, that's good as long we work on a χίμαιραλ serialization but it would be good to be able to use path expressions directly on map data structures. To do so we would need at minima to define steps to match maps and entries.

XSLT 3.0 introduces a new map() item type which could be used as a kind test to identify maps.

If we follow the idea that map entries are similar to XML attributes, we could use the @ notation to identify them. The XPath expression would then become:

map()/@*/map()/@*/map()[@age < 30]]

Or, if we're feeling lucky:

//map()[@age < 30]]

Validation

These data models can be complex. Wouldn't it be useful to be able to validate them with schema languages? This would give us a way to validate JSON maps!

Of course we can already serialize them in χίμαιραλ and validate the serialization using any schema language, but again it would be good to be able to validate these structures directly.

A RELAX NG schema to validate the χίμαιραλ serialization of our example would be:

namespace χ = "http://χίμαιραλ.com#"

start = element χ:data-model { top-level-map }

# Top level map: departments
top-level-map =
    element χ:map {
        element χ:entry {
            attribute key { xsd:NMTOKEN },
            attribute keyType { "string" },
            emp-array
        }*
    }

# List of employees
emp-array =
    element χ:map {
        element χ:entry {
            attribute key { xsd:positiveInteger },
            attribute keyType { "number" },
            emp-map
        }*
    }

# Description of an employee
emp-map = element χ:map { (age | firstName | lastName) + }

age =
    element χ:entry {
        attribute key { "age" },
        attribute keyType { "string" },
        element χ:atomic-value {
            attribute type { "number" },
            xsd:positiveInteger
        }
    }

firstName =
    element χ:entry {
        attribute key { "firstName" },
        attribute keyType { "string" },
        element χ:atomic-value {
            attribute type { "string" },
            xsd:token
        }
    }

lastName =
    element χ:entry {
        attribute key { "lastName" },
        attribute keyType { "string" },
        element χ:atomic-value {
            attribute type { "string" },
            xsd:token
        }
    }

Note

In the description of the maps used to describe employees, we cannot use interleave patterns because of the restriction on interleave and the schema is approximate. In this specific case, we could enumerate the six possible combinations but the exercise would quickly become verbose if the number of items grew:

emp-map = element χ:map { 
      (age, firstName, lastName)  
    | (age, lastName, firstName) 
    | (firstName, age, lastName)  
    | (firstName, lastName, age) 
    | (lastName, age, firstName)  
    | (lastName, firstName, age) 
}

A Schematron schema for the χίμαιραλ serialization could be developed based on XPath expressions similar to those that have been shown in the previous section.

Again, it would be interesting to support maps directly as first class citizens in XML schema languages.

The ability to use Schematron on XDM maps depends directly on the ability to browse maps using patterns and path expressions in XPath and XSLT (see above)...

The main impact on RELAX NG would be to add map and item patterns and the schema could look like:

namespace χ = "http://χίμαιραλ.com#"

start = element χ:data-model { top-level-map }

# Top level map: departments
top-level-map =
    map  {
        entry xsd:NMTOKEN {
            emp-array
        }*
    }

# List of employees
emp-array =
    map {
        entry xsd:positiveInteger {
            emp-map
        }*
    }

# Description of an employee
emp-map = map { age, firstName, lastName  }

age =
    entry age {
            xsd:positiveInteger
        }
    }

firstName =
    entry firstName {
             xsd:token
        }
    }

lastName =
    entry lastName {
            xsd:token
        }
    }

Sequences could probably be supported without adding a new pattern but would require to relax some restrictions to allow the description of sequences mixing atomic values, maps and nodes (in Relax NG, sequences of atomic values are already possible in list datatypes, sequences of nodes are of course available to describe node contents but these two type of sequences cannot be mixed).

Conclusion

According to the definition of chimeras in genetics from Wikipedia quoted in the introduction, chimeras are formed from at least four parent cells (two fertilized eggs or early embryos fused together). Each population of cells keeps its own character and the resulting organism is a mixture of tissues.

The current XDM proposals have added to the XML data model a foreign model to represent maps. This new model is a superset of the JSON data model. The two data models keep their own character and the resulting model is a mixture of information items.

It's far to say that the current XDM proposal is a chimera, something described as usually ugly, foolish or impossible fantasies by Jeni Tennison.

I hope that the proposals sketched in this paper will help to address this situation and fully integrate these new information items in the XML echosystem.

BalisageThe Markup Conference2012

Balisage Paper: Fleshing the XDM chimera

Abstract

Table of Contents

Motivation

Note

XML Data Models

XPath/XSLT 1.0

XDML 2.0: XPath 2.0/XSLT 2.0/XQuery 1.0

XDM 3.0: XPath 3.0/XSLT 3.0/XQuery 3.0

Note

Note

Identity Crisis

Note

Note

Introducing χίμαιραλ (chimeral), the Chimera Language

Example

χίμαιραλ In a Nutshell

Remaining Issues

χίμαιραλ and the identity crisis

Moving the chimera forward

Note

Embracing RDF

Syntactical sugar

XPath

Validation

Note

Conclusion

Balisage Series on Markup Technologies