How to cite this paper
La Fontaine, Robin. “Element order is always important in XML, except when it isn't.” Presented at Balisage: The Markup Conference 2021, Washington, DC, August 2 - 6, 2021. In Proceedings of Balisage: The Markup Conference 2021. Balisage Series on Markup Technologies, vol. 26 (2021). https://doi.org/10.4242/BalisageVol26.LaFontaine01.
Balisage: The Markup Conference 2021
August 2 - 6, 2021
Balisage Paper: Element order is always important in XML, except when it isn't
Robin La Fontaine
Robin is the founder and CEO of DeltaXML. His background includes computer
aided design software and he has been addressing the challenges and
opportunities associated with information change for many years. DeltaXML tools
are now providing critical comparison and merge support for corporate and
commercial publishing systems around the world, and are integrated into content
management, financial and network management applications supplied by major
players. Robin studied Engineering Science at Worcester College, Oxford and
Computer Science at the University of Hertford. He is a Chartered Engineer and
member of the Institution of Mechanical Engineers. He has three adult children,
three and a half grandchildren, and never finds quite enough time for walking,
gardening and working with wood.
Copyright ©2021 DeltaXML Ltd. All Rights Reserved.
Abstract
"Which came first," begins an old joke. But the more interesting question might
be, "does it even matter?" There are many obvious and several not-so-obvious ways
in
which the order of items (be they XML elements or attributes, or JSON maps or
arrays) can be understood to be significant or insignificant. These are not new
questions and how they’re answered plays out across vocabulary design, schema
design, and individual documents. They are important questions when it comes
deciding if two documents are “the same” or “different” and to what extent.
This paper challenges the one-size-fits-all decree in XML that order needs to be
preserved and reviews the implications of 'order'. When ordered elements can be
moved then we have something that has some common ground with orderless. This paper
establishes a continuum between ordered information and orderless information and
proposes that these are not as far apart as they might at first appear.
Table of Contents
- Introduction and Background
- XML elements and attributes, JSON objects and arrays
- Comparison and moves - a focus for order
- Is order a binary choice?
- Schema Languages and Orderless
- Conclusions
Introduction and Background
What one community takes for granted and never even discusses, because it is
'obvious', another community sees as an important area for debate and discussion.
Order
in XML is certainly one of these areas. The community of those who use XML for
human-readable documents never questions the assumption that the order of elements
is
important, it is obvious to anyone that swapping a couple of paragraphs round can
have a
significant impact on the meaning of a document. Similarly the attributes of a paragraph
are just attributes and it is 'obvious' that declaring one before another is of no
consequence.
The data community, if we can be so bold as to make this distinction, will always
define whether or not a collection of data items constitute a set (where they can
appear
in any order) or a list (where order is important). As the Eskimo language has 50
words
for snow (and the Scots quite a few words for mist and fog) so the data community
has
many words for collections, including for ordered collections lists, arrays, vectors
and
for orderless collections maps (name/value pairs), set (where each object is unique),
and bag (where duplicates are allowed).
The document community makes a distinction between 'content', the words that appear
on
a page, and structure/formatting which affect where the words may appear or what font
they may be in. The data community does not have such a binary distinction between
what
constitutes data and what is meta-data, because the distinction may depend on the
type
of processing that is being performed. To put it another way, one person's meta-data
is
another person's data.
This vexed subject of order in XML was discussed at Balisage in 2002 by Tommie Usdin
[1]. There was clearly quite a discussion about it and the
title of her paper, "When 'It Doesn’t Matter' means 'It Matters' ", is uncannily similar
to the one for this paper. Tommie introduces that paper with a story about how order
that was 'obvious' to the document author was not so obvious to a programmer. Her
paper
proposed, and I am very much in agreement, that tight specifications and only one
way to
represent something is the best way forward for unambiguous information exchange,
and we
will return to that later.
So it is not surprising that we have issues around how best to handle order, though
it
is perhaps surprising that we had the same issue almost 20 years ago. As Tommie says
in
her paper, "please listen carefully when someone says 'it doesn’t matter'. Figure
out if
they mean 'there is not information to be conveyed here' or if they mean 'there is
a
great deal of information to be conveyed here'. Or if they don’t know which they mean
because they haven’t thought it through."
XML elements and attributes, JSON objects and arrays
It is worth considering how JSON [2] handles order, as it
has been adopted in place of XML by many in the data community because XML did not
work
well for some types of data. JSON objects with their members resemble XML attributes
in
that they are name/value pairs and they have unique names within their immediate
context. JSON object members are also not ordered, so this is a common feature with
XML
attributes and both implement the concept of a map. A lot of data is order independent
and unless each data item has a name the only JSON structure that can be used is the
array construct. However, JSON defines arrays as ordered collections, as in a list,
and
this reflects the significance of order in a sequence of XML elements.
If we look at XML Schema [3] and JSON Schema [4] we find the ability to define the type of items, minimum and
maximum numbers of items but not whether order is significant. It is for the author
or
provider of the information to decide on the order of the information but we cannot
specify whether or not this order conveys information. It is in one sense reasonable
that the significance of order depends on the information but it seems an omission
not
to be able to indicate one way or the other in a schema.
We find some distortions in the way that data is represented when either format is
adopted. For example, in SVG [5] the points of a polygon and the
line style could be defined in any order, and this seems to be one reason they are
defined as attributes as shown in this example. The representation of points within
an
attribute might be more compact than if elements were used, but all the power of markup
is lost because the attribute has no internal structure. Using attributes in this
way
precludes, for example, the addition of an attribute on a point.
<polygon points="220,10 300,210 170,250 123,234"
style="fill:lime;stroke:purple;stroke-width:1" />
Similarly, if we look at JSON data, and we want to represent data items that each
have
a unique key then the natural way to do this might be using an object where each unique
key is represented as the member name, as follows.
{"contacts": {
"324": {
"first_name": "AN",
"last_name": "Other"
},
"127": {
"first_name": "John",
"last_name": "Doe"
}
}}
However, it is much more usual to see such data represented in an array like
this.
{"contacts": [
{
"id": "324",
"first_name": "AN",
"last_name": "Other"
},
{
"id": "127",
"first_name": "John",
"last_name": "Doe"
}
]}
It could be argued that both XML and JSON have one way to represent ordered data
(elements and arrays) and one way to represent orderless data (attributes and objects).
But adopting these structures means that some uncomfortable consequences are
encountered. In both cases, each orderless data unit or item (it is not possible to
use
the term object or entity without causing confusion) must have a unique identifier,
i.e.
the XML attribute name or the JSON member name. The temptation is to assign an arbitrary
key if there is no key or if it is difficult to determine what the key should be,
but
that is not a sensible way forward. This is why in practice some XML elements can
appear
in any order without change of meaning, and the same is true for the content of some
JSON arrays. Therefore it is important as part of the definition of any data format
to
know whether or not the order is significant.
We can draw up a table to illustrate the characteristics of the two formats in this
area.
Table I
Order and Uniqueness in XML and JSON
|
XML |
JSON |
Comment |
Order |
Elements ordered |
Array members ordered |
Both XML elements and JSON arrays commonly used for orderless data
especially if not keyed
|
Orderless |
Attributes are name/value pairs and are orderless |
Object members are name/value pairs and are orderless |
Attribute values have no structure, so do not work for structured
information
|
Keys |
Attributes keyed by name of attribute |
Object members keyed by name |
Implication is that orderless needs to be keyed, but this is not always
the case
|
Uniqueness |
Powerful 'unique' concept in XML Schema that allows scope to be
defined
|
Provided in JSON Schema for arrays |
|
ID |
ID attribute is unique across the whole document |
|
|
This table shows that according to the definitions, it is not possible to specify
XML
child elements or JSON array members where the order can be changed without changing
the
meaning.
In the XML world, XML Schema does not help here. It allows complex types to contain
'sequence', 'choice' or 'all'. A sequence or choice may occur more than once. Choice
is
just one of a list of elements and sequence is a defined order of elements, 'all'
is
unordered. This is certainly useful in that it allows us to validate the syntax of
an
instance file but the order of the instance data is always deemed to be significant.
Perhaps there is sometimes confusion between the ordering of elements that is defined
for validation of the structure and the significance of the order of elements as they
appear in the instance data. Our discussion is about the ordering of instance data
and
not the ordering of elements that constitute the structure. This might appear a subtle
distinction, but it is not!
When we look into the logic behind the above decisions about XML Schema, there is
an
interesting document that has no official standing but stems from discussions within
the
Web Services working group using XML Schema. This describes schema patterns [6]. Here there is the concept of a vector 'A vector is
an ordered sequence of repeated items of the same data type.' and map 'A map is an
unordered collection of repeated items of the same type... each item is accessible
by a
key value, unique within the scope of the collection.' So again we see unordered or
orderless data needing to have keys and there is no provision for orderless data without
keys.
Comparison and moves - a focus for order
Knowing whether or not order is important is essential to the process of determining
whether or not two XML documents are the same or different. This was noted as long
ago
as 2001 in 'A Delta Format for XML: Identifying changes in XML files and representing
the changes in XML' [7], where it states "some elements
may appear in any order and so a change to order should not be identified as a
change."
Determining if a document has been changed may be critically important, especially
if
that document is a legal or technical document. It is not possible to determine if
a
document has changed unless we can determine with certainly whether two documents
are
the same. This may seem simple and obvious but in practice equality is often not well
defined. The result is that a change to the order of elements is identified as a change
but may not be a 'real' change. Another example is change to white space, which could
simply be the result of pretty-printing, may often not be considered to constitute
a
change.
We know that a change in attribute order does not constitute a difference, but the
change in order of elements should by default be flagged as a difference. But if those
elements are simply elements that have attributes conveying some meta-data and have
no
content, then a change in order may not be significant. This suggests that the
significance of order may depend on other factors, e.g. whether or not there is text
content directly or indirectly within an element.
It is worth noting that Canonical XML [8] provides some
support in that "It is the goal of this specification to establish a method for
determining whether two documents are identical, or whether an application has not
changed a document, except for transformations permitted by XML 1.0 and Namespaces
in
XML 1.0." But of course this is based on the premise that order of elements is always
important so it does not help when a change in order is not a 'real' change.
XML element comparison by default preserves the order of elements in the two
documents. A change of order will therefore result in elements appearing as added
and
deleted. But the result of a change of order could also be represented as a poor match
between two paragraphs that are deemed to be in the same position, for example because
they have common paragraphs before and after them. It is not trivial to determine
what
has changed - have several paragraphs been modified or have one or more paragraphs
been
moved? This suggests that when considering order in the context of comparison one
factor is whether or not we are interested in identifying moves.
Is order a binary choice?
We have deliberately muddied the waters here by introducing the possible distinction
of elements with and without text content, and in considering moves. There appear
to be
binary alternatives: either the order of elements is significant or it is not. Therefore
to determine whether two documents are equal, we test this equality by an ordered
comparison, i.e. preserving the order of elements in both documents, or by ignoring
order and seeking to find pairs of equal elements. By 'equal' here we mean that the
full
content of the elements are equal, often known as 'tree-equal'. If the ordered alignment
shows no differences then by definition the orderless will also show no differences,
but
vice versa is not true.
It is always the case that an orderless alignment will be at least as good if not
better than an ordered alignment. The term 'better' here means that more information
is
aligned, taking into account the full tree structure of all the child elements. Two
elements that are tree-equal represent the 'best' alignment. We discuss later how
to
measure this quality of alignment, e.g. by determining how many words (or PCDATA
characters) and/or attributes are aligned as equal.
This implies that as we move from a strictly ordered alignment to an orderless
alignment, the quality of alignment is likely to improve. If 'nothing can move' then
we
cannot improve on the ordered alignment, but if 'anything can move' then we can use
the
orderless alignment as the best result.
The diagram above illustrates two lists of child elements, denoted 1, 2, 3, 4 in one
XML document and A, B, C, D, E in the other. These labels are purely so that we can
identify them in this discussion. The lines between an element in one file and one
in
the other shows a possible alignment, where the left pairing denotes an alignment
that
preserves order and the right pairing shows an alignment that does not preserve order.
The width of the connecting lines shows the quality of the alignment, where wider
line
shows a better alignment. Note that here we are discussing element alignment rather
than
the alignment of individual words or text items which will either be treated as equal
or
not equal.
We can see here that although 2 is aligned with B when an ordered alignment is made,
the quality of this is not as good as between 2 and D. We can deduce that 2 was probably
moved to a new position, D, rather than modified to create B. The orderless alignment
here sheds some light on what has probably happened in moving from one document to
the
other. This example is very simple and has few items to be aligned, but for a large
number of items of different sizes the complexity rises very rapidly.
If we do not allow any moves, then we have the ordered alignment. If we allow any
moves anywhere then we end up with the orderless alignment. But what happens between
these two extremes? Is there anything meaningful in between them? It is interesting
to
explore this, and in developing algorithms to find 'optimum' alignment we find that
we
are obliged to explore this further.
The alignment will typically improve as we consider some elements as moved rather
than
added, deleted or changed. This suggests a continuum between ordered and orderless
that
is determined by how tolerant we might be to considering that an element has been
moved.
It is interesting to speculate whether we can find an optimum result. How might this
be defined? We would need to measure the quality of an alignment in some way and to
assign a cost to a move. A move would seem to be better always than an addition and
a
deletion, but we would need more detailed quality metrics to determine if an element
that could be aligned with its order preserved is a better or worse result than an
alternative alignment that is better but changes the order. When paragraphs are being
aligned it is normal to apply some threshold so that two paragraphs are deemed to
be
different rather than modified versions of one another if this threshold of common
words
is not met: it is nonsense to align two long paragraphs just because they both contain
a
similar sequence of common words such as 'and', 'but', 'the'. The same could also
be
applied to data though the definition of a good threshold metric is not trivial.
It would take a full paper to explore the metrics that could be applied to determine
the quality of an alignment. Counting the number of words or elements that are equal
versus the number that are not equal is one simple measure but for human readability
of
documents the degree of fragmentation is also important. For example, if the number
of
items aligned is the same in two possible alignments but in one of the two the items
are
in a contiguous sequence, then this would seem to be the better alignment because
it is
less fragmented. For this discussion, we will consider some measurement of Alignment
Quality as having a value between 1 (exact tree equality) and 0 (nothing is
aligned).
Let us consider an Order Metric as a value of 1 to denote fully ordered and a value
of
0 to denote fully orderless. What might 0.9 mean? It could mean, in our example above,
"if the Alignment Quality of 2 and D is greater than 0.9 and the Alignment Quality
of 2
and B is less than 0.9, then align 2 and D." As we slide our Order Metric down from
1
towards 0 we find the alignment changes and there are more moves.
Clearly if the order really does not matter, in other words it has no meaning at all,
then our Order Metric will be 0. But it is usually the case that even if order is
important, we would like to see when items have been moved. In the simplest case,
an
Order Metric of 0.999 would only discover a move if the moved item was more or less
equal to the item in the new position. Document authors would like to see moves even
when there are also changes and this becomes more complex to determine.
We also find a similar effect in data because the simple map model of orderless items,
where they each have a key, is limited when data is changing - not least because
sometimes the key itself is changed and all the other data remains the same! This
is
actually an important use case when comparing data between two different applications
where a key has been incorrectly set in one of the two. So with data even if the order
has some significance, it may be useful to determine where items have been moved.
Schema Languages and Orderless
Returning to schema languages, we have established that there is no way to indicate
that we have a sequence of items which do not have keys (so XML attribute or JSON
Object
members are not appropriate) and where the order in which they appear conveys no meaning
at all. Currently it is only possible to specify this in the documentation.
The simplest model for orderless is to indicate that all the child elements of an
element are orderless. A more complex model would be to specify that certain children
in
a sequence or choice are orderless but this would seem to introduce unnecessary
complexity because the orderless children can always be put in a wrapper element,
which
would seem to be good practice.
Returning to Tommie's paper, schema developers should take note of her comment, "But
if it means the same thing which ever sequence they come in, then pick one and require it!" This means that if elements A, B and C can
occur once each but the order they appear in has no meaning, then specify the order
and
then there is no ambiguity about whether or not it is important because it cannot
be
changed. It is counter-intuitive to require a certain order when the order does not
matter, but that is the correct way to do it, for the reasons articulated in that
paper.
We could avoid adding the burden of having to output information in a specific order
if
we were able to specify in the schema that the order was not significant.
If we can explore further the definition of an Order Metric then this would certainly
be useful in comparing information expressed as XML or JSON, but it does not seem
to
have a place in a schema language. The concept of a move belongs in the configuration
of
a comparison. This is because it might be local to one element, so a child element
can
be moved to another position in the same parent element, or it could be global so
an
element could be moved to somewhere else in the document. Again, there is a continuum
between these two extremes, e.g. a paragraph could be moved within a section but not
across sections.
Conclusions
This paper has considered how the significance of order within an XML document or
a
JSON data set cannot be fully specified in a schema. There seems to be an implication
that unordered or orderless data must have keys, as in a map structure. There may
also
be some confusion between the ordering of elements for structural validation and the
ordering of instances of these elements in a document.
The paper has also shown how looking at order and orderless as a continuum rather
than
a binary choice is helpful to understand change in the sense of moving an element
from
one position to another.
Any future update to XML Schema or JSON Schema should consider how to define the
significance of order, in instance data, for XML element children and JSON array
members.