Introductory matters
XML and JSON differ in how they punctuate data; this is obvious. Perhaps less obvious
is how they differ in the topology and semantics they represent. They share a number
of basic structures, and from 50,000 feet up they're both hierarchical.
This paper is concerned first with the superficial problem of syntax, where JSON
has achieved a reputation in some quarters for being easier to use than XML; and second
with the subtler but ultimately more important problem of data modeling.
This papers intends to demonstrate in a practical way, that JSON's seeming advantage
over XML in programming convenience actually has little to do with the languages,
but much to do with how interfaces to them have been set up in various programming languages. I demonstrate a programming
interface to XML data, that is far simpler to use than the APIs usually derived from
DOM (). In addition, the API supports the stronger semantics and topology of XML,
and so works not only for the simple cases JSON addresses, but also the countless
cases where XML's richer selection of abstract data types is needed.
This paper has two basic points to make:
-
Programming languages can make it far easier to deal with XML, by using their native
facilities for abstract collection types, instead of mechanically implementing the
DOM API, which is not optimized for any particular language.
-
XML brings to the fore a particular abstract collection type that is not widely discussed,
but follows logically from an analysis of other common types. Although this type may
not arise so commonly in traditional data
tasks, it does arise all the time in dealing with documents. Trying to manage document-shaped
information with arrays and hashes is (of course) possible, but exceedingly awkward.
A much better solution is to implement the needed type, and make it as natural to
deal with in programming, as arrays and hashes are now.
Asking a developer in the 1950s about hash tables (or dictionaries or associative
arrays), when all they had in their tools were arrays, would likely have provoked
responses such as: Much too complicated.
Doesn't fit the way my tools work.
Programs just don't have to deal with that kind of thing very much.
Too much stuff in there.
You can build that all with stuff I've already got.
Clever, but not worth the effort.
Or even, If the experts who built Fortran didn't see it as necessary, why should I?
These were also common responses to Unicode, to object-orientation, and to multi-threading.
They are much the same arguments made now against XML. And although none of these
features has become completely ubiquitous, few developers would oppose built-in support
for them.
As with hashes, Unicode, and other advances, I think the key is to build support into
the fabric of programming. When developers have to understand Unicode to make their
code Unicode-safe, they're unlikely to put in the effort — they have better things
to do, or at least, things much more obviously related to the task at hand. Rather,
the few character-set specialists need to build Unicode-awareness and Unicode-safety
into the system libraries and utilities (just try feeding Unicode to the usual *nix
commands!), so they fade into the background and become just the expected
behavior. Then developers with better things to do, will use them — and (if the Unicode
folks do their implementations right) will notice (perhaps long after) that their
systems work better. Meanwhile, their managers may notice that they can sell in new
areas with minimal new development costs.
I begin, then, with a look at basic abstract collection types, and at JSON (Cro06), Javascript (ECM11), XML (Bra08), and DOM (LeH04) in light of the types they use and the syntax commonly provided for getting at them.
The main portion of the paper then examines the more general notion of abstract collection
types, some data's need (and XML's support) for a type that JSON does not provide,
and the design and implementation of an interface in Python, to that additional abstract
collection type. Given that abstract collection type (which is useful for many things
besides XML), a syntactically-native, language-appropriate, easy-to-use interface
to XML's more general data becomes easy.
On very basic data and collection types
Atomic datatypes such as numbers, strings, and Booleans, are ubiquitous in programming,
and the problems they pose are, if annoying, at least familiar: False
vs. #F
vs. 0
vs. nil
; 99
vs. 99
vs. "099"
; and a variety of challenges that arise from mathematical and real-world characteristics
such as precision, spelling complexity and variability, etc.
Arrays are lists of data items, with the items commonly of atomic types, but in general
of any types. Arrays are organized merely by position, generally using non-negative
integers. High-level programming languages have included them for a very long time,
and the syntax for accessing them is generally similar, although programming languages
avoid the literal subscripts we all learned in early math classes. Developers are
used to accommodating punctuation differences as they change from one language to
another: $myArray[0]
vs. myArray(1)
vs. item 7 of myArray
.
Dictionaries, hashes, or associative arrays instead organize their members via a second
piece of data for each: a name
or key
. They are not so ubiquitously built in as arrays.
Their punctuation similarly varies: h = [ a=>1, b=>2 ]
vs. h = { 'a':1, n:'2' }
vs. h["a"]=1, h["b"]=2
vs. (hash ('a 1) ('b 2))
, and so on).
With minor variations, these types are simple and widespread. Many programming languages
provide special syntax for each of them; thus, if you know the programming language,
you know the syntax for using these types. That is a nice advantage when you deal
with data that is accurately modeled with these types.
It is important to keep in mind that there are several levels of abstraction here.
A file containing [ 1, 2, 3 ]
is not an array; it represents one (it may actually be
holes in a piece of compressed wood-pulp); or it might represent a vector in a left-handed
non-orthogonal 3-space. It might be loaded into 12 bytes of RAM, intended to be interpreted
as 3 32-bit (signed?) integers; or Javascript commonly represents arrays as hashes,
by converting the indices from integers to strings that it then hashes. Russel and
others represent an integer n
as a set nested n
-levels deep. And in the end, all of these are merely representations of an abstract
notion of quantity that cannot be directly perceived. For present purposes we need
not contend with all this ontological complexity; but the differences between these
notions are central:
-
The representation of data via a sequence of characters in a text file. For example
{ 'a':1, 'b':2, 'c':"\u2172" }
.
-
The data structure created from the representation via loading,
composed of constructs implemented by those who implement a given programming language.
For example, a contiguous sequence of 12 bytes, preceded by 8 extra bytes of overhead
such as the width (4) of the entries, the total length of the area, who owns it, etc.
-
The interface provided to those data structures for application programmers. For example,
myArray
, or myArray[1]
or myArray.1
for the second item.
On JSON and Javascript's abstract collection types
JSON (JavaScript Object Notation) is a way of representing certain data collections
as text files. JSON, for most purposes, simply is
a part of Javascript: a valid JSON file can be pasted in as the right-hand-side of
a Javascript assignment statement, and should just work
. The reverse is true for simple cases, though not in general. In my opinion, this
is by far the largest factor in JSON's reputed ease of use:
-
JSON supports just the same abstract collection types as Javascript: array (sort of) and dictionary. Thus, the mapping from JSON's idea
of how data can be structured, to Javascript's, is trivial; there essentially isn't
a mapping
at all.
-
It supports most of the same scalar types: integer, floating point number, Boolean, and string.
-
Its syntax is almost the same as expressions in Javascript. One exception is that
one can't interpolate variables, such as [ 1, myVar, "c",... ]
, so one cannot factor out heavily-reused pieces of data except by introducing conventions
outside JSON's awareness. Thus JSON as a representation is slightly less capable than Javascript; but within it's capabilities it is
Javascript.
-
Once loading some JSON is finished (which, like saving, can be a one-step process,
for example via eval(...)
), the programmer accesses any part of the (formerly JSON) data structure just as
if it were the same data declared as native constants in Javascript.
These are all legitimate virtues of JSON. They are greatest for the Javascript programmer,
but similar advantages apply in many programming languages because similar basic types
are common.
Javascript's (hence JSON's) selection of abstract collection types, while simple,
is even smaller than programmers are used to dealing with. If anything, it is slightly
too small, because there is considerable confusion about Javascript's arrays and dictionaries.
Javascript objects are essentially dictionaries, in that they can have properties
which are unordered and are accessed by name, such as obj.myProp
. Thus, there is not independent dictionary type.
There is an Array
constructor, which makes objects that support the usual array/list operations (including
sort
), and whose members can be accessed by a[2]
syntax. They can be initialized with syntax like a = [ 1, 2, 3 ]
.
So far, this is mostly as one would expect. However, an object's members can also
be accessed with array syntax: obj["myProp"]
. This accesses the same property as obj.myProp].
But if you assign obj[99] = "hello"
, the integer 99 is converted to the string 99
, and obj
gains a property named 99
(see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array).
One could access it as obj.99
except that Javascript syntax does not allow numbers as operands of .
. Likewise, if assigning a["2"] = "bar"
affects the same member as a[2] = "bar"
.
The killer, however, is this: if you explicitly create an array, you can still index
into it with non-integers: a["foo"] = "bar"
. You can set and get such members. However, the value of a.length
does not change unless those strings; it is defined as the number (property name)
of the last non-negative-number-named element, plus 1. Obviously that means that arrays
start at 0. You can happily assign to other numeric indices (quoted or not): x[3.0]
= 1; x[0.3e3] = 1; x[3.000000000000000001] = 1; x[3.1] = 0; x[-1] = 1;
. However, only the first three of those five become part of the array
; the others quietly become properties instead (therefore still accessible as, for
example, x[3.1]
As far as I can tell, this amounts to Javascript having only dictionaries, not arrays,
as data structures, but both as API (the Array
constructor makes an object with extra properties and methods such as length
, sort
, etc.). Array methods work so long as you behave just so. For example, if you make
an Array and assign to a["foo"]
or a[3.1]
, it does create a.foo
; but (for example) a.length
does not increase.
Implementing other abstract collection types using only dictionaries is (as with all
Turing-computable matters) possible, but awkward compared to using types whose semantics
and consequent APIs directly match the problem at hand. In Javascript's case, Arrays
are fairly intuitive so long as you carefully follow the implicit rules (rules that
are not checked by the language at all). If you bump up against the edges, things
get messy.
On XML's abstract collection types
XML must handle documents; that is its reason for existence. While it has proven useful
for many other things, such as configuration files, database interchange, and even
sensor data transfer in automative systems, handling all those at the cost of being
effective for documents would be failure.
So what is a document, really? Document structuring and representation clearly require
at least three things (Coo87, DeR90):
-
order (this paragraph comes before that one)
-
hierarchy (this chapter contains that section, or this strophe contains that stanza)
-
a wide variety of names
Disregarding the order of speeches in one scene of a play, or of the steps in a repair
procedure, both lead to absurdity. That is not to say that everything in XML is ordered; attributes are not, by definition; some bibliography schemas care
not how the components of each entry are ordered; and there are many other cases.
But XML does need to support order to support documents.
Hierarchy is also essential — it can makes a great deal of difference whether this
footnote (or quotation, or other object) is part of one section (or pericope, or procedure),
vs. another. Again, not everything is very hierarchical; front-matter may simply be a long, flat list of metadata items.
Finally, the type-names of various components are extremely important. Aircraft repair
manuals contain a variety of warnings, some of which may have mandatory layout requirements
so they cannot be lost in fine print. Citations and bibliographies are very important
in some genres. Sacred texts are very careful about the distinction between (canonical)
content and (merely convenient) annotations. Document representations have reified
such linguistic categories, so they can be used for all kinds of processing: Lay out footnotes this way
; Retrieve all the distinct conference proceedings from the bibliography
; Check that all the verses are present in this translation of the Gospel of John
; Build a table of contents from the first 'title' element in each 'div' not nested
more than 3 divs deep
; and so on.
Element type names are handles by which humans organize their thoughts into writing;
they are also the usual keys by which processing is triggered, much like class and
method names in programming. This function can be moved around syntactically, but
remains central (there is a reason that <span class="p">
looks strange).
At a minimum, any representation that purports to handle documents needs to support
these capabilities. Child-nodes must be accessible by both name and position, and
multiple children of the same type must be permissible.
An example of a trivial document portion
It is sadly true that those who argue that JSON, SQL, or other data
formats are perfectly adequate for documents, rarely show any examples of realistic
documents. Rather, the examples virtually always lack hierarchy; even more frequently
lack mixed content (an extraordinarily important omission); often have no relevant
order; and usually include nothing notionally like attributes (such as properties
of elements as wholes; the occasional exception is IDs).
Such examples are unconvincing.
CSV and its kin cannot handle documents well because they lack hierarchy; because
(in theory though seldom in practice) they lack order), and because while they may
have names (for fields) they lack the ability to put named objects in meaningful orders.
About the best you can do is (Way01, sec. 10):
"h1", "10: The Argonauts"
...
"p", But the hero essayed to hush their laments and assuage their pain
With words of cheer, and he spake,"
"quo", "Take up my war-array,"
"p-resume", "To the thralls, and with downcast eyes did these in silence obey."
Even in this tiny and unusually simple snippet there are many problems: After the
quotation the containing paragraph ought to continue on, but because there is no hierarchy
some hack must be used to re-start it. The same thing would happen if we tried to
separate the section-heading's number and title; reality is far more complex.
Documents require an abstract collection type that permits two important kinds of
retrieval at every level of a hierarchy: By order (for global processing such as rendering,
indexing, and the like); as well as by position (for many analytic and transformation
processes). Moreover, these need to work together. Users must be able to refer to
things like the 3rd footnote in chapter 4,
the last word of each speech attributed to Medea,
all the images,
and so on.
JSON is vastly more capable than CSV, and so can make a more credible attempt to represent
documents; the presence of hierarchy is particularly helpful. But the unavailability
of collections whose members are both ordered and named still makes it awkward, and
to my knowledge JSON has never been seriously used for substantial documents, such
as books, articles, manuals, and so on. It is hard to see a general solution much
cleaner than this, because arrays are needed to keep order, while something else is
needed to distinguish attributes (which form a dictionary) from sub-elements and text:
{ "book": [
{ "section": [
{ "title": [
{ "n": 10 },
"The Argonauts"
]},
{ "p": [
"But the hero essayed to hush their laments and assuage their pain
With words of cheer, and he spake, ",
{ "quo": [
"Take up my war-array,"
]},
" To the thralls, and with downcast eyes did these in silence obey."
]}
]}
]}
This does represent
the needed relationships of precedence and containment; but it is
far less readable and far more error-prone than XML.
<book>
<section>
<title><n>10</n> The Argonauts</title>
<p>But the hero essayed to hush their laments and assuage their pain
With words of cheer, and he spake,
<quo>Take up my war-array,</quo>
To the thralls, and with downcast eyes did these in silence obey.</p>
</section>
</book>
Having an abstract collection type that actually corresponds to document topology
is simply cleaner. A programming language can provide an appropriate interface to
that abstract collection type, and thus enable a much clearer and easier interface
to XML than hammering XML-shaped pegs into JSON-shaped holes (however nice the latter's
fit for JSON-shaped pegs may be).
Python has a rich collection of abstract collection types; rich enough to reveal that
the type XML most needs, while not built in, fits into a slot implied by the relationships
of abstract collection types that are already available in Python. This type's properties
are predicted by symmetry: the same distinctive properties displayed by types already
observed can be combined in another way.
Defining the missing type is one small step for Python. But having taken that step,
XML can be treated in Python much as JSON is treated in Javascript. An implementation
in this form makes use of XML in Python nearly as trivial as use of JSON in Javascript.
And although support for abstract collection types in Python is particularly strong,
analogous APIs for XML can be made with little more work in other languages, vastly
simplifying use of XML for developers in general.
On DOM
DOM, the Document Object Model,
is a standard, widely-used interface to XML structure. The term Object Model
can mean either the set of formal properties of some data object, or a collection
of classes and APIs for accessing something. DOM involves mainly the second sense:
it is essentially an API.
But DOM is not XML, nor the complete object model of XML in the first sense (although
it constrains many things about that object model), nor the only way a programmer
can access XML data. This paper discusses an alternative way of dealing with XML in
programs, yielding the desired simplification for programmers:
-
DOM was designed generically rather than with any particular programming language
in mind. It is thus not optimized to feel native in Javascript or Python or C++ or
anything else. When mechanically implemented, it speaks with an accent. But this has
nothing to do with XML syntax or semantics, and building more native interfaces is
not that hard.
-
DOM includes two levels in the one notion of Node: A Node in itself, and its context
in a tree. While perfectly reasonable, this portmanteau approach makes DOM nodes seem
much more complex than need be. In addition, there are general features such as userData,
namespaces, isSupported, etc. to complicate matters.
-
DOM defines many subclasses of Node, corresponding to SGML syntactic constructs (comments,
PIs, documents, etc.), and for HTML the much lower level of individual element types
(p, li, i, etc.). This complicates the API to be learned; conceptually, it is as reasonable
to think of XML as including only two Node types: elements and leaves; or even 1:
Nodes, some of which contain other Nodes and some of which don't.
-
DOM is a quite low-level interface. For example, it does not provide any way to get
the 3rd p
element child of a given node; nor even the 3rd element as opposed to text nodes. The latter is particularly annoying
because text nodes are in many contexts reliably whitespace-only and irrelevant to
most or all processing (it is frustrating to write or even just call the same filtering
code so many times). Providing higher-level operations helps immensely, as the popularity
of libraries such as JQuery (JQu) attests.
Common operations on arrays and dictionaries
Arrays and dictionaries are very common; they are reasonably thought of as the
simplest collection types: one accessed by integer indices,
and one by string (usually) keys
. This glosses over many differences in theory, implementation, and use, but will
suffice for the moment. Some common operations are shown here, for a plain Javascript/JSON
array, and for the children of a DOM node (other aspects of XML and of DOM are discussed
later).
Table I
Description |
Javascript |
Javascript DOM |
Get first item |
n[0] |
n.firstChild |
Get second item |
n[1] |
n.childNodes(1) |
Items 1-3 |
n.slice(0, 3) |
n.childNodes.slice(0,3) |
"eggs" attribute" |
n.eggs |
n.getAttribute("eggs") |
two items equivalent |
n1 == n2 |
n1.isEqualNode(n2) |
two items identical |
n1 === n2 |
n1.isSameNode(n2) |
replace item 3 |
n1[2] = n3 |
n1.replaceChild(n2, n3) |
The Javascript is obviously shorter, cleaner, and perhaps most important, in keeping
with the rest of the language (ignoring the grey area where arrays and dictionaries
meet, discussed earlier). Of course there is more to a DOM Node; but these illustrate
some of the most common operations.
Already we can see a simple way to make XML structures far easier
to deal with in programming languages with even minimal collection structures: Just
use the native array and dictionary syntax instead of functions, for XML constructs
that are accessed that way. For example, instead of writing c = n.childNodes[3]
,
just say c = n[3]
; this could be done in current Javascript by making Node (or XMLNode
or whatever) be a subclass of Array, with the child-pointers as the members, and
the other data as properties. Although Javascript does not provide operator overloading,
in languages that do, other simplifications such as ==
instead of isEqualNode
can also be provided.
Such changes are part of the proposal below, but there is much more that can be done
by taking the document world's requirements for abstract collection types seriously,
and using languages that have a more robust notion and range of types.
On the implementation of abstract collection types
Computer scientists use the notion of abstract collection types to
describe various ways in which data items can be aggregated, such as the arrays and
dictionaries already discussed. Abstract
in this case means that the datatypes are characterized by their storage and access
behaviors (or topology, if you will), rather than by how they are implemented. Arrays
are distinct from hashes because one indexes items by position, and the other by name
(or more properly key
).
Most any abstract collection type can be concretely implemented using most any concrete
data structure; this is to be expected with Turing Machines. For example, arrays are
often implemented using the array
data structure (a contiguous series of equal-sized memory blocks), but we have already
seen that Javascript is an exception. Sparse arrays such as used in high-dimensionality
problems in NLP and physics commonly implement arrays using linked lists or even hashes
in order not to waste space on large numbers of empty members.
Similarly, an abstract dictionary can be implemented in a concrete array: Just put
each key and item together in an array, and then make an array out of those arrays
as shown below. Large instances might be slow to access, but provide all needed functionality:
myDict = [
[ 1,
[ "Gouda", [ "Netherlands" ] ],
[ "Wisconsin", [ "Wisconsin" ] ],
],
[ 2,
[ "Beemster", [ "Japan" ] ],
],
[ 3,
[ "Brie", [ "France", "USA" ] ],
]
]
As should be clear even from this trivial example, implementing an abstract collection
type by using a less-than-natural data structure type frequently involves
inserting extra layers. In some cases it doubles the number of layers, commonly with
alternating arrays and dictionaries. This introduces several problems:
First, it's hard to read. Many developers do not carefully align things
as in the example above, although balancing brackets and braces by eye is otherwise
difficult.
Second, debugging bracketing errors is difficult because there are so few notations
(commonly 1 or 2) being used to express so many distinctions (one per layer).
Third, there are many different ways to introduce the extra layers, so different people
will do so in different ways. Therefore, data with the same abstract structure will
show up in various concrete forms, and receiving software is unlikely to copy without
modification, compromising data portability.
A feature analysis of abstract collection types
There are several different abstract collection types beyond array and dictionary,
that have substantial functional differences. The most obvious example is the set,
which has quite different semantics because it has neither position nor identifiers,
only data items.
The Priority queue
(used to choose tasks in order of importance or urgency) is another abstract collection
type, which introduces a new feature:
Its members are accessed by priority, which is very much like a position in an array;
however, unlike with an arrays there can be any number of tasks with the same priority.
Dictionaries and sets also fail to encompass priority queues.
As before, introducing more layers is a likely if messy workaround, such as grouping
tasks by level and then grouping the levels:
myDict = [
[ "Brie", "France", 3 ],
[ "Gouda", "Netherlands", 1 ],
[ "Beemster", "Japan", 2 ],
[ "Wisconsin", "Wisconsin", 1 ],
]
The priority queue is effectively an array but with duplicate positions.
The bag
or multiset
provided in many programming libraries is, similarly, a set but with duplicate entries.
Some programming languages provide versions of some or all these
abstract collection types, which are immutable, or restricted from being modified.
When applicable, this can enable certain kinds of optimization.
In dynamic contexts queues and stacks are
distinct abstract collection types: Their items, after all, are not accessed either
by name or by position, but by rule which concern the collection as a whole. However,
in immutable contexts (such as data representation) these distinctions are seldom
relevant. Either would likely be transmitted as a simple list; the receiving application
can do what it likes.
This leads me to suggest the following features for distinguishing abstract collection
types:
-
Is the order (position) of entries a way to refer to them?
-
Are entries identified (named) by some key in addition to their value?
-
Can one position identify only one entry? For example, if you store something in myArray[3]
and then do so again, the first item is replaced. But if you store another item in
a priority queue under the same priority, the first remains.
-
Can one key identify only one entry?
-
Is the data structure mutable?
These features define a space of distinct abstract collection types.
The space is not entirely orthogonal; for example, the question of whether there can
be multiple entries at the same position is moot if the collection type is not ordered
(as dictionaries), and the question of whether there can be multiple entries with
the same name is moot if the collection type has no names (as sets).
A few more properties are commonly distinguished, which are less relevant here:
-
Must the data items in each entry all be of the same type? Arrays are sometimes divided
into homogenous
ones (in which all entries' data items must be of the same datatype, and
heterogeneous
ones (in which items can be of mixed types). For example, Python has arrays, which
are heterogeneous, and byteArrays which are homogeneous. Collection types rarely place
any restriction on entries containing equivalent (or identical) data items.
-
Is the type recursive? That is, can it have as members, items of its own type? Typically
programs allows collection types to have any types they like as members, so recursion
is the norm (as in all of JSON, Python, and XML).
The types and their properties are shown below (parenthesized names are not provided
by Python).
Table II
Name |
Position |
Named |
DupPos |
DupName |
Immutable form |
set |
0 |
0 |
– |
0 |
frozenset |
(multiset or bag) |
0 |
0 |
– |
1 |
|
dict, defaultdict |
0 |
1 |
– |
0 |
|
Counter |
0 |
1 |
– |
1 |
|
array/list, bytearray |
1 |
0 |
0 |
– |
tuple, string, u string, buffer |
priority queue |
1 |
0 |
1 |
– |
|
OrderedDict |
1 |
1 |
0 |
0 |
namedtuple |
? |
1 |
1 |
0 |
1 |
|
Python's OrderedDict
has much in common with XML child sequences.
It is a variation on dictionaries, that also remembers in what order members were
added. Members can be added with the usual od["aardvark"] = "Tubulidentata"
syntax, and accessed by the same name. It remembers the order items were added, and
can iterate in that order with for k,v in od.items()
items.
However, members cannot be set or retrieved by position, nor can they be moved around
(except by deleting each and re-adding it to move it to the end). There is not a method
to find out what position an item is in. It also cannot support multiple members with
the same name as needed for XML. So OrderedDict functions mainly like a regular dictionary,
just with a special iterator.
From this table it is obvious that there are two case missing: First, Python does
not have multisets (some other languages do). They can be simulated easily with Counter.
The last row has no Python type, and no convenient way to simulate it. Sets have neither
position nor name; arrays have position but no name; dictionaries have name but no
position; but where is the final combination? Where is an abstract collection type
that has both name and position, and can have repetitions of at least one of them?
A full-fledged abstract collection type for the final slot turns out to be
exactly the structure that XML needs, but that is hardest to simulate using only arrays
and/or dictionaries. I call it NamedArray,
and it is described in the next section.
XML collection types
XML includes a few collection types, but the specification itself doesn't
describe or formalize them as such. The attributes of an element form a dictionary.
The tokens within an attribute of type IDREFS or NMTOKENS form a homogeneous array.
But the children of a given element do not constitute any of the types discussed so
far, because both their names (element type names) and their positions are important. What are they,
and how can we best access them with typical programming-language syntax?
First, consider XML elements. They are the only Nodes (other than the special case
of Root), that are not leaves in the XML structure: that can have children. DOM provides
them as an (abstract) array known as childNodes
, so they are accessed by position. That's fine as far as it goes, especially if languages
would make them accessible via their native array syntax as shown earlier.
The basic goal here is to find a simple mapping of XML structures to the
native collection syntax and semantics of Python, much like the simple mapping of
JSON to the native collection syntax and semantics of Javascript.
In practice, many operations on XML do not access child nodes by numerical position;
getting the 20th child is rarely needed, except when iterating over all children.
Much more common is to want all the children of a given element type This defines
a basic requirement for an OrderedDict-like abstract collection type to suffice for
XML.
XML as described by DOM has an orthogonal complexity that gets in the way of this:
The distinction of nodeType and nodeName. Every element has a name such as para
or stanza
; but there are also several types of nodes that are not elements. This 2-way or 2-level
distinction complicates XML processing, so to keep the syntax and semantics simple
it would help to get rid of it. The next section deals with this.
Simplifying XML node typing
Every XML node is one of several nodeTypes, as described by DOM. The fundamental ones
are element, attribute, text, processing-instruction, root, namespace, and comment
(others have to do with physical rather than logical structure, for example CDATA
sections and entity references, which need not concern us here).
But elements themselves have types (which DOM calls nodeNames), such as para
or chapter
. The overloading of type
is confusing. The simplification proposed here begins by reducing the variety of
node-types:
-
Root is merely a special case of element.
-
Namespace is merely a special case of attribute.
-
Comments and PIs can be thought of as merely special cases of text (or the PI's can
be thought of as a special case of an empty element, in order to treat its target
as an attribute, or to accomodate the commonplace of using attribute syntax within
PIs).
Let us call the new unified node construct an Elm,
since it is largely similar to DOM Element, but subsumes the other nodes types as
well. By introducing reserved Elm names such as _TEXT
, _PI
, _COMMENT
, and _ROOT
, and by treating the text content of such (empty, leaf) Elms as a special attribute,
the inventory of node types drops to just Elms and attributes.
Conceptually, Elm leaves could treat individual characters as child nodes, rather
than strings. This changes little; in Python strings are a list type, and all the
usual list operations can apply to them anyway.
The attributes of any given Elm form a dictionary, so it is convenient to introduce
a specific node to encapsulate an element's attribute list as a whole (including both
regular and namespace attributes). This corresponds to the XML Attribute-list.
It has often been suggested that the type name of an XML element is merely a special
case of an attribute (it applies to the element as a whole, has a simple token as
value, comes from a schema-constrained namespace, etc.). So the element type can be
treated merely as an attribute, which we may name _TYPE
.
Conveniently, the attribute-list is a familiar type (dictionary), and the Elm has
all the positive properties of an array (plus the presence and use of names).
Mapping to Javascript types and syntax
Javascript at least seems to have notions of accessing array members, and of accessing
object properties. As we have seen, their syntax and semantics overlap somewhat; but
to implement an abstract collection type for XML child sequences we can maintain the
convention that the children of an Elm are identified by number and accessed via array
notation; while the attributes of an Elm are identified by name and accessed via property
notation.
This is enough to permit a much more palatable syntax for dealing with XML structures
in Javascript. Such an implementation may want to support the existing DOM methods
and properties as well, to ease the way for existing users.
Mapping to Python types and syntax
Given the same unification of subtypes of Node just discussed, the collection of a
node's children fits neatly into the taxonomy of abstract collection types discussed
earlier, as a more functional analog to OrderedDict. I call it NamedArray
. The basic properties of this new type are:
-
Order of members is maintained.
-
All the usual array operations are supported, such as accessing members by position,
splicing, etc. (sort()
probably won't be used much for XML, but may be for other applications of NamedArray).
Operations that insert members by position (such as append()
) take an extra parameter for the name.
-
Like in regular dictionaries, each member of a NamedArray has a name (which is a string
for the XML case, but which Python in general would allow to be of any hashable type).
-
Unlike in regular dictionaries, NamedArrays may have multiple entries with the same
name (at different positions).
-
All the usual dictionary operations are supported, such as accessing members by name.
However, the semantics are slightly different because duplicate names are permitted.
Assigning to a member solely by name replaces the first member with that name; so
does deleting by name. Members can also be referred to by name and number simultaneously,
which refers to the n-th member with the given name.
-
The reserved name *
matches any Elm node type name, but not any of the reserved (_
-initial) names.
The specific syntax for these operations is described below.
This provides the most commonly-needed operations for XML. Child nodes can be iterated
over by number, and the sequence can be edited by the usual splice, append, and other
list operations (as in DOM, but with more natural syntax). However, it is equally
easy to iterate over all the comments, PIs, text nodes, elements, or elements of a
given type, or simply slice out the elements of a given type. These are extremely
common requirements in XML processing.
This idea led the author to investigate what it takes to support an additional collection
type in Python, with the native syntactic niceties.
Python doesn’t have only the array[n]
notation for arrays. It also provides slicing such as array[start:end]
(note that the 2nd argument specifies the entry after the last one desired — this is commonly convenient). It also provides dict[key]
notation, and a separate notion of object properties, accessed like object.propname
.
Finally, Python has an unusual but very useful array[start:end:interval]
notation. For example, array[0:20:2]
retrieves every other item from among the array's first 20 entries. This is said
to be heavily used with the numpy/scipy scientific and math packages.
As with the abstract collection types, there turns out to be a natural combination
of these
features which is not built in to Python, but is an excellent fit for XML:
Table III
Description |
DOM |
Pythonish |
Get first child |
n.firstChild |
n[1] |
Get second child |
n.childNodes(2) |
n[2] |
Get last child |
n.lastChild |
n[-1] |
children 1-3 |
n.childNodes.slice(0,3) |
n[1, 4) |
two nodes equivalent |
n1.isEqualNode(n2) |
n1 == n2 |
two nodes identical |
n1.isSameNode(n2) |
n1 === n2 |
replace third child |
n1.replaceChild(n2, n3) |
n1[2] = n3 |
get first "p" child |
|
n["p"] |
get third "p" child |
|
n["p":3] |
get last "p" child |
|
n["p":-1] |
get the first 20 "p" children |
|
n["p":1:21] |
get all "p" children from among the first 20 children |
|
n[1:21:"p"] |
walk down by types |
|
doc[“chap”:3][“sec”:2]["p"] |
It is often helpful to be able to get all child Nodes that are in fact elements; as
noted earlier n["*"]
is
defined to do this. The reserved Elm types _TEXT
, _PI
, _COMMENT
, _ROOT
, and _ATTLIST
are simply names when XML uses NamedArray, so NamedArray itself has no awareness
of XML convention.
NamedArray is mutable, and does not require homogeneity. The members can be data of
whatever type. In an XML document many will be Elms, some of which contain NamedArrays
of its own; leaf nodes, however, could be inserted as raw strings, or as separate
objects corresponding to DOM's Text, Comment, and PI nodes; NamedArray doesn't care.
This seems, at least to this author, quite intuitive. The various slicing options
make the interface considerably higher-level, and the general applicability of []
considerably reduces the number of methods and properties required to achieve DOM's
functionality.
The subtlest detail is
perhaps the distinction between n["p":1:21]
and n[1:21:"p"]
. The former retrieves all children named p
, and then retrieves the first 20 of those; the second instead retrieves the first
20 children, and then extracts
all of those that are named p
. That is, the slicing operations go from left to right. This may seem familiar to
users of XPath's successive []
filters.
The rest of the XML Nodes
So far we have modeled only the child-sequence requirements of XML, using a new abstract
collection type that is not XML-specific, and that fits neatly into the pattern of
Python's already-existing types. However, there is more to an XML node than a NamedArray
of children: it also has attributes, references to neighboring nodes, properties related
to namespaces, and a variety of methods.
To begin with attributes, in Javascript they are probably best implemented as properties;
however, this can lead to name clashes in relation to properties that are part of
the Node implementation (nextSibling
, etc.). Another option, which seems better when it is possible, is to keep them all
in a dictionary. The dictionary in turn can be a property, or as suggested below,
the first member of the NamedArray.
JSOX has a full-fledged dictionary representing the attribute list as a whole.
Because the Elm type (which subsumes DOM nodeType and nodeName) is always present,
and considered a reserved attribute, every Elm has at least one member in that dictionary.
I favor placing the Attlist dictionary at position 0 in the child sequence, considering
it a node in the same way that PIs, comment, and text portions are nodes. Placing
it at position 0 fits with where it appears in XML syntax, while also hinting that
it is special.
With slicing operations, it can be included or excluded at desired. I see this as
a rather nice compromise between (a) most programming languages using 0-based arrays,
and (b) many people thinking of the first element as #1. Those who disagree can put the attributes elsewhere with no substantial effect
on the rest of this proposal.
The other DOM properties (parentNode
, etc.) are provided as properties as usual, as are more
Python-native synonyms: cloneNode (copy), textContent (toString),
appendChild (append), hasChildNodes and hasAttribute (in), removeAttribute (del),
removeChild (del), etc.
Implementation
In Python most operators are just shorthand for function calls, and one can override
the function definitions. The []
notation invokes __getitem__()
. Python already accepts 3 arguments to the bracket notation, and tests show that
any of them can be a string instead of an integer. Thus, the semantics just described
are easily implemented:
def __getitem__(self, n1, n2=None, n3=None):
t1 = t2 = t3 = None
nargs = 1; t1 = type(n1)
if (n2 != None): nargs = 2; t2 = type(n2)
if (n3 != None): nargs = 3; t3 = type(n3)
if (nargs==1):
if (t1==StringType):
if (name == '*'): return(self.getElements());
return(self.getByName(t1)[1])
else:
return(self._subdata[int(t1)])
elif (nargs==2):
if (t1==IntType):
if (t2==IntType):
return(self._subdata[t1:t2])
else:
return(self.getByName(t2)[t1])
else:
if (t2==IntType):
return(self.getByName(t1)[t2])
else:
raise(TypeError)
return(None)
else: # nargs==3
if (t1==IntType and t2==IntType):
if (t3==IntType):
return(self._subdata[t1:t2:t3])
else:
return(self._subdata[t1:t2].getByName(t3))
elif (t2==IntType and t3==IntType):
return(self.getByName(t1)._subdata[t2:t3])
else:
raise(TypeError)
return(None)
Extensibility
The same kinds of indexing along all the XPath axes (Cla99) would be similarly useful, perhaps exposed
as lazily-evaluated properties of type array-of-Node. Some form of lazy evaluation
or indexing is important here, lest enumerating all preceding or following elements
be too expensive. It may be that Python's buffer
type, which supports read-only access to successive subsequences of a sequence (reminiscent
of Scheme cdr'ing
down a list) would provide an effective approach.
Although XML's child sequences are modelled well by NamedArray models, documents sometimes
contain particular components that are logically other types. For example, the rows
in a table body form an array, while the members of a definition list or glossary
may form a dictionary, and a shopping list might form a set. Often the difference
is visible only in a schema or documentation:
<!ELEMENT tbody (tr*)>
<!ELEMENT glossary (term, definition)*>
All these can be modeled via NamedArray, but it is more perspicuous to provides no
syntactic way to indicate that a particular element (type or instance) is conceptually
a more constrained type, such as array, set, or dict (schema languages can specify
such semantics, of course). This could be a useful addition both to add clarity to
XML data, and to aid performance and perspicuity to implementations.
Were XML not so well established and so widely and compatibly implemented, this distinction
could be
accomplished by some minor syntactic enhancements, such as these which trade on the
familiarity of arrays and dicts:
<x> — for the usual XML childSequence
<[x]> — to indicate the children constitute a logical array
<{x}> — to indicate the children constitute a logical dictionary
I find this intuitive, but in addition to not being XML-compatible, one runs out of
punctuation too soon — what would sets use?
<<x>>
?
<⊂x⊂>
?
<〔x〕>
?
A better solution on both counts is to define a namespace in which the abstract types
are enumerated (ac
for abstract collection
):
<x ac:type="childSequence">
<x ac:type="array">
<x ac:type="dict">
This allows complete extensibility, even for specifying finer properties such
as homogeneity, recursion, etc.
Documents
We now come to the most widespread XML use case for which JSON does not suffice: documents.
Let us take the Ἀργοναυτικά or Medea as an example (both, after all, have much to do with Jason). As with most documents,
such works contain long sequences of content objects — many of them verse lines, but
intermixed with other types: quotations, stanza and section boundaries, and in some
editions notes, links, catchwords, illustrations, stage directions, and so forth.
It should be evident to the reader that re-ordering the parts of a book or chapter
(or a Web page) in the manner of a JSON hash table is rather worse than mere inconvenience
(such as seen with the examples above). So when JSON replaces
XML, what is to be done? It is a commonplace that perhaps 80-90% even of corporate
data, resides in documents rather than databases; but whatever that precise figure
may be, Appolonius of Rhodes, Clement of Alexandria, and Harry of Hogwarts are not
going away.
In JSON only arrays preserve order, so they must be used. But an array of strings
is entirely inadequate; one must reconstruct the entire hierarchy of books, chapters,
sections, stanzas, verse lines, footnotes, and whatever else, via arrays. Then, since
arrays provide no place to label their entries, each entry must become a collection
of its own; perhaps a tuple such as ["stanza", "...." ]
, or a dictionary such as {"label":"stanza", "content":"...." }
.
The latter has the advantage of supporting additional properties, whether for formatting,
linking and notes, or meta-textual information such as that the stanza is missing
from certain manuscripts. As always, indirection can be used to make an infinite variety
of hacks possible; but once we have addressed even the most rudimentary demands of
actual documents, JSON's limitation to arrays and hashes forces us into unintuitive,
needlessly complex representations.
I carried out such a conversion for the much simpler case of Twitter data. By taking
typical advantage of attributes, the data at once becomes:
-
More readable (this, of course, is always subjective).
-
Easily processable for multiple purposes (linguistic processing can ignore attributes,
style calculation can ignore content, rendering can use everything, ...).
-
Usable with generic software (because the notion of labeled and yet ordered items is required for all XML software, but for no JSON software).
-
Smaller. The XML ends up about 10% smaller than the JSON, with names and whitespace for pretty-printing kept constant. This is partly because
attribute names do not need to be quoted, unlike alphanumeric JSON dictionary keys.
-
Validatable. This is a particularly important issue. Although it would be possible
to design schema languages for JSON, none seem to exist at present. Thus, while it
is very easy to save and load JSON from Javascript, it is not at all easy to describe
exactly what structure you are sending, or to check that it's right. This rules out
JSON for mission-critical applications, unless a special process is created for checking.
XML has been called "verbose."
But this is often simply wrong. First, skilled rather than mechanical use of XML is
often syntactically more compact than JSON.
For example, schemas for XML support default values,
and in many data-intensive applications many entries have the default
value (that's why it's the default). Omitting them can
save a lot; but in JSON, unlike XML, defaults must be implemented at a higher level.
Second, data compression works very effectively on frequently-repeated things like
tags. An email exchange on the xml-dev list long ago addressed this in detail (discussed
by Edd Dumbill at http://www.xml.com/pub/a/2004/12/15/deviant.html).
Third, safe use of JSON often requires processing it twice. Because JSON is essentially
a subset of Javascript code, it is tempting to simply "eval" it in order to get it
parsed. Doing so opens the door to injection attacks. The easiest way to avoid this
is to run a separate regex match first (Cro06).
Summary
JSON has established itself as an easy way to get data in and out of Javascript programs;
that is a good and useful thing. This paper, however, contends that XML's reputation
for relative complexity or difficulty in that context arose in large measure because
Javascript mechanically translated the language-agnostic Document Object Model to
become it's API for XML data, rather than choosing syntax and signatures tailored
to the Javascript sprachgefühl.
Javascript programmers accessing XML structures write, for example, n.childNodes[3].childNodes[2].getAttribute('class')
when they would quite sensibly rather write n[3][2].attrs['class']
— countless times.
There is no reason this has to be the case. Even in Javascript, with a very small
inventory of abstract collection types and little ability to extend syntax (for example
by operator overloading), there is no reason not to provide the latter, or some similar
more native and less annoying syntax.
Python provides operator overloading, and so can do even better at providing "native"
feel for non-Python-derived formats. Such a Python interface to XML structures is
presented. It provides far higher-level functionality than DOM, such as filtering
nodes by nodeType and nodeName, slicing and splicing groups of children, and so on.
The code to implement it is so small that the main portion is included here.
However, a deeper question arises when one considers just what abstract collection
type XML data represents. Most XML constructs constitute ordinary arrays or dictionaries,
the most basic and widespread collection types. However, the sequence of children
of a given node does not, in part because it is organized and accessed by both position and name. An analysis of Python's abstract collection types into their distinctive
features reveals that XML child sequences fit
into a symmetry gap in the pattern. This paper presents a NamedArray
type and implementation to fill that gap.
Some have suggested that JSON will "replace" XML, supposedly because it is simpler
but still adequate. But any difference in simplicity is minor once you have an appropriate
API; and JSON has not yet shown itself adequate for XML's primary application space:
documents. Examples that accompany claims of JSON's adequacy uniformly lack the most
salient distinctive characteristics of documents. As one example (at the time of writing)
Wikipedia's article "Document-oriented databases" gave (only) these two examples of
"documents":
{
FirstName: "Bob",
Address: "5 Oak St.",
Hobby: "sailing"
}
{
FirstName: "Jonathan",
Address: "15 Wanamassa Point Road",
Children: [
{Name: "Michael", Age: 10},
{Name: "Jennifer", Age: 8},
{Name: "Samantha", Age: 5},
{Name: "Elena", Age: 2}
]
}
The second at least contains one example of repetition, and represents it via hierarchy. This is more than many published
examples,
but there is not a hint of heterogeneous child sequences; of the same component type
appearing at multiple levels; of non-trivial nesting; of anything like the distinction
of sub-parts vs. properties of components; nor mention of schemas, constraints, defaults,
inheritance, or validation issues.
Such examples do nothing to strengthen JSON's case for adequacy in the document world,
however handy it is in the word of simple non-document data. An adequate system for
documents must take real account of the fundamental characteristics of documents.
If the simpler concepts and structures of regular and associative arrays sufficed
for the world of documents and literature, we'd all have saved ourselves a lot of
trouble by talking in SQL long ago.
Thus I conclude that JSON, while an excellent idea for the niche of config files,
transmission of simple / tabular / homogeneous data structure, is insufficient to
handle documents perspicuously. Until it does, the XML world need have no anxiety
on its account; though we should undertake a more serious effort to make XML more
accessible to developers new to the field, or in neighboring fields that interacting
with XML only peripherally.
References
[Bra08] Bray, Tim, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, François Yergeau.
Extensible Markup Language (XML) 1.0 (Fifth Edition).
W3C Recommendation 26 November 2008.
http://www.w3.org/TR/REC-xml/
[Cla99] Clark, James and Steve DeRose. 1999.
XML Path Language (XPath). Version 1.0. W3C Recommendation 16 November 1999.
http://www.w3.org/TR/xpath/
[Coo87] Coombs, James H., Allen H. Renear, Steven J. DeRose. 1987.
Markup systems and the future of scholarly text processing.
Communications of the ACM 30(11): 933-947.
http://dl.acm.org/citation.cfm?id=32209. doi:https://doi.org/10.1145/32206.32209.
[DeR90] DeRose, Steven, David G Durand, Elli Mylonas, and Allen Renear. Winter 1990.
What is text, really?
Journal of Computing in Higher Education
1(2): 3-26. Springer US. ISSN 1042-1726.
http://link.springer.com/article/10.1007%2FBF02941632. doi:https://doi.org/10.1007/BF02941632.
[Dur96] Durand, David G., Elli Mylonas, Steven DeRose. 1996.
What Should Markup Really Be:
Applying theories of text to the design of markup systems.
ALLC/ACH.
[Cro06] Crockford, D. 2006.
The application/json Media Type for JavaScript Object Notation (JSON)
IETF RFC 4627.
http://www.ietf.org/rfc/rfc4627.txt
[ECM11] ECMA International.
ECMAScript® Language Specification.
Standard ECMA-262, 5.1 Edition / June 2011.
Geneva: Ecma International.
http://www.ecma-international.org/ecma-262/5.1/
[Gri68]
Griswold, R. E., J. F. Poage, I. P. Polonsky. 1968, 1971.
THE SNOBOL4 PROGRAMMING LANGUAGE. 2nd ed.
Englewood Cliffs, New Jersey: Prentice-Hall, Inc.
http://www.math.bas.bg/bantchev/place/snobol/gpp-2ed.pdf
[JQu]
JQuery API.
http://api.jquery.com
[LeH04] Le Hors, Arnaud, et al. 07 April 2004.
Document Object Model (DOM) Level 3 Core Specification.
Version 1.0. W3C Recommendation 07 April 2004.
http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/
[Way01] Way, Arthur Sanders (tr). 1901.
Apollonius Rhodius. Argonautica.
London: J. M. Dent and Co.
https://archive.org/details/taleofargonauts00apol