Introductory matters

XML and JSON differ in how they punctuate data; this is obvious. Perhaps less obvious is how they differ in the topology and semantics they represent. They share a number of basic structures, and from 50,000 feet up they're both hierarchical. This paper is concerned first with the superficial problem of syntax, where JSON has achieved a reputation in some quarters for being easier to use than XML; and second with the subtler but ultimately more important problem of data modeling.

This papers intends to demonstrate in a practical way, that JSON's seeming advantage over XML in programming convenience actually has little to do with the languages, but much to do with how interfaces to them have been set up in various programming languages. I demonstrate a programming interface to XML data, that is far simpler to use than the APIs usually derived from DOM (). In addition, the API supports the stronger semantics and topology of XML, and so works not only for the simple cases JSON addresses, but also the countless cases where XML's richer selection of abstract data types is needed.

This paper has two basic points to make:

  1. Programming languages can make it far easier to deal with XML, by using their native facilities for abstract collection types, instead of mechanically implementing the DOM API, which is not optimized for any particular language.

  2. XML brings to the fore a particular abstract collection type that is not widely discussed, but follows logically from an analysis of other common types. Although this type may not arise so commonly in traditional data tasks, it does arise all the time in dealing with documents. Trying to manage document-shaped information with arrays and hashes is (of course) possible, but exceedingly awkward. A much better solution is to implement the needed type, and make it as natural to deal with in programming, as arrays and hashes are now.

Asking a developer in the 1950s about hash tables (or dictionaries or associative arrays), when all they had in their tools were arrays, would likely have provoked responses such as: Much too complicated. Doesn't fit the way my tools work. Programs just don't have to deal with that kind of thing very much. Too much stuff in there. You can build that all with stuff I've already got. Clever, but not worth the effort. Or even, If the experts who built Fortran didn't see it as necessary, why should I? These were also common responses to Unicode, to object-orientation, and to multi-threading. They are much the same arguments made now against XML. And although none of these features has become completely ubiquitous, few developers would oppose built-in support for them.[1]

As with hashes, Unicode, and other advances, I think the key is to build support into the fabric of programming. When developers have to understand Unicode to make their code Unicode-safe, they're unlikely to put in the effort — they have better things to do, or at least, things much more obviously related to the task at hand. Rather, the few character-set specialists need to build Unicode-awareness and Unicode-safety into the system libraries and utilities (just try feeding Unicode to the usual *nix commands!), so they fade into the background and become just the expected behavior. Then developers with better things to do, will use them — and (if the Unicode folks do their implementations right) will notice (perhaps long after) that their systems work better. Meanwhile, their managers may notice that they can sell in new areas with minimal new development costs.

I begin, then, with a look at basic abstract collection types, and at JSON (Cro06), Javascript (ECM11), XML (Bra08), and DOM (LeH04) in light of the types they use and the syntax commonly provided for getting at them. The main portion of the paper then examines the more general notion of abstract collection types, some data's need (and XML's support) for a type that JSON does not provide, and the design and implementation of an interface in Python, to that additional abstract collection type. Given that abstract collection type (which is useful for many things besides XML), a syntactically-native, language-appropriate, easy-to-use interface to XML's more general data becomes easy.

On very basic data and collection types

Atomic datatypes such as numbers, strings, and Booleans, are ubiquitous in programming, and the problems they pose are, if annoying, at least familiar: False vs. #F vs. 0 vs. nil; 99 vs. 99 vs. "099"; and a variety of challenges that arise from mathematical and real-world characteristics such as precision, spelling complexity and variability, etc.

Arrays are lists of data items, with the items commonly of atomic types, but in general of any types. Arrays are organized merely by position, generally using non-negative integers. High-level programming languages have included them for a very long time, and the syntax for accessing them is generally similar, although programming languages avoid the literal subscripts we all learned in early math classes. Developers are used to accommodating punctuation differences as they change from one language to another: $myArray[0] vs. myArray(1) vs. item 7 of myArray.

Dictionaries, hashes, or associative arrays instead organize their members via a second piece of data for each: a name or key[2]. They are not so ubiquitously built in as arrays. Their punctuation similarly varies: h = [ a=>1, b=>2 ] vs. h = { 'a':1, n:'2' } vs. h["a"]=1, h["b"]=2 vs. (hash ('a 1) ('b 2)), and so on).

With minor variations, these types are simple and widespread. Many programming languages provide special syntax for each of them; thus, if you know the programming language, you know the syntax for using these types. That is a nice advantage when you deal with data that is accurately modeled with these types.

It is important to keep in mind that there are several levels of abstraction here. A file containing [ 1, 2, 3 ] is not an array; it represents one (it may actually be holes in a piece of compressed wood-pulp); or it might represent a vector in a left-handed non-orthogonal 3-space. It might be loaded into 12 bytes of RAM, intended to be interpreted as 3 32-bit (signed?) integers; or Javascript commonly represents arrays as hashes, by converting the indices from integers to strings that it then hashes. Russel and others represent an integer n as a set nested n-levels deep. And in the end, all of these are merely representations of an abstract notion of quantity that cannot be directly perceived. For present purposes we need not contend with all this ontological complexity; but the differences between these notions are central:

  1. The representation of data via a sequence of characters in a text file. For example { 'a':1, 'b':2, 'c':"\u2172" }.

  2. The data structure created from the representation via loading, composed of constructs implemented by those who implement a given programming language. For example, a contiguous sequence of 12 bytes, preceded by 8 extra bytes of overhead such as the width (4) of the entries, the total length of the area, who owns it, etc.

  3. The interface provided to those data structures for application programmers. For example, myArray, or myArray[1] or myArray.1 for the second item.

On JSON and Javascript's abstract collection types

JSON (JavaScript Object Notation) is a way of representing certain data collections as text files. JSON, for most purposes, simply is a part of Javascript: a valid JSON file can be pasted in as the right-hand-side of a Javascript assignment statement, and should just work. The reverse is true for simple cases, though not in general. In my opinion, this is by far the largest factor in JSON's reputed ease of use:

  1. JSON supports just the same abstract collection types as Javascript: array (sort of) and dictionary. Thus, the mapping from JSON's idea of how data can be structured, to Javascript's, is trivial; there essentially isn't a mapping at all.

  2. It supports most of the same scalar types: integer, floating point number, Boolean, and string.

  3. Its syntax is almost the same as expressions in Javascript. One exception is that one can't interpolate variables, such as [ 1, myVar, "c",... ], so one cannot factor out heavily-reused pieces of data except by introducing conventions outside JSON's awareness. Thus JSON as a representation is slightly less capable than Javascript; but within it's capabilities it is Javascript.

  4. Once loading some JSON is finished (which, like saving, can be a one-step process, for example via eval(...)), the programmer accesses any part of the (formerly JSON) data structure just as if it were the same data declared as native constants in Javascript.

These are all legitimate virtues of JSON. They are greatest for the Javascript programmer, but similar advantages apply in many programming languages because similar basic types are common.

Javascript's (hence JSON's) selection of abstract collection types, while simple, is even smaller than programmers are used to dealing with. If anything, it is slightly too small, because there is considerable confusion about Javascript's arrays and dictionaries.

Javascript objects are essentially dictionaries, in that they can have properties which are unordered and are accessed by name, such as obj.myProp. Thus, there is not independent dictionary type.

There is an Array constructor, which makes objects that support the usual array/list operations (including sort), and whose members can be accessed by a[2] syntax. They can be initialized with syntax like a = [ 1, 2, 3 ].

So far, this is mostly as one would expect. However, an object's members can also be accessed with array syntax: obj["myProp"]. This accesses the same property as obj.myProp]. But if you assign obj[99] = "hello", the integer 99 is converted to the string 99, and obj gains a property named 99 (see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array). One could access it as obj.99 except that Javascript syntax does not allow numbers as operands of .. Likewise, if assigning a["2"] = "bar" affects the same member as a[2] = "bar".

The killer, however, is this: if you explicitly create an array, you can still index into it with non-integers: a["foo"] = "bar". You can set and get such members. However, the value of a.length does not change unless those strings; it is defined as the number (property name) of the last non-negative-number-named element, plus 1. Obviously that means that arrays start at 0. You can happily assign to other numeric indices (quoted or not): x[3.0] = 1; x[0.3e3] = 1; x[3.000000000000000001] = 1; x[3.1] = 0; x[-1] = 1; . However, only the first three of those five become part of the array; the others quietly become properties instead (therefore still accessible as, for example, x[3.1]

As far as I can tell, this amounts to Javascript having only dictionaries, not arrays, as data structures, but both as API (the Array constructor makes an object with extra properties and methods such as length, sort, etc.). Array methods work so long as you behave just so. For example, if you make an Array and assign to a["foo"] or a[3.1], it does create a.foo; but (for example) a.length does not increase.

Implementing other abstract collection types using only dictionaries is (as with all Turing-computable matters) possible, but awkward compared to using types whose semantics and consequent APIs directly match the problem at hand. In Javascript's case, Arrays are fairly intuitive so long as you carefully follow the implicit rules (rules that are not checked by the language at all). If you bump up against the edges, things get messy.

On XML's abstract collection types

XML must handle documents; that is its reason for existence. While it has proven useful for many other things, such as configuration files, database interchange, and even sensor data transfer in automative systems, handling all those at the cost of being effective for documents would be failure.

So what is a document, really? Document structuring and representation clearly require at least three things (Coo87, DeR90):

  1. order (this paragraph comes before that one)

  2. hierarchy (this chapter contains that section, or this strophe contains that stanza)

  3. a wide variety of names

Disregarding the order of speeches in one scene of a play, or of the steps in a repair procedure, both lead to absurdity. That is not to say that everything in XML is ordered; attributes are not, by definition; some bibliography schemas care not how the components of each entry are ordered; and there are many other cases. But XML does need to support order to support documents.

Hierarchy is also essential — it can makes a great deal of difference whether this footnote (or quotation, or other object) is part of one section (or pericope, or procedure), vs. another. Again, not everything is very hierarchical; front-matter may simply be a long, flat list of metadata items.

Finally, the type-names of various components are extremely important. Aircraft repair manuals contain a variety of warnings, some of which may have mandatory layout requirements so they cannot be lost in fine print. Citations and bibliographies are very important in some genres. Sacred texts are very careful about the distinction between (canonical) content and (merely convenient) annotations. Document representations have reified such linguistic categories, so they can be used for all kinds of processing: Lay out footnotes this way; Retrieve all the distinct conference proceedings from the bibliography; Check that all the verses are present in this translation of the Gospel of John; Build a table of contents from the first 'title' element in each 'div' not nested more than 3 divs deep; and so on.

Element type names are handles by which humans organize their thoughts into writing; they are also the usual keys by which processing is triggered, much like class and method names in programming. This function can be moved around syntactically, but remains central (there is a reason that <span class="p"> looks strange).

At a minimum, any representation that purports to handle documents needs to support these capabilities. Child-nodes must be accessible by both name and position, and multiple children of the same type must be permissible.

An example of a trivial document portion

It is sadly true that those who argue that JSON, SQL, or other data formats are perfectly adequate for documents, rarely show any examples of realistic documents. Rather, the examples virtually always lack hierarchy; even more frequently lack mixed content (an extraordinarily important omission); often have no relevant order; and usually include nothing notionally like attributes (such as properties of elements as wholes; the occasional exception is IDs).

Such examples are unconvincing.

CSV and its kin cannot handle documents well because they lack hierarchy; because (in theory though seldom in practice) they lack order), and because while they may have names (for fields) they lack the ability to put named objects in meaningful orders. About the best you can do is (Way01, sec. 10):

"h1", "10: The Argonauts"
...
"p", But the hero essayed to hush their laments and assuage their pain
With words of cheer, and he spake,"
"quo", "Take up my war-array,"
"p-resume", "To the thralls, and with downcast eyes did these in silence obey."

Even in this tiny and unusually simple snippet there are many problems: After the quotation the containing paragraph ought to continue on, but because there is no hierarchy some hack must be used to re-start it. The same thing would happen if we tried to separate the section-heading's number and title; reality is far more complex.

Documents require an abstract collection type that permits two important kinds of retrieval at every level of a hierarchy: By order (for global processing such as rendering, indexing, and the like); as well as by position (for many analytic and transformation processes). Moreover, these need to work together. Users must be able to refer to things like the 3rd footnote in chapter 4, the last word of each speech attributed to Medea, all the images, and so on.

JSON is vastly more capable than CSV, and so can make a more credible attempt to represent documents; the presence of hierarchy is particularly helpful. But the unavailability of collections whose members are both ordered and named still makes it awkward, and to my knowledge JSON has never been seriously used for substantial documents, such as books, articles, manuals, and so on. It is hard to see a general solution much cleaner than this, because arrays are needed to keep order, while something else is needed to distinguish attributes (which form a dictionary) from sub-elements and text:

{ "book": [
  { "section": [
    { "title": [
      { "n": 10 },
      "The Argonauts"
    ]},
    { "p": [
      "But the hero essayed to hush their laments and assuage their pain
       With words of cheer, and he spake, ",
      { "quo": [
        "Take up my war-array,"
      ]},
      " To the thralls, and with downcast eyes did these in silence obey."
    ]}
  ]}
]}

This does represent the needed relationships of precedence and containment; but it is far less readable and far more error-prone than XML.

<book>
  <section>
    <title><n>10</n> The Argonauts</title>
    <p>But the hero essayed to hush their laments and assuage their pain
    With words of cheer, and he spake,
    <quo>Take up my war-array,</quo>
    To the thralls, and with downcast eyes did these in silence obey.</p>
  </section>
</book>

Having an abstract collection type that actually corresponds to document topology is simply cleaner. A programming language can provide an appropriate interface to that abstract collection type, and thus enable a much clearer and easier interface to XML than hammering XML-shaped pegs into JSON-shaped holes (however nice the latter's fit for JSON-shaped pegs may be).

Python has a rich collection of abstract collection types; rich enough to reveal that the type XML most needs, while not built in, fits into a slot implied by the relationships of abstract collection types that are already available in Python. This type's properties are predicted by symmetry: the same distinctive properties displayed by types already observed can be combined in another way.

Defining the missing type is one small step for Python. But having taken that step, XML can be treated in Python much as JSON is treated in Javascript. An implementation in this form makes use of XML in Python nearly as trivial as use of JSON in Javascript. And although support for abstract collection types in Python is particularly strong, analogous APIs for XML can be made with little more work in other languages, vastly simplifying use of XML for developers in general.

On DOM

DOM, the Document Object Model, is a standard, widely-used interface to XML structure. The term Object Model can mean either the set of formal properties of some data object, or a collection of classes and APIs for accessing something. DOM involves mainly the second sense: it is essentially an API.

But DOM is not XML, nor the complete object model of XML in the first sense (although it constrains many things about that object model), nor the only way a programmer can access XML data. This paper discusses an alternative way of dealing with XML in programs, yielding the desired simplification for programmers:

  1. DOM was designed generically rather than with any particular programming language in mind. It is thus not optimized to feel native in Javascript or Python or C++ or anything else. When mechanically implemented, it speaks with an accent. But this has nothing to do with XML syntax or semantics, and building more native interfaces is not that hard.

  2. DOM includes two levels in the one notion of Node: A Node in itself, and its context in a tree. While perfectly reasonable, this portmanteau approach makes DOM nodes seem much more complex than need be. In addition, there are general features such as userData, namespaces, isSupported, etc. to complicate matters.

  3. DOM defines many subclasses of Node, corresponding to SGML syntactic constructs (comments, PIs, documents, etc.), and for HTML the much lower level of individual element types (p, li, i, etc.). This complicates the API to be learned; conceptually, it is as reasonable to think of XML as including only two Node types: elements and leaves; or even 1: Nodes, some of which contain other Nodes and some of which don't.

  4. DOM is a quite low-level interface. For example, it does not provide any way to get the 3rd p element child of a given node; nor even the 3rd element as opposed to text nodes. The latter is particularly annoying because text nodes are in many contexts reliably whitespace-only and irrelevant to most or all processing (it is frustrating to write or even just call the same filtering code so many times). Providing higher-level operations helps immensely, as the popularity of libraries such as JQuery (JQu) attests.

Common operations on arrays and dictionaries

Arrays and dictionaries are very common; they are reasonably thought of as the simplest collection types: one accessed by integer indices, and one by string (usually) keys. This glosses over many differences in theory, implementation, and use, but will suffice for the moment. Some common operations are shown here, for a plain Javascript/JSON array, and for the children of a DOM node (other aspects of XML and of DOM are discussed later).

Table I

Description Javascript Javascript DOM
Get first item n[0] n.firstChild
Get second item n[1] n.childNodes(1)
Items 1-3 n.slice(0, 3) n.childNodes.slice(0,3)
"eggs" attribute" n.eggs n.getAttribute("eggs")
two items equivalent n1 == n2 n1.isEqualNode(n2)
two items identical n1 === n2 n1.isSameNode(n2)
replace item 3 n1[2] = n3 n1.replaceChild(n2, n3)

The Javascript is obviously shorter, cleaner, and perhaps most important, in keeping with the rest of the language (ignoring the grey area where arrays and dictionaries meet, discussed earlier). Of course there is more to a DOM Node; but these illustrate some of the most common operations.

Already we can see a simple way to make XML structures far easier to deal with in programming languages with even minimal collection structures: Just use the native array and dictionary syntax instead of functions, for XML constructs that are accessed that way. For example, instead of writing c = n.childNodes[3], just say c = n[3]; this could be done in current Javascript by making Node (or XMLNode or whatever) be a subclass of Array, with the child-pointers as the members, and the other data as properties. Although Javascript does not provide operator overloading, in languages that do, other simplifications such as == instead of isEqualNode can also be provided.

Such changes are part of the proposal below, but there is much more that can be done by taking the document world's requirements for abstract collection types seriously, and using languages that have a more robust notion and range of types.

On the implementation of abstract collection types

Computer scientists use the notion of abstract collection types to describe various ways in which data items can be aggregated, such as the arrays and dictionaries already discussed. Abstract in this case means that the datatypes are characterized by their storage and access behaviors (or topology, if you will), rather than by how they are implemented. Arrays are distinct from hashes because one indexes items by position, and the other by name (or more properly key).

Most any abstract collection type can be concretely implemented using most any concrete data structure; this is to be expected with Turing Machines. For example, arrays are often implemented using the array data structure (a contiguous series of equal-sized memory blocks), but we have already seen that Javascript is an exception. Sparse arrays such as used in high-dimensionality problems in NLP and physics commonly implement arrays using linked lists or even hashes in order not to waste space on large numbers of empty members.

Similarly, an abstract dictionary can be implemented in a concrete array: Just put each key and item together in an array, and then make an array out of those arrays as shown below. Large instances might be slow to access, but provide all needed functionality:

myDict = [
    [ 1,
        [ "Gouda",     [ "Netherlands" ] ],
        [ "Wisconsin", [ "Wisconsin"   ] ],
    ],
    [ 2,
        [ "Beemster",  [ "Japan"       ] ],
    ],
    [ 3,
        [ "Brie",      [ "France", "USA" ] ],
    ]
]

As should be clear even from this trivial example, implementing an abstract collection type by using a less-than-natural data structure type frequently involves inserting extra layers. In some cases it doubles the number of layers, commonly with alternating arrays and dictionaries. This introduces several problems:

First, it's hard to read. Many developers do not carefully align things as in the example above, although balancing brackets and braces by eye is otherwise difficult.

Second, debugging bracketing errors is difficult because there are so few notations (commonly 1 or 2) being used to express so many distinctions (one per layer).

Third, there are many different ways to introduce the extra layers, so different people will do so in different ways. Therefore, data with the same abstract structure will show up in various concrete forms, and receiving software is unlikely to copy without modification, compromising data portability.

A feature analysis of abstract collection types

There are several different abstract collection types beyond array and dictionary, that have substantial functional differences. The most obvious example is the set, which has quite different semantics because it has neither position nor identifiers, only data items.

The Priority queue (used to choose tasks in order of importance or urgency) is another abstract collection type, which introduces a new feature: Its members are accessed by priority, which is very much like a position in an array; however, unlike with an arrays there can be any number of tasks with the same priority. Dictionaries and sets also fail to encompass priority queues.

As before, introducing more layers is a likely if messy workaround, such as grouping tasks by level and then grouping the levels:

myDict = [
    [ "Brie",       "France",      3 ],
    [ "Gouda",      "Netherlands", 1 ],
    [ "Beemster",   "Japan",       2 ],
    [ "Wisconsin",  "Wisconsin",   1 ],
]

The priority queue is effectively an array but with duplicate positions. The bag or multiset provided in many programming libraries is, similarly, a set but with duplicate entries.

Some programming languages provide versions of some or all these abstract collection types, which are immutable, or restricted from being modified. When applicable, this can enable certain kinds of optimization.

In dynamic contexts queues and stacks are distinct abstract collection types: Their items, after all, are not accessed either by name or by position, but by rule which concern the collection as a whole. However, in immutable contexts (such as data representation) these distinctions are seldom relevant. Either would likely be transmitted as a simple list; the receiving application can do what it likes.

This leads me to suggest the following features for distinguishing abstract collection types:

  1. Is the order (position) of entries a way to refer to them?

  2. Are entries identified (named) by some key in addition to their value?

  3. Can one position identify only one entry? For example, if you store something in myArray[3] and then do so again, the first item is replaced. But if you store another item in a priority queue under the same priority, the first remains.

  4. Can one key identify only one entry?

  5. Is the data structure mutable?

These features define a space of distinct abstract collection types. The space is not entirely orthogonal; for example, the question of whether there can be multiple entries at the same position is moot if the collection type is not ordered (as dictionaries), and the question of whether there can be multiple entries with the same name is moot if the collection type has no names (as sets).

A few more properties are commonly distinguished, which are less relevant here:

  1. Must the data items in each entry all be of the same type? Arrays are sometimes divided into homogenous ones (in which all entries' data items must be of the same datatype, and heterogeneous ones (in which items can be of mixed types). For example, Python has arrays, which are heterogeneous, and byteArrays which are homogeneous. Collection types rarely place any restriction on entries containing equivalent (or identical) data items.

  2. Is the type recursive? That is, can it have as members, items of its own type? Typically programs allows collection types to have any types they like as members, so recursion is the norm (as in all of JSON, Python, and XML).

The types and their properties are shown below (parenthesized names are not provided by Python).

Table II

Name Position Named DupPos DupName Immutable form
set 0 0 0 frozenset
(multiset or bag) 0 0 1[3]  
dict, defaultdict 0 1 0  
Counter 0 1 1  
array/list, bytearray 1 0 0 tuple, string, u string, buffer
priority queue 1 0 1  
OrderedDict 1 1 0 0 namedtuple
? 1 1 0 1  

Python's OrderedDict has much in common with XML child sequences. It is a variation on dictionaries, that also remembers in what order members were added. Members can be added with the usual od["aardvark"] = "Tubulidentata" syntax, and accessed by the same name. It remembers the order items were added, and can iterate in that order with for k,v in od.items() items.

However, members cannot be set or retrieved by position, nor can they be moved around (except by deleting each and re-adding it to move it to the end). There is not a method to find out what position an item is in. It also cannot support multiple members with the same name as needed for XML. So OrderedDict functions mainly like a regular dictionary, just with a special iterator.

From this table it is obvious that there are two case missing: First, Python does not have multisets (some other languages do). They can be simulated easily with Counter. The last row has no Python type, and no convenient way to simulate it. Sets have neither position nor name; arrays have position but no name; dictionaries have name but no position; but where is the final combination? Where is an abstract collection type that has both name and position, and can have repetitions of at least one of them?

A full-fledged abstract collection type for the final slot turns out to be exactly the structure that XML needs, but that is hardest to simulate using only arrays and/or dictionaries. I call it NamedArray, and it is described in the next section.[4]

XML collection types

XML includes a few collection types, but the specification itself doesn't describe or formalize them as such. The attributes of an element form a dictionary. The tokens within an attribute of type IDREFS or NMTOKENS form a homogeneous array. But the children of a given element do not constitute any of the types discussed so far, because both their names (element type names) and their positions are important. What are they, and how can we best access them with typical programming-language syntax?

First, consider XML elements. They are the only Nodes (other than the special case of Root), that are not leaves in the XML structure: that can have children. DOM provides them as an (abstract) array known as childNodes, so they are accessed by position. That's fine as far as it goes, especially if languages would make them accessible via their native array syntax as shown earlier.

The basic goal here is to find a simple mapping of XML structures to the native collection syntax and semantics of Python, much like the simple mapping of JSON to the native collection syntax and semantics of Javascript.

In practice, many operations on XML do not access child nodes by numerical position; getting the 20th child is rarely needed, except when iterating over all children. Much more common is to want all the children of a given element type This defines a basic requirement for an OrderedDict-like abstract collection type to suffice for XML.

XML as described by DOM has an orthogonal complexity that gets in the way of this: The distinction of nodeType and nodeName. Every element has a name such as para or stanza; but there are also several types of nodes that are not elements. This 2-way or 2-level distinction complicates XML processing, so to keep the syntax and semantics simple it would help to get rid of it. The next section deals with this.

Simplifying XML node typing

Every XML node is one of several nodeTypes, as described by DOM. The fundamental ones are element, attribute, text, processing-instruction, root, namespace, and comment (others have to do with physical rather than logical structure, for example CDATA sections and entity references, which need not concern us here). But elements themselves have types (which DOM calls nodeNames), such as para or chapter. The overloading of type is confusing. The simplification proposed here begins by reducing the variety of node-types:

  1. Root is merely a special case of element.

  2. Namespace is merely a special case of attribute.

  3. Comments and PIs can be thought of as merely special cases of text (or the PI's can be thought of as a special case of an empty element, in order to treat its target as an attribute, or to accomodate the commonplace of using attribute syntax within PIs).

Let us call the new unified node construct an Elm, since it is largely similar to DOM Element, but subsumes the other nodes types as well. By introducing reserved Elm names such as _TEXT, _PI, _COMMENT, and _ROOT, and by treating the text content of such (empty, leaf) Elms as a special attribute, the inventory of node types drops to just Elms and attributes.

Conceptually, Elm leaves could treat individual characters as child nodes, rather than strings. This changes little; in Python strings are a list type, and all the usual list operations can apply to them anyway.

The attributes of any given Elm form a dictionary, so it is convenient to introduce a specific node to encapsulate an element's attribute list as a whole (including both regular and namespace attributes). This corresponds to the XML Attribute-list.

It has often been suggested that the type name of an XML element is merely a special case of an attribute (it applies to the element as a whole, has a simple token as value, comes from a schema-constrained namespace, etc.). So the element type can be treated merely as an attribute, which we may name _TYPE.

Conveniently, the attribute-list is a familiar type (dictionary), and the Elm has all the positive properties of an array (plus the presence and use of names).

Mapping to Javascript types and syntax

Javascript at least seems to have notions of accessing array members, and of accessing object properties. As we have seen, their syntax and semantics overlap somewhat; but to implement an abstract collection type for XML child sequences we can maintain the convention that the children of an Elm are identified by number and accessed via array notation; while the attributes of an Elm are identified by name and accessed via property notation.[5]

This is enough to permit a much more palatable syntax for dealing with XML structures in Javascript. Such an implementation may want to support the existing DOM methods and properties as well, to ease the way for existing users.

Mapping to Python types and syntax

Given the same unification of subtypes of Node just discussed, the collection of a node's children fits neatly into the taxonomy of abstract collection types discussed earlier, as a more functional analog to OrderedDict. I call it NamedArray. The basic properties of this new type are:

  1. Order of members is maintained.

  2. All the usual array operations are supported, such as accessing members by position, splicing, etc. (sort() probably won't be used much for XML, but may be for other applications of NamedArray). Operations that insert members by position (such as append()) take an extra parameter for the name.

  3. Like in regular dictionaries, each member of a NamedArray has a name (which is a string for the XML case, but which Python in general would allow to be of any hashable type).

  4. Unlike in regular dictionaries, NamedArrays may have multiple entries with the same name (at different positions).

  5. All the usual dictionary operations are supported, such as accessing members by name. However, the semantics are slightly different because duplicate names are permitted. Assigning to a member solely by name replaces the first member with that name; so does deleting by name. Members can also be referred to by name and number simultaneously, which refers to the n-th member with the given name.

  6. The reserved name * matches any Elm node type name, but not any of the reserved (_-initial) names.

The specific syntax for these operations is described below.

This provides the most commonly-needed operations for XML. Child nodes can be iterated over by number, and the sequence can be edited by the usual splice, append, and other list operations (as in DOM, but with more natural syntax). However, it is equally easy to iterate over all the comments, PIs, text nodes, elements, or elements of a given type, or simply slice out the elements of a given type. These are extremely common requirements in XML processing.

This idea led the author to investigate what it takes to support an additional collection type in Python, with the native syntactic niceties.

Python doesn’t have only the array[n] notation for arrays. It also provides slicing such as array[start:end] (note that the 2nd argument specifies the entry after the last one desired — this is commonly convenient). It also provides dict[key] notation, and a separate notion of object properties, accessed like object.propname[6].

Finally, Python has an unusual but very useful array[start:end:interval] notation. For example, array[0:20:2] retrieves every other item from among the array's first 20 entries. This is said to be heavily used with the numpy/scipy scientific and math packages.

As with the abstract collection types, there turns out to be a natural combination of these features which is not built in to Python, but is an excellent fit for XML:

Table III

Description DOM Pythonish
Get first child n.firstChild n[1]
Get second child n.childNodes(2) n[2]
Get last child n.lastChild n[-1]
children 1-3 n.childNodes.slice(0,3) n[1, 4)
two nodes equivalent n1.isEqualNode(n2) n1 == n2
two nodes identical n1.isSameNode(n2) n1 === n2
replace third child n1.replaceChild(n2, n3) n1[2] = n3
get first "p" child n["p"]
get third "p" child n["p":3]
get last "p" child n["p":-1]
get the first 20 "p" children n["p":1:21]
get all "p" children from among the first 20 children n[1:21:"p"]
walk down by types doc[“chap”:3][“sec”:2]["p"]

It is often helpful to be able to get all child Nodes that are in fact elements; as noted earlier n["*"] is defined to do this. The reserved Elm types _TEXT, _PI, _COMMENT, _ROOT, and _ATTLIST are simply names when XML uses NamedArray, so NamedArray itself has no awareness of XML convention.

NamedArray is mutable, and does not require homogeneity. The members can be data of whatever type. In an XML document many will be Elms, some of which contain NamedArrays of its own; leaf nodes, however, could be inserted as raw strings, or as separate objects corresponding to DOM's Text, Comment, and PI nodes; NamedArray doesn't care.

This seems, at least to this author, quite intuitive. The various slicing options make the interface considerably higher-level, and the general applicability of [] considerably reduces the number of methods and properties required to achieve DOM's functionality.

The subtlest detail is perhaps the distinction between n["p":1:21] and n[1:21:"p"]. The former retrieves all children named p, and then retrieves the first 20 of those; the second instead retrieves the first 20 children, and then extracts all of those that are named p. That is, the slicing operations go from left to right. This may seem familiar to users of XPath's successive [] filters.

The rest of the XML Nodes

So far we have modeled only the child-sequence requirements of XML, using a new abstract collection type that is not XML-specific, and that fits neatly into the pattern of Python's already-existing types. However, there is more to an XML node than a NamedArray of children: it also has attributes, references to neighboring nodes, properties related to namespaces, and a variety of methods.

To begin with attributes, in Javascript they are probably best implemented as properties; however, this can lead to name clashes in relation to properties that are part of the Node implementation (nextSibling, etc.). Another option, which seems better when it is possible, is to keep them all in a dictionary. The dictionary in turn can be a property, or as suggested below, the first member of the NamedArray.

JSOX has a full-fledged dictionary representing the attribute list as a whole. Because the Elm type (which subsumes DOM nodeType and nodeName) is always present, and considered a reserved attribute, every Elm has at least one member in that dictionary. I favor placing the Attlist dictionary at position 0 in the child sequence, considering it a node in the same way that PIs, comment, and text portions are nodes. Placing it at position 0 fits with where it appears in XML syntax, while also hinting that it is special. With slicing operations, it can be included or excluded at desired. I see this as a rather nice compromise between (a) most programming languages using 0-based arrays, and (b) many people thinking of the first element as #1. Those who disagree can put the attributes elsewhere with no substantial effect on the rest of this proposal.

The other DOM properties (parentNode, etc.) are provided as properties as usual, as are more Python-native synonyms: cloneNode (copy), textContent (toString), appendChild (append), hasChildNodes and hasAttribute (in), removeAttribute (del), removeChild (del), etc.

Implementation

In Python most operators are just shorthand for function calls, and one can override the function definitions. The [] notation invokes __getitem__(). Python already accepts 3 arguments to the bracket notation, and tests show that any of them can be a string instead of an integer. Thus, the semantics just described are easily implemented:

def __getitem__(self, n1, n2=None, n3=None):
    t1 = t2 = t3 = None
    nargs = 1; t1 = type(n1)
    if (n2 != None): nargs = 2; t2 = type(n2)
    if (n3 != None): nargs = 3; t3 = type(n3)

    if (nargs==1):
        if (t1==StringType):
            if (name == '*'): return(self.getElements());
            return(self.getByName(t1)[1])
        else:
            return(self._subdata[int(t1)])
    elif (nargs==2):
        if (t1==IntType):
            if (t2==IntType):
                return(self._subdata[t1:t2])
            else:
                return(self.getByName(t2)[t1])
        else:
            if (t2==IntType):
                return(self.getByName(t1)[t2])
            else:
                raise(TypeError)
                return(None)
    else: # nargs==3
        if (t1==IntType and t2==IntType):
            if (t3==IntType):
                return(self._subdata[t1:t2:t3])
            else:
                return(self._subdata[t1:t2].getByName(t3))
        elif (t2==IntType and t3==IntType):
                return(self.getByName(t1)._subdata[t2:t3])
        else:
            raise(TypeError)
            return(None)

Extensibility

The same kinds of indexing along all the XPath axes (Cla99) would be similarly useful, perhaps exposed as lazily-evaluated properties of type array-of-Node. Some form of lazy evaluation or indexing is important here, lest enumerating all preceding or following elements be too expensive. It may be that Python's buffer type, which supports read-only access to successive subsequences of a sequence (reminiscent of Scheme cdr'ing down a list) would provide an effective approach.

Although XML's child sequences are modelled well by NamedArray models, documents sometimes contain particular components that are logically other types. For example, the rows in a table body form an array, while the members of a definition list or glossary may form a dictionary, and a shopping list might form a set. Often the difference is visible only in a schema or documentation:

<!ELEMENT tbody    (tr*)>
<!ELEMENT glossary (term, definition)*>

All these can be modeled via NamedArray, but it is more perspicuous to provides no syntactic way to indicate that a particular element (type or instance) is conceptually a more constrained type, such as array, set, or dict (schema languages can specify such semantics, of course). This could be a useful addition both to add clarity to XML data, and to aid performance and perspicuity to implementations.

Were XML not so well established and so widely and compatibly implemented, this distinction could be accomplished by some minor syntactic enhancements, such as these which trade on the familiarity of arrays and dicts:

<x>   &#8212; for the usual XML childSequence
<[x]> &#8212; to indicate the children constitute a logical array
<{x}> &#8212; to indicate the children constitute a logical dictionary

I find this intuitive, but in addition to not being XML-compatible, one runs out of punctuation too soon — what would sets use? <<x>>? <⊂x⊂>? <〔x〕>?

A better solution on both counts is to define a namespace in which the abstract types are enumerated (ac for abstract collection):

<x ac:type="childSequence">
<x ac:type="array">
<x ac:type="dict">

This allows complete extensibility, even for specifying finer properties such as homogeneity, recursion, etc.

Documents

We now come to the most widespread XML use case for which JSON does not suffice: documents. Let us take the Ἀργοναυτικά or Medea as an example (both, after all, have much to do with Jason). As with most documents, such works contain long sequences of content objects — many of them verse lines, but intermixed with other types: quotations, stanza and section boundaries, and in some editions notes, links, catchwords, illustrations, stage directions, and so forth.

It should be evident to the reader that re-ordering the parts of a book or chapter (or a Web page) in the manner of a JSON hash table is rather worse than mere inconvenience (such as seen with the examples above). So when JSON replaces XML, what is to be done? It is a commonplace that perhaps 80-90% even of corporate data, resides in documents rather than databases; but whatever that precise figure may be, Appolonius of Rhodes, Clement of Alexandria, and Harry of Hogwarts are not going away.

In JSON only arrays preserve order, so they must be used. But an array of strings is entirely inadequate; one must reconstruct the entire hierarchy of books, chapters, sections, stanzas, verse lines, footnotes, and whatever else, via arrays. Then, since arrays provide no place to label their entries, each entry must become a collection of its own; perhaps a tuple such as ["stanza", "...." ], or a dictionary such as {"label":"stanza", "content":"...." }.

The latter has the advantage of supporting additional properties, whether for formatting, linking and notes, or meta-textual information such as that the stanza is missing from certain manuscripts. As always, indirection can be used to make an infinite variety of hacks possible; but once we have addressed even the most rudimentary demands of actual documents, JSON's limitation to arrays and hashes forces us into unintuitive, needlessly complex representations.

I carried out such a conversion for the much simpler case of Twitter data. By taking typical advantage of attributes, the data at once becomes:

  1. More readable (this, of course, is always subjective).

  2. Easily processable for multiple purposes (linguistic processing can ignore attributes, style calculation can ignore content, rendering can use everything, ...).

  3. Usable with generic software (because the notion of labeled and yet ordered items is required for all XML software, but for no JSON software).

  4. Smaller. The XML ends up about 10% smaller than the JSON, with names and whitespace for pretty-printing kept constant. This is partly because attribute names do not need to be quoted, unlike alphanumeric JSON dictionary keys.

  5. Validatable. This is a particularly important issue. Although it would be possible to design schema languages for JSON, none seem to exist at present. Thus, while it is very easy to save and load JSON from Javascript, it is not at all easy to describe exactly what structure you are sending, or to check that it's right. This rules out JSON for mission-critical applications, unless a special process is created for checking.

XML has been called "verbose." But this is often simply wrong. First, skilled rather than mechanical use of XML is often syntactically more compact than JSON. For example, schemas for XML support default values, and in many data-intensive applications many entries have the default value (that's why it's the default). Omitting them can save a lot; but in JSON, unlike XML, defaults must be implemented at a higher level. Second, data compression works very effectively on frequently-repeated things like tags. An email exchange on the xml-dev list long ago addressed this in detail (discussed by Edd Dumbill at http://www.xml.com/pub/a/2004/12/15/deviant.html). Third, safe use of JSON often requires processing it twice. Because JSON is essentially a subset of Javascript code, it is tempting to simply "eval" it in order to get it parsed. Doing so opens the door to injection attacks. The easiest way to avoid this is to run a separate regex match first (Cro06).

Summary

JSON has established itself as an easy way to get data in and out of Javascript programs; that is a good and useful thing. This paper, however, contends that XML's reputation for relative complexity or difficulty in that context arose in large measure because Javascript mechanically translated the language-agnostic Document Object Model to become it's API for XML data, rather than choosing syntax and signatures tailored to the Javascript sprachgefühl.[7] Javascript programmers accessing XML structures write, for example, n.childNodes[3].childNodes[2].getAttribute('class') when they would quite sensibly rather write n[3][2].attrs['class'] — countless times.

There is no reason this has to be the case. Even in Javascript, with a very small inventory of abstract collection types and little ability to extend syntax (for example by operator overloading), there is no reason not to provide the latter, or some similar more native and less annoying syntax.

Python provides operator overloading, and so can do even better at providing "native" feel for non-Python-derived formats. Such a Python interface to XML structures is presented. It provides far higher-level functionality than DOM, such as filtering nodes by nodeType and nodeName, slicing and splicing groups of children, and so on. The code to implement it is so small that the main portion is included here.

However, a deeper question arises when one considers just what abstract collection type XML data represents. Most XML constructs constitute ordinary arrays or dictionaries, the most basic and widespread collection types. However, the sequence of children of a given node does not, in part because it is organized and accessed by both position and name. An analysis of Python's abstract collection types into their distinctive features reveals that XML child sequences fit into a symmetry gap in the pattern. This paper presents a NamedArray type and implementation to fill that gap.

Some have suggested that JSON will "replace" XML, supposedly because it is simpler but still adequate. But any difference in simplicity is minor once you have an appropriate API; and JSON has not yet shown itself adequate for XML's primary application space: documents. Examples that accompany claims of JSON's adequacy uniformly lack the most salient distinctive characteristics of documents. As one example (at the time of writing) Wikipedia's article "Document-oriented databases" gave (only) these two examples of "documents":

{
    FirstName: "Bob",
    Address: "5 Oak St.",
    Hobby: "sailing"
}

{
    FirstName: "Jonathan",
    Address: "15 Wanamassa Point Road",
    Children: [
        {Name: "Michael", Age: 10},
        {Name: "Jennifer", Age: 8},
        {Name: "Samantha", Age: 5},
        {Name: "Elena", Age: 2}
    ]
}

The second at least contains one example of repetition, and represents it via hierarchy. This is more than many published examples, but there is not a hint of heterogeneous child sequences; of the same component type appearing at multiple levels; of non-trivial nesting; of anything like the distinction of sub-parts vs. properties of components; nor mention of schemas, constraints, defaults, inheritance, or validation issues.

Such examples do nothing to strengthen JSON's case for adequacy in the document world, however handy it is in the word of simple non-document data. An adequate system for documents must take real account of the fundamental characteristics of documents. If the simpler concepts and structures of regular and associative arrays sufficed for the world of documents and literature, we'd all have saved ourselves a lot of trouble by talking in SQL long ago.

Thus I conclude that JSON, while an excellent idea for the niche of config files, transmission of simple / tabular / homogeneous data structure, is insufficient to handle documents perspicuously. Until it does, the XML world need have no anxiety on its account; though we should undertake a more serious effort to make XML more accessible to developers new to the field, or in neighboring fields that interacting with XML only peripherally.

References

[Bra08] Bray, Tim, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, François Yergeau. Extensible Markup Language (XML) 1.0 (Fifth Edition). W3C Recommendation 26 November 2008. http://www.w3.org/TR/REC-xml/

[Cla99] Clark, James and Steve DeRose. 1999. XML Path Language (XPath). Version 1.0. W3C Recommendation 16 November 1999. http://www.w3.org/TR/xpath/

[Coo87] Coombs, James H., Allen H. Renear, Steven J. DeRose. 1987. Markup systems and the future of scholarly text processing. Communications of the ACM 30(11): 933-947. http://dl.acm.org/citation.cfm?id=32209. doi:https://doi.org/10.1145/32206.32209.

[DeR90] DeRose, Steven, David G Durand, Elli Mylonas, and Allen Renear. Winter 1990. What is text, really? Journal of Computing in Higher Education 1(2): 3-26. Springer US. ISSN 1042-1726. http://link.springer.com/article/10.1007%2FBF02941632. doi:https://doi.org/10.1007/BF02941632.

[Dur96] Durand, David G., Elli Mylonas, Steven DeRose. 1996. What Should Markup Really Be: Applying theories of text to the design of markup systems. ALLC/ACH.

[Cro06] Crockford, D. 2006. The application/json Media Type for JavaScript Object Notation (JSON) IETF RFC 4627. http://www.ietf.org/rfc/rfc4627.txt

[ECM11] ECMA International. ECMAScript® Language Specification. Standard ECMA-262, 5.1 Edition / June 2011. Geneva: Ecma International. http://www.ecma-international.org/ecma-262/5.1/

[Gri68] Griswold, R. E., J. F. Poage, I. P. Polonsky. 1968, 1971. THE SNOBOL4 PROGRAMMING LANGUAGE. 2nd ed. Englewood Cliffs, New Jersey: Prentice-Hall, Inc. http://www.math.bas.bg/bantchev/place/snobol/gpp-2ed.pdf

[JQu] JQuery API. http://api.jquery.com

[LeH04] Le Hors, Arnaud, et al. 07 April 2004. Document Object Model (DOM) Level 3 Core Specification. Version 1.0. W3C Recommendation 07 April 2004. http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/

[Way01] Way, Arthur Sanders (tr). 1901. Apollonius Rhodius. Argonautica. London: J. M. Dent and Co. https://archive.org/details/taleofargonauts00apol



[1] In a discussion at http://programmers.stackexchange.com/questions/173573/history-of-associative-array, Yannis Rizos suggests that Snobol was the first language (around 1967) to provide a dictionary-like tool built in, which it called tables: A table is similar to a one-dimensional array. However, instead of referencing an element with an integer, any data object can be used. (Gri68, p. 19) For example, T<'A'> = 5.

[2] Properly speaking, hash table is a data structure very commonly used to implement such keyed collections (binary search trees are another); but the terms are commonly used interchangeably.

[3] Properly speaking, sets allow duplicate entries, not duplicate names; the definition of the DupName feature can be slightly complicated to account for this if desired.

[4] There is also room, not shown, for variants of OrderedDict and NamedArray that permit multiple members at the same position, by analogy with priority queues. I have not yet investigated these in any detail, but would tentatively suggest they may be useful for fined-grained version control.

[5] _ is used to prefix reserved names such as _TYPE because it is an acceptable identifier-start character in Javascript, Python, and many other languages. XML names can, however, also include colon, period, and hyphen, which would necessitate using Javascript bracket notations instead of dot notation.

[6] This is distinctive: Perl uses [] for arrays, {} for dicts, and lacks . notation. Javascript uses array[n], obj[key], and obj.key, but they are largely synonymous.

[7] Linguists and translators will recognize this as the classic choice between "literal" and "idiomatic" translations in natural language; the format may (or may not) be more precise, but are far less readable, and often feel "foreign," leading to lower acceptance by readers.

Author's keywords for this paper:
Python; DOM; API; XML; JSON; Abstract collection types; Markup Systems; Data modeling; Container (type theory)

Steven J. DeRose

Consultant

Steve DeRose has been working with electronic document and hypertext systems since joining Andries van Dam's FRESS project in 1979. He holds degrees in Computer Science and in Linguistics and a Ph.D. in Computational Linguistics from Brown University.

He co-founded Electronic Book Technologies in 1989 to build the first SGML browser and retrieval system, DynaText, and has been deeply involved in standards development including XML, TEI, HyTime, HTML 4, XPath, XPointer, EAD, Open eBook, OSIS, NLM and others. He has served as Chief Scientist of Brown University's Scholarly Technology Group and Adjunct Associate Professor of Computer Science. He has written many papers, two books, and eleven patents. Most recently he has been working as a consultant in text analytics.