How to cite this paper
Couthures, Alain. “My document object model can do more than yours: Extending the DOM for data manipulation.” Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6 - 9, 2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). https://doi.org/10.4242/BalisageVol10.Couthures01.
Balisage: The Markup Conference 2013
August 6 - 9, 2013
Balisage Paper: My document object model can do more than yours
Extending the DOM for data manipulation
Alain Couthures
Alain Couthures is the project leader for XSLTForms which is a
client-side XForms implementation based on XSLT and Javascript. He is
an Invited Expert in the W3C Forms Working Group.
Abstract
Document object models, specifically the browser DOM, were
designed to represent HTML and XML documents. Languages such as XPath
were designed to access and traverse the DOM of HTML and XML documents.
But suppose we wanted to bring the power and convenience of XML
technologies like XPath to new data types. Could we extend the DOM to
support CSV files? JSON? ZIP files? Yes we can! This paper explores a
number of ways in which the DOM can be made to do more. We can loosen
restrictions, describe new sequence types, and even define new XPath
axes to make the DOM better and more useful.
Table of Contents
- Introduction
- Node names
- Multiple document type nodes and multiple document elements
- Sequence Node Type
- Typed-value Node Type
- Named axes
- Parsing and serializing
- Conclusion
Introduction
XML is well-known for two major uses: documents and data exchange.
Actually, any XML "document" can also be considered as a small consistent
database. As an example, an invoice requires many tables to be described
while one XML document is enough for this.
For small amount of data, an XML "document" is usually parsed in
memory and the DOM API is a common library to manipulate its contents
within browsers or in Microsoft environments (MSXML, .Net) or in PHP. DOM
Level 1 was quite limited and was designed for HTML. DOM became
namespace-aware with DOM2. The latest version of DOM is Level 3 and it has
be published in 2004. There is a quite recent working draft for DOM4 (6
December 2012).
There are many critics about the DOM API, probably because it is
clearly a low-level interface and because complexifying it for full XML
support was out of interest for HTML-only fans. The DOM structure is also
not fully appropriate for building an XPath engine (presence of CDATA and
entities nodes and lack of namespaces nodes) as defined in XDM.
As a matter of fact, DOM3 has not been fully implemented in browsers
and DOM4 might loose vital functionnalities for building an XPath engine
with it, such as attributes not been nodes anymore.
XForms 1.0, and later XForms 1.1, has been designed for editing XML
instances with a browser when embedded in an HTML page. XForms is based on
XPath and any XForms implementation requires extra features about nodes
such as properties ("validity", "relevant", "read-only", ...) which cannot
be found in native DOM implementations in browsers. There are also extra
XPath functions, such as "instance()", "index()", "event()", defined in
XForms specifications while XPath engines in browser cannot be extended.
That is why XSLTForms, a client-side XForms implementation, has its own
XPath engine written in Javascript and that is why its ancestor project,
AJAXForms, had even also its own DOM implementation. It was chosen to use
native XML storage in XSLTForms for performance and for eprouved
compliance when serializing XML instances with multiple namespaces. Today,
this has to be reconsidered.
XForms 2.0 is not limited to XML editing. At least, CSV and JSON are
also supported. The question for an XForms implementer is how to integrate
these notations at low level for keeping XPath use at authors'
level.
CSV format might seem to be an old format but it is still used in
many import/export functions of applications. For relational databases, a
CSV file is a natural table content.
Mapping JSON in XML and preserving XPath expressions readability is
not easy. Many attempts have already been done but there is not yet an
agreement about one in particular for each different situation.
This paper is describing that, with few extensions, CSV and JSON can
be loaded in an extended DOM structure so an XPath engine can manipulate
them immediately.
There are yet more challenging notations or file formats which can
be of interest for authors. Typically, applications with a light server
side or offline applications want to manipulate files, not just data in
exchange notation. Many of them are binary formats and they can now be
manipulated within browsers with Javascript. The most common ones are
text-processing documents or spreadsheets in a ZIP package.
Possibilities are numerous. For XQuery implementors and developers,
XQuery instructions might also be parsed into a DOM structure as if there
was an XQueryX source. Programming languages are defined according to a
grammar and a grammar is similar to an XML Schema for text sources. For
Apache administrators, httpd.conf file and log files have a format which
can be loaded into a DOM structure. There are also emerging notations
which are simpler or richer than XML and they can surely be parsed and
stored in an extended DOM structure.
A bigger challenge is to use a DOM structure not only for a tree or
a forest of trees (such as a document with multiple document elements) but
for graphs. A navigational approach can be obtained with named axes: the
data designer can specify different sets of children for a node, each one
being assigned a name.
A non-planar structure is even possible within a DOM structure, the
difficulties being about internal ids to be used when parsing and
serializing. But this is already done by developers with workarounds and a
way to standardize this would be appreciated.
Node names
Most data engines (such as relational databases, CSV, JSON,...)
don't have restrictions for names: any character is possible and an
enclosing delimiter is used in the corresponding query languages to ensure
there will be no mismatch. For example, MySQL uses the back-quote
character, MS-Access uses square brackets. Names without restrictions are
clearly more user-friendly. XML 1.1 extended possibilities for names but
was not implemented much...
For the DOM, a name is stored within a string: there is no
restriction for names due to the DOM structure.
Quote, apostrophe, brackets, angular brackets, parenthesis are
already used in XML and XPath. So, the back-quote character is a good
candidate for this purpose:
`+` is now a valid node name in XPath
Encoding within a name is still required in XML and XPath for
special characters (new line, quotes,...). Those characters can easily be
escaped as entities.
`&` stands for an ampersand character
` ` stands for a newline character
Multiple document type nodes and multiple document elements
Multiple document type nodes can be very useful within a ZIP package
because each ZIP component might have its own document properties (media
type, encoding, ...).
Multiple document elements is a pertinent feature for many formats
for which serializers don't guarantee that the end of the contents has be
reached. This is even an advantage to allow data to be added at the end
without breaking a well-formed rule. Is it still interesting to always
double check that all data has been received?
Those two extensions do not require the DOM structure to be
modified. They impact the parser and the serializer. Methods of the DOM
API, such as appendChild and so on, have to accept them.
For JSON:
{a: "Hello", b: "World"}
becomes
document(
element("a",
text("Hello")
),
element("b",
text("World")
)
)
Sequence Node Type
Manipulation of an ordered set of data is possible in many formats.
For example, JSON allows "arrays" which can be embedded.
There is a known trick when converting a named JSON array into XML:
just iterate as many elements with the array name. Luckily, even the
bracket notation is similar in XPath because of the shortcut for
"[position() = n]". But there are also anonymous arrays in JSON, empty
arrays and arrays with a unique item. So, this trick is not enough and
extra meta data becomes mandatory.
Support of a new node type is enough to deal with ordered sets. This
could be named "Sequence Node Type" because of sequences defined in XPath,
even if XPath does not allow embedded sequences.
A sequence node can be seen as an element without a name and without
attributes. Instead of defining such a new node type, it could also be
considered to allow elements without a name but this would interfere
within XPath expressions with a selector such as '*'.
It could also be compared to a document fragment being stored with
the structure instead of just being a temporary embedding node. It would
then be necessary to add a parameter to each method accepting a document
fragment to specify whether the document fragment node should be dropped
or not.
Since a sequence node type is corresponding to a specific need, it
is better to define a new node type.
["a", "b"] becomes
document( sequence( text("a"), text("b") ) )
It is more problematic to choose a notation in XPath for such
nodes:
"#/text()[1]" returns "a" for ["a", "b"]
Typed-value Node Type
In DOM3, just elements and attributes can have a data type as
property and nodes with effective values are only text nodes.
This is not enough for JSON because stand-alone values are possible.
The following JSON expressions are valid:
[1, true]
null
2.154E-7
["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
Every value in JSON has a data type and JSON allows numbers,
strings, booleans and null. From the Javascript point of view, because
JSON data is directly evaluated, JSON can even contain expressions: for
example,date object creations are possible.
In CSV, no type is specified and, commonly, a spreadsheet processor
will try to associate the most fitting type for each value. Import tools
help users to manually force a type for a column.
But, because the number of header lines is not limited to just one
line, it is also a possible trick to associate a type to each CSV column
within the header (when column names are unknown, an empty line can be
used). For example:
xs:anyURI,xs:gYear,xs:positiveInteger
http://en.wikipedia.org/wiki/Afghanistan,1960,9616353
http://en.wikipedia.org/wiki/Afghanistan,1961,9799379
http://en.wikipedia.org/wiki/Afghanistan,1962,9989846
http://en.wikipedia.org/wiki/Afghanistan,1963,10188299
As a consequence, for a DOM structure to be used to store JSON or
CSV data, the text node type is not enough and a new node type has to be
defined for a typed value.
42 becomes
value("xsd:decimal", 42)
{ a: 42, b: "Hello"} becomes
document( element("a", value("xsd:decimal", 42) ), element("b", value("xsd:string", "Hello") ) )
This has not to be limited to XSD types and really binary types (not
xsd:hexBinary nor xsd:base64Binary) are required in different situations
such as ZIP packages with embedded images like in, for example, OpenXML
files (.xlsx, .docx,...).
Named axes
Attributes can be considered as low-level elements but there are
advantages at using them because they are clearly separated from children
nodes. As a consequence, whether they are present or not, they do not
interfere when calling XPath nodeset functions upon children. For XPath,
attributes are accessed with a dedicated axis ("attribute::", with "@" as
the well-known shorcut). So, in a sense, there are possibly two distinct
sub-trees under an element.
<person firstname="Paul" lastname="Verdier" >
<address>17 rue de Rivoli</address>
<address>75001 Paris</address>
<address>France</address>
</person>
could also be serialized as
<person>
<attribute::firstname>Paul</attribute::firstname>
<attribute::lastname>Verdier</attribute::lastname>
<address>17 rue de Rivoli</address>
<address>75001 Paris</address>
<address>France</address>
</person>
Allowing developers to define named axes means that an element can
have multiple children in a specific context which might also be seen as a
relationship. The name for the axis is the name of the relationship. The
resulting structure can be compared to the navigational CODASYL database
model.
This point is not at all difficult to implement (a map of arrays for
children instead of a unique array) and a corresponding syntax for XPath
can easily be defined ("child(axis-name)::" for example).
For XML-like serialization, the list of axes could precede the name
of an element child:
<product>
<designer::person>Peter</designer::person>
<user::person>Jack</user::person>
<user::person>Emily</user::person>
</product>
When multiple parents are allowed, the structure looks like in the
network CODASYL database model.
A use case for XML developers is embedding in the same document data
and specific data types declarations (in the same approach, fonts can be
embedded within a PDF file). This is possible with the data document
element and the schema document element being at the top level.
Serialization is more clumsy with a network data model: a child
element cannot be serialized under each parent element. A specific
notation can be used to list links such as in multipart messages. This
will be less human-friendly but XML documents are not often generated
manually and serialization might be seen as just a way to exchange a
database.
For example:
<person>
<attribute::email>
<attribute::xsi:type xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#email_type"/>
paul.verdier@rivoli.fr
</attribute::email>
</person>
<xs:simpleType xmlns:xs="http://www.w3.org/2001/XMLSchema" id="email_type">
<xs:restriction base="xsd:string">
<xs:pattern value="(\w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3})"/>
</xs:restriction>
</xs:simpleType>
Parsing and serializing
Currently, with the Javascript DOMParser class, only XML and,
eventually HTML and SVG documents can be parsed in a browser and the
corresponding mediatype is to be provided as a parameter for the
parseFromString() method.
Extending support for other media types such as "text/csv" and
"application/json" allows CSV and JSON being loaded in a DOM structure.
XSLTForms now supports CSV/XML and JSON/XML conversions written in
Javascript.
For other formats, a third parameter can be the grammar to be used
to generate the corresponding tree.
Because data cannot always be stored as a string (binary data, for
example), another method has also to be defined, such as
parseFromUint32Array. This allows media types such as "application/zip"
for ZIP packages to be supported. This has already been prototyped in
XSLTForms with the zip_inflate and zip_deflate Javascript functions
written by Masanao Izumo (http://www.onicos.com/staff/iz/amuse/javascript/expert/)
and there is no performance issue with standard MS-Excel or MS-Word
files.
When serializing, some links could be internal links. It could be
possible to add id attributes when required but another mechanism should
be provided, a sort of internal ids such as keys in XSLT, with its
specific notation such as internal:key.
Of course, exceptions will be fired because not all DOM structures
can be serialized in any format. There might be different workarounds to
serialize a DOM structure in a different media type without data
loss.
Conclusion
Trees are everywhere and they can even store links to represent
graphs. XML notation is a convenient way to serialize not-so-limited trees
but people can consider that it is too verbeous. The real point is the
power in data description and data manipulation.
The DOM structure is already used by browsers as an XML Data Model.
Extending the DOM for non-XML data manipulation is not really difficult
and XPath is not heavily affected. The resulting structure can support
plenty of existing data formats. Because it is a low level component,
upper layers can immediately benefit of it without significant
modifications.
XSLTForms is directly concerned by an extended DOM implementation
written in Javascript. It is a good opportunity to easily prototype and
test new features. This implementation is currently in development and it
already has its own name: "Fleur".