Note: Acknowledgements
This paper describes concepts and source code originally developed by David G. Holmes, formerly of TIBCO Software Inc., without whose innovation and energy neither the paper nor the material that it describes would be possible. David was the senior architect responsible for driving the development (over several iterations) of the gXML code base, and the original advocate of opening the source.
The Problem(s) with XML Tree APIs in Java
Java was one of the first major programming languages with support for XML. It was one of the targets for the Interface Definition Language modules that were developed as the basis of the Document Object Model DOM. Early adoption helped to prove the capabilities of both XML and of Java, but as might be expected, early adoption also has its drawbacks. A number of developers using XML in Java have noted these problems. For instance, Dennis Sosnoski compared a number of tree models in a two-part investigation in 2001 and 2002 (see "XML and Java technologies: Document models, Part 1: Performance" DMPerf and "XML and Java technologies: Java Document Model Usage" DMUse). More recently, Elliotte Harold documented "What's Wrong with XML APIs" WhatsWrong as part of the development of the XOM XOM API. This analysis falls into that tradition, though it does not agree wholly with the previous analyses. We identify four classes of problem with existing tree model APIs.
The first problem is multiplicity. For a variety of reasons, Java developers have not, on the whole, been enthusiastic partisans of the DOM. Alternatives were proposed early; Xalan Xalan, one of the major early XSLT processors, defined its own internal XML tree model (the Data Table Model XalanDTM) in preference to using the DOM. At present, there are at least five well-known tree models for XML in Java: DOM DOM, JDOM JDOM, DOM4J DOM4J, XOM XOM, and AxiOM AxiOM, as well as an unknown number of proprietary APIs to the same purpose (the authors of this paper know of at least six such private APIs). Applications and processors written for one of these models are generally not usable with other models.
The second problem is interoperability. The first tree model to appear on the scene has had a first mover advantage. Subsequent tree model designs have intended to address the shortcomings of the DOM, but not to interoperate with it (note that both DOM4J and AxiOM later added optional DOM interface implementations to address this problem—accepting the disadvantages of the DOM in order to achieve compatibility in this mode). Knowledge of the tricks and optimizations appropriate to one model do not transfer to other tree models. Though the successor models have all positioned themselves as better solutions than the DOM, they have not been adopted as widely. This is most likely due to the DOM's first mover advantage, and the consequent network effect: although other models may have technical advantages that make them more suitable than the DOM for a given application, in order to use those new models efficiently within the JVM, all parts of the application need to use the same tree model. Developers must solve a cruel equation in which the marginal benefits of switching from the DOM are typically low, whereas the marginal costs are always high. The alternatives seem to be to write multiple code paths to achieve the same purpose (with different tree models), or to wrap each node of each tree model in an application-specific abstraction. Some projects, such as Woden Woden and Jaxen Jaxen, have taken one or the other of these approaches in preference to adopting the DOM as the sole programming model.
The DOM, as the first XML tree model for Java, established the universe of discussion for design of tree models. Development of the DOM preceded the Namespaces in XML XMLNS and XML Infoset Infoset specifications. For backward compatibility, the DOM could never enforce these specifications, though it could enable them. Further development of the DOM may be characterized as too closely approaching the Lava Flow LavaFlow anti-pattern. Indeed, the DOM exposes fifteen "basic" abstractions (node types), compared to eleven in the Infoset, and seven in the XDM. Successor APIs have generally targeted the Infoset, but with widely varying interpretations. This is the problem of variability. Each model exposes different property sets. The boundaries between lexical, syntactic, and semantic are drawn at different points. One consequence of this variability is that it is difficult or awkward to add support for specifications "higher in the stack." For instance, XPath 1.0 XPath1 and XSLT 1.0 XSLT1 work perfectly adequately as external tools (one per tree model, or by generalizing the concept of "Node" to "Object"), and some models have built-in support (at least for XPath). XML Schema support (see WXS1 and WXS2) is rarely found—a DOM Level 3 module supports it, but in a fashion that is not noted for ease of use, and the module is not widely implemented. Similar situations exist for specifications such as XQuery 1.0, XPath 2.0, and XSLT 2.0. Even SOAP/XMLP is arguably under-supported. AxiOM, after all, is an entire XML tree model built largely so that the SOAP abstractions could be represented cleanly as extensions.
Finally, the problem of weight plagues most of these tree models. The DOM itself is notoriously heavyweight, typically occupying three to ten times the space, in memory, that the—already verbose—XML occupies as a character stream, according to Harold's Processing XML with Java XMLInJava. Sucessor models have done better in this area. Dennis Sosnoski's evaluation, "Document Models Part 1: Performance" DMPerf, though dated, provides an excellent illustration of this problem. A large part of the problem lies in the unrestricted mutability of these models. All of the prominent XML tree models for Java must restrict programming to serial, synchronous access. A mutable tree model is effectively a mutable collection, so any changes made to it by a single writer may have disastrous effects upon multiple readers. Issues of weight cannot easily be addressed by storing the bulk of the document on disk, or by concurrent processing, because the document may be modified during processing.
There are alternatives: applications and processors with higher performance requirements are often written to abstractions that do not model XML as a tree, such as SAX, StAX, or XML data binding (in its various flavors). Sosnoski's article discusses some of these alternatives; Harold's presentation also notes both advantages and disadvantages. The chief drawback to these approaches is that they expose paradigms which are not as easily or intuitively understood as the tree model, which are more of a challenge for some developers. A tree model is preferred. A single model for navigation and interrogation seems best. To date, attempts to create this single model have proven suboptimal in most environments.
gXML Design Considerations
gXML is a new API for analyzing, creating, and manipulating XML in Java. It embodies the XQuery Data Model, and is consequently a tree-oriented API, but it does not introduce a new tree model comparable to existing models. Instead, it is intended to run over existing tree models, and to permit the introduction of new, specialized models optimized for a particular purpose. Its design rests on four pillars: the Handle/Body design pattern, Java generics, the XQuery Data Model, and immutability for XML processing as a paradigm. These four principles answer the four problems outlined above.
The Handle/Body Pattern
gXML makes extensive use of the Handle/Body pattern (called the Bridge pattern in Design Patterns GOF). This pattern provides a well-defined set of operations over an abstraction (the handle), which may then be adapted to specific implementations (the body). For gXML, the primary "handles" are the Model or Cursor, the Processing Context, the Node Factory in the mutable API, and the type (Meta) and typed-value (Atom) Bridges in the schema-aware API.
When presenting gXML to a new audience, one of the most common stumbling points is the distinction between Handle/Body and Wrapper (called Facade in Design Patterns). gXML does not wrap every node in the tree. Applications and processors are presented with one new abstraction, represented by a single instance (a Singleton for model, or a single instance per tree for cursor). gXML adds very little weight to the existing tree model, compared to the significant additional weight added by the necessity to wrap every node in a tree. Although there is a cost (in memory and performance) to using the handles rather than directly manipulating the bodies, the benefits (in flexibility and capability) are more nearly commensurate: in exchange for a memory/performance impact measured in low single-digit percentages (for most tree model APIs), an application or processor gains the ability to manipulate all supported tree model APIs (currently three; more are anticipated).
There are a number of attractive consequences of using this design pattern. First, since applications and processors need not write separate code paths for different tree models, these models can be injected very late, even at runtime. That suggests that they can be compared, based on the application's or processor's requirements, and the tree model best suited to the problem at hand preferred. It also suggests that application and processor developers might have a sounder foundation to suggest improvements to developers of the models. Second, by bringing peace to these warring models, by allowing developers to choose a model based on technical merits without considering the importance of the network effect for the DOM, gXML also enables the creation of "niche" tree models for XML, models designed and optimized for particular use cases. In other words, by always using these handles for access, special-purpose bodies become more practical. These topics will be revisited in Advancing the State of the Art, below.
gXML's use of the Handle/Body pattern for XML tree models might be compared to the similar pattern used for database drivers in the Java Database Connection (JDBC) API. Each bridge may be viewed as equivalent to a vendor-specific driver.
The 'G' in 'XML'
gXML makes extensive use of Java generics. First, it defines two common
parameters, N and A. N is the "node" handle; A is the "atom" or "atomic value"
handle. Furthermore, gXML makes extensive use of Java's built-in generics;
APIs that accept or return collections typically use Iterable
in
their signatures (as opposed to counts, specialized objects with
pseudo-iterators, single-use iterators, or arrays).
The use of generics is the primary answer, in gXML, to the problem of interoperability. By defining these parameters, particularly the <N>ode handle, each of the tree models can be viewed and manipulated through the lens of the XQuery Data Model. One notable consequence is that the enormous network effect created by the existence of parsers, processors, and applications that understand no model but the DOM, regardless of its fitness for their domain of operation, no longer matters to developers of gXML-based processors and applications. gXML includes a DOM bridge; it is thereby able to leverage that network effect. Every bridge added, adds to the network effect—though not, as a rule, for a single document: conversion from model to model remains expensive.
The XQuery Data Model
Perhaps the most important driver for the development of gXML was the desire to have a Java API that embodied the XQuery Data Model. The XDM is more rigorous than its predecessor, the XML Infoset specification (which was driven in part from a need to model existing APIs, including DOM, SAX, XPath, and Namespaces in XML). It is conceptually complete, and defined in a context that permits type definition, navigation operations, and more advanced functions. This rigorous, well-defined specification was adopted as the basis for the API, and represents gXML's answer to the problem of variability. Is a property or concept in the XDM specification? Then it should be in the gXML API. If it is not in the specification, then either it should not be exposed in the API, or it should be compatible with the well-specified API. For instance, the entire mutable API was added as an extension; XQuery does not define operations that modify trees.
Another important reason to adopt the XQuery Data Model is that it provides the first well-integrated access to XML Schema information (one might argue that XQuery and XSLT2 provide the "missing language" for the XML Schema type system). A great deal of XML processing has no need to concern itself with validation, typing, and particularly with the post-Schema validation infoset; those applications and processors that need it, however, need it very badly. gXML defines a common model for XML Schema, compatible with the XDM's definition and use of XML Schema types and typed values, as a standard extension.
gXML is not the only model to provide support for XML Schema, but the schema-aware extensions in gXML can be implemented for any tree model, and are exposed via APIs that are clearly related to (usually extensions of) the core gXML APIs. In other words, by addressing the problem of variability via adherence to and conformance with the XQuery Data Model Specification, gXML enables the development of a "next wave" of XML processing technologies, based on XPath 2.0, XSLT 2.0, and XQuery 1.0 (including the new generation of XQuery-conformant databases).
The Immutable Approach
In the experience of the developers of gXML, most of the nodes in any given XML instance document are never modified. These nodes need not be mutable—but because some nodes are modified in the common paradigm of XML processing, all nodes must be defined to be mutable. The core gXML API dispenses with mutability. Instead, it promotes a paradigm in which a received or generated XML document is an input, and the XML supplied to other processes (in the same VM, on the same machine, or somewhere else on the network) is a transformation of the input. This approach addresses the problem of weight. In combination with the enabling of custom, potentially domain-specific XML tree models accessed via a gXML bridge, the immutable paradigm (over an immutable tree model) can achieve optimizations not possible for a tree model in which the existence of mutability militates against caching, compaction, and deferred loading. It is not possible, at this point, to quantify the potential performance benefits rigorously because the pure-immutable model remains hypothetical (other priorities have taken precedence). Here we speculate.
Such a hypothetical immutable model would not need to guard against modification of a document in one thread while another thread reads it. It would provide guarantees that would permit processing of large documents to be parallelized; an immutable, late-loading model might be able to provide access to XML documents of a size infeasible for mutable models. A certain number of these optimizations are available even for bridges over mutable models; if the convention encourages immutability, then processors can define their operations only when the convention is adhered to, warning users that breaking the convention may lead to undefined (and incorrect) results.
Immutability enables performance enhancements—for instance, models in memory which occupy a fraction of the size of the XML as a character stream rather than a multiple of its size; concurrent processing of XML documents; storage of the bulk of a document on disk with indexing and a very light footprint in memory. We've noticed unanticipated potential as well: if there is no requirement to modify the document in memory, then a gXML bridge may reasonably be defined over any structured hierarchical data format analagous to XML: JSON, CSV, a file system, a MIME multipart message. Perhaps more strikingly, immutable models can potentially cross the VM boundary, via JNI to other languages, into hardware accelerators, and so on.
The gXML Core
The gXML API is designed for rapid understanding. The core API can be described as a collection of five interfaces. In practice, more interfaces are available, but understanding these five is necessary and sufficient to understand and use the gXML base API. These abstractions adhere to the design principle of immutability, and do not introduce any dependency upon XML Schema.
The core API is completed with two extensions. The mutable extension adds mutability by adding methods to the base interfaces, or by adding new interfaces. The schema-aware extension adds schema awareness, again by adding methods to base interfaces, or by adding new interfaces; the schema-aware extension also introduces the "atom" parameter.
Untyped, Immutable
The heart of the gXML API is an abstraction called Model. Model
is stateless; each bridge implements it. The methods on
Model
permit interrogation of XQuery Data Model properties
(getNamespaceURI(N)
, getLocalName(N)
,
getStringValue(N)
, getNodeKind(N)
, etc.), and
provide XQuery/XPath navigation (child, descendant, ancestor, sibling,
attribute, namespace axes). Since this abstraction is stateless, each method's
first parameter is a context node, the node for which information is
requested, or from which navigation begins. The XQuery Data Model defines
seven node types: Document
, Element
,
Text
, Attribute
, Namespace
,
Comment
, and ProcessingInstruction
. Returns from
each method vary by node type, in conformance with the Data Model
specification, but the API does not distinguish node types (the argument or
return value is <N>, not <? extends N>). The Appendix A documents this interface.
For convenience, a very similar API, with minimal (positional) state is
also defined: Cursor
. Cursor
provides a common
idiom, maintaining its positional state within the target tree, which is
frequently encountered in processing XML. Where Model
's
navigation APIs typically return a node (N getFirstChildElement(N
context)
), Cursor
's corresponding APIs return true or
false and change the Cursor
's state (boolean
moveToFirstChildElement()
). Where Model
's property
accessors require a context node (String getStringValue(N
context)
), Cursor
's use its current state (String
getStringValue()
). The design intent is that anything that may be
accomplished with a Model
may also be accomplished with a
Cursor
. Note that Cursor
is not forward-only.
When processing XML, some applications can make use of gXML with nothing
more than Model
or Cursor
. More advanced uses might
need the third primary abstraction in the core gXML API, the
ProcessingContext
. A processing context is precisely what it
claims to be: a specialized (for the target tree model), stateful abstraction
which provides uniform access to the collection of abstractions which together
make up a bridge. Model
, Cursor
, and
ProcessingContext
are all parameterized only by <N>ode. The
TypedContext
extension introduces the <A>tom parameter.
ProcessingContext
provides Model<N>
getModel()
and Cursor<N> newCursor(N context)
methods,
an accessor for the (singleton) Model
and a factory for the
Cursor
. Several additional accessors, functions, and factory
methods are available from the context: it is the source for the mutable and
typed context extensions (getMutableContext()
and
getTypedContext()
), and for DocumentHandler
and
FragmentBuilder
; it can report whether candidate objects are
compatible with the bridge's specialization of <N>ode; it includes a
mechanism to permit feature-based extension. For greatest generality,
applications should access a bridge via its processing context. An optional
ProcessingContextFactory
interface is also included in the API,
but experience suggests that provision of instances of the factory is an
impediment to the target design pattern, dependency injection. That is,
applications ought to instantiate the factory interface themselves, consistent
with the injection mechanism or API which they use.
The processing context provides access to DocumentHandler
,
which in turn provides methods to parse from and serialize to streams, readers
and writers. ProcessingContext
is also a factory for
FragmentBuilder
, which is-a
ContentHandler
(for the XDM, not the SAX interface of the same
name) and is-a NodeSource
.
FragmentBuilder
is used to programmatically build trees or tree
fragments in memory, parallel to parsing a document into memory via the
document handler's various parse methods. Model
and
Cursor
also accept a ContentHandler
argument to
stream or write themselves. In short, these abstractions provide a range
of input/output operations for XML using a particular bridge.
These five abstractions make up the core of the gXML API. There are other, supporting abstractions, some of which become more significant in particular contexts. An untyped, immutable bridge implementation (minimally) provides implementations for these five abstractions over a given tree model.
Mutability
gXML provides two standard extensions in the core
ProcessingContext
to permit bridges to signal support for
optional functionality. The first extension permits mutability. Immutability
provides important benefits for XML processing, but all currently-available
tree models are mutable, and nearly all processors and applications expect
mutability. To ease migration, ProcessingContext
provides a
method, getMutableContext()
which permits the bridge to signal
that it supports mutability, by returning an implementation of the
MutableContext
extension. A mutable context, in turn, provides
access to MutableModel
and MutableCursor
, each of
which extend the corresponding immutable interfaces (adding methods to add and
remove nodes, and to change the content of a document or element node), and also
provides access to a NodeFactory
implementation which permits the
creation of nodes in memory, independent of any tree (within the limits of the
underlying tree model).
Nota bene: the mutable interfaces, unlike other abstractions in gXML, are not attempts to implement a portion of the XQuery Data Model in Java. The XQuery Data Model (and, in fact, XQuery 1.0, XSLT 2.0, and XPath 2.0) do not provide specification of property mutators. Consequently, this portion of the API has been designed to be roughly compatible with the XDM, as an extension, and to be roughly compatible with the corresponding mutable APIs in dominant tree models. However, once XQuery produces its "update" mechanism, this portion of the API is unlikely to prove conformant.
Schema Awareness
The TypedContext
extension parallels the
MutableContext
extension. It provides the XDM-defined
schema-aware properties and manipulations. Most notably, the typed context
introduces an additional parameter, the <A>tom handle. The base and mutable
interfaces deal only with string values for text node and attribute content
(in XDM terms, actually untyped atomic). The XQuery Data
Model defines the concept of "atom", which corresponds to a typed value or
list of typed values. Atoms are inherently sequences of atoms (a single atom
is a one-element list); "sequence" is also introduced in the schema-aware API,
but unlike atom, is not represented by an independent common parameter.
TypedContext
is more complex than
MutableContext
. As a mutable context provides access to mutable
models and cursors, a typed context provides an accessor for a
TypedModel
and is a factory for TypedCursor
, which
are extensions of the base Model
and Cursor
, adding
methods to access the type-name and typed-value properties. As the base
processing context can identify <N>odes, so the typed context can identify
<A>toms. TypedContext
enhances the base
FragmentBuilder
as a type- and atom-aware
SequenceBuilder
. To handle typed values,
TypedContext
provides an accessor for the
AtomBridge
, which in turn provides facilities to create, compile,
cast, convert (to Java native types), and query atoms, in a fashion consistent
with the XDM.
TypedContext
also provides access to the
MetaBridge
, which primarily serves to map the names of types to
their corresponding implmentations in the (included) XML Schema model.
TypedContext
makes use of this bridge itself, because it extends
the core schema model interface, SmSchema
. SmSchema
permits definition and declaration of custom types, registry of types, and
lookup of types. In other words, the typed context provides a cache of types
(supplied via parsing of schemas or programmatically) which are being used in
the processing of a collection of XML documents. This is actually the origin
of the concept and term "processing context," though it now exists for the
untyped API as well.
Building Bridges with gXML
For greatest utility, gXML ought to have bridges on every tree model for XML in Java. The authors have not been able to accomplish this themselves, but can demonstrate that creating additional bridges is a straightforward task.
The three bridges included in the gXML source tree provide examples of the finished product. The development process is easily described. Note, however, that most tree models present unique challenges when adapted to the XQuery Data Model; our experience suggests that most development time is consumed by handling these impedance mismatches.
Untyped, Immutable
What needs to be done to create a new base bridge (untyped, immutable) for an as-yet unsupported tree model? There are five steps:
-
Implement
ProcessingContext
andModel
. Decide what the <N> (node) abstraction must be.For instance: the DOM defines <N> as
Node
. AxiOM defines it asObject
(AxiOM does not have a single base interface that marks all node types). The Cx bridge proof-of-concept usesXmlNode
. -
Use the
bridgekit
module to get a simple, generic implementation ofCursor
(over the customModel
).The
bridgekit
module is a collection of utilities intended to help bridge developers. It includes, for instance, an implementation of the XML Schema model (SmSchema
) and theXmlAtom
typed-value implementation, as well as theCursorOnModel
helper used here. -
Implement
FragmentBuilder
.The
FragmentBuilder
interface has five methods for creating Text, Attribute, Namespace, Comment, and Processing Instruction node types, and an additional two each (start and end) for the container node types, Element and Document. -
Use the generic implementation of
DocumentHandler
from theinput-output
processor.The generic
DocumentHandler
in theinput-output
module is not terribly mature or robust, but can do the job for an initial implementation. -
Use the
bridgetest
module to verify equivalence with existing bridges.The
bridgetest
module is designed to make implementation easy; enabling each test requires only that the bridge implement the single abstract method, which returns the bridge's implementation ofProcessingContext
(from which all other abstractions can be reached). Adding a test implementation is thus mostly a mechanical task.
This is all that's required. For this minimum,
getMutableContext()
and getTypedContext()
(on
ProcessingContext
) should both return null, indicating no
support.
Mutability
To add support for mutability:
-
Implement
MutableContext
and return it fromProcessingContext
instead of null.MutableModel
provides access theNodeFactory
,MutableModel
, andMutableCursor
implementations. -
Implement
MutableModel
as an extension of the baseModel
from above.MutableModel
adds methods to set attributes and namespaces, to add, remove, and replace children. -
Use the
bridgekit
module to base the bridge'sMutableCursor
on itsMutableModel
.The
bridgekit
implementations are reasonable starting points, though optimization is likely to require a custom implementation. -
Implement
NodeFactory
.NodeFactory
contains methods to create each node type, whereMutableModel
establishes the relationships between nodes. -
Add tests from the
bridgetest
module.In this case, there's only one, at present.
This is admittedly easier to describe than to accomplish. Approaches to mutability among tree models vary much more widely than approaches to navigation and analysis.
On the other hand, gXML's approach to mutability is more restricted than most current tree APIs. The gXML mutable API does not support changing the value of a text or attribute node, for instance. Leaf nodes remain immutable; container nodes (document and element) are mutable in content (contained nodes) only.
Schema Awareness
To add support for schema-awareness:
-
Implement
TypedContext
and return it fromProcessingContext
instead of null; note thatTypedContext
is-aSmSchema
. Decide what the <A> (atom) abstraction must be.Current implementations all define <A> as
XmlAtom
. This is not required. -
Implement
TypedModel
as an extension of the baseModel
from above.The
TypedModel
interface adds only five methods toModel
, all related to the introduction of type names and typed values. Actually ensuring that the type annotations and typed values are associated with the nodes in the tree is one of the most challenging tasks in implementation. -
Use the
bridgekit
module to base the bridge'sTypedCursor
on itsTypedModel
.CursorOnTypedModel
extendsCursorOnModel
as expected. -
Implement or reuse from the
bridgekit
module anAtomBridge
(typed value support).If the chosen <A>tom is
XmlAtom
, theXmlAtomBridge
already exists. -
Implement or reuse from the
bridgekit
module aMetaBridge
(type support).Again, if the <A>tom is
XmlAtom
, aMetaBridge
exists in thebridgekit
. -
Implement
SequenceBuilder
as an extension of theFragmentBuilder
from above.SequenceBuilder
adds overrides for theattribute()
,startElement()
, andtext()
methods (adding type names and typed values), plus methods to create an atom and to start and end a sequence. -
Add the typed tests from the
bridgetest
module.As with the standard tests, these are easy to implement, following the same pattern.
For schema awareness, the most straightforward approach is going to be
reusing the generic implementations found in the bridgekit
module, but better results may be achieved by customizing the code. This is an
area requiring further experience before establishing guidelines for best
practices.
Bridge Traffic
Using bridges is a little less amenable to slideshow style lists, but the principles remain straightforward. When using gXML, it is important to understand "dependency inversion": bridges should be injected, if at all possible, rather than directly instantiated. It is possible to design an application or processor that can react to input by directly instantiating the needed bridge, but it's best to reduce the number of places that contain reference to the tree model packages to as few as possible. One class is ideal; it is then responsible for providing a processing context for a given bridge on demand.
Most applications will spend most of their time with the
Model
or Cursor
) interfaces, which
permit navigation and interrogation. Methods provide access to names, values,
and other characteristics (XQuery Data Model properties) of the node, and
permit navigation in a variety of ways to target nodes. An appendix shows
the content of the Model interface.
FragmentBuilder
(for construction in memory) and
DocumentHandler
(for parsing and serializing) are likely to be
important. Existing applications or developers wedded to the concept of
mutability are likely to make use of the APIs in the mutable model (or cursor)
and the NodeFactory
. Applications or processors needing W3C XML
Schema support (common inside the enterprise, for instance) are likely to make
extensive use of TypedContext
, particularly as a schema cache and
for access to typed models and cursors.
At present, gXML has bridges, in varying states of maturity, for the DOM (level 3 support currently required), for AxiOM (LLOM only; support for typed context rather weak), and for a reference bridge called Cx (a clean, if naive, reimplementation of the XQuery Data Model from scratch, and a gXML bridge over that implementation). The DOM was chosen because of its ubiquity; AxiOM because the web services area is a target for gXML proselytizers; Cx exists primarily to demonstrate that the shared idiosyncracies of DOM and AxiOM (there are a few) are not fundamental to gXML.
Processing XML with gXML
gXML provides an extensive API for bridges, which not only provides the
entry point for applications and processors, but also makes the development of
new bridges easy to describe. In sharp contrast, no interface, no contract, is
specified for XML processors designed for use with gXML. While some processors
might reasonably be defined to have a method with the signature: N
process(N, Model<N>)
, for others this is entirely inappropriate.
Even for processors that might reasonably "process" a node, their function is
more clearly expressed if they "transform" or "extract" or "enhance", or
otherwise mark their "processing" by its specific name, not the more general
one.
So, what is a gXML processor? As the gXML team uses the term, a processor is a code library that performs some specific, well-described function over XML. Most processors can be described with a single word or phrase: "serializer," "parser," "converter," "validator," "transformer," "signer," and so on. A processor is distinguished from an "application," which may create (generate), destroy (consume), modify, and otherwise manipulate XML in multiple steps. Where a processor contributes special functionality to the performance of a goal, the application oversees and orchestrates achievement of the goal from receipt to completion. To further distinguish, a bridge provides the abstraction over which the applications and processors operate, including the model, input/output, and a context that associates related tree-specific functions.
Stateful
gXML processors may be divided, for purposes of discussion, into two
classes: stateful and stateless. Here, "state" refers to the processor's need
to maintain state in the form of any of the parameters specialized by a
particular bridge implementation (<N> and <A>), disregarding maintenance
of state unrelated to gXML parameters. A stateful processor is ideally written
generically, but certain of its component classes will themselves be
parameterized with one or both of the node and atom handles. Consequently, at
instantiation, a given instance of a processor is tied, ipso
facto, to a particular bridge implementation. Like
java.util.List<QName>
, a generic processor taking only <N>
as a parameter would have to be specialized as
GenericProcessor<Node>
for use with the DOM bridge; the same
class would be separately instantiated for use with the Cx bridge as
GenericProcessor<XmlNode>
. Stateful processors typically
contain one or more member fields whose type is specified as a parameter (or
which is a parameterized class, such as an instance of
Cursor<N>
or Bookmark<N>
).
For example, an input-output module is included in the gXML source tree.
This module includes a stateful processor implementing
DocumentHandler<N>
. This DocumentHandler
contains
a member field which is a FragmentBuilder<N>
supplied by the
bridge's ProcessingContext
. This is a good example of the
stateful style: at instantiation, each
DefaultDocumentHandler<N>
is specialized for the bridge's
definition of <N>, associating this handler instance with a particular
bridge (in fact, associating it with a single instance of the bridge's
implementation of ProcessingContext
). This processor's "process"
methods are defined by the DocumentHandler
interface, found in
the core API.
Stateless
An alternate style of implementation is the stateless processor. If no class in the processor needs to retain state typed as or with a gXML parameter, then the processor may be used by declaring the necessary parameters on a method, and supplying the necessary disambiguation as arguments to the method. For instance, a stateless processor might expose the method:
<N> N nearestAncestor(Iterable<N> context, Model<N> model)
The arguments to the method are both parameterized: the context provides
a collection of nodes; the model provides the tool to interrogate each of the
nodes in the supplied context (this hypothetical example finds the nearest
common ancestor of all the nodes supplied in the list, or null
if
no such common ancestor exists).
An extremely simple example of a stateless processor may be found in the
convert
module, in the gXML source tree. It's so simple that it's debatable
whether it's a processor, or simply an instantiation of an idiom.
StaticConverter
has a single, static method, with the
signature:
<Nsrc, Ntrg> Ntrg convert(Cursor<Nsrc> cursor, FragmentBuilder<Ntrg> builder)
It does what it says on the tin: using the supplied Cursor
and FragmentBuilder
, from one or two different bridges, it
converts from one tree model representation to another (strictly speaking,
this is a transforming copy, rather than a conversion; also, if the
Cursor
and FragmentBuilder
are supplied by the same bridge,
this is simply a copy).
A more complex example may be found in the same module: Converter
mixes
the stateful and stateless styles. It is instantiated with a (source)
processing context; it is then able, on request, to convert to any supplied
target processing context—retaining type information, if possible (if
both source and target bridges advertise themselves as schema-aware, it uses
SequenceBuilder
and the TypedModel
's atom-aware
stream()
method in preference to the untyped FragmentBuilder
and Model
).
Developing and Refactoring
The gXML source tree contains, in addition to the processors mentioned above, an XPath 1.0 processor, a schema parser, and a schema validator. The XPath processor is stateless; the schema processors (unsurprisingly) stateful. Processors for XPath 2.0, XSLT 2.0, and XQuery 1.0 have also been explored, although this code is not included in the distribution.
During the development of the API, in early 2009, the Apache Woden project (1.0M8) was refactored as a proof of concept. This effort was based on an earlier revision of the API; the refactoring was extensive, taking advantage of the immutable paradigm. Woden was chosen as an example because it contained an example of multi-tree abstraction: wrapper classes permit Woden to parse and analyze WSDL supplied either as AxiOM or as DOM trees. The project required about a month, but the result seemed a dramatic validation of of gXML principles and design: the lines of code (LOC) count was reduced by about 15%, inconsistencies in the handling of DOM versus AxiOM were eliminated, and supported models grew from two to five (including DOM, AxiOM, the Cx reference model, a proprietary internal model, and an experimental model based on EXI). There is no guarantee of such an LOC count reduction, of course; results will depend upon the original source.
As part of the preparation for release as open source, a similar effort
was undertaken to refactor the Apache XML Security project in early 2010. This
was a more cautious effort, adopting as a guideline that no externally used API
should change. Instead, the existing interfaces were enhanced with a gXML code
path. In addition to preservation of backward compatibility in the API, this
refactoring did not attempt a wholesale restatement of the security problem in
immutable context, but relied extensively upon MutableContext
and
the capabilities supported therein. This effort is ongoing, and does not
appear to promise a reduction in code size, given its goals. It has provided
the team with an excellent test case for the mutable APIs (and even
demonstrated missing XDM-defined functionality in the core APIs) which have
been used to improve both areas. Nonetheless, it appears to validate the
concept of cautious, compatibility-maintaining refactoring; the refactored
API appears able to pass the same tests that the original DOM-based API passed.
The experience from these (and other) proofs of concept, refactoring existing XML processors and developing new processors, leads to some tentative conclusions about the efforts involved and the possible development patterns. We note that because all current tree models incorporate mutability without questioning its utility, most processors approach problems of XML manipulation as a tree mutation.
New Development
The time required for development of a new processor varies depending upon the complexity of the processing. In our experience, adopting the immutable paradigm can actually simplify development, though it requires an effort to state the problem as a transformation rather than as a mutation. Processors developed for gXML take no more, and often less time to develop (and debug) than processors over a single tree model. When designed for immutability, the resulting processor often shows excellent performance characteristics, without requiring significant attention to this area.
Examples are included in the distribution, in the processor
module and its children: input-output
, convert
,
w3c.xs
(schema parsing), and w3c.xs.validation
.
Refactoring: Processing Mutable Trees
Existing processors—such as the Apache XML Security
example—that have already released are apt to wish to maintain existing
customer bases. The approach to take, in this case, seems to be to produce an
extended, parallel API: where the existing API takes a Node
,
provide an override that accepts (for example) N, Model<N>
, or
(if changing the state of the supplied argument is acceptable)
Cursor<N>
. Then change the original DOM-based function so that
it merely calls the new gXML-based method. This approach increases the size
of the code base, but preserves the logic of the API, validation via the
existing test suite, and compatibility with existing clients.
Firm estimates depend upon the size and complexity of the code base, but experience seems to demonstrate that once the principles are understood, much of the refactoring proceeds in a nearly mechanical fashion. The primary advantage to this form of refactoring is the addition of support for all defined gXML bridges (or all bridges that support mutability); this in turn may permit customers to choose models better suited for a particular problem domain. In the XML Security case, the refactoring produces the ability to use the processor with AxiOM (in the current state of the art; potentially with other tree models as those are developed as well).
Refactoring: Processing Immutable Trees
Refactoring an XML processor for immutable operation is more
challenging. The general principle is that instead of considering the problem
as one of modifying a tree, the problem is stated as a transforming copy. The
XML document is an input; other inputs guide the processing; the output is a
new XML document (the original is then typically discarded, or sometimes
archived). Our experience addressed Apache Woden, in part because the project
was then recently graduated from incubation (that is, it had just made a
public 1.0 release), so preservation of API compatibility was deemed less
critical; widespread adoption had not yet occurred. Another example is the
xpath.impl
processor, based on the xpath
API module;
these modules were both created by refactoring a portion of James Clark's and
Bill Lindsey's XT
XT. XPath has no need for
mutability, obviously; stating the XPath processing problem in immutable
context is trivial.
This approach typically changes the logic of processing as well as changing the public API; developers may find that the code that "enhances" (mutates) a tree with information must be localized. That is, instead of receiving, analyzing, modifying, analyzing further, etc., the process is receiving, analyzing, generating/transforming, analyzing further. Creation of new documents is potentially expensive; this is apt to lead developers to minimize occurrences of the event. Awareness of this issue, in our experience, led to code that was more straightforward, easier to understand, and better encapsulated. Note also that a refactoring of a publicly released API might proceed first by preserving API compatibility, and later providing an alternate, transformative code path that parallels the modification path.
Advancing the State of the Art
The gXML team believes that this API presents an exciting opportunity to change the paradigms for XML processing in Java, and to enable a host of additional opportunities for advancing the state of the art. We have discussed the API, bridges, and processors in some detail, above. Now, let's examine the further opportunities that gXML enables.
Because gXML encourages the practice of dependency inversion, of injecting a particular tree model (bridge) at runtime, it effectively bypasses—even leverages, by inclusion of a bridge for the DOM in the distribution—the DOM network effect that has presented Java developers of XML processors and applications with a Hobson's choice: choose a tree model which is technically superior or less awkward to program against but lose interoperability with the vast majority of existing processors and applications, or choose the DOM with its peculiarities and quirks and limitations but gain interoperability with the wider XML ecosystem. Developers of alternative Java XML tree models will (we hope) welcome this, and contribute bridges. Moreover, by permitting this late binding of the tree model, gXML enables use-case specific comparisons of models to each other. This capability for comparison, without losing interoperability, may lead to wider adoption of one or more of the successor models, in one application domain or across domains. Further, given the ability to compare two models in such a way, application and processor developers can provide clear test cases demonstrating issues, which developers of the tree model may find more compelling, more deserving of attention, than is currently the case when any comparison must first develop a custom framework/harness.
By enabling injection of the model, gXML also potentially permits the development of domain-specific tree models, optimized for particular use cases. Such "niche" models are actively discouraged in the current state of the art: they lead in the direction of private code, difficult to learn and difficult to maintain. AxiOM provides an example of a domain-specific model that has survived the process of marginalization; one might argue that it has done so in part through its strong association with the high-profile project Apache Axis 2. Other domains such as strongly typed XML, large XML processing, and XML in constrained memory environments come to mind as potential targets. Customization and optimization are possible both for the underlying tree model, and for the bridge implementation. There is no restriction against implementing multiple bridges for a single underlying tree model—since the pattern is injection, two significantly different bridge implementations over the same underlying tree model may be used by a single application. Here again, there are significant opportunities for domain optimization, in this case by optimizing the bridge implementation rather than changing the underlying tree model.
gXML's championing of the immutable paradigm for XML processing carries powerful potentials for performance enhancements. We cannot, at this point, quantify these benefits (they may even be chimerical), but we have seen immutability adopted in other areas specifically in order to improve performance. Immutability provides guarantees that enable concurrent processing, an increasingly common requirement for applications and processors that must scale to handle large volumes of traffic. With a custom tree model (even an immutable implementation of the DOM, potentially), the notorious impact of XML on memory can potentially be reduced. For applications and processors that already address multiple tree models, significant reductions in code size may accompany improved performance and consistency. Our experience suggests that restating problems as transformation rather than mutation tends to lead to cleaner, better-encapsulated, and typically more performant code.
One particular area in which gXML holds enormous promise is in the processing of "large XML". This is, in a way, the same problem as processing XML with "constrained memory;" whether one identifies the XML as too-large, or memory as too-small, the problem is the same. How can XML be processed if it is too large to fit at once into memory? The obvious answer is a custom tree model, but this answer immediately presents the developer with the DOM "Hobson's choice" outlined above. gXML removes that issue; a processor or application programmed against the gXML API can inject a simple, mature tree model for most processing, or a custom, stored-to-disk, low-memory tree model when the size of the target document exceeds a specified threshold.
Developers of technologies that compete with XML as descriptions of structured, hierarchical data may have no interest in presenting their formats as XML (may even resent the suggestion), but there are advantages to doing so: the XML programming environment is a large one, populated with numerous processors and applications. A bridge over other such data formats—JSON, for a high-profile example—could provide that format with the capabilities of the entire suite of XML tools (with the reservation that there is apt to be an impedance mismatch of some degree, that the bridge will attempt to minimize). This is most interesting when gXML is used with the immutable paradigm; modifying these alternative structured hierarchical data formats as well as analyzing them is a more difficult problem and likely to have a higher degree of impedance mismatch.
Again particularly with respect to immutable processing, gXML offers an opportunity to pass XML across the virtual machine/Java Native Interface boundary. The XQuery Data Model defines the operations and properties that are possible with (g)XML; there is no impediment to producing a specification-compliant API in other languages, whether they are hosted in the VM (Scala, Jython) or outside it (C++, Perl, Lua). This in turn suggests possibilities for enabling most-efficient processing, for enabling scripting in domain-specific languages, and so on.
Perhaps most significantly, from the point of view of the gXML development team: in recent years a number of new specifications have appeared that offer exciting opportunities for advancing the state of the art of XML processing. In Java, adoption of these technologies—XQuery, XSLT2, XML databases—has been slowed by the lack of support in dominant models, and the limited extensibility possible. Even XML Schema has seen relatively little adoption/development outside the enterprise; gXML includes a schema model to address that issue. More importantly, the XQuery Data Model seems to offer a well-thought foundation for the next ten years of development in XML-related technologies. gXML proposes to embody that model for Java, while providing compatibility with the existing tree models, enabling a unification of processing while promoting differentiation, specialization, and customization of models.
gXML Solution(s)
We submit that gXML addresses the problems that its design set out to address, and that have plagued a large population of developers. It resolves the problem of multiple, competing tree models in Java, leverages the network effect of the dominant Java tree model for XML (and in fact shares that network effect with any other tree model over which a gXML bridge is available), and permits comparison of and late (even runtime) selection of a model best suited to the task. In the process, it begins to resolve the problems of interoperability. It is based on a well-defined, rigorous specification (the XQuery Data Model), which appears to be the best foundation for the next generation of XML technologies. It introduces and promotes the immutable paradigm for XML processing, and permits or encourages the development of models able to fulfill the promise of that paradigm.
gXML represents about five man-years of development, in its current state. Its corporate sponsor has contributed it to open source because its value can be directly correlated with its adoption. More bridges: more value (to the contributing corporation and to everyone using gXML). More processors: more value. For more code, though, we need help. Get involved! Try the code. Our experience has been that it has immediate benefits, even for isolated applications and processors. See a bug? Contribute a patch! Intrigued by the promise gXML offers? Become a committer!
Based on the previous ten years, introduction of so significant a shift in APIs and paradigms in the Java world will need to last at least ten years. The APIs developed ten years ago, viewed in hindsight, show what seem to be obvious lacunae or missed focus. Are there such gaps and blind spots in gXML? Take a look; if we're missing something, tell us now, and help us to address it.
Interested in the opportunities, but not in refining the core APIs? Want to provide an XQuery Data Model over a different, currently unsupported tree model (even over a non-XML structured data model)? Write a bridge. Our experience suggests that investment for a new bridge is about one programmer-month, for complete, but unoptimized functionality. Refinements depend upon the underlying tree model; those that are closer in concept to the XQuery Data Model tend to be easier to improve, while those further away (particularly if they don't conform to XML Infoset) provide more challenges. If developers involved in JDOM, DOM4J, or XOM are reading this, we hope to have intrigued you enough that you'll contribute (or provide independently) a bridge implementation for those models. What about a bridge for JSON? CSV? Could the new, XQuery-conformant crop of XML databases expose programming interfaces as bridges or as processors?
Interested in a particular application of XML? Can it be conceived as an XML processor? Development investment for a gXML processor varies pretty widely, depending upon the complexity of the processing to be done. For instance, the schema validation module included in the gXML source represents perhaps six months of work; the conversion processor (because it really does nothing more than embody an idiom already supported in the gXML core APIs) required no more than a week. XQuery or XSLT 2.0 processors would represent significant time investments. The field is vast, though, so it is impossible to characterize (either in time or complexity) everything in it.
Are we missing an obvious opportunity? Tell us about it. Or ... do it, and show us up. Our primary hope, in releasing the code and this paper, is to generate some excitement about the possibilities we believe to be inherent in the gXML refactoring of XML in Java. Get excited; this could change the game.
Appendix A. gXML: Source
As previously noted, the core of the gXML paradigm is an abstraction
called Model
. Because this is an example of the Handle/Body
design pattern (and is stateless), only one instance of Model
is
needed for navigation and investigation for any and all instances of the
XML tree model for which the particular Model
is specialized.
Consequently, it seems worthwhile to show the content of the Model
abstraction. Comments have been removed.
Model
is composed from three interfaces, reflecting three different
forms of information that might be obtained from an XQuery Data Model: NodeInformer
reports information about the content/state of a particular node in context; NodeNavigator
permits one to obtain a different node given a particular starting node; AxisNavigator
supplies iteration over the standard XPath/XQuery axes, starting from a particular
origin node.
public interface Model<N> extends Comparator<N>, NodeInformer<N>, NodeNavigator<N>, AxisNavigator<N> { void stream(N node, boolean copyNamespaces, ContentHandler handler) throws GxmlException; } public interface NodeInformer<N> { Iterable<QName> getAttributeNames(N node, boolean orderCanonical); String getAttributeStringValue(N parent, String namespaceURI, String localName); URI getBaseURI(N node); URI getDocumentURI(N node); String getLocalName(N node); Iterable<NamespaceBinding> getNamespaceBindings(N node); String getNamespaceForPrefix(N node, String prefix); Iterable<String> getNamespaceNames(N node, boolean orderCanonical); String getNamespaceURI(N node); Object getNodeId(N node); NodeKind getNodeKind(N node); String getPrefix(N node); String getStringValue(N node); boolean hasAttributes(N node); boolean hasChildren(N node); boolean hasNamespaces(N node); boolean hasNextSibling(N node); boolean hasParent(N node); boolean hasPreviousSibling(N node); boolean isAttribute(N node); boolean isElement(N node); boolean isId(N node); boolean isIdRefs(N node); boolean isNamespace(N node); boolean isText(N node); boolean matches(N node, NodeKind nodeKind, String namespaceURI, String localName); boolean matches(N node, String namespaceURI, String localName); } public interface NodeNavigator<N> { N getAttribute(N node, String namespaceURI, String localName); N getElementById(N context, String id); N getFirstChild(N origin); N getFirstChildElement(N node); N getFirstChildElementByName(N node, String namespaceURI, String localName); N getLastChild(N node); N getNextSibling(N node); N getNextSiblingElement(N node); N getNextSiblingElementByName(N node, String namespaceURI, String localName); N getParent(N origin); N getPreviousSibling(N node); N getRoot(N node); } public interface AxisNavigator<N> { Iterable<N> getAncestorAxis(N node); Iterable<N> getAncestorOrSelfAxis(N node); Iterable<N> getAttributeAxis(N node, boolean inherit); Iterable<N> getChildAxis(N node); Iterable<N> getChildElements(N node); Iterable<N> getChildElementsByName(N node, String namespaceURI, String localName); Iterable<N> getDescendantAxis(N node); Iterable<N> getDescendantOrSelfAxis(N node); Iterable<N> getFollowingAxis(N node); Iterable<N> getFollowingSiblingAxis(N node); Iterable<N> getNamespaceAxis(N node, boolean inherit); Iterable<N> getPrecedingAxis(N node); Iterable<N> getPrecedingSiblingAxis(N node); }
References
[AxiOM] Axiom 1.2.8 API http://ws.apache.org/commons/axiom/apidocs/index.html
[LavaFlow] Brown W., R. Malveau, H. McCormick, T. Mowbray, and S. W. Thomas. Lava Flow anti-pattern (Dec. 1999) http://www.antipatterns.com/lavaflow.htm
[DOM] Document Object Model Technical Reports http://www.w3.org/DOM/DOMTR
[DOM4J] DOM4J Introduction http://dom4j.sourceforge.net/
[XML] Extensible Markup Language (XML) 1.0 (Fifth Edition) http://www.w3.org/TR/xml/
[GOF] Gamma, E., R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software Addison-Wesley, 1995.
[XMLInJava] Harold, E. Processing XML with Java http://www.cafeconleche.org/books/xmljava/
[WhatsWrong] Harold, E. "What's Wrong with XML APIs (and how to fix them)" http://www.xom.nu/whatswrong/whatswrong.html
[Jaxen] Jaxen http://jaxen.org/
[JDOM] JDOM v1.1.1 API Specification http://www.jdom.org/docs/apidocs/
[XMLNS] Namespaces in XML 1.0 (Second Edition) http://www.w3.org/TR/xml-names
[DMPerf] Sosnoski, D. "XML and Java technologies: Document models, Part 1: Performance" http://www.ibm.com/developerworks/xml/library/x-injava/index.html
[DMUse] Sosnoski, D. "XML and Java technologies: Java document model usage" http://www.ibm.com/developerworks/xml/library/x-injava2/
[Woden] Welcome to Woden http://ws.apache.org/woden/
[Xalan] Xalan-Java http://xml.apache.org/xalan-j/index.html
[XalanDTM] XalanDTM http://xml.apache.org/xalan-j/dtm.html
[Infoset] XML Information Set (Second Edition) http://www.w3.org/TR/xml-infoset
[XPath1] XML Path Language (XPath), Version 1.0 http://www.w3.org/TR/xpath/
[WXS1] XML Schema Part 1: Structures Second Edition http://www.w3.org/TR/xmlschema-1/
[WXS2] XML Schema Part 2: Datatypes Second Edition http://www.w3.org/TR/xmlschema-2/
[XOM] XOM 1.2.5 http://www.xom.nu/apidocs/
[XDM] XQuery 1.0 and XPath 2.0 Data Model (XDM) http://www.w3.org/TR/xpath-datamodel/
[XSLT1] XSL Transformations (XSLT), Version 1.0 http://www.w3.org/TR/xslt