Note: Acknowledgements

This paper describes concepts and source code originally developed by David G. Holmes, formerly of TIBCO Software Inc., without whose innovation and energy neither the paper nor the material that it describes would be possible. David was the senior architect responsible for driving the development (over several iterations) of the gXML code base, and the original advocate of opening the source.

The Problem(s) with XML Tree APIs in Java

Java was one of the first major programming languages with support for XML. It was one of the targets for the Interface Definition Language modules that were developed as the basis of the Document Object Model DOM. Early adoption helped to prove the capabilities of both XML and of Java, but as might be expected, early adoption also has its drawbacks. A number of developers using XML in Java have noted these problems. For instance, Dennis Sosnoski compared a number of tree models in a two-part investigation in 2001 and 2002 (see "XML and Java technologies: Document models, Part 1: Performance" DMPerf and "XML and Java technologies: Java Document Model Usage" DMUse). More recently, Elliotte Harold documented "What's Wrong with XML APIs" WhatsWrong as part of the development of the XOM XOM API. This analysis falls into that tradition, though it does not agree wholly with the previous analyses. We identify four classes of problem with existing tree model APIs.

The first problem is multiplicity. For a variety of reasons, Java developers have not, on the whole, been enthusiastic partisans of the DOM. Alternatives were proposed early; Xalan Xalan, one of the major early XSLT processors, defined its own internal XML tree model (the Data Table Model XalanDTM) in preference to using the DOM. At present, there are at least five well-known tree models for XML in Java: DOM DOM, JDOM JDOM, DOM4J DOM4J, XOM XOM, and AxiOM AxiOM, as well as an unknown number of proprietary APIs to the same purpose (the authors of this paper know of at least six such private APIs). Applications and processors written for one of these models are generally not usable with other models.

The second problem is interoperability. The first tree model to appear on the scene has had a first mover advantage. Subsequent tree model designs have intended to address the shortcomings of the DOM, but not to interoperate with it (note that both DOM4J and AxiOM later added optional DOM interface implementations to address this problem—accepting the disadvantages of the DOM in order to achieve compatibility in this mode). Knowledge of the tricks and optimizations appropriate to one model do not transfer to other tree models. Though the successor models have all positioned themselves as better solutions than the DOM, they have not been adopted as widely. This is most likely due to the DOM's first mover advantage, and the consequent network effect: although other models may have technical advantages that make them more suitable than the DOM for a given application, in order to use those new models efficiently within the JVM, all parts of the application need to use the same tree model. Developers must solve a cruel equation in which the marginal benefits of switching from the DOM are typically low, whereas the marginal costs are always high. The alternatives seem to be to write multiple code paths to achieve the same purpose (with different tree models), or to wrap each node of each tree model in an application-specific abstraction. Some projects, such as Woden Woden and Jaxen Jaxen, have taken one or the other of these approaches in preference to adopting the DOM as the sole programming model.

The DOM, as the first XML tree model for Java, established the universe of discussion for design of tree models. Development of the DOM preceded the Namespaces in XML XMLNS and XML Infoset Infoset specifications. For backward compatibility, the DOM could never enforce these specifications, though it could enable them. Further development of the DOM may be characterized as too closely approaching the Lava Flow LavaFlow anti-pattern. Indeed, the DOM exposes fifteen "basic" abstractions (node types), compared to eleven in the Infoset, and seven in the XDM. Successor APIs have generally targeted the Infoset, but with widely varying interpretations. This is the problem of variability. Each model exposes different property sets. The boundaries between lexical, syntactic, and semantic are drawn at different points. One consequence of this variability is that it is difficult or awkward to add support for specifications "higher in the stack." For instance, XPath 1.0 XPath1 and XSLT 1.0 XSLT1 work perfectly adequately as external tools (one per tree model, or by generalizing the concept of "Node" to "Object"), and some models have built-in support (at least for XPath). XML Schema support (see WXS1 and WXS2) is rarely found—a DOM Level 3 module supports it, but in a fashion that is not noted for ease of use, and the module is not widely implemented. Similar situations exist for specifications such as XQuery 1.0, XPath 2.0, and XSLT 2.0. Even SOAP/XMLP is arguably under-supported. AxiOM, after all, is an entire XML tree model built largely so that the SOAP abstractions could be represented cleanly as extensions.

Finally, the problem of weight plagues most of these tree models. The DOM itself is notoriously heavyweight, typically occupying three to ten times the space, in memory, that the—already verbose—XML occupies as a character stream, according to Harold's Processing XML with Java XMLInJava. Sucessor models have done better in this area. Dennis Sosnoski's evaluation, "Document Models Part 1: Performance" DMPerf, though dated, provides an excellent illustration of this problem. A large part of the problem lies in the unrestricted mutability of these models. All of the prominent XML tree models for Java must restrict programming to serial, synchronous access. A mutable tree model is effectively a mutable collection, so any changes made to it by a single writer may have disastrous effects upon multiple readers. Issues of weight cannot easily be addressed by storing the bulk of the document on disk, or by concurrent processing, because the document may be modified during processing.

There are alternatives: applications and processors with higher performance requirements are often written to abstractions that do not model XML as a tree, such as SAX, StAX, or XML data binding (in its various flavors). Sosnoski's article discusses some of these alternatives; Harold's presentation also notes both advantages and disadvantages. The chief drawback to these approaches is that they expose paradigms which are not as easily or intuitively understood as the tree model, which are more of a challenge for some developers. A tree model is preferred. A single model for navigation and interrogation seems best. To date, attempts to create this single model have proven suboptimal in most environments.

gXML Design Considerations

gXML is a new API for analyzing, creating, and manipulating XML in Java. It embodies the XQuery Data Model, and is consequently a tree-oriented API, but it does not introduce a new tree model comparable to existing models. Instead, it is intended to run over existing tree models, and to permit the introduction of new, specialized models optimized for a particular purpose. Its design rests on four pillars: the Handle/Body design pattern, Java generics, the XQuery Data Model, and immutability for XML processing as a paradigm. These four principles answer the four problems outlined above.

The Handle/Body Pattern

gXML makes extensive use of the Handle/Body pattern (called the Bridge pattern in Design Patterns GOF). This pattern provides a well-defined set of operations over an abstraction (the handle), which may then be adapted to specific implementations (the body). For gXML, the primary "handles" are the Model or Cursor, the Processing Context, the Node Factory in the mutable API, and the type (Meta) and typed-value (Atom) Bridges in the schema-aware API.

When presenting gXML to a new audience, one of the most common stumbling points is the distinction between Handle/Body and Wrapper (called Facade in Design Patterns). gXML does not wrap every node in the tree. Applications and processors are presented with one new abstraction, represented by a single instance (a Singleton for model, or a single instance per tree for cursor). gXML adds very little weight to the existing tree model, compared to the significant additional weight added by the necessity to wrap every node in a tree. Although there is a cost (in memory and performance) to using the handles rather than directly manipulating the bodies, the benefits (in flexibility and capability) are more nearly commensurate: in exchange for a memory/performance impact measured in low single-digit percentages (for most tree model APIs), an application or processor gains the ability to manipulate all supported tree model APIs (currently three; more are anticipated).

There are a number of attractive consequences of using this design pattern. First, since applications and processors need not write separate code paths for different tree models, these models can be injected very late, even at runtime. That suggests that they can be compared, based on the application's or processor's requirements, and the tree model best suited to the problem at hand preferred. It also suggests that application and processor developers might have a sounder foundation to suggest improvements to developers of the models. Second, by bringing peace to these warring models, by allowing developers to choose a model based on technical merits without considering the importance of the network effect for the DOM, gXML also enables the creation of "niche" tree models for XML, models designed and optimized for particular use cases. In other words, by always using these handles for access, special-purpose bodies become more practical. These topics will be revisited in Advancing the State of the Art, below.

gXML's use of the Handle/Body pattern for XML tree models might be compared to the similar pattern used for database drivers in the Java Database Connection (JDBC) API. Each bridge may be viewed as equivalent to a vendor-specific driver.

The 'G' in 'XML'

gXML makes extensive use of Java generics. First, it defines two common parameters, N and A. N is the "node" handle; A is the "atom" or "atomic value" handle. Furthermore, gXML makes extensive use of Java's built-in generics; APIs that accept or return collections typically use Iterable in their signatures (as opposed to counts, specialized objects with pseudo-iterators, single-use iterators, or arrays).

The use of generics is the primary answer, in gXML, to the problem of interoperability. By defining these parameters, particularly the <N>ode handle, each of the tree models can be viewed and manipulated through the lens of the XQuery Data Model. One notable consequence is that the enormous network effect created by the existence of parsers, processors, and applications that understand no model but the DOM, regardless of its fitness for their domain of operation, no longer matters to developers of gXML-based processors and applications. gXML includes a DOM bridge; it is thereby able to leverage that network effect. Every bridge added, adds to the network effect—though not, as a rule, for a single document: conversion from model to model remains expensive.

The XQuery Data Model

Perhaps the most important driver for the development of gXML was the desire to have a Java API that embodied the XQuery Data Model. The XDM is more rigorous than its predecessor, the XML Infoset specification (which was driven in part from a need to model existing APIs, including DOM, SAX, XPath, and Namespaces in XML). It is conceptually complete, and defined in a context that permits type definition, navigation operations, and more advanced functions. This rigorous, well-defined specification was adopted as the basis for the API, and represents gXML's answer to the problem of variability. Is a property or concept in the XDM specification? Then it should be in the gXML API. If it is not in the specification, then either it should not be exposed in the API, or it should be compatible with the well-specified API. For instance, the entire mutable API was added as an extension; XQuery does not define operations that modify trees.

Another important reason to adopt the XQuery Data Model is that it provides the first well-integrated access to XML Schema information (one might argue that XQuery and XSLT2 provide the "missing language" for the XML Schema type system). A great deal of XML processing has no need to concern itself with validation, typing, and particularly with the post-Schema validation infoset; those applications and processors that need it, however, need it very badly. gXML defines a common model for XML Schema, compatible with the XDM's definition and use of XML Schema types and typed values, as a standard extension.

gXML is not the only model to provide support for XML Schema, but the schema-aware extensions in gXML can be implemented for any tree model, and are exposed via APIs that are clearly related to (usually extensions of) the core gXML APIs. In other words, by addressing the problem of variability via adherence to and conformance with the XQuery Data Model Specification, gXML enables the development of a "next wave" of XML processing technologies, based on XPath 2.0, XSLT 2.0, and XQuery 1.0 (including the new generation of XQuery-conformant databases).

The Immutable Approach

In the experience of the developers of gXML, most of the nodes in any given XML instance document are never modified. These nodes need not be mutable—but because some nodes are modified in the common paradigm of XML processing, all nodes must be defined to be mutable. The core gXML API dispenses with mutability. Instead, it promotes a paradigm in which a received or generated XML document is an input, and the XML supplied to other processes (in the same VM, on the same machine, or somewhere else on the network) is a transformation of the input. This approach addresses the problem of weight. In combination with the enabling of custom, potentially domain-specific XML tree models accessed via a gXML bridge, the immutable paradigm (over an immutable tree model) can achieve optimizations not possible for a tree model in which the existence of mutability militates against caching, compaction, and deferred loading. It is not possible, at this point, to quantify the potential performance benefits rigorously because the pure-immutable model remains hypothetical (other priorities have taken precedence). Here we speculate.

Such a hypothetical immutable model would not need to guard against modification of a document in one thread while another thread reads it. It would provide guarantees that would permit processing of large documents to be parallelized; an immutable, late-loading model might be able to provide access to XML documents of a size infeasible for mutable models. A certain number of these optimizations are available even for bridges over mutable models; if the convention encourages immutability, then processors can define their operations only when the convention is adhered to, warning users that breaking the convention may lead to undefined (and incorrect) results.

Immutability enables performance enhancements—for instance, models in memory which occupy a fraction of the size of the XML as a character stream rather than a multiple of its size; concurrent processing of XML documents; storage of the bulk of a document on disk with indexing and a very light footprint in memory. We've noticed unanticipated potential as well: if there is no requirement to modify the document in memory, then a gXML bridge may reasonably be defined over any structured hierarchical data format analagous to XML: JSON, CSV, a file system, a MIME multipart message. Perhaps more strikingly, immutable models can potentially cross the VM boundary, via JNI to other languages, into hardware accelerators, and so on.

The gXML Core

The gXML API is designed for rapid understanding. The core API can be described as a collection of five interfaces. In practice, more interfaces are available, but understanding these five is necessary and sufficient to understand and use the gXML base API. These abstractions adhere to the design principle of immutability, and do not introduce any dependency upon XML Schema.

The core API is completed with two extensions. The mutable extension adds mutability by adding methods to the base interfaces, or by adding new interfaces. The schema-aware extension adds schema awareness, again by adding methods to base interfaces, or by adding new interfaces; the schema-aware extension also introduces the "atom" parameter.

Untyped, Immutable

The heart of the gXML API is an abstraction called Model. Model is stateless; each bridge implements it. The methods on Model permit interrogation of XQuery Data Model properties (getNamespaceURI(N), getLocalName(N), getStringValue(N), getNodeKind(N), etc.), and provide XQuery/XPath navigation (child, descendant, ancestor, sibling, attribute, namespace axes). Since this abstraction is stateless, each method's first parameter is a context node, the node for which information is requested, or from which navigation begins. The XQuery Data Model defines seven node types: Document, Element, Text, Attribute, Namespace, Comment, and ProcessingInstruction. Returns from each method vary by node type, in conformance with the Data Model specification, but the API does not distinguish node types (the argument or return value is <N>, not <? extends N>). The Appendix A documents this interface.

For convenience, a very similar API, with minimal (positional) state is also defined: Cursor. Cursor provides a common idiom, maintaining its positional state within the target tree, which is frequently encountered in processing XML. Where Model's navigation APIs typically return a node (N getFirstChildElement(N context)), Cursor's corresponding APIs return true or false and change the Cursor's state (boolean moveToFirstChildElement()). Where Model's property accessors require a context node (String getStringValue(N context)), Cursor's use its current state (String getStringValue()). The design intent is that anything that may be accomplished with a Model may also be accomplished with a Cursor. Note that Cursor is not forward-only.

When processing XML, some applications can make use of gXML with nothing more than Model or Cursor. More advanced uses might need the third primary abstraction in the core gXML API, the ProcessingContext. A processing context is precisely what it claims to be: a specialized (for the target tree model), stateful abstraction which provides uniform access to the collection of abstractions which together make up a bridge. Model, Cursor, and ProcessingContext are all parameterized only by <N>ode. The TypedContext extension introduces the <A>tom parameter.

ProcessingContext provides Model<N> getModel() and Cursor<N> newCursor(N context) methods, an accessor for the (singleton) Model and a factory for the Cursor. Several additional accessors, functions, and factory methods are available from the context: it is the source for the mutable and typed context extensions (getMutableContext() and getTypedContext()), and for DocumentHandler and FragmentBuilder; it can report whether candidate objects are compatible with the bridge's specialization of <N>ode; it includes a mechanism to permit feature-based extension. For greatest generality, applications should access a bridge via its processing context. An optional ProcessingContextFactory interface is also included in the API, but experience suggests that provision of instances of the factory is an impediment to the target design pattern, dependency injection. That is, applications ought to instantiate the factory interface themselves, consistent with the injection mechanism or API which they use.

The processing context provides access to DocumentHandler, which in turn provides methods to parse from and serialize to streams, readers and writers. ProcessingContext is also a factory for FragmentBuilder, which is-a ContentHandler (for the XDM, not the SAX interface of the same name) and is-a NodeSource. FragmentBuilder is used to programmatically build trees or tree fragments in memory, parallel to parsing a document into memory via the document handler's various parse methods. Model and Cursor also accept a ContentHandler argument to stream or write themselves. In short, these abstractions provide a range of input/output operations for XML using a particular bridge.

These five abstractions make up the core of the gXML API. There are other, supporting abstractions, some of which become more significant in particular contexts. An untyped, immutable bridge implementation (minimally) provides implementations for these five abstractions over a given tree model.

Mutability

gXML provides two standard extensions in the core ProcessingContext to permit bridges to signal support for optional functionality. The first extension permits mutability. Immutability provides important benefits for XML processing, but all currently-available tree models are mutable, and nearly all processors and applications expect mutability. To ease migration, ProcessingContext provides a method, getMutableContext() which permits the bridge to signal that it supports mutability, by returning an implementation of the MutableContext extension. A mutable context, in turn, provides access to MutableModel and MutableCursor, each of which extend the corresponding immutable interfaces (adding methods to add and remove nodes, and to change the content of a document or element node), and also provides access to a NodeFactory implementation which permits the creation of nodes in memory, independent of any tree (within the limits of the underlying tree model).

Nota bene: the mutable interfaces, unlike other abstractions in gXML, are not attempts to implement a portion of the XQuery Data Model in Java. The XQuery Data Model (and, in fact, XQuery 1.0, XSLT 2.0, and XPath 2.0) do not provide specification of property mutators. Consequently, this portion of the API has been designed to be roughly compatible with the XDM, as an extension, and to be roughly compatible with the corresponding mutable APIs in dominant tree models. However, once XQuery produces its "update" mechanism, this portion of the API is unlikely to prove conformant.

Schema Awareness

The TypedContext extension parallels the MutableContext extension. It provides the XDM-defined schema-aware properties and manipulations. Most notably, the typed context introduces an additional parameter, the <A>tom handle. The base and mutable interfaces deal only with string values for text node and attribute content (in XDM terms, actually untyped atomic). The XQuery Data Model defines the concept of "atom", which corresponds to a typed value or list of typed values. Atoms are inherently sequences of atoms (a single atom is a one-element list); "sequence" is also introduced in the schema-aware API, but unlike atom, is not represented by an independent common parameter.

TypedContext is more complex than MutableContext. As a mutable context provides access to mutable models and cursors, a typed context provides an accessor for a TypedModel and is a factory for TypedCursor, which are extensions of the base Model and Cursor, adding methods to access the type-name and typed-value properties. As the base processing context can identify <N>odes, so the typed context can identify <A>toms. TypedContext enhances the base FragmentBuilder as a type- and atom-aware SequenceBuilder. To handle typed values, TypedContext provides an accessor for the AtomBridge, which in turn provides facilities to create, compile, cast, convert (to Java native types), and query atoms, in a fashion consistent with the XDM.

TypedContext also provides access to the MetaBridge, which primarily serves to map the names of types to their corresponding implmentations in the (included) XML Schema model. TypedContext makes use of this bridge itself, because it extends the core schema model interface, SmSchema. SmSchema permits definition and declaration of custom types, registry of types, and lookup of types. In other words, the typed context provides a cache of types (supplied via parsing of schemas or programmatically) which are being used in the processing of a collection of XML documents. This is actually the origin of the concept and term "processing context," though it now exists for the untyped API as well.

Building Bridges with gXML

For greatest utility, gXML ought to have bridges on every tree model for XML in Java. The authors have not been able to accomplish this themselves, but can demonstrate that creating additional bridges is a straightforward task.

The three bridges included in the gXML source tree provide examples of the finished product. The development process is easily described. Note, however, that most tree models present unique challenges when adapted to the XQuery Data Model; our experience suggests that most development time is consumed by handling these impedance mismatches.

Untyped, Immutable

What needs to be done to create a new base bridge (untyped, immutable) for an as-yet unsupported tree model? There are five steps:

  1. Implement ProcessingContext and Model. Decide what the <N> (node) abstraction must be.

    For instance: the DOM defines <N> as Node. AxiOM defines it as Object (AxiOM does not have a single base interface that marks all node types). The Cx bridge proof-of-concept uses XmlNode.

  2. Use the bridgekit module to get a simple, generic implementation of Cursor (over the custom Model).

    The bridgekit module is a collection of utilities intended to help bridge developers. It includes, for instance, an implementation of the XML Schema model (SmSchema) and the XmlAtom typed-value implementation, as well as the CursorOnModel helper used here.

  3. Implement FragmentBuilder.

    The FragmentBuilder interface has five methods for creating Text, Attribute, Namespace, Comment, and Processing Instruction node types, and an additional two each (start and end) for the container node types, Element and Document.

  4. Use the generic implementation of DocumentHandler from the input-output processor.

    The generic DocumentHandler in the input-output module is not terribly mature or robust, but can do the job for an initial implementation.

  5. Use the bridgetest module to verify equivalence with existing bridges.

    The bridgetest module is designed to make implementation easy; enabling each test requires only that the bridge implement the single abstract method, which returns the bridge's implementation of ProcessingContext (from which all other abstractions can be reached). Adding a test implementation is thus mostly a mechanical task.

This is all that's required. For this minimum, getMutableContext() and getTypedContext() (on ProcessingContext) should both return null, indicating no support.

Mutability

To add support for mutability:

  1. Implement MutableContext and return it from ProcessingContext instead of null.

    MutableModel provides access the NodeFactory, MutableModel, and MutableCursor implementations.

  2. Implement MutableModel as an extension of the base Model from above.

    MutableModel adds methods to set attributes and namespaces, to add, remove, and replace children.

  3. Use the bridgekit module to base the bridge's MutableCursor on its MutableModel.

    The bridgekit implementations are reasonable starting points, though optimization is likely to require a custom implementation.

  4. Implement NodeFactory.

    NodeFactory contains methods to create each node type, where MutableModel establishes the relationships between nodes.

  5. Add tests from the bridgetest module.

    In this case, there's only one, at present.

This is admittedly easier to describe than to accomplish. Approaches to mutability among tree models vary much more widely than approaches to navigation and analysis.

On the other hand, gXML's approach to mutability is more restricted than most current tree APIs. The gXML mutable API does not support changing the value of a text or attribute node, for instance. Leaf nodes remain immutable; container nodes (document and element) are mutable in content (contained nodes) only.

Schema Awareness

To add support for schema-awareness:

  1. Implement TypedContext and return it from ProcessingContext instead of null; note that TypedContext is-a SmSchema. Decide what the <A> (atom) abstraction must be.

    Current implementations all define <A> as XmlAtom. This is not required.

  2. Implement TypedModel as an extension of the base Model from above.

    The TypedModel interface adds only five methods to Model, all related to the introduction of type names and typed values. Actually ensuring that the type annotations and typed values are associated with the nodes in the tree is one of the most challenging tasks in implementation.

  3. Use the bridgekit module to base the bridge's TypedCursor on its TypedModel.

    CursorOnTypedModel extends CursorOnModel as expected.

  4. Implement or reuse from the bridgekit module an AtomBridge (typed value support).

    If the chosen <A>tom is XmlAtom, the XmlAtomBridge already exists.

  5. Implement or reuse from the bridgekit module a MetaBridge (type support).

    Again, if the <A>tom is XmlAtom, a MetaBridge exists in the bridgekit.

  6. Implement SequenceBuilder as an extension of the FragmentBuilder from above.

    SequenceBuilder adds overrides for the attribute(), startElement(), and text() methods (adding type names and typed values), plus methods to create an atom and to start and end a sequence.

  7. Add the typed tests from the bridgetest module.

    As with the standard tests, these are easy to implement, following the same pattern.

For schema awareness, the most straightforward approach is going to be reusing the generic implementations found in the bridgekit module, but better results may be achieved by customizing the code. This is an area requiring further experience before establishing guidelines for best practices.

Bridge Traffic

Using bridges is a little less amenable to slideshow style lists, but the principles remain straightforward. When using gXML, it is important to understand "dependency inversion": bridges should be injected, if at all possible, rather than directly instantiated. It is possible to design an application or processor that can react to input by directly instantiating the needed bridge, but it's best to reduce the number of places that contain reference to the tree model packages to as few as possible. One class is ideal; it is then responsible for providing a processing context for a given bridge on demand.

Most applications will spend most of their time with the Model or Cursor) interfaces, which permit navigation and interrogation. Methods provide access to names, values, and other characteristics (XQuery Data Model properties) of the node, and permit navigation in a variety of ways to target nodes. An appendix shows the content of the Model interface. FragmentBuilder (for construction in memory) and DocumentHandler (for parsing and serializing) are likely to be important. Existing applications or developers wedded to the concept of mutability are likely to make use of the APIs in the mutable model (or cursor) and the NodeFactory. Applications or processors needing W3C XML Schema support (common inside the enterprise, for instance) are likely to make extensive use of TypedContext, particularly as a schema cache and for access to typed models and cursors.

At present, gXML has bridges, in varying states of maturity, for the DOM (level 3 support currently required), for AxiOM (LLOM only; support for typed context rather weak), and for a reference bridge called Cx (a clean, if naive, reimplementation of the XQuery Data Model from scratch, and a gXML bridge over that implementation). The DOM was chosen because of its ubiquity; AxiOM because the web services area is a target for gXML proselytizers; Cx exists primarily to demonstrate that the shared idiosyncracies of DOM and AxiOM (there are a few) are not fundamental to gXML.

Processing XML with gXML

gXML provides an extensive API for bridges, which not only provides the entry point for applications and processors, but also makes the development of new bridges easy to describe. In sharp contrast, no interface, no contract, is specified for XML processors designed for use with gXML. While some processors might reasonably be defined to have a method with the signature: N process(N, Model<N>), for others this is entirely inappropriate. Even for processors that might reasonably "process" a node, their function is more clearly expressed if they "transform" or "extract" or "enhance", or otherwise mark their "processing" by its specific name, not the more general one.

So, what is a gXML processor? As the gXML team uses the term, a processor is a code library that performs some specific, well-described function over XML. Most processors can be described with a single word or phrase: "serializer," "parser," "converter," "validator," "transformer," "signer," and so on. A processor is distinguished from an "application," which may create (generate), destroy (consume), modify, and otherwise manipulate XML in multiple steps. Where a processor contributes special functionality to the performance of a goal, the application oversees and orchestrates achievement of the goal from receipt to completion. To further distinguish, a bridge provides the abstraction over which the applications and processors operate, including the model, input/output, and a context that associates related tree-specific functions.

Stateful

gXML processors may be divided, for purposes of discussion, into two classes: stateful and stateless. Here, "state" refers to the processor's need to maintain state in the form of any of the parameters specialized by a particular bridge implementation (<N> and <A>), disregarding maintenance of state unrelated to gXML parameters. A stateful processor is ideally written generically, but certain of its component classes will themselves be parameterized with one or both of the node and atom handles. Consequently, at instantiation, a given instance of a processor is tied, ipso facto, to a particular bridge implementation. Like java.util.List<QName>, a generic processor taking only <N> as a parameter would have to be specialized as GenericProcessor<Node> for use with the DOM bridge; the same class would be separately instantiated for use with the Cx bridge as GenericProcessor<XmlNode>. Stateful processors typically contain one or more member fields whose type is specified as a parameter (or which is a parameterized class, such as an instance of Cursor<N> or Bookmark<N>).

For example, an input-output module is included in the gXML source tree. This module includes a stateful processor implementing DocumentHandler<N>. This DocumentHandler contains a member field which is a FragmentBuilder<N> supplied by the bridge's ProcessingContext. This is a good example of the stateful style: at instantiation, each DefaultDocumentHandler<N> is specialized for the bridge's definition of <N>, associating this handler instance with a particular bridge (in fact, associating it with a single instance of the bridge's implementation of ProcessingContext). This processor's "process" methods are defined by the DocumentHandler interface, found in the core API.

Stateless

An alternate style of implementation is the stateless processor. If no class in the processor needs to retain state typed as or with a gXML parameter, then the processor may be used by declaring the necessary parameters on a method, and supplying the necessary disambiguation as arguments to the method. For instance, a stateless processor might expose the method:

    <N> N nearestAncestor(Iterable<N> context, Model<N> model)

The arguments to the method are both parameterized: the context provides a collection of nodes; the model provides the tool to interrogate each of the nodes in the supplied context (this hypothetical example finds the nearest common ancestor of all the nodes supplied in the list, or null if no such common ancestor exists).

An extremely simple example of a stateless processor may be found in the convert module, in the gXML source tree. It's so simple that it's debatable whether it's a processor, or simply an instantiation of an idiom. StaticConverter has a single, static method, with the signature:

    <Nsrc, Ntrg> Ntrg convert(Cursor<Nsrc> cursor, FragmentBuilder<Ntrg> builder)

It does what it says on the tin: using the supplied Cursor and FragmentBuilder, from one or two different bridges, it converts from one tree model representation to another (strictly speaking, this is a transforming copy, rather than a conversion; also, if the Cursor and FragmentBuilder are supplied by the same bridge, this is simply a copy).

A more complex example may be found in the same module: Converter mixes the stateful and stateless styles. It is instantiated with a (source) processing context; it is then able, on request, to convert to any supplied target processing context—retaining type information, if possible (if both source and target bridges advertise themselves as schema-aware, it uses SequenceBuilder and the TypedModel's atom-aware stream() method in preference to the untyped FragmentBuilder and Model).

Developing and Refactoring

The gXML source tree contains, in addition to the processors mentioned above, an XPath 1.0 processor, a schema parser, and a schema validator. The XPath processor is stateless; the schema processors (unsurprisingly) stateful. Processors for XPath 2.0, XSLT 2.0, and XQuery 1.0 have also been explored, although this code is not included in the distribution.

During the development of the API, in early 2009, the Apache Woden project (1.0M8) was refactored as a proof of concept. This effort was based on an earlier revision of the API; the refactoring was extensive, taking advantage of the immutable paradigm. Woden was chosen as an example because it contained an example of multi-tree abstraction: wrapper classes permit Woden to parse and analyze WSDL supplied either as AxiOM or as DOM trees. The project required about a month, but the result seemed a dramatic validation of of gXML principles and design: the lines of code (LOC) count was reduced by about 15%, inconsistencies in the handling of DOM versus AxiOM were eliminated, and supported models grew from two to five (including DOM, AxiOM, the Cx reference model, a proprietary internal model, and an experimental model based on EXI). There is no guarantee of such an LOC count reduction, of course; results will depend upon the original source.

As part of the preparation for release as open source, a similar effort was undertaken to refactor the Apache XML Security project in early 2010. This was a more cautious effort, adopting as a guideline that no externally used API should change. Instead, the existing interfaces were enhanced with a gXML code path. In addition to preservation of backward compatibility in the API, this refactoring did not attempt a wholesale restatement of the security problem in immutable context, but relied extensively upon MutableContext and the capabilities supported therein. This effort is ongoing, and does not appear to promise a reduction in code size, given its goals. It has provided the team with an excellent test case for the mutable APIs (and even demonstrated missing XDM-defined functionality in the core APIs) which have been used to improve both areas. Nonetheless, it appears to validate the concept of cautious, compatibility-maintaining refactoring; the refactored API appears able to pass the same tests that the original DOM-based API passed.

The experience from these (and other) proofs of concept, refactoring existing XML processors and developing new processors, leads to some tentative conclusions about the efforts involved and the possible development patterns. We note that because all current tree models incorporate mutability without questioning its utility, most processors approach problems of XML manipulation as a tree mutation.

New Development

The time required for development of a new processor varies depending upon the complexity of the processing. In our experience, adopting the immutable paradigm can actually simplify development, though it requires an effort to state the problem as a transformation rather than as a mutation. Processors developed for gXML take no more, and often less time to develop (and debug) than processors over a single tree model. When designed for immutability, the resulting processor often shows excellent performance characteristics, without requiring significant attention to this area.

Examples are included in the distribution, in the processor module and its children: input-output, convert, w3c.xs (schema parsing), and w3c.xs.validation.

Refactoring: Processing Mutable Trees

Existing processors—such as the Apache XML Security example—that have already released are apt to wish to maintain existing customer bases. The approach to take, in this case, seems to be to produce an extended, parallel API: where the existing API takes a Node, provide an override that accepts (for example) N, Model<N>, or (if changing the state of the supplied argument is acceptable) Cursor<N>. Then change the original DOM-based function so that it merely calls the new gXML-based method. This approach increases the size of the code base, but preserves the logic of the API, validation via the existing test suite, and compatibility with existing clients.

Firm estimates depend upon the size and complexity of the code base, but experience seems to demonstrate that once the principles are understood, much of the refactoring proceeds in a nearly mechanical fashion. The primary advantage to this form of refactoring is the addition of support for all defined gXML bridges (or all bridges that support mutability); this in turn may permit customers to choose models better suited for a particular problem domain. In the XML Security case, the refactoring produces the ability to use the processor with AxiOM (in the current state of the art; potentially with other tree models as those are developed as well).

Refactoring: Processing Immutable Trees

Refactoring an XML processor for immutable operation is more challenging. The general principle is that instead of considering the problem as one of modifying a tree, the problem is stated as a transforming copy. The XML document is an input; other inputs guide the processing; the output is a new XML document (the original is then typically discarded, or sometimes archived). Our experience addressed Apache Woden, in part because the project was then recently graduated from incubation (that is, it had just made a public 1.0 release), so preservation of API compatibility was deemed less critical; widespread adoption had not yet occurred. Another example is the xpath.impl processor, based on the xpath API module; these modules were both created by refactoring a portion of James Clark's and Bill Lindsey's XT XT. XPath has no need for mutability, obviously; stating the XPath processing problem in immutable context is trivial.

This approach typically changes the logic of processing as well as changing the public API; developers may find that the code that "enhances" (mutates) a tree with information must be localized. That is, instead of receiving, analyzing, modifying, analyzing further, etc., the process is receiving, analyzing, generating/transforming, analyzing further. Creation of new documents is potentially expensive; this is apt to lead developers to minimize occurrences of the event. Awareness of this issue, in our experience, led to code that was more straightforward, easier to understand, and better encapsulated. Note also that a refactoring of a publicly released API might proceed first by preserving API compatibility, and later providing an alternate, transformative code path that parallels the modification path.

Advancing the State of the Art

The gXML team believes that this API presents an exciting opportunity to change the paradigms for XML processing in Java, and to enable a host of additional opportunities for advancing the state of the art. We have discussed the API, bridges, and processors in some detail, above. Now, let's examine the further opportunities that gXML enables.

Because gXML encourages the practice of dependency inversion, of injecting a particular tree model (bridge) at runtime, it effectively bypasses—even leverages, by inclusion of a bridge for the DOM in the distribution—the DOM network effect that has presented Java developers of XML processors and applications with a Hobson's choice: choose a tree model which is technically superior or less awkward to program against but lose interoperability with the vast majority of existing processors and applications, or choose the DOM with its peculiarities and quirks and limitations but gain interoperability with the wider XML ecosystem. Developers of alternative Java XML tree models will (we hope) welcome this, and contribute bridges. Moreover, by permitting this late binding of the tree model, gXML enables use-case specific comparisons of models to each other. This capability for comparison, without losing interoperability, may lead to wider adoption of one or more of the successor models, in one application domain or across domains. Further, given the ability to compare two models in such a way, application and processor developers can provide clear test cases demonstrating issues, which developers of the tree model may find more compelling, more deserving of attention, than is currently the case when any comparison must first develop a custom framework/harness.

By enabling injection of the model, gXML also potentially permits the development of domain-specific tree models, optimized for particular use cases. Such "niche" models are actively discouraged in the current state of the art: they lead in the direction of private code, difficult to learn and difficult to maintain. AxiOM provides an example of a domain-specific model that has survived the process of marginalization; one might argue that it has done so in part through its strong association with the high-profile project Apache Axis 2. Other domains such as strongly typed XML, large XML processing, and XML in constrained memory environments come to mind as potential targets. Customization and optimization are possible both for the underlying tree model, and for the bridge implementation. There is no restriction against implementing multiple bridges for a single underlying tree model—since the pattern is injection, two significantly different bridge implementations over the same underlying tree model may be used by a single application. Here again, there are significant opportunities for domain optimization, in this case by optimizing the bridge implementation rather than changing the underlying tree model.

gXML's championing of the immutable paradigm for XML processing carries powerful potentials for performance enhancements. We cannot, at this point, quantify these benefits (they may even be chimerical), but we have seen immutability adopted in other areas specifically in order to improve performance. Immutability provides guarantees that enable concurrent processing, an increasingly common requirement for applications and processors that must scale to handle large volumes of traffic. With a custom tree model (even an immutable implementation of the DOM, potentially), the notorious impact of XML on memory can potentially be reduced. For applications and processors that already address multiple tree models, significant reductions in code size may accompany improved performance and consistency. Our experience suggests that restating problems as transformation rather than mutation tends to lead to cleaner, better-encapsulated, and typically more performant code.

One particular area in which gXML holds enormous promise is in the processing of "large XML". This is, in a way, the same problem as processing XML with "constrained memory;" whether one identifies the XML as too-large, or memory as too-small, the problem is the same. How can XML be processed if it is too large to fit at once into memory? The obvious answer is a custom tree model, but this answer immediately presents the developer with the DOM "Hobson's choice" outlined above. gXML removes that issue; a processor or application programmed against the gXML API can inject a simple, mature tree model for most processing, or a custom, stored-to-disk, low-memory tree model when the size of the target document exceeds a specified threshold.

Developers of technologies that compete with XML as descriptions of structured, hierarchical data may have no interest in presenting their formats as XML (may even resent the suggestion), but there are advantages to doing so: the XML programming environment is a large one, populated with numerous processors and applications. A bridge over other such data formats—JSON, for a high-profile example—could provide that format with the capabilities of the entire suite of XML tools (with the reservation that there is apt to be an impedance mismatch of some degree, that the bridge will attempt to minimize). This is most interesting when gXML is used with the immutable paradigm; modifying these alternative structured hierarchical data formats as well as analyzing them is a more difficult problem and likely to have a higher degree of impedance mismatch.

Again particularly with respect to immutable processing, gXML offers an opportunity to pass XML across the virtual machine/Java Native Interface boundary. The XQuery Data Model defines the operations and properties that are possible with (g)XML; there is no impediment to producing a specification-compliant API in other languages, whether they are hosted in the VM (Scala, Jython) or outside it (C++, Perl, Lua). This in turn suggests possibilities for enabling most-efficient processing, for enabling scripting in domain-specific languages, and so on.

Perhaps most significantly, from the point of view of the gXML development team: in recent years a number of new specifications have appeared that offer exciting opportunities for advancing the state of the art of XML processing. In Java, adoption of these technologies—XQuery, XSLT2, XML databases—has been slowed by the lack of support in dominant models, and the limited extensibility possible. Even XML Schema has seen relatively little adoption/development outside the enterprise; gXML includes a schema model to address that issue. More importantly, the XQuery Data Model seems to offer a well-thought foundation for the next ten years of development in XML-related technologies. gXML proposes to embody that model for Java, while providing compatibility with the existing tree models, enabling a unification of processing while promoting differentiation, specialization, and customization of models.

gXML Solution(s)

We submit that gXML addresses the problems that its design set out to address, and that have plagued a large population of developers. It resolves the problem of multiple, competing tree models in Java, leverages the network effect of the dominant Java tree model for XML (and in fact shares that network effect with any other tree model over which a gXML bridge is available), and permits comparison of and late (even runtime) selection of a model best suited to the task. In the process, it begins to resolve the problems of interoperability. It is based on a well-defined, rigorous specification (the XQuery Data Model), which appears to be the best foundation for the next generation of XML technologies. It introduces and promotes the immutable paradigm for XML processing, and permits or encourages the development of models able to fulfill the promise of that paradigm.

gXML represents about five man-years of development, in its current state. Its corporate sponsor has contributed it to open source because its value can be directly correlated with its adoption. More bridges: more value (to the contributing corporation and to everyone using gXML). More processors: more value. For more code, though, we need help. Get involved! Try the code. Our experience has been that it has immediate benefits, even for isolated applications and processors. See a bug? Contribute a patch! Intrigued by the promise gXML offers? Become a committer!

Based on the previous ten years, introduction of so significant a shift in APIs and paradigms in the Java world will need to last at least ten years. The APIs developed ten years ago, viewed in hindsight, show what seem to be obvious lacunae or missed focus. Are there such gaps and blind spots in gXML? Take a look; if we're missing something, tell us now, and help us to address it.

Interested in the opportunities, but not in refining the core APIs? Want to provide an XQuery Data Model over a different, currently unsupported tree model (even over a non-XML structured data model)? Write a bridge. Our experience suggests that investment for a new bridge is about one programmer-month, for complete, but unoptimized functionality. Refinements depend upon the underlying tree model; those that are closer in concept to the XQuery Data Model tend to be easier to improve, while those further away (particularly if they don't conform to XML Infoset) provide more challenges. If developers involved in JDOM, DOM4J, or XOM are reading this, we hope to have intrigued you enough that you'll contribute (or provide independently) a bridge implementation for those models. What about a bridge for JSON? CSV? Could the new, XQuery-conformant crop of XML databases expose programming interfaces as bridges or as processors?

Interested in a particular application of XML? Can it be conceived as an XML processor? Development investment for a gXML processor varies pretty widely, depending upon the complexity of the processing to be done. For instance, the schema validation module included in the gXML source represents perhaps six months of work; the conversion processor (because it really does nothing more than embody an idiom already supported in the gXML core APIs) required no more than a week. XQuery or XSLT 2.0 processors would represent significant time investments. The field is vast, though, so it is impossible to characterize (either in time or complexity) everything in it.

Are we missing an obvious opportunity? Tell us about it. Or ... do it, and show us up. Our primary hope, in releasing the code and this paper, is to generate some excitement about the possibilities we believe to be inherent in the gXML refactoring of XML in Java. Get excited; this could change the game.

Appendix A. gXML: Source

As previously noted, the core of the gXML paradigm is an abstraction called Model. Because this is an example of the Handle/Body design pattern (and is stateless), only one instance of Model is needed for navigation and investigation for any and all instances of the XML tree model for which the particular Model is specialized. Consequently, it seems worthwhile to show the content of the Model abstraction. Comments have been removed.

Model is composed from three interfaces, reflecting three different forms of information that might be obtained from an XQuery Data Model: NodeInformer reports information about the content/state of a particular node in context; NodeNavigator permits one to obtain a different node given a particular starting node; AxisNavigator supplies iteration over the standard XPath/XQuery axes, starting from a particular origin node.

public interface Model<N>
    extends Comparator<N>, NodeInformer<N>, NodeNavigator<N>, AxisNavigator<N> {
    void stream(N node, boolean copyNamespaces, ContentHandler handler) throws GxmlException;
}

public interface NodeInformer<N> {
    Iterable<QName> getAttributeNames(N node, boolean orderCanonical);

    String getAttributeStringValue(N parent, String namespaceURI, String localName);

    URI getBaseURI(N node);

    URI getDocumentURI(N node);

    String getLocalName(N node);

    Iterable<NamespaceBinding> getNamespaceBindings(N node);

    String getNamespaceForPrefix(N node, String prefix);
    
    Iterable<String> getNamespaceNames(N node, boolean orderCanonical);

    String getNamespaceURI(N node);

    Object getNodeId(N node);

    NodeKind getNodeKind(N node);

    String getPrefix(N node);

    String getStringValue(N node);

    boolean hasAttributes(N node);

    boolean hasChildren(N node);

    boolean hasNamespaces(N node);

    boolean hasNextSibling(N node);

    boolean hasParent(N node);

    boolean hasPreviousSibling(N node);

    boolean isAttribute(N node);

    boolean isElement(N node);

    boolean isId(N node);

    boolean isIdRefs(N node);

    boolean isNamespace(N node);

    boolean isText(N node);

    boolean matches(N node, NodeKind nodeKind, String namespaceURI, String localName);

    boolean matches(N node, String namespaceURI, String localName);
}

public interface NodeNavigator<N> {
    N getAttribute(N node, String namespaceURI, String localName);

    N getElementById(N context, String id);

    N getFirstChild(N origin);

    N getFirstChildElement(N node);

    N getFirstChildElementByName(N node, String namespaceURI, String localName);

    N getLastChild(N node);

    N getNextSibling(N node);

    N getNextSiblingElement(N node);

    N getNextSiblingElementByName(N node, String namespaceURI, String localName);

    N getParent(N origin);

    N getPreviousSibling(N node);

    N getRoot(N node);
}

public interface AxisNavigator<N> {
    Iterable<N> getAncestorAxis(N node);

    Iterable<N> getAncestorOrSelfAxis(N node);

    Iterable<N> getAttributeAxis(N node, boolean inherit);

    Iterable<N> getChildAxis(N node);

    Iterable<N> getChildElements(N node);

    Iterable<N> getChildElementsByName(N node, String namespaceURI, String localName);

    Iterable<N> getDescendantAxis(N node);

    Iterable<N> getDescendantOrSelfAxis(N node);

    Iterable<N> getFollowingAxis(N node);

    Iterable<N> getFollowingSiblingAxis(N node);

    Iterable<N> getNamespaceAxis(N node, boolean inherit);

    Iterable<N> getPrecedingAxis(N node);

    Iterable<N> getPrecedingSiblingAxis(N node);
}

References

[AxiOM] Axiom 1.2.8 API http://ws.apache.org/commons/axiom/apidocs/index.html

[LavaFlow] Brown W., R. Malveau, H. McCormick, T. Mowbray, and S. W. Thomas. Lava Flow anti-pattern (Dec. 1999) http://www.antipatterns.com/lavaflow.htm

[DOM] Document Object Model Technical Reports http://www.w3.org/DOM/DOMTR

[DOM4J] DOM4J Introduction http://dom4j.sourceforge.net/

[XML] Extensible Markup Language (XML) 1.0 (Fifth Edition) http://www.w3.org/TR/xml/

[GOF] Gamma, E., R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software Addison-Wesley, 1995.

[XMLInJava] Harold, E. Processing XML with Java http://www.cafeconleche.org/books/xmljava/

[WhatsWrong] Harold, E. "What's Wrong with XML APIs (and how to fix them)" http://www.xom.nu/whatswrong/whatswrong.html

[Jaxen] Jaxen http://jaxen.org/

[JDOM] JDOM v1.1.1 API Specification http://www.jdom.org/docs/apidocs/

[XMLNS] Namespaces in XML 1.0 (Second Edition) http://www.w3.org/TR/xml-names

[DMPerf] Sosnoski, D. "XML and Java technologies: Document models, Part 1: Performance" http://www.ibm.com/developerworks/xml/library/x-injava/index.html

[DMUse] Sosnoski, D. "XML and Java technologies: Java document model usage" http://www.ibm.com/developerworks/xml/library/x-injava2/

[Woden] Welcome to Woden http://ws.apache.org/woden/

[Xalan] Xalan-Java http://xml.apache.org/xalan-j/index.html

[XalanDTM] XalanDTM http://xml.apache.org/xalan-j/dtm.html

[Infoset] XML Information Set (Second Edition) http://www.w3.org/TR/xml-infoset

[XPath1] XML Path Language (XPath), Version 1.0 http://www.w3.org/TR/xpath/

[WXS1] XML Schema Part 1: Structures Second Edition http://www.w3.org/TR/xmlschema-1/

[WXS2] XML Schema Part 2: Datatypes Second Edition http://www.w3.org/TR/xmlschema-2/

[XOM] XOM 1.2.5 http://www.xom.nu/apidocs/

[XDM] XQuery 1.0 and XPath 2.0 Data Model (XDM) http://www.w3.org/TR/xpath-datamodel/

[XSLT1] XSL Transformations (XSLT), Version 1.0 http://www.w3.org/TR/xslt

[XT] XT http://www.blnz.com/xt/index.html

Author's keywords for this paper:
XQuery Data Model; Handle/Body design pattern; Document Object Model; DOM; JDOM; DOM4J; AxiOM; XOM; XPath; XML Infoset

Amelia A. Lewis

Senior Architect

TIBCO Software Inc.

Amelia Lewis is a senior architect with the TIBCO/Extensibility division of TIBCO Software Inc. Her primary focus, since 2000, has been XML technologies, inside and outside TIBCO. She has been active in a variety of XML-related specifications efforts and developer-oriented XML mailing lists; she has extensive experience with implementation of a variety of XML technologies, using most of the tree models mentioned in this paper.

Eric E. Johnson

Principal Architect

TIBCO Software Inc.

Eric Johnson is a principal architect at TIBCO Software Inc. Eric joined TIBCO in 2000, a part of TIBCO's acquisition of Extensibility, an XML tools company. While Eric now works in a variety of areas, including governance, build architecture, and various standards including SOAP/JMS, SCA, and OSGi, he has also maintained a strong interest in improving the core technologies that TIBCO uses, especially those related to XML.