Introduction
To process XML documents using a programming language, one can basically choose from two different application programming interfaces (APIs). The Simple API for XML processing (SAX) is an event-centric interface, while the Document Object Model (DOM) provides a sophisticated object structure to work with XML documents. This contribution introduces an event-centric API to work with XCONCUR documents, which is inspired by the XML's SAX-API.
Section 2 gives a brief overview of the XCONCUR document syntax, in section 3 an event-centric XCONCUR API is described and in section 4 contains an outlook on further work.
XCONCUR
XCONCUR is an extension to XML with major goal to provide an convenient method for expressing concurrent hierarchies. An XCONCUR document may contain an arbitrary number of annotation layers. Each layer can be transformed to a well-formed XML document by a simple filtering process. Therefore, an XCONCUR document can be seen as set of inter-woven XML documents. Figure 1 shows an XCONCUR example document with two annotation layers. Each tag is prefixed by an annotation layer id and thus assigned to a layer. The XCONCUR schema declarations allow to assign an annotation schema to each layer. The annotation schema may be written in any of the current XML schema languages, e.g. DTD, XML Schema or RELAX NG. If an annotation schema has been assigned to an annotation layer, the layer is validated using this schema. While the use of annotation schemas is optional, an XCONCUR document is required to be well-formed: each XCONCUR document can be decomposed in a set of XML documents, by selecting one layer and removing the tags from other annotation layers and the annotation layer prefixes. The resulting XML documents are required to be well-formed. Additionally, an XCONCUR constraint declaration can optionally be used to associate an XCONCUR-CL constraint set to the document, which allows cross-tree validation. For details see Schonefeld (2007) and Witt at al. (2007).
An event-centric application programming interface
The event-centric API for processing XCONCUR documents is heavily inspired by XML's SAX API (see Megginson et al. (2002)). It provides a very low-level approach for working with XCONCUR documents. While processing a document, the parser emits a series of events. An application may receive those events and perform custom actions, e.g. build an in-memory representation of the document. Since the application ultimately decides which events to accept and how to handle them, the parser only has to build up a very minimal in-memory representation to perform it's work. This streaming approach is therefore quite memory-efficient.
The API basically defines a number of start events, which signal the beginning of an entity in the parsed document (e.g. a start tag) and their corresponding counterparts. The event signaling character data is an exception, since only a sole character data event exists without any start or end event. The following list contains the events, which are defined by the API. All events marked with an asterisk are unique the XCONCUR API, all others have been adapted to cope with more than one annotation layer.
Start Document () |
The beginning of the document has been detected. This event is sent after the XCONCUR declaration has been read. |
End Document () |
The end of the document has been detected. This event is sent, when the document has been processed completely. |
Start Layer (layer)* |
A new annotation layer has been detected. This is event is sent, either if an XCONCUR layer declaration has been processed or if the root tag of a new annotation layer has been found. The name of the annotation layer prefix is provided. |
End Layer (layer)* |
The end of an annotation layer has been detected, This event is send after the matching end tag for the annotation layer's root element has been processed. The name of the annotation layer prefix is provided. |
Start Primary Data ()* |
This events signals the beginning of the character data of the document. It is sent, after the root element for all annotation layers in the document have been processed. |
End Primary Data ()* |
This events signals the end of the actual character data of the document. It is sent, right before the first end tag of a root element for any annotation has been processed. |
Start Prefix Mapping (layer, prefix, uri) |
This event signals the beginning of the scope of a namespace prefix mapping on a layer. It is sent just before start tag event of the element, which declares the prefix mapping, is emitted. The event carries information about the annotation layer, the namespace prefix and the namespace URI is provided. If an element defines more than one prefix mapping, the start prefix mapping events may occur in any order. |
End Prefix Mapping (layer, prefix, uri) |
This event signals the end of the scope of a namespace prefix mapping on a layer. It is sent just after the end element event for the element, which declared the mapping, was emitted. The event carries information about the annotaion layer, the namespace prefix and the namespace URI is provided. If an element defined more than one prefix mapping, the end prefix mapping events may occur in any order. |
Characters (characters) |
This event signals the character data. More then one character data events my be emitted for one chunk of character data in the document. |
Start Element (layer, uri, localname, qname, attributes) |
A start tag has been detected. The event carries the annotation layer prefix, the namespace URI, the local name and the qualified name of the tag. Furthermore, a list of attributes is available. This list is either empty, if the element has no attributes or contains the namespace URI, local name, qualified name and value for each attribute. |
End Element (layer, uri, localname, qname) |
A end tag has been detected. The event carries the annotation layer prefix, the namespace URI, the local name and the qualified name of the tag. |
The major difference to XML's SAX-API is that all events, except the characters event, have been modified to also carry the annotation layer id, so an application can also take this information into account. Furthermore, the start/end layer and start/end primary data events have been added. The start/end layer events provide an easy mechanism for the application to determine which annotation layers exist in an XCONCUR document and perform actions, e.g. allocating memory for each layer. Strictly speaking, one could derive this information from other events (e.g. checking, if the just received start element event carries an yet unknown annotation layer id), but by providing the start/end layer events, the API eases writing the application, since the programmer can rely upon these events. The same hold for the start/end primary data events. They signal the start and end of the actual character data for a document.
The XCONCUR SAX-API provides various classes and interfaces. The
most important entities of the XCONCUR SAX-API are the
XConcurReader
and ContentHandler
classes. The XConcurReader
class encapsulates the
underlying parser[1]. The ContentHandler
defines
an interface, which needs to be implemented by user's program and
acts as the message sink for the events generated by the
parser. The whole API consists of various other auxiliary classes,
e.g. provide abstract input sources for reading XCONCUR documents
or error reporting classes.
Figure 2 shows an excerpt of a class
implementing the ContentHandler
interface. Given this
class, a typical sequence for parsing an XCONCUR document is shown
in Figure 3.
The C++ reference implementation of the XCONCUR SAX-API contains a
program called xconcurlint
. It uses the API to read
an XCONCUR document and prints the events, which are emitted by
the parser. Figure 4 shows a
transcript of the parse of the XCONCUR document from figure 1. The event types are printed in curly
brackets. Other event specific information, like annotation layer
prefix or element name are also printed.
Conclusion
The XCONCUR SAX-API provides a very low-level, yet powerful, interface for processing XCONCUR documents. It is a relatively simple and easy interface to work with XCONCUR documents. Programmers, who are familiar with XML's SAX-API, should feel at ease with XCONCUR API really quickly. The API makes very few assumptions about the underlying parser and provides a uniform interface for using parser implementations from different vendors. Furthermore, the API can easily be ported to different programming languages. A C++ and a Java reference implementation is available[2]. For the Java language bindings, the API is implemented in plain Java, while parser uses the C++ implementation of the parser.
Future work involves creating a object based API similar to XML's DOM-API. Conceptional work for this is currently underway and the XCONCUR-DOM parser will be built upon the XCONCUR-SAX parser. Furthermore, the Mascarpone XCONCUR editor needs to be overhauled to use the new APIs.
Appendix A. API interfaces
This appendix lists the most fundamental interfaces of the XCONCUR SAX-API. The full API contains a few more interfaces and classes.
References
[Megginson et al. (2002)] David Megginson, “Simple API for XML processing”. Available online at http://www.saxproject.org/quickstart.html
[Le Hors et al. (2004)] Arnaud Le Hors, Philippe Le Hégaret, Lauren Wood, Gavin Nicol, Jonathan Robie, Mike Champion, Steve Byrne: “Document Object Model (DOM) Level 3 Core Specification”. World Wide Web Consortium, 2006. Available online at http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/
[Schonefeld (2007)] Oliver Schonefeld: “XCONCUR and XCONCUR-CL: A constraint-based approach for the validation of concurrent markup”. In: Datenstrukturen für linguistische Ressourcen und ihre Anwendungen / Data structures for linguistic resources and applications: Proceedings of the Biennial GLDV Conference 2007, Georg Rehm, Andreas Witt, Lothar Lemnitzer (eds), Tübingen Verlag, Germany, 2007. Pp. 347–356.
[Witt at al. (2007)] Andreas Witt, Oliver Schonefeld, Georg Rehm, Jonathan Khoo, Kilian Evang: “On the Lossless Transformation of Single-File, Multi-Layer Annotations into Multi-Rooted Trees”. In: Proceedings of Extreme Markup Languages 2007, Montréal, Canada, 2007. Available online at http://www.idealliance.org/papers/extreme/proceedings/html/2007/Witt01/EML2007Witt01.xml
[Bray et al. (2006)] Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, Francois Yergeau, John Cowan: “Extensible Markup Language (XML) 1.1”. World Wide Web Consortium, 2006, 2nd edition. Available online at http://www.w3.org/TR/2006/REC-xml11-20060816/
[1] The parser implementation is not part if the API. Different vendors could supply their own implementation. The reference implementation of the XCONCUR SAX-API currently provides a non-validating parser.
[2] The author provides the software for evaluation and academic purposes upon request.
[3] All online resources have last been checked on 2008/08/31.