Introduction

To process XML documents using a programming language, one can basically choose from two different application programming interfaces (APIs). The Simple API for XML processing (SAX) is an event-centric interface, while the Document Object Model (DOM) provides a sophisticated object structure to work with XML documents. This contribution introduces an event-centric API to work with XCONCUR documents, which is inspired by the XML's SAX-API.

Section 2 gives a brief overview of the XCONCUR document syntax, in section 3 an event-centric XCONCUR API is described and in section 4 contains an outlook on further work.

XCONCUR

XCONCUR is an extension to XML with major goal to provide an convenient method for expressing concurrent hierarchies. An XCONCUR document may contain an arbitrary number of annotation layers. Each layer can be transformed to a well-formed XML document by a simple filtering process. Therefore, an XCONCUR document can be seen as set of inter-woven XML documents. Figure 1 shows an XCONCUR example document with two annotation layers. Each tag is prefixed by an annotation layer id and thus assigned to a layer. The XCONCUR schema declarations allow to assign an annotation schema to each layer. The annotation schema may be written in any of the current XML schema languages, e.g. DTD, XML Schema or RELAX NG. If an annotation schema has been assigned to an annotation layer, the layer is validated using this schema. While the use of annotation schemas is optional, an XCONCUR document is required to be well-formed: each XCONCUR document can be decomposed in a set of XML documents, by selecting one layer and removing the tags from other annotation layers and the annotation layer prefixes. The resulting XML documents are required to be well-formed. Additionally, an XCONCUR constraint declaration can optionally be used to associate an XCONCUR-CL constraint set to the document, which allows cross-tree validation. For details see Schonefeld (2007) and Witt at al. (2007).

1: XCONCUR example

<?xconcur version="1.1" encoding="iso-8859-1"?>
<?xconcur-schema layer="l1" root="div" system="teispok2.dtd"?>
<?xconcur-schema layer="l2" root="text" system="teiana2.dtd"?>
<?xconcur-constraint system="peterandpaul.xcs" xconcur:l1="L1" xconcur:l2="L2"?>
<(l1)div type="dialog" org="uniform">
  <(l2)text>
      <(l1)u who="Peter">
    <(l2)s>Hey Paul!</(l2)s>
      <(l2)s>Would you give me
    </(l1)u>
    <(l1)u who="Paul">
      the hammer?</(l2)s>
    </(l1)u>
  </(l2)text>
</(l1)div>

An event-centric application programming interface

The event-centric API for processing XCONCUR documents is heavily inspired by XML's SAX API (see Megginson et al. (2002)). It provides a very low-level approach for working with XCONCUR documents. While processing a document, the parser emits a series of events. An application may receive those events and perform custom actions, e.g. build an in-memory representation of the document. Since the application ultimately decides which events to accept and how to handle them, the parser only has to build up a very minimal in-memory representation to perform it's work. This streaming approach is therefore quite memory-efficient.

The API basically defines a number of start events, which signal the beginning of an entity in the parsed document (e.g. a start tag) and their corresponding counterparts. The event signaling character data is an exception, since only a sole character data event exists without any start or end event. The following list contains the events, which are defined by the API. All events marked with an asterisk are unique the XCONCUR API, all others have been adapted to cope with more than one annotation layer.

Start Document ()

The beginning of the document has been detected. This event is sent after the XCONCUR declaration has been read.

End Document ()

The end of the document has been detected. This event is sent, when the document has been processed completely.

Start Layer (layer)*

A new annotation layer has been detected. This is event is sent, either if an XCONCUR layer declaration has been processed or if the root tag of a new annotation layer has been found. The name of the annotation layer prefix is provided.

End Layer (layer)*

The end of an annotation layer has been detected, This event is send after the matching end tag for the annotation layer's root element has been processed. The name of the annotation layer prefix is provided.

Start Primary Data ()*

This events signals the beginning of the character data of the document. It is sent, after the root element for all annotation layers in the document have been processed.

End Primary Data ()*

This events signals the end of the actual character data of the document. It is sent, right before the first end tag of a root element for any annotation has been processed.

Start Prefix Mapping (layer, prefix, uri)

This event signals the beginning of the scope of a namespace prefix mapping on a layer. It is sent just before start tag event of the element, which declares the prefix mapping, is emitted. The event carries information about the annotation layer, the namespace prefix and the namespace URI is provided. If an element defines more than one prefix mapping, the start prefix mapping events may occur in any order.

End Prefix Mapping (layer, prefix, uri)

This event signals the end of the scope of a namespace prefix mapping on a layer. It is sent just after the end element event for the element, which declared the mapping, was emitted. The event carries information about the annotaion layer, the namespace prefix and the namespace URI is provided. If an element defined more than one prefix mapping, the end prefix mapping events may occur in any order.

Characters (characters)

This event signals the character data. More then one character data events my be emitted for one chunk of character data in the document.

Start Element (layer, uri, localname, qname, attributes)

A start tag has been detected. The event carries the annotation layer prefix, the namespace URI, the local name and the qualified name of the tag. Furthermore, a list of attributes is available. This list is either empty, if the element has no attributes or contains the namespace URI, local name, qualified name and value for each attribute.

End Element (layer, uri, localname, qname)

A end tag has been detected. The event carries the annotation layer prefix, the namespace URI, the local name and the qualified name of the tag.

The major difference to XML's SAX-API is that all events, except the characters event, have been modified to also carry the annotation layer id, so an application can also take this information into account. Furthermore, the start/end layer and start/end primary data events have been added. The start/end layer events provide an easy mechanism for the application to determine which annotation layers exist in an XCONCUR document and perform actions, e.g. allocating memory for each layer. Strictly speaking, one could derive this information from other events (e.g. checking, if the just received start element event carries an yet unknown annotation layer id), but by providing the start/end layer events, the API eases writing the application, since the programmer can rely upon these events. The same hold for the start/end primary data events. They signal the start and end of the actual character data for a document.

The XCONCUR SAX-API provides various classes and interfaces. The most important entities of the XCONCUR SAX-API are the XConcurReader and ContentHandler classes. The XConcurReader class encapsulates the underlying parser[1]. The ContentHandler defines an interface, which needs to be implemented by user's program and acts as the message sink for the events generated by the parser. The whole API consists of various other auxiliary classes, e.g. provide abstract input sources for reading XCONCUR documents or error reporting classes.

Figure 2 shows an excerpt of a class implementing the ContentHandler interface. Given this class, a typical sequence for parsing an XCONCUR document is shown in Figure 3.

2: An example implementation of ContentHandler interface

class MyContentHandler : public ContentHandler {
public:
  virtual void StartElement(const char* const layer,
                            const char* const uri,
                            const char* const localname,
                            const char* const qname,
                            const Attributes &attrs) {
    if (strcmp(layer, "l1")) {
      // do something for start element events on layer "l1"
    }
  }

  virtual void EndElement(const char* const layer,
                          const char* const uri,
                          const char* const localname,
                          const char* const qname) {
    if (strcmp(layer, "l1")) {
      // do something for end elements events on layer "l1"
    }
  }

  // ...
}; // class MyContentHandler

3: Typical sequence to invoke the parser

try {
  // create reader instance
  XConcurReader *reader = XConcurReaderFactory::CreateReader();

  // class 'MyContentHandler' extends the ContentHandler interface
  MyContentHandler handler;

  // register content handler with reader
  reader->SetContentHandler(handler);

  // create input source
  // NOTE: 'input' is an InputStream object which points to an XCONCUR file
  InputSource source(input);

  // parse document
  reader->parse(&source);
} catch (XConcurException &e) {
  // handle exception
}

The C++ reference implementation of the XCONCUR SAX-API contains a program called xconcurlint. It uses the API to read an XCONCUR document and prints the events, which are emitted by the parser. Figure 4 shows a transcript of the parse of the XCONCUR document from figure 1. The event types are printed in curly brackets. Other event specific information, like annotation layer prefix or element name are also printed.

4: Output created by the xconcurlint utility

{START LAYER} l1
{START ELEMENT} l1, div
                 type = dialog
                 org = uniform
{START LAYER} l2
{START ELEMENT} l2, text
{START PRIMARY DATA}
{CHARACTERS} "\n      "
{START ELEMENT} l1, u
                 who = Peter
{CHARACTERS} "\n    "
{START ELEMENT} l2, s
{CHARACTERS} "Hey Paul!"
{END ELEMENT} l2, s
{CHARACTERS} "\n      "
{START ELEMENT} l2, s
{CHARACTERS} "Would you give me\n    "
{END ELEMENT} l1, u
{CHARACTERS} "\n    "
{START ELEMENT} l1, u
                 who = Paul
{CHARACTERS} "\n      "
{CHARACTERS} "the hammer?"
{END ELEMENT} l2, s
{CHARACTERS} "\n    "
{END ELEMENT} l1, u
{CHARACTERS} "\n  "
{END PRIMARY DATA}
{END ELEMENT} l2, text
{END LAYER} l2
{END ELEMENT} l1, div
{END LAYER} l1

Conclusion

The XCONCUR SAX-API provides a very low-level, yet powerful, interface for processing XCONCUR documents. It is a relatively simple and easy interface to work with XCONCUR documents. Programmers, who are familiar with XML's SAX-API, should feel at ease with XCONCUR API really quickly. The API makes very few assumptions about the underlying parser and provides a uniform interface for using parser implementations from different vendors. Furthermore, the API can easily be ported to different programming languages. A C++ and a Java reference implementation is available[2]. For the Java language bindings, the API is implemented in plain Java, while parser uses the C++ implementation of the parser.

Future work involves creating a object based API similar to XML's DOM-API. Conceptional work for this is currently underway and the XCONCUR-DOM parser will be built upon the XCONCUR-SAX parser. Furthermore, the Mascarpone XCONCUR editor needs to be overhauled to use the new APIs.

Appendix A. API interfaces

This appendix lists the most fundamental interfaces of the XCONCUR SAX-API. The full API contains a few more interfaces and classes.

Figure 5: XConcurReader interface

class XConcurReader {
public:
  virtual ContentHandler* GetContentHandler() const = 0;

  virtual void SetContentHandler(ContentHandler *handler) = 0;

  virtual ErrorHandler* GetErrorHandler() const = 0;

  virtual void SetErrorHandler(ErrorHandler *handler) = 0;

  virtual void Parse(InputSource *source) = 0;

  virtual void SetFeature(const char* const name, const bool value) = 0;

  virtual bool GetFeature(const char* const name) = 0;

  virtual ~XConcurReader();
}; // class XConcurReader

Figure 6: ContentHandler interface

class ContentHandler {
public:
  virtual ~ContentHandler();

  virtual void StartDocument() = 0;

  virtual void EndDocument() = 0;

  virtual void StartLayer(const char* const prefix) = 0;

  virtual void EndLayer(const char* const prefix) = 0;

  virtual void StartPrimaryData() = 0;

  virtual void EndPrimaryData() = 0;

  virtual void StartPrefixMapping(const char* const layer,
                                  const char* const prefix,
                                  const char* const uri) = 0;

  virtual void EndPrefixMapping(const char* const layer,
                                const char* const prefix) = 0;

  virtual void Characters(const char* const chars,
                          const size_t offset,
                          const size_t len) = 0;

  virtual void StartElement(const char* const layer,
                            const char* const uri,
                            const char* const localname,
                            const char* const qname,
                            const Attributes &attrs) = 0;

  virtual void EndElement(const char* const layer,
                          const char* const uri,
                          const char* const localname,
                          const char* const qname) = 0;
}; // interface ContentHandler

Figure 7: Attributes interface

class Attributes {
public:

  virtual int GetLength() const = 0;

  virtual int GetIndex(const char* const qname) const = 0;

  virtual int GetIndex(const char* const uri,
                       const char* const localname) const = 0;

  virtual const char* const GetQName(const int idx) const = 0;

  virtual const char* const GetURI(const int idx) const = 0;

  virtual const char* const GetLocalName(const int idx) const = 0;

  virtual const char* const GetType(const char* const qname) const = 0;

  virtual const char* const GetType(const char* const uri,
                                    const char* const localname) const = 0;

  virtual const char* const GetType(const int idx) const = 0;

  virtual const char* const GetValue(const char* const qname) const = 0;

  virtual const char* const GetValue(const char* const uri,
                                     const char* const localname) const = 0;

  virtual const char* const GetValue(const int idx) const = 0;

  virtual bool IsDeclared(const char* const qname) const = 0;

  virtual bool IsDeclared(const char* const uri,
                          const char* const localname) const = 0;

  virtual bool IsDeclared(const int idx) const = 0;

  virtual bool IsSpecified(const char* const qname) const = 0;

  virtual bool IsSpecified(const char* const uri,
                           const char* const localname) const = 0;

  virtual bool IsSpecified(const int idx) const = 0;

protected:
  virtual ~Attributes();
}; // interface Attributes

References

[Megginson et al. (2002)] David Megginson, “Simple API for XML processing”. Available online at http://www.saxproject.org/quickstart.html

[Le Hors et al. (2004)] Arnaud Le Hors, Philippe Le Hégaret, Lauren Wood, Gavin Nicol, Jonathan Robie, Mike Champion, Steve Byrne: “Document Object Model (DOM) Level 3 Core Specification”. World Wide Web Consortium, 2006. Available online at http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/

[Schonefeld (2007)] Oliver Schonefeld: “XCONCUR and XCONCUR-CL: A constraint-based approach for the validation of concurrent markup”. In: Datenstrukturen für linguistische Ressourcen und ihre Anwendungen / Data structures for linguistic resources and applications: Proceedings of the Biennial GLDV Conference 2007, Georg Rehm, Andreas Witt, Lothar Lemnitzer (eds), Tübingen Verlag, Germany, 2007. Pp. 347–356.

[Witt at al. (2007)] Andreas Witt, Oliver Schonefeld, Georg Rehm, Jonathan Khoo, Kilian Evang: “On the Lossless Transformation of Single-File, Multi-Layer Annotations into Multi-Rooted Trees”. In: Proceedings of Extreme Markup Languages 2007, Montréal, Canada, 2007. Available online at http://www.idealliance.org/papers/extreme/proceedings/html/2007/Witt01/EML2007Witt01.xml

[Bray et al. (2006)] Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, Francois Yergeau, John Cowan: “Extensible Markup Language (XML) 1.1”. World Wide Web Consortium, 2006, 2nd edition. Available online at http://www.w3.org/TR/2006/REC-xml11-20060816/



[1] The parser implementation is not part if the API. Different vendors could supply their own implementation. The reference implementation of the XCONCUR SAX-API currently provides a non-validating parser.

[2] The author provides the software for evaluation and academic purposes upon request.

[3] All online resources have last been checked on 2008/08/31.

Author's keywords for this paper:
processing XCONCUR

Oliver Schonefeld

University of Tübingen

Oliver Schonefeld works in University of Tübingen's collaborative research centre Linguistic Data Structures in a project that develops the foundations for sustainable linguistic resources. He studied computer science at University of Bielefeld until 2005. This contribution deals with aspects of his forthcoming PhD thesis.