Schonefeld, Oliver. “An event-centric API for processing concurrent markup.” Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008). https://doi.org/10.4242/BalisageVol1.Schonefeld01.
Balisage: The Markup Conference 2008 August 12 - 15, 2008
Balisage Paper: A Simple API for XCONCUR
Processing concurrent markup using an event-centric API
Oliver Schonefeld works in University of Tübingen's
collaborative research centre Linguistic Data Structures in
a project that develops the foundations for sustainable
linguistic resources. He studied computer science at
University of Bielefeld until 2005. This contribution
deals with aspects of his forthcoming PhD thesis.
Programmers can basically choose from two different types APIs
when working with XML documents. On provides an event-centric
view (SAX) on the document, while the offers an object-centric
view (DOM). This contribution introduces an event-centric
programming interface to work with XCONCUR documents which is
inspired by the XML's SAX-API. It provides a very easy to use
API for parsing XCONCUR documents.
To process XML documents using a programming language, one can
basically choose from two different application programming
interfaces (APIs). The Simple API for XML
processing (SAX) is an event-centric interface, while the
Document Object Model (DOM) provides
a sophisticated object structure to work with XML
documents.
This contribution introduces an event-centric API to work
with XCONCUR documents, which is inspired by the XML's SAX-API.
Section 2 gives a brief overview of
the XCONCUR document syntax, in section 3 an event-centric XCONCUR API is described
and in section 4 contains an
outlook on further work.
XCONCUR
XCONCUR is an extension to XML with major goal to
provide an convenient method for expressing concurrent
hierarchies. An XCONCUR document may contain an arbitrary number
of annotation layers. Each layer can be transformed to a
well-formed XML document by a simple filtering
process. Therefore, an XCONCUR document can be seen as set of
inter-woven XML documents. Figure 1
shows an XCONCUR example document with two annotation
layers. Each tag is prefixed by an annotation layer id and thus
assigned to a layer. The XCONCUR schema declarations
allow to assign an annotation schema to each layer. The
annotation schema may be written in any of the current XML
schema languages, e.g. DTD, XML Schema or RELAX NG. If an
annotation schema has been assigned to an annotation layer, the
layer is validated using this schema. While the use of
annotation schemas is optional, an XCONCUR document is required
to be well-formed: each XCONCUR document can be decomposed in a
set of XML documents, by selecting one layer and removing the
tags from other annotation layers and the annotation layer
prefixes. The resulting XML documents are required to be
well-formed. Additionally, an XCONCUR constraint declaration can
optionally be used to associate an XCONCUR-CL constraint set to
the document, which allows cross-tree validation. For details
see Schonefeld (2007) and Witt at al. (2007).
An event-centric application programming interface
The event-centric API for processing XCONCUR documents is
heavily inspired by XML's SAX API (see Megginson et al. (2002)). It provides a very low-level approach for
working with XCONCUR documents. While processing a document, the
parser emits a series of events. An application may receive
those events and perform custom actions, e.g. build an in-memory
representation of the document. Since the application ultimately
decides which events to accept and how to handle them, the
parser only has to build up a very minimal in-memory
representation to perform it's work. This streaming approach is
therefore quite memory-efficient.
The API basically defines a number of start events, which
signal the beginning of an entity in the parsed document (e.g. a
start tag) and their corresponding counterparts. The event signaling
character data is an exception, since only a sole character data
event exists without any start or end event. The following list
contains the events, which are defined by the API. All events
marked with an asterisk are unique the XCONCUR API, all others
have been adapted to cope with more than one annotation
layer.
Start Document ()
The beginning of the document has been detected. This event
is sent after the XCONCUR declaration has been read.
End Document ()
The end of the document has been detected. This
event is sent, when the document has been processed completely.
Start Layer (layer)*
A new annotation layer has been detected. This is event is
sent, either if an XCONCUR layer declaration has been
processed or if the root tag of a new annotation layer has
been found. The name of the annotation layer prefix is
provided.
End Layer (layer)*
The end of an annotation layer has been detected, This
event is send after the matching end tag for the
annotation layer's root element has been processed. The
name of the annotation layer prefix is provided.
Start Primary Data ()*
This events signals the beginning of the character data of
the document. It is sent, after the root element for all
annotation layers in the document have been processed.
End Primary Data ()*
This events signals the end of the actual character data
of the document. It is sent, right before the first end
tag of a root element for any annotation has been
processed.
Start Prefix Mapping (layer, prefix, uri)
This event signals the beginning of the scope of a
namespace prefix mapping on a layer. It is sent
just before start tag event of the element, which declares
the prefix mapping, is emitted. The event carries
information about the annotation layer, the namespace
prefix and the namespace URI is provided. If an element
defines more than one prefix mapping, the start prefix
mapping events may occur in any order.
End Prefix Mapping (layer, prefix, uri)
This event signals the end of the scope of a
namespace prefix mapping on a layer. It is sent just after
the end element event for the element, which declared the
mapping, was emitted. The event carries information about
the annotaion layer, the namespace prefix and the
namespace URI is provided. If an element defined more than
one prefix mapping, the end prefix mapping events may
occur in any order.
Characters (characters)
This event signals the character data. More then one
character data events my be emitted for one chunk of
character data in the document.
Start Element (layer, uri, localname,
qname, attributes)
A start tag has been detected. The event carries the
annotation layer prefix, the namespace URI, the local
name and the qualified name of the tag. Furthermore, a list of
attributes is available. This list is either empty, if the
element has no attributes or contains the namespace URI,
local name, qualified name and value for each attribute.
End Element (layer, uri, localname, qname)
A end tag has been detected. The event carries the
annotation layer prefix, the namespace URI, the local
name and the qualified name of the tag.
The major difference to XML's SAX-API is that all events, except
the characters event, have been modified to also carry the
annotation layer id, so an application can also take this
information into account. Furthermore, the start/end layer and
start/end primary data events have been added. The start/end layer
events provide an easy mechanism for the application to determine
which annotation layers exist in an XCONCUR document and perform
actions, e.g. allocating memory for each layer. Strictly speaking,
one could derive this information from other events
(e.g. checking, if the just received start element event carries
an yet unknown annotation layer id), but by providing the
start/end layer events, the API eases writing the application,
since the programmer can rely upon these events. The same hold for
the start/end primary data events. They signal the start and end
of the actual character data for a document.
The XCONCUR SAX-API provides various classes and interfaces. The
most important entities of the XCONCUR SAX-API are the
XConcurReader and ContentHandler
classes. The XConcurReader class encapsulates the
underlying parser[1]. The ContentHandler defines
an interface, which needs to be implemented by user's program and
acts as the message sink for the events generated by the
parser. The whole API consists of various other auxiliary classes,
e.g. provide abstract input sources for reading XCONCUR documents
or error reporting classes.
Figure 2 shows an excerpt of a class
implementing the ContentHandler interface. Given this
class, a typical sequence for parsing an XCONCUR document is shown
in Figure 3.
The C++ reference implementation of the XCONCUR SAX-API contains a
program called xconcurlint. It uses the API to read
an XCONCUR document and prints the events, which are emitted by
the parser. Figure 4 shows a
transcript of the parse of the XCONCUR document from figure 1. The event types are printed in curly
brackets. Other event specific information, like annotation layer
prefix or element name are also printed.
Conclusion
The XCONCUR SAX-API provides a very low-level, yet powerful,
interface for processing XCONCUR documents. It is a relatively
simple and easy interface to work with XCONCUR
documents. Programmers, who are familiar with XML's SAX-API,
should feel at ease with XCONCUR API really quickly. The API
makes very few assumptions about the underlying parser and
provides a uniform interface for using parser implementations
from different vendors. Furthermore, the API can easily be
ported to different programming languages. A C++ and a Java
reference implementation is available[2]. For the Java language bindings, the
API is implemented in plain Java, while parser uses the C++
implementation of the parser.
Future work involves creating a object based API similar to
XML's DOM-API. Conceptional work for this is currently underway
and the XCONCUR-DOM parser will be built upon the XCONCUR-SAX
parser. Furthermore, the Mascarpone XCONCUR editor needs to be
overhauled to use the new APIs.
Appendix A. API interfaces
This appendix lists the most fundamental interfaces of the
XCONCUR SAX-API. The full API contains a few more interfaces and
classes.
[Le Hors et al. (2004)]
Arnaud Le Hors, Philippe Le Hégaret, Lauren Wood, Gavin Nicol,
Jonathan Robie, Mike Champion, Steve Byrne: “Document Object Model (DOM) Level 3 Core
Specification”. World Wide Web Consortium,
2006. Available online at
http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/
[Schonefeld (2007)]
Oliver Schonefeld: “XCONCUR and
XCONCUR-CL: A constraint-based approach for the validation of
concurrent markup”. In: Datenstrukturen für
linguistische Ressourcen und ihre Anwendungen / Data structures
for linguistic resources and applications: Proceedings of the
Biennial GLDV Conference 2007, Georg Rehm, Andreas Witt, Lothar
Lemnitzer (eds), Tübingen Verlag, Germany, 2007. Pp. 347–356.
[Bray et al. (2006)]
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler,
Francois Yergeau, John Cowan: “Extensible Markup Language (XML)
1.1”. World Wide Web Consortium, 2006, 2nd
edition. Available online at http://www.w3.org/TR/2006/REC-xml11-20060816/
[1] The parser
implementation is not part if the API. Different vendors could
supply their own implementation. The reference implementation of
the XCONCUR SAX-API currently provides a non-validating
parser.
[2] The author provides the software for
evaluation and academic purposes upon
request.
[3] All online resources have last been
checked on 2008/08/31.
Arnaud Le Hors, Philippe Le Hégaret, Lauren Wood, Gavin Nicol,
Jonathan Robie, Mike Champion, Steve Byrne: “Document Object Model (DOM) Level 3 Core
Specification”. World Wide Web Consortium,
2006. Available online at
http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/
Oliver Schonefeld: “XCONCUR and
XCONCUR-CL: A constraint-based approach for the validation of
concurrent markup”. In: Datenstrukturen für
linguistische Ressourcen und ihre Anwendungen / Data structures
for linguistic resources and applications: Proceedings of the
Biennial GLDV Conference 2007, Georg Rehm, Andreas Witt, Lothar
Lemnitzer (eds), Tübingen Verlag, Germany, 2007. Pp. 347–356.
Andreas Witt, Oliver Schonefeld, Georg Rehm, Jonathan Khoo,
Kilian Evang: “On the Lossless
Transformation of Single-File, Multi-Layer Annotations into
Multi-Rooted Trees”. In: Proceedings of Extreme
Markup Languages 2007, Montréal, Canada, 2007. Available online
at
http://www.idealliance.org/papers/extreme/proceedings/html/2007/Witt01/EML2007Witt01.xml
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler,
Francois Yergeau, John Cowan: “Extensible Markup Language (XML)
1.1”. World Wide Web Consortium, 2006, 2nd
edition. Available online at http://www.w3.org/TR/2006/REC-xml11-20060816/