Multiple annotated documents
Markup languages are often defined for structuring the information of a specific text type, such as web pages (HTML), technical articles or books (DocBook), or a set of information items, such as vector graphics (SVG) or protocol information (SOAP). Therefore, their structure is (in limits) determined by a document grammar that allows for specific elements and attributes. In addition, the different XML-based document grammar formalisms allow to a certain degree the combination of elements (and attributes) from different markup languages – usually by means of XML namespaces (Bray et al., 2009). In practice, one host language can include islands of foreign markup (guest languages). There are different examples for the combination of host and guest markup languages (apart from the already mentioned SOAP). A certain XHTML driver (Ishikawa, 2002) allows for the combination of XHTML (as a host language), MathML and SVG (as guest languages), and the Atom Syndication Format (Nottingham and Sayre, 2005) can be used in conjunction with a wide range of extensions (e.g. for Threading, see Snell, 2006, or Activity Streams, see Atkins et al., 2011) while it is also meant to be embedded in parts in the RSS format (Winer, 2009).
Although XML namespaces support the combination of elements derived from different markup languages, they do not change XML's formal model that prohibits overlapping markup. However, standoff markup (instead of inline annotation) may be used to circumvent this problem. The meta markup language XStandoff (Stührenberg and Jettka, 2009) embeds (slightly transformed) islands of guest languages (with respective XML namespaces) in combination with a standardized standoff approach as key feature for the storage of multiple (and possibly overlapping) hierarchies.
Typical problems when dealing with multiple and/or standoff annotations are related
to the
production and processing of instances. Although usually each markup language involved
is
defined by a document grammar on its own, it can often be cumbersome to validate an
instance
combining elements from a large variety of document grammars (although XStandoff is
capable of
validating these instances, adapted XML schema files have to be present for each guest
language). This behaviour can be controlled by means of the document grammar formalism.
For
example, XML Schema allows different values of its processContents
attribute
which may occur on the any
element. The value lax
provided in Figure 1 (taken from XStandoff's layer
element) Fallside and Walsmley, 2004.
In addition, the namespace
attribute may be used to control the allowed
namespaces. While XSD 1.0 allows the values ##any
, ##other
or a list
of namespaces only (including the preserved values ##targetNamespace
and
##local
, see Thompson et al., 2004), RELAX NG supports the exclusion
of namespaces (by using the except
pattern in combination with
nsName
). XSD 1.1 (Gao et al., 2012) introduced the
notNamespace
and notQName
attributes.
The production of multiple annotated documents is typically the result of the combination of formerly stand-alone documents (or their parts), such as the inclusion of externally created SVG graphics in an XHTML host document, or the outcome of a mostly automated process (see Stührenberg and Jettka, 2009 for a discussion on the production of XStandoff instances). What is still lacking is an API (Application Programming Interface) that is flexible enough to support the production and processing of multiple annotated instances, even if annotations are referring to the same primary data by means of standoff annotation. We will demonstrate such an API in the reminder of this article.
Creating an extensible API
XML::Loy (Diewald, 2011) is a Perl library, that
provides a simple programming interface for the creation of XML documents with multiple
namespaces. It is based on Mojo::DOM, an HTML/XML DOM parser that is part
of the Mojolicious framework (Riedel, 2008).
Mojo::DOM povides CSS selector based methods for DOM traversal (van Kesteren and Hunt, 2013), similar to Javascript's querySelector()
and
querySelectorAll()
methods.
The basic methods for the manipulation of the XML Document Object Model provided by
XML::Loy are add()
and set()
. By applying
these methods new nodes can be introduced as children to every node in the document.
While
add()
always appends additional nodes to the document, set()
only
appends nodes in case no child of the given type exists. Both methods are invoked
by a chosen
node in the document tree (acting as the parent node of the newly introduced node).
They
accept the element name as a string parameter, followed by an optional hash reference
containing attributes and a string containing optional textual content of the element.
A final
string can be used to put a comment in front of the element.
In the example presented in Figure 2 a new XML::Loy
document instance is created with a root element document
. Applying the
set()
method, a new title
element is introduced as a child of the
root element. The second call of set()
overwrites the content of the
title
element. By using the add()
method we insert multiple
paragraph
elements without overwriting existing ones. These elements are
defined with both an id
attribute and textual content.
By applying the to_pretty_xml()
method, the result can be printed as XML.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <document> <title>My New Title</title> <paragraph id="p-1">First Paragraph</paragraph> <paragraph id="p-2">Second Paragraph</paragraph> </document>
The strength of this simple approach for document manipulation is the ability to pass these methods to new extension modules that can represent APIs for specific XML namespaces, as both host and guest languages. The example given in Figure 3 is meant to illustrate these capabilities by creating a simple XML::Loy extension for morpheme annotations.
The class inherits all XML creation methods from XML::Loy and thus
all XML traversal methods from Mojo::DOM. When defining the base class,
an optional namespace http://www.xstandoff.net/morphemes
is bound to the
morph
prefix, which means, all invocations of set()
and
add()
from this class will be bound to the morph
namespace. The
newly created morphemes()
method appends a morphemes
element bound
to the given namespace as a child of the invoking node.
To implement simple grammar rules to the API the methods can check the invoking context,
for
example by constraining the introduction of morpheme
elements to
morphemes
parent nodes only (see the regular expression check
/^(?:morph:)?morphemes$/
).
This newly created API for the http://www.xstandoff.net/morphemes
namespace
can now be used to create new document instances (see Figure 4
and the output shown in Figure 5).
By using the generic methods add()
and set()
provided by
XML::Loy, the class can easily be used for extending an existing
XML::Loy based class (i.e. as a guest language inside another host
language). In the example shown in Figure 6 a simplified HTML
instance is read and instantiated. Elements from the
http://www.xstandoff.net/morphemes
namespace are appended using the API
described above (the output is shown in Figure 7).
By extending the XML::Loy base object with the newly created class using
the extension()
[1] method, all method calls from the extension class are available for namespace aware
traversal and manipulation. In general, using such an extensible API provides at least
some
functionality usually made available by document grammars (the nesting of elements
for
example) and adds methods to create and manipulate the respective class of instances.
XStandoff as an example application
XStandoff's predecessor SGF (Sekimo Generic Format) was developed in 2008 (see Stührenberg and Goecke, 2008) as a meta format for storing and analyzing multiple annotated
instances as part of a linguistic corpus. In 2009 the format was generalized and enhanced.
Since then, XStandoff combines standoff notation with the formal model of General
Ordered-Descendant Directed Acyclic Graphs (GODDAG, introduced in Sperberg-McQueen and Huitfeldt, 2004; see Sperberg-McQueen and Huitfeldt, 2008 for a more
recent discussion). The format as such is capable of representing multiple hierarchies
and
specifically challenging structures such as overlaps, discontinuous elements and virtual
elements. The basic structure of an XStandoff instance consists of the root element
corpusData
underneath which the child elements meta
(optional),
resources
(optional), primaryData
(optional in the proposed
release 2.0, see Stührenberg, 2013), segmentation
and
annotation
are subsumed. Figure 8 shows an example
XStandoff document.[2]
In this example, the sentence The sun shines brighter.
is annotated with
two linguistic levels (and respective layers): morphemes and syllables. We cannot
combine both
annotation layers in an inline annotation, since there is an overlap between the two
syllables
brigh
and ter
and the two morphemes bright
and
er
(see Figure 9 for a visualization of the
overlap).
Each annotation is encapsulated underneath a layer
element (which in turn is
a child element of a level
element, since it is possible to have more than one
serialization, that is, layer, for a conceptual level).[3] The xsf:segment
attribute is used to link the annotation with the
respective part of the primary data. Similar to other standoff approaches, XStandoff
uses
character positions for defining segments over textual primary data. Changes of the
input text
result in an out-of-sync situation between primary data and annotation. Processing
XStandoff
instances requires dealing with at least n+1 XML namespaces: one for
XStandoff itself and one for each of the n annotation layers.
Up to now, these instances are created by transforming inline annotations via a set of XSLT 2.0 stylesheets (see Stührenberg and Jettka, 2009 for a detailed discussion). We will outline an example API for XStandoff based on XML::Loy that makes it easy to deal with the dynamic creation of multi-layered annotations in the following section[4].
Creating and processing XStandoff instances using XML::Loy
As presented in the previous section, XStandoff associates annotations to primary data by defining segment spans[5] to which the annotations are linked to via XML ID/IDREF integrity features. There are multiple ways to cope with standoff annotation: Compared to the XStandoff-Toolkit discussed in Stührenberg and Jettka, 2009, our API will provide an additional way to access and manipulate both annotations and primary data directly.
In Figure 10 a new corpusData
element is created.
Next, a textualContent
element is added
(below an automatically introduced primaryData
element with a unique xml:id
).
Seven manually defined
segment
elements are appended for selecting spans over the textual primary data
aligned to the words and the sentence as a whole. Figure 11 shows
the output.
The document creation is simple, as most elements such as corpusData
,
textualContent
and segment
have corresponding API methods for
finding, appending, updating and removing elements of the document. Segments are appended
by
defining their scope.
The manipulation of the primary data is possible by applying the
segment_content()
method, that associates primary data with segment spans (see
Figure 12).
The textual content virtually delimited by a segment can be retrieved, replaced and manipulated, while all other segments stay intact and update their according start and end position values by calculating the new offsets in case they change. This addresses one of the key problems with standoff annotation: Usually, if one alters the primary data without updating the corresponding segments, association of annotations and corresponding primary data will break. Due to the dynamic access of primary data information provided by this API, work with standoff annotations can be nearly as flexible as with inline annotations, without the limitations these annotation formats have, for example to represent overlapping (see Figure 9).
The morpheme extension created in section “Creating an extensible API” can be simply adopted to represent an annotation layer with overlapping segment spans with an annotation of syllables (see Figure 13).
The resulting document is similar to listing Figure 8 but with a modified
primary data of The moon shines brighter.
and updated segment spans.
Another problem with some standoff formats is the association with decoupled primary
data
content. In XStandoff the primary data can be included in the XSF instance (as seen
in the
previous examples) or stored in a separate file and referenced via the
primaryDataRef
element (in case of larger textual primary data, multimedia-based or
multiple primary data files). If this file is on a local storage, the API will take
care
of updating the external textual content as well. Trying to modify files that are
not
modifiable (e.g. accessible online only) will result in a
warning.
Since metadata in XStandoff can be either included inline or referenced in the same
way, the handling of
metadata in our API can be treated alike, with a slight difference
if the metadata itself is a well-formed XML document. The example given in Figure 15 assumes a simple metadata document in RDF with a Dublin Core
namespace at the location files/meta.xml
in the local file system (shown in Figure 14).
The API enables the reference to the external document and supports the access by
defining
a new XML::Loy object with an extension for dealing with Dublin Core data.[6] As a result, the Dublin Core annotated title
element can be accessed
directly, although the data is not embedded in the document.
Conclusion and future work
We have demonstrated the XML::Loy API that can be used as a framework for development of extensible modules for given namespaces (and therefore markup languages). Modules created as extensions can then be used in a simple but yet powerful way to create and process multiple annotated instances, even with standoff markup and referenced documents for primary and metadata information.
The current implementation of XML::Loy is written in pure Perl, with the focus on demonstrating the flexibility and extensibility of our approach, rather than creating a performance optimized system. Since the whole API (including the extension modules and examples described in this paper) is available under a free license at http://github.com/Akron/XML-Loy-XStandoff further possible steps could include performance optimizations and the creation of an extension repository for popular standardized markup languages (such as OLAC, DocBook and TEI).
Acknowledgements
We would like to thank the anonymous reviewers of this paper for their helpful comments and ideas.
References
[Atkins et al., 2011] Martin Atkins, Will Norris, Chris Messina, Monica Wilkinson, and Rob Dolin (2011). Atom Activity Streams 1.0. http://activitystrea.ms/specs/atom/1.0/
[Bray et al., 2009] Tim Bray, Dave Hollander, Andrew Layman, Richard Tobin, and Henry S. Thompson (2009). Namespaces in XML 1.0 (Third Edition). W3C Recommendation, World Wide Web Consortium (W3C). http://www.w3.org/TR/2009/REC-xml-names-20091208/
[Diewald, 2011] Nils Diewald (2011). XML::Loy – Extensible XML Reader and Writer. http://search.cpan.org/dist/XML-Loy/
[Fallside and Walsmley, 2004] David C. Fallside and Priscilla Walmsley (2004). XML Schema Part 0: Primer Second Edition. W3C Recommendation, World Wide Web Consortium (W3C). http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/
[Gao et al., 2012] Shudi (Sandy) Gao, C. M. Sperberg-McQueen, and Henry S. Thompson (2012). W3C XML Schema Definition Language (XSD) 1.1 Part 1: Structures. W3C Recommendation, World Wide Web Consortium (W3C). http://www.w3.org/TR/2012/REC-xmlschema11-1-20120405/
[Goecke et al., 2010] Daniela Goecke, Harald Lüngen, Dieter Metzing, Maik Stührenberg, and Andreas Witt (2010). Different views on markup. Distinguishing Levels and Layers. In: Witt, A. and Metzing, D. (eds.), Linguistic Modeling of Information and Markup Languages. Dordrecht: Springer. doi:https://doi.org/10.1007/978-90-481-3331-4_1.
[Ishikawa, 2002] Masayasu Ishikawa (2002). An XHTML+MathML+SVG Profile. W3C Working Draft, World Wide Web Consortium (W3C). http://www.w3.org/TR/XHTMLplusMathMLplusSVG/xhtml-math-svg.html
[van Kesteren and Hunt, 2013] Anne Van Kesteren, and Lachlan Hunt (2013). Selectors API Level 1. W3C Recommendation, World Wide Web Consortium (W3C). http://www.w3.org/TR/2013/REC-selectors-api-20130221/
[Nottingham and Sayre, 2005] Mark Nottingham, and Robert Sayre (2005). The Atom Syndication Format. The Internet Society. http://tools.ietf.org/html/rfc4287
[Riedel, 2008] Sebastian Riedel (2008). Mojolicious. Real-time web framework. http://search.cpan.org/dist/Mojolicious/
[Snell, 2006] James M. Snell (2006). Atom Threading Extensions. The Internet Society. http://www.ietf.org/rfc/rfc4685.txt
[Sperberg-McQueen and Huitfeldt, 2004] C. M. Sperberg-McQueen and Claus Huitfeldt (2004). GODDAG: A Data Structure for Overlapping Hierarchies. In: King, P. and Munson, E. V. (eds.), Proceedings of the 5th International Workshop on the Principles of Digital Document Processing (PODDP 2000), volume 2023 of Lecture Notes in Computer Science, Springer
[Sperberg-McQueen and Huitfeldt, 2008] C. M. Sperberg-McQueen and Claus Huitfeldt (2008). GODDAG. Presented at the Goddag workshop, Amsterdam, 1-5 December 2008
[Stührenberg and Goecke, 2008] Maik Stührenberg and Daniela Goecke (2008). SGF – An integrated model for multiple annotations and its application in a linguistic domain. Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In: Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1. doi:https://doi.org/10.4242/BalisageVol1.Stuehrenberg01
[Stührenberg and Jettka, 2009] Maik Stührenberg and Daniel Jettka (2009). A toolkit for multi-dimensional markup: The development of SGF to XStandoff. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3. doi:https://doi.org/10.4242/BalisageVol3.Stuhrenberg01.
[Stührenberg, 2013] Maik Stührenberg. A What, when, where? Spatial and temporal annotations with XStandoff. In Proceedings of Balisage: The Markup Conference 2013. doi:https://doi.org/10.4242/BalisageVol10.Stuhrenberg01.
[Thompson et al., 2004] Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn (2004). XML Schema Part 1: Structures Second Edition. W3C Recommendation, World Wide Web Consortium (W3C). http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/
[Winer, 2009] Dave Winer (2009). RSS 2.0 Specification. http://www.rssboard.org/rss-specification
[1] The leading minus symbol is a shortcut for the XML::Loy
module namespace,
meaning, that the qualified name is
XML::Loy::Example::Morphemes. More than one extension can be passed
at once.
[2] More examples can be found at http://www.xstandoff.net/examples.
[3] Think of different POS taggers for example.
[4] The software presented in this section is freely available under the GPL or the Artistic License at http://github.com/Akron/XML-Loy-XStandoff.
[5] In the following example we will limit our view on segments defined by character positions. See Stührenberg, 2013 for examples for other segmentation methods supported by XStandoff.
[6] This extension is not described in this article.