How to cite this paper
Piez, Wendell. “Luminescent: parsing LMNL by XSLT upconversion.” Presented at Balisage: The Markup Conference 2012, Montréal, Canada, August 7 - 10, 2012. In Proceedings of Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012). https://doi.org/10.4242/BalisageVol8.Piez01.
Balisage: The Markup Conference 2012
August 7 - 10, 2012
Balisage Paper: Luminescent: parsing LMNL by XSLT upconversion
Wendell Piez
Mulberry Technologies, Inc.
Wendell Piez has been attending Balisage and its antecedent conferences since the
early days of XML; among his contributions has been, with Jeni Tennison, the original
LMNL
proposal (2002).
Copyright © 2012 by the author. Used with permission.
Abstract
Among attempts to deal with the overlap problem, LMNL (Layered Markup and Annotation
Language) has attracted its share of attention but has also never grown much past
its
origins as a thought experiment. LMNL’s conceptual model differs from XML’s, and by
design
its notation also differs from XML’s. Nonetheless, a pipeline of XSLT transformations
can
parse LMNL input and construct an XML representation of LMNL, with the resulting benefit
that further XML tools can be used to analyze and process documents originating from
the
alien notation. The key is to regard the task as an upconversion: structural induction
performed over plain text.
Table of Contents
- LMNL: the Layered Markup and Annotation Language
-
- Ranges
- Arbitrary overlap
- Annotations
- Atoms
- xLMNL: an XML-based representation of the LMNL data model
- Compiling LMNL syntax into xLMNL via XSLT upconversion
-
- Checking LMNL syntax for well-formedness
- Working with the model: prototype LMNL applications
- Reflections
- Appendix A. xLMNL example
-
- LMNL syntax:
- Compiled into xLMNL
- Appendix B. RNC schema for xLMNL
- Appendix C. Demonstrations and source code
Luminescent is a prototype parser and compiler for LMNL
syntax, converting LMNL documents into xLMNL, an XML-based
representation of the LMNL model suitable for further processing. It consists of a
series of
XSLT 2.0 stylesheets, currently running in a web server (using Cocoon) or in batch
mode (using
an XProc pipeline). A second XProc pipeline can apply Schematron validation to the
intermediate
formats generated in Luminescent to detect and locate syntax errors in the input
document.
LMNL: the Layered Markup and Annotation Language
LMNL (Layered Markup and Annotation Language) is an approach to markup first proposed
by
Jeni Tennison and myself in 2002 [Tennison and Piez 2002]. It emulates XML in some
respects, but also differs from it in several fundamental ways, suggesting some very
different
approaches to modeling text-based information using markup, with some very different
applications. For this reason, even if an alternative processing stack could never
be built on
LMNL (which presumably it could, given enough time, effort and resources), and even
if LMNL is
never regarded as a replacement for XML (which it was never intended to be), it turns
out to
be fertile laboratory for solutions to modeling problems - including XML-based solutions
for
XML platforms.
XML is defined [XML Recommendation] as a syntax, but implies a model, which was
described by the (non-normative) XML Information Set [XML Infoset], expressed in any number of code libraries and APIs (both official
and unofficial), and finally standardized (at least in one variant) in the XPath 2.0/XQuery Data Model (XDM) [XDM] . LMNL
inverts this, being defined first as an abstract model, whose syntax is proposed incidentally,
as a form of representation (and as such, one among many conceivable). Nevertheless,
the idea
is the same: a formal model stabilizes a set of capabilities for tools performing
useful
operations over text-based information sets, and provides a basis for interoperability,
while
a syntax provides a serialization format and an interface for developers and users.
Like XML,
LMNL is conceived in order to support markup, a means of
assigning labels and attributing properties and relationships to data points or fields
in
text, by means of text; and like XML, LMNL expects to provide a basis for descriptive and declarative markup
applications (although, again like XML, not only those), which support document and
data
processing within layered systems that can thus benefit from separation of concerns
(between
authoring, editorial, data management, and production tasks, for example), and that
are not
locked into single applications. Again like XML, LMNL does this by leaving it to applications
to define their own sets of names, labels or keywords, to which they can assign whatever
semantics they see fit. In this respect, LMNL syntax (like XML) is a meta-language
while LMNL
itself (like the XDM) is a meta-model: a model (with a design and hence a particular
set of
affordances in application) that we use to make models, of documents, families of
documents,
and assorted information sets of whatever description.
This much is similar; the differences from XML are (primarily) in the design of the
model
itself, and (secondarily) in the syntax proposed to represent it. The syntax is designed
to
look as little like XML as possible, for two reasons: first, so that LMNL syntax may
be
embedded directly into XML syntax, or the reverse; and secondly, to reduce cognitive
overload
when thinking about LMNL and XML together, or when thinking about LMNL with the burden
of
expectations formed by XML. (At the level of the model, we have similarly tried to
avoid using
XML terminology for LMNL concepts except where the connections are strong.) In the
interests
of brevity, rather than explicate the model fully and offer rationales for it here,
I offer a
simple summary description of the model, and of LMNL syntax, together.
Note
Readers may wish to review some of the historical LMNL specifications, which can now
be found at lmnl-markup.org.
Ranges
Where XML has elements, LMNL has ranges. Unlike XML
elements, ranges in LMNL have no necessary relation with one another: they are neither
parents, nor children of each other, nor in any hierarchy at all. Ranges may be named
(names
in LMNL are qualified by namespaces in much the way they are in XML), or anonymous.
The
assumption is that they will ordinarily have generic names indicating their type,
like XML
elements. Ranges are properties of an owner limen (using
the Latin word for doorstep
to designate this important data object type),
which belongs either to the document as a whole or an annotation, and which has a
value comprising a single string (a sequence of contiguous
characters). The value of the range will be a substring of the value of the limen,
while its
position will be the character offset within its limen where its starts.
In order to avoid confusion with XML, LMNL syntax uses a different set of delimiters
to
identify starts and ends of ranges. This example shows a chunk of LMNL syntax with
two types
of ranges, s and l, marked
over the stream of text. s ranges do not overlap with other
s ranges, and l never
overlaps with l, but the two types overlap each
other:
[s}[l}He manages to keep the upper hand{l]
[l}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l}We fence our flowers in and the hens range.{l]{s]
In the way that XML has a concise empty-element syntax, empty ranges may also be marked
with single tags, as in [br]
. Empty ranges have no value (or their value is an
empty string), although they do have a position within their owner layer.
It is sometimes convenient (although LMNL syntax does not require it) to designate
a
single range covering the entire
document:
[excerpt}
[s}[l}He manages to keep the upper hand{l]
[l}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l}We fence our flowers in and the hens range.{l]{s]
{excerpt]
Arbitrary overlap
LMNL supports arbitrary overlap, which is to say overlapping ranges of the same type.
This is important for certain potential applications such as annotation frameworks
and range
indexing, where ranges of text need to be identified that may overlap, while still
being of
the same type.
In LMNL syntax, this example shows two ranges named r,
overlapping each
other:
[r=r1}A case [r=r2}of{r=r1] arbitrary overlap{r=r2]
While
the range identifier (given after the
=
) is optional, when it is not given, a
close tag is presumed to match the most recent open tag with the same combination
of name
and identifier; thus to express overlap of this kind (rather than one
r range simply being enclosed in the other), the identifier is necessary on
the tags marking at least one of the ranges involved. But the identifier is not formally
part of the name.
Annotations
While XML elements may have attributes, LMNL ranges may have annotations. Unlike XML attributes, there is no restriction against assigning
more than one annotation with the same name to a given range; likewise, the order
of
annotations on a range is supported in the model.
In the syntax, annotations are represented by using tagging inside
tagging:
[excerpt [source}The Housekeeper{source] [author}Robert Frost{author]}
[s}[l [n}144{n]}He manages to keep the upper hand{l]
[l [n}145{n]}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l [n}146{n]}We fence our flowers in and the hens range.{l]{s]
{excerpt]
In
order to reduce tagging overhead, when annotations contain only simple string values,
their
close tags may be presented in abbreviated notation (resembling anonymous end
tags):
...[l [n}145{]}On his own farm.{s [id}s1{]]...
In
addition (as this example also shows), the syntax permits placing annotations on end
tags,
not only on start tags.
Finally, while attributes in XML assign properties to elements as name-value pairs,
LMNL
annotations may be structured. In the LMNL model, annotations are isomorphic to LMNL
documents: like a document, an annotation has a limen with content and optionally
ranges
over that content. Likewise, like ranges (including ranges over annotation content),
annotations may be annotated.
Given this flexibility it is sometimes convenient for annotations, like ranges, to
be
empty, having no content but only annotations, which it groups, orders and names.
So this is legal syntax and represents a coherent LMNL document
object:
[excerpt
[source [date}1915{][title}The Housekeeper{]]
[author
[name}Robert Frost{]
[dates}1874-1963{]] }
[s}[l [n}144{n]}He manages to keep the upper hand{l]
[l [n}145{n]}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l [n}146{n]}We fence our flowers in and the hens range.{l]{s]
{excerpt]
In this example, the excerpt
range carries two empty annotations,
source
and author
, each of which has annotations of its
own.
This is an especially powerful feature of LMNL, not only because it provides a very
useful capability in modeling (as it presents annotations in a directed graph structure
– as if XML attributes could have their own attributes), but also because of its
implications for the way documentary information is organized and linked. For example,
a
LMNL system might well support attaching a document dynamically as an annotation to
a range
in another document.
Atoms
At its base, a LMNL document is defined as a sequence of atoms: the most common type of atom will ordinarily be a character
atom
, represented by a single Unicode character in the syntax. Yet while every
character in Unicode maps to a corresponding atom, atoms in LMNL are also capable
of
representing other information of whatever kind an application may find it useful
to
represent in this way.
An atom has string length of 1. Consequently, and unlike empty ranges, atoms not only
have location, but they occupy space
, are included in the value of ranges in
which they participate, and can be marked up. Atoms are identified with their own
notation,
{{ }}
, in the syntax. In this example, an atom named logo is marked
up with a range named link:
[link [href}lmnl-markup.org{]}{{logo [src}lmnl-markup.org/hat.png{]}}{link]
xLMNL: an XML-based representation of the LMNL data model
One way LMNL builds on the conceptual foundation of XML is by differentiating between
operations on the syntax, which imply parsing, and operations on optimized representations
of
documents held in memory: the model
. This differentiation gives us leverage in
development, since we have the opportunity to identify either syntax or model as the
appropriate place for design and implementation, whether that be of the tag set itself
(considered as a set of labels and constraints over their use), user interfaces,
transformations or anything else.
Paradoxically, while the LMNL model is designed in deliberate contrast to XML, it
is
nevertheless useful to specify an XML-based representation of it, for several reasons.
First,
it exposes instances conveniently by giving us the opportunity to serialize LMNL documents
in
XML syntax. Second, it makes it possible to use XML-based tools (such as XSLT, schema
technologies, XQuery, XML servers, CMS and database technology) to query and manipulate
LMNL
– an advantage for those of us who are well-practiced in these technologies for data
processing, but not in Java or Python. And thirdly, it clarifies some of the resemblances
and
differences between LMNL and other approaches (especially XML-based approaches) to
the problem
set.
Since 2002, I have experimented with adapting XML to LMNL in several different ways.
Not
only can XML elements be construed as LMNL ranges and XML attributes as LMNL annotations
(this
is the essence of the CLIX and ECLIX approaches, cf Piez 2004); also,
XML-based notations for representing overlap, such as milestone-based notations or
segmented
and aligned XML elements, can be mapped into LMNL. This provides a framework, at least,
for
thinking systematically about how to implement and maintain processes to manage these
awkward
and difficult forms of XML.
Yet the real power of the LMNL model as such cannot be exploited without a more direct
representation. xLMNL is an XML-based representation of the
model itself: that is to say, it leaves behind the concept of a document as an information
set
represented in embedded markup (literal tags applied directly to literal text), and
simply
uses XML as a kind of poor man's (hierarchical) database
. This gives us many of
the advantages of an XML platform described above, while making downstream applications
more
tractable, inasmuch as they can work directly with LMNL as conceptualized, rather
than at a
remove. At the price of being somewhat heavyweight and memory intensive, xLMNL is
thus a
useful interim format for testing ideas and demonstrating concepts.
Again, the most concise way of presenting this design is by way of an example: the
xLMNL
equivalent of the document given above is presented in Appendix A.
Note
Note however that the notation itself is not at all concise! In fact there are many
redundancies built into xLMNL, as compared to a bare LMNL range model, in order to
streamline downstream processes. For example, text layer content is broken up into
spans
which are indexed to the ranges in which they participate. While a LMNL processor
might
wish to calculate this on the fly, when working on a static document it makes sense
to
index them only once, so this is done in xLMNL. It should go without saying that this
does
not preclude a more lightweight standoff-based XML representation of LMNL.
xLMNL has undergone several iterations since I first starting modeling LMNL directly
with XML in 2004 [Piez 2004, and see also Piez 2010]
Developers who work on the overlap problem in XML will recognize this as a standoff
representation of ranges. As such, it might be generated and maintained in any number
of ways
– even (if rather onerously) by hand.
Nevertheless, no claim should be inferred that I suppose xLMNL to be at all an optimal
approach to working with LMNL on an XML platform. The best argument for doing this
is that
fairly dramatic demonstrations of the interest of overlapping markup are not all that
hard to
come by if one only has a means by which to create them, and xLMNL is a step along
the
way.
A schema for xLMNL, using Relax NG (compact syntax)
appears in Appendix B.
Compiling LMNL syntax into xLMNL via XSLT upconversion
In its current form, the complete Luminescent pipeline has thirteen steps, each of
which
is implemented in an XSLT 2.0 transformation. These can be chained together using
any
available means; I have used both XProc and Cocoon (which is convenient for hooking
Luminescent together with further transformations processing xLMNL into various targets).
Several of the steps could be combined for greater efficiency; the reason to have
so many
presently is to maximize transparency for development and debugging.
The steps proceed as follows:
-
Comments are extracted using a regular expression matching on open and close comment
delimiters ([!--
and --]
). This has to be done first so that
markup inside comments will not be processed in subsequent steps. The result is a
single
element (representing the root of the tag tree) containing a sequence of strings and
elements representing comments.
-
Tokenization: all open and close tag delimiters, [
, {
,
]
and }
in document content (i.e., not inside comments) are
matched and wrapped as XML t elements (for token). The result is a sequence of strings interspersed with
comments and these elements, representing tag delimiters.
-
The token (t) elements are marked with line and
character offsets, to be carried forward for purposes of any error reporting that
has to
be performed later.
-
A sibling recursion is applied to infer tagging from the tokens. A tag element is initiated with each open delimiter
([
or {
); each close delimiter (]
or
}
) ends the tag element most recently
started. The result is a rudimentary tag tree of the document. Delimiters and comments
are retained.
-
Types are assigned to the tags, which are mapped to start, end, empty and atom elements. This works by
inferring each type of tag from its open and close delimiters: [r}
for
start, {r]
for end, [e]
for empty, and {{a}}
for
atom. The extra level of delimiters required for atoms is respected; tags with
outer shells
but no inner shells
(that is, that fail to
respect the double-brace syntax of atoms, as in {{atom}}
) are marked as
errors.
Simultaneously, tag names (generic identifiers) are extracted from their values. Any
tags that have range identifiers with the generic identifier keeps its range identifier
as part of its GI. (So a tag [range=r1}
is represented as <range
gi="range=r1"/>
.)
-
Start tags are marked with unique identifiers (distinct from any range identifiers
already given).
-
By means of another sibling recursion, end tags are marked with the identifier of
the most recent start tag with the same GI.
Since range identifiers are still, at this stage, considered part of the GI, the
sibling recursion in this process matches end tags to start tags correctly.
-
Matching start and end-tag pairs appearing inside tags are promoted into
annotations.
This is the trickiest step, for two reasons. First, abbreviated syntax permitted for
simple annotations means that anonymous end tags ({]
) may be matched with
named start tags. Secondly, annotations may contain markup, and so not just any tag
directly inside a tag is actually an annotation delimiter (it could mark up a range
over
content inside the annotation). This process must work, again, via sibling recursion
(the third one performed in the pipeline). Where tagging is not correct, error elements may be generated.
-
Character offsets are marked on start, end, empty and atom tag elements, and text spans are wrapped (with span elements) and marked with character offsets within their
owner layer (or limen in LMNL terminology: the
annotation or document within which they appear). The offsets are determined from
the
lengths of string content (text nodes in the XML), with any atoms appearing being
given
length 1, while comments and range markers have length 0.
-
Proper generic identifiers (range names) are derived from combinations of ranges
with their identifiers. (The identifiers are saved as label attributes in case they may be wanted.)
-
Unique identifiers are assigned to ranges; range start and end tags have the same
identifier, while empty range tags have their own.
Similarly, annotations are marked with unique identifiers, as is the document as a
whole.
-
Layer identifiers are assigned to spans, corresponding to the limen (annotation or
document) in which the span appears. Strictly speaking these identifiers are redundant,
since the same information is given by the xLMNL document structure; but they are
useful
for optimizing subsequent (downstream) processes or (potentially) for processing or
aggregating LMNL documents described in multiple xLMNL instances.
The result of this step is a comprehensive tag tree
of the marked up
LMNL syntax instance.
(A later project goal will be to codify this format for interchange; it maps to the
earlier CLIX format. This may also prove to be more robust than xLMNL for maintenance
of
LMNL data sets in XML, since ranges are still represented by tags within the text
stream
rather than standoff markup.)
-
The tag tree is converted into xLMNL by reading range elements from start/end tag
pairs, or from empty range markers as the case may be. Ranges are marked with the
start
and end offsets, read from their tags. Spans are marked with pointers to the ranges
in
which they participate. (A fourth sibling recursion accomplishes this. Again, the
information here is redundant but useful.)
Checking LMNL syntax for well-formedness
Rather than stop processing, the pipeline currently emits error elements when it encounters problems, with codes identifying the issue.
This appears to work well.
In addition, more precise diagnostics are performed by applying Schematron validation
to
particular steps in the pipeline. (This is implemented with a second XProc pipeline
specification that imports the main one, applies Schematron schemas to the results
of two of
Luminescent's intermediate formats, aggregates their results together and formats
them.) For
example, using Schematron it is easy to check whether all start tags have matching
end tags
or vice-versa, or that range or annotation names follow their rules. Because the
intermediate formats carry forward information on the location of tagging in the original
LMNL syntax instance, Schematron can report the locations of tagging found to be
problematic.
This is especially important since LMNL syntax becomes hard to read as the markup
becomes more complex. For example, here is a malformed
instance:
[excerpt [source}The Housekeeper{source] [author}Robert Frost{author]]}
[s}[l [n}144{n]}He manages to keep the upper hand{l]
[l [n}145{n]}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l [n}146{n]}We fence our flowers in and the hens range.{l]{s]
{excerpt]
(The
error occurs at the end of the first line, where an extra
]
appears before the
}
ending the start tag.)
Schematron reports
this:
Error UNEXPECTED-TAGGING reported for } at 1:71,
C:\Projects\LMNL\Luminescent\lmnl\frost-quote.lmnl
No start tag matches end tag {excerpt] at 5:1,
C:\Projects\LMNL\Luminescent\lmnl\frost-quote.lmnl
The
processor has taken the mistaken
]
, as it must, as the end of the tag; and
since it therefore makes an empty range marker, the end tag that is supposed to match
it is
found to have no start tag.
The two errors are detected differently. The first error is reported for any tag
delimiter that can't be matched with a corresponding delimiter of the opposite kind
(start
or end). The second is reported for the failure to follow the constraint that all
start tags
must have end tags and vice versa.
The line numbers and offsets reported (1:71 and 5:1) correctly locate the problems;
character 71 of line 1 is the location of the orphaned tag close delimiter }
(which would have closed a start tag had the ]
character not intervened), while
line 5 character 1 is where the orphaned end tag is located.
Working with the model: prototype LMNL applications
Currently I have several processes running with xLMNL as source. Some of these are
tuned
to particular tag sets, while others are generic. A selection is offered in place
of
presentation slides for this paper (the zipped package contains a mix of HTML, XML
and SVG and
can be reviewed starting from index.html
using any current web browser).
-
A generic diagnostic stylesheet can report which range types overlap with which
other range types. (This is most useful to know for process customization.)
-
XML can be extracted from xLMNL dynamically, using a parameterized listing of range
types to be reflected as a hierarchy of XML elements. Ranges of these types are promoted
into XML elements; their annotations become, when they have simple values, XML
attributes. Ranges not among these types, and annotations that are not cast to
attributes, become XML elements representing range delimiters (tags) or annotation
structures. Spans of text are kept with pointers to the ranges in which they
participate, when these have not been cast to ancestor elements.
This process can be run independently, but its functionality is also available
dynamically as a function call in XSLT, operating on any xLMNL document or annotation
(or a subset of spans from within a document or annotation, perhaps those associated
with a given range) and casting it into XML.
This is also a generic process, although the particular ranges to be converted into
XML elements is passed in at run time.
-
SVG graphs and HTML renditions can be generated to display and depict LMNL
documents. These transformations, to be sure, are not always trivial; but their
difficulties are greatly mitigated by the XML extraction process just mentioned, used
to
cast LMNL into intermediate XML formats (hierarchical views
of the
LMNL).
These are not generic processes, since of course particular displays are optimized
for particular tagging semantics, but some of them do rely on imported functionalities
implemented generically (such as the logic that generates SVG bubble
graphs
), so it can be shared.
Links to demonstrations are provided in Appendix C.
Reflections
I can make no pretense as to the efficiency or scalability of this approach. So far,
it
has only worked well enough for my purposes: to demonstrate its feasibility in principle,
and
to test the specifications. While it has performed adequately well on documents up
to several
hundred Kb in size, and experience suggests that processing bottlenecks for Luminescent
are
actually more likely coming out of xLMNL rather than into it, I have no data to confirm
my
intuitions here. There does appear to be a rich and interesting set of problems at
hand.
Nevertheless, if nothing else, this exercise has suggested some very interesting things
about markup technologies beyond XML. One of the keys appears to be the separation
of the
parsing of the syntax from the construction of the model; so the parse tree is a tree
only of
the tags, from which the document model is derived by a different process. (The parse
itself
works like a parse of S-expressions, in which open and close delimiters are recursively
parsed
into tags.) In this view of things, machine-automated text processing can support a very
different form of document description than that provided by the operational semantics
of XML,
which in order to build a document model from the markup in a single pass, must limit
itself
to a syntax in which not just tags but the element structure itself can be described
by a
context-free grammar. Thus its document models are limited to trees and to graphs projected over that
tree [Bos 2005]. While not, formally, more expressive than XML markup (since
graphs projected over a tree can express the same relations as LMNL markup, as indeed
they do
in xLMNL or other XML-based representations of LMNL), LMNL markup is practically so;
it can
get closer to the text
than XML does, inasmuch as in order to fit within its
own rules, XML's representation of a document (or at any rate, of a document in which
overlapping structures or features, or structured annotations, are represented) is
always
getting in its own way.
Related to this is another aspect of this work: this parsing or compiling process
does not
assume a single depth-first traversal of structures implicit in the syntax, and so
does not
perform a single pass over the data. Instead, it considers that the entire text is
available
to the parser at once, and works by applying several distinct heuristic operations
in
sequence: first tags are inferred from delimiting tokens; then different types of
tags (open,
close, empty or atom) are recognized; then open/close pairs are matched, etc. Whether
this
technique is very novel or interesting, or how it relates to (or evades, or complicates)
classic problems in text processing, I am not highly qualified to say. Yet it might
be
interesting for the sole reason that it serves as a proof of concept for generalized
plain
text processing in XSLT.
What I as a markup user find most remarkable, however, is what happens once a tool
chain
like this is in place. XML practitioners, I think, or at least those of us who work
with
structurally complex texts, are familiar with a conflict between the wish to describe
our
information accurately, capably and gracefully, and the need to force everything into
a single
hierarchy of elements – for reasons having nothing to do with the purposes of the
markup, but only because the processing infrastructure insists on it, behind the scenes,
before work has even begun. This conflict is apparent every time we work with (or
must
develop) a schema that has to make design compromises in order to address a requirement
to
represent things that overlap, introducing one or more of the well-worn but cumbersome
workarounds for doing so. Sometimes we are faced with truly vexing problems in tagging,
and
even in the best case, having to use workarounds generates a certain amount of mental
background noise. When working with LMNL markup, all this clamor is silenced. Even
in small
demonstrations, I am finding it liberating to be able to mark exactly what I wish
to describe,
with concern only for its clearest denotation in tags and its fidelity to what I want
to
represent in the text. If this is possible at all (and it evidently is), XML's early
commitment to a single tree representation of something as complex as a text (meaning
that
word in the sense that literary scholars do, with everything it entails) appears to
be a
premature optimization – in other words, not always an optimization at all. When tags
in
plain text can be used to represent whatever structures in and features of text we
care to
discover, irrespective of whether they fit easily into a single tree-shaped model,
then the
potentials of markup are magnified immensely. We have only just started to explore
the
possibilities.
Appendix A. xLMNL example
LMNL syntax:
[excerpt}
[s}[l [n}144{n]}He manages to keep the upper hand{l]
[l [n}145{n]}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l [n}146{n]}We fence our flowers in and the hens range.{l]{s]
{excerpt
[source [date}1915{][title}The Housekeeper{]]
[author
[name}Robert Frost{]
[dates}1874-1963{]] ]
Compiled into xLMNL
White space is added for legibility, and LF characters in the data indicated with


.
<?xml version="1.0" encoding="UTF-8"?>
<x:document xmlns:x="http://lmnl-markup.org/ns/xLMNL" ID="N.d1e1"
base-uri="file:/c:/Projects/LMNL/Luminescent/lmnl/frost-example.lmnl">
<x:content>
<x:span start="0" end="1" layer="N.d1e1" ranges="R.d1e2">
</x:span>
<x:span start="1" end="34" layer="N.d1e1" ranges="R.d1e2 R.d1e5 R.d1e6">He manages to keep the upper hand</x:span>
<x:span start="34" end="35" layer="N.d1e1" ranges="R.d1e2 R.d1e5">
</x:span>
<x:span start="35" end="51" layer="N.d1e1" ranges="R.d1e2 R.d1e5 R.d1e15">On his own farm.</x:span>
<x:span start="51" end="52" layer="N.d1e1" ranges="R.d1e2 R.d1e15"> </x:span>
<x:span start="52" end="62" layer="N.d1e1" ranges="R.d1e2 R.d1e15 R.d1e25">He's boss.</x:span>
<x:span start="62" end="63" layer="N.d1e1" ranges="R.d1e2 R.d1e15"> </x:span>
<x:span start="63" end="78" layer="N.d1e1" ranges="R.d1e2 R.d1e15 R.d1e31">But as to hens:</x:span>
<x:span start="78" end="79" layer="N.d1e1" ranges="R.d1e2 R.d1e31">
</x:span>
<x:span start="79" end="122" layer="N.d1e1" ranges="R.d1e2 R.d1e31 R.d1e37">We fence our flowers in and the hens range.</x:span>
<x:span start="122" end="123" layer="N.d1e1" ranges="R.d1e2"> </x:span>
</x:content>
<x:range start="0" end="123" ID="R.d1e2" sl="1" so="1" name="excerpt" el="9" eo="25">
<x:annotation ID="N.d1e49" sl="6" so="3" el="6" eo="47" name="source">
<x:annotation ID="N.d1e50" sl="6" so="11" el="6" eo="22" name="date">
<x:content>
<x:span start="0" end="4" layer="N.d1e50">1915</x:span>
</x:content>
</x:annotation>
<x:annotation ID="N.d1e53" sl="6" so="23" el="6" eo="46" name="title">
<x:content>
<x:span start="0" end="15" layer="N.d1e53">The Housekeeper</x:span>
</x:content>
</x:annotation>
<x:content/>
</x:annotation>
<x:annotation ID="N.d1e56" sl="7" so="3" el="9" eo="23" name="author">
<x:annotation ID="N.d1e57" sl="8" so="5" el="8" eo="24" name="name">
<x:content>
<x:span start="0" end="12" layer="N.d1e57">Robert Frost</x:span>
</x:content>
</x:annotation>
<x:annotation ID="N.d1e60" sl="9" so="5" el="9" eo="22" name="dates">
<x:content>
<x:span start="0" end="9" layer="N.d1e60">1874-1963</x:span>
</x:content>
</x:annotation>
<x:content/>
</x:annotation>
</x:range>
<x:range start="1" end="51" ID="R.d1e5" sl="2" so="1" name="s" el="3" eo="32"/>
<x:range start="1" end="34" ID="R.d1e6" sl="2" so="4" name="l" el="2" eo="52">
<x:annotation ID="N.d1e7" sl="2" so="7" el="2" eo="15" name="n">
<x:content>
<x:span start="0" end="3" layer="N.d1e7">144</x:span>
</x:content>
</x:annotation>
</x:range>
<x:range start="35" end="78" ID="R.d1e15" sl="3" so="1" name="l" el="3" eo="71">
<x:annotation ID="N.d1e16" sl="3" so="4" el="3" eo="12" name="n">
<x:content>
<x:span start="0" end="3" layer="N.d1e16">145</x:span>
</x:content>
</x:annotation>
</x:range>
<x:range start="52" end="62" ID="R.d1e25" sl="3" so="34" name="s" el="3" eo="49"/>
<x:range start="63" end="122" ID="R.d1e31" sl="3" so="51" name="s" el="4" eo="62"/>
<x:range start="79" end="122" ID="R.d1e37" sl="4" so="1" name="l" el="4" eo="59">
<x:annotation ID="N.d1e38" sl="4" so="4" el="4" eo="12" name="n">
<x:content>
<x:span start="0" end="3" layer="N.d1e38">146</x:span>
</x:content>
</x:annotation>
</x:range>
</x:document>
Appendix B. RNC schema for xLMNL
namespace x = "http://lmnl-markup.org/ns/xLMNL"
start =
element x:document {
document-model }
document-model =
attribute base-uri { xsd:anyURI }?,
attribute ID { xsd:ID },
attribute name { xsd:QName }?,
debug-support?,
(annotation | comment)*,
( content,
range*,
(annotation | comment)*)?
annotation =
element x:annotation {
document-model }
content =
element x:content {
element x:span {
attribute layer { xsd:IDREF },
attribute ranges { xsd:IDREFS }?,
attribute start { xsd:integer },
attribute end { xsd:integer },
(text
| element x:atom {
attribute name { xsd:NCName },
debug-support?,
annotation*
}
| comment )+
}*
}
range =
element x:range {
attribute ID { xsd:ID },
attribute name { xsd:NCName }?,
attribute start { xsd:integer },
attribute end { xsd:integer },
debug-support?,
(annotation | comment)*
}
comment =
element x:comment {
debug-support?,
text }
debug-support =
attribute sl { xsd:integer },
attribute so { xsd:integer },
attribute el { xsd:integer },
attribute eo { xsd:integer }
A full specification for xLMNL would include constraints not captured by this RNG,
such as
that offsets (start and end
attributes) must be whole numbers (positive integers or 0); values of end must be greater than or equal to values of start on the same range; the difference between
the start and end of a
span (its length) must be equal to its string length plus
the count of its atom children; referential integrity must be
maintained between spans, ranges and layers (limina), and so forth.
Appendix C. Demonstrations and source code
A demonstration showsing results of the Luminescent pipeline accompany this paper,
in the
Slides and Materials
linked in the Proceedings. Unzip the package and open index.html
, which will
describe the examples and present links for examining them.
Many browsers will now attempt and may do a reasonable job rendering the SVG examples.
But
best results will be obtained from a fully conformant SVG viewer implementation with
panning
and zooming to arbitrary levels of scale. (Most browsers will not zoom in as far as
you may
want to go.) Apache Squiggle (distributed with Batik) is recommended.
Source code for Luminescent is available on github, at
https://github.com/wendellpiez/Luminescent.
References
[Bos 2005] Bos, Bert. The XML data model
.
2005. See http://www.w3.org/XML/Datamodel.html
[Cayless and Soroka 2010] Cayless, Hugh A., and Adam
Soroka. On Implementing string-range()
for TEI
. Presented at
Balisage: The Markup Conference 2010 (Montréal, Canada, August 3 - 6, 2010). In Proceedings of Balisage: The Markup Conference 2010. Balisage Series
on Markup Technologies, vol. 5 (2010). doi:https://doi.org/10.4242/BalisageVol5.Cayless01.
[DeRose 2004] DeRose, Steven. Markup Overlap:
A Review and a Horse
. Presented at Extreme Markup Languages 2004 (Montréal,
Canada).
[Durusau and O'Donnell n.d.] Durusau, Patrick, and
Matthew Brook O'Donnell. JITTs (Just-in-time Trees)
.
http://www.durusau.net/publications/NY_xml_sig.pdf.
[lmnl-markup.org] LMNL-markup.org. See
http://www.lmnl-markup.org.
[Piez 2004] Piez, Wendell. Half-steps toward
LMNL
. Presented at Extreme Markup Languages 2004 (Montréal, Canada). See
http://www.piez.org/wendell/papers/LMNL-halfsteps.pdf.
[Piez 2010] Piez, Wendell. Towards Hermeneutic
Markup: An architetural outline
. Presented at Digital Humanities 2010 (London,
England). See http://www.piez.org/wendell/papers/dh2010/index.html.
[Portier and Calabretto 2009] Portier, Pierre-Édouard, and Sylvie Calabretto. “Methodology for the construction
of multi-structured documents.” Presented at Balisage: The Markup Conference 2009
(Montréal,
Canada, August 11 - 14, 2009). In Proceedings of Balisage: The Markup
Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009).
doi:https://doi.org/10.4242/BalisageVol3.Portier01.
[Portier and Calabretto 2010] Portier, Pierre-Édouard, and Sylvie Calabretto. “Multi-structured documents and the
emergence of annotations vocabularies.” Presented at Balisage: The Markup Conference
2010,
Montréal, Canada, August 3 - 6, 2010. In Proceedings of Balisage: The
Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010).
doi:https://doi.org/10.4242/BalisageVol5.Portier01.
[Pondorf and Witt 2010] Pondorf, Denis, and Andreas
Witt. Freestyle Markup Language: Specification of an intuitive, powerful,
polyhierarchical new extensible markup language
. Presented at Balisage: The Markup
Conference 2010 (Montréal, Canada, August 3 - 6, 2010). In Proceedings
of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies,
vol. 5 (2010). doi:https://doi.org/10.4242/BalisageVol5.Pondorf01.
[Schmidt 2010] Schmidt, Desmond. The
inadequacy of embedded markup for cultural heritage texts.
In Literary and Linguistic Computing (2010) 25 (3): 337-356. doi:https://doi.org/10.1093/llc/fqq007.
[Sperberg-McQueen and Huitfeldt 1999] Sperberg-McQueen, Michael, and Claus Huitfeldt: "Concurrent Document Hierarchies in
MECS and
SGML". In Literary and Linguistic Computing (1999) 14, pp
29-42. doi:https://doi.org/10.1093/llc/14.1.29.
[Stegmann and Witt 2009] Stegmann, Jens, and
Andreas Witt. TEI Feature Structures as a Representation Format for Multiple Annotation
and Generic XML Documents
. Presented at Balisage: The Markup Conference 2009,
Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The
Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009).
doi:https://doi.org/10.4242/BalisageVol3.Stegmann01.
[Stührenberg and Jettka 2009] Stührenberg,
Maik, and Daniel Jettka. A toolkit for multi-dimensional markup: The development of SGF
to XStandoff
. Presented at Balisage: The Markup Conference 2009 (Montréal, Canada,
August 11 - 14, 2009). In Proceedings of Balisage: The Markup Conference
2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:https://doi.org/10.4242/BalisageVol3.Stuhrenberg01.
[Tennison and Piez 2002] Tennison, Jeni, and
Wendell Piez. The Layered Markup and Annotation Language (LMNL)
. Presented at
Extreme Markup Languages 2002 (Montréal, Canada).
[XDM] Berglund, Anders, Mary Fernández, Ashok Malhotra,
Jonathan Marsh, Marton Nagy, and Norman Walsh, eds. XQuery 1.0 and XPath
2.0 Data Model (XDM) (Second Edition) W3C Recommendation 14 December 2010.
http://www.w3.org/TR/xpath-datamodel/.
[XML Infoset] Cowan, John, and Richard Tobin, eds.
XML Information Set (Second Edition). W3C Recommendation 4
February 2004. http://www.w3.org/TR/xml-infoset/.
[XML Recommendation] Tim Bray, Tim, Jean Paoli, C. M.
Sperberg-McQueen, Eve Maler, and François Yergeau, eds. Extensible
Markup Language (XML) 1.0 (Fifth Edition) W3C Recommendation 26 November 2008.
http://www.w3.org/TR/REC-xml/.
×Cayless, Hugh A., and Adam
Soroka. On Implementing string-range()
for TEI
. Presented at
Balisage: The Markup Conference 2010 (Montréal, Canada, August 3 - 6, 2010). In Proceedings of Balisage: The Markup Conference 2010. Balisage Series
on Markup Technologies, vol. 5 (2010). doi:https://doi.org/10.4242/BalisageVol5.Cayless01.
×DeRose, Steven. Markup Overlap:
A Review and a Horse
. Presented at Extreme Markup Languages 2004 (Montréal,
Canada).
×Durusau, Patrick, and
Matthew Brook O'Donnell. JITTs (Just-in-time Trees)
.
http://www.durusau.net/publications/NY_xml_sig.pdf.
×Portier, Pierre-Édouard, and Sylvie Calabretto. “Methodology for the construction
of multi-structured documents.” Presented at Balisage: The Markup Conference 2009
(Montréal,
Canada, August 11 - 14, 2009). In Proceedings of Balisage: The Markup
Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009).
doi:https://doi.org/10.4242/BalisageVol3.Portier01.
×Portier, Pierre-Édouard, and Sylvie Calabretto. “Multi-structured documents and the
emergence of annotations vocabularies.” Presented at Balisage: The Markup Conference
2010,
Montréal, Canada, August 3 - 6, 2010. In Proceedings of Balisage: The
Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010).
doi:https://doi.org/10.4242/BalisageVol5.Portier01.
× Pondorf, Denis, and Andreas
Witt. Freestyle Markup Language: Specification of an intuitive, powerful,
polyhierarchical new extensible markup language
. Presented at Balisage: The Markup
Conference 2010 (Montréal, Canada, August 3 - 6, 2010). In Proceedings
of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies,
vol. 5 (2010). doi:https://doi.org/10.4242/BalisageVol5.Pondorf01.
×Schmidt, Desmond. The
inadequacy of embedded markup for cultural heritage texts.
In Literary and Linguistic Computing (2010) 25 (3): 337-356. doi:https://doi.org/10.1093/llc/fqq007.
×Sperberg-McQueen, Michael, and Claus Huitfeldt: "Concurrent Document Hierarchies in
MECS and
SGML". In Literary and Linguistic Computing (1999) 14, pp
29-42. doi:https://doi.org/10.1093/llc/14.1.29.
×Stegmann, Jens, and
Andreas Witt. TEI Feature Structures as a Representation Format for Multiple Annotation
and Generic XML Documents
. Presented at Balisage: The Markup Conference 2009,
Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The
Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009).
doi:https://doi.org/10.4242/BalisageVol3.Stegmann01.
×Stührenberg,
Maik, and Daniel Jettka. A toolkit for multi-dimensional markup: The development of SGF
to XStandoff
. Presented at Balisage: The Markup Conference 2009 (Montréal, Canada,
August 11 - 14, 2009). In Proceedings of Balisage: The Markup Conference
2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:https://doi.org/10.4242/BalisageVol3.Stuhrenberg01.
×Tennison, Jeni, and
Wendell Piez. The Layered Markup and Annotation Language (LMNL)
. Presented at
Extreme Markup Languages 2002 (Montréal, Canada).
×Berglund, Anders, Mary Fernández, Ashok Malhotra,
Jonathan Marsh, Marton Nagy, and Norman Walsh, eds. XQuery 1.0 and XPath
2.0 Data Model (XDM) (Second Edition) W3C Recommendation 14 December 2010.
http://www.w3.org/TR/xpath-datamodel/.
×Tim Bray, Tim, Jean Paoli, C. M.
Sperberg-McQueen, Eve Maler, and François Yergeau, eds. Extensible
Markup Language (XML) 1.0 (Fifth Edition) W3C Recommendation 26 November 2008.
http://www.w3.org/TR/REC-xml/.