For some time now, people interested in descriptive markup have been considering the
problem
of how best to handle overlapping structures in electronic representations of documents.
There
have been proposals for handling such overlap in SGML using CONCUR
, for handling it
in SGML or XML using application-level semantics (milestone elements, Trojan Horse
markup,
fragmentation and recombination using virtual elements of various kinds, standoff
markup), for
resurrecting CONCUR
in the XML context, and for a variety of non-XML approaches
(colored XML, LMNL, Just-in-Time trees, TexMecs, Goddag structures, EARMARK). The
literature on
the subject is still manageable, but it has grown to the point where it is hard to
keep track
even of the number of reviews of the literature.
The proliferation of proposals has led to some secondary phenomena which seem to be
problems
in their own right. Because there are so many proposals for dealing with overlap,
it can be
difficult to keep track of them all. Because so many of the papers describing them
use only a
few terse examples it can be challenging to understand just how the proposal works
in practice,
and unclear just how any given proposal resembles or differs from other proposals
made
elsewhere. Most important of all, it is currently difficult to compare different techniques
for
dealing with overlap with each other and reach well founded conclusions as to their
relative
convenience.
The MLCD Overlap Corpus (MOC) is a first step toward improving this situation. This
paper
describes the current state of MOC and future plans for the project.
Aims
The main immediate goal of the MOC project is to build a corpus of well understood
and
well documented examples of overlap, discontinuity, alternate ordering, and related
phenomena
in various notations, for use in the investigation of methods of recording such phenomena.
Where possible, we would like to allow, indeed to encourage, participation of and
contributions from a wider community in building the corpus. When the corpus has reached
a
suitable size and degree of completeness, we would also like to make it available
for research
and to encourage its use.
To address the concerns which led to the project, the MOC corpus should satisfy a
number
of requirements.
-
It should provide illustrative examples to make it easier to understand various
overlap solutions.
-
It should provide readily available documentation of overlap proposals (with
pointers to the original papers).
-
Its samples should cover as wide a range of problems as is feasible, in the
interests of seeing whether different proposals for overlap work better on different
kinds of problems.
-
The corpus may provide, or should at least support work toward, some kind of
systematic categorization or typology of overlap problems
.
-
The samples in the corpus should be able to serve as a kind of testbed for the
development of tools, including editors, translators, query languages, and so on.
-
It should be able to serve as a testbed for head-to-head comparison of overlap
solutions, by making it possible to build demonstration applications using the same
documents in different encodings and compare the volume and complexity of the code
needed to support the different encodings.
It is currently difficult to compare different techniques for dealing with overlap
with
each other and to apply concrete metrics to them. We believe that MOC may provide
a testbed
for application and tool development, and an empirical basis for answering questions
such as:
-
How successful is a given syntactic proposal in capturing relevant information about
a given document with overlapping structures?
-
How verbose or succinct is the proposal's markup for the document?
-
How complex is each proposal's markup (assuming it is possible
to specify some quantitative measure of markup complexity).
-
How complex is the task of parsing a given syntax and mapping it to a given data
structure?
-
How successful is a given proposed data structure in capturing the relevant
information about overlapping structures in a document?
-
Given a representation of a particular document in a given data structure, how
complex is the task of operating on that data structure in support of a given
application using the document?
Content of the corpus
Structure
In its initial form, MOC will comprise three sets of samples:
-
toy
samples, typically just a few lines in length.
Most toy samples are drawn from the literature on overlap. These samples usually
reduce the problem of overlap to very minimal terms, which makes them helpful for
highlighting the essential features of a particular overlap problem and the proposed
solution. By the same token, they elide many of the details that must be handled in
practical applications.
-
short
samples, each typically a few pages long.
These samples are designed to be large enough to illustrate the interaction of
overlapping structures with other problems of text encoding, but short enough to make
it feasible to encode them multiple times by hand.
-
long
samples, each typically a complete document (e.g. a play, long
short story, or novel).
These samples are designed to be large enough to make it feasible to build simple
text applications (e.g. interactive search and retrieval systems or text visualization
systems) using the MOC samples as data, and to illuminate technical issues in the
processing of overlapping structures. By current standards, however, none of the
samples in this class are expected to be big in the sense of big
data
.
Ancillary materials
Along with the samples, MOC records information about each sample group, notation,
vocabulary, and idiom used in the corpus, together with bibliographic references.
For
notations like XML and vocabularies like TEI, documentation is readily available and
MOC
makes no attempt to compete with other sources as regards completeness of its lists
of
bibliographic references. But for less commonly known notations, it is hoped that
MOC's
collection of information may be helpful to those seeking to learn more.
Each notation, vocabulary, idiom, sample, and sample group in MOC has a distinct
URI;
users can dereference the URI to see the information MOC has about the item in question.
Selection criteria
Since its purpose is to illuminate problems connected with overlap and with existing
proposals for handling it, MOC does not attempt to make the selection of texts
representative of any particular linguistic or textual population. (MOC is not a
corpus in that sense.) For MOC, the relevant population is not a
particular set of natural-language users, but the set of overlap-related problems
encountered by people who work with natural-language texts for whatever purposes.
Accordingly, MOC takes a resolutely opportunistic approach to samples; we will take
samples anywhere we can find them. This is particularly visible in the toy samples:
the
current version of MOC includes among the toy samples many short samples originally
published in papers on overlap that have come to our attention. Opportunistic sampling
is
less fruitful when it comes to the short and long samples.
Since one of the purposes of MOC is to support investigations of different kinds of
overlap as well as different ways of encoding overlap, the collections of short and
long
samples will, to the extent possible, reflect a variety of overlap phenomena and textual
interests. In the absence of a well grounded categorization of different kinds of
overlap,
it's difficult to be certain how many really different kinds of overlap there are,
and which
kinds of overlap are structurally and conceptually isomorphic. In the absence of such
a well
grounded categorization, we hope to include examples at least of the following kinds
of
overlap:
-
structural overlap and multiple hierarchies (as in verse drama, or physical and
logical hierarchies [page vs paragraph], or in the analysis of the Peter / Paul
example above into utterances and into syntactic units)
-
overlapping annotation targets (as in fine-grained commentary on specific
texts)
-
change-history markup showing the revision of a text over time (of practical
import for technical documentation, but also of interest for genetic editions)
-
overlapping sites of textual variation (as in text-critical editions)
-
discontinuous and disordered elements (as in cases where one text is quoted and
commented on in another text, for which songs and plays-within-plays in drama provide
examples; a well known example in the overlap literature is the attempt of Hughie,
Louis, and Dewey to remember a haiku)
We also hope to provide examples that illustrate the occurrence of overlap in texts
and
applications of interest in different communities:
-
literary study
-
lexicology
-
metrical study
-
language corpora (discourse analysis, syntax, prosody, ...)
-
textual criticism
-
document publishing
-
documentary, historical-critical, genetic, and other scholarly editions
-
analytical bibliography
-
historical annotation
-
legal documents
Work flow
Each sample in the corpus goes through the following processes, leading to the
corresponding status:
-
candidate: The sample has been collected and may
or may not be included in the corpus proper. (We expect this will apply just to toy
samples, but it may also apply to others.)
-
projected: We have agreed in principle and in
theory that we want this sample.
-
planned: We have agreed on the desired properties
of the sample in sufficient detail to allow data capture to proceed:
-
incomplete: Data capture has begun but has not
yet been completed.
-
rough: Data capture has been completed, and the
person who did the data capture has done an initial proofreading.
-
validated (or wf-checked): The sample has been validated against all appropriate
schemas, if there are any, or (if there is no schema) has been checked for
well-formedness by some automatic tool.
At this point the paths divide. Toy samples and small samples undergo repeated
proofreadings (the initial plan is to do three proofreadings for each, but that plan
has not
yet been put to the test). Large samples are (we assume) too long for multiple proofreadings
(or possibly even one). Instead, we perform a single proofreading and several spot
checks.
Once it reaches the validated
state, each large sample acquires a list of
spot checks to be performed. One by one, not necessarily in any prescribed order,
the
prescribed checks are performed. Each check results, possibly, in corrections and
re-validation, and possibly in the addition of new checks to the to-check list. Whenever
we
notice something odd or amiss in the document, especially if it could be a systematic
problem, then a new task is added to the to-check list (assuming we can devise a way
to
check systematically for the error in question).
It is not yet clear exactly what spot-checks we need to do; we expect them to vary
with
the notation, the idioms, the sample, etc. But some examples may make the idea clearer:
-
When feasible, a spell-checker is used to check the text for typographic
errors.
-
A selected one-, ten-, or one-hundred-percent sample of markup constructs
(typically occurrences of particular element types or attributes) is checked for
semantic plausibility (their syntactic correctness having already been guaranteed
by
validation). For example, we might spot-check one percent, ten percent, or all of
the
markup used for page breaks, page numbers, the TEI part
attribute, the
next
and prev
attributes, the join
element,
instances of markup for discontinuous elements, instances of fragmented elements,
or
Trojan horses, to make sure they are semantically correct. The specific constructs
that need checking will, of course, typically depend on the idiom used.
Systematic errors found in spot checks are fixed in whatever way we can
manage.
Current status
The ultimate aim of MOC is to provide a fully populated matrix of materials: for each
sample group, one sample in each relevant combination of notation, vocabulary, and
idiom.
As a first step towards this larger goal we have built a prototype corpus of
toy
samples (MOC-POC), as a proof of concept. MOC-POC currently contains
52 samples distributed over 14 sample groups, 4 notations and 6 idioms.
The text fragments comprising the samples of MOC-POC are taken from a selection of
research publications on the overlap problem. This prototype does not claim any kind
of
completeness; it has, however, successfully identified a number of weak spots in our
initial
design.
Notations currently represented in MOC-POC are:
-
XML
-
XConcur
-
LMNL saw-tooth
notation
-
TexMecs
Most of the samples are encoded using a vocabulary taken from or based on some version
of
TEI, but ad hoc vocabularies are also represented.
Samples encoded in XML are encoded using six different TEI idioms to resolve overlap problems:
-
Fragmentation using the next and prev attributes defined as part of the TEI tags set
for segmentation and alignment.
-
Fragmentation using the part attribute provided for certain elements in the TEI
vocabulary. (If a TEI-encoded example requires fragmentation of an element for which
TEI
provides no part attribute, the attribute is added.)
-
Fragmentation using the part and id attributes and the join element defined as part
of the TEI tags set for segmentation and alignment.
-
The "Trojan horse 1" idiom uses milestone tags to resolve overlap. Milestones are
used only when necessary, normal XML elements are used in all other cases. It is left
to
the encoder's choice to decide which element to mark with milestones.
-
"Trojan horse 2", like "Trojan horse 1", uses milestone tags to resolve overlap.
However, the Trojan horse 2 idiom represents every element in the overlap as a pair
of
milestones with intervening content.
-
The "XStandoff" idiom uses XML-conformant markup that points to the character data
("primary data"), which is kept in a separate location.
Preliminary results
A few preliminary results of our work on MOC can be mentioned.
Our attempts to explore the solution space of techniques like TEI-style fragmentation
led
very quickly to the realization that the TEI's techniques for handling overlapping
structures
(here we will use the next
and prev
attributes as an example, but
the same observations apply to all the techniques described by the TEI) do not in
themselves
fully determine the encoding of a given sample, even when there is no uncertainty
about which
textual features are to be encoded. This is not surprising in itself; the TEI almost
always
leaves a great deal of leeway to the individual project and its encoding policies.
But it does
mean that a full description of how the TEI is used to encode a given sample must
go beyond
saying that the next
and prev
attributes are used.
When next
and prev
are used, an overlap of two logical elements
is resolved by breaking one of the logical elements into smaller pieces
(fragmentation
) and using next
and prev
to signal
that each XML element is just a fragment of the original logical element. For example,
the
Peter/Paul example given earlier might be encoded this way:
<sp>
<speaker>Peter</speaker>
<p>
<s id="s1">Hey, Paul!</s>
<s id="s2a" next="s2b">Would you pass me </s>
—
</p>
</sp>
<sp>
<speaker>Paul</speaker>
<stage>Handing him the hammer</stage>
<p>—
<s id="s2b" prev="s2a">the hammer?</s>
</p>
</sp>
Here sentences are tagged using the s
(sentence-unit) element, and the second
sentence is fragmented to fit within the hierarchy defined by the speech
elements.
It would be logically possible, however, to break the speech
elements as
needed to fit within the s
-unit hierarchy:
<p>
<s id="s1">
<sp who="Peter"
id="sp1a"
next="sp1b">Hey, Paul!</sp>
</s>
<s id="s2">
<sp id="sp1b"
pref="sp1a">Would you pass
me —</sp>
<sp id="sp2"
who="Paul">—
the hammer?</sp>
</s>
</p>
In order to be usable, an encoding using next
and prev
to
resolve overlap problems will need to be consistent in choosing which logical elements
to
fragment and which to leave intact. In some cases, it will suffice to say, for each
element
type in the vocabulary, whether or not it is to be fragmented in case of need. Those
elements
which are never to be fragmented or modified are referred to, jokingly, as
sacred
; the others in contrast as profane
. But a binary
classification of element types as either sacred or profane suffices only when every
pair of
overlapping elements has one sacred and one profane member: it does not provide adequate
guidance when both elements in the pair are sacred, or both profane. In more complex
cases,
therefore, it may be desirable to formulate a scale of values assigning
each element type a degree of sacredness or profanity, and to ensure that no two element
types
which overlap each other have the same value. Then the rule can be formulated: for
any pair of
overlapping logical elements, represent the more sacred logical element as a single
XML
element, and fragment the less sacred element in order to make the XML elements nest.
The sacred / profane distinction has been picked up (and stretched into a slightly
different shape) by [Marinelli / Vitali / Zacchiroli 2008].
Conclusion and future work
The MOC project has been presented to markup-related communities on three occasions:
a
poster session at Digital Humanities 2010 in London, a nocturne at Balisage 2010,
and a
talk at the TEI 2010 Members Meeting in Zadar, Croatia. In all cases, the response
of
participants suggested that a corpus along the lines envisaged for MOC may meet a
need of
the community. The nocturne, in particular, led to the creation of a mailing list for
project-related discussion at Brown University (rather quiet so far, but still
there).
As already mentioned, however, MOC is still work in progress. More lies ahead than
behind.
Our first task is to make a first version of MOC which is reasonably complete and
suitable
for at least some of its intended uses. The steps we intend to take are:
-
Concerning the technical infrastructure:
-
Finalize and document decisions on the corpus repository structure and
linking possibilities and mechanisms.
-
Adjust the structure of the current repository to conform to the above decisions.
-
Build, test, and deploy a multi-user and user-friendly interface to the
repository.
-
Call for the contribution of any group or community interested in overlap to take
part in
the effort of populating the corpus.
-
Develop collaboration and work-organization strategies (including
funding).
-
Populate the corpus up to a critical mass size
(including full-size samples):
-
systematic extension of the bibliography
-
selection of useful toy examples from the literature
-
identification of a small but illustrative set of idioms to be illustrated
-
selection and careful encoding of a small set of small examples
-
selection and careful encoding of a (very) small set of large examples
-
systematic encoding of all examples in all applicable notations, vocabularies, and
idioms
Once MOC has something like a critical mass of samples, it should be possible to use
it to
investigate and illustrate the relative merits of various encodings, building applications
to
operate on the data, for example by displaying it using visualizations like those
developed by
Wendell Piez for demonstrations of LMNL, or simple search and retrieval interfaces.
Such applications should make it possible to explore the suggestion by Fabio Vitali
and
his research team [Di Iorio et al. 2009] that SPARQL might be a more useful query
language for overlapping structures than the various extensions to XPath described
in the
literature.
References
[ACH/ACL/ALLC 1994] Association for Computers and the
Humanities, Association for Computational Linguistics, and Association for Literary
and
Linguistic Computing. 1994. Guidelines for Electronic Text Encoding and Interchange
(TEI P3). Ed. C. M. Sperberg-McQueen and Lou Burnard. Chicago, Oxford: Text
Encoding Initiative, 1994.
[Barnard et al. 1988] Barnard, D., Hayter, R.,
Karababa, M., Logan, G. and McFadden, J. 1988. SGML Markup for Literary Texts
.
Computers and the Humanities 22: 265-276. doi:https://doi.org/10.1007/BF00118602.
[Barnard et al. 1995] Barnard, D., Burnard, L.,
Gaspart, J. P., Price, L. A., Sperberg-McQueen, C. M. and Varile, G. B. 1995.
Hierarchical encoding of text: Technical problems and SGML solutions
.
Computers and the Humanities 29 211-231. doi:https://doi.org/10.1007/BF01830617
[Carletta et al. 2005] Carletta, J., Evert, S.,
Heid, U. and Kilgour, J. 2005. The NITE XML Toolkit: data model and query
.
Language Resources and Evaluation 39(4) 313-334. doi:https://doi.org/10.1007/s10579-006-9001-9
[Chatti et al. 2007] Chatti, N., Kaouk, S.,
Calabretto, S. and Pinon, J. M. 2007. MultiX: an XML-based formalism to encode
multi-structured documents
.
Proceedings of Extreme Markup Languages 2007. Montréal (Canada)
Aug. 2007. http://conferences.idealliance.org/extreme/html/2007/Chatti01/EML2007Chatti01.html
[DeRose 2004] DeRose, Steven. 2004. Markup
overlap: A review and a horse
.
Proceedings of Extreme Markup Languages 2004. Montréal (Canada)
Aug. 2004. http://conferences.idealliance.org/extreme/html/2004/DeRose01/EML2004DeRose01.html
[Di Iorio et al. 2009] Di Iorio, A.; Peroni, S.;
and Vitali, F. Towards markup support for full GODDAGs and beyond: the EARMARK approach
.
Proceedings of Balisage: The Markup Conference 2009. Montréal
(Canada), August 11-14, 2009. doi:https://doi.org/10.4242/BalisageVol3.Peroni01.
[Durusau and O’Donnell 2002] Durusau, Patrick and O’Donnell, Matthew Brook. Coming down from the
trees: Next step in the evolution of markup?
.
Proceedings of Extreme Markup Languages® 2002. http://www.durusau.net/publications/Down_from_the_trees.pdf
[Hilbert et al. 2005] Hilbert, Mirco;
Schonefeld, Oliver; and Witt, Andreas. Making CONCUR work
Proceedings of Extreme Markup Languages® 2005. http://conferences.idealliance.org/extreme/html/2005/Witt01/EML2005Witt01.xml
[Huitfeldt and Marcoux 2010] Huitfeldt, Claus and
Marcoux, Yves. The MLCD overlap corpus: A markup research infrastructure
.
Presented at the TEI Members Meeting 2010 Zadar (Croatia).
[Huitfeldt and Sperberg-McQueen 2003] Huitfeldt, Claus and Sperberg-McQueen, C. M. TexMECS: An
experimental markup meta-language for complex documents
. Working paper of the project
Markup Languages for Complex Documents (MLCD) University of Bergen
January 2001, rev. October 2003. http://mlcd.blackmesatech.com/mlcd/2003/Papers/texmecs.html
[Huitfeldt et al. 2010] Huitfeldt, Claus;
Sperberg-McQueen, C. M.; and Marcoux, Yves. The MLCD Overlap Corpus (MOC)
. Poster
presented at the Digital Humanities 2010 Conference. King's College,
London, 7-10 July 2010. http://dh2010.cch.kcl.ac.uk/academic-programme/abstracts/papers/html/ab-633.html
[Jagadish et al. 2004] Jagadish, H. V.;
Lakshmanan, L. V. S.; Scannapieco, M.; Srivastava, D.; and Wiwatwattana, N. Colorful
XML: one hierarchy isn't enough
.
Proceedings of the 2004 ACM SIGMOD international conference on Management of
data. Paris, France: pp. 251-262, 2004. doi:https://doi.org/10.1145/1007568.1007598
[Marinelli / Vitali / Zacchiroli 2008] Marinelli,
Paolo; Vitali, Fabio; Zacchiroli, Stefano. Towards the unification of formats for
overlapping markup
.
The New Review of Hypermedia and Multimedia 14 57-94. doi:https://doi.org/10.1080/13614560802316145; see http://en.scientificcommons.org/38517317, http://www.tandfonline.com/doi/full/10.1080/13614560802316145, and http://hal.archives-ouvertes.fr/docs/00/34/05/78/PDF/nrhm-overlapping-conversions.pdf
[Schonefeld 2007] Schonefeld, Oliver.
XCONCUR and XCONCUR-CL: A constraint-based approach for the validation of concurrent
markup
. Georg Rehm, Andreas Witt, Lothar Lemnitzer Datenstrukturen
für linguistische Ressourcen und ihre Anwendungen / Data structures for linguistic
resources and applications: Proceedings of the Biennial GLDV Conference 2007.
Tübingen: Gunter Narr Verlag, pp. 347-356, 2007. See also http://www.xconcur.org/.
[Sperberg-McQueen 2010] Sperberg-McQueen, C.
M. MOC catalog and maintenance plan
http://mlcd.blackmesatech.com/mlcd/2010/A/moc-catalog-sketch.xml (the
formal model itself, in Alloy, is available at http://mlcd.blackmesatech.com/mlcd/2010/A/moc-catalog.als).
[Sperberg-McQueen / Huitfeldt 1998] Sperberg-McQueen,
C.M. and Huitfeldt, Claus. 1998. Concurrent Document Hierarchies in MECS and SGML
.
Literary and Linguistic Computing 14 29-42
[Sperberg-McQueen and Huitfeldt 2000] Sperberg-McQueen, C. M. and Huitfeldt, Claus. GODDAG: A Data Structure
for Overlapping Hierarchies
. Peter R. King and Ethan V. Munson Digital
documents: systems and principles. Lecture Notes in Computer Science 2023 Berlin:
Springer, 2004, pp. 139-160. Paper given at Digital Documents: Systems and Principles.
8th
International Conference on Digital Documents and Electronic Publishing, DDEP 2000,
5th
International Workshop on the Principles of Digital Document Processing, PODDP 2000,
Munich,
Germany, September 13-15, 2000. 2004 http://www.springerlink.com/content/98j1vbu5nby73ul3/?p=4eefed0ac09e4ee381d09d3ac2afcb46&pi=8
http://cmsmcq.com/2000/poddp2000.html
http://www.w3.org/People/cmsmcq/2000/poddp2000.html
[Stührenberg / Goecke 2008] Stührenberg,
M. and Goecke, D. 2008. SGF — An integrated model for multiple annotations and
its application in a linguistic domain
.
Proceedings of Balisage: The Markup Conference 2008. Montréal
(Canada) August 12-15, 2008. http://www.balisage.net/Proceedings/vol1/html/Stuehrenberg01/BalisageVol1-Stuehrenberg01.html.
doi:https://doi.org/10.4242/BalisageVol1.Stuehrenberg01.
[Stührenberg and Jettka 2009] Stührenberg, M. and Jettka, D. A toolkit for multi-dimensional markup:
The development of SGF to XStandoff
.
Proceedings of Balisage: The Markup Conference 2009. Montréal
(Canada), August 11-14, 2009. doi:https://doi.org/10.4242/BalisageVol3.Stuhrenberg01.
[TEI 2007] TEI Consortium. 2007. TEI P5:
Guidelines for Electronic Text Encoding and Interchange. Ed. Lou Burnard and Syd
Bauman. Oxford, Providence, Charlottesville, Nancy: The TEI Consortium, 2007, rev.
2010.
[Tennison and Piez 2002] Tennison, Jeni and Piez,
Wendell. The Layered Markup and Annotation Language (LMNL)
.
Proceedings of Extreme Markup Languages® 2002. http://conferences.idealliance.org/extreme/html/2002/Tennison02/EML2002Tennison02.html (abstract only). Some information on LMNL can be found at http://www.piez.org/wendell/LMNL/lmnl-page.html.
[Witt 2004] Witt, Andreas. 2004. Multiple
hierarchies: new aspects of an old solution
. Paper given at Extreme Markup Languages
2004, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Witt01/EML2004Witt01.html
[Witt / Lüngen / Goecke 2005] Witt, A.,
Lüngen, H., Sasaki, F. and Goecke, D. 2005. Unification of XML Documents with
Concurrent Markup
.
Literary and Linguistic Computing 20(1): 103-116. doi:https://doi.org/10.1093/llc/fqh046.