The MLCD Overlap Corpus (MOC)

Yves Marcoux; Claus Huitfeldt; C. M. Sperberg-McQueen

Abstract

The MLCD Overlap Corpus (MOC) is a collection of samples of texts and text fragments with overlapping structures. The main immediate goal of the MOC project is to build a corpus of well understood and well documented examples of overlap, discontinuity, alternate ordering, and related phenomena in various notations, for use in the investigation of methods of recording such phenomena. The samples should be of use in documenting the history of proposals for dealing with overlap and in evaluating existing and new proposals.

For some time now, people interested in descriptive markup have been considering the problem of how best to handle overlapping structures in electronic representations of documents. There have been proposals for handling such overlap in SGML using CONCUR, for handling it in SGML or XML using application-level semantics (milestone elements, Trojan Horse markup, fragmentation and recombination using virtual elements of various kinds, standoff markup), for resurrecting CONCUR in the XML context, and for a variety of non-XML approaches (colored XML, LMNL, Just-in-Time trees, TexMecs, Goddag structures, EARMARK). The literature on the subject is still manageable, but it has grown to the point where it is hard to keep track even of the number of reviews of the literature.^[1]

The proliferation of proposals has led to some secondary phenomena which seem to be problems in their own right. Because there are so many proposals for dealing with overlap, it can be difficult to keep track of them all. Because so many of the papers describing them use only a few terse examples it can be challenging to understand just how the proposal works in practice, and unclear just how any given proposal resembles or differs from other proposals made elsewhere. Most important of all, it is currently difficult to compare different techniques for dealing with overlap with each other and reach well founded conclusions as to their relative convenience.

The MLCD Overlap Corpus (MOC) is a first step toward improving this situation. This paper describes the current state of MOC and future plans for the project.

Aims

The main immediate goal of the MOC project is to build a corpus of well understood and well documented examples of overlap, discontinuity, alternate ordering, and related phenomena in various notations, for use in the investigation of methods of recording such phenomena.

Where possible, we would like to allow, indeed to encourage, participation of and contributions from a wider community in building the corpus. When the corpus has reached a suitable size and degree of completeness, we would also like to make it available for research and to encourage its use.

To address the concerns which led to the project, the MOC corpus should satisfy a number of requirements.

It should provide illustrative examples to make it easier to understand various overlap solutions.
It should provide readily available documentation of overlap proposals (with pointers to the original papers).
Its samples should cover as wide a range of problems as is feasible, in the interests of seeing whether different proposals for overlap work better on different kinds of problems.
The corpus may provide, or should at least support work toward, some kind of systematic categorization or typology of overlap problems.
The samples in the corpus should be able to serve as a kind of testbed for the development of tools, including editors, translators, query languages, and so on.
It should be able to serve as a testbed for head-to-head comparison of overlap solutions, by making it possible to build demonstration applications using the same documents in different encodings and compare the volume and complexity of the code needed to support the different encodings.

It is currently difficult to compare different techniques for dealing with overlap with each other and to apply concrete metrics to them. We believe that MOC may provide a testbed for application and tool development, and an empirical basis for answering questions such as:

How successful is a given syntactic proposal in capturing relevant information about a given document with overlapping structures?
How verbose or succinct is the proposal's markup for the document?
How complex is each proposal's markup (assuming it is possible to specify some quantitative measure of markup complexity).
How complex is the task of parsing a given syntax and mapping it to a given data structure?
How successful is a given proposed data structure in capturing the relevant information about overlapping structures in a document?
Given a representation of a particular document in a given data structure, how complex is the task of operating on that data structure in support of a given application using the document?

Content of the corpus

Structure

In its initial form, MOC will comprise three sets of samples:

toy samples, typically just a few lines in length.

Most toy samples are drawn from the literature on overlap. These samples usually reduce the problem of overlap to very minimal terms, which makes them helpful for highlighting the essential features of a particular overlap problem and the proposed solution. By the same token, they elide many of the details that must be handled in practical applications.
short samples, each typically a few pages long.

These samples are designed to be large enough to illustrate the interaction of overlapping structures with other problems of text encoding, but short enough to make it feasible to encode them multiple times by hand.
long samples, each typically a complete document (e.g. a play, long short story, or novel).

These samples are designed to be large enough to make it feasible to build simple text applications (e.g. interactive search and retrieval systems or text visualization systems) using the MOC samples as data, and to illuminate technical issues in the processing of overlapping structures. By current standards, however, none of the samples in this class are expected to be big in the sense of big data.

Information about the samples

Each sample is a concrete encoding of a particular text or text fragment, characterized along a number of axes:

sample group: the text or text fragment encoded, and the information to be captured.

For example, the sample group Peter, Paul, and the Hammer [Hilbert et al. 2005] (whose instances are among the Toy samples of MOC) consists of the following text fragment, together with the constraint that the encoding should capture both the division into speeches or utterances by Peter and Paul and the division of the spoken text into sentences.

Peter

Hey, Paul! Would you pass me —

Paul

[Handing him the hammer]

— the hammer?
notation: By notation we mean a particular markup syntax, such as SGML, XML, TexMecs, etc.
vocabulary: the same textual material may be encoded in XML or another notation using different vocabularies, such as (for example) TEI, HTML, OSIS, or an ad hoc vocabulary invented for the example.

MOC is rather casual about the affiliation of vocabularies with notations. Most public vocabularies (TEI for instance) are defined using just one notation (typically XML), so it's stretching things a bit to say (as MOC does) that a vocabulary defined only for one notation can be used in samples encoded in another notation (say, TexMecs). For MOC purposes, that is, a vocabulary provides information about the meaning to be attached to particular identifiers used in an encoding, without being particular about the notation in which the identifiers occur.
idiom: a given vocabulary may provide more than one way to encode overlapping structures; we refer to a particular way of using a vocabulary as an idiom. Some vocabularies are designed to reduce such variation as far as possible; others tolerate or even encourage it. In the TEI vocabulary, for example, overlapping elements can be encoded in several ways:
- using for-the-purpose milestone elements (such as pb and lb),
- using generic milestone elements (milestone),
- using Trojan Horse markup [DeRose 2004],
- using virtual elements fragmented to fit into the imposed hierarchy and then knit together in a variety of ways:
  - using the join element,
  - using the attributes next and prev,
  - using the attribute part with the values I, M, and F.
- using stand-off markup of various kinds
In all of these cases, the encoder faces the choice of which logical elements of the document structure (if any) to encode in the conventional way (one logical element, one XML element) and which to encode in the alternative way using milestones, multiple XML elements, or elements in a stand-off annotation structure. (See further discussion of sacred and profane elements below.) The convenience of concrete operations on the document may vary widely depending on the choices made, so ideally MOC should provide a wide range of variation in choices here, to enable them to be compared empirically.

MOC reifies idioms by defining them and giving them names so they can be tracked and compared.

Each sample may instantiate any number of named idioms.
source: a bibliographic reference to the source of the sample. Omitted for samples constructed by the MOC project.
description: a prose description of the sample, commenting on any points of particular interest or importance.
status, to-do list, and change history: provisions for work-flow management; see discussion below.

A formal model of (an early draft of) the MOC catalog has been created and is described in [Sperberg-McQueen 2010].

Ancillary materials

Along with the samples, MOC records information about each sample group, notation, vocabulary, and idiom used in the corpus, together with bibliographic references. For notations like XML and vocabularies like TEI, documentation is readily available and MOC makes no attempt to compete with other sources as regards completeness of its lists of bibliographic references. But for less commonly known notations, it is hoped that MOC's collection of information may be helpful to those seeking to learn more.

Each notation, vocabulary, idiom, sample, and sample group in MOC has a distinct URI; users can dereference the URI to see the information MOC has about the item in question.

Selection criteria

Since its purpose is to illuminate problems connected with overlap and with existing proposals for handling it, MOC does not attempt to make the selection of texts representative of any particular linguistic or textual population. (MOC is not a corpus in that sense.) For MOC, the relevant population is not a particular set of natural-language users, but the set of overlap-related problems encountered by people who work with natural-language texts for whatever purposes.

Accordingly, MOC takes a resolutely opportunistic approach to samples; we will take samples anywhere we can find them. This is particularly visible in the toy samples: the current version of MOC includes among the toy samples many short samples originally published in papers on overlap that have come to our attention. Opportunistic sampling is less fruitful when it comes to the short and long samples.

Since one of the purposes of MOC is to support investigations of different kinds of overlap as well as different ways of encoding overlap, the collections of short and long samples will, to the extent possible, reflect a variety of overlap phenomena and textual interests. In the absence of a well grounded categorization of different kinds of overlap, it's difficult to be certain how many really different kinds of overlap there are, and which kinds of overlap are structurally and conceptually isomorphic. In the absence of such a well grounded categorization, we hope to include examples at least of the following kinds of overlap:

structural overlap and multiple hierarchies (as in verse drama, or physical and logical hierarchies [page vs paragraph], or in the analysis of the Peter / Paul example above into utterances and into syntactic units)
overlapping annotation targets (as in fine-grained commentary on specific texts)
change-history markup showing the revision of a text over time (of practical import for technical documentation, but also of interest for genetic editions)
overlapping sites of textual variation (as in text-critical editions)
discontinuous and disordered elements (as in cases where one text is quoted and commented on in another text, for which songs and plays-within-plays in drama provide examples; a well known example in the overlap literature is the attempt of Hughie, Louis, and Dewey to remember a haiku)

We also hope to provide examples that illustrate the occurrence of overlap in texts and applications of interest in different communities:

literary study
lexicology
metrical study
language corpora (discourse analysis, syntax, prosody, ...)
textual criticism
document publishing
documentary, historical-critical, genetic, and other scholarly editions
analytical bibliography
historical annotation
legal documents

Work flow

Each sample in the corpus goes through the following processes, leading to the corresponding status:

candidate: The sample has been collected and may or may not be included in the corpus proper. (We expect this will apply just to toy samples, but it may also apply to others.)
projected: We have agreed in principle and in theory that we want this sample.
planned: We have agreed on the desired properties of the sample in sufficient detail to allow data capture to proceed:
- sample group (i.e., information about which sample group the sample belongs to.)
- source text
- notation
- vocabulary
- idiom
incomplete: Data capture has begun but has not yet been completed.
rough: Data capture has been completed, and the person who did the data capture has done an initial proofreading.
validated (or wf-checked): The sample has been validated against all appropriate schemas, if there are any, or (if there is no schema) has been checked for well-formedness by some automatic tool.

At this point the paths divide. Toy samples and small samples undergo repeated proofreadings (the initial plan is to do three proofreadings for each, but that plan has not yet been put to the test). Large samples are (we assume) too long for multiple proofreadings (or possibly even one). Instead, we perform a single proofreading and several spot checks.

Once it reaches the validated state, each large sample acquires a list of spot checks to be performed. One by one, not necessarily in any prescribed order, the prescribed checks are performed. Each check results, possibly, in corrections and re-validation, and possibly in the addition of new checks to the to-check list. Whenever we notice something odd or amiss in the document, especially if it could be a systematic problem, then a new task is added to the to-check list (assuming we can devise a way to check systematically for the error in question).

It is not yet clear exactly what spot-checks we need to do; we expect them to vary with the notation, the idioms, the sample, etc. But some examples may make the idea clearer:

When feasible, a spell-checker is used to check the text for typographic errors.
A selected one-, ten-, or one-hundred-percent sample of markup constructs (typically occurrences of particular element types or attributes) is checked for semantic plausibility (their syntactic correctness having already been guaranteed by validation). For example, we might spot-check one percent, ten percent, or all of the markup used for page breaks, page numbers, the TEI part attribute, the next and prev attributes, the join element, instances of markup for discontinuous elements, instances of fragmented elements, or Trojan horses, to make sure they are semantically correct. The specific constructs that need checking will, of course, typically depend on the idiom used.

Systematic errors found in spot checks are fixed in whatever way we can manage.

Current status

The ultimate aim of MOC is to provide a fully populated matrix of materials: for each sample group, one sample in each relevant combination of notation, vocabulary, and idiom.

As a first step towards this larger goal we have built a prototype corpus of toy samples (MOC-POC), as a proof of concept. MOC-POC currently contains 52 samples distributed over 14 sample groups, 4 notations and 6 idioms.^[2]

The text fragments comprising the samples of MOC-POC are taken from a selection of research publications on the overlap problem. This prototype does not claim any kind of completeness; it has, however, successfully identified a number of weak spots in our initial design.

Notations currently represented in MOC-POC are:

XML
XConcur
LMNL saw-tooth notation
TexMecs

Most of the samples are encoded using a vocabulary taken from or based on some version of TEI, but ad hoc vocabularies are also represented.

Samples encoded in XML are encoded using six different TEI idioms^[3] to resolve overlap problems:

Fragmentation using the next and prev attributes defined as part of the TEI tags set for segmentation and alignment.
Fragmentation using the part attribute provided for certain elements in the TEI vocabulary. (If a TEI-encoded example requires fragmentation of an element for which TEI provides no part attribute, the attribute is added.)
Fragmentation using the part and id attributes and the join element defined as part of the TEI tags set for segmentation and alignment.
The "Trojan horse 1" idiom uses milestone tags to resolve overlap. Milestones are used only when necessary, normal XML elements are used in all other cases. It is left to the encoder's choice to decide which element to mark with milestones.
"Trojan horse 2", like "Trojan horse 1", uses milestone tags to resolve overlap. However, the Trojan horse 2 idiom represents every element in the overlap as a pair of milestones with intervening content.
The "XStandoff" idiom uses XML-conformant markup that points to the character data ("primary data"), which is kept in a separate location.^[4]

Preliminary results

A few preliminary results of our work on MOC can be mentioned.

Our attempts to explore the solution space of techniques like TEI-style fragmentation led very quickly to the realization that the TEI's techniques for handling overlapping structures (here we will use the next and prev attributes as an example, but the same observations apply to all the techniques described by the TEI) do not in themselves fully determine the encoding of a given sample, even when there is no uncertainty about which textual features are to be encoded. This is not surprising in itself; the TEI almost always leaves a great deal of leeway to the individual project and its encoding policies. But it does mean that a full description of how the TEI is used to encode a given sample must go beyond saying that the next and prev attributes are used.

When next and prev are used, an overlap of two logical elements is resolved by breaking one of the logical elements into smaller pieces (fragmentation) and using next and prev to signal that each XML element is just a fragment of the original logical element. For example, the Peter/Paul example given earlier might be encoded this way:

<sp>
  <speaker>Peter</speaker>
  <p>
    <s id="s1">Hey, Paul!</s>
    <s id="s2a" next="s2b">Would you pass me </s>
    &mdash;
  </p>
</sp>
<sp>
  <speaker>Paul</speaker>
  <stage>Handing him the hammer</stage>
  <p>&mdash;
    <s id="s2b" prev="s2a">the hammer?</s>
  </p>
</sp>

Here sentences are tagged using the s (sentence-unit) element, and the second sentence is fragmented to fit within the hierarchy defined by the speech elements.

It would be logically possible, however, to break the speech elements as needed to fit within the s-unit hierarchy:

<p>
  <s id="s1">
    <sp who="Peter" 
        id="sp1a" 
        next="sp1b">Hey, Paul!</sp>
  </s>
  <s id="s2">
    <sp id="sp1b" 
        pref="sp1a">Would you pass 
          me &mdash;</sp> 
    <sp id="sp2" 
        who="Paul">&mdash; 
          the hammer?</sp>
  </s>
  </p>

In order to be usable, an encoding using next and prev to resolve overlap problems will need to be consistent in choosing which logical elements to fragment and which to leave intact. In some cases, it will suffice to say, for each element type in the vocabulary, whether or not it is to be fragmented in case of need. Those elements which are never to be fragmented or modified are referred to, jokingly, as sacred; the others in contrast as profane. But a binary classification of element types as either sacred or profane suffices only when every pair of overlapping elements has one sacred and one profane member: it does not provide adequate guidance when both elements in the pair are sacred, or both profane. In more complex cases, therefore, it may be desirable to formulate a scale of values assigning each element type a degree of sacredness or profanity, and to ensure that no two element types which overlap each other have the same value. Then the rule can be formulated: for any pair of overlapping logical elements, represent the more sacred logical element as a single XML element, and fragment the less sacred element in order to make the XML elements nest.

The sacred / profane distinction has been picked up (and stretched into a slightly different shape) by [Marinelli / Vitali / Zacchiroli 2008].

Conclusion and future work

The MOC project has been presented to markup-related communities on three occasions: a poster session at Digital Humanities 2010 in London, a nocturne at Balisage 2010, and a talk at the TEI 2010 Members Meeting in Zadar, Croatia. In all cases, the response of participants suggested that a corpus along the lines envisaged for MOC may meet a need of the community. The nocturne, in particular, led to the creation of a mailing list for project-related discussion at Brown University (rather quiet so far, but still there).

As already mentioned, however, MOC is still work in progress. More lies ahead than behind.

Our first task is to make a first version of MOC which is reasonably complete and suitable for at least some of its intended uses. The steps we intend to take are:

Concerning the technical infrastructure:
- Finalize and document decisions on the corpus repository structure and linking possibilities and mechanisms.
- Adjust the structure of the current repository to conform to the above decisions.
- Build, test, and deploy a multi-user and user-friendly interface to the repository.
Call for the contribution of any group or community interested in overlap to take part in the effort of populating the corpus.
Develop collaboration and work-organization strategies (including funding).
Populate the corpus up to a critical mass size (including full-size samples):
- systematic extension of the bibliography
- selection of useful toy examples from the literature
- identification of a small but illustrative set of idioms to be illustrated
- selection and careful encoding of a small set of small examples
- selection and careful encoding of a (very) small set of large examples
- systematic encoding of all examples in all applicable notations, vocabularies, and idioms

Once MOC has something like a critical mass of samples, it should be possible to use it to investigate and illustrate the relative merits of various encodings, building applications to operate on the data, for example by displaying it using visualizations like those developed by Wendell Piez for demonstrations of LMNL, or simple search and retrieval interfaces.

Such applications should make it possible to explore the suggestion by Fabio Vitali and his research team [Di Iorio et al. 2009] that SPARQL might be a more useful query language for overlapping structures than the various extensions to XPath described in the literature.

References

[ACH/ACL/ALLC 1994] Association for Computers and the Humanities, Association for Computational Linguistics, and Association for Literary and Linguistic Computing. 1994. Guidelines for Electronic Text Encoding and Interchange (TEI P3). Ed. C. M. Sperberg-McQueen and Lou Burnard. Chicago, Oxford: Text Encoding Initiative, 1994.

[Barnard et al. 1988] Barnard, D., Hayter, R., Karababa, M., Logan, G. and McFadden, J. 1988. SGML Markup for Literary Texts. Computers and the Humanities 22: 265-276. doi:https://doi.org/10.1007/BF00118602.

[Barnard et al. 1995] Barnard, D., Burnard, L., Gaspart, J. P., Price, L. A., Sperberg-McQueen, C. M. and Varile, G. B. 1995. Hierarchical encoding of text: Technical problems and SGML solutions. Computers and the Humanities 29 211-231. doi:https://doi.org/10.1007/BF01830617

[Carletta et al. 2005] Carletta, J., Evert, S., Heid, U. and Kilgour, J. 2005. The NITE XML Toolkit: data model and query. Language Resources and Evaluation 39(4) 313-334. doi:https://doi.org/10.1007/s10579-006-9001-9

[Chatti et al. 2007] Chatti, N., Kaouk, S., Calabretto, S. and Pinon, J. M. 2007. MultiX: an XML-based formalism to encode multi-structured documents. Proceedings of Extreme Markup Languages 2007. Montréal (Canada) Aug. 2007. http://conferences.idealliance.org/extreme/html/2007/Chatti01/EML2007Chatti01.html

[DeRose 2004] DeRose, Steven. 2004. Markup overlap: A review and a horse. Proceedings of Extreme Markup Languages 2004. Montréal (Canada) Aug. 2004. http://conferences.idealliance.org/extreme/html/2004/DeRose01/EML2004DeRose01.html

[Di Iorio et al. 2009] Di Iorio, A.; Peroni, S.; and Vitali, F. Towards markup support for full GODDAGs and beyond: the EARMARK approach. Proceedings of Balisage: The Markup Conference 2009. Montréal (Canada), August 11-14, 2009. doi:https://doi.org/10.4242/BalisageVol3.Peroni01.

[Durusau and O’Donnell 2002] Durusau, Patrick and O’Donnell, Matthew Brook. Coming down from the trees: Next step in the evolution of markup?. Proceedings of Extreme Markup Languages® 2002. http://www.durusau.net/publications/Down_from_the_trees.pdf

[Hilbert et al. 2005] Hilbert, Mirco; Schonefeld, Oliver; and Witt, Andreas. Making CONCUR work Proceedings of Extreme Markup Languages® 2005. http://conferences.idealliance.org/extreme/html/2005/Witt01/EML2005Witt01.xml

[Huitfeldt and Marcoux 2010] Huitfeldt, Claus and Marcoux, Yves. The MLCD overlap corpus: A markup research infrastructure. Presented at the TEI Members Meeting 2010 Zadar (Croatia).

[Huitfeldt and Sperberg-McQueen 2003] Huitfeldt, Claus and Sperberg-McQueen, C. M. TexMECS: An experimental markup meta-language for complex documents. Working paper of the project Markup Languages for Complex Documents (MLCD) University of Bergen January 2001, rev. October 2003. http://mlcd.blackmesatech.com/mlcd/2003/Papers/texmecs.html

[Huitfeldt et al. 2010] Huitfeldt, Claus; Sperberg-McQueen, C. M.; and Marcoux, Yves. The MLCD Overlap Corpus (MOC). Poster presented at the Digital Humanities 2010 Conference. King's College, London, 7-10 July 2010. http://dh2010.cch.kcl.ac.uk/academic-programme/abstracts/papers/html/ab-633.html

[Jagadish et al. 2004] Jagadish, H. V.; Lakshmanan, L. V. S.; Scannapieco, M.; Srivastava, D.; and Wiwatwattana, N. Colorful XML: one hierarchy isn't enough. Proceedings of the 2004 ACM SIGMOD international conference on Management of data. Paris, France: pp. 251-262, 2004. doi:https://doi.org/10.1145/1007568.1007598

[Marinelli / Vitali / Zacchiroli 2008] Marinelli, Paolo; Vitali, Fabio; Zacchiroli, Stefano. Towards the unification of formats for overlapping markup. The New Review of Hypermedia and Multimedia 14 57-94. doi:https://doi.org/10.1080/13614560802316145; see http://en.scientificcommons.org/38517317, http://www.tandfonline.com/doi/full/10.1080/13614560802316145, and http://hal.archives-ouvertes.fr/docs/00/34/05/78/PDF/nrhm-overlapping-conversions.pdf

[Schonefeld 2007] Schonefeld, Oliver. XCONCUR and XCONCUR-CL: A constraint-based approach for the validation of concurrent markup. Georg Rehm, Andreas Witt, Lothar Lemnitzer Datenstrukturen für linguistische Ressourcen und ihre Anwendungen / Data structures for linguistic resources and applications: Proceedings of the Biennial GLDV Conference 2007. Tübingen: Gunter Narr Verlag, pp. 347-356, 2007. See also http://www.xconcur.org/.

[Sperberg-McQueen 2010] Sperberg-McQueen, C. M. MOC catalog and maintenance plan http://mlcd.blackmesatech.com/mlcd/2010/A/moc-catalog-sketch.xml (the formal model itself, in Alloy, is available at http://mlcd.blackmesatech.com/mlcd/2010/A/moc-catalog.als).

[Sperberg-McQueen / Huitfeldt 1998] Sperberg-McQueen, C.M. and Huitfeldt, Claus. 1998. Concurrent Document Hierarchies in MECS and SGML. Literary and Linguistic Computing 14 29-42

[Sperberg-McQueen and Huitfeldt 2000] Sperberg-McQueen, C. M. and Huitfeldt, Claus. GODDAG: A Data Structure for Overlapping Hierarchies. Peter R. King and Ethan V. Munson Digital documents: systems and principles. Lecture Notes in Computer Science 2023 Berlin: Springer, 2004, pp. 139-160. Paper given at Digital Documents: Systems and Principles. 8th International Conference on Digital Documents and Electronic Publishing, DDEP 2000, 5th International Workshop on the Principles of Digital Document Processing, PODDP 2000, Munich, Germany, September 13-15, 2000. 2004 http://www.springerlink.com/content/98j1vbu5nby73ul3/?p=4eefed0ac09e4ee381d09d3ac2afcb46&pi=8 http://cmsmcq.com/2000/poddp2000.html http://www.w3.org/People/cmsmcq/2000/poddp2000.html

[Stührenberg / Goecke 2008] Stührenberg, M. and Goecke, D. 2008. SGF — An integrated model for multiple annotations and its application in a linguistic domain. Proceedings of Balisage: The Markup Conference 2008. Montréal (Canada) August 12-15, 2008. http://www.balisage.net/Proceedings/vol1/html/Stuehrenberg01/BalisageVol1-Stuehrenberg01.html. doi:https://doi.org/10.4242/BalisageVol1.Stuehrenberg01.

[Stührenberg and Jettka 2009] Stührenberg, M. and Jettka, D. A toolkit for multi-dimensional markup: The development of SGF to XStandoff. Proceedings of Balisage: The Markup Conference 2009. Montréal (Canada), August 11-14, 2009. doi:https://doi.org/10.4242/BalisageVol3.Stuhrenberg01.

[TEI 2007] TEI Consortium. 2007. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Ed. Lou Burnard and Syd Bauman. Oxford, Providence, Charlottesville, Nancy: The TEI Consortium, 2007, rev. 2010.

[Tennison and Piez 2002] Tennison, Jeni and Piez, Wendell. The Layered Markup and Annotation Language (LMNL). Proceedings of Extreme Markup Languages® 2002. http://conferences.idealliance.org/extreme/html/2002/Tennison02/EML2002Tennison02.html (abstract only). Some information on LMNL can be found at http://www.piez.org/wendell/LMNL/lmnl-page.html.

[Witt 2004] Witt, Andreas. 2004. Multiple hierarchies: new aspects of an old solution. Paper given at Extreme Markup Languages 2004, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Witt01/EML2004Witt01.html

[Witt / Lüngen / Goecke 2005] Witt, A., Lüngen, H., Sasaki, F. and Goecke, D. 2005. Unification of XML Documents with Concurrent Markup. Literary and Linguistic Computing 20(1): 103-116. doi:https://doi.org/10.1093/llc/fqh046.

^[1] For methods of handling overlap in SGML, see [Barnard et al. 1988] and [Barnard et al. 1995]; some of the techniques described there are also applicable in XML. For the CONCUR feature of SGML (not available in XML) and a proposal to extend XML by adding it back, see [Sperberg-McQueen / Huitfeldt 1998], [Hilbert et al. 2005], [Schonefeld 2007], and [Witt / Lüngen / Goecke 2005].

Generic and specific milestone elements, and several forms of fragmentation and recombination are described in [ACH/ACL/ALLC 1994] and again in [TEI 2007]; Trojan Horse markup is described in [DeRose 2004].

Standoff markup has been invented and described many times; [Barnard et al. 1995] describes the generic technique, and vocabulary-specific use of standoff markup is described by [ACH/ACL/ALLC 1994], [Carletta et al. 2005], [Chatti et al. 2007], [Stührenberg / Goecke 2008], [Stührenberg and Jettka 2009], and [TEI 2007], among many others.

Among the many non-XML approaches suggested are those described by [Jagadish et al. 2004] (colored XML), [Tennison and Piez 2002] (LMNL), [Durusau and O’Donnell 2002] (Just-in-Time trees), [Huitfeldt and Sperberg-McQueen 2003] (TexMecs), [Sperberg-McQueen and Huitfeldt 2000] (Goddag structures), and [Di Iorio et al. 2009] (Earmark). Some of these propose alternative syntaxes, others alternative data models, others are sketchy on both topics.

Some but not all papers begin with a survey of the history of the topic, or at least a nod in that direction, but most of these are cursory and incomplete. Among the better surveys are those of [DeRose 2004] and [Witt 2004].

^[2] Not all sample groups contain samples in all notations and idioms. When completed, this toy corpus of 14 sample groups will contain 126 samples.

^[3] For other vocabularies, analogous attributes and elements are assumed or provided.

^[4] We express warm thanks to Maik Stührenberg (Bielefeld) for having prepared and allowed us to use XStandoff samples in all sample groups.

Yves Marcoux

Associate Professor

Université de Montréal

`<yves.marcoux@umontreal.ca>`

Yves MARCOUX is a faculty member at EBSI, University of Montréal, since 1991. He is mainly involved in teaching and research activities in the field of document informatics. Prior to his appointment at EBSI, he has worked for 10 years in systems maintenance and development, in Canada, the U.S., and Europe. He obtained his Ph.D. in theoretical computer science from University of Montréal in 1991. His main research interests are document semantics, structured document implementation methodologies, and information retrieval in structured documents. Through GRDS, his research group at EBSI, he has been principal architect for the Governmental Framework for Integrated Document Management, a project funded by the National Archives of Québec and by the Québec Treasury Board.

Claus Huitfeldt

Associate Professor (førsteamanuensis)

Department of Philosophy, University of Bergen

Mag.art. Claus Huitfeldt (born 1957) is Associate Professor (førsteamanuensis) at the Department of Philosophy of the University of Bergen since 1994.

He was founding Director (1990-2000) of the Wittgenstein Archives at the University of Bergen, for which he developed the text encoding system MECS as well as the editorial methods for the publication of Wittgenstein's Nachlass — The Bergen Electronic Edition (Oxford University Press, 2000).

He was Research Director (2000-2002) of Aksis (Section for Culture, Language and Information Technology at the Bergen University Research Foundation). In 2003 he returned to his position at the Department of Philosophy, where he teaches modern philosophy and philosophy of language, and also gives frequent courses in text technology at the The Department of Humanistic Informatics.

He was active in the Text Encoding Initiative (TEI) since 1991, and was centrally involved in the foundation of the TEI Consortium in 2001. The consortium now counts more than 90 member institutions.

Huitfeldt's research interests are within philosophy of language, philosophy of technology, text theory, editorial philology and markup theory. He is currently leader of the project Markup Languages for Complex Documents (MLCD).

C. M. Sperberg-McQueen

Black Mesa Technologies LLC

C. M. Sperberg-McQueen is the founder of Black Mesa Technologies LLC, a consultancy specializing in the use of descriptive markup to help memory institutions preserve cultural heritage information for the long haul. He has served as co-editor of the XML 1.0 specification, the Guidelines of the Text Encoding Initiative, and the XML Schema Definition Language (XSD) 1.1 specification. He holds a doctorate in comparative literature.

BalisageThe Markup Conference

Balisage Paper: The MLCD Overlap Corpus (MOC)

Project report

Yves Marcoux

`<yves.marcoux@umontreal.ca>`

Claus Huitfeldt

C. M. Sperberg-McQueen

Table of Contents

Aims

Content of the corpus

Structure

Information about the samples

Ancillary materials

Selection criteria

Work flow

Current status

Preliminary results

Conclusion and future work

References

`<yves.marcoux@umontreal.ca>`

Balisage Series on Markup Technologies