Types of Standoff Markup in TEI[1]
The essential user story for standoff markup is: “I have a text that I want to add
information to without modifying the source.” This might involve changing the structure
or the
source, e.g. taking a text with page containers and representing it with div and paragraph
containers. It might mean adding linguistic annotation to the words of the source.
It might
mean the addition of notes or commentary on arbitrary chunks of text. It might mean
representing the results of a machine or human named entity recognition process. It
might mean
noting when and where witnesses diverge from the base text. It might mean the alignment
of one
text with another, e.g. a translation with its source. Various mechanisms exist for
the connections involved in these processes: The TEI Guidelines have a set of global
attributes for connecting elements[2], <note>
has a target attribute, <span>
(which associates an
interpretive annotation with a run of text) has target and from/to attributes,
can associate any number of elements using its target attribute. The
targets of these annotations may be referred to by their IDs, <anchor>
s may be
placed to mark targets, words or text segments may be wrapped in <w>
elements, and TEI Pointers may indicate arbitrary runs of text and/or
elements. Most of these techniques involve what we might term “associative” markup—they
additional information to some element or segment in the source text.
Properly speaking, any annotation that occurs away from the thing it is annotating
be called standoff markup. When the TEI Guidelines discuss standoff markup, however,
primarily refer to what we might term “restructuring” or “reconstructive” markup,
wherein text
from one place is imported into structured markup elsewhere. In section 16.9[3], for example, which is the main section on standoff markup, all of the examples
use XInclude to pull text and/or markup from elsewhere into a new structure.
Figure 1: Example from section 16.9.3, “Stand-off Markup in TEI”
Source document:
<p xml:id="par1">home, <emph>home</emph> on Brokeback Mountain.</p>
<p xml:id="par2">That was the <emph>song</emph> that I sang</p>
Restructuring document:
<div><include href="example1.xml" xmlns="http://www.w3.org/2001/XInclude"
Result document:
<p xml:id="par1">home, <emph>home</emph> on Brokeback Mountain.</p>
<p xml:id="par2">That was the <emph>song</emph> that I sang</p>
) do more or less the same thing in a purely TEI fashion, even though
this is not described as a “standoff” technique in that section. This is essentially
method employed by (e.g.) CATMA[4] in its TEI export format to layer annotations onto a run of text.[5] CATMA uses <ptr>
to refer to unmarked text and
with an @ana
attribute that refers to the annotation body
(which is a TEI feature structure). Restructuring markup may be very useful in cases
overlap would otherwise be an issue. A “diplomatic” model of a codex might mark the
text as
contained by pages, for example, but there might well be a need for a parallel version
the text is structured into chapters and paragraphs.
Even though the TEI Guidelines consider restructuring markup to be the main form of
“standoff” markup, for the purposes of this paper I will use a broader definition
“standoff” that includes any markup where information is added to a resource without
modifying the resource itself. Associative markup need not be used in a standoff fashion,
be sure: Figure 2: Notes<note>
s may occur either inline or standoff, for example.
<p>Some text<note>with an inline note</note>.</p>
<p xml:id="id">Some text.</p>
<note target="#id">with a standoff note</note>
<p>Some text.<ptr target="#id"/></p>
... (elsewhere)
<note xml:id="id">with a referenced note</note>
Between the two poles of associative and restructuring standoff annotation, there
lies a
third, and it is both important and, in TEI, almost entirely neglected. We might call
type “assertive” annotation, because rather than simply attaching new information
to the text,
it makes a positive, actionable statement about the segment of text in question. The
obvious way to do this in TEI is with inline markup. A common example of this kind
annotation is the marking of named entities. In TEI, a name is marked with the
element, wrapping the thing named, with a @ref
attribute which points
to some record of the entity identified (e.g. an entry in a personography or gazetteer).
are a number of refinements of <name>
, including <persName>
, and <placeName>
as well as the more general
(referring string). In all of these cases, the element is making a
statement about its content and possibly linking (via the ref attribute) to additional
information about the referent of the name. Someone wishing to say the same sort of
using standoff markup quickly runs into difficulty. A typical way to identify a name
with a
person in TEI might look like this:
Figure 3: Identifying a personal name
<person xml:id="JC">
<persName>Gaius Iulius Caesar</persName>
<idno type="URL">https://www.wikidata.org/wiki/Q1048</idno>
<p>Litteris <persName ref="#JC">C. Caesaris</persName> consulibus redditis aegre ab his impetratum est summa
tribunorum plebis contentione, ut in senatu recitarentur;...</p>
The marked up section is saying “this string identifies a personal name and refers to the person identified in the person element with id ‘JC’”. To say the same thing with standoff markup is much trickier. We could, for example, do something like:
Figure 4: Associating a person with a span of text
<person xml:id="JC">
<persName>Gaius Iulius Caesar</persName>
<idno type="URL">https://www.wikidata.org/wiki/Q1048</idno>
<p xml:id="p1">Litteris C. Caesaris consulibus redditis aegre ab his impetratum est summa
tribunorum plebis contentione, ut in senatu recitarentur;...</p>
<span from="#match('p1','C. Caesaris')"><ptr target="#JC"/></span>
In this example, the span element points to the name “C. Caesaris” in the text, and annotates it with a pointer to the person element that identifies Julius Caesar. The semantics of the TEI do not permit us to explicitly say “that string represents a personal name”, however. We can only imply it by association. It should be noted that other annotation systems, like Web Annotation[6], for example, suffer from this same semantic drawback. In WA, associating a piece of information with a target is straightforward, but having the annotation make an assertion about the target involves (somewhat awkwardly) embedding RDF that makes the assertion into the body of the annotation. One solution to the problem of creating assertive standoff annotations in TEI would be to use restructuring markup to generate a new text that wrapped all of the named entities in their appropriate markup, but this is a heavyweight solution with some drawbacks. So how else could we implement assertive annotation in TEI using standoff markup?
Assertive Standoff Annotation
The question has both theoretical and practical implications. We will need to determine
both what markup structures might be used and find a solution for creating the annotations
themselves. Fortunately, an online annotation system capable of working with TEI already
exists. Recogito[7] is an annotation tool developed by Pelagios Commons which provides for the
machine-assisted annotation of texts, including TEI texts, with the names of persons,
and events. These may be exported in a variety of formats, including TEI. The TEI
involves inserting inline assertive markup (e.g. <persName>
tags) into the
existing document, and the export mechanism understandably has trouble with overlapping
(e.g. if part of a name already contains markup or the name overlaps another structure).
obvious fix for this is to export it as standoff. But it would have to be in a standard
that was still TEI, and it would have to be possible to do something useful with the
One possible “something useful” that suggests itself is to create a browser-based
view of the
document plus annotation using CETEIcean[8], a prospect made even more inviting by the fact that Recogito uses CETEIcean to
render TEI documents for annotation. All sorts of visualizations are possible, including
turning the identified names into links, adding mouseover animations for them, index
generation, and so on.
But the sticking point here is the “standard format” part. What we would need is a
mechanism for applying markup to fragments of a source text, without restructuring
it? In
other words, a way to do assertive standoff annotation. Such a construct does exist
in fact,
but it is not used in precisely this way. The elements in the Critical Apparatus module[9] are designed to model textual variance—cases where the witnesses to a text differ
and the editor wishes to record the alternate possibiilites.
Figure 5: An example apparatus entry
<p n="1" xml:id="p1">
<seg n="1" xml:id="seg-1.1">Bello Alexandrino
conflato Caesar <app>
<rdg wit="#S" ana="#orthographical">Ordo</rdg>
</app> atque ex Syria Ciliciaque omnem classem
arcessit; ...</seg>
tags around a name, would not be quite such a
stretch as it might at first seem. It does not seem unreasonable to treat something
like the
identification of “C. Caesaris” as a personal name as a type of editorial emendation,
though it involves variant markup rather than variant text.
Usefully for us, a critical apparatus may appear either in an inline or standoff position.
A standoff apparatus attaches to the base text using @from and @to attributes, which
indicate the location of the start and end of the varying text (if a TEI Pointer expressing
range is used, or a single element contains the whole variant only the @from attribute
needed). Given a text like
Figure 6: Base text Figure 7: Base text with inline place name identification Figure 8: A standoff place name identification
<div type="textpart" subtype="chapter" n="1" xml:id="c1">
<p type="textpart" subtype="section" n="1" xml:id="c1s1">
<seg n="1" xml:id="c1s1p1">Gallia est omnis divisa in partes tres, quarum unam
incolunt Belgae, Aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli
<div type="textpart" subtype="chapter" n="1" xml:id="c1">
<p type="textpart" subtype="section" n="1" xml:id="c1s1">
<seg n="1" xml:id="c1s1p1"><placeName
ref="https://pleiades.stoa.org/places/993">Gallia</placeName> est omnis divisa
in partes tres, quarum unam incolunt Belgae, Aliam Aquitani, tertiam qui ipsorum
lingua Celtae, nostra Galli appellantur.</seg>
<app from="#string-index(//seg[@xml:id='c1s1p1'],0)" to="#string-index(//seg[@xml:id='c1s1p1'],6)">
<rdg><placeName ref="https://pleiades.stoa.org/places/993" source="#Damon">Gallia</placeName></rdg>
Such an annotation structure could either be embedded in the original or delivered as a separate document.
Given such a document, what could we do with it? And what problems will need to be
overcome in order to use it? CETEIcean is already used as the basis for critical editions
use an inline apparatus to model textual variation. Experimentally, at least, resolving
a set
of standoff assertive annotations and applying them to the text as links is straightforward.
Recogito internally marks the location of the start and end of an annotated string,
using a
generated XPath to register the nearest parent element(s) of the start and end points,
then indexes into the string containing the start and end. This method is isomorphic
to the
TEI Pointers[11] used in the example above.
The difficult piece of the puzzle, and the one which remains unresolved at the time
writing (August 2019), is the shape of the TEI export. A proposal for a new <standoff>
container for annotations pointing into the text is being debated by the TEI community.
form of this will likely be adopted for a future release, and may then serve as a
place to put
standoff assertive annotations of the kind mooted above. Whether or not the existing
apparatus markup is deemed suitable for such annotations is an open question. It is
to be
hoped that if it is not, some equivalent structure can be developed. The technological
of the puzzle are all in place, so we can hope that the standards development component
catch up before too long.
As part of the TEI Council’s Fall face-to-face meeting, we convened several stakeholders from the community to try to work out a structure for standoff markup. The meeting was held on September 16th, 2019 in Graz, Austria. The following decisions were agreed upon:
The TEI element will be allowed to nest, so that one TEI document may be embedded directly inside another.
A new
element will be created, which will be of themodel.resourceLike
class, meaning that it can appear directly inside the<TEI>
element, alongside the<teiHeader>
, etc. -
will contain most list-like elements, including<listPerson>
, and<listOrg>
will be available also, though whether it will be used for assertive annotations of the type outlined here, or whether a new, parallel stucture will be created is an open question. -
will also contain a new<listAnnotation>
element, which will contain<annotationBlock>
(used for linguistic annotation), and/or a new<annotation>
element. -
The precise content model of the new
element is still to be determined. The plan is to model it after the Web Annotation data model[12], with some TEI-specific modifications.
Steps 1–4 will be implemented right away, with a plan to include them in the next release (probably in Spring 2020) and discussions on #5 will proceed in parallel.
