Cayless, Hugh. “Implementing TEI Standoff Annotation in the browser.” Presented at Balisage: The Markup Conference 2019, Washington, DC, July 30 - August 2, 2019. In Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23 (2019). https://doi.org/10.4242/BalisageVol23.Cayless01.
Balisage: The Markup Conference 2019 July 30 - August 2, 2019
Balisage Paper: Implementing TEI Standoff Annotation in the browser
Hugh Cayless
Hugh is a Senior Digital Humanities Developer at the Duke Collaboratory for Classics
Computing (DC3).
The essential user story for standoff markup is: “I have a text that I want to add
information to without modifying the source.” This might involve changing the structure
or the
source, e.g. taking a text with page containers and representing it with div and paragraph
containers. It might mean adding linguistic annotation to the words of the source.
It might
mean the addition of notes or commentary on arbitrary chunks of text. It might mean
representing the results of a machine or human named entity recognition process. It
might mean
noting when and where witnesses diverge from the base text. It might mean the alignment
of one
text with another, e.g. a translation with its source. Various mechanisms exist for
handling
the connections involved in these processes: The TEI Guidelines have a set of global
attributes for connecting elements[2], <note> has a target attribute, <span> (which associates an
interpretive annotation with a run of text) has target and from/to attributes,
<link> can associate any number of elements using its target attribute. The
targets of these annotations may be referred to by their IDs, <anchor>s may be
placed to mark targets, words or text segments may be wrapped in <w> or
<seg> elements, and TEI Pointers may indicate arbitrary runs of text and/or
elements. Most of these techniques involve what we might term “associative” markup—they
attach
additional information to some element or segment in the source text.
Properly speaking, any annotation that occurs away from the thing it is annotating
might
be called standoff markup. When the TEI Guidelines discuss standoff markup, however,
they
primarily refer to what we might term “restructuring” or “reconstructive” markup,
wherein text
from one place is imported into structured markup elsewhere. In section 16.9[3], for example, which is the main section on standoff markup, all of the examples
use XInclude to pull text and/or markup from elsewhere into a new structure.
Given a source document, another document may pull chunks out of it and present them
embedded in a new structure. It should be noted that the examples cited in section
16.7 (on
<join>) do more or less the same thing in a purely TEI fashion, even though
this is not described as a “standoff” technique in that section. This is essentially
the
method employed by (e.g.) CATMA[4] in its TEI export format to layer annotations onto a run of text.[5] CATMA uses <ptr> to refer to unmarked text and
<seg> with an @ana attribute that refers to the annotation body
(which is a TEI feature structure). Restructuring markup may be very useful in cases
where
overlap would otherwise be an issue. A “diplomatic” model of a codex might mark the
text as
contained by pages, for example, but there might well be a need for a parallel version
where
the text is structured into chapters and paragraphs.
Even though the TEI Guidelines consider restructuring markup to be the main form of
“standoff” markup, for the purposes of this paper I will use a broader definition
of
“standoff” that includes any markup where information is added to a resource without
directly
modifying the resource itself. Associative markup need not be used in a standoff fashion,
to
be sure: <note>s may occur either inline or standoff, for example.
Between the two poles of associative and restructuring standoff annotation, there
lies a
third, and it is both important and, in TEI, almost entirely neglected. We might call
this
type “assertive” annotation, because rather than simply attaching new information
to the text,
it makes a positive, actionable statement about the segment of text in question. The
only
obvious way to do this in TEI is with inline markup. A common example of this kind
of
annotation is the marking of named entities. In TEI, a name is marked with the
<name> element, wrapping the thing named, with a @ref attribute which points
to some record of the entity identified (e.g. an entry in a personography or gazetteer).
There
are a number of refinements of <name>, including <persName>,
<orgName>, and <placeName> as well as the more general
<rs> (referring string). In all of these cases, the element is making a
statement about its content and possibly linking (via the ref attribute) to additional
information about the referent of the name. Someone wishing to say the same sort of
thing
using standoff markup quickly runs into difficulty. A typical way to identify a name
with a
person in TEI might look like this:
The marked up section is saying “this string identifies a personal name and refers
to the
person identified in the person element with id ‘JC’”. To say the same thing with
standoff
markup is much trickier. We could, for example, do something like:
In this example, the span element points to the name “C. Caesaris” in the text, and
annotates it with a pointer to the person element that identifies Julius Caesar. The
semantics
of the TEI do not permit us to explicitly say “that string represents a personal name”,
however. We can only imply it by association. It should be noted that other annotation
systems, like Web Annotation[6], for example, suffer from this same semantic drawback. In WA, associating a piece
of information with a target is straightforward, but having the annotation make an
assertion
about the target involves (somewhat awkwardly) embedding RDF that makes the assertion
into the
body of the annotation. One solution to the problem of creating assertive standoff
annotations
in TEI would be to use restructuring markup to generate a new text that wrapped all
of the
named entities in their appropriate markup, but this is a heavyweight solution with
some
drawbacks. So how else could we implement assertive annotation in TEI using standoff
markup?
Assertive Standoff Annotation
The question has both theoretical and practical implications. We will need to determine
both what markup structures might be used and find a solution for creating the annotations
themselves. Fortunately, an online annotation system capable of working with TEI already
exists. Recogito[7] is an annotation tool developed by Pelagios Commons which provides for the
machine-assisted annotation of texts, including TEI texts, with the names of persons,
places,
and events. These may be exported in a variety of formats, including TEI. The TEI
export
involves inserting inline assertive markup (e.g. <persName> tags) into the
existing document, and the export mechanism understandably has trouble with overlapping
markup
(e.g. if part of a name already contains markup or the name overlaps another structure).
The
obvious fix for this is to export it as standoff. But it would have to be in a standard
format
that was still TEI, and it would have to be possible to do something useful with the
output.
One possible “something useful” that suggests itself is to create a browser-based
view of the
document plus annotation using CETEIcean[8], a prospect made even more inviting by the fact that Recogito uses CETEIcean to
render TEI documents for annotation. All sorts of visualizations are possible, including
turning the identified names into links, adding mouseover animations for them, index
generation, and so on.
But the sticking point here is the “standard format” part. What we would need is a
TEI
mechanism for applying markup to fragments of a source text, without restructuring
it? In
other words, a way to do assertive standoff annotation. Such a construct does exist
in fact,
but it is not used in precisely this way. The elements in the Critical Apparatus module[9] are designed to model textual variance—cases where the witnesses to a text differ
and the editor wishes to record the alternate possibiilites.
Here, the base text has “Rhodo” (Rhodes) and the editor wishes readers to know that
a
witness, S, has “Ordo” instead. It might at first seem outlandish to
suggest using such a specialized type of markup to record annotations, but critical
apparatus
markup has several advantages that make it attractive. It can take either an inline
or
standoff form, it is designed explicitly for making assertive annotations and recording
their
provenance, it can accommodate differences in markup as well as text, it can cope
with
overlap, and it even has mechanisms for recording dependencies or conflicts between
readings.
A transposition in a variant, for example, requires that the base reading exclude
the variant,
and vice versa. Person or place identifications might have the same sorts of requirements,
one
identification implying—or ruling out—another. It is already common practice to note
the
suggestions of previous editors in the apparatus, so suggested emendations to the
markup, such
as the addition of <persName> tags around a name, would not be quite such a
stretch as it might at first seem. It does not seem unreasonable to treat something
like the
identification of “C. Caesaris” as a personal name as a type of editorial emendation,
even
though it involves variant markup rather than variant text.
Usefully for us, a critical apparatus may appear either in an inline or standoff position.
A standoff apparatus attaches to the base text using @from and @to attributes, which
can
indicate the location of the start and end of the varying text (if a TEI Pointer expressing
a
range is used, or a single element contains the whole variant only the @from attribute
is
needed). Given a text like
We might do inline identification of the place named in the first word thus:
But an annotation system like Recogito could use a standoff apparatus to propose
this change instead:
without needing to alter the source text. Usefully, the semantics of the latter are
explicit: “Damon says this piece of the text should be read as a place name.”[10]
Such an annotation structure could either be embedded in the original or delivered
as a
separate document.
Given such a document, what could we do with it? And what problems will need to be
overcome in order to use it? CETEIcean is already used as the basis for critical editions
that
use an inline apparatus to model textual variation. Experimentally, at least, resolving
a set
of standoff assertive annotations and applying them to the text as links is straightforward.
Recogito internally marks the location of the start and end of an annotated string,
using a
generated XPath to register the nearest parent element(s) of the start and end points,
and
then indexes into the string containing the start and end. This method is isomorphic
to the
string-index() TEI Pointers[11] used in the example above.
The difficult piece of the puzzle, and the one which remains unresolved at the time
of
writing (August 2019), is the shape of the TEI export. A proposal for a new <standoff>
container for annotations pointing into the text is being debated by the TEI community.
Some
form of this will likely be adopted for a future release, and may then serve as a
place to put
standoff assertive annotations of the kind mooted above. Whether or not the existing
critical
apparatus markup is deemed suitable for such annotations is an open question. It is
to be
hoped that if it is not, some equivalent structure can be developed. The technological
pieces
of the puzzle are all in place, so we can hope that the standards development component
will
catch up before too long.
Postscript
As part of the TEI Council’s Fall face-to-face meeting, we convened several stakeholders
from the community to try to work out a structure for standoff markup. The meeting
was held
on September 16th, 2019 in Graz, Austria. The following decisions were agreed upon:
The TEI element will be allowed to nest, so that one TEI document may be embedded
directly inside another.
A new <standOff> element will be created, which will be of the
model.resourceLike class, meaning that it can appear directly inside the
<TEI> element, alongside the <teiHeader>,
<text>, etc.
<standOff> will contain most list-like elements, including
<listPerson>, <listPlace>, and
<listOrg>. <listApp will be available also, though
whether it will be used for assertive annotations of the type outlined here, or whether
a new, parallel stucture will be created is an open question.
<standOff> will also contain a new <listAnnotation>
element, which will contain <annotationBlock> (used for linguistic annotation),
and/or a new <annotation> element.
The precise content model of the new <annotation> element is still to
be determined. The plan is to model it after the Web Annotation data model[12], with some TEI-specific modifications.
Steps 1–4 will be implemented right away, with a plan to include them in the next
release
(probably in Spring 2020) and discussions on #5 will proceed in parallel.
[1] The idea for this paper presented itself right before the submission deadline, and
as
a result, it will require a good deal more fleshing out before presentation, but I
think
the skeleton is workable.
[8] https://github.com/TEIC/CETEIcean. See Cayless, Hugh, and Raffaele Viglianti.
“CETEIcean: TEI in the Browser.” Presented at Balisage: The Markup Conference 2018,
Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup
Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Cayless01.