TEI, standoff markup, and string-range()
Section 16.2.4 of the Text Encoding Initiative Guidelines outlines a number of pointer
schemes that are related to functions defined in the XPointer specification [XPtr]. These can (notionally at least) be used to produce standoff markup on a
TEI document. There are a variety of problems with the pointer schemes defined by
the
guidelines, and also with the related XPointer functions, but the most basic is that
most of
them don't have any implementation. There is therefore, no good way to use them, and,
because
they are unused, no good reason to implement them either. It is a Catch-22. The TEI
pointer
schemes are clearly meant to be used in concert with XInclude, as functions that retrieve
text
or node sets (see the example in 16.9.3), but their effects are underspecified in
the
guidelines.
Recent developments in the TEI have opened up the possibility of creating an
implementation of at least one of these schemes, namely string-range(). The string-range()
pointer scheme is defined thus:
16.2.4.5 string-range(fragmentIdentifier, offset [, length])
The string-range() scheme locates a range based on character positions. While
string-range endpoints are points adjacent to character positions, they must be designated
by the characters to which they are adjacent, in the same way that the nodes corresponding
to XML elements are. This avoids ambiguity about which point between two characters
is
indicated when characters are interrupted by markup.
The first argument to string-range() designates a node or a range within which a
string is to be located. No string range, even an empty one, can be defined by a
string-range() if the fragment identified has the empty string as its value. Every
string-range is defined based on an ‘origin character’. The origin is numbered 0,
and
designates the first character of the string-value of pointer. The offset is a character
index relative to the origin; the start of the resulting range is the position designated
by the sum of the origin and offset."
If length is specified, the end of the range is at a point adjacent to the character
designated by the origin added to the offset and length. If the offset is negative,
or
length is sufficiently large, a string-range can designate characters outside the
string-value of the initial pointer. In this case, characters are located using the
string-value of the entire document. It is also legal for length plus the origin to
exceed
the length of the string-value of the document by one, in order to accommodate ranges
that
include the last character of a document.
If length is not specified, it defaults to the value 1, and the string range contains
one character. If it is specified as 0, the zero-length range is interpreted as the
point
immediately preceding the origin character or offset character if there is one.
[http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATSSR]
In theory, at least, string-range can be used to indicate an arbitrary section
of text in a TEI document, without regard to the way that text is nested within the
document's
structure. A range could start inside one element, and end inside another. Put another
way, it
can span multiple text() nodes. This means that if string-range() can be implemented,
it would
present a solution to the overlapping hierarchies problem.
Since string-range depends on marking a starting point and length of text within a
section
of the document, it runs immediately into a problem with the way XML regards some
whitespace
as "ignorable". Space between elements, for example, is not necessarily preserved
during
operations on the document. Someone editing a document, for example, might pretty-print
it in
order to make it more readable. This would introduce extra newline and space characters
into
the document, and immediately break any string-range() pointers. In other words, the
ignorable
whitespace content of the document could be changed as a part of normal processing
that
doesn’t involve any editing of the document. This year, for the first time, TEI has
begun to
allow the xml:space attribute.
[http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.html] This
means that the ignorable whitespace issue can be accommodated in a standard way.
A second problem, and one that applies to several of the pointer schemes that the
Guidelines specify, is that they extend the XML data model. The TEI pointer scheme
conceives
of Nodes and Node Sets (both of which correspond to objects in the XML Infoset/DOM),
but also
Points and Ranges. Points are theoretical objects that must lie between element nodes
or
between characters in text nodes. This is a useful concept for marking arbitrary ranges
in a
document, but since it does not correspond to anything conceived of by the XML specifications,
there are are no hooks in XML processing tools on which to hang Points. They cannot
be passed
to or returned by any XPath function or XSLT instruction. This makes implementation
a complex
task. At best, they can be encapsulated in special-purpose markup for passing as messages
or
handled as uninterpreted XPath expressions. The former technique introduces a problem
of
standardization and the latter requires second-order processing, with the dangers
and
difficulties that implies. Since string-range focuses on text, however, it is possible
to
count, for each text node, the concatenated length of text nodes on the preceding
axis, and
thereby to locate the text nodes containing the start and end points indicated in
a
string-range() pointer.
A third problem with string-range() as defined by the TEI, and in fact with all of
its
XPointer schemes, is that the specification (the TEI Guidelines) doesn't properly
address what
implementation would mean. The example in 16.9.3 uses string-range in XInclude elements
to
import text from one XML document to another. Of course this example doesn’t work,
because
TEI’s string-range has no XInclude implementation. But the (unstated) implication
seems to be
that the string-range() function returns plain text only. String-range could certainly
be used
to declaratively indicate arbitrary sections of a document, but without some mechanism
for
executing it, there is nothing concrete for an implementer to do. A further complication
is
that there is nothing stopping a string-range from indicating text that overlaps elements
in a
non-hierarchical fashion. Should an implementer ignore elements thus captured? Or
return them
somehow? A related issue is the fact that since string-range defines text-based locations,
elements are effectively invisible to it. A standalone element (e.g. <lb/>
)
immediately before text that one wants to mark with a string-range() won't automatically
be
part of that range.
Given the underspecified functionality of string-range, the authors have made some
assumptions about implementation details. We have decided not to extend any existing
XInclude
implementation. Instead, we have decided to use string-range only in a declarative
fashion, as
a pointing mechanism within TEI, and we are developing XPath 2.0 functions that complement
and
use string-range(). Where it declares a range, they will be able to retrieve that
range. We
propose three functions, with the following signatures:
get-string-range(parentElt, offset1, offset2 [offset3, offset4, etc.])
- takes as arguments an XPath indicating a parent element (e.g. a div on which
@xml:space="preserve"
as been set) and a set of integer pairs of character
offsets
- returns a sequence of strings derived from text nodes or portions of text nodes
between the pairs of points passed in as parameters.
get-milestone-range(parentElt,offset1, offset2 [offset3, offset4, etc.])
- takes as arguments an XPath indicating a parent element (e.g. a div on which
@xml:space="preserve"
as been set) and a set of integer pairs of character
offsets
- returns a sequence where elements have been converted to milestones (e.g.
<p-start>
and <p-end>
instead of
<p>
).
get-fragment-range(parentElt,offset1, offset2 [offset3, offset4, etc.])
- takes as arguments an XPath indicating a parent element (e.g. a div on which
@xml:space="preserve"
as been set) and a set of integer pairs of character
offsets
- returns a well-formed document fragment, where elements split by the range have
been
automatically opened or closed.
An XSLT 2.0 stylesheet that implements these functions is under development at
http://github.com/hcayless/tei-string-range.
A fourth problem lies in the ease-of-use of the string-range function. Determining
the
index location of a piece of arbitrary text in a TEI document is prohibitively difficult
for a
human editor. It would be relatively easy to programmatically generate a string-range
based on
a selected range in an XML editor, like oXygen, but without this kind of functionality,
it
will be quite hard for someone marking up a document to create the expression with
facility.
What is needed at a bare minimum is a means to mark range starts and ends, in an
editor-independent fashion, which can then be converted to string-range expressions.
We
propose using processing instructions in the form <?range-start
r="n"?>
/<?range-end r="n"?>
, where "n" identifies a particular
range. Pairs of these will mark range starts and ends, and can be processed by an
XSLT
stylesheet to create <linkGrp>
s containing links that use string-range() to
identify the marked ranges.
Our implementation then, consists of a simple way to create string-range() pointers
using
a XSLT 2.0 stylesheet transformation and a set of functions that can be used to process
the
data marked by a string-range() in the context of an XPath 2.0 processor. Using these
stylesheets it is possible, for example, to mark up ranges of text in a non-hierarchical
way
and then generate a set of links denoting those ranges, to which additional standoff
markup
may be linked, or one can convert a document with inline markup to one where a division
contains plain text and a second division contains markup and pointers to the text.
While the authors intend this effort to be a practical addition to the TEI’s arsenal
of
tools, this kind of implementation raises theoretical questions that bring us back
to the
question of the adequacy of inline markup. In the example below, taken from
http://github.com/hcayless/tei-string-range/blob/master/bgu.1.116.xml, a
transcription of a document written on papyrus from Arsinoite in Egypt, some of the
text
content in the edition <div>
is readable in the original, and some has been
supplied by the editor.
<lb n="1"/><handShift new="m3"/> <num value="62">ξβ</num>
<lb n="2"/><handShift new="m1"/>
<supplied reason="lost">Ἁρποκρατίω</supplied>ν<supplied reason="lost">ι</supplied>
τ<supplied reason="lost">ῷ κ</supplied>αὶ Ἱέρακι
<expan>β<supplied reason="lost">ασ<ex>ιλικῷ</ex></supplied></expan>
<lb n="3"/><supplied reason="lost"><expan>γρ<ex>αμματεῖ</ex></expan>
<expan>Ἀρσ<ex>ινοΐτου</ex></expan></supplied>
<expan>Ἡρ<supplied reason="lost">ακ<ex>λείδου</ex></supplied></expan>
<supplied reason="lost"> με</supplied>ρίδος
<lb n="4"/><supplied reason="lost">παρὰ</supplied>
Ὡ<supplied reason="lost">ριγέ</supplied><unclear>ν</unclear>ους
Ἰσιδ<supplied reason="lost">ώ</supplied>ρο<supplied reason="lost">υ</supplied>
<lb n="5"/><supplied reason="lost">τῶν ἀπὸ</supplied> τῆ<supplied reason="lost">ς</supplied>
<expan>μ<supplied reason="lost">ητρ</supplied>ο<ex>πόλεως</ex></expan>
<expan>ἀπογε<supplied reason="lost">γρ</supplied>α<ex>μμένου</ex></expan>
<lb n="6"/><supplied reason="lost">ἐπʼ <expan>ἀμφό<ex>δου</ex></expan> </supplied>
<gap reason="lost" quantity="1" unit="character"/><abbr>ερω</abbr>
Θε<gap reason="lost" quantity="1" unit="character"/><abbr><unclear>μι</unclear>
<gap reason="illegible" quantity="1" unit="character"/></abbr>.
A transcription of the first six lines following the Leiden convention reads thus:
(hand 3) ξβ
(hand 1) [Ἁρποκρατίω]ν[ι] τ[ῷ κ]αὶ Ἱέρακι β[ασ(ιλικῷ)]
[γρ(αμματεῖ) Ἀρσ(ινοΐτου)] Ἡρ[ακ(λείδου) με]ρίδος
[παρὰ] Ὡ[ριγέ]ν̣ους Ἰσιδ[ώ]ρο[υ]
[τῶν ἀπὸ] τῆ[ς] μ[ητρ]ο(πόλεως) ἀπογε[γρ]α(μμένου)
[ἐπʼ ἀμφό(δου) ̣]ερω( ) Θε[ ̣]μ̣ι̣[ ̣]( ).
A “plain text” version, obtained by extracting the markup from the text content
of the TEI document looks like:
ξβ
Ἁρποκρατίωνι τῷ καὶ Ἱέρακι βασιλικῷ
γραμματεῖ Ἀρσινοΐτου Ἡρακλείδου μερίδος
παρὰ Ὡριγένους Ἰσιδώρου
τῶν ἀπὸ τῆς μητροπόλεως ἀπογεγραμμένου
ἐπʼ ἀμφόδου ερω Θεμι.
while the extracted markup, with
<ptr>
elements that refer
back to the text div looks like:
<lb n="1"/>
<handShift new="m3"/>
<ptr target="#string-range('d2e120', 6, 1)"/>
<num value="62">
<ptr target="#string-range('d2e120', 7, 2)"/>
</num>
<ptr target="#string-range('d2e120', 9, 7)"/>
<lb n="2"/>
<handShift new="m1"/>
<ptr target="#string-range('d2e120', 16, 1)"/>
<supplied reason="lost">
<ptr target="#string-range('d2e120', 17, 10)"/>
</supplied>
<ptr target="#string-range('d2e120', 27, 1)"/>
<supplied reason="lost">
<ptr target="#string-range('d2e120', 28, 1)"/>
</supplied>
<ptr target="#string-range('d2e120', 29, 2)"/>
<supplied reason="lost">
<ptr target="#string-range('d2e120', 31, 3)"/>
</supplied>
<ptr target="#string-range('d2e120', 34, 10)"/>
<expan>
<ptr target="#string-range('d2e120', 44, 1)"/>
<supplied reason="lost">
<ptr target="#string-range('d2e120', 45, 2)"/>
<ex>
<ptr target="#string-range('d2e120', 47, 5)"/>
</ex>
</supplied>
</expan>
<ptr target="#string-range('d2e120', 52, 7)"/>
<lb n="3"/>
<supplied reason="lost">
<expan>
<ptr target="#string-range('d2e120', 59, 2)"/>
<ex>
<ptr target="#string-range('d2e120', 61, 7)"/>
</ex>
</expan>
<ptr target="#string-range('d2e120', 68, 1)"/>
<expan>
<ptr target="#string-range('d2e120', 69, 3)"/>
<ex>
<ptr target="#string-range('d2e120', 72, 7)"/>
</ex>
</expan>
</supplied>
<ptr target="#string-range('d2e120', 79, 1)"/>
<expan>
<ptr target="#string-range('d2e120', 80, 2)"/>
<supplied reason="lost">
<ptr target="#string-range('d2e120', 82, 2)"/>
<ex>
<ptr target="#string-range('d2e120', 84, 6)"/>
</ex>
</supplied>
</expan>
<supplied reason="lost">
<ptr target="#string-range('d2e120', 90, 3)"/>
</supplied>
<ptr target="#string-range('d2e120', 93, 12)"/>
<lb n="4"/>
<supplied reason="lost">
<ptr target="#string-range('d2e120', 105, 4)"/>
</supplied>
<ptr target="#string-range('d2e120', 109, 2)"/>
<supplied reason="lost">
<ptr target="#string-range('d2e120', 111, 4)"/>
</supplied>
<unclear>
<ptr target="#string-range('d2e120', 115, 1)"/>
</unclear>
<ptr target="#string-range('d2e120', 116, 8)"/>
<supplied reason="lost">
<ptr target="#string-range('d2e120', 124, 1)"/>
</supplied>
<ptr target="#string-range('d2e120', 125, 2)"/>
<supplied reason="lost">
<ptr target="#string-range('d2e120', 127, 1)"/>
</supplied>
<ptr target="#string-range('d2e120', 128, 7)"/>
<lb n="5"/>
<supplied reason="lost">
<ptr target="#string-range('d2e120', 135, 7)"/>
</supplied>
<ptr target="#string-range('d2e120', 142, 3)"/>
<supplied reason="lost">
<ptr target="#string-range('d2e120', 145, 1)"/>
</supplied>
<ptr target="#string-range('d2e120', 146, 1)"/>
<expan>
<ptr target="#string-range('d2e120', 147, 1)"/>
<supplied reason="lost">
<ptr target="#string-range('d2e120', 148, 3)"/>
</supplied>
<ptr target="#string-range('d2e120', 151, 1)"/>
<ex>
<ptr target="#string-range('d2e120', 152, 6)"/>
</ex>
</expan>
<ptr target="#string-range('d2e120', 158, 1)"/>
<expan>
<ptr target="#string-range('d2e120', 159, 5)"/>
<supplied reason="lost">
<ptr target="#string-range('d2e120', 164, 2)"/>
</supplied>
<ptr target="#string-range('d2e120', 166, 1)"/>
<ex>
<ptr target="#string-range('d2e120', 167, 6)"/>
</ex>
</expan>
<ptr target="#string-range('d2e120', 173, 7)"/>
<lb n="6"/>
<supplied reason="lost">
<ptr target="#string-range('d2e120', 180, 4)"/>
<expan>
<ptr target="#string-range('d2e120', 184, 4)"/>
<ex>
<ptr target="#string-range('d2e120', 188, 3)"/>
</ex>
</expan>
<ptr target="#string-range('d2e120', 191, 1)"/>
</supplied>
<gap reason="lost" quantity="1" unit="character"/>
<abbr>
<ptr target="#string-range('d2e120', 192, 3)"/>
</abbr>
<ptr target="#string-range('d2e120', 195, 3)"/>
<gap reason="lost" quantity="1" unit="character"/>
<abbr>
<unclear>
<ptr target="#string-range('d2e120', 198, 2)"/>
</unclear>
<gap reason="illegible" quantity="1" unit="character"/>
</abbr>
This example is actually a fairly unproblematic one, since it does not contain any
alternate readings or editorial corrections or normalization. Yet even here there
are
difficulties: “Θεμι” (as is clear in the Leiden version) contains two gaps and unclear
text,
but since these visual features of the document are indicated using <gap/>
and <unclear/>
tags, it looks like an undamaged word-fragment in the plain
text version. It must be noted that the traditional way of publishing these documents
in print
employs inline markup. So, in this example at least, a plain text version would itself
be a
somewhat misleading version of the document. This is not a refutation of Schmidt’s
points,
because there are many other ways one could encode the document, using standoff markup,
that
would mitigate this problem. But perhaps it suggests that there are at least some
uses of
inline markup (when it encodes features of the text that cannot be expressed straightforwardly
in Unicode) that may be hard to replace.
The ability to extract the markup from the text and still preserve the manipulability
it
previously enjoyed suggests some additional possibilities: one could now layer in
name and
place information, lexical and grammatical analysis, structural information, such
as line
containment, rather than just marking line beginnings, etc. Different views could
be
generated, using these individually or using combinations of them. Nothing stops us
from
layering these on top of inline markup either.
Since it relies on character offsets, any implementation of string-range() is inherently
somewhat brittle. The adoption of @xml:space
by the TEI closes off one means by which links
using string-range could be broken, but can do nothing to mitigate the danger of someone
editing the text directly. Projects that use this mechanism will have to prevent the
breakage
of string-range links either through workflow or editing environments that manage
shifting
offsets.
We have already learned a good deal from our implementation efforts to date. If this
approach is something other users of TEI or even the TEI Consortium itself wishes
to support,
there are several changes we would suggest. First, that the guidelines be emended
to contain a
more thorough specification of the TEI pointer schemes. Second, that a working group
be formed
look at practical implementations of standoff markup and on appropriate usage patterns
for
these. We must note that the example stylesheet we provide to generate a text + standoff
markup version of a valid TEI document results in invalid TEI when applied to the
bgu.1.116
example, because elements like <ex/>
can only contain text, not pointers to
text. Moreover, if one wants to extract a string-range with the inline markup converted
to
standalone elements, then again the result will not be valid TEI. We hope our efforts
outlined
above will prompt some useful examination and perhaps revision of the TEI guidelines
perspective on standoff markup.