Introduction
The genesis of this paper lies in a discussion[1] on the Humanist mailing last that began with a request for comment from Desmond
Schmidt on his recent article in LLC, The inadequacy of embedded markup for
cultural heritage texts
. [Schmidt2010] The core of
which is an argument (really a series of arguments) that the insertion of what I will
call
inline
markup (the format of which is typically XML) into the midst of a text
to be interpreted is in some sense a violation of that text. Schmidt comes at this
from
several angles, highlighting the overlap problem, the imposition of subjective interpretation
on the text in the form of markup that could become obsolete before the text itself
does, the
ways in which inline markup may duplicate information that could be derived automatically,
and
the fact that markup technologies like XML are industrial
and inherit from
textual command languages designed for print.
The authors aren’t sure they completely agree with all of this, but Schmidt’s is a
thoughtful article, and a useful contribution to the ongoing debate over how satisfactory
XML
is for representing text. The subsequent discussion on Humanist went on for an unusually
long
series of posts, and was at times quite contentious. It inspired Hugh Cayless to call
a
session on The (in)adequacies of markup
[http://thatcamp.org/2010/the-inadequacies-of-markup/] at the THATCamp meeting
held shortly afterwards at George Mason University. The session participants quickly
agreed on
a ruthlessly practical approach. As programmers, we are quite pleased that XML is
an
industrial tool
and while we’ll happily acknowledge the shortcomings of the
Text Encoding Initiative (TEI), the size of its install base and the number of texts
already
encoded using it led us to look for solutions to the problems inherent in inline markup
that
could be implemented within the context of XML and the TEI. The obvious alternative
to inline
markup is standoff markup, and the TEI Guidelines have at least some things to say
about doing
standoff markup in TEI.
TEI, standoff markup, and string-range()
Section 16.2.4 of the Text Encoding Initiative Guidelines outlines a number of pointer schemes that are related to functions defined in the XPointer specification [XPtr]. These can (notionally at least) be used to produce standoff markup on a TEI document. There are a variety of problems with the pointer schemes defined by the guidelines, and also with the related XPointer functions, but the most basic is that most of them don't have any implementation. There is therefore, no good way to use them, and, because they are unused, no good reason to implement them either. It is a Catch-22. The TEI pointer schemes are clearly meant to be used in concert with XInclude, as functions that retrieve text or node sets (see the example in 16.9.3), but their effects are underspecified in the guidelines.
Recent developments in the TEI have opened up the possibility of creating an implementation of at least one of these schemes, namely string-range(). The string-range() pointer scheme is defined thus:
16.2.4.5 string-range(fragmentIdentifier, offset [, length])
The string-range() scheme locates a range based on character positions. While string-range endpoints are points adjacent to character positions, they must be designated by the characters to which they are adjacent, in the same way that the nodes corresponding to XML elements are. This avoids ambiguity about which point between two characters is indicated when characters are interrupted by markup.
The first argument to string-range() designates a node or a range within which a string is to be located. No string range, even an empty one, can be defined by a string-range() if the fragment identified has the empty string as its value. Every string-range is defined based on an ‘origin character’. The origin is numbered 0, and designates the first character of the string-value of pointer. The offset is a character index relative to the origin; the start of the resulting range is the position designated by the sum of the origin and offset."
If length is specified, the end of the range is at a point adjacent to the character designated by the origin added to the offset and length. If the offset is negative, or length is sufficiently large, a string-range can designate characters outside the string-value of the initial pointer. In this case, characters are located using the string-value of the entire document. It is also legal for length plus the origin to exceed the length of the string-value of the document by one, in order to accommodate ranges that include the last character of a document.
If length is not specified, it defaults to the value 1, and the string range contains one character. If it is specified as 0, the zero-length range is interpreted as the point immediately preceding the origin character or offset character if there is one. [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATSSR][2]
Since string-range depends on marking a starting point and length of text within a section of the document, it runs immediately into a problem with the way XML regards some whitespace as "ignorable". Space between elements, for example, is not necessarily preserved during operations on the document. Someone editing a document, for example, might pretty-print it in order to make it more readable. This would introduce extra newline and space characters into the document, and immediately break any string-range() pointers. In other words, the ignorable whitespace content of the document could be changed as a part of normal processing that doesn’t involve any editing of the document. This year, for the first time, TEI has begun to allow the xml:space attribute. [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.html] This means that the ignorable whitespace issue can be accommodated in a standard way.
A second problem, and one that applies to several of the pointer schemes that the Guidelines specify, is that they extend the XML data model. The TEI pointer scheme conceives of Nodes and Node Sets (both of which correspond to objects in the XML Infoset/DOM), but also Points and Ranges. Points are theoretical objects that must lie between element nodes or between characters in text nodes. This is a useful concept for marking arbitrary ranges in a document, but since it does not correspond to anything conceived of by the XML specifications, there are are no hooks in XML processing tools on which to hang Points. They cannot be passed to or returned by any XPath function or XSLT instruction. This makes implementation a complex task. At best, they can be encapsulated in special-purpose markup for passing as messages or handled as uninterpreted XPath expressions. The former technique introduces a problem of standardization and the latter requires second-order processing, with the dangers and difficulties that implies. Since string-range focuses on text, however, it is possible to count, for each text node, the concatenated length of text nodes on the preceding axis, and thereby to locate the text nodes containing the start and end points indicated in a string-range() pointer.
A third problem with string-range() as defined by the TEI, and in fact with all of
its
XPointer schemes, is that the specification (the TEI Guidelines) doesn't properly
address what
implementation would mean. The example in 16.9.3 uses string-range in XInclude elements
to
import text from one XML document to another. Of course this example doesn’t work,
because
TEI’s string-range has no XInclude implementation. But the (unstated) implication
seems to be
that the string-range() function returns plain text only. String-range could certainly
be used
to declaratively indicate arbitrary sections of a document, but without some mechanism
for
executing it, there is nothing concrete for an implementer to do. A further complication
is
that there is nothing stopping a string-range from indicating text that overlaps elements
in a
non-hierarchical fashion. Should an implementer ignore elements thus captured? Or
return them
somehow? A related issue is the fact that since string-range defines text-based locations,
elements are effectively invisible to it. A standalone element (e.g. <lb/>
)
immediately before text that one wants to mark with a string-range() won't automatically
be
part of that range.
Given the underspecified functionality of string-range, the authors have made some assumptions about implementation details. We have decided not to extend any existing XInclude implementation. Instead, we have decided to use string-range only in a declarative fashion, as a pointing mechanism within TEI, and we are developing XPath 2.0 functions that complement and use string-range(). Where it declares a range, they will be able to retrieve that range. We propose three functions, with the following signatures:
get-string-range(parentElt, offset1, offset2 [offset3, offset4, etc.])
- takes as arguments an XPath indicating a parent element (e.g. a div on which
@xml:space="preserve"
as been set) and a set of integer pairs of character offsets- returns a sequence of strings derived from text nodes or portions of text nodes between the pairs of points passed in as parameters.
get-milestone-range(parentElt,offset1, offset2 [offset3, offset4, etc.])
- takes as arguments an XPath indicating a parent element (e.g. a div on which
@xml:space="preserve"
as been set) and a set of integer pairs of character offsets- returns a sequence where elements have been converted to milestones (e.g.
<p-start>
and<p-end>
instead of<p>
).
get-fragment-range(parentElt,offset1, offset2 [offset3, offset4, etc.])
- takes as arguments an XPath indicating a parent element (e.g. a div on which
@xml:space="preserve"
as been set) and a set of integer pairs of character offsets- returns a well-formed document fragment, where elements split by the range have been automatically opened or closed.
A fourth problem lies in the ease-of-use of the string-range function. Determining
the
index location of a piece of arbitrary text in a TEI document is prohibitively difficult
for a
human editor. It would be relatively easy to programmatically generate a string-range
based on
a selected range in an XML editor, like oXygen, but without this kind of functionality,
it
will be quite hard for someone marking up a document to create the expression with
facility.
What is needed at a bare minimum is a means to mark range starts and ends, in an
editor-independent fashion, which can then be converted to string-range expressions.
We
propose using processing instructions in the form <?range-start
r="n"?>
/<?range-end r="n"?>
, where "n" identifies a particular
range. Pairs of these will mark range starts and ends, and can be processed by an
XSLT
stylesheet to create <linkGrp>
s containing links that use string-range() to
identify the marked ranges.
Our implementation then, consists of a simple way to create string-range() pointers using a XSLT 2.0 stylesheet transformation and a set of functions that can be used to process the data marked by a string-range() in the context of an XPath 2.0 processor. Using these stylesheets it is possible, for example, to mark up ranges of text in a non-hierarchical way and then generate a set of links denoting those ranges, to which additional standoff markup may be linked, or one can convert a document with inline markup to one where a division contains plain text and a second division contains markup and pointers to the text.
While the authors intend this effort to be a practical addition to the TEI’s arsenal
of
tools, this kind of implementation raises theoretical questions that bring us back
to the
question of the adequacy of inline markup. In the example below, taken from
http://github.com/hcayless/tei-string-range/blob/master/bgu.1.116.xml, a
transcription of a document written on papyrus from Arsinoite in Egypt, some of the
text
content in the edition <div>
is readable in the original, and some has been
supplied by the editor.
<lb n="1"/><handShift new="m3"/> <num value="62">ξβ</num> <lb n="2"/><handShift new="m1"/> <supplied reason="lost">Ἁρποκρατίω</supplied>ν<supplied reason="lost">ι</supplied> τ<supplied reason="lost">ῷ κ</supplied>αὶ Ἱέρακι <expan>β<supplied reason="lost">ασ<ex>ιλικῷ</ex></supplied></expan> <lb n="3"/><supplied reason="lost"><expan>γρ<ex>αμματεῖ</ex></expan> <expan>Ἀρσ<ex>ινοΐτου</ex></expan></supplied> <expan>Ἡρ<supplied reason="lost">ακ<ex>λείδου</ex></supplied></expan> <supplied reason="lost"> με</supplied>ρίδος <lb n="4"/><supplied reason="lost">παρὰ</supplied> Ὡ<supplied reason="lost">ριγέ</supplied><unclear>ν</unclear>ους Ἰσιδ<supplied reason="lost">ώ</supplied>ρο<supplied reason="lost">υ</supplied> <lb n="5"/><supplied reason="lost">τῶν ἀπὸ</supplied> τῆ<supplied reason="lost">ς</supplied> <expan>μ<supplied reason="lost">ητρ</supplied>ο<ex>πόλεως</ex></expan> <expan>ἀπογε<supplied reason="lost">γρ</supplied>α<ex>μμένου</ex></expan> <lb n="6"/><supplied reason="lost">ἐπʼ <expan>ἀμφό<ex>δου</ex></expan> </supplied> <gap reason="lost" quantity="1" unit="character"/><abbr>ερω</abbr> Θε<gap reason="lost" quantity="1" unit="character"/><abbr><unclear>μι</unclear> <gap reason="illegible" quantity="1" unit="character"/></abbr>.
A transcription of the first six lines following the Leiden convention reads thus:
(hand 3) ξβ (hand 1) [Ἁρποκρατίω]ν[ι] τ[ῷ κ]αὶ Ἱέρακι β[ασ(ιλικῷ)] [γρ(αμματεῖ) Ἀρσ(ινοΐτου)] Ἡρ[ακ(λείδου) με]ρίδος [παρὰ] Ὡ[ριγέ]ν̣ους Ἰσιδ[ώ]ρο[υ] [τῶν ἀπὸ] τῆ[ς] μ[ητρ]ο(πόλεως) ἀπογε[γρ]α(μμένου) [ἐπʼ ἀμφό(δου) ̣]ερω( ) Θε[ ̣]μ̣ι̣[ ̣]( ).
ξβ Ἁρποκρατίωνι τῷ καὶ Ἱέρακι βασιλικῷ γραμματεῖ Ἀρσινοΐτου Ἡρακλείδου μερίδος παρὰ Ὡριγένους Ἰσιδώρου τῶν ἀπὸ τῆς μητροπόλεως ἀπογεγραμμένου ἐπʼ ἀμφόδου ερω Θεμι.
<ptr>
elements that refer
back to the text div looks like:
<lb n="1"/> <handShift new="m3"/> <ptr target="#string-range('d2e120', 6, 1)"/> <num value="62"> <ptr target="#string-range('d2e120', 7, 2)"/> </num> <ptr target="#string-range('d2e120', 9, 7)"/> <lb n="2"/> <handShift new="m1"/> <ptr target="#string-range('d2e120', 16, 1)"/> <supplied reason="lost"> <ptr target="#string-range('d2e120', 17, 10)"/> </supplied> <ptr target="#string-range('d2e120', 27, 1)"/> <supplied reason="lost"> <ptr target="#string-range('d2e120', 28, 1)"/> </supplied> <ptr target="#string-range('d2e120', 29, 2)"/> <supplied reason="lost"> <ptr target="#string-range('d2e120', 31, 3)"/> </supplied> <ptr target="#string-range('d2e120', 34, 10)"/> <expan> <ptr target="#string-range('d2e120', 44, 1)"/> <supplied reason="lost"> <ptr target="#string-range('d2e120', 45, 2)"/> <ex> <ptr target="#string-range('d2e120', 47, 5)"/> </ex> </supplied> </expan> <ptr target="#string-range('d2e120', 52, 7)"/> <lb n="3"/> <supplied reason="lost"> <expan> <ptr target="#string-range('d2e120', 59, 2)"/> <ex> <ptr target="#string-range('d2e120', 61, 7)"/> </ex> </expan> <ptr target="#string-range('d2e120', 68, 1)"/> <expan> <ptr target="#string-range('d2e120', 69, 3)"/> <ex> <ptr target="#string-range('d2e120', 72, 7)"/> </ex> </expan> </supplied> <ptr target="#string-range('d2e120', 79, 1)"/> <expan> <ptr target="#string-range('d2e120', 80, 2)"/> <supplied reason="lost"> <ptr target="#string-range('d2e120', 82, 2)"/> <ex> <ptr target="#string-range('d2e120', 84, 6)"/> </ex> </supplied> </expan> <supplied reason="lost"> <ptr target="#string-range('d2e120', 90, 3)"/> </supplied> <ptr target="#string-range('d2e120', 93, 12)"/> <lb n="4"/> <supplied reason="lost"> <ptr target="#string-range('d2e120', 105, 4)"/> </supplied> <ptr target="#string-range('d2e120', 109, 2)"/> <supplied reason="lost"> <ptr target="#string-range('d2e120', 111, 4)"/> </supplied> <unclear> <ptr target="#string-range('d2e120', 115, 1)"/> </unclear> <ptr target="#string-range('d2e120', 116, 8)"/> <supplied reason="lost"> <ptr target="#string-range('d2e120', 124, 1)"/> </supplied> <ptr target="#string-range('d2e120', 125, 2)"/> <supplied reason="lost"> <ptr target="#string-range('d2e120', 127, 1)"/> </supplied> <ptr target="#string-range('d2e120', 128, 7)"/> <lb n="5"/> <supplied reason="lost"> <ptr target="#string-range('d2e120', 135, 7)"/> </supplied> <ptr target="#string-range('d2e120', 142, 3)"/> <supplied reason="lost"> <ptr target="#string-range('d2e120', 145, 1)"/> </supplied> <ptr target="#string-range('d2e120', 146, 1)"/> <expan> <ptr target="#string-range('d2e120', 147, 1)"/> <supplied reason="lost"> <ptr target="#string-range('d2e120', 148, 3)"/> </supplied> <ptr target="#string-range('d2e120', 151, 1)"/> <ex> <ptr target="#string-range('d2e120', 152, 6)"/> </ex> </expan> <ptr target="#string-range('d2e120', 158, 1)"/> <expan> <ptr target="#string-range('d2e120', 159, 5)"/> <supplied reason="lost"> <ptr target="#string-range('d2e120', 164, 2)"/> </supplied> <ptr target="#string-range('d2e120', 166, 1)"/> <ex> <ptr target="#string-range('d2e120', 167, 6)"/> </ex> </expan> <ptr target="#string-range('d2e120', 173, 7)"/> <lb n="6"/> <supplied reason="lost"> <ptr target="#string-range('d2e120', 180, 4)"/> <expan> <ptr target="#string-range('d2e120', 184, 4)"/> <ex> <ptr target="#string-range('d2e120', 188, 3)"/> </ex> </expan> <ptr target="#string-range('d2e120', 191, 1)"/> </supplied> <gap reason="lost" quantity="1" unit="character"/> <abbr> <ptr target="#string-range('d2e120', 192, 3)"/> </abbr> <ptr target="#string-range('d2e120', 195, 3)"/> <gap reason="lost" quantity="1" unit="character"/> <abbr> <unclear> <ptr target="#string-range('d2e120', 198, 2)"/> </unclear> <gap reason="illegible" quantity="1" unit="character"/> </abbr>
This example is actually a fairly unproblematic one, since it does not contain any
alternate readings or editorial corrections or normalization. Yet even here there
are
difficulties: “Θεμι” (as is clear in the Leiden version) contains two gaps and unclear
text,
but since these visual features of the document are indicated using <gap/>
and <unclear/>
tags, it looks like an undamaged word-fragment in the plain
text version. It must be noted that the traditional way of publishing these documents
in print
employs inline markup. So, in this example at least, a plain text version would itself
be a
somewhat misleading version of the document. This is not a refutation of Schmidt’s
points,
because there are many other ways one could encode the document, using standoff markup,
that
would mitigate this problem. But perhaps it suggests that there are at least some
uses of
inline markup (when it encodes features of the text that cannot be expressed straightforwardly
in Unicode) that may be hard to replace.
The ability to extract the markup from the text and still preserve the manipulability it previously enjoyed suggests some additional possibilities: one could now layer in name and place information, lexical and grammatical analysis, structural information, such as line containment, rather than just marking line beginnings, etc. Different views could be generated, using these individually or using combinations of them. Nothing stops us from layering these on top of inline markup either.
Since it relies on character offsets, any implementation of string-range() is inherently
somewhat brittle. The adoption of @xml:space
by the TEI closes off one means by which links
using string-range could be broken, but can do nothing to mitigate the danger of someone
editing the text directly. Projects that use this mechanism will have to prevent the
breakage
of string-range links either through workflow or editing environments that manage
shifting
offsets.
We have already learned a good deal from our implementation efforts to date. If this
approach is something other users of TEI or even the TEI Consortium itself wishes
to support,
there are several changes we would suggest. First, that the guidelines be emended
to contain a
more thorough specification of the TEI pointer schemes. Second, that a working group
be formed
look at practical implementations of standoff markup and on appropriate usage patterns
for
these. We must note that the example stylesheet we provide to generate a text + standoff
markup version of a valid TEI document results in invalid TEI when applied to the
bgu.1.116
example, because elements like <ex/>
can only contain text, not pointers to
text. Moreover, if one wants to extract a string-range with the inline markup converted
to
standalone elements, then again the result will not be valid TEI. We hope our efforts
outlined
above will prompt some useful examination and perhaps revision of the TEI guidelines
perspective on standoff markup.
References
[TEIP5] Burnard, L. and S. Bauman (eds), Text Encoding Initiative: P5 Guidelines, http://www.tei-c.org/Guidelines/P5/ (2007).
[XPtr] DeRose, Steve, Eve Maler, and Ron Daniel Jr., XML Pointer Language (XPointer) Version 1.0, http://www.w3.org/TR/WD-xptr (2001).
[Schmidt2010] Schmidt, Desmond, The inadequacy of embedded markup for
cultural heritage texts
, Literary and Linguistic
Computing 25.2 (2010).
[1] The discussion, in which most posts have the title the inadequacies of
markup
, began with
http://www.digitalhumanities.org/cgi-bin/humanist/archive/archive_msg.cgi?file=/Humanist.vol23.txt&msgnum=762&start=98202&end=98321
on April 25th and carried on for about three weeks. The postings may be found in
http://www.digitalhumanities.org/cgi-bin/humanist/archive/archive.cgi?list=/Humanist.vol23.txt
and
http://www.digitalhumanities.org/cgi-bin/humanist/archive/archive.cgi?list=/Humanist.vol24.txt
[2] We are so far being quite restrictive in our interpretation of the term
fragmentIdentifier
. In theory this could encompass any means of
identifying a section of the document, including functions in the xpointer framework,
for example. In practise, fragment identifiers are context-dependent, relying both
on
the MIME type of the document identified by the URI and on the functionality of the
technology used to call them. For example, in the context of an XInclude element,
some
xpointer functions will work, whereas in the context of a browser-based hyperlink,
only @id or @xml:id values work. Since we are working outside XInclude, we take the
narrow view that a fragment identifier in a string-range can only be the value of
an
@xml:id attribute somewhere in the current document or in an external XML
document.