Introduction
Two systems are said to be interoperable if each is able to work with the parts or products of the other, with minimal if any external intervention. When applied to formats of digital texts, interoperability is differentiated between syntactic and semantic.
Note
My distinction between and use of syntactic and semantic is congruent with that of the European Interoperability Framework European Commission 2010, 23:
Semantic interoperability is about the meaning of data elements and the relationship between them. It includes developing vocabulary to describe data exchanges, and ensures that data elements are understood in the same way by communicating parties.
Syntactic interoperability is about describing the exact format of the information to be exchanged in terms of grammar, format and schemas.
Syntactic interoperability refers to consistency or completeness in encoding, markup, and related conventions attached to that markup. It generally implies the complete, lossless exchange of data, no matter its meaning. We witness syntactic interoperability every day that we use the Web. Major updated browsers accessing the data in any page written validly in a version of Hypertext Markup Language (HTML) will present different readers with the same content and roughly the same display. Likewise, in the realm of textual scholarship, files validly marked up with one of the Text Encoding Initiative (TEI) formats are, in general, syntactically interoperable. A valid TEI file created by one party can be shared with any other to be studied, processed, or otherwise used.
Semantic interoperability stands a level higher, and characterizes systems that can
losslessly exchange not just the data but any associated or underlying meaning. For
example, the UTF-8 string "France
" may be syntactically interoperable with
other systems that handle UTF-8, but for it to be semantically interoperable, the
underlying significance or meaning, i.e., that the string represents the name of the
country France, should also be preserved after exchange. Such semantics admit degrees
of
interest and importance. For example, in both HTML and TEI , <div>
and
<p>
have some semantic meaning, but to most users, of little import or
precision. HTML 5 has allowed a few other semantically interesting elements, e.g.,
<article>
, but there are not many of these, thus keeping vocabulary to
less than 120 elements. In its more concerted effort to support scholarly concepts
with
markup, the TEI Consortium has produced many more and with even greater precision,
e.g.,
<watermark>
and <residence>
, so that in its full schema
TEI supports nearly 550 elements. TEI encourages projects and users to build on this
effort
by customizing the TEI to add their own semantically precise elements, or to remove
ones
that have no relevance to a given project.
But assigning an XML element to every possible concept of interest is impractical, even in a customized TEI scheme. Thousands of concepts could be encoded, but with what result? If an elemental vocabulary gets too large, it winds up being misunderstood or misused. Or it may legitimate interpretations that members of the community may regard as wrongly deviating from standard usage.
An alternative has emerged to making elements the main carrier of semantics. Known loosely and variously as linked data, open linked data, or the semantic web, this set of practices builds upon a recommendation of the World Wide Web Consortium (W3C) called the Resource Description Framework (RDF), a relatively simple data model that envisions data as a network of nodes connected by lines, termed rather misleadingly edges (http://www.w3.org/RDF/).
Note
In everyday usage, edge implies the juncture of two surfaces of one or more solid objects, with no implications for where that edge might begin or end, if it does at all. None of these sine qua nons for real-life edges have a place in the RDF appropriation of the metaphor. A newcomer may be forgiven for objecting that what depicted looks like a line, not an edge.
http://
universal resource locators (URLs), so that further
information about a thing or concept can be automatically retrieved. The method of
transferring semantics thus shifts, from elements and attributes to the data they
contain,
namely URIs.
RDF conventions have been implemented in markup languages to varyious degrees. Across the Internet, RDFa and other forms of structured markup (Microdata, Microformat) have been applied widely, helping HTML become a major vehicle for semantic interoperability. The Web is populated with billions of assertions that are semantically comparable.
Note
The University of Mannheim's Web Data Commons project, http://webdatacommons.org, conducts regular crawls of the entire Web. The project showed that in winter 2014 31% of HTML pages retrieved from 2.01 billion URLs (up from 26% of 2.24 billion in 2013) had some kind of structured markup, resulting in 20.5 billion RDF quads (RDF triples attached to a named graph; this figure is up from 17.2 billion in 2013). See http://webdatacommons.org/structureddata/2014-12/stats/stats.html and http://webdatacommons.org/structureddata/2013-11/stats/stats.html.
Note
For a theoretical reflection on canonical or standardized reference numbers and their place in digital projects, see Kalvesmaki 2014.
In this article I offer three practical ways to make standardized references in TEI more semantically interoperable. The first of these, deployment of Canonical Text Services URNs, is somewhat well known but has not yet been broadly used in TEI cross-references. The second has, to my knowledge, not yet been tried at all, namely, informal communities agreeing to adopt Schematron files, to be added to the prolog of TEI files to standardize cross-references to a work that is frequently cited. My third and final approach shifts to stand-off markup, and I offer a model based upon the Text Alignment Network, a planned TEI-friendly XML format for the interchange of aligned texts.
Standard Cross-References in TEI
[B]ecause the choice of tags is guided by human interpretation, TEI-XML encoded files are in general not interoperable (Schmidt 2014)
Doubts about the interoperability of the XML format supported by the Text Encoding Initiative (TEI) have been voiced on numerous occasions, even within the flagship journal of the TEI, as in the quote above.
Note
See also Schmidt and October 2014 discussions on the public TEI-L listserv, initiated by Roberto Rosselli Del Turco under the subject line "Interchange of TEI documents: examples?": https://listserv.brown.edu/archives/cgi-bin/wa).
@cRef
,
in tandem with <ptr>
or <ref>
(and sometimes supplemented
by <cRefPattern>
). But there are other ways as well. One could also use
those elements with @target
or @type
. Or one could use
<quote>
along with @source
. Other methods include the use
of <link>
and <linkGrp>
, or even loose, unstructured
mechanisms such as <bibl>
. (The variety of options, as I shall argue,
hamper interoperability.)
A few of these many options are discussed further in this paper. But for ease of
discussion, I will concentrate on @cRef
, presented in the TEI Guidelines as an
ideal solution for an encoder who wishes to create a cross-reference to another work
by
means of a standardized or canonical reference. The relevant parts of the Guidelines,
§3.10.4 and §16.2.5, although accurate, are disjoint, technical, and not clearly connected to
everyday usage. So I present the material somewhat differently, from the perspective
of the
ordinary encoder who is putting a project together and doing the best to follow the
recommended steps.
Note
All references to the TEI guidelines are based on version 2.8.0 of the P5 Guidelines, http://www.tei-c.org/Guidelines/P5/, last accessed 3 July 2015.
The Guidelines illustrate @cRef
with the example of a text that quotes from
the gospel of Matthew, chapter 5 verse 7 (Guidelines §16.2.5). Let us enhance this example by considering the needs of an encoder who
is editing works by Anne Brontë and who has decided to encode explicit quotations,
including the quotation from Matthew 5:7 that appears at chapter 5, paragraph 18 of
Agnes Grey. Because our focus is on both syntax and semantics, let
us assume that the encoder wishes to provide a cross-reference that will refer to
as many
versions of that text as possible, created independently by other encoders or projects,
and
will be as useful as possible to the maximum number users, with a minimum of human
intervention for processing the data. Let us also assume that all the TEI transcriptions
that exist in the world are both discoverable and available. Of course, this is a
terrible
assumption to make in real life, but the problems associated with discoverability
and
availability are ubiquitous for this method and every other one, whether discussed
in this
article or not. Assessing those problems here would be repetitive and tangential to
the
main point, interoperability.
We turn to the Brontë encoder, who has prepared a plain TEI transcription of
Agnes Grey, and now turns to marking up cross-references. Following
the TEI guidelines, the encoder tags the quotation with <quote>
. After
seeing that only <gloss>
, <term>
, <ptr>
,
and <ref>
support @cRef
, the encoder ignores the first two.
Upon further reading, particularly of the examples, the encoder feeling that both
<ptr>
and <ref>
are equally valid, decides that the
markup is more of a reference than a pointer, so adds <ref>
nearby in a
valid location. The relevant part of the TEI file might look like
this:
..... <div type="chapter" n="5"> ..... <p @xml:base="•••••••••">‘But, for the child’s own sake, it ought not to be encouraged to have such amusements,’ answered I, as meekly as I could, to make up for such unusual pertinacity. <said>‘<quote>“Blessed are the merciful, for they shall obtain mercy</quote><ref cRef="•••"/>.”’</said></p> ..... </div> .....The encoder has given
@cref
and @xml:base
dummy values because it is
as yet unknown what kind of values are expected. A target Bible text must be chosen,
and
then it must be interrogated to find out what elements and attributes have been used,
and
with what values. So the encoder finds one in TEI format. After noting the URL, the
encoder
studies the file and finds that it has the following structure at the place
quoted:..... <div n="Matt"> ..... <div type="chap" n="5"> ..... <ab type="v" n="7">Blessed are the merciful, for they will be shown mercy.</ab> ..... </div> ..... </div> .....
The encoder therefore replaces •••••••••
with the target URL (let's call it
http://example.com/nt.xml
) and replaces •••
with Matt
5:7
. But the latter, being so far parsable only to humans, must be converted to
something a computer can act upon. So the Brontë encoder, again following the Guidelines,
adds a statement to the <teiHeader>
, something like this:
<teiHeader> ..... <encodingDesc> <refsDecl xml:id="biblical"> <cRefPattern matchPattern="(.+) (.+):(.+)" replacementPattern="#xpath(//div[@n='$1']/div[@n='$2']/ab[@n='$3'])"> <p>This pointer pattern extracts and references the <q>book,</q> <q>chapter,</q> and <q>verse</q> parts of a biblical reference.</p> </cRefPattern> </refsDecl> </encodingDesc> ..... </teiHeader>
Note
The program listing above departs slightly from the official example in the TEI
Guidelines (§16.2.5), which use #xpath(//div[@n='$1']/div[$2]/div[$3])
, an
XPath expression that assumes that verse labels and positions are isomorphic. That
is
a false assumption for most modern editions, which suppress or demote verses
considered spurious without altering the canonical numbering. The
@replacementPattern
in my example also takes into account advice at
§16.3 that Bible verses should be tagged <ab>
.
<cRefPattern>
stipulates for any TEI processor that Matt
5:7
should be converted to the URL
http://example.com/nt.xml#xpath(//div[@n='Matt']/div[@n='5']/ab[@n='7'])
.
The encoder's job finishes, and the work now moves to those who wish to process,
publish, or study the data. This requires the use of some TEI-compliant and -aware
processing mechanism, which will take the TEI elements and attributes that have been
used
for cross-referencing, resolve them to retrieve a string or document fragment, and
then
transform that data according to whatever purpose is intended. Although the end result
differs widely from one processor to another, the initial, preparatory step is common
across the board. All processors must be programmed to find instances of
@cRef
, take the string value, find a matching pattern in
@matchPattern
(in <cRefPattern>
), create an XPath
expression to be applied to the target XML file of Matthew (specified by
@xml:base
), and then retrieve the document fragment, for later
transformation.
But even in this preparatory stage, the processor requires some human intervention.
Someone must first step in and configure it to address irregularities not found in
other
TEI files. The person configuring the processor must study the Brontë text and discern
which elements have been used for cross-references, and with what kind of editorial
consistency. Perhaps the configurer is surprised to find that the encoder chose
<ref>
instead of <ptr>
, and that the former was left
empty. Perhaps the configurer is surprised to find that the Brontë encoder was enamoured
by
the attraction of @cRef
and ignored a simpler solution, that of
<quote>
with @source
. Perhaps the encoder and configurer
will engage in a spirited discussion as to the best use of TEI.
Perhaps the configurer and encoder are not on speaking terms, and <ref>
stands. The configurer must interrogate the use of the element even further to determine
what relationship any given <quote>
and <ref>
pair share.
After all, the former could be the previous sibling, next sibling, parent, or child
to the
latter. (Of these four valid configurations, three are offered as examples in the
TEI
Guidelines.) The configurer might find that in a series of adjacent quotes it is difficult
to tell which <quote>
is paired with which <ref>
, and the
encoder may not have been consistent. The variety of options in TEI is the source
of extra
work for the person configuring the pre-processor. As Schmidt points out, in the quote
above, the choice of an element, as well as its placement, is subject to human
interpretation, and is therefore detrimental to interoperability.
Such a workflow also requires quite a lot of human intervention and interpretation
at
both stages (transcription, pre-processing configuration). And not only does it fail
to
preserve any data required for semantic interoperability, such as URNs, but it can
scarcely
be said to be even syntactically interoperable. The syntax of the values of
@cRef
and @replacementPattern
are guaranteed to be applicable
only to one quoting version and one quoted version. Any attempts to apply the data
to other
versions of the New Testament (reflected by, say, changing the value of
@xml:base
) must be preceded by checking the structure and contents of the
new file. In addition, once @cRef
is used this way, it becomes difficult to
use the attribute to refer to works other than the New Testament.
Note
This is most acute when an encoder wishes to use @cRef
to point to
multiple works, a practice that would tax the limits of
@xml:base
.
@cRef
as an interoperable cross-reference mechanism proves
to be rather limited. It may be suitable for a single project depending upon specific
files, but it is not prepared to handle a distributed network of independently created
TEI
files.
TEI @cRef
+ Canonical Text Services URNs
The limitations of @cRef
prompt many TEI users to migrate to more complex
TEI linking mechanisms (discussed below). But @cRef
need not be abandoned so
quickly. Its syntactic and semantic value can be enhanced rather easily through Canonical
Text Services (CTS) URNs, a convention that defines a way to coin unique,
computer-actionable references to literary works independent of individual versions.
A
description of the syntax of CTS URNs would take us too far afield, and are easily
found elsewhere.
Note
Discussed informally at
http://www.homermultitext.org/hmt-doc/cite/cts-subreferences.html
and
defined formally at
http://www.homermultitext.org/hmt-docs/specifications/ctsurn/
. See
also Kalvesmaki 2014, paras. 15-24. See esp. notes 12-17, where I
register some concerns about the design of CTS URNs.
urn:cts:greekLit:tlg0031.tlg001:5.7
(the Greek New Testament is catalogued
by the Thesaurus Linguae Graecae as author number 0031, and Matthew as work number
001).
This URN is said, by definition, to be valid for any version of Matthew.
Let us revisit the workflow of our example. Above we started with the Brontë encoder, and we placed no special requirements upon the TEI-compliant version of Matthew she or he used. But under the CTS URN method, the process has to start earlier, with the target text. Or rather, more precisely, a new participant is introduced as a mediary between the New Testament encoder and the Brontë one, namely, a CTS server.
The person who administers a CTS server finds one or more TEI-compliant New Testament texts, and processes those texts, importing them into an RDF-compliant data store. During that process each segment of text is converted into RDF data that connects the text string with a CTS URN (in RDF terms, the latter would be the subject and the former the predicate). The data could be stored and served in any number of ways, for example as a relational database or as a SPARQL Protocol and RDF Query Language (SPARQL) endpoint.
Note
Whereas the architects of CTS have developed CTS as a SPARQL endpoint, Jochen Tiepmar, at the University of Leipzig, has deployed a CTS server as a MySQL database. See https://github.com/cite-architecture/sparqlcts and http://www.culingtec.uni-leipzig.de/ESU_C_T/node/471
In our example, we start with an administrator of a CTS server, who finds a TEI New
Testament. After interrogating the data structure, the administrator imports the verses
of
the New Testament, along with their proper CTS URNs into the service. The administrator
publishes specifications for the API that state that any queries should target the
URL
http://ctsservice.example.com/text
, add a question mark, then the CTS
URN.
Work shifts to the Brontë transcriber, who now does not need to study the structure of any particular New Testament text. All he or she needs to do is get the base URL for the CTS service, follow the specifications for the API, and encode the novel accordingly, e.g.:
..... <div xml:base="http://ctsservice.example.com/text?"> <p>‘But, for the child’s own sake, it ought not to be encouraged to have such amusements,’ answered I, as meekly as I could, to make up for such unusual pertinacity. ‘<quote>“Blessed are the merciful, for they shall obtain mercy.”</quote><ref cRef="urn:cts:greekLit:tlg0031.tlg001:5.7"/>’</p> .....
This particular CTS URN points to every version of the New Testament held in a
particular CTS service. But if the Brontë encoder knows that the quotation is from
a
specific version of Matthew, say a handwritten diary, and finds that version available
in a
CTS service, the value of @cRef
can simply be narrowed further, e.g.,
urn:cts:greekLit:tlg0031.tlg001.diaryA:5.7
.
The two attributes @xml:base
and @cRef
are all that is
required of the transcriber. The syntax of the CTS URN renders
<cRefPattern>
unnecessary.
The work now shifts to the person configuring the processor, who still must interrogate
the Brontë text, to see how elements and attributes have been used for cross-referencing.
But once that is accomplished, the processor can be preconfigured by simply concatenating
@xml:base
and @cRef
. Before sending this request to the CTS
service, the configurer may wish to restrict the number of versions returned, which
is
simple enough: the value of @cRef
or the SPARQL query is changed to specify
the version or versions intended. The text or texts that are returned from the CTS
service
are then ready for transformation.
Under this method, the amount of work required of the transcriber and the pre-processor
is reduced considerably. The transcriber does not need to know anything about regular
expressions, XPath, and replacement patterns. The person configuring the processor
does not
need to rewrite any preprocessing stylesheets. The syntactic and semantic interoperability
of the Brontë TEI file is increased significantly. The syntactic irregularities inherent
in
the customary use of @cRef
are eliminated by the CTS specifications, which
dictate exactly how a valid URN must be constructed. And a new level of semantic
interoperability not traditionally part of TEI files has been introduced. In that
single
CTS URN, one has a machine-actionable name not only for a particular passage but for
a
collection, a work, or, possibly, a specific version. The Brontë encoder has not only
pointed to a specific set of texts in a CTS service, but has uniquely named both a
work
(gospel of Matthew) and a specific part of that work (5:7). That URN can be used by
any
other system that is CTS URN-aware to collate the assertion governed by @cRef
into heterogenous datasets. And that means that the cross-reference declared in the
TEI
file of the Brontë transcription has now been released to the semantic web.
This approach to cross-references assumes, of course, that a quoted text is available in a CTS service, an assumption we made at the outset (see above). But the need to have an available CTS server is a reminder that this method introduces a major step into the workflow, and an added point of possible failure in data processing. The relationship between source text, cross-reference, and target text is now mediated. In addition, the extra labor on the part of the CTS administrator is not to be underestimated. CTS services require software packages (e.g., SPARQL endpoints) that must be configured and maintained, requiring server administrator skills well beyond simply uploading a plain XML file to a public server. The average TEI encoder who has a basic website is not likely to be ready to administer a CTS server. There are also, at this time, few examples of CTS services, and only as that number grows will the specifics of other opportunities and shortcomings be made clear.
TEI @cRef
+ Shared Schematron
At the heart of a CTS URN is a familiar, standardized canonical reference system that has been transformed into a syntactically regularized string, to bridge independently created texts. Another way a community of encoders and projects can exploit so-called canonical references in the name of interoperability is to transform standardized references into an agreed controlled vocabulary, then specifying the rules for that vocabulary with a Schematron file. Anyone choosing to use the convention need merely add a reference to the Schematron file in the head of their TEI documents. This inclusion not only tells other users that the shared cross-reference system has been adopted, but, in the validation process, can weed out bad values and provide contextual help to the TEI encoder who may not know all the rules for the cross-reference system.
Note
The method advocated below resembles somewhat the constraints applied by the schemas developed for the Mary Baker Eddy Library, which regulates the syntax of cross-references within a single corpus to a variety of works. For documentation see http://www.wwp.neu.edu/outreach/seminars/mbel/TEI_development/schemas/mbel.odd; http://www.wwp.neu.edu/outreach/seminars/mbel/TEI_development/schemas/mbel.doc.html#att.pointing; and http://www.wwp.neu.edu/outreach/seminars/mbel/TEI_development/schemas/mbel.isosch. But whereas the Mary Baker Eddy schema focuses on the needs of a single project dealing with multiple works, in this section I deal with the inverse: multiple projects trying to interoperably quote a single work, no matter the specific version.
This method starts further upstream than either the Brontë encoder or a putative CTS server. It begins with the community that wishes to make Matthew and the rest of the New Testament (maybe the Bible in general) open to standardized cross-references. Out of that community a person or project (or perhaps a TEI special interest group) agrees to host and maintain master versions of the schema files. The community agrees to create a pair of Schematron files, one to regulate transcriptions of the New Testament, the other, transcriptions of texts that quote from the New Testament.
The first file defines the structure of the New Testament text and permissible
values. Let us suppose the community has agreed that any New Testament transcription
should have three levels of <div>
, one for books, one for chapters, and
one for verses. They also agree on a set of abbreviations that should be used for
the
names of the books. They envision transcriptions of the New Testament having a TEI
<text>
that looks something like this:
<text> <body> <div n="Mt"> ..... <div n="5"> ..... <div n="7"> <p>μακάριοι οἱ ἐλεήμονες, ὅτι αὐτοὶ ἐλεηθήσονται.</p> </div> ..... </div> ..... </div> ..... </body> </text>
To enforce this structure, the community encodes assorted rules in the first of the two Schematron files. For example, this rule defines permissible book abbreviations:
<rule context="tei:div"> <let name="hierarchy" value="count(ancestor::tei:div) + 1"/> <report test="$hierarchy = 1 and not(matches(@n,'^(Mt|Mk|Lu|Jn|Ac| Ro|1Co|2Co|Gal|Eph|Php|Col|1Th|2Th|1Tim|2Tim|Tit|Phm| Heb|Jam|1Pe|2Pe|1Jn|2Jn|3Jn|Jud|Re)$','x'))" >Book value must be one of the following: Mt, Mk, Lu, Jn, Ac, Ro, 1Co, 2Co, Gal, Eph, Php, Col, 1Th, 2Th, 1Tim, 2Tim, Tit, Phm, Heb, Jam, 1Pe, 2Pe, 1Jn, 2Jn, 3Jn, Jud, Re.</report> ..... </rule>
The example above concisely specifies that the first-level <div>
s
(those at the book level in the hierarchy) must have values of @n
that draw
from one of the abbreviations adopted by the community for the twenty-seven books
of the
New Testament. In the case of Matthew, the agreed abbreviation is
Mt
.
This <report>
is but one of many that could be declared within the
same <rule>
. Another could include a specification as to the number of
chapters allowed in a particular book. This next <report>
specifies that
the second level <div>
s pertaining to the book of Matthew must be
numbered 1 through
28:
<report test="$hierarchy = 2 and ../@n ='Mt' and @n and not(matches(@n,'^([1-9]|1[0-9]|2[0-8])$'))">Mt has a maximum of 28 chapters.</report>
The verse numbers too can be defined, as here, which specifies that verse numbers for Matthew 5 fall from 1 through 48:
<report test="$hierarchy = 3 and ../../@n = 'Mt' and ../@n = '5' and @n and not(matches(@n,'^([1-9]|[1-3][0-9]|4[0-8])$'))">Mt 5 takes verses 1 through 48.</report>
Furthermore, let us suppose that this community agrees with many modern text editors that certain verses should be deprecated, but they do not wish to render a text that includes them as being invalid. For example, Matthew 18:11, widely regarded as spurious, could be flagged in a report, but merely as a warning:
<report test="$hierarchy = 3 and ../../@n = 'Mt' and ../@n = '18' and @n='11'" role="warning">Most critical editions suppress Mt 18.11 as spurious.</report>
Perhaps most important of all, the schema file can declare that every
<div>
should have values of @n
such that every
<div>
furthest from the root is uniquely citable, what I call the
Leaf Div Uniqueness Rule:
<pattern> <let name="leafdiv-flatrefs" value="for $i in (//tei:div[not(descendant::tei:div)]) return string-join($i/ancestor-or-self::tei:div/@n,' ')"/> <rule context="tei:div"> ..... <let name="this-ref" value="string-join(./ancestor-or-self::tei:div/@n,' ')"/> ..... <report test="not(descendant::tei:div) and count(index-of($leafdiv-flatrefs,$this-ref)) > 1" >Canonical references must be unique. </report> </rule> </pattern>
The <pattern>
above binds to the variable
$leafdiv-flatrefs
a sequence of canonical reference for all leaf
<div>
s. Each item in the sequence is a string made up of all the
@n
values of a leaf <div>
and its ancestors joined by a
delimiter, e.g., Mt 5 7
. Each item must be unique to the sequence, a rule
that is checked by the <report>
. If it is not, the duplicate leaf
<div>
s are marked as invalid. Enforcement of the Leaf Div Uniqueness
Rule allows chains of @n
joined vertically along an XML hierarchy to act as
an ID, one that economically follows the standardized (canonical) reference systems
that
are familiar to human encoders.
Note
The uniqueness rule must apply only to leafmost <div>
s because
there are cases where a <div>
midlevel in the hierarchy is
intentionally split. For example, in the Greek Septuagint (LXX) version of
Proverbs, the 30th chapter is split, and interleaved with the two halves of
chapter 24 (24.1 - 24.22e [22a - 22e are LXX verses not extant in the Hebrew];
30.1 - 30.14; 24.23 - 24.34; and 30.15 - 30.33). In this case the @n
s
of the two split book <div>
s must be identical. This also explains
why the report is tested not against a leafmost <div>
's siblings
(which may be but only a partial selection of siblings according to the reference
system) but against the entire sequence of leafmost <div>
s.
@n
has little if any repetition.
Note
Such repetition is found in alternate approaches such as those that use
@xml:id
in the leafmost <div>
, e.g., <div
xml:id="Mt.5.7">
, where Mt
and 5
could have
been inferred from the ancestors' @xml:id
values. Abbreviations of
book names and chapter numbers would need to be repeated for all ca. eight
thousand verses of the New Testament.
We turn now to the second part of the pair of shared Schematron files, that
pertaining to the quoting text and the syntax of the cross-reference. Here rules are
superimposed upon @cRef
(or @source
or @ref
). The
community anticipates that the attribute might be used for multiple space-delimited
cross-references, and to works other than the New Testament. They anticipate complex
quoting files that might look something like this (illustrating the work of an encoder
who wishes to add cross-references outside the New Testament, here to Proverbs
11:17):
.....
<div type="chapter" n="5">
<p n="18">‘But, for the child’s own sake, it ought not to be encouraged to have such
amusements,’ answered I, as meekly as I could, to make up for such unusual
pertinacity. ‘<quote>“Blessed are the merciful, for they shall
obtain mercy.”</quote><ref cRef="NT.Mt.5.7 HebB.Prov.11.17"/>’</p>
</div>
.....
The community therefore defines both a prefix for the work (NT
) and some
character to be used as a delimiter (here a period, but many other nonspacing, nonword
characters would also serve). And the community specifies that every value of
@cRef
that begins with the reserved prefix should construct the
cross-reference according to the established rules. For example, this next rule
specifies that the second element of any New Testament cross-reference (e.g., the
Mt
in NT.Mt.5.7
) should be one of the acceptable book
abbreviations:
<pattern> <rule context="@cRef"> <let name="delimiter" value="'\.'"/> <let name="these-refs" value="tokenize(.,'\s+')"/> <let name="invalid-books" value="for $i in $these-refs return if(matches($i,concat('^NT',$delimiter)) and not(matches(tokenize($i,$delimiter)[2],'^(Mt|Mk|Lu|Jn|Ac| Ro|1Co|2Co|Gal|Eph|Php|Col|1Th|2Th|1Tim|2Tim|Tit|Phm| Heb|Jam|1Pe|2Pe|1Jn|2Jn|3Jn|Jud|Re)$','x'))) then true() else false()"/> <report test="some $i in $invalid-books satisfies $i = true()">Error in cross-reference no. <value-of select="index-of($invalid-books,true())"/>. Book value must be one of the following: Mt, Mk, Lu, Jn, Ac, Ro, 1Co, 2Co, Gal, Eph, Php, Col, 1Th, 2Th, 1Tim, 2Tim, Tit, Phm, Heb, Jam, 1Pe, 2Pe, 1Jn, 2Jn, 3Jn, Jud, Re, separated by subsequent values by this delimiter: <value-of select="replace($delimiter,'\\','')"/></report> ..... </rule> </pattern>
Under this <rule>
, every @cRef
is tokenized into a
sequence of space-delimited cross-references, assigned to the variable
$these-refs
. Another variable checks the ones that begin with NT, and
makes sure that the next part (defined by the delimiter, the period) is one of the
acceptable abbreviations for a New Testament book. If any value does not conform,
that
@cRef
is marked as invalid, and a message is returned, indicating which
cross-reference is faulty, as well as a list of acceptable values and the delimiter
that
should be used to separate parts of a cross-reference.
Other reports that are found in the first Schematron file can be replicated here as
well. For example, allowable chapter and verse numbers can be specified (examples
suppressed here for the sake of brevity). That second shared Schematron file could
also
specify exactly where the <ref>
should be placed relative to the
quotation:
..... <report test="$this-val[1] = 'NT' and not(name(../preceding-sibling::*[1]) = 'quote')">An element containing @cRef must come immediately after the closing tag of the matching quote element.</report> .....
This report specifies that the element containing @cRef
must be the very
next sibling of its corresponding <quote>
. This test removes the
guesswork as to where a quotation's cross-reference is to be found, and so saves some
labor on the part of the person configuring a processor.
The blocks of code in the examples above are not necessarily computationally efficient, nor do they necessarily represent the best use of TEI elements. They merely illustrate the types of patterns and rules a community of practice might embrace. Once the community has established their rules, the two master Schematron files are posted in a central location. The community has the freedom to update those rules as the community learns what works and what doesn't, and the updates benefit every user.
Now work shifts to the two different communities of transcribers. The first consists of those who wish to provide a citable transcription of the New Testament. They begin by adding to a pre-existing TEI file an extra prolog statement, for example:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_lite.rng" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_lite.rng"
schematypens="http://purl.oclc.org/dsdl/schematron"?>
<?xml-model href="http://example.org/schemas/nt/1.0/nt-quotable.sch"
schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
.....
</TEI>
The transcriber runs the validator, and might find that the once-valid TEI file is now rendered invalid, because it does not follow the new rules precisely. But the explanations provided by the error messages will advise the transcriber on how and where to alter the file to make it valid, so it can be made interoperable with all others.
Note
In fact, the schematron file could be provided Schematron Quick Fixes, which in SQF-aware XML processors would allow the invalid data to be corrected with just two clicks or keystrokes, or even automatically. See http://www.schematron-quickfix.com/.
We now turn to the Brontë encoder, who, like the New Testament transcribers, adds a prolog:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml"
schematypens="http://purl.oclc.org/dsdl/schematron"?>
<?xml-model href="http://example.org/schemas/nt/1.0/quoting-nt.sch"
schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
.....
</TEI>
And once again, the encoder runs the validator, and the extra Schematron pattern is used to see if the citations to the New Testament conform to the rules agreed upon by the community. If there are any errors, the message specifies exactly where and for what reason. The Brontë encoder edits the file until there are no more error messages.
This process can be repeated as often as one wants, upon any version of any text,
whether quoting or quoted. In fact, they can be combined in the same file, to allow
a
New Testament to be marked with internal cross-references. No matter the context,
the
Schematron reports steer the transcriber into the (usually small) fixes that need
to be
made. @cRef
alone is sufficient to declare the cross-reference. Neither
@xml:base
nor <cRefPattern>
is necessary. An
@xml:base
could be supplied, if so desired, but the @cRef
is now applicable to any version of the New Testament that adopts the shared Schematron
files.
Work now turns to the processor to do something with the cross-reference. Here,
because the structure of every New Testament TEI file has been precisely defined (as
a
series of tesselated <div>
s) very little human intervention is needed.
Or, rather, the type of human intervention shifts, primarily to deciding which and
how
many of the available versions of the New Testament should be processed (compare the
same wealth of riches in the CTS method). Once a processor is configured to handle
these
user-defined cross-references to the New Testament, it can be used on any valid file
that also uses it, with no extra work. Naturally, this applies only to the preprocessing
phase. How exactly that data will be used (display, statistics, etc.) is determined
by
what users want.
This method greatly improves both the syntactic and semantic interoperability of TEI files. It requires no new infrastructure, and it supports both customized and standard TEI schemas. The shared Schematron files provide structure and predictability—a controlled vocabulary for cross-references—in areas where encoders most want it. Like CTS, a middleman has been introduced, but it is rather simple and benign: two relatively small Schematron files made available by http request that will normally be cached by users on their local drive for day-to-day work. So maintenance and overhead are rather light.
Note too that the shared Schematron files can be used on TEI Lite, TEI All, or even
customized TEI. No one has to use the same version of TEI in order to make New Testament
references interoperable. The validation files do not preclude any other markup within
a
leaf <div>
. They can be used on any version of the New Testament,
partial or complete, in any language, and the books or chapters need not be in a
specified order (thereby accommodating unusual editions that adopt alternative orders
of
the books of the New Testament).
Furthermore, this effort could be extended outside the TEI realm. That same community might create variations of the Schematron file pairs for XHTML 1, thereby allowing web pages to serve as host to syntactically and semantically interoperable transcriptions of New Testaments, or of texts quoting the New Testament.
But this general method also has a few major problems. It might work fine for heavily
quoted works, but what about less frequently quoted ones? Organizing a community of
practice to agree on rules might be difficult if not impossible for some texts
(including, ironically, the Bible). Further, how would reserved keywords (here,
NT
) within the value of @cRef
be minted without conflict?
What happens in the case of duplicate or ambiguous prefixes adopted by independent
communities? Such questions should be regarded not as reasons for abandonment but
as
problems that can and should be solved. But those solutions go beyond the scope of
this article.
Note
The problem of conflicting prefixes could be solved if they were handled like namespace prefixes. But such "work prefixes" would require new specifications in the TEI Guidelines, to ensure the integrity of the method.
TEI + Stand-off Markup
The three methods discussed so far assume cross-references that are embedded within a transcription. Such inline annotation is the most common way an encoder points from one text to another, not just in TEI but also in HTML. But the TEI guidelines (§§16.9-16.10) provide for an alternative approach, stand-off markup, where linking and cross-referencing are placed in a file separate from the transcriptions. Such stand-off markup or annotation has a few immediate drawbacks, the most immediate being that it is difficult to easily see the text to which an annotation applies, either because the files must be navigated and edited independently or because the semantics in the pointing scheme may be difficult for a human to parse (character counting, complex or opaque XPath expressions, etc.). But stand-off markup also has great benefits. It allows multiple complimentary or competing annotations to be made of the same base transcription; stand-off markup files can be created, edited, and served independently of any source texts; it facilitates a division of labor that allows transcribers and annotators to focus independently and concurrently on their discrete tasks.
The current specifications of the TEI guidelines provide for a specific method of
stand-off markup. It presumes that one or more transcription files are to be found
somewhere, and an external aligning file stands apart from them. That external file
can
point to the source files either by means of XInclude elements (explained at TEI Guidelines
§16.9) or by using @target
with <ptr>
,
<ref>
, or <link>
(TEI Guidelines §§16.2, 16.7). Common to all these methods is a reliance upon the TEI XPointer scheme,
which provides a precise, stable, and expressive reference system that follows a
straight-forward, consistent syntax. The following examples show two different ways
to
create a stand-off cross-reference from the Brontë novel's quotation to the New
Testament:
..... <linkGrp> <link target="http://example2.com/agnesgray.xml#xpath(//div[@n='5']/p[18]) http://example.com/nt.xml#xpath(//div[@n='Matt']/div[@n='5']/div[@n='7'])"/> </linkGrp> .....
..... <body> <div> <include href="http://example2.com/agnesgray.xml" xmlns="http://www.w3.org/2001/XInclude" xpointer="range(xpath(//div[@n='5']/p[18]))"/> <include href="http://example.com/nt.xml" xmlns="http://www.w3.org/2001/XInclude" xpointer="range(xpath(//div[@n='Matt']/div[@n='5']/div[@n='7']))"/> </div> </body> .....
Other examples using <ref>
or <link>
would look similar
to the second one above. The XPointer framework stands at the heart of them all,
pinpointing the precise node or document fragment that is meant. But as currently
constructed, this XPointer scheme shares with @cRef
a lack of semantics behind
the syntax. That is, no information about the meaning of a particular node is built
into
the XPointer scheme. For the examples above, there is no way to imply in the XPath
fragment
div[@n='Matt']
that the div
means a book and that the
@n
means the name of that book. In addition, this fragment has coinage only
within a specific TEI file. Its interoperability is as limited as @cRef
was
shown to be above, since the XPointers are not guaranteed to have any validity for
other
versions of the same work. For every new version of Matthew or Agnes
Grey that the encoder wishes to include, the file structure must be
interrogated and a new XPointer expression created.
I propose a different approach to stand-off cross-references, one that relies upon semantically defined alignment. My proposal shares points with the previous two methods (CTS URNs and community-written Schematron files) but is more extensive in scope, anticipating an ecosystem of scholarly texts in which stand-off markup is the norm for all types of annotations, not simply cross-references. This ecosystem is the goal of a project that is still in development, the Text Alignment Network (TAN; http://textalign.net), a suite of XML encoding formats and set of recommended best practices to serve anyone who wishes to encode, exchange, and study varieties of text reuse: translations, quotations, paraphrases, adaptations, summaries, and so forth. In this section I use fragments of examples created in the TAN format to illustrate how stand-off annotation might be used to maximize the syntactic and semantic interoperability of the cross-reference.
Note
Because the TAN format is still under development, examples provided in this article may be rendered invalid in any public release.
Methods discussed above moved the beginning of the encoding workflow earlier, either to a new network of CTS servers or to communities of practice coming up with their own Schematron files. Under the TAN method work begins with what I hope will become an informal community that actively develops and maintains TAN validation schemas, documentation, and examples, and to house those files in a central repository.
To make the format maximally useful to TEI users, TAN defines a minor customization of the TEI All schema, introducing a few constraints. Every transcription file must:
-
be dedicated exclusively to a normalized text of one version of one work found on one text bearing object;
-
be uniquely named;
-
uniquely name the work that has been transcribed;
-
segment the transcription of the work into a series of nested
<div>
s. Each<div>
must:-
contain other
<div>
s or no<div>
at all; -
take
@type
and@n
, specifying the type of division and its name; -
observe the Leaf Div Uniqueness Rule (explained above).
-
-
define every metadatum with both human-readable names and machine-readable ones (URI/IRIs).
<div>
s, but such markup is
likely to be ignored by TAN users, since they are interested in TEI files primarily
as a
source of normalized, well-segmented transcriptions. Extra markup, such as nuanced,
complex
cross-references, are expected to be found in a separate file.
So, coming back to our example, we start with the transcriber of Agnes Grey, who makes a few adjustments to the TEI file (explained below):
<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/1/schemas/TAN-TEI.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/1/schemas/TAN-TEI.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TEI xmlns="http://www.tei-c.org/ns/1.0" TAN-version="1" id="tag:textalign.net,2015-04-07:test1"> <teiHeader> ..... </teiHeader> <head xmlns="tag:textalign.net,2015:ns"> ..... <declarations> <work> <IRI>http://dbpedia.org/resource/Agnes_Grey</IRI> <name>Agnes Grey</name> </work> <div-type xml:id="chapter"> <IRI>http://dbpedia.org/resource/Chapter_(books)</IRI> <name>chapter</name> </div-type> <div-type xml:id="p"> <IRI>http://dbpedia.org/resource/Paragraph</IRI> <name>paragraph</name> </div-type> ..... </declarations> ..... </head> <body xml:lang="eng"> ..... <div type="chapter" n="5"> ..... <div n="18" type="p"> <p>‘But, for the child’s own sake, it ought not to be encouraged to have such amusements,’ answered I, as meekly as I could, to make up for such unusual pertinacity. ‘“Blessed are the merciful, for they shall obtain mercy.”’</p> </div> ..... </div> ..... </body> </TEI>
That is all the Brontë encoder need do. The New Testament transcriber has a similar responsibility:
<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/1/schemas/TAN-TEI.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/1/schemas/TAN-TEI.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TEI xmlns="http://www.tei-c.org/ns/1.0" TAN-version="1" id="tag:textalign.net,2015-04-07:test2"> <teiHeader> ..... </teiHeader> <head xmlns="tag:textalign.net,2015:ns"> ..... <declarations> <work> <IRI>http://dbpedia.org/resource/New_testament</IRI> <name>New Testament</name> </work> <div-type xml:id="bk"> <IRI>http://dbpedia.org/resource/Book</IRI> <name>book</name> </div-type> <div-type xml:id="ch"> <IRI>http://dbpedia.org/resource/Chapter_(books)</IRI> <name>chapter</name> </div-type> <div-type xml:id="v"> <IRI>tag:textalign.net,2015-04-07:div-type:verse:biblical</IRI> <name>verse (Bible)</name> </div-type> ..... </declarations> ..... </head> <body xml:lang="eng"> <div n="Matt" type="bk"> ..... <div n="5" type="ch"> ..... <div n="7" type="v"><ab>Blessed are the merciful: for they shall obtain mercy.</ab></div> ..... </div> ..... </body> </TEI>
Starting from the top of both examples, observe the following:
-
The prolog contains two declarations, one pointing to a customized TEI schema in RELAX-NG (compact syntax) and another pointing to a Schematron file. (These URLs do not resolve; they are merely illustrative.)
-
The rootmost element,
<TEI>
, has@TAN-version
and@id
. The latter is a user-defined URN naming the file. (Actually, the name applies to all versions of that file, but I avoid a full explanation here.) -
There is a new
<head>
element. The TAN suite has formats for different kinds of data (some of which one would never use TEI to encode). Metadata from one type of TAN file to the next must be predictably and consistently structured. In a word,<teiHeader>
is inadequate for TAN files, and would be confusing when juxtaposed with other TAN files. The<tan:head>
structures metadata in a manner consistent with other TAN files. The need for predictability is also why it is a sibling, not a child, of<teiHeader>
. -
The literary work and the division types are defined by
<work>
and<div-type>
, which take what I call an IRI + name pattern, a recurrent feature of all TAN files. One or more<IRI>
s supply a computer-readable name in the form of an Internationalized Resource Identifier (IRI, an extension of URI, Uniform Resource Identifier) and one or more<name>
s, a human-readable one. The@xml:id
provides a local identifier so that the entity, properly defined by its IRI values, can be easily referenced. Thus, the two examples assign the division "chapter" different abbreviations (ch
versuschapter
), but this difference does not matter because the definition, made by<IRI>
, is shared. -
<body>
takes a set of nested<div>
s. Any markup inside a leaf<div>
is optional, and will be ignored by many users of the file. (For this reason, a bare TAN format for transcriptions is provided, to support users who prefer plain text to TEI.)
The transcribers' work is finished. Before we move to the next phase, however, it
is
worth noting some important gains in interoperability that have already been made.
Because
a TAN transcriber is compelled to segment a single work according to a semi-intuitive
reference system, and to declare the work and the types of division according to IRI/URIs,
we have in place the foundation for computer-actionable alignment. That is, if one
were to
have one hundred people each independently transcribe a different version of
Agnes Grey or the New Testament along TAN rules, it is likely that
many of them would structure, define, and label <div>
s in a similar
fashion. Thus, a good number of these versions will already be prepared for automatic
alignment, with no human intervention whatsoever. There will always be some versions
encoded differently, of course, and the TAN format provides the tools for an aligner
to
easily reconcile differences where they exist. But even before the aligner has arrived,
the
stage has been set for computers to create multilingual editions of versions of the
same
work with minimal human intervention.
At this point, work shifts to the annotator who wants to encode the cross-reference.
The
TAN format specifies two formats for cross-referencing. One is designed exclusively
for
pairs of texts (bitexts) and is used to create clusters of words (or merely letters)
that
correspond across the bitexts. This format, intended for highly detailed, nuanced,
and
complex work, provides a kind of microscopic alignment. But we focus here on the other
kind
of format, mascroscopic, which is intended to be used to align any number of versions
of
any number of works, and to specify further alignments on the basis of leaf
<div>
s (but more larger or mor precise alignments, down to the level of
words, can also be made).
Let us suppose an aligner has found not only our two example TAN transcription files but another version of each work, and wishes to declare a cross-reference from the Brontë novel to the New Testament that applies to all four. That alignment file will look something like this:
<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/schemas/1/TAN-TEI.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/schemas/1/TAN-TEI.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-A-div xmlns="tag:textalign.net,2015:ns" TAN-version="1" id="tag:textalign.net,2015-04-07:alignment-test1"> <head> ..... <source xml:id="bronte"> <IRI>tag:textalign.net,2015-04-07:test1</IRI> <name>Agnes Grey in English</name> <location when-accessed="2015-07-13">test1.xml</location> </source> <source xml:id="bronte-fra"> <IRI>tag:textalign.net,2015-04-07:test3</IRI> <name>Agnes Grey in French</name> <location when-accessed="2015-07-13">test3.xml</location> </source> <source xml:id="nt"> <IRI>tag:textalign.net,2015-04-07:test2</IRI> <name>King James version of the New Testament</name> <location when-accessed="2015-07-13">test2.xml</location> </source> <source xml:id="nt-grc"> <IRI>tag:textalign.net,2015-04-07:test4</IRI> <name>Nestle Aland version of the Greek New Testament</name> <location when-accessed="2015-07-13">test4.xml</location> </source> ..... </head> <body> <align> <div-ref src="bronte" ref="chapter 5 p 18"/> <div-ref src="nt" ref="bk Matt ch 5 v 7"/> </align> </body>
The <head>
is somewhat long, because four different versions are in
play, and they each need the IRI + name pattern (see above) as well as one or more
<location>
s, to specify where the source has been found. But the
<body>
is relatively straightforward. A single <align>
encloses a set of <div-ref>
s, each of which names a particular passage by
identifying the source and reference. The pair of <div-refs>
provide a
two-way cross-reference that follows a human-friendly syntax that does not require
any
knowledge of XPath, XPointer, regular expressions, and so forth.
Even though this cross-reference invokes only the sources given the id
bronte
and nt
, the reference applies to all four sources. That
is because <div>
-based alignment rules stipulate that every processor must
infer alignment wherever possible and that, unless otherwise specified, alignment
is
transitive. If two texts are versions of the same work (discerned through the
<IRI>
values of each source's <work>
), then their
constituent parts—their <div>
s—should be aligned wherever they can (using
the IRI values of @type
and the data values of @n
). Furthermore,
if special alignment is made across works (such as the cross-reference above), then
that
alignment is to be treated as transitive unless otherwise specified. That is, if an
<align>
says that X ~ (aligns with) Y, then for every A ~ X and every B
~ Y, A ~ B.
There are a number of benefits to the simplified <div>
-based alignment
illustrated above, but one should be singled out. The value of @when-accessed
(a required attribute of <location>
) indicates when the aligner last saw a
source transcription. If that file is corrected and updated, and the date of the change
is
logged in the source file, then when the aligner validates the alignment file, the
Schematron pattern will issue a warning that the source has been updated. The aligner
can
then decide if the changes have any important consequences. So transcribers can keep
their
files in a central location and have the liberty of correcting
typographical errors. They need not worry about altering any stand-off markup files
or
hunting down every person using their files. The Schematron schemas do the notifying.
Those
who depend upon the source file can be automatically informed of any changes, one
of the
signal strengths of stand-off markup.
The aligner's task is finished, and work shifts to the processor. Configuration of the pre-processor is a one-time affair that will apply not only to any version of a particular text (as was the case with the method of the shared Schematron file, discussed above) but to any TAN div-based alignment file for any work. That is, those who configure processors do not need to learn the structure of a given work or transcription file. They need only to know the TAN specifications for alignment (i.e., how to interpret a TAN-A-div file). Any TAN-compliant processor can be used on any TAN-A-div file, no matter how many works or versions it has. How the processor uses or transforms the data is another issue altogether, because that depends upon the purpose and questions the transformation serves. But the preliminary pre-processing stage need be configured only once, since all valid TAN files (both transcription and alignment) are interoperable, both syntactically and semantically.
There is obviously much more I should say about TAN alignment, in response to important
questions or concerns. What if independent transcriptions of the same work are discordant,
using different values for @n
? What if division types and works are defined by
different IRI vocabularies? What about versions of the same work that use altogether
different reference systems? What about works that are similar but not really the
same?
What about coordinating specific ranges of text smaller than the leaf
<div>
? What if a commonly used reference system is misleading or
inadequate?
These questions and more have been anticipated, and will be addressed in the full specifications for the Text Alignment Network. Explaining any single point adequately would involve moving into territory outside the remit of this article, and would raise yet other questions that would require a full discussion of the TAN design principles and rules.
But let us assume for the sake of argument that these concerns are not handled
adequately under TAN specifications. Inevitable shortcomings aside, consider how much
extra
interoperability has been secured in the simple examples above. Like CTS URNs,
TAN-compliant TEI provides a means for uniquely naming literary works. Like the shared
Schematron method, TAN-TEI offers transcribers rules to make their texts consistent
and
predictably structured (and therefore citable). And by compelling <div>
to
be given a semantically precise definition, TAN specifications allow an otherwise
generic
element to become highly productive and semantically precise. That is, a transcriber
is now
free to define <div>
to mean a textual division that might be unusual or
specific to a field. Thus the world of textual divisions is now opened to the semantic
web.
Even if TAN proves to have fatal flaws, I hope these examples inspire someone to create a better stand-off annotation system. If the goal is to allow a cross-reference to apply to any number of versions of any two works, then in-line annotation is not viable, because it indelibly impresses the cross-reference into a single version. To be applicable to other versions the cross-reference must be freed.
Conclusion
Three methods for enhancing the syntactic and semantic interoperability of cross-references in TEI files have been offered: Canonical Text Services URNs, shared Schematron files, and the stand-off markup of the Text Alignment Network. The first two could be implemented now. The principal barrier is practical—getting independent scholars, projects, and groups to adopt a method, try it out, and through trial and experience develop the protocols behind it. The third method needs both experimentation and development before it can be widely used. But all three show that greater interoperability is possible through a few modest adjustments to our approach to TEI. First, make source transcriptions predictably structured. Second, make sure that references to those predictably structured sources are themselves predictably structured. Third, define the syntax of the metadata such that each constituent part retains its semantics, defined by IRIs/URIs. Even if a reader finds one of the three methods disfavorable, that method is successful if, in the end, it catalyzes a better way.
References
[European Commission 2010] European Commission, Annex 2,
‘Towards Interoperability for European Public Services’
, ver. 744final (2010-12-16),
http://ec.europa.eu/isa/documents/isa_annex_ii_eif_en.pdf (accessed
2015-07-02).
[Kalvesmaki 2014] Joel Kalvesmaki, Canonical
References in Electronic Texts: Rationale and Best Practices
, Digital
Humanities Quarterly 8.2 (2014),
http://www.digitalhumanities.org/dhq/vol/8/2/000181/000181.html.
[Schmidt 2014] Desmond Schmidt, Towards an
Interoperable Digital Scholarly Edition
, Journal of the Text Encoding Initiative
[Online], Issue 7 | November 2014, Online since 12 November 2014, connection on 24
March
2015. URL:http://jtei.revues.org/979; doi:https://doi.org/10.4000/jtei.979.
[Schmidt] Desmond Schmidt, The Inadequacy of
Embedded Markup for Cultural Heritage Texts
, Literary and Linguistic
Computing 25 (2010): 337-356. doi:https://doi.org/10.1093/llc/fqq007.