Introduction
Project context
The project Annotated Turki Manuscripts from the
Jarring Collection Online
(ATMO) is digitizing a number
of Central Asian manuscripts collected in the first half of the
twentieth century by the Swedish ethnographer and Turkic
philologist Gunnar Jarring.[1]
A number of previously undigitized documents have been scanned,
and the project has put digital facsimiles
of them
online. One is shown in Figure 1.
Further, the project is transcribing as many newly scanned
manuscripts as resources allow, and a number of transcriptions
are also available on the project's site. For as many of the
transcribed manuscripts as we can manage, the project is also
translating and providing word-by-word (or to be more precise,
morpheme-by-morpheme) linguistic annotation.
In order to simplify both the creation of the literatim transcripts and their later comparison with the scanned images of the originals, the transcriptions use the markup defined by the Text Encoding Initiative (TEI P5) for close transcriptions of physical sources, with elements for writing surfaces (here mostly pages), zones (regions of the surface used for writing), and lines. A line by line transcription of the page shown in Figure 1 is shown in Figure 2.
The linguistic annotation, however, is based on the linguistic structure of the texts and requires elements for sentences (or sentence-like units), words, and morphemes. As may be seen in Figure 3, the text is displayed sentence by sentence, with Latin transliteration, segmentation into morphemes, part of speech for each morpheme, and interlinear gloss for each morpheme shown immediately below each word, and a prose gloss for the entire sentence shown below the sentence, followed by any notes applicable to the sentence.
A display of the material oriented to speakers of Uyghur or to area specialists with non-linguistic interests (e.g. historians of religion or folklore) will require (or at least benefit from) markup for a third set of textual structures, with elements for texts (some manuscripts contain anthologies of multiple texts), headings, paragraphs, verse stanzas, verse lines, etc. Figure 4 shows a sample text-oriented display, with the original Perso-Arabic script on the right, the English sentence-by-sentence translation on the left, and the Latin transliteration between the two.
No two of these views nest neatly with each other.
The ATMO project thus exhibits in a particularly
straightforward and striking form the problem of
overlapping hierarchies
which the SGML and XML
communities have been discussing since the 1980s.[2]
This paper first describes the specific requirements to be met by the markup for the ATMO project; the following sections describe how the project is going about meeting those requirements. Sections are devoted to the abstract structure assumed for documents, the serialization forms used to represent that structure in XML, and the mechanisms employed for well-formedness checking and (very briefly) validation; these are all based on those of XML, but require some description of the application conventions employed and how they deal with multiplicity of document structures. The paper concludes with some indications of further work to be done and/or to be reported on in other papers.
Requirements
For transcription (and for the presentation of transcripts for
those interested in the physical organization of the
manuscript), the ATMO project uses markup whose elements
identify important units in the topography of the manuscript
exemplar: pages, regions on the page (header area including
folio numbers and page numbers, right margin, main writing area,
left margin, footer including catch-words), lines, and
highlighted areas within the lines. For tabular material,
extensive use is made of TEI's rend
attribute, to
allow the display stylesheets to approximate the layout of the
exemplar.[3]
For linguistic annotation and for presentation of annotated material for readers with linguistic interests, a close reproduction of the physical organization of the manuscript is not helpful; the key units of organization are sentences, words, and morphemes. Like many documentary linguistic projects, ATMO segments words to identify inflectional (but not derivational) morphemes and annotates each segment.
For presentation of the texts in regularized spelling and
for readers interested primarily in the cultural, ethnographic,
anthropological, religious, or historical import of the
material, neither the close reproduction of the physical
organization of the manuscript nor an exclusive focus on
sentences would be helpful; the kind of logical
structure
typically captured in document-oriented SGML
and XML vocabularies is more useful: texts or works, paragraphs
or other blocks, phrases of various kinds should be
identified.
In prose, where sentences normally nest within paragraphs
or similar units, the text-oriented
and
sentence-oriented structures are often compatible and can be
combined in a single tree structure. In verse, however, the two
structures do not nest.
It may be noted in passing that in the ATMO project these
three structures compete with or overlay each other only in the
main part of the document; the TEI header will be the same in
all views. In XML terms, the competing structures all occur
only within a container element; in the
case of ATMO the container is the tei:text
or
tei:sourceDoc
element. Within the container, again
some elements may be common to all structures.
From these observations several requirements arise, which in turn entail or suggest others:
-
Any of the three structures (which I will call page, sentence, and paragraph) should be visible and processable when needed.
-
Because we do not have the resources needed to re-create the XML software stack from the ground up, a second requirement is that if possible, all document representations used in the project should be XML.
-
Taken together, the two requirements just mentioned seem to suggest that we use XML representations in which one of the structures (I'll call it the dominant structure) is represented more or less conventionally, representing each structural unit of the dominant structure with one XML element (and vice versa), and the other two structures (the recessive structures) are represented in some other way (with milestone elements, fragmentations, stand-off markup, or some other technique).
Terminological note: for brevity, I will sometimes refer to elements or nodes appearing in a recessive structure as recessive elements, and to the markup delimiting such elements as recessive markup, and similarly for dominant elements and dominant markup.
We meet this requirement using Trojan-Horse markup (DeRose 2004) for the recessive structures.
-
Because we do not wish to privilege any one structure by making it permanently dominant, we would like to be able to view and process any document with any of the three structures as the dominant one.
-
Because we do not wish to have to perform triple maintenance on documents, we do not want to have three parallel static representations for each document which must be maintained in parallel; instead, we want to be able to translate from any of the three forms to either of the other two (changing from one dominant structure to another), without information loss.[4]
We meet this requirement with XSLT transformations which accept a document with one dominant and any number of recessive structures and write out an equivalent document with a dominant structure identified by a run-time parameter.[5]
-
Because each of the three structures is reasonably simple and well understood, we would like to be able to validate the markup for each structure using a conventional grammar-based schema language.
We meet this requirement by translating a set of document grammars defining the individual views into a set of related schemas (one for each dominant structure).
-
Because most of the uses we imagine for the project's data involve one or the other of these views, but not more than one, it is probably not an absolute requirement, for the ATMO project, that multiple structures be visible and processable at the same time. But neither is it an absolute requirement that recessive structures be invisible to processing: A requirement to see all structures at once can in principle easily arise whenever multiple structures are of interest: all it takes is beginning to wonder whether any two structures are completely orthogonal to each other or not. So we would tentatively like if possible both to be able to perform tasks that require taking more than one hierarchical structure into account and to completely ignore the recessive structures.
We believe we have met this requirement but do not have space to demonstrate how; we hope to report on processing techniques for concurrent documents in later work.
Document structure
ISO 8879 introduced the notion that a markup language can not only be defined as a set of character sequences but can also be associated naturally both with an abstract data type which represents the structure of the marked up document and with a mechanism for validating marked up documents. The following sections follow this pattern in describing explicitly the abstract data type for document structure, the serial form, and the mechanisms for well-formedness checking for the markup used by the ATMO project. It is hoped that later work will have space for fuller discussion of validation against schemas and the challenges of processing data with concurrent structures.
Concurrent trees and sharing of leaf nodes
The structure we postulate for documents is in essence that of the SGML feature CONCUR: multiple element trees sharing leaf nodes; see ISO 8879:1986 and Sperberg-McQueen / Huitfeldt 1999 for descriptions. Later work on the same or very similar data structures includes Dekhtyar / Iacob 2005, Hilbert / Schonefeld / Witt 2005, Schonefeld / Witt 2006, and Schonefeld 2007.
CONCUR has sometimes been described (by the current author and by others) as involving multiple element trees drawn over the same frontier of text nodes, comments, and processing instructions. This is a reasonable first approximation, but in fact the data structure implied by ISO 8879 is slightly more complicated: when CONCUR is used, it is not guaranteed or required that each document type have exactly the same character data.[7] There are two sources of variation. First, SGML's rules for record-end suppression depend crucially on the relative location of the record-end in question and the nearest markup. Since in a document marked up with CONCUR, some markup is applicable to (visible in) only one document type; record ends affected by that markup will be suppressed in that document type and visible in others. Second, there is no requirement that a given general entity name be given the same declaration in different document types; if the replacement text for entity E differs in different DTDs, then the concurrent trees will have different frontiers at any point where entity E is referred to.
It would thus be more precise to say that concurrent markup describes multiple element trees over a frontier of text nodes, comments, and processing instructions which is shared in whole or in part. In any one tree, all leaf nodes (indeed, all nodes, if we assume an XDM-like data model) are totally ordered, and any leaf nodes shared among trees have the same relative ordering in all trees. (I.e., if N1 and N2 are present both in document type X and in document type Y, and N1 << N2 in X, then N1 << N2 in Y.)
It is not obvious at first glance that the ATMO project needs to allow different structures to cover different sets of leaf nodes; we defined the abstract model as allowing that possibility just in case that requirement showed up in later work. It did: when words are broken across line breaks, and even more obviously when broken across page breaks (so that the first part of the word and its ending may be separated by a catchword, a page number, a folio number, and other material in the top margin of the new page), the page view requires that each word fragment appear on the page where it is written in the manuscript, while the text and sentence views need the word to appear as an undivided whole. Annotations applicable only to a single view of the document would also be a use case for different views having slightly different character-data content.
Variations in whitespace, on the other hand, we hope to succeed in ignoring permanently.
Sharing of internal nodes (elements)
ISO 8879 can (as already noted above) be read as allowing an
SGML processor to make just one of the available document types
available for processing; it can also be read as allowing a
processor to make multiple document types available. Since 8879
does not constrain the interface offered by an SGML parser to its
consumer (or even require that there be such an interface —
the standard does not require that an SGML application be
divisible into an SGML parser
and a
consumer
), it is unspecified whether markup shared between
document types is treated by the interface as being the
same
in all applicable document types or not. It is
similarly unspecified whether the nodes that might appear in a
data structure representing the document are shared between
document types or not.
For purposes of the ATMO project, we do want some nodes to be shared across views: we wish to regard elements representing individual texts (in a manuscript which contains several distinct texts), paragraphs, headings, tables, and notes as occurring in all views: the text and sentence views should not have distinct but similar sets of paragraphs, but the same set of paragraphs. (Of course, such identity of elements across views is not readily detectable by inspection of the markup or by validation; node identity arises as an issue only in the context of processing with the XDM or some other object model. And even there, there is no way at the XDM level to express the identity of elements across different XDM documents representing different views of the manuscript: no XDM node occurs in more than one document.
Illustration of concurrent trees with shared elements
An example may be helpful as an illustration of the data model. Consider the following haiku by Bashō as translated by Harold G. Henderson (Henderson 1958, p. 48), marked up with its metrical structure (line group, line):
<text xmlns="http://www.tei-c.org/ns/1.0"> <body xml:id="body"> <head xml:id="h1">The Village Without Bells</head> <lg xml:id="lg1"> <l xml:id="L1">A village where they ring</l> <l xml:id="L2">no bells! — Oh, what do they do</l> <l xml:id="L3">at dusk in spring?</l> </lg> </body> </text>If instead we mark up the sentences, we will have something like this:
<text xmlns:tei="http://www.tei-c.org/ns/1.0"> <body xml:id="body"> <head xml:id="h1">The Village Without Bells</head> <ab xml:id="ab1"> <s xml:id="s1">A village where they ring no bells! — </s> <s xml:id="s2">Oh, what do they do at dusk in spring?</s> </ab> </body> </text>
The metrical and the sentence structures of the document relate to each other as shown in Figure 5 below.
Mutual visibility of different views
ISO 8879 seems clearly to expect that even if multiple
document types are processed at the same time, any nodes not
shared (and the tags which mark their boundaries) will be
visible only in the document types to which they belong.
Concretely, this means that in the example given above, the
nodes for tei:body
and tei:head
are
shared between the sentence and meter structures, and the
boundary markers for the end of sentence 1 and the beginning of
sentence 2 are not children of the tei:l
element
for line 2. That is a convenient arrangement for many kinds of
processing, but it is also sometimes convenient for a process to
know not only about one dominant view but also about the other
recessive views of the document as well.
For the ATMO project, the initial expectation was that we would prefer that each view know nothing about the others, so that any tags relevant only for recessive views would be invisible, as would any text nodes not part of the dominant view. As will be seen below, however, the XML representation we have chosen entails the opposite: all text nodes and all tags are visible whether they are dominant or recessive. Once we got over the embarrassment of having failed to implement the intended design fully, however, experience taught us that this is often helpful in ways not anticipated at first. In the web display of any view, for example, the recessive markup can be used to provide hyperlinks to alternative views of the location being displayed; this would be much less convenient if recessive markup were invisible. Nor does the presence of recessive markup typically present any serious convenience: if it did, we could write general-purpose filters to strip out recessive markup from a document before processing it, but in practice it has proven to be just as simple for the process to have its own code to ignore explicitly those recessive tags it is not interested in.
Serial form
The serial form of the project's documents is XML, in which one dominant hierarchical structure is represented by XML elements in the straightforward conventional way (one XML element per node in the logical structure) and other recessive structures are represented by Trojan Horse elements, using essentially the notation proposed by DeRose 2004 and used in OSIS (Durusau 2005).
Trojan Horse markup
Trojan Horse markup is a systematic application of an idea that
was current in markup folklore no later than the 1980s and
instantiated by a number of element types defined in the TEI
Guidelines.[8] The TEI, for example, defines empty elements to
mark boundaries of specific kinds: pb
,
cb
, and lb
mark page, column, and line
breaks, and the more general milestone
element
marks boundaries of arbitrary kinds. These elements are
designed for marking boundaries in a complete tesselation of the
data (when a page break occurs, one page ends and another
begins); they do not provide clean methods of marking the start
and end of a region which is not immediately preceded and
succeeded by other regions of the same kind. Nor do they have
good ways of providing values for all the attributes which could
appear on the logical element being represented. Like the
element types just mentioned, Trojan Horse markup uses empty
elements to mark the start and end of regions which cannot be
represented as XML content elements, but does not define special
element types for the purpose. Instead, it uses empty instances
of the normal element type for the kind of textual feature being
recorded, and marks them as special by using the attributes
sID
and eID
to signal that the empty
element in question marks the start or the end of a virtual
element rather than a content element. Matching start- and
end-markers will have the same value for these attributes, which
allows reliable identification of pairs.
OSIS defines twelve element types as
milestoneable
(representable using Trojan Horse
markup). It uses the mechanism, for example, to represent
verses which cross paragraph boundaries:
<p> ... <verse sID="Esth.2.8" osisID="Esth.2.8"/> When the king ordered the search for beautiful women, many were taken to the king's palace in Susa, and Esther was one of them. </p> <p>Hegai was put in charge of all the women, <verse eID="Esth.2.8"/> <verse sID="Esth.2.9" osisID="Esth.2.9"/> and from the first day, Esther was his favorite. He began her beauty treatments at once. He also gave her plenty of food and seven special maids from the king's palace, and they had the best rooms. <verse eID="Esth.2.9"/> </p>
We make several small changes to the notation described by DeRose and used in OSIS:
-
We place the
sID
andeID
attributes in a namespace (here conventionally bound to the prefixth
). -
We add a
soleID
attribute for use on empty recessive elements which we wish to represent with sole tags rather than start/end pairs. -
We add an attribute named
th:doc
to each Trojan-Horse empty element, which contains a set of tokens identifying the structures of which the virtual element is part (in the ATMO project, we use the abbreviations P, T, and S for the page, text, and sentence views). Theth:doc
attribute simplifies the XSLT transform to change dominant hierarchies. Any elements with more than one name in the value of theirth:doc
attribute are logically shared across those document types.
It should be noted that other XML-based serializations are
also possible (and many appear to have been invented more or less
ad hoc). The Trojan-Horse empty elements can be replaced by
elements in the Trojan Horse namespace named
th:start
, th:end
, and
th:sole
, or by processing instructions with the
target th
(i.e. Trojan Horse). These have the
advantage that they require little or no change (respectively) to
any pre-existing schemas for the various hierarchies. They have
the disadvantage that to eyes accustomed to scanning conventional
XML, they are less legible. As Derose pointed out when introducing
the notation, The advantage that (unlike generic
milestones) Trojan milestones look like element tags (that is,
they have the same GI) should not be underestimated
(DeRose 2004).
In what follows, I refer to Trojan Horse elements which mark the start of an element in a recessive structure as start-markers, those which mark the end of an element in a recessive structure as end-markers, and elements so marked as logical or virtual elements. Elements conventionally marked up with XML start- and end-tags I will refer to as content elements (even if in some particular cases they are empty).
Illustration
Using Trojan Horse markup, we can represent both the metrical structure and the sentence structure in the example shown above. When the metrical structure is dominant, the document might look like this:[9]
<text xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:th="http://www.blackmesatech.com/2017/nss/trojan-horse" th:doc="meter sentence"> <body th:doc="meter sentence" xml:id="body"> <head th:doc="meter sentence" xml:id="h1" >The Village Without Bells </head> <lg th:doc="meter" xml:id="lg1"> <ab th:doc="sentence" th:sID="ab1" xml:id="ab1"/> <l th:doc="meter" xml:id="L1"> <s th:doc="sentence" th:sID="s1" xml:id="s1"/> A village where they ring </l> <l th:doc="meter" xml:id="L2"> no bells! — <s th:doc="sentence" th:eID="s1"/> <s th:doc="sentence" th:sID="s2" xml:id="s2"/> Oh, what do they do </l> <l th:doc="meter" xml:id="L3"> at dusk in spring? </l> </lg> <s th:doc="sentence" th:eID="s2"/> <ab th:doc="sentence" th:eID="ab1"/> </body> </text>When the sentence-structure is dominant:
<text xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:th="http://www.blackmesatech.com/2017/nss/trojan-horse" th:doc="meter sentence"> <body th:doc="meter sentence" xml:id="body"> <head th:doc="meter sentence" xml:id="h1" >The Village Without Bells </head> <lg th:doc="meter" th:sID="lg1" xml:id="lg1"/> <ab th:doc="sentence" xml:id="ab1"> <l th:doc="meter" th:sID="L1" xml:id="L1"/> <s th:doc="sentence" xml:id="s1"> A village where they ring <l th:doc="meter" th:eID="L1"/> <l th:doc="meter" th:sID="L2" xml:id="L2"/> no bells! — </s> <s th:doc="sentence" xml:id="s2"> Oh, what do they do <l th:doc="meter" th:eID="L2"/> <l th:doc="meter" th:sID="L3" xml:id="L3"/> at dusk in spring? <l th:doc="meter" th:eID="L3"/> <lg th:doc="meter" th:eID="lg1"/> </s> </ab> </body> </text>
Interpretation of tags in the input
Each tag in the document is either
-
dominant markup: an XML start-, end-, or sole-tag used
conventionally
and representing the beginning, end, or location of a node in the dominant structure, or -
recessive markup: a empty Trojan-Horse element representing (or corresponding to) a start-, end-, or sole-tag in a recessive structure.
-
Start- and sole-tags with
th:sID
orth:eID
attributes are Trojan-Horse markup and relate to the recessive structures identified by theth:doc
attribute. -
Start- and sole-tags with neither
th:sID
north:eID
attributes relate to the dominant structure.
Note that strictly speaking some of the information recorded
is redundant and could be omitted: because the Trojan-Horse
elements correspond 1:1 to tags in a well-formed XML document with
a different dominant structure, each Trojan-Horse element marking
the end of a region closes the most recently begun matching
region; we could thus omit the th:sID
and
th:eID
attributes if we wished. We could similarly
omit th:doc
on end-tag elements. These omissions
would not, however, save as many characters as one might think:
without th:sID
and th:eID
we would need
to add some other simple signal to distinguish Trojan-Horse
elements from conventional elements. In practice, the redundant
co-indexing of th:sID
and th:eID
is
convenient for processing software, as it makes it easy to find
the matching tag in a pair. The redundant specification of
th:doc
on end-tag elements similarly makes processing
slightly simpler in the transforms which switch from one dominant
structure to another.
All-recessive form
It can sometimes be convenient to have no dominant
hierarchy at all, and to represent all three hierarchies as
recessive using Trojan Horse elements. The haiku example
looks like this in this shallow
form:
<tei:text xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:th="http://www.blackmesatech.com/2017/nss/trojan-horse" th:doc="meter sentence"> <tei:body th:doc="meter sentence" th:sID="body" xml:id="body"/> <tei:head th:doc="meter sentence" th:sID="h1" xml:id="h1"/> The Village Without Bells <tei:head th:doc="meter sentence" th:eID="h1"/> <tei:lg th:doc="meter" th:sID="lg1" xml:id="lg1"/> <tei:ab th:doc="sentence" th:sID="ab1" xml:id="ab1"/> <tei:l th:doc="meter" th:sID="L1" xml:id="L1"/> <tei:s th:doc="sentence" th:sID="s1" xml:id="s1"/> A village where they ring <tei:l th:doc="meter" th:eID="L1"/> <tei:l th:doc="meter" th:sID="L2" xml:id="L2"/> no bells! — <tei:s th:doc="sentence" th:eID="s1"/> <tei:s th:doc="sentence" th:sID="s2" xml:id="s2"/> Oh, what do they do <tei:l th:doc="meter" th:eID="L2"/> <tei:l th:doc="meter" th:sID="L3" xml:id="L3"/> at dusk in spring? <tei:l th:doc="meter" th:eID="L3"/> <tei:lg th:doc="meter" th:eID="lg1"/> <tei:s th:doc="sentence" th:eID="s2"/> <tei:ab th:doc="sentence" th:eID="ab1"/> <tei:body th:doc="meter sentence" th:eID="body"/> </tei:text>
As may be observed, in this form the container element
(here tei:text
) contains a flat sequence of empty
elements and text nodes, with no further nesting; for this
reason we call this the shallow
form of the document. (It is called a
flattened
form in Birnbaum et al. 2018.)
Translation from one dominant hierarchy to another is
conveniently achieved by a two-step translation first into
shallow form and then into the new dominant hierarchy.
Well-formedness checking and simple validation
Logical well-formedness checking
One immediate consequence of the syntax used here is that it is possible to construct well-formed XML documents which are not logically well formed. A document is logically well formed if the markup for each hierarchy (dominant or recessive) is well formed: each start-marker has exactly one corresponding end-marker, and vice versa, and start- / end-marker pairs nest properly, and the same is true for start- and end-tags. A document that is not logically well formed is logically ill formed. Logical ill-formedness will be manifest as XML ill-formedness if the markup for the dominant hierarchy is made recessive and the markup for some recessive hierarchy is made dominant.
Unfortunately, neither XML editors nor XML parsers will detect logical ill-formedness in a recessive hierarchy. And we cannot simply make each recessive hierarchy dominant in turn in order to check well-formedness using an XML parser: our transformations are written in XSLT, which normally produces no ill-formed output: if the recessive hierarchy is logically ill formed in the input, the transformation will either fail or (worse) succeed with erroneous output.
It is imperative, therefore, to develop tools for checking the well-formedness of documents in this format. As the examples above show, even in simple cases the density of markup can be very high, and without the aid of an editor in maintaining well formedness, it is very easy to make the kind of errors familiar to anyone who has had to deal with attempts to edit XML documents in editors without sufficient XML awareness.[11]
The current state of our well-formedness checking is represented by an XSLT stylesheet whose core is given by the following template:
<xsl:template match="/"> <report xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:p5="http://www.tei-c.org/ns/1.0" xmlns:bmt="http://blackmesatech.com/2015/nss/digifacs" xmlns:atmo="http://uyghur.ittc.ku.edu/2015/ns/0.1"> <head>Well-formedness report for Trojan-Horse markup</head> <p>Input document: <xsl:value-of select="document-uri()"/></p> <p>$doctype parameter: <xsl:value-of select="$doctype"/></p> <p>$nesting parameter: <xsl:value-of select="$nesting"/></p> <p>Date, time: <xsl:value-of select="adjust-dateTime-to-timezone(current-dateTime(), ())"/>.</p> <xsl:variable name="results" as="element()*"> <start-IDs> <xsl:call-template name="check-SIDs"/> </start-IDs> <end-IDs> <xsl:call-template name="check-EIDs"/> </end-IDs> <sole-IDs> <xsl:call-template name="check-SoleIDs"/> </sole-IDs> <xsl:variable name="lDT" as="xs:string*" select="if (exists($doctype)) then (for $i in 1 to string-length($doctype) return substring($doctype,$i, 1))[normalize-space()] else distinct-values( for $a in descendant::*/attribute::th:doc return tokenize($a,'\s+'))"/> <xsl:for-each select="$lDT"> <xsl:call-template name="check-balance-on-doc"> <xsl:with-param name="doctype" select="."/> <xsl:with-param name="nesting" select="$nesting"/> </xsl:call-template> </xsl:for-each> </xsl:variable> <xsl:variable name="c" as="xs:integer" select="count($results//error)"/> <summary> <xsl:value-of select="concat($c, if ($c eq 1) then ' error ' else ' errors ', 'found.')"/> </summary> <details> <xsl:sequence select="$results"/> </details> </report> </xsl:template>
As can be seen, it generates an XML document with a report
on the well-formedness of the input. Initially it reports on
its input and parameters: $doctype
requests
well-formedness checking for one particular document type
(default is all), and $nesting
determines whether
each content element in the input with Trojan Horse children is
checked independently for well-formedness; documents in shallow
form set $nesting
to respect
and those
with a dominant hierarchy set it to ignore
.
Separate named templates[12] then check the start- and end-markers of the document to confirm that:
-
Each
th:sID
value is unique among start- or sole-markers; eachth:eID
value is unique among end-markers. -
Each start-, sole-, or end-marker is empty.
-
No element has more than one of
th:sID
,th:eID
,th:soleID
among its attributes. -
Each
th:sID
matches at least oneth:eID
.Each
th:eID
matches at least oneth:sID
. -
Each
th:sID
matches at most oneth:eID
.Each
th:eID
matches at most oneth:sID
. -
When
th:sID
andth:eID
match, the two markers have the same generic identifier, theth:sID
precedes theth:eID
, and theth:doc
attributes match.
Another named template then checks to see that the
sequence of start- and end-markers for a given document type
form nesting elements: it progresses through the sequence of
markers, pushing th:sID
values onto a stack
and checking, when it encounters an end-marker, that the
th:eID
attribute on the end-marker
matches the value at the top of the stack. It can thus report
on errors of nesting in the recessive views.
Simple validation
It is straightforward (or more precisely: it is as straightforward as document design ever gets) to specify a basic document grammar for each structural view of the document, in which the elements of that structure (including any common elements) are defined and elements of other structures are ignored. In the discussion that follows, we assume that such grammars are available. For purpose of the discussion it does not matter whether the grammars are expressed in DTD notation, Relax NG, or XSD.
Given such basic grammars, validation of the markup described above can be achieved in any of several ways.
The simplest approach is to validate each view separately. For each structure S marked up in the document:
-
First, translate the document into a form where S is dominant.
-
Then use a simple transformation to omit all recessive markup (or translate it into processing instructions).
-
Finally, validate against the basic document grammar for S.
For example, the basic grammar for the metrical structure of the haiku example might be (in DTD notation):
<!ELEMENT text (body) > <!ELEMENT body (head?, lg+) > <!ELEMENT head (#PCDATA) > <!ELEMENT lg (l+) > <!ELEMENT l (#PCDATA) >
The basic grammar for the sentence structure might be:
<!ELEMENT text (body) > <!ELEMENT body (head?, ab) > <!ELEMENT head (#PCDATA) > <!ELEMENT ab (s+) > <!ELEMENT s (#PCDATA) >
This approach has the advantage of simplicity in the grammars: each basic grammar can essentially ignore the other grammars. It has the disadvantage that XML editors can no longer validate the document usefully, because there is no document grammar that actually describes even approximately the set of acceptable documents.
A more convenient validation process can be achieved by making an augmented document grammar for each structural view, which accounts for both the dominant structure and the Trojan-Horse markup for recessive structures. Because the augmented grammar includes declarations for recessive markup, it can be applied without pre-processing the document to strip recessive markup. This makes it possible to use the augmented grammar in schema-aware XML editors.
The set of base grammars satisfies the definition in Sperberg-McQueen 2006 for a set of rabbit/duck grammars. All common elements and elements in the dominant structure are first-class elements, and all other elements are third-class. We achieve a single augmented schema by making all recessive elements second-class and accounting for their start- and end-tags in the content models of the dominant structure.
-
For each structure S, make a list of all element types present in other structures, for which recessive markup may appear in view S (and declarations for which thus need to appear in the augmented schema). Call this list R (for
recessive
).Note that some element types may be present as content elements in all structures: for the ATMO project, the TEI header and the TEI
note
element (with all its possible descendants) are such elements. Note, however, that some instances of such element types may be present in some structures but not all: the main paragraphs of the text (not inside notes) will be content elements in the text and sentence views, but virtual elements marked by Trojan Horse markup in the page view. Thep
element and its descendants, therefore, must appear in the list R constructed for the page view. -
Augment the document grammar for S (call the augmented grammar S′) by allowing start- or end-tags for all elements in R at any location in any content model.[13]
This is equivalent to adding all the elements of R as inclusion exceptions on the SGML content model for the container element(s). In Relax NG, the desired effect can be achieved using the interleave operator (except when RNG's ambiguity rules mean that it cannot). In other schema languages (XML DTDs, XSD), systematic changes will need to be made to content models.[14]
Validation against the modified document grammar S′ is possible without a prior transformation to strip out recessive markup, and thus S′ can be used to guide a validating XML editor.
An SGML DTD with an augmented form of the metrical grammar might be:
<!ELEMENT text (body) +(ab | s)> <!ELEMENT body (head?, lg+) > <!ELEMENT head (#PCDATA) > <!ELEMENT lg (l+) > <!ELEMENT l (#PCDATA) >
An XML DTD will require more changes:
<!ENTITY % R "ab | s" > <!ELEMENT text (body)> <!ELEMENT body ((%R;)*, (head, (%R;)*)?, (lg, (%R;)*)+) > <!ELEMENT head (#PCDATA | %R;)* > <!ELEMENT lg (l, (%R;)*)+ > <!ELEMENT l (#PCDATA | %R;)* >
Our current validation practice uses augmented grammars, but our method of generating them is slightly less systematic that could be desired and has run into a number of snags. We continue to seek improvements, but resource constraints may limit our ability to refine the process.
For project participants, it would perhaps be simplest and most convenient to use a validator built to understand rabbit/duck grammars and Trojan-Horse markup, capable of validating multiple document grammars in parallel. A prototype of such a validator was described in Sperberg-McQueen 2006, but it is not deployable on the ATMO server. In any case, for editing an augmented grammar appears to be the best approach that is currently feasible.
Conclusions and future work
The paper has presented an account of one technique for representing multiple hierarchies systematically in XML and processing documents so marked up using an XML tool chain.
Within the project, it remains to make full use of the technique, and in particular to create a search interface that allows the user to exploit the presence of multiple overlapping tagged structures in the documents.
It would also be helpful to automate the creation of schemas more fully.
More generally, and beyond the confines of the ATMO project, several topics invite further examination. The ability to validate documents with concurrent hierarchies marked up in this way in a single pass would be helpful; even more helpful would be techniques for writing schemas in conventional schema languages to enforce validity or at least well-formedness with respect to recessive views, so that XML-aware editors could be warned against changes that destroy logical well-formedness. If such schemas could be generated by deterministic processes operating on simple base schemas, so much the better.
The ability to query richly marked up documents with multiple concurrent hierarchies is of interest not only to the ATMO project but to others. It seems clear that such queries can be supported in principle, but it is less clear how to make such queries convenient and intuitive to the end user, or how to make XPath / XQuery / XSLT formulations of cross-hierarchy searches convenient and intuitive to the XML programmer. In particular, providing tools for XPath-style navigation in the presence of multiple hierarchies would be challenging and interesting.
We can perhaps take query as a bellwether for the general problem of processing concurrent structures, but it is possible that other forms of processing may turn up requirements not visible in search and retrieval applications. Peter Sharpe of SoftQuad pointed out a number of years ago that even standard operations like cut and paste take on new complications in the presence of concurrent structures; there may be other operations we take for granted in the conventional XML context that similarly become more complicated in documents like those described here.
References
[Barnard et al. 1988] Barnard, David;
Ron Hayter;
Maria Karababa;
George Logan and
John McFadden.
SGML Markup for Literary Texts
.
Computers and the Humanities
22 (1988): 265-276. doi:https://doi.org/10.1007/BF00118602.
[Barnard et al. 1995] Barnard, David,
Lou Burnard,
Jean-Pierre Gaspart,
Lynne A. Price,
C. M. Sperberg-McQueen,
and
Giovanni Battista Varile.
Hierarchical encoding of text: Technical problems and SGML solutions
.
Computers and the Humanities
29 (1995): 211-231. doi:https://doi.org/10.1007/BF01830617.
[Birnbaum et al. 2018]
Birnbaum David J.,
Elisa E. Beshero-Bondar,
and
C. M. Sperberg-McQueen.
Flattening and unflattening XML markup: a Zen garden of XSLT and other tools
.
To be
presented at Balisage: The Markup Conference 2018,
Washington, DC.
On the Web in the preliminary proceedings.
[Dekhtyar / Iacob 2005]
Dekhtyar, Alex,
and
Ionut Emil Iacob.
2005.
A Framework For Management of Concurrent XML Markup
.
Data and Knowledge Engineering
52.2: 185-215. doi:https://doi.org/10.1016/j.datak.2004.05.005.
[DeRose 2004]
DeRose, Steven.
2004.
Markup overlap: A review and a Horse
.
Paper given at Extreme Markup Languages 2004,
Montréal, sponsored by IDEAlliance.
On the Web at http://conferences.idealliance.org/extreme/html/2004/DeRose01/EML2004DeRose01.html
[Durusau / O'Donnell 2001]
Durusau, Patrick, and
Matthew Brook O'Donnell.
2001.
Implementing concurrent markup in XML
.
Paper given at Extreme Markup Languages 2001,
Montréal, sponsored by IDEAlliance.
Slides on the Web at
http://www.durusau.net/publications/Implementing_concur.pdf.
[Durusau / O'Donnell 2002a]
Durusau, Patrick, and
Matthew Brook O'Donnell.
2002.
JITTS (Just-In-Time-Trees)
.
Talk given at New York XML Special Interest Group, January
2002.
Slides on the Web at
http://www.durusau.net/publications/NY_xml_sig.pdf.
[Durusau / O'Donnell 2002b]
Durusau, Patrick, and
Matthew Brook O'Donnell.
2002.
Coming down from the trees: Next step in the evolution of markup?
Late-breaking paper given at Extreme Markup Languages 2002,
Montréal, sponsored by IDEAlliance.
Slides on the Web at
http://www.durusau.net/publications/Down_from_the_trees.pdf.
[Durusau / O'Donnell 2003]
Durusau, Patrick, and
Matthew Brook O'Donnell.
2003.
Restoring the primacy of PCDATA
.
Paper given at XML Europe 2004,
sponsored by IDEAlliance.
Available on the Web at
http://www.durusau.net/publications/Primacy_of_PCDATA.pdf.
[Durusau / O'Donnell 2004]
Durusau, Patrick, and
Matthew Brook O'Donnell.
2004.
Tabling the overlap discussion
.
Paper given at Extreme Markup Languages 2004,
Montréal, sponsored by IDEAlliance.
Available on the Web at
http://conferences.idealliance.org/extreme/html/2004/Durusau01/EML2004Durusau01.html.
[Durusau 2005] Durusau, Patrick. 2005. OSIS users manual (OSIS Schema 2.1.1). The canonical location on the Web appears to be http://www.bibletechnologies.net/utilities/fmtdocview.cfm?id=28871A67-D5F5-4381-B22EC4947601628B&method=title but the site is intermittently unavailable. Another copy is at http://ebible.org/osis /OSIS2_1 UserManual_ 06March2006_-_with_ O'Donnell_edits.PDF.
[Haentjens Dekker / Birnbaum 2017]
Haentjens Dekker, Ronald, and
David J. Birnbaum.
It's more than just overlap: Text As Graph
.
Presented at Balisage: The Markup Conference 2017,
Washington, DC, August 1 - 4, 2017.
In
Proceedings of Balisage: The Markup Conference 2017.
Balisage Series on Markup Technologies, vol. 19 (2017).
doi:https://doi.org/10.4242/BalisageVol19.Dekker01.
[Henderson 1958] Henderson, Harold G. An introduction to haiku. (Garden City, New York: Doubleday, 1958).
[Hilbert / Schonefeld / Witt 2005]
Hilbert, Mirco,
Oliver Schonefeld,
and
Andreas Witt.
Making CONCUR work
.
In Proceedings of Extreme
Markup Languages 2005. On the Web at
http://conferences.idealliance.org/extreme/html/2005/Witt01/EML2005Witt01.xml
[ISO 8879:1986] International Organization for Standardization (ISO). 1986. ISO 8879-1986 (E). Information processing — Text and Office Systems — Standard Generalized Markup Language (SGML). International Organization for Standardization, Geneva, 1986.
[Jagadish et al. 2004]
Jagadish, H. V.,
Laks V. S. Lakshmanan,
Monica Scannapieco,
Divesh Srivastava,
and
Nuwee Wiwatwattana.
2004.
Colorful XML: One hierarchy isn't enough
.
Proceedings of the 2004 ACM SIGMOD International
conference on management of data, Paris,
sponsored by the Association
for Computing Machinery Special Interest Group on Management of Data.
New York: ACM Press.
doi:https://doi.org/10.1145/1007568.1007598.
[Piez 2012]
Piez, Wendell.
Luminescent: parsing LMNL by XSLT
upconversion
.
Presented at Balisage: The Markup Conference 2012,
Montréal, Canada, August 7 - 10, 2012.
In
Proceedings of Balisage:
The Markup Conference 2012.
Balisage Series on Markup
Technologies, vol. 8 (2012).
doi:https://doi.org/10.4242/BalisageVol8.Piez01.
[Piez 2014]
Piez, Wendell.
Hierarchies within range space:
From LMNL to OHCO
.
Presented at Balisage: The Markup Conference 2014,
Washington, DC, August 5 - 8, 2014.
In
Proceedings of Balisage:
The Markup Conference 2014.
Balisage Series on Markup
Technologies,
vol. 13 (2014).
doi:https://doi.org/10.4242/BalisageVol13.Piez01.
[Schonefeld 2007]
Schonefeld, Oliver.
2007.
XCONCUR and XCONCUR-CL:
A constraint-based approach for the validation of concurrent markup
.
In Datenstrukturen für linguistische Ressourcen
und ihre Anwendungen /
Data structures for linguistic resources and applications:
Proceedings of the Biennial GLDV Conference 2007,
ed. Georg Rehm, Andreas Witt, Lothar Lemnitzer.
Tübingen: Gunter Narr Verlag.
Pp. 347-356.
[Schonefeld / Witt 2006]
Schonefeld, Oliver,
and
Andreas Witt.
2006.
Towards validation of concurrent markup
.
Extreme Markup Languages 2006.
[Sperberg-McQueen / Huitfeldt 1999]
Sperberg-McQueen, C. M.,
and
Claus Huitfeldt.
1999.
Concurrent document hierarchies in MECS and SGML
.
Literary & Linguistic Computing
14.1: 29-42. doi:https://doi.org/10.1093/llc/14.1.29.
[Sperberg-McQueen 2006]
Sperberg-McQueen, C. M.
Rabbit/duck grammars: a validation method for overlapping structures
.
In Proceedings of Extreme Markup Languages 2006.
On the Web at
http://conferences.idealliance.org/extreme/html/2006/SperbergMcQueen01/EML2006SperbergMcQueen01.html.
[TEI P5] Text Encoding Initiative Consortium. 2018. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 3.3.0, last updated 31 January 2018. Available on the Web at http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html
[Witt 2004]
Witt, Andreas.
2004.
Multiple hierarchies: new aspects of an old solution
.
Paper given at Extreme Markup Languages 2004,
Montréal, sponsored by IDEAlliance.
Available on the Web at
http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Witt01/EML2004Witt01.html
[1] Many of the manuscripts in the Jarring Collection were acquired during Jarring's 1929-1930 stay in Kashgar, a city on the Silk Road in what is now the Xinjiang Uyghur Autonomous Region in the far western portion of the People's Republic of China. Some of the manuscripts are in Persian, Arabic, or other languages, but most are in the language of Kashgar's main indigenous population, the Uyghurs, which Jarring called Eastern Turki or just Turki. It is a matter of some interest whether the language of these manuscripts should be identified as modern standard Uyghur (ISO language code uig) or as Chaghatay, the language of the Chaghatay Khanate, the latest common ancestor of modern standard Uyghur and of modern Uzbek. For what it's worth, the linguists in the ATMO project lean on linguistic grounds toward the latter classification.
Jarring later had a distinguished career in the Swedish foreign service and at the United Nations. Near the end of his career he donated his collection of manuscripts to the University Library in Lund, Sweden, where they now form the nucleus of the Jarring Collection.
The ATMO project has received funding from the Henry Luce Foundation. The author thanks the Luce Foundation for their financial support and my collaborators in the project (especially Prof. Arienne M. Dwyer, Dr. Alexandre Papas, Akbar Amat, and Gulnar Eziz) for the intellectual challenges of the collaboration.
[2] The earliest discussion I am aware of in a scholarly journal is that of Barnard et al. 1988, though there is earlier work in a master's thesis written under David Barnard's supervision. The discussion of the problem and potential solutions continues; see for example [Haentjens Dekker / Birnbaum 2017].
[3]
The use of rend
to distinguish things for which
standard XML practice would prescribe different element types
is suboptimal; it has unavoidable similarities to the practice
sometimes described as a kind of thought experiment: could we
use a vocabulary with just one element type e
,
distinguishing different kinds of structure only by use of a
type
, class
, or role
attribute? The answer turns out to be yes, but you
won't enjoy it very much.
The awkwardness can probably be taken as a sign of flaws in the original document analysis within the ATMO project; one of the challenges in tagging hitherto unavailable material, however, is that the material one is going to tag may not be conveniently accessible. For the ATMO project, a systematic survey of the topographic structures found in the manuscripts would have required an extended visit to Sweden.
A retrospective redesign of the markup and retagging of the transcripts would probably be desirable but is unlikely to be feasible. The most recent revision of the page-view schema does, however, fix the most egregious problem of the initial schema by allowing tables to appear within zones of writing.
[4] There is a certain potential for confusion in having
documents in three formats, any one of which may be the most
recently edited master
copy, with changes
that must promptly be propagated to the other two copies.
To reduce this confusion, we have in fact chosen as a matter
of policy to identify one or other other form as the
standard master
(or just default) format; any
changes most easily made with a different dominant hierarchy
should be followed immediately by automatically re-updating
the default master form. The goal of the markup design
described here is to allow decisions about master form and
maintenance rules to be made on other grounds, and not to be
foreclosed by by limitations of the markup design.
[5] On the topic of such transformations and their algorithms see now the paper Birnbaum et al. 2018 elsewhere in this year's Balisage conference.
[6] They could also be treated as sole tags, in which case the stream seen by the SAX-based consumer would be very similar to that in the proposal made here. But this possibility was not mooted explicitly by Durusau and O'Donnell.
[7] The author is grateful to Lynne A. Price for patient explication of these details in conversations spanning a number of years.
[8] The name Trojan Horse
markup is a jocular
reference to Troy Griffitts, a participant in the development of
the Open Scripture Information Standard, whom DeRose credits
with the basic idea.
[9] N.B. I have inserted line breaks and indentation here and in other examples for ease of reading. If the details of whitespace may be meaningful at the application level, less convenient indentation may be needed.
[10] I apologize if I appear to belabor this point, but experience has shown that even normally acute observers have objected to Trojan-Horse markup on the erroneous supposition that it introduces ambiguity. The claim is based on a fundamental misunderstanding.
[11] This is true even for experienced XML users. Early in the
process of deploying the format described in this paper, the
author was obliged to make some relatively simple, mechanical
edits in a recessive hierarchy. Because the inter-format
transformations were not yet all ready, it was not feasible to
transform that recessive hierarchy to make it dominant, so he
edited the elements in the recessive hierarchy by hand. The
process involved splitting each tei:surface
element
in two and supplying new hyperlinks to point to a new set of
page images to replace the old set of images of two pages at a
time. Although the process was essentially mechanical and was
executed using a simple editor macro, the end result had two
errors in its logical well formedness, which cost a full day and
half in debugging time, and which were found only after the
well-formedness checker described in this section had been
written.
[12] The named templates not described are not shown here, but the entire stylesheet is available for inspection at http://uyghur.ittc.ku.edu/lib/th-wf-checker.xsl
[13]
In this simple approach, the dominant grammar will not
distinguish between start- and end-tags for recessive
elements; in the notation defined by Sperberg-McQueen 2006, this amounts to saying
tag(x)
can be used, but not
stag(x)
or etag(x)
.
[14]
The simplest approach is to replace every primitive
content token T
with the expression (T, (%R;)*), where
%R;
is an or-group containing every
element in R.
Additionally, replace every content model M thus modified with the
expression ((%R;)*, M).