Premise[1]
There are many pragmatic reasons to encode cultural heritage texts in TEI-compliant XML, but we have to be mindful of the XML framework becoming synonymous with the framework in which we conceptualize text. As various textual scholars have pointed out, the limitations of a technology can delimit the selection, modelling, and analysis of textual aspects.[2]
They remind us of the frequently used analogy of the hammer and the toolbox: if all you have is a hammer, everything will look like a nail. Indeed, the limitations of the XML data model have influenced and shaped our text encoding praxis (cf. Pierazzo 2015) and will continue to do so. If all you know is XML, every text will start to look like a tree. And while there are enough cases where a tree data model suffices, other cases benefit from an alternative.
This paper explores the potential of combining the XML data model and the Text-As-Graph (TAG) data model. The primary aim of the paper is to examine a practical and workable method for modeling and editing documents. TAG is still under development and not (yet) as mature as XML. Still, we find the affordances of the TAG model to be highly suitable for modeling and storing literary historical documents. XML remains a prominent technology for the analysis and publishing of texts. The question is, then, can we combine the strengths of TAG and XML into one powerful tool for everything we may want to do with text: modeling, storing, (collaborative) editing, processing, analyzing, and publishing? This paper explores that possibility by proposing a digital editing workflow in which scholarly editors can model, edit, and store text as a TAG hypergraph, and subsequently export the textual data to an XML format for further analysis or publication with XML-based tools.
Developing a more inclusive, flexible data model for text has been one of the guiding principles behind the design of TAG, a graph-based model under development at the Royal Netherlands Academy of Arts and Sciences. In previous Balisage contributions we discussed the TAG data model (Haentjens Dekker and Birnbaum 2017), its markup language TAGML and reference implementation Alexandria (Haentjens Dekker et al. 2018), and explored the ways in which TAGML can be used to model certain textual features that are notoriously difficult to model in XML (Bleeker et al. 2020). Over the course of the four+ years we have been working on the TAG markup stack (i.e., its data model, syntax, query language, and schema) we have gained some valuable insights. In some cases, this resulted in some small modifications of the data model and syntax.[3]
In each communication we pointed out that even though TAG was under active development, we considered our work, findings, and reflections already relevant for a general discussion on text modeling. We illustrated how other data models and markup systems express complex features like overlap and non-linear text, and we argued that text could best be expressed as a network structure (i.e., a graph). In doing so, we aimed to encourage a reflexive awareness of the relationship between an intellectual model and data models of text. One of the motivations for our continued work on TAG is to emphasize the value of, first, having a wider assortment of tools in one’s metaphorical text modeling toolbox and, secondly, knowing to select the right tool for the job you want to do. We believe that scholars can make an informed choice only when they know both the strong and the weak points of a data model.
Data models for texts and documents
Our work is informed by a definition of text as a sequence of characters (e.g., letters, digits, spaces, and punctuation, including symbols and music notation) that is inscribed on a material carrier: a document. From the text on a document, a reader derives information. In earlier contributions, we argued that this information can best be organized in a network structure. We stated that text is partially ordered, which means that it is not always possible to determine the order of all characters in the sequence. Examples of partially ordered text are non-linear, discontinuous, or overlapping structures.
Generally speaking, certain data models are more suitable for expressing complex features than others: as said, the inherent properties of a data model provide its scope and determine its limits. While TAG is conceptually based on a hypergraph, the model can be implemented in different ways. The best logical implementation of TAG depends on the purpose: we found that a variation on the Multi-Colored Trees (MCT or Colorful XML, developed by Jagadish et al. 2004) and the GODDAG model (Sperberg-McQueen and Huitfeldt 2000, Sperberg-McQueen 2018), which we described as a colored GODDAG (Haentjens Dekker et al. 2020), works best for overlapping structures as well as for export and visualization purposes. With export being the focus of the paper, we will describe both the hypergraph and the Colored GODDAG models in the following sections.
Data models
Text as a hypergraph
Just like any other graph, a hypergraph consists of nodes and edges. The important difference is that some edges in a hypergraph can join together two or more nodes (in contrast to the one-to-one edges of regular graphs). These are called hyperedges. Hyperedges are typically undirected and they can be used to express group relations. The hypergraph is not a common data structure in the (digital) humanities, but it is fairly well-known in the STEM research fields. By means of illustration, figure 1 (Figure 1) shows a hypergraph used in microbiology:
The TAG hypergraph is slightly different than the model in figure 1. First, the nodes in the TAG hypergraph are typed. We distinguish five different node types: each hypergraph consists of exactly one document node (the root), zero or more text nodes, zero or more markup nodes, zero or more annotation nodes, and zero or more branching nodes. Furthermore, the TAG model consists of undirected hyperedges as well as directed one-to-one edges. The document node, the text nodes, and the branching nodes indicate the stream of the text and are therefore connected by directed, regular edges. The markup and annotation nodes in the TAG hypergraph can be connected with either a hyperedge or a regular one-to-one edge. For example, a hyperedge can associate multiple text nodes with one and the same markup node, and an annotation node can be associated with one markup node by means of a regular edge. Figure 2 below exemplifies the different types of nodes and edges. The text nodes are white, the regular edges are visualized as arrows, and the hyperedges are labelled and visualized in different colors: Text in a hypergraph is read from left to right, starting with the document node and following the directed edges. There are two branches in the hypergraph; the beginning and the end of the branches is indicated with a branching node. The markup node labelled “subst” in figure 2 (Figure 2) is associated with two text nodes via a labelled hyperedge (yellow); the markup node labelled “add” is associated with one text node via a labelled hyperedge (dark green) and has an associated annotation node (light green) with information about the place of the addition in the source text.
Note that the markup in this example is properly nested: it represents one hierarchical structure and the markup elements do not overlap. Expressing multiple overlapping hierarchical structures in TAG is done by grouping together related markup nodes in a group.[4] The markup nodes within each group form a single hierarchy, but groups can share markup nodes and a TAG document can contain any number of groups. This way, overlapping structures can be easily expressed in TAG. The following section exemplifies the logical model (section “Text as Multi-Colored Trees”).
Text as Multi-Colored Trees
Alexandria implements a MCT, an ordered directed acyclic graph with colors on the markup nodes. The MCT is inspired by the multitrees of GODDAG, the colored nodes of Colorful XML, and the combination of several XML trees of XConcur (Jagadish et al. 2004, Hilbert et al. 2005). The MCT model extends the XML tree in two ways: a node in a MCT has an additional property (its color), and a MCT database can consist of one or more colored trees (instead of XML’s single-rooted tree). Each tree has a different color. A node can be shared by more than one tree, in that case it has multiple colors. The trees within a MCT document can be navigated and manipulated with extended XQuery or XPath expressions in which the user first selects a leading color (Jagadish et al. 2004, see also Portier et al. 2012). The MCT is implemented in the Alexandria repository of TAG; see section section “Workflow”. In the Alexandria MCT, the text nodes as well as the root document node are shared between all the colors.
Overlapping hierarchies in TAG are expressed as a MCT by assigning a colored tree to each group of markup nodes. Figure 3 shows a visualization of a MCT implementation with two groups of markup nodes i.e., two colored trees. This simple example shows how two groups of markup nodes, one with the identifier “D” and the other with the identifier “T”, can be expressed as a red and a blue tree, respectively. The red tree (i.e., the group of markup nodes identified with “D”) shares the markup nodes labeled “s” and “add” with the blue tree (i.e., the group labeled “T”). The markup node labeled “del” is only part of the red tree. Each node is stored only once in the MCT model; edges between nodes are specified in each colored tree.
Text features
In earlier publications, we distinguished at least three complex features with which digital scholarly editors have to deal on a regular basis: overlapping or concurrent hiearchies, discontinuous text, and non-linear structures (Haentjens Dekker and Birnbaum 2017, Haentjens Dekker et al. 2018). Expressing concurrent or overlapping structures is, if not the biggest, surely the most famous and most debated obstacle for text modeling in XML and therefore needs little explaining. Discontinuous text is usually illustrated with a quotation that is interrupted by an aside or by the narrator’s voice (cf. Sperberg-McQueen and Huitfeldt 2008). Non-linear structures, finally, can be found on the level of the source text as well as the markup, e.g., in-text revisions on a manuscript or an abbrevation and its expansion supplied by the editor in markup. The common denominator is a temporary break in the linearity of the text that can be conceptualized as a split of the text stream in two or more substreams or branches.
In this section we show how (or to what extent) these features are modeled in TAG. We take our examples from three pages of Love and Freindship, a short epistolary novel by Jane Austen, written between 1790-1793. The novel is written in a notebook entitled “Volume the Second” and is part of her Juvenalia manuscripts.[5]
For each text feature, we first describe our philological interpretation of the feature in question. We then show the syntactical serialization of the feature in TAGML and in XML, combined with visualizations of the underlying data models. This will facilitate the comparison between TAGML and XML and, we hope, increase the readers’ awareness of the relationship between the conceptual model and its logical implementation(s).
Discontinuous text
Discontinuous text can be found in the fragment of the text displayed in
figure 7, which reads: Beware my Laura (she would often say) Beware of
the insipid Vanities and idle Dissipations […].
The aside
(she would often say)
is a comment by the letter writer Laura
upon the quoted text and can therefore be identified as a break in the
quotation.
Discontinuous text in TAGML
The TAGML encoding of discontinuous text is similar to the TexMECS
proposal: the q
element is “paused” and subsequently “resumed”
with the affixes -
and +
(see Sperberg-McQueen and Huitfeldt 2008).[6] On the level of the conceptual hypergraph model, the discontinuous
text is part of one and the same q
element, as is illustrated
by figure 9 (Figure 9).
The relationship between the two parts of the q
element is also stored in the MCT implementation of Alexandria:
Discontinuous text in XML
The TEI Guidelines offer several options to encode discontinuous text,
such as the use of next
and prev
attributes on the
discontinued elements.[7] A simplified TEI XML transcription of the example sentence would
then be:
<?xml version="1.0" encoding="UTF-8"?> <TEI xmlns="http://www.tei-c.org/ns/1.0"> <teiHeader> <!-- metadata information here --> </teiHeader> <text> <body> <p> <!-- more text here --> <s> <q xml:id="q1" next="#q2">"Beware my Laura </q> (she would often say) <q xml:id="q2" prev="#q1"> Beware of [...] Southampton."</q> </s> <!-- more text here --> </p> </body> </text> </teiHeader> </TEI>
Here the q
elements are given an xml:id
and the
next
and prev
attributes are used to indicate
that the first q
element is continued in the next. Instead of,
or in addition to, next
and prev
attributes, the
two parts can be joined together using a link
element:
<link type="join" target="#q1 #q2"/>
. As figure 11
shows, the elements are only linked on a syntactical level. On the level of
the data model the q
elements are two separate child elements
of the div
element.
Overlapping structures
Finding an example of overlapping structures in the manuscript notebook of Love and Freindship is fairly simple. Let’s say we want to express both the material features of the document and the linguistic structure of the text, i.e., the lines on the page and the sentences respectively. As the sentences run over several page lines and one page boundary, they overlap partly with the material structure of the document.
Overlap in TAG
As mentioned in section 2.1.2. (section “Text as Multi-Colored Trees”), TAG handles
overlapping structures by grouping the markup nodes of each structure into a
separate group. Within each group, the markup nodes are hierarchically
ordered. Figure 12 (Figure 12) illustrates the
TAGML-encoded text of Austen’s text containing the overlapping hierarchies:
the material structure, which nodes are assigned the identifier “D”, and the
linguistic structure of the letter, which nodes are assigned the identifier
“T” . For example, the page
and l
markup elements
have the identifier “D”, the s
markup elements are given the
identifier “T”.[8] In this particular TAGML document, the root node
text
is shared: it has both the “D”, “T”, and the “P”
identifier.
The MCT of the entire novel is too large to visualize here, but
this simplified visualization below shows the colored trees constituted by
the material and linguistic structures:
Overlap in XML
The ubiquity of overlapping structures in cultural heritage texts produced
a wide variety of TEI P5 encoding systems. In our case it would be
convenient to use empty elements <lb/>
for the structure of
the page lines, and the aforementioned next
and
prev
attributes for the sentence that runs across the two
pages. Belowe a simplified TEI-compliant XML transcription in which the
linguistic structure of the sentences are leading looks. Note again that the
last sentence is made up of two separate s
elements that are
only linked on a syntactical
level.
<?xml version="1.0" encoding="UTF-8"?> <TEI xmlns="http://www.tei-c.org/ns/1.0"> <teiHeader> <!-- metadata information here --> </teiHeader> <text> <body> <!-- some text here --> <p> <s xml:id="s1" next="#s2"> <lb/>Our neighbourhood consisted <lb/>only of your Mother.</s> <s>She may probably have already <lb/>told you that being left by her Parents in indegent <lb/>Circumstances she had retired into Wales on economi-</s> </p> <pb/> <p> <s xml:id="s2" prev="#s1 <lb/>cal motives.</s> <!-- some text and markup here --> </p> </body> </text> </teiHeader> </TEI>
From TAGML to XML and beyond
Workflow
After having encoded the text in TAGML, the document can be uploaded to the
Alexandria repository. Alexandria is operated on the command line and is set up to work
similar to git, the version control software often used by programmers and digital
humanists to work collaboratively on code. Via command line commands, users can
“check in” and “check out” TAGML documents to the Alexandria repository. In addition to a TAGML document, users can
also upload one or more views in which they can
define which (groups of) markup nodes can be filtered out. We created the view
functionality because the TAG document in Alexandria can
potentially contain a large amount of information, and we assume that not all users
will always be interested in all information. A view is expressed in JSON, for
example: {"includeGroups":["D"]} {"includeMarkup:["del"]"}
. In this
example, the view says to include the markup nodes that are grouped with the
identifier “D” and all markup nodes labelled “del” when the TAGML document is
checked out of the Alexandria repository.
An Alexandria check-in means that the TAGML document is parsed, verified for well-formedness, and stored as a MCT in the repository. A check-out in combination with a view generates a new TAGML document that contains the selected markup nodes, while the master TAGML document remains intact in the repository. The checked-out document can be edited and checked-in again. Alternatively, users can indicate upon checkout whether they want the master document to be exported to another format: currently, we support SVG, XML, DOT, or PNG. Figure 14 depicts this workflow:
Let’s zoom in on the last step in this workflow diagram: the export to XML. The
Alexandria command export-xml
seems easy enough and to some extent it is: technically, converting a MCT to a
single tree is fairly simple. Like XML, TAGML has markup tags with names,
attributes, and values. And although TAGML supports different data types for
attribute values (in addition to a string, TAGML attribute values can be integers,
floats, strings, lists, or Booleans), the TAGML attribute values can be converted
to
string type attribute values in XML. Still, a graph with multiple concurrent
hierarchies contains more information than a mono-hierarchical tree, so a
graph-to-tree conversion implies that we have to decide how to express that
information. This depends in part on how the user plans to use the exported
document. Representing overlapping hierarchies in a single tree, for example, will
require additional tagging. Users may therefore want to scale down the amount of
information in the XML document and select only the information that is relevant for
their purpose. This means deciding whether the exported XML document should have a
leading hierarchy and if so, which markup nodes should be part of it. After
addressing the algorithmic side of the TAGML-to-XML conversion in section section “The code”, we will discuss
the editorial side.
The code
The steps taken during the TAGML to XML conversion are as follows:
-
The user gives the
xml-export
command to the Alexandria server; -
The
TAGTraverser
iterates over the MCT of the TAGML document. If the user has provided it, the traverser will also use information from the view and ignore certain (groups of) markup nodes; -
The
TAGTraverser
generates a stream ofEvents
; -
For each
Event
, check to see whether it is an open tag, a close tag, or text characters; -
If the
Event
is text, the characters are transformed into an XML text node; -
If the
Event
is an open tag or a close tag, check to see if the user provided information about the leading hierarchy and if so, whether the tag is part of the leading hierarchy;-
If not, the open tag or close tag is transformed into a Trojan Horse start or end element, respectively;
-
If the tag is part of the leading hierarchy, the open tag or close tag is transformed into an XML open tag or an XML close tag.
-
Figure 15 (Figure 15) shows a flowchart of this process. The flowchart starts after the user has created a TAGML document and a view and uploaded them in the Alexandria repository.
The code base of Alexandria is written in Kotlin.
The code fragment below shows the class definitions of Node
,
MCT
, and Event
. Edges are stored in two directions:
incoming and outgoing. Target nodes are stored in LinkedHashMaps to preserve the
order of the nodes.
sealed class Node data class Markup(val label: String, val colors: List<String>, val id: Long = System.currentTimeMillis()) : Node() data class Text(val content: String, val id: Long = System.currentTimeMillis()) : Node() class MCT(val rootNode: Markup) { val outgoingEdges: MutableMap<Markup, LinkedHashSet<Node>> = HashMap() val incomingEdges: MutableMap<Node, LinkedHashSet<Markup>> = HashMap() } sealed class Event data class MarkupOpen(val node: Markup) : Event() data class MarkupClose(val node: Markup) : Event() data class TextEvent(val node: Text) : Event()
As described above (section section “Workflow”), a TAGML document that is
checked into the Alexandria repository is parsed
and stored as a MCT. When the user gives the export-xml
command to
Alexandria, the TAGTraverser
iterates over the MCT and generates a stream of Event
s. The Kotlin code
fragment below describes the traversal algorithm. It traverses over the nodes in
topological order and creates TextEvent
s, MarkupOpen
, or
MarkupClose
events. A text node generates a TextEvent
;
a markup node generates a MarkupOpen
event of that node. In order to
link the MarkupClose
events to the appropriate MarkupOpen
events, we keep track of the markup that is currently open by using markup stacks.
There is a global stack as well as one stack for each color in the MCT. Each markup
node is added to the global stack and to the relevant color stack. Before we can
generate a TextEvent
or a MarkupOpen
event for a node, we
need to check the top of the relevant color stack(s) to see if it’s not a parent of
the current node. Those markup nodes generate MarkupClose
events and
can be removed from both the color stacks and the global stack. After all nodes have
been processed in this manner, the markup that is left on the global stack generates
the remaining MarkupClose
events.
fun traverseMCT(mct: MCT): List<Event> { val nodes = topologicalSort(mct) val result = arrayListOf<Event>() val colorToStackMap = HashMap<String, Stack<Markup>>() val globalStack = LinkedHashSet<Markup>() for (node in nodes) { val parents = mct.incomingEdges.getOrElse(node) { emptySet<Markup>() } val stacksToCheck: List<Stack<Markup>> = when (node) { is Markup -> colorToStackMap.entries.filter { node.colors.contains(it.key) }.map { it.value } is Text -> colorToStackMap.values.toList() } for (stack in stacksToCheck) { while (stack.peek() !in parents) { val nodeToPop = stack.pop() if (globalStack.remove(nodeToPop)) result.add(MarkupClose(nodeToPop)) } } when (node) { is Markup -> { node.colors.map { colorToStackMap.getOrPut(it) { Stack() } }.forEach { it.push(node) } globalStack.add(node) result.add(MarkupOpen(node)) } is Text -> result.add(TextEvent(node)) } } globalStack.reversed().forEach { node -> result.add(MarkupClose(node)) } return result }
Finally, the algorithm creates the XML document from the MCT. The code below
describes how the algorithm loops over the Events
and creates XML tags.
The XML tags are based on the type of event and whether the node associated with the
event is the leading hierarchy or not. Nodes in the leading hierarchy are used to
create XML content elements, nodes in the other hierarchies are converted to Trojan
Horse elements. Trojan Horse elements are a specific type of elements or
“segment-boundary delimeters” with a namespace definition th:
(see .
Two related milestones are linked by means of matching @start
and
@end
attributes, so the regular XML <s>The sun is
yellow</s>
becomes <s th:s sID="foo"/>The sun is yellow<s
th:s eID="foo"/>
in Trojan Horse markup (De Rose 2004,
Barnard et al. 1995, Sperberg-McQueen 2018). Additionally, the
Trojan Horse markup elements are given an attribute that is generated from their
TAGML group identifier, e.g., @th:doc="D"
for all markup nodes in a
group called “D”.
fun createXML(mct: MCT, leadingHierarchy: String, writer: Writer) { val events = traverseMCT(mct) val xml = XMLOutputFactory.newFactory().createXMLStreamWriter(writer) for (event in events) { when (event) { is TextEvent -> xml.writeCharacters(event.node.content) is MarkupOpen -> if (event.node.colors.contains(leadingHierarchy)) xml.writeStartElement(event.node.label) else xml.apply {writeEmptyElement(event.node.label) writeAttribute("sID", event.node.id.toString()) } is MarkupClose -> if (event.node.colors.contains(leadingHierarchy)) xml.writeEndElement() else xml.apply { writeEmptyElement(event.node.label) writeAttribute("eID", event.node.id.toString()) } } } writer.close() }
The user
The user is usually not aware of the algorithmic details of the XML export and —
although the code is open source and there for anyone to look at — not required to
either. Still, the conversion process is not just a matter of clicking a button: it
is designed so that the user can influence it. Let’s take a closer look. Figure 16
(Figure 16) shows the (simplified) TAGML transcription of
the entire text of the fourth letter from Love and
Freindship.
The overlapping hierarchical structures in TAGML cannot be automatically
transformed to XML. This is where the user comes in: they can choose the hierarchy
formed by the group of markup nodes labelled “D”, the hierarchy formed by the markup
nodes labelled “T”, or no leading hierarchy in the XML output. In case of the
latter, it suffices to use the command alexandria export-xml
Austen_VtS
, which says as much as “Hi Alexandria, please export the document
called Austen_VtS to XML”. Subsequently, the
TAGTraverser
described in the previous section will iterate over
the MCT, generate a stream of events, and build an XML tree. All TAGML markup nodes
will be transformed into Trojan Horse elements. A short fragment of the Trojan Horse
XML output is shown below:
<?xml version="1.0" encoding="UTF-8"?> <xml xmlns:tag="http://tag.di.huc.knaw.nl/ns/tag" xmlns:th="http://www.blackmesatech.com/2017/nss/trojan-horse" th:doc="D P T"> <text title="Love and Freindship" type="novel" author="Jane Austen" date="1790"> <page n="6" th:doc="D" th:sId="page0"/><head th:doc="D T" th:sId="head1"/><l th:doc="D" th:sId="l2" />Letter 4th<l th:doc="D" th:eId="l2"/><head th:doc="D T" th:eId="head1"/><l th:doc="D" th:sId="l3"/>Laura to Marianne<l th:doc="D" th:eId="l3"/><s th:doc="T" th:sId="s4"/><l th:doc="D" th:sId="l5"/>Our neighbourhood was small, for it consisted <l th:doc="D" th:eId="l5"/><l th:doc="D" th:sId="l6"/>only of your Mother. <s th:doc="T" th:eId="s4" /> <s th:doc="T" th:sId="s7"/>She may probably have already <l th:doc="D" th:eId="l6"/><l th:doc="D" th:sId="l8"/>told you that being left by her Parents in indegent <l th:doc="D" th:eId="l8"/><l th:doc="D" th:sId="l9"/>Circumstances she had retired into Wales on economi-<l th:doc="D" th:eId="l9"/><page th:doc="D" th:eId="page0"/> <page n="7" th:doc="D" th:sId="page10"/><l th:doc="D" th:sId="l11"/>cal motives. <s th:doc="T" th:eId="s7"/><s th:doc="T" th:sId="s12"/>There it was our freindship first <l th:doc="D" th:eId="l11"/><l th:doc="D" th:sId="l13"/>commenced – Isabel was then one and twenty – <l th:doc="D" th:eId="l13"/><s th:doc="T" th:eId="s12"/><s th:doc="T" th:sId="s14"/> <l th:doc="D" th:sId="l15"/>Tho' pleasing in both her Person and Manners <l th:doc="D th:eId="l15"/><l th:doc="D" th:sId="l16"/>(between ourselves) she never possessed the hun-<l th:doc="D" th:eId="l16"/><l th:doc="D" th:sId="l17"/>dreth part of my Beauty or Accomplishments.<l th:doc="D" th:eId="l17"/><s th:doc="T" th:eId="s14"/> <!-- more text and markup --> <page th:doc="D" th:eId="page10"/> </text> </xml>
The user could also decide to make one of the hierarchical structures leading in
the XML output. In that case, they can add a parameter to the Alexandria command with which they indicate which grouped markup
nodes should form the leading hierarchy, e.g., alexandria export-xml
Austen_VtS -l D
. In this example commend, the markup nodes in group “D”
are transformed into XML content elements; the markup nodes not belonging to the
leading hierarchy will be transformed into Trojan Horse elements. A fragment of the
XML output with the markup group “D” as leading structure would look as follows:
<?xml version="1.0" encoding="UTF-8"?> <xml xmlns:tag="http://tag.di.huc.knaw.nl/ns/tag" xmlns:th="http://www.blackmesatech.com/2017/nss/trojan-horse" th:doc="P T"> <text title="Love and Freindship" type="novel" author="Jane Austen" date="1790"> <page n="6"> <head><l>Letter 4th</l></head> <l>Laura to Marianne</l> <s th:doc="T" th:sId="s0"/><l>Our neighbourhood was small, for it consisted </l> <l>only of your Mother. <s th:doc="T" th:eId="s0"/> <s th:doc="T" th:sId="s1"/>She may probably have already </l> <l>told you that being left by her Parents in indegent </l> <l>Circumstances she had retired into Wales on economi-</l> </page> <page n="7"> <l>cal motives. <s th:doc="T" th:eId="s1"/> <s th:doc="T" th:sId="s2"/>There it was our freindship first </l> <l>commenced – Isabel was then one and twenty – </l> <s th:doc="T" th:eId="s2"/> <s th:doc="T" th:sId="s3"/><l>Tho' pleasing in both her Person and Manners </l> <l>(between ourselves) she never possessed the hun-</l> <l>dreth part of my Beauty or Accomplishments.</l> <s th:doc="T" th:eId="s3"/> <!-- more text and markup --> <s th:doc="T" th:sId="s9"/> <q tag:n="1" th:doc="P" th:sId="q10"/> <l>"Beware my Laura<q th:doc="P" th:eId="q10"/> (she would often say) </l> <q tag:n="1" th:doc="P" th:sId="q11"/> <l>Beware of the insipid Vanities and idle Dissipations </l> <l>of the Metropolis of England; Beware of the </l> <l>unmeaning Luxuries of Bath & of the Stink-</l> <l>ing fish of Southampton."</l> <q th:doc="P" th:eId="q11"/> <s th:doc="T" th:eId="s9"/> </page> </text> </xml>
Note how the discontinuous text, marked with the suspend and resume signs in
TAGML, is coverted to XML. With the attribute @tag:n="1"
the
discontinued parts of the quotation are linked together. This approach is also
suggested by Wendell Piez in Piez 2008. Finally, the header of both
fragments contains information about which group of markup nodes are represented in
Trojan Horse (th:doc="T D P"
versus th:doc="T P"
).
In order to illustrate the variety of options created by the XML export function of Alexandria, the final paragraphs of this section explore two hypothetical yet highly likely user scenarios. Let’s say that, after modeling the text in TAGML and uploading the TAGML document in Alexandria, a user wants to (1) publish it online, so they need an HTML version of the encoded text; (2) send it for approval to another scholarly editor who prefers to work in Word (e.g., using LibreOffice of Microsoft Office). We can update the workflow diagram by adding these two steps: As the diagram illustrates, the transformation steps in the editorial workflow do not take place in the context of Alexandria, but in the user’s own workspace. This is a conscious choice: we assume that most users prefer their own, possibly customized, tools, transformation scenario’s, and work environments. Accordingly, we provide TAGML users with the opportunity to create workflows and pipelines in which TAGML is seamlessly integrated with other tools. For instance, the user can use XSLT in order to create an HTML document from the XML document.
For the transformation to Word, we use the open source software OxGarage, a
RESTful web service that was created by members of the TEI community and allows
users to manage transformations between various document formats.[9] Even without any additional information, the XML document sampled above
(with the page structure as leading hierarchy) is easily transformed into a clearly
readable Word document:
Note that the text of the addition, marked with add
tags in
both the source TAGML and the XML output, is automatically surrounded by diacritical
marks in the Word output, and that the text marked as deleted in the TAGML and XML
sources is represented as crossed out text between square brackets.
Future work
So far, our contribution to Balisage about TAG have discussed work in progress, and
the present contribution is no different. While the basic export funtionalities perform
well, there are a few steps that need to be taken before TAGML documents can be fully
converted to an XML format. At the moment, non-linear structures are not optimally
converted. We conceptualized this structure as a linear stream of text “splitting”
into
two or more branches. In the case of an in-text revision like a substitution, the
split
leads to two branches of the text, each with a different reading. As the attentive
reader may have seen in figure 15 (Figure 16), the TAGML approach
to a non-linear structure is to encode the splitting into branches as follows:
some linear text <| branch 1 | branch 2 |> more linear text
.[10] For example, the non-linear structure caused by the substitution in Austen’s
text is encoded as follows:
[s> […] had spent a fortnight in Bath & had <|[del>slept<del]|[add>supped<add]|> one night in Southampton.<s]Here, the notation
<|
and |>
indicates the start and end of
branches, with the |
to separate them. One branch reads
sleptand one reads
supped. Currently, this branching strcuture is quite literally converted as such to XML:
<s> had spent a fortnight in Bath & had <tag:branches th:doc="_default" th:sId=":branches6"/> <tag:branch th:doc="_default" th:sId=":branch7"/> <del>slept</del> <tag:branch th:doc="_default" th:eId=":branch7"/> <tag:branch th:doc="_default" th:sId=":branch8"/> <add>supped</add> <tag:branch th:doc="_default" th:eId=":branch8"/> <tag:branches th:doc="_default" th:eId=":branches6"/> one night in Southampton. <s/>The branches are represented as Trojan Horse element and linked to one another with the Trojan Horse attributes. While this is technically correct, it is difficult to read for humans and equally hard to transform to valid TEI-XML. Non-linear TAGML is however not as easy to transform automatically into XML: where TAGML uses general syntactical symbols, TEI proposes multiple options with markup elements like
subst
, mod
, or choice
. Potentially this
conversion will require user-input as well.
Another item on our “(Soon) To Do” list is to move away from grouping markup elements
with identifiers. We are at present examining the possibilities to implement a TAGML
schema and to include in the schema information about the hierarchical structure(s)
in a
TAGML document. This would mean, for instance, that the user identifies which markup
nodes are part of a certain hierarchy, which nodes can be shared between hierarchies,
etc. In line with an XSL file, the schema may contain other information about the
export. The user can subsequently point to this TAGML schema document when they give
the
export
command to the Alexandria
server. And finally, on a more general level, future work entails the possibilty for
multiple users to collaborate in Alexandria. The
repository is now initialized locally, and a user of Alexandria can already check-in, check-out, and edit a TAGML document on
their local machine, but we aim to enable a collaboration between multiple users.
This
requires among others further development of the diff-functionality, as we want to
track
the edits made to a TAGML document on the level of the text as well as the
markup.
Reflection
It’s a well known fact that certain textual and documentary structures are less suited to be encoded as XML, such as concurrent or overlapping hierarchies, discontinuous text, and non-linear structures. In the past thirty years or so, numerous ways to deal with these structures have been proposed, ranging from XML-based approaches (to name but a few: empty or virtual elements, linking, Trojan Horse markup, stand-off approaches, encoding the same text multiple times, XCONCUR) to non-XML proposals (such as LMNL, TexMECS, or EARMARK).[11] There are various reasons why the proposed alternatives have not been adopted as text encoding standards. Some of them focused primarily on addressing just one limitation of the XML data model — e.g., overlap — and did not address the wider range of non-hierarchical text phenomena. Others never transcended the experimental phase, or needed to be abandoned simply because funding ran out. Nevertheless, the academic publications related to these proposals still make for an interesting read as they provide insight into the intellectual and technological history of text modeling. Most of them are not “just” technical: they are based on a philosophy of text. After all, when examining the question of how to express text informationally, the authors had to formulate their definition of text. If it’s not an Ordered Hierarchy of Content Objects, then what is it?
We conceptualize text as a partially ordered sequence of characters, inscribed on a material carrier. The information derived from the text can best be organized in a network structure, for which we propose a hypergraph. In this paper, we aimed to demonstrate the value of looking beyond the prevalent technologies and examining the alternatives vis-à-vis their philological notions of text, their conceptual model, the logical implementation(s), etc. We intended for the reader to realize that each approach to text modeling has its advantages and disadvantages. The best choice is not necessarily what is most commonly used, but what best addresses one’s research requirements and objective(s). There are many pragmatic reasons to encode literary historical texts in XML, but the XML framework should not become synonymous with the framework in which we conceptualize text.
We illustrated this notion by demonstrating how TAG can be combined with XML in an
editorial workflow to model, store, edit, process, analyze, and publish text. The
workflow combines the advantages of two data models: while TAG performs well with
modeling and storing literary historical text, XML forms the input of many tools for
further text analysis, transformation, and visualization. The XML output of the TAG
reference implementation Alexandria uses Trojan Horse
elements to avoid overlap conflicts. We opted for Trojan Horse because it is well
known
in the text encoding community and it is relatively easy, from a computational
perspective, to have the computer generate milestones and unique IDs. Discontinuous
elements are transformed into XML by adding a matching attribute (e.g.,
tag:n="q1")
on the tags of the discontinued elements.
If we consider scholarly editing as an ongoing process of translating information from one carrier to another, it’s clear that scholars need to keep track of and understand these data transformations in order to make informed choices. By describing in detail the choices we made in designing TAG, by illustrating how these choices are reflected in the data model, and by proposing a workflow in which users have the opportunity to influence the TAGML-to-XML conversion process, we hope to have provoked amongst today’s textual scholars a continuous curiosity for data models for text modeling.
References
[Barnard et al. 1995] Barnard, David T., Burnard,
Lou, Gaspart, Jean-Pierre, Price, Lynne A., Sperberg-McQueen, C. Michael, & Varile,
Giovanni Battista. Hierarchical encoding of text: Technical problems and SGML
solutions
. Computers and the Humanities, vol.29,
no.3, 1995, pp: 211-231. doi:https://doi.org/10.1007/BF01830617.
[Bleeker et al. 2020] Bleeker, Elli, Bram
Buitendijk and Ronald Haentjens Dekker. Marking Up Microrevisions With Major
Implications: Non-linear Text in TAG
. Presented at Balisage: The Markup
Conference 2020, Washington, DC, July 27-31, 2020. Proceedings of
Balisage: The Markup Conference 2020. Balisage Series on Markup
Technologies, vol. 25, 2020.
doi:https://doi.org/10.4242/BalisageVol25.Bleeker01.
[Cummings 2008] Cummings, James. The Text
Encoding Initiative and the Study of Literature
. A
Companion to Digital Literary Studies, edited by Susan Schreibman and Ray
Siemens. Oxford: Blackwell, 2008.
http://www.digitalhumanities.org/companionDLS/.
[Haentjens Dekker and Birnbaum 2017] Haentjens
Dekker, Ronald and David J. Birnbaum. It’s More Than Just Overlap: Text As
Graph
. Presented Presented at Balisage: The Markup Conference 2017,
Washington, DC, August 1-4, 2017. Proceedings of Balisage: The Markup
Conference 2017. Balisage Series on Markup Technologies, vol. 19, 2017.
doi:https://doi.org/10.4242/BalisageVol19.Dekker01.
[Haentjens Dekker et al. 2018] Haentjens Dekker,
Ronald, Elli Bleeker, Bram Buitendijk, Astrid Kulsdom and David J. Birnbaum.
TAGML: A markup language of many dimensions
. Presented at Balisage:
The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup
Technologies, vol. 21, 2018.
doi:https://doi.org/10.4242/BalisageVol21.HaentjensDekker01.
[Haentjens Dekker et al. 2020] Dekker, Ronald
Haentjens, Bram Buitendijk, and Elli Bleeker. Parsing a Markup Language That
Supports Overlap and Discontinuity
. Proceedings of the
ACM Symposium on Document Engineering 2020, pp. 1-4.
doi:https://doi.org/10.1145/3395027.3419590.
[Huitfeldt 1994] Huitfeldt, Claus.
Multi-dimensional Texts in a One-Dimensional Medium
. Computers and the Humanities, vol. 28, no. 4, 1994, pp.
235-241. doi:https://doi.org/10.1007/BF01830270.
[Hilbert et al. 2005] Hilbert, Mirco, Oliver
Schonefeld, and Andreas Witt. Making CONCUR work
. Presented at Extreme
Markup Languages 2005, Montréal, Québec, August 1-5, 2005.
http://conferences.idealliance.org/extreme/html/2005/Witt01/EML2005Witt01.xml.
[Jagadish et al. 2004] Jagadish, H.V., Laks V.S.
Lakshmanan, M. Scannapieco, D. Srivastava, and N. Wiwatwattana. Colorful XML: One
Hierarchy Isn’t Enough
. Presented at SIGMOD 2004, Paris, France, June 13–18,
2004. doi:https://doi.org/10.1145/1007568.1007598.
[Kimber 2011] Kimber, Eliot. DITA Document
Types: Enabling Blind Interchange Through Modular Vocabularies and Controlled
Extension
. Presented at Balisage: The Markup Conference 2011, Montréal,
Canada, August 2-5, 2011. Proceedings of Balisage: The Markup
Conference 2011. Balisage Series on Markup Technologies, vol. 7, 2011.
doi:https://doi.org/10.4242/BalisageVol7.Kimber01.
[Marcoux et al. 2011] Marcoux, Yves, Michael
Sperberg-McQueen, and Claus Huitfeldt. Expressive Power of Markup Languages and
Graph Structures
. Presented at the Digital Humanities Conference 2011, Stanford, CA, June
19-22, 2011, pp. 178-180.
https://core.ac.uk/reader/48606830.
[Niu et al. 2019] Niu Ya-Wei, Qu Cun-Quan, Wang
Guang-Hui, Yan Gui-Ying. RWHMDA: Random Walk on Hypergraph for Microbe-Disease
Association Prediction
. Frontiers in
Microbiology, vol. 10, 2019. doi:https://doi.org/10.3389/fmicb.2019.01578.
[Peroni et al. 2014] Peroni, Silvio, Francesco
Poggi and Fabio Vitali. Overlapproaches in Documents: a Definitive Classification
(in OWL, 2!)
. Presented at Balisage: The Markup Conference 2014, Washington,
DC, August 5-8, 2014. In Proceedings of Balisage: The Markup
Conference 2014. Balisage Series on Markup Technologies, vol. 13, 2014.
doi:https://doi.org/10.4242/BalisageVol13.Peroni01.
[Pierazzo 2015] Pierazzo, Elena. TEI: XML
and Beyond
. Presented at the Text Encoding Initiative Conference and Members
Meeting 2015, Lyon (France), October 28-31, 2015. Abstract of talk available online
at
http://tei2015.huma-num.fr/en/papers/#140.
[Piez 2008] Piez, Wendell. LMNL in miniature. An
introduction
. Presented at the Amsterdam Goddag Workshop, December 1–5, 2008.
http://piez.org/wendell/LMNL/Amsterdam2008/presentation-slides.html
[Portier et al. 2012] Portier, Pierre-Édouard,
Noureddine Chatti, Sylvie Calabretto, Elöd Egyed-Zsigmond and Jean-Marie Pinon.
Modeling, Encoding And Querying Multi-structured Documents
. Information Processing & Management, vol. 48, no. 5, 2012,
pp. 931-955. doi:https://doi.org/10.1016/j.ipm.2011.11.004.
[De Rose 2004] DeRose, Steven J. Markup
Overlap: a Review and a Horse
. Presented at Extreme Markup Languages 2004, Montréal, Québec, August 2-6, 2004.
http://xml.coverpages.org/DeRoseEML2004.pdf.
[Sahle 2013] Sahle, Patrick. Digitale Editionsformen. Zum Umgang mit der Überlieferung unter den Bedingungen des Medienwandels. Teil 3: Textbegriffe und Recodierung. Norderstedt: BoD, 2013. https://kups.ub.uni-koeln.de/5353/
[Schmidt 2010] Schmidt, Desmond. The
Inadequacy of Embedded Markup for Cultural Heritage Texts
. Literary and Linguistic Computing 25, no. 3, 2010, pp. 337-356.
doi:https://doi.org/10.1093/llc/fqq007.
[Sperberg-McQueen and Huitfeldt 2008] Sperberg-McQueen, C. M. and Claus Huitfeldt. Markup Discontinued: Discontinuity
in TexMecs, Goddag Structures, and Rabbit/Duck Grammars
. Presented at
Balisage: The Markup Conference 2008, Montréal, Canada, August 12-15, 2008. Proceedings of Balisage: The Markup Conference 2008. Balisage Series on
Markup Technologies, vol. 1, 2008.
doi:https://doi.org/10.4242/BalisageVol1.Sperberg-McQueen01.
[Sperberg-McQueen 2018] Sperberg-McQueen, C.M.
Representing Concurrent Document Structures Using Trojan Horse
Markup
. Presented at Balisage: The Markup Conference 2018, Washington, DC, July
31 - August 3, 2018. Proceedings of Balisage: The Markup
Conference. Balisage Series on
Markup Technologies, vol. 21, 2018.
doi:https://doi.org/10.4242/BalisageVol21.Sperberg-McQueen01.
[Sperberg-McQueen and Huitfeldt 2000] Sperberg-McQueen, C.M. and Claus Huitfeldt. GODDAG: A Data Structure for
Overlapping Hierarchies
. International Workshop on
Principles, 2000.
doi:https://doi.org/10.1007/978-3-540-39916-2_12.
[Austen 2010] Katherine Sutherland, editor. Jane Austen’s Fiction Manuscripts: A Digital Edition. Available at http://www.janeausten.ac.uk/, consulted July 16th, 2021.
[Vitali 2016] Vitali, Fabio. The Expressive
Power of Digital Formats: Criticizing the Manicure of the Wise Man Pointing at the
Moon
. Lecture at the DiXiT Convention 2, Cologne, Germany, March 15, 2016.
Slides available at
http://dixit.uni-koeln.de/wp-content/uploads/Vitali_Digital-formats.pdf.
[Watanna, 1916] Watanna, Onoto. Marion, the Story of an Artist’s Model. New York: W.J. Watt. URN:oclc:record:1048793515.
[1] The authors are very grateful for the reviewers’ comments which have been insightful, useful, and highly appreciated. Many thanks.
[2] See notably Huitfeldt 1994 (p. 143; 147-151), Sahle 2013 (p. 381-382), and Pierazzo 2015 (p.73-74) in relation to the conceptual model on which the encoding guidelines of the TEI are based.
[3] To give but one example, the edges in the graph were first directed (Haentjens Dekker and Birnbaum 2017), then undirected with the order of the Text nodes in the graph derived from their distance from the root node (Haentjens Dekker et al. 2018). The current version of the hypergraph data model has both directed edges (for the text-to-text nodes) and undirected edges (for the markup and annotation nodes).
[4] In earlier publications, we referred to groups of Markup nodes as layers, but we found this term to be confusing as “layers” are often used differently in different (humanities) contexts.
[5] All text, documentary transcriptions, and facsimiles are retrieved from the digital diplomatic edition Jane Austen’s Fiction Manuscripts: Digital Edition, edited by Katherine Sutherland and her team. The digital edition can be found at https://janeausten.ac.uk/index.html.
[6] The TAGML transcription is made in Sublime Text Editor that offers syntax TAGML highlighting. See the Github page of the project for the most up-to-date information about the TAGML syntax highlighting.
[7] See the TEI Guidelines, chapter 16.7. See also the (suggestions for the) question posted by Joey Takeda on the TEI mailing list, “Another q question”, March 23, 2019 (permalink: https://listserv.brown.edu/cgi-bin/wa?A2=TEI-L;48339dc2.1903).
[8] Note that the quotation q
is given the identifier
“P”, because it overlaps locally with the s
element.
[9] See the website and the Github page of OxGarage for more information.
[10] For a detailed discussion of the representation of non-linear structures in TAGML, see Haentjens Dekker et al. 2018 and Bleeker et al. 2020.
[11] For an overview and description of these and other proposals, we recommend among others De Rose 2004, Schmidt 2010, Portier et al. 2012, Peroni et al. 2014, and Vitali 2016.