Hyper, Multi, or Single? Thinking about Text in Graphs and Trees

Elli Bleeker; Ronald Haentjens Dekker; Bram Buitendijk

Abstract

This paper explores the potential of combining the Text-As-Graph (TAG) and the XML data models. It proposes a digital editing workflow in which users can model, edit, and store text in TAG, and subsequently export the data to XML for further analysis or publication with XML-based tools. The conversion from TAGML to XML presents several interesting challenges on a technical level as well as a philological level. Overall, we argue that there may be many pragmatic reasons to encode cultural heritage texts in XML, but we have to be mindful of the XML framework becoming synonymous with the framework in which we conceptualize text. The paper therefore dives deep into the translation from conceptual model to logical model(s) and argues in favor of understanding the affordances and limitations of the text modeling technologies we use.

Premise^[1]

There are many pragmatic reasons to encode cultural heritage texts in TEI-compliant XML, but we have to be mindful of the XML framework becoming synonymous with the framework in which we conceptualize text. As various textual scholars have pointed out, the limitations of a technology can delimit the selection, modelling, and analysis of textual aspects.^[2]

They remind us of the frequently used analogy of the hammer and the toolbox: if all you have is a hammer, everything will look like a nail. Indeed, the limitations of the XML data model have influenced and shaped our text encoding praxis (cf. Pierazzo 2015) and will continue to do so. If all you know is XML, every text will start to look like a tree. And while there are enough cases where a tree data model suffices, other cases benefit from an alternative.

This paper explores the potential of combining the XML data model and the Text-As-Graph (TAG) data model. The primary aim of the paper is to examine a practical and workable method for modeling and editing documents. TAG is still under development and not (yet) as mature as XML. Still, we find the affordances of the TAG model to be highly suitable for modeling and storing literary historical documents. XML remains a prominent technology for the analysis and publishing of texts. The question is, then, can we combine the strengths of TAG and XML into one powerful tool for everything we may want to do with text: modeling, storing, (collaborative) editing, processing, analyzing, and publishing? This paper explores that possibility by proposing a digital editing workflow in which scholarly editors can model, edit, and store text as a TAG hypergraph, and subsequently export the textual data to an XML format for further analysis or publication with XML-based tools.

Developing a more inclusive, flexible data model for text has been one of the guiding principles behind the design of TAG, a graph-based model under development at the Royal Netherlands Academy of Arts and Sciences. In previous Balisage contributions we discussed the TAG data model (Haentjens Dekker and Birnbaum 2017), its markup language TAGML and reference implementation Alexandria (Haentjens Dekker et al. 2018), and explored the ways in which TAGML can be used to model certain textual features that are notoriously difficult to model in XML (Bleeker et al. 2020). Over the course of the four+ years we have been working on the TAG markup stack (i.e., its data model, syntax, query language, and schema) we have gained some valuable insights. In some cases, this resulted in some small modifications of the data model and syntax.^[3]

In each communication we pointed out that even though TAG was under active development, we considered our work, findings, and reflections already relevant for a general discussion on text modeling. We illustrated how other data models and markup systems express complex features like overlap and non-linear text, and we argued that text could best be expressed as a network structure (i.e., a graph). In doing so, we aimed to encourage a reflexive awareness of the relationship between an intellectual model and data models of text. One of the motivations for our continued work on TAG is to emphasize the value of, first, having a wider assortment of tools in one’s metaphorical text modeling toolbox and, secondly, knowing to select the right tool for the job you want to do. We believe that scholars can make an informed choice only when they know both the strong and the weak points of a data model.

Data models for texts and documents

Our work is informed by a definition of text as a sequence of characters (e.g., letters, digits, spaces, and punctuation, including symbols and music notation) that is inscribed on a material carrier: a document. From the text on a document, a reader derives information. In earlier contributions, we argued that this information can best be organized in a network structure. We stated that text is partially ordered, which means that it is not always possible to determine the order of all characters in the sequence. Examples of partially ordered text are non-linear, discontinuous, or overlapping structures.

Generally speaking, certain data models are more suitable for expressing complex features than others: as said, the inherent properties of a data model provide its scope and determine its limits. While TAG is conceptually based on a hypergraph, the model can be implemented in different ways. The best logical implementation of TAG depends on the purpose: we found that a variation on the Multi-Colored Trees (MCT or Colorful XML, developed by Jagadish et al. 2004) and the GODDAG model (Sperberg-McQueen and Huitfeldt 2000, Sperberg-McQueen 2018), which we described as a colored GODDAG (Haentjens Dekker et al. 2020), works best for overlapping structures as well as for export and visualization purposes. With export being the focus of the paper, we will describe both the hypergraph and the Colored GODDAG models in the following sections.

Data models

Text as a hypergraph

Just like any other graph, a hypergraph consists of nodes and edges. The important difference is that some edges in a hypergraph can join together two or more nodes (in contrast to the one-to-one edges of regular graphs). These are called hyperedges. Hyperedges are typically undirected and they can be used to express group relations. The hypergraph is not a common data structure in the (digital) humanities, but it is fairly well-known in the STEM research fields. By means of illustration, figure 1 (Figure 1) shows a hypergraph used in microbiology:

The TAG hypergraph is slightly different than the model in figure 1. First, the nodes in the TAG hypergraph are typed. We distinguish five different node types: each hypergraph consists of exactly one document node (the root), zero or more text nodes, zero or more markup nodes, zero or more annotation nodes, and zero or more branching nodes. Furthermore, the TAG model consists of undirected hyperedges as well as directed one-to-one edges. The document node, the text nodes, and the branching nodes indicate the stream of the text and are therefore connected by directed, regular edges. The markup and annotation nodes in the TAG hypergraph can be connected with either a hyperedge or a regular one-to-one edge. For example, a hyperedge can associate multiple text nodes with one and the same markup node, and an annotation node can be associated with one markup node by means of a regular edge. Figure 2 below exemplifies the different types of nodes and edges. The text nodes are white, the regular edges are visualized as arrows, and the hyperedges are labelled and visualized in different colors:

Text in a hypergraph is read from left to right, starting with the document node and following the directed edges. There are two branches in the hypergraph; the beginning and the end of the branches is indicated with a branching node. The markup node labelled “subst” in figure 2 (Figure 2) is associated with two text nodes via a labelled hyperedge (yellow); the markup node labelled “add” is associated with one text node via a labelled hyperedge (dark green) and has an associated annotation node (light green) with information about the place of the addition in the source text.

Note that the markup in this example is properly nested: it represents one hierarchical structure and the markup elements do not overlap. Expressing multiple overlapping hierarchical structures in TAG is done by grouping together related markup nodes in a group.^[4] The markup nodes within each group form a single hierarchy, but groups can share markup nodes and a TAG document can contain any number of groups. This way, overlapping structures can be easily expressed in TAG. The following section exemplifies the logical model (section “Text as Multi-Colored Trees”).

Text as Multi-Colored Trees

Alexandria implements a MCT, an ordered directed acyclic graph with colors on the markup nodes. The MCT is inspired by the multitrees of GODDAG, the colored nodes of Colorful XML, and the combination of several XML trees of XConcur (Jagadish et al. 2004, Hilbert et al. 2005). The MCT model extends the XML tree in two ways: a node in a MCT has an additional property (its color), and a MCT database can consist of one or more colored trees (instead of XML’s single-rooted tree). Each tree has a different color. A node can be shared by more than one tree, in that case it has multiple colors. The trees within a MCT document can be navigated and manipulated with extended XQuery or XPath expressions in which the user first selects a leading color (Jagadish et al. 2004, see also Portier et al. 2012). The MCT is implemented in the Alexandria repository of TAG; see section section “Workflow”. In the Alexandria MCT, the text nodes as well as the root document node are shared between all the colors.

Overlapping hierarchies in TAG are expressed as a MCT by assigning a colored tree to each group of markup nodes. Figure 3 shows a visualization of a MCT implementation with two groups of markup nodes i.e., two colored trees.

This simple example shows how two groups of markup nodes, one with the identifier “D” and the other with the identifier “T”, can be expressed as a red and a blue tree, respectively. The red tree (i.e., the group of markup nodes identified with “D”) shares the markup nodes labeled “s” and “add” with the blue tree (i.e., the group labeled “T”). The markup node labeled “del” is only part of the red tree. Each node is stored only once in the MCT model; edges between nodes are specified in each colored tree.

Text features

In earlier publications, we distinguished at least three complex features with which digital scholarly editors have to deal on a regular basis: overlapping or concurrent hiearchies, discontinuous text, and non-linear structures (Haentjens Dekker and Birnbaum 2017, Haentjens Dekker et al. 2018). Expressing concurrent or overlapping structures is, if not the biggest, surely the most famous and most debated obstacle for text modeling in XML and therefore needs little explaining. Discontinuous text is usually illustrated with a quotation that is interrupted by an aside or by the narrator’s voice (cf. Sperberg-McQueen and Huitfeldt 2008). Non-linear structures, finally, can be found on the level of the source text as well as the markup, e.g., in-text revisions on a manuscript or an abbrevation and its expansion supplied by the editor in markup. The common denominator is a temporary break in the linearity of the text that can be conceptualized as a split of the text stream in two or more substreams or branches.

In this section we show how (or to what extent) these features are modeled in TAG. We take our examples from three pages of Love and Freindship, a short epistolary novel by Jane Austen, written between 1790-1793. The novel is written in a notebook entitled “Volume the Second” and is part of her Juvenalia manuscripts.^[5]

For each text feature, we first describe our philological interpretation of the feature in question. We then show the syntactical serialization of the feature in TAGML and in XML, combined with visualizations of the underlying data models. This will facilitate the comparison between TAGML and XML and, we hope, increase the readers’ awareness of the relationship between the conceptual model and its logical implementation(s).

Discontinuous text

Discontinuous text can be found in the fragment of the text displayed in figure 7, which reads: Beware my Laura (she would often say) Beware of the insipid Vanities and idle Dissipations […]. The aside (she would often say) is a comment by the letter writer Laura upon the quoted text and can therefore be identified as a break in the quotation.

Discontinuous text in TAGML

The TAGML encoding of discontinuous text is similar to the TexMECS proposal: the q element is “paused” and subsequently “resumed” with the affixes - and + (see Sperberg-McQueen and Huitfeldt 2008).^[6]

On the level of the conceptual hypergraph model, the discontinuous text is part of one and the same q element, as is illustrated by figure 9 (Figure 9).

The relationship between the two parts of the q element is also stored in the MCT implementation of Alexandria:

Discontinuous text in XML

The TEI Guidelines offer several options to encode discontinuous text, such as the use of next and prev attributes on the discontinued elements.^[7] A simplified TEI XML transcription of the example sentence would then be:

                              
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
  <!-- metadata information here -->
</teiHeader>
    <text>
        <body>
            <p>
            <!-- more text here -->
                <s>
                    <q xml:id="q1" next="#q2">"Beware my Laura </q> 
                    (she would often say) <q xml:id="q2" prev="#q1"> Beware of [...] Southampton."</q>
                </s>
            <!-- more text here -->
            </p>
    </body>
</text>
</teiHeader>
</TEI>

Here the q elements are given an xml:id and the next and prev attributes are used to indicate that the first q element is continued in the next. Instead of, or in addition to, next and prev attributes, the two parts can be joined together using a link element: <link type="join" target="#q1 #q2"/>. As figure 11 shows, the elements are only linked on a syntactical level. On the level of the data model the q elements are two separate child elements of the div element.

Overlapping structures

Finding an example of overlapping structures in the manuscript notebook of Love and Freindship is fairly simple. Let’s say we want to express both the material features of the document and the linguistic structure of the text, i.e., the lines on the page and the sentences respectively. As the sentences run over several page lines and one page boundary, they overlap partly with the material structure of the document.

Overlap in TAG

As mentioned in section 2.1.2. (section “Text as Multi-Colored Trees”), TAG handles overlapping structures by grouping the markup nodes of each structure into a separate group. Within each group, the markup nodes are hierarchically ordered. Figure 12 (Figure 12) illustrates the TAGML-encoded text of Austen’s text containing the overlapping hierarchies: the material structure, which nodes are assigned the identifier “D”, and the linguistic structure of the letter, which nodes are assigned the identifier “T” . For example, the page and l markup elements have the identifier “D”, the s markup elements are given the identifier “T”.^[8] In this particular TAGML document, the root node text is shared: it has both the “D”, “T”, and the “P” identifier.

The MCT of the entire novel is too large to visualize here, but this simplified visualization below shows the colored trees constituted by the material and linguistic structures:

Overlap in XML

The ubiquity of overlapping structures in cultural heritage texts produced a wide variety of TEI P5 encoding systems. In our case it would be convenient to use empty elements <lb/> for the structure of the page lines, and the aforementioned next and prev attributes for the sentence that runs across the two pages. Belowe a simplified TEI-compliant XML transcription in which the linguistic structure of the sentences are leading looks. Note again that the last sentence is made up of two separate s elements that are only linked on a syntactical level.

                              
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
  <!-- metadata information here -->
</teiHeader>
    <text>
        <body>
            <!-- some text here -->
            <p>
                <s xml:id="s1" next="#s2">
                <lb/>Our neighbourhood consisted <lb/>only of your Mother.</s>
                <s>She may probably have already <lb/>told you that being left by her Parents in indegent 
                <lb/>Circumstances she had retired into Wales on economi-</s>
            </p>
            <pb/>
            <p>
               <s xml:id="s2" prev="#s1
               <lb/>cal motives.</s>
           <!-- some text and markup here -->
        </p>
    </body>
</text>
</teiHeader>
</TEI>

From TAGML to XML and beyond

Workflow

After having encoded the text in TAGML, the document can be uploaded to the Alexandria repository. Alexandria is operated on the command line and is set up to work similar to git, the version control software often used by programmers and digital humanists to work collaboratively on code. Via command line commands, users can “check in” and “check out” TAGML documents to the Alexandria repository. In addition to a TAGML document, users can also upload one or more views in which they can define which (groups of) markup nodes can be filtered out. We created the view functionality because the TAG document in Alexandria can potentially contain a large amount of information, and we assume that not all users will always be interested in all information. A view is expressed in JSON, for example: {"includeGroups":["D"]} {"includeMarkup:["del"]"}. In this example, the view says to include the markup nodes that are grouped with the identifier “D” and all markup nodes labelled “del” when the TAGML document is checked out of the Alexandria repository.

An Alexandria check-in means that the TAGML document is parsed, verified for well-formedness, and stored as a MCT in the repository. A check-out in combination with a view generates a new TAGML document that contains the selected markup nodes, while the master TAGML document remains intact in the repository. The checked-out document can be edited and checked-in again. Alternatively, users can indicate upon checkout whether they want the master document to be exported to another format: currently, we support SVG, XML, DOT, or PNG. Figure 14 depicts this workflow:

Let’s zoom in on the last step in this workflow diagram: the export to XML. The Alexandria command export-xml seems easy enough and to some extent it is: technically, converting a MCT to a single tree is fairly simple. Like XML, TAGML has markup tags with names, attributes, and values. And although TAGML supports different data types for attribute values (in addition to a string, TAGML attribute values can be integers, floats, strings, lists, or Booleans), the TAGML attribute values can be converted to string type attribute values in XML. Still, a graph with multiple concurrent hierarchies contains more information than a mono-hierarchical tree, so a graph-to-tree conversion implies that we have to decide how to express that information. This depends in part on how the user plans to use the exported document. Representing overlapping hierarchies in a single tree, for example, will require additional tagging. Users may therefore want to scale down the amount of information in the XML document and select only the information that is relevant for their purpose. This means deciding whether the exported XML document should have a leading hierarchy and if so, which markup nodes should be part of it. After addressing the algorithmic side of the TAGML-to-XML conversion in section section “The code”, we will discuss the editorial side.

The code

The steps taken during the TAGML to XML conversion are as follows:

The user gives the xml-export command to the Alexandria server;
The TAGTraverser iterates over the MCT of the TAGML document. If the user has provided it, the traverser will also use information from the view and ignore certain (groups of) markup nodes;
The TAGTraverser generates a stream of Events;
For each Event, check to see whether it is an open tag, a close tag, or text characters;
If the Event is text, the characters are transformed into an XML text node;
If the Event is an open tag or a close tag, check to see if the user provided information about the leading hierarchy and if so, whether the tag is part of the leading hierarchy;
- If not, the open tag or close tag is transformed into a Trojan Horse start or end element, respectively;
- If the tag is part of the leading hierarchy, the open tag or close tag is transformed into an XML open tag or an XML close tag.

Figure 15 (Figure 15) shows a flowchart of this process. The flowchart starts after the user has created a TAGML document and a view and uploaded them in the Alexandria repository.

The code base of Alexandria is written in Kotlin. The code fragment below shows the class definitions of Node, MCT, and Event. Edges are stored in two directions: incoming and outgoing. Target nodes are stored in LinkedHashMaps to preserve the order of the nodes.

                        
sealed class Node
data class Markup(val label: String, val colors: List<String>, val id: Long = System.currentTimeMillis()) : Node()
data class Text(val content: String, val id: Long = System.currentTimeMillis()) : Node()

class MCT(val rootNode: Markup) {
	val outgoingEdges: MutableMap<Markup, LinkedHashSet<Node>> = HashMap()
	val incomingEdges: MutableMap<Node, LinkedHashSet<Markup>> = HashMap()
}

sealed class Event
data class MarkupOpen(val node: Markup) : Event()
data class MarkupClose(val node: Markup) : Event()
data class TextEvent(val node: Text) : Event()

As described above (section section “Workflow”), a TAGML document that is checked into the Alexandria repository is parsed and stored as a MCT. When the user gives the export-xml command to Alexandria, the TAGTraverser iterates over the MCT and generates a stream of Events. The Kotlin code fragment below describes the traversal algorithm. It traverses over the nodes in topological order and creates TextEvents, MarkupOpen, or MarkupClose events. A text node generates a TextEvent; a markup node generates a MarkupOpen event of that node. In order to link the MarkupClose events to the appropriate MarkupOpen events, we keep track of the markup that is currently open by using markup stacks. There is a global stack as well as one stack for each color in the MCT. Each markup node is added to the global stack and to the relevant color stack. Before we can generate a TextEvent or a MarkupOpen event for a node, we need to check the top of the relevant color stack(s) to see if it’s not a parent of the current node. Those markup nodes generate MarkupClose events and can be removed from both the color stacks and the global stack. After all nodes have been processed in this manner, the markup that is left on the global stack generates the remaining MarkupClose events.

       fun traverseMCT(mct: MCT): List<Event> {
	val nodes = topologicalSort(mct)
	val result = arrayListOf<Event>()
	val colorToStackMap = HashMap<String, Stack<Markup>>()
	val globalStack = LinkedHashSet<Markup>()
	for (node in nodes) {
		val parents = mct.incomingEdges.getOrElse(node) { emptySet<Markup>() }
		val stacksToCheck: List<Stack<Markup>> =
				when (node) {
					is Markup -> colorToStackMap.entries.filter { node.colors.contains(it.key) }.map { it.value }
					is Text -> colorToStackMap.values.toList()
				}
		for (stack in stacksToCheck) {
			while (stack.peek() !in parents) {
				val nodeToPop = stack.pop()
				if (globalStack.remove(nodeToPop)) result.add(MarkupClose(nodeToPop))
			}
		}
		when (node) {
			is Markup -> {
				node.colors.map { colorToStackMap.getOrPut(it) { Stack() } }.forEach { it.push(node) }
				globalStack.add(node)
				result.add(MarkupOpen(node))
			}
			is Text -> result.add(TextEvent(node))
		}
	}
	globalStack.reversed().forEach { node -> result.add(MarkupClose(node)) }
	return result
}

Finally, the algorithm creates the XML document from the MCT. The code below describes how the algorithm loops over the Events and creates XML tags. The XML tags are based on the type of event and whether the node associated with the event is the leading hierarchy or not. Nodes in the leading hierarchy are used to create XML content elements, nodes in the other hierarchies are converted to Trojan Horse elements. Trojan Horse elements are a specific type of elements or “segment-boundary delimeters” with a namespace definition th: (see . Two related milestones are linked by means of matching @start and @end attributes, so the regular XML <s>The sun is yellow</s> becomes <s th:s sID="foo"/>The sun is yellow<s th:s eID="foo"/> in Trojan Horse markup (De Rose 2004, Barnard et al. 1995, Sperberg-McQueen 2018). Additionally, the Trojan Horse markup elements are given an attribute that is generated from their TAGML group identifier, e.g., @th:doc="D" for all markup nodes in a group called “D”.

                        
       fun createXML(mct: MCT, leadingHierarchy: String, writer: Writer) {
	val events = traverseMCT(mct)
	val xml = XMLOutputFactory.newFactory().createXMLStreamWriter(writer)
	for (event in events) {
		when (event) {
			is TextEvent -> xml.writeCharacters(event.node.content)
			is MarkupOpen -> if (event.node.colors.contains(leadingHierarchy)) xml.writeStartElement(event.node.label)
					    else xml.apply {writeEmptyElement(event.node.label)
						writeAttribute("sID", event.node.id.toString()) }
			is MarkupClose -> if (event.node.colors.contains(leadingHierarchy)) xml.writeEndElement()
					    else xml.apply { writeEmptyElement(event.node.label)
						writeAttribute("eID", event.node.id.toString()) }
		}
	}
	writer.close()
}

The user

The user is usually not aware of the algorithmic details of the XML export and — although the code is open source and there for anyone to look at — not required to either. Still, the conversion process is not just a matter of clicking a button: it is designed so that the user can influence it. Let’s take a closer look. Figure 16 (Figure 16) shows the (simplified) TAGML transcription of the entire text of the fourth letter from Love and Freindship.

The overlapping hierarchical structures in TAGML cannot be automatically transformed to XML. This is where the user comes in: they can choose the hierarchy formed by the group of markup nodes labelled “D”, the hierarchy formed by the markup nodes labelled “T”, or no leading hierarchy in the XML output. In case of the latter, it suffices to use the command

alexandria export-xml
                        Austen_VtS

, which says as much as “Hi Alexandria, please export the document called Austen_VtS to XML”. Subsequently, the TAGTraverser described in the previous section will iterate over the MCT, generate a stream of events, and build an XML tree. All TAGML markup nodes will be transformed into Trojan Horse elements. A short fragment of the Trojan Horse XML output is shown below:

                        
<?xml version="1.0" encoding="UTF-8"?>
<xml xmlns:tag="http://tag.di.huc.knaw.nl/ns/tag"
 xmlns:th="http://www.blackmesatech.com/2017/nss/trojan-horse" th:doc="D P T">
 <text title="Love and Freindship" type="novel" author="Jane Austen" date="1790">
<page n="6" th:doc="D" th:sId="page0"/><head th:doc="D T" th:sId="head1"/><l th:doc="D" th:sId="l2"
 />Letter 4th<l th:doc="D" th:eId="l2"/><head th:doc="D T" th:eId="head1"/><l th:doc="D"
 th:sId="l3"/>Laura to Marianne<l th:doc="D" th:eId="l3"/><s th:doc="T" th:sId="s4"/><l
 th:doc="D" th:sId="l5"/>Our neighbourhood was small, for it consisted <l th:doc="D"
 th:eId="l5"/><l th:doc="D" th:sId="l6"/>only of your Mother. <s th:doc="T" th:eId="s4"
 />
<s th:doc="T" th:sId="s7"/>She may probably have already <l th:doc="D" th:eId="l6"/><l
 th:doc="D" th:sId="l8"/>told you that being left by her Parents in indegent <l
 th:doc="D" th:eId="l8"/><l th:doc="D" th:sId="l9"/>Circumstances she had retired into
 Wales on economi-<l th:doc="D" th:eId="l9"/><page th:doc="D" th:eId="page0"/>
<page n="7" th:doc="D" th:sId="page10"/><l th:doc="D" th:sId="l11"/>cal motives. <s th:doc="T"
 th:eId="s7"/><s th:doc="T" th:sId="s12"/>There it was our freindship first <l th:doc="D"
 th:eId="l11"/><l th:doc="D" th:sId="l13"/>commenced – Isabel was then one and twenty – 
<l th:doc="D" th:eId="l13"/><s th:doc="T" th:eId="s12"/><s th:doc="T" th:sId="s14"/>
<l th:doc="D" th:sId="l15"/>Tho' pleasing in both her Person and Manners 
<l th:doc="D th:eId="l15"/><l th:doc="D" th:sId="l16"/>(between ourselves) she never possessed the 
hun-<l th:doc="D" th:eId="l16"/><l th:doc="D" th:sId="l17"/>dreth part of my Beauty or
Accomplishments.<l th:doc="D" th:eId="l17"/><s th:doc="T" th:eId="s14"/>

        <!-- more text and markup --> 
    <page th:doc="D" th:eId="page10"/>
    </text>
</xml>

The user could also decide to make one of the hierarchical structures leading in the XML output. In that case, they can add a parameter to the Alexandria command with which they indicate which grouped markup nodes should form the leading hierarchy, e.g., alexandria export-xml Austen_VtS -l D. In this example commend, the markup nodes in group “D” are transformed into XML content elements; the markup nodes not belonging to the leading hierarchy will be transformed into Trojan Horse elements. A fragment of the XML output with the markup group “D” as leading structure would look as follows:

                        
<?xml version="1.0" encoding="UTF-8"?>
<xml xmlns:tag="http://tag.di.huc.knaw.nl/ns/tag"
 xmlns:th="http://www.blackmesatech.com/2017/nss/trojan-horse" th:doc="P T">
     <text title="Love and Freindship" type="novel" author="Jane Austen" date="1790">
     <page n="6">
     <head><l>Letter 4th</l></head>
         <l>Laura to Marianne</l>
         <s th:doc="T" th:sId="s0"/><l>Our neighbourhood was small, for it consisted </l>
         <l>only of your Mother. <s th:doc="T" th:eId="s0"/>
         <s th:doc="T" th:sId="s1"/>She may probably have already </l>
         <l>told you that being left by her Parents in indegent </l>
         <l>Circumstances she had retired into Wales on economi-</l>
     </page>
     <page n="7">
         <l>cal motives. <s th:doc="T" th:eId="s1"/>
         <s th:doc="T" th:sId="s2"/>There it was our freindship first </l>
         <l>commenced – Isabel was then one and twenty – </l> <s th:doc="T" th:eId="s2"/>
         <s th:doc="T" th:sId="s3"/><l>Tho&apos; pleasing in both her Person and Manners </l>
         <l>(between ourselves) she never possessed the hun-</l>
         <l>dreth part of my Beauty or Accomplishments.</l> <s th:doc="T" th:eId="s3"/>
             <!-- more text and markup --> 
         <s th:doc="T" th:sId="s9"/> <q tag:n="1" th:doc="P" th:sId="q10"/>
         <l>"Beware my Laura<q th:doc="P" th:eId="q10"/> (she would often say) </l>
         <q tag:n="1" th:doc="P" th:sId="q11"/>
         <l>Beware of the insipid Vanities and idle Dissipations </l>
         <l>of the Metropolis of England; Beware of the </l>
         <l>unmeaning Luxuries of Bath &amp; of the Stink-</l>
         <l>ing fish of Southampton."</l>
         <q th:doc="P" th:eId="q11"/> <s th:doc="T" th:eId="s9"/>

    </page>    
    </text>
</xml>

Note how the discontinuous text, marked with the suspend and resume signs in TAGML, is coverted to XML. With the attribute @tag:n="1" the discontinued parts of the quotation are linked together. This approach is also suggested by Wendell Piez in Piez 2008. Finally, the header of both fragments contains information about which group of markup nodes are represented in Trojan Horse (th:doc="T D P" versus th:doc="T P").

In order to illustrate the variety of options created by the XML export function of Alexandria, the final paragraphs of this section explore two hypothetical yet highly likely user scenarios. Let’s say that, after modeling the text in TAGML and uploading the TAGML document in Alexandria, a user wants to (1) publish it online, so they need an HTML version of the encoded text; (2) send it for approval to another scholarly editor who prefers to work in Word (e.g., using LibreOffice of Microsoft Office). We can update the workflow diagram by adding these two steps:

As the diagram illustrates, the transformation steps in the editorial workflow do not take place in the context of Alexandria, but in the user’s own workspace. This is a conscious choice: we assume that most users prefer their own, possibly customized, tools, transformation scenario’s, and work environments. Accordingly, we provide TAGML users with the opportunity to create workflows and pipelines in which TAGML is seamlessly integrated with other tools. For instance, the user can use XSLT in order to create an HTML document from the XML document.

For the transformation to Word, we use the open source software OxGarage, a RESTful web service that was created by members of the TEI community and allows users to manage transformations between various document formats.^[9] Even without any additional information, the XML document sampled above (with the page structure as leading hierarchy) is easily transformed into a clearly readable Word document:

Note that the text of the addition, marked with add tags in both the source TAGML and the XML output, is automatically surrounded by diacritical marks in the Word output, and that the text marked as deleted in the TAGML and XML sources is represented as crossed out text between square brackets.

Future work

So far, our contribution to Balisage about TAG have discussed work in progress, and the present contribution is no different. While the basic export funtionalities perform well, there are a few steps that need to be taken before TAGML documents can be fully converted to an XML format. At the moment, non-linear structures are not optimally converted. We conceptualized this structure as a linear stream of text “splitting” into two or more branches. In the case of an in-text revision like a substitution, the split leads to two branches of the text, each with a different reading. As the attentive reader may have seen in figure 15 (Figure 16), the TAGML approach to a non-linear structure is to encode the splitting into branches as follows: some linear text <| branch 1 | branch 2 |> more linear text.^[10] For example, the non-linear structure caused by the substitution in Austen’s text is encoded as follows:

                     
[s> […] had spent a fortnight in Bath & had <|[del>slept<del]|[add>supped<add]|> one night in Southampton.<s]

Here, the notation <| and |> indicates the start and end of branches, with the | to separate them. One branch reads slept and one reads supped. Currently, this branching strcuture is quite literally converted as such to XML:

                     
<s> had spent a fortnight in Bath &amp; had 
            <tag:branches th:doc="_default" th:sId=":branches6"/>
            <tag:branch th:doc="_default" th:sId=":branch7"/>
                        <del>slept</del>
            <tag:branch th:doc="_default" th:eId=":branch7"/>
            <tag:branch th:doc="_default" th:sId=":branch8"/>
                        <add>supped</add>
            <tag:branch th:doc="_default" th:eId=":branch8"/>
            <tag:branches th:doc="_default" th:eId=":branches6"/> 
     one night in Southampton.
<s/>

The branches are represented as Trojan Horse element and linked to one another with the Trojan Horse attributes. While this is technically correct, it is difficult to read for humans and equally hard to transform to valid TEI-XML. Non-linear TAGML is however not as easy to transform automatically into XML: where TAGML uses general syntactical symbols, TEI proposes multiple options with markup elements like subst, mod, or choice. Potentially this conversion will require user-input as well.

Another item on our “(Soon) To Do” list is to move away from grouping markup elements with identifiers. We are at present examining the possibilities to implement a TAGML schema and to include in the schema information about the hierarchical structure(s) in a TAGML document. This would mean, for instance, that the user identifies which markup nodes are part of a certain hierarchy, which nodes can be shared between hierarchies, etc. In line with an XSL file, the schema may contain other information about the export. The user can subsequently point to this TAGML schema document when they give the export command to the Alexandria server. And finally, on a more general level, future work entails the possibilty for multiple users to collaborate in Alexandria. The repository is now initialized locally, and a user of Alexandria can already check-in, check-out, and edit a TAGML document on their local machine, but we aim to enable a collaboration between multiple users. This requires among others further development of the diff-functionality, as we want to track the edits made to a TAGML document on the level of the text as well as the markup.

Reflection

It’s a well known fact that certain textual and documentary structures are less suited to be encoded as XML, such as concurrent or overlapping hierarchies, discontinuous text, and non-linear structures. In the past thirty years or so, numerous ways to deal with these structures have been proposed, ranging from XML-based approaches (to name but a few: empty or virtual elements, linking, Trojan Horse markup, stand-off approaches, encoding the same text multiple times, XCONCUR) to non-XML proposals (such as LMNL, TexMECS, or EARMARK).^[11] There are various reasons why the proposed alternatives have not been adopted as text encoding standards. Some of them focused primarily on addressing just one limitation of the XML data model — e.g., overlap — and did not address the wider range of non-hierarchical text phenomena. Others never transcended the experimental phase, or needed to be abandoned simply because funding ran out. Nevertheless, the academic publications related to these proposals still make for an interesting read as they provide insight into the intellectual and technological history of text modeling. Most of them are not “just” technical: they are based on a philosophy of text. After all, when examining the question of how to express text informationally, the authors had to formulate their definition of text. If it’s not an Ordered Hierarchy of Content Objects, then what is it?

We conceptualize text as a partially ordered sequence of characters, inscribed on a material carrier. The information derived from the text can best be organized in a network structure, for which we propose a hypergraph. In this paper, we aimed to demonstrate the value of looking beyond the prevalent technologies and examining the alternatives vis-à-vis their philological notions of text, their conceptual model, the logical implementation(s), etc. We intended for the reader to realize that each approach to text modeling has its advantages and disadvantages. The best choice is not necessarily what is most commonly used, but what best addresses one’s research requirements and objective(s). There are many pragmatic reasons to encode literary historical texts in XML, but the XML framework should not become synonymous with the framework in which we conceptualize text.

We illustrated this notion by demonstrating how TAG can be combined with XML in an editorial workflow to model, store, edit, process, analyze, and publish text. The workflow combines the advantages of two data models: while TAG performs well with modeling and storing literary historical text, XML forms the input of many tools for further text analysis, transformation, and visualization. The XML output of the TAG reference implementation Alexandria uses Trojan Horse elements to avoid overlap conflicts. We opted for Trojan Horse because it is well known in the text encoding community and it is relatively easy, from a computational perspective, to have the computer generate milestones and unique IDs. Discontinuous elements are transformed into XML by adding a matching attribute (e.g., tag:n="q1") on the tags of the discontinued elements.

If we consider scholarly editing as an ongoing process of translating information from one carrier to another, it’s clear that scholars need to keep track of and understand these data transformations in order to make informed choices. By describing in detail the choices we made in designing TAG, by illustrating how these choices are reflected in the data model, and by proposing a workflow in which users have the opportunity to influence the TAGML-to-XML conversion process, we hope to have provoked amongst today’s textual scholars a continuous curiosity for data models for text modeling.

References

[Barnard et al. 1995] Barnard, David T., Burnard, Lou, Gaspart, Jean-Pierre, Price, Lynne A., Sperberg-McQueen, C. Michael, & Varile, Giovanni Battista. Hierarchical encoding of text: Technical problems and SGML solutions. Computers and the Humanities, vol.29, no.3, 1995, pp: 211-231. doi:https://doi.org/10.1007/BF01830617.

[Bleeker et al. 2020] Bleeker, Elli, Bram Buitendijk and Ronald Haentjens Dekker. Marking Up Microrevisions With Major Implications: Non-linear Text in TAG. Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27-31, 2020. Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25, 2020. doi:https://doi.org/10.4242/BalisageVol25.Bleeker01.

[Cummings 2008] Cummings, James. The Text Encoding Initiative and the Study of Literature. A Companion to Digital Literary Studies, edited by Susan Schreibman and Ray Siemens. Oxford: Blackwell, 2008. http://www.digitalhumanities.org/companionDLS/.

[Haentjens Dekker and Birnbaum 2017] Haentjens Dekker, Ronald and David J. Birnbaum. It’s More Than Just Overlap: Text As Graph. Presented Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1-4, 2017. Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19, 2017. doi:https://doi.org/10.4242/BalisageVol19.Dekker01.

[Haentjens Dekker et al. 2018] Haentjens Dekker, Ronald, Elli Bleeker, Bram Buitendijk, Astrid Kulsdom and David J. Birnbaum. TAGML: A markup language of many dimensions. Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21, 2018. doi:https://doi.org/10.4242/BalisageVol21.HaentjensDekker01.

[Haentjens Dekker et al. 2020] Dekker, Ronald Haentjens, Bram Buitendijk, and Elli Bleeker. Parsing a Markup Language That Supports Overlap and Discontinuity. Proceedings of the ACM Symposium on Document Engineering 2020, pp. 1-4. doi:https://doi.org/10.1145/3395027.3419590.

[Huitfeldt 1994] Huitfeldt, Claus. Multi-dimensional Texts in a One-Dimensional Medium. Computers and the Humanities, vol. 28, no. 4, 1994, pp. 235-241. doi:https://doi.org/10.1007/BF01830270.

[Hilbert et al. 2005] Hilbert, Mirco, Oliver Schonefeld, and Andreas Witt. Making CONCUR work. Presented at Extreme Markup Languages 2005, Montréal, Québec, August 1-5, 2005. http://conferences.idealliance.org/extreme/html/2005/Witt01/EML2005Witt01.xml.

[Jagadish et al. 2004] Jagadish, H.V., Laks V.S. Lakshmanan, M. Scannapieco, D. Srivastava, and N. Wiwatwattana. Colorful XML: One Hierarchy Isn’t Enough. Presented at SIGMOD 2004, Paris, France, June 13–18, 2004. doi:https://doi.org/10.1145/1007568.1007598.

[Kimber 2011] Kimber, Eliot. DITA Document Types: Enabling Blind Interchange Through Modular Vocabularies and Controlled Extension. Presented at Balisage: The Markup Conference 2011, Montréal, Canada, August 2-5, 2011. Proceedings of Balisage: The Markup Conference 2011. Balisage Series on Markup Technologies, vol. 7, 2011. doi:https://doi.org/10.4242/BalisageVol7.Kimber01.

[Marcoux et al. 2011] Marcoux, Yves, Michael Sperberg-McQueen, and Claus Huitfeldt. Expressive Power of Markup Languages and Graph Structures. Presented at the Digital Humanities Conference 2011, Stanford, CA, June 19-22, 2011, pp. 178-180. https://core.ac.uk/reader/48606830.

[Niu et al. 2019] Niu Ya-Wei, Qu Cun-Quan, Wang Guang-Hui, Yan Gui-Ying. RWHMDA: Random Walk on Hypergraph for Microbe-Disease Association Prediction. Frontiers in Microbiology, vol. 10, 2019. doi:https://doi.org/10.3389/fmicb.2019.01578.

[Peroni et al. 2014] Peroni, Silvio, Francesco Poggi and Fabio Vitali. Overlapproaches in Documents: a Definitive Classification (in OWL, 2!). Presented at Balisage: The Markup Conference 2014, Washington, DC, August 5-8, 2014. In Proceedings of Balisage: The Markup Conference 2014. Balisage Series on Markup Technologies, vol. 13, 2014. doi:https://doi.org/10.4242/BalisageVol13.Peroni01.

[Pierazzo 2015] Pierazzo, Elena. TEI: XML and Beyond. Presented at the Text Encoding Initiative Conference and Members Meeting 2015, Lyon (France), October 28-31, 2015. Abstract of talk available online at http://tei2015.huma-num.fr/en/papers/#140.

[Piez 2008] Piez, Wendell. LMNL in miniature. An introduction. Presented at the Amsterdam Goddag Workshop, December 1–5, 2008. http://piez.org/wendell/LMNL/Amsterdam2008/presentation-slides.html

[Portier et al. 2012] Portier, Pierre-Édouard, Noureddine Chatti, Sylvie Calabretto, Elöd Egyed-Zsigmond and Jean-Marie Pinon. Modeling, Encoding And Querying Multi-structured Documents. Information Processing & Management, vol. 48, no. 5, 2012, pp. 931-955. doi:https://doi.org/10.1016/j.ipm.2011.11.004.

[De Rose 2004] DeRose, Steven J. Markup Overlap: a Review and a Horse. Presented at Extreme Markup Languages 2004, Montréal, Québec, August 2-6, 2004. http://xml.coverpages.org/DeRoseEML2004.pdf.

[Sahle 2013] Sahle, Patrick. Digitale Editionsformen. Zum Umgang mit der Überlieferung unter den Bedingungen des Medienwandels. Teil 3: Textbegriffe und Recodierung. Norderstedt: BoD, 2013. https://kups.ub.uni-koeln.de/5353/

[Schmidt 2010] Schmidt, Desmond. The Inadequacy of Embedded Markup for Cultural Heritage Texts. Literary and Linguistic Computing 25, no. 3, 2010, pp. 337-356. doi:https://doi.org/10.1093/llc/fqq007.

[Sperberg-McQueen and Huitfeldt 2008] Sperberg-McQueen, C. M. and Claus Huitfeldt. Markup Discontinued: Discontinuity in TexMecs, Goddag Structures, and Rabbit/Duck Grammars. Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12-15, 2008. Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1, 2008. doi:https://doi.org/10.4242/BalisageVol1.Sperberg-McQueen01.

[Sperberg-McQueen 2018] Sperberg-McQueen, C.M. Representing Concurrent Document Structures Using Trojan Horse Markup. Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. Proceedings of Balisage: The Markup Conference. Balisage Series on Markup Technologies, vol. 21, 2018. doi:https://doi.org/10.4242/BalisageVol21.Sperberg-McQueen01.

[Sperberg-McQueen and Huitfeldt 2000] Sperberg-McQueen, C.M. and Claus Huitfeldt. GODDAG: A Data Structure for Overlapping Hierarchies. International Workshop on Principles, 2000. doi:https://doi.org/10.1007/978-3-540-39916-2_12.

[Austen 2010] Katherine Sutherland, editor. Jane Austen’s Fiction Manuscripts: A Digital Edition. Available at http://www.janeausten.ac.uk/, consulted July 16th, 2021.

[Vitali 2016] Vitali, Fabio. The Expressive Power of Digital Formats: Criticizing the Manicure of the Wise Man Pointing at the Moon. Lecture at the DiXiT Convention 2, Cologne, Germany, March 15, 2016. Slides available at http://dixit.uni-koeln.de/wp-content/uploads/Vitali_Digital-formats.pdf.

[Watanna, 1916] Watanna, Onoto. Marion, the Story of an Artist’s Model. New York: W.J. Watt. URN:oclc:record:1048793515.

^[1] The authors are very grateful for the reviewers’ comments which have been insightful, useful, and highly appreciated. Many thanks.

^[2] See notably Huitfeldt 1994 (p. 143; 147-151), Sahle 2013 (p. 381-382), and Pierazzo 2015 (p.73-74) in relation to the conceptual model on which the encoding guidelines of the TEI are based.

^[3] To give but one example, the edges in the graph were first directed (Haentjens Dekker and Birnbaum 2017), then undirected with the order of the Text nodes in the graph derived from their distance from the root node (Haentjens Dekker et al. 2018). The current version of the hypergraph data model has both directed edges (for the text-to-text nodes) and undirected edges (for the markup and annotation nodes).

^[4] In earlier publications, we referred to groups of Markup nodes as layers, but we found this term to be confusing as “layers” are often used differently in different (humanities) contexts.

^[5] All text, documentary transcriptions, and facsimiles are retrieved from the digital diplomatic edition Jane Austen’s Fiction Manuscripts: Digital Edition, edited by Katherine Sutherland and her team. The digital edition can be found at https://janeausten.ac.uk/index.html.

^[6] The TAGML transcription is made in Sublime Text Editor that offers syntax TAGML highlighting. See the Github page of the project for the most up-to-date information about the TAGML syntax highlighting.

^[7] See the TEI Guidelines, chapter 16.7. See also the (suggestions for the) question posted by Joey Takeda on the TEI mailing list, “Another q question”, March 23, 2019 (permalink: https://listserv.brown.edu/cgi-bin/wa?A2=TEI-L;48339dc2.1903).

^[8] Note that the quotation q is given the identifier “P”, because it overlaps locally with the s element.

^[9] See the website and the Github page of OxGarage for more information.

^[10] For a detailed discussion of the representation of non-linear structures in TAGML, see Haentjens Dekker et al. 2018 and Bleeker et al. 2020.

^[11] For an overview and description of these and other proposals, we recommend among others De Rose 2004, Schmidt 2010, Portier et al. 2012, Peroni et al. 2014, and Vitali 2016.

Elli Bleeker

Researcher

Huygens Institute for the History of the Netherlands

Elli Bleeker works as a researcher at the Huygens Institute for the History of the Netherlands. As a Research Fellow in the Marie Sklodowska-Curie funded network DiXiT (2013–2017), she received advanced training in manuscript studies, text modeling, and XML technologies for text modeling. She completing her PhD at the Centre for Manuscript Genetics at Antwerp University (2017) on the role of the scholarly editor in the digital environment. She specialized in digital scholarly editing with a focus on modern manuscripts, genetic criticism, and semi-automated collation. Currently, she works together with Ronald Haentjens Dekker and studies the potential of graph technologies for the modeling of literary and historical texts. This confronts her frequently with complex manuscripts that are very challenging to model computationally. Still, she would choose it again without a doubt.

Ronald Haentjens Dekker

Software engineer

Huygens Institute for the History of the Netherlands

Ronald Haentjens Dekker is a software architect and lead engineer of the Computational Modelling for Textual Sources (ComTES) at the Huygens Institute for the History of the Netherlands, part of the Royal Netherlands Academy of Arts and Sciences. As a software architect, he is responsible for translating research questions into technology or algorithms and explaining to researchers and management how specific technologies will influence their research. He has worked on transcription and annotation software, collation software, and repository software, and he is the lead developer of the CollateX collation tool. He also conducts workshops to teach researchers how to use scripting languages in combination with digital editions to enhance their research.

Bram Buitendijk

Software engineer

Digital Infrastructure Department, Humanities Cluster, Royal Netherlands Academy for Arts and Sciences

Bram Buitendijk is a software developer at the Humanities Cluster, part of the Royal Netherlands Academy of Arts and Sciences. He has worked on transcription and annotation software, collation software, and repository software.

BalisageThe Markup Conference

Balisage Paper: Hyper, Multi, or Single? Thinking about Text in Graphs and Trees

Elli Bleeker

Ronald Haentjens Dekker

Bram Buitendijk

Table of Contents

Premise^[1]

Data models for texts and documents

Data models

Text as a hypergraph

Text as Multi-Colored Trees

Text features

Discontinuous text

Discontinuous text in TAGML

Discontinuous text in XML

Overlapping structures

Overlap in TAG

Overlap in XML

From TAGML to XML and beyond

Workflow

The code

The user

Future work

Reflection

References

Balisage Series on Markup Technologies

Balisage Paper: Hyper, Multi, or Single? Thinking about Text in Graphs and Trees

Elli Bleeker

Ronald Haentjens Dekker

Bram Buitendijk

Table of Contents

Premise[1]

Data models for texts and documents

Data models

Text as a hypergraph

Text as Multi-Colored Trees

Text features

Discontinuous text

Discontinuous text in TAGML

Discontinuous text in XML

Overlapping structures

Overlap in TAG

Overlap in XML

From TAGML to XML and beyond

Workflow

The code

The user

Future work

Reflection

References

Balisage Series on Markup Technologies

Premise^[1]