Bleeker, Elli, Ronald Haentjens Dekker and Bram Buitendijk. “Hyper, Multi, or Single? Thinking about Text in Graphs and Trees.” Presented at Balisage: The Markup Conference 2021, Washington, DC, August 2 - 6, 2021. In Proceedings of Balisage: The Markup Conference 2021. Balisage Series on Markup Technologies, vol. 26 (2021). https://doi.org/10.4242/BalisageVol26.Bleeker01.
Balisage: The Markup Conference 2021 August 2 - 6, 2021
Balisage Paper: Hyper, Multi, or Single? Thinking about Text in Graphs and Trees
Elli Bleeker
Researcher
Huygens Institute for the History of the Netherlands
Elli Bleeker works as a researcher at the Huygens Institute for the History of
the Netherlands. As a Research Fellow in the Marie Sklodowska-Curie funded
network DiXiT (2013–2017), she received advanced training in manuscript studies,
text modeling, and XML technologies for text modeling. She completing her PhD at
the Centre for Manuscript Genetics at Antwerp University (2017) on the role of
the scholarly editor in the digital environment. She specialized in digital
scholarly editing with a focus on modern manuscripts, genetic criticism, and
semi-automated collation. Currently, she works together with Ronald Haentjens
Dekker and studies the potential of graph technologies for the modeling of
literary and historical texts. This confronts her frequently with complex
manuscripts that are very challenging to model computationally. Still, she would
choose it again without a doubt.
Ronald Haentjens Dekker
Software engineer
Huygens Institute for the History of the Netherlands
Ronald Haentjens Dekker is a software architect and lead
engineer of the Computational Modelling for Textual Sources (ComTES) at the
Huygens Institute for the History of the Netherlands, part of the Royal
Netherlands Academy of Arts and Sciences. As a software architect, he is
responsible for translating research questions into technology or algorithms and
explaining to researchers and management how specific technologies will
influence their research. He has worked on transcription and annotation
software, collation software, and repository software, and he is the lead
developer of the CollateX collation tool. He also conducts workshops to teach
researchers how to use scripting languages in combination with digital editions
to enhance their research.
Bram Buitendijk
Software engineer
Digital Infrastructure Department, Humanities Cluster, Royal Netherlands
Academy for Arts and Sciences
Bram Buitendijk is a software developer at the Humanities
Cluster, part of the Royal Netherlands Academy of Arts and Sciences. He has
worked on transcription and annotation software, collation software, and
repository software.
This paper explores the potential of combining the Text-As-Graph (TAG) and the XML
data models. It proposes a digital editing workflow in which users can model, edit,
and store text in TAG, and subsequently export the data to XML for further analysis
or publication with XML-based tools. The conversion from TAGML to XML presents
several interesting challenges on a technical level as well as a philological level.
Overall, we argue that there may be many pragmatic reasons to encode cultural
heritage texts in XML, but we have to be mindful of the XML framework becoming
synonymous with the framework in which we conceptualize text. The paper therefore
dives deep into the translation from conceptual model to logical model(s) and argues
in favor of understanding the affordances and limitations of the text modeling
technologies we use.
There are many pragmatic reasons to encode cultural heritage texts in TEI-compliant
XML, but we have to be mindful of the XML framework becoming synonymous with the
framework in which we conceptualize text. As various textual scholars have pointed
out,
the limitations of a technology can delimit the selection, modelling, and analysis
of
textual aspects.[2]
They remind us of the frequently used analogy of the hammer and the toolbox: if all
you have is a hammer, everything will look like a nail. Indeed, the limitations of
the
XML data model have influenced and shaped our text encoding praxis (cf. Pierazzo 2015) and will continue to do so. If all you know is XML, every text
will start to look like a tree. And while there are enough cases where a tree data
model
suffices, other cases benefit from an alternative.
This paper explores the potential of combining the XML data model and the
Text-As-Graph (TAG) data model. The primary aim of the paper is to examine a practical
and workable method for modeling and editing documents. TAG is still under development
and not (yet) as mature as XML. Still, we find the affordances of the TAG model to
be
highly suitable for modeling and storing literary historical documents. XML remains
a
prominent technology for the analysis and publishing of texts. The question is, then,
can we combine the strengths of TAG and XML into one powerful tool for everything
we may
want to do with text: modeling, storing, (collaborative) editing, processing, analyzing,
and publishing? This paper explores that possibility by proposing a digital editing
workflow in which scholarly editors can model, edit, and store text as a TAG hypergraph,
and subsequently export the textual data to an XML format for further analysis or
publication with XML-based tools.
Developing a more inclusive, flexible data model for text has been one of the guiding
principles behind the design of TAG, a graph-based model under development at the
Royal
Netherlands Academy of Arts and Sciences. In previous Balisage contributions we
discussed the TAG data model (Haentjens Dekker and Birnbaum 2017), its markup language TAGML and
reference implementation Alexandria (Haentjens Dekker et al. 2018), and explored the ways in which TAGML can be used to model
certain textual features that are notoriously difficult to model in XML (Bleeker et al. 2020). Over the course of the four+ years we have been working on
the TAG markup stack (i.e., its data model, syntax, query language, and schema) we
have
gained some valuable insights. In some cases, this resulted in some small modifications
of the data model and syntax.[3]
In each communication we pointed out that even though TAG was under active
development, we considered our work, findings, and reflections already relevant for
a
general discussion on text modeling. We illustrated how other data models and markup
systems express complex features like overlap and non-linear text, and we argued that
text could best be expressed as a network structure (i.e., a graph). In doing so,
we
aimed to encourage a reflexive awareness of the relationship between an intellectual
model and data models of text. One of the motivations for our continued work on TAG
is
to emphasize the value of, first, having a wider assortment of tools in one’s
metaphorical text modeling toolbox and, secondly, knowing to select the right tool
for
the job you want to do. We believe that scholars can make an informed choice only
when
they know both the strong and the weak points of a data model.
Data models for texts and documents
Our work is informed by a definition of text as a
sequence of characters (e.g., letters, digits, spaces, and punctuation, including
symbols and music notation) that is inscribed on a material carrier: a document. From the text on a document, a reader derives
information. In earlier contributions, we argued that this
information can best be organized in a network structure. We stated that text is
partially ordered, which means that it is not always possible
to determine the order of all characters in the sequence. Examples of partially ordered
text are non-linear, discontinuous, or overlapping structures.
Generally speaking, certain data models are more suitable for expressing complex
features than others: as said, the inherent properties of a data model provide its
scope
and determine its limits. While TAG is conceptually based on a hypergraph, the model
can
be implemented in different ways. The best logical implementation of TAG depends on
the
purpose: we found that a variation on the Multi-Colored Trees (MCT or Colorful XML,
developed by Jagadish et al. 2004) and the GODDAG model (Sperberg-McQueen and Huitfeldt 2000, Sperberg-McQueen 2018), which we described as a colored
GODDAG (Haentjens Dekker et al. 2020), works best for overlapping
structures as well as for export and visualization purposes. With export being the
focus
of the paper, we will describe both the hypergraph and the Colored GODDAG models in
the
following sections.
Data models
Text as a hypergraph
Just like any other graph, a hypergraph consists of nodes and edges. The
important difference is that some edges in a hypergraph can join together two or
more nodes (in contrast to the one-to-one edges of regular graphs). These are
called hyperedges. Hyperedges are typically
undirected and they can be used to express group relations. The hypergraph is
not a common data structure in the (digital) humanities, but it is fairly
well-known in the STEM research fields. By means of illustration, figure 1
(Figure 1) shows a hypergraph used in microbiology:
The TAG hypergraph is slightly different than the model in figure 1. First,
the nodes in the TAG hypergraph are typed. We distinguish five different node
types: each hypergraph consists of exactly one document
node (the root), zero or more text
nodes, zero or more markup
nodes, zero or more annotation
nodes, and zero or more branching
nodes. Furthermore, the TAG model consists of undirected
hyperedges as well as directed one-to-one edges. The document node, the text
nodes, and the branching nodes indicate the stream of the text and are therefore
connected by directed, regular edges. The markup and annotation nodes in the TAG
hypergraph can be connected with either a hyperedge or a regular one-to-one
edge. For example, a hyperedge can associate multiple text nodes with one and
the same markup node, and an annotation node can be associated with one markup
node by means of a regular edge. Figure 2 below exemplifies the different types
of nodes and edges. The text nodes are white, the regular edges are visualized
as arrows, and the hyperedges are labelled and visualized in different colors:
Text in a hypergraph is read from left to right, starting with the
document node and following the directed edges. There are two branches in the
hypergraph; the beginning and the end of the branches is indicated with a
branching node. The markup node labelled “subst” in figure 2 (Figure 2) is associated with two text nodes via a labelled
hyperedge (yellow); the markup node labelled “add” is associated with one text
node via a labelled hyperedge (dark green) and has an associated annotation node
(light green) with information about the place of the addition in the source
text.
Note that the markup in this example is properly nested: it represents one
hierarchical structure and the markup elements do not overlap. Expressing
multiple overlapping hierarchical structures in TAG is done by grouping together
related markup nodes in a group.[4] The markup nodes within each group form a single hierarchy, but
groups can share markup nodes and a TAG document can contain any number of
groups. This way, overlapping structures can be easily expressed in TAG. The
following section exemplifies the logical model (section “Text as Multi-Colored Trees”).
Text as Multi-Colored Trees
Alexandria implements a MCT, an ordered
directed acyclic graph with colors on the markup nodes. The MCT is inspired by
the multitrees of GODDAG, the colored nodes of Colorful XML, and the combination
of several XML trees of XConcur (Jagadish et al. 2004, Hilbert et al. 2005). The MCT model extends the XML tree in two ways: a
node in a MCT has an additional property (its color), and a MCT database can
consist of one or more colored trees (instead of XML’s single-rooted tree). Each
tree has a different color. A node can be shared by more than one tree, in that
case it has multiple colors. The trees within a MCT document can be navigated
and manipulated with extended XQuery or XPath expressions in which the user
first selects a leading color (Jagadish et al. 2004, see also Portier et al. 2012). The MCT is implemented in the Alexandria repository of TAG; see section section “Workflow”. In the Alexandria
MCT, the text nodes as well as the root document node are shared between all the
colors.
Overlapping hierarchies in TAG are expressed as a MCT by assigning a colored
tree to each group of markup nodes. Figure 3 shows a visualization of a MCT
implementation with two groups of markup nodes i.e., two colored trees.
This simple example shows how two groups of markup nodes, one with the
identifier “D” and the other with the identifier “T”, can be expressed as a red
and a blue tree, respectively. The red tree (i.e., the group of markup nodes
identified with “D”) shares the markup nodes labeled “s” and “add” with the blue
tree (i.e., the group labeled “T”). The markup node labeled “del” is only part
of the red tree. Each node is stored only once in the MCT model; edges between
nodes are specified in each colored tree.
Text features
In earlier publications, we distinguished at least three complex features with
which digital scholarly editors have to deal on a regular basis: overlapping or
concurrent hiearchies, discontinuous text, and non-linear structures (Haentjens Dekker and Birnbaum 2017, Haentjens Dekker et al. 2018). Expressing concurrent or
overlapping structures is, if not the biggest, surely the most famous and most
debated obstacle for text modeling in XML and therefore needs little explaining.
Discontinuous text is usually illustrated with a quotation that is interrupted by
an
aside or by the narrator’s voice (cf. Sperberg-McQueen and Huitfeldt 2008). Non-linear
structures, finally, can be found on the level of the source text as well as the
markup, e.g., in-text revisions on a manuscript or an abbrevation and its expansion
supplied by the editor in markup. The common denominator is a temporary break in the
linearity of the text that can be conceptualized as a split of the text stream in
two or more substreams or branches.
In this section we show how (or to what extent) these features are modeled in TAG.
We take our examples from three pages of Love and Freindship, a
short epistolary novel by Jane Austen, written between 1790-1793. The novel is
written in a notebook entitled “Volume the Second” and is part of her
Juvenalia manuscripts.[5]
For each text feature, we first describe our philological interpretation of the
feature in question. We then show the syntactical serialization of the feature in
TAGML and in XML, combined with visualizations of the underlying data models. This
will facilitate the comparison between TAGML and XML and, we hope, increase the
readers’ awareness of the relationship between the conceptual model and its logical
implementation(s).
Discontinuous text
Discontinuous text can be found in the fragment of the text displayed in
figure 7, which reads: Beware my Laura (she would often say) Beware of
the insipid Vanities and idle Dissipations […]. The aside
(she would often say) is a comment by the letter writer Laura
upon the quoted text and can therefore be identified as a break in the
quotation.
Discontinuous text in TAGML
The TAGML encoding of discontinuous text is similar to the TexMECS
proposal: the q element is “paused” and subsequently “resumed”
with the affixes - and + (see Sperberg-McQueen and Huitfeldt 2008).[6] On the level of the conceptual hypergraph model, the discontinuous
text is part of one and the same q element, as is illustrated
by figure 9 (Figure 9).
The relationship between the two parts of the q
element is also stored in the MCT implementation of Alexandria:
Discontinuous text in XML
The TEI Guidelines offer several options to encode discontinuous text,
such as the use of next and prev attributes on the
discontinued elements.[7] A simplified TEI XML transcription of the example sentence would
then be:
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<!-- metadata information here -->
</teiHeader>
<text>
<body>
<p>
<!-- more text here -->
<s>
<q xml:id="q1" next="#q2">"Beware my Laura </q>
(she would often say) <q xml:id="q2" prev="#q1"> Beware of [...] Southampton."</q>
</s>
<!-- more text here -->
</p>
</body>
</text>
</teiHeader>
</TEI>
Here the q elements are given an xml:id and the
next and prev attributes are used to indicate
that the first q element is continued in the next. Instead of,
or in addition to, next and prev attributes, the
two parts can be joined together using a link element:
<link type="join" target="#q1 #q2"/>. As figure 11
shows, the elements are only linked on a syntactical level. On the level of
the data model the q elements are two separate child elements
of the div element.
Overlapping structures
Finding an example of overlapping structures in the manuscript notebook of
Love and Freindship is fairly simple. Let’s say we want
to express both the material features of the document and the linguistic
structure of the text, i.e., the lines on the page and the sentences
respectively. As the sentences run over several page lines and one page
boundary, they overlap partly with the material structure of the
document.
Overlap in TAG
As mentioned in section 2.1.2. (section “Text as Multi-Colored Trees”), TAG handles
overlapping structures by grouping the markup nodes of each structure into a
separate group. Within each group, the markup nodes are hierarchically
ordered. Figure 12 (Figure 12) illustrates the
TAGML-encoded text of Austen’s text containing the overlapping hierarchies:
the material structure, which nodes are assigned the identifier “D”, and the
linguistic structure of the letter, which nodes are assigned the identifier
“T” . For example, the page and l markup elements
have the identifier “D”, the s markup elements are given the
identifier “T”.[8] In this particular TAGML document, the root node
text is shared: it has both the “D”, “T”, and the “P”
identifier.
The MCT of the entire novel is too large to visualize here, but
this simplified visualization below shows the colored trees constituted by
the material and linguistic structures:
Overlap in XML
The ubiquity of overlapping structures in cultural heritage texts produced
a wide variety of TEI P5 encoding systems. In our case it would be
convenient to use empty elements <lb/> for the structure of
the page lines, and the aforementioned next and
prev attributes for the sentence that runs across the two
pages. Belowe a simplified TEI-compliant XML transcription in which the
linguistic structure of the sentences are leading looks. Note again that the
last sentence is made up of two separate s elements that are
only linked on a syntactical
level.
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<!-- metadata information here -->
</teiHeader>
<text>
<body>
<!-- some text here -->
<p>
<s xml:id="s1" next="#s2">
<lb/>Our neighbourhood consisted <lb/>only of your Mother.</s>
<s>She may probably have already <lb/>told you that being left by her Parents in indegent
<lb/>Circumstances she had retired into Wales on economi-</s>
</p>
<pb/>
<p>
<s xml:id="s2" prev="#s1
<lb/>cal motives.</s>
<!-- some text and markup here -->
</p>
</body>
</text>
</teiHeader>
</TEI>
From TAGML to XML and beyond
Workflow
After having encoded the text in TAGML, the document can be uploaded to the
Alexandria repository. Alexandria is operated on the command line and is set up to work
similar to git, the version control software often used by programmers and digital
humanists to work collaboratively on code. Via command line commands, users can
“check in” and “check out” TAGML documents to the Alexandria repository. In addition to a TAGML document, users can
also upload one or more views in which they can
define which (groups of) markup nodes can be filtered out. We created the view
functionality because the TAG document in Alexandria can
potentially contain a large amount of information, and we assume that not all users
will always be interested in all information. A view is expressed in JSON, for
example: {"includeGroups":["D"]} {"includeMarkup:["del"]"}. In this
example, the view says to include the markup nodes that are grouped with the
identifier “D” and all markup nodes labelled “del” when the TAGML document is
checked out of the Alexandria repository.
An Alexandria check-in means that the TAGML
document is parsed, verified for well-formedness, and stored as a MCT in the
repository. A check-out in combination with a view generates a new TAGML document
that contains the selected markup nodes, while the master TAGML document remains
intact in the repository. The checked-out document can be edited and checked-in
again. Alternatively, users can indicate upon checkout whether they want the master
document to be exported to another format: currently, we support SVG, XML, DOT, or
PNG. Figure 14 depicts this workflow:
Let’s zoom in on the last step in this workflow diagram: the export to XML. The
Alexandria command export-xml
seems easy enough and to some extent it is: technically, converting a MCT to a
single tree is fairly simple. Like XML, TAGML has markup tags with names,
attributes, and values. And although TAGML supports different data types for
attribute values (in addition to a string, TAGML attribute values can be integers,
floats, strings, lists, or Booleans), the TAGML attribute values can be converted
to
string type attribute values in XML. Still, a graph with multiple concurrent
hierarchies contains more information than a mono-hierarchical tree, so a
graph-to-tree conversion implies that we have to decide how to express that
information. This depends in part on how the user plans to use the exported
document. Representing overlapping hierarchies in a single tree, for example, will
require additional tagging. Users may therefore want to scale down the amount of
information in the XML document and select only the information that is relevant for
their purpose. This means deciding whether the exported XML document should have a
leading hierarchy and if so, which markup nodes should be part of it. After
addressing the algorithmic side of the TAGML-to-XML conversion in section section “The code”, we will discuss
the editorial side.
The code
The steps taken during the TAGML to XML conversion are as follows:
The user gives the xml-export command to the Alexandria server;
The TAGTraverser iterates over the MCT of the TAGML
document. If the user has provided it, the traverser will also use
information from the view and ignore certain (groups of) markup
nodes;
The TAGTraverser generates a stream of
Events;
For each Event, check to see whether it is an open tag, a
close tag, or text characters;
If the Event is text, the characters are transformed into
an XML text node;
If the Event is an open tag or a close tag, check to see
if the user provided information about the leading hierarchy and if so,
whether the tag is part of the leading hierarchy;
If not, the open tag or close tag is transformed into a
Trojan Horse start or end element, respectively;
If the tag is part of the leading hierarchy, the open tag
or close tag is transformed into an XML open tag or an XML
close tag.
Figure 15 (Figure 15) shows a flowchart of this
process. The flowchart starts after the user has created a TAGML document and a view
and uploaded them in the Alexandria repository.
The code base of Alexandria is written in Kotlin.
The code fragment below shows the class definitions of Node,
MCT, and Event. Edges are stored in two directions:
incoming and outgoing. Target nodes are stored in LinkedHashMaps to preserve the
order of the nodes.
sealed class Node
data class Markup(val label: String, val colors: List<String>, val id: Long = System.currentTimeMillis()) : Node()
data class Text(val content: String, val id: Long = System.currentTimeMillis()) : Node()
class MCT(val rootNode: Markup) {
val outgoingEdges: MutableMap<Markup, LinkedHashSet<Node>> = HashMap()
val incomingEdges: MutableMap<Node, LinkedHashSet<Markup>> = HashMap()
}
sealed class Event
data class MarkupOpen(val node: Markup) : Event()
data class MarkupClose(val node: Markup) : Event()
data class TextEvent(val node: Text) : Event()
As described above (section section “Workflow”), a TAGML document that is
checked into the Alexandria repository is parsed
and stored as a MCT. When the user gives the export-xml command to
Alexandria, the TAGTraverser
iterates over the MCT and generates a stream of Events. The Kotlin code
fragment below describes the traversal algorithm. It traverses over the nodes in
topological order and creates TextEvents, MarkupOpen, or
MarkupClose events. A text node generates a TextEvent;
a markup node generates a MarkupOpen event of that node. In order to
link the MarkupClose events to the appropriate MarkupOpen
events, we keep track of the markup that is currently open by using markup stacks.
There is a global stack as well as one stack for each color in the MCT. Each markup
node is added to the global stack and to the relevant color stack. Before we can
generate a TextEvent or a MarkupOpen event for a node, we
need to check the top of the relevant color stack(s) to see if it’s not a parent of
the current node. Those markup nodes generate MarkupClose events and
can be removed from both the color stacks and the global stack. After all nodes have
been processed in this manner, the markup that is left on the global stack generates
the remaining MarkupClose events.
fun traverseMCT(mct: MCT): List<Event> {
val nodes = topologicalSort(mct)
val result = arrayListOf<Event>()
val colorToStackMap = HashMap<String, Stack<Markup>>()
val globalStack = LinkedHashSet<Markup>()
for (node in nodes) {
val parents = mct.incomingEdges.getOrElse(node) { emptySet<Markup>() }
val stacksToCheck: List<Stack<Markup>> =
when (node) {
is Markup -> colorToStackMap.entries.filter { node.colors.contains(it.key) }.map { it.value }
is Text -> colorToStackMap.values.toList()
}
for (stack in stacksToCheck) {
while (stack.peek() !in parents) {
val nodeToPop = stack.pop()
if (globalStack.remove(nodeToPop)) result.add(MarkupClose(nodeToPop))
}
}
when (node) {
is Markup -> {
node.colors.map { colorToStackMap.getOrPut(it) { Stack() } }.forEach { it.push(node) }
globalStack.add(node)
result.add(MarkupOpen(node))
}
is Text -> result.add(TextEvent(node))
}
}
globalStack.reversed().forEach { node -> result.add(MarkupClose(node)) }
return result
}
Finally, the algorithm creates the XML document from the MCT. The code below
describes how the algorithm loops over the Events and creates XML tags.
The XML tags are based on the type of event and whether the node associated with the
event is the leading hierarchy or not. Nodes in the leading hierarchy are used to
create XML content elements, nodes in the other hierarchies are converted to Trojan
Horse elements. Trojan Horse elements are a specific type of elements or
“segment-boundary delimeters” with a namespace definition th: (see .
Two related milestones are linked by means of matching @start and
@end attributes, so the regular XML <s>The sun is
yellow</s> becomes <s th:s sID="foo"/>The sun is yellow<s
th:s eID="foo"/> in Trojan Horse markup (De Rose 2004,
Barnard et al. 1995, Sperberg-McQueen 2018). Additionally, the
Trojan Horse markup elements are given an attribute that is generated from their
TAGML group identifier, e.g., @th:doc="D" for all markup nodes in a
group called “D”.
fun createXML(mct: MCT, leadingHierarchy: String, writer: Writer) {
val events = traverseMCT(mct)
val xml = XMLOutputFactory.newFactory().createXMLStreamWriter(writer)
for (event in events) {
when (event) {
is TextEvent -> xml.writeCharacters(event.node.content)
is MarkupOpen -> if (event.node.colors.contains(leadingHierarchy)) xml.writeStartElement(event.node.label)
else xml.apply {writeEmptyElement(event.node.label)
writeAttribute("sID", event.node.id.toString()) }
is MarkupClose -> if (event.node.colors.contains(leadingHierarchy)) xml.writeEndElement()
else xml.apply { writeEmptyElement(event.node.label)
writeAttribute("eID", event.node.id.toString()) }
}
}
writer.close()
}
The user
The user is usually not aware of the algorithmic details of the XML export and —
although the code is open source and there for anyone to look at — not required to
either. Still, the conversion process is not just a matter of clicking a button: it
is designed so that the user can influence it. Let’s take a closer look. Figure 16
(Figure 16) shows the (simplified) TAGML transcription of
the entire text of the fourth letter from Love and
Freindship.
The overlapping hierarchical structures in TAGML cannot be automatically
transformed to XML. This is where the user comes in: they can choose the hierarchy
formed by the group of markup nodes labelled “D”, the hierarchy formed by the markup
nodes labelled “T”, or no leading hierarchy in the XML output. In case of the
latter, it suffices to use the command alexandria export-xml
Austen_VtS, which says as much as “Hi Alexandria, please export the document
called Austen_VtS to XML”. Subsequently, the
TAGTraverser described in the previous section will iterate over
the MCT, generate a stream of events, and build an XML tree. All TAGML markup nodes
will be transformed into Trojan Horse elements. A short fragment of the Trojan Horse
XML output is shown below:
<?xml version="1.0" encoding="UTF-8"?>
<xml xmlns:tag="http://tag.di.huc.knaw.nl/ns/tag"
xmlns:th="http://www.blackmesatech.com/2017/nss/trojan-horse" th:doc="D P T">
<text title="Love and Freindship" type="novel" author="Jane Austen" date="1790">
<page n="6" th:doc="D" th:sId="page0"/><head th:doc="D T" th:sId="head1"/><l th:doc="D" th:sId="l2"
/>Letter 4th<l th:doc="D" th:eId="l2"/><head th:doc="D T" th:eId="head1"/><l th:doc="D"
th:sId="l3"/>Laura to Marianne<l th:doc="D" th:eId="l3"/><s th:doc="T" th:sId="s4"/><l
th:doc="D" th:sId="l5"/>Our neighbourhood was small, for it consisted <l th:doc="D"
th:eId="l5"/><l th:doc="D" th:sId="l6"/>only of your Mother. <s th:doc="T" th:eId="s4"
/>
<s th:doc="T" th:sId="s7"/>She may probably have already <l th:doc="D" th:eId="l6"/><l
th:doc="D" th:sId="l8"/>told you that being left by her Parents in indegent <l
th:doc="D" th:eId="l8"/><l th:doc="D" th:sId="l9"/>Circumstances she had retired into
Wales on economi-<l th:doc="D" th:eId="l9"/><page th:doc="D" th:eId="page0"/>
<page n="7" th:doc="D" th:sId="page10"/><l th:doc="D" th:sId="l11"/>cal motives. <s th:doc="T"
th:eId="s7"/><s th:doc="T" th:sId="s12"/>There it was our freindship first <l th:doc="D"
th:eId="l11"/><l th:doc="D" th:sId="l13"/>commenced – Isabel was then one and twenty –
<l th:doc="D" th:eId="l13"/><s th:doc="T" th:eId="s12"/><s th:doc="T" th:sId="s14"/>
<l th:doc="D" th:sId="l15"/>Tho' pleasing in both her Person and Manners
<l th:doc="D th:eId="l15"/><l th:doc="D" th:sId="l16"/>(between ourselves) she never possessed the
hun-<l th:doc="D" th:eId="l16"/><l th:doc="D" th:sId="l17"/>dreth part of my Beauty or
Accomplishments.<l th:doc="D" th:eId="l17"/><s th:doc="T" th:eId="s14"/>
<!-- more text and markup -->
<page th:doc="D" th:eId="page10"/>
</text>
</xml>
The user could also decide to make one of the hierarchical structures leading in
the XML output. In that case, they can add a parameter to the Alexandria command with which they indicate which grouped markup
nodes should form the leading hierarchy, e.g., alexandria export-xml
Austen_VtS -l D. In this example commend, the markup nodes in group “D”
are transformed into XML content elements; the markup nodes not belonging to the
leading hierarchy will be transformed into Trojan Horse elements. A fragment of the
XML output with the markup group “D” as leading structure would look as follows:
<?xml version="1.0" encoding="UTF-8"?>
<xml xmlns:tag="http://tag.di.huc.knaw.nl/ns/tag"
xmlns:th="http://www.blackmesatech.com/2017/nss/trojan-horse" th:doc="P T">
<text title="Love and Freindship" type="novel" author="Jane Austen" date="1790">
<page n="6">
<head><l>Letter 4th</l></head>
<l>Laura to Marianne</l>
<s th:doc="T" th:sId="s0"/><l>Our neighbourhood was small, for it consisted </l>
<l>only of your Mother. <s th:doc="T" th:eId="s0"/>
<s th:doc="T" th:sId="s1"/>She may probably have already </l>
<l>told you that being left by her Parents in indegent </l>
<l>Circumstances she had retired into Wales on economi-</l>
</page>
<page n="7">
<l>cal motives. <s th:doc="T" th:eId="s1"/>
<s th:doc="T" th:sId="s2"/>There it was our freindship first </l>
<l>commenced – Isabel was then one and twenty – </l> <s th:doc="T" th:eId="s2"/>
<s th:doc="T" th:sId="s3"/><l>Tho' pleasing in both her Person and Manners </l>
<l>(between ourselves) she never possessed the hun-</l>
<l>dreth part of my Beauty or Accomplishments.</l> <s th:doc="T" th:eId="s3"/>
<!-- more text and markup -->
<s th:doc="T" th:sId="s9"/> <q tag:n="1" th:doc="P" th:sId="q10"/>
<l>"Beware my Laura<q th:doc="P" th:eId="q10"/> (she would often say) </l>
<q tag:n="1" th:doc="P" th:sId="q11"/>
<l>Beware of the insipid Vanities and idle Dissipations </l>
<l>of the Metropolis of England; Beware of the </l>
<l>unmeaning Luxuries of Bath & of the Stink-</l>
<l>ing fish of Southampton."</l>
<q th:doc="P" th:eId="q11"/> <s th:doc="T" th:eId="s9"/>
</page>
</text>
</xml>
Note how the discontinuous text, marked with the suspend and resume signs in
TAGML, is coverted to XML. With the attribute @tag:n="1" the
discontinued parts of the quotation are linked together. This approach is also
suggested by Wendell Piez in Piez 2008. Finally, the header of both
fragments contains information about which group of markup nodes are represented in
Trojan Horse (th:doc="T D P" versus th:doc="T P").
In order to illustrate the variety of options created by the XML export function
of Alexandria, the final paragraphs of this section
explore two hypothetical yet highly likely user scenarios. Let’s say that, after
modeling the text in TAGML and uploading the TAGML document in Alexandria, a user wants to (1) publish it online, so they need an
HTML version of the encoded text; (2) send it for approval to another scholarly
editor who prefers to work in Word (e.g., using LibreOffice of Microsoft Office).
We
can update the workflow diagram by adding these two steps:
As the diagram illustrates, the transformation steps in the editorial
workflow do not take place in the context of Alexandria, but in the user’s own workspace. This is a conscious
choice: we assume that most users prefer their own, possibly customized, tools,
transformation scenario’s, and work environments. Accordingly, we provide TAGML
users with the opportunity to create workflows and pipelines in which TAGML is
seamlessly integrated with other tools. For instance, the user can use XSLT in order
to create an HTML document from the XML document.
For the transformation to Word, we use the open source software OxGarage, a
RESTful web service that was created by members of the TEI community and allows
users to manage transformations between various document formats.[9] Even without any additional information, the XML document sampled above
(with the page structure as leading hierarchy) is easily transformed into a clearly
readable Word document:
Note that the text of the addition, marked with add tags in
both the source TAGML and the XML output, is automatically surrounded by diacritical
marks in the Word output, and that the text marked as deleted in the TAGML and XML
sources is represented as crossed out text between square brackets.
Future work
So far, our contribution to Balisage about TAG have discussed work in progress, and
the present contribution is no different. While the basic export funtionalities perform
well, there are a few steps that need to be taken before TAGML documents can be fully
converted to an XML format. At the moment, non-linear structures are not optimally
converted. We conceptualized this structure as a linear stream of text “splitting”
into
two or more branches. In the case of an in-text revision like a substitution, the
split
leads to two branches of the text, each with a different reading. As the attentive
reader may have seen in figure 15 (Figure 16), the TAGML approach
to a non-linear structure is to encode the splitting into branches as follows:
some linear text <| branch 1 | branch 2 |> more linear text.[10] For example, the non-linear structure caused by the substitution in Austen’s
text is encoded as follows:
[s> […] had spent a fortnight in Bath & had <|[del>slept<del]|[add>supped<add]|> one night in Southampton.<s]
Here, the notation <| and |> indicates the start and end of
branches, with the | to separate them. One branch reads
slept and one reads supped. Currently, this branching
strcuture is quite literally converted as such to XML:
<s> had spent a fortnight in Bath & had
<tag:branches th:doc="_default" th:sId=":branches6"/>
<tag:branch th:doc="_default" th:sId=":branch7"/>
<del>slept</del>
<tag:branch th:doc="_default" th:eId=":branch7"/>
<tag:branch th:doc="_default" th:sId=":branch8"/>
<add>supped</add>
<tag:branch th:doc="_default" th:eId=":branch8"/>
<tag:branches th:doc="_default" th:eId=":branches6"/>
one night in Southampton.
<s/>
The branches are represented as Trojan Horse element and linked
to one another with the Trojan Horse attributes. While this is technically correct,
it
is difficult to read for humans and equally hard to transform to valid TEI-XML.
Non-linear TAGML is however not as easy to transform automatically into XML: where
TAGML
uses general syntactical symbols, TEI proposes multiple options with markup elements
like subst, mod, or choice. Potentially this
conversion will require user-input as well.
Another item on our “(Soon) To Do” list is to move away from grouping markup elements
with identifiers. We are at present examining the possibilities to implement a TAGML
schema and to include in the schema information about the hierarchical structure(s)
in a
TAGML document. This would mean, for instance, that the user identifies which markup
nodes are part of a certain hierarchy, which nodes can be shared between hierarchies,
etc. In line with an XSL file, the schema may contain other information about the
export. The user can subsequently point to this TAGML schema document when they give
the
export command to the Alexandria
server. And finally, on a more general level, future work entails the possibilty for
multiple users to collaborate in Alexandria. The
repository is now initialized locally, and a user of Alexandria can already check-in, check-out, and edit a TAGML document on
their local machine, but we aim to enable a collaboration between multiple users.
This
requires among others further development of the diff-functionality, as we want to
track
the edits made to a TAGML document on the level of the text as well as the
markup.
Reflection
It’s a well known fact that certain textual and documentary structures are less suited
to be encoded as XML, such as concurrent or overlapping hierarchies, discontinuous
text,
and non-linear structures. In the past thirty years or so, numerous ways to deal with
these structures have been proposed, ranging from XML-based approaches (to name but
a
few: empty or virtual elements, linking, Trojan Horse markup, stand-off approaches,
encoding the same text multiple times, XCONCUR) to non-XML proposals (such as LMNL,
TexMECS, or EARMARK).[11] There are various reasons why the proposed alternatives have not been
adopted as text encoding standards. Some of them focused primarily on addressing just
one limitation of the XML data model — e.g., overlap — and did not address the wider
range of non-hierarchical text phenomena. Others never transcended the experimental
phase, or needed to be abandoned simply because funding ran out. Nevertheless, the
academic publications related to these proposals still make for an interesting read
as
they provide insight into the intellectual and technological history of text modeling.
Most of them are not “just” technical: they are based on a philosophy of text. After
all, when examining the question of how to express text informationally, the authors
had
to formulate their definition of text. If it’s not an Ordered Hierarchy of Content
Objects, then what is it?
We conceptualize text as a partially ordered sequence of characters, inscribed on
a
material carrier. The information derived from the text can best be organized in a
network structure, for which we propose a hypergraph. In this paper, we aimed to
demonstrate the value of looking beyond the prevalent technologies and examining the
alternatives vis-à-vis their philological notions of text, their conceptual model,
the
logical implementation(s), etc. We intended for the reader to realize that each approach
to text modeling has its advantages and disadvantages. The best choice is not
necessarily what is most commonly used, but what best addresses one’s research
requirements and objective(s). There are many pragmatic reasons to encode literary
historical texts in XML, but the XML framework should not become synonymous with the
framework in which we conceptualize text.
We illustrated this notion by demonstrating how TAG can be combined with XML in an
editorial workflow to model, store, edit, process, analyze, and publish text. The
workflow combines the advantages of two data models: while TAG performs well with
modeling and storing literary historical text, XML forms the input of many tools for
further text analysis, transformation, and visualization. The XML output of the TAG
reference implementation Alexandria uses Trojan Horse
elements to avoid overlap conflicts. We opted for Trojan Horse because it is well
known
in the text encoding community and it is relatively easy, from a computational
perspective, to have the computer generate milestones and unique IDs. Discontinuous
elements are transformed into XML by adding a matching attribute (e.g.,
tag:n="q1") on the tags of the discontinued elements.
If we consider scholarly editing as an ongoing process of translating information
from one carrier to another, it’s clear that scholars need to keep track of and
understand these data transformations in order to make informed choices. By describing
in detail the choices we made in designing TAG, by illustrating how these choices
are
reflected in the data model, and by proposing a workflow in which users have the
opportunity to influence the TAGML-to-XML conversion process, we hope to have provoked
amongst today’s textual scholars a continuous curiosity for data models for text
modeling.
References
[Barnard et al. 1995] Barnard, David T., Burnard,
Lou, Gaspart, Jean-Pierre, Price, Lynne A., Sperberg-McQueen, C. Michael, & Varile,
Giovanni Battista. Hierarchical encoding of text: Technical problems and SGML
solutions. Computers and the Humanities, vol.29,
no.3, 1995, pp: 211-231. doi:https://doi.org/10.1007/BF01830617.
[Bleeker et al. 2020] Bleeker, Elli, Bram
Buitendijk and Ronald Haentjens Dekker. Marking Up Microrevisions With Major
Implications: Non-linear Text in TAG. Presented at Balisage: The Markup
Conference 2020, Washington, DC, July 27-31, 2020. Proceedings of
Balisage: The Markup Conference 2020. Balisage Series on Markup
Technologies, vol. 25, 2020.
doi:https://doi.org/10.4242/BalisageVol25.Bleeker01.
[Cummings 2008] Cummings, James. The Text
Encoding Initiative and the Study of Literature. A
Companion to Digital Literary Studies, edited by Susan Schreibman and Ray
Siemens. Oxford: Blackwell, 2008.
http://www.digitalhumanities.org/companionDLS/.
[Haentjens Dekker and Birnbaum 2017] Haentjens
Dekker, Ronald and David J. Birnbaum. It’s More Than Just Overlap: Text As
Graph. Presented Presented at Balisage: The Markup Conference 2017,
Washington, DC, August 1-4, 2017. Proceedings of Balisage: The Markup
Conference 2017. Balisage Series on Markup Technologies, vol. 19, 2017.
doi:https://doi.org/10.4242/BalisageVol19.Dekker01.
[Haentjens Dekker et al. 2018] Haentjens Dekker,
Ronald, Elli Bleeker, Bram Buitendijk, Astrid Kulsdom and David J. Birnbaum.
TAGML: A markup language of many dimensions. Presented at Balisage:
The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup
Technologies, vol. 21, 2018.
doi:https://doi.org/10.4242/BalisageVol21.HaentjensDekker01.
[Jagadish et al. 2004] Jagadish, H.V., Laks V.S.
Lakshmanan, M. Scannapieco, D. Srivastava, and N. Wiwatwattana. Colorful XML: One
Hierarchy Isn’t Enough. Presented at SIGMOD 2004, Paris, France, June 13–18,
2004. doi:https://doi.org/10.1145/1007568.1007598.
[Kimber 2011] Kimber, Eliot. DITA Document
Types: Enabling Blind Interchange Through Modular Vocabularies and Controlled
Extension. Presented at Balisage: The Markup Conference 2011, Montréal,
Canada, August 2-5, 2011. Proceedings of Balisage: The Markup
Conference 2011. Balisage Series on Markup Technologies, vol. 7, 2011.
doi:https://doi.org/10.4242/BalisageVol7.Kimber01.
[Marcoux et al. 2011] Marcoux, Yves, Michael
Sperberg-McQueen, and Claus Huitfeldt. Expressive Power of Markup Languages and
Graph Structures. Presented at the Digital Humanities Conference 2011, Stanford, CA, June
19-22, 2011, pp. 178-180.
https://core.ac.uk/reader/48606830.
[Niu et al. 2019] Niu Ya-Wei, Qu Cun-Quan, Wang
Guang-Hui, Yan Gui-Ying. RWHMDA: Random Walk on Hypergraph for Microbe-Disease
Association Prediction. Frontiers in
Microbiology, vol. 10, 2019. doi:https://doi.org/10.3389/fmicb.2019.01578.
[Peroni et al. 2014] Peroni, Silvio, Francesco
Poggi and Fabio Vitali. Overlapproaches in Documents: a Definitive Classification
(in OWL, 2!). Presented at Balisage: The Markup Conference 2014, Washington,
DC, August 5-8, 2014. In Proceedings of Balisage: The Markup
Conference 2014. Balisage Series on Markup Technologies, vol. 13, 2014.
doi:https://doi.org/10.4242/BalisageVol13.Peroni01.
[Pierazzo 2015] Pierazzo, Elena. TEI: XML
and Beyond. Presented at the Text Encoding Initiative Conference and Members
Meeting 2015, Lyon (France), October 28-31, 2015. Abstract of talk available online
at
http://tei2015.huma-num.fr/en/papers/#140.
[Sahle 2013] Sahle, Patrick. Digitale
Editionsformen. Zum Umgang mit der Überlieferung unter den Bedingungen des
Medienwandels. Teil 3: Textbegriffe und Recodierung. Norderstedt: BoD, 2013.
https://kups.ub.uni-koeln.de/5353/
[Schmidt 2010] Schmidt, Desmond. The
Inadequacy of Embedded Markup for Cultural Heritage Texts. Literary and Linguistic Computing 25, no. 3, 2010, pp. 337-356.
doi:https://doi.org/10.1093/llc/fqq007.
[Sperberg-McQueen and Huitfeldt 2008] Sperberg-McQueen, C. M. and Claus Huitfeldt. Markup Discontinued: Discontinuity
in TexMecs, Goddag Structures, and Rabbit/Duck Grammars. Presented at
Balisage: The Markup Conference 2008, Montréal, Canada, August 12-15, 2008. Proceedings of Balisage: The Markup Conference 2008. Balisage Series on
Markup Technologies, vol. 1, 2008.
doi:https://doi.org/10.4242/BalisageVol1.Sperberg-McQueen01.
[Sperberg-McQueen 2018] Sperberg-McQueen, C.M.
Representing Concurrent Document Structures Using Trojan Horse
Markup. Presented at Balisage: The Markup Conference 2018, Washington, DC, July
31 - August 3, 2018. Proceedings of Balisage: The Markup
Conference. Balisage Series on
Markup Technologies, vol. 21, 2018.
doi:https://doi.org/10.4242/BalisageVol21.Sperberg-McQueen01.
[Watanna, 1916] Watanna, Onoto. Marion, the
Story of an Artist’s Model. New York: W.J. Watt.
URN:oclc:record:1048793515.
[1] The authors are very grateful for the reviewers’ comments which have been insightful,
useful, and highly appreciated. Many thanks.
[2] See notably Huitfeldt 1994 (p. 143; 147-151), Sahle 2013 (p. 381-382), and Pierazzo 2015 (p.73-74)
in relation to the conceptual model on which the encoding guidelines of the TEI
are based.
[3] To give but one example, the edges in the graph were first directed (Haentjens Dekker and Birnbaum 2017), then undirected with the order of the Text nodes in
the graph derived from their distance from the root node (Haentjens Dekker et al. 2018). The current version of the hypergraph data model has
both directed edges (for the text-to-text nodes) and undirected edges (for the
markup and annotation nodes).
[4] In earlier publications, we referred to groups of Markup nodes as
layers, but we found this term to
be confusing as “layers” are often used differently in different
(humanities) contexts.
[5] All text, documentary transcriptions, and facsimiles are retrieved from
the digital diplomatic edition Jane Austen’s Fiction
Manuscripts: Digital Edition, edited by Katherine Sutherland
and her team. The digital edition can be found at https://janeausten.ac.uk/index.html.
[6] The TAGML transcription is made in Sublime Text Editor that offers
syntax TAGML highlighting. See the Github page of the project for the most up-to-date
information about the TAGML syntax highlighting.
Barnard, David T., Burnard,
Lou, Gaspart, Jean-Pierre, Price, Lynne A., Sperberg-McQueen, C. Michael, & Varile,
Giovanni Battista. Hierarchical encoding of text: Technical problems and SGML
solutions. Computers and the Humanities, vol.29,
no.3, 1995, pp: 211-231. doi:https://doi.org/10.1007/BF01830617.
Bleeker, Elli, Bram
Buitendijk and Ronald Haentjens Dekker. Marking Up Microrevisions With Major
Implications: Non-linear Text in TAG. Presented at Balisage: The Markup
Conference 2020, Washington, DC, July 27-31, 2020. Proceedings of
Balisage: The Markup Conference 2020. Balisage Series on Markup
Technologies, vol. 25, 2020.
doi:https://doi.org/10.4242/BalisageVol25.Bleeker01.
Cummings, James. The Text
Encoding Initiative and the Study of Literature. A
Companion to Digital Literary Studies, edited by Susan Schreibman and Ray
Siemens. Oxford: Blackwell, 2008.
http://www.digitalhumanities.org/companionDLS/.
Haentjens
Dekker, Ronald and David J. Birnbaum. It’s More Than Just Overlap: Text As
Graph. Presented Presented at Balisage: The Markup Conference 2017,
Washington, DC, August 1-4, 2017. Proceedings of Balisage: The Markup
Conference 2017. Balisage Series on Markup Technologies, vol. 19, 2017.
doi:https://doi.org/10.4242/BalisageVol19.Dekker01.
Haentjens Dekker,
Ronald, Elli Bleeker, Bram Buitendijk, Astrid Kulsdom and David J. Birnbaum.
TAGML: A markup language of many dimensions. Presented at Balisage:
The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup
Technologies, vol. 21, 2018.
doi:https://doi.org/10.4242/BalisageVol21.HaentjensDekker01.
Dekker, Ronald
Haentjens, Bram Buitendijk, and Elli Bleeker. Parsing a Markup Language That
Supports Overlap and Discontinuity. Proceedings of the
ACM Symposium on Document Engineering 2020, pp. 1-4.
doi:https://doi.org/10.1145/3395027.3419590.
Huitfeldt, Claus.
Multi-dimensional Texts in a One-Dimensional Medium. Computers and the Humanities, vol. 28, no. 4, 1994, pp.
235-241. doi:https://doi.org/10.1007/BF01830270.
Jagadish, H.V., Laks V.S.
Lakshmanan, M. Scannapieco, D. Srivastava, and N. Wiwatwattana. Colorful XML: One
Hierarchy Isn’t Enough. Presented at SIGMOD 2004, Paris, France, June 13–18,
2004. doi:https://doi.org/10.1145/1007568.1007598.
Kimber, Eliot. DITA Document
Types: Enabling Blind Interchange Through Modular Vocabularies and Controlled
Extension. Presented at Balisage: The Markup Conference 2011, Montréal,
Canada, August 2-5, 2011. Proceedings of Balisage: The Markup
Conference 2011. Balisage Series on Markup Technologies, vol. 7, 2011.
doi:https://doi.org/10.4242/BalisageVol7.Kimber01.
Marcoux, Yves, Michael
Sperberg-McQueen, and Claus Huitfeldt. Expressive Power of Markup Languages and
Graph Structures. Presented at the Digital Humanities Conference 2011, Stanford, CA, June
19-22, 2011, pp. 178-180.
https://core.ac.uk/reader/48606830.
Niu Ya-Wei, Qu Cun-Quan, Wang
Guang-Hui, Yan Gui-Ying. RWHMDA: Random Walk on Hypergraph for Microbe-Disease
Association Prediction. Frontiers in
Microbiology, vol. 10, 2019. doi:https://doi.org/10.3389/fmicb.2019.01578.
Peroni, Silvio, Francesco
Poggi and Fabio Vitali. Overlapproaches in Documents: a Definitive Classification
(in OWL, 2!). Presented at Balisage: The Markup Conference 2014, Washington,
DC, August 5-8, 2014. In Proceedings of Balisage: The Markup
Conference 2014. Balisage Series on Markup Technologies, vol. 13, 2014.
doi:https://doi.org/10.4242/BalisageVol13.Peroni01.
Pierazzo, Elena. TEI: XML
and Beyond. Presented at the Text Encoding Initiative Conference and Members
Meeting 2015, Lyon (France), October 28-31, 2015. Abstract of talk available online
at
http://tei2015.huma-num.fr/en/papers/#140.
DeRose, Steven J. Markup
Overlap: a Review and a Horse. Presented at Extreme Markup Languages 2004, Montréal, Québec, August 2-6, 2004.
http://xml.coverpages.org/DeRoseEML2004.pdf.
Sahle, Patrick. Digitale
Editionsformen. Zum Umgang mit der Überlieferung unter den Bedingungen des
Medienwandels. Teil 3: Textbegriffe und Recodierung. Norderstedt: BoD, 2013.
https://kups.ub.uni-koeln.de/5353/
Schmidt, Desmond. The
Inadequacy of Embedded Markup for Cultural Heritage Texts. Literary and Linguistic Computing 25, no. 3, 2010, pp. 337-356.
doi:https://doi.org/10.1093/llc/fqq007.
Sperberg-McQueen, C. M. and Claus Huitfeldt. Markup Discontinued: Discontinuity
in TexMecs, Goddag Structures, and Rabbit/Duck Grammars. Presented at
Balisage: The Markup Conference 2008, Montréal, Canada, August 12-15, 2008. Proceedings of Balisage: The Markup Conference 2008. Balisage Series on
Markup Technologies, vol. 1, 2008.
doi:https://doi.org/10.4242/BalisageVol1.Sperberg-McQueen01.
Sperberg-McQueen, C.M.
Representing Concurrent Document Structures Using Trojan Horse
Markup. Presented at Balisage: The Markup Conference 2018, Washington, DC, July
31 - August 3, 2018. Proceedings of Balisage: The Markup
Conference. Balisage Series on
Markup Technologies, vol. 21, 2018.
doi:https://doi.org/10.4242/BalisageVol21.Sperberg-McQueen01.
Sperberg-McQueen, C.M. and Claus Huitfeldt. GODDAG: A Data Structure for
Overlapping Hierarchies. International Workshop on
Principles, 2000.
doi:https://doi.org/10.1007/978-3-540-39916-2_12.
Katherine Sutherland, editor. Jane Austen’s Fiction Manuscripts: A Digital Edition. Available
at http://www.janeausten.ac.uk/, consulted July 16th, 2021.