Introduction[1]
When we say that text encoding is a means of making explicit an interpretation of that text, we mean that the encoder is compelled to explicitly formulate their underlying assumptions about the text. We often forget to point out that text encoding also implies a (subconscious) choice for a certain data model. Needless to say, not all data models are equally suitable to express and query all kinds of textual information. It is crucial, then, for encoders to be(come) aware of the consequences of their choices, not only on the level of the tagset but also on the level of the data model. As a result, practising digital textual scholarship – the modeling, encoding, and analysing – can be as informative and enlightening as the end product.
Since data models for text are usually developed according to a specific conceptual idea of text, it is interesting to see what textual features are natively supported by a data model. What are, for example, the consequences of expressing text as a consequetive sequence of characters, with annotations as ranges on the text (LMNL data model)? Or: how will representing textual information as RDF statements (cf. EARMARK, see Peroni and Vitali 2009) instead of an ordered rooted tree (cf. XML) change the way we think about text? In each case, the affordances of the chosen data model will inevitably affect our encoding practice and the outcomes of an analysis.
Ideally, then, the choice for a suitable data model is primarily informed by one's
research questions and the particulars of the textual material, and not by a prevailing
standard. As Michael Sperberg-McQueen noted in his concluding remarks of the Balisage
2009 conference: It is not standards in themselves that are harmful, but mindless
adherence to standards that is harmful
(Sperberg-McQueen 2009). Making
an informed choice for a specific data model is accordingly related to the research
needs of a scholar, who needs to be clear about what textual feature(s) they want
to
examine, what result(s) they expect from the text modeling, and how they intend to
get
there.
In this contribution, we investigate how the TAG data model addresses a persistent challenge for modeling and analysing literary and historical documents. The contribution builds upon two previous Balisage papers which introduced respectively the TAG model (Haentjens Dekker and Birnbaum 2017) and the TAGML syntax (Haentjens Dekker et al. 2018). Presently, we will expand on the potential of TAG and TAGML to model and process complex textual phenomena. We take our examples from fragments of the authorial manuscripts of Virginia Woolf's To the Lighthouse (1927). The text on these documents presents quite some modeling challenges: words are deleted mid-way a sentence, phrases are inserted in between the lines or in the margin, paragraphs are transposed, changes to the text structure are indicated with arrows or metamarks, etc. In short: the documents contain the sort of textual phenomena that tests the limitations of a data model. For this contribution, we concentrate on one particular phenomena: in-text revisions and similar non-linear text structures.[2]
After briefly illustrating what we mean with in-text revisions and how they constitute non-linear information structures (section 2 Non-linear text), we go on to demonstrate how TAGML allows encoders to markup non-linear text in a straightforward and idiomatic manner. Using the concept of the computational pipeline, we show in section 3 (Encoding non-linear text) how a TAGML transcription of non-linear text is tokenized, parsed, and stored as a single TAG hypergraph for text. The fourth section, Analysing non-linear text: collation, discusses the topic of automated collation and outlines how the non-linear information that is stored in the individual TAG hypergraphs can be used to come to a more refined collation output via graph-to-graph comparison.
The paper intends to demonstrate how TAG allows scholars to be extremely precise in
expressing their interpretation of textual variance which, in turn, positively affects
the subsequent processing and analysis of the encoded texts. Because work on the TAG
project is under ongoing development, this contribution will not be your average
tool presentation
. Rather, we intend to show in some detail how textual
information is stored, interpreted, and processed by our data model. We consider this
essential to understanding the potential of our model for supporting textual analysis.
By providing detailed insights into the design choices and technical implementation
of
TAGML and HyperCollate, we emphasize how the choices made on the level of the storage
and processing of textual information can affect the subsequent analysis.
Non-linear text
Challenges for text-encoding
Briefly put, non-linear text means that the text does not form a linear stream of characters. As we explained in Bleeker et al. 2018 and Haentjens Dekker et al. 2018, textual content is normally fully ordered information: the text characters form a stream of characters and their order is inherent to their meaning. Fully ordered text is parsed and processed as it is read: in Western scripts that means from left to right, from top to bottom.
In many cases, however, text is not always or consistently a linear structure. Textual variation, for example, may constitute partially ordered information. Take the in-text revision in Figure 1:
The original text of this fragment reads aux pierres en saillie toute une
écume
. The words en saillie
are crossed out and the word
noyées
is added above the line, changing the phrase into
aux pierres noyées toute une écume
. The words en
saillie
and noyées
are on the same location in the text
and thus mutually exclusive.
Text encoding implies the interpretation, transcription, and encoding of textual
inscriptions on the document page. In case of documents with in-text revisions,
markup can be used to identify the subsequent stages in the writing and revision
process or to label the different types of revisions. To illustrate the trickiness
of encoding non-linear text, let's take a look at how it can be done in TEI/XML. The
TEI Guidelines, the de facto standard for text encoding (TEI P5), are currently based on the XML data model, which means
that literary texts are usually modeled as an ordered, monohierarchical tree. So,
as
text encoders set out to encode the various stages of writing and revision as
thoroughly as possible, they face the tricky task of encoding partially ordered
information in a fully ordered data structure. The example above can be encoded in
TEI/XML with del
and add
elements, and – if needed – a
subst
element to group the two elements together as a single
intervention:
aux pierres <subst><del>en saillie</del><add>noyées</add></subst> toute une écume .
Here, the opening tag <subst>
indicates where the textual
information becomes temporarily non-linear. We can say that the two readings
en saillie
and noyées
constitute two paths through
the text. The tags del
and add
identify the two separate
paths. Within the individual paths, the text is fully ordered again: the words
en saillie
would have a different meaning (if any) if the text
characters lost their order. At the closing tag </subst>
the two
branches rejoin and the text becomes fully linear again.
Similar examples of non-linear or partially ordered information that is expressed
in markup are the choice
or app
elements. Both are used to
group a number of alternative encodings for the same point in the
text
.[3]
Because XML cannot natively express non-linear text structures at the level of the
model or the syntax (Haentjens Dekker and Birnbaum 2017), the TEI Guidelines provide
several dedicated elements and schemata. As a result, the temporary non-linearity
as
expressed with the TEI/XML element subst
can be licensed by and
validated against a schema.[4] In theory, a query processor that has access to and understands this schema
will be able to recognize the non-linear information expressed by the markup.
Following our example above, the processor will understand that the words en
saillie
and noyées
are located on the same place in the
text stream and that they are mutually exclusive alternatives to each other.
Unfortunately, this scenario does not apply to the majority of the XML query
processors. In most cases, processing queries remains limited to a linear level. Put
differently: the TEI/XML file may conceptually represent the
encoder's idea of non-linear text, but that concept is typically not shared with a
processor. This has significant consequences for searching, querying, and analysis.
Desmond Schmidt found for instance that only 10% of digital editions using inline
markup could find literal expressions that span inline substitutions
(Schmidt 2019, note 3). The section below illustrates other
complications that arise when collating texts with in-text variations.
Challenges for text analysis
In this contribution, we focus on one form of text analysis that would
particularly benefit from having access to non-linear information: collation.
Collation can be defined as the comparison of two or more versions
(witnesses
) of a literary text in order to establish a record of
the textual variance. To this end, a scholar can use designated collation tools like
CollateX (CollateX) or Juxta (Juxta). However,
since these automated collation tools operate on character strings, the non-linear
information about revisions within individual witnesses is not used to come to an
alignment of the texts.
Editors who are working with witnesses containing in-text variation are compelled to either choose only one revision stage per text, or to pass on relevant information about deletions and additions through the collation pipeline so that it is present in the collation output (Bleeker 2017, section 2.2.5; Beshero-Bondar 2017; Bleeker et al. 2018; Birnbaum et al. 2018; Beshero-Bondar and Viglianti 2018).
The first option implies that relevant information about an author's writing and
revision process is ignored by the collation tool. This may have a negative impact
on the analysis of the textual variance. With the second option – passing on markup
tags (in flattened form) – editors can at least use the retained information to
visualize the deleted words in the collation
output,[5] or to raise
the flattened transcription again (see the
Variorum Frankenstein project, Birnbaum et al. 2018,
discussed in more detail in section 4.1.2 Passing along markup).
However, this option requires a considerable set of technical skills that may not
be
available to most scholarly editors. Furthermore, the collation tool would still
operate on the character stream and the non-linear information is
not part of the alignment process.
This brings us to three important requirements for the way TAG should handle non-linear information. First, editors need to be able to markup this kind of partially ordered information in a straightforward manner. Secondly, a processor needs to recognize non-linear information as such so that the texts can be queried and searched more effectively. And finally, we need a collation program that recognizes non-linear text as two or more mutually exclusive paths through the text, from which it then chooses the best match.
Our Balisage 2018 paper already demonstrated how to represent non-linear information in TAGML (Haentjens Dekker et al. 2018); section 3.2 TAGML will therefore focus on how this information is interpreted by the TAGML parser and stored as a hypergraph. Section 4.2 HyperCollate then describes how the individual TAG hypergraphs can be collated by the hypergraph-based collation tool HyperCollate (https://huygensing.github.io/hyper-collate/) that recognizes non-linear information and thus produces a more refined collation output. Both sections are proceeded by an overview of the related work done in these areas.
Encoding non-linear text
TAGML
We have defined non-linear text as partially ordered information, and we have emphasized that it is desirable for a markup system to natively represent partially-ordered information. Now let's move on to the approach of TAG. This section first describes the pipeline of the TAGML parser. It subsequently illustrates the operations of the parser with three examples of textual variation within a manuscript fragment: a single deletion, an immediate deletion, and a grouped revision. For each example, we show the TAGML transcription and the Abstract Syntax Tree (AST) that is created by the TAGML parser.
Note that a draft manuscript can present a huge number of complex textual variations (substitutions within substitutions within substitutions, transpositions of multiple segments, revisions within a word, etc). For this paper, we selected three short examples, lest the visualisations of the ASTs becomes too large and uninformative.
TAGML Pipeline
Figure 3 presents a schematic representation of the TAGML processing pipeline.
The input of the pipeline is a TAGML document that contains a combination of text and markup. Markup tags indicate whether the text is fully ordered, partially ordered, or unordered.[9] The TAGML document is first tokenized by the TAGML lexer which produces a stream of TAGML tokens; each token contains information about its position in the TAGML document, its type, and its length. We will discuss the lexer in more detail below. The stream of TAGML tokens is subsequently parsed with an ANTLR-generated parser that uses TAGML grammar, resulting in an AST of the input TAGML document. The most up-to-date version of the TAGML grammar can be found on Github. From the AST, an Abstract Semantic Graph (ASG) can be generated. In the TAG data model, the ASG is implemented either as a Multi-Colored Tree (MCT) or a hypergraph.[10] In a tree model the markup elements start at the top level and are (almost) all above the text elements, which are at the bottom in leaf nodes. In contrast, the TAG hypergraph model has the text elements at the centre of the model. The relationship between Text nodes and Markup nodes is expressed by hyperedges.
Examples
The examples in this section come from the authorial notebooks of Virginia Woolf's To the Lighthouse, written between 1926 and 1927 (Woolf 1927). Digital facsimiles of the notebook pages are available via the digital archive Woolf Online.[11] Of each example we show four representations: the manuscript fragment, the TAGML transcription, its AST as produced by the parser, and its hypergraph representation. Again, we'd like to point out that these examples are selected for their simplicity; more complex examples would result in an exceptionally large AST visualisation that won't fit into this paper.
Single deletion
If we follow the tag suggestions of the TEI Guidelines for transcription of primary sources and map them to TAGML,[12] a textual fragment with the deletion can be represented as follows:
[TEI>[s>it seemed [?del>indeed<?del] as if it were now settled<s]<TEI]The question mark prefix in the
[?del>
tag indicates that the
element and its textual content are optional, thus two ways of reading the
text. When processed by the TAGML parser, the following AST is produced:
The TAGML grammar is context-sensitive,[13] and a TAGML document consists of one or more chunks. Each chunk can
contain either a start tag, an end tag, text, or text variation. In this
diagram, the TAGML tokens are visualized in the blue leaf nodes. The fact
that the lexer has different modes enables us to reuse a (sequence of)
character(s). Based on their position in the TAGML document, the same
characters may get a different function. For example, if the lexer is in the
'Default'-mode, a left square bracket [
is identified as the
start of a markup opener, so the lexer switches to the 'Inside Markup
Opener'-modus. The lexer remains in this modus until it encounters a token
that prompts it to switch to another modus, in this case that could be the
>
token, which triggers the lexer to switch back to the
'Default'-modus ('pop mode // back to default'). The diagram shows how a
TAGML token can trigger the shift to a different modus: the modes of the
lexer are visualized in red and connected to the tokens that trigger
them.
From the resulting AST a Multi-Colored Tree or a hypergraph can be generated. Because hypergraphs are more suitable to represent non-linear information, we will show the hypergraph for each TAGML document. Note that the AST only expresses the syntactic structure of the TAGML document, which means the start tags and end tags are not linked. Going from an AST to a hypergraph, then, means that the start and end tags will need to be reconnected in order to form Markup nodes in the hypergraph.
Immediate deletion
In TAGML, the immediate deletion is expressed as follows:
[TEI>[s>The [del>im<del] picture of stark & compromising severity.<s]<TEI]Note that the
[del>
does not have an optional prefix
?
because we interpret an immediate revision as part of the
same writing stage as the rest of the text. This means there is just one
reading of the text and that reading includes two deleted text characters
im.
Grouped revision
A grouped revision is a clear example of non-linear, partially
ordered information: alternative readings for the same point in the text. In
the TAGML syntax, the branching of the text stream is indicated with
<|
, the individual branches are separated with a
vertical bar |
and the converging of the branches is indicated
with a |>
. Individual branches contain markup and text. The
present example can thus be encoded like this:
[TEI>[s> something <|[del>trustful<del]|[add>trusting<add]|>, something childlike<s]<TEI]As we will see in the visualization of the AST below, the TAGML parser will recognize and interpret the divergence and convergence signs, so that the content of the branches is considered variant text ('ITV_text').
Taken together, the different visualizations of in-text revisions illustrate how TAG and TAGML allow for a natural and idiomatic digital representation of in-text variation, one that we think comes as close as possible to how a human encoder understands it. The next section will look at ways to transfer this understanding to a collation tool.
Analysing non-linear text: collation
Related work
We defined collation at its most basic level as the comparison of two or more
versions (witnesses
) of a text to find (dis)similarities between or
among them. As mentioned above, collation software typically does not excel at
handling markup within individiual witnesses: they collate the witnesses looking at
the plain text – either ignoring all tags and attributes or requiring users to
remove them in a pre-processing phase – or transform the markup to plain text
characters so that the tags are collated but their meaning ignored. In either case,
any partially ordered information is overlooked. Still, because the requirement of
including (certain) markup elements in the collation process continues to exist,
scholars have invented some nifty ways to work around the limitations of prevalent
collation software. We distinguish three main approaches, which are briefly
discussed in the following paragraphs:
-
Creating additional witnesses for each in-text revision;
-
Passing along markup for postprocessing;
-
Comparing structured (data-centric) XML files.
Creating additional witnesses for each in-text revision
Section 3.1.2 (Stand-off approaches) described how users of the MVD technology can create separate layers for each occurrence of in-text variation. These layers can subsequently be used as temporary witnesses for the plain text collation. Using the MVD collation functionality 'Compare', the temporary witnesses and the base texts are compared to one another. Both the official text versions and their temporary layers are merged into one MVD, which stores the text shared by each version and layer only once, similar to a variant graph. 'Compare' does not recognize markup elements and consequently does not differentiate between text and markup. In fact, it is not quite correct to say that the collation program ignores the markup information about non-linearity: the markup was never there in the first place.[14] The result of the collation is an MVD variant graph containing in-text revisions as well as between-document variation. In the variant graph, in-text non-linear structures (i.e., within one witness) are distinguished from external variation (i.e., between witnesses) by sigla.
Passing along markup
Some collation tools allow for markup elements to be passed on through their pipeline. The markup elements are ignored during the alignment process, but they are present in the output and can then be used for further analysis or visualisation purposes.
Juxta Commons accepts TEI/XML encoded files for collation. The tool offers an
interface that lists all TEI/XML elements contained by the input witness; the
user can select which elements should be part of the collation and which can be
filtered out. The TEI/XML elements are not part of the alignment process proper:
they are saved as stand-off annotations to the text tokens and passed-on through
the collation pipeline. The elements can be visualized in the heat map
representation of the collation result, which shows deletions in red and
additions in green. In other output formats of Juxta, the <add>
and <del>
tags are no longer there.
Another method to pass along markup through the collation pipeline requires
coding on the side of the editor who interacts with the underlying code of the
collation tool. We illustrate this approach by looking at the software CollateX.[15] The JSON input format for CollateX allows for extra properties to be
added to words. In a preprocessing step, the text is tokenized and transformed
into JSON word tokens. Editors can make a selection of markup elements that they
wish to attach to the JSON token as a t property (the
t standing for token
). The collation is
executed using the value of the token's n property
(n for normalized
), but the selected
markup is included in the JSON alignment table output via the
t property, and can be processed in the visualisation
of the alignment table. This approach does approximate the goal of having
in-text revisions marked as non-linear text in both in- and output, but at the
point of alignment, the collation algorithm is unaware of any non-linearity in
the witness' text and treats the partially ordered information as a linear
structure.
Comparing data-centric XML
Finally, there are several approaches to comparing XML trees (Barabucci et al. 2016, Ciancarini et al. 2016). Some XML editors, like
oXygen, have a built-in XML comparison function. In theory, this approach would
allow for the comparison of TEI/XML transcriptions containing non-linear
information encoded with subst
or app
elements. The
comparison functionality is primarily developed for structured, record-based XML
documents – e.g., XML documents containing address information such as person
name, address, age – and the documents are compared on the level of the XML
elements.
Comparing two record-based XML documents results in an overview of the
difference in markup. For example, an add
element in witness A
would be aligned with an add
element that is on the same relative
location in witness B. Note that these matches are made on the element level and
not on level of the textual content.
Textual scholars studying the revisions in literary texts, however, would
typically give preference to the textual content.
The DeltaXML software does detect and display changes between two XML
documents, either on the level of the XML elements or on the level of the text
content. (Delta XML) Their Document comparator tool compares two XML documents and merges them
into a new XML document that contains additional attributes representing the
variation. What is more, DeltaXML provides the option to identify
orderless containers: XML elements whose children can
be arranged in an order that is not considered significant
.
Orderless containers are marked by placing the
deltaxml:ordered="false"
attribute on the element so that all
its children will be considered to occur in any order. It's a description that
seems to apply to our interpretation of the subst
, the
app
and the choice
elements. Comparing XML
documents that contain a combination of ordered and orderless data (i.e., the
children of some XML elements are ordered, but others are orderless) remains,
according to the DeltaXML developers, a challenge
. They propose
to solve it by preprocessing the input XML documents.[16]
That data-centric XML comparison methods usually do not work for text-centric
encoded documents has already been pointed out by Di Iorio et al., who wrote
that although XML is used to encode both literary documents and database
dumps, there is something different from diff-ing an
XML-encoded literary document and an XML-encoded database
(Di Iorio et al. 2009, emphasis in original). In the case of comparing an
encoded literary document, the authors state, the output of a diff should be
evaluated on its naturalness, that is: for an algorithm to identify the changes
that would be identified by a manual, human approach. To this end, Di Iorio et
al. developed the implementation JNDiff.[17] Di Iorio et al. were right to point out that the comparison of encoded
literary documents is quite unique. In the majority of cases, the collation of
such documents needs to take place on the level of the text. Rather than being
compared, the markup indicating in-text revision needs to be
interpreted as it was indented by the encoder:
identifying the start and end of textual variation.
HyperCollate
In developing HyperCollate, we assumed the need for a collation tool that, first, recognizes the multiple writing stages within each witness text and, secondly, chooses the best match from these different stages. This implies that the non-linear structures in an encoded text need to be recognized by the collation program, and that the program needs to take this information into account during the alignment process. The technical implications are, first, that the input witnesses are not treated as a linear string of characters, but as hypergraphs consisting of partially ordered information. Secondly, that the collation program recognizes this partially ordered information and processes it accordingly.
As a result, a multidimensional text containing multiple revision stages would no
longer have to be flattened
before collation. More importantly, we
estimate that the collation result would approximate the way a human editor would
collate the texts. Accordingly, the collation tool would effectively support and
advance scholarship.
HyperCollate's approach
HyperCollate takes as input two TAG hypergraphs of the individual witnesses. Note that each hypergraph may contain a combination of ordered, partially ordered and unordered information. The partially ordered information is represented as two or more branches in the hypergraph, with the text nodes in these branches having the same position in the text vis-à-vis the document root node (see the hypergraph visualisations in section 3.2.2 Examples).
The computational pipeline of HyperCollate can be visualized as follows:
For each hypergraph witness, the textual content is first tokenized into text tokens. By default, the text
nodes are segmented on whitespace and punctuation. Note that the output of the
tokenizer is not a linear stream of tokens: the tokens contain information about
their distance vis-à-vis the root note (their depth
), and in the
case of non-linearity there may be more than one text token at the same
position. The two witnesses are subsequently aligned. Alignment here means that HyperCollate calculates the
smallest possible number of changes needed to change one set into the other.
The output of that alignment is a set of matching tokens. Again, the tokens in this set contain information about their depth, and there may be more than one text token on the same position. Based on the information from this set of matches, the two witness hypergraphs are merged into a new collation hypergraph that contains all information about the Text nodes and the Markup nodes of the input witnesses. All the nodes that are not aligned are unique to one of the two witnesses; all the nodes that are aligned can be reduced from two to one node, with labels on the edges indicating which node is part of which witness.
For every witness, then, there is always a fully connected path of Text nodes
through the collation hypergraph: from the start node to the end node, following
the sigla on the edges. Labeled hyperedges are used to associate the Markup
node(s) with the Text nodes. The collation hypergraph can be visualized as an
ASCII table or a collation graph (in .dot, SVG or PNG format). In the case of >2
witnesses, the collation output could also be used as the basis for a new
collation following the progressive alignment
-method (Spencer and Howe 2004).
Examples
In the paragraphs below, we return to the examples from section 3.2.2 Examples. Each example text fragment is collated with HyperCollate against another version of the same text that can be found in the print proofs of To the Lighthouse Woolf 1927.[18] The HyperCollate output, a collation hypergraph, can be visualized in multiple ways, from an ASCII alignment table to an SVG graph. Each format has its own level of information density. Here, we provide an alignment table visualisation and an SVG collation graph visualisation to illustrate what differences this density makes: a simpler visualization may be clearer, but it may lack relevant information about the textual variance and/or the markup.
Single deletion
Input witness 1:
[TEI>[s>it seemed [?del>indeed<?del] as if it were now settled<s]<TEI]
Input witness 2:
[TEI>[s>as if it were settled<s]<TEI]
Immediate deletion
Input witness 1:
[TEI>[s>he appeared the [del>im<del] picture of stark & compromising severity.<s]<TEI]
Input witness 2:
[TEI>[s>he appeared the image of stark & compromising severity.<s]<TEI]
Grouped revision
Input witness 1:
[TEI>[s>something <|[del>trustful<del]|[add>trusting<add]|>, something childlike <s]<TEI]
Input witness 2:
[TEI>[s>something trustful, something childlike <s]<TEI]
Discussion
The contribution concentrated on representing in-text variation in TAGML and subsequently collating that information with HyperCollate. We described how the TAG model understands textual variation within one text version as non-linear, partially ordered information. The TAGML syntax allows encoders to express partially ordered information in a straightforward manner.
Partially ordered information is recognized and processed as such by the TAGML parser, and stored as multiple branches in a hypergraph for text. The Text tokens within each branch are mutually exclusive and have the same depth, meaning that they are both at the same distance from the root document node of the hypergraph.
HyperCollate is a hypergraph-based collation program that implements the TAG model. HyperCollate aligns Text tokens based on their relative position in the text as well as their depth in the hypergraph. The program recognizes the branches in the input hypergraphs: if two Text tokens have the same position number, the program finds the best match between them. The output of HyperCollate is a collation hypergraph that can be visualized in different ways; in this paper we showed an alignment table and a variant graph visualization.
Presently, the TAG approach to transcription and collation takes the text as leading, using the markup as basis to recognize in-text variation as partially ordered information. Nevertheless, future work could look into aligning hypergraphs while looking at other types of markup, e.g., paragraph or sentence breaks. This would be a drastic adjustment to the collation algorithm, though, because it would require HyperCollate to prioritize not only the text, but also the markup. Another topic of further work is the TAG query language (TAGQL) in order to query the information of both individual TAGML documents and the collation hypergraph.[19]
Finally, we continue working on the different output formats of the collation. Currently, the collation output can be visualized as an ASCII table or a collation graph (in .dot, SVG, or PNG format). The examples used in this contribution were small text fragments and simplified TAGML transcriptions for a reason: representing and collating two larger TAGML transcriptions, each containing several stages of revisions, would result in an AST and a collation hypergraph that, in their entirety, cannot be visualized in any meaningful way for the reader. In fact, the TAGML input and HyperCollate output contain a much larger variety of textual information than the visualization of a collation hypergraph. Since this information can be of instrumental value to a deeper study of the text, a future aim of HyperCollate is to provide an output in a TAGML-format. This could be similar to the TEI critical apparatus, and would allow scholars to continue their textual analysis on the collation output.
Conclusion
So far, all of our contributions to Balisage are characterized by an aspect of 'ongoing research' and the same applies to this contribution. Among other things, HyperCollate is not yet operating optimally and TAGML does not have a fully functioning query language yet. Nevertheless, we hope to have shown the value of looking beyond the prevalent standard and continuing to question how we think about, represent, and analyse texts digitally.
As more processes are automated and new methods spring from using digital technologies, we have more and better digital instruments to map the writing process. But we need to pay equal attention to the thoughts that go into making these instruments: how are scholarly activities automated? And how does that affect our understanding of and interaction with text? In short: it is important that textual scholars continue to explore different options and to critically evaluate to what extent a data model addresses their research needs.
The underlying goal of our contribution was therefore to provide transparency about the way a TAGML document is parsed and subsequently collated. We illustrated how we transferred our human understanding of in-text variation to the computer and how this intelligence is used to improve the alignment of textual witnesses. In testing, we have found that HyperCollate's more refined collation technology allows scholars to closely examine different forms of textual variation and to discover patterns in the writing processes of literary authors. We can see this even with the small examples of Woolf's text used in this contribution: in two of the three cases a word that was deleted in the draft version reoccurred in the print proofs.
We consider HyperCollate as an inclusive approach to collation: scholars are no longer required to 'flatten' the manuscript text, to dive deep into the code of the collation software, or to create additional transcriptions solely for collation purposes. Instead, they can preserve and use the information about an author's creative revision process and arrive at a collation result that may reveal information previously hidden.
References
[Alexandria] Alexandria. https://huygensing.github.io/alexandria/ ; Information about installing and using the Alexandria command line app is available at links on the TAG portal at https://huygensing.github.io/TAG/.
[Barabucci et al. 2016] Barabucci, Gioele, Paolo
Ciancarini, Angelo Di Iorio, and Fabio Vitali. Measuring the quality of diff
algorithms: a formalization
. In Computer Standards
& Interfaces, vol. 46 (2016), pp. 52-65. doi:https://doi.org/10.1016/j.csi.2015.12.005.
[Barrellon et al. 2017] Barrellon, Vincent,
Pierre-Edouard Portier, Sylvie Calabretto, and Olivier Ferret. Linear extended
annotation graphs
. In Proceedings of ACM Document
Engineering, Malta, September 2017 (2017). doi:https://doi.org/10.1145/3103010.3103011. Online available.
[Beshero-Bondar 2017] Beshero-Bondar, Elisa
Eileen. Rebuilding a Digital Frankenstein by 2018: Reflections toward a Theory of
Losses and Gains in Up-Translation
. Presented at Symposium on Up-Translation
and Up-Transformation: Tasks, Challenges, and Solutions, Washington, DC, July 31,
2017.
In Proceedings of the Symposium on Up-Translation and
Up-Transformation: Tasks, Challenges, and Solutions. Balisage Series on Markup
Technologies, vol. 20 (2017). doi:https://doi.org/10.4242/BalisageVol20.Beshero-Bondar01.
[Beshero-Bondar and Viglianti 2018] Beshero-Bondar, Elisa E., and Raffaele Viglianti. Stand-off Bridges in the
Frankenstein Variorum Project: Interchange and Interoperability within TEI Markup
Ecosystems
. Presented at Balisage: The Markup Conference 2018, Washington,
DC, July 31 - August 3, 2018. In Proceedings of Balisage: The
Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21
(2018). doi:https://doi.org/10.4242/BalisageVol21.Beshero-Bondar01.
[Birnbaum 2015] Birnbaum, David J. Using
CollateX with XML: Recognizing and Tracking Markup Information During
Collation
. Blogpost, published on June 28, 2015 under Computer-supported collation with CollateX. Available on
http://collatex.obdurodon.org/xml-json-conversion.xhtml.
[Birnbaum et al. 2018] Birnbaum, David J., Elisa
E. Beshero-Bondar and C. M. Sperberg-McQueen. Flattening and unflattening XML
markup: a Zen garden of XSLT and other tools
. Presented at Balisage: The
Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on
Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Birnbaum01.
[Bleeker 2017] Bleeker, Elli. Mapping
invention in writing: digital infrastructure and the role of the genetic
editor
. Ph.D. thesis, University of Antwerp (2017). Available via
https://repository.uantwerpen.be/docman/irua/e959d6/155676.pdf.
[Bleeker et al. 2018] Bleeker, Elli, Bram
Buitendijk, Ronald Haentjens Dekker, and Astrid Kulsdom. Including XML markup in
the automated collation of literary texts
. Presented at XML Prague 2018,
Prague, Czech Republic, February 8–10, 2018. In XML Prague 2018 -
Conference Proceedings (2018), pp. 77–95. Available via
http://archive.xmlprague.cz/2018/files/xmlprague-2018-proceedings.pdf.
[Bleeker et al. 2019] Bleeker, Elli, Bram
Buitendijk and Ronald Haentjens Dekker. Between Freedom and Formalisation: a
Hypergraph Model for Representing the Nature of Text
. Long paper presented
at the TEI Conference and Members meeting 2019, September 16-20 2019, Graz, Austria.
Slides available from: https://zenodo.org/record/3929350. doi:https://doi.org/10.5281/zenodo.3929350.
[Ciancarini et al. 2016] Ciancarini, Paolo, Angelo
Di Iorio, Carlo Marchetti, Michelle Schririnzi, and Fabio Vitali. Bridging the gap
between tracking and detecting changes in XML
. In Software: Practice and Experience, vol. 46, no. 2 (2016), pp.
227-250. doi:https://doi.org/10.1002/spe.2305.
[CollateX] CollateX. https://pypi.python.org/pypi/collatex.
[Dekhtyar and Iacob 2005] Dekhtyar, Alex, and
Ionut Emil Iacob. A Framework for Management of Concurrent XML Markup
. In
Data and Knowledge Engineering, vol. 52, no.2, pp.
185-215 (2005). doi:https://doi.org/10.1016/j.datak.2004.05.005.
[Delta XML] DeltaXML, v. 2019-09-12. https://www.deltaxml.com/.
[Di Iorio et al. 2009] Di Iorio, Angelo,
Michele Schirinzi, Fabio Vitali, and Carlo Marchetti. A natural and multi-layered
approach to detect changes in tree-based textual documents
. In International Conference on Enterprise Information Systems,
pp. 90-101. Springer: Berlin, Heidelberg (2009). doi:https://doi.org/10.1007/978-3-642-01347-8_8.
[Haentjens Dekker and Birnbaum 2017] Haentjens Dekker, Ronald and David J. Birnbaum. It’s more than just overlap:
Text As Graph
. Presented at Balisage: The Markup Conference 2017,
Washington, DC, August 1–4, 2017. In Proceedings of Balisage: The
Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19
(2017). doi:https://doi.org/10.4242/BalisageVol19.Dekker01.
[Haentjens Dekker et al. 2018] Haentjens
Dekker, Ronald, Elli Bleeker, Bram Buitendijk, Astrid Kulsdom and David J. Birnbaum.
TAGML: A markup language of many dimensions
. Presented at Balisage:
The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on
Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.HaentjensDekker01.
[Huitfeldt and Sperberg-McQueen 2003] Huitfeldt,
Claus, and C. M. Sperberg-McQueen. TexMECS: An experimental markup meta-language
for complex documents
. Last revised on October 5, 2003. Available online.
[Van Hulle et al. (editors) 2019] Van Hulle, Dirk, Mark Nixon, And Vincent Neyt, editors. Samuel Beckett's Digital Manuscript Project. Antwerp: University Press Antwerp. http://www.beckettarchive.org, last updated in 2019.
[Juxta] Juxta Commons. http://juxtacommons.org/ and http://www.juxtasoftware.org/.
[LMNL data model] LMNLWiki. LMNL data
model
. From the Lost Archives of LMNL.
http://lmnl-markup.org/specs/archive/LMNL_data_model.xhtml.
[Marcoux et al. 2012] Marcoux, Yves, Claus
Huitfeldt and C. M. Sperberg-McQueen. The MLCD Overlap Corpus (MOC): Project
report
. Presented at Balisage: The Markup Conference 2012, Montréal, Canada,
August 7 - 10, 2012. In Proceedings of Balisage: The Markup
Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012).
doi:https://doi.org/10.4242/BalisageVol8.Huitfeldt02.
[Ostrowski et al.] Ostrowski, Donald, David J.
Birnbaum, and Horace G. Lunt. The e-PVL: An electronic edition of the Rus'
primary chronicle
. Via http://pvl.obdurodon.org/.
[Peroni and Vitali 2009] Peroni, Silvio and
Fabio Vitali. Annotation with EARMARK for Arbitrary, Overlapping and Out-of-Order
Markup
. Presented at the DocEng’09 conference, September 16-18, 2009. In
Proceedings of the 2009 ACM Symposium on Document Engineering
(2009), pp. 171-180. ACM: New York. doi:https://doi.org/10.1145/1600193.1600232.
[Peroni et al. 2014] Peroni, Silvio, Francesco
Poggi and Fabio Vitali. Overlapproaches in documents: a definitive classification
(in OWL, 2!)
. Presented at Balisage: The Markup Conference 2014, Washington,
DC, August 5 - 8, 2014. In Proceedings of Balisage: The Markup
Conference 2014. Balisage Series on Markup Technologies, vol. 13 (2014).
doi:https://doi.org/10.4242/BalisageVol13.Peroni01.
https://www.balisage.net/Proceedings/vol13/html/Peroni01/BalisageVol13-Peroni01.html.
[Piez 2008] Piez, Wendell. LMNL in
miniature. An introduction
. Amsterdam Goddag Workshop, 1–5 December 2008.
http://piez.org/wendell/LMNL/Amsterdam2008/presentation-slides.html.
[Portier et al. 2012] Portier, Pierre-Édouard,
Noureddine Chatti, Sylvie Calabretto, Elöd Egyed-Zsigmond and Jean-Marie Pinon.
Modeling, Encoding And Querying Multi-structured Documents
. In
Information Processing & Management, vol. 48,
no. 5 (2012), pp. 931-955. doi:https://doi.org/10.1016/j.ipm.2011.11.004.
[Schmidt 2008] Schmidt, Desmond. What is
a Multi-Version Document?
Blogpost, published on March 5, 2008.
https://multiversiondocs.blogspot.com/2008/03/whats-multi-version-document.html.
[Schmidt and Colomb 2009] Schmidt, Desmond,
and Robert Colomb. A data structure for representing multi-version texts
online
. In International Journal of Human-Computer
Studies, vol. 67, no.6 (2009), pp. 497-514. doi:https://doi.org/10.1016/j.ijhcs.2009.02.001.
[Schmidt and Fiormonte 2010] Schmidt, Desmond,
and Domenico Fiormonte. Multi-Version Documents: A digitisation solution for
textual cultural heritage artefacts
. In Intelligenza
Artificiale, vol. 4, no. 1 (2010), pp. 56-61.
[Schmidt 2019] Schmidt, Desmond. A Model
of Versions and Layers
. In Digital Humanities
Quarterly, vol. 13, no. 3 (2019), available from
http://digitalhumanities.org/dhq/vol/13/3/000430/000430.html.
[Schonefeld 2007] Schonefeld, Oliver.
XCONCUR and XCONCUR-CL: A constraint-based approach for the validation of
concurrent markup
. In Data Structures for Linguistic
Resources and Applications. Proceedings of the Biennial GLDV Conference
(2007).
[Spencer and Howe 2004] Spencer, Matthew and
Christopher Howe. Collating Texts Using Progressive Multiple Alignment
.
In Computers and the Humanities, vol. 38 (2004), pp. 253-270.
doi:https://doi.org/10.1007/s10579-004-8682-1.
[Sperberg-McQueen 2009] Sperberg-McQueen, C. M.
Sometimes a question of scale
. Presented at Balisage: The Markup
Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup
Technologies, vol. 3 (2009). doi:https://doi.org/10.4242/BalisageVol3.Sperberg-McQueen02.
[TEI P5] Text Encoding Initiative Consortium. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 4.0.0, last updated 13 February 2020. Available from: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html.
[Vitali 2016] Vitali, Fabio. The
expressive power of digital formats: Criticizing the manicure of the wise man
pointing at the moon
. Workshop lecture presented at the DiXiT Convention.
Vol. 2. 2016. Slides available at
http://dixit.uni-koeln.de/wp-content/uploads/Vitali_Digital-formats.pdf.
[Witt et al 2007] Witt, Andreas, Oliver
Schonefeld, Georg Rehm, Jonathan Khoo, and Kilian Evang. On the Lossless
Transformation of Single-File, Multi-Layer Annotations in Multi-Rooted
Trees
. In Proceedings of Extreme Markup Languages
2007, Montréal, Canada, 2007. Available from:
http://conferences.idealliance.org/extreme/html/2007/Witt01/EML2007Witt01.xml.
[Woolf 1927] Woolf, Virginia. 1927. To the Lighthouse. Holograph ms. Berg Collection, New York Public Library; Proofs, Smith College Libraries. Pamela L. Caughie, Nick Hayward, Mark Hussey, Peter Shillingsburg, and George K. Thiruvathukal, editors. Woolf Online.
[1] The authors express their gratitude to the reviewers for their extensive and insightful comments.
[2] Overlap is evidently a recurring favorite of the markup
community and the past decades have witnessed a significant number of
alternative markup languages and/or data models to represent overlapping
structures. However, we argued elsewhere that there is much more to modeling
complex texts than overlapping hierarchies alone (Haentjens Dekker and Birnbaum 2017). In the context of in-text revisions, Desmond Schmidt noted that
solving
the overlap problem does not necessarily solve the
challenges of modeling textual variation: not all cases of textual
variation are cases of overlapping hierarchies, and hence solutions to
overlapping hierarchies cannot adequately represent textual
variation
(Schmidt and Colomb 2009, p. 499). Indeed, as the
table in Figure 2 shows, data models developed to represent overlapping
structures do not necessarily provide for expressing non-linear
information.
[3] See the TEI Guidelines on the choice
element.
[4] Note that the schema needs to be written in a language that supports unorderedness, such as XML schema.
[5] This is done for example in the Beckett Digital Manuscript Project (Van Hulle et al. (editors) 2019); see their editorial policy: https://www.beckettarchive.org/editorial.jsp.
[6] For some examples of discontinuity and overlap in literary text, see Bleeker et al. 2019.
[7] Note that while the extended Annotation Graphs-model is a stand-off annotation model, its syntax LeAG is an inline markup syntax.
[8] The practice of splitting individual files into multiple layers is also promoted by Witt et al 2007. This transforms different sets of linguistic corpora into a set of separate XML documents that have identical text, but different annotations.
[9] An example of unordered information is metadata.
[10] Technically, an ASG is a Directed Acyclic Graph (DAG) and the hypergraph is a rooted mixed property hypergraph. But conceptually they can be considered as similar, so in this context we will take the TAG hypergraph also as an implementation of the ASG.
[11] We are grateful to and acknowledge the Society of Authors as the literary representative of the Estate of Virginia Woolf. The Woolf material may not be used for commercial purposes. Please credit the copyright holder when reusing Woolf's work.
[12] It is currently not an official part of our research to map the TEI semantics to TAGML, but we intend to work towards TAGML being an alternative expression of TEI.
[13] Because TAGML allows markup ranges to overlap, the markup does not
have to be closed in the exact reverse order in which it was opened,
like with XML. This makes the TAGML grammar context sensitive. The
ANTLR4 grammar used in the TAGML library, however, is context-free, because ANTLR4 does not
provide a way to encode context-sensitive grammars. The current
parser that is generated from the grammar cannot check whether every
open tag (eg. [tag>
) is eventually followed by a
corresponding close tag (<tag]
). This and other
validity checks are done in post-processing. We are currently
examining how to build a context-sensitive parser that does not
require post-processing.
[14] Bleeker 2017, pp. 106-114, elaborates on the MVD "Compare" technology and why it will not always produce the desired results for the study of textual variation.
[15] A number of projects make use of CollateX' option to pass along information about relevant markup elements through the collation pipeline, e.g., the Beckett Digital Manuscript Project (Van Hulle et al. (editors) 2019), the critical edition of the Primary Chronicles of David J. Birnbaum (Ostrowski et al.; Birnbaum 2015) and the Frankenstein Variorium Project (Beshero-Bondar and Viglianti 2018).
[16] See the project page of the DeltaXML Document comparator, last accessed August 20, 2020.
[17] Unfortunately we have not been able to test JNDiff as it has not been updated since 2014 and it is not clear whether it is still maintained.
[18] Digital facsimiles of the page proofs are also available via the digital archive Woolf Online. The same copyright notice applies.
[19] In recognition of the crucial role TEI/XML plays in the text encoding
community, we already provide a TAGML-to-XML export function. A TAGML file
uploaded in the Alexandria database can be
exported to XML with the alexandria export-xml [filename] -o
[filename].xml
command. This will export the specified TAGML document
as XML. Of course, the conversion from hypergraph to tree implies information
loss. Overlapping hierarchies are represented in the XML output as Trojan Horse
markup. See for more information the Alexandria
documentation. (Alexandria)