Cycles of time in the Gothenburg model for computer-aided collation
A project applying computer-aided collation, by which we mean the guidance of a computational system to compare texts, challenges our debugging capacities in ways that scale and consume far more time than humans may expect at the outset. Scholarly editors working on collation projects often do so in the context of preparing critical editions that present a text-bearing cultural object in all its variations. These editors tend to prioritize a high degree of accuracy in the marking of variants, and may come to a computer-aided collation project prepared for saintly patience and endurance to spot and correct each and every error. Yet endurance has its mortal limits, and computationally-generated errors require the developers' patient scrutiny to be directed not to the outputs but to the preparation of the inputs. Humanities scholars (the author included) may be tempted to hand-correct errors as they spot them to perfect an output for a swift project launch, but this is brittle and can even compound errors, even if applied by a computer script that finds and replaces every instance of a distinctive pattern in the output. We should limit correcting the output, even computationally, in favor of making changes at the guidance and preparation stages and re-running the collation software. Taking cycles of time to study the errors generated by a collation process should lead us to a state of enlightenment about precisely what we are comparing and precisely how it is to be compared. Do we consider computer collation a form of machine learning, in which we expend energy preparing the training set? But it is really the scholars who learn over cycles of collation re-processing, with enhanced awareness of their systems for comparison. Perhaps we should think of computer-aided collation as machine-assisted human learning.
This paper marks a return to work on a digital collation project that began in the years just before the pandemic, and paused a moment in 2020-2021 when several of the original project members moved to new jobs and I, as the one leading the preparation of the collation data, concentrated on teaching and running a program in a new university during a time of global pandemic crisis disrupting university work. Returning to work on the Frankenstein Variorum has necessitated reorienting myself to complicated processes and reviewing the problems we identified as we last documented and left them. This has been a labor of reviewing and sampling our past code, updating its documentation, and trying to find new and better ways to continue our work. We set out to prepare a comparison of five distinct versions of the novel Frankenstein, without taking any single edition as the default standard. Rather, we seek to provide readers a reading view for each of the five versions that demonstrates which passages were altered and how heavily they were altered. The Frankenstein Variorum should help readers to see a heavily revised text and understand the abandoned, discontinuous branches of that revision.
Much of this project has investigated methods for moving between hierarchical and
flat
text structures, moving back and forth between markup and strings as we thread our
texts into the machine power-loom of collation software, examine the woven output,
and contemplate how to untangle the snags and snarls we find in the woven textile
of collation output. The Gothenburg model guides our work, an elaboration of five
distinct stages of work necessary to guide machine-assisted collation as established
in 2009 by the developers of collateX and Juxta in Gothenburg, Sweden:
-
Tokenization (deciding on the smallest units of comparison)
-
Normalization/Regularization (deciding what distinct features in the source documents need to be suppressed from comparison. An easy example of a normlization step is indicating that ampersands are stand in for the word
and
. A more difficult judgment call is the handling of notational metamarks or even editorial markup that assist reading in one document, but should be skipped over by the collation software as not meaningful for comparison with other documents.) -
Alignment (locating via software parallel equivalent passages and their moments of unison and divergence)
-
Analysis/Feedback (reviewing, correcting, adjusting the alignment algorithm as well as normalization/regularization method
-
Visualization (finding an effective way to show and share the collation results)[1]
These stages are usually numbered, but that numbering belies the cyclicality of the
Gothenburg process. We scholars who collate have to try and try again to retool, rethink,
re-run, and even this is not a steady process because we have to decide whether certain
kinds of problems are best resolved by changing the pre-processing
to improve the normalization prior to collation, or to apply corrections to collation
output errors in post-processing
. Collation projects of significant length and complexity require computational assistance
to visualize not only the collation output but crucially to locate patterns in the
collation errors. The Analysis/Feedback stage is critical and without giving it procedural
care, we risk inaccuracies or the human error introduced during hand-correcting outputs.
In returning to work on the Frankenstein Variorum, I have concentrated on the Analysis/Feedback stage to try to find patterns in the snags of our collation weave, and to try different normalization and alignment methods to improve the work. The work requires close observation of meticulous details (plenty of myopic eye fatigue), but also a return to the long view of our research goals. As David J. Birnbaum and Elena Spadini have discussed, normalization is interpretive, reflecting on transcription decisions and deciding on what properties should and should not be considered worthy of comparison.[2] In our project, the great challenge for the normalization process has been contending with the diplomatic markup of the Shelley-Godwin Archive's manuscript notebook edition of Frankenstein. The meticulous page-by-page markup structured on page surfaces and zones cannot be compared with the semantic trees representing the other print-based editions in the project, though deletions do matter, as do the insertion points of marginal additions and corrections. Alignment of our variant passages is improved by revisiting our normalization algorithms, and also by testing our software tools as soon as we recognize problems.
In this paper, I venture that the software we choose to assist us with aligning collations
matters not only for its effectiveness in applying an algorithm like Needleman Wunsch
or Dekker or TAN Diff, but also for its capacity to guide the scholar in the documentation
of a highly complex, time-consuming, cyclical process. In the following sections I
first unfold the elaborate pipeline process of our collation and visualization, to
then introduce problems we have found and efforts to resolve them by exploring a new
method of alignment. XSLT has been vital to the pre-processing and post-processing
of our edition and collation data, but now we find it may be a benefit for handling
the entire process of tree flattening, string collation, and edition construction.
Long ago in the early years of XSLT, John Bradley extolled its virtues not only for
publishing documents but also for text scholarly analysis and research.[3] Though it is common in the Digital Humanities community now to think of Python or
R Studio as the only necessary tools one needs for text analysis, the strengths of
XSLT seem realized in its precision and elegance in transforming structured and unstructured
text and its particular ease of documentation for the humanities researcher. In this
paper, I contemplate a move from applying normalizing algorithms to stringified XML
in Python to restating and revising those algorithms in XSLT.
The Frankenstein Variorum project and its production pipeline
The Frankenstein Variorum project (hereafter referred to as FV) is an ongoing collation project that began during the recent 1818-2018 bicentennial
celebrating the first publication of Mary Shelley's novel. We are constructing a digital
variorum edition that highlights alterations to the novel Frankenstein over five key moments from its first drafting in 1816 to its author’s final revisions
by 1831. If we think of each source edition for the collation as a thread
, the collation algorithm can be said to weave five threads together into a woven
pattern that helps us to identify the moments when the editions align together and
where they diverge. The project applies automated collation, so far working with the
software collateX to create a weave
of five editions of the novel, output in TEI-conformant XML. We have shared papers
at Balisage in previous years about our processing work, preparing differently-encoded source
editions for machine-assisted collation[4], and working with the output of collation to construct a spine
file in TEI critical apparatus form, which coordinates the preparation of the edition
files that hold elements locating moments of divergence from the other editions.[5]
Our source threads
for weaving the collation pattern include two well-known digital editions: The Pennsylvania Electronic Edition (PAEE), an early hypertext edition produced at the University of Pennsylvania in the mid
1990s by Stuart Curran and Jack Lynch that represents the 1818 and 1831 published
editions of the novel, and the Shelley-Godwin Archive's edition of the manuscript notebooks (S-GA) published in 2013 by the University of Maryland. We prepared two other editions in
XML representing William Godwin’s corrections of his daughter's edition from 1823,
and the Thomas copy
, which represents Mary Shelley’s corrections and proposed revisions written in a
copy of the 1818 edition left in Italy with her friend Mrs. Thomas.
Here is a summary of our project’s production pipeline, discussed in more detail in previous Balisage papers:
-
Preparing differently-encoded XML files for collation. This involves several stages of document analysis and pre-processing:
-
Identifying markup that indicates structures we care about: letter, chapter, paragraphs, and verse structures, for example. Where these change from version to version, we want to be able to track them by including this markup with the text of each edition in the collation.
-
Re-sequencing margin annotations coded in the SGA edition of the manuscript notebook, as these margin annotations were encoded at the ends of each XML file and needed to be moved into reading order for collation. (For this resequencing, we wrote XSLT to follow the
@xml:id
s and pointers on each SGA edition file). -
Determining and marking
chunk
boundaries (dividing the texts into units that mostly start and end the same way to facilitate comparison): We divided Frankenstein into 33chunks
roughly corresponding with chapter structures. (Later in refining the collation we subdivided some of these chunks (breaking some complicated chunks into 3 -5sub-chunks
) for a more granular comparison of shorter passages.) -
Flattening
the edition files' structural markup into Trojan milestone elements (converting structural elements that wrap chapters, letters, and paragraphs like<div xml:id="chap-ID">
into self-closed matched pairs of elements to mark the starts and ends of structures:div sID="chap-ID"/>
anddiv eID="chap-ID"/>
). The use of these empty milestones facililates the inclusion of markup in the collation when, for example, a version inserts a new paragraph of chapter boundary within an otherwise identical passage in the other editions. Ultimately this flattening in pre-processing facilitates post-processing of the edition files from text strings to XML holding collation data.[6]
-
-
Normalizing the text and processing the collation. Up to this point, we have been applying a Python script to read the flattened XML as text, and to introduce a series of about 25 normalizations via regular expression patterns that instruct collateX to (among other things):
-
Convert
&
intoand
-
Ignore some angle-bracketed material that is not relevant to the collation: For example, convert
<surface.+?/>
and<zone.+?/>
into""
(nothing) when comparing strings of the editions -
Simplify to ignore the attribute values on Trojan milestone elements. For example, remove the
@sID
and@eID
attributes by converting(<p)\s+.+?(/>)
into$1$2
(or simply<p/>
) so that all paragraph markers are read as the same.
-
-
Work with the output of the collation to prepare, in a series of stages, a very important file that we call our
spine
, which contains the collation data that organizes the variorum edition project. The format of the output is a prototype of a TEI critical apparatus, and in preliminary stages it is not quite valid against the TEI. Here is an example of a part of thespine
in an early post-processing stage, featuring a passage in which Victor Frankenstein beholds his completed Creature in each of the five editions. Notice that the<rdg>
elements contain mixed content: each is holding a passage from the source edition that is crucially not normalized. This is our project’s particular approach to the parallel segmentation method of preparing a critical apparatus in TEI P5. Eventually thespine
of the collation needs to contain data particular to each version in order to reconstruct that version so that it stores information about its variations from the other versions. In the not-quite-TEI stage of our post-collation process represented here, we can see that a use of the Trojan milestone markers to indicate the starting points of paragraphs, and their values indicate their location in the original source edition's XML hierarchy.[7]<cx:apparatus xmlns:cx="http://interedition.eu/collatex/ns/1.0"> <app> <rdgGrp n="['chapter']"> <rdg wit="f1818"><milestone n="4" type="start" unit="chapter"/> <head sID="novel1_letter4_chapter4_div4_div4_head1"/>CHAPTER </rdg> <rdg wit="f1823"><milestone n="4" type="start" unit="chapter"/> <head sID="novel1_letter4_chapter4_div4_div4_head1"/>CHAPTER </rdg> <rdg wit="fThomas"><milestone n="4" type="start" unit="chapter"/> <head sID="novel1_letter4_chapter4_div4_div4_head1"/>CHAPTER </rdg> <rdg wit="f1831"><milestone n="5" type="start" unit="chapter"/> <head sID="novel1_letter4_chapter5_div4_div5_head1"/>CHAPTER </rdg> <rdg wit="fMS"><lb n="c56-0045__main__1"/> <milestone spanTo="#c56-0045.04" unit="tei:head"/>Chapter </rdg> </rdgGrp> </app> <app> <rdgGrp n="['7<shi rend="sup"></shi><p/>']"> <rdg wit="fMS">7<shi rend="sup"></shi> <milestone unit="tei:p"/><lb n="c56-0045__main__2"/> </rdg> </rdgGrp> <rdgGrp n="['iv.<p/>']"> <rdg wit="f1818">IV.<head eID="novel1_letter4_chapter4_div4_div4_head1"/> <p sID="novel1_letter4_chapter4_div4_div4_p1"/></rdg> <rdg wit="f1823">IV.<head eID="novel1_letter4_chapter4_div4_div4_head1"/> <p sID="novel1_letter4_chapter4_div4_div4_p1"/></rdg> <rdg wit="fThomas">IV.<head eID="novel1_letter4_chapter4_div4_div4_head1"/> <p sID="novel1_letter4_chapter4_div4_div4_p1"/></rdg> </rdgGrp> <rdgGrp n="['v.<p/>']"> <rdg wit="f1831">V.<head eID="novel1_letter4_chapter5_div4_div5_head1"/> <p sID="novel1_letter4_chapter5_div4_div5_p1"/> </rdg> </rdgGrp> </app> <app> <rdgGrp n="['it']"> <rdg wit="f1818">I<hi sID="novel1_letter4_chapter4_div4_div4_p1_hi1"/>T <hi eID="novel1_letter4_chapter4_div4_div4_p1_hi1"/> </rdg> <rdg wit="f1823">I<hi sID="novel1_letter4_chapter4_div4_div4_p1_hi1"/>T <hi eID="novel1_letter4_chapter4_div4_div4_p1_hi1"/> </rdg> <rdg wit="fThomas">I<hi sID="novel1_letter4_chapter4_div4_div4_p1_hi1"/>T <hi eID="novel1_letter4_chapter4_div4_div4_p1_hi1"/> </rdg> <rdg wit="f1831">I<hi sID="novel1_letter4_chapter5_div4_div5_p1_hi1"/>T <hi eID="novel1_letter4_chapter5_div4_div5_p1_hi1"/> </rdg> <rdg wit="fMS">It </rdg> </rdgGrp> </app> <app> <rdgGrp n="['was', 'on', 'a', 'dreary', 'night', 'of']"> <rdg wit="f1818">was on a dreary night of </rdg> <rdg wit="f1823">was on a dreary night of </rdg> <rdg wit="fThomas">was on a dreary night of </rdg> <rdg wit="f1831">was on a dreary night of </rdg> <rdg wit="fMS">was on a dreary night of </rdg> </rdgGrp> </app> <app> <rdgGrp n="['november', '']"> <rdg wit="fMS">November <lb n="c56-0045__main__3"/> </rdg> </rdgGrp> <rdgGrp n="['november,']"> <rdg wit="f1818">November, </rdg> <rdg wit="f1823">November, </rdg> <rdg wit="fThomas">November, </rdg> <rdg wit="f1831">November, </rdg> </rdgGrp> </app> <app> <rdgGrp n="['that', 'i', 'beheld']"> <rdg wit="f1818">that I beheld </rdg> <rdg wit="f1823">that I beheld </rdg> <rdg wit="fThomas">that I beheld </rdg> <rdg wit="f1831">that I beheld </rdg> <rdg wit="fMS">that I beheld </rdg> </rdgGrp> </app> <app> <rdgGrp n="['<del>the', 'frame', 'on', 'whic<del>', 'my', 'man', 'completeed,.', '<del>and<del>']"> <rdg wit="fMS"><del rend="strikethrough" sID="c56-0045__main__d2e9718"/> the frame on whic<del eID="c56-0045__main__d2e9718"/> my man comple<mdel>at</mdel>teed,. <del rend="strikethrough" sID="c56-0045__main__d2e9739"/> And<del eID="c56-0045__main__d2e9739"/><lb n="c56-0045__main__4"/> </rdg> </rdgGrp> <rdgGrp n="['the', 'accomplishment', 'of']"> <rdg wit="f1818">the accomplishment of my toils. </rdg> <rdg wit="f1823">the accomplishment of my toils. </rdg> <rdg wit="fThomas">the accomplishment of my toils. </rdg> <rdg wit="f1831">the accomplishment of my toils. </rdg> </rdgGrp> </app>
-
Apply this collation data in post-processing to prepare the edition files:
-
Raise
the flattened editions in their original form (convert Trojan milestones into the original tree hierarchy). -
With XSLT, pull information from the
spine
file, and insert<seg>
elements in each of the distinct edition files to mark every point of variance captured in the collation. Each<seg>
element is mapped to a specific location in the collationspine
file. This mapping is done by recording in the value of theseg @xml:id
a modified XPath expression for the particular<app>
element in the collation spine. (The modification to the XPath is necessary so that the result is valid as an xml:id, which may not contain a solidus.) Here is an example of<seg>
elements applied in the 1818 edition file. Here the@xml:id
attributes are keyed to the specific<app>
elements in the portion of the spine representing collation unit 10. Where the collationspine
marks passages around chapter headings and paragraph boundaries, we apply a@part
attribute to designate an initial and final portion within the structure of the edition XML file:<div type="collation"> <milestone n="4" type="start" unit="chapter"/> <head xml:id="novel1_letter4_chapter4_div4_div4_head1"> CHAPTER <seg part="I" xml:id="C10_app2-f1818__I">IV.</seg> </head> <p xml:id="novel1_letter4_chapter4_div4_div4_p1"> <seg part="F" xml:id="C10_app2-f1818__F"> </seg>I<hi xml:id="novel1_letter4_chapter4_div4_div4_p1_hi1">T</hi> was on a dreary night of <seg xml:id="C10_app5-f1818">November, </seg> that I beheld <seg xml:id="C10_app7-f1818">the accomplishment of my toils. </seg> With an anxiety that almost amounted to <seg xml:id="C10_app9-f1818">agony, </seg> I collected <seg xml:id="C10_app11-f1818">the </seg>instruments of life around <seg xml:id="C10_app14-f1818">me, that </seg>I <seg xml:id="C10_app16-f1818">might infuse </seg>a spark of being into the lifeless thing that lay at my feet. . . . </p> . . . </div>
-
Convert the text contents of the
spine
critical apparatus into stand-off pointers to the<seg>
elements in the new edition files. Here is a passage from the convertedstandoff spine
featuring only the normalized tokens for each variant passage, and containing pointers to the locations of<seg>
elements in each edition XML file corresponding with a reading witness:<app xml:id="C10_app7" n="48"> <rdgGrp xml:id="C10_app7_rg1" n="['<del>the', 'frame', 'on', 'whic<del>', 'my', 'man', 'completeed,.', '<del>and<del>']"> <rdg wit="#fMS"> <ptr target="https://raw.githubusercontent.com/PghFrankenstein/fv-data/master/variorum-chunks/fMS_C10.xml#string-range(//tei:surface[@xml:id='ox-ms_abinger_c56-0045']/tei:zone[@type='main']//tei:line[3],15,59)"/> </rdg> </rdgGrp> <rdgGrp xml:id="C10_app7_rg2" n="['the', 'accomplishment', 'of']"> <rdg wit="#f1818"> <ptr target="https://raw.githubusercontent.com/PghFrankenstein/fv-data/master/variorum-chunks/f1818_C10.xml#C10_app7-f1818"/> </rdg> <rdg wit="#f1823"> <ptr target="https://raw.githubusercontent.com/PghFrankenstein/fv-data/master/variorum-chunks/f1823_C10.xml#C10_app7-f1823"/> </rdg> <rdg wit="#fThomas"> <ptr target="https://raw.githubusercontent.com/PghFrankenstein/fv-data/master/variorum-chunks/fThomas_C10.xml#C10_app7-fThomas"/> </rdg> <rdg wit="#f1831"> <ptr target="https://raw.githubusercontent.com/PghFrankenstein/fv-data/master/variorum-chunks/f1831_C10.xml#C10_app7-f1831"/> </rdg> </rdgGrp> </app>
-
Calculate special links and pointers to the original XML files of the Shelley-Godwin Archive's manuscript edition of Frankenstein, adjusting for the resequencing prior to collation.
-
-
Publish the XML via a web framework (optimally CETEIcean, but for now using Node.js and React). Share the edition text files and XML data from our GitHub repo.[8]
Snags in the Weaving
A Schematron flashlight
For an unhappy time two years ago, pressed by a deadline to launch a proof-of-concept
partial version of the Variorum interface, I hand-corrected the outputs of collation
assisted by a Schematron file that would help me to identify common problems. The
Schematron was like a flashlight guiding me through a forest of code in efforts to
revise the contents of <rdg>
elements and refine the organization of <app>
elements and their constituent <rdgGrp>
elements. The sheer volume of corrections was exhausting, and a short novel like
Frankenstein could seem infinitely long and tortuously slow to read while attempting to make spine adjustments
. I knew this method was not sufficient for our needs. This Schematron file was merely
my first attempt to try to address the problem systematically:
<?xml version="1.0" encoding="UTF-8"?> <sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"> <sch:pattern> <sch:rule context="app"> <sch:report test="not(rdgGrp)" role="error">Empty app element--missing rdgGrps! That's an error introduced from editing the collation.</sch:report> <sch:report test="contains(descendant::rdg[@wit='fThomas'], '<del')" role="info"> Here is a place where the Thomas text contains a deleted passage. Is it completely encompassed in the app?</sch:report> <sch:assert test="count(descendant::rdg/@wit) = count(distinct-values(descendant::rdg/@wit))" role="error"> A repeated rdg witness is present! There's an error here introduced by editing the collation. </sch:assert> </sch:rule> </sch:pattern> <sch:pattern> <sch:rule context="app[count(rdgGrp) eq 1][count(descendant::rdg) eq 1]"> <sch:report test="count(preceding-sibling::app[1]/rdgGrp) eq 1 or count(following-sibling::app[1]/rdgGrp) eq 1 or last()" role="warning"> Here is a "singleton" app that may be best merged in with the preceding or following "unison" app as part of a new rdgGrp. </sch:report> </sch:rule> </sch:pattern> <sch:pattern> <sch:let name="delString" value="'<del'"/> <sch:rule context="rdg[@wit='fThomas']" role="error"> <sch:let name="textTokens" value="tokenize(text(), ' ')"/> <sch:let name="delMatch" value="for $t in $textTokens return $t[contains(., $delString)]"/> <sch:assert test="count($delMatch) mod 2 eq 0"> Unfinished deletion in the Thomas witness. We count <sch:value-of select="count($delMatch)"/> deletion matches. Make sure the Thomas witness deletion is completely encompassed in the app.</sch:assert> </sch:rule> </sch:pattern> <sch:pattern> <sch:rule context="rdgGrp[ancestor::rdgGrp]"> <sch:report test=".">A reading group must NOT be nested inside another reading group!</sch:report> </sch:rule> </sch:pattern> </sch:schema>When reviewing and editing collation
spinefiles, I would apply this schema to guide my work and stop me from making terrible mistakes in hand-correcting common collation output problems, such as, for example, pasting a
<rdgGrp>
element inside another <rdgGrp>
(permissible and useful in some TEI projects, but an outright mistake in our very
simple spine). Eye fatigue is a serious problem in just about every stage of the Gothenburg
model, and this Schematron did intervene to prevent clumsy mistakes. More importantly
for the long-range work of the project, this simple Schematron file helped me to begin
documenting and describing repeating patterns of error in the collation, most significantly
here the singleton
<app>
, containing only one witness inside, and the incommplete witness that I wanted to
include a complete act of deletion.
XSLT for alignment correction
Revisiting the collation output in fall 2021 with two student research assistants,
we began identifying patterns that XSLT could correct. The output from collateX output
would frequently create a problem we called loner dels
: that is, failing to align text with flattened <del>
markup wtih corresponding witnesses, despite the presence of related content with
other witnesses. collateX would sometimes place these in an <app>
with a single <rdgGrp>
by themselves. We succeeded over a few Friday afternoon research team meetings in
November to correct this while my students learned about tunneling parameters in XSLT.
Our Stylesheet is short but represents a step forward in documentation as well as
outright repair of collation alignment:
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:math="http://www.w3.org/2005/xpath-functions/math" xmlns:cx="http://interedition.eu/collatex/ns/1.0" exclude-result-prefixes="xs math" version="3.0"> <!--2021-09-24 ebb with wdjacca and amoebabyte: We are writing XSLT to try to move solitary apps reliably into their neighboring app elements representing all witnesses. --> <xsl:mode on-no-match="shallow-copy"/> <!-- ******************************************************************************************** LONER DELS: These templates deal with collateX output of app elements containing a solitary MS witness containing a deletion, which we interpret as usually a false start, before a passage. ********************************************************************************************* --> <xsl:template match="app[count(descendant::rdg) = 1][contains(descendant::rdg, '<del')]"> <xsl:if test="following-sibling::app[1][count(descendant::rdgGrp) = 1 and count(descendant::rdg) gt 1]"> <xsl:apply-templates select="following-sibling::app[1]" mode="restructure"> <xsl:with-param as="node()" name="loner" select="descendant::rdg" tunnel="yes" /> <xsl:with-param as="attribute()" name="norm" select="rdgGrp/@n" tunnel="yes"/> </xsl:apply-templates> </xsl:if> </xsl:template> <xsl:template match="app[preceding-sibling::app[1][count(descendant::rdg) = 1][contains(descendant::rdg, '<del')]]"/> <xsl:template match="app" mode="restructure"> <xsl:param name="loner" tunnel="yes"/> <xsl:param name="norm" tunnel="yes"/> <app> <xsl:apply-templates select="rdgGrp" mode="restructure"> <xsl:with-param as="node()" name="loner" tunnel="yes" select="$loner"/> </xsl:apply-templates> <xsl:variable name="TokenSquished"> <xsl:value-of select="$norm ! string()||descendant::rdgGrp[descendant::rdg[@wit=$loner/@wit]]/@n"/> </xsl:variable> <xsl:variable name="newToken"> <xsl:value-of select="replace($TokenSquished, '\]\[', ', ')"/> </xsl:variable> <rdgGrp n="{$newToken}"> <rdg wit="{$loner/@wit}"><xsl:value-of select="$loner/text()"/> <xsl:value-of select="descendant::rdg[@wit = $loner/@wit]"/> </rdg> </rdgGrp> </app> </xsl:template> <xsl:template match="rdgGrp" mode="restructure"> <xsl:param name="loner" tunnel="yes"/> <xsl:if test="rdg[@wit ne $loner/@wit]"> <xsl:copy-of select="current()" /> </xsl:if> </xsl:template> </xsl:stylesheet>
Next my little team and I began to tackle the noxious problem of empty tokens
, that is, spurious <app>
elements that would form around a single witness, the Shelley-Godwin Archive's manuscript
witness. Most frequently, these isolated <app>
elements would with an empty token, ""
, formed to represent the normalized form a <lb/>
element. The empty token itself is quite correct for our collation normalization
process: The Shelley-Godwin Archive editors applied line-by-line encoding of the manuscript
notebook, and we need to preserve the information about each line in order to help
ensure that each passage is connected back to its source edition file. So in our project,
we must not ignore this markup, but we do normalize it for the purposes of the collation
since, in our document data model, it is not meaningful information about how the
manuscript notebooks compare with the printed editions. In pre-processing the source
files for collation, we converted the Shelley-Godwin Archive’s <line>
elements into empty milestone markers as <lb/>
marking the beginnings of each line, and these <lb>
elements with their informative attributes are output as contents of <rdg wit="fMS">
in the spine
file. Our normalizing algorithm converts them into empty tokens to indicate that
they are meaningless for the purposes of collation, but they are nevertheless meaningful
to the reconstruction of the 1816 manuscript edition information following collation.
In its most frequently occurring form, the empty-token problem
would appear in separate isolated <app>
elements in between two unison apps
(that is, interrupting a passage in which all witnesses should align in unison. Here
is an example of the pattern. Those reading vigilantly will note that the expected
octothorpe or sharp #
characters indicating that this is a pointer to a defined @xml:id
do not appear in the @wit
attribute values in this example. These are simply lacking at their point of output
from collateX, before it is reprocessed in the post-production pipeline as a TEI document.
<app> <rdgGrp n="['as', 'i', 'did', 'not', 'appear', 'to', 'know']"> <rdg wit="f1818">as I did not appear to know </rdg> <rdg wit="f1823">as I did not appear to know </rdg> <rdg wit="fThomas">as I did not appear to know </rdg> <rdg wit="f1831">as I did not appear to know </rdg> <rdg wit="fMS">as I did not appear to know </rdg> </rdgGrp> </app> <app> <rdgGrp n="['']"> <rdg wit="fMS"><lb n="c57-0119__main__14"/> </rdg> </rdgGrp> </app> <app> <rdgGrp n="['the']"> <rdg wit="f1818">the </rdg> <rdg wit="f1823">the </rdg> <rdg wit="fThomas">the </rdg> <rdg wit="f1831">the </rdg> <rdg wit="fMS">the </rdg> </rdgGrp> </app>The
empty tokenproblem is evident here in the normalized form of an
<lb/>
element with all its attributes, being read properly as an array of one item, a null
string, but somehow despite being nullified, asserting its presence to interrupt a
unison passage. At this point, ideally there will be only one <app>
element that should appear like this:
<app> <rdgGrp n="['as', 'i', 'did', 'not', 'appear', 'to', 'know', 'the']"> <rdg wit="f1818">as I did not appear to know the </rdg> <rdg wit="f1823">as I did not appear to know the </rdg> <rdg wit="fThomas">as I did not appear to know the </rdg> <rdg wit="f1831">as I did not appear to know the </rdg> <rdg wit="fMS">as I did not appear to know <rdg wit="fMS"><lb n="c57-0119__main__14"/> the </rdg> </rdgGrp> </app>We did not complete the XSLT to resolve this particular problem because it is not outright disruptive to our collation pipeline. That is, the output collation shows two passages rather than one in which all witnesses are properly aligned. All collation problems are solved if we instruct collateX to completely ignore the
<lb/>
elements entirely, but we cannot do that. Our output collation files need to preserve
locational information stored in the @n
attributes, indicating the position of a passage in a specific zone of the source
TEI encoding of the manuscript notebook surface. The @n
attributes store information needed to reconstruct an XPath to point to a specific
location in the S-GA edition. We decided ultimately to live with this collation error
for the purposes of the edition production pipeline.
The smashed-tokens
problem
We discovered a much more pernicious problem, however, with our method of tokenizing
and normalizing passages that contain angle-bracketted markup. This problem generates
what we call smashed-tokens
, that is two tokens losing their space separator and becoming one token. Here is
a very short passage of XML from the Shelley-Godwin Archive's encoding:
. . . the spot<add eID="c57-0117__main__d3e21951"/> & endeavoured . . .In this passage, we will be normalizing to ignore the
<add>
element. Notice the spaces separating the element tag from the ampersand. Now, here
is how collateX interprets the passage in its collation alignment:
<app> <rdgGrp n="['spot,', 'and', 'endeavoured,']"> <rdg wit="f1818">spot, and endeavoured, </rdg> <rdg wit="f1823">spot, and endeavoured, </rdg> <rdg wit="fThomas">spot, and endeavoured, </rdg> <rdg wit="f1831">spot, and endeavoured, </rdg> </rdgGrp> <rdgGrp n="['spotand', 'endeavoured']"> <rdg wit="fMS">spot& endeavoured </rdg> </rdgGrp> </app>Somewhere in the normalizing process, the two completely separate tokens
spotand
andwere spliced together into one token. We are not certain whether this problem is more with Python's tokenization or with collateX's alignment of normalized empty strings representing markup irrelevant to the collation. It's as if the normalized markup refuses quite to disappear, haunting the collation machinery with a ghostly persistent trace. We are also not certain at the date of this writing whether choosing a different alignment algorithm in collateX might resolve the problem, but I think not, since error seems to be in interpreting spaces following normalization. That is, I believe the problems could be in the Python script that handles the normalization.
At the date of this writing, we have not completed these XSLT post-processing tasks, in part because we all became busy wih senior projects (for them, completing their projects and for me, assisting with them). I have also paused a moment before continuing down this path because the more patterns we discovered to correct, the more I wanted to try another XSLT approach. If our normalization and alignment were so disjointed to require so much XSLT post-processing, what if XSLT might simply be better at the task of normalizing markup and tokenizing and aligning around it? That is, would it be less brittle to use XSLT for all of our processing, including the collation itself?
XSLT for the win?
At the August 2021 Balisage conference, Joel Kalvesmaki introduced tan:diff
and tan:collate
. As I listened to the paper, I was struck by how clear, simple, and precise Kalvesmaki
made the process of string comparison seem, using XPath and XSLT:
tan:diff()
. . . extracts a series of progressively smaller samples from the shorter of the two texts and looks for a match in the longer one on the basis offn:contains()
. When a match is found, the results are separated byfn:substring-before()
andfn:substring-after()
[9]
tan:diff()
and tan:collate()
.
At the time of this writing, I have begun testing a collation unit from the Frankenstein Variorum, the same featured above that produced the smashed tokens
problem featured in the previous section. Currently I am applying the software following
Kalvesmaki’s detailed comments within his sample Diff XSLT files guiding the user
to generate TAN Diff's native XML and HTML output. The XML output is like, but yet
unlike the TEI critical apparatus structure we have been using in the Frankenstein Variorum project. The markup is simply organized in <c>
when all witnesses share a normalized string, and <u>
when they are variant from each other. There is nothing like the <app>
element to bundle together related groups. But this is the realm of XSLT and the
developers encourage their users to adapt the code to their needs. Alas, however,
at the time of this writing, it appears there may be no straightforward way for me,
without assistance from the developer, to output the original text in the format we
need for the spine
of the Frankenstein Variorum project. This remains an area for future development.
Nevertheless, I have been applying and updating my 25 normalizing replacement patterns via XSLT and find this much more legible than my handling of the same in my Python script (which sometimes contained duplicate passages). XSLT, by nature of being written in XML is XPathable, and with a large quantity of replacement sequences, I am able to check and sequence the process carefully in a way that is easier for me to document legibly.
<!-- Should punctuation be ignored? --> <xsl:param name="tan:ignore-punctuation-differences" as="xs:boolean" select="false()"/> <xsl:param name="additional-batch-replacements" as="element()*"> <!--ebb: normalizations to batch process for collation. NOTE: We want to do these to preserve some markup \\ in the output for post-processing to reconstruct the edition files. Remember, these will be processed in order, so watch out for conflicts. --> <replace pattern="(<.+?>\s*)>" replacement="$1" message="normalizing away extra right angle brackets"/> <replace pattern="&" replacement="and" message="ampersand batch replacement"/> <replace pattern="</?xml>" replacement="" message="xml tag replacement"/> <replace pattern="(<p)\s+.+?(/>)" replacement="$1$2" message="p-tag batch replacement"/> <replace pattern="(<)(metamark).*?(>).+?\1/\2\3" replacement="" message="metamark batch replacement"/> <!--ebb: metamark contains a text node, and we don't want its contents processed in the collation, so this captures the entire element. --> <replace pattern="(</?)m(del).*?(>)" replacement="$1$2$3" message="mdel-SGA batch replacement"/> <!--ebb: mdel contains a text node, so this catches both start and end tag. We want mdel to be processed as <del>...</del>--> <replace pattern="</?damage.*?>" replacement="" message="damage-SGA batch replacement"/> <!--ebb: damage contains a text node, so this catches both start and end tag. --> <replace pattern="</?unclear.*?>" replacement="" message="unclear-SGA batch replacement"/> <!--ebb: unclear contains a text node, so this catches both start and end tag. --> <replace pattern="</?retrace.*?>" replacement="" message="retrace-SGA batch replacement"/> <!--ebb: retrace contains a text node, so this catches both start and end tag. --> <replace pattern="</?shi.*?>" replacement="" message="shi-SGA batch replacement"/> <!--ebb: shi (superscript/subscript) contains a text node, so this catches both start and end tag. --> <replace pattern="(<del)\s+.+?(/>)" replacement="$1$2" message="del-tag batch replacement"/> <replace pattern="<hi.+?/>" replacement="" message="hi batch replacement"/> <replace pattern="<pb.+?/>" replacement="" message="pb batch replacement"/> <replace pattern="<add.+?>" replacement="" message="add batch replacement"/> <replace pattern="<w.+?/>" replacement="" message="w-SGA batch replacement"/> <replace pattern="(<del)Span.+?spanTo="#(.+?)".*?(/>)(.+?)<anchor.+?xml:id="\2".*?>" replacement="$1$3$4$1$3" message="delSpan-to-anchor-SGA batch replacement"/> <replace pattern="<anchor.+?/>" replacement="" message="anchor-SGA batch replacement"/> <replace pattern="<milestone.+?unit="tei:p".+?/>" replacement="<p/> <p/>" message="milestone-paragraph-SGA batch replacement"/> <replace pattern="<milestone.+?/>" replacement="" message="milestone non-p batch replacement"/> <replace pattern="<lb.+?/>" replacement="" message="lb batch replacement"/> <replace pattern="<surface.+?/>" replacement="" message="surface-SGA batch replacement"/> <replace pattern="<zone.+?/>" replacement="" message="zone-SGA batch replacement"/> <replace pattern="<mod.+?/>" replacement="" message="mod-SGA batch replacement"/> <replace pattern="<restore.+?/>" replacement="" message="restore-SGA batch replacement"/> <replace pattern="<graphic.+?/>" replacement="" message="graphic-SGA batch replacement"/> <replace pattern="<head.+?/>" replacement="" message="head batch replacement"/>The documentation-centered nature of the TAN stylesheets makes the XSLT function to document decisions and makes it easy to amend the process and resequence it. Contrast this with the brittle complexity of my Python normalization function and its dependencies:
regexWhitespace = re.compile(r'\s+') regexNonWhitespace = re.compile(r'\S+') regexEmptyTag = re.compile(r'/>$') regexBlankLine = re.compile(r'\n{2,}') regexLeadingBlankLine = re.compile(r'^\n') regexPageBreak = re.compile(r'<pb.+?/>') RE_MARKUP = re.compile(r'<.+?>') RE_PARA = re.compile(r'<p\s[^<]+?/>') RE_INCLUDE = re.compile(r'<include[^<]*/>') RE_MILESTONE = re.compile(r'<milestone[^<]*/>') RE_HEAD = re.compile(r'<head[^<]*/>') RE_AB = re.compile(r'<ab[^<]*/>') # 2018-10-1 ebb: ampersands are apparently not treated in python regex as entities any more than angle brackets. # RE_AMP_NSB = re.compile(r'\S&\s') # RE_AMP_NSE = re.compile(r'\s&\S') # RE_AMP_SQUISH = re.compile(r'\S&\S') # RE_AMP = re.compile(r'\s&\s') RE_AMP = re.compile(r'&') # RE_MULTICAPS = re.compile(r'(?<=\W|\s|\>)[A-Z][A-Z]+[A-Z]*\s') # RE_INNERCAPS = re.compile(r'(?<=hi\d"/>)[A-Z]+[A-Z]+[A-Z]+[A-Z]*') # TITLE_MultiCaps = match(RE_MULTICAPS).lower() RE_DELSTART = re.compile(r'<del[^<]*>') RE_ADDSTART = re.compile(r'<add[^<]*>') RE_MDEL = re.compile(r'<mdel[^<]*>.+?</mdel>') RE_SHI = re.compile(r'<shi[^<]*>.+?</shi>') RE_METAMARK = re.compile(r'<metamark[^<]*>.+?</metamark>') RE_HI = re.compile(r'<hi\s[^<]*/>') RE_PB = re.compile(r'<pb[^<]*/>') RE_LB = re.compile(r'<lb.*?/>') # 2021-09-06: ebb and djb: On <lb> collation troubles: LOOK FOR DOT MATCHES ALL FLAG # b/c this is likely spanning multiple lines, and getting split by the tokenizing algorithm. # 2021-09-10: ebb with mb and jc: trying .*? and DOTALL flag RE_LG = re.compile(r'<lg[^<]*/>') RE_L = re.compile(r'<l\s[^<]*/>') RE_CIT = re.compile(r'<cit\s[^<]*/>') RE_QUOTE = re.compile(r'<quote\s[^<]*/>') RE_OPENQT = re.compile(r'“') RE_CLOSEQT = re.compile(r'”') RE_GAP = re.compile(r'<gap\s[^<]*/>') # <milestone unit="tei:p"/> RE_sgaP = re.compile(r'<milestone\sunit="tei:p"[^<]*/>') . . . def normalize(inputText): # 2018-09-23 ebb THIS WORKS, SOMETIMES, BUT NOT EVERWHERE: RE_MULTICAPS.sub(format(re.findall(RE_MULTICAPS, inputText, flags=0)).title(), \ # RE_INNERCAPS.sub(format(re.findall(RE_INNERCAPS, inputText, flags=0)).lower(), \ return RE_MILESTONE.sub('', \ RE_INCLUDE.sub('', \ RE_AB.sub('', \ RE_HEAD.sub('', \ RE_AMP.sub('and', \ RE_MDEL.sub('', \ RE_SHI.sub('', \ RE_HI.sub('', \ RE_LB.sub('', \ RE_PB.sub('', \ RE_PARA.sub('<p/>', \ RE_sgaP.sub('<p/>', \ RE_LG.sub('<lg/>', \ RE_L.sub('<l/>', \ RE_CIT.sub('', \ RE_QUOTE.sub('', \ RE_OPENQT.sub('"', \ RE_CLOSEQT.sub('"', \ RE_GAP.sub('', \ RE_DELSTART.sub('<del>', \ RE_ADDSTART.sub('<add>', \ RE_METAMARK.sub('', inputText)))))))))))))))))))))).lower()I came to realize that in the more writerly environment of the XSLT stylesheets for TAN Diff, It is easier to review the code, which would be helpful to me setitng it down and returning to it after some months or years.
More to the point of this experiment with a new alignment method, does it succeed
in reducing the tokenization and noramlization alignments discussed in the previous
section? So far, yes, often with identical correct alignments and usually with longer
matches. For example, on the passage that illustrated the smashed tokens
problem, here is tan collate's output:
<u> <txt>spot,</txt> <wit ref="2-1818_fullFlat_C27" pos="1423"/> <wit ref="4-Thomas_fullFlat_C27" pos="1423"/> <wit ref="3-1823_fullFlat_C27" pos="1421"/> <wit ref="5-1831_fullFlat_C27" pos="1422"/> </u> <u> <txt>spot</txt> <wit ref="1-msColl_C27" pos="317"/> </u> <c> <txt> and </txt> <wit ref="2-1818_fullFlat_C27" pos="1428"/> <wit ref="4-Thomas_fullFlat_C27" pos="1428"/> <wit ref="3-1823_fullFlat_C27" pos="1426"/> <wit ref="5-1831_fullFlat_C27" pos="1427"/> <wit ref="1-msColl_C27" pos="321"/> </c> <u> <txt>endeavoured,</txt> <wit ref="2-1818_fullFlat_C27" pos="1433"/> <wit ref="4-Thomas_fullFlat_C27" pos="1433"/> <wit ref="3-1823_fullFlat_C27" pos="1431"/> <wit ref="5-1831_fullFlat_C27" pos="1432"/> </u> <u> <txt>endeavoured </txt> <wit ref="1-msColl_C27" pos="326"/> </u>I have found no problems with empty tokens created from markup, and overall the number of aligned groups is much smaller. Here is a summary of total aligned reading groups and unison aligned reading groups for the same collation unit as processed by Python-mediated CollateX vs. Tan Diff XSLT:
Table I
Alignment | CollateX | Tan Diff/Collate |
---|---|---|
Unison Alignments (all witnesses the same) | 1198 | 515 |
All Alignments (unison and divergent) | 2315 | 1932 |
tan:diff()
(as expected), and the tokenization and normalization outcomes appear so far to be
reasonably reliable, though this is difficult to tell without being able to see the
original passages in the TAN collation XML output. When this functionality is achieved,
it would make sound sense to take a moment and adapt tan:collate()
to the pipeline of the Frankenstein Variorum project, since it appears even out of the boxto produce fewer alignment problems. At the time of this writing, however, it is difficult to navigate the TAN library of included XSLT files and to fully understand how to intervene in the collation output process.
Just because it is written in XSLT and is well-documented for public usability, does not mean an XSLT library is any more or less adaptable to intensive customization than a library made of other forms of code. As I write this concluding paragraph in mid-July 2022, my student assistant Yuying Jin and I have made a significant breakthrough in modifying our pre-processing Python script for collateX, and we have succeeded in handling the problem of spliced tokens, though we have now multipled the number of empty null-string tokens. Since it now appears easier to continue refining our Python script than to intervene in the labyrinthine XSLT library of functions, we will proceed in this direction for now, and look forward to future developments to improve the customization potential of the TAN library. The XSLT process promises to make the recursive pre-programming cycles in the Gothenburg model more amenable to clear documentation, and the effort to investigate two pre-processing methods via Python, and via XSLT, has proven fruitful to resolve many problems in our original process. The TAN library holds great promise for handling collations in XSLT, and while customizing its output proves for this moment no easy matter for an outsider to TAN, it seems a clear next step for this impressive XSLT library.
[1] Interedition Development Group, The Gothenburg Model
, 2010-2019. https://collatex.net/doc/. For more on the summit and workshop of collation software developers in 2009 that
formulated the Gothenburg model, see Ronald Haentjens Dekker, Dirk van Hulle, Gregor
Middell, Vincent Neyt, and Joris van Zundert. Computer-supported collation of modern manuscripts: CollateX and the Beckett Digital
Manuscript Project.
Digital Scholarship in the Humanities 30:3 (December 2014) pp. 3-4. doi:https://doi.org/10.1093/llc/fqu007.
[2] Birnbaum, David J. and Elena Spadini. Reassessing the locus of normalization in machine-assisted collation.
DHQ: Digital Humanities Quarterly 14:3 (2020). http://www.digitalhumanities.org/dhq/vol/14/3/000489/000489.html.
[3] John Bradley. Text Tools
in A Companion to Digital Humanities, ed. Susan Schreibman, Ray Siemens, and John Unsworth (Wiley, 2004), pages 505-522.
[4] Beshero-Bondar, Elisa Eileen. Rebuilding a Digital Frankenstein by 2018: Reflections toward a Theory of Losses and
Gains in Up-Translation.
Presented at Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions,
Washington, DC, July 31, 2017. In Proceedings of Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions. Balisage Series on Markup Technologies, vol. 20 (2017). doi:https://doi.org/10.4242/BalisageVol20.Beshero-Bondar01.
[5] See Beshero-Bondar, Elisa E., and Raffaele Viglianti. Stand-off Bridges in the Frankenstein Variorum Project: Interchange and Interoperability within TEI Markup Ecosystems.
Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August
3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Beshero-Bondar01. I am also grateful to Michael Sperberg-McQueen and David Birnbaum for their guidance
of my efforts to raise
flattened markup into hierarchical markup leading to a Balisage paper evaluating
multiple methods for attempting this. See Birnbaum, David J., Elisa E. Beshero-Bondar
and C. M. Sperberg-McQueen. Flattening and unflattening XML markup: a Zen garden of XSLT and other tools.
Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August
3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Birnbaum01.
[6] On Trojan markup, see Steve DeRose, Markup Overlap: A Review and a Horse,
Extreme Markup Languages 2004. https://www.academia.edu/42096176/Markup_Overlap_A_Review_and_a_Horse;
and Sperberg-McQueen, C. M. Representing concurrent document structures using Trojan Horse markup.
Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August
3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Sperberg-McQueen01.
[7] On the parallel segmentation method for handling a critical apparatus in TEI P5, see
the TEI Guidelines: TEI Consortium, eds. 4.3.2 Floating Texts.
TEI P5: Guidelines for Electronic Text Encoding and Interchange. 4.4.0. April 19, 2022. TEI Consortium. http://www.tei-c.org/release/doc/tei-p5-doc/en/html/DS.html#DSFLT (2022-07-19).
[8] Our production pipeline unfolds in a series of GitHub repos at https://github.com/FrankensteinVariorum. The repo storing and documenting the post-collation XSLT files, not the focus of discussion for this paper but likely of interest to our readers is https://github.com/FrankensteinVariorum/fv-postCollation.
[9] Kalvesmaki, Joel. String Comparison in XSLT with tan:diff().
Presented at Balisage: The Markup Conference 2021, Washington, DC, August 2 - 6,
2021. In Proceedings of Balisage: The Markup Conference 2021. Balisage Series on Markup Technologies, vol. 26 (2021). doi:https://doi.org/10.4242/BalisageVol26.Kalvesmaki01.