Beshero-Bondar, Elisa E. “Adventures in Correcting XML Collation Problems with Python and XSLT: Untangling the
Frankenstein Variorum.” Presented at Balisage: The Markup Conference 2022, Washington, DC, August 1 - 5, 2022. In Proceedings of Balisage: The Markup Conference 2022. Balisage Series on Markup Technologies, vol. 27 (2022). https://doi.org/10.4242/BalisageVol27.Beshero-Bondar01.
Balisage: The Markup Conference 2022 August 1 - 5, 2022
Balisage Paper: Adventures in Correcting XML Collation Problems with Python and XSLT
Untangling the Frankenstein Variorum
Elisa E. Beshero-Bondar
Professor of Digital Humanities
Program Chair of Digital Media, Arts, and Technology
Elisa Beshero-Bondar explores and teaches document data modeling with the XML
family of languages. She serves on the TEI Technical Council and is the founder
and organizer of the Digital
Mitford project and its usually annual coding
school. She experiments with visualizing data from complex document
structures like epic poems and with computer-assisted collation of differently
encoded editions of Frankenstein. Her ongoing adventures with
markup technologies are documented on her
development site at newtfire.org.
The process of instructing a computer to compare texts, known as computer-aided collation,
might resemble trying to fix a power loom when the threads it is supposed to weave
together become tangled. The power of the automated weaving continues, with the threads
improperly aligned and the pattern broken in a way that can make it difficult to isolate
the cause of the problem. Automating a tedious process magnifies the complexity of
error-correction, sometimes calling for new tooling to help us perfect the weaving
or collating process.
The authors are attempting to refine a collation algorithm to improve its alignment
of variant passages in the Frankenstein Variorum project. We have begun with a Python script that tokenizes and normalizes the texts
of the editions and delivers them to collateX for processing the collation and delivering TEI-conformant output for our project.
In post-processing stages after running the collation, we apply a series of XSLT transformations
to the collation output. This post-collation XSLT pipeline publishes the digital variorum
edition, which prepares each output witness in TEI XML to store information about
its own variance from the other editions. We have discussed that pipeline elsewhere,
but our interest in this paper is in efforts to repair and correct and improve the
collation process.
We have applied Schematron and XSLT in post-processing to correct patterns of erroneous
alignments, but eventually realized that the problems we were trying to solve required
repairing the collation algorithm. We are now experimenting with revising the collation
algorithm in two ways: 1) by fine-tuning the text preparation algorithms we apply
in our Python file that delivers text to the collateX software, and 2) by attempting
to introduce those same text preparation algorithms entirely with XSLT using the Text
Alignment Network's XSLT application of tan:diff() and tan:collate(), introduced by Joel Kalvesmaki at the 2021 Balisage conference. In this paper we
discuss the challenges of figuring out where and how to intervene in the collation
process, and what we are learning about how far we can take XSLT and Schematron in
helping to automate the preparation, collation, and correction process.
Cycles of time in the Gothenburg model for computer-aided collation
A project applying computer-aided collation, by which we mean the guidance of a computational
system to compare texts, challenges our debugging capacities in ways that scale and
consume far more time than humans may expect at the outset. Scholarly editors working
on collation projects often do so in the context of preparing critical editions that
present a text-bearing cultural object in all its variations. These editors tend to
prioritize a high degree of accuracy in the marking of variants, and may come to a
computer-aided collation project prepared for saintly patience and endurance to spot
and correct each and every error. Yet endurance has its mortal limits, and computationally-generated
errors require the developers' patient scrutiny to be directed not to the outputs
but to the preparation of the inputs. Humanities scholars (the author included) may
be tempted to hand-correct errors as they spot them to perfect an output for a swift
project launch, but this is brittle and can even compound errors, even if applied
by a computer script that finds and replaces every instance of a distinctive pattern
in the output. We should limit correcting the output, even computationally, in favor
of making changes at the guidance and preparation stages and re-running the collation
software. Taking cycles of time to study the errors generated by a collation process
should lead us to a state of enlightenment about precisely what we are comparing and
precisely how it is to be compared. Do we consider computer collation a form of machine
learning, in which we expend energy preparing the training set? But it is really the
scholars who learn over cycles of collation re-processing, with enhanced awareness
of their systems for comparison. Perhaps we should think of computer-aided collation
as machine-assisted human learning.
This paper marks a return to work on a digital collation project that began in the
years just before the pandemic, and paused a moment in 2020-2021 when several of the
original project members moved to new jobs and I, as the one leading the preparation
of the collation data, concentrated on teaching and running a program in a new university
during a time of global pandemic crisis disrupting university work. Returning to work
on the Frankenstein Variorum has necessitated reorienting myself to complicated processes and reviewing the problems
we identified as we last documented and left them. This has been a labor of reviewing
and sampling our past code, updating its documentation, and trying to find new and
better ways to continue our work. We set out to prepare a comparison of five distinct
versions of the novel Frankenstein, without taking any single edition as the default standard. Rather, we seek to provide
readers a reading view for each of the five versions that demonstrates which passages
were altered and how heavily they were altered. The Frankenstein Variorum should help readers to see a heavily revised text and understand the abandoned, discontinuous
branches of that revision.
Much of this project has investigated methods for moving between hierarchical and
flat text structures, moving back and forth between markup and strings as we thread our
texts into the machine power-loom of collation software, examine the woven output,
and contemplate how to untangle the snags and snarls we find in the woven textile of collation output. The Gothenburg model guides our work, an elaboration of five
distinct stages of work necessary to guide machine-assisted collation as established
in 2009 by the developers of collateX and Juxta in Gothenburg, Sweden:
Tokenization (deciding on the smallest units of comparison)
Normalization/Regularization (deciding what distinct features in the source documents
need to be suppressed from comparison. An easy example of a normlization step is indicating
that ampersands are stand in for the word and. A more difficult judgment call is the handling of notational metamarks or even editorial
markup that assist reading in one document, but should be skipped over by the collation
software as not meaningful for comparison with other documents.)
Alignment (locating via software parallel equivalent passages and their moments of
unison and divergence)
Analysis/Feedback (reviewing, correcting, adjusting the alignment algorithm as well
as normalization/regularization method
Visualization (finding an effective way to show and share the collation results)[1]
These stages are usually numbered, but that numbering belies the cyclicality of the
Gothenburg process. We scholars who collate have to try and try again to retool, rethink,
re-run, and even this is not a steady process because we have to decide whether certain
kinds of problems are best resolved by changing the pre-processing to improve the normalization prior to collation, or to apply corrections to collation
output errors in post-processing. Collation projects of significant length and complexity require computational assistance
to visualize not only the collation output but crucially to locate patterns in the
collation errors. The Analysis/Feedback stage is critical and without giving it procedural
care, we risk inaccuracies or the human error introduced during hand-correcting outputs.
In returning to work on the Frankenstein Variorum, I have concentrated on the Analysis/Feedback stage to try to find patterns in the
snags of our collation weave, and to try different normalization and alignment methods
to improve the work. The work requires close observation of meticulous details (plenty
of myopic eye fatigue), but also a return to the long view of our research goals.
As David J. Birnbaum and Elena Spadini have discussed, normalization is interpretive,
reflecting on transcription decisions and deciding on what properties should and should
not be considered worthy of comparison.[2] In our project, the great challenge for the normalization process has been contending
with the diplomatic markup of the Shelley-Godwin Archive's manuscript notebook edition
of Frankenstein. The meticulous page-by-page markup structured on page surfaces and zones cannot
be compared with the semantic trees representing the other print-based editions in
the project, though deletions do matter, as do the insertion points of marginal additions
and corrections. Alignment of our variant passages is improved by revisiting our normalization
algorithms, and also by testing our software tools as soon as we recognize problems.
In this paper, I venture that the software we choose to assist us with aligning collations
matters not only for its effectiveness in applying an algorithm like Needleman Wunsch
or Dekker or TAN Diff, but also for its capacity to guide the scholar in the documentation
of a highly complex, time-consuming, cyclical process. In the following sections I
first unfold the elaborate pipeline process of our collation and visualization, to
then introduce problems we have found and efforts to resolve them by exploring a new
method of alignment. XSLT has been vital to the pre-processing and post-processing
of our edition and collation data, but now we find it may be a benefit for handling
the entire process of tree flattening, string collation, and edition construction.
Long ago in the early years of XSLT, John Bradley extolled its virtues not only for
publishing documents but also for text scholarly analysis and research.[3] Though it is common in the Digital Humanities community now to think of Python or
R Studio as the only necessary tools one needs for text analysis, the strengths of
XSLT seem realized in its precision and elegance in transforming structured and unstructured
text and its particular ease of documentation for the humanities researcher. In this
paper, I contemplate a move from applying normalizing algorithms to stringified XML in Python to restating and revising those algorithms in XSLT.
The Frankenstein Variorum project and its production pipeline
The Frankenstein Variorum project (hereafter referred to as FV) is an ongoing collation project that began during the recent 1818-2018 bicentennial
celebrating the first publication of Mary Shelley's novel. We are constructing a digital
variorum edition that highlights alterations to the novel Frankenstein over five key moments from its first drafting in 1816 to its author’s final revisions
by 1831. If we think of each source edition for the collation as a thread, the collation algorithm can be said to weave five threads together into a woven
pattern that helps us to identify the moments when the editions align together and
where they diverge. The project applies automated collation, so far working with the
software collateX to create a weave of five editions of the novel, output in TEI-conformant XML. We have shared papers
at Balisage in previous years about our processing work, preparing differently-encoded source
editions for machine-assisted collation[4], and working with the output of collation to construct a spine file in TEI critical apparatus form, which coordinates the preparation of the edition
files that hold elements locating moments of divergence from the other editions.[5]
Our source threads for weaving the collation pattern include two well-known digital editions: The Pennsylvania Electronic Edition (PAEE), an early hypertext edition produced at the University of Pennsylvania in the mid
1990s by Stuart Curran and Jack Lynch that represents the 1818 and 1831 published
editions of the novel, and the Shelley-Godwin Archive's edition of the manuscript notebooks (S-GA) published in 2013 by the University of Maryland. We prepared two other editions in
XML representing William Godwin’s corrections of his daughter's edition from 1823,
and the Thomas copy, which represents Mary Shelley’s corrections and proposed revisions written in a
copy of the 1818 edition left in Italy with her friend Mrs. Thomas.
Here is a summary of our project’s production pipeline, discussed in more detail in
previous Balisage papers:
Preparing differently-encoded XML files for collation. This involves several stages
of document analysis and pre-processing:
Identifying markup that indicates structures we care about: letter, chapter, paragraphs,
and verse structures, for example. Where these change from version to version, we
want to be able to track them by including this markup with the text of each edition
in the collation.
Re-sequencing margin annotations coded in the SGA edition of the manuscript notebook,
as these margin annotations were encoded at the ends of each XML file and needed to
be moved into reading order for collation. (For this resequencing, we wrote XSLT to
follow the @xml:ids and pointers on each SGA edition file).
Determining and marking chunk boundaries (dividing the texts into units that mostly start and end the same way
to facilitate comparison): We divided Frankenstein into 33 chunks roughly corresponding with chapter structures. (Later in refining the collation we
subdivided some of these chunks (breaking some complicated chunks into 3 -5 sub-chunks) for a more granular comparison of shorter passages.)
Flattening the edition files' structural markup into Trojan milestone elements (converting structural
elements that wrap chapters, letters, and paragraphs like <div xml:id="chap-ID"> into self-closed matched pairs of elements to mark the starts and ends of structures:
div sID="chap-ID"/> and div eID="chap-ID"/>). The use of these empty milestones facililates the inclusion of markup in the collation
when, for example, a version inserts a new paragraph of chapter boundary within an
otherwise identical passage in the other editions. Ultimately this flattening in pre-processing
facilitates post-processing of the edition files from text strings to XML holding
collation data.[6]
Normalizing the text and processing the collation. Up to this point, we have been
applying a Python script to read the flattened XML as text, and to introduce a series
of about 25 normalizations via regular expression patterns that instruct collateX
to (among other things):
Convert & into and
Ignore some angle-bracketed material that is not relevant to the collation: For example,
convert <surface.+?/> and <zone.+?/> into "" (nothing) when comparing strings of the editions
Simplify to ignore the attribute values on Trojan milestone elements. For example,
remove the @sID and @eID attributes by converting (<p)\s+.+?(/>) into $1$2 (or simply <p/>) so that all paragraph markers are read as the same.
Work with the output of the collation to prepare, in a series of stages, a very important
file that we call our spine, which contains the collation data that organizes the variorum edition project. The
format of the output is a prototype of a TEI critical apparatus, and in preliminary
stages it is not quite valid against the TEI. Here is an example of a part of the
spine in an early post-processing stage, featuring a passage in which Victor Frankenstein
beholds his completed Creature in each of the five editions. Notice that the <rdg> elements contain mixed content: each is holding a passage from the source edition
that is crucially not normalized. This is our project’s particular approach to the parallel segmentation method of preparing a critical apparatus in TEI P5. Eventually the spine of the collation needs to contain data particular to each version in order to reconstruct
that version so that it stores information about its variations from the other versions.
In the not-quite-TEI stage of our post-collation process represented here, we can
see that a use of the Trojan milestone markers to indicate the starting points of
paragraphs, and their values indicate their location in the original source edition's
XML hierarchy.[7]
<cx:apparatus xmlns:cx="http://interedition.eu/collatex/ns/1.0">
<app>
<rdgGrp n="['chapter']">
<rdg wit="f1818"><milestone n="4" type="start" unit="chapter"/>
<head sID="novel1_letter4_chapter4_div4_div4_head1"/>CHAPTER </rdg>
<rdg wit="f1823"><milestone n="4" type="start" unit="chapter"/>
<head sID="novel1_letter4_chapter4_div4_div4_head1"/>CHAPTER </rdg>
<rdg wit="fThomas"><milestone n="4" type="start" unit="chapter"/>
<head sID="novel1_letter4_chapter4_div4_div4_head1"/>CHAPTER </rdg>
<rdg wit="f1831"><milestone n="5" type="start" unit="chapter"/>
<head sID="novel1_letter4_chapter5_div4_div5_head1"/>CHAPTER </rdg>
<rdg wit="fMS"><lb n="c56-0045__main__1"/>
<milestone spanTo="#c56-0045.04" unit="tei:head"/>Chapter </rdg>
</rdgGrp>
</app>
<app>
<rdgGrp n="['7<shi rend="sup"></shi><p/>']">
<rdg wit="fMS">7<shi rend="sup"></shi>
<milestone unit="tei:p"/><lb n="c56-0045__main__2"/> </rdg>
</rdgGrp>
<rdgGrp n="['iv.<p/>']">
<rdg wit="f1818">IV.<head eID="novel1_letter4_chapter4_div4_div4_head1"/>
<p sID="novel1_letter4_chapter4_div4_div4_p1"/></rdg>
<rdg wit="f1823">IV.<head eID="novel1_letter4_chapter4_div4_div4_head1"/>
<p sID="novel1_letter4_chapter4_div4_div4_p1"/></rdg>
<rdg wit="fThomas">IV.<head eID="novel1_letter4_chapter4_div4_div4_head1"/>
<p sID="novel1_letter4_chapter4_div4_div4_p1"/></rdg>
</rdgGrp>
<rdgGrp n="['v.<p/>']">
<rdg wit="f1831">V.<head eID="novel1_letter4_chapter5_div4_div5_head1"/>
<p sID="novel1_letter4_chapter5_div4_div5_p1"/> </rdg>
</rdgGrp>
</app>
<app>
<rdgGrp n="['it']">
<rdg wit="f1818">I<hi sID="novel1_letter4_chapter4_div4_div4_p1_hi1"/>T
<hi eID="novel1_letter4_chapter4_div4_div4_p1_hi1"/> </rdg>
<rdg wit="f1823">I<hi sID="novel1_letter4_chapter4_div4_div4_p1_hi1"/>T
<hi eID="novel1_letter4_chapter4_div4_div4_p1_hi1"/> </rdg>
<rdg wit="fThomas">I<hi sID="novel1_letter4_chapter4_div4_div4_p1_hi1"/>T
<hi eID="novel1_letter4_chapter4_div4_div4_p1_hi1"/> </rdg>
<rdg wit="f1831">I<hi sID="novel1_letter4_chapter5_div4_div5_p1_hi1"/>T
<hi eID="novel1_letter4_chapter5_div4_div5_p1_hi1"/> </rdg>
<rdg wit="fMS">It </rdg>
</rdgGrp>
</app>
<app>
<rdgGrp n="['was', 'on', 'a', 'dreary', 'night', 'of']">
<rdg wit="f1818">was on a dreary night of </rdg>
<rdg wit="f1823">was on a dreary night of </rdg>
<rdg wit="fThomas">was on a dreary night of </rdg>
<rdg wit="f1831">was on a dreary night of </rdg>
<rdg wit="fMS">was on a dreary night of </rdg>
</rdgGrp>
</app>
<app>
<rdgGrp n="['november', '']">
<rdg wit="fMS">November <lb n="c56-0045__main__3"/> </rdg>
</rdgGrp>
<rdgGrp n="['november,']">
<rdg wit="f1818">November, </rdg>
<rdg wit="f1823">November, </rdg>
<rdg wit="fThomas">November, </rdg>
<rdg wit="f1831">November, </rdg>
</rdgGrp>
</app>
<app>
<rdgGrp n="['that', 'i', 'beheld']">
<rdg wit="f1818">that I beheld </rdg>
<rdg wit="f1823">that I beheld </rdg>
<rdg wit="fThomas">that I beheld </rdg>
<rdg wit="f1831">that I beheld </rdg>
<rdg wit="fMS">that I beheld </rdg>
</rdgGrp>
</app>
<app>
<rdgGrp n="['<del>the', 'frame', 'on', 'whic<del>',
'my', 'man', 'completeed,.', '<del>and<del>']">
<rdg wit="fMS"><del rend="strikethrough" sID="c56-0045__main__d2e9718"/>
the frame on whic<del eID="c56-0045__main__d2e9718"/> my man
comple<mdel>at</mdel>teed,.
<del rend="strikethrough" sID="c56-0045__main__d2e9739"/>
And<del eID="c56-0045__main__d2e9739"/><lb n="c56-0045__main__4"/> </rdg>
</rdgGrp>
<rdgGrp n="['the', 'accomplishment', 'of']">
<rdg wit="f1818">the accomplishment of my toils. </rdg>
<rdg wit="f1823">the accomplishment of my toils. </rdg>
<rdg wit="fThomas">the accomplishment of my toils. </rdg>
<rdg wit="f1831">the accomplishment of my toils. </rdg>
</rdgGrp>
</app>
Apply this collation data in post-processing to prepare the edition files:
Raise the flattened editions in their original form (convert Trojan milestones into the
original tree hierarchy).
With XSLT, pull information from the spine file, and insert <seg> elements in each of the distinct edition files to mark every point of variance captured
in the collation. Each <seg> element is mapped to a specific location in the collation spine file. This mapping is done by recording in the value of the seg @xml:id a modified XPath expression for the particular <app> element in the collation spine. (The modification to the XPath is necessary so that
the result is valid as an xml:id, which may not contain a solidus.) Here is an example
of <seg> elements applied in the 1818 edition file. Here the @xml:id attributes are keyed to the specific <app> elements in the portion of the spine representing collation unit 10. Where the collation
spine marks passages around chapter headings and paragraph boundaries, we apply a @part attribute to designate an initial and final portion within the structure of the edition
XML file:
<div type="collation">
<milestone n="4" type="start" unit="chapter"/>
<head xml:id="novel1_letter4_chapter4_div4_div4_head1">
CHAPTER <seg part="I" xml:id="C10_app2-f1818__I">IV.</seg>
</head>
<p xml:id="novel1_letter4_chapter4_div4_div4_p1">
<seg part="F" xml:id="C10_app2-f1818__F">
</seg>I<hi xml:id="novel1_letter4_chapter4_div4_div4_p1_hi1">T</hi>
was on a dreary night of <seg xml:id="C10_app5-f1818">November, </seg>
that I beheld <seg xml:id="C10_app7-f1818">the accomplishment of my toils. </seg>
With an anxiety that almost amounted to <seg xml:id="C10_app9-f1818">agony, </seg>
I collected <seg xml:id="C10_app11-f1818">the </seg>instruments of life around
<seg xml:id="C10_app14-f1818">me, that </seg>I
<seg xml:id="C10_app16-f1818">might infuse </seg>a spark of being
into the lifeless thing that lay at my feet. . . .
</p>
. . .
</div>
Convert the text contents of the spine critical apparatus into stand-off pointers to the <seg> elements in the new edition files. Here is a passage from the converted standoff spine featuring only the
normalized tokens for each variant passage, and containing pointers to the locations
of <seg>
elements in each edition XML file corresponding with a reading witness:
Calculate special links and pointers to the original XML files of the Shelley-Godwin
Archive's manuscript edition of Frankenstein, adjusting for the resequencing prior
to collation.
For an unhappy time two years ago, pressed by a deadline to launch a proof-of-concept
partial version of the Variorum interface, I hand-corrected the outputs of collation
assisted by a Schematron file that would help me to identify common problems. The
Schematron was like a flashlight guiding me through a forest of code in efforts to
revise the contents of <rdg> elements and refine the organization of <app> elements and their constituent <rdgGrp> elements. The sheer volume of corrections was exhausting, and a short novel like
Frankenstein could seem infinitely long and tortuously slow to read while attempting to make spine adjustments. I knew this method was not sufficient for our needs. This Schematron file was merely
my first attempt to try to address the problem systematically:
<?xml version="1.0" encoding="UTF-8"?>
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2">
<sch:pattern>
<sch:rule context="app">
<sch:report test="not(rdgGrp)" role="error">Empty app element--missing rdgGrps! That's an error introduced from editing the collation.</sch:report>
<sch:report test="contains(descendant::rdg[@wit='fThomas'], '<del')" role="info">
Here is a place where the Thomas text contains a deleted passage. Is it completely encompassed in the app?</sch:report>
<sch:assert test="count(descendant::rdg/@wit) = count(distinct-values(descendant::rdg/@wit))" role="error">
A repeated rdg witness is present! There's an error here introduced by editing the collation.
</sch:assert>
</sch:rule>
</sch:pattern>
<sch:pattern>
<sch:rule context="app[count(rdgGrp) eq 1][count(descendant::rdg) eq 1]">
<sch:report test="count(preceding-sibling::app[1]/rdgGrp) eq 1 or count(following-sibling::app[1]/rdgGrp) eq 1 or last()" role="warning">
Here is a "singleton" app that may be best merged in with the preceding or following "unison" app as part of a new rdgGrp.
</sch:report>
</sch:rule>
</sch:pattern>
<sch:pattern>
<sch:let name="delString" value="'<del'"/>
<sch:rule context="rdg[@wit='fThomas']" role="error">
<sch:let name="textTokens" value="tokenize(text(), ' ')"/>
<sch:let name="delMatch" value="for $t in $textTokens return $t[contains(., $delString)]"/>
<sch:assert test="count($delMatch) mod 2 eq 0">
Unfinished deletion in the Thomas witness. We count <sch:value-of select="count($delMatch)"/> deletion matches.
Make sure the Thomas witness deletion is completely encompassed in the app.</sch:assert>
</sch:rule>
</sch:pattern>
<sch:pattern>
<sch:rule context="rdgGrp[ancestor::rdgGrp]">
<sch:report test=".">A reading group must NOT be nested inside another reading group!</sch:report>
</sch:rule>
</sch:pattern>
</sch:schema>
When reviewing and editing collation spine files, I would apply this schema to guide my work and stop me from making terrible
mistakes in hand-correcting common collation output problems, such as, for example,
pasting a <rdgGrp> element inside another <rdgGrp> (permissible and useful in some TEI projects, but an outright mistake in our very
simple spine). Eye fatigue is a serious problem in just about every stage of the Gothenburg
model, and this Schematron did intervene to prevent clumsy mistakes. More importantly
for the long-range work of the project, this simple Schematron file helped me to begin
documenting and describing repeating patterns of error in the collation, most significantly
here the singleton<app>, containing only one witness inside, and the incommplete witness that I wanted to
include a complete act of deletion.
XSLT for alignment correction
Revisiting the collation output in fall 2021 with two student research assistants,
we began identifying patterns that XSLT could correct. The output from collateX output
would frequently create a problem we called loner dels: that is, failing to align text with flattened <del> markup wtih corresponding witnesses, despite the presence of related content with
other witnesses. collateX would sometimes place these in an <app> with a single <rdgGrp> by themselves. We succeeded over a few Friday afternoon research team meetings in
November to correct this while my students learned about tunneling parameters in XSLT.
Our Stylesheet is short but represents a step forward in documentation as well as
outright repair of collation alignment:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
xmlns:cx="http://interedition.eu/collatex/ns/1.0"
exclude-result-prefixes="xs math"
version="3.0">
<!--2021-09-24 ebb with wdjacca and amoebabyte: We are writing XSLT to try to move
solitary apps reliably into their neighboring app elements representing all witnesses.
-->
<xsl:mode on-no-match="shallow-copy"/>
<!-- ********************************************************************************************
LONER DELS: These templates deal with collateX output of app elements
containing a solitary MS witness containing a deletion, which we interpret as usually a false start,
before a passage.
*********************************************************************************************
-->
<xsl:template match="app[count(descendant::rdg) = 1][contains(descendant::rdg, '<del')]">
<xsl:if test="following-sibling::app[1][count(descendant::rdgGrp) = 1 and count(descendant::rdg) gt 1]">
<xsl:apply-templates select="following-sibling::app[1]" mode="restructure">
<xsl:with-param as="node()" name="loner" select="descendant::rdg" tunnel="yes" />
<xsl:with-param as="attribute()" name="norm" select="rdgGrp/@n" tunnel="yes"/>
</xsl:apply-templates>
</xsl:if>
</xsl:template>
<xsl:template match="app[preceding-sibling::app[1][count(descendant::rdg) = 1][contains(descendant::rdg, '<del')]]"/>
<xsl:template match="app" mode="restructure">
<xsl:param name="loner" tunnel="yes"/>
<xsl:param name="norm" tunnel="yes"/>
<app>
<xsl:apply-templates select="rdgGrp" mode="restructure">
<xsl:with-param as="node()" name="loner" tunnel="yes" select="$loner"/>
</xsl:apply-templates>
<xsl:variable name="TokenSquished">
<xsl:value-of select="$norm ! string()||descendant::rdgGrp[descendant::rdg[@wit=$loner/@wit]]/@n"/>
</xsl:variable>
<xsl:variable name="newToken">
<xsl:value-of select="replace($TokenSquished, '\]\[', ', ')"/>
</xsl:variable>
<rdgGrp n="{$newToken}">
<rdg wit="{$loner/@wit}"><xsl:value-of select="$loner/text()"/>
<xsl:value-of select="descendant::rdg[@wit = $loner/@wit]"/>
</rdg>
</rdgGrp>
</app>
</xsl:template>
<xsl:template match="rdgGrp" mode="restructure">
<xsl:param name="loner" tunnel="yes"/>
<xsl:if test="rdg[@wit ne $loner/@wit]">
<xsl:copy-of select="current()" />
</xsl:if>
</xsl:template>
</xsl:stylesheet>
Next my little team and I began to tackle the noxious problem of empty tokens, that is, spurious <app> elements that would form around a single witness, the Shelley-Godwin Archive's manuscript
witness. Most frequently, these isolated <app> elements would with an empty token, "", formed to represent the normalized form a <lb/> element. The empty token itself is quite correct for our collation normalization
process: The Shelley-Godwin Archive editors applied line-by-line encoding of the manuscript
notebook, and we need to preserve the information about each line in order to help
ensure that each passage is connected back to its source edition file. So in our project,
we must not ignore this markup, but we do normalize it for the purposes of the collation
since, in our document data model, it is not meaningful information about how the
manuscript notebooks compare with the printed editions. In pre-processing the source
files for collation, we converted the Shelley-Godwin Archive’s <line> elements into empty milestone markers as <lb/> marking the beginnings of each line, and these <lb> elements with their informative attributes are output as contents of <rdg wit="fMS"> in the spine file. Our normalizing algorithm converts them into empty tokens to indicate that
they are meaningless for the purposes of collation, but they are nevertheless meaningful
to the reconstruction of the 1816 manuscript edition information following collation.
In its most frequently occurring form, the empty-token problem would appear in separate isolated <app> elements in between two unison apps (that is, interrupting a passage in which all witnesses should align in unison. Here
is an example of the pattern. Those reading vigilantly will note that the expected
octothorpe or sharp # characters indicating that this is a pointer to a defined @xml:id do not appear in the @wit attribute values in this example. These are simply lacking at their point of output
from collateX, before it is reprocessed in the post-production pipeline as a TEI document.
<app>
<rdgGrp n="['as', 'i', 'did', 'not', 'appear', 'to', 'know']">
<rdg wit="f1818">as I did not appear to know </rdg>
<rdg wit="f1823">as I did not appear to know </rdg>
<rdg wit="fThomas">as I did not appear to know </rdg>
<rdg wit="f1831">as I did not appear to know </rdg>
<rdg wit="fMS">as I did not appear to know </rdg>
</rdgGrp>
</app>
<app>
<rdgGrp n="['']">
<rdg wit="fMS"><lb n="c57-0119__main__14"/> </rdg>
</rdgGrp>
</app>
<app>
<rdgGrp n="['the']">
<rdg wit="f1818">the </rdg>
<rdg wit="f1823">the </rdg>
<rdg wit="fThomas">the </rdg>
<rdg wit="f1831">the </rdg>
<rdg wit="fMS">the </rdg>
</rdgGrp>
</app>
The empty token problem is evident here in the normalized form of an <lb/> element with all its attributes, being read properly as an array of one item, a null
string, but somehow despite being nullified, asserting its presence to interrupt a
unison passage. At this point, ideally there will be only one <app> element that should appear like this:
<app>
<rdgGrp n="['as', 'i', 'did', 'not', 'appear', 'to', 'know', 'the']">
<rdg wit="f1818">as I did not appear to know the </rdg>
<rdg wit="f1823">as I did not appear to know the </rdg>
<rdg wit="fThomas">as I did not appear to know the </rdg>
<rdg wit="f1831">as I did not appear to know the </rdg>
<rdg wit="fMS">as I did not appear to know
<rdg wit="fMS"><lb n="c57-0119__main__14"/> the </rdg>
</rdgGrp>
</app>
We did not complete the XSLT to resolve this particular problem because it is not
outright disruptive to our collation pipeline. That is, the output collation shows
two passages rather than one in which all witnesses are properly aligned. All collation
problems are solved if we instruct collateX to completely ignore the <lb/> elements entirely, but we cannot do that. Our output collation files need to preserve
locational information stored in the @n attributes, indicating the position of a passage in a specific zone of the source
TEI encoding of the manuscript notebook surface. The @n attributes store information needed to reconstruct an XPath to point to a specific
location in the S-GA edition. We decided ultimately to live with this collation error
for the purposes of the edition production pipeline.
The smashed-tokens problem
We discovered a much more pernicious problem, however, with our method of tokenizing
and normalizing passages that contain angle-bracketted markup. This problem generates
what we call smashed-tokens, that is two tokens losing their space separator and becoming one token. Here is
a very short passage of XML from the Shelley-Godwin Archive's encoding:
In this passage, we will be normalizing to ignore the <add> element. Notice the spaces separating the element tag from the ampersand. Now, here
is how collateX interprets the passage in its collation alignment:
<app>
<rdgGrp n="['spot,', 'and', 'endeavoured,']">
<rdg wit="f1818">spot, and endeavoured, </rdg>
<rdg wit="f1823">spot, and endeavoured, </rdg>
<rdg wit="fThomas">spot, and endeavoured, </rdg>
<rdg wit="f1831">spot, and endeavoured, </rdg>
</rdgGrp>
<rdgGrp n="['spotand', 'endeavoured']">
<rdg wit="fMS">spot& endeavoured </rdg>
</rdgGrp>
</app>
Somewhere in the normalizing process, the two completely separate tokens spot and and were spliced together into one token. We are not certain whether this problem is
more with Python's tokenization or with collateX's alignment of normalized empty strings
representing markup irrelevant to the collation. It's as if the normalized markup
refuses quite to disappear, haunting the collation machinery with a ghostly persistent
trace. We are also not certain at the date of this writing whether choosing a different
alignment algorithm in collateX might resolve the problem, but I think not, since
error seems to be in interpreting spaces following normalization. That is, I believe
the problems could be in the Python script that handles the normalization.
At the date of this writing, we have not completed these XSLT post-processing tasks,
in part because we all became busy wih senior projects (for them, completing their
projects and for me, assisting with them). I have also paused a moment before continuing
down this path because the more patterns we discovered to correct, the more I wanted
to try another XSLT approach. If our normalization and alignment were so disjointed
to require so much XSLT post-processing, what if XSLT might simply be better at the
task of normalizing markup and tokenizing and aligning around it? That is, would it
be less brittle to use XSLT for all of our processing, including the collation itself?
XSLT for the win?
At the August 2021 Balisage conference, Joel Kalvesmaki introduced tan:diff and tan:collate. As I listened to the paper, I was struck by how clear, simple, and precise Kalvesmaki
made the process of string comparison seem, using XPath and XSLT:
tan:diff() . . . extracts a series of progressively smaller samples from the shorter of the
two texts and looks for a match in the longer one on the basis of fn:contains(). When a match is found, the results are separated by fn:substring-before() and fn:substring-after()[9]
The application of XPath functions described here made the alignment process seem
more familiar and accessible, amenable to the kinds of modifications I regularly make
with my own XSLT. I wondered whether starting with attempts to match longer samples
and moving to match shorter ones might generate fewer aligned clusters and reduce
the spurious particulation I had been seeing in my Python-assisted process with collateX.
So I have lately begun experimenting with tan:diff() and tan:collate().
At the time of this writing, I have begun testing a collation unit from the Frankenstein Variorum, the same featured above that produced the smashed tokens problem featured in the previous section. Currently I am applying the software following
Kalvesmaki’s detailed comments within his sample Diff XSLT files guiding the user
to generate TAN Diff's native XML and HTML output. The XML output is like, but yet
unlike the TEI critical apparatus structure we have been using in the Frankenstein Variorum project. The markup is simply organized in <c> when all witnesses share a normalized string, and <u> when they are variant from each other. There is nothing like the <app> element to bundle together related groups. But this is the realm of XSLT and the
developers encourage their users to adapt the code to their needs. Alas, however,
at the time of this writing, it appears there may be no straightforward way for me,
without assistance from the developer, to output the original text in the format we
need for the spine of the Frankenstein Variorum project. This remains an area for future development.
Nevertheless, I have been applying and updating my 25 normalizing replacement patterns
via XSLT and find this much more legible than my handling of the same in my Python
script (which sometimes contained duplicate passages). XSLT, by nature of being written
in XML is XPathable, and with a large quantity of replacement sequences, I am able
to check and sequence the process carefully in a way that is easier for me to document
legibly.
<!-- Should punctuation be ignored? -->
<xsl:param name="tan:ignore-punctuation-differences" as="xs:boolean" select="false()"/>
<xsl:param name="additional-batch-replacements" as="element()*">
<!--ebb: normalizations to batch process for collation. NOTE: We want to do these to preserve some markup \\
in the output for post-processing to reconstruct the edition files.
Remember, these will be processed in order, so watch out for conflicts. -->
<replace pattern="(<.+?>\s*)>" replacement="$1" message="normalizing away extra right angle brackets"/>
<replace pattern="&" replacement="and" message="ampersand batch replacement"/>
<replace pattern="</?xml>" replacement="" message="xml tag replacement"/>
<replace pattern="(<p)\s+.+?(/>)" replacement="$1$2" message="p-tag batch replacement"/>
<replace pattern="(<)(metamark).*?(>).+?\1/\2\3" replacement="" message="metamark batch replacement"/>
<!--ebb: metamark contains a text node, and we don't want its contents processed in the collation, so
this captures the entire element. -->
<replace pattern="(</?)m(del).*?(>)" replacement="$1$2$3" message="mdel-SGA batch replacement"/>
<!--ebb: mdel contains a text node, so this catches both start and end tag.
We want mdel to be processed as <del>...</del>-->
<replace pattern="</?damage.*?>" replacement="" message="damage-SGA batch replacement"/>
<!--ebb: damage contains a text node, so this catches both start and end tag. -->
<replace pattern="</?unclear.*?>" replacement="" message="unclear-SGA batch replacement"/>
<!--ebb: unclear contains a text node, so this catches both start and end tag. -->
<replace pattern="</?retrace.*?>" replacement="" message="retrace-SGA batch replacement"/>
<!--ebb: retrace contains a text node, so this catches both start and end tag. -->
<replace pattern="</?shi.*?>" replacement="" message="shi-SGA batch replacement"/>
<!--ebb: shi (superscript/subscript) contains a text node, so this catches both start and end tag. -->
<replace pattern="(<del)\s+.+?(/>)" replacement="$1$2" message="del-tag batch replacement"/>
<replace pattern="<hi.+?/>" replacement="" message="hi batch replacement"/>
<replace pattern="<pb.+?/>" replacement="" message="pb batch replacement"/>
<replace pattern="<add.+?>" replacement="" message="add batch replacement"/>
<replace pattern="<w.+?/>" replacement="" message="w-SGA batch replacement"/>
<replace pattern="(<del)Span.+?spanTo="#(.+?)".*?(/>)(.+?)<anchor.+?xml:id="\2".*?>"
replacement="$1$3$4$1$3" message="delSpan-to-anchor-SGA batch replacement"/>
<replace pattern="<anchor.+?/>" replacement="" message="anchor-SGA batch replacement"/>
<replace pattern="<milestone.+?unit="tei:p".+?/>" replacement="<p/> <p/>"
message="milestone-paragraph-SGA batch replacement"/>
<replace pattern="<milestone.+?/>" replacement="" message="milestone non-p batch replacement"/>
<replace pattern="<lb.+?/>" replacement="" message="lb batch replacement"/>
<replace pattern="<surface.+?/>" replacement="" message="surface-SGA batch replacement"/>
<replace pattern="<zone.+?/>" replacement="" message="zone-SGA batch replacement"/>
<replace pattern="<mod.+?/>" replacement="" message="mod-SGA batch replacement"/>
<replace pattern="<restore.+?/>" replacement="" message="restore-SGA batch replacement"/>
<replace pattern="<graphic.+?/>" replacement="" message="graphic-SGA batch replacement"/>
<replace pattern="<head.+?/>" replacement="" message="head batch replacement"/>
The documentation-centered nature of the TAN stylesheets makes the XSLT function to
document decisions and makes it easy to amend the process and resequence it. Contrast
this with the brittle complexity of my Python normalization function and its dependencies:
I came to realize that in the more writerly environment of the XSLT stylesheets for
TAN Diff, It is easier to review the code, which would be helpful to me setitng it
down and returning to it after some months or years.
More to the point of this experiment with a new alignment method, does it succeed
in reducing the tokenization and noramlization alignments discussed in the previous
section? So far, yes, often with identical correct alignments and usually with longer
matches. For example, on the passage that illustrated the smashed tokens problem, here is tan collate's output:
I have found no problems with empty tokens created from markup, and overall the number
of aligned groups is much smaller. Here is a summary of total aligned reading groups
and unison aligned reading groups for the same collation unit as processed by Python-mediated
CollateX vs. Tan Diff XSLT:
Table I
Alignments generated by CollateX vs. Tan Diff
Alignment
CollateX
Tan Diff/Collate
Unison Alignments (all witnesses the same)
1198
515
All Alignments (unison and divergent)
2315
1932
In general, the alignments were longer in tan:diff() (as expected), and the tokenization and normalization outcomes appear so far to be
reasonably reliable, though this is difficult to tell without being able to see the
original passages in the TAN collation XML output. When this functionality is achieved,
it would make sound sense to take a moment and adapt tan:collate() to the pipeline of the Frankenstein Variorum project, since it appears even out of the box to produce fewer alignment problems. At the time of this writing, however, it is
difficult to navigate the TAN library of included XSLT files and to fully understand
how to intervene in the collation output process.
Just because it is written in XSLT and is well-documented for public usability, does
not mean an XSLT library is any more or less adaptable to intensive customization
than a library made of other forms of code. As I write this concluding paragraph in
mid-July 2022, my student assistant Yuying Jin and I have made a significant breakthrough
in modifying our pre-processing Python script for collateX, and we have succeeded
in handling the problem of spliced tokens, though we have now multipled the number
of empty null-string tokens. Since it now appears easier to continue refining our
Python script than to intervene in the labyrinthine XSLT library of functions, we
will proceed in this direction for now, and look forward to future developments to
improve the customization potential of the TAN library. The XSLT process promises
to make the recursive pre-programming cycles in the Gothenburg model more amenable
to clear documentation, and the effort to investigate two pre-processing methods via
Python, and via XSLT, has proven fruitful to resolve many problems in our original
process. The TAN library holds great promise for handling collations in XSLT, and
while customizing its output proves for this moment no easy matter for an outsider
to TAN, it seems a clear next step for this impressive XSLT library.
[1] Interedition Development Group, The Gothenburg Model, 2010-2019. https://collatex.net/doc/. For more on the summit and workshop of collation software developers in 2009 that
formulated the Gothenburg model, see Ronald Haentjens Dekker, Dirk van Hulle, Gregor
Middell, Vincent Neyt, and Joris van Zundert. Computer-supported collation of modern manuscripts: CollateX and the Beckett Digital
Manuscript Project.Digital Scholarship in the Humanities 30:3 (December 2014) pp. 3-4. doi:https://doi.org/10.1093/llc/fqu007.
[3] John Bradley. Text Tools in A Companion to Digital Humanities, ed. Susan Schreibman, Ray Siemens, and John Unsworth (Wiley, 2004), pages 505-522.
[4] Beshero-Bondar, Elisa Eileen. Rebuilding a Digital Frankenstein by 2018: Reflections toward a Theory of Losses and
Gains in Up-Translation. Presented at Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions,
Washington, DC, July 31, 2017. In Proceedings of Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions. Balisage Series on Markup Technologies, vol. 20 (2017). doi:https://doi.org/10.4242/BalisageVol20.Beshero-Bondar01.
[5] See Beshero-Bondar, Elisa E., and Raffaele Viglianti. Stand-off Bridges in the Frankenstein Variorum Project: Interchange and Interoperability within TEI Markup Ecosystems. Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August
3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Beshero-Bondar01. I am also grateful to Michael Sperberg-McQueen and David Birnbaum for their guidance
of my efforts to raise flattened markup into hierarchical markup leading to a Balisage paper evaluating
multiple methods for attempting this. See Birnbaum, David J., Elisa E. Beshero-Bondar
and C. M. Sperberg-McQueen. Flattening and unflattening XML markup: a Zen garden of XSLT and other tools. Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August
3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Birnbaum01.
[6] On Trojan markup, see Steve DeRose, Markup Overlap: A Review and a Horse,Extreme Markup Languages 2004. https://www.academia.edu/42096176/Markup_Overlap_A_Review_and_a_Horse;
and Sperberg-McQueen, C. M. Representing concurrent document structures using Trojan Horse markup. Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August
3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Sperberg-McQueen01.
[7] On the parallel segmentation method for handling a critical apparatus in TEI P5, see
the TEI Guidelines: TEI Consortium, eds. 4.3.2 Floating Texts.TEI P5: Guidelines for Electronic Text Encoding and Interchange. 4.4.0. April 19, 2022. TEI Consortium. http://www.tei-c.org/release/doc/tei-p5-doc/en/html/DS.html#DSFLT (2022-07-19).
[9] Kalvesmaki, Joel. String Comparison in XSLT with tan:diff(). Presented at Balisage: The Markup Conference 2021, Washington, DC, August 2 - 6,
2021. In Proceedings of Balisage: The Markup Conference 2021. Balisage Series on Markup Technologies, vol. 26 (2021). doi:https://doi.org/10.4242/BalisageVol26.Kalvesmaki01.