Introduction
Full-text search and text analysis often rely on tokenization—the careful division of textual content into smaller, discrete units (here, words). Many tools for search retrieval and natural language processing require plain text inputs, from which tokens are derived. Markup tags are not desired because these tools cannot parse them as annotations, only as a bizarre sort of plain text.
Note
The documentation for Apache Lucene states outright: Applications that build
their search capabilities upon Lucene may support documents in various formats – HTML,
XML, PDF, Word – just to name a few. Lucene does not care about the Parsing of these and other document formats, and it is the
responsibility of the application using Lucene to use an appropriate Parser to convert the original format into plain text before
passing that plain text to Lucene.
(Lucene)
One might next expect the argument that these tools—ranging from the search engine Lucene to the many word cloud generators currently in existence—should be able to parse markup. It may in fact be true that such applications could benefit from the nuance of a marked-up text, but such an argument is beyond the scope of this paper. Rather than advocating for the creation of comprehensive, omni-input tools, I describe a modular, transparent approach; one which prioritizes the need for tokenization and the need for informational markup, one which allows for customization but not at the expense of applying the rubrics of a markup project again and again.
Real
text content
To extract tokens from the text content of an XML document, it is first necessary
to
determine what the document’s content is. A novice in XML encoding might
expect that the text nodes are the real
content of the document—to get text out
of marked-up text, you just remove the marked-up
.
This approach is reductive, but not absurdly so. The Fifth Edition of the XML 1.0
spec
states that All text that is not markup constitutes the character data of the
document.
(XML 1.0) Novices may be encouraged to read XML by moving
the markup out of cognitive focus, and taking in only the character data. This method
emphasizes the document’s linear progression of text nodes, mentally filtering out
tags and
attribute content.
However, not all character data are comparable, or even useful for all activities.
The
Text Encoding Initiative, for example, defines the <teiHeader>
element for
metadata, and <text>
for the document itself. Both elements have use in
discovery as well as in analysis. The <teiHeader>
tends to play a contextual
role, allowing one to filter a corpus or determine a document’s licensing information.
The
<text>
element tends to house the words and the structure of the document in
question, but calling <text>
’s text nodes the real
content is
still be overly broad in several senses.
In the next sections, I will provide other examples from the Northeastern University Women Writers Project’s Women Writers Online (WWO), a corpus of works by women, which were published before 1850.
Editorial notes
WWO is not intended to serve as commentary on the women’s writing made available to
its subscribers.[1] WWO encoding has the goal of accurate representation of the original document,[2] placing an emphasis on structural and semantic tagging, rather than the
presentational. However, there are times when the encoding fails to accurately capture
some
nuance of the original document, or something in the encoding will be lost when it
is
published to the web. In some of those cases, the encoder of the text is asked to
write a
public-facing description of the missing nuance. The description is tagged as a
<note type="WWP">
.
This note by Women Writers Project (WWP) alumnus Sarah Stanley captures several meanings invoked by Lady Eleanor Davies in The Benediction from the Almighty Omnipotent. The note is undoubtedly important and potentially useful data; someone searching Women Writers Online for mentions of Oliver Cromwell should be able to find Benediction, even though Davies rarely mentions Cromwell without employing some kind of coded wordplay. However, for programmatic analysis of the works authored and published by Davies, a note from the modern era would not be useful.
Alternate readings
Lady Eleanor Davies wrote pamphlets—short in page length, but made dense with added
markup. In Benediction, the TEI’s <choice>
element
offers equally useful alternate readings: expansions (<expan>
) are given for
Davies’s abbreviations (<abbr>
). The former would be of use for searchability;[3] the latter for analysis of character data collected from
Benediction.
The above example contains two <choice>
s, with whitespace added for
readability. The example also contains an instance of the WWP’s custom element
<mcr>
(meaningful change in rendition
[4]) which here marks an anagram for a person mentioned earlier in
Benediction, one O: Cromwel
.
While the alternatives can be considered on equal footing, <choice>
represents an open question for indexing and text analysis purposes. Which child of
<choice>
should be ignored? Or do we choose to let the character data
assert that O:
and Oliver
occur sequentially?
Implied or insignificant whitespace
Here is the same excerpt, with different spacing:
Despite some changes in lineation and white space, this encoding is functionally
equivalent to the excerpt in the previous section. First, the TEI defines only elements
as
the children of <choice>
, and so, whitespace-only text nodes are considered
to be insignificant[5] when they are the children of <choice>
. (TEI Guidelines) Even though the first <choice>
has a space between O:
and
Oliver
, a schema-aware processor might show O:Oliver
, as if
the newline and spaces weren’t there.
Second, the <lb>
, or line beginning, implies the
existence of a newline between thus
and with one voice
,
regardless of whether or not the newline character is actually present in the previous
or
following text node. For reasons of formatting and readability, a newline is usually
present
immediately before the <lb>
, but it does not have to be. WWO gives
<lb>
a default rendition of break(yes)
,[6] such that <lb>
is treated as if it occurs after a newline
character.
Tags and differing wordviews
It’s worth noting that the WWP uses extensive intra-word markup.
Tags can and do occur in the midst of a word—meaning, one can assume that most elements
in
WWO imply no surrounding whitespace at all.[7] For example, the <wwp:vuji>
tag is used as a convenient
shorthand for a <choice>
marking old-style letterforms and their regularizations.[8]Because <wwp:vuji>
only marks one or two characters, the tag
occurs most often inside words and never implies whitespace. We might prefer to read
Prophet Ioel
or the modernized Prophet Joel
. No one would
be happy with Prophet I oel
.
Many XML-aware tools have a different understanding of implied whitespace. In XTF,[9] eXist-DB,[10] and Morphadorner,[11] every element—by default—implies that there is at least one whitespace character on either side.
I list these tools in particular because each has been used with WWO documents at one time or another. Women Writers Online runs on the XTF platform. Eventually, eXist will replace XTF as a platform for WWO publication.[12] As part of work on the Word Vector Interface, the WWP experimented with Morphadorner for regularization, especially on works from the early modern era. These tools do not share the WWP’s worldview on tags and whitespace, but we have been able to customize them to parse WWO documents with reasonable success.
Both eXist-DB and Morphadorner provide a configuration option which lets one define
a
list of tags which should be considered inline
, or, as implying no
whitespace. (eXist-DB, Burns 2013) Configuration can be
a humbling process when most tags must be exempted from the default behavior!
XTF, on the other hand, can only parse tags as discrete terms
. (XTF Users List) Until recently, WWO pre-publication processes resolved most
intra-word ambiguity before XTF indexed the documents. For example,
<wwp:vuji>
s were transformed into the modern forms of their character data:
XTF’s parser does a fine job of telling Lucene to index the terms
Prophet
and Joel
, setting aside the stopword
The
.
Recently, the WWP unveiled a new feature of the Women Writers Online interface which
allows readers to toggle between the regularized and the original typography. To do
so,
the <wwp:vuji>
tags were retained, although the character data was still modernized:
Also, <wwp:vuji>
tags were introduced around
each long-s character:
By preserving the encoding, Javascript can toggle the content of
<xhtml:span class="vuji">
to match the reader’s set preferences.
The WWP staff soon discovered that XTF was displaying the content as expected, but
it
was also indexing the terms Prophet
, J
, and
oel
. In fact, XTF had always done this with non-<choice>
markup, such as the relatively rare, intra-word <emph>
, which has always
retained its tags during indexing. We only found out when the abrupt increase in
intra-word markup made XTF’s assumption a great deal more apparent.
Words and their boundaries
The concerns listed in previous sections are not new; they cannot be solved once, nor for all. Rather, they are confronted and addressed in XML database index configurations, XSLT stylesheets for publication formats, discussions of schema design, &c., &c. And because there are as many approaches as there are projects, Lucene and the word cloud generators of the world can perhaps be forgiven for sticking to plain text input with its single layer of data. These tools don’t have to interpret or reduce complexity beyond the character level—all content is “real” content.
In the following sections, I describe the fulltexting
routines used by the
Women Writers Project. The foundation of the routines is fulltext.xsl, also known
as the
fulltextBot, which defines steps for the creation of an intermediary, derived XML
format. The
XML intermediary can be (and has been) used for indexing, XPath queries, the extraction
of
plain text, and simple HTML display.
At the time of this writing, fulltextBot development has three guiding principles:
-
no matter the reasons one has for needing reliable word boundaries in character data, some normalization processes will always be useful;
-
to support as many applications as possible, the markup should be preserved for as long as it remains valuable; and
-
it should always be possible to determine where and why a normalized document differs from the original.
Early versions of the fulltextBot favored human-readable output over verbosity,[13] and so it may come as no surprise that the fulltextBot creates an XML intermediary
which can be read using the reductive premise described in the introduction—that one
can
determine the so-called real
content of an XML document by focusing only on the
text nodes. Alternate readings are removed from text nodes, leaving only regularized
character
data.
The original content is not lost, though. Whenever the fulltextBot decides that character
data should be dropped from the document’s regularized content, it moves the string
into a
custom attribute called @read
(as in, for this
element, read this original character data
). Examples are shown
below.
Origins of the fulltextBot
In 2016, Syd Bauman and I started work on a small application to serve WWO data out of an XML database. The project ultimately didn’t go anywhere, but it did include a XSLT stylesheet intended to create index-friendly derivatives of WWO documents. This stylesheet, the fulltextBot, was also an experiment in soft hyphen processing.
Soft hyphens: an interlude
Soft hyphens are the hyphens which occur at the end of a printed line, in the middle
of
a word, where a hyphen would not normally occur.
Soft hyphens are also the most tenacious of intra-word markup. The soft hyphen
phenomenon is encoded in WWO as the Unicode character ­
. That is to
say, unlike <wwp:vuji>
or even <emph>
, a soft hyphen occurs
alongside other character data in a text node. The presence of a soft
hyphen overrides any whitespace implied by the next printed line (<lb>
). In
fact, any whitespace should be considered insignificant if it occurs after the soft
hyphen
character and before the orphaned wordpart. Ideally, the wordpart before the soft
hyphen
should be joined up with the next eligible wordpart.
In 2016, WWP staff spent weeks debugging the soft hyphen processing in Women Writers
Online stylesheets. Syd Bauman’s paper The Hard Edges of Soft Hyphens
goes
into great depth about the intricacies of whitespace and axis relationships, all of
which
make it difficult to obtain a single word from two parts separated by a soft hyphen.
Syd
writes of his experimental method:
Eventually it occurred to me that XSLT’s forte is processing trees of element nodes and their attributes, not text nodes. A large part of the problem I was having was needing to repeat a test performed in template A so that template B could figure out what template A had thought of a given node. Instead, if I processed in separate passes, template A could record what it thought of each node so that template B, running at a later pass, would know. Of course, one needs a place to record this information, and a text node doesn’t really have any convenient place.
I, in turn, wanted to reduce the cognitive load required for humans to parse and debug
the XPaths needed for template A’s and template B’s tests.[14] I followed the status quo established in the WWO stylesheets: when a soft hyphen
occurs, the XSLT moves
the second wordpart to the first, and deletes the soft
hyphen. When a text node has a soft hyphen in it, an XSLT template (A
) must
correctly identify the next part of the word, and copy that wordpart. Consequently,
a
template (B
) must also be able to identify text nodes which contain the
copied wordpart, and delete the string. As noted in Bauman 2016, a
successful resolution can only occur when both text nodes are processed.
My one innovation in soft hyphen processing was to first group together sequences
of
elements which represent artifacts around pages, such as catchwords (<mw
type="catch">
), signatures (<mw type="sig">
), and page beginnings
(<pb>
).[15]
A template matches these elements, and determines if the current node has any other
artifacts before it. If not, the current node is processed by the
pbSubsequencer
template. The pbSubsequencer recursively gathers up all
pbGroup candidates which appear immediately after the triggering element. The resulting
collection of elements and whitespace-only text nodes is contained within <ab
type="pbGroup">
. With the phenomena around page beginnings grouped together on a
first pass, templates in the second pass—unifier
mode—could safely ignore
these pbGroups
when deciding whether a text node is on either side of a soft
hyphen.
An anxiety of soft hyphens
On November 29, 2016, I wrote an optimistic commit message: This should(?!)
complete shy handling.
[16] I was wrong, of course, and I knew it even then, even though my test data looked
clean. Soft hyphens are the most volatile of intra-word markup because so much of
their
behavior depends upon: implied whitespace; elements with character data that should
be
ignored when looking for the next wordpart; elements which should halt shy processing
(such
as <gap>
); how much node ancestry is shared by the affected wordparts;
&c., &c.
Knowing this, I surveyed the WWO corpus for soft hyphens, looking for encoding which
might cause bugs. In his paper, Syd states that it is trivially easy to find all the
occurrences of soft hyphens that require resolution
in WWO documents. (Bauman 2016) I found this to be accurate. On the other hand, it is much harder
to classify the ways in which soft hyphens interact with the XML structures around
them. It
is even more difficult to do so programmatically, at scale.
For testing purposes, elements and/or attributes were introduced at the sites of the
fulltextBot’s interventions. Besides @read
, the fulltextBot would add
@resp="fulltextBot"
, @type
, and @subtype
to
communicate the kind of intervention made. The fulltextBot also would be able to recognize
WWO elements which imply break behavior. If the element had no preceding whitespace
delimiter, the fulltextBot would add one.
Beyond the fulltextBot XSLT, a companion XQuery fulltext2table
was
developed to gather regularized WWO content into a tab-separated values.[19] Each row represents a document from Women Writers Online. Besides a cell
containing a plain text representation of the document, each row also contained metadata
about the source material.
All-purpose fulltexting
By April 2017, general development on the sample WWO application had stopped. The
only
commits in the app repository were on the fulltextBot or fulltext2table
. With
the push to retain tags and to capture the provenance of interventions on WWO character
data,
the XSLT was becoming a transparent, open system. At this point, the fulltextBot and
the
XQuery were moved to the Women Writers Project Public Code Share as modular parts
of a
general-purpose toolset.[20]
By applying a baseline of normalization first, the toolset as a whole reduces the
barrier
of entry to creating plain text from WWO documents. The fulltext2table
XQueries
allow further customizations and free users to define for themselves what constitutes
relevance in marked-up text.
The fulltexting routines have since been used for many purposes, mostly by WWP staff and encoders. These endeavors include: gathering data on the titles in WWO; providing regularized plain text to researchers; creating input files for training word embedding models;[21] and spellchecking WWO texts before publication.
The toolset has continued to grow in response to these endeavors. The processes already
described continue to be fine-tuned as new bugs are discovered. In order to reduce
the memory
needed to run the original fulltext2table.xq, a new version called
fulltext2table.enmasse.xq
was invented to create one TSV file per XML
document. The fulltextBot offered the option to move <note>
s out of the
<wwp:hyperDiv>
and next to their anchors. Sarah Connell and I wrote a new
XQuery to get plain text out of generic XML. Also, starting in fulltextBot version
2.0, I
reworked soft hyphen handling—instead of moving wordparts around, the fulltextBot
now deletes
the whitespace that occurs between wordparts.[22]
Customizable extraction of plain text
With some effort, the intermediary XML can be used to walk back from a plain text
snippet to the original WWO XML. The first real use of the fulltextBot was to create
an
Inspectre
report[23] on normalized <title>
s which only appear once in WWO. For human
readability, it was necessary for each <title>
to be normalized... and, for
actionability, it was necessary to be able to get back to the original node using
XPath or
XQuery.
I used the fulltextBot to create intermediary XML of each published WWO document.
I then
ran an XQuery script which calculated the number of times the content of each
<title>
appeared across the corpus, and inserted an @ft-match
attribute on those which appeared only once. The
singleton-intertextual-titles
Inspectre report contained copies of the
passages in which those <title>
s appeared. The Inspectre application
transformed the passages into HTML, and also provided an XML view and an XPath for
cases
where more context was needed.[24]
Once the Inspectre report was complete, I used another XQuery to insert bibliographic
references (@ref
) onto <title>
. The nature of the intermediary
XML allowed me to programmatically determine what the original text content of a given
node
would have been at the time of the report’s creation. The annotations told me which
file the
node appeared in, and the <milestone>
preceding the node’s containing
passage.
With one voice
In the intermediary form described above, WWO markup retains its value even when the character data is being prioritized. At a minimum, fulltextBot results provide a window into the original encoding. They can be queried just as regular WWO texts can, and they can allow one to answer questions with XPath that would ordinarily require XQuery or XSLT and, likely, a day of developer time. More than that, the XSLT—and the assumptions under its code—can be debugged by searching the output for the intervention-marking attributes and their fulltextBot-specific tokens.
The fulltexting routines have been used on other TEI-based corpora with a change to the default namespace declarations, and with some document analysis to find any ignorable elements. Even so, I think the toolset’s most valuable asset is that it gives shape and context to the invisible[25] rules underlying WWO encoding. In short, the fulltextBot works best on WWO documents because it has been tailored to the dimensions of the WWO corpus.
As previously stated, there are as many approaches to tokenization as there are projects. But it is perhaps more useful to say that every project has baked-in assumptions about what textual content is important, and how XML nodes play off one another. It is perhaps more important to examine these assumptions, to test them, and to build a foundation on which common understanding can rest.
Acknowledgments
Thanks to all the encoders and staff at the Women Writers Project, for their time, their energy, and their thoughtfulness. They transcribe, edit, proof, correct, query, ask questions, do research, advocate for new encoding processes, find interesting phenomena, push WWO to break new ground, &c., &c. None of this would be possible without their painstaking work.
I owe a significant debt of gratitude to Syd Bauman for his support and for his work processing soft hyphens. The fulltextBot would not be nearly so comprehensive if Syd hadn’t pointed out many, many pitfalls to me.
I owe even more to Sarah Connell, who probably has a copy of almost every version of the fulltextBot. Her feedback and feature requests have indelibly shaped these tools, making them much more powerful and accessible than they would be otherwise.
Finally, a grateful thank you to the peer reviewers for Balisage, for all their suggestions, especially regarding the overall shape of this paper.
Any errors or missteps are mine and mine alone.
Appendix A. Further information
The Women Writers Project fulltext
toolset can be found in the WWP Public Code Share on GitHub.
Appendix B. Processing in fulltext.xsl version 2.4
The fulltextBot at version 2.4 can be found at commit 556a8a of the WWP Public Code Share.
Pass 1: default mode
Most regularization takes place, including the following:
-
long-s characters are changed to lower-case s characters;
-
<choice>
s are made; -
WWP-authored content is deleted;
-
implied whitespace is made explicit;
-
pbGroup
members are wrapped together in an<ab>
element.
Pass 2: unifier
mode
Once whitespace is in a reliable state and metawork is dehydrated
into
values on @read
, soft hyphens can be resolved. Whitespace is deleted if it
occurs after a soft hyphen and before a subsequent wordpart.
If the parameter $move-notes-to-anchors
is toggled on (it is off by
default), unifier mode is first run on <note>
s. The resulting
<note>
s are tunnelled through to their anchor points in the
<text>
proper. Notes are not inserted next to their anchors if the note
would appear in the middle of a word.
Pass 3: noted
mode
If $move-notes-to-anchors
is toggled on and there
exist <note>
s which could not be placed with their anchor, those notes are
returned to their original locations.
Note
This would be the pass where the remaining notes would be placed after the interrupting wordpart. However, this kind of manipulation is easier to do with XQuery Update, so I left it out of the XSLT stylesheet.
References
[Lucene] Apache Software Foundation. Lucene 8.0.0 documentation. Package
org.apache.lucene.analysis
. https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/analysis/package-summary.html#package.description.
Accessed 2019-04-12.
[Bauman 2016] Bauman, Syd. “The Hard Edges of Soft Hyphens.” Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2–5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016). doi:https://doi.org/10.4242/BalisageVol17.Bauman01.
[Burns 2013] Burns, Philip R. 2013. “MorphAdorner v2: A Java Library for the Morphological Adornment of English Language Texts.” Northwestern University. https://morphadorner.northwestern.edu/morphadorner/download/morphadorner.pdf. Accessed 2019-07-05.
[Davies, The Benediction] Davies, Lady Eleanor. 2015. The Benediction, 1651. From the Women Writers Online XML, last modified 2019-02-10 (commit 36259). Published at https://www.wwp.northeastern.edu/texts/davies.benediction.html. (Requires subscription.)
[eXist-DB] eXist-db Project. Documentation.
Whitespace Treatment and Ignored Content
. In Full Text Index
.
http://exist-db.org/exist/apps/doc/lucene.xml#D3.19.62. Accessed
2019-07-04.
[Jockers 2016] Jockers, Matthew L. 2016. Text Quality, Text Variety, and Parsing
XML.
In Text Analysis with R for Students of Literature.
Quantitative Methods in the Humanities and Social Sciences. Springer
International.
[TEI Guidelines] TEI Consortium. Appendix C
Elements.
In P5: Guidelines for Electronic Text Encoding and
Interchange. Version 3.5.0. https://www.tei-c.org/release/doc/tei-p5-doc/en/html/REF-ELEMENTS.html. Accessed
2019-07-04.
[XML 1.0] W3C. Extensible Markup Language (XML) 1.0 (Fifth
Edition). Section 2.4, Character Data and Markup
. https://www.w3.org/TR/REC-xml/#syntax.
Accessed 2019-04-12.
[XQuery and XPath Full Text 1.0] W3C. XQuery and XPath Full Text 1.0. https://www.w3.org/TR/xpath-full-text-10/. Accessed 2019-04-12.
[XTF Users List] XTF
Users List. 2012-02-06 – 2012-05-04. Forum thread. Tags that break up words
. https://groups.google.com/forum/#!topic/xtf-user/hsvFOTM0b9E. Accessed 2019-07-04.
[1] The Women Writers Project does publish essays on the documents within WWO. These are
encoded in a separate TEI customization and published as Women Writers in
Context
.
[2] By way of a facsimile.
[3] And for human comprehension!
[4] See the WWP Internal Documentation entry for <mcr>
for more
information on when the element is applied.
[5] The XML 1.0 specification states, In editing XML documents, it is often
convenient to use "white space" (spaces, tabs, and blank lines) to set apart the
markup for greater readability. Such white space is typically not intended for
inclusion in the delivered version of the document. On the other hand, "significant"
white space that should be preserved in the delivered version is common, for example
in poetry and source code.
(XML 1.0)
[6] The WWO Internal Documentation includes a list of elements which
break
by default: https://wwp.northeastern.edu/research/publications/documentation/internal/#!/entry/break_narrative
[7] One can also assume that the presence of most encoded whitespace in WWO should be
respected. In our transformations and queries, Syd Bauman and I tend to use a
variation on normalize-space()
, where one or more whitespace characters
are normalized to a single space, even if the whitespace occurs at the beginning,
at
the end, or as the entirety of a string. (See, for example, Bauman 2016.)
[9] The eXtensible Text Framework (XTF) is a web publishing platform which includes Lucene for search and indexing, and a set of customizable XSLT stylesheets to parse, transform, and deliver web content. XTF is supported by the California Digital Library. https://xtf.cdlib.org/.
[10] eXist-DB is an XML database and application platform. It supports indexing via Lucene. http://exist-db.org/exist/apps/homepage/index.html.
[11] Morphadorner is a command line tool which features tokenization of plain text or XML content, and the adornment of tokens with lemmata, parts of speech, etc. http://morphadorner.northwestern.edu/morphadorner/.
[12] The WWP already uses eXist to power the WWP’s public access collections Women Writers in Context and Women Writers in Review.
[13] XML readability was a guiding principle up until about version 2.0 of the fulltextBot, when I reworked soft hyphen processing to remove a good deal of whitespace, including newlines. Instead of human-readable XML, I now aim for human-decipherable XML.
I consider this to be version 1.0 of the fulltextBot, although it is not marked as such: commit 370f4e of GitHub repository amclark42/xdb-app-central.
[14] As an example of the code’s complexity, here’s an //xsl:template/@match
expression, which attempts to identify whether a text node should delete a
wordpart:
[15] The WWO Internal Documentation entry Forme work and meta work
goes
into more detail.
[16] See commit 8779fd of amclark42/xdb-app-central. This very old version of the fulltextBot still regularizes documents by removing tags or entire elements.
[17] fulltext.xsl from 2016-12-15, commit 370f4ee of GitHub repository amclark42/xdb-app-central.
[18] fulltext.xsl from 2017-04-28, commit bd0968f in GitHub repository amclark42/xdb-app-central.
[19] The earliest version of fulltext2table.xq can be found in the GitHub repository amclark42/xdb-app-central.
[20] The Public Code Share is a GitHub repository and collection of open source,
WWP-authored tools which could be of use to encoders, researchers, developers, and/or
XML
enthusiasts. Most of these tools are written in XSLT or XQuery. The
fulltext
code in particular can be found in the fulltext
directory of
the repository.
[21] The WWP’s word embedding models can be queried with the Word Vector Interface, itself
available as part of the Women Writers Vector
Toolkit. The Methodology
page contains more information on the
preparation, training, and testing of these models.
[22] I could probably write another paper on this, and maybe one day I will. Ultimately I decided that humans are not great at writing XSLT for moving content across variable markup structures, due to the aforementioned need to copy and delete in two different nodes. I also decided that leaving the wordparts where they are is truer to the original work and to the spirit in which the WWP encodes soft hyphens.
[23] For the origins of the Inspectre, see Meta(data)morphosis
(Clark
& Connell 2016): http://www.balisage.net/Proceedings/vol18/html/Clark01/BalisageVol18-Clark01.html.
[24] The singloton-intertextual-titles
report is complete, and no longer
has a web presence. Screenshots can be seen in Sarah Connell’s lecture notes from a
panel at the 2017 Digital Humanities conference.
[25] Invisible in the XML document, at least. One might not know to check the WWP’s Internal Documentation, or the WWP editorial statement, or the ODD file.