With One Voice: A Modular Approach to Streamlining Character Data for
    Tokenization

Ashley M. Clark

Abstract

This article discusses the concerns that arise when deriving textual content from a marked-up corpus, for use in full-text search or natural language processing. Also discussed is one approach, in use by the Women Writers Project, which creates an intermediary XML document which reflects the encoding practices and assumptions made within the Project.

Introduction

Full-text search and text analysis often rely on tokenization—the careful division of textual content into smaller, discrete units (here, words). Many tools for search retrieval and natural language processing require plain text inputs, from which tokens are derived. Markup tags are not desired because these tools cannot parse them as annotations, only as a bizarre sort of plain text.

Note

The documentation for Apache Lucene states outright: Applications that build their search capabilities upon Lucene may support documents in various formats – HTML, XML, PDF, Word – just to name a few. Lucene does not care about the Parsing of these and other document formats, and it is the responsibility of the application using Lucene to use an appropriate Parser to convert the original format into plain text before passing that plain text to Lucene. (Lucene)

One might next expect the argument that these tools—ranging from the search engine Lucene to the many word cloud generators currently in existence—should be able to parse markup. It may in fact be true that such applications could benefit from the nuance of a marked-up text, but such an argument is beyond the scope of this paper. Rather than advocating for the creation of comprehensive, omni-input tools, I describe a modular, transparent approach; one which prioritizes the need for tokenization and the need for informational markup, one which allows for customization but not at the expense of applying the rubrics of a markup project again and again.

Real text content

To extract tokens from the text content of an XML document, it is first necessary to determine what the document’s content is. A novice in XML encoding might expect that the text nodes are the real content of the document—to get text out of marked-up text, you just remove the marked-up.

This approach is reductive, but not absurdly so. The Fifth Edition of the XML 1.0 spec states that All text that is not markup constitutes the character data of the document. (XML 1.0) Novices may be encouraged to read XML by moving the markup out of cognitive focus, and taking in only the character data. This method emphasizes the document’s linear progression of text nodes, mentally filtering out tags and attribute content.

However, not all character data are comparable, or even useful for all activities. The Text Encoding Initiative, for example, defines the <teiHeader> element for metadata, and <text> for the document itself. Both elements have use in discovery as well as in analysis. The <teiHeader> tends to play a contextual role, allowing one to filter a corpus or determine a document’s licensing information. The <text> element tends to house the words and the structure of the document in question, but calling <text>’s text nodes the real content is still be overly broad in several senses.

In the next sections, I will provide other examples from the Northeastern University Women Writers Project’s Women Writers Online (WWO), a corpus of works by women, which were published before 1850.

Editorial notes

WWO is not intended to serve as commentary on the women’s writing made available to its subscribers.^[1] WWO encoding has the goal of accurate representation of the original document,^[2] placing an emphasis on structural and semantic tagging, rather than the presentational. However, there are times when the encoding fails to accurately capture some nuance of the original document, or something in the encoding will be lost when it is published to the web. In some of those cases, the encoder of the text is asked to write a public-facing description of the missing nuance. The description is tagged as a <note type="WWP">.

Figure 1

<note xml:id="n003" target="#a003" type="WWP">
  <p>These characters symbolize several things. The first is the Sun and the Moon. 
    They also represent the “eye and the horn of the lamb.”
    The characters are also supposed to evoke “O C” for 
    <persName>Oliver Cromwell</persName>.
  </p>
</note>

Fig. 1. Example of a WWO note, from The Benediction from the Almighty Omnipotent by Lady Eleanor Davies. Here’s part of the sentence in which the note is anchored: witneſs ☉ ☾ their Golden Characters, ſtiled Eyes and Horns of the Lamb, &c. (Davies, The Benediction)

This note by Women Writers Project (WWP) alumnus Sarah Stanley captures several meanings invoked by Lady Eleanor Davies in The Benediction from the Almighty Omnipotent. The note is undoubtedly important and potentially useful data; someone searching Women Writers Online for mentions of Oliver Cromwell should be able to find Benediction, even though Davies rarely mentions Cromwell without employing some kind of coded wordplay. However, for programmatic analysis of the works authored and published by Davies, a note from the modern era would not be useful.

Alternate readings

Lady Eleanor Davies wrote pamphlets—short in page length, but made dense with added markup. In Benediction, the TEI’s <choice> element offers equally useful alternate readings: expansions (<expan>) are given for Davies’s abbreviations (<abbr>). The former would be of use for searchability;^[3] the latter for analysis of character data collected from Benediction.

Figure 2

Anagram, <mcr rend="slant(upright)" xml:id="cromwel1">Howl <placeName>Rome</placeName></mcr>: And thus
<lb/>with one voice, <said rend="slant(upright)">come and ſee, 
<persName><choice>
    <abbr>O:</abbr>
    <expan>Oliver</expan>
  </choice>
 
  <choice>
    <abbr>C:</abbr>
    <expan>Cromwell</expan>
  </choice></persName></said>

Fig. 2. Davies, The Benediction

The above example contains two <choice>s, with whitespace added for readability. The example also contains an instance of the WWP’s custom element <mcr> (meaningful change in rendition^[4]) which here marks an anagram for a person mentioned earlier in Benediction, one O: Cromwel.

While the alternatives can be considered on equal footing, <choice> represents an open question for indexing and text analysis purposes. Which child of <choice> should be ignored? Or do we choose to let the character data assert that O: and Oliver occur sequentially?

Implied or insignificant whitespace

Here is the same excerpt, with different spacing:

Figure 3

Anagram, <mcr rend="slant(upright)" xml:id="cromwel1">Howl <placeName>Rome</placeName></mcr>: 
And thus<lb/>with one voice, <said rend="slant(upright)">come and ſee, <persName><choice><abbr>O:</abbr>
      <expan>Oliver</expan></choice> <choice><abbr>C:</abbr><expan>Cromwell</expan></choice></persName></said>

Despite some changes in lineation and white space, this encoding is functionally equivalent to the excerpt in the previous section. First, the TEI defines only elements as the children of <choice>, and so, whitespace-only text nodes are considered to be insignificant^[5] when they are the children of <choice>. (TEI Guidelines) Even though the first <choice> has a space between O: and Oliver, a schema-aware processor might show O:Oliver, as if the newline and spaces weren’t there.

Second, the <lb>, or line beginning, implies the existence of a newline between thus and with one voice, regardless of whether or not the newline character is actually present in the previous or following text node. For reasons of formatting and readability, a newline is usually present immediately before the <lb>, but it does not have to be. WWO gives <lb> a default rendition of break(yes),^[6] such that <lb> is treated as if it occurs after a newline character.

Tags and differing wordviews

It’s worth noting that the WWP uses extensive intra-word markup. Tags can and do occur in the midst of a word—meaning, one can assume that most elements in WWO imply no surrounding whitespace at all.^[7] For example, the <wwp:vuji> tag is used as a convenient shorthand for a <choice> marking old-style letterforms and their regularizations.^[8]

Figure 4

The Prophet <persName rend="slant(upright)"><vuji>I</vuji>oel</persName>

The Prophet <persName rend="slant(upright)"><choice><orig>I</orig><reg>J</reg></choice>oel</persName>

The first snippet uses <wwp:vuji> to mark the character I, which would be written as J in modern usage. The second snippet shows the TEI-conformant version of the same content. (Davies, The Benediction)

Because <wwp:vuji> only marks one or two characters, the tag occurs most often inside words and never implies whitespace. We might prefer to read Prophet Ioel or the modernized Prophet Joel. No one would be happy with Prophet I oel.

Many XML-aware tools have a different understanding of implied whitespace. In XTF,^[9] eXist-DB,^[10] and Morphadorner,^[11] every element—by default—implies that there is at least one whitespace character on either side.

I list these tools in particular because each has been used with WWO documents at one time or another. Women Writers Online runs on the XTF platform. Eventually, eXist will replace XTF as a platform for WWO publication.^[12] As part of work on the Word Vector Interface, the WWP experimented with Morphadorner for regularization, especially on works from the early modern era. These tools do not share the WWP’s worldview on tags and whitespace, but we have been able to customize them to parse WWO documents with reasonable success.

Both eXist-DB and Morphadorner provide a configuration option which lets one define a list of tags which should be considered inline, or, as implying no whitespace. (eXist-DB, Burns 2013) Configuration can be a humbling process when most tags must be exempted from the default behavior!

XTF, on the other hand, can only parse tags as discrete terms. (XTF Users List) Until recently, WWO pre-publication processes resolved most intra-word ambiguity before XTF indexed the documents. For example, <wwp:vuji>s were transformed into the modern forms of their character data:

Figure 5

The Prophet <persName slant="upright"> Joel</persName>

XTF’s parser does a fine job of telling Lucene to index the terms Prophet and Joel, setting aside the stopword The.

Recently, the WWP unveiled a new feature of the Women Writers Online interface which allows readers to toggle between the regularized and the original typography. To do so, the <wwp:vuji> tags were retained, although the character data was still modernized:

Figure 6

The Prophet <persName slant="upright"> <vuji>J</vuji>oel</persName>

Also, <wwp:vuji> tags were introduced around each long-s character:

Figure 7

as fore<vuji>s</vuji>aw

By preserving the encoding, Javascript can toggle the content of <xhtml:span class="vuji"> to match the reader’s set preferences.

The WWP staff soon discovered that XTF was displaying the content as expected, but it was also indexing the terms Prophet, J, and oel. In fact, XTF had always done this with non-<choice> markup, such as the relatively rare, intra-word <emph>, which has always retained its tags during indexing. We only found out when the abrupt increase in intra-word markup made XTF’s assumption a great deal more apparent.

Figure 8

The <hit><term>Prophet</term></hit> J oel as fore s aw

Keyword-in-context snippet from the XTF’s raw XML results for a WWO search on the word prophet.

Words and their boundaries

The concerns listed in previous sections are not new; they cannot be solved once, nor for all. Rather, they are confronted and addressed in XML database index configurations, XSLT stylesheets for publication formats, discussions of schema design, &c., &c. And because there are as many approaches as there are projects, Lucene and the word cloud generators of the world can perhaps be forgiven for sticking to plain text input with its single layer of data. These tools don’t have to interpret or reduce complexity beyond the character level—all content is “real” content.

In the following sections, I describe the fulltexting routines used by the Women Writers Project. The foundation of the routines is fulltext.xsl, also known as the fulltextBot, which defines steps for the creation of an intermediary, derived XML format. The XML intermediary can be (and has been) used for indexing, XPath queries, the extraction of plain text, and simple HTML display.

At the time of this writing, fulltextBot development has three guiding principles:

no matter the reasons one has for needing reliable word boundaries in character data, some normalization processes will always be useful;
to support as many applications as possible, the markup should be preserved for as long as it remains valuable; and
it should always be possible to determine where and why a normalized document differs from the original.

Early versions of the fulltextBot favored human-readable output over verbosity,^[13] and so it may come as no surprise that the fulltextBot creates an XML intermediary which can be read using the reductive premise described in the introduction—that one can determine the so-called real content of an XML document by focusing only on the text nodes. Alternate readings are removed from text nodes, leaving only regularized character data.

The original content is not lost, though. Whenever the fulltextBot decides that character data should be dropped from the document’s regularized content, it moves the string into a custom attribute called @read (as in, for this element, read this original character data). Examples are shown below.

Origins of the fulltextBot

In 2016, Syd Bauman and I started work on a small application to serve WWO data out of an XML database. The project ultimately didn’t go anywhere, but it did include a XSLT stylesheet intended to create index-friendly derivatives of WWO documents. This stylesheet, the fulltextBot, was also an experiment in soft hyphen processing.

Soft hyphens: an interlude

Soft hyphens are the hyphens which occur at the end of a printed line, in the middle of a word, where a hyphen would not normally occur.

Figure 9

witneſs <seg xml:id="a003" corresp="#n003">☉ ☾</seg> their Golden Cha-
  <lb/>racters, ſtiled Eyes and Horns of the
  <lb/>Lamb

An example of a soft hyphen in WWO encoding. (Davies, The Benediction)

For display purposes, I have used a hard hyphen character (&#x2D) instead of the soft hyphen character ().

Soft hyphens are also the most tenacious of intra-word markup. The soft hyphen phenomenon is encoded in WWO as the Unicode character . That is to say, unlike <wwp:vuji> or even <emph>, a soft hyphen occurs alongside other character data in a text node. The presence of a soft hyphen overrides any whitespace implied by the next printed line (<lb>). In fact, any whitespace should be considered insignificant if it occurs after the soft hyphen character and before the orphaned wordpart. Ideally, the wordpart before the soft hyphen should be joined up with the next eligible wordpart.

In 2016, WWP staff spent weeks debugging the soft hyphen processing in Women Writers Online stylesheets. Syd Bauman’s paper The Hard Edges of Soft Hyphens goes into great depth about the intricacies of whitespace and axis relationships, all of which make it difficult to obtain a single word from two parts separated by a soft hyphen. Syd writes of his experimental method:

Eventually it occurred to me that XSLT’s forte is processing trees of element nodes and their attributes, not text nodes. A large part of the problem I was having was needing to repeat a test performed in template A so that template B could figure out what template A had thought of a given node. Instead, if I processed in separate passes, template A could record what it thought of each node so that template B, running at a later pass, would know. Of course, one needs a place to record this information, and a text node doesn’t really have any convenient place.

(Bauman 2016, emphasis mine)

I, in turn, wanted to reduce the cognitive load required for humans to parse and debug the XPaths needed for template A’s and template B’s tests.^[14] I followed the status quo established in the WWO stylesheets: when a soft hyphen occurs, the XSLT moves the second wordpart to the first, and deletes the soft hyphen. When a text node has a soft hyphen in it, an XSLT template (A) must correctly identify the next part of the word, and copy that wordpart. Consequently, a template (B) must also be able to identify text nodes which contain the copied wordpart, and delete the string. As noted in Bauman 2016, a successful resolution can only occur when both text nodes are processed.

My one innovation in soft hyphen processing was to first group together sequences of elements which represent artifacts around pages, such as catchwords (<mw type="catch">), signatures (<mw type="sig">), and page beginnings (<pb>).^[15]

Figure 11

<pb n="2"/>
      <milestone unit="sig" n="A1v"/>

WWO encoding indicating the beginning of page 2, which has an idealized signature of A1v. (Davies, The Benediction)

Figure 12

<ab type="pbGroup">
<pb n="1"/>
      <milestone unit="sig" n="A1r"/>

      </ab>

An early example of a pbGroup wrapper. (Derived from Davies, The Benediction.)

A template matches these elements, and determines if the current node has any other artifacts before it. If not, the current node is processed by the pbSubsequencer template. The pbSubsequencer recursively gathers up all pbGroup candidates which appear immediately after the triggering element. The resulting collection of elements and whitespace-only text nodes is contained within

<ab
                        type="pbGroup">

. With the phenomena around page beginnings grouped together on a first pass, templates in the second pass—unifier mode—could safely ignore these pbGroups when deciding whether a text node is on either side of a soft hyphen.

An anxiety of soft hyphens

On November 29, 2016, I wrote an optimistic commit message: This should(?!) complete shy handling.^[16] I was wrong, of course, and I knew it even then, even though my test data looked clean. Soft hyphens are the most volatile of intra-word markup because so much of their behavior depends upon: implied whitespace; elements with character data that should be ignored when looking for the next wordpart; elements which should halt shy processing (such as <gap>); how much node ancestry is shared by the affected wordparts; &c., &c.

Figure 13

witness <seg xml:id="a003" corresp="#n003">☉ ☾</seg> their Golden
 Cha<seg read="">racters,</seg><lb/> <seg read="racters,"/> stiled Eyes and Horns of the
  <lb/> Lamb,

The effect of a 2016 fulltextBot^[17] on wordparts separated by a soft hyphen. (Derived from Davies, The Benediction.)

Knowing this, I surveyed the WWO corpus for soft hyphens, looking for encoding which might cause bugs. In his paper, Syd states that it is trivially easy to find all the occurrences of soft hyphens that require resolution in WWO documents. (Bauman 2016) I found this to be accurate. On the other hand, it is much harder to classify the ways in which soft hyphens interact with the XML structures around them. It is even more difficult to do so programmatically, at scale.

For testing purposes, elements and/or attributes were introduced at the sites of the fulltextBot’s interventions. Besides @read, the fulltextBot would add @resp="fulltextBot", @type, and @subtype to communicate the kind of intervention made. The fulltextBot also would be able to recognize WWO elements which imply break behavior. If the element had no preceding whitespace delimiter, the fulltextBot would add one.

Figure 14

<ab type="pbGroup" subtype="add-element" resp="fulltextBot"><pb n="2"/>
      <milestone unit="sig" n="A1v"/>

     </ab>

witness <seg xml:id="a003" corresp="#n003">☉ ☾</seg> their Golden
 Cha<seg read="-" type="shy-part" subtype="add-element mod-content" resp="fulltextBot">racters,</seg><lb/>
 <seg read="racters," type="shy-part" subtype="add-element del-content" resp="fulltextBot"/> stiled Eyes and Horns of the
  <lb/>Lamb

The Prophet 
<persName rend="slant(upright)"
  ><vuji read="I" subtype="mod-content" resp="fulltextBot">J</vuji>oel</persName> 
as foresaw

Anagram, <mcr rend="slant(upright)" xml:id="cromwel1">Howl <placeName>Rome</placeName></mcr>: 
And thus <lb/>with one voice, <said rend="slant(upright)"><quote>come and see</quote>, 
<persName><choice
  ><abbr read="O:" type="choice" subtype="del-content" resp="fulltextBot"/>
  <expan>Oliver</expan></choice> 

<choice
  ><abbr read="C:" type="choice" subtype="del-content" resp="fulltextBot"
  /><expan>Cromwell</expan></choice></persName></said>

Some effects of a 2017 fulltextBot^[18] on WWO encoding. (Derived from Davies, The Benediction.)

Note that the ſ character has been silently regularized to a lower-cased s. For human readability, long-s regularization remains the only unmarked intervention type in the fulltextBot.

Beyond the fulltextBot XSLT, a companion XQuery fulltext2table was developed to gather regularized WWO content into a tab-separated values.^[19] Each row represents a document from Women Writers Online. Besides a cell containing a plain text representation of the document, each row also contained metadata about the source material.

All-purpose fulltexting

By April 2017, general development on the sample WWO application had stopped. The only commits in the app repository were on the fulltextBot or fulltext2table. With the push to retain tags and to capture the provenance of interventions on WWO character data, the XSLT was becoming a transparent, open system. At this point, the fulltextBot and the XQuery were moved to the Women Writers Project Public Code Share as modular parts of a general-purpose toolset.^[20]

By applying a baseline of normalization first, the toolset as a whole reduces the barrier of entry to creating plain text from WWO documents. The fulltext2table XQueries allow further customizations and free users to define for themselves what constitutes relevance in marked-up text.

The fulltexting routines have since been used for many purposes, mostly by WWP staff and encoders. These endeavors include: gathering data on the titles in WWO; providing regularized plain text to researchers; creating input files for training word embedding models;^[21] and spellchecking WWO texts before publication.

The toolset has continued to grow in response to these endeavors. The processes already described continue to be fine-tuned as new bugs are discovered. In order to reduce the memory needed to run the original fulltext2table.xq, a new version called fulltext2table.enmasse.xq was invented to create one TSV file per XML document. The fulltextBot offered the option to move <note>s out of the <wwp:hyperDiv> and next to their anchors. Sarah Connell and I wrote a new XQuery to get plain text out of generic XML. Also, starting in fulltextBot version 2.0, I reworked soft hyphen handling—instead of moving wordparts around, the fulltextBot now deletes the whitespace that occurs between wordparts.^[22]

Customizable extraction of plain text

With some effort, the intermediary XML can be used to walk back from a plain text snippet to the original WWO XML. The first real use of the fulltextBot was to create an Inspectre report^[23] on normalized <title>s which only appear once in WWO. For human readability, it was necessary for each <title> to be normalized... and, for actionability, it was necessary to be able to get back to the original node using XPath or XQuery.

I used the fulltextBot to create intermediary XML of each published WWO document. I then ran an XQuery script which calculated the number of times the content of each <title> appeared across the corpus, and inserted an @ft-match attribute on those which appeared only once. The singleton-intertextual-titles Inspectre report contained copies of the passages in which those <title>s appeared. The Inspectre application transformed the passages into HTML, and also provided an XML view and an XPath for cases where more context was needed.^[24]

Once the Inspectre report was complete, I used another XQuery to insert bibliographic references (@ref) onto <title>. The nature of the intermediary XML allowed me to programmatically determine what the original text content of a given node would have been at the time of the report’s creation. The annotations told me which file the node appeared in, and the <milestone> preceding the node’s containing passage.

With one voice

In the intermediary form described above, WWO markup retains its value even when the character data is being prioritized. At a minimum, fulltextBot results provide a window into the original encoding. They can be queried just as regular WWO texts can, and they can allow one to answer questions with XPath that would ordinarily require XQuery or XSLT and, likely, a day of developer time. More than that, the XSLT—and the assumptions under its code—can be debugged by searching the output for the intervention-marking attributes and their fulltextBot-specific tokens.

The fulltexting routines have been used on other TEI-based corpora with a change to the default namespace declarations, and with some document analysis to find any ignorable elements. Even so, I think the toolset’s most valuable asset is that it gives shape and context to the invisible^[25] rules underlying WWO encoding. In short, the fulltextBot works best on WWO documents because it has been tailored to the dimensions of the WWO corpus.

As previously stated, there are as many approaches to tokenization as there are projects. But it is perhaps more useful to say that every project has baked-in assumptions about what textual content is important, and how XML nodes play off one another. It is perhaps more important to examine these assumptions, to test them, and to build a foundation on which common understanding can rest.

Acknowledgments

Thanks to all the encoders and staff at the Women Writers Project, for their time, their energy, and their thoughtfulness. They transcribe, edit, proof, correct, query, ask questions, do research, advocate for new encoding processes, find interesting phenomena, push WWO to break new ground, &c., &c. None of this would be possible without their painstaking work.

I owe a significant debt of gratitude to Syd Bauman for his support and for his work processing soft hyphens. The fulltextBot would not be nearly so comprehensive if Syd hadn’t pointed out many, many pitfalls to me.

I owe even more to Sarah Connell, who probably has a copy of almost every version of the fulltextBot. Her feedback and feature requests have indelibly shaped these tools, making them much more powerful and accessible than they would be otherwise.

Finally, a grateful thank you to the peer reviewers for Balisage, for all their suggestions, especially regarding the overall shape of this paper.

Any errors or missteps are mine and mine alone.

Appendix A. Further information

The Women Writers Project fulltext toolset can be found in the WWP Public Code Share on GitHub.

Appendix B. Processing in fulltext.xsl version 2.4

The fulltextBot at version 2.4 can be found at commit 556a8a of the WWP Public Code Share.

Pass 1: default mode

Most regularization takes place, including the following:

long-s characters are changed to lower-case s characters;
<choice>s are made;
WWP-authored content is deleted;
implied whitespace is made explicit;
pbGroup members are wrapped together in an <ab> element.

Pass 2: unifier mode

Once whitespace is in a reliable state and metawork is dehydrated into values on @read, soft hyphens can be resolved. Whitespace is deleted if it occurs after a soft hyphen and before a subsequent wordpart.

If the parameter $move-notes-to-anchors is toggled on (it is off by default), unifier mode is first run on <note>s. The resulting <note>s are tunnelled through to their anchor points in the <text> proper. Notes are not inserted next to their anchors if the note would appear in the middle of a word.

Pass 3: noted mode

If $move-notes-to-anchors is toggled on and there exist <note>s which could not be placed with their anchor, those notes are returned to their original locations.

Note

This would be the pass where the remaining notes would be placed after the interrupting wordpart. However, this kind of manipulation is easier to do with XQuery Update, so I left it out of the XSLT stylesheet.

Otherwise, the results from unifier mode are returned.

References

[Lucene] Apache Software Foundation. Lucene 8.0.0 documentation. Package org.apache.lucene.analysis. https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/analysis/package-summary.html#package.description. Accessed 2019-04-12.

[Bauman 2016] Bauman, Syd. “The Hard Edges of Soft Hyphens.” Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2–5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016). doi:https://doi.org/10.4242/BalisageVol17.Bauman01.

[Burns 2013] Burns, Philip R. 2013. “MorphAdorner v2: A Java Library for the Morphological Adornment of English Language Texts.” Northwestern University. https://morphadorner.northwestern.edu/morphadorner/download/morphadorner.pdf. Accessed 2019-07-05.

[Davies, The Benediction] Davies, Lady Eleanor. 2015. The Benediction, 1651. From the Women Writers Online XML, last modified 2019-02-10 (commit 36259). Published at https://www.wwp.northeastern.edu/texts/davies.benediction.html. (Requires subscription.)

[eXist-DB] eXist-db Project. Documentation. Whitespace Treatment and Ignored Content. In Full Text Index. http://exist-db.org/exist/apps/doc/lucene.xml#D3.19.62. Accessed 2019-07-04.

[Jockers 2016] Jockers, Matthew L. 2016. Text Quality, Text Variety, and Parsing XML. In Text Analysis with R for Students of Literature. Quantitative Methods in the Humanities and Social Sciences. Springer International.

[TEI Guidelines] TEI Consortium. Appendix C Elements. In P5: Guidelines for Electronic Text Encoding and Interchange. Version 3.5.0. https://www.tei-c.org/release/doc/tei-p5-doc/en/html/REF-ELEMENTS.html. Accessed 2019-07-04.

[XML 1.0] W3C. Extensible Markup Language (XML) 1.0 (Fifth Edition). Section 2.4, Character Data and Markup. https://www.w3.org/TR/REC-xml/#syntax. Accessed 2019-04-12.

[XQuery and XPath Full Text 1.0] W3C. XQuery and XPath Full Text 1.0. https://www.w3.org/TR/xpath-full-text-10/. Accessed 2019-04-12.

[XTF Users List] XTF Users List. 2012-02-06 – 2012-05-04. Forum thread. Tags that break up words. https://groups.google.com/forum/#!topic/xtf-user/hsvFOTM0b9E. Accessed 2019-07-04.

^[1] The Women Writers Project does publish essays on the documents within WWO. These are encoded in a separate TEI customization and published as Women Writers in Context.

^[2] By way of a facsimile.

^[3] And for human comprehension!

^[4] See the WWP Internal Documentation entry for <mcr> for more information on when the element is applied.

^[5] The XML 1.0 specification states, In editing XML documents, it is often convenient to use "white space" (spaces, tabs, and blank lines) to set apart the markup for greater readability. Such white space is typically not intended for inclusion in the delivered version of the document. On the other hand, "significant" white space that should be preserved in the delivered version is common, for example in poetry and source code. (XML 1.0)

^[6] The WWO Internal Documentation includes a list of elements which break by default: https://wwp.northeastern.edu/research/publications/documentation/internal/#!/entry/break_narrative

^[7] One can also assume that the presence of most encoded whitespace in WWO should be respected. In our transformations and queries, Syd Bauman and I tend to use a variation on normalize-space(), where one or more whitespace characters are normalized to a single space, even if the whitespace occurs at the beginning, at the end, or as the entirety of a string. (See, for example, Bauman 2016.)

^[8] See the WWO Internal Documentation entry on <vuji>.

^[9] The eXtensible Text Framework (XTF) is a web publishing platform which includes Lucene for search and indexing, and a set of customizable XSLT stylesheets to parse, transform, and deliver web content. XTF is supported by the California Digital Library. https://xtf.cdlib.org/.

^[10] eXist-DB is an XML database and application platform. It supports indexing via Lucene. http://exist-db.org/exist/apps/homepage/index.html.

^[11] Morphadorner is a command line tool which features tokenization of plain text or XML content, and the adornment of tokens with lemmata, parts of speech, etc. http://morphadorner.northwestern.edu/morphadorner/.

^[12] The WWP already uses eXist to power the WWP’s public access collections Women Writers in Context and Women Writers in Review.

^[13] XML readability was a guiding principle up until about version 2.0 of the fulltextBot, when I reworked soft hyphen processing to remove a good deal of whitespace, including newlines. Instead of human-readable XML, I now aim for human-decipherable XML.

I consider this to be version 1.0 of the fulltextBot, although it is not marked as such: commit 370f4e of GitHub repository amclark42/xdb-app-central.

^[14] As an example of the code’s complexity, here’s an //xsl:template/@match expression, which attempts to identify whether a text node should delete a wordpart:

Figure 10

text()[
      preceding::text()
         [not(parent::sic)]
         [not(normalize-space(.)='')]
         [1]
         [contains(.,'&#xAD;')]
      ]
|
text()[
       preceding::*
         [1]
         [not( self::anchor | self::cb | self::mw | self::gb | self::lb | self::milestone | self::pb )]
       /
       preceding::text()
         [not(normalize-space(.)='')]
         [1]
         [contains(.,'&#xAD;')]
       ]
|
*
  [
    preceding::text()
      [not(normalize-space(.)='')]
      [1]
      [contains(.,'&#xAD;')]
  ]
  [
    following::node()
      [self::* or self::text()[not(normalize-space(.) eq '')]][1]
      [self::anchor or self::cb or self::mw or self::gb or self::lb or self::milestone or self::pb]
  ]
/text()

Excerpt from wwoPreFilter.xsl, as of Subversion commit 12337.

^[15] The WWO Internal Documentation entry Forme work and meta work goes into more detail.

^[16] See commit 8779fd of amclark42/xdb-app-central. This very old version of the fulltextBot still regularizes documents by removing tags or entire elements.

^[17] fulltext.xsl from 2016-12-15, commit 370f4ee of GitHub repository amclark42/xdb-app-central.

^[18] fulltext.xsl from 2017-04-28, commit bd0968f in GitHub repository amclark42/xdb-app-central.

^[19] The earliest version of fulltext2table.xq can be found in the GitHub repository amclark42/xdb-app-central.

^[20] The Public Code Share is a GitHub repository and collection of open source, WWP-authored tools which could be of use to encoders, researchers, developers, and/or XML enthusiasts. Most of these tools are written in XSLT or XQuery. The fulltext code in particular can be found in the fulltext directory of the repository.

^[21] The WWP’s word embedding models can be queried with the Word Vector Interface, itself available as part of the Women Writers Vector Toolkit. The Methodology page contains more information on the preparation, training, and testing of these models.

^[22] I could probably write another paper on this, and maybe one day I will. Ultimately I decided that humans are not great at writing XSLT for moving content across variable markup structures, due to the aforementioned need to copy and delete in two different nodes. I also decided that leaving the wordparts where they are is truer to the original work and to the spirit in which the WWP encodes soft hyphens.

^[23] For the origins of the Inspectre, see Meta(data)morphosis (Clark & Connell 2016): http://www.balisage.net/Proceedings/vol18/html/Clark01/BalisageVol18-Clark01.html.

^[24] The singloton-intertextual-titles report is complete, and no longer has a web presence. Screenshots can be seen in Sarah Connell’s lecture notes from a panel at the 2017 Digital Humanities conference.

^[25] Invisible in the XML document, at least. One might not know to check the WWP’s Internal Documentation, or the WWP editorial statement, or the ODD file.

Ashley M. Clark

Ashley M. Clark is XML Applications Developer for the Northeastern University Women Writers Project and the Digital Scholarship Group.

BalisageThe Markup Conference

Balisage Paper: With One Voice: A Modular Approach to Streamlining Character Data for Tokenization

Ashley M. Clark

Table of Contents

Introduction

Note

Real text content

Editorial notes

Alternate readings

Implied or insignificant whitespace

Tags and differing wordviews

Words and their boundaries

Origins of the fulltextBot

Soft hyphens: an interlude

An anxiety of soft hyphens

All-purpose fulltexting

Customizable extraction of plain text

With one voice

Acknowledgments

Appendix A. Further information

Appendix B. Processing in fulltext.xsl version 2.4

Pass 1: default mode

Pass 2: unifier mode

Pass 3: noted mode

Note

References

Balisage Series on Markup Technologies