Balisage Paper: With One Voice:
A Modular Approach to Streamlining Character Data for
Tokenization
July 30 - August 2, 2019
The materials listed below were provided by the speaker as supplements to a
presentation at Balisage. These materials may include the slides or visuals used in
the
presentation; supplementary material, such as code samples or a demonstration application;
and/or the paper accompanying the presentation (if it has not been provided in XML).
These
materials have been zipped for easy download and are identified by a brief description
of
the contents. The materials themselves are untouched
, that is, they
have not been tested or edited by Balisage: The Markup Conference or by Mulberry
Technologies, Inc. As such, they are included on this website AS IS
,
i.e., as provided by the speaker, with no warranties, express or otherwise, made by
Balisage
or Mulberry.
Slides and Materials
- Bal2019-Clark-slides.zip: Presentation slides in Adobe PDF
- Bal2019-Clark-sample.zip: A sample WWO document run through the fulltexting routines
Apache Software Foundation. Lucene 8.0.0 documentation. Package
org.apache.lucene.analysis
. https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/analysis/package-summary.html#package.description.
Accessed 2019-04-12.
Bauman, Syd. “The Hard Edges of Soft Hyphens.” Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2–5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016). doi:https://doi.org/10.4242/BalisageVol17.Bauman01.
Burns, Philip R. 2013. “MorphAdorner v2: A Java Library for the Morphological Adornment of English Language Texts.” Northwestern University. https://morphadorner.northwestern.edu/morphadorner/download/morphadorner.pdf. Accessed 2019-07-05.
Davies, Lady Eleanor. 2015. The Benediction, 1651. From the Women Writers Online XML, last modified 2019-02-10 (commit 36259). Published at https://www.wwp.northeastern.edu/texts/davies.benediction.html. (Requires subscription.)
eXist-db Project. Documentation.
Whitespace Treatment and Ignored Content
. In Full Text Index
.
http://exist-db.org/exist/apps/doc/lucene.xml#D3.19.62. Accessed
2019-07-04.
Jockers, Matthew L. 2016. Text Quality, Text Variety, and Parsing
XML.
In Text Analysis with R for Students of Literature.
Quantitative Methods in the Humanities and Social Sciences. Springer
International.
TEI Consortium. Appendix C
Elements.
In P5: Guidelines for Electronic Text Encoding and
Interchange. Version 3.5.0. https://www.tei-c.org/release/doc/tei-p5-doc/en/html/REF-ELEMENTS.html. Accessed
2019-07-04.
W3C. Extensible Markup Language (XML) 1.0 (Fifth
Edition). Section 2.4, Character Data and Markup
. https://www.w3.org/TR/REC-xml/#syntax.
Accessed 2019-04-12.
W3C. XQuery and XPath Full Text 1.0. https://www.w3.org/TR/xpath-full-text-10/. Accessed 2019-04-12.
XTF
Users List. 2012-02-06 – 2012-05-04. Forum thread. Tags that break up words
. https://groups.google.com/forum/#!topic/xtf-user/hsvFOTM0b9E. Accessed 2019-07-04.