Note to the reader
A link to an updated version (or even a newer edition) of this paper may be available on the WWP bibliography page.
Argument
In section “Introduction” this paper presents what soft
hyphens are and how they are encoded, and then discusses the
desired processing (called resolution
). In the next
two sections (section “Seems easy …” and section “Further complications”)
an algorithm for how this might be done, and then a somewhat
detailed discussion of some of the features of TEI encoding that
make this difficult are presented, along with a few of the
policies at the WWP that try to make it a bit easier. Lastly, in
section “Attempts” brief discussions of various attempts to
perform this processing are presented.
Introduction
Soft hyphens
In the modern post-Unicode era, a soft
hyphen
is typically defined as a spot where you,
the word processor, may break this word across a line break, if
needed
[1] But even as recently as ISO 8859
a soft hyphen was for use when a line break has been
established within a word
[2] Although
not called a soft
hyphen back then, this use of
the hyphen has been around for centuries. E.g., the OED cites NWEW as saying
Hyphen … is used … when one part of a word
concludes the former Line, and the one begins the next.
It is this latter (older) definition with which we are concerned
here: a computer character (or other XML construct) used in a
transcription to indicate where an end-of-line hyphen was
printed in the source text to indicate this word is
continued on the next line
.
The use of such characters (hyphen to indicate word
continued on next line
) is nearly ubiquitous in printed
works (at least in English). For example, I searched Google
Books for the word balisage
, and looked at the
first book listed.[3] Even though I cannot read it because it
is in French, there are obviously four soft hyphens on the first
page of printed prose alone (i.e., ignoring the title page,
etc.); that page has just over 200 words spread over 20 lines.
In the first full chapter of Michael Kay’s
book[4] I counted 67 soft
hyphens in roughly 17,760 words over roughly 1410 lines.
Recording lineation and end-of-line hyphens
The Text Encoding Initiative Guidelines for Electronic Text Encoding and Interchange contain a discussion of how to handle these extant typographic indicators.[5] One common solution is to ignore the soft hyphens, and to simply transcribe the word that has been broken across a line break as a single word. Consider the following example.[6]
This passage might be encoded[7] asso far they’d been smart enough to keep quiet about it. I’d never seen any posts about the Tomb of Horrors on any gunter message boards. I realized, of course, that this might be because my theory about the old D&D module was completely lame and totally off base.</p>or even
so far they’d been smart enough to keep quiet about it. I’d never seen any posts about the Tomb of Horrors on any gunter message boards. I realized, of course, that this might be because my theory about the old D&D module was completely lame and totally off base.</p>or, if encoding original lineation
<lb/>so far they’d been smart enough to keep quiet about it. I’d never seen any <lb/>posts about the Tomb of Horrors on any gunter message boards. I realized, <lb/>of course, that this might be because my theory about the old D&D <lb/>module was completely lame and totally off base.</p>
In all three of the above encodings, the word
realized
has been silently reconstituted from its
constituent parts, the initial portion immediately prior to the
soft hyphen, and the final portion shortly after the soft hyphen.
In the first and third examples the soft hyphen is resolved by
moving the final portion of the word up from the begining of its
line to the end of the previous line (which I call
finalUp
). It could just as easily have been
resolved by moving the initial portion of the word from the end of
its line to the beginning of the next line (which I call
initDown
). For most of this paper I will discuss
resolution in only the finalUp
direction, but the
issues generally apply equally well to both directions.
Personally, I do not like that third (last) encoding. It
explicitly asserts there was a line break in the source document
between realized,
and of course
,
which is not true. But nonetheless, the practice is not uncommon.
The first encoded example is not nearly so bad, as its implication
that there was a line break at that spot is implicit, not
explicit. The middle of the three encoding possibilities is not
objectionable in assertion of line breaks at all, since it makes
no such assertions. (One could infer them, but they are not
implied.) However, many projects will find it a disadvantage to
transcribe prose completely irrespective of original lineation.
Keeping track of original lineation is very helpful when trying to
align the source document with the transcription (or the output of
processing the transcription). Even if a project does not think
the users of its transcribed texts will appreciate this alignment,
the project proofreaders will — a lot.
Another common approach is to explicitly record both the hyphen character and original lineation. Consider the following example.[8]
It turns out that the most important voice in the Su‐ preme Court nomination battle is not the American peo‐ ple’s, as Senate Republicans have insisted from the mo‐ ment Justice Antonin Scalia died last month. It is not even that of the senators. It’s the National Rifle Association’s. That is what the majority leader, Mitch McConnell, said the other day when asked about the possibility of con‐ sidering and confirming President Obama’s nominee, Judge Merrick Garland, after the November elections. “I can’t imagine that a Republican majority in the United States Senate would want to confirm, in a lame-duck ses‐ sion, a nominee opposed by the National Rifle Associa‐ tion,” he told “Fox News Sunday.”This excerpt from a New York Times editorial might be encoded as follows.
<p>It turns out that the most important voice in the Su<pc force="weak">-</pc> <lb break="no"/>preme Court nomination battle is not the American peo<pc force="weak">-</pc> <lb break="no"/>ple’s, as Senate Republicans have insisted from the mo<pc force="weak">-</pc> <lb break="no"/>ment Justice Antonin Scalia died last month. It is not even <lb/>that of the senators. It’s the National Rifle Association’s.</p> <p>That is what the majority leader, Mitch McConnell, <lb/>said the other day when asked about the possibility of con<pc force="weak">-</pc> <lb break="no"/>sidering and confirming President Obama’s nominee, <lb/>Judge Merrick Garland, after the November elections. “I <lb/>can’t imagine that a Republican majority in the United <lb/>States Senate would want to confirm, in a lame-duck ses<pc force="weak">-</pc> <lb break="no"/>sion, a nominee opposed by the National Rifle Associa<pc force="weak">-</pc> <lb break="no"/>tion,” he told “Fox News Sunday.”</p>Here it is explicit that the hyphen character is not a word separator (
force="weak"
), and that the line break
does not imply the end of an orthographic token
(break="no"
). It is worth noting that many TEI projects
choose to use either <pc force="weak">
or
<lb break="no">
, but not both.
At my project[9] we encode soft hyphens using the
Unicode character SOFT HYPHEN (U+00AD). Given that this
character is explicitly of the a word processor may
insert a hyphen here if needed
variety, in some sense it
is technically incorrect to use it for this purpose.
Furthermore, the TEI Guidelines do not
recommend this use. In our defense, we chose this path back in
the ISO 8859 days, and when ­
was an SGML
SDATA reference that did not necessarily mean code-point 0xAD.
But more importantly, the detail of which character is used to
represent the this word is continued on the next
line
glyph that was on the source page does not matter,
so long as it is not also used for some other purpose in the
same file. So, given that we are encoding early modern printed
books, we could just as well have used the EURO SIGN (U+20AC)
for this purpose. In either case it is character abuse; however
the abuse of SOFT HYPHEN seems much less dramatic than would be
the abuse of EURO SIGN: this has something
to do with hyphenation, and nothing to do
with currency.
So our encoding of the excerpt from the New York Times editorial would be as follows.[10]
<p>It turns out that the most important voice in the Su­ <lb/>preme Court nomination battle is not the American peo­ <lb/>ple's, as Senate Republicans have insisted from the mo­ <lb/>ment Justice Antonin Scalia died last month. It is not even <lb/>that of the senators. It's the National Rifle Association's.</p> <p>That is what the majority leader, Mitch McConnell, <lb/>said the other day when asked about the possibility of con­ <lb/>sidering and confirming President Obama's nominee, <lb/>Judge Merrick Garland, after the November elections. <said>I <lb/>can't imagine that a Republican majority in the United <lb/>States Senate would want to confirm, in a lame-duck ses­ <lb/>sion, a nominee opposed by the National Rifle Associa­ <lb/>tion,</said> he told <title>Fox News Sunday.</title></p>
Desired output
Encoding texts serves little purpose unless some sort of analysis or output generation (or both) is undertaken. If all we wanted to do was read the text, scanned images of the pages would do.
Consider the following snippet of an encoded text:[11]
<p>Whatever has been ſaid <lb n="14"/>by Men of more Wit than <lb n="15"/>Wiſdom, and perhaps of <lb n="16"/>more malice than either, <lb n="17"/>that Women are natural­ <lb n="18"/>ly Incapable of acting Pru­ <lb n="19"/>dently, or that they are <lb n="20"/>neceſſarily determined to <lb n="21"/>folly, …For most analyses we would prefer the words
naturallyand
Prudentlyto occur in our data, and the the tokens
ly,
Pru, and
dentlynot to occur. That is, we would like the soft hyphens resolved. The exception is the physical bibliographer who is interested in the phenomena of breaking a word across a line.
As with analyses, for most purposes we would prefer to read the text with as few interruptions to words as possible. The obvious exception is when we want to align reading of the processed output with the physical source page or a facsimile thereof. This alignment makes proofreading much easier.
Thus for proofreading we might like to see something like the following.
13: Whatever has been ſaid 14: by Men of more Wit than 15: Wiſdom, and perhaps of 16: more malice than either, 17: that Women are natural- 18: ly Incapable of acting Pru- 19: dently, or that they are 20: neceſſarily determined to 21: folly, …Whereas for casual reading, we might prefer:
Whatever has been said by Men of more Wit than Wisdom, and perhaps of more malice than either, that Women are naturally Incapable of acting Prudently, or that they are necessarily determined to folly, …The question is, of course, how to get that
resolvedoutput.
When I’m wrong, I can be really wrong
Famous last words: (figuratively, expressing sarcasm) A statement which is overly optimistic, results from overconfidence, or lacks realistic foresight.— FLWs
Like many people, I make good predictions and bad ones.
But sometimes I make truly horrible predictions. E.g., in early
1995 or thereabouts I infamously said something like
remember, the web is not our friend, it is our
enemy
. Hard to be more wrong than that. But when it came
to soft hyphens, it may turn out I was. You see, sometime during
the early days of the Women Writers Project I asserted that
software could read our documents (in which soft hyphens were
encoded using first the ­.
Waterloo Script
set symbol (essentially a variable), and later the
­
SGML SDATA entity reference), and
resolve
the soft hyphen for creating full-text
searchable word lists or reading output for undergraduates. I
said this with a how hard can it be?
attitude,
I’m sure.[12]
Well, as will be discussed in the rest of this paper, it has turned out to be quite hard.
Seems easy …
At first blush, this does not seem like it would be a difficult programming task. Basically, when you find a soft hyphen, drop it and replace it with the first token from the next line. Correspondingly, in order to avoid duplicating the first token from the next line,[13] when you find a text node whose immediately preceding text node ended in soft hyphen, drop the first token. For example, consider the following passage.[14]
Or, in modern typography, If this passage is transcribed as<lb/>procuring a ſpeedy adminiſtration of Juſ­ <lb/>tice for the impartiall puniſhment of all <lb/>offenders, to the relief and comfort of thethen to
resolvethe soft hyphen, it needs to be replaced by the first text token of the line that immediately follows. In an XSLT context, this means that the template that matches the blue portion above needs to strip off the
­
character and replace it with the red portion in the above; and
the template that matches the text node that includes the red
portion needs to strip off said red portion (since it has already
been put into the output stream by the template that matched the
blue portion).
Whitespace
That doesn’t sound too tough. Of course it is obviously a little harder than the diagrams above make it look, for they ignore the whitespace between the blue and red portions:
In order to handle that whitespace we need to
-
ignore whitespace at the end-of-line when looking for soft hyphens
-
ensure that the end-of-line whitespace is not inserted between the two parts of the broken word, either by stripping end-of-line whitespace off along with the
­
character, or by carefully replacing only that character (such that the whitespace comes after the re-constituted word)
Of course we cannot just normalize whitespace using
XPath’s built-in normalize-space()
function, as in
many cases leading and trailing space are important. E.g., given
the following fragment,[15]
<p rend="first-indent(1)">How far the passages of scripture <lb/>she mentions were applicable to the <lb/>conduct of <persName>Mr B</persName> it is not our prov­ <lb/>ince to determine; but it is notusing
normalize-space()
on
␣it␣is␣not␣our␣prov­↲would lead to
…conduct of Mr. Bit is not our province to determine; …, because the space in front of
itwould be lost.[16] [17] But even with this whitespace concern, this is not particularly difficult. And if that’s all there was to it, well, I wouldn’t be writing this paper.
Further complications
’Twixt
First thing to keep in mind is that XML constructs other
than just the <lb>
element may come between the
text node that ends in SOFT HYPHEN and the text node that
contains the representation of the continued word. Besides the
obvious (XML comments and XML processing instructions), first
and foremost the feature that forced the typographer to break
the word in the first place may have been a page break, not a
line break. Page breaks usually have other information
associated with them (page numbers, catch words, signature
marks, running titles) that are generally encoded where
they lie
such that they further interrupt the word that
has been split. E.g.[18][19]
<p>Whoever may come out in any society as Mis­ <pb n="247"/> <milestone unit="sig" n="M4r"/> <mw type="pageNum">247</mw> <lb/>sionaries or teachers, whether here or at <placeName>Sierra- <lb/>Leone</placeName>, had need to guard against assimilating too <lb/>much in habit or sentiment with other <rs type="properAdjective">European</rs> <lb/>residents, …Notice that included among those things that follow the soft hyphen is a text node (
247) which is not part of the split word
Missionaries. [20] (Note also that the hyphen glyph in
Sierra-Leonelooks exactly the same in the source as the hyphen glyph in
Mis-sionaries, but the encoding asserts it is a hard hyphen even though it occurs at end-of-line. This is because other occurrences of
Sierra-Leonehave a hyphen, even when it is in the middle of a typographic line. The hard hyphen is probably best encoded with a HYPHEN character (U+2010), but is typically recorded with a HYPHEN-MINUS character (U+002D).)
But sadly, it is not only the obvious and predictable (XML comments, XML processing instructions, line breaks, column breaks, and page breaks with their apparatuses) that may come between a soft hyphen and the final portion of a word. The most common culprits here are annotations and figures, but handwritten additions (either authorial or by a later hand) could also occur.
In the following example[21] an entire tipped-in plate sits between the soft hyphen and the final portion of the word.
<p> <label>I.</label> God spoke of Be-he-moth. What ani­ <pb n="facing 48"/> <pb n="facing 49"/> <figure> <figDesc>An engraving of a “behemoth” (resembles the elephant) standing on a grassy bank drinking from a body of water, vegitation in background</figDesc> <ab type="caption">To face page 49.</ab> </figure> <pb n="49"/> <milestone n="E5r" unit="sig"/> <mw rend="align(outside)" type="pageNum">49</mw> <lb/>mal is that?
Sibling of Overlap
In all of the examples so far, the initial and final
portions of the word divided by a soft hyphen are at least at
the same hierarchical level of encoding. That is (in XPath
terms) from the text node that contains the soft hyphen, the
final portion of the word is on the
following-sibling::
axis, even if it is not the
first text node, or even the first non-whitespace-only text
node, on that axis.
However, we are not always so lucky. Here is a modern diplomatic transcription of a heading.[22]
The wordHonourableis half in roman (or
upright) type and half in italics. To account for this font shift, the encoding uses the TEI
<hi>
element and the global @rend
attribute to indicate that while the entire heading is (in general) in italics, the
first
typographic line is highlighted by being in roman typeface.[23]
<head rend="slant(italic)"><hi rend="slant(upright)">To all vertuous Ladies Honou­</hi> <lb/>rable or Worſhipfull, and to all other <lb/>of <persName rend="slant(upright)">He<vuji>u</vuji>ahs</persName> ſex fearing God, and lo<vuji>u</vuji>ing their <lb/><vuji>i</vuji>uſt reputation, grace and peace through <lb/><persName>Chriſt</persName>, to eternall glory. </head>
It would be reasonable to think this phenomenon pernicious, not particularly important, and rare. But a different manifestation of the same hierarchical problem is anything but. When a book is damaged (e.g., by a coffee spill, or torn or mouse-eaten edges of pages), it is common for the damage to be on only one side of the page. Such damage will cause a problem reading either the initial portion (if it is on the right edge) or the final portion (if it is ontFIXME!! he left edge) of a word split across a line break.
In the following example,[24] the encoder has indicated that she cannot read a few characters at the beginning of each of four lines due to damage, but that either from context alone or from looking at a different edition of the same book she has been able to surmise what must have been printed.
<lb/>not your own. It is a miſe­ <lb/><supplied reason="damaged">ra</supplied>ble thing for any Wo­ <lb/><supplied reason="damaged">ma</supplied>n, though never ſo great, <lb/><supplied reason="damaged">not</supplied> to be able to teach her <lb/><supplied reason="damaged">ſerv</supplied>ants; …This is a particularly thorny case, because in order to resolve the soft hyphen, software will have to recognize that not only should the following
<supplied>
element be moved from the beginning of its line to the end of
the previous line (replacing the ­
character), but also the first token of the text node
immediately following the <supplied>
needs to
move with it.
Text that is not there
If the text that is damaged cannot be read at all, the TEI
Guidelines recommend using the
<gap>
element. While the <gap>
element may have content, if it does that content does not
provide a transcription of the source text, but rather provides
a description of or information about what was not transcribed
from the source text; and more often than not
<gap>
is empty. In the following example,[25] the
encoder is asserting that she could not read a significant
portion of the last line.
<lb/>And alſo ge<vuji>u</vuji>eth them grace to <vuji>v</vuji>ſe in his <lb/>glorye, po<vuji>u</vuji>ertie, ignomine, infamie, in­ <lb/>firmitie, with all ad<vuji>u</vuji>erſitie, and the pri­ <lb/><gap extent="over one third of the line" reason="flawed-reproduction"/>tes, e<vuji>u</vuji>en to the death<unclear>,</unclear>When a
<gap>
occurs after a soft hyphen, but
before any non-ignorable content, we have a case for which it is
particularly difficult to resolve the soft hyphen; thankfully,
it is also a case for which it is particularly unimportant to do
so.
It is difficult to do so for two main reasons. First,
because (unlike most other empty elements we would encounter
after a soft hyphen: <cb>
, <lb>
,
<milestone>
, and <pb>
) the
<gap>
represents content, it would have to be
moved as if it were the first token of content. Second, because
a <gap>
may represent less than a single word, a
single word, or more than a single word, the software will need
to parse its attributes (and perhaps content) to determine
whether or not the first token of an immediately following text
node (that does not start with whitespace) needs to be moved
along with the <gap>
.
It is unimportant because under no circumstances can the
soft hyphen resolution process meet the goal of reconstituting
the entire word. Whether for spell checking, for indexing for
search, or for generating an easy-to-read display, having
is no better than what you had
to begin with.
pri<gap extent="rest of word"/><lb/><gap
extent="roughly one third of the line minus roughly one half of
the first word"/>
Choosing the shy
The TEI uses a parallel elements
mechanism
for recording a variety of editorial interventions. Here I will
discuss the correction of apparent errors
(<choice>
, <sic>
, and
<corr>
), but the same issues hold true for the
simple expansion of abbreviations (<choice>
,
<abbr>
, and <expan>
), the
substitution of one bit of text for another
(<subst>
, <del>
, and
<add>
), the regularization of archaic or
eccentric spelling or typography (<choice>
,
<orig>
, and <reg>
), and the
simultaneous encoding of multiple variant witnesses
(<app>
, <rdg>
, and
<lem>
).
The following example[26] demonstrates two errors in one title, each of which is directly involved in the use of soft hyphens. I will discuss the second error here, and the first one in the next subsection.
If you look carefully at the end of the 3rd line, you will see that the soft hyphen character is not a hyphen at all. In this reproduction you may find it hard to figure out what it is, but in other editions (I am told) it is more obvious that the character there is a period.
Presuming the encoding project would like to record both
the error as it appears in the source text and a modern
correction of it, there are two likely TEI encodings of this:
letter-level
and word-level
.
<lb/>Bench</placeName>, for the releaſing of all pri<choice><sic>.</sic><corr>­</corr></choice> <lb/>ſoners for Debt, according toThe above letter-level encoding makes resolving the soft hyphen potentially quite a bit more difficult. The difficulty lies in the fact that if we were to apply the simple algorithm discussed above — namely to replace the soft hyphen character with the first token of the following line, we would suddenly be asserting that the partial word
sonerswas somehow a correction of a period:
<lb/>Bench</placeName>, for the releaſing of all pri<choice><sic>.</sic><corr>ſoners</corr></choice> <lb/>for Debt, according toIn many, if not the vast majority, of situations this would not really be a problem. When performing soft hyphen resolution for the purpose of generating word lists or indices, we generally do not care about simultaneously handling both the source text and the editorial correction. We usually just want the corrected version, in which case the entire
<choice>
construct is itself resolved to the content of
<corr>
. Whether this is done before or after
soft hyphen resolution, we end up with the desired words.
In rare cases we might be interested in the uncorrected
source text. In which case soft hyphen resolution software has
to be smart enough to perform the resolution on the text in
<sic>
based on the content of
<corr>
. In theory a project may want to perform
soft hyphen resolution in both the uncorrected source and the
editorially corrected text. I do not address this particular
situation here, as I have never even heard this idea
entertained.
Word-level correction is a bit easier for soft hyphen resolution, as the simple algorithm yields a perfectly acceptable result.
<lb/>Bench</placeName>, for the releaſing of all <choice> <sic>pri.<lb/>ſoners</sic> <corr>pri­<lb/>ſoners</corr> </choice> for Debt, according toHowever, it has a different drawback: with this system counting lines on the page — a common and important task — is harder, in that the counter has to know that the
choice/sic/lb
and the choice/corr/lb
together need to be counted
as only one line break.
Shy of the choice
Anything more than a cursory or rapid read of the first
two lines reveals an egregious error, probably by the
typesetter: the word commanders
is spelled
commanmanders
, as the medial letters
man
are not only in the initial portion of the
word, but are also repeated after the soft hyphen. Multiple
possible encodings jump to mind. The editor may consider the
man
at the end of the first line as the correct
one, and thus the man
at the beginning of the
second line as the one error; or vice-versa. And in each case
the encoder may use letter-level or word-level encoding.
<titlePart type="second">Alſo a Petition of divers Comman­ <lb/><choice><sic>man</sic><corr/></choice>ders, priſoners in the <placeName>Kings
<titlePart type="second">Alſo a Petition of divers <choice> <sic>Comman­<lb/>manders</sic> <corr>Comman­<lb/>ders</corr> </choice>, priſoners in the <placeName>Kings
<titlePart type="second">Alſo a Petition of divers Com<choice><sic>man</sic><corr/></choice>­ <lb/>manders, priſoners in the <placeName>Kings
<titlePart type="second">Alſo a Petition of divers <choice> <sic>Comman­<lb/>manders</sic> <corr>Com­<lb/>manders</corr> </choice>, priſoners in the <placeName>KingsFurthermore, when using word-level encoding, project editorial policy may allow elision of the soft hyphen and line break in the corrected version:
<titlePart type="second">Alſo a Petition of divers <choice> <sic>Comman­<lb/>manders</sic> <corr>Commanders</corr> </choice>, priſoners in the <placeName>Kings
Saving graces
So we see that there are quite a few complications to soft hyphen resolution. Luckily, at least at the WWP, there are a few encoding practices we have put in place that ease the process, rather than interfere.
-
Soft hyphens are consistently encoded
We never encode soft hyphen with anything else, ever.[27] That is (as demonstrated in section “Choosing the shy”), a U+00AD character is encoded at every soft hyphen even if the source text erroneously has a different character, or indeed no character at all, to represent the soft hyphen.
-
U+00AD is unique to this purpose
We never use U+00AD for anything else, ever. This is a slight exaggeration, but foregrounds the important point. On rare occasion an actual U+00AD character will creep into the discussion about the encoding of a file in its metadata, e.g., in a change log entry that discusses fixing a soft hyphen. But this usage never occurs in the content. Furthermore, an actual soft hyphen never occurs in metadata. Thus all U+00AD within
/TEI/text
are soft hyphens, all U+00AD within/TEI/teiHeader
are discussions about soft hyphen characters. -
U+00AD is always in element content
At the WWP our encoding is such that any U+00AD in an attribute value is in error; and, for this purpose, any U+00AD in an XML comment or processing instruction (or the
<teiHeader>
) is ignorable. -
Once you’ve seen one white space, you’ve seen ’em all
As with many text encoding projects, the WWP cares very much about the presence or absence of most whitespace in the encoded XML file, but we don’t care at all about the details of said whitespace, i.e. how many or which whitespace characters occur. We would consider the following three examples entirely equivalent (although obviously, humans prefer to work on the first).
<lg> <byline>To the tune of <title>Don’t Cry for me Argentina</title> by Andrew Lloyd Webber and Tim Rice</byline> <l>Don’t cry for me Charles Goldfarb,</l> <l>The truth is I do not miss them,</l> <l>All of those features,</l> <l>Because we’re lazy,</l> <l>To save us typing,</l> <l>They drove us crazy.</l> </lg>
<lg><byline>To the tune of <title>Don’t Cry for me Argentina</title> by Andrew Lloyd Webber and Tim Rice</byline> <l> Don’t cry for me Charles Goldfarb, </l><l>The truth is I do not miss them, </l><l> All of those features, </l><l> Because we’re lazy, </l><l> To save us typing, </l><l> They drove us crazy. </l> </lg>
<lg><byline>To the tune of <title>Don’t Cry for me Argentina</title> by Andrew Lloyd Webber and Tim Rice</byline> <l> Don’t cry for me Charles Goldfarb, </l><l> The truth is I do not miss them, </l><l> All of those features,</l><l>Because we’re lazy,</l><l>To save us typing,</l><l>They drove us crazy.</l></lg>
The results of these encoding practices are that it is
trivially easy to find all the occurrences of soft hyphens that
require resolution (without any false positives), and we can
regularize whitespace (even if we can’t use the
normalize-space()
function; see section “Whitespace”), making tokenization and reconstitution of strings
easier.
Attempts
Early Days
Roughly speaking, in the 1980s the WWP used Waterloo
Script; in the early 1990s we used Waterloo GML; in the mid
1990s we used Waterloo GML using pointy brackets
(
and
<
) instead of the default tag
delimiters >
and
:
; in the late 1990s we used SGML,
but still did most processing with Waterloo Script; and in the
early 21st century we switched to
XML.
.
In mid-1991 the WWP embarked on a collaboration with
Oxford University Press to publish a series of books based on
our textbase files. Thus I went to work on a program to
generate camera-ready PostScript output from our pseudo-SGML
input, using Waterloo Script. I believe this was the first
time we actually wrote code to resolve our soft hyphens,
which. The snippet of code below is from a subroutine of that
program written in 1991-10. The &*txt0.
set
symbol contains a line of text with each SPACE (U+0020)
converted to a COMMERCIAL AT (U+0040) character. In the input
files at that time a soft hyphen was encoded just like a hard
hyphen, i.e. using the HYPHEN-MINUS character (U+002D).
. .* . .* check the first character; if it is a blank ("@") AND our "we . .* chopped a hyphen off last time we appended" flag is set, chop off . .* the blank. . .* . .if "&'substr( &*txt0., 1, 1 )" = "@" & &nw_shyl. = 1 . .sr *txt1 = &'substr( &*txt0., 2 ) . .el .sr *txt1 = &*txt0. . .* . .* . .* parse off the last character; if it is the CONTinuation character, . .* chop it off. (For some reason in this context Script treats it as . .* a text character.) . .* . .sr *len = &'length( &*txt1. ) . .sr *last = &'substr( &*txt1., &*len., 1 ) . .* . .if "&*last." = "&$cont." .sr *txt2 = "&'substr(&*txt1.,1,&*len.-1 )" . .el .sr *txt2 = "&*txt1." . .* . .* . .* if there are still characters left, check the last one; if it is a . .* hyphen, chop it off (Script will not treat it as a soft hyphen . .* here!), and set a flag . .* . .sr *len = &'length( &*txt2. ) . .if &*len. gt 0 .do begin . .sr *last = "&'substr( &*txt2., &*len., 1 )" . .* . .if "&*last." = "­." .th .do begin . .sr *txt3 = "&'substr(&*txt2., 1, &*len.-1 )" . .sr nw_shyl = 1 . .do end . .el .do begin . .sr *txt3 = &*txt2. . .sr nw_shyl = 0 . .do end . .do end . .*
My vague recollection is that the above code worked
reasonably well, but e-mail I sent in 1994-03 makes it clear it
always had problems: hyphens [are] top prio[rity], so I
will be tackling that … My thinknig rightg now is that
I've spent years trying to figure out how to get SCRIPT to
handle this w/o success. But it would be trivial to massage the
original file w/ Perl (or maybe even BBEdit) in order to remove
soft hyphens, at least in simple <lb> case, and probably
others
. I do not recall what the problems were. My vague
recollection is this system had the capability to handle section “’Twixt” problems well, because this routine was not
called on strings that were not part of the main text flow; i.e.
it was not used for page apparatus, annotations, figure
descriptions, etc.
Special-purpose: Perl version
It is clear from an e-mail exchange from mid 1994-03
that I wrote a special-purpose MacPerl program at that time to
handle the simple
soft hyphen cases, i.e. when
an end-of-line hyphen was followed by a breaking element
(<pgbk>
, <lb>
, or
<cl>
). I have not been able to find that
original Perl program, but I believe that the soft hyphen
handling portion of a later routine was based on it. In the
following snippet, the entire input file is stored as one long
string in the variable $in
.
$in =~ s,­\s*<lb[^>]*>(<anchor[^>]*>)([^ \t\r\n<]*),\2\1,igs; $in =~ s,­\s*(<anchor[^>]*>)\s*<lb[^>]*>([^ \t\r\n<]*),\2\1,igs; $in =~ s,­\s*<lb[^>]*>,,igs;This snippet of code does not handle
<pgbk>
or
<cl>
elements, because by the time this
program, based on the original MacPerl program, was written
they no longer existed in our encoding system. It does handle
an empty <anchor>
element, whether it is
before (line 2) or after (line 1) the <lb>
that follows the soft hyphen. I can only guess at the reason
why it does not handle <pb>
, the replacement
for <pgbk>
: handling the section “’Twixt” problem would be too difficult.
Special-purpose: CMS Pipelines
However, we found the MacPerl program to be too
cumbersome and slow.[28] Thus a few days later (1994-03-19)
I wrote a CMS Pipelines version of the same command. The
program is written in Rexx, but all the work is done by a
single call to the CMS pipe
command. That main
call follows.
/* ** Now do the real work in one big pipeline; it would be fast ** except that the SPILL stage is written in Rexx. Oh well. */ 'pipe (long endchar #) <' fn ft fm, /* read file in */ '| nfind <pgbk', /* nuke page-break lines */ '| join * /@/', /* now 1 line, remembering \n's */ '| split after string />/', /* chop into reasonable size parts */ '| change /-@<lb/<NUKEME/', /* mark hyphen-EOL-<lb */ '| change /-@<cl>/<NUKEME>/', /* mark hyphen-EOL-<cl> */ '| change /-@<cl /<NUKEME /', /* mark hyphen-EOL-<cl_ */ '| join *', /* back into 1 line */ '| split before string /</', /* now chop up such as to separate */ '| split after string />/', /* tags onto lines of their own */ '| nfind <NUKEME', /* and kill marked records */ '| t: find <', /* take only the tags */ '| change / /%/', /* and protect internal blanks */ '| a: faninany', /* get back non-tags */ '| join *', /* back into 1 line, again */ '| split before string /@/', /* cut into pieces at orginal \n's */ '| change /@//', /* nuke our \n markers */ '| spill 153 sep|', /* make sure not too long */ ' change /%/ /', /* restore protected blanks-in-tags*/ '| >' ofid, /* write to output file */ '# t:', '| a:'
By the time this command is issued, the input file
(fn ft fm
) has been tested to ensure it has no
@
or %
characters, and the name of the
output file (ofid
) has been set up. The
spill
stage (which was added a month or so later)
is not a standard CMS Pipeline stage, but rather was a pipeline
stage written at Brown by James
Mathiesen.[29]
Its purpose was to Spill lines at a particular column
… to wrap one-paragraph-per-line input into a wrapped
text
. This routine was clearly written back when soft
hyphens were encoded just as hard hyphens, i.e. using the
HYPHEN-MINUS character (U+002D).
Like the Perl program before it, this program was only
designed to handle the simple
soft hyphens that
were followed immediately by a <cl>
,
<lb>
, or <pgbk>
element. These
were, of course, the vast majority of cases. The problem
described in section “’Twixt” is handled, at least for
page breaks, in a novel way: the entire record is simply
discarded. This worked because during this era it was policy to
record all details about a page break on a single line.
It is worth mentioning that the program can match
<lb>
elements just by searching for the first
three characters, which will always be <lb
.
However, the same shorthand does not work for
<cl>
elements, because the first characters of
the element name are not unique: there were also
<close>
, <closer>
,
<closing>
, and <clbk>
elements
at different times in our history.
XSLT
And so it went — for years the WWP limped by on various hacks to resolve soft hyphens. Then in 2011 we began moving our publication to XTF, a XSLT system built almost entirely on XSLT.[30] Thus we attempted to resolve soft hyphens in XSLT.
First try: text nodes
My first crack at this was just a simple attempt to
implement the algorithm loosely described in section “Seems easy …”. One template matched
text()[contains(.,'­')]
and grabbed the
first token of the next (i.e., closest following)
non-all-whitespace text node that was not inside an
<mw>
element. Another template matched that
closest following non-all-whitespace text node that was not
inside an <mw>
, and dropped the first token
before spitting it into the output stream.
This code became quite thorny when I added the conditions
to handle some of the complications mentioned above. But it is
even thornier than you might imagine because any given text node
may fall into both categories: it may end in
­
and may also immediately follow a line
that ends in ­
. I was ending up with code
that that almost worked, but was horrible to read and maintain.
Debugging was a nightmare.
Second try: decorated elements around those text nodes
Eventually it occurred to me that XSLT’s forte is processing trees of element nodes and their attributes, not text nodes. A large part of the problem I was having was needing to repeat a test performed in template A so that template B could figure out what template A had thought of a given node. Instead, if I processed in separate passes, template A could record what it thought of each node so that template B, running at a later pass, would know. Of course, one needs a place to record this information, and a text node doesn’t really have any convenient place.
So a first pass wraps all text nodes other than those that need to be ignored,
anyway with a temporary element,
<pcdata>
. This element is given attributes that
record useful information about the text node for later
examination. E.g., whether or not it ends in a soft hyphen,
whether or not it starts with whitespace, the first token parsed
off, etc. The following is what the example from section “Seems easy …” looks like after the text nodes have been
wrapped
.
<lb/> <pcdata xml:id="d2t6" endsInShy="true" multiWord="true" space1st="false" firstWord="procuring" restWords="a ſpeedy adminiſtration of Juſ­ ">procuring a ſpeedy adminiſtration of Juſ­ </pcdata> <lb/> <pcdata xml:id="d2t8" endsInShy="false" multiWord="true" space1st="false" firstWord="tice" restWords="for the impartiall puniſhment of all ">tice for the impartiall puniſhment of all </pcdata> <lb/> <pcdata xml:id="d2t10" endsInShy="false" multiWord="true" space1st="false" firstWord="offenders," restWords="to the relief and comfort of the ">offenders, to the relief and comfort of the </pcdata>
A second pass further decorates the new
<pcdata>
elements with attributes that record
information about other nodes.
For example, whether or not a text node immediately follows a
text node that ended in a soft hyphen is recorded on a new
attribute @immedFollowsShy
that is added to its
wrapper <pcdata>
element.
Given the easy access to information now associated with
each pertinent text node, it should be much easier to resolve
the soft hyphens by moving the first token following a soft
hyphen to the end of the <pcdata>
containing the
soft hyphen (replacing the soft hyphen itself). And, of course,
a final pass would clean up
by removing the
temporary <pcdata>
elements.
And in fact, I did find it easier to think about and handle the various tests needed to see which bits should be moved to replace the soft hyphen. Nonetheless, I found this a daunting task and never got a fully working version.
Third try: decorated elements around tokens
Eventually it occurred to me that one of the problems I was
facing was the difficulty presented by a single
<pcdata>
-wrapped text node that both immediately
follows a soft hyphen and ends in a soft hyphen; and that
another was that keeping track of which text nodes contained
multiple tokens and which did not was, although not particularly
difficult, an unneeded layer of complexity.
There is no such thing (in English) as a word that is long enough to wrap around more than one line. That is, a single token will never both immediately follow a soft hyphen and end with a soft hyphen. (Note to self: see FLWs.) Thus I am now using the approach to wrap each pertinent text token in a temporary, decorated (i.e., information-rich) element.
<lb/> <tmp:tok xml:id="d2t6.1" endsInShy="false" tmp:spaceBefore="false">procuring</tmp:tok> <tmp:tok xml:id="d2t6.2" endsInShy="false">a</tmp:tok> <tmp:tok xml:id="d2t6.3" endsInShy="false">ſpeedy</tmp:tok> <tmp:tok xml:id="d2t6.4" endsInShy="false">adminiſtration</tmp:tok> <tmp:tok xml:id="d2t6.5" endsInShy="false">of</tmp:tok> <tmp:tok xml:id="d2t6.6" endsInShy="true">Juſ</tmp:tok> <lb/> <tmp:tok xml:id="d2t8.1" endsInShy="false" tmp:spaceBefore="false">tice</tmp:tok> <tmp:tok xml:id="d2t8.2" endsInShy="false">for</tmp:tok> <tmp:tok xml:id="d2t8.3" endsInShy="false">the</tmp:tok> <tmp:tok xml:id="d2t8.4" endsInShy="false">impartiall</tmp:tok> <tmp:tok xml:id="d2t8.5" endsInShy="false">puniſhment</tmp:tok> <tmp:tok xml:id="d2t8.6" endsInShy="false">of</tmp:tok> <tmp:tok xml:id="d2t8.7" endsInShy="false">all</tmp:tok> <lb/> <tmp:tok xml:id="d2t10.1" endsInShy="false" tmp:spaceBefore="false">offenders,</tmp:tok> <tmp:tok xml:id="d2t10.2" endsInShy="false">to</tmp:tok> <tmp:tok xml:id="d2t10.3" endsInShy="false">the</tmp:tok> <tmp:tok xml:id="d2t10.4" endsInShy="false">relief</tmp:tok> <tmp:tok xml:id="d2t10.5" endsInShy="false">and</tmp:tok> <tmp:tok xml:id="d2t10.6" endsInShy="false">comfort</tmp:tok> <tmp:tok xml:id="d2t10.7" endsInShy="false">of</tmp:tok> <tmp:tok xml:id="d2t10.8" endsInShy="false">the</tmp:tok>
At the time of this writing, the program that uses this
method runs, and handles the simple case well. It also has the
added advantage that it will resolve soft hyphens in either
direction: finalUp
or initDown
,
moving the final portion of the word up to replace the soft
hyphen, or by moving the initial portion of the word down to the
beginning of the next line. However, it still has quite a few
bugs. In particular, it does not handle the problem pointed out
in section “Sibling of Overlap” well at all. However, I am still
holding out hope.
References
[NWEW] Edward Phillips, The New World of English Words, or, a General Dictionary, 4th edition; 1678.
[OED] The Oxford English Dictionary, online edition, accessed 2016-04-22.
[FLWs]
famous last words
in Wiktionary,
accessed 2016-04-21.
[SH]
soft hyphen
in Wiktionary,
accessed 2016-04-22.
[SHHP] Soft hyphen (SHY) – a hard problem?, accessed 2016-04-22.
[TEI] Burnard, Lou and Syd Bauman, eds. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 3.0.0, 2016-03. TEI Consortium. (2016-04-22).
[1] E.g.,
soft hyphen: (computing, typography) A generally invisible text character marking a point where hyphenation can occur without forcing a line break in an inconvenient place if the text is later re-flowed.— SH
[2] From SHHP, an excellent resource by Jukka Korpela. That said, I’m a little concerned because Mr. Korpela quotes clause 6.3.3 of ISO 8859-1. I have not yet gotten my hands on a copy of ISO 8859-1:1987 or earlier, but the 1998 edition does not seem to have a clause 6.3.3.
It is worth noting here
that the hard problem
Mr. Korpela discusses in
his paper is not at all the same difficult problem I am trying
to tackle in the current paper.
[3] Mémoire sur l'éclairage et le balisage des côtes de France, Volume 2 by Léonce Reynaud. Sadly, it is not about markup technologies. (Not surprising, though: it was published in 1864.)
[4] XSLT 2.0 and XPath 2.0, 4th edition.
[6] From Ready Player One by Ernest Cline. 1st edition, paperback, ISBN 978-0-307-88744-3, page 67.
[8] From The Senate Defers to the N.R.A.
,
The New York Times, page A24
(editorials), 2016-03-24. Although you can read the editorial
online, it does not have the same lineation as the
printed National Edition.
[9] Formerly the Brown University Women Writers Project, now the Women Writers Project, which is part of the Digital Scholarship Group in the Northeastern University Library.
[10] Here, as elsewhere, the hyphen glyph in the source is transcribed as a numeric character reference because an actual SOFT HYPHEN does not show up in a web browser.
[11] Copied from lines 697–705 of the WWP transcription of
Mary Astell’s 1694 book A Serious
Proposal to the Ladies as of revision
r27555, last updated 2015-12-30; I then added the
@n
attributes to make talking about the lines
easier.
[12] In my own defense, by summer 1994 I had
posted to the internal WWP list that this was a difficult
problem. Re: Missing hyphens and spaces
posted 1994-07-21 to WWPTAG-L
[13] The re-peat
Pete
, identi-cal Cal
, or duplic-ate
8
problem.
[14] From page 5 of The petition of the Jewes for the repealing of the act of Parliament for their banishment out of England by Johanna Cartwright (with her son Ebenezer Cartwright), 1648. The image is from the Hathi Trust page image. The transcription is copied from the WWP transcription of the same edition as of revision r27244, last updated 2015-11-24.
[15] Adapted from the WWP transcription of Memoir of Mrs. Chloe Spear, a native of Africa, who
was enslaved in childhood and died in Boston, January 3,
1815...aged 65 years by A lady of
Boston
, as of revision r27576, last updated 2016-01-04.
[16] One might imagine that a processor should know that a
<persName>
is always a word unto itself, and thus
should be followed by whitespace. I.e., that the presence of the
<persName>
element should cause
whitespace around its content, thus giving us Mrs. B
it
as opposed to Mrs. Bit
. But this turns
out not to be the case. Personal names are often immediately
followed by a non-whitespace character. While these characters are
most commonly punctuation (e.g., an apostrophe, a comma, or a
period) that might be encoded inside the
<persName>
, there are cases where even such
white lie
encoding will not work. E.g., the
following passage copied from the WWP transcription of Lady Mary Chudleigh’s 1701
work The Ladies Defence as of r28816, last updated 2016-06-09.
<l><persName>Narciſſius</persName>-like, you your own Graces view,</l> <l>Think none deſerve to be admir'd but you:</l> <l>Your own Perfections always you adore,</l> <l>And think all others deſpicably poor:</l>
[17] I often use a WWP function explicitly for this purpose:
<xsl:function name="wwp:regularize-space" as="xs:string"> <!-- Collapse all strings of whitespace *including leading & trailing white- --> <!-- space* in the parameter (a string) to a single space (U+0020) character. --> <!-- Written long ago on a computer far away by Syd Bauman; copyleft. --> <xsl:param name="arg" as="xs:string"/> <xsl:variable name="intermediate" select="concat('␠', $arg, '␠')"/> <xsl:variable name="semifinal" select="normalize-space( $intermediate )"/> <xsl:value-of select="substring( $semifinal, 2, string-length( $semifinal ) -2 )"/> </xsl:function>
[18] Adapted from the WWP transcription of Memoir of the late Hannah Kilham, 1837, as of revision r28478, last updated 2016-04-21.
[19] The
<mw>
element is the WWP’s version of the TEI
<fw>
element.
[20] An overall helpful anonymous reviewer suggested that
page numbers should be encoded in an attribute value instead
of in element content, and further suggested the TEI Guidelines recommend an attribute
value. The reviewer is certainly correct, the process of soft
hyphen resolution would be much easier if there was never any
element content between the initial and final portions of a
word broken across a line, column, or page break. And, indeed,
in TEI 3.10.3
Milestone Elements
the Guidelines say The global
However, this is a
mechanism for recording what the page number is, not the page number as it is written on the page. The two
may not match, and (in the general case) it is definitionally
not possible to record what is written on the page in an
attribute value, for two reasons:
@n
attribute is used in each case to provide a
value for the [page number].
-
it may include characters outside of Unicode — which need to be represented using markup, in the TEI case the
<g>
element; -
it may require markup for other reasons, for example the correction of an apparent error, said correction made either by the current encoder (which would entail the use of the TEI
<choice>
,<sic>
, and<corr>
elements) or by an 18th century librarian (which would entail the use of, e.g., the TEI<subst>
,<del>
, and<add>
elements).
<fw>
element precisely to
record page numbers etc. actually present in the document being encoded(see TEI 11.6
Headers, Footers, and Similar Matter).
[21] Adapted from the WWP transcription of Favell Mortimer’s 1842 publication The History of Job, in Language Adapted to Children, as of revision r29046, last updated 2016-07-07.
[22] In
particular, the heading at the top of page 1 of A Muzzle for Melastomus by Rachel Speght,
published in 1617. The heading is actually the complete title of
the book, and because there is a lot of front matter, occurs
almost halfway through. This image is of Shirley Marc’s
Renascence Editions
edition, which can be found
at the University of Oregon’s Scholars’
Bank.
[23] The encoding also uses the <vuji>
element,
which is not a TEI element. It is WWP shorthand for the
typographic regularization of V
,
v
, U
, u
,
J
, j
, I
,
i
, VV
, and vv
.
E.g., the expanded TEI form of the WWP
<vuji>u</vuji>
would be
<choice><orig>u</orig><reg>v</reg></choice>
.
[24] Copied from the WWP transcription of The cook's guide: or, rare receipts for cookery by Hannah Wolley, 1664, as of r27331, last updated 2015-12-03.
[25] Copied from the WWP transcription of the second edition of Sermons of Barnardine Ochyne, (to the number of. 25.) concerning the predestination and election of god, translated by Ann Bacon in 1570, as of r27244, last updated 2015-11-24.
[26] From the title page of The petition of the Jewes for the repealing of the act of Parliament for their banishment out of England by Johanna Cartwright (with her son Ebenezer Cartwright), 1648. The image is from the Hathi Trust page image. The transcription is copied from the WWP transcription of the same edition as of revision r27244, last updated 2015-11-24.
[27] This is in accordance with the enthymeme I often give
clients: I’d prefer your encoding be consistently wrong
than inconsistent.
.
[28] Apparently a large part of the problem was that for reasons I do not know and may have never known, we could not run our preferred Mac↔mainframe transfer program at the same time as MacPerl.
[29] Written 1991-05-24. Interestingly, James and I shared an apartment at the time.
[30] The eXtensible Text Framework from the California Digital Library.