How to cite this paper
DeRose, Steven J. “The structure of content.” Presented at International Symposium on Quality Assurance and Quality Control in XML, Montréal, Canada, August 6, 2012. In Proceedings of the International Symposium on Quality Assurance and Quality Control
in XML. Balisage Series on Markup Technologies, vol. 9 (2012). https://doi.org/10.4242/BalisageVol9.DeRose01.
International Symposium on Quality Assurance and Quality Control in XML
August 6, 2012
Balisage Paper: The structure of content
Steven J. DeRose
Director of R&D
OpenAmplify
Steve DeRose has been working with electronic document systems since joining Andries
van Dam's FRESS project in 1979. He holds degrees in Computer Science and in Linguistics
and a Ph.D. in Computational Linguistics from Brown University. His development of
fast, accurate part-of-speech tagging methods for English and Greek corpora helped
touch off the shift from heuristic to statistical methods in computational linguistics.
He co-founded Electronic Book Technologies to build the first SGML browser and retrieval
system, "DynaText", and has been deeply involved in document standards including XML,
TEI, HyTime, HTML 4, XPath, XPointer, EAD, Open eBook, OSIS, NLM and others. He has
served as Chief Scientist of Brown University's Scholarly Technology Group and Adjunct
Associate Professor of Computer Science. He has written many papers, two books, and
eleven patents. Most recently he joined OpenAmplify, a text analytics company that
does very high-volume analysis of texts, mostly from social media.
Copyright © 2012 by the author. Used with permission.
Abstract
Text analytics involves extracting features of meaning from natural language texts
and making them explicit, much as markup does. It uses linguistics, AI, and statistical
methods to get at a level of "meaning" that markup generally does not: down in the
leaves of what to XML may be unanalyzed "content". This suggests potential for new
kinds of error, consistency, and quality checking. However, text analytics can also
discover features that markup is used for; this suggests that text analytics can also contribute to the markup process
itself.
Perhaps the simplest example of text analytics' potential for checking, is xml:lang.
Language identification is well-developed technology, and xml:lang attributes "in
the wild" could be much improved. More interestingly, the distribution of named entities
(people, places, organizations, etc.), topics, and emphasis interacts closely with
documents' markup structures. Summaries, abstracts, conclusions, and the like all
have distinctive features which can be measured.
This paper provides an overview of how text analytics works, what it can do, and how
that relates to the things we typically mark up in XML. It also discuss the trade-offs
and decisions involved in just what we choose to mark up, and how that interacts with
automation. It presents several specific ways that text analytics can help create,
check, and enhance XML components, and exemplifies some cases using a high-volume
analytics tool.
Table of Contents
- Introduction
- Tedious and brief: Language vs. Markup?
- Text Analytics to the rescue?
-
- Statistical methods
- Intuitive methods
- Accuracy?
- Text Analytics' Relation to Markup
- How shall we find the concord of this discord?
Introduction
The basic purpose of markup is to make the implicit structure of texts explicit; the
same is true of "text analytics" (or "TA"), a fast-growing application area involving
the automated extraction of (some) meaning from documents. In this paper I will try
to place markup and analytics in a larger frame, and discuss their relationship to
each other and to the notion of explicit vs. implicit information.
To begin, note that 'implicit' and 'explicit' are matters of degree. Language itself
is an abstract system for thought, entirely implicit in the relevant sense here. The
words and structures of natural language make thought more explicit, fixing it in
the tangible medium of sound (of course, some people cannot use that medium, but still
use language). Writing systems take us another step: particularly in their less familiar,
non-phonetic features: sentence-initial capitals, quotation marks, ellipses, etc.
all make linguistic phenomena explicit, and it has long been noted that punctuation
(not to mention layout) are kinds of "markup"[Coom87], albeit often ambiguous. In a yet broader sense, the "s" on the end of a plural
noun can be considered "markup" for a semantic feature we call "plural", and a space
is a convenient abbreviation for <word>. XML markup thus fills a "metalinguistic"
role analogous to punctuation, though (ideally) richer and less ambiguous.
'Documents' is also an imprecise term. Here I am focusing only on the more traditional
sense: human-readable discourses, in a (human) language, with typically near-total
ordering. This doesn't preclude including pictures, data tables, graphs, video, models,
etc.; but I am for now excluding documents that are mainly for machine consumption,
or are in a specialized "language" such as for mathematics, vector graphics, database
tables, etc.
What, then, can text analytics do for XML? What are some phenomena in texts that we
might make explicit, or check, using text analytics methods? How do these mesh with
how our linguistic and markup sytems are organized? An anonymous reviewer raised the
valuable question of where text analytics can be effective for auto-tagging material;
which in turn poses an old problem in a new guise: If you can use a program to find
instances of some textual feature, why mark it up?
Tedious and brief: Language vs. Markup?
To explore these questions we may arrange reality into successive levels of abstraction
and explicitness. This is similar to the typical organization of layers in natual
language processing systems.
-
Semantic structure: implicit (in that we rarely "see" semantic structure directly); presumably relatively
unambiguous.
When representing semantics, we tend to use artificial, formal languages such as predicate
calculus, or model the real world with semantic structures such as Montague's[Mont73], involving
Entities (people, places, things);
Relationships (A does V to B with C near D because R...); and
Truth values (characterizing, say, the meaning of a sentence).
-
Syntactic structure: mostly implicit, but with some explicit markers); ambiguous. "The six armed men
held up the bank" is a classic example of syntactic and lexical ambiguity.
Typical syntactic units include Morphemes -> Words -> Phrases -> Clauses -> Sentences
-> Discourse units (such as requests, questions, and arguments), etc. At this level
we begin to see some explicit markers. In speech they include timing, tone contours,
and much more; in writing punctuation and whitespace often mark these kinds of units.
A full analysis of speech must also deal with many features we tend to ignore with
written language: incomplete or flatly incorrect grammar; back-tracking; background
noise that obscures the "text" much as coffee stains obscure manuscripts; even aphasic
speech; but we'll ignore most of that here.
-
Orthographic structure: explicit (by definition); but still ambiguous. Characters in characeter sets, need
not map trivially to characters as linguistic units (ligatures, CJK unification, etc).
More significantly, orthographic choices can communicate sociolinguistics messages:
Hiragana vs. Katakana; italics for foreign words; even the choice of fonts. Case distinctions
often indicate part of speech. And a given spelling may represent different words:
"refuse", "tear", "lead", "dove", "wind", "console", "axes"[het1, het2]. These are "heteronyms"; true "homonyms" (which also share pronunciation), such
as "bat", are rarer. Both are commonly misused, making them obvious candidates for
automatic checking.
Most English punctuation marks also have multiple uses. Ignoring numeric uses, periods
may end or be in the midst of sentences, or stand open abbreviations. Colon can express
phrase or clause or sentence boundary. The lowly apostrophe, introduced to English
in the 16th century, can mean open or close 'quote'; there's a possessive present;
contraction as in "fo'c's'le"; or sometimes plural as in P's and Q's. Punctuation
is, I think, underappreciated in both markup and text analytics.
All these level involve both markup-like and linguistic-like features, so it seems
clear that there is much potential for synergy. At the same time, at each level the
nature of errors to be detected and of markup to be automated differ, and so the applications
of text analytics msut be considered pragmatically.
-
Markup structure: explicit and uambiguous. Markup is rarely considered in natural-languages processing,
even though it has obvious implications. Consider tags such as <del> and <ins>, <abbr>,
as well as tags that change appropriate syntactic expectations: <q>, <heading>, <code>,
and many others. We may hope that never anything can be amiss, when simpleness and
duty tender it. But markup directly affects the applicability and techniques of NLP.
In large-scale text projects, the creation of high-quality markup is one of the largest
expenses, whether measured by time, money, or perhaps even frustration. How do we
choose what to mark up? Can text analytics shift the cost/benefit calculus, so that
we can (a) mark up useful features we couldn't before, and (b) assure the quality
of the features we have marked up?
Text Analytics to the rescue?
Text analytics tries to extract specific features of meaning from natural language
text (i.e., documents). It mixes linguistics, statistical and machine learning, AI,
and the like, to do a task very much like the task of marking up text: searching through
documents and trying to categorize meaningful pieces. Like markup, this can be done
by humans (whose consensus, when achievable, is deemed the "gold standard" against
which algorithms are measured), or by algorithms of many types.
Text analytics today seeks very specific features; largely ones that are saleable;
it does not typically aim at a complete analysis of linguistic or pragmatic meaning.
Among the features TA systems typically seek are:
-
Language identification
-
Genre categorization: Is this reportage, advocacy, forum discussions, reviews, ads,
spam,...).
-
Characteristics of the author: gender, age, education level,....
-
Topic disambiguation: Flying rodents vs. baseball implements; non-acids vs. other
baseball implements; mispelled photographs vs. baseball players vs. carafes; and so
on.
-
Topic weight: Is the most important topic here baseball, or sports, or the Acme Mark-12
jet-assisted fielding glove's tendency to overheat?
-
Sentiment: Does the author like topic X?
-
Intentions: Is the user intent on buying/selling/getting help?
-
Times and events: Is the text reporting or announcing some event? What kind? When?
-
"Style" or "tone" measures: Decisiveness, flamboyance, partisan flavor, etc.
Text analytics is related to many more traditional techniques, such as text summarization,
topical search, and the like. However, it tends to be more focused on detecting very
specific features, and operating over very large collections such as Twitter feeds,
FaceBook posts, and the like. The OpenAmplify[oa] service, for example, regularly analyzes several million documents per day.
A natural thing for a text analysis system to do, is to mark up parts of documents
with what features they were found to express: this "he" and that "our renowned Duke"
are references to the "Theseus" mentioned at the beginning; this paragraph is focused
on fuel shortages and the author appears quite angry about them; and so on. Of course,
some features only emerge from the text as a whole: the text may be about space exploration,
even though that topic is never mentioned by name.
Users of text analytics commonly want to process large collections of short social
media texts, searching for particular kinds of mentions. For example, Acme Mustardseed
Company may want to know whenever someone posts about them, and what attitude is expressed.
Beyond mere positive vs. negative, knowing that some posters are advocates or detractors
(advising others to buy or avoid), can facilitate better responses. Other companies
may want to route emails to tech support vs. sales, or measure the response to a marketing
campaign, or find out what specific features of their product are popular.
Text analytics algorithms are varied and sometimes compex, but there are two overall
approaches:
One method is almost purely statistical: gather a collection of texts that have been
human-rated for some new feature, and then search for the best combination of features
for picking out the cases you want. This "Machine Learning" approach allows for fast
bootstrapping, even for multiple languages, and is sometimes very accurate. However,
since it does not take much account of language structure, complex syntax, negation,
and almost any kind of subtlety may trip it up. It's also very hard to fix manually
-- if you go tweaking the statistically-derived parameters, unintended consequences
show up.
The other general method is heuristic or intuitive: linguists analyze the feature
in question, and find vocabulary, syntactic patterns, and so on that express it. Then
you run a pattern-matching engine over documents. The plusses and minuses of this
approach are opposite those of machine learning: It is much more time- and expertise-intensive;
if the analysts are good the results can be amazing. But it's hard to do this for
100 languages. When problems crop up, the linguists can add new and better patterns,
add information to the lexicon, etc.
As Klavans and Resnik point out[Kla96], it can be very effective to combine these approaches. One way to do that is to
use statistal methods as a discovery tool, facilitating and checking experts' intuitions.
With either approach or a combination, you end up with an algorithm, or more abstractly
a function, that takes a text and returns some rating of how likely the text is to
exhibit the phenomenon you're looking for. For the trivial
Statistical methods
These methods, as noted, involve combining many features to characterize when the
sought. Usually the features are very simple, so they can be detectected reliably
and very quickly: frequencies of words or specific function-words, character-pair
frequencies, sentence and word lengths, frequency of various punctuation, and so on.
The researcher may choose some features to try based on intuition, but it is also
common simply to throw in a wide variety of features, and see what works. The features
just listed turn out to be commonly useful as well as convenient.
Taking again the simple example of language identification, one might guess that the
overall frequencies of individual characters would suffice. To test this hypothesis,
one collects from hundreds to perhaps a million texts, categorized (in this example)
by language. Say, all the English texts in one folder, and all the French texts in
another. Generating such a "corpus" is usually the critical-path item: somehow you
must sort the texts out by the very phenomenon you don't yet have a tool for. This
can be done by the researcher (or more likely their assistants), by Mechanical Turk
users, or perhaps by a weak automatic classifier with post-checking; sometimes an
appropriate corpus can be found rather than constructed de novo. Constructing an annotated
corpus has much in common with the sometimes difficult and expensive task of doing
XML markup, particularly in a new domain where you must also develop and refine schemas
and best-practice tagging guidelines and conventions.
Given such a "gold standard" corpus, software runs over each text to calculate the
features (in this example by counting characters). This produces a list of numbers
for each text. The mathematically-inclined reader will have noticed that such a list
is a vector, or a position in space -- only the space may have hundreds or thousands
of dimensions, not merely 11. For ease of intuition, imagine using only 3 features,
such as the frequencies of "k" and "?" and a flag for whether the text is from Twitter.
Usually, frequency measures are normalized by text length, so that texts of widely
varying lengths are comparable; and binary features such as being from Twitter, are
treated as 0 or 1.
Each document's list of 3 numbers equates to a location in normal space: (x, y, z).
It is easy to calculate the distance between any two such points, and this is a measure
of "distance" or similarity between the documents those two points represent. The
"as the crow flies" distance is often used, but there are other useful measures such
as the "Manhatten distance", the "cosine distance", and others.
Software such as WEKA[WEKA] runs through all the vectors, and tries to find the best combination of features
to look at, in order to correctly assign documents to the target categories. This
is called "training". Of these 3 features, the frequency of "k" is typically much
higher in English texts than French texts, while the other 2 features don't contribute
much (that is, they are not distinctive for this purpose). Such softare typically
results in a "weight" for each feature for each language: in this case "k-" frequency
would have a substantial positive weight for English, and negative for French.
Intuitively, this training allows you to discard irrelevant featurs, and to weight
the relevant features so as to maximize the number of texts that will be accurately
categorized. The real proof comes when you try out the results on texts that were
not part of the original training corpus, and you discover whether the training text
were really representative or not. With too small a corpus or with features that really
aren't appropriate, you may get seemingly good results from training, that don't actually
work in the end.
This example is a bit too simple -- as it happens, counting sequences of 2 or 3 letters
characterizes specific languages far better than single letters. Abramson[Abr63] (pp. 33-38) presents text generated randomly, but in accordance with tables of such
counts (known as "Hidden Markov Models") based on various languages. For example,
if the last two letters generated were "mo", the next letter would be chosen according
to how often each letter occurs following "mo". Abramson's randomly generated texts
illustrate how well even so trivial a model distinguish languages:
-
(1) jou mouplas de monnernaissains deme us vreh bre tu de toucheur dimmere lles mar
elame re a ver il douvents so
-
(2) bet ereiner sommeit sinach gan turhatt er aum wie best alliender taussichelle
laufurcht er bleindeseit uber konn
-
(3) rama de lla el guia imo sus condias su e uncondadado dea mare to buerbalia nue
y herarsin de se sus suparoceda
-
(4) et ligercum siteci libemus acerelin te vicaescerum pe non sum minus uterne ut
in arion popomin se inquenque ira
These texts are clearly identifiable as to the language on whose probabilities each
was based.
Given some set of available features, there are many specific statistical methods
that programs like WEKA can use to derive an effective model. Among them are Support
Vector Models (SVM), simulated annealing, Bayesian modeling, and a variety of more
traditional statistics such as regressions. The first of these is graphically intuitive:
the program tries to position a plane in space, with all (or as many as possible)
of the French texts falling on one side, and the English texts on the other. Some
methods (such as SVM) can only learn "pairwise" distinctions (such as "English or
not" and "French or not"); others can distinguish multiple categories at once (English
or French or Spanish...). Sometimes a degree of certainty or confidence can also be
assigned as well.
These methods often work quite well, and (given a training corpus) can be tried out
rapidly. If the method fails, adding new features lets you try again. Programs can
typically manage hundreds to a few thousand features. This seems generous, but remember
that if you want to track the frequencies of all English words, or even all letter-triples,
you'll quickly exceed that, so some restraint is necessary.
On the other hand, statistical methods are not very intuitive. Sometimes the results
are obvious: the frequency of accented vowels is vastly higher in French than English.
But often it is hard to see why some combination of features works, and this can make
it hard to "trust" the resulting categorizer. This may be reminiscent of the methods
used for norming some psychological tests, by asking countless almost random questions
of many people, and seeing which ones correlate with various diagnoses. This can work
very well, and is hard to accuse of bias; on the other hand, if it stops working in
a slightly larger sample space, that may be hard to notice or repair.
Intuitive methods
A more "traditional" approach to developing text analytics systems is for experts
to articulate how they would go about categorizing documents or detecting the desired
phenomena and then implementing those methods. This is usually done as an iterative
process, running the implementation against a corpus much as with statistical methods
and then studying the results and refining.
For example, a linguist might articulate a set of patterns to use to detect names
of educational institutions in text. One might be "any sequence of capitalized words,
the first or last of which is one of "College", "University", "Institute", "Seminary",
or "Academy". This rule can be easily tried on some text, and two checks can be done:
1: checking what it finds, in order to discover false positive such as "College Station,
TX"; and
2: check what is left over (perhaps just series of capitalized words in the leftovers),
in order to discover false negatives such as "State University of New York", "College
of the Ozarks".
The rules are then refined, for example to allow uncapitalized words like "of", "at",
and "the"; and to allow "State", "City", and "National" preceding the already-listed
possible starter words. Extending rules by searching for words that occur in similar
contexts also helps.
Eventually this process should produce a pretty good set of rules, although in natural
language the rules will often have to include lists of idiosyncratic cases: "Harvard",
"Columbia", "McGill", "Brown" (with the difficulty of sorting out uses as surnames,
colors, place names, and the like).
Intuition-based methods tend to be far better when larger phenomena matter. For example,
the ever-popular "Sentiment" calculation is highly sensitive to negation, and natural
languages have a huge variety of ways to negate things. Besides the many ways overt
negatives like "not" can be used, there are antonyms, sarcasm, condemnation by faint
praise, and countless other techniques. Statistical methods are unlikely to work well
for negation, in part because "not" or other signals of negation may occur quite a
distance from what they modify; just having "not" in a sentence tells you little about
what specifically is being negated. "There's seldom a paucity of evidence that could
preclude not overlooking an ersatz case of negative polarity failing to be missing."
Intuitive approaches have the advantage of being more easily explained and more easily
enhanced when shortcomings arise. But building them requires human rather than machine
effort (especially difficult if one needs to support many languages).
Intuitive methods have the added cost and benefit, that experts commonly refer to
high-level, abstract linguistic notions. In more realistic cases than the last example,
a linguist might want to create rules involving parts of speech, clause boundaries,
active vs. passive sentences, and so on. To do this requires a bit of recursion: how
do you detect *those* phenomena in the first place? That requires some tool that can
identify those constructs, and make them available as features for the next "level".
"Shallow parsing" is well understood and readily available (such as via [Gate] and [NLTK]), and can provide many of those features. "Shallow parsers" identify parts of speech
(using far more than the traditional 8 distinctions), as well as relatively small
components such as noun phrases, simple predicates, and so on, often using grammars
constructed from rules broadly similar to those in XML schemas. Shallow parsing draws
the line about at clauses: subordinate clauses as in "She went to the bank that advertised
most" are very complex in the general case, and attaching them correctly to the major
clause they occur in even more so.
The results of shallow parsing are usually strongly hierarchical, since they exclude
many of the complex long-distance phenomena in language (such as relationships between
non-adjacent nouns, pronouns, clauses, etc.). Because of this, XML is commonly used
to represent natural-language parser output. However, this is separate from potential
uses of text analytics on general XML. An example of the structures produced by a
shallow parser:
<s>
<nps>
<at base="the">The</at>
<nn base="course">course</nn>
</nps>
<pp>
<in base="of">of</in>
<np>
<jj base="true">true</jj>
<nn base="love">love</nn>
</np>
</pp>
<pred>
<rb base="never">never</rb>
<vb base="do">did</vb>
<vb base="run">run</vb>
</pred>
<jj base="smooth">smooth</jj>
<dl base=".">.</dl>
</s>
Many current text analytics systems use a lexicon, part of speech tagging, and shallow
parsing to extract such linguistic structures. While far from a complete analysis
of linguistic structure, these features permit much more sophisticated evaluation
of "what's going on" than strictly word-level features such as described earlier.
For example, knowing whether a topic showed up as a subject vs. an object or a mere
modifier, is invaluable for estimating it's importance. Knowing whether a given action
is past or future, part of a question, or subject to connectives such as "if" or "but"
(not to mention negation!) also has obvious value. Having a handle on sentence structure
also helps reveal what is said about a given topic, when a topic is referred to indirectly (by pronouns, generic labels
like "the actor", etc.).
Accuracy?
Users of text analytics systems always ask "how accurate is it?" Unlike XML validation,
this is not a simple yes/no question, and so using TA in creating or checking markup
is a probabilistic matter. As it turns out, even a "percent correct" measure is often
a misleading oversimplification. In the interest of brevity, I'll give a few examples
of the problems with a unitary notion of "accuracy", using Sentiment as an example
(since it is perhaps the most common TA measure in practice):
-
Let's say a system categorizes texts as having "Positive" or "Negative" sentiment
(leaving aside the precise definition of "Sentiment"), and gets the "right" answer
for 70% of the test documents. The first key question is how the desired answer came
to be considered "right" in the first place. Normally, accuracy is measured against
human ratings on a set of texts. Yet if one asks several people to rate a particular
text, they only agree with each other about 70% of the time. If texts only have one rater, 30% of the texts probably have debatable ratings.
If one throws out all the cases where people disagree, that unfairly inflates the
computer's score because all the hard/borderline cases are gone. Treating the humans'
ratings like votes, if the computer agrees with the consensus of 3 humans 70% of the
time, is it 70% accurate, or 100% as good as a human? If the algorithm does even better,
say 80%, what does that even mean? In considering applications to XML, it would be
interesting to know how closely human readers agree about the matters TA might be
called on to evaluate.
-
In practice, Sentiment requires a third category: "Neutral". A corpus that is 80%
neutral, 5% negative, and 15% positive is typical. That means a program can beat the
last example merely by always returning "Neutral": that's 80% accurate, right? This
illustrates the crucial distinction of precision versus recall: this strategy perfectly recalls (finds) 100% of the neutral cases; but it's not
very precise: 1/5 of the texts it calls "Neutral" are wrong. In addition, it has 0%
recall for positives and negatives (which are much more important to find for most
purposes).
-
At some level Sentiment is typically calculated as a real number, not just three categories;
say, ranging from -1.0 to +1.0. How close to 0.0 counts as Neutral? That decision
is trivial to adjust, and may make a huge difference in the precision and recall (and
note that the Neutral zone doesn't have to be symmetrical around 0).
-
For systems that generate quantities instead of just categories, should a near miss
be considered the same as a totally wild result? Depending on the user's goals and
capabilities, a close answer might (or might not) still be useful. Using TA methos
to evaluate XML document portions is a likely case of this: Although most abstracts
are likely to be broadly positive, only severe negativity would likely warrant a warning;
similarly, so long as an abstract is reasonably stylistically "abstract-ish", it's
probably ok (but not if it looks "bibliography-ish").
-
Many documents longer than a treet have a mixture of positive, neutral, and negative
sentiments expressed toward their topics. Does a long document that's scrupulously
neutral, deserve the same sentiment as one that describes two very strong opposing
views? They may average out the same, but that's not very revealing.
-
A subtler problem is that rating a document's sentiment toward a given topic depends
on rightly identifying the topic. What if the topic isn't identified right? Is "Ms.
Gaga" the same as "Lady Gaga"? Is the topic economics, the national debt, or fiscal
policy? Some systems avoid this problem by reporting only an "overall" sentiment instead
of sentiment towards each specific topic, but that greatly exacerbates the previous
problem.
-
Users' goals may dictate very different notions of what counts as positive and negative.
For market analysts or reporters, a big change in a company's stock price may be "just
data": good or bad for the company, but just a fact to the analyst. Political debate
obviously has similar issues.
Understanding those general methods and caveats, it's fair to generalize that text
analysis systems typically detect features with 70-80% accuracy, although some features
are far easier than others. Language identification is far more accurate; sarcasm
detection, far less. This means that such systems work best when there are many texts
to process -- the errors will generally come out in the statistical wash.
Text Analytics' Relation to Markup
Markup has multiple purposes; among them are
-
Disambiguating structure (e.g., famous OED italics)
-
Controlling layout and other processing
-
Identifying things to search on
Markup makes aspects of document structure explicit. In principle, any phenomenon
that text analytics can identify, can then be marked up, to a corresponding level
of accuracy. Exactly the same analytics can be used in checking: If a text is already
marked up for feature X, we need only run an auto-tagger for X and compare. This simultaneously
gives feedback on the text-analytic output's accuracy, and the prior markup's.
When two sources of data are available like this, they can be used to check each other.
In addition, the degree of overlap in what is "found" by each source, enables estimating
the number of cases not found be either. A simple statistic called the "Lincoln Index",
originating in species population surveys, provides this estimate. In the same way,
text analytics can be used to do XML markup de novo, or as a direct quality check
on existing markup.
Such comparative analysis may be one of the most useful applications of TA to XML
evaluation. In a text project where markup is not straightforward, how can one evaluate
how well the taggers are doing? Say a literary project is marking up subjective features
such as allusions, or sometimes-unclear features such as who the speaker is for each
beat in dialog. TA methods can learn how to characterize such features, and then be
run against the human-tagged texts. Disagreement can trigger a re-check, thus saving
time versus checking everything.
There seems to be an implicit "sweet spot" for markup use. We don't mark up sufficiently
obvious phenomena, such as word boundaries (except in special cases). Given that almost
every kind of processing needs to know where they are, why not? Probably because finding
word boundaries seems trivial in English. Yet word boundaries are also unlikely to be marked up in Asian languages, where identifying
them is far from trivial. Thus, simplicity can't be the whole story. Perhaps it is
that consciously or not we assume that most any downstream software will do this by
itself, so there would be no return on even a small investment in explicit markup.
Language-use has better ROI, for example enabling language-specific spell-checking
or indexing. Downstream software is perhaps less likely to "just handle it." Nevertheless,
it is not very common to see xml:lang
more than once in a document except in special cases such as bilingual dictionaries,
diglot literature editions, and the like.
TA systems can certainly add (or check) word-boundary and language-use markup, and
the most common related attributes, such as part of speech and lemma. Such markup
is perhaps of limited value except in special applications, such as text projects
in Classical languages or that contend with paleographic issues.
Marking up small-scope, very specific semantics such as emphasis and the role of particular
nouns is a traditionally awkward matter in markup. Some schemas merely provide tags
for italic, bold, and the like; using less font-oriented tags such as <emph> is considered
a step up, but often accomplished little more than moving "italic" from element to
attribute. If more meaningful small-scale elements are not available, conventions
such as RDF[RDF] and microformats[micro] make it feasible to represent the results of text analytics or even linguistic parsing
in ways accessible to XML systems.
DocBook[docb] provides many more meaningful items at this level: guibutton, keycap, menuchoice
, etc. As with other fairly concrete features already described and as an anonymous
referee pointed out, text analytics could be used to tag (or check) many such distinctions
automatically: "ENTER" is going to be a keycap
, not a guiItem
; in other cases nearby text such as "press the ___ key" can help, as can context
such as being in a section titled "List of commands". This seems entirely tractable
for analytics, and could have significant value because such markup is valuable for
downstream processings (particularly search), but tedious for humans to create or
check, and therefore error-prone.
However, even this level of markup can get subtle. In Making Hypermedia Work[DeRo94] David Durand and I decided to distinguish SGML element names from HyTime
architectural form names in the text, because the distinction is crucial but, at least
to the new user, subtle. We also decided, in many examples, to name elements the same
as the architectural form they represented. In most cases there was only one element
of a given form under discussion; and because elements are concrete while forms are
abstract, one cannot easily reify the latter without the former. In most cases this
was trivial; but in a few cases the decision seemed impossible. Examing those cases
via TA methods would likely reveal much about the distinction we were trying to achieve,
as well as no doubt reveal marup errors.
Bibliography entries are notoriously troublesome, whether in XML or any other representation.
They have many sub-parts, with complex rules for which parts can co-occur; the order
(not to mention formatting) of the parts varies considerably from one publisher to
another; and there are special cases that are difficult given the usual rules. PubMedCentral
receives an extraordinary number of XML articles, often in a standard XML Schema[NCBI];
but usage varies significantly even in valid data, and the code to manage bibliographic
entries in the face of such variability is substantial. Many publishers opt for the
"free" style, in which most schema constraints are relaxed, and recovering the meaningful
parts of entries is a task worthy of AI.
At a higher or at least larger level, many schemas are heavy on tags for idiosyncratic
components with much linguistic content, which also have distinctive textual features:
Bibliography, Abstract, Preface, etc. For example, a Preface will likely use much
future tense, while a Prior Work section will use past. Text analytics can find and
quantify such patterns, and then find and report outliers, which might show up due
to tagging errors, or to writing which, whether through error or wisdom, does not
fit the usual patterns.
This provides a particularly promising area for applying text analytics. Although
the titles of such sections differ, and there may or not be distinct tags for each
kind, a text analytics system could learn the distinctive form of such components,
and then evaluate how consistent the tagging and/or content of corresponding sections
in other documents are.
We usually mark up things that are necessary for layout; the ROI is often obvious
and quick. But it takes a lot of dedication, sophistication, and money to, say, disambiguate
the many distinct uses of italics, bold, and other typography in the Oxford English
Dictionary[Tom91], or to characterize the implicitly-expected style for major components
such as those in front and back matter.di. Many of the implicit data described earlier
can be detected using text analytic methods, but using this to assist the markup process
has been little explored.
How shall we find the concord of this discord?
If you can find it reliably via some algorithm, why mark it up? In a sense, creating
markup via algorithms is kind of like the old saw about Artificial Intelligence: "as
soon as you know how to do it, it's not AI anymore." If text analytics (or any other
technology) could completely reliably detect some feature we used to mark up, we might
stop marking it up. But in reality, neither humans nor algorithms are entirely reliable
for marking up items of interest. The probability that the two will err in quite different
ways, means there is synergy to be had.
Anyone who has tried to write a program to turn punctuated quotes into marked-up quote
elements, has discovered that there are many difficult cases, at a variety of levels.
Choice of character set and encoding, international differences in style, nested and/or
continued quotations in dialog, alternate uses of apostrophe and double quote, and
even quotations whose scope crosses tree boundaries[see DeRo04]. Would we bother marking quotations if the punctuation were unambiguous, or if we
had widespread text-analytics solutions that could always identify quotations for
formatting search, and other common needs?
Typical XML schemas define some very common specific items: Bibliography, Table of
Contents, Preface; and some common generic items: Chapter, Appendix, etc. But (perhaps
pragmatically) we don't enumerate the many score front and back matter sections listed
in the Manual of Style, or the additional ones that show up in countless special cases
-- at some point we just say "front-matter section, type=NAME" and quit. Worse, we
sometimes cannot choose the "correct" markup: whether we are the original author or
a later redactor, we may simply be unable to say whether to mark up "Elizabeth" as
<paramour> or <tourist> in "Elizabeth went to Essex. She had always liked Essex."[TEI P3]
The short response to these issues, I think, is that markup is always a tradeoff;
there are levels we make explicit, and levels we don't. Perhaps it cannot be otherwise.
Intuitively, it seems that at least for authors many of the choices should always
be clear; and to that extent text analytics can also find many of these phenomena.
So why does the principle not work, that an author knows when component X is needed,
and so should have an easier time just naming X, than carrying out commands to achieve
a look that others will (hopefully) process back to X?[Coom87]
I think it is because people's interaction with language (not via language) is largely unconscious. We rarely think "the next preposition is important,
so I'm going to say it louder and slower"; we don't think "I've said everything that
relates to that topic sentence, so it's time for a new paragraph"; nor even "'Dr'
is an abbreviation, so I need a period". Expertise has been defined as the reduction
of more and more actions to unconsciousness -- that's how we (almost always) walk
and/or chew gum. Our understanding of language is often similarly tacit. As Dreyfuss
and Dreyfuss put it[Drey05, p. 788), If one asks an expert for the rules he or she is using, one will, in effect, force
the expert to regress to the level of a beginner and state the rules learned in school.
Thus, instead of using rules he or she no longer remembers, as the knowledge engineers
suppose, the expert is forced to remember rules he or she no longer uses.
The act of markup, whether automated or manual, seems similar: we know a paragraph
(or emphasis, or lacuna) when we see it, just as we know an obstacle on the sidewalk
when we see it; but neither often makes it to consciousness.
Text analytics and markup are very similar tasks, though they tend to identify different
things; it is rare for (say) a literary text project to mark up sentiment in novels,
while it is equally rare for text anaytics to identify emphasis (although emphasis
mught contribute to other features, such as topic weight).
Perhaps the most obvious place to start, beyond simple things like language-identification,
is checking whether existing markup "makes sense", at a higher level of abstraction
that XML schema languages -- a level closely involving the language rather than the
text. The usual XML schema languages do little or nothing with text-content; with
DTDs one can strip out *all* text content and the validity status of a document cannot
change. With a text analytics system in place, however, it is possible to run tests
related to the actual meaning of content. For example:
-
After finding the topics of each paragraph, one can estimate the cohesiveness of sections
and chapters, as well as check that section titles at least mention the most relevant
topics. Comparison between the topics found in an abstract, and the topics found in
the remainder of an article, could be quite revealing.
-
The style for a technical journal might require a Conclusions section (which might
want to be very topically similar to the abstract), and a future work section that
should be written in the future tense. Similarly, a Methods section should probably
come out low on subjectivity. In fiction, measuring stylistic features of each character's
speech could reveal either mis-attributed speeches, or inconsistencies in how a given
character is presented.
-
The distribution of specific topics can also be valuable: Perhaps a definition should
accompany the *first* use of a given topic -- this is relatively easy to check, and
a good text analytic system will not often be fooled by a prior reference not being
identical (for example, plural vs. singular references), or by similar phrases that
don't actually involve the same topic.
-
One important task in text analytics is identification of "Named Entities": Is this
name a person, organization, location, etc? Many XML schemas are rich in elements
whose content should be certain kinds of named entities: <author>, <editor>, <copyright-holder>,
<person>, <place>, and many more. These can easily be checked by many TA systems.
Since TA systems typically use large catalogs of entities, marked-up texts can also
contribute to TA systems as entity sources.
Text analytics is strongest at identifying abstract/conceptual features, when those
features are not easily characterized by specific words or phrases, but emerge from
larger linguistic context. The most blatant example is the perennial problem with
non-language-aware search engines: negation. There are many ways to invert the sense
(or sentiment) of a text, some simple buy many extremely subtle or complex. Tools
that do not analyze syntax, clause roles, and the like can't distinguish texts that
mean nearly the opposite of each other. Thus, at all levels from initial composition
and markup, through validation and production, to search and retrieval, text analytics
can enable new kinds of processes. Perhaps as such technology becomes widespread and
is integrated into our daily workflows, it may help us to reach more deeply into the
content of our texts.
Little has been published on the use of text analytics in direct relation to markup,
although text analytics tools often use XML extensively, particularly for the representation
of their results. However, TA has the potential to contribute significantly to our
ability to validate exactly those aspects of documents, that markup does not help
with: namely, what's going on down in the leaves.
References
[Abr63] Norman Abramson. 1963. Information Theory and Coding. New York: McGraw-Hill.
[Coom87] James H. Coombs, Allen H. Renear, and Steven J. DeRose. 1987. "Markup systems and
the future of scholarly text processing." Communications of the ACM 30, 11 (November
1987), 933-947. doi:https://doi.org/10.1145/32206.32209.
[DeRo04] Steven DeRose. 2004. "Markup Overlap: A Review and a Horse." Extreme Markup Languages
2004, Montréal, Québec, August 2-6, 2004. http://xml.coverpages.org/DeRoseEML2004.pdf
[DeRo94] Steven DeRose and David Durand. 1994. "Making Hypermedia Work: A User's Guide to HyTime."
Boston: Kluwer Academic Publishers. doi:https://doi.org/10.1007/978-1-4615-2754-1.
[Drey05] Hubert L. Dreyfus and Stuart E. Dreyfus.
"Peripheral Vision: Expertise in Real World Contexts." Organization studies 26(5):
779-792. doi:https://doi.org/10.1177/0170840605053102.
[Gate] Gate: General Architecure for Text Engineering http://gate.ac.uk
[het1] "Heteronym Homepage" http://www-personal.umich.edu/~cellis/heteronym.html
[het2] "The Heteronym Page" http://jonv.flystrip.com/heteronym/heteronym.htm
[Kla96] Judith Klavans and Philip Resnik. The Balancing Act: Combining Symbolic and Statistical
Approaches to Language. MIT Press 1996. 978-0-262-61122-0.
[micro] Microformats (home page) http://microformats.org
[Mont73] Richard Montague. 1973. "The Proper Treatment of Quantification in Ordinary English".
In: Jaakko Hintikka, Julius Moravcsik, Patrick Suppes (eds.): Approaches to Natural
Language. Dordrecht: 221–242. doi:https://doi.org/10.1007/978-94-010-2506-5_10.
[NCBI] National Center for Biotechnology Information, National Library of Medicine, National
Institutes for Health. "Journal Publishing Tag Set".
[NLTK] NLTK 2.0 documentation: The Natural Language Toolkit. http://www.nltk.org
[docb] OASIS Docbook TC. http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=docbook
[oa] OpenAmplify. http://www.openamplify.com
[RDF] Resource Description Format. http://www.w3.org/RDF/
[Stede] Manfred Stede and Arhit Suriyawongkul. "Identifying Logical Structure and Content
Structure in Loosely-Structured Documents." In Linguistic Modeling of Information
and Markup Languages: Contributions to Language Technology. Andreas Witt and Dieter
Metzing, eds., pp. 81-96.
[TEI P3] TEI Guidelines for the Encoding of Machine-Readable Texts. Edition P5.
[Tom91] Frank Wm. Tompa and Darrell R. Raymond. 1991. "Database Design for a Dynamic Dictionary."
In (Eds.) Susan Hockey and Nancy Ide. Research in Humanities Computing: Selected Paper
from ALLC/ACH Conference, Toronto.
[WEKA] Machine Learning Group at University of Waikato. "Weka 3: Data Mining Software in
Java." http://www.cs.waikato.ac.nz/ml/weka/.
×Norman Abramson. 1963. Information Theory and Coding. New York: McGraw-Hill.
×James H. Coombs, Allen H. Renear, and Steven J. DeRose. 1987. "Markup systems and
the future of scholarly text processing." Communications of the ACM 30, 11 (November
1987), 933-947. doi:https://doi.org/10.1145/32206.32209.
×Judith Klavans and Philip Resnik. The Balancing Act: Combining Symbolic and Statistical
Approaches to Language. MIT Press 1996. 978-0-262-61122-0.
×Richard Montague. 1973. "The Proper Treatment of Quantification in Ordinary English".
In: Jaakko Hintikka, Julius Moravcsik, Patrick Suppes (eds.): Approaches to Natural
Language. Dordrecht: 221–242. doi:https://doi.org/10.1007/978-94-010-2506-5_10.
×National Center for Biotechnology Information, National Library of Medicine, National
Institutes for Health. "Journal Publishing Tag Set".
×Manfred Stede and Arhit Suriyawongkul. "Identifying Logical Structure and Content
Structure in Loosely-Structured Documents." In Linguistic Modeling of Information
and Markup Languages: Contributions to Language Technology. Andreas Witt and Dieter
Metzing, eds., pp. 81-96.
×TEI Guidelines for the Encoding of Machine-Readable Texts. Edition P5.
×Frank Wm. Tompa and Darrell R. Raymond. 1991. "Database Design for a Dynamic Dictionary."
In (Eds.) Susan Hockey and Nancy Ide. Research in Humanities Computing: Selected Paper
from ALLC/ACH Conference, Toronto.