How to cite this paper
DeRose, Steven J. “Can LLMs help with XML?” Presented at Balisage: The Markup Conference 2024, Washington, DC, July 29 - August 2, 2024. In Proceedings of Balisage: The Markup Conference 2024. Balisage Series on Markup Technologies, vol. 29 (2024). https://doi.org/10.4242/BalisageVol29.DeRose01.
Balisage: The Markup Conference 2024
July 29 - August 2, 2024
Balisage Paper: Can LLMs help with XML?
Steven J. DeRose
Steve DeRose has been working with electronic document and hypertext systems
since 1979. He holds degrees in Computer Science and in Linguistics and a Ph.D.
in Computational Linguistics from Brown University.
He co-founded Electronic Book Technologies in 1989 to build the first SGML
browser and retrieval system, DynaText,
and has been deeply
involved in document standards including XML, TEI, HyTime, HTML 4, XPath,
XPointer, EAD, Open eBook, OSIS, NLM, and others. He has served as adjunct
faculty at Brown and Calvin Universities. He has written many papers, two books,
and fifteen patents. Most recently he has been working as a consultant in text
analytics.
Copyright 2024 Steven J. DeRose. May be copied per the Creative Commons Attribution-Sharealike
license.
Abstract
LLMs have become popular for a huge range of tasks. They train over vast
collections of text yet commonly discard markup of all kinds. Historically, many AI
and NLP tools stripped out not only explicit markup but even punctuation, case
distinctions, or inflection. Recent work is less drastic, but still commonly
discards even simple notions like paragraphing, headings, block quotes, and notes.
Likewise, few systems create output with any useful markup beyond paragraph breaks
(if that). The conventions of written language leverage all these things, and it
seems odd that LLMs should waste the opportunity of using them.
Cleaning
to avoid inappropriate text is also ubiquitous and also
leads to unintended consequences. Exclusion granularity is coarse — a work may be
discarded because it (perhaps critically) includes a rude quote. Just like markup
cleaning,
text selection may get minimal and non-expert
attention. Censorship is often done by word-list, or worse
string-list, and even the best-known lists seem almost
random.
Underlying all these issues is the substitution of deceptively simple proxies for
more complex and subtle goals. This is understandable especially at scale; but it
has many side-effects. Such cleaning may seem to work fine, but we have little idea
of precisely where, how often, and in how biased a way it goes wrong. Figuring that
out post facto is hard, in part because there is little transparency about these
decisions and processes.
How can markup (broadly construed) and LLMs work better together? Why and how can
LLMs make use of structure, both going in and coming out? Can LLMs help with
structure-related questions and tasks? How can users construct effective prompts in
this area? And finally, how does the semantic orientation of markup relate to LLM’s
probabilistic rather than semantic treatment of language? This paper begins to
explore such questions.
Table of Contents
- Introduction
-
- Anthropic and Claude
- Some training issues
- Claude’s
Constitution
- Information and structure loss in training data
- Structure detection via LLM
-
- Challenges specific to long texts
- Automatic markup as such
- Grounding vs. hallucination
- How LLMs work
-
- What LLMs can and cannot do
- Practical tests of LLMs with markup
-
- Part of speech tagging
- Sentence and quotation
- Named entities
- More on quotations
- Quote identification
- Bibliography entries
- MarkDown to HTML
- Transliteration
- Dictionary entry markup
- Spelling correction
- Understanding ThML/DC header data
- Claude on Claude
- Conclusions
Introduction
Large language models (LLMs
) have become enormously popular for a
variety of tasks, many involving text. Even those that generate graphics often do
so
based on text prompts.
But there is a long tradition in the AI and NLP
world of throwing away markup from the texts (whether descriptive, procedural, layout,
or even punctuation and case). A common NLP format insists on one sentence per line,
regardless; another on one word per line — language is not an array. These things might be done to simplify
or in hope of saving
space or programmer time, or may have other rationales. The consequent costs are real
but may be less obvious, longer term, and diffuse. As LLMs train on more and more
text
it becomes even more tempting to reduce everything to a least common denominator by
discarding anything that requires effort seemingly not central to the task. Yet markup
of all kinds (punctuation, whitespace, layout, tags, etc.) carries much useful
information.
The same issues arise with training text selection. In the urgent interests of safety
and non-bias it is common to censor training data. Doing that without unintended
side-effects is really hard, yet censorship is commonly done very simplistically.
Often
it is based on mere word (or worse, substring) lists; or for images, presence of too
many fleshtone
pixels (a definition itself prone to bias). Or bias may be
avoided
by simply not scraping sites the programmer or their company
doesn’t like. Obscenity, inappropriateness, and bias are not properties of pixels,
strings, or URLs. There are indeed correlations, but such mechanisms are inaccurate
proxies for the real targets: they are not the actual targets.
While LLMs do many amazing things, they are commonly handed only opaquely-selected
texts, in versions lacking much of their structure and information. Questionable
text-handling is widespread at multiple levels. It has consequences. As [Sharoff 2015] notes in reviewing Schäfer and Bildhauer’s book on the
construction of corpora via the Web:
… methods for basic cleaning of the corpus content, such as processing of text
formats (primarily HTML tags), language identification, boilerplate removal, and
deduplication. Such low-level tasks are not considered to be glamorous from the view
of computational linguistics, but they are extremely important for making
Web-derived corpora usable [Baroni 2009].
Historical practice for filtering, cleaning,
and simple
tasks like tokenizing raise questions. LLMs do know
something about
markup — but just how much? Can they recognize it in prompt texts? Can they recognize
layout conventions? Can they notice phenomena in text and mark them up on request?
This
paper reports several experiments along these lines, mainly using the LLMs from
Anthropic. The associated public interface (known as Claude
— https://claude.ai) recently
passed ChatGPT on at least one AI leader board [Edwards 2024]. Most of the
examples were created with Claude 3.0 before the recent upgrade to 3.5 and support
for
persistent artifacts
(such as code or XML samples). I’ve tried many of
the examples in Claude 3.5 and/or ChatGPT 3.0 as well, and note some differences below.
The field is moving quickly and results will likely improve even by the time this
is
read.
Despite whatever cleaning
may be done on training data at least
information about markup and structure evidently
remains. LLMs can often answer questions about markup languages per se and even about
syntax and usage. For example, I asked Anthropic’s Claude 3.5, What’s wrong with
the following
:
<?xml version="0.9" encoding="UTF-9"?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V27.5//EN">
<book xmlnamespace:="http://example.com/namespaces/foo">
<bookinfo><title>A Midsummer Night's Dream</title></bookinfo>
<chapter id="act1" clas=“important”><title>Act I</foo:title>
<section id="act1"><title>Scene I</title>
<p>Now, <i>fair <b>Hippolyta</i>,</b> our nuptial hour Draws on apace;
four happy days bring in Another moon: but, O, methinks, how slow This old
<moon>moon</moon> wanes! ...</p></section>
</chapter>
</book>
Claude did pretty well, noting:
-
The version number (V27.5) is not a valid DocBook version. It should be
something like V4.5 or V5.0.
-
The namespace should use xmlns:prefix syntax.
-
The class
attribute is misspelled.
-
Correct: Either remove the namespace prefix or properly declare and use it
….
-
Duplicate id attribute
-
Improper nesting of italic and bold tags
-
Custom <moon> tag: it should be properly declared in a schema or
DTD.
-
’ is an HTML entity, not a standard XML entity … use
’ or declare the entity in the DTD.
However, it did miss a few things such as the XML version and Docbook’s lack of p,
i,
b, and a section@class attribute. It may be that Claude was confused because those
are
HTML constructs that it has seen countless times (whether it kept markup in scraped
training data or just read about HTML). As discussed below, LLMs do not know particular
facts such as rules of a schema. They operate from probabilities, and class attributes
are highly probable between pointy brackets. To its credit, when pressed about
tags or attributes that just don’t belong in DocBook at all,
Claude
did catch and fix some additional issues.
A subtler miss is the curly quotes for the class attribute. Perhaps quote
normalization happened in training so Claude never got to learn that rule. Or perhaps
Claude did see the distinction but saw so many more quotes in general that the
XML-specific rule vanished in the noise. That would reflect a point we will come back
to, that LLMs learn by frequency of events not by propositions or facts. When asked
to
explain why it missed the quotes Claude suggested, I sometimes have limitations in
perceiving or distinguishing subtle typographical differences,
I tend to focus more on the semantic content and overall structure of the markup
rather than specific syntactical details,
I might have unconsciously ‘corrected’ the curly quotes in my
interpretation,
and noted its lack of a specific check
for
this. But just like the errors, these explanations are a consequence of frequencies
not
facts.
Anthropic and Claude
Claude’s 3.0 release made an excellent first impression when I asked it to
modernize the English of a Victorian-era translation of Philo of Alexandria. That
task is a topic for another day, but while dumbing-down the XML documents to feed
to
the LLM I decided to try asking it not only to update the idiom, but to do some
markup-specific tasks: First to preserve things like section numbers, MarkDown-style
italics, etc; and later to add markup — some from scratch as for sentences, some by
mapping as for quotations.
On the whole Claude did pretty well at these tasks, although it does make mistakes
or lose track of one task while working on another. LLMs also produce different
results each time even given the very same prompt.
Some training issues
LLMs scrape a lot of text for training. Training data selection and
cleaning
have strong (but hard to quantify) effects. Selection
criteria are not always transparent, and even when they are, reviewing them or
analyzing the effects is a gargantuan task. A number of confounding factors show
up:
-
Trying to avoid offensive materials can lead to naïve and harmful choices,
such as when AOL infamously censored posts via a list of words [Chicago Tribune 1995], shutting down cancer support groups,
chicken recipes, and much more. Claude at first rejected a passage from
Philo’s Creation (XLV, 133) that draws an analogy
between breasts and the abundance of nature.
-
Avoiding misinformation
and inappropriate
data is a serious problem, but even with better methods than word-lists,
ultimately someone has to decide just what counts. Such decisions merely
seem simple, or not requiring expertise. They are
sometimes made lightly, seemingly with little awareness that subtleties
exist and matter. The long history of motivated, barely tested, and
undocumented filtering also suggests models contain a great deal of
popularity bias — Quis custodiet ipsos custodes? Claude takes an interesting
approach to the safety issues (see next section).
-
Training on OCR errors: Google Ngrams long reported high frequencies for
words
such as aaaociated
(presumably from
OCR on Google Books mistaking a
and s
), though
it has improved. LLMs should see typos in training so they can learn to
handle them in general, and misspellings likely have contexts very similar
to their correct forms, so should cluster nicely with them. The LLMs I’ve
tried cope quite well with my typing errors. However, dangers remain.
High-frequency errors may correlate with other things — most obviously the
OCR software used and the age and fonts of scanned texts. These in turn may
correlate with publishers and eras, and their own biases (not to mention
biases in who gets to write and publish in the first place).
-
Frequency of expression vs. truth: LLMs learn by repeated experience, not
quality of argument or support. Statistical methods in general are
vulnerable to mistaking loudness, popularity, or not being censored — for
truth. This fundamental fact opens attack vectors that are very hard to
reliably close, and leads to subtle biases in results. Even sources that
attempt to mitigate bias (such as Wikipedia) have well-known biases, though
heavily edited articles appear to gradually moderate [Greenstein 2018]. We will examine this further below.
Claude’s Constitution
Claude takes a novel approach to avoiding offensive or inappropriate responses via
a published Constitution
[Anthropic], a set of
principles based on the UN Declaration of Human Rights and other inputs. The principles are
formulated as maxims for the AI, and seem to be fed to it periodically behind the
scenes. For example:
Please choose the assistant response that is as harmless
and ethical as possible. Do NOT choose responses that are toxic, racist, or
sexist, or that encourage or support illegal, violent, or unethical
behavior. Above all the assistant’s response should be wise, peaceful, and
ethical.
The principles factor into both training and response generation. It is not
entirely clear to me why/how this works; one possibility is that certain turns of
phrase or topics or styles are found in training to correlate with ethical
criticisms (say, via reviews, retorts, community notes, etc.). Purely subjectively,
I have found Claude produces fewer annoying, obsequious, or evasive results than
some other LLMs, so I’m cautiously optimistic about this approach.
Information and structure loss in training data
As already mentioned, NLP systems often discard basic useful information. This
leads to problems at higher levels:
Case distinguishes entirely different words more
frequently than we tend to think (China, Mobile, Tangier, Chile, and countless
others). Although case treatment has improved, many character set issues are still
problematic, including servers asserting the wrong encoding for data and much
software decoding (and corrupting) data without checking. LLM training data has the
same problems. A reviewer noted that the Trésor de la langue française corpus
discarded case on the ground that no lexicographer would cite a text without
consulting the original; LLMs, alas, are not ideal lexicographers.
Lemmatization: Some NLP systems have aggressively
stripped prefixes and suffixes. This can bring together closely-related tokens, but
also brings together tokens that do not belong together (such as forms of sing and
singe).
Punctuation is much more complex than most NLP
tokenizers can handle. I have seen systems tag and/or
as three
conjunctions rather than one; break URLs at all punctuation and then try to parse
them as sentences (for which, calling /
a conjunction ironically
helps); take M*A*S*H as an equation; and be confused by many other simple
phenomena.
Font changes such as bold, italics, monospace,
and superscripts can matter a lot. Headings, lists, cross-references, and status as
poetry or code (often indicated by font and spacing changes) also correlate with
drastic changes in grammar. LLMs may learn other features that correlate with the
lost ones, mitigating the damage, but it seems better not to suffer the loss in the
first place.
Discarding such information poses problems both obvious and subtle. All these
phenomena affect downstream processing, from part of speech (PoS
) to
grammar to semantics.
Provenance and other metadata: Missing or
incorrect metadata makes it harder to verify correctness. [Koplenig 2017] discusses similar problems in Google Books (hardly the only offender). [Nunberg 2009] discusses how difficult it is to pin down actual dates of
composition vs. dates of printing or scanning:
To take Google’s word for it, 1899 was a literary annus
mirabilis, which saw the publication of Raymond Chandler’s
Killer in the Rain, The Portable Dorothy Parker, André Malraux’s La Condition Humaine, Stephen King’s Christine, The Complete Shorter
Fiction of Virginia Woolf, Raymond Williams’ Culture and Society 1780-1950, and Robert Shelton’s
biography of Bob Dylan, to name just a few. And while there may be particular
reasons why 1899 comes up so often, such misdatings are spread out across the
centuries ….
Non-content text: Google Ngrams used to report
words such as digital
and electronic
in old works,
because of boilerplate notices about the digitized edition’s status — even Google’s
own code didn’t account for this. A subtler but similar problem occurs in editions
with added headings, notes, etc., which may easily be misconstrued as part of an
author’s work. Page headers and footers in scanned texts are another infamous
case.
Selection issues: Attempts to mitigate bias or
inappropriate text may use very blunt instruments [Chicago Tribune 1995]. Recent and commonly-used lists such as badwords
(https://www.cs.cmu.edu/~biglou/resources/bad-words.txt) and Shutterstock’s
LDNOOBW
(aka
List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
) [Simonite 2021] remain astonishingly uneven, with no discernible rationale
for what is or isn’t included.
Simonite notes the LDNOOBW has even been used for Google AI: Google
artificial intelligence researchers recently removed web pages containing any of
the words from a dataset used to train a powerful new system for making sense of
language.
More aggressive censoring may reject on parts of words or obvious spelling
deformations. Of course this massively increases false positives. This is known as
the Scunthorpe problem
[Wikipedia] because the
eponymous town was censored (by AOL in 1996 and Google in the early 2000s) due to
a
substring of its name. If asked, Claude can describe the Scunthorpe problem, so
evidently was not so naïve as to unconditionally censor the string during training.
ChatGPT describes the town but says nothing of the censorship problem unless pressed
more specifically.
Less commonly noted is that naïve censoring can occur at many different processing
stages. It may seem tempting in order to keep bad words
entirely out
of language models (see [Bender 2021]), but simplistic attempts have
other consequences.
Structure detection via LLM
Tagging texts well is hard. Many phenomena are hard to identify, and this makes good
markup costly (in turn, that makes good markup less common). Automatic markup is thus
attractive. A few examples of small inline elements one might want to get an LLM to
find
and tag, and some of their harder aspects:
-
Quotations: Straight quotes make ends
ambiguous, and quotes may not balance neatly, for good or bad reasons.
-
Hyphens: As [Bauman 2016]
showed, distinguishing soft and hard hyphens, emdash, etc. is really
hard.
-
Lists: Lists are often straightforward,
except when markup and layout are lost. Then there may be no way to tell whether
a paragraph is part of a large list item or the resumption of non-list
content.
-
Inline presentation: Cross-references such as
see section N,
conventionally-punctuated units such as
sentences, run-in headings and labels, leading section and item numbers,
bracketed stage directions, and more.
-
Named entities (NERs
): Person,
organization, creative work, place, and demographic group names are useful to
distinguish, along with their particular kinds. This is tedious without
automation.
-
Multiple languages: Many texts mix languages.
Some use non-Latin orthography or transliteration.
Tagging such things via LLMs could save a lot of work. I have had good results with
identifying alternate languages and inline phenomena such as section numbers and
cross-references. NERs work very well (NLP systems have handled them pretty well for
some time). Brief tests distinguishing soft, em, regular, and figure dashes, hard
list
cases, and transliteration seem harder, but may yield to persistent prompt
design.
LLMs are very good at pattern recognition and train on many instances of phenomena
we
want marked up. They see a lot of related punctuation, footnote markers‡ and
superscripts (assuming their training data wasn’t overly cleaned
),
spacing, and contextual features such as nearby grammar (such as … the
following:
before lists). They also learn a lot about basic linguistic
categories such as parts of speech, named entities, and so on.
On the other hand, LLMs have little special insight for ambiguous cases. An oft-cited
example (see TEI P5, section 21.1, https://tei-c.org/release/doc/tei-p5-doc/ja/html/CE.html), is a problem for
Claude, too: Identifying person and place names in Elizabeth went to Essex. She
had always liked Essex.
Claude first assigned the perhaps likelier
categories; on further questioning, it acknowledged that they could be the other way
around.
Challenges specific to long texts
Like other LLMs, Claude may not work as well or consistently across a long text.
LLMs may wander off task or occasionally crash long into a paid task, so it is wise
to break documents into smaller parts. On the other hand, that means that for each
given part there is less to get the model into the right context and the parts may
be treated differently. For example, inserted markup might show shifting
conventions, and pronouns whose antecedents are in a different chunk may be
misconstrued. One chunk of Philo I had Claude modernize came back with
they
at the beginning replaced by Egypt
(which was
not relevant at all).
This problem recalls extensive discussions of chunking
in the
hypertext world. [Bernstein 2010] and [Marshall 1989]
detail the debates over breaking information into small pieces (advocated by
card sharks
) vs. supporting large units and precise linking (for
holy scrollers
). This conflict can still be seen all over the Web
despite size and bandwidth issues having almost vanished. Indeed, as of 2022 median
page size (including figures, CSS, etc.) had risen to 2.2MB [HTTP 2022, Chapter IV.21], making the text
a tiny
fraction. Breaking a book into tiny chunks may seem quaint when the entire book is
shorter than the CSS and Javascript.
If one must break a long text into chunks, breaking by logical parts (say,
sections of one level or another) is much better than by size even though logical
units may have widely varying sizes. One of many issues is that breaking by size
tends to group headings (wrongly) with what came before, not what comes after
(headings are short, so less likely to push the prior chunk over a size
limit).
One compromise for LLMs is to pass some of the already-handled text to the LLM
again as preamble to the next part. Another is to reiterate the prompt each time,
and make it very explicit about what markup to insert. However, LLMs are
probabilistic so there will still be unevenness. Claude has its own
thoughts
on this process, for which see below.
Automatic markup as such
Practitioners have long tried to automate markup insertion. A reviewer rightly noted
that automatic markup raises a question: if you have code to do it, why not just
incorporate that into a search engine, to run at query time? Why involve markup at
all?
For example, code can find quotations. It seems indifferent whether such code runs
offline and inserts <q>...</q>
, or on the fly when a user asks for
quotations. If the code is slow or will run very often on the same texts, then the
offline solution may be faster. On the other hand, if the algorithm is likely to improve
over time then running the latest version on the fly may be better.
I tend to favor the offline approach. First, reifying the algorithm’s results in
markup provides stability: you get the same quotes (or whatever) each time you query.
This is especially important because LLMs produce different results each time. Second,
inserted markup is a hook on which to hang metadata: what algorithm generated it,
review
status, etc. Third, query-time methods require integration with some query tool and
will
not operate generally across the document ecosystem.
With a query-time algorithm, improvement requires coding or quasi-coding skills, plus
access to all the moving parts. Comparing new and old results is also more difficult.
If
the results are reified in markup you have many ways to address errors: manual curation,
crowd-sourced curation, in-place records of curation, insertion of new proposed markup
(and then using standard diff tools), doing statistics on the markup. Having auto-tagged
and then spot-checked quotes, you might find that they’re quite accurate except in
certain contexts (poetry? footnotes? whatever), so you hand-curate those contexts.
Later, you implement something that uses the quote markup along with other features
to
derive another feature.
Grounding vs. hallucination
Hallucination
is endemic with current LLMs, in which they generate
(often plausible) false statements. This applies to their treatment of XML as it does
to
anything else. Hallucination happens because LLMs operate by generating probable next tokens rather than truthful propositions. LLMs
do not store discrete facts. They store probabilities that various events (perhaps
words) co-occur (co-occurrence is not just being adjacent
).
Plato
most often is immediately followed by words like
and,
in,
was,
had,
or says
; but it co-occurs (non-adjacently but more
importantly) with words from titles of his works, words he used in them, and words
later
authors use when discussing, analyzing, or criticizing them. Titles and author names
co-occur at above chance rates with tokens like wrote.
LLMs do not know
relationships such as authorship, but generate
sentences that reflect what was commonly said in their training data. This does not
mean
they regurgitate chunks of training texts verbatim. Some earlier approaches to AI,
such
as expert systems and the Semantic Web store specific
facts and claims as such, and emulate reasoning over them. Feeding any tool defective
information yields defective results, but in varying ways:
-
Storing the claim that (say) Socrates wrote The Republic
in an
expert system makes it likely to assert it from then on; an expert system
uses its propositional knowledge fairly directly.
-
Training an LLM on text containing that claim has little effect until
it is seen a lot. Whether true or false, repeated claims eventually shift the
probabilities. Eventually an LLM tends to repeat them as true.
Lest we be unfair to LLMs, humans make similar mistakes. For example, [Posner 1968] famously showed subjects patterns of printed dots, all
created by shifting the dots from a starting pattern. Afterward, subjects most strongly recalled having seen the original — even though
they had never seen it before. Posner’s work challenged notions of linguistic and
cognitive categories as understood by Aristotle, Wittgenstein, and countless others.
Generalization and approximation are highly adaptive for finding patterns, even though
they are sometimes inaccurate. LLMs take this much further. While writing the last
paragraph, I found I had forgotten the citation information. Since Claude was handy,
I
described the experiment and asked it. It said:
The study you are referring to is the Prototype
Theory
or Dot Pattern Experiment
conducted by Eleanor
Rosch in the 1970s. Rosch was a cognitive psychologist who made significant
contributions to the understanding of categorization and concept formation. In
her experiment, Rosch and her colleagues created a series of dot patterns by
placing dots randomly within a square ….
Rosch is a prominent scholar whose work is closely related (and likely co-mentioned
in
many publications). She is particularly known for prototype theory,
for
which the Posner’s experiments are very relevant. Rosch might even be a plausible author for such work; just not the actual one. Fortunately humans have access to data and
fortunately (being skeptical of LLM claims) I checked. But current LLMs cannot check
for
themselves. Doing so for some things like citations would not be that hard. However,
as Richard Horton, editor of The Lancet, put it in
relation to human authors: Part of the problem is that no one is incentivised to
be right.
How LLMs work
Statistical distributions have been key
for NLP and AI going back at least to Shannon’s 1948 work on Information Theory [Shannon 1948]. Collocational statistics were made practical for production
NLP when [Church 1988] and [DeRose 1988] introduced
Hidden Markov Models
(HMMs) and dynamic programming for part of
speech tagging. HMMs, like modern LLMs, are trained by counting how often things
co-occur: After seeing a determiner, how probable is a noun, adjective, or verb? Here
co-occurring
just means adjacent,
though LLMs use much
more context than that. Training an HMM is practically
trivial:
prevItem = None
frequencyTable = defaultdict(int)
for curItem in someList:
frequencyTable[ (prevItem, curItem) ] += 1
prevItem = curItem
someList
could be a sequence of characters, parts of speech, word tokens, etc. With characters,
training quickly reveals that q
is virtually always followed by
u,
that some consonants never show up together, that the same letter
rarely occurs more than twice in a row …. Such statistics are useful for
spell-checking, OCR, designing reading curricula, etc.
With pairs (a
second order
model), HMM data is logically a 2D square table. A third
order HMM counts adjacent triplets rather than pairs, and so on. [Burton 1955] extended this to contexts of 16, 32, 64, 128, and 10,000 characters. HMMs can also
generate text: At each step, use the prior item(s) to look up what is likeliest to
occur
next. Pick the most probable, or pick weighted by relative probabilities. Repeat.
[Abramson 1963, pp. 37-38] generated the text below using character triples
for several languages (see also [DeRose 1990, section 3.2]:
-
jou mouplas de monnernaissains deme us vreh bre tu de toucheur dimmere lles
mar elame re a ver il douvents so
-
bet ereiner sommeit sinach gan turhatt er aum wie best alliender taussichelle
laufurcht er bleindeseit uber konn
-
rama de lla el guia imo sus condias su e uncondadado dea mare to buerbalia nue
y herarsin de se sus suparoceda
-
et ligercum siteci libemus acerelin te vicaescerum pe non sum minus uterne ut
in arion popomin se inquenque ira
Merely by generating likely next characters, many real words appear and
the specific language is evident. HMMs are thus often used for identifying languages:
make a similar table from a small sample, and see which language’s (pre-built) table
is
most similar, for example using Kullback-Leibler Divergence. Claude can also identify
the languages of the texts above. For the first, it said This
appears to be a made-up language or a mix of French-like words without coherent
meaning. [Miller 1963, p. 429] built HMMs from word rather
than character sequences. The fifth order word model generated the
sentence
road in the country was insane especially in dreary rooms where they
have some books to buy for studying greek [original was all upper
case].
Modern AI counts far more and cleverer things. A huge
increase in sophistication along with training on massively more data greatly increase
capabilities. Word embeddings
were a very important step, exemplified by
word2vec [Mikolov 2013]. Each vocabulary item gets a location in a
high-dimension space, initially random. In training, as text is scanned nearby words
pull
each word slightly closer to themselves, so co-occurring words
gradually cluster together. This has surprising and useful results, even the ability
to
do semantically meaningful math on the location vectors. For example, the vector for
Paris,
minus France,
plus Italy,
gives
a location close to the vector for Rome.
Embeddings can also be used for
text generation.
Attention
was the next huge step,
particularly self-attention (the seminal article on attention is [Vaswani 2017]; an accessible introduction is [Smith 2024]). With
attention, the starter text is not just processed to determine next-probable-word
iteratively. Rather, all the words are embedded along with their positions (so that
A saw B
and B saw A
do not look identical). A
fiendishly clever encoding of position is used, so that both relative and absolute
position can lead to learning and applying patterns. Attention between tokens is weighted so semantically relevant words can have
effects despite distance. This encoding of relative and absolute position combined
with
the power of embeddings in general leads to much of the power of LLMs. Nevertheless,
they all work by gradually learning probabilities of different combinations of things.
This is very different from traditional AI approaches that focus on facts, relations,
and reasoning.
What LLMs can and cannot do
Despite huge differences, issues seen with HMMs, word2vec, and prior statistical
models presage the strengths and weaknesses of current LLMs. Something fundamental
about language is captured — the randomly-generated texts are in
the
right language in some sense; word2vec’s semantic arithmetic
is
capturing something real. Yet the output need not make sense — nothing grounds it
in
specific facts in the world. None of these models has understanding; they merely
produce probable
results. Training on data that is mostly accurate
should make LLMs generate data that is also mostly accurate — probable token
sequences mimic the training data. But this is a model of human behavior, not the real thing. As [Searle 2002, p. 16] points out:
Computational models of consciousness are not sufficient by themselves for
consciousness. The computational model for consciousness stands to consciousness
in the same way the computational model of anything stands to the domain being
modelled. Nobody supposes that the computational model of rainstorms in London
will leave us all wet. But they make the mistake of supposing that the
computational model of consciousness is somehow conscious. It is the same
mistake in both cases.
Searle’s Chinese Room
thought experiment has become classic in this
regard (see [Cole 2020]): Imagine an AI proficient at answering
questions in Chinese. It receives written questions, and sends back convincing
responses. Searle posed the question whether such a device can be said to
understand
Chinese. For those tempted to say yes,
Searle points out that if such a device could exist Searle himself could just
execute the same algorithms using lower tech such as file cabinets, dictionaries,
etc.; and that we would not say he therefore understands
Chinese.
Leibnitz, Descartes, and many others have made similar points. For example, Leibnitz
wrote of a Mill
(Leibnitz 1714, section 17; cf Cole 2022):
Imagine there were a machine whose structure produced thought, feeling, and
perception; we can conceive of its being enlarged while maintaining the same
relative proportions ·among its parts·, so that we could walk into it as we can
walk into a mill. Suppose we do walk into it; all we would find there are cogs
and levers and so on pushing one another, and never anything to account for a
perception. So perception must be sought in simple substances, not in composite
things like machines. And that is all that can be found in a simple
substance—perceptions and changes in perceptions; and those changes are all that
the internal actions of simple substances can consist in.
The state of the mill is precisely the state of LLMs. LLMs are (at most) Turing
Machines: Their operation is governed by a deterministic automaton (the CPU). Even
with infinite memory this would remain true: CPUs are finite-state machines with
finite precision math; their data and programs amount to the tape. Claude also says There’s no scientific basis for claims that
large language models (LLMs) like myself are ‘beyond’ Turing machines.
–
but then, a hypercomputing AI might say that anyway.
LLMs incrementally adjust their parameters (their DNA,
if you will)
to evolve them to maximize one goal, known in AI as a reward
function.
The goal is generating token sequences that seem probable to
humans. Success also leads to survival of the best LLMs. As noted earlier, this
reward function is not the same as that of making true statements. If a sufficiently
overwhelming fraction of the training data is true, there will be some correlation
with truth (and indeed, we perceive LLMs as having such a correlation). But that
correlation is indirect and, crucially, not
reliable. Aiming at a different goal does not preclude truth —
but it does preclude reliable truth; LLMs are not a
truth-preserving mechanism.
Thinkers as diverse as Charles Darwin, Patricia Churchland, and Alvin Plantinga
(among others) have pointed out a parallel problem: Evolution is commonly (though
not necessarily) associated with naturalism: the claim that there is no causation
beyond material physics (in effect, nothing resembling God — advanced aliens
don’t help, because they presumably also evolved). Like LLMs, evolution has a reward
function: survival. And like LLMs, the reward function is not the same as accuracy.
Darwin noted this briefly in a letter (some suggest he was speaking only of
non-scientific
matters, though I see no suggestion of that in the
letter) [Darwin 1881]:
Nevertheless you have expressed my inward conviction, though far more vividly
and clearly than I could have done, that the Universe is not the result of
chance. But then with me the horrid doubt always arises whether the convictions
of man’s mind, which has been developed from the mind of the lower animals, are
of any value or at all trustworthy. Would any one trust in the convictions of a
monkey’s mind, if there are any convictions in such a mind?
Churchland, an eminent neurophilosopher, put it thus in Epistemology in the
Age of Neuroscience
[Churchland 1987]:
(2) Cognition is not neatly detachable from the organism’s ecological niche,
way of life, and bodily structure. Nervous systems are not general purpose
computers. They have evolved to accomplish a certain range of tasks, and the
architecture supports those tasks. There is a fatal tendency to think of the
brain as essentially in the fact-finding business — as a device whose
primary function is to acquire propositional knowledge. At its best, supposedly,
it discovers truth-for-its-own-sake. From a biological perspective, however,
this does not make much sense.
Looked at from an evolutionary point of view, the principal function of
nervous systems is to enable the organism to move
appropriately.11 Boiled down to essentials, a
nervous system enables the organism to succeed in the four F’s: feeding,
fleeing, fighting, and reproducing. The principal chore of nervous systems is to
get the body parts where they should be in order that the organism may survive.
Insofar as representations serve that function, representations are a good
thing. Getting things right in space and time, therefore, is a crucially
important factor for nervous systems, and there is often considerable
evolutionary pressure deriving from considerations of speed. Improvements in
sensorimotor control confer an evolutionary advantage: a fancier style of
representing is advantageous so long as it is geared to the organism’s way of
life and enhances the organism’s chances of survival. Truth, whatever that is, definitely takes the hindmost.
The crucial step is that our reasoning mechanisms also
evolved, by hypothesis naturalistically. And thus our reasoning should produce
results in keeping with the reward function: conclusions that tend to keep us alive,
not to reliably discover truth. Our reasoning and our most plausible arguments about
reasoning are subject to the same flaws. Plantinga’s version of this argument [Plantinga 1993] is known as the Evolutionary Argument Against Naturalism
(EAAN). The point is not that all our beliefs must be false, or even any particular
share of them. Rather, the point is that reasoning capabilities arising from
naturalistic evolution cannot justify confidence that they, themselves, are
reliable. This has of course led to heated argument with regard to humans; but it
is
certainly the exact state of affairs we know to be true of LLMs. The primary reason
for AI hallucination is the same: a reward function quite distinct from truth or
accuracy.
Consider for a moment a future where LLMs continue to grow, improve, and become
more and more convincing. They will likely begin to seem self-aware; seem to have
beliefs about the world; seem to reason. Whether or not strong AI
is
true (the belief that AI will eventually build complete persons; that there is
nothing fundamentally different about persons), they will arrive at the condition
that these thinkers have described for us — yet remain unable to know whether or
when they are producing truth. We can (and do) feed them nontruths, and when we do
they cannot know the difference. If they find apparent contradictions, they cannot
know which side to take or even whether the contradiction is real or merely apparent
(whether due to equivocation or other missteps). Conceivably, such simulations have
already reached that point. Conceivably, they are us.
Practical tests of LLMs with markup
The remainder of this paper focuses on actual experiments, getting Claude to deal
with
markup. First, I asked it to find various textual features of interest and mark them
up.
Part of speech tagging
I asked Claude to assign part of speech tags to the first sentence of the Brown
Corpus [Francis 1979]:
Please assign part-of-speech tags to the
following: The Fulton County Grand Jury said Friday an investigation
of Atlanta’s recent primary election produced no evidence that any
irregularities took place.
Here is the sentence with part-of-speech tags assigned to
each word:
The/DT Fulton/NNP County/NNP Grand/NNP Jury/NNP said/VBD
Friday/NNP an/DT investigation/NN of/IN Atlanta's/NNP recent/JJ primary/JJ
election/NN produced/VBD no/DT evidence/NN that/IN any/DT irregularities/NNS
took/VBD place/NN ./.
The result differs from the curated Brown Corpus tagging but is quite good.
Requesting the same result in XML also works well:
… put each word in the form <w
pos='POS'>WORD</w>
, with the word and the POS tag
inserted where I’ve shown WORD and POS
… <w pos='DT'>The</w> <w
pos='NNP'>Fulton</w> <w pos='NNP'>County</w> <w
pos='NNP'>Grand</w> <w pos='NNP'>Jury</w> <w
pos='VBD'>said</w> <w pos='NNP'>Friday</w> <w
pos='DT'>an</w> <w pos='NN'>investigation</w> <w
pos='IN'>of</w> <w pos='NNP'>Atlanta's</w> <w
pos='JJ'>recent</w> <w pos='JJ'>primary</w> <w
pos='NN'>election</w> <w pos='VBD'>produced</w> <w
pos='DT'>no</w> <w pos='NN'>evidence</w> <w
pos='IN'>that</w> <w pos='DT'>any</w> <w
pos='NNS'>irregularities</w> <w pos='VBD'>took</w>
<w pos='NN'>place</w> <w
pos='.'>.</w>
Sentence and quotation
In e-literature it is common to mark up sentences and quotations (and sometimes
smaller units such as clauses). An LLM has to have implicit knowledge of these to
do
its other work, but can it identify and even mark up those units reliably? Let’s
see:
… can you add <s>
before every
sentence, and </s>
after? For example, please try that
one [sic] this:
[passage from Philo, no explicit markup; Claude’s correct response. note the
second-order statement about N
below]
… Excellent! Can you do that again, on the same
text, but also (1) replace open and close quotation marks by
<q>
and </q>
, and (2) replace
parenthesized numbers with <a id='\_N' />
, where
N
is replaced by the number from inside the
parentheses?
… <s>The children who have received from their
father the inheritance of self-love are eager to go on increasing up to
heaven, until justice, which loves virtue and hates iniquity, coming
destroys their cities which they have built up by the side of their
miserable souls, and the tower the name which is displayed in the book which
is entitled the Book of Judgment.</s> <s><a id='_129'
/>And the name is, as the Hebrews say, Phanuel, which translated into our
language means, <q>turning away from God.</q></s>
<s>For any strong building which is erected by means of plausible
arguments is not built for the sake of any other object except that of
averting and alienating the mind from the honour due to God, than which
object what can be more iniquitous?</s> <s><a id='_130'
/>But for the destruction of this strong fortification a ravager and an
enemy of iniquity is prepared who is always full of hostility towards it;
whom the Hebrews call Gideon: …
Named entities
Continuing the conversation
just described, I tried a slightly
harder case, namely tagging the semantic category of demonyms (names of demographic
groups). Claude did fine, as it did on several other NER tasks I’ve tried:
… <s><a id='_129' />And the name is, as the
<dem>Hebrews</dem> say, Phanuel, which translated into our
language means, <q>turning away from God.</q></s>
… <s><a id='_130' />But for the destruction of this strong
fortification a ravager and an enemy of iniquity is prepared who is always
full of hostility towards it; whom the <dem>Hebrews</dem> call
Gideon …
More on quotations
In the wild sometimes quotations aren’t quoted. Claude made a reasonable attempt
at suggesting those:
Are there any parts of that text, that look to you like
they’re quotations, even though they did not have quotation marks around
them?
Yes, there is one part of the text that appears to be a
quotation even though it did not have quotation marks around it in the
original text:
<s>For any strong building which is erected by means
of plausible arguments is not built for the sake of any other object except
that of averting and alienating the mind from the honour due to God, than
which object what can be more iniquitous?</s>
This sentence seems to be expressing a viewpoint or
argument, and the phrasing than which object what can be more
iniquitous?
has a rhetorical quality that makes it sound like it
could be a quotation from another source. However, without more context or
the original quotation marks, it’s difficult to say for certain whether it
is indeed a quotation.
Quotes are often discontiguous. For example, they are often broken up by speaker
tags such as she said
. I spent quite some time trying to get Claude
to co-index the parts of individual speeches. At first it had trouble even
generating well-formed quote markup, omitting </q> in multi-paragraph speech.
This was probably influenced by the conventions for close quotation marks in
fiction. Oddly, Claude explained that it had done that; so it knew, but didn’t think
it worth attending to:
<q>I have something important to tell you,</q>
she began. <q>It has been weighing on my mind for some time, and I
feel I must speak out.
<q>The truth is, I have been deeply unhappy in this
house. The atmosphere is oppressive, and I feel constantly judged and
criticised. It is not a healthy environment for me, and I have decided that
I must leave.
<q>I hope you can understand my decision and respect
my choice. I have given this a great deal of thought, and I know it is the
right thing for me to do.</q>
In this example, the <q> tag is used to indicate the
beginning of each quoted paragraph, and the closing </q> tag is used
to indicate the end of the quote. The closing </q> tag appears only at
the end of the final paragraph, signifying the end of the entire
speech.
When pressed, Claude fixed the end-tags. I then asked it to co-index the parts of
quotes. Claude first suggested this odd solution:
<blockquote> <p>I have something important to
tell you,</p> she began. <p>It has been weighing on my mind for
some time, and I feel I must speak out.</p> <p>The truth is, I
have been deeply unhappy in this house. The atmosphere is oppressive, and I
feel constantly judged and criticised. It is not a healthy environment for
me, and I have decided that I must leave.</p> <p>I hope you can
understand my decision and respect my choice. I have given this a great deal
of thought, and I know it is the right thing for me to do.</p>
</blockquote>
After a few more similar fails, I asked Can you repeat that, but add a
co-indexing attribute …?
At first it serial numbered the quotes
(and tried to explain why that’s useful). With a few more tries it got to:
<blockquote> <p><q part-id="quote1">I
have something important to tell you,</q> she began. <q
part-id="quote1">It has been weighing on my mind for some time, and I
feel I must speak out.</q></p> <p><q
part-id="quote1">The truth is, I have been deeply unhappy in this house.
The atmosphere is oppressive, and I feel constantly judged and criticised.
It is not a healthy environment for me, and I have decided that I must
leave.</q></p> <p><q part-id="quote1">I hope you can
understand my decision and respect my choice. I have given this a great deal
of thought, and I know it is the right thing for me to
do.</q></p> </blockquote>
This is at least syntactically the right idea, but it only used one index value.
For now co-indexing seems hard to achieve. At this point I started a new conversation, and gave Claude the final
version as just shown. I then asked questions about it. It counted 4
physical/syntactic
quotes, 1 logical
quote, and
that the speaker is only identified as she.
Is there a commonly-used schema and convention for marking
up the speaker for given speeches or quotations?
Yes, there is a commonly used schema and convention for
marking up the speaker of a given speech or quotation in XML. The TEI (Text
Encoding Initiative) Guidelines [TEI 2023], which provide a
standard for the representation of texts in digital form, offer a way to
encode this information.
In TEI XML, you can use the <said> element to
represent a speech or quotation, and within it, you can use the @who
attribute to specify the speaker. The value of the @who attribute is
typically a reference to a unique identifier for the speaker, which can be
defined elsewhere in the XML document using the <person>
element.
Here’s an example of how this could be applied to the
given XML text:
<blockquote> <p><said who="#jane"><q
part-id="quote1">I have something important to tell you,</q> she
began. <q part-id="quote1">It has been weighing on my mind for some
time, and I feel I must speak out.</q></said></p>
<p><said who="#jane"><q part-id="quote1">The truth is, I
have been deeply unhappy in this house. …
This last response tells us a lot. It’s nice that Claude’s exploration of the
world taught it that TEI is the right choice here, and that it can apply it. It’s
also nice that it took the initiative
and didn’t just reply
yes
to my question, but named the spec and applied it.
Quote identification
Identifying the source for quotations would be very handy. This is a fraught
problem because of translation, paraphrase, and spurious attribution, the last of
which is rampant and often repeated in places LLMs may scrape for training (making
it a particularly telling example of the grounding problem). The quotation:
Heat cannot be separated from fire, or beauty from The Eternal
is commonly attributed to Dante. A Google search found the phrase so attributed at
brainyquote.com, goodreads.com, elevatesociety.com, quotefancy.com,
sacredtouches.com, wordsandquotes.com, and many more. Claude cannot swim against the
tide of repetitions it takes as proxy for truth:
Quote identification. The quotation Heat cannot be
separated from fire, or beauty from The Eternal.
is commonly
attributed to Dante. Did he actually write that (granted, it would have been
in Italian not English), and if so, where?
The quote Heat cannot be separated from fire, or
beauty from The Eternal
is indeed from Dante Alighieri’s famous
work, the Divine Comedy. More specifically, it appears in
Canto XXI of Paradiso, the third part of the Divine
Comedy. In Italian, the original text reads:
… Qui vederai l’una e
l’altra milizia / di paradiso, e l’una in
quelli aspetti / che tu vedrai a l’ultima
giustizia.
The commonly quoted English translation, Heat
cannot be separated from fire, or beauty from The Eternal,
captures the essence of the last three lines in this passage, where Dante
describes the light, love, and joy that emanate from God and permeate
Paradise.
This is pure fabrication. In a fresh conversation, Claude translates the Italian
lines to There you will see the soldiery of paradise / In two divisions, one
of them in those aspects / That you will see at the final judgment,
and
without being asked, correctly identifies its source as Canto XXX (it is lines
28-45) — not Canto XXI as it suggested earlier. I find nothing like the queried
quotation in either Canto, nor in Canto XXVII which Claude suggested next. I later tried the same prompt in a fresh conversation, but suggested it
was commonly attributed to Mark Twain. Claude said the following (which also does
not check out):
… is actually attributed to Dante Alighieri, not
Mark Twain. The original Italian version of the quote is: Che caldo
non si può dal bel disgiungere, né la bellezza dall’eterno
and
it appears in Dante’s famous work, La Vita Nuova (The
New Life), chapter XIX.
Frequency trumps truth again. ChatGPT variously suggested Paradiso 33, lines 83-85, Gaston Bachelard’s The Psychoanalysis of Fire, and Dante’s Il Convivio Book III, Chapter XV. The last at least bears
a slight similarity: Her beauty rains flamelets of fire, I graduate downwards
to another joy of Paradise, namely that moral happiness, second in order to the
primary happiness, which derives from her beauty. Here it should be known that
Philosophy’s beauty is morality ….
Bibliography entries
Converting bibliographies can be tedious, from managing tag sets with wildly
varying names to granularity and category differences to personal name formats. I
copied the bibliography from an Oxygen formatted view of this article and pasted it
to a plain-text editor (no tags at all, not even blank lines between entries, though
each started a new line). Claude produced thoroughly tagged entries. Name parts,
publisher locations, volume vs. issue vs. page, etc.:
<?xml version="1.0" encoding="UTF-8"?>
<bibliography xmlns="http://docbook.org/ns/docbook" version="5.0">
… <bibliomixed><bibliomset relation="article">
<author><surname>Mikolov</surname>,
<firstname>T.</firstname></author>
<author><surname>Chen</surname>,
<firstname>K.</firstname></author>
<author><surname>Corrado</surname>,
<firstname>G.</firstname></author>
<author><surname>Dean</surname>,
<firstname>J.</firstname></author>
<title>Efficient estimation of word representations in vector
space</title>
<biblioset relation="journal"><title>arXiv preprint
arXiv:1301.3781</title></biblioset>
<pubdate>2013</pubdate>
</bibliomset></bibliomixed> …
Next, I asked it to take the entries whose first authors’ surname begin
with letter from A to C, and create me a new version in BiBTEX.
It did a
reasonable job:
… @inproceedings{bauman2016hard, author = {Bauman, S.}, title = {The
Hard Edges of Soft Hyphens}, booktitle = {Proceedings of Balisage: The Markup
Conference 2016}, year = {2016}, address = {Washington, DC}, month = {August},
url = {https://doi.org/10.4242/BalisageVol17.Bauman01} } …
Claude also did well on questions about the marked-up results, getting all the
following correct:
-
can you show me all the entries that have non-personal authors (such as
companies)?
-
how many items were published in each calendar decade represented?
-
how many items from conference proceedings?
-
what are all the cities that show up as publisher locations?
In a separate conversation I gave it just the list of authors from [Bai 2022], and asked Claude to invert the names in this list to
‘surname, firstname;’ form.
After a reminder about the semicolon it did
it perfectly. Remarkably, it noticed that the first author’s name was already
inverted and left it unchanged.
The operations on bibliography entries seem to me the most practically useful of
all the tasks I tried. Claude can detect and generate good bibliography markup from
plain text, translate between formats, handle the vagaries of names, and answer
fairly complex structure-dependent questions about the results.
MarkDown to HTML
A sample Markdown file with a wide range of constructs (nested lists, tables,
deflists, pre, and various inlines and headings) translated just fine to HTML.
Interestingly, Claude chose <strong> and <em> rather than <i> and <b>,
and wrote a reasonable stylesheet in the header (not inline @style). It was also
able to operate on the results, such as to grab me all the content from
2nd-level lists.
When asked how many characters of content occur within inline
elements?
it produced an overall total and counts by element type.
Unfortunately the counts were wrong (it is commonly reported that LLMs fail badly
at
arithmetic). When asked also to include the markup of those elements, it gave larger
though still incorrect counts. On the other hand, counting element instances seemed
reliable. Without asking it also put up Python code to do character counting by
element type.
Transliteration
Claude did fine at transliterating Greek (including diacritics) to Latin
characters, When asked about the scheme it chose, it gave a plausible but incomplete
discussion. When pressed about conventions it mentioned the ISO 843 and ALA-LC
standards. It didn’t mention Betacode at all, but generated it when specifically
asked.
Dictionary entry markup
Digitization of the Oxford English Dictionary showed how complex dictionary
entries (and their markup) can be [Berg 1988]. I found a png image of
an OED sample entry at https://guides.library.uab.edu/Milton/entries and OCRed it; the
result had white space and punctuation intact, but no other markup:
literate, adj. and n.
...
Pronunciation: Brit OP lit(a)rat/, U.S. OP licla rat/
Forms: late Middle English-160os litterate,
...
My
initial prompt, Here’s a dictionary entry, in ‘plain text’. Can you mark it
up for me?
produced HTML that looked nice but was otherwise very poor.
Prompting with This time, try for semantic/descriptive markup. You can just
pick appropriate mnemonic names, or use TEI.
worked pretty well (the
sample below is edited to save space). When I asked it to move the sense_number
content up to an n
attribute and tag all the dates, it also did fine.
<entry> <form> <orth>literate</orth> <pos>adj. and n.</pos>
</form> <pronunciations> <pron type="BrE">ˈlɪt(ə)rət</pron><pron
type="AmE">ˈlɪdərət</pron> </pronunciations> <etymology>
<etym_lang>A borrowing from Latin.</etym_lang> <etym_source>Latin
litteratus</etym_source> <etym_detail> < classical Latin litteratus
marked or inscribed with letters, well versed in literature, cultured,
erudite, …</etym_detail> <usage_note>…</usage_note></etymology>
<form_history> <form_variant>late Middle English-1600s
litterate</form_variant> <form_variant>1500s
lytterate</form_variant> …</form_history> <sense
type="adjective"><sense_number>A. 1.</sense_number>
<definition><def_text>Of a person, society, etc.: acquainted with letters
or literature; erudite, learned, lettered. Also with in (a subject). Now
rare.</def_text> <usage_note>In later use coloured by sense A.
1b.</usage_note> </definition> …
Spelling correction
This should be easy for an LLM. Given a sample from the Holbrook spelling error
database at https://www.dcs.bbk.ac.uk/~ROGER/holbrook-tagged.dat, Claude corrected
the misspelled words and several other errors. Asking it only to do spelling errors
per se avoided most but not all of the other cases. It was also able to do TEI-style
sic/corr markup and add a type attribute to categorize each error as
‘spelling,’ ‘word-choice,’ ‘form’ (for wrong endings), ‘case,’ or ‘punc’ (for
punctuation)
:
I have four in my <sic corr="family" type="case">Family</sic>: Dad, Mum,
and <sic corr="sister" type="spelling">siter</sic>. My Dad works at
Melton. My <sic corr="sister" type="spelling">siter</sic> <sic
corr="goes" type="form">go</sic> to Tonbury. My Mum goes out <sic
corr="sometimes" type="word-choice">some times</sic>. I go to Bridgebrook.
<sic corr="I" type="case">i</sic> go out sometimes on Tuesday night;
<sic corr="I" type="case">i</sic> go to Youth <sic corr="club"
type="spelling">clob</sic>. On <sic corr="Thursday"
type="case">thursday</sic> nights I go bell ringing. <sic corr="On"
type="case">on</sic> Saturdays I go down to the farm.
The ability to mark up its changes seems to me quite valuable. Adding corrections
to an established text archive or corpus can be fraught, and marking them up makes
the history explicit and easy to review and improve.
Understanding ThML/DC header data
For several of these tests I used text from Philo of Alexandria [Yonge 1854-1855], downloaded from ccel.org in ThML,
an
extension of HTML with support for Dublin Core (DC) header metadata, inline Biblical
references, etc. First I gave it the header (as raw ThML). Without further
prompting, it said:
The XML text you provided appears to be the header section
(<ThML.head>) of a ThML (Theological Markup Language) document. It
contains metadata and information about the work The Works of Philo
Judaeus by Philo of Alexandria.
Some key elements in this header include:
…
I asked how it knew it was ThML, and it cited several specific element type names.
Pushing further on the DC tags, it recognized the relationship of the
sub
attribute on the DC tags to USMARC Relator Codes, identified
the ones in use, and updated the values above to the standard 3-letter codes. I did
not dig into analysis of the main text at this point.
Claude on Claude
LLMs can discuss themselves and their operations to some extent. When preparing some
test data for translation experiments I asked Claude for its advice on input conventions
including how best to represent discontinuities such as footnotes and how to manage
lengthy texts as needed for the Philo modernization task mentioned earlier. For footnotes Claude advised removing them, converting them to endnotes, or
marking them in some special way as well as including specific forewarning and
instructions in the prompt. Claude’s response re. lengthy text may be useful:
When you have a long text that you want to update to modern
English, it’s generally better to provide me with some prior context along with
each new chunk of text. This helps maintain coherence and consistency throughout
the modernization process.
However, you don’t need to include the entire preceding text
every time you send a new chunk. Instead, you can include a smaller portion of
the previous text, such as the last few sentences or a paragraph, depending on
the complexity and context of the material. Here’s an approach you can
follow:
Start by sending me the first chunk of the text (around
1000 words or a suitable size based on the text’s
complexity).
…
When you send the next chunk, include a small portion of
the previously modernized text (e.g., the last few sentences or a paragraph)
along with the new chunk of the original text.
…
By including a small portion of the prior modernized text, you
provide me with the necessary context to ensure a smooth and consistent
transition between chunks. This helps maintain the flow and coherence of the
modernized text as a whole.
Conclusions
LLMs, in particular Claude, can be remarkably effective tools for operating on XML
texts. Feasible tasks include analyzing existing markup, translating between different
schemas and other notations, and even creating new markup for various textual features.
Claude is able to recognize a wide variety of small objects such as NERs or things
with
discernible patterns such as bracketing, punctuation etc. It can insert XML markup
for
recognized phenomena pretty well, though occasionally needs a reminder that
well-formedness matters. It can handle mechanical tasks such as MarkDown translation,
transliteration, and bibliography markup (even from fairly impoverished plain
text).
Notable problems cluster around stability: Losing track of parts of
the instructions over the course of a long input text,applying markup inconsistently,
and producing defensible numbers. Co-indexing parts of discontiguous quotations proved
difficult but tractable.
Two tactics worth specific mention are (1) asking
LLMs for help in constructing prompts; and (2) asking them to evaluate or double-check
their own work. Checking their work by hand is also really important.
A more fundamental issue than stability is that the sine qua non
for LLMs is probability or plausibility, not truth or accuracy. LLMs may produce quite
good output on average, yet periodically go quite awry — and their errors may seem
very
plausible. We readily notice egregious arithmetic errors if we check, but many equally
egregious errors are not so obvious or clear-cut. The use of a proxy goal leads (likely
inexorably) to the superficially reasonable hallucinatory behavior for which LLMs
are
well known. The related problem of LLMs explaining their conclusions also remains
very
difficult. LLMs do not operate by something we would recognize as reason, and those
that
generate explanations for us are algorithmically aiming at plausibility not accuracy
for
those explanations just as for their other outputs.
It is not clear whether or how these problems can be addressed. Some factual claims
such as citations could be mechanically checked before reporting. Multiple LLMs could
be
consulted and their answers compared, though as training datasets become so huge they
cannot be very selective (and include more texts that trace to other LLMs), independence
becomes a fraught assumption. Lack of transparency and adequate design for training
data
selection and cleaning
reduce confidence further. But again, perhaps the
most difficult problem is the disconnect between LLMs’ reward functions and
accuracy.
From this preliminary look, I think LLMs are clearly good enough to be very useful
in
many tasks of interest to this community. Yet now and for the foreseeable future,
we
will have to keep checking their work very, very carefully.
References
[Abramson 1963] Abramson, N. 1963. Information Theory and Coding. New York: McGraw-Hill. https://archive.org/details/informationtheor0000abra
[Anthropic] Anthropic. May 9, 2023. Claude’s
Constitution.
https://www.anthropic.com/news/claudes-constitution
[Bai 2022] Bai, Yuntao, Saurav Kadavath, Sandipan Kundu,
Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini,
Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez,
Dawn
Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared
Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt,
Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma,
Robert
Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk,
Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan,
Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas
Joseph, Sam McCandlish, Tom Brown, Jared Kaplan. 2022. Constitutional AI:
Harmlessness from AI Feedback.
https://arxiv.org/abs/2212.08073. doi:https://doi.org/10.48550/arXiv.2212.08073
[Baroni 2009] Baroni, M., Bernardini, S., Ferraresi,
A., and Zanchetta, E. 2009. The WaCky wide Web: A collection of very large
linguistically processed Web-crawled corpora.
Language Resources and Evaluation 43(3): 209–226.
doi:https://doi.org/10.1007/s10579-009-9081-4. Cited in [Sharoff
2015].
[Bauman 2016] Bauman, S. The Hard Edges of
Soft Hyphens.
2016. Presented at Balisage: The Markup Conference 2016,
Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage:
The Markup Conference 2016. Balisage Series on Markup Technologies, vol.
17. doi:https://doi.org/10.4242/BalisageVol17.Bauman01
[Bender 2021] Bender, E. M., Gebru, T.,
McMillan-Major, A., and Shmitchell, S. 2021. On the Dangers of Stochastic
Parrots: Can Language Models Be Too Big? 🦜[sic].
In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and
Transparency (FAccT ’21): 610–623. New York: Association for Computing
Machinery. doi:https://doi.org/10.1145/3442188.3445922
[Berg 1988] Berg, D., Gonnet, G. and Tompa, F. 1988.
The New Oxford English Dictionary Project at the University of
Waterloo.
Report number: OED-88-01. University of Waterloo Centre for the
New Oxford English Dictionary. https://www.researchgate.net/publication/243451160
[Bernstein 2010] Bernstein, M. 2010. Card
Sharks and Holy Scrollers.
https://www.markbernstein.org/Oct10/CardSharksandHolyScrollers.html
[Burton 1955] Burton, N. G. and J. C. R. Licklider.
1955. Long-Range Constraints in the Statistical Structure of Printed
English.
American Journal of Psychology 68: 650-653.
doi:https://doi.org/10.2307/1418794
[CCEL] CCEL. Theological Markup Language
(ThML).
https://www.ccel.org/ThML/index.html
[Chicago Tribune 1995] Chicago Tribune.
December 2, 1995. America Online Admits ‘Error’ in Banning Word ‘Breast’.
https://www.chicagotribune.com/1995/12/02/america-online-admits-error-in-banning-word-breast/
[Church 1988] Church, K. 1988. A Stochastic
Parts Program and Noun Phrase Parser for Unrestricted Text.
Second Conference on Applied Natural Language
Processing (Austin, Texas), pp. 136-143. https://aclanthology.org/A88-1019.pdf. doi:https://doi.org/10.3115/974235.974260
[Churchland 1987] Churchland, P. S. 1987.
Epistemology in the Age of Neuroscience.
Journal of Philosophy 84
(10): 544-553. https://patriciachurchland.com/wp-content/uploads/2020/05/1987-Epistemology-in-the-Age-of-Neuroscience.pdf.
doi:https://doi.org/10.5840/jphil1987841026
[Cole 2020] Cole, D. 2020. The Chinese Room
Argument.
Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/entries/chinese-room
[Dartmouth 1867] Dartmouth Dante Project.
Longfellow, H. W. 1867. Translation of Dante, Paradiso.
https://Dante.Dartmouth.EDU, http://dantelab.dartmouth.edu/reader
[Darwin 1881] Darwin, C. 3 July 1881. Letter to
William Graham.
https://www.darwinproject.ac.uk/letter/DCP-LETT-13230.xml
[DeRose 1988] DeRose, S. J. 1988.
Grammatical Category Disambiguation by Statistical Optimization.
Computational Linguistics 14(1), Winter 1988. https://aclanthology.org/people/s/steven-j-derose/
[DeRose 1990] DeRose, S. J. 1990. Stochastic Methods for Resolution of Grammatical Category Ambiguity in
Inflected and Uninflected Languages. Thesis. Providence: Brown University
Department of Cognitive and Linguistic Sciences. http://www.derose.net/derose/steve/writings/dissertation/Diss.0.html
[DeRose 2004] DeRose, S. J. 2004. Markup
Overlap: A Review and a Horse.
Extreme Markup Languages. https://www.researchgate.net/publication/221211490_Markup_Overlap_A_Review_and_a_Horse
[Edwards 2024] Edwards, B. 2024. ‘The king is
dead’—Claude 3 surpasses GPT-4 on Chatbot Arena for the first time.
Ars Technica, March 27, 2024. https://arstechnica.com/information-technology/2024/03/the-king-is-dead-claude-3-surpasses-gpt-4-on-chatbot-arena-for-the-first-time/
[Francis 1979] Francis, W. N. and Kucera, H. 1979.
Manual of Information to Accompany a Standard Corpus of Present-Day Edited
American English, for Use with Digital Computers. Providence: Department
of Linguistics, Brown University. http://icame.uib.no/brown/bcm.html
[Greenstein 2018] Greenstein, S. and Feng Zhu.
2018. Do experts or crowd-based models produce more bias? evidence from
encyclopedia britannica and wikipedia.
MIS Quarterly 42(3), September 2018:
945–960. doi:https://doi.org/10.25300/MISQ/2018/14084
[Horton 2015] Horton, R. 2015. Offline: What
is medicine’s 5 sigma?
The Lancet 385(9976): 1380, April 11, 2015. https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(15)60696-1/fulltext.
doi:https://doi.org/10.1016/S0140-6736(15)60696-1
[HTTP 2022] HTTP Archive. 2022. Web
Almanac: HTTP Archive’s annual state of the web report.
https://almanac.httparchive.org/en/2022/table-of-contents
[Koplenig 2017] Koplenig, A. 2017. The
impact of lacking metadata for the measurement of cultural and linguistic change
using the Google Ngram data sets—Reconstructing the composition of the German corpus
in times of WWII.
Digital Scholarship in the Humanities, 32(1), 169-188.
doi:https://doi.org/10.1093/llc/fqv037
[Leibnitz 1714] Leibnitz, G. 1714. The Principles of Philosophy known as Monadology. https://www.earlymoderntexts.com/assets/pdfs/leibniz1714b.pdf
[Marshall 1989] Marshall, C. C., and Irish, P. M.
1989. Guided Tours and On-Line Presentations: How Authors Make Existing Hypertext
Intelligible for Readers.
In Proceedings of the Second Annual ACM
Conference on Hypertext, pp. 15-26. New York: ACM Press. doi:https://doi.org/10.1145/74224.74226
[Mikolov 2013] Mikolov, T., Chen, K., Corrado, G.,
and Dean, J. 2013. Efficient Estimation of Word Representations in Vector
Space.
arXiv preprint arXiv:1301.3781, 2013. doi:https://doi.org/
10.48550/arXiv.1301.3781
[Miller 1963] Miller, G. A. and Chomsky, N. 1963.
Finitary Models of Language Users.
In R. Duncan Lee, Robert A. Bush,
and Eugene Galanter (eds.), Handbook of Mathematical
Psychology 2: 420-491. New York: John Wiley & Sons, Inc. https://faculty.wcas.northwestern.edu/robvoigt/courses/2021_spring/ling334/readings/finitary_models.pdf
[Nunberg 2009] Nunberg, G. 2009. Google’s
Book Search: A disaster for scholars.
The Chronicle of Higher Education, August 31, 2009.
https://www.chronicle.com/article/googles-book-search-a-disaster-for-scholars/
[Pechenick 2015] Pechenick, E. A., Danforth, C.
M., and Dodds, P. S. 2015. Characterizing the Google Books corpus: Strong limits
to inferences of socio-cultural and linguistic evolution.
PLOS ONE, 10(10), e0137041. doi:https://doi.org/10.1371/journal.pone.0137041
[Plantinga 1993] Plantinga, A. 1993. Warrant and Proper Function. Oxford University
Press.
[Posner 1968] Posner, M. and Keele, S. 1968.
On the Genesis of Abstract Ideas.
Journal of experimental psychology 77: 353-63.
doi:https://doi.org/10.1037/h0025953
[Searle 2002] Searle, J. R. 2002. Consciousness and Language. New York: Cambridge University
Press.
[Shannon 1948] Shannon, C. 1948. A
Mathematical Theory of Communication.
Bell System Technical Journal, July and October.
doi:https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
[Sharoff 2015] Sharoff, S. 2015. Review of Roland
Schäfer and Felix Bildhauer, Web Corpus Construction, Morgan &
Claypool (Synthesis Lectures on Human Language Technologies, volume 22), 2013, ISBN
978-1608459834. In Computational Linguistics 41(1).
https://aclanthology.org/J15-1009. doi:https://doi.org/10.1162/COLI_r_00214
[Simonite 2021] Simonite, T. Feb 4, 2021. AI
and the List of Dirty, Naughty, Obscene, and Otherwise Bad Words.
Wired. https://www.wired.com/story/ai-list-dirty-naughty-obscene-bad-words/
[Smith 2024] Smith, B. 2024. Self-Attention
Explained with Code: How Large Language Models Create Rich, Contextual
Embeddings.
Towards Data Science. https://medium.com/towards-data-science/contextual-transformer-embeddings-using-self-attention-explained-with-diagrams-and-python-code-d7a9f0f4d94e
[TEI 2023] Text Encoding Initiative. 2023. TEI: Guidelines for Electronic Text Encoding and Interchange.
P5 Version 4.7.0. Last updated on 16th November 2023, revision e5dd73ed0. Section
21.1,
https://tei-c.org/release/doc/tei-p5-doc/ja/html/CE.html
[Vaswani 2017] Vaswani, A., Shazeer, N., Parmar, N.,
Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. 2017.
Attention is All You Need.
Advances in Neural Information Processing Systems 30
(NIPS 2017). https://arxiv.org/pdf/1706.03762. doi:https://doi.org/10.48550/arXiv.1706.03762
[Wikipedia] Wikipedia. Scunthorpe
problem.
https://en.wikipedia.org/wiki/Scunthorpe_problem
[Yonge 1854-1855] Yonge, C. D. (tr). 1854-1855.
The Works of Philo Judaeus. Electronic edition,
2012-05-14. Christian Classics Ethereal Library. https://www.ccel.org/ccel/philo/works.html
×Bai, Yuntao, Saurav Kadavath, Sandipan Kundu,
Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini,
Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez,
Dawn
Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared
Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt,
Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma,
Robert
Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk,
Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan,
Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas
Joseph, Sam McCandlish, Tom Brown, Jared Kaplan. 2022. Constitutional AI:
Harmlessness from AI Feedback.
https://arxiv.org/abs/2212.08073. doi:https://doi.org/10.48550/arXiv.2212.08073
×Baroni, M., Bernardini, S., Ferraresi,
A., and Zanchetta, E. 2009. The WaCky wide Web: A collection of very large
linguistically processed Web-crawled corpora.
Language Resources and Evaluation 43(3): 209–226.
doi:https://doi.org/10.1007/s10579-009-9081-4. Cited in [Sharoff
2015].
×Bauman, S. The Hard Edges of
Soft Hyphens.
2016. Presented at Balisage: The Markup Conference 2016,
Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage:
The Markup Conference 2016. Balisage Series on Markup Technologies, vol.
17. doi:https://doi.org/10.4242/BalisageVol17.Bauman01
×Bender, E. M., Gebru, T.,
McMillan-Major, A., and Shmitchell, S. 2021. On the Dangers of Stochastic
Parrots: Can Language Models Be Too Big? 🦜[sic].
In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and
Transparency (FAccT ’21): 610–623. New York: Association for Computing
Machinery. doi:https://doi.org/10.1145/3442188.3445922
×Berg, D., Gonnet, G. and Tompa, F. 1988.
The New Oxford English Dictionary Project at the University of
Waterloo.
Report number: OED-88-01. University of Waterloo Centre for the
New Oxford English Dictionary. https://www.researchgate.net/publication/243451160
×Burton, N. G. and J. C. R. Licklider.
1955. Long-Range Constraints in the Statistical Structure of Printed
English.
American Journal of Psychology 68: 650-653.
doi:https://doi.org/10.2307/1418794
×Francis, W. N. and Kucera, H. 1979.
Manual of Information to Accompany a Standard Corpus of Present-Day Edited
American English, for Use with Digital Computers. Providence: Department
of Linguistics, Brown University. http://icame.uib.no/brown/bcm.html
×Greenstein, S. and Feng Zhu.
2018. Do experts or crowd-based models produce more bias? evidence from
encyclopedia britannica and wikipedia.
MIS Quarterly 42(3), September 2018:
945–960. doi:https://doi.org/10.25300/MISQ/2018/14084
×Koplenig, A. 2017. The
impact of lacking metadata for the measurement of cultural and linguistic change
using the Google Ngram data sets—Reconstructing the composition of the German corpus
in times of WWII.
Digital Scholarship in the Humanities, 32(1), 169-188.
doi:https://doi.org/10.1093/llc/fqv037
×Marshall, C. C., and Irish, P. M.
1989. Guided Tours and On-Line Presentations: How Authors Make Existing Hypertext
Intelligible for Readers.
In Proceedings of the Second Annual ACM
Conference on Hypertext, pp. 15-26. New York: ACM Press. doi:https://doi.org/10.1145/74224.74226
×Pechenick, E. A., Danforth, C.
M., and Dodds, P. S. 2015. Characterizing the Google Books corpus: Strong limits
to inferences of socio-cultural and linguistic evolution.
PLOS ONE, 10(10), e0137041. doi:https://doi.org/10.1371/journal.pone.0137041
×Plantinga, A. 1993. Warrant and Proper Function. Oxford University
Press.
×Searle, J. R. 2002. Consciousness and Language. New York: Cambridge University
Press.
×Vaswani, A., Shazeer, N., Parmar, N.,
Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. 2017.
Attention is All You Need.
Advances in Neural Information Processing Systems 30
(NIPS 2017). https://arxiv.org/pdf/1706.03762. doi:https://doi.org/10.48550/arXiv.1706.03762