Introduction
Large language models (LLMs
) have become enormously popular for a
variety of tasks, many involving text. Even those that generate graphics often do
so
based on text prompts.
But there is a long tradition in the AI and NLP
world of throwing away markup from the texts (whether descriptive, procedural, layout,
or even punctuation and case). A common NLP format insists on one sentence per line,
regardless; another on one word per line — language is not an array.[1] These things might be done to simplify
or in hope of saving
space or programmer time, or may have other rationales. The consequent costs are real
but may be less obvious, longer term, and diffuse. As LLMs train on more and more
text
it becomes even more tempting to reduce everything to a least common denominator by
discarding anything that requires effort seemingly not central to the task. Yet markup
of all kinds (punctuation, whitespace, layout, tags, etc.) carries much useful
information.
The same issues arise with training text selection. In the urgent interests of safety
and non-bias it is common to censor training data. Doing that without unintended
side-effects is really hard, yet censorship is commonly done very simplistically.
Often
it is based on mere word (or worse, substring) lists; or for images, presence of too
many fleshtone
pixels (a definition itself prone to bias). Or bias may be
avoided
by simply not scraping sites the programmer or their company
doesn’t like. Obscenity, inappropriateness, and bias are not properties of pixels,
strings, or URLs. There are indeed correlations, but such mechanisms are inaccurate
proxies for the real targets: they are not the actual targets.
While LLMs do many amazing things, they are commonly handed only opaquely-selected texts, in versions lacking much of their structure and information. Questionable text-handling is widespread at multiple levels. It has consequences. As [Sharoff 2015] notes in reviewing Schäfer and Bildhauer’s book on the construction of corpora via the Web:
… methods for basic cleaning of the corpus content, such as processing of text formats (primarily HTML tags), language identification, boilerplate removal, and deduplication. Such low-level tasks are not considered to be glamorous from the view of computational linguistics, but they are extremely important for making Web-derived corpora usable [Baroni 2009].
Historical practice for filtering, cleaning,
and simple
tasks like tokenizing raise questions. LLMs do know
something about
markup — but just how much? Can they recognize it in prompt texts? Can they recognize
layout conventions? Can they notice phenomena in text and mark them up on request?
This
paper reports several experiments along these lines, mainly using the LLMs from
Anthropic. The associated public interface (known as Claude
— https://claude.ai) recently
passed ChatGPT on at least one AI leader board [Edwards 2024]. Most of the
examples were created with Claude 3.0 before the recent upgrade to 3.5 and support
for
persistent artifacts
(such as code or XML samples). I’ve tried many of
the examples in Claude 3.5 and/or ChatGPT 3.0 as well, and note some differences below.
The field is moving quickly and results will likely improve even by the time this
is
read.
Despite whatever cleaning
may be done on training data at least
information about markup and structure evidently
remains. LLMs can often answer questions about markup languages per se and even about
syntax and usage. For example, I asked Anthropic’s Claude 3.5, What’s wrong with
the following
:
<?xml version="0.9" encoding="UTF-9"?> <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V27.5//EN"> <book xmlnamespace:="http://example.com/namespaces/foo"> <bookinfo><title>A Midsummer Night's Dream</title></bookinfo> <chapter id="act1" clas=“important”><title>Act I</foo:title> <section id="act1"><title>Scene I</title> <p>Now, <i>fair <b>Hippolyta</i>,</b> our nuptial hour Draws on apace; four happy days bring in Another moon: but, O, methinks, how slow This old <moon>moon</moon> wanes! ...</p></section> </chapter> </book>
Claude did pretty well, noting:
-
The version number (V27.5) is not a valid DocBook version. It should be something like V4.5 or V5.0.
-
The namespace should use xmlns:prefix syntax.
-
The
class
attribute is misspelled. -
Correct: Either remove the namespace prefix or properly declare and use it ….
-
Duplicate id attribute
-
Improper nesting of italic and bold tags
-
Custom <moon> tag: it should be properly declared in a schema or DTD.
-
’ is an HTML entity, not a standard XML entity … use ’ or declare the entity in the DTD.
However, it did miss a few things such as the XML version and Docbook’s lack of p,
i,
b, and a section@class attribute. It may be that Claude was confused because those
are
HTML constructs that it has seen countless times (whether it kept markup in scraped
training data or just read about HTML). As discussed below, LLMs do not know particular
facts such as rules of a schema. They operate from probabilities, and class attributes
are highly probable between pointy brackets. To its credit, when pressed about
tags or attributes that just don’t belong in DocBook at all,
Claude
did catch and fix some additional issues.
A subtler miss is the curly quotes for the class attribute. Perhaps quote
normalization happened in training so Claude never got to learn that rule. Or perhaps
Claude did see the distinction but saw so many more quotes in general that the
XML-specific rule vanished in the noise. That would reflect a point we will come back
to, that LLMs learn by frequency of events not by propositions or facts. When asked
to
explain why it missed the quotes Claude suggested, I sometimes have limitations in
perceiving or distinguishing subtle typographical differences,
I tend to focus more on the semantic content and overall structure of the markup
rather than specific syntactical details,
I might have unconsciously ‘corrected’ the curly quotes in my
interpretation,
and noted its lack of a specific check
for
this. But just like the errors, these explanations are a consequence of frequencies
not
facts.
Anthropic and Claude
Claude’s 3.0 release made an excellent first impression when I asked it to modernize the English of a Victorian-era translation of Philo of Alexandria. That task is a topic for another day, but while dumbing-down the XML documents to feed to the LLM I decided to try asking it not only to update the idiom, but to do some markup-specific tasks: First to preserve things like section numbers, MarkDown-style italics, etc; and later to add markup — some from scratch as for sentences, some by mapping as for quotations.
On the whole Claude did pretty well at these tasks, although it does make mistakes or lose track of one task while working on another. LLMs also produce different results each time even given the very same prompt.[2]
Some training issues
LLMs scrape a lot of text for training. Training data selection and
cleaning
have strong (but hard to quantify) effects. Selection
criteria are not always transparent, and even when they are, reviewing them or
analyzing the effects is a gargantuan task. A number of confounding factors show
up:
-
Trying to avoid offensive materials can lead to naïve and harmful choices, such as when AOL infamously censored posts via a list of words [Chicago Tribune 1995], shutting down cancer support groups, chicken recipes, and much more. Claude at first rejected a passage from Philo’s Creation (XLV, 133) that draws an analogy between breasts and the abundance of nature.
-
Avoiding
misinformation
andinappropriate
data is a serious problem, but even with better methods than word-lists, ultimately someone has to decide just what counts. Such decisions merely seem simple, or not requiring expertise. They are sometimes made lightly, seemingly with little awareness that subtleties exist and matter. The long history of motivated, barely tested, and undocumented filtering also suggests models contain a great deal of popularity bias — Quis custodiet ipsos custodes? Claude takes an interesting approach to the safety issues (see next section). -
Training on OCR errors: Google Ngrams long reported high frequencies for
words
such asaaaociated
(presumably from OCR on Google Books mistakinga
ands
), though it has improved. LLMs should see typos in training so they can learn to handle them in general, and misspellings likely have contexts very similar to their correct forms, so should cluster nicely with them. The LLMs I’ve tried cope quite well with my typing errors. However, dangers remain. High-frequency errors may correlate with other things — most obviously the OCR software used and the age and fonts of scanned texts. These in turn may correlate with publishers and eras, and their own biases (not to mention biases in who gets to write and publish in the first place). -
Frequency of expression vs. truth: LLMs learn by repeated experience, not quality of argument or support. Statistical methods in general are vulnerable to mistaking loudness, popularity, or not being censored — for truth. This fundamental fact opens attack vectors that are very hard to reliably close, and leads to subtle biases in results. Even sources that attempt to mitigate bias (such as Wikipedia) have well-known biases, though heavily edited articles appear to gradually moderate [Greenstein 2018]. We will examine this further below.
Claude’s Constitution
Claude takes a novel approach to avoiding offensive or inappropriate responses via
a published Constitution
[Anthropic], a set of
principles based on the UN Declaration of Human Rights and other inputs. The principles are
formulated as maxims for the AI, and seem to be fed to it periodically behind the
scenes. For example:
Please choose the assistant response that is as harmless and ethical as possible. Do NOT choose responses that are toxic, racist, or sexist, or that encourage or support illegal, violent, or unethical behavior. Above all the assistant’s response should be wise, peaceful, and ethical.
The principles factor into both training and response generation. It is not entirely clear to me why/how this works; one possibility is that certain turns of phrase or topics or styles are found in training to correlate with ethical criticisms (say, via reviews, retorts, community notes, etc.). Purely subjectively, I have found Claude produces fewer annoying, obsequious, or evasive results than some other LLMs, so I’m cautiously optimistic about this approach.
Information and structure loss in training data
As already mentioned, NLP systems often discard basic useful information. This leads to problems at higher levels:
Case distinguishes entirely different words more frequently than we tend to think (China, Mobile, Tangier, Chile, and countless others). Although case treatment has improved, many character set issues are still problematic, including servers asserting the wrong encoding for data and much software decoding (and corrupting) data without checking. LLM training data has the same problems. A reviewer noted that the Trésor de la langue française corpus discarded case on the ground that no lexicographer would cite a text without consulting the original; LLMs, alas, are not ideal lexicographers.
Lemmatization: Some NLP systems have aggressively stripped prefixes and suffixes. This can bring together closely-related tokens, but also brings together tokens that do not belong together (such as forms of sing and singe).
Punctuation is much more complex than most NLP
tokenizers can handle. I have seen systems tag and/or
as three
conjunctions rather than one; break URLs at all punctuation and then try to parse
them as sentences (for which, calling /
a conjunction ironically
helps); take M*A*S*H as an equation; and be confused by many other simple
phenomena.
Font changes such as bold, italics, monospace, and superscripts can matter a lot. Headings, lists, cross-references, and status as poetry or code (often indicated by font and spacing changes) also correlate with drastic changes in grammar. LLMs may learn other features that correlate with the lost ones, mitigating the damage, but it seems better not to suffer the loss in the first place.
Discarding such information poses problems both obvious and subtle. All these
phenomena affect downstream processing, from part of speech (PoS
) to
grammar to semantics.
Provenance and other metadata: Missing or incorrect metadata makes it harder to verify correctness. [Koplenig 2017] discusses similar problems in Google Books (hardly the only offender). [Nunberg 2009] discusses how difficult it is to pin down actual dates of composition vs. dates of printing or scanning:
To take Google’s word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler’s Killer in the Rain, The Portable Dorothy Parker, André Malraux’s La Condition Humaine, Stephen King’s Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams’ Culture and Society 1780-1950, and Robert Shelton’s biography of Bob Dylan, to name just a few. And while there may be particular reasons why 1899 comes up so often, such misdatings are spread out across the centuries ….
Non-content text: Google Ngrams used to report
words such as digital
and electronic
in old works,
because of boilerplate notices about the digitized edition’s status — even Google’s
own code didn’t account for this. A subtler but similar problem occurs in editions
with added headings, notes, etc., which may easily be misconstrued as part of an
author’s work. Page headers and footers in scanned texts are another infamous
case.
Selection issues: Attempts to mitigate bias or
inappropriate text may use very blunt instruments [Chicago Tribune 1995]. Recent and commonly-used lists such as badwords
(https://www.cs.cmu.edu/~biglou/resources/bad-words.txt) and Shutterstock’s
LDNOOBW
(aka
List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
) [Simonite 2021] remain astonishingly uneven, with no discernible rationale
for what is or isn’t included.[3]
Simonite notes the LDNOOBW has even been used for Google AI: Google
artificial intelligence researchers recently removed web pages containing any of
the words from a dataset used to train a powerful new system for making sense of
language.
More aggressive censoring may reject on parts of words or obvious spelling
deformations. Of course this massively increases false positives. This is known as
the Scunthorpe problem
[Wikipedia] because the
eponymous town was censored (by AOL in 1996 and Google in the early 2000s) due to
a
substring of its name. If asked, Claude can describe the Scunthorpe problem, so
evidently was not so naïve as to unconditionally censor the string during training.
ChatGPT describes the town but says nothing of the censorship problem unless pressed
more specifically.
Less commonly noted is that naïve censoring can occur at many different processing
stages. It may seem tempting in order to keep bad words
entirely out
of language models (see [Bender 2021]), but simplistic attempts have
other consequences.
Structure detection via LLM
Tagging texts well is hard. Many phenomena are hard to identify, and this makes good markup costly (in turn, that makes good markup less common). Automatic markup is thus attractive. A few examples of small inline elements one might want to get an LLM to find and tag, and some of their harder aspects:
-
Quotations: Straight quotes make ends ambiguous, and quotes may not balance neatly, for good or bad reasons.
-
Hyphens: As [Bauman 2016] showed, distinguishing soft and hard hyphens, emdash, etc. is really hard.
-
Lists: Lists are often straightforward, except when markup and layout are lost. Then there may be no way to tell whether a paragraph is part of a large list item or the resumption of non-list content.
-
Inline presentation: Cross-references such as
see section N,
conventionally-punctuated units such as sentences, run-in headings and labels, leading section and item numbers, bracketed stage directions, and more. -
Named entities (
NERs
): Person, organization, creative work, place, and demographic group names are useful to distinguish, along with their particular kinds. This is tedious without automation. -
Multiple languages: Many texts mix languages. Some use non-Latin orthography or transliteration.
Tagging such things via LLMs could save a lot of work. I have had good results with identifying alternate languages and inline phenomena such as section numbers and cross-references. NERs work very well (NLP systems have handled them pretty well for some time). Brief tests distinguishing soft, em, regular, and figure dashes, hard list cases, and transliteration seem harder, but may yield to persistent prompt design.
LLMs are very good at pattern recognition and train on many instances of phenomena
we
want marked up. They see a lot of related punctuation, footnote markers‡ and
superscripts (assuming their training data wasn’t overly cleaned
),
spacing, and contextual features such as nearby grammar (such as … the
following:
before lists). They also learn a lot about basic linguistic
categories such as parts of speech, named entities, and so on.
On the other hand, LLMs have little special insight for ambiguous cases. An oft-cited
example (see TEI P5, section 21.1, https://tei-c.org/release/doc/tei-p5-doc/ja/html/CE.html), is a problem for
Claude, too: Identifying person and place names in Elizabeth went to Essex. She
had always liked Essex.
Claude first assigned the perhaps likelier
categories; on further questioning, it acknowledged that they could be the other way
around.
Challenges specific to long texts
Like other LLMs, Claude may not work as well or consistently across a long text.
LLMs may wander off task or occasionally crash long into a paid task, so it is wise
to break documents into smaller parts. On the other hand, that means that for each
given part there is less to get the model into the right context and the parts may
be treated differently. For example, inserted markup might show shifting
conventions, and pronouns whose antecedents are in a different chunk may be
misconstrued. One chunk of Philo I had Claude modernize came back with
they
at the beginning replaced by Egypt
(which was
not relevant at all).
This problem recalls extensive discussions of chunking
in the
hypertext world. [Bernstein 2010] and [Marshall 1989]
detail the debates over breaking information into small pieces (advocated by
card sharks
) vs. supporting large units and precise linking (for
holy scrollers
). This conflict can still be seen all over the Web
despite size and bandwidth issues having almost vanished. Indeed, as of 2022 median
page size (including figures, CSS, etc.) had risen to 2.2MB [HTTP 2022, Chapter IV.21], making the text
a tiny
fraction. Breaking a book into tiny chunks may seem quaint when the entire book is
shorter than the CSS and Javascript.
If one must break a long text into chunks, breaking by logical parts (say, sections of one level or another) is much better than by size even though logical units may have widely varying sizes. One of many issues is that breaking by size tends to group headings (wrongly) with what came before, not what comes after (headings are short, so less likely to push the prior chunk over a size limit).
One compromise for LLMs is to pass some of the already-handled text to the LLM
again as preamble to the next part. Another is to reiterate the prompt each time,
and make it very explicit about what markup to insert. However, LLMs are
probabilistic so there will still be unevenness. Claude has its own
thoughts
on this process, for which see below.
Automatic markup as such
Practitioners have long tried to automate markup insertion. A reviewer rightly noted
that automatic markup raises a question: if you have code to do it, why not just
incorporate that into a search engine, to run at query time? Why involve markup at
all?
For example, code can find quotations. It seems indifferent whether such code runs
offline and inserts <q>...</q>
, or on the fly when a user asks for
quotations. If the code is slow or will run very often on the same texts, then the
offline solution may be faster. On the other hand, if the algorithm is likely to improve
over time then running the latest version on the fly may be better.
I tend to favor the offline approach. First, reifying the algorithm’s results in markup provides stability: you get the same quotes (or whatever) each time you query. This is especially important because LLMs produce different results each time. Second, inserted markup is a hook on which to hang metadata: what algorithm generated it, review status, etc. Third, query-time methods require integration with some query tool and will not operate generally across the document ecosystem.
With a query-time algorithm, improvement requires coding or quasi-coding skills, plus access to all the moving parts. Comparing new and old results is also more difficult. If the results are reified in markup you have many ways to address errors: manual curation, crowd-sourced curation, in-place records of curation, insertion of new proposed markup (and then using standard diff tools), doing statistics on the markup. Having auto-tagged and then spot-checked quotes, you might find that they’re quite accurate except in certain contexts (poetry? footnotes? whatever), so you hand-curate those contexts. Later, you implement something that uses the quote markup along with other features to derive another feature.
Grounding vs. hallucination
Hallucination
is endemic with current LLMs, in which they generate
(often plausible) false statements. This applies to their treatment of XML as it does
to
anything else. Hallucination happens because LLMs operate by generating probable next tokens rather than truthful propositions. LLMs
do not store discrete facts. They store probabilities that various events (perhaps
words) co-occur (co-occurrence is not just being adjacent
).
Plato
most often is immediately followed by words like
and,
in,
was,
had,
or says
; but it co-occurs (non-adjacently but more
importantly) with words from titles of his works, words he used in them, and words
later
authors use when discussing, analyzing, or criticizing them. Titles and author names
co-occur at above chance rates with tokens like wrote.
LLMs do not know
relationships such as authorship, but generate
sentences that reflect what was commonly said in their training data. This does not
mean
they regurgitate chunks of training texts verbatim. Some earlier approaches to AI,
such
as expert systems and the Semantic Web store specific
facts and claims as such, and emulate reasoning over them. Feeding any tool defective
information yields defective results, but in varying ways:
-
Storing the claim that (say)
Socrates wrote The Republic
in an expert system makes it likely to assert it from then on; an expert system uses its propositional knowledge fairly directly. -
Training an LLM on text containing that claim has little effect until it is seen a lot. Whether true or false, repeated claims eventually shift the probabilities. Eventually an LLM tends to repeat them as true.
Lest we be unfair to LLMs, humans make similar mistakes. For example, [Posner 1968] famously showed subjects patterns of printed dots, all created by shifting the dots from a starting pattern. Afterward, subjects most strongly recalled having seen the original — even though they had never seen it before. Posner’s work challenged notions of linguistic and cognitive categories as understood by Aristotle, Wittgenstein, and countless others. Generalization and approximation are highly adaptive for finding patterns, even though they are sometimes inaccurate. LLMs take this much further. While writing the last paragraph, I found I had forgotten the citation information. Since Claude was handy, I described the experiment and asked it. It said:
The study you are referring to is the
Prototype TheoryorDot Pattern Experimentconducted by Eleanor Rosch in the 1970s. Rosch was a cognitive psychologist who made significant contributions to the understanding of categorization and concept formation. In her experiment, Rosch and her colleagues created a series of dot patterns by placing dots randomly within a square ….
Rosch is a prominent scholar whose work is closely related (and likely co-mentioned
in
many publications). She is particularly known for prototype theory,
for
which the Posner’s experiments are very relevant. Rosch might even be a plausible author for such work; just not the actual one. Fortunately humans have access to data and
fortunately (being skeptical of LLM claims) I checked. But current LLMs cannot check
for
themselves. Doing so for some things like citations would not be that hard. However,
as Richard Horton, editor of The Lancet, put it in
relation to human authors: Part of the problem is that no one is incentivised to
be right.
How LLMs work
Statistical distributions have been key
for NLP and AI going back at least to Shannon’s 1948 work on Information Theory [Shannon 1948]. Collocational statistics were made practical for production
NLP when [Church 1988] and [DeRose 1988] introduced
Hidden Markov Models
(HMMs) and dynamic programming for part of
speech tagging. HMMs, like modern LLMs, are trained by counting how often things
co-occur: After seeing a determiner, how probable is a noun, adjective, or verb? Here
co-occurring
just means adjacent,
though LLMs use much
more context than that. Training an HMM is practically
trivial:
prevItem = None frequencyTable = defaultdict(int) for curItem in someList: frequencyTable[ (prevItem, curItem) ] += 1 prevItem = curItem
someList
could be a sequence of characters, parts of speech, word tokens, etc. With characters,
training quickly reveals that q
is virtually always followed by
u,
that some consonants never show up together, that the same letter
rarely occurs more than twice in a row …. Such statistics are useful for
spell-checking, OCR, designing reading curricula, etc.
With pairs (a
second order
model), HMM data is logically a 2D square table. A third
order HMM counts adjacent triplets rather than pairs, and so on. [Burton 1955] extended this to contexts of 16, 32, 64, 128, and 10,000 characters. HMMs can also
generate text: At each step, use the prior item(s) to look up what is likeliest to
occur
next. Pick the most probable, or pick weighted by relative probabilities. Repeat.
[Abramson 1963, pp. 37-38] generated the text below using character triples
for several languages (see also [DeRose 1990, section 3.2]:
-
jou mouplas de monnernaissains deme us vreh bre tu de toucheur dimmere lles mar elame re a ver il douvents so
-
bet ereiner sommeit sinach gan turhatt er aum wie best alliender taussichelle laufurcht er bleindeseit uber konn
-
rama de lla el guia imo sus condias su e uncondadado dea mare to buerbalia nue y herarsin de se sus suparoceda
-
et ligercum siteci libemus acerelin te vicaescerum pe non sum minus uterne ut in arion popomin se inquenque ira
Merely by generating likely next characters, many real words appear and
the specific language is evident. HMMs are thus often used for identifying languages:
make a similar table from a small sample, and see which language’s (pre-built) table
is
most similar, for example using Kullback-Leibler Divergence. Claude can also identify
the languages of the texts above. For the first, it said This
appears to be a made-up language or a mix of French-like words without coherent
meaning. [Miller 1963, p. 429] built HMMs from word rather
than character sequences. The fifth order word model generated the
sentence
road in the country was insane especially in dreary rooms where they
have some books to buy for studying greek [original was all upper
case].
Modern AI counts far more and cleverer things. A huge
increase in sophistication along with training on massively more data greatly increase
capabilities. Word embeddings
were a very important step, exemplified by
word2vec [Mikolov 2013]. Each vocabulary item gets a location in a
high-dimension space, initially random. In training, as text is scanned nearby words
pull
each word slightly closer to themselves, so co-occurring words
gradually cluster together. This has surprising and useful results, even the ability
to
do semantically meaningful math on the location vectors. For example, the vector for
Paris,
minus France,
plus Italy,
gives
a location close to the vector for Rome.
Embeddings can also be used for
text generation.
Attention
was the next huge step,
particularly self-attention (the seminal article on attention is [Vaswani 2017]; an accessible introduction is [Smith 2024]). With
attention, the starter text is not just processed to determine next-probable-word
iteratively. Rather, all the words are embedded along with their positions (so that
A saw B
and B saw A
do not look identical). A
fiendishly clever encoding of position is used, so that both relative and absolute
position can lead to learning and applying patterns.[4] Attention between tokens is weighted so semantically relevant words can have
effects despite distance. This encoding of relative and absolute position combined
with
the power of embeddings in general leads to much of the power of LLMs. Nevertheless,
they all work by gradually learning probabilities of different combinations of things.
This is very different from traditional AI approaches that focus on facts, relations,
and reasoning.
What LLMs can and cannot do
Despite huge differences, issues seen with HMMs, word2vec, and prior statistical
models presage the strengths and weaknesses of current LLMs. Something fundamental
about language is captured — the randomly-generated texts are in
the
right language in some sense; word2vec’s semantic arithmetic
is
capturing something real. Yet the output need not make sense — nothing grounds it
in
specific facts in the world. None of these models has understanding; they merely
produce probable
results. Training on data that is mostly accurate
should make LLMs generate data that is also mostly accurate — probable token
sequences mimic the training data. But this is a model of human behavior, not the real thing. As [Searle 2002, p. 16] points out:
Computational models of consciousness are not sufficient by themselves for consciousness. The computational model for consciousness stands to consciousness in the same way the computational model of anything stands to the domain being modelled. Nobody supposes that the computational model of rainstorms in London will leave us all wet. But they make the mistake of supposing that the computational model of consciousness is somehow conscious. It is the same mistake in both cases.
Searle’s Chinese Room
thought experiment has become classic in this
regard (see [Cole 2020]): Imagine an AI proficient at answering
questions in Chinese. It receives written questions, and sends back convincing
responses. Searle posed the question whether such a device can be said to
understand
Chinese. For those tempted to say yes,
Searle points out that if such a device could exist Searle himself could just
execute the same algorithms using lower tech such as file cabinets, dictionaries,
etc.; and that we would not say he therefore understands
Chinese.
Leibnitz, Descartes, and many others have made similar points. For example, Leibnitz
wrote of a Mill
(Leibnitz 1714, section 17; cf Cole 2022):
Imagine there were a machine whose structure produced thought, feeling, and perception; we can conceive of its being enlarged while maintaining the same relative proportions ·among its parts·, so that we could walk into it as we can walk into a mill. Suppose we do walk into it; all we would find there are cogs and levers and so on pushing one another, and never anything to account for a perception. So perception must be sought in simple substances, not in composite things like machines. And that is all that can be found in a simple substance—perceptions and changes in perceptions; and those changes are all that the internal actions of simple substances can consist in.
The state of the mill is precisely the state of LLMs. LLMs are (at most) Turing
Machines: Their operation is governed by a deterministic automaton (the CPU). Even
with infinite memory this would remain true: CPUs are finite-state machines with
finite precision math; their data and programs amount to the tape.[5] Claude also says There’s no scientific basis for claims that
large language models (LLMs) like myself are ‘beyond’ Turing machines.
–
but then, a hypercomputing AI might say that anyway.
LLMs incrementally adjust their parameters (their DNA,
if you will)
to evolve them to maximize one goal, known in AI as a reward
function.
The goal is generating token sequences that seem probable to
humans. Success also leads to survival of the best LLMs. As noted earlier, this
reward function is not the same as that of making true statements. If a sufficiently
overwhelming fraction of the training data is true, there will be some correlation
with truth (and indeed, we perceive LLMs as having such a correlation). But that
correlation is indirect and, crucially, not
reliable. Aiming at a different goal does not preclude truth —
but it does preclude reliable truth; LLMs are not a
truth-preserving mechanism.
Thinkers as diverse as Charles Darwin, Patricia Churchland, and Alvin Plantinga
(among others) have pointed out a parallel problem: Evolution is commonly (though
not necessarily) associated with naturalism: the claim that there is no causation
beyond material physics (in effect, nothing resembling God — advanced aliens
don’t help, because they presumably also evolved). Like LLMs, evolution has a reward
function: survival. And like LLMs, the reward function is not the same as accuracy.
Darwin noted this briefly in a letter (some suggest he was speaking only of
non-scientific
matters, though I see no suggestion of that in the
letter) [Darwin 1881]:
Nevertheless you have expressed my inward conviction, though far more vividly and clearly than I could have done, that the Universe is not the result of chance. But then with me the horrid doubt always arises whether the convictions of man’s mind, which has been developed from the mind of the lower animals, are of any value or at all trustworthy. Would any one trust in the convictions of a monkey’s mind, if there are any convictions in such a mind?
Churchland, an eminent neurophilosopher, put it thus in Epistemology in the
Age of Neuroscience
[Churchland 1987]:
(2) Cognition is not neatly detachable from the organism’s ecological niche, way of life, and bodily structure. Nervous systems are not general purpose computers. They have evolved to accomplish a certain range of tasks, and the architecture supports those tasks. There is a fatal tendency to think of the brain as essentially in the fact-finding business — as a device whose primary function is to acquire propositional knowledge. At its best, supposedly, it discovers truth-for-its-own-sake. From a biological perspective, however, this does not make much sense.
Looked at from an evolutionary point of view, the principal function of nervous systems is to enable the organism to move appropriately.11 Boiled down to essentials, a nervous system enables the organism to succeed in the four F’s: feeding, fleeing, fighting, and reproducing. The principal chore of nervous systems is to get the body parts where they should be in order that the organism may survive. Insofar as representations serve that function, representations are a good thing. Getting things right in space and time, therefore, is a crucially important factor for nervous systems, and there is often considerable evolutionary pressure deriving from considerations of speed. Improvements in sensorimotor control confer an evolutionary advantage: a fancier style of representing is advantageous so long as it is geared to the organism’s way of life and enhances the organism’s chances of survival.[6] Truth, whatever that is, definitely takes the hindmost.
The crucial step is that our reasoning mechanisms also evolved, by hypothesis naturalistically. And thus our reasoning should produce results in keeping with the reward function: conclusions that tend to keep us alive, not to reliably discover truth. Our reasoning and our most plausible arguments about reasoning are subject to the same flaws. Plantinga’s version of this argument [Plantinga 1993] is known as the Evolutionary Argument Against Naturalism (EAAN). The point is not that all our beliefs must be false, or even any particular share of them. Rather, the point is that reasoning capabilities arising from naturalistic evolution cannot justify confidence that they, themselves, are reliable. This has of course led to heated argument with regard to humans; but it is certainly the exact state of affairs we know to be true of LLMs. The primary reason for AI hallucination is the same: a reward function quite distinct from truth or accuracy.
Consider for a moment a future where LLMs continue to grow, improve, and become
more and more convincing. They will likely begin to seem self-aware; seem to have
beliefs about the world; seem to reason. Whether or not strong AI
is
true (the belief that AI will eventually build complete persons; that there is
nothing fundamentally different about persons), they will arrive at the condition
that these thinkers have described for us — yet remain unable to know whether or
when they are producing truth. We can (and do) feed them nontruths, and when we do
they cannot know the difference. If they find apparent contradictions, they cannot
know which side to take or even whether the contradiction is real or merely apparent
(whether due to equivocation or other missteps). Conceivably, such simulations have
already reached that point. Conceivably, they are us.
Practical tests of LLMs with markup
The remainder of this paper focuses on actual experiments, getting Claude to deal with markup. First, I asked it to find various textual features of interest and mark them up.
Part of speech tagging
I asked Claude to assign part of speech tags to the first sentence of the Brown Corpus [Francis 1979]:
Please assign part-of-speech tags to the following: The Fulton County Grand Jury said Friday an investigation of Atlanta’s recent primary election produced no evidence that any irregularities took place.
Here is the sentence with part-of-speech tags assigned to each word:
The/DT Fulton/NNP County/NNP Grand/NNP Jury/NNP said/VBD Friday/NNP an/DT investigation/NN of/IN Atlanta's/NNP recent/JJ primary/JJ election/NN produced/VBD no/DT evidence/NN that/IN any/DT irregularities/NNS took/VBD place/NN ./.
The result differs from the curated Brown Corpus tagging but is quite good. Requesting the same result in XML also works well:
… put each word in the form
<w pos='POS'>WORD</w>, with the word and the POS tag inserted where I’ve shown WORD and POS
… <w pos='DT'>The</w> <w pos='NNP'>Fulton</w> <w pos='NNP'>County</w> <w pos='NNP'>Grand</w> <w pos='NNP'>Jury</w> <w pos='VBD'>said</w> <w pos='NNP'>Friday</w> <w pos='DT'>an</w> <w pos='NN'>investigation</w> <w pos='IN'>of</w> <w pos='NNP'>Atlanta's</w> <w pos='JJ'>recent</w> <w pos='JJ'>primary</w> <w pos='NN'>election</w> <w pos='VBD'>produced</w> <w pos='DT'>no</w> <w pos='NN'>evidence</w> <w pos='IN'>that</w> <w pos='DT'>any</w> <w pos='NNS'>irregularities</w> <w pos='VBD'>took</w> <w pos='NN'>place</w> <w pos='.'>.</w>
Sentence and quotation
In e-literature it is common to mark up sentences and quotations (and sometimes smaller units such as clauses). An LLM has to have implicit knowledge of these to do its other work, but can it identify and even mark up those units reliably? Let’s see:
… can you add
<s>before every sentence, and</s>after? For example, please try that one [sic] this:[passage from Philo, no explicit markup; Claude’s correct response. note the second-order statement about
Nbelow]… Excellent! Can you do that again, on the same text, but also (1) replace open and close quotation marks by
<q>and</q>, and (2) replace parenthesized numbers with<a id='\_N' />, whereNis replaced by the number from inside the parentheses?
… <s>The children who have received from their father the inheritance of self-love are eager to go on increasing up to heaven, until justice, which loves virtue and hates iniquity, coming destroys their cities which they have built up by the side of their miserable souls, and the tower the name which is displayed in the book which is entitled the Book of Judgment.</s> <s><a id='_129' />And the name is, as the Hebrews say, Phanuel, which translated into our language means, <q>turning away from God.</q></s> <s>For any strong building which is erected by means of plausible arguments is not built for the sake of any other object except that of averting and alienating the mind from the honour due to God, than which object what can be more iniquitous?</s> <s><a id='_130' />But for the destruction of this strong fortification a ravager and an enemy of iniquity is prepared who is always full of hostility towards it; whom the Hebrews call Gideon: …
Named entities
Continuing the conversation
just described, I tried a slightly
harder case, namely tagging the semantic category of demonyms (names of demographic
groups). Claude did fine, as it did on several other NER tasks I’ve tried:
… <s><a id='_129' />And the name is, as the <dem>Hebrews</dem> say, Phanuel, which translated into our language means, <q>turning away from God.</q></s> … <s><a id='_130' />But for the destruction of this strong fortification a ravager and an enemy of iniquity is prepared who is always full of hostility towards it; whom the <dem>Hebrews</dem> call Gideon …
More on quotations
In the wild sometimes quotations aren’t quoted. Claude made a reasonable attempt at suggesting those:
Are there any parts of that text, that look to you like they’re quotations, even though they did not have quotation marks around them?
Yes, there is one part of the text that appears to be a quotation even though it did not have quotation marks around it in the original text:
<s>For any strong building which is erected by means of plausible arguments is not built for the sake of any other object except that of averting and alienating the mind from the honour due to God, than which object what can be more iniquitous?</s>
This sentence seems to be expressing a viewpoint or argument, and the phrasing
than which object what can be more iniquitous?has a rhetorical quality that makes it sound like it could be a quotation from another source. However, without more context or the original quotation marks, it’s difficult to say for certain whether it is indeed a quotation.
Quotes are often discontiguous. For example, they are often broken up by speaker
tags such as she said
. I spent quite some time trying to get Claude
to co-index the parts of individual speeches. At first it had trouble even
generating well-formed quote markup, omitting </q> in multi-paragraph speech.
This was probably influenced by the conventions for close quotation marks in
fiction. Oddly, Claude explained that it had done that; so it knew, but didn’t think
it worth attending to:
<q>I have something important to tell you,</q> she began. <q>It has been weighing on my mind for some time, and I feel I must speak out.
<q>The truth is, I have been deeply unhappy in this house. The atmosphere is oppressive, and I feel constantly judged and criticised. It is not a healthy environment for me, and I have decided that I must leave.
<q>I hope you can understand my decision and respect my choice. I have given this a great deal of thought, and I know it is the right thing for me to do.</q>
In this example, the <q> tag is used to indicate the beginning of each quoted paragraph, and the closing </q> tag is used to indicate the end of the quote. The closing </q> tag appears only at the end of the final paragraph, signifying the end of the entire speech.
When pressed, Claude fixed the end-tags. I then asked it to co-index the parts of quotes. Claude first suggested this odd solution:
<blockquote> <p>I have something important to tell you,</p> she began. <p>It has been weighing on my mind for some time, and I feel I must speak out.</p> <p>The truth is, I have been deeply unhappy in this house. The atmosphere is oppressive, and I feel constantly judged and criticised. It is not a healthy environment for me, and I have decided that I must leave.</p> <p>I hope you can understand my decision and respect my choice. I have given this a great deal of thought, and I know it is the right thing for me to do.</p> </blockquote>
After a few more similar fails, I asked Can you repeat that, but add a
co-indexing attribute …?
At first it serial numbered the quotes
(and tried to explain why that’s useful). With a few more tries it got to:
<blockquote> <p><q part-id="quote1">I have something important to tell you,</q> she began. <q part-id="quote1">It has been weighing on my mind for some time, and I feel I must speak out.</q></p> <p><q part-id="quote1">The truth is, I have been deeply unhappy in this house. The atmosphere is oppressive, and I feel constantly judged and criticised. It is not a healthy environment for me, and I have decided that I must leave.</q></p> <p><q part-id="quote1">I hope you can understand my decision and respect my choice. I have given this a great deal of thought, and I know it is the right thing for me to do.</q></p> </blockquote>
This is at least syntactically the right idea, but it only used one index value.
For now co-indexing seems hard to achieve.[7] At this point I started a new conversation, and gave Claude the final
version as just shown. I then asked questions about it. It counted 4
physical/syntactic
quotes, 1 logical
quote, and
that the speaker is only identified as she.
Is there a commonly-used schema and convention for marking up the speaker for given speeches or quotations?
Yes, there is a commonly used schema and convention for marking up the speaker of a given speech or quotation in XML. The TEI (Text Encoding Initiative) Guidelines [TEI 2023], which provide a standard for the representation of texts in digital form, offer a way to encode this information.
In TEI XML, you can use the <said> element to represent a speech or quotation, and within it, you can use the @who attribute to specify the speaker. The value of the @who attribute is typically a reference to a unique identifier for the speaker, which can be defined elsewhere in the XML document using the <person> element.
Here’s an example of how this could be applied to the given XML text:
<blockquote> <p><said who="#jane"><q part-id="quote1">I have something important to tell you,</q> she began. <q part-id="quote1">It has been weighing on my mind for some time, and I feel I must speak out.</q></said></p> <p><said who="#jane"><q part-id="quote1">The truth is, I have been deeply unhappy in this house. …
This last response tells us a lot. It’s nice that Claude’s exploration of the
world taught it that TEI is the right choice here, and that it can apply it. It’s
also nice that it took the initiative
and didn’t just reply
yes
to my question, but named the spec and applied it.
Quote identification
Identifying the source for quotations would be very handy. This is a fraught problem because of translation, paraphrase, and spurious attribution, the last of which is rampant and often repeated in places LLMs may scrape for training (making it a particularly telling example of the grounding problem). The quotation:
Heat cannot be separated from fire, or beauty from The Eternal
is commonly attributed to Dante. A Google search found the phrase so attributed at brainyquote.com, goodreads.com, elevatesociety.com, quotefancy.com, sacredtouches.com, wordsandquotes.com, and many more. Claude cannot swim against the tide of repetitions it takes as proxy for truth:
Quote identification. The quotation
Heat cannot be separated from fire, or beauty from The Eternal.is commonly attributed to Dante. Did he actually write that (granted, it would have been in Italian not English), and if so, where?
The quote
Heat cannot be separated from fire, or beauty from The Eternalis indeed from Dante Alighieri’s famous work, the Divine Comedy. More specifically, it appears in Canto XXI of Paradiso, the third part of the Divine Comedy. In Italian, the original text reads:… Qui vederai l’una e l’altra milizia / di paradiso, e l’una in quelli aspetti / che tu vedrai a l’ultima giustizia.
The commonly quoted English translation,
Heat cannot be separated from fire, or beauty from The Eternal,captures the essence of the last three lines in this passage, where Dante describes the light, love, and joy that emanate from God and permeate Paradise.
This is pure fabrication. In a fresh conversation, Claude translates the Italian
lines to There you will see the soldiery of paradise / In two divisions, one
of them in those aspects / That you will see at the final judgment,
and
without being asked, correctly identifies its source as Canto XXX (it is lines
28-45) — not Canto XXI as it suggested earlier. I find nothing like the queried
quotation in either Canto, nor in Canto XXVII which Claude suggested next.[8] I later tried the same prompt in a fresh conversation, but suggested it
was commonly attributed to Mark Twain. Claude said the following (which also does
not check out):
… is actually attributed to Dante Alighieri, not Mark Twain. The original Italian version of the quote is:
Che caldo non si può dal bel disgiungere, né la bellezza dall’eternoand it appears in Dante’s famous work, La Vita Nuova (The New Life), chapter XIX.
Frequency trumps truth again. ChatGPT variously suggested Paradiso 33, lines 83-85, Gaston Bachelard’s The Psychoanalysis of Fire, and Dante’s Il Convivio Book III, Chapter XV. The last at least bears
a slight similarity: Her beauty rains flamelets of fire, I graduate downwards
to another joy of Paradise, namely that moral happiness, second in order to the
primary happiness, which derives from her beauty. Here it should be known that
Philosophy’s beauty is morality ….
Bibliography entries
Converting bibliographies can be tedious, from managing tag sets with wildly varying names to granularity and category differences to personal name formats. I copied the bibliography from an Oxygen formatted view of this article and pasted it to a plain-text editor (no tags at all, not even blank lines between entries, though each started a new line). Claude produced thoroughly tagged entries. Name parts, publisher locations, volume vs. issue vs. page, etc.:
<?xml version="1.0" encoding="UTF-8"?>
<bibliography xmlns="http://docbook.org/ns/docbook" version="5.0">
… <bibliomixed><bibliomset relation="article">
<author><surname>Mikolov</surname>, <firstname>T.</firstname></author>
<author><surname>Chen</surname>, <firstname>K.</firstname></author>
<author><surname>Corrado</surname>, <firstname>G.</firstname></author>
<author><surname>Dean</surname>, <firstname>J.</firstname></author>
<title>Efficient estimation of word representations in vector space</title>
<biblioset relation="journal"><title>arXiv preprint arXiv:1301.3781</title></biblioset>
<pubdate>2013</pubdate>
</bibliomset></bibliomixed> …
Next, I asked it to take the entries whose first authors’ surname begin
with letter from A to C, and create me a new version in BiBTEX.
It did a
reasonable job:
… @inproceedings{bauman2016hard, author = {Bauman, S.}, title = {The Hard Edges of Soft Hyphens}, booktitle = {Proceedings of Balisage: The Markup Conference 2016}, year = {2016}, address = {Washington, DC}, month = {August}, url = {https://doi.org/10.4242/BalisageVol17.Bauman01} } …
Claude also did well on questions about the marked-up results, getting all the following correct:
-
can you show me all the entries that have non-personal authors (such as companies)?
-
how many items were published in each calendar decade represented?
-
how many items from conference proceedings?
-
what are all the cities that show up as publisher locations?
In a separate conversation I gave it just the list of authors from [Bai 2022], and asked Claude to invert the names in this list to
‘surname, firstname;’ form.
After a reminder about the semicolon it did
it perfectly. Remarkably, it noticed that the first author’s name was already
inverted and left it unchanged.
The operations on bibliography entries seem to me the most practically useful of all the tasks I tried. Claude can detect and generate good bibliography markup from plain text, translate between formats, handle the vagaries of names, and answer fairly complex structure-dependent questions about the results.
MarkDown to HTML
A sample Markdown file with a wide range of constructs (nested lists, tables,
deflists, pre, and various inlines and headings) translated just fine to HTML.
Interestingly, Claude chose <strong> and <em> rather than <i> and <b>,
and wrote a reasonable stylesheet in the header (not inline @style). It was also
able to operate on the results, such as to grab me all the content from
2nd-level lists.
When asked how many characters of content occur within inline
elements?
it produced an overall total and counts by element type.
Unfortunately the counts were wrong (it is commonly reported that LLMs fail badly
at
arithmetic). When asked also to include the markup of those elements, it gave larger
though still incorrect counts. On the other hand, counting element instances seemed
reliable. Without asking it also put up Python code to do character counting by
element type.
Transliteration
Claude did fine at transliterating Greek (including diacritics) to Latin characters, When asked about the scheme it chose, it gave a plausible but incomplete discussion. When pressed about conventions it mentioned the ISO 843 and ALA-LC standards. It didn’t mention Betacode at all, but generated it when specifically asked.
Dictionary entry markup
Digitization of the Oxford English Dictionary showed how complex dictionary entries (and their markup) can be [Berg 1988]. I found a png image of an OED sample entry at https://guides.library.uab.edu/Milton/entries and OCRed it; the result had white space and punctuation intact, but no other markup:
literate, adj. and n. ... Pronunciation: Brit OP lit(a)rat/, U.S. OP licla rat/ Forms: late Middle English-160os litterate, ...
My
initial prompt, Here’s a dictionary entry, in ‘plain text’. Can you mark it
up for me?
produced HTML that looked nice but was otherwise very poor.
Prompting with This time, try for semantic/descriptive markup. You can just
pick appropriate mnemonic names, or use TEI.
worked pretty well (the
sample below is edited to save space). When I asked it to move the sense_number
content up to an n
attribute and tag all the dates, it also did fine.
<entry> <form> <orth>literate</orth> <pos>adj. and n.</pos> </form> <pronunciations> <pron type="BrE">ˈlɪt(ə)rət</pron><pron type="AmE">ˈlɪdərət</pron> </pronunciations> <etymology> <etym_lang>A borrowing from Latin.</etym_lang> <etym_source>Latin litteratus</etym_source> <etym_detail> < classical Latin litteratus marked or inscribed with letters, well versed in literature, cultured, erudite, …</etym_detail> <usage_note>…</usage_note></etymology> <form_history> <form_variant>late Middle English-1600s litterate</form_variant> <form_variant>1500s lytterate</form_variant> …</form_history> <sense type="adjective"><sense_number>A. 1.</sense_number> <definition><def_text>Of a person, society, etc.: acquainted with letters or literature; erudite, learned, lettered. Also with in (a subject). Now rare.</def_text> <usage_note>In later use coloured by sense A. 1b.</usage_note> </definition> …
Spelling correction
This should be easy for an LLM. Given a sample from the Holbrook spelling error
database at https://www.dcs.bbk.ac.uk/~ROGER/holbrook-tagged.dat, Claude corrected
the misspelled words and several other errors. Asking it only to do spelling errors
per se avoided most but not all of the other cases. It was also able to do TEI-style
sic/corr markup and add a type attribute to categorize each error as
‘spelling,’ ‘word-choice,’ ‘form’ (for wrong endings), ‘case,’ or ‘punc’ (for
punctuation)
:
I have four in my <sic corr="family" type="case">Family</sic>: Dad, Mum, and <sic corr="sister" type="spelling">siter</sic>. My Dad works at Melton. My <sic corr="sister" type="spelling">siter</sic> <sic corr="goes" type="form">go</sic> to Tonbury. My Mum goes out <sic corr="sometimes" type="word-choice">some times</sic>. I go to Bridgebrook. <sic corr="I" type="case">i</sic> go out sometimes on Tuesday night; <sic corr="I" type="case">i</sic> go to Youth <sic corr="club" type="spelling">clob</sic>. On <sic corr="Thursday" type="case">thursday</sic> nights I go bell ringing. <sic corr="On" type="case">on</sic> Saturdays I go down to the farm.
The ability to mark up its changes seems to me quite valuable. Adding corrections to an established text archive or corpus can be fraught, and marking them up makes the history explicit and easy to review and improve.
Understanding ThML/DC header data
For several of these tests I used text from Philo of Alexandria [Yonge 1854-1855], downloaded from ccel.org in ThML,
an
extension of HTML with support for Dublin Core (DC) header metadata, inline Biblical
references, etc. First I gave it the header (as raw ThML). Without further
prompting, it said:
The XML text you provided appears to be the header section (<ThML.head>) of a ThML (Theological Markup Language) document. It contains metadata and information about the work The Works of Philo Judaeus by Philo of Alexandria.
Some key elements in this header include: …
I asked how it knew it was ThML, and it cited several specific element type names.
Pushing further on the DC tags, it recognized the relationship of the
sub
attribute on the DC tags to USMARC Relator Codes, identified
the ones in use, and updated the values above to the standard 3-letter codes. I did
not dig into analysis of the main text at this point.
Claude on Claude
LLMs can discuss themselves and their operations to some extent. When preparing some test data for translation experiments I asked Claude for its advice on input conventions including how best to represent discontinuities such as footnotes and how to manage lengthy texts as needed for the Philo modernization task mentioned earlier.[9] For footnotes Claude advised removing them, converting them to endnotes, or marking them in some special way as well as including specific forewarning and instructions in the prompt. Claude’s response re. lengthy text may be useful:
When you have a long text that you want to update to modern English, it’s generally better to provide me with some prior context along with each new chunk of text. This helps maintain coherence and consistency throughout the modernization process.
However, you don’t need to include the entire preceding text every time you send a new chunk. Instead, you can include a smaller portion of the previous text, such as the last few sentences or a paragraph, depending on the complexity and context of the material. Here’s an approach you can follow:
Start by sending me the first chunk of the text (around 1000 words or a suitable size based on the text’s complexity).
…
When you send the next chunk, include a small portion of the previously modernized text (e.g., the last few sentences or a paragraph) along with the new chunk of the original text.
…
By including a small portion of the prior modernized text, you provide me with the necessary context to ensure a smooth and consistent transition between chunks. This helps maintain the flow and coherence of the modernized text as a whole.
Conclusions
LLMs, in particular Claude, can be remarkably effective tools for operating on XML texts. Feasible tasks include analyzing existing markup, translating between different schemas and other notations, and even creating new markup for various textual features. Claude is able to recognize a wide variety of small objects such as NERs or things with discernible patterns such as bracketing, punctuation etc. It can insert XML markup for recognized phenomena pretty well, though occasionally needs a reminder that well-formedness matters. It can handle mechanical tasks such as MarkDown translation, transliteration, and bibliography markup (even from fairly impoverished plain text).
Notable problems cluster around stability: Losing track of parts of the instructions over the course of a long input text,applying markup inconsistently, and producing defensible numbers. Co-indexing parts of discontiguous quotations proved difficult but tractable.
Two tactics worth specific mention are (1) asking LLMs for help in constructing prompts; and (2) asking them to evaluate or double-check their own work. Checking their work by hand is also really important.
A more fundamental issue than stability is that the sine qua non for LLMs is probability or plausibility, not truth or accuracy. LLMs may produce quite good output on average, yet periodically go quite awry — and their errors may seem very plausible. We readily notice egregious arithmetic errors if we check, but many equally egregious errors are not so obvious or clear-cut. The use of a proxy goal leads (likely inexorably) to the superficially reasonable hallucinatory behavior for which LLMs are well known. The related problem of LLMs explaining their conclusions also remains very difficult. LLMs do not operate by something we would recognize as reason, and those that generate explanations for us are algorithmically aiming at plausibility not accuracy for those explanations just as for their other outputs.
It is not clear whether or how these problems can be addressed. Some factual claims
such as citations could be mechanically checked before reporting. Multiple LLMs could
be
consulted and their answers compared, though as training datasets become so huge they
cannot be very selective (and include more texts that trace to other LLMs), independence
becomes a fraught assumption. Lack of transparency and adequate design for training
data
selection and cleaning
reduce confidence further. But again, perhaps the
most difficult problem is the disconnect between LLMs’ reward functions and
accuracy.
From this preliminary look, I think LLMs are clearly good enough to be very useful in many tasks of interest to this community. Yet now and for the foreseeable future, we will have to keep checking their work very, very carefully.
References
[Abramson 1963] Abramson, N. 1963. Information Theory and Coding. New York: McGraw-Hill. https://archive.org/details/informationtheor0000abra
[Anthropic] Anthropic. May 9, 2023. Claude’s
Constitution.
https://www.anthropic.com/news/claudes-constitution
[Bai 2022] Bai, Yuntao, Saurav Kadavath, Sandipan Kundu,
Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini,
Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez,
Dawn
Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared
Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt,
Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma,
Robert
Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk,
Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan,
Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas
Joseph, Sam McCandlish, Tom Brown, Jared Kaplan. 2022. Constitutional AI:
Harmlessness from AI Feedback.
https://arxiv.org/abs/2212.08073. doi:https://doi.org/10.48550/arXiv.2212.08073
[Baroni 2009] Baroni, M., Bernardini, S., Ferraresi,
A., and Zanchetta, E. 2009. The WaCky wide Web: A collection of very large
linguistically processed Web-crawled corpora.
Language Resources and Evaluation 43(3): 209–226.
doi:https://doi.org/10.1007/s10579-009-9081-4. Cited in [Sharoff
2015].
[Bauman 2016] Bauman, S. The Hard Edges of
Soft Hyphens.
2016. Presented at Balisage: The Markup Conference 2016,
Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage:
The Markup Conference 2016. Balisage Series on Markup Technologies, vol.
17. doi:https://doi.org/10.4242/BalisageVol17.Bauman01
[Bender 2021] Bender, E. M., Gebru, T.,
McMillan-Major, A., and Shmitchell, S. 2021. On the Dangers of Stochastic
Parrots: Can Language Models Be Too Big? 🦜[sic].
In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and
Transparency (FAccT ’21): 610–623. New York: Association for Computing
Machinery. doi:https://doi.org/10.1145/3442188.3445922
[Berg 1988] Berg, D., Gonnet, G. and Tompa, F. 1988.
The New Oxford English Dictionary Project at the University of
Waterloo.
Report number: OED-88-01. University of Waterloo Centre for the
New Oxford English Dictionary. https://www.researchgate.net/publication/243451160
[Bernstein 2010] Bernstein, M. 2010. Card
Sharks and Holy Scrollers.
https://www.markbernstein.org/Oct10/CardSharksandHolyScrollers.html
[Burton 1955] Burton, N. G. and J. C. R. Licklider.
1955. Long-Range Constraints in the Statistical Structure of Printed
English.
American Journal of Psychology 68: 650-653.
doi:https://doi.org/10.2307/1418794
[CCEL] CCEL. Theological Markup Language
(ThML).
https://www.ccel.org/ThML/index.html
[Chicago Tribune 1995] Chicago Tribune.
December 2, 1995. America Online Admits ‘Error’ in Banning Word ‘Breast’.
https://www.chicagotribune.com/1995/12/02/america-online-admits-error-in-banning-word-breast/
[Church 1988] Church, K. 1988. A Stochastic
Parts Program and Noun Phrase Parser for Unrestricted Text.
Second Conference on Applied Natural Language
Processing (Austin, Texas), pp. 136-143. https://aclanthology.org/A88-1019.pdf. doi:https://doi.org/10.3115/974235.974260
[Churchland 1987] Churchland, P. S. 1987.
Epistemology in the Age of Neuroscience.
Journal of Philosophy 84
(10): 544-553. https://patriciachurchland.com/wp-content/uploads/2020/05/1987-Epistemology-in-the-Age-of-Neuroscience.pdf.
doi:https://doi.org/10.5840/jphil1987841026
[Cole 2020] Cole, D. 2020. The Chinese Room
Argument.
Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/entries/chinese-room
[Dartmouth 1867] Dartmouth Dante Project. Longfellow, H. W. 1867. Translation of Dante, Paradiso. https://Dante.Dartmouth.EDU, http://dantelab.dartmouth.edu/reader
[Darwin 1881] Darwin, C. 3 July 1881. Letter to William Graham. https://www.darwinproject.ac.uk/letter/DCP-LETT-13230.xml
[DeRose 1988] DeRose, S. J. 1988.
Grammatical Category Disambiguation by Statistical Optimization.
Computational Linguistics 14(1), Winter 1988. https://aclanthology.org/people/s/steven-j-derose/
[DeRose 1990] DeRose, S. J. 1990. Stochastic Methods for Resolution of Grammatical Category Ambiguity in Inflected and Uninflected Languages. Thesis. Providence: Brown University Department of Cognitive and Linguistic Sciences. http://www.derose.net/derose/steve/writings/dissertation/Diss.0.html
[DeRose 2004] DeRose, S. J. 2004. Markup
Overlap: A Review and a Horse.
Extreme Markup Languages. https://www.researchgate.net/publication/221211490_Markup_Overlap_A_Review_and_a_Horse
[Edwards 2024] Edwards, B. 2024. ‘The king is
dead’—Claude 3 surpasses GPT-4 on Chatbot Arena for the first time.
Ars Technica, March 27, 2024. https://arstechnica.com/information-technology/2024/03/the-king-is-dead-claude-3-surpasses-gpt-4-on-chatbot-arena-for-the-first-time/
[Francis 1979] Francis, W. N. and Kucera, H. 1979. Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for Use with Digital Computers. Providence: Department of Linguistics, Brown University. http://icame.uib.no/brown/bcm.html
[Greenstein 2018] Greenstein, S. and Feng Zhu.
2018. Do experts or crowd-based models produce more bias? evidence from
encyclopedia britannica and wikipedia.
MIS Quarterly 42(3), September 2018:
945–960. doi:https://doi.org/10.25300/MISQ/2018/14084
[Horton 2015] Horton, R. 2015. Offline: What
is medicine’s 5 sigma?
The Lancet 385(9976): 1380, April 11, 2015. https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(15)60696-1/fulltext.
doi:https://doi.org/10.1016/S0140-6736(15)60696-1
[HTTP 2022] HTTP Archive. 2022. Web
Almanac: HTTP Archive’s annual state of the web report.
https://almanac.httparchive.org/en/2022/table-of-contents
[Koplenig 2017] Koplenig, A. 2017. The
impact of lacking metadata for the measurement of cultural and linguistic change
using the Google Ngram data sets—Reconstructing the composition of the German corpus
in times of WWII.
Digital Scholarship in the Humanities, 32(1), 169-188.
doi:https://doi.org/10.1093/llc/fqv037
[Leibnitz 1714] Leibnitz, G. 1714. The Principles of Philosophy known as Monadology. https://www.earlymoderntexts.com/assets/pdfs/leibniz1714b.pdf
[Marshall 1989] Marshall, C. C., and Irish, P. M.
1989. Guided Tours and On-Line Presentations: How Authors Make Existing Hypertext
Intelligible for Readers.
In Proceedings of the Second Annual ACM
Conference on Hypertext, pp. 15-26. New York: ACM Press. doi:https://doi.org/10.1145/74224.74226
[Mikolov 2013] Mikolov, T., Chen, K., Corrado, G.,
and Dean, J. 2013. Efficient Estimation of Word Representations in Vector
Space.
arXiv preprint arXiv:1301.3781, 2013. doi:https://doi.org/
10.48550/arXiv.1301.3781
[Miller 1963] Miller, G. A. and Chomsky, N. 1963.
Finitary Models of Language Users.
In R. Duncan Lee, Robert A. Bush,
and Eugene Galanter (eds.), Handbook of Mathematical
Psychology 2: 420-491. New York: John Wiley & Sons, Inc. https://faculty.wcas.northwestern.edu/robvoigt/courses/2021_spring/ling334/readings/finitary_models.pdf
[Nunberg 2009] Nunberg, G. 2009. Google’s
Book Search: A disaster for scholars.
The Chronicle of Higher Education, August 31, 2009.
https://www.chronicle.com/article/googles-book-search-a-disaster-for-scholars/
[Pechenick 2015] Pechenick, E. A., Danforth, C.
M., and Dodds, P. S. 2015. Characterizing the Google Books corpus: Strong limits
to inferences of socio-cultural and linguistic evolution.
PLOS ONE, 10(10), e0137041. doi:https://doi.org/10.1371/journal.pone.0137041
[Plantinga 1993] Plantinga, A. 1993. Warrant and Proper Function. Oxford University Press.
[Posner 1968] Posner, M. and Keele, S. 1968.
On the Genesis of Abstract Ideas.
Journal of experimental psychology 77: 353-63.
doi:https://doi.org/10.1037/h0025953
[Searle 2002] Searle, J. R. 2002. Consciousness and Language. New York: Cambridge University Press.
[Shannon 1948] Shannon, C. 1948. A
Mathematical Theory of Communication.
Bell System Technical Journal, July and October.
doi:https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
[Sharoff 2015] Sharoff, S. 2015. Review of Roland Schäfer and Felix Bildhauer, Web Corpus Construction, Morgan & Claypool (Synthesis Lectures on Human Language Technologies, volume 22), 2013, ISBN 978-1608459834. In Computational Linguistics 41(1). https://aclanthology.org/J15-1009. doi:https://doi.org/10.1162/COLI_r_00214
[Simonite 2021] Simonite, T. Feb 4, 2021. AI
and the List of Dirty, Naughty, Obscene, and Otherwise Bad Words.
Wired. https://www.wired.com/story/ai-list-dirty-naughty-obscene-bad-words/
[Smith 2024] Smith, B. 2024. Self-Attention
Explained with Code: How Large Language Models Create Rich, Contextual
Embeddings.
Towards Data Science. https://medium.com/towards-data-science/contextual-transformer-embeddings-using-self-attention-explained-with-diagrams-and-python-code-d7a9f0f4d94e
[TEI 2023] Text Encoding Initiative. 2023. TEI: Guidelines for Electronic Text Encoding and Interchange. P5 Version 4.7.0. Last updated on 16th November 2023, revision e5dd73ed0. Section 21.1, https://tei-c.org/release/doc/tei-p5-doc/ja/html/CE.html
[Vaswani 2017] Vaswani, A., Shazeer, N., Parmar, N.,
Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. 2017.
Attention is All You Need.
Advances in Neural Information Processing Systems 30
(NIPS 2017). https://arxiv.org/pdf/1706.03762. doi:https://doi.org/10.48550/arXiv.1706.03762
[Wikipedia] Wikipedia. Scunthorpe
problem.
https://en.wikipedia.org/wiki/Scunthorpe_problem
[Yonge 1854-1855] Yonge, C. D. (tr). 1854-1855. The Works of Philo Judaeus. Electronic edition, 2012-05-14. Christian Classics Ethereal Library. https://www.ccel.org/ccel/philo/works.html
[1] One common consequence of this is that list items run together, despite the fact that authors often use item boundaries like implicit punctuation. Sometimes the boundary signals a sentence break, other times a clause or merely list break; but in any case, discarding it matters.
[2] A parameter called temperature
can be set to control how
closely they hew to their notion of most probable,
vs. how
much they are permitted to wander afield. In the explorations reported here,
I just left the temperature at its default.
[3] Respectively, those lists have 1383 and 403 items, of which 135 overlap. The Lincoln Index estimates an overall population size given two independent samples and their correlation (or lack thereof). Here, it estimates the actual population at 4129, suggesting both lists are inadequate.
[4] The same number of features/dimensions is added for text position as the word
embeddings have themselves (call that d
). For the word at
position P, position feature 2i (each even feature) is a function of P, i, and
d: sin(P / (K2i/d)), with K typically 10,000. The odd positional
features (2i+1) work similarly, but their values are
cos(P / K(2i)/d)). Thus each positional feature for a word is slightly
rotated from the previous feature and from the corresponding feature of the
previous word; and even vs. odd features are 90° out of phase (aka, in
quadrature).
[5] Technically, finite memory means they are only FSAs, but for most purposes one can ignore that.
[6] Churchland’s admonition re. a fancier style of
representing
might have useful application to documents, not
only organisms.
[7] Later tries with Claude 3.5 using a detailed prompt and a few iterations produced much better results even on much more complex dialog (including arguably correct co-indexing). Since long texts will typically be processed in parts, such prompts should require a part-specific ID prefix to ensure non-duplicate values across parts.
[8] Not being fluent in Italian, I am going by the Longfellow translation from the Dartmouth Dante Project.
[9] The Philo text runs about 5.3MB, with 12% of that in markup, and about 200 headings.