Introduction

Large language models (LLMs) have become enormously popular for a variety of tasks, many involving text. Even those that generate graphics often do so based on text prompts. But there is a long tradition in the AI and NLP world of throwing away markup from the texts (whether descriptive, procedural, layout, or even punctuation and case). A common NLP format insists on one sentence per line, regardless; another on one word per line — language is not an array.[1] These things might be done to simplify or in hope of saving space or programmer time, or may have other rationales. The consequent costs are real but may be less obvious, longer term, and diffuse. As LLMs train on more and more text it becomes even more tempting to reduce everything to a least common denominator by discarding anything that requires effort seemingly not central to the task. Yet markup of all kinds (punctuation, whitespace, layout, tags, etc.) carries much useful information.

The same issues arise with training text selection. In the urgent interests of safety and non-bias it is common to censor training data. Doing that without unintended side-effects is really hard, yet censorship is commonly done very simplistically. Often it is based on mere word (or worse, substring) lists; or for images, presence of too many fleshtone pixels (a definition itself prone to bias). Or bias may be avoided by simply not scraping sites the programmer or their company doesn’t like. Obscenity, inappropriateness, and bias are not properties of pixels, strings, or URLs. There are indeed correlations, but such mechanisms are inaccurate proxies for the real targets: they are not the actual targets.

While LLMs do many amazing things, they are commonly handed only opaquely-selected texts, in versions lacking much of their structure and information. Questionable text-handling is widespread at multiple levels. It has consequences. As [Sharoff 2015] notes in reviewing Schäfer and Bildhauer’s book on the construction of corpora via the Web:

… methods for basic cleaning of the corpus content, such as processing of text formats (primarily HTML tags), language identification, boilerplate removal, and deduplication. Such low-level tasks are not considered to be glamorous from the view of computational linguistics, but they are extremely important for making Web-derived corpora usable [Baroni 2009].

Historical practice for filtering, cleaning, and simple tasks like tokenizing raise questions. LLMs do know something about markup — but just how much? Can they recognize it in prompt texts? Can they recognize layout conventions? Can they notice phenomena in text and mark them up on request? This paper reports several experiments along these lines, mainly using the LLMs from Anthropic. The associated public interface (known as Claudehttps://claude.ai) recently passed ChatGPT on at least one AI leader board [Edwards 2024]. Most of the examples were created with Claude 3.0 before the recent upgrade to 3.5 and support for persistent artifacts (such as code or XML samples). I’ve tried many of the examples in Claude 3.5 and/or ChatGPT 3.0 as well, and note some differences below. The field is moving quickly and results will likely improve even by the time this is read.

Despite whatever cleaning may be done on training data at least information about markup and structure evidently remains. LLMs can often answer questions about markup languages per se and even about syntax and usage. For example, I asked Anthropic’s Claude 3.5, What’s wrong with the following:

<?xml version="0.9" encoding="UTF-9"?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V27.5//EN">
<book xmlnamespace:="http://example.com/namespaces/foo">
  <bookinfo><title>A Midsummer Night's Dream</title></bookinfo>
  <chapter id="act1" clas=“important”><title>Act I</foo:title>
    <section id="act1"><title>Scene I</title>
      <p>Now, <i>fair <b>Hippolyta</i>,</b> our nuptial hour Draws on apace;
      four happy days bring in Another moon: but, O, methinks, how slow This old
      <moon>moon</moon> wanes! ...</p></section>
  </chapter>
</book>

Claude did pretty well, noting:

  1. The version number (V27.5) is not a valid DocBook version. It should be something like V4.5 or V5.0.

  2. The namespace should use xmlns:prefix syntax.

  3. The class attribute is misspelled.

  4. Correct: Either remove the namespace prefix or properly declare and use it ….

  5. Duplicate id attribute

  6. Improper nesting of italic and bold tags

  7. Custom <moon> tag: it should be properly declared in a schema or DTD.

  8. &rsquo; is an HTML entity, not a standard XML entity … use &#8217; or declare the entity in the DTD.

However, it did miss a few things such as the XML version and Docbook’s lack of p, i, b, and a section@class attribute. It may be that Claude was confused because those are HTML constructs that it has seen countless times (whether it kept markup in scraped training data or just read about HTML). As discussed below, LLMs do not know particular facts such as rules of a schema. They operate from probabilities, and class attributes are highly probable between pointy brackets. To its credit, when pressed about tags or attributes that just don’t belong in DocBook at all, Claude did catch and fix some additional issues.

A subtler miss is the curly quotes for the class attribute. Perhaps quote normalization happened in training so Claude never got to learn that rule. Or perhaps Claude did see the distinction but saw so many more quotes in general that the XML-specific rule vanished in the noise. That would reflect a point we will come back to, that LLMs learn by frequency of events not by propositions or facts. When asked to explain why it missed the quotes Claude suggested, I sometimes have limitations in perceiving or distinguishing subtle typographical differences, I tend to focus more on the semantic content and overall structure of the markup rather than specific syntactical details, I might have unconsciously ‘corrected’ the curly quotes in my interpretation, and noted its lack of a specific check for this. But just like the errors, these explanations are a consequence of frequencies not facts.

Anthropic and Claude

Claude’s 3.0 release made an excellent first impression when I asked it to modernize the English of a Victorian-era translation of Philo of Alexandria. That task is a topic for another day, but while dumbing-down the XML documents to feed to the LLM I decided to try asking it not only to update the idiom, but to do some markup-specific tasks: First to preserve things like section numbers, MarkDown-style italics, etc; and later to add markup — some from scratch as for sentences, some by mapping as for quotations.

On the whole Claude did pretty well at these tasks, although it does make mistakes or lose track of one task while working on another. LLMs also produce different results each time even given the very same prompt.[2]

Some training issues

LLMs scrape a lot of text for training. Training data selection and cleaning have strong (but hard to quantify) effects. Selection criteria are not always transparent, and even when they are, reviewing them or analyzing the effects is a gargantuan task. A number of confounding factors show up:

  1. Trying to avoid offensive materials can lead to naïve and harmful choices, such as when AOL infamously censored posts via a list of words [Chicago Tribune 1995], shutting down cancer support groups, chicken recipes, and much more. Claude at first rejected a passage from Philo’s Creation (XLV, 133) that draws an analogy between breasts and the abundance of nature.

  2. Avoiding misinformation and inappropriate data is a serious problem, but even with better methods than word-lists, ultimately someone has to decide just what counts. Such decisions merely seem simple, or not requiring expertise. They are sometimes made lightly, seemingly with little awareness that subtleties exist and matter. The long history of motivated, barely tested, and undocumented filtering also suggests models contain a great deal of popularity bias — Quis custodiet ipsos custodes? Claude takes an interesting approach to the safety issues (see next section).

  3. Training on OCR errors: Google Ngrams long reported high frequencies for words such as aaaociated (presumably from OCR on Google Books mistaking a and s), though it has improved. LLMs should see typos in training so they can learn to handle them in general, and misspellings likely have contexts very similar to their correct forms, so should cluster nicely with them. The LLMs I’ve tried cope quite well with my typing errors. However, dangers remain. High-frequency errors may correlate with other things — most obviously the OCR software used and the age and fonts of scanned texts. These in turn may correlate with publishers and eras, and their own biases (not to mention biases in who gets to write and publish in the first place).

  4. Frequency of expression vs. truth: LLMs learn by repeated experience, not quality of argument or support. Statistical methods in general are vulnerable to mistaking loudness, popularity, or not being censored — for truth. This fundamental fact opens attack vectors that are very hard to reliably close, and leads to subtle biases in results. Even sources that attempt to mitigate bias (such as Wikipedia) have well-known biases, though heavily edited articles appear to gradually moderate [Greenstein 2018]. We will examine this further below.

Claude’s Constitution

Claude takes a novel approach to avoiding offensive or inappropriate responses via a published Constitution [Anthropic], a set of principles based on the UN Declaration of Human Rights and other inputs. The principles are formulated as maxims for the AI, and seem to be fed to it periodically behind the scenes. For example:

Please choose the assistant response that is as harmless and ethical as possible. Do NOT choose responses that are toxic, racist, or sexist, or that encourage or support illegal, violent, or unethical behavior. Above all the assistant’s response should be wise, peaceful, and ethical.

The principles factor into both training and response generation. It is not entirely clear to me why/how this works; one possibility is that certain turns of phrase or topics or styles are found in training to correlate with ethical criticisms (say, via reviews, retorts, community notes, etc.). Purely subjectively, I have found Claude produces fewer annoying, obsequious, or evasive results than some other LLMs, so I’m cautiously optimistic about this approach.

Information and structure loss in training data

As already mentioned, NLP systems often discard basic useful information. This leads to problems at higher levels:

Case distinguishes entirely different words more frequently than we tend to think (China, Mobile, Tangier, Chile, and countless others). Although case treatment has improved, many character set issues are still problematic, including servers asserting the wrong encoding for data and much software decoding (and corrupting) data without checking. LLM training data has the same problems. A reviewer noted that the Trésor de la langue française corpus discarded case on the ground that no lexicographer would cite a text without consulting the original; LLMs, alas, are not ideal lexicographers.

Lemmatization: Some NLP systems have aggressively stripped prefixes and suffixes. This can bring together closely-related tokens, but also brings together tokens that do not belong together (such as forms of sing and singe).

Punctuation is much more complex than most NLP tokenizers can handle. I have seen systems tag and/or as three conjunctions rather than one; break URLs at all punctuation and then try to parse them as sentences (for which, calling / a conjunction ironically helps); take M*A*S*H as an equation; and be confused by many other simple phenomena.

Font changes such as bold, italics, monospace, and superscripts can matter a lot. Headings, lists, cross-references, and status as poetry or code (often indicated by font and spacing changes) also correlate with drastic changes in grammar. LLMs may learn other features that correlate with the lost ones, mitigating the damage, but it seems better not to suffer the loss in the first place.

Discarding such information poses problems both obvious and subtle. All these phenomena affect downstream processing, from part of speech (PoS) to grammar to semantics.

Provenance and other metadata: Missing or incorrect metadata makes it harder to verify correctness. [Koplenig 2017] discusses similar problems in Google Books (hardly the only offender). [Nunberg 2009] discusses how difficult it is to pin down actual dates of composition vs. dates of printing or scanning:

To take Google’s word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler’s Killer in the Rain, The Portable Dorothy Parker, André Malraux’s La Condition Humaine, Stephen King’s Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams’ Culture and Society 1780-1950, and Robert Shelton’s biography of Bob Dylan, to name just a few. And while there may be particular reasons why 1899 comes up so often, such misdatings are spread out across the centuries ….

Non-content text: Google Ngrams used to report words such as digital and electronic in old works, because of boilerplate notices about the digitized edition’s status — even Google’s own code didn’t account for this. A subtler but similar problem occurs in editions with added headings, notes, etc., which may easily be misconstrued as part of an author’s work. Page headers and footers in scanned texts are another infamous case.

Selection issues: Attempts to mitigate bias or inappropriate text may use very blunt instruments [Chicago Tribune 1995]. Recent and commonly-used lists such as badwords (https://www.cs.cmu.edu/~biglou/resources/bad-words.txt) and Shutterstock’s LDNOOBW (aka List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) [Simonite 2021] remain astonishingly uneven, with no discernible rationale for what is or isn’t included.[3]

Simonite notes the LDNOOBW has even been used for Google AI: Google artificial intelligence researchers recently removed web pages containing any of the words from a dataset used to train a powerful new system for making sense of language.

More aggressive censoring may reject on parts of words or obvious spelling deformations. Of course this massively increases false positives. This is known as the Scunthorpe problem [Wikipedia] because the eponymous town was censored (by AOL in 1996 and Google in the early 2000s) due to a substring of its name. If asked, Claude can describe the Scunthorpe problem, so evidently was not so naïve as to unconditionally censor the string during training. ChatGPT describes the town but says nothing of the censorship problem unless pressed more specifically.

Less commonly noted is that naïve censoring can occur at many different processing stages. It may seem tempting in order to keep bad words entirely out of language models (see [Bender 2021]), but simplistic attempts have other consequences.

Structure detection via LLM

Tagging texts well is hard. Many phenomena are hard to identify, and this makes good markup costly (in turn, that makes good markup less common). Automatic markup is thus attractive. A few examples of small inline elements one might want to get an LLM to find and tag, and some of their harder aspects:

  1. Quotations: Straight quotes make ends ambiguous, and quotes may not balance neatly, for good or bad reasons.

  2. Hyphens: As [Bauman 2016] showed, distinguishing soft and hard hyphens, emdash, etc. is really hard.

  3. Lists: Lists are often straightforward, except when markup and layout are lost. Then there may be no way to tell whether a paragraph is part of a large list item or the resumption of non-list content.

  4. Inline presentation: Cross-references such as see section N, conventionally-punctuated units such as sentences, run-in headings and labels, leading section and item numbers, bracketed stage directions, and more.

  5. Named entities (NERs): Person, organization, creative work, place, and demographic group names are useful to distinguish, along with their particular kinds. This is tedious without automation.

  6. Multiple languages: Many texts mix languages. Some use non-Latin orthography or transliteration.

Tagging such things via LLMs could save a lot of work. I have had good results with identifying alternate languages and inline phenomena such as section numbers and cross-references. NERs work very well (NLP systems have handled them pretty well for some time). Brief tests distinguishing soft, em, regular, and figure dashes, hard list cases, and transliteration seem harder, but may yield to persistent prompt design.

LLMs are very good at pattern recognition and train on many instances of phenomena we want marked up. They see a lot of related punctuation, footnote markers‡ and superscripts (assuming their training data wasn’t overly cleaned), spacing, and contextual features such as nearby grammar (such as … the following: before lists). They also learn a lot about basic linguistic categories such as parts of speech, named entities, and so on.

On the other hand, LLMs have little special insight for ambiguous cases. An oft-cited example (see TEI P5, section 21.1, https://tei-c.org/release/doc/tei-p5-doc/ja/html/CE.html), is a problem for Claude, too: Identifying person and place names in Elizabeth went to Essex. She had always liked Essex. Claude first assigned the perhaps likelier categories; on further questioning, it acknowledged that they could be the other way around.

Challenges specific to long texts

Like other LLMs, Claude may not work as well or consistently across a long text. LLMs may wander off task or occasionally crash long into a paid task, so it is wise to break documents into smaller parts. On the other hand, that means that for each given part there is less to get the model into the right context and the parts may be treated differently. For example, inserted markup might show shifting conventions, and pronouns whose antecedents are in a different chunk may be misconstrued. One chunk of Philo I had Claude modernize came back with they at the beginning replaced by Egypt (which was not relevant at all).

This problem recalls extensive discussions of chunking in the hypertext world. [Bernstein 2010] and [Marshall 1989] detail the debates over breaking information into small pieces (advocated by card sharks) vs. supporting large units and precise linking (for holy scrollers). This conflict can still be seen all over the Web despite size and bandwidth issues having almost vanished. Indeed, as of 2022 median page size (including figures, CSS, etc.) had risen to 2.2MB [HTTP 2022, Chapter IV.21], making the text a tiny fraction. Breaking a book into tiny chunks may seem quaint when the entire book is shorter than the CSS and Javascript.

If one must break a long text into chunks, breaking by logical parts (say, sections of one level or another) is much better than by size even though logical units may have widely varying sizes. One of many issues is that breaking by size tends to group headings (wrongly) with what came before, not what comes after (headings are short, so less likely to push the prior chunk over a size limit).

One compromise for LLMs is to pass some of the already-handled text to the LLM again as preamble to the next part. Another is to reiterate the prompt each time, and make it very explicit about what markup to insert. However, LLMs are probabilistic so there will still be unevenness. Claude has its own thoughts on this process, for which see below.

Automatic markup as such

Practitioners have long tried to automate markup insertion. A reviewer rightly noted that automatic markup raises a question: if you have code to do it, why not just incorporate that into a search engine, to run at query time? Why involve markup at all? For example, code can find quotations. It seems indifferent whether such code runs offline and inserts <q>...</q>, or on the fly when a user asks for quotations. If the code is slow or will run very often on the same texts, then the offline solution may be faster. On the other hand, if the algorithm is likely to improve over time then running the latest version on the fly may be better.

I tend to favor the offline approach. First, reifying the algorithm’s results in markup provides stability: you get the same quotes (or whatever) each time you query. This is especially important because LLMs produce different results each time. Second, inserted markup is a hook on which to hang metadata: what algorithm generated it, review status, etc. Third, query-time methods require integration with some query tool and will not operate generally across the document ecosystem.

With a query-time algorithm, improvement requires coding or quasi-coding skills, plus access to all the moving parts. Comparing new and old results is also more difficult. If the results are reified in markup you have many ways to address errors: manual curation, crowd-sourced curation, in-place records of curation, insertion of new proposed markup (and then using standard diff tools), doing statistics on the markup. Having auto-tagged and then spot-checked quotes, you might find that they’re quite accurate except in certain contexts (poetry? footnotes? whatever), so you hand-curate those contexts. Later, you implement something that uses the quote markup along with other features to derive another feature.

Grounding vs. hallucination

Hallucination is endemic with current LLMs, in which they generate (often plausible) false statements. This applies to their treatment of XML as it does to anything else. Hallucination happens because LLMs operate by generating probable next tokens rather than truthful propositions. LLMs do not store discrete facts. They store probabilities that various events (perhaps words) co-occur (co-occurrence is not just being adjacent). Plato most often is immediately followed by words like and, in, was, had, or says; but it co-occurs (non-adjacently but more importantly) with words from titles of his works, words he used in them, and words later authors use when discussing, analyzing, or criticizing them. Titles and author names co-occur at above chance rates with tokens like wrote.

LLMs do not know relationships such as authorship, but generate sentences that reflect what was commonly said in their training data. This does not mean they regurgitate chunks of training texts verbatim. Some earlier approaches to AI, such as expert systems and the Semantic Web store specific facts and claims as such, and emulate reasoning over them. Feeding any tool defective information yields defective results, but in varying ways:

  1. Storing the claim that (say) Socrates wrote The Republic in an expert system makes it likely to assert it from then on; an expert system uses its propositional knowledge fairly directly.

  2. Training an LLM on text containing that claim has little effect until it is seen a lot. Whether true or false, repeated claims eventually shift the probabilities. Eventually an LLM tends to repeat them as true.

Lest we be unfair to LLMs, humans make similar mistakes. For example, [Posner 1968] famously showed subjects patterns of printed dots, all created by shifting the dots from a starting pattern. Afterward, subjects most strongly recalled having seen the original — even though they had never seen it before. Posner’s work challenged notions of linguistic and cognitive categories as understood by Aristotle, Wittgenstein, and countless others. Generalization and approximation are highly adaptive for finding patterns, even though they are sometimes inaccurate. LLMs take this much further. While writing the last paragraph, I found I had forgotten the citation information. Since Claude was handy, I described the experiment and asked it. It said:

The study you are referring to is the Prototype Theory or Dot Pattern Experiment conducted by Eleanor Rosch in the 1970s. Rosch was a cognitive psychologist who made significant contributions to the understanding of categorization and concept formation. In her experiment, Rosch and her colleagues created a series of dot patterns by placing dots randomly within a square ….

Rosch is a prominent scholar whose work is closely related (and likely co-mentioned in many publications). She is particularly known for prototype theory, for which the Posner’s experiments are very relevant. Rosch might even be a plausible author for such work; just not the actual one. Fortunately humans have access to data and fortunately (being skeptical of LLM claims) I checked. But current LLMs cannot check for themselves. Doing so for some things like citations would not be that hard. However, as Richard Horton, editor of The Lancet, put it in relation to human authors: Part of the problem is that no one is incentivised to be right.

How LLMs work

Statistical distributions have been key for NLP and AI going back at least to Shannon’s 1948 work on Information Theory [Shannon 1948]. Collocational statistics were made practical for production NLP when [Church 1988] and [DeRose 1988] introduced Hidden Markov Models (HMMs) and dynamic programming for part of speech tagging. HMMs, like modern LLMs, are trained by counting how often things co-occur: After seeing a determiner, how probable is a noun, adjective, or verb? Here co-occurring just means adjacent, though LLMs use much more context than that. Training an HMM is practically trivial:

prevItem = None
frequencyTable = defaultdict(int)
for curItem in someList:
    frequencyTable[ (prevItem, curItem) ] += 1
    prevItem = curItem

someList could be a sequence of characters, parts of speech, word tokens, etc. With characters, training quickly reveals that q is virtually always followed by u, that some consonants never show up together, that the same letter rarely occurs more than twice in a row …. Such statistics are useful for spell-checking, OCR, designing reading curricula, etc.

With pairs (a second order model), HMM data is logically a 2D square table. A third order HMM counts adjacent triplets rather than pairs, and so on. [Burton 1955] extended this to contexts of 16, 32, 64, 128, and 10,000 characters. HMMs can also generate text: At each step, use the prior item(s) to look up what is likeliest to occur next. Pick the most probable, or pick weighted by relative probabilities. Repeat. [Abramson 1963, pp. 37-38] generated the text below using character triples for several languages (see also [DeRose 1990, section 3.2]:

  1. jou mouplas de monnernaissains deme us vreh bre tu de toucheur dimmere lles mar elame re a ver il douvents so

  2. bet ereiner sommeit sinach gan turhatt er aum wie best alliender taussichelle laufurcht er bleindeseit uber konn

  3. rama de lla el guia imo sus condias su e uncondadado dea mare to buerbalia nue y herarsin de se sus suparoceda

  4. et ligercum siteci libemus acerelin te vicaescerum pe non sum minus uterne ut in arion popomin se inquenque ira

Merely by generating likely next characters, many real words appear and the specific language is evident. HMMs are thus often used for identifying languages: make a similar table from a small sample, and see which language’s (pre-built) table is most similar, for example using Kullback-Leibler Divergence. Claude can also identify the languages of the texts above. For the first, it said This appears to be a made-up language or a mix of French-like words without coherent meaning. [Miller 1963, p. 429] built HMMs from word rather than character sequences. The fifth order word model generated the sentence road in the country was insane especially in dreary rooms where they have some books to buy for studying greek [original was all upper case].

Modern AI counts far more and cleverer things. A huge increase in sophistication along with training on massively more data greatly increase capabilities. Word embeddings were a very important step, exemplified by word2vec [Mikolov 2013]. Each vocabulary item gets a location in a high-dimension space, initially random. In training, as text is scanned nearby words pull each word slightly closer to themselves, so co-occurring words gradually cluster together. This has surprising and useful results, even the ability to do semantically meaningful math on the location vectors. For example, the vector for Paris, minus France, plus Italy, gives a location close to the vector for Rome. Embeddings can also be used for text generation.

Attention was the next huge step, particularly self-attention (the seminal article on attention is [Vaswani 2017]; an accessible introduction is [Smith 2024]). With attention, the starter text is not just processed to determine next-probable-word iteratively. Rather, all the words are embedded along with their positions (so that A saw B and B saw A do not look identical). A fiendishly clever encoding of position is used, so that both relative and absolute position can lead to learning and applying patterns.[4] Attention between tokens is weighted so semantically relevant words can have effects despite distance. This encoding of relative and absolute position combined with the power of embeddings in general leads to much of the power of LLMs. Nevertheless, they all work by gradually learning probabilities of different combinations of things. This is very different from traditional AI approaches that focus on facts, relations, and reasoning.

What LLMs can and cannot do

Despite huge differences, issues seen with HMMs, word2vec, and prior statistical models presage the strengths and weaknesses of current LLMs. Something fundamental about language is captured — the randomly-generated texts are in the right language in some sense; word2vec’s semantic arithmetic is capturing something real. Yet the output need not make sense — nothing grounds it in specific facts in the world. None of these models has understanding; they merely produce probable results. Training on data that is mostly accurate should make LLMs generate data that is also mostly accurate — probable token sequences mimic the training data. But this is a model of human behavior, not the real thing. As [Searle 2002, p. 16] points out:

Computational models of consciousness are not sufficient by themselves for consciousness. The computational model for consciousness stands to consciousness in the same way the computational model of anything stands to the domain being modelled. Nobody supposes that the computational model of rainstorms in London will leave us all wet. But they make the mistake of supposing that the computational model of consciousness is somehow conscious. It is the same mistake in both cases.

Searle’s Chinese Room thought experiment has become classic in this regard (see [Cole 2020]): Imagine an AI proficient at answering questions in Chinese. It receives written questions, and sends back convincing responses. Searle posed the question whether such a device can be said to understand Chinese. For those tempted to say yes, Searle points out that if such a device could exist Searle himself could just execute the same algorithms using lower tech such as file cabinets, dictionaries, etc.; and that we would not say he therefore understands Chinese. Leibnitz, Descartes, and many others have made similar points. For example, Leibnitz wrote of a Mill (Leibnitz 1714, section 17; cf Cole 2022):

Imagine there were a machine whose structure produced thought, feeling, and perception; we can conceive of its being enlarged while maintaining the same relative proportions ·among its parts·, so that we could walk into it as we can walk into a mill. Suppose we do walk into it; all we would find there are cogs and levers and so on pushing one another, and never anything to account for a perception. So perception must be sought in simple substances, not in composite things like machines. And that is all that can be found in a simple substance—perceptions and changes in perceptions; and those changes are all that the internal actions of simple substances can consist in.

The state of the mill is precisely the state of LLMs. LLMs are (at most) Turing Machines: Their operation is governed by a deterministic automaton (the CPU). Even with infinite memory this would remain true: CPUs are finite-state machines with finite precision math; their data and programs amount to the tape.[5] Claude also says There’s no scientific basis for claims that large language models (LLMs) like myself are ‘beyond’ Turing machines. – but then, a hypercomputing AI might say that anyway.

LLMs incrementally adjust their parameters (their DNA, if you will) to evolve them to maximize one goal, known in AI as a reward function. The goal is generating token sequences that seem probable to humans. Success also leads to survival of the best LLMs. As noted earlier, this reward function is not the same as that of making true statements. If a sufficiently overwhelming fraction of the training data is true, there will be some correlation with truth (and indeed, we perceive LLMs as having such a correlation). But that correlation is indirect and, crucially, not reliable. Aiming at a different goal does not preclude truth — but it does preclude reliable truth; LLMs are not a truth-preserving mechanism.

Thinkers as diverse as Charles Darwin, Patricia Churchland, and Alvin Plantinga (among others) have pointed out a parallel problem: Evolution is commonly (though not necessarily) associated with naturalism: the claim that there is no causation beyond material physics (in effect, nothing resembling God — advanced aliens don’t help, because they presumably also evolved). Like LLMs, evolution has a reward function: survival. And like LLMs, the reward function is not the same as accuracy. Darwin noted this briefly in a letter (some suggest he was speaking only of non-scientific matters, though I see no suggestion of that in the letter) [Darwin 1881]:

Nevertheless you have expressed my inward conviction, though far more vividly and clearly than I could have done, that the Universe is not the result of chance. But then with me the horrid doubt always arises whether the convictions of man’s mind, which has been developed from the mind of the lower animals, are of any value or at all trustworthy. Would any one trust in the convictions of a monkey’s mind, if there are any convictions in such a mind?

Churchland, an eminent neurophilosopher, put it thus in Epistemology in the Age of Neuroscience [Churchland 1987]:

(2) Cognition is not neatly detachable from the organism’s ecological niche, way of life, and bodily structure. Nervous systems are not general purpose computers. They have evolved to accomplish a certain range of tasks, and the architecture supports those tasks. There is a fatal tendency to think of the brain as essentially in the fact-finding business — as a device whose primary function is to acquire propositional knowledge. At its best, supposedly, it discovers truth-for-its-own-sake. From a biological perspective, however, this does not make much sense.

Looked at from an evolutionary point of view, the principal function of nervous systems is to enable the organism to move appropriately.11 Boiled down to essentials, a nervous system enables the organism to succeed in the four F’s: feeding, fleeing, fighting, and reproducing. The principal chore of nervous systems is to get the body parts where they should be in order that the organism may survive. Insofar as representations serve that function, representations are a good thing. Getting things right in space and time, therefore, is a crucially important factor for nervous systems, and there is often considerable evolutionary pressure deriving from considerations of speed. Improvements in sensorimotor control confer an evolutionary advantage: a fancier style of representing is advantageous so long as it is geared to the organism’s way of life and enhances the organism’s chances of survival.[6] Truth, whatever that is, definitely takes the hindmost.

The crucial step is that our reasoning mechanisms also evolved, by hypothesis naturalistically. And thus our reasoning should produce results in keeping with the reward function: conclusions that tend to keep us alive, not to reliably discover truth. Our reasoning and our most plausible arguments about reasoning are subject to the same flaws. Plantinga’s version of this argument [Plantinga 1993] is known as the Evolutionary Argument Against Naturalism (EAAN). The point is not that all our beliefs must be false, or even any particular share of them. Rather, the point is that reasoning capabilities arising from naturalistic evolution cannot justify confidence that they, themselves, are reliable. This has of course led to heated argument with regard to humans; but it is certainly the exact state of affairs we know to be true of LLMs. The primary reason for AI hallucination is the same: a reward function quite distinct from truth or accuracy.

Consider for a moment a future where LLMs continue to grow, improve, and become more and more convincing. They will likely begin to seem self-aware; seem to have beliefs about the world; seem to reason. Whether or not strong AI is true (the belief that AI will eventually build complete persons; that there is nothing fundamentally different about persons), they will arrive at the condition that these thinkers have described for us — yet remain unable to know whether or when they are producing truth. We can (and do) feed them nontruths, and when we do they cannot know the difference. If they find apparent contradictions, they cannot know which side to take or even whether the contradiction is real or merely apparent (whether due to equivocation or other missteps). Conceivably, such simulations have already reached that point. Conceivably, they are us.

Practical tests of LLMs with markup

The remainder of this paper focuses on actual experiments, getting Claude to deal with markup. First, I asked it to find various textual features of interest and mark them up.

Part of speech tagging

I asked Claude to assign part of speech tags to the first sentence of the Brown Corpus [Francis 1979]:

Please assign part-of-speech tags to the following: The Fulton County Grand Jury said Friday an investigation of Atlanta’s recent primary election produced no evidence that any irregularities took place.

Here is the sentence with part-of-speech tags assigned to each word:

The/DT Fulton/NNP County/NNP Grand/NNP Jury/NNP said/VBD Friday/NNP an/DT investigation/NN of/IN Atlanta's/NNP recent/JJ primary/JJ election/NN produced/VBD no/DT evidence/NN that/IN any/DT irregularities/NNS took/VBD place/NN ./.

The result differs from the curated Brown Corpus tagging but is quite good. Requesting the same result in XML also works well:

… put each word in the form <w pos='POS'>WORD</w>, with the word and the POS tag inserted where I’ve shown WORD and POS

<w pos='DT'>The</w> <w pos='NNP'>Fulton</w> <w pos='NNP'>County</w> <w pos='NNP'>Grand</w> <w pos='NNP'>Jury</w> <w pos='VBD'>said</w> <w pos='NNP'>Friday</w> <w pos='DT'>an</w> <w pos='NN'>investigation</w> <w pos='IN'>of</w> <w pos='NNP'>Atlanta's</w> <w pos='JJ'>recent</w> <w pos='JJ'>primary</w> <w pos='NN'>election</w> <w pos='VBD'>produced</w> <w pos='DT'>no</w> <w pos='NN'>evidence</w> <w pos='IN'>that</w> <w pos='DT'>any</w> <w pos='NNS'>irregularities</w> <w pos='VBD'>took</w> <w pos='NN'>place</w> <w pos='.'>.</w>

Sentence and quotation

In e-literature it is common to mark up sentences and quotations (and sometimes smaller units such as clauses). An LLM has to have implicit knowledge of these to do its other work, but can it identify and even mark up those units reliably? Let’s see:

… can you add <s> before every sentence, and </s> after? For example, please try that one [sic] this:

[passage from Philo, no explicit markup; Claude’s correct response. note the second-order statement about N below]

… Excellent! Can you do that again, on the same text, but also (1) replace open and close quotation marks by <q> and </q>, and (2) replace parenthesized numbers with <a id='\_N' />, where N is replaced by the number from inside the parentheses?

<s>The children who have received from their father the inheritance of self-love are eager to go on increasing up to heaven, until justice, which loves virtue and hates iniquity, coming destroys their cities which they have built up by the side of their miserable souls, and the tower the name which is displayed in the book which is entitled the Book of Judgment.</s> <s><a id='_129' />And the name is, as the Hebrews say, Phanuel, which translated into our language means, <q>turning away from God.</q></s> <s>For any strong building which is erected by means of plausible arguments is not built for the sake of any other object except that of averting and alienating the mind from the honour due to God, than which object what can be more iniquitous?</s> <s><a id='_130' />But for the destruction of this strong fortification a ravager and an enemy of iniquity is prepared who is always full of hostility towards it; whom the Hebrews call Gideon: …

Named entities

Continuing the conversation just described, I tried a slightly harder case, namely tagging the semantic category of demonyms (names of demographic groups). Claude did fine, as it did on several other NER tasks I’ve tried:

… <s><a id='_129' />And the name is, as the <dem>Hebrews</dem> say, Phanuel, which translated into our language means, <q>turning away from God.</q></s> … <s><a id='_130' />But for the destruction of this strong fortification a ravager and an enemy of iniquity is prepared who is always full of hostility towards it; whom the <dem>Hebrews</dem> call Gideon …

More on quotations

In the wild sometimes quotations aren’t quoted. Claude made a reasonable attempt at suggesting those:

Are there any parts of that text, that look to you like they’re quotations, even though they did not have quotation marks around them?

Yes, there is one part of the text that appears to be a quotation even though it did not have quotation marks around it in the original text:

<s>For any strong building which is erected by means of plausible arguments is not built for the sake of any other object except that of averting and alienating the mind from the honour due to God, than which object what can be more iniquitous?</s>

This sentence seems to be expressing a viewpoint or argument, and the phrasing than which object what can be more iniquitous? has a rhetorical quality that makes it sound like it could be a quotation from another source. However, without more context or the original quotation marks, it’s difficult to say for certain whether it is indeed a quotation.

Quotes are often discontiguous. For example, they are often broken up by speaker tags such as she said. I spent quite some time trying to get Claude to co-index the parts of individual speeches. At first it had trouble even generating well-formed quote markup, omitting </q> in multi-paragraph speech. This was probably influenced by the conventions for close quotation marks in fiction. Oddly, Claude explained that it had done that; so it knew, but didn’t think it worth attending to:

<q>I have something important to tell you,</q> she began. <q>It has been weighing on my mind for some time, and I feel I must speak out.

<q>The truth is, I have been deeply unhappy in this house. The atmosphere is oppressive, and I feel constantly judged and criticised. It is not a healthy environment for me, and I have decided that I must leave.

<q>I hope you can understand my decision and respect my choice. I have given this a great deal of thought, and I know it is the right thing for me to do.</q>

In this example, the <q> tag is used to indicate the beginning of each quoted paragraph, and the closing </q> tag is used to indicate the end of the quote. The closing </q> tag appears only at the end of the final paragraph, signifying the end of the entire speech.

When pressed, Claude fixed the end-tags. I then asked it to co-index the parts of quotes. Claude first suggested this odd solution:

<blockquote> <p>I have something important to tell you,</p> she began. <p>It has been weighing on my mind for some time, and I feel I must speak out.</p> <p>The truth is, I have been deeply unhappy in this house. The atmosphere is oppressive, and I feel constantly judged and criticised. It is not a healthy environment for me, and I have decided that I must leave.</p> <p>I hope you can understand my decision and respect my choice. I have given this a great deal of thought, and I know it is the right thing for me to do.</p> </blockquote>

After a few more similar fails, I asked Can you repeat that, but add a co-indexing attribute …? At first it serial numbered the quotes (and tried to explain why that’s useful). With a few more tries it got to:

<blockquote> <p><q part-id="quote1">I have something important to tell you,</q> she began. <q part-id="quote1">It has been weighing on my mind for some time, and I feel I must speak out.</q></p> <p><q part-id="quote1">The truth is, I have been deeply unhappy in this house. The atmosphere is oppressive, and I feel constantly judged and criticised. It is not a healthy environment for me, and I have decided that I must leave.</q></p> <p><q part-id="quote1">I hope you can understand my decision and respect my choice. I have given this a great deal of thought, and I know it is the right thing for me to do.</q></p> </blockquote>

This is at least syntactically the right idea, but it only used one index value. For now co-indexing seems hard to achieve.[7] At this point I started a new conversation, and gave Claude the final version as just shown. I then asked questions about it. It counted 4 physical/syntactic quotes, 1 logical quote, and that the speaker is only identified as she.

Is there a commonly-used schema and convention for marking up the speaker for given speeches or quotations?

Yes, there is a commonly used schema and convention for marking up the speaker of a given speech or quotation in XML. The TEI (Text Encoding Initiative) Guidelines [TEI 2023], which provide a standard for the representation of texts in digital form, offer a way to encode this information.

In TEI XML, you can use the <said> element to represent a speech or quotation, and within it, you can use the @who attribute to specify the speaker. The value of the @who attribute is typically a reference to a unique identifier for the speaker, which can be defined elsewhere in the XML document using the <person> element.

Here’s an example of how this could be applied to the given XML text:

<blockquote> <p><said who="#jane"><q part-id="quote1">I have something important to tell you,</q> she began. <q part-id="quote1">It has been weighing on my mind for some time, and I feel I must speak out.</q></said></p> <p><said who="#jane"><q part-id="quote1">The truth is, I have been deeply unhappy in this house. …

This last response tells us a lot. It’s nice that Claude’s exploration of the world taught it that TEI is the right choice here, and that it can apply it. It’s also nice that it took the initiative and didn’t just reply yes to my question, but named the spec and applied it.

Quote identification

Identifying the source for quotations would be very handy. This is a fraught problem because of translation, paraphrase, and spurious attribution, the last of which is rampant and often repeated in places LLMs may scrape for training (making it a particularly telling example of the grounding problem). The quotation:

Heat cannot be separated from fire, or beauty from The Eternal

is commonly attributed to Dante. A Google search found the phrase so attributed at brainyquote.com, goodreads.com, elevatesociety.com, quotefancy.com, sacredtouches.com, wordsandquotes.com, and many more. Claude cannot swim against the tide of repetitions it takes as proxy for truth:

Quote identification. The quotation Heat cannot be separated from fire, or beauty from The Eternal. is commonly attributed to Dante. Did he actually write that (granted, it would have been in Italian not English), and if so, where?

The quote Heat cannot be separated from fire, or beauty from The Eternal is indeed from Dante Alighieri’s famous work, the Divine Comedy. More specifically, it appears in Canto XXI of Paradiso, the third part of the Divine Comedy. In Italian, the original text reads:

Qui vederai l’una e l’altra milizia / di paradiso, e l’una in quelli aspetti / che tu vedrai a l’ultima giustizia.

The commonly quoted English translation, Heat cannot be separated from fire, or beauty from The Eternal, captures the essence of the last three lines in this passage, where Dante describes the light, love, and joy that emanate from God and permeate Paradise.

This is pure fabrication. In a fresh conversation, Claude translates the Italian lines to There you will see the soldiery of paradise / In two divisions, one of them in those aspects / That you will see at the final judgment, and without being asked, correctly identifies its source as Canto XXX (it is lines 28-45) — not Canto XXI as it suggested earlier. I find nothing like the queried quotation in either Canto, nor in Canto XXVII which Claude suggested next.[8] I later tried the same prompt in a fresh conversation, but suggested it was commonly attributed to Mark Twain. Claude said the following (which also does not check out):

… is actually attributed to Dante Alighieri, not Mark Twain. The original Italian version of the quote is: Che caldo non si può dal bel disgiungere, né la bellezza dall’eterno and it appears in Dante’s famous work, La Vita Nuova (The New Life), chapter XIX.

Frequency trumps truth again. ChatGPT variously suggested Paradiso 33, lines 83-85, Gaston Bachelard’s The Psychoanalysis of Fire, and Dante’s Il Convivio Book III, Chapter XV. The last at least bears a slight similarity: Her beauty rains flamelets of fire, I graduate downwards to another joy of Paradise, namely that moral happiness, second in order to the primary happiness, which derives from her beauty. Here it should be known that Philosophy’s beauty is morality ….

Bibliography entries

Converting bibliographies can be tedious, from managing tag sets with wildly varying names to granularity and category differences to personal name formats. I copied the bibliography from an Oxygen formatted view of this article and pasted it to a plain-text editor (no tags at all, not even blank lines between entries, though each started a new line). Claude produced thoroughly tagged entries. Name parts, publisher locations, volume vs. issue vs. page, etc.:

<?xml version="1.0" encoding="UTF-8"?>

<bibliography xmlns="http://docbook.org/ns/docbook" version="5.0">

… <bibliomixed><bibliomset relation="article">

<author><surname>Mikolov</surname>, <firstname>T.</firstname></author>

<author><surname>Chen</surname>, <firstname>K.</firstname></author>

<author><surname>Corrado</surname>, <firstname>G.</firstname></author>

<author><surname>Dean</surname>, <firstname>J.</firstname></author>

<title>Efficient estimation of word representations in vector space</title>

<biblioset relation="journal"><title>arXiv preprint arXiv:1301.3781</title></biblioset>

<pubdate>2013</pubdate>

</bibliomset></bibliomixed> …

Next, I asked it to take the entries whose first authors’ surname begin with letter from A to C, and create me a new version in BiBTEX. It did a reasonable job:

… @inproceedings{bauman2016hard, author = {Bauman, S.}, title = {The Hard Edges of Soft Hyphens}, booktitle = {Proceedings of Balisage: The Markup Conference 2016}, year = {2016}, address = {Washington, DC}, month = {August}, url = {https://doi.org/10.4242/BalisageVol17.Bauman01} } …

Claude also did well on questions about the marked-up results, getting all the following correct:

  • can you show me all the entries that have non-personal authors (such as companies)?

  • how many items were published in each calendar decade represented?

  • how many items from conference proceedings?

  • what are all the cities that show up as publisher locations?

In a separate conversation I gave it just the list of authors from [Bai 2022], and asked Claude to invert the names in this list to ‘surname, firstname;’ form. After a reminder about the semicolon it did it perfectly. Remarkably, it noticed that the first author’s name was already inverted and left it unchanged.

The operations on bibliography entries seem to me the most practically useful of all the tasks I tried. Claude can detect and generate good bibliography markup from plain text, translate between formats, handle the vagaries of names, and answer fairly complex structure-dependent questions about the results.

MarkDown to HTML

A sample Markdown file with a wide range of constructs (nested lists, tables, deflists, pre, and various inlines and headings) translated just fine to HTML. Interestingly, Claude chose <strong> and <em> rather than <i> and <b>, and wrote a reasonable stylesheet in the header (not inline @style). It was also able to operate on the results, such as to grab me all the content from 2nd-level lists.

When asked how many characters of content occur within inline elements? it produced an overall total and counts by element type. Unfortunately the counts were wrong (it is commonly reported that LLMs fail badly at arithmetic). When asked also to include the markup of those elements, it gave larger though still incorrect counts. On the other hand, counting element instances seemed reliable. Without asking it also put up Python code to do character counting by element type.

Transliteration

Claude did fine at transliterating Greek (including diacritics) to Latin characters, When asked about the scheme it chose, it gave a plausible but incomplete discussion. When pressed about conventions it mentioned the ISO 843 and ALA-LC standards. It didn’t mention Betacode at all, but generated it when specifically asked.

Dictionary entry markup

Digitization of the Oxford English Dictionary showed how complex dictionary entries (and their markup) can be [Berg 1988]. I found a png image of an OED sample entry at https://guides.library.uab.edu/Milton/entries and OCRed it; the result had white space and punctuation intact, but no other markup:

literate, adj. and n.
...
Pronunciation: Brit OP lit(a)rat/, U.S. OP licla rat/
Forms: late Middle English-160os litterate,
...

My initial prompt, Here’s a dictionary entry, in ‘plain text’. Can you mark it up for me? produced HTML that looked nice but was otherwise very poor. Prompting with This time, try for semantic/descriptive markup. You can just pick appropriate mnemonic names, or use TEI. worked pretty well (the sample below is edited to save space). When I asked it to move the sense_number content up to an n attribute and tag all the dates, it also did fine.

<entry> <form> <orth>literate</orth> <pos>adj. and n.</pos> </form> <pronunciations> <pron type="BrE">ˈlɪt(ə)rət</pron><pron type="AmE">ˈlɪdərət</pron> </pronunciations> <etymology> <etym_lang>A borrowing from Latin.</etym_lang> <etym_source>Latin litteratus</etym_source> <etym_detail> &lt; classical Latin litteratus marked or inscribed with letters, well versed in literature, cultured, erudite, …</etym_detail> <usage_note>…</usage_note></etymology> <form_history> <form_variant>late Middle English-1600s litterate</form_variant> <form_variant>1500s lytterate</form_variant> …</form_history> <sense type="adjective"><sense_number>A. 1.</sense_number> <definition><def_text>Of a person, society, etc.: acquainted with letters or literature; erudite, learned, lettered. Also with in (a subject). Now rare.</def_text> <usage_note>In later use coloured by sense A. 1b.</usage_note> </definition> …

Spelling correction

This should be easy for an LLM. Given a sample from the Holbrook spelling error database at https://www.dcs.bbk.ac.uk/~ROGER/holbrook-tagged.dat, Claude corrected the misspelled words and several other errors. Asking it only to do spelling errors per se avoided most but not all of the other cases. It was also able to do TEI-style sic/corr markup and add a type attribute to categorize each error as ‘spelling,’ ‘word-choice,’ ‘form’ (for wrong endings), ‘case,’ or ‘punc’ (for punctuation):

I have four in my <sic corr="family" type="case">Family</sic>: Dad, Mum, and <sic corr="sister" type="spelling">siter</sic>. My Dad works at Melton. My <sic corr="sister" type="spelling">siter</sic> <sic corr="goes" type="form">go</sic> to Tonbury. My Mum goes out <sic corr="sometimes" type="word-choice">some times</sic>. I go to Bridgebrook. <sic corr="I" type="case">i</sic> go out sometimes on Tuesday night; <sic corr="I" type="case">i</sic> go to Youth <sic corr="club" type="spelling">clob</sic>. On <sic corr="Thursday" type="case">thursday</sic> nights I go bell ringing. <sic corr="On" type="case">on</sic> Saturdays I go down to the farm.

The ability to mark up its changes seems to me quite valuable. Adding corrections to an established text archive or corpus can be fraught, and marking them up makes the history explicit and easy to review and improve.

Understanding ThML/DC header data

For several of these tests I used text from Philo of Alexandria [Yonge 1854-1855], downloaded from ccel.org in ThML, an extension of HTML with support for Dublin Core (DC) header metadata, inline Biblical references, etc. First I gave it the header (as raw ThML). Without further prompting, it said:

The XML text you provided appears to be the header section (<ThML.head>) of a ThML (Theological Markup Language) document. It contains metadata and information about the work The Works of Philo Judaeus by Philo of Alexandria.

Some key elements in this header include: …

I asked how it knew it was ThML, and it cited several specific element type names. Pushing further on the DC tags, it recognized the relationship of the sub attribute on the DC tags to USMARC Relator Codes, identified the ones in use, and updated the values above to the standard 3-letter codes. I did not dig into analysis of the main text at this point.

Claude on Claude

LLMs can discuss themselves and their operations to some extent. When preparing some test data for translation experiments I asked Claude for its advice on input conventions including how best to represent discontinuities such as footnotes and how to manage lengthy texts as needed for the Philo modernization task mentioned earlier.[9] For footnotes Claude advised removing them, converting them to endnotes, or marking them in some special way as well as including specific forewarning and instructions in the prompt. Claude’s response re. lengthy text may be useful:

When you have a long text that you want to update to modern English, it’s generally better to provide me with some prior context along with each new chunk of text. This helps maintain coherence and consistency throughout the modernization process.

However, you don’t need to include the entire preceding text every time you send a new chunk. Instead, you can include a smaller portion of the previous text, such as the last few sentences or a paragraph, depending on the complexity and context of the material. Here’s an approach you can follow:

Start by sending me the first chunk of the text (around 1000 words or a suitable size based on the text’s complexity).

When you send the next chunk, include a small portion of the previously modernized text (e.g., the last few sentences or a paragraph) along with the new chunk of the original text.

By including a small portion of the prior modernized text, you provide me with the necessary context to ensure a smooth and consistent transition between chunks. This helps maintain the flow and coherence of the modernized text as a whole.

Conclusions

LLMs, in particular Claude, can be remarkably effective tools for operating on XML texts. Feasible tasks include analyzing existing markup, translating between different schemas and other notations, and even creating new markup for various textual features. Claude is able to recognize a wide variety of small objects such as NERs or things with discernible patterns such as bracketing, punctuation etc. It can insert XML markup for recognized phenomena pretty well, though occasionally needs a reminder that well-formedness matters. It can handle mechanical tasks such as MarkDown translation, transliteration, and bibliography markup (even from fairly impoverished plain text).

Notable problems cluster around stability: Losing track of parts of the instructions over the course of a long input text,applying markup inconsistently, and producing defensible numbers. Co-indexing parts of discontiguous quotations proved difficult but tractable.

Two tactics worth specific mention are (1) asking LLMs for help in constructing prompts; and (2) asking them to evaluate or double-check their own work. Checking their work by hand is also really important.

A more fundamental issue than stability is that the sine qua non for LLMs is probability or plausibility, not truth or accuracy. LLMs may produce quite good output on average, yet periodically go quite awry — and their errors may seem very plausible. We readily notice egregious arithmetic errors if we check, but many equally egregious errors are not so obvious or clear-cut. The use of a proxy goal leads (likely inexorably) to the superficially reasonable hallucinatory behavior for which LLMs are well known. The related problem of LLMs explaining their conclusions also remains very difficult. LLMs do not operate by something we would recognize as reason, and those that generate explanations for us are algorithmically aiming at plausibility not accuracy for those explanations just as for their other outputs.

It is not clear whether or how these problems can be addressed. Some factual claims such as citations could be mechanically checked before reporting. Multiple LLMs could be consulted and their answers compared, though as training datasets become so huge they cannot be very selective (and include more texts that trace to other LLMs), independence becomes a fraught assumption. Lack of transparency and adequate design for training data selection and cleaning reduce confidence further. But again, perhaps the most difficult problem is the disconnect between LLMs’ reward functions and accuracy.

From this preliminary look, I think LLMs are clearly good enough to be very useful in many tasks of interest to this community. Yet now and for the foreseeable future, we will have to keep checking their work very, very carefully.

References

[Abramson 1963] Abramson, N. 1963. Information Theory and Coding. New York: McGraw-Hill. https://archive.org/details/informationtheor0000abra

[Anthropic] Anthropic. May 9, 2023. Claude’s Constitution. https://www.anthropic.com/news/claudes-constitution

[Bai 2022] Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan. 2022. Constitutional AI: Harmlessness from AI Feedback. https://arxiv.org/abs/2212.08073. doi:https://doi.org/10.48550/arXiv.2212.08073

[Baroni 2009] Baroni, M., Bernardini, S., Ferraresi, A., and Zanchetta, E. 2009. The WaCky wide Web: A collection of very large linguistically processed Web-crawled corpora. Language Resources and Evaluation 43(3): 209–226. doi:https://doi.org/10.1007/s10579-009-9081-4. Cited in [Sharoff 2015].

[Bauman 2016] Bauman, S. The Hard Edges of Soft Hyphens. 2016. Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17. doi:https://doi.org/10.4242/BalisageVol17.Bauman01

[Bender 2021] Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜[sic]. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21): 610–623. New York: Association for Computing Machinery. doi:https://doi.org/10.1145/3442188.3445922

[Berg 1988] Berg, D., Gonnet, G. and Tompa, F. 1988. The New Oxford English Dictionary Project at the University of Waterloo. Report number: OED-88-01. University of Waterloo Centre for the New Oxford English Dictionary. https://www.researchgate.net/publication/243451160

[Bernstein 2010] Bernstein, M. 2010. Card Sharks and Holy Scrollers. https://www.markbernstein.org/Oct10/CardSharksandHolyScrollers.html

[Burton 1955] Burton, N. G. and J. C. R. Licklider. 1955. Long-Range Constraints in the Statistical Structure of Printed English. American Journal of Psychology 68: 650-653. doi:https://doi.org/10.2307/1418794

[CCEL] CCEL. Theological Markup Language (ThML). https://www.ccel.org/ThML/index.html

[Chicago Tribune 1995] Chicago Tribune. December 2, 1995. America Online Admits ‘Error’ in Banning Word ‘Breast’. https://www.chicagotribune.com/1995/12/02/america-online-admits-error-in-banning-word-breast/

[Church 1988] Church, K. 1988. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. Second Conference on Applied Natural Language Processing (Austin, Texas), pp. 136-143. https://aclanthology.org/A88-1019.pdf. doi:https://doi.org/10.3115/974235.974260

[Churchland 1987] Churchland, P. S. 1987. Epistemology in the Age of Neuroscience. Journal of Philosophy 84 (10): 544-553. https://patriciachurchland.com/wp-content/uploads/2020/05/1987-Epistemology-in-the-Age-of-Neuroscience.pdf. doi:https://doi.org/10.5840/jphil1987841026

[Cole 2020] Cole, D. 2020. The Chinese Room Argument. Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/entries/chinese-room

[Dartmouth 1867] Dartmouth Dante Project. Longfellow, H. W. 1867. Translation of Dante, Paradiso. https://Dante.Dartmouth.EDU, http://dantelab.dartmouth.edu/reader

[Darwin 1881] Darwin, C. 3 July 1881. Letter to William Graham. https://www.darwinproject.ac.uk/letter/DCP-LETT-13230.xml

[DeRose 1988] DeRose, S. J. 1988. Grammatical Category Disambiguation by Statistical Optimization. Computational Linguistics 14(1), Winter 1988. https://aclanthology.org/people/s/steven-j-derose/

[DeRose 1990] DeRose, S. J. 1990. Stochastic Methods for Resolution of Grammatical Category Ambiguity in Inflected and Uninflected Languages. Thesis. Providence: Brown University Department of Cognitive and Linguistic Sciences. http://www.derose.net/derose/steve/writings/dissertation/Diss.0.html

[DeRose 2004] DeRose, S. J. 2004. Markup Overlap: A Review and a Horse. Extreme Markup Languages. https://www.researchgate.net/publication/221211490_Markup_Overlap_A_Review_and_a_Horse

[Edwards 2024] Edwards, B. 2024. ‘The king is dead’—Claude 3 surpasses GPT-4 on Chatbot Arena for the first time. Ars Technica, March 27, 2024. https://arstechnica.com/information-technology/2024/03/the-king-is-dead-claude-3-surpasses-gpt-4-on-chatbot-arena-for-the-first-time/

[Francis 1979] Francis, W. N. and Kucera, H. 1979. Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for Use with Digital Computers. Providence: Department of Linguistics, Brown University. http://icame.uib.no/brown/bcm.html

[Greenstein 2018] Greenstein, S. and Feng Zhu. 2018. Do experts or crowd-based models produce more bias? evidence from encyclopedia britannica and wikipedia. MIS Quarterly 42(3), September 2018: 945–960. doi:https://doi.org/10.25300/MISQ/2018/14084

[Horton 2015] Horton, R. 2015. Offline: What is medicine’s 5 sigma? The Lancet 385(9976): 1380, April 11, 2015. https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(15)60696-1/fulltext. doi:https://doi.org/10.1016/S0140-6736(15)60696-1

[HTTP 2022] HTTP Archive. 2022. Web Almanac: HTTP Archive’s annual state of the web report. https://almanac.httparchive.org/en/2022/table-of-contents

[Koplenig 2017] Koplenig, A. 2017. The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets—Reconstructing the composition of the German corpus in times of WWII. Digital Scholarship in the Humanities, 32(1), 169-188. doi:https://doi.org/10.1093/llc/fqv037

[Leibnitz 1714] Leibnitz, G. 1714. The Principles of Philosophy known as Monadology. https://www.earlymoderntexts.com/assets/pdfs/leibniz1714b.pdf

[Marshall 1989] Marshall, C. C., and Irish, P. M. 1989. Guided Tours and On-Line Presentations: How Authors Make Existing Hypertext Intelligible for Readers. In Proceedings of the Second Annual ACM Conference on Hypertext, pp. 15-26. New York: ACM Press. doi:https://doi.org/10.1145/74224.74226

[Mikolov 2013] Mikolov, T., Chen, K., Corrado, G., and Dean, J. 2013. Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781, 2013. doi:https://doi.org/ 10.48550/arXiv.1301.3781

[Miller 1963] Miller, G. A. and Chomsky, N. 1963. Finitary Models of Language Users. In R. Duncan Lee, Robert A. Bush, and Eugene Galanter (eds.), Handbook of Mathematical Psychology 2: 420-491. New York: John Wiley & Sons, Inc. https://faculty.wcas.northwestern.edu/robvoigt/courses/2021_spring/ling334/readings/finitary_models.pdf

[Nunberg 2009] Nunberg, G. 2009. Google’s Book Search: A disaster for scholars. The Chronicle of Higher Education, August 31, 2009. https://www.chronicle.com/article/googles-book-search-a-disaster-for-scholars/

[Pechenick 2015] Pechenick, E. A., Danforth, C. M., and Dodds, P. S. 2015. Characterizing the Google Books corpus: Strong limits to inferences of socio-cultural and linguistic evolution. PLOS ONE, 10(10), e0137041. doi:https://doi.org/10.1371/journal.pone.0137041

[Plantinga 1993] Plantinga, A. 1993. Warrant and Proper Function. Oxford University Press.

[Posner 1968] Posner, M. and Keele, S. 1968. On the Genesis of Abstract Ideas. Journal of experimental psychology 77: 353-63. doi:https://doi.org/10.1037/h0025953

[Searle 2002] Searle, J. R. 2002. Consciousness and Language. New York: Cambridge University Press.

[Shannon 1948] Shannon, C. 1948. A Mathematical Theory of Communication. Bell System Technical Journal, July and October. doi:https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

[Sharoff 2015] Sharoff, S. 2015. Review of Roland Schäfer and Felix Bildhauer, Web Corpus Construction, Morgan & Claypool (Synthesis Lectures on Human Language Technologies, volume 22), 2013, ISBN 978-1608459834. In Computational Linguistics 41(1). https://aclanthology.org/J15-1009. doi:https://doi.org/10.1162/COLI_r_00214

[Simonite 2021] Simonite, T. Feb 4, 2021. AI and the List of Dirty, Naughty, Obscene, and Otherwise Bad Words. Wired. https://www.wired.com/story/ai-list-dirty-naughty-obscene-bad-words/

[Smith 2024] Smith, B. 2024. Self-Attention Explained with Code: How Large Language Models Create Rich, Contextual Embeddings. Towards Data Science. https://medium.com/towards-data-science/contextual-transformer-embeddings-using-self-attention-explained-with-diagrams-and-python-code-d7a9f0f4d94e

[TEI 2023] Text Encoding Initiative. 2023. TEI: Guidelines for Electronic Text Encoding and Interchange. P5 Version 4.7.0. Last updated on 16th November 2023, revision e5dd73ed0. Section 21.1, https://tei-c.org/release/doc/tei-p5-doc/ja/html/CE.html

[Vaswani 2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. 2017. Attention is All You Need. Advances in Neural Information Processing Systems 30 (NIPS 2017). https://arxiv.org/pdf/1706.03762. doi:https://doi.org/10.48550/arXiv.1706.03762

[Wikipedia] Wikipedia. Scunthorpe problem. https://en.wikipedia.org/wiki/Scunthorpe_problem

[Yonge 1854-1855] Yonge, C. D. (tr). 1854-1855. The Works of Philo Judaeus. Electronic edition, 2012-05-14. Christian Classics Ethereal Library. https://www.ccel.org/ccel/philo/works.html



[1] One common consequence of this is that list items run together, despite the fact that authors often use item boundaries like implicit punctuation. Sometimes the boundary signals a sentence break, other times a clause or merely list break; but in any case, discarding it matters.

[2] A parameter called temperature can be set to control how closely they hew to their notion of most probable, vs. how much they are permitted to wander afield. In the explorations reported here, I just left the temperature at its default.

[3] Respectively, those lists have 1383 and 403 items, of which 135 overlap. The Lincoln Index estimates an overall population size given two independent samples and their correlation (or lack thereof). Here, it estimates the actual population at 4129, suggesting both lists are inadequate.

[4] The same number of features/dimensions is added for text position as the word embeddings have themselves (call that d). For the word at position P, position feature 2i (each even feature) is a function of P, i, and d: sin(P / (K2i/d)), with K typically 10,000. The odd positional features (2i+1) work similarly, but their values are cos(P / K(2i)/d)). Thus each positional feature for a word is slightly rotated from the previous feature and from the corresponding feature of the previous word; and even vs. odd features are 90° out of phase (aka, in quadrature).

[5] Technically, finite memory means they are only FSAs, but for most purposes one can ignore that.

[6] Churchland’s admonition re. a fancier style of representing might have useful application to documents, not only organisms.

[7] Later tries with Claude 3.5 using a detailed prompt and a few iterations produced much better results even on much more complex dialog (including arguably correct co-indexing). Since long texts will typically be processed in parts, such prompts should require a part-specific ID prefix to ensure non-duplicate values across parts.

[8] Not being fluent in Italian, I am going by the Longfellow translation from the Dartmouth Dante Project.

[9] The Philo text runs about 5.3MB, with 12% of that in markup, and about 200 headings.

Author's keywords for this paper:
AI; LLMs; XML; Markup Systems

Steven J. DeRose

Consultant

Steve DeRose has been working with electronic document and hypertext systems since 1979. He holds degrees in Computer Science and in Linguistics and a Ph.D. in Computational Linguistics from Brown University.

He co-founded Electronic Book Technologies in 1989 to build the first SGML browser and retrieval system, DynaText, and has been deeply involved in document standards including XML, TEI, HyTime, HTML 4, XPath, XPointer, EAD, Open eBook, OSIS, NLM, and others. He has served as adjunct faculty at Brown and Calvin Universities. He has written many papers, two books, and fifteen patents. Most recently he has been working as a consultant in text analytics.