I have been working on a project over the past few years that has challenged me to
explore, test, and refine a machine-assisted method for comparing versions of a text.
The project involves comparing five versions of the novel
Frankenstein, and the basis for comparing these versions
includes the markup from editions that were coded differently. I have shared papers
about these adventures over the past few Balisage meetings, including the fun of
flattening and raising markup and the challenge of comparing strings that include
representations of the markup in the editions. Even when it is machine-assisted, document collation is tiring, tedious work.
It is one thing to prepare an algorithm for comparison and apply it to good, adaptable
software for the purpose, but it is quite another have to correct the output. That
is
where the real challenge begins—the intellectual challenge, mental discipline, or
self-psych-out
of machine-assisted
collation: When do
you give up trying to refine the software algorithm, and when do you
crack
and resort to hand-correcting problematic outputs? Sometimes
giving up really slows down a project, when it is still possible to refine the method.
Not giving up on the process requires patience and tinkering with the machinery, and
the patience to continue
testing. Would it not be wonderful for artificial intelligence to assist and accelerate
this painstaking effort, if there could be a reliable way to train a large language
model to process text
the way an editor of a digital scholarly edition would?
The sensational public launch of ChatGPT on November 30, 2022 has sparked excitment,
confusion, concern and a new surge of publications across academia and the public
sector. ChatGPT has also given us much fun with prompt experiments among
my academic circles. We would ask for introductions of ourselves and our colleagues
suitable for use in conferences and laugh at how ludricrously wrong they are. (In
one
invented bio I am both a medievalist and a published poet, neither of which is true,
but
I could speculate that somehow the words associated with me in ChatGPT’s training
had
been in proximity with medievalist friends and people far more creative than me).
Of
course we also prompted ChatGPT to invent supposedly serious bibliographies to see
its
made-up citations and fake URLs. At universities, teaching faculty worry about our
students abusing generative AI to compose their papers, even as we explore new possibilities
for the writing process. Academics who write computer programs quickly
discovered a supremely helpful aspect of ChatGPT for debugging code or quickly
introducing how to access a software library on the fly. Once we have seen such
benefits, we tend to recognize that our students should also cultivate skills to prompt
AI—responsibly—for coding assistance. ChatGPT has saved time for me and my most
dedicated students, time that we might otherwise have spent combing through
documentation and Stack Overflow posts. Even when ChatGPT’s code plainly did not work,
the AI was leading us rapidly to the ad-hoc fixes we needed to
make, and particularly when we began with a foundation of knowledge, we could work
within a dialogue
and improve our understanding.
Encouraged by such time-saving coding assistance, my colleagues, students, and I have
been eagerly experimenting all year with something that seems to promise the ultimate
declarative possibility: to directly ask a machine to deliver an
output, without directing how it generates that output. The developers at OpenAI fondly
aspire for their chat model to become: a very eager junior programmer
to
make completely new workflows effortless and efficient
. I wondered how well the new AI models might perform on the very task
that has taken me and my colleagues and students years to refine: the machine-assisted
collation of multiple versions of a work in the form of manuscripts and printed
documents. The processes that underlie this involve aligning chunks of text, and a
token-by-token processing of streams of text pulled from marked-up documents. The
large
language models supporting generative AI are themselves based on word embeddings and
tokenized processing text streams. Could the word arithmetic
we associate
with text-generative AI be applied to comparison algorithms? Would the language model
just
know
how to optimize the most reasonable alignments and outputs?
At various moments between January and July 2023, I began testing a hypothesis that
a
machine trained on tokenized text and word embeddings should excel at the task of
token-by-token comparison of strings. I will document some of my representative prompt
engineering experiments in the next section. My prompts could likely be improved,
but I do have
specific requirements for an optimal collation in mind that I have been attempting
to
declare to the AI, and the exercise has been instructive. Over the past several months
in my prompt experiments to engage ChatGPT and Anthropic’s Claude in various permutations
of
this task, they have not only been wanting, but remarkably and consistently so. The
ways
in which generative AI has bungled my series of string-comparison challenges may tell
us
something interesting about the limits of current large language models to correctly
observe
the differences between strings, and to express those differences in structured forms.
The
errors are illuminating in a way that demonstrates something seriously lacking in
the text-generative
system in 2023, and that is a fully declarative method of working with input and generating
output.
What happens when we ask a large language model to compare strings?
To begin to estimate whether AI could assist my collation work, I needed to work
within the limits of ChatGPT’s input window. But just as I could ask the AI to proofread
a paragraph or two for errors, surely I could provide some input strings and ask the
AI to compare them,
and show me how they differ. In no case did I ever receive
reliable, accurate results, but the kinds of errors I saw were interesting to try
to
diagnose the basis of error. Here I share a representative sampling of the
experiments.
In some of my experiments, I simply requested that ChatGPT present me a table
showing me how two versions of a text align. In this prompt, I indicated that two
editions share a version, and a manuscript presents a different version. I was
curious to see how the AI would handle two versions shared by three documents.
Elisa:
Can you prepare a data structure to show where these texts are the same
and where they differ?
Manuscript: It was on a dreary night of November that I beheld the frame
on which my man compleated. And with an anxiety that almost amounted to
agony I collected instruments of life around me that I might infuse a spark
of being into the lifeless thing that lay at my feet.
1818 edition and 1831 edition: It was on a dreary night of November, that
I beheld the accomplishment of my toils. With an anxiety that almost
amounted to agony, I collected the instruments of life around me, that I
might infuse a spark of being into the lifeless thing that lay at my
feet.
ChatGPT:
To compare the texts and visualize the similarities and differences, you
can use a data structure such as a table or a side-by-side comparison.
Here's an example of how you can represent the texts:
+---------------------+---------------------------+-------------------------+
| Manuscript | 1818 edition | 1831 edition |
+---------------------+---------------------------+-------------------------+
| It was on a dreary | It was on a dreary | It was on a dreary |
| night of November | night of November | night of November |
| that I beheld the | that I beheld the | that I beheld the |
| frame on which my | accomplishment of my | accomplishment of my |
| man compleated. | toils. | toils. |
| And with an anxiety | With an anxiety that | With an anxiety that |
| that almost amounted| almost amounted to agony, | almost amounted to agony|
| to agony I collected| I collected the | I collected the |
| instruments of life | instruments of life | instruments of life |
| around me that I | around me, | around me, |
| might infuse a spark| that I might infuse a | that I might infuse a |
| of being into the | spark of being into the | spark of being into the |
| lifeless thing that | lifeless thing that lay | lifeless thing that lay |
| lay at my feet. | at my feet. | at my feet. |
+---------------------+---------------------------+-------------------------+
In this representation, each row corresponds to a line of text, and the
columns represent the different versions: Manuscript, 1818 edition, and 1831
edition. The matching text is aligned vertically, allowing you to see where
the texts are the same and where they differ.
By comparing the texts using this structure, you can easily identify the
variations between the Manuscript, 1818 edition, and 1831 edition.
Here I was pleased to see that ChatGPT understood that I wanted a
visual representation of three editions. But I was surprised that the Ai did nothing
more than block the text into a single row. I asked for ChatGPT to refine the table.
Elisa:
Let's try to refine this. Where the texts are identical, just show one
block of text on the screen. Then when the texts are different, use the
table to put the differences side by side.
ChatGPT:
Sure! Let's refine the representation to show the identical text blocks
once and use a table to display the differing portions side by side. Here's
an updated representation:
Identical text:
-------------------------
It was on a dreary night of November
Text differences:
+---------------------+---------------------------+-------------------------+
| Manuscript | 1818 edition | 1831 edition |
+---------------------+---------------------------+-------------------------+
| that I beheld the | that I beheld the | that I beheld the |
| accomplishment of my| accomplishment of my | accomplishment of my |
| toils. | toils. | toils. |
| With an anxiety that| With an anxiety that | With an anxiety that |
| almost amounted to | almost amounted to agony, | almost amounted to agony|
| agony I collected | I collected the | I collected the |
| instruments of life | instruments of life | instruments of life |
| around me, that I | around me, | around me, |
| might infuse a spark| that I might infuse a | that I might infuse a |
| of being into the | spark of being into the | spark of being into the |
| lifeless thing that | lifeless thing that lay | lifeless thing that lay |
| lay at my feet. | at my feet. | at my feet. |
+---------------------+---------------------------+-------------------------+
In this refined representation, the identical text block "It was on a
dreary night of November" is displayed once at the top, indicating that this
portion is the same in all versions. The table below then shows the
differing text portions side by side, highlighting the variations between
the Manuscript, 1818 edition, and 1831 edition.
Here I am applying underlined emphasis to show what ChatGPT delivered
in blue color-coded CSS to highlight whatcalls
variations
between
the editions. A few strange things have happened. First the language model does
correctly indicate that the first portion of the string is identical. But ChatGPT
misses
the difference that there is a comma in the 1818 and 1831 versions, and simply does
not represent the comma, so we have dropped a small token of text. Next in the
table, we completely miss the variant passage from the manuscript:
the frame
on which my man compleated. And
. Instead ChatGPT represents the
manuscript text as the same block provided for the other two. Is the AI
forgetting the differences? Yet they seem to understand that some indication of
differences is expected, and they highlight words of between two and four characters
long that are identical in the three passages blocked. This is distressingly
inaccurate, and raises questions about how ChatGPT can be determining what to
highlight.
I tried again with this experiment. I understood that ChatGPT’s memory might well
be too short to handle the text processing completely, but I wondered whether they
could improve the response, much as we can prod them to help us debug our code. Perhaps
ChatGPT would
help to provide the source texts again to restore them to the context window of memory.
Elisa:
Your approach is what I had in mind, good. But you've forgotten where the
source texts are different. Can you redo this, taking another look at the
source texts?
Manuscript: It was on a dreary night of November that I beheld the frame
on which my man compleated. And with an anxiety that almost amounted to
agony I collected instruments of life around me that I might infuse a spark
of being into the lifeless thing that lay at my feet.
1818 edition and 1831 edition: It was on a dreary night of November, that
I beheld the accomplishment of my toils. With an anxiety that almost
amounted to agony, I collected the instruments of life around me, that I
might infuse a spark of being into the lifeless thing that lay at my
feet.
ChatGPT:
Apologies for the oversight. Let's revise the representation to include
the differing portions as well. Here's the updated version:
Text differences:
+---------------------+---------------------------+-------------------------+
| Manuscript | 1818 edition | 1831 edition |
+---------------------+---------------------------+-------------------------+
| It was on a dreary | It was on a dreary | It was on a dreary |
| night of November | night of November | night of November |
| that I beheld the | that I beheld the | that I beheld the |
| frame on which my | accomplishment of my | accomplishment of my |
| man compleated. | toils. | toils. |
| And with an anxiety | With an anxiety that | With an anxiety that |
| that almost amounted| almost amounted to agony, | almost amounted to agony|
| to agony I collected| I collected the | I collected the |
| instruments of life | instruments of life | instruments of life |
| around me that I | around me, | around me, |
| might infuse a spark| that I might infuse a | that I might infuse a |
| of being into the | spark of being into the | spark of being into the |
| lifeless thing that | lifeless thing that lay | lifeless thing that lay |
| lay at my feet. | at my feet. | at my feet. |
+---------------------+---------------------------+-------------------------+
In this refined representation, we have the text differences table with
the Manuscript, 1818 edition, and 1831 edition side by side. The portions
that are the same in all versions are omitted for brevity, and only the
differing sections are displayed.
With this response, I was baffled and simply left the conversation.
ChatGPT seems to have forgotten what I thought we established in the previous prompts.
There is a single accentuated highlight (in yellow against the blue) of the capitalized
word
And
from the manuscript, but the AI has otherwise simply highlighted
short words across all the witnesses. What most surprised me was ChatGPT's assertion
that
omitted
all the passages that were the same, when they omitted
nothing. And the AI seemed to have no understanding of the concept of
differentiation.
I tried similar experiments this summer, wondering if the language model’s string-differentiation
powers
might have improved after the news that ChatGPT had a newly expanded context window
for
processing input. After June 21, developers are able to prompt ChatGPT with
8000-word prompts, andwill be able ot store longer passages in memory. My
prompts were not so long, but I suspect there is something challenging to the memory
processing about my requests to compare strings, and I was curious whether I might
see improvements. So I tried some similar experiments. Among my experiments were
efforts to ask ChatGPT to provide responses not in an alignment table but TEI
critical apparatus markup.
Here is a more recent experiment, requesting output in TEI critical apparatus
markup (with apologies to Samuel Taylor Coleridge for my flagrant abuse of
The Rime of the Ancient Mariner
).
Elisa:
Given these four text inputs, can you collate them and generate a TEI
critical apparatus?
-
It was an ancient Marinere eftsoons there went he.
-
He was an old mariner, erstwhile he went there.
-
It was an old Marinere, eftsoons there went he.
-
He was an ancient Marinere, eftsoons he went there.
ChatGPT:
Certainly! A critical apparatus in Text Encoding Initiative (TEI) format
typically consists of variant readings with corresponding witnesses. In this
case, the four text inputs are slightly different from each other, and we
can create a simple critical apparatus for them. I will assume that each
text is a separate witness (W1, W2, W3, and W4) in the critical
apparatus.
Here's the TEI critical apparatus:
<app>
<lem>
It was an ancient Marinere eftsoons there went he.
</lem>
<rdg wit="#W2">
He was an old mariner, erstwhile he went there.
</rdg>
<rdg wit="#W3">
It was an old Marinere, eftsoons there went he.
</rdg>
<rdg wit="#W4">
He was an ancient Marinere, eftsoons he went there.
</rdg>
</app>
In this representation:
-
<lem>
stands for the lemma (the base
text).
-
<rdg>
stands for a variant reading.
-
wit="#W1"
represents the witness identifier for
each variant reading.
Now, the critical apparatus shows that the lemma (base text) is "It was an
ancient Marinere eftsoons there went he," and three variants (W2, W3, and
W4) are provided with their corresponding readings.
ChatGPT’s response here has distinct advantage over my other
experiments of being, quite simply correct. Indeed, none of these four sentences is
the same, and from ChatGPT this is a fair representation of TEI critical apparatus
encoding, representing a single moment of variation in the
<app>
element, and showing each of the four versions inside. ChatGPT’s explanation is clear.
I
am not sure how they arrived at a concept of the lemma, but they seem to be
defaulting to the first string as being somehow preferred for the base text. But this
markup is too simple and not especially useful if I am wanting to highlight
precisely where the texts align and where they differ. So I continued the
conversation.
Elisa:
Can you improve this by using the <app>
elements only
where there are forks or deltas in the text?
ChatGPT:
Certainly! In a critical apparatus, we can use the
<app>
element only where there are forks or deltas
(variants) in the text. This will help in presenting a more concise
representation of the textual differences. Here's the improved TEI critical
apparatus:
<app>
<lem>
It was an ancient Marinere eftsoons there went he.
</lem>
<rdg wit="#W2">
He was an old mariner, erstwhile he went there.
</rdg>
<rdg wit="#W3">
It was an old Marinere, eftsoons there went he.
</rdg>
<rdg wit="#W4">
He was an ancient Marinere, eftsoons he went there.
</rdg>
</app>
In this representation, the <app>
element encloses the
entire set of variant readings (rdg elements) because there are textual
differences among the witnesses at this point. Since the lemma (base text)
remains the same in all cases, we don’t need separate
<app>
elements for each witness.
This was amusing. ChatGPT does not change the code, and of course,
the witnesses each differ from one another. This
is a justification or rationalization.
If I want to approach the problem differently, perhaps I need a better prompt.
So I try again. In the prompt below, I attempted to show ChatGPT what kind of output
I wanted to see, and gave the AI a start that I requested them to continue. Adding
the
code freehand without a syntax checker, I
inadvertently made two mistakes which mangled this code:
-
I pasted in one of ChatGPT’s <lem>
elements from the previous prompt,
intending to modify it to an <rdg>
element.
But in haste, I only changed it to an <rdg>
start tag and
left the <lem>
end tag.
-
I left the second <app>
element unclosed.
Elisa:
That is not what I meant. I'd like you to apply the critical apparatus
using inline parallel segmentation. This means you should show just one line
of text where all witnesses are the same, and then use the
<app>
element in the specific moments where the text
differs. Here is how you could begin:
<app>
<rdg wit="#W1 #W3">It</lem>
<rdg wit="#W2 #W4">He</rdg>
</app>
was an
<app>
<rdg wit="#W1 #W4">ancient</rdg>
<rdg wit="#W2 #W3>old</rdg>
Can you continue based on this start?
ChatGPT:
Apologies for the misunderstanding. I understand now that you want to
represent the critical apparatus using inline parallel segmentation. Here's
the continuation based on the start you provided:
<app>
<rdg wit="#W1 #W3">It</rdg>
<rdg wit="#W2 #W4">He</rdg>
</app>
was an
<app>
<rdg wit="#W1 #W4">ancient</rdg>
<rdg wit="#W2 #W3">old</rdg>
</app>
Marinere
<app>
<rdg wit="#W1 #W3 #W4">eftsoons there</rdg>
<rdg wit="#W2">erstwhile</rdg>
</app>
went he.
In this representation, the critical apparatus is represented using the
<app>
element, and within each
<app>
element, we use the <rdg>
elements to indicate the variant readings for different witnesses (W1, W2,
W3, and W4).
I hope this format better meets your requirements for inline parallel
segmentation. Let me know if you need any further adjustments!
At first glance this appeared to be what I wanted, particularly since ChatGPT fixed
my mangled input so that start tags properly matched end tags.
However on review of
the strings I had asked ChatGPT to collate, we needed to recognize some significant
errors. Let’s review the original input strings that we asked ChatGPT to compare:
-
It was an ancient Marinere eftsoons there went he.
-
He was an old mariner, erstwhile he went there.
-
It was an old Marinere, eftsoons there went he.
-
He was an ancient Marinere, eftsoons he went there.
ChatGPT's critical apparatus markup declares that all of the texts
share the capitalized and distinct spelling
Marinere
, missing the
difference of witness 2 which has
mariner
. The AI also misses the
presence or absence of the comma (omitting it entirely in the critical apparatus).
Finally, there is the issue with representing transposition. Versions 2 and 4 show
he went there
while Versions 1 and 3 have
there went
he
. ChatGPT almost completely misses these differences in representing the ends of
the passages. The apparatus correctly represents Text 2 as the only
variant holding
erstwhile
, but it incorrectly indicates Text 4 as
showing
eftsoons there
along with Texts 1 and 3. ChatGPT should have
stopped the alignment at
eftsoons
, and created a new
<app>
to hold the variations of
there went he
vs.
he went there
. The AI was doing pretty well, and if well-formed markup was all that was expected,
it can be said to write good XML. But if accurate content in the critical apparatus
is called for, this approximation
is dropping information needed for accurate content in the critical apparatus.
Still I wondered if I had given ChatGPT something too short and intricate with too
many small variations. What if I added a bit to the four strings that would give the
AI something
that should be easy to align, a long stretch of identical tokens? Could ChatGPT show
improvement, continuing on?
Elisa:
You're on the right track, but you're dropping some of the text in your
comparison. Let's try again with a longer set of texts to see if that helps.
-
It was an ancient Marinere eftsoons there went he. He wanted
to talk to the AI about collation and TEI.
-
He was an old mariner, erstwhile he went there. He wanted to
talk about TEI to the AI that day.
-
It was an old Marinere, eftsoons there went he. He wanted to
talk to the AI about TEI.
-
He was an ancient Marinere, eftsoons he went there. He wanted
to talk about TEI and AI.
ChatGPT:
Apologies for the oversight. Let's try again with the longer set of texts,
using the inline parallel segmentation for the TEI critical
apparatus:
<app>
<rdg wit="#W1 #W3">It was an ancient Marinere eftsoons there went he.</rdg>
<rdg wit="#W2">He was an old mariner, erstwhile he went there.</rdg>
<rdg wit="#W4">He was an ancient Marinere, eftsoons he went there.</rdg>
</app>
<app>
<rdg wit="#W1">He wanted to talk to the AI about collation and TEI.</rdg>
<rdg wit="#W2">He wanted to talk about TEI to the AI that day.</rdg>
<rdg wit="#W3">He wanted to talk to the AI about TEI.</rdg>
<rdg wit="#W4">He wanted to talk about TEI and AI.</rdg>
</app>
In this representation:
The first <app>
element contains the variant readings
for the first sentence: It was an ancient Marinere eftsoons there
went he.
(W1, W2, W3, and W4). The second
<app>
element contains the variant readings for the
second sentence: "He wanted to talk to the AI about collation and TEI."
(W1), "He wanted to talk about TEI to the AI that day." (W2), "He wanted to
talk to the AI about TEI." (W3), and "He wanted to talk about TEI and AI."
(W4).
This format shows the differences between the witnesses inline for each
sentence. Let me know if this meets your requirements, or if you need any
further adjustments!
Somehow in the series of prompts and responses, something I expect
ChatGPT to have
learned
seems to be forgotten. ChatGPT is returning
to their original preference for blocking long portions of text in the
<rdg>
elements.is nevertheless indicating, incorrectly,
that Texts 1 and 3 agree completely, when Text 3 contains a comma after
Marinere
. Is the experiment worth continuing? Reader, I did
continue, but saw similar patterns of error with dropping punctuation or variant
tokens, and with ignoring sameness in long blocksmarked variant. The last
response provided by ChatGPT to this series provides a good summary representation
of each of these problems:
ChatGPT:
<app>
<rdg wit="#W1 #W3">It was an</rdg>
<rdg wit="#W2 #W4">He was an</rdg>
</app>
<app>
<rdg wit="#W1">ancient</rdg>
<rdg wit="#W2">old</rdg>
<rdg wit="#W3 #W4">old</rdg>
</app>
Marinere eftsoons there went he. He wanted to talk to the AI about collation and TEI.
<app>
<rdg wit="#W1">He wanted to talk to the AI about collation and TEI.</rdg>
<rdg wit="#W2">He wanted to talk about TEI to the AI that day.</rdg>
<rdg wit="#W3">He wanted to talk to the AI about TEI.</rdg>
<rdg wit="#W4">He wanted to talk about TEI and AI.</rdg>
</app>
Now, the revised encoding correctly shows the common phrase "He wanted to
talk to the AI about collation and TEI." in the second text, which is
present across all four witnesses (W1, W2, W3, and W4).
ChatGPT replaces one kind of problem for a new kind of error. There
are omissions and failures to recognize precisely what is common and what is not.
And now text that the AI claims to be the same across all versions is
also represented as variant. This is a rather spectacular
misrepresentation, and the kinds of errors make me wonder how the language model is
attempting to process my requests. Why is ChatGPT generating these errors?
I had learned that sometimes if you prompt the chatbot to pretend that they are an
expert in an area, this can improve their responses. In some of my
experiments this year I attempted this sort of let’s pretend
game, to
ask ChatGPT to pretend they are an expert in the TEI and preparing a digital
scholarly edition, but this did not help to
improve the responses to my tasks. They are predictably unreliable, and never once
did I see a
response without errors. I also tried simplifying the task and asking the AI directly
only to diff
some strings, wondering if that word might be more familiar to the language model.
But this made no difference and I have yet
to see an accurate response to a prompt requesting a comparison of two or more
strings.
Finally, I attempted an experiment with Anthropic’s Claude AI, which purports to
have a larger memory capacity and context window for prompt processing than
ChatGPT.
The results were slightly better out of the box. In this case I returned to
comparing two versions of the passage from Frankenstein:
Elisa:
You are preparing a critical edition in TEI using critical apparatus
markup. Can you prepare a sample of code for comparing two editions based on
the following three versions of the same passage?
Manuscript: It was on a dreary night of November that I beheld the frame
on which my man compleated. And with an anxiety that almost amounted to
agony I collected instruments of life around me that I might infuse a spark
of being into the lifeless thing that lay at my feet.
1818 edition and 1831 edition: It was on a dreary night of November, that
I beheld the accomplishment of my toils. With an anxiety that almost
amounted to agony, I collected the instruments of life around me, that I
might infuse a spark of being into the lifeless thing that lay at my
feet.
Claude.ai:
Here is a sample TEI encoding of the critical apparatus comparing those
three versions of the passage:
<p>It was on a dreary night of November, that I beheld the
<app>
<rdg wit="#MS">frame on which my man compleated</rdg>
<rdg wit="#1818 #1831">accomplishment of my toils</rdg>.
</app>
With an anxiety that almost amounted to agony, I collected the instruments of
life around me, that I might infuse a spark of being into the lifeless thing that
lay at my feet.</p>
This is
almost correct, and indeed, I
was surprised to find that Claude was unifying the text streams around the parallel
portions. Claude missed the differences in punctuation, however, and even when I
prompted the AI to correct for omissions they might have made, Claude failed to
identify the problems.
Experimenting with text-generative language models in the previous months has
illuminated some interesting problems in comparing text strings in language models
that work with tokenized string inputs. Comparisons are fuzzy, overly broad, or
outright drop tokens and usually misrepresent the source texts in some way. Not only
were the results almost always inaccurate, but on repeated prompts and requests for
corrections, neither ChatGPT nor Claude could be said to improve significantly.
Why are the models so problematic at a task that involves identifying sameness and
difference in sequences of tokenized text? I can only speculate since I lack access
or expertise to comprehend the working systems under the hood
of the
language models, but I speculate that the problems stem from a fundamentally
different way of reading
tokens. That is, the method of reading input
and studying word embeddings in the langauge model is based on a more complicated
statistical process than the simple edit-distance calculation I requested. Perhaps
the AI can read
in no other way than by the (statistical)
books.
From stochastic parrot to reasoning intelligence?
As a human (read: limited, inconsistent) reader of texts I am surprised that a
mathematically trained language model, a so-called artificial
intelligence
which I expect to be more consistent and adept at pattern
recognition than I am, consistently struggles with what seems to be a simple
comparison of strings. Is a shortness of Long-Term Short-Term Memory (LTSTM) that
causes the models to lose track of the details in the process of calculating a
response? Perhaps ChatGPT’s current inability to analyze comparisons of strings has
something to do with its token-by-token generative stream. ChatGPT can correctly
tell me what Levenshtein (or edit-distance) distance is:
ChatGPT:
The Levenshtein distance is a measure of the minimum number of
single-character edits (insertions, deletions, or substitutions) required to
change one string into another.
The AI can also separate two different versions of a text in different
boxes. But it cannot accurately calculate the edit-distance and it cannot seem to
pinpoint variations. In the task of preparing an alignment table or a TEI critical
apparatus, the AI supplies the illusion of structure with some persistent blind
spots. Perhaps the model is failing to
see
what constitutes a reasonable,
meaningful basis for comparison of text streams. Perhaps this is due to a confusion
between the model’s understanding tokens based on
relatedness, as opposed to their exact sameness or
difference. In other words, I suspect that something about semantics is getting in
the way, not literally the semantic meaning of the text, but rather the programming
to
associate tokens with each other based on their co-occurence, the vectorized word
embeddings that help the AI predict the next token to generate something that strikes
us a plausibly meaningful. The very capacities
that permit ChatGPT and Claude to converse in a convincingly reasonable voice interfere
with
their capacities to calculate or prepare an accurate programming script. We have
been learning these past several months simply to expect and live with AIs being
fast but fuzzy, to expect error, and certainly not to ask them do to our math homework.
Perhaps we should not expect anything better. Today dialogue with generative
language-based AI gives us the opportunity to declare and inquire with the voice of
reason, but the stochastic outputs we receive sometimes contradict themselves and
frequently miscalculate and misrepresent. We understand that prompt generation is
based on statistical predictions of what might be the best-fit, reasonable next
tokens of text to supply in sequence, and that this makes generative language models
not intelligent at
all but rather stochastic machines. Further, we understand these stochastic machines can do damage, because
they amplify even the veiled language of racism and sexism that often goes nearly
unheard or unmarked in everyday discourse of Wikipedia and Reddit and social media. In the year 2023, text-generative AI speaks with a language of authority
and confidence that amplifies normative values in their unstable predictions of the
best-fit content completion for a prompt. We know these limitations, and their
consequences are dire.
Reasoned authority is what we expect of optimal interactions with computers, and
this is a foundational value of our declarative markup community. In slides from a
2015 digital humanities conference presentation, Michael Sperberg-McQueen reminds
us,
Declarative semantics make it possible to reason about representations;
imperative semantics impede
. Nothing in our modeling of texts escapes bias, but our capacity to
assert and test reasonable statements is a particular strength of declarative markup
and the tooling our community has developed for it. In that 2015 presentation,
Sperberg-McQueen also points out that hierarchical models are not neutral. The way
we organize document hierarchies and decide on markup representations, and create
schema rules to validate our models does not represent absolute
ground truth
, but rather attempts to describe and define based on
what we choose to prioritize, whether that is the section headings of a legislative
memo or the page-by-page printing of a comic book. The models we create for
documents and the metadata we care about reflect the paradigms and priorities of the
humans who create them. Document historians of the future may come back to our XML
markup and find us benighted, but they could also research in our models the rules
of our publishing houses, the attitudes and expectations that prevailed in
understanding how to study language and archive our cultural heritage. The point is,
what we express in declarative semantics is fully visible, tractable, and usually
documented. Our ways of thinking about texts, however flawed, are explicitly and
deliberately marked. By stark contrast, the biases amplified by large language
models in our stochastic systems are revealed only by analyzing the outputs or
attempting to source their now-secret training data.
That seminal article from 1990 What is Text, Really?
is striking in
the year 2023 for how current it remains as a critique of prevailing machine systems
for handling text. The authors' proposition, that text really is an Ordered Hierarchy of Content
Objects, came explicitly as a response (among other things) to concepts of text as
a
stream of content objects, the gram particles and formmating instructions without
reference to structural context. In the year 2023 when we ask questions to large
language models we are given to understand that text is generated in response to a
limited context window
that the language model uses for perceiving
the range of nearby tokens and their proximity to the current token as a frame of
reference. Is it then the case that context in the large language model is
determined by frames of reference in a stream of text? When an approximation
of semantics is coded and decoded in vector space by position and token
co-occurrence, this generates the appearance of meaningful output, which cannot be
said to represent truth but might sometimes approximate it. This is the argument of
the famous Stochastic Parrots
article: that large language models
distort, exploit, and waste resources in an effort only to produce unreliable and
biased approximations.
Should we expect better of so-called artificial intelligence
in our
time, to attempt something more than formulaic approximations of reasoned responses?
Mulitple researchers suggest that large language models could attempt a wider range
of learning methods, to associate metadata wtih data, or to attempt solving complex
problems using hierarchical decision trees. A group of researchers writing from the NIH, discussing the application
of AI in medical diagnostics, explicitly calls for declarative methods to improve
the learning algorithms. Because they historicize artificial intelligence to show
how declarative methods once were central in Lisp and Prolog, they provide a helpful
differentiation between AI then and now:
The main goal of AI has been to develop programs that make intelligent
decisions and solve real-world problems, possibly dealing with “messy" real
world input that could make it difficult to handle using “conventional"
programming. The earlier AI problem solvers were expert systems that
attempted to model the way experts reason and make decisions using a set of
logical rules. Programming languages like Lisp and Prolog were designed to
make programming such systems easy even for non-expert users. The idea was
to represent the domain knowledge using a set of logical rules, and use the
rules in a logical reasoning process hidden from the programmers.
From the traditional AI perspective, this is a declarative programming
paradigm where we program for the what and not the how. The expert programs
could go beyond an independent set of rules and turn to logical programs
with a Turing-complete expressivity, supporting logical inference, for
example, by unification and resolution. However, real-world problems are
complex and often involve many interdependent components … It has become
evident that formalizing complex problem solving using programming a finite
set of deterministic logic-based rules is not possible, nor is it possible
to write a conventional structured program, even with a Turing-complete
language, for supporting intelligent decision-making based on naturally
occurring data.
The virtues of declarative methods are here expressed in familiar
terms: domain experts can communicate needed information and apply programming to
validate according to logical constraints. The authors point out how insufficient
these machine learning techniques are and call for augmenting those techniques with
what they
call Systems AI using declarative methods.
We emphasize the need to use some fundamental declarative ideas such as
first-order query languages, knowledge representation and reasoning
techniques, programming languages for multi agent systems, database
management systems (DBMS), and deductive databases (DDB). We need to place
these ideas within and around ML formalisms including classical ML tools,
deep learning libraries and automatic differentiation tools, and integrate
them with innovative programming languages and software development
techniques, as a way to address complex real-world problems that require
both learning and reasoning models.
I have quoted this article at length to emphasize a point that should not be
forgotten about the origins of artificial intelligence work in declarative methods,
and the capacity of declarative methods to alter the grounds of what a machine
intelligence can reference. The language models with which we interact in 2023
cannot currently adjust their paradigms for how they evaluate language. If they are
to develop to
become precision instruments that do no harm, declarative methods may be the way
forward.