Introduction: Text tokenization and declarative methods in digital humanities
Many scholars in the digital humanities work on natural language processing projects
that involve tokenizing texts and applying software to study how they cluster and
co-occur. Studies of term/token frequency-inverse document frequency (TF-IDF) are
the
basis of experiments like topic modeling, an approximate statistical effort to survey
topics
based on clustering of tokens. The practice of tokenizing
texts is also fundamental to stylometry, to seek evidence of what patterns and
clusterings make a voice distinctive. Among digital humanists who have spoken at the
ADHO conference from 2015 onward, my own work on digital scholarly editions involves
me
much more than usual with declarative markup through XML and TEI.[1] This apparent marginality in my field means I have a slightly unusual and
persistently useful toolkit
I have been using for a decade, and that I am
also a resource for my colleagues when they need to learn these technologies. My work
with declarative methods also allows me to enter the world of natural language
processing and explore documents as my colleagues do where we share common areas of
interest. These days I regularly move data from my marked-up documents into the realm
of
natural language processing to identify and analyze patterns. Such work tends to cycle
from marked-up XML documents to strings for tokenization and processing, and then
back
into markup to frame the outputs for sharing and visualizing the data in SVG, XML,
XHTML5.
The cycle from declarative markup, to string-processing algorthims, and back to output
declarative markup is a fundamental aspect of my research and teaching. But for many
in
my field, only the string processing portion of this is relevant, as they are not
regularly involved as I am with designing custom interfaces. When you work with digital
scholarly editions, you think of interfaces and representation, and you also tend
to
work closely with lots of textual data drawn from single works or small collections,
instead of at a distance from thousands or millions of documents treated as
unstructured
streams of tokenized data. I began my career in the
United States in the 1990s in a literary and cultural field called
English
, and I know that many of my digital humanities colleagues
invested in natural language processing share that background with me, and share much
of
my own migration experience into zones delineated digital
humanities
in our strange institutional adventures with computers. But I
think of texts quite differently than many of my digital humanist colleagues do, simply
because in my document modeling and research with digital scholarly editions, I am
processing texts in ways that seem to be out of scope in their work (at least for
now).
Some of my analytical colleagues opine that declarative markup seems unnecessary in
a
time of big text data analytics, particularly when the making of declarations about
the
document form and content seems to encumber or inhibit projects at scale. The
classifications and categorizations made by AI should be able to do the work of markup
or render it obsolete.[2] In this paper, I want to address the question of what declarative markup can
do for us who engage in textual scholarship in the digital humanities. More needs
to be
declared just now about what we do with declarative methods and semantics and how
they
might assist the efforts of AI in our time.
Because I am steeped in scholarly editing and a declarative markup
specialization
in the digital humanities, my efforts to test the chat
interfaces of large language models reflect my distinct research interests in comparing
and collating texts. Over the course of this year, I have realized something about
the
authority and efficacy of declarative methods in text analysis, and I hope this
realization is worth sharing with declarative markup specialists and their adjacent
text-scholarly friends in the digital humanities. In this paper I will discuss a form
of
token-based text analysis that consistently foils large language models, and I will
show
how declarative markup, interacting with imperative programming, simply and profoundly
makes that analysis possible to accomplish. This is a paper intended to demonstrate
the
importance of declarative authority in our text-analytic systems and to join a call
for
a more declarative approach to what is currently an unintelligent (so-called
AI
) system for modeling language.
Can AI help with my document collation?
I have been working on a project over the past few years that has challenged me to
explore, test, and refine a machine-assisted method for comparing versions of a text.
The project involves comparing five versions of the novel
Frankenstein, and the basis for comparing these versions
includes the markup from editions that were coded differently. I have shared papers
about these adventures over the past few Balisage meetings, including the fun of
flattening and raising markup and the challenge of comparing strings that include
representations of the markup in the editions.[3] Even when it is machine-assisted, document collation is tiring, tedious work.
It is one thing to prepare an algorithm for comparison and apply it to good, adaptable
software for the purpose, but it is quite another have to correct the output. That
is
where the real challenge begins—the intellectual challenge, mental discipline, or
self-psych-out
of machine-assisted
collation: When do
you give up trying to refine the software algorithm, and when do you
crack
and resort to hand-correcting problematic outputs? Sometimes
giving up really slows down a project, when it is still possible to refine the method.
Not giving up on the process requires patience and tinkering with the machinery, and
the patience to continue
testing. Would it not be wonderful for artificial intelligence to assist and accelerate
this painstaking effort, if there could be a reliable way to train a large language
model to process text
the way an editor of a digital scholarly edition would?
The sensational public launch of ChatGPT on November 30, 2022 has sparked excitment, confusion, concern and a new surge of publications across academia and the public sector.[4] ChatGPT has also given us much fun with prompt experiments among my academic circles. We would ask for introductions of ourselves and our colleagues suitable for use in conferences and laugh at how ludricrously wrong they are. (In one invented bio I am both a medievalist and a published poet, neither of which is true, but I could speculate that somehow the words associated with me in ChatGPT’s training had been in proximity with medievalist friends and people far more creative than me). Of course we also prompted ChatGPT to invent supposedly serious bibliographies to see its made-up citations and fake URLs. At universities, teaching faculty worry about our students abusing generative AI to compose their papers, even as we explore new possibilities for the writing process. Academics who write computer programs quickly discovered a supremely helpful aspect of ChatGPT for debugging code or quickly introducing how to access a software library on the fly. Once we have seen such benefits, we tend to recognize that our students should also cultivate skills to prompt AI—responsibly—for coding assistance. ChatGPT has saved time for me and my most dedicated students, time that we might otherwise have spent combing through documentation and Stack Overflow posts. Even when ChatGPT’s code plainly did not work, the AI was leading us rapidly to the ad-hoc fixes we needed to make, and particularly when we began with a foundation of knowledge, we could work within a dialogue and improve our understanding.
Encouraged by such time-saving coding assistance, my colleagues, students, and I have
been eagerly experimenting all year with something that seems to promise the ultimate
declarative possibility: to directly ask a machine to deliver an
output, without directing how it generates that output. The developers at OpenAI fondly
aspire for their chat model to become: a very eager junior programmer
to
make completely new workflows effortless and efficient
.[5] I wondered how well the new AI models might perform on the very task
that has taken me and my colleagues and students years to refine: the machine-assisted
collation of multiple versions of a work in the form of manuscripts and printed
documents. The processes that underlie this involve aligning chunks of text, and a
token-by-token processing of streams of text pulled from marked-up documents. The
large
language models supporting generative AI are themselves based on word embeddings and
tokenized processing text streams. Could the word arithmetic
we associate
with text-generative AI be applied to comparison algorithms? Would the language model
just
know
how to optimize the most reasonable alignments and outputs?
At various moments between January and July 2023, I began testing a hypothesis that a machine trained on tokenized text and word embeddings should excel at the task of token-by-token comparison of strings. I will document some of my representative prompt engineering experiments in the next section. My prompts could likely be improved, but I do have specific requirements for an optimal collation in mind that I have been attempting to declare to the AI, and the exercise has been instructive. Over the past several months in my prompt experiments to engage ChatGPT and Anthropic’s Claude in various permutations of this task, they have not only been wanting, but remarkably and consistently so. The ways in which generative AI has bungled my series of string-comparison challenges may tell us something interesting about the limits of current large language models to correctly observe the differences between strings, and to express those differences in structured forms. The errors are illuminating in a way that demonstrates something seriously lacking in the text-generative system in 2023, and that is a fully declarative method of working with input and generating output.
What happens when we ask a large language model to compare strings?
To begin to estimate whether AI could assist my collation work, I needed to work within the limits of ChatGPT’s input window. But just as I could ask the AI to proofread a paragraph or two for errors, surely I could provide some input strings and ask the AI to compare them, and show me how they differ. In no case did I ever receive reliable, accurate results, but the kinds of errors I saw were interesting to try to diagnose the basis of error. Here I share a representative sampling of the experiments.[6]
In some of my experiments, I simply requested that ChatGPT present me a table showing me how two versions of a text align. In this prompt, I indicated that two editions share a version, and a manuscript presents a different version. I was curious to see how the AI would handle two versions shared by three documents.
Elisa:
Can you prepare a data structure to show where these texts are the same and where they differ?
Manuscript: It was on a dreary night of November that I beheld the frame on which my man compleated. And with an anxiety that almost amounted to agony I collected instruments of life around me that I might infuse a spark of being into the lifeless thing that lay at my feet.
1818 edition and 1831 edition: It was on a dreary night of November, that I beheld the accomplishment of my toils. With an anxiety that almost amounted to agony, I collected the instruments of life around me, that I might infuse a spark of being into the lifeless thing that lay at my feet.
ChatGPT:
To compare the texts and visualize the similarities and differences, you can use a data structure such as a table or a side-by-side comparison. Here's an example of how you can represent the texts:
+---------------------+---------------------------+-------------------------+ | Manuscript | 1818 edition | 1831 edition | +---------------------+---------------------------+-------------------------+ | It was on a dreary | It was on a dreary | It was on a dreary | | night of November | night of November | night of November | | that I beheld the | that I beheld the | that I beheld the | | frame on which my | accomplishment of my | accomplishment of my | | man compleated. | toils. | toils. | | And with an anxiety | With an anxiety that | With an anxiety that | | that almost amounted| almost amounted to agony, | almost amounted to agony| | to agony I collected| I collected the | I collected the | | instruments of life | instruments of life | instruments of life | | around me that I | around me, | around me, | | might infuse a spark| that I might infuse a | that I might infuse a | | of being into the | spark of being into the | spark of being into the | | lifeless thing that | lifeless thing that lay | lifeless thing that lay | | lay at my feet. | at my feet. | at my feet. | +---------------------+---------------------------+-------------------------+In this representation, each row corresponds to a line of text, and the columns represent the different versions: Manuscript, 1818 edition, and 1831 edition. The matching text is aligned vertically, allowing you to see where the texts are the same and where they differ.
By comparing the texts using this structure, you can easily identify the variations between the Manuscript, 1818 edition, and 1831 edition.
Elisa:
Let's try to refine this. Where the texts are identical, just show one block of text on the screen. Then when the texts are different, use the table to put the differences side by side.
ChatGPT:
Sure! Let's refine the representation to show the identical text blocks once and use a table to display the differing portions side by side. Here's an updated representation:
Identical text: ------------------------- It was on a dreary night of November Text differences: +---------------------+---------------------------+-------------------------+ | Manuscript | 1818 edition | 1831 edition | +---------------------+---------------------------+-------------------------+ | that I beheld the | that I beheld the | that I beheld the | | accomplishment of my| accomplishment of my | accomplishment of my | | toils. | toils. | toils. | | With an anxiety that| With an anxiety that | With an anxiety that | | almost amounted to | almost amounted to agony, | almost amounted to agony| | agony I collected | I collected the | I collected the | | instruments of life | instruments of life | instruments of life | | around me, that I | around me, | around me, | | might infuse a spark| that I might infuse a | that I might infuse a | | of being into the | spark of being into the | spark of being into the | | lifeless thing that | lifeless thing that lay | lifeless thing that lay | | lay at my feet. | at my feet. | at my feet. | +---------------------+---------------------------+-------------------------+In this refined representation, the identical text block "It was on a dreary night of November" is displayed once at the top, indicating that this portion is the same in all versions. The table below then shows the differing text portions side by side, highlighting the variations between the Manuscript, 1818 edition, and 1831 edition.
variationsbetween the editions. A few strange things have happened. First the language model does correctly indicate that the first portion of the string is identical. But ChatGPT misses the difference that there is a comma in the 1818 and 1831 versions, and simply does not represent the comma, so we have dropped a small token of text. Next in the table, we completely miss the variant passage from the manuscript:
the frame on which my man compleated. And. Instead ChatGPT represents the manuscript text as the same block provided for the other two. Is the AI forgetting the differences? Yet they seem to understand that some indication of differences is expected, and they highlight words of between two and four characters long that are identical in the three passages blocked. This is distressingly inaccurate, and raises questions about how ChatGPT can be determining what to highlight.
I tried again with this experiment. I understood that ChatGPT’s memory might well be too short to handle the text processing completely, but I wondered whether they could improve the response, much as we can prod them to help us debug our code. Perhaps ChatGPT would help to provide the source texts again to restore them to the context window of memory.
Elisa:
Your approach is what I had in mind, good. But you've forgotten where the source texts are different. Can you redo this, taking another look at the source texts?
Manuscript: It was on a dreary night of November that I beheld the frame on which my man compleated. And with an anxiety that almost amounted to agony I collected instruments of life around me that I might infuse a spark of being into the lifeless thing that lay at my feet.
1818 edition and 1831 edition: It was on a dreary night of November, that I beheld the accomplishment of my toils. With an anxiety that almost amounted to agony, I collected the instruments of life around me, that I might infuse a spark of being into the lifeless thing that lay at my feet.
ChatGPT:
Apologies for the oversight. Let's revise the representation to include the differing portions as well. Here's the updated version:
Text differences: +---------------------+---------------------------+-------------------------+ | Manuscript | 1818 edition | 1831 edition | +---------------------+---------------------------+-------------------------+ | It was on a dreary | It was on a dreary | It was on a dreary | | night of November | night of November | night of November | | that I beheld the | that I beheld the | that I beheld the | | frame on which my | accomplishment of my | accomplishment of my | | man compleated. | toils. | toils. | | And with an anxiety | With an anxiety that | With an anxiety that | | that almost amounted| almost amounted to agony, | almost amounted to agony| | to agony I collected| I collected the | I collected the | | instruments of life | instruments of life | instruments of life | | around me that I | around me, | around me, | | might infuse a spark| that I might infuse a | that I might infuse a | | of being into the | spark of being into the | spark of being into the | | lifeless thing that | lifeless thing that lay | lifeless thing that lay | | lay at my feet. | at my feet. | at my feet. | +---------------------+---------------------------+-------------------------+In this refined representation, we have the text differences table with the Manuscript, 1818 edition, and 1831 edition side by side. The portions that are the same in all versions are omitted for brevity, and only the differing sections are displayed.
Andfrom the manuscript, but the AI has otherwise simply highlighted short words across all the witnesses. What most surprised me was ChatGPT's assertion that
omittedall the passages that were the same, when they omitted nothing. And the AI seemed to have no understanding of the concept of differentiation.
I tried similar experiments this summer, wondering if the language model’s string-differentiation powers might have improved after the news that ChatGPT had a newly expanded context window for processing input. After June 21, developers are able to prompt ChatGPT with 8000-word prompts, andwill be able ot store longer passages in memory. My prompts were not so long, but I suspect there is something challenging to the memory processing about my requests to compare strings, and I was curious whether I might see improvements. So I tried some similar experiments. Among my experiments were efforts to ask ChatGPT to provide responses not in an alignment table but TEI critical apparatus markup.[7]
Here is a more recent experiment, requesting output in TEI critical apparatus
markup (with apologies to Samuel Taylor Coleridge for my flagrant abuse of
The Rime of the Ancient Mariner
).
Elisa:
Given these four text inputs, can you collate them and generate a TEI critical apparatus?
It was an ancient Marinere eftsoons there went he.
He was an old mariner, erstwhile he went there.
It was an old Marinere, eftsoons there went he.
He was an ancient Marinere, eftsoons he went there.
ChatGPT:
Certainly! A critical apparatus in Text Encoding Initiative (TEI) format typically consists of variant readings with corresponding witnesses. In this case, the four text inputs are slightly different from each other, and we can create a simple critical apparatus for them. I will assume that each text is a separate witness (W1, W2, W3, and W4) in the critical apparatus.
Here's the TEI critical apparatus:
<app> <lem> It was an ancient Marinere eftsoons there went he. </lem> <rdg wit="#W2"> He was an old mariner, erstwhile he went there. </rdg> <rdg wit="#W3"> It was an old Marinere, eftsoons there went he. </rdg> <rdg wit="#W4"> He was an ancient Marinere, eftsoons he went there. </rdg> </app>In this representation:
<lem>
stands for the lemma (the base text).
<rdg>
stands for a variant reading.
wit="#W1"
represents the witness identifier for each variant reading.Now, the critical apparatus shows that the lemma (base text) is "It was an ancient Marinere eftsoons there went he," and three variants (W2, W3, and W4) are provided with their corresponding readings.
<app>
element, and showing each of the four versions inside. ChatGPT’s explanation is clear.
I
am not sure how they arrived at a concept of the lemma, but they seem to be
defaulting to the first string as being somehow preferred for the base text. But this
markup is too simple and not especially useful if I am wanting to highlight
precisely where the texts align and where they differ. So I continued the
conversation.
Elisa:
Can you improve this by using the
<app>
elements only where there are forks or deltas in the text?
ChatGPT:
Certainly! In a critical apparatus, we can use the
<app>
element only where there are forks or deltas (variants) in the text. This will help in presenting a more concise representation of the textual differences. Here's the improved TEI critical apparatus:<app> <lem> It was an ancient Marinere eftsoons there went he. </lem> <rdg wit="#W2"> He was an old mariner, erstwhile he went there. </rdg> <rdg wit="#W3"> It was an old Marinere, eftsoons there went he. </rdg> <rdg wit="#W4"> He was an ancient Marinere, eftsoons he went there. </rdg> </app>In this representation, the
<app>
element encloses the entire set of variant readings (rdg elements) because there are textual differences among the witnesses at this point. Since the lemma (base text) remains the same in all cases, we don’t need separate<app>
elements for each witness.
If I want to approach the problem differently, perhaps I need a better prompt. So I try again. In the prompt below, I attempted to show ChatGPT what kind of output I wanted to see, and gave the AI a start that I requested them to continue. Adding the code freehand without a syntax checker, I inadvertently made two mistakes which mangled this code:
-
I pasted in one of ChatGPT’s
<lem>
elements from the previous prompt, intending to modify it to an<rdg>
element. But in haste, I only changed it to an<rdg>
start tag and left the<lem>
end tag. -
I left the second
<app>
element unclosed.
Elisa:
That is not what I meant. I'd like you to apply the critical apparatus using inline parallel segmentation. This means you should show just one line of text where all witnesses are the same, and then use the
<app>
element in the specific moments where the text differs. Here is how you could begin:<app> <rdg wit="#W1 #W3">It</lem> <rdg wit="#W2 #W4">He</rdg> </app> was an <app> <rdg wit="#W1 #W4">ancient</rdg> <rdg wit="#W2 #W3>old</rdg>Can you continue based on this start?
ChatGPT:
Apologies for the misunderstanding. I understand now that you want to represent the critical apparatus using inline parallel segmentation. Here's the continuation based on the start you provided:
<app> <rdg wit="#W1 #W3">It</rdg> <rdg wit="#W2 #W4">He</rdg> </app> was an <app> <rdg wit="#W1 #W4">ancient</rdg> <rdg wit="#W2 #W3">old</rdg> </app> Marinere <app> <rdg wit="#W1 #W3 #W4">eftsoons there</rdg> <rdg wit="#W2">erstwhile</rdg> </app> went he.In this representation, the critical apparatus is represented using the
<app>
element, and within each<app>
element, we use the<rdg>
elements to indicate the variant readings for different witnesses (W1, W2, W3, and W4).I hope this format better meets your requirements for inline parallel segmentation. Let me know if you need any further adjustments!
-
It was an ancient Marinere eftsoons there went he.
-
He was an old mariner, erstwhile he went there.
-
It was an old Marinere, eftsoons there went he.
-
He was an ancient Marinere, eftsoons he went there.
Marinere, missing the difference of witness 2 which has
mariner. The AI also misses the presence or absence of the comma (omitting it entirely in the critical apparatus). Finally, there is the issue with representing transposition. Versions 2 and 4 show
he went therewhile Versions 1 and 3 have
there went he. ChatGPT almost completely misses these differences in representing the ends of the passages. The apparatus correctly represents Text 2 as the only variant holding
erstwhile, but it incorrectly indicates Text 4 as showing
eftsoons therealong with Texts 1 and 3. ChatGPT should have stopped the alignment at
eftsoons, and created a new
<app>
to hold the variations of there went hevs.
he went there. The AI was doing pretty well, and if well-formed markup was all that was expected, it can be said to write good XML. But if accurate content in the critical apparatus is called for, this approximation is dropping information needed for accurate content in the critical apparatus.
Still I wondered if I had given ChatGPT something too short and intricate with too many small variations. What if I added a bit to the four strings that would give the AI something that should be easy to align, a long stretch of identical tokens? Could ChatGPT show improvement, continuing on?
Elisa:
You're on the right track, but you're dropping some of the text in your comparison. Let's try again with a longer set of texts to see if that helps.
It was an ancient Marinere eftsoons there went he. He wanted to talk to the AI about collation and TEI.
He was an old mariner, erstwhile he went there. He wanted to talk about TEI to the AI that day.
It was an old Marinere, eftsoons there went he. He wanted to talk to the AI about TEI.
He was an ancient Marinere, eftsoons he went there. He wanted to talk about TEI and AI.
ChatGPT:
Apologies for the oversight. Let's try again with the longer set of texts, using the inline parallel segmentation for the TEI critical apparatus:
<app> <rdg wit="#W1 #W3">It was an ancient Marinere eftsoons there went he.</rdg> <rdg wit="#W2">He was an old mariner, erstwhile he went there.</rdg> <rdg wit="#W4">He was an ancient Marinere, eftsoons he went there.</rdg> </app> <app> <rdg wit="#W1">He wanted to talk to the AI about collation and TEI.</rdg> <rdg wit="#W2">He wanted to talk about TEI to the AI that day.</rdg> <rdg wit="#W3">He wanted to talk to the AI about TEI.</rdg> <rdg wit="#W4">He wanted to talk about TEI and AI.</rdg> </app>In this representation:
The first
<app>
element contains the variant readings for the first sentence:It was an ancient Marinere eftsoons there went he.(W1, W2, W3, and W4). The second<app>
element contains the variant readings for the second sentence: "He wanted to talk to the AI about collation and TEI." (W1), "He wanted to talk about TEI to the AI that day." (W2), "He wanted to talk to the AI about TEI." (W3), and "He wanted to talk about TEI and AI." (W4).This format shows the differences between the witnesses inline for each sentence. Let me know if this meets your requirements, or if you need any further adjustments!
learnedseems to be forgotten. ChatGPT is returning to their original preference for blocking long portions of text in the
<rdg>
elements.is nevertheless indicating, incorrectly,
that Texts 1 and 3 agree completely, when Text 3 contains a comma after
Marinere. Is the experiment worth continuing? Reader, I did continue, but saw similar patterns of error with dropping punctuation or variant tokens, and with ignoring sameness in long blocksmarked variant. The last response provided by ChatGPT to this series provides a good summary representation of each of these problems:
ChatGPT:
<app> <rdg wit="#W1 #W3">It was an</rdg> <rdg wit="#W2 #W4">He was an</rdg> </app> <app> <rdg wit="#W1">ancient</rdg> <rdg wit="#W2">old</rdg> <rdg wit="#W3 #W4">old</rdg> </app> Marinere eftsoons there went he. He wanted to talk to the AI about collation and TEI. <app> <rdg wit="#W1">He wanted to talk to the AI about collation and TEI.</rdg> <rdg wit="#W2">He wanted to talk about TEI to the AI that day.</rdg> <rdg wit="#W3">He wanted to talk to the AI about TEI.</rdg> <rdg wit="#W4">He wanted to talk about TEI and AI.</rdg> </app>Now, the revised encoding correctly shows the common phrase "He wanted to talk to the AI about collation and TEI." in the second text, which is present across all four witnesses (W1, W2, W3, and W4).
I had learned that sometimes if you prompt the chatbot to pretend that they are an
expert in an area, this can improve their responses. In some of my
experiments this year I attempted this sort of let’s pretend
game, to
ask ChatGPT to pretend they are an expert in the TEI and preparing a digital
scholarly edition, but this did not help to
improve the responses to my tasks. They are predictably unreliable, and never once
did I see a
response without errors. I also tried simplifying the task and asking the AI directly
only to diff
some strings, wondering if that word might be more familiar to the language model.
But this made no difference and I have yet
to see an accurate response to a prompt requesting a comparison of two or more
strings.
Finally, I attempted an experiment with Anthropic’s Claude AI, which purports to have a larger memory capacity and context window for prompt processing than ChatGPT.[8] The results were slightly better out of the box. In this case I returned to comparing two versions of the passage from Frankenstein:
Elisa:
You are preparing a critical edition in TEI using critical apparatus markup. Can you prepare a sample of code for comparing two editions based on the following three versions of the same passage?
Manuscript: It was on a dreary night of November that I beheld the frame on which my man compleated. And with an anxiety that almost amounted to agony I collected instruments of life around me that I might infuse a spark of being into the lifeless thing that lay at my feet.
1818 edition and 1831 edition: It was on a dreary night of November, that I beheld the accomplishment of my toils. With an anxiety that almost amounted to agony, I collected the instruments of life around me, that I might infuse a spark of being into the lifeless thing that lay at my feet.
Claude.ai:
Here is a sample TEI encoding of the critical apparatus comparing those three versions of the passage:
<p>It was on a dreary night of November, that I beheld the <app> <rdg wit="#MS">frame on which my man compleated</rdg> <rdg wit="#1818 #1831">accomplishment of my toils</rdg>. </app> With an anxiety that almost amounted to agony, I collected the instruments of life around me, that I might infuse a spark of being into the lifeless thing that lay at my feet.</p>
Experimenting with text-generative language models in the previous months has
illuminated some interesting problems in comparing text strings in language models
that work with tokenized string inputs. Comparisons are fuzzy, overly broad, or
outright drop tokens and usually misrepresent the source texts in some way. Not only
were the results almost always inaccurate, but on repeated prompts and requests for
corrections, neither ChatGPT nor Claude could be said to improve significantly.
Why are the models so problematic at a task that involves identifying sameness and
difference in sequences of tokenized text? I can only speculate since I lack access
or expertise to comprehend the working systems under the hood
of the
language models, but I speculate that the problems stem from a fundamentally
different way of reading
tokens. That is, the method of reading input
and studying word embeddings in the langauge model is based on a more complicated
statistical process than the simple edit-distance calculation I requested. Perhaps
the AI can read
in no other way than by the (statistical)
books.
From stochastic parrot to reasoning intelligence?
As a human (read: limited, inconsistent) reader of texts I am surprised that a
mathematically trained language model, a so-called artificial
intelligence
which I expect to be more consistent and adept at pattern
recognition than I am, consistently struggles with what seems to be a simple
comparison of strings. Is a shortness of Long-Term Short-Term Memory (LTSTM) that
causes the models to lose track of the details in the process of calculating a
response? Perhaps ChatGPT’s current inability to analyze comparisons of strings has
something to do with its token-by-token generative stream. ChatGPT can correctly
tell me what Levenshtein (or edit-distance) distance is:
ChatGPT:
The Levenshtein distance is a measure of the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.
seewhat constitutes a reasonable, meaningful basis for comparison of text streams. Perhaps this is due to a confusion between the model’s understanding tokens based on relatedness, as opposed to their exact sameness or difference. In other words, I suspect that something about semantics is getting in the way, not literally the semantic meaning of the text, but rather the programming to associate tokens with each other based on their co-occurence, the vectorized word embeddings that help the AI predict the next token to generate something that strikes us a plausibly meaningful. The very capacities that permit ChatGPT and Claude to converse in a convincingly reasonable voice interfere with their capacities to calculate or prepare an accurate programming script. We have been learning these past several months simply to expect and live with AIs being fast but fuzzy, to expect error, and certainly not to ask them do to our math homework.[9]
Perhaps we should not expect anything better. Today dialogue with generative language-based AI gives us the opportunity to declare and inquire with the voice of reason, but the stochastic outputs we receive sometimes contradict themselves and frequently miscalculate and misrepresent. We understand that prompt generation is based on statistical predictions of what might be the best-fit, reasonable next tokens of text to supply in sequence, and that this makes generative language models not intelligent at all but rather stochastic machines.[10] Further, we understand these stochastic machines can do damage, because they amplify even the veiled language of racism and sexism that often goes nearly unheard or unmarked in everyday discourse of Wikipedia and Reddit and social media.[11] In the year 2023, text-generative AI speaks with a language of authority and confidence that amplifies normative values in their unstable predictions of the best-fit content completion for a prompt. We know these limitations, and their consequences are dire.
Reasoned authority is what we expect of optimal interactions with computers, and
this is a foundational value of our declarative markup community. In slides from a
2015 digital humanities conference presentation, Michael Sperberg-McQueen reminds
us,
Declarative semantics make it possible to reason about representations;
imperative semantics impede
.[12] Nothing in our modeling of texts escapes bias, but our capacity to
assert and test reasonable statements is a particular strength of declarative markup
and the tooling our community has developed for it. In that 2015 presentation,
Sperberg-McQueen also points out that hierarchical models are not neutral. The way
we organize document hierarchies and decide on markup representations, and create
schema rules to validate our models does not represent absolute
ground truth
, but rather attempts to describe and define based on
what we choose to prioritize, whether that is the section headings of a legislative
memo or the page-by-page printing of a comic book. The models we create for
documents and the metadata we care about reflect the paradigms and priorities of the
humans who create them. Document historians of the future may come back to our XML
markup and find us benighted, but they could also research in our models the rules
of our publishing houses, the attitudes and expectations that prevailed in
understanding how to study language and archive our cultural heritage. The point is,
what we express in declarative semantics is fully visible, tractable, and usually
documented. Our ways of thinking about texts, however flawed, are explicitly and
deliberately marked. By stark contrast, the biases amplified by large language
models in our stochastic systems are revealed only by analyzing the outputs or
attempting to source their now-secret training data.
That seminal article from 1990 What is Text, Really?
is striking in
the year 2023 for how current it remains as a critique of prevailing machine systems
for handling text.[13] The authors' proposition, that text really is an Ordered Hierarchy of Content
Objects, came explicitly as a response (among other things) to concepts of text as
a
stream of content objects, the gram particles and formmating instructions without
reference to structural context. In the year 2023 when we ask questions to large
language models we are given to understand that text is generated in response to a
limited context window
that the language model uses for perceiving
the range of nearby tokens and their proximity to the current token as a frame of
reference. Is it then the case that context in the large language model is
determined by frames of reference in a stream of text? When an approximation
of semantics is coded and decoded in vector space by position and token
co-occurrence, this generates the appearance of meaningful output, which cannot be
said to represent truth but might sometimes approximate it. This is the argument of
the famous Stochastic Parrots
article: that large language models
distort, exploit, and waste resources in an effort only to produce unreliable and
biased approximations.[14]
Should we expect better of so-called artificial intelligence
in our
time, to attempt something more than formulaic approximations of reasoned responses?
Mulitple researchers suggest that large language models could attempt a wider range
of learning methods, to associate metadata wtih data, or to attempt solving complex
problems using hierarchical decision trees.[15] A group of researchers writing from the NIH, discussing the application
of AI in medical diagnostics, explicitly calls for declarative methods to improve
the learning algorithms. Because they historicize artificial intelligence to show
how declarative methods once were central in Lisp and Prolog, they provide a helpful
differentiation between AI then and now:
The main goal of AI has been to develop programs that make intelligent decisions and solve real-world problems, possibly dealing with “messy" real world input that could make it difficult to handle using “conventional" programming. The earlier AI problem solvers were expert systems that attempted to model the way experts reason and make decisions using a set of logical rules. Programming languages like Lisp and Prolog were designed to make programming such systems easy even for non-expert users. The idea was to represent the domain knowledge using a set of logical rules, and use the rules in a logical reasoning process hidden from the programmers.
From the traditional AI perspective, this is a declarative programming paradigm where we program for the what and not the how. The expert programs could go beyond an independent set of rules and turn to logical programs with a Turing-complete expressivity, supporting logical inference, for example, by unification and resolution. However, real-world problems are complex and often involve many interdependent components … It has become evident that formalizing complex problem solving using programming a finite set of deterministic logic-based rules is not possible, nor is it possible to write a conventional structured program, even with a Turing-complete language, for supporting intelligent decision-making based on naturally occurring data.
We emphasize the need to use some fundamental declarative ideas such as first-order query languages, knowledge representation and reasoning techniques, programming languages for multi agent systems, database management systems (DBMS), and deductive databases (DDB). We need to place these ideas within and around ML formalisms including classical ML tools, deep learning libraries and automatic differentiation tools, and integrate them with innovative programming languages and software development techniques, as a way to address complex real-world problems that require both learning and reasoning models.[16]
How meaningful is markup, anyway?
When the text-generative AI models composed the markup shared in the previous section, their declarative semantics were unclear and often inconsistently applied. Perhaps the AI’s purpose became diluted by other imperatives of the generative language algorithm. We could describe such output as expressing a middle state, a string bearing the forms of markup but with dubious or disreputable declarative semantics. The way declarative logic is incorporated in a text-processing system matters. How deliberate can we be in applying it and generating the output we ask for? In their current form, large language models are limited in their capacities to handle complex tasks.
When handled deliberately, declarative logic can be processed in imperative programs without loss of their significance in the output. Transforming text into other formats shows us what is fluid and transferrable in markup. We map its structures into forms that machines need to read, and when we do that they become moveable bins or containers of information. The process has been vital to our collation project for the Frankenstein Variorum, where a deliberate declared logic for controlling the basis for comparison has been key to consistent processing.
Markup as declarative intervention in an imperative process
The process of refining the collation process for the Frankenstein Variorum
involved a serious challenge to stop the collateX software from its default
mechanism, always to align the smallest particles of the same text. CollateX
definitely tends to align the smallest irreducible units of text (tokens) that
it reads as the same
, like a
, an
, and
the
, even in passages that are not meant to be associated across
the texts.[17] One solution is to normalize the text stream in pre-processing prior to
collation, to exclude such articles or short punctuation, but this did not accord
with our methods of comparing the texts: the small words do matter to us when they
mark variations. Other solutions involve post-processing or editing the output
collation units by hand, which in our case was a brittle solution, prone to
introducing new errors. Some of the versions of Frankenstein
contain long inserted blocks, multi-paragraph inserted passages, and gaps in the
manuscript that make it tricky to prepare aligned units (or chunks
representing parallel passages across the five versions).
Over the last year, my student Yuying Jin and I established a reliable method for
bracketing off long divergent passages, a method that we call our
longToken
strategy. Here we lengthen the size of the smallest
particle of comparable text to the size of whatever we can express inside an XML
element <longToken>.....</longToken>
. We instructed our
Python script to isolate all tokens by newline characters, and set the entire length
of a longToken (which could be as small as a single character and as large as two
paragraphs of text, including flattened markup) all on one irreducible line. By
controlling the tokenization algorithm, we were able to control the mechanism of the
collation software, prevent it from making spurious alignments on small words in a
passage that we would effectively bracket away from micro-comparisons.
Our Python script is a place of negotiation between paradigms of structured markup
and so-called unstructured text. We use the XML Pulldom
library to process what markup from the source documents we want to
include in the string comparison process. That is, we mask away some elements, like
the <surface>
and <zone>
elements that
indicate page surfaces and locations on the pages from the Shelley-Godwin Archive
encoding, because we have decided that page position is not relevant to comparison
of the semantic text structure. But we want to preserve the element nodes that mark
paragraphs, and chapter structures, and we want to preserve the information about
deletion marks in the manuscript and from the Thomas copy. The word
mask
seems appropriate here: This is something like applying tape
to pieces of the file that we select. We continue to work with the markup, though,
in its meaningful form. In the Python script, we define variables containing lists
of element names that we will either mask away from the collation, or that we will
include:
ignore = ['sourceDoc', 'xml', 'comment', 'include', 'addSpan', 'handShift', 'damage', 'unclear', 'restore', 'surface', 'zone', 'retrace'] blockEmpty = ['p', 'div', 'milestone', 'lg', 'l', 'cit', 'quote', 'bibl'] inlineEmpty = ['mod', 'pb', 'sga-add', 'delSpan', 'anchor', 'lb', 'gap', 'hi', 'w', 'ab'] inlineContent = ['del-INNER', 'add-INNER', 'metamark', 'shi'] inlineVariationEvent = ['head', 'del', 'mdel', 'add', 'note', 'longToken']The
ignore
variable contains
everything we are screening away from the stream of text comparison. The other
variables represent elements types we will see in the input. This input contains
some recognizable elements from the TEI, but <p>
,
<lg>
, and <l>
are defined in the
blockEmpty
list, along with <milestone>
, which
is the only element that those knowledgeable of the TEI would recognize as
legitimately empty.
What have we done to the TEI? Perhaps a sacrilege, but we are meddling with TEI XML files as, after all, text files that hold meaningful declarative markup, and we have converted their element nodes into a format that allows us to compare texts based on their original structures by removing the structures to process the comparison.
There and back again: Transferring declared semantics
In preparing our editions for collation (as discussed in previous Balisage
papers), we have flattened
the original TEI structural elements, and
abstracted them away from their original document models. We do this on purpose to
represent the element tags as Trojan-style markers and to be able to work them into
our a new XML file that stores a standoff critical apparatus in TEI. That file
represents the results of our collation pipeline, and it stores a flattened
representation of the tags from the source editions. The standoff critical apparatus
serves, also, as a basis for creating new edition files that store the collation
data, highlighting passages that vary with the other editions.
In this process, the semantics of the declarative markup from the source files are preserved even while that markup has undergone a complicated series of transformations. First it is transferred into strings or a stream of text in order to be collated. Then that stream of text is mapped back again into new XML structures to represent the meaningful data in the critical apparatus about how the texts compare to one another.
Moving in between text-processing paradigms illuminates a transfer of semantics into formatting. The logic of declarative markup is preserved in the Python function running the pull parser via the XML Pulldom library. This function delivers us a way to transfer the logic of the markup element nodes into the syntax used to prepare the tokens and normalized tokens to be delivered to the collateX software.
def extract(input_xml): """Process entire input XML document, firing on events""" doc = pulldom.parse(input_xml) output = '' for event, node in doc: if event == pulldom.START_ELEMENT and node.localName in ignore: continue # copy comments intact # if event == pulldom.COMMENT: # doc.expandNode(node) # output += node.toxml() # ebb: The following handles our longToken and longToken-style elements: # complete element nodes surrounded by newline characters to make a long complete token: if event == pulldom.START_ELEMENT and node.localName in inlineVariationEvent: doc.expandNode(node) output += '\n' + node.toxml() + '\n' # stops the problem of forming tokens that fuse element tags to words. elif event == pulldom.START_ELEMENT and node.localName in blockEmpty: output += '\n' + node.toxml() + '\n' # ebb: empty inline elements that do not take surrounding white spaces: elif event == pulldom.START_ELEMENT and node.localName in inlineEmpty: output += node.toxml() # non-empty inline elements: mdel, shi, metamark elif event == pulldom.START_ELEMENT and node.localName in inlineContent: output += '\n' + regexEmptyTag.sub('>', node.toxml()) # output += '\n' + node.toxml() elif event == pulldom.END_ELEMENT and node.localName in inlineContent: output += '</' + node.localName + '>' + '\n' # elif event == pulldom.START_ELEMENT and node.localName in blockElement: # output += '\n<' + node.localName + '>\n' # elif event == pulldom.END_ELEMENT and node.localName in blockElement: # output += '\n</' + node.localName + '>' elif event == pulldom.CHARACTERS: # output += fixToken(normalizeSpace(node.data)) output += normalizeSpace(node.data) else: continue return outputThe
partial DOM treeconstructed by XML PullDom serializes something resonant with the semantics of explicit markup, allowing us in our project to hold the logic and even the structure of markup as a stream of text to be tokenized, chopped into the smallest fragments of meaningful variation. Thanks to the advantage of declarative markup, the scholarly editor gets to declare what that smallest fragment can be. A full element node marking an
inlineVariationEvent
surrounded by \n
newline
characters becomes an irreducible token, and this includes the
<longToken>
, <add>
, and
<del>
elements that in our project mark irreducible units of
comparison. We want an entire added or deleted passage to be lined up complete as
one action in the text. Such a longTokenmust be compared to a full comparable unit marked in the other documents, fully undeleted. That is a decision of our scholarly edition work to handle collation events, and it means that a deletion event followed by an insertion event in the Thomas text (where the author crossed out a passage and indicated another to add) effectively drives the collation software to generate a specially shaped entry in our critical apparatus. We have programmed our work to prepare this output:
<app> <rdgGrp n="['<del>to his statement, which was delivered</del>', 'to him with interest for he spoke']"> <rdg wit="fThomas"><del rend="strikethrough">to his statement, which was delivered</del> <add>to him with interest for he spoke</add></rdg> </rdgGrp> <rdgGrp n="['to his statement, which was delivered']"> <rdg wit="f1818"><longToken>to his statement, which was delivered</longToken></rdg> <rdg wit="f1823"><longToken>to his statement, which was delivered</longToken></rdg> <rdg wit="f1831"><longToken>to his statement, which was delivered</longToken></rdg> </rdgGrp> </app>This output from collateX is formatted according to the TEI XML critical apparatus to express the logic of elements storing variation information across five source edition files. A moment of meaningful variation is stored in an
<app>
element, and each of its
<rdg>
descendants stores a single token representing the text
of one witness, here a phrase that we purposefully blocked off to be an irreducible
unit of comparison using the <longToken>
element. You can see the
trace of the longToken elements in the <rdg>
text nodes, while
the @n
attribute on the <rdgGrp
elements serves to
express how the collation software normalizes each token to show on what basis the
<rdg>
elements inside are understood to agree. The trace of
the longToken element is removed there because it is not itself part of the strings
that we ask the collation software to compare.
In the example, the collation software follows our normalizing algorithm to
determine that three of the witnesses share one form, and the Thomas edition holds
the same passage crossed out together with its complete replacement. Here,
declarative markup provides a precise way to delineate information on what
constitutes an indivisible token to the software. With the demarcation of a
<longToken>
we can bracket off passages of text and force the
software to deal with them on our terms. This gives us control of a process of
pair-wise comparisons by allowing us to alter the usual definition of the smallest
unit of meaningful variation. Declarative markup thus permits us to express our
theory of textual variation in the logic of the programming pipeline. Certainly we
did something unorthodox with the machinery of comparing texts and we invented our
own markup outside the TEI to declare what a token could be. I want to call this
a deliberate, declarative intervention in the imperative
algorithms.
Our effort presupposes that the imperative tools can read the declarations, and accommodate the markup. For this capacity in our project, we flatten the element nodes into strings and identify those as regular expression patterns for processing. In our project we raise them again into elements with an XSLT pipeline, here fully relying on declarative methods to communicate our desired output. But in the crucial stage of collation, when texts must be handled as sequential strings and imperative processes are required for calculations of alignment and comparison, the handling of markup as patterns in imperative programs offers a modicum of authority and control of the output.
Conclusion
In this paper we looked first at the ways in which large language models fail repeatedly in the task of comparing and differentiating strings. We considered this a curious failing in a system that works with tokenized units of language, and also a sign of a limited basis for making reasonable statements. Perhaps this could be improved by implementing declarative methods. ChatGPT promised a kind of declarative affordance and declares to us its results, but its declarations are neither consistent nor reasonable. Perhaps a more adaptable algorithm for AI would allow for declarative mechanisms to control the logic of its operations.
We then turned to consider how declarative methods can be implemented in imperative programming. When scholars have the capacity to alter the scripts on which machines do their complex processing, they lend scholarly authority to what computing machines can process. What we require, and what declarative methods should afford, is control and validation of the grounds of knowledge, even when these are shifting.
In a moment of eager excitement, confusion, and fear about the disruptive
influences of generative language models, markup technologies provide reliable precision
and control. They allow us to intervene and guide computational processes within
declarative bounds of reason. Declarative methods can provide a foundation for a digital
humanities lab like mine to counterbalance the anxiety-ridden speculative work of
statistically-based distant reading
, to extend what digital resources we
can create, and to ground the authority of our research. When the calculations and
training capacities of a large language model are subject to rapid change with the
next
month's update, and when developers of generative language models conceal their sources
for commercial reasons and do not share their transformer architectures openly, we
would
do well to inspect our tools and research methods for brittle dependencies. Declarative
markup proves itself a precision instrument to guide computational processing and
may
help to address the ephemerality of unstable technology stacks. What might it mean
for a
text-generative AI
system to operate in a fully declarative
way?
[1] At least in 2017, topics connected to digital scholarly editions (and
connected to it, declarative markup) seems to have been on the decline. See
Scott Weingart, What’s Under the Big Tent?: A Study of ADHO Conference
Abstracts
, Digital Studies/le Champ Numérique, 7(1), 6. DOI:http://doi.org/10.16995/dscn.284. See also the questions raised
about the ADHO conference and its representation of the practice of
digital humanities
in Laura Estill, Jennifer Giuliano, Élika
Ortega, Melissa Terras, Deb Verhoeven, and Glen Layne-Worthy, The circus
we deserve? A front row look at the organization of the annual academic
conference for the Digital Humanities
, DHQ: Digital Humanities
Quarterly 16(4). 2022. http://www.digitalhumanities.org/dhq/vol/16/4/000643/000643.html.
[2] Evidence of how my colleagues practicing digital humanities perceive declarative markup in the time of AI can be seen in the author's January 2023 conversation with Ted Underwood on Mastodon: https://sigmoid.social/@TedUnderwood/109730986869388754. The informal conversation was sparked by a declaration that ChatGPT’s capacity to take text input with instructions and output marked in TEI also meant that eventually there will be no more need for markup at all. In some ways this paper responds to the challenge of that conversation.
[3] See among others, Adventures in Correcting XML
Collation Problems with Python and XSLT
, Proceedings of Balisage:
The Markup Conference 2022, Balisage Series on Markup Technologies, vol. 27. 2022. DOI:https://doi.org/10.4242/BalisageVol27.Beshero-Bondar01.
[4] This article documents a significant body of publications aided by, applying,
or investigating ChatGPT published between December 2022 and February 2023, just
the beginning of a scholarly surge of interest across disciplines in AI:
Alin Zamfiroiu, Denisa Vasile, and Daniel Savu, ChatGPT – A Systematic
Review of Published Research Papers
,
Informatica Economica, vol. 27, no. 1, 2023, pp. 5-16.
http://revistaie.ase.ro/content/105/01%20-%20zamfiroiu,%20vasile,%20savu.pdf.
Accessed 2023-07-22.
[5] ChatGPT plugins: Code interpreter
, Chat GPT Blog, 2023.
[6] I accessed ChatGPT from January through July 2023 from https://chat.openai.com/.
[7] Timothy Mugayi, ChatGPT 16K Context Window A Step Closer to 100k: A look at
larger context windows and whatmeans for your
applications
, Medium: Level Up Coding. 21 June
2023.
[8] I accessed Claude in July 2023 at https://claude.ai.
[9] Simon Frieder, Luca Pinchetti, Alexis Chevalier, Ryan-Rhys Griffiths,
Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, and Julius
Berner, Mathematical Capabilities of ChatGPT
, Arxiv: Computer
Science > Machine Learning. 2023. DOI:https://doi.org/10.48550/arXiv.2301.13867.
[10] For unsophisticated novices like me, these articles have been particularly
helpful for gaining a basic appreciation of key concepts in text generative
AI: Haomiao Huang, The generative AI revolution has begun—how did we get
here?
, Ars Technica, 30 January 2023; Jay Alammar,
The Illustrated Word2Vec
blog post, 27 March
2019.
[11] This Google Collab Notebook tutorial offers an accessible introduction to
gender and social class bias engrained in large language models: Shlomi Hod,
Tutorial // Exploring Gender Bias in Word Embedding
,
2018—. https://colab.research.google.com/github/ResponsiblyAI/word-embedding/blob/main/tutorial-bias-word-embedding.ipynb.
The following article details pervasive problema of gender bias in ChatGPT’s
translations from English to Farsi, Malay, Tagalog, Thai, and Turkish:
Sourojit Ghosh and Aylin Caliskan, ChatGPT Perpetuates Gender Bias in
Machine Translation and Ignores Non-Gendered Pronouns: Findings across
Bengali and Five other Low-Resource Languages
, Upcoming
Publication in AAAI/ACM Conference on AI, Ethics, and Society. 2023.
[12] Michael Sperberg-McQueen, What does
descriptive markup contribute to digital humanities?
,
Key Ideas and Concepts of Digital Humanities conference, Technische
Universitat Darmstadt, 26-28 October, 2015. Slide
8.
[13] Steven J. DeRose, David G. Durand, Elli Mylonas, and Allen H. Renear,
What is Text, Really?
, Journal of Computing in Higher Education. 1: 3-26. 1990. DOI:https://doi.org/10.1007/BF02941632.
[14] Emily Bender, Timnit Gebru et al., On the Dangers of Stochastic
Parrots: Can Language Models Be Too Big?
, Proceedings of the 2021
ACM Conference on Fairness, Accountability, and Transparency, March 2021,
pp. 610–623. DOI:https://doi.org/10.1145/3442188.3445922. For helpful
introductions to how large language models work to generate text based on
predictive algorithms, see Jay Alammar, The Illustrated Transformer
blog post, 27
June 2018. See also Molly Ruby, How ChatGPT Works: The Model Behind The Bot
,
in Towards Data Science. Medium. 30 January 2023.
[15] Here is a sampling of recent articles on how to improve problem solving
capacities in large language models: Cem Anil, Yuhuai Wu et al.,
Exploring Length Generalization in Large Language Models
,
36th Conference on Neural Information Processing Systems. 2022. https://openreview.net/pdf?id=zSkYVeX7bC4; Shunyu Yao, Dian Yu,
Jeffrey Zhao et al., Tree of Thoughts: Deliberate Problem Solving
with Large Language Models
, Arxiv: Computer Science >
Computation and Language, May 2023 preprint https://arxiv.org/pdf/2305.10601.pdf; Roie Schwaber-Cohen,
Vector Databases
, Pinecone https://www.pinecone.io/learn/vector-database/ n.d.
[16] Parisa Kordjamshidi, Dan Roth, and Kristian Kersting, Declarative Learning-Based
Programming as an Interface to AI Systems
, Frontiers in
Artifical Intelligence 5:755361, 2022 Mar 14. DOI:https://doi.org/10.3389/frai.2022.755361.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8967162/.
[17] See the CollateX documentation, section 3.1 Alignment Algorithms in https://collatex.net/doc/.
The Needleman-Wunsch algorithm appears to be a starting point that CollateX
has adapted and improved to the Dekker algorithm, but my
experience applying the default alignment algorithm is consistent with this
pronounced statement from Needleman and Wunsch when they introduced their
comparison algorithm for bioinformatics in 1970: Comparisons are made
from the smallest unit of significance.
This means that the
software consistently defaults to aligning the smallest units of agreement
at the token level, even when the context of the passage of text surrounding
them (say the unit paragraph or section) is completely different. Saul B.
Needleman and Christian D. Wunsch, A general method applicable to the
search for similarities in the amino acid sequence of two
proteins
, Journal of Molecular Biology 48:3, 28 March 1970, pp.
443-453. DOI:https://doi.org/10.1016/0022-2836(70)90057-4.