String Difference
Many algorithms have been written to calculate differences between two arbitrary strings, in various programming languages and with a variety of methods (see references at Fraser 2006fraser; Jhaver et al. 2014 refs. [1]-[12]jhaver). In general, a string-comparison algorithm takes as input two strings and returns output that itemizes insertions and deletions. Such output is particularly conducive to being expressed in XML. If our two input strings are "fish cat bird" and "fish dog bird" the output might be represented as follows:
<diff> <common>fish </common> <a>cat</a> <b>dog</b> <common> bird</common> </diff>
If we treat the two strings as two stages in an editorial process, every
<a>
represents a deletion (characters in A but not in B), and every
<b>
, an insertion. The output is lossless, and the original strings
A and B can be restored easily via XPath, e.g., string-join($output/(a |
common))
and string-join($output/(b | common))
.
The classic approach to global string comparison frames the task as one of finding the longest common subsequence (LCS). LCS-based methods go back to 1970 and the Needleman-Wunsch algorithm.boes Through the 1970s and 1980s a number of models were proposed, collected in stephen. In a famous 1986 article, Eugene Myers described the LCS task as navigation through a matrix.myers Letters in the two strings marked points on the x and y axes (Fig. 1). Starting from the corner where each sequence begins, a series of paths are drawn, following downward and right steps. A third, diagonal path is drawn if it leads to a point whose x and y coordinates are identical. Once the entire matrix is marked, every path is scored, and the lowest score determines the path or paths to an LCS (there may be more than one). Myers also observed that securing the LCS also resulted in the shortest edit script, i.e., the least number of changes required to go from one string to the other.
The algorithm, refined, takes computing space of O(ND), where N is the product of the length of the two strings and D is the size of the minimum edit script for the two strings.[1] A simplification proposed by Myers lends itself to quadratic time, O(N).
A somewhat different approach was introduced by Ratcliff and Obershelp in 1988, and
it
forms the basis of the standard ndiff
function in Python's
difflib
module.ratcliff That algorithm proceeds
character by character through the two strings, looking for the longest match of
consecutive characters. This algorithm favors longest common substrings (LCG) to achieve
the LCS. (unlike LCSs, LCGs consist of contiguous characters and so have no gaps.)
Once
that is found, the process is repeated on either side of the LCG. The procedure is
normally quadratic in time complexity but cubic in the worst-case scenario.
Such LCS and LCG processes, well known, assume a dynamic, stateful programming environment, where variables can be updated in iterative steps. In functional programming languages such as XSLT, which is stateless and whose variables are immutable, the process is more challenging. A naive reproduction of Myers's matrix would require as much computational work to set up as to solve.
This challenge is illustrated by Birnbaum's 2020 XSLT implementation of the Needleman-Wunsch algorithm, building on some optimizations developed by Tennison in 2007.birnbaum Birnbaum's work represents a significant advance in securing an XSLT solution to finding the LCS, because it shows for the first time how one might optimize the language to create a path through the matrix. But the results are not conducive to widespread implementation. In Birnbaum's testing, two texts of roughly three thousand words each (almost eighteen thousand characters) took more than two minutes to process.
Below I present tan:diff()
, an XSLT 3.0 function that is part of the Text
Alignment Network function library (namespace tag:textalign.net,2015:ns
).
tan:diff()
approaches the LCS task differently than either
Needleman-Wunsch or Ratcliff-Obershelp, and results in both excellent results and
efficiency. For common string comparisons, it nearly always results in a minimal
difference. XSLT's allowance for asynchronous computing, as well its declarative
structure, make tan:diff()
extremely efficient on very long strings.
Empirical observation, corroborated by analysis of the algorithm, suggests that, given
pairs of strings of roughly the same difference, tan:diff()
runs in time
complexity O(log(L)), where L is the summed length of the two inputs.
I have not found in the literature any description of an algorithm that uses the staggered-sample approach I present. Ratcliff and Obershelp's algorithm comes the closest. I suppose there are computer programs that adopt this strategy, but not having found any published accounts, I have chosen to use this article not only to present the theoretical underpinnings of the function but to go into detail on tests that demonstrate its efficiency and accuracy.
String Difference through Staggered Samples
In what follows we assume two strings that are of arbitrary length and are roughly similar to each other, from any language or languages. Our concern is primarily with differences on the character level, because some languages (e.g., Chinese, Thai) do not normally use a space (U+00A0) to differentiate words, and we cannot presume line breaks of any kind. Once results based on characters are secured, the output can later be adjusted into word-based differences, if we prefer.
The second string represents the outcome of changes applied to the first: insertions and deletions of characters at arbitrary places, with the quantity of inserted or deleted characters being no greater than one half the length of the original string. The algorithm works even when the two strings are more different than alike, but for this discussion it is best to think of two strings that resemble each other because they share some genetic relationship through an editorial process.
XSLT instructions are commonly executed asynchronously, permitting multiple processes to occur at the same time. Another benefit of XSLT is that the declarative syntax encourages a programmer to approach problems in a manner different from one shaped by imperative programming, thus expanding the developer's creative horizons. Declarative architecture is helpful particularly for string comparison, because it allows us to model methods we use in everyday life.
When I am presented two texts and asked to compare them, I first look at the big picture. If the two texts are arranged side-by-side on my computer in two windows of equal width, I eyeball a very large section of text, starting with the smaller of the two, and then look in the longer counterpart for a match. If none presents itself, I reduce the size of my sample, and I take several of them from different parts of the shorter text. For each sample pattern I look for a comparable match in the other, longer one. I may do this several times through the first text, using relatively large samples. If there still isn't an obvious match, I reduce yet again the size of the sample a bit more, and take more of them. Those samples tend to be staggered, and the amount of overlap decreases as the number of samples grow.
As suggested by Fig. 2, the process can be described spatially. Vertically, one proceeds sample size by sample size; horizontally, by the number of samples. Because each vertical step governs one or more horizontal steps, the two axes can also be described as an outer iteration and an inner one. Throughout this article I will use these terms interchangeably.
Table I
Vertical axis | Horizontal axis |
---|---|
vertical steps | horizontal steps |
sample size | number of samples |
outer iteration | inner iteration |
If this were a process that I were to follow myself, I would begin with only a few samples and increase the number gradually, inversely proportionate to the size of the sample. From one unfruitful vertical step to the next, I would become increasingly reluctant to go past some maximum number of samples, as I looked for a sizeable match, any match, that would allow me to establish a pair of anchors. And as more vertical steps passed, the more likely I would be to conclude that the two strings are unlike, and settle for matches in individual sentences, clauses, or words.
The personal process described above provides the model for tan:diff()
,
which extracts a series of progressively smaller samples from the shorter of the two
texts and looks for a match in the longer one on the basis of
fn:contains()
. When a match is found, the results are separated by
fn:substring-before()
and fn:substring-after()
.[2]
Let us suppose the smallest of the two strings is quite small, say six characters or fewer. The process is straight-forward, consisting of rounds of samples of size 6, 5, and so forth down to a single character. That approach takes up to Σ(1..N) steps, the triangular number of N.
If the smallest string is about twenty characters or fewer, the sample size is crudely staggered at 100%, 80%, 60%, 40%, 30%, 20%, 10%, 5%, with number of samples (horizontal passes) increasing 1, 3, 9, to a maximum of N.
Normally we are dealing with much larger strings, so we need to generalize the
staggering principle. In tan:diff()
the progression of sample sizes and
sample iterations are bound to parameters that can be adjusted.[3] The default sample sizes (first sixteen) are 70.7%, 50.0%, 35.4%, 25.0%,
17.7%, 12.5%, 8.8%, 6.2%, 4.4%, 3.1%, 2.2%, 1.6%, 1.1%, 0.8%, 0.6%, and 0.4% of the
shorter of the two strings. There are about one hundred such vertical steps, far more
than is needed (the 80th step provides a sample size of one character for a string
of
length 1,000,000,000,000). The progression relies on the formula B^(V*E)
,
where V
is the Nth vertical step, B
is a base (default 0.5),
and E
the exponent (default 0.5). B
, V
, and
E
are bound to adjustable global parameters, to allow any developer to
redistribute sample sizes.
The number of samples (horizontal, inner steps) correspondingly increases with each
vertical step, currently a default of 5, 13, 21, 29, … 50. The progression follows
the
formula ((1 - p)^(1/F))*M
, where p
is the proportionate size
of the sample, E
is the exponent factor, default 0.5, and M
is
the maximum number of samples, default 50. Both F
and M
are
bound, once again, to adjustable global parameters, to permit control over the maximum
number of samples and the speed at which they increase.
The process described above works well for strings of size 200K characters or smaller, or larger ones that are rather alike. But strings that are extremely large and quite different tend to raise problems when the algorithm reaches samples sizes of three characters or fewer. If the remaining pair of strings to be compared are still relatively lengthy, then we are hunting for scraps of commonality.
In most cases though, a match will be found, not too far into the process. The number of steps needed to reach the first match is unpredictable, a function of the dissimilarity of the two strings, the values assigned to the parameters, and the input itself.
If the sample matches more than once, whether in the long string, the short string, or both, the algorithm chooses the pair that share the closest contextual position. For example, if the search string "abcd" appears in the short string at approximately 2%, 10%, and 45%, but in the long string at 55%, 75%, and 95%, the 45%-55% pair will be chosen. This decision follows the presumption that the two strings are genetically related to each other, and proportionality is the best guide to correlation of ambiguous matches. This design feature also helps to stabilize the search for differences in repeated text structures (e.g., tables with a great deal of repeated material).[4]
After the first successful round of vertical steps, we have a relatively long pair of anchors in each text. This is not an LCG, but a PLCG: pretty long common substring. An attempt is then made to extend the boundaries of the PLCG by proceeding character by character on each side, adding to the PLCG as many contiguous matching pairs of characters. We then have a complete and very long common substring. Because of our incremental approach, that first anchor is quite often the longest common substring.
Now the process bifurcates, starting anew on the two remnant strings that precede the anchor, and on the two that follow. Asynchronous processing helps reduce processing time.
The output is rendered in a tree, much like that illustrated at the beginning of this
article: a <diff>
wrapping interleaved <common>
s (text
shared by both strings), <a>
s (text unique to the first string), and
<b>
s (text unique to the second).
The time complexity of tan:diff()
is difficult to calculate. The number
of operations is reliant largely not upon the size of the strings but upon the number
of
samples required. That number depends upon the similarity of the two strings, and
the
happenstance of finding a good match early on. After a match is found, the processing
time depends upon fn:substring()
and fn:contains()
, functions
in which string length may play a significant factor, depending upon how the XSLT
processor handles those functions. Multiple matches may require some extra steps to
resolve the ambiguity. Expanding each side of PLCG depends upon
tan:common-start-or-end-string()
, which applies to remnant pairs of
strings the functions fn:substring()
,
fn:string-to-codepoints()
, and fn:codepoints-to-string()
.
The process then repeats on either side of the LCG. The number of times it must do
so
depends upon string length and similarity.
Tests presented below suggest that on average tan:diff()
works in
logarithmic time complexity. That is, every time the cumulative input string length
increases tenfold, computational time merely doubles. This is far better performance
than what is found in many other string difference algorithms, many of which run in
quadratic time.
Extremely Long Strings
Suppose our input strings are quite long, say more than 200K characters (which is
still a conservative threshold, as will be seen). On some long strings, the process
described above might become inefficient or grind to a halt. The heart of
tan:diff()
is a tail-end recursive function, whose repeated calls may
cause a processor to suspect endless loop and end execution. For any function that
runs
in logarithmic time, the larger the input, the proportionately faster it runs. Therefore
preprocessing does not improve the speed of tan:diff()
. But it can assist
in memory management, and avoid problems in recursion.
tan:diff()
provides two ways to handle extremely long strings, one based
on tokenization and another on segmentation. Segmentation may be followed by
tokenization, but not vice versa. Parameters allow a user to determine which
preprocessing method should be applied, at what thresholds.
Preprocessing via Tokenization
Some have approached the problem of comparing two long strings by first tokenizing
them. Many diff algorithms, written for comparing computer code, tokenize the text
into individual lines. This strategy cannot be used in tan:diff()
,
which is written for any kind of text. Input may not have any carriage returns or
line feeds.
Others have tokenized based on the space character. This too is inadequate for
tan:diff()
, which aims to accommodate texts in languages, e.g.,
Chinese, that do not separate words with spaces. Even languages such as English that
frequently use spaces pose problems. A space-tokenized long string in English
reduces the size of input items (tokens instead of characters) by about 1/7th, which
only slightly reduces the complexity of the problem. Early versions of
tan:diff()
that preprocessed extremely long strings by tokenizing
on spaces were terribly inefficient because of the enormous number of tokens, which
had to be compared in a quadratic operation.
We are accustomed to tokenizing on lines and spaces because of how common they are in the languages we use. The problem becomes much more tractable when we tokenize progressively, starting with the least common characters. The space is commonly the most frequent character, so if it is reached, it normally comes last, and only in cases where the two strings are quite unalike.
The tokenizing preprocess used by tan:diff()
begins by building a sorted
sequence of strings along which iterations of tokenization should proceed. That
sequence of tokenizers is built by grouping and sorting individual codepoints.
Characters that are found in one string but not the other are specially marked. They
are especially good candidates to start the tokenizing process, to remove from
consideration any characters that could never be matched. Next come the rarest
characters, slowly building up to the most common.
A typical sequence of tokenizing characters might be as follows (space-delimited, based on a sample from the United States Code of Federal Regulations): + • ? @ é \ × Z % K X Q _ Y “ ” [ ] V H $ — J W j : / z L ' M G B k N ; 7 § - U q O 8 6 x 4 P I D 5 9 E C 0 R F 3 T 2 A S ) ( 1 w , v . b g y m u p f l h d c s r n o a i t e ....
Each time a tokenizer is applied, the result is a sequence of tokens for string A and another for string B. At this point, we need to find the longest common subsequence of tokens. Because we have started with the least common tokenizers, the number of tokens are few.
Initial results need to be streamlined. Suppose a tokenizer on the two texts results in the following two sequences: A, B, C, B, D, E, F, G and B, C, G, C, H, D, E, F. Technically, the longest common subsequence is B, C, D, E, F. But it is unclear which B in string 1 or which C in string 2 is to be preferred. Further, any one of these tokens might be a substring of any other. A clear resolution requires analyzing the relationships between these tokens, which could prove unnecessarily costly. It is faster simply to discard ambiguous entries and simplify the two sequences: A, C, D, E, F, G and B, G, H, D, E, F. Now the longest sequence is clearly D, E, F. In addition, a straightforward check ensures that D, E, and F appear only once in their respective strings.
The removal of duplicate tokens avoids ambiguous matches. The two sequences of
remaining unique tokens are then passed through
tan:collate-pair-of-sequences()
, which collates them into a single
sequence, from which the shared elements can be extracted (in our example above, the
D, E, and F). They provide our first anchors into the two texts. They are long and
uniquely correspondent, so they become our first <common>
elements
in the output. The next step is then to apply the same process to the paired
fragments from strings a and b that appear before, between, or after any of the
shared unique tokens. Asynchronous processing allows the paired fragments to be
evaluated efficiently.
If no shared unique tokens have been found from the first tokenizer, the process starts again, now adding the next tokenizing character. The process repeats until a unique token sequence is found. If the process completes, and successive tokenizing fails to produce any anchors, it is only because the two large strings are far more unlike than alike.
For two large texts that are more similar than not, tokenized-based splits happen rather early. In the example that generated the tokenizer sequence given above, the first results came with the question mark (the third tokenizer), resulting two unique anchors in the pairs of texts of lengths 593 and 642 characters, splitting both strings into three new string pairs of length (7214, 7215), (1, 1), and (491549, 483817). The (1, 1) pair happened to be the tokenizer itself, appearing in both texts, so our starting anchor (PLCG) is really length 1,236.
Preprocessing via Segmentation
Preprocessing via tokenization requires substantial memory overhead, and one may run into cases of stack overflow. In a sample Oxygen XML editor session allocated 3.5 MB RAM, the tokenizing preprocess exhausted the memory allowance when applied to two relatively similar strings of 12M characters each.
Such gigantic strings could be broken up not by tokens but cutting them into
halves, thirds, fourths, or some other number of segments, to be processed pairwise
through tan:diff()
.
The tan:diff()
results for one pair of segments are checked before
proceeding to the next pair. Fragmentary results at the end of one comparison are
dropped, and prepended to the input for the next segment pair. This helps smooth the
seams between the segments. For this operation, the XML result tree has immediate
benefits for an XSLT-based approach to analyzing the seam and splitting any
<diff>
tree into two parts. This is made possible by the
auxiliary function tan:chop-tree()
, which takes as input any tree and a
sequence of integers, which represent text positions where the input tree should be
chopped (split). The output of tan:chop-tree()
is a map; each map entry
has as its key the integer at where the chop occurred, and as its value, the
corresponding fragment of the input tree.
After all pairs of segments are compared, the diff results are concatenated, and slightly adjusted, to produce the final output.
This particular strategy is useful only if the two extremely long texts are roughly coordinate with each other. Large common substrings that fall across seams may not be detected, and result in unexpected matches, and transpositions that move substantive portions of text from one end of the string to the other may elude detection. Those cautions aside, segmentation-based preprocessing is much quicker and efficient than one that depends upon tokenization.
Output
The output of fn:diff()
is simple and pliable. It is easily queried to
reconstruct the original strings A and B. For example, tan:diff($a, $b)/(tan:a |
tan:common)
returns the original A. The output is also predictable. Any two
nearby <common>
s will be separated by one <a>
, one
<b>
, or one of each. When an <a>
and a
<b>
are immediately adjacent siblings, the <a>
precedes the <b>
.
Because tan:diff()
focuses on characters, not words, its output may be
difficult for humans to read and interpret. A post-processing option for the function
allows one to specify whether the output should be snapped to the word level, which
tends to be much easier to read. For example, the output...
<diff> <common>H</common> <a>e</a> <b>a</b> <common>re today, go</common> <b>o</b> <common>n</common> <a>e</a> <common> tomorrow.</common> </diff>
...is much easier to read when results are snapped to the word level:
<diff> <a>Here</a> <b>Hare</b> <common> today, </common> <a>gone</a> <b>goon</b> <common> tomorrow.</common> </diff>
The simple structure of the output is conducive to conversion to HTML and styling in CSS:
.a{color: red; text-decoration: line-through;} .b{color: green; text-decoration: underline;}
<div xmlns="http://www.w3.org/1999/xhtml" class="diff"> <div class="a">Here</div> <div class="b">Hare</div> <div class="common"> today, </div> <div class="a">gone</div> <div class="b">goon</div> <div class="common"> tomorrow.</div> </div>
Sometimes the output is accurate but not optimal. For example, comparing 'aaa, bbb, ccc' to 'aaa, x, bbb, ccc' results in the following:
<diff> <common>aaa</common> <b>, x</b> <common>, bbb, ccc</common> </diff>
We would prefer a different approach to the commas and spaces.
Such output can be passed through a post-processing function,
tan:adjust-diff()
, to provide the same accurate results, but adjusting
some pliable borders in light of traditional word boundaries. It results in the
following:
<diff> <common>aaa, </common> <b>x, </b> <common>bbb, ccc</common> </diff>
There may be cases where one string contains substantive transpositions. An
experimental function, tan:get-diff-output-transpositions()
, returns
passages where strings a and b have strong correspondence to each other, outside of
the
substantive <common>
sections of the standard output of
tan:diff()
. Transpositions is an advanced topic that deserves extended
treatment elsewhere.
Overall, the results of tan:diff()
are flexible, and especially conducive
to creative and efficient transformations and queries.
A First Glance at Performance
tan:diff()
performs very well against a variety of corpora. Four
batteries of tests were applied to text pairs, in order of text similarity: Birnbaum's
Darwin corpus; Code of Federal Regulations 22.1 from 2018 and 2019; CFR 22.1 from
1997
and 2019, and CFR 22.1 from 2019 against the King James version of the Old Testament.[5]
These diagnostic samples are illustrative of performance trends, and the impact of preprocessing. For more rigorous benchmark tests see the next section.
Rows are sorted by per-character efficiency, starting with the least performant. Columns labeled "unique" specify what portion of string a or b was not found in the other. The lower the percentage for the same pair of texts, the better the quality of results. The last two columns show the amount of time needed, and the ratio per character (both a and b added, divided by the time, less an estimated 0.5 seconds for Java warmup and compilation of the TAN function library). All tests were performed on a Dell Inspiron 5570 (Intel Core 1.6GHz i5-8250U with 4 physical and 8 logical cores), Saxon HE 9.9 within Oxygen Editor 23.0. Saxon's more advanced products were avoided, to illustrate the effectiveness of the algorithm. Recorded times reflect a single iteration.
Table II
segm. preprocess size | tok. preprocess? | string a length | string b length | unique a | unique b | seconds | 1K chars / sec |
— | — | 69066 | 69331 | 0.47% | 0.85% | 0.7 | 197.71 |
— | ✓ | 69066 | 69331 | 0.47% | 0.85% | 0.7 | 197.71 |
The Darwin sample was not demanding, and results arrived rather quickly. The manageable size of the texts precluded segmentation preprocessing. Applying tokenizing preprocessing neither helped nor harmed the process.
These results represent a set of tests that might be commonly encountered, and the rate of around 200K characters per second provide a good benchmark against which to compare the following batteries of tests.
Table III
segm. preprocess size | tok. preprocess? | string a length | string b length | unique a | unique b | seconds | 1K chars / sec |
— | ✓ | 4937847 | 4861503 | 2.06% | 0.52% | 50 | 195.99 |
2M | ✓ | 4937847 | 4861503 | 3.36% | 1.84% | 28.4 | 345.05 |
500K | ✓ | 4937847 | 4861503 | 8.33% | 6.89% | 15.5 | 632.27 |
— | — | 4937847 | 4861503 | 2.12% | 0.59% | 11 | 890.85 |
2M | — | 4937847 | 4861503 | 3.43% | 1.91% | 10.8 | 907.35 |
500K | — | 4937847 | 4861503 | 8.33% | 6.89% | 7.7 | 1272.64 |
The two texts here are rather similar, since they represent merely one year's worth of changes to this part of the federal regulations. Bear in mind, in these tables, the lower the percentage of difference (unique a, unique b), the more the results approach a minimal script, and therefore improve in quality. The qualitatively best results come from the test shown in the first row.
This battery of tests introduces the segmentation method. The 2M character segment threshold means that each text was divided into three segments before being processed pairwise. With the 500K setting, each text was broken into ten segments.
Avoiding the segmentation preprocess maximized the quality of the outcome, with the
tokenizing preprocess slightly improving results. Nevertheless, tokenization always
introduced a significant performance penalty. Segments of size 500K characters led
to a
significant decrease in the quality of results, but improved performance. But in all
six
cases, the performance rivalled or surpassed that of the Darwin corpus. The relative
speed of tan:diff()
improves as the size of input increases.
The 2018-to-2019 entries never took longer than 50 seconds, which is rather fast,
generally speaking. The same two texts passed to Python's ndiff()
took more
than three hours to process.
In the comparison above, the two versions, one-year apart, were rather similar. But a comparison against a greater range of time, twelve years, where the more recent version was nearly 30% larger, shows interesting contrasts:
Table IV
segm. preprocess size | tok. preprocess? | string a length | string b length | unique a | unique b | seconds | 1K chars / sec |
500K | ✓ | 3768303 | 4861503 | 57.85% | 67.33% | 82.2 | 104.99 |
— | ✓ | 3768303 | 4861503 | 37.23% | 51.34% | 82 | 105.24 |
— | — | 3768303 | 4861503 | 35.34% | 49.88% | 75.6 | 114.15 |
2M | ✓ | 3768303 | 4861503 | 37.49% | 51.55% | 68 | 126.91 |
2M | — | 3768303 | 4861503 | 36.29% | 50.62% | 66.2 | 130.36 |
500K | — | 3768303 | 4861503 | 59.23% | 68.39% | 42.4 | 203.53 |
In this battery of tests, where the pair of texts were roughly half alike, half unalike, processing took considerably more time than did processing for the Darwin texts.
The tokenizing preprocess resulted in a noticeable decrease in performance, and did not improve the quality of the results. The best quality came when preprocessing was avoided. Segments of size 500K resulted, as before, in noticeably inferior results, but unlike the previous battery of tests, segmentation was not a guarantor of improved performance.
The 1997-to-2019 comparison involved texts that were roughly only half alike. What happens when very large, very dissimilar texts are compared?
Table V
segm. preprocess size | tok. preprocess? | string a length | string b length | unique a | unique b | seconds | 1K chars / sec |
— | ✓ | 3189047 | 4861503 | 95.11% | 96.79% | 3688.8 | 2.18 |
2M | ✓ | 3189047 | 4861503 | 91.56% | 94.46% | 3472.2 | 2.32 |
500K | ✓ | 3189047 | 4861503 | 91.55% | 94.46% | 1498.3 | 5.37 |
500K | — | 3189047 | 4861503 | 82.99% | 88.84% | 162.7 | 49.48 |
2M | — | 3189047 | 4861503 | 86.18% | 90.93% | 157.7 | 51.05 |
— | — | 3189047 | 4861503 | 85.61% | 90.56% | 148.9 | 54.07 |
One might be surprised that there is as much as 17% agreement between the Old Testament and a section of the Code of Federal Regulations, but most of it is fragmentary, consisting of collocations and isolated words and characters. The three longest common substrings were (in the last test, without any preprocessing): " on the first day of the month", ", and the name of the ", and " the substance thereof".
In this battery of tests, performance suffered greatly when tokenization preprocessing was applied.
The results from the smallest level of segmentation, without tokenization, had the best quality, but one should question whether this is significant. It may have captured more common text, but it did not find a longer common substrings than the first one cited above. The top three results, " the place where the", " satisfied with the ", and " in the presence of ", suggest that attempting to secure the shortest edit script would be quixotic.
Overall, the tokenizing preprocess can bring with it a heavy penalty, but used in the right situations, it can help manage the process, and produce relatively good results. The segmentation preprocess can significantly assist the process, but also erode the quality of the results.
Accuracy and Quality: the Corruption Test
Variation in the results presented in the previous section might raise questions about
the accuracy of tan:diff()
. After all, tan:diff()
applied to
the same two strings produces different results under different settings.
As already noted, the output of tan:diff()
permits anyone to reconstruct
the original two strings, so the accuracy of results can always be verified. In every
single test discussed in this paper, and in hundreds of day-to-day applications of
tan:diff()
on live work, the two input strings have been confirmed to
have been preserved intact in the output. That is, even when tan:diff()
returns different results for the same input, each set is always accurate. The real
question is whether or not a set of results are optimal.
This brings us back to the goals raised at the beginning of this article. Many have
sought for output that is sometimes described as a minimal diff, i.e., a set of
differences that reflects the shortest edit script needed to get from one string to
the
other. This entails the search look for the longest common subsequence. The result
tree
of tan:diff()
can be used as a metric. The minimal diff is the one where
<common>
wraps the greatest amount of text.
As will be seen below, tan:diff()
, when run without preprocessing,
normally returns a minimal diff. When it does not, the minimal diff is closely
approximated.
This claim may be surprising, since an approach dependent upon arbitrary staggered samples would seem to be prone to erratic results. But tests prove otherwise, primarily through what I call the corruption test, presented here.
An sample text was chosen. A second copy was made, but with 1% of its characters, chosen at random, changed to characters that could not be found in the original text. That new version became the basis for another version, again with 1% of the characters changed. The process repeated. Thus, the first text had a randomized corruption of 1%, the second, 2%, and so on to the hundredth, which had no characters in common with original text.
<corr n="0">Mar. 19, 2019 Title 22 Foreign Relations Parts 1 to 299...</corr> <corr n="1">Mar. 19, 2019 Title 22 Foreign Relations Parts 1 to 299...</corr> <corr n="2">Mar. 19, 201𐀍 Title 22 Foreign Relations Parts 1 to 299...</corr> <corr n="3">Mar. 19, 201𐀍 Title 22 Foreign Rel𐀣tions Parts 1 to 299...</corr> <corr n="4">Mar. 19, 201𐀍 Title 22 Foreign Rel𐀣tions Parts 1 to 299...</corr> <corr n="5">Mar. 19, 201𐀍 Title 22 Foreign Rel𐀣tions Parts 1 to 299...</corr> . . . . . <corr n="10">Mar. 19, 2𐀋1𐀍 Title 22 Foreig𐀞 Rel𐀣tions Parts 1 to 299...</corr> . . . . . <corr n="20">M𐀂r. 19, 2𐀋1𐀍 Title𐀔22 Foreig𐀞 𐀠el𐀣𐀤𐀥ons Parts 1 to𐀴29𐀷...</corr> . . . . . <corr n="30">M𐀂𐀃. 𐀆9, 2𐀋1𐀍𐀎Title𐀔𐀕2 Forei𐀝𐀞 𐀠𐀡l𐀣𐀤𐀥on𐀨 Parts𐀯1 to𐀴29𐀷...</corr> . . . . . <corr n="40">M𐀂𐀃. 𐀆9, 2𐀋1𐀍𐀎Title𐀔𐀕𐀖𐀗𐀘orei𐀝𐀞𐀟𐀠𐀡l𐀣𐀤𐀥on𐀨 Parts𐀯𐀰 to𐀴29𐀷...</corr> . . . . . <corr n="50">M𐀂𐀃. 𐀆𐀇, 2𐀋1𐀍𐀎T𐀐tle𐀔𐀕𐀖𐀗𐀘or𐀛i𐀝𐀞𐀟𐀠𐀡l𐀣𐀤𐀥o𐀨 Parts𐀯𐀰 to𐀴29𐀷...</corr> . . . . . <corr n="60">M𐀂𐀃𐀄 𐀆𐀇, 2𐀋1𐀍𐀎T𐀐tle𐀔𐀕𐀖𐀗𐀘or𐀛i𐀝𐀞𐀟𐀠𐀡l𐀣𐀤𐀥o𐀨 P𐀫rt𐀮𐀯𐀰 to𐀴2𐀶𐀷...</corr> . . . . . <corr n="70">M𐀂𐀃𐀄 𐀆𐀇𐀈 2𐀋1𐀍𐀎T𐀐tle𐀔𐀕𐀖𐀗𐀘or𐀛i𐀝𐀞𐀟𐀠𐀡l𐀣𐀤𐀥𐀦𐀨 P𐀫r𐀭𐀮𐀯𐀰𐀱to𐀴2𐀶𐀷...</corr> . . . . . <corr n="80">M𐀂𐀃𐀄𐀅𐀆𐀇𐀈𐀉2𐀋1𐀍𐀎T𐀐𐀑le𐀔𐀕𐀖𐀗𐀘𐀙r𐀛i𐀝𐀞𐀟𐀠𐀡𐀢𐀣𐀤𐀥𐀦𐀨 P𐀫𐀬𐀭𐀮𐀯𐀰𐀱𐀲𐀳𐀴𐀵𐀶𐀷...</corr> . . . . . <corr n="90">M𐀂𐀃𐀄𐀅𐀆𐀇𐀈𐀉2𐀋1𐀍𐀎T𐀐𐀑𐀒𐀓𐀔𐀕𐀖𐀗𐀘𐀙r𐀛i𐀝𐀞𐀟𐀠𐀡𐀢𐀣𐀤𐀥𐀦𐀨 𐀪𐀫𐀬𐀭𐀮𐀯𐀰𐀱𐀲𐀳𐀴𐀵𐀶𐀷...</corr> . . . . . <corr n="100">𐀁𐀂𐀃𐀄𐀅𐀆𐀇𐀈𐀉𐀊𐀋𐀍𐀎𐀏𐀐𐀑𐀒𐀓𐀔𐀕𐀖𐀗𐀘𐀙𐀚𐀛𐀜𐀝𐀞𐀟𐀠𐀡𐀢𐀣𐀤𐀥𐀦𐀨𐀩𐀪𐀫𐀬𐀭𐀮𐀯𐀰𐀱𐀲𐀳𐀴𐀵𐀶𐀷...</corr>
Because the amount of corruption in each version was carefully controlled and
measured, the quality of the results of tan:diff()
could be measured
precisely. In any comparison of the original text against another version with X%
corruption, we should expect to see exactly (100 - X)% of the characters wrapped by
<common>
elements. The remaining X% of each version should be found
in pairs of an <a>
and a <b>
of equal length. Any
comparison of the original text to one with 20% corruption, for example, should result
in exactly 80% of the characters wrapped by <common>
in the output.
Anything greater is impossible, and anything less is a measure of a loss in
quality.
Five batches of one hundred corruption files were created, of 100, 1K, 10K, 100K,
and
1M characters in length, five hundred files in total. For this exercise, all
preoptimization was suspended, to force tan:diff()
to process each pair of
strings alike. Then, for each of the five batches, and within each batch for every
corrupted version, the original string was passed to tan:diff()
as string
A, and the corrupted version as B. The three parameters determining the distribution
of
samples were set to 0.5 (see above). The maximum number of horizontal passes was set
to
50. As will be seen below, raising that number would not have improved results, and
only
added to the processing time; lowering that number improved performance, at a very
slight expense in quality.
For the batches of length 100, 1K, 10K, and 100K tan:diff()
produced
output that matched exactly the highest expected quality, all the way from one percent
corruption to 100% corruption. Presenting the results in a table would be pointless,
because every row would simply register quality of 100%. That is,
tan:diff()
produced four hundred minimal sets of differences, and the
output was verified to have preserved the input. Only when the batches got as long
as 1
million characters did the results begin to falter, but only beginning with corruption
levels of 15% and greater, and on average affecting only 0.00003% of the total character
count.
The excellent quality of tan:diff()
's results can be contrasted with
those of Python's ndiff()
. The latter function was chosen as a point of
comparison because it is one of the few string-difference algorithms to approach the
task character by character, not line by line.
The corruption test was applied to ndiff()
, with the same batches of
texts. ndiff()
produced suboptimal results quite early and often. For
batches of length 100, the quality began to falter at 35% corruption. In the batch
of
length 1K, Python's ndiff()
began to produce suboptimal results at only 7%
corruption. At greater levels of corruption unnecessary differences reported by
ndiff()
averaged to thirty percent of the total character count.
Results were far worse with batches of length 10K and 100K. Batches of 1M characters
were not run, because each difference operation took hours.
The controlled corruption test described above is artificial, expectations of
tan:diff()
when applied to normal text pairs should be tempered. Two
caveats are in order.
The first caveat raises a point of caution. This particular test replaces every
deleted character with a new one. Such one-to-one replacement is conducive to
tan:diff()
and other difference algorithms that resolve ambiguous
matches in light of those matches' proportional positions within the string or
substring. But in ordinary cases, editorial interventions do not consist of simple
one-to-one changes. At various places, editors make wholesale insertions without
compensatory deletions, or deletions without insertions. Even when there are paired
deletions and insertions, they are normally not the same length as what they replace.
So
the position of unchanged text constantly shifts relative to the whole, and to other
unchanged text. In some cases, strictly calculated proportional reassignment may be
mistaken, or only contestably correct. So, even though the results of
tan:diff()
are rather spectacular for the corruption test, one should
expect less optimal results for pairs of strings where the editing has unpredictably
shifted the relative position of unchanged text.
The second caveat raises considerable promise. The random character changes in the
corrupted files tended to be evenly distributed. Such changes are punishing on any
string comparison algorithm, because large common substrings are constantly being
shattered into smaller and smaller shards that become almost impossible to correlate
with the original. As noted in the Python ndiff
tests above, such
fragmentation is very challenging for most algorithms. But ordinarily, when a text
is
edited by humans, changes occur in clumps, not at random, and long common substrings
are
common even in two texts that are more dissimilar than similar. The corruption test
poses the worst-case sceranios for any text-difference algorithm. The very few
qualitative flaws in the output of tan:diff()
suggest it is an excellent
tool to analyze very messy differences.
A Second Glance at Performance
The second caveat for the corruption test speaks not only to the quality of
tan:diff()
but to its performance. Randomly distributed changes heavily
penalize performance, because the number of steps required to find the LCG grows
steeply. So the corruption tests provide a window into the efficiency of
tan:diff()
.
Each of the five batches of corruption tests were run ten times, and the speeds were
averaged, discounting the average time needed to load Saxon engine and the package
that
delivers tan:diff()
(normally about 0.1 and 0.3 seconds, respectively). In
this case, tests were run from a Python script that passed the strings directly into
Saxon HE 9.9 and then measured the speed of each operation in milliseconds. The results
are shown below.
On the batch of length one million, the tests started to take excessively long at corruption level 67%, and had to be terminated. At this level, samples sizes were running out, and the fallback algorithm, which shifts toward quadratic time, took over. For the same reason, the batch of length 100K began to take noticeably longer when there was 98% difference between the two texts.
The two top bands provide some idea of the point at which preprocessing might be useful. Because of the rather even distribution of random changes, preprocessing via tokenization quickly became inefficient as the corruption level increased. The segmentation method, however, did rather well. Batches of length 1M were easily processed up to the point of about 94% corruption, at which point the operations became excessively taxing. For the most part, such segmentation adds about 35% extra time. This is to be expected because of the logarithmic nature of the function. Below is a comparison of the batch of length 1M, comparing the non-preprocessed operations (time shown is the average of ten operations) to the preprocessed ones (time shown is for a single operation only),
Let us return to strings that are of more tractable size. The three other batches, of length 100, 1K, and 10K characters, are quite comparable and consistent in the time needed for execution. Because the relationship is not easily seen in Fig. 3, they are magnified in Fig. 5:
The amount of time required is erratic, and unpredictable, although some general trends are evident. At any given band of corruption level, processing time was consistent with logarithmic time complexity. As corruption increases, the logarithmic model slowly breaks down.
This efficiency of tan:diff()
contrasts with Python's
ndiff()
. For batches of length 100, 1K, and 10K characters, Python was
much faster, requiring normally only a few milliseconds, and no more than two seconds.
But at the batch of length 100K tan:diff()
performed better: Python's
algorithm needed between seven and twenty-six seconds. And at batches of length 1M,
Python's ndiff()
was prohibitive to use, needing more than seventeen
minutes to run for the 1% corruption test.
Another window into tan:diff()
's time complexity is seen when the times
are prorated, at seconds per thousand characters. That is, how long does the operation
take per character?
Here the batch of 100 characters is the outlier, and is clearly the least efficient. The erratic path is indicative of the fortuitous nature of finding the first matching sample. As the length of the short string increases, the per-character speed of the algorithm improves. Once again, because of the outliers, the general pattern is best illustrated by presenting the same graph, but focusing only on the largest batches (the three lines nearly indistinguishable in Fig. 6):
The problems inherent in searching for tiny samples are illustrated at the end of
the
batch length 100K. In future versions of tan:diff()
, a better approach to
handling small sample sizes may make these comparisons more feasible.
The Maximum Number of Samples
The default maximum number of horizontal samples on any vertical level, currently 50, may seem countintuitive. Once the sample size diminishes to 2% of the shorter input string, the samples will no longer overlap. One might expect anomalous results. But the results of the corruption test on batches of 100K and 1M were of remarkable quality. One could turn the question around, of course, and ask why go as high as fifty? How low can the ceiling be set, without compromising the quality of the results?
The answer, perhaps surprising, is that the threshold can be lowered to as little as three, even on very large strings. The size of the input string is seemingly immaterial.
To test this hypothesis, the corruption test was applied once again, this time adjusting only the maximum number of horizontal samples, and focusing on the batches of 100K and 1M, to find the threshold at which the quality of results deterioriate.
For the batch of 100K, when the setting was lowered to a maximum of 10 samples, the results remained perfect. When lowered to 3, the results were still virtually flawless. The only suboptimal results occurred at corruption levels of 13% and 14%, which were short of the maximal diff by a single character. Furthermore, lowering the maximum number of samples to 10 and 3 improved the efficiency of the algorithm, by 6.7% and 6.4%, respectively.[6]
At the batch of 1M results began to suffer slightly more. When the maximum was lowered to 10, corruption levels 15% through 33% and 43% through 55% were off by two or three characters, but never more than four. All other corruption levels had perfect results. When the threshold was lowered yet further to 3, the quality of results grew only very slightly worse. Corruption levels 11% through 56% were off by an average of 1.85 characters, and never more than six.
More significantly, lowering the maximum number of samples improved the efficiency of the algorithm. Recall above (see Fig. 3) that at corruption level 67% the standard algorithm became too time-consuming to run. But lowering the threshold to 10 and 3 improved performance and optimized memory management. When lowered to 10, there was on average a 17.3% improvement in speed; at 3, 20.6%.
Lowering the maximum threshold resulted in a slight gain in efficiency of about 9%. But the times tended to be more erratic, with some corruption levels resulting in significantly longer processing times, likely due to the less frequent samples.
So why not set the ceiling at 3 as a default? Tests from the Code of Federal Regulations suggest setting the ceiling so low in real-world texts may more significantly compromise quality. Consider the 2018-to-2019 CFR tests presented earlier. The following shows what happens when the maximum number of samples is changed (no preprocessing was applied):
Table VI
max. number of samples | string a length | string b length | unique a | unique b | seconds | 1K chars / sec |
200 | 4937847 | 4861503 | 2.12% | 0.59% | 28.1 | 348.73 |
100 | 4937847 | 4861503 | 2.12% | 0.59% | 16.6 | 590.32 |
50 | 4937847 | 4861503 | 2.12% | 0.59% | 11.0 | 890.85 |
25 | 4937847 | 4861503 | 2.12% | 0.59% | 8.5 | 1152.86 |
10 | 4937847 | 4861503 | 2.05% | 0.51% | 6.7 | 1462.59 |
3 | 4937847 | 4861503 | 2.93% | 1.41% | 6.9 | 1420.20 |
Clearly, performance improves as the ceiling is lowered. But surprisingly, the results improved when the threshold was set to 10, then deteriorated when it was lowered further. Other parameters, and the peculiarities of the texts chosen, may play a significant role in these alterations. Further testing and analysis is needed to better understand the effect of changing the maximum number of samples.
Code and Applications
tan:diff()
is written in XSLT 3.0, and is a central part of the Text
Alignment Network (TAN) function library, alpha release 2021, http://textalign.net. To encourage reuse and adaptation, all code is open
source, released under a GNU General Public
License. Previous versions of the function, provided in TAN alpha releases
2018 and 2020, were written in XSLT 2.0, released under the same license. Whether
imported directly into an XSLT project, or invoked in XQuery via
fn:transform()
, the file that should serve as the main point of entry
to access the functions is /functions/TAN-function-library.xsl
.
tan:diff()
has been enormously important in very practical applications,
particularly in TAN's validation routine.
Consider two XML files that link to each other and claim to contain the same text,
but
are marked up differently. Such a textual redivision might be necessary if one wishes
to
structure the same text according to different hierarchies. In the TAN validation
routine, when one of those files is validated, a check is made against the other version
that is said to contain an identical transcription. Behind the scenes,
tan:diff()
compares the two texts, space-normalized, and the results
are returned to a Schematron report, which tells the user exactly where the error
occurs.
Below is a sample from an English translation of Aristotle's Categories, which has two competing reference systems, both commonly used. The top version uses a logical system based on chapters, paragraphs, and sections, the the bottom version uses so-called Bekker numbers to point to the page, column, and line numbers of the 19th-century edition. The bottom version has had the word "MISTAKE" inserted. When the TAN file is validated in Oxgyen, the user sees the following:
The TAN validation routines do not just complain about the mistake. Through Schematron Quick Fixes (SQF), one can fix the mistake with a couple of keystrokes within XML editors that support SQFs:
Thus, a user can easily coordinate two separate but identical texts.
For XML environments, which tend to be text-centric, tan:diff()
has
proven to be indispensible, not just for validation but for enterprise applications,
such as measuring the amount of work in the stages of editorial workflow, or in
evaluating OCR results from different engines, and at different settings. Many more
practical applications could be mentioned.
To my knowledge, tan:diff()
is the first XSLT function of its kind. It is
highly efficient and highly accurate, especially on very large strings. The algorithm
described above could be modeled in many other programming languages. But XSLT allows
us
to exploit the output in ways that are difficult or impossible in other languages.
For
example, if string A is compared to string B, and then string B to string C, one can
use
XSLT to merge the result trees, to measure or visualize the relative distance of A,
B,
and C. And one need not stop at just three strings. Why not 4-way comparison, or 5-way?
I hope to present and discuss such opportunities and challenges in future work.
References
[aldous] Aldous, David, and Persi Diaconis. “Longest Increasing Subsequences: From Patience Sorting to TheBaik-Deift-Johansson Theorem.” Bulletin of the American Mathematical Society 36, no. 4 (July 21, 1999): 413–33. doi:https://doi.org/10.1090/S0273-0979-99-00796-X.
[altschul] Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. “Basic Local Alignment Search Tool.” Journal of Molecular Biology 215, no. 3 (October 5, 1990): 403–10. doi:https://doi.org/10.1016/S0022-2836(05)80360-2.
[birnbaum] Birnbaum, David J. “Sequence Alignment in XSLT 3.0.” In XML Prague 2020 – Conference Proceedings, 45–65, 2020.
[boes] Boes, Olivier. “Improving the Needleman-Wunsch Algorithm with the Dynamine Predictor.” University of Brussels, 2014.
[fraser] Fraser, Neil. “Writing: Diff Strategies.” Accessed February 20, 2021. https://neil.fraser.name/writing/diff/.
[heckel] Heckel, Paul. “A Technique for Isolating Differences Between Files.” Communications of the ACM 21, no. 4 (April 1, 1978): 264–68. doi:https://doi.org/10.1145/359460.359467.
[huitfeldt] Huitfeldt, Claus, and C. M. Sperberg-McQueen. “Document Similarity: Transcription, Edit Distances, Vocabulary Overlap, and the Metaphysics of Documents.” Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). doi:https://doi.org/10.4242/BalisageVol25.Huitfeldt01.
[hunt] Hunt, J W, and M D McIlroy. “An Algorithm for Differential File Comparison.” Bell Laboratories Computing Science Technical Report 46, July 1976.
[jhaver] Jhaver, Shagun, Latifur Khan, and Bhavani Thuraisingham. “Calculating Edit Distance for Large Sets of String Pairs Using MapReduce.” In Proceedings of the ASE International Conference on Big Data, 2014.
[myers] Myers, Eugene W. “An O(ND) Difference Algorithm and Its Variations.” Algorithmica 1, no. 1–4 (November 1986): 251–66. doi:https://doi.org/10.1007/BF01840446.
[ratcliff] Ratcliff, John W., and David E. Metzener. “Pattern Matching: The Gestalt Approach.” Dr. Dobb’s. Accessed February 23, 2021. http://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970.
[sankoff] Sankoff, David, and Joseph B. Kruskal. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Reading, Mass.: Addison-Wesley Pub Co, Advanced Book Program, 1983.
[stephen] Stephen, Graham A. String Searching Algorithms. Lecture Notes Series on Computing 3. Singapore: World Scientific Publishing Co Pte Ltd, 1994. doi:https://doi.org/10.1142/2418.
[1] Although Myers does not say so, the size of the minimum edit script is equivalent to Levenshtein Distance, widely known and discussed elsewhere; see huitfeldt and references.
[2] Earlier published versions of tan:diff()
relied on
fn:match()
and fn:analyze-string()
. Changing the
function to the three mentioned above significantly improved performance.
[3] Parameter settings described here reflect default parameter settings in September 2021, in version 2021 of the TAN function library. These parameter settings may change in future releases.
[4] Brokering ambiguity based upon relative position may suffer in cases where there have been major transpositions in texts with repeated structures (e.g., tables). Such effects from transposition can be explored in post-processing functions. See below.
[5] Darwin corpus: https://github.com/djbpitt/xstuff/tree/master/nw. CFR: https://www.govinfo.gov/bulkdata/CFR. Old Testament: https://github.com/Arithmeticus/TAN-bible.
[6] By contrast, when the maximum number of samples was set to 250,
tan:diff()
took on average 62% longer to run on the batch of
100K, and of course without any improvement in results, which were already
perfect.