How to cite this paper
Halpern-Hamu, Charlie. “TCS tcs: Tata Consultancy Services trash compactor script: Design considerations in
the implementation of a boil-this-corpus-down-to-a-sample-document tool.” Presented at Balisage: The Markup Conference 2012, Montréal, Canada, August 7 - 10, 2012. In Proceedings of Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012). https://doi.org/10.4242/BalisageVol8.Halpern-Hamu02.
Balisage: The Markup Conference 2012
August 7 - 10, 2012
Balisage Paper:
TCS tcs: Tata Consultancy Services trash compactor script
Design considerations in the implementation of a boil-this-corpus-down-to-a-sample-document
tool
Charlie Halpern-Hamu
Senior Solutions Architect
Tata Consultancy Services
Charlie has been working with structured text since 1991.
During this time, he has acted as a content and systems architect, programmer, systems
integrator, consultant,
mentor, best-practices coordinator, trainer, book editor, project lead, department
manager, and vice president.
His consulting and training work has taken him all over North America
as well as visits to South America, Europe, Australia and China.
Charlie has a PhD in Computer Science from the University of Toronto and an MBA from
Heriot-Watt University.
He's good at making complex systems easy to understand. Or so he claims.
© Copyright 2012, Tata Consultancy Services.
Disclaimer: All views expressed in the publication are of the author and Tata Consultancy
Services (TCS) does not warrant, either expressly or implied, the accuracy, appropriateness
of the information in the publication. TCS disclaims any responsibility for content
error, omissions and any responsibility associated with relying on the information
provided in the publication.
Abstract
Creation of representative sample(s) of a large document collection can be automated
using XSLT.
Such samples will be useful for analysis, as a preliminary document analysis step
in vocabulary redesign
or conversion and to guide design of storage, editing, and transformation processing.
Design goals are: to work intuitively with default configuration and no schema, produce
plausible output,
and produce a range of outputs from a large representative set to a short but highly
complex sample document.
The technique can be conceptualized in passes: annotate structures as original or
redundant;
keep wrappers to accommodate original markup found lower in the hierarchy;
retain required children and attributes; and collapse similar structures.
Possible settings include redundancy thresholds, text compression techniques, target
length, schema-awareness,
schema intuitions, how much context to preserve around kept elements, and whether
similar structures should be collapsed (overlaid).
Table of Contents
-
Context
-
Alternatives
-
Goals
-
Algorithm
-
Efficiency
-
Parameters
-
Code
-
Discussion
Context
Sample documents play a critical role in the development of most XML systems.
But good sample documents are difficult to generate.
There seem to be three basic approaches, each with its difficulties:
-
Craft.
Artfully crafting sample documents by hand is usually a labor-intensive path
to an artificial-feeling result.
Hand-crafted samples can be especially frustrating when they go stale,
when even minor schema changes necessitate significant rework.
-
Mad Libs.™
Automatically-generated samples are often so random and disconnected from reality
that they are difficult for humans to comprehend and use.
Based on all the possibilities of a schema, they include many combinations
that simply would not occur in real life.
The schema provides no guidance about the kinds of content
that would make sense in each element, so contents are unfamiliar nonsense.
-
Curation.
Thoughtfully choosing some sample documents from a larger set of available documents
is the approach most similar to the approach presented by this paper.
The difficulty is that most XML corpora are like war:
long stretches of boredom punctuated by moments of mildly increased interest.
It's easy to miss those documents that have interesting, unusual markup.
And it's easy to miss the interesting, unusual markup even within the documents chosen.
There is often a tension between brevity and completeness.
It's easiest for a human to understand a shorter, simpler documents.
Including more elements, in more contexts, requires longer, more complex documents.
Alternatives
Before this paper was presented on Wednesday of the
Balisage conference,
there was a discussion of approaches to creating sample documents
at the preceding
symposium on QA and QC
on the Monday.
The following alternatives were listed:
-
Dale Waldt
suggested starting with all XPaths
(presumably FQGI, fully qualified generic identifier, ancestry paths) in a larger
set,
and finding an example of each.
-
Paul Ryan
said he automated the extraction of all XPaths
(again, presumably FQGIs) and automatically created Lorem Ipsum text.
-
Murray Maloney
spoke of running all variations, as determined by the possibilities allowed by the
Schema.
-
John Cowan
talked about how, with LexisNexis, it was easy to get a firehose blast of volumes
of data.
He talked about crafting a sample document that was actually true facts, though a
constructed sample.
-
Robert Stuart
advised against real, or real-looking data,
having once had an intelligence agency get upset with him
when he to closely simulated, using public sources, the look of secret documents.
-
Syd Bauman
said that when he needs a set of test documents to exercise a collection,
he typically starts by cutting out the uninteresting TEI header metadata,
and then keeping the first few paragraphs of each of several document.
When he needs to generate test data about an instance
he tends to need to be much pickier about which bits he keeps,
and, in general, he does this by hand.
For the cases where content could not be shared,
Syd also mentioned that Elliotte Rusty Harold
had presented an XML Randomizer that worked like a neutron bomb on XML documents
(my inappropriate analogy), scrambling the content but leaving the structures intact.
[Harold]
-
Debbie Lapeyre
warned against taking the first ten pages,
and relayed a anecdote in which the last thousand pages were very different
from the first ten.
-
Ian Gorman
described taking all available content as the universe from which to choose.
If I understood correctly, this is the approach automated by the current paper.
Goals
The primary goal of this project is a tool to boil large sets of XML data
down to smaller samples.
The samples should maintain as much as possible of the variety and complexity
of markup patterns, and remove as much as possible of the useless repetitions.
Subsidiary design goals include:
-
Plausible results.
The hope is that the results will feel natural: compact, but recognizable.
-
Good defaults.
The default parameter settings should hit some kind of sweet spot.
Is it presumptious to imagine that such a sweet spot exists?
I imagine it to be something like this:
one can say that the sample is
complete
and that at the same time it is manageably short.
-
Single dial.
Given the trade-offs between brevity and completeness,
it would be nice to have a single parameter that simultaneously adjusted
all the other parameters towards one extreme or the other.
-
Ease over perfection.
If the occassional result is invalid, because of a wrong guess
about required elements, that's probably okay.
-
Use information from schema.
If there is a schema available, it would be nice to take advantage of it.
The goal would not to be to include elements only because they exist in the schema,
but to avoid deleting elements that are required by the schema.
-
Intuit schema.
My guess is that in most real-life situations,
required elements can be easily intuited.
For example, if every
section
contains a
title,
it's reasonable to guess that
title
is required in
section.
Of course this rule of thumb can fail in one of two ways:
-
False positive.
It could be the case that every
section
in a sample contains a
title,
but only by chance. But even in this case, there's no denying that
a representative sample will contain a
title
for every
section.
-
False negative.
It could be that the content model is more complicated, that
either
number
or
title
is required on every
section.
As a result, the intuition will fail, and conclude
both
number
and
title
to be optional.
I haven't had enough real-life experience with the tool yet
to conclude how much trouble this will create.
Algorithm
The algorithm, as implemented, involves multiple passes:
-
Annotate with signatures.
This present implementation uses an attribute in a special namespace,
tcs:signature
to record the characteristics we're using to determine
what counts as the same and what counts as different.
This is configurable, and includes some combination of ancestry,
previous siblings, element name, attributes specified and child elements.
Because of the choice to use an attribute,
it's easiest to annotate only elements,
not attributes, processing-instructions or comments.
In an attempt to increase efficiency, a signature-to-element key was introduced.
This introduced the possibility of using the same key mechanism
to annotate attributes, processing instructions and comments.
The current implementation uses this first pass to shorten the text nodes.
The reasoning is that there will be less text passing through the subsequent passes.
This is really for ease of debugging: it actually seems to slow the process.
-
Mark unique elements.
The next pass is to indicate a small number of each signature to keep.
As implemented the first one (or two or more ... configurable)
of each unique-signature element is kept,
but there's no requirement that the first elements be kept.
-
Mark wrapping elements.
If an element is to be kept, obviously all its ancestor elements must be kept.
The current implementation marks these ancestors in a separate pass.
-
Mark required children.
For all the elements that are to be kept,
certain child elements need to be kept, as they are required ...
or guessed to be.
In addition, this is a good time to mark inline elements
mixed in with the text of the elements that are being kept.
The reason for doing this is to maintain the sense of the retained text.
If not, it's likely we'll lose
the second and subsequent emphasized words in a sentence.
-
Prune unneeded elements.
The final pass removes those elements that haven't been marked to be kept.
The current implementation keeps all atrributes
and also text, processing instruction and comment children of kept elements.
Efficiency
In the course of developing the current implementation,
several efficiency issues were encountered.
Not all have yet been resolved.
-
Regex text trimming.
The current implementation keeps the first few words and the last few words
of each text node. The regular expression processing to acheive this appears
to be slow. My guess is that it has to do with backtracking, and might be
improved by simply taking the first few words and not looking to keep the last.
-
Best time to trim text.
In the naïve expectation that reducing the amount of text
to be processed by subsequent passes would be an efficiency gain,
the first pass does the trimming of the text nodes.
But only a small number of text nodes are retained in the end,
so it turns out to be more efficient to prune the tree first
before abbreviating the text nodes.
The truth is that the most useful results might come from untrimmed text.
-
N-squared comparison to previous elements.
Unlike text trimming, which isn't essential to the process,
finding unique signatures is the purpose of the script, so it can't be skipped.
The first implementation of the script used a comparison of the current
tcs:signature
with all the
tcs:signatures
on the with
preceding
axis.
This obviously introduces an order
n
2
inefficiency.
The second attempt involved using a
key
to map signatures to elements,
and an
intersect
to detect if the set of similar signatures
includes elements that overlap with the
preceding
axis.
This is someone more efficient, but on the same order.
In order to work with large data sets, this needs to be resolved.
The current solution is to use a non-XSLT streaming pass.
The current implementation is not XML-aware, programmed using Perl regular expressions.
It could be done using a streaming XML parser,
but unless there are
tcs:signature
attributes that aren't actually markup in the documents, there will be no problem.
-
Multiple passes vs complicated XPath.
The current implementation consists of several passes as outlined above.
In order to accommodate a non-XSLT streaming pass in the middle,
the first pass is in a separate script from the last passes.
It's almost certainly the case that several of these calculations
could be strung together in a more complicated XPath expression.
This might be easier to understand, or not.
And it might be more efficient, or not.
I imagine it might be elegant, in a way.
Parameters
The following parameters have been implemented:
-
Ancestor count.
How much context should be included in an element's signature?
Measured on a scale of zero to infinity, with 0 representing no ancestry,
1 representing the the parent of the current element,
2 representing the grandparent and parent,
7 presumed to be the largest number ever needed short of
8 which is taken to represent infinity (at a ninety-degree angle).
Infinity seems to be a good default,
as some XML processing operates on fully-qualified generic identifiers.
Also, it's common for deeply nested elements have a tendency
to stress stylesheets and other processing.
-
Preceding-sibling count.
How many preceding siblings should be included in an element's signature?
A good default seems to be 1, looking only at the closest preceding sibling.
-
Self count.
Should the element name itself be include in its own signature?
Surely the answer should be yes in most applications.
For consistency, this parameter is included, on a scale from 0 (no) to 1 (yes).
-
Attribute count.
Should the element's specified attributes count as part of its signature?
Measured on a scale of 0 (no) to 8 (all of them),
with no sensible meaning for the numbers in between.
The current script has a bug in that it doesn't sort the attribute names as it should.
A good default seems to be to include the attributes names.
-
Child count.
How many children should be included an element's signature?
Again, this parameter is measured on a scale of 0 = none to 8 = infinity.
The children are counted out in document order.
The thought here is to distinguish between a
section
that starts with a
number
from a
section
that starts with a
title.
An overlapping effect can sometimes also be achieved by using using the preceding-sibling
parameter:
including a
section
not on its own merits,
but because it contains a
title
that follows a
number.
-
Start words count.
How many words should kept at the start when abbreviating a text node?
-
Middle words count.
What is the minimum number of words to delete from the middle of a text node?
If there are less than this number of words in the middle, don't abbreviate.
-
End words count.
How many words should be kept at the end when abbreviating a text node?
So, a text node is not abbreviated unless it contains start + middle + end words.
Though these parameters are still available in the full code,
more readable results are found more efficiently by simply taking the first
sentence
of each text node.
This is the approach taken in the simplified code below.
-
Repetition count.
How many times would we like to see each signature?
Since we may want to reduce a very large corpus to a still large sample,
it doesn't make sense to follow the silly pattern of the other parameters,
insisting that after 7 comes infinity.
One discovery, working with the tool,
is that the most effective way to get a bulkier sample
is to include higher values for the various context parameters.
This way, in addition to being simply bulkier,
the extra included elements will show more variety.
For this reason, the chosen default value is 1.
-
Show deletions.
Should comments be left in the sample to indicate where elements have been pruned?
It can be helpful, when working with a smaller sample,
to have indication of where elements have been deleted from the larger source.
As implemented, the integer parameter is interpreted as a Boolean.
But it might make sense to interpret as a number:
showing some number of comments before stopping.
To avoid overwhelming results, the default value is 0 = no.
-
Debug signatures.
Should the intermediate element signatures be shown in the final output?
Like with the parameters above and below,
it would make sense to interpret this value as a integer,
rather than as a Boolean.
The default value here is 0, meaning no
tcs:signatures
are shown.
-
Debug marks.
Should the intermediate
tcs:mark
attributes be shown?
Again, this integer parameter might make better sense interpreted as a count,
rather than interpreted as a Boolean.
The
tcs:mark
attributes are added to all elements that are to be kept,
with the value indicating the reason.
Reasons include that the element had a unique signature
(or was within the specified repetition count),
that the element had a descendent that was to be kept,
that the element was known or guessed to be a required child of a kept element,
or that the element was an textual inline in the middle of kept text.
The default value is 0, meaning no intermediate
tcs:mark
attributes are shown.
The following parameters have been designed, but not yet implemented:
Code
To illustrate the key concepts, a reference implementation is presented here.
This version of the script is stripped of comments, parameters and efficiencies.
Please contact the author for the current revision of the full script.
<?xml version='1.0' encoding='UTF-8'?>
<!-- TCS = Tata Consultancy Services; tcs = trash compactor script. -->
<!-- For brevity, this version is missing comments, parameters and efficiencies. -->
<transform xmlns='http://www.w3.org/1999/XSL/Transform' version='2.0'
xmlns:xs='http://www.w3.org/2001/XMLSchema'
xmlns:tcs='mailto:charlie.hamu@tcs.com'>
<template match='/'>
<variable name='signed'>
<apply-templates select='node()' mode='sign' />
</variable>
<variable name='marked'>
<apply-templates select='$signed' mode='mark' />
</variable>
<variable name='wrapped'>
<apply-templates select='$marked' mode='wrap' />
</variable>
<variable name='needed'>
<apply-templates select='$wrapped' mode='need' />
</variable>
<variable name='needed-2'>
<apply-templates select='$needed' mode='need' />
</variable>
<variable name='wanted'>
<apply-templates select='$needed-2' mode='want' />
</variable>
<variable name='pruned'>
<apply-templates select='$wanted' mode='prune' />
</variable>
<sequence select='$pruned' />
</template>
<template match='*' mode='sign'>
<copy>
<attribute name='tcs:signature' select='ancestor::*/name(),
preceding-sibling::*[1]/name(),
name(),
@*/name(),
*[1]/name()' />
<copy-of select='@*' />
<apply-templates select='node()' mode='sign' />
</copy>
</template>
<template match='*' mode='mark'>
<copy>
<if test='not(preceding::*[@tcs:signature = current()/@tcs:signature])'>
<attribute name='tcs:mark' select='"keep"' />
</if>
<copy-of select='@*' />
<apply-templates select='node()' mode='mark' />
</copy>
</template>
<template match='*' mode='wrap'>
<copy>
<if test='not(@tcs:mark) and *//@tcs:mark'>
<attribute name='tcs:mark' select='"wrap"' />
</if>
<copy-of select='@*' />
<apply-templates select='node()' mode='wrap' />
</copy>
</template>
<template match='*' mode='want'>
<copy>
<copy-of select='@*' />
<if test='not(@tcs:mark) and ../@tcs:mark
and text()[normalize-space() ne ""]
and ../text()[normalize-space() ne ""]'>
<attribute name='tcs:mark' select='"want"' />
</if>
<apply-templates select='node()' mode='want' />
</copy>
</template>
<template match='*' mode='need'>
<copy>
<if test='not(@tcs:mark)
and ../@tcs:mark
and not(preceding-sibling::*[name() eq current()/name()])
and not(//*[name() eq current()/../name()
and not(*[name() eq current()/name()])])'>
<attribute name='tcs:mark' select='"need"' />
</if>
<copy-of select='@*' />
<apply-templates select='node()' mode='need' />
</copy>
</template>
<template match='*' mode='prune'>
<if test='@tcs:mark'>
<copy>
<copy-of select='@* except @tcs:*' />
<apply-templates select='node()' mode='prune' />
</copy>
</if>
</template>
<template match='text()' mode='prune'>
<value-of select='replace(.,"\.\s.+","...","s")' />
</template>
<template match='comment() | processing-instruction()' mode='#all'>
<copy />
</template>
</transform>
Discussion
This paper was presented at Balisage 2012.
[Balisage]
In the discussion after, the following points were raised by attendees.
-
Pointers.
It would be nice to link the samples, element-by-element, back to the full corpus,
so that you could see the full context if curious.
-
Randomize.
Liam Quinn suggested that if you are taking a small sample from a larger set,
it would be nice to be be able to randomize the selection,
rather than just taking the first examples of each markup combination.
-
Log the seed.
As a follow-on suggestion, Lee says to log the seed used for the random selection,
so that it can be recreated as required.
-
Redo from back.
Wendell Piez observed that in order to avoid redundant elements,
in which for example, the first paragraph is included because we've never seen a paragraph
before,
but the second paragraph is included because it contains an inline,
we can run the algorithm twice: once front to back and then a second time back to
front.
-
XQuery.
There seemed to be general consensus that multiple passes were the natural approach,
and that any attempt to collapse this into a single XPath expression or XQuery was
misguided.
-
Xmlsh.
David Lee said the individual passes could be run together with xmlsh.
This would be especially relevant if the inefficient have-I-seen-this-before pass
were recast using another programming language.
-
Efficiency.
Michael Sperberg-McQueen said that the
n2
inefficiency could be solved, even in XSLT 1.0, using keys and Muenchian method. [Tennison]
-
Accummulators.
Abel Braaksma said that XSLT 3.0 accumulators were designed to solve problems
just like the have-I-seen-this-before inefficiency, and took the example back to the
XSLT working group.
My initial intuition was to add the current element's signature to the accumulated
set
as the element starts. But this is too soon, as we need to know about all preceding
elements,
not including the one we're on. On the other had, updating the accumulator at the
end of the element
would be too late, as we would not have had its signature as we reviewed all of its
descendants.
The naive fix would be to somehow put the current element's signature
on deck
so that it can be folded in to the accumulated list of all signatures
as soon we hit the start of the next element.
Rather than building this data structure ourselves,
the elegant choice seems to be to avoid the temptation
to add the
current
element's signature to the accumulated set,
but to rather add the
immediately preceding
element's signature to the accumulator.
References
[Balisage]
Balisage: The Markup Conference,
2012 August 7-10, Hotel Europa, Montreal, Canada,
http://balisage.net/Proceedings/vol8/contents.html
(accessed 2012 August 17).
[Harold]
Harold, Elliote Rusty.
Paper:
Obsuring XML,
Proceedings of Extreme Markup Languages
2005 August 1-5, Montreal, Canada.
http://conferences.idealliance.org/extreme/html/2005/Harold01/EML2005Harold01.html
Presentation:
Randomizing XML,
http://cafeconleche.org/slides/extreme/randomizer
(accessed 2012 August 20).
[Mad_Libs]
Penguin Group (USA).
Mad Libs,
http://madlibs.com
(Accessed 2012 August 20).
[QA-QC]
International Symposium on Quality Assurance and Quality Control in XML,
2012 August 6, Hotel Europa, Montreal, Canada,
http://balisage.net/QA-QC
(accessed 2012 August 17).
[Tennison]
Tennison, Jeni.
Grouping using the Muenchian Method,
http://jenitennison.com/xslt/grouping/muenchian.html
(accessed 2012 August 17).
×
International Symposium on Quality Assurance and Quality Control in XML,
2012 August 6, Hotel Europa, Montreal, Canada,
http://balisage.net/QA-QC
(accessed 2012 August 17).