BalisageUp-Translation and Up-Transformation: Tasks, Challenges, and Solutions2017
How to cite this paper
Nordström, Ari. “Up and Sideways: RTF to XML.” Presented at Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions, Washington, DC, July 31, 2017. In Proceedings of Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions. Balisage Series on Markup Technologies, vol. 20 (2017). https://doi.org/10.4242/BalisageVol20.Nordstrom01.
Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions July 31, 2017
Balisage Paper: Up and Sideways
RTF to XML
Ari Nordström
Ari Nordström is a freelance markup
geek, based in Göteborg, Sweden, but offering his services across a number of
borders. He has provided angled brackets and such to a number of organisations
and companies over the years, with LexisNexis UK being the latest. His favourite
XML specification remains XLink, and so quite a few of his frequent talks and
presentations on XML include various aspects of linking.
Ari is the proud owner and head
projectionist of Western Sweden's last functioning 35/70mm cinema, situated in
his garage, which should explain why he once wrote a paper on automating
commercial cinemas using XML.
A conversion of hundreds of Rich Text Format documents to highly structured XML is
always going to be a challenge and a showcase of XML technologies, even if you are
excluded from a number of them. This paper is a case study of one such conversion,
dealing with migrating huge volumes of legal commentary, more specifically the
classic standard text Halsbury's Laws of England, from RTF to
XML so new editions can be authored and published in XML to various paper and online
publication targets.
While describing the migration approach in any detail would probably require a
book-length paper, this attempts to highlight some of the challenges and their
solutions.
This paper is about converting huge volumes of Rich Text Format (RTF) legal commentary
to XML. For those of you in the know, this is one of the most painful things an XML
geek
will ever experience; it is always about infinite pain and constant regret. RTF is
by
many seen as a bug, and for good reason.
On the other hand, the project had its upsides. It is sometimes immensely satisfying
to run a conversion pipeline of several dozens of steps over 104 RTF titles comprising
tens of megabytes each, knowing the process will take hours—sometimes days—and yet
end
up in valid and well-structured XML. IF that happens.
The Sources
The sources are legal documents, so-called commentary. Much
of this text concerns the standard text for legal commentary in England,
Halsbury's Laws of England (see id-halsbury), published
by LexisNexis, but some of the discussion also includes its sister publication for
Scottish lawyers, Stair Memorial Encyclopaedia, also known
simply as STAIR.
Halsbury consists of 104 titles, each divided
into volumes that in turn consist of several physical files. A
listing of the files in a single title might look like this:
Here, the initial number is the volume. It is followed by the name of the title,
an ordinal number for the physical file, and finally the range of volume
paragraphs contained within that particular part. Yes, the filenames
follow a very specific format, necessary to keep the titles apart and enable merging
together the RTF files when publishing them on paper or online in multiple
systems.
Each title covers what is known as a practice area, divided into
volume paragraphs, numbered units much like sections[1], each covering a topic within the area. A topic might look like
this:
The volparas, as they are usually known, are used by lawyers to assist in their
work, ranging from drafting wills and arguing tax law to arguing cases in court.
They suggest precedents, highlight legal interpretations and generally offer
guidance, and as such, are littered with references to relevant caselaw or
legislation, sometimes in footnotes, sometimes inline.
When the legislation changes or when new caselaw emerges—which is often—the
commentary needs to change, too. This is done in several ways over a year: there are
online updates, so called supplements, which are also edited
and published on paper commonly known as looseleafs[2] or noterups.
The terminology is more complicated than the actual concept. A volume paragraph
that changes gets a supplement, added below the main text body of the para. For
example, this supplements the above volume paragraph:
The supplements amend the original text, add new references to caselaw and
legislation, and sometimes delete content that is no longer applicable or correct.
Sometimes, the changes are big enough to result in the addition of new volume
paragraphs. These new volume paragraphs inherit the parent vol paras number followed
by a letter, very much in line with the looseleaf way of thinking.
Called A paras, they are published online and in the looseleaf
supplements on paper[3]. And once a year, the titles are edited to include the supplemental
information. The A paras are renumbered and made into ordinary vol
paras, and a new year of new supplements begins.
So?
The commentary titles have been produced from the RTF sources for decades, first
to paper and later to paper and several online systems, with increasingly clever—and
convoluted, and error-prone—publishing macros, each new requirement resulting in
further complications. Somewhere along the line, it was decided to migrate the
commentary, along with huge numbers of other documentation, to XML.
Some of the company's content has been authored in XML for years, with new content
constantly migrated to XML from various sources. The setup is what I'd label as
highly distributed, with no central source or point of origin, just an increasing
number of satellite systems. Similarly, there are a number of target publishing
systems.
Requirements
LexisNexis, of course, have been publishing from a number of formats for years. XML,
therefore, is not in any way new for them. The requirements, then, were surprisingly
clear:
The target schema is an established, proprietary XML DTD controlled by
LexisNexis.
The target system is a customisation on top of an established, proprietary
CMS, Contenta.
As we've seen, the source titles consist of multiple files. The target XML, on
the other hand, needs to be one single file per commentary title. There were a
number of reasons for this, with perhaps the most important being that the
target CMS has a chunking solution of its own, one with sizes and composition
that greatly differs from the RTF files[4].
As the number of sources is huge and the conversion project was expected to
take a significant amount of time and effort, a roundtrip back to RTF was
required for the duration of the project[5]. An existing XML to RTF conversion was already in place but is in
the process of being extended to handle the new content.
An all-important requirement was a substantial QA on all aspects of the content, from
the obvious is everything there?[6] to did the upconversion produce the desired semantics? and
beyond. This implies:
A pipelined conversion comprising multiple conversion steps, isolating
concerns and so being able to focus on isolated tasks per step.
Testing the pipeline, both for individual steps and for making sure that the
input matched the output, sometimes dozens of conversion steps later.
Validation of the output. DTD validation, obviously, but also Schmatron
validation, both for development use and for highlighting possible problems to
the subject matter experts.
Generated HTML files listing possible issues. Here, footnotes provide a good
example as the source RTF markup was sometimes poor, resulting in
orphaned footnotes, that is, footnotes lackcing a reference
or footnote references lacking a target.
And, of course, manual reviews of a conversion of a representative subset,
both by technical and legal experts, frequently aided by the above validation
reports.
Pipeline
Thankfully, rather than having to write an RTF parser from scratch, commercial
software is available to convert RTF to a structured format better suited for further
conversion, namely WordML. LexisNexis have been using Aspose.Words
for past conversions, so using it was a given. Aspose was run using Ant macros, with
the
Ant script also in charge of the pipeline that followed.
The basic idea is this:
Convert RTF to WordML.
Convert WordML to flat XHTML5.
Note
As RTF and WordML are both essentially event-based formats where any
structure is implied, this is replicated in an XHTML5 consisting of
p elements with an attribute stating the name of
theoriginal RTF style.
Use a number of subsequent upconversion steps to produce a more structured
version of the XHTML5, for example by adding nested section
elements as implied by the RTF style names that identify headings, and so
on.
With a sufficiently enriched XHTML5, add a number of steps that first convert
the XHTML5 to the target XML format and then enrich it, until done.
A recent addition was the realisation that some of the titles contain equations,
resulting in several further steps. See section “Equations”.
Pipeline Mechanics
The pipeline consists of a series of XSLT stylesheets, each transforming a
specific subset of the document; one step might convert inline elements while
another wrap list items into list elements. The XSLTs are run by an
XProc script (see id-nicg-xproc-tools) that determines which XSLTs to run and in which
order by reading a manifest file:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="../../../../Content%20Development%20Tools/DEV/DataModelling/Physical/Schemata/RelaxNG/production/pipelines/manifest.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<manifest
xmlns="http://www.corbas.co.uk/ns/transforms/data"
xml:id="migration.p1.p2"
description="migration.p1.p2"
xml:base="."
version="1.0">
<group
xml:id="p12p2.conversion"
description="p12p2.conversion"
xml:base="."
enabled="true">
<item
href="p2_structure.xsl"
description="Do some basic structural stuff"/>
<item
href="p2_orphan-supps.xsl"
description="Handle orphaned supps"/>
<item
href="p2_trintro.xsl"
description="Handle tr:intros"/>
<item
href="p2_volbreaks.xsl"
description="Generate HALS volume break PIs"/>
<item
href="p2_para-grp.xsl"
description="Produce vol paras and supp paras"/>
<item
href="p2_blockpara.xsl"
description="Add display attrs to supp blockparasw.
Add print-only supp blockparas."/>
<item
href="p2_ftnotes.xsl"
description="Move footnotes inline"/>
<item
href="p2_orphan-ftnotes.xsl"
description="Convert orphaned footnotes in supps to
paras starting with the footnote label"/>
<item
href="p2_removecaseinfo.xsl"
description="Remove metadata in case refs"/>
<item
href="p2_xpp-pi.xsl"
description="Generate XPP PIs"/>
<item
href="p2_xref-cleanup.xsl"
description="Removes leading and trailing whitespace from xrefs"/>
<item
href="p2_cleanup.xsl"
description="Clean up the XML, including namespaces"/>
</group>
</manifest>
Each step can also save its output in a debug folder, which is extremely useful
when debugging[7]:
The above pipeline is relatively short, as it transforms an intermediate XML
format to the target XML format. The main pipeline for converting Halsbury
Laws of England RTF to XML (the aforementioned intermediate XML
format) currently contains 39 steps.
The XProc is run using a configurable Ant build script[8] that also runs the initial Aspose RTF to WordML conversion, validates
the results against the DTD and any Schematrons, and runs the XSpec descriptions
testing the pipeline steps, among other things.
The pipeline code, including the XProc and its auxiliary XSLTs and manifest file
schema, is based on Nic Gibson's XProc Tools (see id-nicg-xproc-tools) but
customised over time to fit the evolving conversion requirements at
LexisNexis.
Note on ID Transforms
Any pipeline that wishes to only change a subset of the input will have to carry
over anything outside that subset unchanged so a later step can then take care of
the unchanged content at an appropriate time. This transform, known as the identity,
or ID, transform, will copy over anything not in scope:
This simple design pattern, used by every step in the pipeline, makes it very easy
to focus on specific tasks, be they to add a single attribute (such as the example
above) to handling inline semantics.
The Fun Stuff
From a markup geek point of view, the conversion is actually a fascinating mix of
methods and tools, the horrors of the RTF format notwithstanding. This section attempts
to highlight some of the more notable ones.
Merging Title Files
The many RTF files comprising the volumes that in turn comprised a single title
needed to be converted and merged (stitched together) into a single
output XML file. The earlier publishing system had Word macros do this, but running
the macro was error-prone and half manual work; it was unsuitable for an automated
batch conversion of the entire set of commentary titles.
Instead, this approach emerged:
Convert all of the individual RTFs to matching raw XHTML files where the
actual content were all p and div elements inside
the XHTML body element.
Stitch together the files per commentary title[9] by adding together the contents of the XHTML body
elements into one big file.
Merging together files per title would have been far more difficult
without a filename convention used by the editors (also see the listing in
section “The Sources”):
09_Children_12(635-704).xml
This was expressed in a regular expression[10] (actually three, owing to how the file stitcher works):
An XProc pipeline listed all the source XHTML in a folder and any
subfolders, called an XSLT that did the actual work. It grouped the files
per title, naming each title according to an agreed-upon set of conventions,
merged each title contents, saving the merged file in a secondary output,
fed back a list of the original files that were then deleted, leaving behind
the merged XHTML.
Implicit to Explicit Structure
The raw XHTML produced by the first step from the WordML is a lot like the RTF;
whatever structure there is, is implicit. Every block-level component is actually
a
p element, with the RTF style given in
data-lexisnexis-word-style attributes. Here, for example, is a
level two section heading followed by a volume paragraph with a heading, some
paragraphs and list items:
<p data-lexisnexis-word-style="vol-H2">
<span class="bold">(1) THE BENEFITS</span>
</p>
<p data-lexisnexis-word-style="vol-PH">
<span class="bold">1. The benefits.</span>
</p>
<p data-lexisnexis-word-style="vol-Para">Following a review of the social security benefits
system<sup>1</sup>, the government introduced universal credit, a new single payment for
persons looking for work or on a low income<sup>2</sup>.</p>
<p data-lexisnexis-word-style="vol-Para">Universal credit is being phased in<sup>3</sup> and
will replace income-based jobseeker’s allowance<sup>4</sup>, income-related employment and
support allowance<sup>5</sup>, income support<sup>6</sup>, housing benefit<sup>7</sup>,
child tax credit and working tax credits<sup>8</sup>.</p>
<p data-lexisnexis-word-style="vol-Para">Council tax benefit has been abolished and replaced by
council tax reduction schemes<sup>9</sup>.</p>
<p data-lexisnexis-word-style="vol-Para">In this title, welfare benefits are considered under
the following headings:</p>
<p data-lexisnexis-word-style="vol-L1">(1)<span class="tab"/>entitlement to universal
credit<sup>10</sup>;</p>
<p data-lexisnexis-word-style="vol-L1">(2)<span class="tab"/>claimant responsibilities,
including work related requirements<sup>11</sup>;</p>
<p data-lexisnexis-word-style="vol-L1">(3)<span class="tab"/>non-contributory benefits,
including carer’s allowance, personal independence payment, disability living allowance,
attendance allowance, guardian’s allowance, child benefit, industrial injuries benefit, the
social fund, state pension credit, age-related payments and income related benefits that are
to be abolished<sup>12</sup>;</p>
<p data-lexisnexis-word-style="vol-L1">(4)<span class="tab"/>contributions<sup>13</sup>;</p>
<p data-lexisnexis-word-style="vol-L1">(5)<span class="tab"/>contributory benefits, including
jobseeker’s allowance, employment and support allowance, incapacity benefit, state maternity
allowance and bereavement payments<sup>14</sup>;</p>
<p data-lexisnexis-word-style="vol-L1">(6)<span class="tab"/>state retirement
pensions<sup>15</sup>;</p>
<p data-lexisnexis-word-style="vol-L1">(7)<span class="tab"/>administration<sup>16</sup>;
and</p>
<p data-lexisnexis-word-style="vol-L1">(8)<span class="tab"/>European law<sup>17</sup>.</p>
<p data-lexisnexis-word-style="vol-PH">
<span class="bold">2. Overhaul of benefits.</span>
</p>
<p data-lexisnexis-word-style="vol-Para">In July 2010 the government published its consultation
paper <span class="italic">21</span>st<span class="italic"> Century Welfare</span> setting
out problems of poor work incentives and complexity in the existing benefits and tax credits
systems<sup>1</sup>. The paper considered the following five options for reform: (1)
universal credit<sup>2</sup>; (2) a single unified taper<sup>3</sup>; (3) a single working
age benefit<sup>4</sup>; (4) the Mirrlees model<sup>5</sup>; and (5) a single
benefit/negative income tax model<sup>6</sup>.</p>
The implied structure (a level two section containing a volume paragraph that in
turn contains a heading, a few paragraphs and a list) is made explicit using a
series of steps.
Inline Spans
RTF, as mentioned earlier, is a non-enforceable, event-based, flat format. It
lists things to do with the content in the order in which the instructions appear,
with little regard to any structure, implied or otherwise. The instructions happen
when the author inserts a style, either where the marker is or on a selected range
of text. This can be done as often as desired, of course, and will simply add to
existing RTF style instructions, which means that an instruction such as use
bold might be applied multiple times on the same, or mostly the same,
content.
The resulting raw XHTML converted from WordML might then look like this
(indentatiton added for clarity):
Simply mapping and converting this to a target XML format will not result in what
was intended (i.e. <core:para><core:emph>2. Opening a childcare
account</core:emph></core:para>) but instead a huge mess, so
cleanup steps are required before the actual conversion, merging spans, eliminating
nested spans, etc.
With just one intended semantics such as mapping bold to an emphasis tag, the
cleanup can be relatively uncomplicated. When more than one style is present in the
sources, however[11], the raw XHTML is anything but straight-forward. Heading labels (see
section “Labels in Headings, List Items, and Footnotes”),
cross-references and case citations (see section “Cross-references and Citations”) all have
problems in part caused by the inline span elements.
Labels in Headings, List Items, and Footnotes
The span elements cause havoc in headings and any kind of ordered
list, as the heading and list item labels use many different types of numbering in
legal commentary. A volpara sometimes includes half a dozen ordered lists, each of
which must use a different type of label (numbered, lower alpha, upper alpha, lower
roman, ...) so the items can be referenced later without risking confusion.
Here, for example, is a level one list item using small caps alphanumeric:
<p data-lexisnexis-word-style="vol-L1">(<span class="smallcaps">a</span>)<span class="tab"/>the
allowable losses accruing to the transferor are set off against the chargeable gains so accruing
and the transfer is treated as giving rise to a single chargeable gain equal to the aggregate of
the gains less the aggregate of the losses<sup>22</sup>;</p>
Note the tab character, mapped to a span[@class='tab'] element in the
XHTML, separating the label from the list contents, but also the parentheses
wrapping the smallcapsspan. The code used to extract the list item contents, determine the
list type used, and extract the labels must take into account a number of
variations.
The source RTF list items all follow the same pattern, a list item label followed
by a tab and the item contents:
In the XHTML, the result is this:
<p data-lexisnexis-word-style="vol-L1">(1)<span class="tab"/>protecting plants or wood or other
plant products from harmful organisms<sup>8</sup>;</p>
<p data-lexisnexis-word-style="vol-L1">(2)<span class="tab"/>regulating the growth of
plants<sup>9</sup>;</p>
<p data-lexisnexis-word-style="vol-L1">(3)<span class="tab"/>giving protection against harmful
creatures<sup>10</sup>;</p>
<p data-lexisnexis-word-style="vol-L1">(4)<span class="tab"/>rendering such creatures
harmless<sup>11</sup>;</p>
<p data-lexisnexis-word-style="vol-L1">(5)<span class="tab"/>controlling organisms with harmful
or unwanted effects on water systems, buildings or other structures, or on manufactured
products<sup>12</sup>; and</p>
Footnotes use a similar construct, separating the label from the contents with a
tab character:
In both cases, the XSLT essentially attempts to determine the type of list by
analysing the content beforespan[@class="tab"] to create a list item element with the list type
information placed in @type, and then includes everything
after the span as list item contents:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:when
test="@data-lexisnexis-word-style=('L1', 'vol-L1', 'vol-L1CL', 'vol-L1P', 'sup-L1', 'sup-L1CL')">
<!-- Note that the if test is needed to parse lists where the number and tab are in italics or similar -->
<!-- the span must be non-empty since the editors sometimes use a new list item but then remove the
numbering and leave the tab (span class=tab) to make it look as if it was part of the immediately
preceding list item -->
<xsl:element
name="core:listitem">
<xsl:attribute
name="type">
<xsl:choose>
<xsl:when
test="span[1][@class='smallcaps' and
matches(.,'\(?[a-z]+\)?')]">
<xsl:analyze-string
select="span[1]"
regex="^(\(?[a-z]+\)?)$">
<xsl:matching-substring>
<xsl:choose>
<xsl:when
test="regex-group(1)!=''">upper-alpha</xsl:when>
</xsl:choose>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:when>
<xsl:otherwise>
<xsl:analyze-string
select="if (node()[1][self::span and .!=''])
then (span[1]/text()[1])
else (text()[1])"
regex="^(\(([0-9]+)\)[\s]?)|
(\(([ivx]+)\)?[\s]?)|
(\(([A-Z]+)\))|
(\(([a-z]+)\))$">
<xsl:matching-substring>
<xsl:choose>
<xsl:when
test="regex-group(1)!=''">number</xsl:when>
<xsl:when
test="regex-group(3)!=''">lower-roman</xsl:when>
<xsl:when
test="regex-group(5)!=''">upper-alpha</xsl:when>
<xsl:when
test="regex-group(7)!=''">lower-alpha</xsl:when>
</xsl:choose>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of
select="'plain'"/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:otherwise>
</xsl:choose>
</xsl:attribute>
<xsl:element name="core:para">
<xsl:copy-of
select="@*"/>
<!-- This does not remove the numbering of list items where the numbers
are inside spans (for example, in italics); that we handle later -->
<xsl:apply-templates
select="node()[not(following-sibling::span[@class='tab'])]"
mode="KEPLER_STRUCTURE"/>
</xsl:element>
</xsl:element>
</xsl:when>
The xsl:choose handles two cases. The first handles a case where the
list item label was in a small caps RTF style (here translated to
span[@class="smallcaps"] in a previous step), the second deals with
all remaining types of list item labels. The key in both cases is a regular
expression that relies on the original author writing a list item in the same way,
every time[12]. I've added line breaks in the above example to make the regex easier to
read; essentially, the different cases simply replicate the allowed list
types.
The overall quality of the RTF (list and footnote) sources was surprisingly good,
but since the labels were manually entered, this would sometimes break the
conversion.
Headings are somewhat different. Here is a level four heading:
There is no tab character separating the label from the heading contents, so we
are relying on whitespace rather than a mapped span element to separate
the label and the heading contents from each other. The basic heading label
recognition mechanism still relies on pattern-matching the label, however. The
difficulties here would usually involve the editor using a bold or smallcaps RTF
style to select the label, but accidentially marking up the space that followed,
necessitating
Note
Here, the contents are in lower case only. The RTF vol-H4
style automatically provided the small caps formatting, so editors would simply
enter the text without bothering to use title caps. This resulted in a
conversion step that, given an input string, would convert that string to
heading caps, leaving prepositions in lower case and
adding all caps to a predefined list of keywords such as UK or
EU.
The code to identify list item, footnote, and heading labels evolved over time,
recognising most variations in RTF style usage, but nevertheless, some problems were
only spotted in the QA that followed (see section “QA”).
Wrapping List Items in Lists
List items in RTF have no structure, of course. They are merely paragraphs with
style instructions that make them look like lists by adding a label before the
actual contents, separating the two with a tab character as seen in the previous
section.
That step does not wrap the list items together, it merely identifies the list
types and constructs list item elements. A later step adds list wrapper elements by
using xsl:for-each-group instructions such as this:
Note the many different RTF styles taken into account; these do not all do
different things, they are actually duplicates or near duplicates, the result of the
non-enforceable nature of RTF. Also note the boolean() expression in
@group-adjacent. The expression checks for matching attribute
values in the children of the list items, as these will still
have the style information from the RTFs.
The xsl:for-each-group instruction is frequently used in the pipeline
steps as it is perfect when grouping a flat content model to make any implied
hierarchies in it explicit.
Volume Paragraphs
The volume paragraphs provide another implicit sction grouping. They are
essentially a series of block-level elements that always start with a numbered title
(see Figure 1). The raw XHTML looks something like this:
<p data-lexisnexis-word-style="vol-PH">
<span class="bold">104. Claimants required to participate in an interview.</span>
</p>
<p data-lexisnexis-word-style="vol-Para">...</p>
<p data-lexisnexis-word-style="vol-L1">...</p>
<p data-lexisnexis-word-style="vol-L1">...</p>
<p data-lexisnexis-word-style="vol-L1">...</p>
<p data-lexisnexis-word-style="vol-Para">....</p>
<p data-lexisnexis-word-style="sup-PH">
<span class="bold">104 </span>
<span class="bold">Claimants required to participate in an interview</span>
</p>
<p data-lexisnexis-word-style="sup-Para">...</p>
Using the kind of upconversions outlined above, the result is a reasonably
structured sequence of block-level elements:
<core:para edpnum-start="104">
<core:emph typestyle="bf">Claimants required to participate in an interview.</core:emph>
</core:para>
<core:para>...</core:para>
<core:list type="number">
<core:listitem type="number">
<core:para data-lexisnexis-word-style="vol-L1">...</core:para>
</core:listitem>
<core:listitem type="number">
...
</core:listitem>
...
</core:list>
<core:para>...</core:para>
With longer volume paragraphs, frequently with supplements added, processing them
becomes difficult and unwieldy.
We added semantics to the DTD to make later publishing and processing easier,
wrapping the volume paragraphs and the supplements inside them:
<core:para-grp>
<core:desig value="104">104.</core:desig>
<core:title>Claimants required to participate in an interview.</core:title>
<core:para>...</core:para>
<core:list type="number">
<core:listitem>
<core:para>...</core:para>
</core:listitem>
<core:listitem>
...
</core:listitem>
...
</core:list>
<core:para>...</core:para>
<su:supp pub="supp">
<core:no-title/>
<su:body>
<su:para-grp>
<core:desig value="104">104</core:desig>
<core:title>Claimants required to participate in an interview</core:title>
<core:para>...</core:para>
</su:para-grp>
</su:body>
</su:supp>
</core:para-grp>
This was achieved using a two-stage transform where the first template, matching
volume paragraph headings (para[@edpnum-start] elements) only, would
add content along the following-sibling axis until (but not including)
the next volume paragraph heading[13]:
<!-- Common template for following-sibling axis -->
<xsl:template name="following-sibling-blocks">
<xsl:param name="num"/>
<xsl:apply-templates
select="following-sibling::*[(local-name(.)='para' or
local-name(.)='list' or
local-name(.)='blockquote' or
local-name(.)='figure' or
local-name(.)='comment' or
local-name(.)='legislation' or
local-name(.)='endnotes' or
local-name(.)='supp' or
local-name(.)='generic-hd' or
local-name(.)='q-a' or
local-name(.)='digest-grp' or
local-name(.)='form' or
local-name(.)='address' or
local-name(.)='table' or
local-name(.)='block-wrapper') and
not(@edpnum-start) and
preceding-sibling::core:para[@edpnum-start][1][@edpnum-start=$num]]"
mode="P2_INSIDE_PARA-GRP"/>
</xsl:template>
This, of course, created duplicates of every block-level sibling in what
essentially is a top-down transform, so a second pattern was needed to eliminate the
duplicates in a matching child axis template:
The supplements were enriched using a similar pattern, including along the
following-sibling axis and deleting the resulting duplicates along
the descendant axis.
Cross-references and Citations
Perhaps the most significant case of upconversion came with cross-references and
citations (to statutes, cases, and so on).
Cross-references
A cross-reference in the RTFs would always be manually entered in the RTF sources[14]:
The cross-reference here is the keyword para followed by
a (volume paragraph) number. The problem here is that the only identifiable
omponent was the para (or paras, in case of multiple volume paragraph
references) keyword:
As to the meaning of allowable losses
see <span class="smallcaps">para</span> 609.
In some cases, the editor had used the small caps style on the number in
addition to the keyword, causing additional complications.
The reference might be to a combined list of numbers and ranges of
numbers:
The xsl:when shown here covers the case where the reference
follows after the keyword.
The regular expression includes letters after the numbers to accommodate
the so-called A paras.
This needed to be combined with a kill template for the same
text node but on a descendant axis. In other words, something like this:
<xsl:template
match="node()[self::text() and
preceding-sibling::*[1][self::*:span and @class='smallcaps' and
matches(.,'^para[s]?[\s]*$')]]"
mode="KEPLER_CONSTRUCT-REFS"/>
The end result would be something like this (indentation added for
readability):
see the Taxation of Chargeable Gains Act 1992 s 21(1); and
<core:emph typestyle="smcaps">para</core:emph>
<lnci:cite type="paragraph-ref">
<lnci:book>
<lnci:bookref>
<lnci:paragraph num="613"/>
</lnci:bookref>
</lnci:book>
<lnci:content>613</lnci:content>
</lnci:cite>.
If the reference was given to a list, each list item would be tagged in a
separate lnci:cite element, while a range would instead add a
lastnum attribute to the lnci:cite.
The following-sibling axis to match content, paired with a
descendant axis to delete duplicates is, as we have seen,
frequently used in the pipeline.
In some cases, the cross-reference would point to a volume paragraph in a
different title:
Here, we'd have the target title name styled in an
*xtitle RTF style, here in purple, followed by
text-only volume number information, the para keyword, and
the target volume paragraph number. This was handled much like the above, the
difference being an additional step to match the title in a separate step and
combine the title with the cross-reference markup in yet another step.
Citations
Halsbury's Laws of England contain huge numbers of
citations, but very few of them have any kind of RTF styling and were thus
mostly unidentifiable in the conversion. Instead, they will be handled later,
when the XML is uploaded into the target CMS, by using a cite pattern-matching
tool developed specifically for the purpose.
The sister publication for Scotland, on the other hand, had plenty of case
citations, most of which would look like this:
Here, the blue text indicates the case name and uses the
*case RTF style, while the brown(-ish) text is the
actual formal citation and uses the RTF style *citation.
The citation markup we want looks like this:
<lnci:cite>
<lnci:case>
<lnci:caseinfo>
<lnci:casename>
<lnci:text
txt="Bushell v Faith"/>
</lnci:casename>
</lnci:caseinfo>
<lnci:caseref
normcite="[1970] AC 1099[1970]1All ER 53, HL"
spanref="spd93039e7444"/>
</lnci:case>
<lnci:content>
<core:emph typestyle="it">Bushell v Faith</core:emph>
<lnci:span
spanid="spd93039e7444"
statuscode="citation">[1970] AC 1099,
[1970], 1, All ER 53, HL</lnci:span></lnci:content>
</lnci:cite>
Essentially, the citation consists of two parts, one formal part where the
machine-readable citation (in the normcite attribute) lives, along
with the case name, and another, referenced by the formal part (the
spanref/spanid is an ID/IDREF pair, in case you
didn't spot it), where the content visible to the end user lives.
My first approach was to convert the casename and citation parts in one step,
then merge the two and add the wrapper markup when done in another.
Unfortunately, there were several problems:
Neither the casename nor the citation was always present. Sometimes, a
case would be referred to only by its citation. Sometimes, a previously
referred case would be referred to again using only its name.
Multiple case citations might occur in a single paragraph, sometimes
in a single sentence.
Sometimes, there woul be other markup between the
casename and its matching citation.
As the RTF style application was done manually, there were plenty of
edge cases where not all of the name or citation had been selected and
marked up. In quite a few, the unmarked text was then selected and
marked up separately, resulting in additional span elements
in the raw XHTML.
This resulted in the citation construction being divided into three separate
steps, beginning with a cleanup to fnd and merge span elements, a
second to handle the casenames and citations, and a third to construct the
wrapper markup with the two citation parts and the ID/IDREF pairs.
This sounds simple enough, but consider the following: In a paragraph containing
multiple citations, how does one know what span belongs to what
citation? How does ne express that in an XSLT template? Here is a relatively
simple one:
<span data-lexisnexis-word-style="case">Secretary of State for Business,
Enterprise and Regulatory Reform v UK Bankruptcy Ltd</span>
<core:emph typestyle="it"> </core:emph>
<span data-lexisnexis-word-style="citation">[2010] CSIH 80</span>,
<span data-lexisnexis-word-style="citation">2011 SC 115</span>,
<span data-lexisnexis-word-style="citation">2010 SCLR 801</span>,
<span data-lexisnexis-word-style="citation">2010 SLT 1242</span>,
<span data-lexisnexis-word-style="citation">[2011] BCC 568</span>.
Do all citations belong to the same casename? Only the first? Here is another
one (note that it's all in a single sentence):
<span data-lexisnexis-word-style="case">Bushell v Faith</span>
<span data-lexisnexis-word-style="citation">[1970] AC 1099</span>,
<span data-lexisnexis-word-style="citation">[1970] </span>
<span data-lexisnexis-word-style="citation">1 </span>
<span data-lexisnexis-word-style="citation">All ER 53, HL</span>;
<span data-lexisnexis-word-style="case">Cumbrian Newspapers Group Ltd v
Cumberland and Westmorland Newspapers and Printing Co Ltd</span>
<span data-lexisnexis-word-style="citation">[1987] Ch 1</span>,
<span data-lexisnexis-word-style="citation">[1986] 2 All ER 816</span>.
Note the fragmentation of spans and the comma and semicolon separators,
respectively. When looking ahead along the following-sibling axis,
how far should we look? Would the semicolon be a good separator? The
comma?
The decision was a combination of asking the editors to update some of the
more ambiguous RTF citations and a relatively conservative approach where
situations like the above resulted in multiple case citation markup. There was
no way to programmatically make sure that a preceding case name is actually part
of the same citation.
Symbols
An unexpected problem was with missing characters: en dashes (U+2013) and em
dashes (U+2014) would mysteriously disappear in the conversion. After looking at the
debug output of the early steps, I realised that the characters were actually
symbols, inserted using Insert Symbol in Microsoft Word. In
WordML, the symbols were mapped to w:sym elements, but these were then
discarded.
When looking at the extent of the problem, it turned out that the affected
documents were all old, meaning an older version of Microsoft Word and implying that
the problem with symbols was fixed in later versions, inserting Unicode caracters
rather than (presumably) CP1252 characters. Furthermore, only two symbols were used
from the symbol map, the en and em dashes. This fixed the problem:
A very recent issue, two weeks old as I write this, is the fact that a few of the
titles contain equations created in Microsoft Equation 3.0. The
equations would quietly disappear during our test conversions without me noticing,
until one of the editors had the good sense to check. What happened was that Aspose
converted the equtions to uuencoded and gzipped Windows Meta Files and embedded them
in a binary object elements that were then discarded.
Unfortunately for me, the requirements extended beyond equations as images, which
required me to rethink the process. What I'm doing now is this:
Add placeholder processing instructions in an early step to mark where to
(re-)insert equations later. Finish converting the title to XML.
Convert the RTF to LaTeX. It turns out that there are quite a few
converters available open source, including some that handle Microsoft
Equation 3.0. What I've decided on for now is
rtf2latex2e (see id-rtf2latex), as
it is very simple to run from an Ant script and provides reasonable-looking
TeX, meaning that the equations are handled. The process can also be
customised, mapping RTF styles to LaTeX macros so some of the hidden styles
I need to identify title metadata are kept intact.
Convert the LaTeX to XHTML+MathML. Again, it turns out that there are
quite a few options available. I chose a converter called
TtM (see id-ttm). It
produces some very basic and very ugly XHTML, but the equations are pure
presentation MathML.
Extract the equations per title, in document order, and reinsert them in
the converted XML titles where the PIs are located.
This process is surprisingly uncomplicated and very fast. There are a few niggles
as I write this, most to do with the fact that I need to stitch together the
XHTML+MathML result files to match the converted XML, but I expect to have completed
the work within days.
QA
With a conversion as big as the Halsbury titles migration,
quality assurance is vital, both when developing the pipeline steps and after running
them. Here are some of the more important QA steps taken:
Most of the individual XSLT steps were developed using XSpec tests for unit
testing to make sure that the templates did what they were supposed to.
We also used XSpec tests to validate the content for key steps in the
pipeline. Typically, an XSpec test might perform node counts before and after a
certain step, making sure that nothing was being systematically lost.
Headings, list items and footnotes were particularly prone to problems, as the
initial identification of content as being a labelled content type rather than,
say, an ordinary paragraph relied solely on pattern matching (see section “Labels in Headings, List Items, and Footnotes”). A failed
list item conversion would usually result in an ordinary paragraph (a
core:para element) with a procesing attribute
(@data-lexisnexis-word-style) attached, hinting at where the
problem occurred and what the nature of the problem was (the contents of the
processing attribute giving the name of the original RTF style).
Obviously, DTD validation was part of the final QA.
The resulting XML was also validated against Schematron rules, some of which
were intended for developers and others for the subject matter experts going
through the converted material. For example, a number of the rules highlighted
possible issues with citations and cross-references, due to the many possible
problems the pipeline might encounter because of source issues (see section “Cross-references and Citations”).
Other schematron rules provided sanity checks, for example, that heading and
list item labels were in sequence and were being extracted correctly.
Some conversion steps were particularly error-prone because of the many
variations in the sources, so these steps included debug information inside XML
comments. These were then used to generate reports for the SME review.
Footnotes, for example, would sometimes have a broken footnote reference due
to a missing target or a wrongly applied superscript style or simply the wrong
number. All these cases would generate a debug comment that would then be
included in an HTML report to the SME review.
XSpec for Pipeline Transformations
XSpec, of course, is a testing framework for single
transformations, meaning one XSLT applied to input producing output,
not a testing framework for testing a pipeline comprising several XSLTs with
multiple inputs and outputs. Our early XSpec scenarios were therefore used for
developing the individual steps, not for comparing pipeline input and output, which
initially severely limited the usability of the framework in our
transformations.
To overcome this limitation, I wrote a series of XSLT transforms and Ant macros to
define a way to use XSpecs on a pipeline. While still not directly comparing
pipeline input and output, the Ant macro, run-xspecs, accepted an XSpec
manifest file (compare this to the XSLT manifest briefly described in section “Pipeline Mechanics”)
that declared on which pipeline steps to apply which XSpec tests and produce
concatenated reports. Here's a short XSpec manifest file:
<?xml version="1.0" encoding="UTF-8"?>
<tests
xmlns="http://www.sgmlguru.org/ns/xproc/steps"
manifest="xslt/manifest-stair-p1-to-p2.xml"
xml:base="file:/c:/Users/nordstax/repos/ca-hsd/stair">
<!-- Use paths relative to /tests/@xml:base for pipeline manifest, XSLT and XSpec -->
<test
xslt="xslt/p2_structure.xsl"
xspec="xspec/p2_structure.xspec"
focus="batch"/>
<test
xslt="xslt/p2_para-grp.xsl"
xspec="xspec/p2_para-grp.xspec"
focus="batch"/>
<test
xslt="xslt/p2_ftnotes.xsl"
xspec="xspec/p2_ftnotes.xspec"
focus="batch"/>
</tests>
The assumption is here that the pipeline produces debug step output (see section “Pipeline Mechanics”) so
the XSpec tests can be applied on the step debug inputs/outputs. The
run-xspecs macro includes a helper XSLT that takes the basic XSpecs
(three of them in the above example) and transforms them into XSpec instances for
each input and output XML file to be tested[16]. The Ant build script then runs each generated XSpec test and generates
XSpec test reports.
The run-xspecs code, while still in development and rather lacking in
any features we don't currently need, works beautifully and has significantly eased
the QA process.
End Notes
Some Notes on Conversion Mechanics
Some notes on the conversion meachanics:
The conversions were run by Ant build scripts that ran all of the various
tasks, from Aspose RTF to WordML, the pipeline(s), validation, XSpec tests
and reporting.
The volumes were huge. We are talking about several gigabytes of
data.
The conversions, all of them in batch, were done on a file system. While I
don't recommend this approach (I would gladly have done conversion in an XML
database), it does work.
The pipeline enabled a very iterative approach.
It should be noted that while I keep talking about a
pipeline, several similar data migrations were actually done in parallell with other
products, each with similar pipelines and similar challenges. The techniques
discussed here apply to those other pipelines, of course, but they all had their
unique challenges. While the pipelines all had a common ancestor, a first pipeline
developed to handle forms publication, they were developed at different times by
different people on different continents.
Even so, yours truly did refactor, merge and rewrite two of his pipelines for two
separate legal commentary products into a single one where a simple reconfiguration
of the pipeline using build script properties was all that was needed to switch
between the two products.
Note
An alternative way of doing pipelines is explored in a 2017 Balisage paper I
had the good fortune to review before the conference,
micropipelining. This paper, Patterns and
antipatterns in XSLT micropipelining by David J. Birnbaum, explores micropipelining, a
pipelining method where a pipeline in a single XSLT is constructed by adding a
series of variables, each of them a step doing something to the input.
We used these techniques in some of our steps, essentially creating pipelines
within pipelines. Describing them would bring up the size of this paper to that
of a novel, so I recommend you to read David's paper instead.
Preprocessing?
One of the reviewers of this paper wanted to know if preprocessing the content
would have helped. While his or her question was specifically made in the context
of
processing list items, footnotes and headings, the answer I wish to provide should
apply to everything in this paper:
First, yes, preprocessing helps! We did, and we do, a lot of
that. Consider that the RTFs were (and still are, at the time of this writing) being
used for publishing in print and online, and some of the problems encountered when
migrating are equally problematic when publishing. There are numerous Microsoft Word
macros in place to check, and frequently correct, various aspects of the content
before publication. To pick but one example, there is a macro that converts every
footnote created using Microsoft functionality to an inline superscripted label
matching its footnote body placed elsewhere in the document, as the MS-style
footnotes will break the publishing (and migration) process.
Second, what is the difference between a preprocess and a pipeline step? If it's
simply that the former is something done to the RTFs, before the pipelined
conversion domain, the line is already somewhat blurred. The initial conversion
takes the RTF to first docx and then to XHTML, but I would argue that the XHTML is
a
reasonably faithful reproduction of the RTF's event-based semantics and so equally
well-suited for preprocessing, unless the problem we want to solve is
the lack of semantics, neatly bringing me to my final point.
Third, the kind of problem that cannot be solved by
preprocessing is an authoring mistake, typically ranging from not using a style to
using the wrong one. It is the very lack of semantics that is the problem. If the
style used was the wrong one, the usual consequence is a processing attribute left
behind and discovered during QA. This can be detected but it cannot be automatically
fixed.
That said, we did sometimes preprocess the RTF rather than adding a pipeline step,
mostly because there was already a macro to process the RTFs that did what we
wanted, not because the macro did something the pipeline couldn't.
Conclusions
Some conclusions I am willing to back up:
A conversion from RTF to XML is error-prone but quite possible to
automate.
The slightest errors in the sources will cause problems but they can be
minimised and controlled. At the very least, it is possible to develop
workarounds with relative ease.
A pipeline with a multitude of steps is the way to go, every step doing
one thing and one thing only.
It is easy for a developer to defocus ever so slightly and add more to a
step than intended (I'll just fix this problem here so I don't have
to do another step). This is bad. Remain focussed and your
colleagues will thank you.
Lastly
Here's where I thank my colleagues at LexisNexis, past and present, without whom I
would most certainly be writing about something else. Special thanks must go to
Shely Bhogal and Mark Shellenberger, my fellow Content Architects in the project,
but also to Nic Gibson who designed and wrote much of the underlying pipeline
mechanics.
Also, thanks to Fiona Prowting and Edoardo Nolfo, my Project Manager and Line
Manager, respectively, who sometimes believe in me more than I do.
[1] And seen as such; the terminology may sometimes be confusing for structure
nazis like yours truly.
[2] Originally referring to literally loose leafs to be
added to binders.
[3] Sometimes the changes warrant whole new chapters or sections containing
the new A paras. These chapters and sections will then follow
the number-letter numbering conventions.
[4] Another being that the paper publishing system, SDL
XPP, a proprietary print solution for XML used by
LexisNexis, appears to require a single file as input.
[5] Another early motivation for the roundtrip was to have the in-house
editors perform QA on the converted files—by first converting them back
to RTF. Thankfully, we were able to show the client that there are
better ways to perform the QA.
[6] A surprisingly difficult question to answer when discussing several gigabytes
of data.
[7] Enabling the developer to run a step against the previous step's
output.
[8] Options include debug output, stitch patterns, validation, and much
more.
[9] There were more than a hundred titles, meaning more than a
thousand physical files.
[10] What's shown here are the default patterns in the file stitcher
XSLT. In reality, as several different commentary title sets were
converted, the calling XProc pipeline would add other
patterns.
[11] Legal documents tend to add small caps to their formatting, just to pick
one example.
[12] While the RTF template includes a number of basic list styles, the labels
in an ordered list are usually entered manually; a single volpara will have
a well-defined progression of allowed ordered list types so that each list
item can be referenced in the text.
[13] Or the last following sibling, if there were no more volume paragraphs to
add.
[14] That is, there was no actual linking support to be had.
[15] Or inside the span, or a combination of both.
[16] Most of our conversions ran in batch, sometimes with dozens or