Context and Goals

A recently completed data conversion project provided an opportunity to combine several different techniques for XML quality assurance and quality control.

The data conversion project was a significant cost and significant risk for the company. Data in two languages, and several hundred SGML DTDs, was converted to a single XML DTD. The company was a legal publisher, and this data was the company's key money-making asset. The conversion project stretched over multiple years and multiple teams in multiple locations.

The time and money involved meant that it was crucial to specify clearly what was to be done, and to be able to estimate as accurately as possible the amount of effort involved to complete each subproject. It was critical that no data be lost or damaged, and that any errors be caught before the converted data went into production.

The nine techniques discussed in this paper can be grouped into three broad categories:

  1. Data Analysis and Estimation:

    • Counting function points in source to estimate effort.

    • Count function points in target to estimate slope.

    • Autogenerate tight schemas to discover variation.

  2. Quality Assurance:

    • List parent-child pairs to guide specification.

    • Always program for context.

    • Always program for all content.

  3. Quality Control:

    • Compare source to target for lost or duplicated content.

    • Autogenerate word wheels to highlight anomalous data.

    • Use XQuery-capable XML database to quickly review data.

Data Analysis and Estimation

The first steps to a high-quality conversion project include getting a clear sense of the size of the job, the complexity of the conversion, and variability of the data. Though hardly complete, the techniques listed in this section address each of these concerns.

Count function points in source to estimate effort

The basic metric we use to estimate the size of a conversion project or subproject is to count the conversion function points in the input data, defined as follows:

  1. Each parent/child pair counts for one. So, for example article/para/bold and section/para/bold count for a single point, but section/title/bold merits a second point.

  2. Each element/attribute pair counts for one. So tr/@align and td/@align count as two function points.

  3. Text, processing instructions, and comments are ignored. This is based on the observation/assumption that most conversions pass these kinds of information through without complicated logic.

The function point count allows us to estimate the programming effort. In our experience, as a rule-of-thumb starting point, specification, programming and QC come out to about an hour per function point. Already-specified function points can usually be deducted from estimate.

Obviously, other metrics could be used. In determining how much context should count, experience suggested that no context (i.e. all bold considered the same) seemed too little, multiple levels of context (i.e. article/para/bold and section/para/bold considered different) seemed too much, and one level of context seemed about the right balance. By right balance we mean that the resulting function point count varied approximately with the programming effort required.

Benefit: This metric is objective, transparent, and repeatable.

Count function points in target to estimate slope

When available, an estimation of the function points in the target allows us to estimate the slope, or difficulty, of the conversion.

Example:

  • Input sample = 72 conversion function points

  • Corresponding output = 101 conversion function points

  • 101 / 72 = 40% bulk up

Note that conversion effort is sometimes more closely related to the number of output markup combinations that must be produced than to the number of input markup combinations. For example, it's much easier to convert both foreign and pub-title to italic than it is to map italic to either foreign or pub-title depending on other clues. So, with one-to-one (slope = 1) as the baseline, greater bulk-up factors typically represent greater complexity, difficulty and effort in the conversion.

But usually, by the time output is available for analysis, the conversion is done. As target markup is often not available, slope is more commonly estimated by the number of text-pattern-to-element rows in the specification. Or, a smaller sample is used to estimate slope, and then this slope is assumed to apply to the fuller data set.

Recommendation: during conversion keep slope modest. Slopes much higher than level can be defined to be data enhancements, rather than data conversions. As such, they may be more successfully undertaken once the basic conversion is complete. This is especially true in contexts such as the present case study, where the initial conversion collapsed hundreds of sometimes contradictory SGML DTDs to a single consistent XML DTD.

Autogenerate tight schemas to discover variation

Using free tools such as inst2xsd, any number of XML files can be used to create a schema. inst2xsd

A schema can be generated for an initial set of files, and then this schema can be used to validate additional files. The resulting validation errors indicate new input variations that must be accomodated.

This same technique can equally well be applied to conversion outputs. New patterns in the outputs can indicate where downstream processes may need to be extended, or may be indications of a conversion process that's gone off the rails.

Benefit: Auto-generated tight schemas are one way to highlight variation in the data.

Quality Assurance

Our goal in quality assurance is to avoid introducing errors in the first place. Our techiques are a collection of best practices for specification and programming, three of which are highlighted here.

List parent-child pairs to guide specification

Creating a simple list of all parent/child pairs in the input can be used to create a simple framework for specification.

Input Context Output Notes
@align colspec to-do to-do
@align entry to-do to-do
b entry to-do to-do
b paragraph to-do to-do

These parent/child pairs are the same that we used as the definition of conversion function point for estimation purposes above, and so we already have generated these lists.

This same list can provide a simple starting skeleton for the conversion script:

<template match="colspec/@align | entry/@align">
 <call-template name="to-do"/>
</template>
<template match="entry/b | paragraph/b">
 <call-template name="to-do"/>
</template>

Always program for context

Mapping without context is deceptively fast:

<template match="b">
 <call-template name="map-to-bold"/>
</template>

But new contexts often require new consideration:

<template match="entry/b">
 <call-template name="map-to-heading-cell"/>
</template>
<template match="paragraph/b">
 <call-template name="map-to-bold"/>
</template>

Benefit: Defensive programming avoids accidents before they happen.

Always program for all content

We may script a paragraph by writing code for each of the following attributes:

  • paragraph/@font_size

  • paragraph/@keep-next

  • paragraph/@keep-previous

  • paragraph/@leading

  • paragraph/@no-keeps

  • paragraph/@type

But if we do that, we will lose:

  • paragraph/@prespace

...unless we script for and all other attributes.

Benefit: Avoid silent loss of data.

Quality Control

Our goal in quality assurance was to avoid introducing errors. Our goal now, in quality control, is to reassure ourselves that we were successful, and to find the errors we no doubt made despite our best efforts.

Compare source to target for lost or duplicated content

The goal in comparing conversion input to conversion output is to note where we've lost or duplicated content. The basic concept is easy: delete all the markup and compare the content that remains. The execution is rather more difficult:

  • Some input markup will become output content.

  • Some input content will become output markup.

  • Some input content will be puposely deleted.

  • Some input content will be purposely duplicated.

Nonetheless, we acheived good results by preprocessing both input and output files, rescuing some content from markup and duplicated and deleting content as required. Then we sort runs of text (another problem: determine what constitutes a run of text) in order to overcome issues with purposely reordered content. Finally, we use a side-by-side comparison tool to highlight mismatches between the massaged source and target.

Benefit: Catch lost and duplicated content.

Autogenerate of word-wheels to highlight anomalous data

I was first introduced to the concept of a word wheel in the context of Folio Views™ in the early 1990s. Folio Folio Views was (and is) software for searching and browsing large sets of textual information - at the time, typically delivered on removable media. Folio Views for searching of predefined fields (in XML terms, typically metadata elements or semantic inline elements) across the data. One special feature, when searching fields, was autocompletion. When the users cursor was placed in the the search-form box corresponding to a particular field, all the values for that field were shown in an sorted list of unique values - called the word wheel. Just as with modern autocompletion, this list was updated as characters were typed in the search box.

The word wheel was a wonderful searching enhancement when it was introduced. But it could be a terrible embarrassment to the publisher if the source data was not clean. If searching for provinces, it was a wonderful help to be provided with a short list starting with Alberta and British Columbia. But if Alberta were mispelled anywhere in the thousands of documents being searched, it would show up in the short autocompletion list: Alberta, Alberto, British Columbia. Worse, if there were numerical values in a text field, they would sort to the top, making themselves all the more obvious.

The idea of adapting this word wheel behavior to become a quality control tool is simple. For each text-containing element in the converted output, with the exception of free-text fields like paragraphs, uniquely sort each of the observed values. Then ask a human to review the resulting sorted lists, looking for anomalies.

This approach very quickly highlights anomalous data, making it jump out to the human reviewer. At the same time, it avoids the wasted effort of reviewing thousands of redundantly correct values.

In the discussion that followed, Steve DeRose suggested that, in addition to an alphabetized list, a ranked list would give an additional insight. A ranked list would have the most common values at the top: not very interesting. And at the bottom, it would have the one-off values: possibly errors but also possibly one-off oddities. In the middle, you would have a sweet spot where you might find repeated errors.

Benefit: Makes human QC efficient.

Use an XQuery-capable XML database to quickly review data

When a suspicion arises, perhaps as a result of an odd word wheel entry or an unexplained mismatch between input and output content, it becomes very important that the offending data can be found quickly. And not only the specific example should be found, but we should review other places where similar behavior is likely to have occured. Having all input and all output instantly searchable using XQuery in invaluable. Given the availability of free and easy tools such as BaseX, there's really no excuse not to equip every team member with easy access to all the data. BaseX

Tangible Benefits

Consistent application of these techniques resulted in the following benefits to this project, as well as to other projects undertaken by Tata Consultancy Services:

  • Objective, repeatable and reliable estimation from our conversion function point framework.

  • High quality results from programming best practices.

  • High productivity from reliable tools and techniques.

  • Scalability from systematic development approach.

References

[BaseX] Wikipedia contributors, BaseX, Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/BaseX (accessed 2012 July 13).

[Folio] Wikipedia contributors, Folio Corporation, Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Folio_Corporation (accessed 2012 July 13).

[inst2xsd] inst2xsd (Instance to Schema Tool), part of the Apache Project XMLBeans Tools, http://xmlbeans.apache.org/docs/2.0.0/guide/tools.html#inst2xsd (accessed 2012 July 13).

Author's keywords for this paper:
function points; elements in context; parent/child pairs; estimating; converting; conversion slope; autogenerated schemas; programming best practices; automated testing; programming; quality; XQuery; XSLT; querying; validating

Charlie Halpern-Hamu

Senior Solutions Architect

Tata Consultancy Services

Charlie has been working with structured text since 1991. During this time, he has acted as a content and systems architect, programmer, systems integrator, consultant, mentor, best-practices coordinator, trainer, book editor, project lead, department manager, and vice president. His consulting and training work has taken him all over North America as well as visits to South America, Europe, Australia and China. Charlie has a PhD in Computer Science from the University of Toronto and an MBA from Heriot-Watt University. He's good at making complex systems easy to understand. Or so he claims.