How to cite this paper
La Fontaine, Robin. “Divide and Conquer: can we handle complex markup simply?” Presented at Symposium on Cultural Heritage Markup, Washington, DC, August 10, 2015. In Proceedings of the Symposium on Cultural Heritage Markup. Balisage Series on Markup Technologies, vol. 16 (2015). https://doi.org/10.4242/BalisageVol16.LaFontaine01.
Symposium on Cultural Heritage Markup
August 10, 2015
Balisage Paper: Divide and Conquer: can we handle complex markup simply?
Robin La Fontaine
Robin is the founder and CEO of DeltaXML. He holds an Engineering Science
degree from Oxford University and an MSc in Computer Science. His background
includes computer aided design software and he has been addressing the
challenges and opportunities associated with information change for many
years.
Copyright © 2015 DeltaXML Limited. All rights reserved.
Abstract
Cultural Heritage markup can quickly become complex because of the need to
represent multiple, and even overlapping, hierarchical structures. It can therefore
become very difficult to maintain correctly. This talk suggests that a better
approach is now possible: markup that is designed to represent different aspects of
a text could be handled separately from the point of view of checking and
maintenance, and then only combined into a single document when needed, e.g. for
some kind of analysis. Advances in comparison and merge tools for XML make this a
possibility.
Table of Contents
- Introduction and Background
- An example of Divide and Conquer
- Application to Cultural Heritage Markup
- Developments in XML-aware Comparison and Delta Representation
- Conclusions
Introduction and Background
Our cultural heritage is important, and we can learn from it. In looking at better
ways of handling cultural heritage documents using structured markup, there is an
opportunity also to learn from computer science ‘heritage’. Although many things in
computer science are changing very rapidly, lessons can be learned from past mistakes
or
experiences and it is often the case that what is deemed to be a new approach, is
in
fact an old approach revisited.
One of the purposes of cultural heritage markup is to have a representation of many
variants of a document all in one document. The variation may be in how it is marked
up,
or in the text itself. This can lead to very complex markup, and it can become extremely
difficult to manage without very good tools. Indeed, as the information content becomes
richer, so the difficulty of handling the complexity increases. This is very well
described by Schmidt [1] and [2] and he
proposes that one way to solve this is to keep separate variants and merge them as
needed. He points out, however, that this is not a simple task.
The purpose of this short paper is twofold. Firstly, to note that this approach of
divide and conquer has been used in similar situations very successfully. Secondly,
to
summarize developments in the area of XML comparison and merge, developed primarily
for
other purposes, that relate to and may help in this area.
An example of Divide and Conquer
The cultural heritage markup problem has similarities to the handling of multiple
versions in other areas of computer science. An example of this is a project for
handling the documentation of a complex data model, using a version controlled
relational database. Although this work was done some twenty-five years ago, the lessons
learned remain pertinent.
The purpose of this project was to document a complex data model, and have this
reviewed by subject matter experts. The problem was that these experts were, as they
always are, short of time, and therefore we wanted to ensure that their time was well
spent in review. However, the model had to be developed and reviewed over many different
versions in order to make sure that it was correct, and therefore we needed to present
the subject matter experts with successive versions as these were developed. The experts
clearly wanted to know what had changed, rather than reviewing the whole document
again.
This was in an era when tracked changes had not even been thought about, and certainly
good word-processing technology was not widely available.
The documentation was therefore put into a relational database, which was versioned
so that each successive version was recorded and identified. Using what was called
4GL,
fourth-generation language[3], it was possible to write a
report that generated the full documentation with an indication of which parts of
it
have been updated since the previous version. (As an aside, it turned out to be
impossible to parameterize the 4GL reports sufficiently, and therefore large sections
had to be duplicated and slightly modified, resulting in a very large number of lines
of
code, which eventually became impossible to maintain.) In terms of the result that
was
produced, the project was very successful and the subject matter experts were pleased
because they were able to review only the changes.
As more versions of the document were added to this database, it became more and more
difficult to maintain the integrity of the database. It was extremely difficult, for
example, to remove a particular version from the database, or even to make updates
to
the latest version. This was partly due to inadequate tools, but it was fundamentally
difficult because whenever something was changed, it had to be duplicated first and
all
the versioning information set up correctly.
To get round this problem, a new approach was adopted. Rather than working directly
on the versioned database, a new version of the documentation was created independently
from the versioned database. It was then possible to write an automated script that
could add this new version back into the versioned database as a new version. This
could
be automated and so correctness could be guaranteed. Using this approach, it became
far
easier to create a new version of the document while at the same time being able to
maintain the versioned documentation that was required by the subject matter experts.
It was quite a simple idea, but it made an increasingly complex situation much easier
to handle. There are some parallels with cultural heritage markup, so this approach
is
worth persuing based on this similar past experience.
Application to Cultural Heritage Markup
We will now consider how this approach applies to cultural heritage markup. If we
could work on the representation of a particular variant of the document, this would
have relatively simple markup, which could be validated using conventional XML tools.
There would not be a need for overlapping hierarchy, and possibly not even text
variations. If we could then combine these simpler variants into a single document,
using markup to show structural and text variations, we would still be able to publish
the rich information that cultural heritage markup provides.
One of the advantages of this approach would be that we would not need to keep all
the variants together in a single document all the time, but rather we would combine
only those variants that were relevant to a particular publishing scenario. In addition,
we can combine two related variants together in order to check their integrity with
respect to each other.
In order to achieve this simplification, there are some significant challenges in
performing the merge, as noted by Schmidt. This would need to be based on comparison,
but it would be important to align the text independently of the structural markup.
That
said, some of the markup may be important for alignment and therefore a flexible
comparison approach is needed. Traditional text comparison tools are line based, and
do
not understand the markup and are therefore unsuitable for this work. XML comparison
is
traditionally guided by the document structure and again this is not suitable unless
it
can be made more flexible. A prerequisite is therefore the ability to be able to
distinguish between structurally significant markup, i.e. markup that is an important
divider in terms of alignment, and structurally insignificant markup, i.e. markup
that
should be ignored for alignment.
Once the text has been aligned, it is then necessary to have a suitable
representation of the overlapping structural hierarchy in a form that is suitable
for
conversion into cultural markup, e.g. TEI[4]. The representation of
overlapping hierarchies is a difficult problem, and quite a number of papers have
been
presented at this conference and others about it [5].
Developments in XML-aware Comparison and Delta Representation
XML aware comparison understands the structure of the XML, and therefore uses this
when aligning two documents. Where the XML elements represent structure that is
significant to the alignment process, this approach is appropriate. However, XML element
tags are also used to markup formatting information and it is usually desirable not
to
show text changes when only formatting changes have been made. We therefore end up
with
a mixture of XML structure, some of which is significant and needs to be considered
in
the alignment process, and other elements that are not significant and need to be
ignored in the alignment process. The ignored elements need to be represented in the
final result and not lost. This, of course, typically leads to overlapping hierarchy.
Ignoring for a moment cultural heritage markup, for regular structured documents in
formats such as DITA[6] or DocBook[7], it
is not generally necessary to be able to represent overlapping hierarchy. However,
it is
often desirable to be able to distinguish between textual changes and formatting
changes, and a good representation of overlapping hierarchy enables such a distinction
to be made in a delta file[8].
Another requirement of conventional structured document comparison is the need to
control the alignment of specific elements within the document. This can be achieved
by
assigning keys to these elements, and ensuring that these keyed elements are aligned
in
preference to the alignment of any other elements. The use of keys enables a very
reproducible and controllable merge.
We are therefore moving to a situation where generic XML delta formats are able to
represent not only changes to textual information, and the simple addition or deletion
of complete elements, but also the presence or absence of XML tags around portions
of
text. We are getting close to the ability to take multiple variants of the document,
where the text is similar but not necessarily the same, and where markup may be
completely different, and merge these into a single document where the variants are
represented in XML in a generic form, without loss of information. To validate that
no
information is lost, it should be possible to generate all of the original documents
from the merged document.
The original purpose of this generic delta format was to be able to generate
derivatives for a variety of different purposes. For example, it would be possible
to
show where text has been changed, and distinguish this from where formatting has been
changed so that those who are interested in the one are not confused by the other.
It is
also possible to ignore certain types of change in an intelligent way.
These advances may have applications useful to the cultural heritage markup community.
Conclusions
This short paper has described how a divide-and-conquer approach was adopted for a
complex version-controlled relational database, designed to support documentation
of a
data model. The version-controlled database became too complex to manage but success
was
achieved by working on each version and only adding this to the versioned database.
Very
useful results were achieved from this – results that were simply not possible at
the
time with any other approach.
The paper has explored parallels between this and the management of cultural heritage
markup and shown that advances in techniques to compare and merge structured XML
documents mean that a similar approach could be applied.
A purpose of these short talks is to explore different approaches to existing
problems. The question for discussion and feedback from the audience is how useful
this
approach would be to the Cultural Heritage markup community.
References
[1] Schmidt, Desmond. “The role of markup in the
digital humanities.” Historical Social Research 37 (2012), 3, pp. 125-146. URN:
http://nbn-resolving.de/urn:nbn:de:0168-ssoar-378369
[2] Schmidt, Desmond. “Merging Multi-Version Texts:
a Generic Solution to the Overlap Problem.” Balisage Series on Markup Technologies,
vol. 3 (2009),
http://www.balisage.net/Proceedings/vol3/html/Schmidt01/BalisageVol3-Schmidt01.html,
doi:https://doi.org/10.4242/BalisageVol3.Schmidt01
[3] Informix 4GL:
https://en.wikipedia.org/wiki/Informix-4GL
[4] TEI: Text Encoding Initiative,
http://www.tei-c.org/index.xml
[5] Marcoux, Yves, Michael Sperberg-McQueen, Claus Huitfeldt. “Modeling overlapping structures.”
Balisage Series on Markup Technologies, vol. 10 (2013),
http://www.balisage.net/Proceedings/vol10/html/Marcoux01/BalisageVol10-Marcoux01.html, doi:https://doi.org/10.4242/BalisageVol10.Marcoux01
[6] OASIS Darwin Information Typing Architecture (DITA)
TC, https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=dita
[7] OASIS DocBook TC,
https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=docbook
[8] Overlapping Hierarchies in DeltaV2
Format, http://www.deltaxml.com/support/documents/deltav21