How to cite this paper
Quin, Liam R. E. “Beyond Eighteen Wheels: Considerations in Archiving Documents Represented Using the
Extensible Markup Language.” Presented at International Symposium on XML for the Long Haul: Issues in the Long-term Preservation
of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the
Long-term Preservation of XML. Balisage Series on Markup Technologies, vol. 6 (2010). https://doi.org/10.4242/BalisageVol6.Quin01.
International Symposium on XML for the Long Haul: Issues in the Long-term Preservation
of XML
August 2, 2010
Balisage Paper: Beyond Eighteen Wheels
Considerations in Archiving Documents Represented Using the Extensible Markup Language
Liam R. E. Quin
XML Activity Lead
The World Wide Web Consortium
Liam Quin is the XML Activity Lead at the World Wide Web Consortium,
where he has worked since 2001; he also does consulting in his spare time.
Prior to working for W3C, Quin was a full-time consultant. He has worked with structured
markup
since the early 1980s, with SGML since 1987, and was an Invited Expert for the original
XML work at W3C.
Copyright © Liam Quin, 2010
Abstract
When documents are stored for any significant length of time,
or when they are used, whether continuously or occasionally, over
an extended period, the original people and culture and context
associated with their creation become unavailable. If the documents
are to remain useful, it is necessary to retain sufficient knowledge
about how they can be used that the future people involved can
still gain value from them.
This document is a position paper for discussion.
Table of Contents
- Introduction
- Definitions
-
- An XML Document
- A Long Time
- Storing
- Documents as Communication
- Document Context
- Navigation and Finding Aids
- The Politics of Selection
- How to Archive?
-
- The Physical Substrate
- The Logical Layers
-
- Coded Character Sets
- Fonts and Glyphs
- Extensible Markup Language (XML)
- Ancillary Formats
- Multiple Copies, Multiple Locations.
- Summary
- Designing XML-based Formats for Longevity
-
- Avoid Implicit Content
- Avoid Obscure Features
- Avoid Cryptic Names
- Document the significance of markup items
- Validate the Data
- Mean What You Mean To Mean
- Check Links
- Provide for Translations
- Provide for Contextualization
- Don't be Inventive
- Summary
- XML or Not XML?
-
- Textual Format
- Explicit End Markers
- Embedded Usage
- Device Independent
- Open Specification
- Conclusions
Introduction
The requirements for archiving a document for a hundred years are
very different from the requirements for archiving the same document for one year,
or for five years. As we try to prepare for even longer term storage, the number
of unknowns increases greatly.
This paper suggests ways to manage some of those unknowns, and to
prepare for as many of them as possible: in other words, ways to store XML-encoded
documents in such a way as to have some reasonable expectation that they can
be decoded at some unspecified time in the future and used.
Definitions
An XML Document
Before we can talk about storing documents for a long time, we must decide what we
mean by the term document.
This is not sophistry: an XML document generally consists of multiple parts, and,
as will be discussed further, we must
be careful in determining the boundary of the document for the purpose of preservation.
Let the term XML Document, then, in this
paper, denote a sequence of characters represented digitally on a computer such that
the sequence of
characters satisfies the productions and constraints
of the XML specification. We shall be neither more not less precise, but rather shall
qualify or expand upon this base term as needed.
Our definition excludes non-digital representations: if one were to print out an XML
document onto paper, and then make a video of
the paper, the video would not, by our definition, constitute an XML document.
A Long Time
The overview for the Balisage pre-conference symposium suggests that a long time,
in terms of document storage, could be
for as long as a thousand
years. The author of this document, however, suggests that A Long Time,
however long it may actually be, in any case starts now, in the present, and gradually
lengthens. What is
significant is that the people who created the document are not the people who decode
it, and that the context of that decoding
is not necessarily the same as the social, technological or political context in which
the document was encoded.
Storing
By our definition an XML Document exists only within computer storage.
A document will be said to have been stored for a given period of time if, at the
end of that time,
the same sequence of characters can be retrieved. This definition permits changes
in the encoding of the document, for example
from UTF-8 to UTF-16, as long as the sequence of encoded characters remains the
same.
Our definition is also silent on whether
the document may be inaccessible at times during the stored period.
One example of a document becoming inaccessible might be
a so-called dark archive,
created to hold copyrighted information until such time
as the copyright expires.
That is a political or social inaccessibility: the document
might be accessible technically, but perhaps not legally.
Documents can easily become technically inaccessible.
A promise of XML is that the format is open, and, unlike,
say, a proprietary word processing format, will still be
readable in the future. The same promise was made of
SGML documents, but few people make SGML software today.
Fortunately, software exists to convert SGML documents to
XML, but doing so correctly requires human expertise
in both formats in order to select correct options.
In order to retain technical accessibility, then,
documents may need to be migrated between formats.
If this is not done, however, keeping the format definition
along with the documents may facilitate such work after
a Very Long Time.
Documents as Communication
The Extensible Markup Language sees a great many uses in a wide range of applications.
We may classify XML documents in many ways, but for the purpose of archiving, let
us consider
a document as a form of communication, with at least one speaker and zero or more
listeners.
Any combination of listener and speaker may be an automated process or may be a human
(or
perhaps some other sentient being). We shall use the term machine to denote an agent
that
is not sentient, and person to denote a sentient agent, regardless of whether the
person or
the machine is human, robotic, dolphin or even alien.
We can sort documents, then, by the creators and by the intended audience, in terms
of one's likely interest
in archiving the documents:
Table I
Creators |
Audience |
Interest |
Machine |
Machine |
Very Low |
Machine |
Person |
Low |
Person |
Machine |
Low |
Person |
Person |
High |
For the purpose of this ranking, multiple agents are treated as equivalent to a single
agent, but
a heterogeneous group of agents containing at least one person is considered to be
a person.
The interest in archiving machine-machine communication is low because the (human)
programmers
of our computers decided that the messages were not of interest to humans, and because
instead of archiving
actual messages, it is customary to archive logs of many such messages. For example,
it is not usually of
interest to record the mouse pointer location on a computer screen when a particular
icon was clicked,
except in human-computer interaction usability research; archiving the details of
that event is unlikely
to be of use to anyone, even where, as with XCB XCB,
it was in XML. Even a usability researcher would be unlikely to find it useful without
further information, such as what icon was displayed there, what task the user was
attempting, whether she was tired,
and so forth. The machine-to-machine message in this example is not complete without
its context, and does
not constitute a sustained rhetoric or literature. Most machine-to-machine communication
happens entirely
without human intervention, or with only indirect intervention. Note that a log file,
recording a summary of
such communications, is an instance of machine to person communication, and is in
a different category.
Machine to Person communication designates machine-generated documents, and these
constitute a broad range of literature, ranging from error
messages and log files to random poetry. In some cases it is more interesting to archive
the algorithms used
to generate the random poetry, perhaps with some examples. In this paper we will consider
randomly generated
literature to be a subset of person to person communication, with the creator being
the person who wrote the
program and therefore controlled the domain of discourse. Machine-generated documents
such as error
messages that are part of a larger context can be archived only as part of a larger
context, as we shall see.
It would make little sense to archive a document whose entire content was "File not
found," although files
containing such messages can be a great source of confusion to users.
Person to Machine communication might include computer programs and scripts, XSLT
transformations,
manually-generated SOAP messages, and much more. Computer programs can be archived,
but are mostly outside the
scope of this document. We shall consider XSLT documents in a later section.
Person to person communication, mediated through XML documents, includes all manner
of electronic
mail, instant message conversation, poetry, visual arts, music, virtual sculpture,
prosody and erotica, research and
reflection. This is the material that first comes to most people's mind when they
consider archiving beyond a short-term
computer backup. It is the stuff of libraries and of museums, the artifacts of our
culture.
For the sake of completeness, let us be clear that our primary concern is archiving
for subsequent
human (Person) retrieval and study.
Document Context
Documents do not exist in isolation. They are created, stored and retrieved using
digital computers.
When a Person reads a document, or part of a document, what is understood is, as with
any piece of
literature or cultural artifact, bounded on all sides by that culture. Once the context
of creation is lost,
understanding of the artifact is necessarily incomplete. Documents are part of a culture
that includes other documents
as well as social conventions and shared knowledge. We shall use the term External
Context to denote the environment,
political, social, technological and cultural, surrounding a document.
How an ancient object was used is often a mystery. Similarly, there are documents
in existence which
appear to be works of literature of some kind, but which cannot now be deciphered
at all. In some
cases, such as the Phaistos Disc, there is almost no surviving context at all. In
others, such as the one-time key
encryption used by Dee and Kelly in Bohemia Liu2005, there is some knowledge about the purpose of the documents,
but not necessarily sufficient information to decode them. In other cases, knowledge
is incomplete.
Misunderstandings, corruption in copying, ignorance, politics and flawed textual theories
have
all come into play with documents such as Biblical translations. Famously, “Peace
on Earth and good will to all
men” is now considered to be more accurately rendered, “Peace on Earth to all men
of good will.” Was
the genitive case not noticed by the translators of the King James Bible, or did it
not exist in their primary texts, or did they choose a silent
emendation? We might like to imagine that when we archive computer texts, we will
not have problems with corruption. But in
practice it is not corruption but a lost context that is the problem. Consider the
way that the English language has
changed an as short a time as three hundred years:“indifference” was a word that in
the 1700s meant without
difference, impartial; today a statement that God metes out justice with indifference
would not be considered respectful.
In order to reconstruct the significance of a text, then, recipients of our putative
archived XML Document, a Very Long Time from now,
will need to understand not only the computer formats we have used but also the natural-language
parts of our text.
For XML documents, where it is common practice to embed natural-language terms in
markup as element names or identifiers,
natural language appears not only in document content, but also in the actual document
format, in the markup.
Language context is only one part of the wider document context. A funeral oration
might be perceived quite
differently from a shopping list; a parody differently from a news article. The expected
use and implicit shared understanding
between document creator and audience in these examples can be lost by the Very Long
Time; this tacit knowledge
must therefore be documented and made explicit if the archived document is to be interpreted
as it was intended.
As C. Michael Sperberg-McQueen pointed out in reviewing a draft of this paper, the
effort for an author
in adding extra background information may seem onerous, and may require a very different
sort of skill than
writing the document. In a corporate environment, sanctions are generally available
to require additional
information to be of at least minimal accuracy and completeness; in a university or
library environment, the
mandate for making tacit knowledge manifest as explicit knowledge (See Applen and McDaniel, chh. 1 and 3) may
fall to the archivist.
Socio-political contexts, organizational contexts, corporate cultures and fashion
can also all affect the
interpretation of documents. One cannot archive an entire culture in order to explain
a single document, but
neither can one understand an entire culture from a single document.
Every document necessarily stands in some relationship to other documents. That relationship
can be
implicit or explicit. An example of an implicit relationship might be that a dictionary
gives definitions or
explanations for words found in other documents. An explicit link might point from
a person's name mentioned
in one document to a biography of that person in another document.
Information about the external contexts in which a document was created, and was intended
to be understood, then,
are generally necessary in order
to understand that document correctly. These external contexts include language and
culture of the creators of documents
as well as of the intended recipients, the other documents created or preexisting
within those contexts, and also the
intended purposes and audiences of the documents.
Navigation and Finding Aids
Large modern research libraries and rare book libraries often store the books away
from the public: to see a book one must request it explicitly. In such a world
there is no browsing, and serendipitous discoveries have been outlawed.
One cannot discover tucked inside an otherwise uninteresting volume a transcription
of a poem whose only extant Anglo-Saxon manuscript copy had been destroyed in
a fire.
Forgotten poets remain forgotten, sometimes for the good of mankind and sometimes
regrettably.
A putative future patron of a digital library may be at a great disadvantage
compared to today's visitor to a rare book collection: that of expectation.
One might reasonably expect a rare book library to have a copy of Plato's Republic,
of Moxon on printing Moxon1683, or of something printed by Caxton or Aldus Manutius. But
the digital collection might contain a thousand terabyes constituting an
ultra-high resolution scan of the skin of an earthworm, or the accumulated
income tax returns of retired Cornish clergymen, or the entirety of Moroccan
twentieth-century literature. What files to request?
This question of how to make a selection is not new to anyone working in
the fields of archiving, but it is certainly new to many computer engineers,
the people most likely to be constructing digital archives. An overview
of the contents of an archive is of critical importance and, in the end,
may be a significant deciding factor in which archives survive.
Controversial works might need to be hidden; at different times in history
many works have been defaced or destroyed for ideological reasons.
In a world of automated search it might seem that such strategies cannot
succeed; after a Very Long Time what seems today Controversial may in any
case seem banal or common-place. We should remember, however, that full-text
search is generally accomplished using software that, over time, will
probably cease to function unless it is actively maintained.
An archive, in any case, needs an overview, a Finding Aid, that gives the
reader an idea of the sorts of thing that one might find in the collection,
and perhaps delves down by category into subsections of the collection.
The Politics of Selection
George Landow writes (pp. 267ff) about the politics of hypertext; of particular relevance
here is the idea that providing
easier access to some documents implies harder access to others: the choice of which
documents to archive is (or
can be) a political decision every bit as much as decisions about which books to keep
on the shelves in a
public library.
It is not technically feasible to archive all documents. Even if we restrict ourselves
to the domain of
person-to-person communication, we still find that the sheer volume of electronic
mail, especially when
spam is included, simply makes it harder to find information later. In addition, privacy
concerns mean that it
is not always desirable to archive everything. Some public libraries now routinely
delete book borrowing
information, so that they cannot be required to identify which books a particular
individual may have read.
If not all documents are to be archived, some documents must be rejected. In a corporate
research environment it might
be that reports are archived, but not research notes, for example. Yet, in the future,
those notes might be
considered a highly valuable resource for understanding and validating (or otherwise)
the findings of the
reports.
When a Document is archived then, one should consider archiving secondary documents
in the same collection;
however, this increases the burden on Finding Aids and on Archive Structure.
How to Archive?
After selecting which documents are to be preserved, and (explicitly or implicitly)
which are to be
destroyed, or at best left to their own chances, after the decision to create an archive
has been made,
one must determine the methodologies to be employed. After the why and the what comes
the how.
The Physical Substrate
It is not reasonable to expect modern computer storage devices to remain functional
for A Very Long Time. Typical values for A Very Long Time for most computer equipment
today are
measured in thousands of hours, not
thousands of years. Magnetic tapes degrade over time, as do optical storage media
such as
compact discs. Active devices such as rotating hard drives have dependencies on voltage
and
current levels, on specific versions of software drivers for specific operating systems,
and, since
they contain firmware, may also fail after a specific date.
It is possible to run a digital archive in such a way that data is periodically migrated
to
newer media. Such a strategy assumes a continued supply of funding and replacement
media.
It is also possible for an organization to rely on external archiving services, but
this does not solve
the question of ensuring that A Very Long Time is sufficiently great.
Suitable physical media for long-term storage of digital data remains an unsolved
problem at this time.
The author of this document once unpacked a computer system; inside the box was also
a manual in
many ring-bound volumes, and it was necessary to open shrink-wrapped stacks of hole-punched
paper
and insert them into the proper binders. One of these manuals was a chapter explaining
how to off-load
the box with a fork-lift truck, and how to open the box. Of course, in order to discover
these instructions,
it was necessary to open the box. When archiving for a Very Long Time, it is important
to label the
archive in multiple languages, with a pen, on the outside of the box. Who, in a thousand
years from now,
would guess that an object clearly marked 90 Minutes Audio Cassette actually contained
a computer program?
If the instructions for unpacking the archive are inside the archive, how will they
be used?
The Logical Layers
Computer users are accustomed to metaphors presented by graphical user interfaces.
For example, a Folder is used as a metaphor for a group of documents. But the actual
implementation
of a File System on most operating systems today involves a list of hard disk block
numbers or storage extents.
It might be that a future data archaeologist will have to inspect those individual
disk blocks and piece
them together. This process is made considerably easier if files larger than a single
block are in plain text
wherever possible, rather than (for example) being compressed. In addition, the fewer
layers that must be
penetrated, the easier the task, so store individual files in folders (directories)
rather than in binary
formats such as zip or tar archives. The process of reconstructing data from a damaged
CD-ROM or
hard drive is tedious, but today at least it is a known skill; many skills fall into
disuse, and today
few people can repair a hole in a saucepan, sharpen a wooden ploughshare, or correctly
aim a ballista. If an important archive is to be stored for a long time, the layout
of the storage
system file systems must be documented on a separate physical medium.
A text file
is in actual fact stored digitally in a way that could be thought of as a sequence
of integers,
with an implied mapping from integers to characters and from
character sequences to visible representations known as glyphs.
The mapping from integers to characters is known as an encoding;
the mapping from characters to glyphs is implemented by fonts.
Coded Character Sets
A Coded Character Set, or Encoding, is a mapping from integers (or, more properly,
codes of some sort) into characters.
Some encodings are context sensitive, so that the same integer may map to different
characters in different contexts; ISO 2022-JP is an example of such an encoding mechanism.
Others are context-free, so that the same integer always maps to the same logical
character.
Over time, character encodings tend to be modified, for example by introducing the
Euro sign,
or by fixing minor bugs. There is no general concept of version numbers for encodings,
however,
so that there is no way to determining which historical version of a given encoding
was in use
when a document was created. Some encodings (most notably IBM EBCDIC) also have many
variations, with no overall consistent, standard naming scheme.
For the purpose of archiving data for a long time, it is clearly essential to label
all character encodings used, and to include, along with the archived data, copies
of the specifications for the encodings. Note that in order to read these specifications,
people may need to decipher at least some of the encodings used!
Fonts and Glyphs
An encoding transforms the stored computer file, which we consider to be a sequence
of integers,
into a sequence of logical characters. However, in order for a person to be able to
make sense of
the information, the characters must be presented in some readable form. In the West
this is most often
done using alphabetic symbols from the Latin script. The software that controls
the mapping from characters to glyphs is a Text Layout Engine. This software generally
reads tables to
indicate that particular sequences of characters are to be displayed as particular
sequences
of character shapes; the definitions of those replacement sequences ans the corresponding
character shapes (glyphs) are defined in Fonts. In order to read a computer file then,
the integers
must be mapped to characters, the characters to glyphs, and the glyphs rendered on
a screen,
paper, or other device.
A font is really a piece of software that implements a
typeface design. Current font technology, especially
OpenType, includes procedural machine code in a language
called TrueType; it would be unreasonable to expect that software
written today will still be runnable twenty years from now. Therefore, as part
of archiving a document for A Very Long Time, we must also archive depictions of
the glyphs, perhaps as bitmap images, along with documentation of the bitmap
image format that was used.
Extensible Markup Language (XML)
Our subject is the archival storage of documents encoded in XML. This encoding should
not be confused with a coded character set: the XML encoding is defined as a formal
grammar whose input is a sequence of characters, not a sequence of integers. The
characters are defined to be in the Unicode character set, although the particular
version
of the Unicode character set is not clearly defined. We have already noted that information
on
the coded character set should be stored along with the document; we now note that
the
version of XML used, and the corresponding XML specification itself, must also be
stored. This is
not the definition of the actual markup used, but rather the specification for XML
itself.
Ancillary Formats
Photographs, illustrations, sound clips, 3D models, digital scent definitions, video
and any other non-textual
information must be archived in a file format that is documented. Where possible,
declarative
formats are to be preferred over procedural, and open, documented formats preferred
over closed,
undocumented formats.
Declarative and Procedural Formats
A format may be said to be Procedural if instances of that format give a complete
specification of
an algorithm, and Declarative if instead the format describes a desired result without
giving a full
algorithm.
An example of a Procedural format for graphics might be a computer program in the
FORTRAN IV language
using the Graphics Kernel System (GKS) to draw a series of five rectangles. The program
might be several
hundred lines long, and would deal with initializing a device context, then with querying
which plotter
pen colours were available, then issuing an instruction to select (say) the red pen,
then telling the
robotic plotter to lift the pen, move to a particular place on the paper, lower the
pen, and move the
pen horizontally by a certain distance, and so on. Running such a program even five
or ten years
after it was written may be difficult, as it will probably contain code that is specific
to a particular
operating environment, and possibly to a particular device. Deducing that a particular
Calcomp plotter held the red pen in position five might or might not be trivial.
An example of a Declarative format might be an XML Rectangle Language, with
five elements called Rectangle, each with a colour="Coates3801"
attribute.
In this example, although the recipient of the document might not know what Coates3801
means,
it is not necessary to perform a computation or to run a program in order to comprehend
the
intent to draw five rectangles. The outcome has been described, and not the mechanism.
Open And Closed Format Specifications
We shall denote by Open Format Specification a specification with the following characteristics,
listed with the most
important first, from the perspective of Very Long Time Archiving of Documents:
-
Conforming objects can be created and
manipulated freely, without needing permission
or payment of royalties;
-
Conforming implementations can be
created freely, without needing permission or
payment of royalties;
-
The specification itself is available
and can be copied freely, without needing
permission or payment of
royalties;
-
In addition to describing an Open
Format, the Specification itself is available
in a format defined by an Open Format
Specification.
We shall describe each of these characteristics in
turn. A specification which does not meet any of them,
or that meets only the first, we shall denote as a
Closed Format. It is necessary to consider that, after
a Very Long Time, the organization that issued the
Specification may or may not still exist. However, if
copyright still pertains, it might be that it is no
longer
possible to use the Specification until copyright
expires. Future changes in copyright law may mean that
copyright no longer ever expires.
Objects can be created freely
If this is not the case, then explicit permission
must be obtained from the controlling organization
to create the archive, and also to give permission
for the archive to be accessed and used. This
permission must of course
be stored along with the object.
Implementations can be created Freely
After A Very Long Time, it might be that no
implementation exists that can still be run. In
order to make use of the archive, a new
implementation will need to be written. For
example, digital hypertext literature written using
Hypercard can often no longer be run or
experienced; one way to preserve Hypercard-based
literature might be to create an open source
program to run them, but this is a difficult
proposition Liu2005
The specification can be copied freely
A Very Long Time from now, a commercially-old
specification may well be unavailable. For
example, ISO SQL 92 [ref] has been withdrawn after
less than 20 years, and is no longer for sale.
Therefore, copies of the specification should be
archived along with the documents, and that may
require permission. When the archive is used, the
copyright status of the specification must be very
clear.
The Specification itself is written using an Open Format
An example of a closed format electronic document
might be a Magic Wand file (Magic Wand was a word
processor in the 1970s and 80s). The format is
binary and proprietary, and is no longer available.
Magic Wand is also no longer available, and all
existing license keys will no doubt have expired.
So a specification for a graphics file format (say)
that was archived in Magic Wand format would not
now be very useful.
Where the format is not entirely open, text-based
formats are generally to be preferred over binary
formats, as being easier to pick apart byte by
byte, line by line, in the future.
In the case that the Specification is not
available in an open format, multiple formats
should be used, perhaps including a bitmap image
for each page of a document, to maximise the chance
that at least ne of the formats can be read in the
future: a sort of digital Rosetta Stone.
Multiple Copies, Multiple Locations.
Some digital documents, like antiquarian books, are scarcer than others. With antiquarian
books,
commercial value is related both to scarcity and to interest. With digital documents,
commercial
value is determined by ease of availability and interest. Documents that are widely
copied are easier
to access. The license or copyright by which digital documents are released determines
how easy
it is for other people to share copies of them. However, as digital documents are
more widely disseminated,
there is a greater chance of corrupted or changed copies emerging. This can be counteracted
by
providing a digital fingerprint, known as a signature or checksum hash, along with
the file. This
does not prevent alterations, but makes it possible for people to test to see if the
file has been changed.
Making archived documents widely available in unchanged form is sometimes referred
to as “mirroring.” A
collection of documents that is widely disseminated in this way can survive even if
only one of the mirror sites
survives. However, for this to happen, the mirrors must be funded.
A private organization might decide to have a distributed archive, with entire copies
of the archive
at several disparate geographical locations. However, if the organization ceased operations,
all of the
archives would probably be closed. Mirroring by multiple organizations is more robust,
but there has to be
a suitably sustainable funding source. Sometimes this can be provided by government
grants; in other
cases, if the documents are suitably licensed, and can be made public for commercial
use,
advertising on Web sites hosting the archive can suffice. In some communities there
are already
mirroring and archiving initiatives such as Lots of Copies Keep Stuff Safe [LOCKSS2008].
Summary
An archive that is expected to outlast the archivist must be self-contained: it must
contain
not only a single document of interest, but also information about everything needed
to decode and
use that document, at both physical and logical levels.
In order to facilitate future decoding of an archive, use separate uncompressed files
wherever possible.
A hierarchical folder or directory structure can help to keep ancillary documents
separate from
the main document.
There is no definite answer to physical storage formats at this time.
Where there is a choice of file formats, Open Formats should be used wherever possible.
Storing information in multiple parallel formats is wise for archiving, as is using
multiple locations.
Designing XML-based Formats for Longevity
In this
section we shall assume that the reader has some familiarity with
the Extensible Markup Language and associated terminology.
By the term XML-based Format we intend to denote not only an XML
Vocabulary or set of vocabularies, as might be specified in some
Schema language such as XSD or DTD or RelaxNG, but also to include
any usage guides, documentation, examples and associated social
culture.
We must not assume that XML Processing models
will remain the same for A Very Long Time. For example,
xml:id
,
xml:base
,
xinclude
and
other low-level specifications have arisen within the past decade, and
there is no reason to suppose that new specifications will not
similarly come into being. We can create some guidelines that
follow from this observation.
Avoid Implicit Content
Some XML Schema languages,
including XSD and DTDs, have the ability to provide “default”
attribute values, or, in some cases, even default element content.
The document is augmented
by this content after validation against a schema by a software
process. Since we cannot assume that software processes will still
be runnable A Very Long Time from now, we should make sure that the
archived XML Document can be used without schema validation, and,
in XML terminology, is a stand-alone document.
Avoid Obscure Features
Any feature of XML, or of any other format, that is not widely implemented, or whose
behaviour varies between implementations, or whose semantics are not clearly
documented,
should be avoided. For XML that might include, for example:
-
Notation, a feature whose semantics are not well-defined;
-
Use of parameter entities in the internal document type definition subset at the start
of a document, as support for this feature is not required by the XML specification,
and not all
XML processors implement it.
-
The use of inline general entities to introduce markup, as this is not always supported.
-
Use of character encodings other than UTF-8 or UTF-16 or US-ASCII, the only
encodings currently guaranteed to be supported.
-
Reliance on “well-known” sets of entity definitions such as those provided
by ISO 8879:SGML or by HTML; even if these
entity sets are included with the document, an XML processor that does not support
DTD processing will be unable to handle the document.
Avoid Cryptic Names
A necessary precondition of usability is comprehensibility.
If someone else if to make use of our markup they must
understand it.
Consider the example in the following listing, in which short element
names and a flat strucure have been used for a novel
(the actual content has been reduced to numbers to avoid
accidentally introducing anything of interest into this paper):
<h>1</h><p>3</p><p>10</p><p>21</p><h>64</h><p>129</p>
In the listing,
the relationship of the elements is not explicit.
An improved version is given in listing 2:
<c><h>1</h><p>3</p><p>10</p><p>21</p></c><c><h>64</h><p>129</p></c>
Someone inspecting this document might not have enough information to deduce
the intent of the markup, and we could make the further improvement of listing 3:
<chapter>
<heading>1</heading>
<paragraph>3</paragraph>
<paragraph>10</paragraph>
<paragraph>21</paragraph>
</chapter>
<chapter>
<heading>64</heading>
<paragraph>129</paragraph>
</chapter>
Of course, such lengthy element names may be inconvenient when
editing a document.
Plausible compromises include converting to an archival format,
or making sure that each document includes sufficient information to
allow someone to reconstruct this meaning. Of these, the first approach
seems more likely to stand the test of time.
Document the significance of markup items
Write clear descriptions of the purpose of each XML element and attribute;
be careful not to rely on the element name in the description. For example, do not
describe the bulletlist
element as containing a bulleted list; describe
it as containing a sequence of independent items that form a related group or
sequence, and give an example. The meaning and usage of terms such as Paragraph, Section,
List,
Title, Bullet and Flower, Folio and Explication has not remained constant over the
past few centuries, and even today
is not constant across all cultures.
A project to archive a body of documents over a Very Long Time might well ensure
that the documentation is translated into multiple languages, to increase the chance
that
a future user will be able to understand what is written. Of course, one might do
the same with
the actual documents as well as the documentation about the rhetorical and technical
contexts of the document. Changing the actual XML Document is outside the scope
of this paper,
but the possibility of creating an archival format alongside the original
document has already been suggested.
Validate the Data
If there is uncertainty about the ability of recipients and users of an archived document
to be able
to use the XML format, there must be considerably more uncertainty about their ability
to cope with
errors in the use of the format. Make sure that all XML data is will-formed: not only
the XML Document itself but all supporting information.
Validate against XML Schemas, DTDs or other test documents wherever possible. Where
data fails to validate, the archivist should
attempt to ask the supplier to provide corrected content; where that is not possible,
the original data must obviously be
archived, but the archivist should consider including a corrected version of
any documents that do not validate, along with information about the validity problems.
Including detailed notes about the format emplyed
is of little use if the documents to not correspond to the documentation.
Mean What You Mean To Mean
There is no clear definition of Meaning that satisfies all users of a typical XML
document. Some people
are interested in denotational semantics from a semiotic viewpoint (what is signified?),
some in behavioural semantics (what does it do), some merely in human understanding
(what is it?).
A consequence of this is that there is no single universal way to denote the meaning
of XML markup.
W3C XML Schema documents can contain annotation
elements for this purpose.
Check Links
If your document format has links, whether explicit or implicit, you should check
them before
committing the document or documents to the archive. As a minimum, make sure that
all link
targets exist. Better, make sure that the links go where they should. For example,
check the titles
of sections that are the targets of links, perhaps to the content of an element in
the link that exists
only for that purpose. For off-site links, for example to Web sites, consider archiving
a surrogate,
or including a description of the remote content. Archiving the actual remote Web
page may
require explicit permission, but for some projects this is worth while. Ensure that
you are checking
linked documents in your collection that you are about to archive, and that your link
checker
is not reaching out to a production server elsewhere in your organization! All of
this requires provision in
the design of an XML document format.
Although the World Wide Web Consortium has produced a specification for marking up
hypertext links in XML, xlink
, this specification is not likely to
be sufficient even for explicit links, as it does not provide for a pattern to match
against a given target element in order to validate a link.
The xlink
specification does
not attempt at all to support implicit links, where, for example, a part-number
element
within a step of a repair procedure is treated as a link to a database of parts, and
for printing
is augmented by a description of the part, and for online publication also becomes
a hyperlink to an online catalogue. Like implicit content from default elements and
attributes,
implicit links are difficult to archive. The ISO SGML HyTime specification did have
support, using a
mechanism termed Architectural Forms, for at least some level of implicit linking,
but HyTime has
not been adopted by the XML community.
The best approach to linking in documents that are expected to be archived, then,
is to include in each document some information about link targets. This could be
a short
textual description or an entire resource. For the purpose of link checking, static
content
in the document can be used, such as an attribute whose value specifies an XPath expression,
and another attribute giving a regular expression (a text pattern) that the result
of evaluating
that XPath expression must match. For example, one might say that the nearest enclosing
section
title of the link target must contain the word Plastic. The utility of this is at
archive creation time,
and during the regular document maintenance life-cycle.
Provide for Translations
After A Very Long Time, language will have changed. Documents that have been translated
into multiple languages clearly have a better chance of bing understood. A single
Rosetta Stone
is more likely to survive in one place than three separate stones: consider archiving
a single file
containing all of the translations, or at least a fragment from each language,
perhaps first a German paragraph and then a French one and then
a Swedish one (in alphabetical order by their own names, but that breaks down for
languages written
with non-Latin scripts). Remember that any piece of human-readable text may need to
contain markup
in some language or other, whether to delimit right-to-left and left-to-right components,
or for Ruby-style annotations, or for emphasis. As a result, all natural-language
fragments should
be in XML elements (not attributes), where they can be distinguished by xml:lang
values.
Designing XML documents for translation has been described elsewhere [e.g. ITS] in
detail.
It can be helpful to have block-level translations, so that (for example)
a paragraph or list item in one language can be immediately followed by
corresponding text in another language. This does not always work,
since the rhetorical structures in different cultures may suggest
that material be most naturally presented in different sequences.
Provide for Contextualization
Include a place for authors to make explicit not only the purpose
of the document, but how it is to be used and how it might fit into the
ecosystem of the wider context in which it is created. For example, a
dream journal might have an introduction that says the author wrote
down memories of dreams each day for a year, but the wider
context might include that this was part of a therapeutic exercise in
working out resentment towards alien visitors, and that, after
a year, the writer's perception about the visitors was changed. This
sort of explanation is not generally considered necessary for a work of fiction,
but the boundaries between perception and construction become
less clearly discernible over time. Harry Potter is a fictional boy,
but of course the train from King's Cross station in London, and
the oddly numbered platform there, really exist. They have been
created after the success of the books, but that small detail may no
longer be obvious a Very Long Time from now.
In a business context, the reasons behind particular decisions
and documents may not be public; it may be desirable to store two
versions of a document, one of which is to be made available only
fifty years (say) after its original creation. Some governments have
similar policies, although the details vary widely between countries.
As an example of contexts, consider the following amounts
of money: all except one represent the same value.
Table II
Amount |
1/3 |
6/8 |
0.33 |
80 |
327 |
0.17 |
0.19 |
Adding some context would make this clearer:
Table III
Amount |
£1/3 |
£-/6/8 |
£0.33 |
80d |
£327 |
$0.17 |
$0.19 |
To belabour the point, even here there is some difficulty. The second item,
£-/6/8, is a now-obsolete noation for six shillings and eight pence in the pre-decimal
British currency; with twelve pennies in
the pound that made 80d (denarii, pennies). Italy, in pre-Euro days, used the Lira,
and used the same symbol for currency, although devaluation meant there were of the
order
of a thousand Italian Lira to the Britsh Pound (at some point in history); similarly,
the two dollar
figures are from two different countries using the dollar as a currency symbol. Thus,
notations and symbols that we take for granted may lose their meaning over time,
or not be clear to future readers. There is no clear way to determine how
much context to retain, and yet we must persist in claiming that the more context
is recorded, the greater the chance of successful communication.
Don't be Inventive
Use existing specifications where possible; if that is impractical,
consider copying techniques from existing specifications, which can then
be included in the archive. The more people who use a specific technique,
the more likely it is to survive.
Summary
Use validated, well-documented XML that wherever possible
relies on widespread practices.
XML or Not XML?
The scope of this paper is intended to be considerations for
long-term storage of XML Documents. However, what if the best things to
store are not XML Documents at all? Such considerations could easily fill
volumes; in this paper, we have room only to delineate a small number
of features of XML Documents that make them suitable.
Textual Format
As we have already discussed, textual formats tend to be more robust in
the face of possible data corruption than compressed or binary formats. If a
single storage device block becomes unreadable, the rest of the document
after the lacuna will still be readable, albeit incomplete.
Explicit End Markers
Not all text formats have explicit markup to surround objects, or, if they
do, may use an ambiguous symbol to mark an ending, such as a close brace or
a closing parenthesis. The possibility of data corruption means that the additional
redundancy of repeating an element name can help to identify errors and limit
the scope of corruption.
Embedded Usage
XML is used in devices ranging from automobile engines to television sets. Its use
is very widespread, and it is found in devices expected to last for thirty years or
more.
These devices cannot easily be changed if XML becomes obsolete and is replaced by
(say) Aldus PageMaker files.
This means that the chance of XML technology surviving, or being historically retrievable,
is statistically significant: 3% of 1000 years is 30 years.
Device Independent
XML Documents do not generally rely on specific hardware. For example, one does not
embed Epson dot-matrix printer control sequences in XML documents in order to generate
underlining. XML-based graphics formats are generally declarative, and do not generally
rely on
initializing a plotter or on the size of a sheet of paper.
Open Specification
The XML Specification meets all four of the criteria given in this paper for a specification
to be considered fully Open. This is not to say that no other format does: many do.
Conclusions
This paper has outlined some considerations for long-term storage of
XML documents.
A small amount of extra care and consideration may help to
provide a framework for document creation with a Very Long Time in
mind.
An archived document should have an overview document accompanying
it, that documents which specifications were used and why, gives a high-level summary
of the document itself, lists
all copyrights and trademarks that may apply (and patents, non-disclosure agreements,
licenses or other agreements or restrictions on republishing the document in the future),
and lists all associated files and their purpose.
Since there may be hundreds or even tens of thousands of ancillary files for a single
document, especially if specifications are included, use a hierarchical file structure
to give prominence to the
actual document.
Use open formats wherever possible, and prefer textual formats to binary formats.
Do not use complex compound file formats such as zip files, which are difficult or
impossible to
repair if they become corrupted. Similarly, store files uncompressed wherever possible.
For long-term archiving, use multiple organizations and multiple physical locations
for the data.
Do not store data in a vault covered by a large pyramid with a lidless eye carved
into it, as you will
attract aliens.
References
[Applen2009] Applen, J. D. and McDaniel, Rudy, “The Rhetorical Nature of XML,” Rouledge, 2009.
(to be supplied; a number of references were consulted)
[Liu2005]
“Born-Again Bits: A Framework for Migrating Electronic Literature,”
Alan Liu, David Durand, Nick Montfort, Merrilee Proffitt, Liam R. E. Quin, Jean-Hugues
Réty, and Noah Wardrip-Fruin
2005,
online at www.eliterature.org/pad/bab.html and accessed July 2010
[Moxon1683] Moxon, Joseph, “Mechanick exercises on the whole art of printing,” 1683/4.
[Wooley2002] Wooley, Benjamin, “The Queen's Conjurer: The Science and Magic of Dr. John Dee, Adviser
to Queen Elizabeth I,”
Holt, 2002 (not itself a scholarly book but a good and clear introduction to the topic).
[XCB] The X protocol C-language Binding (XCB), available at xcb.freedesktop.org,
accessed July 2010.
×Applen, J. D. and McDaniel, Rudy, “The Rhetorical Nature of XML,” Rouledge, 2009.
×(to be supplied; a number of references were consulted)
×
“Born-Again Bits: A Framework for Migrating Electronic Literature,”
Alan Liu, David Durand, Nick Montfort, Merrilee Proffitt, Liam R. E. Quin, Jean-Hugues
Réty, and Noah Wardrip-Fruin
2005,
online at www.eliterature.org/pad/bab.html and accessed July 2010
×Moxon, Joseph, “Mechanick exercises on the whole art of printing,” 1683/4.
×Wooley, Benjamin, “The Queen's Conjurer: The Science and Magic of Dr. John Dee, Adviser
to Queen Elizabeth I,”
Holt, 2002 (not itself a scholarly book but a good and clear introduction to the topic).