How to cite this paper
Cowan, John. “Transforming schemas: Architectural Forms for the 21st Century.” Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6 - 9, 2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). https://doi.org/10.4242/BalisageVol10.Cowan01.
Balisage: The Markup Conference 2013
August 6 - 9, 2013
Balisage Paper: Transforming schemas
Architectural Forms for the 21st Century
John Cowan
Senior Content Architect
LexisNexis
John Cowan works for LexisNexis, which he likes to call "$EMPLOYER". On his 2011 tax
returns, he listed his occupation as "ontologist" . He pushed both XML 1.1 and XML
1.0 Fifth Edition through the W3C XML Core Working Group, of which he somehow remains
a member. He also hangs out on numerous mailing lists and blogs, masquerading on the
A forum as the expert on B and and on the B forum as the expert on A. His friends
say that he knows at least something about almost everything; his enemies, that he
knows far too much about far too much.
Copyright © 2013 by John Cowan
Abstract
The traditional approach to transforming XML documents is a three-step pipeline: validate,
transform, validate. The SGML feature called architectural forms combined enhancements
to DTDs with annotations in source documents to allow a valid SGML document to be
automatically transformed into another SGML document valid against a different DTD.
This permitted document creators to conform to a general document architecture without
having to constrain their own documents to every detail of a specific schema. In
the XML world, however, the emphasis has been on the creation of comprehensive schemas
rather than easy transformation, and the ideas behind architectural forms have mostly
been lost. This paper attempts to explain how to restore those ideas to XML practice.
Table of Contents
- 1. Introduction
- 2. Element Renaming and the Renaming Attribute
- 3. Attribute Defaulting
- 4. Element Reordering
- 5. Occurrences
- 6. Character Content
- 7. Mixed Content
- 8. Attribute Mapping
1. Introduction
The traditional approach to transforming XML documents is a three-step pipeline: validate,
transform, validate. (Sometimes, of course, one or both of the validation steps is
omitted.) Architectural forms, a feature first of the SGML-based hypermedia standard HyTime and then of SGML itself,
made use of a combination of enhancements to DTDs and annotations in source documents
to allow a two-step pipeline for certain simple transformations. In this pipeline,
a valid SGML document could be automatically transformed using a specialized SGML
parser, called an architectural engine (AE), into another SGML document valid against a more general DTD known as the meta-DTD.
This permitted document creators to conform to a general document architecture without
having to constrain their own documents to every detail of a specific schema.
However, DTDs have not seen wide uptake in the XML world, and the few XML architectural
engines that have been built have conformed more to the letter than to the spirit
of architectural forms. The emphasis has been on the creation of comprehensive and
complex schemas which attempt simultaneously to serve local needs and the needs of
interchange. Such schemas are usually arrived at by difficult, lengthy, and highly
political negotiations between interested parties, with victory often going to the
participants with the greatest weight of Sitzfleisch rather than the best ideas.
This paper describes an attempt to return to those thrilling days of yesteryear by
providing a modern equivalent of SGML architectural engines. In principle any grammar-based
schema language such as XML Schema or RELAX NG would be suitable for the methods outlined
here. However, the software development (still very much a work in progress as of
this writing) is using the much simpler Examplotron schema language. Examplotron
is not well-known or much used in the XML environment, but I believe it to be extremely
suitable to the stripped-down MicroXML environment in which I am now primarily interested.
Since most people don't know Examplotron, I have written the paper to be accessible
to anyone who can read simple DTD declarations.
In this paper, I will speak of the source document, which is the input to a schema-based transformation engine (TE), and of the target document, which is a TE's output. Additional inputs are the source schema and the target schema. In this paper the schemas are expressed as DTD fragments, but in actual use they
will be Examplotron 0.8 schemas. In addition, we may supply the TE with the transformation name of the particular transformation to be performed on the source, possibly one of many
such transformations. For clarity's sake, I will speak as if the various transformations
are made one by one, but except for attribute defaulting they are all made simultaneously.
For example, if all elements named foo
are to be renamed bar
, and all elements named bar
are to be renamed baz
, that does not mean that both foo
and bar
elements wind up being named baz
.
2. Element Renaming and the Renaming Attribute
The first and simplest kind of transformation to be performed is element renaming. A TE does this by looking at each element of the source document for an attribute
whose name is the same as the transformation name supplied to the TE. This attribute
is called the renaming attribute.
For example, suppose we have the following source document:
<limerick>
<title>Relativity</title>
<a>There was a young lady named Bright</a>
<a>Who could travel much faster than light.</a>
<b>She set out one day</b>
<b>In a relative way</b>
<a>And returned the previous night.</a>
</limerick>
If we wish to transform it from its limerick-specific schema to a more general stanza
schema, we might add a renaming attribute named
stanza
to every element, like this:
<limerick stanza="stanza">
<title stanza="title">Relativity</title>
<a stanza="line">There was a young lady named Bright</a>
<a stanza="line">Who could travel much faster than light.</a>
<b stanza="line">She set out one day</b>
<b stanza="line">In a relative way</b>
<a stanza="line">And returned the previous night.</a>
</limerick>
Running a TE on the above document, specifying
stanza
as the transformation name, would produce the following target document:
<stanza>
<title>Relativity</title>
<line>There was a young lady named Bright</line>
<line>Who could travel much faster than light.</line>
<line>She set out one day</line>
<line>In a relative way</line>
<line>And returned the previous night.</line>
</stanza>
Note that all occurrences of the renaming attribute have been removed from the target
document.
What happens if an element doesn't have a renaming attribute? The answer is that
the element is dropped in its entirety. For example, suppose we did not have a stanza
attribute on the source document's title
element. In that case, the target document would contain only a stanza
element with five line
child elements.
If you don't provide a TE with a transformation name, there is no renaming attribute,
and rather than dropping all the elements, none of them are renamed. However, the
target document may still differ from the source document in other ways.
Note
The concept of renaming attributes comes from AEs; however, AEs do not require the
name of the renaming attribute to be the same as the transformation name, and have
different and more flexible rules about processing elements without renaming attributes.
3. Attribute Defaulting
This business of adding renaming attributes directly to the source document is irritating,
and may be impossible if we aren't able to change the source document. Instead, we
can take advantage of attribute defaulting by specifying a source schema. Consider the following DTD fragment:
<!ATTLIST limerick stanza "stanza">
<!ATTLIST title stanza "title">
<!ATTLIST a stanza "line">
<!ATTLIST b stanza "line">
This says that in the
limerick
element, if no
stanza
attribute is supplied, its value is assumed to be
stanza
. Likewise, for the
title
element, the default value of the
stanza
attribute is
title
, and for the
a
and
b
elements, it is
line
. Now we no longer have to alter our original limerick document when we want to transform
it. If we specify the transformation name as
stanza
, we will get the same target document that we saw in the previous section.
What is more, we can provide more than one renaming attribute in the same source schema.
Suppose we add the following declarations to the above source schema:
<!ATTLIST limerick estrofa "estrofa">
<!ATTLIST title estrofa "título"
<!ATTLIST a estrofa "línea">
<!ATTLIST b estrofa "línea">
If we specify the transformation name as
estrofa
rather than
stanza
, we will generate a target document whose element names are in Spanish rather than
English. However, the TE cannot automatically remove the defaulted
stanza
attribute when doing an
estrofa
transformation, nor vice versa, because it does not know which attributes might be
used as renaming attributes in a different transformation run. In order to suppress
them, we must provide the TE with a list of renaming attributes that are
not being used for the current transformation, so that they can be suppressed from the
target document. In the rest of this paper we will assume that this list has been
provided.
Attribute defaulting is not restricted to renaming attributes. If any attribute is
given a default value by the source schema but does not appear in the source document,
it will be created, and by default will appear in the target document. Attribute
defaulting is done in advance of all other transformations; a default attribute may
have its name or value changed by a later transformation.
Note
Attribute defaulting is inherent to DTD processing. The version of Examplotron used
by TEs, Examplotron 0.8, allows the specification of default values for attributes,
and in fact for elements too.
4. Element Reordering
So far, we haven't had to deal with child elements appearing in a different order
in the source and target documents. However, this can often happen when the source
document is data-oriented rather than content-oriented. In order to know how to reorder
child elements, we must provide the TE with a target schema. Here's a simple target
schema specifying a document containing people's names:
<!ELEMENT people (person*)>
<!ELEMENT person (last, first)>
In this schema, we see that a
people
element contains zero or more
person
elements and nothing else, and that each
person
element contains
last
and
first
elements in that order.
Now here's a source document:
<people>
<person>
<first>John</first>
<last>Cowan</last>
</person>
<person>
<first>Dorian</first>
<last>Cowan</last>
</person>
</people>
Suppose we pass this source document and the target schema to a TE without specifying
a transformation name. In that case, there is no renaming attribute, and so no element
renaming is done. However, since the order of child elements for the
person
element in the source document is not valid according to the target schema, they
will be reordered so as to be valid in the target document, producing this:
<people>
<person>
<last>Cowan</last>
<first>John</first>
</person>
<person>
<last>Cowan</last>
<first>Dorian</first>
</person>
</people>
Note
AEs do not perform element reordering.
5. Occurrences
Both source and target schemas can specify how many occurrences a child
element can have within its parent element. In DTDs, we can repeat the element name
to specify a fixed number of occurrences, as in this source schema for our limerick
document:
<!ELEMENT limerick (title, a, a, b, b, a)>
<!ATTLIST limerick index "poem">
<!ATTLIST a index "firstline">
Now suppose we run a TE, passing it the transformation name index
, our original limerick document, the above source schema, and the following target
schema:
<!ELEMENT poem (firstline)>
The renaming attribute
index
will rename the
limerick
element to
poem
and the three
a
elements to
firstline
, dropping the
title
and
b
elements altogether. But since the target schema permits only a single
firstline
element in each
poem
element, the second and third
firstline
elements will also be dropped, producing the following target document:
<poem>
<firstline>There was a young lady named Bright</firstline>
</poem>
This is suitable for inclusion in an index of first lines.
On the other hand, if the target schema requires more occurrences
of an element than the source schema provides, sufficient elements are
created following the mapped elements. For an example of that process, consider this
source document with explicit renaming attributes:
<couplet limerick="limerick">
<line limerick="a">Go and tell the Spartans, passerby,</line>
<line limerick="b">That here, obedient to their laws, we lie.</line>
</couplet>
What happens if we transform this into a limerick using the limerick schema as the
target schema? (There is nothing inherent in a schema that says whether it is a source
or a target, only in how it is provided to a TE.) Limericks have to have a title
and five lines, but we have only two lines here, one mapped (for some unknown reason)
to an
a
element and one to a
b
element. Consequently, we get this target document:
<limerick>
<title/>
<a>Go and tell the Spartans, passerby,</a>
<a/>
<b>That here, obedient to their laws, we lie.</b>
<b/>
<a/>
</limerick>
Not very useful or pretty, perhaps, but certainly valid.
In this paper, newly created elements are shown as empty. However, if the Examplotron
schema provides a default value for them, it will be used.
When specifying the content model of an element in a source or target schema, we can
follow the name of a child element with *
to mean "zero or more occurrences", as shown in the declaration of the people
element. In the same way, ?
means "zero or one occurrences" and +
means "one or more occurrences". All these indicators are respected by a TE. So
if two foo
child elements appear in the source document, but the target schema specifies foo?
, then the second one will be dropped. A TE cannot construct transformations based
on more complex content models like ((a,b)+)
, in which the occurrence indicator follows a sequence of child element names, except
as noted under the discussion of mixed content.
However, technically ambiguous content models like (line, line?, line?)
, meaning from one to three line
elements, which are illegal in DTDs, are supported in Examplotron schemas as well
as by a TE.
Note
AEs neither drop unwanted elements nor create new ones, but report validation errors
instead.
6. Character Content
So far, the source and target schemas we have seen have been incomplete, because not
all the elements used in the documents have been mentioned in the schemas. In particular,
declarations for the elements whose only permitted content is characters, such as
a
, firstline
, and title
have been left out. Here's a complete version of the limerick source schema with
all three renaming attributes provided:
<!ELEMENT limerick (title, a, a, b, b, a)>
<!ATTLIST limerick stanza "stanza">
<!ATTLIST limerick estrofa "estrofa">
<!ATTLIST limerick index "poem">
<!ELEMENT title #PCDATA>
<!ATTLIST title stanza "title">
<!ATTLIST title estrofa "título"
<!ELEMENT a #PCDATA>
<!ATTLIST a stanza "line">
<!ATTLIST a estrofa "línea">
<!ATTLIST a index "firstline">
<!ELEMENT b #PCDATA>
<!ATTLIST b stanza "line">
<!ATTLIST b estrofa "línea">
And here is an erroneous target schema for stanza documents:
<!ELEMENT stanza (title, line*)>
<!ELEMENT title #PCDATA>
<!ELEMENT line EMPTY>
Let's see what happens if we do a stanza
transformation using that target schema. We get this target document:
<stanza>
<title>Relativity</title>
<line/>
<line/>
<line/>
<line/>
<line/>
</stanza>
Because the target schema specified the
line
element as empty (no child elements or character content), the TE threw away the
character content. Again, probably not very useful, but again certainly valid.
Reordering and occurrence control are really two aspects of the same thing, and they
can both happen to the same children of an element at the same time. Here is a not-very-realistic
example. Given the source document
<root>
<a id="a1"/>
<b id="b1"/>
<a id="a2"/>
<b id="b2"/>
<a id="a3"/>
</root>
and the target schema
<!ELEMENT root (a, a, b, b, b>)>
the target document will be
<root>
<a id="a1"/>
<a id="a2"/>
<b id="b1"/>
<b id="b2"/>
<b/>
</root>
That is, the
a
elements have been reordered before the
b
elements, the third
a
element has been dropped as unwanted, and a third
b
element has been created.
Note
AEs allow greater control of what happens to character content when an element containing
it is dropped from the target document: it may be discarded or included as part of
the parent element. TEs always discard it unless the parent element's content model
is specified as mixed content.
7. Mixed Content
An element has mixed content when its content includes both child elements and characters.
Consider this limerick:
<limerick>
<title>Memory</title>
<a>There was an old man of Khartoum</a>
<a>Who kept two black sheep in his room.</a>
<b><quote>"They remind me,"</quote> he said,</b>
<b><quote>"Of two friends who are dead,</quote></b>
<a><quote>But I <em>cannot</em> remember of whom."</quote></a>
</limerick>
Because of the
quote
and
em
elements, this document isn't valid against our latest limerick schema. Let's add
the following declarations to our limerick schema, replacing the existing declarations
for the
a
and
b
elements:
<!ELEMENT emphasis (#PCDATA|quote|em)*>
<!ELEMENT quote (#PCDATA|quote|em)*>
<!ELEMENT a (#PCDATA|quote|em)*>
<!ELEMENT b (#PCDATA|quote|em)*>
The meaning of these element declarations is that the specified child elements (
quote
and
em
in this case) may appear in any order, any number of times, interleaved with the
character content if any. This is the only kind of mixed content that DTDs support.
Examplotron permits more restrictive sorts of mixed content, but a TE cannot handle
them. If we do a
stanza
transformation, then because the
a
and
b
elements are declared to have mixed content, instead of simply dropping the
quote
and
em
elements along with their content as you might expect, their content is preserved.
The result, then, is the same as if no quotation or emphasis markup had appeared in
the source document.
What would happen if the target schema for stanzas allowed em
elements but not quote
elements? Then the final line's content would become:
<line>But I <em>cannot</em> remember of whom.</line>
By definition, reordering is never done on mixed content. It is the presence of mixed
content in the source schema, not in the target schema, that triggers this style of
processing, although you usually want to specify mixed content in both schemas.
In summary, the content models that a TE supports are mixed content, character-only
content, empty content, and element content consisting of a simple sequence of child
element names, possibly decorated with occurrence indicators. All other content models
are unsupported for transformation, though they are permitted for validation.
8. Attribute Mapping
So far, the value of a renaming attribute has been a single token, an element name.
But if the renaming attribute contains multiple tokens separated by whitespace, the
first token is the element name for element mapping, and the rest of the tokens are
pairs of equivalent source and target attribute names. For example, here's a link
element that contains a renaming attribute to map it to an HTML a
element:
<link target="http://examplotron.com"
html="a target href">
Examplotron
</link>
Running a TE on this source document and providing
html
as the transformation name produces this target document:
<a href="http://examplotron.com">
Examplotron
</a>
TEs support three special cases of attribute mapping. If the target attribute name
is replaced by #NONE
, then the source attribute will be omitted from the target document. If the source
attribute is #CONTENT
, then the target attribute's value does not come from any source attribute, but from
the character content of the element; likewise, if the target attribute is #CONTENT
, then the source attribute is removed and its value is used as character content
of the target element. Here's an example of all three special cases. The source
element
<url purpose="linkage" label="Examplotron"
html="a purpose #NONE label #CONTENT #CONTENT href">
http://examplotron.org
</url>
is transformed by dropping the
purpose
attribute, putting the character content
http://examplotron.org
into the
href
attribute, and putting the value of the
label
attribute into the character content of the target element (an
a
element), thus producing the same result (modulo whitespace) as the transformation
of the
link
element did.
As a further extension to attribute mapping, if a source/target attribute name pair
is followed by the token #MAPTOKEN
, it is then followed by a source token and a target token. The source attribute
value is then divided into tokens by whitespace, and if the source token appears in
it, it is replaced by the target token. There may be any number of such triples of
#MAPTOKEN
, source token, target token following a source/target attribute pair.
Note
This mechanism is usable but crude, and should eventually be replaced by something
less hacky. In AEs the source/target attribute pairs and mapping-token triples are
in a separate attribute from the renaming attribute.