1. Introduction
The traditional approach to transforming XML documents is a three-step pipeline: validate, transform, validate. (Sometimes, of course, one or both of the validation steps is omitted.) Architectural forms, a feature first of the SGML-based hypermedia standard HyTime and then of SGML itself, made use of a combination of enhancements to DTDs and annotations in source documents to allow a two-step pipeline for certain simple transformations. In this pipeline, a valid SGML document could be automatically transformed using a specialized SGML parser, called an architectural engine (AE), into another SGML document valid against a more general DTD known as the meta-DTD. This permitted document creators to conform to a general document architecture without having to constrain their own documents to every detail of a specific schema.
However, DTDs have not seen wide uptake in the XML world, and the few XML architectural engines that have been built have conformed more to the letter than to the spirit of architectural forms. The emphasis has been on the creation of comprehensive and complex schemas which attempt simultaneously to serve local needs and the needs of interchange. Such schemas are usually arrived at by difficult, lengthy, and highly political negotiations between interested parties, with victory often going to the participants with the greatest weight of Sitzfleisch rather than the best ideas.
This paper describes an attempt to return to those thrilling days of yesteryear by providing a modern equivalent of SGML architectural engines. In principle any grammar-based schema language such as XML Schema or RELAX NG would be suitable for the methods outlined here. However, the software development (still very much a work in progress as of this writing) is using the much simpler Examplotron schema language. Examplotron is not well-known or much used in the XML environment, but I believe it to be extremely suitable to the stripped-down MicroXML environment in which I am now primarily interested. Since most people don't know Examplotron, I have written the paper to be accessible to anyone who can read simple DTD declarations.
In this paper, I will speak of the source document, which is the input to a schema-based transformation engine (TE), and of the target document, which is a TE's output. Additional inputs are the source schema and the target schema. In this paper the schemas are expressed as DTD fragments, but in actual use they
will be Examplotron 0.8 schemas. In addition, we may supply the TE with the transformation name of the particular transformation to be performed on the source, possibly one of many
such transformations. For clarity's sake, I will speak as if the various transformations
are made one by one, but except for attribute defaulting they are all made simultaneously.
For example, if all elements named foo
are to be renamed bar
, and all elements named bar
are to be renamed baz
, that does not mean that both foo
and bar
elements wind up being named baz
.
2. Element Renaming and the Renaming Attribute
The first and simplest kind of transformation to be performed is element renaming. A TE does this by looking at each element of the source document for an attribute whose name is the same as the transformation name supplied to the TE. This attribute is called the renaming attribute.
For example, suppose we have the following source document:
<limerick> <title>Relativity</title> <a>There was a young lady named Bright</a> <a>Who could travel much faster than light.</a> <b>She set out one day</b> <b>In a relative way</b> <a>And returned the previous night.</a> </limerick>If we wish to transform it from its limerick-specific schema to a more general stanza schema, we might add a renaming attribute named
stanza
to every element, like this:
<limerick stanza="stanza"> <title stanza="title">Relativity</title> <a stanza="line">There was a young lady named Bright</a> <a stanza="line">Who could travel much faster than light.</a> <b stanza="line">She set out one day</b> <b stanza="line">In a relative way</b> <a stanza="line">And returned the previous night.</a> </limerick>Running a TE on the above document, specifying
stanza
as the transformation name, would produce the following target document:
<stanza> <title>Relativity</title> <line>There was a young lady named Bright</line> <line>Who could travel much faster than light.</line> <line>She set out one day</line> <line>In a relative way</line> <line>And returned the previous night.</line> </stanza>Note that all occurrences of the renaming attribute have been removed from the target document.
What happens if an element doesn't have a renaming attribute? The answer is that
the element is dropped in its entirety. For example, suppose we did not have a stanza
attribute on the source document's title
element. In that case, the target document would contain only a stanza
element with five line
child elements.
If you don't provide a TE with a transformation name, there is no renaming attribute, and rather than dropping all the elements, none of them are renamed. However, the target document may still differ from the source document in other ways.
Note
The concept of renaming attributes comes from AEs; however, AEs do not require the name of the renaming attribute to be the same as the transformation name, and have different and more flexible rules about processing elements without renaming attributes.
3. Attribute Defaulting
This business of adding renaming attributes directly to the source document is irritating, and may be impossible if we aren't able to change the source document. Instead, we can take advantage of attribute defaulting by specifying a source schema. Consider the following DTD fragment:
<!ATTLIST limerick stanza "stanza"> <!ATTLIST title stanza "title"> <!ATTLIST a stanza "line"> <!ATTLIST b stanza "line">This says that in the
limerick
element, if no stanza
attribute is supplied, its value is assumed to be stanza
. Likewise, for the title
element, the default value of the stanza
attribute is title
, and for the a
and b
elements, it is line
. Now we no longer have to alter our original limerick document when we want to transform
it. If we specify the transformation name as stanza
, we will get the same target document that we saw in the previous section.
What is more, we can provide more than one renaming attribute in the same source schema. Suppose we add the following declarations to the above source schema:
<!ATTLIST limerick estrofa "estrofa"> <!ATTLIST title estrofa "título" <!ATTLIST a estrofa "línea"> <!ATTLIST b estrofa "línea">If we specify the transformation name as
estrofa
rather than stanza
, we will generate a target document whose element names are in Spanish rather than
English. However, the TE cannot automatically remove the defaulted stanza
attribute when doing an estrofa
transformation, nor vice versa, because it does not know which attributes might be
used as renaming attributes in a different transformation run. In order to suppress
them, we must provide the TE with a list of renaming attributes that are not being used for the current transformation, so that they can be suppressed from the
target document. In the rest of this paper we will assume that this list has been
provided.
Attribute defaulting is not restricted to renaming attributes. If any attribute is given a default value by the source schema but does not appear in the source document, it will be created, and by default will appear in the target document. Attribute defaulting is done in advance of all other transformations; a default attribute may have its name or value changed by a later transformation.
Note
Attribute defaulting is inherent to DTD processing. The version of Examplotron used by TEs, Examplotron 0.8, allows the specification of default values for attributes, and in fact for elements too.
4. Element Reordering
So far, we haven't had to deal with child elements appearing in a different order in the source and target documents. However, this can often happen when the source document is data-oriented rather than content-oriented. In order to know how to reorder child elements, we must provide the TE with a target schema. Here's a simple target schema specifying a document containing people's names:
<!ELEMENT people (person*)> <!ELEMENT person (last, first)>In this schema, we see that a
people
element contains zero or more person
elements and nothing else, and that each person
element contains last
and first
elements in that order.
Now here's a source document:
<people> <person> <first>John</first> <last>Cowan</last> </person> <person> <first>Dorian</first> <last>Cowan</last> </person> </people>Suppose we pass this source document and the target schema to a TE without specifying a transformation name. In that case, there is no renaming attribute, and so no element renaming is done. However, since the order of child elements for the
person
element in the source document is not valid according to the target schema, they
will be reordered so as to be valid in the target document, producing this:
<people> <person> <last>Cowan</last> <first>John</first> </person> <person> <last>Cowan</last> <first>Dorian</first> </person> </people>
Note
AEs do not perform element reordering.
5. Occurrences
Both source and target schemas can specify how many occurrences a child element can have within its parent element. In DTDs, we can repeat the element name to specify a fixed number of occurrences, as in this source schema for our limerick document:
<!ELEMENT limerick (title, a, a, b, b, a)> <!ATTLIST limerick index "poem"> <!ATTLIST a index "firstline">
Now suppose we run a TE, passing it the transformation name index
, our original limerick document, the above source schema, and the following target
schema:
<!ELEMENT poem (firstline)>The renaming attribute
index
will rename the limerick
element to poem
and the three a
elements to firstline
, dropping the title
and b
elements altogether. But since the target schema permits only a single firstline
element in each poem
element, the second and third firstline
elements will also be dropped, producing the following target document:
<poem> <firstline>There was a young lady named Bright</firstline> </poem>This is suitable for inclusion in an index of first lines.
On the other hand, if the target schema requires more occurrences of an element than the source schema provides, sufficient elements are created following the mapped elements. For an example of that process, consider this source document with explicit renaming attributes:
<couplet limerick="limerick"> <line limerick="a">Go and tell the Spartans, passerby,</line> <line limerick="b">That here, obedient to their laws, we lie.</line> </couplet>What happens if we transform this into a limerick using the limerick schema as the target schema? (There is nothing inherent in a schema that says whether it is a source or a target, only in how it is provided to a TE.) Limericks have to have a title and five lines, but we have only two lines here, one mapped (for some unknown reason) to an
a
element and one to a b
element. Consequently, we get this target document:
<limerick> <title/> <a>Go and tell the Spartans, passerby,</a> <a/> <b>That here, obedient to their laws, we lie.</b> <b/> <a/> </limerick>Not very useful or pretty, perhaps, but certainly valid.
In this paper, newly created elements are shown as empty. However, if the Examplotron schema provides a default value for them, it will be used.
When specifying the content model of an element in a source or target schema, we can
follow the name of a child element with *
to mean "zero or more occurrences", as shown in the declaration of the people
element. In the same way, ?
means "zero or one occurrences" and +
means "one or more occurrences". All these indicators are respected by a TE. So
if two foo
child elements appear in the source document, but the target schema specifies foo?
, then the second one will be dropped. A TE cannot construct transformations based
on more complex content models like ((a,b)+)
, in which the occurrence indicator follows a sequence of child element names, except
as noted under the discussion of mixed content.
However, technically ambiguous content models like (line, line?, line?)
, meaning from one to three line
elements, which are illegal in DTDs, are supported in Examplotron schemas as well
as by a TE.
Note
AEs neither drop unwanted elements nor create new ones, but report validation errors instead.
6. Character Content
So far, the source and target schemas we have seen have been incomplete, because not
all the elements used in the documents have been mentioned in the schemas. In particular,
declarations for the elements whose only permitted content is characters, such as
a
, firstline
, and title
have been left out. Here's a complete version of the limerick source schema with
all three renaming attributes provided:
<!ELEMENT limerick (title, a, a, b, b, a)> <!ATTLIST limerick stanza "stanza"> <!ATTLIST limerick estrofa "estrofa"> <!ATTLIST limerick index "poem"> <!ELEMENT title #PCDATA> <!ATTLIST title stanza "title"> <!ATTLIST title estrofa "título" <!ELEMENT a #PCDATA> <!ATTLIST a stanza "line"> <!ATTLIST a estrofa "línea"> <!ATTLIST a index "firstline"> <!ELEMENT b #PCDATA> <!ATTLIST b stanza "line"> <!ATTLIST b estrofa "línea">And here is an erroneous target schema for stanza documents:
<!ELEMENT stanza (title, line*)> <!ELEMENT title #PCDATA> <!ELEMENT line EMPTY>
Let's see what happens if we do a stanza
transformation using that target schema. We get this target document:
<stanza> <title>Relativity</title> <line/> <line/> <line/> <line/> <line/> </stanza>Because the target schema specified the
line
element as empty (no child elements or character content), the TE threw away the
character content. Again, probably not very useful, but again certainly valid.
Reordering and occurrence control are really two aspects of the same thing, and they can both happen to the same children of an element at the same time. Here is a not-very-realistic example. Given the source document
<root> <a id="a1"/> <b id="b1"/> <a id="a2"/> <b id="b2"/> <a id="a3"/> </root>and the target schema
<!ELEMENT root (a, a, b, b, b>)>the target document will be
<root> <a id="a1"/> <a id="a2"/> <b id="b1"/> <b id="b2"/> <b/> </root>That is, the
a
elements have been reordered before the b
elements, the third a
element has been dropped as unwanted, and a third b
element has been created.
Note
AEs allow greater control of what happens to character content when an element containing it is dropped from the target document: it may be discarded or included as part of the parent element. TEs always discard it unless the parent element's content model is specified as mixed content.
7. Mixed Content
An element has mixed content when its content includes both child elements and characters. Consider this limerick:
<limerick> <title>Memory</title> <a>There was an old man of Khartoum</a> <a>Who kept two black sheep in his room.</a> <b><quote>"They remind me,"</quote> he said,</b> <b><quote>"Of two friends who are dead,</quote></b> <a><quote>But I <em>cannot</em> remember of whom."</quote></a> </limerick>Because of the
quote
and em
elements, this document isn't valid against our latest limerick schema. Let's add
the following declarations to our limerick schema, replacing the existing declarations
for the a
and b
elements:
<!ELEMENT emphasis (#PCDATA|quote|em)*> <!ELEMENT quote (#PCDATA|quote|em)*> <!ELEMENT a (#PCDATA|quote|em)*> <!ELEMENT b (#PCDATA|quote|em)*>The meaning of these element declarations is that the specified child elements (
quote
and em
in this case) may appear in any order, any number of times, interleaved with the
character content if any. This is the only kind of mixed content that DTDs support.
Examplotron permits more restrictive sorts of mixed content, but a TE cannot handle
them. If we do a stanza
transformation, then because the a
and b
elements are declared to have mixed content, instead of simply dropping the quote
and em
elements along with their content as you might expect, their content is preserved.
The result, then, is the same as if no quotation or emphasis markup had appeared in
the source document.
What would happen if the target schema for stanzas allowed em
elements but not quote
elements? Then the final line's content would become:
<line>But I <em>cannot</em> remember of whom.</line>
By definition, reordering is never done on mixed content. It is the presence of mixed content in the source schema, not in the target schema, that triggers this style of processing, although you usually want to specify mixed content in both schemas.
In summary, the content models that a TE supports are mixed content, character-only content, empty content, and element content consisting of a simple sequence of child element names, possibly decorated with occurrence indicators. All other content models are unsupported for transformation, though they are permitted for validation.
8. Attribute Mapping
So far, the value of a renaming attribute has been a single token, an element name.
But if the renaming attribute contains multiple tokens separated by whitespace, the
first token is the element name for element mapping, and the rest of the tokens are
pairs of equivalent source and target attribute names. For example, here's a link
element that contains a renaming attribute to map it to an HTML a
element:
<link target="http://examplotron.com" html="a target href"> Examplotron </link>Running a TE on this source document and providing
html
as the transformation name produces this target document:
<a href="http://examplotron.com"> Examplotron </a>
TEs support three special cases of attribute mapping. If the target attribute name
is replaced by #NONE
, then the source attribute will be omitted from the target document. If the source
attribute is #CONTENT
, then the target attribute's value does not come from any source attribute, but from
the character content of the element; likewise, if the target attribute is #CONTENT
, then the source attribute is removed and its value is used as character content
of the target element. Here's an example of all three special cases. The source
element
<url purpose="linkage" label="Examplotron" html="a purpose #NONE label #CONTENT #CONTENT href"> http://examplotron.org </url>is transformed by dropping the
purpose
attribute, putting the character content http://examplotron.org
into the href
attribute, and putting the value of the label
attribute into the character content of the target element (an a
element), thus producing the same result (modulo whitespace) as the transformation
of the link
element did.
As a further extension to attribute mapping, if a source/target attribute name pair
is followed by the token #MAPTOKEN
, it is then followed by a source token and a target token. The source attribute
value is then divided into tokens by whitespace, and if the source token appears in
it, it is replaced by the target token. There may be any number of such triples of
#MAPTOKEN
, source token, target token following a source/target attribute pair.
Note
This mechanism is usable but crude, and should eventually be replaced by something less hacky. In AEs the source/target attribute pairs and mapping-token triples are in a separate attribute from the renaming attribute.
References
International Organization for Standards. SGML Extended Facilities, normative annex A to ISO/IEC 10744. "A.3 Architectural Form Definition Requirements (AFDR)." [online]. © 1992, 1997 [cited 12 July 2013]. http://www.pms.ifi.lmu.de/mitarbeiter/ohlbach/multimedia/HYTIME/ISO/clause-A.3.html.
van der Vlist, Eric. "Examplotron" [online]. © 2003 [cited 12 July 2013]. http://www.examplotron.org.