Transforming schemas

John Cowan

Abstract

The traditional approach to transforming XML documents is a three-step pipeline: validate, transform, validate. The SGML feature called architectural forms combined enhancements to DTDs with annotations in source documents to allow a valid SGML document to be automatically transformed into another SGML document valid against a different DTD. This permitted document creators to conform to a general document architecture without having to constrain their own documents to every detail of a specific schema. In the XML world, however, the emphasis has been on the creation of comprehensive schemas rather than easy transformation, and the ideas behind architectural forms have mostly been lost. This paper attempts to explain how to restore those ideas to XML practice.

1. Introduction

The traditional approach to transforming XML documents is a three-step pipeline: validate, transform, validate. (Sometimes, of course, one or both of the validation steps is omitted.) Architectural forms, a feature first of the SGML-based hypermedia standard HyTime and then of SGML itself, made use of a combination of enhancements to DTDs and annotations in source documents to allow a two-step pipeline for certain simple transformations. In this pipeline, a valid SGML document could be automatically transformed using a specialized SGML parser, called an architectural engine (AE), into another SGML document valid against a more general DTD known as the meta-DTD. This permitted document creators to conform to a general document architecture without having to constrain their own documents to every detail of a specific schema.

However, DTDs have not seen wide uptake in the XML world, and the few XML architectural engines that have been built have conformed more to the letter than to the spirit of architectural forms. The emphasis has been on the creation of comprehensive and complex schemas which attempt simultaneously to serve local needs and the needs of interchange. Such schemas are usually arrived at by difficult, lengthy, and highly political negotiations between interested parties, with victory often going to the participants with the greatest weight of Sitzfleisch rather than the best ideas.

This paper describes an attempt to return to those thrilling days of yesteryear by providing a modern equivalent of SGML architectural engines. In principle any grammar-based schema language such as XML Schema or RELAX NG would be suitable for the methods outlined here. However, the software development (still very much a work in progress as of this writing) is using the much simpler Examplotron schema language. Examplotron is not well-known or much used in the XML environment, but I believe it to be extremely suitable to the stripped-down MicroXML environment in which I am now primarily interested. Since most people don't know Examplotron, I have written the paper to be accessible to anyone who can read simple DTD declarations.

In this paper, I will speak of the source document, which is the input to a schema-based transformation engine (TE), and of the target document, which is a TE's output. Additional inputs are the source schema and the target schema. In this paper the schemas are expressed as DTD fragments, but in actual use they will be Examplotron 0.8 schemas. In addition, we may supply the TE with the transformation name of the particular transformation to be performed on the source, possibly one of many such transformations. For clarity's sake, I will speak as if the various transformations are made one by one, but except for attribute defaulting they are all made simultaneously. For example, if all elements named foo are to be renamed bar, and all elements named bar are to be renamed baz, that does not mean that both foo and bar elements wind up being named baz.

2. Element Renaming and the Renaming Attribute

The first and simplest kind of transformation to be performed is element renaming. A TE does this by looking at each element of the source document for an attribute whose name is the same as the transformation name supplied to the TE. This attribute is called the renaming attribute.

For example, suppose we have the following source document:

      <limerick>
        <title>Relativity</title>
        <a>There was a young lady named Bright</a>
        <a>Who could travel much faster than light.</a>
        <b>She set out one day</b>
        <b>In a relative way</b>
        <a>And returned the previous night.</a>
      </limerick>

If we wish to transform it from its limerick-specific schema to a more general stanza schema, we might add a renaming attribute named stanza to every element, like this:

      <limerick stanza="stanza">
        <title stanza="title">Relativity</title>
        <a stanza="line">There was a young lady named Bright</a>
        <a stanza="line">Who could travel much faster than light.</a>
        <b stanza="line">She set out one day</b>
        <b stanza="line">In a relative way</b>
        <a stanza="line">And returned the previous night.</a>
      </limerick>

Running a TE on the above document, specifying stanza as the transformation name, would produce the following target document:

      <stanza>
        <title>Relativity</title>
        <line>There was a young lady named Bright</line>
        <line>Who could travel much faster than light.</line>
        <line>She set out one day</line>
        <line>In a relative way</line>
        <line>And returned the previous night.</line>
      </stanza>

Note that all occurrences of the renaming attribute have been removed from the target document.

What happens if an element doesn't have a renaming attribute? The answer is that the element is dropped in its entirety. For example, suppose we did not have a stanza attribute on the source document's title element. In that case, the target document would contain only a stanza element with five line child elements.

If you don't provide a TE with a transformation name, there is no renaming attribute, and rather than dropping all the elements, none of them are renamed. However, the target document may still differ from the source document in other ways.

Note

The concept of renaming attributes comes from AEs; however, AEs do not require the name of the renaming attribute to be the same as the transformation name, and have different and more flexible rules about processing elements without renaming attributes.

3. Attribute Defaulting

This business of adding renaming attributes directly to the source document is irritating, and may be impossible if we aren't able to change the source document. Instead, we can take advantage of attribute defaulting by specifying a source schema. Consider the following DTD fragment:

      <!ATTLIST limerick stanza "stanza">
      <!ATTLIST title stanza "title">
      <!ATTLIST a stanza "line">
      <!ATTLIST b stanza "line">

This says that in the limerick element, if no stanza attribute is supplied, its value is assumed to be stanza. Likewise, for the title element, the default value of the stanza attribute is title, and for the a and b elements, it is line. Now we no longer have to alter our original limerick document when we want to transform it. If we specify the transformation name as stanza, we will get the same target document that we saw in the previous section.

What is more, we can provide more than one renaming attribute in the same source schema. Suppose we add the following declarations to the above source schema:

      <!ATTLIST limerick estrofa "estrofa">
      <!ATTLIST title estrofa "título"
      <!ATTLIST a estrofa "línea">
      <!ATTLIST b estrofa "línea">

If we specify the transformation name as estrofa rather than stanza, we will generate a target document whose element names are in Spanish rather than English. However, the TE cannot automatically remove the defaulted stanza attribute when doing an estrofa transformation, nor vice versa, because it does not know which attributes might be used as renaming attributes in a different transformation run. In order to suppress them, we must provide the TE with a list of renaming attributes that are not being used for the current transformation, so that they can be suppressed from the target document. In the rest of this paper we will assume that this list has been provided.

Attribute defaulting is not restricted to renaming attributes. If any attribute is given a default value by the source schema but does not appear in the source document, it will be created, and by default will appear in the target document. Attribute defaulting is done in advance of all other transformations; a default attribute may have its name or value changed by a later transformation.

Note

Attribute defaulting is inherent to DTD processing. The version of Examplotron used by TEs, Examplotron 0.8, allows the specification of default values for attributes, and in fact for elements too.

4. Element Reordering

So far, we haven't had to deal with child elements appearing in a different order in the source and target documents. However, this can often happen when the source document is data-oriented rather than content-oriented. In order to know how to reorder child elements, we must provide the TE with a target schema. Here's a simple target schema specifying a document containing people's names:

      <!ELEMENT people (person*)>
      <!ELEMENT person (last, first)>

In this schema, we see that a people element contains zero or more person elements and nothing else, and that each person element contains last and first elements in that order.

Now here's a source document:

      <people>
        <person>
          <first>John</first>
          <last>Cowan</last>
        </person>
       <person>
          <first>Dorian</first>
          <last>Cowan</last>
       </person>
      </people>

Suppose we pass this source document and the target schema to a TE without specifying a transformation name. In that case, there is no renaming attribute, and so no element renaming is done. However, since the order of child elements for the person element in the source document is not valid according to the target schema, they will be reordered so as to be valid in the target document, producing this:

      <people>
        <person>
          <last>Cowan</last>
          <first>John</first>
        </person>
       <person>
          <last>Cowan</last>
          <first>Dorian</first>
       </person>
      </people>

Note

AEs do not perform element reordering.

5. Occurrences

Both source and target schemas can specify how many occurrences a child element can have within its parent element. In DTDs, we can repeat the element name to specify a fixed number of occurrences, as in this source schema for our limerick document:

    <!ELEMENT limerick (title, a, a, b, b, a)>
    <!ATTLIST limerick index "poem">
    <!ATTLIST a index "firstline">

Now suppose we run a TE, passing it the transformation name index, our original limerick document, the above source schema, and the following target schema:

    <!ELEMENT poem (firstline)>

The renaming attribute index will rename the limerick element to poem and the three a elements to firstline, dropping the title and b elements altogether. But since the target schema permits only a single firstline element in each poem element, the second and third firstline elements will also be dropped, producing the following target document:

    <poem>
      <firstline>There was a young lady named Bright</firstline>
    </poem>

This is suitable for inclusion in an index of first lines.

On the other hand, if the target schema requires more occurrences of an element than the source schema provides, sufficient elements are created following the mapped elements. For an example of that process, consider this source document with explicit renaming attributes:

    <couplet limerick="limerick">
      <line limerick="a">Go and tell the Spartans, passerby,</line>
      <line limerick="b">That here, obedient to their laws, we lie.</line>
    </couplet>

What happens if we transform this into a limerick using the limerick schema as the target schema? (There is nothing inherent in a schema that says whether it is a source or a target, only in how it is provided to a TE.) Limericks have to have a title and five lines, but we have only two lines here, one mapped (for some unknown reason) to an a element and one to a b element. Consequently, we get this target document:

      <limerick>
        <title/>
        <a>Go and tell the Spartans, passerby,</a>
        <a/>
        <b>That here, obedient to their laws, we lie.</b>
        <b/>
        <a/>
      </limerick>

Not very useful or pretty, perhaps, but certainly valid.

In this paper, newly created elements are shown as empty. However, if the Examplotron schema provides a default value for them, it will be used.

When specifying the content model of an element in a source or target schema, we can follow the name of a child element with * to mean "zero or more occurrences", as shown in the declaration of the people element. In the same way, ? means "zero or one occurrences" and + means "one or more occurrences". All these indicators are respected by a TE. So if two foo child elements appear in the source document, but the target schema specifies foo?, then the second one will be dropped. A TE cannot construct transformations based on more complex content models like ((a,b)+), in which the occurrence indicator follows a sequence of child element names, except as noted under the discussion of mixed content.

However, technically ambiguous content models like (line, line?, line?), meaning from one to three line elements, which are illegal in DTDs, are supported in Examplotron schemas as well as by a TE.

Note

AEs neither drop unwanted elements nor create new ones, but report validation errors instead.

6. Character Content

So far, the source and target schemas we have seen have been incomplete, because not all the elements used in the documents have been mentioned in the schemas. In particular, declarations for the elements whose only permitted content is characters, such as a, firstline, and title have been left out. Here's a complete version of the limerick source schema with all three renaming attributes provided:

      <!ELEMENT limerick (title, a, a, b, b, a)>
      <!ATTLIST limerick stanza "stanza">
      <!ATTLIST limerick estrofa "estrofa">
      <!ATTLIST limerick index "poem">
      <!ELEMENT title #PCDATA>
      <!ATTLIST title stanza "title">
      <!ATTLIST title estrofa "título"
      <!ELEMENT a #PCDATA>
      <!ATTLIST a stanza "line">
      <!ATTLIST a estrofa "línea">
      <!ATTLIST a index "firstline">
      <!ELEMENT b #PCDATA>
      <!ATTLIST b stanza "line">
      <!ATTLIST b estrofa "línea">

And here is an erroneous target schema for stanza documents:

      <!ELEMENT stanza (title, line*)>
      <!ELEMENT title #PCDATA>
      <!ELEMENT line EMPTY>

Let's see what happens if we do a stanza transformation using that target schema. We get this target document:

      <stanza>
        <title>Relativity</title>
        <line/>
        <line/>
        <line/>
        <line/>
        <line/>
      </stanza>

Because the target schema specified the line element as empty (no child elements or character content), the TE threw away the character content. Again, probably not very useful, but again certainly valid.

Reordering and occurrence control are really two aspects of the same thing, and they can both happen to the same children of an element at the same time. Here is a not-very-realistic example. Given the source document

    <root>
      <a id="a1"/>
      <b id="b1"/>
      <a id="a2"/>
      <b id="b2"/>
      <a id="a3"/>
    </root>

and the target schema

    <!ELEMENT root (a, a, b, b, b>)>

the target document will be

    <root>
      <a id="a1"/>
      <a id="a2"/>
      <b id="b1"/>
      <b id="b2"/>
      <b/>
    </root>

That is, the a elements have been reordered before the b elements, the third a element has been dropped as unwanted, and a third b element has been created.

Note

AEs allow greater control of what happens to character content when an element containing it is dropped from the target document: it may be discarded or included as part of the parent element. TEs always discard it unless the parent element's content model is specified as mixed content.

7. Mixed Content

An element has mixed content when its content includes both child elements and characters. Consider this limerick:

      <limerick>
        <title>Memory</title>
        <a>There was an old man of Khartoum</a>
        <a>Who kept two black sheep in his room.</a>
        <b><quote>"They remind me,"</quote> he said,</b>
        <b><quote>"Of two friends who are dead,</quote></b>
        <a><quote>But I <em>cannot</em> remember of whom."</quote></a>
      </limerick>

Because of the quote and em elements, this document isn't valid against our latest limerick schema. Let's add the following declarations to our limerick schema, replacing the existing declarations for the a and b elements:

      <!ELEMENT emphasis (#PCDATA|quote|em)*>
      <!ELEMENT quote (#PCDATA|quote|em)*>
      <!ELEMENT a (#PCDATA|quote|em)*>
      <!ELEMENT b (#PCDATA|quote|em)*>

The meaning of these element declarations is that the specified child elements (quote and em in this case) may appear in any order, any number of times, interleaved with the character content if any. This is the only kind of mixed content that DTDs support. Examplotron permits more restrictive sorts of mixed content, but a TE cannot handle them. If we do a stanza transformation, then because the a and b elements are declared to have mixed content, instead of simply dropping the quote and em elements along with their content as you might expect, their content is preserved. The result, then, is the same as if no quotation or emphasis markup had appeared in the source document.

What would happen if the target schema for stanzas allowed em elements but not quote elements? Then the final line's content would become:

      <line>But I <em>cannot</em> remember of whom.</line>

By definition, reordering is never done on mixed content. It is the presence of mixed content in the source schema, not in the target schema, that triggers this style of processing, although you usually want to specify mixed content in both schemas.

In summary, the content models that a TE supports are mixed content, character-only content, empty content, and element content consisting of a simple sequence of child element names, possibly decorated with occurrence indicators. All other content models are unsupported for transformation, though they are permitted for validation.

8. Attribute Mapping

So far, the value of a renaming attribute has been a single token, an element name. But if the renaming attribute contains multiple tokens separated by whitespace, the first token is the element name for element mapping, and the rest of the tokens are pairs of equivalent source and target attribute names. For example, here's a link element that contains a renaming attribute to map it to an HTML a element:

    <link target="http://examplotron.com"
     html="a target href">
      Examplotron
    </link>

Running a TE on this source document and providing html as the transformation name produces this target document:

    <a href="http://examplotron.com">
      Examplotron
    </a>

TEs support three special cases of attribute mapping. If the target attribute name is replaced by #NONE, then the source attribute will be omitted from the target document. If the source attribute is #CONTENT, then the target attribute's value does not come from any source attribute, but from the character content of the element; likewise, if the target attribute is #CONTENT, then the source attribute is removed and its value is used as character content of the target element. Here's an example of all three special cases. The source element

    <url purpose="linkage" label="Examplotron"
     html="a purpose #NONE label #CONTENT #CONTENT href">
       http://examplotron.org
    </url>

is transformed by dropping the purpose attribute, putting the character content http://examplotron.org into the href attribute, and putting the value of the label attribute into the character content of the target element (an a element), thus producing the same result (modulo whitespace) as the transformation of the link element did.

As a further extension to attribute mapping, if a source/target attribute name pair is followed by the token #MAPTOKEN, it is then followed by a source token and a target token. The source attribute value is then divided into tokens by whitespace, and if the source token appears in it, it is replaced by the target token. There may be any number of such triples of #MAPTOKEN, source token, target token following a source/target attribute pair.

Note

This mechanism is usable but crude, and should eventually be replaced by something less hacky. In AEs the source/target attribute pairs and mapping-token triples are in a separate attribute from the renaming attribute.

References

International Organization for Standards. SGML Extended Facilities, normative annex A to ISO/IEC 10744. "A.3 Architectural Form Definition Requirements (AFDR)." [online]. © 1992, 1997 [cited 12 July 2013]. http://www.pms.ifi.lmu.de/mitarbeiter/ohlbach/multimedia/HYTIME/ISO/clause-A.3.html.

Author's keywords for this paper:

Architectural forms; Examplotron; Schema-driven transformation

John Cowan

Senior Content Architect

LexisNexis

`<cowan@ccil.org>`

John Cowan works for LexisNexis, which he likes to call "$EMPLOYER". On his 2011 tax returns, he listed his occupation as "ontologist" . He pushed both XML 1.1 and XML 1.0 Fifth Edition through the W3C XML Core Working Group, of which he somehow remains a member. He also hangs out on numerous mailing lists and blogs, masquerading on the A forum as the expert on B and and on the B forum as the expert on A. His friends say that he knows at least something about almost everything; his enemies, that he knows far too much about far too much.

BalisageThe Markup Conference

Balisage Paper: Transforming schemas

Architectural Forms for the 21st Century

John Cowan

`<cowan@ccil.org>`

Table of Contents

1. Introduction

2. Element Renaming and the Renaming Attribute

Note

3. Attribute Defaulting

Note

4. Element Reordering

Note

5. Occurrences

Note

6. Character Content

Note

7. Mixed Content

8. Attribute Mapping

Note

References

Author's keywords for this paper:

`<cowan@ccil.org>`

Balisage Series on Markup Technologies