How to cite this paper

Quin, Liam. “DTD (document type definition) declarations exposed in XSLT: Parsing DTD files in XSLT to expose the definitions they contain.” Presented at Balisage: The Markup Conference 2024, Washington, DC, July 29 - August 2, 2024. In Proceedings of Balisage: The Markup Conference 2024. Balisage Series on Markup Technologies, vol. 29 (2024). https://doi.org/10.4242/BalisageVol29.Quin01.

Balisage: The Markup Conference 2024
July 29 - August 2, 2024

Balisage Paper: DTD (document type definition) declarations exposed in XSLT

Parsing DTD files in XSLT to expose the definitions they contain

Liam Quin

Liam Quin was stolen as a child by fae folk and raised in fairyland, where it developed into a monster that would eventually become capable of reading XML and writing XSLT.

It was part of the W3C group that created XML, and later worked at W3C where it was in charge of complaining about XML, as well as influencing XQuery, XPath, XSLT, and other magical creations, using its fairyland powers to try to defuse arguments.

It now runs delightfulcomputing.com, writes and maintains XSLT stylesheets and XQuery applications for people, edits specifications and proposals for the wise and adventurous, and gives training courses for would-be explorers.

Abstract

The XML specification defines a syntax for, and semantics for, a document type definition (DTD). An XML document can optionally reference a DTD, which it does by means of a document type declaration. The DTD may contain constraints on the elements that can be found in the XML document, and what attributes they may contain. It may also define entities, which associate names with strings or with external resources such as XML fragments or images. The constraints can be checked using a process termed validation.

The XML Data Model (XDM) does not expose declarations found in a document’s document type definition. As a result, XSLT stylesheets, XPath expressions, XQuery expressions, and anything else using the XDM, cannot reference them.

Sometimes one might want (or need) to access declarations from a DTD. The author in the past has written programs in procedural languages such as C or Perl in order to access such declarations.

This paper describes an XSLT stylesheet that parses DTD syntax and constructs a simple XML representation that can then be processed using XSLT or other XML-aware languages directly.

Some use cases and examples are given. The work is publicly available on gitlab.

Introduction

Use Cases

Comparing DTDs
Diagramming DTDs
Exploring

How to Cast the Magical Spell?

Matching Constructs

Handling a single construct (you may touch the exhibits)

External Entities

Results

Further work

An API

Conclusions

Introduction

The author developed a tool for working with XSLT transformations that map documents from one DTD to another, where the two DTDs are broadly similar. Examples might include two versions of DocBook, or from a generic JATS version to a customized one. This tool needed a list of all the elements defined in each DTD, along with their content models (the elements they can contain) and the attributes they might have.

The tool was written in Perl, because the Python module most commonly used for processing XML did not give the necessary access and the Perl one did. It could also have been written in Java or C, but no matter: it was not written in XSLT. Like fairy gold that crumbles into dust in the night, Perl has become unfashionable. Since the tool processed XML, people wanted an XSLT or XQuery version.

So a spell was cast (longum et longum servi laborem) and reading DTD files in XSLT became a thing.

This work enabled an XSLT coverage checker, a Near and Far style diagram generator. Future work may also include document validation and possibly a small bush whose fruit hatch into sparkly pink unicorns.

This paper describes the ugly and evil methods used to parse DTD syntax, some challenges faced and how they were overcome in fierce magical battles, and how it might have been done better by a wiser fairy.

Use Cases

Some anticipated ways the XSLT modle could be used:

Comparing DTDs

A developer or analyst may have two large document type definitions, perhaps each using dozens of files, and perhaps using nested parameter entity definitions for element names and content models in different ways, and be faced with the problem of determining whether the DTDs describe the same set of documents or, if not, enumerating the differences.

The Eddie 2 tool already mentioned is one example of software to do this.

Diagramming DTDs

Some ways to draw diagrams representing DTDs were introduced in Quin 2015; to generate such diagrams using XSLT (or XQuery) requires access to the declarations in the DTD. A rather large overview of the history of information graphics can be found in Rendgen 2021.

Exploring

Having a DTD represented simply in XML, without the complexities of W3C XML Schema, enables one to ask questions such as, Which elements contain a vodka element but do not also contain lemonade; which elements have an iso-date attribute but do not allow text content; which attributes are declared as enumerations that include the value joyfulness? It is not difficult to formulate XPath or XQuery expressions to find answers to such questions, given a suitable data model.

How to Cast the Magical Spell?

The main considerations for what parsing technology to use generally include:

Availability of tools;
Knowledge of the would-be spell-caster;
Limitations of available tools;
Time and system resources (material components).

In this case, although there are parsing tools available, the DTD syntax has a hidden complexity: parameter entities.

A parameter entity is a named string that can be interpolated anywhere inside a DTD. Here is an example:

<!ENTITY % model "(antennae, head, body, wings, feet)">

<!ELEMENT butterfly %model;>

This example produces the same effect as:

<!ELEMENT butterfly (antennae, head, body, wings, feet)>

So far so good. Here is a harder example:

<!--* the following appears to be legal, if obscure: *-->
<!ENTITY % er 'der"'>
<!ENTITY % type 'CDATA #FIXED "slen%er;'>
<!ATTLIST boy
    ankles %type;
>

Here, the closing double quote from the entity type is supplied from the parameter entity. This example shows that you cannot simply apply the grammar from the XML specification without expanding parameter entities first. However, since parameter entity definitions can appear in included files, and the inclusion mechanism uses parameter entity references to include the file, parameter entity definitions and references must be processed as they are encountered.

The interaction between parameter entities and syntax means that one cannot simply expand all entities before parsing. A parameter entity might contain a definition of another parameter entity that in turn overrides a later one (the earliest definition wins):

<!ENTITY % A "<!ENTITY % B 'haha!'>">
%A;
<!ENTITY % B "this one is checked for syntax but never used">

So we can take one of three technical approaches. The quickest is not supporting all the complicated uses of parameter entities and only allowing them in content models. We can then add them to the grammar for content models and move on with life.

Unfortunately, actual DTDs do use parameter entities to define sets of attributes, and they do override parameter entities. It would still be possible to cast the spell of parsing by adding parameter entities to the grammar wherever we want to allow them, but this approach would preclude full XML validation in the future, or at least would make it difficult.

Perhaps a tree-hanging grammar could be used, which would cope with parameter entity references by treating them as “errors” and handling them specially, but that would mean writing a tree-hanging parser in XSLT, a much longer and more involved magical spell than the fairy could justify attempting to cast.

XSLT does, however, have a built-in mechanism for parsing fragments of text. Eating of this food will doom us to live forever in fairyland, but we have already tasted it: the impenetrable stew of regular expressions. It is well known that attempting to parse XML documents using pure regular expressions will summon demons, but we are not doing that. If we come to doing validation, the XSLT processor will read the document itself for us, and we will validate based on the resulting XDM instance. And DTD syntax is easily amenable to regular expressions. What makes this possible is the rule that each top-level construct must begin and end in the same entity, so that this is illegal:

<!ENTITY % A '<!ELEMENT' >

<!ENTITY % B '>' >

%A;cuddles (pillow|arms|blanket)* %B;

We should note that there are also restrictions in open angle brackets in entity values and string literals: they must be escaped. They are shown expanded in this paper for readability.

So a DTD must be parsed from start to end, one construct at a time, because earlier constructs can modify the meaning of later ones. When parameter entity references are encountered they must be expanded before proceeding. The rule about constructs does mean, though, that the DTD can be parsed one top-level construct at a time.

The author decided to start out by trying a regular-expression-based approach that matches each construct and processes it. In retrospect, it would almost be sufficient to separate constructs before parsing them by scanning ahead for the closing greater-than sign, except that attribute defaults and entity replacements can contain unescaped greater-than signs (but not unescaped less-than signs). Since a string could be started or ended by a parameter entity reference, it is not possible to use this approach. The hybrid approach actually adopted does not handle all cases of parameter entity expansion, but could be extended to do so, as described below.

Matching Constructs

The constructs in a DTD are syntactically independent (disregarding parameter entity interactions for a moment). So it makes sense in XSLT to write a template or function to handle a single entire construct.

The approach taken, then, was to have a top-level parse-string function that is given the text of the DTD. This identifies one top-level construct, calls a function to handle it, and then recurses on the rest of the input.

This approach actually worked with some test DTDs, but when presented with the JATS DTD, the Java Virtual Machine fell onto its back and wiggled its little Corgi legs in the air, wailing in stack-overflown despair. Since no fairy wants to upset Corgis, and since this meant the parser was useless in practice, it had to be rewritten.

The problem turned out to be the text of the rest of the input passed to the next iteration. Alas, taking the computer for walkies and giving it some water didn’t help.

The solution was to keep the input text unchanged, and to pass an integer, a cursor, pointing to the current parse location. Now the stylesheet worked on the JATS DTD.

So now, the parsing function recognizes the start of a construct (after expanding a parameter entity reference if that should happen to be the next input token) and hands off to a function to handle that construct. The handler function returns an array of two items: the new declaration, if any, and the integer value of the cursor after reading the entire construct:

<xsl:when test="matches($input-no-leading-ws, '^<!ELEMENT\s+')">
  <xsl:variable name="decl" as="array(*)"
    select="dc:handle-element-decl(
            $input,
            $results-so-far,
            $input-base-uri,
            $skipped)" />

  <!--* Now recurse for the rest of the input: *-->
  <xsl:call-template name="dc:parse-string-tmpl">
    <xsl:with-param name="input" select="$input" />
    <xsl:with-param name="results-so-far"
        select="($results-so-far,  $decl(2))" />
    <xsl:with-param name="input-base-uri"
            select="$input-base-uri" />
    <xsl:with-param name="cursor" select="$cursor + xs:integer($decl(1))" />
  </xsl:call-template>
</xsl:when>

Here, dc:handle-element-decl() will process one element declaration and, assuming it finds one, return $results-so-far (a sequence of such decl elements) with the new decl element appended. It also returns the number of characters it processed, so that we know where to resume parsing.

The astute, handsome, and knowledgeable reader will note that we could just use starts-with($input-no-leading-ws, '<!ELEMENT') rather than a regular expression match. An even more limber sage might observe that ELEMENT could be followed by a parameter entity reference instead of whitespace. And that is true. An earlier version of the parser followed the XML grammar more closely. A regular expression is used to match S from the XML grammar, and so that all of the xsl:when clauses use the same format, for readability. However, the expression as shown here is not correct.

What might really be wanted here would be to match <!ELEMENT when followed by spaces or by a %-sign. XPath regular expressions do not have a “when followed by” operator, more conventionally called a positive lookahead assertion. We can match <!ELEMENT[\s%]+ in this case, however. Really we just want to catch <!ELEMENTS as an error, and although the handle-element-decl function will do that, if it knows the first token is correct it can be simpler to write. Perl regular expressions have \b to match a word boundary, and vi (and grep) have \< and \> to match the start and end boundaries of a word respectively, but we lack those, too. So, for this version, there must be a space in the input after the keyword.

To return to the more general topic, the approach taken is to parse a single construct as determined by matching the first token of the remaining input, and then to parse the remaining constructs recursively. Whitespace and comments at the start are removed at each iteration to reduce the number of recursions; comments could easily be kept, though, given a use for them.

The next section (just through that arch in the hedge) describes what the individual handler functions look like.

Handling a single construct (you may touch the exhibits)

A handler function must do several things:

It must consume at least one character of input, or raise an error. In this way infinite recursion is avoided.
It must consume a single construct.
It must return the results so far merged with the new construct, along with the number of (Unicode) characters that were read.
It should detect errors and report them.

Handler functions are given the following as input parameters:

the entire input;
the results so far as element(decl)*;
the current input base URI, for error reporting;
the current input position (the cursor) as xs:integer, at the start of the construct.

Most of the handler functions define a variable called regex to hold a regular expression that will match the construct; if the pattern does not match, an error is raised. If it does match, the regular expression is used to fish the construct out of the input, and that construct is then processed according to the grammar in the XML specification.

Conditional sections were more complex. In an ignored conditional section, <![ must be matched up with ]]>, which can’t easily be done with a standard regular expression. Perl expressions can do it with procedural depth counting, and Perl 6 has a grammar facility to do it, but here in XSLT it’s done with a recursive function.

External Entities

What’s that crashing through the forest? An elephant? No my fairy friends, an entity!

External parameter entities in an XML DTD can be referenced, in which case the reference is to be replaced by the content of the referenced resource. The rule that all constructs must begin and end in the same entity allows us to process a reference to an external resource (or file, if you prefer) as a top-level construct just like an element declaration.

Note that you could, if you wanted, put a commonly-used content model into a separate file and include it in the middle of an element declaration. The approach here is to expand all parameter entity references after picking out the overall top-level construct, so that case is handled. A declaration cannot start in one file and end in another, so we can always see the end of the declaration.

There are two ways to reference external resources: public identifiers and system identifiers. However, if a public identifier is given, a system identifier must also be given. So there is always a system identifier (which is a URI) and sometimes a public identifier (which can be pretty much any string). If there is a public identifier, some parsers may choose to use it rather than the system identifier, or might be configured to prefer one or the other.

Currently this parser only handles system identifiers, and does not support XML Catalogue files that are often used to map a public identifier into a URI, or to re-map a system identifier URU into a different URI. Since XSLT and XPath do not give us access to the XML catalogue file being used by the XSLT engine, we cannot ask XSLT to resolve identifiers for us. An XSLT implementation of XML catalogue files is therefore planned, but today, only the system identifier is used and it must resolve either as a relative URI (for example a file alongside the current input file) or as an absolute one potentially to be fetched over the network.

Results

The current status is that the parser returns a decl element for each input construct. Attribute lists are checked but not yet represented as structures of sub-elements, which is planned and partially completed at the time of writing. Element content models are parsed (simple bottom-up followed by recursive descent) and represented as XML elements.

The parser is slower than one might like. It takes Saxon 9 (EE version) 22 seconds to parse the JATS DTD for example (on a 2013 computer with a spinning disk) and to produce 4,336 decl elements. However, even a rain dance done slowly, barefoot on the soft green moss of a clearing in the magic forest, brings gentle rain and cool refreshing yoghurt.

The parser is not ready for use in Eddie 2 to replace Perl for testing XSLT for DTD coverage (does every element in the DTD have a template to match it, and if not, which do not), but it is close. Since it does not yet parse attribute lists completely (but might by the time you read this paper), it also cannot generate complete Near and Far diagrams. The author’s need for these went away, but a Web-based DTD explorer seems to have potential and could follow, possibly with Saxon-JS to run the parser in the browser.

The code will be available from gitlab.com/barefootliam/dtdeum in the near future.

Further work

In addition to the use cases described in the previous section, an XML instance validator, a DTD to XML Schema converter, and other tools, may be created.

Currently, comments are discarded; some comments provide information about elements or attributes, and could usefully be preserved, and that would be a simple addition.

The tool as written could be improved in both readability and performance by refactoring. For example, a structure such as the following would simplify the code and make it easier to modify:

<xsl:variable name="parse-info" as="array(*)" select="
  [
     map {
       'regex' : '<!ELEMENT',
       'handler' : dc:handle-element-del#4
     },
     . . .
  ]
" />

An array of maps allows the maintainer control over the order in which the patterns are tried (important for entity declarations) and each map contains a regular expression and a function to handle matching constructs. A sequence of two items could also be used, but this method allows extra items to be added to the maps, if needed, such as a flags argument to pass to matches(). Then the xsl:choose becomes a single XPath expression:

array:filter($parse-info, function($item) as xs:boolean {
    matches($input, $item?regex, 'sx')
  })[1]?handler($input)

In practice not all visitors to the magical forest will find this part of the magical spell to be readable, but with some comments, and with expanding it to handle errors and to add the cursor-handling, it will become as clear as the water in the Magic Pond, which to those who know how to reach it is a refreshing ointment for the soul.

An API

In order to make DTDeum useful for third parties, there is a simple function-based API in development. The API shields the developer using DTDeum from the internal representations, so that the library implementation could change (for example, moving from decl elements to arrays or maps) without the calling XPath or XSLT needing to change.

The API is not described in detail in this paper; it contains functions like dtd:get-element-declaration($dtd, name) and dtd:get-elements-with-attribute($dtd, name) and so on.

Conclusions

It is possible to parse XML DTDs using XSLT. It may be possible to do so elegantly and efficiently, but the work being described in this paper does not do so. None the less, document type definitions are a part of XML, and the ability to process XML documents should therefore include the ability to process a DTD.

More work on external entity resolution is needed; however, the XSLT described in this paper can already parse the JATS and DocBook DTDs successfully.

The author hopes to make available versions of its diagramming (DTD Village) and DTD-to-DTD conversion tool (Eddie 2) built on the DTDeum library.

References

[Aho and Ullman 1977] Alfred Aho and Jeffrey D. Ullman, Principles of Compiler Design (the Green Dragon Book), Addison-Wesley, 1977. There is also the Dragon Book, Compilers: Principles, Techniques, and Tools, by the same authors with others; there are many newer references on parsing, but these ones have nice covers.

[Quin 2015] Quin, Liam, Diagramming XML: Exploring Concepts, Constraints and Affordances, presented at Balisage: The Markup Conference 2015, Washington, DC, August 11 - 14, 2015. In Proceedings of Balisage: The Markup Conference 2015. Balisage Series on Markup Technologies, vol. 15 (2015). doi:https://doi.org/10.4242/BalisageVol15.Quin01. This work was one of the motivations for the work in the current paper.

[Quin 2020] Quin, Liam, Analytical XSLT, presented at XML Prague 2020, February 13 - 15, 2020. In XML Prague 2020 Conference Proceedings, pp. 219-230. A paper on a tool to analyze two very similar DTDs and write XSLT to transform documents from one to the other; it was the other motivation for this work, besides diagramming.

[Rendgen 2021] Rendgen, Sandra, History of Information Graphics, Taschen, 2021 (XXL Edition). This is very large and heavy, but has a wider scope than books by Manual Lima and others referenced in the Diagramming XML paper.

Alfred Aho and Jeffrey D. Ullman, Principles of Compiler Design (the Green Dragon Book), Addison-Wesley, 1977. There is also the Dragon Book, Compilers: Principles, Techniques, and Tools, by the same authors with others; there are many newer references on parsing, but these ones have nice covers.

Quin, Liam, Diagramming XML: Exploring Concepts, Constraints and Affordances, presented at Balisage: The Markup Conference 2015, Washington, DC, August 11 - 14, 2015. In Proceedings of Balisage: The Markup Conference 2015. Balisage Series on Markup Technologies, vol. 15 (2015). doi:https://doi.org/10.4242/BalisageVol15.Quin01. This work was one of the motivations for the work in the current paper.

Quin, Liam, Analytical XSLT, presented at XML Prague 2020, February 13 - 15, 2020. In XML Prague 2020 Conference Proceedings, pp. 219-230. A paper on a tool to analyze two very similar DTDs and write XSLT to transform documents from one to the other; it was the other motivation for this work, besides diagramming.

Rendgen, Sandra, History of Information Graphics, Taschen, 2021 (XXL Edition). This is very large and heavy, but has a wider scope than books by Manual Lima and others referenced in the Diagramming XML paper.

BalisageThe Markup Conference2024