The author developed a tool for working with XSLT transformations that map documents from one DTD to another, where the two DTDs are broadly similar. Examples might include two versions of DocBook, or from a generic JATS version to a customized one. This tool needed a list of all the elements defined in each DTD, along with their content models (the elements they can contain) and the attributes they might have.
The tool was written in Perl, because the Python module most commonly used for processing XML did not give the necessary access and the Perl one did. It could also have been written in Java or C, but no matter: it was not written in XSLT. Like fairy gold that crumbles into dust in the night, Perl has become unfashionable. Since the tool processed XML, people wanted an XSLT or XQuery version.
So a spell was cast (longum et longum servi laborem) and reading DTD files in XSLT became a thing.
This work enabled an XSLT coverage checker, a Near and Far style diagram generator. Future work may also include document validation and possibly a small bush whose fruit hatch into sparkly pink unicorns.
This paper describes the ugly and evil methods used to parse DTD syntax, some challenges faced and how they were overcome in fierce magical battles, and how it might have been done better by a wiser fairy.
Use Cases
Some anticipated ways the XSLT modle could be used:
Comparing DTDs
A developer or analyst may have two large document type definitions, perhaps each using dozens of files, and perhaps using nested parameter entity definitions for element names and content models in different ways, and be faced with the problem of determining whether the DTDs describe the same set of documents or, if not, enumerating the differences.
The Eddie 2 tool already mentioned is one example of software to do this.
Diagramming DTDs
Some ways to draw diagrams representing DTDs were introduced in Quin 2015; to generate such diagrams using XSLT (or XQuery) requires access to the declarations in the DTD. A rather large overview of the history of information graphics can be found in Rendgen 2021.
Having a DTD represented simply in XML, without the complexities of
W3C XML Schema, enables one to ask questions such as, Which elements
contain a vodka
element but do not also contain
; which elements have an iso-date
attribute but do not allow text content; which attributes are declared
as enumerations that include the value joyfulness
? It is
not difficult to formulate XPath or XQuery expressions to find answers
to such questions, given a suitable data model.
How to Cast the Magical Spell?
The main considerations for what parsing technology to use generally include:
Availability of tools;
Knowledge of the would-be spell-caster;
Limitations of available tools;
Time and system resources (material components).
In this case, although there are parsing tools available, the DTD syntax has a hidden complexity: parameter entities.
A parameter entity is a named string that can be interpolated anywhere inside a DTD. Here is an example:
<!ENTITY % model "(antennae, head, body, wings, feet)"> <!ELEMENT butterfly %model;>
This example produces the same effect as:
<!ELEMENT butterfly (antennae, head, body, wings, feet)>
So far so good. Here is a harder example:
<!--* the following appears to be legal, if obscure: *--> <!ENTITY % er 'der"'> <!ENTITY % type 'CDATA #FIXED "slen%er;'> <!ATTLIST boy ankles %type; >
Here, the closing double quote from the entity type
supplied from the parameter entity. This example shows that you cannot
simply apply the grammar from the XML specification without expanding
parameter entities first. However, since parameter entity definitions can
appear in included files, and the inclusion mechanism
uses parameter entity references to include the file, parameter entity
definitions and references must be processed as they are
The interaction between parameter entities and syntax means that one cannot simply expand all entities before parsing. A parameter entity might contain a definition of another parameter entity that in turn overrides a later one (the earliest definition wins):
<!ENTITY % A "<!ENTITY % B 'haha!'>"> %A; <!ENTITY % B "this one is checked for syntax but never used">
So we can take one of three technical approaches. The quickest is not supporting all the complicated uses of parameter entities and only allowing them in content models. We can then add them to the grammar for content models and move on with life.
Unfortunately, actual DTDs do use parameter entities to define sets of attributes, and they do override parameter entities. It would still be possible to cast the spell of parsing by adding parameter entities to the grammar wherever we want to allow them, but this approach would preclude full XML validation in the future, or at least would make it difficult.
Perhaps a tree-hanging grammar could be used, which would cope with parameter entity references by treating them as “errors” and handling them specially, but that would mean writing a tree-hanging parser in XSLT, a much longer and more involved magical spell than the fairy could justify attempting to cast.
XSLT does, however, have a built-in mechanism for parsing fragments of text. Eating of this food will doom us to live forever in fairyland, but we have already tasted it: the impenetrable stew of regular expressions. It is well known that attempting to parse XML documents using pure regular expressions will summon demons, but we are not doing that. If we come to doing validation, the XSLT processor will read the document itself for us, and we will validate based on the resulting XDM instance. And DTD syntax is easily amenable to regular expressions. What makes this possible is the rule that each top-level construct must begin and end in the same entity, so that this is illegal:
<!ENTITY % A '<!ELEMENT' > <!ENTITY % B '>' > %A;cuddles (pillow|arms|blanket)* %B;
We should note that there are also restrictions in open angle brackets in entity values and string literals: they must be escaped. They are shown expanded in this paper for readability.
So a DTD must be parsed from start to end, one construct at a time, because earlier constructs can modify the meaning of later ones. When parameter entity references are encountered they must be expanded before proceeding. The rule about constructs does mean, though, that the DTD can be parsed one top-level construct at a time.
The author decided to start out by trying a regular-expression-based approach that matches each construct and processes it. In retrospect, it would almost be sufficient to separate constructs before parsing them by scanning ahead for the closing greater-than sign, except that attribute defaults and entity replacements can contain unescaped greater-than signs (but not unescaped less-than signs). Since a string could be started or ended by a parameter entity reference, it is not possible to use this approach. The hybrid approach actually adopted does not handle all cases of parameter entity expansion, but could be extended to do so, as described below.
Matching Constructs
The constructs in a DTD are syntactically independent (disregarding parameter entity interactions for a moment). So it makes sense in XSLT to write a template or function to handle a single entire construct.
The approach taken, then, was to have a top-level
function that is given the text of the DTD.
This identifies one top-level construct, calls a function to handle it,
and then recurses on the rest of the input.
This approach actually worked with some test DTDs, but when presented with the JATS DTD, the Java Virtual Machine fell onto its back and wiggled its little Corgi legs in the air, wailing in stack-overflown despair. Since no fairy wants to upset Corgis, and since this meant the parser was useless in practice, it had to be rewritten.
The problem turned out to be the text of the rest of the input passed to the next iteration. Alas, taking the computer for walkies and giving it some water didn’t help.
The solution was to keep the input text unchanged, and to pass an integer, a cursor, pointing to the current parse location. Now the stylesheet worked on the JATS DTD.
So now, the parsing function recognizes the start of a construct (after expanding a parameter entity reference if that should happen to be the next input token) and hands off to a function to handle that construct. The handler function returns an array of two items: the new declaration, if any, and the integer value of the cursor after reading the entire construct:
<xsl:when test="matches($input-no-leading-ws, '^<!ELEMENT\s+')"> <xsl:variable name="decl" as="array(*)" select="dc:handle-element-decl( $input, $results-so-far, $input-base-uri, $skipped)" /> <!--* Now recurse for the rest of the input: *--> <xsl:call-template name="dc:parse-string-tmpl"> <xsl:with-param name="input" select="$input" /> <xsl:with-param name="results-so-far" select="($results-so-far, $decl(2))" /> <xsl:with-param name="input-base-uri" select="$input-base-uri" /> <xsl:with-param name="cursor" select="$cursor + xs:integer($decl(1))" /> </xsl:call-template> </xsl:when>
Here, dc:handle-element-decl() will process one element declaration and, assuming it finds one, return $results-so-far (a sequence of such decl elements) with the new decl element appended. It also returns the number of characters it processed, so that we know where to resume parsing.
The astute, handsome, and knowledgeable reader will note that we could
just use starts-with($input-no-leading-ws, '<!ELEMENT')
rather than a regular expression match. An even more limber sage might
observe that ELEMENT could be followed by a parameter entity reference
instead of whitespace. And that is true. An earlier version of the parser
followed the XML grammar more closely. A regular expression is used to
match S from the XML grammar, and so that all of the xsl:when clauses use
the same format, for readability. However, the expression as shown here is
not correct.
What might really be wanted here would be to match <!ELEMENT when followed by spaces or by a %-sign. XPath regular expressions do not have a “when followed by” operator, more conventionally called a positive lookahead assertion. We can match <!ELEMENT[\s%]+ in this case, however. Really we just want to catch <!ELEMENTS as an error, and although the handle-element-decl function will do that, if it knows the first token is correct it can be simpler to write. Perl regular expressions have \b to match a word boundary, and vi (and grep) have \< and \> to match the start and end boundaries of a word respectively, but we lack those, too. So, for this version, there must be a space in the input after the keyword.
To return to the more general topic, the approach taken is to parse a single construct as determined by matching the first token of the remaining input, and then to parse the remaining constructs recursively. Whitespace and comments at the start are removed at each iteration to reduce the number of recursions; comments could easily be kept, though, given a use for them.
The next section (just through that arch in the hedge) describes what the individual handler functions look like.
Handling a single construct (you may touch the exhibits)
A handler function must do several things:
It must consume at least one character of input, or raise an error. In this way infinite recursion is avoided.
It must consume a single construct.
It must return the results so far merged with the new construct, along with the number of (Unicode) characters that were read.
It should detect errors and report them.
Handler functions are given the following as input parameters:
the entire input;
the results so far as element(decl)*;
the current input base URI, for error reporting;
the current input position (the cursor) as xs:integer, at the start of the construct.
Most of the handler functions define a variable called regex to hold a regular expression that will match the construct; if the pattern does not match, an error is raised. If it does match, the regular expression is used to fish the construct out of the input, and that construct is then processed according to the grammar in the XML specification.
Conditional sections were more complex. In an ignored conditional section, <![ must be matched up with ]]>, which can’t easily be done with a standard regular expression. Perl expressions can do it with procedural depth counting, and Perl 6 has a grammar facility to do it, but here in XSLT it’s done with a recursive function.
External Entities
What’s that crashing through the forest? An elephant? No my fairy friends, an entity!
External parameter entities in an XML DTD can be referenced, in which case the reference is to be replaced by the content of the referenced resource. The rule that all constructs must begin and end in the same entity allows us to process a reference to an external resource (or file, if you prefer) as a top-level construct just like an element declaration.
Note that you could, if you wanted, put a commonly-used content model into a separate file and include it in the middle of an element declaration. The approach here is to expand all parameter entity references after picking out the overall top-level construct, so that case is handled. A declaration cannot start in one file and end in another, so we can always see the end of the declaration.
There are two ways to reference external resources: public identifiers and system identifiers. However, if a public identifier is given, a system identifier must also be given. So there is always a system identifier (which is a URI) and sometimes a public identifier (which can be pretty much any string). If there is a public identifier, some parsers may choose to use it rather than the system identifier, or might be configured to prefer one or the other.
Currently this parser only handles system identifiers, and does not support XML Catalogue files that are often used to map a public identifier into a URI, or to re-map a system identifier URU into a different URI. Since XSLT and XPath do not give us access to the XML catalogue file being used by the XSLT engine, we cannot ask XSLT to resolve identifiers for us. An XSLT implementation of XML catalogue files is therefore planned, but today, only the system identifier is used and it must resolve either as a relative URI (for example a file alongside the current input file) or as an absolute one potentially to be fetched over the network.
The current status is that the parser returns a decl
element for each input construct. Attribute lists are checked but not yet
represented as structures of sub-elements, which is planned and partially
completed at the time of writing. Element content models are parsed
(simple bottom-up followed by recursive descent) and represented as XML
The parser is slower than one might like. It takes Saxon 9 (EE version) 22 seconds to parse the JATS DTD for example (on a 2013 computer with a spinning disk) and to produce 4,336 decl elements. However, even a rain dance done slowly, barefoot on the soft green moss of a clearing in the magic forest, brings gentle rain and cool refreshing yoghurt.
The parser is not ready for use in Eddie 2 to replace Perl for testing XSLT for DTD coverage (does every element in the DTD have a template to match it, and if not, which do not), but it is close. Since it does not yet parse attribute lists completely (but might by the time you read this paper), it also cannot generate complete Near and Far diagrams. The author’s need for these went away, but a Web-based DTD explorer seems to have potential and could follow, possibly with Saxon-JS to run the parser in the browser.
The code will be available from in the near future.
Further work
In addition to the use cases described in the previous section, an XML instance validator, a DTD to XML Schema converter, and other tools, may be created.
Currently, comments are discarded; some comments provide information about elements or attributes, and could usefully be preserved, and that would be a simple addition.
The tool as written could be improved in both readability and performance by refactoring. For example, a structure such as the following would simplify the code and make it easier to modify:
<xsl:variable name="parse-info" as="array(*)" select=" [ map { 'regex' : '<!ELEMENT', 'handler' : dc:handle-element-del#4 }, . . . ] " />
An array of maps allows the maintainer control over the order in which the patterns are tried (important for entity declarations) and each map contains a regular expression and a function to handle matching constructs. A sequence of two items could also be used, but this method allows extra items to be added to the maps, if needed, such as a flags argument to pass to matches(). Then the xsl:choose becomes a single XPath expression:
array:filter($parse-info, function($item) as xs:boolean { matches($input, $item?regex, 'sx') })[1]?handler($input)
In practice not all visitors to the magical forest will find this part of the magical spell to be readable, but with some comments, and with expanding it to handle errors and to add the cursor-handling, it will become as clear as the water in the Magic Pond, which to those who know how to reach it is a refreshing ointment for the soul.
In order to make DTDeum
useful for third parties,
there is a simple function-based API in development. The API shields the
developer using DTDeum from the internal representations, so that the
library implementation could change (for example, moving from
elements to arrays or maps) without the calling
XPath or XSLT needing to change.
The API is not described in detail in this paper; it contains
functions like dtd:get-element-declaration($dtd, name)
dtd:get-elements-with-attribute($dtd, name)
and so
It is possible to parse XML DTDs using XSLT. It may be possible to do so elegantly and efficiently, but the work being described in this paper does not do so. None the less, document type definitions are a part of XML, and the ability to process XML documents should therefore include the ability to process a DTD.
More work on external entity resolution is needed; however, the XSLT described in this paper can already parse the JATS and DocBook DTDs successfully.
The author hopes to make available versions of its diagramming (DTD
Village) and DTD-to-DTD conversion tool (Eddie 2) built on the
