How to cite this paper
Quin, Liam. “DTD (document type definition) declarations exposed in XSLT: Parsing DTD files in
XSLT to expose the definitions they
contain.” Presented at Balisage: The Markup Conference 2024, Washington, DC, July 29 - August 2, 2024. In Proceedings of Balisage: The Markup Conference 2024. Balisage Series on Markup Technologies, vol. 29 (2024). https://doi.org/10.4242/BalisageVol29.Quin01.
Balisage: The Markup Conference 2024
July 29 - August 2, 2024
Balisage Paper: DTD (document type definition) declarations exposed in XSLT
Parsing DTD files in XSLT to expose the definitions they
contain
Liam Quin
Liam Quin was stolen as a child by fae folk and raised in
fairyland, where it developed into a monster that would eventually
become capable of reading XML and writing XSLT.
It was part of the W3C group that created XML, and later worked at
W3C where it was in charge of complaining about XML, as well as
influencing XQuery, XPath, XSLT, and other magical creations, using
its fairyland powers to try to defuse arguments.
It now runs delightfulcomputing.com, writes and maintains XSLT
stylesheets and XQuery applications for people, edits specifications
and proposals for the wise and adventurous, and gives training courses
for would-be explorers.
Copyright Liam Quin, 2024
Abstract
The XML specification defines a syntax for, and semantics for, a
document type definition (DTD). An XML document can optionally reference
a DTD, which it does by means of a document type declaration. The DTD
may contain constraints on the elements that can be found in the XML
document, and what attributes they may contain. It may also define
entities, which associate names with strings or with external resources
such as XML fragments or images. The constraints can be checked using a
process termed validation.
The XML Data Model (XDM) does not expose declarations found in a
document’s document type definition. As a result, XSLT stylesheets,
XPath expressions, XQuery expressions, and anything else using the XDM,
cannot reference them.
Sometimes one might want (or need) to access declarations from a
DTD. The author in the past has written programs in procedural languages
such as C or Perl in order to access such declarations.
This paper describes an XSLT stylesheet that parses DTD syntax and
constructs a simple XML representation that can then be processed using
XSLT or other XML-aware languages directly.
Some use cases and examples are given. The work is publicly
available on gitlab.
Table of Contents
- Introduction
- Use Cases
-
- Comparing DTDs
- Diagramming DTDs
- Exploring
- How to Cast the Magical Spell?
- Matching Constructs
- Handling a single construct (you may touch the exhibits)
- External Entities
- Results
- Further work
-
- An API
- Conclusions
Introduction
The author developed a tool for working with XSLT transformations that
map documents from one DTD to another, where the two DTDs are broadly
similar. Examples might include two versions of DocBook, or from a generic
JATS version to a customized one. This tool needed a list of all the
elements defined in each DTD, along with their content models (the
elements they can contain) and the attributes they might have.
The tool was written in Perl, because the Python module most commonly
used for processing XML did not give the necessary access and the Perl one
did. It could also have been written in Java or C, but no matter: it was
not written in XSLT. Like fairy gold that crumbles into dust in the night,
Perl has become unfashionable. Since the tool processed XML, people wanted
an XSLT or XQuery version.
So a spell was cast (longum et longum servi laborem)
and reading DTD files in XSLT became a thing.
This work enabled an XSLT coverage checker, a Near and Far style
diagram generator. Future work may also include document validation and
possibly a small bush whose fruit hatch into sparkly pink unicorns.
This paper describes the ugly and evil methods used to parse DTD
syntax, some challenges faced and how they were overcome in fierce magical
battles, and how it might have been done better by a wiser fairy.
Use Cases
Some anticipated ways the XSLT modle could be used:
Comparing DTDs
A developer or analyst may have two large document type definitions,
perhaps each using dozens of files, and perhaps using nested parameter
entity definitions for element names and content models in different
ways, and be faced with the problem of determining whether the DTDs
describe the same set of documents or, if not, enumerating the
differences.
The Eddie 2 tool already mentioned is one example of software to do
this.
Diagramming DTDs
Some ways to draw diagrams representing DTDs were introduced in
Quin 2015; to generate such diagrams using XSLT
(or XQuery) requires access to the declarations in the DTD. A rather
large overview of the history of information graphics can be found in
Rendgen 2021.
Exploring
Having a DTD represented simply in XML, without the complexities of
W3C XML Schema, enables one to ask questions such as, Which elements
contain a vodka
element but do not also contain
lemonade
; which elements have an iso-date
attribute but do not allow text content; which attributes are declared
as enumerations that include the value joyfulness
? It is
not difficult to formulate XPath or XQuery expressions to find answers
to such questions, given a suitable data model.
How to Cast the Magical Spell?
The main considerations for what parsing technology to use generally
include:
-
Availability of tools;
-
Knowledge of the would-be spell-caster;
-
Limitations of available tools;
-
Time and system resources (material components).
In this case, although there are parsing tools available, the DTD
syntax has a hidden complexity: parameter entities.
A parameter entity is a named string that can be interpolated anywhere
inside a DTD. Here is an example:
<!ENTITY % model "(antennae, head, body, wings, feet)">
<!ELEMENT butterfly %model;>
This example produces the same effect as:
<!ELEMENT butterfly (antennae, head, body, wings, feet)>
So far so good. Here is a harder example:
<!--* the following appears to be legal, if obscure: *-->
<!ENTITY % er 'der"'>
<!ENTITY % type 'CDATA #FIXED "slen%er;'>
<!ATTLIST boy
ankles %type;
>
Here, the closing double quote from the entity type
is
supplied from the parameter entity. This example shows that you cannot
simply apply the grammar from the XML specification without expanding
parameter entities first. However, since parameter entity definitions can
appear in included files, and the inclusion mechanism
uses parameter entity references to include the file, parameter entity
definitions and references must be processed as they are
encountered.
The interaction between parameter entities and syntax means that one
cannot simply expand all entities before parsing. A parameter entity might
contain a definition of another parameter entity that in turn overrides a
later one (the earliest definition wins):
<!ENTITY % A "<!ENTITY % B 'haha!'>">
%A;
<!ENTITY % B "this one is checked for syntax but never used">
So we can take one of three technical approaches. The quickest is not
supporting all the complicated uses of parameter entities and only
allowing them in content models. We can then add them to the grammar for
content models and move on with life.
Unfortunately, actual DTDs do use parameter entities to define sets of
attributes, and they do override parameter entities. It would still be
possible to cast the spell of parsing by adding parameter entities to the
grammar wherever we want to allow them, but this approach would preclude
full XML validation in the future, or at least would make it
difficult.
Perhaps a tree-hanging grammar could be used, which would cope with
parameter entity references by treating them as “errors” and handling them
specially, but that would mean writing a tree-hanging parser in XSLT, a
much longer and more involved magical spell than the fairy could justify
attempting to cast.
XSLT does, however, have a built-in mechanism for parsing fragments of
text. Eating of this food will doom us to live forever in fairyland, but
we have already tasted it: the impenetrable stew of regular expressions.
It is well known that attempting to parse XML documents using pure regular
expressions will summon demons, but we are not doing that. If we come to
doing validation, the XSLT processor will read the document itself for us,
and we will validate based on the resulting XDM instance. And DTD syntax
is easily amenable to regular expressions. What makes this possible is the
rule that each top-level construct must begin and end in the same entity,
so that this is illegal:
<!ENTITY % A '<!ELEMENT' >
<!ENTITY % B '>' >
%A;cuddles (pillow|arms|blanket)* %B;
We should note that there are also restrictions in open angle brackets
in entity values and string literals: they must be escaped. They are shown
expanded in this paper for readability.
So a DTD must be parsed from start to end, one construct at a time,
because earlier constructs can modify the meaning of later ones. When
parameter entity references are encountered they must be expanded before
proceeding. The rule about constructs does mean, though, that the DTD can
be parsed one top-level construct at a time.
The author decided to start out by trying a regular-expression-based
approach that matches each construct and processes it. In retrospect, it
would almost be sufficient to separate constructs before parsing them by
scanning ahead for the closing greater-than sign, except that attribute
defaults and entity replacements can contain unescaped greater-than signs
(but not unescaped less-than signs). Since a string could be started or
ended by a parameter entity reference, it is not possible to use this
approach. The hybrid approach actually adopted does not handle all cases
of parameter entity expansion, but could be extended to do so, as
described below.
Matching Constructs
The constructs in a DTD are syntactically independent (disregarding
parameter entity interactions for a moment). So it makes sense in XSLT to
write a template or function to handle a single entire construct.
The approach taken, then, was to have a top-level
parse-string
function that is given the text of the DTD.
This identifies one top-level construct, calls a function to handle it,
and then recurses on the rest of the input.
This approach actually worked with some test DTDs, but when presented
with the JATS DTD, the Java Virtual Machine fell onto its back and wiggled
its little Corgi legs in the air, wailing in stack-overflown despair.
Since no fairy wants to upset Corgis, and since this meant the parser was
useless in practice, it had to be rewritten.
The problem turned out to be the text of the rest of the input passed
to the next iteration. Alas, taking the computer for walkies and giving it
some water didn’t help.
The solution was to keep the input text unchanged, and to pass an
integer, a cursor, pointing to the current parse location. Now the
stylesheet worked on the JATS DTD.
So now, the parsing function recognizes the start of a construct
(after expanding a parameter entity reference if that should happen to be
the next input token) and hands off to a function to handle that
construct. The handler function returns an array of two items: the new
declaration, if any, and the integer value of the cursor after reading the
entire construct:
<xsl:when test="matches($input-no-leading-ws, '^<!ELEMENT\s+')">
<xsl:variable name="decl" as="array(*)"
select="dc:handle-element-decl(
$input,
$results-so-far,
$input-base-uri,
$skipped)" />
<!--* Now recurse for the rest of the input: *-->
<xsl:call-template name="dc:parse-string-tmpl">
<xsl:with-param name="input" select="$input" />
<xsl:with-param name="results-so-far"
select="($results-so-far, $decl(2))" />
<xsl:with-param name="input-base-uri"
select="$input-base-uri" />
<xsl:with-param name="cursor" select="$cursor + xs:integer($decl(1))" />
</xsl:call-template>
</xsl:when>
Here, dc:handle-element-decl() will process one element declaration
and, assuming it finds one, return $results-so-far (a sequence of such
decl elements) with the new decl element appended. It also returns the
number of characters it processed, so that we know where to resume
parsing.
The astute, handsome, and knowledgeable reader will note that we could
just use starts-with($input-no-leading-ws, '<!ELEMENT')
rather than a regular expression match. An even more limber sage might
observe that ELEMENT could be followed by a parameter entity reference
instead of whitespace. And that is true. An earlier version of the parser
followed the XML grammar more closely. A regular expression is used to
match S from the XML grammar, and so that all of the xsl:when clauses use
the same format, for readability. However, the expression as shown here is
not correct.
What might really be wanted here would be to match <!ELEMENT when
followed by spaces or by a %-sign. XPath regular expressions do not have a
“when followed by” operator, more conventionally called a positive
lookahead assertion. We can match <!ELEMENT[\s%]+ in this case,
however. Really we just want to catch <!ELEMENTS as an error, and
although the handle-element-decl function will do that, if it knows the
first token is correct it can be simpler to write. Perl regular
expressions have \b to match a word boundary, and vi (and grep) have \<
and \> to match the start and end boundaries of a word respectively, but
we lack those, too. So, for this version, there must be a space in the
input after the keyword.
To return to the more general topic, the approach taken is to parse a
single construct as determined by matching the first token of the
remaining input, and then to parse the remaining constructs recursively.
Whitespace and comments at the start are removed at each iteration to
reduce the number of recursions; comments could easily be kept, though,
given a use for them.
The next section (just through that arch in the hedge) describes what
the individual handler functions look like.
Handling a single construct (you may touch the exhibits)
A handler function must do several things:
-
It must consume at least one character of input, or raise an
error. In this way infinite recursion is avoided.
-
It must consume a single construct.
-
It must return the results so far merged with the new construct,
along with the number of (Unicode) characters that were read.
-
It should detect errors and
report them.
Handler functions are given the following as input parameters:
-
the entire input;
-
the results so far as element(decl)*;
-
the current input base URI, for error reporting;
-
the current input position (the cursor) as xs:integer, at the
start of the construct.
Most of the handler functions define a variable called regex to hold a
regular expression that will match the construct; if the pattern does not
match, an error is raised. If it does match, the regular expression is
used to fish the construct out of the input, and that construct is then
processed according to the grammar in the XML specification.
Conditional sections were more complex. In an ignored conditional
section, <![ must be matched up with ]]>, which can’t easily be done
with a standard regular expression. Perl expressions can do it with
procedural depth counting, and Perl 6 has a grammar facility to do it, but
here in XSLT it’s done with a recursive function.
External Entities
What’s that crashing through the forest? An elephant? No my fairy
friends, an entity!
External parameter entities in an XML DTD can be referenced, in which
case the reference is to be replaced by the content of the referenced
resource. The rule that all constructs must begin and end in the same
entity allows us to process a reference to an external resource (or file,
if you prefer) as a top-level construct just like an element
declaration.
Note that you could, if you wanted, put a commonly-used content model
into a separate file and include it in the middle of an element
declaration. The approach here is to expand all parameter entity
references after picking out the overall top-level construct, so that case
is handled. A declaration cannot start in one file and end in another, so
we can always see the end of the declaration.
There are two ways to reference external resources: public identifiers
and system identifiers. However, if a public identifier is given, a system
identifier must also be given. So there is always a system identifier
(which is a URI) and sometimes a public identifier (which can be pretty
much any string). If there is a public identifier, some parsers may choose
to use it rather than the system identifier, or might be configured to
prefer one or the other.
Currently this parser only handles system identifiers, and does not
support XML Catalogue files that are often used to map a public identifier
into a URI, or to re-map a system identifier URU into a different URI.
Since XSLT and XPath do not give us access to the XML catalogue file being
used by the XSLT engine, we cannot ask XSLT to resolve identifiers for us.
An XSLT implementation of XML catalogue files is therefore planned, but
today, only the system identifier is used and it must resolve either as a
relative URI (for example a file alongside the current input file) or as
an absolute one potentially to be fetched over the network.
Results
The current status is that the parser returns a decl
element for each input construct. Attribute lists are checked but not yet
represented as structures of sub-elements, which is planned and partially
completed at the time of writing. Element content models are parsed
(simple bottom-up followed by recursive descent) and represented as XML
elements.
The parser is slower than one might like. It takes Saxon 9 (EE
version) 22 seconds to parse the JATS DTD for example (on a 2013 computer
with a spinning disk) and to produce 4,336 decl elements. However, even a
rain dance done slowly, barefoot on the soft green moss of a clearing in
the magic forest, brings gentle rain and cool refreshing yoghurt.
The parser is not ready for use in Eddie 2 to replace Perl for testing
XSLT for DTD coverage (does every element in the DTD have a template to
match it, and if not, which do not), but it is close. Since it does not
yet parse attribute lists completely (but might by the time you read this
paper), it also cannot generate complete Near and Far diagrams. The
author’s need for these went away, but a Web-based DTD explorer seems to
have potential and could follow, possibly with Saxon-JS to run the parser
in the browser.
The code will be available from gitlab.com/barefootliam/dtdeum in the
near future.
Further work
In addition to the use cases described in the previous section, an XML
instance validator, a DTD to XML Schema converter, and other tools, may be
created.
Currently, comments are discarded; some comments provide information
about elements or attributes, and could usefully be preserved, and that
would be a simple addition.
The tool as written could be improved in both readability and
performance by refactoring. For example, a structure such as the following
would simplify the code and make it easier to modify:
<xsl:variable name="parse-info" as="array(*)" select="
[
map {
'regex' : '<!ELEMENT',
'handler' : dc:handle-element-del#4
},
. . .
]
" />
An array of maps allows the maintainer control over the order in which
the patterns are tried (important for entity declarations) and each map
contains a regular expression and a function to handle matching
constructs. A sequence of two items could also be used, but this method
allows extra items to be added to the maps, if needed, such as a flags
argument to pass to matches(). Then the xsl:choose becomes a single XPath
expression:
array:filter($parse-info, function($item) as xs:boolean {
matches($input, $item?regex, 'sx')
})[1]?handler($input)
In practice not all visitors to the magical forest will find this part
of the magical spell to be readable, but with some comments, and with
expanding it to handle errors and to add the cursor-handling, it will
become as clear as the water in the Magic Pond, which to those who know
how to reach it is a refreshing ointment for the soul.
An API
In order to make DTDeum
useful for third parties,
there is a simple function-based API in development. The API shields the
developer using DTDeum from the internal representations, so that the
library implementation could change (for example, moving from
decl
elements to arrays or maps) without the calling
XPath or XSLT needing to change.
The API is not described in detail in this paper; it contains
functions like dtd:get-element-declaration($dtd, name)
and
dtd:get-elements-with-attribute($dtd, name)
and so
on.
Conclusions
It is possible to parse XML DTDs using XSLT. It may be possible to do
so elegantly and efficiently, but the work being described in this paper
does not do so. None the less, document type definitions are a part of
XML, and the ability to process XML documents should therefore include the
ability to process a DTD.
More work on external entity resolution is needed; however, the XSLT
described in this paper can already parse the JATS and DocBook DTDs
successfully.
The author hopes to make available versions of its diagramming (DTD
Village) and DTD-to-DTD conversion tool (Eddie 2) built on the
DTDeum
library.
References
[Aho and Ullman 1977] Alfred Aho and Jeffrey D. Ullman,
Principles of Compiler Design (the Green Dragon Book),
Addison-Wesley, 1977. There is also the Dragon Book, Compilers:
Principles, Techniques, and Tools, by the same authors with
others; there are many newer references on parsing, but these ones have
nice covers.
[Quin 2015] Quin, Liam,
Diagramming XML: Exploring Concepts, Constraints and
Affordances,
presented at Balisage: The Markup Conference
2015, Washington, DC, August 11 - 14, 2015. In Proceedings of Balisage:
The Markup Conference 2015. Balisage Series on Markup Technologies, vol.
15 (2015). doi:https://doi.org/10.4242/BalisageVol15.Quin01. This work was
one of the motivations for the work in the current paper.
[Quin 2020] Quin, Liam, Analytical XSLT
, presented at XML
Prague 2020, February 13 - 15, 2020. In XML Prague 2020
Conference Proceedings, pp. 219-230. A paper on a tool to analyze two very similar DTDs and write
XSLT to transform documents from one to the other; it was the other
motivation for this work, besides diagramming.
[Rendgen 2021] Rendgen, Sandra,
History of Information Graphics, Taschen, 2021 (XXL
Edition). This is very large and heavy, but has a wider scope than books by
Manual Lima and others referenced in the Diagramming XML
paper.
×Alfred Aho and Jeffrey D. Ullman,
Principles of Compiler Design (the Green Dragon Book),
Addison-Wesley, 1977. There is also the Dragon Book, Compilers:
Principles, Techniques, and Tools, by the same authors with
others; there are many newer references on parsing, but these ones have
nice covers.
×Quin, Liam,
Diagramming XML: Exploring Concepts, Constraints and
Affordances,
presented at Balisage: The Markup Conference
2015, Washington, DC, August 11 - 14, 2015. In Proceedings of Balisage:
The Markup Conference 2015. Balisage Series on Markup Technologies, vol.
15 (2015). doi:https://doi.org/10.4242/BalisageVol15.Quin01. This work was
one of the motivations for the work in the current paper.
×Quin, Liam, Analytical XSLT
, presented at XML
Prague 2020, February 13 - 15, 2020. In XML Prague 2020
Conference Proceedings, pp. 219-230. A paper on a tool to analyze two very similar DTDs and write
XSLT to transform documents from one to the other; it was the other
motivation for this work, besides diagramming.
×Rendgen, Sandra,
History of Information Graphics, Taschen, 2021 (XXL
Edition). This is very large and heavy, but has a wider scope than books by
Manual Lima and others referenced in the Diagramming XML
paper.