The Desperate Perl Hacker featured often in the early days of
XML. Designing a markup format that could be processed easily by
ordinary programmers using their chosen languages was an explicit
goal of XML: 4. It
shall be easy to write programs which process XML documents.
This goal was achieved, at least for XML itself, if not all of the
subsequent specifications in the broader ecosystem, and as a consequence
there are no significant, mainstream languages which are incapable of processing
XML. There are probably none for which there aren't a choice of XML
parsers. Any language built on top of the Java VM includes such a choice.
Modern languages like Scala include features for the specific purpose of
writing domain specific language
parsers. These allow XML,
or subsets of XML, to be incorporated directly into the language itself.
It is straightforward to parse XML with more-or-less any programming language you care to use. The way, and the extent to which, XML coexists with those languages is largely a question of their design and the full range of language design is outside the scope of this paper.
Within the XML community, many XML languages have been designed specifically for the purpose of processing XML. These include all of the usual suspects: validation languages, transformation languages, query languages, etc. These are languages designed by XML users for XML users to process XML. These are the languages that are the focus of this paper.
We are concerned mostly with the syntax of these languages, not their semantics. Of course, syntax and semantics are not wholly separable. A language whose semantics are nothing more than the expression of a single boolean value needs at most two tokens and so can be vastly simpler syntactically than a language with Turing complete semantics. Nevertheless, we'll focus mostly on the syntax for syntaxes sake.
The first, perhaps most obvious, question to ask about the syntax of an XML language is: to what extent is it XML itself? A brief survey of XML languages reveals that there is considerable variety on this point.
On one end of the spectrum, RELAX NG Compact Syntax has nothing that resembles XML to the untrained eye. See Figure 1.
On the other end of the spectrum, XQueryX is nothing but XML. See Figure 2.
Other XML languages fit between those two ends. XSLT has a mostly XML syntax, see Figure 3.
While XQuery has a mostly non-XML syntax, see Figure 4.
Let's look a little more closely at the distinction between XQueryX and XSLT. On the one hand, XQueryX provides improved machine readability: there are no semantic elements not manifest in the XML. On the other hand, it gains this benefit by sacrificing human readability. These are two possible axes on which we can analyze a language syntax, we'll revisit them later.
In the meantime, distinguish a “practical” XML syntax as one that is concise enough for human comprehension (even if it relies on some non-XML syntax to aid readability).
How do XML languages stand up? See Table I.
Table I
Language | XML Syntax | Practical XML Syntax | Non-XML Syntax |
---|---|---|---|
Atom | ✓ | ✓ | |
DocBook, HTML, …[1] | ✓ | ✓ | |
MathML | ✓ | ✓ | |
RELAX NG | ✓ | ✓ | ✓ |
RDF | ✓ | ✓ | ✓ |
Schematron | ✓ | ✓ | |
SVG | ✓ | ✓ | |
XInclude | ✓ | ✓ | |
XLink | ✓ | ||
XML Schema | ✓ | ✓ | |
XPointer | ✓ | ||
XProc | ✓ | ✓ | |
XQuery | ✓ | ✓ | |
XSLT | ✓ | ✓ |
There may be room for debate about some cells in that table. Evan Lenz's work on carrot, for example, is moving in the direction of a more compact, non-XML syntax for XSLT. One could argue that TeX is a non-XML syntax for MathML. We might debate whether or not attribute-based languages like XLink are or are not XML. And, in addition, there may be other syntaxes for these languages of which the author is unaware. However, at a coarse level of granularity, what we can see is that there are languages all across the spectrum.
Syntactically: XML or not?
Seeing languages spread across a spectrum like this invites the question: why? What motivates a language designer to choose an XML syntax, or not? When both are provided, what motivates a user to choose an XML syntax, or not?
The case for XML syntaxes
Why choose XML?
-
“Eat your own dogfood”/”Fly your own airplanes.” One school of thought says that XML languages should be expressed in XML simply because they are XML languages. Some XML developers find XML to be a clear and precise format for the expression of ideas.
-
Extensibility. The XML syntax has natural extension points, attributes on start tags, for example, and namespaces. At a syntactic level, extending an XML language is an easily solved problem. Conversely, non-XML languages sometimes suffer from a dearth of extension points. Keeping a grammar for a complex language like XQuery free from ambiguity while simultaneously adding language features can be a real challenge.
Whether the accretion of language features through this form of ad-hoc extension, in either the XML or non-XML cases, produces a coherent and regular language over time, is a separate question.
-
Accessibility to XML tools. The fact that an XSLT stylesheet can be used to produce an XSLT stylesheet is not a feature that every XSLT user needs, but there are circumstances when it is a great boon.
-
Documentation. The ability to inline documentation in an XML language is considered a great benefit in some environments. Expressing XML documentation in a non-XML language can have a deleterious effect readability. Compare, for example, the non-XML representation of the
unitprice
pattern, Figure 5, with the equivalent XML representation, Figure 6. -
Syntactic conformance. Operating on XML with a language that has an XML syntax provides certain minimum assurances about the outputs. An XSLT stylesheet, which must itself be well formed, guarantees[2] that the resulting document will be well formed, by virtue of the nature of XSLT.
-
Learnability? There's certainly anecdotal evidence that non-programmers can be taught to be productive with XSLT in ways that don't have parallels in non-XML languages. This may be because the structure of the XSLT stylesheet has a strong surface resemblance to the documents that are to be transformed. This is true both at the level of the surface syntax (they're both XML) and at a deeper level in that templates contain fragments of the documents in a very obvious and direct way.
-
Declarativeness? There's a tendency for XML languages to have a more declarative nature than their non-XML counterparts. This can be seen particularly in the case of XSLT as compared to XQuery. The XSLT stylesheet in Figure 3 was written in a very “pull” fashion in order to have as much surface similarity to the XQuery example, Figure 4, as possible[3].
A more idiomatically natural XSLT solution for the problem is shown in Figure 7.
In the idiomatic, or “push”, style separate templates are declared for each component. This greatly increases the flexibility and reusability of XSLT.
-
Familiarity. For users whose principle tasks involve editing, validating, transforming, or otherwise working with XML, a language that is itself expressed in XML has a certain familiarity. Languages like XSLT or RELAX NG can be edited in the same comfortable, understood environment used for other XML editing tasks.
The case for non-XML syntaxes
Why choose a non-XML syntax?
-
Conciseness. One of the principle attractions of a non-XML syntax is that it's more compact, more concise. A concise syntax allows more information to fit on a screen or page and consequently provides the reader with a greater perspective on the language.
The compact schema in Figure 1 fits easily on a single page or screen and is completely straightforward to understand, assuming you're familiar with RELAX NG and its compact syntax.
The same schema expressed in the XML syntax, Figure 8, is twice as long as it's compact counterpart. It's not manifestly more difficult to understand, assuming you're familiar with RELAX NG and its XML syntax, but it doesn't fit on a single page and contains a lot of syntactic “clutter” that one must learn to “look through”.
-
Familiarity. For tasks, such as programming, that are most typically performed with non-XML languages, using a non-XML syntax for an XML language makes it more familiar and approachable for users that come from other backgrounds.
XQuery is arguably far more familiar, and consequently less threatening and more approachable, and easier to learn for a programmer with a background in SQL or any of a host of common scripting languages.
-
Accessibility to non-XML tools. Both familiarity and conciseness play into another strength for non-XML languages: support in tools and environments that programmers are used to. An XQuery or RELAX NG Compact Syntax plugin for the programmer's favorite IDE makes editing those files part of a comfortable, understood environment. Using an XML syntax may require a new editing tool.
-
Syntactic expressiveness. An XML syntax imposes constraints on what characters may appear unescaped. Some of the characters that must escaped are common in other contexts. For example, it's easy to argue that “
$a <= 5
” is easier to read and understand than “$a <= 5
”.
Syntactically: Both?
Why choose if you can have both? RELAX NG is widely praised for having both an XML syntax and a compact syntax. Why not always take that approach?
One critical metric by which the success or failure of a dual-syntax approach will be judged is semantic compatibility. Arguably, the RELAX NG Compact Syntax has not been successful simply because it has the advantages of a non-XML syntax, but also because it describes exactly the same language as the XML syntax. There are no constructs that can be represented in the compact syntax that cannot be represented in the XML syntax, and vice-versa. It is possible to translate every valid schema losslessly from one format to the other and back again.
In practice, this is a remarkably high bar. RELAX NG is a purely declarative language with no semantics for iteration or transformation. As such, it is burdened with far fewer semantics to express than a programming language like XSLT or XQuery. It is difficult to imagine finding a useful alternative syntax for either of those languages that expressed precisely the same underlying semantics.
Yet, the absolute syntactic isomorphism of the two syntaxes is considered in this paper to be an absolute requirement. Devising alternate syntaxes for subsets of a language is both much easier and much less useful. Every instance of the language that uses a construct not available in the alternate syntax is unavailable to the users who prefer the alternative, and to tools that are designed to work best with it.
It's also worth noting that even in the RELAX NG case, there are unusual artifacts in the non-XML syntax: square bracketed notations placed in front of the constructs that they modify and a somewhat torturous representation of XML markup in such annotations. Luckily, and by design, these annotations are uncommon, the simplest of these annotations are the most common and the most complicated are quite rare. Also, because of the syntactic isomorphism, it is possible to switch back-and-forth between the syntaxes, editing XML annotations in the XML syntax, and content models in the compact syntax, for example.
Case studies: compact syntaxes for XProc
To explore these ideas further, for the balance of this paper, we will consider two alternative, compact syntaxes for XProc: An XML Pipeline Language.
XProc, for those unfamiliar with it, is a language “for describing operations to be performed on XML documents.”A pipeline accepts XML documents as input, performs an arbitrary series of operations on them, and produces XML documents as output. In the context of an XProc pipeline, an “operation” is one of a set of discrete steps. These steps perform tasks such as adding an attribute, counting nodes, deleting nodes, inserting nodes, performing XInclude, XSLT, or XQuery, various forms of validation. XProc has about 40 such operations built in and may be extended with additional operations.
A simple XProc pipeline is shown in Figure 9.
This pipeline takes a single input document, performs XInclude processing, styles it using the “dbslides.xsl” stylesheet, and then produces as its output the result of that transformation. If the XProc processor serializes the result, it does so as indented XHTML.
Case study 1: A compact syntax for XProc
How might the pipeline in Figure 9 be represented in a compact, non-XML syntax? Where might we look for inspiration?
-
Python? With significant whitespace?
-
Pascal? With
BEGIN
/END
and:=
? -
Scheme? Because everything looks better with parentheses?
-
Something from the C/Java/JavaScript family?
For our first attempt, we'll take the last option. Translating Figure 9 into a compact syntax along these lines produces Figure 10.
This is in many ways a very direct translation. Like RELAX NG's compact syntax and XQuery, we use curly braces to delimit the bodies of our semantic constructs. Each new construct is introduced by a new token. There are two syntactic extension points in the XML syntax that we must accommodate: the presence of arbitrary extension attributes on what are elements in the XML syntax, and the presence of arbitrary XML fragments.
The “with
” keyword is used at the end of each
construct in the compact syntax to introduce an unbounded list of
name/value pairs. These map back to extension attributes in the XML
syntax.
Where additional namespaces are required, as in the pipeline library
in Figure 11, they're introduced in the compact syntax and
CName
s are allowed as tokens. The equivalent library in
this compact syntax is shown in Figure 12.
This example shows the use of an extension attribute, cx:type
,
represented in the compact syntax.
The other challenge is representing arbitrary XML. In RELAX NG,
arbitrary XML fragments are always annotations of one sort or another;
they're both relatively uncommon and, to some extent, unimportant to
the core grammar. Not so in XProc where they appear both in annotations,
like p:documentation
, Figure 13,
but also as inline document
content in the pipeline. Using a syntax as awkward as the approach in
RNC seems like a bad choice.
However, in the context of parsing a non-XML syntax, it must be
possible to recognize both where the XML begins and where it
ends. The presence of, for example, a fragment of
XProc compact syntax in a program listing in some XML must not be
accidentally parsed as XProc. One approach would be to build a
complete XML parser into the grammar of the compact syntax. But even
this is tricky because a p:inline
might include
several consecutive sibling elements that each have to be recognized.
If only there were some string of tokens that can't appear in XML…
In fact, such a sequence exists. Almost. The sequence “]]>
”
is forbidden in XML except when it ends a CDATA section.
We can leverage this fact in our compact syntax to form delimiters for
arbitrary XML: “<![xml[
” and “]]>
”.
See Figure 14.
It's arguably a hack, but it allows us to satisfy the requirement that each syntax represent exactly the same underlying constructs.
This syntax has been implemented. The implementation strategy is to transform the compact syntax into the XML syntax as a pre-processing step and then process the resulting XML as usual.
How does this syntax stand up to the suggested benefits of non-XML syntaxes?
-
Conciseness? A wash. It's not clearly shorter in terms of absolute number of lines.
-
Familiarity? Not clear. It has the advantage of less visual clutter, but doesn't draw from the C/Java/JavaScript family in any significant regard beyond curly braces.
-
Accessibility to non-XML tools? Probably an improvement. It's likely that a modern IDE could be customized with the EBNF (see Appendix A).
-
Syntactic expressiveness? An improvement; outside of XML blocks, there are no characters that need to be explicitly escaped.
Case study 2: An alternate compact syntax for XProc
When I presented the first compact syntax in a lightning talk last year, Jeni Tennison observed that it could be made more compact, and perhaps more useful if it was more idiomatically like other programming languages. She subsequently produced most of the “second compact syntax” language design.
Translating Figure 9 into this second compact syntax produces Figure 15.
Adopting a more “method call”-like syntax does make the pipelines shorter. The outputs of a step are treated in a similar way, but shown at the end of the body.
The most obvious example of an attempt to make the language more
idiomatically like other programming languages can be seen in the handling
of p:choose
. Consider Figure 16.
Translating it into our initial compact syntax produces Figure 17.
This is clearly a non-XML syntax, but it retains all of the semantic flavor of the original. In the second XProc compact syntax, a choose statement is represented using an if/then/else construct that's likely to be more familiar to programmers, see Figure 18.
Again, this manages to be both shorter and possibly more familiar.
Whether or not either of these syntaxes would be markedly easier to use or would spur greater adoption of XProc is an open question.
Appendix A. Grammar for XProc Compact Syntax #1
document ::= xpcMarker namespace* ( declareStep | pipeline | library ) EOF xpcMarker ::= 'xproc' version version ::= '1.0' namespace ::= ('namespace' prefix '=' quotedstr) | ('default' 'namespace' '=' quotedstr) prefix ::= NCName declareStep ::= 'declare-step' stepName? withExtra? pipelineBody stepName ::= 'named' quotedstr withExtra ::= 'with' attr (',' attr)* attr ::= QName '=' (QName | quotedstr) pipelineBody ::= '{' ( input | output | option | log | serialization )* ( declareStep | pipeline | imports )* subpipeline? '}' input ::= 'input' quotedstr withExtra? ( '{' binding* '}' )? output ::= 'output' quotedstr withExtra? ( '{' binding* '}' )? option ::= 'required' 'option' QName withExtra? | 'option' QName withExtra? log ::= 'log' quotedstr 'to' quotedstr serialization ::= 'serialization' quotedstr withExtra? imports ::= 'import' quotedstr variable ::= 'variable' QName '=' quotedstr variableBody? variableBody ::= '{' ( binding | namespaces )* '}' namespaces ::= 'namespaces' withExtra? nsBody? nsBody ::= '{' namespace '}' binding ::= ( comment | pi )* ( emptyBinding | documentBinding | dataBinding | pipeBinding | inlineBinding ) emptyBinding ::= 'empty' withExtra? documentBinding ::= 'document' quotedstr withExtra? dataBinding ::= 'data' quotedstr withExtra? pipeBinding ::= quotedstr 'on' quotedstr withExtra? inlineBinding ::= 'inline' withExtra? inlineXML inlineXML ::= '<![XML[' Char* ']]>' subpipeline ::= ( variable | documentation | pipeinfo | forEachStep | viewportStep | chooseStep | tryStep | groupStep | atomicStep | comment | pi )+ documentation ::= 'documentation' withExtra? '{' inlineXML '}' pipeinfo ::= 'pipeinfo' withExtra? '{' inlineXML '}' named ::= 'named' quotedstr forEachStep ::= 'for-each' named? withExtra? forEachBody forEachBody ::= '{' ( iterationSource | output | log )* subpipeline '}' iterationSource ::= 'iteration-source' withExtra? ( '{' binding* '}' )? viewportStep ::= 'viewport' named? withExtra? viewportBody viewportBody ::= '{' ( viewportSource | output | log )* subpipeline '}' viewportSource ::= 'viewport-source' withExtra? ( '{' binding* '}' )? chooseStep ::= 'choose' named? withExtra? chooseBody chooseBody ::= '{' xpathContext? variable* whenStep* otherwiseStep? '}' xpathContext ::= 'xpath-context' withExtra? ( '{' binding* '}' )? whenStep ::= 'when' quotedstr withExtra? whenBody whenBody ::= ( xpathContext | output | log )* subpipeline otherwiseStep ::= 'otherwise' withExtra? otherwiseBody otherwiseBody ::= ( output | log )* subpipeline tryStep ::= 'try' named? withExtra? tryBody tryBody ::= '{' variable* groupStep catchStep '}' groupStep ::= 'group' named? withExtra? groupBody groupBody ::= '{' ( output | log )* subpipeline '}' catchStep ::= 'catch' named? withExtra? catchBody catchBody ::= '{' ( output | log )* subpipeline '}' atomicStep ::= ( 'add-xml-base' | 'add-attribute' | 'compare' | 'count' | 'delete' | 'directory-list' | 'error' | 'escape-markup' | 'exec' | 'filter' | 'hash' | 'http-request' | 'identity' | 'insert' | 'label-elements' | 'load' | 'make-absolute-uris' | 'namespace-rename' | 'pack' | 'parameters' | 'rename' | 'replace' | 'set-attributes' | 'sink' | 'split-sequence' | 'store' | 'string-replace' | 'unescape-markup' | 'unwrap' | 'uuid' | 'validate-with-relax-ng' | 'validate-with-schematron' | 'validate-with-xml-schema' | 'wrap' | 'wrap-sequence' | 'www-form-urldecode' | 'www-form-urlencode' | 'xinclude' | 'xquery' | 'xslt' | 'xsl-formatter' ) named? withExtra? atomicStepBody? | CName named? withExtra? atomicStepBody? atomicStepBody ::= '{' ( input | withOption | withParam | log )* '}' withOption ::= 'with-option' QName '=' quotedstr withExtra? withOptionBody? withOptionBody ::= '{' ( binding | namespaces )* '}' withParam ::= 'with-param' QName '=' quotedstr withExtra? withParamBody? withParamBody ::= '{' ( binding | namespaces )* '}' pipeline ::= 'pipeline' named? withExtra? pipelineBody library ::= 'library' withExtra? libraryBody libraryBody ::= '{' ( imports | declareStep | pipeline )* '}' EOF ::= $ comment ::= '<!--' ( ( Char - '-' ) | '-' ( Char - '-' ) )* '-->' pi ::= '<?' pitarget ( S ( [^?] | '?'+ [^?>] )* '?'* )? '?>' /* ws: explicit */ pitarget ::= NCName S ::= ( #x0020 | #x0009 | #x000D | #x000A )+ /* ws: definition */ quotedstr ::= '"' ( [^"] )* '"' | "'" ( [^'] )* "'" NameStartChar ::= [A-Z] | '_' | [a-z] | [#x00C0-#x00D6] | [#x00D8-#x00F6] | [#x00F8-#x02FF] | [#x0370-#x037D] | [#x037F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] NameChar ::= NameStartChar | '-' | '.' | [0-9] | #x00B7 | [#x0300-#x036F] | [#x203F-#x2040] NCName ::= NameStartChar NameChar* CName ::= (NCName ':' NCName) QName ::= NCName | CName Char ::= [#x0021-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Appendix B. Implementation
XML Calabash implements both compact syntaxes in the same way.
-
The EBNF for the compact syntax is compiled into an XQuery module using the REx Parser Generator. The XQuery module produces an XML parse tree for the input pipeline.
-
An XSLT stylesheet is written which transforms the XML parse tree into standard XProc.
-
These two steps are combined into a pipeline, Figure 19, which is used to transform the input document into XProc which is then executed normally.
This mechanism may not be particularly efficient, but it is quite easy to write as a proof-of-concept.
References
[carrot] Lenz, Evan. “Carrot: An appetizing hybrid of XQuery and XSLT.” Presented at Balisage: The Markup Conference 2011, Montréal, Canada, August 2 - 5, 2011. In Proceedings of Balisage: The Markup Conference 2011. Balisage Series on Markup Technologies, vol. 7 (2011). doi:https://doi.org/10.4242/BalisageVol7.Lenz01.
[rex] Rademacher, Gunther. “REx Parser Generator”, http://www.bottlecaps.de/rex/
[xmlcalabash] Walsh, Norman. “XML Calabash”, http://xmlcalabash.com/
[1] …, DITA, TEI, etc. Markup languages for prose.
[2] “Guarantees” in the absence of features such as disable output escaping and character maps that are designed to subvert the serialization, in any event.
[3] Pulling the rows out of line and storing them in a variable is an awkward consequence of XQuery's completely broken semantics with respect to the default namespace.