On XML Languages…

Norman Walsh

Abstract

Some XML languages have an XML syntax, some have a non-XML syntax, and some have both. This paper explores the intersection of these languages and syntaxes. What are the advantages of an XML syntax? What are the advantages of a non-XML syntax? After discussing the general issues, the paper presents two, alternative non-XML syntaxes for XProc as a case study to further explore the issues.

The Desperate Perl Hacker featured often in the early days of XML. Designing a markup format that could be processed easily by ordinary programmers using their chosen languages was an explicit goal of XML: 4. It shall be easy to write programs which process XML documents.

This goal was achieved, at least for XML itself, if not all of the subsequent specifications in the broader ecosystem, and as a consequence there are no significant, mainstream languages which are incapable of processing XML. There are probably none for which there aren't a choice of XML parsers. Any language built on top of the Java VM includes such a choice. Modern languages like Scala include features for the specific purpose of writing domain specific language parsers. These allow XML, or subsets of XML, to be incorporated directly into the language itself.

It is straightforward to parse XML with more-or-less any programming language you care to use. The way, and the extent to which, XML coexists with those languages is largely a question of their design and the full range of language design is outside the scope of this paper.

Within the XML community, many XML languages have been designed specifically for the purpose of processing XML. These include all of the usual suspects: validation languages, transformation languages, query languages, etc. These are languages designed by XML users for XML users to process XML. These are the languages that are the focus of this paper.

We are concerned mostly with the syntax of these languages, not their semantics. Of course, syntax and semantics are not wholly separable. A language whose semantics are nothing more than the expression of a single boolean value needs at most two tokens and so can be vastly simpler syntactically than a language with Turing complete semantics. Nevertheless, we'll focus mostly on the syntax for syntaxes sake.

The first, perhaps most obvious, question to ask about the syntax of an XML language is: to what extent is it XML itself? A brief survey of XML languages reveals that there is considerable variety on this point.

On one end of the spectrum, RELAX NG Compact Syntax has nothing that resembles XML to the untrained eye. See Figure 1.

Figure 1: RELAX NG Compact Syntax

namespace a = "http://relaxng.org/ns/compatibility/annotations/1.0"
namespace db = "http://docbook.org/ns/docbook"

start = purchaseOrder

purchaseOrder = element po { item+ }

item = element item { itemno, quantity, description, unitprice }

itemno =
  element itemno {
    xsd:string { pattern = "[A-Z]+[0-9]+" }
  }

quantity = element quantity { xsd:decimal }

description = element description { (text | emph)* }

emph = element emph { (text | emph)* }

unitprice =
    [
      db:para [
        "The unit price must have an associated currency.\x{a}" ~
        "If no currency is explicitly specified, the default\x{a}" ~
        "value of "
        db:literal [ "USD" ]
        "\x{a}"
        db:emphasis [ "must" ]
        " be assumed."
      ]
    ]
    element unitprice {
       [ a:defaultValue = "USD" ]
       attribute currency {
          ## US Dollars
          "USD"
        | ## Great British Pounds
          "GBP"
        | ## Euro
          "EUR"
       }?,
       xsd:decimal { fractionDigits = "2" }
    }

On the other end of the spectrum, XQueryX is nothing but XML. See Figure 2.

Figure 2: XQueryX

<?xml version="1.0"?>
<xqx:module xmlns:xqx="http://www.w3.org/2005/XQueryX"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="http://www.w3.org/2005/XQueryX
                                http://www.w3.org/2005/XQueryX/xqueryx.xsd">
  <xqx:versionDecl>
    <xqx:version>1.0</xqx:version>
    <!-- encoding: null -->
  </xqx:versionDecl>
  <xqx:mainModule>
    <xqx:prolog>
      <xqx:defaultNamespaceDecl>
        <xqx:defaultNamespaceCategory>function</xqx:defaultNamespaceCategory>
        <xqx:uri>http://www.w3.org/2005/xpath-functions</xqx:uri>
      </xqx:defaultNamespaceDecl>
    </xqx:prolog>
    <xqx:queryBody>
      <xqx:flworExpr>
        <xqx:letClause>
          <xqx:letClauseItem>
            <xqx:typedVariableBinding>
              <xqx:varName>rows</xqx:varName>
            </xqx:typedVariableBinding>
            <xqx:letExpr>
              <xqx:flworExpr>
                <xqx:forClause>
                  <xqx:forClauseItem>
                    <xqx:typedVariableBinding>
                      <xqx:varName>item</xqx:varName>
                    </xqx:typedVariableBinding>
                    <xqx:forExpr>
                      <xqx:pathExpr>
                        <xqx:rootExpr/>
                        <xqx:stepExpr>
                          <xqx:xpathAxis>child</xqx:xpathAxis>
                          <xqx:nameTest>po</xqx:nameTest>
                        </xqx:stepExpr>
                        <xqx:stepExpr>
                          <xqx:xpathAxis>child</xqx:xpathAxis>
                          <xqx:nameTest>item</xqx:nameTest>
                        </xqx:stepExpr>
                      </xqx:pathExpr>
                    </xqx:forExpr>
                  </xqx:forClauseItem>
                </xqx:forClause>
                <xqx:letClause>
                  <xqx:letClauseItem>
                    <xqx:typedVariableBinding>
                      <xqx:varName>itemno</xqx:varName>
                    </xqx:typedVariableBinding>
                    <xqx:letExpr>
                      <xqx:functionCallExpr>
                        <xqx:functionName>string</xqx:functionName>
                        <xqx:arguments>
                          <xqx:pathExpr>
                            <xqx:stepExpr>
                              <xqx:filterExpr>
                                <xqx:varRef>
                                  <xqx:name>item</xqx:name>
                                </xqx:varRef>
                              </xqx:filterExpr>
                            </xqx:stepExpr>
                            <xqx:stepExpr>
                              <xqx:xpathAxis>child</xqx:xpathAxis>
                              <xqx:nameTest>itemno</xqx:nameTest>
                            </xqx:stepExpr>
                          </xqx:pathExpr>
                        </xqx:arguments>
                      </xqx:functionCallExpr>
                    </xqx:letExpr>
                  </xqx:letClauseItem>
                </xqx:letClause>
                <xqx:letClause>
                  <xqx:letClauseItem>
                    <xqx:typedVariableBinding>
                      <xqx:varName>quant</xqx:varName>
                    </xqx:typedVariableBinding>
                    <xqx:letExpr>
                      <xqx:functionCallExpr>
                        <xqx:functionName xqx:prefix="xs">integer</xqx:functionName>
                        <xqx:arguments>
                          <xqx:pathExpr>
                            <xqx:stepExpr>
                              <xqx:filterExpr>
                                <xqx:varRef>
                                  <xqx:name>item</xqx:name>
                                </xqx:varRef>
                              </xqx:filterExpr>
                            </xqx:stepExpr>
                            <xqx:stepExpr>
                              <xqx:xpathAxis>child</xqx:xpathAxis>
                              <xqx:nameTest>quantity</xqx:nameTest>
                            </xqx:stepExpr>
                          </xqx:pathExpr>
                        </xqx:arguments>
                      </xqx:functionCallExpr>
                    </xqx:letExpr>
                  </xqx:letClauseItem>
                </xqx:letClause>
                <xqx:letClause>
                  <xqx:letClauseItem>
                    <xqx:typedVariableBinding>
                      <xqx:varName>desc</xqx:varName>
                    </xqx:typedVariableBinding>
                    <xqx:letExpr>
                      <xqx:pathExpr>
                        <xqx:stepExpr>
                          <xqx:filterExpr>
                            <xqx:varRef>
                              <xqx:name>item</xqx:name>
                            </xqx:varRef>
                          </xqx:filterExpr>
                        </xqx:stepExpr>
                        <xqx:stepExpr>
                          <xqx:xpathAxis>child</xqx:xpathAxis>
                          <xqx:nameTest>description</xqx:nameTest>
                        </xqx:stepExpr>
                        <xqx:stepExpr>
                          <xqx:xpathAxis>child</xqx:xpathAxis>
                          <xqx:anyKindTest/>
                        </xqx:stepExpr>
                      </xqx:pathExpr>
                    </xqx:letExpr>
                  </xqx:letClauseItem>
                </xqx:letClause>
                <xqx:letClause>
                  <xqx:letClauseItem>
                    <xqx:typedVariableBinding>
                      <xqx:varName>unitp</xqx:varName>
                    </xqx:typedVariableBinding>
                    <xqx:letExpr>
                      <xqx:functionCallExpr>
                        <xqx:functionName xqx:prefix="xs">decimal</xqx:functionName>
                        <xqx:arguments>
                          <xqx:pathExpr>
                            <xqx:stepExpr>
                              <xqx:filterExpr>
                                <xqx:varRef>
                                  <xqx:name>item</xqx:name>
                                </xqx:varRef>
                              </xqx:filterExpr>
                            </xqx:stepExpr>
                            <xqx:stepExpr>
                              <xqx:xpathAxis>child</xqx:xpathAxis>
                              <xqx:nameTest>unitprice</xqx:nameTest>
                            </xqx:stepExpr>
                          </xqx:pathExpr>
                        </xqx:arguments>
                      </xqx:functionCallExpr>
                    </xqx:letExpr>
                  </xqx:letClauseItem>
                </xqx:letClause>
                <xqx:returnClause>
                  <xqx:elementConstructor>
                    <xqx:tagName>tr</xqx:tagName>
                    <xqx:attributeList>
                      <xqx:namespaceDeclaration>
                        <xqx:uri>http://www.w3.org/1999/xhtml</xqx:uri>
                      </xqx:namespaceDeclaration>
                    </xqx:attributeList>
                    <xqx:elementContent>
                      <xqx:elementConstructor>
                        <xqx:tagName>td</xqx:tagName>
                        <xqx:elementContent>
                          <xqx:varRef>
                            <xqx:name>itemno</xqx:name>
                          </xqx:varRef>
                        </xqx:elementContent>
                      </xqx:elementConstructor>
                      <xqx:elementConstructor>
                        <xqx:tagName>td</xqx:tagName>
                        <xqx:elementContent>
                          <xqx:varRef>
                            <xqx:name>quant</xqx:name>
                          </xqx:varRef>
                        </xqx:elementContent>
                      </xqx:elementConstructor>
                      <xqx:elementConstructor>
                        <xqx:tagName>td</xqx:tagName>
                        <xqx:elementContent>
                          <xqx:varRef>
                            <xqx:name>desc</xqx:name>
                          </xqx:varRef>
                        </xqx:elementContent>
                      </xqx:elementConstructor>
                      <xqx:elementConstructor>
                        <xqx:tagName>td</xqx:tagName>
                        <xqx:elementContent>
                          <xqx:varRef>
                            <xqx:name>unitp</xqx:name>
                          </xqx:varRef>
                        </xqx:elementContent>
                      </xqx:elementConstructor>
                      <xqx:elementConstructor>
                        <xqx:tagName>td</xqx:tagName>
                        <xqx:elementContent>
                          <xqx:multiplyOp>
                            <xqx:firstOperand>
                              <xqx:varRef>
                                <xqx:name>quant</xqx:name>
                              </xqx:varRef>
                            </xqx:firstOperand>
                            <xqx:secondOperand>
                              <xqx:varRef>
                                <xqx:name>unitp</xqx:name>
                              </xqx:varRef>
                            </xqx:secondOperand>
                          </xqx:multiplyOp>
                        </xqx:elementContent>
                      </xqx:elementConstructor>
                    </xqx:elementContent>
                  </xqx:elementConstructor>
                </xqx:returnClause>
              </xqx:flworExpr>
            </xqx:letExpr>
          </xqx:letClauseItem>
        </xqx:letClause>
        <xqx:returnClause>
          <xqx:elementConstructor>
            <xqx:tagName>html</xqx:tagName>
            <xqx:attributeList>
              <xqx:namespaceDeclaration>
                <xqx:uri>http://www.w3.org/1999/xhtml</xqx:uri>
              </xqx:namespaceDeclaration>
            </xqx:attributeList>
            <xqx:elementContent>
              <xqx:elementConstructor>
                <xqx:tagName>head</xqx:tagName>
                <xqx:elementContent>
                  <xqx:elementConstructor>
                    <xqx:tagName>title</xqx:tagName>
                    <xqx:elementContent>
                      <xqx:stringConstantExpr>
                        <xqx:value>Purchase Order</xqx:value>
                      </xqx:stringConstantExpr>
                    </xqx:elementContent>
                  </xqx:elementConstructor>
                </xqx:elementContent>
              </xqx:elementConstructor>
              <xqx:elementConstructor>
                <xqx:tagName>body</xqx:tagName>
                <xqx:elementContent>
                  <xqx:elementConstructor>
                    <xqx:tagName>h1</xqx:tagName>
                    <xqx:elementContent>
                      <xqx:stringConstantExpr>
                        <xqx:value>Purchase Order</xqx:value>
                      </xqx:stringConstantExpr>
                    </xqx:elementContent>
                  </xqx:elementConstructor>
                  <xqx:elementConstructor>
                    <xqx:tagName>table</xqx:tagName>
                    <xqx:elementContent>
                      <xqx:varRef>
                        <xqx:name>rows</xqx:name>
                      </xqx:varRef>
                    </xqx:elementContent>
                  </xqx:elementConstructor>
                </xqx:elementContent>
              </xqx:elementConstructor>
            </xqx:elementContent>
          </xqx:elementConstructor>
        </xqx:returnClause>
      </xqx:flworExpr>
    </xqx:queryBody>
  </xqx:mainModule>
</xqx:module>

Other XML languages fit between those two ends. XSLT has a mostly XML syntax, see Figure 3.

Figure 3: XSLT

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:xs="http://www.w3.org/2001/XMLSchema"
                xmlns="http://www.w3.org/1999/xhtml"
		exclude-result-prefixes="xs"
                version="2.0">

<xsl:template match="/">
  <xsl:variable name="rows">
    <xsl:for-each select="/po/item">
      <xsl:variable name="itemno" select="string(itemno)"/>
      <xsl:variable name="quant" select="xs:integer(quantity)"/>
      <xsl:variable name="desc" select="description/node()"/>
      <xsl:variable name="unitp" select="xs:decimal(unitprice)"/>
      <tr>
        <td><xsl:value-of select="$itemno"/></td>
        <td><xsl:value-of select="$quant"/></td>
        <td><xsl:copy-of select="$desc"/></td>
        <td><xsl:value-of select="$unitp"/></td>
        <td><xsl:value-of select="$quant * $unitp"/></td>
      </tr>
    </xsl:for-each>
  </xsl:variable>

  <html>
    <head>
      <title>Purchase Order</title>
    </head>
    <body>
      <h1>Purchase Order</h1>
      <body>
        <table>
          <xsl:sequence select="$rows"/>
        </table>
      </body>
    </body>
  </html>
</xsl:template>

</xsl:stylesheet>

While XQuery has a mostly non-XML syntax, see Figure 4.

Figure 4: XQuery

xquery version "1.0";

declare default function namespace "http://www.w3.org/2005/xpath-functions";

let $rows := for $item in /po/item
             let $itemno := string($item/itemno)
             let $quant  := xs:integer($item/quantity)
             let $desc   := $item/description/node()
             let $unitp  := xs:decimal($item/unitprice)
             return
               <tr xmlns="http://www.w3.org/1999/xhtml">
                 <td>{ $itemno }</td>
                 <td>{ $quant }</td>
                 <td>{ $desc }</td>
                 <td>{ $unitp } </td>
                 <td>{ $quant * $unitp }</td>
               </tr>
return
  <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
      <title>Purchase Order</title>
    </head>
    <body>
      <h1>Purchase Order</h1>
      <table>
        { $rows }
      </table>
    </body>
  </html>

Let's look a little more closely at the distinction between XQueryX and XSLT. On the one hand, XQueryX provides improved machine readability: there are no semantic elements not manifest in the XML. On the other hand, it gains this benefit by sacrificing human readability. These are two possible axes on which we can analyze a language syntax, we'll revisit them later.

In the meantime, distinguish a “practical” XML syntax as one that is concise enough for human comprehension (even if it relies on some non-XML syntax to aid readability).

How do XML languages stand up? See Table I.

Table I

XML Languages

Language	XML Syntax	Practical XML Syntax	Non-XML Syntax
Atom	✓	✓
DocBook, HTML, …^[1]	✓	✓
MathML	✓	✓
RELAX NG	✓	✓	✓
RDF	✓	✓	✓
Schematron	✓	✓
SVG	✓	✓
XInclude	✓	✓
XLink			✓
XML Schema	✓	✓
XPointer			✓
XProc	✓	✓
XQuery	✓		✓
XSLT	✓	✓

There may be room for debate about some cells in that table. Evan Lenz's work on carrot, for example, is moving in the direction of a more compact, non-XML syntax for XSLT. One could argue that TeX is a non-XML syntax for MathML. We might debate whether or not attribute-based languages like XLink are or are not XML. And, in addition, there may be other syntaxes for these languages of which the author is unaware. However, at a coarse level of granularity, what we can see is that there are languages all across the spectrum.

Syntactically: XML or not?

Seeing languages spread across a spectrum like this invites the question: why? What motivates a language designer to choose an XML syntax, or not? When both are provided, what motivates a user to choose an XML syntax, or not?

The case for XML syntaxes

Why choose XML?

“Eat your own dogfood”/”Fly your own airplanes.” One school of thought says that XML languages should be expressed in XML simply because they are XML languages. Some XML developers find XML to be a clear and precise format for the expression of ideas.
Extensibility. The XML syntax has natural extension points, attributes on start tags, for example, and namespaces. At a syntactic level, extending an XML language is an easily solved problem. Conversely, non-XML languages sometimes suffer from a dearth of extension points. Keeping a grammar for a complex language like XQuery free from ambiguity while simultaneously adding language features can be a real challenge.

Whether the accretion of language features through this form of ad-hoc extension, in either the XML or non-XML cases, produces a coherent and regular language over time, is a separate question.
Accessibility to XML tools. The fact that an XSLT stylesheet can be used to produce an XSLT stylesheet is not a feature that every XSLT user needs, but there are circumstances when it is a great boon.

Documentation. The ability to inline documentation in an XML language is considered a great benefit in some environments. Expressing XML documentation in a non-XML language can have a deleterious effect readability. Compare, for example, the non-XML representation of the unitprice pattern, Figure 5, with the equivalent XML representation, Figure 6.

Figure 5: XML Documentation in RELAX NG Compact Syntax

unitprice =
    [
      db:para [
        "The unit price must have an associated currency.\x{a}" ~
        "If no currency is explicitly specified, the default\x{a}" ~
        "value of "
        db:literal [ "USD" ]
        "\x{a}"
        db:emphasis [ "must" ]
        " be assumed."
      ]
    ]
    element unitprice {
       [ a:defaultValue = "USD" ]
       attribute currency {
          ## US Dollars
          "USD"
        | ## Great British Pounds
          "GBP"
        | ## Euro
          "EUR"
       }?,
       xsd:decimal { fractionDigits = "2" }
    }

Figure 6: XML Documentation in RELAX NG XML Syntax

  <define name="unitprice">
    <element name="unitprice">
      <db:para>The unit price must have an associated currency.
      If no currency is explicitly specified, the default
      value of <db:literal>USD</db:literal>
      <db:emphasis>must</db:emphasis> be assumed.</db:para>

      <optional>
        <attribute name="currency" a:defaultValue="USD">
          <choice>
            <value>USD</value>
            <a:documentation>US Dollars</a:documentation>
            <value>GBP</value>
            <a:documentation>Great British Pounds</a:documentation>
            <value>EUR</value>
            <a:documentation>Euro</a:documentation>
          </choice>
        </attribute>
      </optional>

      <data type="decimal">
        <param name="fractionDigits">2</param>
      </data>
    </element>

Syntactic conformance. Operating on XML with a language that has an XML syntax provides certain minimum assurances about the outputs. An XSLT stylesheet, which must itself be well formed, guarantees^[2] that the resulting document will be well formed, by virtue of the nature of XSLT.
Learnability? There's certainly anecdotal evidence that non-programmers can be taught to be productive with XSLT in ways that don't have parallels in non-XML languages. This may be because the structure of the XSLT stylesheet has a strong surface resemblance to the documents that are to be transformed. This is true both at the level of the surface syntax (they're both XML) and at a deeper level in that templates contain fragments of the documents in a very obvious and direct way.

Declarativeness? There's a tendency for XML languages to have a more declarative nature than their non-XML counterparts. This can be seen particularly in the case of XSLT as compared to XQuery. The XSLT stylesheet in Figure 3 was written in a very “pull” fashion in order to have as much surface similarity to the XQuery example, Figure 4, as possible^[3].

A more idiomatically natural XSLT solution for the problem is shown in Figure 7.

Figure 7: Idiomatic XSLT

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:xs="http://www.w3.org/2001/XMLSchema"
                xmlns="http://www.w3.org/1999/xhtml"
		exclude-result-prefixes="xs"
                version="2.0">

<xsl:template match="/">
  <html>
    <head>
      <title>Purchase Order</title>
    </head>
    <body>
      <h1>Purchase Order</h1>
      <xsl:apply-templates/>
    </body>
  </html>
</xsl:template>

<xsl:template match="po">
  <table>
    <xsl:apply-templates select="item"/>
  </table>
</xsl:template>

<xsl:template match="item">
  <tr>
    <td>
      <xsl:value-of select="itemno"/>
    </td>
    <td>
      <xsl:value-of select="quantity"/>
    </td>
    <td>
      <xsl:apply-templates select="description"/>
    </td>
    <td>
      <xsl:value-of select="unitprice"/>
    </td>
    <td>
      <xsl:value-of select="xs:integer(quantity) * xs:decimal(unitprice)"/>
    </td>
  </tr>
</xsl:template>

<xsl:template match="description">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="emph">
  <em>
    <xsl:apply-templates/>
  </em>
</xsl:template>

</xsl:stylesheet>

In the idiomatic, or “push”, style separate templates are declared for each component. This greatly increases the flexibility and reusability of XSLT.

Familiarity. For users whose principle tasks involve editing, validating, transforming, or otherwise working with XML, a language that is itself expressed in XML has a certain familiarity. Languages like XSLT or RELAX NG can be edited in the same comfortable, understood environment used for other XML editing tasks.

The case for non-XML syntaxes

Why choose a non-XML syntax?

Conciseness. One of the principle attractions of a non-XML syntax is that it's more compact, more concise. A concise syntax allows more information to fit on a screen or page and consequently provides the reader with a greater perspective on the language.

The compact schema in Figure 1 fits easily on a single page or screen and is completely straightforward to understand, assuming you're familiar with RELAX NG and its compact syntax.

The same schema expressed in the XML syntax, Figure 8, is twice as long as it's compact counterpart. It's not manifestly more difficult to understand, assuming you're familiar with RELAX NG and its XML syntax, but it doesn't fit on a single page and contains a lot of syntactic “clutter” that one must learn to “look through”.

Figure 8: RELAX NG

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns:db="http://docbook.org/ns/docbook"
         xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0"
         xmlns="http://relaxng.org/ns/structure/1.0"
         datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <start>
    <ref name="purchaseOrder"/>
  </start>

  <define name="purchaseOrder">
    <element name="po">
      <oneOrMore>
        <ref name="item"/>
      </oneOrMore>
    </element>
  </define>

  <define name="item">
    <element name="item">
      <ref name="itemno"/>
      <ref name="quantity"/>
      <ref name="description"/>
      <ref name="unitprice"/>
    </element>
  </define>

  <define name="itemno">
    <element name="itemno">
      <data type="string">
        <param name="pattern">[A-Z]+[0-9]+</param>
      </data>
    </element>
  </define>

  <define name="quantity">
    <element name="quantity">
      <data type="decimal"/>
    </element>
  </define>

  <define name="description">
    <element name="description">
      <zeroOrMore>
        <choice>
          <text/>
          <ref name="emph"/>
        </choice>
      </zeroOrMore>
    </element>
  </define>

  <define name="emph">
    <element name="emph">
      <zeroOrMore>
        <choice>
          <text/>
          <ref name="emph"/>
        </choice>
      </zeroOrMore>
    </element>
  </define>

  <define name="unitprice">
    <element name="unitprice">
      <db:para>The unit price must have an associated currency.
      If no currency is explicitly specified, the default
      value of <db:literal>USD</db:literal>
      <db:emphasis>must</db:emphasis> be assumed.</db:para>

      <optional>
        <attribute name="currency" a:defaultValue="USD">
          <choice>
            <value>USD</value>
            <a:documentation>US Dollars</a:documentation>
            <value>GBP</value>
            <a:documentation>Great British Pounds</a:documentation>
            <value>EUR</value>
            <a:documentation>Euro</a:documentation>
          </choice>
        </attribute>
      </optional>

      <data type="decimal">
        <param name="fractionDigits">2</param>
      </data>
    </element>
  </define>
</grammar>

Familiarity. For tasks, such as programming, that are most typically performed with non-XML languages, using a non-XML syntax for an XML language makes it more familiar and approachable for users that come from other backgrounds.

XQuery is arguably far more familiar, and consequently less threatening and more approachable, and easier to learn for a programmer with a background in SQL or any of a host of common scripting languages.
Accessibility to non-XML tools. Both familiarity and conciseness play into another strength for non-XML languages: support in tools and environments that programmers are used to. An XQuery or RELAX NG Compact Syntax plugin for the programmer's favorite IDE makes editing those files part of a comfortable, understood environment. Using an XML syntax may require a new editing tool.
Syntactic expressiveness. An XML syntax imposes constraints on what characters may appear unescaped. Some of the characters that must escaped are common in other contexts. For example, it's easy to argue that “$a <= 5” is easier to read and understand than “$a <= 5”.

Syntactically: Both?

Why choose if you can have both? RELAX NG is widely praised for having both an XML syntax and a compact syntax. Why not always take that approach?

One critical metric by which the success or failure of a dual-syntax approach will be judged is semantic compatibility. Arguably, the RELAX NG Compact Syntax has not been successful simply because it has the advantages of a non-XML syntax, but also because it describes exactly the same language as the XML syntax. There are no constructs that can be represented in the compact syntax that cannot be represented in the XML syntax, and vice-versa. It is possible to translate every valid schema losslessly from one format to the other and back again.

In practice, this is a remarkably high bar. RELAX NG is a purely declarative language with no semantics for iteration or transformation. As such, it is burdened with far fewer semantics to express than a programming language like XSLT or XQuery. It is difficult to imagine finding a useful alternative syntax for either of those languages that expressed precisely the same underlying semantics.

Yet, the absolute syntactic isomorphism of the two syntaxes is considered in this paper to be an absolute requirement. Devising alternate syntaxes for subsets of a language is both much easier and much less useful. Every instance of the language that uses a construct not available in the alternate syntax is unavailable to the users who prefer the alternative, and to tools that are designed to work best with it.

It's also worth noting that even in the RELAX NG case, there are unusual artifacts in the non-XML syntax: square bracketed notations placed in front of the constructs that they modify and a somewhat torturous representation of XML markup in such annotations. Luckily, and by design, these annotations are uncommon, the simplest of these annotations are the most common and the most complicated are quite rare. Also, because of the syntactic isomorphism, it is possible to switch back-and-forth between the syntaxes, editing XML annotations in the XML syntax, and content models in the compact syntax, for example.

Case studies: compact syntaxes for XProc

To explore these ideas further, for the balance of this paper, we will consider two alternative, compact syntaxes for XProc: An XML Pipeline Language.

XProc, for those unfamiliar with it, is a language “for describing operations to be performed on XML documents.”A pipeline accepts XML documents as input, performs an arbitrary series of operations on them, and produces XML documents as output. In the context of an XProc pipeline, an “operation” is one of a set of discrete steps. These steps perform tasks such as adding an attribute, counting nodes, deleting nodes, inserting nodes, performing XInclude, XSLT, or XQuery, various forms of validation. XProc has about 40 such operations built in and may be extended with additional operations.

A simple XProc pipeline is shown in Figure 9.

Figure 9: Simple XProc Pipeline

<p:pipeline xmlns:p="http://www.w3.org/ns/xproc"
            version='1.0'>
<p:serialization port="result" method="xhtml" indent="true"/>

<p:xinclude/>

<p:xslt>
  <p:input port="stylesheet">
    <p:document href="dbslides.xsl"/>
  </p:input>
</p:xslt>

</p:pipeline>

This pipeline takes a single input document, performs XInclude processing, styles it using the “dbslides.xsl” stylesheet, and then produces as its output the result of that transformation. If the XProc processor serializes the result, it does so as indented XHTML.

Case study 1: A compact syntax for XProc

How might the pipeline in Figure 9 be represented in a compact, non-XML syntax? Where might we look for inspiration?

Python? With significant whitespace?
Pascal? With BEGIN/END and :=?
Scheme? Because everything looks better with parentheses?
Something from the C/Java/JavaScript family?

For our first attempt, we'll take the last option. Translating Figure 9 into a compact syntax along these lines produces Figure 10.

Figure 10: Simple XProc Pipeline, Compact Syntax #1

xproc 1.0

pipeline {
  serialization "result" with method="xhtml",
       indent="true"
  xinclude
  xslt {
    input "stylesheet" {
      document "dbslides.xsl"
    }
  }
}

This is in many ways a very direct translation. Like RELAX NG's compact syntax and XQuery, we use curly braces to delimit the bodies of our semantic constructs. Each new construct is introduced by a new token. There are two syntactic extension points in the XML syntax that we must accommodate: the presence of arbitrary extension attributes on what are elements in the XML syntax, and the presence of arbitrary XML fragments.

The “with” keyword is used at the end of each construct in the compact syntax to introduce an unbounded list of name/value pairs. These map back to extension attributes in the XML syntax.

Figure 11: XProc Library

<p:library xmlns:p="http://www.w3.org/ns/xproc"
	   xmlns:cx="http://xmlcalabash.com/ns/extensions"
           version="1.0">

<p:declare-step type="cx:unzip">
  <p:output port="result"/>
  <p:option name="href" required="true"
            cx:type="xsd:anyURI"/>
  <p:option name="file"/>
  <p:option name="content-type"/>
</p:declare-step>

</p:library>

Where additional namespaces are required, as in the pipeline library in Figure 11, they're introduced in the compact syntax and CNames are allowed as tokens. The equivalent library in this compact syntax is shown in Figure 12.

Figure 12: XProc Library, Compact Syntax #1

xproc 1.0

namespace p = "http://www.w3.org/ns/xproc"
namespace cx = "http://xmlcalabash.com/ns/extensions"

library with version="1.0" {
  declare-step with type="cx:unzip" {
    output "result"
    required option href with cx:type="xsd:anyURI"
    option file
    option content-type
  }
}

This example shows the use of an extension attribute, cx:type, represented in the compact syntax.

The other challenge is representing arbitrary XML. In RELAX NG, arbitrary XML fragments are always annotations of one sort or another; they're both relatively uncommon and, to some extent, unimportant to the core grammar. Not so in XProc where they appear both in annotations, like p:documentation, Figure 13, but also as inline document content in the pipeline. Using a syntax as awkward as the approach in RNC seems like a bad choice.

Figure 13: XProc Library with Documentation

<p:library xmlns:p="http://www.w3.org/ns/xproc"
           xmlns:cx="http://xmlcalabash.com/ns/extensions"
           version="1.0">

<p:documentation>
<div xmlns="http://www.w3.org/1999/xhtml">
<h1>XML Calabash Extension Library</h1>
<h2>Version 1.0</h2>
<p>The steps defined in this library are implemented in
<a href="http://xmlcalabash.com/">XML Calabash</a>.
</p>
</div>
</p:documentation>
…

However, in the context of parsing a non-XML syntax, it must be possible to recognize both where the XML begins and where it ends. The presence of, for example, a fragment of XProc compact syntax in a program listing in some XML must not be accidentally parsed as XProc. One approach would be to build a complete XML parser into the grammar of the compact syntax. But even this is tricky because a p:inline might include several consecutive sibling elements that each have to be recognized.

If only there were some string of tokens that can't appear in XML…

In fact, such a sequence exists. Almost. The sequence “]]>” is forbidden in XML except when it ends a CDATA section. We can leverage this fact in our compact syntax to form delimiters for arbitrary XML: “<![xml[” and “]]>”. See Figure 14.

Figure 14: XProc Library with Documentation, Compact Syntax #1

xproc 1.0

library with version="1.0" {

documentation {
<![xml[<div xmlns="http://www.w3.org/1999/xhtml">
<h1>XML Calabash Extension Library</h1>
<h2>Version 1.0</h2>
<p>The steps defined in this library are implemented in
<a href="http://xmlcalabash.com/">XML Calabash</a>.
</p>
</div>]]>
…

It's arguably a hack, but it allows us to satisfy the requirement that each syntax represent exactly the same underlying constructs.

This syntax has been implemented. The implementation strategy is to transform the compact syntax into the XML syntax as a pre-processing step and then process the resulting XML as usual.

How does this syntax stand up to the suggested benefits of non-XML syntaxes?

Conciseness? A wash. It's not clearly shorter in terms of absolute number of lines.
Familiarity? Not clear. It has the advantage of less visual clutter, but doesn't draw from the C/Java/JavaScript family in any significant regard beyond curly braces.
Accessibility to non-XML tools? Probably an improvement. It's likely that a modern IDE could be customized with the EBNF (see Appendix A).
Syntactic expressiveness? An improvement; outside of XML blocks, there are no characters that need to be explicitly escaped.

Case study 2: An alternate compact syntax for XProc

When I presented the first compact syntax in a lightning talk last year, Jeni Tennison observed that it could be made more compact, and perhaps more useful if it was more idiomatically like other programming languages. She subsequently produced most of the “second compact syntax” language design.

Translating Figure 9 into this second compact syntax produces Figure 15.

Figure 15: Simple XProc Pipeline, Compact Syntax #2

pipeline {
  xinclude
  xslt ( stylesheet = document 'dbslides.xsl' )
} => ( result serialized with [ method = 'xhtml', indent = 'true' ] )

Adopting a more “method call”-like syntax does make the pipelines shorter. The outputs of a step are treated in a similar way, but shown at the end of the body.

The most obvious example of an attempt to make the language more idiomatically like other programming languages can be seen in the handling of p:choose. Consider Figure 16.

Figure 16: XProc “Choose” Pipeline

<p:pipeline xmlns:p="http://www.w3.org/ns/xproc"
            xmlns:a="http://example.com/a"
            xmlns:b="http://example.com/b"
            version='1.0'>

<p:choose>
  <p:when test="/a:*">
    <p:xslt>
      <p:input port="stylesheet">
        <p:document href="a2html.xsl"/>
      </p:input>
    </p:xslt>
  </p:when>
  <p:when test="/b:*">
    <p:xslt>
      <p:input port="stylesheet">
        <p:document href="b2html.xsl"/>
      </p:input>
    </p:xslt>
  </p:when>
  <p:otherwise>
    <p:identity/>
  </p:otherwise>
</p:choose>

</p:pipeline>

Translating it into our initial compact syntax produces Figure 17.

Figure 17: XProc “Choose” Pipeline, Compact Syntax #1

xproc 1.0

namespace a='http://example.com/a'
namespace b='http://example.com/b'

pipeline {
  choose {
    when "/a:*" {
      xslt {
        input "stylesheet" {
          document "a2html.xsl"
        }
      }
    }
    when "/b:*" {
      xslt {
        input "stylesheet" {
          document "b2html.xsl"
        }
      }
    }
    otherwise {
      identity
    }
  }
}

This is clearly a non-XML syntax, but it retains all of the semantic flavor of the original. In the second XProc compact syntax, a choose statement is represented using an if/then/else construct that's likely to be more familiar to programmers, see Figure 18.

Figure 18: XProc “Choose” Pipeline, Compact Syntax #2

namespace a: 'http://example.com/a'
namespace b: 'http://example.com/b'

pipeline {
  if (/a:*) {
    xslt ( stylesheet = document 'a2html.xsl' )
  } else if (/b:*) {
    xslt ( stylesheet = document 'b2html.xsl' )
  } else {
    identity
  }
}

Again, this manages to be both shorter and possibly more familiar.

Whether or not either of these syntaxes would be markedly easier to use or would spur greater adoption of XProc is an open question.

Appendix A. Grammar for XProc Compact Syntax #1

document    ::= xpcMarker namespace* ( declareStep | pipeline | library ) EOF

xpcMarker   ::= 'xproc' version

version     ::= '1.0'

namespace   ::= ('namespace' prefix '=' quotedstr)
              | ('default' 'namespace' '=' quotedstr)

prefix      ::= NCName

declareStep ::= 'declare-step' stepName? withExtra? pipelineBody

stepName    ::= 'named' quotedstr

withExtra   ::= 'with' attr (',' attr)*

attr        ::= QName '=' (QName | quotedstr)

pipelineBody ::= '{'
     ( input | output | option | log | serialization )*
     ( declareStep | pipeline | imports )*
     subpipeline?
     '}'

input       ::= 'input' quotedstr withExtra? ( '{' binding* '}' )?

output      ::= 'output' quotedstr withExtra? ( '{' binding* '}' )?

option      ::= 'required' 'option' QName withExtra?
              | 'option' QName withExtra?

log         ::= 'log' quotedstr 'to' quotedstr

serialization ::= 'serialization' quotedstr withExtra?

imports     ::= 'import' quotedstr

variable    ::= 'variable' QName '=' quotedstr variableBody?

variableBody ::= '{' ( binding | namespaces )* '}'

namespaces  ::= 'namespaces' withExtra? nsBody?

nsBody      ::= '{' namespace '}'

binding     ::= ( comment | pi )*
                ( emptyBinding | documentBinding | dataBinding | pipeBinding | inlineBinding )

emptyBinding    ::= 'empty' withExtra?
documentBinding ::= 'document' quotedstr withExtra?
dataBinding     ::= 'data' quotedstr withExtra?
pipeBinding     ::= quotedstr 'on' quotedstr withExtra?
inlineBinding   ::= 'inline' withExtra? inlineXML

inlineXML       ::= '<![XML[' Char* ']]>'

subpipeline     ::= ( variable | documentation | pipeinfo | forEachStep | viewportStep
                     | chooseStep | tryStep | groupStep | atomicStep | comment | pi )+

documentation   ::= 'documentation' withExtra? '{' inlineXML '}'

pipeinfo        ::= 'pipeinfo' withExtra? '{' inlineXML '}'

named           ::= 'named' quotedstr

forEachStep     ::= 'for-each' named? withExtra? forEachBody

forEachBody     ::= '{' ( iterationSource | output | log )* subpipeline '}'

iterationSource ::= 'iteration-source' withExtra? ( '{' binding* '}' )?

viewportStep    ::= 'viewport' named? withExtra? viewportBody

viewportBody    ::= '{' ( viewportSource | output | log )* subpipeline '}'

viewportSource  ::= 'viewport-source' withExtra? ( '{' binding* '}' )?

chooseStep      ::= 'choose' named? withExtra? chooseBody

chooseBody      ::= '{' xpathContext? variable* whenStep* otherwiseStep? '}'

xpathContext    ::= 'xpath-context' withExtra? ( '{' binding* '}' )?

whenStep        ::= 'when' quotedstr withExtra? whenBody

whenBody        ::= ( xpathContext | output | log )* subpipeline

otherwiseStep   ::= 'otherwise' withExtra? otherwiseBody

otherwiseBody   ::= ( output | log )* subpipeline

tryStep         ::= 'try' named? withExtra? tryBody

tryBody         ::= '{' variable* groupStep catchStep '}'

groupStep       ::= 'group' named? withExtra? groupBody

groupBody       ::= '{' ( output | log )* subpipeline '}'

catchStep       ::= 'catch' named? withExtra? catchBody

catchBody       ::= '{' ( output | log )* subpipeline '}'

atomicStep      ::= ( 'add-xml-base' | 'add-attribute' | 'compare' | 'count' | 'delete'
                      | 'directory-list' | 'error' | 'escape-markup' | 'exec' | 'filter'
                      | 'hash' | 'http-request' | 'identity' | 'insert' | 'label-elements'
                      | 'load' | 'make-absolute-uris' | 'namespace-rename' | 'pack'
                      | 'parameters' | 'rename' | 'replace' | 'set-attributes' | 'sink'
                      | 'split-sequence' | 'store' | 'string-replace' | 'unescape-markup'
                      | 'unwrap' | 'uuid' | 'validate-with-relax-ng'
                      | 'validate-with-schematron' | 'validate-with-xml-schema'
                      | 'wrap' | 'wrap-sequence' | 'www-form-urldecode' | 'www-form-urlencode'
                      | 'xinclude' | 'xquery' | 'xslt' | 'xsl-formatter' )
                    named? withExtra? atomicStepBody?
                  | CName named? withExtra? atomicStepBody?

atomicStepBody  ::= '{' ( input | withOption | withParam | log )* '}'

withOption      ::= 'with-option' QName '=' quotedstr withExtra? withOptionBody?

withOptionBody  ::= '{' ( binding | namespaces )* '}'

withParam       ::= 'with-param' QName '=' quotedstr withExtra? withParamBody?

withParamBody   ::= '{' ( binding | namespaces )* '}'

pipeline        ::= 'pipeline' named? withExtra? pipelineBody

library         ::= 'library' withExtra? libraryBody

libraryBody     ::= '{' ( imports | declareStep | pipeline )* '}'



EOF ::= $

comment  ::= '<!--' ( ( Char - '-' ) | '-' ( Char - '-' ) )* '-->'
pi       ::= '<?' pitarget ( S ( [^?] | '?'+ [^?>] )* '?'* )? '?>' /* ws: explicit */
pitarget ::= NCName
S        ::= ( #x0020 | #x0009 | #x000D | #x000A )+ /* ws: definition */

quotedstr ::= '"' ( [^"] )* '"'
            | "'" ( [^'] )* "'"

NameStartChar
         ::= [A-Z]
           | '_'
           | [a-z]
           | [#x00C0-#x00D6]
           | [#x00D8-#x00F6]
           | [#x00F8-#x02FF]
           | [#x0370-#x037D]
           | [#x037F-#x1FFF]
           | [#x200C-#x200D]
           | [#x2070-#x218F]
           | [#x2C00-#x2FEF]
           | [#x3001-#xD7FF]
           | [#xF900-#xFDCF]
           | [#xFDF0-#xFFFD]
NameChar ::= NameStartChar
           | '-'
           | '.'
           | [0-9]
           | #x00B7
           | [#x0300-#x036F]
           | [#x203F-#x2040]
NCName   ::= NameStartChar NameChar*
CName    ::= (NCName ':' NCName)
QName    ::= NCName | CName

Char     ::= [#x0021-#xD7FF]
           | [#xE000-#xFFFD]
           | [#x10000-#x10FFFF]

Appendix B. Implementation

XML Calabash implements both compact syntaxes in the same way.

The EBNF for the compact syntax is compiled into an XQuery module using the REx Parser Generator. The XQuery module produces an XML parse tree for the input pipeline.
An XSLT stylesheet is written which transforms the XML parse tree into standard XProc.
These two steps are combined into a pipeline, Figure 19, which is used to transform the input document into XProc which is then executed normally.

This mechanism may not be particularly efficient, but it is quite easy to write as a proof-of-concept.

Figure 19: XProc Pipeline for Converting XPC to XPL

<p:pipeline xmlns:p="http://www.w3.org/ns/xproc" version="1.0">

<p:xquery>
  <p:input port="query">
    <p:data href="xpc1.xqy"/>
  </p:input>
</p:xquery>

<p:xslt version="2.0">
  <p:input port="stylesheet">
    <p:document href="xpc1.xsl"/>
  </p:input>
</p:xslt>

</p:pipeline>

References

[carrot] Lenz, Evan. “Carrot: An appetizing hybrid of XQuery and XSLT.” Presented at Balisage: The Markup Conference 2011, Montréal, Canada, August 2 - 5, 2011. In Proceedings of Balisage: The Markup Conference 2011. Balisage Series on Markup Technologies, vol. 7 (2011). doi:https://doi.org/10.4242/BalisageVol7.Lenz01.

[rex] Rademacher, Gunther. “REx Parser Generator”, http://www.bottlecaps.de/rex/

[xmlcalabash] Walsh, Norman. “XML Calabash”, http://xmlcalabash.com/

^[1] …, DITA, TEI, etc. Markup languages for prose.

^[2] “Guarantees” in the absence of features such as disable output escaping and character maps that are designed to subvert the serialization, in any event.

^[3] Pulling the rows out of line and storing them in a variable is an awkward consequence of XQuery's completely broken semantics with respect to the default namespace.

Norman Walsh

Norman Walsh is a Lead Engineer at MarkLogic Corporation where he works with the Application Services team. Norm is also an active participant in a number of standards efforts worldwide: he is chair of the XML Processing Model Working Group at the W3C where he is also co-chair of the XML Core Working Group. At OASIS, he is chair of the DocBook Technical Committee.

With more than a decade of industry experience, Norm is well known for his work on DocBook and a wide range of open source projects. He is the author of DocBook: The Definitive Guide.

BalisageThe Markup Conference

Balisage Paper: On XML Languages…