XML and Authoring

XML is a popular format. It is widely and successfully used for document and data storage, exchange and presentation. A major advantage of using XML is the toolchain and pipeline available for generic XML processing. You can easily use new formats within the generic framework.

However, for authoring purposes XML is seldom preferred over a notation more directly suited to the purpose. Few would prefer to write their CSS rules as

<rule><simple-selector name="body"/><block><property name="color" value="blue"/></block></rule>

to the more direct

body {color: blue}

and even less would prefer to write

<statement><if><condition><comparison name="&lt;"><var name="max"><var name="a"></comparison></condition><then><statement><assign><var name="max"/><expression><var name="a"/></expression></assign></statement></then></if></statement>

to the much more direct

if (max<a) then max=a;

And, of course it should be noted that even RELAX NG has both an XML syntax and a 'compact' syntax RELAX NG RELAX NG COMPACT.

In fact if we are to be brutally honest, even XML formats take short cuts for authoring ease. Take for instance an <a> element in XHTML:

<a href="http://www.w3.org/TR/1999/xhtml">XHTML</a>

This does not surface the real structure of the underlying data. If we were to be completely faithful to the principle of making all relevant structure explicit, we should really write something along the lines of

<a><href><method type="href"/><domain name="org"/><site name="w3"/><sub name="www"/><path><root><sub name="TR"><sub name="1999"><sub name="xhtml"</sub></sub></sub></root></path></href><text>XHTML</text></a>

You might argue about the details here, but this example is only to show that there are parts of XML documents that could be further structured, but that we choose not to, possibly for authoring ease, possibly for fear of being laughed out of town.

The reasons for this are obvious: despite the disadvantages of not being able to use the generic toolchain any more, or only to a lesser degree, the increased readability of the source, and its closer relation to the problem domain makes authoring so much easier.

Parsing and Parse trees

Part of the advantage of XML is that there is a single parser needed to be able to deal with any kind of document. This can be contrasted with for instance the situation for HTML, where you need a parser for the HTML, with separate parsers for CSS and Javascript at least, (and URLs), creating extra complexity and brittleness.

But looked at through a suitable pair of glasses, what is XML apart from a description of a parse tree for some format (with some special treatment for text nodes)? And frankly, what is so difficult about general-purpose parsing? It is a widely understood and easily solved problem. Is it not possible to combine the best of both worlds, and have authorable formats, that can still use the XML tool chain? Couldn't XML become the underlying format for everything?

The Approach

The approach presented here is to add one more step to the XML processing chain, an initial one. This step takes any textual document, and a (reference to) a suitable syntax description, parses the document using the syntax description, and produces as output a parse tree that can be treated as an XML document with no further parsing necessary (or alternatively, the document can be serialised out to XML).

In other words, the input document might be

body {color: blue}

but the result of the parse will be the same as if an XML parser had been presented with the XML document

<css>
   <rule><simple-selector name="body"/>
      <block><property name="color" value="blue"/></block>
   </rule>
</css>

We call this method Invisible XML, since the document is treated as XML, but it is not visibly an XML document.

Syntax Description

The requirement is to find a suitable way to describe the syntax of the input document so that the resultant parse-tree is of the form suitable for use in our XML chain. If we were to use BNF BNF, arguably the most well-known syntax-description format, it might look like this (in what follows "..." is used for parts of the definition that have been elided and will be defined later):

<css> ::= <rules>
<rules> ::= <rule> | <rules> <rule>
<rule> ::= <selector> <block>
<block> ::= "{" <properties> "}"
<properties> ::= <property> | <property> ";" <properties>
<property> ::= <name> ":" <value> | <empty>
<selector> ::= <name>

etc, etc. But it is quickly apparent that this has some shortcomings. Firstly a surface problem that since we are using this for XML, we could quickly go crazy with the use of angle brackets for two different purposes. Although there is a certain charm to defining the <css> element with a syntax rule whose name is <css>, let us rather use a different format. Therefore we shall use a variant of VWG format VWG. This looks like:

css: rules.
rules: rule; rules, rule.
rule: selector, block.
block: "{", properties, "}".
properties:  property; property, ";", properties.
property:  name, ":", value; empty.
selector: name.
name: ...
value: ...
empty: .

(We shall restrict ourselves to a simplified CSS grammar for the sake of this article).

Note that ";" signifies alternatives, and as is normal in syntax definitions, if one alternative is empty (or reduces to empty), the rule is optional.

If we parse the snippet of CSS above with this, and then represent the resulting parse tree in an XML style (so that each nonterminal is represented as an XML element), a second problem becomes apparent:

<css>
   <rules>
      <rule>
         <selector>body</selector>
         <block>
            <properties>
               <property>
                  <name>color</name>
                  <value>blue</value>
               </property>
            </properties>
         </block>
      </rule>
   </rules>
</css>

namely that there are certain elements in the tree (rules, properties) that we really aren't interested in. (You'll notice that some terminal symbols such as the brackets, colons and semicolons don't appear in the parse tree. This will be discussed later).

The problem becomes even more apparent with a CSS snippet like

body {color: blue; font-weight: bold}

since the content of the <block> element then becomes even more unwieldly:

<properties>
   <property>
      <name>color</name>
      <value>blue</value>
   </property>
   <properties>
      <property>
         <name>font-weight</name>
         <value>bold</value>
      </property>
   </properties>
</properties>

where we would prefer to see the much more direct

<property>
   <name>color</name>
   <value>blue</value>
</property>
<property>
   <name>font-weight</name>
   <value>bold</value>
</property>

The problem arises in this case because the syntax description method relies on recursion to deal with repetition. To that end, we shall introduce a specific notation for repetition. Zero or more repetitions:

(rule)*

and one or more repetitions:

(rule)+

In fact we shall extend these two postfix operators to also act as infix operators, to handle a commonly occurring case:

(property)*";"
(property)+";"

which respectively mean "zero or more, separated by semicolon" and "one or more, separated by semicolon" (there is no reason to restrict the separator to a terminal as here; it may also be a nonterminal).

Now we can specify our syntax as:

css: (rule)*.
rule: selector, block.
block: "{", (property)*";", "}".
property:  name, ":", value; .
name: ...
value: ...

and the parsetree will now look like this:

<css>
   <rule>
      <selector>body</selector>
      <block>
         <property>
            <name>color</name>
            <value>blue</value>
         </property>
         <property>
            <name>font-weight</name>
            <value>bold</value>
         </property>
      </block>
   </rule>
</css>

However, there is another reason why we might not want a syntax rule name to appear in the parse tree, and that is when we use a syntax rule as a refinement, that is to say, when the syntax rule doesn't represent anything of semantic importance, but has been defined so that we can use it in several places without having to repeat it. For instance, suppose we wanted to define a series of properties in a separate rule:

properties: (property)*";".

and use it:

block: "{", properties, "}".

but not want <properties> to appear in the final parse tree. What we define is that the use of any rule name preceded by a minus sign is only being used for refinement. So that would give us:

properties: (property)*";".
block: "{", -properties, "}".

and this would result in the same parse-tree as above. Note that this still allows a rule to be used in other places and appear in the parse tree if needed.

Also note that for simplicity we have ignored treating spaces in the syntax description, but that is also an example of something you would not want to have in the parse tree:

colon: -spaces, ":", -spaces.
spaces: " "*.

Similarly, we can use it to make empty alternatives more explicit:

property:  name, ":", value; -empty.
empty: .

Terminals

As alluded to above, in general, terminal symbols do not appear in the parse-tree, since most of them are only there to delimit structural elements in the source file. If you want them to show up, you can add an explicit rule for them:

colon: ":".

which will cause them to show up in the tree like this:

 <property>
     <name>color</name>
     <colon/>
     <value>blue</value>
 </property>

However, there are places where terminals have semantic meaning, and you do want them to appear in the parse-tree, for instance in our example the names and values of the properties. To achieve this we mark terminals that are to be copied to the parse tree specially:

name: (+"a"; +"b"; ...etc...; +"9"; +"-")+.

In other words, normally terminals are discarded, but if they are preceded with a + they are copied to the parse-tree.

Extensions

Strictly speaking, this would be enough to allow you to parse a document, and output it as an equivalent XML document. However, there are possible extensions that give you a little more control over the result. The most obvious is allowing the specification of attributes. This is simply done by marking the use of rules with at signs:

css: (rule)*.
rule: selector, block.
block: "{", (property)*";", "}".
property:  @name, ":", value.

A rule used like this may clearly not contain any structural elements (though it may contain terminals and refinements), since attributes are not structured, but this is an easy condition to check for. The parsetree will now look like this:

<css>
   <rule>
      <selector>body</selector>
      <block>
         <property name="color">
            <value>blue</value>
         </property>
         <property name="font-weight">
            <value>bold</value>
         </property>
      </block>
   </rule>
</css>

If we changed the rule for property to look like this:

property:  @name, ":", @value.

then the resultant parse-tree would look like

<css>
   <rule>
      <selector>body</selector>
      <block>
         <property name="color" value="blue"/>
         <property name="font-weight" value="bold"/>
      </block>
   </rule>
</css>

Note that by marking the use of a syntax rule in this way, and not the definition, it allows the syntax rule to be used for structural elements (<name>color</name>) as well as for attributes (name="color").

Parsing Algorithms

Although it would be possible to require the syntax to be restricted to some class of language, such as LL(1) or LR(1) LL1 in order to make the parser faster, in practice it is easier for the author of the syntax if we make no such restriction, since it would require the author to understand the principles, and it would require the system to check that the syntax adhered to the requirement. In practise a parsing algorithm such as Earley's Earley is fast enough, and will treat all context-free languages. The only remaining problem is if the syntax author describes an ambiguous language. To that end we just define that the parser outputs one of the parses, and leave it at that. For instance, if expression were defined as:

expr: i; expr, plus, expr.
i: "i".
plus: "+".

then a string such as

i+i+i

could be parsed as both

<expr><i/></expr>
<plus/>
<expr>
   <expr><i/></expr>
   <plus/>
   <expr><i/></expr>
</expr>

and as

<expr>
   <expr><i/></expr>
   <plus/>
   <expr><i/></expr>
</expr>
<plus/>
<expr><i/></expr>

Delivery

To deliver a source document to be parsed by our system, we can use a media type Media type that supplies a reference to the required syntax description. For instance:

application/xml-invisible; syntax=http://example.com/syntax/css

Clearly a system can cache well-known syntax descriptions.

Using Invisible XML to define itself

It should go without saying that the syntax descriptions themselves are in Invisible XML (though in their case the syntax description must be cached to prevent an infinite loop of processing.)

The definition might look like this:

ixml: (rule)+.
rule: @name, -colon, -definition, -stop.
definition: (alternative)*-semicolon.
alternative: (-term)*-comma.
term: -symbol; -repetition.
repetition: one-or-more; zero-or-more.
one-or-more: -open, -definition, -close, -plus, separator.
zero-or-more: -open, -definition, -close, -star, separator.
separator: -symbol; -empty.
empty: .
symbol: -terminal; nonterminal; refinement.
terminal: explicit-terminal; implicit-terminal.
explicit-terminal: -plus, @string.
implicit-terminal: @string.
nonterminal: @name.
refinement: -minus, @name.
attribute: -at, @name.

string: -openquote, (-character)*, -closequote.
name: (-letter)+.
letter: +"a"; +"b"; ...
character: ...

colon: -S, ":", -S.
stop: -S, ".", -S.
semicolon: -S, ";", -S.
comma:  -S, ",", -S.
plus:  -S, "+", -S.
minus:  -S, "-", -S.
star:  -S, "*", -S.
open:  -S, "(", -S.
close:  -S, ")", -S.
at:  -S, "@", -S.
openquote: -S, """".
closequote: """", -S.
S: " "*.

This would then parse to the XML form:

<ixml>
   <rule name="ixml">
      <alternative>
         <one-or-more>
             <alternative>
                <nonterminal name="rule"/>
             </alternative><separator/>
         </one-or-more>
      </alternative>
   </rule>
   <rule name="rule">
      <alternative>
         <attribute name="name"/>
         <refinement name="definition"/>
      </alternative
   </rule>
   <rule name="definition">
      <alternative>
         <zero-or-more>
            <alternative>
               <nonterminal name="alternative"/>
            </alternative>
            <separator><refinement name="semicolon"/></separator>
         </zero-or-more>
      </alternative
   </rule>
   ... etc ...
   <rule name="separator">
      <alternative><refinement name="symbol"/></alternative>
      <alternative><refinement name="empty"/></alternative>
   </rule>
   ... etc ...
</ixml>

Thanks to Earley's parsing algorithm, we can remove the <alternative> elements when there is only one alternative in a rule, by redefining definition:

definition: -alternative; alternative, -semicolon, (alternative)+-semicolon.

Note how we have used the "-" character to prevent it being copied in the first case (when there is only one). You wouldn't be able to use such a rule as this if there were a requirement on the syntax to be LL(1) or LR(1), since the two parts of the rule start with the same symbols.

Similarly, we can get rid of empty <separators/> thusly:

one-or-more: -open, -definition, -close, -plus; -open, -definition, -close, -plus, separator.
zero-or-more: -open, -definition, -close, -star; -open, -definition, -close, -star, separator.
separator: -symbol.

We can move the value of the separator into an attribute with:

separator: @explicit; @implicit; @nonterminal; @refinement.
explicit: -plus, -string.
implicit: -string.

This would then generate:

<ixml>
   <rule name="ixml">
      <one-or-more>
         <nonterminal name="rule"/>
      </one-or-more>
   </rule>
   <rule name="rule">
      <attribute name="name"/>
      <refinement name="definition"/>
   </rule>
   <rule name="definition">
      <alternative>
         <refinement name="alternative"/>
      </alternative>
      <alternative>
         <nonterminal name="alternative"/>
         <one-or-more>
            <nonterminal name="alternative"/>
            <separator refinement="semicolon"/>
         </one-or-more>
      </alternative>
   </rule>
   ... etc ...
   <rule name="separator">
      <alternative><refinement name="symbol"/></alternative>
      <alternative><refinement name="empty"/></alternative>
   </rule>
   ... etc ...
</ixml>

(An observant reader will have spotted that we have allowed attributes to be defined by attributes here -- for instance with @refinement -- that is we treat an attribute within an attribute definition as if it were a refinement).

As yet another possibility, we can move the separator into an attribute of the one-or-more or zero-or-more elements:

one-or-more: -open, -definition, -close, -plus; -open, -definition, -close, -plus, -separator.
zero-or-more: -open, -definition, -close, -star; -open, -definition, -close, -star, -separator.
separator: @explicit; @implicit; @nonterminal; @refinement.
explicit: -plus, -string.
implicit: -string.

Alternative Representation

Although the syntax description so defined was developed iteratively based on the needs of the user, and is sufficient for its purpose, it is clear in the above example, that refinements occur far more frequently than true semantic rules. An alternative worth exploring would be to say that nothing is copied to the syntax tree unless specifically marked. Let us use the "^" character to mark items that are copied to the tree. The result is clearly much more restful on the eyes:

ixml: (^rule)+.
rule: @name, colon, definition, stop.
definition: alternative; ^alternative, semicolon, (^alternative)+semicolon.
alternative: (term)*comma.
term: symbol; repetition.
repetition: ^one-or-more; ^zero-or-more.
one-or-more: open, definition, close, plus; open, definition, close, plus, ^separator.
zero-or-more: open, definition, close, star; open, definition, close, star, ^separator.
separator: terminal; @nonterminal; @refinement.
symbol: terminal; ^nonterminal; ^refinement.
terminal: ^explicit-terminal; ^implicit-terminal.
explicit-terminal: up, @string.
implicit-terminal: @string.
nonterminal: up, @name.
refinement: @name.
attribute: at, @name.

string: openquote, (character)*, closequote.
name: (letter)+.
letter: ^"a"; ^"b"; ...
character: ...

colon: S, ":", S.
stop: S, ".", S.
semicolon: S, ";", S.
comma:  S, ",", S.
plus:  S, "+", S.
up:  S, "^", S.
star:  S, "*", S.
open:  S, "(", S.
close:  S, ")", S.
at:  S, "@", S.
openquote: S, """".
closequote: """", S.
S: " "*.

Extras

There are obvious extra odds and ends that need adding, such as sets of characters, to make terminal specification easier, for instance:

letter: ^["a"-"z", "A"-"Z", "-"].
S: [" ", "\t", "\n", ...]*.

but these are just details.

Restriction on the XML Produced

It should be noted in passing that in the form presented here, Invisible XML only works in one direction: you can turn any textual document into an equivalent XML document. However, it is not in general possible to turn a textual document into a particular XML form without more work. For instance, you could turn Wiki markup into an XML document, but not into XHTML in particular.

Roundtripping

Returning the resultant XML document to its original format is just a process of presentation, nothing that a suitable bit of XSLT couldn't do, or even CSS in some simple cases. In fact it should be apparent that from the Invisible XML syntax, it would be straightforward to automatically generate the required piece of XSLT directly.

Conclusion

There is really no reason why XML can't be more ubiquitous than it is, and similarly there is no reason why XML documents have to be written in an explicit XML format per se. Anything that can be parsed can be perceived as XML, since parsing is very easy, and parse-trees are really just XML documents in different clothing. Invisible XML allows a multitude of document formats to be authored in their traditional form, but be processed as XML, with the concomitant advantages of the XML toolchain.

References

[RELAX NG] James Clark, Makoto MURATA (eds.), 2001, RELAX NG Specification, https://www.oasis-open.org/committees/relax-ng/spec.html

[RELAX NG COMPACT] James Clark (ed.). 2002, RELAX NG Compact Syntax, https://www.oasis-open.org/committees/relax-ng/compact-20021121.html

[BNF] Backus-Naur Form, http://en.wikipedia.org/wiki/Backus-Naur_Form

[VWG] S. Pemberton, 1982, "Executable Semantic Definition of Programming Languages Using Two-level Grammars", http://www.cwi.nl/~steven/vw.html

[LL1] Alfred Aho and Jeffrey D. Ullman, 1977, "Principles of Compiler Design", Addison-Wesley, ISBN 0-201-00022-9.

[Earley] Earley, Jay (1970), "An efficient context-free parsing algorithm", Communications of the ACM 13 (2): 94-102, doi:https://doi.org/10.1145/362007.362035

[Media type] N. Freed et al., 1996, "Multipurpose Internet Mail Extensions, (MIME) Part Two: Media Types", http://www.ietf.org/rfc/rfc2046.txt