How to cite this paper
Pemberton, Steven. “Invisible XML.” Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6 - 9, 2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). https://doi.org/10.4242/BalisageVol10.Pemberton01.
Balisage: The Markup Conference 2013
August 6 - 9, 2013
Balisage Paper: Invisible XML
Steven Pemberton
Researcher
CWI, Amsterdam
Copyright © Steven Pemberton 2013, all rights reserved.
Abstract
What if you could see everything as XML? XML has many
strengths for data exchange, strengths both inherent in the nature
of XML markup and strengths that derive from the ubiquity of tools
that can process XML. For authoring, however, other forms are
preferred: no one writes CSS or Javascript in XML. It does not
follow, however, that there is no value in representing such
information in XML. Invisible XML is a method
for treating non-XML documents as if they were XML, enabling
authors to write in a format they prefer while providing XML for
processes that are more effective with XML content. There is really
no reason why XML cannot be more ubiquitous than it is.
Table of Contents
- XML and
Authoring
- Parsing and Parse
trees
- The Approach
- Syntax
Description
- Terminals
- Extensions
- Parsing
Algorithms
- Delivery
- Using
Invisible XML to define itself
- Alternative
Representation
- Extras
- Restriction
on the XML Produced
- Roundtripping
- Conclusion
XML and
Authoring
XML is a popular format. It is widely and successfully used
for document and data storage, exchange and presentation. A major
advantage of using XML is the toolchain and pipeline available for
generic XML processing. You can easily use new formats within the
generic framework.
However, for authoring purposes XML is seldom preferred over
a notation more directly suited to the purpose. Few would prefer to
write their CSS rules as
<rule><simple-selector name="body"/><block><property name="color" value="blue"/></block></rule>
to the more direct
body {color: blue}
and even less would prefer to write
<statement><if><condition><comparison name="<"><var name="max"><var name="a"></comparison></condition><then><statement><assign><var name="max"/><expression><var name="a"/></expression></assign></statement></then></if></statement>
to the much more direct
if (max<a) then max=a;
And, of course it should be noted that even RELAX NG has both
an XML syntax and a 'compact' syntax RELAX NG RELAX NG COMPACT.
In fact if we are to be brutally honest, even XML formats
take short cuts for authoring ease. Take for instance an
<a>
element in XHTML:
<a href="http://www.w3.org/TR/1999/xhtml">XHTML</a>
This does not surface the real structure
of the underlying data. If we were to be completely faithful to the
principle of making all relevant structure explicit, we should
really write something along the lines of
<a><href><method type="href"/><domain name="org"/><site name="w3"/><sub name="www"/><path><root><sub name="TR"><sub name="1999"><sub name="xhtml"</sub></sub></sub></root></path></href><text>XHTML</text></a>
You might argue about the details here, but this example is
only to show that there are parts of XML documents that could be
further structured, but that we choose not to, possibly for
authoring ease, possibly for fear of being laughed out of
town.
The reasons for this are obvious: despite the disadvantages
of not being able to use the generic toolchain any more, or only to
a lesser degree, the increased readability of the source, and its
closer relation to the problem domain makes authoring so much
easier.
Parsing and Parse
trees
Part of the advantage of XML is that there is a single parser
needed to be able to deal with any kind of document. This can be
contrasted with for instance the situation for HTML, where you need
a parser for the HTML, with separate parsers for CSS and Javascript
at least, (and URLs), creating extra complexity and
brittleness.
But looked at through a suitable pair of glasses, what is XML
apart from a description of a parse tree for some format (with some
special treatment for text nodes)? And frankly, what is so
difficult about general-purpose parsing? It is a widely understood
and easily solved problem. Is it not possible to combine the best
of both worlds, and have authorable formats, that can still use the
XML tool chain? Couldn't XML become the underlying format for
everything?
The Approach
The approach presented here is to add one more step to the
XML processing chain, an initial one. This step takes any textual
document, and a (reference to) a suitable syntax description,
parses the document using the syntax description, and produces as
output a parse tree that can be treated as an XML document with no
further parsing necessary (or alternatively, the document can be
serialised out to XML).
In other words, the input document might be
body {color: blue}
but the result of the parse will be the same as if an XML
parser had been presented with the XML document
<css>
<rule><simple-selector name="body"/>
<block><property name="color" value="blue"/></block>
</rule>
</css>
We call this method Invisible XML, since
the document is treated as XML, but it is not visibly an XML
document.
Syntax
Description
The requirement is to find a suitable way to describe the
syntax of the input document so that the resultant parse-tree is of
the form suitable for use in our XML chain. If we were to use BNF
BNF, arguably the most well-known syntax-description format, it
might look like this (in what follows "..." is used for parts of
the definition that have been elided and will be defined
later):
<css> ::= <rules>
<rules> ::= <rule> | <rules> <rule>
<rule> ::= <selector> <block>
<block> ::= "{" <properties> "}"
<properties> ::= <property> | <property> ";" <properties>
<property> ::= <name> ":" <value> | <empty>
<selector> ::= <name>
etc, etc. But it is quickly apparent that this has some
shortcomings. Firstly a surface problem that since we are using
this for XML, we could quickly go crazy with the use of angle
brackets for two different purposes. Although there is a certain
charm to defining the <css>
element with a
syntax rule whose name is <css>
, let us
rather use a different format. Therefore we shall use a variant of
VWG format VWG. This looks like:
css: rules.
rules: rule; rules, rule.
rule: selector, block.
block: "{", properties, "}".
properties: property; property, ";", properties.
property: name, ":", value; empty.
selector: name.
name: ...
value: ...
empty: .
(We shall restrict ourselves to a simplified CSS grammar for
the sake of this article).
Note that ";" signifies alternatives, and as is normal in
syntax definitions, if one alternative is empty (or reduces to
empty), the rule is optional.
If we parse the snippet of CSS above with this, and then
represent the resulting parse tree in an XML style (so that each
nonterminal is represented as an XML element), a second problem
becomes apparent:
<css>
<rules>
<rule>
<selector>body</selector>
<block>
<properties>
<property>
<name>color</name>
<value>blue</value>
</property>
</properties>
</block>
</rule>
</rules>
</css>
namely that there are certain elements in the tree
(rules
, properties
) that we
really aren't interested in. (You'll notice that some terminal
symbols such as the brackets, colons and semicolons don't appear in
the parse tree. This will be discussed later).
The problem becomes even more apparent with a CSS snippet
like
body {color: blue; font-weight: bold}
since the content of the <block>
element then becomes even more unwieldly:
<properties>
<property>
<name>color</name>
<value>blue</value>
</property>
<properties>
<property>
<name>font-weight</name>
<value>bold</value>
</property>
</properties>
</properties>
where we would prefer to see the much more direct
<property>
<name>color</name>
<value>blue</value>
</property>
<property>
<name>font-weight</name>
<value>bold</value>
</property>
The problem arises in this case because the syntax
description method relies on recursion to deal with repetition. To
that end, we shall introduce a specific notation for repetition.
Zero or more repetitions:
(rule)*
and one or more repetitions:
(rule)+
In fact we shall extend these two postfix operators to also
act as infix operators, to handle a commonly occurring case:
(property)*";"
(property)+";"
which respectively mean "zero or more, separated by
semicolon" and "one or more, separated by semicolon" (there is no
reason to restrict the separator to a terminal as here; it may also
be a nonterminal).
Now we can specify our syntax as:
css: (rule)*.
rule: selector, block.
block: "{", (property)*";", "}".
property: name, ":", value; .
name: ...
value: ...
and the parsetree will now look like this:
<css>
<rule>
<selector>body</selector>
<block>
<property>
<name>color</name>
<value>blue</value>
</property>
<property>
<name>font-weight</name>
<value>bold</value>
</property>
</block>
</rule>
</css>
However, there is another reason why we might not want a
syntax rule name to appear in the parse tree, and that is when we
use a syntax rule as a refinement, that is to
say, when the syntax rule doesn't represent anything of semantic
importance, but has been defined so that we can use it in several
places without having to repeat it. For instance, suppose we wanted
to define a series of properties in a separate rule:
properties: (property)*";".
and use it:
block: "{", properties, "}".
but not want <properties>
to appear
in the final parse tree. What we define is that the use of any rule
name preceded by a minus sign is only being used for refinement. So
that would give us:
properties: (property)*";".
block: "{", -properties, "}".
and this would result in the same parse-tree as above. Note
that this still allows a rule to be used in other places and appear
in the parse tree if needed.
Also note that for simplicity we have ignored treating spaces
in the syntax description, but that is also an example of something
you would not want to have in the parse tree:
colon: -spaces, ":", -spaces.
spaces: " "*.
Similarly, we can use it to make empty alternatives more
explicit:
property: name, ":", value; -empty.
empty: .
Terminals
As alluded to above, in general, terminal symbols do not
appear in the parse-tree, since most of them are only there to
delimit structural elements in the source file. If you want them to
show up, you can add an explicit rule for them:
colon: ":".
which will cause them to show up in the tree like
this:
<property>
<name>color</name>
<colon/>
<value>blue</value>
</property>
However, there are places where terminals have semantic
meaning, and you do want them to appear in the
parse-tree, for instance in our example the names and values of the
properties. To achieve this we mark terminals that are to be copied
to the parse tree specially:
name: (+"a"; +"b"; ...etc...; +"9"; +"-")+.
In other words, normally terminals are discarded, but if they
are preceded with a + they are copied to the parse-tree.
Extensions
Strictly speaking, this would be enough to allow you to parse
a document, and output it as an equivalent XML document. However,
there are possible extensions that give you a little more control
over the result. The most obvious is allowing the specification of
attributes. This is simply done by marking the use of rules with at
signs:
css: (rule)*.
rule: selector, block.
block: "{", (property)*";", "}".
property: @name, ":", value.
A rule used like this may clearly not contain any structural
elements (though it may contain terminals and refinements), since
attributes are not structured, but this is an easy condition to
check for. The parsetree will now look like this:
<css>
<rule>
<selector>body</selector>
<block>
<property name="color">
<value>blue</value>
</property>
<property name="font-weight">
<value>bold</value>
</property>
</block>
</rule>
</css>
If we changed the rule for property
to
look like this:
property: @name, ":", @value.
then the resultant parse-tree would look like
<css>
<rule>
<selector>body</selector>
<block>
<property name="color" value="blue"/>
<property name="font-weight" value="bold"/>
</block>
</rule>
</css>
Note that by marking the use of a syntax
rule in this way, and not the definition, it allows the syntax rule
to be used for structural elements
(<name>color</name>
) as well as for
attributes (name="color"
).
Parsing
Algorithms
Although it would be possible to require the syntax to be
restricted to some class of language, such as LL(1) or LR(1) LL1
in order to make the parser faster, in practice it is easier for
the author of the syntax if we make no such restriction, since it
would require the author to understand the principles, and it would
require the system to check that the syntax adhered to the
requirement. In practise a parsing algorithm such as Earley's
Earley is fast enough, and will treat all context-free languages.
The only remaining problem is if the syntax author describes an
ambiguous language. To that end we just define that the parser
outputs one of the parses, and leave it at that. For instance, if
expression were defined as:
expr: i; expr, plus, expr.
i: "i".
plus: "+".
then a string such as
i+i+i
could be parsed as both
<expr><i/></expr>
<plus/>
<expr>
<expr><i/></expr>
<plus/>
<expr><i/></expr>
</expr>
and as
<expr>
<expr><i/></expr>
<plus/>
<expr><i/></expr>
</expr>
<plus/>
<expr><i/></expr>
Delivery
To deliver a source document to be parsed by our system, we
can use a media type Media type that supplies a reference to the
required syntax description. For instance:
application/xml-invisible; syntax=http://example.com/syntax/css
Clearly a system can cache well-known syntax
descriptions.
Using
Invisible XML to define itself
It should go without saying that the syntax descriptions
themselves are in Invisible XML (though in their case the syntax
description must be cached to prevent an
infinite loop of processing.)
The definition might look like this:
ixml: (rule)+.
rule: @name, -colon, -definition, -stop.
definition: (alternative)*-semicolon.
alternative: (-term)*-comma.
term: -symbol; -repetition.
repetition: one-or-more; zero-or-more.
one-or-more: -open, -definition, -close, -plus, separator.
zero-or-more: -open, -definition, -close, -star, separator.
separator: -symbol; -empty.
empty: .
symbol: -terminal; nonterminal; refinement.
terminal: explicit-terminal; implicit-terminal.
explicit-terminal: -plus, @string.
implicit-terminal: @string.
nonterminal: @name.
refinement: -minus, @name.
attribute: -at, @name.
string: -openquote, (-character)*, -closequote.
name: (-letter)+.
letter: +"a"; +"b"; ...
character: ...
colon: -S, ":", -S.
stop: -S, ".", -S.
semicolon: -S, ";", -S.
comma: -S, ",", -S.
plus: -S, "+", -S.
minus: -S, "-", -S.
star: -S, "*", -S.
open: -S, "(", -S.
close: -S, ")", -S.
at: -S, "@", -S.
openquote: -S, """".
closequote: """", -S.
S: " "*.
This would then parse to the XML form:
<ixml>
<rule name="ixml">
<alternative>
<one-or-more>
<alternative>
<nonterminal name="rule"/>
</alternative><separator/>
</one-or-more>
</alternative>
</rule>
<rule name="rule">
<alternative>
<attribute name="name"/>
<refinement name="definition"/>
</alternative
</rule>
<rule name="definition">
<alternative>
<zero-or-more>
<alternative>
<nonterminal name="alternative"/>
</alternative>
<separator><refinement name="semicolon"/></separator>
</zero-or-more>
</alternative
</rule>
... etc ...
<rule name="separator">
<alternative><refinement name="symbol"/></alternative>
<alternative><refinement name="empty"/></alternative>
</rule>
... etc ...
</ixml>
Thanks to Earley's parsing algorithm, we can remove the
<alternative>
elements when there is only
one alternative
in a rule
, by
redefining definition
:
definition: -alternative; alternative, -semicolon, (alternative)+-semicolon.
Note how we have used the "-" character to prevent it being
copied in the first case (when there is only one). You wouldn't be
able to use such a rule as this if there were a requirement on the
syntax to be LL(1) or LR(1), since the two parts of the rule start
with the same symbols.
Similarly, we can get rid of empty
<separators/>
thusly:
one-or-more: -open, -definition, -close, -plus; -open, -definition, -close, -plus, separator.
zero-or-more: -open, -definition, -close, -star; -open, -definition, -close, -star, separator.
separator: -symbol.
We can move the value of the separator into an attribute
with:
separator: @explicit; @implicit; @nonterminal; @refinement.
explicit: -plus, -string.
implicit: -string.
This would then generate:
<ixml>
<rule name="ixml">
<one-or-more>
<nonterminal name="rule"/>
</one-or-more>
</rule>
<rule name="rule">
<attribute name="name"/>
<refinement name="definition"/>
</rule>
<rule name="definition">
<alternative>
<refinement name="alternative"/>
</alternative>
<alternative>
<nonterminal name="alternative"/>
<one-or-more>
<nonterminal name="alternative"/>
<separator refinement="semicolon"/>
</one-or-more>
</alternative>
</rule>
... etc ...
<rule name="separator">
<alternative><refinement name="symbol"/></alternative>
<alternative><refinement name="empty"/></alternative>
</rule>
... etc ...
</ixml>
(An observant reader will have spotted that we have allowed
attributes to be defined by attributes here -- for instance with
@refinement
-- that is we treat an attribute
within an attribute definition as if it were a refinement).
As yet another possibility, we can move the separator into an
attribute of the one-or-more
or
zero-or-more
elements:
one-or-more: -open, -definition, -close, -plus; -open, -definition, -close, -plus, -separator.
zero-or-more: -open, -definition, -close, -star; -open, -definition, -close, -star, -separator.
separator: @explicit; @implicit; @nonterminal; @refinement.
explicit: -plus, -string.
implicit: -string.
Alternative
Representation
Although the syntax description so defined was developed
iteratively based on the needs of the user, and is sufficient for
its purpose, it is clear in the above example, that refinements
occur far more frequently than true semantic rules. An alternative
worth exploring would be to say that nothing
is copied to the syntax tree unless specifically marked. Let us use
the "^" character to mark items that are copied to the tree. The
result is clearly much more restful on the eyes:
ixml: (^rule)+.
rule: @name, colon, definition, stop.
definition: alternative; ^alternative, semicolon, (^alternative)+semicolon.
alternative: (term)*comma.
term: symbol; repetition.
repetition: ^one-or-more; ^zero-or-more.
one-or-more: open, definition, close, plus; open, definition, close, plus, ^separator.
zero-or-more: open, definition, close, star; open, definition, close, star, ^separator.
separator: terminal; @nonterminal; @refinement.
symbol: terminal; ^nonterminal; ^refinement.
terminal: ^explicit-terminal; ^implicit-terminal.
explicit-terminal: up, @string.
implicit-terminal: @string.
nonterminal: up, @name.
refinement: @name.
attribute: at, @name.
string: openquote, (character)*, closequote.
name: (letter)+.
letter: ^"a"; ^"b"; ...
character: ...
colon: S, ":", S.
stop: S, ".", S.
semicolon: S, ";", S.
comma: S, ",", S.
plus: S, "+", S.
up: S, "^", S.
star: S, "*", S.
open: S, "(", S.
close: S, ")", S.
at: S, "@", S.
openquote: S, """".
closequote: """", S.
S: " "*.
Restriction
on the XML Produced
It should be noted in passing that in the form presented
here, Invisible XML only works in one
direction: you can turn any textual document into an equivalent XML
document. However, it is not in general possible to turn a textual
document into a particular XML form without
more work. For instance, you could turn Wiki markup into an XML
document, but not into XHTML in particular.
Roundtripping
Returning the resultant XML document to its original format
is just a process of presentation, nothing that a suitable bit of
XSLT couldn't do, or even CSS in some simple cases. In fact it
should be apparent that from the Invisible XML
syntax, it would be straightforward to automatically generate the
required piece of XSLT directly.
Conclusion
There is really no reason why XML can't be more ubiquitous
than it is, and similarly there is no reason why XML documents have
to be written in an explicit XML format per
se. Anything that can be parsed can be perceived as XML,
since parsing is very easy, and parse-trees are really just XML
documents in different clothing. Invisible XML
allows a multitude of document formats to be authored in their
traditional form, but be processed as XML, with the concomitant
advantages of the XML toolchain.
References
[RELAX NG] James Clark, Makoto MURATA (eds.), 2001, RELAX NG Specification,
https://www.oasis-open.org/committees/relax-ng/spec.html
[RELAX NG COMPACT] James Clark (ed.). 2002, RELAX NG Compact Syntax,
https://www.oasis-open.org/committees/relax-ng/compact-20021121.html
[BNF] Backus-Naur Form,
http://en.wikipedia.org/wiki/Backus-Naur_Form
[VWG] S. Pemberton, 1982, "Executable Semantic Definition of
Programming Languages Using Two-level Grammars",
http://www.cwi.nl/~steven/vw.html
[LL1] Alfred Aho and Jeffrey D. Ullman, 1977, "Principles of
Compiler Design", Addison-Wesley, ISBN 0-201-00022-9.
[Earley] Earley, Jay (1970), "An efficient context-free
parsing algorithm", Communications of the ACM 13 (2): 94-102,
doi:https://doi.org/10.1145/362007.362035
[Media type] N. Freed et al., 1996, "Multipurpose Internet
Mail Extensions, (MIME) Part Two: Media Types",
http://www.ietf.org/rfc/rfc2046.txt
×Alfred Aho and Jeffrey D. Ullman, 1977, "Principles of
Compiler Design", Addison-Wesley, ISBN 0-201-00022-9.