How to cite this paper
Berjon, Robin. “Mending Fences and Saving Babies.” Presented at Symposium on HTML5 and XML, Washington, DC, August 4, 2014. In Proceedings of the Symposium on HTML5 and XML. Balisage Series on Markup Technologies, vol. 14 (2014). https://doi.org/10.4242/BalisageVol14.Berjon01.
Symposium on HTML5 and XML
August 4, 2014
Balisage Paper: Mending Fences and Saving Babies
Robin Berjon
Robin Berjon is a freelance consultant carrying out research, prototyping, and
standardisation in Web, mobile, and XML technologies. He has worked on both Web and
XML
standards for over a decade, and is currently trying to herd HTML5 to Recommendation
as
part of the W3C team. He lives in Paris, France, with his wife, two daughters, and
a
rather idiotic cat.
Copyright © 2014 Robin Berjon
Abstract
The harshest squabbles are fraternal, and as fraternal squabbles go, the one between
the
proponents of XML and HTML has been at times quite brutal. This kerfuffle has opened
large
rifts in what is largely a like-minded community. The differences between XML and
HTML are
genuine, especially when considering not just the markup but the full family of technologies
that have grown around them. But do they really justify animosity? Both XML and HTML
have
created strong solutions to varied problems often ignored by other angle bracketists.
Their
many commonalities mean that the XML and HTML communities need not throw away one
another’s
babies in a big slosh of bathwater. It is time for a candid conversation about flaws
and
limitations, and from there to mend fences.
In this paper we look at some mythical preconceptions that each community has about
the
other, go through a number of topics that show where there is value in “looking over
the
fence”, and reach what is hopefully a pragmatic conclusion as to what each community
needs
to do.
Table of Contents
- Introduction
- Myths of the Markup Community
- How JavaScript Is Saving the Document
- HTML in an XML pipeline
- Extending HTML: Web Components
- Conclusion
Introduction
One of the original hopes for XML and its family of technologies was that it would
be the
markup infrastructure for the Web. This goal was notably materialised in the suite
of XHTML
specifications, in SVG and MathML, as well as in the use cases considered for a number
of XML
technologies such as XSLT, XSL-FO, XLink, and many others.
This dream, however, has failed. XML is without a doubt a very successful set of technologies
and benefits from a healthy community and powerful tooling. It is nevertheless close
to absent
from Web content.
The failure of this dream, and the way it was brought about, has created a lot of
animosity in
the broader markup community. XML aficionados feel they have been cheated out of their
future
by HTML, browser vendors, and whoever is even remotely associated with today’s Web.
On the
other side, HTML people feel they had to fight XML bitterly in W3C and in their daily
jobs in
order for Web technology to have the properties they felt were needed for its success.
In the aftermath of this dispute, the two communities are largely estranged from one
another.
XML heads hold their noses at JavaScript and at HTML parsing; HTML people fashionably
disparage XML in much the same way that one makes fun of Java.
It is this paper’s position that this attitude is hurting both. While it seems unlikely
— and
not even desirable — that some form of grand merge of XML and HTML would take place,
there is
nevertheless value in opening up a bidirectional discussion between the two communities
so
that they may learn from one another’s tools and ideas, mending fences as it were
so that each
side stops throwing away babies with bathwater.
Myths of the Markup Community
Without getting into excessive details about myths, rumours, and hearsay it is useful
to take
a quick look at the myths that the XML and HTML communities entertain about one another.
If
nothing else, it tells us where each is coming from and can help avoid clichés as
well as map
out places of genuine contention.
To the HTML crowd, XML is essentially overly strict, full of overly complicated
solutions that are perceived to be either enterprise-like (invented to give Java a
reason to exist) or completely academic and impractical.
There is no doubt that technologies such as XML Schema, not to mention the whole SOAP
stack,
have a lot to do with this perception. But as we will see below, the famed Desperate
JavaScript Hacker (who in today’s world has come to replace the Desperate Perl Hacker)
does
not have a monopoly on useful technologies, and some tools that may be pitched in
a manner
reminiscent of enterprise-speak — likely because that is where they can be sold today
— can
have value in many other contexts.
Conversely, to XML people HTML is messy, hackish, cannot be parsed or processed
reliably, isn’t extensible, and is plagued with tools designed by and for amateurs,
chief amongst which stands JavaScript, often considered to be a toy language.
Again, there are genuine issues that brought about this perception. For the longest
time, HTML
was indeed impossible to parse properly, and it is only now acquiring extensibility.
The fact
that its tooling is accessible to beginners — a strength — entails that there are
many
beginners dabbling in it. But the sheer scale of the Web and its obvious ability to
deliver
complex, major, highly successful projects put together with the massive creativity
of its
communities of developers should give indication that it may not be entirely stupid
and
unreliable.
Covering all of the ways in which one community could inform the other would require
more
space than there is here. We can, however, look at a few examples in the hope that
it will
whet the readers’ curiosity.
How JavaScript Is Saving the Document
There are few communities in which JavaScript is more reviled than amongst document
lovers.
By its very dynamical and interactive nature, it is seen to single-handedly destroy
a
document’s meaning and any hope for the processing of data outside the visual, interactive
mediation of the browser.
In many a case, and for a large class of documents, that has indeed be the case. Since
any
HTML page may be a document just as well as an application, and since the difference
between
the two is blurry at best, one cannot in general process HTML in a meaningful manner
outside
of the browser.
This can be contrasted with what was a large part of the vision of XML on the Web.
The idea,
at least that of many, was that one could produce purely semantic content in the form
of an
XML document, and then attach to it some XSLT that would transform it for clients
that
required such transformations. This was seen as providing a separation of concerns
superior to
that afforded by HTML+CSS since one can easily introduce elements more meaningful
than, say,
div
. (How the actual semantics of a given arbitrary markup
language were supposed to be conveyed to users, tools, or the accessibility layer
was,
however, largely swept under the rug as a problem to be solved at a later date.)
It may therefore come as a shock to some that, today, JavaScript is paving the way
for a
return of that very usage.
Single Page Applications (SPA) are Web applications in which all of the resources
that define
a page (the HTML chrome, JavaScript, CSS, etc.) are loaded once and then used as a
shell
inside of which content can change. A well-known advantage of SPAs is that they massively
increase performance and thereby provide users with a better, snappier experience.
They also
avoid having to deal with application logic that is split between client and server,
and are
therefore very much desirable for developers — even in cases where the application
is largely
content-oriented (e.g. a blog). Note that, contrary to still-popular belief, SPAs
can be made
URL-friendly through use of the History
API.
To date, SPAs have mostly been used in the production of very application-oriented
content.
The reason for that is because the robotic crawlers used by search engines have so
far been
unable to process them, and no one wants their content kept away from search engines.
This
situation is changing. Increasingly, crawlers are able to process JavaScript-heavy
pages, and
do so. This opens up the door to far broader deployment of SPAs, and since they are
often
more convenient for developers the odds are strong that they will come to dominate
even
content-based sites.
This essentially brings about the content/transformation distinction that XSLT and
XML were
aiming for on the Web. One can maintain (and make available) a set of “pure” documents
that
the SPA then renders on the client. The content may be simplified, semantic, possibly
enhanced
HTML that adequately captures the intended meaning (and is happily devoid of all the
navigation and useless paraphernalia that most pages would normally contain). It can
also be,
for more data-oriented content, JSON. And naturally, if desired, it can be XML. In
all cases,
XSLT cannot be expected to be natively available for transformation, but there exist
many
solutions that can be picked based on what best matches one’s needs. (The author routinely
uses jQuery as a transformation language precisely for this sort of task.) If one
prefers XML,
there are even XSLT and XQuery libraries available — it becomes up to you to use the
browser
just as a VM, and deploy whichever technology you prefer.
The attentive reader will note that even if SPAs effectively make the client-side
XSLT
publishing workflow a reality, they still do not solve the problem of properly conveying
arbitrary semantics that was mentioned above. Hopefully, though, in being successful
they will
make the problem more salient and thereby bring about a solution.
HTML in an XML pipeline
HTML was initially supposed to be defined as an application of SGML. But few implementations
effectively followed that path, and it quickly grew to be defined solely as a set
of hacks
and bugs mimicked from others’ bugs, leading to the well-known “tag soup” situation
that
essentially made it scrapable at best only through regular expressions.
While that situation prevailed for a long time, it no longer reflects reality. The
HTML
parsing algorithm has now been fully defined, and is highly interoperable. It certainly
has
its complex, dirty corners, but those only need to be implemented once. And in many
ways, they
are no worse than some of the warts found in the likes of XML or SGML.
Today, when applying an HTML parser, you obtain real, usable DOMs that are guaranteed
to be
interoperably the same across implementations. The HTML DOM even benefits of a mapping
to
XML known as the “Infoset coercion” rules. As a result, largely any tool that you
can apply to
XML can be applied equally well to HTML provided you front your pipeline with an HTML
parser.
No need to even stick to the so-called “polyglot” syntax (which has issues of its
own).
As things stand today, processing a large HTML corpus remains painful. There are full-text
indexers, but they rarely afford much flexibility in taking the structure into account.
One
can naturally parse the HTML and process the DOM, but doing that for every search
on a large
corpus is of course prohibitively expensive. It is possible to produce ad hoc indices
built
with such processing, but that removes the benefits from arbitrary querying. In other
words,
there is no such thing as an “HTML database” to match the existing XML databases.
In developing Web standards, we regularly need to look at large HTML corpora to determine
whether a given usage is common or how people actually use the technology (for instance,
a
dump of the front pages of the top million sites). The tool we use for this? Typically:
grep
.
Yet a lot of data is captured as HTML. Huge corpora contain a humongous amount of
information,
for instance in tables, that is being locked there.
That’s a situation for which something like XQuery could prove itself extremely useful.
There
is, in fact, very little that prevents one from loading HTML directly into an XML
database
and processing it. Yet few do it, likely because of the “X” in “XQuery”, serving as
a
scarecrow. As Liam Quin recently put it, it may have been better for XQuery to be
called
something like “Fast Forest”.
Slightly to the side of HTML, but by and large in the same technological bucket, a
similar
situation applies to JSON. There do exist JSON databases — many of them actually —
but their
query abilities are often poor to laughable. Solutions built atop XQuery, such as
JSONiq,
would without a doubt solve many real problems that people are facing when managing
their
JSON data. Yet the mutual ignorance is such between our communities that such tools
remain
largely confidential.
Another great example of technology built for XML being applied to HTML comes from
DeltaXML.
Producing meaningful diffs of HTML content is, today, largely a painful situation,
especially
if the HTML is irregular, large, and heavily marked up. That is a problem largely
solved for
XML and HTML could benefit greatly from it becoming more available.
Extending HTML: Web Components
A strong point of contention between XML and HTML is the notion of extensibility,
or more
precisely of extensibility carried out by arbitrary third-parties with no requirement
to
work their way through a centralised standard. In other words, “distributed extensibility”.
XML’s solution to distributed extensibility is XML Namespaces. For all that they may
be
reviled, namespaces do work in bringing distributed extensibility to XML — but only
in a
limited sense.
Extensibility can happen at many levels: the syntax, the vocabulary, the meaning,
the styling,
the behaviour… Neither XML nor HTML have extensible syntax and there seems to be only
limited
demand for that. Namespaces deliver vocabulary extensibility: you can create your
own
vocabulary easily, and if you’re not entirely daft you can do so in a manner that
won’t
conflict with anyone else. However, namespaces stop there. Even without considering
the
problems inherent in interactive behaviour, just discovering how two vocabularies
mixed
together need to be processed is an unsolved problem and requires resorting to ad
hoc
development. This situation is worsened by the fact that some schema languages, most
notably
XML Schema, don’t even consider XML to be extensible by default.
HTML does not have a real solution at the vocabulary level (unless you count prefixing
your
elements in a global namespace a solution). It does, however, have an approach from
the other
end of the spectrum: Web Components.
There is not enough space in this paper to provide a full-fledged introduction to
Web
Components (for more on the topic, I recommend the
webcomponents.org website) but the part that
is essential for this discussion should be easy to grasp without a full understanding
of the
technology.
Essentially, the point at which HTML behaviour is integrated into a browser engine
is through
the HTMLElement
interface. That is where the common APIs hang off of, where CSS
applies, where integration with the class and ID system happens, and much more. What
Web
Components enable is essentially for developers to create their own arbitrary elements
by
subclassing HTMLElement
and providing their own implementation, injected into the
runtime.
Once that has been done, the new element becomes treated just like any other built-in
element.
What’s more, thanks to the concept of shadow trees (essentially subtrees of the DOM
that can
be hiding recursively behind regular DOM nodes) it is possible to intermix Web Component
content and regular HTML at will.
It is therefore interesting to note that neither XML nor HTML have solved the distributed
extensibility problem across the board. Each has solved it from the end that made
most sense
to its more common usage. Because of a difference in use cases, depending on where
you stand
either one may be seen as extensible and the other not, or vice-versa.
But this difference of viewpoint leads not necessarily to an opposition but rather
to
complementarity. It is important to know and to understand both so that one may be
able to
rely on either when the applicable need arises rather than shun half of the solution
space.
In my XML Prague 2014 paper “Distributed
Extensibility: Finally Done Right?” I go so far as to point out how one could transform
XML-namespaced content into a syntax friendly to Web Components in order to implement
the
behaviour of an XML language; and provide indications as to how the two could be integrated
more closely together. Deciding whether that is wise or not is left as an exercise
for the
reader, but it does point to complementarity rather than opposition.
Conclusion
We hope to have shown through this overview that there is value for anyone sitting
on one side
of the fence to go look at what is going on on the other side, assuming that the “others”
may
in fact be smart people with somewhat different needs rather than dumb people who
just don’t
“get it”.
A good example here is SVG. While originally defined in XML, and in fact deeply steeped
in XML
technology throughout, it struggled for years to reach any decent level of usage.
At some
point came the realisation that “SVG isn’t about
XML, or even syntax, it’s about sassy, sexy, wicked cool graphics that make you go
wow.”
Ever since adopting the changes that make it usable equally well in XML and HTML contexts,
SVG has undergone a period of blooming and has grown to be a solid part of the Web
platform.
There are several more such stories waiting to be written.