How to cite this paper
Quin, Liam. “Extending Vocabularies: The Rack and the Weeds: Social Context and Technical Consequence.” Presented at Balisage: The Markup Conference 2019, Washington, DC, July 30 - August 2, 2019. In Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23 (2019). https://doi.org/10.4242/BalisageVol23.Quin01.
Balisage: The Markup Conference 2019
July 30 - August 2, 2019
Balisage Paper: Extending Vocabularies: The Rack and the Weeds
Social Context and Technical Consequence
Liam Quin
Visionary
Delightful Computing
Liam Quin runs an information design company, Delightful
Computing, and previously was XML Activity Lead at the World
Wide Web Consortium; before that they were involved in the creation of
XML itself and in SGML, most notably at SoftQuad Inc. in Toronto.
Their background is in digital typography, text processing and
computer science.
Abstract
In its simplest form a vocabulary is simply a
set of words and phrases with predefined meanings. In this paper the
term is used to mean a controlled vocabulary and,
in particular, a controlled vocabulary in the context of computer markup
languages such as XML or JSON or SGML.
Vocabularies are created in specific contexts and for specific
purposes. Like all human constructs they are flawed and need to be
repaired and changed over time; as people use vocabularies they also
gain understanding of the limitations in them and often want to extend
them. Understanding these processes involves an understanding of the
human needs involved: the social contexts in which people interact with
and around the vocabularies. This paper characterizes some of these
contexts and their properties, and in the light of this characterization
describes changes to vocabularies, both successful and
unsuccessful.
Table of Contents
- Introduction
- The Social Context of Vocabularies
- An Ontology For Extensions
-
- Planned-for Extensions
-
- Grammar Hooks
- Unchecked Islands
- Extension Names
- Unanticipated Extensions
-
- Altered Grammar
- Usage Conventions
- Unchecked Usage
- Hybrid and Absorbed Extensions
-
- Ambiguous Markup
- New Vocabulary Features
- Usage Conventions Adopted
- Internal and Interchange Formats
- Evaluating Extensions
- Vocabulary Life Cycle: the Birth of an Extension
-
- Committee Proposals
- Community Proposals
- Forks
- Merging
- After the Work Ends
- Characterizing Extensions
-
- Functional Extensions: New Behaviour
- Semantic Coverage: New Meanings
- Implicit Extensions
- Explicit Extensions
- Usage Conventions
- Methods of Extension
-
- Adding New Elements
- Adding New Attributes
- Adding New Content
- Adding New Values
- Subtractions
- Combining Vocabularies: Xreole
- Adapting Existing Markup
- Scripting
- Inhibiting Factors
- Encouraging Benevolent Extensions
-
- Version Numbering
- Allowing Mixed Namespaces
- Fallback
- Extension Attributes and Namespaces
- Communication
- Conclusions
Introduction
The SGML standard defines the following term:
4.279 SGML application: Rules
that apply SGML to a text processing application. An SGML application
includes a formal specification of the markup constructs used in the
application, expressed in SGML. It can also include a non-SGML
definition of semantics, application conventions, and/or
processing.
— ISO 8879:1986 SGML
The SGML standard attempted to give a formal definition for what today
might be called a markup vocabulary. When XML made the explicit document
type declaration optional and provided other ways to share
computer-processable specifications, such as XML Schema Documents, the
term Document Type Definition, or DTD, gradually gave way to the more
informal, broader, term, Vocabulary.
The non-SGML part of an SGML application, as with vocabularies in
other systems such as XML, HTML or JSON, can include natural-language
prose that might add constraints not easily expressed in markup:
the n attribute shall be a Mersenne Prime Number expressed in
Roman numerals for example. Such constraints can sometimes be
enforced, or violations detected, with a conformance
checker; often these are written in a special-purpose
language such as that of Schematron [Lubell, 2009]
and those Schematron tests in turn can be tested using frameworks such as
XSpec [Lizzi, 2017].
Both the machine-processable part of a vocabulary definition and the
additional human-readable part (often much larger) must change over time:
at the very least, they change from not existing into existing, but almost
always they change through revision and, explicitly and implicitly,
through extension.
For the purpose of this paper, an extension to a
vocabulary is any change to the specification of that vocabulary, whether
in detail, in scope, or otherwise.
Before we can define implicit and
explicit extension, we must consider the wider
social context in which vocabularies are created and used. We can then
characterize the extensions more precisely and go on to suggest ways to
encourage what we will define as beneficial vocabulary
evolution.
The Social Context of Vocabularies
The context in which a vocabulary was first developed and the primary
contexts in which it is subsequently maintained are also the contexts
within which the maintainers will view extensions. Example contexts include:
-
An individual person inventing a vocabulary for their own
use;
-
A group of people working on a project, using a vocabulary
between them but with no wider usage outside the group;
-
An organization that publishes a vocabulary for use with
specific software or for some other specific purpose connected with
the organization;
-
Organizations whose staff work together to produce a shared
vocabulary;
-
International and national standards organizations such as ISO,
NISO, and ANSI; industry consortia such as W3C, WHATWG or Oasis
Open; each of these has produced specifications that define
vocabularies, primarily to standardize on behaviours between
implementations or to invent new solutions to problems.
When a specification for a vocabulary exists primarily for
interoperability between implementations, innovation is strictly limited.
In this case, it is usually clear to the vocabulary designers that each
vendor or implementer will need to extend the vocabulary to add support
for the features that make their implementation a special snowflake.
Equally, it will be clear to them that they must provide some way for
other vendors to process marked-up documents that use those
extensions.
The truth is rarely pure and never simple.
[Wilde, 1895]
An Ontology For Extensions
In order to characterize extensions we need to introduce some
descriptive terminology. The terms introduced in this section are a first
attempt to provide not only phrases but clearly separated concepts in the
area of vocabulary extensions.
Planned-for Extensions
The creators of a vocabulary foresaw a need but not the specifics,
and so provided mechanisms to allow the vocabulary to be
extended.
Grammar Hooks
Some vocabulary designers provide mechanisms for users to extend
the grammar used to validate instances; this can allow subtractions or
entire replacements, or may be restricted to adding extra terms, such
as adding an extra element to an XML content model for a bibliography
entry.
Unchecked Islands
A vocabulary might include a grammar for validation that
incorporates places where names from other vocabularies can be used,
or where validation is disabled. Example mechanisms for this are lax
validation in XML Schema, or extension elements with content models of
ANY in DTD-based validation.
Extension Names
Some vocabularies incorporate a convention that elements starting
with a specific prefix (x-socks) are extensions,
and the creators promise never to define meanings for such names. In
XML, a vocabulary might state that elements in a specific secondary
namespace, or any namespace but the primary one, are extension
elements, or, like XSLT, might allow arbitrary attributes on any
element as extension attributes. There is always
a risk of conflict with future versions of the specification when this
is done, however.
Unanticipated Extensions
The creators of the vocabulary did not foresee the need for
extensions, or not of the kinds that users of the vocabulary wanted or
needed.
Altered Grammar
Sometimes if the creators of a vocabulary did not supply a
mechanism to add or change names, people copy the grammar definition
and edit it in a text editor. The resulting vocabulary might in open
source terms be called a hostile fork. Documents
using this changed grammar might not work properly with tools for the
original vocabulary.
Usage Conventions
Users might assign their own meanings to vocabulary terms in
specific contexts. This is a very common way to extend any language.
For example, one might say that the HTML cite
element is to contain a footnote reference to a bibliography entry, or
that it contains quoted text but not the name of the quoted author. Or
if an XML vocabulary did not allow links, one might start using a
shoesize element and put a URL into its
USAorEuropean attribute. This is sometimes
(disparagingly) called tag abuse: if an XML
vocabulary, say, does not distinguish between italic for a foreign
phrase and italic for emphasis, and one needs to include a foreign
phrase, people using a text-to-speech reader to interact with the
document will be forced to hear a raise in pitch as the foreign phrase
is read out loud. It can be better for vocabulary creators to provide
an italic element with a required
because attribute than to deny the possibility
of unforeseen italicized content, but no-one can anticipate
everything.
Unchecked Usage
Faced with needs not met by a vocabulary, some people give up on
grammars altogether and add terms as they see fit. This is similar to
the hostile fork described above, except that without formal
documentation there can be little hope that any other group will adopt
the extensions.
Hybrid and Absorbed Extensions
Extensions are sometimes adopted back into a vocabulary; in most
cases this is done in such a way that people previously using the
extension have to change their usage to conform, because people making
extensions usually do not share exactly the same constraints and
perspectives as the vocabulary’s creators.
An absorbed extension, then, is one that was
originally an extension but became part of the vocabulary. A
hybrid extension shares characteristics of
planned-for and unanticipated extensions and may also be, or become,
officially absorbed.
Ambiguous Markup
Declarative markup admits the possibility of multiple ways to
process a single document; ambiguous markup goes
one step further and admits the possibility that a term can be
interpreted by the reader. An example in XML is the use of the
Chameleon XML Schema Pattern, in which a fragment of a grammar might
be included in multiple language definitions but, because of differing
prologues, have radically different interpretations, for example with
a different default namespace in use.
New Vocabulary Features
A new version of a vocabulary might incorporate new terms that
were previously an extension. The vocabulary itself might be said to
have been extended compared to previous versions, but the new terms or
features are no longer themselves considered an extension.
Usage Conventions Adopted
The creators of a vocabulary may decide that a usage convention is
reasonable and adopt it into their language. This is sometimes
referred to as paving the cowpaths, although
anyone who has lived around cows know that they don’t always follow
very useful or wise routes. A common example here is languages that
adopt special meanings to comments in a particular format, such as
Encapsulated PostScript using %%page at the start of a line; regular
comments in that language start with a %, but the convention is that
PostScript comments should not start with %% unless they conform to
the Encapsulated PostScript convention.
Note that usage conventions, in the sense used in this paper, are
not themselves part of the vocabulary.
Internal and Interchange Formats
These are not strictly speaking a type of
extension, but rather a context and
situation: The context is one in which an
organization has needs not met by a vocabulary; the situation is one
where documents produced internally must be shared with other
organizations, and are transformed in some way at the institutional
boundaries so that what is shared is conformant.
Evaluating Extensions
Extensions can have an effect wider than on a single individual or
organization. Some extensions become widely used, and these may be adopted
by the maintainers of the vocabulary, or they may be seen as
disruptive.
In either case, extensions that are in use in interchange between
organizations necessarily lead to fragmentation: any given tool or tool
chain may or may not be able to process the extension. For example, if one
were to share with someone else an XSLT transformation document that made
use of EXpath extension modules, the recipient would be unable to use the
transformation unless they had an XSLT implementation that supported the
extension. So, there would then be two languages: XSLT with EXPath and
XSLT without EXPath. But if there were three EXpath extension modules, and
implementations may have any combination, there would then be
six different languages, since the increase is
combinatorial.
An extension, then, reduces interoperability. But when an extension is
widely implemented, it generally increases the scope, or applicability, of
the vocabulary, and gives an overall benefit. This would be a
beneficial extension.
An extension sometimes is created by people who are not well-connected
to the user community, or who have very different views from the majority
of the people creating the original vocabulary. Or, sometimes, one or more
of the original creators has a change of heart in some way. The extension
might violate what users perceive to be underlying principles, or might
feel out of place. For example, consider an extension to a declarative
content-oriented XML interchange vocabulary that introduces procedural
commands such as Switch to a larger type size until otherwise notified.
Such an extension changes the way that people think about the vocabulary,
and even though it may increase applicability, it can cause damage. The
extension doesn't fit in well, is harder to learn, and users become
confused by a new lack of orthogonality. So, this would be an example of a
harmful extension.
It should be admitted that there is no easy and clear-cut way to
determine whether an extension is beneficial or harmful. Sometimes it is
only apparent after several years. Sometimes the damage of an extension is
that it precluded a better solution being adopted.
Vocabulary Life Cycle: the Birth of an Extension
Vocabularies are created, born, grow, live and flourish, or wither and
are forgotten, but they rarely die. It is very difficult to withdraw
features from vocabularies once those features are in widespread use.
Furthermore, the slightest change to the specification may mean new
documents do not work correctly in existing implementations. Features can
be marked as deprecated, but both users and
implementers will have to deal with documents containing such deprecated
markup.
Some common reasons for extensions include:
-
As people started to use a vocabulary they found they needed (or
wanted) it to handle more cases than it already did: the vocabulary
grows;
-
The requirements changed, or priorities changed, and with it the
focus of usage. For example, rotary-dial telephones are no longer
ubiquitous, and a shared way to describe telephones for retailers to
choose items to stock no longer needs to mention the speed of the dial
return, but should probably mention whether a 3.5mm headphone socket
is provided. It might be that the vocabulary does not need to grow,
but rather that there will be increased detail in some areas and
perhaps reduced detail in others.
-
Someone involved in the vocabulary came up with an idea:
this specification is fabulous; we could use it for
selling wedding cakes if only it had.; or, it’s
really useful that you can work with numbers and decimals but what
about fractions? So an extension can be
systemic: adding fractions to every numeric
value, for example, or it might be modular,
offering a self-contained new facility such as the ability to
manipulate Zip archives in EXPath.
-
The way the vocabulary is used has varied over time. For example,
before the availability of CSS, the HTML blockquote
element was often used for indented text, regardless of the reason for
the indenting. Similarly, people using vocabularies with markup for
italics as rhetorical or grammatical emphasis but
without plain italics may find themselves marking up book titles or
phrases in foreign languages as emphasized rather than merely
differentiated. This is sometimes called tag abuse, but it is really a
symptom of needs not being met, and can healthily evolve into the
first case listed above.
-
Often people use built-in extension mechanisms, or invent their
own mechanisms, to remain within the broad communion, or user base, of
a particular vocabulary while supporting their own workflows.
Sometimes one sees Web pages that use a custom DTD, for example, or
DocBook articles with custom elements: the additional markup is
usually not intended to be public in these cases, but rather is a
symptom of a private extension.
-
Very occasionally, two or more vocabularies merge, or one subsumes
another. The individual vocabularies may continue to be maintained
separately, as with HTML 5 incorporating MathML and SVG. The original
vocabulary appears to grow in size and complexity but, since the most
common cases of this is to absorb widely-used extensions, there may be
no increase in practice.
As with any change to a specification, whether explicit or implied,
changes can originate internally, from the people maintaining the
specification of the vocabulary, or can originate from sources external to
that group, as the next two sections describe.
Committee Proposals
Very often a new feature starts out as a proposal from someone
already participating in whichever group or committee maintains a
particular vocabulary. Such an extension may go into a future version of
the vocabulary, in which case it people using that next version do not
generally consider it to be an extension. Sometimes the committee will
reject the proposal, and in that case it may later become part of some
third-party extension. Eventually it may return to become part of the
main specification, as SVG did with HTML 5.
The important thing about proposals from within the committee is
that because they are very often developed in a context of what Applen
and McDaniel refer to as tacit knowledge
[Applen & McDaniel 2009], they tend to fit in well with the overall
design of the vocabulary or specification in question.
Community Proposals
Sometimes people who are on the periphery of a committee, whether
outside but following closely or inside but not part of the cognoscenti
or not well respected, will come up with a proposal; at other times it’s
committee members but the proposal falls outside the scope of the
committee work, or is of a nature that means the details could not
easily be agreed upon within the group.
At other times, a user or implementer group outside the original or
main committee decides to extend the specification. This can happen
through dissatisfaction with the main group (as, for example, with HTML
5 and the WHAT WG) or a need for something faster than full consensus
allows, or sometimes simply because the outsiders did not understand
that they could have been more closely involved.
Forks
Strictly speaking a fork can happen from within
a committee or from the outside, or even a combination of both. The term
comes from open source programming: a fork of a
piece of code (whether a complete application or just a single library)
comes when someone copies the original, changes it, and starts
redistributing their changed version. This can be for several reasons:
the original maintainer might have wandered off, leaving the work
orphaned; the maintainer might have refused to make changes someone
wanted; sometimes the original maintainer passes on the flag to someone
else, or agrees there will be two versions with two different areas of
focus. Thus a fork can be amicable or can be hostile.
In the world of markup vocabularies and specifications a fork most
often happens when the original standards committee doesn’t recognise a
particular need as valid (rightly or wrongly). It can also happen if a
group needs a smaller subset of a specification, as happened with the
Mallard subset of DocBook for the GNOME project. Usually in the case
that the new fork is not intended to replace or supplant the original
specification there is no need for hostility: Mallard was made for use
by a specific community, for example.
Merging
Sometimes two specifications merge into a single larger one; usually the resulting vocabulary is the union of the
original specifications before the merge, often with some additions
since if one is revising a vocabulary it can be hard to argue with
people who want to add to it.
A merge can be done to include one vocabulary inside another, such
as HTML 5 incorporating MathML and SVG, rather in the manner of a shark
eating a jellyfish. The result can remain separate specifications or can
become one larger one. Another reason for a merger is when there are
variants of the original specification in use and incorporating the
variations seems best for everyone.
After the Work Ends
Sometimes a maintainer wanders off, loses interest, loses the
ability to continue the work, or even dies. An organization can be taken
over (such as Sun Microsystems by Oracle), or can cancel a project (such
as Oracle canceling Solaris). Sometimes the specification may remain
frozen, and may even become difficult or impossible to obtain. But if
the vocabulary is in widespread use then new needs will emerge, and a
new group will probably carry the torch forward.
Sometimes a committee will mark a particular vocabulary or version
as deprecated. On other occasions a specification
may be actively withdrawn by its publisher, for
example for legal reasons. And of course at times a specification is
perfect: the committee can be disbanded because the work is
finished.
Characterizing Extensions
The terms defined in this section are intended to be of use in
describing extensions to vocabularies and will be used in the rest of the
paper to characterize specific extensions and extension mechanisms.
Functional Extensions: New Behaviour
A behavioural extension is one that changes the
behaviour, or enables such changes, in software processing marked-up
documents.
In HTML, for example, code running in the client (that is, in a Web
browser) makes use of the extensions: for example, a browser might
interpret rel=toc to provide a toolbar button or
keyboard shortcut to access a table of contents entirely outside the
containing document. Markup to extend behaviour is often very specific
to the behaviour however: early HTML examples included
blink and marquee elements
whose purpose was to affect the display of contents rather than to
indicate meaning.
A vocabulary or a system using a vocabulary might also be extended
by missing or adding entirely different languages. For example, one
might include OpenGraph or Schema.org features in an XML vocabulary, and
these might be expressed in a JASON-LD syntax inside XML elementsor
attributes. Or, a system might be extended by supporting scripting, and
this might become visible inside documents. Such extensions can be
pernicious, tying documents down to use by specific software in specific
contexts and limiting reuse.
Semantic Coverage: New Meanings
In the semantic coverage case, the purpose of
the extension is to represent information in documents. For example, in
HTML, one might use a span element with a
class attribute value of
place to mark up places mentioned in a document.
Although this information might then enable new functionality, such as
connecting the prose to a map view, the markup is not tied to any
particular behaviour.
Implicit Extensions
Sometimes when we use a specification and share our documents or
data, we do not realize that we have created an extension. For example,
people marking up HTML documents might find they have used common
class
attribute values, or people might take a
specification like SCXML and use it for subject domains that the Working
Group that developed it never envisioned, and to which the prose in the
specification applies at best poorly. Over time the result can be to
broaden the scope of the original language.
An implicit extension, then, is one where the
fact that a vocabulary has been extended is not necessarily obvious to
an observer.
Explicit Extensions
Many specifications provide methods for extension, some of which are
covered later in this paper. In most cases the fact that something is an
extension is made explicitly visible: for example, by the use of an XML
namespace, or by including a module or library.
An explicit extension, then, is one whose use
(and not just whose description or specification) makes it clear both to
software and to any human working with the vocabulary that an extension
has been used.
Usage Conventions
Sometimes it seems easier to decide on a particular way of using a
vocabulary than extending it. With a vocabulary that does not include
section titles, one might decide to use the first paragraph of each
section as a title, even if formatting software does not embolden it.
Strictly speaking a usage convention is not an extension to a
vocabulary, but it extends the scope of the vocabulary without
introducing any new terms or markup.
Methods of Extension
We have considered some of the contexts in which vocabularies are
commonly extended. We are now in a position to consider the methods by
which they are extended in those various contexts and to understand the
reasons for the technical design choices.
Some vocabularies provide explicit extension methods in which users
can add new elements or attributes, or can change what is allowed at any
given point, using extension points built in to the various schema
languages used to describe those vocabularies. For example, a DTD might
provide parameter entities included in each content model, so that a
document can override the definition of one of the parameter entities to
add a new element to the corresponding content model. This permits
extended documents to be validated, but software processing the documents
will still need to be modified appropriately to understand the new
markup.
Adding New Elements
One of the most obvious ways to extend an XML vocabulary, or one in
any similar language, is to add new terms to the vocabulary. We might,
for example, decide that the title
element of an HTML
document is insufficient for our purposes because it does not allow
nested elements within it; instead, we add a pagetitle
element that’s richer.
One obvious problem with this is that existing software doesn’t
understand it. The second is that we still need a title
element in each document for existing software to use, so now there is
duplicated information. The new element reduces interoperability of
documents, but if the usage is confined to a well-defined group then
this is not a problem.
The use of XML namespaces is the most common way to identify
extension elements. Namespaces are a fragile mechanism, often failing
silently if there’s a typo in the namespace name (the URL) and in some
implementations even failing if a document uses a different prefix.
However, the fragility seems more than compensated for by avoiding
conflicts, where two groups add elements of the same name. One of the
original use cases for XML namespaces was to allow the mixing of
vocabularies in this way.
In all cases, anyone adding an element of their own to documents
that otherwise conform to someone else’s vocabulary needs to ensure that
receiving software that does not understand the new element will behave
sensibly: this is known as fallback. For example, a
Web browser receiving a document containing a dblookup
element will (in the absence of scripting) simply display the contents
of the element. The document author therefore needs to include
appropriate fallback contents. The design of extensibility in HTML
places burdens on document authors.
Adding New Attributes
Very often, software processing a vocabulary will ignore attributes
that are not recognized. Schemas need to be modified, but that’s true of
any change. XML extension attributes can be associated with an XML
namespace to avoid conflicts, as with elements.
A benefit of using attributes for extensions is that they tend to be
less disruptive than elements. On the other hand they are restricted to
simple string content and cannot be marked for language or text
direction (e.g. RTL). Attributes are therefore not in general suitable
for human-readable content: for example, you can’t easily have Taiwanese
alt
text for an SVG image in a Chinese HTML page, and
this matters because Unicode code points are shared between those
languages, so that language marking is needed for the text to be
readable. In addition, there can only be one attribute of a given name
on any particular element, limiting some sorts of extensibility.
The HTML 5 specification reserves attributes whose names begin with
data-
to be extension attributes, but this naming
convention is not always acceptable to other groups extending
HTML.
Adding New Content
Additional content is not usually considered to
be an extension, since it does not affect the vocabulary itself. But
consider including multiple translations of each paragraph of a
document, one after the other; a usage convention
might be used to say that the first paragraph is Romanian and the second
the Italian translation, marked as Italian with
xml:lang but not otherwise as a
translation.
Adding New Values
New attribute values or element contents make an obvious way to
extend many specifications. Attributes with names like role
seem good candidates. It can be difficult to avoid collisions here,
however, and there can be problems with fallback.
Examples in HTML include adding new meta or
rel values, using data:*
attribute values instead of linking to external resources, or using
non-standard ARIA role attribute values. Note that this is different
from extending HTML using Custom Elements or new
class attribute values, because those are
intended to be used for customization.
Subtractions
It may seem odd to consider removing part of a
vocabulary as an extension. Such a change, however, can greatly
facilitate implementation and can also help with authoring (by reducing
choices). A diminished version of a vocabulary is sometimes known as a
subset and sometimes as a
profile, depending mostly on whether the speaker
approves of it or not. Subsets (or profiles) can reduce
interoperability, because an implementation might support one dialect
and not another. They are therefore most suited for well-targeted use
cases and communities.
One well-known example of a profile, or subset, is XML: every
well-formed and DTD-valid XML document is also a valid SGML document.
Admittedly this took a change to SGML to achieve, but the change (or
rather, set of changes) was not unreasonable. Although XML was
originally made by a group who did not think they could get the SGML
committee to make changes in a timely fashion, if at all, in the end the
committee turned out to be generally (overall) amenable to changes, and
the design of XML could have been somewhat simplified had this been
anticipated.
Combining Vocabularies: Xreole
Merging specifications to make a superset has already been discussed
above. Another possibility when merging is to pick and choose, resulting
in what is perhaps best considered to be an entirely new markup
vocabulary, a sort of XML Creole, that was influenced by its ancestors
but is not compatible with them.
This may sound the province of Igor in the basement, but can have
the advantage of reduced training costs and sometimes even reduced
tooling costs. Consider a vocabulary that uses DocBook element names for
structure, HTML names for paragraphs and below, and DITA-style assembly
from fragments. We could call it DitaWebBook. The HTML names for italic
and bold, the accessibility attributes, the p
element, all
add a (perhaps false and misleading) sense of familiarity. Authors may
then be surprised when MathML or SVG or JavaScript are not
supported.
Adapting Existing Markup
When you don’t have an element to mark up a foreign phrase that’s to
be italicized, and there’s no element for meaningless (semantically
unweighted) italics, what’s an author to do except look for some other
element that displays in italics? Emphasis, perhaps, resulting in
documents in which text-to-speech software reads out foreign phrases (or
book titles, perhaps) in a louder or higher-pitched voice as if they
were really important.
More pernicious them poorly-accessible italics, hover, are values
that are interpreted by software: our vocabulary didn’t have an
element for postcode, so we used email address, because we aren’t
allowed to store those.
This is payback for vocabulary
designers who did not allow for extensibility. The three-level postal
address that doesn’t work in other countries; the telephone number field
that doesn’t allow for an office extension number. Every example
represents a design failure.
Adapting markup is sometimes derogatorily called tag abuse, although
it can also be a form of usage convention.
Scripting
It is often tempting to make a system user-extensible by
incorporating a scripting language. The result, as suggested above, can
be that documents become tied to a particular system used in a
particular configuration, because they contain fragments of programs or
hooks for extension scripts to use.
An example is HTML Custom Elements, where the language is extended
not by editing the grammar in some way, but through a JavaScript API
which itself is subject to change.
Inhibiting Factors
Some vocabularies and languages have designs that make it harder to
evolve them over time. .
HTML has always defined that an unknown element in the document body
should be rendered as if its tags were missing, which allows for
experimental elements to be added easily. Unfortunately there was also a
decision that the first unknown element would end the head, which
considerably complicated adding new metadata and which the IETF HTML
Working Group later regretted.
But inhibiting factors can come from other directions. For example,
the technique known as literate programming, in which
a program is intertwined with extensive documentation, can discourage many
programmers from making changes, especially if they are not comfortable
with writing prose. Or, they may make changes to the code but not update
the prose, which to them maybe a harder task.
Literate programming is an extreme example, but any extension can make
existing documentation obsolete, because you wouldn’t do it that
way any more.
There can also be legislative inhibitors, for example if a specific
version of a vocabulary is required, and implementation inhibitors, for
example if a particular language version is very widely implemented, as
with XSLT 1. Infrastructure inhibitors can be very difficult to
surmount.
Sometimes incompatible changes in a new version of a vocabulary can
discourage or even prevent adoption; this was the case with XML 1.1, where
in some (admittedly obscure) cases existing documents could have their
meaning changed, and where existing XML processors were required to reject
XML 1.1 documents.
Encouraging Benevolent Extensions
There are a number of techniques that have emerged through experience
as ways to encourage extensions that improve an ecosystem. Even though the
combinatorial bifurcation problem is always present with extensions, the
techniques either mitigate this problem or give benefits that outweigh
it.
Version Numbering
if an XML vocabulary includes its version number in its namespace,
any change to the version number will generally break all processing
tool chains. This is appropriate if it would always be an error for
version N software to attempt to process version N + 1 input, but more
often there are compatible changes, or new features added to the
vocabulary such that every version N+1 document that does
not use the new features is also a conforming
version N document. This can be managed by separating the namespace (if
used) from a version attribute on the top-level
element, as is done by XSLT and DocBook 5.
It also helps to use a version number scheme that says that minor
revisions are compatible in the way mentioned above; this is often done
using a decimal point in the version number, so that a processor for
version 3.2 of a vocabulary can process input marked as 3.* (where *
represents any number, such as 3.9), but would report an error if given
a version 4 document, where the first part of the number, before the
dot, was higher than the processor understood. The 4 here is called a
major revision number and the par after the dot
(or the entire number) a minor revision
number.
Allowing Mixed Namespaces
Allowing foreign, or secondary, namespace can help demarcate
extensions from the primary vocabulary, and can make sure there are no
conflicts. For example, both DocBook and SVG have
title elements, but DocBook documents that use
SVG elements associate them with the SVG namespace, so there is no
conflict. However, the DocBook 5 specification indicates where SVG
elements are allowed to appear.
Fallback
One of the places where CSS design has improved upon HTML design is
the notion of fallback; that is, in considering
what an implementation will do if it encounters CSS it does not
understand, and making sure the base language is designed so that a
sensible fallback is always possible, meaning that the document should
always be readable even if some features (such as coloured borders, for
example) are not rendered.
Constraints such as CSS FallBack places on designers of language
extensions can be very helpful to user communities.
Extension Attributes and Namespaces
The HTML 5 specification allows any number of attributes whose names
start with data- to appear on any element as an
extension. In an XML environment one might supply a specific extension
namespace, or one might say, as XSLT says, that attributes in any
namespace other than that of XSLT are extension attributes. The goal is
to make sure there can never be conflict between extensions and the
original vocabulary as it grows and changes over time. Requiring people
writing extensions to use their own namespaces means that any two
different extensions will not conflict either.
The same techniques can be used with any names, including
elements.
It should be noted that a large proliferation of XML namespaces can
cause problems with implementations; there have been XSLT engines, for
example, with limits of 256 namespaces per element, or even per
document. In addition, users can find it confusing to remember which
namespace to use. A possible strategy is to stick to one for the main
vocabulary, and one for each organization making extensions, rather than
one per extension.
Communication
The single most important factor in writing a successful language
extension is to be in communication with both the original language
maintainers and the primary user community. Therefore, a wise vocabulary
designer will provide a place for people to get in touch at an early
stage both with the developers of the vocabulary and with users.
Conclusions
There are many ways to extend vocabularies, only a few of which were
covered in this paper. When vocabularies are not created with
extensibility in mind, a fist punched through the wall makes a new window
but it is not always pretty. Therefore, a combination of anticipation and
feedback from users is to be recommended. Fallback must always be
considered, along with accessibility and internationalization.
References
[ISO 8879:1986 SGML] ISO/IEC, Information processing — Text and office
systems — Standard Generalized Markup Language
(SGML).
[Applen & McDaniel 2009] Applen, J.D. and McDaniel, Rudy, The Rhetorical Nature of
XML, Routledge, 2009.
[Lizzi, 2017] Lizzi, Vincent M.,
Testing Schematron using XSpec.
Presented at
Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017.
In Proceedings of Balisage: The Markup Conference
2017. Balisage Series on Markup Technologies, vol. 19 (2017).
doi:https://doi.org/10.4242/BalisageVol19.Lizzi01.
[Lubell, 2009] Lubell,
Joshua, Documenting and Implementing Guidelines with
Schematron.
Presented at Balisage: The Markup Conference
2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009.
Balisage Series on Markup Technologies, vol. 3 (2009).
doi:https://doi.org/10.4242/BalisageVol3.Lubell01.
[Wilde, 1895] Wilde, Oscar,
The Importance of Being Earnest, A Trivial Comedy for Serious
People. First performed at St James’s Theatre in London in
1895 and in 1998 published from exile in Paris by Leonard
Smithers.
×ISO/IEC, Information processing — Text and office
systems — Standard Generalized Markup Language
(SGML).
×Applen, J.D. and McDaniel, Rudy, The Rhetorical Nature of
XML, Routledge, 2009.
×Lizzi, Vincent M.,
Testing Schematron using XSpec.
Presented at
Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017.
In Proceedings of Balisage: The Markup Conference
2017. Balisage Series on Markup Technologies, vol. 19 (2017).
doi:https://doi.org/10.4242/BalisageVol19.Lizzi01.
×Lubell,
Joshua, Documenting and Implementing Guidelines with
Schematron.
Presented at Balisage: The Markup Conference
2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009.
Balisage Series on Markup Technologies, vol. 3 (2009).
doi:https://doi.org/10.4242/BalisageVol3.Lubell01.
×Wilde, Oscar,
The Importance of Being Earnest, A Trivial Comedy for Serious
People. First performed at St James’s Theatre in London in
1895 and in 1998 published from exile in Paris by Leonard
Smithers.