How to cite this paper

Quin, Liam. “Extending Vocabularies: The Rack and the Weeds: Social Context and Technical Consequence.” Presented at Balisage: The Markup Conference 2019, Washington, DC, July 30 - August 2, 2019. In Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23 (2019). https://doi.org/10.4242/BalisageVol23.Quin01.

Balisage: The Markup Conference 2019
July 30 - August 2, 2019

Balisage Paper: Extending Vocabularies: The Rack and the Weeds

Social Context and Technical Consequence

Liam Quin

Visionary

Delightful Computing

`<liam@fromoldbooks.org>`

Liam Quin runs an information design company, Delightful Computing, and previously was XML Activity Lead at the World Wide Web Consortium; before that they were involved in the creation of XML itself and in SGML, most notably at SoftQuad Inc. in Toronto. Their background is in digital typography, text processing and computer science.

Abstract

In its simplest form a vocabulary is simply a set of words and phrases with predefined meanings. In this paper the term is used to mean a controlled vocabulary and, in particular, a controlled vocabulary in the context of computer markup languages such as XML or JSON or SGML.

Vocabularies are created in specific contexts and for specific purposes. Like all human constructs they are flawed and need to be repaired and changed over time; as people use vocabularies they also gain understanding of the limitations in them and often want to extend them. Understanding these processes involves an understanding of the human needs involved: the social contexts in which people interact with and around the vocabularies. This paper characterizes some of these contexts and their properties, and in the light of this characterization describes changes to vocabularies, both successful and unsuccessful.

Introduction

The Social Context of Vocabularies

An Ontology For Extensions

Planned-for Extensions

Grammar Hooks
Unchecked Islands
Extension Names

Unanticipated Extensions

Altered Grammar
Usage Conventions
Unchecked Usage

Hybrid and Absorbed Extensions

Ambiguous Markup
New Vocabulary Features
Usage Conventions Adopted
Internal and Interchange Formats

Evaluating Extensions

Vocabulary Life Cycle: the Birth of an Extension

Committee Proposals
Community Proposals
Forks
Merging
After the Work Ends

Characterizing Extensions

Functional Extensions: New Behaviour
Semantic Coverage: New Meanings
Implicit Extensions
Explicit Extensions
Usage Conventions

Methods of Extension

Adding New Elements
Adding New Attributes
Adding New Content
Adding New Values
Subtractions
Combining Vocabularies: Xreole
Adapting Existing Markup
Scripting

Inhibiting Factors

Encouraging Benevolent Extensions

Version Numbering
Allowing Mixed Namespaces
Fallback
Extension Attributes and Namespaces
Communication

Conclusions

Introduction

The SGML standard defines the following term:

4.279 SGML application: Rules that apply SGML to a text processing application. An SGML application includes a formal specification of the markup constructs used in the application, expressed in SGML. It can also include a non-SGML definition of semantics, application conventions, and/or processing.

— ISO 8879:1986 SGML

The SGML standard attempted to give a formal definition for what today might be called a markup vocabulary. When XML made the explicit document type declaration optional and provided other ways to share computer-processable specifications, such as XML Schema Documents, the term Document Type Definition, or DTD, gradually gave way to the more informal, broader, term, Vocabulary.

The non-SGML part of an SGML application, as with vocabularies in other systems such as XML, HTML or JSON, can include natural-language prose that might add constraints not easily expressed in markup: the n attribute shall be a Mersenne Prime Number expressed in Roman numerals for example. Such constraints can sometimes be enforced, or violations detected, with a conformance checker; often these are written in a special-purpose language such as that of Schematron [Lubell, 2009] and those Schematron tests in turn can be tested using frameworks such as XSpec [Lizzi, 2017].

Both the machine-processable part of a vocabulary definition and the additional human-readable part (often much larger) must change over time: at the very least, they change from not existing into existing, but almost always they change through revision and, explicitly and implicitly, through extension.

For the purpose of this paper, an extension to a vocabulary is any change to the specification of that vocabulary, whether in detail, in scope, or otherwise.

Before we can define implicit and explicit extension, we must consider the wider social context in which vocabularies are created and used. We can then characterize the extensions more precisely and go on to suggest ways to encourage what we will define as beneficial vocabulary evolution.

The Social Context of Vocabularies

The context in which a vocabulary was first developed and the primary contexts in which it is subsequently maintained are also the contexts within which the maintainers will view extensions. Example contexts include:

An individual person inventing a vocabulary for their own use;
A group of people working on a project, using a vocabulary between them but with no wider usage outside the group;
An organization that publishes a vocabulary for use with specific software or for some other specific purpose connected with the organization;
Organizations whose staff work together to produce a shared vocabulary;
International and national standards organizations such as ISO, NISO, and ANSI; industry consortia such as W3C, WHATWG or Oasis Open; each of these has produced specifications that define vocabularies, primarily to standardize on behaviours between implementations or to invent new solutions to problems.

When a specification for a vocabulary exists primarily for interoperability between implementations, innovation is strictly limited. In this case, it is usually clear to the vocabulary designers that each vendor or implementer will need to extend the vocabulary to add support for the features that make their implementation a special snowflake. Equally, it will be clear to them that they must provide some way for other vendors to process marked-up documents that use those extensions.

The truth is rarely pure and never simple. [Wilde, 1895]

An Ontology For Extensions

In order to characterize extensions we need to introduce some descriptive terminology. The terms introduced in this section are a first attempt to provide not only phrases but clearly separated concepts in the area of vocabulary extensions.

Planned-for Extensions

The creators of a vocabulary foresaw a need but not the specifics, and so provided mechanisms to allow the vocabulary to be extended.

Grammar Hooks

Some vocabulary designers provide mechanisms for users to extend the grammar used to validate instances; this can allow subtractions or entire replacements, or may be restricted to adding extra terms, such as adding an extra element to an XML content model for a bibliography entry.

Unchecked Islands

A vocabulary might include a grammar for validation that incorporates places where names from other vocabularies can be used, or where validation is disabled. Example mechanisms for this are lax validation in XML Schema, or extension elements with content models of ANY in DTD-based validation.

Extension Names

Some vocabularies incorporate a convention that elements starting with a specific prefix (x-socks) are extensions, and the creators promise never to define meanings for such names. In XML, a vocabulary might state that elements in a specific secondary namespace, or any namespace but the primary one, are extension elements, or, like XSLT, might allow arbitrary attributes on any element as extension attributes. There is always a risk of conflict with future versions of the specification when this is done, however.

Unanticipated Extensions

The creators of the vocabulary did not foresee the need for extensions, or not of the kinds that users of the vocabulary wanted or needed.

Altered Grammar

Sometimes if the creators of a vocabulary did not supply a mechanism to add or change names, people copy the grammar definition and edit it in a text editor. The resulting vocabulary might in open source terms be called a hostile fork. Documents using this changed grammar might not work properly with tools for the original vocabulary.

Usage Conventions

Users might assign their own meanings to vocabulary terms in specific contexts. This is a very common way to extend any language. For example, one might say that the HTML cite element is to contain a footnote reference to a bibliography entry, or that it contains quoted text but not the name of the quoted author. Or if an XML vocabulary did not allow links, one might start using a shoesize element and put a URL into its USAorEuropean attribute. This is sometimes (disparagingly) called tag abuse: if an XML vocabulary, say, does not distinguish between italic for a foreign phrase and italic for emphasis, and one needs to include a foreign phrase, people using a text-to-speech reader to interact with the document will be forced to hear a raise in pitch as the foreign phrase is read out loud. It can be better for vocabulary creators to provide an italic element with a required because attribute than to deny the possibility of unforeseen italicized content, but no-one can anticipate everything.

Unchecked Usage

Faced with needs not met by a vocabulary, some people give up on grammars altogether and add terms as they see fit. This is similar to the hostile fork described above, except that without formal documentation there can be little hope that any other group will adopt the extensions.

Hybrid and Absorbed Extensions

Extensions are sometimes adopted back into a vocabulary; in most cases this is done in such a way that people previously using the extension have to change their usage to conform, because people making extensions usually do not share exactly the same constraints and perspectives as the vocabulary’s creators.

An absorbed extension, then, is one that was originally an extension but became part of the vocabulary. A hybrid extension shares characteristics of planned-for and unanticipated extensions and may also be, or become, officially absorbed.

Ambiguous Markup

Declarative markup admits the possibility of multiple ways to process a single document; ambiguous markup goes one step further and admits the possibility that a term can be interpreted by the reader. An example in XML is the use of the Chameleon XML Schema Pattern, in which a fragment of a grammar might be included in multiple language definitions but, because of differing prologues, have radically different interpretations, for example with a different default namespace in use.

New Vocabulary Features

A new version of a vocabulary might incorporate new terms that were previously an extension. The vocabulary itself might be said to have been extended compared to previous versions, but the new terms or features are no longer themselves considered an extension.

Usage Conventions Adopted

The creators of a vocabulary may decide that a usage convention is reasonable and adopt it into their language. This is sometimes referred to as paving the cowpaths, although anyone who has lived around cows know that they don’t always follow very useful or wise routes. A common example here is languages that adopt special meanings to comments in a particular format, such as Encapsulated PostScript using %%page at the start of a line; regular comments in that language start with a %, but the convention is that PostScript comments should not start with %% unless they conform to the Encapsulated PostScript convention.

Note that usage conventions, in the sense used in this paper, are not themselves part of the vocabulary.

Internal and Interchange Formats

These are not strictly speaking a type of extension, but rather a context and situation: The context is one in which an organization has needs not met by a vocabulary; the situation is one where documents produced internally must be shared with other organizations, and are transformed in some way at the institutional boundaries so that what is shared is conformant.

Evaluating Extensions

Extensions can have an effect wider than on a single individual or organization. Some extensions become widely used, and these may be adopted by the maintainers of the vocabulary, or they may be seen as disruptive.

In either case, extensions that are in use in interchange between organizations necessarily lead to fragmentation: any given tool or tool chain may or may not be able to process the extension. For example, if one were to share with someone else an XSLT transformation document that made use of EXpath extension modules, the recipient would be unable to use the transformation unless they had an XSLT implementation that supported the extension. So, there would then be two languages: XSLT with EXPath and XSLT without EXPath. But if there were three EXpath extension modules, and implementations may have any combination, there would then be six different languages, since the increase is combinatorial.

An extension, then, reduces interoperability. But when an extension is widely implemented, it generally increases the scope, or applicability, of the vocabulary, and gives an overall benefit. This would be a beneficial extension.

An extension sometimes is created by people who are not well-connected to the user community, or who have very different views from the majority of the people creating the original vocabulary. Or, sometimes, one or more of the original creators has a change of heart in some way. The extension might violate what users perceive to be underlying principles, or might feel out of place. For example, consider an extension to a declarative content-oriented XML interchange vocabulary that introduces procedural commands such as Switch to a larger type size until otherwise notified. Such an extension changes the way that people think about the vocabulary, and even though it may increase applicability, it can cause damage. The extension doesn't fit in well, is harder to learn, and users become confused by a new lack of orthogonality. So, this would be an example of a harmful extension.

It should be admitted that there is no easy and clear-cut way to determine whether an extension is beneficial or harmful. Sometimes it is only apparent after several years. Sometimes the damage of an extension is that it precluded a better solution being adopted.

Vocabulary Life Cycle: the Birth of an Extension

Vocabularies are created, born, grow, live and flourish, or wither and are forgotten, but they rarely die. It is very difficult to withdraw features from vocabularies once those features are in widespread use. Furthermore, the slightest change to the specification may mean new documents do not work correctly in existing implementations. Features can be marked as deprecated, but both users and implementers will have to deal with documents containing such deprecated markup.

Some common reasons for extensions include:

As people started to use a vocabulary they found they needed (or wanted) it to handle more cases than it already did: the vocabulary grows;
The requirements changed, or priorities changed, and with it the focus of usage. For example, rotary-dial telephones are no longer ubiquitous, and a shared way to describe telephones for retailers to choose items to stock no longer needs to mention the speed of the dial return, but should probably mention whether a 3.5mm headphone socket is provided. It might be that the vocabulary does not need to grow, but rather that there will be increased detail in some areas and perhaps reduced detail in others.
Someone involved in the vocabulary came up with an idea: this specification is fabulous; we could use it for selling wedding cakes if only it had.; or, it’s really useful that you can work with numbers and decimals but what about fractions? So an extension can be systemic: adding fractions to every numeric value, for example, or it might be modular, offering a self-contained new facility such as the ability to manipulate Zip archives in EXPath.
The way the vocabulary is used has varied over time. For example, before the availability of CSS, the HTML blockquote element was often used for indented text, regardless of the reason for the indenting. Similarly, people using vocabularies with markup for italics as rhetorical or grammatical emphasis but without plain italics may find themselves marking up book titles or phrases in foreign languages as emphasized rather than merely differentiated. This is sometimes called tag abuse, but it is really a symptom of needs not being met, and can healthily evolve into the first case listed above.
Often people use built-in extension mechanisms, or invent their own mechanisms, to remain within the broad communion, or user base, of a particular vocabulary while supporting their own workflows. Sometimes one sees Web pages that use a custom DTD, for example, or DocBook articles with custom elements: the additional markup is usually not intended to be public in these cases, but rather is a symptom of a private extension.
Very occasionally, two or more vocabularies merge, or one subsumes another. The individual vocabularies may continue to be maintained separately, as with HTML 5 incorporating MathML and SVG. The original vocabulary appears to grow in size and complexity but, since the most common cases of this is to absorb widely-used extensions, there may be no increase in practice.

As with any change to a specification, whether explicit or implied, changes can originate internally, from the people maintaining the specification of the vocabulary, or can originate from sources external to that group, as the next two sections describe.

Committee Proposals

Very often a new feature starts out as a proposal from someone already participating in whichever group or committee maintains a particular vocabulary. Such an extension may go into a future version of the vocabulary, in which case it people using that next version do not generally consider it to be an extension. Sometimes the committee will reject the proposal, and in that case it may later become part of some third-party extension. Eventually it may return to become part of the main specification, as SVG did with HTML 5.

The important thing about proposals from within the committee is that because they are very often developed in a context of what Applen and McDaniel refer to as tacit knowledge [Applen & McDaniel 2009], they tend to fit in well with the overall design of the vocabulary or specification in question.

Community Proposals

Sometimes people who are on the periphery of a committee, whether outside but following closely or inside but not part of the cognoscenti or not well respected, will come up with a proposal; at other times it’s committee members but the proposal falls outside the scope of the committee work, or is of a nature that means the details could not easily be agreed upon within the group.

At other times, a user or implementer group outside the original or main committee decides to extend the specification. This can happen through dissatisfaction with the main group (as, for example, with HTML 5 and the WHAT WG) or a need for something faster than full consensus allows, or sometimes simply because the outsiders did not understand that they could have been more closely involved.

Forks

Strictly speaking a fork can happen from within a committee or from the outside, or even a combination of both. The term comes from open source programming: a fork of a piece of code (whether a complete application or just a single library) comes when someone copies the original, changes it, and starts redistributing their changed version. This can be for several reasons: the original maintainer might have wandered off, leaving the work orphaned; the maintainer might have refused to make changes someone wanted; sometimes the original maintainer passes on the flag to someone else, or agrees there will be two versions with two different areas of focus. Thus a fork can be amicable or can be hostile.

In the world of markup vocabularies and specifications a fork most often happens when the original standards committee doesn’t recognise a particular need as valid (rightly or wrongly). It can also happen if a group needs a smaller subset of a specification, as happened with the Mallard subset of DocBook for the GNOME project. Usually in the case that the new fork is not intended to replace or supplant the original specification there is no need for hostility: Mallard was made for use by a specific community, for example.

Merging

Sometimes two specifications merge into a single larger one^*; usually the resulting vocabulary is the union of the original specifications before the merge, often with some additions since if one is revising a vocabulary it can be hard to argue with people who want to add to it.

A merge can be done to include one vocabulary inside another, such as HTML 5 incorporating MathML and SVG, rather in the manner of a shark eating a jellyfish. The result can remain separate specifications or can become one larger one. Another reason for a merger is when there are variants of the original specification in use and incorporating the variations seems best for everyone.

After the Work Ends

Sometimes a maintainer wanders off, loses interest, loses the ability to continue the work, or even dies. An organization can be taken over (such as Sun Microsystems by Oracle), or can cancel a project (such as Oracle canceling Solaris). Sometimes the specification may remain frozen, and may even become difficult or impossible to obtain. But if the vocabulary is in widespread use then new needs will emerge, and a new group will probably carry the torch forward.

Sometimes a committee will mark a particular vocabulary or version as deprecated. On other occasions a specification may be actively withdrawn by its publisher, for example for legal reasons. And of course at times a specification is perfect: the committee can be disbanded because the work is finished.

Characterizing Extensions

The terms defined in this section are intended to be of use in describing extensions to vocabularies and will be used in the rest of the paper to characterize specific extensions and extension mechanisms.

Functional Extensions: New Behaviour

A behavioural extension is one that changes the behaviour, or enables such changes, in software processing marked-up documents.

In HTML, for example, code running in the client (that is, in a Web browser) makes use of the extensions: for example, a browser might interpret rel=toc to provide a toolbar button or keyboard shortcut to access a table of contents entirely outside the containing document. Markup to extend behaviour is often very specific to the behaviour however: early HTML examples included blink and marquee elements whose purpose was to affect the display of contents rather than to indicate meaning.

A vocabulary or a system using a vocabulary might also be extended by missing or adding entirely different languages. For example, one might include OpenGraph or Schema.org features in an XML vocabulary, and these might be expressed in a JASON-LD syntax inside XML elementsor attributes. Or, a system might be extended by supporting scripting, and this might become visible inside documents. Such extensions can be pernicious, tying documents down to use by specific software in specific contexts and limiting reuse.

Semantic Coverage: New Meanings

In the semantic coverage case, the purpose of the extension is to represent information in documents. For example, in HTML, one might use a span element with a class attribute value of place to mark up places mentioned in a document. Although this information might then enable new functionality, such as connecting the prose to a map view, the markup is not tied to any particular behaviour.

Implicit Extensions

Sometimes when we use a specification and share our documents or data, we do not realize that we have created an extension. For example, people marking up HTML documents might find they have used common class attribute values, or people might take a specification like SCXML and use it for subject domains that the Working Group that developed it never envisioned, and to which the prose in the specification applies at best poorly. Over time the result can be to broaden the scope of the original language.

An implicit extension, then, is one where the fact that a vocabulary has been extended is not necessarily obvious to an observer.

Explicit Extensions

Many specifications provide methods for extension, some of which are covered later in this paper. In most cases the fact that something is an extension is made explicitly visible: for example, by the use of an XML namespace, or by including a module or library.

An explicit extension, then, is one whose use (and not just whose description or specification) makes it clear both to software and to any human working with the vocabulary that an extension has been used.

Usage Conventions

Sometimes it seems easier to decide on a particular way of using a vocabulary than extending it. With a vocabulary that does not include section titles, one might decide to use the first paragraph of each section as a title, even if formatting software does not embolden it. Strictly speaking a usage convention is not an extension to a vocabulary, but it extends the scope of the vocabulary without introducing any new terms or markup.

Methods of Extension

We have considered some of the contexts in which vocabularies are commonly extended. We are now in a position to consider the methods by which they are extended in those various contexts and to understand the reasons for the technical design choices.

Some vocabularies provide explicit extension methods in which users can add new elements or attributes, or can change what is allowed at any given point, using extension points built in to the various schema languages used to describe those vocabularies. For example, a DTD might provide parameter entities included in each content model, so that a document can override the definition of one of the parameter entities to add a new element to the corresponding content model. This permits extended documents to be validated, but software processing the documents will still need to be modified appropriately to understand the new markup.

Adding New Elements

One of the most obvious ways to extend an XML vocabulary, or one in any similar language, is to add new terms to the vocabulary. We might, for example, decide that the title element of an HTML document is insufficient for our purposes because it does not allow nested elements within it; instead, we add a pagetitle element that’s richer.

One obvious problem with this is that existing software doesn’t understand it. The second is that we still need a title element in each document for existing software to use, so now there is duplicated information. The new element reduces interoperability of documents, but if the usage is confined to a well-defined group then this is not a problem.

The use of XML namespaces is the most common way to identify extension elements. Namespaces are a fragile mechanism, often failing silently if there’s a typo in the namespace name (the URL) and in some implementations even failing if a document uses a different prefix. However, the fragility seems more than compensated for by avoiding conflicts, where two groups add elements of the same name. One of the original use cases for XML namespaces was to allow the mixing of vocabularies in this way.

In all cases, anyone adding an element of their own to documents that otherwise conform to someone else’s vocabulary needs to ensure that receiving software that does not understand the new element will behave sensibly: this is known as fallback. For example, a Web browser receiving a document containing a dblookup element will (in the absence of scripting) simply display the contents of the element. The document author therefore needs to include appropriate fallback contents. The design of extensibility in HTML places burdens on document authors.

Adding New Attributes

Very often, software processing a vocabulary will ignore attributes that are not recognized. Schemas need to be modified, but that’s true of any change. XML extension attributes can be associated with an XML namespace to avoid conflicts, as with elements.

A benefit of using attributes for extensions is that they tend to be less disruptive than elements. On the other hand they are restricted to simple string content and cannot be marked for language or text direction (e.g. RTL). Attributes are therefore not in general suitable for human-readable content: for example, you can’t easily have Taiwanese alt text for an SVG image in a Chinese HTML page, and this matters because Unicode code points are shared between those languages, so that language marking is needed for the text to be readable. In addition, there can only be one attribute of a given name on any particular element, limiting some sorts of extensibility.

The HTML 5 specification reserves attributes whose names begin with data- to be extension attributes, but this naming convention is not always acceptable to other groups extending HTML.

Adding New Content

Additional content is not usually considered to be an extension, since it does not affect the vocabulary itself. But consider including multiple translations of each paragraph of a document, one after the other; a usage convention might be used to say that the first paragraph is Romanian and the second the Italian translation, marked as Italian with xml:lang but not otherwise as a translation.

Adding New Values

New attribute values or element contents make an obvious way to extend many specifications. Attributes with names like role seem good candidates. It can be difficult to avoid collisions here, however, and there can be problems with fallback.

Examples in HTML include adding new meta or rel values, using data:* attribute values instead of linking to external resources, or using non-standard ARIA role attribute values. Note that this is different from extending HTML using Custom Elements or new class attribute values, because those are intended to be used for customization.

Subtractions

It may seem odd to consider removing part of a vocabulary as an extension. Such a change, however, can greatly facilitate implementation and can also help with authoring (by reducing choices). A diminished version of a vocabulary is sometimes known as a subset and sometimes as a profile, depending mostly on whether the speaker approves of it or not. Subsets (or profiles) can reduce interoperability, because an implementation might support one dialect and not another. They are therefore most suited for well-targeted use cases and communities.

One well-known example of a profile, or subset, is XML: every well-formed and DTD-valid XML document is also a valid SGML document. Admittedly this took a change to SGML to achieve, but the change (or rather, set of changes) was not unreasonable. Although XML was originally made by a group who did not think they could get the SGML committee to make changes in a timely fashion, if at all, in the end the committee turned out to be generally (overall) amenable to changes, and the design of XML could have been somewhat simplified had this been anticipated.

Combining Vocabularies: Xreole

Merging specifications to make a superset has already been discussed above. Another possibility when merging is to pick and choose, resulting in what is perhaps best considered to be an entirely new markup vocabulary, a sort of XML Creole, that was influenced by its ancestors but is not compatible with them.

This may sound the province of Igor in the basement, but can have the advantage of reduced training costs and sometimes even reduced tooling costs. Consider a vocabulary that uses DocBook element names for structure, HTML names for paragraphs and below, and DITA-style assembly from fragments. We could call it DitaWebBook. The HTML names for italic and bold, the accessibility attributes, the p element, all add a (perhaps false and misleading) sense of familiarity. Authors may then be surprised when MathML or SVG or JavaScript are not supported.

Adapting Existing Markup

When you don’t have an element to mark up a foreign phrase that’s to be italicized, and there’s no element for meaningless (semantically unweighted) italics, what’s an author to do except look for some other element that displays in italics? Emphasis, perhaps, resulting in documents in which text-to-speech software reads out foreign phrases (or book titles, perhaps) in a louder or higher-pitched voice as if they were really important.

More pernicious them poorly-accessible italics, hover, are values that are interpreted by software: our vocabulary didn’t have an element for postcode, so we used email address, because we aren’t allowed to store those. This is payback for vocabulary designers who did not allow for extensibility. The three-level postal address that doesn’t work in other countries; the telephone number field that doesn’t allow for an office extension number. Every example represents a design failure.

Adapting markup is sometimes derogatorily called tag abuse, although it can also be a form of usage convention.

Scripting

It is often tempting to make a system user-extensible by incorporating a scripting language. The result, as suggested above, can be that documents become tied to a particular system used in a particular configuration, because they contain fragments of programs or hooks for extension scripts to use.

An example is HTML Custom Elements, where the language is extended not by editing the grammar in some way, but through a JavaScript API which itself is subject to change.

Inhibiting Factors

Some vocabularies and languages have designs that make it harder to evolve them over time. .

HTML has always defined that an unknown element in the document body should be rendered as if its tags were missing, which allows for experimental elements to be added easily. Unfortunately there was also a decision that the first unknown element would end the head, which considerably complicated adding new metadata and which the IETF HTML Working Group later regretted.

But inhibiting factors can come from other directions. For example, the technique known as literate programming, in which a program is intertwined with extensive documentation, can discourage many programmers from making changes, especially if they are not comfortable with writing prose. Or, they may make changes to the code but not update the prose, which to them maybe a harder task.

Literate programming is an extreme example, but any extension can make existing documentation obsolete, because you wouldn’t do it that way any more.

There can also be legislative inhibitors, for example if a specific version of a vocabulary is required, and implementation inhibitors, for example if a particular language version is very widely implemented, as with XSLT 1. Infrastructure inhibitors can be very difficult to surmount.

Sometimes incompatible changes in a new version of a vocabulary can discourage or even prevent adoption; this was the case with XML 1.1, where in some (admittedly obscure) cases existing documents could have their meaning changed, and where existing XML processors were required to reject XML 1.1 documents.

Encouraging Benevolent Extensions

There are a number of techniques that have emerged through experience as ways to encourage extensions that improve an ecosystem. Even though the combinatorial bifurcation problem is always present with extensions, the techniques either mitigate this problem or give benefits that outweigh it.

Version Numbering

if an XML vocabulary includes its version number in its namespace, any change to the version number will generally break all processing tool chains. This is appropriate if it would always be an error for version N software to attempt to process version N + 1 input, but more often there are compatible changes, or new features added to the vocabulary such that every version N+1 document that does not use the new features is also a conforming version N document. This can be managed by separating the namespace (if used) from a version attribute on the top-level element, as is done by XSLT and DocBook 5.

It also helps to use a version number scheme that says that minor revisions are compatible in the way mentioned above; this is often done using a decimal point in the version number, so that a processor for version 3.2 of a vocabulary can process input marked as 3.* (where * represents any number, such as 3.9), but would report an error if given a version 4 document, where the first part of the number, before the dot, was higher than the processor understood. The 4 here is called a major revision number and the par after the dot (or the entire number) a minor revision number.

Allowing Mixed Namespaces

Allowing foreign, or secondary, namespace can help demarcate extensions from the primary vocabulary, and can make sure there are no conflicts. For example, both DocBook and SVG have title elements, but DocBook documents that use SVG elements associate them with the SVG namespace, so there is no conflict. However, the DocBook 5 specification indicates where SVG elements are allowed to appear.

Fallback

One of the places where CSS design has improved upon HTML design is the notion of fallback; that is, in considering what an implementation will do if it encounters CSS it does not understand, and making sure the base language is designed so that a sensible fallback is always possible, meaning that the document should always be readable even if some features (such as coloured borders, for example) are not rendered.

Constraints such as CSS FallBack places on designers of language extensions can be very helpful to user communities.

Extension Attributes and Namespaces

The HTML 5 specification allows any number of attributes whose names start with data- to appear on any element as an extension. In an XML environment one might supply a specific extension namespace, or one might say, as XSLT says, that attributes in any namespace other than that of XSLT are extension attributes. The goal is to make sure there can never be conflict between extensions and the original vocabulary as it grows and changes over time. Requiring people writing extensions to use their own namespaces means that any two different extensions will not conflict either.

The same techniques can be used with any names, including elements.

It should be noted that a large proliferation of XML namespaces can cause problems with implementations; there have been XSLT engines, for example, with limits of 256 namespaces per element, or even per document. In addition, users can find it confusing to remember which namespace to use. A possible strategy is to stick to one for the main vocabulary, and one for each organization making extensions, rather than one per extension.

Communication

The single most important factor in writing a successful language extension is to be in communication with both the original language maintainers and the primary user community. Therefore, a wise vocabulary designer will provide a place for people to get in touch at an early stage both with the developers of the vocabulary and with users.

Conclusions

There are many ways to extend vocabularies, only a few of which were covered in this paper. When vocabularies are not created with extensibility in mind, a fist punched through the wall makes a new window but it is not always pretty. Therefore, a combination of anticipation and feedback from users is to be recommended. Fallback must always be considered, along with accessibility and internationalization.

References

[ISO 8879:1986 SGML] ISO/IEC, Information processing — Text and office systems — Standard Generalized Markup Language (SGML).

[Applen & McDaniel 2009] Applen, J.D. and McDaniel, Rudy, The Rhetorical Nature of XML, Routledge, 2009.

[Lizzi, 2017] Lizzi, Vincent M., Testing Schematron using XSpec. Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Lizzi01.

[Lubell, 2009] Lubell, Joshua, Documenting and Implementing Guidelines with Schematron. Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:https://doi.org/10.4242/BalisageVol3.Lubell01.

[Wilde, 1895] Wilde, Oscar, The Importance of Being Earnest, A Trivial Comedy for Serious People. First performed at St James’s Theatre in London in 1895 and in 1998 published from exile in Paris by Leonard Smithers.

^* Specifications very rarely shrink except by virtue of splitting into several separate documents.

ISO/IEC, Information processing — Text and office systems — Standard Generalized Markup Language (SGML).

Applen, J.D. and McDaniel, Rudy, The Rhetorical Nature of XML, Routledge, 2009.

Lizzi, Vincent M., Testing Schematron using XSpec. Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Lizzi01.

Lubell, Joshua, Documenting and Implementing Guidelines with Schematron. Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:https://doi.org/10.4242/BalisageVol3.Lubell01.

Wilde, Oscar, The Importance of Being Earnest, A Trivial Comedy for Serious People. First performed at St James’s Theatre in London in 1895 and in 1998 published from exile in Paris by Leonard Smithers.

Author's keywords for this paper:

XML; vocabulary design; standards

Balisage Paper: Extending Vocabularies: The Rack and the Weeds

Social Context and Technical Consequence

<liam@fromoldbooks.org>

Abstract

Table of Contents

Introduction

The Social Context of Vocabularies

An Ontology For Extensions

Planned-for Extensions

Grammar Hooks

Unchecked Islands

Extension Names

Unanticipated Extensions

Altered Grammar

Usage Conventions

Unchecked Usage

Hybrid and Absorbed Extensions

Ambiguous Markup

New Vocabulary Features

Usage Conventions Adopted

Internal and Interchange Formats

Evaluating Extensions

Vocabulary Life Cycle: the Birth of an Extension

Committee Proposals

Community Proposals

Forks

Merging

After the Work Ends

Characterizing Extensions

Functional Extensions: New Behaviour

Semantic Coverage: New Meanings

Implicit Extensions

Explicit Extensions

Usage Conventions

Methods of Extension

Adding New Elements

Adding New Attributes

Adding New Content

Adding New Values

Subtractions

Combining Vocabularies: Xreole

Adapting Existing Markup

Scripting

Inhibiting Factors

Encouraging Benevolent Extensions

Version Numbering

Allowing Mixed Namespaces

Fallback

Extension Attributes and Namespaces

Communication

Conclusions

References

Author's keywords for this paper:

Balisage Series on Markup Technologies

`<liam@fromoldbooks.org>`