History
DocBook has a long history. It began in the early 90s as an SGML
DTD. At the time, there were several commercial Unix vendors and
O’Reilly Media (then O’Reilly & Associates) had built a successful
publishing business supplying technical books about the Unix
ecosystem.
One aspect of the Unix system, the
man page,
contributed significantly to the documentation of Unix and its
commands, tools, APIs, and subsystems. Anecotally, the origin story
for DocBook begins with some of the Unix vendors, in cooperation with
O’Reilly and HaL Computer Systems, getting together to build an
interchange vocabulary for man pages.
The man pages were not a differentiating factor for the vendors, they
didn’t need or want them to be proprietary. All Unix systems shipped
with largely the same man pages.
The idea was that each Unix vendor would have their own man pages (in
troff
documents with thier own custom macros) but when they wished to
exchange them, the would transform them into a common language and the
recipient would transform them from the common language back into the
local flavor of troff.
SGML was selected as the appropriate interchange format and
DocBook development began. Much of the early design of DocBook was the
result of extensive document analysis over the corpus of O’Reilly
Unix-related books (including several collections of “man pages”, formatted in
the house style, entire series of
X-Windows
books, and others).
Markup was invented for each significant feature of the corpus.
From the very beginning, DocBook was a descriptive format, not a
prescriptive one. If a structure could occur in a corpus of reasonable
documents, it was (to the maximum extent possible) allowed.
Early development
Once there were several players, a maintenance organization was
formed: The Davnenport Group. Davenport met several times a year,
stewarding the development of DocBook.
The Davenport Group established that the focus of DocBook would be
computer hardware and software documentation: Unix man pages, software
documentation, documentation about networking hardware, etc. That
remains its focus today.
By the mid 90s DocBook had accumulated several years of managed but
somewhat ad hoc growth. When Eve Maler joined the design team, she
undertook to bring design principles to the structure of the DTD.
Maler and Jeanne El Andaloussi developed the markup design philosophy,
methodology, and techniques described in their book
Developing SGML
DTDs: From Text to Model to Markup during the development of DocBook
3.0.
The structure of DocBook still reflects that methodology, even though
much has evolved.
Growth and popularity
By the late 90s and early 2000’s, DocBook was very popular. It
arrived on the SGML scene (relatively) early, it had meaningful
element names, good documentation, and a reasonably complete set of
open source stylesheets for transforming it into HTML and print.
Ironically, the Unix vendors had all gone away by this time and
interchange had ceased to be the dominate use case for DocBook. The
most common use case became, and remains to this day, simply to use
DocBook as an authoring format, often with very little customization.
Versioning and copyright
From the very earliest days, the Davenport Group made distribution and
reuse of the DTD as easy as possible. It was freely available, it was
free to use, and organizations were free to customize it.
The rules that apply to customizations of DocBook are as simple as
possible: do anything you like, but if you change it, don’t call it
DocBook.
Many, many (many!) organizations began with DocBook and built
their own systems by extension and subsetting. Many more simply
canabalized DocBook for the markup structures that they found
useful.For example, the elements and attributes for indexing, or the
table model (later formalized as the XML Exchange
Table Model Document Type Definition), were often excised out
of DocBook and used in other systems.
DTD structures
Anyone familiar with the Maler and El Andaloussi methodologies will
recognize the structure of the DocBook DTD. It makes extensive use of
parameter entitites to make customization (i.e., changing the schema
without editing the original source files) as easy as possible.
Elements are divided into classes, with extensibility:
<!ENTITY % local.admon.class "">
<!ENTITY % admon.class
"Caution|Important|Note|Tip|Warning %local.admon.class;">
Classes are accumulated into mixtures, also with extensibility:
<!ENTITY % local.component.mix "">
<!ENTITY % component.mix
"%list.class; |%admon.class;
|%linespecific.class; |%synop.class;
|%para.class; |%informal.class;
|%formal.class; |%compound.class;
|%genobj.class; |%descobj.class;
|%ndxterm.class;
%local.component.mix;">
And the content models of individual elements are constructed
by combining elements and mixtures:
<!ENTITY % admon.elements "INCLUDE">
<![ %admon.elements; [
<!ELEMENT (%admon.class;) - - (Title?, (%admon.mix;)+) %admon.exclusion;>
<!--end of admon.elements-->]]>
DocBook provided the additional feature that elements could be
selectively excluded with a parameter entity (admon.elements
in this
case).
The modern day
Many members of The Davenport Group became central figures in the
working groups at the W3C that lead to the development of XML. In this
period, development of DocBook languished. It was revitalized in 1998
when maintenance moved to OASIS as the first Technical Committee.
Conversion to RELAX NG
As the XML ecosystem evolved, namespaces became a significant feature
of the landscape. The maintainers were eager that DocBook should
continue to evolve and participate fully with the emerging standards
(XLink, XInclude, etc.) that required namespaces.
DTDs are incapable of validating namespaced documents (in the general
case), so moving to a new validation technology was necessary.
Practically speaking, only two grammar-based choices presented
themselves: W3C XML Schemas and RELAX NG. There was unanimity among
the members of the Technical Commitee that RELAX NG was a better fit
for modeling the structure of prose documents.
Over the course of a couple of years, through a series of experimental
releases and design reviews, a new set of RELAX NG based models was
developed. These debuted in DocBook V5.0 in 2008.
A primary goal of the conversion was that as many valid documents as
practical should remain valid. That is, a DTD-valid DocBook V4.5
document could be converted to a RELAX NG-valid DocBook V5.0 document
simply by removing any SGML features used in the document and adding a
namespace declaration (and normalizing mixed case, if necessary).
In converting the normative format from DTDs to RELAX NG, the
OASIS Technical Committee decided that the DocBook schema should take
advantage of RELAX NG features that would improve the constraints.
While on its face this seems very sensible approach, this decision
casts a long shadow.
Unlike DTDs and W3C XML Schemas, RELAX NG allows ambiguity in content
models. This is useful in a prose schema because it allows the
vocabulary designer to more precisely model what users actually want
to do.
In particular, DocBook’s descriptive rather than prescriptive nature
leads to a lot of optionality. Consider a (highly!) simplified
description of a DocBook book. This simplified book could be described
as an optional table of contents, followed by optional chapters,
followed by an optional table of contents. (In French publishing,
tables of contents often come at the end.)
So the structure is:
toc?, chapter*, toc?
That’s perfectly reasonable and logical. No human being looking at
that is troubled by it. And yet neither DTDs nor W3C XML Schema
can express that content model because of its ambiguity: if you see
a toc
, you can’t determine (without looking ahead) if it’s the first
one before chapters or the last one after absent chapters.
One of the secondary goals of the transition to RELAX NG was that it
should be possible to generate useful (though not normative) DTD and
W3C XML Schema versions of the schema.
That turned out to be impractical.
What is RELAX NG: a brief tutorial
Very broadly speaking, RELAX NG is a language for performing
pattern matching on trees roughly analogous to the way a regular
expression is a language for performing pattern matching on
strings.
A RELAX NG schema (or grammar) defines a set of patterns. A document
is valid against that grammar if there exists a valid arrangement of
those patterns that matches the document.
There are two syntaxes for RELAX NG, an XML syntax and a compact
syntax. The two are entirely equivalent and it’s possible to translate
losslessly between them. In the interest of space, and because many
people find it more readable, this paper gives its examples in the
compact syntax.
Let’s consider something smaller than DocBook to explore the way a
RELAX NG grammar works.
Here are three patterns:
a = element a { empty }
b = element b { empty }
c = a|b
The first matches an empty a
, the second an empty b
, and the third
anything that matches a
or anything that matches b
. It’s important
to remember that validity is about pattern matching. Although it’s
convenient to name patterns after elements, technically what matches
c
isn’t an a
element or a b
element, it’s an a
pattern> or a b
pattern.
If RELAX NG was limited to matching empty elements without
attributes, it wouldn’t be very useful! Let’s extend our example to
add some attributes and content.
One way to do this is to extend an existing pattern with a new
one. If you extend an existing pattern, you have to specify how your
extension should fit into the current pattern: is it a new choice, or
is it allowed to be interleaved anywhere in the existing pattern.
Here’s an example that extends the “a
” pattern with
a choice (signaled with “|=
”):
a = element a { empty }
a |= element a {
attribute priority { "high" | "highest" },
empty
}
b = element b { empty }
c = a|b
Now a
matches either a element with a “high” or
“highest” priority
attribute or an a
element
with no attributes.
Writing an easily customized RELAX NG grammar is, in part, about
making the patterns easily customizable. Making the a
pattern an
explicit choice between two element patterns isn’t the best approach.
It would be easier to customize if we used different pattern names.
This grammar is equivalent:
ordinary = element a {
empty
}
important = element a {
attribute priority { "high" | "highest" },
empty
}
a = ordinary | important
b = element b { empty }
Now a customization layer has the freedom to adjust, in ways we’ll
come to in a moment, the ordinary
and important
patterns
independently.
As these patterns stand, we can match either a single a
element
or a single b
element. Let’s add a wrapper to hold a collection
of elements:
document = element doc { (a|b)* }
This pattern matches an element named doc
that contains any number,
including none, of things that match the a
pattern or things that
match the b
pattern in any order.
The content model rules are straightforward, if you find regular
expressions straightforward, and will be familiar if you’ve written
DTDs.
-
a
matches exactly one a
pattern.
-
a?
matches an optional (exactly 0 or 1) a
pattern.
-
a*
matches zero or more a
patterns.
-
a+
matches one or more a
patterns.
-
(a,b)
, a sequence, matches an a
pattern followed by a b
pattern.
-
(a|b)
, a choice, matches an a
pattern or a b
pattern.
-
(a&b)
, an interleave, matches an a
pattern and a b
pattern,
in any order.
Finally, RELAX NG requires that we enumerate the top level patterns
that our document must match. This is not possible in DTDs and
requires a certain amount of gymnastics in W3C XML Schema.
start = doc|a
Combining these patterns into a grammar, we get:
start = doc|a
doc = element doc { (a|b)* }
important = element a {
attribute priority { "high" | "highest" },
empty
}
ordinary = element a {
empty
}
a = ordinary | important
b = element b { empty }
This grammar matches documents that begin with a
doc
element or an a
element, if and only if the a
element has a priority
attribute with the value “high” or “highest”.
With a schema this simple, it doesn’t seem impractical to stop here.
Extending the important
or ordinary
patterns by redefining the
entire element pattern wouldn’t be too burdonsome.
That’s much less practical in a schema with hundreds of elements and
attributes containing complex content models. Let’s rewrite this grammar
in a way that more closely matches the overall pattern structure in the
DocBook schema.
start = doc|a
doc.contentmodel = (a|b)*
doc.attlist = empty
doc = element doc {
doc.attlist,
doc.contentmodel
}
high_priority = attribute priority { "high" | "highest" }
priority = high_priority
important.attlist = priority
important.contentmodel = empty
important = element a {
important.attlist,
important.contentmodel
}
ordinary.attlist = empty
ordinary.contentmodel = empty
ordinary = element a {
ordinary.attlist,
ordinary.contentmodel
}
a = ordinary | important
b.attlist = empty
b.contentmodel = empty
b = element b {
ordinary.attlist,
ordinary.contentmodel
}
This grammar validates exactly the same documents, but it’s much, much
easier to customize as we shall see below.
RELAX NG allows you to create a grammar with reference to another,
existing grammar. Suppose the schema above is accessible at the
URI “~base.rnc~”. I can write a new grammar by reference:
# My custom schema
include "base.rnc" {
}
We can add any patterns we like outside of the curly braces that
follow the filename, “base.rnc
”. Within
those curly braces, the patterns that we specify will either augment
or entirely replace patterns with the same names in the original, base
schema.
As it stands, this is an uninteresting grammar that matches
exactly the same things as the base grammar, all
I’ve introduced is a comment. But from here we can begin
to look at customizations.
First, observe that our base grammar allows an explicit high priority
attribute, but doesn’t allow an explicit low or medium priority
attribute. We can easily add such an attribute. Second, let’s add
the requirement that a high priority element must have an ID.
# My custom schema
ordinary_priority = attribute priority { "low" | "medium" }
id = attribute xml:id { xsd:ID }
include "base.rnc" {
ordinary.attlist = ordinary_priority?
important.attlist = priority & id
}
This grammar defines a new pattern to match a low or medium priority
attribute and extends the definition of ordinary.attlist
to include it.
By making the pattern optional (“?”), we still allow ordinary
elements
without the new attribute.
The important.attlist
is defined to interleave a required ID
attribute. RELAX NG incorporates all of the W3C XML Schema data types,
so we can define it to be an xsd:ID
. Since attributes are unordered
in XML, it’s natural to interleave them. (But putting a comma between them
would have the same effect, you cannot make order
matter even if you technically make the attributes a sequence in your grammar.)
Next, let’s imagine that we want to add an “emergency” priority.
There are, in fact, several ways that we could do this.
We could redefine the high_priority
pattern:
include "base.rnc" {
high_priority = attribute priority { "high" | "highest" | "emergency" }
}
Or we could extend it:
emergency_priority = attribute priority { "emergency" }
include "base.rnc" {
priority = high_priority | emergency_priority
}
At this point, you might wish that the base schema had defined a
pattern for the list of values:
high_priorities = "high" | "highest"
high_priority = attribute priority { high_priorities }
Then our customization could simply be:
include "base.rnc" {
high_priorities = "high" | "highest" | "emergency"
}
Schema designers have to strike a balance between complexity (make
everything a pattern) and maintainability. Invariably, it will be
the case that for some customizations, you’ll wish there had been
another pattern in the base schema.
RELAX NG offers a special pattern called notAllowed
that allows us to
remove things in a customization layer. Suppose, for example, that we
want to remove the notion of priority from this schema entirely:
include "base.rnc" {
priority=notAllowed
}
DocBook is a RELAX NG grammar
Customizing DocBook is, effectively, nothing more than applying the
RELAX NG grammar features to the particular set of patterns that
define the DocBook schema.
Removing elements or attributes
There may be nothing easier to do in a RELAX NG customization
layer than remove things. Suppose, for example, you wanted to remove
the revisionflag
attribute. If you aren’t tracking changes
in your DocBook sources, you don’t need it. At the element level,
if your publishing system doesn’t support generated callouts, you don’t
need the area
element.
namespace db = "http://docbook.org/ns/docbook"
default namespace = "http://docbook.org/ns/docbook"
# ======================================================================
# docbookxi.rnc is the flavor of DocBook that has
# XInclude mixed-in in appropriate places.
include "docbookxi.rnc" {
db.revisionflag.attribute = notAllowed
db.area.units-enum.attribute = notAllowed
db.area.units-other.attributes = notAllowed
}
Sometimes, as with the area
element above, you
have to make a few patterns notAllowed
to do the job
completely. In this case, there are two patterns, one with an enumerated
list of values for the units
attribute and one with a value of “other”
for the units attribute and a required otherunits
attribute.
Adding elements or attributes
To add something new, you have to do two things: create a
pattern to match the new item and add that pattern to the appropriate
mixtures. For example, here’s a customization layer that adds a new
inline element, port
. The author uses this customization
in the documentation for XProc.
include "docbookxi.rnc" {
db.markup.inlines |= db.port
}
# ======================================================================
db.port.role.attribute = attribute role { text }
db.port.attlist =
db.port.role.attribute?
& db.common.attributes
& db.common.linking.attributes
db.port =
element port {
db.code.attlist, (db.programming.inlines | db._text)*
}
There is no requirement that you name patterns following the
conventions used in the DocBook schema, but doing so is likely to help
you keep them organized in a similar way and will make them easier for
other DocBook customizers to understand.
Make required elements optional
Changing content models is, necessarily, contextual. Making a title
optional, for example, is just a matter of changing the pattern used
for the relevant info
element. This customization makes chapter titles
optional:
include "/projects/docbook/docbook/relaxng/schemas/docbookxi.rnc" {
# This pattern is db._info.title.req in the base schema
db.chapter.info = db._info
}
The descriptive nature of DocBook has lead to a schema without a lot
of required elements. Books without chapters? Indexes without index
terms? Check and check.
To pick an example where a less elegant customization is necessary,
let’s consider ordered lists. In DocBook, they’re required to have at
least one list item. Suppose we wanted to relax that requirement?
It happens that there isn’t a “~.contentmodel~” pattern for the
content of orderedlist
, so we’ll simply have to redefine the whole
thing.
include "docbookxi.rnc" {
db.orderedlist = element orderedlist {
db.orderedlist.attlist,
db.orderedlist.info,
db.all.blocks*,
db.listitem*
}
}
A more practical customization here might be to remove the
optional blocks from before the first list item, but that’s not an
example of making required elements optional.
Make optional elements required
Making optional elements required is very much the same as making
required elements optional. If you wish to require bibliography
elements to have a title, change the db.bibliography.info
so that it matches an info
element with a required,
db._info.title.req
.
For a more interesting example, consider that some style guides frown
on nested hierarchy elements without any intervening prose: a chapter
that begins immediately with a top-level section or a top-level
section that begins immediately with a second-level section.
If you examine the DocBook schema, you’ll find that the chapter
content model is defined by the db.chapter.contentmodel
pattern.
That pattern is, in turn, defined as db.component.contentmodel
, the
common content model for all “components” (roughly, elements at the
level of chapter).
The common content model for components is:
db.component.contentmodel =
db.navigation.components*,
db.toplevel.blocks.or.sections,
db.navigation.components*
This allows navigational components (indexes, tables of contents,
etc.) to appear at either the front or the back. Between them, “top
level blocks or sections”:
db.toplevel.blocks.or.sections =
(db.all.blocks+, db.toplevel.sections?) | db.toplevel.sections
DocBook has two independent section hierarchies, a numbered one,
(sect1
, sect2
, …) and a recursive one (section
). That’s captured
by the db.toplevel.sections
pattern in a way that makes it easy to
choose either one or both.
Anyway, this deep in the maze, we can see that forbidding
immediately nested hierarchy elements for all components would require
a simple change to this pattern:
include "docbookxi.rnc" {
db.toplevel.blocks.or.sections =
db.all.blocks+, db.toplevel.sections?
}
If, for some reason, this change were necessary
only for chapters, more dramatic surgery would be
required. How we approach it depends on whether or not we
expected our customization layer to be further
customized.
The most direct method would be simply to redefine the content model
for chapters:
include "docbookxi.rnc" {
db.chapter.contentmodel =
db.navigation.components*,
db.all.blocks+,
db.toplevel.sections?,
db.navigation.components*
}
This is sufficient, but we’ve “unpicked” the pattern structure
significantly. We haven’t, for example, made it any easier to apply
this change to appendix
elements later, if we need to.
Exercise for the reader: consider how you might make it easier for a
future customizer of your customization layer.
Make optional elements forbidden
Structured editing tools that constrain authors to write valid
documents are wonderful. But one of the disadvantages of a broad,
standard schema is that editing tools will expose all of the
flexibility of the standard allowed to your authors.
One of the easiest ways to make authoring easier is remove all of the
things that you don’t want your authors to use. This is
straightforward in RELAX NG.
The flexibility to produce French books with tables of contents at the
back is wonderful. But if you don’t publish books in French, it’s just
extra cognative load for your authors.
There are giant swaths of DocBook that you will probably never use
unless you write for a particular domain of hardware or software.
-
Do you write about programming language APIs? No? Then you don’t
need all the synopsis elements.
-
Do you write about networking? No? Then you don’t need all the inlines
about that.
-
Do your documents have bibliographies? Both the raw and cooked forms?
Ditch the one(s) you don’t use.
-
Do you produce back-of-the-book indexes in markup? No? Then you
don’t need indexentry
and its descendants.
-
Do your documents have mathematics? Flush the equation elements!
-
Do your documents have admonitions? Q&A sets?
Screenshots? Video? Audio? Drop all the blocks you don’t
need.
-
You don’t need msgset
.
This’ll simplify your authoring environment:
include "docbookxi.rnc" {
db.synopsis.blocks = notAllowed
db.systemitem = notAllowed
db.biblioentry = notAllowed
db.indexdiv = notAllowed
db.indexentry = notAllowed
db.segmentedlist = notAllowed
db.equation = notAllowed
db.informalequation = notAllowed
db.inlineequation = notAllowed
db.math.inlines = notAllowed
db.admonition.blocks = notAllowed
db.videoobject = notAllowed
db.audioobject = notAllowed
db.screenshot = notAllowed
db.qandadiv = notAllowed
db.qandaentry = notAllowed
db.qandaset = notAllowed
db.msg = notAllowed
db.msgexplan = notAllowed
db.msgmain = notAllowed
db.msgrel = notAllowed
db.msgset = notAllowed
db.msgsub = notAllowed
}
Change the semantics of a component
DocBook, perhaps because of its history as an interchange
format, doesn’t attempt to bring a great deal of rigor to the
semantics of its elements. The reference documentation provides a
description of its intended semantics, as the DocBook designers
understood it, but those descriptions are often intentionally vague.
Saying that a productnumber
is “a number assigned to a
product” is not especially precise.
There are a number of elements right down on the leaves of the
tree where a customization layer could impose stricter syntactic
constraints that would limit the opportunities for misunderstanding.
For example, pubdate
could be restricted to an ISO 8601 date
or date-time. Similarly, if your organization has product numbers that follow
a predictable pattern, you could add a constraint to enforce that.
Document the customizations
The RELAX NG grammar allows documentation to be combined with the
schema. Elements from other namespaces simply become ignored
annotations to the validator. In this way, DocBook prose for example,
could be combined directly with the schema in a
“literate
programming”
style.
Unfortunately, this is fairly cumbersome in practice, in part because
the DocBook schema is authored in the compact syntax. The compact
syntax, as mentioned earlier, can be losslessly converted to and from
the XML syntax. However, the particular representation of arbitrary
XML in the compact syntax is, in a word, awful.
For example, consider this simple fragment of documentation in the XML
syntax:
<db:para>This is a <emphasis role="important">feature</emphasis>,
not a bug.</db:para>
In the compact syntax, it becomes this annotation:
db:para [
"This is a "
rng:emphasis [ role = "important" "feature" ]
",\x{a}" ~
"not a bug."
]
That’s not…practical.
As a result, the DocBook schema limits the embedded documentation
to the single-sentence summary of each pattern (it’s man page
refpurpose
), and the description of enumerated attribute values.
For example, here are the patterns for the revisionflag
and its
enumerated values.
db.revisionflag.enumeration =
## The element has been changed.
"changed"
| ## The element is new (has been added to the document).
"added"
| ## The element has been deleted.
"deleted"
| ## Explicitly turns off revision markup for this element.
"off"
db.revisionflag.attribute =
[
db:refpurpose [ "Identifies the revision status of the element" ]
]
attribute revisionflag { db.revisionflag.enumeration }
The rest of the reference documentation is maintained separately, in
DocBook, and combined with the schema annotations through a fairly
complicated process of shaking and stirring.