History
DocBook has a long history. It began in the early 90s as an SGML DTD. At the time, there were several commercial Unix vendors and O’Reilly Media (then O’Reilly & Associates) had built a successful publishing business supplying technical books about the Unix ecosystem.
One aspect of the Unix system, the man page, contributed significantly to the documentation of Unix and its commands, tools, APIs, and subsystems. Anecotally, the origin story for DocBook begins with some of the Unix vendors, in cooperation with O’Reilly and HaL Computer Systems, getting together to build an interchange vocabulary for man pages.
The man pages were not a differentiating factor for the vendors, they didn’t need or want them to be proprietary. All Unix systems shipped with largely the same man pages.
The idea was that each Unix vendor would have their own man pages (in troff documents with thier own custom macros) but when they wished to exchange them, the would transform them into a common language and the recipient would transform them from the common language back into the local flavor of troff.
SGML was selected as the appropriate interchange format and DocBook development began. Much of the early design of DocBook was the result of extensive document analysis over the corpus of O’Reilly Unix-related books (including several collections of “man pages”, formatted in the house style, entire series of X-Windows books, and others). Markup was invented for each significant feature of the corpus.
From the very beginning, DocBook was a descriptive format, not a prescriptive one. If a structure could occur in a corpus of reasonable documents, it was (to the maximum extent possible) allowed.
Early development
Once there were several players, a maintenance organization was formed: The Davnenport Group. Davenport met several times a year, stewarding the development of DocBook.
The Davenport Group established that the focus of DocBook would be computer hardware and software documentation: Unix man pages, software documentation, documentation about networking hardware, etc. That remains its focus today.
By the mid 90s DocBook had accumulated several years of managed but somewhat ad hoc growth. When Eve Maler joined the design team, she undertook to bring design principles to the structure of the DTD. Maler and Jeanne El Andaloussi developed the markup design philosophy, methodology, and techniques described in their book Developing SGML DTDs: From Text to Model to Markup during the development of DocBook 3.0.
The structure of DocBook still reflects that methodology, even though much has evolved.
Growth and popularity
By the late 90s and early 2000’s, DocBook was very popular. It arrived on the SGML scene (relatively) early, it had meaningful element names, good documentation, and a reasonably complete set of open source stylesheets for transforming it into HTML and print.
Ironically, the Unix vendors had all gone away by this time and interchange had ceased to be the dominate use case for DocBook. The most common use case became, and remains to this day, simply to use DocBook as an authoring format, often with very little customization.
Versioning and copyright
From the very earliest days, the Davenport Group made distribution and reuse of the DTD as easy as possible. It was freely available, it was free to use, and organizations were free to customize it.
The rules that apply to customizations of DocBook are as simple as possible: do anything you like, but if you change it, don’t call it DocBook.
Many, many (many!) organizations began with DocBook and built their own systems by extension and subsetting. Many more simply canabalized DocBook for the markup structures that they found useful.For example, the elements and attributes for indexing, or the table model (later formalized as the XML Exchange Table Model Document Type Definition), were often excised out of DocBook and used in other systems.
DTD structures
Anyone familiar with the Maler and El Andaloussi methodologies will recognize the structure of the DocBook DTD. It makes extensive use of parameter entitites to make customization (i.e., changing the schema without editing the original source files) as easy as possible.
Elements are divided into classes, with extensibility:
<!ENTITY % local.admon.class ""> <!ENTITY % admon.class "Caution|Important|Note|Tip|Warning %local.admon.class;">
Classes are accumulated into mixtures, also with extensibility:
<!ENTITY % local.component.mix ""> <!ENTITY % component.mix "%list.class; |%admon.class; |%linespecific.class; |%synop.class; |%para.class; |%informal.class; |%formal.class; |%compound.class; |%genobj.class; |%descobj.class; |%ndxterm.class; %local.component.mix;">
And the content models of individual elements are constructed by combining elements and mixtures:
<!ENTITY % admon.elements "INCLUDE"> <![ %admon.elements; [ <!ELEMENT (%admon.class;) - - (Title?, (%admon.mix;)+) %admon.exclusion;> <!--end of admon.elements-->]]>
DocBook provided the additional feature that elements could be
selectively excluded with a parameter entity (admon.elements
in this
case).
The modern day
Many members of The Davenport Group became central figures in the working groups at the W3C that lead to the development of XML. In this period, development of DocBook languished. It was revitalized in 1998 when maintenance moved to OASIS as the first Technical Committee.
Conversion to RELAX NG
As the XML ecosystem evolved, namespaces became a significant feature of the landscape. The maintainers were eager that DocBook should continue to evolve and participate fully with the emerging standards (XLink, XInclude, etc.) that required namespaces.
DTDs are incapable of validating namespaced documents (in the general case), so moving to a new validation technology was necessary. Practically speaking, only two grammar-based choices presented themselves: W3C XML Schemas and RELAX NG. There was unanimity among the members of the Technical Commitee that RELAX NG was a better fit for modeling the structure of prose documents.
Over the course of a couple of years, through a series of experimental releases and design reviews, a new set of RELAX NG based models was developed. These debuted in DocBook V5.0 in 2008.
A primary goal of the conversion was that as many valid documents as practical should remain valid. That is, a DTD-valid DocBook V4.5 document could be converted to a RELAX NG-valid DocBook V5.0 document simply by removing any SGML features used in the document and adding a namespace declaration (and normalizing mixed case, if necessary).
In converting the normative format from DTDs to RELAX NG, the OASIS Technical Committee decided that the DocBook schema should take advantage of RELAX NG features that would improve the constraints. While on its face this seems very sensible approach, this decision casts a long shadow.
Unlike DTDs and W3C XML Schemas, RELAX NG allows ambiguity in content models. This is useful in a prose schema because it allows the vocabulary designer to more precisely model what users actually want to do.
In particular, DocBook’s descriptive rather than prescriptive nature leads to a lot of optionality. Consider a (highly!) simplified description of a DocBook book. This simplified book could be described as an optional table of contents, followed by optional chapters, followed by an optional table of contents. (In French publishing, tables of contents often come at the end.)
So the structure is:
toc?, chapter*, toc?
That’s perfectly reasonable and logical. No human being looking at
that is troubled by it. And yet neither DTDs nor W3C XML Schema
can express that content model because of its ambiguity: if you see
a toc
, you can’t determine (without looking ahead) if it’s the first
one before chapters or the last one after absent chapters.
One of the secondary goals of the transition to RELAX NG was that it should be possible to generate useful (though not normative) DTD and W3C XML Schema versions of the schema.
That turned out to be impractical.
DocBook is a RELAX NG grammar
DocBook is normatively defined by a RELAX NG grammar. The actual construction of the published schema from its sources is a somewhat complicated affair (Appendix A), but to the end user, DocBook is a monolithic RELAX NG grammar. The mechanisms that you have available to customize DocBook are precisely those afforded by RELAX NG.
What is RELAX NG: a brief tutorial
Very broadly speaking, RELAX NG is a language for performing pattern matching on trees roughly analogous to the way a regular expression is a language for performing pattern matching on strings.
A RELAX NG schema (or grammar) defines a set of patterns. A document is valid against that grammar if there exists a valid arrangement of those patterns that matches the document.
There are two syntaxes for RELAX NG, an XML syntax and a compact syntax. The two are entirely equivalent and it’s possible to translate losslessly between them. In the interest of space, and because many people find it more readable, this paper gives its examples in the compact syntax.
Let’s consider something smaller than DocBook to explore the way a RELAX NG grammar works.
Here are three patterns:
a = element a { empty } b = element b { empty } c = a|b
The first matches an empty a
, the second an empty b
, and the third
anything that matches a
or anything that matches b
. It’s important
to remember that validity is about pattern matching. Although it’s
convenient to name patterns after elements, technically what matches
c
isn’t an a
element or a b
element, it’s an a
pattern> or a b
pattern.
If RELAX NG was limited to matching empty elements without attributes, it wouldn’t be very useful! Let’s extend our example to add some attributes and content.
One way to do this is to extend an existing pattern with a new one. If you extend an existing pattern, you have to specify how your extension should fit into the current pattern: is it a new choice, or is it allowed to be interleaved anywhere in the existing pattern.
Here’s an example that extends the “a
” pattern with
a choice (signaled with “|=
”):
a = element a { empty } a |= element a { attribute priority { "high" | "highest" }, empty } b = element b { empty } c = a|b
Now a
matches either a element with a “high” or
“highest” priority
attribute or an a
element
with no attributes.
Writing an easily customized RELAX NG grammar is, in part, about
making the patterns easily customizable. Making the a
pattern an
explicit choice between two element patterns isn’t the best approach.
It would be easier to customize if we used different pattern names.
This grammar is equivalent:
ordinary = element a { empty } important = element a { attribute priority { "high" | "highest" }, empty } a = ordinary | important b = element b { empty }
Now a customization layer has the freedom to adjust, in ways we’ll
come to in a moment, the ordinary
and important
patterns
independently.
As these patterns stand, we can match either a single a
element
or a single b
element. Let’s add a wrapper to hold a collection
of elements:
document = element doc { (a|b)* }
This pattern matches an element named doc
that contains any number,
including none, of things that match the a
pattern or things that
match the b
pattern in any order.
The content model rules are straightforward, if you find regular expressions straightforward, and will be familiar if you’ve written DTDs.
-
a
matches exactly onea
pattern. -
a?
matches an optional (exactly 0 or 1)a
pattern. -
a*
matches zero or morea
patterns. -
a+
matches one or morea
patterns. -
(a,b)
, a sequence, matches ana
pattern followed by ab
pattern. -
(a|b)
, a choice, matches ana
pattern or ab
pattern. -
(a&b)
, an interleave, matches ana
pattern and ab
pattern, in any order.
Finally, RELAX NG requires that we enumerate the top level patterns that our document must match. This is not possible in DTDs and requires a certain amount of gymnastics in W3C XML Schema.
start = doc|a
Combining these patterns into a grammar, we get:
start = doc|a doc = element doc { (a|b)* } important = element a { attribute priority { "high" | "highest" }, empty } ordinary = element a { empty } a = ordinary | important b = element b { empty }
This grammar matches documents that begin with a
doc
element or an a
element, if and only if the a
element has a priority
attribute with the value “high” or “highest”.
With a schema this simple, it doesn’t seem impractical to stop here.
Extending the important
or ordinary
patterns by redefining the
entire element pattern wouldn’t be too burdonsome.
That’s much less practical in a schema with hundreds of elements and attributes containing complex content models. Let’s rewrite this grammar in a way that more closely matches the overall pattern structure in the DocBook schema.
start = doc|a doc.contentmodel = (a|b)* doc.attlist = empty doc = element doc { doc.attlist, doc.contentmodel } high_priority = attribute priority { "high" | "highest" } priority = high_priority important.attlist = priority important.contentmodel = empty important = element a { important.attlist, important.contentmodel } ordinary.attlist = empty ordinary.contentmodel = empty ordinary = element a { ordinary.attlist, ordinary.contentmodel } a = ordinary | important b.attlist = empty b.contentmodel = empty b = element b { ordinary.attlist, ordinary.contentmodel }
This grammar validates exactly the same documents, but it’s much, much easier to customize as we shall see below.
RELAX NG allows you to create a grammar with reference to another, existing grammar. Suppose the schema above is accessible at the URI “~base.rnc~”. I can write a new grammar by reference:
# My custom schema include "base.rnc" { }
We can add any patterns we like outside of the curly braces that
follow the filename, “base.rnc
”. Within
those curly braces, the patterns that we specify will either augment
or entirely replace patterns with the same names in the original, base
schema.
As it stands, this is an uninteresting grammar that matches exactly the same things as the base grammar, all I’ve introduced is a comment. But from here we can begin to look at customizations.
First, observe that our base grammar allows an explicit high priority attribute, but doesn’t allow an explicit low or medium priority attribute. We can easily add such an attribute. Second, let’s add the requirement that a high priority element must have an ID.
# My custom schema ordinary_priority = attribute priority { "low" | "medium" } id = attribute xml:id { xsd:ID } include "base.rnc" { ordinary.attlist = ordinary_priority? important.attlist = priority & id }
This grammar defines a new pattern to match a low or medium priority
attribute and extends the definition of ordinary.attlist
to include it.
By making the pattern optional (“?”), we still allow ordinary
elements
without the new attribute.
The important.attlist
is defined to interleave a required ID
attribute. RELAX NG incorporates all of the W3C XML Schema data types,
so we can define it to be an xsd:ID
. Since attributes are unordered
in XML, it’s natural to interleave them. (But putting a comma between them
would have the same effect, you cannot make order
matter even if you technically make the attributes a sequence in your grammar.)
Next, let’s imagine that we want to add an “emergency” priority. There are, in fact, several ways that we could do this.
We could redefine the high_priority
pattern:
include "base.rnc" { high_priority = attribute priority { "high" | "highest" | "emergency" } }
Or we could extend it:
emergency_priority = attribute priority { "emergency" } include "base.rnc" { priority = high_priority | emergency_priority }
At this point, you might wish that the base schema had defined a pattern for the list of values:
high_priorities = "high" | "highest" high_priority = attribute priority { high_priorities }
Then our customization could simply be:
include "base.rnc" { high_priorities = "high" | "highest" | "emergency" }
Schema designers have to strike a balance between complexity (make everything a pattern) and maintainability. Invariably, it will be the case that for some customizations, you’ll wish there had been another pattern in the base schema.
RELAX NG offers a special pattern called notAllowed
that allows us to
remove things in a customization layer. Suppose, for example, that we
want to remove the notion of priority from this schema entirely:
include "base.rnc" { priority=notAllowed }
Pattern annotations
It is possible to add annotations to patterns in RELAX NG. This can be
used to elaborate the grammar. For example, while RELAX NG has no
provision for default attribute values, there is a defined vocabulary
of annotations for this purpose. We can use these annotations to make
“medium” the default value of the priority
attribute:
namespace a = "http://relaxng.org/ns/compatibility/annotations/1.0" [ a:attributeValue = "medium" ] ordinary_priority = attribute priority { "low" | "medium" } include "base.rnc" { ordinary.attlist = ordinary_priority? }
Note that this default value will be added by a processor that understands and interprets the compatibility annotations in addition to performing RELAX NG validation. An “ordinary” validator will simply ignore them.
Another common use of annotations is for documentation. Also in aid
of documentation, RELAX NG allows patterns to be grouped within
div
elements.
The DocBook RELAX NG grammar
The DocBook RELAX NG grammar continues to reflect the structure established in the DTD. Many patterns are defined simply to establish mixtures that will later be used in content models:
db.nopara.blocks = db.list.blocks | db.wrapper.blocks | db.formal.blocks | db.informal.blocks | db.publishing.blocks | db.graphic.blocks | db.technical.blocks | db.verbatim.blocks | db.bridgehead | db.remark | db.revhistory db.para.blocks = db.anchor | db.para | db.formalpara | db.simpara db.all.blocks = db.nopara.blocks | db.para.blocks | db.extension.blocks
A consistent arrangement of patterns is used to define each element.
Here, for example, is the set of patterns that define the para
element:
[ db:refname [ "para" ] db:refpurpose [ "A paragraph" ] ] div { db.para.role.attribute = attribute role { text } db.para.attlist = db.para.role.attribute? & db.common.attributes & db.common.linking.attributes db.para.info = db._info.title.forbidden db.para = element para { db.para.attlist, db.para.info, (db.all.inlines | db.nopara.blocks)* } }
It begins with a couple of annotations that are used for documentation purposes. What follows are typically:
-
A pattern that defines the role attribute. In the base schema, these definitions are all the same. A distinct pattern is provided in each case because one common form of customization is to make the role attribute have a delimited set of values.
-
A pattern that defines the attributes on the element. In the base schema this is often just a mixture of the role and common attributes.
-
For block elements, a pattern for the
info
element. Theinfo
element is a generic wrapper for block-level metadata, things liketitle
,author
, andcopyright
. (Those of you with long memories may recallBookInfo
andChapterInfo
from DocBook’s SGML DTD days. Those all got replaced with a singleinfo
element that has varying content models in the RELAX NG grammar.)Out-of-the box, this comes in several flavors. For
para
, theinfo
element that cannot contain atitle
is used. -
Finally, there is the content model itself, most often a combination of elements and mixtures.
DocBook is a RELAX NG grammar
Customizing DocBook is, effectively, nothing more than applying the RELAX NG grammar features to the particular set of patterns that define the DocBook schema.
Removing elements or attributes
There may be nothing easier to do in a RELAX NG customization
layer than remove things. Suppose, for example, you wanted to remove
the revisionflag
attribute. If you aren’t tracking changes
in your DocBook sources, you don’t need it. At the element level,
if your publishing system doesn’t support generated callouts, you don’t
need the area
element.
namespace db = "http://docbook.org/ns/docbook" default namespace = "http://docbook.org/ns/docbook" # ====================================================================== # docbookxi.rnc is the flavor of DocBook that has # XInclude mixed-in in appropriate places. include "docbookxi.rnc" { db.revisionflag.attribute = notAllowed db.area.units-enum.attribute = notAllowed db.area.units-other.attributes = notAllowed }
Sometimes, as with the area
element above, you
have to make a few patterns notAllowed
to do the job
completely. In this case, there are two patterns, one with an enumerated
list of values for the units
attribute and one with a value of “other”
for the units attribute and a required otherunits
attribute.
Adding elements or attributes
To add something new, you have to do two things: create a
pattern to match the new item and add that pattern to the appropriate
mixtures. For example, here’s a customization layer that adds a new
inline element, port
. The author uses this customization
in the documentation for XProc.
include "docbookxi.rnc" { db.markup.inlines |= db.port } # ====================================================================== db.port.role.attribute = attribute role { text } db.port.attlist = db.port.role.attribute? & db.common.attributes & db.common.linking.attributes db.port = element port { db.code.attlist, (db.programming.inlines | db._text)* }
There is no requirement that you name patterns following the conventions used in the DocBook schema, but doing so is likely to help you keep them organized in a similar way and will make them easier for other DocBook customizers to understand.
Make required elements optional
Changing content models is, necessarily, contextual. Making a title
optional, for example, is just a matter of changing the pattern used
for the relevant info
element. This customization makes chapter titles
optional:
include "/projects/docbook/docbook/relaxng/schemas/docbookxi.rnc" { # This pattern is db._info.title.req in the base schema db.chapter.info = db._info }
The descriptive nature of DocBook has lead to a schema without a lot of required elements. Books without chapters? Indexes without index terms? Check and check.
To pick an example where a less elegant customization is necessary, let’s consider ordered lists. In DocBook, they’re required to have at least one list item. Suppose we wanted to relax that requirement?
It happens that there isn’t a “~.contentmodel~” pattern for the
content of orderedlist
, so we’ll simply have to redefine the whole
thing.
include "docbookxi.rnc" { db.orderedlist = element orderedlist { db.orderedlist.attlist, db.orderedlist.info, db.all.blocks*, db.listitem* } }
A more practical customization here might be to remove the optional blocks from before the first list item, but that’s not an example of making required elements optional.
Make optional elements required
Making optional elements required is very much the same as making
required elements optional. If you wish to require bibliography
elements to have a title, change the db.bibliography.info
so that it matches an info
element with a required,
db._info.title.req
.
For a more interesting example, consider that some style guides frown on nested hierarchy elements without any intervening prose: a chapter that begins immediately with a top-level section or a top-level section that begins immediately with a second-level section.
If you examine the DocBook schema, you’ll find that the chapter
content model is defined by the db.chapter.contentmodel
pattern.
That pattern is, in turn, defined as db.component.contentmodel
, the
common content model for all “components” (roughly, elements at the
level of chapter).
The common content model for components is:
db.component.contentmodel = db.navigation.components*, db.toplevel.blocks.or.sections, db.navigation.components*
This allows navigational components (indexes, tables of contents, etc.) to appear at either the front or the back. Between them, “top level blocks or sections”:
db.toplevel.blocks.or.sections = (db.all.blocks+, db.toplevel.sections?) | db.toplevel.sections
DocBook has two independent section hierarchies, a numbered one,
(sect1
, sect2
, …) and a recursive one (section
). That’s captured
by the db.toplevel.sections
pattern in a way that makes it easy to
choose either one or both.
Anyway, this deep in the maze, we can see that forbidding immediately nested hierarchy elements for all components would require a simple change to this pattern:
include "docbookxi.rnc" { db.toplevel.blocks.or.sections = db.all.blocks+, db.toplevel.sections? }
If, for some reason, this change were necessary only for chapters, more dramatic surgery would be required. How we approach it depends on whether or not we expected our customization layer to be further customized.
The most direct method would be simply to redefine the content model for chapters:
include "docbookxi.rnc" { db.chapter.contentmodel = db.navigation.components*, db.all.blocks+, db.toplevel.sections?, db.navigation.components* }
This is sufficient, but we’ve “unpicked” the pattern structure
significantly. We haven’t, for example, made it any easier to apply
this change to appendix
elements later, if we need to.
Exercise for the reader: consider how you might make it easier for a future customizer of your customization layer.
Make optional elements forbidden
Structured editing tools that constrain authors to write valid documents are wonderful. But one of the disadvantages of a broad, standard schema is that editing tools will expose all of the flexibility of the standard allowed to your authors.
One of the easiest ways to make authoring easier is remove all of the things that you don’t want your authors to use. This is straightforward in RELAX NG.
The flexibility to produce French books with tables of contents at the back is wonderful. But if you don’t publish books in French, it’s just extra cognative load for your authors.
There are giant swaths of DocBook that you will probably never use unless you write for a particular domain of hardware or software.
-
Do you write about programming language APIs? No? Then you don’t need all the synopsis elements.
-
Do you write about networking? No? Then you don’t need all the inlines about that.
-
Do your documents have bibliographies? Both the raw and cooked forms? Ditch the one(s) you don’t use.
-
Do you produce back-of-the-book indexes in markup? No? Then you don’t need
indexentry
and its descendants. -
Do your documents have mathematics? Flush the equation elements!
-
Do your documents have admonitions? Q&A sets? Screenshots? Video? Audio? Drop all the blocks you don’t need.
-
You don’t need
msgset
.
This’ll simplify your authoring environment:
include "docbookxi.rnc" { db.synopsis.blocks = notAllowed db.systemitem = notAllowed db.biblioentry = notAllowed db.indexdiv = notAllowed db.indexentry = notAllowed db.segmentedlist = notAllowed db.equation = notAllowed db.informalequation = notAllowed db.inlineequation = notAllowed db.math.inlines = notAllowed db.admonition.blocks = notAllowed db.videoobject = notAllowed db.audioobject = notAllowed db.screenshot = notAllowed db.qandadiv = notAllowed db.qandaentry = notAllowed db.qandaset = notAllowed db.msg = notAllowed db.msgexplan = notAllowed db.msgmain = notAllowed db.msgrel = notAllowed db.msgset = notAllowed db.msgsub = notAllowed }
Change the semantics of a component
DocBook, perhaps because of its history as an interchange
format, doesn’t attempt to bring a great deal of rigor to the
semantics of its elements. The reference documentation provides a
description of its intended semantics, as the DocBook designers
understood it, but those descriptions are often intentionally vague.
Saying that a productnumber
is “a number assigned to a
product” is not especially precise.
There are a number of elements right down on the leaves of the
tree where a customization layer could impose stricter syntactic
constraints that would limit the opportunities for misunderstanding.
For example, pubdate
could be restricted to an ISO 8601 date
or date-time. Similarly, if your organization has product numbers that follow
a predictable pattern, you could add a constraint to enforce that.
Document the customizations
The RELAX NG grammar allows documentation to be combined with the schema. Elements from other namespaces simply become ignored annotations to the validator. In this way, DocBook prose for example, could be combined directly with the schema in a “literate programming” style.
Unfortunately, this is fairly cumbersome in practice, in part because the DocBook schema is authored in the compact syntax. The compact syntax, as mentioned earlier, can be losslessly converted to and from the XML syntax. However, the particular representation of arbitrary XML in the compact syntax is, in a word, awful.
For example, consider this simple fragment of documentation in the XML syntax:
<db:para>This is a <emphasis role="important">feature</emphasis>, not a bug.</db:para>
In the compact syntax, it becomes this annotation:
db:para [ "This is a " rng:emphasis [ role = "important" "feature" ] ",\x{a}" ~ "not a bug." ]
That’s not…practical.
As a result, the DocBook schema limits the embedded documentation
to the single-sentence summary of each pattern (it’s man page
refpurpose
), and the description of enumerated attribute values.
For example, here are the patterns for the revisionflag
and its
enumerated values.
db.revisionflag.enumeration = ## The element has been changed. "changed" | ## The element is new (has been added to the document). "added" | ## The element has been deleted. "deleted" | ## Explicitly turns off revision markup for this element. "off" db.revisionflag.attribute = [ db:refpurpose [ "Identifies the revision status of the element" ] ] attribute revisionflag { db.revisionflag.enumeration }
The rest of the reference documentation is maintained separately, in DocBook, and combined with the schema annotations through a fairly complicated process of shaking and stirring.
Plus Schematron
One last observation. No set of grammatical constraints can conveniently capture all of the useful constraints of an authoring schema. DocBook uses Schematron rules, embedded in the RELAX NG grammar as annotations, to enforce a number of extra-grammatical constraints.
If you add structures that carry with them extra-grammatical constraints, you’d be wise to add Schematron rules for as many of them as practical.
Appendix A. Building DocBook
A reviewer commented that additional detail about the process by
which a collection of source files becomes the DocBook RELAX NG
grammar would be interesting. What follows is a summary of the process.
If you want to see all the gory details, they’re publically available
in the
DocBook
repository at GitHub. The process begins down in the
/relaxng/schemas/
directory.
The source files themselves are stored in a collection of logical modules (markup related to admonitions, sections, tables, technical content, general publishing information, etc.). The idea is that if you want to completely excise some module, you can simply construct your own driver file that omits it.
The goal of the assembly process is to transform a set of schema files organized to be convenient for authoring and generating documentation into a small, efficient RELAX NG grammar that can be used for validation. The process follows this basic plan:
-
The RNC files are handy for authoring, but not actually useful for processing. The first step is to use
trang
to convert them all to XML RNG files. -
The set of schema files is composed into a single XML document. There’s provision at this level for a few special cases through the use of some custom “control” markup in another namespace in the RELAX NG grammars.
-
Any patterns that are entirely unreferenced are excluded from the composed schema. Several published customization layers are subsets; removing the unused patterns from the base schema makes the customization layers smaller and easier to understand.
-
Schematron validation is used for a number of extra-grammatical constraints. In the context of DocBook, many of these can be derived from the patterns themselves. The control markup has features for expressing this. For example:
ctrl:exclude [ from="db.footnote" exclude="db.formal.blocks" ]
This single control structure is transformed into a set of Schematron rules that forbid any element in the “
db.formal.blocks
” pattern from appearing as a descendant of (any element in the) “db.footnote
” pattern. -
Another cleanup pass is performed to remove redundant inherited attributes and sort out namespace issues (the namespace for the control vocabulary is no longer needed at this point, for example).
-
All of the markup annotations related to documentation are removed from the generated schema.
-
Copyright messages are moved into the right places and updated with the build information (date, version number, etc.).
-
The build process also runs a subset of the DocBook test suite. (Builds are published from an online continuous integration server and the full suite runs longer than the allowed time for jobs.) Locally, developers can and should run the whole test suite, of course.
That process produces docbook.rng
ready for validation.
Running trang
again produces docbook.rnc
.
A slightly longer and more elaborate process can be used to turn the sources into a fully elaborated, 24MB XML document that can be used to drive the process of building the documentation sources.
The current state of the art is that this process is neither well documented nor especially portable. It would be possible to leverage these tools for local customizations, but it would not be easy.