From clay tablet to PDF
There appears to be a set of structural features common to
the majority of text documents that have become a part of the
way the human race has recorded textual information over the
millenia. As I have shown elsewhere, it is apparent that from
clay tablets to PDFs, we have slowly evolved various models of
a document
that have many features in common
[Flynn14, Ch.1]. Part of this may be due to
the need — until recently — to agree upon a generalized physical
representation for the document that others would recognize, but
this could not have been done without there being a mental model
of the document to start from. It is not known if anyone
actually sat down at the dawn of writing, or even at the dawn of
printing, to decide that certain features are what makes up a
document,[1] but we can see evidence of such decisions in the
design of commands and structures in older markup systems such
as RUNOFF, Scribe, [S]GML, LaTeX, and others which inherit
their paradigms.
Strictly speaking, a document grammar (in the case of XML, for example, a DTD, W3C Schema, or RNG Schema) is a set of definitions and declarations for modeling a class of text documents. It defines the components of the documents they describe, as well as the rules governing their presence in the documents of that class [Tekli11] — a similar application has been noted in linguistics [Power03]. However, we are more concerned here with the document components themselves, and with the rules governing their arrangement, than with the expressive power of the particular grammatical notation used to describe them.
Core features
In comparing the features of text document markup
vocabularies for earlier research, the existence of a core set
of features became evident because it recurred in one form or
another in virtually every system examined. Not only were the
functions replicated, but the associations between them, and
the rules under which they operated, were extremely similar.
These features have been observed and discussed many times,
and are used as examples in our theories of document grammars,
but they do not appear to have been codified across multiple
instances of their occurrence. To test the feasibility of
codification, an experimental Table of the
fragment was constructed from a small
sample of document types of varying age and popularity [Table I], looking principally for obvious evidence
of common requirements such as metadata (principally the
document identity), hierarchical structure, non-hierarchical
categorization, and object reference. Although incomplete and
unrefined, the table showed the existence of some common
features, as well as numerous gaps.
Table I
(Non-Periodic) Table of the Elements
from selected XML grammars (LaTeX has been included for
Feature | HTML | DocBook | DITA | TEI | 12083 | JATS | Briefing | Bulletin | LaTeX |
title | title | title | title | title | title | article-title | title | title | \title |
author | author | author | author | author | briefeditors | author | \author | ||
summary | abstract | shortdesc | abstract | abstract | abstract | abstract | abstract | ||
preface | preface | front | preface | \frontmatter | |||||
part | part | section | div|div0 | part | sec | \part | |||
chapter | h1 | chapter | section | div|div1 | chapter | sec | report | \chapter | |
section | h2 | sect1 | section | div|div2 | section | sec | story | section | \section |
subsection | h3 | sect2 | section | div|div3 | subsect1 | level3 | sub.section | \subsection | |
subsubsection | h4 | sect3 | section | div|div4 | subsect2 | level4 | \subsubsection | ||
appendix | appendix | appendix | afterwrd | \appendix | |||||
bibliography | bibliography | listBibl | biblist | Ref-list | biblist | thebibliography | |||
index | index | index | index | index | |||||
glossary | glossary | glossary | glossary | glossary | glosslist | glossary | |||
paragraph | p | para | p | p | p | p | para | ptxt | \par |
quotation | blockquote | blockquote | lq | quote | bq | block.quote | quotation | ||
numbered list | ol | orderedlist | ol | list | list | list | numberlist | list | enumerate |
bulleted list | ul | itemizedlist | ul | list | list | list | bulletlist | itemize | |
dictionary list | dl | variablelist | dl | list | deflist | list | defnlist | description | |
figure | img | figure | fig | figure | fig | fig | illus | figure | figure |
table | table | table | table | table | table | table | table | table | table |
mathematics | equation | formula | formula | mml:math | formula | $$ | |||
cross-reference | a | xref | link | ref | secref | xref | eiro.ref | \ref | |
bibliographic reference | a | biblioref | cite | ref | citeref | biblio | \cite | ||
external link | a | link | xref | ptr | weblink | external.ref | \hyperref | ||
emphasis | em | emphasis | emph | emph | emph | emph1 | \emph | ||
language | lang | foreignphrase | foreign | language.phrase | \selectlanguage |
From this data the features of a common grammar begin to emerge:
document models provide for self-labelling: in concrete terms, titles, authors, and other [meta]data within the document;
the models provide for an ordered hierarchical division of the information;
within those divisions, there is a non-hierarchical sequence of text-bearing components (and some for graphical content);
at the level of the discourse itself (text), there may be interspersed identifiers which describe relationships between objects or which signify some special quality to be observed, and which may themselves contain further text, identifiers, or signifiers.
I have so far avoided assigning the
conventional labels of markup theory or the names used in any
specific system to these features (element, attribute,
environment, etc; or title
, para
, or
, etc). However, for practicality and convenience
in discussion, the grouping of the features in Table I corresponds with terminology commonly used:
metadata, hierarchy, pool, and flow.[2]
Standard Average?
The human race seems to like to categorize things. We do it on the basis of perception (loud|quiet, bright|dark, hot|cold), cognition (cheap|expensive, fast|slow, wet|dry), and even guesswork (bull|bear [market]) — ultimately it’s a survival trait (dangerous|harmless) [Lakoff90]. More experienced humans have more points on their scales: flooded|sodden|wet|damp|moist|dry|bone-dry|parched|desert, because it’s more useful that way. It’s also possible to measure on a sliding scale, for example 100%=flooded and 0%=desert, or any point in-between. But as most of us live neither under water nor in a desert, neither in perpetual daylight nor perpetual night, neither on top of a mountain nor at the bottom of a canyon, there is a tendency for most humans to have an affinity for somewhere between the extremes. This clustering, or central tendency, is a hallmark of natural behavior, and has been known since antiquity, although formalized in statistics only since the late 1600s.[3]
therefore seems to be an
appropriate way to describe the clustering observed in the way
in which document are constructed — at least in SGML/XML
and LaTeX — even if it is not used in the strictly
mathematical sense required by statistics. There is a cluster
of recognizable types of information around the title and
author; another around the hierarchy, another around the pool,
and around the flow.
The standards we use daily, whether formalized by ISO or
just accepted as patterns of behavior, have been formed from a
similar principle to the average: a degree of genericness or
commonality has been seen to be useful as a model because it
is representative or descriptive of the whole. In effect, we
are unconsciously applying the duck test
abductive reasoning: if it [repeatedly] looks useful, it
probably is.
The suggested term Standard Average
Document Grammar
is derived from (but entirely
unassociated with) the linguistic term Standard Average
coined in the late 1930s to describe a set
of grammatical similarities which characterize Indo-European
languages.[4] The term Standard Average
on its
own has to some extent become a portmanteau phrase in everyday
language for acceptably common behaviour.[5]
Feature set
The set of features for the derived grammar is expanded below, but we should first deal with what it does not describe.
There are many classes of document structures that do not or cannot follow a generic model but have their own: those which are too short to exhibit much in the way of structure; those which are intended as ephemeral or singular; and those which by convention of their nature require a specialist structure. But even amongst these, some of the features may be present, even if (for example) in the metadata rather than the text body.
The point of standard
as described above is that such a grammar
should be able to cover enough of the spectrum to be a useful
pattern or model in a majority of cases, and that this
should be generally accepted by the user
community. There will nevertheless be some specific factors
which must be considered in testing this acceptance:
there must be broad agreement between users on semantics;
not all features have to be present: there can be rules about requirement and optionality;
if features are present, then they must be used in the manner generally accepted;
Naming is also important, and has been the
topic of much discussion over the years on XML-related mailing
lists. Not only are names a prerequisite of any concrete
instantiation, but we need them informally as handles during
discussion, so they may as well be meaningful in the language of
that discussion. This raises other linguistic and cultural
questions, but in essence we are simply requiring agreement that
the feature we refer to as a title
is in fact the
title of a document (or section, or whatever) as commonly
understood, and not a mosquito or a bottle of beer.
Because of the traditional separation of concerns between logical and physical in dealing with document markup, the visual appearance of a grammatical feature is not generally relevant. However, for the purposes of usability and — as here — illustration, when features are given an appearance, it is common to use one of the widely-accepted styles.
The salient features of a Standard Average Document Grammar are summarized in Figure 1 to Figure 4. There may be disagreement over the presence or absence of some specifics, but enough of these appear to occur in enough instances of otherwise disparate types of document to make it worth inclusion.
Figure 1: Identification
![]() |
The features in Figure 1 are often regarded as metadata, as they typically stand outside the running text. It is nevertheless seems to be accepted as part of the function of the grammar that it should label the document (title), link to an authority outside the document (author), and provide an overview or synopsis (summary).
Figure 2: Formation
![]() |
The core structure of a document appears most commonly as a hierarchical nesting of divisions, with each level able to reoccur as siblings (Figure 2). As encoders of documents are well aware, this does not hold true for many early documents, and even for some contemporary ones, but it is sufficiently true elsewhere for it to be useful as a model, and is sometimes imposed upon otherwise unstructured or semi-structured documents to make them usable in conventional modern contexts. In formally-published documents, especially books, there is usually material preceding and following the hierarchical structure (prefaces, forewords, indexes, appendices).
Figure 3: Text Content
![]() |
While the function of a hierarchical structure is to provide a referential framework within which the author can develop or express an argument (at the least, something like introduction, exposition, analysis, and conclusion), the text itself uses a set of building-blocks to present that argument (Figure 3), of which a small subset seems to be widely used.
The most basic seems to be the paragraph (a novel consists largely just of these and nothing else apart from chapter headings).
A list is a collection of thoughts or topics in some way related by order or concept.
Tables and figures are ways of expressing or relating more complex collections of information in such a way that they do not interrupt the flow of the argument but remain available for consultation.
Images and other notations (mathematics, music) are specialist ways of presenting collections of information that cannot reasonably be given in normal textual form because they need their own language.
Quotations are arguably a form of external link (see Figure 4), but reproduce the content of the target verbatim so that it becomes part of the author’s argument.
The critical point about these building-blocks is that they occur and reoccur many times. While the components of the hierarchical structure which contain them may reoccur as often as needed as siblings (that is, at their own level), they cannot occur out of depth (that is, you cannot have a subsubsection as a child of a chapter), whereas the building-blocks of content can occur and reoccur at any level within the hierarchical structure. Whatever about the constraints imposed by the hierarchical model, this distinction seems to be a key aspect of document grammars.
Figure 4: Reference
![]() |
Unlike the other features in Figure 1 to Figure 3, where at least one of them must be present, otherwise you have no document at all, the reference features are entirely optional, and are used at the author’s discretion according to sense (Figure 4).
In the detail of running text, there may be a need to link components within the document for reference or to link to other documents elsewhere. While these features perform a closely related function, an internal reference can be checked immediately, so it is dependent, whereas a link to another document is independent, as it cannot be known at the time of writing if the reader will have access to the document concerned.
Signifiers are ways to express some special nature of a feature, so that it takes on a quality which impresses itself on the reader. Emphasis or terminology are probably the most frequently-used in continuous text; specifiers of sequence occur in structures like numbered lists and the titles of sections.
Adopt, Adapt, Build[6]
In this author’s experience, the core set of features, or one very similar, is where most concrete instantiations of document grammars appear to have started, as far back as the days of SGML DTDs. Additional features, and deviations from the norm, are legion, and may be specialist within a field or topic, or introduced for practical, technical, or political reasons — it is these which distinguish one implementation from another. The ease (or otherwise) with which a particular type of document can be modified seems to depend largely on the original authors’ intentions:
some structures are designed to be modified, and therefore provide facilities for doing so, such as parameterization;
some certainly can be modified, and occasionally are, but it’s a big effort and it’s usually easier to put up with the occasional semantic mismatch;
some are not intended to be modified at all.
Not all parts of a document grammar may be equal to the task: in some cases it may be hard to modify the metadata but easy to modify the hierarchy; in others the reverse. There is also significant debate (not a part of this analysis) about the extent to which modifications should allow or deny a user the right to continue to claim that they are [still] using the type of document they started with.
The simplest use case is no changes. This implies that the requirements of the documents to be created or encoded are identical to those envisaged by the creators of the grammar, or at least so similar that the differences can be ignored. Using an existing document grammar in this way, without any modification at all, seems to this author to be relatively rare in the long run, with some specific exceptions noted below; but collecting hard data on numbers would be difficult to undertake. Certainly it makes an excellent starting-point for those with no history of structured-document usage, but the process needs to be managed in order to avoid rejection because of unexpected conflicts between the provisions of the grammar and the view that users have of their own document types.
One obvious exception is a need to adhere to a de facto standard, and HTML is the most prominent example. It is something of a special case because it was implemented by software (browsers and editors) that ignored or even encouraged syntactic errors. While XHTML and HTML5 are sometimes now well-formed, the uncounted millions of earlier HTML web pages remain in use and are likely to do so for the foreseeable future. HTML itself has been adapted on occasions for specialist use, but usually just in restricted forms like the subset of XHTML used in EPUBs rather than extending the grammar in other directions; and this author (and separately, the ISO HTML committee) did produce versions which used a hierarchical structure in the body of the document.
Another exception is the mandated use of specialist document types in a vertical market such as a single industry. The success of many industrial document types relies either on agreement that their use between companies in their industry is, effectively, grammatically identical, or it relies on an obvious advantage such as common software.
JATS, for example, while parameterized and open to
modification, is seldom changed much except by very large
organizations (and even then mostly only in the metadata)
because significant change would break the shared model of
an article
in journal publishing, as well as
the toolset. However, some extensive modification has been
done to produce BITS (book interchange) and NISO STS
(standards), but these are more in the nature of forks or
full-scale derivatives.
Three commonly-adapted grammars are TEI, DocBook, and DITA. All provide extensive facilities for adaptation, implemented in different ways, and all can generate DTDs, W3C Schemas, or RNG Schemas.
TEI is generated by the ODD system (One Document Does all), and user modifications can be created via the Roma web tool by adding features to a minimal core or substracting them from an
version. More specialist modifications can also be done manually by creating customized ODD files and generating the schema afresh. -
DocBook is maintained in RNG, and features (specified as RNG patterns) can be selectively disabled and enabled in a customization layer, and additional features introduced. The documentation is careful to distinguish between creating subsets, which remain valid DocBook instances, and extensions, which can no longer be called DocBook [Walsh16b].
DITA is maintained in RNG and allows for adding and removing new topic or elements types, as well as applying effectivities (conditionalizations). Specializations can be managed centrally by the sponsoring agency which maintains the standard (OASIS) or locally by users or industry groups.
Despite enquiry, I have failed to identify any modified
version of any of these three which has involved changing any
of the element type names shown in Table I, or their structure relative to one another.
Additions and exclusions occur in more specialist areas, as
noted above, but the basic grammar of a hierarchical structure
containing sequences of text blocks containing mixed text and
referential signifiers appears to satisfy that particular core
of demand for what constitutes a
However, from discussions among developers of document types and classes (for example, on the TEI, DocBook, HTML, XML, LaTeX, and other related forums), it is clear that there have been questions of structural relationships and content modeling in the grammar at the design level which appear largely to have been resolved, at least within the encoding communities served by each system. A few examples:
Should further discursive block-level content be permissible after the close of the last hierarchical child in a hierarchical container?
After the end of the last
in a DocBookchapter
? Yes, but limited tosimplesect
; -
After the end of the last
in a TEIdiv0
? No, perhaps oddly, given that the TEI is designed to be able to model historical documents which often do not conform to rigid modern hierarchical structures; -
After the end of the last
in a HTML5div
? Sure, no problem.
Should hierarchical containers be numbered (by level) or not?
DocBook provides names for Parts and Chapters but sections within them are numbered by level; but there is an unnumbered
which can be used instead; -
TEI provides level-numbered divisions and keeps naming to attributes; but it too provides an undistinguished
; -
ISO 12083 names the components down to the section level but numbers the levels beneath;
HTML and others simply use recurrent containers of the same name at all depths.
To what extent should block-level (pool) components occur within themselves, alongside normal unmarked text?
Not at all — TEI (in SGML, one of the most notable victims of
pernicious mixed content
); -
Within limits — DocBook (not those with complex internal structure);
Go for it — HTML (as implemented).
(Some systems — Microsoft Word, for example — go to extreme lengths to avoid mixed content entirely.)
Is it the responsibility of the grammar to describe or prescribe the possible types of content of a document?
TEI is largely descriptive, in that it was designed to cope with the planet’s literary, historical, and cultural Nachlaß;
DocBook is mildly prescriptive (no lists in an Abstract, for example);
Specialist grammars can be almost completely prescriptive in structure, although rarely in text content.
The degree to which the chosen grammar offers acceptable constraints, or fails to offer sufficient descriptive accuracy, will largely determine the level of adaptation needed. This is not a failing on either side, simply an acknowledgement that both sides are close enough to the standard average to get along together except for a few areas where they need to go their own way.
The decision to write your own document type or class — to design your own grammar, often from scratch — seems to me to be less common than before, when the public offerings were more limited, document-grammar analysis skills were rare, and a full understanding of ISO 8879 itself rarer still. Specialist requirements continue to mean that vertical-market document type grammars will still need to be written. Maler and el Andaloussi (1999) and others are clear about the commitment of time and effort required to undertake the task at an industrial level, but there must be many hundreds, possibly thousands, of personal or localized schemas originally written for ad hoc purposes which have become embedded into workflows and still continue to function.
In the original analysis for this paper, four small examples were used: EIRO Bulletin and Croner Briefing, which appear in Table I because they show some commonality with the rest; and BiBTeXML and Daybook, which have no correlation with the Standard Average Document Grammar.
Bulletin |
This was written for the publishing workflow of a European Union labor research institution. The design is not easily extensible: it has an abbreviated hierarchy and pool, simply enough for the practicalities of publishing; and a curious selection of inline signifiers aimed at the requirements of the publishing process which needed to be able to identify many different aspects (locations, organizations, people, documents, and three different styles of emphasis) for indexing and retrieval as well as visual formatting. |
Briefing |
Croner Publications had this developed for a frequently-issued series of business briefings. There is a simple hierarchical structure, but it is remarkable for the pool having 12 different element types for lists (surely some kind of record). There is a significant amount of metadata for document control in a publishing workflow, even for a relatively small unit of writing. Some of the inlines are clearly designed to be retro-fitted after formatting (position and page number). |
This shows one possible way of tackling the naming
problem when the field is (by design) very narrow. It
would, of course, have been perfectly possible to encode
the referenced document types (eg The designers opted for the more pragmatic route of constraining the content model with an element type for each referenced document type, so that the element types available within them reflect exactly those a user would expect from any other interface to a BiBTeX file. This is in some ways an exercise in obviousness: part of the solution in usability is sometimes making the affordances so obvious that it minimizes training. |
Daybook |
This was designed for the transcription of parliamentary proceedings. Legislative records not only have to be exact (perhaps in some jurisdictions even when the truth has been redacted) but for retrieval, an attempt has to be made to represent the class of material being debated, so there are element types for General Debate, Oral Answers, Written Answers, and Private Notice Questions. They can be nested, so the structure is discrete; class within class, rather than hierarchical in the normal chapter—section—subsection manner. |
Ultimately, the write or adapt
decision has
to be made on many grounds: accuracy, practicality, security
(independence), ease of use, speed, convenience, software
availability, skill requirements, and others. Not all of these
can necessarily be measured directly with money: there may be
less-quantifiable aspects such as human relations and
organizational politics involved.
Drawing the line
If there is anything we can learn from a Standard Average Document Grammar, it seems to be that it’s a convenient term for a phenomenon which needs more accurate measurement. One way of looking at it would be to pursue the pseudo-statistical theme and construct values for concrete use cases, with their distance from the theoretical SADG as a measure of divergence.
When an organization or individual considers using an
existing document grammar, there will eventually be a pain point
at which they in effect say, No, that really isn’t how we
see things here, we need something closer to how we
From that point on, it’s a case of adaptation:
new names, perhaps, or a new structure, or an extended or
contracted content model. If such a fork is public, it may
attract additional users, particularly if it is designed for a
vertical market. Takeup and the amount of divergence from the
original can be measured.
Some will never get to that point, and will use an existing grammar unadapted, or perhaps with only the most trivial of changes to, say, attribute value lists. In these circumstances, we are effectively adding to the number of use cases at the mode (the most commonly-occurring value of an average).
Those who elect to build their own grammar are in effect initially located beyond some as yet undetermined measure of deviation, although if the resulting structures end up bearing enough similarity to the SADG, the grammar may be considered have added to the base of contributory systems.
In this author’s experience, the adaptations of existing
grammars are undertaken for multiple reasons, but often related
to not enough
or too many
not what we call it
insufficient or over-complex metadata requirements (some people need more, others need less);
too many or too few restrictions on the formation of the hierarchy: a modeling mismatch with the way the organization or individual works;
missing or excessive provision for pool components which lie at the heart of structured document writing and editing;
similar problems with the inline flow components.
Cutting back on the richness of some of the standard offerings is likely to ease editing complexity, but there can also be extra work if some components are named in a way that causes ambiguity or uncertainty in the circumstances of use. When this reaches frustration point among document users, there may be a rise in tag abuse or other inaccuracy, leading to calls for adaptation or writing a new grammar.
Given that the creation of a new document grammar and new document type or class is non-trivial, it would be useful to have some measure of how far off-piste you have to be to justify it.
[Mark Clifton’s 1952 story] Clifton, Mark (1952) Star, Bright
. Galaxy
Science Fiction, 4:4 (July), World Editions (Edizione Mondiale),
New York, NY,
[Flynn14] Flynn, Peter (2014) Human Interfaces to Structured Documents, PhD Thesis, University College Cork, Cork, Ireland,
Kosek, Jirka (2017) Improving validation of
structured text
. In Proc. XML London 2017, University
College London, June 11–12, pp.56–67. doi:
[Lakoff90] Lakoff, George (1990) Women, Fire, and Dangerous Things. University of Chicago Press, Chicago, IL, 9780226468044.
[Maler and el Andaloussi (1999)] Maler, Eve; and el Andaloussi, Jeanne (1999) Developing SGML DTDs: from Text to Model to Markup. Prentice-Hall, Upper Saddle River, NJ, 0-13-309881-8.
[Oppenheim67] Oppenheim, A Leo (1967) Letters from Mesopotamia: Official, Business, and Private Letters on Clay Tablets from Two Millenia. University of Chicago Press, Chicago, IL.
[Power03] Power, Richard Power; Scott, Donia;
and Nadjet Bouayad-Agha (2003) Document
. In Computational Linguistics 29:2, p.223 et
seq. doi:
[Southall (1989)] Southall, Richard (1989) Interfaces between the
Designer and the Document
. In André, Jacques; Furuta,
Richard; and Quint, Vincent; Structured
Documents, CUP, Cambridge, England pp.119-131,
[Tekli11] Tekli, Joe; Chbeir, Richard; Traina,
Agma JM; and Traina Jr, Caetano (2011) XML
document-grammar comparison: related problems and
. In Central European Journal of Computer
Science (Springer, Versita), 1:1, pp.117–136, doi:
[OMalley64] Vesalius, Andreas (1554) Letter to Johannes Oporinus. In O’Malley, Charles Donald (1964) Andreas Vesalius of Brussels, 1514–1564. University of California Press, Berkeley CA (text at, retrieved May 2017).
[Walsh16a] Walsh, Norman (2016)
Underlying Technologies
. In XML and
Publishing, XML Summer School, St Edmund Hall,
Oxford, p.19
[Walsh16b] Walsh, Norman (2016)
Customizing DocBook
. Ch.5 in Publishing
DocBook Documents,
[1] Although in the first case, the authors of clay-tablet business documents do appear to have settled on shared modes of expression [Oppenheim67]; and in the second case, Vesalius came fairly close [OMalley64]. ←
[2] The terms pool
and flow
are taken from the design conventions of Document Type
Descriptions as used in SGML and XML: Maler and el Andaloussi (1999) derive them from an Open Software
Foundation DTD design committee. They are in widespread
use and occur in the specifications for both DocBook and
HTML, although they appear much earlier under the terms
, containment
, and
in Southall (1989). The terms blocks
(for pool) and inlines
(for flow) are also
in common use. ←
[3] The word average
derives from the Latin
havaria, which was the sharing of the
expense of lost cargoes between shipping merchants which
ultimately gave us the concept of insurance. ←
[5] As in Mark Clifton’s 1952 story about the father of
an exceptionally bright young daughter warning her
against feigning stupidity in order to be accepted in
school: Now, look,
I cautioned,
don’t overdo it. That’s as bad as being too quick.
The idea is that everybody has to be just about standard
average. That’s the only thing we will
tolerate. […]