How to cite this paper
Viglianti, Raffaele. “One Document Does-it-all (ODD): a language for documentation, schema generation, and
customization from the Text Encoding Initiative.” Presented at Symposium on Markup Vocabulary Customization, Washington, DC, July 29, 2019. In Proceedings of the Symposium on Markup Vocabulary Customization. Balisage Series on Markup Technologies, vol. 24 (2019). https://doi.org/10.4242/BalisageVol24.Viglianti01.
Symposium on Markup Vocabulary Customization
July 29, 2019
Balisage Paper: One Document Does-it-all (ODD): a language for documentation, schema generation, and
customization from the Text Encoding Initiative
Raffaele Viglianti
Research Programmer
Maryland Institute for Technology in the Humanities (MITH) at the
University of Maryland
Raffaele Viglianti is a TEI Technical Council member and Research Associate at
the Maryland Institute for Technology in the Humanities (MITH) at the University
of Maryland, where he works on a number of digital humanities projects and is
the Technical Editor for the
Shelley-Godwin Archive. Raffaele’s research revolves around digital
editions and textual scholarship, with a focus on editions of music
scores.
Copyright ©2019 by the author. Used with permission.
Abstract
TEI, the Text Encoding Initiative, was founded in 1987 to develop guidelines for encoding
machine-readable texts of interest to the humanities and social sciences. The TEI
is a text-centric community of practice in the academic field of digital humanities,
operating continuously since the 1980s. The community currently runs several mailing
lists, holds an annual conference, and maintains an eponymous technical standard,
an online journal, a wiki, a GitHub repository, and a toolchain. The TEI Guidelines,
which collectively define an XML format, are the defining output of the community
of practice. The format differs from other well-known open formats for text (such
as HTML and OpenDocument) in that it’s main mission is for encoding “extant” texts
such that they are amenable to scholarly processing. After a brief introduction to
the TEI, we will discuss the mechanisms built in to the TEI for customization.
Table of Contents
- Introdcution
- What is in an ODD
-
- Documentation
-
- A schema specification (or more)
- Specifications: a brief overview
- Modules
- Model Classes
- Attribute Classes
- Elements
- Datatypes
- Macros
- Constraints
- More literate programming
- Processing ODDs: zapping, sourcing, chaining
- Tools
- ODD for TEI interchange and beyond
- Acknowledgements
Introdcution
The Text Encoding Initiative (TEI) began as an international research project in 1987,
with the goal of creating guidelines for the representation of texts in digital form.
While these guidelines are still its main focus today, the TEI has since evolved into
a
non-profit consortium with numerous members and an elected body of individuals (the
Technical Council) who maintain and expand the Guidelines in response to the needs of the community. TEI’s broad
mission of representing “text” has resonated in particular with the academic community,
libraries, and cultural heritage institutions, who have widely applied the TEI—and
consequently shaped it—as an instrument for online research, teaching, and preservation.
In 2007, the TEI released version P5 of the Guidelines and with it introduced a complete revision of ODD, or One
Document Does-it-all, the system for its documentation, schema generation, and
customization. This system makes use of literate programming principles in order to keep
the documentation, grammar, and constraints rules of the TEI all together in the same
TEI XML document. To achieve this, the large documentation text of the Guidelines is encoded in TEI and is peppered with
references to formal declarations of elements, attributes, modules, and classes. These
formal declarations are themselves expressed using TEI elements, which allows the
TEI’s
processing tools to generate both human-readable documentation and schemas in a variety
of formats. These elements, described in Chapter 22 of the Guidelines, can be used both to define and to customize the various
components of the TEI; this allows users to define and document customizations, and
to
generate human-readable and machine-readable output.
Customizing the TEI is an essential step in the creation of a TEI project: the
specification is very large and using it all at once is discouraged. Indeed, the TEI
offers customization “exemplars” to users, including TEI Lite, TEI for Manuscript
Description, and jTEI (a customization for articles for the Journal of the TEI). Researchers using the TEI are recommended (most
often via workshops and other training sessions) to let their research questions drive
their customization design and create a subset that most closely addresses their needs.
Besides selecting a subset, customization in ODD allows encoders to add constraints
(such as limiting open attribute values) and to introduce extensive prose documentation
tightly coupled with formal declarations.
What is in an ODD
Because ODD is used to both define and customize a markup vocabulary, it takes two
ODD
files to tango: one to define the vocabulary (e.g. the whole of the TEI) and one to
customize it (e.g. TEI Lite). Both kinds of operation are performed with the same
element set, but a @mode
attribute determines whether something is being
added, changed only in part, replaced, or explicitly removed. The absence of
@mode
means that something new is being declared. The following
subsections introduce what is in an ODD by way of a brief introduction to some of
these
elements.
Documentation
As a TEI document, an ODD can contain extensive prose describing either a new
markup language or a customization. This human-readable documentation is typically
contained within the TEI <text>
element, and can make use of the
standard TEI elements including those for divisions, headings, paragraphs, and
snippets of computer code.
A schema specification (or more)
Specification and customization elements are contained by
<schemaSpec>
, on which a number of top-level options can be
set, such as the schema name, language, namespace, and the possible root or
outermost elements.
<schemaSpec ident="myTEI" start="TEI" ns="http://tei-c.org/ns/1.0">
<!-- specification and customization elements go here -->
</schemaSpec>
Specifications: a brief overview
The specifications introduced below are defined and referenced using ODD elements
that share a similar structure. First, the names of these elements are formed by a
term plus “Spec”, such as <elementSpec>
or
<classSpec>
. Elements that express references to these
specifications end in “Ref”, such as <elementRef>
or
<classRef>
. Other shared features include the
@ident
attribute to indicate the name of the object being specified
or referred to, and documentation elements for providing descriptions and usage
examples.
<*Spec ident="name">
<gloss>An expansion of the name, if necessary</gloss>
<desc>A description of this specification</desc>
<!-- definitions depending on the type of specification -->
<exemplum>
<!-- Examples of usage -->
</exemplum>
<remarks>
<!--Any further notes or comments about this specification-->
</remarks>
</*Spec>
Modules
A module provides a name for a set of other formal declarations, which other
specifications will use to indicate their membership to the module and that module
alone (specifications can only belong to one module).
<moduleSpec ident="namesdates">
<desc>Additional elements for names and dates</desc>
</moduleSpec>
Modules are rarely changed, but a customization ODD will use
<moduleRef>
to indicate which of them are to be included in the
customization. This element is also equipped with attributes to exclude or include
element members; for example the following example includes the whole “namesdates”
module, but without <event>
and
<listEvent>
:
<moduleRef ident="namesdates" except="event listEvent" />
This element can also be used to bring in external schemata if
necessary.
<moduleRef url="svg.rng" />
A TEI customization needs four modules to be functional: tei
,
core
, header
, and textstructure
. TEI’s
ODD processor does not enforce the presence of these modules, however, and it is
left to the user to make sure they are included.
Model Classes
Model classes work similarly to modules, but only accept memberships from element
declarations (elements may be members of multiple model classes). A model class can
be referenced from within a content model, allowing all members of the model class
to appear at that point in the content model. When the class is referenced, one can
indicate cardinality and the order (alternation or sequence) of any or all members
of the class (see Elements
below).
<classSpec module="tei" type="model" ident="model.segLike">
<desc>groups elements used for arbitrary segmentation.</desc>
<classes>
<memberOf key="model.phrase"/>
</classes>
</classSpec>
Customizations may change model classes to fine-tune class dependencies. Here is
an example that allows members of the model.segLike
class to appear
wherever members of the model.addrPart
class are allowed.
<classSpec module="tei" type="model" ident="model.segLike" mode="change">
<classes>
<memberOf key="model.phrase"/>
<memberOf key="model.addrPart"/>
</classes>
</classSpec>
Attribute Classes
Attribute classes declare and provide documentation for a set of attributes.
Elements and other attribute classes can inherit from
them.
<classSpec module="verse" type="atts" ident="att.enjamb">
<attList>
<attDef ident="enjamb" usage="opt">
<desc>indicates whether the end of a verse line is marked by enjambement.</desc>
<datatype>
<dataRef key="teidata.enumerated"/>
</datatype>
<valList type="open">
<valItem ident="no">
<desc>the line is end-stopped </desc>
</valItem>
<valItem ident="yes">
<desc>the line in question runs on into the next </desc>
</valItem>
<valItem ident="weak">
<desc>the line is weakly enjambed </desc>
</valItem>
<valItem ident="strong">
<desc>the line is strongly enjambed</desc>
</valItem>
</valList>
</attDef>
</attList>
</classSpec>
Customizations may adjust dependencies to other attribute classes and will often
update and constrain attribute values. The example below makes the
@enjamb
attribute required (by default it is optional), changes its
values to a particular preferred terminology, and closes the list of values, thus
disallowing a value that is not from the specified preferred terminology. All of
these changes are well within the original specification of the TEI, which is quite
permissive, but this case supposes a situation where a mandatory and stricter
version of @enjamb
is required by a text encoding project. Note how
children of <classSpec>
that do not need change are not included
(such as <desc>
). Note the use of @mode="replace" to override the
attribute
declaration.
<classSpec module="verse" type="atts" ident="att.enjamb" mode="change">
<attList>
<attDef ident="enjamb" usage="req" mode="replace">
<valList type="close">
<valItem ident="endstop">
<desc>the line is end-stopped </desc>
</valItem>
<valItem ident="light">
<desc>the line is lightly enjambed</desc>
</valItem>
<valItem ident="heavy">
<desc>the line is heavily enjambed</desc>
</valItem>
</valList>
</attDef>
</attList>
</classSpec>
Elements
The definition of elements includes their memberships to modules and classes,
attributes, and a content model
declaration.
<elementSpec module="tagdocs" ident="code">
<desc>contains literal code</desc>
<classes>
<memberOf key="model.emphLike"/>
</classes>
<content>
<textNode/>
</content>
<attList>
<attDef ident="type" usage="opt">
<desc>the language of the code</desc>
<datatype>
<dataRef key="teidata.enumerated"/>
</datatype>
</attDef>
</attList>
</elementSpec>
Content models can be defined using RELAX NG, or (preferably) using dedicated ODD
elements. There are a number of features available to organize the content model,
such as <alternate>
and <sequence>
to determine how
the referenced elements can be combined; and @minOccurs
and
@maxOccurs
attributes to set cardinality. Note that each
specification element (<moduleSpec>
, <classSpec>
,
<elementSpec>
) has corresponding reference elements
(<moduleRef>
, <classRef>
,
<elementRef>
).
<content>
<alternate>
<classRef key="model.pLike" maxOccurs="unbounded"/>
<sequence>
<elementRef key="summary" minOccurs="0" maxOccurs="1"/>
<elementRef key="msItem" maxOccurs="unbounded"/>
</sequence>
</alternate>
</content>
Model classes group elements by membership and typically do not impose a specific
order. When referenced, however, the @expand
attribute can be used to
override this behavior. For example, the following content model boils down to
( p*, ab* )
rather than the usual ( p | ab
)
.
<content>
<classRef key="model.pLike" expand="sequenceOptionalRepeatable" />
</content>
In a customization, including or removing an element is typically done when
selecting a module via the @include
and @except
attributes
on <moduleRef>
. However, these operations can also be performed
explicitly using <elementRef>
inside a <schemaSpec>
.
For example, to add the element <msItem>
(manuscript item) without
including the manuscript description
module:
<elementRef key="msItem" />
Or to
remove the <p>
(paragraph) element without removing the core module
(without which TEI would make little sense):
<elementRef key="p" mode="delete" />
More minute changes to elements are quite common in a TEI customization and they
will range from adjusting the description, to adjusting attribute values, to class
memberships. A typical operation would be constraining attributes, for example the
@type
attribute on the <div>
(textual division)
element. The @type
attribute is derived from <div>
’s
membership in the attribute class att.typed
. Note the use of
@mode="replace" to override the declaration of the @type attribute inherited from
att.typed
.
<elementSpec ident="div" mode="change">
<attList>
<attDef ident="type" mode="replace">
<valList type="closed">
<valItem ident="chapter"/>
<valItem ident="section"/>
</valList>
</attDef>
</attList>
</elementSpec>
Entirely new elements can be added as well, though when customizing TEI, the
Guidelines require that new elements and
attributes are added under a new namespace. Membership to classes will determine
where the element can go; for example model.phrase
groups “inline”
elements, so a new inline element can simply declare its membership to that
class.
<elementSpec ident="opus" ns="myTEI.example.org" mode="add">
<desc>The opus number or "work number" that is assigned to a musical composition</desc>
<classes>
<memberOf key="model.phrase"/>
<memberOf key="att.global"/>
</classes>
<content>
<textNode/>
</content>
</elementSpec>
Likewise, a “block” element could be part of the same model class as
<div>
(model.divLike
) or as <p>
(model.pLike
). When the new element is meant to be a child of
another specific element, the parent element’s content model will need to be
changed. For example this is how the new element <opus>
, presuming
it is not a member of model.phrase,
could be added to TEI’s
<title>
element only.
<elementSpec ident="title" mode="change">
<content>
<alternate minOccurs="0" maxOccurs="unbounded">
<macroRef key="macro.paraContent"/>
<elementRef key="opus" />
</alternate>
</content>
</elementSpec>
Datatypes
Datatypes for attributes and other string content can be specified and used by
multiple declarations. W3C XML datatypes can be referred directly by their name and
need not be
redefined.
<dataSpec ident="teidata.pointer">
<desc>defines the range of attribute values used to provide a single URI, absolute or relative,
pointing to some other resource, either within the current document or elsewhere.</desc>
<content>
<dataRef name="anyURI"/>
</content>
</dataSpec>
When referenced, datatypes can be restricted to match a given regular
expression.
<!-- a fraction: -->
<dataRef name="token" restriction="(\-?[\d]+/\-?[\d]+)"/>
Datatypes can be changed by customizations, though it is more common to add or
change restrictions or introduce entirely new datatypes.
Macros
Macros are used to declare predefined strings or patterns. Content models can be
defined here just like they are in
elements.
<macroSpec module="tei" ident="macro.paraContent">
<content>
<alternate minOccurs="0" maxOccurs="unbounded">
<textNode/>
<classRef key="model.gLike"/>
<classRef key="model.phrase"/>
<classRef key="model.inter"/>
<classRef key="model.global"/>
<elementRef key="lg"/>
<classRef key="model.lLike"/>
</alternate>
</content>
</macroSpec>
Customizations may consider introducing new macros, or adding new classes and
elements to existing macros.
Constraints
Other formal constraints can be documented and specified within the
<constraintSpec> element. These can be placed within other specification
elements, or elsewhere in the documentation text. The TEI source uses Schematron to
express constraints, for example:
<constraintSpec ident="activemutual" scheme="schematron">
<constraint>
<s:report test="@active and @mutual">Only one of the
attributes @active and @mutual may be supplied</s:report>
</constraint>
</constraintSpec>
Expressing constraints is a powerful tool for building customizations,
particularly when multiple encoders will be working with the schema. Schematron in
particular offers many level of reporting for catching encoding errors and offering
suggestions to encoders.
More literate programming
A truly literate programming ODD will couple prose with specifications, yet the
elements introduced so far are children of <schemaSpec>, which is somewhat
divorced from the <text> element containing the bulk of the human-readable
documentation. It is possible, nonetheless, to refer from the prose to
specifications in the <schemaSpec>, which a processor will expand when generating
documentation. The TEI Guidelines use this
mechanism, which simplifies the maintenance of specifications organized into
multiple XML files.
<listRef>
<ptr target="#ID_OF_SPEC_ELEMENT"/>
</listRef>
This is hardly a tight coupling of prose and specification, but it works well for
a complex ecosystem such as the TEI Guidelines.
It is also possible, however, to do just the opposite: break up specifications into
groups to be included within the documentation prose. References within
<schemaSpec> can then take care of telling the processor how to put everything
back together. The more recent TEI customization for “Simple Print” documents
employs this strategy.
The following (abridged) bit of prose describes the selection of elements from the
TEI header for the Simple Print
customization:
<div>
<p>A subset of 45 elements is selected from the TEI header module. In addition, <!-- etc. --></p>
<specGrp xml:id="header">
<moduleRef key="header" include="abstract availability biblFull catDesc etc" />
<moduleRef key="corpus" include="particDesc settingDesc"/>
</specGrp>
</div>
Elsewhere in the document, the <schemaSpec> points back to this and other
<specGrp> elements.
<schemaSpec ident="teisimpleprint" start="TEI teiCorpus">
<specGrpRef target="#base"/>
<specGrpRef target="#header"/>
<!-- etc. -->
</schemaSpec>
Processing ODDs: zapping, sourcing, chaining
To generate documentation and schemata, a processor will merge together the source
ODD
and the customization ODD, resulting in a compiled document containing everything
that
the customization selected from the source, plus the instructions to perform the
additions and changes required. This first step is an opportunity to drop anything
that
is not needed, which makes it possible to write fairly lean customizations. For example,
the TEI analysis
module has among its members the global attribute class
att.global.analytic
. In turn, this class is a member of the
att.global
class, which is referenced by every single element in the
TEI. When a customization excludes the analysis module, att.global.analytic
will also be dropped without needing to change att.global
explicitly.
Similarly, when selected classes or elements end up not being referenced anywhere
else
in the compiled ODD, they get “zapped” to avoid unreferenced declarations in the
resulting schemata and to exclude unnecessary documentation from the human-readable
output.
In a typical TEI customization, only the customization ODD is supplied by the user
and
the processor obtains the source ODD for the latest release of TEI P5 before
compilation. While an altogether different source can be passed to the processor,
the
user can also indicate in the customization file that certain specifications should
be
obtained from specific sources. <schemaSpec> and other reference elements (e.g.
<moduleRef>, <elementRef>) can use the @source attribute to point the processor to
a different ODD to look for that specification. The TEI Guidelines specify a private URI (tei:x.y.z
) to be able to
refer to specifications from older versions of the TEI. Because @source
can
point to any ODD via a URI, it is possible to “chain” ODDs by customizing an existing
customization. This example from the TEI Guidelines shows how to extend the customization “TEI Bare”, which doesn’t include
<q> (quote), with <q> from version 3.0.0 of the TEI.
<schemaSpec ident="Bare-plus" source="tei_bare.compiled.odd" start="TEI">
<moduleRef key="tei"/>
<moduleRef key="header"/>
<moduleRef key="core" include="p list item label head author title"/>
<elementRef key="q" source="tei:3.0.0"/>
<moduleRef key="textstructure"/>
</schemaSpec>
Tools
The only existing ODD processor is a set of XSLT scripts maintained by the TEI.
Besides obtaining and running these scripts directly, there are a number of ways to
process ODD to generate documentation and schemata.
-
Command line: the TEI Stylesheets repository on GitHub includes a number
of scripts to perform transformations. The script
bin/teitorelaxng
compiles ODDs and transforms them into
RELAX NG. It allows the user to set a non-TEI source ODD as well as a number
of other options. Other scripts such as bin/teitohtml5
can
generate documentation from a compiled ODD.
-
Oxygen XML editor: the TEI Oxygen plugin includes the TEI Stylesheets and
routines for generating documentation and schemata from TEI ODD
customizations.
-
OxGarage: this online service at https://oxgarage.tei-c.org/ provides both a
graphical interface and an API to the TEI Stylesheets. It can compile ODDs
and generate documentation and schemata in a number of formats.
Additionally, the TEI has created Roma (https://roma.tei-c.org/), an
online tool to create customizations via a user interface, which also interfaces with
the TEI Stylesheets to generate documentation and schemata. The interface does not
cover
the full expressiveness of ODD, but it supports users with less schema design expertise.
An entirely new version of Roma is currently in beta (https://romabeta.tei-c.org/). Besides a complete rewrite of
the interface, the new version takes advantage of the OxGarage API for processing
ODD
and covers a wider range of customization operations.
ODD for TEI interchange and beyond
ODD plays an important role in data interchange within the TEI ecosystem: as a large
and greatly adaptable format, TEI-encoded documents can look quite different from
one
another. When carefully crafted, ODD customizations become the key to facilitate TEI
interchange because they contain human-readable documentation as well as a formal
description as to how a schema differs from the whole TEI specification. Finally, because ODD can be used to both express and customize a markup
vocabulary, it has been adopted outside of the TEI. The most notable case is the Music
Encoding Initiative, a markup language for representing music notation targeted at
library and musicological research that shares many of the documentation and
customization principles and needs of the TEI. The Music Encoding Initiative, which also uses ODD for its source and customizations, provides a
transformation service online at http://customization.music-encoding.org/, which also applies the TEI
Stylesheets to process ODD and generate schemata. ODD has also been used for the
definition of the Internationalization Tag Set (ITS); and “various standards proposal designed within ISO committee TC 37 have
been totally or partially written in TEI/ODD: MLIF, MAF, ISO 16642 rev., ISOTimeML”. New applications of ODD are still underway, including Martin Holmes’
proposed use for HTML (Holmes 2018).
Acknowledgements
My thanks to Syd Bauman for his extensive feedback on this piece and to the organizers
of the pre-conference Symposium on Markup Vocabulary Customization for inviting me
to talk about TEI ODD.
References
[Baumann 2011]
Bauman, Syd. 2011. Interchange vs. Interoperability
, in Proceedings of Balisage: The Markup Conference 2011. Balisage Series on Markup Technologies, vol. 7 (2011). doi:https://doi.org/10.4242/BalisageVol7.Bauman01
[Baumann 2017]
Baumann, Syd. 2017. tei_customization: A TEI customization for writing TEI customizations (paper)
, in Proceedings of the Text Encoding Initiative Conference and Members Meeting, Victoria, British Columbia, Canada, November 11 - 15 2017. https://hcmc.uvic.ca/tei2017/abstracts/t_110_bauman_teicustomization.html
[Burnard 2000]
Burnard, Lou. 2000. Text Encoding for Interchange: a new Consortium
, Ariadne, 24. http://www.ariadne.ac.uk/issue/24/tei/
[Burnard and Rahtz 2000]
Burnard, Lou and Sebastian Rahtz. 2000. Relax NG with Son of ODD
, in Proceedings of Extreme Markup Languages 2000. https://ora.ox.ac.uk/objects/pubs:394056
[Cummings 2008]
Cummings, James. 2007. The text encoding initiative and the study of literature
, A Companion to Digital Literary Studies, ed. Susan Schreibman and Ray Siemens. Oxford: Blackwell, 2008. http://www.digitalhumanities.org/companion/view?docId=blackwell/9781405148641/9781405148641.xml&chunk.id=ss1-6-6
[Holmes 2018]
Holmes, Martin. 2018. Using ODD for HTML
, in Proceedings of of the Text Encoding Initiative Conference and Members Meeting The
Markup Conference, Tokyo, Japan, September 9 - 13 2018. Pages 240 - 241. https://tei2018.dhii.asia/AbstractsBook_TEI_0907.pdf
[Vanhoutte 2004]
Vanhoutte, Edward. 2004. An Introduction to the TEI and the TEI Consortium
, Literary and Linguistic Computing, Volume 19, Issue 1, April 2004, Pages 9–16. doi:https://doi.org/10.1093/llc/19.1.9
[Wittern et al. 2009]
Wittern, Christian, Arianna Ciula, Conal Tuohy. 2009. The making of TEI P5
, Literary and Linguistic Computing, Volume 24, Issue 3, September 2009, Pages 281–296. doi:https://doi.org/10.1093/llc/fqp017
×
Vanhoutte, Edward. 2004. An Introduction to the TEI and the TEI Consortium
, Literary and Linguistic Computing, Volume 19, Issue 1, April 2004, Pages 9–16. doi:https://doi.org/10.1093/llc/19.1.9
×
Wittern, Christian, Arianna Ciula, Conal Tuohy. 2009. The making of TEI P5
, Literary and Linguistic Computing, Volume 24, Issue 3, September 2009, Pages 281–296. doi:https://doi.org/10.1093/llc/fqp017