How to cite this paper
Haentjens Dekker, Ronald, David J. Birnbaum, Bram Buitendijk and Joris J. van Zundert. “A schema language and parser for next-generation markup languages.” Presented at Balisage: The Markup Conference 2025, Washington, DC, August 4 - 8, 2025. In Proceedings of Balisage: The Markup Conference 2025. Balisage Series on Markup Technologies, vol. 30 (2025). https://doi.org/10.4242/BalisageVol30.HaentjensDekker01.
Balisage: The Markup Conference 2025
August 4 - 8, 2025
Balisage Paper: A schema language and parser for next-generation markup languages
Ronald Haentjens Dekker
Researcher
DH Lab, Huygens Institute for the History of the Netherlands
Ronald Haentjens Dekker is a researcher and software architect at DHLab and the
Huygens Institute for the History of the Netherlands, part of the Royal Netherlands
Academy of Arts and Sciences. As a software architect, he is responsible for translating
research questions into technology or algorithms and explaining to researchers and
management how specific technologies will influence their research. He has worked
on
transcription and annotation software, collation software, and repository software,
and he
is the lead developer of the CollateX collation tool. He also conducts workshops to
teach
researchers how to use scripting languages in combination with digital editions to
enhance
their research.
David J. Birnbaum
Professor Emeritus
Department of Slavic Languages and Literatures, University of Pittsburgh
David J. Birnbaum is Professor Emeritus of Slavic Languages and Literatures at the
University of Pittsburgh. He has been involved in the study of electronic text technology
since the mid-1980s, has delivered presentations at a variety of electronic text
technology conferences, and has served on the board of the Association for Computers
and
the Humanities, the editorial board of Markup languages: theory and
practice, and the Text Encoding Initiative Technical Council. Much of his
electronic text work intersects with his research in medieval Slavic manuscript studies,
but he also often writes about issues in the philosophy of markup.
Bram Buitendijk
Software Engineer
Digital Infrastructure, KNAW Humanities Cluster
Bram Buitendijk is a software engineer in the Research and Development team at the
Humanities Cluster, part of the Royal Netherlands Academy of Arts and Sciences. He
has
worked on transcription and annotation software, collation software, and repository
software.
Joris J. van Zundert
Researcher
Computational Literary Studies, Huygens Institute for the History of the
Netherlands
Joris J. van Zundert is a senior researcher and developer in humanities computing
in
the department of literary studies and the Digital Humanities Lab at the Huygens
Institute. His research focuses on computational algorithms to analyze literary and
historical texts, and on aspects of humanities information and data modeling. His
PhD
research in the field of Science and Technology Studies focused on methodological
effects
of the interaction between software engineers and humanities scholars. He was awarded
the
title of doctor cum laude on the basis of his dissertation Scholarship in
interaction
in 2022. His computational analytic work focuses on the correlation
between text immanent features of texts and sociological processes around the concept
of
literature. He is also involved in developing computational approaches to stemmatology,
narratology, and scholarly editions.
Copyright by the authors.
Abstract
This report explores strategies for grammar-based parsing and processing of
next-generation markup languages, a term we use to describe markup
languages that are designed to be able to represent structural Overlap,
Discontinuity, and other features that are not easily managed in XML.
Because next-generation markup languages cannot be modeled fully as context-free grammars
(CFG), we employ a mildly context-sensitive grammar (MCSG) to validate documents with
next-generation markup against user schemas expressed through an extension of Relax
NG
compact syntax (based on Creole).
Table of Contents
- Introduction
-
- Markup and markup languages
- Tokenization, parsing, and validation
- Schemas and grammars
- Conclusion
- Properties of grammars
-
- Left-hand side and right-hand side
- Terminals and non-terminals
- Parts of a grammar
- Three classes of grammars
- About context-free grammars (CFG) and XML
- Markup and schemas for XML and beyond
-
- Schema languages for XML
- Existing schema and grammar proposals for managing Overlap
- Why Creole is different
- Partially ordered content: how Interleave works
- Grammar-based and constraint-based schema languages
- What we adopt from existing schema languages
- Patterns and anti-patterns in Relax NG compact syntax
-
- Attributes and patterns that resolve to attributes do not accept repeatable occurrence
indicators (
+, *)
text requires at least one textual character and does not accept
repetition occurrence indicators (+, *)
- Attributes are not content
- Mixed content must be represented with the keyword
mixed
- Overlap and Creole
-
- About our notation
- Example of
Overlap in Creole
- How
Concur works in SGML and in Creole
Concur and Interleave
- Mildly context-sensitive grammars (MCSG)
-
- Handling overlap and implementing the
Concur operator
- Handling discontinuity
- Self-overlap with a mildly context-sensitive grammar
- Schema compiler
- Conclusion
Introduction
Markup and markup languages
Document markup can be understood as a means of making
explicit an interpretation of a text
, that is, of making explicit what is
conjectural or implicit, a process of directing the user as to how the content of
the text
should be (or has been) interpreted.
A markup language, in
turn, is a set of markup conventions used together for encoding texts
. (TEI Gentle introduction) Markup languages may vary in the structures and interpretations that
they are (or are not) able to represent, and, where they are able to represent a structure,
in the quality of the representation. Assessing the adequacy and quality of a markup
language inevitably has a subjective aspect, but however the following terms are understood,
any assessment would value clarity and legibility
(humans typically need to read and write markup), as well as suitability for any
intended automated processing (computers need to be able to process markup
efficiently).
An optimal markup language would be able to represent—clearly, legibly, and in a
machine-tractable way—any feature of a document that a user considers important. The
predominant ecosystem of document markup conventions and tools today, based on the
XML-related family of W3C (and other) specifications, struggles to support, in a clear,
legible, and machine-tractable way, many concepts that are required to represent the
expressive richness of documents. Although workarounds exist, structural document
features
that cannot easily or naturally be expressed in XML and managed with
existing XML tooling include Discontinuity, Overlap, Non-linearity, Multi-ordered content, and Unordered content. We refer to a markup language that retains the intellectual and engineering
virtues of XML while also supporting the preceding features as a next-generation
markup language.
Tokenization, parsing, and validation
A principal goal of the present contribution is to imagine, and to begin to implement,
a
document authoring and processing environment that meets the needs of a next-generation
markup language as naturally and effectively as the existing XML ecosystem meets the
needs
of XML markup. Similarly to the way the use of specific elements and attributes in
documents
tagged with XML can be represented by schema languages designed to support XML features,
our
goals for a next-generation markup language include that it be possible to parse and
validate documents tagged according to such a language against a user-defined schema
that
recognizes and manages all features, including next-generation features, similarly
to the
way the current XML ecosystem recognizes and manages features of XML.
Parsing, as a high-level term, is the process of recognizing the
subcomponents of a text stream. For example, parsing XML entails recognizing the different
types of nodes that make up an XML document instance. Parsing in this high-level sense
is
sometimes subdivided into smaller tasks, such as, for example, an operation to distinguish
markup from content and an operation to analyze the structural relationships among
the
markup and content components. In this two-part model, the part that involves recognizing
markup and content tokens is typically called tokenization or
lexing (lexical analysis) and the term parsing
is reserved for grammatical analysis, that is, analyzing the relationships among the
tokens
identified during lexing. It may be possible to tokenize an invalid document, so that, for example, a
tokenization process might be able to recognize an end-tag at the very beginning of
an XML
document, where it is not permitted, by rules of well-formedness, to appear. Detecting
the
well-formedness violation is the responsibility of the parser (in the narrow sense
of the
term); the lexer identifies the tokens (structurally meaningful units in the text
stream)
and the parser either succeeds or fails to make sense of them.
Common reasons for parsing documents include validation (see below), enhancing
information (some schema languages record datatype information that is not explicit
in a
document instance but can be accessed during parsing), or performing transformations
or
other operations. The opposite of parsing can be considered
serialization, where parsing translates a sequence (series) of
characters into a structure that is convenient for subsequent processing (e.g., a
tree in
XML) and serialization turns that type of structure into a series (thus the term
serialize) of characters that is convenient for storage or
transmission.
Validation is verification that the grammatical structure of a
specific document conforms to the types of structures permitted by a schema used for
that
purpose. Schemas describe a domain-specific language, that is, they are
authored to enable and enforce domain-specific structural requirements. XML distinguishes
validation against a schema from well-formedness checking, that is,
from verifying that the document observes requirements that must be met by all XML
documents, regardless of domain-specific requirements and regardless of whether they
are
also validated against a schema. Strictly speaking, a document with angle-bracketed
markup
that does not conform to XML well-formedness requirements is not XML, which means
that it
cannot be validated against a (domain-specific) schema. More informally, though, users
ultimately care whether documents do or do not meet expectations and requirements,
and may
therefore regard verifying well-formedness and verifying schema validity as parts
of a
single process of verifying the acceptability of the document markup.
A validator that simply accepts or rejects a document according to whether it conforms
to structures permitted by a schema is called a recognizer. Beyond
distinguishing valid from invalid documents, a validation process may emit error messages
of
varying granularity, which can be helpful for users who would want to correct structural
invalidities. Validation can be performed in real time, and real-time validation is
often
incorporated into code editors to guide users through the process of adding valid
markup.
Alternatively, validation can be performed on completed documents.
Schemas and grammars
One difference between XML and SGML is that SGML always required a schema (the only
type
of schema supported by SGML was DTD), while schema validation is optional for XML.
The
decision by the original developers of XML that schemas should be optional recognized
that a
schema may be important during or immediately after authoring, but once a document
is
complete, has been validated, and will not be changed again, there is no need to continue
to
check it for validity. SGML made use of markup features that could not be interpreted
without reference to a schema, which is one reason that a schema had to be obligatory
for
SGML, and those features were excluded from XML requirements to ensure that XML could
be
processed without reference to a schema.
In its current form our system requires a schema, that is, we integrate well-formedness
checking with schema validation. We do not preclude separating these operations later,
but
for now that separation is out of scope because it is not clear how we would reliably
analyze and interpret next-generation features like Overlap and Discontinuity without
the
guidance provided by schema rules.
Schemas that can be used to support parsing and validation of XML markup can be authored
in W3C XML Schema (XML Schema 1.1), Relax NG (Clark 2002, Compact), or DTD
(defined as part of XML 1.0), none of which is designed to represent the
next-generation features described above. One of our goals, then, is to begin to develop
a
schema language and associated tooling for next-generation markup languages that is
analogous to the schema languages and associated tooling currently available in the
XML
ecosystem.
The terms schema and grammar are sometimes
used interchangeably in the structured-text community, but we use them here as technical
terms with different meanings and properties.
-
We use the term schema to refer to a set of
productions (rules that define a non-terminal on the left-hand
side according to non-terminals and terminals on the right-hand side, similarly to
Relax
NG Compact Syntax) that constrains the use of specific elements, attributes, text
nodes,
etc. Schemas are typically developed to meet project-specific requirements. Because
schemas are typically developed by content creators, they prioritize human legibility.
With our next-generation schema language, as with schema languages for XML, a schema
rule cannot overrule or supersede a well-formedness constraint. That is, schema rules
provide a second layer of constraints beyond the universal and non-negotiable
constraints imposed by well-formedness.
A schema is both a human-oriented statement of what is and is not permitted in
associated documents and a starting point for a grammar, as described below.
-
We use the term grammar to refer to a constrained set of
productions, optimized for machine interpretation and processing, that is generated
programmatically by compiling the schema into a grammar. While schema design prioritizes
human legibility, grammar design prioritizes efficient computational
implementation.
As we explain below, complex tree-like operators such as OR,
OPTIONAL, OneOrMore, ZeroOrMore,
INTERLEAVE, and CONCUR may be rewritten as one or more linear
productions when a human-oriented schema is compiled into a machine-oriented grammar.
Familiar notations like Backus-Naur Format (BNF, see BNF) and its
extensions (e.g., EBNF, see EBNF), which are often considered grammar
notations, allow alternation and other complex operations, but we use the term
grammar in this communication to refer specifically to
lower-level, purely sequential notations.
Human developers interact directly with the schema, but not directly with the grammar,
much as XML developers author XML Schema or Relax NG or DTD schemas to meet project-specific
needs. A schema language is the declarative, human-facing component of formal document
modeling, which means that it describes what is and is not permitted in a valid, conformant
document instance, but it does not describe how parsing and validation is performed.
If a
schema is to be usable for parsing and validation, it must be amenable to compilation
into a
grammar that can perform the parsing and validation operations in a robust, efficient,
and
scalable way. The present contribution aims to:
-
Identify the intellectual and engineering requirements for implementing a
user-friendly declarative formal schema language for a next-generation markup
language, and
-
Contribute an implementation of such a system, including a machine-oriented
grammar into which a schema can be compiled that is sufficiently robust, efficient,
and scalable to be suitable for likely real-world use cases.
The three schema languages in wide use in the XML ecosystem, listed above, have mature
implementations that have been in use for many years, and we have no doubt that the next-generation schema language and implementation
introduced here will be improved as they are tested against a broad variety of real-world
documents and research needs. At the same time, the next-generation schema language
we
describe here, and our reference implementation of a validating parser for documents
tagged
in conformity with schemas expressed in that schema language, provide a viable starting
point for a broad user community that is eager for next-generation functionality that
is not
easily or naturally implemented purely with XML markup and the existing XML toolkit.
Current widely used parsing strategies for markup technologies are often based on
context-free grammars (CFGs), although the properties and
restrictions they aim to describe and validate cannot always be expressed without
recourse
to features that cannot be represented completely within a CFG. For example, the XML specification (XML 1.0) incorporates a
formal, context-free grammar in EBNF notation (EBNF) that seeks to
represent what is and is not permitted with XML markup, but that formal grammar must
be
supplemented, within the specification, with plain prose validity
constraints (VC, e.g., An element type must not be declared more than
once
, §3.2)) and well-formedness constraints (WFC, e.g.,
The Name in an element’s end-tag must match the element type in the
start-tag
, §3) because not all requirements for well-formed XML can be represented
completely by the features available within a CFG. XML validity and well-formedness
constraints can be implemented computationally in a robust, efficient, and scalable
manner,
but only through processing that transcends parsing and validation against the formal
CFG
grammar. Insofar as a grammar is a formal representation of a theory of the
text, the system we describe in this contribution aims to be sufficiently
expressive to admit all desired textual structures (as represented by a project-specific
document schema), to recognize any structure not licensed by the schema as invalid,
and to
do so without recourse to extra-grammatical annotations and implementation features.
Insofar
as a grammar is also an engineering resource, a priority for our work
is that our schema language, designed to maximize legibility for human schema designers,
should be able to be compiled (translated) by the computer into a formal grammar that
is
adequate for efficient, scalable automated parsing and validation.
We regard the general process by which a schema is compiled into a grammar against
which
a document instance can be parsed and validated as an abstract, high-level
metamodel, that is, as independent of any specific implementation in
any specific programming language. We provide a particular implementation, then, primarily
as a way of verifying whether the metamodel is implementable in principle. The distinction
between the abstract metamodel and a concrete implementation invites an eventual comparison
of implementations, whether with respect to performance, user experience, or other
criteria,
even as all implementations are required to emit what is functionally the same result
because all implementations must implement the shared metamodel.
Conclusion
Ultimately, the primary aim of this contribution is to elaborate a formal model that
is
able to express complex phenomena of texts that users might reasonably wish to record,
including those not easily represented with XML, and to do so in a way that is theoretically
coherent and formally computable. This goal engages with several hard challenges that
have
not yet been resolved in a satisfactory way in the domain of markup languages, as
is clear
from, for example, the way Overlap has been a recurrent theme in markup conferences
for
decades. Much of the present contribution engages explicitly with formal grammar theory,
but
we also introduce an implementation of a proof-of-concept parser-validator that goes
beyond
the XML toolkit in being able to process next-generation features not as secondary
or
external operations, but as part of both the schema language and the grammar to which
it
compiles. Although our proposal emerged from our work on TagML, it is intended to
be usable
with any markup language that supports the next-generation features we discuss.
Properties of grammars
Left-hand side and right-hand side
A grammar is primarily a set of productions (rules), each of which
maps from a left-hand side (LHS) to a right-hand
side (RHS). Grammatical productions are similar in some respects to statements
in Relax NG compact syntax, where, for example:
chapter = element chapter { title, paragraph+ } has an LHS
with the value
chapter and an RHS with a value that defines
chapter as an element type with the name
chapter that contains a
single obligatory
title followed by one or more items called
paragraph.
title and
paragraph are, in turn, defined
in their own productions elsewhere in the schema. A Relax NG compact-syntax production
is
not exactly the same as a grammar production, but what they have in common is that
a single
item in the LHS (in this case
chapter) is defined in terms of items in the RHS
(in this case
title and
paragraph, along with the comma connector
and the Kleene plus occurrence indicator, which describe syntactic relations among
the
content items).
Terminals and non-terminals
The items in the LHS and RHS are either terminals or
non-terminals, which can be understood as follows:
-
A terminal is an item that cannot be expanded further, such
as, in both Relax NG compact syntax and grammars, a literal string.
-
A non-terminal is an item that can be defined in terms of
terminals and other non-terminals.
For example, a Relax NG production like:
published = attribute published { "yes" | "no" } defines
the non-terminal
published as an attribute with a value that entails a choice
between two terminals, the string
"yes" and the string
"no". As
above, a Relax NG production is not exactly the same as a grammatical production (in
particular, a grammatical production does not allow alternatives, represented in Relax
NG by
the pipe connector),
but the distinction between terminals and non-terminals applies to both grammars
and Relax NG schemas.
Parts of a grammar
A complete grammar has 1) a starting point, 2) a set of non-terminals, 3) a set of
terminals, and 4) a set of productions that must ultimately define all non-terminals
in
terms of terminals. The LHS of a grammar production must always include a non-terminal; whether it
can also include terminals or other non-terminals depends on the type of grammar.
A schema language may come with built-in, predefined productions, so that, for example,
Relax NG compact syntax knows how to compile the grammatically non-terminal keyword
text into productions with only terminals (defined with reference to a set of
one-character string values) on the RHS. One difference between a schema and a grammar,
then, is that a grammar must be complete (that is, must ultimately define everything
using
only terminals), while a schema language can come with implicit productions that do
not need
to be spelled out in the schema itself. This is consistent with the fact that a schema
prioritizes human legibility because it is intended to be authored directly by humans,
while
a grammar prioritizes computational tractability because it is intended to be processed
by a
machine. Part of the parsing process involves compiling (transforming) human-oriented
schema
rules into machine-oriented grammar productions.
Three classes of grammars
This contribution discusses three types of grammars, which differ with respect to
what
is or is not permitted on the LHS and RHS. We say more about these types of grammars
below,
but we can distinguish them concisely now as follows:
-
A context-free grammar (CFG) contains productions that map
from a single non-terminal LHS to a single sequence of terminals and non-terminals
in
the RHS. The single item in the LHS has no access to contextual history.
-
A context-sensitive grammar (CSG) allows multiple symbols in
the LHS, at least one of which must be a non-terminal. The non-terminal is required
because, as noted above, the point of a production is to explain a non-terminal, while
terminals are self-explanatory and therefore not candidates for further explanation.
The other symbols (terminal and non-terminal) that are allowed on the LHS in a CSG
provide context, that is, they are how a CSG, unlike a CFG, has access to direct
contextual history.
-
A mildly context-sensitive grammar (MCSG) allows only a
single (non-terminal) symbol on the LHS, similarly to a CFG and unlike a CSG. But
a
MCSG allows more than one group of terminals and non-terminals in the RHS, together
with rules that determine how they can be mixed (interleaved) in final
production.
About context-free grammars (CFG) and XML
As noted above, parsers based entirely on context-free grammars (CFG) are unable to
model
not only next-generation markup languages, but also some features of XML. The term
context-free means, among other things, that CFG productions must be
static, that is, they must be based exclusively on information
available before looking at the document instance. Information acquired only once
parsing
begins is dynamic, and therefore cannot be part of a CFG
production.
It is possible to parse XML against an XML schema with a CFG, but this is not the
case in
the absence of a schema, which is why the XML spec consists of CFG productions supplemented
with WFCs. Two features of XML where the availability of a schema enables the compilation
of a
CFG that does not need to rely on WFCs are:
-
XML validation requires that every start-tag be followed, in a non-overlapping way,
by a corresponding end-tag for the same element type. If a schema is available, it
necessarily identifies all element types that are allowed in a valid document, which
means that the schema can be compiled into a grammar that includes separate explicit
productions for every valid element type. That is, instead of saying something like
every start-tag must be followed by a matching end-tag
, a CFG compiled
from a schema can say XML that is valid against this schema can contain only
element types A, B, and C; a start-tag for element A must be followed by an end-tag
for element A, a start-tag for element B must be followed by an end-tag for element
B,
and a start-tag for element C must be followed by an end-tag for element C
.
These explicit rules account for all possible element types in any valid document
to be
parsed against the schema, and therefore remove the need for a more general grammar
rule
that every start-tag must be followed by a matching end-tag.
-
XML validation requires that attribute names not be repeated on an element. If a
schema is available, it necessarily identifies all attributes that are permitted on
an
element, which means that the schema can be compiled into a grammar that includes
productions for all combinations and orders of attributes. For example, a schema can
say
that element A must include values for attributes X and Y
, which can be
translated into grammar rules that say that the element name in the start-tag of
element A must be followed by either attribute X and then attribute Y or attribute
Y
and then attribute X
. Those rules do not allow either attribute to repeat, and
therefore remove the need for a general well-formedness constraint that prohibits
that
type of repetition.
Schema validation is likely to incorporate optimizations that reduce the computation
described in a brute-force way above. Nonetheless, the difference between validation
against a
schema vs well-formedness checking in the absence of a schema is that when parsing
against a
CFG all information must be static, that is, available to the parsing process before
it sees
the document instance. A schema makes all valid element and attribute names and all
relationships among them available, providing information that can be used when the
schema is
compiled into a grammar prior to parsing the document instance. In the absence of
a schema,
element and attribute names are available only from the document instance, which means,
by
definition, that they cannot be incorporated into CFG productions.
As noted above, a common workaround for the inability to specify all properties of
a
markup language in a CFG has been to supplement the grammar, which defines only some
features
of the markup language, with well-formedness constraints (WFCs), which must be validated
by
augmenting the processing of the CFG. This approach has at least two types of limitation:
-
WFCs are often rendered only in plain prose, and not in a formal,
machine-implementable way, much less in a way that can be integrated formally with
context-free productions. This is the case, for example, with well-formedness
constraints in the XML specification (XML 1.0), where features of the
markup language that cannot be expressed entirely with the resources provided by a
CFG
must be expressed in other ways. The absence of a machine-oriented formalization
introduces a risk that implementations may be based on inconsistent (and even erroneous)
human interpretations of the constraints.
-
Parsers that take into account WFCs may need to keep state, that is, they may need
to handle information encountered earlier in the parsing process that is not fully
represented by any grammar production. For example, in the absence of a schema, an
XML
parser needs to know the elements that remain open at a given moment because without
that information, it cannot verify that every start-tag has a corresponding end-tag
and
that the elements expressed by the tags are properly nested. Those requirements are
expressed in the XML specification as WFCs because in the absence of a schema they
cannot be expressed fully in a CFG alone.
In addition to looking backward (that is, keeping state), in cases involving ambiguous
grammars a parser may also need to look ahead in order to resolve an ambiguity. Unlike
keeping
state, look-ahead is not necessarily incompatible with CFGs, and we discuss look-ahead
operations in different types of grammars below.
The performance of CFGs benefits from not needing to save past information in memory
in
order to interpret an item that is currently in focus, but some next-generation markup
features cannot be recognized and interpreted without contextual information. In the
case of
Discontinuity, for example, without access to earlier parts of a document instance,
it may not
be possible for a parser to recognize when an element in focus is the continuation
of an
element that went before, that is, whether the new element is part of a valid discontinuous
construction.
In the case of CFGs with WFCs the challenges of keeping state effectively and efficiently
have been studied extensively, which has led to deep knowledge about high-performance
implementations. For example, an XML parser that is performing well-formedness checking
in the
absence of a schema does not need to hold everything before the current
location in memory at all times; it is sufficient to keep track of only a subset of
that
information. To verify that every start-tag has a corresponding end-tag the parsing
process
has to keep track of which elements have been opened but not yet closed, and in what
order.
Furthermore, because XML does not permit overlapping elements, the only part of that
information that needs to be accessible immediately is the most recently opened element,
which
means that the information can be stored on a stack, which prioritizes efficient access
to the
only item that needs to be immediately accessible, which is the one opened most recently.
The
situation is obviously more complicated once next-generation features, such as Overlap,
need
to be accommodated.
While we know the limitations of CFGs with respect to representing next-generation
markup-language features, what to use instead is less clear. Compared to the markup
community’s extensive and deep knowledge of modeling and processing XML with a CFG
supplemented by WFCs that can be implemented efficiently, we have only limited and
largely
experimental knowledge of grammars for next-generation markup languages that might
be able to
represent all structural features of markup in a way that can be formalized for robust
and
efficient machine processing.
Insofar as a grammatically consistent and self-contained way of modeling a next-generation
markup language may need to maintain state more comprehensively than is required for
parsing
XML, and perhaps to look both backward and forward, one option might be to use some
kind of
context-sensitive grammar. At the same time, it is not immediately clear how a
context-sensitive grammar for that purpose should be constructed because context-sensitive
grammars have not been standardized and remain an under-researched concept, at least
in
comparison to CFGs. We do know, however, that any solution will at least have to provide
for
three functions that are fundamental to our aim:
-
A grammar-like function that enables the expression of a basic
syntax that includes all obligatory next-generation features, as enumerated above.
This
function is comparable to the role of well-formedness constraints in XML.
-
A schema-like function that supports additional domain-specific
constraints on the use of features permitted by the grammar. This function is comparable
to the role of schema languages in the XML ecosystem, where schema languages allow
users
to express, in a formal, machine-actionable way, the constraints that apply to
domain-specific document models. Ideally the schema-like function would prioritize
human
legibility, much as is the case with schema languages for XML.
-
A parser-like function that enables the validation of a document
instance with respect to well-formedness and, where provided, user-specified schema
rules
in a consistent and performant manner, emitting clear error messages when needed.
To avoid
the need to keep entire documents in memory during parsing and validation, this function
might be compared to the functions of XML SAX or StAX (but not DOM) parsers.
Any solution will have many moving and interlocking parts, all of which are known
as
desiderata for document modeling but most of which have not been explored thoroughly
as
potential parts of a grammar for a next-generation markup language. We explore these
considerations in the next several sections, but we first survey historically related
work and
state-of-the-art developments that are relevant as a background for understanding
the solution
we will then propose.
Markup and schemas for XML and beyond
Schema languages for XML
There is no need to tell a Balisage audience that there are three grammar-based schema
languages in wide use for modeling XML, as listed above: DTD, XML Schema, and Relax
NG. The three schema languages can all be considered tree
grammars (thus Manguinhas 2009, §2), and they are largely
comparable in the XML structures they are able to model, with two conspicuous exceptions:
-
DTDs lack support for namespaces, although it can be simulated (XML Namespaces §5). The other two schema languages fully support
namespaces.
-
Only Relax NG supports Interleave.
This second difference, concerning Interleave patterns, is important for next-generation
features for two reasons:
-
Unlike most structures accommodated by all three schema languages for XML,
Interleave incorporates unordered and partially ordered properties, which cannot be
represented using only features of CFGs.
-
Concur and Discontinuity, as well as Mixed content, can be described and implemented
using features required for modeling Interleave.
We discuss the relationship between Interleave and next-generation markup features
later
in this report.
Existing schema and grammar proposals for managing Overlap
As we look beyond XML, data models and schema languages that have been proposed for
managing Overlap include SGML Concur and its XML variant XConcur; Rabbit/duck grammars;
and
Creole. Because Tennison 2007 includes detailed explanations, examples,
bibliography, and critiques of the multiple-hierarchy models Concur, XConcur (including
XConcur-Cl, which is the validation component of XConcur), and Rabbit/duck grammars,
we say
nothing more about those technologies here except to point out that none of them engages
explicitly with any next-generation feature other than Overlap. Luminescent, a LMNL
parser
implemented in XSLT, reflects a perspective similar to that of Rabbit/duck grammars
in that
it processes LMNL ranges with reference to XML milestone start- and end-markers. (Tennison and Piez 2002, Luminescent) But because LMNL lacked
a schema language at the time Luminescent was developed, Luminescent was designed
to perform
well-formedness checking, but not schema validation. Other proposals for managing
Overlap
have included the GODDAG data model and TexMECS markup language (Sperberg-McQueen and Huitfeldt 2004,
Huitfeldt and Sperberg-McQueen 2001), which seek to model and manage not only Overlap, but also
Discontinuity (Sperberg-McQueen and Huitfeldt 2008). Modeling and tooling based on
Rabbit/duck grammars, GODDAG, and TexMECS have been employed in projects implemented
by the
resource designers and developers (e.g., Sperberg-McQueen 2018), but have not been
adopted broadly elsewhere.
Schema-aware Extended Annotation Graphs offers a graph-based approach to validating
some
next-generation features, performing simulation-based validation by
construction. (SAEAG) The authors observe that
since Dataguides and Graph Schemas are inferred from the data, they cannot be used
as authoring tools
(§4), which makes the method unsuitable for our
purposes.
Why Creole is different
As far as we have been able to determine, Creole is the only detailed proposal for
managing Overlap that is designed to process all structural properties of an entire
document
with a single grammar instead of validating overlapping hierarchies in ways that are
partially or entirely separated from one another. This is consistent with our aim
of
managing Overlap as an integral feature of our document model. The one detailed publication
about Creole, Tennison 2007, describes the design principles, some of which we
adopt directly and some of which we either modify or replace with alternatives. Alongside
a
thorough explanation of the design of Creole and the strategy for processing it, Tennison 2007 reports having built a proof-of-concept implementation in XSLT 2.0 that
worked as expected, except that it was insufficiently performant for production use.
(25) We
discuss performance in greater detail below, when we introduce our own design and
implementation.
Partially ordered content: how Interleave works
Most authors of Relax NG schemas will have worked with both ordered and unordered
content,
but partially ordered content, which is vital for modeling
next-generation features, is likely to be less familiar. The Interleave feature of
Relax NG is
able to model partially ordered content, but even those who have used Interleave in
their own
schemas may have used it to represent content that they consider unordered, rather
than
partially ordered. For that reason, before we introduce a schema language and grammar
for a
markup language that supports next-generation features, we take a close look at the
meaning of
partially ordered content as represented by Relax NG Interleave.
Interleave of simple items: A common use of Interleave
involves unordered simple content items. For example, the content model for
<x> in the following schema:
x = element x { A & B & C }
A = element A { empty }
B = element B { empty }
C = element C { empty } means that
<x> contains
exactly one instance each of
A,
B, and
C in any order,
that is, any of the following six patterns:
A,B,C;
A,C,B;
B,A,C;
B,C,A;
C,A,B; and
C,B,A. In this
respect the unordered nature of Interleave might be compared to the unordered nature
of XML
attributes.
Interleave of repeated simple items: Unlike attributes,
however, elements that participate in Interleave may be repeated, as in the following
content
model:
A* & B* & C*
This means, among other things,
that Interleave allows its immediate operands (in this example: zero or more instances
of
A, zero or more instances of
B, and zero or more instances of
C) to appear in any order, licensing the same six patterns as the example above,
except with zero or more repetitions of each repeatable component, e.g.,
A*,B*,C*;
A*,C*,B*, etc. But that isn’t the only thing allowed by
Interleave with Repetition, because the model above does not require the repeated
items to
cluster with others of the same type. This means that in addition to the six patterns
listed
above, it also allows any number of
A,
B, and
C
instances in any order, and is therefore equivalent to:
( A | B | C )*
This, too, might be compared to unordered
content.
Interleave of items with ordered content: The
repeatability of the simple content-item types in the example above obscures the way
Interleave models not only unordered content, but also partially ordered content.
Consider the
following Relax NG schema:
x = element x { a & b }
a = c, d
b = e, f
c = element c { empty }
d = element d { empty }
e = element e { empty }
f = element f { empty } This schema says that an element of type
<x>
contains an interleave of patterns of type
a and
b. This schema
allows the two immediate operands of the Interleave operator to appear in either order,
that
is, either all of
a before all of
b, as in:
<x>
<c/><d/><e/><f/>
</x>
or all of
b before all of
a:
<x>
<e/><f/><c/><d/>
</x>
Up to this point we might also consider this model unordered:
x contains one instance of
a and one instance of
b in
either order.
Interleave of items with partially ordered content: The
fact that the content is partially ordered, rather than unordered, becomes visible
in the
remaining patterns allowed by this schema: the components of a (c
and d) may be interwoven (that is, interleaved) with those of b
(e and f). For example, in:
<x>
<c/><e/><d/><f/>
</x>
the two components of
a occupy positions 1 and 3 and the two
components of
b occupy positions 2 and 4. It would be a mistake, though, to think
that the components of
a and
b can be combined in
any order, that is, it would be a mistake to interpret the Interleave
operator in this case as modeling unordered content. Because the components of
a
and
b are ordered (the comma connector means
in this order
), the
two components of
a must occur with
c before
d and the
two components of
b must occur with
e before
f. That
is, the following document would be invalid:
<!-- invalid -->
<x>
<e/><d/><f/><c/>
</x>
Interleave, then, does not mean only that the immediate operands to the interleave
operator may appear in any order; that is permitted, but it is not the only thing
that is
permitted. Interleave also does not mean that the individual atoms in the expansion
of its
arguments may appear in any order; this is not permitted because any lower-level order
imposed
by the patterns of the interleaved items must be respected. In the example above,
c, d, e, and f may occur in any order only
as long as c precedes d and e precedes f
(not necessarily immediately).
It’s easy to imagine realistic use cases for Interleave as a way of modeling unordered
content. For example, mixed content that allows text intermixed with any number of
<a> and <b> elements can be modeled in
several ways, both with and without the Interleave operator. In a schema that includes:
p = element p { mixed { ( a | b )* } }
a = element a { text }
b = element b { text } the first line could, alternatively, be expressed with
the Interleave operator as:
p = element p { mixed { a* & b* } } We find it more
challenging, though, to think of natural, real-world use cases for partially ordered
content.
van der Vlist 2004 (54) gives an example of modeling groups of tourists in a
museum, but we struggle to imagine that as a sort of relationship anyone might realistically
want to model in a Relax NG schema. Consider, though, the following XML:
<root>
<section>
<annotation-start id="0001"/>
<header>Hi, I'm a header!</header>
<p>Hi</p>
<ab>Bye</ab>
<ab>Hi again</ab>
<annotation-end id="0001"/>
<p>Bye again</p>
</section>
</root> Ignoring for a moment the annotations, a
<section> must start with a single
<header>
element that is followed by zero or more instances of
<p> and zero or
more instances of
<ab> in any order (as long as there is at least one
<p> or one
<ab>). The preceding example
without the annotations might look like:
<root>
<section>
<header>Hi, I'm a header!</header>
<p>Hi</p>
<ab>Bye</ab>
<ab>Hi again</ab>
<p>Bye again</p>
</section>
</root> Meanwhile, we want to be able to mark annotation ranges with milestone
<annotation-start/> and
<annotation-end/>
elements. (We simplify the example by omitting the text of the annotation, which might,
hypothetically, be represented by an attribute on the
<annotation-start/> element.) These elements must appear in paired
order, with exactly one start followed by exactly one end, and they must be coindexed
by a
shared
@id value (not an
@xml:id or HTML
@id because it
begins with a digit and repeats). The scope of an annotation has to be one or more
sequential
<p> and
<ab> elements. A Relax NG schema to
represent this is:
start = root
root = element root { section+ }
section =
element section {
(header, (p | ab)+)
& annotation-pair*
}
header = element header { text }
ab = element ab { text }
p = element p { text }
annotation-pair = annotation-start, annotation-end
annotation-start = element annotation-start { id, empty }
annotation-end = element annotation-end { id, empty }
id = attribute id { text }
There are some limitations to the types of constraints that Relax NG can enforce:
-
Relax NG cannot require the @id values on pairs of annotation starts
and end to agree.
-
Relax NG cannot prevent <annotation-end/> from following
<annotation-end/> immediately, without any intervening
<p> or <ab> elements.
Both the preceding limitations can be overcome with Schematron. What the Relax
NG schema can enforce, though, is partial order, as follows:
-
There must be exactly one <header> that precedes all
<p> and <ab> elements within a
section. If the <header> is missing or if it is preceded by any
<p> or <ab>, the document is
invalid.
-
<annotation-start> and
<annotation-end> elements may occur anywhere as children of a
<section> element (but not as children of
<header>, <p>, or
<ab>). They must occur in pairs, that is, every
<annotation-start/> element must precede a corresponding
<annotation-end/> element. This model does not allow a new
annotation to open before the first one closes, which would be a case of self-overlap,
about which please see below.
One way to think of what Interleave permits and prohibits is that it describes
partial order. As we will see below, in this respect it shares some
properties of Concur: each concurrent hierarchy under Concur must observe, when we
ignore the
components that are not part of that hierarchy, all hierarchy-specific constraints,
including
ordering constraints. Similarly, each operand of an Interleave operator must observe,
when we
ignore other immediate operands, all operand-specific constraints.
Finally, we can also consider mixed content a type of Interleave, where, for example:
p = element p { mixed { a, b } } requires exactly one
instance each of
a and
b in that order, with optional text nodes
before, between, or after. This can be considered a partially-ordered interleave of
the
sequence
a, b and
text. Because the keyword
text in
Relax NG means zero or more textual characters, it is implicitly optional, and because
XML
does not permit adjacent text nodes (which it merges into a single text node) it is
also
implicitly repeatable. This means that the same content constraints can be modeled
alternatively with the Interleave operator as:
p = element p { (a, b) & text }
SGML DTDs included an & operator that meant in any order
.
The SGML operator, then, models only unordered content, which is different from the
meaning of
Relax NG Interleave. XML DTDs do not include an & operator.
Grammar-based and constraint-based schema languages
XML schemas in XML Schema, Relax NG, or DTD are often supplemented by Schematron schemas,
that is, collections of Schematron rules. Schematron rules are expressed in XPath,
and can
therefore validate anything that can be expressed in XPath, which includes both structural
and
content-oriented features. A lot of document properties can be modeled in either a
grammar-based schema language or Schematron, but in general:
-
Schematron is better than grammar-based languages at
content-oriented validation. The inclusion of datatyping and facets in XML
Schema and Relax NG enables those grammar-based languages to validate some aspects
of
content, but that functionality does not include validating content-based relationships
across nodes either within a document or across documents.
-
Grammar-based languages are better than Schematron at complete
structural validation. A grammar-based schema language is intended to model
an entire document, which means, for example, that Relax NG will raise an error during
validation if there are parts of the document that are not represented in an associated
Relax NG schema.
The completeness of grammar-based schemas enables them to support reliable and
thorough content-completion hinting in an editing environment because real-time validation
knows at every location in a document what is and is not allowed. Schematron can provide
hints in response to specific rule violation (this is how the <oXygen/> Schematron
Quick-Fix feature works), but because Schematron validation does not require complete document coverage,
Schematron cannot provide consistent, reliable content-completion hinting.
The schema validation introduced in this report currently focuses entirely on validating
structural (syntactic) relations. We intend eventually to support datatypes and facets
at
least at the same level as Relax NG. We do not plan to support Schematron-like constraint
validation, but nothing in our design precludes performing Schematron-like constraint
validation with a different tool, much as users currently do for XML validation.
What we adopt from existing schema languages
We began our development of a next-generation schema language by adopting four features
of
existing schema languages for XML:
-
Grammar-based language: The grammar-based schema
languages for XML listed above (DTD, Relax NG, XML Schema), as well as Creole, are
designed to model entire documents, which is not the case with a constraint-based
language like Schematron. We opted for a grammar-based approach because modeling an
entire document was a priority. As noted above, nothing in our approach precludes
employing a constraint-based language like Schematron to validate features that cannot
be expressed naturally in a grammar, but that type of enrichment is not in scope for
the
present report.
-
Partially ordered content: Relax NG is the only
grammar-based schema language in current use for XML that is capable of representing
Interleave, which, as described above, can model and validate unordered and partially
ordered content. As we write above, we struggle to find a natural, real-world use
for
the partially ordered variant of Interleave, but we do find partial order important
for
other purposes. For that reason, we regard Relax NG as the existing schema language
for
XML that comes closest to supporting the features needed for next-generation markup
systems.
-
Compact syntax: We further adopt Relax NG compact
syntax as the formal syntax for our schema language because we find it more legible
than
the XML syntax and we regard human legibility as a high priority for a schema language.
Nothing in our proposal precludes expressing our schema language in other ways,
including using XML syntax, but such expressions are out of scope for the present
report.
-
Concur and Creole: Finally, we also adopt and
further develop features (in particular, a Concur reserved word) added to
Relax NG compact syntax in Creole. Creole lacks an available implementation, but the
design and implementation are documented clearly in Tennison 2007.
While we copy and extend syntactic features of Relax NG and Creole, we deviate from
those
schema languages in our implementation. In particular, both Relax NG and Creole use
parsing with Brzozowki derivatives to manage trees of expectations,
while we instead implement a MCSG. We discuss the motivations for our implementation
decisions
below.
Patterns and anti-patterns in Relax NG compact syntax
Relax NG compact syntax permits several constructions that appear to allow markup
that is
either impossible to implement in XML (e.g., adjacent text nodes) or that, if implemented
in a
document, would result in well-formedness violations (e.g., repeated attributes).
Content
models in Relax NG compact syntax also fail to distinguish syntactically attributes
(properties of elements) from actual content items (elements, text nodes), although
the
distinction between attributes and content items is unambiguous in XML syntax.
These features of Relax NG compact syntax compromise the legibility of a schema without
improving its expressive power. For that reason we consider them
anti-patterns and we exclude them from our proposed schema language,
which otherwise aims for compatibility with Relax NG compact syntax notation. None
of the ways
in which our schema language is stricter than Relax NG compact syntax impinges on
the
expressive power of the schema because everything we prohibit can be expressed more
clearly in
both our schema language and Relax NG itself in other ways.
The features where our schema rules diverge from those of Relax NG compact syntax
are the
following:
Attributes and patterns that resolve to attributes do not accept repeatable occurrence
indicators (+, *)
Attributes are not repeatable in both XML and the next-generation markup languages
that
we aim to model. Relax NG compact syntax nonetheless allows attributes and patterns
that
resolve to attributes to be attached to the repetition indicators + and
*. Because a schema cannot override a well-formedness constraint and
well-formedness does not permit repetition of attributes, attempts to repeat an attribute
because the schema says that it is permitted will nonetheless raise a well-formedness
error.
Any use of + in association with an attribute is actually understood as
required but not repeatable
(otherwise represented in Relax NG compact
syntax by the absence of any explicit occurrence indicator) and * is understood
as optional and not repeatable
(otherwise represented by ?).
Allowing the schema to say that attributes are repeatable when well-formedness prohibits
their repetition means that the schema purports to license XML expressions that are
not, in
fact, allowed in XML. This means that the schema misrepresents what is possible in
the XML
it will be used to validate.
Because we take seriously the role of a schema as a human-readable statement of what
is
and is not permitted in associated document instances, our schema language does not
permit
the repeatable occurrence indicators + and * to be associated with
(non-repeatable) attributes. Because attributes may be optional, they do accept the
? occurrence indicator.
text requires at least one textual character and does not accept
repetition occurrence indicators (+, *)
The reserved word text in Relax NG compact syntax means zero or more
contiguous textual characters
. This is potentially confusing for at least two
reasons:
-
It is counter-intuitive that a pattern that appears to require text could be
satisfied by zero textual characters, which a human would understand as the absence
of
text.
-
An XML document instance cannot contain a text node that consists of zero
characters, so the Relax NG keyword says that it permits something that cannot occur.
In our schema language the keyword text means
one or more textual characters
. The Relax NG
definition of zero or more textual characters is implicitly optional, but ours is
not, which
means that text in our system may be followed by ? to mean
either no text node, and therefore no characters, or exactly one text node, and
therefore one or more characters
.
The next-generation markup languages we target continue the XML practice of not
permitting adjacent text nodes. In the case of both these next-generation languages
and XML,
then, any construction involving adjacent text nodes is interpreted as a single merged
text
node. For that reason we do not permit the repetition occurrence indicators +
and * after the keyword text.
Finally, we do not allow the keyword text in Interleave
patterns. The only way to interleave text nodes with elements in our schema language
is by
using the mixed keyword. We require the mixed keyword because it
signals to the user that an element contains mixed content more clearly than a repeatable
or-group that includes the keyword text.
Attributes are not content
In Relax NG attributes and patterns that resolve to attributes are included in content
models alongside elements and the keyword text. Because attributes are
properties of elements, rather than content, this is misleading; a
Relax NG content model may appear to impose ordering relationships on attributes with
respect to both other attributes and actual element content items even though no such
ordering relationships are possible. Our schema language will represent attributes
separately from actual element content. but the exact mechanism for doing so is still
under
development.
Mixed content must be represented with the keyword mixed
Relax NG compact syntax can represent mixed content with the keyword mixed,
as in
p = element p { mixed { ( a | b )* } } but it can,
alternatively, omit the keyword
mixed and instead include the keyword
text in a repeatable or group or interleave,
e.g.:
p = element p { ( a | b | text )* } or
p = element p { a* & b* & text } Using the keyword
mixed improves the human legibility of Relax NG content models because it
announces unambiguously that it represents mixed content. For that reason, our schema
language does not permit the last two models above, and instead requires the keyword
mixed for modeling mixed content.
Overlap and Creole
About our notation
In this report we use the term <element> differently than Tennison 2007. Creole was designed to meet the requirements of LMNL, although Creole
is intended to be data-agnostic and therefore applicable to parsing other markup languages
that support overlap (Creole is not designed to support discontinuity). (Tennison 2007 26) LMNL is based on ranges, which, unlike XML
elements, are generally (but not universally) allowed to overlap.
Tennison 2007 uses the following terms:
-
A LMNL <range> is a sequence of markup and text tokens.
Ranges have start- and end-tags and are generally allowed to overlap with other
ranges.
-
A LMNL <element> is a sequence of markup and text tokens
that is not allowed to overlap with other ranges. In Creole an
<element> is equivalent to (synonymous with) a
<range> child of a <partition, i.e.,
<element> is a notational variant (synonym) of
<partition><range> … </range></partition>.
In this report we use the term
<element> where
Tennison 2007 uses the term
<range> and we use the
term
<partition> were Creole uses
<element>. We do not use the term
<range>.
That is:
Table I
Comparison of terminology
| Creole |
This report |
<range> |
<element> |
<element> |
<partition> |
Example of Overlap in Creole
Tennison 2007 uses a brief biblical example to review and critique the
representation of Overlap in SGML Concur and elsewhere and to
illustrate an alternative model and notation for Creole that address the perceived
limitations. The biblical example is illustrative and the point is more general: markup
hierarchies under a shared root (in this case a root element like
<bible>) can overlap in a variety of meaningful ways. This
particular example encodes four overlapping structures, as follows: (5–6)
-
A hierarchy of Bible, books, and chapters. Books begin with a title followed by
one or more chapters, which, in turn, contain one or more verses.
-
A hierarchy of Bible, books, and sections. Books (the same books as in the
preceding hierarchy) begin with a title (also the same as in the preceding hierarchy)
followed by one or more sections. Sections begin with section headers added by an
editor (e.g., The flood and the tower of Babel
) followed by one or more
paragraphs, which, in turn, contain one or more sentences. Tennison 2007
says elsewhere on the same page that every paragraph is made up of one or more
verses
, and because verses may break in the middle of sentences and vice
versa, paragraphs thus contain concurrent tessellated hierarchies of verses and sentences.
-
A hierarchy of Bible and pages. Page breaks may occur within or between components
of the preceding two hierarchies (e.g., within or between books or chapters or verses
in the first hierarchy or books or sections or paragraphs or sentences in the second),
except that page breaks cannot appear within a book title (first, second, and third
hierarchy) or section heading (second hierarchy).
-
A set of index references that is not easily or naturally modeled as a hierarchy.
The role of index references in the markup is not explained in Tennison 2007, but the name suggests that they might be targets of entries/pointers in a
back-of-the-book index (or its digital-publication counterpart). Index references
may
overlap other index references (a relationship traditionally called
self-overlap), which means that some text tokens may be part of
multiple index references; some text tokens may not be part of any index reference;
and page breaks may fall within index references. The example says that a
sentence or heading contains any number of index references
, but we can
easily imagine an index to the Bible in which some references might cross sentence
(and perhaps larger) boundaries.
The first two hierarchies are loosely coupled semantically in that books (also chapters)
and sections tend to be about something, and we would not expect a
titled, topical section to begin in one titled, topical book and end in another. One
way to
understand this is that a <book> in one hierarchy has a
<title> followed by <chapter> children,
while the same <book>, in another hierarchy, has the same
<title> followed by <section> children.
That is, the hierarchies are congruent at a higher (<book>) level
and then diverge partially (shared <title> but then
<chapter> vs <section>) below that.
Neither of these two hierarchies is subordinate to the other structurally because
sections
may start or end in the middle of chapters (and even, at least potentially, verses)
and vice
versa. A similar relationship obtains within paragraphs, where a paragraph is the
parent of
mutually independent tessellated sequences of sentences and of verses.
The example in Tennison 2007 models pages as milestone XML processing
instructions with the name new-page; alternatively, pages could be modeled with
tessellated milestone start- and end-markers, which is the structure used in the example
to
model index references. In either case we might consider pages to be subcomponents
of the
entire Bible (and not only of books or anything at a lower hierarchical level) insofar
as
new pages may begin either within or between books.
How Concur works in SGML and in Creole
In a traditional context-free production in a schema, the RHS is expanded in order,
except that, as described above, Interleave relationships are partially ordered
because the operands to the Interleave operator may occur in any order, with
the restriction that any order within any of the subcomponents must be
maintained. For example, if A, B is interleaved with C, D, the
four components may occur in any order only as long as A precedes
B and C precedes D. In this case, then, of the 24
possible permutations of the four components, only the following six are allowed:
A, B, C, D; A, C, B, D; A, C, D, B;
C, D, A, B; C, A, D, B; and C, A, B, D.
SGML Concur applies to entire documents, while Creole Concur
can apply to subtrees within a document. The Creole Concur operator is similar
to the Relax NG (and Creole) Interleave operator in that any order within a
model that participates in a Concur relationship must be respected. For
example, given the first two biblical hierarchies above, neither has a defined relative
order with respect to the other, but within the hierarchy of titled sections, each
<section> must contain exactly one
<header> element followed by one or more
<paragraph> elements.
In SGML Concur text tokens in the document instance, unlike element tags,
are allowed only in positions where they are simultaneously valid in
all concurrent content models. The same is true of Creole
Concur, but with an important exception involving
partitions that we discuss below. This distinction between markup
tokens (which may be part of some but not all models) and text tokens (which must
be part of
all models unless, under Creole, they are part of partitions) is consistent with our
understanding of Concur as concurrent markup over a single, shared
document, rather than as concurrent documents. At the same
time, the behavior of text tokens has important implications for how sequences can
be
combined with Concur. Consider, for example, the following Creole schema
production:
element root {
concur {
element section {
text
},
element chapter {
text
}
}
} In Creole, constructions like
concur { A, B } mean that
A and
B overlap. The example above, then, means that a single
<section> element with plain text content overlaps with a single
<chapter> element that also has plain text content.The following
examples would all be valid against this production:
<root><section><chapter>This is an example text</section></chapter></root>
<root><chapter><section>This is an example text</chapter></section></root>
<root><chapter><section>This is an example text</section></chapter></root>
<root><section><chapter>This is an example text</chapter></section></root>
That is, the document tokens are partially ordered because the relative order of the
start- and end-tags for the <section> and
<chapter> elements is not informational, but the start-tag of each
element must precede its corresponding end-tag. The content model for
<section> elements does not allow a
<chapter> as a child of <section>, and
vice versa, but the last two examples are nonetheless valid and unambiguous because
the
schema defines the elements as concurrent, rather than nesting. It should be clear
from this
example that schema information may be required to distinguish concurrent relationships
from
hierarchical (nested) ones.
The following sequence, on the other hand, is not allowed under the production above:
<root><section>This is an </section><chapter>example text</chapter></root>
This example respects the relative order of the markup tokens (the start- and end-tags)
in both of the two concurrent sequences: either sequence, by itself, has the correct
root
element, the correct child element, and correct textual content. The reason this example
is
invalid is that the text tokens do not adhere to the rule that all text tokens in
concurrent
hierarchies, unlike markup tokens (start- and end-tags), must be valid in both hierarchies
simultaneously. In the example above, the text between the
<chapter> tags makes the <section> part of
the Concur relationship invalid because the <section>
part is defined in a way that does not permit text after the
</section> end-tag. Similarly, the text between the
<section> tags makes the <chapter> part of
the Concur relationship invalid because text is not permitted before the
<chapter> start-tag. We will see below that we can resolve this
issue by allowing text to appear more broadly in the concurrent content models.
A Concur relationship between two single elements that have textual content
is not very interesting, and we can enhance it by adding occurrence indicators that
allow
the components of the Concur relationship to repeat, as in the following
example:
element root {
concur {
element section+ {
text
},
element chapter+ {
text
}
}
}
The schema above allows for one or more sections and one or more chapters to overlap
in
any way as long as all text tokens in a valid document instance are allowed under
both
hierarchies. This means that:
-
Both the section and chapter markup need to start immediately after the
<root> start-tag, before any textual content;
-
Both the section and chapter markup need to end just before the
</root> end-tag, without any intervening text; and
-
No textual content is allowed between an end-tag of one element and the start-tag
of another.
Under the preceding schema the following sequence is now allowed:
<root><section><chapter>This is </section><section> an example text</chapter><chapter> And more text</chapter></section></root>
In this example all text tokens are between the start- and end-tags of both
<section> and <chapter> elements, which satisfies the
requirement that under
Concur all text tokens must be valid under all
concurrent hierarchies simultaneously. This example would not be valid if there were
text
immediately after the
<root> start-tag, immediately before the
</root> end-tag, or between adjacent tags, since in all of those
positions the text would not be allowed under one or both of the concurrent
hierarchies.
Concur and mixed content
Because the Creole mixed keyword, borrowed from Relax NG, effectively makes
text optional before, between, and after the elements to which it is applied, it can
be used
to simplify the modeling of some Concur relationships. Consider, for example:
element root {
concur {
element section {
text
},
mixed {
element chapter {
text
}
}
}
} The preceding model allows a single
<section>
element to concur (overlap) with a mixed model that includes a single
<chapter> element. The mixed model allows (but does not require)
text before or after the
<chapter> element, which means that the
following sequence is valid:
<root><section><chapter>This is an example text</chapter>more text</section></root>
In the example above, all text tokens are located between <section> start- and
end-tags, as required, since the <section> element does not allow mixed content.
The last two words are not between <chapter> tags, but the <chapter>
part of the
Concur relationship, modeled as mixed content, allows plain text in
this position, with the result that the text is valid under both parts of the
Concur relationship. On the other hand, the following example is not valid:
<root><section><chapter>This is an example text</section></chapter>more text</root>
Here the text at the end is not part of either of the
Concur components, which
is acceptable with respect to the <chapter> element, but not the
<section> element. Modeling both
<section> and
<chapter> elements as mixed content would make both of the
preceding examples valid.
Using mixed content as described above can simplify modeling concurrent relationships,
but that simplification comes with the cost of allowing plain text in places where
it might
not be part of our mental model of one or another hierarchy in a concurrent relationship.
That is, although we can choose to ignore text that we don’t consider an informational
part
of one or another hierarchy during downstream processing, it is nonetheless part of
the
content according to the model. This is unlike the situation with markup tokens, as
illustrated in the example above, where we can include <chapter>
start- and end-tags inside a concurrent <section> element without
making the <chapter> part of the content of the
<section>.
Tennison 2007 recognized the requirement that all text tokens be part of all
overlapping hierarchies as a challenge for effective modeling. For example, the biblical
<section> hierarchy requires that a
<section> begin with a <header>, but the
section header is an editorial intervention that is not part of the text of the Bible,
and
therefore not part of the hierarchy of chapters and verses. Creole Concur
addresses this concern by diverging from SGML Concur in allowing
<element> structures alongside <range>
structures, where <element> is shorthand for wrapping a
<partition> around a <range>. (16) This
allows Creole to model text tokens that are not part of all concurrent hierarchies:
What about the <heading> element within the
<section>? Its text isn’t inside a
<verse>, so would normally be invalid. However, we can define
the <heading> as a partition—it can’t overlap anything else—and
this means that it’s ignored by any concurrent patterns.
In summary, then: Concur in Creole allows the symbols (tags and text
tokens) of multiple sequences to be intertwined at the token level as long as:
-
The relative order of the tokens in the individual sequences is maintained;
and
-
All text tokens must appear in positions that are valid for all sequences, except
that …
-
… text tokens within a <partition> are invisible to other
concurrent hierarchies.
Concur and Interleave
Although the Concur and Interleave operators are similar in
that both model partially ordered content, there are subtle but important differences:
-
In Interleave the interleaved components are subtrees, while in
Concur they are markup tokens and text tokens. As a consequence,
Interleave relationships result in a tree, while Concur
relationships result in a graph.
-
Concur allows elements of the same type to appear in multiple
concurrent models, while Interleave does not permit the same symbol to
occur in multiple interleaved sequences. This difference is a consequence of the
difference between trees and graphs; graphs allow nodes to be the targets of multiple
directed edges, while trees require that no node have more than one parent. Allowing
the same node type to appear in both parts of an Interleave relationship
would mean either multiple parentage (not permitted in a tree) or ambiguity (to which
of the interleaved components would the nodes belong?).
Mildly context-sensitive grammars (MCSG)
Handling overlap and implementing the Concur operator
Now that we have an understanding of the kinds of sequences that a Concur
operator in a schema can validate, we need to find equivalent grammar rules that we
can
create from those sequences. A CFG, which requires exactly one sequence on the right-hand
side of a production, is not capable of expressing a Concur relationship, but a
MCSG, which allows one or two sequences on the right-hand side, is. Within the grammar,
each
production of this type has reorder constraints associated with it that define all
possible
combinations of tokens of the sequences. All of this apparatus—the two sequences on
the
right-hand side and the reorder constraints—is invisible to the end user, who interacts
directly only with the schema syntax, including the Concur operator.
Grammars are typically written as expressions over symbols, and in our markup domain
we
use tokens as symbols. The set of token types is derived from the element names, which
for
any finite set of documents is also a finite set. The tokens can be categorized as:
-
Markup specific
-
start-tag
-
end-tag
-
empty-element tag
-
suspend-tag
-
resume-tag
-
Text
The following simplified biblical example, tagged with TexMECS Concur, is
based on a more elaborate example in Tennison 2007:
<genesis|
<chapter n="7"|
<section|
<verse n="24"|… a hundred and fifty days.|verse>
|chapter>
<chapter n="8"|
<verse n="1"|God thought of Noah …|verse>
|section>
|chapter>
|genesis>
The following schema for the biblical example above is based on features introduced
in
Tennison 2007. The ~ symbol is the Concur operator,
so the second line means that a <genesis> element contains a
concurrence of one or more <chapter> elements with one or more
<section> elements. Note that we use the term
<element> where Tennison 2007 uses
<range>):
start = genesis
genesis = chapter+ ~ section+
chapter = element chapter { attribute n, verse }
section = element section { verse }
verse = element verse { attribute n, text }This can be compiled to a MCSG as follows:
start := <genesis| genesis_content |genesis>;
genesis_content := section optional_more_sections; , chapter optional_more_chapters;
section := <section| verse |section>;
chapter := <chapter attribute_n | verse |chapter>;
verse := <verse attribute_n "|" text |verse>;
attribute_n := "n" "=" string;
optional_more_sections := section optional_more_sections;
optional_more_sections := epsilon;
optional_more_chapters := chapter optional_more_chapters;
optional_more_chapters := epsilon;
In our notation a semicolon (;) in a production rule marks the end of a
single sequence of a symbols on the right-hand side. In a MCSG the right-hand side
can
contain up to two sequences of symbols, and in such cases we use a comma (,) as
a separator between them.
The following table shows the alignment of the input tokens with the tokens of operand
1
(<chapter>) and 2 (<section>) of the
Concur operator:
Table II
Overlap of <section> and <chapter>
using the Concur operator
| Input |
Operand 1 |
Operand 2 |
|
(chapter) |
(section) |
| genesis |
X |
X |
| chapter |
chapter |
|
| section |
|
section |
| verse |
verse |
verse |
| text |
text |
text |
| /verse |
/verse |
/verse |
| /chapter |
/chapter |
|
| chapter |
chapter |
|
| verse |
verse |
verse |
| text |
text |
text |
| /verse |
/verse |
/verse |
| /section |
|
/section |
| /chapter |
/chapter |
|
| /genesis |
X |
X |
In this table, as in the schema above, sequences of <chapter>
and <section> elements observe a Concur relationship.
<verse> elements are part of both concurrent hierarchies. An
X (in the rows for the <genesis> start- and end-tags)
means that the element is not a participant in a Concur relationship. Text
nodes (except inside a <partition>, which is not relevant here) are
always part of both concurrent hierarchies.
We refer to the relationship between concurrent sequences as a reorder
constraint because the right-hand side of the production rule lists one entire
sequence before the other entire sequence, regardless of the order of the tokens in
the
input. When such a rule is used for validation, the goal is to confirm that the input
matches a concurrence of the two sequences on the right-hand side. That confirmation
requires knowing what is and is not permitted in a concurrence, and those rules are:
-
Each of the two sequences in a concurrent relationship, considered in isolation, is
fully ordered. That is, if operand sequence A requires the ordering A1, A2, A3 and
operand sequence B requires the ordering B1, B2, B3, the markup tokens of each of
those
two sequences, considered in isolation, must appear in the input in the specified
order.
-
Although the tokens of each of the two sequences have to observe any ordering
specified in the schema, the markup tokens of the two sequences do not have to be
aligned with each other. This means that the two sequences that make up the right-hand
side of the production rule are partially ordered: each sequence observes its own
ordering rules, but the markup tokens of the two sequences are not compared to each
other.
-
Unlike the markup tokens of the two concurrent sequences, the text tokens of the two
sequences do have to be aligned.
The reorder rule of the two sequences means that any implementation must adhere to
three rules:
-
All symbols of both operand sequences must be fully present in the input
data.
-
The relative order within each of the sequences must be observed.
-
When a text token is encountered in the input stream it must be present at the
current relative location in both operand sequences.
We have to determine whether the input sequence is valid for the operands of the
Concur operator, recalling that the text tokens of the two operands have to
line up. We do this by thinking of it as an alignment problem, which we address using
a
method called dynamic programming. We can approach the task by arranging the tokens in a matrix with the tokens of
the first operand serving as column labels and the tokens of the second operand serving
as
row labels. Initially we enter only the column and row labels; we explain below where
the
cell values (x
) come from:
Table III
|
|
chapter |
verse |
text |
/verse |
/chapter |
chapter |
verse |
text |
/verse |
/chapter |
|
(0,0) |
x |
|
|
|
|
|
|
|
|
|
| section |
|
x |
|
|
|
|
|
|
|
|
|
| verse |
|
|
x |
|
|
|
|
|
|
|
|
| text |
|
|
|
x |
|
|
|
|
|
|
|
| /verse |
|
|
|
|
x |
x |
x |
|
|
|
|
| verse |
|
|
|
|
|
|
|
x |
|
|
|
| text |
|
|
|
|
|
|
|
|
x |
|
|
| /verse |
|
|
|
|
|
|
|
|
|
x |
|
| /section |
|
|
|
|
|
|
|
|
|
x |
x |
Starting with just column and row labels, we will fill in the cells as we use the
sequence of input tokens to guide our traversal of the matrix from the upper left
to the
lower right. We can think of this traversal as moving two pointers, both of which
start at
0, so that the start coordinate would be (0, 0). When we say that a pointer is on
1
, we mean that it is located after the first token. In other words, each pointer
is always between the most recently processed token and the next token to be expected
in the
input. Both pointers start on 0, that is, before the first token in the two sequences
has
been processed. Our traversal must observe the following output requirements:
-
The complete input sequence is matched by tokens in the operands.
-
The text tokens in both sequences line up.
-
If the input is valid, we can always advance one or both pointers.
-
If the input is valid, whenever a start-tag is matched simultaneously in both
sequences, the corresponding end-tag must also be matched simultaneously in both sequences.
-
If the input is valid, we end in the lower right.
Whenever a markup token comes in it must be allowed in the next position in either
or
both sequences, in which case we increment the respective pointer(s). If a markup
token is
not allowed at that location in either sequence, the input is not valid. For example:
-
The first token in our input example is a <chapter>
start-tag, which lets us increment (only) the row pointer but not the column pointer,
so we move from (0,0) to (1,0).
-
The next input token is a <section> start-tag, which lets
us increment only the column pointer, so we move from (1,0) to (1,1).
-
The next input token is a <verse> start-tag, which is
expected in both sequences at that location (immediately after a
<chapter> start-tag in one and immediately after a
<section> start-tag in the other), so we increment both
pointers at the same time, that is, we move diagonally in the matrix, from (1,1) to
(2,2).
Text tokens are handled differently. In case of text input, both pointers must point
to
a text token in their respective operand sequences of tokens. The unusual detail is
that
this can happen in two ways:
-
If a pointer is on a markup token, the next token must be a text token.
-
If a pointer is on a text token, it is not moved.
The first situation is similar to the behavior with markup tokens. The
second situation exists to allow multiple text tokens in one operand token sequence
to be
matched with a single text token in the other operand sequence. To illustrate why
this
second condition is necessary, consider the following TexMECS example, which contains
two
text nodes (one with the value
Hi and the other with the value
Mom) and a concurrence of
<section> and
<chapter> elements:
<genesis|
<section|
<chapter|Hi|chapter>
<chapter|Mom|section>
|chapter>
|genesis>
A schema that supports this markup is:
start = genesis
genesis = chapter+ ~ section+
chapter = element chapter { text }
section = element section { text }
A matrix that represents how we might traverse this concurrent input, based on the
method described above, is:
Table IV
|
|
section |
text |
/section |
|
(0,0) |
x |
|
|
| chapter |
|
x |
|
|
| text |
|
|
x |
|
| /chapter |
|
|
x |
|
| chapter |
|
|
x |
|
| text |
|
|
x |
|
| /chapter |
|
|
x |
|
|
|
|
|
x |
Each of the two
<chapter> elements contains a one-word
text node and the single
<section> element contains a two-word text
node. When we reach the text token inside the second
<chapter>, the
input remains valid because of the second situation above: we are currently on a text
token
in the
<section>, and therefore do not need to advance the pointer
on that axis. Were we to encounter a text token in one hierarchy when there was not
a text
token in the other hierarchy (which would satisfy the first condition) and when we
were not
already on a text token in the other hierarchy (which would satisfy the second condition),
the input would not be valid against the schema.
Handling discontinuity
Discontinuity in TexMECS and TAGML is expressed through the use of suspend- and
resume-tags (see the example below), which are allowed only where either an
Interleave (recall that mixed content is an Interleave of an
element sequence with text) or a Concur relationship is active. This constraint
is expected because what discontinuity shares with Interleave and
Concur is that the content is partially ordered, that is, tokens that are part
of the discontinuous segments must appear in the correct order, but tokens that are
not part
of the discontinuity may appear before, between, and after them. Sperberg-McQueen and Huitfeldt 2008 explains that:
It seems clear that if we specify that the discontinuous element (q or B in our examples)
matches a normal (first-class
) token in a content model, then the usual
rules for interpreting right-hand sides in the production rules of context-free grammars
do not apply: the usual interpretation is that the right-hand side specifies a sequence
of children for the left-hand side, but as described at some length above (section
“Goddag structures”) the discontinuous element itself cannot be part of any sequence which also contains the material which
interrupts it. We postulate, therefore, that discontinuous elements, taken as wholes,
do
not match first-class tokens in content models.
To accommodate elements that participate in a discontinuity relationship we adjust
our
MCSG slightly. Here is a sample input file with TexMECS tagging, taken from Sperberg-McQueen and Huitfeldt 2008:
<p|Alice
was beginning to get very tired of sitting
by her sister on the bank, and of having
nothing to do: once or twice she had peeped
into the book her sister was reading, but
it had no pictures or conversations in it,
<q|and what is the use of a book,|-q>
thought Alice
<+q|without pictures or conversation?|q>
|p>
In the example above, Alice’s single thought is tagged as a discontinuity of
<q> (quote
) elements, interrupted by the phrase
thought Alice
, which is in the narrator’s voice. The following schema
supports this markup:
start = paragraph
paragraph = element p { mixed { quote } }
quote = element q { text }
The schema compiler translates the preceding schema in the following MCSG:
start := <p| mixed_quote |p>;
mixed_quote = text; , quote;
quote := <q| text optional_discontinuity_quote |q>;
optional_discontinuity_quote := |-q> <+q| text optional_discontinuity_quote;
optional_discontinuity_quote := epsilon
The table below illustrates how the input tokens (leftmost column) are allocated between
the two participants in a mixed-content (that is, Interleave) relationship, the
discontinuous quotation (operand 1) and the text with which it is mixed (operand 2):
Table V
Discontinuity of <q> within a
<p> with mixed content using the Interleave
operator
| Input |
Operand 1 |
Operand 2 |
|
|
(mixed) |
| p |
X |
X |
| text |
|
text |
| q |
q |
|
| text |
text |
|
| -q |
-q |
|
| text |
|
text |
| +q |
+q |
|
| text |
text |
|
| /q |
/q |
|
| /p |
X |
X |
The <p> tags are the parent of both parts of the
Interleave; they are marked with an X
to show that they are not
part of either of the interleaved sequences. When the <q> element
is paused, the content needs to come from the other operand until the
<q> resume-tag is reached. In this sense discontinuity is
comparable to a <partition> within a Concur
operator.
Self-overlap with a mildly context-sensitive grammar
Creole introduces concurOneOrMore and concurZeroOrMore
operators to manage self-overlap, that is, situations where an element of a particular
type
overlaps with another element of the same type. Self-overlap requires special handling
for
two reasons:
-
In markup languages like MECS, TexMECS, LMNL, and TAGML, all of which support
self-overlap, start- and end-tags that are paired together use a shared identifier
on
the start- and end-tag. Without that sort of identifier, input like the following:
<x>A<x>B</x>C</x>
would be structurally
ambiguous. There are two <x> elements, each with a start- and
end-tag, but nothing in the markup tells us whether an inner
<x> element is a child of an outer
<x> element, on the one hand, or the two
<x> elements overlap and share the text token B.
We can disambiguate that situation by adding matching identifiers to both the start-
and end-tags, so that, for example, overlap, as in
<x n="1">A<x n="2">B</x n="1">C</x n="2">
would be distinct from containment, as in
<x n="1">A<x n="2">B</x n="2">C</x n="1">
-
Production rules in a grammar traditionally are completely static, which means
that the exact rules are known from information in the schema before parsing begins
(that is, without access to information in the document instance) and the production
rules do not change in response to anything encountered in the input.
The challenge is that the exact disambiguating identifiers that will occur in the
document instance cannot be specified in the schema if the schema is to be applicable
to
more than a single document, as schemas typically are. For that reason, processing
will need
to recognize when a start-tag with an identifier comes along and store it somewhere
until an
end-tag with a matching identifier is encountered, so that the parsing process can
verify
that each opened element is closed in a way that is consistent with schema requirements.
We
implement this functionality with an attribute grammar, which is an
extension on top of context-free grammars that allows production rules to capture
part of
the input stream, store it in what is called an attribute, and reuse that attribute
later in
the input stream to recognize whether an identifier given in an end-tag is valid at
the
location where it occurs.
Consider the following TexMECS example:
<verse no="23"|God wiped out every living thing
that existed on earth, <index~1 ref="i0037"|man and
<index~2 ref="i0038"|beast|index~1>, reptile and
bird|index~2>; they were all wiped out over the whole
earth, and only Noah and his company in the
ark survived.|verse>
In the TexMECS example above, which has been simplified from an example at Tennison 2007 11, Verse 23 contains (self-)overlapping targets of index references,
that is, of target spans of text to which entries in an index point. Index reference
i0037 refers to the string "man and beast" and index reference
i0038 refers to the string "beast, reptile and bird". The two
overlap in containing the same substring "beast". If this were XML and the
markup read:
<index ref="i0037">man and <index ref="i0038">beast</index>, reptile, and bird</index>
the markup would be well-formed but the parsing would be incorrect semantically. That
is, an
XML parser would recognize that there were two
<index> start-tags
and two
</index> end-tags, and that each end-tag follows a
start-tag. What an XML parser would fail to recognize, though, would be the overlap:
The XML
tagging above cannot represent overlap, and therefore says that indexed reference
i0038 is
wholly nested inside indexed reference
i0037. This is semantically
incorrect.
In TexMECS notation, a normal start-tag looks like <index| and
a normal end-tag looks like |index>. Self-overlap is a type of
concur relationship, and to avoid the misreading of the XML markup described above,
self-overlapping elements must record information that associates start-tags with
the
correct corresponding end-tags, so that, for example, <index~1| and
|index~1> are start- and end-tags for the same instance of an
<index> element. A schema used to author or validate this document
would recognize that the two index reference targets (i0037 and
i0038) as instances of an element type <index>, which
is to say that the ~1 and ~2 are attributes
of the element instance, and not part of the element name. The same
relationship might be spelled, in a hypothetical alternative syntax, as
<index id="1" ref="i0037"> for the start-tag and
</index id="1"> for the end-tag.
Attribute grammars enhance a traditional grammar with the ability to perform (typically
small) computations. In the case of a CFG or a MCSG, the left-hand side of a production
is a
single non-terminal. The right-hand side contains an expansion of the left-hand side
into a
sequence of terminals and non-terminals (in the case of a CFG) or up to two such sequences
(MCSG), and with an attribute grammar the items on the right-hand side may be enriched
with
expressions that compute a value.
For example a schema containing self-overlapping <index> tags
could look like the following:
indexedText = concurOneOrMore { mixed { index* } }
index = element index { attribute ref { text }, text }A set of grammar rules governing the self-overlapping <index>
elements looks like this:
open_index_tag := <index "~" identifier attribute_ref "|";
close_index_tag := |index "~" identifier ">";
identifier := string;
attribute_ref := "ref" "=" string;
indexed_text := interleave_text_index;
interleave_text_index := text; , optional_index;
optional_index := epsilon;
optional_index := open_index_tag concur_close_tag_new_index_tag; [concur_close_tag_new_index_tag.identifier = open_index_tag.identifier]
concur_close_tag_new_index_tag := text close_index_tag; , optional_index; [close_index_tag.identifier == concur_close_tag_new_index_tag.identifier]
An attribute on a grammar should not be confused with an attribute in the XML domain.
An
attribute on a grammar is implemented as an index number on non-terminals on the right-hand
side of a production rule. Next to the production rules are extra statements that
specify
whether attributes should capture incoming data or whether they should inherit data
from an
earlier non-terminal in the input stream in order to validate the current input
token.
Schema compiler
The present report focuses primarily on how to express next-generation features in
a
schema language. We have developed a prototype implementation of a schema compiler
and
accompanying parser/validator, the details of which will be the subject of a future
publication, but in this section we provide a brief overview of how the schema is
compiled
(that is, translated) into a grammar and how the grammar can be used to parse and
validate an
input document.
Our implementation compiles a schema expressed in the syntax described above into
a MCSG,
which we then parse with a modified LL parser. A standard LL parser has a set of production
rules from a CFG and a stack as a memory for expectations. An LL parser also has a
parsing
table, which is generated from the production rules. In the standard case of a CFG
the
expectations on the right-hand side of production rules of the grammar have to be
matched in
order. In the case of a MCSG, however, the content is matched in partial order as
we figure
out whether an incoming token is allowed at that location in the data stream. In an
LL parser
for an incoming token we can:
-
Look at the top of the stack and use the table to determine whether a non-terminal
/
terminal combination is allowed.
-
If so, identify the production rules allowed by the table.
-
Put the symbols from the right-hand side of the applicable production rule on the
stack.
The table provides an efficient method of determining which production rule
needs to be applied according to a combination of the terminal token from the input
and the
non-terminal or terminal on the top of the stack.
In a schema the number of production rules is relatively low compared to the number
of
documents and the size of each document (both theoretically unlimited) that a user
might want
to validate. This means that a schema compiler is able to improve efficiency by shifting
the
performance bottleneck away from the (potentially large) number of input tokens and
onto the
(relatively small) number of production rules in the schema.
Conclusion
This paper introduces a new method to model, parse, and validate complex, next-generation
textual features such as overlap (including self-overlap), discontinuity, and partially
ordered and unordered content. It builds on earlier work that forms part of Relax
NG (compact
syntax, the Interleave operator), Creole (the Concur and
Partition operators), and the MLCD/MECS project (discontinuity). Where Relax NG
and Creole use parsing with Brzozowski derivatives, our grammar-based technique relies
on a
mildly context-sensitive grammar combined with an attribute grammar. We have developed
a
prototype implementation that reads in a schema, compiles it into a mildly context-sensitive
grammar, and then uses the grammar to validate markup files that contain overlapping
and
discontinuous markup. By generating a parser based on the schema, the cost of validation
per
token is kept low. The presence of overlapping or discontinuous markup in the source
files
does not cause ambiguity and thus does not negatively affect the performance of the
validator.
In this article we have responded to the challenges we set out to solve, namely: 1)
to
propose a schema language that could be used to model complex, next-generation text
features
such as overlap (including self-overlap), discontinuity, and unordered and partially
ordered
content; 2) propose a grammar technology that is able to express those features; and
3) parse
and validate document instances against a schema in a consistent and performant manner
by
compiling the schema into a mildly context-sensitive grammar and a parsing table before
the
actual validation starts.
References
[BNF] Janikow, Cezary Z. What is BNF?
http://www.cs.umsl.edu/~janikow/cs4280/bnf.pdf
[Cameron 1993] Cameron, Robert D. Extending
context-free grammars with permutation phrases.
ACM letters on programming
languages and systems vol 2, no. 1–4 (March–December 1993), pp.
85–94. doi:https://doi.org/10.1145/176454.176490
[Chomsky 1956] Chomsky, Noam. Three models for
the description of language.
IRE transactions on information theory, vol. 2, no. 3, pp. 113-124,
September 1956, doi:https://doi.org/10.1109/TIT.1956.1056813
[Clark 2002, Algorithm] Clark, James. An
algorithm for RELAX NG validation.
2002.
https://relaxng.org/jclark/derivative.html
[Clark 2002, Compact] Clark, James.
RELAX NG Compact Syntax. Committee Specification 21 November 2002.
https://relaxng.org/compact-20021121.html
[Clark, Design] Clark, James. The design of Relax
NG.
n.d. https://relaxng.org/jclark/design.html
[EBNF] ISO/IEC 14977:1996 Information technology —
Syntactic metalanguage — Extended BNF.
December 1996.
https://www.iso.org/standard/26153.html
[Huitfeldt and Sperberg-McQueen 2001] Huitfeldt,
Claus and C. M. Sperberg-McQueen. TexMECS. An experimental markup meta-language for
complex documents.
January 25, 2001, rev. October 5, 2003.
http://mlcd.blackmesatech.com/mlcd/2003/Papers/texmecs.html
[Knuth 1990] Knuth, Donald E. The genesis of attribute
grammars.
1990. In: Attribute grammars and their applications, ed. P.
Deransart and M. Jour- dan. Vol. 461. Lecture Notes in Computer Science. Berlin, Heidelberg:
Springer. Berlin Heidelberg, pp. 1–12. doi:https://doi.org/10.1007/3-540-53101-7_1. https://link.springer.
com/chapter/10.1007/3-540-53101-7_1 (visited on 2024-04-12).
[Luminescent] Piez, Wendell. Luminescent: a
LMNL syntax parsing pipeline in XSLT.
https://github.com/wendellpiez/Luminescent
[Manguinhas 2009] Manguinhas, Hugo.
Schema Languages for XML.
Actas do INForum-Simpósio de Informática 2009, pp.299–310.
[Murata et al. 2005] Murata, Makoto, Dongwon Lee,
and Murali Mani, and Kohsuke Kawaguchi. Taxonomy of XML schema languages using formal
language theory.
ACM Transactions on Internet Technology (TOIT), Volume 5, Issue 4, 2005,
pp. 660–704. doi:https://doi.org/10.1145/1111627.1111631. Available at
https://pike.psu.edu/publications/toit05.pdf. An earlier version by the first
three authors, presented at Extreme Markup Languages® 2000, is available at
https://cs.brown.edu/courses/cs196-9/murata00taxonomy.pdf
[Piez 2008] Piez, Wendell. LMNL in miniature. An
introduction.
Amsterdam Goddag Workshop, 1–5 December 2008.
http://piez.org/wendell/LMNL/Amsterdam2008/presentation-slides.html
[SAEAG] Barrellon, Vincent, Pierre-Edouard Portier,
Sylvie Calabretto, Olivier Ferret. Schema-aware Extended Annotation Graphs.
DocEng '16: Proceedings of the 2016 ACM Symposium on Document
Engineering, pp. 45–54.
doi:https://doi.org/10.1145/2960811.2960816
[Siegel 2022] Siegel, Erik. 2022.
Schematron. A language for validating XML. Denver: XML
Press.
[Sperberg-McQueen 2018] Sperberg-McQueen, C. M.
Representing concurrent document structures using Trojan Horse markup.
Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August
3, 2018.
In Proceedings of Balisage: The Markup Conference 2018. Balisage Series
on Markup Technologies, vol. 21 (2018).
doi:https://doi.org/10.4242/BalisageVol21.Sperberg-McQueen01
[Sperberg-McQueen and Huitfeldt 2004] Sperberg-McQueen,
C. M. and Claus Huitfeldt. GODDAG: A data structure for overlapping
hierarchies.
Springer-Verlag (2004). Preprint at
http://cmsmcq.com/2000/poddp2000.html
[Sperberg-McQueen and Huitfeldt 2008] Sperberg-McQueen, C. M., and Claus Huitfeldt. Markup discontinued: discontinuity in
TexMecs, Goddag structures, and rabbit/duck grammars.
Presented at Balisage: The
Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In Proceedings of
Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies,
vol. 1 (2008).
doi:https://doi.org/10.4242/BalisageVol1.Sperberg-McQueen01
[SQF] Schematron Quick Fixes (SQF).
https://www.oxygenxml.com/doc/ug-editor/topics/schematron-quick-fixes.html
[TagML 2017] Haentjens Dekker, Ronald, and David J.
Birnbaum. It's more than just overlap: Text As Graph.
Presented at Balisage:
The Markup Conference 2017, Washington, DC, August 1–4, 2017. In Proceedings of
Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies,
vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Dekker01
[TagML 2018] Haentjens Dekker, Ronald, Elli Bleeker,
Bram Buitendijk, Astrid Kulsdom and David J. Birnbaum. TAGML: A markup language of many
dimensions.
Presented at Balisage: The Markup Conference 2018, Washington, DC, July
31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference
2018. Balisage Series on Markup Technologies, vol. 21 (2018).
doi:https://doi.org/10.4242/BalisageVol21.HaentjensDekker01
[TEI Gentle introduction] A gentle introduction to
XML.
Chapter v. of the Text Encoding Initiative Guidelines, P5 Version 4.9.0. Last
updated on 24th January 2025, revision f73186978.
https://tei-c.org/release/doc/tei-p5-doc/en/html/SG.html
[Tennison 2007] Tennison, Jeni. Creole: validating
overlapping markup.
XTech 2007.
https://www.princexml.com/howcome/2007/xtech/papers/output/0077-30.pdf
[Tennison 2009] Tennison, Jeni. Overlap,
containment and dominance.
Jeni’s musings, 2008-12-06.
http://www.jenitennison.com/2008/12/06/overlap-containment-and-dominance.html
[Tennison and Piez 2002] Tennison, Jeni and Wendell Piez.
2002. The Layered Markup and Annotation Language (LMNL).
Presented at Extreme
Markup Languages® 2002 (Montréal, Canada).
[Usdin 2005] Usdin, B. Tommie. W3C XML Schema; Relax NG; or DTD:
How’s a user to choose?
2005.
https://www.mulberrytech.com/papers/whichschema/index.html
[van der Vlist 2004] van der Vlist, Eric. 2004.
Relax NG. Sebastropol, CA: O’Reilly.
[XDM] XQuery and XPath Data Model. 3.1 W3C
Recommendation 21 March 2017.
Norman Walsh, John Snelson, and Andrew Coleman, ed.
https://www.w3.org/TR/xpath-datamodel-31
[XML 1.0] Extensible Markup Language (XML) 1.0
(Fifth Edition). W3C Recommendation 26 November 2008.
Tim Bray et al., ed.
https://www.w3.org/TR/xml/
[XML Namespaces] Namespaces in XML 1.0
(Third Edition). W3C Recommendation 8 December 2009
Tim Bray et al., ed.
https://www.w3.org/TR/xml-names/
[XML Schema 1.1] XML schema.
https://www.w3.org/XML/Schema
×Cameron, Robert D. Extending
context-free grammars with permutation phrases.
ACM letters on programming
languages and systems vol 2, no. 1–4 (March–December 1993), pp.
85–94. doi:https://doi.org/10.1145/176454.176490
×Manguinhas, Hugo.
Schema Languages for XML.
Actas do INForum-Simpósio de Informática 2009, pp.299–310.
×Murata, Makoto, Dongwon Lee,
and Murali Mani, and Kohsuke Kawaguchi. Taxonomy of XML schema languages using formal
language theory.
ACM Transactions on Internet Technology (TOIT), Volume 5, Issue 4, 2005,
pp. 660–704. doi:https://doi.org/10.1145/1111627.1111631. Available at
https://pike.psu.edu/publications/toit05.pdf. An earlier version by the first
three authors, presented at Extreme Markup Languages® 2000, is available at
https://cs.brown.edu/courses/cs196-9/murata00taxonomy.pdf
×Barrellon, Vincent, Pierre-Edouard Portier,
Sylvie Calabretto, Olivier Ferret. Schema-aware Extended Annotation Graphs.
DocEng '16: Proceedings of the 2016 ACM Symposium on Document
Engineering, pp. 45–54.
doi:https://doi.org/10.1145/2960811.2960816
×Siegel, Erik. 2022.
Schematron. A language for validating XML. Denver: XML
Press.
×Sperberg-McQueen, C. M.
Representing concurrent document structures using Trojan Horse markup.
Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August
3, 2018.
In Proceedings of Balisage: The Markup Conference 2018. Balisage Series
on Markup Technologies, vol. 21 (2018).
doi:https://doi.org/10.4242/BalisageVol21.Sperberg-McQueen01
×Sperberg-McQueen, C. M., and Claus Huitfeldt. Markup discontinued: discontinuity in
TexMecs, Goddag structures, and rabbit/duck grammars.
Presented at Balisage: The
Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In Proceedings of
Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies,
vol. 1 (2008).
doi:https://doi.org/10.4242/BalisageVol1.Sperberg-McQueen01
×Haentjens Dekker, Ronald, and David J.
Birnbaum. It's more than just overlap: Text As Graph.
Presented at Balisage:
The Markup Conference 2017, Washington, DC, August 1–4, 2017. In Proceedings of
Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies,
vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Dekker01
×Haentjens Dekker, Ronald, Elli Bleeker,
Bram Buitendijk, Astrid Kulsdom and David J. Birnbaum. TAGML: A markup language of many
dimensions.
Presented at Balisage: The Markup Conference 2018, Washington, DC, July
31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference
2018. Balisage Series on Markup Technologies, vol. 21 (2018).
doi:https://doi.org/10.4242/BalisageVol21.HaentjensDekker01
×Tennison, Jeni and Wendell Piez.
2002. The Layered Markup and Annotation Language (LMNL).
Presented at Extreme
Markup Languages® 2002 (Montréal, Canada).
×van der Vlist, Eric. 2004.
Relax NG. Sebastropol, CA: O’Reilly.
×Extensible Markup Language (XML) 1.0
(Fifth Edition). W3C Recommendation 26 November 2008.
Tim Bray et al., ed.
https://www.w3.org/TR/xml/