A schema language and parser for next-generation markup languagesThe four authors contributed as follows to the research described here:Ronald Haentjens Dekker: Lead researcher and developer, also contributed to the
              written report.Bram Buitendijk: Developer, also contributed to the written report.Joris J. von Zundert: Lead author of the written report.David J. Birnbaum: Author and editor of the written report.

Ronald Haentjens Dekker; David J. Birnbaum; Bram Buitendijk; Joris J. van Zundert

Abstract

This report explores strategies for grammar-based parsing and processing of next-generation markup languages, a term we use to describe markup languages that are designed to be able to represent structural Overlap, Discontinuity, and other features that are not easily managed in XML. Because next-generation markup languages cannot be modeled fully as context-free grammars (CFG), we employ a mildly context-sensitive grammar (MCSG) to validate documents with next-generation markup against user schemas expressed through an extension of Relax NG compact syntax (based on Creole).

Introduction

Markup and markup languages

Document markup can be understood as a means of making explicit an interpretation of a text, that is, of making explicit what is conjectural or implicit, a process of directing the user as to how the content of the text should be (or has been) interpreted. A markup language, in turn, is a set of markup conventions used together for encoding texts. (TEI Gentle introduction) Markup languages may vary in the structures and interpretations that they are (or are not) able to represent, and, where they are able to represent a structure, in the quality of the representation. Assessing the adequacy and quality of a markup language inevitably has a subjective aspect, but however the following terms are understood, any assessment would value clarity and legibility (humans typically need to read and write markup), as well as suitability for any intended automated processing (computers need to be able to process markup efficiently).

An optimal markup language would be able to represent—clearly, legibly, and in a machine-tractable way—any feature of a document that a user considers important. The predominant ecosystem of document markup conventions and tools today, based on the XML-related family of W3C (and other) specifications, struggles to support, in a clear, legible, and machine-tractable way, many concepts that are required to represent the expressive richness of documents. Although workarounds exist, structural document features that cannot easily or naturally be expressed in XML and managed with existing XML tooling include Discontinuity,^[2] Overlap,^[3] Non-linearity,^[4] Multi-ordered content,^[5] and Unordered content.^[6] We refer to a markup language that retains the intellectual and engineering virtues of XML while also supporting the preceding features as a next-generation markup language.

Tokenization, parsing, and validation

A principal goal of the present contribution is to imagine, and to begin to implement, a document authoring and processing environment that meets the needs of a next-generation markup language as naturally and effectively as the existing XML ecosystem meets the needs of XML markup. Similarly to the way the use of specific elements and attributes in documents tagged with XML can be represented by schema languages designed to support XML features, our goals for a next-generation markup language include that it be possible to parse and validate documents tagged according to such a language against a user-defined schema that recognizes and manages all features, including next-generation features, similarly to the way the current XML ecosystem recognizes and manages features of XML.

Parsing, as a high-level term, is the process of recognizing the subcomponents of a text stream. For example, parsing XML entails recognizing the different types of nodes that make up an XML document instance. Parsing in this high-level sense is sometimes subdivided into smaller tasks, such as, for example, an operation to distinguish markup from content and an operation to analyze the structural relationships among the markup and content components. In this two-part model, the part that involves recognizing markup and content tokens is typically called tokenization or lexing (lexical analysis) and the term parsing is reserved for grammatical analysis, that is, analyzing the relationships among the tokens identified during lexing.^[7] It may be possible to tokenize an invalid document, so that, for example, a tokenization process might be able to recognize an end-tag at the very beginning of an XML document, where it is not permitted, by rules of well-formedness, to appear. Detecting the well-formedness violation is the responsibility of the parser (in the narrow sense of the term); the lexer identifies the tokens (structurally meaningful units in the text stream) and the parser either succeeds or fails to make sense of them.

Common reasons for parsing documents include validation (see below), enhancing information (some schema languages record datatype information that is not explicit in a document instance but can be accessed during parsing), or performing transformations or other operations. The opposite of parsing can be considered serialization, where parsing translates a sequence (series) of characters into a structure that is convenient for subsequent processing (e.g., a tree in XML) and serialization turns that type of structure into a series (thus the term serialize) of characters that is convenient for storage or transmission.

Validation is verification that the grammatical structure of a specific document conforms to the types of structures permitted by a schema used for that purpose. Schemas describe a domain-specific language, that is, they are authored to enable and enforce domain-specific structural requirements. XML distinguishes validation against a schema from well-formedness checking, that is, from verifying that the document observes requirements that must be met by all XML documents, regardless of domain-specific requirements and regardless of whether they are also validated against a schema. Strictly speaking, a document with angle-bracketed markup that does not conform to XML well-formedness requirements is not XML, which means that it cannot be validated against a (domain-specific) schema. More informally, though, users ultimately care whether documents do or do not meet expectations and requirements, and may therefore regard verifying well-formedness and verifying schema validity as parts of a single process of verifying the acceptability of the document markup.^[8]

A validator that simply accepts or rejects a document according to whether it conforms to structures permitted by a schema is called a recognizer. Beyond distinguishing valid from invalid documents, a validation process may emit error messages of varying granularity, which can be helpful for users who would want to correct structural invalidities. Validation can be performed in real time, and real-time validation is often incorporated into code editors to guide users through the process of adding valid markup. Alternatively, validation can be performed on completed documents.

Schemas and grammars

One difference between XML and SGML is that SGML always required a schema (the only type of schema supported by SGML was DTD), while schema validation is optional for XML. The decision by the original developers of XML that schemas should be optional recognized that a schema may be important during or immediately after authoring, but once a document is complete, has been validated, and will not be changed again, there is no need to continue to check it for validity. SGML made use of markup features that could not be interpreted without reference to a schema, which is one reason that a schema had to be obligatory for SGML, and those features were excluded from XML requirements to ensure that XML could be processed without reference to a schema.

In its current form our system requires a schema, that is, we integrate well-formedness checking with schema validation. We do not preclude separating these operations later, but for now that separation is out of scope because it is not clear how we would reliably analyze and interpret next-generation features like Overlap and Discontinuity without the guidance provided by schema rules.

Schemas that can be used to support parsing and validation of XML markup can be authored in W3C XML Schema (XML Schema 1.1), Relax NG (Clark 2002, Compact), or DTD (defined as part of XML 1.0), none of which is designed to represent the next-generation features described above. One of our goals, then, is to begin to develop a schema language and associated tooling for next-generation markup languages that is analogous to the schema languages and associated tooling currently available in the XML ecosystem.

The terms schema and grammar are sometimes used interchangeably in the structured-text community, but we use them here as technical terms with different meanings and properties.

We use the term schema to refer to a set of productions (rules that define a non-terminal on the left-hand side according to non-terminals and terminals on the right-hand side, similarly to Relax NG Compact Syntax) that constrains the use of specific elements, attributes, text nodes, etc. Schemas are typically developed to meet project-specific requirements. Because schemas are typically developed by content creators, they prioritize human legibility. With our next-generation schema language, as with schema languages for XML, a schema rule cannot overrule or supersede a well-formedness constraint. That is, schema rules provide a second layer of constraints beyond the universal and non-negotiable constraints imposed by well-formedness.

A schema is both a human-oriented statement of what is and is not permitted in associated documents and a starting point for a grammar, as described below.
We use the term grammar to refer to a constrained set of productions, optimized for machine interpretation and processing, that is generated programmatically by compiling the schema into a grammar. While schema design prioritizes human legibility, grammar design prioritizes efficient computational implementation.

As we explain below, complex tree-like operators such as OR, OPTIONAL, OneOrMore, ZeroOrMore, INTERLEAVE, and CONCUR may be rewritten as one or more linear productions when a human-oriented schema is compiled into a machine-oriented grammar. Familiar notations like Backus-Naur Format (BNF, see BNF) and its extensions (e.g., EBNF, see EBNF), which are often considered grammar notations, allow alternation and other complex operations, but we use the term grammar in this communication to refer specifically to lower-level, purely sequential notations.

Human developers interact directly with the schema, but not directly with the grammar, much as XML developers author XML Schema or Relax NG or DTD schemas to meet project-specific needs. A schema language is the declarative, human-facing component of formal document modeling, which means that it describes what is and is not permitted in a valid, conformant document instance, but it does not describe how parsing and validation is performed. If a schema is to be usable for parsing and validation, it must be amenable to compilation into a grammar that can perform the parsing and validation operations in a robust, efficient, and scalable way. The present contribution aims to:

Identify the intellectual and engineering requirements for implementing a user-friendly declarative formal schema language for a next-generation markup language, and
Contribute an implementation of such a system, including a machine-oriented grammar into which a schema can be compiled that is sufficiently robust, efficient, and scalable to be suitable for likely real-world use cases.

The three schema languages in wide use in the XML ecosystem, listed above, have mature implementations that have been in use for many years,^[9] and we have no doubt that the next-generation schema language and implementation introduced here will be improved as they are tested against a broad variety of real-world documents and research needs. At the same time, the next-generation schema language we describe here, and our reference implementation of a validating parser for documents tagged in conformity with schemas expressed in that schema language, provide a viable starting point for a broad user community that is eager for next-generation functionality that is not easily or naturally implemented purely with XML markup and the existing XML toolkit.

Current widely used parsing strategies for markup technologies are often based on context-free grammars (CFGs), although the properties and restrictions they aim to describe and validate cannot always be expressed without recourse to features that cannot be represented completely within a CFG.^[10] For example, the XML specification (XML 1.0) incorporates a formal, context-free grammar in EBNF notation (EBNF) that seeks to represent what is and is not permitted with XML markup, but that formal grammar must be supplemented, within the specification, with plain prose validity constraints (VC, e.g., An element type must not be declared more than once, §3.2)) and well-formedness constraints (WFC, e.g., The Name in an element’s end-tag must match the element type in the start-tag, §3) because not all requirements for well-formed XML can be represented completely by the features available within a CFG. XML validity and well-formedness constraints can be implemented computationally in a robust, efficient, and scalable manner, but only through processing that transcends parsing and validation against the formal CFG grammar. Insofar as a grammar is a formal representation of a theory of the text, the system we describe in this contribution aims to be sufficiently expressive to admit all desired textual structures (as represented by a project-specific document schema), to recognize any structure not licensed by the schema as invalid, and to do so without recourse to extra-grammatical annotations and implementation features. Insofar as a grammar is also an engineering resource, a priority for our work is that our schema language, designed to maximize legibility for human schema designers, should be able to be compiled (translated) by the computer into a formal grammar that is adequate for efficient, scalable automated parsing and validation.

We regard the general process by which a schema is compiled into a grammar against which a document instance can be parsed and validated as an abstract, high-level metamodel, that is, as independent of any specific implementation in any specific programming language. We provide a particular implementation, then, primarily as a way of verifying whether the metamodel is implementable in principle. The distinction between the abstract metamodel and a concrete implementation invites an eventual comparison of implementations, whether with respect to performance, user experience, or other criteria, even as all implementations are required to emit what is functionally the same result because all implementations must implement the shared metamodel.

Conclusion

Ultimately, the primary aim of this contribution is to elaborate a formal model that is able to express complex phenomena of texts that users might reasonably wish to record, including those not easily represented with XML, and to do so in a way that is theoretically coherent and formally computable. This goal engages with several hard challenges that have not yet been resolved in a satisfactory way in the domain of markup languages, as is clear from, for example, the way Overlap has been a recurrent theme in markup conferences for decades. Much of the present contribution engages explicitly with formal grammar theory, but we also introduce an implementation of a proof-of-concept parser-validator that goes beyond the XML toolkit in being able to process next-generation features not as secondary or external operations, but as part of both the schema language and the grammar to which it compiles. Although our proposal emerged from our work on TagML, it is intended to be usable with any markup language that supports the next-generation features we discuss.^[11]

Properties of grammars

Left-hand side and right-hand side

A grammar is primarily a set of productions (rules), each of which maps from a left-hand side (LHS) to a right-hand side (RHS). Grammatical productions are similar in some respects to statements in Relax NG compact syntax, where, for example:

chapter = element chapter { title, paragraph+ }

has an LHS with the value chapter and an RHS with a value that defines chapter as an element type with the name chapter that contains a single obligatory title followed by one or more items called paragraph. title and paragraph are, in turn, defined in their own productions elsewhere in the schema. A Relax NG compact-syntax production is not exactly the same as a grammar production, but what they have in common is that a single item in the LHS (in this case chapter) is defined in terms of items in the RHS (in this case title and paragraph, along with the comma connector and the Kleene plus occurrence indicator, which describe syntactic relations among the content items).

Terminals and non-terminals

The items in the LHS and RHS are either terminals or non-terminals, which can be understood as follows:

A terminal is an item that cannot be expanded further, such as, in both Relax NG compact syntax and grammars, a literal string.
A non-terminal is an item that can be defined in terms of terminals and other non-terminals.

For example, a Relax NG production like:

published = attribute published { "yes" | "no" }

defines the non-terminal published as an attribute with a value that entails a choice between two terminals, the string "yes" and the string "no". As above, a Relax NG production is not exactly the same as a grammatical production (in particular, a grammatical production does not allow alternatives, represented in Relax NG by the pipe connector),^[12] but the distinction between terminals and non-terminals applies to both grammars and Relax NG schemas.

Parts of a grammar

A complete grammar has 1) a starting point, 2) a set of non-terminals, 3) a set of terminals, and 4) a set of productions that must ultimately define all non-terminals in terms of terminals.^[13] The LHS of a grammar production must always include a non-terminal; whether it can also include terminals or other non-terminals depends on the type of grammar.^[14]

A schema language may come with built-in, predefined productions, so that, for example, Relax NG compact syntax knows how to compile the grammatically non-terminal keyword text into productions with only terminals (defined with reference to a set of one-character string values) on the RHS. One difference between a schema and a grammar, then, is that a grammar must be complete (that is, must ultimately define everything using only terminals), while a schema language can come with implicit productions that do not need to be spelled out in the schema itself. This is consistent with the fact that a schema prioritizes human legibility because it is intended to be authored directly by humans, while a grammar prioritizes computational tractability because it is intended to be processed by a machine. Part of the parsing process involves compiling (transforming) human-oriented schema rules into machine-oriented grammar productions.

Three classes of grammars

This contribution discusses three types of grammars, which differ with respect to what is or is not permitted on the LHS and RHS. We say more about these types of grammars below, but we can distinguish them concisely now as follows:

A context-free grammar (CFG) contains productions that map from a single non-terminal LHS to a single sequence of terminals and non-terminals in the RHS. The single item in the LHS has no access to contextual history.^[15]
A context-sensitive grammar (CSG) allows multiple symbols in the LHS, at least one of which must be a non-terminal. The non-terminal is required because, as noted above, the point of a production is to explain a non-terminal, while terminals are self-explanatory and therefore not candidates for further explanation. The other symbols (terminal and non-terminal) that are allowed on the LHS in a CSG provide context, that is, they are how a CSG, unlike a CFG, has access to direct contextual history.
A mildly context-sensitive grammar (MCSG) allows only a single (non-terminal) symbol on the LHS, similarly to a CFG and unlike a CSG. But a MCSG allows more than one group of terminals and non-terminals in the RHS, together with rules that determine how they can be mixed (interleaved) in final production.

About context-free grammars (CFG) and XML

As noted above, parsers based entirely on context-free grammars (CFG) are unable to model not only next-generation markup languages, but also some features of XML. The term context-free means, among other things, that CFG productions must be static, that is, they must be based exclusively on information available before looking at the document instance. Information acquired only once parsing begins is dynamic, and therefore cannot be part of a CFG production.

It is possible to parse XML against an XML schema with a CFG, but this is not the case in the absence of a schema, which is why the XML spec consists of CFG productions supplemented with WFCs. Two features of XML where the availability of a schema enables the compilation of a CFG that does not need to rely on WFCs are:

XML validation requires that every start-tag be followed, in a non-overlapping way, by a corresponding end-tag for the same element type. If a schema is available, it necessarily identifies all element types that are allowed in a valid document, which means that the schema can be compiled into a grammar that includes separate explicit productions for every valid element type. That is, instead of saying something like every start-tag must be followed by a matching end-tag, a CFG compiled from a schema can say XML that is valid against this schema can contain only element types A, B, and C; a start-tag for element A must be followed by an end-tag for element A, a start-tag for element B must be followed by an end-tag for element B, and a start-tag for element C must be followed by an end-tag for element C. These explicit rules account for all possible element types in any valid document to be parsed against the schema, and therefore remove the need for a more general grammar rule that every start-tag must be followed by a matching end-tag.
XML validation requires that attribute names not be repeated on an element. If a schema is available, it necessarily identifies all attributes that are permitted on an element, which means that the schema can be compiled into a grammar that includes productions for all combinations and orders of attributes. For example, a schema can say that element A must include values for attributes X and Y, which can be translated into grammar rules that say that the element name in the start-tag of element A must be followed by either attribute X and then attribute Y or attribute Y and then attribute X. Those rules do not allow either attribute to repeat, and therefore remove the need for a general well-formedness constraint that prohibits that type of repetition.^[16]

Schema validation is likely to incorporate optimizations that reduce the computation described in a brute-force way above. Nonetheless, the difference between validation against a schema vs well-formedness checking in the absence of a schema is that when parsing against a CFG all information must be static, that is, available to the parsing process before it sees the document instance. A schema makes all valid element and attribute names and all relationships among them available, providing information that can be used when the schema is compiled into a grammar prior to parsing the document instance. In the absence of a schema, element and attribute names are available only from the document instance, which means, by definition, that they cannot be incorporated into CFG productions.

As noted above, a common workaround for the inability to specify all properties of a markup language in a CFG has been to supplement the grammar, which defines only some features of the markup language, with well-formedness constraints (WFCs), which must be validated by augmenting the processing of the CFG. This approach has at least two types of limitation:

WFCs are often rendered only in plain prose, and not in a formal, machine-implementable way, much less in a way that can be integrated formally with context-free productions. This is the case, for example, with well-formedness constraints in the XML specification (XML 1.0), where features of the markup language that cannot be expressed entirely with the resources provided by a CFG must be expressed in other ways. The absence of a machine-oriented formalization introduces a risk that implementations may be based on inconsistent (and even erroneous) human interpretations of the constraints.
Parsers that take into account WFCs may need to keep state, that is, they may need to handle information encountered earlier in the parsing process that is not fully represented by any grammar production. For example, in the absence of a schema, an XML parser needs to know the elements that remain open at a given moment because without that information, it cannot verify that every start-tag has a corresponding end-tag and that the elements expressed by the tags are properly nested. Those requirements are expressed in the XML specification as WFCs because in the absence of a schema they cannot be expressed fully in a CFG alone.

In addition to looking backward (that is, keeping state), in cases involving ambiguous grammars a parser may also need to look ahead in order to resolve an ambiguity. Unlike keeping state, look-ahead is not necessarily incompatible with CFGs, and we discuss look-ahead operations in different types of grammars below.

The performance of CFGs benefits from not needing to save past information in memory in order to interpret an item that is currently in focus, but some next-generation markup features cannot be recognized and interpreted without contextual information. In the case of Discontinuity, for example, without access to earlier parts of a document instance, it may not be possible for a parser to recognize when an element in focus is the continuation of an element that went before, that is, whether the new element is part of a valid discontinuous construction.

In the case of CFGs with WFCs the challenges of keeping state effectively and efficiently have been studied extensively, which has led to deep knowledge about high-performance implementations. For example, an XML parser that is performing well-formedness checking in the absence of a schema does not need to hold everything before the current location in memory at all times; it is sufficient to keep track of only a subset of that information. To verify that every start-tag has a corresponding end-tag the parsing process has to keep track of which elements have been opened but not yet closed, and in what order. Furthermore, because XML does not permit overlapping elements, the only part of that information that needs to be accessible immediately is the most recently opened element, which means that the information can be stored on a stack, which prioritizes efficient access to the only item that needs to be immediately accessible, which is the one opened most recently. The situation is obviously more complicated once next-generation features, such as Overlap, need to be accommodated.

While we know the limitations of CFGs with respect to representing next-generation markup-language features, what to use instead is less clear. Compared to the markup community’s extensive and deep knowledge of modeling and processing XML with a CFG supplemented by WFCs that can be implemented efficiently, we have only limited and largely experimental knowledge of grammars for next-generation markup languages that might be able to represent all structural features of markup in a way that can be formalized for robust and efficient machine processing.

Insofar as a grammatically consistent and self-contained way of modeling a next-generation markup language may need to maintain state more comprehensively than is required for parsing XML, and perhaps to look both backward and forward, one option might be to use some kind of context-sensitive grammar. At the same time, it is not immediately clear how a context-sensitive grammar for that purpose should be constructed because context-sensitive grammars have not been standardized and remain an under-researched concept, at least in comparison to CFGs. We do know, however, that any solution will at least have to provide for three functions that are fundamental to our aim:

A grammar-like function that enables the expression of a basic syntax that includes all obligatory next-generation features, as enumerated above. This function is comparable to the role of well-formedness constraints in XML.
A schema-like function that supports additional domain-specific constraints on the use of features permitted by the grammar. This function is comparable to the role of schema languages in the XML ecosystem, where schema languages allow users to express, in a formal, machine-actionable way, the constraints that apply to domain-specific document models. Ideally the schema-like function would prioritize human legibility, much as is the case with schema languages for XML.
A parser-like function that enables the validation of a document instance with respect to well-formedness and, where provided, user-specified schema rules in a consistent and performant manner, emitting clear error messages when needed. To avoid the need to keep entire documents in memory during parsing and validation, this function might be compared to the functions of XML SAX or StAX (but not DOM) parsers.

Any solution will have many moving and interlocking parts, all of which are known as desiderata for document modeling but most of which have not been explored thoroughly as potential parts of a grammar for a next-generation markup language. We explore these considerations in the next several sections, but we first survey historically related work and state-of-the-art developments that are relevant as a background for understanding the solution we will then propose.

Markup and schemas for XML and beyond

Schema languages for XML

There is no need to tell a Balisage audience that there are three grammar-based schema languages in wide use for modeling XML, as listed above: DTD, XML Schema, and Relax NG.^[17] The three schema languages can all be considered tree grammars (thus Manguinhas 2009, §2), and they are largely comparable in the XML structures they are able to model, with two conspicuous exceptions:^[18]

DTDs lack support for namespaces, although it can be simulated (XML Namespaces §5). The other two schema languages fully support namespaces.
Only Relax NG supports Interleave.^[19]

This second difference, concerning Interleave patterns, is important for next-generation features for two reasons:

Unlike most structures accommodated by all three schema languages for XML, Interleave incorporates unordered and partially ordered properties, which cannot be represented using only features of CFGs.
Concur and Discontinuity, as well as Mixed content, can be described and implemented using features required for modeling Interleave.

We discuss the relationship between Interleave and next-generation markup features later in this report.

Existing schema and grammar proposals for managing Overlap

As we look beyond XML, data models and schema languages that have been proposed for managing Overlap include SGML Concur and its XML variant XConcur; Rabbit/duck grammars; and Creole. Because Tennison 2007 includes detailed explanations, examples, bibliography, and critiques of the multiple-hierarchy models Concur, XConcur (including XConcur-Cl, which is the validation component of XConcur), and Rabbit/duck grammars, we say nothing more about those technologies here except to point out that none of them engages explicitly with any next-generation feature other than Overlap. Luminescent, a LMNL parser implemented in XSLT, reflects a perspective similar to that of Rabbit/duck grammars in that it processes LMNL ranges with reference to XML milestone start- and end-markers.^[20] (Tennison and Piez 2002, Luminescent) But because LMNL lacked a schema language at the time Luminescent was developed, Luminescent was designed to perform well-formedness checking, but not schema validation. Other proposals for managing Overlap have included the GODDAG data model and TexMECS markup language (Sperberg-McQueen and Huitfeldt 2004, Huitfeldt and Sperberg-McQueen 2001), which seek to model and manage not only Overlap, but also Discontinuity (Sperberg-McQueen and Huitfeldt 2008). Modeling and tooling based on Rabbit/duck grammars, GODDAG, and TexMECS have been employed in projects implemented by the resource designers and developers (e.g., Sperberg-McQueen 2018), but have not been adopted broadly elsewhere.

Schema-aware Extended Annotation Graphs offers a graph-based approach to validating some next-generation features, performing simulation-based validation by construction. (SAEAG) The authors observe that since Dataguides and Graph Schemas are inferred from the data, they cannot be used as authoring tools (§4), which makes the method unsuitable for our purposes.

Why Creole is different

As far as we have been able to determine, Creole is the only detailed proposal for managing Overlap that is designed to process all structural properties of an entire document with a single grammar instead of validating overlapping hierarchies in ways that are partially or entirely separated from one another. This is consistent with our aim of managing Overlap as an integral feature of our document model. The one detailed publication about Creole, Tennison 2007, describes the design principles, some of which we adopt directly and some of which we either modify or replace with alternatives. Alongside a thorough explanation of the design of Creole and the strategy for processing it, Tennison 2007 reports having built a proof-of-concept implementation in XSLT 2.0 that worked as expected, except that it was insufficiently performant for production use. (25) We discuss performance in greater detail below, when we introduce our own design and implementation.

Partially ordered content: how Interleave works

Most authors of Relax NG schemas will have worked with both ordered and unordered content, but partially ordered content, which is vital for modeling next-generation features, is likely to be less familiar. The Interleave feature of Relax NG is able to model partially ordered content, but even those who have used Interleave in their own schemas may have used it to represent content that they consider unordered, rather than partially ordered. For that reason, before we introduce a schema language and grammar for a markup language that supports next-generation features, we take a close look at the meaning of partially ordered content as represented by Relax NG Interleave.

Interleave of simple items: A common use of Interleave involves unordered simple content items. For example, the content model for <x> in the following schema:

x = element x { A & B & C }
A = element A { empty }
B = element B { empty }
C = element C { empty }

means that <x> contains exactly one instance each of A, B, and C in any order, that is, any of the following six patterns: A,B,C; A,C,B; B,A,C; B,C,A; C,A,B; and C,B,A. In this respect the unordered nature of Interleave might be compared to the unordered nature of XML attributes.

Interleave of repeated simple items: Unlike attributes, however, elements that participate in Interleave may be repeated, as in the following content model:

A* & B* & C*

This means, among other things, that Interleave allows its immediate operands (in this example: zero or more instances of A, zero or more instances of B, and zero or more instances of C) to appear in any order, licensing the same six patterns as the example above, except with zero or more repetitions of each repeatable component, e.g., A*,B*,C*; A*,C*,B*, etc. But that isn’t the only thing allowed by Interleave with Repetition, because the model above does not require the repeated items to cluster with others of the same type. This means that in addition to the six patterns listed above, it also allows any number of A, B, and C instances in any order, and is therefore equivalent to:

( A | B | C )*

This, too, might be compared to unordered content.

Interleave of items with ordered content: The repeatability of the simple content-item types in the example above obscures the way Interleave models not only unordered content, but also partially ordered content. Consider the following Relax NG schema:

x = element x { a & b }
a = c, d
b = e, f
c = element c { empty }
d = element d { empty }
e = element e { empty }
f = element f { empty }

This schema says that an element of type <x> contains an interleave of patterns of type a and b. This schema allows the two immediate operands of the Interleave operator to appear in either order, that is, either all of a before all of b, as in:

<x>
  <c/><d/><e/><f/>
</x>

or all of b before all of a:

<x>
  <e/><f/><c/><d/>
</x>

Up to this point we might also consider this model unordered: x contains one instance of a and one instance of b in either order.

Interleave of items with partially ordered content: The fact that the content is partially ordered, rather than unordered, becomes visible in the remaining patterns allowed by this schema: the components of a (c and d) may be interwoven (that is, interleaved) with those of b (e and f). For example, in:

<x>
  <c/><e/><d/><f/>
</x>

the two components of a occupy positions 1 and 3 and the two components of b occupy positions 2 and 4. It would be a mistake, though, to think that the components of a and b can be combined in any order, that is, it would be a mistake to interpret the Interleave operator in this case as modeling unordered content. Because the components of a and b are ordered (the comma connector means in this order), the two components of a must occur with c before d and the two components of b must occur with e before f. That is, the following document would be invalid:

<!-- invalid -->
<x>
  <e/><d/><f/><c/>
</x>

Interleave, then, does not mean only that the immediate operands to the interleave operator may appear in any order; that is permitted, but it is not the only thing that is permitted. Interleave also does not mean that the individual atoms in the expansion of its arguments may appear in any order; this is not permitted because any lower-level order imposed by the patterns of the interleaved items must be respected. In the example above, c, d, e, and f may occur in any order only as long as c precedes d and e precedes f (not necessarily immediately).

It’s easy to imagine realistic use cases for Interleave as a way of modeling unordered content. For example, mixed content that allows text intermixed with any number of <a> and <b> elements can be modeled in several ways, both with and without the Interleave operator. In a schema that includes:

p = element p { mixed { ( a | b )* } }
a = element a { text }
b = element b { text }

the first line could, alternatively, be expressed with the Interleave operator as:

p = element p { mixed { a* & b* } }

We find it more challenging, though, to think of natural, real-world use cases for partially ordered content. van der Vlist 2004 (54) gives an example of modeling groups of tourists in a museum, but we struggle to imagine that as a sort of relationship anyone might realistically want to model in a Relax NG schema. Consider, though, the following XML:

<root>
  <section>
    <annotation-start id="0001"/>
    <header>Hi, I'm a header!</header>
    <p>Hi</p>
    <ab>Bye</ab>
    <ab>Hi again</ab>
    <annotation-end id="0001"/>
    <p>Bye again</p>
  </section>
</root>

Ignoring for a moment the annotations, a <section> must start with a single <header> element that is followed by zero or more instances of <p> and zero or more instances of <ab> in any order (as long as there is at least one <p> or one <ab>). The preceding example without the annotations might look like:

<root>
  <section>
    <header>Hi, I'm a header!</header>
    <p>Hi</p>
    <ab>Bye</ab>
    <ab>Hi again</ab>
    <p>Bye again</p>
  </section>
</root>

Meanwhile, we want to be able to mark annotation ranges with milestone <annotation-start/> and <annotation-end/> elements. (We simplify the example by omitting the text of the annotation, which might, hypothetically, be represented by an attribute on the <annotation-start/> element.) These elements must appear in paired order, with exactly one start followed by exactly one end, and they must be coindexed by a shared @id value (not an @xml:id or HTML @id because it begins with a digit and repeats). The scope of an annotation has to be one or more sequential <p> and <ab> elements. A Relax NG schema to represent this is:

start = root
root = element root { section+ }
section =
  element section {
    (header, (p | ab)+)
    & annotation-pair*
  }
header = element header { text }
ab = element ab { text }
p = element p { text }
annotation-pair = annotation-start, annotation-end
annotation-start = element annotation-start { id, empty }
annotation-end = element annotation-end { id, empty }
id = attribute id { text }

There are some limitations to the types of constraints that Relax NG can enforce:

Relax NG cannot require the @id values on pairs of annotation starts and end to agree.
Relax NG cannot prevent <annotation-end/> from following <annotation-end/> immediately, without any intervening <p> or <ab> elements.

Both the preceding limitations can be overcome with Schematron. What the Relax NG schema can enforce, though, is partial order, as follows:

There must be exactly one <header> that precedes all <p> and <ab> elements within a section. If the <header> is missing or if it is preceded by any <p> or <ab>, the document is invalid.
<annotation-start> and <annotation-end> elements may occur anywhere as children of a <section> element (but not as children of <header>, <p>, or <ab>). They must occur in pairs, that is, every <annotation-start/> element must precede a corresponding <annotation-end/> element. This model does not allow a new annotation to open before the first one closes, which would be a case of self-overlap, about which please see below.

One way to think of what Interleave permits and prohibits is that it describes partial order. As we will see below, in this respect it shares some properties of Concur: each concurrent hierarchy under Concur must observe, when we ignore the components that are not part of that hierarchy, all hierarchy-specific constraints, including ordering constraints. Similarly, each operand of an Interleave operator must observe, when we ignore other immediate operands, all operand-specific constraints.

Finally, we can also consider mixed content a type of Interleave, where, for example:

p = element p { mixed { a, b } }

requires exactly one instance each of a and b in that order, with optional text nodes before, between, or after. This can be considered a partially-ordered interleave of the sequence a, b and text. Because the keyword text in Relax NG means zero or more textual characters, it is implicitly optional, and because XML does not permit adjacent text nodes (which it merges into a single text node) it is also implicitly repeatable. This means that the same content constraints can be modeled alternatively with the Interleave operator as:

p = element p { (a, b) & text }

SGML DTDs included an & operator that meant in any order. The SGML operator, then, models only unordered content, which is different from the meaning of Relax NG Interleave. XML DTDs do not include an & operator.

Grammar-based and constraint-based schema languages

XML schemas in XML Schema, Relax NG, or DTD are often supplemented by Schematron schemas, that is, collections of Schematron rules. Schematron rules are expressed in XPath, and can therefore validate anything that can be expressed in XPath, which includes both structural and content-oriented features. A lot of document properties can be modeled in either a grammar-based schema language or Schematron, but in general:

Schematron is better than grammar-based languages at content-oriented validation. The inclusion of datatyping and facets in XML Schema and Relax NG enables those grammar-based languages to validate some aspects of content, but that functionality does not include validating content-based relationships across nodes either within a document or across documents.^[21]
Grammar-based languages are better than Schematron at complete structural validation. A grammar-based schema language is intended to model an entire document, which means, for example, that Relax NG will raise an error during validation if there are parts of the document that are not represented in an associated Relax NG schema.

The completeness of grammar-based schemas enables them to support reliable and thorough content-completion hinting in an editing environment because real-time validation knows at every location in a document what is and is not allowed. Schematron can provide hints in response to specific rule violation (this is how the <oXygen/> Schematron Quick-Fix feature works),^[22] but because Schematron validation does not require complete document coverage, Schematron cannot provide consistent, reliable content-completion hinting.

The schema validation introduced in this report currently focuses entirely on validating structural (syntactic) relations. We intend eventually to support datatypes and facets at least at the same level as Relax NG. We do not plan to support Schematron-like constraint validation, but nothing in our design precludes performing Schematron-like constraint validation with a different tool, much as users currently do for XML validation.

What we adopt from existing schema languages

We began our development of a next-generation schema language by adopting four features of existing schema languages for XML:

Grammar-based language: The grammar-based schema languages for XML listed above (DTD, Relax NG, XML Schema), as well as Creole, are designed to model entire documents, which is not the case with a constraint-based language like Schematron. We opted for a grammar-based approach because modeling an entire document was a priority. As noted above, nothing in our approach precludes employing a constraint-based language like Schematron to validate features that cannot be expressed naturally in a grammar, but that type of enrichment is not in scope for the present report.
Partially ordered content: Relax NG is the only grammar-based schema language in current use for XML that is capable of representing Interleave, which, as described above, can model and validate unordered and partially ordered content. As we write above, we struggle to find a natural, real-world use for the partially ordered variant of Interleave, but we do find partial order important for other purposes. For that reason, we regard Relax NG as the existing schema language for XML that comes closest to supporting the features needed for next-generation markup systems.
Compact syntax: We further adopt Relax NG compact syntax as the formal syntax for our schema language because we find it more legible than the XML syntax and we regard human legibility as a high priority for a schema language. Nothing in our proposal precludes expressing our schema language in other ways, including using XML syntax, but such expressions are out of scope for the present report.
Concur and Creole: Finally, we also adopt and further develop features (in particular, a Concur reserved word) added to Relax NG compact syntax in Creole. Creole lacks an available implementation, but the design and implementation are documented clearly in Tennison 2007.

While we copy and extend syntactic features of Relax NG and Creole, we deviate from those schema languages in our implementation. In particular, both Relax NG and Creole use parsing with Brzozowki derivatives to manage trees of expectations, while we instead implement a MCSG. We discuss the motivations for our implementation decisions below.

Patterns and anti-patterns in Relax NG compact syntax

Relax NG compact syntax permits several constructions that appear to allow markup that is either impossible to implement in XML (e.g., adjacent text nodes) or that, if implemented in a document, would result in well-formedness violations (e.g., repeated attributes). Content models in Relax NG compact syntax also fail to distinguish syntactically attributes (properties of elements) from actual content items (elements, text nodes), although the distinction between attributes and content items is unambiguous in XML syntax.

These features of Relax NG compact syntax compromise the legibility of a schema without improving its expressive power. For that reason we consider them anti-patterns and we exclude them from our proposed schema language, which otherwise aims for compatibility with Relax NG compact syntax notation. None of the ways in which our schema language is stricter than Relax NG compact syntax impinges on the expressive power of the schema because everything we prohibit can be expressed more clearly in both our schema language and Relax NG itself in other ways.

The features where our schema rules diverge from those of Relax NG compact syntax are the following:

Attributes and patterns that resolve to attributes do not accept repeatable occurrence indicators (`+`, `*`)

Attributes are not repeatable in both XML and the next-generation markup languages that we aim to model. Relax NG compact syntax nonetheless allows attributes and patterns that resolve to attributes to be attached to the repetition indicators + and *. Because a schema cannot override a well-formedness constraint and well-formedness does not permit repetition of attributes, attempts to repeat an attribute because the schema says that it is permitted will nonetheless raise a well-formedness error. Any use of + in association with an attribute is actually understood as required but not repeatable (otherwise represented in Relax NG compact syntax by the absence of any explicit occurrence indicator) and * is understood as optional and not repeatable (otherwise represented by ?). Allowing the schema to say that attributes are repeatable when well-formedness prohibits their repetition means that the schema purports to license XML expressions that are not, in fact, allowed in XML. This means that the schema misrepresents what is possible in the XML it will be used to validate.

Because we take seriously the role of a schema as a human-readable statement of what is and is not permitted in associated document instances, our schema language does not permit the repeatable occurrence indicators + and * to be associated with (non-repeatable) attributes. Because attributes may be optional, they do accept the ? occurrence indicator.

`text` requires at least one textual character and does not accept repetition occurrence indicators (`+`, `*`)

The reserved word text in Relax NG compact syntax means zero or more contiguous textual characters. This is potentially confusing for at least two reasons:

It is counter-intuitive that a pattern that appears to require text could be satisfied by zero textual characters, which a human would understand as the absence of text.
An XML document instance cannot contain a text node that consists of zero characters, so the Relax NG keyword says that it permits something that cannot occur.^[23]

In our schema language the keyword text means one or more textual characters. The Relax NG definition of zero or more textual characters is implicitly optional, but ours is not, which means that text in our system may be followed by ? to mean either no text node, and therefore no characters, or exactly one text node, and therefore one or more characters.

The next-generation markup languages we target continue the XML practice of not permitting adjacent text nodes. In the case of both these next-generation languages and XML, then, any construction involving adjacent text nodes is interpreted as a single merged text node. For that reason we do not permit the repetition occurrence indicators + and * after the keyword text.

Finally, we do not allow the keyword text in Interleave patterns. The only way to interleave text nodes with elements in our schema language is by using the mixed keyword. We require the mixed keyword because it signals to the user that an element contains mixed content more clearly than a repeatable or-group that includes the keyword text.

Attributes are not content

In Relax NG attributes and patterns that resolve to attributes are included in content models alongside elements and the keyword text. Because attributes are properties of elements, rather than content, this is misleading; a Relax NG content model may appear to impose ordering relationships on attributes with respect to both other attributes and actual element content items even though no such ordering relationships are possible. Our schema language will represent attributes separately from actual element content. but the exact mechanism for doing so is still under development.

Mixed content must be represented with the keyword `mixed`

Relax NG compact syntax can represent mixed content with the keyword mixed, as in

p = element p { mixed { ( a | b )* } }

but it can, alternatively, omit the keyword mixed and instead include the keyword text in a repeatable or group or interleave, e.g.:

p = element p { ( a | b | text )* }

p = element p { a* & b* & text }

Using the keyword mixed improves the human legibility of Relax NG content models because it announces unambiguously that it represents mixed content. For that reason, our schema language does not permit the last two models above, and instead requires the keyword mixed for modeling mixed content.

Overlap and Creole

About our notation

In this report we use the term <element> differently than Tennison 2007. Creole was designed to meet the requirements of LMNL, although Creole is intended to be data-agnostic and therefore applicable to parsing other markup languages that support overlap (Creole is not designed to support discontinuity). (Tennison 2007 26) LMNL is based on ranges, which, unlike XML elements, are generally (but not universally) allowed to overlap.^[24] Tennison 2007 uses the following terms:

A LMNL <range> is a sequence of markup and text tokens. Ranges have start- and end-tags and are generally allowed to overlap with other ranges.
A LMNL <element> is a sequence of markup and text tokens that is not allowed to overlap with other ranges. In Creole an <element> is equivalent to (synonymous with) a <range> child of a <partition, i.e., <element> is a notational variant (synonym) of <partition><range> … </range></partition>.

In this report we use the term <element> where Tennison 2007 uses the term <range> and we use the term <partition> were Creole uses <element>. We do not use the term <range>. That is:

Table I

Comparison of terminology

Creole	This report
`<range>`	`<element>`
`<element>`	`<partition>`

Example of `Overlap` in Creole

Tennison 2007 uses a brief biblical example to review and critique the representation of Overlap in SGML Concur and elsewhere and to illustrate an alternative model and notation for Creole that address the perceived limitations. The biblical example is illustrative and the point is more general: markup hierarchies under a shared root (in this case a root element like <bible>) can overlap in a variety of meaningful ways. This particular example encodes four overlapping structures, as follows: (5–6)

A hierarchy of Bible, books, and chapters. Books begin with a title followed by one or more chapters, which, in turn, contain one or more verses.
A hierarchy of Bible, books, and sections. Books (the same books as in the preceding hierarchy) begin with a title (also the same as in the preceding hierarchy) followed by one or more sections. Sections begin with section headers added by an editor (e.g., The flood and the tower of Babel) followed by one or more paragraphs, which, in turn, contain one or more sentences. Tennison 2007 says elsewhere on the same page that every paragraph is made up of one or more verses, and because verses may break in the middle of sentences and vice versa, paragraphs thus contain concurrent tessellated hierarchies of verses and sentences.^[25]
A hierarchy of Bible and pages. Page breaks may occur within or between components of the preceding two hierarchies (e.g., within or between books or chapters or verses in the first hierarchy or books or sections or paragraphs or sentences in the second), except that page breaks cannot appear within a book title (first, second, and third hierarchy) or section heading (second hierarchy).
A set of index references that is not easily or naturally modeled as a hierarchy. The role of index references in the markup is not explained in Tennison 2007, but the name suggests that they might be targets of entries/pointers in a back-of-the-book index (or its digital-publication counterpart). Index references may overlap other index references (a relationship traditionally called self-overlap), which means that some text tokens may be part of multiple index references; some text tokens may not be part of any index reference; and page breaks may fall within index references. The example says that a sentence or heading contains any number of index references, but we can easily imagine an index to the Bible in which some references might cross sentence (and perhaps larger) boundaries.^[26]

The first two hierarchies are loosely coupled semantically in that books (also chapters) and sections tend to be about something, and we would not expect a titled, topical section to begin in one titled, topical book and end in another. One way to understand this is that a <book> in one hierarchy has a <title> followed by <chapter> children, while the same <book>, in another hierarchy, has the same <title> followed by <section> children. That is, the hierarchies are congruent at a higher (<book>) level and then diverge partially (shared <title> but then <chapter> vs <section>) below that. Neither of these two hierarchies is subordinate to the other structurally because sections may start or end in the middle of chapters (and even, at least potentially, verses) and vice versa. A similar relationship obtains within paragraphs, where a paragraph is the parent of mutually independent tessellated sequences of sentences and of verses.

The example in Tennison 2007 models pages as milestone XML processing instructions with the name new-page; alternatively, pages could be modeled with tessellated milestone start- and end-markers, which is the structure used in the example to model index references. In either case we might consider pages to be subcomponents of the entire Bible (and not only of books or anything at a lower hierarchical level) insofar as new pages may begin either within or between books.

How `Concur` works in SGML and in Creole

In a traditional context-free production in a schema, the RHS is expanded in order, except that, as described above, Interleave relationships are partially ordered because the operands to the Interleave operator may occur in any order, with the restriction that any order within any of the subcomponents must be maintained. For example, if A, B is interleaved with C, D, the four components may occur in any order only as long as A precedes B and C precedes D. In this case, then, of the 24 possible permutations of the four components, only the following six are allowed: A, B, C, D; A, C, B, D; A, C, D, B; C, D, A, B; C, A, D, B; and C, A, B, D.^[27]

SGML Concur applies to entire documents, while Creole Concur can apply to subtrees within a document. The Creole Concur operator is similar to the Relax NG (and Creole) Interleave operator in that any order within a model that participates in a Concur relationship must be respected. For example, given the first two biblical hierarchies above, neither has a defined relative order with respect to the other, but within the hierarchy of titled sections, each <section> must contain exactly one <header> element followed by one or more <paragraph> elements.

In SGML Concur text tokens in the document instance, unlike element tags, are allowed only in positions where they are simultaneously valid in all concurrent content models. The same is true of Creole Concur, but with an important exception involving partitions that we discuss below. This distinction between markup tokens (which may be part of some but not all models) and text tokens (which must be part of all models unless, under Creole, they are part of partitions) is consistent with our understanding of Concur as concurrent markup over a single, shared document, rather than as concurrent documents. At the same time, the behavior of text tokens has important implications for how sequences can be combined with Concur. Consider, for example, the following Creole schema production:

element root {
  concur {
    element section {
      text
    },
    element chapter {
      text
    }
  }
}

In Creole, constructions like concur { A, B } mean that A and B overlap. The example above, then, means that a single <section> element with plain text content overlaps with a single <chapter> element that also has plain text content.The following examples would all be valid against this production:

<root><section><chapter>This is an example text</section></chapter></root>

<root><chapter><section>This is an example text</chapter></section></root>

<root><chapter><section>This is an example text</section></chapter></root>

<root><section><chapter>This is an example text</chapter></section></root>

That is, the document tokens are partially ordered because the relative order of the start- and end-tags for the <section> and <chapter> elements is not informational, but the start-tag of each element must precede its corresponding end-tag. The content model for <section> elements does not allow a <chapter> as a child of <section>, and vice versa, but the last two examples are nonetheless valid and unambiguous because the schema defines the elements as concurrent, rather than nesting. It should be clear from this example that schema information may be required to distinguish concurrent relationships from hierarchical (nested) ones.

The following sequence, on the other hand, is not allowed under the production above:

<root><section>This is an </section><chapter>example text</chapter></root>

This example respects the relative order of the markup tokens (the start- and end-tags) in both of the two concurrent sequences: either sequence, by itself, has the correct root element, the correct child element, and correct textual content. The reason this example is invalid is that the text tokens do not adhere to the rule that all text tokens in concurrent hierarchies, unlike markup tokens (start- and end-tags), must be valid in both hierarchies simultaneously. In the example above, the text between the <chapter> tags makes the <section> part of the Concur relationship invalid because the <section> part is defined in a way that does not permit text after the </section> end-tag. Similarly, the text between the <section> tags makes the <chapter> part of the Concur relationship invalid because text is not permitted before the <chapter> start-tag. We will see below that we can resolve this issue by allowing text to appear more broadly in the concurrent content models.

A Concur relationship between two single elements that have textual content is not very interesting, and we can enhance it by adding occurrence indicators that allow the components of the Concur relationship to repeat, as in the following example:

element root {
  concur {
    element section+ {
      text
    },
    element chapter+ {
      text
    }
  }
}

The schema above allows for one or more sections and one or more chapters to overlap in any way as long as all text tokens in a valid document instance are allowed under both hierarchies. This means that:

Both the section and chapter markup need to start immediately after the <root> start-tag, before any textual content;
Both the section and chapter markup need to end just before the </root> end-tag, without any intervening text; and
No textual content is allowed between an end-tag of one element and the start-tag of another.

Under the preceding schema the following sequence is now allowed:^[28]

<root><section><chapter>This is </section><section> an example text</chapter><chapter> And more text</chapter></section></root>

In this example all text tokens are between the start- and end-tags of both <section> and <chapter> elements, which satisfies the requirement that under Concur all text tokens must be valid under all concurrent hierarchies simultaneously. This example would not be valid if there were text immediately after the <root> start-tag, immediately before the </root> end-tag, or between adjacent tags, since in all of those positions the text would not be allowed under one or both of the concurrent hierarchies.

Concur and mixed content

Because the Creole mixed keyword, borrowed from Relax NG, effectively makes text optional before, between, and after the elements to which it is applied, it can be used to simplify the modeling of some Concur relationships. Consider, for example:

element root {
  concur {
    element section {
      text
    },
    mixed {
      element chapter {
        text
      }
    }
  }
}

The preceding model allows a single <section> element to concur (overlap) with a mixed model that includes a single <chapter> element. The mixed model allows (but does not require) text before or after the <chapter> element, which means that the following sequence is valid:

<root><section><chapter>This is an example text</chapter>more text</section></root>

In the example above, all text tokens are located between <section> start- and end-tags, as required, since the <section> element does not allow mixed content. The last two words are not between <chapter> tags, but the <chapter> part of the Concur relationship, modeled as mixed content, allows plain text in this position, with the result that the text is valid under both parts of the Concur relationship. On the other hand, the following example is not valid:

<root><section><chapter>This is an example text</section></chapter>more text</root>

Here the text at the end is not part of either of the Concur components, which is acceptable with respect to the <chapter> element, but not the <section> element. Modeling both <section> and <chapter> elements as mixed content would make both of the preceding examples valid.

Using mixed content as described above can simplify modeling concurrent relationships, but that simplification comes with the cost of allowing plain text in places where it might not be part of our mental model of one or another hierarchy in a concurrent relationship. That is, although we can choose to ignore text that we don’t consider an informational part of one or another hierarchy during downstream processing, it is nonetheless part of the content according to the model. This is unlike the situation with markup tokens, as illustrated in the example above, where we can include <chapter> start- and end-tags inside a concurrent <section> element without making the <chapter> part of the content of the <section>.

Tennison 2007 recognized the requirement that all text tokens be part of all overlapping hierarchies as a challenge for effective modeling. For example, the biblical <section> hierarchy requires that a <section> begin with a <header>, but the section header is an editorial intervention that is not part of the text of the Bible, and therefore not part of the hierarchy of chapters and verses. Creole Concur addresses this concern by diverging from SGML Concur in allowing <element> structures alongside <range> structures, where <element> is shorthand for wrapping a <partition> around a <range>. (16) This allows Creole to model text tokens that are not part of all concurrent hierarchies:

What about the <heading> element within the <section>? Its text isn’t inside a <verse>, so would normally be invalid. However, we can define the <heading> as a partition—it can’t overlap anything else—and this means that it’s ignored by any concurrent patterns.

In summary, then: Concur in Creole allows the symbols (tags and text tokens) of multiple sequences to be intertwined at the token level as long as:

The relative order of the tokens in the individual sequences is maintained; and
All text tokens must appear in positions that are valid for all sequences, except that …
… text tokens within a <partition> are invisible to other concurrent hierarchies.

`Concur` and `Interleave`

Although the Concur and Interleave operators are similar in that both model partially ordered content, there are subtle but important differences:

In Interleave the interleaved components are subtrees, while in Concur they are markup tokens and text tokens. As a consequence, Interleave relationships result in a tree, while Concur relationships result in a graph.
Concur allows elements of the same type to appear in multiple concurrent models, while Interleave does not permit the same symbol to occur in multiple interleaved sequences. This difference is a consequence of the difference between trees and graphs; graphs allow nodes to be the targets of multiple directed edges, while trees require that no node have more than one parent. Allowing the same node type to appear in both parts of an Interleave relationship would mean either multiple parentage (not permitted in a tree) or ambiguity (to which of the interleaved components would the nodes belong?).

Mildly context-sensitive grammars (MCSG)

Handling overlap and implementing the `Concur` operator

Now that we have an understanding of the kinds of sequences that a Concur operator in a schema can validate, we need to find equivalent grammar rules that we can create from those sequences. A CFG, which requires exactly one sequence on the right-hand side of a production, is not capable of expressing a Concur relationship, but a MCSG, which allows one or two sequences on the right-hand side, is. Within the grammar, each production of this type has reorder constraints associated with it that define all possible combinations of tokens of the sequences. All of this apparatus—the two sequences on the right-hand side and the reorder constraints—is invisible to the end user, who interacts directly only with the schema syntax, including the Concur operator.

Grammars are typically written as expressions over symbols, and in our markup domain we use tokens as symbols. The set of token types is derived from the element names, which for any finite set of documents is also a finite set. The tokens can be categorized as:

Markup specific^[29]
1. start-tag
2. end-tag
3. empty-element tag
4. suspend-tag
5. resume-tag
Text

The following simplified biblical example, tagged with TexMECS Concur, is based on a more elaborate example in Tennison 2007:^[30]

<genesis|
  <chapter n="7"|
    <section|
    <verse n="24"|… a hundred and fifty days.|verse>
  |chapter>
  <chapter n="8"|
  <verse n="1"|God thought of Noah …|verse>
    |section>
  |chapter>
|genesis>

The following schema for the biblical example above is based on features introduced in Tennison 2007. The ~ symbol is the Concur operator, so the second line means that a <genesis> element contains a concurrence of one or more <chapter> elements with one or more <section> elements. Note that we use the term <element> where Tennison 2007 uses <range>):

start = genesis   
genesis = chapter+ ~ section+
chapter = element chapter { attribute n, verse }
section = element section { verse }
verse = element verse { attribute n, text }

This can be compiled to a MCSG as follows:

start := <genesis| genesis_content |genesis>;
genesis_content :=  section optional_more_sections; , chapter optional_more_chapters; 
section := <section| verse |section>;
chapter := <chapter attribute_n | verse |chapter>;
verse := <verse attribute_n "|" text |verse>;
attribute_n := "n" "=" string;
optional_more_sections := section optional_more_sections;
optional_more_sections := epsilon;
optional_more_chapters := chapter optional_more_chapters;
optional_more_chapters := epsilon;

In our notation a semicolon (;) in a production rule marks the end of a single sequence of a symbols on the right-hand side. In a MCSG the right-hand side can contain up to two sequences of symbols, and in such cases we use a comma (,) as a separator between them.

The following table shows the alignment of the input tokens with the tokens of operand 1 (<chapter>) and 2 (<section>) of the Concur operator:

Table II

Overlap of <section> and <chapter> using the Concur operator

Input	Operand 1	Operand 2
	(chapter)	(section)
genesis	X	X
chapter	chapter
section		section
verse	verse	verse
text	text	text
/verse	/verse	/verse
/chapter	/chapter
chapter	chapter
verse	verse	verse
text	text	text
/verse	/verse	/verse
/section		/section
/chapter	/chapter
/genesis	X	X

In this table, as in the schema above, sequences of <chapter> and <section> elements observe a Concur relationship. <verse> elements are part of both concurrent hierarchies. An X (in the rows for the <genesis> start- and end-tags) means that the element is not a participant in a Concur relationship. Text nodes (except inside a <partition>, which is not relevant here) are always part of both concurrent hierarchies.

We refer to the relationship between concurrent sequences as a reorder constraint because the right-hand side of the production rule lists one entire sequence before the other entire sequence, regardless of the order of the tokens in the input. When such a rule is used for validation, the goal is to confirm that the input matches a concurrence of the two sequences on the right-hand side. That confirmation requires knowing what is and is not permitted in a concurrence, and those rules are:

Each of the two sequences in a concurrent relationship, considered in isolation, is fully ordered. That is, if operand sequence A requires the ordering A1, A2, A3 and operand sequence B requires the ordering B1, B2, B3, the markup tokens of each of those two sequences, considered in isolation, must appear in the input in the specified order.
Although the tokens of each of the two sequences have to observe any ordering specified in the schema, the markup tokens of the two sequences do not have to be aligned with each other. This means that the two sequences that make up the right-hand side of the production rule are partially ordered: each sequence observes its own ordering rules, but the markup tokens of the two sequences are not compared to each other.^[31]
Unlike the markup tokens of the two concurrent sequences, the text tokens of the two sequences do have to be aligned.

The reorder rule of the two sequences means that any implementation must adhere to three rules:

All symbols of both operand sequences must be fully present in the input data.
The relative order within each of the sequences must be observed.
When a text token is encountered in the input stream it must be present at the current relative location in both operand sequences.

We have to determine whether the input sequence is valid for the operands of the Concur operator, recalling that the text tokens of the two operands have to line up. We do this by thinking of it as an alignment problem, which we address using a method called dynamic programming.^[32] We can approach the task by arranging the tokens in a matrix with the tokens of the first operand serving as column labels and the tokens of the second operand serving as row labels. Initially we enter only the column and row labels; we explain below where the cell values (x) come from:

Table III

		chapter	verse	text	/verse	/chapter	chapter	verse	text	/verse	/chapter
	(0,0)	x
section		x
verse			x
text				x
/verse					x	x	x
verse								x
text									x
/verse										x
/section										x	x

Starting with just column and row labels, we will fill in the cells as we use the sequence of input tokens to guide our traversal of the matrix from the upper left to the lower right. We can think of this traversal as moving two pointers, both of which start at 0, so that the start coordinate would be (0, 0). When we say that a pointer is on 1, we mean that it is located after the first token. In other words, each pointer is always between the most recently processed token and the next token to be expected in the input. Both pointers start on 0, that is, before the first token in the two sequences has been processed. Our traversal must observe the following output requirements:

The complete input sequence is matched by tokens in the operands.
The text tokens in both sequences line up.
If the input is valid, we can always advance one or both pointers.
If the input is valid, whenever a start-tag is matched simultaneously in both sequences, the corresponding end-tag must also be matched simultaneously in both sequences.^[33]
If the input is valid, we end in the lower right.

Whenever a markup token comes in it must be allowed in the next position in either or both sequences, in which case we increment the respective pointer(s). If a markup token is not allowed at that location in either sequence, the input is not valid. For example:

The first token in our input example is a <chapter> start-tag, which lets us increment (only) the row pointer but not the column pointer, so we move from (0,0) to (1,0).
The next input token is a <section> start-tag, which lets us increment only the column pointer, so we move from (1,0) to (1,1).
The next input token is a <verse> start-tag, which is expected in both sequences at that location (immediately after a <chapter> start-tag in one and immediately after a <section> start-tag in the other), so we increment both pointers at the same time, that is, we move diagonally in the matrix, from (1,1) to (2,2).

Text tokens are handled differently. In case of text input, both pointers must point to a text token in their respective operand sequences of tokens. The unusual detail is that this can happen in two ways:

If a pointer is on a markup token, the next token must be a text token.
If a pointer is on a text token, it is not moved.

The first situation is similar to the behavior with markup tokens. The second situation exists to allow multiple text tokens in one operand token sequence to be matched with a single text token in the other operand sequence. To illustrate why this second condition is necessary, consider the following TexMECS example, which contains two text nodes (one with the value Hi and the other with the value Mom) and a concurrence of <section> and <chapter> elements:

<genesis|
  <section|
  <chapter|Hi|chapter>
  <chapter|Mom|section>
  |chapter>
|genesis>

A schema that supports this markup is:

start = genesis   
genesis = chapter+ ~ section+
chapter = element chapter { text }
section = element section { text }

A matrix that represents how we might traverse this concurrent input, based on the method described above, is:

Table IV

		section	text	/section
	(0,0)	x
chapter		x
text			x
/chapter			x
chapter			x
text			x
/chapter			x
				x

Each of the two <chapter> elements contains a one-word text node and the single <section> element contains a two-word text node. When we reach the text token inside the second <chapter>, the input remains valid because of the second situation above: we are currently on a text token in the <section>, and therefore do not need to advance the pointer on that axis. Were we to encounter a text token in one hierarchy when there was not a text token in the other hierarchy (which would satisfy the first condition) and when we were not already on a text token in the other hierarchy (which would satisfy the second condition), the input would not be valid against the schema.

Handling discontinuity

Discontinuity in TexMECS and TAGML is expressed through the use of suspend- and resume-tags (see the example below), which are allowed only where either an Interleave (recall that mixed content is an Interleave of an element sequence with text) or a Concur relationship is active. This constraint is expected because what discontinuity shares with Interleave and Concur is that the content is partially ordered, that is, tokens that are part of the discontinuous segments must appear in the correct order, but tokens that are not part of the discontinuity may appear before, between, and after them. Sperberg-McQueen and Huitfeldt 2008 explains that:

It seems clear that if we specify that the discontinuous element (q or B in our examples) matches a normal (first-class) token in a content model, then the usual rules for interpreting right-hand sides in the production rules of context-free grammars do not apply: the usual interpretation is that the right-hand side specifies a sequence of children for the left-hand side, but as described at some length above (section “Goddag structures”) the discontinuous element itself cannot be part of any sequence which also contains the material which interrupts it. We postulate, therefore, that discontinuous elements, taken as wholes, do not match first-class tokens in content models.

To accommodate elements that participate in a discontinuity relationship we adjust our MCSG slightly. Here is a sample input file with TexMECS tagging, taken from Sperberg-McQueen and Huitfeldt 2008:

<p|Alice
was beginning to get very tired of sitting
by her sister on the bank, and of having
nothing to do: once or twice she had peeped
into the book her sister was reading, but
it had no pictures or conversations in it,
<q|and what is the use of a book,|-q>
thought Alice
<+q|without pictures or conversation?|q>
|p>

In the example above, Alice’s single thought is tagged as a discontinuity of <q> (quote) elements, interrupted by the phrase thought Alice, which is in the narrator’s voice. The following schema supports this markup:

start = paragraph   
paragraph = element p { mixed { quote } }
quote = element q { text }

The schema compiler translates the preceding schema in the following MCSG:

start := <p| mixed_quote |p>;
mixed_quote = text; , quote;
quote := <q| text optional_discontinuity_quote |q>;
optional_discontinuity_quote := |-q> <+q| text optional_discontinuity_quote;
optional_discontinuity_quote := epsilon

The table below illustrates how the input tokens (leftmost column) are allocated between the two participants in a mixed-content (that is, Interleave) relationship, the discontinuous quotation (operand 1) and the text with which it is mixed (operand 2):

Table V

Discontinuity of <q> within a <p> with mixed content using the Interleave operator

Input	Operand 1	Operand 2
		(mixed)
p	X	X
text		text
q	q
text	text
-q	-q
text		text
+q	+q
text	text
/q	/q
/p	X	X

The <p> tags are the parent of both parts of the Interleave; they are marked with an X to show that they are not part of either of the interleaved sequences. When the <q> element is paused, the content needs to come from the other operand until the <q> resume-tag is reached. In this sense discontinuity is comparable to a <partition> within a Concur operator.

Self-overlap with a mildly context-sensitive grammar

Creole introduces concurOneOrMore and concurZeroOrMore operators to manage self-overlap, that is, situations where an element of a particular type overlaps with another element of the same type. Self-overlap requires special handling for two reasons:

In markup languages like MECS, TexMECS, LMNL, and TAGML, all of which support self-overlap, start- and end-tags that are paired together use a shared identifier on the start- and end-tag. Without that sort of identifier, input like the following:
```
<x>A<x>B</x>C</x>
```
would be structurally ambiguous. There are two <x> elements, each with a start- and end-tag, but nothing in the markup tells us whether an inner <x> element is a child of an outer <x> element, on the one hand, or the two <x> elements overlap and share the text token B. We can disambiguate that situation by adding matching identifiers to both the start- and end-tags, so that, for example, overlap, as in
```
<x n="1">A<x n="2">B</x n="1">C</x n="2">
```
would be distinct from containment,^[34] as in
```
<x n="1">A<x n="2">B</x n="2">C</x n="1">
```
Production rules in a grammar traditionally are completely static, which means that the exact rules are known from information in the schema before parsing begins (that is, without access to information in the document instance) and the production rules do not change in response to anything encountered in the input.

The challenge is that the exact disambiguating identifiers that will occur in the document instance cannot be specified in the schema if the schema is to be applicable to more than a single document, as schemas typically are. For that reason, processing will need to recognize when a start-tag with an identifier comes along and store it somewhere until an end-tag with a matching identifier is encountered, so that the parsing process can verify that each opened element is closed in a way that is consistent with schema requirements. We implement this functionality with an attribute grammar, which is an extension on top of context-free grammars that allows production rules to capture part of the input stream, store it in what is called an attribute, and reuse that attribute later in the input stream to recognize whether an identifier given in an end-tag is valid at the location where it occurs.^[35]

Consider the following TexMECS example:

<verse no="23"|God wiped out every living thing
that existed on earth, <index~1 ref="i0037"|man and
<index~2 ref="i0038"|beast|index~1>, reptile and 
bird|index~2>; they were all wiped out over the whole 
earth, and only Noah and his company in the
ark survived.|verse>

In the TexMECS example above, which has been simplified from an example at Tennison 2007 11, Verse 23 contains (self-)overlapping targets of index references, that is, of target spans of text to which entries in an index point. Index reference i0037 refers to the string "man and beast" and index reference i0038 refers to the string "beast, reptile and bird". The two overlap in containing the same substring "beast". If this were XML and the markup read:

<index ref="i0037">man and <index ref="i0038">beast</index>, reptile, and bird</index>

the markup would be well-formed but the parsing would be incorrect semantically. That is, an XML parser would recognize that there were two <index> start-tags and two </index> end-tags, and that each end-tag follows a start-tag. What an XML parser would fail to recognize, though, would be the overlap: The XML tagging above cannot represent overlap, and therefore says that indexed reference i0038 is wholly nested inside indexed reference i0037. This is semantically incorrect.

In TexMECS notation, a normal start-tag looks like <index| and a normal end-tag looks like |index>. Self-overlap is a type of concur relationship, and to avoid the misreading of the XML markup described above, self-overlapping elements must record information that associates start-tags with the correct corresponding end-tags, so that, for example, <index~1| and |index~1> are start- and end-tags for the same instance of an <index> element. A schema used to author or validate this document would recognize that the two index reference targets (i0037 and i0038) as instances of an element type <index>, which is to say that the ~1 and ~2 are attributes of the element instance, and not part of the element name. The same relationship might be spelled, in a hypothetical alternative syntax, as <index id="1" ref="i0037"> for the start-tag and </index id="1"> for the end-tag.

Attribute grammars enhance a traditional grammar with the ability to perform (typically small) computations. In the case of a CFG or a MCSG, the left-hand side of a production is a single non-terminal. The right-hand side contains an expansion of the left-hand side into a sequence of terminals and non-terminals (in the case of a CFG) or up to two such sequences (MCSG), and with an attribute grammar the items on the right-hand side may be enriched with expressions that compute a value.

For example a schema containing self-overlapping <index> tags could look like the following:

indexedText = concurOneOrMore { mixed { index* } }
index = element index { attribute ref { text }, text }

A set of grammar rules governing the self-overlapping <index> elements looks like this:

open_index_tag := <index "~" identifier attribute_ref "|";
close_index_tag := |index "~" identifier ">";
identifier := string;
attribute_ref := "ref" "=" string;
indexed_text := interleave_text_index;
interleave_text_index := text; , optional_index;
optional_index := epsilon;
optional_index := open_index_tag concur_close_tag_new_index_tag; [concur_close_tag_new_index_tag.identifier = open_index_tag.identifier]
concur_close_tag_new_index_tag := text close_index_tag; , optional_index; [close_index_tag.identifier == concur_close_tag_new_index_tag.identifier]

An attribute on a grammar should not be confused with an attribute in the XML domain. An attribute on a grammar is implemented as an index number on non-terminals on the right-hand side of a production rule. Next to the production rules are extra statements that specify whether attributes should capture incoming data or whether they should inherit data from an earlier non-terminal in the input stream in order to validate the current input token.

Schema compiler

The present report focuses primarily on how to express next-generation features in a schema language. We have developed a prototype implementation of a schema compiler and accompanying parser/validator, the details of which will be the subject of a future publication, but in this section we provide a brief overview of how the schema is compiled (that is, translated) into a grammar and how the grammar can be used to parse and validate an input document.

Our implementation compiles a schema expressed in the syntax described above into a MCSG, which we then parse with a modified LL parser. A standard LL parser has a set of production rules from a CFG and a stack as a memory for expectations. An LL parser also has a parsing table, which is generated from the production rules. In the standard case of a CFG the expectations on the right-hand side of production rules of the grammar have to be matched in order. In the case of a MCSG, however, the content is matched in partial order as we figure out whether an incoming token is allowed at that location in the data stream. In an LL parser for an incoming token we can:

Look at the top of the stack and use the table to determine whether a non-terminal / terminal combination is allowed.
If so, identify the production rules allowed by the table.
Put the symbols from the right-hand side of the applicable production rule on the stack.

The table provides an efficient method of determining which production rule needs to be applied according to a combination of the terminal token from the input and the non-terminal or terminal on the top of the stack.

In a schema the number of production rules is relatively low compared to the number of documents and the size of each document (both theoretically unlimited) that a user might want to validate. This means that a schema compiler is able to improve efficiency by shifting the performance bottleneck away from the (potentially large) number of input tokens and onto the (relatively small) number of production rules in the schema.

Conclusion

This paper introduces a new method to model, parse, and validate complex, next-generation textual features such as overlap (including self-overlap), discontinuity, and partially ordered and unordered content. It builds on earlier work that forms part of Relax NG (compact syntax, the Interleave operator), Creole (the Concur and Partition operators), and the MLCD/MECS project (discontinuity). Where Relax NG and Creole use parsing with Brzozowski derivatives, our grammar-based technique relies on a mildly context-sensitive grammar combined with an attribute grammar. We have developed a prototype implementation that reads in a schema, compiles it into a mildly context-sensitive grammar, and then uses the grammar to validate markup files that contain overlapping and discontinuous markup. By generating a parser based on the schema, the cost of validation per token is kept low. The presence of overlapping or discontinuous markup in the source files does not cause ambiguity and thus does not negatively affect the performance of the validator.

In this article we have responded to the challenges we set out to solve, namely: 1) to propose a schema language that could be used to model complex, next-generation text features such as overlap (including self-overlap), discontinuity, and unordered and partially ordered content; 2) propose a grammar technology that is able to express those features; and 3) parse and validate document instances against a schema in a consistent and performant manner by compiling the schema into a mildly context-sensitive grammar and a parsing table before the actual validation starts.

References

[BNF] Janikow, Cezary Z. What is BNF? http://www.cs.umsl.edu/~janikow/cs4280/bnf.pdf

[Cameron 1993] Cameron, Robert D. Extending context-free grammars with permutation phrases. ACM letters on programming languages and systems vol 2, no. 1–4 (March–December 1993), pp. 85–94. doi:https://doi.org/10.1145/176454.176490

[Chomsky 1956] Chomsky, Noam. Three models for the description of language. IRE transactions on information theory, vol. 2, no. 3, pp. 113-124, September 1956, doi:https://doi.org/10.1109/TIT.1956.1056813

[Clark 2002, Algorithm] Clark, James. An algorithm for RELAX NG validation. 2002. https://relaxng.org/jclark/derivative.html

[Clark 2002, Compact] Clark, James. RELAX NG Compact Syntax. Committee Specification 21 November 2002. https://relaxng.org/compact-20021121.html

[Clark, Design] Clark, James. The design of Relax NG. n.d. https://relaxng.org/jclark/design.html

[EBNF] ISO/IEC 14977:1996 Information technology — Syntactic metalanguage — Extended BNF. December 1996. https://www.iso.org/standard/26153.html

[Huitfeldt and Sperberg-McQueen 2001] Huitfeldt, Claus and C. M. Sperberg-McQueen. TexMECS. An experimental markup meta-language for complex documents. January 25, 2001, rev. October 5, 2003. http://mlcd.blackmesatech.com/mlcd/2003/Papers/texmecs.html

[Knuth 1990] Knuth, Donald E. The genesis of attribute grammars. 1990. In: Attribute grammars and their applications, ed. P. Deransart and M. Jour- dan. Vol. 461. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer. Berlin Heidelberg, pp. 1–12. doi:https://doi.org/10.1007/3-540-53101-7_1. https://link.springer. com/chapter/10.1007/3-540-53101-7_1 (visited on 2024-04-12).

[Luminescent] Piez, Wendell. Luminescent: a LMNL syntax parsing pipeline in XSLT. https://github.com/wendellpiez/Luminescent

[Manguinhas 2009] Manguinhas, Hugo. Schema Languages for XML. Actas do INForum-Simpósio de Informática 2009, pp.299–310.

[Murata et al. 2005] Murata, Makoto, Dongwon Lee, and Murali Mani, and Kohsuke Kawaguchi. Taxonomy of XML schema languages using formal language theory. ACM Transactions on Internet Technology (TOIT), Volume 5, Issue 4, 2005, pp. 660–704. doi:https://doi.org/10.1145/1111627.1111631. Available at https://pike.psu.edu/publications/toit05.pdf. An earlier version by the first three authors, presented at Extreme Markup Languages® 2000, is available at https://cs.brown.edu/courses/cs196-9/murata00taxonomy.pdf

[Piez 2008] Piez, Wendell. LMNL in miniature. An introduction. Amsterdam Goddag Workshop, 1–5 December 2008. http://piez.org/wendell/LMNL/Amsterdam2008/presentation-slides.html

[SAEAG] Barrellon, Vincent, Pierre-Edouard Portier, Sylvie Calabretto, Olivier Ferret. Schema-aware Extended Annotation Graphs. DocEng '16: Proceedings of the 2016 ACM Symposium on Document Engineering, pp. 45–54. doi:https://doi.org/10.1145/2960811.2960816

[Siegel 2022] Siegel, Erik. 2022. Schematron. A language for validating XML. Denver: XML Press.

[Sperberg-McQueen 2018] Sperberg-McQueen, C. M. Representing concurrent document structures using Trojan Horse markup. Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Sperberg-McQueen01

[Sperberg-McQueen and Huitfeldt 2004] Sperberg-McQueen, C. M. and Claus Huitfeldt. GODDAG: A data structure for overlapping hierarchies. Springer-Verlag (2004). Preprint at http://cmsmcq.com/2000/poddp2000.html

[Sperberg-McQueen and Huitfeldt 2008] Sperberg-McQueen, C. M., and Claus Huitfeldt. Markup discontinued: discontinuity in TexMecs, Goddag structures, and rabbit/duck grammars. Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008). doi:https://doi.org/10.4242/BalisageVol1.Sperberg-McQueen01

[SQF] Schematron Quick Fixes (SQF). https://www.oxygenxml.com/doc/ug-editor/topics/schematron-quick-fixes.html

[TagML 2017] Haentjens Dekker, Ronald, and David J. Birnbaum. It's more than just overlap: Text As Graph. Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1–4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Dekker01

[TagML 2018] Haentjens Dekker, Ronald, Elli Bleeker, Bram Buitendijk, Astrid Kulsdom and David J. Birnbaum. TAGML: A markup language of many dimensions. Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.HaentjensDekker01

[TEI Gentle introduction] A gentle introduction to XML. Chapter v. of the Text Encoding Initiative Guidelines, P5 Version 4.9.0. Last updated on 24th January 2025, revision f73186978. https://tei-c.org/release/doc/tei-p5-doc/en/html/SG.html

[Tennison 2007] Tennison, Jeni. Creole: validating overlapping markup. XTech 2007. https://www.princexml.com/howcome/2007/xtech/papers/output/0077-30.pdf

[Tennison 2009] Tennison, Jeni. Overlap, containment and dominance. Jeni’s musings, 2008-12-06. http://www.jenitennison.com/2008/12/06/overlap-containment-and-dominance.html

[Tennison and Piez 2002] Tennison, Jeni and Wendell Piez. 2002. The Layered Markup and Annotation Language (LMNL). Presented at Extreme Markup Languages® 2002 (Montréal, Canada).

[Usdin 2005] Usdin, B. Tommie. W3C XML Schema; Relax NG; or DTD: How’s a user to choose? 2005. https://www.mulberrytech.com/papers/whichschema/index.html

[van der Vlist 2004] van der Vlist, Eric. 2004. Relax NG. Sebastropol, CA: O’Reilly.

[XDM] XQuery and XPath Data Model. 3.1 W3C Recommendation 21 March 2017. Norman Walsh, John Snelson, and Andrew Coleman, ed. https://www.w3.org/TR/xpath-datamodel-31

[XML 1.0] Extensible Markup Language (XML) 1.0 (Fifth Edition). W3C Recommendation 26 November 2008. Tim Bray et al., ed. https://www.w3.org/TR/xml/

[XML Namespaces] Namespaces in XML 1.0 (Third Edition). W3C Recommendation 8 December 2009 Tim Bray et al., ed. https://www.w3.org/TR/xml-names/

[XML Schema 1.1] XML schema. https://www.w3.org/XML/Schema

^[1] The four authors contributed as follows to the research described here:

Ronald Haentjens Dekker: Lead researcher and developer, also contributed to the written report.
Bram Buitendijk: Developer, also contributed to the written report.
Joris J. von Zundert: Lead author of the written report.
David J. Birnbaum: Author and editor of the written report.

^[2] Sperberg-McQueen and Huitfeldt 2008 offer the following paragraph from Lewis Carroll’s Alice in Wonderland as an example of Discontinuity:

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, and what is the use of a book, thought Alice without pictures or conversation?

^[3] A commonly cited type of Overlap is multiple concurrent hierarchies, that is, mutually independent fully-tesselated structures that can be erected over a shared character stream. A canonic example of multiple concurrent hierarchies is a bound edition of a novel, which might be expressed as a logical hierarchy (e.g., chapters, paragraphs) and an overlapping physical hierarchy (e.g., pages, lines).

Overlap may also be occasional, that is, may not entail hierarchies that fully model an entire document. For example, an annotation system might allow readers to add occasional markup elements to a document that tag arbitrary strings of text without regard to existing markup boundaries.

^[4] We use the term non-linearity to refer to text components that may not coexist synchronously with one another, such as a representation that combines different diachronic versions of a text within a single document. For example, an editorial intervention may transform an earlier version of a text to a later version without any real text containing simultaneously the earlier and the later readings.

^[5] For example, consider an alignment of textual witnesses that differ only in the order of the tokens, e.g., Romeo and Juliet vs Juliet and Romeo. The text might be modeled as a graph with edges that connects the tokens in the order in which they appear in the witnesses, so that a collation would contain three shared tokens ordered in two different ways. This is different from simple variation (e.g., Romeo and Juliet vs Laurel and Hardy) because the text tokens that make up the content do not vary, but their order does.

^[6] For example, textual variation can be recorded in the TEI as <rdg> children of an <app> element, and the child elements have an order that is informational in XML whether we care about that order or not. The editor’s choice of an encoding order may be meaningful (e.g., grouping or prioritizing witnesses to express a theory of the transmission of the text), but even when the user does not wish to assert an order, XML syntax is not designed to represent unordered sibling elements. Users can choose to ignore the order of sibling elements during processing, but two XML elements with the same child elements in different orders are not deep equal, unlike, for example, two XML elements with attributes recorded in different orders.

Because informationally unordered sibling elements are not part of the data model that XML syntax expresses, ignoring sibling order in XML ultimately becomes the responsibility of processing applications. Locating that responsibility at the processing level is undesirable because it means that each individual processing application must accommodate it. A data model that can represent and distinguish both ordered and unordered siblings, on the other hand, could make the distinction available to multiple processing operations while expressing it only once, as part of the markup. More generally, inherent properties of the data are best represented and expressed at the level of the data, and not of applications that process the data.

What we mean by unordered content, then, is the ability to distinguish at the level of the model, and not only during processing when sibling items of arbitrary complexity do or do not have an order that is informational.

^[7] Tokenization can be performed at different levels of granularity. For example, given the following small but complete XML document:

<p month-name="November" month-ordinal="11">In the eleventh month … </p>

a simple tokenization might identify three tokens: start-tag, content, and end-tag. In this case, though, the start-tag includes attributes, and an alternative, more granular, tokenization might recognize the start-tag as a sequence of multiple tokens: start-tag-opener, element name, whitespace, attribute name, equal sign, quotation mark, attribute value, quotation mark, whitespace, attribute name, equal sign, quotation mark, attribute value, quotation mark, start-tag-closer. Different degrees of tokenization might be appropriate for different purposes.

^[8] In some authoring environments the extent to which a chosen schema can be considered domain specific may be debated. For example, the Balisage schema is a customization of Docbook that is specific to the domain of Balisage papers. Users might customize the schema further, whether by modifying the grammar-based schema or layering Schematron rules that are specific to the individual paper on top of the general Balisage schema. Domains, then, may have subdomains.

Domain-specific rules are sometimes called business rules (e.g., Siegel 2022, pp. 22–24). We do not find this term helpful for at least two reasons:

The word business has many meanings, including whatever a person is (properly) interested in (as in the phrase Mind your own business!), and in that sense it begins to overlap with the concept of domain-specific. Nonetheless, the primary semantic domain of the word business is commerce, which is not intrinsic to the more important domain-specific meaning.
As explained above, domains may have subdomains, and the term business rule does not clarify what does or does not count as a business rule. If we take the Balisage conference schema as an example, once we leave the domain of well-formedness, 1) choosing Docbook as a foundation instead of another schema type (e.g., TEI Lite) reflects a domain-driven decision; 2) the customizations of Docbook for Balisage-specific purposes reflect domain-driven decisions; and 3) further modifications by individual authors to accommodate their individual requirements reflect still other domain-driven decisions. All three of these levels are accommodations to domain-specific needs, with progressively narrower domains. It is not clear, at least to us, how the term business rule is useful for elucidating this tiered model.

^[9] DTDs date to the days of SGML (ISO specification 1986), although XML DTDs (W3C specification 1998) differ in several ways from SGML DTDs. Relax NG dates to 2001 (OASIS specification), as does XML Schema (W3C specification).

^[10] For context-free and other types of grammars see Chomsky 1956The situation with Relax NG is more complicated, as we discuss below. See Murata et al. 2005, Clark 2002, Algorithm, Clark, Design.

^[11] Concerning TagML see TagML 2017 and TagML 2018. Our goal of supporting markup languages other than TagML continues one of the goals of Creole, about which the designer writes that Although designed to be used with LMNL, Creole is data-model agnostic (Tennison 2007, p. 26).

^[12] Grammars allow multiple productions with the same LHS, so instead of a single production like A = B | C, which is allowed in a Relax NG compact-syntax schema but not in a grammar, a grammar might contain two productions: A = B and A = C. That is, a grammar, like a Relax NG compact-syntax schema, allows the concept of alternation, but expresses it through multiple linear productions instead of by means of an alternation operator.

^[13] By ultimately we mean that the definition (RHS) may include both terminals and non-terminals, but any non-terminals must themselves be defined elsewhere in the grammar. This means that if we replace every LHS by the corresponding RHS recursively, we eventually wind up with only non-terminals. A document instance to be parsed or validated against a schema contains only terminals, which means that ultimately a parser must determine whether the inventory and arrangement of terminals in a document instance are consistent with the inventory and arrangement of terminals permitted, once all non-terminals have been expanded, by the grammar.

^[14] This requirement is to be expected because the point of a production is to define the LHS in terms of the RHS. Terminals, by definition, do not need to be defined, which means that there is no need for productions where the LHS consists only of a terminal.

^[15] While a CFG has no access to history, it is allowed to look ahead, although the smaller the lookahead, the more efficient the grammar.

^[16] See also Cameron 1993 for a discussion of extending CFGs with permutation phrases.

^[17] For a comparative description of the three see Usdin 2005.

We exclude Schematron from this discussion because Schematron is a rule-based, rather than grammar-based, language, and we describe below why we favor a grammar-based approach. We also exclude TEI ODD, a higher-level language that, among other things, integrates into the syntactic modeling a class system (which can be used to model semantic relationships) and a literate-programming approach to documentation. Nothing about the schema language and associated grammar that we propose here precludes subsequently integrating Schematron-like or ODD-like methods, but those features are out of scope at this time.

^[18] XML Schema and Relax NG support rich datatyping, while DTDs do not. We exclude datatyping from our discussion because it does not interact directly with the structural features on which we concentrate.

^[19] The <xs:openContent mode="interleave"> structure in XML Schema 1.1 matches some, but not all, of the features of Relax NG Interleave. See the more detailed discussion of Interleave, below.

^[20] We use the terms start-marker and end-marker to refer to coordinated milestone (self-closing, empty) elements that serve to mark the paired start and end positions of ranges within a LMNL or XML document.

^[21] For example:

There is no natural way to say in Relax NG that the values of an @n attribute on list items must reset to 1 with every new list and increase by increments of 1 in consecutive items within the same list.
There is no natural way to say in Relax NG that @ref attributes on references to persons must point to (agree with) matching @xml:id values in a master list of persons while @ref attributes on references to places must point to matching @xml:id values in a master list of places. This is also the case if the master lists are in separate documents.

These types of constraints can be expressed easily and naturally in Schematron.

^[22] See SQF.

^[23] See XDM, §6.7.1. If we create XML that includes a text node that consists of zero characters, it is removed. For example, the following XSLT:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  version="3.0">
  <xsl:variable name="stuff">
    <root>
      <e>Hi</e>
      <xsl:text></xsl:text>
      <e>Bye</e>
    </root>
  </xsl:variable>
  <xsl:template name="xsl:initial-template">
    <xsl:value-of select="count($stuff/node()/node())"/>
  </xsl:template>
</xsl:stylesheet>

returns 2 as the count of child nodes of the <root> root element.

^[24] For information about LMNL see Piez 2008

^[25] Tessellated refers to structures that fully occupy their domains with no gaps and no overlaps. For example, the part of a book inside the binding might be said to consist of a tessellated sequence of pages, which means that 1) the first page begins at the very beginning of the contents within the binding, 2) each new page begins exactly after the end of the immediately preceding page, and 3) the last page ends at the very end of the contents within the binding. That is, there is no part of the contents within the binding that is not part of exactly one page.

^[26] Index references, then, are not hierarchical and not tessellated for two reasons:

There are text tokens that are not part of any index reference.
There are text tokens that are part of more than one index reference because index references may overlap with other index references.

^[27] With respect to schema syntax, both Concur and Interleave can be applied to two or more operands, although when the schema is compiled into a grammar both Interleave and Concur take only two operands. That is, schema productions with more than two operands are translated during compilation into associative grammar productions. For example, a ternery schema Interleave or Concur relationship of A, B, and C might be compiled into a binary grammar relationship of A and B, the value of which then enters into a binary grammar relationship with C. This is similar to the way a ternary choice between A, B, and C in a schema production like A | B | C might be compiled, at the grammar level, to a choice between A and B, the value of which then enters into a choice with C. Interleave, Concur, and Choice relationships in schema rules are all both associative and commutative.

^[28] At least in some positions, whether whitespace in XML is significant or insignificant is widely regarded as difficult to recognize and therefore a common cause of error during processing. We would expect the distinction of significant vs insignificant whitespace to vary among next-generation markup languages. In this example below we might prefer to insert newlines and space characters after the <root> start-tag and a newline before the end-tags, but we could do that only if the specific markup language unambiguously recognizes those whitespace characters are insignificant.

^[29] The spellings start-tag (instead of, for example, start tag, open-tag, etc.), end-tag, and empty-element tag are taken directly from XML 1.0. The spellings suspend-tag and resume-tag are taken directly from Sperberg-McQueen and Huitfeldt 2008.

^[30] We’ve pretty-printed the example to improve the legibility, but in practice a markup language would have to specify how to distinguish significant from insignificant whitespace.

^[31] In the example above, <verse> elements are part of both concurrent hierarchies because both <chapter> elements and <section> elements have <verse> children. Furthermore, this schema requires that every <verse> element be simultaneously a child of exactly one <chapter> element and exactly one <section> element. If, though, the models for <chapter> and <section> involved mixed content, that is, mixed { verse* }, it would be possible for a <verse> in one hierarchy to correspond to plain text in the other, in which case the <verse> tags at that location would be part of only one hierarchy, and therefore not aligned. Whether the <verse> tags are visible in a hierarchy must be resolved for matching start- and end-tags, which is easy to validate for each hierarchy in isolation (recall the each operand of a concur relationship, considered in isolation, is fully ordered), but determining whether a matching start- and end-tag in this mixed-content situation is part of a particular hierarchy is not trivial. We leave this edge case aside for the moment, while recognizing that it remains part of a full solution to the task.

^[32] The dynamic programming example, with the matrix, is used for illustration; it is not how the validation is implemented in the code. Among other things, in this example both <chapter> elements and <section> elements are repeatable, which means that we cannot construct a useful matrix in advance based only on the schema rules. As a simplification, our sample matrix includes two <chapter> elements and one <section> element because that’s what happens to in this particular input example.

^[33] This requirement ensures that start- and end-tags will be balanced.

^[34] Next-generation markup languages may distinguish two ways in which one element may be inside another: containment and dominance. See Tennison 2009 for more information.

^[35] For more information about attribute grammars see Knuth 1990.

Author's keywords for this paper:

Grammar; Parsing; Overlap; Discontinuity; Next-generation markup language; TAGML; CREOLE

Ronald Haentjens Dekker

Researcher

DH Lab, Huygens Institute for the History of the Netherlands

`<ronald.dekker@di.huc.knaw.nl>`

Ronald Haentjens Dekker is a researcher and software architect at DHLab and the Huygens Institute for the History of the Netherlands, part of the Royal Netherlands Academy of Arts and Sciences. As a software architect, he is responsible for translating research questions into technology or algorithms and explaining to researchers and management how specific technologies will influence their research. He has worked on transcription and annotation software, collation software, and repository software, and he is the lead developer of the CollateX collation tool. He also conducts workshops to teach researchers how to use scripting languages in combination with digital editions to enhance their research.

David J. Birnbaum

Professor Emeritus

Department of Slavic Languages and Literatures, University of Pittsburgh

`<djbpitt@gmail.com>`

David J. Birnbaum is Professor Emeritus of Slavic Languages and Literatures at the University of Pittsburgh. He has been involved in the study of electronic text technology since the mid-1980s, has delivered presentations at a variety of electronic text technology conferences, and has served on the board of the Association for Computers and the Humanities, the editorial board of Markup languages: theory and practice, and the Text Encoding Initiative Technical Council. Much of his electronic text work intersects with his research in medieval Slavic manuscript studies, but he also often writes about issues in the philosophy of markup.

Bram Buitendijk

Software Engineer

Digital Infrastructure, KNAW Humanities Cluster

`<bram.buitendijk@di.huc.knaw.nl>`

Bram Buitendijk is a software engineer in the Research and Development team at the Humanities Cluster, part of the Royal Netherlands Academy of Arts and Sciences. He has worked on transcription and annotation software, collation software, and repository software.

Joris J. van Zundert

Researcher

Computational Literary Studies, Huygens Institute for the History of the Netherlands

`<joris.van.zundert@huygens.knaw.nl>`

Joris J. van Zundert is a senior researcher and developer in humanities computing in the department of literary studies and the Digital Humanities Lab at the Huygens Institute. His research focuses on computational algorithms to analyze literary and historical texts, and on aspects of humanities information and data modeling. His PhD research in the field of Science and Technology Studies focused on methodological effects of the interaction between software engineers and humanities scholars. He was awarded the title of doctor cum laude on the basis of his dissertation Scholarship in interaction in 2022. His computational analytic work focuses on the correlation between text immanent features of texts and sociological processes around the concept of literature. He is also involved in developing computational approaches to stemmatology, narratology, and scholarly editions.

Balisage Paper: A schema language and parser for next-generation markup languages[1]

Ronald Haentjens Dekker

<ronald.dekker@di.huc.knaw.nl>

David J. Birnbaum

<djbpitt@gmail.com>

Bram Buitendijk

<bram.buitendijk@di.huc.knaw.nl>

Joris J. van Zundert

<joris.van.zundert@huygens.knaw.nl>

Table of Contents

Introduction

Markup and markup languages

Tokenization, parsing, and validation

Schemas and grammars

Conclusion

Properties of grammars

Left-hand side and right-hand side

Terminals and non-terminals

Parts of a grammar

Three classes of grammars

About context-free grammars (CFG) and XML

Markup and schemas for XML and beyond

Schema languages for XML

Existing schema and grammar proposals for managing Overlap

Why Creole is different

Partially ordered content: how Interleave works

Grammar-based and constraint-based schema languages

What we adopt from existing schema languages

Patterns and anti-patterns in Relax NG compact syntax

Attributes and patterns that resolve to attributes do not accept repeatable occurrence indicators (+, *)

text requires at least one textual character and does not accept repetition occurrence indicators (+, *)

Attributes are not content

Mixed content must be represented with the keyword mixed

Overlap and Creole

About our notation

Example of Overlap in Creole

How Concur works in SGML and in Creole

Concur and Interleave

Mildly context-sensitive grammars (MCSG)

Handling overlap and implementing the Concur operator

Handling discontinuity

Self-overlap with a mildly context-sensitive grammar

Schema compiler

Conclusion

References

Author's keywords for this paper:

<ronald.dekker@di.huc.knaw.nl>

<djbpitt@gmail.com>

<bram.buitendijk@di.huc.knaw.nl>

<joris.van.zundert@huygens.knaw.nl>

Balisage Series on Markup Technologies

Balisage Paper: A schema language and parser for next-generation markup languages^[1]

`<ronald.dekker@di.huc.knaw.nl>`

`<djbpitt@gmail.com>`

`<bram.buitendijk@di.huc.knaw.nl>`

`<joris.van.zundert@huygens.knaw.nl>`

Attributes and patterns that resolve to attributes do not accept repeatable occurrence indicators (`+`, `*`)

`text` requires at least one textual character and does not accept repetition occurrence indicators (`+`, `*`)

Mixed content must be represented with the keyword `mixed`

Example of `Overlap` in Creole

How `Concur` works in SGML and in Creole

`Concur` and `Interleave`

Handling overlap and implementing the `Concur` operator

`<ronald.dekker@di.huc.knaw.nl>`

`<djbpitt@gmail.com>`

`<bram.buitendijk@di.huc.knaw.nl>`

`<joris.van.zundert@huygens.knaw.nl>`