Introduction
Markup and markup languages
Document markup can be understood as a means of making
explicit an interpretation of a text
, that is, of making explicit what is
conjectural or implicit, a process of directing the user as to how the content of
the text
should be (or has been) interpreted.
A markup language, in
turn, is a set of markup conventions used together for encoding texts
. (TEI Gentle introduction) Markup languages may vary in the structures and interpretations that
they are (or are not) able to represent, and, where they are able to represent a structure,
in the quality of the representation. Assessing the adequacy and quality of a markup
language inevitably has a subjective aspect, but however the following terms are understood,
any assessment would value clarity and legibility
(humans typically need to read and write markup), as well as suitability for any
intended automated processing (computers need to be able to process markup
efficiently).
An optimal markup language would be able to represent—clearly, legibly, and in a machine-tractable way—any feature of a document that a user considers important. The predominant ecosystem of document markup conventions and tools today, based on the XML-related family of W3C (and other) specifications, struggles to support, in a clear, legible, and machine-tractable way, many concepts that are required to represent the expressive richness of documents. Although workarounds exist, structural document features that cannot easily or naturally be expressed in XML and managed with existing XML tooling include Discontinuity,[2] Overlap,[3] Non-linearity,[4] Multi-ordered content,[5] and Unordered content.[6] We refer to a markup language that retains the intellectual and engineering virtues of XML while also supporting the preceding features as a next-generation markup language.
Tokenization, parsing, and validation
A principal goal of the present contribution is to imagine, and to begin to implement, a document authoring and processing environment that meets the needs of a next-generation markup language as naturally and effectively as the existing XML ecosystem meets the needs of XML markup. Similarly to the way the use of specific elements and attributes in documents tagged with XML can be represented by schema languages designed to support XML features, our goals for a next-generation markup language include that it be possible to parse and validate documents tagged according to such a language against a user-defined schema that recognizes and manages all features, including next-generation features, similarly to the way the current XML ecosystem recognizes and manages features of XML.
Parsing, as a high-level term, is the process of recognizing the subcomponents of a text stream. For example, parsing XML entails recognizing the different types of nodes that make up an XML document instance. Parsing in this high-level sense is sometimes subdivided into smaller tasks, such as, for example, an operation to distinguish markup from content and an operation to analyze the structural relationships among the markup and content components. In this two-part model, the part that involves recognizing markup and content tokens is typically called tokenization or lexing (lexical analysis) and the term parsing is reserved for grammatical analysis, that is, analyzing the relationships among the tokens identified during lexing.[7] It may be possible to tokenize an invalid document, so that, for example, a tokenization process might be able to recognize an end-tag at the very beginning of an XML document, where it is not permitted, by rules of well-formedness, to appear. Detecting the well-formedness violation is the responsibility of the parser (in the narrow sense of the term); the lexer identifies the tokens (structurally meaningful units in the text stream) and the parser either succeeds or fails to make sense of them.
Common reasons for parsing documents include validation (see below), enhancing information (some schema languages record datatype information that is not explicit in a document instance but can be accessed during parsing), or performing transformations or other operations. The opposite of parsing can be considered serialization, where parsing translates a sequence (series) of characters into a structure that is convenient for subsequent processing (e.g., a tree in XML) and serialization turns that type of structure into a series (thus the term serialize) of characters that is convenient for storage or transmission.
Validation is verification that the grammatical structure of a specific document conforms to the types of structures permitted by a schema used for that purpose. Schemas describe a domain-specific language, that is, they are authored to enable and enforce domain-specific structural requirements. XML distinguishes validation against a schema from well-formedness checking, that is, from verifying that the document observes requirements that must be met by all XML documents, regardless of domain-specific requirements and regardless of whether they are also validated against a schema. Strictly speaking, a document with angle-bracketed markup that does not conform to XML well-formedness requirements is not XML, which means that it cannot be validated against a (domain-specific) schema. More informally, though, users ultimately care whether documents do or do not meet expectations and requirements, and may therefore regard verifying well-formedness and verifying schema validity as parts of a single process of verifying the acceptability of the document markup.[8]
A validator that simply accepts or rejects a document according to whether it conforms to structures permitted by a schema is called a recognizer. Beyond distinguishing valid from invalid documents, a validation process may emit error messages of varying granularity, which can be helpful for users who would want to correct structural invalidities. Validation can be performed in real time, and real-time validation is often incorporated into code editors to guide users through the process of adding valid markup. Alternatively, validation can be performed on completed documents.
Schemas and grammars
One difference between XML and SGML is that SGML always required a schema (the only type of schema supported by SGML was DTD), while schema validation is optional for XML. The decision by the original developers of XML that schemas should be optional recognized that a schema may be important during or immediately after authoring, but once a document is complete, has been validated, and will not be changed again, there is no need to continue to check it for validity. SGML made use of markup features that could not be interpreted without reference to a schema, which is one reason that a schema had to be obligatory for SGML, and those features were excluded from XML requirements to ensure that XML could be processed without reference to a schema.
In its current form our system requires a schema, that is, we integrate well-formedness checking with schema validation. We do not preclude separating these operations later, but for now that separation is out of scope because it is not clear how we would reliably analyze and interpret next-generation features like Overlap and Discontinuity without the guidance provided by schema rules.
Schemas that can be used to support parsing and validation of XML markup can be authored in W3C XML Schema (XML Schema 1.1), Relax NG (Clark 2002, Compact), or DTD (defined as part of XML 1.0), none of which is designed to represent the next-generation features described above. One of our goals, then, is to begin to develop a schema language and associated tooling for next-generation markup languages that is analogous to the schema languages and associated tooling currently available in the XML ecosystem.
The terms schema and grammar are sometimes used interchangeably in the structured-text community, but we use them here as technical terms with different meanings and properties.
-
We use the term schema to refer to a set of productions (rules that define a non-terminal on the left-hand side according to non-terminals and terminals on the right-hand side, similarly to Relax NG Compact Syntax) that constrains the use of specific elements, attributes, text nodes, etc. Schemas are typically developed to meet project-specific requirements. Because schemas are typically developed by content creators, they prioritize human legibility. With our next-generation schema language, as with schema languages for XML, a schema rule cannot overrule or supersede a well-formedness constraint. That is, schema rules provide a second layer of constraints beyond the universal and non-negotiable constraints imposed by well-formedness.
A schema is both a human-oriented statement of what is and is not permitted in associated documents and a starting point for a grammar, as described below.
-
We use the term grammar to refer to a constrained set of productions, optimized for machine interpretation and processing, that is generated programmatically by compiling the schema into a grammar. While schema design prioritizes human legibility, grammar design prioritizes efficient computational implementation.
As we explain below, complex tree-like operators such as
OR,OPTIONAL,OneOrMore,ZeroOrMore,INTERLEAVE, andCONCURmay be rewritten as one or more linear productions when a human-oriented schema is compiled into a machine-oriented grammar. Familiar notations like Backus-Naur Format (BNF, see BNF) and its extensions (e.g., EBNF, see EBNF), which are often considered grammar notations, allow alternation and other complex operations, but we use the term grammar in this communication to refer specifically to lower-level, purely sequential notations.
Human developers interact directly with the schema, but not directly with the grammar, much as XML developers author XML Schema or Relax NG or DTD schemas to meet project-specific needs. A schema language is the declarative, human-facing component of formal document modeling, which means that it describes what is and is not permitted in a valid, conformant document instance, but it does not describe how parsing and validation is performed. If a schema is to be usable for parsing and validation, it must be amenable to compilation into a grammar that can perform the parsing and validation operations in a robust, efficient, and scalable way. The present contribution aims to:
-
Identify the intellectual and engineering requirements for implementing a user-friendly declarative formal schema language for a next-generation markup language, and
-
Contribute an implementation of such a system, including a machine-oriented grammar into which a schema can be compiled that is sufficiently robust, efficient, and scalable to be suitable for likely real-world use cases.
The three schema languages in wide use in the XML ecosystem, listed above, have mature implementations that have been in use for many years,[9] and we have no doubt that the next-generation schema language and implementation introduced here will be improved as they are tested against a broad variety of real-world documents and research needs. At the same time, the next-generation schema language we describe here, and our reference implementation of a validating parser for documents tagged in conformity with schemas expressed in that schema language, provide a viable starting point for a broad user community that is eager for next-generation functionality that is not easily or naturally implemented purely with XML markup and the existing XML toolkit.
Current widely used parsing strategies for markup technologies are often based on
context-free grammars (CFGs), although the properties and
restrictions they aim to describe and validate cannot always be expressed without
recourse
to features that cannot be represented completely within a CFG.[10] For example, the XML specification (XML 1.0) incorporates a
formal, context-free grammar in EBNF notation (EBNF) that seeks to
represent what is and is not permitted with XML markup, but that formal grammar must
be
supplemented, within the specification, with plain prose validity
constraints (VC, e.g., An element type must not be declared more than
once
, §3.2)) and well-formedness constraints (WFC, e.g.,
The Name in an element’s end-tag must match the element type in the
start-tag
, §3) because not all requirements for well-formed XML can be represented
completely by the features available within a CFG. XML validity and well-formedness
constraints can be implemented computationally in a robust, efficient, and scalable
manner,
but only through processing that transcends parsing and validation against the formal
CFG
grammar. Insofar as a grammar is a formal representation of a theory of the
text, the system we describe in this contribution aims to be sufficiently
expressive to admit all desired textual structures (as represented by a project-specific
document schema), to recognize any structure not licensed by the schema as invalid,
and to
do so without recourse to extra-grammatical annotations and implementation features.
Insofar
as a grammar is also an engineering resource, a priority for our work
is that our schema language, designed to maximize legibility for human schema designers,
should be able to be compiled (translated) by the computer into a formal grammar that
is
adequate for efficient, scalable automated parsing and validation.
We regard the general process by which a schema is compiled into a grammar against which a document instance can be parsed and validated as an abstract, high-level metamodel, that is, as independent of any specific implementation in any specific programming language. We provide a particular implementation, then, primarily as a way of verifying whether the metamodel is implementable in principle. The distinction between the abstract metamodel and a concrete implementation invites an eventual comparison of implementations, whether with respect to performance, user experience, or other criteria, even as all implementations are required to emit what is functionally the same result because all implementations must implement the shared metamodel.
Conclusion
Ultimately, the primary aim of this contribution is to elaborate a formal model that is able to express complex phenomena of texts that users might reasonably wish to record, including those not easily represented with XML, and to do so in a way that is theoretically coherent and formally computable. This goal engages with several hard challenges that have not yet been resolved in a satisfactory way in the domain of markup languages, as is clear from, for example, the way Overlap has been a recurrent theme in markup conferences for decades. Much of the present contribution engages explicitly with formal grammar theory, but we also introduce an implementation of a proof-of-concept parser-validator that goes beyond the XML toolkit in being able to process next-generation features not as secondary or external operations, but as part of both the schema language and the grammar to which it compiles. Although our proposal emerged from our work on TagML, it is intended to be usable with any markup language that supports the next-generation features we discuss.[11]
Properties of grammars
Left-hand side and right-hand side
A grammar is primarily a set of productions (rules), each of which maps from a left-hand side (LHS) to a right-hand side (RHS). Grammatical productions are similar in some respects to statements in Relax NG compact syntax, where, for example:
chapter = element chapter { title, paragraph+ } has an LHS
with the value chapter and an RHS with a value that defines
chapter as an element type with the name chapter that contains a
single obligatory title followed by one or more items called
paragraph. title and paragraph are, in turn, defined
in their own productions elsewhere in the schema. A Relax NG compact-syntax production
is
not exactly the same as a grammar production, but what they have in common is that
a single
item in the LHS (in this case chapter) is defined in terms of items in the RHS
(in this case title and paragraph, along with the comma connector
and the Kleene plus occurrence indicator, which describe syntactic relations among
the
content items).
Terminals and non-terminals
The items in the LHS and RHS are either terminals or non-terminals, which can be understood as follows:
-
A terminal is an item that cannot be expanded further, such as, in both Relax NG compact syntax and grammars, a literal string.
-
A non-terminal is an item that can be defined in terms of terminals and other non-terminals.
For example, a Relax NG production like:
published = attribute published { "yes" | "no" } defines
the non-terminal published as an attribute with a value that entails a choice
between two terminals, the string "yes" and the string "no". As
above, a Relax NG production is not exactly the same as a grammatical production (in
particular, a grammatical production does not allow alternatives, represented in Relax
NG by
the pipe connector),[12] but the distinction between terminals and non-terminals applies to both grammars
and Relax NG schemas.
Parts of a grammar
A complete grammar has 1) a starting point, 2) a set of non-terminals, 3) a set of terminals, and 4) a set of productions that must ultimately define all non-terminals in terms of terminals.[13] The LHS of a grammar production must always include a non-terminal; whether it can also include terminals or other non-terminals depends on the type of grammar.[14]
A schema language may come with built-in, predefined productions, so that, for example,
Relax NG compact syntax knows how to compile the grammatically non-terminal keyword
text into productions with only terminals (defined with reference to a set of
one-character string values) on the RHS. One difference between a schema and a grammar,
then, is that a grammar must be complete (that is, must ultimately define everything
using
only terminals), while a schema language can come with implicit productions that do
not need
to be spelled out in the schema itself. This is consistent with the fact that a schema
prioritizes human legibility because it is intended to be authored directly by humans,
while
a grammar prioritizes computational tractability because it is intended to be processed
by a
machine. Part of the parsing process involves compiling (transforming) human-oriented
schema
rules into machine-oriented grammar productions.
Three classes of grammars
This contribution discusses three types of grammars, which differ with respect to what is or is not permitted on the LHS and RHS. We say more about these types of grammars below, but we can distinguish them concisely now as follows:
-
A context-free grammar (CFG) contains productions that map from a single non-terminal LHS to a single sequence of terminals and non-terminals in the RHS. The single item in the LHS has no access to contextual history.[15]
-
A context-sensitive grammar (CSG) allows multiple symbols in the LHS, at least one of which must be a non-terminal. The non-terminal is required because, as noted above, the point of a production is to explain a non-terminal, while terminals are self-explanatory and therefore not candidates for further explanation. The other symbols (terminal and non-terminal) that are allowed on the LHS in a CSG provide context, that is, they are how a CSG, unlike a CFG, has access to direct contextual history.
-
A mildly context-sensitive grammar (MCSG) allows only a single (non-terminal) symbol on the LHS, similarly to a CFG and unlike a CSG. But a MCSG allows more than one group of terminals and non-terminals in the RHS, together with rules that determine how they can be mixed (interleaved) in final production.
About context-free grammars (CFG) and XML
As noted above, parsers based entirely on context-free grammars (CFG) are unable to model not only next-generation markup languages, but also some features of XML. The term context-free means, among other things, that CFG productions must be static, that is, they must be based exclusively on information available before looking at the document instance. Information acquired only once parsing begins is dynamic, and therefore cannot be part of a CFG production.
It is possible to parse XML against an XML schema with a CFG, but this is not the case in the absence of a schema, which is why the XML spec consists of CFG productions supplemented with WFCs. Two features of XML where the availability of a schema enables the compilation of a CFG that does not need to rely on WFCs are:
-
XML validation requires that every start-tag be followed, in a non-overlapping way, by a corresponding end-tag for the same element type. If a schema is available, it necessarily identifies all element types that are allowed in a valid document, which means that the schema can be compiled into a grammar that includes separate explicit productions for every valid element type. That is, instead of saying something like
every start-tag must be followed by a matching end-tag
, a CFG compiled from a schema can sayXML that is valid against this schema can contain only element types A, B, and C; a start-tag for element A must be followed by an end-tag for element A, a start-tag for element B must be followed by an end-tag for element B, and a start-tag for element C must be followed by an end-tag for element C
. These explicit rules account for all possible element types in any valid document to be parsed against the schema, and therefore remove the need for a more general grammar rule that every start-tag must be followed by a matching end-tag. -
XML validation requires that attribute names not be repeated on an element. If a schema is available, it necessarily identifies all attributes that are permitted on an element, which means that the schema can be compiled into a grammar that includes productions for all combinations and orders of attributes. For example, a schema can say that
element A must include values for attributes X and Y
, which can be translated into grammar rules that say thatthe element name in the start-tag of element A must be followed by either attribute X and then attribute Y or attribute Y and then attribute X
. Those rules do not allow either attribute to repeat, and therefore remove the need for a general well-formedness constraint that prohibits that type of repetition.[16]
Schema validation is likely to incorporate optimizations that reduce the computation described in a brute-force way above. Nonetheless, the difference between validation against a schema vs well-formedness checking in the absence of a schema is that when parsing against a CFG all information must be static, that is, available to the parsing process before it sees the document instance. A schema makes all valid element and attribute names and all relationships among them available, providing information that can be used when the schema is compiled into a grammar prior to parsing the document instance. In the absence of a schema, element and attribute names are available only from the document instance, which means, by definition, that they cannot be incorporated into CFG productions.
As noted above, a common workaround for the inability to specify all properties of a markup language in a CFG has been to supplement the grammar, which defines only some features of the markup language, with well-formedness constraints (WFCs), which must be validated by augmenting the processing of the CFG. This approach has at least two types of limitation:
-
WFCs are often rendered only in plain prose, and not in a formal, machine-implementable way, much less in a way that can be integrated formally with context-free productions. This is the case, for example, with well-formedness constraints in the XML specification (XML 1.0), where features of the markup language that cannot be expressed entirely with the resources provided by a CFG must be expressed in other ways. The absence of a machine-oriented formalization introduces a risk that implementations may be based on inconsistent (and even erroneous) human interpretations of the constraints.
-
Parsers that take into account WFCs may need to keep state, that is, they may need to handle information encountered earlier in the parsing process that is not fully represented by any grammar production. For example, in the absence of a schema, an XML parser needs to know the elements that remain open at a given moment because without that information, it cannot verify that every start-tag has a corresponding end-tag and that the elements expressed by the tags are properly nested. Those requirements are expressed in the XML specification as WFCs because in the absence of a schema they cannot be expressed fully in a CFG alone.
In addition to looking backward (that is, keeping state), in cases involving ambiguous grammars a parser may also need to look ahead in order to resolve an ambiguity. Unlike keeping state, look-ahead is not necessarily incompatible with CFGs, and we discuss look-ahead operations in different types of grammars below.
The performance of CFGs benefits from not needing to save past information in memory in order to interpret an item that is currently in focus, but some next-generation markup features cannot be recognized and interpreted without contextual information. In the case of Discontinuity, for example, without access to earlier parts of a document instance, it may not be possible for a parser to recognize when an element in focus is the continuation of an element that went before, that is, whether the new element is part of a valid discontinuous construction.
In the case of CFGs with WFCs the challenges of keeping state effectively and efficiently have been studied extensively, which has led to deep knowledge about high-performance implementations. For example, an XML parser that is performing well-formedness checking in the absence of a schema does not need to hold everything before the current location in memory at all times; it is sufficient to keep track of only a subset of that information. To verify that every start-tag has a corresponding end-tag the parsing process has to keep track of which elements have been opened but not yet closed, and in what order. Furthermore, because XML does not permit overlapping elements, the only part of that information that needs to be accessible immediately is the most recently opened element, which means that the information can be stored on a stack, which prioritizes efficient access to the only item that needs to be immediately accessible, which is the one opened most recently. The situation is obviously more complicated once next-generation features, such as Overlap, need to be accommodated.
While we know the limitations of CFGs with respect to representing next-generation markup-language features, what to use instead is less clear. Compared to the markup community’s extensive and deep knowledge of modeling and processing XML with a CFG supplemented by WFCs that can be implemented efficiently, we have only limited and largely experimental knowledge of grammars for next-generation markup languages that might be able to represent all structural features of markup in a way that can be formalized for robust and efficient machine processing.
Insofar as a grammatically consistent and self-contained way of modeling a next-generation markup language may need to maintain state more comprehensively than is required for parsing XML, and perhaps to look both backward and forward, one option might be to use some kind of context-sensitive grammar. At the same time, it is not immediately clear how a context-sensitive grammar for that purpose should be constructed because context-sensitive grammars have not been standardized and remain an under-researched concept, at least in comparison to CFGs. We do know, however, that any solution will at least have to provide for three functions that are fundamental to our aim:
-
A grammar-like function that enables the expression of a basic syntax that includes all obligatory next-generation features, as enumerated above. This function is comparable to the role of well-formedness constraints in XML.
-
A schema-like function that supports additional domain-specific constraints on the use of features permitted by the grammar. This function is comparable to the role of schema languages in the XML ecosystem, where schema languages allow users to express, in a formal, machine-actionable way, the constraints that apply to domain-specific document models. Ideally the schema-like function would prioritize human legibility, much as is the case with schema languages for XML.
-
A parser-like function that enables the validation of a document instance with respect to well-formedness and, where provided, user-specified schema rules in a consistent and performant manner, emitting clear error messages when needed. To avoid the need to keep entire documents in memory during parsing and validation, this function might be compared to the functions of XML SAX or StAX (but not DOM) parsers.
Any solution will have many moving and interlocking parts, all of which are known as desiderata for document modeling but most of which have not been explored thoroughly as potential parts of a grammar for a next-generation markup language. We explore these considerations in the next several sections, but we first survey historically related work and state-of-the-art developments that are relevant as a background for understanding the solution we will then propose.
Markup and schemas for XML and beyond
Schema languages for XML
There is no need to tell a Balisage audience that there are three grammar-based schema languages in wide use for modeling XML, as listed above: DTD, XML Schema, and Relax NG.[17] The three schema languages can all be considered tree grammars (thus Manguinhas 2009, §2), and they are largely comparable in the XML structures they are able to model, with two conspicuous exceptions:[18]
-
DTDs lack support for namespaces, although it can be simulated (XML Namespaces §5). The other two schema languages fully support namespaces.
-
Only Relax NG supports Interleave.[19]
This second difference, concerning Interleave patterns, is important for next-generation features for two reasons:
-
Unlike most structures accommodated by all three schema languages for XML, Interleave incorporates unordered and partially ordered properties, which cannot be represented using only features of CFGs.
-
Concur and Discontinuity, as well as Mixed content, can be described and implemented using features required for modeling Interleave.
We discuss the relationship between Interleave and next-generation markup features later in this report.
Existing schema and grammar proposals for managing Overlap
As we look beyond XML, data models and schema languages that have been proposed for managing Overlap include SGML Concur and its XML variant XConcur; Rabbit/duck grammars; and Creole. Because Tennison 2007 includes detailed explanations, examples, bibliography, and critiques of the multiple-hierarchy models Concur, XConcur (including XConcur-Cl, which is the validation component of XConcur), and Rabbit/duck grammars, we say nothing more about those technologies here except to point out that none of them engages explicitly with any next-generation feature other than Overlap. Luminescent, a LMNL parser implemented in XSLT, reflects a perspective similar to that of Rabbit/duck grammars in that it processes LMNL ranges with reference to XML milestone start- and end-markers.[20] (Tennison and Piez 2002, Luminescent) But because LMNL lacked a schema language at the time Luminescent was developed, Luminescent was designed to perform well-formedness checking, but not schema validation. Other proposals for managing Overlap have included the GODDAG data model and TexMECS markup language (Sperberg-McQueen and Huitfeldt 2004, Huitfeldt and Sperberg-McQueen 2001), which seek to model and manage not only Overlap, but also Discontinuity (Sperberg-McQueen and Huitfeldt 2008). Modeling and tooling based on Rabbit/duck grammars, GODDAG, and TexMECS have been employed in projects implemented by the resource designers and developers (e.g., Sperberg-McQueen 2018), but have not been adopted broadly elsewhere.
Schema-aware Extended Annotation Graphs offers a graph-based approach to validating
some
next-generation features, performing simulation-based validation by
construction. (SAEAG) The authors observe that
since Dataguides and Graph Schemas are inferred from the data, they cannot be used
as authoring tools
(§4), which makes the method unsuitable for our
purposes.
Why Creole is different
As far as we have been able to determine, Creole is the only detailed proposal for managing Overlap that is designed to process all structural properties of an entire document with a single grammar instead of validating overlapping hierarchies in ways that are partially or entirely separated from one another. This is consistent with our aim of managing Overlap as an integral feature of our document model. The one detailed publication about Creole, Tennison 2007, describes the design principles, some of which we adopt directly and some of which we either modify or replace with alternatives. Alongside a thorough explanation of the design of Creole and the strategy for processing it, Tennison 2007 reports having built a proof-of-concept implementation in XSLT 2.0 that worked as expected, except that it was insufficiently performant for production use. (25) We discuss performance in greater detail below, when we introduce our own design and implementation.
Partially ordered content: how Interleave works
Most authors of Relax NG schemas will have worked with both ordered and unordered content, but partially ordered content, which is vital for modeling next-generation features, is likely to be less familiar. The Interleave feature of Relax NG is able to model partially ordered content, but even those who have used Interleave in their own schemas may have used it to represent content that they consider unordered, rather than partially ordered. For that reason, before we introduce a schema language and grammar for a markup language that supports next-generation features, we take a close look at the meaning of partially ordered content as represented by Relax NG Interleave.
Interleave of simple items: A common use of Interleave
involves unordered simple content items. For example, the content model for
<x> in the following schema:
x = element x { A & B & C }
A = element A { empty }
B = element B { empty }
C = element C { empty } means that <x> contains
exactly one instance each of A, B, and C in any order,
that is, any of the following six patterns: A,B,C; A,C,B;
B,A,C; B,C,A; C,A,B; and C,B,A. In this
respect the unordered nature of Interleave might be compared to the unordered nature
of XML
attributes.
Interleave of repeated simple items: Unlike attributes, however, elements that participate in Interleave may be repeated, as in the following content model:
A* & B* & C*This means, among other things, that Interleave allows its immediate operands (in this example: zero or more instances of
A, zero or more instances of B, and zero or more instances of
C) to appear in any order, licensing the same six patterns as the example above,
except with zero or more repetitions of each repeatable component, e.g.,
A*,B*,C*; A*,C*,B*, etc. But that isn’t the only thing allowed by
Interleave with Repetition, because the model above does not require the repeated
items to
cluster with others of the same type. This means that in addition to the six patterns
listed
above, it also allows any number of A, B, and C
instances in any order, and is therefore equivalent to:
( A | B | C )*This, too, might be compared to unordered content.
Interleave of items with ordered content: The repeatability of the simple content-item types in the example above obscures the way Interleave models not only unordered content, but also partially ordered content. Consider the following Relax NG schema:
x = element x { a & b }
a = c, d
b = e, f
c = element c { empty }
d = element d { empty }
e = element e { empty }
f = element f { empty } This schema says that an element of type <x>
contains an interleave of patterns of type a and b. This schema
allows the two immediate operands of the Interleave operator to appear in either order,
that
is, either all of a before all of b, as in:
<x> <c/><d/><e/><f/> </x>or all of
b before all of a:
<x> <e/><f/><c/><d/> </x>Up to this point we might also consider this model unordered:
x contains one instance of a and one instance of b in
either order.
Interleave of items with partially ordered content: The
fact that the content is partially ordered, rather than unordered, becomes visible
in the
remaining patterns allowed by this schema: the components of a (c
and d) may be interwoven (that is, interleaved) with those of b
(e and f). For example, in:
<x> <c/><e/><d/><f/> </x>the two components of
a occupy positions 1 and 3 and the two
components of b occupy positions 2 and 4. It would be a mistake, though, to think
that the components of a and b can be combined in
any order, that is, it would be a mistake to interpret the Interleave
operator in this case as modeling unordered content. Because the components of a
and b are ordered (the comma connector means in this order), the two components of
a must occur with c before d and the
two components of b must occur with e before f. That
is, the following document would be invalid:
<!-- invalid --> <x> <e/><d/><f/><c/> </x>
Interleave, then, does not mean only that the immediate operands to the interleave
operator may appear in any order; that is permitted, but it is not the only thing
that is
permitted. Interleave also does not mean that the individual atoms in the expansion
of its
arguments may appear in any order; this is not permitted because any lower-level order
imposed
by the patterns of the interleaved items must be respected. In the example above,
c, d, e, and f may occur in any order only
as long as c precedes d and e precedes f
(not necessarily immediately).
It’s easy to imagine realistic use cases for Interleave as a way of modeling unordered
content. For example, mixed content that allows text intermixed with any number of
<a> and <b> elements can be modeled in
several ways, both with and without the Interleave operator. In a schema that includes:
p = element p { mixed { ( a | b )* } }
a = element a { text }
b = element b { text } the first line could, alternatively, be expressed with
the Interleave operator as:
p = element p { mixed { a* & b* } } We find it more
challenging, though, to think of natural, real-world use cases for partially ordered
content.
van der Vlist 2004 (54) gives an example of modeling groups of tourists in a
museum, but we struggle to imagine that as a sort of relationship anyone might realistically
want to model in a Relax NG schema. Consider, though, the following XML:
<root>
<section>
<annotation-start id="0001"/>
<header>Hi, I'm a header!</header>
<p>Hi</p>
<ab>Bye</ab>
<ab>Hi again</ab>
<annotation-end id="0001"/>
<p>Bye again</p>
</section>
</root> Ignoring for a moment the annotations, a
<section> must start with a single <header>
element that is followed by zero or more instances of <p> and zero or
more instances of <ab> in any order (as long as there is at least one
<p> or one <ab>). The preceding example
without the annotations might look like:
<root>
<section>
<header>Hi, I'm a header!</header>
<p>Hi</p>
<ab>Bye</ab>
<ab>Hi again</ab>
<p>Bye again</p>
</section>
</root> Meanwhile, we want to be able to mark annotation ranges with milestone
<annotation-start/> and <annotation-end/>
elements. (We simplify the example by omitting the text of the annotation, which might,
hypothetically, be represented by an attribute on the
<annotation-start/> element.) These elements must appear in paired
order, with exactly one start followed by exactly one end, and they must be coindexed
by a
shared @id value (not an @xml:id or HTML @id because it
begins with a digit and repeats). The scope of an annotation has to be one or more
sequential
<p> and <ab> elements. A Relax NG schema to
represent this is:
start = root
root = element root { section+ }
section =
element section {
(header, (p | ab)+)
& annotation-pair*
}
header = element header { text }
ab = element ab { text }
p = element p { text }
annotation-pair = annotation-start, annotation-end
annotation-start = element annotation-start { id, empty }
annotation-end = element annotation-end { id, empty }
id = attribute id { text }
There are some limitations to the types of constraints that Relax NG can enforce:
-
Relax NG cannot require the
@idvalues on pairs of annotation starts and end to agree. -
Relax NG cannot prevent
<annotation-end/>from following<annotation-end/>immediately, without any intervening<p>or<ab>elements.
-
There must be exactly one
<header>that precedes all<p>and<ab>elements within a section. If the<header>is missing or if it is preceded by any<p>or<ab>, the document is invalid. -
<annotation-start>and<annotation-end>elements may occur anywhere as children of a<section>element (but not as children of<header>,<p>, or<ab>). They must occur in pairs, that is, every<annotation-start/>element must precede a corresponding<annotation-end/>element. This model does not allow a new annotation to open before the first one closes, which would be a case of self-overlap, about which please see below.
One way to think of what Interleave permits and prohibits is that it describes partial order. As we will see below, in this respect it shares some properties of Concur: each concurrent hierarchy under Concur must observe, when we ignore the components that are not part of that hierarchy, all hierarchy-specific constraints, including ordering constraints. Similarly, each operand of an Interleave operator must observe, when we ignore other immediate operands, all operand-specific constraints.
Finally, we can also consider mixed content a type of Interleave, where, for example:
p = element p { mixed { a, b } } requires exactly one
instance each of a and b in that order, with optional text nodes
before, between, or after. This can be considered a partially-ordered interleave of
the
sequence a, b and text. Because the keyword text in
Relax NG means zero or more textual characters, it is implicitly optional, and because
XML
does not permit adjacent text nodes (which it merges into a single text node) it is
also
implicitly repeatable. This means that the same content constraints can be modeled
alternatively with the Interleave operator as:
p = element p { (a, b) & text }
SGML DTDs included an & operator that meant in any order
.
The SGML operator, then, models only unordered content, which is different from the
meaning of
Relax NG Interleave. XML DTDs do not include an & operator.
Grammar-based and constraint-based schema languages
XML schemas in XML Schema, Relax NG, or DTD are often supplemented by Schematron schemas, that is, collections of Schematron rules. Schematron rules are expressed in XPath, and can therefore validate anything that can be expressed in XPath, which includes both structural and content-oriented features. A lot of document properties can be modeled in either a grammar-based schema language or Schematron, but in general:
-
Schematron is better than grammar-based languages at content-oriented validation. The inclusion of datatyping and facets in XML Schema and Relax NG enables those grammar-based languages to validate some aspects of content, but that functionality does not include validating content-based relationships across nodes either within a document or across documents.[21]
-
Grammar-based languages are better than Schematron at complete structural validation. A grammar-based schema language is intended to model an entire document, which means, for example, that Relax NG will raise an error during validation if there are parts of the document that are not represented in an associated Relax NG schema.
The completeness of grammar-based schemas enables them to support reliable and thorough content-completion hinting in an editing environment because real-time validation knows at every location in a document what is and is not allowed. Schematron can provide hints in response to specific rule violation (this is how the <oXygen/> Schematron Quick-Fix feature works),[22] but because Schematron validation does not require complete document coverage, Schematron cannot provide consistent, reliable content-completion hinting.
The schema validation introduced in this report currently focuses entirely on validating structural (syntactic) relations. We intend eventually to support datatypes and facets at least at the same level as Relax NG. We do not plan to support Schematron-like constraint validation, but nothing in our design precludes performing Schematron-like constraint validation with a different tool, much as users currently do for XML validation.
What we adopt from existing schema languages
We began our development of a next-generation schema language by adopting four features of existing schema languages for XML:
-
Grammar-based language: The grammar-based schema languages for XML listed above (DTD, Relax NG, XML Schema), as well as Creole, are designed to model entire documents, which is not the case with a constraint-based language like Schematron. We opted for a grammar-based approach because modeling an entire document was a priority. As noted above, nothing in our approach precludes employing a constraint-based language like Schematron to validate features that cannot be expressed naturally in a grammar, but that type of enrichment is not in scope for the present report.
-
Partially ordered content: Relax NG is the only grammar-based schema language in current use for XML that is capable of representing Interleave, which, as described above, can model and validate unordered and partially ordered content. As we write above, we struggle to find a natural, real-world use for the partially ordered variant of Interleave, but we do find partial order important for other purposes. For that reason, we regard Relax NG as the existing schema language for XML that comes closest to supporting the features needed for next-generation markup systems.
-
Compact syntax: We further adopt Relax NG compact syntax as the formal syntax for our schema language because we find it more legible than the XML syntax and we regard human legibility as a high priority for a schema language. Nothing in our proposal precludes expressing our schema language in other ways, including using XML syntax, but such expressions are out of scope for the present report.
-
Concur and Creole: Finally, we also adopt and further develop features (in particular, a
Concurreserved word) added to Relax NG compact syntax in Creole. Creole lacks an available implementation, but the design and implementation are documented clearly in Tennison 2007.
While we copy and extend syntactic features of Relax NG and Creole, we deviate from those schema languages in our implementation. In particular, both Relax NG and Creole use parsing with Brzozowki derivatives to manage trees of expectations, while we instead implement a MCSG. We discuss the motivations for our implementation decisions below.
Patterns and anti-patterns in Relax NG compact syntax
Relax NG compact syntax permits several constructions that appear to allow markup that is either impossible to implement in XML (e.g., adjacent text nodes) or that, if implemented in a document, would result in well-formedness violations (e.g., repeated attributes). Content models in Relax NG compact syntax also fail to distinguish syntactically attributes (properties of elements) from actual content items (elements, text nodes), although the distinction between attributes and content items is unambiguous in XML syntax.
These features of Relax NG compact syntax compromise the legibility of a schema without improving its expressive power. For that reason we consider them anti-patterns and we exclude them from our proposed schema language, which otherwise aims for compatibility with Relax NG compact syntax notation. None of the ways in which our schema language is stricter than Relax NG compact syntax impinges on the expressive power of the schema because everything we prohibit can be expressed more clearly in both our schema language and Relax NG itself in other ways.
The features where our schema rules diverge from those of Relax NG compact syntax are the following:
Attributes and patterns that resolve to attributes do not accept repeatable occurrence
indicators (+, *)
Attributes are not repeatable in both XML and the next-generation markup languages
that
we aim to model. Relax NG compact syntax nonetheless allows attributes and patterns
that
resolve to attributes to be attached to the repetition indicators + and
*. Because a schema cannot override a well-formedness constraint and
well-formedness does not permit repetition of attributes, attempts to repeat an attribute
because the schema says that it is permitted will nonetheless raise a well-formedness
error.
Any use of + in association with an attribute is actually understood as
required but not repeatable
(otherwise represented in Relax NG compact
syntax by the absence of any explicit occurrence indicator) and * is understood
as optional and not repeatable
(otherwise represented by ?).
Allowing the schema to say that attributes are repeatable when well-formedness prohibits
their repetition means that the schema purports to license XML expressions that are
not, in
fact, allowed in XML. This means that the schema misrepresents what is possible in
the XML
it will be used to validate.
Because we take seriously the role of a schema as a human-readable statement of what
is
and is not permitted in associated document instances, our schema language does not
permit
the repeatable occurrence indicators + and * to be associated with
(non-repeatable) attributes. Because attributes may be optional, they do accept the
? occurrence indicator.
text requires at least one textual character and does not accept
repetition occurrence indicators (+, *)
The reserved word text in Relax NG compact syntax means zero or more
contiguous textual characters
. This is potentially confusing for at least two
reasons:
-
It is counter-intuitive that a pattern that appears to require text could be satisfied by zero textual characters, which a human would understand as the absence of text.
-
An XML document instance cannot contain a text node that consists of zero characters, so the Relax NG keyword says that it permits something that cannot occur.[23]
In our schema language the keyword text means
one or more textual characters
. The Relax NG
definition of zero or more textual characters is implicitly optional, but ours is
not, which
means that text in our system may be followed by ? to mean
either no text node, and therefore no characters, or exactly one text node, and
therefore one or more characters
.
The next-generation markup languages we target continue the XML practice of not
permitting adjacent text nodes. In the case of both these next-generation languages
and XML,
then, any construction involving adjacent text nodes is interpreted as a single merged
text
node. For that reason we do not permit the repetition occurrence indicators +
and * after the keyword text.
Finally, we do not allow the keyword text in Interleave
patterns. The only way to interleave text nodes with elements in our schema language
is by
using the mixed keyword. We require the mixed keyword because it
signals to the user that an element contains mixed content more clearly than a repeatable
or-group that includes the keyword text.
Attributes are not content
In Relax NG attributes and patterns that resolve to attributes are included in content
models alongside elements and the keyword text. Because attributes are
properties of elements, rather than content, this is misleading; a
Relax NG content model may appear to impose ordering relationships on attributes with
respect to both other attributes and actual element content items even though no such
ordering relationships are possible. Our schema language will represent attributes
separately from actual element content. but the exact mechanism for doing so is still
under
development.
Mixed content must be represented with the keyword mixed
Relax NG compact syntax can represent mixed content with the keyword mixed,
as in
p = element p { mixed { ( a | b )* } } but it can,
alternatively, omit the keyword mixed and instead include the keyword
text in a repeatable or group or interleave,
e.g.:
p = element p { ( a | b | text )* } or
p = element p { a* & b* & text } Using the keyword
mixed improves the human legibility of Relax NG content models because it
announces unambiguously that it represents mixed content. For that reason, our schema
language does not permit the last two models above, and instead requires the keyword
mixed for modeling mixed content.
Overlap and Creole
About our notation
In this report we use the term <element> differently than Tennison 2007. Creole was designed to meet the requirements of LMNL, although Creole
is intended to be data-agnostic and therefore applicable to parsing other markup languages
that support overlap (Creole is not designed to support discontinuity). (Tennison 2007 26) LMNL is based on ranges, which, unlike XML
elements, are generally (but not universally) allowed to overlap.[24]
Tennison 2007 uses the following terms:
-
A LMNL
<range>is a sequence of markup and text tokens. Ranges have start- and end-tags and are generally allowed to overlap with other ranges. -
A LMNL
<element>is a sequence of markup and text tokens that is not allowed to overlap with other ranges. In Creole an<element>is equivalent to (synonymous with) a<range>child of a<partition, i.e.,<element>is a notational variant (synonym) of<partition><range> … </range></partition>.
<element> where
Tennison 2007 uses the term <range> and we use the
term <partition> were Creole uses
<element>. We do not use the term <range>.
That is:
Table I
Comparison of terminology
| Creole | This report |
|---|---|
<range> |
<element> |
<element> |
<partition> |
Example of Overlap in Creole
Tennison 2007 uses a brief biblical example to review and critique the
representation of Overlap in SGML Concur and elsewhere and to
illustrate an alternative model and notation for Creole that address the perceived
limitations. The biblical example is illustrative and the point is more general: markup
hierarchies under a shared root (in this case a root element like
<bible>) can overlap in a variety of meaningful ways. This
particular example encodes four overlapping structures, as follows: (5–6)
-
A hierarchy of Bible, books, and chapters. Books begin with a title followed by one or more chapters, which, in turn, contain one or more verses.
-
A hierarchy of Bible, books, and sections. Books (the same books as in the preceding hierarchy) begin with a title (also the same as in the preceding hierarchy) followed by one or more sections. Sections begin with section headers added by an editor (e.g.,
The flood and the tower of Babel
) followed by one or more paragraphs, which, in turn, contain one or more sentences. Tennison 2007 says elsewhere on the same page thatevery paragraph is made up of one or more verses
, and because verses may break in the middle of sentences and vice versa, paragraphs thus contain concurrent tessellated hierarchies of verses and sentences.[25] -
A hierarchy of Bible and pages. Page breaks may occur within or between components of the preceding two hierarchies (e.g., within or between books or chapters or verses in the first hierarchy or books or sections or paragraphs or sentences in the second), except that page breaks cannot appear within a book title (first, second, and third hierarchy) or section heading (second hierarchy).
-
A set of index references that is not easily or naturally modeled as a hierarchy. The role of index references in the markup is not explained in Tennison 2007, but the name suggests that they might be targets of entries/pointers in a back-of-the-book index (or its digital-publication counterpart). Index references may overlap other index references (a relationship traditionally called self-overlap), which means that some text tokens may be part of multiple index references; some text tokens may not be part of any index reference; and page breaks may fall within index references. The example says that
a sentence or heading contains any number of index references
, but we can easily imagine an index to the Bible in which some references might cross sentence (and perhaps larger) boundaries.[26]
The first two hierarchies are loosely coupled semantically in that books (also chapters)
and sections tend to be about something, and we would not expect a
titled, topical section to begin in one titled, topical book and end in another. One
way to
understand this is that a <book> in one hierarchy has a
<title> followed by <chapter> children,
while the same <book>, in another hierarchy, has the same
<title> followed by <section> children.
That is, the hierarchies are congruent at a higher (<book>) level
and then diverge partially (shared <title> but then
<chapter> vs <section>) below that.
Neither of these two hierarchies is subordinate to the other structurally because
sections
may start or end in the middle of chapters (and even, at least potentially, verses)
and vice
versa. A similar relationship obtains within paragraphs, where a paragraph is the
parent of
mutually independent tessellated sequences of sentences and of verses.
The example in Tennison 2007 models pages as milestone XML processing
instructions with the name new-page; alternatively, pages could be modeled with
tessellated milestone start- and end-markers, which is the structure used in the example
to
model index references. In either case we might consider pages to be subcomponents
of the
entire Bible (and not only of books or anything at a lower hierarchical level) insofar
as
new pages may begin either within or between books.
How Concur works in SGML and in Creole
In a traditional context-free production in a schema, the RHS is expanded in order,
except that, as described above, Interleave relationships are partially ordered
because the operands to the Interleave operator may occur in any order, with
the restriction that any order within any of the subcomponents must be
maintained. For example, if A, B is interleaved with C, D, the
four components may occur in any order only as long as A precedes
B and C precedes D. In this case, then, of the 24
possible permutations of the four components, only the following six are allowed:
A, B, C, D; A, C, B, D; A, C, D, B;
C, D, A, B; C, A, D, B; and C, A, B, D.[27]
SGML Concur applies to entire documents, while Creole Concur
can apply to subtrees within a document. The Creole Concur operator is similar
to the Relax NG (and Creole) Interleave operator in that any order within a
model that participates in a Concur relationship must be respected. For
example, given the first two biblical hierarchies above, neither has a defined relative
order with respect to the other, but within the hierarchy of titled sections, each
<section> must contain exactly one
<header> element followed by one or more
<paragraph> elements.
In SGML Concur text tokens in the document instance, unlike element tags,
are allowed only in positions where they are simultaneously valid in
all concurrent content models. The same is true of Creole
Concur, but with an important exception involving
partitions that we discuss below. This distinction between markup
tokens (which may be part of some but not all models) and text tokens (which must
be part of
all models unless, under Creole, they are part of partitions) is consistent with our
understanding of Concur as concurrent markup over a single, shared
document, rather than as concurrent documents. At the same
time, the behavior of text tokens has important implications for how sequences can
be
combined with Concur. Consider, for example, the following Creole schema
production:
element root {
concur {
element section {
text
},
element chapter {
text
}
}
} In Creole, constructions like concur { A, B } mean that
A and B overlap. The example above, then, means that a single
<section> element with plain text content overlaps with a single
<chapter> element that also has plain text content.The following
examples would all be valid against this production:
<root><section><chapter>This is an example text</section></chapter></root>
<root><chapter><section>This is an example text</chapter></section></root>
<root><chapter><section>This is an example text</section></chapter></root>
<root><section><chapter>This is an example text</chapter></section></root>
That is, the document tokens are partially ordered because the relative order of the
start- and end-tags for the <section> and
<chapter> elements is not informational, but the start-tag of each
element must precede its corresponding end-tag. The content model for
<section> elements does not allow a
<chapter> as a child of <section>, and
vice versa, but the last two examples are nonetheless valid and unambiguous because
the
schema defines the elements as concurrent, rather than nesting. It should be clear
from this
example that schema information may be required to distinguish concurrent relationships
from
hierarchical (nested) ones.
The following sequence, on the other hand, is not allowed under the production above:
<root><section>This is an </section><chapter>example text</chapter></root>
This example respects the relative order of the markup tokens (the start- and end-tags)
in both of the two concurrent sequences: either sequence, by itself, has the correct
root
element, the correct child element, and correct textual content. The reason this example
is
invalid is that the text tokens do not adhere to the rule that all text tokens in
concurrent
hierarchies, unlike markup tokens (start- and end-tags), must be valid in both hierarchies
simultaneously. In the example above, the text between the
<chapter> tags makes the <section> part of
the Concur relationship invalid because the <section>
part is defined in a way that does not permit text after the
</section> end-tag. Similarly, the text between the
<section> tags makes the <chapter> part of
the Concur relationship invalid because text is not permitted before the
<chapter> start-tag. We will see below that we can resolve this
issue by allowing text to appear more broadly in the concurrent content models.
A Concur relationship between two single elements that have textual content
is not very interesting, and we can enhance it by adding occurrence indicators that
allow
the components of the Concur relationship to repeat, as in the following
example:
element root {
concur {
element section+ {
text
},
element chapter+ {
text
}
}
}
The schema above allows for one or more sections and one or more chapters to overlap in any way as long as all text tokens in a valid document instance are allowed under both hierarchies. This means that:
-
Both the section and chapter markup need to start immediately after the
<root>start-tag, before any textual content; -
Both the section and chapter markup need to end just before the
</root>end-tag, without any intervening text; and -
No textual content is allowed between an end-tag of one element and the start-tag of another.
Under the preceding schema the following sequence is now allowed:[28]
<root><section><chapter>This is </section><section> an example text</chapter><chapter> And more text</chapter></section></root>In this example all text tokens are between the start- and end-tags of both
<section> and <chapter> elements, which satisfies the
requirement that under Concur all text tokens must be valid under all
concurrent hierarchies simultaneously. This example would not be valid if there were
text
immediately after the <root> start-tag, immediately before the
</root> end-tag, or between adjacent tags, since in all of those
positions the text would not be allowed under one or both of the concurrent
hierarchies.
Concur and mixed content
Because the Creole mixed keyword, borrowed from Relax NG, effectively makes
text optional before, between, and after the elements to which it is applied, it can
be used
to simplify the modeling of some Concur relationships. Consider, for example:
element root {
concur {
element section {
text
},
mixed {
element chapter {
text
}
}
}
} The preceding model allows a single <section>
element to concur (overlap) with a mixed model that includes a single
<chapter> element. The mixed model allows (but does not require)
text before or after the <chapter> element, which means that the
following sequence is valid:
<root><section><chapter>This is an example text</chapter>more text</section></root>In the example above, all text tokens are located between <section> start- and end-tags, as required, since the <section> element does not allow mixed content. The last two words are not between <chapter> tags, but the <chapter> part of the
Concur relationship, modeled as mixed content, allows plain text in
this position, with the result that the text is valid under both parts of the
Concur relationship. On the other hand, the following example is not valid:
<root><section><chapter>This is an example text</section></chapter>more text</root>Here the text at the end is not part of either of the
Concur components, which
is acceptable with respect to the <chapter> element, but not the
<section> element. Modeling both <section> and
<chapter> elements as mixed content would make both of the
preceding examples valid.
Using mixed content as described above can simplify modeling concurrent relationships,
but that simplification comes with the cost of allowing plain text in places where
it might
not be part of our mental model of one or another hierarchy in a concurrent relationship.
That is, although we can choose to ignore text that we don’t consider an informational
part
of one or another hierarchy during downstream processing, it is nonetheless part of
the
content according to the model. This is unlike the situation with markup tokens, as
illustrated in the example above, where we can include <chapter>
start- and end-tags inside a concurrent <section> element without
making the <chapter> part of the content of the
<section>.
Tennison 2007 recognized the requirement that all text tokens be part of all
overlapping hierarchies as a challenge for effective modeling. For example, the biblical
<section> hierarchy requires that a
<section> begin with a <header>, but the
section header is an editorial intervention that is not part of the text of the Bible,
and
therefore not part of the hierarchy of chapters and verses. Creole Concur
addresses this concern by diverging from SGML Concur in allowing
<element> structures alongside <range>
structures, where <element> is shorthand for wrapping a
<partition> around a <range>. (16) This
allows Creole to model text tokens that are not part of all concurrent hierarchies:
What about the
<heading>element within the<section>? Its text isn’t inside a<verse>, so would normally be invalid. However, we can define the<heading>as a partition—it can’t overlap anything else—and this means that it’s ignored by any concurrent patterns.
In summary, then: Concur in Creole allows the symbols (tags and text
tokens) of multiple sequences to be intertwined at the token level as long as:
-
The relative order of the tokens in the individual sequences is maintained; and
-
All text tokens must appear in positions that are valid for all sequences, except that …
-
… text tokens within a
<partition>are invisible to other concurrent hierarchies.
Concur and Interleave
Although the Concur and Interleave operators are similar in
that both model partially ordered content, there are subtle but important differences:
-
In
Interleavethe interleaved components are subtrees, while inConcurthey are markup tokens and text tokens. As a consequence,Interleaverelationships result in a tree, whileConcurrelationships result in a graph. -
Concurallows elements of the same type to appear in multiple concurrent models, whileInterleavedoes not permit the same symbol to occur in multiple interleaved sequences. This difference is a consequence of the difference between trees and graphs; graphs allow nodes to be the targets of multiple directed edges, while trees require that no node have more than one parent. Allowing the same node type to appear in both parts of anInterleaverelationship would mean either multiple parentage (not permitted in a tree) or ambiguity (to which of the interleaved components would the nodes belong?).
Mildly context-sensitive grammars (MCSG)
Handling overlap and implementing the Concur operator
Now that we have an understanding of the kinds of sequences that a Concur
operator in a schema can validate, we need to find equivalent grammar rules that we
can
create from those sequences. A CFG, which requires exactly one sequence on the right-hand
side of a production, is not capable of expressing a Concur relationship, but a
MCSG, which allows one or two sequences on the right-hand side, is. Within the grammar,
each
production of this type has reorder constraints associated with it that define all
possible
combinations of tokens of the sequences. All of this apparatus—the two sequences on
the
right-hand side and the reorder constraints—is invisible to the end user, who interacts
directly only with the schema syntax, including the Concur operator.
Grammars are typically written as expressions over symbols, and in our markup domain we use tokens as symbols. The set of token types is derived from the element names, which for any finite set of documents is also a finite set. The tokens can be categorized as:
-
Markup specific[29]
-
start-tag
-
end-tag
-
empty-element tag
-
suspend-tag
-
resume-tag
-
-
Text
The following simplified biblical example, tagged with TexMECS Concur, is
based on a more elaborate example in Tennison 2007:[30]
<genesis|
<chapter n="7"|
<section|
<verse n="24"|… a hundred and fifty days.|verse>
|chapter>
<chapter n="8"|
<verse n="1"|God thought of Noah …|verse>
|section>
|chapter>
|genesis>
The following schema for the biblical example above is based on features introduced
in
Tennison 2007. The ~ symbol is the Concur operator,
so the second line means that a <genesis> element contains a
concurrence of one or more <chapter> elements with one or more
<section> elements. Note that we use the term
<element> where Tennison 2007 uses
<range>):
start = genesis
genesis = chapter+ ~ section+
chapter = element chapter { attribute n, verse }
section = element section { verse }
verse = element verse { attribute n, text }This can be compiled to a MCSG as follows:
start := <genesis| genesis_content |genesis>; genesis_content := section optional_more_sections; , chapter optional_more_chapters; section := <section| verse |section>; chapter := <chapter attribute_n | verse |chapter>; verse := <verse attribute_n "|" text |verse>; attribute_n := "n" "=" string; optional_more_sections := section optional_more_sections; optional_more_sections := epsilon; optional_more_chapters := chapter optional_more_chapters; optional_more_chapters := epsilon;
In our notation a semicolon (;) in a production rule marks the end of a
single sequence of a symbols on the right-hand side. In a MCSG the right-hand side
can
contain up to two sequences of symbols, and in such cases we use a comma (,) as
a separator between them.
The following table shows the alignment of the input tokens with the tokens of operand
1
(<chapter>) and 2 (<section>) of the
Concur operator:
Table II
Overlap of <section> and <chapter>
using the Concur operator
| Input | Operand 1 | Operand 2 |
|---|---|---|
| (chapter) | (section) | |
| genesis | X | X |
| chapter | chapter | |
| section | section | |
| verse | verse | verse |
| text | text | text |
| /verse | /verse | /verse |
| /chapter | /chapter | |
| chapter | chapter | |
| verse | verse | verse |
| text | text | text |
| /verse | /verse | /verse |
| /section | /section | |
| /chapter | /chapter | |
| /genesis | X | X |
In this table, as in the schema above, sequences of <chapter>
and <section> elements observe a Concur relationship.
<verse> elements are part of both concurrent hierarchies. An
X (in the rows for the <genesis> start- and end-tags)
means that the element is not a participant in a Concur relationship. Text
nodes (except inside a <partition>, which is not relevant here) are
always part of both concurrent hierarchies.
We refer to the relationship between concurrent sequences as a reorder constraint because the right-hand side of the production rule lists one entire sequence before the other entire sequence, regardless of the order of the tokens in the input. When such a rule is used for validation, the goal is to confirm that the input matches a concurrence of the two sequences on the right-hand side. That confirmation requires knowing what is and is not permitted in a concurrence, and those rules are:
-
Each of the two sequences in a concurrent relationship, considered in isolation, is fully ordered. That is, if operand sequence A requires the ordering A1, A2, A3 and operand sequence B requires the ordering B1, B2, B3, the markup tokens of each of those two sequences, considered in isolation, must appear in the input in the specified order.
-
Although the tokens of each of the two sequences have to observe any ordering specified in the schema, the markup tokens of the two sequences do not have to be aligned with each other. This means that the two sequences that make up the right-hand side of the production rule are partially ordered: each sequence observes its own ordering rules, but the markup tokens of the two sequences are not compared to each other.[31]
-
Unlike the markup tokens of the two concurrent sequences, the text tokens of the two sequences do have to be aligned.
The reorder rule of the two sequences means that any implementation must adhere to three rules:
-
All symbols of both operand sequences must be fully present in the input data.
-
The relative order within each of the sequences must be observed.
-
When a text token is encountered in the input stream it must be present at the current relative location in both operand sequences.
We have to determine whether the input sequence is valid for the operands of the
Concur operator, recalling that the text tokens of the two operands have to
line up. We do this by thinking of it as an alignment problem, which we address using
a
method called dynamic programming.[32] We can approach the task by arranging the tokens in a matrix with the tokens of
the first operand serving as column labels and the tokens of the second operand serving
as
row labels. Initially we enter only the column and row labels; we explain below where
the
cell values (x
) come from:
Table III
| chapter | verse | text | /verse | /chapter | chapter | verse | text | /verse | /chapter | ||
| (0,0) | x | ||||||||||
| section | x | ||||||||||
| verse | x | ||||||||||
| text | x | ||||||||||
| /verse | x | x | x | ||||||||
| verse | x | ||||||||||
| text | x | ||||||||||
| /verse | x | ||||||||||
| /section | x | x |
Starting with just column and row labels, we will fill in the cells as we use the
sequence of input tokens to guide our traversal of the matrix from the upper left
to the
lower right. We can think of this traversal as moving two pointers, both of which
start at
0, so that the start coordinate would be (0, 0). When we say that a pointer is on
1
, we mean that it is located after the first token. In other words, each pointer
is always between the most recently processed token and the next token to be expected
in the
input. Both pointers start on 0, that is, before the first token in the two sequences
has
been processed. Our traversal must observe the following output requirements:
-
The complete input sequence is matched by tokens in the operands.
-
The text tokens in both sequences line up.
-
If the input is valid, we can always advance one or both pointers.
-
If the input is valid, whenever a start-tag is matched simultaneously in both sequences, the corresponding end-tag must also be matched simultaneously in both sequences.[33]
-
If the input is valid, we end in the lower right.
Whenever a markup token comes in it must be allowed in the next position in either or both sequences, in which case we increment the respective pointer(s). If a markup token is not allowed at that location in either sequence, the input is not valid. For example:
-
The first token in our input example is a
<chapter>start-tag, which lets us increment (only) the row pointer but not the column pointer, so we move from (0,0) to (1,0). -
The next input token is a
<section>start-tag, which lets us increment only the column pointer, so we move from (1,0) to (1,1). -
The next input token is a
<verse>start-tag, which is expected in both sequences at that location (immediately after a<chapter>start-tag in one and immediately after a<section>start-tag in the other), so we increment both pointers at the same time, that is, we move diagonally in the matrix, from (1,1) to (2,2).
Text tokens are handled differently. In case of text input, both pointers must point to a text token in their respective operand sequences of tokens. The unusual detail is that this can happen in two ways:
-
If a pointer is on a markup token, the next token must be a text token.
-
If a pointer is on a text token, it is not moved.
Hi and the other with the value
Mom) and a concurrence of <section> and
<chapter> elements:
<genesis| <section| <chapter|Hi|chapter> <chapter|Mom|section> |chapter> |genesis>
A schema that supports this markup is:
start = genesis
genesis = chapter+ ~ section+
chapter = element chapter { text }
section = element section { text }
A matrix that represents how we might traverse this concurrent input, based on the method described above, is:
Table IV
| section | text | /section | ||
| (0,0) | x | |||
| chapter | x | |||
| text | x | |||
| /chapter | x | |||
| chapter | x | |||
| text | x | |||
| /chapter | x | |||
| x |
<chapter> elements contains a one-word
text node and the single <section> element contains a two-word text
node. When we reach the text token inside the second <chapter>, the
input remains valid because of the second situation above: we are currently on a text
token
in the <section>, and therefore do not need to advance the pointer
on that axis. Were we to encounter a text token in one hierarchy when there was not
a text
token in the other hierarchy (which would satisfy the first condition) and when we
were not
already on a text token in the other hierarchy (which would satisfy the second condition),
the input would not be valid against the schema.
Handling discontinuity
Discontinuity in TexMECS and TAGML is expressed through the use of suspend- and
resume-tags (see the example below), which are allowed only where either an
Interleave (recall that mixed content is an Interleave of an
element sequence with text) or a Concur relationship is active. This constraint
is expected because what discontinuity shares with Interleave and
Concur is that the content is partially ordered, that is, tokens that are part
of the discontinuous segments must appear in the correct order, but tokens that are
not part
of the discontinuity may appear before, between, and after them. Sperberg-McQueen and Huitfeldt 2008 explains that:
It seems clear that if we specify that the discontinuous element (q or B in our examples) matches a normal (
first-class) token in a content model, then the usual rules for interpreting right-hand sides in the production rules of context-free grammars do not apply: the usual interpretation is that the right-hand side specifies a sequence of children for the left-hand side, but as described at some length above (section “Goddag structures”) the discontinuous element itself cannot be part of any sequence which also contains the material which interrupts it. We postulate, therefore, that discontinuous elements, taken as wholes, do not match first-class tokens in content models.
To accommodate elements that participate in a discontinuity relationship we adjust our MCSG slightly. Here is a sample input file with TexMECS tagging, taken from Sperberg-McQueen and Huitfeldt 2008:
<p|Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, <q|and what is the use of a book,|-q> thought Alice <+q|without pictures or conversation?|q> |p>
In the example above, Alice’s single thought is tagged as a discontinuity of
<q> (quote
) elements, interrupted by the phrase
thought Alice
, which is in the narrator’s voice. The following schema
supports this markup:
start = paragraph
paragraph = element p { mixed { quote } }
quote = element q { text }
The schema compiler translates the preceding schema in the following MCSG:
start := <p| mixed_quote |p>; mixed_quote = text; , quote; quote := <q| text optional_discontinuity_quote |q>; optional_discontinuity_quote := |-q> <+q| text optional_discontinuity_quote; optional_discontinuity_quote := epsilon
The table below illustrates how the input tokens (leftmost column) are allocated between
the two participants in a mixed-content (that is, Interleave) relationship, the
discontinuous quotation (operand 1) and the text with which it is mixed (operand 2):
Table V
Discontinuity of <q> within a
<p> with mixed content using the Interleave
operator
| Input | Operand 1 | Operand 2 |
|---|---|---|
| (mixed) | ||
| p | X | X |
| text | text | |
| q | q | |
| text | text | |
| -q | -q | |
| text | text | |
| +q | +q | |
| text | text | |
| /q | /q | |
| /p | X | X |
The <p> tags are the parent of both parts of the
Interleave; they are marked with an X
to show that they are not
part of either of the interleaved sequences. When the <q> element
is paused, the content needs to come from the other operand until the
<q> resume-tag is reached. In this sense discontinuity is
comparable to a <partition> within a Concur
operator.
Self-overlap with a mildly context-sensitive grammar
Creole introduces concurOneOrMore and concurZeroOrMore
operators to manage self-overlap, that is, situations where an element of a particular
type
overlaps with another element of the same type. Self-overlap requires special handling
for
two reasons:
-
In markup languages like MECS, TexMECS, LMNL, and TAGML, all of which support self-overlap, start- and end-tags that are paired together use a shared identifier on the start- and end-tag. Without that sort of identifier, input like the following:
<x>A<x>B</x>C</x>
would be structurally ambiguous. There are two<x>elements, each with a start- and end-tag, but nothing in the markup tells us whether an inner<x>element is a child of an outer<x>element, on the one hand, or the two<x>elements overlap and share the text tokenB. We can disambiguate that situation by adding matching identifiers to both the start- and end-tags, so that, for example, overlap, as in<x n="1">A<x n="2">B</x n="1">C</x n="2">
would be distinct from containment,[34] as in<x n="1">A<x n="2">B</x n="2">C</x n="1">
-
Production rules in a grammar traditionally are completely static, which means that the exact rules are known from information in the schema before parsing begins (that is, without access to information in the document instance) and the production rules do not change in response to anything encountered in the input.
The challenge is that the exact disambiguating identifiers that will occur in the document instance cannot be specified in the schema if the schema is to be applicable to more than a single document, as schemas typically are. For that reason, processing will need to recognize when a start-tag with an identifier comes along and store it somewhere until an end-tag with a matching identifier is encountered, so that the parsing process can verify that each opened element is closed in a way that is consistent with schema requirements. We implement this functionality with an attribute grammar, which is an extension on top of context-free grammars that allows production rules to capture part of the input stream, store it in what is called an attribute, and reuse that attribute later in the input stream to recognize whether an identifier given in an end-tag is valid at the location where it occurs.[35]
Consider the following TexMECS example:
<verse no="23"|God wiped out every living thing that existed on earth, <index~1 ref="i0037"|man and <index~2 ref="i0038"|beast|index~1>, reptile and bird|index~2>; they were all wiped out over the whole earth, and only Noah and his company in the ark survived.|verse>
In the TexMECS example above, which has been simplified from an example at Tennison 2007 11, Verse 23 contains (self-)overlapping targets of index references,
that is, of target spans of text to which entries in an index point. Index reference
i0037 refers to the string "man and beast" and index reference
i0038 refers to the string "beast, reptile and bird". The two
overlap in containing the same substring "beast". If this were XML and the
markup read:
<index ref="i0037">man and <index ref="i0038">beast</index>, reptile, and bird</index>the markup would be well-formed but the parsing would be incorrect semantically. That is, an XML parser would recognize that there were two
<index> start-tags
and two </index> end-tags, and that each end-tag follows a
start-tag. What an XML parser would fail to recognize, though, would be the overlap:
The XML
tagging above cannot represent overlap, and therefore says that indexed reference
i0038 is
wholly nested inside indexed reference i0037. This is semantically
incorrect.
In TexMECS notation, a normal start-tag looks like <index| and
a normal end-tag looks like |index>. Self-overlap is a type of
concur relationship, and to avoid the misreading of the XML markup described above,
self-overlapping elements must record information that associates start-tags with
the
correct corresponding end-tags, so that, for example, <index~1| and
|index~1> are start- and end-tags for the same instance of an
<index> element. A schema used to author or validate this document
would recognize that the two index reference targets (i0037 and
i0038) as instances of an element type <index>, which
is to say that the ~1 and ~2 are attributes
of the element instance, and not part of the element name. The same
relationship might be spelled, in a hypothetical alternative syntax, as
<index id="1" ref="i0037"> for the start-tag and
</index id="1"> for the end-tag.
Attribute grammars enhance a traditional grammar with the ability to perform (typically small) computations. In the case of a CFG or a MCSG, the left-hand side of a production is a single non-terminal. The right-hand side contains an expansion of the left-hand side into a sequence of terminals and non-terminals (in the case of a CFG) or up to two such sequences (MCSG), and with an attribute grammar the items on the right-hand side may be enriched with expressions that compute a value.
For example a schema containing self-overlapping <index> tags
could look like the following:
indexedText = concurOneOrMore { mixed { index* } }
index = element index { attribute ref { text }, text }A set of grammar rules governing the self-overlapping <index>
elements looks like this:
open_index_tag := <index "~" identifier attribute_ref "|"; close_index_tag := |index "~" identifier ">"; identifier := string; attribute_ref := "ref" "=" string; indexed_text := interleave_text_index; interleave_text_index := text; , optional_index; optional_index := epsilon; optional_index := open_index_tag concur_close_tag_new_index_tag; [concur_close_tag_new_index_tag.identifier = open_index_tag.identifier] concur_close_tag_new_index_tag := text close_index_tag; , optional_index; [close_index_tag.identifier == concur_close_tag_new_index_tag.identifier]
An attribute on a grammar should not be confused with an attribute in the XML domain. An attribute on a grammar is implemented as an index number on non-terminals on the right-hand side of a production rule. Next to the production rules are extra statements that specify whether attributes should capture incoming data or whether they should inherit data from an earlier non-terminal in the input stream in order to validate the current input token.
Schema compiler
The present report focuses primarily on how to express next-generation features in a schema language. We have developed a prototype implementation of a schema compiler and accompanying parser/validator, the details of which will be the subject of a future publication, but in this section we provide a brief overview of how the schema is compiled (that is, translated) into a grammar and how the grammar can be used to parse and validate an input document.
Our implementation compiles a schema expressed in the syntax described above into a MCSG, which we then parse with a modified LL parser. A standard LL parser has a set of production rules from a CFG and a stack as a memory for expectations. An LL parser also has a parsing table, which is generated from the production rules. In the standard case of a CFG the expectations on the right-hand side of production rules of the grammar have to be matched in order. In the case of a MCSG, however, the content is matched in partial order as we figure out whether an incoming token is allowed at that location in the data stream. In an LL parser for an incoming token we can:
-
Look at the top of the stack and use the table to determine whether a non-terminal / terminal combination is allowed.
-
If so, identify the production rules allowed by the table.
-
Put the symbols from the right-hand side of the applicable production rule on the stack.
In a schema the number of production rules is relatively low compared to the number of documents and the size of each document (both theoretically unlimited) that a user might want to validate. This means that a schema compiler is able to improve efficiency by shifting the performance bottleneck away from the (potentially large) number of input tokens and onto the (relatively small) number of production rules in the schema.
Conclusion
This paper introduces a new method to model, parse, and validate complex, next-generation
textual features such as overlap (including self-overlap), discontinuity, and partially
ordered and unordered content. It builds on earlier work that forms part of Relax
NG (compact
syntax, the Interleave operator), Creole (the Concur and
Partition operators), and the MLCD/MECS project (discontinuity). Where Relax NG
and Creole use parsing with Brzozowski derivatives, our grammar-based technique relies
on a
mildly context-sensitive grammar combined with an attribute grammar. We have developed
a
prototype implementation that reads in a schema, compiles it into a mildly context-sensitive
grammar, and then uses the grammar to validate markup files that contain overlapping
and
discontinuous markup. By generating a parser based on the schema, the cost of validation
per
token is kept low. The presence of overlapping or discontinuous markup in the source
files
does not cause ambiguity and thus does not negatively affect the performance of the
validator.
In this article we have responded to the challenges we set out to solve, namely: 1) to propose a schema language that could be used to model complex, next-generation text features such as overlap (including self-overlap), discontinuity, and unordered and partially ordered content; 2) propose a grammar technology that is able to express those features; and 3) parse and validate document instances against a schema in a consistent and performant manner by compiling the schema into a mildly context-sensitive grammar and a parsing table before the actual validation starts.
References
[BNF] Janikow, Cezary Z. What is BNF?
http://www.cs.umsl.edu/~janikow/cs4280/bnf.pdf
[Cameron 1993] Cameron, Robert D. Extending
context-free grammars with permutation phrases.
ACM letters on programming
languages and systems vol 2, no. 1–4 (March–December 1993), pp.
85–94. doi:https://doi.org/10.1145/176454.176490
[Chomsky 1956] Chomsky, Noam. Three models for
the description of language.
IRE transactions on information theory, vol. 2, no. 3, pp. 113-124,
September 1956, doi:https://doi.org/10.1109/TIT.1956.1056813
[Clark 2002, Algorithm] Clark, James. An
algorithm for RELAX NG validation.
2002.
https://relaxng.org/jclark/derivative.html
[Clark 2002, Compact] Clark, James. RELAX NG Compact Syntax. Committee Specification 21 November 2002. https://relaxng.org/compact-20021121.html
[Clark, Design] Clark, James. The design of Relax
NG.
n.d. https://relaxng.org/jclark/design.html
[EBNF] ISO/IEC 14977:1996 Information technology —
Syntactic metalanguage — Extended BNF.
December 1996.
https://www.iso.org/standard/26153.html
[Huitfeldt and Sperberg-McQueen 2001] Huitfeldt,
Claus and C. M. Sperberg-McQueen. TexMECS. An experimental markup meta-language for
complex documents.
January 25, 2001, rev. October 5, 2003.
http://mlcd.blackmesatech.com/mlcd/2003/Papers/texmecs.html
[Knuth 1990] Knuth, Donald E. The genesis of attribute
grammars.
1990. In: Attribute grammars and their applications, ed. P.
Deransart and M. Jour- dan. Vol. 461. Lecture Notes in Computer Science. Berlin, Heidelberg:
Springer. Berlin Heidelberg, pp. 1–12. doi:https://doi.org/10.1007/3-540-53101-7_1. https://link.springer.
com/chapter/10.1007/3-540-53101-7_1 (visited on 2024-04-12).
[Luminescent] Piez, Wendell. Luminescent: a
LMNL syntax parsing pipeline in XSLT.
https://github.com/wendellpiez/Luminescent
[Manguinhas 2009] Manguinhas, Hugo.
Schema Languages for XML.
Actas do INForum-Simpósio de Informática 2009, pp.299–310.
[Murata et al. 2005] Murata, Makoto, Dongwon Lee,
and Murali Mani, and Kohsuke Kawaguchi. Taxonomy of XML schema languages using formal
language theory.
ACM Transactions on Internet Technology (TOIT), Volume 5, Issue 4, 2005,
pp. 660–704. doi:https://doi.org/10.1145/1111627.1111631. Available at
https://pike.psu.edu/publications/toit05.pdf. An earlier version by the first
three authors, presented at Extreme Markup Languages® 2000, is available at
https://cs.brown.edu/courses/cs196-9/murata00taxonomy.pdf
[Piez 2008] Piez, Wendell. LMNL in miniature. An
introduction.
Amsterdam Goddag Workshop, 1–5 December 2008.
http://piez.org/wendell/LMNL/Amsterdam2008/presentation-slides.html
[SAEAG] Barrellon, Vincent, Pierre-Edouard Portier,
Sylvie Calabretto, Olivier Ferret. Schema-aware Extended Annotation Graphs.
DocEng '16: Proceedings of the 2016 ACM Symposium on Document
Engineering, pp. 45–54.
doi:https://doi.org/10.1145/2960811.2960816
[Siegel 2022] Siegel, Erik. 2022. Schematron. A language for validating XML. Denver: XML Press.
[Sperberg-McQueen 2018] Sperberg-McQueen, C. M.
Representing concurrent document structures using Trojan Horse markup.
Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August
3, 2018.
In Proceedings of Balisage: The Markup Conference 2018. Balisage Series
on Markup Technologies, vol. 21 (2018).
doi:https://doi.org/10.4242/BalisageVol21.Sperberg-McQueen01
[Sperberg-McQueen and Huitfeldt 2004] Sperberg-McQueen,
C. M. and Claus Huitfeldt. GODDAG: A data structure for overlapping
hierarchies.
Springer-Verlag (2004). Preprint at
http://cmsmcq.com/2000/poddp2000.html
[Sperberg-McQueen and Huitfeldt 2008] Sperberg-McQueen, C. M., and Claus Huitfeldt. Markup discontinued: discontinuity in
TexMecs, Goddag structures, and rabbit/duck grammars.
Presented at Balisage: The
Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In Proceedings of
Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies,
vol. 1 (2008).
doi:https://doi.org/10.4242/BalisageVol1.Sperberg-McQueen01
[SQF] Schematron Quick Fixes (SQF). https://www.oxygenxml.com/doc/ug-editor/topics/schematron-quick-fixes.html
[TagML 2017] Haentjens Dekker, Ronald, and David J.
Birnbaum. It's more than just overlap: Text As Graph.
Presented at Balisage:
The Markup Conference 2017, Washington, DC, August 1–4, 2017. In Proceedings of
Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies,
vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Dekker01
[TagML 2018] Haentjens Dekker, Ronald, Elli Bleeker,
Bram Buitendijk, Astrid Kulsdom and David J. Birnbaum. TAGML: A markup language of many
dimensions.
Presented at Balisage: The Markup Conference 2018, Washington, DC, July
31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference
2018. Balisage Series on Markup Technologies, vol. 21 (2018).
doi:https://doi.org/10.4242/BalisageVol21.HaentjensDekker01
[TEI Gentle introduction] A gentle introduction to
XML.
Chapter v. of the Text Encoding Initiative Guidelines, P5 Version 4.9.0. Last
updated on 24th January 2025, revision f73186978.
https://tei-c.org/release/doc/tei-p5-doc/en/html/SG.html
[Tennison 2007] Tennison, Jeni. Creole: validating
overlapping markup.
XTech 2007.
https://www.princexml.com/howcome/2007/xtech/papers/output/0077-30.pdf
[Tennison 2009] Tennison, Jeni. Overlap,
containment and dominance.
Jeni’s musings, 2008-12-06.
http://www.jenitennison.com/2008/12/06/overlap-containment-and-dominance.html
[Tennison and Piez 2002] Tennison, Jeni and Wendell Piez.
2002. The Layered Markup and Annotation Language (LMNL).
Presented at Extreme
Markup Languages® 2002 (Montréal, Canada).
[Usdin 2005] Usdin, B. Tommie. W3C XML Schema; Relax NG; or DTD:
How’s a user to choose?
2005.
https://www.mulberrytech.com/papers/whichschema/index.html
[van der Vlist 2004] van der Vlist, Eric. 2004. Relax NG. Sebastropol, CA: O’Reilly.
[XDM] XQuery and XPath Data Model. 3.1 W3C
Recommendation 21 March 2017.
Norman Walsh, John Snelson, and Andrew Coleman, ed.
https://www.w3.org/TR/xpath-datamodel-31
[XML 1.0] Extensible Markup Language (XML) 1.0
(Fifth Edition). W3C Recommendation 26 November 2008.
Tim Bray et al., ed.
https://www.w3.org/TR/xml/
[XML Namespaces] Namespaces in XML 1.0
(Third Edition). W3C Recommendation 8 December 2009
Tim Bray et al., ed.
https://www.w3.org/TR/xml-names/
[XML Schema 1.1] XML schema. https://www.w3.org/XML/Schema
[1] The four authors contributed as follows to the research described here:
-
Ronald Haentjens Dekker: Lead researcher and developer, also contributed to the written report.
-
Bram Buitendijk: Developer, also contributed to the written report.
-
Joris J. von Zundert: Lead author of the written report.
-
David J. Birnbaum: Author and editor of the written report.
[2] Sperberg-McQueen and Huitfeldt 2008 offer the following paragraph from Lewis Carroll’s Alice in Wonderland as an example of Discontinuity:
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it,
and what is the use of a book,thought Alicewithout pictures or conversation?
[3] A commonly cited type of Overlap is multiple concurrent hierarchies, that is, mutually independent fully-tesselated structures that can be erected over a shared character stream. A canonic example of multiple concurrent hierarchies is a bound edition of a novel, which might be expressed as a logical hierarchy (e.g., chapters, paragraphs) and an overlapping physical hierarchy (e.g., pages, lines).
Overlap may also be occasional, that is, may not entail hierarchies that fully model an entire document. For example, an annotation system might allow readers to add occasional markup elements to a document that tag arbitrary strings of text without regard to existing markup boundaries.
[4] We use the term non-linearity to refer to text components that may not coexist synchronously with one another, such as a representation that combines different diachronic versions of a text within a single document. For example, an editorial intervention may transform an earlier version of a text to a later version without any real text containing simultaneously the earlier and the later readings.
[5] For example, consider an alignment of textual witnesses that differ only in the
order of the tokens, e.g., Romeo and Juliet
vs Juliet and
Romeo
. The text might be modeled as a graph with edges that connects the
tokens in the order in which they appear in the witnesses, so that a collation would
contain three shared tokens ordered in two different ways. This is different from
simple
variation (e.g., Romeo and Juliet
vs Laurel and Hardy
)
because the text tokens that make up the content do not vary, but their order
does.
[6] For example, textual variation can be recorded in the TEI as
<rdg> children of an <app> element,
and the child elements have an order that is informational in XML whether we care
about
that order or not. The editor’s choice of an encoding order may be meaningful (e.g.,
grouping or prioritizing witnesses to express a theory of the transmission of the
text),
but even when the user does not wish to assert an order, XML syntax is not designed
to
represent unordered sibling elements. Users can choose to ignore the order of sibling
elements during processing, but two XML elements with the same child elements in
different orders are not deep equal, unlike, for example, two XML elements with
attributes recorded in different orders.
Because informationally unordered sibling elements are not part of the data model that XML syntax expresses, ignoring sibling order in XML ultimately becomes the responsibility of processing applications. Locating that responsibility at the processing level is undesirable because it means that each individual processing application must accommodate it. A data model that can represent and distinguish both ordered and unordered siblings, on the other hand, could make the distinction available to multiple processing operations while expressing it only once, as part of the markup. More generally, inherent properties of the data are best represented and expressed at the level of the data, and not of applications that process the data.
What we mean by unordered content, then, is the ability to distinguish at the level of the model, and not only during processing when sibling items of arbitrary complexity do or do not have an order that is informational.
[7] Tokenization can be performed at different levels of granularity. For example, given the following small but complete XML document:
<p month-name="November" month-ordinal="11">In the eleventh month … </p>a simple tokenization might identify three tokens: start-tag, content, and end-tag. In this case, though, the start-tag includes attributes, and an alternative, more granular, tokenization might recognize the start-tag as a sequence of multiple tokens: start-tag-opener, element name, whitespace, attribute name, equal sign, quotation mark, attribute value, quotation mark, whitespace, attribute name, equal sign, quotation mark, attribute value, quotation mark, start-tag-closer. Different degrees of tokenization might be appropriate for different purposes.
[8] In some authoring environments the extent to which a chosen schema can be considered domain specific may be debated. For example, the Balisage schema is a customization of Docbook that is specific to the domain of Balisage papers. Users might customize the schema further, whether by modifying the grammar-based schema or layering Schematron rules that are specific to the individual paper on top of the general Balisage schema. Domains, then, may have subdomains.
Domain-specific rules are sometimes called business rules (e.g., Siegel 2022, pp. 22–24). We do not find this term helpful for at least two reasons:
-
The word business has many meanings, including
whatever a person is (properly) interested in
(as in the phraseMind your own business!
), and in that sense it begins to overlap with the concept of domain-specific. Nonetheless, the primary semantic domain of the word business is commerce, which is not intrinsic to the more important domain-specific meaning. -
As explained above, domains may have subdomains, and the term business rule does not clarify what does or does not count as a business rule. If we take the Balisage conference schema as an example, once we leave the domain of well-formedness, 1) choosing Docbook as a foundation instead of another schema type (e.g., TEI Lite) reflects a domain-driven decision; 2) the customizations of Docbook for Balisage-specific purposes reflect domain-driven decisions; and 3) further modifications by individual authors to accommodate their individual requirements reflect still other domain-driven decisions. All three of these levels are accommodations to domain-specific needs, with progressively narrower domains. It is not clear, at least to us, how the term business rule is useful for elucidating this tiered model.
[9] DTDs date to the days of SGML (ISO specification 1986), although XML DTDs (W3C specification 1998) differ in several ways from SGML DTDs. Relax NG dates to 2001 (OASIS specification), as does XML Schema (W3C specification).
[10] For context-free and other types of grammars see Chomsky 1956The situation with Relax NG is more complicated, as we discuss below. See Murata et al. 2005, Clark 2002, Algorithm, Clark, Design.
[11] Concerning TagML see TagML 2017 and TagML 2018.
Our goal of supporting markup languages other than TagML continues one of the goals
of
Creole, about which the designer writes that Although designed to be used with
LMNL, Creole is data-model agnostic
(Tennison 2007, p. 26).
[12] Grammars allow multiple productions with the same LHS, so instead of a single
production like A = B | C, which is allowed in a Relax NG compact-syntax
schema but not in a grammar, a grammar might contain two productions: A = B
and A = C. That is, a grammar, like a Relax NG compact-syntax schema,
allows the concept of alternation, but expresses it through
multiple linear productions instead of by means of an alternation operator.
[13] By ultimately
we mean that the definition (RHS) may include both
terminals and non-terminals, but any non-terminals must themselves be defined elsewhere
in the grammar. This means that if we replace every LHS by the corresponding RHS
recursively, we eventually wind up with only non-terminals. A document instance to
be
parsed or validated against a schema contains only terminals, which means that
ultimately a parser must determine whether the inventory and arrangement of terminals
in
a document instance are consistent with the inventory and arrangement of terminals
permitted, once all non-terminals have been expanded, by the grammar.
[14] This requirement is to be expected because the point of a production is to define the LHS in terms of the RHS. Terminals, by definition, do not need to be defined, which means that there is no need for productions where the LHS consists only of a terminal.
[15] While a CFG has no access to history, it is allowed to look ahead, although the smaller the lookahead, the more efficient the grammar.
[16] See also Cameron 1993 for a discussion of extending CFGs with permutation phrases.
[17] For a comparative description of the three see Usdin 2005.
We exclude Schematron from this discussion because Schematron is a rule-based, rather than grammar-based, language, and we describe below why we favor a grammar-based approach. We also exclude TEI ODD, a higher-level language that, among other things, integrates into the syntactic modeling a class system (which can be used to model semantic relationships) and a literate-programming approach to documentation. Nothing about the schema language and associated grammar that we propose here precludes subsequently integrating Schematron-like or ODD-like methods, but those features are out of scope at this time.
[18] XML Schema and Relax NG support rich datatyping, while DTDs do not. We exclude datatyping from our discussion because it does not interact directly with the structural features on which we concentrate.
[19] The <xs:openContent mode="interleave"> structure in
XML Schema 1.1 matches some, but not all, of the features of Relax NG Interleave.
See the more detailed discussion of Interleave, below.
[20] We use the terms start-marker and end-marker to refer to coordinated milestone (self-closing, empty) elements that serve to mark the paired start and end positions of ranges within a LMNL or XML document.
[21] For example:
-
There is no natural way to say in Relax NG that the values of an
@nattribute on list items must reset to1with every new list and increase by increments of1in consecutive items within the same list. -
There is no natural way to say in Relax NG that
@refattributes on references to persons must point to (agree with) matching@xml:idvalues in a master list of persons while@refattributes on references to places must point to matching@xml:idvalues in a master list of places. This is also the case if the master lists are in separate documents.
[23] See XDM, §6.7.1. If we create XML that includes a text node that consists of zero characters, it is removed. For example, the following XSLT:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
version="3.0">
<xsl:variable name="stuff">
<root>
<e>Hi</e>
<xsl:text></xsl:text>
<e>Bye</e>
</root>
</xsl:variable>
<xsl:template name="xsl:initial-template">
<xsl:value-of select="count($stuff/node()/node())"/>
</xsl:template>
</xsl:stylesheet> returns 2 as the count of child nodes of the
<root> root element.
[25] Tessellated refers to structures that fully occupy their domains with no gaps and no overlaps. For example, the part of a book inside the binding might be said to consist of a tessellated sequence of pages, which means that 1) the first page begins at the very beginning of the contents within the binding, 2) each new page begins exactly after the end of the immediately preceding page, and 3) the last page ends at the very end of the contents within the binding. That is, there is no part of the contents within the binding that is not part of exactly one page.
[26] Index references, then, are not hierarchical and not tessellated for two reasons:
-
There are text tokens that are not part of any index reference.
-
There are text tokens that are part of more than one index reference because index references may overlap with other index references.
[27] With respect to schema syntax, both Concur and
Interleave can be applied to two or more operands, although when the schema
is compiled into a grammar both Interleave and
Concur take only two operands. That is, schema productions with more than
two operands are translated during compilation into associative grammar productions.
For
example, a ternery schema Interleave or Concur relationship of
A, B, and C might be compiled into a binary
grammar relationship of A and B, the value of which then
enters into a binary grammar relationship with C. This is similar to the
way a ternary choice between A, B, and C in a
schema production like A | B | C might be compiled, at the grammar level,
to a choice between A and B, the value of which then enters
into a choice with C. Interleave, Concur, and
Choice relationships in schema rules are all both associative and
commutative.
[28] At least in some positions, whether whitespace in XML is significant or
insignificant is widely regarded as difficult to recognize and therefore a common
cause
of error during processing. We would expect the distinction of significant vs
insignificant whitespace to vary among next-generation markup languages. In this example
below we might prefer to insert newlines and space characters after the
<root> start-tag and a newline before the end-tags, but we
could do that only if the specific markup language unambiguously recognizes those
whitespace characters are insignificant.
[29] The spellings start-tag (instead of, for example, start tag, open-tag, etc.), end-tag, and empty-element tag are taken directly from XML 1.0. The spellings suspend-tag and resume-tag are taken directly from Sperberg-McQueen and Huitfeldt 2008.
[30] We’ve pretty-printed the example to improve the legibility, but in practice a markup language would have to specify how to distinguish significant from insignificant whitespace.
[31] In the example above, <verse> elements are part of both
concurrent hierarchies because both <chapter> elements and
<section> elements have <verse>
children. Furthermore, this schema requires that every
<verse> element be simultaneously a child of exactly one
<chapter> element and exactly one
<section> element. If, though, the models for
<chapter> and <section> involved
mixed content, that is, mixed { verse* }, it would be possible for a
<verse> in one hierarchy to correspond to plain text in the
other, in which case the <verse> tags at that location
would be part of only one hierarchy, and therefore not aligned. Whether the
<verse> tags are visible in a hierarchy must be resolved
for matching start- and end-tags, which is easy to validate for each hierarchy in
isolation (recall the each operand of a concur relationship, considered in
isolation, is fully ordered), but determining whether a matching start- and end-tag
in this mixed-content situation is part of a particular hierarchy is not trivial.
We
leave this edge case aside for the moment, while recognizing that it remains part
of
a full solution to the task.
[32] The dynamic programming example, with the matrix, is used for illustration; it is
not how the validation is implemented in the code. Among other things, in this example
both <chapter> elements and <section>
elements are repeatable, which means that we cannot construct a useful matrix in advance
based only on the schema rules. As a simplification, our sample matrix includes two
<chapter> elements and one <section>
element because that’s what happens to in this particular input example.
[33] This requirement ensures that start- and end-tags will be balanced.
[34] Next-generation markup languages may distinguish two ways in which one element may be inside another: containment and dominance. See Tennison 2009 for more information.
[35] For more information about attribute grammars see Knuth 1990.