Stührenberg, Maik, and Christian Wurm. “Refining the Taxonomy of XML Schema Languages. A new Approach for Categorizing XML
Schema Languages in Terms of Processing Complexity..” Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 - 6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). https://doi.org/10.4242/BalisageVol5.Stuhrenberg01.
Balisage: The Markup Conference 2010 August 3 - 6, 2010
Balisage Paper: Refining the Taxonomy of XML Schema Languages. A new Approach for Categorizing XML
Schema
Languages in Terms of Processing Complexity
Maik Stührenberg
Maik Stührenberg studied Computational Linguistics at Bielefeld University. After
working for four years as research assistant at Giessen University in different
text-technological projects, he is now a Ph. D. student and research assistant at
Bielefeld University. His main research interests include XML schema languages and
specifications for structuring and querying multi-dimensional annotated data.
Christian Wurm
Christian Wurm is a Ph. D. student in Computational and Mathematical Linguistics at
Bielefeld University in the Cognitive Interaction Technology – Center of Excellence
(CITEC) at Bielefeld University.
This paper presents a refined taxonomy of XML schema languages based on the work by
Murata et al., 2005. It can be seen as first building block for a more elaborate
formal analysis of XML and its accompanied specifications, in this case: XML schema
languages such as DTD, XSD and RELAX NG.
The authors would like to thank both the reviewers for their constructive comments
and our
colleagues Marcus Kracht and Jens Michaelis who provided additional insightful remarks.
Introduction
In this paper, we continue the fruitful research that has been performed on XML and
formal language theory (see section “Formal Language Theory and XML”). As there is a close
correspondence between schema languages and a hierarchy of tree languages discussed
by Murata et al., 2005, and to which we will refer as the Murata hierarchy, formal language
theory has been very useful to determine and describe the expressiveness and computability
of
different XML schema languages. From the point of view of formal language theory (ignoring
things such as user-friendliness or software support, amongst others), the question
which
grammar formalism is most apt for defining a document grammar for a given XML markup
language
is determined by a trade-off between expressiveness on one side and processing complexity
on
the other side. The more expressive a grammar formalism is, the more resources we
need for
processing the corresponding languages, and the more likely are they to fall prey
to
ambiguity.
Expressiveness thus always comes at a cost; it is, however, not always quite clear
at
which cost. To make these things more clear is one goal of this paper. We will look
at some
well-known classes of formal grammars/languages that are relevant for XML, and scrutinize
their expressivity and processing cost. Pursuing this approach, we will see that there
are
some other interesting and relevant classes, which have not been formally established
yet to
the best of our knowledge, and which we will define in the sequel. These classes partly
refine, partly complement the hierarchy of Murata et al., 2005, and, as we will see,
are partly already tacitly in use. Our main focus will not be on (non-)determinism
in content
models, that is, on properties of regular expressions; rather we will focus on determinism
in
tree structures, and try to formally clarify the relation between determinism and
expressivity, as well as between locally and globally ambiguous grammars/trees. On
the way, we
provide some new results for the problem of ambiguity not only of grammars, but also
of
languages: as we will see, there are languages for which there is no unambiguous grammar,
and
not yet a class of grammars which generate all and only the languages for which there
is an
unambiguous grammar.
The history of document grammars begins long before XML's success: in Goldfarb, 1978 the first published formal document type
descriptions can be found, while 1986 SGML Document Type Definition (DTD, SGML) were established, followed by XML's DTD (XML 1.0) and
other XML schema languages that were created during XML's ongoing success. This time
line
begins even earlier, in 1955, when Noam Chomsky published his theory on formal grammar
(Chomsky, 1955, Chomsky, 1956).
One of the several benefits of XML markup languages is the possibility to use a schema
(a
document grammar) to assure not only well-formedness, but
also validity of an instance of a given markup language. The
XML specification defines well-formedness as follows:
A textual object is a well-formed XML document if:
Taken as a whole, it matches the production labeled document.
It meets all the well-formedness constraints given in this specification.
Each of the parsed entities which is referenced directly or indirectly within the
document is well-formed.
A valid instance in addition declares conformance and actually conforms to the
rules of a schema of a given markup language. Or, in a more general way:
The intention or purpose of validation is to subject a document or data set to a test,
to determine whether it conforms to a given set of external criteria.
A validating parser takes an instance document (and the corresponding schema) as
input and produces a validation report, which includes at least a return code reporting
whether [t]he document is valid and an optional Post Schema Validation Infoset (PSVI),
updating the original document's infoset (the information obtained from the XML document
by
the parser) with additional information (default values, datatypes, etc.) (van der Vlist, 2001). During this process different levels of validation may be
checked, depending on the XML schema language used: validation of the instance's structure
(i.e., the markup), datatyping (i.e., the content of individual leaf nodes), integrity
(in
terms of links, either between nodes within a document or between documents). In addition,
other tests, usually called business rules may apply as well
(see van der Vlist, 2001). While one may differentiate between the terms
valid (as defined by the XML specification and therefore
only referring to validity according to a Document Type Definition) and schema-valid (i.e., valid according to one of the externally defined XML schema
languages) we will use the former term throughout this paper as equal term for depicting
the
feature of confirmation to a given set of external criteria regardless of the schema
language
used. Furthermore, we will only discuss validation mechanisms through schema languages,
therefore, technologies such as the Content Assembly
Mechanism (CAM, see Carey, 2009) or meta-validation techniques
such as Namespace-based Validation Dispatching Language
(NVDL) are not observed any further. Since we will focus on grammar-based schema languages, rule-based[1] (or constraint-based) schema languages such as Schematron, the Constraint Language in XML (CLiX, Nentwich, 2005) or DSD2 will not be observed either.[2] For clarification reasons we follow the definitions of Costello and Simmons, 2008:
A grammar-based schema language specifies the structure and contents of elements and
attributes in an XML instance document. For example, a grammar-based schema language
can
specify the presence and order of elements in an XML instance document, the number
of
occurrences of each element, and the contents and datatype of each element and attribute.
A rule-based schema language specifies the relationships that must hold between the
elements and attributes in an XML instance document. For example, a rule-based schema
language can specify that the value of certain elements must conform to a rule or
algorithm.
When someone starts developing a new XML-based markup language sooner or later the
question about a formalism to define the corresponding document grammar arises, since
there is
a variety of schema (definition) languages available. While a schema can be considered
as a
formal definition of a grammar of the XML-based markup language (e.g. as a set of
rules or
criteria), the schema language is a formal language for expressing schemas
(Møller and Schwartzbach, 2006). Usually, choosing a schema language depends on several
factors such as familiarity with a given formalism or support provided by the chosen
authoring
software or processing tools such as XSLT 2.0 or XQuery 1.0.
These factors are very specific for one's own needs and environment and we will not
give any
advice regarding these topics. However, what we want to demonstrate in this paper
are the
differences in terms of expressiveness and computability between the three most used
XML
schema languages, starting with XML's inherent Document Type Definition (DTD, see
XML 1.0), based on SGML's DTD (see SGML, Goldfarb, 1991, Maler and Andaloussi, 1995) where a non-XML syntax is used[3], over W3C's XML Schema Description Language (XML Schema 1.0 Part 0,
XML Schema 1.0 Part 1, XML Schema 1.0 Part 2) and the formal language
theory based RELAX NG (see RELAX NG, van der Vlist, 2003, RELAX NG (2nd Ed.) as a successor to both RELAX (RELAX Core) and TREX
(Clark, 2001).
Typically, DTDs are characterized as extended context-free
grammars (see Hopcroft et al., 2000 and Rizzi, 2001),
that is, on the right-hand-side of a production rule regular expressions are allowed.
This
means for the declaration of an element that its allowed content is described by a
regular
expression using other element names (i.e., referring to other or the very same globally
declared elements) or reserved keywords such as #PCDATA or EMPTY. In
current work, especially Murata et al., 2005 and Møller and Schwartzbach, 2006, a
family of tree grammars is used to model XML schema
languages; for example, DTDs are defined as local tree
grammars which can be considered strongly equivalent to CFGs, with the only
difference that they allow non-finitary branching:
Ignoring the attributes for a moment, there is a simple but elegant connection between
DTDs and context-free grammars, namely, each DTD corresponds to an extended context-free grammar, where productions may have regular
expressions on their right-hand side. Then, an XML document is valid with respect
to the
DTD precisely when its associated tree is a correct derivation tree for that
grammar.
In addition to using local tree grammars for characterizing DTDs, Murata et al., 2005 construe a taxonomy of XML schema languages. The authors introduce
single-type tree grammars, as characterizing XML Schema,
and use regular tree grammars to characterize RELAX NG.
Although this work is quite extensive, the formal analysis can be further improved
by
clarifying some propositions. Given the (still growing) importance of XML and the
broad range
of tasks it is used for, stronger theoretical background seems to be the best way
to find new
applications. Before we present our results, we have to introduce the formal concepts
we are
working with.
Refining the Taxonomy of XML Schema Languages
Introduction to the Formal Concepts
In this paper, we will mainly use tree grammars. Since the use of tree grammars is
well
established in the XML community, we will just shortly provide the necessary definitions;
for a more explicit treatment and motivation, we defer the reader to Gécseg and Steinby, 1997 or Murata et al., 2005.
Definition 1 A regular tree grammar (RTG) is a 4-tuple (N,T,S,P), where N is a
finite set of nonterminals, T is a finite set of terminals, S is a set of start symbols,
which form a subset of N, P is a set of production rules, which have the form: A →
a(r), where A ∈ N, a ∈ T, and r is a regular expression over elements of N. We
call A the left hand side of a rule, a the terminal or label which is introduced by
the
rule, and r its content model.
We generally use uppercase letters for nonterminals, and lower-case letters for
terminals. Note that this grammar generates trees, not strings, and that the nonterminals
do
not remain the labels of the (non-leaf) nodes they introduce, but are substituted
by the
terminal labels. The class of languages generated by RTGs is called the regular tree
languages (RTLs). This is the most general class of languages we will consider here;
and we
now introduce various restrictions on this class of grammars, as they are defined
by Murata et al., 2005.
Definition 2We call two rules of a RTG competing, if they introduce the same
terminal nodes, but have different left hand sides. Thus, A → a(r) and B →
a(r') are competing.
In general, in an RTG we can merge any two rules which have the same left-hand side
and
introduce the same terminal, by merging their content models, because for any two
regular
expressions we can easily form a single expression which denotes is the union of both.
Therefore, we will generally assume that in our grammars any two rules with same left-hand
side and same terminal do not exist. As a consequence, the concept of competing rules
is the
crucial point if we deal with determinism and ambiguity. For the same reason, we can
speak
of competing nonterminals almost in the same way as of competing rules: competing
nonterminals are the left-hand sides of competing rules; a grammar has competing
nonterminals exactly if it has competing rules, and exactly as many as.
Definition 3A local tree grammar (LTG) is an RTG with no competing
rules.
In an LTG, we have thus a one-to-one correspondence of nonterminals and terminals,
which
makes them very similar to context-free grammars (though not identical, since regular
content models allow for non-finitary branching).
Definition 4A single type tree grammar (STG) is an RTG, where competing
nonterminals must not occur in the same content model.
As Murata et al., 2005 point out, LTGs roughly correspond to DTDs, STGs
correspond to XML Schema, and RTGs correspond to Relax NG. Note that this correspondence
is
established by only regarding the "core" syntactic features of the schema languages,
that is
XML's inherent reference mechanism is not taken into account. These are thus the most
important grammar types for XML. Murata et al., 2005 still add another type:
Definition 5A restrained competition grammar (RCG) is an RTG, where competing
nonterminals must not occur in the same content model and with the same prefix of
nonterminals; we thus disallow rules with identical left-hand side, terminals, and
content
models of the form (Γ A Δ) and (Γ B Δ'), where A and B are competing
nonterminals, and where uppercase Greek letters refer to possibly empty sequences
of
nonterminals.
Notably, the restriction concerns only the left context of the competing nonterminals.
Of course, there exists a parallel definition for the right context. The problem is
that
both definitions lack some generalization, as they both generate different classes
of
languages, and there is no inclusion in either direction. If we, however, generalize
the
restriction of competition to both the left and the right context (which weakens the
overall
restriction on the grammar), some problems arise. We will discuss possible generalizations
later on.
It is easily seen that there is a hierarchy of proper inclusion of the grammar types
presented: LTGs are always STGs, which are always RCGs, which are always RTGs, whereas
the
converse does not hold.
We furthermore define an interpretation of a given tree against a given grammar as
follows:
Definition 6 An interpretation I of a
tree t against a grammar G is a mapping from each node label of t, denoted by e, to
a
nonterminal N of the grammar, such that
I(e) is a start symbol when e is the root of
t,
for each e and its daughter nodes e0,
e1,...,en, there is a production rule
A → a(r) in G, such that
I(e) is A,
the label of e is a,
I(e0),
I(e1),...,I(en) matches
r.
As is easily seen, with this definition, the ease of interpretation directly interacts
with the Murata hierarchy. We will continue in this vein. To keep things clear, it
is
crucial to distinguish between the label of a node and its interpretation: the label
of a
node corresponds to its terminal in the production rule (recall that tree grammars
directly
generate trees, not strings via trees as CFGs), and it is immediately visible in the
tree.
By the interpretation of a node in turn we denote the nonterminal by which the node
label
has been produced. This nonterminal is not visible and has to be inferred. In addition,
we
have to distinguish between rule and rule instantiation: since content models are
regular
expressions over nonterminals, they denote sets of sequences of nonterminals, and
one member
of this set is an instantiation. An additional problem arises in interaction with
the fact
that there is no necessary one-to-one correspondence of nonterminals and terminals
(i.e.,
labels); a possible consequence is that the same sequence of labels can be produced
by
different instantiations of a content model (we will exhaustively discuss this source
of
ambiguity later on).
So far, we have introduced the main concepts which are well-known in the literature,
and
which we are going to use and elaborate in this paper.
An Example Grammar
We will use the following example to demonstrate some of our findings. We want to
define
a document grammar for a text. The text may contain an optional title, followed by
either at
least a single section or a single paragraph. An optional author entity (possibly
decoded
using an attribute) may contain information about the text's author. Inside a section
an
optional title followed by other (sub-)sections or paragraphs are allowed. The title
consists of raw text while a paragraph may contain raw text or a reference to other
paragraphs, since these may have an optional identifier (using XML's ID
type).
If we try to express these constraints more formally we might end up with a grammar
similar to the one shown in Figure 1. Again, nonterminals are printed
in capital letters, while node labels or terminals are printed in small letters. Note,
that
in this formulation elements, attributes and raw text are defined as terminals.
The Example Grammar Realized by a DTD (Local Tree Grammar)
Following Murata et al., 2005, DTDs can be classified as local tree grammars.
A possible realization for our example grammar can be seen in Figure 2.
Since local tree grammars and DTDs only support globally declared elements (and
locally declared attributes), the content models of the text and the
section element share references to the same three elements
(title, section and para) and contain both a
sequence and a choice together with the occurrence indicators + (at least one
occurrence) and ? (optional). The content model of the para
element contains mixed content, that is both raw text and the xref element
are allowed as children. Since DTDs force the use of the choice group (|) and
the trailing asterisk (*) as occurrence indicator, there is no other way to
define this specific content model using this schema language.
The Example Grammar Realized by an XSD (Single Type Tree Grammar)
An XML schema description (i.e., a single type tree grammar) of the same document
grammar may look like the one in Figure 3.
Note that this XML schema description is only one possible realization out of a
variety of different XML schema descriptions that would fit our needs. Although it
may be
not very human-readable, it was designed to show some features that are supported
by XSD.
The text element is derived by extension of the globally declared complexType
textType which itself refers to the globally declared model group
sectOrPara. The schema contains both locally and globally declared
attributes (author vs. id) and elements (xref as an
example for a locally declared element). Apart from that, XSD supports XML Namespaces (Third Edition) which are not shown in the example above. As Martens et al., 2006 has already pointed out, that the actual extra expressive power
of XSDs over DTDs can only be used to a very limited extent due to the Element Declarations Consistent (EDC) constraint (see XML Schema 1.0 Part 1, Section 3.8.6).
The Example Grammar Realized by a RELAX NG Grammar (Regular Tree Grammar)
RELAX NG can be classified as regular tree grammar according to Murata et al., 2005. A possible realization with RELAX NG is shown in Figure 4.
Compared to DTD or XSD, RELAX NG is based both on the mathematical theory of regular
expressions and the concept of hedge grammars (van der Vlist, 2003 and Murata et al., 2005). As an XML schema language, RELAX NG has some advantages over
other schema languages: while in DTDs and XSD mixed content models may contain child
elements and text nodes in any arbitrary order, RELAX NG allows for ordering of the
element child nodes (see van der Vlist, 2003, p. 57f.). Co-occurrence
constraints can be used to specify the content model of an item according to the value
of
another item, allowing non-deterministic content models which cannot be realized in
DTD or
XSD (see van der Vlist, 2003, p. 62f, and section “Determinism, Algorithms and Local Ambiguities”
for a discussion). In general, a co-occurrence constraint (or co-constraint as they
are
called by Pawson, 2007) may be a constraint over multiple items,
not just two items and may exist between XML structure components
(elements, attributes) as well as between data values. One may differentiate
between element-to-element, element-to-attribute, or attribute-attribute co-occurrence
constraint, based on the items involved.
In addition, SGML's interleave operator & (see Goldfarb, 1991, p. 291) that is missing in XML DTD and XSD can be used in
RELAX NG as well, although this adds nothing to its expressive power. In contrast
to the
two other schema languages discussed in this paper, RELAX NG does not support default
values (which are supported for attributes in DTD and for attributes and elements
in XSD).
While both DTD and XSD support XML references via
ID/IDREF(S) attribute types, RELAX NG has no
included datatype library; however, as seen in the example grammar, it is possible
to
include the datatype library of XML Schema 1.0 Part 2.
The document instance given in Figure 5 would be valid according
to all of the above defined document grammars (the example shows validation against
the
XML schema, adding a Doctype declaration and removing the
noNamespaceSchemaLocation attribute of the root element would result in a
valid instance according to the DTD. Note that RELAX NG does not contain a standard
way to
associate a RELAX NG schema to an XML instance since it was designed as part of the
ISO
DSDL framework (in this framework, the NVDL should be used as general
external mechanism for validating instances).
Determinism and Local Ambiguity
Determinism is a important property for XML Documents, schema languages and
interpretation. If a grammar is deterministic, parsing will be much more efficient,
since in
general we do not have to keep in mind any information, do not have to backtrack,
do not need
any non-local information in case we search, etc.[4] The concept of determinism is closely related to the concept of local ambiguity:
if there is no local ambiguity, then at every point in the parsing process we know
the
structure (in this case: interpretation) of what we have seen so far, because there
is
only one possible local analysis. If there is some local ambiguity, non-determinism
arises: we
cannot assign a unique interpretation locally, for we would need information which
is not
available yet, and we need to apply some heuristics.
Determinism, Algorithms and Local Ambiguities
In this section we will review the concept of determinism, as opposed to local ambiguity
of a grammar. As introductory issue, we show that determinism does not only depend
on the
grammar we use, but also on the algorithm. In regular tree languages, there can be
no
one-dimensional concept of determinism, as there is for regular string languages.
Note that
this is more than a metaphor, since we can perceive of regular string languages as
one-dimensional structures; in order to talk about trees or the strings formed by
their
leaves, we need at least two dimensions (see Rogers, 2003). In the
remainder, we will show how different grammar types provide determinism for all, some
particular class of, or no algorithms.
Deterministic Content Models
Firstly, we consider the concept of deterministic content models. This draws on the
notion of deterministic (or 1-unambiguous) regular expressions (DREs), thoroughly
surveyed
by Brüggemann-Klein and Wood, 1997[5]. Assume we have a string s which is denoted by
some regular expression E. Assume furthermore we build our
expressions with letters from an alphabet Π, which is
identical to the alphabet Σ, except for the fact that
we have additional indices for the letters (taken here from the set of natural numbers).
If
we build an expression from Π, we have to make sure
that every index must occur at most once in it. Constructors for regular expressions
are as
usual, and indices are passed on to the letters of the strings the expression denotes.
We
say that a letter in s instantiates a letter in E, if the following holds:
Definition 7A letter ai in s instantiates a letter
aj from E, if i = j.
The index ensures that for every string an expression denotes, there is a unique
surjective mapping from letters in s to letters in
E. We now define a mapping ♮ on strings over Π to strings
over Σ, such that
♮(xi):= x,
♮(xs):=♮(x)♮(s), where s
is a string and x a letter (the first of the string). This
homomorphism simply deletes the indices and leaves anything else untouched. Note that
for a
unique string ♮s, and a given expression E, there might be several s ∈
E, such that their mapping under ♮ is
identical. In this case we say that a letter in s might
instantiate several letters in E. A deterministic regular
expression over Π is then defined as follows:
Definition 8E is deterministic or one-unambiguous, if for all strings u, v, w over
Π, and all letters x, y in Π, the following holds: if uxv and uyw ∈ L(E),
and x ≠ y, then also ♮ x ≠ ♮ y.
This means that we can skip the indices, and we still know which letter in E is instantiated by any letter in s, simply from knowing its left context. Formally, this means that the mapping
♮ is reversible. A regular language is deterministic if it is denoted by a
deterministic regular expression. A simple example of a non-deterministic expression
is the
expression a*b*a*, as we can easily check for the string
a, which might be the instantiation of either of the two
as.
As a consequence for processing, quite informally, we can state that reading s ∈ L(E) from the left to
the right, where E is a DRE, at any point in s, we know at which point in E we
find ourselves. In automata theory, DREs correspond to deterministic Glushkov
automata.
There is however a problem if we apply this concept to content models in regular tree
grammars: in regular expressions, we can see from a letter in the string which type
of
letter in the expression it instantiates (thus, we have a unique letter in Σ, though not in Π). We
can thus deduce from a letter in the string a letter in the expression, though not
a letter
instantiation, if the expression is not deterministic. We cannot, however, deduce
from a
given tree node label a unique type of nonterminal: if we have competing rules, different
nonterminals introduce identical labels; and we still have to keep them apart. Thus,
if the
content model of nonterminals itself is deterministic, this is of little use if we
cannot
infer from a given label the unique nonterminal it belongs to.
By way of example, consider the following grammar rule:
A → a(ABC|CBA)
Its content model is surely deterministic. However, if A and C are competing rules (have identical
labels), this is of little use. We have to check each subtree until we have its unique
interpretation. This means, in the worst case, we have to check both subtrees (as
we will
see, this is the case in which the trees generated by A (C)
form a subset of the trees generated by C (A)).
The obvious reason for the fact that this concept of determinism comes short is that
it
originates in one-dimensional strings. As our trees are two dimensional, we can define
determinism only with respect to directions in which our analysis proceeds. The main
difference is, of course, the one between top-down and bottom-up processing. In this
paper
we will not consider the difference between depth-first and breadth-first parsing,
though
this is surely worthwhile.
In the next subsection, we will reformulate and complete the Murata hierarchy in a
way,
that makes clear which kind of determinism is facilitated by which kind of grammar.
In the
sequel, we will disregard the one-dimensional problem of non-deterministic content
models,
since they have been thoroughly analyzed, and we have nothing to add (see Brüggemann-Klein and Wood, 1997). At this point, our interest is the second dimension:
importantly, this means that talking about determinism, we implicitly always add:
provided
that content models are deterministic in the above sense. We thus exclude all problems
which
may arise from regular expressions.
The Murata Hierarchy as Hierarchy of Locality Conditions
As our main concern will be the formal properties of the grammar types, as they affect
processing, we will firstly reformulate the hierarchy. This reformulation aims at
making
clear which information we need in order to uniquely interpret a local node or
subtree.
In a local tree grammar, for any node a in any context, we know its unique interpretation. This is
obvious, since for any node label, there is only one single rule which generates it,
by
the very definition of a local tree grammar. As a consequence, parsing is deterministic
for any algorithm (provided we have deterministic content models), and the problem
of
giving a certain node of a given tree its interpretation is solvable in a constant
amount of steps.
In a single type tree grammar, for any node label
a in any context, we can determine its unique
interpretation if we know the interpretation of its mother node. This follows directly
from the definition: if we know the interpretation of a node's mother node (rules
correspond to interpretations), we know its content model. Within a content model
there
must not occur any competing nonterminals.
Note, however, that it is not sufficient to know the mother nodes label. We can
easily construct an example to show this: we have two competing rules, A → a(C) and B →
a(D), whose nonterminals do not occur in the same content model of any
rule.
Furthermore, we have the two rules C → b(r)
and D → b(r'). Then both nodes, as introduced by
C and D, have label
b, their mother nodes both have the label a, despite they have different interpretations.
For processing, this has an immediate consequence: a top down parser will at any
point immediately know the interpretation of any node, whereas if we start
interpretation from the bottom, in the worst case we will have to go up to the root
in
order to get the correct interpretation. The matter of providing the interpretation
of a
given node is nonetheless a linear search problem, since for a given tree and a given
node, it is sufficient to go a path from the root to that node.
In a restrained competition grammar (RCG), in
order to give a node its unique interpretation, we must have the interpretation of
its
mother node, and check its left siblings, in case it has any. Note that from how we
defined RCGs, it follows that we only need the label of the siblings, not their
interpretation: because any two competing nonterminals within a single content model
are
distinguished by a unique prefix, this prefix itself must not consist of competing
nonterminals only, and neither must it be empty. This keeps the grammar unambiguous,
and
easy to process. However, it makes some restrictions we do not necessarily want to
make:
maybe the unique interpretation of a label should not depend on a left sibling, but
on a
right sibling. For example, in RCGs we cannot have competing nonterminals as leftmost
symbols in a content-model, but as rightmost, given some left context. This causes
an
asymmetry which seems quite arbitrary. Of course we can equally define the asymmetric
counterpart of RCG, checking for unique suffixes instead of prefixes; but care is
to be
taken: since we have to fix the type for the class of languages (i.e., document
grammars) we define, we have the same problem. If we generalize the concept to both
unique suffix and prefix, some problems arise, which can however be remedied.
We now define a generalized restrained competition
grammar (GRCG), as follows:
In a GRCG, for any two competing nonterminals A and
B within a single content model r, one of (Γ A Δ) and
(Γ B Δ) fails to match r (Greek variables range over possible empty sequences of
nonterminals).
We now have generalized the restriction from the left (right, respectively) to the
entire context. Note that we have relaxed the overall restriction on the grammar,
by
making the restriction on content models more specific (indeed, this type properly
includes the RCGs).
This little relaxation however causes a vast increase in processing complexity:
because now, in order to give its unique interpretation to any node a in any context, in the worst case one needs to know the
interpretation of its mother, the interpretation of its siblings, and the interpretation
of its subtrees as well. And even then, GRCGs might still be ambiguous, allowing more
than one interpretation for a entire single tree.
This needs some explanation. The first point is easy to see: that we need to know
the interpretation of the mother node follows a fortiori from the preceding argument
(single type grammars are properly included in restrained competition grammars). But
this is insufficient, since competing nonterminals may occur within the same content
model. We have to match all the sister labels to the nonterminals of the content model
of the mothers interpretation in order to get a unique interpretation (according to
the
definition).
Note, however, that we need the interpretation of the sister nodes; it is not
sufficient to have their labels. This we can easily verify with the following grammar
rule: A → a(BC|B'C'), where B and B' and C and C' are competing rules, introduce
the labels b and c,
respectively.
This satisfies all conditions on GRCGs. In order to get the correct interpretation
for the labels b and c, it is not sufficient to know the sister node's label, but its
interpretation.
Things can thus get even worse, if we consider the case where the interpretation of
a sister node depends on the interpretation of the node under consideration itself. Look at the following example rule: A → a(BC|CB), where B
and C are competing rules.
Obviously, they are both uniquely determined by their neighbor within the content
model of A (A may only
occur with B to its left or its right, and vice versa).
However, as they carry the same labels, it is insufficient to determine either of
them
if we just check the label of its sister (since it is identical). Furthermore, we
might
have the case where it is impossible to interpret one of the subtrees, without its
sisters interpretation (e.g., if one of the competing nonterminals generates a subset
of
the trees generated by the other).
We will consider this case more thoroughly in the next section, showing that there
are globally ambiguous GRCGs, and that for every language that can be generated by
an
RTG, there is also a GRCG grammar that generates the same language.
From the point of view of processing, we see that in GRCGs, expressive power comes
at a high cost: neither a bottom-up nor a top-down parser is capable of assigning
a
unique interpretation locally, and maybe not even globally. The problem of giving
a
given node its unique interpretation might thus be an exponential search problem,
and in
the worst case not even decidable. We will therefore introduce a subtype of GRCG,
which
we will call unambiguous RCG. This type is properly included in the class of GRCG
grammars, and includes properly the class of STGs as well as RCGs, as can be seen
easily.
We now introduce unambiguous restrained competition
grammars (URCGs). What we want to eliminate is ambiguity, which can be
caused by the fact that in a GRCG, we might have entire competing contexts, or the
interpretation of the context of a label in a content-model might depend on the
interpretation of the label itself. In the resulting grammar, it should be possible
to
yield the unique interpretation of a node from the interpretation of its mother and
the
labels (not interpretations) of its sisters.
We characterize the grammar type in the following terms: We introduce an alphabet
of
meta-variables O, which we use in the following way:
We form a set of all sets of nonterminals from N
which compete with each other; we call these sets the competition
sets (which are possibly singletons). We are interested in the sets where
every nonterminal occurs in exactly one competition set, such that the whole set forms
a
partition of N. For every such partition we iterate the
following procedure: To every competition set, we assign a single symbol from O. We call this an O-assignment. Then, for all content models, we check for all nonterminals,
whether the content models still satisfy the GRCG condition, if we replace all other
nonterminals by the symbols from O they are assigned
to. In case there is more than one overall assignment, that is, a single nonterminal
belongs to more than one competition set, we have to iterate this for every possible
assignment. If for every assignment, nonterminal and content model, the resulting
grammar is a GRCG, then the original grammar is a URCG.
Note that the assignments are only introduced for this evaluation procedure. We will
call contexts which are identical under the O-assignmentsimilar. We define accordingly a URCG as a grammar
where competing nonterminals must not occur in the same content model and in similar
contexts (this obviously subsumes identical contexts). It is easy to see that now
we
have made sure that competing nonterminals must not occur within the same contexts
of
labels (as opposed to nonterminals).
Thus, a URCG is a GRCG where for every node we find its unique interpretation if we
have the interpretation of the mother node and the labels of its sister nodes. In
particular, we can interpret any node without having to recur to its sister node's
interpretation: the content model (BC|CB), where
B and C are
competing nonterminals, does not satisfy the condition, because both B and C occur in the context
__X or X__,
respectively, where X is the O-assignment for both.
This kind of grammar is useful for the following reason: there is no global
ambiguity in it (as we have deleted the only source of ambiguity, that interpretations
of labels may depend on each other); and it is the strongest of the non-ambiguous
grammar types we have considered. However, it is not capable of generating every
language which is not inherently ambiguous, as we will show later. Note, that in order
to provide the unique interpretation of a node, we still might have to check all of
its
sister nodes, but it is sufficient to check the labels.
It is easily seen that URCGs properly include RCGs, as both left and right context
can count as distinctive (we will make this more precise later on). As we will also
see
further down, there are languages which can be generated by GRCGs, but not by URCGs.
This will follow from the fact that actually every RTL can be generated by a GRCG,
but
there are languages for which there are no unambiguous grammars, and, obviously, URCGs
are always unambiguous.
Interestingly, the search problem for URCGs is still linear, since we only need to
go down the path from the root to a given node, and in addition check finitely many
sister labels (note that while regular expressions allow for arbitrary branching,
the
branching of a given tree is, of course, always finite).
An example of a URCG could be the use of attribute based co-occurrence constraints
or attribute-element constraints in the following RELAX NG declaration. We extend
our
example grammar by adding a type information to the section
element. If the type is set to the value "global" other section
child elements are allowed as part of the content model, if its value is set to
"sub" only para child elements are allowed (see Figure 6).
The interesting part is the definition of the element.section pattern.
It allows for two different elements named "section" with different content
models according to the value of the optional type attribute. The result is
that the instance shown in Figure 7 is valid according to
this RELAX NG grammar while the one shown in Figure 8 is
not.
Note
Without any type attribute the instance shown in Figure 7 would still be valid.
Other well deployed RELAX NG examples of attribute based co-occurrence constraints
can be found in van der Vlist, 2003, Chapter 7, in Clark et al., 2003, Section 15, or in the RELAX NG schema for the JLTF (Japanese
Layout Taskforce) aligned document shown in Sasaki, 2010.
Importantly, co-occurrence constraints do not add anything to the formal generative
capacity of the grammar. This is because attributes (or their values, respectively)
add
an additional specification to the terminals. Thereby
we can convert competing terminals (or, equivalently, rules) into non-competing ones,
but not vice versa. Any co-occurrence constraint thus gives us the possibility to
distinguish maybe otherwise indistinguishable non-terminals, thereby at most keeping
the
complexity of the grammar constant, or even reducing it. Furthermore, as co-occurrence
constraints do only affect immediate subtrees (i.e., content models), their expressivity
is entirely contained within the expressive capacities of standard regular tree
rewriting rules; the only thing we might need to add to our formal grammar model is
some
additional specification on the terminals.
Neither DTD[6] nor XSD 1.0 support such attribute-element constraint, although there are
some workarounds or hacks that can be used in XML schema to mimic co-occurrence
constraints: either the use of the xsi:type attribute or
xs:key[7]. Another option to realize this particular constraint is the use of embedded
Schematron business rules or conditional type assignment using type
alternatives or assertions that are introduced in XML Schema 1.1 Part 1 (for
complex Types) and XML Schema 1.1 Part 2 (for simple Types). Figure 9 shows a possible XSD 1.1 realization.
The section element contains an XSD 1.1 assert element
that uses an XPath expression in its type attribute to restrain the
possible child elements according to the type attribute's value of the
section element.
Note that when using XSD 1.1 assertions for expressing co-occurrence constraints we
are not limited to immediate subtrees: although the evaluation of the XPath expression
is done in the context of the parent element (i.e., XSD 1.1's xs:assert
element uses a tree fragment rooted at the element whose type it is tested against)
one
could put the assertion at the level of the common ancestor, that is, the element
that
contains all the data needed to compute the assertion. The support for full XPath
(i.e.,
axes such as ancestor, parent or preceding and preceding-sibling and following and
following-sibling, respectively) may be implementation-dependent. XSD 1.1 type alternatives
are restricted to tests against constants or attributes on the element itself but
not to
ancestors, descendants, siblings or children or their attributes while Schematron
business rules are not restricted to an XPath subset.
There is one type we should add, which cannot be assigned a place on the hierarchy
from STGs to RTGs, which is, however, weakly and strongly between LTGs and RTGs.
Grammars of this type satisfy the following conditions: we want to be able to assign
a
unique interpretation to any node a, provided we know
the complete subtree it governs. This kind of grammar would facilitate deterministic
parsing for bottom-up algorithms. In terms of grammar, this imposes the following
restrictions on the formalism:
Every leaf-terminal is introduced by a single nonterminal,
for every nonterminal N in a given grammar,
there is at most one rule which has a given terminal a, and N appears in its content
model.
We call this grammar type unique subtree grammar
(USG). Note that this grammar type does not include STGs and URCGs, nor is it included
by them. The restrictions do not restrain the occurrence of competing nonterminals
in
content models, but rather the labels which belong to nonterminals in the content
model.
More precisely, whereas the former types restrain the occurrence of competing
nonterminals within content models, USGs restrain the content models of competing
rules
itself.
However, as all other types, they properly include the class of LTGs, as every LTG
is a USG, but not vice versa, and it easy to find a language which is generated by
a
USG, but not by LTG. Furthermore, they are properly included in the class of RTGs,
in
the strong and the weak sense (and weakly within the class of GRCGs, as we will see).
As is also easy to see, the distinctive USG property provides deterministic bottom
up parsing. In order to get the interpretation of a given node, it is therefore
sufficient to find a path from the leaves, which is a linear search problem.
Structure and Global Ambiguity
First, we will present some well-known results, which are important for our further
discussion.
Theorem 1 RTGs are weakly equivalent to
CFGs (that is, the set of strings of leaves generated by CFGs is equivalent to the
set of
strings of leaves generated by RTGs).
Theorem 2 RTGs are strongly equivalent
to graphs generated by a CFGs, modulo a homomorphism of node labels (i.e., a homomorphism
which maps various node labels in a given tree onto a single one), provided the RTG
has
finitary branching.
Theorem 3 LTGs are strongly equivalent
to graphs generated by CFGs, provided finitary branching.
Proof is trivial: as in LTGs every node label is generated by exactly one nonterminal;
and
in CFGs, nodes which are not leaves are labeled by nonterminals, there is a one-to-one
correspondence. This has some importance for the relation of LTGs and RTGs. It follows,
that
LTGs and RTGs are equivalent up to homomorphism, as well as all grammar types in between.
Every discussion we have about generative capacity concerns only non-equivalence in
isomorphism. Since also the strings of leaves for all grammar types are identical,
we can only
be interested in the sets of trees. In the sequel, if we speak of the weak generative
capacity
of a tree grammar, we refer to the sets of trees it generates, not the strings of
leaves.
Since to our knowledge, only the strong generative capacity of the grammar types of
Murata et al., 2005 has been in the focus of research, we will now scrutinize their weak
generative capacity.
Theorem 4 The
sets of trees generated by STGs form a proper subset of the sets of trees generated
by
RCGs.
For proof, consider the trees generated by the following grammar:
S → a(AB)
A → b(C)
B → b(D)
C → c(D)
D → c(ε)
This is an RCG, since A and B do not occur in similar contexts. In order to see that
there
is no way to generate this tree with an STG, consider the following fact: a governs two identical labels, which however govern different
subtrees. It is therefore impossible to introduce them with identical rules, and (by
definition of STGs) forbidden to have two competing rules in the content model of
the first
rule. This is sufficient for the proof of weak inclusion, since all STG rules are
also RCG
rules, and therefore the languages generated form a proper subset.
Theorem 5 The
sets of trees generated by LTGs form a proper subset of the sets of trees generated
by
STGs.
Consider the trees generated by the following grammar:
A → a(B)
B → b(C)
C → a(D)
D → c(ε)
This is a single type tree grammar, and no LTG is able to generate such a tree (remember
that LTGs are strongly equivalent to graphs generated by CFG, provided finitary
branching).
Restrained Competition Grammars and Variants
In this section we will scrutinize formal properties of the different types of
restrained competition grammars. We will show which kind of languages cannot be generated
by
RCGs; we will prove that there are GRCGs which are ambiguous, and that for every language
which can be generated by an RTG, there is a GRCG which generates the same language.
Finally, we will show that URCGs do not have these properties, are properly included
within
GRCGs and properly include RCGs.
The type of languages we cannot generate with RCGs is quickly described as follows:
all
grammars, where a single content model contains competing nonterminals, which can
be
uniquely distinguished only from their left (right, respectively) context, are not
RCGs.
Consequently, we cannot generate sets of trees, where a certain node has different
subtrees
depending on its right siblings. If we want to get rid if this asymmetry, and allow
for
GRCGs, where competing nonterminals in a single content model are uniquely determined
by
their left or right context, we run into problems:
Theorem 6 There
are GRCGs which are ambiguous.
For proof, consider the following rule: S →
a(AB|BA). Suppose, A and B are competing nonterminals; suppose furthermore, that there is some
overlapping between A and B; i.e., the nonterminals generate overlapping sets of trees. In particular,
we may assume that the trees generated by A form a subset
of the trees generated by B. For example, A and B generate identical trees
up to depth n; B in
addition generates a tree of depth n+1. In this case, the
trees of the language have the root a, with two symmetrical
sets of subtrees up to depth n, and possibly one subtree
with depth n+1. It is easily seen that now it is impossible
to merge A and B, for then
we would be incapable of expressing the condition that at most one subtree has depth
n+1. However, for the trees, where the subtrees
introduced by A and B have
depth at most n, there is necessarily more than one
analysis. The grammar we have described so far is, however, a GRCG, because neither
A nor B occur in
identical contexts (though in similar contexts, remember the preceding section).
Theorem 7 For
every language which can be generated by an RTG, there is a GRCG which generates the
same
language.
To proof this theorem, we describe a simple procedure to convert any RTG into a GRCG,
which generates the same language. We define competing sequences of length n of nonterminals as follows: two sequences of nonterminals
compete, if for all n, the nth nonterminal of one sequence competes with the nth nonterminal of the other sequence. We have to assume a content model
r which is not GRCG conform. Therefore, there have to be
two competing nonterminals or competing sequences of nonterminals A and B in r, such that for possibly empty sequences of nonterminals Γ and Δ, (Γ A Δ) and (Γ B Δ) match
with r.
Given this, we can be sure, that in the instantiations of r, which violate the GRCG condition, A and
B occur in exactly the same global tree contexts. By
global tree context we here mean that a tree with a governing the subtrees generated
by
A is part of the language iff a also governs the set of
subtrees generated by B. Since this is the case, we can
simply merge the two nonterminals to a new one, C, which is
the union of the former two. This new nonterminal substitutes all instantiations of
A and B, which occur in
the same global tree context. This, by definition, are the instantiations which violate
the
GRCG condition. This we can apply to all nonterminals which violate the GRCG condition.
The
only thing we have to take care of is that we apply this only to those instantiations
of the
content models where two competing nonterminals match equally (this might force us
to change
some regular expressions). We do not show an exact algorithm at this point, since
it is
clear that an equivalent GRCG exist, and the details of the construction are of no
practical
interest at this point.
We now show that there is a hierarchy of proper inclusion RCG ⊂ URCG ⊂ GRCG.
To show that RCG ⊂ URCG, consider the following: every rule which is admitted by an
RCG is also admitted by a URCG, because if competing nonterminals in the same content
model
have a unique prefix, a fortiori they also have a unique context (we have already
shown that
a unique prefix of nonterminals is paramount to a unique prefix of labels/siblings,
by
induction). Above, we have already shown that for an RCG it is impossible to generate
languages as the following, which is a URCG.
S → a(AC|BD)
A → b(C)
B → b(D)
C → c(ε)
D → d(ε)
A and B compete, but are
determined uniquely by their context.
This concludes the first part; the second part will be a corollary of the next section:
We will show that some languages are inherently ambiguous, that is, there is no unambiguous
grammar for them. By Theorem 7 we know that we can generate these languages with GRCGs,
but
URCGs cannot:
Proposition 1 A URCG cannot be ambiguous.
This is easy to see: an ambiguous grammar assigns two different sequences of
nonterminals to the daughters of one node (since root nodes are unambiguous): Then
however,
there must be at least two competing nonterminals which occur in the same content
model in
similar contexts, which, by definition, is impossible.
Inherently Ambiguous Languages
As a corollary, we can show that there are regular tree languages, for which there
is no
unambiguous grammar. There are sets of trees, which are generated by an ambiguous
GRCG, but
by no URCG. We will call these languages inherently
ambiguous.
Theorem 8 Some
regular tree languages are inherently ambiguous.
This can be seen easily, if we spell out a grammar which we described in the above
subsection. We will then show that there is now way to write an unambiguous grammar
which
generates the same language.
S → r(AB|BA)
A → a(C)
B → a(D)
C → b(ε)
D → b(ε|E)
E → c(ε)
There is no way to merge A and B, since they generate different sets of subtrees (we can write L(A)≠L(B)); but since they overlap (L(A)∩ L(B)≠ ∅), there is no way to have a unique
interpretation in the cases where the subtrees generated by the nonterminals are identical.
There will always be two ways to generate trees in this case.
We can, furthermore, precisely state the conditions, under which a regular tree language
is ambiguous. To this end, however, we need to introduce some notation. We now for
simplicity write trees as terms: a tree with root a and
daughters b and c is
denoted as a(b,c), etc. As a next step, we define a context
as the position, where certain subtrees occur within trees of a language.
Definition 9A context C is a tree-term with exactly one variable. We say that a
set of subtrees α occurs in a context C in a language L, if the following holds: We
can instantiate the variable of C with any tree from α, and the resulting tree is
in
L.
Note that sets of subtrees correspond to nonterminals, when we speak of languages
rather
than grammars. In the sequel, for simplicity we will use lower case Greek variables
equally
for sets of subtrees as for ordered sequences of sets of subtrees. The definition
of a
context is easily accommodated to sequences. A set of sequences of trees of length
n consists of ordered tuples of trees of length n, of the form (t0,...,tn-1). Sets of subtrees
are then simply sets of one-tuples. Importantly, we will not provide a proof for the
following proposition, and leave it open as a conjecture. However, we will sketch
the
argument. We now make the following conjecture:
Conjecture 1A tree-language is inherently ambiguous iff at least one node fulfills
all of the following conditions:
We need to have one node with an arbitrary label a, with at least
two (sequences of) sets of subtrees α and β, such that
α ∩ β ≠ ∅;
α ≠ β;
There is at least one context C in L, such that both a(Γ,
(t1,...,tn), Δ,
(u1,...,un), Θ) and
a(Γ,
(u1,...,un), Δ,
(t1,...,tn), Θ) occur
in C, for all (t1,...,tn) ∈ α
and all (u1,...,un)
∈ β, where uppercase Greek letters designate possibly empty
sequences of daughter sub-trees; note that the sequences need to have equal length
in
order to meet condition 1.
Due to space restrictions, we leave the prove for this conjecture open here; this
reminds however of a theorem in Odgen, 1968 for string languages. But we
will give some rather informal discussion of the points in the next section. It is
not hard
to see that this is merely a generalization of the cases we have been described above.
As we
will see, we can derive some useful facts from these properties of ambiguous languages,
even
without a general proof: we can show that we can construe grammars for languages which
do
not fulfill one of the conditions, and, moreover, which type of grammars we can
construe.
Unambiguous Languages
Theorem 9From the grammar types sketched so far, there is no type which
generates all and only the RTLs that are not inherently ambiguous.
We will demonstrate this going through the three conditions mentioned in the preceding
section, and look which unambiguous grammar we can construe if one condition is not
met.
This is to be read as follows: if one condition is not met, then it means, that from
all
nodes of the tree language, there might be any one which meets the ones not in question,
but
none which meets the one currently under consideration.
If there is no intersection between the subtrees of a given node, the grammar is of
course not ambiguous. We can, however not necessarily construe a URCG for this grammar,
since in the content model of the mother node there are competing nonterminals in
similar contexts (recall the example given above).
We can, however, construe a USG for such a language, since subtrees are uniquely
identifiable.
This means that there are no two sets of subtrees governed by the same node which
are not identical. We can thus introduce them by the same nonterminals, and have a
local
tree grammar (having no different sets of subtrees governed by the same node amounts
to
say we need no competing rules in the grammar, as nonterminals correspond to sets
of
subtrees).
If the third condition is not met, then we can construe nonterminals (corresponding
to the sets of subtrees) such that for all of them the following holds: assuming they
compete (introduce identical labels), they either occur in different contexts, in
which
case they are distinguishable thereby and no ambiguity arises; or they occur in
identical contexts, in which case we can use a unique nonterminal which is the merge
of
both (this also holds for root nodes). The critical case, where the content of one
(sequence of) set(s) of subtrees depends on the other one, which makes them occur
in
similar contexts, while making it impossible to merge them, however, we have excluded
by
assumption.
Since this argument holds inductively from the root to all subtrees, we can construe
a URCG for the language were condition 3 is not met, but we cannot use any strictly
weaker type. The only thing we have provided is that if two sets of subtrees occur
in
similar contexts (for the grammar we construe), then they actually occur in identical
contexts. It follows that we do not need competing nonterminals in similar
contexts.
This shows that we still have not solved the problem to define a canonical grammar
type
which generates all and only the unambiguous languages, since there are languages
which are
generated by USGs, but no other canonical class which does not allow any ambiguity
type, and
languages which are generated by URCGs and no other such type. So far, we are still
lacking
a characterization of the unambiguous languages in terms of grammar rules.
Application and Future Research
When we speak of XML schema languages and applications, the first thing that comes
into
mind is parsing an instance and validate it according to a respective schema. Murata et al., 2005 have shown algorithms for parsing the three types of tree grammars we
discussed already. However, a task which is still open is to provide algorithms for
the new
grammars types we have defined.
Mani and Lee, 2002 demonstrated the use of the theory of regular tree grammars for
the XML to relational conversion as an additional application of formal language in
the XML
context. Again, this work could be extended using the newly established grammar types.
Regarding future research this paper may serve as just a foundation in the fields
of XML
applications and formal languages. New features that are introduced in XSD 1.1 such
as
conditional type assignment, assertions and the openContent element as well as
the relaxed Unique Particle Attribution rule (UPA, aka the
determinism rule, see XML Schema 1.1 Part 1, Section 3.8.6.4), or changed behavior
regarding wildcards have effects on the expressiveness. Apart from these natural enhancements,
another focus may lie in examining the relationships between XPath and XQuery and
formal
languages on the basis of the work undertaken in this paper; we expect to shed some
light on
this topic during future research. In addition, a more formal approach in the analysis
of
overlapping markup structures such as GODDAGs (Sperberg-McQueen and Huitfeldt, 2004) could
be an interesting field for future work.
Seen from a practical perspective and under consideration of the findings in Martens et al., 2006, a large portion of the XML document grammars that can be found in
the wild are structurally equivalent to DTDs or specialized
DTDs (that is, adding a mechanism to decouple element names from their types to
regular DTDs, see Papakonstantinou and Vianu, 2000 and Balmin et al., 2004
– also called EDTDs by Martens et al., 2006), hence use roughly the
expressiveness of local tree grammars. This is often due to nontransparent restrictions
in the
XML Schema spec such as the already discussed Element Declarations
Consistent (EDC) constraint. Bex et al., 2009 and Martens et al., 2007 provide simplifications for XSDs and XSD authoring tools that should
relive authors from the burden of these constraints by automatically transforming
nondeterministic expressions into concise deterministic ones. Regarding RELAX NG
document grammars we think that restraining its expressive power to the class of URCGs
would
provide a feasible compromise. Up to this point we hope that this more fine-grained
hierarchy
may serve others as guide for choosing a specific XML schema language depending on
the
expressivity of the markup language that has to be developed.
References
[Abiteboul et al., 2000] Abiteboul, S., P.
Buneman, and D. Suciu (2000). Data on the Web: From Relations to Semistructured Data
and XML.
Morgan Kaufmann Publishers, San Francisco, California.
[Ansari et al., 2009] Ansari, M. S., Zahid, N., and
K.-G. Doh. A Comparative Analysis of XML Schema Languages. In Slezak, D., Kim, T.,
Zhang, Y.,
Ma, J., and K. Chung, eds., Database Theory and Application. International Conference,
DTA
2009, Held as Part of the Future Generation Information Technology Conference, FGIT
2009, Jeju
Island, Korea, December 10-12, 2009. Proceedings, volume 64, pages 41– 48. Springer,
Berlin,
Heidelberg, 2009. doi:https://doi.org/10.1007/978-3-642-10583-8_6.
[Bauman, 2008] Bauman, S., (2008). Freedom to
Constrain: where does attribute constraint come from, mommy? In Proceedings of Balisage:
The
Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008). doi:https://doi.org/10.4242/BalisageVol1.Bauman01.
[Bex et al., 2009] Bex, G. J., Gelade, W., Martens, W.
and F. Neven (2009). Simplifying XML Schema: Effortless Handling of Nondeterministic
Regular
Expressions. In SIGMOD ’09: Proceedings of the 35th SIGMOD international conference
on
Management of data, pages 731–744, New York, NY, USA, ACM. doi:https://doi.org/10.1145/1559845.1559922.
[Brüggemann-Klein and Wood, 1992] Brüggemann-Klein, A., and D. Wood (1992). Deterministic Regular Languages. In Finkel,
A. and
M. Jantzen, eds., STACS 92. 9th Annual Symposium on Theoretical Aspects of Computer
Science
Cachan, France, February 13–15, 1992 Proceedings, volume 577 of Lecture Notes in Computer
Science, pages 173–184. Springer, Berlin, Heidelberg. doi:https://doi.org/10.1007/3-540-55210-3_182.
[Brüggemann-Klein, 1993] Brüggemann-Klein,
A. (1993). Formal Models in Document Processing. Habilitation, Albert-Ludwig-Universität
zu
Freiburg i. Br.
[Brüggemann-Klein and Wood, 2002] Brüggemann-Klein, A., and D. Wood (2002). The parsing of extended context-free grammars.
HKUST Theoretical Computer Science Center Research Report HKUST-TCSC-2002-08, The
Hong Kong
University of Science and Technology Library.
[Brüggemann-Klein and Wood, 2004] Brüggemann-Klein, A., and D. Wood (2004). Balanced context-free grammars, hedge grammars
and
pushdown caterpillar automata. In Proceedings of Extreme Markup Languages, Montréal,
Québec.
[Buck et al., 2000] Buck, L., Goldfarb, C. F., and P.
Prescod (2000). Datatypes for DTDs (DT4DTD) 1.0. W3C Note 13 January 2000, World Wide
Web
Consortium.
[Clark, 2001] Clark, J. (2001). TREX – Tree Regular
Expressions for XML Language Specification. Technical report, Thai Open Source Software
Center
Ltd.
[Comon et al., 2008] Comon, H., M. Dauchet, R.
Gilleron, C. Löding, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi (2007). Tree
Automata
Techniques and Applications. Release November, 18th 2008. http://www.grappa.univ-lille3.fr/tata.
[DSD2] Møller, A. (2005). Document Structure
Description 2.0. Technical report, BRICS (Basic Research in Computer Science, Aarhus
University), 2005. http://www.brics.dk/DSD/dsd2.html.
[Fiorello et al., 2004] Fiorello, D., Gessa, N.,
Marinelli, P., and F. Vitali. DTD++ 2.0: Adding support for co-constraints. In Proceedings
of
Extreme Markup Languages, Montréal, Québec.
[Gelade et al., 2009] Gelade, W, Martens, W., and F.
Neven (2009). Optimizing Schema Languages for XML: Numerical Constraints and Interleaving.
SIAM Journal on Computing, 38(5):2021–2043. doi:https://doi.org/10.1137/070697367.
[Goldfarb, 1978] Goldfarb, C. F. (1978). DCF GML
User’s Guide (IBM SH20-9160). IBM, 1978.
[Goldfarb, 1991] Goldfarb, C. F. (1991). The SGML
Handbook. Oxford University Press, Oxford.
[Gécseg and Steinby, 1997] Gécseg, F., and M. Steinby
(1997). Tree languages. In Handbook of Formal Languages, volume 3, pages 1-68. Springer,
New
York.
[Hopcroft et al., 2000] Hopcroft, J., R. Motwani,
and J. Ullman (2000). Introduction to Automata Theory, Languages, and Computation.
2nd
edition. Addison Wesley Longman, Amsterdam.
[Klarlund et al., 2003] Klarlund, N., T.
Schwentick, and D. Suciu (2003). XML: Model, Schemas, Types, Logics and Queries. In
Chomicki,
J., R. van der Meyden, and G. Saake, eds., Logics for Emerging Applications of Databases,
pages 1-41. Springer, Berlin, Heidelberg.
[Kracht, 2010] Kracht, M. (to appear). Modal Logic
Foundations of Markup Structures in Annotation Systems. In Mehler, A., Kühnberger,
K.-U.,
Lobin, H., Lüngen, H., Storrer, A., and A. Witt, eds., Modeling, Learning and Processing
of
Text Technological Data Structures, Studies in Computational Intelligence. Springer,
Dordrecht.
[Maler and Andaloussi, 1995] Maler, E., and J. E.
Andaloussi (1995). Developing SGML DTDs: From Text to Model to Markup. Prentice Hall,
Upper
Saddle River, New Jersey
[Mani, 2001] M. Mani (2001). Keeping chess alive: Do we
need 1-unambiguous content models? In Proceedings of Extreme Markup Languages, Montréal,
Québec.
[Mani and Lee, 2002] Mani, M., and D. Lee (2002). XML
to Relational Conversion using Theory of Regular Tree Grammars. In Proceedings of
the 28th
VLDB Conference, Hong Kong, China.
[Marcoux, 2008] Marcoux, Y. (2008). Graph
characterization of overlap-only TexMECS and other overlapping markup formalisms.
In
Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies,
vol. 1. Montréal, Québec. doi:https://doi.org/10.4242/BalisageVol1.Marcoux01.
[Martens et al., 2005] Martens, W., Neven, F., and
T. Schwentick (2005). Which XML Schemas Admit 1-Pass Preorder Typing? In Eiter, T.,
and L.
Libkin, eds., Database Theory – ICDT 2005, volume 3363 of Lecture Notes in Computer
Science,
pages 68–82. Springer, Berlin, Heidelberg, 2005. doi:https://doi.org/10.1007/978-3-540-30570-5_5.
[Martens et al., 2009] Martens, W., Neven, F. and T.
Schwentick (2009). Complexity of Decision Problems for XML Schemas and Chain Regular
Expressions. SIAM Journal on Computing, 39(4):1486–1530. doi:https://doi.org/10.1137/080743457.
[Møller and Schwartzbach, 2006] Møller, A., and M.
Schwartzbach (2006). An Introduction to XML and Web Technologies, chapter Schema Languages,
pages 92–187. Addison-Wesley, Harlow, England.
[Murata et al., 2001] Murata, M., D. Lee, and M.
Mani (2001). Taxonomy of XML Schema Languages using Formal Language Theory. In Proceedings
of
Extreme Markup Languages, Montréal, Québec.
[Nentwich, 2005] Nentwich, C. (2005). CLiX – A
Validation Rule Language for XML. Presented by Anthony Finkelstein at W3C Workshop
on Rule
Languages for Interoperability, 27-28 April 2005, Washington D.C. http://www.w3.org/2004/12/rules-ws/paper/24/.
[NVDL] ISO/IEC 19757-4:2006. Information technology —
Document Schema Definition Languages (DSDL) — Part 4: Namespace-based Validation Dispatching
Language (NVDL), International Standard, International Organization for Standardization,
Geneva.
[Papakonstantinou and Vianu, 2000] Papakonstantinou, Y., and V. Vianu (2000). DTD inference for views of XML data. In
PODS ’00:
Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of
database
systems, pages 35–46, New York, NY, USA, ACM. doi:https://doi.org/10.1145/335168.335173.
[RELAX Core] ISO/IEC TR 22250-1:2002. Information
technology – Document description and processing languages – Regular Language Description
for
XML – part 1: RELAX Core. International Standard, International Organization for
Standardization, Geneva.
[RELAX NG] ISO/IEC 19757-2:2008. Information technology –
Document Schema Definition Language (DSDL) – Part 2: Regular-grammar-based validation
– RELAX
NG (ISO/IEC 19757-2). International Standard, International Organization for Standardization,
Geneva.
[RELAX NG (2nd Ed.)] ISO/IEC 19757-2:2008. Information
technology – Document Schema Definition Language (DSDL) – Part 2: Regular-grammar-based
validation – RELAX NG (ISO/IEC 19757-2). Second Edition. International Standard, International
Organization for Standardization, Geneva.
[Rizzi, 2001] Rizzi, R. (2001). Complexity of
context-free grammars with exceptions and the inadequacy of grammars as models for
XML and
SGML. Markup Languages – Theory & Practice, 3(1):107–116. doi:https://doi.org/10.1162/109966201753537222.
[Sasaki, 2010] Sasaki, F. (2010). How to avoid
suffering from markup: A project report about the virtue of hiding xml. In XML Prague
2010
Conference Proceedings, pages 105–123, Prague, Czech Republic, March 13–14 2010. Institute
for
Theoretical Computer Science.
[Schematron] ISO/IEC 19757-3:2006 Information
technology — Document Schema Definition Languages (DSDL) — Part 3: Rule-based validation
—
Schematron. International Standard, International Organization for Standardization,
Geneva.
[SGML] ISO 8879:1986. Information Processing — Text and
Office Information Systems — Standard Generalized Markup Language. International Standard,
International Organization for Standardization, Geneva.
[Sperberg-McQueen, 2003] Sperberg-McQueen,
C. M. (2003). Logic grammars and XML Schema. In Proceedings of Extreme Markup Languages,
Montréal, Québec.
[Sperberg-McQueen and Huitfeldt, 2004] Sperberg-McQueen, C. M. and C.
Huitfeldt (2004). GODDAG: A Data Structure for Overlapping Hierarchies. In King, P.
and E. V.
Munson, eds. Proceedings of the 5th International Workshop on the Principles of Digital
Document Processing (PODDP 2000), volume 2023 of Lecture Notes in Computer Science,
pages
139–160. Springer, 2004
[Stührenberg and Jettka, 2009] Stührenberg, M.
and D. Jettka (2009). A toolkit for multi-dimensional markup: The development of SGF
to
XStandoff. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series
on Markup
Technologies, vol. 3 (2009). Montréal, Québec. doi:https://doi.org/10.4242/BalisageVol3.Stuhrenberg01.
[Vitali et al., 2003] Vitali, F., Amorosi, N., and N.
Gessa. Datatype- and namespace-aware DTDs: A minimal extension. In Proceedings of
Extreme
Markup Languages, Montré́al, Québec.
Abiteboul, S., P.
Buneman, and D. Suciu (2000). Data on the Web: From Relations to Semistructured Data
and XML.
Morgan Kaufmann Publishers, San Francisco, California.
Ansari, M. S., Zahid, N., and
K.-G. Doh. A Comparative Analysis of XML Schema Languages. In Slezak, D., Kim, T.,
Zhang, Y.,
Ma, J., and K. Chung, eds., Database Theory and Application. International Conference,
DTA
2009, Held as Part of the Future Generation Information Technology Conference, FGIT
2009, Jeju
Island, Korea, December 10-12, 2009. Proceedings, volume 64, pages 41– 48. Springer,
Berlin,
Heidelberg, 2009. doi:https://doi.org/10.1007/978-3-642-10583-8_6.
Balmin, A., Papakonstantinou,
Y., and V. Vianu (2004). Incremental validation of XML documents. ACM Transactions
on Database
Systems (TODS), 29(4):710–751. doi:https://doi.org/10.1145/1042046.1042050.
Bauman, S., (2008). Freedom to
Constrain: where does attribute constraint come from, mommy? In Proceedings of Balisage:
The
Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008). doi:https://doi.org/10.4242/BalisageVol1.Bauman01.
Bex, G. J., Gelade, W., Martens, W.
and F. Neven (2009). Simplifying XML Schema: Effortless Handling of Nondeterministic
Regular
Expressions. In SIGMOD ’09: Proceedings of the 35th SIGMOD international conference
on
Management of data, pages 731–744, New York, NY, USA, ACM. doi:https://doi.org/10.1145/1559845.1559922.
Brüggemann-Klein, A., and D. Wood (1992). Deterministic Regular Languages. In Finkel,
A. and
M. Jantzen, eds., STACS 92. 9th Annual Symposium on Theoretical Aspects of Computer
Science
Cachan, France, February 13–15, 1992 Proceedings, volume 577 of Lecture Notes in Computer
Science, pages 173–184. Springer, Berlin, Heidelberg. doi:https://doi.org/10.1007/3-540-55210-3_182.
Brüggemann-Klein, A., and D. Wood (1997). One-unambiguous regular languages. Information
and
computation, 142:182–206. doi:https://doi.org/10.1006/inco.1997.2695.
Brüggemann-Klein, A., and D. Wood (2002). The parsing of extended context-free grammars.
HKUST Theoretical Computer Science Center Research Report HKUST-TCSC-2002-08, The
Hong Kong
University of Science and Technology Library.
Brüggemann-Klein, A., and D. Wood (2004). Balanced context-free grammars, hedge grammars
and
pushdown caterpillar automata. In Proceedings of Extreme Markup Languages, Montréal,
Québec.
Chomsky, N. (1956). Three Models for
the Description of Language. IRE Transactions on Information Theory, 2:113–124,
1956. doi:https://doi.org/10.1109/TIT.1956.1056813.
Clark, J., J. Cowan, and M.
Murata, (2003). Relax NG Compact Syntax Tutorial. Working Draft 26 March 2003, OASIS
–-
Organization for the Advancement of Structured Information Standards. http://relaxng.org/compact-tutorial-20030326.html.
Comon, H., M. Dauchet, R.
Gilleron, C. Löding, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi (2007). Tree
Automata
Techniques and Applications. Release November, 18th 2008. http://www.grappa.univ-lille3.fr/tata.
Fiorello, D., Gessa, N.,
Marinelli, P., and F. Vitali. DTD++ 2.0: Adding support for co-constraints. In Proceedings
of
Extreme Markup Languages, Montréal, Québec.
Gelade, W, Martens, W., and F.
Neven (2009). Optimizing Schema Languages for XML: Numerical Constraints and Interleaving.
SIAM Journal on Computing, 38(5):2021–2043. doi:https://doi.org/10.1137/070697367.
Hopcroft, J., R. Motwani,
and J. Ullman (2000). Introduction to Automata Theory, Languages, and Computation.
2nd
edition. Addison Wesley Longman, Amsterdam.
Kilpeläinen,
P., and R. Tuhkanen (2007). One-unambiguity of regular expressions with numeric occurrence
indicators. Information and Computation, 205(6):890–916. doi:https://doi.org/10.1016/j.ic.2006.12.003.
Klarlund, N., T.
Schwentick, and D. Suciu (2003). XML: Model, Schemas, Types, Logics and Queries. In
Chomicki,
J., R. van der Meyden, and G. Saake, eds., Logics for Emerging Applications of Databases,
pages 1-41. Springer, Berlin, Heidelberg.
Kracht, M. (to appear). Modal Logic
Foundations of Markup Structures in Annotation Systems. In Mehler, A., Kühnberger,
K.-U.,
Lobin, H., Lüngen, H., Storrer, A., and A. Witt, eds., Modeling, Learning and Processing
of
Text Technological Data Structures, Studies in Computational Intelligence. Springer,
Dordrecht.
Lee, D. and W. Chu. Comparative
Analysis of Six XML Schema Languages. ACM SIGMOD Record, 29(3):76–87, September
2000. doi:https://doi.org/10.1145/362084.362140.
Mani, M., and D. Lee (2002). XML
to Relational Conversion using Theory of Regular Tree Grammars. In Proceedings of
the 28th
VLDB Conference, Hong Kong, China.
Marcoux, Y. (2008). Graph
characterization of overlap-only TexMECS and other overlapping markup formalisms.
In
Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies,
vol. 1. Montréal, Québec. doi:https://doi.org/10.4242/BalisageVol1.Marcoux01.
Martens, W., Neven, F., and
T. Schwentick (2005). Which XML Schemas Admit 1-Pass Preorder Typing? In Eiter, T.,
and L.
Libkin, eds., Database Theory – ICDT 2005, volume 3363 of Lecture Notes in Computer
Science,
pages 68–82. Springer, Berlin, Heidelberg, 2005. doi:https://doi.org/10.1007/978-3-540-30570-5_5.
Martens, W., Neven, F.,
Schwentick, T., and G. Bex (2006). Expressiveness and Complexity of XML Schema. ACM
Transactions on Database Systems (TODS), 31(3):770–813. doi:https://doi.org/10.1145/1166074.1166076.
Martens, W., Neven, F. and T.
Schwentick (2007). Simple off the shelf abstractions for XML schema. SIGMOD Rec.,
36(3):15–22. doi:https://doi.org/10.1145/1324185.1324188.
Martens, W., Neven, F. and T.
Schwentick (2009). Complexity of Decision Problems for XML Schemas and Chain Regular
Expressions. SIAM Journal on Computing, 39(4):1486–1530. doi:https://doi.org/10.1137/080743457.
Møller, A., and M.
Schwartzbach (2006). An Introduction to XML and Web Technologies, chapter Schema Languages,
pages 92–187. Addison-Wesley, Harlow, England.
Murata, M., D. Lee, and M.
Mani (2001). Taxonomy of XML Schema Languages using Formal Language Theory. In Proceedings
of
Extreme Markup Languages, Montréal, Québec.
Murata, M., D. Lee, M. Mani,
and K. Kawaguchi (2005). Taxonomy of XML Schema Languages Using Formal Language Theory.
ACM
Transactions on Internet Technology, 5(4):660–704. doi:https://doi.org/10.1145/1111627.1111631.
Nentwich, C. (2005). CLiX – A
Validation Rule Language for XML. Presented by Anthony Finkelstein at W3C Workshop
on Rule
Languages for Interoperability, 27-28 April 2005, Washington D.C. http://www.w3.org/2004/12/rules-ws/paper/24/.
ISO/IEC 19757-4:2006. Information technology —
Document Schema Definition Languages (DSDL) — Part 4: Namespace-based Validation Dispatching
Language (NVDL), International Standard, International Organization for Standardization,
Geneva.
Odgen, W. (1968). A Helpful Result for
Proving Inherent Ambiguity. In Mathematical Systems Theory, 2(3):191–194. doi:https://doi.org/10.1007/BF01694004.
Papakonstantinou, Y., and V. Vianu (2000). DTD inference for views of XML data. In
PODS ’00:
Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of
database
systems, pages 35–46, New York, NY, USA, ACM. doi:https://doi.org/10.1145/335168.335173.
Piez, W. (2001). Beyond the “descriptive
vs. procedural” distinction. In Markup Languages – Theory & Practice,
3(2):141–172. doi:https://doi.org/10.1162/109966201317356380.
ISO/IEC TR 22250-1:2002. Information
technology – Document description and processing languages – Regular Language Description
for
XML – part 1: RELAX Core. International Standard, International Organization for
Standardization, Geneva.
ISO/IEC 19757-2:2008. Information technology –
Document Schema Definition Language (DSDL) – Part 2: Regular-grammar-based validation
– RELAX
NG (ISO/IEC 19757-2). International Standard, International Organization for Standardization,
Geneva.
ISO/IEC 19757-2:2008. Information
technology – Document Schema Definition Language (DSDL) – Part 2: Regular-grammar-based
validation – RELAX NG (ISO/IEC 19757-2). Second Edition. International Standard, International
Organization for Standardization, Geneva.
Rizzi, R. (2001). Complexity of
context-free grammars with exceptions and the inadequacy of grammars as models for
XML and
SGML. Markup Languages – Theory & Practice, 3(1):107–116. doi:https://doi.org/10.1162/109966201753537222.
Rogers, J. (2003). Syntactic
Structures as Multi-dimensional Trees. In Research on Language and Computation,
1(3-4):265–305. doi:https://doi.org/10.1023/A:1024695608419.
Sasaki, F. (2010). How to avoid
suffering from markup: A project report about the virtue of hiding xml. In XML Prague
2010
Conference Proceedings, pages 105–123, Prague, Czech Republic, March 13–14 2010. Institute
for
Theoretical Computer Science.
ISO/IEC 19757-3:2006 Information
technology — Document Schema Definition Languages (DSDL) — Part 3: Rule-based validation
—
Schematron. International Standard, International Organization for Standardization,
Geneva.
ISO 8879:1986. Information Processing — Text and
Office Information Systems — Standard Generalized Markup Language. International Standard,
International Organization for Standardization, Geneva.
Sperberg-McQueen, C. M. and C.
Huitfeldt (2004). GODDAG: A Data Structure for Overlapping Hierarchies. In King, P.
and E. V.
Munson, eds. Proceedings of the 5th International Workshop on the Principles of Digital
Document Processing (PODDP 2000), volume 2023 of Lecture Notes in Computer Science,
pages
139–160. Springer, 2004
Stührenberg, M.
and D. Jettka (2009). A toolkit for multi-dimensional markup: The development of SGF
to
XStandoff. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series
on Markup
Technologies, vol. 3 (2009). Montréal, Québec. doi:https://doi.org/10.4242/BalisageVol3.Stuhrenberg01.
Vitali, F., Amorosi, N., and N.
Gessa. Datatype- and namespace-aware DTDs: A minimal extension. In Proceedings of
Extreme
Markup Languages, Montré́al, Québec.