How to cite this paper
Blažević, Mario. “Grammar-driven Markup Generation.” Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 - 6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). https://doi.org/10.4242/BalisageVol5.Blazevic01.
Balisage: The Markup Conference 2010
August 3 - 6, 2010
Balisage Paper: Grammar-driven Markup Generation
Mario Blažević
Senior software developer
Stilo International plc.
The author has a Master's degree in Computer Science from University of Novi Sad,
Yugoslavia. Since moving to
Canada in 2000, he has been working for OmniMark Technologies, later acquired by Stilo
International plc.,
mostly in the area of markup processing and on development of the OmniMark programming
language.
Copyright © 2010 by the author. Used with
permission.
Abstract
This paper defines the concept of grammar-driven normalization of incomplete instances,
sketches its implementation
for RELAX NG schema and XML documents, and presents an example of its practical use
for automated document
conversion.
Table of Contents
- Introduction
- Example
- Design and Implementation
- Results
- Related Work
- Appendix A. Sample implementation
Note
I want to thank my colleagues Jacques Légaré, Joe Gollner, Patrick Baker, and Roy Amodeo for their
valuable comments, and Stilo International for providing the initial motivation for the work and giving me time to
do it properly.
Introduction
A grammar is a set of rules that describes the composition of a language. Grammars
have existed for at least two
and half millenia, but their use was mainly didactic. The practical applications of
grammars started with their
mathematical formalization in the 1950s c56 and the emergence of computer programming languages
b59.
There are three ways to apply a grammar:
Validate
|
a given sentence to determine if it is grammatically correct.
|
Generate
|
a correct sentence according to the grammar.
|
Parse
|
a given sentence to determine its structure and constituent parts.
|
The historical motivation for development of the theory of formal languages was to
explain the method of
generation of valid sentences as well as their recognition and parsing. Generative
grammars are still an active area
of reasearch in linguistics. In computer science, however, it is parsing that is the
primary role of grammars; their
use for generation of valid sentences appears to be forgotten.
One reason for this lack of interest in grammar-driven generation is that the problem
is poorly defined. The
output of a generator is a grammatically valid sentence. But what is its input, other
than the grammar? The task of a
parser is clearer: while its output in practice can be anything, it is usually defined
as a parse tree of some
form.
While many kinds of generator inputs are possible and interesting to contemplate,
we shall concentrate on the
specific task of generating a grammatically valid sentence from a sequence of tokens
(i.e., terminals) allowed by the grammar. Furthermore, the generator must exactly reproduce
any
input that is already a grammatically valid sentence. We shall use the term normalization to refer to this form of sentence generation, and the term normalizer for the generator that performs it.
Given a normalizer that maps a set of token sequences T into a set of
cannonical, grammatically valid sequences S, one can define a new language S' as the union of the core language S and the non-core
language T.
For an actual example of this kind of language definition, we need look no further
than SGML. An SGML parser
performs parsing, normalization, and validation at the same time. For example, given
the following SGML DTD that
declares the end tag of element a and both tags of element b to be omissible:
<!element a - o (b)>
<!element b o o (#pcdata)>
an SGML parser with OMITTAG feature enabled will parse the instance <a>Hello, World!
into a syntax tree corresponding to <a><b>Hello, World!</b></a>
. Any conformant SGML parser
with support for omissible tags will consider these two instances as equivalent, as
long as the same DTD is used to
parse them. If the DTD should change, however, an instance that omits element tags
may become not only invalid but
unparsable.
For this reason and others, an XML parser does not perform any normalization. A well-formed
XML instance is
always parsed in the same way according to a fixed grammar, regardless of what DTD
or schema the instance happens to
also conform to. Various types of XML schemas are used only for validation, not for
parsing or normalization.
This paper will demonstrate how an XML schema (a RELAX NG schema specifically) can
be used to generate a valid
instance of the schema from an invalid but well-formed XML instance. In other words,
we shall normalize a well-formed
XML instance so it conforms to a given schema.
The only element tags allowed in the input are those that occur in the target schema.
If normalization succeeds
with no errors, the output of the normalizer will:
-
be well-formed XML,
-
conform to the target schema, and
-
contain the entire input instance as its subsequence, interleaved with additional
element tags.
The normalizer is much more forgiving than the SGML parser in what kinds of input
it can normalize. The target
grammar is allowed to be ambiguous, though this is not recommended for performance
reasons. There is no unique particle contribution constraint w04, nor is the generation
restricted to contextually required elements s86. If the input
instance is ambiguous, i.e., there is more than one way to normalize it, the
normalization with the least amount of added markup will be chosen. This choice is
not made at the point when the
particle is encountered, because the subsequent content could make that choice invalid,
but deferred until the
enclosing element ends. The biggest constraint is that only plain element tags with
no attributes can be
inserted. Even this constraint could be overcome by automatically generating some
arbitrary grammatically correct
values for all required attributes, but we felt that this might cause more harm than
benefit in practice.
The primary purpose of SGML's minimization features was making SGML easier to author.
We do not recommend using
the XML normalizer in this way; we feel that it is important that every XML document
in permanent storage should
conform to a clearly expressed schema.
The legitimate place for the normalization process is inside a processing pipeline,
where it can usefully be
applied to or produce a transient XML document instance. Normalization can thus be
applied before another process in
order to transform the input document into a form easier to process. The target schema
for the normalizer may be just
a stricter form of the input document's schema, or it may specify additional disambiguating
markup.
Alternatively, a normalizer can appear at the opposite end of the processing pipeline,
where it ensures that the
output of the preceding process conforms to the target schema. We shall demonstrate
this latter possibility of
practical use in a later section.
The rest of the paper is organized as follows: the next section will present an example
of normalizer's use, and
afterwards its design and implementation will be explained. The section following
after that explains the purpose and
results of the normalizer in production, and the paper concludes with a discussion
of related work.
Example
This section will demonstrate the capabilities of our normalizer
through an example. Let us take the following RELAX NG schema as our target:
start = document
block = p | ol | ul
document = element document { title, block+, section* }
section = element section { title, block+, section* }
title = element title { text }
p = element p { text }
ol = element ol { li+ }
ul = element ul { li+ }
li = element li { block+ }
The goal is to convert our input to a valid instance of this grammar. As for the input,
it may start as a plain
text file like the following one:
Normalizer test input
This is a short test input for the normalizer.
Purpose
The purpose of this document is to demonstrate the capabilities of the normalizer.
The normalizer is schema-driven, and it tries to fit its input into the target schema.
Constraints
The goal of the normalizer is to produce output that conforms to the schema, while
preserving complete input. It does not always succeed:
- A piece of input that cannot be fit into the schema will be dropped.
- The normalizer does not attempt to conjure any required attribute values nor text data.
These must be present in the input.
We first need to ensure that our input is a well-fomed XML document, which can be
achieved simply by putting the
entire text inside the <document>...</document>
tags. If we then run this XML instance
through the normalizer, it will be converted to
<document>
<title></title>
<p>
Normalizer test input
This is a short test input for the normalizer.
Purpose
The purpose of this document is to demonstrate the capabilities of the normalizer.
The normalizer is schema-driven, and it tries to fit its input into the target schema.
Constraints
The goal of the normalizer is to produce output that conforms to the schema, while
preserving complete input. It does not always succeed:
- A piece of input that cannot be fit into the schema will be dropped.
- The normalizer does not attempt to conjure any required attribute values nor text data.
These must be present in the input.
</p>
</document>
As we can verify, the output indeed does conform to the target schema. The content
model of
document
requires a title
element, for example, and this element has been inserted. Since it
has no content, though, it may not be all that useful.
In order to produce a more appropriate instance, we need to place certain constraints
on the solution. One way
to do this would be make the schema stricter. We could require the title content to
be non-empty, and the paragraph
content to contain no line-breaks. Alternatively, we can specify the constraints by
adding the desired markup to the
input document. For example, we can mark up all input lines that appear to be titles:
<document>
<title>Normalizer test input</title>
This is a short test input for the normalizer.
<title>Purpose</title>
The purpose of this document is to demonstrate the capabilities of the normalizer.
The normalizer is schema-driven, and it tries to fit its input into the target schema.
<title>Constraints</title>
The goal of the normalizer is to produce output that conforms to the schema, while
preserving complete input. It does not always succeed:
- A piece of input that cannot be fit into the schema will be dropped.
- The normalizer does not attempt to conjure any required attribute values nor text data.
These must be present in the input.
</document>
Given this enriched input together with the original schema, the normalizer produces
the following
output:
<document>
<title>Normalizer test input</title>
<p>
This is a short test input for the normalizer.
</p>
<section>
<title>Purpose</title>
<p>
The purpose of this document is to demonstrate the capabilities of the normalizer.
The normalizer is schema-driven, and it tries to fit its input into the target schema.
</p>
<section>
<title>Constraints</title>
<p>
The goal of the normalizer is to produce output that conforms to the schema, while
preserving complete input. It does not always succeed:
- A piece of input that cannot be fit into the schema will be dropped.
- The normalizer does not attempt to conjure any required attribute values nor text data.
These must be present in the input.
</p>
</section>
</section>
</document>
The good news is that the normalizer has not only respected and kept our title
markup, it has
inferred that every title but the first one has to start a new section. The bad news
is that the second section has
been nested within the first one. While this does conform to the target schema, the
two sections belong at the same
level. Furthermore, we want every line to be a separate paragraph; and there appears
to be a list at the end of our
document and it should be marked up as such.
In this simple example we could just visually determine where each section, paragraph,
or list item starts and
ends and mark them up with XML element tags. In practice, however, we'd rather have
this work done automatically. A
simple pattern-matching rule would be capable of inserting a start tag wherever there
is a hyphen at the beginning of
the line, but finding where it ends is much more difficult. To make this marking-up
process easier, the normalizer
recognizes a number of processing instructions as output constraints. Once we apply
them, our input document might
look like this:
<document>
<title>Normalizer test input</title>
<?derivative.start-anew <p>?>This is a short test input for the normalizer.
<?derivative.start-anew <section>?><title>Purpose</title>
<?derivative.start-anew <p>?>The purpose of this document is to demonstrate
the capabilities of the normalizer.
<?derivative.start-anew <p>?>The normalizer is schema-driven, and it tries
to fit its input into the target schema.
<?derivative.start-anew <section>?><title>Constraints</title>
<?derivative.start-anew <p>?>The goal of the normalizer is to produce output
that conforms to the schema, while preserving complete input. It does not always succeed:
<?derivative.proceed-with <ul>?><?derivative.start-anew <li>?>
A piece of input that cannot be fit into the schema will be dropped.
<?derivative.proceed-with <ul>?><?derivative.start-anew <li>?>
The normalizer does not attempt to conjure any required attribute values nor text data.
<?derivative.start-anew <p>?>These must be present in the input.
</document>
The processing instructions basically act as placeholders for element start tags,
letting the normalizer know
which element is supposed to enclose the following content. The exact location of
the element's end tag will be
inferred by the normalizer. The details of the processing instructions and the inference
algorithm will be explained in
the next chapter. The normalizer's output, based on our last enriched input above,
looks as follows:
<document>
<title>Normalizer test input</title>
<p>This is a short test input for the normalizer.</p>
<section>
<title>Purpose</title>
<p>The purpose of this document is to demonstrate the capabilities of the normalizer.</p>
<p>The normalizer is schema-driven, and it tries to fit its input into the target schema.</p>
</section>
<section>
<title>Constraints</title>
<p>The goal of the normalizer is to produce output that conforms to the schema, while
preserving complete input. It does not always succeed:</p>
<ul>
<li>
<p>A piece of input that cannot be fit into the schema will be dropped.</p>
</li>
<li>
<p>The normalizer does not attempt to conjure any required attribute values nor
text data.</p>
<p>These must be present in the input.</p>
</li>
</ul>
</section>
</document>
Design and Implementation
The basic idea that drove the normalizer design came after implementing the validator
automaton c02 for RELAX NG c01. This automaton uses a flexible validation algorithm based on
Brzozowski derivatives b64. We shall now sketch the basic idea of this validation algorithm; a
more thorough treatment can be found in the literature a72.
Informally, the derivative of a grammar G with respect to a sequence of tokens
S is the grammar G', such that a sequence S' is valid according to G' if and only if the concatenation
of S and S' is valid according to G.
We can think of the grammar G as the initial state of a validating automaton,
and the grammar derivative calculation as its transition function. A very simple and
generic validator algorithm can
be built on top of this concept s05, and one such validator is the standard RELAX NG validation
algorithm.
While working on the implementation of an RELAX NG validator library s09b for OmniMark
s09a, it became clear that the automaton could be relatively easily extended in two orthogonal
ways: with additional pattern constructs, and with additional actions to perform.
In particular, the validating
automaton could be modified to produce arbitrary output, in the same way a deterministic
finite automaton can be
extended to a Moore machine m56.
The original RELAX NG validation algorithm c02 represents the grammar as a tree data
structure. Here is its definition in Haskell h02:
data Pattern = Empty
| NotAllowed
| Text
| Choice Pattern Pattern
| Interleave Pattern Pattern
| Group Pattern Pattern
| OneOrMore Pattern
| List Pattern
| Data Datatype ParamList
| DataExcept Datatype ParamList Pattern
| Value Datatype String Context
| Attribute NameClass Pattern
| Element NameClass Pattern
| After Pattern Pattern
The constructors of this data type, with the exception of the last one, correspond
to the constructs of the
RELAX NG schema language. The last constructor, After
, appears when a start tag of an element is
consumed, and matches the remainder of the XML input together with the corresponding
end tag.
We need two extensions to this data structure in order to enable the automaton to
perform normalization. First,
we need to be able to assume the presence of an omitted element tag. Secondly, we
must be able to output the
normalized version of our input, together with the omitted and inferred tags. So we
add two more data
constructors:
data Pattern = Empty
| ...
| Region Bool QName Pattern
| Result [ResultDelta] Pattern
data ResultDelta = StartTag ChildNode | Add ChildNode | EndTag | Inferred ResultDelta
The Result
constructor keeps track of the output that should be emitted if the pattern it wraps
succeeds. Whenever the current pattern (i.e., the state of the automaton) has a
Result
node at its root, the accumulated chunk of output is flushed and the root node is
replaced by its
child pattern.
The Region
constructor denotes an open element region, which may have come from the input or
may
have been inserted in order to normalize it. In the former role, it acts as a replacement
for the After
node. To accomodate this change in the representation of open element regions, we
have to make another adjustment to
the Interleave
constructor. It needs an extra integer field to keep track of which of its patterns
has
consumed unmatched start tags, and how many of them.
As long as the input conforms to the target schema, the normalizer acts the same way
as the RELAX NG validator,
except it builds the Result
nodes and flushes them at the output. When a particle that doesn't match the
current pattern is encountered, the normalizer attempts to recover by descending into
the content models of elements
that are allowed at the current point. The derivative of the current pattern is then
the sum of derivatives of all
content models that accept the particle, with each content derivative wrapped into
a Region
node and a
Result
node; these keep track of the inferred element start tag.
This modification of the RELAX NG validation algorithm results in the normalizer whose
complete source code is
listed in the paper's appendix Appendix A. This simple normalizer is completely correct, but far from
perfect. It easily runs into a combinatorial explosion of the pattern tree, and it
cannot be customized to resolve the
ambiguity in a desired way. While the available space is too short to describe all
of the additional features and
optimizations, the role of the processing instruction guides is too important to leave
out.
Many interesting target schemas contain a large number of elements which have nearly
identical content models
and which are allowed to appear in the same places. Such is the case, for example,
with DocBook 5 elements
glossary
, bibliography
, index
, toc
, dedication
,
acknowledgements
, preface
, chapter
, appendix
,
article
, and colophon
. This represents a problem for the normalizer because it must keep
track of all the alternative elements in case of recovery. If the user can specify
which of the elements should be
used for recovery, this improves the output and the performance at the same time.
At the moment, the normalizer accepts the following processing instructions:
-
<?derivative.start-anew
depth? start-tag?>
-
<?derivative.start-nested
depth? start-tag?>
-
<?derivative.proceed-with
depth? start-tag?>
-
<?derivative.ensure-inside
element-name?>
-
<?derivative.ensure-outside
element-name?>
where depth has the form region-identifier:
integer.
Processing instructions are used instead of element tags because the input document
has to be well-formed XML,
so it cannot contain unmatched element start tags. Using empty element tags in a special
namespace could cause
difficulties because the target schema may or may not allow them in that position.
Finally, since these instructions
target one specific tool, the normalizer, this is an appropriate use of processing
instructions if there ever was
one.
The normalizer evaluates these processing instructions as follows:
start-anew
|
Any existing element region that was open with the same region-identifier and larger
or equal integer depth will be closed. If depth is
not specified, any existing element region with the same element name will be closed.
A new element region will be
started as if the specified start-tag appeared instead of the instruction.
|
start-nested
|
Any existing element region that was open with the same region-identifier and larger
integer depth will be closed. A new element region will be started as if the specified
start-tag appeared instead of the instruction.
|
proceed-with
|
Any existing element region that was open with the same region-identifier and larger
integer depth will be closed. If there is an existing element region with the same
region-identifier and equal integer depth, this region is kept and nothing else
happens. Otherwise a new element region will be started as if the specified start-tag
appeared instead of the instruction.
|
ensure-inside
|
The current pattern is trimmed to leave only those possibilities which assume the
specified element-name is currently open.
|
ensure-outside
|
The existing schema of the remainder of the document is trimmed to leave only those
possibilities which assume the
specified element-name is not currently open.
|
Results
The schema-driven normalizer described above has been implemented in the OmniMark
programming language, to serve
as a component in various document conversion solutions. We shall now briefly describe
this environment in order to
illustrate the practical benefits of the approach.
The wider purpose is to convert documents from a legacy presentation-oriented format
into a semantically richer
XML format. Some examples of the source document format are FrameMaker, HTML, InDesign,
or Quark Xpress. The target
format of the conversion can be DITA, DocBook, or a user-specified XML format.
One obvious simplification when faced with converting M source formats to N target
formats is to introduce an
intermediate format, and thus replace M∗N direct converters by M input converters
to the intermediate format
plus N output converters from the intermediate to the target format. Another benefit
of having the intermediate
document is that it can be enriched with user annotations to guide the output converter.
Experience we gained in building specific output converters showed that hand-written
implementations were
difficult to verify and maintain, so we needed a more robust and more flexible solution.
Some customers required
outputs that conform to a specialization of DITA, others wanted DocBook with various
extensions.
The core of the output converters is the topic of this paper: the schema-driven normalizer.
Its draft
implementation, given in the appendix, was coded in Haskell and tested using the QuickCheck
library c10. After this proof of concept, the production implementation was developed in OmniMark.
Once the
normalizer was ready, the old hand-coded output converters for DITA and DocBook were
replaced with new converters in a
matter of days.
Compared to the hand-written output converters they replaced, the converters based
on the schema-driven
normalizer provide several benefits:
-
Assuming the normalizer itself is correct, the output of the converter is guaranteed
to be conformant to the
target schema. The correctness of hand-written converters' output is more difficult
to verify.
-
A new target format can be supported very quickly as long as we have a schema for
it, while before it would
require rewriting the entire converter. If the new target language is a strict subset
of an existing one, no coding is
required whatsoever. For a small extension of an already supported schema, only a
few lines of code must be
added.
-
Many users have various stylistic requirements that can be supported simply by restricting
the target
schema, with no code modifications required.
-
The new converters are capable of automatically reordering parts of the input to make
it conform to the
target schema, with no user intervention required beyond the usual annotations. One
example that we have often
encountered in practice is that of DITA task topics: the DITA 1.1 schema requires
the prereq
,
context
, steps
, result
, example
, and postreq
elements
to be listed in this specific order. Most existing technical documentation is not
written this way.
One downside of the new converters is their performance, especially their worst-case
memory consumption. The
worst case in question is generally an input document which is annotated in a way
that makes it non-conformant with
the target schema. The hand-coded converters would in this situation either report
errors (which is the desired
behaviour) or generate invalid output (which is not). The schema-driven converters
try to recover by inserting the
missing element tags and by rearranging the input. The recovery eventually fails and
the normalizer produces an error
report, but its attempt costs a lot of time and memory. We are still looking for a
way to fix this issue.
Related Work
Automatic grammar-driven sentence normalization has been done in linguistics, for
the automatic correction of
grammatically incorrect English sentences v88.
For markup languages, particularly XML, there has been some related work in the area
of automatic markup f93 k97 t98. The main difference with our work is that the input of an automatic markup system
is typically plain
text, usually produced by OCR. As a consequence, even those automatic markup systems
whose output schema is not fixed
require additional configuration in form of pattern-matching rules. The normaliser
approach presented in this paper
can be coupled with a pattern-matching engine as well, but it does not depend on one.
The automatic markup system that seems nearest to ours is the Vasari project
a03, which is also schema-driven to some extent. The target schema must be enriched using
a
special pattern-matching language which determines where the required element tags
can be inserted. It is not clear if
the location of every tag required by the target schema must be specified this way.
There is a large body of work on the automatic translation between structured documents
that conform to similar
schemas a72 m96 t03. The approaches based on the
concept of syntax-directed translation schema l68 and its
extensions, such as extended syntax-directed translation schema k95 and tree transformation grammars l97, seem to
be most related to our work. The syntax-directed translation is simpler and more direct
than our approach to
translation, but it requires both the source and target format to be well structured.
Our problem, quite common in
practice, is that the source format is only weakly structured. The transformation
process presented in this paper
proceeds in two stages: the weakly structured source document is first translated
into another weakly structured
document which uses the tags allowed by the target schema. This intermediate document
is then normalized to a valid
schema instance.
Appendix A. Sample implementation
This appendix contains a sample normalizer implementation in Haskell. The implementation
depends on the sample
code for the RELAX NG validator c02, which it expects to find in module file
Validator.hs
.
module Normalizer where
import Control.Monad (liftM2)
import Control.Exception (assert)
import Data.List (minimumBy)
-- The RELAX NG validator automaton code from http://www.thaiopensource.com/relaxng/derivative.html
import Validator (contains, datatypeAllows, datatypeEqual, strip, whitespace,
Uri, LocalName, ParamList, Prefix, Context, Datatype,
NameClass(..), QName(..), ChildNode(..), AttributeNode(..))
data Pattern = Empty
| NotAllowed
| Text
| Choice Pattern Pattern
| Interleave Int Pattern Pattern
| Group Pattern Pattern
| OneOrMore Pattern
| List Pattern
| Data Datatype ParamList
| DataExcept Datatype ParamList Pattern
| Value Datatype String Context
| Attribute NameClass Pattern
| Element NameClass Pattern
| Region Bool QName Pattern
| Result [ResultDelta] Pattern
deriving (Eq, Show)
data ResultDelta = StartTag ChildNode | Add ChildNode | EndTag | Inferred ResultDelta
deriving (Eq, Show)
nullable:: Pattern -> Bool
nullable (Group p1 p2) = nullable p1 && nullable p2
nullable (Interleave balance p1 p2) = balance == 0 && nullable p1 && nullable p2
nullable (Choice p1 p2) = nullable p1 || nullable p2
nullable (OneOrMore p) = nullable p
nullable (Element _ _) = False
nullable (Attribute _ _) = False
nullable (List _) = False
nullable (Value _ _ _) = False
nullable (Data _ _) = False
nullable (DataExcept _ _ _) = False
nullable NotAllowed = False
nullable Empty = True
nullable Text = True
nullable (Region False _ _) = False
nullable (Region True _ p) = nullable p
nullable (Result _ p) = nullable p
inferred :: ResultDelta -> Bool
inferred (Inferred _) = True
inferred _ = False
nullableResults:: Pattern -> [[ResultDelta]]
nullableResults (Group p1 p2) = case nullableResults p2 of [] -> []
l -> assert (all null l) (nullableResults p1)
nullableResults (Interleave 0 p1 p2) = liftM2 (++) (nullableResults p1) (nullableResults p2)
nullableResults Interleave{} = []
nullableResults (Choice p1 p2) = nullableResults p1 ++ nullableResults p2
nullableResults (OneOrMore p) = nullableResults p
nullableResults (Element _ _) = []
nullableResults (Attribute _ _) = []
nullableResults (List _) = []
nullableResults (Value _ _ _) = []
nullableResults (Data _ _) = []
nullableResults (DataExcept _ _ _) = []
nullableResults NotAllowed = []
nullableResults (Region False _ _) = []
nullableResults (Region True q p) = map (++ [Inferred EndTag]) (nullableResults p)
nullableResults Empty = [[]]
nullableResults Text = [[]]
nullableResults (Result r p) = map (r ++) (nullableResults p)
bestNullableResult :: Pattern -> [ResultDelta]
bestNullableResult p = if nullableResults p == []
then error ("bestNullableResult: " ++ show p)
else minimumBy innermostLength (nullableResults p)
where innermostLength r1 r2 = compare (length r1) (length r2)
childDeriv :: Context -> Pattern -> ChildNode -> Pattern
childDeriv cx p (TextNode s) = textDeriv cx False p s
childDeriv _ p (ElementNode qn cx atts children) =
let p1 = startTagOpenDeriv p (ElementNode qn cx atts [])
p2 = attsDeriv cx p1 atts
p3 = startTagCloseDeriv p2
p4 = childrenDeriv cx p3 children
in endTagDeriv p4
textDeriv :: Context -> Bool -> Pattern -> String -> Pattern
textDeriv cx inf (Choice p1 p2) s = choice (textDeriv cx inf p1 s) (textDeriv cx inf p2 s)
textDeriv cx False (Interleave 0 p1 p2) s =
choice (interleave 0 (textDeriv cx False p1 s) p2)
(interleave 0 p1 (textDeriv cx False p2 s))
textDeriv cx False (Interleave balance p1 p2) s = if balance < 0
then interleave balance (textDeriv cx False p1 s) p2
else interleave balance p1 (textDeriv cx False p2 s)
textDeriv cx True (Interleave 0 p1 p2) s = choice
(interleave 0 (textDeriv cx True p1 s) p2)
(interleave 0 p1 (textDeriv cx True p2 s))
textDeriv cx True (Interleave balance p1 p2) s = if balance < 0
then interleave balance (textDeriv cx True p1 s) p2
else interleave balance p1 (textDeriv cx True p2 s)
textDeriv cx inf (Group p1 p2) s = let p = group (textDeriv cx inf p1 s) p2
in if nullable p1
then choice p (result (bestNullableResult p1) (textDeriv cx inf p2 s))
else p
textDeriv cx inf (OneOrMore p) s = group (textDeriv cx inf p s) (choice (OneOrMore p) Empty)
textDeriv cx inf Text s = result [(if inf then Inferred else id) (Add (TextNode s))] Text
textDeriv cx1 inf (Value dt value cx2) s = if datatypeEqual dt value cx2 s cx1
then result [(if inf then Inferred else id) (Add (TextNode s))] Empty
else NotAllowed
textDeriv cx inf (Data dt params) s = if datatypeAllows dt params s cx
then result [(if inf then Inferred else id) (Add (TextNode s))] Empty
else NotAllowed
textDeriv cx inf (DataExcept dt params p) s =
if datatypeAllows dt params s cx && not (nullable (textDeriv cx inf p s))
then result [(if inf then Inferred else id) (Add (TextNode s))] Empty
else NotAllowed
textDeriv cx inf (List p) s = if nullable (listDeriv cx p (words s))
then result [(if inf then Inferred else id) (Add (TextNode s))] Empty
else NotAllowed
textDeriv cx inf (Element (Name uri local) p) s =
let qn = QName uri local
in region True qn (textDeriv cx inf (startRecovery qn cx p) s)
textDeriv cx inf (Element (NameClassChoice n1 n2) p) s =
choice (textDeriv cx inf (Element n1 p) s) (textDeriv cx inf (Element n2 p) s)
textDeriv cx inf (Element _ _) s = NotAllowed
textDeriv cx inf (Region inferred qn p) s = region inferred qn (textDeriv cx inf p s)
textDeriv cx inf (Result r p) s = result r (textDeriv cx inf p s)
textDeriv _ _ _ _ = NotAllowed
listDeriv :: Context -> Pattern -> [String] -> Pattern
listDeriv _ p [] = p
listDeriv cx p (h:t) = listDeriv cx (textDeriv cx False p h) t
startRecovery :: QName -> Context -> Pattern -> Pattern
startRecovery qn cx p = result [Inferred (StartTag (ElementNode qn cx [] []))] (startTagCloseDeriv p)
region :: Bool -> QName -> Pattern -> Pattern
region inferred qname NotAllowed = NotAllowed
region inferred qname p = Region inferred qname p
result :: [ResultDelta] -> Pattern -> Pattern
result _ NotAllowed = NotAllowed
result r (Choice p1 p2) = Choice (result r p1) (result r p2)
result r1 (Result r2 p) = Result (r1 ++ r2) p
result [] p = p
result r p = Result r p
addResult :: Pattern -> [ResultDelta] -> Pattern
addResult NotAllowed _ = NotAllowed
addResult (Choice p1 p2) r = Choice (addResult p1 r) (addResult p2 r)
addResult (Group p1 p2) r = group (addResult p1 r) p2
addResult p@(Interleave balance p1 p2) r = interleave balance (addResult p1 r) p2
addResult (Result r1 p) r2 = result r1 (addResult p r2)
addResult (Region inferred qn p) r = Region inferred qn (addResult p r)
addResult p r = Result r p
choice :: Pattern -> Pattern -> Pattern
choice p NotAllowed = p
choice NotAllowed p = p
choice Empty Text = Text
choice Text Empty = Text
choice p1 p2 | p1 == p2 = p1
choice p1 p2 = Choice p1 p2
group :: Pattern -> Pattern -> Pattern
group p NotAllowed = NotAllowed
group NotAllowed p = NotAllowed
group p Empty = p
group Empty p = p
group Text Text = Text
group _ (Result _ _) = error "Can't have Result on the RHS of Group."
group (Result r p1) p2 = result r (group p1 p2)
group p1 p2 = Group p1 p2
interleave :: Int -> Pattern -> Pattern -> Pattern
interleave _ p NotAllowed = NotAllowed
interleave _ NotAllowed p = NotAllowed
interleave _ p Empty = p
interleave _ Empty p = p
interleave _ Text Text = Text
interleave balance p1 p2 = Interleave balance p1 p2
startTagOpenDeriv :: Pattern -> ChildNode -> Pattern
startTagOpenDeriv (Choice p1 p2) node = choice (startTagOpenDeriv p1 node) (startTagOpenDeriv p2 node)
startTagOpenDeriv (Element (NameClassChoice n1 n2) p) node =
choice (startTagOpenDeriv (Element n1 p) node) (startTagOpenDeriv (Element n2 p) node)
startTagOpenDeriv (Element nc@(Name uri local) p) node@(ElementNode qn cx _ _) =
if contains nc qn
then Region False qn (result [StartTag node] p)
else let qn = QName uri local
in region True qn (startTagOpenDeriv (startRecovery qn cx p) node)
startTagOpenDeriv (Element nc p) node@(ElementNode qn cx _ _) =
if contains nc qn then Region False qn (result [StartTag node] p) else NotAllowed
startTagOpenDeriv (Interleave 0 p1 p2) node =
choice
(interleave (-1) (startTagOpenDeriv p1 node) p2)
(interleave 1 p1 (startTagOpenDeriv p2 node))
startTagOpenDeriv (Interleave balance p1 p2) node =
if balance < 0
then interleave (balance-1) (startTagOpenDeriv p1 node) p2
else interleave (balance+1) p1 (startTagOpenDeriv p2 node)
startTagOpenDeriv (OneOrMore p) node = group (startTagOpenDeriv p node) (choice (OneOrMore p) Empty)
startTagOpenDeriv (Group p1 p2) node =
let x = group (startTagOpenDeriv p1 node) p2
in if nullable p1
then choice x (result (bestNullableResult p1) (startTagOpenDeriv p2 node))
else x
startTagOpenDeriv (Region inferred qn p) node = region inferred qn (startTagOpenDeriv p node)
startTagOpenDeriv (Result r p) node = result r (startTagOpenDeriv p node)
startTagOpenDeriv _ _ = NotAllowed
attsDeriv :: Context -> Pattern -> [AttributeNode] -> Pattern
attsDeriv cx p [] = p
attsDeriv cx p ((AttributeNode qn s):t) = attsDeriv cx (attDeriv cx p (AttributeNode qn s)) t
attDeriv :: Context -> Pattern -> AttributeNode -> Pattern
attDeriv cx (Choice p1 p2) att = choice (attDeriv cx p1 att) (attDeriv cx p2 att)
attDeriv cx (Group p1 p2) att = choice (group (attDeriv cx p1 att) p2) (group p1 (attDeriv cx p2 att))
attDeriv cx (Interleave balance p1 p2) att =
choice (interleave balance (attDeriv cx p1 att) p2) (interleave balance p1 (attDeriv cx p2 att))
attDeriv cx (OneOrMore p) att = group (attDeriv cx p att) (choice (OneOrMore p) Empty)
attDeriv cx (Attribute nc p) (AttributeNode qn s) = if contains nc qn && valueMatch cx p s
then Empty else NotAllowed
attDeriv cx (Region inferred qn p) att = Region inferred qn (attDeriv cx p att)
attDeriv cx (Result r p) att = result r (attDeriv cx p att)
attDeriv _ _ _ = NotAllowed
valueMatch :: Context -> Pattern -> String -> Bool
valueMatch cx p s = (nullable p && whitespace s) || nullable (textDeriv cx False p s)
startTagCloseDeriv :: Pattern -> Pattern
startTagCloseDeriv (Choice p1 p2) = choice (startTagCloseDeriv p1) (startTagCloseDeriv p2)
startTagCloseDeriv (Group p1 p2) = group (startTagCloseDeriv p1) (startTagCloseDeriv p2)
startTagCloseDeriv (Interleave balance p1 p2) = interleave balance (startTagCloseDeriv p1) (startTagCloseDeriv p2)
startTagCloseDeriv (OneOrMore p) = oneOrMore (startTagCloseDeriv p)
startTagCloseDeriv (Attribute _ _) = NotAllowed
startTagCloseDeriv (Region inferred qn p) = Region inferred qn (startTagCloseDeriv p)
startTagCloseDeriv (Result r p) = result r (startTagCloseDeriv p)
startTagCloseDeriv p = p
oneOrMore :: Pattern -> Pattern
oneOrMore NotAllowed = NotAllowed
oneOrMore Empty = Empty
oneOrMore Text = Text
oneOrMore p@OneOrMore{} = p
oneOrMore (Choice p Empty) = Choice (oneOrMore p) Empty
oneOrMore (Choice Empty p) = Choice Empty (oneOrMore p)
oneOrMore (Result r p) = result r (oneOrMore p)
oneOrMore p = OneOrMore p
childrenDeriv :: Context -> Pattern -> [ChildNode] -> Pattern
childrenDeriv _ NotAllowed _ = NotAllowed
childrenDeriv cx p [] = let p1 = textDeriv cx True p ""
in choice (addResult p [Inferred (Add (TextNode ""))]) p1
childrenDeriv cx p [(TextNode s)] = let p1 = childDeriv cx p (TextNode s)
in if whitespace s
then choice (addResult p [Add (TextNode s)]) p1
else p1
childrenDeriv cx p children = stripChildrenDeriv cx p children
stripChildrenDeriv :: Context -> Pattern -> [ChildNode] -> Pattern
stripChildrenDeriv _ p [] = p
stripChildrenDeriv cx p (h:t) = stripChildrenDeriv cx (if strip h
then (addResult p [Add h])
else childDeriv cx p h) t
endTagDeriv :: Pattern -> Pattern
endTagDeriv (Choice p1 p2) = choice (endTagDeriv p1) (endTagDeriv p2)
endTagDeriv (Region False qn p) = if nullable p
then result (bestNullableResult p ++ [EndTag]) Empty
else region False qn (endTagDeriv p)
endTagDeriv (Region True qn p) = region True qn (endTagDeriv p)
endTagDeriv (Result r p) = result r (endTagDeriv p)
endTagDeriv (Group p1 p2) = group (endTagDeriv p1) p2
endTagDeriv (Interleave balance p1 p2) | balance < 0 = interleave (balance+1) (endTagDeriv p1) p2
| balance == 0 = NotAllowed
| balance > 0 = interleave (balance-1) p1 (endTagDeriv p2)
endTagDeriv _ = NotAllowed
References
[a72]
AV Aho, JD Ullman, 1972.
The Theory of Parsing, Translation, and Compiling.
Prentice Hall
[a03]
Mohammad Abolhassani, Norbert Fuhr and Norbert Gövert,
Information extraction and automatic markup for XML documents,
In Blanken et al, 2003, 159--174,
Springer. doi:https://doi.org/10.1007/978-3-540-45194-5_11
[b59]
Backus, J.W.,
The Syntax and Semantics of the Proposed International Algebraic Language of Zürich
ACM-GAMM Conference,
Proceedings of the International Conference on Information Processing, UNESCO,
1959, pp.125-132.
[b64]
Brzozowski, J. A. 1964. Derivatives of Regular Expressions. J. ACM 11,
4 (Oct. 1964), 481-494.
doi:https://doi.org/10.1145/321239.321249
[c56]
Chomsky, Noam (1956). "Three models for the description of language".
IRE Transactions on Information Theory 2: 113–124.
doi:https://doi.org/10.1109/TIT.1956.1056813
[c01]
James Clark and Makoto Murata. RELAX NG Specification.
http://relaxng.org/spec-20011203.html, 2001. ISO/IEC 19757-2:2003.
[c02]
James Clark. An algorithm for RELAX NG validation
http://www.thaiopensource.com/relaxng/derivative.html
[c10]
QuickCheck: Automatic testing of Haskell programs
http://hackage.haskell.org/package/QuickCheck-2.1.0.3
[f93]
Peter Fankhauser and Yi Xu,
MarkItUp! - An incremental approach to document structure recognition,
Electronic Publishing, 1993, pages 447-456
[k95]
Eila Kuikka and Martti Penttonen,
Transformation of Structured Documents,
Electronic Publishing Origination, Dissemination and Design, 8(4), 1995.
[k97]
Bertin Klein and Peter Fankhauser,
Error tolerant Document Structure Analysis,
International Journal on Digital Libraries,
1997, volume 1, pages 344-357.
doi:https://doi.org/10.1007/s007990050028
[l68]
Lewis, P. M. and Stearns, R. E. 1968. Syntax-Directed Transduction.
J. ACM 15, 3 (Jul. 1968), 465-488.
doi:https://doi.org/10.1145/321466.321477
[l97]
Greger Lindén,
Structured Document Transformations,
1997
[m56]
Moore, E. F., [1956]. Gedanken experiments on sequential machines,
Automata Studies, Princeton Univ. Press,
Princeton, New Jersey, pp. 129-153.
[m96]
Makoto Murata,
Transformation of Documents and Schemas by Patterns and Contextual Conditions,
Proceedings of the Third International Workshop on Principles of Document Processing
(PODP 96),
1997, pages 153-169, Springer-Verlag. doi:https://doi.org/10.1007/3-540-63620-X_61
[s05]
Sperberg-McQueen, C. M. Applications of Brzozowski derivatives to XML schema processing.
In Extreme Markup Languages 2005, page 26, Internet, 2005. IDEAlliance.
[t98]
Kazem Taghva, Allen Condit, and Julie Borsack,
Autotag: A tool for creating structured document collections from printed materials,
Electronic Publishing, Artistic Imaging, and Digital Typography, Proc. of the EP ’98
and RIDT
’98 Conferences, 1998, pages 420-431,
Springer-Verlag
[t03]
Tang, X. 2003 A High-Level Specification Language for Structured Document Transformation.
Doctoral Thesis.
UMI Order Number: AAINQ84932., University of Waterloo.
[v88]
Dénes Vargha,
Schema method: a framework for correcting grammatically ill-formed input
Proceedings of the 12th conference on Computational linguistics - Volume 1
Computer and Automation Institute, Hungarian Academy of Sciences
Pages 341 - 347
Association for Computational Linguistics Morristown, NJ, USA ©1988
ISBN: 9638431563. doi:https://doi.org/10.3115/991635.991705
[h02]
Haskell 98 Language and Libraries, the Revised Report. December 2002.
http://haskell.org/onlinereport/
[s86]
Standard Generalized Markup Language (SGML)
International Organization for Standardization ISO 8879:1986
[s09a]
OmniMark language documentation
http://developers.omnimark.com/docs-extract/html/index.htm
[s09b]
OmniMark RELAX NG (OMRELAXNG) library documentation
http://developers.omnimark.com/docs-extract/html/library/125.htm
[w04]
XML Schema Part 1: Structures Second Edition, Analysis of the Unique Particle Attribution
Constraint
W3C Recommendation 28 October 2004
http://www.w3.org/TR/xmlschema-1/#non-ambig
×
AV Aho, JD Ullman, 1972.
The Theory of Parsing, Translation, and Compiling.
Prentice Hall
×
Backus, J.W.,
The Syntax and Semantics of the Proposed International Algebraic Language of Zürich
ACM-GAMM Conference,
Proceedings of the International Conference on Information Processing, UNESCO,
1959, pp.125-132.
×
Peter Fankhauser and Yi Xu,
MarkItUp! - An incremental approach to document structure recognition,
Electronic Publishing, 1993, pages 447-456
×
Eila Kuikka and Martti Penttonen,
Transformation of Structured Documents,
Electronic Publishing Origination, Dissemination and Design, 8(4), 1995.
×
Bertin Klein and Peter Fankhauser,
Error tolerant Document Structure Analysis,
International Journal on Digital Libraries,
1997, volume 1, pages 344-357.
doi:https://doi.org/10.1007/s007990050028
×
Greger Lindén,
Structured Document Transformations,
1997
×
Moore, E. F., [1956]. Gedanken experiments on sequential machines,
Automata Studies, Princeton Univ. Press,
Princeton, New Jersey, pp. 129-153.
×
Makoto Murata,
Transformation of Documents and Schemas by Patterns and Contextual Conditions,
Proceedings of the Third International Workshop on Principles of Document Processing
(PODP 96),
1997, pages 153-169, Springer-Verlag. doi:https://doi.org/10.1007/3-540-63620-X_61
×
Sperberg-McQueen, C. M. Applications of Brzozowski derivatives to XML schema processing.
In Extreme Markup Languages 2005, page 26, Internet, 2005. IDEAlliance.
×
Kazem Taghva, Allen Condit, and Julie Borsack,
Autotag: A tool for creating structured document collections from printed materials,
Electronic Publishing, Artistic Imaging, and Digital Typography, Proc. of the EP ’98
and RIDT
’98 Conferences, 1998, pages 420-431,
Springer-Verlag
×
Tang, X. 2003 A High-Level Specification Language for Structured Document Transformation.
Doctoral Thesis.
UMI Order Number: AAINQ84932., University of Waterloo.
×
Dénes Vargha,
Schema method: a framework for correcting grammatically ill-formed input
Proceedings of the 12th conference on Computational linguistics - Volume 1
Computer and Automation Institute, Hungarian Academy of Sciences
Pages 341 - 347
Association for Computational Linguistics Morristown, NJ, USA ©1988
ISBN: 9638431563. doi:https://doi.org/10.3115/991635.991705
×
Standard Generalized Markup Language (SGML)
International Organization for Standardization ISO 8879:1986