Note
I want to thank my colleagues Jacques Légaré, Joe Gollner, Patrick Baker, and Roy Amodeo for their valuable comments, and Stilo International for providing the initial motivation for the work and giving me time to do it properly.
Introduction
A grammar is a set of rules that describes the composition of a language. Grammars have existed for at least two and half millenia, but their use was mainly didactic. The practical applications of grammars started with their mathematical formalization in the 1950s c56 and the emergence of computer programming languages b59.
There are three ways to apply a grammar:
Validate |
a given sentence to determine if it is grammatically correct. |
Generate |
a correct sentence according to the grammar. |
Parse |
a given sentence to determine its structure and constituent parts. |
The historical motivation for development of the theory of formal languages was to explain the method of generation of valid sentences as well as their recognition and parsing. Generative grammars are still an active area of reasearch in linguistics. In computer science, however, it is parsing that is the primary role of grammars; their use for generation of valid sentences appears to be forgotten.
One reason for this lack of interest in grammar-driven generation is that the problem is poorly defined. The output of a generator is a grammatically valid sentence. But what is its input, other than the grammar? The task of a parser is clearer: while its output in practice can be anything, it is usually defined as a parse tree of some form.
While many kinds of generator inputs are possible and interesting to contemplate, we shall concentrate on the specific task of generating a grammatically valid sentence from a sequence of tokens (i.e., terminals) allowed by the grammar. Furthermore, the generator must exactly reproduce any input that is already a grammatically valid sentence. We shall use the term normalization to refer to this form of sentence generation, and the term normalizer for the generator that performs it.
Given a normalizer that maps a set of token sequences T into a set of cannonical, grammatically valid sequences S, one can define a new language S' as the union of the core language S and the non-core language T.
For an actual example of this kind of language definition, we need look no further than SGML. An SGML parser performs parsing, normalization, and validation at the same time. For example, given the following SGML DTD that declares the end tag of element a and both tags of element b to be omissible:
<!element a - o (b)> <!element b o o (#pcdata)>
an SGML parser with OMITTAG feature enabled will parse the instance <a>Hello, World!
into a syntax tree corresponding to <a><b>Hello, World!</b></a>
. Any conformant SGML parser
with support for omissible tags will consider these two instances as equivalent, as
long as the same DTD is used to
parse them. If the DTD should change, however, an instance that omits element tags
may become not only invalid but
unparsable.
For this reason and others, an XML parser does not perform any normalization. A well-formed XML instance is always parsed in the same way according to a fixed grammar, regardless of what DTD or schema the instance happens to also conform to. Various types of XML schemas are used only for validation, not for parsing or normalization.
This paper will demonstrate how an XML schema (a RELAX NG schema specifically) can be used to generate a valid instance of the schema from an invalid but well-formed XML instance. In other words, we shall normalize a well-formed XML instance so it conforms to a given schema.
The only element tags allowed in the input are those that occur in the target schema. If normalization succeeds with no errors, the output of the normalizer will:
-
be well-formed XML,
-
conform to the target schema, and
-
contain the entire input instance as its subsequence, interleaved with additional element tags. [1]
The normalizer is much more forgiving than the SGML parser in what kinds of input it can normalize. The target grammar is allowed to be ambiguous, though this is not recommended for performance reasons. There is no unique particle contribution constraint w04, nor is the generation restricted to contextually required elements s86. If the input instance is ambiguous, i.e., there is more than one way to normalize it, the normalization with the least amount of added markup will be chosen. This choice is not made at the point when the particle is encountered, because the subsequent content could make that choice invalid, but deferred until the enclosing element ends. The biggest constraint is that only plain element tags with no attributes can be inserted. Even this constraint could be overcome by automatically generating some arbitrary grammatically correct values for all required attributes, but we felt that this might cause more harm than benefit in practice.
The primary purpose of SGML's minimization features was making SGML easier to author. We do not recommend using the XML normalizer in this way; we feel that it is important that every XML document in permanent storage should conform to a clearly expressed schema.
The legitimate place for the normalization process is inside a processing pipeline, where it can usefully be applied to or produce a transient XML document instance. Normalization can thus be applied before another process in order to transform the input document into a form easier to process. The target schema for the normalizer may be just a stricter form of the input document's schema, or it may specify additional disambiguating markup.
Alternatively, a normalizer can appear at the opposite end of the processing pipeline, where it ensures that the output of the preceding process conforms to the target schema. We shall demonstrate this latter possibility of practical use in a later section.
The rest of the paper is organized as follows: the next section will present an example of normalizer's use, and afterwards its design and implementation will be explained. The section following after that explains the purpose and results of the normalizer in production, and the paper concludes with a discussion of related work.
Example
This section will demonstrate the capabilities of our normalizer through an example. Let us take the following RELAX NG schema as our target:
start = document block = p | ol | ul document = element document { title, block+, section* } section = element section { title, block+, section* } title = element title { text } p = element p { text } ol = element ol { li+ } ul = element ul { li+ } li = element li { block+ }
The goal is to convert our input to a valid instance of this grammar. As for the input, it may start as a plain text file like the following one:
Normalizer test input This is a short test input for the normalizer. Purpose The purpose of this document is to demonstrate the capabilities of the normalizer. The normalizer is schema-driven, and it tries to fit its input into the target schema. Constraints The goal of the normalizer is to produce output that conforms to the schema, while preserving complete input. It does not always succeed: - A piece of input that cannot be fit into the schema will be dropped. - The normalizer does not attempt to conjure any required attribute values nor text data. These must be present in the input.
We first need to ensure that our input is a well-fomed XML document, which can be
achieved simply by putting the
entire text inside the <document>...</document>
tags. If we then run this XML instance
through the normalizer, it will be converted to
<document> <title></title> <p> Normalizer test input This is a short test input for the normalizer. Purpose The purpose of this document is to demonstrate the capabilities of the normalizer. The normalizer is schema-driven, and it tries to fit its input into the target schema. Constraints The goal of the normalizer is to produce output that conforms to the schema, while preserving complete input. It does not always succeed: - A piece of input that cannot be fit into the schema will be dropped. - The normalizer does not attempt to conjure any required attribute values nor text data. These must be present in the input. </p> </document>
As we can verify, the output indeed does conform to the target schema. The content
model of
document
requires a title
element, for example, and this element has been inserted. Since it
has no content, though, it may not be all that useful.
In order to produce a more appropriate instance, we need to place certain constraints on the solution. One way to do this would be make the schema stricter. We could require the title content to be non-empty, and the paragraph content to contain no line-breaks. Alternatively, we can specify the constraints by adding the desired markup to the input document. For example, we can mark up all input lines that appear to be titles:
<document> <title>Normalizer test input</title> This is a short test input for the normalizer. <title>Purpose</title> The purpose of this document is to demonstrate the capabilities of the normalizer. The normalizer is schema-driven, and it tries to fit its input into the target schema. <title>Constraints</title> The goal of the normalizer is to produce output that conforms to the schema, while preserving complete input. It does not always succeed: - A piece of input that cannot be fit into the schema will be dropped. - The normalizer does not attempt to conjure any required attribute values nor text data. These must be present in the input. </document>
Given this enriched input together with the original schema, the normalizer produces the following output:
<document> <title>Normalizer test input</title> <p> This is a short test input for the normalizer. </p> <section> <title>Purpose</title> <p> The purpose of this document is to demonstrate the capabilities of the normalizer. The normalizer is schema-driven, and it tries to fit its input into the target schema. </p> <section> <title>Constraints</title> <p> The goal of the normalizer is to produce output that conforms to the schema, while preserving complete input. It does not always succeed: - A piece of input that cannot be fit into the schema will be dropped. - The normalizer does not attempt to conjure any required attribute values nor text data. These must be present in the input. </p> </section> </section> </document>
The good news is that the normalizer has not only respected and kept our title
markup, it has
inferred that every title but the first one has to start a new section. The bad news
is that the second section has
been nested within the first one. While this does conform to the target schema, the
two sections belong at the same
level. Furthermore, we want every line to be a separate paragraph; and there appears
to be a list at the end of our
document and it should be marked up as such.
In this simple example we could just visually determine where each section, paragraph, or list item starts and ends and mark them up with XML element tags. In practice, however, we'd rather have this work done automatically. A simple pattern-matching rule would be capable of inserting a start tag wherever there is a hyphen at the beginning of the line, but finding where it ends is much more difficult. To make this marking-up process easier, the normalizer recognizes a number of processing instructions as output constraints. Once we apply them, our input document might look like this:
<document> <title>Normalizer test input</title> <?derivative.start-anew <p>?>This is a short test input for the normalizer. <?derivative.start-anew <section>?><title>Purpose</title> <?derivative.start-anew <p>?>The purpose of this document is to demonstrate the capabilities of the normalizer. <?derivative.start-anew <p>?>The normalizer is schema-driven, and it tries to fit its input into the target schema. <?derivative.start-anew <section>?><title>Constraints</title> <?derivative.start-anew <p>?>The goal of the normalizer is to produce output that conforms to the schema, while preserving complete input. It does not always succeed: <?derivative.proceed-with <ul>?><?derivative.start-anew <li>?> A piece of input that cannot be fit into the schema will be dropped. <?derivative.proceed-with <ul>?><?derivative.start-anew <li>?> The normalizer does not attempt to conjure any required attribute values nor text data. <?derivative.start-anew <p>?>These must be present in the input. </document>
The processing instructions basically act as placeholders for element start tags, letting the normalizer know which element is supposed to enclose the following content. The exact location of the element's end tag will be inferred by the normalizer. The details of the processing instructions and the inference algorithm will be explained in the next chapter. The normalizer's output, based on our last enriched input above, looks as follows:
<document> <title>Normalizer test input</title> <p>This is a short test input for the normalizer.</p> <section> <title>Purpose</title> <p>The purpose of this document is to demonstrate the capabilities of the normalizer.</p> <p>The normalizer is schema-driven, and it tries to fit its input into the target schema.</p> </section> <section> <title>Constraints</title> <p>The goal of the normalizer is to produce output that conforms to the schema, while preserving complete input. It does not always succeed:</p> <ul> <li> <p>A piece of input that cannot be fit into the schema will be dropped.</p> </li> <li> <p>The normalizer does not attempt to conjure any required attribute values nor text data.</p> <p>These must be present in the input.</p> </li> </ul> </section> </document>
Design and Implementation
The basic idea that drove the normalizer design came after implementing the validator automaton c02 for RELAX NG c01. This automaton uses a flexible validation algorithm based on Brzozowski derivatives b64. We shall now sketch the basic idea of this validation algorithm; a more thorough treatment can be found in the literature a72.
Informally, the derivative of a grammar G with respect to a sequence of tokens S is the grammar G', such that a sequence S' is valid according to G' if and only if the concatenation of S and S' is valid according to G.
We can think of the grammar G as the initial state of a validating automaton, and the grammar derivative calculation as its transition function. A very simple and generic validator algorithm can be built on top of this concept s05, and one such validator is the standard RELAX NG validation algorithm.
While working on the implementation of an RELAX NG validator library s09b for OmniMark s09a, it became clear that the automaton could be relatively easily extended in two orthogonal ways: with additional pattern constructs, and with additional actions to perform. In particular, the validating automaton could be modified to produce arbitrary output, in the same way a deterministic finite automaton can be extended to a Moore machine m56.
The original RELAX NG validation algorithm c02 represents the grammar as a tree data structure. Here is its definition in Haskell h02:
data Pattern = Empty | NotAllowed | Text | Choice Pattern Pattern | Interleave Pattern Pattern | Group Pattern Pattern | OneOrMore Pattern | List Pattern | Data Datatype ParamList | DataExcept Datatype ParamList Pattern | Value Datatype String Context | Attribute NameClass Pattern | Element NameClass Pattern | After Pattern Pattern
The constructors of this data type, with the exception of the last one, correspond
to the constructs of the
RELAX NG schema language. The last constructor, After
, appears when a start tag of an element is
consumed, and matches the remainder of the XML input together with the corresponding
end tag.
We need two extensions to this data structure in order to enable the automaton to perform normalization. First, we need to be able to assume the presence of an omitted element tag. Secondly, we must be able to output the normalized version of our input, together with the omitted and inferred tags. So we add two more data constructors:
data Pattern = Empty | ... | Region Bool QName Pattern | Result [ResultDelta] Pattern data ResultDelta = StartTag ChildNode | Add ChildNode | EndTag | Inferred ResultDelta
The Result
constructor keeps track of the output that should be emitted if the pattern it wraps
succeeds. Whenever the current pattern (i.e., the state of the automaton) has a
Result
node at its root, the accumulated chunk of output is flushed and the root node is
replaced by its
child pattern.
The Region
constructor denotes an open element region, which may have come from the input or
may
have been inserted in order to normalize it. In the former role, it acts as a replacement
for the After
node. To accomodate this change in the representation of open element regions, we
have to make another adjustment to
the Interleave
constructor. It needs an extra integer field to keep track of which of its patterns
has
consumed unmatched start tags, and how many of them.
As long as the input conforms to the target schema, the normalizer acts the same way
as the RELAX NG validator,
except it builds the Result
nodes and flushes them at the output. When a particle that doesn't match the
current pattern is encountered, the normalizer attempts to recover by descending into
the content models of elements
that are allowed at the current point. The derivative of the current pattern is then
the sum of derivatives of all
content models that accept the particle, with each content derivative wrapped into
a Region
node and a
Result
node; these keep track of the inferred element start tag.
This modification of the RELAX NG validation algorithm results in the normalizer whose complete source code is listed in the paper's appendix Appendix A. This simple normalizer is completely correct, but far from perfect. It easily runs into a combinatorial explosion of the pattern tree, and it cannot be customized to resolve the ambiguity in a desired way. While the available space is too short to describe all of the additional features and optimizations, the role of the processing instruction guides is too important to leave out.
Many interesting target schemas contain a large number of elements which have nearly
identical content models
and which are allowed to appear in the same places. Such is the case, for example,
with DocBook 5 elements
glossary
, bibliography
, index
, toc
, dedication
,
acknowledgements
, preface
, chapter
, appendix
,
article
, and colophon
. This represents a problem for the normalizer because it must keep
track of all the alternative elements in case of recovery. If the user can specify
which of the elements should be
used for recovery, this improves the output and the performance at the same time.
At the moment, the normalizer accepts the following processing instructions:
-
<?
derivative.start-anew
depth? start-tag?> -
<?
derivative.start-nested
depth? start-tag?> -
<?
derivative.proceed-with
depth? start-tag?> -
<?
derivative.ensure-inside
element-name?> -
<?
derivative.ensure-outside
element-name?>
where depth has the form region-identifier:
integer.
Processing instructions are used instead of element tags because the input document has to be well-formed XML, so it cannot contain unmatched element start tags. Using empty element tags in a special namespace could cause difficulties because the target schema may or may not allow them in that position. Finally, since these instructions target one specific tool, the normalizer, this is an appropriate use of processing instructions if there ever was one.
The normalizer evaluates these processing instructions as follows:
start-anew |
Any existing element region that was open with the same region-identifier and larger or equal integer depth will be closed. If depth is not specified, any existing element region with the same element name will be closed. A new element region will be started as if the specified start-tag appeared instead of the instruction. |
start-nested |
Any existing element region that was open with the same region-identifier and larger integer depth will be closed. A new element region will be started as if the specified start-tag appeared instead of the instruction. |
proceed-with |
Any existing element region that was open with the same region-identifier and larger integer depth will be closed. If there is an existing element region with the same region-identifier and equal integer depth, this region is kept and nothing else happens. Otherwise a new element region will be started as if the specified start-tag appeared instead of the instruction. |
ensure-inside |
The current pattern is trimmed to leave only those possibilities which assume the specified element-name is currently open. |
ensure-outside |
The existing schema of the remainder of the document is trimmed to leave only those possibilities which assume the specified element-name is not currently open. |
Results
The schema-driven normalizer described above has been implemented in the OmniMark programming language, to serve as a component in various document conversion solutions. We shall now briefly describe this environment in order to illustrate the practical benefits of the approach.
The wider purpose is to convert documents from a legacy presentation-oriented format into a semantically richer XML format. Some examples of the source document format are FrameMaker, HTML, InDesign, or Quark Xpress. The target format of the conversion can be DITA, DocBook, or a user-specified XML format.
One obvious simplification when faced with converting M source formats to N target formats is to introduce an intermediate format, and thus replace M∗N direct converters by M input converters to the intermediate format plus N output converters from the intermediate to the target format. Another benefit of having the intermediate document is that it can be enriched with user annotations to guide the output converter.
Experience we gained in building specific output converters showed that hand-written implementations were difficult to verify and maintain, so we needed a more robust and more flexible solution. Some customers required outputs that conform to a specialization of DITA, others wanted DocBook with various extensions.
The core of the output converters is the topic of this paper: the schema-driven normalizer. Its draft implementation, given in the appendix, was coded in Haskell and tested using the QuickCheck library c10. After this proof of concept, the production implementation was developed in OmniMark. Once the normalizer was ready, the old hand-coded output converters for DITA and DocBook were replaced with new converters in a matter of days.
Compared to the hand-written output converters they replaced, the converters based on the schema-driven normalizer provide several benefits:
-
Assuming the normalizer itself is correct, the output of the converter is guaranteed to be conformant to the target schema. The correctness of hand-written converters' output is more difficult to verify.
-
A new target format can be supported very quickly as long as we have a schema for it, while before it would require rewriting the entire converter. If the new target language is a strict subset of an existing one, no coding is required whatsoever. For a small extension of an already supported schema, only a few lines of code must be added.
-
Many users have various stylistic requirements that can be supported simply by restricting the target schema, with no code modifications required.
-
The new converters are capable of automatically reordering parts of the input to make it conform to the target schema, with no user intervention required beyond the usual annotations. One example that we have often encountered in practice is that of DITA task topics: the DITA 1.1 schema requires the
prereq
,context
,steps
,result
,example
, andpostreq
elements to be listed in this specific order. Most existing technical documentation is not written this way. [2]
One downside of the new converters is their performance, especially their worst-case memory consumption. The worst case in question is generally an input document which is annotated in a way that makes it non-conformant with the target schema. The hand-coded converters would in this situation either report errors (which is the desired behaviour) or generate invalid output (which is not). The schema-driven converters try to recover by inserting the missing element tags and by rearranging the input. The recovery eventually fails and the normalizer produces an error report, but its attempt costs a lot of time and memory. We are still looking for a way to fix this issue.
Related Work
Automatic grammar-driven sentence normalization has been done in linguistics, for the automatic correction of grammatically incorrect English sentences v88.
For markup languages, particularly XML, there has been some related work in the area of automatic markup f93 k97 t98. The main difference with our work is that the input of an automatic markup system is typically plain text, usually produced by OCR. As a consequence, even those automatic markup systems whose output schema is not fixed require additional configuration in form of pattern-matching rules. The normaliser approach presented in this paper can be coupled with a pattern-matching engine as well, but it does not depend on one.
The automatic markup system that seems nearest to ours is the Vasari project a03, which is also schema-driven to some extent. The target schema must be enriched using a special pattern-matching language which determines where the required element tags can be inserted. It is not clear if the location of every tag required by the target schema must be specified this way.
There is a large body of work on the automatic translation between structured documents that conform to similar schemas a72 m96 t03. The approaches based on the concept of syntax-directed translation schema l68 and its extensions, such as extended syntax-directed translation schema k95 and tree transformation grammars l97, seem to be most related to our work. The syntax-directed translation is simpler and more direct than our approach to translation, but it requires both the source and target format to be well structured. Our problem, quite common in practice, is that the source format is only weakly structured. The transformation process presented in this paper proceeds in two stages: the weakly structured source document is first translated into another weakly structured document which uses the tags allowed by the target schema. This intermediate document is then normalized to a valid schema instance.
Appendix A. Sample implementation
This appendix contains a sample normalizer implementation in Haskell. The implementation
depends on the sample
code for the RELAX NG validator c02, which it expects to find in module file
Validator.hs
.
module Normalizer where import Control.Monad (liftM2) import Control.Exception (assert) import Data.List (minimumBy) -- The RELAX NG validator automaton code from http://www.thaiopensource.com/relaxng/derivative.html import Validator (contains, datatypeAllows, datatypeEqual, strip, whitespace, Uri, LocalName, ParamList, Prefix, Context, Datatype, NameClass(..), QName(..), ChildNode(..), AttributeNode(..)) data Pattern = Empty | NotAllowed | Text | Choice Pattern Pattern | Interleave Int Pattern Pattern | Group Pattern Pattern | OneOrMore Pattern | List Pattern | Data Datatype ParamList | DataExcept Datatype ParamList Pattern | Value Datatype String Context | Attribute NameClass Pattern | Element NameClass Pattern | Region Bool QName Pattern | Result [ResultDelta] Pattern deriving (Eq, Show) data ResultDelta = StartTag ChildNode | Add ChildNode | EndTag | Inferred ResultDelta deriving (Eq, Show) nullable:: Pattern -> Bool nullable (Group p1 p2) = nullable p1 && nullable p2 nullable (Interleave balance p1 p2) = balance == 0 && nullable p1 && nullable p2 nullable (Choice p1 p2) = nullable p1 || nullable p2 nullable (OneOrMore p) = nullable p nullable (Element _ _) = False nullable (Attribute _ _) = False nullable (List _) = False nullable (Value _ _ _) = False nullable (Data _ _) = False nullable (DataExcept _ _ _) = False nullable NotAllowed = False nullable Empty = True nullable Text = True nullable (Region False _ _) = False nullable (Region True _ p) = nullable p nullable (Result _ p) = nullable p inferred :: ResultDelta -> Bool inferred (Inferred _) = True inferred _ = False nullableResults:: Pattern -> [[ResultDelta]] nullableResults (Group p1 p2) = case nullableResults p2 of [] -> [] l -> assert (all null l) (nullableResults p1) nullableResults (Interleave 0 p1 p2) = liftM2 (++) (nullableResults p1) (nullableResults p2) nullableResults Interleave{} = [] nullableResults (Choice p1 p2) = nullableResults p1 ++ nullableResults p2 nullableResults (OneOrMore p) = nullableResults p nullableResults (Element _ _) = [] nullableResults (Attribute _ _) = [] nullableResults (List _) = [] nullableResults (Value _ _ _) = [] nullableResults (Data _ _) = [] nullableResults (DataExcept _ _ _) = [] nullableResults NotAllowed = [] nullableResults (Region False _ _) = [] nullableResults (Region True q p) = map (++ [Inferred EndTag]) (nullableResults p) nullableResults Empty = [[]] nullableResults Text = [[]] nullableResults (Result r p) = map (r ++) (nullableResults p) bestNullableResult :: Pattern -> [ResultDelta] bestNullableResult p = if nullableResults p == [] then error ("bestNullableResult: " ++ show p) else minimumBy innermostLength (nullableResults p) where innermostLength r1 r2 = compare (length r1) (length r2) childDeriv :: Context -> Pattern -> ChildNode -> Pattern childDeriv cx p (TextNode s) = textDeriv cx False p s childDeriv _ p (ElementNode qn cx atts children) = let p1 = startTagOpenDeriv p (ElementNode qn cx atts []) p2 = attsDeriv cx p1 atts p3 = startTagCloseDeriv p2 p4 = childrenDeriv cx p3 children in endTagDeriv p4 textDeriv :: Context -> Bool -> Pattern -> String -> Pattern textDeriv cx inf (Choice p1 p2) s = choice (textDeriv cx inf p1 s) (textDeriv cx inf p2 s) textDeriv cx False (Interleave 0 p1 p2) s = choice (interleave 0 (textDeriv cx False p1 s) p2) (interleave 0 p1 (textDeriv cx False p2 s)) textDeriv cx False (Interleave balance p1 p2) s = if balance < 0 then interleave balance (textDeriv cx False p1 s) p2 else interleave balance p1 (textDeriv cx False p2 s) textDeriv cx True (Interleave 0 p1 p2) s = choice (interleave 0 (textDeriv cx True p1 s) p2) (interleave 0 p1 (textDeriv cx True p2 s)) textDeriv cx True (Interleave balance p1 p2) s = if balance < 0 then interleave balance (textDeriv cx True p1 s) p2 else interleave balance p1 (textDeriv cx True p2 s) textDeriv cx inf (Group p1 p2) s = let p = group (textDeriv cx inf p1 s) p2 in if nullable p1 then choice p (result (bestNullableResult p1) (textDeriv cx inf p2 s)) else p textDeriv cx inf (OneOrMore p) s = group (textDeriv cx inf p s) (choice (OneOrMore p) Empty) textDeriv cx inf Text s = result [(if inf then Inferred else id) (Add (TextNode s))] Text textDeriv cx1 inf (Value dt value cx2) s = if datatypeEqual dt value cx2 s cx1 then result [(if inf then Inferred else id) (Add (TextNode s))] Empty else NotAllowed textDeriv cx inf (Data dt params) s = if datatypeAllows dt params s cx then result [(if inf then Inferred else id) (Add (TextNode s))] Empty else NotAllowed textDeriv cx inf (DataExcept dt params p) s = if datatypeAllows dt params s cx && not (nullable (textDeriv cx inf p s)) then result [(if inf then Inferred else id) (Add (TextNode s))] Empty else NotAllowed textDeriv cx inf (List p) s = if nullable (listDeriv cx p (words s)) then result [(if inf then Inferred else id) (Add (TextNode s))] Empty else NotAllowed textDeriv cx inf (Element (Name uri local) p) s = let qn = QName uri local in region True qn (textDeriv cx inf (startRecovery qn cx p) s) textDeriv cx inf (Element (NameClassChoice n1 n2) p) s = choice (textDeriv cx inf (Element n1 p) s) (textDeriv cx inf (Element n2 p) s) textDeriv cx inf (Element _ _) s = NotAllowed textDeriv cx inf (Region inferred qn p) s = region inferred qn (textDeriv cx inf p s) textDeriv cx inf (Result r p) s = result r (textDeriv cx inf p s) textDeriv _ _ _ _ = NotAllowed listDeriv :: Context -> Pattern -> [String] -> Pattern listDeriv _ p [] = p listDeriv cx p (h:t) = listDeriv cx (textDeriv cx False p h) t startRecovery :: QName -> Context -> Pattern -> Pattern startRecovery qn cx p = result [Inferred (StartTag (ElementNode qn cx [] []))] (startTagCloseDeriv p) region :: Bool -> QName -> Pattern -> Pattern region inferred qname NotAllowed = NotAllowed region inferred qname p = Region inferred qname p result :: [ResultDelta] -> Pattern -> Pattern result _ NotAllowed = NotAllowed result r (Choice p1 p2) = Choice (result r p1) (result r p2) result r1 (Result r2 p) = Result (r1 ++ r2) p result [] p = p result r p = Result r p addResult :: Pattern -> [ResultDelta] -> Pattern addResult NotAllowed _ = NotAllowed addResult (Choice p1 p2) r = Choice (addResult p1 r) (addResult p2 r) addResult (Group p1 p2) r = group (addResult p1 r) p2 addResult p@(Interleave balance p1 p2) r = interleave balance (addResult p1 r) p2 addResult (Result r1 p) r2 = result r1 (addResult p r2) addResult (Region inferred qn p) r = Region inferred qn (addResult p r) addResult p r = Result r p choice :: Pattern -> Pattern -> Pattern choice p NotAllowed = p choice NotAllowed p = p choice Empty Text = Text choice Text Empty = Text choice p1 p2 | p1 == p2 = p1 choice p1 p2 = Choice p1 p2 group :: Pattern -> Pattern -> Pattern group p NotAllowed = NotAllowed group NotAllowed p = NotAllowed group p Empty = p group Empty p = p group Text Text = Text group _ (Result _ _) = error "Can't have Result on the RHS of Group." group (Result r p1) p2 = result r (group p1 p2) group p1 p2 = Group p1 p2 interleave :: Int -> Pattern -> Pattern -> Pattern interleave _ p NotAllowed = NotAllowed interleave _ NotAllowed p = NotAllowed interleave _ p Empty = p interleave _ Empty p = p interleave _ Text Text = Text interleave balance p1 p2 = Interleave balance p1 p2 startTagOpenDeriv :: Pattern -> ChildNode -> Pattern startTagOpenDeriv (Choice p1 p2) node = choice (startTagOpenDeriv p1 node) (startTagOpenDeriv p2 node) startTagOpenDeriv (Element (NameClassChoice n1 n2) p) node = choice (startTagOpenDeriv (Element n1 p) node) (startTagOpenDeriv (Element n2 p) node) startTagOpenDeriv (Element nc@(Name uri local) p) node@(ElementNode qn cx _ _) = if contains nc qn then Region False qn (result [StartTag node] p) else let qn = QName uri local in region True qn (startTagOpenDeriv (startRecovery qn cx p) node) startTagOpenDeriv (Element nc p) node@(ElementNode qn cx _ _) = if contains nc qn then Region False qn (result [StartTag node] p) else NotAllowed startTagOpenDeriv (Interleave 0 p1 p2) node = choice (interleave (-1) (startTagOpenDeriv p1 node) p2) (interleave 1 p1 (startTagOpenDeriv p2 node)) startTagOpenDeriv (Interleave balance p1 p2) node = if balance < 0 then interleave (balance-1) (startTagOpenDeriv p1 node) p2 else interleave (balance+1) p1 (startTagOpenDeriv p2 node) startTagOpenDeriv (OneOrMore p) node = group (startTagOpenDeriv p node) (choice (OneOrMore p) Empty) startTagOpenDeriv (Group p1 p2) node = let x = group (startTagOpenDeriv p1 node) p2 in if nullable p1 then choice x (result (bestNullableResult p1) (startTagOpenDeriv p2 node)) else x startTagOpenDeriv (Region inferred qn p) node = region inferred qn (startTagOpenDeriv p node) startTagOpenDeriv (Result r p) node = result r (startTagOpenDeriv p node) startTagOpenDeriv _ _ = NotAllowed attsDeriv :: Context -> Pattern -> [AttributeNode] -> Pattern attsDeriv cx p [] = p attsDeriv cx p ((AttributeNode qn s):t) = attsDeriv cx (attDeriv cx p (AttributeNode qn s)) t attDeriv :: Context -> Pattern -> AttributeNode -> Pattern attDeriv cx (Choice p1 p2) att = choice (attDeriv cx p1 att) (attDeriv cx p2 att) attDeriv cx (Group p1 p2) att = choice (group (attDeriv cx p1 att) p2) (group p1 (attDeriv cx p2 att)) attDeriv cx (Interleave balance p1 p2) att = choice (interleave balance (attDeriv cx p1 att) p2) (interleave balance p1 (attDeriv cx p2 att)) attDeriv cx (OneOrMore p) att = group (attDeriv cx p att) (choice (OneOrMore p) Empty) attDeriv cx (Attribute nc p) (AttributeNode qn s) = if contains nc qn && valueMatch cx p s then Empty else NotAllowed attDeriv cx (Region inferred qn p) att = Region inferred qn (attDeriv cx p att) attDeriv cx (Result r p) att = result r (attDeriv cx p att) attDeriv _ _ _ = NotAllowed valueMatch :: Context -> Pattern -> String -> Bool valueMatch cx p s = (nullable p && whitespace s) || nullable (textDeriv cx False p s) startTagCloseDeriv :: Pattern -> Pattern startTagCloseDeriv (Choice p1 p2) = choice (startTagCloseDeriv p1) (startTagCloseDeriv p2) startTagCloseDeriv (Group p1 p2) = group (startTagCloseDeriv p1) (startTagCloseDeriv p2) startTagCloseDeriv (Interleave balance p1 p2) = interleave balance (startTagCloseDeriv p1) (startTagCloseDeriv p2) startTagCloseDeriv (OneOrMore p) = oneOrMore (startTagCloseDeriv p) startTagCloseDeriv (Attribute _ _) = NotAllowed startTagCloseDeriv (Region inferred qn p) = Region inferred qn (startTagCloseDeriv p) startTagCloseDeriv (Result r p) = result r (startTagCloseDeriv p) startTagCloseDeriv p = p oneOrMore :: Pattern -> Pattern oneOrMore NotAllowed = NotAllowed oneOrMore Empty = Empty oneOrMore Text = Text oneOrMore p@OneOrMore{} = p oneOrMore (Choice p Empty) = Choice (oneOrMore p) Empty oneOrMore (Choice Empty p) = Choice Empty (oneOrMore p) oneOrMore (Result r p) = result r (oneOrMore p) oneOrMore p = OneOrMore p childrenDeriv :: Context -> Pattern -> [ChildNode] -> Pattern childrenDeriv _ NotAllowed _ = NotAllowed childrenDeriv cx p [] = let p1 = textDeriv cx True p "" in choice (addResult p [Inferred (Add (TextNode ""))]) p1 childrenDeriv cx p [(TextNode s)] = let p1 = childDeriv cx p (TextNode s) in if whitespace s then choice (addResult p [Add (TextNode s)]) p1 else p1 childrenDeriv cx p children = stripChildrenDeriv cx p children stripChildrenDeriv :: Context -> Pattern -> [ChildNode] -> Pattern stripChildrenDeriv _ p [] = p stripChildrenDeriv cx p (h:t) = stripChildrenDeriv cx (if strip h then (addResult p [Add h]) else childDeriv cx p h) t endTagDeriv :: Pattern -> Pattern endTagDeriv (Choice p1 p2) = choice (endTagDeriv p1) (endTagDeriv p2) endTagDeriv (Region False qn p) = if nullable p then result (bestNullableResult p ++ [EndTag]) Empty else region False qn (endTagDeriv p) endTagDeriv (Region True qn p) = region True qn (endTagDeriv p) endTagDeriv (Result r p) = result r (endTagDeriv p) endTagDeriv (Group p1 p2) = group (endTagDeriv p1) p2 endTagDeriv (Interleave balance p1 p2) | balance < 0 = interleave (balance+1) (endTagDeriv p1) p2 | balance == 0 = NotAllowed | balance > 0 = interleave (balance-1) p1 (endTagDeriv p2) endTagDeriv _ = NotAllowed
References
[a72] AV Aho, JD Ullman, 1972. The Theory of Parsing, Translation, and Compiling. Prentice Hall
[a03] Mohammad Abolhassani, Norbert Fuhr and Norbert Gövert, Information extraction and automatic markup for XML documents, In Blanken et al, 2003, 159--174, Springer. doi:https://doi.org/10.1007/978-3-540-45194-5_11
[b59] Backus, J.W., The Syntax and Semantics of the Proposed International Algebraic Language of Zürich ACM-GAMM Conference, Proceedings of the International Conference on Information Processing, UNESCO, 1959, pp.125-132.
[b64] Brzozowski, J. A. 1964. Derivatives of Regular Expressions. J. ACM 11, 4 (Oct. 1964), 481-494. doi:https://doi.org/10.1145/321239.321249
[c56] Chomsky, Noam (1956). "Three models for the description of language". IRE Transactions on Information Theory 2: 113–124. doi:https://doi.org/10.1109/TIT.1956.1056813
[c01] James Clark and Makoto Murata. RELAX NG Specification. http://relaxng.org/spec-20011203.html, 2001. ISO/IEC 19757-2:2003.
[c02] James Clark. An algorithm for RELAX NG validation http://www.thaiopensource.com/relaxng/derivative.html
[c10] QuickCheck: Automatic testing of Haskell programs http://hackage.haskell.org/package/QuickCheck-2.1.0.3
[f93] Peter Fankhauser and Yi Xu, MarkItUp! - An incremental approach to document structure recognition, Electronic Publishing, 1993, pages 447-456
[k95] Eila Kuikka and Martti Penttonen, Transformation of Structured Documents, Electronic Publishing Origination, Dissemination and Design, 8(4), 1995.
[k97] Bertin Klein and Peter Fankhauser, Error tolerant Document Structure Analysis, International Journal on Digital Libraries, 1997, volume 1, pages 344-357. doi:https://doi.org/10.1007/s007990050028
[l68] Lewis, P. M. and Stearns, R. E. 1968. Syntax-Directed Transduction. J. ACM 15, 3 (Jul. 1968), 465-488. doi:https://doi.org/10.1145/321466.321477
[l97] Greger Lindén, Structured Document Transformations, 1997
[m56] Moore, E. F., [1956]. Gedanken experiments on sequential machines, Automata Studies, Princeton Univ. Press, Princeton, New Jersey, pp. 129-153.
[m96] Makoto Murata, Transformation of Documents and Schemas by Patterns and Contextual Conditions, Proceedings of the Third International Workshop on Principles of Document Processing (PODP 96), 1997, pages 153-169, Springer-Verlag. doi:https://doi.org/10.1007/3-540-63620-X_61
[s05] Sperberg-McQueen, C. M. Applications of Brzozowski derivatives to XML schema processing. In Extreme Markup Languages 2005, page 26, Internet, 2005. IDEAlliance.
[t98] Kazem Taghva, Allen Condit, and Julie Borsack, Autotag: A tool for creating structured document collections from printed materials, Electronic Publishing, Artistic Imaging, and Digital Typography, Proc. of the EP ’98 and RIDT ’98 Conferences, 1998, pages 420-431, Springer-Verlag
[t03] Tang, X. 2003 A High-Level Specification Language for Structured Document Transformation. Doctoral Thesis. UMI Order Number: AAINQ84932., University of Waterloo.
[v88] Dénes Vargha, Schema method: a framework for correcting grammatically ill-formed input Proceedings of the 12th conference on Computational linguistics - Volume 1 Computer and Automation Institute, Hungarian Academy of Sciences Pages 341 - 347 Association for Computational Linguistics Morristown, NJ, USA ©1988 ISBN: 9638431563. doi:https://doi.org/10.3115/991635.991705
[h02] Haskell 98 Language and Libraries, the Revised Report. December 2002. http://haskell.org/onlinereport/
[s86] Standard Generalized Markup Language (SGML) International Organization for Standardization ISO 8879:1986
[s09a] OmniMark language documentation http://developers.omnimark.com/docs-extract/html/index.htm
[s09b] OmniMark RELAX NG (OMRELAXNG) library documentation http://developers.omnimark.com/docs-extract/html/library/125.htm
[w04] XML Schema Part 1: Structures Second Edition, Analysis of the Unique Particle Attribution Constraint W3C Recommendation 28 October 2004 http://www.w3.org/TR/xmlschema-1/#non-ambig
[1] There is one exception. The normalizer may reorder those parts of the input content that the target schema models as interleaved. In essence, its output will conform to the target schema strictified by replacing its every interleave pattern node by a group node.
[2] Note that this reordering transformation applies only where the target content model specifies that its parts can appear in any order. In case of DITA task topics, this means that we have to modify the target schema so it accepts the elements listed above in any order. The normalizer will output them in the canonical order.