How to cite this paper
Sperberg-McQueen, C. M. “Translating imperative algorithms into declarative, functional terms: towards Earley
parsing in XSLT and XQuery.” Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). https://doi.org/10.4242/BalisageVol19.Sperberg-McQueen01.
Balisage: The Markup Conference 2017
August 1 - 4, 2017
Balisage Paper: Translating imperative algorithms into declarative, functional terms
towards Earley parsing in XSLT and XQuery
C. M. Sperberg-McQueen
Founder and principal
Black Mesa Technologies LLC
C. M. Sperberg-McQueen is the founder and
principal of Black Mesa Technologies, a consultancy
specializing in helping memory institutions improve
the long term preservation of and access to the
information for which they are responsible.
He served as editor in chief of the TEI
Guidelines from 1988 to 2000, and has also served
as co-editor of the World Wide Web Consortium's
XML 1.0 and XML Schema 1.1
specifications.
Copyright © 2017 by the author.
Abstract
In building up subroutine libraries for XSLT and
XQuery, it is sometimes useful to re-implement standard
algorithms in the new language. Such re-implementation can
be challenging, because standard algorithms are often
described in imperative terms; before being
reimplemented in XSLT or XQuery, the algorithm must first be
re-understood in a declarative and functional way.
Some of the challenges which arise in this process can be
illustrated by the example of Earley parsing. Earley’s
algorithm can parse an input string against any context-free
grammar in Backus-Naur Form. Unlike recursive-descent or
table-driven LALR(1) parsers it is not limited to
well-behaved
grammars. Unlike other
general context-free parsing algorithms such as CYK, it does
not devote time and space to operations which can be seen in
advance to have no possible use in a full parse. Earley’s
procedural description involves successive changes to a
small set of data structures representing sets of
Earley items; these procedural changes cannot
be translated directly into a functional language lacking
assignment.
But Earley’s data-structure updates can be
understood as defining relations among Earley items, and the
algorithm as a whole can be interpreted as calculating the
smallest set of Earley items which contains a given starter
item and is closed over a small number of relations on
items. Re-thinking the Earley algorithm in this way not only
makes it easier to implement it in XSLT and
XQuery, but helps make it clear why the parser is both
complete (it will always find a parse if there is one) and
correct (any parse it finds will be a real parse).
Table of Contents
- Introduction
- Notation
- Earley items
- Earley trees
- Brief digression
- Properties and relations of Earley items
-
- Expectation
- Testability; winning and losing
-
- The advance() function; continuations
- The scan() relation between items
- Testing and advance() function for non-terminals
-
- The advance() function again
- Continuations
- Useful items
- Relevant items
- Some consequences of the definition of relevance
- The pred() and comp() relations among items
- The Earley algorithm
- Two technical excursus
-
- Calculating transitive closures
- Extension to EBNF
-
- Operators on locations
- Notes on individual operators
- Properties of items and relations on items
- The algorithm
- Follow-on work
- Concluding remarks
Note
This paper makes no overt reference to markup; perhaps
some rationale is due for submitting it Balisage.
Balisage is about the use of markup to meet challenges
in information manangement. Many, perhaps most, users of
descriptive markup use XML. Many (although not all) users of
XML find it more convenient to process XML using XSLT and
XQuery than using most other programming languages; some would
like to XSLT and XQuery developed as general-purpose
programming languages. To be useful as general-purpose
programming languages, however, they need libraries providing
standard functionality and often implementing standard
algorithms.
This paper describes the reimplementation of one
standard in XQuery / XSLT terms. Its interest, if it has any,
is first in the functional, declarative reformulation of the
interesting (even beautiful) Earley algorithm and second in
the way that that reformulation has been achieved. Perhaps
this example will be helpful to others seeking to reimplment
standard functionality in these languages.
Introduction
Among XML users, it’s not unusual to find XSLT and
XQuery being used as general-purpose, not special-purpose,
programming languages. It is easier to use a language for
general-purpose programming across a wide range of application
areas if it has libraries of subroutines for different
domains. To build up larger libraries of subroutines for
general-purpose languages, it’s sometimes useful to
re-implement standard algorithms in the new language.
One of the challenges of reimplementing standard algorithms in
XSLT and XQuery is that standard algorithms are often
described in imperative terms which cannot be translated
directly into declarative functional languages. If we can
understand the algorithms in a more declarative way, however,
it may be possible to implement them straightforwardly in XSLT
or XQuery. The difficulty and one path to resolving it are
illustrated here with the parsing algorithm described by Jay
Earley in 1970 (Earley 1970).
Earley parsing is a well-known technique for parsing input
using context-free languages; unlike techniques like recursive
descent and standard table-driven parsing it does not require
special properties in the grammar, like being LL(1) or
LALR(1): it will work for any input and any grammar, whether
the grammar requires lookahead or not. But the algorithm as
Earley describes it is quite low-level: routines called the
predictor, the scanner, and the completer take turns changing
global data structures until it can be seen from the data
structures that the parse has succeeded or failed. In the
case of success, one or more parse trees can be extracted from
the data structure; in failures, one can identify the point at
the input at which the attempt failed and generate diagnostic
messages. It is not always immediately obvious to the reader,
however, exactly why the predictor, the scanner, and the
completer behave as they do, or why their interaction is
guaranteed to find any and all legitimate parses of the input.
And since XSLT and XQuery don’t allow the assignment of
new values to variables, the data-structure manipulations used
by Earley cannot be implemented as described.
The discussion below tries to step back and think about what
the data structures described by Earley mean, in order to
understand the algorithm in a purely declarative way. We
proceed roughly as follows:
-
Earley defines an important data structure called
the Earley item and the Earley
algorithm initializes and manages a set of such items,
adding new elements to the set when certain conditions are
met.
-
Since the conditions to be met typically involve the
presence of particular items in the set, and since the new
items to be added are typically constructed on the basis
of those existing items, we can consider Earley’s
operations as defining several relations on items. A
relation R holds between two Earley
items X and Y
just in case Earley’s algorithm will add
Y to the set whenever
X is present.
Some relations used by Earley are ternary, not binary;
in those cases,
item Z will be added to the
set just in case X and
Y are in the set and
R(X,
Y,
Z) holds.
-
The algorithm's updates to the set of items can then
all be described in the following terms: if an item
X is in the set (or, for
some steps, if items X
and Y are in the set), and the relation
R holds between
X (or X and
Y) and another item
Z, then Z is
added to the set.
-
It is then possible to see that Earley’s
algorithm calculates the smallest set of Earley items
which (a) contains a standard starter item
and (b) is closed over a number of relations defined as
holding (or not holding) for pairs or triples of Earley
items.
Re-thinking the Earley algorithm in this way not only makes it
easier to implement it in XSLT and XQuery, but
helps make it clear why the parser is both complete (it will
always find a parse if there is one) and correct (any parse it
finds will be a real parse).
The next two sections of this paper (section “Notation” and section “Earley items”) introduce some
basic notation and the concept of the Earley item.
We then (section “Earley trees”) define
Earley trees; these are not defined by Earley,
but prove useful in the later discussion. The nodes of Earley trees
are sets of Earley items with certain parsing-relevant
properties. We then proceed (section “Properties and relations of Earley items”)
to identify some interesting properties of
Earley items (truth, expectation, testability, usefulness,
relevance), and some useful relations between items (advance,
continuation, scan, prediction). Supplied with these relations,
we can define the Earley algorithm using them and show how they
guarantee that it is complete and correct (section “The Earley algorithm”).
The initial description of the algorithm applies to grammars
given in Backus-Naur Form; the penultimate section (section “Two technical excursus”) contains discussions of two
technical topics of practical importance: how to calculate the
required sets in XSLT and XQuery, and how to extend the
treatment of Earley parsing here to handle regular-right-part
grammars like those specified for Invisible XML (section “Extension to EBNF”). The final section (section “Concluding remarks”) tries to draw some lessons for XSLT and
XQuery translations of procedural algorithms from this
example.
Notation
We assume a grammar G and an input string I of length n.
A grammar is a four-tuple (VN, VT, P, S), where
-
VN is a set of non-terminal symbols.
-
VT is a set of terminal symbols,
disjoint from VN (i.e.
VN ∩ VT = ∅).
-
P is a set of production rules
of the form N → α, where
-
N ∈ VN.
-
α is a sequence of zero or more symbols
from (VN ∪ VT).
-
S ∈ VN is the start symbol of
the grammar.
The
vocabulary of
G is the set of terminal and
non-terminal symbols
V =
VN ∪
VT.
When we need to mention the start symbol of
G, we call it
S.
Non-terminal and terminal symbols in
G will be called
N,
T,
N1,
N2, ... and so on. Sequences of terminals and
non-terminals will be called
α,
β,
γ ...
For purpose of our exposition, we often wish to deal not with
the grammar G as is, but with an augmented grammar
G′ = (VN ∪ Goal, VT,
P ∪ Goal → S, Goal),
where Goal ∉ VN. This allows us to know that
the start-symbol Goal has exactly one rule, whose form
we know, and that it never appears on the right-hand side of any rule
in G′.
It should be clear that G and G′ define the same
language.
We use the term string or string of
characters when referring to I or substrings of I,
or other sequences of terminal symbols.
We use the term sequence when referring
to sequences of symbols in V. There is no technical
distinction between the terms string and
sequence other than the restriction of
strings to terminal symbols; they are distinguished here only
in an attempt to make the argument easier to follow.
For any two sequences or strings α and β,
α β
denotes the concatenation of α and
β. Where simple juxtaposition might
be ambiguous, or where it is desired to emphasize the
concatenation
operator, the form
α || β
is used with the same meaning.
We write the empty string ε and the empty sequence
either ε or ().
The expression I[i] refers to the character at position i in
I; the expression I[x,y] refers to the substring from
(just after) position x up to and including the
character at position y. We use 0 to refer to the position
before the first character.
For example, if I is the string a+b
, then
-
I[1], I[2], I[3] refer to the characters a
,
+
, and b
,
respectively.
-
I[0,1], I[0,2], I[2,3] refer to the substrings
a
, a+
, and
b
respectively; I[0,0] and
I[3,3] refer to the empty strings immediately before
a
and after b
,
respectively.
Note that for any string I and any non-negative integers
x, y, z all ≤ n, we have I[x, y] || I[y,
z] = I[x, z], and for any x (0 < x ≤ n),
I[x] = I[x-1, x].
The discussion which follows tries to make the line of
argument as explicit and easy to follow as possible, but some
basic knowlege of context-free grammars is inevitably assumed.
(Ironically, any attempt to make an argument easier to follow
by making the inference steps as small as possible has the
follow-on effect of multiplying the number of formulas and
such, so that at first glance the text will appear less
accessible, not more accessible. Readers interested but uneasy
with formalisms are encouraged to take a deep breath and
persevere.)
Earley items
An Earley item is a triple (x, y, N → α · β), where
-
x is a number between 0 and n, inclusive, referred
to as the start value
-
y is a number between x and n, inclusive,
referred to as the end value
-
α and β are sequences in V*,
i.e. sequences of
zero or more symbols (terminals or non-terminals)
-
N → α β is either a production rule in G or the special
rule Goal → S described above.
The third item in the triple is a location
(sometimes referred to in parsing literature simply as an
item); if we don’t care about the details
of the location, we may write an Earley item (x, y,
L).
If β is the empty sequence, so that the item has the form
(x, y, N → α ·), then the item is
called a completed item (or a
completion) for N.
For example, given I as above (a+b
), if G is
the tuple (VN, VT, P, E), where
-
VN = { E, T, P }
-
VT = { *
,
+
,
a
,
b
,
(
,
)
,
[
,
]
}
-
P =
-
E → T
-
E → E *
T
-
T → A
-
T → T +
P
-
A → a
-
A → b
-
A → (
E )
-
A → [
E ]
then possible Earley items include
-
(0, 0, E → · T)
-
(2, 3, T → T +
· A)
-
(2, 2 E → · T)
-
(2, 2 T → · T +
A)
-
(3, 3, A → (
E )
·)
-
(0, 3, E → · T)
-
(0, 3, E → T ·)
-
(0, 3, A → (
E · )
)
-
(0, 0, Goal → · E)
-
(0, 0, Goal → E ·)
Informally, we understand an Earley item to mean that in
grammar G, the string of symbols α generates the
string I[x, y]. (A sequence of symbols
generates a string if and only if the string
can be produced from the symbols by applying a series of
zero or more rewrites using rules of G. Since generation
involves zero or more applications of the rewrite relation
written →, it is not infrequently denoted ⇒*.)
So the
examples just given can be interpreted as saying
-
The string
immediately
before a
is generated by the empty
sequence of symbols.
-
The string b
is generated by
the symbol sequence T +
.
-
The string
immediately after
+
is generated by the empty
sequence of symbols.
-
The string
immediately after
+
is generated by the empty
sequence of symbols.
-
The string
immediately after
b
is generated by the sequence
(
E )
.
-
The string a+b
is generated
by the empty sequence of symbols.
-
The string a+b
is generated
by the sequence T.
-
The string a+b
is generated
by the sequence (
E.
-
The string
immediately
before a
is generated by the empty
sequence of symbols.
-
The string
immediately
before a
is generated by the
sequence of symbols E.
Note that since we normally use Earley items to keep track
of where we are in recognizing input against a grammar,
informally an Earley item may also suggest that we are
expecting to see an instantiation of the sequence
β that follows the dot. (See also the discussion of
expectation, below.) But expectations are not relevant for
the truth or falsehood of the item, and thus not part of the
meaning of the item, narrowly construed.
Of these examples, (5), (7), and (10) are completed items.
Note that because they have a meaning, Earley items can be true or false.
For example, in the list just given, items (1), (3), (4),
(7), and (9) are true
statements, and the others are false.
Note that if x = y and α is the empty string, then
we have an Earley item of the form (x, x, N →
· β). The meaning of such an item is always that
the empty sequence of grammar symbols derives the empty
sequence of terminal symbols, and the item is accordingly
always true.
Earley trees
These are not defined by Earley, but they prove helpful for
later discussion.
An Earley tree for the start symbol S in a given grammar
G, for a given input I, is an ordered tree of Earley items with
the following properties:
-
Each node in the tree is either an Earley item
or an expression of the form
I[y]
or an expression of the form
I[y, y].
-
The children of any node are ordered.
-
The root node has the form (0, n, Goal → S
·), where S is the start-symbol of G and Goal
is the start symbol of the augmented grammar G′
described above.
-
For any node K of the form (x, y, N1 →
α N2 · β), (i.e. in which the last item
before the dot is a non-terminal N2), the node’s
children
will take the form:
-
(w, w, N2 → · R1 R2 R3
... Rk)
-
(w, z1, N2 → R1 · R2 R3
... Rk)
-
(w, z2, N2 → R1 R2 · R3
... Rk)
-
...
-
(w, y, N2 → R1 R2 R3 ... Rk
·)
where
-
k is the length of the rule’s
right-hand side.
-
w ≤ z1 ≤ z2 ≤ ... ≤ zn-1 ≤
y.
-
w is equal to the end value of node K’s left
sibling, if K has a left sibling.
Otherwise, w = x (the start value of K).
Note that in either case we have x ≤ w ≤ y.
Recall that by the definition of Earley item, N2
→ R1 R2 R3 ... Rk is a rule in G.
By this rule, all siblings in an Earley tree have
locations in the same production rule of G, all have
the same start value w, and they have non-decreasing
end values. The start value of all children is the same
as the end value of the parent’s left sibling,
when it has one; otherwise it’s the same as the
start value of the parent. (The second case arises for
unit rules and for the children of the root node.) The
end value of the last child is the end value of the
parent.
-
Every node K of the form (x, x, N →
·), (i.e. a completion for non-terminal
N in which the production has an
empty right-hand side) has a single child node, of the
form I[x, x].
-
For any node of the form (x, y, N →
α T · β), (i.e. in which the last item
before the dot is a terminal symbol T), the node’s
sole child is the terminal symbol I[y], and I[y] =
T.
-
Every node of the form
I[y] is a leaf
node (has no children).
Like an Earley item, an tree of Earley items has a meaning, and
can be true or false.
-
The root node has the form (0, n, Goal → S
·). Informally, this means means S generates
the string I[0,n]
, and since I[0,n] = I,
this amounts to saying that
S generates I. Since S is the start symbol of the
grammar G this in turn means that I is a sentence in
the language defined by G.
-
For any node K of the form (x, y, N1 →
α N2 · β), the meaning is that the
substring of I from x to y, i.e. I[x,y], is
generated by the sequence α N2. This will be
true if and only if (a) α generates I[x,w]
and (b) N2 generates I[w,y].
K’s first child has the form
(w, w, N2 → · α) and means
that the empty sequence of symbols generates the
zero-length (empty) string I[w,w]. Like
all Earley items of similar form, it is true.
Each subsequent child of K has the meaning that
a longer and longer prefix of the right-hand side of the rule
N2 → R1 R2 ... Rk generates a longer and
longer substring of I beginning at w. The final
child, which has the form
(w, y, N2 → R1 R2 ... Rk ·),
means that the sequence of symbols
R1 R2 ... Rk generates
I[w,y], the substring of I from
w to y.
It follows that N2 also generates the string I[w,y].
If the sequence of symbols α
generates I[x,w], and the symbol N2 generates
I[w,y], then it follows that
the sequence α N2 generates I[x,y].
So if (a) the Earley item in K’s left sibling is true,
or K has no left sibling, and (b) all of K’s children
are true, then K is also true.
-
For any node K of the form (x, y, N → α T · β),
(i.e. in which the last item before the dot is a terminal symbol T),
the meaning of the Earley item is that
the sequence of symbols α T generates
I[x,y], the substring of I from
x to y.
The left sibling of K, if it exists, will be an
Earley item of the form (x, w, N → α
· T β), where w = y - 1. If the Earley
item in the left sibling is true, and if I[y] = T,
then node K is true.
It follows from the rules that all leaves in the tree are
either of the form
I[j]
or else of the form
I[j, j],
for 1 ≤ j ≤ n.
It also follows that for the leaves, taken in tree order,
j is monotonically non-decreasing.
For any two leaf of the form
I[j],
the immediately following leaf in tree order may be
either
I[k]
or
I[k,
k],
where
k = j + 1.
For a leaf of the form
I[j, j],
the next leaf in tree order may be either
I[j,
j],
or
I[k],
again with
k = j + 1.
The Earley tree is true (or correct) if and only if the
concatenation of all the leaves, in order, is the input I.
We say that such an Earley tree exhibits a
derivation of I from G.
It should be evident that a conventional parse tree for the
input I against the grammar G can be constructed from a
correct Earley tree by
(1) deleting each node (x,
x,
N → ·
α)
for non-empty α;
(2) replacing each node (x, y, N1 → α N2
· β) with
N2;
(3) replacing pre-terminal nodes (nodes with single leaf
children) with their children;
(4) replacing each leaf node with the string it denotes
(I[j]
with character j of
I,
I[j,
j]
with the empty string);
and (5) retaining the parent-child relations and sequence of children.
It should also be evident that the parse tree implied by an
Earley tree for a given I and G is correct if and only
if every Earley item in the tree is true.
Brief digression
We note in passing that the job of any parser is to
calculate an Earley tree for a given G and I; one way to
think of it is as starting with the given information (I
and G), calculating a set of Earley items, and then
determining whether we can build an Earley tree from them.
The job of a correct parser is to ensure that
all the items in the Earley tree are true (and thus that the
tree is correct).
The job of an efficient parser is to do so with
as little effort as possible (in particular, considering as
few items as possible, and testing as few as possible for
truth).
The job of an unrestricted parser is to calculate a correct
Earley tree for any arbitrary G and I; restricted
parsers (e.g. LL(1) or LR(k) parsers) do so only for some
G.
Grune and Jacobs observe that the problem with some general
parsing methods is that in the worst case they end up doing
a lot of unnecessary work, such as calculating the
truth of a lot of Earley items that turn out irrelevant to
the construction of a correct parse tree.
The Earley algorithm is a way to calculate Earley trees for
arbitrary context-free grammars while reducing the amoung of
unnecessary work. To summarize the algorithm, we need to
define some properties of Earley items, beyond truth and
falsehood.
Properties and relations of Earley items
The first important property of Earley items has already
been mentioned several times: Earley items can be true or
false. This section describes some additional properties and
relations.
Expectation
An Earley item (x, y, N → α · X
β), i.e. an item with the symbol X (terminal or
non-terminal) right after the dot, is said to expect an
X
.
Note: We are restricting ourselves here to CFGs given in
BNF. The extension to EBNF and regular right part grammars
is not too difficult, but complicates the account. See below
(section “Extension to EBNF”) for a discussion of extended BNF.
Testability; winning and losing
An Earley item (x, y, N → α · T
β), which expects a terminal symbol T, can be
directly tested against the input. If I[y,y+1] = T,
then the item wins; otherwise it
loses.
If the item just described is true and wins, then the
related item (x, y, N → α T · β),
which moves the dot one symbol to the right, is also true.
That is, given the truth of:
-
(x, y, N → α · T β)
-
I[y,y+1] = T
we can infer the truth of:
The advance() function; continuations
We define the function advance() to describe this
relation; it maps from an Earley item i and a terminal
symbol T to an Earley item j or to nothing.
Specifically, if
then
-
advance(i,T) = j if and only if i wins
-
advance(i,T) = ∅ if and only if i loses
For convenience, we may also want to be able to apply the
function to a set of items, not just a single item. If iii
is a set of Earley items, and T is a terminal symbol, then
advance*(iii, T) is the set {j | j =
advance(i, T) for some i ∈ iii}. We refer to
j as a continuation of i. Continuation is
transitive: any continuation of j is also a continuation
of i.
The scan() relation between items
Looking ahead, we say that the relation scan(i, j)
holds between Earley items i and j if and only if i expects
terminal T, i wins, and j = advance(i, T).
Testing and advance() function for non-terminals
By contrast, an Earley item (x, y, N1 → α
· N2 β) which expects a non-terminal symbol
N2 cannot be directly tested against the input. It can
however can be tested indirectly. If N2 generates a
substring of I beginning at y (i.e. if for some z we
have N2 ⇒* I[y, z]), then the item wins;
otherwise it loses.
The advance() function again
Again we define advance() appropriately: given
-
i = (x, y, N1 → α · N2 β)
-
j = (x, z, N1 → α N2 · β)
then
-
advance(i, N2) = j
if and only if N2 ⇒* I[y, z] for some z
-
advance(i, N2) = ∅ otherwise
Continuations
As with terminals, so also with non-terminals; if j =
advance(i, N) for some non-terminal N, then j
is a continuation of i, and any continuation of j is
also a continuation of i.
Useful items
An Earley item is useful if and only if it
appears in a correct Earley tree for the grammar G and
the input I. If an Earley item is not useful, it is
useless.
If parsing algorithms could distinguish between useful
and useless items before spending time on them, they
could be very efficient.
But at the beginning of a parse, it’s impossible
to know which items will be useful and which
won’t. For example, if I is not in the language
L(G) defined by G, then all Earley items for I and
G are useless, but we can’t know that without
calculating and testing at least a few.
Note in passing that if any item i = (x, y, N
→ α · β) is useful, then some
completion of i, i.e. some item of the form (x, z,
N → α β ·) (with z ≥ y) will
also be useful; the two will be siblings in the Earley
tree. There might be more than one true completion of
i; if I has more than one derivation in G, more than
one of them may be useful.
The converse is also true in most cases: if (x, z, N
→ α β ·) is useful, then there will be
at least one true item of the form (x, y, N →
α · β), and at least one such form will be
useful (more than one, in cases of ambiguity).
Relevant items
In a particular state of partial knowledge, when we have
examined part of the input but not all, an Earley item is
relevant if, based on what we know, it
could be useful. (We will make this more
precise in a moment.)
Also any Earley item (x, x, N → · α
β) is relevant if for some y (≠ x) the item
(x, y, N → α · β) is relevant.
Given a particular state of partial knowledge, an Earley
item is irrelevant if we know, based on
that partial knowledge, that it is useless. Any item
known to be false (in that same state of partial knowlege)
is thus irrelevant. I believe but have not attempted to
prove that in any given state of partial knowledge, any
Earley item that is not relevant is
irrelevant. This might depend on the nature
of the partial knowledge available.
The Earley algorithm is an online
algorithm, which reads the input left to right (or:
front to back, if one wishes to avoid fatal confusion in
cases of right-to-left scripts). At any given point in
time, we have read the first y characters in the
input, for some number y such that 0 ≤ y ≤
n, so we have knowledge of the substring I[0, y],
and no knowledge of the substring
I[y,
n].
We also have full knowledge of the grammar G.
Given knowledge of I[0, y], an Earley item of the
form (x,y,L) is relevant if it could
be part of an Earley tree for G and some input
which starts with the string I[0, y]. Or more
formally, the Earley item is relevant if and only if
there is some string I′ which shares the prefix
I[0, y] with I, for which the item is useful
(appears in some correct Earley tree).
Some consequences of the definition of relevance
It follows from the definition of relevance just given
that the item (0, 0, Goal → · S) is relevant
for all grammars and inputs.
It also follows from the definition of relevance that if
an item i = (x, y, L) expects terminal T, and
i is relevant, then advance(i, T) is relevant if
it exists (i.e. if i wins; if i loses,
advance(i, T) does not denote an item).
It further follows from the definition of relevance that
if an item i = (x, y, L) expects non-terminal N,
and i is relevant, then for every production rule N
→ α in G, the item (y, y, N →
· α) is relevant.
If i is useful as well as relevant, then we know
we’ll need an item (y, z, N → α
·) in the tree as well (since i is expecting an N,
and appears in a derivation tree, there will
be an N in the derivation tree at the appropriate
location). So at least one of the production rules for N
will be useful as well, but if there is more than one we
do not yet know which.
If the grammar is unambiguous, at most one production rule will be
relevant, but G could be ambiguous, so more than one
could be useful.
We don’t know, however, how N might be
instantiated in I (that’s essentially a
consequence of the nature of CFGs), so any of the
production rules for N could be the one needed for the
tree. That means every possible item (y, y, N →
· α) is relevant if N → α is in
G.
The pred() and comp() relations among items
Looking ahead, we say that the relation pred(i, j)
holds between items i and j if and only if item i = (x,
y, L) expects non-terminal N and item j has the
form (y, y, N → · α). Here pred
is short for predicts
.
It also follows from the definition of relevance that if we have
-
item i =
(x, y, N1 → α · N2 β)
expecting a non-terminal N2
-
item j =
(y, z, N2 → γ ·)
is a completion for N1.
and both are relevant (if
i is relevant, then
j is also relevant
unless
j is false), then the item
advance(
i,
N2),
i.e.
(
x,
z,
N1 →
α N2 ·
β),
is also relevant.
Looking ahead, we say that the relation comp(i, j, k) holds
if and only if
-
item i = (x, y, L1)
expects some non-terminal N
-
item j = (y, z, L2) is a completion for N
-
k = advance(i, N)
The Earley algorithm
The Earley algorithm (as described in Earley 1970)
is a procedural way of calculating a set of Earley
items which are all true and are all relevant at the time
they are added to the set. I believe (but have not proved)
that it is in fact the set of all items which are both true
and relevant by the definitions above.
The algorithm calculates a set of Earley items which obeys
the following rules:
-
The item (0, 0, Goal → · S) is a member of the set.
-
If i = (x, y, L) is a member of the set which
expects terminal symbol T, and I[y+1] = T, then
advance(i, T) is also a member of the set.
If (x, y, L) is a member of the set which expects
non-terminal symbol N, and N → α is a
production rule in G, then (y, y, N → ·
α) is also a member of the set.
-
If i = (x, y, L) is a member of the set which
expects non-terminal symbol N, and j = (y, z,
N → α ·) is a member of the set, then
advance(i,N) is also a member of the set.
-
Nothing else is a member of the set.
Or, more concisely, the set is the smallest set that contains (0, 0,
Goal → · S) and is closed over the relations
scan(i, j), pred(i, j),
and comp(i, j, k).
Earley defines things procedurally, in terms of the
following steps:
-
He starts by initializing the set of items to (0, 0, Goal→ · S).
-
He then applies rule 3 above (our pred() relation) until it produces
no new items. At this point all items in the set have x = y =
0.
-
Then he iterates over the following procedure.
-
First
apply rule 2 (Earley’s
scanner
, our scan() relation),
which generates new items of the form (x, y+1, L).
If no new items are produced by this step, we are done;
I is not a member of L(G).
(Digression: Why are we done? If applying rule 2 produces no
new items, then one of the following is true:
-
(a) The set of items includes no items which expect a terminal.
In this case, we have reached a point in the grammar where no
terminal is predicted, which means that no terminal can possibly
advance the parse, which means the parsing situation is
hopeless.
-
(b) The set of items includes some items which expect a
terminal, but none of them win. In this case, the grammar does
make testable predictions for this point, but the input
satisfies none of them, which means that the input definitely
fails to match the grammar.
Either way, there is no parse tree for I as a sentence of G,
so we are done.
End of digression.)
-
Then apply rule 3 (Earley’s
predictor
, our pred() relation)
and rule 4 (Earley’s completer
,
our comp() relation) until they produce no new items.
-
If the highest end-point y in our items (x, y,
L) is less than n, then start another round of
this procedure (go back to finding instances of the
scan() relation).
Otherwise, the highest end-point y = n and we
have reached the end of the input and are done.
-
If the set contains (0, n, Goal → S ·),
we have recognized the sentence and can construct an
Earley tree from the set. If the set doesn’t
contain that item, I is not a member of L(G).
Each round of this procedure uses progressively higher
values for the end-point in the search for instances of
the scan() relation, so there are at most n rounds
before we reach the end of the input.
It seems clear from this way of putting it that the
procedure described by Earley calculates the transitive
closure of (0, 0, Goal → · S) over the
relations scan(), pred(), and comp().
It’s pretty clear, intuitively, that the starter item
(0, 0, Goal → · S) is relevant and true,
and that all of the relations scan(), pred(), and
comp() produce new items for the set which are also
relevant and also true. Producing only relevant items is
what distinguishes Earley from some other algorithms, which
work purely bottom-up and essentially try to calculate all
true items with x ≠ y. I believe (but hesitate to say
it’s intuitively obvious) that the set described
contains all relevant items, as that term is
defined above; that property distinguishes the Earley
algorithm from things like recursive-descent and LR(1)
parsing, which in the interests of simplicity and/or speed
ignore some relevant items (which is why they can’t
handle all grammars).
Earley describes his algorithm in terms of very small
mechanical tasks; it’s obvious how to implement it,
especially in a procedural language with mutation.
It’s not at all obvious to me from E’s
description how to do the same thing in a declarative
language without mutation.
But by describing the set in terms of the relations, and defining the
relations in terms of their properties, instead of in the imperative,
procedural-pseudocode style used by Earley and by other descriptions,
I think I have succeeded in defining the relevant sets and properties
declaratively, which will make it easier to calculate them in a
functional language.
Two technical excursus
Calculating transitive closures
For implementing this declarative functional version
of the algorithm, it would obviously be handy to have
a language with a built-in transitive-closure
operator. But with higher-order functions, we can
build our own, I think.
(: to take the transitive closure of some
function f on some set $s, first define an
equality function $eq, then specify a maximum
number of recursive calls $n, and finally call
my:transclos($eq, f, $n, $s)
:)
declare function my:transclos(
$eq as function(*), (: equality test :)
$r as function(*), (: unary function, for binary relation :)
$maxcycles as xs:integer,
$set as item()*
) as item()* {
local:tc-helper($eq, $r, $maxcycles, $set, $set)
};
declare function local:tc-helper(
$eq as function(*), (: equality test :)
$r as function(*),
(: unary function to calculate a binary
relation :)
$ttl as xs:integer,
$queue as item()*,
$accum as item()*
) as item()* {
if (empty($queue)) then
$accum
else if ($ttl le 0) then
error("Could not complete calculation, out of cycles.")
else
let $this := $queue[1],
$rest := $queue[position() gt 1],
$candidates := $r($this),
$new := $candidates
[no $x in $accum
satisfies
$eq(., $x)]
return local:tc-helper($eq, $r, $ttl - 1,
($rest, $new), ($accum, $new))
};
Unfortunately, I really don’t know
a good way to generalize this for
ternary relations like comp. But the
pattern is simple enough to replicate.
Extension to EBNF
The account of the algorithm given above exploits the
simplicity of production rules in BNF, which have a flat
sequence of symbols in the right hand side of each rule. But
many grammars extend BNF in ways which vary but typically
amount to allowing not just a sequence of symbols, but a
regular expression over V, on the right-hand side of a
rule. The notations are typically called
extended BNF, or EBNF; grammars written using
such notations are regular-right-part grammars
(because the right-hand part of each rule is a regular
expression).
Among the notations for regular-right-part grammars of
interest to XML users is the variant of van Wijngaarden
grammars specified by Steven Pemberton for use in
invisible XML
(Pemberton 2013).
To apply the Earley algorithm to regular-right-part
grammars, it is necessary to generalize the account above,
which can be done as follows. In particular, we will need
more abstract and general notions of (1) location,
(2) position in a production rule,
(3) expectation, and
(4) the advance function.
As before, an Earley item is a triple (x, y, L),
interpreted in the context of a grammar G and an input
string I, where x and y are the start and end
points of a substring of the input I, and L is a
location inside some rule in the productions P of the
grammar G. But instead of taking the form N →
α, where α is a sequence of symbols in V, the
location now has a structure whose details are not
specified. Instead, we define several operations on L,
which provide the information we need for the algorithm.
We would like the definition of these operations to be
general enough to be consistent with any of several
possible implementations of locations:
-
For a BNF, we can use a sequence of symbols in
V and a non-negative integer representing an index into
that sequence.
-
For an EBNF, we might use a pointer to a rule in
the grammar, together with a derivative expression
identifying what is still required or allowed to
appear.
-
For any grammar, we could use a triple consisting
of a non-terminal N, a finite state automaton (FSA)
representing the right-hand side of a rule for N, and an
identifier denoting the current state in the
FSA.
Operators on locations
The following operations on locations are useful in
describing Earley parsing for regular-right-part grammars.
-
rule(L) denotes the production rule within which
the location L appears.
-
nt(L) denotes the non-terminal on the left-hand side of
rule(L).
-
rhs(L) denotes the right-hand-side of rule(L)
-
seen-so-far(L) denotes a sequence of symbols in V (all
of which appear in rhs(rule(L))), which form a
prefix of some word recognized by rhs(rule(L)).
-
remains-to-be-seen(L) is an expression describing the part of
rhs(L) which remains to be matched after
seen-so-far(L) has been seen.
In an EBNF it will always be (equivalent to) the
derivative of rhs(L) with respect to seen-so-far(L).
-
completed(L) is true if and only if
remains-to-be-seen(L) is nullable, i.e. it is matched
by (accepts) the empty sequence.
For any item (x, y, L), if completed(L)
then nt(L) ⇒* I[x, y].
Such an item is a completed item,
and a completion of nt(L).
-
expects(L) denotes a set of symbols in
rhs(L). In a BNF, we get exactly one symbol, or none:
the symbol to the right of dot in the item. In an EBNF, we
get the set of symbols (terminal or non-terminal) which
can follow the current state, or equivalently the set of
symbols which are accepted as initial symbols by
remains-to-be-seen(L).
For any two locations L1 and L2, if rule(L1) =
rule(L2) and seen-so-far(L1) = seen-so-far(L2), then L1 and
L2 are equivalent. Let R = rule(L1) = rule(L2)
and α = seen-so-far(L1) = seen-so-far(L2); then we have:
-
nt(L1) = nt(L2) = left-hand side of R
-
rhs(L1) = rhs(L2) = right-hand side of R
-
remains-to-be-seen(L1) is equivalent to remains-to-be-seen(L2) in the
sense that they accept the same language; in each case,
the expression will be the derivative of rhs(R) with
respect to α, or another expression equivalent to
that derivative. (In practice, it is helpful to simplify
derivative expressions, but a weak and easy simplification
may not reduce all equivalent expressions to the same
form. See Brzozowski 1964.)
-
completed(L1) = completed(L2) =
nullable(remains-to-be-seen(L1)) = nullable(remains-to-be-seen(L2)).
An expression is nullable if the language it defines
includes the empty string as a sentence.
For definitions of nullability and explanations of
the relation between nullability of a derivative and
satisfaction of the original expression,
see
Brzozowski 1964 or
Brüggemann-Klein 1993a or
Brüggemann-Klein/Wood 1998 or
Sperberg-McQueen 2005.
-
expects(L1) = expects(L2) = the
set {X | X ∈ V and the derivative of remains-to-be-seen(L1)
with respect to X ≠ ∅}.
A consequence of the equivalence of
L1 and
L2, given the
identify of
rule() and
seen-so-far() for them, is that we can
specify any location down to equivalence by specifying a
rule and a sequence of symbols.
Notes on individual operators
The following subsections provide some examples
and further discussion of some of the operators
defined above.
Note that seen-so-far(L) is not necessarily
or usually a prefix of some sequence of characters
generated by nt(L); it is a prefix of some
sequence of terminal and non-terminal symbols in V
which matches the regular expression rhs(L).
If an item (x, y, L) is in fact true, then
I[x, y] is generated by seen-so-far(L).
For example, consider L1 and L2 such that
-
rule(L1) = A → B C D
-
rule(L2) = A → (B | C)*, D
In this case,
seen-so-far(
L1) may be any
of
-
() (i.e. the empty sequence)
-
(B)
-
(B C)
-
(B C D)
seen-so-far(
L2)
may be any of
-
()
-
(B)
-
(C)
-
(D)
-
(B B)
-
(B C)
-
(B D)
-
(C B)
-
(C C)
-
(B D)
-
(B B B)
-
...
This operator is not actually used in calculating the
set of Earley items we are interested in; the only
reason for defining seen-so-far(L) is to make it possible
to say (as we did above) that any item (x, y, L)
is true if and only seen-so-far(L) ⇒* I[x, y].
We could do without seen-so-far(L) if we said that the
interpretation of an Earley item (x, y, L) is not
the string I[x, y] matches (is generated by) the
part of the rule to the left of the dot
but the
equivalent claim for all strings s generated by
remains-to-be-seen(L), the concatenation of I[x, y] with s is
generated by nt(L).
Unlike seen-so-far, which returns a sequence of symbols in V,
remains-to-be-seen returns a regular expression over V.
For L1 and L2 as described above, then
remains-to-be-seen(L1) may be any of
remains-to-be-seen(
L2) may be any of
If the grammar is in BNF, then completed(L) if and only if
remains-to-be-seen(L) = empty-sequence.
Properties of items and relations on items
Provided with the operations defined above, we
can now redefine the crucial terms introduced earlier
for BNF (section “Properties and relations of Earley items”) so that
they also apply when G is a regular-right-part grammar.
Expectation, winning, and losing in EBNF grammars
An item (x, y, L) expects a
symbol X if and only if X ∈ expects(L).
Given an item i = (x, y, L) and a symbol X (X ∈ V),
such that i expects X, i wins on X if either
-
X is a terminal symbol T, and
-
I[x, y+1] = T,
or
-
X is a non-terminal symbol N, and
-
N ⇒* I[y, z], for some z (y ≤ z ≤ n).
Otherwise
i loses on
X.
Note that since i can expect more than one symbol, it
may win on some symbols and lose on others. Without
respect to any particular symbol, i wins if
there exists as least one symbol X for which i wins on
X; i loses if there is no such symbol.
The advance function for EBNF grammars
For any i = (x, y, L1), j = (x, z, L2), and
X ∈ V,
if (1) i expects X and
(2) rule(L1) = rule(L2) and
(3) seen-so-far(L1) || X = seen-so-far(L2),
then advance(i, X) = j if and only if i wins on X;
otherwise advance(i, X) = ∅.
The definitions of advance* and
continuation are as given above: if iii is
a set of items and X is a symbol, advance*(iii, X)
= {j | j = advance(i, T) for some i ∈
iii }, and j is a continuation of i if j =
advance(i, X) for some symbol X, or if
j is a continuation of some continuation of i.
The scan relation for EBNF grammars
The relation scan(i, j) holds between items i and j if and only if
there exists some T such that
i expects T,
i wins on T,
and j = advance(i, T).
Inferences regarding relevance for EBNF grammars
The definitions of useful, useless, relevant, and irrelevant items are as given above
(section “Useful items”). Some of the inferences given above can
be generalized for EBNF grammars.
Any item i = (x, y, L) may have more than one
continuation which is a completion for nt(L). If i
is useful, at least one of those completions will be
useful. (If more than one is useful, they will be useful
in different Earley trees, and G will be ambiguous.)
If any item i of the form (x, y, L) is relevant
at the point where we have knowledge of I[0, y] and
expects a symbol X and wins with X, then
advance(i, X) is also relevant, at the point where
we have read X. (For terminal symbols, that will be at
point I[0, y+1]; for non-terminals, it may be at any
point between I[0, y] and I[0, n], inclusive.) If
expects(L) ⊆ VT (i.e. every symbol expected by
i is a terminal symbol), and i loses, then at point
I[0, y+1] i is known to be false and thus known to
be useless, so at that point it is no longer relevant.
If any of the symbols expected by i are non-terminals,
it is not usually known how long a string any of those
non-terminals might generate; it is accordingly somewhat
harder to establish with certainty that any such items
have lost definitively and have become irrelevant.
At any point when we have read I[0, y], if item i
= (x, y, L) is relevant, then for every non-terminal
N in expects(L), and for every rule R defining N
in the grammar G, the item (y, y, L) with
rule(L) = R and seen-so-far(L) = ε (the empty
sequence) is relevant.
The pred and comp relations for EBNF
grammars
The relation pred(i, j)
(prediction
) holds between any two items
i and j if and only if item i = (x, y, L1)
expects some non-terminal N, and j = (y, y,
L2) with nt(L2) = N and seen-so-far(L2) = ε.
The definition of relation comp(i,
j, k)
(completion
) applies to EBNF grammars in the form defined
above (section “The pred() and comp() relations among items”).
The algorithm
Provided with the operations defined above, we
can now describe the Earley algorithm in a form that
applies to EBNF as well as to BNF.
Actually, the only change needed is that instead of
saying that the set is initialized with the starter item
(0, 0, Goal → · S), we say that it is
initialized with a starter item item of the form (0, 0,
L) with nt(L) = Goal and seen-so-far(L) = ε.
Follow-on work
Future work should include formal proofs of one or more of
the following propositions stated or suggested informally
above.
In them, E is the set calculated by the
algorithm in Earley’s paper, and S is the smallest set
which contains the starter item and is closed over the
relations scan, pred, and comp (i.e. the set
described in the declarative functional restatement of the
algorithm).
-
The set E and the set S are
identical.
-
S contains no false items.
-
Every item i of the form (x, y, L)
which is relevant with respect to the partial knowledge
available from having read I[0, x] is in
S.
-
Every item in S of the form (x, y, L)
is relevant with respect to the partial knowledge
available from having read I[0, x].
The first of these amounts to showing the correctness of
my claim to have translated Earley’s algorithm into
functional terms.
The second would establish that the functional algorithm
described here is correct, in the sense that if for a
given I and G it produces a set from which an Earley
tree can be constructed, then I is in fact a sentence
of the language defined by G.
The third would establish that the functional algorithm
given here is complete, in the sense that if a given I
is a sentence of the language defined by a given grammar
G, the algorithm will produce a set from which an
Earley tree can be constructed.
The fourth would suggest that the functional algorithm
given here does not do any work which can be seen in
advance to be pointless.
Concluding remarks
What can be learned from the example of Earley
parsing for the general problem of translating procedural
algorithms into XSLT and XQuery?
Three basic ideas seems to stand out.
-
Update operations define relations.
Each of the basic operations performed by the Earley
operation consists in adding an Earley item to a set,
given certain conditions. The functional paraphrase given
here works by interpreting each of these operations semantically
as identifying a relation that holds between one or more items
already in the set and the new item being calculated.
(The dependency of the operations on G and I complicates
matters very slightly, because it makes scan and
pred, which would otherwise be binary relations,
into relations with four arguments. But since G and I don’t change,
we can treat them as binary relations, for which transitive closure
is easily defined, by restricting our universe of discourse to
contexts in which G and I are present. Earley achieves the same
effect by treating G and I as global variables.)
-
Iteration can calculate a transitive closure.
Earley describes the algorithm as running until either
the input is exhausted or the parse fails. We can capture
this in a non-temporal way by making the algorithm
calculate the transitive closure of one or more
relations.
Transitive closure is not a primitive operation in XSLT or
XQuery, except in the special cases of the axes ancestor,
descendant, following[-sibling],
etc. But any transitive closure can be calculated without
great difficulty using a recursive function.
-
Sets may be easier than more complex structures.
We are able to use the concept of transitive closure here
because we are operating on sets of Earley items; if instead we
were building a set of parse trees, both the relations among objects
and the calculation of the transitive closure would almost certainly
be more complicated.
References
[Brüggemann-Klein 1993a]
Brüggemann-Klein, Anne.
1993.
Regular expressions into finite automata.
Theoretical Computer Science
120.2 (1993): 197-213. doi:https://doi.org/10.1016/0304-3975(93)90287-4.
[Brüggemann-Klein/Wood 1998]
Brüggemann-Klein, Anne,
and
Derick Wood.
1998.
One-unambiguous regular languages.
Information and computation
140 (1998): 229-253. doi:https://doi.org/10.1006/inco.1997.2688.
[Brzozowski 1964]
Brzozowski, Janusz A.
Derivatives of regular expressions.
Journal of the ACM
11.4 (1964): 481-494. doi:https://doi.org/10.1145/321239.321249.
[Earley 1970]
Earley, Jay.
1970.
An efficient context-free parsing algorithm.
CACM
13.2 (1970): 94-102. doi:https://doi.org/10.1145/362007.362035.
[Grune/Jacobs 1990/2008]
Grune, Dick, and Ceriel J. H. Jacobs.
1990/2008.
Parsing techniques: a practical guide.
First edition New York et al.: Ellis Horwood, 1990.
Second edition [New York]: Springer, 2008.
[Pemberton 2013]
Pemberton, Steven.
Invisible XML.
Presented at Balisage: The Markup Conference 2013,
Montréal, Canada, August 6 - 9, 2013.
In
Proceedings of Balisage: The Markup Conference 2013.
Balisage Series on Markup Technologies, vol. 10 (2013).
doi:https://doi.org/10.4242/BalisageVol10.Pemberton01.
[Sperberg-McQueen 2005]
Sperberg-McQueen, C. M.
2005.
Applications of Brzozowski derivatives
to XML Schema processing.
Paper given at Extreme Markup Languages 2005, Montréal,
sponsored by IDEAlliance.
Available on the Web at
http://www.mulberrytech.com/Extreme/Proceedings/html/2005/SperbergMcQueen01/EML2005SperbergMcQueen01.html,
http://cmsmcq.com/2005/abdxsp.unicode.html, and
http://cmsmcq.com/2005/abdxsp.ascii.html.
×
Grune, Dick, and Ceriel J. H. Jacobs.
1990/2008.
Parsing techniques: a practical guide.
First edition New York et al.: Ellis Horwood, 1990.
Second edition [New York]: Springer, 2008.
×
Pemberton, Steven.
Invisible XML.
Presented at Balisage: The Markup Conference 2013,
Montréal, Canada, August 6 - 9, 2013.
In
Proceedings of Balisage: The Markup Conference 2013.
Balisage Series on Markup Technologies, vol. 10 (2013).
doi:https://doi.org/10.4242/BalisageVol10.Pemberton01.