Note

This paper makes no overt reference to markup; perhaps some rationale is due for submitting it Balisage.

Balisage is about the use of markup to meet challenges in information manangement. Many, perhaps most, users of descriptive markup use XML. Many (although not all) users of XML find it more convenient to process XML using XSLT and XQuery than using most other programming languages; some would like to XSLT and XQuery developed as general-purpose programming languages. To be useful as general-purpose programming languages, however, they need libraries providing standard functionality and often implementing standard algorithms.

This paper describes the reimplementation of one standard in XQuery / XSLT terms. Its interest, if it has any, is first in the functional, declarative reformulation of the interesting (even beautiful) Earley algorithm and second in the way that that reformulation has been achieved. Perhaps this example will be helpful to others seeking to reimplment standard functionality in these languages.

Introduction

Among XML users, it’s not unusual to find XSLT and XQuery being used as general-purpose, not special-purpose, programming languages. It is easier to use a language for general-purpose programming across a wide range of application areas if it has libraries of subroutines for different domains. To build up larger libraries of subroutines for general-purpose languages, it’s sometimes useful to re-implement standard algorithms in the new language.

One of the challenges of reimplementing standard algorithms in XSLT and XQuery is that standard algorithms are often described in imperative terms which cannot be translated directly into declarative functional languages. If we can understand the algorithms in a more declarative way, however, it may be possible to implement them straightforwardly in XSLT or XQuery. The difficulty and one path to resolving it are illustrated here with the parsing algorithm described by Jay Earley in 1970 (Earley 1970).

Earley parsing is a well-known technique for parsing input using context-free languages; unlike techniques like recursive descent and standard table-driven parsing it does not require special properties in the grammar, like being LL(1) or LALR(1): it will work for any input and any grammar, whether the grammar requires lookahead or not. But the algorithm as Earley describes it is quite low-level: routines called the predictor, the scanner, and the completer take turns changing global data structures until it can be seen from the data structures that the parse has succeeded or failed. In the case of success, one or more parse trees can be extracted from the data structure; in failures, one can identify the point at the input at which the attempt failed and generate diagnostic messages. It is not always immediately obvious to the reader, however, exactly why the predictor, the scanner, and the completer behave as they do, or why their interaction is guaranteed to find any and all legitimate parses of the input. And since XSLT and XQuery don’t allow the assignment of new values to variables, the data-structure manipulations used by Earley cannot be implemented as described.

The discussion below tries to step back and think about what the data structures described by Earley mean, in order to understand the algorithm in a purely declarative way. We proceed roughly as follows:

  1. Earley defines an important data structure called the Earley item and the Earley algorithm initializes and manages a set of such items, adding new elements to the set when certain conditions are met.

  2. Since the conditions to be met typically involve the presence of particular items in the set, and since the new items to be added are typically constructed on the basis of those existing items, we can consider Earley’s operations as defining several relations on items. A relation R holds between two Earley items X and Y just in case Earley’s algorithm will add Y to the set whenever X is present. Some relations used by Earley are ternary, not binary; in those cases, item Z will be added to the set just in case X and Y are in the set and R(X, Y, Z) holds.

  3. The algorithm's updates to the set of items can then all be described in the following terms: if an item X is in the set (or, for some steps, if items X and Y are in the set), and the relation R holds between X (or X and Y) and another item Z, then Z is added to the set.

  4. It is then possible to see that Earley’s algorithm calculates the smallest set of Earley items which (a) contains a standard starter item and (b) is closed over a number of relations defined as holding (or not holding) for pairs or triples of Earley items.

Re-thinking the Earley algorithm in this way not only makes it easier to implement it in XSLT and XQuery, but helps make it clear why the parser is both complete (it will always find a parse if there is one) and correct (any parse it finds will be a real parse).

The next two sections of this paper (section “Notation” and section “Earley items”) introduce some basic notation and the concept of the Earley item. We then (section “Earley trees”) define Earley trees; these are not defined by Earley, but prove useful in the later discussion. The nodes of Earley trees are sets of Earley items with certain parsing-relevant properties. We then proceed (section “Properties and relations of Earley items”) to identify some interesting properties of Earley items (truth, expectation, testability, usefulness, relevance), and some useful relations between items (advance, continuation, scan, prediction). Supplied with these relations, we can define the Earley algorithm using them and show how they guarantee that it is complete and correct (section “The Earley algorithm”).

The initial description of the algorithm applies to grammars given in Backus-Naur Form; the penultimate section (section “Two technical excursus”) contains discussions of two technical topics of practical importance: how to calculate the required sets in XSLT and XQuery, and how to extend the treatment of Earley parsing here to handle regular-right-part grammars like those specified for Invisible XML (section “Extension to EBNF”). The final section (section “Concluding remarks”) tries to draw some lessons for XSLT and XQuery translations of procedural algorithms from this example.

Notation

We assume a grammar G and an input string I of length n. A grammar is a four-tuple (VN, VT, P, S), where

  1. VN is a set of non-terminal symbols.

  2. VT is a set of terminal symbols, disjoint from VN (i.e. VNVT = ∅).

  3. P is a set of production rules of the form Nα, where

    1. NVN.

    2. α is a sequence of zero or more symbols from (VNVT).

  4. SVN is the start symbol of the grammar.

The vocabulary of G is the set of terminal and non-terminal symbols V = VNVT. When we need to mention the start symbol of G, we call it S. Non-terminal and terminal symbols in G will be called N, T, N1, N2, ... and so on. Sequences of terminals and non-terminals will be called α, β, γ ...

For purpose of our exposition, we often wish to deal not with the grammar G as is, but with an augmented grammar G′ = (VNGoal, VT, PGoalS, Goal), where GoalVN. This allows us to know that the start-symbol Goal has exactly one rule, whose form we know, and that it never appears on the right-hand side of any rule in G′. It should be clear that G and G′ define the same language.

We use the term string or string of characters when referring to I or substrings of I, or other sequences of terminal symbols. We use the term sequence when referring to sequences of symbols in V. There is no technical distinction between the terms string and sequence other than the restriction of strings to terminal symbols; they are distinguished here only in an attempt to make the argument easier to follow.

For any two sequences or strings α and β, α β denotes the concatenation of α and β. Where simple juxtaposition might be ambiguous, or where it is desired to emphasize the concatenation operator, the form α || β is used with the same meaning.

We write the empty string ε and the empty sequence either ε or ().

The expression I[i] refers to the character at position i in I; the expression I[x,y] refers to the substring from (just after) position x up to and including the character at position y. We use 0 to refer to the position before the first character.

For example, if I is the string a+b, then

  1. I[1], I[2], I[3] refer to the characters a, +, and b, respectively.

  2. I[0,1], I[0,2], I[2,3] refer to the substrings a, a+, and b respectively; I[0,0] and I[3,3] refer to the empty strings immediately before a and after b, respectively.

Note that for any string I and any non-negative integers x, y, z all ≤ n, we have I[x, y] || I[y, z] = I[x, z], and for any x (0 < xn), I[x] = I[x-1, x].

The discussion which follows tries to make the line of argument as explicit and easy to follow as possible, but some basic knowlege of context-free grammars is inevitably assumed. (Ironically, any attempt to make an argument easier to follow by making the inference steps as small as possible has the follow-on effect of multiplying the number of formulas and such, so that at first glance the text will appear less accessible, not more accessible. Readers interested but uneasy with formalisms are encouraged to take a deep breath and persevere.)

Earley items

An Earley item is a triple (x, y, Nα · β), where

  1. x is a number between 0 and n, inclusive, referred to as the start value

  2. y is a number between x and n, inclusive, referred to as the end value

  3. α and β are sequences in V*, i.e. sequences of zero or more symbols (terminals or non-terminals)

  4. N → α β is either a production rule in G or the special rule GoalS described above.

The third item in the triple is a location (sometimes referred to in parsing literature simply as an item); if we don’t care about the details of the location, we may write an Earley item (x, y, L).

If β is the empty sequence, so that the item has the form (x, y, Nα ·), then the item is called a completed item (or a completion) for N.

For example, given I as above (a+b), if G is the tuple (VN, VT, P, E), where

  1. VN = { E, T, P }

  2. VT = { *, +, a, b, (, ), [, ] }

  3. P =

    • ET

    • EE * T

    • TA

    • TT + P

    • Aa

    • Ab

    • A( E )

    • A[ E ]

then possible Earley items include
  1. (0, 0, E → · T)

  2. (2, 3, TT + · A)

  3. (2, 2 E → · T)

  4. (2, 2 T → · T + A)

  5. (3, 3, A( E ) ·)

  6. (0, 3, E → · T)

  7. (0, 3, ET ·)

  8. (0, 3, A( E · ))

  9. (0, 0, Goal → · E)

  10. (0, 0, GoalE ·)

Informally, we understand an Earley item to mean that in grammar G, the string of symbols α generates the string I[x, y]. (A sequence of symbols generates a string if and only if the string can be produced from the symbols by applying a series of zero or more rewrites using rules of G. Since generation involves zero or more applications of the rewrite relation written →, it is not infrequently denoted ⇒*.) So the examples just given can be interpreted as saying

  1. The string immediately before a is generated by the empty sequence of symbols.

  2. The string b is generated by the symbol sequence T +.

  3. The string immediately after + is generated by the empty sequence of symbols.

  4. The string immediately after + is generated by the empty sequence of symbols.

  5. The string immediately after b is generated by the sequence ( E ).

  6. The string a+b is generated by the empty sequence of symbols.

  7. The string a+b is generated by the sequence T.

  8. The string a+b is generated by the sequence ( E.

  9. The string immediately before a is generated by the empty sequence of symbols.

  10. The string immediately before a is generated by the sequence of symbols E.

Note that since we normally use Earley items to keep track of where we are in recognizing input against a grammar, informally an Earley item may also suggest that we are expecting to see an instantiation of the sequence β that follows the dot. (See also the discussion of expectation, below.) But expectations are not relevant for the truth or falsehood of the item, and thus not part of the meaning of the item, narrowly construed.

Of these examples, (5), (7), and (10) are completed items.

Note that because they have a meaning, Earley items can be true or false.

For example, in the list just given, items (1), (3), (4), (7), and (9) are true statements, and the others are false.

Note that if x = y and α is the empty string, then we have an Earley item of the form (x, x, N → · β). The meaning of such an item is always that the empty sequence of grammar symbols derives the empty sequence of terminal symbols, and the item is accordingly always true.

Earley trees

These are not defined by Earley, but they prove helpful for later discussion.

An Earley tree for the start symbol S in a given grammar G, for a given input I, is an ordered tree of Earley items with the following properties:

  1. Each node in the tree is either an Earley item or an expression of the form I[y] or an expression of the form I[y, y].

  2. The children of any node are ordered.

  3. The root node has the form (0, n, GoalS ·), where S is the start-symbol of G and Goal is the start symbol of the augmented grammar G′ described above.

  4. For any node K of the form (x, y, N1α N2 · β), (i.e. in which the last item before the dot is a non-terminal N2), the node’s children will take the form:

    • (w, w, N2 → · R1 R2 R3 ... Rk)

    • (w, z1, N2R1 · R2 R3 ... Rk)

    • (w, z2, N2R1 R2 · R3 ... Rk)

    • ...

    • (w, y, N2R1 R2 R3 ... Rk ·)

    where
    1. k is the length of the rule’s right-hand side.

    2. wz1z2 ≤ ... ≤ zn-1y.

    3. w is equal to the end value of node K’s left sibling, if K has a left sibling.

      Otherwise, w = x (the start value of K).

      Note that in either case we have xwy.

    Recall that by the definition of Earley item, N2R1 R2 R3 ... Rk is a rule in G.

    By this rule, all siblings in an Earley tree have locations in the same production rule of G, all have the same start value w, and they have non-decreasing end values. The start value of all children is the same as the end value of the parent’s left sibling, when it has one; otherwise it’s the same as the start value of the parent. (The second case arises for unit rules and for the children of the root node.) The end value of the last child is the end value of the parent.

  5. Every node K of the form (x, x, N → ·), (i.e. a completion for non-terminal N in which the production has an empty right-hand side) has a single child node, of the form I[x, x].

  6. For any node of the form (x, y, Nα T · β), (i.e. in which the last item before the dot is a terminal symbol T), the node’s sole child is the terminal symbol I[y], and I[y] = T.

  7. Every node of the form I[y] is a leaf node (has no children).

Like an Earley item, an tree of Earley items has a meaning, and can be true or false.

  1. The root node has the form (0, n, GoalS ·). Informally, this means means S generates the string I[0,n], and since I[0,n] = I, this amounts to saying that S generates I. Since S is the start symbol of the grammar G this in turn means that I is a sentence in the language defined by G.

  2. For any node K of the form (x, y, N1α N2 · β), the meaning is that the substring of I from x to y, i.e. I[x,y], is generated by the sequence α N2. This will be true if and only if (a) α generates I[x,w] and (b) N2 generates I[w,y].

    K’s first child has the form (w, w, N2 → · α) and means that the empty sequence of symbols generates the zero-length (empty) string I[w,w]. Like all Earley items of similar form, it is true.

    Each subsequent child of K has the meaning that a longer and longer prefix of the right-hand side of the rule N2R1 R2 ... Rk generates a longer and longer substring of I beginning at w. The final child, which has the form (w, y, N2R1 R2 ... Rk ·), means that the sequence of symbols R1 R2 ... Rk generates I[w,y], the substring of I from w to y. It follows that N2 also generates the string I[w,y].

    If the sequence of symbols α generates I[x,w], and the symbol N2 generates I[w,y], then it follows that the sequence α N2 generates I[x,y]. So if (a) the Earley item in K’s left sibling is true, or K has no left sibling, and (b) all of K’s children are true, then K is also true.

  3. For any node K of the form (x, y, Nα T · β), (i.e. in which the last item before the dot is a terminal symbol T), the meaning of the Earley item is that the sequence of symbols α T generates I[x,y], the substring of I from x to y.

    The left sibling of K, if it exists, will be an Earley item of the form (x, w, Nα · T β), where w = y - 1. If the Earley item in the left sibling is true, and if I[y] = T, then node K is true.

It follows from the rules that all leaves in the tree are either of the form I[j] or else of the form I[j, j], for 1 ≤ j ≤ n. It also follows that for the leaves, taken in tree order, j is monotonically non-decreasing. For any two leaf of the form I[j], the immediately following leaf in tree order may be either I[k] or I[k, k], where k = j + 1. For a leaf of the form I[j, j], the next leaf in tree order may be either I[j, j], or I[k], again with k = j + 1.

The Earley tree is true (or correct) if and only if the concatenation of all the leaves, in order, is the input I. We say that such an Earley tree exhibits a derivation of I from G.

It should be evident that a conventional parse tree for the input I against the grammar G can be constructed from a correct Earley tree by (1) deleting each node (x, x, N → · α) for non-empty α; (2) replacing each node (x, y, N1α N2 · β) with N2; (3) replacing pre-terminal nodes (nodes with single leaf children) with their children; (4) replacing each leaf node with the string it denotes (I[j] with character j of I, I[j, j] with the empty string); and (5) retaining the parent-child relations and sequence of children.

It should also be evident that the parse tree implied by an Earley tree for a given I and G is correct if and only if every Earley item in the tree is true.

Brief digression

We note in passing that the job of any parser is to calculate an Earley tree for a given G and I; one way to think of it is as starting with the given information (I and G), calculating a set of Earley items, and then determining whether we can build an Earley tree from them.

The job of a correct parser is to ensure that all the items in the Earley tree are true (and thus that the tree is correct).

The job of an efficient parser is to do so with as little effort as possible (in particular, considering as few items as possible, and testing as few as possible for truth).

The job of an unrestricted parser is to calculate a correct Earley tree for any arbitrary G and I; restricted parsers (e.g. LL(1) or LR(k) parsers) do so only for some G.

Grune and Jacobs observe that the problem with some general parsing methods is that in the worst case they end up doing a lot of unnecessary work, such as calculating the truth of a lot of Earley items that turn out irrelevant to the construction of a correct parse tree.

The Earley algorithm is a way to calculate Earley trees for arbitrary context-free grammars while reducing the amoung of unnecessary work. To summarize the algorithm, we need to define some properties of Earley items, beyond truth and falsehood.

Properties and relations of Earley items

The first important property of Earley items has already been mentioned several times: Earley items can be true or false. This section describes some additional properties and relations.

Expectation

An Earley item (x, y, Nα · X β), i.e. an item with the symbol X (terminal or non-terminal) right after the dot, is said to expect an X.

Note: We are restricting ourselves here to CFGs given in BNF. The extension to EBNF and regular right part grammars is not too difficult, but complicates the account. See below (section “Extension to EBNF”) for a discussion of extended BNF.

Testability; winning and losing

An Earley item (x, y, Nα · T β), which expects a terminal symbol T, can be directly tested against the input. If I[y,y+1] = T, then the item wins; otherwise it loses.

If the item just described is true and wins, then the related item (x, y, Nα T · β), which moves the dot one symbol to the right, is also true. That is, given the truth of:

  • (x, y, Nα · T β)

  • I[y,y+1] = T

we can infer the truth of:
  • (x, y + 1, Nα T · β)

The advance() function; continuations

We define the function advance() to describe this relation; it maps from an Earley item i and a terminal symbol T to an Earley item j or to nothing. Specifically, if

  • i = (x, y, Nα · T β)

  • j = (x, y + 1, nα T · β)

then
  • advance(i,T) = j if and only if i wins

  • advance(i,T) = ∅ if and only if i loses

For convenience, we may also want to be able to apply the function to a set of items, not just a single item. If iii is a set of Earley items, and T is a terminal symbol, then advance*(iii, T) is the set {j | j = advance(i, T) for some iiii}. We refer to j as a continuation of i. Continuation is transitive: any continuation of j is also a continuation of i.

The scan() relation between items

Looking ahead, we say that the relation scan(i, j) holds between Earley items i and j if and only if i expects terminal T, i wins, and j = advance(i, T).

Testing and advance() function for non-terminals

By contrast, an Earley item (x, y, N1α · N2 β) which expects a non-terminal symbol N2 cannot be directly tested against the input. It can however can be tested indirectly. If N2 generates a substring of I beginning at y (i.e. if for some z we have N2 ⇒* I[y, z]), then the item wins; otherwise it loses.

The advance() function again

Again we define advance() appropriately: given

  • i = (x, y, N1α · N2 β)

  • j = (x, z, N1α N2 · β)

then
  • advance(i, N2) = j if and only if N2 ⇒* I[y, z] for some z

  • advance(i, N2) = ∅ otherwise

Continuations

As with terminals, so also with non-terminals; if j = advance(i, N) for some non-terminal N, then j is a continuation of i, and any continuation of j is also a continuation of i.

Useful items

An Earley item is useful if and only if it appears in a correct Earley tree for the grammar G and the input I. If an Earley item is not useful, it is useless.

If parsing algorithms could distinguish between useful and useless items before spending time on them, they could be very efficient.

But at the beginning of a parse, it’s impossible to know which items will be useful and which won’t. For example, if I is not in the language L(G) defined by G, then all Earley items for I and G are useless, but we can’t know that without calculating and testing at least a few.

Note in passing that if any item i = (x, y, Nα · β) is useful, then some completion of i, i.e. some item of the form (x, z, Nα β ·) (with zy) will also be useful; the two will be siblings in the Earley tree. There might be more than one true completion of i; if I has more than one derivation in G, more than one of them may be useful.

The converse is also true in most cases: if (x, z, Nα β ·) is useful, then there will be at least one true item of the form (x, y, Nα · β), and at least one such form will be useful (more than one, in cases of ambiguity).

Relevant items

In a particular state of partial knowledge, when we have examined part of the input but not all, an Earley item is relevant if, based on what we know, it could be useful. (We will make this more precise in a moment.)

Also any Earley item (x, x, N → · α β) is relevant if for some y (≠ x) the item (x, y, Nα · β) is relevant.

Given a particular state of partial knowledge, an Earley item is irrelevant if we know, based on that partial knowledge, that it is useless. Any item known to be false (in that same state of partial knowlege) is thus irrelevant. I believe but have not attempted to prove that in any given state of partial knowledge, any Earley item that is not relevant is irrelevant. This might depend on the nature of the partial knowledge available.

The Earley algorithm is an online algorithm, which reads the input left to right (or: front to back, if one wishes to avoid fatal confusion in cases of right-to-left scripts). At any given point in time, we have read the first y characters in the input, for some number y such that 0 ≤ yn, so we have knowledge of the substring I[0, y], and no knowledge of the substring I[y, n]. We also have full knowledge of the grammar G.

Given knowledge of I[0, y], an Earley item of the form (x,y,L) is relevant if it could be part of an Earley tree for G and some input which starts with the string I[0, y]. Or more formally, the Earley item is relevant if and only if there is some string I′ which shares the prefix I[0, y] with I, for which the item is useful (appears in some correct Earley tree).

Some consequences of the definition of relevance

It follows from the definition of relevance just given that the item (0, 0, Goal → · S) is relevant for all grammars and inputs.

It also follows from the definition of relevance that if an item i = (x, y, L) expects terminal T, and i is relevant, then advance(i, T) is relevant if it exists (i.e. if i wins; if i loses, advance(i, T) does not denote an item).

It further follows from the definition of relevance that if an item i = (x, y, L) expects non-terminal N, and i is relevant, then for every production rule Nα in G, the item (y, y, N → · α) is relevant.

If i is useful as well as relevant, then we know we’ll need an item (y, z, Nα ·) in the tree as well (since i is expecting an N, and appears in a derivation tree, there will be an N in the derivation tree at the appropriate location). So at least one of the production rules for N will be useful as well, but if there is more than one we do not yet know which.

If the grammar is unambiguous, at most one production rule will be relevant, but G could be ambiguous, so more than one could be useful.

We don’t know, however, how N might be instantiated in I (that’s essentially a consequence of the nature of CFGs), so any of the production rules for N could be the one needed for the tree. That means every possible item (y, y, N → · α) is relevant if Nα is in G.

The pred() and comp() relations among items

Looking ahead, we say that the relation pred(i, j) holds between items i and j if and only if item i = (x, y, L) expects non-terminal N and item j has the form (y, y, N → · α). Here pred is short for predicts.

It also follows from the definition of relevance that if we have

  • item i = (x, y, N1α · N2 β) expecting a non-terminal N2

  • item j = (y, z, N2γ ·) is a completion for N1.

and both are relevant (if i is relevant, then j is also relevant unless j is false), then the item advance(i, N2), i.e. (x, z, N1α N2 · β), is also relevant.

Looking ahead, we say that the relation comp(i, j, k) holds if and only if

  1. item i = (x, y, L1) expects some non-terminal N

  2. item j = (y, z, L2) is a completion for N

  3. k = advance(i, N)

The Earley algorithm

The Earley algorithm (as described in Earley 1970) is a procedural way of calculating a set of Earley items which are all true and are all relevant at the time they are added to the set. I believe (but have not proved) that it is in fact the set of all items which are both true and relevant by the definitions above.

The algorithm calculates a set of Earley items which obeys the following rules:

  1. The item (0, 0, Goal → · S) is a member of the set.

  2. If i = (x, y, L) is a member of the set which expects terminal symbol T, and I[y+1] = T, then advance(i, T) is also a member of the set.

    If (x, y, L) is a member of the set which expects non-terminal symbol N, and Nα is a production rule in G, then (y, y, N → · α) is also a member of the set.

  3. If i = (x, y, L) is a member of the set which expects non-terminal symbol N, and j = (y, z, Nα ·) is a member of the set, then advance(i,N) is also a member of the set.

  4. Nothing else is a member of the set.

Or, more concisely, the set is the smallest set that contains (0, 0, Goal → · S) and is closed over the relations scan(i, j), pred(i, j), and comp(i, j, k).

Earley defines things procedurally, in terms of the following steps:

  1. He starts by initializing the set of items to (0, 0, Goal→ · S).

  2. He then applies rule 3 above (our pred() relation) until it produces no new items. At this point all items in the set have x = y = 0.

  3. Then he iterates over the following procedure.

    1. First apply rule 2 (Earley’s scanner, our scan() relation), which generates new items of the form (x, y+1, L). If no new items are produced by this step, we are done; I is not a member of L(G).

      (Digression: Why are we done? If applying rule 2 produces no new items, then one of the following is true:

      1. (a) The set of items includes no items which expect a terminal. In this case, we have reached a point in the grammar where no terminal is predicted, which means that no terminal can possibly advance the parse, which means the parsing situation is hopeless.

      2. (b) The set of items includes some items which expect a terminal, but none of them win. In this case, the grammar does make testable predictions for this point, but the input satisfies none of them, which means that the input definitely fails to match the grammar.

      Either way, there is no parse tree for I as a sentence of G, so we are done. End of digression.)

    2. Then apply rule 3 (Earley’s predictor, our pred() relation) and rule 4 (Earley’s completer, our comp() relation) until they produce no new items.

    3. If the highest end-point y in our items (x, y, L) is less than n, then start another round of this procedure (go back to finding instances of the scan() relation).

      Otherwise, the highest end-point y = n and we have reached the end of the input and are done.

  4. If the set contains (0, n, GoalS ·), we have recognized the sentence and can construct an Earley tree from the set. If the set doesn’t contain that item, I is not a member of L(G).

Each round of this procedure uses progressively higher values for the end-point in the search for instances of the scan() relation, so there are at most n rounds before we reach the end of the input.

It seems clear from this way of putting it that the procedure described by Earley calculates the transitive closure of (0, 0, Goal → · S) over the relations scan(), pred(), and comp().

It’s pretty clear, intuitively, that the starter item (0, 0, Goal → · S) is relevant and true, and that all of the relations scan(), pred(), and comp() produce new items for the set which are also relevant and also true. Producing only relevant items is what distinguishes Earley from some other algorithms, which work purely bottom-up and essentially try to calculate all true items with xy. I believe (but hesitate to say it’s intuitively obvious) that the set described contains all relevant items, as that term is defined above; that property distinguishes the Earley algorithm from things like recursive-descent and LR(1) parsing, which in the interests of simplicity and/or speed ignore some relevant items (which is why they can’t handle all grammars).

Earley describes his algorithm in terms of very small mechanical tasks; it’s obvious how to implement it, especially in a procedural language with mutation. It’s not at all obvious to me from E’s description how to do the same thing in a declarative language without mutation.

But by describing the set in terms of the relations, and defining the relations in terms of their properties, instead of in the imperative, procedural-pseudocode style used by Earley and by other descriptions, I think I have succeeded in defining the relevant sets and properties declaratively, which will make it easier to calculate them in a functional language.

Two technical excursus

Calculating transitive closures

For implementing this declarative functional version of the algorithm, it would obviously be handy to have a language with a built-in transitive-closure operator. But with higher-order functions, we can build our own, I think.


  (: to take the transitive closure of some
     function f on some set $s, first define an
     equality function $eq, then specify a maximum
     number of recursive calls $n, and finally call
     my:transclos($eq, f, $n, $s)
     :)
  declare function my:transclos(
    $eq as function(*), (: equality test :)
    $r as function(*), (: unary function, for binary relation :)
    $maxcycles as xs:integer,
    $set as item()*
  ) as item()* {
    local:tc-helper($eq, $r, $maxcycles, $set, $set)
  };

  declare function local:tc-helper(
    $eq as function(*), (: equality test :)
    $r as function(*),
        (: unary function to calculate a binary
           relation :)
    $ttl as xs:integer,
    $queue as item()*,
    $accum as item()*
  ) as item()* {
    if (empty($queue)) then
       $accum
    else if ($ttl le 0) then
       error("Could not complete calculation, out of cycles.")
    else
       let $this := $queue[1],
           $rest := $queue[position() gt 1],
           $candidates := $r($this),
           $new := $candidates
	           [no $x in $accum
	           satisfies
	           $eq(., $x)]
       return local:tc-helper($eq, $r, $ttl - 1, 
                   ($rest, $new), ($accum, $new))
  };

Unfortunately, I really don’t know a good way to generalize this for ternary relations like comp. But the pattern is simple enough to replicate.

Extension to EBNF

The account of the algorithm given above exploits the simplicity of production rules in BNF, which have a flat sequence of symbols in the right hand side of each rule. But many grammars extend BNF in ways which vary but typically amount to allowing not just a sequence of symbols, but a regular expression over V, on the right-hand side of a rule. The notations are typically called extended BNF, or EBNF; grammars written using such notations are regular-right-part grammars (because the right-hand part of each rule is a regular expression).

Among the notations for regular-right-part grammars of interest to XML users is the variant of van Wijngaarden grammars specified by Steven Pemberton for use in invisible XML (Pemberton 2013).

To apply the Earley algorithm to regular-right-part grammars, it is necessary to generalize the account above, which can be done as follows. In particular, we will need more abstract and general notions of (1) location, (2) position in a production rule, (3) expectation, and (4) the advance function.

As before, an Earley item is a triple (x, y, L), interpreted in the context of a grammar G and an input string I, where x and y are the start and end points of a substring of the input I, and L is a location inside some rule in the productions P of the grammar G. But instead of taking the form Nα, where α is a sequence of symbols in V, the location now has a structure whose details are not specified. Instead, we define several operations on L, which provide the information we need for the algorithm.

We would like the definition of these operations to be general enough to be consistent with any of several possible implementations of locations:

  1. For a BNF, we can use a sequence of symbols in V and a non-negative integer representing an index into that sequence.

  2. For an EBNF, we might use a pointer to a rule in the grammar, together with a derivative expression identifying what is still required or allowed to appear.

  3. For any grammar, we could use a triple consisting of a non-terminal N, a finite state automaton (FSA) representing the right-hand side of a rule for N, and an identifier denoting the current state in the FSA.

Operators on locations

The following operations on locations are useful in describing Earley parsing for regular-right-part grammars.

  1. rule(L) denotes the production rule within which the location L appears.

  2. nt(L) denotes the non-terminal on the left-hand side of rule(L).

  3. rhs(L) denotes the right-hand-side of rule(L)

  4. seen-so-far(L) denotes a sequence of symbols in V (all of which appear in rhs(rule(L))), which form a prefix of some word recognized by rhs(rule(L)).

  5. remains-to-be-seen(L) is an expression describing the part of rhs(L) which remains to be matched after seen-so-far(L) has been seen.

    In an EBNF it will always be (equivalent to) the derivative of rhs(L) with respect to seen-so-far(L).

  6. completed(L) is true if and only if remains-to-be-seen(L) is nullable, i.e. it is matched by (accepts) the empty sequence.

    For any item (x, y, L), if completed(L) then nt(L) ⇒* I[x, y]. Such an item is a completed item, and a completion of nt(L).

  7. expects(L) denotes a set of symbols in rhs(L). In a BNF, we get exactly one symbol, or none: the symbol to the right of dot in the item. In an EBNF, we get the set of symbols (terminal or non-terminal) which can follow the current state, or equivalently the set of symbols which are accepted as initial symbols by remains-to-be-seen(L).

For any two locations L1 and L2, if rule(L1) = rule(L2) and seen-so-far(L1) = seen-so-far(L2), then L1 and L2 are equivalent. Let R = rule(L1) = rule(L2) and α = seen-so-far(L1) = seen-so-far(L2); then we have:

  1. nt(L1) = nt(L2) = left-hand side of R

  2. rhs(L1) = rhs(L2) = right-hand side of R

  3. remains-to-be-seen(L1) is equivalent to remains-to-be-seen(L2) in the sense that they accept the same language; in each case, the expression will be the derivative of rhs(R) with respect to α, or another expression equivalent to that derivative. (In practice, it is helpful to simplify derivative expressions, but a weak and easy simplification may not reduce all equivalent expressions to the same form. See Brzozowski 1964.)

  4. completed(L1) = completed(L2) = nullable(remains-to-be-seen(L1)) = nullable(remains-to-be-seen(L2)).

    An expression is nullable if the language it defines includes the empty string as a sentence. For definitions of nullability and explanations of the relation between nullability of a derivative and satisfaction of the original expression, see Brzozowski 1964 or Brüggemann-Klein 1993a or Brüggemann-Klein/Wood 1998 or Sperberg-McQueen 2005.

  5. expects(L1) = expects(L2) = the set {X | XV and the derivative of remains-to-be-seen(L1) with respect to X ≠ ∅}.

A consequence of the equivalence of L1 and L2, given the identify of rule() and seen-so-far() for them, is that we can specify any location down to equivalence by specifying a rule and a sequence of symbols.

Notes on individual operators

The following subsections provide some examples and further discussion of some of the operators defined above.

Note that seen-so-far(L) is not necessarily or usually a prefix of some sequence of characters generated by nt(L); it is a prefix of some sequence of terminal and non-terminal symbols in V which matches the regular expression rhs(L).

If an item (x, y, L) is in fact true, then I[x, y] is generated by seen-so-far(L).

For example, consider L1 and L2 such that

  1. rule(L1) = AB C D

  2. rule(L2) = A → (B | C)*, D

In this case, seen-so-far(L1) may be any of
  1. () (i.e. the empty sequence)

  2. (B)

  3. (B C)

  4. (B C D)

seen-so-far(L2) may be any of
  1. ()

  2. (B)

  3. (C)

  4. (D)

  5. (B B)

  6. (B C)

  7. (B D)

  8. (C B)

  9. (C C)

  10. (B D)

  11. (B B B)

  12. ...

This operator is not actually used in calculating the set of Earley items we are interested in; the only reason for defining seen-so-far(L) is to make it possible to say (as we did above) that any item (x, y, L) is true if and only seen-so-far(L) ⇒* I[x, y].

We could do without seen-so-far(L) if we said that the interpretation of an Earley item (x, y, L) is not the string I[x, y] matches (is generated by) the part of the rule to the left of the dot but the equivalent claim for all strings s generated by remains-to-be-seen(L), the concatenation of I[x, y] with s is generated by nt(L).

Unlike seen-so-far, which returns a sequence of symbols in V, remains-to-be-seen returns a regular expression over V.

For L1 and L2 as described above, then remains-to-be-seen(L1) may be any of

  1. (B C D)

  2. (C D)

  3. (D)

  4. ()

remains-to-be-seen(L2) may be any of
  1. ((B|C)*, D)

  2. (D)

  3. ()

If the grammar is in BNF, then completed(L) if and only if remains-to-be-seen(L) = empty-sequence.

Properties of items and relations on items

Provided with the operations defined above, we can now redefine the crucial terms introduced earlier for BNF (section “Properties and relations of Earley items”) so that they also apply when G is a regular-right-part grammar.

Expectation, winning, and losing in EBNF grammars

An item (x, y, L) expects a symbol X if and only if Xexpects(L).

Given an item i = (x, y, L) and a symbol X (XV), such that i expects X, i wins on X if either

  1. X is a terminal symbol T, and

  2. I[x, y+1] = T,

or
  1. X is a non-terminal symbol N, and

  2. N ⇒* I[y, z], for some z (yzn).

Otherwise i loses on X.

Note that since i can expect more than one symbol, it may win on some symbols and lose on others. Without respect to any particular symbol, i wins if there exists as least one symbol X for which i wins on X; i loses if there is no such symbol.

The advance function for EBNF grammars

For any i = (x, y, L1), j = (x, z, L2), and XV, if (1) i expects X and (2) rule(L1) = rule(L2) and (3) seen-so-far(L1) || X = seen-so-far(L2), then advance(i, X) = j if and only if i wins on X; otherwise advance(i, X) = ∅.

The definitions of advance* and continuation are as given above: if iii is a set of items and X is a symbol, advance*(iii, X) = {j | j = advance(i, T) for some iiii }, and j is a continuation of i if j = advance(i, X) for some symbol X, or if j is a continuation of some continuation of i.

The scan relation for EBNF grammars

The relation scan(i, j) holds between items i and j if and only if there exists some T such that i expects T, i wins on T, and j = advance(i, T).

Inferences regarding relevance for EBNF grammars

The definitions of useful, useless, relevant, and irrelevant items are as given above (section “Useful items”). Some of the inferences given above can be generalized for EBNF grammars.

Any item i = (x, y, L) may have more than one continuation which is a completion for nt(L). If i is useful, at least one of those completions will be useful. (If more than one is useful, they will be useful in different Earley trees, and G will be ambiguous.)

If any item i of the form (x, y, L) is relevant at the point where we have knowledge of I[0, y] and expects a symbol X and wins with X, then advance(i, X) is also relevant, at the point where we have read X. (For terminal symbols, that will be at point I[0, y+1]; for non-terminals, it may be at any point between I[0, y] and I[0, n], inclusive.) If expects(L) ⊆ VT (i.e. every symbol expected by i is a terminal symbol), and i loses, then at point I[0, y+1] i is known to be false and thus known to be useless, so at that point it is no longer relevant. If any of the symbols expected by i are non-terminals, it is not usually known how long a string any of those non-terminals might generate; it is accordingly somewhat harder to establish with certainty that any such items have lost definitively and have become irrelevant.

At any point when we have read I[0, y], if item i = (x, y, L) is relevant, then for every non-terminal N in expects(L), and for every rule R defining N in the grammar G, the item (y, y, L) with rule(L) = R and seen-so-far(L) = ε (the empty sequence) is relevant.

The pred and comp relations for EBNF grammars

The relation pred(i, j) (prediction) holds between any two items i and j if and only if item i = (x, y, L1) expects some non-terminal N, and j = (y, y, L2) with nt(L2) = N and seen-so-far(L2) = ε.

The definition of relation comp(i, j, k) (completion) applies to EBNF grammars in the form defined above (section “The pred() and comp() relations among items”).

The algorithm

Provided with the operations defined above, we can now describe the Earley algorithm in a form that applies to EBNF as well as to BNF.

Actually, the only change needed is that instead of saying that the set is initialized with the starter item (0, 0, Goal → · S), we say that it is initialized with a starter item item of the form (0, 0, L) with nt(L) = Goal and seen-so-far(L) = ε.

Follow-on work

Future work should include formal proofs of one or more of the following propositions stated or suggested informally above. In them, E is the set calculated by the algorithm in Earley’s paper, and S is the smallest set which contains the starter item and is closed over the relations scan, pred, and comp (i.e. the set described in the declarative functional restatement of the algorithm).

  1. The set E and the set S are identical.

  2. S contains no false items.

  3. Every item i of the form (x, y, L) which is relevant with respect to the partial knowledge available from having read I[0, x] is in S.

  4. Every item in S of the form (x, y, L) is relevant with respect to the partial knowledge available from having read I[0, x].

The first of these amounts to showing the correctness of my claim to have translated Earley’s algorithm into functional terms.

The second would establish that the functional algorithm described here is correct, in the sense that if for a given I and G it produces a set from which an Earley tree can be constructed, then I is in fact a sentence of the language defined by G.

The third would establish that the functional algorithm given here is complete, in the sense that if a given I is a sentence of the language defined by a given grammar G, the algorithm will produce a set from which an Earley tree can be constructed.

The fourth would suggest that the functional algorithm given here does not do any work which can be seen in advance to be pointless.

Concluding remarks

What can be learned from the example of Earley parsing for the general problem of translating procedural algorithms into XSLT and XQuery?

Three basic ideas seems to stand out.

  1. Update operations define relations.

    Each of the basic operations performed by the Earley operation consists in adding an Earley item to a set, given certain conditions. The functional paraphrase given here works by interpreting each of these operations semantically as identifying a relation that holds between one or more items already in the set and the new item being calculated.

    (The dependency of the operations on G and I complicates matters very slightly, because it makes scan and pred, which would otherwise be binary relations, into relations with four arguments. But since G and I don’t change, we can treat them as binary relations, for which transitive closure is easily defined, by restricting our universe of discourse to contexts in which G and I are present. Earley achieves the same effect by treating G and I as global variables.)

  2. Iteration can calculate a transitive closure.

    Earley describes the algorithm as running until either the input is exhausted or the parse fails. We can capture this in a non-temporal way by making the algorithm calculate the transitive closure of one or more relations.

    Transitive closure is not a primitive operation in XSLT or XQuery, except in the special cases of the axes ancestor, descendant, following[-sibling], etc. But any transitive closure can be calculated without great difficulty using a recursive function.

  3. Sets may be easier than more complex structures.

    We are able to use the concept of transitive closure here because we are operating on sets of Earley items; if instead we were building a set of parse trees, both the relations among objects and the calculation of the transitive closure would almost certainly be more complicated.

References

[Brüggemann-Klein 1993a] Brüggemann-Klein, Anne. 1993. Regular expressions into finite automata. Theoretical Computer Science 120.2 (1993): 197-213. doi:https://doi.org/10.1016/0304-3975(93)90287-4.

[Brüggemann-Klein/Wood 1998] Brüggemann-Klein, Anne, and Derick Wood. 1998. One-unambiguous regular languages. Information and computation 140 (1998): 229-253. doi:https://doi.org/10.1006/inco.1997.2688.

[Brzozowski 1964] Brzozowski, Janusz A. Derivatives of regular expressions. Journal of the ACM 11.4 (1964): 481-494. doi:https://doi.org/10.1145/321239.321249.

[Earley 1970] Earley, Jay. 1970. An efficient context-free parsing algorithm. CACM 13.2 (1970): 94-102. doi:https://doi.org/10.1145/362007.362035.

[Grune/Jacobs 1990/2008] Grune, Dick, and Ceriel J. H. Jacobs. 1990/2008. Parsing techniques: a practical guide. First edition New York et al.: Ellis Horwood, 1990. Second edition [New York]: Springer, 2008.

[Pemberton 2013] Pemberton, Steven. Invisible XML. Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6 - 9, 2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). doi:https://doi.org/10.4242/BalisageVol10.Pemberton01.

[Sperberg-McQueen 2005] Sperberg-McQueen, C. M. 2005. Applications of Brzozowski derivatives to XML Schema processing. Paper given at Extreme Markup Languages 2005, Montréal, sponsored by IDEAlliance. Available on the Web at http://www.mulberrytech.com/Extreme/Proceedings/html/2005/SperbergMcQueen01/EML2005SperbergMcQueen01.html, http://cmsmcq.com/2005/abdxsp.unicode.html, and http://cmsmcq.com/2005/abdxsp.ascii.html.

C. M. Sperberg-McQueen

Founder and principal

Black Mesa Technologies LLC

C. M. Sperberg-McQueen is the founder and principal of Black Mesa Technologies, a consultancy specializing in helping memory institutions improve the long term preservation of and access to the information for which they are responsible.

He served as editor in chief of the TEI Guidelines from 1988 to 2000, and has also served as co-editor of the World Wide Web Consortium's XML 1.0 and XML Schema 1.1 specifications.