Tricolor automata

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

7 August 2015

http://blackmesatech.com/2015/08/tricolor-automata/

Cover page image (Black Mesa)

Overview

Want to look carefully at the details?

Two elevator pitches

Vocabulary designers / users:
  • Draw a picture
  • of two content models
  • showing what is valid in both models
  • what is valid in one but not the other
Computer scientists:
  • A simple, practical application of
  • Brzozowski derivatives and
  • Glushkov automata.

Introduction

Context

The Igel project is sponsored by the Institut für Deutsche Sprache, Mannheim.
Igel is building tools for comparing document vocabularies.

A problem

Given two related DTDs,
  • How do they differ?
  • How are they the same?
  • Have any old documents become invalid?
E.g. XHTML 1.0 Strict vs. Transitional.
E.g. HTML 3.2 vs HTML 4.0.
E.g. JATS 3.1 Archiving vs JATS 4.0 Authoring
E.g. TEI Lite and XCES (or ...)
E.g. Two customizations of Docbook.
E.g. Two versions of MathML.

Displaying declarations

If we have translated the DTD into XML, we can expand the parameter entities and label the expansions:
Now A has:
<!ELEMENT bibl ( %phrase.seq{{ #PCDATA 
               | %m.phrase{{ 
                   %m.token{{ abbr | date | num 
                    | measure | name | term | time 
                    | }}m.token;
                   foreign | mentioned | distinct 
                   | title | hi | list | corr 
                   | gap | reg | ptr | ref | 
                 }}m.phrase; 
              }}phrase.seq; 
              author)* >
And B has:
<!ELEMENT bibl ( %phrase.seq{{ #PCDATA
               | %m.phrase{{
                   %m.token{{ 
                     %x.token{{ }}x.token;
                     abbr | date | num 
                     | dateRange | numRange | timeRange
                     | measure | name | term | time 
                     | w | }}m.token;
                   corr | distinct | foreign 
                   | gap | hi | list | mentioned
                   | orig | q | ref | reg | s
                   | stage | title | table 
                   | xref | 
                 }}m.phrase; 
               }}phrase.seq; 
               author | 
               %ids.milestones{{ 
                 pb | lb | ptr | xptr 
               }}ids.milestones;)* >
A little better, perhaps, but ...

Sketching a solution

Basic ideas: Glushkov automata

The simplest method of making an FSA is to construct the Glushkov automaton.
  • Each primitive token becomes a state (the state of just having seen that token).
  • Define transitions to agree with the regex. (Glushkov, Brüggemann-Klein, Wood).
  • Plus one more for start state.
  • Mark the final states.

Example: TEI <biblFull>

Note single start state; paths describe possible sequence of children; double outlines mark final states (if we reach them, we win).

Example: A and B

Content models ((a|x|y)*, z) and ((b|x|y)+, z):

Basic ideas: the Glushkov property

In a Glushkov automaton, every path into a state has the same label.
So we can label the states, not the arcs. (Easier to read and understand.)

Reduced or-groups

One state per token clutters up fast. Let's reduce or-groups into single states, but preserve the Glushkov property:

Desiderata

An initial sketch of the goal:
  • clarity,* simplicity
  • what's valid in both: black/white
  • what's valid in just one: red or blue
  • determinism is good
  • the Glushkov property is good
  • each token maps to one state* — no clever stuff

What we want / Example

Black/white: paths that work in both FSAs.
Red, blue: paths that work in only one FSA.

Basic ideas: intersection of FSA

The standard way to calculate the intersection of two FSAs:
  • Run them both in parallel.
  • Do they both accept the string?

At any given point, we are in one* state in A, and one* in B.

That means, each state in the intersection represents a pair (A-state, B-state).

Complications: nondeterminism; error states. (Explain.)

Intersection of A and B (step 1)

Start with the two start states: i.e. state (StartA, StartB).
Calculate the next state pair for each symbol.
  • (StartA, StartB) × a ⇒ ((a|x|y)A, ErrorB). Omit (error state).
  • (StartA, StartB) × b ⇒ (ErrorA, (b|x|y)B). Omit (error state).
  • (StartA, StartB) × x ⇒ ((a|x|y)A, (b|x|y)B). New state.
  • (StartA, StartB) × y ⇒ ((a|x|y)A, (b|x|y)B). Same new state.
  • (StartA, StartB) × z ⇒ (zA, ErrorB). Omit (error state).

Intersection of A and B (step 2)

Continue the same process with any new states.
  • ((a|x|y)A, (b|x|y)B) × a ⇒ ((a|x|y)A, ErrorB). Omit (error state).
  • ((a|x|y)A, (b|x|y)B) × b ⇒ (ErrorA, (b|x|y)B). Omit (error state).
  • ((a|x|y)A, (b|x|y)B) × x ⇒ ((a|x|y)A, (b|x|y)B). Known state.
  • ((a|x|y)A, (b|x|y)B) × y ⇒ ((a|x|y)A, (b|x|y)B). Known state.
  • ((a|x|y)A, (b|x|y)B) × z ⇒ (zA, zB). New state.

Intersection of A and B (step 3)

One more state (all errors — no exit from state):
  • (zA, zB) × a ⇒ (ErrorA, ErrorB).
  • (zA, zB) × b ⇒ (ErrorA, ErrorB).
  • (zA, zB) × x ⇒ (ErrorA, ErrorB).
  • (zA, zB) × y ⇒ (ErrorA, ErrorB).
  • (zA, zB) × z ⇒ (ErrorA, ErrorB).

Result: intersection

Extending the result

We don't want just the intersection.
We also want A and B to be visible.
So instead of discarding any pair containing either ErrorA or ErrorB, let's discard only the pair (ErrorA, ErrorB).
If we do that, we get a larger FSA ...

Keeping the A- and B-only states

N.B. the McCarthy principle or tainted string approach: once red, always red.
N.B. proliferation of states.

Reduced or-groups

A first attempt at keeping the reduced or-groups of A and B.
We have lost determinism and Glushkov Property.
Can we find another way to reduce state count?

Rehabilitation

Can we overcome McCarthyism and allow ‘recovery’ from errors?
[N.B. name change: in proceedings, this is color-filter approach.]

Merger with rehabilitation

Now reduce or-groups

Except that strictly speaking, we cannot reduce or-groups now.

Merging states

Strictly speaking, we cannot reduce or-groups.
All or-groups are long gone!
We are merging states with
  • same set of predecessor states;
  • same set of successor states;
  • same color.
Preserves equivalence, deteminism, Glushkov Property.

Nailing it down

Let's get more specific (and formal).
  • definition
  • interpretation
  • construction
  • modifications

Informal characterization

A tricolor automaton is an FSA whose
  • states,
  • arcs, and
  • final-state markers
are (independently) colored
  • red,
  • white, or
  • blue

Definition

A tricolor automaton is a triple (A, Χ, Φ), where
  • A is the underlying finite state automaton (Q, Σ, δ, q0, F), where
    • Q is a set of states
    • Σ is an input alphabet
    • δ is a transition function mapping from Q × Σ to Q
    • q0Q is the start state
    • FQ is the set of final states
  • Χ is a chromatic function mapping from Q ∪ δ to the colors {red, white, blue}.
  • Φ is a chromatic function mapping from F to the colors {red, white, blue}.

Interpretation

But what does it mean?
  • black/white paths = paths legal in both A and B
  • red/white paths = paths legal in A
  • at least one red state or arc = legal in A, not in B
  • blue/white paths = paths legal in B
  • at least one blue state or arc = legal in B, not in A
N.B.
  • red/white/blue paths ≠ union of A and B (but ...)

Some sanity checks

The definitions allows some things that don't make ‘sense’:
  • red arc to/from blue state (or vice versa)
  • arc from red to blue state (or vice versa)
  • red final marker on blue state (or vice versa)
N.B. If these things happen, the rules of interpetation still hold.

That is, they do make sense; they just don't do anything ‘useful’.

Constructing the tainted-string tricolor automaton

States: (StartA, StartB) plus all triples (σ, PA, QB) where PAStartA and QBStartB.
Alphabet: union of ΣA and ΣB.
Transitions:
  • (x, PA, QB) × σ ⇒ (σ, PA′, QB′) iff A has PA × σ ⇒ PA′ and A has QB × σ ⇒ QB′.
  • (x, PA, QB) × σ ⇒ (σ, PA′, ErrorB) iff A has PA × σ ⇒ PA′ and σ ∉ ΣB.
  • (x, PA, QB) × σ ⇒ (σ, ErrorA, QB′) iff B has QB × σ ⇒ QB′ and σ ∉ ΣA.
Start state: (StartA, StartB).
Final states: union of FinalA and FinalB.

Construction (2)

State coloring::
  • red for (σ, PA, ErrorB)
  • blue for (σ, ErrorA, QB)
  • otherwise white.
Arc coloring::
  • red if source or target state is red
  • blue if source or target state is blue
  • otherwise white.
Finality coloring::
  • white if PAFinalA and QBFinalB.
  • red if PAFinalA and QBFinalB.
  • blue if PAFinalA and QBFinalB.

Construction example

Steps as before, for intersection, but don't throw away partial errors.
Start with (StartA, StartB); calculate next state for each symbol.
  • (StartA, StartB) × a ⇒ (a, (a|x|y)A, ErrorB). Red state.
  • (StartA, StartB) × b ⇒ (b, ErrorA, (b|x|y)B). Blue state.
  • (StartA, StartB) × x ⇒ (x, (a|x|y)A, (b|x|y)B). White state.
  • (StartA, StartB) × y ⇒ (y, (a|x|y)A, (b|x|y)B). White state.
  • (StartA, StartB) × z ⇒ (z, zA, ErrorB). Red state.

After first step

Construction example, step 2

For each new state; calculate next state for each symbol.
Omitting transitions to (σ, ErrorA, ErrorB):
  • (a, (a|x|y)A, ErrorB) × a ⇒ (a, (a|x|y)A, ErrorB).
  • (a, (a|x|y)A, ErrorB) × x ⇒ (x, (a|x|y)A, ErrorB). New red state.
  • (a, (a|x|y)A, ErrorB) × y ⇒ (y, (a|x|y)A, ErrorB). New red state.
  • (a, (a|x|y)A, ErrorB) × z ⇒ (z, zA, ErrorB).
  • (b, ErrorA, (b|x|y)B) × b ⇒ (b, ErrorA, (b|x|y)B).
  • (b, ErrorA, (b|x|y)B) × x ⇒ (x, ErrorA, (b|x|y)B). New blue state.
  • (b, ErrorA, (b|x|y)B) × y ⇒ (y, ErrorA, (b|x|y)B). New blue state.
  • (b, ErrorA, (b|x|y)B) × z ⇒ (z, ErrorA, zB). New blue state.
  • (x, (a|x|y)A, (b|x|y)B) × a ⇒ (a, (a|x|y)A, ErrorB).
  • (x, (a|x|y)A, (b|x|y)B) × b ⇒ (b, ErrorA, (b|x|y)B).
  • (x, (a|x|y)A, (b|x|y)B) × x ⇒ (x, (a|x|y)A, (b|x|y)B).
  • (x, (a|x|y)A, (b|x|y)B) × y ⇒ (y, (a|x|y)A, (b|x|y)B).
  • (x, (a|x|y)A, (b|x|y)B) × z ⇒ (z, zA, zB). New white state.
  • (y, (a|x|y)A, (b|x|y)B) × a ⇒ (a, (a|x|y)A, ErrorB).
  • (y, (a|x|y)A, (b|x|y)B) × b ⇒ (b, ErrorA, (b|x|y)B).
  • (y, (a|x|y)A, (b|x|y)B) × x ⇒ (x, (a|x|y)A, (b|x|y)B).
  • (y, (a|x|y)A, (b|x|y)B) × y ⇒ (y, (a|x|y)A, (b|x|y)B).
  • (y, (a|x|y)A, (b|x|y)B) × z ⇒ (z, zA, zB). New white state.

After step 2

Shaded states are old; new ones need to be processed.

Construction example, step 3

For each new state; calculate next state for each symbol.
Omitting transitions to (σ, ErrorA, ErrorB):
  • (x, (a|x|y)A, ErrorB) × a ⇒ (a, (a|x|y)A, ErrorB).
  • (x, (a|x|y)A, ErrorB) × x ⇒ (x, (a|x|y)A, ErrorB).
  • (x, (a|x|y)A, ErrorB) × y ⇒ (y, (a|x|y)A, ErrorB).
  • (x, (a|x|y)A, ErrorB) × z ⇒ (z, zA, ErrorB).
(y, (a|x|y)A, ErrorB) similarly ...
  • (x, ErrorA, (b|x|y)B) × b ⇒ (b, ErrorA, (b|x|y)B).
  • (x, ErrorA, (b|x|y)B) × x ⇒ (x, ErrorA, (b|x|y)B).
  • (x, ErrorA, (b|x|y)B) × y ⇒ (y, ErrorA, (b|x|y)B).
  • (x, ErrorA, (b|x|y)B) × z ⇒ (z, ErrorA, zB).
(y, ErrorA, (b|x|y)B) similarly ...

After step 3

Now we're done.

Simplifying the labels

Replace bookkeeping information with simpler labels:

(A back reference)

We've seen this before.

Rehabilitation of error states

For every state (σ, PA, QB):
  • if state (σ, PA, ErrorB) exists, or
  • state (σ, ErrorA, QB) exists,
  • with same* successors,
then merge the states. Repeat.

N.B. Such mergers preserve equivalence, determinism, and the Glushkov Property.

(If more than one error-free state is a candidate, choose arbitrarily?)

Rehabilitation example

In our example:
  • (x, (a|x|y)A, (b|x|y)B) ∼ (x, (a|x|y)A, ErrorB), (x, ErrorA, (b|x|y)B), but successors differ
  • (y, (a|x|y)A, (b|x|y)B) ∼ (y, (a|x|y)A, ErrorB), (y, ErrorA, (b|x|y)B), but successors differ
  • (z, zA, zB) ∼ (z, zA, ErrorB), (z, ErrorA, zB), with same successors (i.e. none).

Rehabilitation example, step 2

Now
  • (x, (a|x|y)A, (b|x|y)B) ∼ (x, (a|x|y)A, ErrorB), (x, ErrorA, (b|x|y)B), with same* successors
  • (y, (a|x|y)A, (b|x|y)B) ∼ (y, (a|x|y)A, ErrorB), (y, ErrorA, (b|x|y)B), with same* successors

Rehabilitation example, step 3

Now simplify the labels.

Reduction of or-groups

For each pair of states (σ1, PA, QB) and (σ2, PA, QB) with the same predecessors, successors, and color, merge the states.
Here, the x and y states.

Conclusion

We have a visualization which:
  • shows two content models at once
  • exhibits their intersection
  • exhibits their differences
  • is producible by machine
  • illustrates useful applications of techniques from Brzozowski and Glushkov

Future work

Acknowledgements