Tricolor automata
C. M. Sperberg-McQueen, Black Mesa Technologies LLC
7 August 2015
http://blackmesatech.com/2015/08/tricolor-automata/
Overview
- A problem
- Sketching a solution
- Details
- Further work
Want to look carefully at the details?
Two elevator pitches
Vocabulary designers / users:
- Draw a picture
- of two content models
- showing what is valid in both models
- what is valid in one but not the other
Computer scientists:
- A simple, practical application of
- Brzozowski derivatives and
- Glushkov automata.
Introduction
- A problem: how do these schemas relate?
- Comparing element sets
- Comparing declarations (raw, expanded, doctored)
- Visualizing
- Visualizing with finite state automata
Context
The Igel project is sponsored by the
Institut für Deutsche Sprache, Mannheim.
Igel is building tools for comparing document vocabularies.
A problem
Given two related DTDs,
- How do they differ?
- How are they the same?
- Have any old documents become invalid?
E.g. XHTML 1.0 Strict vs. Transitional.
E.g. HTML 3.2 vs HTML 4.0.
E.g. JATS 3.1 Archiving vs JATS 4.0 Authoring
E.g. TEI Lite and XCES (or ...)
E.g. Two customizations of Docbook.
E.g. Two versions of MathML.
Displaying declarations
If we have translated the DTD into XML, we can
expand the parameter entities and label the
expansions:
Now A has:
<!ELEMENT bibl ( %phrase.seq{{ #PCDATA
| %m.phrase{{
%m.token{{ abbr | date | num
| measure | name | term | time
| }}m.token;
foreign | mentioned | distinct
| title | hi | list | corr
| gap | reg | ptr | ref |
}}m.phrase;
}}phrase.seq;
author)* >
And B has:
<!ELEMENT bibl ( %phrase.seq{{ #PCDATA
| %m.phrase{{
%m.token{{
%x.token{{ }}x.token;
abbr | date | num
| dateRange | numRange | timeRange
| measure | name | term | time
| w | }}m.token;
corr | distinct | foreign
| gap | hi | list | mentioned
| orig | q | ref | reg | s
| stage | title | table
| xref |
}}m.phrase;
}}phrase.seq;
author |
%ids.milestones{{
pb | lb | ptr | xptr
}}ids.milestones;)* >
A little better, perhaps, but ...
Sketching a solution
- some basic ideas
- examples
- desiderata
Basic ideas: Glushkov automata
The simplest method of making an FSA is to construct
the
Glushkov automaton.
- Each primitive token becomes a state (the state of just having seen that token).
- Define transitions to agree with the regex.
(Glushkov, Brüggemann-Klein, Wood).
- Plus one more for start state.
- Mark the final states.
Example: TEI <biblFull>

Note single start state; paths describe possible
sequence of children; double outlines mark final states
(if we reach them, we win).
Example: A and B
Content models
((a|x|y)*, z) and
((b|x|y)+, z):


Basic ideas: the Glushkov property
In a Glushkov automaton, every path into a state
has the same label.
So we can label the states, not the arcs.
(Easier to read and understand.)


Reduced or-groups
One state per token clutters up fast.
Let's reduce or-groups into single states,
but preserve the Glushkov property:


Desiderata
An initial sketch of the goal:
- clarity,* simplicity
- what's valid in both: black/white
- what's valid in just one: red or blue
- determinism is good
- the Glushkov property is good
- each token maps to one state*
— no clever stuff
What we want / Example
Black/white: paths that work in both FSAs.
Red, blue: paths that work in only one FSA.
Basic ideas: intersection of FSA
The standard way to calculate the intersection of
two FSAs:
- Run them both in parallel.
- Do they both accept the string?
At any given point, we are in one* state in A,
and one* in B.
That means, each state in the intersection
represents a pair (A-state, B-state).
Complications:
nondeterminism; error states. (Explain.)
Intersection of A and B (step 1)
Start with the two start states: i.e. state
(StartA, StartB).
Calculate the next state pair for each symbol.
- (StartA, StartB) × a ⇒ ((a|x|y)A, ErrorB). Omit (error state).
- (StartA, StartB) × b ⇒ (ErrorA, (b|x|y)B). Omit (error state).
- (StartA, StartB) × x ⇒ ((a|x|y)A, (b|x|y)B). New state.
- (StartA, StartB) × y ⇒ ((a|x|y)A, (b|x|y)B). Same new state.
- (StartA, StartB) × z ⇒ (zA, ErrorB). Omit (error state).
Intersection of A and B (step 2)
Continue the same process with any new states.
- ((a|x|y)A, (b|x|y)B) × a ⇒ ((a|x|y)A, ErrorB). Omit (error state).
- ((a|x|y)A, (b|x|y)B) × b ⇒ (ErrorA, (b|x|y)B). Omit (error state).
- ((a|x|y)A, (b|x|y)B) × x ⇒ ((a|x|y)A, (b|x|y)B). Known state.
- ((a|x|y)A, (b|x|y)B) × y ⇒ ((a|x|y)A, (b|x|y)B). Known state.
- ((a|x|y)A, (b|x|y)B) × z ⇒ (zA, zB). New state.
Intersection of A and B (step 3)
One more state (all errors — no exit from state):
- (zA, zB) × a ⇒ (ErrorA, ErrorB).
- (zA, zB) × b ⇒ (ErrorA, ErrorB).
- (zA, zB) × x ⇒ (ErrorA, ErrorB).
- (zA, zB) × y ⇒ (ErrorA, ErrorB).
- (zA, zB) × z ⇒ (ErrorA, ErrorB).
Extending the result
We don't want just the intersection.
We also want A and B to be visible.
So instead of discarding any
pair containing either
ErrorA
or
ErrorB,
let's discard only
the pair (ErrorA, ErrorB).
If we do that, we get a larger FSA ...
Keeping the A- and B-only states

N.B. the
McCarthy principle
or
tainted string approach:
once red, always red.
N.B. proliferation of states.
Reduced or-groups

A first attempt at keeping the reduced or-groups of A and B.
We have lost determinism and Glushkov Property.
Can we find another way to reduce state count?
Rehabilitation
Can we overcome McCarthyism and allow
‘recovery’ from errors?
- For any states PA, QB,
merge (PA, QB) with
- (PA, ErrorB)
- (ErrorA, QB)
- No change in meaning for red strings, blue strings.
- Cost: we must remember what colors we've visited.
[N.B. name change: in proceedings, this is color-filter
approach.]
Merger with rehabilitation
Now reduce or-groups

Except that strictly speaking, we
cannot reduce
or-groups now.
Merging states
Strictly speaking, we cannot reduce or-groups.
All or-groups are long gone!
We are merging states with
- same set of predecessor states;
- same set of successor states;
- same color.
Preserves equivalence, deteminism, Glushkov Property.
Nailing it down
Let's get more specific (and formal).
- definition
- interpretation
- construction
- modifications
Informal characterization
A tricolor automaton is an FSA
whose
- states,
- arcs, and
- final-state markers
are (independently) colored
Definition
A tricolor automaton is a triple
(
A, Χ, Φ), where
A is the underlying finite state automaton
(
Q, Σ, δ,
q0,
F),
where
- Q is a set of states
- Σ is an input alphabet
- δ is a transition function mapping from Q × Σ to
Q
- q0 ∈ Q is the start state
- F ⊆ Q is the set of final states
Χ is a chromatic function
mapping from Q ∪ δ to the colors
{red, white, blue}.
Φ is a chromatic function
mapping from F to the colors {red, white, blue}.
Interpretation
But what does it
mean?
- black/white paths = paths legal in both A and B
- red/white paths = paths legal in A
- at least one red state or arc = legal in A, not in B
- blue/white paths = paths legal in B
- at least one blue state or arc = legal in B, not in A
N.B.
- red/white/blue paths ≠ union of A and B (but ...)
Some sanity checks
The definitions allows some things that
don't make ‘sense’:
- red arc to/from blue state (or vice versa)
- arc from red to blue state (or vice versa)
- red final marker on blue state (or vice versa)
N.B. If these things happen, the rules of interpetation still hold.
That is, they do make sense;
they just don't do anything ‘useful’.
Constructing the tainted-string tricolor automaton
States: (StartA, StartB) plus all triples (σ, PA, QB)
where PA ≠ StartA
and QB ≠ StartB.
Alphabet: union of ΣA and ΣB.
Transitions:
- (x, PA, QB) × σ ⇒ (σ, PA′, QB′)
iff A has PA × σ ⇒ PA′
and A has QB × σ ⇒ QB′.
- (x, PA, QB) × σ ⇒ (σ, PA′, ErrorB)
iff A has PA × σ ⇒ PA′
and σ ∉ ΣB.
- (x, PA, QB) × σ ⇒ (σ, ErrorA, QB′)
iff B has QB × σ ⇒ QB′
and σ ∉ ΣA.
Start state: (StartA, StartB).
Final states: union of FinalA and FinalB.
Construction (2)
State coloring::
- red for (σ, PA, ErrorB)
- blue for (σ, ErrorA, QB)
- otherwise white.
Arc coloring::
- red if source or target state is red
- blue if source or target state is blue
- otherwise white.
Finality coloring::
- white if PA ∈ FinalA and QB ∈ FinalB.
- red if PA ∈ FinalA and QB ∉ FinalB.
- blue if PA ∉ FinalA and QB ∈ FinalB.
Construction example
Steps as before, for intersection, but don't throw away partial errors.
Start with (StartA, StartB); calculate next state for each symbol.
- (StartA, StartB) × a ⇒ (a, (a|x|y)A, ErrorB). Red state.
- (StartA, StartB) × b ⇒ (b, ErrorA, (b|x|y)B). Blue state.
- (StartA, StartB) × x ⇒ (x, (a|x|y)A, (b|x|y)B). White state.
- (StartA, StartB) × y ⇒ (y, (a|x|y)A, (b|x|y)B). White state.
- (StartA, StartB) × z ⇒ (z, zA, ErrorB). Red state.
Construction example, step 2
For each new state; calculate next state for each symbol.
Omitting transitions to (σ,
ErrorA,
ErrorB):
- (a, (a|x|y)A, ErrorB) × a ⇒ (a, (a|x|y)A, ErrorB).
- (a, (a|x|y)A, ErrorB) × x ⇒ (x, (a|x|y)A, ErrorB). New red state.
- (a, (a|x|y)A, ErrorB) × y ⇒ (y, (a|x|y)A, ErrorB). New red state.
- (a, (a|x|y)A, ErrorB) × z ⇒ (z, zA, ErrorB).
- (b, ErrorA, (b|x|y)B) × b ⇒ (b, ErrorA, (b|x|y)B).
- (b, ErrorA, (b|x|y)B) × x ⇒ (x, ErrorA, (b|x|y)B). New blue state.
- (b, ErrorA, (b|x|y)B) × y ⇒ (y, ErrorA, (b|x|y)B). New blue state.
- (b, ErrorA, (b|x|y)B) × z ⇒ (z, ErrorA, zB). New blue state.
- (x, (a|x|y)A, (b|x|y)B) × a ⇒ (a, (a|x|y)A, ErrorB).
- (x, (a|x|y)A, (b|x|y)B) × b ⇒ (b, ErrorA, (b|x|y)B).
- (x, (a|x|y)A, (b|x|y)B) × x ⇒ (x, (a|x|y)A, (b|x|y)B).
- (x, (a|x|y)A, (b|x|y)B) × y ⇒ (y, (a|x|y)A, (b|x|y)B).
- (x, (a|x|y)A, (b|x|y)B) × z ⇒ (z, zA, zB). New white state.
- (y, (a|x|y)A, (b|x|y)B) × a ⇒ (a, (a|x|y)A, ErrorB).
- (y, (a|x|y)A, (b|x|y)B) × b ⇒ (b, ErrorA, (b|x|y)B).
- (y, (a|x|y)A, (b|x|y)B) × x ⇒ (x, (a|x|y)A, (b|x|y)B).
- (y, (a|x|y)A, (b|x|y)B) × y ⇒ (y, (a|x|y)A, (b|x|y)B).
- (y, (a|x|y)A, (b|x|y)B) × z ⇒ (z, zA, zB). New white state.
After step 2

Shaded states are old; new ones need to be processed.
Construction example, step 3
For each new state; calculate next state for each symbol.
Omitting transitions to (σ,
ErrorA,
ErrorB):
- (x, (a|x|y)A, ErrorB) × a ⇒ (a, (a|x|y)A, ErrorB).
- (x, (a|x|y)A, ErrorB) × x ⇒ (x, (a|x|y)A, ErrorB).
- (x, (a|x|y)A, ErrorB) × y ⇒ (y, (a|x|y)A, ErrorB).
- (x, (a|x|y)A, ErrorB) × z ⇒ (z, zA, ErrorB).
(y,
(a|x|y)A,
ErrorB) similarly ...
- (x, ErrorA, (b|x|y)B) × b ⇒ (b, ErrorA, (b|x|y)B).
- (x, ErrorA, (b|x|y)B) × x ⇒ (x, ErrorA, (b|x|y)B).
- (x, ErrorA, (b|x|y)B) × y ⇒ (y, ErrorA, (b|x|y)B).
- (x, ErrorA, (b|x|y)B) × z ⇒ (z, ErrorA, zB).
(y,
ErrorA,
(b|x|y)B) similarly ...
After step 3
Now we're done.

Simplifying the labels
Replace bookkeeping information with simpler labels:

(A back reference)
We've seen this before.

Rehabilitation of error states
For every state (σ,
PA,
QB):
- if state (σ, PA, ErrorB) exists, or
- state (σ, ErrorA, QB) exists,
- with same* successors,
then merge the states. Repeat.
N.B. Such mergers preserve equivalence, determinism,
and the Glushkov Property.
(If more than one error-free state is a candidate, choose arbitrarily?)
Rehabilitation example
In our example:
- (x, (a|x|y)A, (b|x|y)B) ∼ (x, (a|x|y)A, ErrorB), (x, ErrorA, (b|x|y)B), but successors differ
- (y, (a|x|y)A, (b|x|y)B) ∼ (y, (a|x|y)A, ErrorB), (y, ErrorA, (b|x|y)B), but successors differ
- (z, zA, zB) ∼ (z, zA, ErrorB), (z, ErrorA, zB), with same successors (i.e. none).
Rehabilitation example, step 2
Now
- (x, (a|x|y)A, (b|x|y)B) ∼ (x, (a|x|y)A, ErrorB), (x, ErrorA, (b|x|y)B), with same* successors
- (y, (a|x|y)A, (b|x|y)B) ∼ (y, (a|x|y)A, ErrorB), (y, ErrorA, (b|x|y)B), with same* successors
Rehabilitation example, step 3
Now simplify the labels.
Reduction of or-groups
For each pair of states
(σ1, PA, QB)
and
(σ2, PA, QB)
with the same predecessors, successors, and color,
merge the states.
Here, the x and y states.
Conclusion
We have a visualization which:
- shows two content models at once
- exhibits their intersection
- exhibits their differences
- is producible by machine
- illustrates useful applications of techniques from Brzozowski and Glushkov
Future work
- Better integration into Igel
- Proofs of correctness. (Or ...)
- Proofs of uniqueness (or non-).
- Clear statement of tradeoffs among:
- minimal number of states
- minimal number of arcs
- Glushkov property
- simple relation between tokens and states
- determinism