Sperberg-McQueen, C. M., Yves Marcoux and Claus Huitfeldt. “Document lattices: Equivalence, compatibility, and contradiction in document markup.” Presented at Balisage: The Markup Conference 2014, Washington, DC, August 5 - 8, 2014. In Proceedings of Balisage: The Markup Conference 2014. Balisage Series on Markup Technologies, vol. 13 (2014). https://doi.org/10.4242/BalisageVol13.Sperberg-McQueen01.
Balisage: The Markup Conference 2014 August 5 - 8, 2014
Balisage Paper: Document lattices:
Equivalence, compatibility, and contradiction in document markup
C. M. Sperberg-McQueen
Founder and principal
Black Mesa Technologies LLC
C. M. Sperberg-McQueen is the founder and
principal of Black Mesa Technologies, a consultancy
specializing in helping memory institutions improve
the long term preservation of and access to the
information for which they are responsible.
He served as editor in chief of the TEI
Guidelines from 1988 to 2000, and has also served
as co-editor of the World Wide Web Consortium's
XML 1.0 and XML Schema 1.1
specifications.
Yves Marcoux Yves Marcoux
Associate Professor (Professeur agrégé)
École de bibliothéconomie et des
sciences de l'information, Université de Montréal
Yves Marcoux has been a faculty member at EBSI,
University of Montréal, since 1991.
He is mainly involved in teaching, research, standardization,
and international cooperation activities
in the field of document informatics.
Prior to his appointment at EBSI,
Dr. Marcoux worked for 10 years
in systems maintenance and development,
in Canada, the U.S., and Europe.
He obtained his Ph.D. in theoretical computer science
from Université de Montréal in 1991.
His main research interests are intertextual semantics,
the design of communication, markup languages
and digital humanities.
Claus Huitfeldt
Associate Professor (førsteamanuensis)
Department of Philosophy, University of Bergen
Claus Huitfeldt is Associate Professor at the Department
of Philosophy of the University of Bergen, Norway. He was
founding Director (1990-2000) of the Wittgenstein Archives at
the University of Bergen, for which he developed the text
encoding system MECS as well as the editorial methods for the
publication of Wittgenstein's Nachlass - The Bergen Electronic
Edition (Oxford University Press, 2000).
If the information conveyed by the markup in a document
can be identified with the set of inferences we can draw from
that markup, as has been proposed in earlier work, then the sets
of inferences licensed by documents form an infinitely large
lattice, by means of which the relative information content of
any two documents (equivalence, subsumption, contradiction,
consistency) can be displayed visually. The sets of inferences
licensed by markup can be used to test translations from one
markup language to another for equivalence or information loss;
a simple example using XHTML and CALS table markup illustrates
the process.
Practitioners of descriptive markup rely on the ability of
markup in a document to convey information; earlier work
has attempted to characterize the nature of that
information and describe ways to make it manifest in ways
beyond what is done by conventional processing of SGML, XML,
and similar systems.[1]
Among the methods so far suggested is to capture the
information carried by the markup of the document in symbolic
logic. It is possible in principle to enumerate the crucial
inferences licensed by the markup in a document; these
sentences, together with all the other sentences that can be
inferred from them, are held to constitute the set of
inferences licensed by the markup (which in turn is held by
some to constitute the meaning of the markup).
We offer here an application of that idea to the problem of
document comparison. Given the sets of inferences licensed by the
markup in two documents, we suggest a method for determining
whether the markup in the two documents is semantically
equivalent, semantically compatible (but richer in one than the
other), or contradictory.
One consequence of this is that the information in documents
can usefully be taken to form a lattice, which can in turn be used
to illustrate the relation among documents.
Related work
It should be noted at the outset that the kind of document
comparison we describe here has nothing to do with the problem
of XML file comparison or diff tools.
One way in which one might motivate the choice of one of these
target syntaxes over the others would be to show that it makes
it easier to perform useful tasks with the information carried
by the markup in the documents. The proposal made here exploits
the fact that inference is well understood for first-order
predicate calculus in many different variants, and uses (a very
simple form) of mechanical theorem proving to perform
interesting work.
In 2011, Sperberg-McQueen proposed to use the sets of inferences
licensed by markup in the field of digital preservation, as a
way to test the information equivalence of documents before and
after conversion [Sperberg-McQueen 2011]. We provide a
concrete illustration of the technique described there,
generalize the test from equivalence to subsumption,
compatibilty, and contradiction, and discuss some implications
of the technique.
Document comparison
Relations among documents
We claim that with respect to the information carried by
the markup in any two documents
D1 and
D2, exactly one of the
following states of affairs will apply:
D1 and
D2 are
equivalent: all of the information in
D1 is present in
D2 and vice
versa.
D1subsumes or is an
abstraction ofD2: all of the
information in
D1 is present in
D2, but the
reverse is not true. Equivalently, we say that
D2is
a refinement of
D1.
D2 subsumes
D1.
Neither D1
nor D2 subsumes
the other, but they are logically consistent with each
other.
D1 and
D2 contradict each
other.
An algorithm for document comparisons
To compare two documents
D1 and
D2, we first enumerate
a set of key inferences from the markup for each document. Let
us call these two sets
S1 and
S2. These take the
form of sentences in some suitable form of symbolic logic. (In
the example below, we use a more or less standard first-order
predicate calculus; in demonstrating that the comparisons can be
implemented in software, we also use the syntaxes of Prolog and
Alloy.)
Note that we do not ask for an enumeration of all the
inferences licensed by the markup in either document; that set
is by definition closed under inference, and by all conventional
accounts will be infinite. By an enumeration of key
inferences we mean some finite set of sentences inferred
from the document, which suffice to allow the inference of all
the others (sometimes called a basis for
the infinite set of inferences). We will refer to the closure
of S1 under logical
inference as *S1, and
the closure of S2
under inference will be
*S2.
In the simple case,
S1 and
S2 will use the same
predicates, assume the existence of the same individuals, and
use the same names for the individuals in the universe of
discourse. Then establishing the informational equivalence of
the two documents is simply a case of checking that every
sentence in S1 is
present in S2, and
vice versa. The subsumption relation can similarly be
established by checking for a subset relation between
S1 and
S2.
In the general case, however, none of these will be true:
S1 and
S2 may use different
predicates; they may assume the existence of different
individuals in the universe of discourse; even when they
assume the same predicates or individuals, they
may use different names for them. A prerequisite for
comparing the sets
S1 and
S2 in practice is
thus the preparation of rules of inference that allow
statements in the vocabulary of
S1 to be inferred
from statements in the vocabulary of
S2, and vice-versa.
We will call these the translation inference
rules, since their goal is to translate information
from one vocabulary to another. Some translation inference
rules map from S1 to
S2: they allow us to
infer statements in the vocabulary of
S2, given other
statements in the vocabulary of
S1. We refer to
these as S1 →
S2; we refer to the
translation inference rules mapping in the other direction as
S2 →
S1.
Note that in what follows we silently assume that the
translation inference rules
S1 →
S2 and
S2 →
S1 are given, and are
included in the closures
*S1 and
*S2, along with other
general world knowledge.
In this more general situation, we can establish the
equivalence of D1 and
D2 by checking that
every sentence in S1
is present in, or follows from,
S2, and vice versa.
Subsumption, consistency, and inconsistency relations can
similarly be established on the basis of logical
implication.
We provide an algorithm for determining which state of
affairs applies between documents
D1 and
D2:
From the set
S1 and the translation
rules S1 →
S2, we attempt to
infer each sentence in
S2.
If
we succeed for all sentences in
S2, then
S2 is contained within
the logical closure of
S1
(i.e. S2 ⊆
*S1, and by definition
also *S2 ⊆
*S1). Less formally:
all of the information in
S2 is present in
S1 (and similarly for
D1 and
D2).
If
we succeed for some but not all sentences in
S2, then some but not
all of the information in
S2 (and
D2) is present in
S1
(D1).
Conversely, from the set
S2 and the translation
rules S2 →
S1, we attempt to
infer each sentence in
S1. (Or, more
formally, we test whether
*S1 ⊆
*S2.)
From
the set S1 and the
translation rules S1
→ S2, we
attempt to infer the negation of each sentence in
S2.
If
we succeed for any sentence in
S2, then
S1 and
S2 contradict each
other (as do D1 and
D2).
If
we fail for all sentences in
S2, then
S1 (and
D1) are compatible
(logically consistent) with
S2
(D2).
Again, we perform the same
task in the other direction, seeking to infer negations of
sentences in S2 from
the set S1 and the
translation rules S1
→
S2.
The relation of the documents' information content (as
given by the markup) is determined by the results of this
exercise.
D1
subsumes D2 if and
only if each sentence in
S1 can be inferred
from S2 and
S2 →
S1.
D2
subsumes D1 if and
only if each sentence in
S2 can be inferred
from S1 and
S1 →
S2.
D1
and D2 are equivalent
if and only if D1
subsumes D2 and
D2 subsumes
D1.
D1
and D2 contradict each
other if and only if
¬s1 follows, for
some sentence s1 in
S1, from
S2 and
S2 →
S1, or
¬s2 follows, for
some sentence s2 in
S2, from
S1 and
S1 →
S2.
D1
and D2 are compatible
(logically consistent) with each other if and only if neither
contradicts the other.
That is, the relation between documents
D1 and
D2 is determined by
the subset/superset relations between
*S1 and
*S2.
The document lattice
The subset/superset relations among sets of inferences
licenced by documents constitute a partial order over all
documents. It is a consequence of this fact that the universe
of documents forms a lattice, in the mathematical sense of the
term.
A simple lattice formed by the subset relation over
the subsets of the {a, b, c} is:
A similar lattice formed by the subset relation over
the subsets of the {a, b, c, d, e} is:
Any two points a and b in a lattice (and thus any two
documents in a lattice of documents) have both a greatest
lower bound (a point c in the lattice such that c ≤
a and c ≤ b, and x ≤ c for all x such that
x ≤ c and x ≤ b) and a greatest lower bound,
known in lattice contexts respectively as the
meet and the join of a and b.
Informally, the meet is the highest point where downward
paths from a and b meet, the join is the lowest point
where climbing paths meet.
Since the set of documents is infinite, so is the universal
document lattice (as we will call the lattice formed
from the sets of sentences entailed by documents); we will not
attempt to provide an image of this infinite lattice.
Instead, we will illustrate document lattices by
considering the lattice formed by the sets *S1, *S2,
*(S1 ∪ S2), *(S1 ∩ S2), *(S1 \ S2),
*(S2 \ S1), and the top and bottom nodes (⊤ and
⊥) of the universal document lattice.[2]
Note that all illustrations assume that neither
D1 nor D2 is self-contradictory (so neither
*S1 nor *S2 is equal to ⊤)
and that neither is vacuous (so neither
*S1 nor *S2 is equal to ⊥).
Every document is represented by a point in the lattice.
If and only if two documents D1 and D2 are equivalent,
then D1 and D2 map to the same point in the lattice.
If D1 subsumes D2 but the two documents are not equivalent,
then D1 is below D2 in the lattice.
Informally: we can reach D1 from D2 by going
down in the graph (or D2 from D1 by climbing).
If neither D1 nor D2
subsumes the other, then neither is above or below the other
in the lattice.
For the document lattice (based as it is upon the subset
relation), the meet of two documents D1 and D2 is
represented by the set *S1 ∩ *S2, their join by *S1
∪ *S2. (In the figure, the nodes a and b are
colored yellow and blue; their meet is colored gray,
their joint is colored green. As the bold arrows show,
there may be more than one path connecting a node to its
meet or its join with another node, but the meet and join
are nevertheless each guaranteed unique.)
We have the following relations in the lattice for the
various possible relations of D1 and D2, which we
illustrate on the finite lattice described earlier:
If D1 and D2 are equivalent, then they map
to the same point in the lattice (as do their union and
intersection).
If D1 subsumes D2 but they are not
equivalent, then D1 is below D2 in the lattice and can
be reached by a sequence of downward arcs.
If D2 subsumes D1 but they are not
equivalent, then the reverse is true.
If D1 and D2 contradict each other, then
their join is the topmost point in the lattice (the
set of all possible sentences).[3]
If neither D1 nor D2 subsumes the other, but
they are logically consistent with each other, then
they have a join somewhere other than the
top of the lattice.
Most of what has been said so far is a straightforward
account of the relation of arbitrary sets of sentences in a
logical notation, when the sets are closed under logical
inference. It is not uniquely true of sets of sentences
derived from the markup in documents. Readers thus may well
be asking themselves where some application to document
processing comes in.
The document lattice we have described makes explicit
some facts about information in documents that is obvious to
markup practitioners (but often disappointingly non-obvious
to clients and novices). The kinds of translation processes
referred to as up-translations and down-translations
correspond directly to relations in the lattice: an
up-translation involves mapping from some document
D1 to another document D2 above D2 on the lattice,
a down translation similarly involves moving downwards
in the lattice.
If our task is to translate from
document D1 in one markup vocabulary (say, Docbook) into
some document D2 in another vocabulary (say, XHTML) by
fully automatic means (e.g. an XSLT stylesheet), then either
D2 must subsume D1, or our task is impossible: fully
automatic transforms can in principle map only from one
document into another document reachable by zero or more
downward steps. (D2 is reachable in zero downward steps
if it is logically equivalent to D1; this is possible in
principle but often quite difficult in practice.)
If a conversion from one vocabulary to another is
intended to have no information loss, then the requiremet
is that for any D1 in the source vocabulary, the
conversion produce a D2 which occupies the same point
in the lattice.
Example
A simple example may help to illustrate the operation
of the algorithm we have given above.
High-level description
Consider the following simple table:
Let us imagine that we have two versions of this table,
each in a different markup language with a possibly different
table model, and we wish to know whether the table
markup in the two documents is equivalent, or at least
logically consistent.
One table, let us suppose, is tagged in XHTML:
<table border="1">
<tr>
<th>Année</th>
<th>Événement</th>
</tr>
<tr>
<td>1969</td>
<td>Création d'ARPANET, le premier réseau
national américain d'ordinateurs, par le
<em>Defense Department's Advanced
Research Projects Agency</em> (DARPA)</td>
</tr>
<tr>
<td>1992</td>
<td>Mise en service du <em>World Wide Web</em>
par le CERN (Centre européen de recherche
nucléaire), en Suisse</td>
</tr>
</table>
The other table, intended to be equivalent, is tagged
using the SGML Open exchange subset of the CALS table model
[Bingham 1995],
[Severson and Bingham 1995].
<table colsep="1" rowsep="1">
<tgroup cols="2">
<colspec colnum="1" colname="annee" colwidth="1*"/>
<colspec colnum="2" colname="evenement" colwidth="4*"/>
<tbody>
<row>
<entry>Année</entry>
<entry>Événement</entry>
</row>
<row>
<entry colname="evenement">Création d'ARPANET,
le prémier réseau américain
d'ordinateurs, par le Defense Department's
Advanced Research
Projects Agency (DARPA)</entry>
<entry colname="annee">1969</entry>
</row>
<row>
<entry colname="annee">1992</entry>
<entry>Mise en service du World Wide Web
par le CERN
(Centre européen de recherche nucléaire),
en Suisse</entry>
</row>
</tbody>
</tgroup>
</table>
In first-order predicate calculus
From each marked up table we derive a set of
sentences.[4]
Set S1
In more or less conventional first-order predicate calculus
notation,[5]
the first set of inferences is as follows.[6]
Table T is 3 by 2. In consequence, cell (3, 2) exists in table T,
but cells (4, 1) and (1, 3) do not.)
table_dimensions(T, 3, 2)
cell(T, 3, 2) ∧ ¬cell(T, 4, 1) and ¬cell(T, 1, 3)
The two heading cells in row 1, columns 1 and 2,
are identified as headings.
isTableHeader(T, 1, 1)
isTableHeader(T, 1, 2)
Next, the contents of the various cells are given.
tableCellContent(T, 3, 2, "Mise en service du...")
The table has some styling information, given in CSS and not
shown above: the table
width is 80% of its parent element's width, and it has a 10% left margin.
tableWidth(T, "80%")
tableMarginLeft(T, "10%")
The background color of the heading cells is #CCCCCC, and all cells
are vertically aligned to the top of the cell.
(∀ x ∈ ℕ, y ∈ ℕ)[isTableHeader(T, x, y) ⇒ tableCellBackgroundColor(T, x, y, "#CCCCCC")]
(∀ x ∈ ℕ, y ∈ ℕ)[cell(T, x, y) ⇒ tableCellVerticalAlign(T, x, y, "top")]
Table borders are thin, solid lines.
(∀ x ∈ ℕ, y ∈ ℕ)[cell(T, x, y) ⇒ tableCellBorderWidth(T, x, y, "thin")]
(∀ x ∈ ℕ, y ∈ ℕ)[cell(T, x, y) ⇒ tableCellBorderStyle(T, x, y, "solid")]
Finally, we give some rules which apply to all tables and not just to this one.
These can be used for sanity checking of sets of sentences, to ensure that we have
specified
content for all the cells that need to be described, and so on.
In any table, only those cells exist which have content.
(∀ t, x ∈ ℕ, y ∈ ℕ)[cell(t, x, y) ⇔ (∃ c)[tableCellContent(t, x, y, c)]]
All tables are finite and have no holes. (Literally: for all tables,
there are maximum dimensions x and y such that all cells have row and
column coordinates less than or equal to x and y, respectively.)
No table is empty. (So all tables have at least one row and one column and thus a
cell in position (1, 1).)
(∀ t)[cell(t, 1, 1)]
In a table, only existing cells can have interesting properties.
(Literally, we postulate isTableHeader, or a background color, or
vertical alignment, or border width and style, of some triple T, x, y only if
there is a cell T, x, y.)
(∀ t, x ∈ ℕ, y ∈ ℕ)[
(
isTableHeader(t, x, y) ∨
(∃ c)[tableCellBackgroundColor(t, x, y, c) ∨
tableCellVerticalAlign(t, x, y, c) ∨
tableCellBorderWidth(t, x, y, c) ∨
tableCellBorderStyle(t, x, y, c)]
) ⇒ cell(t, x, y)
]
Set S2
The description of the other table is this; notice both that
it takes a rather different view of which individuals need to exist to
enable the description of the table, and that the markup from
which it started said nothing about vertical alignment or
table borders.
This set of sentences begins by identifying the individuals
in the universe of discourse and saying what kinds of things they are:
c11 and c12 are heading cells, various other individuals are
data cells, rows, columns, or tables.
headingcell(c11)
headingcell(c21)
datacell(c12)
datacell(c22)
datacell(c13)
datacell(c23)
row(r1)
row(r2)
row(r3)
column(c1)
column(c2)
table(t1)
The table t1 contains a particular set of rows in a particular
order, and a particular set of columns in a particular order.
(We write sequences as comma-separated lists in angle brackets,
as is common in some fields.)
table_rowsequence(t1, 〈 r1, r2, r3 〉)
table_colsequence(t1, 〈 c1, c2 〉)
row_cellsequence(r1, 〈 c11, c21 〉)
row_cellsequence(r2, 〈 c12, c22 〉)
row_cellsequence(r3, 〈 c13, c23 〉)
col_cellsequence(c1, 〈 c11, c12, c13 〉)
col_cellsequence(c2, 〈 c21, c22, c23 〉)
The individual cells each contain text, represented
here as a simple string of characters.[7]
cell_text(c11, "Année")
cell_text(c21,"Evénement")
cell_text(c12,"1969")
cell_text(c22,"Création d'ARPANET, le prémier réseau national américain d'ordinateurs ...")
cell_text(c13,"1992")
cell_text(c23,"Mise en service du World Wide Web pare le CERN ...")
Ensuring the identity of individuals (digression)
The next step is to formulate translation inference rules
for mapping from the vocabulary of S1 to that of S2 and
vice versa.
There is a complication here, however, which requires a brief
digression. The issue has no particular interest from the
markup point of view but is crucial to make the inference
process work smoothly. It is perhaps best illustrated if
we imagine a two-player game similar to Twenty Questions.
Player One is equipped with sentences S1, which Player
Two cannot see, while Player Two has set S2, which is
invisible to Player One; both have access to the
translation inference rules. The players take turns
challenging each other to say whether a given sentence is
or is not present in (or inferrable from) the challenger's
set of sentences.
In our example Player One might ask whether
table_dimensions(t, 2, 3) is in set S1. Player
Two knows that the table has three rows and two columns,
so the correct form of such a sentence should be
table_dimensions(t, 3, 2), so Player Two
correctly answers no.
The complication arises when Player 1 asks whether the
sentence table_dimensions(t, 3, 2) is in S1.
It could be; the order of arguments is correct. But does
S1 refer to the table in question as t? Or as t1?
Or by some other name? There is no way for Player Two to
find out: in symbolic logic, the identifiers used to denote individuals are
arbitrary and do not in themselves carry information.
If Player Two is required to guess how
S1 spells the identifier, then the game quickly becomes
uninteresting. We need some way to make such guessing
unnecessary. Perhaps the players could agree in advance
on the names of all the individuals to be postulated in
the universe of discourse. That would simplify life a
great deal, but it is not always possible: S1 and S2
do not necessarily agree on the number or nature of the
individuals to be postulated in the universe of
discourse.
We therefore impose a rule that the arbitrary
identifier used for each individual must be discoverable
from the essential properties of that individual. For
each individual with an arbitrary identifier, the set of
sentences referring to that individual must include some
predicate which is true of that individual and of no other
individual in the universe of discourse.[8]
Set S1, for example, assigns the arbitrary identifier
T to the table being described. Since for purposes of
the example there is never more than one table in the
universe of discourse, a predicate like
this_table(T) can be used to identify the
table uniquely, make the identifier T discoverable,
and satisfy the rule. We therefore add that predicate
to our sketch of S1:
this_table(T)
The natural numbers 1, 2, and 3, and various
strings of Unicode characters are referred to using
standard well known notations and not using arbitrary
identifiers, so we do not need such uniquely identifying
predicates for them.[9]
Set S2 identifies a larger set of individuals, but we
can easily use the positions of rows, columns, and cells
within a table to uniquely identify them. The individuals
postulated in the universe of discourse by S2 can all be
identified with the following set of uniquely identifying
predicates:
table_row(r1, t1, 1)
table_row(r2, t1, 2)
table_row(r3, t1, 3)
table_column(c1, t1, 1)
table_column(c2, t1, 2)
table_cell(c11, t1, 1, 1)
table_cell(c21, t1, 1, 2)
table_cell(c12, t1, 2, 1)
table_cell(c22, t1, 2, 2)
table_cell(c13, t1, 3, 1)
table_cell(c23, t1, 3, 2)
When such uniquely identifying predicates are available,
we no longer have to wonder if a cell whose existence
is predicated by set S1 and the S1 → S2 rules is
supposed to represent cell c11 or cell c21 or one of the
others.
The rules of the game can now be refined: players are
forbidden to ask about sentences involving arbitrary
identifiers. So Player One cannot pose the sentence
table_dimensions(t, 3, 2) as a challenge.
Instead, all sentences must use bound variables; the
uniquely identifying predicates make it possible to bind
variables reliably to any chosen individual. So Player
One can usefully ask:
(∃ x)[this_table(x) ∧ table_dimensions(x, 3, 2)]
The form of the query is superficially more complicated, but the
outer structure and first part of the sentence (namely
(∃ x)[this_table(x) ∧ ...])
serve merely to set up the real question, namely
table_dimensions(x, 3, 2).[10]
The S1 → S2 translation inference rules
We can translate from the vocabulary of S1 into that
of S2 by these rules:
We begin with existential claims for the individuals in S2,
beginning with the table and continuing with the rows, columns,
and cells.
(∃ x)[this_table(x)]
⇒
(∃ y)[table(y)]
(∀ x, y)(∀ n ∈ ℕ, m ∈ ℕ, i ∈ ℕ)
[(this_table(x)
∧ table(y)
∧ table_dimensions(x, n, m)
∧ 1 ≤ i
∧ i ≤ n)
⇒
(∃1r)[table_row(r, y, i)]]
(∀ x, y)(∀ n ∈ ℕ, m ∈ ℕ, i ∈ ℕ)
[(this_table(x)
∧ table(y)
∧ table_dimensions(x, n, m)
∧ 1 ≤ i
∧ i ≤ m)
⇒
(∃1c)[table_column(c, y, i)]]
(∀ x, y)(∀ n ∈ ℕ, m ∈ ℕ, i ∈ ℕ, j ∈ ℕ)
[(this_table(x)
∧ table(y)
∧ table_dimensions(x, n, m)
∧ 1 ≤ i ∧ i ≤ n
∧ 1 ≤ j ∧ j ≤ m)
⇒
(∃1c)[table_cell(c, y, i, j)]]
Next, we specify that the rows are rows, the columns are columns,
etc.
(∀ x, y, r)(∀ n ∈ ℕ, m ∈ ℕ, i ∈ ℕ)
[(this_table(x)
∧ table(y)
∧ table_dimensions(x, n, m)
∧ (1 ≤ i ∧ i ≤ n)
∧ table_row(r, y, i)
⇒
row(r)]
(∀ x, y, c)(∀ n ∈ ℕ, m ∈ ℕ, i ∈ ℕ)
[(this_table(x)
∧ table(y)
∧ table_dimensions(x, n, m)
∧ (1 ≤ i ∧ i ≤ m)
∧ table_column(c, y, i)
⇒
column(c)]
(∀ x, y, c)(∀ n ∈ ℕ, m ∈ ℕ)
[(this_table(x)
∧ table(y)
∧ isTableHeader(x, n, m)
∧ table_cell(c, y, n, m)
⇒
headingcell(c)]
(∀ x, y, c)(∀ n ∈ ℕ, m ∈ ℕ)
[(this_table(x)
∧ table(y)
∧ ¬ isTableHeader(x, n, m)
∧ table_cell(c, y, n, m)
⇒
datacell(c)]
(∀ x, y, c)(∀ n ∈ ℕ, m ∈ ℕ)
[(this_table(x)
∧ table(y)
∧ cell(cx, n, m)
∧ table_cell(c, y, n, m)
⇒
cell(c)]
The table_rowsequence, table_colsequence,
row_cellsequence, and column_cellsequence predicates
are a little more complicated, as they require
the construction of a sequence of objects.
While sequences are familiar enough to anyone involved
with XML or with contemporary programming languages, there
are a variety of ways they can be specified in logical
terms. One common approach[11] which
suffices for our present purposes is to say that a
sequence s of length n is a mapping from an initial
segment of the counting numbers (1, 2, ... n) to a
set.
To indicate that a variable s in a logical expression
denotes a sequence, we declare it as (s ∈ Seq).
A sequence can be written out as a whole
by listing its members, in order, between angle
brackets (as was done above in the discussion of set
S2); an individual member of the sequence can
be identified by writing the number for its position
and the element in that position, joined by
the symbol ↦, e.g. 1 ↦ A, 2 ↦ B,
... For clarity, such mappings may be parenthesized:
(1 ↦ A), (2 ↦ B), ...
We say that sequence s has element x at position
n by writing (n ↦ x) ∈ s.
Armed with this account of sequences, we can now
describe the sequences of rows and columns in the table
and the sequences of cells in columns and rows.
(∀ x, y, r)
(∀ n ∈ ℕ, m ∈ ℕ, i ∈ ℕ, s ∈ Seq)
[(this_table(x)
∧ table(y)
∧ row(r)
∧ table_dimensions(x, n, m)
∧ ((i ↦ r) ∈ s ⇔ (1 ≤ i
∧ i ≤ n ∧ table_row(r, y, i)))
∧ (i ≤ 0 ∨ n < i)
⇒
¬(∃ z)[(i ↦ z) ∈ s])
⇒
table_rowsequence(x, s)]
(∀ x, y, c)
(n ∈ ℕ, m ∈ ℕ, i ∈ ℕ, s ∈ Seq)
[(this_table(x)
∧ table(y)
∧ column(c)
∧ table_dimensions(x, n, m)
∧
((i ↦ c) ∈ s
⇔ (1 ≤ i ∧ i ≤ m ∧ table_column(c, y, i)))
∧ (i ≤ 0 ∨ m < i)
⇒
¬(∃ z)[(i ↦ z) ∈ s])
⇒
table_colsequence(x, s)]
(∀ x, y, r, c)
(n ∈ ℕ, m ∈ ℕ, i ∈ ℕ, j ∈ ℕ, s ∈ Seq)
[(this_table(x)
∧ table(y)
∧ table_dimensions(x, n, m)
∧ row(r)
∧ 1 ≤ i ∧ i ≤ n
∧ table_row(r, t, i)
∧ ((j ↦ c) ∈ s
⇔ 1 ≤ j ∧ j ≤ m ∧ table_cell(c, y, i, j))
∧ (j ≤ 0 ∨ m < j)
⇒
¬(∃ z)[(j ↦ z) ∈ s])
⇒
row_cellsequence(r, s)]
(∀ x, y, col, c)
(n ∈ ℕ, m ∈ ℕ, i ∈ ℕ, j ∈ ℕ, s ∈ Seq)
[(this_table(x)
∧ table(y)
∧ table_dimensions(x, n, m)
∧ column(col)
∧ 1 ≤ j ∧ j ≤ m
∧ table_column(col, t, j)
∧ ((i ↦ c) ∈ s
⇔ 1 ≤ i ∧ i ≤ n ∧ table_cell(c, y, i, j))
∧ (i ≤ 0 ∨ n < i)
⇒
¬(∃ z)[(i ↦ z) ∈ s])
⇒
col_cellsequence(col, s)]
Finally, we specify rules for the cell_text predicate.
(∀ x, y, c, s)
(i ∈ ℕ, j ∈ ℕ)
[(this_table(x) ∧ table(y) ∧ table_cell(c, y, i, j)
and tableCellContent(x, i, j, s))
⇒
cell_text(c, s)]
The S2 → S1 translation inference rules
And conversely, we can translate from the vocabulary of S2 into that
of S1 by these rules:
(∀ x, y)
[table(x) ⇒ this_table(y)]
(∀ x, y, c, i ∈ ℕ, j ∈ ℕ)
[(table(x) ∧ this_table(y)
∧ headingcell(c)
∧ table_cell(c, x, i, j))
⇒
isTableHeader(y, i, j)]
(∀ x, y, c, s, i ∈ ℕ, j ∈ ℕ)
[(table(x) ∧ this_table(y)
∧ cell(c)
∧ table_cell(c, x, i, j))
∧ cell_text(c, s))
⇒
tableCellContent(y, i, j, s)]
Note that no translation rules are listed which
allow us to infer any instances of the S1 predicates
tableWidth, tableMarginLeft, tableCellBackgroundColor,
tableCellVerticalAlign, tableCellBorderWidth, or
tableCellBorderStyles. This reflects the fact that
the set of enumerations S1 includes an analysis of
the CSS styling for the XHTML table, while
the Oasis Exchange Model (CALS) encoding of the table
lacks such style information.
As a consequence, here D1 and D2 are not equivalent;
instead, D2 subsumes D1.
In Prolog
The logic outlined above has been translated into Prolog
to demonstrate that the inferences required are automatically
derivable.[12]
The following files are available in the appendices to this paper:[13]
The S1
→ S2
translation inference rules are in
Appendix C.
The S2
→ S1
translation inference rules are in Appendix D.
The two sets of translation inference rules adopt the convention
of defining two predicates with standard names:
individuals asserts the existence of
all the individuals mentioned in the target set of
sentences, using a Skolem function (gensym)
to create a name for the individual, and asserting the
uniquely identifying predicate for that individual.
obligations consists of a conjunction
of all the ground facts of the target set of sentences,
using appropriately bound variables in place of the
identifiers actually used in the target set.
Further work
Several questions arise from the operational
definitions offered here of document equivalence,
subsumption, compatibility, and contradiction.
Can the constraints given above on the form
taken by the enumerations for a given document be
relaxed?
Given two compatible documents in a given vocabulary V,
is it always, sometimes, or never possible to generate
documents in V representing the meet and the join of the
two initial documents?
We conjecture that it is sometimes possible, depending on
properties of the vocabulary V and the specific information
in the two documents. This conjecture leads to another question:
Is it possible to design a vocabulary V so as
to ensure that the meet and the join of arbitrary sets of
documents is always representable?
If it is not possible to guarantee that the meet and join
are always representable, can we increase the likelihood
that it's representable for interesting cases?
Can the technique described here scale up to
full colloquial XML vocabularies like JATS, DocBook,
TEI, and HTML? Or even to full table markup in the
CALS and XHTML table models? We believe so, but
cannot exhibit a full formal description of any
colloquial XML vocabulary.
For two given vocabularies V1 and V2, is it
possible to generalize about the relative position in the
lattice of documents in V1 and V2? Is it
possible (and if so, how can it be done?) to define V1 and V2
such that
No document D1 in vocabulary V1 is equivalent
to any document D2 in V2.
Every document D1 in vocabulary V1 has at least
one equivalent document D2 in V2.
Every document
D1 in vocabulary
V1 can be
reached by a down-translation from some document
D2 in
V2 (i.e. for
every D1, there
exists some D2
such that the join of
D1 and
D2 is
D2).
Appendix A. Sentence set
S1
% Prolog equivalent of S1 sentences
%
% Revisions
%
% 2014-04-18 : YM : added y/m prefix to each (exported) predicate
% 2014-04-15 : YMA : reintroduced the module definition (no change)
% 2014-04-12 : YMA : various major revisions
% 2014-04-08 : CMSMcQ : supply uniquely identifying predicate for T
% extend with style rules
% remove general rules to y_general.pl
% 2014-04-03 : CMSMcQ : initial translation from table_trial_YM.txt
:- module(y, [y_this_table/1,
y_table_dimensions/3,
y_cell/3,
y_table_consistent/1,
y_tableWidth/2,
y_tableMarginLeft/2,
y_isTableHeader/3,
y_tableCellContent/4,
y_tableCellBackgroundColor/4,
y_tableCellVerticalAlign/4,
y_tableCellBorderWidth/4,
y_tableCellBorderStyle/4
]).
% Individuals: for each individual that needs one, we have a uniquely
% identifying predicate. (Here, we have only one individual needing
% such a predicate: the table.)
% t is a table; it is the only individual we identify
% apart from the natural numbers and the strings
y_this_table(t).
y_tableWidth(t,"80%").
y_tableMarginLeft(t,"10%").
y_isTableHeader(t, 1, 1).
y_isTableHeader(t, 1, 2).
y_tableCellContent(t, 1, 1, "Année").
y_tableCellContent(t, 1, 2, "Événement").
y_tableCellContent(t, 2, 1, "1969").
y_tableCellContent(t, 3, 1, "1992").
y_tableCellContent(t, 2, 2, "Création d'ARPANET, le premier réseau national américain d'ordinateurs, par le Defense Department's Advanced Research Projects Agency (DARPA)").
y_tableCellContent(t, 3, 2, "Mise en service du World Wide Web par le CERN (Centre européen de recherche nucléaire), en Suisse").
% Uncomment to test for incorrectness
% y_tableCellContent(t, 4, 4, "Extraneous").
% The style is consistent across cells in the table, so we can
% have some general rules.
y_cell(T,R,C) :- y_tableCellContent(T, R, C, _).
y_tableCellBackgroundColor(t,R,C,"#CCCCCC") :- y_isTableHeader(t,R,C).
y_tableCellVerticalAlign(t,R,C,"top") :- y_cell(t,R,C).
y_tableCellBorderWidth(t,R,C,"thin") :- y_cell(t,R,C).
y_tableCellBorderStyle(t,R,C,"solid") :- y_cell(t,R,C).
% General rules about tables. We formulate some of these as tests of
% consistent description for a table.
table_nonempty_finite_rectangular_and_dense(T) :-
y_cell(T, R, C),
\+ (
y_cell(T, Row, Col),
(Row > R;
Col > C)
;
between(1, R, Row),
between(1, C, Col),
(\+ y_cell(T, Row, Col))
).
table_properties_ok(T) :-
\+ (has_properties(T,Row,Col),
\+ cell(T,Row,Col)).
has_properties(T, R, C) :-
y_isTableHeader(T,R,C);
y_tableCellBackgroundColor(T,R,C,_);
y_tableCellVerticalAlign(T,R,C,_);
y_tableCellBorderWidth(T,R,C,_);
y_tableCellBorderStyle(T,R,C,_).
y_table_consistent(T) :-
table_nonempty_finite_rectangular_and_dense(T),
table_properties_ok(T).
y_table_dimensions(T,Row,Col) :-
% This is not a validity check, but rather a simple way
% to compute a table's dimensions.
% Note: Reliable only if table_nonempty_finite_rectangular_and_dense(T)
y_cell(T,Row,Col),
R is Row + 1,
C is Col + 1,
\+ y_cell(T, R, 1),
\+ y_cell(T, 1, C).
Appendix B. Sentence set
S2
% S2 Inferences from the XHTML table
% 2014-04-18 : YM : added y/m prefix to each (exported) predicate
% 2014-04-17 : YM : Corrected definitions of row(R) and col(C)
(R and C were at a wrong place in the predicate
on the right)
% 2014-04-08 : MSM : add uniquely identifying predicates for all
% individuals.
% 2014-03-21 : MSM : transcribed ---28.
:- module(m, [m_headingcell/1,
m_datacell/1,
m_cell/1,
m_row/1,
m_column/1,
m_this_table/1,
m_table_row/3,
m_table_column/3,
m_table_cell/4,
m_table_rowsequence/2,
m_table_colsequence/2,
m_row_cellsequence/2,
m_col_cellsequence/2,
m_cell_text/2,
m_table_row/2,
m_table_col/2,
m_table_cell/2,
m_row_cell/2,
m_col_cell/2]).
% Individuals
m_this_table(t1).
m_table_row(r1, t1, 1).
m_table_row(r2, t1, 2).
m_table_row(r3, t1, 3).
m_table_column(c1, t1, 1).
m_table_column(c2, t1, 2).
% convention: cell X, Y is cXY
m_table_cell(c11,t1,1,1).
m_table_cell(c21,t1,1,2).
m_table_cell(c12,t1,2,1).
m_table_cell(c22,t1,2,2).
m_table_cell(c13,t1,3,1).
m_table_cell(c23,t1,3,2).
% basic classes
m_row(R) :- m_table_row(R,_T,_N).
m_column(C) :- m_table_column(C,_T,_N).
m_headingcell(c11).
m_headingcell(c21).
m_datacell(c12).
m_datacell(c22).
m_datacell(c13).
m_datacell(c23).
% Relations
%
m_table_rowsequence(t1, [r1, r2, r3]).
m_table_colsequence(t1, [c1, c2]).
m_row_cellsequence(r1, [c11, c21]).
m_row_cellsequence(r2, [c12, c22]).
m_row_cellsequence(r3, [c13, c23]).
m_col_cellsequence(c1, [c11, c12, c13]).
m_col_cellsequence(c2, [c21, c22, c23]).
m_cell_text(c11, "Année").
m_cell_text(c21, "Événement").
m_cell_text(c12, "1969").
m_cell_text(c22, "Création d'ARPANET, le premier réseau national américain d'ordinateurs, par le Defense Department's Advanced Research Projects Agency (DARPA)").
m_cell_text(c13, "1992").
m_cell_text(c23, "Mise en service du World Wide Web par le CERN (Centre européen de recherche nucléaire), en Suisse").
% Synonyms / analytic sentences
% These are not essential, but may be convenient to have.
m_cell(X) :- m_headingcell(X).
m_cell(X) :- m_datacell(X).
m_table_row(T,R) :- m_this_table(T), m_row(R), m_table_rowsequence(T,Rs), member(R,Rs).
m_table_col(T,C) :- m_this_table(T), m_column(C), m_table_colsequence(T,Cs), member(C,Cs).
m_row_cell(R,C) :- m_row(R), m_cell(C), m_row_cellsequence(R,Cs), member(C,Cs).
m_col_cell(Col,Cell) :- m_column(Col),
m_cell(Cell),
m_col_cellsequence(Col, Cells),
member(Cell,Cells).
m_table_cell(T,C) :-
m_this_table(T),
m_cell(C),
m_row(R),
m_table_row(T,R),
m_row_cell(R,C).
m_table_cell(T,C) :-
m_this_table(T),
m_cell(C),
m_column(C2),
m_table_col(T,C2),
m_col_cell(C2,C).
Appendix C. The S1
→ S2
translation inference rules
% S1 to S2: translation rules from Y sentences to M sentences.
%
% 2014-04-18 : YM : added y/m prefix to each (exported) predicate
% 2014-04-17 : YMA : added M goals at the end
% 2014-04-08 : CMSMcQ : revise for new y.pl and new m.pl
% 2014-04-03 : CMSMcQ : first version
%
% Our job is to define the predicates exported from m.pl
% in terms of the vocabulary in y.pl. Or at least the
% predicates used for ground facts.
%
% So: to define are first the uniquely identifying predicates:
%
% m_this_table/1,
% m_table_row/3,
% m_table_column/3,
% m_table_cell/4
%
% And then the other predicates:
%
% m_headingcell/1,
% m_datacell/1,
% m_cell/1,
% m_row/1,
% m_column/1,
% m_table_rowsequence/2,
% m_table_colsequence/2,
% m_row_cellsequence/2,
% m_col_cellsequence/2,
% m_cell_text/2,
% m_table_row/2,
% m_table_col/2,
% m_table_cell/2,
% m_row_cell/2,
% m_col_cell/2.
% And we are given:
% y_this_table/1, y_cell/3, y_table_dimensions/3, y_table_consistent/1,
% y_tableWidth/2, y_tableMarginLeft/2, y_isTableHeader/3, y_tableCellContent/4,
% y_tableCellBackgroundColor/4, y_tableCellVerticalAlign/4,
% y_tableCellBorderWidth/4, y_tableCellBorderStyle/4
% First, assert the existence of appropriate individuals.
individuals :-
abolish(m_this_table/1),
known_table,
abolish(m_table_row/3),
known_rows,
abolish(m_table_column/3),
known_columns,
abolish(m_table_cell/4),
known_cells.
known_table :-
gensym('m', T),
assertz(m_this_table(T)).
known_rows :- y_this_table(T0), m_this_table(T),
y_table_dimensions(T0, RowCount, _ColCount),
( between(1,RowCount,RowNum),
gensym('m',R),
assertz(m_table_row(R, T, RowNum)),
fail
)
;
true.
known_columns :- y_this_table(T0), m_this_table(T),
y_table_dimensions(T0, _RowCount, ColCount),
( between(1,ColCount,ColNum),
gensym('m',C),
assertz(m_table_column(C, T, ColNum)),
fail
)
;
true.
known_cells :- y_this_table(T0), m_this_table(T),
y_table_dimensions(T0, RowCount, ColCount),
( between(1, RowCount, RowNum),
between(1, ColCount, ColNum),
gensym('m',Cell),
assertz(m_table_cell(Cell, T, RowNum, ColNum)),
fail
)
;
true.
% m_headingcell/1
m_headingcell(H) :- y_this_table(T0),
y_isTableHeader(T0,Row,Col),
m_this_table(T1),
m_table_cell(H,T1,Row,Col).
% m_datacell/1,
m_datacell(D) :- y_this_table(T0),
y_cell(T0,Row,Col),
\+ (y_isTableHeader(T0,Row,Col)),
m_this_table(T1),
m_table_cell(D, T1, Row, Col).
% m_cell/1,
m_cell(Cell) :- y_this_table(T0),
y_cell(T0, Row, Col),
m_this_table(T1),
m_table_cell(Cell, T1, Row, Col).
% m_row/1,
m_row(R) :- y_this_table(T0),
y_table_dimensions(T0, RowCount, _ColCount),
between(1, RowCount, RowNum),
m_this_table(T1),
m_table_row(R, T1, RowNum).
% m_column/1,
m_column(R) :- y_this_table(T0),
y_table_dimensions(T0, _RowCount, ColCount),
between(1, ColCount, ColNum),
m_this_table(T1),
m_table_column(R, T1, ColNum).
% m_this_table/1,
% taken care of by the 'individuals' predicate.
% the following are all derivative. If we can prove the
% ground facts, they all follow.
% m_table_row/3,
% m_table_column/3,
% m_table_cell/4,
% m_table_rowsequence/2,
m_table_rowsequence(T, Rows) :-
y_this_table(T0), m_this_table(T),
y_table_dimensions(T0, RowCount, _ColCount),
aux_rowseq(T, 1, RowCount, Rows).
aux_rowseq(Table, RowNum, RowCount, [Row|Rows]) :-
RowNum < RowCount,
m_table_row(Row, Table, RowNum),
NextRow is RowNum + 1,
aux_rowseq(Table, NextRow, RowCount, Rows).
aux_rowseq(Table, N, N, [Row]) :-
m_table_row(Row, Table, N).
% m_table_colsequence/2,
m_table_colsequence(T, Cols) :-
y_this_table(T0), m_this_table(T),
y_table_dimensions(T0, _RowCount, ColCount),
aux_colseq(T, 1, ColCount, Cols).
aux_colseq(Table, ColNum, ColCount, [Col|Cols]) :-
ColNum < ColCount,
m_table_column(Col, Table, ColNum),
NextCol is ColNum + 1,
aux_colseq(Table, NextCol, ColCount, Cols).
aux_colseq(Table, N, N, [Col]) :-
m_table_column(Col, Table, N).
% m_row_cellsequence/2,
m_row_cellsequence(R,Cells) :-
m_table_row(R, Table, RowNum),
aux_rowcells(Table, RowNum, Cells).
aux_rowcells(Table, RowNum, Cells) :-
y_this_table(T0),
y_table_dimensions(T0, _RowCount, ColCount),
aux2_rowcells(Table, RowNum, 1, ColCount, Cells).
aux2_rowcells(Table, RowNum, ColNum, ColCount, [Cell|Cells]) :-
ColNum =< ColCount,
m_table_cell(Cell, Table, RowNum, ColNum),
NextCol is ColNum + 1,
aux2_rowcells(Table, RowNum, NextCol, ColCount, Cells).
aux2_rowcells(_Table, _RowNum, ColNum, ColCount, []) :-
ColNum > ColCount.
% m_col_cellsequence/2,
m_col_cellsequence(Col,Cells) :-
m_table_column(Col, Table, ColNum),
aux_colcells(Table, ColNum, Cells).
aux_colcells(Table, ColNum, Cells) :-
y_this_table(T0),
y_table_dimensions(T0, RowCount, _ColCount),
aux2_colcells(Table, ColNum, 1, RowCount, Cells).
aux2_colcells(Table, ColNum, RowNum, RowCount, [Cell|Cells]) :-
RowNum =< RowCount,
m_table_cell(Cell, Table, RowNum, ColNum),
NextRow is RowNum + 1,
aux2_colcells(Table, ColNum, NextRow, RowCount, Cells).
aux2_colcells(_Table, _ColNum, RowNum, RowCount, []) :-
RowNum > RowCount.
% m_cell_text/2,
m_cell_text(C, Chars) :-
y_this_table(T0), m_this_table(T1),
m_table_cell(C, T1, Row, Col),
y_tableCellContent(T0, Row, Col, Chars).
obligations :-
m_this_table(T),
m_table_row(R1, T, 1),
m_table_row(R2, T, 2),
m_table_row(R3, T, 3),
m_table_column(C1, T, 1),
m_table_column(C2, T, 2),
m_table_cell(C11, T, 1, 1),
m_table_cell(C21, T, 1, 2),
m_table_cell(C12, T, 2, 1),
m_table_cell(C22, T, 2, 2),
m_table_cell(C13, T, 3, 1),
m_table_cell(C23, T, 3, 2),
m_row(R1),
m_row(R2),
m_row(R3),
m_column(C1),
m_column(C2),
m_headingcell(C11),
m_headingcell(C21),
m_datacell(C12),
m_datacell(C22),
m_datacell(C13),
m_datacell(C23),
m_table_rowsequence(T, [R1, R2, R3]),
m_table_colsequence(T, [C1, C2]),
m_row_cellsequence(R1, [C11, C21]),
m_row_cellsequence(R2, [C12, C22]),
m_row_cellsequence(R3, [C13, C23]),
m_col_cellsequence(C1, [C11, C12, C13]),
m_col_cellsequence(C2, [C21, C22, C23]),
m_cell_text(C11, "Année"),
m_cell_text(C21,"Événement"),
m_cell_text(C12,"1969"),
m_cell_text(C22,"Création d'ARPANET, le premier réseau national américain d'ordinateurs, par le Defense Department's Advanced Research Projects Agency (DARPA)"),
m_cell_text(C13,"1992"),
m_cell_text(C23,"Mise en service du World Wide Web par le CERN (Centre européen de recherche nucléaire), en Suisse").
Appendix D. The S2 →
S1 translation
inference rules
% m_to_y.pl: translate from M sentences to Y sentences.
% Our job here is to define rules for
% all the predicates exported from y.pl, in
% terms of the rules in m.pl.
% 2014-04-18 : YM : added y/m prefix to each (exported) predicate
% 2014-04-17 : YM : added Y goals
% First, assert the existence of appropriate individuals
% (here, only one: the table).
individuals :-
abolish(y_this_table/1),
gensym('y',T),
assertz(y_this_table(T)).
% y_isTableHeader/3
% y_isTableHeader(Table, Rownum, Colnum)
y_isTableHeader(Table0,Row,Col) :-
m_this_table(Table1),
y_this_table(Table0),
m_headingcell(Cell),
table_cell_rownum(Table1, Cell, Row),
table_cell_colnum(Table1, Cell, Col).
% y_tableCellContent/4
% y_tableCellContent(Table, Row, Column, String)
y_tableCellContent(Table0, Row, Col, String) :-
m_this_table(Table1),
y_this_table(Table0),
m_cell(Cell),
table_cell_rownum(Table1, Cell, Row),
table_cell_colnum(Table1, Cell, Col),
m_cell_text(Cell, String).
table_cell_rownum(Table,Cell,Rownum) :-
m_this_table(Table),
m_row(Row),
m_table_rowsequence(Table,Rows),
nth1(Rownum,Rows,Row),
m_cell(Cell),
m_row_cellsequence(Row,Cs),
member(Cell,Cs).
table_cell_colnum(Table, Cell, Colnum) :-
m_this_table(Table),
m_column(Column),
m_table_colsequence(Table,Columns),
nth1(Colnum,Columns,Column),
m_cell(Cell),
m_col_cellsequence(Column, Cells),
member(Cell, Cells).
obligations :-
y_this_table(Ty),
y_isTableHeader(Ty, 1, 1),
y_isTableHeader(Ty, 1, 2),
y_tableCellContent(Ty, 1, 1, "Année"),
y_tableCellContent(Ty, 1, 2, "Événement"),
y_tableCellContent(Ty, 2, 1, "1969"),
y_tableCellContent(Ty, 3, 1, "1992"),
y_tableCellContent(Ty, 2, 2, "Création d'ARPANET, le premier réseau national américain d'ordinateurs, par le Defense Department's Advanced Research Projects Agency (DARPA)"),
y_tableCellContent(Ty, 3, 2, "Mise en service du World Wide Web par le CERN (Centre européen de recherche nucléaire), en Suisse").
[Marcoux 2009]
Marcoux, Yves.
Intertextual semantics generation for structured documents:
a complete implementation in XSLT.Actes du 12e Colloque international sur
le Document Électronique (CiDE.12),
Montréal, octobre 2009, pp. 159-170.
[Marcoux and Rizkallah 2009]
Marcoux, Yves, and Élias Rizkallah.
Intertextual semantics:
A semantics for information design.Journal of the American Society for
Information Science & Technology
60.9 (2009): 1895-1906.
doi:https://doi.org/10.1002/asi.21134.
[Sperberg-McQueen 2011] Sperberg-McQueen, C. M.
What constitutes successful format conversion?
Towards a formalization of ‘intellectual content’.International Journal of Digital Curation
6.1 (2011): 153-164. doi:https://doi.org/10.2218/ijdc.v6i1.179.
[Sperberg-McQueen et al. 2002]
Sperberg-McQueen, C. M., Renear, A., Huitfeldt, C., and
Dubin, D. Skeletons in the closet: Saying what markup means. Paper
given at ALLC/ACH, Tübingen, Germany, July 2002.
[Welty and Ide 1999]
Welty, Christopher, and Nancy Ide. 1999. Using
the Right Tools: Enhancing Retrieval from Marked-up Documents.Computers and the Humanities 1999: 59-84.
Originally delivered at TEI 10, Providence(1997). doi:https://doi.org/10.1023/A:1001800717376.
[Wetzel 2009]
Wetzel, Linda.
Types & tokens: On abstrct objects.
Cambridge, Mass.; London: MIT Press, 2009.
[2] In these expressions, * indicates closure
under logical inference, given some universal set of
assumptions independent of the documents under consideration
— what we elsewhere describe as world
knowledge. The node ⊤ is
the universal set of all sentences; note that ⊤ is
self-contradictory: for every
sentence s included in ⊤, the negation of s is also
included.
The node
⊥ is not the empty set, as might be expected,
but the set of sentences which can be inferred
without reference to any information in any document:
that is, the set of tautologies, the set of sentences
representing world knowledge, and the set of sentences
inferrable from the tautologies together with world
knowledge.
And finally \ is the set difference operator:
(S1 \ S2) contains those sentences of S1 which
are not in S2.
There is a certain notational tension in the fact
that lattices based on the subset/superset relation
are typically drawn with the ⊤ as the universal set
and ⊥ as the empty set (or, as in our case, a
set minimal in some way), while discussions of logic
sometimes use the symbol ⊤ to denote truth
and the symbol ⊥ to denote falsehood, or
contradiction. Perhaps we should draw our lattice
in the other direction, with subsets above not below
their supersets.
[3]
The topmost point in any lattice of sets is taken
by the universal set; the bottom point in such a lattice
is taken by the empty set. The sets we are concerned
with are all closed under logical inference, and
the most striking characteristic of a logical
contradiction is that it allows absolutely any
sentence to be inferred. So any set that contains
a contradiction automatically also contains all possible
sentences.
[4] The processes by which
these inferences may be enumerated has been described
elsewhere; we won't expound it again here.
[5]
Since notations for symbolic logic vary widely, it may
be helpful to summarize here the essentials of our
notation.
The symbols used for variables and predicate names
follow, more or less, the rules for non-colonized names
in XML.
Atomic facts are written in the form
p(a1,
a2, ...,
an); here p is the
predicate symbol and
a1 through
an are its
arguments.
Sentences can be combined with the connectors
∧ (and), ∨ (or), ⇒ (implies, if-then),
and ⇔ (if and only if).
Sentences of the form (∀ x)[P(x)]
may be read For all x, P(x),
i.e. for everything that exists, the predicate P
holds.
Sentences of the form (∃ x)[P(x)]
may be read There exists some x such that P(x),
i.e. some thing exists (here identified with the variable
x) of which the predicate P
holds.
Sentences of the form (∃1x)[P(x)]
may be read There exists exactly one x such that P(x).
This is conventionally (following Russell) taken as
meaning (∃ x)[P(x) ∧ (∀ y)[P(y) ⇒
y =
x]].
For brevity, sentences of the form
(∃ x)[x ∈ ℕ ∧ P(x)]
may be abbreviated
(∃ x ∈ ℕ)[P(x)].
And similarly for other quantifiers (∀, ∃1).
ℕ here means the natural numbers, i.e.
the non-negative integers.
For brevity, multiple quantifiers may be written together:
(∃ x)(∃ y)[P(x, y)]
may be abbreviated
(∃ x, y)[P(x, y)].
And similarly for other quantifiers (∀, ∃1).
[6]
It is impossible to contemplate this list without
thinking about those who have argued in the past
that documents would be much easier to process if
instead of XML people would use some more semantic
notation, like symbolic logic or RDF or some knowledge-representation
scheme, to represent them. The only way we can imagine to
produce an actual printed table from these logical
sentences is to try to translate them back into
markup, and then use conventional XML processing
to display the table.
[7] That is,
for simplicity we here ignore the possibility of embedded markup
inside the table cells. Since table cells can typically include
more or less arbitrary paragraph- or phrase-level markup from
the host markup language, addressing the equivalence of cell
contents in the general case would involve a full account of
the host markup language(s), and would rather spoil the simplicity
of the example.
[8] The rule does not apply for individuals
with well-known names, such that the identifier used for
the individual is not arbitrarily chosen by the creator of
the set S. The natural numbers (for example) and the
set of strings of Unicode characters do not need such
uniquely identifying predicates.
[9] Before the
character set enthusiasts among our readership ask, we
assume that all strings have been normalized using some
appropriate form of Unicode normalization, so that
Année and Événement are always spelled the
same way. The more or less analogous issues of whitespace
normalization, by contrast, we simply ignore; they would
take us too far afield.
[10]
The notation becomes lighterweight when we allow functions
in our logical system and require not a uniquely identifying
predicate for each individual but a function call which
returns it. Then the challenge can be expressed
table_dimensions( this_table(), 3, 2).
This shorter form is often helpful in practice, but imposes
a heavier burden on the prose exposition, so we omit further
mention of it here.
[11] One
example of this common approach is the account of
sequences on pages 128ff of [Wetzel 2009], which in general we follow (except that
we number members of a sequence from one, not zero, to
minimize confusion for XPath users).
[12] It must be admitted that the
sentences enumerated in the Prolog are not quite identical
to those given in the predicate-calculus versions given
above, owing to changes during the preparation of the paper
in our understanding of the best way to formulate the
logical descriptions of tables. Of course, some changes
of detail are required by the nature of Prolog.
Marcoux, Yves.
Intertextual semantics generation for structured documents:
a complete implementation in XSLT.Actes du 12e Colloque international sur
le Document Électronique (CiDE.12),
Montréal, octobre 2009, pp. 159-170.
Marcoux, Yves, and Élias Rizkallah.
Intertextual semantics:
A semantics for information design.Journal of the American Society for
Information Science & Technology
60.9 (2009): 1895-1906.
doi:https://doi.org/10.1002/asi.21134.
Severson, Eric, and Harvey Bingham. TABLE INTEROPERABILITY:
Issues for the CALS Table ModelOASIS Technical
Research Paper 9501:1995.
https://www.oasis-open.org/specs/a501.htm
Sperberg-McQueen, C. M.
What constitutes successful format conversion?
Towards a formalization of ‘intellectual content’.International Journal of Digital Curation
6.1 (2011): 153-164. doi:https://doi.org/10.2218/ijdc.v6i1.179.
Sperberg-McQueen, C. M., Renear, A., Huitfeldt, C., and
Dubin, D. Skeletons in the closet: Saying what markup means. Paper
given at ALLC/ACH, Tübingen, Germany, July 2002.
Welty, Christopher, and Nancy Ide. 1999. Using
the Right Tools: Enhancing Retrieval from Marked-up Documents.Computers and the Humanities 1999: 59-84.
Originally delivered at TEI 10, Providence(1997). doi:https://doi.org/10.1023/A:1001800717376.