How to cite this paper
Huitfeldt, Claus, C. M. Sperberg-McQueen and Yves Marcoux. “Markup Meaning and Mereology.” Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). https://doi.org/10.4242/BalisageVol3.Huitfeldt01.
Balisage: The Markup Conference 2009
August 11 - 14, 2009
Balisage Paper: Markup Meaning and Mereology
Claus Huitfeldt
Associate professor
University of Bergen, Norway
Claus Huitfeldt is Associate Professor at the Department of Philosophy of the
University of Bergen. His research interests are within philosophy of language, philosophy
of technology, text theory, editorial philology and markup theory. He was founding
Director (1990-2000) of the Wittgenstein Archives at the University of Bergen, for
which
he developed the text encoding system MECS as well as the editorial methods for the
publication of Wittgenstein's Nachlass - The Bergen Electronic Edition (Oxford University
Press, 2000). He was active in the Text Encoding Initiative (TEI) since 1991, and
was
centrally involved in the foundation of the TEI Consortium. Huitfeldt was Research
Director (2000-2002) of Aksis (Section for Culture, Language and Information Technology
at
the Bergen University Research Foundation).
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
Sperberg-McQueen, C. M. is an independent consultant for Black Mesa Technologies LLC.
He currently serves as an editor of the W3C XML Schema Definition Language (XSD)
1.1.
Yves Marcoux
Associate professor
Université a Montréal, Canada
Yves Marcoux is a faculty member at EBSI, University of Montréal, since 1991. He is
mainly involved in teaching and research activities in the field of document informatics.
Prior to his appointment at EBSI, he has worked for 10 years in systems maintenance
and
development, in Canada, the U.S., and Europe. He obtained his Ph.D. in theoretical
computer science from University of Montréal in 1991. His main research interests
are
document semantics, structured document implementation methodologies, and information
retrieval in structured documents. Through GRDS, his research group at EBSI, he has
been
principal architect for the Governmental Framework for Integrated Document Management,
a
project funded by the National Archives of Québec and by the Québec Treasury Board.
Copyright © 2009 by the authors. Used with
permission.
Abstract
When marking up a document we chop it up into elements. Elements are parts of the
document, some of which contain further elements, i.e., have parts of their own. Thus,
the
part-whole relation is central to the way markup works.
Mereology is precisely the theory of part-whole relationships, but has not yet found
much application in markup theory. In this paper we provide a sketch of how mereology,
in
the form more specifically of Nelson Goodman's Calculus of Individuals, might be applied
to
markup.
We discuss ways of identifying the individuals of marked-up documents and of referencing
these individuals, and we sketch some ways of applying the calculus to the problem
of
propagation of properties in documents.
Table of Contents
- Introduction
- The Calculus of Individuals
- The Calculus applied to XML
-
- The element-as-individual approach
- The tags and PCDATA approach
- The character-atom approach
-
- The approach
- Examples
- Statements and inferences
- Conclusion
- Property Propagation — a Sketch
-
- Dissective and anti-dissective properties
- Expansive and anti-expansive properties
- Collective and anti-collective properties
- The HTML title element
- The TEI sp, speaker and stage elements
- The TEI docTitle, docDate and docAuthor elements
- Problems
-
- Empty elements
-
- Milestone elements
- Other empty elements
- Coextensive elements
- Conclusion and Future Work
Introduction
XML documents consist of marked elements, which may in turn contain sequences of marked
elements, etc. This hierarchy of elements is conveniently represented as a tree in
which each
node stands for an element, in which each arc between elements stand for a parent-child
relationship, and in which the children of each node are ordered sequentially in accordance
with their document order.
While it is commonly the case that the generic identifier of an element is understood
to
ascribe a property to the element's content, that elements represented by nodes dominated
by
that element's node in the document tree are also understood to be contained by it,
and that
these nodes are understood to inherit the properties ascribed to their ancestor elements,
none
of this is always or necessarily the case.
As we have pointed out elsewhere [Sperberg-McQueen and Huitfeldt 2008], the parent-child
relationship may be taken to indicate either a containment relationship, or a dominance
relationship. Frequently these relationships coincide, and no harm is caused by not
distinguishing them. When they do not coincide, however, the result may easily be
confusing.
One view of the structure of XML documents emphasizing the part-whole relationship
is
this: A document contains elements, i.e., parts. Some of these parts contain further
elements,
i.e., have parts of their own. The generic identifiers of elements ascribe properties
to their
own content and/or to the content of elements related to them by part-whole relationships.
Mereology is precisely the theory of part-whole relationships. Even so, mereology
does not
seem to have found much application in markup theory until now. It may therefore be
interesting to investigate whether the application of mereology may give insights
relevant to
the understanding of interpretation and processing of marked-up documents.
It is sometimes said that XML provides a formal syntax for document representation,
but no
formal semantics for the interpretation or processing of this syntax. If mereology
can be
brought to bear on the ascription and propagation of properties and relations between
parts of
marked-up documents, it may help in providing a general approach to markup semantics.
For
example, the work presented here may turn out to be of direct relevance for the work
on formal
tag set descriptions and intertextual semantics specifications presented in [Marcoux et al. 2009] and [Sperberg-McQueen et al. 2009a].
Before we proceed, some words on the limitations of this paper are in place. First,
although our focus is on XML, and although we mention other markup languages in passing,
we
believe that mereology deserves to be studied in relation to markup languages in general
(such
as XML, SGML, TexMecs, LMNL, and others) rather than XML only. We think so partly
because
application of mereology may be equally or more profitable when it comes to some non-XML
markup systems, and partly because such broader studies might inspire modifications
of
— or alternatives to — any or all of these. We hope to come back to
applications of mereology to markup more generally in future work.
Second, the concept XML document
as used in this paper refers almost
exclusively to XML in its serialized form. We do not explicitly attempt to apply mereology
to
XML documents considered as graphs of xPath nodes, Infoset items, or the like.
Finally, we limit ourselves to an attempt to apply the so-called Calculus of Individuals,
a mereological system worked out by Nelson Goodman [Goodman 1977] (initially
in cooperation with Henry S. Leonard [Leonard and Goodman 1940]). As a further
simplification, and in order to ensure focus, we will ignore XML attributes, entities,
declarations, comments, processing instructions, and marked sections; in short, we
will regard
XML documents as consisting of elements and their content only .
The Calculus of Individuals
The origins of mereology go back to ancient Greece, but it was taken up as a formal
study
and developed mathematically only early in the 20th century. Today, it is a well developed
formal discipline, and there are a number of different mereological systems. The term
mereology is sometimes used to refer to these formal calculi in particular, sometimes
to
formal as well as non-formalized theories of part-whole relationships in general [Libardi 1994, pp. 13–15].
Early developments of formal mereology were largely motivated by scepticism towards
set
theory and the calculus of classes, and a desire to translate or reduce
all
talk of abstract classes and their members to talk of concrete individuals and their
parts.
Mereology therefore came to be associated with a particular ontological stance, nominalism,
and to be shunned by most adherents of other ontological views.
Such ontological considerations may or may not motivate, but do not in any way need
to
concern, our attempt to apply mereology to markup languages, however: later work in
the field
is generally taken to demonstrate that mereology and set theory may live merrily together,
that in fact the one may be seen as an extension of the other, and that the adoption
of
mereology does not by itself commit one to any particular ontological stance.
The part-whole relationships that mereology studies are relationships between entities
that are, in Goodman's terminology, called individuals. Generally
speaking an individual may be any thing
in a very wide sense of the word
— a concrete, an abstract, a universal or a particular — i.e., any object
or entity of which something can be predicated. This is admittedly still pretty general,
and
more specific talk may be in order: As examples of individuals we may take stones,
tables,
chairs, animals and other medium-sized everyday objects; but if we like we may also
populate
our world with individuals such as molecules, atoms, electrons, quarks; or planets,
stars and
galaxies; or for that matter persons, visual after-images, mental images or sense
data. If we
believe in abstract objects we may include numbers, geometrical objects, concepts,
etc., and
according to some applications of mereology there may also be temporal
individuals such as processes, events, and snippets of time.
Individuals need not be contiguous, neither in space nor in time. This is one of
the
principles of the Calculus of Individuals which has provoked some discussion. In its
defence
one may point to the fact that we actually do employ the notion of at least some such
disconnected wholes in everyday language. Thus, to treat the land mass of Japan
(or any geographic entity which includes two or more islands) as an individual may
seem
unobjectionable. However, according to another principle, the sum of any two individuals
is
always also an individual. This seems to force us to accept as individuals, i.e.,
wholes
, sums of randomly scattered parts such as Caesar's nose and the
state of Utah
[Goodman 1972, p. 37]. Goodman bites that bullet, while much of the ensuing debate has been concerned
with attempts to find ways of distinguishing such scattered and arbitrary sums from
more
cohesive
or integral
individuals as wholes consisting of parts
in a more intuitively satisfactory sense.
A formal mereological theory takes conventional first-order predicate logic as its
basis.
We will use conventional modern logical notation for quantifiers, operators, predicates,
variables and constants. More specifically, we will use (x) for universal and
(∃x) for existential quantification over x; ¬ for negation, →
for implication, ∨ for inclusive disjunction, ∧ for conjunction, ⇔ for
equivalence, and = for identity. We use the small roman letters a, b, c... for constants,
x,
y, z... for variables, and upper roman letters A, B, C... for predicates. We will
occasionally
use the conventional abbreviation iff
for if and only if
.
The extension which mereology makes to this basis is very modest: In fact the extension
consists in adding only one single primitive relation to the first-order system. This
specifically mereological
, primitive relation may be chosen from among the
relations part of
, proper part of
, discrete from
or overlapping with
. As each of these relations may be defined in terms of any
of the others, it does not matter much which one we chose as our undefined primitive. With a hopefully obvious appeal to markup theorists, we will follow [Goodman 1977] in choosing overlap
for our primitive relation.
Variables are taken to range over individuals only, and predicates are taken to ascribe
properties of or relations between individuals.
From a mereological point of view, two individuals overlap iff they
have some content in common. One consequence of this definition may briefly confuse
markup
specialists: since in an XML document a child element and its parent element have
some content
in common (everything contained by the child is also contained by the parent), it
follows that
in the sense introduced here the child and the parent overlap. That is,
the term overlap, as used in the calculus of individuals, includes proper
nesting or normal part/whole relations.
Thus, if we think of XML elements as individuals consisting of stretches of consecutive
character occurrences, and if we consider the following four cases (strictly speaking,
the
first line is not well formed XML and is included only for purposes of illustration):
<s> <q> </s> </q>
<s> <q> </q> </s>
<q> <s> </s> </q>
<s> </s> <q> </q>
the first three cases exhibit an overlap between elements
s
and
q
.
Only in the last case do the two elements not overlap, i.e., they are discrete. In
contrast,
markup theorists would probably consider only the first case to be one of overlap.
The overlap operator is written ∘
. The following
condition on ∘
captures the intuitive notion of “having some content in
common,” and we thus take it as an axiom:
2.41 x ∘ y ⇔ (∃z)(w)((w ∘ z) → ((w ∘ x) ∧ (w ∘ y)))
Any relation satisfying this condition is necessarily reflexive and symmetric (but
not
necessarily transitive).
We now state further relation and operator definitions, theorems and axioms. Note
that not
all of them belong to all variants of mereological systems; they do, however, belong
to ours.
As already mentioned, the relations part of,
proper part,
and discrete
may all be defined in terms of the
overlap relation.
Iff x is a part of y, then everything that overlaps x also overlaps
y:
D2.042 x < y =df (z)((z ∘ x) → (z ∘ y))
The part relation is reflexive, anti-symmetric and transitive.
Iff x is a proper part of y, then x is a part of y but y is not a
part of x:
D2.043 x ≪ y =df (x < y) ∧ ¬(y < x)
The proper part relation is irreflexive, anti-symmetric and transitive.
Iff x and y are discrete, then they have no part in common, i.e.,
they do not overlap:
D2.041 x ʅ y =df ¬(x o y)
The
discrete relation is irreflexive and symmetric (and thus, non-transitive).
It is worth noting that identity can be defined in terms of the
primitive relation:
D2.044 x = y =df (z)((z o x) ⇔ (z o y))
The product of x and y is the individual which exactly contains their
common part:
D2.045 x · y =df (℩z)(w)((w < z) ⇔ ((w < x) ∧ (w < y)))
The sum of x and y is the individual which contains exactly and
exhaustively both of them, or, in other words, the individual which overlaps all and
only
those individuals which overlap any of them:
D2.047 x + y =df (℩z)(w)((w ∘ z) ⇔ ((w ∘ x) ∨ (w ∘ y)))
The negate of an individual includes everything which does not
overlap with that individual (i.e., what is often called its complement
, or
the rest of the world
):
D2.046 –x =df (℩z)(y)((y ʅ x) ⇔ (y < z))
The difference between x and y is what remains of x after we
eliminate the parts it has in common with y:
x – y =df (x · –y)
There is considerable controversy in the literature over the nil
individual. The nil individual is the mereological analogue
of the empty class. If accepted, it is part of any individual. Most mereological systems
reject its existence, and we will do the same in this paper.
There is less controversy over the existence of the universal
individual, i.e., the one individual of which every other is a part — the
world
or the universe
as an individual. In our case, we are
not applying the Calculus of Individuals as a Grand Theory of Everything,
but
limit its application to domains consisting of a single document, to collections (not
to say
sets or classes) of documents, or perhaps to documents and whatever else we may need
to take
into consideration to make sense of what these documents say. So we, too, will endorse
the
existence of a universal individual, customarily denoted by the letter W
:
W =df (℩x)(y)(y < x)
Note that, because there is no nil individual:
-
the product of x
and y
can possibly exist only if
x
and y
overlap,
-
the difference between x
and y
can possibly exist only if
x
is not a part of y
, and
-
W (the universe) does not have a negate.
However, the following statements hold, either as axioms or theorems, depending on
how one
elaborates the system:
-
(x)(y)(∃z)(z = x + y)
, i.e., the sum of any
two individual exists (that is, is an individual),
-
(x)(y)((x ∘ y) ⇔ (∃z)(z = x
· y))
, i.e., the product of any two individuals exists iff they
overlap,
-
(x)(¬(x = W) ⇔ (∃z)(z = –x))
,
i.e., the negate of an individual exists iff the individual is not the universe,
and
-
(x)(y)((¬x < y) ⇔ (∃z)(z = x
– y))
, i.e., the difference between any individual x
and any
individual y
exists iff x
is not a part of
y
.
Do all individuals have parts, or are there some individuals which are not further
divisible into parts? Whether we take the one or the other position may have wide-reaching
consequences for other properties of a mereological system, and the literature abounds
with
discussion on the subject. Given our domain of application, however, we believe that
any
system will have to be atomistic — on none of our analyses will
documents have parts below character-level, or at least we foresee no need to talk
about parts
of characters.
So we may simply add the axiom of atomicity to our system right away:
(x)(∃y)((y < x) ∧ ¬(∃z)(z ≪ y))
[
Casati and Varzi 1999, p. 61]
The Calculus applied to XML
What might it mean to apply the Calculus of Individuals to XML documents (or, for
short,
to XML
) and what purpose might such an application of the calculus serve? A
preliminary answer to the first question is that an application of the Calculus of
Individuals
to XML would require us to decide which entities to count as individuals, to decide
which of
these are to count as atomic individuals, as well as which properties they can have
and which
relations hold between them. Given the Calculus of Individual's rules of composition,
different decisions on these issues will bring us to recognize the existence of individuals
which may or may not coincide with established ways of viewing the structure of XML
documents.
Identifying rules which replicate such conventional views is, if possible, in itself
of
interest. Identifying rules which provide alternative views of XML documents may be
of even
greater interest, at least if they also suggest alternate and useful ways of analysing
the
parts of a document, of addressing them, and of how to ascribe properties of and relations
between parts of a document.
A preliminary answer to the second question has thus already been suggested: We suspect
that an application of the Calculus of Individuals to XML might suggest ways of identifying
and addressing parts of a document which in some cases, or for some purposes, would
be more
convenient or more powerful than existing methods such as SAX, DOM or xPath. We also
suspect
that some application of the Calculus of Individuals to XML might suggest ways of
dealing with
what is sometimes called the semantics
of XML, i.e., how to understand XML
documents in terms of properties ascribed to and relations indicated between the various
parts
of them indicated by the markup.
In what follows we have nothing but tentative answers to the general questions just
posed. Trying to answer the first question, we will present different ways of applying
the
Calculus of Individuals to XML. We will also explore some of their implications for
answers to
the second question. The explorative nature of our work should be emphasized: We do
not want
to suggest that these are the only, or the best, ways of applying the Calculus of
Individuals
to XML, nor do we suggest that we have identified all or even the most important implications
of the approaches that we consider.
Therefore, each of the following sections begins by suggesting a different answer
to the
question Which are the individuals of a marked-up document?
First, we consider
the possibility that the individuals simply are XML elements. Next, we go down one
step in
level of granularity and identify tags and character strings as individuals. Finally,
we
proceed to a still finer level of granularity in order to see what happens if we recognize
individual characters as atomic individuals, and distinguish between different kinds
of
individuals built from these atoms.
The element-as-individual approach
What to count as individuals is a matter of choice, a choice which must be made on
the
basis of such criteria as naturalness, convenience, expressiveness, simplicity, etc.
We
begin by simply assuming a one-to-one matching between the elements of
an XML document and the individuals of our calculus. On this assumption, consider
the
following simple XML document:
(1) <para>A <quote>rose</quote> is <emph>a</emph> rose.</para>
If each element is an individual, then (1) itself, as well as the elements
(2) <quote>rose</quote>
(3) <emph>a</emph>
are individuals. Now, the sum of any two individuals must (by our mereological axioms)
be an
individual. Thus, the sum of (2) and (3) must be an individual and, by our hypothesis,
an
XML element. No matter what model we have in mind for XML elements and documents,
it is hard
to imagine a way in which the sum of (2) and (3) could be an XML element — it
would be at best two!
In fact, the goal we have set ourselves here turns out to be self-defeating: It is
not
possible to identify XML elements with individuals, without accepting as individuals
parts
of the document which are not XML elements. In other words, if all XML elements are
individuals, then some XML documents necessarily give rise to individuals which are
not XML elements.
An obvious fix would be to retain the decision that every element is an individual,
but
allow for composite individuals having more than one element as their parts. This
would
solve the problem of sums, but others would remain (e.g., what elements can the difference
(1) – (2) be the sum of?). Even taking the closure of elements under sum and
difference would still not solve a granularity issue in handling text content: Take,
for
example, the strings
A
,
is
, and
rose.
; any given individual would contain either all three or none. There would be no way
to separate
those strings.
Another issue is that the definition of parthood implies nothing about the ordering
of
parts, resulting in the fact that individuals are
unordered. Thus, there is no way in our approach to say, for example, that (2)
occurs before (3).
The Calculus of Individuals offers in itself no way of defining ordered pairs — and thus, relations — as individuals. However, relations
can be represented by predicates on individuals. Thus, we
can order (either totally or partially) our individuals by defining an appropriate
binary
predicate corresponding to the desired relation.
If we think of individuals as corresponding to objects in an XML data model, and if
that
model allows serializations in which no two distinct elements or characters start
at the
same offset in a serialization (we will need to deal with characters in later sections), then we can induce a
total ordering of the individuals that correspond to elements and characters, based
on the
total order among the offsets of their XML counterparts in the serialization. We call
that
order relation document order.
Throughout this paper, we assume that document order exists and is well
defined.
So far we have assumed that XML elements containing no sub-elements have no parts,
i.e.,
that they are atoms in our system. A solution may perhaps be to recognize a more generous
set of individuals. But before we proceed to investigate this, we pause to make a
couple of
observations on other characteristics of the element-as-individual approach.
-
The lack of a fine enough granularity prevents a satisfactory treatment of strings,
let alone parts of strings.
However we could regard a string as a property of an individual. Thus, although we
cannot strictly speaking say that in (1) the string rose
is a part of the
string A rose is a rose.
, we could say that an individual having the
string rose
as a property is part of an individual having the string
A rose is a rose.
as a property. Note that the strings rose
is
or ose i
would not be properties of any individual,
and thus not a part
of the document even in this extended sense.
-
Building a tree structure in which each node is an individual (i.e., an element),
in
which each arc represents a whole-part relationship, and in which the children of
each
node are ordered in document order, produces a tree which is almost identical to the
XML
tree for the same document, except for PCDATA leaf nodes of mixed content elements,
which would be lost. (However empty element leaf nodes would appear in the tree.)
The tags and PCDATA approach
Moving one step down in level of granularity, we might take tags and PCDATA
strings delimited by tags as atomic individuals. Thus (1) would contain the
following 11 atomic individuals:
<para>
A
<quote>
rose
</quote>
is
<emph>
a
</emph>
rose.
</para>
From these, we might compose composite individuals such as, for example:
<para>
<para>A
<para>A <quote>
<para>A <quote>rose
A rose
A rose.
rose a
<para>A <quote>
A <quote>rose </quote> is <emph>
rose </quote> rose.</para>
As a matter of fact, (1) would give rise to no less than 2
11-1 =
2047 individuals on this account (-1 because there is no
nil individual) — in the interest of the reader we do not list all of
them here. Only a handful of these individuals would be well-balanced XML fragments,
of
course.
A total order relation on the atomic individuals based on document order could be
defined, as in the preceding section. Note that in this case, the sequence of ordered
atomic
individuals is isomorphic to the sequence of events identified by a SAX-like XML
tokenizer.
Observe that although many of the individuals
could be identified or referenced using xPath or similar XML-aware mechanisms, many
of them
could not. In particular, tag atoms could not (or, at least, it is unclear how and
in what
sense they could). However, the interest of being able to refer to tags individually
is not
obvious. Also, since strings are atoms, it is still impossible to handle parts of
strings:
ose i
is still not an individual. Therefore, we do not pursue this avenue
any
further.
The character-atom approach
The approach
Finally, and moving one further step down in the level of granularity, we take
character occurrences as the atomic individuals in our application
of the calculus. For the sake of conciseness, we will use character as a synonym for character
occurrence, except where confusion might
arise.
The type of a character occurrence is represented in
our system by a property of that character occurrence. So any atom (i.e., character
occurrence) has the property of being an a
, or a b
, or a
c
, etc., thus populating our vocabulary with one predicate for each of
the characters of the writing system at hand.
We define a total order relation on atoms, based on document order, represented by
the
predicate PA(x, y)
, true iff x
precedes y
in
document order (“P” stands for “precedes” and “A” indicates it is a predicate on atoms).
The transitive reduction of PA
is represented by the predicate NA(x,
y)
, true iff x
immediately precedes y
in document order
(“N” stands for “next” and “A” indicates it is a predicate on
atoms).
Since characters are atomic individuals, all individuals which can be composed on
the
basis of the characters of a document are also individuals, i.e., composite individuals.
Composite individuals of special interest for our purposes are
strings. We define strings as individuals which are either atoms, or
the sum of atoms consecutive in NA
order. A string that consists of only one
character is (also) an atom. There is no such thing as an empty string
(which would have to be the nil individual). Note that
strings constitute a tiny fraction of all existing individuals.
Some strings are of particular interest to us. We define a molecular
string (or molecule) as a string that is
delimited on both sides (in the serialization underlying document order) by a tag,
with no
other tag intervening in between. A total ordering of molecular strings, represented
by
the predicate P(x, y)
, is trivially derived from the ordering of atoms
(itself based on document order). The transitive reduction of P
is
represented by the predicate N(x, y)
. (“P” stands for “precedes” and “N” for
“next”.)
We define an elemental string as a string delimited by the
matching tags of an XML element (there may be intervening tags). We do not rely on
any
ordering of elemental strings.
For any given string x
, we define (for convenience only) the
label of x
as the sequence of the types of the atoms
composing x
, in NA
order. That is, for example, a string is
labelled rose
(or has the label rose
) iff it is the sum of
atoms of types r
, o
, s
, and e
,
and those atoms are NA
-ordered so that the one of type r
comes
first, the one of type o
comes second, etc.
While it might have been plausible to treat tags as a special kind of strings, and
build elements and nodes with their ordering and parent-child relationship in a way
similar to that suggested in the tags and PCDATA approach above, instead, we shall
regard
tags simply as delimiting certain string individuals, and ascribing properties to
(or
relations between) those individuals.
We can now read (1) as follows:
-
There are 17 atomic individuals. Their ordered sequence of types is:
A
,
, r
, o
,
s
, e
,
, i
,
s
,
, a
,
, r
, o
, s
,
e
, and .
.
-
There are five molecular string individuals. Their ordered sequence of labels
is: A
, rose
,
is
, a
, and
rose.
.
-
There are three elemental string individuals, labelled A rose is a
rose.
, rose
and
a
.
-
The elemental string labelled A rose is a rose.
has the property
indicated by the generic identifier <para>.
-
Note that this does not imply that any of its parts, such as the molecular
strings labelled A
, rose
, etc., has
this property.
-
The elemental string labelled rose
has the property indicated by
the generic identifier <quote>.
-
The elemental string labelled a
has the property indicated by the
generic identifier <emph>.
We introduce the following predicates:
Table I
Predicate |
Meaning |
Range of x and y |
|
NA(x,y) |
next after x is y (or, x immediately precedes y) |
atoms |
PA(x,y) |
x precedes y |
atoms |
N(x,y) |
next after x is y (or, x immediately precedes y) |
molecules |
P(x,y) |
x precedes y |
molecules |
A(x) |
x is atomic |
any |
M(x) |
x is molecular |
any |
E(x) |
x is elemental |
any |
ccc(x) |
x has the property assigned by ccc (where ccc is an XML generic identifier) |
any |
T("c",x) |
x is of type c (where c is a character type) |
atoms |
L("ccc",x) |
x is labelled ccc (where ccc is a sequence of character types) |
any |
The last two predicates (T and L) are to be regarded as notational convenience features. We are ignoring potential problems of name conflicts in this presentation
(which would arise e.g. in the case of a document containing XML generic identifiers
A
, M
or E
).
Examples
We assign the identifiers i01, i02, i03, etc. to individuals of (1) and state some facts about them as follows:
Table II
T("A",i01) |
A(i01) |
NA(i01,i02) |
T(" ",i02) |
A(i02) |
NA(i02,i03) |
|
|
|
T("r",i03) |
A(i03) |
NA(i03,i04) |
T("o",i04) |
A(i04) |
NA(i04,i05) |
T("s",i05) |
A(i05) |
NA(i05,i06) |
T("e",i06) |
A(i06) |
NA(i06,i07) |
|
|
|
T(" ",i07) |
A(i07) |
NA(i07,i08) |
T("i",i08) |
A(i08) |
NA(i08,i09) |
T("s",i09) |
A(i09) |
NA(i09,i10) |
T(" ",i10) |
A(i10) |
NA(i10,i11) |
|
|
|
T("a",i11) |
A(i11) |
NA(i11,i12) |
|
|
|
T(" ",i12) |
A(i12) |
NA(i12,i13) |
T("r",i13) |
A(i13) |
NA(i13,i14) |
T("o",i14) |
A(i14) |
NA(i14,i15) |
T("s",i15) |
A(i15) |
NA(i15,i16) |
T("e",i16) |
A(i16) |
NA(i16,i17) |
T(".",i17) |
A(i17) |
|
|
|
|
i18=i01+i02 |
M(i18) |
N(i18,i19) |
i19=i03+i04+i05+i06 |
M(i19) |
N(i19,i20) |
i20=i07+i08+i09+i10 |
M(i20) |
N(i20,i11) |
|
M(i11) |
N(i11,i21) |
i21=i12+i13+i14+i15+i16+i17 |
M(i21) |
|
i22=i18+i19+i20+i11+i21 |
|
|
|
|
|
L("A ",i18) |
|
|
L("rose",i19) |
E(i19) |
quote(i19) |
L(" is ",i20) |
|
|
T("a",i11) |
E(i11) |
emph(i11) |
L("rose.",i21) |
|
|
L("A rose is a rose.",i22) |
E(i22) |
para(i22) |
The same information may be presented more conspicuously in the following table,
listing for each individual its identifier, its type, its label, the kind of individual
it
is (A for atoms, M for molecular and E for elemental strings), its assigned properties
(i.e., properties assigned by an XML generic identifier), its next atom or molecular
string and its immediate proper parts.
Table III
Id |
Type |
Label |
Kind |
Assigned property |
Next atom |
Next molecule |
Immediate parts |
i01 |
"A" |
|
A |
|
i02 |
|
|
i02 |
" " |
|
A |
|
i03 |
|
|
i03 |
"r" |
|
A |
|
i04 |
|
|
i04 |
"o" |
|
A |
|
i05 |
|
|
i05 |
"o" |
|
A |
|
i06 |
|
|
i06 |
"e" |
|
A |
|
i07 |
|
|
i07 |
" " |
|
A |
|
i08 |
|
|
i08 |
"i" |
|
A |
|
i09 |
|
|
i09 |
"s" |
|
A |
|
i10 |
|
|
i10 |
" " |
|
A |
|
i11 |
|
|
i11 |
"a" |
"a" |
A M E |
emph |
i12 |
i21 |
|
i12 |
" " |
|
A |
|
i13 |
|
|
i13 |
"r" |
|
A |
|
i14 |
|
|
i14 |
"o" |
|
A |
|
i15 |
|
|
i15 |
"s" |
|
A |
|
i16 |
|
|
i16 |
"e" |
|
A |
|
i17 |
|
|
i17 |
"." |
|
A |
|
|
|
|
i18 |
|
"A " |
M |
|
|
i19 |
i01, i02 |
i19 |
|
"rose" |
M E |
quote |
|
i20 |
i03, i04, i05, i06 |
i20 |
|
" is " |
M |
|
|
i11 |
i07, i08, i09, i10 |
i21 |
|
"rose." |
M |
|
|
|
i12, i13, i14, i15, i16, i17 |
i22 |
|
"A rose is a rose." |
E |
para |
|
|
i18, i19, i20, i11, i21 |
The elemental strings i22, i19 and i11 correspond to the XML elements (1)-(3) in a
fairly straightforward way, and can now be identified for example as follows:
i22 = (℩x)(para(x) ∧ E(x))
i19 = (℩x)(quote(x) ∧ E(x))
i11 = (℩x)(emph(x) ∧ E(x))
The non-elemental molecules i18, i20 and i21 can be identified for example as follows:
i18 = (℩x)(∃y)(quote(y) ∧ N(x,y))
i20 = (℩x)(∃y)(emph(y) ∧ N(x,y))
i21 = (℩x)(M(x) ∧ ¬(∃y)N(x,y))
Although in this particular case the denoting expressions identifying individuals
are
fairly simple, identifying individuals by means of denoting expressions may in general
become rather tedious. For example, in any document with more than one individual
assigned
the property quote, the denoting expression identifying individual i19 above would
return
the sum of all those individuals.
So although we have shown that all atoms, molecular and elemental strings
of (1) can be identified by our relatively straightforward
application of the Calculus, some of the above examples draw on the simplicity of
the
example and are rather ad hoc. Therefore, before we proceed to discuss how the Calculus
can be used to make statements and make inferences about a document, we introduce
a
slightly more complicated (and also more realistic) example.
Consider the following XML document:
<?xml version="1.0" encoding="UTF-8"?>
<doc>
A rule:
<list>
<item>First:</item>
<item>
<list>
<item>think,</item>
<item>decide.</item>
</list>
</item>
<item>Then:</item>
<item>
<list>
<item>act,</item>
<item>regret.</item>
</list>
</item>
</list>
</doc>
Once again we provide identifiers for individuals of the document and present their
properties and relations in tabular form, but this time we include only the molecular
and
elemental individuals:
Table IV
Id |
Label |
Kind |
Assigned property |
Next molecule |
Immediate parts |
i01 |
A rule: |
M |
|
i02 |
|
i02 |
First: |
M E |
item |
i03 |
|
i03 |
think, |
M E |
item |
i04 |
|
i04 |
decide. |
M E |
item |
i05 |
|
i05 |
Then: |
M E |
item |
i06 |
|
i06 |
act, |
M E |
item |
i07 |
|
i07 |
regret. |
M E |
item |
|
|
i08 |
|
E |
list, item |
|
i03, i04 |
i09 |
|
E |
list, item |
|
i06, i07 |
i10 |
|
E |
list |
|
i02, i08, i05, i09 |
i11 |
|
E |
doc |
|
i01, i10 |
Note that the individuals i08 and i09 are each represented as one individual with
two
assigned properties, rather than as two individuals each with one property. The difference
between this representation and the conventional XML representation can be illustrated
by
juxtaposing a conventional XML tree of the document (to the left) and what we might
call a
mereological graph (to the right):
Because of our decision not to count tags as part of the document, all coextensive
XML
elements will be represented as one elemental individual. The nesting order of these
elements in the XML document will not be preserved in this representation.
As before, we can use denoting expressions to refer to any part of the document, for
example:
i01 = (℩x)¬(∃y)N(y,x)
i02 = (℩x)(item(x) ∧ ¬(∃y)(item(y) ∧ P(y,x)))
i03 = (℩x)(∃y)(∃z)(w)(v)
((x ≪ y) ∧ list(y) ∧
(y ≪ z) ∧ list(z) ∧
(N(w,x) → ¬(w ≪ y)) ∧
(N(v,w) → ¬(v ≪ z)))
i09 = (℩x)(∃y)(∃z)
(list(x) ∧ (x ≪ y) ∧ list(y) ∧
list(z) ∧ (z ≪ y) ∧ ¬(x = z) ∧ P(x,z))
Statements and inferences
We can also use the Calculus to make statements about the document —
unquantified, such as (1)–(4), or quantified, such as (5)–(8):
(1) list(i09)
(2) item(i09)
(3) i07 ≪ i09
(4) i09 ≪ i10
(5) (x)(y)((list(x) ∧ item(x) ∧ (y ≪ x)) → item(y))
(6) (x)(y)((list(x) ∧ item(x) ∧ (x ≪ y)) → (list(y) ∨ doc(y)))
(7) (x)(item(x) → (∃y)((x ≪ y) ∧ list(y)))
(8) (x)(item(x) → (∃y)(∃z)
(item(y) ∧ list(z) ∧ (x ≪ z) ∧ (y ≪ z) ∧ ¬(x = y)))
In order to avoid unnecessary misunderstanding, it should be pointed out that
(1)–(8) are descriptive statements about this particular document. (In other
context, such as for example situations where we wanted to express general constraints
on
document structure, we might of course also want to state facts about document
types, but that is not our issue here.)
From the statements we can make inferences, such as for example:
(9) item(i07)
[From (1), (2), (3) and (5).]
(10) list(i10) ∨ doc(i10)
[From (1), (2), (4) and (6).]
(11) (∃y)((i09 ≪ y) ∧ list(y))
[From (2) and (7).]
(12) (∃y)(∃z)(item(y) ∧ list(z) ∧ (i07 ≪ z) ∧ (y ≪ z) ∧ ¬(i07 = y))
[From (8) and (9).]
Conclusion
We have shown that strings composed of characters defined as atomic individuals can
be
identified and referenced by denoting expressions, that the Calculus can be used to
describe the part-whole relationships and ordering relations between parts of the
document
as well as the properties ascribed by generic identifiers. We have also shown that
this
application of the Calculus can be used for making statements about documents and
for
drawing inferences from these statements.
The approach chosen here has at least two obvious problems, or shortcomings; one
concerns the representation of coextensive elements, one relates to the representation
of
empty elements. Before we discuss these problems, however, we would like to assess
one of
its possible merits. In the next section, we will therefore sketch how this application
of
the Calculus can be used for the formulation of rules for propagation of properties
among
the parts of a document.
Property Propagation — a Sketch
We have assumed that the generic identifier of an element may be seen as assigning
a
property to the PCDATA content of that element, and not to any proper part of that
PCDATA
content. But sometimes, the meaning of the markup is such that that property is not
assigned
(or not only assigned) to the contents of the element itself, but also to all or some
of its
descendants, or to all or some of its ancestors, or to one or more of its siblings,
or to only
specific other elements. Furthermore, what is assigned to the element or elements
in question
may be not a monadic property, but a relation of them to other elements in the same
document,
or even to document elements or other entities outside that document. Thus, the propagation
of
properties ascribed by the generic identifier of an element may follow a large diversity
of
patterns.
Using examples from the TEI and HTML encoding schemes, we will show that some of these
patterns can conveniently be described by means of our application of the Calculus.
We will
first address some of the general distribution patterns identified by Nelson Goodman,
which
seem to represent important aspects of the intended semantics of certain TEI or HTML
element
types. We will then proceed to more complicated examples.
Dissective and anti-dissective properties
As mentioned, in our application of the Calculus so far we have assumed that the
property designated by the generic identifier of an XML element is assigned exclusively
to
the individual delimited by the start and end tags of the element, and not to its
parts.
This seems plausible enough for a number of element types, such as paragraphs, list
items
and titles. For example, a part of a paragraph, a list item or a title is not in general
itself a paragraph, a list item or a title.
TEI element types such as <hi> (highlighting) or <add> (added), however, do not seem to follow this rule. Every
part of a highlighted or added element is itself presumably highlighted or added.
Other
examples may be <del> (deleted) and <foreign>. The HTML element
type <i> (italics) may provide an even clearer example here — every
part of an italicized element is itself in italics.
According to Goodman, a ... predicate is ... dissective if
it is satisfied by every part of every individual that satisfies it
[Goodman 1972, p. 38]. A dissective one-place predicate is defined as
follows:
F is dissective iff (x)(y)((F(x) ∧ (y < x)) → F(y))
Consider the following document fragment:
<s>We
<add>, as all
<del>purely <hi>human</hi> and</del>
finite beings,
</add>
are all fallible.</s>
As earlier, we represent the properties of this fragment in tabular form. From now
on,
however, in stead of indicating
assigned properties
for each individual we
will list relevant statements (some of which may be inferences from statements about
the
properties of other individuals):
Table V
Id |
Label |
Kind |
Statements |
Next |
Parts |
i01 |
We |
M |
|
i02 |
|
i02 |
, as all |
M |
|
i03 |
|
i03 |
purely |
M |
|
i04 |
|
i04 |
human |
M E |
hi(i04) |
i05 |
|
i05 |
and |
M |
|
i06 |
|
i06 |
finite beings, |
M |
|
i07 |
|
i07 |
are all fallible. |
M |
|
|
|
i08 |
|
E |
del(i08) |
|
i03, i04, i05 |
i09 |
|
E |
add(i09) |
|
i02, i08, i06 |
i10 |
|
E |
s(i10) |
|
i01, i08, i09, i07 |
However, if we add the following statements
to the effect that the properties add, del and hi are dissective:
(x)(y)((add(x) ∧ (y < x)) → add(y))
(x)(y)((del(x) ∧ (y < x)) → del(y))
(x)(y)((hi(x) ∧ (y < x)) → hi(y))
— then, we can infer additional properties, with the following result:
Table VI
Id |
Label |
Kind |
Statements |
Next |
Parts |
i01 |
We |
M |
|
i02 |
|
i02 |
, as all |
M |
del(i02) |
i03 |
|
i03 |
purely |
M |
del(i03), add(i03) |
i04 |
|
i04 |
human |
M E |
hi(i04), del(i04), add(i04) |
i05 |
|
i05 |
and |
M |
del(i05), add(i05) |
i06 |
|
i06 |
finite beings, |
M |
del(i06) |
i07 |
|
i07 |
are all fallible. |
M |
|
|
|
i08 |
|
E |
del(i08), add(i08) |
|
i03, i04, i05 |
i09 |
|
E |
add(i09) |
|
i02, i08, i06 |
i10 |
|
E |
s(i10) |
|
i01, i08, i09, i07 |
(Note that this is the first example so far of non-elemental individuals
carrying assigned properties.)
Goodman observes that In practice, we are usually concerned only with
disectiveness under some special or systematic limitations...
[Goodman 1972, p. 38]. This seems to be the case here, too: While the
TEI elements <hi>, <add> and <del> and the HTML
element <i> seem to apply all the way down to every atomic part of an
individual, an element type like <foreign> hardly applies below word-level.
Furthermore, there seem to be exceptions even in the case of <hi>,
<add> and <del>: In a transcription, a <note>
(note) element is normally not intended to inherit the property in question. A more
generally usable formula for disectiveness may therefore be this:
(x)(y)(z)((F(x) ∧ (y < x) ∧
¬((z < x) ∧ (y < z) ∧ (G(z) ∨ H(z) ∨ ...)))
→ F(y))
where G, H,... indicate exceptions.
Let us define an anti-dissective one-place predicate as follows:
F is anti-dissective iff (x)(y)((F(x) ∧ (y ≪ x)) → ¬F(y))
The TEI element <docDate> (document date) and the TEI and HTML
<body> may serve as examples of anti-dissective properties, — no
part of a <docDate> or a <body> element is itself a
<body> or a <docDate>. The HTML <p> (paragraph)
element is also clearly anti-dissective.
The TEI <p> element presents a complication. It would seem to be
anti-dissective, but unlike HTML, TEI allows <p>s nested within
<p>s. So
(x)(y)((p(x) ∧ (y ≪ x)) → ¬p(y))
is true in HTML, but not in TEI. The TEI <p> element can therefore not be said
to be either dissective or anti-dissective.
Expansive and anti-expansive properties
A one-place predicate is expansive if it is satisfied by
everything that has a part satisfying it.
[Goodman 1972,
p. 38]. An expansive one-place predicate can be defined as follows:
F is expansive iff (x)(y)((F(x) ∧ (x < y)) → F(y))
In more conventional XML terms, while dissective predicates propagate
down
the document tree, expansive predicates propagate
upwards
in the tree, from
children to their parents. This might be thought to be unusual, and actually it is
difficult
to find examples of such properties in the TEI and HTML encoding schemes. Element
types such
as <docDate> and <docAuthor> may, as we shall see later, be said
to ascribe properties to individuals of which they are a part, but that does not make
these
individuals themselves <docDate>s or <docAuthor>s. (Even so, it
easy to think of expansive properties: — for example, the property of
containing the word Hamlet
would clearly be
expansive.)
Let us define an anti-expansive property as follows:
F is anti-expansive iff (x)(y)((F(x) ∧ (x ≪ y)) → ¬F(y))
The TEI element <foreign> may be an example of a property which is
anti-dissective, at least up to a certain level, and at least insofar as it seems
reasonable
to assume that if something is marked as foreign, then it is marked off from something
which
is
not in a foreign language.
Collective and anti-collective properties
That a one-place predicate is collective means that it is
satisfied by the sum of every two individuals (distinct or not) that satisfy it
severally
[Goodman 1972, p. 39]. A collective one-place
predicate can be defined as follows:
F is collective iff (x)(y)((F(x) ∧ F(y)) → F(x + y))
Dissective elements like the TEI elements <hi>, <add>,
<del> and <foreign> and the HTML element <i> seem
also to be collective: any sum of strings in italics would seem itself to be in italics,
etc. There probably are examples of expansive and non-dissective or anti-dissective
properties in TEI or HTML, but so far we have not found any.
Let us define an anti-collective property as follows:
F is anti-colletive iff (x)(y)((F(x) ∧ F(y) ∧ (x ʅ y)) → ¬F(x + y))
Both the TEI and the HTML <div> (division) element types seem to be
anti-collective: no sum of <div>s is itself a <div>.
The HTML title element
So far, we have been concerned only with one-place predicates. Many TEI and HTML elements ascribe properties according to more complicated
patterns which can more conveniently be accounted for by representing them as relations,
or
predicates with two or more places.
We begin with a simple example of an element expressing a two-place predicate, the
HTML
title element. From:
<!DOCTYPE html SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Simple HTML</title>
</head>
<body>
<p>First para</p>
<p>Second para</p>
</body>
</html>
we get:
Table VII
Id |
Label |
Kind |
Statements |
Next |
Parts |
i01 |
Simple HTML |
M E |
head(i01), title(i01) |
i02 |
|
i02 |
First para |
M E |
p(i02) |
i03 |
|
i03 |
Second para |
ME |
p(i03) |
|
|
i04 |
|
E |
body(i04) |
|
i02, i03 |
i05 |
|
E |
html(i05) |
|
i01, i04 |
We state the propagation rule that:
(x)(y)((title(x) ∧ (x < y) ∧ html(y)) → hasTitle(y,x))
and get for the last line of the previous table:
Table VIII
Id |
Label |
Kind |
Statements |
Next |
Parts |
i05 |
|
E |
html(i05), hasTitle(i05,i01) |
|
i01, i04 |
The fact that the propagation rule can be made so simple in this case is partly due
to
the fact that we are assuming that the document is valid, and that the relative structural
positions of the elements are constant. For example, there is no need to state that
the
title element has to be the child of a head element which in turn is directly succeeded
by a
body element etc.
The TEI sp, speaker and stage elements
While it is quite legitimate to assume document validity when stating propagation
rules,
these rules tend to become more complex when more elements are involved, and/or the
rules
for the structural positions of the elements concerned are more complex.
The relation between the TEI elements <sp> (speech),
<speaker> and <stage> (stage direction) is that a
<sp> may contain a <speaker>, and if it does, the
<speaker> element contains the name of the speaker of the rest of the
<sp> element, except for any <stage>s (stage directions) it
might contain. From:
<sp>
<speaker>Peer</speaker>
Why
<stage>(hesitating)</stage>
swear?
</sp>
we get:
Table IX
Id |
Label |
Kind |
Statements |
Next |
Parts |
i01 |
Peer |
M E |
speaker(i01) |
i02 |
|
i02 |
Why |
M |
|
i03 |
|
i03 |
(hesitating) |
M E |
stage(i03) |
i04 |
|
i04 |
swear? |
M |
|
|
|
i05 |
|
E |
sp(i05) |
|
i01, i02, i03, i04 |
We state the following propagation rule:
(x)(y)((speaker(x) ∧ (x < y) ∧ sp(y)) →
(z)(((z < y) ∧ ¬(speaker(z) ∨ stage(z))) → saidBy(z,x)))
and get:
Table X
Id |
Label |
Kind |
Statements |
Next |
Parts |
i01 |
Peer |
M E |
speaker(i01) |
i02 |
|
i02 |
Why |
M |
saidBy(i02,i01) |
i03 |
|
i03 |
(hesitating) |
M E |
stage(i03) |
i04 |
|
i04 |
swear? |
M |
saidBy(i04,i01) |
|
|
i05 |
|
E |
sp(i05) |
|
i01, i02, i03, i04 |
The TEI docTitle, docDate and docAuthor elements
The TEI <docTitle> (document title) element may occur directly within
<titlePage> or <front> (front matter); <titlePage>
may occur directly within <front> or <back> (back matter), and
<front> and <back> may occur directly within
<text>. <docTitle> behaves very much like the HTML
<title> element:
(x)(y)((docTitle(x) ∧ (x < y) ∧ text(y)) → hasTitle(y,x))
<docTitle> assigns the property of
being a document title
to its own content, and the property of
having that title to the
individual which carries the property of being a text, and of which it is itself a
part.
Thus, while no other parts of the elemental text individual have any of these properties,
all its parts have the property of being the
part of an individual
which carries the title in question.
The <docDate> (document date) element, in turn, behaves very much like the
<docTitle> element. Although it may occur in a larger variety of positions, it
assigns the property of being (or identifying) the date of the document
to its own content, and the property of having that date to the
individual which carries the property of being a text, and of which it is itself a
part.
We may assume, however, that the document date carries over to most or all the parts
of
the text, i.e., that all the parts of the element have the property of having that
date,
too.
If we are dealing with a transcription of an authorial document which according to
the
<docDate> element dates from a particular year, it may be the case that we
also know that all parts of the document marked by <add> contain corrections
in that document made by another person several years later, and that all
<note>s are editorial notes supplied even later than that, by the creator of
the electronic version. A propagation rule to this effect may be expressed for example
as
follows:
(x)(y)(z)(w)((docDate(x) ∧ (x < y) ∧ text(y)) →
(((z < y) ∧ ¬((z < w) ∧ (add(w) ∨ note(w)))) →
(hasDate(y,x) ∧ hasDate(z,x))))
Note, however, that in some situations the TEI <docDate> element gives the date of
the
first edition of the text, while the text actually transcribed by the document comes
from a later edition. In such situations
the semantics of the element is rather different, and the property of having the date
given may possibly not propagate to elements below <text> level at all.
The <docAuthor> (document author) element, again, behaves much like the
<docDate> element. It assigns the property of being the
name of the author of the document to its own content, and the property of
having the author of that name to the text of which it is a part.
In the example just discussed, we may again assume that the property, in this case
the
property of having the author in question, is not carried over to later additions
and notes.
Other element types, such as <q> (quote) <cit> (citation), would
for more or less obvious reasons also have to be considered for exclusion. However,
there is
a further complication: If a person is considered the author of a document, he is
normally
also considered the author of parts of that document, such as its chapters, sections
and
paragraphs. Perhaps authorship may also be attributed to sentences or phrases, but
certainly
not to individual words or letters. Again we are faced with a property which propagates
down
to a certain level, but where it is unclear exactly where that level ends. And as
is so
often the case with markup, it does not help us much to become clear about the level
at
which the propagation ends, be it subparagraphs, sentences or phrases, if it turns
out that
the elements at that level have not been marked up.
Problems
We have mentioned that there are at least two serious problems with our application
of
the Calculus. One problem, which has already been identified, relates to the representation
of
coextensive elements. The other problem, which relates to the representation of empty
elements, has only been mentioned in passing. We believe this is the least serious
of the two,
and we will therefore discuss that first.
Empty elements
For the purposes of this discussion, we may conveniently distinguish between milestone
elements and other empty elements
Milestone elements
Milestones are empty elements which ascribe properties to parts of a document, but
which for various reasons are represented by empty elements. The reason why some textual
phenomena are represented by milestones rather than ordinary elements is often a need
to
overcome the XML constraint that element structure must be hierarchical.
Typically, a milestone may be seen as assigning a property to the following parts
of
the document, up to the next milestone element of the same type, up to the occurrence
of
an element of some specific other type, or to the end of the document. We think we
have
already demonstrated that our application of the Calculus to XML documents can handle
such
property assignment.
We believe that many of the other mechanisms proposed to handle so-called overlapping
hierarchies in XML (for example, Trojan Horse
milestones, [DeRose 2004] and fragmented or virtual elements [TEI P4]) can be
handled in similar ways, and therefore do not constitute a serious problem for our
application of the Calculus.
Other empty elements
Empty elements which are not milestones typically stand for and/or ascribe properties
to some part of the document which cannot straightforwardly be represented as a character
or string of characters. These empty elements are more difficult to deal with, because
according to our application of the Calculus something which cannot be said to consist
of
character atoms simply cannot be an individual. And if it is no individual there seems
to
be nothing to which properties can be ascribed; only individuals can have properties.
The TEI elements <ptr> (pointer), <anchor> (anchor point),
<index> (index entry) and <divGen> (automatically generated
text division) are some examples. Either they indicate a point in the document, i.e.,
they
have no extension
in the terms of our application of the Calculus and would
seem to have to be located in a position between two atoms. Or they do not indicate
any
point or extension in the document, but rather an instruction to generate strings
with
certain properties at the position they are located. In some cases, the problems outlined
here can be solved by replacing the empty element in question with a character string,
taken for example from an attribute value of the element in question. In cases where
the
element occupies or points to a location between characters, we might find a practical
workaround by letting it apply or point instead to the atom immediately before or
after
the relevant location in our model of the document.
A slightly different kind of problem is presented by the TEI <graphic>
(inline graphic, illustration, or figure) and HTML <graphic> elements. The
basic meaning of these elements is easy enough to catch: The occurrence of the element
indicates that an illustration or a figure occurs at a specific location in the document.
Therefore, a more appropriate solution to this as well as to the previously mentioned
examples is probably to lift the requirement that all atoms should have a character
type
as a property. A graphics element, for example, might simply be represented in our
model
by a graphics
atom.
More generally, this would be a model in which a document consists not of a sequence
of character atoms, but of a sequence of some more generic kind of atoms. We might,
for
example, agree to call them atomic content objects
, and concede that such
atoms may or may not have a character property, an image
property etc.
Although we have not investigated the matter, we believe that such a modification
would
not drastically change the application of the Calculus described above.
Coextensive elements
We have already exemplified and briefly discussed the problem with coextensive elements:
If two or more nested elements have exactly the same content, i.e., share exactly
the same
leaf nodes in the XML tree, they will be represented in our application of the Calculus
as
one individual sharing all the properties ascribed by the nested XML elements. What
kind of
problem this is, and whether and how it can be solved, depends on the wider requirements
and
aims for our application of the Calculus to markup. Under certain requirements or
perspectives, it may cease to be a problem.
If our aim is to establish a representation from which the serialized form of an XML
document can be regenerated, we obviously have a problem: It is by no means obvious
if or
how this could be done. Likewise, if our aim is to establish a representation from
which the
XML DOM, the XDM or the XML Infoset representation can be generated, or which is isomorphic
to and/or contains (all) the information given in any of those, then it is perhaps
even more
obvious that we have a problem.
We have two responses to this: On the one hand, the value of the approach presented
here
does not depend on such capabilities. The value of the approach to property propagation,
for
example, may be simply as an ancillary representation of some of the features of marked-up
documents, a representation which is not intended to capture all
the
information present in XML documents but rather to assist in the processing of such
documents. Therefore, the problem discussed here is a problem only to the extent that
it
impedes our work to realize this more modest aim. So far, we have not found any indication
that it does.
On the other hand, we might want to use this representation in order to modify the
XML
documents so represented, and in that case we would clearly need to reserialize them
to XML
or generate an XML-conformant document model of them. For such purposes, we believe
that
information about the XML nesting order of coextensive elements could easily be stored
in
some ancillary data structure which would make reserialization etc possible. It should
also
be mentioned that, although again we have not investigated the matter, it is not
unreasonable to assume that a representation of documents in the way proposed for
our
application of the Calculus might be a convenient step in the process of converting
XML
documents to certain other markup systems, such as TexMecs or LMNL.
Finally, if our aim is to offer an alternative representation based on a different
understanding of the structure and semantics of marked-up documents, then we have
a problem
only if it can convincingly be argued that our representation is in some respect inferior
to
these standard ways of modelling documents. We think such a discussion is premature
unless
and until the application sketched here is developed further, but at least two lines
of
argument seem to present themselves as possible responses to the challenge.
First, one might argue that the problem is with XML, and not with the approach discussed
here. For example, if a TEI <p> (paragraph) and <s> (s-unit,
sentence) element are coextensive, XML forces us to decide whether we are dealing
with a
paragraph containing a sentence, or a sentence containing a paragraph, and leaves
us no
other option. But we might just as well (or rather) want to say that we are dealing
with one
object which has two properties: that of being a paragraph and that of being a sentence.
The
part-whole relationship which seems forced upon us by XML is an artifact of the
serialization, a result of one of the limitations of embedded markup.[Raymond et al. 1996]
Second, we might concede that the representation of coextensive elements as conceived
in
the present approach is a problem, and try to solve it by amending our mereological
system.
Part of the solution may be found in allowing more generous set of atoms, as discussed
above
in connection with the problem of empty elements. Another part of the solution might
be to
replace the Calculus of Individuals with some other formal mereological system. For
example,
there seems to be mereological systems which allow for the idea that one individual
may be
part of another even in cases where we cannot identify any part which they do not
share. For
options along these lines, see the discussion of supplementation and closure principles
in
Casati and Varzi 1999 p. 38 f.f.
Conclusion and Future Work
We have considered some possible applications of the Calculus of Individuals to XML,
whereof the so-called character-atom approach has seemed the most promising so far.
Strings
composed of characters defined as atomic individuals can be identified and referenced
by
denoting expressions. The part-whole relationships and ordering relations between
parts of the
document as well as the properties ascribed by generic identifiers can be described.
Statements about the individuals of documents and their properties can be made, and
inferences
can be drawn from these statements.
We have shown, by means of examples from the TEI and HTML encoding schemes, how this
application of the Calculus can be used for the formulation of rules describing the
propagation of properties among the parts of a document.
We have identified problems or shortcomings concerning the representation of empty
elements and coextensive elements, and suggested that these problems may be overcome
partly by
allowing a more generous set of atoms, and partly by replacing the Calculus of Individuals
with some other formal mereological system.
In order to assess whether the application of formal mereology to markup semantics
is
worth while, we believe that continued work is required along several lines: The application
to XML should be extended beyond the limitations of the approach presented here to
include XML
the full range of XML mechanisms, such as attributes, entities, declarations, comments,
processing instructions, and marked sections. While the approach presented here is
limited to
the consideration of XML documents in serialized form, i.e. as character streams,
attempts
should be made at applying formal mereology to XML documents considered as graphs
of xPath
nodes, Infoset items, and the like.
Furthermore, and as already mentioned, mereological systems beyond the Calculus of
Individuals should be considered in order to overcome some of the problems encountered
in the
approach presented her. Last, but not least: The application of formal mereological
systems
should be extended to other markup systems such as SGML, TexMecs, LMNL, Goddag and
others.
References
[Casati and Varzi 1999] Casati, Roberto and
Varzi, Achille C. Parts and Places. The Structures of Spatial
Representation. MIT Press, 1999.
[DeRose 2004] DeRose, Steven J. 2004. Markup
overlap: A review and a horse.
In Proceedings of Extreme Markup Languages
2004.
[Fitzgerald 2003] Fitzgerald, Henry.
Nominalist things
. Analysis 63.2, OUP, April 2003, pp
170-71. doi:https://doi.org/10.1093/analys/63.2.170.
[Goodman 1972] Goodman, Nelson. Problems
and Projects. Hackett, Indianapolis 1972.
[Goodman 1977] Goodman, Nelson. The
structure of appearance. Third edition. Boston: Reidel, 1977
[Leonard and Goodman 1940] Leonard, Henry
S. and Goodman, Nelson. The Calculus of Individuals and Its Uses
, The
Journal of Symbolic Logic Vol 5, No. 2, pp 45-55, June 1940. doi:https://doi.org/10.2307/2266169.
[Libardi 1994] Libardi, Massimo. Applications
and limits of mereology. From the theory of parts to the theory of wholes
,
Axiomathes, n.1, aprile 1994, pp. 13-54.
[Marcoux et al. 2009] Marcoux, Yves, Michael
Sperberg-McQueen, and Claus Huitfeldt. Formal and informal meaning from documents
through skeleton sentences: Complementing formal tag-set descriptions with intertextual
semantics and vice-versa.
Presented at Balisage: The Markup Conference 2009,
Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference
2009. Balisage Series on Markup Technologies, vol. 3 (2009).
doi:https://doi.org/10.4242/BalisageVol3.Sperberg-McQueen01.
[Pitkänen] Risto Pitkänen.
Content Identity
. Mind.1976;
LXXXV: 262–268. doi:https://doi.org/10.1093/mind/LXXXV.338.262.
[Sperberg-McQueen and Huitfeldt 2008] Sperberg-McQueen, C. M., and Claus Huitfeldt. Containment and dominance in Goddag
structures
. Talk given at Conference on Text Technology, Bielefeld, March 2008.
Forthcoming.
[Raymond et al. 1996] Raymond, Darrell, Frank Wm. Tompa
and Derick Wood. From Data Representation to Data Model: Meta-Semantic Issues in the
Evolution of SGML
, Computer Standards and Interfaces 18 p.
25-36 (1996). doi:https://doi.org/10.1016/0920-5489(96)00033-5.
[Sperberg-McQueen et al. 2009a] Sperberg-McQueen, C. M.,
Claus Huitfeldt and Yves Marcoux. What is transcription? (Part 2)
. Talk given
at Digital Humanities 2009, Maryland, June 2009. Forthcoming.
[TEI P4] The TEI Consortium / The Association for
Computers and the Humanities (ACH); The Association for Computational Linguistics
(ACL); The
Association for Literary and Linguistic Computing (ALLC). TEI P4:
Guidelines for Electronic Text Encoding and Interchange XML-compatible edition.
Ed. C. M. Sperberg-McQueen and Lou Burnard; XML conversion by Syd Bauman, Lou Burnard,
Steven
DeRose, and Sebastian Rahtz. Oxford, Providence, Charlottesville, Bergen: TEI Consortium,
December 2001. http://www.tei-c.org/release/doc/tei-p4-doc/html/
[Varzi 2003] Varzi, Achille. Mereology
.
Stanford Encyclopedia of Philosophy.
http://plato.stanford.edu/entries/mereology/ First published Tue May 13, 2003; substantive
revision Thu May 14, 2009.
×Casati, Roberto and
Varzi, Achille C. Parts and Places. The Structures of Spatial
Representation. MIT Press, 1999.
×DeRose, Steven J. 2004. Markup
overlap: A review and a horse.
In Proceedings of Extreme Markup Languages
2004.
×Goodman, Nelson. Problems
and Projects. Hackett, Indianapolis 1972.
×Goodman, Nelson. The
structure of appearance. Third edition. Boston: Reidel, 1977
×Leonard, Henry
S. and Goodman, Nelson. The Calculus of Individuals and Its Uses
, The
Journal of Symbolic Logic Vol 5, No. 2, pp 45-55, June 1940. doi:https://doi.org/10.2307/2266169.
×Libardi, Massimo. Applications
and limits of mereology. From the theory of parts to the theory of wholes
,
Axiomathes, n.1, aprile 1994, pp. 13-54.
× Marcoux, Yves, Michael
Sperberg-McQueen, and Claus Huitfeldt. Formal and informal meaning from documents
through skeleton sentences: Complementing formal tag-set descriptions with intertextual
semantics and vice-versa.
Presented at Balisage: The Markup Conference 2009,
Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference
2009. Balisage Series on Markup Technologies, vol. 3 (2009).
doi:https://doi.org/10.4242/BalisageVol3.Sperberg-McQueen01.
×Sperberg-McQueen, C. M., and Claus Huitfeldt. Containment and dominance in Goddag
structures
. Talk given at Conference on Text Technology, Bielefeld, March 2008.
Forthcoming.
×Raymond, Darrell, Frank Wm. Tompa
and Derick Wood. From Data Representation to Data Model: Meta-Semantic Issues in the
Evolution of SGML
, Computer Standards and Interfaces 18 p.
25-36 (1996). doi:https://doi.org/10.1016/0920-5489(96)00033-5.
×Sperberg-McQueen, C. M.,
Claus Huitfeldt and Yves Marcoux. What is transcription? (Part 2)
. Talk given
at Digital Humanities 2009, Maryland, June 2009. Forthcoming.
×The TEI Consortium / The Association for
Computers and the Humanities (ACH); The Association for Computational Linguistics
(ACL); The
Association for Literary and Linguistic Computing (ALLC). TEI P4:
Guidelines for Electronic Text Encoding and Interchange XML-compatible edition.
Ed. C. M. Sperberg-McQueen and Lou Burnard; XML conversion by Syd Bauman, Lou Burnard,
Steven
DeRose, and Sebastian Rahtz. Oxford, Providence, Charlottesville, Bergen: TEI Consortium,
December 2001. http://www.tei-c.org/release/doc/tei-p4-doc/html/
×Varzi, Achille. Mereology
.
Stanford Encyclopedia of Philosophy.
http://plato.stanford.edu/entries/mereology/ First published Tue May 13, 2003; substantive
revision Thu May 14, 2009.