How to cite this paper
Dombrowski, Andrew, and Quinn Dombrowski. “A formal approach to XML semantics: implications for archive standards.” Presented at International Symposium on XML for the Long Haul: Issues in the Long-term Preservation
of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the
Long-term Preservation of XML. Balisage Series on Markup Technologies, vol. 6 (2010). https://doi.org/10.4242/BalisageVol6.Dombrowski01.
International Symposium on XML for the Long Haul: Issues in the Long-term Preservation
of XML
August 2, 2010
Balisage Paper: A formal approach to XML semantics: implications for archive standards
Andrew Dombrowski
PhD student
Department of Slavic Languages and Literatures and Department of
Linguistics, University of Chicago
Andrew Dombrowski is a 4th year PhD student at the University of Chicago in
the Department of Slavic Languages and Literatures and the Department of
Linguistics. His research focuses on language change and contact between Slavic
and non-Slavic languages.
Quinn Dombrowski
Manager, Scholarly Technology
University of Chicago
Quinn Dombrowski is the manager of the Scholarly Technology group in the
University of Chicago's IT Services organization. She has an MA in Slavic
Linguistics from the University of Chicago, and an MLS from the University of
Illinois at Urbana-Champaign.
Abstract
Previous literature characterizing XML semantics (Sperberg-McQueen et al. 2000, Renear et al. 2002, Piez 2002) takes reasonably syntactically and semantically plausible
markup and/or schemas as a starting point. In contrast, for this paper we aim to work
towards such a schema as an idealized end goal, by characterizing the necessary—
if not sufficient— semantic constraints that differentiate a schema intended for archival
use from nonsense and implausible
schemas, as well as schemas that fail to sufficiently take semantics into account.
In addition to the goal of providing a novel approach to the perenially thorny
problem of XML semantics, we are particularly concerned with the interaction between
the goals of archival purposes and XML semantics.
Table of Contents
- 0. Introduction
- 1. Why semantics?
- 2. Syntax-Semantics Mismatches in XML
- 3. Formal Semantics of XML
-
- 3.1 Semantic Types
- 3.2 Semantic Coherence
- 3.3 Semantic Hierarchies
- 4. Some Very Basic Features of Archive Standards
- 5. Conclusion
0. Introduction
In contrast to syntax, which is explicitly (and machine-readably) defined for XML
documents through use of a schema, XML semantics is notoriously difficult to pin down.
Sperberg-McQueen, et al. 2000
takes the approach of describing semantics by defining some of the processes one goes
through unconsciously when interpreting the semantics of XML: what meaning elements
and
attributes convey, how one makes sense of seemingly conflicting statements, the
different behavior of distributed and non-distributed features, etc. Renear, et al. 2002 presents the issue of XML
semantics in its historic context, identifies important aspects of semantics (class
relationships, feature propagation, context and reference, etc.) that are usually
only
specified in accompanying prose documentation—if at all, and argues for the value
of a
machine-readable representation scheme for markup semantics, which is one of the
research goals of the BECHAMEL Project. Piez
2002 takes a more philosophical approach to XML semantics, drawing on the
work of Ferdinand de Saussure and the Structuralist movement by describing markup
as a
layered sign system. All of these approaches take reasonably syntactically and
semantically plausible markup and/or schemas as their starting point. In contrast,
for this paper we aim to work
towards such a schema as an idealized end goal, by characterizing the necessary—
if not sufficient— semantic constraints that differentiate a schema intended for archival
use from nonsense and implausible
schemas, as well as schemas that fa(i.e.,il to sufficiently take semantics into account.
In addition to the goal of
providing a novel approach to the perenially
thorny problem of XML semantics, we are particularly concerned with the interaction
between the goals of archival purposes and XML semantics.
We argue that for archival purposes, XML semantics are non-trivial - i.e., (1) that
the problem of XML semantics cannot be reduced to the set of all possible use cases,
(2)
that XML syntax and semantics differ with regard to crucial structural properties,
and
(3) that semantics and syntax impose independent well-formedness constraints on schemas.
We examine these properties in the context of a hypothetical long-haul archival situation
in which documentation may not have been preserved – and in which the agendas underpinning
the original markup may not be easy to reconstruct. In such circumstances, the interpretation
of a given XML markup schema will be facilitated by an ability to explicitly delineate
plausible markup schemas from non-plausible schemas independent of subject-specific
knowledge.
With this in mind, we provide a formal semantic characterization of traits found in
good (reasonably plausible, as
contrasted with merely syntactically valid) schemas, and finally propose a set of
properties that characterize such schemas in a way that incorporates both semantic
and syntactic considerations. We hope
that specifically considering what semantic characteristics should exclude a schema
from
consideration as a plausible archive standard will indirectly shed light on the nature
of XML semantics more broadly. However, it is not our goal in this paper to propose
an exhaustive treatment of XML semantics – instead, rather to elucidate the bare minimum
necessary for a scheme to be plausible. This paper is informed by linguistic methodology
in the
broad sense – i.e., the proposition that a characterization of the bare minimum of
“grammaticality” can yield insight of broader interest. In particular, we draw upon
notions developed in the modern school of semantics that began with Montague Grammar.
As
such, we hope that some of the developments in the field of linguistics in the last
50
years, as reflected herein, prove as insightful a lens onto markup as the earlier
Structuralist school.
1. Why semantics?
The characterization of archive-appropriate schemas necessitates separating "good"
(i.e.,
plausibly useful) schemas from the infinitely large space of valid XML schemas. At
any
given point in time, practical and case-specific evaluations of the utility of a given
schema should suffice for most purposes. However, long-term preservation also means
planning for environments in which significant amount of case-specific detail may
have
been lost. Lexical semantics are particularly mutable over time; the description of
"symbol" provided by TEI, documents the intended significance of a particular
character or character sequence within a metrical notation, either explicitly or in
terms of other symbol elements in the same metDecl
TEI P4 is easier to intuitively grasp given the
modern English meaning of the word than based on the 15th century usage, meaning "creed,
summary, religious belief" Online Etymology
Dictionary. The assumptions underlying research programs are even less
stable than lexical semantics; the concern with structuralist semantics was superseded
in the 1960's by the controversial and short-lived generative semantics research program
which was itself eventually superseded (in the 1980s and onward) by more modern schools
of semantics, beginning with Montague grammar, that have drawn on techniques of formal
logic for their basis. An illustrative thought experiment here is to imagine projecting
markup technologies into the past to be coextensive with literacy. What XML schemas
would have been created by, for instance: a Greek dramatist, St. Augustine, an early
medieval Chinese chronicler, and an alchemist? And how would these schemas differ
from,
say, TEI? While a single modern guideline such as TEI may be able to encode the written
records of this diverse group of individuals in a way that is meaningful to the modern
scholar, a TEI encoding of these texts informed by modern scholarly interests would
not
only fail to be interoperable with the schemas devised by the original authors, but
may
perhaps not even be comprehensible to them.
A rich knowledge of the specific situations (intended use, cultural context, concept
of authorship/citation, etc.) in which these hypothetical schemas were created would
ameliorate the situation. However, a goal of long-term preservation standards is to
allow a certain
degree of interoperability without crucial context-specific knowledge. One step in
doing
so is to separate out the relatively small set of plausibly useful schemas from the
potentially vast space of valid schemas; it is the goal of this paper to outline a
way
of doing so.
To illustrate this, we can consider example XML using completely ridiculous schemas
and some using merely implausible schemas. Examples using completely ridiculous schemas
are shown below (1-3). In each of these schemas, the permitted content type of each
element is the actual object, action, or part of specified by the name of the element
(i.e.
<branch />
can only contain such a protrusion from a
tree, <simplify />
can only contain the act of
simplification, etc.)
-
the tree-list schema: <trunk /><oak
/><maple /><branch />
-
the command-list schema: <simplify /><eat
/><breathe />
-
the English conjunctions schema: <and /><but
/><however />
Some structurally similar schemas are intuitively less ridiculous, although also
implausible. Examples of XML using implausible schemas are given below (4-7).
-
the word-length schema: <word length="x"/>
, where x
= # of letters in word
-
the "broken clock is right twice a day" incorrect word-length schema:
<word length="x*sin(n°)"/>
where of x = # of
letters in the n-th word in the document
-
the count-words-by-threes schema: <word1 /><word2
/><word3 /><word1 /><word2
/><word3 /><word1 />
etc...
-
the conspiracy-theorist schema: <word(n) />
<word(n+k)/> <word(n+2k)/>
, etc., where n
is the n-th word in the text and k is a number imbued with some significance
(e.g. 666, 42, (with a few tweaks) a succession of prime numbers, etc...)
How, then, to distinguish between the ridiculous, the implausible, and the
plausible?
An immediate and intuitive objection to these schemas might be that they can be ruled
out on the grounds that no one would possibly be interested in them. However, that
explanation, which can be termed the "practical usability explanation" is not fully
adequate. First, it is not necessarily clear that this approach would capture the
difference between the ridiculous and the merely implausible. On a certain level,
the
English conjunctions schema could be thought to be more plausible than the "broken
clock
is right twice a day" incorrect word-length schema, insofar as it is much easier to
imagine why someone would be interested in conjunctions than in looking at the result
of
multiplying word-length figures by the sine function. However, the English conjunctions
schema is clearly bad in a way that the "broken clock is right twice a day" incorrect
word-length schema isn't. Intuitively speaking, we might say that conjunctions are
a
reasonable area of interest, but given an interest in conjunctions, the English
conjunctions schema is unlikely to be your choice. On the other hand, being interested
in
multiplying word-length figures by the sine function is bizarrely implausible, but
if
for some reason one wanted to do that, the "broken clock is right twice a day" incorrect
word-length schema would work.
The "practical usability explanation" is especially problematic in the context of
archival preservation. Part of the reason why long-term archival preservation of XML
is
a non-trivial task is precisely the fact that it is not always obvious what future
generations of researchers will find interesting or useful. Furthermore, the
establishment of practical usability will always to a certain extent be in the eye
of
the beholder. Schemas like our conspiracy-theorist schema could be of potential interest
- Dan Brown, for instance, could testify to the wide public appeal of conspiracy
theories. More seriously, debates about intuitive assessments of practical utility
are
unlikely to be a fundamentally productive line of discussion.
Another possible objection is that by definition XML markup is performed on text.
This renders the tree-list schema and the command-list schema impossible insofar as
it is a feature of the real world that tree parts and actions are not composed of
combinations of characters. While this is a reasonable objection, the degree to which
these assertions are based on potentially contestible real-world knowledge is problematic.
It may be difficult to imagine a situation in which a sane person would assert that
trees are composed out of characters in an ontologically real sense, but one can more
easily imagine a lively argument about whether actions can be expressed with words
in an ontologically real sense (e.g. performatives). Regardless, this line of reasoning
is only applicable with difficulty in a hypothetical long-haul preserval scenario
– assumptions about real-world phenomena have been known to change over time.
What criteria, then, can we use to distinguish ridiculous, implausible, and plausible
schemas without reference to practical utility or related questions? Syntax could
help; an intuitive
observation about schemas (1) - (7) is that they are structurally flat, an observation
which leads to the suggestion that more elaborate syntactic structure may be
characteristic of plausible schemas. While this may be the case, it is also the case
that equally absurd examples could be constructed to an arbitrary degree of syntactic
nestedness, and not all flat schemas are absurd (i.e. Dublin Core). This illustrates
that syntactic considerations are not sufficient to the task at hand. The rest of
this
paper develops a proposal that employs semantics to characterize plausible schemas,
as
opposed to syntactically valid but ridiculous or implausible schemas.
2. Syntax-Semantics Mismatches in XML
A prerequisite to any discussion of XML syntax versus XML semantics is to determine
whether or not XML syntax and XML semantics are on some level equivalent. If a
generalization about XML semantics could be restated making reference only to XML
syntax, this would render any mention of semantics irrelevant. In this section, it
is
shown that there are at least two senses in which syntax and semantics are crucially
distinct in XML. (A note on representation; in the field of semantics, angled brackets
are are used to refer to words, while square brackets refer to what the words mean,
or
their denotation. Therefore, in this context, <cat> refers to an element
that could be employed in a schema, while [[cat]] refers to the furry animal, and
<cat>
refers to an XML representation.
First, XML syntax is strictly hierarchical, but XML semantics does not have to be.
An
example where both syntax and semantics are hierarchal can be seen in paragraph
structure: <sentence> ∊ <paragraph> (in XML,
<paragraph><sentence
/><paragraph>
) and [[sentence]] ⊂ [[paragraph]] (a
sentence is a subset of a paragraph). However, when elements refer to properties that
are not inherently hierarchical, this is not the case. For instance,
<damage> ∊ <sentence> but [[damage]] ⊄ [[sentence]] - i.e.,
the element <damage> may be the parent element for
<sentence>, but it does not make sense to say that the concept of damage
is a subset of the concept of sentence. This can be formalized as follows: if the
subset
relationship holds between the denotations of two or more elements (like [[sentence]]
and [[paragraph]]), let these elements be called semantically hierarchical. If not
(like
[[damage]] and [[sentence]]), then let these elements be called semantically
non-hierarchical. The semantic hierarchy can be captured by arranging semantically
hierarchical elements on the semantic levels s, s(1), s(2), ..., s(k) for k levels
of
specificity (proceeding from general to specific) - i.e., the semantic hierarchy
consists of semantically hierarchical elements, arranged accordingly.
As an aside, it can be noted that proposals have been made for XML syntax to be
non-strictly hierarchical in order to accommodate different kinds of structures in
a
document Renear, et al. 1993, which stands in
contrast to earlier conceptions of a document as containing a single logical hierarchy
of content objects DeRose, et al. 1990.
Non-hierarchical syntax involves the use of different (concurrent) structures that
may
overlap with one another but share the same content Chatti, et al 2007. Syntactic non-hierarchicality applies only to
interactions between different syntactic levels of the schema (although, in extreme
cases, such as the Dublin Core, there may only be one level of syntax at all), and
does
not obviate hierarchicality in the semantics.
Syntax and semantics also impose independent constraints on the well-formedness of
schemas (where well-formedness is understood as the property that characterizes
plausible schemas). The independence of syntactic and semantic constraints are
illustrated below; again, here the element <every>
can only
contain the concept of every-ness:
-
good syntax + good semantics: <paragraph><sentence
/></paragraph>
-
bad syntax + good semantics:
<paragraph><sentence></paragraph></sentence>
-
good syntax + bad semantics: <paragraph><every
/></paragraph>
-
bad syntax + bad semantics:
<paragraph><every></paragraph></every>
These considerations demonstrate that XML syntax and semantics must be analyzed as
separate domains. The restrictions that hold on valid XML syntax have been well
documented W3C 2008, whereas the restrictions
that must hold on the semantics of plausible schemas are less well described.
3. Formal Semantics of XML
3.1 Semantic Types
In this section, we propose that attributes and elements in plausible XML schemas
must be of type <e, t>, where the notation <e, t> is
understood as indicating a function from individuals (<e>) onto truth
values (<t>). This is the semantic type generally postulated to
characterize common nouns and adjectives in English. For instance, [[dog]] can be
thought of as the set of all things that are dogs - i.e., a function f from
individuals (any and all conceivable entities in this world) onto truth values (1
=
true, 0 = false) such that f(x) = 1 iff [[x]] is a dog. One could object that it
would be simpler to state this proposal in terms of nouns and adjectives - i.e., to
propose that attributes and elements should be nouns and adjectives. However, it is
preferable to state this in terms of semantics, because we need to keep our terms
straight. "Nouns" and "adjectives" are terms taken from English syntax, which is not
optimal when what we really want to talk about is XML semantics - i.e., neither
English nor syntax. This proposal rules out absurd schemas (2) and (3) from the
introduction, and captures the intuition that attributes and elements should be
statements about things.
Beyond the intuitive appeal of this proposal, it can be derived in a bottom-up
fashion, based only on the assumptions that (1) texts are made up of things, and (2)
that markup says things about things. Assumption (1) shows that texts are made up
of
basic components of type <e>. Assumption (2) leads directly to a
semantic type of <e, t> for elements and attributes; i.e., something
is tagged <paragraph>
only if it is true that it is a
paragraph, modulo whatever definition of paragraph is appropriate in context. A
formal definition of "tag abuse" can also fall out from assumption (2), i.e., tag
abuse is the mapping of an individual onto a truth value of zero. In a situation
where <ship>
is being used to cause some arbitrary text
(other than a ship name) to be rendered in italics Piez
2001, the user has misunderstood that the element
<ship> is a function that assigns the value 1 to its contents, if and
only if it is true that the denotation of the contents is a ship.
Translated into the terms above, the element <paragraph> is a
function from individual bits of text onto truth values such that
<paragraph>(x) = 1 iff [[x]] is a paragraph. Assumptions (1) and (2)
should be basic for all archival purposes. Denying assumption (2) could lead to the
emergence of bizarre surrealist schemas, but it seems safe to conclude that ruling
out such schemas is precisely the goal for developing archival standards. It is not
clear what denying assumption (1) would even mean ontologically.
More complicated functions are of course conceivable, but they are the domain
of the processing language rather than the XML itself. An example of this would be
a
function of the type <<e, t>, <e, t>> -
i.e., a function that takes one element/attribute and returns another. For instance,
one such function would take a nested element and return the element one level
higher.
It should be noted that in the above proposal XML schemas are not assumed to be compositional
semantically. To some extent, it is an open question whether or not a compositional
minimal semantics for XML is a desirable feature. Compositional semantics would inevitably
result in a proliferation of types, thereby obviating the proposed distinction between
<e, t> elements that belong to XML and other elements that are the domain of the processing
language. On the other hand, non-compositional semantics means that the concept of
function admissible in XML must be wide enough to include input from outside the local
domain of the element. For instance, the attribute lang = "en"
must valued by referring to something beyond the string of characters "en". Similarly,
an element containing many sub-elements would have to be evaluable in terms of its
sub-elements. To a certain extent, it remains to be seen whether non-compositional
semantics makes undesirable predictions. Absent such evidence, the more parsimonious
option is not to include compositionality as an explicit requirement.
3.2 Semantic Coherence
The requirement that attributes and elements in plausible XML schemas be of type
<e, t> is necessary but not sufficient to the task of ruling in
plausible schemas while ruling out implausible schemas. To illustrate the point,
consider the XML in (12) and (13):
-
<title /><creator /><subject
/><description /><publisher />
-
<title /><giraffe /><arsenic
/><starvation /><King of France
/>
Example (12) is an excerpt from the well-known Dublin Core schema for marking up
metadata, while schema (13) is nonsense that satisfies the requirement that
attributes and elements be of semantic type <e, t>. How, then, to rule
out (13) as compared to (12)? In this section, we attempt to develop the intuition
that there exists a real-world object such that the traits [[title]], [[creator]],
[[subject]], [[description]], and [[publisher]] can be predicated of it or its constituent
parts with a truth
value of 1 (i.e., there exists at least one object that has all of these traits),
but there is no real-world object such that [[title]], [[giraffe]], [[arsenic]],
[[starvation]], and [[King of France]] can be predicated of it with a truth value
of
1. As a reminder, the notation [[title]] should be understood as meaning roughly "something
that is a title".
In order to formalize this insight, it is necessary to take a closer look at how
entities of type <e, t> operate. The denotation of such an entity
([[x]] where x is of type <e, t>) is either 1 or 0 (corresponding to
true or false). Such an entity must give a truth value based on an entity of type
<e> - i.e., a chunk of text. The only restriction on this process is
that it be a function, which for these purposes only means that some individual x
cannot be assigned to both true and false - i.e., it cannot be simultaneously true
and false that a chunk of text is a paragraph. Within this very wide scope, it is
possible to distinguish multiple types of functions. Structural-type functions
assign truth values based on whether or not the individual entity under evaluation
meets certain structural criteria; i.e., x is a paragraph if and only if x is a
paragraph. Predicative-type functions assign truth values based on a
non-definitional but inherent property of the entity under evaluation; i.e., x is
in
German if and only if x is in German (as distinct from being a sentence, a
paragraph, a word, etc.) Attributive-type functions assign truth values based on a
non-definitional and non-inherent property of the entity under evaluation - i.e.,
x
is the title if and only if x is the title, a bit of information that requires
specific real-world context to determine.
With this in mind, we can return to the main topic and provide a more precise
characterization of semantic coherence. A schema S is said to be semantically
coherent iff for each element or attribute {a1, a2, a3, ..., an} ∊ S there exists
a
set of entities (of type <e>) {x1, x2, x3, ..., xn} such that
[[ak(xk)]] = 1 for all e ∊ S. The concrete interpretation of this will vary
depending on whether the elements or attributes in question are structural,
predicative, or attributive. This rules out example 13, because there are no
real-world objects such that each element could assign those objects to a truth
condition of 1 simultaneously (i.e. there is no thing that literally consists of or
contains a
title, a giraffe, arsenic, starvation, and the King of France, all at the same
time.)
3.3 Semantic Hierarchies
At least one more issue must be discussed in order to fully characterize plausible
schemas. Compare (14) and (15):
-
<paragraph><sentence/></paragraph>
-
<sentence><paragraph/></sentence>
(14) is obviously corresponds to a common schema while (15) is nonsense. Syntax
cannot help here, nor does it suffice to appeal to the claim that (15) is not
plausible because it is not plausible. The reason why (15) is not plausible is
because syntax is conflicting with semantics. In order to get a precise handle on
this, it is necessary to formalize the notion of semantic hierarchies.
The semantic representation of an XML tree may be considered to consist of the
linearly arranged denotations of the elements and attributes present within an XML
tree. In other words, <element> → [[element]] and
<attribute> → [[attribute]]. As applied to (14) and (15), this yields
the following table.
Table I
|
Linear Representation |
Hierarchical Representation |
Syntax of (14) |
<paragraph><sentence/></paragraph>
|
<sentence> ∊ <paragraph>
|
Semantics of (14) |
[[paragraph]][[sentence]] |
[[sentence]] ⊂ [[paragraph]] |
Syntax of (15) |
<sentence><paragraph/></sentence>
|
<paragraph> ∊ <sentence>
|
Semantics of (15) |
[[sentence]][[paragraph]] |
[[sentence]] ⊂ [[paragraph]] |
Table I gives an indication of what the problem is with (15) - we can freely
change the syntax of (14), but as much as we change the syntax, we cannot change
what "paragraph" and "sentence" mean - in particular, we cannot change the fact that
[[paragraph]] and [[sentence]] are semantically hierarchical. The only remaining
step is to smooth over the notational discrepancy between hierarchical syntactic
relationships and hierarchical semantic relationships.
Below is a formal characterization of semantic hierarchies as conceived more
abstractly as ordering relationships: given a set of entities E = {e1, e2, e3, ...,
en}, a hierarchy can be defined as an ordered k-tuple (ei, ej, ek) made up of
elements of E. An XML schema S is then made up of both syntactic elements/attributes
and their denotations: S = {<e1>, [[e1], <e2>, [[e2]],
<e3>, [[e3]], <e4>, [[e4]], ..., <en>,
[[en]]}. We can then state that any ordering that holds for a syntactic element
<ek> in S must also hold for its semantic correspondent [[ek]]. If the
above holds, we may then state that semantic hierarchies respect syntactic
hierarchies- i.e., while the syntactic and semantic hierarchies don’t need to
correspond, they can’t be contradictory. This rules out (15).
4. Some Very Basic Features of Archive Standards
In this section, we summarize the above points and add some other criteria that must
be met by plausible archive standards.
-
Syntax is arbitrarily nested. If the most general level
is p, let the more specific levels be denoted by p, p(1), p(2) ..., p(k) for k
levels of specificity. It is not necessarily the case that one and only one
element correspond to each syntactic level. For instance, it is possible that
elements like <sentence> and <metaphor> are on the
same level.
-
Elements and attributes are of semantic type <e,
t>.
-
Schemas must be semantically coherent.
-
Syntactic hierarchies must respect semantic hierarchies.
-
Elements and attributes are assigned at the highest possible
level. This is an obvious insight that is not trivial to
formalize, the insight being that elements and attributes should not be
gratuitously repeated.Sometimes (in the case of
structural elements), this is because to do otherwise would be semantically
invalid (i.e.,
<paragraph><paragraph></paragraph></paragraph>
.)
For transitive predicative attributes or elements, it would be redundant (i.e., not
everything needs to be redundantly marked for language). Thus, for most elements
and attributes, it is sufficient to state that an element or attribute that maps
onto t = 1 (true) at level {p + k} maps onto t = 0 (untrue) at level {p + (k -
1)}. This will handle structural elements like <paragraph> and
predicative attributes like <language>. The situation is more
complex with regard to attributive elements and attributes like
<metaphor> or <damage>. One can imagine situations in
which these elements might occur on two structurally contiguous levels -
metaphors within metaphors or damage within damage. Ontologically, the situation
could be saved by positing that underlyingly, different metaphors or different
types of damage are being denoted. The details of how to formalize this is not
entirely clear but would likely capitalize on the intuition that metaphors
inside metaphors only work if the two metaphors are different.
5. Conclusion
Starting our exploration of XML semantics from the perspective of all syntactically
valid schemas has allowed us to formalize some semantic traits shared by mostly
widely-used schemas that are easy to overlook, but of great significance when assessing
how useful a schema might be for archival purposes - and reverse-engineering the interpretation
of schemas that have been used for archival purposes, but for which adequate documentation
is lacking. This may also have implications for
ongoing work towards machine-interpretation of XML semantics. If an XML document uses
a
schema that conforms to our proposed archive standards, stronger statements can be
made
about the relationship between the elements in that document. The fact that the
syntactic hierarchy of elements is compatible with a real-world semantic hierarchy,
in
combination with the other generalizations that we have made about archive-appropriate
XML semantics, facilitates the development of automatizable processes of analysis,
and
enables developers to bring to bear existing tools used for classifying the real
world.
References
[chatti2007] Chatti, Noureddine; Suha Kaouk, Sylvie Calabretto and Jean
Marie Pinon. "MultiX: an XML based formalism to encode multi-structured documents"
In
Proceedings of Extreme Markup Languages 2007.
http://conferences.idealliance.org/extreme/html/2007/Chatti01/EML2007Chatti01.html
[derose1990] DeRose, S. J., Durand, D. G., Mylonas, E., and Renear A. H.
(1990), 'What is Text, Really?', Journal of Computing in Higher Education, 1.2:
3-26. doi:https://doi.org/10.1007/BF02941632
[onlinetym] Online Etymology Dictionary. "Symbol". Accessed 15 April
2010. http://www.etymonline.com/index.php?term=symbol
[piez2001] Piez, Wendell. "Beyond the “descriptive vs. procedural”
distinction." In Proceedings of Extreme Markup Languages 2001.
http://conferences.idealliance.org/extreme/html/2001/Piez01/EML2001Piez01.html
[piez2002] Piez, Wendell. "Human and Machine Sign Systems." In
Proceedings of Extreme Markup Languages 2002.
http://conferences.idealliance.org/extreme/html/2002/Piez01/EML2002Piez01.html
[renear1993] Renear, Allen; Elli Mylonas, and David Durand. "Refining
our Notion of What Text Really Is: The Problem of Overlapping Hierarchies."
http://www.stg.brown.edu/resources/stg/monographs/ohco.html
[renear2002] Renear, Allen; David Dubin, and C.M. Sperberg-McQueen.
"Towards a semantics for XML markup". Proceedings of the 2002 ACM symposium on Document
engineering. doi:https://doi.org/10.1145/585058.585081
[sperbergmcqueen2000] Sperberg-McQueen, C.M.; Claus Huitfeldt, Allen
Renear. "Meaning and interpretation of markup." Markup Languages: Theory &
Practice 2.3 (2000): 215-234. http://cmsmcq.com/2000/mim.html. doi:https://doi.org/10.1162/109966200750363599
[teip4] Text Encoding Initiative: The XML Version of the TEI Guidelines:
5 The TEI Header. Accessed 15 April 2010.
http://www.tei-c.org/cms/Guidelines/P4/html/HD.html
[W3C2008] Extensible Markup Language (XML) 1.0 (Fifth Edition). W3C
Recommendation 26 November 2008
http://www.w3.org/TR/2008/REC-xml-20081126/
×DeRose, S. J., Durand, D. G., Mylonas, E., and Renear A. H.
(1990), 'What is Text, Really?', Journal of Computing in Higher Education, 1.2:
3-26. doi:https://doi.org/10.1007/BF02941632
×Renear, Allen; David Dubin, and C.M. Sperberg-McQueen.
"Towards a semantics for XML markup". Proceedings of the 2002 ACM symposium on Document
engineering. doi:https://doi.org/10.1145/585058.585081