How to cite this paper
Sperberg-McQueen, C. M. “Text. You keep using that word ….” Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). https://doi.org/10.4242/BalisageVol19.Sperberg-McQueen02.
Balisage: The Markup Conference 2017
August 1 - 4, 2017
Balisage Paper: Text. You keep using that word …
C. M. Sperberg-McQueen
Founder and principal
Black Mesa Technologies LLC
C. M. Sperberg-McQueen is the founder and
principal of Black Mesa Technologies, a consultancy
specializing in helping memory institutions improve
the long term preservation of and access to the
information for which they are responsible.
He served as editor in chief of the TEI
Guidelines from 1988 to 2000, and has also served
as co-editor of the World Wide Web Consortium's
XML 1.0 and XML Schema 1.1
specifications.
Copyright © 2017 by the author.
Abstract
Every data representation constitutes a data interpretation. What are SGML, XML, and
other tools for descriptive markup telling us about the nature of text?
Table of Contents
- The OHCO model and the OHCO thesis
- OHCO and the SGML/XML specs
-
- The privacy of your own CPU
- A graph, not a tree
- What about CONCUR?
- OHCO and Standard Average SGML/XML
-
- So what's the alternative to OHCO?
- OHCO as liberation
- Those who have ears to hear, let them hear
- What goes in the model, what goes in the application?
Some of you will recognize my title Text. You keep using that word …
as an allusion to a line in the movie The Princess Bride. One character persistently misuses the word inconceivable
, and eventually elicits from the character Inigo Montoya, played by Mandy Patinkin,
the remark You keep using that word. I do not think it means what you think it means.
I've been thinking about the word text
a lot recently — more than usual, I mean — and about the model of text entailed by
SGML and XML and related technologies. One reason perhaps is that the paper by Ronald
Haentjens Dekker and David Birnbaum [Haentjens Dekker and Birnbaum 2017] refers back to a paper published a couple of decades ago (1990, to be exact) by
Steve DeRose and Allen Renear and a couple of other people under the title What is Text, Really?
[DeRose et al. 1990] That paper comes up a lot in discussions of the textual model of SGML and XML.
And whenever I see discussions of the textual model of SGML and XML and of the way
this paper What is Text, Really?
defines it, I find myself thinking No, it's not really quite like that. It's a little more complicated. I don't think
that any of those words mean what you think they mean.
The OHCO model and the OHCO thesis
The paper is well worth reading even at the distance of 27 years. It describes a
model of text as an ordered hierarchy of content objects. The authors they abbreviate this concept O-H-C-O
, so it's often referred to as the OHCO paper
). The thesis that text is an ordered hierarchy of content objects is frequently referred to as the OHCO thesis.
The paper defines — or at least sketches — the model; it briefly describes some
alternative models; it argues for the superiority of the OHCO model. The 1990 paper
does not go quite as far as the 1987 paper by some of the same authors [Coombs et al. 1997], in which the claim is made that SGML is not only the best available model of text,
but the best imaginable model of text. (Perhaps between 1987 and 1990 some of the
authors strengthened their imaginations? We all learn; we all improve.) But the
1990 paper does argue explicitly for the superiority of the OHCO model to the available
alternatives.
In the time since, it has frequently been claimed that OHCO is the philosophy underlying
SGML and XML and most applications of SGML and XML, so the OHCO paper comes up frequently
in critiques of the Text Encoding Initiative Guidelines [hereafter TEI
], sometimes in the form of a claim, more often in the form of a tacit assumption,
that the OHCO model underlies TEI. When I read such critiques I find myself asking
Is OHCO really the philosophy of the TEI? Is the OHCO model really the model entailed
by SGML and XML?
On mature consideration my answer to this question is: Yes. No. Maybe, in part. Yes, but …. No, but ….
Perhaps I should explain.
First, let's recall that yes
and no
are not the only possible answers to a yes/no question. Sometimes the line between
agreement and disagreement is complicated. Allen Renear once asked Can we disagree about absolutely everything? Or would that be like an airline so
small that it had no non-stop flights?
There are many possible stages between belief and disbelief, or assent and refutation.
A second difficulty is that the authors of the OHCO paper don't actually define the
terms hierarchy,
ordered hierarchy,
or content object.
Not every reader is certain that they understand what is meant by the claim that
text is an ordered hierarchy of content objects. And those readers — and they seem
to be the majority — who do feel confident that they understand what is meant clearly
don't always understand the same things.
I have held the lack of definitions against the authors for some time, but thinking
about it more recently, I have concluded there is probably good reason that they don't
nail the terms down any more tightly than they do. Hierarchy
and ordered hierarchy,
seem reasonably clear, even if we can quibble a bit about the edges. Content object
is a pointedly vacuous term; to some readers it may suggest objects like chapters
and paragraphs and quotations, and those are in fact some sample content objects that
the authors mention. But we can also interpret a content object
as an object that constitutes the content of something else
in which case it is as pointedly vacuous and possibly as intentionally vacuous as
the SGML and XML technical terms element
or entity
. Those terms are vacuous for a reason, vague for a purpose, vague because it is
you, the user — or we, the users — who are to fill them with meaning by determining
for ourselves what we wish to treat as elements or entities. There is in SGML and
XML and perhaps in the OHCO model no a priori inventory of concepts which are held to appropriate for inclusion in a marked up
document. The choice of concepts is left to the user. That is one of the crucial
points of SGML.
The third difficulty I have in deciding how to answer the question is that we need to decide what it means to say that OHCO is the philosophy underlying
SGML and XML. What do we mean by SGML or XML? There are at least three things we
could plausibly mean. In the narrow sense, when people talk about SGML I frequently
think that they mean the text of the specification itself, ISO 8879. And when they
talk about XML, I assume they mean the text of the XML specification as published
by the W3C. Sometimes when I'm feeling particularly expansive, I may think they mean
and related specs.
But (possibly because of my personal history) I do tend to assume that references
to SGML and XML mean (or perhaps I mean that they ought to mean) the specs and the specs' texts themselves, independent of implementations
and independent of related technologies.
But frequently when people are talking about the textual model in this or anything
implicit in SGML or XML, what they have in mind is not the specifications themselves
but what the creators of those specifications were trying to accomplish, or to be
more precise what we now think that they were then trying to accomplish. Our ideas
of what they were then trying to accomplish may or may not be based on asking them.
And if we do ask them today what they were trying to accomplish then, we must bear
in mind that they may not know. Or they may know and not want to tell us.
The third usage, probably the most frequent when people are talking about what is
implicit in the use of SGML and XML, is what one might call standard average SGML
, which includes but goes perhaps a little beyond the kind of commonalities that were
identified by Peter Flynn the other day [Flynn 2017]. By standard average SGML
I mean the set of beliefs, propositions, and practices that would be your impression
of what the technology entails if you hung around a conference like this one or like
the GCA SGML conferences of the late 1980s and early 1990s. It is quite clear that
not everybody attending or speaking at those conferences believed the same things,
but there was a certain commonality of ideas, and that commonality is not actually
too far from the beliefs and propositions expressed in the OHCO paper.
OHCO and the SGML/XML specs
But, of course, as I say, because of my personal history I keep coming back to the
text of ISO 8879 and the text of the XML spec, and in consequence I tend to resist
the idea that SGML (the SGML spec), or even XML (that is, the XML spec), entails the
idea that text consists of a single hierarchical structure. There are several reasons
for this. I don't resist the idea that there is a model of text instrinsic to the
specs. Specs do embody models, and we need careful hermeneutics of specs. But if
we want to interpret the worldview of a spec, we need to pay close attention to the
words used in the spec because words are what specs use to embody their views. It
is no part of careful hermeneutics to offer an interpretation of a text which ignores
the details of the text.
The privacy of your own CPU
ISO 8879 and XML define serialization formats. They define a data format; I would
say they defined a file format, except that ISO 8879 notoriously does not use the
term file
. (That's one of the things that made it so desperately difficult to understand.)
I sometimes suspect that ISO 8879 didn't use the word file
because the editor of the spec, who had spent years working on IBM mainframes, really
wanted to keep the door open to partitioned data sets. (Partitioned data sets may
be simplistically described as files within files; Unix tar files and [except for
the compression] ZIP files are roughly similar examples in other computing environments).
The avoidance of the term file
turned out to be very handy later, because when the Internet became more important
everything in the SGML spec could be applied without any change in the spec at all
to a stream of data coming in over a networking socket. If ISO 8879 had said this is what's in the file
there would certainly have been language lawyers who said Well, a socket is not really a file so it doesn't apply; you can't use SGML or XML
for that.
The very abstract terminology of ISO 8879 is perhaps a mixed blessing. It made
the spec much harder to understand in the first place, but it achieved a certain generality
that went beyond what some readers (me, for example) might have expected.
Now, the usual model for processing for a format like that is that the spec defines
what goes over the wire, what comes in over the socket, or what's in the file, and
not processing. Software that processes the data is going to read the file, parse
it, build whatever data structures it wants to build, do whatever computations they
want to do, and serialize their data structures in an appropriate way. Even if they're
going to write their output in the same format, the data structures used internally
will not necessarily be closely tied to the input or output format. There is no necessary
connection, and certainly no tight connection, between the grammar of the format and
the form of the data structures. What you do in the privacy of your own CPU is your
business.
In this, be it noted, SGML is very different from the kind of database models that
were contemporary at the time it was being developed. Database models, as their name
suggests, specified a model, to which access is provided via an API. Implicitly,
or abstractly, database models define (abstract) data structures. SGML avoids doing
so. It provides no API, and it requires no particular abstract data structure. Both
specs are rather vague and general on the question of how the data are to be processed.
The XML spec does say that we (the authors of the spec) conceive of a piece of software
called an XML parser reading XML data and handing it in some convenient form to a
downstream consuming application. That's as concrete as we got. ISO 8879 is not
even that concrete; there is nothing in ISO 8879 that says there will be a general-purpose
SGML parser and it will serve as client application software. ISO 8879 is compatible
with that view, but it's also compatible with the view that guided by an SGML Declaration
and a Document Type Definition, users of that DTD will write software that reads data
that conforms to that DTD. If I remember correctly, this is how the Amsterdam SGML
Parser worked [Warmer / van Egmond 1989]. It did not parse arbitrary SGML data coming in; you handed it an SGML Declaration
and a Document Type Definition, and it handed you back a piece of software that would
parse documents valid against that Document Type Definition. There was no general
SGML application involved in the Amsterdam SGML Parser. It was a parser generator,
as it were, a DTD compiler; general-purpose SGML parsers could in contrast be classified
as DTD interpreters.
There is also no assertion in ISO 8879 that it should be used as a hub format. It's
entirely possible to view it as a carrier format, to use the distinction that Wendell
Piez introduced on Monday [Piez 2017]. This has both technical and political implications. For political reasons, it
is frequently easier to persuade people to tolerate a format if they think Well, it's just a carrier format. It doesn't compete with my internal format. I
have my own format which I use internally, and I can use this as a convenience to
simplify interchange with other people. It doesn't require that I change anything
at the core of my system, it only affects the periphery.
A graph, not a tree
The second reason to be skeptical of the claim that SGML embodies the OHCO thesis
is an insight I owe to Jean-Pierre Gaspart, who wrote probably the first widely available
SGML parser. He used to make emphatic reference to the idea that SGML does not define
S-expressions. I infer that he may have been a LISP programmer by background. Those
who aren't LISP programmers may want some clarification. An S-expression in LISP
is either an atomic bit of data (a number, a string, an identifier, ...) or a parenthesized
list of whitespace-separated S-expressions; since parentheses can nest, S-expressions
can nest, and if you restrict yourself to S-expressions you build trees … and nothing
but trees. Arbitrary directed graphs cannot be built with S-expressions because S-expressions
have no pointers. In Gaspart's account, SGML does not define a data format which
is isomorphic to S-expressions because SGML does have pointers; it has IDs and IDREFs.
It follows that if there is a single data structure intrinsic to ISO 8879 and XML,
it is the directed graph, not the tree. The portion of that graph captured by the
element structure of the document is indeed a tree, which means that the data structure
intrinsic to SGML and XML is a directed graph with an easily acessible spanning tree.
And (if I may refer to yet another old-timer) Erik Naggum, the genius loci of comp.text.sgml for many years, used to insist (in ways that I didn't always find
terribly helpful) that SGML does not define trees. He would deny with equal firmness
that it defines graphs; he meant that SGML defines a sequence of characters, and the
data structures you build from SGML input are completely independent which, again,
is true enough.
What about CONCUR?
Finally, of course, if the view of text in SGML were that text has a single hierarchical
structure, there would be no explanation for the feature known as CONCUR
. CONCUR
, for those of you who haven't used it, allows a single SGML document to instantiate
multiple document types, each with its own element structure. With CONCUR, an SGML
document has, oversimplifying slightly, two or three or ninety-nine element-structure
trees and directed graphs drawn over the same sequence of character data.
For all of these reasons, I tend to bridle at the suggestion that ISO 8879 embodies
the thesis that text consists of a single ordered hierarchy of content objects. Certainly,
the text of ISO 8879 limits documents neither to a hierarchy (as opposed to a directed
graph) nor to a single structure (as opposed to multiple concurrent structures).
It is certainly true that for most of the goals that the SGML Working Group had and
discussed in public, a single hierarchy would probably suffice. And historically
it's well-attested that CONCUR
was a very late addition to the spec, added at a time when the Office Document Architecture
[hereafter ODA
], the big competition within ISO, was making great play of the fact that they could
describe both the logical structure of a document and its formatted structure. In
order not to be demonstrably less capable than the Office Document Architecture, it
was essential that the SGML spec have something analogous. At this point, it's worth
while to notice an important property of what the Working Group did and did not do.
They did not say Okay, you can have two structures, a logical structure and a physical structure.
In a move which I can only explain as a touching instance of blind faith in the
design principle of generality, the Working Group said you can have more than one
structure. They didn't say that one structure is logical and one is physical; they
didn't say anything about what one might want to use multiple structures for, and
they didn't supply an upper limit to the number of structures. (Even if all you want
is a logical structure and a physical structure, a document might have multiple physical
structures if it is sometimes rendered on A4 paper and sometimes on 5x8 book pages.)
In the same way that they avoided specifying the word file,
they avoided telling you what these multiple hierarchies were for.
The result, of course, is that the introductory part of the Office Document Architecture
spec was a joy to read. It was clear, it was concrete — I loved it. The authors
nailed things down, they were specific, they said exactly what they intended, they
didn't say some image format or other
, they specified what image format ODA processors would support. And so on. The
only problem was that by the time I had saved the money to buy a copy of the ODA spec
and was reading it and enjoying its concreteness, the spec was technologically obsolete.
No one in their right minds would by that time have chosen that format for graphics and that format for photographs; it was just crazy. The ODA group had driven stakes into
the ground, and then the tide had moved the shoreline, and they were high and dry,
and they were not where they wanted to be. There were a number of ODA implementation
efforts, but I'm not sure that any of them was ever completed, because by the time
the implementation was nearing completion, it was easy to lose interest, because no
user was going to want to use it.
SGML was wiser in a way. The SGML WG held things like graphics and photographic formats
at arm's length. Instead of prescribing photo formats for SGML processors to support
(which would seem to be a good idea for interoperability), they provided syntax to
allow the user to declare the notation being used, and syntax for referring to an
external non-SGML entity. The details of how an SGML system used your declaration
to find appropriate processing modules is completely out of scope for SGML, with the
consequence that SGML was compatible both with the formats that were contemporary
with its development and with all the many, many formats that came later.
OHCO and Standard Average SGML/XML
So what's the alternative to OHCO?
On the other hand, OHCO does seem a really, fairly good description of the view of
text that I remember from SGML conferences. That is, it matches up pretty well with
Standard Average SGML, even if not with the text of the spec. When I first started
working with SGML, I spent a lot of time consciously training myself to think about
documents as trees. Now, why did I do this? Because up until then I had used alternative
textual models, and it's worth pointing out that none of them was really a graph model.
One of the main alternative models of text available to us before ISO 8879 was text
as a series of punchcards or punchcard records. (I won't bother trying to explain
why I think that was a problematic model of text.) It was an improvement when someone
introduced the notion that text is simply a string of characters. But, you know,
text isn't really a string of characters; if you just have a string of characters,
you can't even get decent output.
If you want decent output, you have to control the formatting process. The next advance
in the modeling of text was that text is a string of characters with interspersed
formatting instructions. If you want a name for this model, I call it the text is one damn thing after another
(ODTAO) model. Imagine that you have indexed the text. Using the index, you search
for a word and find an instance of the word somewhere in the middle of the document.
In order to display the passage to the user, you want to know where you are in the
text, what page you are on, what act and scene you are in, what language the curent
passage is in. And if we're talking about a text with embedded formatting instructions,
you also need to know what the current margins are and what the current page number
is. That means essentially that you want to know which of the instructions in the
document are still applicable at this point. There is a simple way to find out; it
is the only way to find out. You read the entire document from the beginning, making
note of the commands that are in effect at the moment and the moment at which they
are no longer in effect, and when you reach the point that your index pointed you
at, you know what commands are still in effect at this point. You have now lost every
advantage that using the index gave you in the first place because the whole point
of an index is to avoid having to read the entire document from the start up to the
point that you're interested in, in order to find things out.
I should mention one other alternative model that may explain why so many of us grasped
at SGML as if we were reaching for a life ring. I recently encountered a description
of a pre-SGML proprietary system for descriptive markup. Like SGML, it allowed you
to define your own fields. Unlike SGML, it was a binary format; there was no serialization
form. There was no validation mechanism, so it was essentially just a data structure.
I won't try to describe all the details of the data structure (and couldn't if I wanted
to), but anyone for whom the following sentence is meaningful will have an immediately
clear idea of the essentials: it was a MARC record with names instead of numbers
for the fields.
Now, those of you who don't know what MARC records are like need know that the MARC
record was invented by a certifiable genius (Henriette Avram) who in the 1960s analysed
the exceptionally complicated structure of library catalog data and found a way to
represent it electronically in a machine-tractable form. Unfortunately, the machine-tractable
form that she came up with has made many a grown programmer gnash their teeth and
try to tear their own hair out. First of all, MARC defines units called records
; everything is either a record or a collection of records. A record consists of
a header, which has a fixed structure so it can be read automatically; you have a
general purpose header reader that reads the header, and it then knows what's in the
record. In the case of MARC, the header is a series of numbers which identify specific
fields (types of information), followed by pointers that identify the starting position
and length of the field within a data payload which follows the header and makes up
the rest of the record.
Now this arrangement has some beautiful properties; you can read a record in from
tape and process it easily. There are pointers, so it's easy to get at any portion
of the record that you're interested in to process it, and you never have to move/copy
things around in memory. If you're editing a MARC record, you can just add more data
to the end of the payload and change the pointers and leave the old data as dead data
(like junk DNA). And then if you want, you can clean it up later before you write
it back up to tape or to disk. (In fact most of the first MARC processors used tapes,
not disks, because disks weren't big enough.)
Notice that with such a record structure there is no problem with overlapping fields.
There's nothing to prevent two different items in the header from pointing at overlapping
ranges. I don't think anyone in their right minds ever tried to use overlapping fields,
but there's nothing in the specification itself to prevent it.
OHCO as liberation
Compared to these models, a model of text based on user-specified content objects
was almost guaranteed to feel like an improvement, especially if those content objects
were suitably descriptive and generic. A model based on hierarchical structure is
easier to work with, visualize, and reason about than the one-damn-thing-after-another
model, and hierarchical structures allow a richer description of textual structure
than a format based on flat fields within a record. If the hierarchy allows arbitrarily
deep nesting, it is a much better match for text as we see it in the wild than the
fixed-level document / paragraph / character hierarchy on which most word-processing
software is built. And of course a model which keeps track of ordering is essential
for text, in which order is by default always meaningful, in contrast to relational
database management systems in which the data are by default always unordered.
A tree or graph also makes it easier to identify and understand the relevant context
at any point in the document, if you jump into the document by following an index
pointer. In a tree or graph, you can just consult the nodes above you in the graph
to know your context. Each ancestor node may be associated with semantics of various
kinds (whether those are abstractions like act
and scene
or formatting properties like margin settings) and will tell you something about
the current environment. Assuming a reasonably coherent usage of SGML elements, that
will probably give you the information you're looking for. N.B. This is not the only
way to use SGML elements; if you're using milestone elements, you're back to reading
from the beginning of the document. But in most SGML applications, thinking of text
as a tree instead of as a set of flat fields or as a series of one damn thing after
another was a huge step forward.
So, OHCO had a great appeal to me. But not really because I thought text was intrinsically
a single ordered hierarchy of content objects. I thought that a single ordered hierarchy
of content objects would be a good way to model text even though text by nature was
slightly more complicated. That is to say, I more or less agreed with the character
in Hemingway's The Sun Also Rises who, when someone suggests something to him, says Oh, wouldn't it be pretty to think so?
[Hemingway 1926] Text may not be an ordered hierarchy of content objects, but wouldn't it be pretty
to think so, or with the Italians se non è vero, è ben trovato.
Now, as formulated, the OHCO paper makes a clear and refutable claim: we think that our point can be scarcely overstated: text is an ordered hierarchy of content objects.
What really bothers me about that sentence is the singular article an
. I'm pretty sure, however, that this is a case of even forward thinkers not being
able to capture exactly what they mean. This can happen because, especially when
you're trying to persuade other people, there's a limit to exactly how far you can
go in your formulation. Some of you will remember the columnist Jerry Pournelle,
who in the 1980s promoted a shift away from mainframe computing to personal computing
with the slogan One user, one CPU.
And within ten years found himself having to explain I didn't mean maximally one CPU; I meant at least one CPU.
I discovered, re-reading the OHCO paper recently, that the authors didn't actually
argue that text is at most one ordered hierarchy of content objects; they explicitly
claim, in a discussion of future developments, that many documents have more than one useful structure.
They observe that Some structures cannot be fully described even with multiple hierarchies, but require
arbitrary network structures.
And they point out that version control and the informative display of document version
comparisons will pose great challenges.
Those who have ears to hear, let them hear
I think the OHCO thesis can be regarded perhaps as an attempt to capture what was
new — not everything in ISO 8879, but what was new and liberating. There are many
things it doesn't address; I've mentioned some of them. I haven't mentioned something
that I never thought of as terribly important but which I know the editor of ISO 8879
felt was important: namely, the more-or-less complete orthogonality of the logical
structure from the physical organization of the document into entities. XML limited
that orthogonality; XML requires every element to start and end in the same entity.
That degree of harmony between logical and physical structure was not required in
ISO 8879; we lost a certain amount of flexibility at that point. I personally have
never missed it. I don't quite know why WG8 thought it was an advantage to have it
(or why that particular member of the WG8 thought so), but anyone who prefer complete
orthogonality of storage organization and logical structure will think of that rule
of XML as a step backwards.
The things that I think the OHCO thesis captures well are
-
the focus on the notion of the essential parts of a document
-
the notion that those essential parts might vary with the application
-
the notion that it was the user's right and responsibility to decide what the essential
parts of the application are
-
the separation of document content and structure from processing, which led allegedly
to simpler authoring and certainly to simpler production work, simpler generic formatting,
more consistant formatting, better information retrieval, better ways for retrieval,
easier compound documents, and some notion of data integrity.
Those are the kinds of things that SGML and XML have that serve to make it possible
to do the kinds of cool things we've heard about here in several talks: for example
Murray Maloney's multiple editions of Bob Glushko's book on information [Maloney 2017], or Pietro Liuzzo's project on Ethiopic manuscripts [Liuzzo 2017]. Anne Brüggemann-Klein and her students have shown us that a sufficiently powerful
model of text, as instantiated not just in XML but in the entire XML stack, can handle
things that are not at all what we normally think of as text [Al-Awadai et al. 2017]. Each project in its way, I think, is a tour de force.
But to be fair, not everybody cares about all of these things. Some people are not
impressed when things shift from being impossible to being possible; they want them
to be easy, or they're not interested. And that may be why some of the things that
we were excited about 30 years ago, we're not excited about now. Relatively few users
were worried then about overlap; very few people cared. They voted with their feet
for implementations of SGML that did not implement that optional feature. I sometimes
think that the reason the OHCO paper focuses where it does is that it was trying to
say things people could understand and trying to avoid saying things that people were
not ready to hear. It doesn't help much to say things that people are not ready to
understand, although occasionally someone will remember thirty years later, the way
I now remember Jean-Pierre Gaspart telling us all that SGML is not just an S-expression
with angle brackets instead of parentheses. It was thirty years before I understood
what he was talking about, but now I think I understand.
What goes in the model, what goes in the application?
If we step back, if we ask show various models of text compare with each other — OHCO,
what is actually in ISO 8879, hypertext the way Ted Nelson defines it, hypertext the
way HyperCard defines it, the one-damn-thing-after-another model — I think OHCO looks
pretty good compared to most of its contemporaries. It's probably true, however,
that a more general graph structure might be better; it would certainly be more expressive.
The design of a model is sometimes a tradeoff among competing factors like expressive
power, simplicity for the users, and simplicity for the implementors. There are reasons
to want to get as much information as possible into the model. If you can capture
a constraint in a model and if you have generic software to enforce it, then every
application built on top of the data gets the preservation of that constraint for
free. This is why it makes perfect sense for relational databases to allow you to
declare referential integrity constraints and for the databases to enforce those constraints
because if you declare referential integrity constraints, then the database enforces
them (unless you were using MySQL a few years ago when it didn't enforce them), and
you don't have to ensure that every application program that reads your database is
careful about those constraints.
A richer model provides, in this way, a safer environment in which to program. Everything
that's not in the model, everything that's not expressed formally in an enforceable
way, is something that every application program you write has got to be careful about.
So the more tightly and formally your model can describe what is correct, the cleaner
your data can be. It may be true, as Evan Owens told us on Monday, that you will
never know 100% of the ways that authors can get things wrong, but over time you can
(if you pay attention) learn more and more of them, and if you can get them into the
schema, you can protect your downstream software … if you have formalisms that allow
you to check those things [Owens 2017].
Further, it has been demonstrated by several papers here, including the ones by David
Lee and Matt Turner, that the fuller the model is, the more information it has about
the data, the more interesting things it can do by itself without much input from
us [Lee 2017, Turner 2017]. And processors that know more about the data can optimize better and more safely
than processors that don't know anything about the data. So, it's better that things
go into the model — up to a certain limit. The countervailing argument is that sometimes
its better to have less elaborate models, because simpler models are easier to use
and understand and easier to support in software, and many applications don't in fact
need complex constraints. If constraints are hard to implement or slow things down,
then implementors may omit them, or users may turn them off, the way some users turn
off referential integrity constraint checking even in databases that support it, because
they would rather have fast, wrong software than correct, slow software. That's a
choice they get to make.
I notice there's a relation here between modeling and whether the format being defined
is intended as a carrier or a hub. Formats that don't impose tight constraints may
be better for carrier format functions. If a formalism or model is opinionated and
says This is the way it's got to be,
it's going to make it easier to do interesting things with the data that obeys those
constraints, and it may be what you want for a hub format, whereas a more cynical
format that doesn't really have any strong convictions but just allows anything to
happen, like the carrier format that Wendell Piez was talking about (viz. HTML) [Piez 2017], may be better for carrier functions. Both SGML and many SGML applications like
TEI were kind of vague about whether they expected to be used as a hub format or a
carrier format. I think that was partly for political reasons, partly because the
distinction may not have been clear to us all at the time.
There's another instructive example that we can spend a moment on, I think. In the
1950s, programming languages were defined in prose. And writing a parser for a programming
language meant struggling with the prose of the spec and trying to figure out what
on earth it meant in the corner case that you were currently facing in your code.
And the nature of human natural-language prose being what it is, different readers
occasionally reached different conclusions. This is one reason that when in 1960
the Algol 60 Report came out and introduced the format called the Backus-Naur Format
[hereafter BNF
] to provide a formal definition of the syntax, computer scientists were, as far as
I can tell, immediately won over to the formalism. A concise formal definition of the syntax makes it possible to make inferences from the notation.
One could know how the corner case was supposed to be handled, assuming that the grammar
was correct — and the grammar was by definition correct, so you were home free. Computer
science spent the next ten or twenty years developing one method after another to
go systematically from a formal definition of a grammar in BNF, or later in extended
BNF [hereafter EBNF
], to a parser, systematically and eventually automatically, so you could just write
a grammar, run a program on that grammar, and have a parser. So pretty much every
programming language now provides a grammar of the language in BNF or EBNF.
But the formalism doesn't capture everything. Not every string of characters that's
legal against the Algol 60 grammar is a legal Algol 60 program. And the same is true
for any other programming language that's more than an intellectual curiosity, because
programming languages are not, in fact, context-free languages; they are context sensitive.
Algol 60 was typed, and if over here you had declared a certain variable as of type
Boolean, you were not allowed to assign it the value 42
over there. But that amounts to saying that the set of assignment statements legal
at any given point is dependent on the context, and context is precisely what a context-free
grammar cannot capture.
In the preparation of Algol 68, the Dutch computer scientist Adriaan van Wijngaarden
made a concerted effort to fix this state of affairs. In Algol 68, he was determined
to push all those constraints into the formalism. To manage that, he needed, and
he duly invented, a stronger grammatical formalism (known today as two-level grammars
or van Wijngaarden grammars
. He noticed if there were a finite number of legal identifiers, a context-free grammar
could actually capture the kinds of constraint mentioned above involving declaration
and typing of the variables. And if you have an infinite number of identifiers (as
you do in any realistic programming language), you can manage to express the constraint
if you allow yourself to imagine, not a finite context-free grammar, but an infinite
context-free grammar. So, van Wijngaarden invented infinite context-free grammars.
Now, he didn't want to try to write any infinite grammars down line by line, so he
invented two-level grammars. At one level there is a context-free grammatical base
that has notions called hyper-notions and meta-notions which in turn generate, at
the second level, an infinite number of rules. For any given Algol 68 program, you
can generate a finite subset of the infinite grammar of Algol 68 that suffices for
parsing the particular program before you.
It is a brilliant mechanism; its only flaw is that the grammar is now unreadable.
It is almost certainly impossible for anyone in that Working Group (including, I suspect,
van Wijngaarden himself) to look at the grammar and know for sure whether a given
formulation is or is not a correct expression of the design agreed by the WG on some
particular technical point — because the grammar is too complicated. It's like reading
source code for a parser. There is a good reason that most programming languages
are not defined by reference implementations: it is too hard to tell whether the
reference implementation is correct or not. Now, of course, a reference implementation
is correct by definition, but it's only correct by definition once the Working Group
has said it's correct.
And so there are really not many implementations of two-level grammars. I know of
exactly one, and I think it was a partial implementation. So, in a way, having too
strong a formalism is like going back to the 1950s: you have to study the spec and
try to figure out what it means. People wanted these constraints to be in the grammar
because experience had shown that grammars were easy to understand, but by the time
those constraints are pushed into the grammar — into the model — the model is no longer
easy to understand. There is a tradeoff. Most programming languages now define context-free
grammars, and then they define an additional list of context-sensitive constraints
that you have to meet. You can formalize that, too. Attribute grammars are a way
of formalizing that. Essentially, attribute grammars have a different kind of two-layer
formalism: a context-free grammar, together with a set of rules for assigning attributes
to each instance of a non-terminal and calculating the values for those attributes.
So, perhaps the solution is to have layered models in which each layer individually
is relatively simple, easy to understand, and easy to check, and in which the conjunction
of all layers and their constraints allows you to do things that are more complex.
That's the way programming languages work; Will Thompson showed us a nice example
of the kind of thing I have in mind [Thompson 2017]. The underlying model that he is working with doesn't know anything about redundancy;
he wants to introduce controlled redundancy, so he invents a way of marking the redundancy
and writes a little library that fits on top of the underlying engine and provides
a more complicated model, in which you can have the redundant version of the document
that's easy to retrieve or strip the redundancy for other purposes. It feels like
a very SGML/XML-like thing to do. If the off-the-shelf models and tools don't do
what we need, we can layer what we need on top of them.
Of course, sometimes layers of that kind just feel like work-arounds, like hacks.
Sometimes what you need to is step back and think things through from the beginning.
The outstanding example at this conference is the paper by Ronald Haentjens Dekker
and David Birnbaum showing what things can look like if you step back and try to re-think
the model of text from the beginning [Haentjens Dekker and Birnbaum 2017]. Their notion of using hyper-graphs as a way to keep the model simpler than other
graph models — brilliant. I don't know how such a model can support the multiple
orderings you propose as a topic for future work; you might need to layer something
on top of it. But it's very exciting work.
Another open question there is how to match the capabilities offered by SGML and XML
that work together so well. SGML and XML formally define a serialization format,
but implicitly they suggest a data structure: an element tree with pointers. And
the element tree in turn suggests a validation mechanism. You can write document
grammars and treat the element structure as an abstract syntax tree for that document
grammar, so you can constrain your data in ways that help you find a number of mechanical
errors automatically. When you re-think things from the ground up, lots of interesting
things become possible. Mary Holstege provided a very challenging but rather exhilarating
example of the kind of considerations that need to go into the re-thinking of a model
or a language — lessons from long ago that may nevertheless still be useful [Holstege 2017].
Sometimes the model that we want to formalize is whatever we know how to model formally.
Sometimes it's where we think we found a sweet spot in the tradeoffs between expressive
power and simplicity. Sometimes the model expresses what you can get the people in
the room to agree on.
OHCO captures, I think, pretty well, at least in the wouldn't-it-be-pretty-to-think-so
sense what most standard average users of SGML could agree on. They didn't all agree
on CONCUR
, or rather they did mostly agree on CONCUR
: they agreed they didn't want it. They didn't mostly think that ID and IDREF were
a fundamental part of the model even though Jean-Pierre Gaspart did. They did think
that trees were important, so all of the tutorials will talk about the tree structure
of XML. This is one reason that people believe that the OHCO model is what motivated
it in ISO 8879 because they read the tutotials rather than ISO 8879 — can we blame
them?
But what we agree on, of course, varies with time and geography, and it changes when
we hear other people who think differently and we argue with them. Sometimes we persuade
each other. And to do that, to hear others and argue with them and persuade or be
persuaded, we come to conferences like this one. I have learned a lot at this year's
Balisage; I hope you have, too. I have enjoyed hearing from you in talks and during breaks
and arguing with some of you about this and that, including the right way to model
text. Thank you for coming to Balisage. Let's do it again sometime!
References
[Al-Awadai et al. 2017] Al-Awadai, Zahra, Anne Brüggemann-Klein, Michael Conrads, Andreas Eichner and Marouane
Sayih. XML Applications on the Web: Implementation Strategies for the Model Component in
a Model-View-Controller Architectural Style.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Bruggemann-Klein01.
[Coombs et al. 1997] Coombs, J. H., A. H. Renear, and S. J. DeRose. Markup systems and the future of scholarly text processing.
Communications of the Association for Computing Machinery 30.11 (Nov. 1987): 933–947.
[DeRose et al. 1990] DeRose, Steven J., David G. Durand, Elli Mylonas and Allen H. Renear. What is text, really?
Journal of Computing in Higher Education 1, no. 2 (1990): 3-26. doi:https://doi.org/10.1007/BF02941632.
[Flynn 2017] Flynn, Peter. Your Standard Average Document Grammar: just not your average standard.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Flynn01.
[Grune / Jacobs 2008] Grune, Dick, and Ceriel J. H. Jacobs. Parsing Techniques: A Practical Guide. New York: Ellis Horwood, 1990; Second edition [New York]: Springer, 2008.
[Haentjens Dekker and Birnbaum 2017] Haentjens Dekker, Ronald, and David J. Birnbaum. It's more than just overlap: Text As Graph.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Dekker01.
[Hemingway 1926] Hemingway, Ernest. The Sun Also Rises. New York: Charles Scribner's Sons, 1926. Reprint. New York: Scribner, 2006.
[Holstege 2017] Holstege, Mary. The Concrete Syntax of Documents: Purpose and Variety.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Holstege01.
[Lee 2017] Lee, David. The Secret Life of Schema in Web Protocols, API's and Software Type Systems.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Lee01.
[Liuzzo 2017] Liuzzo, Pietro Maria. Encoding the Ethiopic Manuscript Tradition: Encoding and representation challenges
of the project Beta ma?a??ft: Manuscripts of Ethiopia and Eritrea.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Liuzzo01.
[Maloney 2017] Maloney, Murray. Using DocBook5: To Produce PDF and ePub3 Books.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Maloney01.
[Owens 2017] Owens, Evan. Symposium Introduction.
Presented at Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions,
Washington, DC, July 31, 2017. In Proceedings of Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions. Balisage Series on Markup Technologies, vol. 20 (2017). doi:https://doi.org/10.4242/BalisageVol20.Owens01.
[Piez 2017] Piez, Wendell. Uphill to XML with XSLT, XProc … and HTML.
Presented at Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions,
Washington, DC, July 31, 2017. In Proceedings of Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions. Balisage Series on Markup Technologies, vol. 20 (2017). doi:https://doi.org/10.4242/BalisageVol20.Piez02.
[YouTube Film Clip] You keep using that word. I do not think it means what you think it means.
The Princess Bride, YouTube video, 00:07. Clip from film released in 1987. Posted by Bob Vincent,
January 9, 2013. https://www.youtube.com/watch?v=wujVMIYzYXg.
[Thompson 2017] Thompson, Will. Automatically Denormalizing Document Relationships.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Thompson01.
[Turner 2017] Turner, Matt. Entity Services in Action with NISO STS.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Turner01.
[Warmer / van Egmond 1989] Warmer, Jos, and Sylvia van Egmond. The implementation of the Amsterdam SGML Parser.
Electronic Publishing 2.2 (December 1989): 65-90. A copy is on the Web at http://cajun.cs.nott.ac.uk/compsci/epo/papers/volume2/issue2/epjxw022.pdf.
[van Winjgaarden et al. 1976] van Wijngaarden, A[driaan], et al. Revised Report on the Algorithmic Language Algol 68. Berlin, Heidelberg, New York: Springer, 1976.
×Al-Awadai, Zahra, Anne Brüggemann-Klein, Michael Conrads, Andreas Eichner and Marouane
Sayih. XML Applications on the Web: Implementation Strategies for the Model Component in
a Model-View-Controller Architectural Style.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Bruggemann-Klein01.
×Coombs, J. H., A. H. Renear, and S. J. DeRose. Markup systems and the future of scholarly text processing.
Communications of the Association for Computing Machinery 30.11 (Nov. 1987): 933–947.
×DeRose, Steven J., David G. Durand, Elli Mylonas and Allen H. Renear. What is text, really?
Journal of Computing in Higher Education 1, no. 2 (1990): 3-26. doi:https://doi.org/10.1007/BF02941632.
×Flynn, Peter. Your Standard Average Document Grammar: just not your average standard.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Flynn01.
×Grune, Dick, and Ceriel J. H. Jacobs. Parsing Techniques: A Practical Guide. New York: Ellis Horwood, 1990; Second edition [New York]: Springer, 2008.
×Haentjens Dekker, Ronald, and David J. Birnbaum. It's more than just overlap: Text As Graph.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Dekker01.
×Hemingway, Ernest. The Sun Also Rises. New York: Charles Scribner's Sons, 1926. Reprint. New York: Scribner, 2006.
×Holstege, Mary. The Concrete Syntax of Documents: Purpose and Variety.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Holstege01.
×Lee, David. The Secret Life of Schema in Web Protocols, API's and Software Type Systems.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Lee01.
×Liuzzo, Pietro Maria. Encoding the Ethiopic Manuscript Tradition: Encoding and representation challenges
of the project Beta ma?a??ft: Manuscripts of Ethiopia and Eritrea.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Liuzzo01.
×Maloney, Murray. Using DocBook5: To Produce PDF and ePub3 Books.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Maloney01.
×Owens, Evan. Symposium Introduction.
Presented at Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions,
Washington, DC, July 31, 2017. In Proceedings of Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions. Balisage Series on Markup Technologies, vol. 20 (2017). doi:https://doi.org/10.4242/BalisageVol20.Owens01.
×Piez, Wendell. Uphill to XML with XSLT, XProc … and HTML.
Presented at Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions,
Washington, DC, July 31, 2017. In Proceedings of Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions. Balisage Series on Markup Technologies, vol. 20 (2017). doi:https://doi.org/10.4242/BalisageVol20.Piez02.
×You keep using that word. I do not think it means what you think it means.
The Princess Bride, YouTube video, 00:07. Clip from film released in 1987. Posted by Bob Vincent,
January 9, 2013. https://www.youtube.com/watch?v=wujVMIYzYXg.
×Thompson, Will. Automatically Denormalizing Document Relationships.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Thompson01.
×Turner, Matt. Entity Services in Action with NISO STS.
Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4,
2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:https://doi.org/10.4242/BalisageVol19.Turner01.
×van Wijngaarden, A[driaan], et al. Revised Report on the Algorithmic Language Algol 68. Berlin, Heidelberg, New York: Springer, 1976.