How to cite this paper
Mason, James David. “Do we really want to see markup?” Presented at Balisage: The Markup Conference 2019, Washington, DC, July 30 - August 2, 2019. In Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23 (2019). https://doi.org/10.4242/BalisageVol23.Mason01.
Balisage: The Markup Conference 2019
July 30 - August 2, 2019
Balisage Paper: Do we really want to see markup?
James David Mason
James D. Mason, originally trained as a mediaevalist and linguist, is
retired from being a writer, publishing systems developer, and manufacturing
engineer at U.S. Department of Energy facilities in Oak Ridge, Tennessee. In
1981, he joined the ISO’s work on standards for document management and
interchange. He chaired ISO/IEC JTC1/SC34, which was responsible for SGML,
DSSSL, Topic Maps, and related standards, from 1985 until 2007. Dr. Mason
has been a frequent writer and speaker on standards and their applications.
For his work on SGML, Dr. Mason has received the Gutenberg Award from
Printing Industries of America and the Tekkie Award from the Graphic
Communications Association. He has also done research in horology and the
history of pipe organs.
Copyright ©2019 by the author. Used with permission.
Abstract
Markup fanatics have long cried, “We need to see the markup!” Yet since the
earliest stages of developing the SGML standard, there has been an urge even among
standards developers to avoid having to write tags everywhere. The recent urge to
create “Invisible XML” is but the latest symptom of a smoldering disease, from which
I too suffer.
Table of Contents
- Prologue
- ODA, SGML, and the First Hints of
Invisible SGML
- Digression on Word Processors and Seeing Coding
- Early Invisibility in SGML
- To Be Seen or Not To Be Seen
- Appendix A. Markup Minimization
Prologue
Why do we want to see markup?
That's not a question I would have asked forty years ago when I started using
computers to process text. I first experienced document markup as an editor and writer
in the publishing organization at Oak Ridge National Laboratory: I taught myself the
coding for our typesetting system (developed in house by a physicist) so I could have
more control over my documents. No WYSIWYG was available to me then! I worked with
markup on hard copy and edited it using a line editor on a teletype terminal. Because
I
had done that typesetting and also some FORTRAN programming, I was picked to be the
guinea pig for our new UNIX-based publishing system and eventually to train the rest
of
our staff. I found myself with a full-screen editor on a CRT (much quieter than the
teletype), learning troff, tbl, and eqn. Basic troff typesetting wasn't all that different from what I knew
(the systems shared Runoff as a common ancestor), but Joe Ossanna and Brian Kernighan
had made troff programmable, and that meant that there
were macro packages, the abstractions of patterns in markup.
My life changed forever: I had encountered Generic Markup! This appealed to me. All
my
life I had been interested in patterns. I had encountered Joseph Campbell and The Hero with a Thousand Faces early in my college career,
studied Jungian archetypes, and written a dissertation on patterns in early Germanic
literature. Now I had found something based on patterns I could use in my work—and
get
paid for it.
I chose the MM
(Bell Laboratories Memorandum Macros) package as being
most suited to our work at ORNL and set about adapting the package to our requirements.
I also rewrote parts of eqn. Then I started training
our composition staff and eventually the other editors. I attended one of the early
Seybold Conference series, where someone from IBM talked about something called
Generic Markup Language
. I realized there was a kind of community;
other people were working on other types of generic markup.
My success with the project at ORNL led to my being asked to present it at a
Department of Energy conference. There I met Millard Collins, chairman of a new ANSI
committee (X3V1) working on how to make the new word-processing systems just becoming
popular communicate with each other. Since part of my job was to get text out of word
processors and into our UNIX system, I joined the committee at its organizational
meeting in the fall of 1981. At that meeting, I met Charles Card, who suggested I
join
his committee (X3J6), which was working on, among other things, a Standard Generalized
Markup Language. I first attended X3J6 in the spring of 1982, and there I had my first
encounter with Charles Goldfarb, the project editor and driving force behind SGML.
The first of these committees (and its ISO counterpart, ISO/IEC JTC1/SC18) started
work on something called Office Document Architecture (later Open Document
Architecture), ISO 8613 ODA, now largely forgotten. SGML, ISO 8879,
SGML developed originally by X3J6 and its ISO counterpart, the
JTC1 Experts Group on Computer Languages for Processing Text, is still with us. The
two
ISO committees eventually merged into SC18. In the fall of 1985, I became the convenor
of the ISO working group responsible for SGML and related projects (SC18/WG8). ODA
was
managed by a parallel working group (SC18/WG3). After the demise of ODA, my SGML group
became the primary committee in 1998, just as XML was getting started. (As ISO/IEC
JTC1/SC34, it still exists. SC34) The competition between SGML and ODA
went on for nearly eighteen years. While there were many technical issues (and much
electro-politics
) involved, in many ways the competition was about
the difference between visible and invisible markup.
ODA, SGML, and the First Hints of Invisible SGML
Most of the people working on SGML came, like me, from the documentation and the
scientific and technical publishing industries. We prided ourselves on our connection
to
technology, and we were used to typing codes into computers. We were used to long,
highly structured documents—and lots of code.
Those who joined the world of descriptive markup only after the arrival of XML may
not
realize how endangered that world had been only a few years earlier. The SGML/ODA
Wars
are, thankfully, long over and forgotten, except by those of us who
still have scars from them. In retrospect, I think SGML might have survived on its
own,
in a niche community; but if we had not survived the wars, we wouldn't have been able
to
build a support system for it. In particular, we wouldn't have had DSSSL (Document
Style
Semantics and Specification Language, ISO/IEC 10179), and without that we wouldn't
have
had the basis to build XSL and XQuery.
The ODA project was driven largely by makers of word-processing systems and also by
national telecommunications agencies that were looking to offer yet another tariffed
service. While they dreamed of WYSIWYG, the reality of their work was long limited
by
the limitations of their hardware, particularly the inability to produce more than
typewriter-like output when the project began. What ODA seemed to desire most was
a
system that offered a working screen free of codes. Nonetheless, ODA had a foundation
that was not so simple as their surface goals might suggest, and indeed they had
considerable influence on SGML and its approach to coding. From its beginnings in
Wolfgang Horak's dissertation, ODA had an implicit interest in generic structures,
Horak-Kroenert-83 and in the earliest ISO drafts, ODA proposed that
documents possessed two concurrent, interleaved, high-level document structures,
layout
and logical
. What these structures involved was
never made completely explicit, though layout
obviously had to do with
rendition on the screen and page. The logical
structure apparently dealt
with paragraph-like objects. ODA was the cloud computing
of the 1980s: an
office was expected to rent an ODA terminal from their telephone company, and the
documents would reside on the company's mainframes. The ODA standards project was
eventually published in 14 volumes, with several supporting technical reports.
From the beginning, ODA assumed that the serialization of documents would be in binary
form, ODIF (Office Document Interchange Format). The notation selected was based on
ASN.1 (Abstract Syntax Notation One, ASN.1 ASN1), though with
modifications because of the concurrent structures. Below the page level, the layout
structure was control codes for rendering devices, which amounted to invisible inline
procedural markup. For the logical structure, however, the developers turned to
type-length-value triplets, with byte count pointers as a kind of implied stand-off
generic markup.
During the earliest years of the ODA project, I attended their meetings and brought
back their discussions to the SGML committee. Most of the SGML team considered ODA
a
distraction, but it intrigued Goldfarb, who took it as a personal challenge to develop
an SGML representation for anything and everything proposed for ODIF. One of the first
results of this was the introduction of the CONCUR feature into SGML. Because ODA
never
developed an explicit schema mechanism, Goldfarb had to develop a mechanism for dealing
with ad hoc and implicit structures. The result was Architectural Forms
.
Goldfarb's SGML rendering of something that began as binary and invisible into visible
markup was eventually folded back into the ODIF standard as an alternative
serialization.
In the two serializations of ODIF, we had (at least in theory) the materials for a
reversible transformation between a document whose only visible manifestation was
something that appeared on a presentation system and one that was encoded in
conventional, and readable, character markup. It was sufficiently interesting to
Goldfarb that he played with the idea of developing a binary version of the whole
SGML
design, on the assumption that it would be more compact and therefore easier to transmit
over a bandwidth-limited network. That came to an end when NIST calculated the relative
sizes of binary- and SGML-encoded ODA documents and found the latter to be more
compact.
Although the reversible transformation between visible and invisible markup was
defined, at least for definition of the serialization of ODA, it never worked in
practice. While we all know SGML and its heirs, which have multiple implementations,
ODA
was never completely implemented and today is largely forgotten. It had, on paper,
a
bewildering number of options from which profiles could be extracted, only a few of
which had even trial laboratory implementations. Those of us who had to cope with
its
presence generally think of it as an expensive failure. Yet it influenced DSSSL, and
thus XSL, through its page model. And it started the debate of how to represent
overlapping structures that still intrigues participants in Balisage.
One of the things that killed the ODA project was visible markup. ODA was not intended
to be seen, even in the SGML encoding. ODA was not really even intended to be created
directly (though Philips did at one point attempt, unsuccessfully, to build an ODA
editor as a laboratory project). ODA was originally intended to be used in invisible
environments, for communication between systems. It was too hard for all but a few
specialists to comprehend its rather abstract model and its difficult binary
representation. ODIF could be generated only by machines, doing things like pointer
arithmetic. SGML markup, in contrast, was expected to be created by end users. It
turned
out as something we could—and did—create by hand, and we expected to see that which
was
both document markup and the interchange format. Yves Marcoux and Martin Sévigny
considered eye-readability
to be the primary reason that SGML succeeded
where ODA did not. Marcoux
I trace the last gasps of ODA to the SC18 plenary in 1995. The convenors of the
working groups were sitting together at the head table, and I was next to Steve Price,
the convenor of WG3 and the chief public advocate for ODA. I happened to look at his
laptop screen and saw he was taking notes in a text editor—in HTML. I leaned over
and
whispered to him I'm glad to see you've come over to our side.
What do you mean?
he asked. You're taking notes in SGML
, I
replied. No,
he shot back, it's this new World Wide Web
thing.
Yes, I can see it's HTML, and that's an SGML application.
He was crushed.
His group, which had big money behind it, had spent years trying to compete with ours,
which had worked because of a passion for its project. All this time the ODA developers
had never really grasped what we were doing. Meanwhile, we sold our concept quietly,
planting it in places like CERN, where it spawned HTML, and the ODA team didn't realize
they had been subverted. They tried to keep their project going for another couple
of
years, but it was futile.
I don't think that it was merely the technical superiority of SGML that led to its
victory over ODA. The ODA developers had started with confidence that they had the
next
great thing. They were, after all, professional standards developers, backed by powerful
organizations, and they were working on something that would fit into Open Systems
Interconnect. The SGML developers knew little about standards development; we were
just
end users with a common interest. (As Sharon Adler remarked, If we ever figure
out how this standards process works, it will be time for us to retire.
) In
the long run, it was probably to the advantage of the SGML developers that they were
working on something that they wanted and needed themselves, rather than something
that
corporate bodies expected to impose on end users. The design of SGML is
improvised—sometimes amateurish, sometimes obscure. The resulting application languages
are nonetheless something that can be seen and used directly by humans. The visibility
of SGML markup was part of what enabled Bill Tunnicliffe to sell it to the U.S.
Department of Defense in 1983, and that led to our going public with the GENCODE
standard later that year. GENCODE ODA, with its thousands of
permutations of options, was much harder to grasp—and to implement. All its advocates
could do was publish descriptive papers. You can write SGML in a simple text editor.
You
can't do that with ODA. So in the end, the leader of ODA development picked up on
the
utility of HTML and actually used it. Visible markup had won.
Digression on Word Processors and Seeing Coding
WYSIWYG is a seductive concept. The earliest stand-alone word-processing
systems—expensive, yet limited, behemoths—promoted it. But by the time SGML and its
offspring really gained traction, the stand-alone devices had been supplanted by
programs running on general-purpose personal computers. And in the end, the multitude
of
early applications had largely fallen by the wayside while two major competitors fought
to control the marketplace, Microsoft's Word and
Corel's Word Perfect. Word was based on work at Xerox PARC, and as a consequence it was
fundamentally object oriented. It understood units of text such as strings and
paragraphs and applied properties to them, and it understood generalized structure
and
inheritance of both structure and properties. That meant it could easily support
stylesheets with inheritable properties and things that depended on structure, like
outlining. Word Perfect, in contrast, just serialized
control functions in whatever order the user happened to insert them; there was no
overall concept of structure. (I thought of it as one damn thing after
another.
) Stylesheets and outlining came only late to Word Perfect and were relatively weak, compared to those in Word.
Conceptually, Word was in closer sympathy with SGML,
while Word Perfect followed the layout structure of
ODA. (It is perhaps significant that Corel was one of the very few companies to attempt
an ODIF export filter for their product.) Word beat
Word Perfect to full WYSIWYG with Word for Windows (no surprise there), but my observation of
hundreds of users of these two products showed an interesting phenomenon: serious
Word Perfect users almost always ran the program in
split-screen mode, with reveal codes
at the bottom of the editing screen.
Using reveal codes
was important because the program enforced no
discipline about how codes were entered; users could do things in random order, and
just
seeing the cursor in the WYSIWYG screen gave few hints about what was actually going
on
in the procedural coding. Word users didn't need this
because the program managed the coding in a structured way, always told them what
object
they were in, and could also tell them what its properties were. So in a fully
structured environment, it was not necessary to look at coding; but in an undisciplined
one, visibility of coding was essential.
Early Invisibility in SGML
As proud as the hard-core SGML developers were of our ability to bang markup into
a
terminal, we were nonetheless practical—or lazy. Almost from the beginning we had
markup minimization
. In the early days, before we had syntax-directed
editors designed for SGML, we took it on faith that the SGML Parser
(whatever that turned out to be) would be intelligent enough to keep track of the
current context and so save us the trouble of typing full tags. Goldfarb, of course,
had
to generalize that idea into the full scope of minimization options in the final
standard (see below, Appendix A).
I can remember the first SGML editor I used, from Datalogics: it was basically a text
editor, with an attached batch parser. I could type tags, attributes and all, and
end
tags; then I could check to see how many mistakes I'd made. Software Exoterica (later
known by the name of its primary product, OmniMark) came out with
Checkmark, based on a simple text editor for the Macintosh, but
with a live parser. The ability to get validation while a document was being created
was
so useful that I, like a number of other people, kept an ancient Mac alive for years
just to run Checkmark after Exoterica stopped updating it for later
systems.
XML, hoping to simplify life for the parser writer, decided to drop minimization.
Ironically, most of the problems with minimization had been solved by then, and
furthermore we had real SGML editors like SoftQuad's Author/Editor and Arbortext, so the
problem had ceased to be an issue. With the arrival of real SGML editors, users suddenly
had the option of deciding how much SGML they wanted to see. They could see full source
code, they could see schematic block tags, or they could see no tags at all. As I
write
this in <oXygen/>, I'm looking at a page very similar to what I
saw more than twenty years ago in Author/Editor, and
I'm switching between visible and hidden tags according to what tasks I'm performing
at
the moment. Even if I were still in Author/Editor,
there would be no minimization in my output document.
As I've looked at some recent papers on Invisible XML
, I've kept
thinking, We're back where I was about 1983.
What was the state of SGML back then, and how does it lead to Invisible
SGML
, if not to Invisible XML
?
By 1982 our image of what an SGML document would look like would be largely
recognizable to an XML user today. A document would have tags with angle brackets,
and
the elements indicated by the tags would be in a hierarchy. Attributes would be
specified in start tags. What we lacked then was a formal way to define the tags and
hierarchy. In short, we needed a way to specify a schema, and developing such a
specification was harder than forming a basic expectation of what SGML would look
like.
In 1982 we were already thinking about specifications for content models that were
somehow related to regular expressions, but we did not yet have a settled syntax for
them. When we did start to develop a syntax for declarations in 1983, one of our first
drafts was actually a whitespace-delimited table inside a declaration (then called
STRUC, for structure), with columns for element names and models. Multiple elements
could be declared in a single table. We'd leave until later the problem of how to
parse
such a table and use the results.
Given this state of development, it was sometime in late 1982 that I inadvertently
launched an idea that would result in Invisible SGML
. I had to do a
presentation about SGML, and I picked for my example a conventional memo, with
From
, To
, Subject
, and other such
fields. Not yet having a real syntax for a schema, I wrote out a series of definitions
borrowing from regular expressions that included string literals as components of
content models. I don't have the original any longer, but it was something
like
memo: to, from, subject, date, body
to: "To: ", #PCDATA
from: "From: ", #PCDATA
etc.
Afterwards, I showed it to Goldfarb, who fired back that it was all wrong, that wasn't
what he intended to do at all, that he wasn't using full regular expressions, and
so
there could be no literals in the models. Content models included only element names
(plus reserved characters for grouping, sequencing, and occurrence indication).
But Goldfarb being Goldfarb, my error gave him a challenge. Rather than drop the idea
of literal strings in the input as replacements for tags, he decided to implement
it,
and the 1983 version of the STRUC declaration did include some limited cases of literals
in models for character strings. It also included the first cut at what became the
DATATAG option in an SGML configuration. GENCODE At the cost of adding
another delimiter role to separate them from element names, string literals came back
into content models as separators between elements. When a declared literal pattern
is
encountered in the source, it ends one element, forcing the start of the next in the
model, while at the same time being passed on as part of the source. With the final
DATATAG syntax of 1986, the
declaration
<!ELEMENT row - o ([cell, ", ", " "], cell)>
describes
a two-column table row to be made from a row in a comma-separated list, one line per
implied row, where the comma is followed by a space (
", "
) and then
followed by optional padding spaces
" "
, then by the second cell.
If strings (#PCDATA) can become markup, what about strings that change roles according
to context? Goldfarb did not stop with simple alternatives to tagging: he went on
to
generalize the concept of recognizing strings in situations such as smart
quotes
. His solution, short references and short reference maps, cost two
more markup declarations (SHORTREF and USEMAP) and considerable indirection. When
a
string that has been declared as a short reference is encountered, it is replaced
by an
entity, which is resolved to an element name, and whether it is to be used in a start
tag or an end tag. Furthermore, invoking an element (either by encountering it in
text
or by generating it from a short reference) can change the mapping from a short
reference to an entity. Thus encountering a quotation mark in text could start an
element and a new map; encountering another quotation mark under the new map could
end
the element and revert to the original map. (Handling nested quotes or cases like
single
quotes in English, which can have more than one role, requires complex patterns and
mappings.)
<!USEMAP textmap p>
<!-- In normal text, the "textmap" is active. -->
<!USEMAP quotemap quote>
<!-- In a quotation, the "quotemap" is active -->
<!ENTITY quotetag "<quote>" >
<!-- The "quotetag" entity is the start tag for a quotation. -->
<!ENTITY endquotetag "</quote>" >
<!-- The "endquotetag" entity is the end tag for a quotation. -->
<!SHORTREF textmap '"' quotetag>
<!-- Within the "textmap" a double quote resolves to the "quotetag" entity. -->
<!SHORTREF quotemap '"' endquotetag>
<!-- Within the "quotemap" a double quote resolves to the "endquotetag" entity. -->
DATATAG and SHORTREF are complementary techniques. DATATAG is a technique for markup
minimization; SHORTREF is an alternative method for entering markup and potentially
modifying its meaning. When DATATAG is enabled, a string that matches a pattern serves
as both data and end tag; the characters of the string are passed through to the output
at the same time that they cause a parsing event. The start tag that began the element
is generally assumed to be minimized. A string that matches a SHORTREF pattern is
just
markup in Invisible SGML
; it causes an event but is consumed in the
process.
For all his ingenuity in creating these techniques, Goldfarb still didn't give me
precisely what I was asking for: I wanted matching a pattern to create an implied
start
tag. In its first draft DATATAG supported both start and end tags, but the final version
provides implied end tags, or rather it provides element separators that involve an
implied end tag for one element and a start tag for the next. Perhaps SHORTREF could
be
stretched (Goldfarb seemed not to like long short references), rather than DATATAG,
to
get what I was looking
for:
<!SHORTREF memomap "&#RS;To: " to
"&#RS;From: " from>
<!ENTITY to "<to>">
<!ENTITY from "<from>">
<!ELEMENT to o o (%text;)>
<!ELEMENT from o o (%text;)>
So
long as whatever
%text;
resolved to didn't include the string
To
: or
From:
, that might work. ("&#RS;" is a
long-forgotten SGML predefined entity reference to the start of a data record; there
was
a corresponding "#RE" for the end of a record.)
As the SGML standard makes explicit (Appendix C.1.3), one intent of these techniques
was to capture simple WYSIWYG data, as it was seen in the 1980s. In effect, we were
trying to capture typewriter-like markup, expressed largely through whitespace and
punctuation. This was about as much as the stand-alone word processors of the early
1980s were able to export. Given that the only output devices available to them, such
as
daisy-wheel printers, were only glorified typewriters, that's about as much as could
be
expected. The day of the stand-alone device was ending because they were beginning
to be
supplanted by programs running on personal computers. As laser printers arrived, with
new output capabilities, the programs also grew in flexibility and also in complexity
of
coding. With the new word-processing programs, it was often possible to extract more
coding data, though I saw little evidence of SGML users stretching these techniques
to
deal with extended coding. In the period when Word
Perfect was the dominant program, writing SHORTREF structures would have
offered even more challenges than dealing with multilingual quotes because there were
so
many codes and no programmatic discipline at all over the order in which they could
be
entered.
By the time I was building real SGML publishing systems, we had separate conversion
tools and then OmniMark to do the work for us. But the work was
still nontrivial.
The longest discussions of the DATATAG and SHORTREF techniques that I know, in
Appendix C the ISO standard (and Goldfarb's annotation of it in The SGML Handbook
Goldfarb-1990) and Martin Bryan's book SGML: An
Author's Guide, Bryan-88 concentrate on techniques such
as turning vertical whitespace into new elements in a sequence, turning comma-separated
(or TAB-separated) data into tables, and handling quotations and similar constructs.
These discussions predate the rise of word-processing programs, so they did not deal
with translation of formatting codes.
All the mechanisms necessary to enable these techniques were dropped from XML:
-
the SGML DECLARATION, necessary to enable minimization, DATATAG, and
SHORTREF;
-
markup minimization as a concept;
-
the SHORTREF and USEMAP markup declarations;
-
markup roles declared in ENTITY declarations; and
-
predefined entities, especially the #RS
and
#RE
, often used in short references for the concepts of
record start
and record end
.
These techniques were not heavily used, and implementing them was
probably too much for the
desperate Perl hacker
envisioned as the
potential XML parser writer.
The absence of these features in XML has not prevented enthusiasts from trying to
reinvent them. Simon St. Laurent had a habit of showing up at the Montréal conference
that has since become Balisage and suggesting ways of resurrecting
things lost in XML. In 2001 his target was using textual patterns as markup. StLaurent
To Be Seen or Not To Be Seen
So do we want to see markup?
At first glance, the current interest in Invisible XML
suggests that we
don't want to see markup anymore. Pemberton-2013 But I think that is
not really the case. Invisibility is not the goal in this effort; markup is. As Steven
Pemberton has said about his project, Invisible XML is a technique for treating
any parsable format as if it were XML, and thus allowing any parsable object to be
injected into an XML pipeline.
. Pemberton-2016 In this
sense, Invisible XML
is like a continuation of Goldfarb's demonstration
of how to generate SGML out of comma-delimited values, which can be traced back as
far
as the 1983 GENCODE standard.
I think that the greatest differences between Invisible XML
technologies and SGML technologies are the underlying assumptions and the technologies
available. In the 1980s we made few assumptions about the data, other than that we
could
find some patterns upon which to operate. The patterns might be complex, as in
Goldfarb's incomplete attempt to mark up sentences and words (ISO 8879, Appendix C,
p.
106) or Bryan's handling of multilingual quotation marks (Appendix A.3, pp. 274–286),
but they were derived simply from direct examination of documents. Invisible
XML
, in contrast, treats documents from the beginning as though they were
expressions of a parse tree, with the expectation that it must be possible to
describe the data using a context-free grammar
Sperberg-McQueen-2019 and to write out that grammar to drive a
processor. In the 1980s we had few tools available with which to ingest documents
into
SGML, so Goldfarb built requirements for the tools into the standard itself, hoping
that
some programmer would implement them. Since XML has omitted the basis on which Goldfarb
improvised his tools, we must now depend on something outside the XML parser.
Fortunately, we have other tools, many of them XML-aware, and so Sperberg-McQueen
can
propose Aparecium
as a library for XSLT or XQuery. The emphasis in
Invisible XML
is, after all, not on Invisible
but on
XML
. And this is still the goal we had in the 1980s: How do we get
our data marked up so we can make further use of it? Invisible XML
,
requiring an external processor, is more complex and more capable than the original
set
of techniques, but the interest it has aroused suggests that we still need something
to
do that work. So Invisible XML
is a way of making the invisible
appear.
The techniques I have described that were built into SGML were originally a way of
making markup disappear. Everything grew out of minimization, and that started as
a way
of saving effort for users in the days when all the coding would have to be typed
in
manually, not inserted by a syntax-directed editor. While this was a labor-saving
technology, I suspect there was also an unconscious awareness that this new SGML
notation for markup was much more verbose that what our team had been used to in
Script, troff, and other systems. SGML,
before the final version, was actually much more verbose than we think of it now.
There
were more delimiters and more delimiter roles: one reviewer accused the code of looking
like chicken tracks
! That these techniques turned into a way of
simplifying the process of getting markup into documents that were being imported
was an
unintended consequence, though a fortunate one.
We put up with SGML because it was what we needed, what we had created, and we didn't
have much other choice. It was successful in spite of what some saw as flaws. We sold
it
to the Department of Defense, the European Union, CERN, the American Association of
Publishers, and dozens of other organizations. Major applications that we are still
discussing at Balisage this year, such as DocBook and TEI, started
out in SGML. Nonetheless, most of us were glad to see the arrival of applications
like
Author/Editor that disguised the chicken tracks
and allowed us to forget about minimization. Most of the time what we cared about
wasn't
so much what the markup looked like but that we knew it was there and we could get
at it
as needed. As I write this, most of the time I have tags hidden. I sometimes turn
them
on when I need to know where my selection cursor really is. And on occasion I go into
full code view because there are some things I just can't do any other way.
There is a difference between working with documents where there is no visible markup,
yet which you can treat as though they are marked up, and working with documents where
you make the markup that is present disappear because that helps your creative process.
Nevertheless, in any case, the goal is to have information identified. Whether I am
importing data or creating it from scratch, what is important is that the markup is
applied to the data. What was on my mind in 1998, whether I just said it at a conference
or wrote it down, was that not only had visible markup helped the success of SGML
over
ODA, but that, having vanquished what we had thought was a mortal threat, we could
relax
and make SGML less overtly visible. Visibility, per se, is
not a goal. I think that the core issue is connected to the idea of ownership of data.
Putting your mark on the data (or rather in it) is an effective way of establishing
that. The SGML/XML model of inline markup has thus been vastly more successful in
that
respect than the ODA approach of binary pointers.
Looking back over more than three decades of working with descriptive markup, I think
the issue is not just seeing markup but making markup comprehensible by humans. If
making markup visible is what it takes to do that, I'm all for visible markup.
Appendix A. Markup Minimization
With modern XML editors, markup minimization has ceased to be an issue. XML dropped
the whole concept as being irrelevant in a time of syntax-directed editors, as well
as
being too difficult to implement in a parser.
But when SGML was under development, minimization was much desired—and debated—in
our
meetings. The final form of the ELEMENT declaration in the 1986 standard had two fields
for minimization between the element name and its model, one for start tags and the
other for end tags. Either, or both, could be declared omissible. The STRUC declaration
in the 1983 GENCODE draft of SGML had several other kinds of minimization, and more
than
one kind of minimization could be specified in each of the two fields (pp. 40–46,
64–65).
- |
Tag is required. |
O |
Tag can be omitted. |
C |
A containing element can end elements within it. |
E |
The current element can be ended by its container. |
N |
Null tag: the current element type is the same as the previous. There are
many variants on this, but in general they meant typing only delimiters,
without including the whole generic identifiers within them.
|
D |
Data tags: literal strings could serve for either open or close tags. |
We eventually realized this was excessively complex. When we created so
many conditions, we didn't actually have an SGML parser with which to test minimization.
As we gained experience in parser design, we realized, for example, that ending a
container element naturally ended any contained elements on the stack. In the end,
each
field became binary:
-
, required, or
O
, omissible, in the
published standard.
Planning minimization for an application required some skill: you had to think like
a
parser and maintain a mental stack of contexts. Consider a document type that required
the title of a section to be followed by a paragraph and did not allow paragraphs
to be
nested:
<!ELEMENT section - - (title, p+) >
<!ELEMENT title - - (#PCDATA) >
<!ELEMENT p O O (%text;) -p >
(For
those who are not familiar with SGML DTDs, the
-p
> is an SGML
exclusion
: even if
%text;
includes
p
in its
content model,
p
cannot appear within another
p
.) The result
might look
like:
<section>
<title>A section title</title>
The first paragraph
<p>
A second paragraph
<p>
A third paragraph
</section>
Just
such a model is what led Tim Berners-Lee to think that the
<p>
tag was
just a separator, analogous to a
newline
in typewriter text and not a
container for text! The mess that we recognize in HTML is a prime case of why markup
should not be made invisible.
References
AT&T Bell Laboratories (and later modifiers). groff_mm man page. https://www.mankier.com/7/groff_mm.
[Bryan-88] Bryan, Martin. SGML: An Author's
Guide. New York: Addison-Wesley (1988).
[GENCODE] Graphic Communications Association. GCA Standard
101-1983, GENCODE and the Standard Generalized Markup
Language.
[Goldfarb-1990] Goldfarb, Charles, and Yuri Rubinski. The SGML Handbook. Oxford: Oxford University Press
(1990).
[Horak-Kroenert-83] Horak, Wolfgang, and Guenther Kroenert (1983).
"Techniques for Preparing and Interchanging Mixed Text-Image Documents at
Multifunctional Workstations", Siemens Forschungs- und Entwicklungsberichte/Siemens
Research and Development Reports. 12. 61-69. https://www.researchgate.net/publication/282210430_TECHNIQUES_FOR_PREPARING_AND_INTERCHANGING_MIXED_TEXT-IMAGE_DOCUMENTS_AT_MULTIFUNCTIONAL_WORKSTATIONS.
[ODA] International Organization for Standardization/International
Electrotechnical Commission. ISO/IEC 8613-1:1994, Information
technology—Open Document Architecture (ODA) and interchange format: Introduction and
general principles, https://www.iso.org/standard/15928.html.
International Organization for Standardization/International Electrotechnical
Commission. ISO/IEC 8613-2:1994, Information technology—Open
Document Architecture (ODA) and interchange format: Open Document Interchange
Format, https://www.iso.org/standard/23410.html.
[SGML] International Organization for Standardization/International
Electrotechnical Commission. ISO/IEC 8879:1986, Information
processing—Text and office systems—Standard Generalized Markup Language
(SGML), https://www.iso.org/standard/16387.html.
[ASN1] International Telecommunication Union, Abstract Syntax Notation 1, ASN.1, X-680 series, https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One).
[SC34] International Organization for Standardization/International
Electrotechnical Commission. ISO/IEC JTC1/SC34, Document
description and processing languages, https://www.iso.org/committee/45374.html, https://en.wikipedia.org/wiki/ISO/IEC_JTC_1/SC_34.
[Marcoux] Marcoux, Yves, and Martin Sévigny. Why SGML? Why
Now?
. Journal of the American Society for Information
Science
48, No. 7, July 1997, p. 584.
[Pemberton-2013] Pemberton, Steven. Invisible XML
.
Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6–9, 2013.
In Proceedings of Balisage: The Markup Conference 2013. Balisage
Series on Markup Technologies, vol. 10 (2013).
doi:https://doi.org/10.4242/BalisageVol10.Pemberton01.
[Pemberton-2016] Pemberton, Steven. Data Just Wants to Be
Format-Neutral
. Presented at XML Prague, 2016, Prague, Czech Republic.
Proceedings of XML Prague 2016, pp. 109–120. http://archive.xmlprague.cz/2016/files/xmlprague-2016-proceedings.pdf, https://homepages.cwi.nl/%7Esteven/Talks/2016/02-12-prague/data.html.
[StLaurent] St. Laurent, Simon. Regular fragmentations: Treating
complex textual content as markup
. Paper given at Extreme Markup Languages
2001, Montréal, sponsored by IDEAlliance. Abstract on the Web at
http://conferences.idealliance.org/extreme/html/2001/StLaurent01/EML2001StLaurent01.html.
[Sperberg-McQueen-2019] Sperberg-McQueen, C. M. Aparecium: An
XQuery / XSLT library for invisible XML
. Presented at Balisage: The Markup
Conference 2019, Washington, DC, July 30 – August 2, 2019. In Proceedings of
Balisage: The Markup Conference 2019. Balisage Series on Markup
Technologies, vol. 23 (2019). doi:https://doi.org/10.4242/BalisageVol23.Sperberg-McQueen01.
×Bryan, Martin. SGML: An Author's
Guide. New York: Addison-Wesley (1988).
×Graphic Communications Association. GCA Standard
101-1983, GENCODE and the Standard Generalized Markup
Language.
×Goldfarb, Charles, and Yuri Rubinski. The SGML Handbook. Oxford: Oxford University Press
(1990).
×International Organization for Standardization/International
Electrotechnical Commission. ISO/IEC 8613-1:1994, Information
technology—Open Document Architecture (ODA) and interchange format: Introduction and
general principles, https://www.iso.org/standard/15928.html.
×International Organization for Standardization/International Electrotechnical
Commission. ISO/IEC 8613-2:1994, Information technology—Open
Document Architecture (ODA) and interchange format: Open Document Interchange
Format, https://www.iso.org/standard/23410.html.
×International Organization for Standardization/International
Electrotechnical Commission. ISO/IEC 8879:1986, Information
processing—Text and office systems—Standard Generalized Markup Language
(SGML), https://www.iso.org/standard/16387.html.
×Marcoux, Yves, and Martin Sévigny. Why SGML? Why
Now?
. Journal of the American Society for Information
Science
48, No. 7, July 1997, p. 584.
×Pemberton, Steven. Invisible XML
.
Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6–9, 2013.
In Proceedings of Balisage: The Markup Conference 2013. Balisage
Series on Markup Technologies, vol. 10 (2013).
doi:https://doi.org/10.4242/BalisageVol10.Pemberton01.
×Sperberg-McQueen, C. M. Aparecium: An
XQuery / XSLT library for invisible XML
. Presented at Balisage: The Markup
Conference 2019, Washington, DC, July 30 – August 2, 2019. In Proceedings of
Balisage: The Markup Conference 2019. Balisage Series on Markup
Technologies, vol. 23 (2019). doi:https://doi.org/10.4242/BalisageVol23.Sperberg-McQueen01.