Prologue
Why do we want to see markup?
That's not a question I would have asked forty years ago when I started using computers to process text. I first experienced document markup as an editor and writer in the publishing organization at Oak Ridge National Laboratory: I taught myself the coding for our typesetting system (developed in house by a physicist) so I could have more control over my documents. No WYSIWYG was available to me then! I worked with markup on hard copy and edited it using a line editor on a teletype terminal. Because I had done that typesetting and also some FORTRAN programming, I was picked to be the guinea pig for our new UNIX-based publishing system and eventually to train the rest of our staff. I found myself with a full-screen editor on a CRT (much quieter than the teletype), learning troff, tbl, and eqn. Basic troff typesetting wasn't all that different from what I knew (the systems shared Runoff as a common ancestor), but Joe Ossanna and Brian Kernighan had made troff programmable, and that meant that there were macro packages, the abstractions of patterns in markup.
My life changed forever: I had encountered Generic Markup! This appealed to me. All my life I had been interested in patterns. I had encountered Joseph Campbell and The Hero with a Thousand Faces early in my college career, studied Jungian archetypes, and written a dissertation on patterns in early Germanic literature. Now I had found something based on patterns I could use in my work—and get paid for it.
I chose the MM
(Bell Laboratories Memorandum Macros) package as being
most suited to our work at ORNL and set about adapting the package to our requirements.
I also rewrote parts of eqn. Then I started training
our composition staff and eventually the other editors. I attended one of the early
Seybold Conference series, where someone from IBM talked about something called
Generic Markup Language
. I realized there was a kind of community;
other people were working on other types of generic markup.
My success with the project at ORNL led to my being asked to present it at a Department of Energy conference. There I met Millard Collins, chairman of a new ANSI committee (X3V1) working on how to make the new word-processing systems just becoming popular communicate with each other. Since part of my job was to get text out of word processors and into our UNIX system, I joined the committee at its organizational meeting in the fall of 1981. At that meeting, I met Charles Card, who suggested I join his committee (X3J6), which was working on, among other things, a Standard Generalized Markup Language. I first attended X3J6 in the spring of 1982, and there I had my first encounter with Charles Goldfarb, the project editor and driving force behind SGML.
The first of these committees (and its ISO counterpart, ISO/IEC JTC1/SC18) started
work on something called Office Document Architecture (later Open Document
Architecture), ISO 8613 ODA, now largely forgotten. SGML, ISO 8879,
SGML developed originally by X3J6 and its ISO counterpart, the
JTC1 Experts Group on Computer Languages for Processing Text, is still with us. The
two
ISO committees eventually merged into SC18. In the fall of 1985, I became the convenor
of the ISO working group responsible for SGML and related projects (SC18/WG8). ODA
was
managed by a parallel working group (SC18/WG3). After the demise of ODA, my SGML group
became the primary committee in 1998, just as XML was getting started. (As ISO/IEC
JTC1/SC34, it still exists. SC34) The competition between SGML and ODA
went on for nearly eighteen years. While there were many technical issues (and much
electro-politics
) involved, in many ways the competition was about
the difference between visible and invisible markup.
ODA, SGML, and the First Hints of Invisible SGML
Most of the people working on SGML came, like me, from the documentation and the scientific and technical publishing industries. We prided ourselves on our connection to technology, and we were used to typing codes into computers. We were used to long, highly structured documents—and lots of code.
Those who joined the world of descriptive markup only after the arrival of XML may
not
realize how endangered that world had been only a few years earlier. The SGML/ODA
Wars
are, thankfully, long over and forgotten, except by those of us who
still have scars from them. In retrospect, I think SGML might have survived on its
own,
in a niche community; but if we had not survived the wars, we wouldn't have been able
to
build a support system for it. In particular, we wouldn't have had DSSSL (Document
Style
Semantics and Specification Language, ISO/IEC 10179), and without that we wouldn't
have
had the basis to build XSL and XQuery.
The ODA project was driven largely by makers of word-processing systems and also by
national telecommunications agencies that were looking to offer yet another tariffed
service. While they dreamed of WYSIWYG, the reality of their work was long limited
by
the limitations of their hardware, particularly the inability to produce more than
typewriter-like output when the project began. What ODA seemed to desire most was
a
system that offered a working screen free of codes. Nonetheless, ODA had a foundation
that was not so simple as their surface goals might suggest, and indeed they had
considerable influence on SGML and its approach to coding. From its beginnings in
Wolfgang Horak's dissertation, ODA had an implicit interest in generic structures,
Horak-Kroenert-83 and in the earliest ISO drafts, ODA proposed that
documents possessed two concurrent, interleaved, high-level document structures,
layout
and logical
. What these structures involved was
never made completely explicit, though layout
obviously had to do with
rendition on the screen and page. The logical
structure apparently dealt
with paragraph-like objects. ODA was the cloud computing
of the 1980s: an
office was expected to rent an ODA terminal from their telephone company, and the
documents would reside on the company's mainframes. The ODA standards project was
eventually published in 14 volumes, with several supporting technical reports.
From the beginning, ODA assumed that the serialization of documents would be in binary form, ODIF (Office Document Interchange Format). The notation selected was based on ASN.1 (Abstract Syntax Notation One, ASN.1 ASN1), though with modifications because of the concurrent structures. Below the page level, the layout structure was control codes for rendering devices, which amounted to invisible inline procedural markup. For the logical structure, however, the developers turned to type-length-value triplets, with byte count pointers as a kind of implied stand-off generic markup.
During the earliest years of the ODA project, I attended their meetings and brought
back their discussions to the SGML committee. Most of the SGML team considered ODA
a
distraction, but it intrigued Goldfarb, who took it as a personal challenge to develop
an SGML representation for anything and everything proposed for ODIF. One of the first
results of this was the introduction of the CONCUR feature into SGML. Because ODA
never
developed an explicit schema mechanism, Goldfarb had to develop a mechanism for dealing
with ad hoc and implicit structures. The result was Architectural Forms
.
Goldfarb's SGML rendering of something that began as binary and invisible into visible
markup was eventually folded back into the ODIF standard as an alternative
serialization.
In the two serializations of ODIF, we had (at least in theory) the materials for a reversible transformation between a document whose only visible manifestation was something that appeared on a presentation system and one that was encoded in conventional, and readable, character markup. It was sufficiently interesting to Goldfarb that he played with the idea of developing a binary version of the whole SGML design, on the assumption that it would be more compact and therefore easier to transmit over a bandwidth-limited network. That came to an end when NIST calculated the relative sizes of binary- and SGML-encoded ODA documents and found the latter to be more compact.
Although the reversible transformation between visible and invisible markup was defined, at least for definition of the serialization of ODA, it never worked in practice. While we all know SGML and its heirs, which have multiple implementations, ODA was never completely implemented and today is largely forgotten. It had, on paper, a bewildering number of options from which profiles could be extracted, only a few of which had even trial laboratory implementations. Those of us who had to cope with its presence generally think of it as an expensive failure. Yet it influenced DSSSL, and thus XSL, through its page model. And it started the debate of how to represent overlapping structures that still intrigues participants in Balisage.
One of the things that killed the ODA project was visible markup. ODA was not intended
to be seen, even in the SGML encoding. ODA was not really even intended to be created
directly (though Philips did at one point attempt, unsuccessfully, to build an ODA
editor as a laboratory project). ODA was originally intended to be used in invisible
environments, for communication between systems. It was too hard for all but a few
specialists to comprehend its rather abstract model and its difficult binary
representation. ODIF could be generated only by machines, doing things like pointer
arithmetic. SGML markup, in contrast, was expected to be created by end users. It
turned
out as something we could—and did—create by hand, and we expected to see that which
was
both document markup and the interchange format. Yves Marcoux and Martin Sévigny
considered eye-readability
to be the primary reason that SGML succeeded
where ODA did not. Marcoux
I trace the last gasps of ODA to the SC18 plenary in 1995. The convenors of the
working groups were sitting together at the head table, and I was next to Steve Price,
the convenor of WG3 and the chief public advocate for ODA. I happened to look at his
laptop screen and saw he was taking notes in a text editor—in HTML. I leaned over
and
whispered to him I'm glad to see you've come over to our side.
What do you mean?
he asked. You're taking notes in SGML
, I
replied. No,
he shot back, it's this new World Wide Web
thing.
Yes, I can see it's HTML, and that's an SGML application.
He was crushed.
His group, which had big money behind it, had spent years trying to compete with ours,
which had worked because of a passion for its project. All this time the ODA developers
had never really grasped what we were doing. Meanwhile, we sold our concept quietly,
planting it in places like CERN, where it spawned HTML, and the ODA team didn't realize
they had been subverted. They tried to keep their project going for another couple
of
years, but it was futile.
I don't think that it was merely the technical superiority of SGML that led to its
victory over ODA. The ODA developers had started with confidence that they had the
next
great thing. They were, after all, professional standards developers, backed by powerful
organizations, and they were working on something that would fit into Open Systems
Interconnect. The SGML developers knew little about standards development; we were
just
end users with a common interest. (As Sharon Adler remarked, If we ever figure
out how this standards process works, it will be time for us to retire.
) In
the long run, it was probably to the advantage of the SGML developers that they were
working on something that they wanted and needed themselves, rather than something
that
corporate bodies expected to impose on end users. The design of SGML is
improvised—sometimes amateurish, sometimes obscure. The resulting application languages
are nonetheless something that can be seen and used directly by humans. The visibility
of SGML markup was part of what enabled Bill Tunnicliffe to sell it to the U.S.
Department of Defense in 1983, and that led to our going public with the GENCODE
standard later that year. GENCODE ODA, with its thousands of
permutations of options, was much harder to grasp—and to implement. All its advocates
could do was publish descriptive papers. You can write SGML in a simple text editor.
You
can't do that with ODA. So in the end, the leader of ODA development picked up on
the
utility of HTML and actually used it. Visible markup had won.[1]
Digression on Word Processors and Seeing Coding
WYSIWYG is a seductive concept. The earliest stand-alone word-processing
systems—expensive, yet limited, behemoths—promoted it. But by the time SGML and its
offspring really gained traction, the stand-alone devices had been supplanted by
programs running on general-purpose personal computers. And in the end, the multitude
of
early applications had largely fallen by the wayside while two major competitors fought
to control the marketplace, Microsoft's Word and
Corel's Word Perfect. Word was based on work at Xerox PARC, and as a consequence it was
fundamentally object oriented. It understood units of text such as strings and
paragraphs and applied properties to them, and it understood generalized structure
and
inheritance of both structure and properties. That meant it could easily support
stylesheets with inheritable properties and things that depended on structure, like
outlining. Word Perfect, in contrast, just serialized
control functions in whatever order the user happened to insert them; there was no
overall concept of structure. (I thought of it as one damn thing after
another.
) Stylesheets and outlining came only late to Word Perfect and were relatively weak, compared to those in Word.
Conceptually, Word was in closer sympathy with SGML,
while Word Perfect followed the layout structure of
ODA. (It is perhaps significant that Corel was one of the very few companies to attempt
an ODIF export filter for their product.) Word beat
Word Perfect to full WYSIWYG with Word for Windows (no surprise there), but my observation of
hundreds of users of these two products showed an interesting phenomenon: serious
Word Perfect users almost always ran the program in
split-screen mode, with reveal codes
at the bottom of the editing screen.
Using reveal codes
was important because the program enforced no
discipline about how codes were entered; users could do things in random order, and
just
seeing the cursor in the WYSIWYG screen gave few hints about what was actually going
on
in the procedural coding. Word users didn't need this
because the program managed the coding in a structured way, always told them what
object
they were in, and could also tell them what its properties were. So in a fully
structured environment, it was not necessary to look at coding; but in an undisciplined
one, visibility of coding was essential.
Early Invisibility in SGML
As proud as the hard-core SGML developers were of our ability to bang markup into
a
terminal, we were nonetheless practical—or lazy. Almost from the beginning we had
markup minimization
. In the early days, before we had syntax-directed
editors designed for SGML, we took it on faith that the SGML Parser
(whatever that turned out to be) would be intelligent enough to keep track of the
current context and so save us the trouble of typing full tags. Goldfarb, of course,
had
to generalize that idea into the full scope of minimization options in the final
standard (see below, Appendix A).
I can remember the first SGML editor I used, from Datalogics: it was basically a text editor, with an attached batch parser. I could type tags, attributes and all, and end tags; then I could check to see how many mistakes I'd made. Software Exoterica (later known by the name of its primary product, OmniMark) came out with Checkmark, based on a simple text editor for the Macintosh, but with a live parser. The ability to get validation while a document was being created was so useful that I, like a number of other people, kept an ancient Mac alive for years just to run Checkmark after Exoterica stopped updating it for later systems.
XML, hoping to simplify life for the parser writer, decided to drop minimization. Ironically, most of the problems with minimization had been solved by then, and furthermore we had real SGML editors like SoftQuad's Author/Editor and Arbortext, so the problem had ceased to be an issue. With the arrival of real SGML editors, users suddenly had the option of deciding how much SGML they wanted to see. They could see full source code, they could see schematic block tags, or they could see no tags at all. As I write this in <oXygen/>, I'm looking at a page very similar to what I saw more than twenty years ago in Author/Editor, and I'm switching between visible and hidden tags according to what tasks I'm performing at the moment. Even if I were still in Author/Editor, there would be no minimization in my output document.
As I've looked at some recent papers on Invisible XML
, I've kept
thinking, We're back where I was about 1983.
What was the state of SGML back then, and how does it lead to Invisible
SGML
, if not to Invisible XML
?
By 1982 our image of what an SGML document would look like would be largely recognizable to an XML user today. A document would have tags with angle brackets, and the elements indicated by the tags would be in a hierarchy. Attributes would be specified in start tags. What we lacked then was a formal way to define the tags and hierarchy. In short, we needed a way to specify a schema, and developing such a specification was harder than forming a basic expectation of what SGML would look like. In 1982 we were already thinking about specifications for content models that were somehow related to regular expressions, but we did not yet have a settled syntax for them. When we did start to develop a syntax for declarations in 1983, one of our first drafts was actually a whitespace-delimited table inside a declaration (then called STRUC, for structure), with columns for element names and models. Multiple elements could be declared in a single table. We'd leave until later the problem of how to parse such a table and use the results.
Given this state of development, it was sometime in late 1982 that I inadvertently
launched an idea that would result in Invisible SGML
. I had to do a
presentation about SGML, and I picked for my example a conventional memo, with
From
, To
, Subject
, and other such
fields. Not yet having a real syntax for a schema, I wrote out a series of definitions
borrowing from regular expressions that included string literals as components of
content models. I don't have the original any longer, but it was something
like
memo: to, from, subject, date, body
to: "To: ", #PCDATA
from: "From: ", #PCDATA
etc.
Afterwards, I showed it to Goldfarb, who fired back that it was all wrong, that wasn't
what he intended to do at all, that he wasn't using full regular expressions, and
so
there could be no literals in the models. Content models included only element names
(plus reserved characters for grouping, sequencing, and occurrence indication).
But Goldfarb being Goldfarb, my error gave him a challenge. Rather than drop the idea of literal strings in the input as replacements for tags, he decided to implement it, and the 1983 version of the STRUC declaration did include some limited cases of literals in models for character strings. It also included the first cut at what became the DATATAG option in an SGML configuration. GENCODE At the cost of adding another delimiter role to separate them from element names, string literals came back into content models as separators between elements. When a declared literal pattern is encountered in the source, it ends one element, forcing the start of the next in the model, while at the same time being passed on as part of the source. With the final DATATAG syntax of 1986, the declaration
<!ELEMENT row - o ([cell, ", ", " "], cell)>describes a two-column table row to be made from a row in a comma-separated list, one line per implied row, where the comma is followed by a space (
", "
) and then
followed by optional padding spaces " "
, then by the second cell.
If strings (#PCDATA) can become markup, what about strings that change roles according
to context? Goldfarb did not stop with simple alternatives to tagging: he went on
to
generalize the concept of recognizing strings in situations such as smart
quotes
. His solution, short references and short reference maps, cost two
more markup declarations (SHORTREF and USEMAP) and considerable indirection. When
a
string that has been declared as a short reference is encountered, it is replaced
by an
entity, which is resolved to an element name, and whether it is to be used in a start
tag or an end tag. Furthermore, invoking an element (either by encountering it in
text
or by generating it from a short reference) can change the mapping from a short
reference to an entity. Thus encountering a quotation mark in text could start an
element and a new map; encountering another quotation mark under the new map could
end
the element and revert to the original map. (Handling nested quotes or cases like
single
quotes in English, which can have more than one role, requires complex patterns and
mappings.)
<!USEMAP textmap p> <!-- In normal text, the "textmap" is active. --> <!USEMAP quotemap quote> <!-- In a quotation, the "quotemap" is active --> <!ENTITY quotetag "<quote>" > <!-- The "quotetag" entity is the start tag for a quotation. --> <!ENTITY endquotetag "</quote>" > <!-- The "endquotetag" entity is the end tag for a quotation. --> <!SHORTREF textmap '"' quotetag> <!-- Within the "textmap" a double quote resolves to the "quotetag" entity. --> <!SHORTREF quotemap '"' endquotetag> <!-- Within the "quotemap" a double quote resolves to the "endquotetag" entity. -->
DATATAG and SHORTREF are complementary techniques. DATATAG is a technique for markup
minimization; SHORTREF is an alternative method for entering markup and potentially
modifying its meaning. When DATATAG is enabled, a string that matches a pattern serves
as both data and end tag; the characters of the string are passed through to the output
at the same time that they cause a parsing event. The start tag that began the element
is generally assumed to be minimized. A string that matches a SHORTREF pattern is
just
markup in Invisible SGML
; it causes an event but is consumed in the
process.
For all his ingenuity in creating these techniques, Goldfarb still didn't give me precisely what I was asking for: I wanted matching a pattern to create an implied start tag. In its first draft DATATAG supported both start and end tags, but the final version provides implied end tags, or rather it provides element separators that involve an implied end tag for one element and a start tag for the next. Perhaps SHORTREF could be stretched (Goldfarb seemed not to like long short references), rather than DATATAG, to get what I was looking for:
<!SHORTREF memomap "&#RS;To: " to "&#RS;From: " from> <!ENTITY to "<to>"> <!ENTITY from "<from>"> <!ELEMENT to o o (%text;)> <!ELEMENT from o o (%text;)>So long as whatever
%text;
resolved to didn't include the string
To: or
From:, that might work. ("&#RS;" is a long-forgotten SGML predefined entity reference to the start of a data record; there was a corresponding "#RE" for the end of a record.)
As the SGML standard makes explicit (Appendix C.1.3), one intent of these techniques was to capture simple WYSIWYG data, as it was seen in the 1980s. In effect, we were trying to capture typewriter-like markup, expressed largely through whitespace and punctuation. This was about as much as the stand-alone word processors of the early 1980s were able to export. Given that the only output devices available to them, such as daisy-wheel printers, were only glorified typewriters, that's about as much as could be expected. The day of the stand-alone device was ending because they were beginning to be supplanted by programs running on personal computers. As laser printers arrived, with new output capabilities, the programs also grew in flexibility and also in complexity of coding. With the new word-processing programs, it was often possible to extract more coding data, though I saw little evidence of SGML users stretching these techniques to deal with extended coding. In the period when Word Perfect was the dominant program, writing SHORTREF structures would have offered even more challenges than dealing with multilingual quotes because there were so many codes and no programmatic discipline at all over the order in which they could be entered.
By the time I was building real SGML publishing systems, we had separate conversion tools and then OmniMark to do the work for us. But the work was still nontrivial.
The longest discussions of the DATATAG and SHORTREF techniques that I know, in Appendix C the ISO standard (and Goldfarb's annotation of it in The SGML Handbook Goldfarb-1990) and Martin Bryan's book SGML: An Author's Guide, Bryan-88 concentrate on techniques such as turning vertical whitespace into new elements in a sequence, turning comma-separated (or TAB-separated) data into tables, and handling quotations and similar constructs. These discussions predate the rise of word-processing programs, so they did not deal with translation of formatting codes.
All the mechanisms necessary to enable these techniques were dropped from XML:
-
the SGML DECLARATION, necessary to enable minimization, DATATAG, and SHORTREF;
-
markup minimization as a concept;
-
the SHORTREF and USEMAP markup declarations;
-
markup roles declared in ENTITY declarations; and
-
predefined entities, especially the
#RS
and#RE
, often used in short references for the concepts ofrecord start
andrecord end
.
desperate Perl hackerenvisioned as the potential XML parser writer.
The absence of these features in XML has not prevented enthusiasts from trying to reinvent them. Simon St. Laurent had a habit of showing up at the Montréal conference that has since become Balisage and suggesting ways of resurrecting things lost in XML. In 2001 his target was using textual patterns as markup. StLaurent
To Be Seen or Not To Be Seen
So do we want to see markup?
At first glance, the current interest in Invisible XML
suggests that we
don't want to see markup anymore. Pemberton-2013 But I think that is
not really the case. Invisibility is not the goal in this effort; markup is. As Steven
Pemberton has said about his project, Invisible XML is a technique for treating
any parsable format as if it were XML, and thus allowing any parsable object to be
injected into an XML pipeline.
. Pemberton-2016 In this
sense, Invisible XML
is like a continuation of Goldfarb's demonstration
of how to generate SGML out of comma-delimited values, which can be traced back as
far
as the 1983 GENCODE standard.
I think that the greatest differences between Invisible XML
technologies and SGML technologies are the underlying assumptions and the technologies
available. In the 1980s we made few assumptions about the data, other than that we
could
find some patterns upon which to operate. The patterns might be complex, as in
Goldfarb's incomplete attempt to mark up sentences and words (ISO 8879, Appendix C,
p.
106) or Bryan's handling of multilingual quotation marks (Appendix A.3, pp. 274–286),
but they were derived simply from direct examination of documents. Invisible
XML
, in contrast, treats documents from the beginning as though they were
expressions of a parse tree, with the expectation that it must be possible to
describe the data using a context-free grammar
Sperberg-McQueen-2019 and to write out that grammar to drive a
processor. In the 1980s we had few tools available with which to ingest documents
into
SGML, so Goldfarb built requirements for the tools into the standard itself, hoping
that
some programmer would implement them. Since XML has omitted the basis on which Goldfarb
improvised his tools, we must now depend on something outside the XML parser.
Fortunately, we have other tools, many of them XML-aware, and so Sperberg-McQueen
can
propose Aparecium
as a library for XSLT or XQuery. The emphasis in
Invisible XML
is, after all, not on Invisible
but on
XML
. And this is still the goal we had in the 1980s: How do we get
our data marked up so we can make further use of it? Invisible XML
,
requiring an external processor, is more complex and more capable than the original
set
of techniques, but the interest it has aroused suggests that we still need something
to
do that work. So Invisible XML
is a way of making the invisible
appear.
The techniques I have described that were built into SGML were originally a way of
making markup disappear. Everything grew out of minimization, and that started as
a way
of saving effort for users in the days when all the coding would have to be typed
in
manually, not inserted by a syntax-directed editor. While this was a labor-saving
technology, I suspect there was also an unconscious awareness that this new SGML
notation for markup was much more verbose that what our team had been used to in
Script, troff, and other systems. SGML,
before the final version, was actually much more verbose than we think of it now.
There
were more delimiters and more delimiter roles: one reviewer accused the code of looking
like chicken tracks
! That these techniques turned into a way of
simplifying the process of getting markup into documents that were being imported
was an
unintended consequence, though a fortunate one.
We put up with SGML because it was what we needed, what we had created, and we didn't
have much other choice. It was successful in spite of what some saw as flaws. We sold
it
to the Department of Defense, the European Union, CERN, the American Association of
Publishers, and dozens of other organizations. Major applications that we are still
discussing at Balisage this year, such as DocBook and TEI, started
out in SGML. Nonetheless, most of us were glad to see the arrival of applications
like
Author/Editor that disguised the chicken tracks
and allowed us to forget about minimization. Most of the time what we cared about
wasn't
so much what the markup looked like but that we knew it was there and we could get
at it
as needed. As I write this, most of the time I have tags hidden. I sometimes turn
them
on when I need to know where my selection cursor really is. And on occasion I go into
full code view because there are some things I just can't do any other way.
There is a difference between working with documents where there is no visible markup, yet which you can treat as though they are marked up, and working with documents where you make the markup that is present disappear because that helps your creative process. Nevertheless, in any case, the goal is to have information identified. Whether I am importing data or creating it from scratch, what is important is that the markup is applied to the data. What was on my mind in 1998, whether I just said it at a conference or wrote it down, was that not only had visible markup helped the success of SGML over ODA, but that, having vanquished what we had thought was a mortal threat, we could relax and make SGML less overtly visible. [1] Visibility, per se, is not a goal. I think that the core issue is connected to the idea of ownership of data. Putting your mark on the data (or rather in it) is an effective way of establishing that. The SGML/XML model of inline markup has thus been vastly more successful in that respect than the ODA approach of binary pointers.
Looking back over more than three decades of working with descriptive markup, I think the issue is not just seeing markup but making markup comprehensible by humans. If making markup visible is what it takes to do that, I'm all for visible markup.
Appendix A. Markup Minimization
With modern XML editors, markup minimization has ceased to be an issue. XML dropped the whole concept as being irrelevant in a time of syntax-directed editors, as well as being too difficult to implement in a parser.
But when SGML was under development, minimization was much desired—and debated—in our meetings. The final form of the ELEMENT declaration in the 1986 standard had two fields for minimization between the element name and its model, one for start tags and the other for end tags. Either, or both, could be declared omissible. The STRUC declaration in the 1983 GENCODE draft of SGML had several other kinds of minimization, and more than one kind of minimization could be specified in each of the two fields (pp. 40–46, 64–65).
- |
Tag is required. |
O |
Tag can be omitted. |
C |
A containing element can end elements within it. |
E |
The current element can be ended by its container. |
N |
Null tag: the current element type is the same as the previous. There are many variants on this, but in general they meant typing only delimiters, without including the whole generic identifiers within them. |
D |
Data tags: literal strings could serve for either open or close tags. |
-
, required, or O
, omissible, in the
published standard.
Planning minimization for an application required some skill: you had to think like a parser and maintain a mental stack of contexts. Consider a document type that required the title of a section to be followed by a paragraph and did not allow paragraphs to be nested:
<!ELEMENT section - - (title, p+) > <!ELEMENT title - - (#PCDATA) > <!ELEMENT p O O (%text;) -p >(For those who are not familiar with SGML DTDs, the
-p
> is an SGML
exclusion: even if
%text;
includes p
in its
content model, p
cannot appear within another p
.) The result
might look
like:<section> <title>A section title</title> The first paragraph <p> A second paragraph <p> A third paragraph </section>Just such a model is what led Tim Berners-Lee to think that the
<p>
tag was
just a separator, analogous to a newlinein typewriter text and not a container for text! The mess that we recognize in HTML is a prime case of why markup should not be made invisible.
References
AT&T Bell Laboratories (and later modifiers). groff_mm man page. https://www.mankier.com/7/groff_mm.
[Bryan-88] Bryan, Martin. SGML: An Author's Guide. New York: Addison-Wesley (1988).
[GENCODE] Graphic Communications Association. GCA Standard 101-1983, GENCODE and the Standard Generalized Markup Language.
[Goldfarb-1990] Goldfarb, Charles, and Yuri Rubinski. The SGML Handbook. Oxford: Oxford University Press (1990).
[Horak-Kroenert-83] Horak, Wolfgang, and Guenther Kroenert (1983). "Techniques for Preparing and Interchanging Mixed Text-Image Documents at Multifunctional Workstations", Siemens Forschungs- und Entwicklungsberichte/Siemens Research and Development Reports. 12. 61-69. https://www.researchgate.net/publication/282210430_TECHNIQUES_FOR_PREPARING_AND_INTERCHANGING_MIXED_TEXT-IMAGE_DOCUMENTS_AT_MULTIFUNCTIONAL_WORKSTATIONS.
[ODA] International Organization for Standardization/International Electrotechnical Commission. ISO/IEC 8613-1:1994, Information technology—Open Document Architecture (ODA) and interchange format: Introduction and general principles, https://www.iso.org/standard/15928.html.
International Organization for Standardization/International Electrotechnical Commission. ISO/IEC 8613-2:1994, Information technology—Open Document Architecture (ODA) and interchange format: Open Document Interchange Format, https://www.iso.org/standard/23410.html.
[SGML] International Organization for Standardization/International Electrotechnical Commission. ISO/IEC 8879:1986, Information processing—Text and office systems—Standard Generalized Markup Language (SGML), https://www.iso.org/standard/16387.html.
[ASN1] International Telecommunication Union, Abstract Syntax Notation 1, ASN.1, X-680 series, https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One).
[SC34] International Organization for Standardization/International Electrotechnical Commission. ISO/IEC JTC1/SC34, Document description and processing languages, https://www.iso.org/committee/45374.html, https://en.wikipedia.org/wiki/ISO/IEC_JTC_1/SC_34.
[Marcoux] Marcoux, Yves, and Martin Sévigny. Why SGML? Why
Now?
. Journal of the American Society for Information
Science
48, No. 7, July 1997, p. 584.
[Pemberton-2013] Pemberton, Steven. Invisible XML
.
Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6–9, 2013.
In Proceedings of Balisage: The Markup Conference 2013. Balisage
Series on Markup Technologies, vol. 10 (2013).
doi:https://doi.org/10.4242/BalisageVol10.Pemberton01.
[Pemberton-2016] Pemberton, Steven. Data Just Wants to Be
Format-Neutral
. Presented at XML Prague, 2016, Prague, Czech Republic.
Proceedings of XML Prague 2016, pp. 109–120. http://archive.xmlprague.cz/2016/files/xmlprague-2016-proceedings.pdf, https://homepages.cwi.nl/%7Esteven/Talks/2016/02-12-prague/data.html.
[StLaurent] St. Laurent, Simon. Regular fragmentations: Treating
complex textual content as markup
. Paper given at Extreme Markup Languages
2001, Montréal, sponsored by IDEAlliance. Abstract on the Web at
http://conferences.idealliance.org/extreme/html/2001/StLaurent01/EML2001StLaurent01.html.
[Sperberg-McQueen-2019] Sperberg-McQueen, C. M. Aparecium: An
XQuery / XSLT library for invisible XML
. Presented at Balisage: The Markup
Conference 2019, Washington, DC, July 30 – August 2, 2019. In Proceedings of
Balisage: The Markup Conference 2019. Balisage Series on Markup
Technologies, vol. 23 (2019). doi:https://doi.org/10.4242/BalisageVol23.Sperberg-McQueen01.
[1] I said something about visibility/invisibility in a session at
SGML/XML Europe 1998 in Paris, where I responded to a
query by François Chahuneau with a comment that it was perhaps time to
streamline SGML and that we no longer needed to be attached to the specifics of
what SGML looked like. I have been convinced for some years that I had published
somewhere not long afterwards an opinion piece on how visibility/invisibility
affected the SGML/ODA Wars
and what that meant for the future of
markup. Diligent searching by several people has failed to discover a published
article, and my wife has declared it to be a Fig Newton of my imagination. So
now I am committing to text what I should have said then.