How to cite this paper
Wilmott, Sam. “Literate Programming: A Case Study and Observations.” Presented at Balisage: The Markup Conference 2012, Montréal, Canada, August 7 - 10, 2012. In Proceedings of Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012). https://doi.org/10.4242/BalisageVol8.Wilmott01.
Balisage: The Markup Conference 2012
August 7 - 10, 2012
Balisage Paper: Literate Programming: A Case Study and Observations
Sam Wilmott
Sam Wilmott started using markup languages in the late '60s. Since then he has led
the development of typesetting/text-formatting systems for the Canadian Government
Printing Office and for a major real-estate company, implemented one of the first
SGML parsers (which was also the first pull-model markup parser), and is the originator
of the OmniMark programming language, with its strong support of SGML, XML, and text transformation.
More recently Sam has been working the XSLT world: he has recently contributed to
the implementation of an XSLT compiler and currently works as an XSLT programmer and
analyst. As a side project, he is working on new programming language ideas for markup
language processing.
Copyright © 2012 Sam Wilmott and Stilo International plc.
Abstract
A newly revived interest in literate programming means that we need to look at what's
been done in the past. Literate Programming requires both the integration of computer
programming code with its documentation, and the elimination of duplicate information
between the code and the documentation. It's the latter that has been overlooked
in the past. This paper describes a project that integrated programming code with
its documentation, using a markup language, and discusses lessons that might be learned
from it. It also illustrates the use of, and discusses the advantages of using compact
markup (as exemplified by SGML short references and Wiki markup) especially as it
applies to using a markup language for literate programming.
Table of Contents
- Why Literate Programming?
- A Blast From The Past: A Case Study
- An Aside On Short References
- Another Aside, On The Kinds Of Documentation
- A Literate Programming Markup Language As A New Language
- Conclusions And Observations
Why Literate Programming?
First of all, I'd like to thank Stilo International for giving me access to some of their internal documents. This paper wouldn't exist
without their contribution.
Literate Programming, the integration of program code with its documentation, has been a feature of both
the programming and the documentation fields for over thirty years. It got off to
a good start with Knuth's WEB System but there hasn't been a lot of new work in the field over most of the intervening
decades. However, now there seems to be a renewed interest in Literate Programming.
Why? you might ask.
Integrating programming code and its documentation isn't just about having them in
the same document, as was done in Knuth's WEB: it's about eliminating duplication
of information coded in both programming and documentation forms. In contrast, and
most commonly, programmers are continuing with the traditional model of completely
separating code and its documentation. The difficulty with this approach is that
it duplicates a lot of information: information coded in a programming language and
also written up in its documentation, as text or in tables. The result is costly
in three ways:
-
Doing so increases the amount of work required by the programmers and the documenters
(who may be the same or different folk) to initially create and to later update the
code and the documentation.
-
Organizing multiple copies of information can itself increase the cost of developing,
maintaining and managing programming code: there's more to do that way.
-
Most importantly, duplication increases the chances of error: rewriting a text description
into code can result in an error, as can describing code using text. Updating one
can easily cause it to be out of step with the other, even when they were previously
in step.
There's been a number of approaches taken to deal with these difficulties:
-
All production programming languages support integrating comments with code. Comments
are most commonly used to help the reader understand the details of why coding is
done in a particular way. Comments are also used to document how a program is to
be used, but they don't make for good reading for the users, and force the user to
read the code.
-
Many ways have been looked at for adding markup to program languages' comments, both
XML and non-XML (often wiki-like) markup. This certainly improves things. It means
that the user documentation can be extracted from the programming code and repurposed
for the user: a big help.
Marked-up comments still have a problem, however: a lot of information needs to be
duplicated, so that there's a "human" version of the information and a "computer"
version of it too. For example, the information in function/method headers needs
to be available to and understood by both the user and the computer.
-
Taking things one more step further, there have been a few approaches to adding markup
to a programming language's code itself, so that it can be used within the documentation
without duplication. This is where the future lies, what I'll be talking about here,
and is something still in development.
There are a number of good papers at this conference already covering different aspects
of this problem:
-
"Code Up: Marking up Programming Languages and the winding road to an XML Syntax" describes and analyzes various approaches, from simple commenting to a program that's
all XML.
-
"On XML Languages" describes both XML and "compact" (non-XML) syntaxes for existing W3C scripting languages,
discussing the advantages and disadvantages of each approach.
-
"Encoding Transparency: Literate Programming and Test Generation for Scientific Function
Libraries" describes an XML-based approach to duplicating what was achieved with Donald Knuth's
Literate Programming tools (his WEB targeting TeX).
This paper adds to the discussion in two ways:
-
It presents an existing system, integrating programming code and its documentation
in a practical way.
-
It discusses further issues that have to be dealt with when designing languages and
building tools for such a system.
A Blast From The Past: A Case Study
Back in the days when SGML was still new (when XML hadn't shown up yet), and when the C programming language was still a practical language of choice for cross-platform tool development (when
C was about the only language that ran uniformly on all major platforms, and when
there were a much larger variety of machine architectures than there are in our now
Intel-dominated world), I implemented one of the still-existing SGML parsers. Almost
uniquely, I think, the SGML parser is itself an SGML document. (It helped a lot that
it was the second SGML parser developed by the company, so that the first one could
be used to initially process the second one. Once the second SGML parser was well
developed, it took over and was used to help processed itself.)
The following examples are taken from the code of the SGML parser used in Stilo International's
OmniMark programming language. This code has been in use for over twenty years, so it serves
as a good example of "real world" markup-based literate programming. The markup language
used to markup the SGML parser's code is quite complex, but you'll get most of it's
ideas from the following examples.
In practice, the following program-oriented markup elements are included in otherwise
common paragraph-level markup.
Here's the header of the module processing SGML declarations (other than ENITY declarations):
<!-- xkdecl.doc:
Copyright (C) Stilo International plc, 1991 - 2011
All Rights Reserved
PROPRIETARY AND CONFIDENTIAL
-->
<chapter>Declarations
<revinfo>$Id: xkdecl.doc,v 1.83 2001/10/19 15:11:08 kernel Exp $
<system>kernel:XK
<module defined>decl <!-- Declarations;-->;
basic; mem; syn; lex; var; ent; mod; con; attr; edec; err; fsm1
<cinclude>xktypes.h
It contains:
-
Importantly, copyright and distribution information.
-
A chapter heading, both as a lead comment in the code and as a chapter start and title
in the user's and programmer's documentation.
-
Revision information for the revision control system used at the time. (For a stable
piece of software such as this, it doesn't get updated often, as you can see.)
-
Information about the system name, the used module name, what other modules are used,
and what C include files are needed.
Data structures are documented rather than coded:
<struct external>document type definition
# The data structure which describes a document type definition (a "compiled"
DTD) and which points to all the data structures for the objects declared
for the document type definition. #
<comment>
The following fields provide information about specific features of
a document type.
</comment>
= document element: 'element definition'*
# Pointer to the definition for the document element. #
= default general entity: 'entity definition'*
# Pointer to the default general entity. #
...
</struct>
Documentation of a structure as a whole as well as of each field is required. The
markup used for the fields of a structure is exactly the same as for an ordered list
of labeled textual items: no distinction is made between the markup for documentation
and for code.
Structures, functions and other constructs have attributes specified with them that
are meaningful either to the target code, the programmer's documentation, the user's
documentation or two or more of these things.
Global names are marked up (surrounded by apostrophes in code and by "at" signs in
text), and are chosen to be appropriate for documentation. The processing software
replaces these names with the kinds of names required by the target language, together
with appropriate prefixing. This approach makes the code more portable between systems.
Functions/methods have their interface information marked up as documentation:
<function external>initialize document type: 'boolean'
# Prepare for parsing a document instance. #
<inout setself>parsing_state: 'parsing state'*
<in>base_element: 'element definition'*
<comment>
This procedure prepares ~parsing_state~ to parse a document instance using the
current document type definition in ~parsing_state~. Three options are
available for selecting what is to be parsed, depending on the value of
~base_element~, as follows:
<ol>
= If ~base_element~ is the base document element of the document type (i.e.
the one named following the keyword DOCTYPE in the DTD), then the following
input text is parsed with that element as the document element (see
ISO 8879-1986, definition 4.99).
= If ~base_element~ is any other element in the DTD, then that other element
is treated as if it were the document element for the purposes of parsing the
following input text. This allows parts of documents to be parsed, such as a
single chapter.
= If ~base_element~ is "null" (@element definition.null@), then the following
text may consist of any sequence of elements defined in the DTD.
</ol>
In the first two cases, @initialize document type@ sets the number of opened
elements to zero (0).
...
<code>
<return/'initialize parsing state generally'
(parsing_state,
('document type definition'*) 'document type definition.null',
base_element,
('document syntax'*) 'document syntax.null',
('document syntax'*) 'document syntax.null',
('parsing state setup result'*) 0)/
</function>
Function arguments are documented both by text and by the markup of the argument.
Code in the body of a function is the one place where program code is used in preference
to markup, for a number of reasons:
-
Dense code is generally easier to read in a more compact form.
-
The code is only used in two (potential) targets: the produced C code that was intended
to be compiled, and in the annotated code documentation.
That said, there are exceptions to using a non-SGML form for code:
-
Constructs that impact a function's interface, such as "return" (but not things like
"if") are marked up.
-
The big issue in choosing verbose markup or compact markup is in the trade-off of
readability and utility. This trade-off can be subjective -- different people with
come to different conclusions, depending largely on what markup and other notations
they are familiar with.
-
References to names in the software's interface, either of interest to the user or
of global interest to the software's developers, are marked up, so that they can be
easily found if needed. One use of marked up marked-up names is an index of all uses
of every name can be listed.
Using markup also means that things that are better coded as tables than as code,
but which need to be run as code, can be included. This was done in the SGML parser,
by coding the syntactic parsing logic as a finite state machine (FSM). For example,
here's the logic for parsing an SGML end tag (in a somewhat abbreviated form):
From Clause 7.5, End-tag:
<fsm>end-tag (TAG):
&more;
= name {end-tag}: +generic identifier specification
= tagc {back over lexeme; check end-tag shorttag}: +checked shorttag
= * {impossible}
# checked shorttag
{empty end-tag}: +generic identifier specification
# generic identifier specification (TAG):
&s;
= tagc {end of end-tag: other prolog; end of tag}: content
= stago no rhs, etago no rhs
{back over lexeme; report missing end tag tagc missing;
end of end-tag: other prolog; end of tag}: content
= * {backup needed; 'unrecognized item'}: +unrecognized
# unrecognized
{end of end-tag: other prolog; end of tag}: content
</fsm>
Each entry has four parts: the thing or things being recognized, the lexical context
in effect (i.e. what tokens are recognized, identified by a keyword such as "TAG"),
the action to be taken when recognizing that thing (in curly braces), and what state
in the state machine to go to next. In particular:
-
"#" introduces a sub-state and "+" prefixes a local reference to the next state.
Next states with no "+" prefix are major states, like "end-tag".
-
Substates need not recognize anything, but just do something, like "checked shorttag".
-
Groups of common actions are coded as entity references ("&more;" and "&s;").
Note that the above example is very heavily marked up: all of ( ) { } + = # ' and
; are compact markup (a.k.a. SHORTREFs).
Actions in the FSM are marked up specially:
<action value>end tag
# Process an end tag containing an element name and signal the
change of context to the application. #
<comment>
The current lexical item is the name of the element.
@parsing state.selected element@ is to be made the definition of the element.
to the previous state after closing one element, or go to the alternate
state after having reported an error.
</comment>
<local>element: 'element definition'*
<local>opened_element: 'opened element'*
<code>
if (parsing_state->'parsing state.opened element count' > 0)
opened_element = parsing_state->'parsing state.opened element stack';
else
opened_element = 'opened element.null';
if (!'look up element' (parsing_state,
parsing_state->'parsing state.opened entity stack'->
'opened entity.item start',
('integer')
(parsing_state->'parsing state.opened entity stack'->
'opened entity.item end' -
parsing_state->'parsing state.opened entity stack'->
'opened entity.item start'),
addr(element)))
{
parsing_state->'parsing state.selected element' =
'element definition.null';
'report error' (parsing_state,
'exception code.undefined element in end tag');
<return//
}
parsing_state->'parsing state.selected element' = element;
'create opened element' (parsing_state);
'initiate closing current element' (parsing_state, opened_element);
</action>
Actions can be compiled as functions, with calls to them included in the FSM code,
or they can be marked as a "macro", and included in-line. The "value" attribute indicates
that the action (potentially) returns a value to the invoking application. This illustrates
the use of markup not just for documentation purposes, but to make the coding simpler.
The FSM markup language made it easy to create program code, and was easy to work
with. It greatly shortened the time of creating a high-performance SGML parser.
Using SGML to help create an SGML parser had nothing, of course, to do with the fact
that it was an SGML parser that was being developed using this technique. However
it did help to speed up development of the product in an otherwise inappropriate programming
language: C. One could also argue that it took someone with expertise in implementing
and using SGML to perform both tasks.
It's unclear whether this use of literate programming was a success or not:
So an argument could be made both for success and for failure.
An Aside On Short References
The work described in this paper makes extensive use of short references and illustrates
how they can be useful.
Another paper being presented at this conference, describes a simplified mechanism for introducing the advantages of Wiki Markup and SGML short references into XML. As that paper correctly points out, it's not
easy to get SGML short references right. The difficulty is not so much compact markup
its self -- it's in the mechanism for defining it, in the tool support for such markup,
and in the quality of the documentation of such markup. (If anything, it's in the
later that the use of SGML short references failed most notably.)
XML was designed and made different from SGML on the assumption that markup support
tools, such as XML editors and XML exporting support in word processors, had or would
develop to the point where users were no longer entering XML markup "by hand", but
would use semi-automated tools for doing so. This is true for a large class of users.
But there is also a large number of users entering XML tags using non-XML-specific
editors: one major category of such being in programming language environments, where
those languages have syntaxes in addition to that of XML. To be effective user-helpful
tools need to support multiple syntaxes, not just that of the programming language
or languages used, or just XML, but all of them.
One difficulty with using compact markup is that it's best used sparingly. That is,
only a small number of compact markup forms should be used in any particular context.
Successful Wiki Markup languages are a testament to this principle. Too many different
compact forms results in confusion. The classical paper on the subject is Miller's
The Magical Number Seven, which says that the limit on the number of usable forms (per context) is about 7
(plus or minus 2).
At this stage in the development of markup languages, it doesn't seem to be a particularly
controversial statement to say that the best use of fully-tagged and compact markup
is in some combination of the two -- with the balance chosen based on the needs of
a particular application. One size does not fit all. For an example, consider the
mixture of fully-tagged XML and compact XPath that appear in most XSLT programs.
There are a number of ways in which the advantages of compact markup can be realized
in an XML context, including:
-
A general facility could be added to XML structure descriptors (DTD, schema, RELAX-NG,
etc), maybe some up-dated form of short references as suggested in another paper here for markup language developers to develop their own compact markup.
-
A similar facility could be created as a separate process, complementary to existing
XML structure descriptors, that could be used with any of them, that for example,
adds further element structure to a preexisting parsed XML tree based on discovered
compact markup.
-
Some special-purpose compact markup could be supported as a separate process. This
approach would be appropriate if there were a limited number of applications of compact
markup -- only for literate programming applications, for example -- and no need for
a general approach.
The Literate Programming work described in this paper wouldn't have really been possible
without the use of some form of compact markup to complement the primary markup (SGML
or XML). The level of detail would make full XML markup, for example, difficult to
read, especially for programmers, whose primary interest is the programming code.
Another Aside, On The Kinds Of Documentation
The SGML/C project described above supported four kinds of documentation that could
be targeted by marked-up code and documentation:
-
User documentation: information for the end user of a software system.
-
Design documentation: information for helping maintain a software system, outlining
the structure of the software and how it works
-
Fully annotated code: for use by those actually working with the code, detailing what,
how and why is actually done.
These three categories of documentation are incremental: generally speaking, design
documentation includes everything the user is told, and annotated code includes all
the user and design information.
-
Comments: There is some documentation that falls outside of any of the above categories:
comments detailing the how and why of specific code snippets (rather than the more
general techniques that apply to whole methods or other segments of code). These
comments are inseparable from the code they annotate, and seem to be best entered
as language-specific comments rather than as marked-up documentation. Unlike the
above categories of documentation, these kinds of comments need no special handling.
and of course, there's the code itself: what the programming language's compiler needs
to be given. In practice there can be more than one kind of code:
-
The "production" code, that appears in the final product. There can be multiple products,
or multiple versions of a product, originating in one set of code.
-
In addition, code can exist as part of the software development process, with lots
of extra checks and reports.
Markup can effectively distinguish between different versions and kinds of code.
So there's at least four kinds of things created from marked-up code: user, design
and annotated code documentation, and the compiler's code.
A Literate Programming Markup Language As A New Language
Adding comments to program code doesn't change the programming language used in any
way. It remains the same programming language plus comments. But once major programming
language constructs, such as data structure declarations and function headings, are
replaced by documentation-friendly markup, we find ourselves looking at different
programming language.
At what point changing the syntax of a programming language makes it a different language
depends largely on one's point of view. From the point of view of the programming
language designer, syntax is a minor issue: functionality is their focus. From the
point of view of the language user syntax is just about everything: it's important
how to code an "if" statement, even though it's semantics is more-or-less the same
in every programming language. As a consequence, any useful definition of what constitutes
a programming language, and the extent to which two are the same, has got to take
syntax into account.
A major impediment to acceptance of a literate programming language is the fact that
it is a different language. It's not the programming language that a programmer knows,
and switching over is not a small job. And I'm afraid to say that I've found computer
programmers in general very conservative in what languages they are willing to work
with: they generally stick with what they know. A major selling job is needed to
convince programmers to switch.
It being a different language than what programmers were used to seems to be a large
part of the reason that the SGML/C-based programming language described in this paper
failed. It may well be for other reasons: lack of promotion of the language, or a
well-established base of other software that management and the programmers didn't
want to change. These things have to be taken into consideration when developing
a new language, to ensure its better acceptance.
Conclusions And Observations
Literate Programming is something that clearly needs more work:
-
More use of Literate Programming needs to be undertaken so that useful ideas can be
developed. If nobody does it, it's not going to happen.
-
Markup conventions for Literate Programming need to be developed, either with respect
to a particular programming language, or which apply to a variety of programming languages.
There is not going to be general acceptance of Literate Programming if every language
or, worse yet, every system has its own set of conventions.
As noted earlier, the trade-offs between full and compact markup are somewhat subjective.
As a consequence, these conventions will need to be arbitrary. And that has to be
accepted.
-
Literate Programming tools need to be integrated into software development systems.
At present, Literate Programming is usually implemented as a preprocessor. But this
doesn't fit well with most visual software development systems, or with the expectations
of most programmers.
-
The use of compact markup in XML documents needs to be researched further. Whether
XML itself needs to be extended to support compact, whether that can best be done
outside of XML, or whether it's unwise to try either needs to be reexamined.
Markup-based Literate Programming gives us the opportunity to bring the advantages
of markup in general, and XML in particular, to a wider community. More than any
new programming language feature -- which language designers are always on the lookout
for -- better and more reliable documentation could make a difference to how computer
programmers work. But it's not a small task: it's as big as developing a whole new
programming langauge.
×Standard Generalized Markup Language (SGML) International Organization for Standardization ISO 8879:1986