Kalvesmaki, Joel. “Serializing the Locator Format of the United States Government Publishing Office as
XML.” Presented at Balisage: The Markup Conference 2023, Washington, DC, July 31 - August 4, 2023. In Proceedings of Balisage: The Markup Conference 2023. Balisage Series on Markup Technologies, vol. 28 (2023). https://doi.org/10.4242/BalisageVol28.Kalvesmaki01.
Balisage: The Markup Conference 2023 July 31 - August 4, 2023
Balisage Paper: Serializing the Locator Format of the United States Government Publishing Office as
XML
Joel Kalvesmaki
Joel Kalvesmaki is a software developer for the United States Government
Publishing Office, founder and director of the Text Alignment Network, and a
scholar specializing in early Christianity.
Copyright Joel Kalvesmaki, released under a Creative Commons Attribution 4.0 International
License
Abstract
The United States Government Publishing Office makes significant use of so-called
locators, a specialized file format for commercial
typography. In this article, I explain the origins of the locator format, discuss
its architecture, and explore some of the challenges inherent in understanding and
converting the format. I present Slax, Serializing Locators as XML, an expression
of
the locator format in XML and an application that makes the conversion. In telling
the story of the locator format, which is poorly documented, I show that the
specifications developed slowly over time as user expectations changed, and in
tandem with improvements to its primary processor, MicroComp. Slax also provides an
excellent example of where XSLT is an ideal technology for document conversion, and
where it is not, and provides a model for the symbiotic use of declarative and
imperative programming models.
The Government Publishing Office (GPO), which serves the legislative, executive, and
judicial branches of the United States Federal Government, makes significant use of
so-called locators, a specialized format for commercial typography.
At present, the format is actively used on a daily basis to typeset the
Congressional Record, the Federal
Register, and other significant publications.
At Balisage 2020, Andries and Wood presented an approach to converting locator files
to structured XML. (andries_and_wood) Their article focuses upon the U.S.
Code and its up-conversion to the semantically rich United States Legislative Markup
(USLM) format, and it provides an overview of a complete workflow.
In this paper, I present a generalized process that serializes the locator format
as
XML. Unlike Andries and Wood's application, which focused on the U.S. Code, this process
deals with all known locator publications. And whereas Andries and Wood's approach
utilized Java to deal with the conversion process, mine adopts a mix of two
technologies: XSLT to manage the sprawling data resources and a C# application to
replicate the byte handling of MicroComp, GPO's flagship desktop composition
application.
My paper offers two significant contributions. First, in telling the story of
locators, I argue that the format, which is poorly documented, developed slowly and
silently over time as user expectations changed, and as MicroComp evolved in the 1990s.
MicroComp, in fact, became a de facto black-box specification for a
format that was never fixed. Second, the serialization of locators into XML provides
an
excellent example of places where XSLT is an ideal technology for document conversion,
and places where it is not, and offers a model for the symbiotic use of declarative
and
imperative programming models.[1]
The Background of the Locator Format
The Government Printing Office (now Government Publishing Office, GPO) was created
by
an Act of Congress in 1860 to centralize and stabilize a relatively chaotic printing
system. It is one of the oldest agencies of the Federal Government, and has had storied
moments. GPO seconded its staff to support the defense of Washington, DC during the
Civil War, and was responsible for printing the Emancipation Proclamation.
GPO has been known for innovation and early adoption of new technology in printing.
Necessity was a driving force, given the volume of government output and the costs
of
labor and materials. Starting in 1963, GPO used electronic printing methods, beginning
with a video unit and two Linofilm keyboards created by the Mergenthaler Linotype
Corporation. GPO's first electronically produced publication, from March that year,
set
in 8/9 Bodoni Book, was “believed to be the first anywhere to use a pre-existing
tabulating card file to produce typographical quality output.” (uspo_1963)
Over the next several years, GPO took a small staff of programmers, added to their
ranks compositors taught to program, and collaborated with IBM and Mergenthaler to
create a digital typographic system for the IBM 1401. (boyle) Teletype
system (TTS) code was converted to binary-coded decimal (BCD), and divided into blocks,
each block being assigned an identification code. That prefatory code was critical
to
locating a particular block of text, so it was called a
locator.
In 1967, the Linotron 1010, a new photocomposition machine developed specially by
Mergenthaler and CBS Laboratories, was introduced to GPO production. (gpo_1967) The mechanical process involved directing the rays of a mercury
arc lamp through a grid of 256 characters on the cathodes of a vacuum tube. An external
logic signal controlling an electrostatic wire grid selection matrix ensured that
only
one of the 256 fragments of the beam—the chosen character—passed to an electron
multiplier. (mlc_1966) The Linotron magazine could hold up to four
grids for a total of 1,024 characters. The photocomposition system was operated by
a
Master Typography Program (MTP), a federated suite of small, independent programs
written for the IBM 360/50 and other IBM peripherals. (cavanaugh)
Customers would give GPO magnetic tape that had text or data, structured by a limited
set of characters and reference codes. These codes identified text block categories,
but
not typographic styles. For example, a first-level header might have been marked with
code 11, but the typographical specifications for block style 11 were defined separately
in files curated by staff of GPO's Electronic Photocomposition Unit. This approach
meant
that the same structured text could, without alteration, be slotted into different
publications, and not require redesign or new markup. That approach is still used
today:
the locator code for many legislative bills is copied without change into the
Congressional Record.
Typographical specifications rested in two different GPO libraries. The parameter
library (later called a format library) dictated page layouts, paragraph dimension,
and
the behavior of lines. A second grid library coordinated input with the Linotron grid
magazine by supplying offset lookup tables, character width values, and so forth.
These
and other tape libraries were updated by GPO staff via cards. (jcp_1970,
90, 93–106) Thus, the Linotron system of the 1960s—a landmark in printing
history—indelibly influenced the terminology (“locators,” “formats,” “grids,”
“parameters”) and the architecture of GPO's electronic publication program.
The Linotron system was adequate for simple text-centric publications where digitized
data already existed, but not for more complex ones with graphics, tables, multiple
columns, and spanning heads, and not for publications where text had not yet been
digitized. These publications were still handled by the costly hot metal process.
GPO
was eager to bring its entire line of publications into an electronic workflow, but
as
it considered options beyond the Linotron, it soon realized that no outside contractor
would be able to write the software needed to publish complex documents. So in the
1970s
GPO decided that it would develop its own typesetting software in-house. To build
the
staff of its Electronic Photocomposition Division, GPO recruited craftsmen from Linotype
and Monotype.
GPO also sought to purchase different kinds of typesetting hardware. In 1974 GPO
acquired a system that combined optical character recognition (OCR) with text editing.
Designed by Compuscan and Atex, the system combined a model 170 scanner with a PDP
11/35
computer and two video display terminals (VDT). Its output was compatible with the
Linotron. (cavanaugh) Two years later GPO procured VideoComp 500
machines for the Atex systems. The MTP software, which had been designed specially
for
the Linotron, was now revised for the PDP 11/35 mainframe hooked to the VideoComp
500.
Printing jobs were gradually transferred from the Linotron to the VideoComp systems.
New Atex systems were procured in 1978, the same year that GPO established its Graphic
Systems Development Division (GSDD), which was tasked with creating and supporting
typography software that was now distinctively its own. At the heart of the application
network was the Automated Composition System (AComp), the core composition
algorithm. (gpo_1978) GSDD developed new features for AComp. In 1979
it introduced functionality for tables, which it called subformats. (gpo_1979) The following year it introduced features to allow tables and
graphics to span columns, or to let tables flow across column or page breaks. (gpo_1980)
In the 1980s, GPO enhanced the existing VideoComp systems, but also looked ahead.
Experience had shown that a locator-based approach to typography was conceptually
sound
and remained independent of any particular system. The format had matured to the point
where it could handle all its publications, so GSDD prepared to make AComp and its
other
programs more widely available. In 1984, GSDD management announced that an unnamed
typsetting program had entered production on the local network. (rollert)
The shift to electronic publishing was complete, and in February 1985 GPO officially
closed down its hot metal operations. But software development continued. The aging
Atex
system was scheduled to be replaced by a VAX 6210, requiring patches and new builds,
and
there was a pressing need to repurpose the codebase to enable typesetting on the
microcomputer. In 1989, GSDD released for IBM-compatible systems an application it
called MicroComp, a name that paid homage to the older VideoComp systems (for both,
“Comp” was short for “composition”). (gpo_1989) By 1991 more than thirty
offices in Congress were using MicroComp. (gpo_1991)
Very quickly, MicroComp became the mainstay for typesetting at GPO, Congress, and
other parts of the Federal Government. Informally, the terms
locators and MicroComp became
interchangeable. Many aspects of MicroComp were advanced for the time. Despite the
256-character limit imposed by the byte, MicroComp gave users access to thousands
of
characters in dozens of styles. It allowed users control over a wide variety of
formatting and typographic features, some highly sophisticated. Paragraph types could
be
programmed with conditional leaders. Columns could be feathered (leading applied so
that
lines would vertically fill a column). It was extremely fast, typesetting many pages
per
second. AComp managed to do this in a mere eighty thousand lines of code, written
largely in C++, albeit with C-style, non-object architecture.
Despite these forward-looking accomplishments, MicroComp remained entrenched in 1960s
hardware and concepts. Like its predecessors, MicroComp was a federated suite of
programs that worked across a variety of computers and peripherals. The photomechanical
grids of the Linotron 1010 had been replaced by digital, conceptual ones that remained
agnostic of any meaning behind specific characters. MicroComp continued to rely upon
the
convention of locators (which now no longer located anything) and other terse codes
already familiar to its Federal customers. It retained jargon, lingo, and engineering
solutions rooted in 1960s paradigms.
MicroComp underwent upgrades and changes in the 1990s, although enhancements were
documented poorly, if at all. Most significantly, it began to support SGML input,
in
tandem with conversion of the Federal Register and the
Code of Federal Regulations. Gradual changes to byte handling
were introduced. Some new format features were silently introduced. Format and grid
libraries continued to evolve. Nevertheless, MicroComp continued to look backward
as it
progressed. GSDD's solution to integrating SGML with MicroComp was to have AComp first
convert the input to an internal locator byte stream, and then typeset that stream,
as
if the user had simply used locators, not SGML. Browsing the incremental changes in
MicroComp up to the turn of the millennium, one gets an eerie feeling that Unicode
is
distant and irrelevant, and that PostScript type 1 fonts represent the apex of font
technology.
In the early 2000s, a moratorium was placed on any but the most urgent changes to
MicroComp, while GPO leadership deliberated whether to pursue incremental development
of
the existing software, or to completely rearchitect it. The latter path was chosen.
In
2009 a bid went out for vendors to supply a composition system replacement. As of
September 2023, Congress is preparing to retire MicroComp, as it gradually switches
its
legislative workflow to XPub, GPO's next-generation, all-XML composition system. Much
could be said about that transition. But events of the last twenty years are stories
for
another day.
The MicroComp Locator Format
Specifications and Formats
The most comprehensive, detailed account of the locator system was published in the
early 1980s in a book with the arcane title Publishing from a Full Text
Data Base (2nd edition, 1983, herein
PFTDB). (gsdd_1983)
PFTDB is the closest approximation to a formal specification
for the locator format, because it documents the intentions of the designers of the
locator format. But it is imperfect. Some of the jargon is impenetrable. Some of
PTFDB's features were dropped or changed by MicroComp and,
of course, PTFDB excludes features that MicroComp introduced.
My description below draws from PFTDB and related documents
used by GPO Prepress staff for their daily work.[2] I have also relied upon experiments with current compiled MicroComp
executables, and upon selective study of the source code.
A typical locator file might look like this (control characters have been converted
to
Unicode control pictures):
␇S0634
␇I66F
␇I81SENATE RESOLUTION 77_DESIGNATING FEBRUARY 16, 2023, AS ``NATIONAL ELIZABETH PERATROVICH DAY''
␇I11Mr. ␇T4SULLIVAN␇T1 (for himself and Ms. ␇T4Murkowski␇T1) submitted the following resolution;
which was considered and agreed to:
␇I74S. Res. 77
␇I27Whereas Elizabeth Wanamaker Peratrovich, Tlingit, was a member of the Lukaa␇g840␇T5C␇K.aÿAE1di
clan in the Raven moiety with the Tlingit name of ␇g840␇T5B␇Kaa␇g840␇T5C␇Kgal.aat (referred to in
this preamble as ``Elizabeth'') who fought for social equality, civil liberties, and respect for
Alaska Native and Native American communities;
The code should not seem utterly strange. Quite a lot should be recognizable as
text. The rest is markup of one sort or another. A number of markup characters begin
with ␇, which is technically called the precedence code. More colloquially it is
called the bell code, because it is commonly (but not always) handled by codepoint
7
(now U+0007, ALERT). The character that follows the precedence code specifies a
particular action that should be taken, and it may be followed by other characters
that supply parameters for that action. The most common code is ␇I followed by two
digits. These are locators proper, tagging blocks of text by type.
Although the text looks like it might be ASCII or an ASCII-based codepage,
appearances are deceptive. The locator format is actually codepage agnostic, or, to
be more precise, the encoding is format dependent. Each MicroComp format allows a
designer to reassign any of the 256 input bytes as desired to any permitted text or
control character. More on that later.
The above ten lines were typset by MicroComp as follows:
More than two thousand MicroComp formats have been developed over the years, for
various publications. Each publication is assigned one or more format numbers, and
the typographic specifications are tailored to the publication in question. The
format files that MicroComp uses are binary, and they are created by compiling a
source ASCII file, created and edited by GPO staff. Line 1 in the example above
invokes format number 0634. Here are some samples from the source ASCII file for
0634 (the full file is 439 lines):
/FORMAT NO./ /0634/ .PCF
/GRID TABLE/ /738/ /072/ /739/ /780/
/PAGE/
/00/ /003/ number of text columns
/01/ /177/ column width (including gutter)
/02/ /716/ column length--odd page column 1
/03/ /716/ column length--odd page column 2
/04/ /716/ column length--odd page column 3
/11/ /019/ column sink--odd page column 1
/12/ /019/ column sink--odd page column 2
/13/ /019/ column sink--odd page column 3
/20/ /716/ column length--even page column 1
/21/ /716/ column length--even page column 2
/22/ /716/ column length--even page column 3
/29/ /019/ column sink--even page column 1
/30/ /019/ column sink--even page column 2
/31/ /019/ column sink--even page column 3
/38/ /716/ column length--first page column 1
/39/ /716/ column length--first page column 2
/40/ /716/ column length--first page column 3
/47/ /019/ column sink--first page column 1
/48/ /019/ column sink--first page column 2
/49/ /019/ column sink--first page column 3
/56/ /.../ x-origin of the head in storage area 1
/57/ /.../ y-origin of the head in storage area 1
/76/ /001/ bottom leading for running (continued) heads
/77/ /010/ Footnote sep lding in pts--10-4-77
/78/ /002/ Footnote sep max carding allowed in pts--10-1-77
/79/ /001/ kill last page square-off
/80/ /1/ Rotation of the job and frame number and folio
/81/ /1/ Rotation of the page
. . . . . . .
/LOC 1/ FRSLN NEXT
LOC PTSZ LDING LNLG TPLD BTLD Y1LN PRIND SCIND PRIOR PRIOR
/10/ /008/ /009/ /168/ /009/ /000/ /008/ /000/ /000/ /3/ /2/
/11/ /008/ /009/ /168/ /009/ /000/ /008/ /008/ /000/ /3/ /2/
/12/ /008/ /009/ /168/ /009/ /000/ /008/ /000/ /016/ /1/ /1/
/13/ /008/ /009/ /168/ /009/ /000/ /008/ /000/ /024/ /1/ /1/
. . . . . . .
LOC MINSP MAXSP RHOD RHDE STORHD SETSZ GPLD PARLD
/10/ /006/ /009/ /.../ /.../ /.../ /.../ /010/ /009/
/11/ /006/ /009/ /.../ /.../ /.../ /.../ /012/ /009/
/12/ /006/ /009/ /.../ /.../ /.../ /.../ /012/ /009/
/13/ /006/ /009/ /.../ /.../ /.../ /.../ /012/ /008/
. . . . . . .
F L G T L T R L L L M H H S S S X F S S L C LOC
O D R F N S U D D D C D I E P N - T E C I N POR
T T N N T F L C R T L T E H C P S N P T N C TION
P P O O P D I H B F N P A D L G T O C F N S
D L R T H B
LOC X K Y E R
/10 .. . 1 1 J . . ... . . . . . . W . . . ... . . . .../
/11 .. . 1 1 J . . ... . . . . . . W . . . ... . . . .../
/12 .. . 1 1 J . . ... . . . . . . W . . . ... . . . .../
/13 .. . 1 1 J . . ... . . . . . . W . . . ... . . . .../
. . . . . . .
/CHRTBL UNSHIFT/
Displacement Table Displacement Table
Input Grid Input Grid
Char Code Pos. Char Code Pos.
bypass qc /004/ /000/
bypass qm /005/ /000/ open bracket /091/ /091/
small bullet ␁ /001/ /001/ /.../ /092/
small cap ampersand /224/ /002/ close bracket /093/ /093/
plus-minus /027/ /003/ minus /094/ /094/
lc mu sss0 /160/ /005/ open quote /096/ /096/
twist /006/ /006/ a /097/ /097/
lc phi sss1 /161/ /007/ b /098/ /098/
lc delta sss7 /167/ /008/ c /099/ /099/
/.../ /009/ d /100/ /100/
. . . . . . .
ampersand & /038/ /038/ SPECIAL CHARACTERS
apostrophe ' /039/ /039/
open paren ( /040/ /040/ The output code for all functions
close paren ) /041/ /041/ in the character table can be
asterisk * /042/ /042/ anything between 192 and 240.
plus + /043/ /043/ There should be no outputs in the
comma , /044/ /044/ character table between 128 to 191.
/.../ /045/
. /046/ /046/ hyphen /045/ /192/
/.../ /047/ shill /047/ /193/
0 /048/ /048/ em dash /095/ /194/
1 /049/ /049/ word space /032/ /195/
. . . . . .
/HEADS/
/ALL/
/C111012000522000000␇G3␇T2CONGRESSIONAL RECORD␇P/
/@/
/CONSTANT/
/CC/ /_Continued␇P/
/XXXXXXXXXXXXXXXXXXX/
Relevant data is wrapped by pairs of slashes. Nearly everything else is ignored, so
can be treated as comments. Fields are identified by position within the source
document and by localized groups.
Every format file has four sections. The first deals with global type and page
settings. The second handles typographic rules for up to ninety-nine locators. This
section is lengthy because it allows numerous block/paragraph properties to be
defined, so it is broken up into three subsections. The portion above shows parts
of
the three subsections relevant to locators 10 through 13. The third major section
of
the format source file is a character displacement table, and the last handles
miscellanea such as running heads.
For the purposes of this paper, the third section is critical, because it facilitates
byte remapping, and therefore defines the codepage. In the example above, the
hyphen, decimal 45, is targeted to be converted to 192. Incoming codepoint 160,
treated as a μ, is to be shifted to byte 5. Incoming 95 (the lowbar, _) is to be
shifted to byte 196, the em dash (see the locator sample plus figure 1 for an
example). Any input codepoint can be redirected to another value, as long as the
resulting codepoint is either from 0 to 127 inclusive (the character block), or 177
to 255 inclusive (the function code block). (The comment in the example above,
indicating that the permitted range is 192 through 240, reflects an earlier, more
restrictive scope. Function codes were added over time, some for internal use,
others for users.) Yes, the null byte is allowed, and can be either an input byte
or
a target one.
As mentioned above, the precedence code is normally described as codepoint 7,
because that is what creators and users of locator files typically type and see. But
that is not completely accurate. The real precedence code, after byte displacement,
is actually 200. That shift, as well as the choice of codepoint 7, is arbitrary, and
not universal. In about five percent of GPO's two thousand formats some other
character is defined as the precedence code, and codepoint 7 is used for other
purposes. The ability to shift bytes howsoever one wishes makes the locator format
highly enabling, and highly unpredictable. In a study of the more than two thousand
formats, I have found some strange, surprising mappings. It is even possible to
write a format specification that completely rescrambles all the letters and digits
(fortunately, none of the existing standard formats do so).
When MicroComp's typographic core, AComp, receives an input byte stream, its first
order of business is to perform the appropriate character displacement described
just above. But before it can do that, it must perform another perfunctory initial
character displacement to eradicate noise from word processors. PFTDB
presumes the input to be a pure locator byte stream, and not the result
of any customer, off-the-shelf software. But with MicroComp (introduced after the
publication of PFTDB ), users began to use XYWrite and
WordPerfect to create locator files, and GSDD staff had to deal with input that
included reserved codepoints. For example, XYWrite used codepoints 174 and 175 to
wrap excluded material. Word processors might include tabs, codepoint 9. That proved
to be something of a nuisance, because in most MicroComp formats codepoint 9 becomes
the en dash, a very common character. For such cases, MicroComp was written so as
to
strip away known XYWrite and WordPerfect reserved codepoints. If a user needed to
directly access those codepoints, they were given a workaround: codepoint 255
followed by a two-digit hexadecimal number. During this initial process, AComp would
convert those three-byte codes to the single, specified byte. Thus, even today, one
uses ÿ09 to get an en dash in MicroComp, because to encode it as a tab would
guarantee that it would be stripped out before processing begins.
Grids
After AComp has performed those two passes of byte displacement, it has a new byte
stream consisting of either characters (codepoints 0 through 127) or functions
(codepoints 177 and above). For now we ignore the function codepoints and look more
closely at the character codepoints.
AComp's next point of business is to connect a character to a font glyph, and
gather any associated properties. That connection is made through a grid
specification. Every format defines four default grids (grids 1 through 4). In the
format example above, the four grids are 738, 072, 739, and 780. All formats have
tacit access to a standard set of four more grids that feature commonly needed
specialized characters (relative grids 5 through 8, targeting absolute grid numbers
805 through 808).
A grid file, which is just as important as the format file already discussed, is
also a binary artefact generated by compiling an ASCII source file. Keeping with our
example, here are excerpts from the ASCII source file for format 0634's grid 1 (=
absolute grid number 738):
;JUNE 24, 1996
; Column
;Ionic/Century Schoolbook 1 Typeface
; 2 Displacement
;MICROGRD.738 3 Output Code
; 4 Font Number
;T1 Ionic Roman C&lc (052) 5 Synthetic font
;T2 Century Schoolbook Bold C&lc (038) 6 Postscript widths
;T3 Ionic Ital C&lc (054) 7 Hyphenation Code
;T4 Ionic Roman C&sc (052) 8 Entity Tag & Description
;T5 Ionic Roman C&sc (052) (full figs)
;
1 000 000 000 00 00000 000 00000000 ; [T1 Ionic Roman C&lc]
1 001 183 086 00 00556 000 sbull ; ␁ Small bullet
1 002 038 052 01 00833 000 scamp ; ␂ Small cap &
1 003 177 009 00 00549 000 plusmn ; ␃ Plus-minus
1 004 101 009 00 00439 000 epsiv | egr ; ␄ lc greek epsilon
1 005 109 009 00 00576 000 mu | micro | mgr ; ␅ lc greek mu
. . . . . .
1 097 097 052 00 00615 097 ; a Lowercase a
1 098 098 052 00 00615 098 ; b Lowercase b
1 099 099 052 00 00563 099 ; c Lowercase c
. . . . . .
5 121 089 052 01 00833 121 ; y Small cap Y
5 122 090 052 01 00729 122 ; z Small cap Z
5 123 189 086 00 00551 003 lparb ; { Black open paren
5 124 161 006 00 00333 000 iexcl ; | Upside down bang
5 125 190 086 00 00551 002 rparb ; } Black close paren
5 126 191 006 00 00611 000 iquest ; ÿ7E Upside down question
5 127 162 009 00 00247 000 prime | min | foot | quot ; ␡ Single prime
GGGGGGGGGGGGGGGGGGGGGG
␚
The grid file is equivalent to the grid tape library that controlled the Linotron
1010 grid magazine. But where the Linotron grid magazine allowed four different sets
of 256 characters each, the MicroComp grid supports five different sets of 128
characters each. A MicroComp grid is navigated by a pair of integers: a typeface
option (1 through 5) and a codepoint (0 through 127).[3] The typeface-codepoint pair calls up a particular row, which has six
pieces of information (identified in the example as nos. 3–8): (3) a glyph number,
(4) a font number, (5) a special code that indicates what kind of special
typesetting modifications should be applied, (6) a number specifying the relative
width of that font's glyph, (7) a hyphenation code, and (8) entity names that are
aliases for the codepoint (for conversion from SGML). The modifications (number 5)
can be quite significant: small caps, superior (superscript), double superior,
inferior (subscript), double inferior, cancel (strikethrough), underscore,
condensed, extra condensed, extended, or some combination thereof.
GPO has defined more than 180 grid files. Commonly, a single grid file targets one
or two faces, and one or more fonts from each face. In lines 7–11 in our example
above, the comment indicates that the second typeface option (labeled T2) targets
Century Schoolbook Bold. The other four options support the Ionic face. Two work
with caps and lowercase: one roman and one italic. Another two typeface options
target roman Ionic as caps and small caps, one with text figures (old-style
numerals) and the other with full or lining figures.
But just because a given grid typeface option is declared to target a particular
font and font modification, it always mixes glyphs from other fonts, to provide
access to characters not in the target font, and to fill up the entire set of 128
slots. In the grid extract above, typeface 1 codepoint 1, the small bullet, points
not to Ionic (font number 52) but BGsddV01 (font 86), an in-house custom font.
Typeface 1 codepoint 4, Greek epsilon, targets font 9, Symbol.
When processing a given byte of text, how does AComp determine which grid and
typeface option to use? As noted above, every format has eight predefined grids, and
the format defines a default grid and typeface number for each locator. As AComp
walks through the byte stream, it changes the default grid and typeface as it moves
from one locator to the next. Some special precedence codes can change the current
grid and typeface numbers in the middle of a locator. ␇G1 through ␇G8 changes the
relative grid number, and ␇gNNN changes to absolute grid number NNN. ␇T1 through ␇T5
changes to a particular typeface in the grid. ␇K cancels the special grid-typeface
call and returns AComp to the current locator's default grid and typeface number.
The sample locator text above provides examples of all these codes.
So when AComp processes a text byte, it uses the immediate context to determine
which grid and typeface should be consulted. It notes the target font number and
glyph number. It stores the relative width of the character and any special
modifications, and uses that information to determine how to space words, and where
to place line breaks. (For better or worse, AComp performs no kerning.) And in one
final act of displacement, it shifts the current byte to the target glyph byte,
which is what will be written to the output PostScript file.
Fonts and Hyphenation
Unlike the older Linotron and VideoComp systems, which created output directly,
the MicroComp system's primary goal is to produce a sequence of one or more
PostScript files, one per printed page. Once that is finished, its job is done. It
is up to subsequent applications to handle the PostScript files as desired, whether
by passing them to a printer that has the fonts preinstalled, or by distilling the
files into a PDF, or something else.
The font library, however, poses its own challenges. The current library consists
of approximately 150 Type 1 PostScript fonts. Many of these fonts are in the
standard PostScript encoding (“standard” is a bit of a misnomer: some assignments
in
that encoding table are rather eclectic). But many of them are non-standard, e.g.,
fonts with the Cyrillic alphabet, the Greek alphabet, woodcut ornaments, box
drawing, wingdings, and symbols. Some fonts are in-house creations. GPO commonly
fulfills requests from customers to design new glyphs, so that unusual characters
are available at a keystroke (see Figure 1). GPO's piecemeal
typography has included Chinese ideographs, IPA symbols, mathematical equations, and
even the signature of sitting Presidents.
From the perspective of AComp's operations, the semantics of the target glyphs are
irrelevant. The only exception pertains to hyphenation, which is significant in
Federal publications. In legal documents such as legislative bills, line numeration,
and therefore line breaks, must be stable and deterministic. A word break dictionary
is shared across all MicroComp systems, and no user word break dictionary is
permitted. Any update to the word break dictionary requires MicroComp to be released
in a new major version. A user can force hyphenation in a word, but only by changing
the input locator file, a safeguard against tampering.
Challenges in Converting Locators
There is a significant need to provide a conversion service that universally converts
locators to some modern format, be it XML, JSON, HTML, or something else. The model
offered by Andries and Wood is both original and significant, but represents only
the
start, and is focused upon the U.S. Code. Around eighty government publications have
been published with MicroComp, and they need similar attention. If locator files have
been accumulating over the last fifty years, across many branches of government, it
is
in the interest of archivists and historians to be able to convert them. Equally
important to the past is the future. A conversion service should be able to handle
cases
when (not if) changes are made to GPO's format and grid specifications for publications
that still use locators.
My summary of the history of MicroComp, and my description of how it processes locator
files, show that there are several challenges inherent in the effort to convert locator
files universally. In this discussion I presume an interest in XML (not JSON or another
format) as the target format.
The locator format is not a context-free language. Quite the contrary, it is
heavily contextual. A given locator file cannot be interpreted without access to all
the
formats that are invoked, and to all the grids, even those that are not referenced
by
the formats, because the creator of a locator file can on a whim draw from any of
the
180 grids. When thinking about converting a locator file, one must deliberate over
how
much of that context needs to be placed within the target file.
The text itself poses problems. XML requires Unicode. Translating locator
characters to Unicode can pose a significant challenge. Many font glyphs require
interpretation. Sometimes, inspection of a particular glyph leaves mysteries about
the
intent of the font designer. For many glyphs, no Unicode value exists, or no Unicode
value ever will exist. Some glyphs are composites, and require resolution to multiple
Unicode characters. And even after determining a Unicode value, it is reasonable to
suspect that a character may have been introduced capriciously, used for multiple
functions, or simply misused. I have seen evidence of this in actual congressional
legislation, where the German eszett, ß, was used as a Greek beta, or an obscure line
drawing character was used instead of a hyphen or en dash.
Internally, locator files do not declare a version number, and they assume that
users know and have access to the format and grid files. But format files and grid
files
have changed little by little over the years. Some of the fonts have as well. The
great
majority of these changes have been made silently, without any versioning system,
so
that it is impossible to retrieve a previous version. For any given locator file,
one
would be hard-pressed to reproduce the original typeset artefact, because one would
have
to configure MicroComp so that it matched exactly the version in use at the time.
Some of the locator conventions are not well documented. They do not feature in
PFTDB, and they are not explained in user manuals. This
phenomenon is seen in some of the ASCII source files for the formats, where new
key-pairs suddenly appear in some of the formats, and are not explained. Only by
experimentation with MicroComp can one discover exactly what certain features were
designed to do. That can be true for some documented phenomena, as well. A feature
may
be explained in PFTDB quite sparingly, or in impenetrable jargon,
and only tests with MicroComp can bring to light what is meant. Sometimes those
experiments exhibit unexpected behavior that may or may not have been part of the
intended design. Such Easter eggs also highlight MicroComp's error handling. A number
of
types of syntax errors are forgiven, and some permitted syntactic constructions can
throw an error. The error messages returned by MicroComp are commonly inaccurate,
misplaced, or uninformative.
The peculiar dependency of the format upon its processor highlights another challenge
for conversion, that of replicating MicroComp's byte displacement rules. Locators
were
designed under an imperative programming paradigm. All AComp operations are applied
byte
by byte. It is common for AComp, at a given byte, to look behind or look ahead before
applying a rule. Determining what those rules are requires experimentation with
MicroComp and review of the source code, which has very little documentation. Variable
names are cryptic. Conditional jumps out of iterative routines to completely different
modules are frequent and unexplained. To describe AComp as spaghetti code would be
generous. The code has been difficult to maintain across the years, evidenced by
developers' comments, e.g., “these instructions make no sense to me,” “recursive
nonsense,” “don't know why did this . . . that goddamned unified agenda is the
reason.”
Some parts of the specifications appear to be ignored by MicroComp, or commented out.
Other parts of the specifications have unclear functionality. In many such situations,
I
have been unable to find archived locator files with the specific coding, to test
and
verify the intended function against actual past publications.
Serializing Locators as XML (Slax)
A conversion process for locators could target any one of a number of formats. USLM
XML, the target of Andries and Wood's effort, is only one of many possible destinations.
Users of GPO's official document repository GovInfo.gov exhibit a strong preference
to
access and read documents in HTML. Archivists or other practitioners may wish to develop
pipelines into Akoma Ntoso, a generalized XML format for legal documents. Parliamentary
proceedings in other countries have been of interest to historians and linguists,
and it
is likely that many U.S. documents would be ideal for a pipeline into TEI XML. And
we
should not forget the possibility that some government publications would be ideal
candidates for JATS XML.
Perhaps most important, GPO's next-generation composition service, XPub, uses XML
Publishing Professional (XPP) for its core composition engine. The input model GPO
has
designed for XPP (which can handle a variety of input formats) consists of relatively
flat XML files punctuated by processing instructions. In those XML files, typesetting
concerns are at the forefront. Ostensibly, in the future, XPub may wish to incorporate,
or migrate to, another typographic engine (say InDesign) that supports
typographic-centered XML for input or output.
Every target format is lossy to one degree or another. Certainly, it is possible to
take, say, USLM, and pack it with new attributes or elements to try to stanch the
loss.
But that would result in a kludge, and wind up making USLM something it wasn't designed
to be.
Rather than try to force locators into an existing format, I asked myself what GSDD
designers in the 1970s would have done if had they been given the opportunity to
serialize the format as XML. The term “serialize” here is important, because it signals
a change in perspective. My goal was not so much to convert locators to XML, but to
express them as XML. Securing an XML
expression of locators puts us in an excellent place. Locators serialized as XML could
be easily reverted to the original format, if one liked. And it could become a pivot
format, facilitating straight-forward conversion to any other format.
The concept developed into a name, Slax, meaning Serializing Locators as XML. Slax
is
the name for both the XML format and the application that does the conversion.
But what to do about the many challenges (see previous section)?
Some of those challenges, such as misuse or creative use of glyphs, are intractable,
and not peculiar to locators themselves. They would be put to the side.
The sprawling format-grid-font context, it seemed to me, could be handled optimally
by
an XSLT-based workflow. The ASCII source files for formats and grids are structured
data. One might be tempted, given our current technology, to apply Invisible XML to
them. Personally, I wouldn't want to try. XSLT proved to be an excellent choice for
parsing the 2,000+ formats and 180+ grids. Certainly other languages could have been
used. But I found XSLT's <xsl:analyze-string> to be a powerful tool, and
its praises need to be sung more loudly. The construct allows the user to losslessly
segment a string into principal parts, and to help developers clearly visualize how
the
input string wil be segmented, and to quickly identify the relationship between both
matches and non-matches and their role within the result tree. I have found that the
task of losslessly consuming unparsed text to build a complex tree is much more
manageable with <xsl:analyze-string> than it is in other programming
language counterparts. The XSLT code is generally easier to read and maintain.
The grids were relatively easy to convert through this process. An XSLT application
of
about 200 lines was able to convert all the grids to something like this:
<?xml version="1.0" encoding="UTF-8"?>
<!--This data was generated by the stylesheet at
file:/D:/XPUB_UTILITIES/Slax/SlaxBuilders/grid%20source%20to%20xml.xsl
on 2022-09-20T21:59:09.01525-04:00.-->
<grid n="738">
<type>microcomp</type>
<!--JUNE 24, 1996-->
<!-- Column -->
<!--Ionic/Century Schoolbook 1 Typeface -->
<!-- 2 Displacement -->
<!--MICROGRD.738 3 Output Code -->
<!-- 4 Font Number -->
<!--T1 Ionic Roman C&lc (052) 5 Synthetic font -->
<!--T2 Century Schoolbook Bold C&lc (038) 6 Postscript widths -->
<!--T3 Ionic Ital C&lc (054) 7 Hyphenation Code -->
<!--T4 Ionic Roman C&sc (052) 8 Entity Tag & Description -->
<!--T5 Ionic Roman C&sc (052) (full figs)-->
<!---->
<typeface n="1">
<glyph cp="0">
<output-code>0</output-code>
<font-number>0</font-number>
<synthetic-font>0</synthetic-font>
<postscript-width>00000</postscript-width>
<hyphenation-code>0</hyphenation-code>
<entity>
<code>00000000</code>
</entity>
<!-- [T1 Ionic Roman C&lc]-->
</glyph>
<glyph cp="1">
<output-code>183</output-code>
<font-number>86</font-number>
<synthetic-font>0</synthetic-font>
<postscript-width>00556</postscript-width>
<hyphenation-code>0</hyphenation-code>
<entity>
<code>sbull</code>
</entity>
<!-- ␁ Small bullet-->
</glyph>
. . . . . .
The formats were more challenging, but a multiple-pass approach within a single XSLT
file (about 1,200 lines of code) was sufficient. Here a sample from format 0634:
In places, the XML files are much more legible than their ASCII source counterparts.
Element and attribute names provide much better documentation than the abbreviated,
cryptic headers. And in the course of building a library of format and grid XML files,
it was easy to ferret out numerous typographical errors, some of which affected
MicroComp output.
For both the formats and grids, relatively succinct XSLT applications were able to
generate, very quickly, lossless XML serializations. XSLT's
<xsl:analyze-string> allowed me to capture and retain non-data
fields as comments, so that it was possible to reverse the process and rebuild the
source ASCII files nearly byte for byte (except where an occasional stray null byte
was
in the source file). With these XSLT applications in hand, whenever formats and grids
were updated, the Slax resources could be as well.
The fonts posed a distinct challenge. There, I had to get my hands dirty. I studied
the entire library, and focused on the idiosyncratic fonts. I did my best to provide
a
sensible mapping to the Unicode standard. Where such mapping was impossible,
undesirable, or arbitrary, I left notes. Because most fonts had the standard PostScript
encoding, I focused only on mapping the exceptions, which I did in a simple XML file,
of
course.
It seemed to me that at the very minimum, a user of a Slax XML file should not have
to
perform any more byte displacements. But to get there—to arrive at the proper Unicode
values—one must take a complicated journey, through a given format, through a given
grid, through a given typeface, to a particular font glyph, and finally to its Unicode
meaning. In building the pipeline to retrieve that information, I felt that about
one
third of the grid material had been mined and placed directly in the Slax XML file.
If
the grid files could be altogether dispensed with, users of the Slax format would
benefit. Hence the output indicates, for every locator, or for any special characters,
the remaining relevant information from the grid file: the target font name and family,
and any special modifications that should be applied.[4]
The format files, however, were a different story. They have such an abundance of
information (points, leading, line length, justification, indentation, margins, relative
padding, complementary formats, etc.) that to put any of that within a typical Slax
file
would result in a bloated, illegible, and repetitive format. It would be better, I
thought, to give Slax file users access to the library of format files, serialized
as
XML, so that they can query typesetting specifications of interest as needed.
I now had a set of XSLT tools that could place the vast mechanism of formats, grids,
and fonts at service for the XML stack, and could be responsive to any changes to
that
context. That left me with one final challenge: how to orchestrate the conversion
process, i.e., how to start with an actual locator file, trawl those XML resources,
and
return an XML file with the intended Unicode text.
I quickly ruled out XML technologies for this task. The conversion process would need
to handle control bytes (prohibited in XML 1.0) and null bytes (prohibited in both
XML
1.0 and 1.1). A proper handling of the character displacement process would require
bytewise operations with look-ahead and look-behind features. My previous experiments
with byte- and bit-wise functions, such as implementing the MD5 hash algorithm in
XSLT,
suggested that an XSLT or XQuery locator processor would not be performative.[5]
Andries and Wood opted for a Java-based declarative approach. I felt that I would
have
most success along a different route. The locator format was designed with a particular
outlook, and was deeply enmeshed with AComp. I thought that the possibility of error
and
omissions would be minimized if I developed a system that mimicked AComp's native
process.
My solution was to write the Slax application in C#. It was designed as a class
LocatorAnalyzer, whose primary method Load() processed an
arbitrary locator byte stream along a streamlined version of what I detected AComp
was
doing, or was meant to do. Load() walks the byte stream along the same
stages AComp does. At many of those stages, AComp consults formats and grids as compiled
binaries. Slax needed to do the equivalent, quickly looking up character displacement
maps, and related operations.
Because the formats, grids, and fonts were now available as XML, I developed another
XSLT application that built the tuples, records, and other low-level data structures
required for the C# code. In an early experiment with this model, the XSLT application
took the C# source code as unparsed text, located specially marked sections designed
for
fields, and populated them with the raw data serialized in C# syntax. This was an
effective but impractical solution. The very long C# code that resulted rendered Visual
Studio unresponsive. The technique also meant that any updates to the formats, grids,
or
fonts would require recompiling the application, and submitting it repeatedly for
security audits.
The more tractable approach was to write the data to stand-off text files, optimized
for rapid ingestion as tuples, hashsets, dictionaries, and other data types. This
approach was clean, and meant that any future updates to the grids and formats would
not
require recompiling the core executable. The automated XSLT-based process of grouping
and distilling the thousands of format, grid, and font files into small datasets proved
to be somewhat time-consuming, taking about ten minutes. But such updates are expected
to be infrequent, so the excess time is at present negligible, and represents work
that
the C# application does not have to perform.
Overall, both XSLT and C# performed their share of challenging work. XSLT parses and
manages the resources, and prepares the C# source code with the data it needs. The
C#
code approaches byte handling in an object-oriented environment with very fast low-level
operations. The result is that Slax is highly performative. A typical locator file
is
converted in a tenth of a second. A one megabyte file takes less than a second. Slax
is
one of GPO's fastest applications.
The Slax Format
The best way to illustrate the design of Slax XML is by example, using the locator
file presented at the beginning of this
article:
<slax ver="0.09" xmlns="http://gpo.gov/slax">
<format n="634">
<locator n="66" font-no="86" font-mod-no="0" font-name="BGsddV01" font-mod-name="none">
<c font-no="0" font-name="Gpospec5">
<svg-path for-cp="102"
d="M7870 228h2l-4 -3l-6 -6h2l-7 -3c-4 -4 -8 -6 -8 -6c-1 1 2 2 8 6c2 1 3 2 5 3l-1 1l-36
-27l-2700 -98l-1159 -49l-3937 154l-1 30c-28 7 2 30 21 31l3946 195l1208 -68l2665 -134c10
-9 8 -19 2 -26z"
/>
</c>
</locator>
<locator n="81" font-no="52" font-mod-no="0" font-name="MIonic" font-mod-name="none">SENATE
RESOLUTION 77—DESIGNATING FEBRUARY 16, 2023, AS “NATIONAL ELIZABETH PERATROVICH DAY” </locator>
<locator n="11" font-no="52" font-mod-no="0" font-name="MIonic" font-mod-name="none">Mr. <span
typeface="4" font-mod-no="1" font-mod-name="small-caps"><c font-mod-no="0"
font-mod-name="none">SULLIVAN</c></span><span typeface="1"> (for himself and Ms.
</span><span typeface="4" font-mod-no="1" font-mod-name="small-caps"><c font-mod-no="0"
font-mod-name="none">M</c>URKOWSKI</span><span typeface="1">) submitted the following
resolution; which was considered and agreed to: </span></locator>
<locator n="74" font-no="52" font-mod-no="1" font-name="MIonic" font-mod-name="small-caps"><c
font-mod-no="0" font-mod-name="none">S</c><c font-mod-no="0" font-mod-name="none">.</c>
<c font-mod-no="0" font-mod-name="none">R</c>ES<c font-mod-no="0" font-mod-name="none"
>.</c>
<c font-mod-no="0" font-mod-name="none">7</c><c font-mod-no="0" font-mod-name="none">7</c>
</locator>
<locator n="27" font-no="52" font-mod-no="0" font-name="MIonic" font-mod-name="none">Whereas
Elizabeth Wanamaker Peratrovich, Tlingit, was a member of the Lukaa<span font-no="0"
font-name="Gpospec5">x̱</span>.ádi clan in the Raven moiety with the Tlingit name of
<span font-no="0" font-name="Gpospec5">Ḵ</span>aa<span font-no="0" font-name="Gpospec5"
>x̱</span>gal.aat (referred to in this preamble as “Elizabeth”) who fought for social
equality, civil liberties, and respect for Alaska Native and Native American communities;
</locator>
The root element is <slax>, and it takes a sequence of
<format>s. Each format takes a sequence of block elements such as
<locator> and <table> (on which see below). Each
<locator> takes mixed content of text and inline
<span>s. Individual characters might be marked by
<c>s, which are always wrapped by <span>s.
Line 1 of the input code, ␇S0634, creates the first
<format>. And line 2, ␇I66F, creates the first
<locator> for locator 66. Information about the default font and
font modification is captured in attributes twice, once with the relevant code and
a
second time with a human-readable version. A future version of Slax may handle such
repetition differently.
Note, the F of locator file line 2 is not the letter F, but rather a
character from one of GPO's specialized fonts. This represents the elongated decorative
lozenge seen in figure 1. After all byte displacement has taken place, F does not
target
a Unicode character but a graphic, so it is replaced with an <svg-path>,
whose attribute d holds a SVG representation of the font Gpospec5's glyph
102.
The standard way in locators to express double quotation marks is through two separate
glyphs, as either two backticks, ``, or two straight apostrophes,
''. Slax converts these to typographic quotation marks, because that is
what the user intends, even though MicroComp would have typeset these as pairs of
single
quotation marks.
In line 4, we have the following code: ␇I11Mr. ␇T4SULLIVAN␇T1 (for himself and
Ms. ␇T4Murkowski␇T1). That line invokes locator 11, whose default is typeface
1, which has no special font modifications. But when typeface 4 is specially invoked
(twice) it calls upon small caps as a modifier for the lowercase letters. Each word
governed by ␇T4 includes uppercase letters, which are not subject to the
small caps modification. Hence, the Slax representation results in some structures
where
font modification is toggled on and off. Note that Slax output has URKOWSKI
in all caps. That reflects the actual byte displacement process defined by MicroComp's
grid, which calls for the codepoints to be changed from lowercase to uppercase, and
also
instructs MicroComp to reduce their size. The glyph width information in the grid
specifications is used to specify the width of the letters rendered as small caps,
so
that they are smaller than the letters not so rendered.
The locator sample I have used throughout this paper has a relatively simple
structure, and the bell codes are used only to navigate formats, grids, and typefaces.
The locator format has other bell codes, for other components, and to describe them
would make this article overly long. However, at least something should be said about
tables, because of their complexity.[6] Here is a sample table in the locator format and its typeset form:
A table in locators is commonly called a subformat, because the traditional format
specifications continue to be used, but different settings are invoked. A table's
start
and end are signaled by ␇c and ␇e, respectively, with the
division between head and body signaled by ␇j. The first line has a complex line of
code
declaring the parameters for the table and the specifications for each column. There
are
other features, not illustrated above, that make tables one of the most daunting parts
of the locator format. Rather than point out every feature of the table, I provide
here
the Slax representation:
Although the Slax version is more verbose than the locator version, it is also more
informative, and a significant step in the direction of making the tables HTML- or
CALS-compliant. The locator version indirectly or implicitly points to settings that
are
expressed explictly in the Slax version. For example, the first line in the locator
file
is used to populate various attributes for <table> and
<colspec>. Within the table body, ␇I01 in the locator
file invokes subformat specifications about whether a cell is leadered or not, or
what
level of indentation it should receive, expressed in the <entry>
attributes.
For both tables and running text, when Slax serializes locators, it may encounter
syntax errors. If they are not fatal, the location of the error, and its specific
message, are recorded. When the Slax XML file is written, each erroneous location
is
anchored by a comment whose content is an integer one greater than the previous anchor's
value. At the end of the file all errors are listed. A Slax Schematron schema associates
each error with the particular place in the file, so that users in Oxygen or other
XML
editors can quickly find errors. Schematron, of course, is not the main Slax schema,
which is written in RELAX-NG, so if the locator serialization process results in
syntactic absurdities (some of which MicroComp might overlook), users have two registers
of validation.
The Slax XML samples above are representative. Slax has been applied to tens of
thousands of locator files, and more often than not what emerges is a relatively flat
tree. The occasional <format> adds some depth to what is otherwise
normally an undifferentiated long sequence of <locator>s, whose
attributes give no sign of semantic function. Within the <locator>
blocks are inline elements that deal with character- or phrase-based exceptions to
the
block rules.
In looking at Slax XML I am reminded of Microsoft Word docx files, OpenOffice odt
files, and early word processor formats, which share a similar typology. I think of
them
not as trees but as hedgerows. The hierarchy is rarely deep, even though the printed
result might give the illusion of depth. Behind the scenes, the files consist of very
lengthy sequences of text blocks, kept all on the same level, like hedgerows along
a
road.
The exception to that general trend pertains to tables and lists. In the locator
format, tables never nest. So when a table is encountered, its depth always has a
maximum limit. Here the locator format is shallower than its word processor
counterparts, which normally allow tables and other structures to nest recursively.
In
the case of lists, the locator format makes no attempt to preserve the hierarchical
structure, but simply assigns each list item, regardless of its semantic depth, to
one
of a series of <locator>s. Whether the user creates deep, beautiful
lists, or illogical or inconsistent ones, it is all the same flat structure. The locator
format stays out of the business of monitoring the proper use of hierarchical
structures.
Conclusion and Further Work
Slax is still under development. Securing an adequate test suite is difficult. Many
of
the dark corners of AComp remain to be explored. Although much work has been done
testing the serialization, only some effort has been put into the next stage, namely,
using Slax XML as a pivot format for conversions to USLM, HTML, and XML for typesetting
with XPP. More work needs to be done to allow users to supply their own custom format
and grid specifications. That is critical for any attempts to reconstruct previous
versions of formats and grids, in order to accurately convert legacy locator
files.
Desiderata aside, Slax shows enormous promise. It demonstrates the power of XSLT for
lossless parsing of complex documents. It models an optimal balance between declarative
and imperative programming. And it results in a tool and an XML format I wish I could
gift to my predecessors.
References
[andries_and_wood] Andries, Patrick, and Lauren Wood. “Converting
Typesetting Codes to Structured XML.” Presented at Balisage: The Markup Conference
2020,
Washington, DC, July 27–31, 2020. In Proceedings of Balisage: The
Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25
(2020). doi:https://doi.org/10.4242/BalisageVol25.Wood01.
[boyle] Boyle, John J. “Electronic Composing System Applications.” In
Electronic Composition in Printing: Proceedings of a
Symposium, edited by Richard W. Lee and Roy W. Worral, pp. 90–93.
National Bureau of Standards Special Publication 295. Washington, DC: National Bureau
of
Standards, 1968. https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nbsspecialpublication295.pdf.
[cavanaugh] Cavanaugh, John F. “Text Handling at the United States
Government Printing Office.” Technical Communication
25, no. 3 (1978): 12–15.
[gsdd_1983] Graphic Systems Development Division. Publishing from a Full Text Data Base. 2nd ed. Government Printing
Office, 1983.
[hincherick] Hincherick, William. “Automated Composition: The Development and Utilization of
a Unique System for the U.S. Government Printing Office.” Government Publications Review 12, no. 3 (May 1, 1985): 215–25. doi:https://doi.org/10.1016/0277-9390(85)90024-X.
[3] In standard typography, typeface refers to a unified
ensemble of type design, e.g., Ionic, which is represented by one or more
fonts, e.g., Ionic roman, Ionic italic. Locator
documentation, however, uses typeface in an
idiosyncratic way, to refer to one of the five parts of a grid file, i.e., a
set of 128 characters. In this article when I use
typeface, it is in the locators sense; when I mean
the standard typographic term, I use simply face.
Thanks to Peter Binns for prompting me to be clear on terminology.
[4] A grid also lists the relative width of each glyph, and the hyphenation code,
but most users of Slax would have little if any use for that information. If
someone really wanted to know about the PostScript font glyphs, they could
access the grid specifications in XML, or the actual font tables, as separate
resources.
[6] I thank Mark Harcourt for prompting me to include this discussion. I wrote the
Slax conversion for locator tables independently of similar work performed by
Priscilla Walmsley and Martin Smith. I have benefitted from the output of their
conversion routines for testing Slax output. That we reached similar conclusions
on how the tables should be interpreted is gratifying.
Andries, Patrick, and Lauren Wood. “Converting
Typesetting Codes to Structured XML.” Presented at Balisage: The Markup Conference
2020,
Washington, DC, July 27–31, 2020. In Proceedings of Balisage: The
Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25
(2020). doi:https://doi.org/10.4242/BalisageVol25.Wood01.
Boyle, John J. “Electronic Composing System Applications.” In
Electronic Composition in Printing: Proceedings of a
Symposium, edited by Richard W. Lee and Roy W. Worral, pp. 90–93.
National Bureau of Standards Special Publication 295. Washington, DC: National Bureau
of
Standards, 1968. https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nbsspecialpublication295.pdf.
Hincherick, William. “Automated Composition: The Development and Utilization of
a Unique System for the U.S. Government Printing Office.” Government Publications Review 12, no. 3 (May 1, 1985): 215–25. doi:https://doi.org/10.1016/0277-9390(85)90024-X.