Harvey, Betty. “SGML in the Age of XML.” Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016). https://doi.org/10.4242/BalisageVol17.Harvey01.
Balisage: The Markup Conference 2016 August 2 - 5, 2016
As President of Electronic Commerce Connection, Inc. since 1995, Ms. Harvey
has led many federal government and commercial enterprises in planning and
executing their migration to the use of structured information for their
critical functions. She has helped develop strategic XML solutions for her
clients. Ms. Harvey has been instrumental in developing industry XML standards.
She is the co-author of "Professional ebXML Foundations" published by Wrox. Ms.
Harvey founded the Washington, DC Area SGML/XML Users Group. Ms. Harvey is a
member of "The XML Guild" and was a coauthor of the book "Advanced XML
Applications From the Experts at The XML Guild" published by Thomson.
Today (2016!), there are organizations, especially in the military, who have SGML
documents and/or requirements to meet SGML-based specifications. Given the
unfashionability of SGML and the shrinking availability of SGML tools and SGML
expertise, these organizations face significant challenges. How can they best
approach the task of working with existing SGML document collections? What about a
requirement to create SGML that will integrate cleanly into existing SGML document
collections to be processed with existing SGML tools? What questions should someone
facing an SGML requirement ask? What resources are they going to need? How much can
they do with XML infrastructure to meet SGML requirements and where must they “cut
over” to SGML? How should they make SGML if they really need to? How can they
leverage XML tools while maintaining SGML source requirements?
This year we celebrate the 30th anniversary of Standard Generalized Markup Language
(SGML) becoming an international standard. Many reading this paper may never have
heard
of SGML or the role it played in the acceptance and success of the World Wide Web
(WWW).
In some cases it has also revolutionized the publishing of information by providing
an
easy way to output complex information to different media channels.
SGML via HTML allowed for the first time for Internet browsers the capability to
display and publish information on the WWW. In December 1990, the Global Hypertext
Project at CERN, European Laboratory for Particle Physics under the direction of Tim
Berners Lee provided the capability of displaying and linking information across
the
internet. At this point in time the internet was mainly available to educational
and
government organizations. The Global Hypertext Project project developed a very simple
SGML vocabulary for presenting and transporting information across the internet. Many
working in the SGML space knew the flexibility and usefulness of SGML but HTML actually
proved to the naysayers that there was both intellectual and monetary value in using
markup to describe information. HTML became and still is the largest use of SGML
(some
nameless organizations dispute this fact but they haven't been able to prove it
yet).
There were many pitfalls along the way. The U.S. Department of Defense was one of
the
first adopters of SGML for technical publications. This resulted in positive movement
to
adopting SGML by other organizations in manufacturing, data warehouses, publishing,
etc.
The negative side of the DOD jumping in so early was that they poured massive amounts
of
money into companies (in many cases traditional DoD contractors) to develop software
applications to support DoD. This resulted in many of the early SGML software
applications (authoring, databases, publishing, etc) were too costly for small and
medium-sized organizations to adequately leverage the power of SGML.
With the success of HTML, visionaries saw that SGML really could be affordable to
the
masses. SGML could be used by small and medium organizations to manage and disseminate
their information.
SGML wasn't without it's problems. XML was designed to alleviate some of the inherent
problems and pain points that SGML had. XML was originally designed to be a subset
of
SGML. Some of these problems of SGML, either real or perceived were:
SGML Declarations: The SGML declaration
was a complex file that relayed information to the SGML application. A few
of the parameters were:
Allowed length of element names and attributes. Some of the early
SGML vocabularies restricted element names to 2 characters. Most
were restricted to 8 to 32 characters.
Tag minimization. You could say whether the opening tab, closing
tag or both tags could be eliminated.
Allowed you to use other characters in place of the less-than and
greater-than (pointy brackets) in documents.
SGML DTD: SGML requires that validation
against a DTD always be performed before any application will process the
information. Although validation is important in developing and managing
information for presentation and dissemination of information it is not
important to the end user. The SGML DTD allows many concepts such as
inclusions, exclusions, inline comments, tag minimization, etc. that caused
inconsistencies in tools and parsers.
Character entities: SGML used the ISO
character sets for characters (á). XML uses native UNICODE
XML became a W3C standard in December 1998. Organizations quickly jumped on board
and
adopted XML for their data. Organizations that had originally adopted SGML were slower
to switch to XML but as software applications improved and became more affordable
than
SGML software, as well as the declining SGML tool market, they also moved to XML.
An
educated guess would say that 98-99% of all organizations are using markup are using
XML
if they are using Markup for their data. Today, even popular software such as Microsoft
Word uses XML for it's underlying data. Open a Microsoft .docx or .xlsx file in an
zip
application and take a peak inside - all the underlying data is XML.
However, there are a few organizations, mainly DoD who have not adopted XML and have
stuck with SGML almost 20 years later. This paper is designed to help organizations
navigate the necessity of delivering SGML in an XML world.
Authoring SGML Content
There are a few editors that still support SGML authoring. These editors originally
started in the SGML world and still support SGML authoring. However, if you try
researching their literature there is very little information about their SGML authoring
capability. Each of these authoring tools support full SGML editing ,as well as support
authoring using native SGML DTD's. These editors are:
Justsystem's Xmetal
Adobe's Framemaker + XML
PTC's Arbortext Editor
If you do have the need to deliver SGML the easiest and most efficient way is to use
one of the editors that support both SGML and XML authoring.
The same SGML editors can be used to create XML files. The above editors also allow
you to save the file as XML or SGML. The big differences between an XML and SGML
instance are:
XML declaration vs. SGML declaration
DOCTYPE statement
Empty tags: <linebreak/> vs. <linebreak>. Normalizing an XML element
from <linebreak> to <linebreak></linebreak> will result in validity in
both SGML and XML.
Case sensitivity. SGML is not case sensitive. This means that the elements
<TITLE> and <title> are exactly the same in an SGML document. This is not
true in XML, these two elements are treated as 2 different elements in XML. SGML
editors handle cases differently. For example, some editors use all capital
letters for element names whereas other editors use all lower case. If you are
going from XML to SGML this isn't important but moving from SGML to XML case
becomes significant. It is something to be aware of when deciding your authoring
process.
Document Declaration Subset
The document declaration subset is a construct that provides a mechanism in the
beginning of an SGML or XML document for creating both file entities and text
entities. The document declaration subset was a commonly used construction in SGML.
Almost every file contained one. Even though document declaration subsets are still
used in XML it isn't a commonly used as it once was. The reason they aren't used as
much in XML is because XSLT cannot process the information in the document
declaration subset.
The example below shows a document using a document declaration subset:
<!DOCTYPE poem SYSTEM "poem.dtd"[
<!ENTITY author SYSTEM "poepic.jpg" NDATA jpg>
]>
<poem id="poem1">
<title>The Raven</title>
<poet>Edgar Allan Poe</poet>
<author-picture src="author"/>
<stanza id="stanza1">
<line>Once upon a midnight deary, while I pondered, weak and weary,</line>
<line>Over many a quaint and curious volume of forgotten lore-</line>
<line>While I nodded, nearly napping, suddenly there came a tapping</line>
<line>As of some one gently rapping, rapping at my chamber door.</line>
<line>"‘Tis some visitor," I muttered, "tapping at my chamber door-</line>
<line>Only this and nothing more."</line>
</stanza>
...
</poem>
If you are faced with this situation, establishing authoring rules can allow files
to be processed by both XML and SGML applications. For example, if you establish a
rule that the entity name is always the name of the file name then XSLT can
determine the graphic without the necessity of looking at the document declaration
subset to determine the name of the file.
Other organizations have used a metadata field in the XML document to place the
document declaration subset information in the file. The metadata field gets
stripped during the conversion to SGML. The document declaration subset is created
at the time of the conversion to SGML and included in the file.
Authoring in Native XML Editor
If you already have an XML editor in-house and prefer using your favorite editor this
can be accomplished. You just need to be aware of the slight differences in the SGML/XML
editor.
SGML DTD to XML DTD
If you decide to author content in an XML editor you will need a valid XML DTD.
You will need to either obtain the XML version of the DTD or you will need to
convert the SGML DTD to a valid XML version. This can be a daunting task, especially
with large complex DTDs. There are some good articles on the modifications required
to convert an SGML DTD to an XML DTD. One such article was written by Norm Walsh in
1998 and is available at W3C [DTD].
If you need to convert a complicated and all-inclusive DTD such as MIL-STD-38784C
it may be best to do a data analysis of your documents and determine what components
from the DTD are required for your set of documents and develop a subset of the DTD.
This approach will has several advantages:
best defines your documents
makes authoring documents easier by reducing the number of unnecessary
elements.
In some cases organizations have requirements to deliver their data in multiple
SGML/XML formats for multiple clients. This happens all the time in the
manufacturing world. In this case organizations usually find it cost effective to
develop their own XML DTD and/or schema and convert the document to multiple formats
based on business requirements. Their DTD/schema may be based upon an industry
standard.
Many open source standards that started in the SGML world have both XML and SGML
DTD's, as well as XML Schema, versions available. Docbook (http://docbook.org/) and Text Encoding
Initiative (TEI) (http://www.tei-c.org/index.xml) are two initiatives that provide both
SGML and XML DTD's.
Converting and Parsing XML Native Data to SGML Native Data
Converting an XML document to an SGML document is trivial. By normalizing the XML
file you will have an SGML document that you can parse against the SGML DTD. You
will want to parse the document against the DTD before any delivery of data. One
of
the best tools for parsing an SGML document is James Clark's SP.SP requires a little
knowledge of the SGML application but is one of the best SGML parsers.
Native SGML Publishing and Specifications
Early SGML publishing was accomplished using proprietary publishing systems. These
systems were very expensive. Several of these publishing systems are still available
and
are still quite costly. In the late 1980's DoD started developing a specification
for
publishing SGML. The specification was called Formatting Output Specification Instance
(FOSI).
The specification MIL-PRF-28001 (MARKUP REQUIREMENTS AND GENERIC STYLE SPECIFICATION FOR
EXCHANGE OF TEXT AND ITS PRESENTATION) was originally published in 1992. The last
printing was MIL-PRF-280001C which was published in 1997. MIL-PRF-28001 specified
the
use of SGML for all new technical manuals within DoD. Each branch of DoD (Army, Navy,
Air Force) developed DTD’s for use within their individual organizations based on
their
specific requirements. These DTDs adhered to the requirements in MIL-PRF-28001.
In addition to specifying the SGML constructs for developing DTD’s, MIL-PRF-28001
provided a DTD and specification for applying styling to the SGML. The styling
specification was called Formatting Output Specification Instance (FOSI). Appendix
B of
the specification contained the DTD that supported the use of FOSI for presentation.
Several SGML vendors were part of the working group and worked toward developing a
FOSI-based publishing system. Two vendors DataLogic and Arbortext successfully
developed FOSI-based formatting within their product. However, the implementation
was
slightly different in both systems based on different interpretations of the
specification and ambiguity of the DTD. The result was that a FOSI developed for
one
system could not be used in the other system.
FOSI's are still used today by DataLogic and Arbortext. Arbortext has slowly tried
to
replace the FOSI with their own style specification called Styler. Styler uses both
FOSI
constructs and its own style constructs.
SGML and Loose-Leaf Publishing
When SGML was a new standard, large publishing environments required loose-leaf
publishing. In the 'olden days' large organizations published administrative and
technical manuals in paper. When modifications to the manuals were made only the
pages that were modified were printed and sent to the users. A manifest was sent
with the 'change pages' which told the manual administrators or librarians which
pages to remove and add to the paper document. The manifests were often printed on
blue paper and the manifest and change pages were called 'blue pages'. Many
organizations still used these manifests in their XML, as well as SGML publishing
and still call them 'blue pages'. An example of a manifest document is available at
the Patent and Trademark Office
The users would remove old pages and insert new pages into binders. If pages ran
longer than the original page then the page numbers reflected the new page with a
different numbering sequence. For example, if page 1200-2 is modified and the
revised page resulted in running over to the next page the users would receive page
1200-2 and page 1200-2a.
SGML publishing systems that supported loose-leaf publishing compose the SGML
first and places processing instructions at the point of a page break in the SGML.
When revision elements are placed in the SGML, the publishing system makes a second
run through the document to determine where page breaks occur and then calculates
the correct page breaks and page numbers.
Modern technology has negated the need for loose-leaf publishing because manuals
can be disseminated in total without the necessity to print and disseminate single
paper pages. Most of today's workforce are used to on-line, PDF or e-book technology
and prefer electronic dissemination of information to paper.
However, there are still pockets of organizations who are stuck in the 1970's and
still require loose-leaf publishing capability. Therefore suppliers to these
organizations have a requirement to provide information in antiquated paper pages
which includes change pages.
If you are required to support loose-leaf publishing, it can still be done with
XML but will require out-of the box thinking and additional processes in order to
emulate loose-leaf publishing. XSL-FO does not support loose-leaf
publishing.
Creating Published Documents from Native SGML
There are multiple publishing formats for SGML/XML documents. Most organizations
are looking to create print (PDF), HTML and/or e-books). It is possible to get all
of these outputs from your SGML documents. In some cases you will need to convert
the SGML to XML.
SGML had two main specifications for publishing SGML. These two specifications
were FOSI (Formatting Object Specification Instance) and DSSSL (Document Style
Semantics and Specification Language). The FOSI came first then DSSSL
followed.
FOSI (Formatting Object Specification Instance)
Shortly after SGML became a standard the U.S. Department of Defense decided to
adopt SGML as the standard architecture for developing technical manuals and
IETMs (Interactive Electronic Technical Manuals). They needed a way to create
printed output from the SGML documents. Initially the DoD initiated a project
called CALS (initially Computer Aided Logistics Support, then Continuous
Acquisition and Lifecycle Support and lastly Commerce at Lightspeed). DoD needed
a mechanism to produce printed documents from the SGML they were
creating.
Industry was slow in developing a standard for publishing SGML. There were
pockets of proprietary software. DoD started an industry initiative to develop a
standard for publishing SGML. The partnership between industry and DoD resulted
in the FOSI specification. The FOSI specification was incorporated in the DOD.
MIL-PRF-28001 [MIL-PRF-28001] specification. Ultimately there
were two vendors that supported the FOSI specification, Arbortext (now PTC) and
Datalogics. Both Arbotext Editor and Datalogics DL Composer still support FOSI's
for publishing SGML data.
FOSI is actually an SGML document controlled by an SGML DTD. It was a bold
concept. Newer SGML and XML specifications continued this practice of using
markup to write output specifications.
The last update to MIL-PRF-28001 was almost 20 years ago in May 1997.
DSSSL (Document Style Semantics and Specification Language)
DSSSL came chronologically after FOSI. DSSSL became an ISO (International
Standards Organization) [ISO] in 1996. Like FOSI, there were few
products that adopted DSSSL. However DSSSL can be considered the mother of the
W3C XSLT specification.
DSSSL, like XSLT, had 2 parts. The first part was transformation. The
transformation specification provided the standard for how to convert the
document. The second part provided formatting information on how the elements
should be transformed in order to obtain the presentation of the data.
Creating Published Documents from XML
There are several possibilities for creating PDF output from the native SGML. As
previously discussed you can use one of the tools that support the SGML style
specifications. There are also proprietary publishing systems that support SGML. For
most organizations proprietary publishing software is cost inhibitive. Organizations
who
can't afford the software and/or the expertise to develop the stylesheets will use
a 3rd
party composition company, which can also be expensive.
However XSLT and XSL-FO are obvious choices for creating PDF and printable files.
As
stated previously converting the SGML files to XML or vise-versa is relatively trivial.
XSLT is a relatively easy skillset to obtain internally or externally. XSL-FO expertise
is a little harder to obtain but should be easy to either train individuals internal
or
obtain outside help.
There are also standard XML vocabularies that have standard stylesheets available.
Docbook and DITA are two that are commonly used. If the presentation of the SGML is
relatively straight-forward it might be worthwhile to convert the document to Docbook
or
DITA and modify the stylesheets that come as part of these specifications.
DITA would be the best choice for DoD technical manuals, however extensive
modification of the stylesheets would be required and wouldn't be the easiest approach.
Some DoD contractors have been able to negotiate with their DoD customers to either
deliver a complete PDF book as a 'new book' without the loose-leaf publishing
requirement. In this case XSL-FO is used to create the PDF of the technical manual.
Others have negotiated to supply the SGML to the DoD customer as well as an HTML
rendition of the document [IETM]. In this case the DoD facility has the
capability to publish the SGML in-house using the JCALS (Joint Computer Aided Logistics
Support) system that is still have available. JCALS was a joint program with the Army,
Navy and Air Force that developed a publishing system that included custom and
proprietary software in the mid-1990's. Arbortext Editor is the editor and Datalogics
DL
Composer is the composition software.
Conclusion
In conclusion, 30 years after SGML became an international standard it is still being
created and used. If you and/or your organization find yourself in a position where
you
need to deliver SGML documents, it is still possible. I will take careful thought
to
develop the document constructs and the workflow. Hopefully, in the not too distant
future, organizations will eventually move from SGML to XML.