Nordström, Ari. “Topic-based SGML? Really?” Presented at Balisage: The Markup Conference 2021, Washington, DC, August 2 - 6, 2021. In Proceedings of Balisage: The Markup Conference 2021. Balisage Series on Markup Technologies, vol. 26 (2021). https://doi.org/10.4242/BalisageVol26.Nordstrom01.
Balisage: The Markup Conference 2021 August 2 - 6, 2021
Balisage Paper: Topic-based SGML? Really?
Ari Nordström
Ari is an independent markup geek based in Göteborg, Sweden. He has provided
angled brackets to many organisations and companies across a number of borders
over the years, some of which deliver the rule of law, help dairy farmers make a
living, and assist in servicing commercial aircraft. And others are just for
fun.
Ari is the proud owner and head projectionist of Western Sweden's last
functioning 35/70mm cinema, situated in his garage, which should explain why he
once wrote a paper on automating commercial cinemas using XML.
Topic-based technical documentation is all the rage these days, made popular by
DITA and others. Topics can be integrated with the engineering data for the products
both describe using nifty Product Life Management (PLM) tools that make this easier
than ever. But what if you're stuck with SGML, voluntarily or involuntarily? Can
you, too, bring your content into the topic-based paradigm or should you rather
not?
This paper explores your options and the state of SGML in the PLM world today. It
nose-dives into the ATA iSpec 2200 SGML, discusses some of the pains to implement
it, and finally converts the ATA to DITA.
Topic-based content has been a thing for a number of years by now, with standards
such
as DITA [DITA Version 1.3] making
it increasingly popular in the techpub world I frequently inhabit. The idea is to
split
your technical documentation, for example, a User Guide or a Programmer's Manual,
into
smaller topics that by themselves only describe a single subject or task. This makes
it
easy to reuse and process them in a variety of contexts and output media, from multiple
User Guide variants to describe a range of products to context-sensitive help texts
online.
The topics are marked up with profiling information that identifies their valid
contexts. For example, a topic may be applicable to products A and B, but only partially
C and D, while another may apply to the entire product range. Different properties
are
made available, for example, to state that the content is applicable only to specific
serial number ranges or when the product is equipped with a specific module.
The individual topics seldom have a section hierarchy, instead leaving that job to
map-like constructs that link to the topics and define whatever chapter, section,
etc
structures that particular topic assembly requires.
DITA in particular has been pushing the idea for the last decade or two, and DITA
implementations are now available out of the box for many editors and other tools.
There's now a host of DITA solutions out there for your content, regardless of your
business area, including software that help you integrate your topics with your
engineering data, and more. Topic-based authoring is entering mainstream in
techpub.
There are plenty of non-DITA topic-based solutions, too. Among them is S1000D [S1000D], a standard
originally intended to cater for the maintenance documentation needs of military
aircraft but now greatly expanded to describe the operation and maintenance of any
land,
sea, and air vehicle[1]. S1000D topics are known as data modules, but the idea is the same: one
module describes one aspect of the product, be it a description. parts list or
maintenance task, to enable reuse in multiple contexts. S1000D has also spawned a
variant specifically for seagoing vessels, known as Shipdex.
Note
S1000D is very much meant for information exchange, and so businesses agreeing on
exchanging information using the standard will often define business rules to detail
which parts of the standard are used, and how.
The aerospace industry has produced other standards to help maintain their products.
Airlines for America (A4A), formerly known as Air
Transport Association of America (ATA) [ATA], started publishing
aviation technical documentation guidelines in the 1950s and has updated them ever
since, eventually including SGML tag sets (DTDs) alongside the spec itself. The ATA
iSpec 2200 standard [ATA iSpec 2200], a 3,000+-page book describing all aspects aircraft
maintenance documentation, and with accompanying SGML DTDs, is still in active development[2]. ATA is an industry standard; their systems-oriented maintenance numbering
system [ATA iSpec 100] is
used by the majority of aviation industry manufacturers, regardless of what
documentation format they use.
The ATA iSpec 2200 SGML DTDs are typical SGML-age creations, with typically
monolithic, book-like approaches to the information. For example, documents using
their
Aircraft Maintenance or Engine Manual DTDs tend to be huge, with thousands of
illustrations, 50+ chapters, and text content weighing in at dozens of megabytes.
You'd
be hard-pressed to label them as topic-based.
Or can you?
Marrying Engineering Data and Techpub
I currently work in the aerospace techpub industry, as my current client is a PLM
(Product Life Management) service provider for many of the big aerospace manufacturers.
I was brought in as the markup guy, the go-to guy for anything related to angled
brackets but also older standards, including SGML. It's interesting work, but just
between you and me, I'd sometimes kill for a modern XML standard.
My client has heavily invested into services around Siemens
Teamcenter, a suite of product lifecycle management computer
software applications. [Teamcenter on Wikipedia] [Siemens Teamcenter]
I would describe Teamcenter, or TC, as we tend to call it, like this
(but I'm a layman and will probably get it wrong):
Imagine your product - be it an alarm clock, a coffee maker or a jet engine - as a
3D
model. The model is complete; it contains every single part, pacer, and screw. You
can
view the model from any angle, disassemble and explode any portion of it, and assemble
it again. And you can edit and change it, and base all of your product design and
engineering on it. This is what TC manages. Some of it you'll need other software
for,
but you get the idea.
Mind, there are a number of these around; TC is by no means alone in this space. CAD
has changed engineering far beyond recognition since I went to school[3].
From a documentation point of view, there are obvious advantages. A 3D model can
provide parts lists, generate 2D illustrations of the parts, and express the disassembly
and assembly procedures built in to it as maintenance tasks, but also update both
as
soon as the engineering data is updated.
Topic-based authoring is very much the standard to aim for in the PLM space.
Topic-based profiling and reuse à la DITA is perfect if you need to add documentation
to
your 3D model.
Authoring
Enter Cortona3D RapidAuthor [RapidAuthor], a suite of
products to manipulate the engineering data that can be integrated with Teamcenter.
RapidAuthor can then be used to author techpub content, from multimedia 3D content
to illustrated parts catalogues to assembly and disassembly tasks, all based on the
engineering data.
Based on the engineering data and defining the procedure in Rapid Author, it can
generate a disassembly/assembly task in any format known to
it, from DITA to S1000D. The generated markup is not perfect but can be edited
manually, using an embedded instance of XMAX, the XMetaL Author ActiveX plug-in,
including adding 2D images from the 3D model the procedure is based. It's a powerful
way to create documentation, and it's inherently topic-oriented, as the focus is on
assemblies and disassemblies of parts of the product[4]. Alternatively, you can configure an external editor to handle the
markup instead.
Content Management
Teamcenter (TC) also offers topic-based documentation support with their Content
Management module, be it DITA or some other markup vocabulary. There's support for
gathering the topics into DITA-like relational maps to build up what is known as a
publication structure.
A publication structure is a gathering of nested topics and multimedia that
together comprise a manual. The approach is to express the individual topics as
markup while leaving the rest to a relational hierarchy of headings. It's very much
DITA-inspired, of course and can be exported as a DITA map. They claim to be
vocabulary-agnostic, though, and provide support for S1000D and other DTDs and XML
Schemas, allowing you to split the input document into topic-sized chunks when
importing it to TC by defining XML Attribute Mapping[5] rules, essentially XPath/like expressions[6].
For example, given an XML document containing chapters, sections, and subsections,
you might define rules that break it down into topics on those levels, storing each
topic separately and representing the hierarchy in the relational database,
including using the titles of each topic as topic heading properties in the database
(see Figure 3).
Mapping rules can also represent other properties, from IDs to cross-references to
content transclusions and image references.
If you're thinking just like DITA maps but in a relational
database, you're not wrong.
But here's the thing: they also claim to support SGML.
The Power of a Good Sales Team
Enter the topic[7] at hand. The end customer is a large manufacturer in the aerospace industry.
They've been using ATA iSpec 2200 for years, their partners and subcontractors have
been
using it, they all require it when exchanging components and documentation, and getting
from SGML to the 21st century isn't easy. Not that they haven't tried; some time ago,
when introducing S1000D XML for some deliveries, they made an attempt at moving
everything to XML but failed. Now, though, the software and tools they have been using
are being phased out, and they've had to start looking elsewhere.
Enter an enterprising Teamcenter sales team. TC has been part of the customer's
engineering data setup for years, but recently, it was pointed out to them that
Teamcenter Content Management can also handle SGML. They were shown flashy presentations
of engineering data married to content, document breakdown into individual topics
for
storage and editing, export of composed documents, and flawless PDF and HTML output,
all
based on engineering data. Yes, they could have it all, too, and everything could
be
based on their ATA SGML.
The Proof of Concept
A Proof of Concept (PoC) was agreed on, comprising a single product and two document
types, Engine Manuals and Service Bulletins. Both would be imported into Teamcenter
and
broken down into bite-sized chunks for editing, and then reassembled into suitable
publications. And it would all be ATA SGML.
Service Bulletins are short documents, usually no more than a handful of published
pages with few or no images, and no breakdown into smaller topics is actually required.
The Engine Manuals, however, are a different story.
ATA SGML Engine Manuals
The Engine Manual ATA DTD is a child of its time, a rather typical SGML DTD
representing a monolithic printed book. It has chapters and sections and subsections
(known as subjects), it has markup for individual maintenance tasks and subtasks,
and there are front matters inserted into all of its main parts.
The content is authored from pgblk (pageblock) and
down, and as the element name suggests, its origins really are the printed page.
Binders of them, to be exact.. The chapter, section, and subject hierarchy is a
systems-oriented breakdown of aircraft maintenance content, specified in ATA iSpec
100 [ATA iSpec 100] and,
to some extent, in ATA iSpec 2200 [ATA iSpec 2200]. The headings are mostly predetermined, and there is
no content beyond revision markup and change descriptions.
For block-level purposes, there is the usual array of text paragraphs, lists,
graphics, and tables (CALS). The lists in particular have multiple levels, and are
sometimes used for content I'd today label as procedures. Inline elements include
cross-references, subscript and superscript, and some domain-specific semantics for
part numbers and such.
Being an SGML DTD, it uses inclusions and exclusions for some semantics — see the
inclusions from the root element in Figure 5. Notably, it allows revision and effectivity[8] markup all over the place, both of which are modelled with
EMPTY elements. The revision markup in particular is implemented as
standoff markup - a revst (revision start) element is
inserted somewhere in the structure, followed by a revend
(revision end) element elsewhere. The two together mark a change:
from here to here, we've changed something. Since both are
EMPTY elements, they can be inserted anywhere without regard to
well-formedness.
I should also mention that there is a certain degree of freedom for ATA adopters,
as the DTDs can be adapted with custom markup, depending on the level of compliance
aimed for as defined in the spec.
PoC Implementation Notes
I was first introduced to the proof of concept project right around the time I
first started my new contract, a few days after I first encountered TC. Beyond some
tests with ATA SGML, all of them unsuccessful but initially attributed to my
inexperience, I spent the first several months of my contract on the DITA
implementation. Among other things, I drafted an ATA iSpec 100 system and subsystem-based[9] DITA solution in TC for another aerospace customer, and while that
experience was not without its horrors, it seemed to me that the topic-based
approach was clearly what TC did best.
Cut back to the PoC, several months later. I created a few XML Attribute Mapping
rules in TC[10] to import an SGML Engine Manual, and failed miserably. No matter what I
tried, the SGML import would fail.
A tiny test SGML DTD and instance also failed.
Siemens suggested changes to my import rules, and I was able to import my test
SGML, breaking down the contents into topics where I wanted them to. However, when
checking out a topic to edit in XMetaL, the editor chosen for the PoC[11], the topics came out as XML, without any graphics and without a
DOCTYPE declaration to put the graphic entity declarations in.
Oops.
Eventually Siemens acknowledged that their SGML implementation had a few bugs,
assigned a developer to fix them, and thus began a cycle of frequent DLL deliveries
and tests, alongside daily meetings.
DOCTYPE and SGML
Among the first issues fixed was the missing DOCTYPE declaration
and the little matter of TC outputting XML rather than SGML. It turned out,
unsurprisingly, that internally it handled everything in XML, using OpenSP
[OpenJade]
as a parser and as the main software for conversions between SGML to XML. They
had never broken down SGML into smaller topics and so had never realised that
this particular issue existed.
The fixes brought back SGML for the edited topics, and with the SGML the
DOCTYPE declaration. This is where I discovered the next set of
problems.
We Don't Do Entities
The graphics used in the Engine Manual were declared as graphic entities, like
so:
<!ENTITY name SYSTEM "filename.suffix" NDATA CGM>
In other word, the graphic file, filename.suffix, is given an
alias, name, that is then referenced in ENTITY-type
attributes in the content to insert the image. The NDATA bit is
there to explain to the SGML application how the entity should be processed. The
entire Engine Manual contained 6,000 of these, one for each image
inserted.
Teamcenter requires you to first import any images, using mapping rules to
name the images in the system so that name can then be used to associate the
graphics with the content. When I did this and then imported the document, I
realised that only some graphics were associated with the content as expected,
namely those where the graphic entity declarations followed this pattern:
<!ENTITY name SYSTEM "name.suffix" NDATA CGM>
That is, the entity name must be the same as the file name, minus the suffix.
Siemens confirmed, adding that we don't do entities. They had no
intention of changing this, either; SGML and entities were dead technologies and
would go away.
I added an Ant script to update graphic file names accordingly — I much wanted
a solution where I didn't have to change anything other than the
DOCTYPE in the SGML; writing a script that would rename a
SYSTEM identifier and the corresponding file was far easier
than changing entity references inside the SGML.
Here's where I discovered the next problem. The customer had obviously had
entity naming problems and their solutions had been to use the file name, with
the suffix, as both the entity name and the file name:
<!ENTITY name.suffix SYSTEM "name.suffix" NDATA CGM>
This was a no go with Teamcenter, but, as I discovered, so was my scripted fix
that added an extra suffix to appease Teamcenter:
<!ENTITY name.suffix SYSTEM "name.suffix.suffix" NDATA CGM>
For some reason, the software doesn't like multiple suffixes, and nothing I
did helped. For the PoC, I've simply updated the test documents semi-manually,
using regular expressions.
We Don't Do Entities, Part Two
Having finally imported the full Engine Manual and its 6,000 graphics, I
discovered that Siemens' new DLL added all 6,000 to each and every pageblock
being checked out and edited. A pageblock might only use four graphics, yet the
DOCTYPE would contain all 6,000.
Thankfully, all 6,000 graphics are not exported for every checkout.
Notable Differences Between SGML and XML
Remember the revision markup I mentioned earlier, with EMPTY
elements being used to mark the start and end of a revision? Something like
this:
The above is valid SGML, of course, since EMPTY elements in SGML
look like start tags. In the DTD, they use the OMITTAG feature
(- O) and, as they are declared as EMPTY, that's
all there can ever be.
XML did away with this when it introduced the concept of well-formedness, and
for a good reason; trying to figure out where an end tag should go was always a
problem for SGML tools, not to mention browsers. It did define shorthand for
EMPTY elements, though, so in XML,
<REVST></REVST> is equal to
<REVST/>.
Any translation between SGML and XML and back needs to address this problem.
This is what Teamcenter exported, however:
Note the REVEND end tag. The XML handled internally must be
converted to SGML on the fly when checking out or exporting. Clearly, the code
thinks the REVEND tag was meant to be the REVST end
tag. Why REVEND is still there is a mystery to me. Other
EMPTY elements, for example, COLSPEC in CALS
tables, are handled similarly and can produce similar weirdness on
export.
The SGML DTD includes the REVST and REVEND elements
from the root and down, the idea being to be able to use them anywhere, and I
very much doubt Teamcenter's SGML to XML and back conversions are schema aware.
Yet sometimes there is an end tag where it simply cannot be, such as the above,
which, to me, suggets a streaming parser artifact.
SGML in the Editor
The SGML editor used in the PoC is JustSystems' XMetaL Author[12]. It's one of the very few editors to support SGML today, and it does
it well. ATA iSpec 2200 caused no issues in XMetaL, and implementing a working
authoring environment was a matter of tweaking some display CSS, adding SGML
templates to help authors create new tasks, and a couple of macros to support
viewing CGM images directly in the editor — XMetaL allows you to add in-place
controls to handle graphics, and I was able to add Cortona2D
Viewer with a few lines of JavaScript code[13], the most difficult part being to find the Cortona method giving me
access to the graphic URL.
Publishing
The PoC also requires us to show that the SGML managed by Teamcenter can be published.
HTML output was deemed as enough.
I had no wish to implement DSSSL[14] in TC, so I decided that easiest would be to first convert the SGML to XML[15], convert that XML to some XML-based industry standard, and then use
ready-made stylesheets to produce the HTML. And what could be better to represent
topic-based SGML than topic-based XML? I decided go for DITA
Conversion Outline
The ATA iSpec 2200 SGML to DITA 1.3 XML conversion has a number of distinct
parts:
Normalise the SGML to include default attribute values, general entities,
etc.
Convert the normalised SGML to XML syntax.
Insert XML versions of the ISO character entity file declarations and
calls to the internal subset.
Do away with the graphic entities and add @href attributes to
the graphic elements to point at the files directly.
Convert the XML to a sort of normalised DITA.
Break down the normalised DITA into maps and topics to produce valid
DITA.
Publish the DITA in HTML format using default DITA HTML
stylesheets.
The first two steps above are doable with OpenSP, the first with
ospam and the second with osx. Steps three and four
are easily doable with XSLT, as are five and six. And since TC likes to do its
publishing with Ant scripts,orchestrating everything in Ant would do[16].
Normalising the SGML
Normalising the SGML with ospam was easy, even trivial. This does the
trick:
Essentially this is just calling ospam (essentially James Clark's
spam [SP 1.3.4]) to bring everything into a single SGML file.
SGML to XML
Converting the SGML to XML with osx is just as trivial. This does
it:
<target name="sx" depends="prepare" description="Convert file from SGML to XML">
<!-- Output file is XML, not SGML, so change file suffix -->
<local name="local.xml"/>
<propertyregex
property="local.xml"
input="${file}"
regexp="${regex.filename}\.sgm"
replace="\2.xml"
global="true"/>
<exec executable="${sp-loc.path.sx}" output="${base.sx}/${local.xml}">
<arg value="-c"/>
<arg value="${sgml-catalogs.output}"/>
<arg value="-xndata"/>
<arg value="-xnotation"/>
<arg value="-xlower"/>
<arg value="-xempty"/>
<arg value="${file}"/>
</exec>
<concat destfile="${base.reports}/sx-report.txt" append="true">
<filelist dir="." files="errsx.txt"/>
</concat>
<echo message="Converted ${file} to XML"/>
</target>
This calls osx (essentially the same as James Clark's sx
syntax [SP 1.3.4]) to
convert the SGML to XML. This is little more than rewriting the SGML using XML
syntax with the help of an SGML declaration for XML. I really like OpenSP (and James
Clark's SP that it is based on), I have to say. The syntax is weird and the
documentation not the clearest I have seen, but it's an incredibly reliable parser
package.
Character Entities
Those of you who've done the move from SGML to XML will remember the many
character entities used to add special characters. A character entity declaration
looked like this:
<!ENTITY cularr SDATA "[cularr]"--/curvearrowleft A: left curved arrow -->
These were all found in files grouping related entities, with the file declared
and invoked like so:
<!ENTITY % ISOamsa PUBLIC "ISO 8879-1986//ENTITIES Added Math Symbols Arrow Relations//EN">
%ISOamsa;
SDATA entities were done away with in XML, so the easy solution to
map them to Unicode and UTF-8 was to replace the SDATA files with
equivalent Unicode declarations:
<!ENTITY cularr "↶"> <!-- ANTICLOCKWISE TOP SEMICIRCLE ARROW -->
DocBook used to include most ISO character entity files when they still used SGML
DTDs, and so to handle the SGML to XML character entity conversion for the PoC, I
added every single one in a text file like so:
<!-- This maps SGML character entities to their Unicode equivalents -->
<!-- Based on DocBook 4.1.2 XML entities --><!-- ISO 8879 official entity sets -->
<!ENTITY % iso-amsa PUBLIC "ISO 8879:1986//ENTITIES Added Math Symbols: Arrow Relations//EN" "xml-entities/iso-amsa.ent">
%iso-amsa;
<!ENTITY % iso-amsb PUBLIC "ISO 8879:1986//ENTITIES Added Math Symbols: Binary Operators//EN" "xml-entities/iso-amsb.ent">
%iso-amsb;
<!ENTITY % iso-amsc PUBLIC "ISO 8879:1986//ENTITIES Added Math Symbols: Delimiters//EN" "xml-entities/iso-amsc.ent">
%iso-amsc;
...
I then added a target to my Ant script that added the text file contents to the
DOCTYPE internal subset of the XML-syntax ATA.
Graphic Entities
Like Siemens, I didn't particularly want to deal with entities and decided to add
@href markup inline instead. This, again, was trivial. I used an
XSLT stylesheet and unparsed-entity-uri() to resolve the graphic entity
declarations to get the SYSTEM identifiers, added base URIs to them.
The whole thing was an Ant target calling Saxon,.
ATA XML to Normalised DITA XML
I'll freely admit that the idea of converting everything to DITA originated from
me being slightly contrarian. You want topics? I'll give you topics!
Having said that, the DITA stylesheets we tend to use for other customers are quite
pretty, and I wanted to take advantage of that.
The ATA to DITA conversion isn't trivial, but it also doesn't have to be perfect —
this is for a proof of concept, not a production-setting conversion — so various ATA
inline elements for things like part numbers and such I could simply convert to DITA
ph (phrase) elements. Similarly, simple
div wrappers were enough for block-level constructs with no obvious
equivalent.
There were plenty of decisions to make and lots of places where things could go
wrong, so in the interest of speedy development and easy refactoring, I decided to
go with an XProc pipeline running XSLT stylesheets in sequence [Pipelined XSLT Transformations].
I made sure to write XSpec [XSpec] tests for every single XSLT, again to speed up development,
which saved me more than once[17].
The EM pipeline took a few days to write; the SB pipeline was finished in one day.
Pipelines pay off.
ATA Topics?
The ATA EM DTD is seemingly well equipped for a topic-based approach, with
pageblocks stating where topics should be. This, I guess, is what the Siemens
sales team recognised and successfully sold to the customer. The entire
structure above the pageblocks is a systems-oriented book skeleton as defined by
the ATA iSpec specs [ATA iSpec 100].
The pageblock isn't even defined in the ATA numbering system, however. Here's
an example of the Task Oriented Support System Numbering:
The pageblock wrappers in the DTD are a remnant of a page-oriented
documentation; they are inserted into binders on a level suitable for grouping
functional tasks. A complex product may have any number of tasks and subtasks
inside a subject, with each task being grouped inside pageblocks to fit
functional code (see above), but you can also use it to group any variant of the
tasks.
What a task actually looks like is very much up to the author and the context,
so the topic breakdown is not as clear-cut as the initial sales pitch would have
us believe. It would seem that a more suitable topic breakdown is just as
dynamic as in DITA.
Notice how the content model mixes subtasks and block-level elements, and how
it allows the latter to occur after a subtask — this is not
a section hierarchy per se. Thankfully, the only block-level elements to occur
between subtasks in the PoC SGML are graphic elements:
The graphic element is a wrapper for one or more links to the
actual images and again very much page-oriented; ATA EM graphics tend to
illustrate an entire task or subtask, the idea being to insert the graphics
separately in a binder, before or after the task or subtask. For this
conversion, I decided to label graphic elements as topics,
alongside tasks and subtasks.
My idea was to first convert to a kind of normalised DITA, a
format where everything is in a single file, like this:
There are no DITA topics anywhere before the pageblock level[18], just nested topicref elements to define the chapter,
section, and subject ATA structure. The content in an ATA Engine Manual is
written from pageblock level and down.
ATA has a number of attributes to identify the system breakdown as specified
in both the iSpec 100 and the iSpec 2200. As seen above, I chose to convert
those to a DITA-style list of @props tokens:
For those of you not into DITA, this is a DITA notation listing properties and
their values: chapnbr="05", sectnbr="00", etc. Here,
it is a useful mechanism to preserve the ATA attributes and their values[19].
Pageblocks were too big to be DITA topics, so my first few XSLT pipeline steps
examined the contents of tasks inside them. An ATA task without a subtask became
a topic, like so:
Mostly, if the ATA SGML task had subtasks, there would be no content beyond
titles on task level. In the few cases there were, I'd simply add a topic on
task level:
ATA front matters received a similar treatment. In addition to various
metadata, they tend to contain lists of applicable service bulletins and
temporary revisions, both of which are essentially lists well suited to be
topics.
Topic Breakdown
The final breakdown of the normalised DITA into maps and topics is done with a
single XSLT 3.0 stylesheet, run after the pipeline has completed and its results
have been written to disk. Basically, the XSLT iterates through the XML, acting on
every topicref containing an @href (that is, a
topicref that actually references a topic rather than only being
used for indicating the structure through nesting.
For the most part, this is a trivial exercise.
Note
You may ask why I didn't split the ATA XML into smaller parts first. The main
reason is how the XProc XSLT pipeline works and how the output of each XSLT is
serialised, which allows me to run the XSLTs in sequence, taking the output of
one XSLT as the input to the next. Essentially, every output in the pipeline is
on the step's secondary port, which is also what you'd use to grab
result-document output in the XSLT.
DITA to HTML
The DITA to HTML conversion is an Ant script that runs a series of XSLT 1.0[20]stylesheets using msxsl.exe[21], and is added to Teamcenter as a zip package. I run it as a
subant script, which essentially means that it inherits its input
properties from my script and runs just like any other Ant target would, in the way
and order I define. We changed very little in it, beyond tweaking the CSS to better
match the customer's brand.
I have to say, once I had written the XSLT 3.0 and the pipeline, I was tempted to
add an XSLT 2.0 stylesheet somewhere, just so I could claim to have used all three
XSLT versions in a single project.
Was It A Good Idea?
What did I learn? Was it a good idea to push the ATA SGML into TC, break it down to
DITA-sized topics, and then spit it out again for editing and publishing?
Honestly? No.
So please don't.
Breaking Down, Decomposing, Splitting
Teamcenter and many others like it can break down (split, decompose, pick your
catchphrase) XML documents into smaller chunks and manage them in a database.
They'll add code to handle those chunks, and they will either support a standard
like DITA or S1000D rather than inventing something of their own, or they'll just
market their product as do-all, end-all, and say they can do it with
anything.
They can't.
Topic-based authoring, again, is all the rage in technical publishing, mostly
because it's a good idea. It fits. But not every standard was developed with this
in
mind. ATA, mostly, wasn't. It lacks the standardisation of content into common
denominators, similar to what DITA or S1000D both do, which is why many aerospace
manufacturers are now using them instead of ATA (while still keeping ATA's
brilliant, system-oriented breakdown of functions rather than
topics.
Let me offer you a simple example: The SGML Engine Manual document includes a
number of ID/IDREF pairs. This, in a single SGML file, is
fine. But as soon as you break down the Engine Manual document into topics, you risk
invalidating every topic with an IDREF pointing at an ID
that happens to be in a different topic. An Engine Manual is basically a book; it's
designed to be a single, large unit. You may not have to read it from cover to
cover, but its whole organisation is that of a book. And not every book-type
dependency is something that a parser can catch.
Unless you're prepared to move away from your book-based content paradigm in
action as well as (sales pitch-induced) spirit, don't.
SGML in the 21st Century
Mind, that necessary moving away of the book-oriented paradigm I was discussing is
just that, a paradigm shift; no SGML is involved, just XML. XML is mature because
everything it's implemented in is. There is no need for an SGML declaration to save
memory by limiting attribute content, and there is a wealth of related standards and
tools to support whatever you need to do. We're only discussing how you're
authoring your content.
SGML predates all that. It came about when disk space was at a premium and a
project to implement it would start with a six-month prestudy and another six months
to start developing the necessary tool support. There was no SGML document model,
no
SGML transformation language, no wealth of ready-to-rock tools to choose from. There
still isn't, because all that was developed later, for XML.
The tools today all focus on XML, not SGML. The very few that claim to support
SGML (for example, I know of three production-quality SGML editors still around) no
longer do any SGML-related development. If there's a bug not related to SGML, you're
out of luck.
So if a product claims to support SGML, it's either a leftover or hybris, or both.
In many cases, such as with Teamcenter, the SGML support now involves a conversion
to XML and back, and any processing is done in XML. So it's what everyone should do.
SGML, as much as it hurts to admit it, is dead.
The aerospace industry may still be using ATA iSpec 2200, but it is my firm belief
that the hidden costs of staying in SGML by far outweigh a proper XML migration
project. Even moving to XML yourself and converting to SGML and back for those of
your partners that insist on staying in SGML will be cheaper because it will be more
controllable. See section “Postscript”.
My silly little SGML to DITA publishing pipeline added an extra XML conversion —
TC handles everything in XML internally, remember, so when it export SGML to my
pipeline, it's been to XML and back already — which is just bizarre because it
exposed all kinds of errors and bugs and issues stemming from that internal XML
conversion.
So, SGML in the 21st century? Nah.
End Notes
Allow me to end with a few top tips and conclusions.
Don't promise your product has SGML support unless you've tested and know it
to be true.
Graphic entities are dead. Why anyone would use them is beyond me. However,
having said that, it's not hard to process them if you're leaving them
behind.
While the ATA iSpec 2200 DTDs are very much page-oriented, redefining them in
topic-oriented fashion — not necessarily DITA, mind — makes sense because I
think it's a good idea for any technical documentation. The basic idea of TC's
structure view is sound, even though their approach can be a tad
simplistic.
I love SGML for its many weird features[22] and because it was my introduction into markup, but the XML
infrastructure — the many sister standards beginning with X, the
tool support, the fact that people actually know about it — is awesome. You
don't know how much you miss them until you try to do without.
Plus, I'd just like to say that I'd still like James Clark to sign the SP
package for me. It's a great package.
Postscript
After the first draft of this paper was submitted, I was finally able to convince
the involved parties to move ahead with an ATA-like XML approach
instead. I wrote XML versions of the ATA DTDs[23], tweaked the ATA SGML to DITA conversion pipeline to ignore the DITA
bits and add a few steps to generate valid XML instances from the source SGML, and
wrote a quick oXygen Author framework for the ATA-like XML.
I also wrote a pipeline that converted the XML back to ATA SGML for those cases
when the end customer is required to deliver SGML to its partners and customers;
this pipeline, just as the SGML to XML one, relies heavily on doing the prep work
using a sequence of XSLT 3.0 stylesheets[24], and then runs the tweaked XML through SP to generate valid SGML.
As I write this, the PoC is finally scheduled to start within the next week or
two.
[SP 1.3.4] James Clark, SP, An SGML System
Conforming to International Standard ISO 8879 - Standard Generalized Markup
Language, http://www.jclark.com/sp/
[1] The S1000D example documentation kit describes a bicycle.
[2] ATA is producing XML-based standards, too, these days. SGML has a way of
sticking around, though.
[3] I studied some basic engineering before venturing into physics.
[4] But in no way is it a guarantee for properly sized topics; the author can
always include too much.
[5] No, it's not about XML attributes at all, which is quite confusing for a
markup person to begin with. Think of the term attribute as
equivalent with a property.
[6] Unfortunately mostly without the expressive power of XPath.
[10] Nothing to do with attributes or XML, remember; these are topic breakdown
rules for Teamcenter.
[11] Essentially, there are only two or three production-quality editors
capable of editing SGML available today. The customer was moving away from
one, so our choices were limited. Not that XMetaL is a bad editor; it's
not.
[12] The full editor rather than RapidAuthor was necessary, because of the
limitations posed by SGML.
[13] No small feat since I don't do JavaScript. The time-honoured
Copying and Pasting from Stack Overflow
development method was my friend.
[14] One of the things you might use to publish your SGML with, back in the
day.
[15] While Teamcenter represents SGML as XML internally, they've not provided a way
to access that XML.
[16] I did consider XProc 3.0 for all of the above, but the only available
XProc 3.0 processor as I write this, Morgana XProc III, couldn't run OpenSP
in my tests.
[17] XSpec, in case you haven't looked into it, is a way to unit test your XSLT
(or XQuery or Schematron) by defining the source and the expected output of
a structure. It's brilliant.
[23] There are many papers out there about this sort of thing; I can highly
recommend Betty Harvey's 2016 Balisage paper [SGML in the Age of XML].
[24] Mostly to generate an SGML DOCTYPE declaration that contains
ENTITY and NOTATION declarations, but also to
convert processing instructions in the XML to SGML inclusion
elements.
Harvey, Betty. “SGML in the Age of XML.” Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5,
2016. In Proceedings of Balisage: The
Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016). doi:https://doi.org/10.4242/BalisageVol17.Harvey01