How to cite this paper

Nordström, Ari. “Topic-based SGML? Really?” Presented at Balisage: The Markup Conference 2021, Washington, DC, August 2 - 6, 2021. In Proceedings of Balisage: The Markup Conference 2021. Balisage Series on Markup Technologies, vol. 26 (2021). https://doi.org/10.4242/BalisageVol26.Nordstrom01.

Balisage: The Markup Conference 2021
August 2 - 6, 2021

Balisage Paper: Topic-based SGML? Really?

Ari Nordström

Ari is an independent markup geek based in Göteborg, Sweden. He has provided angled brackets to many organisations and companies across a number of borders over the years, some of which deliver the rule of law, help dairy farmers make a living, and assist in servicing commercial aircraft. And others are just for fun.

Ari is the proud owner and head projectionist of Western Sweden's last functioning 35/70mm cinema, situated in his garage, which should explain why he once wrote a paper on automating commercial cinemas using XML.

Abstract

Topic-based technical documentation is all the rage these days, made popular by DITA and others. Topics can be integrated with the engineering data for the products both describe using nifty Product Life Management (PLM) tools that make this easier than ever. But what if you're stuck with SGML, voluntarily or involuntarily? Can you, too, bring your content into the topic-based paradigm or should you rather not?

This paper explores your options and the state of SGML in the PLM world today. It nose-dives into the ATA iSpec 2200 SGML, discusses some of the pains to implement it, and finally converts the ATA to DITA.

Intro

Marrying Engineering Data and Techpub

Authoring
Content Management

The Power of a Good Sales Team

The Proof of Concept

ATA SGML Engine Manuals

PoC Implementation Notes

DOCTYPE and SGML
We Don't Do Entities
We Don't Do Entities, Part Two
Notable Differences Between SGML and XML
SGML in the Editor

Publishing

Conversion Outline

Normalising the SGML

SGML to XML

Character Entities

Graphic Entities

ATA XML to Normalised DITA XML

ATA Topics?

Topic Breakdown

DITA to HTML

Was It A Good Idea?

Breaking Down, Decomposing, Splitting
SGML in the 21st Century

End Notes

Postscript

Intro

Topic-based content has been a thing for a number of years by now, with standards such as DITA [DITA Version 1.3] making it increasingly popular in the techpub world I frequently inhabit. The idea is to split your technical documentation, for example, a User Guide or a Programmer's Manual, into smaller topics that by themselves only describe a single subject or task. This makes it easy to reuse and process them in a variety of contexts and output media, from multiple User Guide variants to describe a range of products to context-sensitive help texts online.

The topics are marked up with profiling information that identifies their valid contexts. For example, a topic may be applicable to products A and B, but only partially C and D, while another may apply to the entire product range. Different properties are made available, for example, to state that the content is applicable only to specific serial number ranges or when the product is equipped with a specific module.

The individual topics seldom have a section hierarchy, instead leaving that job to map-like constructs that link to the topics and define whatever chapter, section, etc structures that particular topic assembly requires.

DITA in particular has been pushing the idea for the last decade or two, and DITA implementations are now available out of the box for many editors and other tools. There's now a host of DITA solutions out there for your content, regardless of your business area, including software that help you integrate your topics with your engineering data, and more. Topic-based authoring is entering mainstream in techpub.

There are plenty of non-DITA topic-based solutions, too. Among them is S1000D [S1000D], a standard originally intended to cater for the maintenance documentation needs of military aircraft but now greatly expanded to describe the operation and maintenance of any land, sea, and air vehicle^[1]. S1000D topics are known as data modules, but the idea is the same: one module describes one aspect of the product, be it a description. parts list or maintenance task, to enable reuse in multiple contexts. S1000D has also spawned a variant specifically for seagoing vessels, known as Shipdex.

Note

S1000D is very much meant for information exchange, and so businesses agreeing on exchanging information using the standard will often define business rules to detail which parts of the standard are used, and how.

The aerospace industry has produced other standards to help maintain their products. Airlines for America (A4A), formerly known as Air Transport Association of America (ATA) [ATA], started publishing aviation technical documentation guidelines in the 1950s and has updated them ever since, eventually including SGML tag sets (DTDs) alongside the spec itself. The ATA iSpec 2200 standard [ATA iSpec 2200], a 3,000+-page book describing all aspects aircraft maintenance documentation, and with accompanying SGML DTDs, is still in active development^[2]. ATA is an industry standard; their systems-oriented maintenance numbering system [ATA iSpec 100] is used by the majority of aviation industry manufacturers, regardless of what documentation format they use.

The ATA iSpec 2200 SGML DTDs are typical SGML-age creations, with typically monolithic, book-like approaches to the information. For example, documents using their Aircraft Maintenance or Engine Manual DTDs tend to be huge, with thousands of illustrations, 50+ chapters, and text content weighing in at dozens of megabytes. You'd be hard-pressed to label them as topic-based.

Or can you?

Marrying Engineering Data and Techpub

I currently work in the aerospace techpub industry, as my current client is a PLM (Product Life Management) service provider for many of the big aerospace manufacturers. I was brought in as the markup guy, the go-to guy for anything related to angled brackets but also older standards, including SGML. It's interesting work, but just between you and me, I'd sometimes kill for a modern XML standard.

My client has heavily invested into services around Siemens Teamcenter, a suite of product lifecycle management computer software applications. [Teamcenter on Wikipedia] [Siemens Teamcenter]

I would describe Teamcenter, or TC, as we tend to call it, like this (but I'm a layman and will probably get it wrong):

Imagine your product - be it an alarm clock, a coffee maker or a jet engine - as a 3D model. The model is complete; it contains every single part, pacer, and screw. You can view the model from any angle, disassemble and explode any portion of it, and assemble it again. And you can edit and change it, and base all of your product design and engineering on it. This is what TC manages. Some of it you'll need other software for, but you get the idea.

Mind, there are a number of these around; TC is by no means alone in this space. CAD has changed engineering far beyond recognition since I went to school^[3].

From a documentation point of view, there are obvious advantages. A 3D model can provide parts lists, generate 2D illustrations of the parts, and express the disassembly and assembly procedures built in to it as maintenance tasks, but also update both as soon as the engineering data is updated.

Topic-based authoring is very much the standard to aim for in the PLM space. Topic-based profiling and reuse à la DITA is perfect if you need to add documentation to your 3D model.

Authoring

Enter Cortona3D RapidAuthor [RapidAuthor], a suite of products to manipulate the engineering data that can be integrated with Teamcenter. RapidAuthor can then be used to author techpub content, from multimedia 3D content to illustrated parts catalogues to assembly and disassembly tasks, all based on the engineering data.

Based on the engineering data and defining the procedure in Rapid Author, it can generate a disassembly/assembly task in any format known to it, from DITA to S1000D. The generated markup is not perfect but can be edited manually, using an embedded instance of XMAX, the XMetaL Author ActiveX plug-in, including adding 2D images from the 3D model the procedure is based. It's a powerful way to create documentation, and it's inherently topic-oriented, as the focus is on assemblies and disassemblies of parts of the product^[4]. Alternatively, you can configure an external editor to handle the markup instead.

Content Management

Teamcenter (TC) also offers topic-based documentation support with their Content Management module, be it DITA or some other markup vocabulary. There's support for gathering the topics into DITA-like relational maps to build up what is known as a publication structure.

A publication structure is a gathering of nested topics and multimedia that together comprise a manual. The approach is to express the individual topics as markup while leaving the rest to a relational hierarchy of headings. It's very much DITA-inspired, of course and can be exported as a DITA map. They claim to be vocabulary-agnostic, though, and provide support for S1000D and other DTDs and XML Schemas, allowing you to split the input document into topic-sized chunks when importing it to TC by defining XML Attribute Mapping^[5] rules, essentially XPath/like expressions^[6].

For example, given an XML document containing chapters, sections, and subsections, you might define rules that break it down into topics on those levels, storing each topic separately and representing the hierarchy in the relational database, including using the titles of each topic as topic heading properties in the database (see Figure 3). Mapping rules can also represent other properties, from IDs to cross-references to content transclusions and image references.

If you're thinking just like DITA maps but in a relational database, you're not wrong.

But here's the thing: they also claim to support SGML.

The Power of a Good Sales Team

Enter the topic^[7] at hand. The end customer is a large manufacturer in the aerospace industry. They've been using ATA iSpec 2200 for years, their partners and subcontractors have been using it, they all require it when exchanging components and documentation, and getting from SGML to the 21st century isn't easy. Not that they haven't tried; some time ago, when introducing S1000D XML for some deliveries, they made an attempt at moving everything to XML but failed. Now, though, the software and tools they have been using are being phased out, and they've had to start looking elsewhere.

Enter an enterprising Teamcenter sales team. TC has been part of the customer's engineering data setup for years, but recently, it was pointed out to them that Teamcenter Content Management can also handle SGML. They were shown flashy presentations of engineering data married to content, document breakdown into individual topics for storage and editing, export of composed documents, and flawless PDF and HTML output, all based on engineering data. Yes, they could have it all, too, and everything could be based on their ATA SGML.

The Proof of Concept

A Proof of Concept (PoC) was agreed on, comprising a single product and two document types, Engine Manuals and Service Bulletins. Both would be imported into Teamcenter and broken down into bite-sized chunks for editing, and then reassembled into suitable publications. And it would all be ATA SGML.

Service Bulletins are short documents, usually no more than a handful of published pages with few or no images, and no breakdown into smaller topics is actually required. The Engine Manuals, however, are a different story.

ATA SGML Engine Manuals

The Engine Manual ATA DTD is a child of its time, a rather typical SGML DTD representing a monolithic printed book. It has chapters and sections and subsections (known as subjects), it has markup for individual maintenance tasks and subtasks, and there are front matters inserted into all of its main parts.

The content is authored from pgblk (pageblock) and down, and as the element name suggests, its origins really are the printed page. Binders of them, to be exact.. The chapter, section, and subject hierarchy is a systems-oriented breakdown of aircraft maintenance content, specified in ATA iSpec 100 [ATA iSpec 100] and, to some extent, in ATA iSpec 2200 [ATA iSpec 2200]. The headings are mostly predetermined, and there is no content beyond revision markup and change descriptions.

For block-level purposes, there is the usual array of text paragraphs, lists, graphics, and tables (CALS). The lists in particular have multiple levels, and are sometimes used for content I'd today label as procedures. Inline elements include cross-references, subscript and superscript, and some domain-specific semantics for part numbers and such.

Being an SGML DTD, it uses inclusions and exclusions for some semantics — see the inclusions from the root element in Figure 5. Notably, it allows revision and effectivity^[8] markup all over the place, both of which are modelled with EMPTY elements. The revision markup in particular is implemented as standoff markup - a revst (revision start) element is inserted somewhere in the structure, followed by a revend (revision end) element elsewhere. The two together mark a change: from here to here, we've changed something. Since both are EMPTY elements, they can be inserted anywhere without regard to well-formedness.

I should also mention that there is a certain degree of freedom for ATA adopters, as the DTDs can be adapted with custom markup, depending on the level of compliance aimed for as defined in the spec.

PoC Implementation Notes

I was first introduced to the proof of concept project right around the time I first started my new contract, a few days after I first encountered TC. Beyond some tests with ATA SGML, all of them unsuccessful but initially attributed to my inexperience, I spent the first several months of my contract on the DITA implementation. Among other things, I drafted an ATA iSpec 100 system and subsystem-based^[9] DITA solution in TC for another aerospace customer, and while that experience was not without its horrors, it seemed to me that the topic-based approach was clearly what TC did best.

Cut back to the PoC, several months later. I created a few XML Attribute Mapping rules in TC^[10] to import an SGML Engine Manual, and failed miserably. No matter what I tried, the SGML import would fail.

A tiny test SGML DTD and instance also failed.

Siemens suggested changes to my import rules, and I was able to import my test SGML, breaking down the contents into topics where I wanted them to. However, when checking out a topic to edit in XMetaL, the editor chosen for the PoC^[11], the topics came out as XML, without any graphics and without a DOCTYPE declaration to put the graphic entity declarations in. Oops.

Eventually Siemens acknowledged that their SGML implementation had a few bugs, assigned a developer to fix them, and thus began a cycle of frequent DLL deliveries and tests, alongside daily meetings.

DOCTYPE and SGML

Among the first issues fixed was the missing DOCTYPE declaration and the little matter of TC outputting XML rather than SGML. It turned out, unsurprisingly, that internally it handled everything in XML, using OpenSP [OpenJade] as a parser and as the main software for conversions between SGML to XML. They had never broken down SGML into smaller topics and so had never realised that this particular issue existed.

The fixes brought back SGML for the edited topics, and with the SGML the DOCTYPE declaration. This is where I discovered the next set of problems.

We Don't Do Entities

The graphics used in the Engine Manual were declared as graphic entities, like so:

<!ENTITY name SYSTEM "filename.suffix" NDATA CGM>

In other word, the graphic file, filename.suffix, is given an alias, name, that is then referenced in ENTITY-type attributes in the content to insert the image. The NDATA bit is there to explain to the SGML application how the entity should be processed. The entire Engine Manual contained 6,000 of these, one for each image inserted.

Teamcenter requires you to first import any images, using mapping rules to name the images in the system so that name can then be used to associate the graphics with the content. When I did this and then imported the document, I realised that only some graphics were associated with the content as expected, namely those where the graphic entity declarations followed this pattern:

<!ENTITY name SYSTEM "name.suffix" NDATA CGM>

That is, the entity name must be the same as the file name, minus the suffix. Siemens confirmed, adding that we don't do entities. They had no intention of changing this, either; SGML and entities were dead technologies and would go away.

I added an Ant script to update graphic file names accordingly — I much wanted a solution where I didn't have to change anything other than the DOCTYPE in the SGML; writing a script that would rename a SYSTEM identifier and the corresponding file was far easier than changing entity references inside the SGML.

Here's where I discovered the next problem. The customer had obviously had entity naming problems and their solutions had been to use the file name, with the suffix, as both the entity name and the file name:

<!ENTITY name.suffix SYSTEM "name.suffix" NDATA CGM>

This was a no go with Teamcenter, but, as I discovered, so was my scripted fix that added an extra suffix to appease Teamcenter:

<!ENTITY name.suffix SYSTEM "name.suffix.suffix" NDATA CGM>

For some reason, the software doesn't like multiple suffixes, and nothing I did helped. For the PoC, I've simply updated the test documents semi-manually, using regular expressions.

We Don't Do Entities, Part Two

Having finally imported the full Engine Manual and its 6,000 graphics, I discovered that Siemens' new DLL added all 6,000 to each and every pageblock being checked out and edited. A pageblock might only use four graphics, yet the DOCTYPE would contain all 6,000.

Thankfully, all 6,000 graphics are not exported for every checkout.

Notable Differences Between SGML and XML

Remember the revision markup I mentioned earlier, with EMPTY elements being used to mark the start and end of a revision? Something like this:

<PARA>S/B<SBNBR>73-0177</SBNBR>,<REVST>Revision 1<REVEND></PARA>

The above is valid SGML, of course, since EMPTY elements in SGML look like start tags. In the DTD, they use the OMITTAG feature (- O) and, as they are declared as EMPTY, that's all there can ever be.

XML did away with this when it introduced the concept of well-formedness, and for a good reason; trying to figure out where an end tag should go was always a problem for SGML tools, not to mention browsers. It did define shorthand for EMPTY elements, though, so in XML, <REVST></REVST> is equal to <REVST/>.

Any translation between SGML and XML and back needs to address this problem. This is what Teamcenter exported, however:

<PARA>S/B<SBNBR>73-0177</SBNBR>,<REVST>Revision 1</REVST></REVEND></PARA>

Note the REVEND end tag. The XML handled internally must be converted to SGML on the fly when checking out or exporting. Clearly, the code thinks the REVEND tag was meant to be the REVST end tag. Why REVEND is still there is a mystery to me. Other EMPTY elements, for example, COLSPEC in CALS tables, are handled similarly and can produce similar weirdness on export.

The SGML DTD includes the REVST and REVEND elements from the root and down, the idea being to be able to use them anywhere, and I very much doubt Teamcenter's SGML to XML and back conversions are schema aware. Yet sometimes there is an end tag where it simply cannot be, such as the above, which, to me, suggets a streaming parser artifact.

SGML in the Editor

The SGML editor used in the PoC is JustSystems' XMetaL Author^[12]. It's one of the very few editors to support SGML today, and it does it well. ATA iSpec 2200 caused no issues in XMetaL, and implementing a working authoring environment was a matter of tweaking some display CSS, adding SGML templates to help authors create new tasks, and a couple of macros to support viewing CGM images directly in the editor — XMetaL allows you to add in-place controls to handle graphics, and I was able to add Cortona2D Viewer with a few lines of JavaScript code^[13], the most difficult part being to find the Cortona method giving me access to the graphic URL.

Publishing

The PoC also requires us to show that the SGML managed by Teamcenter can be published. HTML output was deemed as enough.

I had no wish to implement DSSSL^[14] in TC, so I decided that easiest would be to first convert the SGML to XML^[15], convert that XML to some XML-based industry standard, and then use ready-made stylesheets to produce the HTML. And what could be better to represent topic-based SGML than topic-based XML? I decided go for DITA

Conversion Outline

The ATA iSpec 2200 SGML to DITA 1.3 XML conversion has a number of distinct parts:

Normalise the SGML to include default attribute values, general entities, etc.
Convert the normalised SGML to XML syntax.
Insert XML versions of the ISO character entity file declarations and calls to the internal subset.
Do away with the graphic entities and add @href attributes to the graphic elements to point at the files directly.
Convert the XML to a sort of normalised DITA.
Break down the normalised DITA into maps and topics to produce valid DITA.
Publish the DITA in HTML format using default DITA HTML stylesheets.

The first two steps above are doable with OpenSP, the first with ospam and the second with osx. Steps three and four are easily doable with XSLT, as are five and six. And since TC likes to do its publishing with Ant scripts,orchestrating everything in Ant would do^[16].

Normalising the SGML

Normalising the SGML with ospam was easy, even trivial. This does the trick:

<target name="spam" depends="prepare" description="Normalise file">
    <local name="local.sgm"/>
    <propertyregex
        property="local.sgm"
        input="${file}"
        regexp="${regex.filename}\.sgm"
        replace="\2.sgm"
        global="true"/>
    
    <exec executable="${sp-loc.path.spam}" output="${base.spam}/${local.sgm}" dir="." inputstring="&#10;">
        <arg value="-p"/>
        <arg value="-p"/>
        <arg value="-x"/>
        <arg value="-x"/>
        <arg value="-l"/>
        <arg value="-c"/>
        <arg value="${sgml-catalogs.input}"/>
        <arg value="${file}"/>
    </exec>
    
    <concat destfile="${base.reports}/spam-report.txt" append="true">
        <filelist dir="." files="errspam.txt"/>
    </concat>
    
    <echo>Normalised ${local.sgm}</echo>
</target>

Essentially this is just calling ospam (essentially James Clark's spam [SP 1.3.4]) to bring everything into a single SGML file.

SGML to XML

Converting the SGML to XML with osx is just as trivial. This does it:

<target name="sx" depends="prepare" description="Convert file from SGML to XML">
    
    <!-- Output file is XML, not SGML, so change file suffix -->
    <local name="local.xml"/>
    <propertyregex
        property="local.xml"
        input="${file}"
        regexp="${regex.filename}\.sgm"
        replace="\2.xml"
        global="true"/>
    
    <exec executable="${sp-loc.path.sx}" output="${base.sx}/${local.xml}">
        <arg value="-c"/>
        <arg value="${sgml-catalogs.output}"/>
        <arg value="-xndata"/>
        <arg value="-xnotation"/>
        <arg value="-xlower"/>
        <arg value="-xempty"/>
        <arg value="${file}"/>
    </exec>
    
    <concat destfile="${base.reports}/sx-report.txt" append="true">
        <filelist dir="." files="errsx.txt"/>
    </concat>
    
    <echo message="Converted ${file} to XML"/>
</target>

This calls osx (essentially the same as James Clark's sx syntax [SP 1.3.4]) to convert the SGML to XML. This is little more than rewriting the SGML using XML syntax with the help of an SGML declaration for XML. I really like OpenSP (and James Clark's SP that it is based on), I have to say. The syntax is weird and the documentation not the clearest I have seen, but it's an incredibly reliable parser package.

Character Entities

Those of you who've done the move from SGML to XML will remember the many character entities used to add special characters. A character entity declaration looked like this:

<!ENTITY cularr SDATA "[cularr]"--/curvearrowleft A: left curved arrow -->

These were all found in files grouping related entities, with the file declared and invoked like so:

<!ENTITY % ISOamsa PUBLIC "ISO 8879-1986//ENTITIES Added Math Symbols Arrow Relations//EN">
%ISOamsa;

SDATA entities were done away with in XML, so the easy solution to map them to Unicode and UTF-8 was to replace the SDATA files with equivalent Unicode declarations:

<!ENTITY cularr "&#x21B6;"> <!-- ANTICLOCKWISE TOP SEMICIRCLE ARROW -->

DocBook used to include most ISO character entity files when they still used SGML DTDs, and so to handle the SGML to XML character entity conversion for the PoC, I added every single one in a text file like so:

<!-- This maps SGML character entities to their Unicode equivalents -->
<!-- Based on DocBook 4.1.2 XML entities --><!-- ISO 8879 official entity sets -->
<!ENTITY % iso-amsa PUBLIC "ISO 8879:1986//ENTITIES Added Math Symbols: Arrow Relations//EN" "xml-entities/iso-amsa.ent">
%iso-amsa;
<!ENTITY % iso-amsb PUBLIC "ISO 8879:1986//ENTITIES Added Math Symbols: Binary Operators//EN" "xml-entities/iso-amsb.ent">
%iso-amsb;
<!ENTITY % iso-amsc PUBLIC "ISO 8879:1986//ENTITIES Added Math Symbols: Delimiters//EN" "xml-entities/iso-amsc.ent">
%iso-amsc;
...

I then added a target to my Ant script that added the text file contents to the DOCTYPE internal subset of the XML-syntax ATA.

Graphic Entities

Like Siemens, I didn't particularly want to deal with entities and decided to add @href markup inline instead. This, again, was trivial. I used an XSLT stylesheet and unparsed-entity-uri() to resolve the graphic entity declarations to get the SYSTEM identifiers, added base URIs to them. The whole thing was an Ant target calling Saxon,.

ATA XML to Normalised DITA XML

I'll freely admit that the idea of converting everything to DITA originated from me being slightly contrarian. You want topics? I'll give you topics! Having said that, the DITA stylesheets we tend to use for other customers are quite pretty, and I wanted to take advantage of that.

The ATA to DITA conversion isn't trivial, but it also doesn't have to be perfect — this is for a proof of concept, not a production-setting conversion — so various ATA inline elements for things like part numbers and such I could simply convert to DITA ph (phrase) elements. Similarly, simple div wrappers were enough for block-level constructs with no obvious equivalent.

There were plenty of decisions to make and lots of places where things could go wrong, so in the interest of speedy development and easy refactoring, I decided to go with an XProc pipeline running XSLT stylesheets in sequence [Pipelined XSLT Transformations]. I made sure to write XSpec [XSpec] tests for every single XSLT, again to speed up development, which saved me more than once^[17].

My Engine Manual pipeline ended at 28 steps:

xslt/em/ATA2DITA_main-structure.xsl
xslt/em/ATA2DITA_tasks.xsl
xslt/em/ATA2DITA_front-matter.xsl
xslt/em/ATA2DITA_tfmatr.xsl
xslt/em/ATA2DITA_prclists.xsl
xslt/em/ATA2DITA_figtopic.xsl
xslt/common/ATA2DITA_effectivity.xsl
xslt/em/ATA2DITA_chgdesc.xsl
xslt/em/ATA2DITA_dates.xsl
xslt/common/ATA2DITA_lxlists.xsl
xslt/common/ATA2DITA_table.xsl
xslt/common/ATA2DITA_lists.xsl
xslt/common/ATA2DITA_block-level.xsl
xslt/common/ATA2DITA_graphics.xsl
xslt/em/ATA2DITA_delete-ind.xsl
xslt/common/ATA2DITA_inline.xsl
xslt/common/ATA2DITA_xref.xsl
xslt/common/ATA2DITA_ata-inline.xsl
xslt/em/ATA2DITA_topic-ids.xsl
xslt/common/ATA2DITA_mtoss.xsl
xslt/em/ATA2DITA_misc-amattrs.xsl
xslt/common/ATA2DITA_revmarkers.xsl
xslt/common/ATA2DITA_attrs.xsl
xslt/common/ATA2DITA_id-href-consistency.xsl
xslt/common/ATA2DITA_ref-target.xsl
xslt/common/ATA2DITA_base-attrs.xsl
xslt/common/ATA2DITA_move-data-about.xsl
xslt/common/ATA2DITA_cleanup.xsl

Note the common XSLTs listed; these were also used in converting the Service Bulletins in 24 steps:

xslt/sb/SB-ATA2DITA_main-structure.xsl
xslt/sb/SB-ATA2DITA_add-topics.xsl
xslt/sb/SB-ATA2DITA_add-sections.xsl
xslt/sb/SB-ATA2DITA_legal-ntc.xsl
xslt/common/ATA2DITA_effectivity.xsl
xslt/common/ATA2DITA_lxlists.xsl
xslt/common/ATA2DITA_table.xsl
xslt/common/ATA2DITA_lists.xsl
xslt/common/ATA2DITA_block-level.xsl
xslt/common/ATA2DITA_graphics.xsl
xslt/common/ATA2DITA_inline.xsl
xslt/common/ATA2DITA_xref.xsl
xslt/common/ATA2DITA_ata-inline.xsl
xslt/sb/SB-ATA2DITA_sb-inline.xsl
xslt/sb/SB-ATA2DITA_topic-ids.xsl
xslt/common/ATA2DITA_mtoss.xsl
xslt/sb/SB-ATA2DITA_misc-sbattrs.xsl
xslt/common/ATA2DITA_revmarkers.xsl
xslt/common/ATA2DITA_attrs.xsl
xslt/common/ATA2DITA_id-href-consistency.xsl
xslt/common/ATA2DITA_ref-target.xsl
xslt/common/ATA2DITA_base-attrs.xsl
xslt/common/ATA2DITA_move-data-about.xsl
xslt/common/ATA2DITA_cleanup.xsl

The EM pipeline took a few days to write; the SB pipeline was finished in one day. Pipelines pay off.

ATA Topics?

The ATA EM DTD is seemingly well equipped for a topic-based approach, with pageblocks stating where topics should be. This, I guess, is what the Siemens sales team recognised and successfully sold to the customer. The entire structure above the pageblocks is a systems-oriented book skeleton as defined by the ATA iSpec specs [ATA iSpec 100].

The pageblock isn't even defined in the ATA numbering system, however. Here's an example of the Task Oriented Support System Numbering:

The pageblock wrappers in the DTD are a remnant of a page-oriented documentation; they are inserted into binders on a level suitable for grouping functional tasks. A complex product may have any number of tasks and subtasks inside a subject, with each task being grouped inside pageblocks to fit functional code (see above), but you can also use it to group any variant of the tasks.

What a task actually looks like is very much up to the author and the context, so the topic breakdown is not as clear-cut as the initial sales pitch would have us believe. It would seem that a more suitable topic breakdown is just as dynamic as in DITA.

Notice how the content model mixes subtasks and block-level elements, and how it allows the latter to occur after a subtask — this is not a section hierarchy per se. Thankfully, the only block-level elements to occur between subtasks in the PoC SGML are graphic elements:

<subtask>...</subtask>
<subtask>...</subtask>
<graphic>...</graphic>
<subtask>...</subtask>
...

The graphic element is a wrapper for one or more links to the actual images and again very much page-oriented; ATA EM graphics tend to illustrate an entire task or subtask, the idea being to insert the graphics separately in a binder, before or after the task or subtask. For this conversion, I decided to label graphic elements as topics, alongside tasks and subtasks.

My idea was to first convert to a kind of normalised DITA, a format where everything is in a single file, like this:

<bookmap
    spl="07482"
    model="PRODUCT">

    <title>Engine Manual</title>
    
    <bookmeta role="generated">
        ...
    </bookmeta>
    
    <frontmatter role="mfmatr">
        <topicref
            href="#trlist">
            <topic>
                ...
            </topic>
        </topicref>
    </frontmatter>
    
    <chapter
        props="chapnbr(05)"
        navtitle="LIFE LIMITS">
        <topicmeta role="chapter">
            <navtitle role="title">LIFE LIMITS</navtitle>
        </topicmeta>
        
        <topicref 
            props="chapnbr(05) sectnbr(00)"
            navtitle="TIME LIMITS/MAINTENANCE CHECKS - GENERAL">
            <topicmeta role="section">
                ...
            </topicmeta>
            
            <topicref
                props="chapnbr(05) sectnbr(00) subjnbr(00)"
                navtitle="AIRWORTHINESS LIMITATIONS AND ENGINE SCHEDULING AND INSPECTION INFORMATION">
                <topicmeta role="subject">
                    ...
                </topicmeta>
                
                <topicref
                    props="chapnbr(05) sectnbr(00) subjnbr(00)"
                    navtitle="AIRWORTHINESS LIMITATIONS AND ENGINE SCHEDULING AND INSPECTION INFORMATION"
                    pgblknbr="00">
                    <topicmeta role="pgblk">
                        <navtitle role="title">AIRWORTHINESS LIMITATIONS AND ENGINE SCHEDULING AND INSPECTION INFORMATION</navtitle>
                        <critdates>
                            <revised date="20161031"/>
                        </critdates>
                        <metadata>
                            <data-about role="effect">
                                <data name="title">effect</data>
                                <data name="effrg">ALL</data>
                                <data name="efftext">ALL</data>
                            </data-about>
                        </metadata>
                    </topicmeta>
                    
                    
                    <topicref
                        props="chapnbr(05) sectnbr(00) subjnbr(00) func(870) seq(801) confltr(NA) varnbr(0)"
                        navtitle="Airworthiness Limitations General Description and Operation"
                        pgblknbr="00">
                        <topicmeta>
                            <navtitle role="title">Airworthiness Limitations General Description and Operation</navtitle>
                        </topicmeta>
                        
                        <topicref 
                            href="#tk05-00-00-870-801-001">
                            <topic>
                                <title>General.</title>
                                ...
                            </topic>
                        </topicref>
                        <topicref
                            role="generated"
                            href="#tk05-00-00-870-801-002">
                            <topic>
                                <title>Engine Parts Life Limits.</title>
                                ...
                            </topic>
                        </topicref>
                        
                        <topicref
                            href="#tk05-00-00-870-801-003">
                            <topic>
                                <title>Control System.</title>
                                ...
                            </topic>
                        </topicref>
                    </topicref>
                </topicref>
            </topicref>
        </topicref>
    </chapter>
</bookmap>

There are no DITA topics anywhere before the pageblock level^[18], just nested topicref elements to define the chapter, section, and subject ATA structure. The content in an ATA Engine Manual is written from pageblock level and down.

ATA has a number of attributes to identify the system breakdown as specified in both the iSpec 100 and the iSpec 2200. As seen above, I chose to convert those to a DITA-style list of @props tokens:

props="chapnbr(05) sectnbr(00) subjnbr(00) func(870) seq(801) confltr(NA) varnbr(0)"

For those of you not into DITA, this is a DITA notation listing properties and their values: chapnbr="05", sectnbr="00", etc. Here, it is a useful mechanism to preserve the ATA attributes and their values^[19].

Pageblocks were too big to be DITA topics, so my first few XSLT pipeline steps examined the contents of tasks inside them. An ATA task without a subtask became a topic, like so:

<topicref navtitle="pageblock">
    <topicref navtitle="task" href="task.dita">
        <topic id="task">...</topic>
    </topicref>
</topicref>

If the task was split into subtasks, I'd use the subtasks as topics instead, wrapping ATA graphics on subtask level in topics, too:

<topicref navtitle="pageblock">
    <topicref navtitle="task">
        <topicref navtitle="subtask1" href="subtask1.dita">
            <topic id="subtask1">...</topic>
        </topicref>
        <topicref navtitle="graphic1" href="graphic1.dita">
            <topic id="graphic1">
                <title/>
                <body>
                    <fig>...</fig>
                </body>
            </topic>
        </topicref>
        <topicref navtitle="subtask2" href="subtask2.dita">
            <topic id="subtask2">...</topic>
        </topicref>
    </topicref>
</topicref>

Mostly, if the ATA SGML task had subtasks, there would be no content beyond titles on task level. In the few cases there were, I'd simply add a topic on task level:

<topicref navtitle="task" href="task-level.dita">
    <topic id="task-level">...</topic>
    <topicref navtitle="subtask1" href="subtask1.dita">
        <topic id="subtask1">...</topic>
    </topicref>
    ...

ATA front matters received a similar treatment. In addition to various metadata, they tend to contain lists of applicable service bulletins and temporary revisions, both of which are essentially lists well suited to be topics.

Topic Breakdown

The final breakdown of the normalised DITA into maps and topics is done with a single XSLT 3.0 stylesheet, run after the pipeline has completed and its results have been written to disk. Basically, the XSLT iterates through the XML, acting on every topicref containing an @href (that is, a topicref that actually references a topic rather than only being used for indicating the structure through nesting.

For the most part, this is a trivial exercise.

Note

You may ask why I didn't split the ATA XML into smaller parts first. The main reason is how the XProc XSLT pipeline works and how the output of each XSLT is serialised, which allows me to run the XSLTs in sequence, taking the output of one XSLT as the input to the next. Essentially, every output in the pipeline is on the step's secondary port, which is also what you'd use to grab result-document output in the XSLT.

DITA to HTML

The DITA to HTML conversion is an Ant script that runs a series of XSLT 1.0^[20]stylesheets using msxsl.exe^[21], and is added to Teamcenter as a zip package. I run it as a subant script, which essentially means that it inherits its input properties from my script and runs just like any other Ant target would, in the way and order I define. We changed very little in it, beyond tweaking the CSS to better match the customer's brand.

I have to say, once I had written the XSLT 3.0 and the pipeline, I was tempted to add an XSLT 2.0 stylesheet somewhere, just so I could claim to have used all three XSLT versions in a single project.

Was It A Good Idea?

What did I learn? Was it a good idea to push the ATA SGML into TC, break it down to DITA-sized topics, and then spit it out again for editing and publishing?

Honestly? No.

So please don't.

Breaking Down, Decomposing, Splitting

Teamcenter and many others like it can break down (split, decompose, pick your catchphrase) XML documents into smaller chunks and manage them in a database. They'll add code to handle those chunks, and they will either support a standard like DITA or S1000D rather than inventing something of their own, or they'll just market their product as do-all, end-all, and say they can do it with anything.

They can't.

Topic-based authoring, again, is all the rage in technical publishing, mostly because it's a good idea. It fits. But not every standard was developed with this in mind. ATA, mostly, wasn't. It lacks the standardisation of content into common denominators, similar to what DITA or S1000D both do, which is why many aerospace manufacturers are now using them instead of ATA (while still keeping ATA's brilliant, system-oriented breakdown of functions rather than topics.

Let me offer you a simple example: The SGML Engine Manual document includes a number of ID/IDREF pairs. This, in a single SGML file, is fine. But as soon as you break down the Engine Manual document into topics, you risk invalidating every topic with an IDREF pointing at an ID that happens to be in a different topic. An Engine Manual is basically a book; it's designed to be a single, large unit. You may not have to read it from cover to cover, but its whole organisation is that of a book. And not every book-type dependency is something that a parser can catch.

Unless you're prepared to move away from your book-based content paradigm in action as well as (sales pitch-induced) spirit, don't.

SGML in the 21st Century

Mind, that necessary moving away of the book-oriented paradigm I was discussing is just that, a paradigm shift; no SGML is involved, just XML. XML is mature because everything it's implemented in is. There is no need for an SGML declaration to save memory by limiting attribute content, and there is a wealth of related standards and tools to support whatever you need to do. We're only discussing how you're authoring your content.

SGML predates all that. It came about when disk space was at a premium and a project to implement it would start with a six-month prestudy and another six months to start developing the necessary tool support. There was no SGML document model, no SGML transformation language, no wealth of ready-to-rock tools to choose from. There still isn't, because all that was developed later, for XML.

The tools today all focus on XML, not SGML. The very few that claim to support SGML (for example, I know of three production-quality SGML editors still around) no longer do any SGML-related development. If there's a bug not related to SGML, you're out of luck.

So if a product claims to support SGML, it's either a leftover or hybris, or both. In many cases, such as with Teamcenter, the SGML support now involves a conversion to XML and back, and any processing is done in XML. So it's what everyone should do. SGML, as much as it hurts to admit it, is dead.

The aerospace industry may still be using ATA iSpec 2200, but it is my firm belief that the hidden costs of staying in SGML by far outweigh a proper XML migration project. Even moving to XML yourself and converting to SGML and back for those of your partners that insist on staying in SGML will be cheaper because it will be more controllable. See section “Postscript”.

My silly little SGML to DITA publishing pipeline added an extra XML conversion — TC handles everything in XML internally, remember, so when it export SGML to my pipeline, it's been to XML and back already — which is just bizarre because it exposed all kinds of errors and bugs and issues stemming from that internal XML conversion.

So, SGML in the 21st century? Nah.

End Notes

Allow me to end with a few top tips and conclusions.

Don't promise your product has SGML support unless you've tested and know it to be true.
Graphic entities are dead. Why anyone would use them is beyond me. However, having said that, it's not hard to process them if you're leaving them behind.
While the ATA iSpec 2200 DTDs are very much page-oriented, redefining them in topic-oriented fashion — not necessarily DITA, mind — makes sense because I think it's a good idea for any technical documentation. The basic idea of TC's structure view is sound, even though their approach can be a tad simplistic.
I love SGML for its many weird features^[22] and because it was my introduction into markup, but the XML infrastructure — the many sister standards beginning with X, the tool support, the fact that people actually know about it — is awesome. You don't know how much you miss them until you try to do without.
Plus, I'd just like to say that I'd still like James Clark to sign the SP package for me. It's a great package.

Postscript

After the first draft of this paper was submitted, I was finally able to convince the involved parties to move ahead with an ATA-like XML approach instead. I wrote XML versions of the ATA DTDs^[23], tweaked the ATA SGML to DITA conversion pipeline to ignore the DITA bits and add a few steps to generate valid XML instances from the source SGML, and wrote a quick oXygen Author framework for the ATA-like XML.

I also wrote a pipeline that converted the XML back to ATA SGML for those cases when the end customer is required to deliver SGML to its partners and customers; this pipeline, just as the SGML to XML one, relies heavily on doing the prep work using a sequence of XSLT 3.0 stylesheets^[24], and then runs the tweaked XML through SP to generate valid SGML.

As I write this, the PoC is finally scheduled to start within the next week or two.

References

[DITA Version 1.3] OASIS Darwin Information Typing Architecture (DITA) TC, Darwin Information Typing Architecture (DITA) Version 1.3 Part 3: All-Inclusive Edition, http://docs.oasis-open.org/dita/dita/v1.3/os/part3-all-inclusive/dita-v1.3-os-part3-all-inclusive.html

[S1000D] S1000D, International specification for technical publications using a common source database, https://s1000d.org/

[ATA] Airlines for America (formerly ATA), https://www.airlines.org/#

[ATA iSpec 2200] ATA iSpec 2200, https://publications.airlines.org/CommerceProductDetail.aspx?Product=274

[ATA iSpec 100] ATA iSpec 100, https://publications.airlines.org/CommerceProductDetail.aspx?Product=33

[Teamcenter on Wikipedia] Teamcenter, https://en.wikipedia.org/wiki/Teamcenter

[Siemens Teamcenter] Siemens Teamcenter, https://www.plm.automation.siemens.com/global/en/products/teamcenter/

[RapidAuthor] Cortona3D RapidAuthor, https://www.cortona3d.com/en/rapidauthor

[OpenJade] OpenJade Distribution Page, also maintains OpenSP, http://openjade.sourceforge.net/

[SP 1.3.4] James Clark, SP, An SGML System Conforming to International Standard ISO 8879 - Standard Generalized Markup Language, http://www.jclark.com/sp/

[Pipelined XSLT Transformations] Nordström, Ari. “Pipelined XSLT Transformations.” Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). doi:https://doi.org/10.4242/BalisageVol25.Nordstrom01

[XSpec] XSpec Home, https://github.com/expath/xspec/wiki

[Marking up and marking down] Walsh, Norman. “Marking up and marking down.” Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016). doi:https://doi.org/10.4242/BalisageVol17.Walsh01

[SGML in the Age of XML] Harvey, Betty. “SGML in the Age of XML.” Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016). doi:https://doi.org/10.4242/BalisageVol17.Harvey01

^[1] The S1000D example documentation kit describes a bicycle.

^[2] ATA is producing XML-based standards, too, these days. SGML has a way of sticking around, though.

^[3] I studied some basic engineering before venturing into physics.

^[4] But in no way is it a guarantee for properly sized topics; the author can always include too much.

^[5] No, it's not about XML attributes at all, which is quite confusing for a markup person to begin with. Think of the term attribute as equivalent with a property.

^[6] Unfortunately mostly without the expressive power of XPath.

^[7] Pun intended.

^[8] Basically, applicability, what the content applies to.

^[9] ATA iSpec 100 chapters and sections, that is.

^[10] Nothing to do with attributes or XML, remember; these are topic breakdown rules for Teamcenter.

^[11] Essentially, there are only two or three production-quality editors capable of editing SGML available today. The customer was moving away from one, so our choices were limited. Not that XMetaL is a bad editor; it's not.

^[12] The full editor rather than RapidAuthor was necessary, because of the limitations posed by SGML.

^[13] No small feat since I don't do JavaScript. The time-honoured Copying and Pasting from Stack Overflow development method was my friend.

^[14] One of the things you might use to publish your SGML with, back in the day.

^[15] While Teamcenter represents SGML as XML internally, they've not provided a way to access that XML.

^[16] I did consider XProc 3.0 for all of the above, but the only available XProc 3.0 processor as I write this, Morgana XProc III, couldn't run OpenSP in my tests.

^[17] XSpec, in case you haven't looked into it, is a way to unit test your XSLT (or XQuery or Schematron) by defining the source and the expected output of a structure. It's brilliant.

^[18] Apart from the front matters, of course.

^[19] And if I had implemented DITA directly, I would have considered specialising the DTDs to include these attributes.

^[20] I told you; we already had this one, and you take what you get.

^[21] Microsoft's XSLT processor; I thought I had seen the last of it 15 years ago, but there you go.

^[22] See, for example, Norm Walsh's wonderful Balisage paper from 2016 [Marking up and marking down].

^[23] There are many papers out there about this sort of thing; I can highly recommend Betty Harvey's 2016 Balisage paper [SGML in the Age of XML].

^[24] Mostly to generate an SGML DOCTYPE declaration that contains ENTITY and NOTATION declarations, but also to convert processing instructions in the XML to SGML inclusion elements.

OASIS Darwin Information Typing Architecture (DITA) TC, Darwin Information Typing Architecture (DITA) Version 1.3 Part 3: All-Inclusive Edition, http://docs.oasis-open.org/dita/dita/v1.3/os/part3-all-inclusive/dita-v1.3-os-part3-all-inclusive.html

S1000D, International specification for technical publications using a common source database, https://s1000d.org/

Airlines for America (formerly ATA), https://www.airlines.org/#

ATA iSpec 2200, https://publications.airlines.org/CommerceProductDetail.aspx?Product=274

ATA iSpec 100, https://publications.airlines.org/CommerceProductDetail.aspx?Product=33

Teamcenter, https://en.wikipedia.org/wiki/Teamcenter

Siemens Teamcenter, https://www.plm.automation.siemens.com/global/en/products/teamcenter/

Cortona3D RapidAuthor, https://www.cortona3d.com/en/rapidauthor

OpenJade Distribution Page, also maintains OpenSP, http://openjade.sourceforge.net/

James Clark, SP, An SGML System Conforming to International Standard ISO 8879 - Standard Generalized Markup Language, http://www.jclark.com/sp/

Nordström, Ari. “Pipelined XSLT Transformations.” Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). doi:https://doi.org/10.4242/BalisageVol25.Nordstrom01

XSpec Home, https://github.com/expath/xspec/wiki

Walsh, Norman. “Marking up and marking down.” Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016). doi:https://doi.org/10.4242/BalisageVol17.Walsh01

Harvey, Betty. “SGML in the Age of XML.” Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016). doi:https://doi.org/10.4242/BalisageVol17.Harvey01

BalisageThe Markup Conference2021

Balisage Paper: Topic-based SGML? Really?

Abstract

Table of Contents

Intro

Note

Marrying Engineering Data and Techpub

Authoring

Content Management

The Power of a Good Sales Team

The Proof of Concept

ATA SGML Engine Manuals

PoC Implementation Notes

DOCTYPE and SGML

We Don't Do Entities

We Don't Do Entities, Part Two

Notable Differences Between SGML and XML

SGML in the Editor

Publishing

Conversion Outline

Normalising the SGML

SGML to XML

Character Entities

Graphic Entities

ATA XML to Normalised DITA XML

ATA Topics?

Topic Breakdown

Note

DITA to HTML

Was It A Good Idea?

Breaking Down, Decomposing, Splitting

SGML in the 21st Century

End Notes

Postscript

References

Balisage Series on Markup Technologies