Cuellar, Autumn, and Jason Aiken. “The Ugly Duckling No More: Using Page Layout Software to Format DITA Outputs.” Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016). https://doi.org/10.4242/BalisageVol17.Cuellar01.
Balisage: The Markup Conference 2016 August 2 - 5, 2016
Autumn Cuellar has had a long and happy history with XML. Her first degree is in Biomedical
Engineering, which led to a role as a researcher at the University of Auckland in
New Zealand. There Autumn co-authored a metadata specification, explored the use of
ontologies for advancing biological research, and developed CellML, an XML language
for describing biological models. Since leaving the academic world, Autumn has been
delighted to share her enthusiasm for XML in technical and enterprise applications.
Previously at Design Science, her roles included MathML evangelism and working with
standards bodies to provide guidance for inclusion of MathML in such standards as
DITA and PDF/UA. Now at Quark Software, Autumn provides her XML expertise to organizations
seeking to hide the XML for a better non-technical user experience.
Jason manages Quark Enterprise Solutions, a platform for content automation that
streamlines the entire lifecycle of high-value content – from creation to delivery.
He
coordinates with strategic partners and product engineering to help clients across
financial services, manufacturing, life sciences and government reinvent and modernize
their content strategies. With two decades of experience in technical publishing and
content management, including products and services for aerospace and biomedical devices,
Jason consistently advocates for technical solutions which improve user experience
and
simplify business process. Jason has a MS in IT System Design & Programming from
Capella University.
DITA is growing in popularity as a document standard and is now being used across
a
range of industries. As DITA grows beyond the scope of technical publications and
as
businesses become more concerned about branding documents across the organization,
the
current methods of coding templates to format DITA output are no longer sufficient
for
document production. We'll explore using page layout software to design complex, visually
rich templates for DITA and other XML document formats.
Many organizations around the world are automating their production of business-critical
content with great success. Much of the creation process can be automated by pulling
from
external sources such as stock databases, geolocation systems, and statistical analysis
reports. Translation memory databases are growing in popularity as a method for helping
automate localization of content. Publication and delivery of documents can often
be performed
without human intervention. Using advanced template structures, document assemblies
can be
pre-approved and generated at the push of a button with just-in-time resolution of
content.
DITA, an OASIS XML standard for documents and best practices, has helped pave the
way for
content automation. DITA supports foreign content, enabling the inclusion of data
from outside
sources, and its specialization architecture allows publishing channels to be built
on or
customized from existing publishing systems.
The initial application area of DITA was computer software documentation at IBM. Up
until
fairly recently, DITA remained, for the most part, in the realm of technical content[1]. However, content producers of all kinds are now finding DITA to be a useful
format for a wide range of applications. DITA is being used at universities, petroleum
companies (Chevron, Schlumberger), non-profit organizations (FamilySearch, HealthWise),
consortiums (World Agroforestry Centre) [Schengili-Roberts 2012], financial
services organizations (Mastercard[2]), and a number of non-technical publishing companies.
The main hurdle for the adoption of DITA in non-technical applications has been the
technical nature of DITA and the associated Open Toolkit, used for converting DITA
XML to and
from other formats. One writer notes (emphasis mine), “DITA for non-technical writers
is very
much a real option, with some planning and tweaking of tools and
workflows” [Samuels 2014]. However, the required planning and
tweaking can be a significant obstacle for resource-strapped organizations.
Among the difficulties facing non-technical content producers using DITA, perhaps
the most
challenging is the design of output layouts. In a recent survey conducted by SyncroSoft,
a
large number of respondents cited PDF customization as their biggest frustration in
working
with DITA [Coravu 2016]. As Hans Christian Andersen highlights in his acclaimed
1843 fairy tale “The Ugly Duckling”, some hatchlings are perceived very differently.
This
paper describes how page layout software can be used by non-technical designers to
add complex
design and organization to DITA hatchlings.
A Brief History of Page Layout
Layout design has for centuries been a visual, manual process. Books produced in
monasteries in medieval times generally featured a central block of text, surrounded
by an
artist’s ornamental design, or illumination. Even to this day, through the invention
of the
printing press and later computers and printers, page layouts are sometimes modeled
on these
early manuscript layouts [Novin 2010].
As soon as the graphics capabilities of computers could support it, layout design
moved to
the territory of software. High quality page production was opened to the masses through
WYSIWYG applications ranging from word processors to desktop publishing software.
One of the
earliest desktop publishing programs was PageMaker, originally produced by Aldus and
later
acquired by Adobe. PageMaker made it possible for designers to quickly compose text
and images
in eye-catching layouts and then send those layouts to printers.
Computers also enable the automation of publishing, but in order to fulfill this promise,
page design concepts had to be translated to a programming language to support precise
replication of a design. To this end, languages such as TeX and troff were created
early on,
even prior to WYSIWYG design software. As various digital document formats have emerged,
so
too has stylesheet support for these formats, allowing templates for design elements
such as
paragraph and line spacing, font families, and colors to be applied uniformly to documents.
Two stylesheet languages in particular are used frequently for providing templates
for DITA
outputs: Cascading Style Sheets (CSS) and XSL Formatting Objects (XSL-FO).
The Current Landscape for PDF Output of DITA
The DITA Open Toolkit (DITA-OT) is maintained separately from the DITA specification
- it
is an open source toolset for converting DITA to a variety of other formats including
PDF and
HTML, the most popular output formats for DITA. As most of the popular DITA outputs
are
XML-based and not difficult to produce, the rest of this paper will focus on PDF output,
which
gives users of DITA the most headaches. Print continues to be an important delivery
channel.
Many organizations still rely on PDF for pre-press printing. Additionally, PDF is
a convenient
and simple distribution channel for branded layout of longer documents intended for
anyone to
print. For these reasons and others, PDF garnered the top spot as respondents' most
important
output format for DITA in the previously mentioned SyncroSoft survey. [Coravu 2016]
For PDF output, the DITA-OT uses XSL-FO as an intermediate step. As the Open Toolkit
is
free and an active open source project, the DITA-OT is widely used for producing DITA
outputs,
especially as it is built in to many applications offering DITA support. Therefore,
XSL-FO is
the primary path by which PDF output is achieved. However, there are tools that use
CSS for
templating and others use a proprietary approach. Finally, some DITA implementers
have chosen
to convert DITA to HTML or Word as the intermediate step before publishing PDF output.
XSL-FO
Out of the box, the DITA-OT is set up to use the Apache Formatting Objects Processor
(FOP) publishing engine but can be configured to use the Antenna House Formatter or
the
RenderX XEP engine for producing PDF. The advantages of using XSL-FO for PDF output
are
three-fold: the DITA-OT is already set up to produce XSL-FO, an XSL-FO formatting
engine
(Apache FOP) is freely available, and XSL-FO is intended for paginated outputs and
can
reliably handle more complex layouts than CSS. These advantages cannot be over-estimated.
With almost no work required, resource-strapped organizations can get up and running
with
PDF output in no time, and basic customization can be performed by modifying the XSLT
files
that ship with the DITA-OT.
However, once an organization goes beyond requiring basic customization of the DITA-OT,
the costs in time and money to work with XSL-FO increase dramatically. WYSIWYG XSL-FO
software typically only offers basic functionality. Therefore, in most cases, a skilled
developer is required to customize PDF output and preferably one who knows XSL-FO
(not
entirely common).
Furthermore, while DITA promises interoperability and the DITA-OT offers much faster
ramp-up time than starting from scratch, differences in how the various rendering
engines
support the XSL-FO specification also require consideration. After investing in format
development for FOP, for example, significant testing and refactoring is required
when
switching to another engine for PDF production. Small differences in rendering output
matter
to demanding enterprise customers who must meet specific business requirements for
complex
and engaging layouts. For these reasons costs of maintenance for XSL-FO can be high.
CSS
CSS is also designed for formatting and styling of content, and because CSS is widely
known and easy to use, some DITA implementers have chosen to rely on CSS instead of
XSL-FO.
SyncroSoft, the makers of the <oXygen/> XML editor, have developed an open source
DITA-OT plug-in[3] that can convert DITA to PDF using CSS and either Prince XML or Antenna House
Formatter, which can handle CSS as well as XSL-FO. Using a similar idea, some implementers
first convert DITA to HTML/XHTML and then generate the PDF from the HTML using one
of
several applications available for this purpose, such as Prince XML.
The problem with CSS is that it was originally designed for web pages, for which
pagination is not a priority. CSS2 does not have support for a number of features
that
XSL-FO supports, including multi-column layouts, items in margins such as footers
and
headers, page numbering, and cross-referencing particular page numbers. CSS3 introduced
a
Paged Media Module to help address some of these problems but not all [Harold & Means 2002]. Additionally, not all CSS formatting tools support the Paged
Media Module. Depending on how complex the requirements are, CSS may not be
sufficient.
Other Paths to PDF Output
Alternatives to XSL-FO and CSS do exist but are used more infrequently. Some
implementations will convert DITA to another intermediate format such as Microsoft
Word
before publishing the document to PDF. The drawback here is that there are now two
transformation processes to manage and two processes during which artifacts may be
lost.
There are a few commercial PDF renderers on the market that do not rely on XSL-FO
or CSS
for formatting, including TopLeaf XML
Publisher and Adobe FrameMaker. Both TopLeaf XML Publisher and Adobe FrameMaker provide a
WYSIWYG interface for designing the page layout of the output PDF, but for both, this
is a
secondary goal, and, therefore, design functionality is neither comprehensive nor
particularly easy to master. TopLeaf is built around XML; to customize DITA templates
in
TopLeaf, the designer must have some knowledge of DITA. FrameMaker is targeted to
technical
content, and as one blogger notes in comparing FrameMaker to Adobe’s page layout application
InDesign, “The key question here is: How important is great, typographically-sophisticated,
cool-looking, creative design to communicating technical information?” [Gold 2013]. The answer is “not very,” which is why Adobe has the two different
products and which is why for “cool-looking, creative design” functionality designers
do not
turn to FrameMaker.
Handing DITA Output Design Back to Designers
Because of the current toolset offering, most of the design of DITA outputs is currently
performed in code, by modifying XSLT or CSS files to produce the correct look for
a set of
documents. This is not ideal for a visual process dating back hundreds of years, of
course,
but it has been a tolerable state of affairs because DITA, until recently, has been
used
primarily for technical content, and traditionally technical content has not required
particularly creative design. However, two trends are changing the landscape. Firstly,
DITA’s
popularity is growing for non-technical content produced by non-technical contributors,
resulting in an increased demand for WYSIWYG design tools. Secondly, branding and
user
experience are becoming important priorities for businesses [Goodson 2012].
Branding touches all aspects of a business, including its technical publications,
and user
experience includes design. Incorporating high quality design into branding efforts
creates a
competitive advantage that businesses are using with success.
Complex page layout design is already available in desktop publishing software. Most
rich-layout publications such as magazines and catalogs are built in InDesign or QuarkXPress,
two tools that have carried on PageMaker’s legacy. Since today’s content automation
world is
built on XML, it shouldn’t be surprising that both InDesign and QuarkXPress have support
for
XML. This is our path forward for handing DITA layout design back to designers for
producing
complex, beautiful layouts that can be published to PDF and other outputs. As this
author is
familiar with QuarkXPress, the process will be described using Quark software, but
a similar
process can be applied using Adobe’s InDesign and InDesign Server.
The Process with Page Layout Software
QuarkXPress allows designers to place any number of design elements in a page layout
in
containers called Boxes. All Boxes in a layout have an associated unique identifier;
the
designer has the option to attach an easily recognizable name to the identifier.
Additionally, a QuarkXPress project can include a number of other named variables
and design
elements. Variables can be used for static content, such as a copyright statement,
or
dynamic content, such as the publication date. All style preferences can be set in
the
project and named, from color palettes to table and list styles.
As well as being able to set the design and style preferences in a QuarkXPress project
and providing identifiers for them, almost every aspect of a QuarkXPress project can
be
represented as XML. QuarkXPress’ XML doctype is known as Modifier. Modifier can be
used to
create or delete Boxes, change the properties of Boxes, such as shape or position,
change
the content of Boxes, change the style of the content of Boxes, and so on.
Putting the Modifier and a QuarkXPress project together is where the magic happens.
The
QuarkXPress project provides the template that guides the Modifier. For example, the
QuarkXPress project might include a Box with the name of “Title” and two character
styles,
one with the name of “Main Title” setting the font size to 48pt and the other named
“Subtitle” setting the font size to 36pt. The Modifier will then specify the strings
to
write to the “Title” Box with instructions of when to use the “Main Title” character
style
and when to use the “Subtitle” character style.
Because the Modifier schema is so closely related to the QuarkXPress project and because
QuarkXPress is a mature design package, the use of Modifier with QuarkXPress enables
organizations to create complex sets of documents. A project might consist of different
layouts for different targets or various page designs for parts of chapters or articles
(for
example, first and last pages may use different design elements than middle pages).
And all
is at the control of the designer because the QuarkXPress templates dictate the boundaries
within which the Modifier operates.
The engine that puts the Modifier and QuarkXPress project together and converts the
XML
to a new output format is the QuarkXPress Server. QuarkXPress Server can be used to
automate
conversion of large volumes of documents to a variety of different output formats
including
PDF and HTML. All that’s left in the pipeline is mapping DITA to Modifier, and given
that
both are XML languages for describing documents, this is a straightforward XSLT conversion.
This conversion process is made even easier if implemented as a DITA-OT plug-in to
leverage
the DITA-OT's ability to process DITA maps, links, and references.
Analysis of Approach
As we’ve seen, several different paths can be used for formatting PDF output of DITA
content – each has its advantages and drawbacks. Let’s highlight some of the strengths
and
weaknesses of using page layout software.
Because there needs to be a link between the XML and the design project, designers
will
either need to stay within the confines of the design project or know enough about
the
implementation to design around it. Following on the above example, if a designer
creates a
new design template, s/he will need to know that the project must have a Box with
the name
of “Title” or the title of the document will not appear in the output, or if the designer
is
modifying an existing template, s/he will need to know that the Box with the name
of “Title”
can be modified but not deleted. This might restrict how a layout designer normally
works.
On the other hand, page layout software does provide a powerful mechanism for designers
to add pizazz to DITA outputs through a WYSIWYG interface. Layout design is largely
a visual
process that depends on seeing how elements of a layout relate to other design elements
on
the page, and since InDesign and QuarkXPress are both mature applications, they have
extensive functionality for making this process easier for designers, from providing
color
pickers for matching colors to Bezier pen tools for creating interesting shapes.
Additionally, in certain areas, functionality of page layout software goes where XSL-FO
cannot, such as with running text along odd shapes and curves.
Finally, high quality desktop publishing systems are commercial applications – this
can
be either a strength or weakness depending on your point of view. Some organizations
will
not want to spend the money on upfront software costs and instead prefer to use their
own
development resources to build on open source applications like Apache FOP. Others
prefer to
invest in tested, supported commercial products.
Advantages and Challenges of Supporting DITA
The main advantage of supporting DITA (beyond its widespread adoption) is the existence
of the DITA-OT. Thanks to a large and active open source community, the DITA-OT is
already
set up to process large and complex DITA documents. In preprocessing steps the DITA-OT
handles such tasks as validating the XML, applying filters, resolving references,
and moving
metadata. The DITA-OT is then able to pass an intermediate, simplified DITA file to
an
external rendering process, such as the QuarkXPress Server.
The primary challenge of supporting DITA is its sheer breadth. The All-Inclusive DITA
1.3 Specification, which includes the Technical Content and Learning & Training
specializations, lists over 600 elements, and this does not include the elements allowed
through foreign XML languages SVG and MathML. Many of the elements are specialized
off of
existing DITA base elements, which means that out-of-the-box support of these elements
comes
with DITA’s typing architecture. However, to consider a rendering engine to have full
support of DITA 1.3, the rendering engine should distinguish specialized elements
from the
base elements.
Regarding SVG and MathML, neither QuarkXPress nor InDesign have native support for
SVG
or MathML. These XML formats can be converted to static images for use within these
page
layout applications, but then the inherent advantages of using these formats in the
first
place are lost, including accessibility and interactivity.
These challenges are certainly not insurmountable, and DITA support by page layout
software will continue to improve. The more difficult branding requirements introduce
challenges when DITA content is reused for other business units (e.g. marketing).
Many
design obstacles can be handled better by products which support high-fidelity page
layout,
but at the cost of automation. For example, the layouts most challenging to conventional
XML-based publishing engines include features like irregular-sized graphics with text
wraparound and multi-column layouts with callouts anchored to relevant text content.
These
difficulties can now be handled automatically with XML-aware layout engines, such
as
InDesign Server and QuarkXPress Server, used in concert with the DITA-OT.
Conclusion
As DITA continues to grow in popularity as a document format for non-technical industries
and as branding and user experience become important priorities for organizations,
demand
grows for tools to make DITA easier to use and implement for non-technical authors
and
contributors. Chief among the pain points for DITA implementers is PDF customization
– working
with code is not always feasible for layout design, a process that for centuries has
been a
visual, manual process, nor does it allow for the rich design key to a great user
experience.
Using desktop publishing software QuarkXPress or InDesign, mature products for which
the
primary application is layout design, is one possibility for producing high-quality,
rich-layout templates for use in PDF and other outputs.
Furthermore, because QuarkXPress and InDesign both support XML, a similar process
can be
used for other XML-based document formats. Smart Content [White 2015] has an
extensible typed architecture like DITA but is simpler in nature (only a couple dozen
elements
in comparison to the 600+ elements in the DITA standard) and arguably more approachable
to
developers familiar with HTML-related standards. SmartContent is used extensively
with
QuarkXPress templates and has proven immensely successful for a broad range of enterprise
organizations needing content automation coupled with engaging page layout. Using
page layout
software, all XML documents can become swans.
Harold, Elliotte Rusty, & Means, W. Scott. 13.5 Choosing Between CSS and
XSL-FO. XML in a Nutshell, 2nd Edition.
Sebastopol: O'Reilly & Associates, Inc., 2002.