Nordström, Ari. “The Dream of a CMS.” Presented at Balisage: The Markup Conference 2023, Washington, DC, July 31 - August 4, 2023. In Proceedings of Balisage: The Markup Conference 2023. Balisage Series on Markup Technologies, vol. 28 (2023). https://doi.org/10.4242/BalisageVol28.Nordstrom01.
Balisage: The Markup Conference 2023 July 31 - August 4, 2023
Balisage Paper: The Dream of a CMS
Ari Nordström
Ari is an independent markup geek based in Göteborg, Sweden. He has provided
angled brackets to many organisations and companies across a number of borders
over the years, some of which deliver the rule of law, help dairy farmers make a
living, and assist in servicing commercial aircraft. And others are just for
fun.
Ari is the proud owner and head projectionist of Western Sweden's last
functioning 35/70mm cinema, situated in his garage, which should explain why he
once wrote a paper on automating commercial cinemas using XML.
An XML-first content management system — XML technologies handling XML content in
an XML database — has been the author's dream for the last decade and a half, ever
since he first found a way to break free from the shackles of non-XML technologies
limiting what he could do.
A portal for an automotive client provided both an interesting, and XML-centric,
case study using quite a few X technologies, but just as importantly
a way forward to finally implementing that system.
This paper started life around fifteen years ago, when I was spending my professional
life designing a document management system based on a then-popular XML editor and
a SQL
database. The system had a document management layer with full integration between
the
editor and the database, exact versioning and traceability, translation handling,
and,
of course, XSL for publishing.
It had all kinds of bells and whistles — and yet, adding any kind of new processing
capabilities, be it editor functionality or additional publishing targets, was a
nightmare because the system was based on technologies and languages far removed from
XML. As a markup person, I relied on programmers specialising on .net for even the
slightest change.
XML First
That all changed with XProc 1.0. As long as I had something in the DMS calling an
XProc engine and a pipeline of my choice, adding a new publishing format or an
import function was a matter of writing the stylesheets and an associated pipeline.
It was brilliant!
And it was to get even better — not only was I able to write an XProc pipeline for
the processing but I could also generate a matching user interface in XForms using
XProc! [Using XML to Implement XML]
Some time later, I helped design another content management and (web-based)
publishing system to produce regulatory checklists for farmers seeking national and
EU funding, this time using oXygen and eXist-db [XML Solutions for Swedish Farmers: A Case Study].
Importantly, beyond a few web technologies and design choices outside of my control,
it was all implemented using XML technologies — XProc, XSL, XLink, XQuery, and a few
other things beginning with X — so rather than having to deal with
binary blobs in a SQL database, I could query and process the XML directly.
My next step was to test the waters by combining the aforementioned XProc and
XForms publishing approach with eXist-db [ProXist]. It was a bit
clunky but it worked; I was able to output a UI, published by an XProc pipeline and
associated framework, based on the type of output pipeline I had.
I still missed version management and full traceability — what I had in that SQL
server-based system — but implemented in an XML database using XML technologies. I
came up with VML [Multilevel Versioning],
an XML vocabulary for version management, and an outline suggesting how to implement
it in eXist-db.
A Question and a Devious Plan
The XML first content management approach stayed with me. I would
occasionally suggest solutions along those lines for my clients while keeping busy
with migration pipelines, publishing stylesheets, and DITA customisations, but
surprisingly no-one wanted to sponsor my XML-first system. I did write another
eXist-db and XForms project, a registration app for Balisage's sister conference,
Markup UK [Eating Your Own Dog Food],
but that was about as close I got.
But then, in 2022, a client asked me how I would go about implementing a portal
for publishing DITA and S1000D content. The portal would publish the service
documentation of a car manufacturer about to launch their very first model, but also
fit into my client's larger strategy of providing an entire service lifecycle
management chain, from 3D CAD data to service and end user documentation, parts
catalogues, and so on.
I proposed a system where the XML content is stored in an XML database as-is,
without pre-converting anything, filtered and queried as-is, and finally published
to HTML on the fly. I had a devious plan.
The Portal
The portal happened in a specific context, namely in the publishing of web-based
documentation within automotive and aerospace industries — user guides, workshop
manuals, bulletins, parts catalogues, etc, much of which is accompanied by 3D graphics
and animations produced directly from the product CAD data — using XML vocabularies
such
as DITA and S1000D.
It's just a portal, though; the content is authored elsewhere, in an unrelated system
that doesn't know about the portal's existence. Similarly, the portal does not care
where the content comes from.
Content
DITA [DITA Specification] and
S1000D [S1000D Specifications],
while very different on markup level, have similar approaches to content: firstly,
there are the topics, that is, the reusable blocks of information where the actual
content lives, and secondly, there are the publications or structure descriptions
(maps in DITA, publication modules in S1000D) that
combine the topics into actual documents with chapter and section hierarchies.[1]
The portal had to be able to display both — individual topics would be enough for,
say, a disassembly task or a function description for a single component, while the
structures would be needed to browse through the full publications.
A structure view helps illustrate the overall document structure
and functions as a table of contents, but it also helps highlight the reuse of
common components — for example, see the arrows pointing to Warning 2
in Figure 1.
Both vocabularies employ what is known as conditional
processing (some vocabularies, including DITA, call it profiling),
basically declaring applicabilities (which is really S1000D
terminology) for the content: this topic applies to products A and B,
this applies to regions APAC and EU, and so on. By profiling
content from individual paragraphs to entire topics, reusing becomes easier; a
single topic can be reused in multiple contexts, spanning multiple products, product
variants, audiences, and so on.
The filtering is, of course, also useful when searching for content, but also for
publishing on the fly; an end user can select the exact model and variant to only
include relevant information when publishing to HTML.
Browsing and Filtering
In the portal, we store the source content as-is, and that content needs to drive
everything. For example, we only list the topic types and profile values actually
in
use in the database, not everything that is possible. The UI should always reflect
the actual content.
While XML databases use XML as their primary format, we still want to generate
basic resource lists only once, when initialising, rather than every time those
resources are queried.[2] This makes most operations far quicker, especially with a large enough
database.
The initialise operation generates several lists, all in XML format:
A list of profiling attributes and values in use.
Elements in use.
Resources considered to be part of the portal content.
The resource list is especially interesting. It is currently limited to
content and structure XML only, so while there are
plenty of other files, XML and binary both, they are not used directly by the main
UI. For each listed XML, we add a basic information such as a database URI and a
title (extracted from the file contents), but also profiling and topic/resource type
information. The result is a (very long) list of file elements:
The View button starts a publishing process that converts the
selected topic to HTML and opens it in a separate tab. For DITA maps, that button
becomes a Browse button and presents the structure as a tree,
with expandable subtrees and, of course, viewable topics:
The UI is generated from the file list XML using an initial XQuery script that
calls some XSLT, allowing us to not only localise the UI (all UI labels and text are
stored in language- and country-specific XML files) but also configure how the UI
is
presented, using a configuration XML file to define what features are being used.
For example, one setting allow us to select one or more resources for later
processing or filtering:
Additional controls allow us to filter the file list by applying profiling
information:
When applying filters, we simply add the currently selected values to the file
list XML:
This is used by the XForm controls, of course, but also for publishing. When a
topic is published (see section “Publishing”), the profiles element is converted to a
DITAVAL filter [TBA ref] and sent to the publishing process as a parameter.
Most of the file browsing and filtering features are implemented with XForm
controls that either act on the file list XML directly or call XQuery
functions.
A configuration XML file controls most aspects of the UI, including styling,
scripts, and so on, so changing the appearance of the UI is a matter of tweaking the
CSS stylesheet(s) and updating the UI configuration:
Finally, note that the file list XML acts as an XForms instance, so in addition to
the resource metadata, we frequently set attribute values for XForms-related
processing. For example, in the nested file elements shown above, there
is an @expanded attribute to expand and collapse the tree
representation. Other processing includes the resource type and the DITA
outputclass, both of which are useful when filtering and publishing.
Publishing
The first portal version ended up being a DITA implementation because that's what
the customer, an automotive manufacturer, is using. DITA isn't without its
advantages; most aspects of DITA, from authoring to publishing, are well supported
today, meaning that we wouldn't have to write publishing stylesheets from scratch.
Rather than using the DITA Open Toolkit, a first choice for many implementers, we
chose XMLMind's DITAC framework [XMLmind DITA Converter] because it was far better suited to being integrated with
eXist-db.
Standard DITA functionality, from profiling using DITAVAL filters to conref links,
is provided by DITAC out-of-the-box, which is a huge help: DITAC includes default
stylesheets for HTML, PDF, and a few other formats. This means that a DITAC
stylesheet to output a specific layout, known as a plugin, is really
just an extension and therefore much faster to write.
The portal DITAC implementation brings the functionality to eXist-db and deals
mostly with database URIs and packaging, but also JavaScript code to handle 3D[3]:
Most aspects of the portal are XML technologies that the author feels quite
comfortable with; this is not one of them.
What's in a CMS? Asking for a Friend.
What's in a CMS, really? What does it take? And mind, while I am a proponent of
open-source software[4], this exercise is not about that; rather, it's about yours truly arriving at
a point where document (content) management can be had without a year-long project[5] or an expensive third-party system offered by consulting firms who really
want to make money from additional services and support[6], or both. Some people build boats or cars from scratch just because they
can, even though money can buy both. Me, I want to build a document (content) management
system and then take it for a spin.
So, again, what does it take?
Storage. Being a pointy brackets person[7], I'm partial to XML databases, and there are several alternatives
out there.
Authoring. There are plenty of alternatives, from open-source editors to
commercial products with everything supported out-of-the-box.
Publishing. Again, plenty of alternatives. XSLT and FO are no-brainers if you
want to write the stylesheets yourself.
Management. This is the heart of the matter, really, isn't it? It's about
listing whatever resources you store and author and publish, about searching and
filtering their contents, and about keeping track of them.
Do-It-Yourself Document Management
The alert reader will have noticed where this is going, of course. The portal is
half-way to a CMS:
Storage? Yes, we already store the content in an XML database so all kinds
of things become possible.
Authoring? No, we don't have that per se but read
on.
Publishing? Check.
Management? Yes, we have some of it. Read on.
Authoring
Adding authoring is easy, and easiest by far is to connect oXygen XML Editor to
eXist-db[8], with the connector available out-of-the-box. Other editors require more
work. If all you want is an integration to the (XML) database without versioning,
then we're done.
Note
Obviously, if you're integrating Emacs, you'll have more
to do.
Actually, even with versioning, I'd argue we're only ever going to edit the latest
version of any document. If you check out an earlier one, you still don't actually
edit that version, you create a fork and edit that instead.[9]
In a proper version management with check-out and check-in, your editor
integration will require additions to do that.[10] I'd argue that the check-out function is a matter of locking the file
and reasonably easy to achieve. There are probably also a number of convenience
functions, but for a bare-bones authoring environment, this should be enough.
A reviewer rightly pointed out that the editor is frequently a stumbling block for
non-markup people. While I do agree, addressing that particular problem must come
later. oXygen, for example, makes an excellent effort towards user-friendliness for
many types of authors[11], with or without an associated CMS, and so I would argue that a
well-designed CMS (which this one aims to be) can only help.
Document Management
And we arrive at the heart of the matter. You'll note that we already have
browsing and filtering capabilities; I'd argue that the portal's browsing UI isn't
needed for basic editing when using oXygen (see section “Authoring”) since we'll
always edit the latest version and the whole versioning business is handled in the
CMS.
Other editors might have to either implement proper in-editor integration — which
to me sounds a fairly difficult thing to do — or add a UI trigger to open a selected
resource in the XForm, much like the buttons shown in Figure 4. That
function, of course, would be a little something written in XQuery, perhaps made
slightly more complex with check-out/check-in functionality.
The check-out/check-in functionality requires a suitable flag, of course, but it's
is easy enough to add one to the file list XML as an attribute (with the flag
controlled by XQuery functions behind the scenes):
<file checkout="true">...</file>
Other flags (and associated XQuery functions) are easily implemented, of
course.
Versions and Workflow
This is not a paper on versions or workflow (which, by the way, are
not the same or even close), but I do believe that the
portal's basic approach to file listing and browsing is well suited to being
expanded for the purpose.
I've long advocated an approach that centers on identifying resources using
URNs (Uniform Resource Names). That URN identifies a resource on multiple
levels, a base identifier followed by a point in time and a rendition, something
like this:
urn:x-example:thing:123456:<version>:<xmllang>
If all you want is to identify an abstract notion of that resource, without a
point in time (version) or a specific rendition (language and locale), you are
left with a base identifier:
urn:x-example:thing:123456
Again, this is a base identifier, identifying the
resource in its purest and most abstract form. My paper on versioning [Multilevel Versioning] explains how it
all works and suggests how to handle versioning and workflows, so what is left
here is to connect that with the portal's file list XML:
<file urn="urn:x-example:thing:123456">...</file>
This means that the base resource is identified by @urn but
that's it; any decisions on what version that should be used is left to other
business logic. Of course, given the nature of the file list XML, we might
reasonably expect it to use the latest version (and perhaps whatever language
and locale we're displaying at the moment). Alternatively, we might just provide
a specific value when publishing[12]:
Notably, though, the file list XML is in no way a master list. It's for
presentational and editing purposes only, and the actual master is a VML XML
instance, with XQuery code keeping track of the two using the base URNs as
keys.
In Closing
There is some way to go still, but I do think that an XML-first CMS based on the
portal and some other bits and pieces is a distinct possibility. Some of it may happen
well before Balisage[13], but I doubt I'll have full-blown versioning before the end of this
year.
But What About...?
While I've thought about the subject for years[14], this paper simply cannot reflect everything. Also, I could be missing
something obvious. The latter I can't do much about, but here's to addressing at
least some of the outstanding questions:
Yes, the portal does handle localised content. Think of it as profiling —
we're filtering on @xml:lang identifying 4-position
language/country codes.
Version management is a really big subject and impossible to fully address
here. You'll need to read the paper[15] I wrote on the subject [Multilevel Versioning]. It's
certainly not definitive but I will be more than happy to argue my
points.
Workflow management in a CMS, similarly, is not possible to address
here.
Binary (non-XML) files are manageable in the same way as the XML files.
The reason the portal limits the file list to XML is simply that we don't
currently support doing anything with the non-XML files beyond what we do
now, namely to display them if linked to from the XML.
This is easy enough to change because there is currently a function to
specifically remove non-XML (and a few XML files, too, because not all XML
is content) files from the file list. We'll need to decide what to do with
them first.
The first implementation is somewhat DITA-centric, yes. Later versions
will generalise much of the code to handle other vocabularies (S1000D and
ATA XML to start with, but I'd assume DocBook will also follow).
As for schema languages, S1000D is on top of the list so we'll have at
least one XML Schema implementation. The current DITA implementation uses
DTDs, although Relax NG can be added with very minor updates.
The First Live Version
The portal went live before Christmas 2022 in what the author does regard as an
unqualified success. The database contains 100+ GB of data, much of it 3D content
but also thousands of XML resources, and the performance is surprisingly good. The
XForms UI, in particular, works amazingly well.
[10] I once worked on a DMS project without explicit check-ins and check-outs.
They called it optimistic check-out, and essentially they
relied on being able to manage conflicts when two authors were editing the
same file at the same time.
[11] I've been involved in a number of projects where the aim was to make
structured authoring available and easy for non-markup authors. We've come a
very long way.