Intro
This is a paper about that twilight zone beyond schemas, the place where style guides, those arcane instructions to authors about house style and how to produce content that is not only valid but stylistically consistent, are supposed to kick in but, these days, increasingly don't.
It's a soapbox paper, basically, a result of years of irritation, agitation, and random shouting.
Some Examples
Allow me to begin with a couple of examples to illustrate where I come from.
Here is a favourite pastime of Brits and Swedes:
This, of course, is a queue. For us markup folks, it's basically a list. The semantics are clear, right?
But let's have a look at a more scary example:
This is McDonald's on Fleet Street in London during lunch hour. There are a number of cash registers but, for the most part, no clear queues and no help in sight. People are literally all over the place. What are the semantics here? How do you get your food? Clearly, there should be queues, but none is readily apparent[1].
Let's do a more markup-centric example:
<para>Here are my favourite films: <list> <item>Close Encounters of the Third Kind</item> <item>2001</item> <item>Amadeus</item> </list> </para>
This, a list construction recognisable from schemas such as DocBook, has, on the surface of it, clear semantics. There's an introductory sentence and a couple of list items. A lot clearer than the McDonald's chaos, above, right? There is a problem, though.
If your world is like mine, there are a couple of usual suspects when it comes to what are known as block-level elements. Paragraphs, notes, admonishments, tables... and lists. And on the surface of it, this would qualify as a list, except I've always instinctively read the DocBook-style lists as inline because they are inside a paragraph. Something to be presented like this:
Here are my favourite films: Close Encounters of the Third Kind, 2001, and Amadeus.
Here's how it's usually presented, though, with everything on block level:
Here are my favourite films: * Close Encounters of the Third Kind * 2001 * Amadeus
Looking at the schema, how does one know what is actually meant here?
In practice, since DocBook and others allow lists both (seemingly) inline and on block level, I've had plenty of authors write
<para> Intro to list: <list>...</list> </para>
But just as many write
<para>Intro to list:</para> <list>...</list>
And many write lists in both ways, frequently in the same document, seemingly oblivious to the difference, or the pain they cause me.
That intro text is part of the list, of course; if you remove the list
during some processing, that processing should remove the intro, too. To illustrate
this
problem:
<para>Here are my favourite films:</para> <list> <item>Close Encounters of the Third Kind</item> <item>2001</item> <item>Amadeus</item> </list>
Here, the introductory para clearly belongs to the list if you bother to read the contents, but isn't actually part of it. The quick fix in a schema would be to add a para to the list model and use that:
<!ELEMENT list (para, item+)>
(Yes, required. I hate lists without neither motivation nor explanation[2].)
In real life, there are any number of reasons to want to use a list but not introduce it with a para, so making the para optional is a prudent first modelling step. In a lot of content, though, people like to precede the items with a title rather than a para, or a title and a para, and possibly other elements, all of them part of the list group. Adding all of those to the content model (and making most of it optional) results in a large model:
<!ELEMENT list (title?, (para|note|admonishment|figure)*, item+)>
Chances are that if your list model looks like this, then quite a few of your other block-level elements will, too — they'll be complex because you need to cover all the use cases. For the author, though, this will increase the risk for markup errors, with content ending up in the wrong place, or simply cause a (mostly) unused model. Or both.
The intended meaning behind the model is far from clear, even though the literal semantics may be.
Or, to take a different kind of example, have a look at this ATTLIST
depicting the allowed attributes of a list in legal commentary:
<!ATTLIST core:list type ( bullet | check-box | lower-alpha | lower-roman | mdash | ndash | number | plain | upper-alpha | upper-roman | upper-alpha-alpha | lower-alpha-alpha | smallcaps-alpha-alpha ) #REQUIRED restart (yes | no) 'yes' source-pnum CDATA #IMPLIED %display-atts; lni CDATA #IMPLIED >
Most importantly, there is a type
attribubte offering 13 (!) different
list types. There's probably[3] no way for you to know what's going on merely by reading the DTD. In fact,
even if you decide to study their use by looking at actual documents, you'd probably
still miss the point (note the two ordered list types, number
and
lower-alpha
):
<core:para-grp> <core:desig value="17">17.</core:desig> <core:title>General financial arrangements.</core:title> <core:para>The following are to be paid out ...:</core:para> <core:list type="number"> <core:listitem> <core:para>contributory benefit...;</core:para> </core:listitem> <core:listitem> <core:para>guardian’s allowance...;</core:para> </core:listitem> ... </core:list> <core:para>The following are to be paid out ...:</core:para> <core:list type="lower-alpha"> <core:listitem> <core:para>any administrative expenses of the Secretary of State...</core:para> </core:listitem> ... </core:list> ... </core:para-grp>
At a quick glance, this might suggest that an ordered list type is all you need, and that the other types happened because someone thought they would be pretty. It's what I, rather lazily, assumed at first.
Not so. The different types are there because in a single unit of the law, what is
known as a paragraph
, you are not allowed to use the same type of list
more than once. If you think about it, it makes perfect sense; if you refer to the
second item of a list in a (law) paragraph, the reader will only find the right item
if
the list type used is unique within that paragraph.
Nowhere in the schema is any of this apparent, however, and there was no style guide available to me.
List types in legal documents are easy to misunderstand, especially if you don't use them daily and there's no documentation to guide you. Some authors have enough difficulties understanding the difference between the different types to begin with.
Which is why it is not uncommon to see a step-by-step instruction that looks like this:
Follow these steps: * Do this. * Then do this. * Also do this.
This, of course, is simply bad form stemming from inadequate understanding of semantics. A bulleted list is an unordered list, which is pretty much the opposite of a step-by-step instruction. The former is a list of things where the order is of no importance, while the latter is a set of instructions where order (presumably) matters a lot.
Why Does This Happen?
The question we need to answer first is why does this happen? And to answer that, we need to define exactly what it is that happens.
Think about the list intro, above, the one that grew to an overcomplicated mess. This tends to happen because the sources lack consistency[4]. For example, ordered lists are used as procedures and vice versa, and to cover all the use cases, the schema grows unnecessarily[5] big to allow for cases that should have been identified as edge cases to begin with. Or, in cases where an existing schema is expanded with new models, the requirements process is the result of a lack of understanding in the style the content should follow.
Note
Also, sometimes duplication or near duplication of an existing model happens when a schema is updated, again because the style that the content should follow is poorly understood or the sources were inconsistent and poorly modelled to begin with.
The irony, of course, is that the content resulting from the shiny new (or updated) schema will rarely or never need everything the schema offers, so all those models remain either inconsistently used (with one document using one model and another a different one) or not used at all.
On the other hand, an overly complex model might actually be correct but poorly
understood by its users. Think of those 13 different list types (or rather,
formats; to me, type
implies semantics). It's
all too easy to dismiss most of those types, again because the intended style
of the content is poorly understood.
The documentation there is will probably not tell you enough; most schema documentation I've seen is half auto-generated, the other half not up to date. None of it explains the use cases (sometimes because there is not enough room, more often because the information analysis that resulted in the model wasn't properly documented.
Looking Pretty
The legal lists above are seemingly about being pretty, but as we saw, their reasoning was actually far more than that. Maybe it's because so many schemas do this sort of thing:
<emphasis type="bold italic">emphasised content</emphasis>
Us markup folks see this sort of thing so often that we become jaded. Yes, you
haven't bothered about the semantics here (is it a GUI object? a spare part? an
important word?), just provided the author the means to have the text look pretty.
We see it and lazily assume that formatting in markup is either about lazy modelling
or no actual semantics was needed[6]. The opposite is sometimes true, as seen in the ATTLIST
example, above, but how are we to know without enough information?
I have no problems with using this sort of thing, mind; sometimes it's what you
need. What I don't accept is just letting it all out there. When you say bold
italic
, what do you mean? And pretty doesn't count.
It's about consistency. If you do this now, do what you've done before, and what your co-workers have done before. But, if you've all done it before, what do you actually mean?
And is it too much to ask that you document what you mean?
What To Do About It
Some of the practical-minded and result-oriented markup folks will now be saying things like add Schematron rules! Add Schematron Quick Fixes!
This is true but not nearly enough, in my ever-so-humble opinion. By themselves, Schematron rules are merely painkillers.
Schematron rules check for patterns, relying on XPath expressions to match a pattern and offer appropriate messages. Sometimes, these are merely informational, sometimes they warn against a practice or report an error a schema either can't or shouldn't warn about. Sounds useful, right?
But where should the patterns come from? Why do they happen to begin with? Some developers will now reiterate the last paragraph, emphasising the parts about a schema being unable to check for condition A or warn against error B. Yes, but what a Schematron should really check for is adherence to a house style. In olden days, this style was described in a style guide, and so that's what you need to look for.
So, what should we really do about the mess outlined in the previous sections? Locate the style guide, see what it says, and act accordingly. And if there is no style guide, then write one![7]
Um, What Is A Style Guide?
When I started writing this paper, the conclusion was to use a style guide, and
that's pretty much it. Maybe a little sugar on top — tools such as Schematrons — but
essentially, the paper concluded with use a style guide
,
without any explanation of what a style guide is.
So, what is a style guide?
Think of it as a poet's schema. There are rules, such as how many section levels to use, or how to describe a procedure, including things like what a single step is and what kinds of things warrant a procedure. But a style guide will also explain how to write[8] — passive vs active voice, gerunds in headings, that sort of thing — and what to include in a certain document type. And once upon a time, it would list explain what an index needs to contain — today, of course, people increasingly equate indices with search engines, which is just not the same, but search boxes is what we have, rather than indices.
There was a time when most technical writing departments had a style guide detailing how their documentation was written, but these days, style guides tend to only be used by newspapers (although this practice is also disappearing) and publishers. The reasons, I imagine, are much the same as with indices — for some reason, the thinking is that just as search engines can replace indices, schemas can replace style guides. Ugh.
Style Guide Examples
In a former life, I worked as an editor (as opposed to author; see section “Roles”) of a global telecommunications company. Among other things, I was responsible for editing and updating their Style Guide[9]. The company produced most of their documents in unstructured FrameMaker format, but with well-defined paragraph and character formats, a style guide, and an actual editor — me! — to enforce the content styles[10]. That's a subject for a different paper, or perhaps my memoirs. Suffice to say that the content produced at the time was more consistent than a lot of the XML content I see these days, and it was easy to convert to SGML when the time came.
I do want to highlight some of the instructions in that long-forgotten book, though, as I feel it still illustrates my points rather well[11]. For example, here's a screenshot from a section that deals with ordered lists:
Note how ordered sublists should avoided if at all possible. This was about keeping ordered lists simple enough to process and fit onto a low-res screen (this was in the 90s), among other things.
Procedures (not to be confused with ordered lists) had different style instructions:
There were a lot of different procedures in the documentation, and they all had their own style. The following example is rather long, but should illustrate how style ties into structure:
As should be apparent above, there is an overlap between the style guide and the structure, which worked quite well for FrameMaker-based content. Also, when the time came, the SGML DTD did complement the style guide quite well.
Today, some twenty years after the fact, the style guide is all but forgotten, and the editors have all left.
Similar Models, No Way To Share
To illustrate how important style is, let me tell you another story. Some years after the demise of the style guide at the big telecom company, above, I was tasked with creating an XML production DTD, an exchange format that would allow two car manufacturers to exchange service information. The two were already sharing a lot of the hardware; both manufacturers shared platforms, engines, gearboxes and more to make many of their car models.
The production DTD itself was easy enough to create. There were a couple of differences in the respective DTDs¸but most differences were about trivialities like cardinality and different element names, and so the production DTD that resulted was a superset of the respective DTDs used by each manufacturer. I think that DTD took me a few days to do, all in all.
But looking at the actual contents from the respective manufacturer, it became clear that sharing information would be a lot trickier than sharing hardware. Manufacturer A used a text-based approach to write their service information, adding a few illustrations where necessary. Manufacturer B, however, used a comic-book approach — very little or no text, but at least one image per step.
This was not a modelling problem at all, this was purely a style problem, and neither side would give up their way of producing content. They never did share their service information with each other.
Neither manufacturer used a style guide, and it certainly never occurred to them to ask the other how they wrote their information. DTDs were sent back and forth during early decision-making[12], but that was about it.
Lists Revisited
So, to return to the list problems that followed the queues, here's a style guide excerpt that addresses my list examples (drawn from memory; I don't have the actual pages):
Always introduce a list with a paragraph that explains what is listed. The introductory paragraph is not a title; rather, it is a qualifier, giving the list its proper context. It, just as the list, is an integral part of the text flow, and should, just as the list, be written to fit the surrounding text.
Never use an ordered list when you are writing a procedure (and don't even consider writing it using an unordered list).
Never insert a list or its introductory text inside a paragraph unless you intend to present your list inline.
...
That last bit I added here and now; my
style guide did not
discuss markup.
OK, So Where (How) Do I Get One?
If you don't have use a style guide but have a lot of XML, plus some schemas and schematrons, chances are that your documents are inconsistent and would need that style guide. Is it too late?
Ideally, I think a style guide should be the first result of the information analysis that will later lead to the schema(s) when starting out with structured information. This, of course, may not be possible, so I'd settle for the next best thing: do a new information analysis by looking at the current XML sources and the Schematron schemas, figure out what the problems are — I'm guessing looking at the more common Schematron errors will point you in the right direction — and then having a think about what the content should look like, in terms of style. Define a desired house style, in other words. Once there — and this is just as iterative a process as writing a schema — you should formalise your findings in a style guide.
This will result in better semantics and more consistent content. Chances are that you'll be able to tighten the schema(s) and get rid of unused models while improving the ones you keep. This will help you create better, more focussed, Schematron rules and achieve a separation of concerns — let the schema enforce the structure and the Schematron suggest a style defined in the style guide.
Yes, I do think it's worth your while.
Roles
Authors are opinionated people. They care very much about their content, and they all have very definitive ideas about what makes it good. This, sometimes, can be bad, because when allowed to do what they want, the documents will differ from one another; the reader, will suffer.
This is why publishers used to have editors.
Some years ago, before the true state of things was readily apparent to me, I innocently asked a client of mine if they had editors. Yes, they had a whole department of them, why? It took me a few moments to realise that they were talking about authors. Writers. They had no editors, and hadn't had them for years. That's why they moved to structured information, right?
An editor, of course, is the person who makes sure that everyone follows the style guide, is the final arbiter of all things style, and frequently the one who edits the style guide[13].
So, does it make sense to have an editor on the staff, in addition to authors? Aren't there tools that can do the job, these days?
Tools
The obvious tool beyond a schema is a Schematron — those XPath-based,
context-sensitive soft
rules that go beyond what schemas can
express, and what schemas should express.
A Schematron rule can, with a few well-expressed XPaths, make sure that any ordered list in a law paragraph will use a different list type (see section “Some Examples”). It can suggest a list to have an introductory paragraph if it lacks one, and, in a similar way, help out with most other rules. What it can't do is to explain what a complete procedure should or shouldn't look like. Schematrons are not instructions, they are a help when validating, and if you don't know how you should write your content, it won't help you, only point out what's wrong with what you've already written[14].
Schematrons — and certainly Schematron Quick Fixes — are great for context-sensitive reminders of what's in a style guide, but they can't replace one. Nor can they replace an editor — an editor is the guy who will look through your content and explain, in broad strokes, what doesn't comply with the style guide and why. If you've created content consistently and with consistent errors, Schematron warnings could be numerous and therefore overwhelming; an editor will be able to summarise.
Of course, with enough time and code, there's a lot you can do to convert your numerous Schematron warnings into summaries, say, by eliminating duplicate errors, but in the end, an editor will be able to do that much more quickly while also being able to explain further if you don't understand the finer points.
And perhaps more importantly, if the style guide changes, the editor can take this into account without any coding whatsoever, and also spot why the style guide needs to change.
A Schematron, then, is a tool that aids rules expressed in style guides and enforced by an editor.
Queues Reinvented
So, what to do about the long queue and the chaos at Mc Donald's on Fleet Street I started this paper with? Well, if you haven't thought about it already, this is what everyone should do:
This is a fairly advanced queue numbering system display for a waiting room. Once you've picked a queue number from the machine, all you have to do is to wait for your turn. It's multiple lists merged into a single one, really — you won't ever risk picking the wrong queue, and you won't miss your turn. The semantics are clear and reasonably unambiguous.
I'm betting that a lot of thought and careful analysis went into designing this display and its underlying system. Instead of the long line or the chaos that is McDonald's on Fleet St during lunch hour, this simplifies the model (multiple lists are merged into a single one) and allows for a separation of concerns where the business rules help the end user to complete his or her tasks (waiting for your turn and finding the right counter) while being able to relax.
This, of course, is a paradoxical example, considering that it's a (mostly) technological solution to the queue problem opening this paper. Where is the style guide in all this? Glad you asked; it would have been easy to present the whole thing as a straight list[15]:
148 (6)
293 (8)
774 (3)
694 (4)
616 (10)
102 (9)
X (5)
602 (2)
X (7)
X (1)
This is a made-up example, of course, but my point should be clear. The style guide is involved:
-
Don't display any unmanned counters.
-
Show the latest update in a larger font.
-
Limit the number of counters shown.
-
...
See how this works? Yes, it is probably entirely possible to check the above rules in, um, a Schematron and then enforce the findings by adding some XSLT and CSS[16], but the Schematron only checks what's already been done rather than telling you what to do before you start. We want to prevent the bad habits rather than catch them later!
Conclusions
You need to start with the style guide. The style guide should be an organic part of your information analysis — if you're starting out, it should be the first thing produced by the analysis — and later allow you to make informed choices when writing the schema. Which should then allow the authors to use the new schema in the right way and using the correct style.
Ideally, this is how it should be done:
-
Information analysis
-
Style guide produced
-
Schema produced (enforce structure)
-
Schematron(s) produced (enforce style)
-
Rinse and repeat until done
Authors can then produce content in the style prescribed by the style guide, the structure as described by the schema, and with schematron rules highlighting problems with both. And ideally, with an editor making sure that it's all done properly.
End Note
Hoping to find a few examples of modern style guides by searching Google for
online style guide
, the first several results were all about web
design. I rest my case.
References
The Chicago Manual of Style
. [online]. The
University of Chicago Press. http://www.chicagomanualofstyle.org/home.html
William Strunk Jr. and E.B. White. Elements of Style, 3rd
Edition
. Simon & Schuster.
[1] The answer is that there are almost no queues. The people waiting have already ordered; they are waiting for their burgers to be ready. If it's your first time eating at McD, Fleet St, there's no way to know without pushing your way through to a counter. If you're like me, this is very disconcerting.
[2] This, actually, is the kind of thing that belongs in a style guide, not
schema. Let the para
be optional but stress its importance in the
style guide. But I'm getting ahead of myself.
[3] Unless you've worked in legal publishing.
[4] I'm not saying there's never a reason for complex models. Of course there is. It's just that in my experience, overmodelling is more common.
[5] In my experience, FrameMaker sources are especially vulnerable, paradoxically because FrameMaker templates can be used as semi-structured because of the way paragraph and character formats are defined.
[6] Although bold italic
in a single emphasis type always made
me suspicious.
[7] In a way, this is the easiest paper I've ever written. The one-stop solution is actually to write a style guide!
[8] How you write content will influence the schema, too, but above all, it's the kind of thing best explained in a style guide.
[9] This also led to me setting requirements for, and eventually writing, their SGML DTDs.
[10] Yes, I did use a red marker, and yes, the authors hated me.
[11] I'm not taking credit for all of it; we did work I'm very proud of to this day, but we also borrowed heavily from other style guides, such as Chicago Manual of Style, Strunk & White's Elements of Style, and many others.
[12] I was not part of this — I would have asked for style guides then, and most of the misery that followed would have been avoided. When I did come aboard, I asked for them, got some puzzled looks, and was eventually given the DTDs instead.
[13] And is at least partly responsible of the schema, if you're lucky.
[14] This is not entirely true; a clever Schematron can make things a lot easier if you have an inkling of the direction in which you need to go.
[15] Yes, the irony does not escape me.
[16] Schematron Quick Fixes for the win?