How to cite this paper
Nordström, Ari. “Tracking Toys (and Documents).” Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016). https://doi.org/10.4242/BalisageVol17.Nordstrom01.
Balisage: The Markup Conference 2016
August 2 - 5, 2016
Balisage Paper: Tracking Toys (and Documents)
Ari Nordström
Ari Nordström is a freelance markup
geek, based in Göteborg, Sweden, but offering his services across a number of
borders. He has provided angled brackets and such to a number of organisations
and companies over the years, with LexisNexis UK being the latest. His favourite
XML specification remains XLink, and so quite a few of his frequent talks and
presentations on XML include various aspects of linking.
Ari is the proud owner and head
projectionist of Western Sweden's last functioning 35/70mm cinema, situated in
his garage, which should explain why he once wrote a paper on automating
commercial cinemas using XML. He now realises it's too late, however.
Copyright © Ari Nordström 2016
Abstract
This paper is about naming and tracking documents. It's about identifying changes
to them, and about the contexts where they are intended to be used. Formally, the
paper suggests that three characteristics, a base identifier, a version and a
rendition, are enough to sufficiently identify a document so its lifecycle can be
fully tracked.
The paper addresses some of the concerns and issues commonly used to argue against
this view, also taking a look at a real-life system that has put the principles into
practice.
Table of Contents
- Toys (and What Happens to Them)
- Documents (and What Happens to Them)
-
- Identifying the Box
- Identifying Change
- Identifying Context
- Contexts as Renditions
- The Semantic Document
-
- The Semantic Document
-
- Base
- Version
- Rendition
- A Few Asides
-
- IDs
- Reasons for Change
- Links
- Addressing Problems
-
- Versions
- Translations
-
- EU Regulations Example
- Modularisation
- Too Many Versions?
- Multi-level Versioning and Business Rules
- EU Example - Minimising Work
- Another Aside on IDs
- A Case Study
-
- Weaknesses
-
- The Many Versions Problem
- The
Use Latest
Problem
- The Master Language Problem
- Improvements
-
- Handling Versions
- Linking to Later Versions
- Why A Master Language?
- End Notes
-
- Profiling
- Passive Version Tracking
- FRBR
- Why This Paper?
- So?
Toys (and What Happens to Them)
Consider a cardboard box containing a boy's collection of toy cars. He had started
small, with half a dozen nondescript plastic cars that he loved playing with, and
scribbled CARS
and KEEP AWAY
all over the cardboard lid.
Whenever he had a friend over, he'd tell the friend to fetch the box labelled
CARS
, and they would pour the contents on the floor and start
playing.
Over time, his car collection grew and so, when playing, they would discard the oldest
cars as childish
, plastic and large and obviously fake, and pick some
newer and more detailed die-cast ones instead. Later, they'd move on to models of
large
trucks, and later still, to the cool Formula One models with spring-loaded engines.
And
so on.
Eventually the boy's interests expanded to toy planes. The first few went into the
CARS
box but soon to a new box, labelled PLANES
(and
KEEP AWAY
). Cars were still cool, though, and with that collection
still growing, a second cars box (CARS 2
) was added.
With the growing collections of cars and planes, still more boxes were required. There
were smaller boxes in addition to the larger ones, with labels such as
TRUCKS
and FORMULA ONE
and WAR PLANES
,
and it was now easier than ever to quickly pick the right toy set for the right buddy.
Mum, though, working quietly in the background as mums often do, decided that the
oldest
plastic cars were no longer of interest and so they ended up in yet another box,
OLD TOYS
, discreetly hidden in the attic.
See where this is heading yet? This paper is about naming and identifying things;
specifically, it's about naming documents, and the toy box metaphor seems like a good
idea.
Documents (and What Happens to Them)
A document, just like the cardboard box with the toy cars, is essentially just a
container, labelled to differentiate it from other containers. And just like the box,
its contents change and evolve over time. Is there a way to label the document, so
that
it remains usable throughout the lifespan of its contents?
Identifying the Box
Like the cardboard boxes with toys, documents need to be named so their contents
can be found later. Usually, the name tells us what the contents are about, so
reasonable is to include the name of the product being described. If there are
variants or models (car models, say), those might need to be included, too.
And, just as with the cardboard boxes, a type identifier might prove useful.
Cars
, planes
, formula one
,
user guide
, and spare parts
are all reasonable
labels.
But is the maker or manufacturer
important to include in a name? Quite a few documents out there probably say things
like my file
, but is that a good idea?
Surely, a document should have a description of what it is about in its name. A
manual for assembling a toy car, for example, should probably have that in the name:
Jaguar E-Type Model, Assembly Instructions
. But what if the
manual is produced in a modern fashion, in modularised form, where the toy
manufacturer writes everything in reusable modules and so, the section about the
necessary tools would be common to not only model cars but probably model ships,
planes and tanks.
Identifying Change
What if the assembly manual was regularly updated to include the changes to the
model car? How would those changes be identified? Should they be in the name?
Identifying change is about helping readers to know which version of a document to
get. Would it therefore make sense to include a version label in the name? Would it
be enough to use ordinal numbers like Manual 1
, Manual
2
, and so on? Or would it be better to date the new versions and include
that?
Personally, I get nervous if I see a version label on a cover. Do I have the right
version? Do I have the latest? What version is the product? I always end up thinking
wouldn't it make more sense to make sure that the reader got the right
version to begin with? Versions are needed by the writers to know
which version to attach to which product, and which version to translate; the
readers have no real use for the information.
But on the other hand, consider that box of old cars mum put in the attic: if it
only said OLD TOYS
on the lid and there were others like it, some
considerable work would be required if you ever wanted to get rid of some of your
old stuff. Rummaging through the attic is never any fun.
So the answer is both yes and no. It's useful to know which one of the two
versions of a document you have in front of you is the old one, but at the same
time, there is no way for the reader to know if there is a still later version than
the two that you have.
Identifying Context
If the model car assembly manual was translated to a number of languages, should
the target languages be included in the name? Is it relevant to include, say,
Swedish
, in the name?
The language identifies a context in which the manual is
intended to be used; often, a language and country combination is needed to identify
not only the language but also the intended market. Swedish, for example, is a
language spoken in both Sweden and Finland, and while the language is mostly the
same in both cases, the countries are not.
Or consider Märklin model train sets. They come in different sizes, so getting the
right size for your model railway is of vital importance.
Unlike the version labels, I would argue that identifying the context is important
information to the user and should be included.
Contexts as Renditions
Consider an image of that E-Type Jaguar model. If printed, the image's visual
contents and its name are one and the same if you know your cars, but a short,
descriptive caption (Jaguar E-Type
) would still be needed to identify
it for someone who doesn't.
Similarly, a version (1962
or 1970
, say) will be
useful for anyone without the necessary knowledge.
If you are a kid and want it up on the wall, however, the context needs to be
paper
and that's about it. But what if you need to print it from
your computer? Obviously, if you are a kid, you won't care much about the file
format—the contents are important, not how they are rendered—but to get a
sufficiently high resolution using the right software, in other words, when you need
to achieve something specific with the contents, the rendition becomes
crucial.
Note
I'm borrowing the term rendition
from the world of graphics,
where the same image might be rendered as a JPG or a PNG and defined as
equivalent in terms of content.
All this is equally true with documents. The name describes the abstract contents
of the container, the version how they change over time, and the language (and,
frequently, country, region or other parameter affecting the exact context) how they
are rendered.
The Semantic Document
So, how would we actually go about naming the containers? I'd think the above suggests
something like this:
-
Name the container with some kind of base identification so we can find it.
-
Add metadata that formalises how changes to the contents can be described, so
we can keep track of it.
-
Include a category or rendition (H0 and Z Gauge
or
Swedish
) that makes the contents usable for a specific
audience or situation.
The Semantic Document
Or, expressing the above in a more concise manner:
BASE:VERSION:RENDITION
-
BASE
is the basic label on the box. It might be a simple
category (CARS
) or convey some meaning, such as a brand name
and model (Ford Mondeo
), but by itself, it is really just an
identifier of the contents. We know, sort of, what the contents are, in a
comfortingly abstract way.
-
VERSION
gives that abstract contents a defined shape at
some point in time (such as Model Year 2008
); in short, it
describes the change of the semantic document contents over time.
-
RENDITION
is the type of output, the rendition required to
make the versioned semantic document for a specific audience (EU
Market, left-hand drive
).
There is no right or wrong here. It could, for example, be argued that
Mondeo
is a Ford version rather than part of
the base name.
Base
The BASE:VERSION:RENDITION notation suggests a minimum
set of semantics, but there is nothing wrong with providing additional intelligence
to the base identifier. For example:
A product called 123
might be described using three different
document types: user guide, service
guide, and spare parts. BASE
,
in this case, could then identify the product AND the document type, so three
base identifiers, 123UG
, 123SG
, and
123SP
.
Version
Adding a version can be a question of using an ordinal number (1, 2, 3, 4, 5,
...); after all, they are very easy to understand for man and machine alike. In
many cases, a multilevel construction as used in software (1.1, 1.2, 1.3, 1.4,
2.0, ...) might be betters.
A date is probably not a good idea as it is nearly impossible to know if
2008-11-11
is the first version or the tenth, at least
without a context.
Rendition
The most obvious example of a rendition for documents is
the language (sv
, en
, de
, ...) a
manual is published in, but another perfectly good example is provided by the
EU Market, left-hand drive
example, above.
Or going back to toys, a rendition can be a size for a specific use case.
Märklin model trains, for example, come in three sizes, Z Gauge
,
H0
, and 1 Gauge
. Or to put it differently: as
we all know, it's no fun when little brother wants to add his big plastic Volvo
to your race track day.
A Few Asides
Before I move to addressing some of the more obvious issues with this naming schema,
allow me to quickly discuss a few related topics.
IDs
The observant reader will no doubt have thought about identifiers beyond those
naming the container by now. What about identifying specific parts of the contents?
In toy parlance, the box CARS
might contain smaller boxes, say, one
for each car, and so there would be a Ford Mondeo MY2008 EU Market, left-hand
drive
box, a Jaguar E-Type MY 1961 Roadster
box, and so
on.
I'd argue that the definition of a container is mostly an arbitrary one and will
likely change over time. The contents are important and should be considered, but
above all, we should always be able to uniquely identify the container, whenever and
however it changes. What's inside is uniquely identifiable by first identifying the
container and then the particular object of interest inside (the Jaguar
E-Type stored in CARS
).
Jaguar E-Type is what we call a structural identifier; if the
boy's got several E-Types but only one in CARS
, then the structural
identifier doesn't need to be unique across the toy collection, only inside
CARS
; the context is enough. Also, with only one E-Type inside
CARS
, we don't actually have to use the full identifier to find
it even if it does have one.
On the other hand, if that E-Type proves to be the boy's favourite toy, he might
well want to store it in its own box, separately from the others. Once outside, it
would have to be referred to using its full identifier, especially if there were
other Jaguars. The boy, of course, would probably just refer to it as
the Jag
.
And if the boy's list of favourites grows over time, he might want to create a
favourite cars
box and put it there, in which case a structural
identifier combined with the unique container name would again be enough.
Reasons for Change
A new version signifies a change and should happen whenever there is a
significant change to the contents (begging the question
what is a significant change?
). When the E-Type toy is moved to
its own box, or perhaps to that box with favourites, the old box changes and gets
a
new version.
The definition of change
is a bit like the definition of
done
to agile practitioners or starving artists, and so hard to
offer here. I tend to go with anything beyond pretty-printing a document.
My definition of change, of course, means that there will be a lot of versions,
most of which are either significant only to me or not significant at all, beyond
me
hitting Save
. A precious few are of interest to others, and this is
where workflows come into play.
A change in the workflow is a status identifier: ready to be
reviewed
, approved
, rejected
, ready
to be translated
, published
, etc. A document could move
from draft
to published
without change and thus in a
single version, but on the other hand, reaching published
might just
as well require two dozen versions. None of the versions
between the two workflow stages is of any interest to anyone else than those
involved. Workflows, then, can be used to scope a name (only use published
versions
).
Links
Finding the E-Type Jag is achieved using a link. The boy will
not think of it as such, but when warning his friend to not touch the Jag in
the CARS box
, a link is what he uses to get his message across.
A link, of course, would build on the same rules for applicability and scoping as
the name itself. However, a link used in an editing situation (don't touch
the Jag on the floor
) would depend on context while a more stringent
link to an outside observer (mum, don't touch the E-Type Jaguar in the CARS
box if you go into my room
) is meant to be persistent and needs to be
better defined.
Addressing Problems
Naming happens on several levels. A semantic document would keep its base identifier
for as long as the contents remain related to the original. For example, for as long
as
a user manual is about "Volvo V70", the new versions (model years) would happen to
the
same base document.
If there was a significant change (Volvo 850
to Volvo
V70
), the change might result in either what in software circles in known as a
fork
(a new version 1 based on a specific version; in other words,
something new based on an existing version but going off in a new direction) or an
entirely new document (for obvious reasons).
Versions
Versions indicate change, so in the car world, this would actually correspond to
manufacturing weeks in addition to model years (as opposed to actual years). For
example, MY2003 expressed using this schema might be M2003 w46 (in reality a model
that started production in 2002), MY2003 w12, MY2003 w24, and MY2003 w38. MY2004
would happen in 2003, week 46.
A versioning schema for the Volvo V70 might then be defined as 2003:46, 2003:12,
2003:24, etc. Of course, what the end user sees is still 2003
.
What this means is that the version label seen by the
customer need not be the same as the version seen by the service technician.
Versions, then, are better off scoped. Workflow stages (see section “Reasons for Change”) can help, but so can multilevel
versioning (see section “Multi-level Versioning and Business Rules”) and naming abstractions (see id-naming-abstractions).
Translations
Frequently, problems with translations really are about the lack of proper
versioning and/or naming. Consider, for example, multinational companies producing
content for multiple markets and multiple languages. A manual may be first written
and published in one language, say, Swedish, but then translated to another language
and market, say, China and APAC, where the contents are further developed before
publication to accommodate product changes specific to that market.
After some time, the product is then sold to a third market. It is translated from
the second, and customised once again to meet the requirements of that
market.
Meanwhile, the original Swedish version has been updated, independently from the
other markets, and if they, at that point, decide to use product features developed
for the other markets, the documentation from the other market(s) will not be
reusable.
This is where the naming schema and regarding translations as renditions will
help. Returning to the car model year and manufacturing week example, translations
would be based on a specific exact version (rather than a scoped one), so
2004:46:sv-SE should be equivalent to 2004:46:en-GB rather than simply MY2004 sv-SE
to MY2004 en-GB. The exact version schema would ensure equivalence. These would be different renditions of the same basic abstract
manual.
Note that the rendition is simply defined as such; a GB
manual would describe a right-hand drive car and therefore not be an exact match.
Similarly, the GB model might have any number of standard features not present in
the SE version or vice versa (a Spanish car would, for example, probably not have
heated seats as standard).
EU Regulations Example
While translations are renditions, there might be a need to develop a
translation more or less independently from the original language. EU
regulations for different member countries provide a good example of this. If a
French-language translation for France from an English-language original
requires content development, it could either be a development within the same
version or an independently developed fork.
Here, the English-language original identified as DocA is translated to French
and developed as a variant that is still seen as DocA:
DocA en-GB v1 written => DocA fr-FR translation of v1
DocA fr-FR v1-1
DocA fr-FR v1-2
DocA fr-FR v1-3
...
DocA en-GB v2
...
Here, on the other hand, the English-language DocA is translated but then
developed independently as DocB:
DocA en-GB v1 written => DocA fr-FR translation of v1
DocB fr-FR v1
DocB fr-FR v2
DocB fr-FR v3
...
DocA en-GB v2 DocB fr-FR v9
In the case of EU, the former variant is more likely. A regulation is adapted
and developed for each member country until a new version comes out, at which
point the translations need to be "merged" with the main version line.
While DocB starts out as a fork of DocA, this relationship is never explicit
in the base identifier. Instead, it might be a relationship kept track of by a
system.
Modularisation
Let's say that DocA:2:en-GB (see above) is modularised into a root file and two
modules, like so:
DocA:3:en-GB (root; modularisation is a change so a new version)
|
|--DocC:1:en-GB
|--DocD:1:en-GB
When the new DocA is translated to French, this happens:
DocA:3:fr-FR
|
|--DocC:1:fr-FR
|--DocD:1:fr-FR
In other words, the root module and both of the new modules are translated. The
root, of course, might simply comprise links to the modules and so the translation
might happen automatically.
If a new document, DocE, is written and reuses modules like this:
DocE:1:en-GB
|
|--DocC:1:en-GB
|--DocF:1:en-GB
The translation to french would be
DocE:1:fr-FR
|
|--DocC:1:fr-FR
|--DocF:1:fr-FR
The root, if comprising links only, would be automatically translatable, simply by
changing the links from the en_GB renditions to the fr-FR renditions; DocC:1:fr-FR
would already exist, as it was translated in the previous run; and so the only thing
remaining would be to translate the new DocF:1:en-GB module.
Too Many Versions?
Mentioned above: a new version happens when there is a significant change in
contents. Would this mean that a document that was properly modularised from the
start would be hell to keep updated and translated? After all, if DocC:1:en-GB is
updated to DocC:2:en-GB, doesn't that mean that DocE needs to be updated so the link
points to the new version (with the same applicable to every other document using
DocC)?
So if this happens:
DocE:1:en-GB
|
|--DocC:1:en-GB
|--DocF:1:en-GB
is reworked to
DocE:192:en-GB
|
|--DocC:28:en-GB
|--DocF:45:en-GB
How many times does the root document need to be updated? Translated? 2? 5? 40?
100?
There are several problems in play here:
-
Poor modularisation (and configuration management)
-
Translations happen before everything is ready
-
Automated systems bump versions for everything (or there is not enough
systems support)
The real problem here is the assumption that every version is important. Most
aren't, as mentioned in section “Versions”. Let's have a closer look.
Multi-level Versioning and Business Rules
Remember, a translation is a predefined rendition. I.e. we
are saying this content is equivalent to that
content
, not that this content is identical
to that content
. So use several levels of versioning to
scope a link. Assuming these versions:
1.0.0
1.0.1
1.0.2
1.1.0
1.1.1
1.2.0
1.2.1
1.2.2
2.0.0
Define business rules saying that a new translation, or link, is only required for
integer versions (or decimal versions, or ...).
This, of course, is equally applicable to modularisation without translation. We
could simply say that a link made to DocC v 1.0.2 would be valid for all 1.x.x
versions and the business logic could then automatically use the latest 1.x.x
version. The same would be applicable for translations.
EU Example - Minimising Work
The multilevel versioning provides an easy solution to the EU translation example,
while retaining the all-important link between renditions. A business rule could
simply say that every 1.x rendition is equivalent to every other 1.x
rendition.
A clever modularisation of the document could then minimise the impact in terms of
the number of member country-dependent changes by placing the contents requiring
updates into a single module.
Let's say we have this slightly more complex document:
DocG:3:en-GB
|
|--DocC:1:en-GB
| |--DocX:1:en-GB
|
|--DocD:1:en-GB
|
|--DocH:5:en-GB
|
|--DocM:7:en-GB
| |--DocX:1:en-GB
|
|--DocP:3:en-GB
|--DocX:1:en-GB
Here we have DocC, DocM and DocP all linking to the common module, DocX. Of
course, the links would not reuse the entire document, only
selected parts, expressed as fragment identifiers like so:
DocC:1:en-GB => DocX:1:en-GB#id1
DocM:7:en-GB => DocX:1:en-GB#id2
DocP:3:en-GB => DocX:1:en-GB#id3
This results in limiting the member country-dependent changes to a single module,
DocX, that could be updated when required. The links from the source documents would
all point at the integer version 1 of DocX, meaning that every update within the 1.x
series could be defined as an equivalent link in the business rules.
Note
Of course, a decent system would be able to keep track of docX 1.x during
development, assigning workflow statuses to the development versions so a draft
would never find its way to a published document.
Another Aside on IDs
Document IDs need to be unique, of course. How else would we find the right box?
Structural IDs, however, do not. Internal links
(#id
) would be unique inside the module, but there is no need to
enforce uniqueness outside them since any pointers outside include the document ID
that is unique:
DocC:1:en-GB => DocX:1:en-GB#id1
DocM:7:en-GB => DocX:1:en-GB#id2
DocP:3:en-GB => DocX:1:en-GB#id3
Problems may ensue when merging modules or when modularising documents; in both
cases, there is a need to check for uniqueness and possibly handle broken links from
elsewhere.
A Case Study
The principles outlined in this paper have been put to practice far beyond my son's
toy collection. This section outlines a system I've been involved developing.
The basic system was conceived from the start to offer full traceability in large,
modularised technical documents that are translated to a dozen or so languages. The
system therefore keeps a full version history of each and every resource in the system,
which means that the document modules are linked to using specific versions of each
module. Exact versions are also used when translating the modules, which means that
in
theory, reuse is very efficient.
The system identifies its resources using a URN schema that formalises the
BASE:VERSION:RENDITION concept, but also generates structural
IDs for the document contents. Thus, every link to structured content uses the form
URN#ID
, which means that reuse is allowed on document fragments, too.
The structural IDs are kept intact when translating the modules, so a link to a Swedish
document fragment URN:sv-SE#ID
is equivalent to an English-language one,
URN:en-GB#ID
. All that is needed is to replace the language-country
code (and, obviously, translating the contents.
Note
The Swedish and English versions are defined to be
equivalent, not identical, which means that apart from the key nodes (with the IDs),
some differences may exist.
Images and media are also linked to using URNs, and thus translated in the same way,
by changing the language-country code (and the contents), unless they are
language-neutral.
Weaknesses
In spite of the precise versioning capabilities, the system had several main
weaknesses when it was built:
The Many Versions Problem
First and foremost, while the user was able to decide when to check in (and
out) a module and thereby creating a new version, this would nevertheless result
in a lot of versions; it was not unusual for frequently
revised modules to have dozens or even hundreds of versions.
While the system had some degree of scoping in that it allowed the user to
define workflow statuses for each stored resource, which in theory should have
been used to limit linking to approved versions only, the workflow feature was
manual and underused. This resulted in links frequently pointing to old versions
of modules because even a slight change to a module would result in a new
version, which, because specific versions of modules were always linked to and
translated, would require the link to the module to be updated and the module
itself to be translated again.
The Use Latest
Problem
The very fact that links were made to a specific version also made it
difficult to always link to the latest version, something frequently asked for
by the users. For example, if an image link should always be the latest
(approved) version, the way the system was built meant that if the image link
was updated, the link to the module with the image would have to be updated, and
so on, all the way to the root.
This was doable, of course, if none of the ancestors to
the image had been reused elsewhere, in an incompatible way. For example, let's
say we started with a document, RootX, linking to modules
as follows, with ModuleA in version 5 as its immediate
child:
RootX-v2 => ModuleA-v5 => ModuleB-v9 => Image-v1
The child of RootX, ModuleA, was
then modified to version 8 and used in another document,
RootY:
RootY-v1 => ModuleA-v8 => ModuleB-v9 => Image-v1
As we can see, the modules linked to by ModuleA are
unchanged; only the contents of ModuleA have been
updated.
If RootX required a new version of the image, however,
this would cause a problem since updating the image to v2 would mean updating
ModuleB, which was fine, but also
ModuleA (and RootX).
But ModuleA had already been updated so it could be used
by RootY, and so the next available
version
would be version 9. This is fine, of course, but would then create a
similar situation for RootY if it was updated.
The Master Language Problem
Finally, the system had the concept of a master language
, a
language used when producing new content. The system had a feature to build a
translation package
from a master, which essentially meant
creating a zip package of the modules to be translated, copying them verbatim in
every respect but the xml:lang
attribute in the root element, minus
those that had already been translated, and logging the event in the database so
it would know it had translated to upload back to the system at some point in
the future.
This seemed like a good idea at the time but quickly became difficult to
handle. For example, if the master language was set to Swedish, that was the
language the translation packages sent to the translation agency had. There was
no way to base a translation on, say, English, which would have made
translations cheaper and easier.
The master language concept, obviously, also made it more difficult to
collaborate in more than one language, something that rather unexpectedly became
a deal-breaker.
Improvements
The system, then, actually had issues with several concepts introduced above. Does
this mean that the idea of a semantic document, labelled using a
BASE:VERSION:RENDITION schema, doesn't work after
all?
Handling Versions
We never managed to address the many versions problem fully, but did handle
the resulting retranslation required
because of a version bump
problem in a rather simple but quite effective way:
Whenever a module had been updated, thus breaking the coupling
with the existing translations, we added a feature allowing the user to see the
master language and its available. but old, translations in a table, one column
per language, select a target version and an old translation, and then
investigate if the old translation was possible to bump
to the
selected later version.
A few business rules tested against made sure that the bump
was
structurally feasible and then ran an XProc pipeline that updated any links
inside the module to match the master language's. The pipeline would also
attempt to include any missing content from the master language, but using some
very restrictive business rules.
The idea was simple and based on the fact that any translation is
defined as being equivalent with the original; it does
not have to be identical. This allowed the user to assume responsibility for
bumped translation being equal to a selected later version, and in most cases,
the users would then update the translation manually for the usually minor bits
that did not match.
The system's major weakness, where the versioning was straight
and not scoped in any way, was never addressed, however. For a comprehensive
solution to this problem, see id-ml-paper.
Linking to Later Versions
The use latest
problem was never fully addressed either. We
side-stepped the problem by adding the notion of forks
where a
given module version would be copied and used as the basis of a new module, with
that module's versioning then moving in a separate direction from the original.
The forking functionality did take care of some of the problems but did not
really fully address the original problem, namely to be able to easily update a
leaf node such as an image to the latest version without having to update the
versions of every ancestor accordingly.
A more sensible approach would have been to used the kind of scoped versioning
discussed in section “Multi-level Versioning and Business Rules” (with a more comprehensive solution
suggested in id-ml-paper).
Why A Master Language?
The master language problem
, finally, was not a problem with
the approach but the system itself. The system was simply built that way,
assuming that content production would always be in a
single language and the translations in all of the others. The functionality to
create translation packages, essentially copies of the originals but with
updated language attributes and information, was then designed upon that first
decision.
In other words, there is nothing in the naming principles discussed here that
even hint at a single master language. Quite the opposite; the situation in the
below table actually makes more sense. P
stands for
production
while the empty cells mean that the content is
translated to that language.
Table I
sv-SE |
en-GB |
en-US |
pl-PL |
fr-FR |
P |
|
|
|
|
|
P |
|
|
|
|
|
P |
|
|
P |
|
|
|
|
|
|
|
|
P |
Of course, a system built on these principles would allow creating a
translation package from any existing language.
End Notes
Profiling
Efficient modularisation of documents require being able to
profile the contents, marking up sections to be about the
different variants covered by the module and then showing and hiding them when
publishing the document, depending on the publishing context. This practice also
includes features like variable text strings so a variant name can be included in
the content when using a specific profile.
Profiling is beyond the scope of this paper, but I presented a fuller treatment of
how to manage profiles a number of years ago. See id-naming-abstractions.
Passive Version Tracking
The ideas described in this paper are all about labelling and annotating
container, a play with the labels that identify those toy boxes with cars (or the
boxes inside the larger boxes). The cardboard box metaphor is best visualised by the
labels taped on the boxes but since it's just a metaphor, there's nothing to stop
us
from moving the labels to a more convenient location. In other words, indirection,
a
sort of an out-of-line versioning
.
We can place imaginary labels on the boxes, on the cars, on groups of cars, and so
on (think registers), so we can also remotely track the changes. If you think about
the ideas presented here, they all observe and react on the
changes to the content; they do not actually cause them.
FRBR
Two reviewers suggested I should not finish this paper without considering
Functional Requirements for Bibliographic Records (FRBR)
(see id-frbr). I must
confess that FRBR is mostly new to me. I have heard of it before, somewhere, but I
never made the connection.
There are similarities between what I propose and FRBR, in that there is a concept
that reminds me of the abstract, rendition-less content that is my base document,
and there is a concept similar to a rendition (called a
manifestation
, a phrase I like). There is also what is to some extent
an equivalent, or at least similar, to a version (called an
expression
, I phrase I don't like), but while I think I
understand where the authors of FRBR are coming from, I don't think FRBR is an
equivalent, and here is why:
FRBR also describes what they call derivative works
. These may
include, say, annotated versions of a movie script but also any other version of the
base work
, and while technically, there is nothing in theory to stop
an author from including a derivative work
such as an annotated
script as a new version in what I propose here, my proposal is not about including
any kind of derivative work as a version. My versions are very strictly about change
to the document, not interpretations of it.
My proposal (and practice; I have implemented what I describe here in several live
systems) is about a base document, an abstract document without a context requiring
a specific rendition, nor a version identifying its change over time, that then
is changed over time, performed in increments where each
increment is defined by a user as a significant (and therefore
saved) new version, and where any of those user-identified versions can then be
rendered in a specific way, depending on context.
It's a bit like quantum physics for versioning; I am talking about incremental
changes where each change can be rendered according to (again) some user-defined
context, with each rendition being defined as equivalent to another, which means
that there can be no middle ground between two versions. In FRBR, however, at least
in how I interpret it, there is a fluidity between
manifestations and expressions,
leaving the standard open to connecting not just versions and renditions of the
document itself but also any derivative work it might have.
In my quantum versioning
, that annotated version could be
defined as a fork, which would be perfectly fine, but I
would probably not want to mix my semantics in that manner as an annotated work
would be a single version (in some rendition or renditions) where some selected
nodes had been linked to external content.
In other words, while I can certainly appreciate the similarities between my work
and FRBR, I do think the approaches are different.
Why This Paper?
I presented the passive version tracking idea at XML Prague 2016 (see id-xmlprague), expecting
questions and objections on the idea itself. It didn't happen. Instead, people
doubted the translations-as-renditions definition, both immediately after the talk
and over beers. The EU translations example was brought up as proof against the idea
and so triggered this paper.
So?
The basic principles outlined in this paper are already in use. The passive
tracking system is a proposed proof of concept for a client of mine. Plus, I can't
think of a good alternative. Can you?
References
[id-ml-paper] Nordström, Ari. “Multilevel Versioning.” Presented at
Balisage: The Markup Conference 2014, Washington, DC, August 5 - 8, 2014. In Proceedings
of Balisage: The Markup Conference 2014. Balisage Series on Markup Technologies, vol.
13
(2014). http://balisage.net/Proceedings/vol13/html/Nordstrom01/BalisageVol13-Nordstrom01.html. doi:https://doi.org/10.4242/BalisageVol13.Nordstrom01
[id-naming-abstractions] Nordström, Ari. “Semantic Profiling Using
Indirection.” Presented at Balisage: The Markup Conference 2013, Montréal, Canada,
August 6 - 9, 2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage
Series on Markup Technologies, vol. 10 (2013). http://balisage.net/Proceedings/vol10/html/Nordstrom01/BalisageVol10-Nordstrom01.html. doi:https://doi.org/10.4242/BalisageVol10.Nordstrom01
[id-xmlprague] Nordström, Ari. "Virtual Document Management." Presented
at XML Prague 2016, Prague, The Czech Republic, February 11 - 13.
[id-scales] Märklin Trains: Scales; see https://www.marklin.com/scales/
[id-frbr] Functional Requirements for Bibliographic Records; see http://www.ala.org/alcts/sites/ala.org.alcts/files/content/events/pastala/annual/04/Tillett.pdf
×Nordström, Ari. "Virtual Document Management." Presented
at XML Prague 2016, Prague, The Czech Republic, February 11 - 13.