How to cite this paper

Nordström, Ari. “Tracking Toys (and Documents).” Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016). https://doi.org/10.4242/BalisageVol17.Nordstrom01.

Balisage: The Markup Conference 2016
August 2 - 5, 2016

Balisage Paper: Tracking Toys (and Documents)

Ari Nordström

Ari Nordström is a freelance markup geek, based in Göteborg, Sweden, but offering his services across a number of borders. He has provided angled brackets and such to a number of organisations and companies over the years, with LexisNexis UK being the latest. His favourite XML specification remains XLink, and so quite a few of his frequent talks and presentations on XML include various aspects of linking.

Ari is the proud owner and head projectionist of Western Sweden's last functioning 35/70mm cinema, situated in his garage, which should explain why he once wrote a paper on automating commercial cinemas using XML. He now realises it's too late, however.

Abstract

This paper is about naming and tracking documents. It's about identifying changes to them, and about the contexts where they are intended to be used. Formally, the paper suggests that three characteristics, a base identifier, a version and a rendition, are enough to sufficiently identify a document so its lifecycle can be fully tracked.

The paper addresses some of the concerns and issues commonly used to argue against this view, also taking a look at a real-life system that has put the principles into practice.

Toys (and What Happens to Them)

Documents (and What Happens to Them)

Identifying the Box
Identifying Change
Identifying Context
Contexts as Renditions

The Semantic Document

Base
Version
Rendition

A Few Asides

IDs
Reasons for Change
Links

Addressing Problems

Versions

Translations

EU Regulations Example

Modularisation

Too Many Versions?

Multi-level Versioning and Business Rules

EU Example - Minimising Work

Another Aside on IDs

A Case Study

Weaknesses

The Many Versions Problem
The Use Latest Problem
The Master Language Problem

Improvements

Handling Versions
Linking to Later Versions
Why A Master Language?

End Notes

Profiling
Passive Version Tracking
FRBR
Why This Paper?
So?

Toys (and What Happens to Them)

Consider a cardboard box containing a boy's collection of toy cars. He had started small, with half a dozen nondescript plastic cars that he loved playing with, and scribbled CARS and KEEP AWAY all over the cardboard lid. Whenever he had a friend over, he'd tell the friend to fetch the box labelled CARS, and they would pour the contents on the floor and start playing.

Over time, his car collection grew and so, when playing, they would discard the oldest cars as childish, plastic and large and obviously fake, and pick some newer and more detailed die-cast ones instead. Later, they'd move on to models of large trucks, and later still, to the cool Formula One models with spring-loaded engines. And so on.

Eventually the boy's interests expanded to toy planes. The first few went into the CARS box but soon to a new box, labelled PLANES (and KEEP AWAY). Cars were still cool, though, and with that collection still growing, a second cars box (CARS 2) was added.

With the growing collections of cars and planes, still more boxes were required. There were smaller boxes in addition to the larger ones, with labels such as TRUCKS and FORMULA ONE and WAR PLANES, and it was now easier than ever to quickly pick the right toy set for the right buddy. Mum, though, working quietly in the background as mums often do, decided that the oldest plastic cars were no longer of interest and so they ended up in yet another box, OLD TOYS, discreetly hidden in the attic.

See where this is heading yet? This paper is about naming and identifying things; specifically, it's about naming documents, and the toy box metaphor seems like a good idea.

Documents (and What Happens to Them)

A document, just like the cardboard box with the toy cars, is essentially just a container, labelled to differentiate it from other containers. And just like the box, its contents change and evolve over time. Is there a way to label the document, so that it remains usable throughout the lifespan of its contents?^[1]

Identifying the Box

Like the cardboard boxes with toys, documents need to be named so their contents can be found later. Usually, the name tells us what the contents are about, so reasonable is to include the name of the product being described. If there are variants or models (car models, say), those might need to be included, too.

And, just as with the cardboard boxes, a type identifier might prove useful. Cars, planes, formula one, user guide, and spare parts are all reasonable labels.

But is the maker or manufacturer important to include in a name? Quite a few documents out there probably say things like my file, but is that a good idea?

Surely, a document should have a description of what it is about in its name. A manual for assembling a toy car, for example, should probably have that in the name: Jaguar E-Type Model, Assembly Instructions. But what if the manual is produced in a modern fashion, in modularised form, where the toy manufacturer writes everything in reusable modules and so, the section about the necessary tools would be common to not only model cars but probably model ships, planes and tanks.

Identifying Change

What if the assembly manual was regularly updated to include the changes to the model car? How would those changes be identified? Should they be in the name?

Identifying change is about helping readers to know which version of a document to get. Would it therefore make sense to include a version label in the name? Would it be enough to use ordinal numbers like Manual 1, Manual 2, and so on? Or would it be better to date the new versions and include that?

Personally, I get nervous if I see a version label on a cover. Do I have the right version? Do I have the latest? What version is the product? I always end up thinking wouldn't it make more sense to make sure that the reader got the right version to begin with? Versions are needed by the writers to know which version to attach to which product, and which version to translate; the readers have no real use for the information.

But on the other hand, consider that box of old cars mum put in the attic: if it only said OLD TOYS on the lid and there were others like it, some considerable work would be required if you ever wanted to get rid of some of your old stuff. Rummaging through the attic is never any fun.

So the answer is both yes and no. It's useful to know which one of the two versions of a document you have in front of you is the old one, but at the same time, there is no way for the reader to know if there is a still later version than the two that you have.

Identifying Context

If the model car assembly manual was translated to a number of languages, should the target languages be included in the name? Is it relevant to include, say, Swedish, in the name?

The language identifies a context in which the manual is intended to be used; often, a language and country combination is needed to identify not only the language but also the intended market. Swedish, for example, is a language spoken in both Sweden and Finland, and while the language is mostly the same in both cases, the countries are not.

Or consider Märklin model train sets. They come in different sizes, so getting the right size for your model railway is of vital importance.

Unlike the version labels, I would argue that identifying the context is important information to the user and should be included.

Contexts as Renditions

Consider an image of that E-Type Jaguar model. If printed, the image's visual contents and its name are one and the same if you know your cars, but a short, descriptive caption (Jaguar E-Type) would still be needed to identify it for someone who doesn't.

Similarly, a version (1962 or 1970, say) will be useful for anyone without the necessary knowledge.

If you are a kid and want it up on the wall, however, the context needs to be paper and that's about it. But what if you need to print it from your computer? Obviously, if you are a kid, you won't care much about the file format—the contents are important, not how they are rendered—but to get a sufficiently high resolution using the right software, in other words, when you need to achieve something specific with the contents, the rendition becomes crucial.

Note

I'm borrowing the term rendition from the world of graphics, where the same image might be rendered as a JPG or a PNG and defined as equivalent in terms of content.

All this is equally true with documents. The name describes the abstract contents of the container, the version how they change over time, and the language (and, frequently, country, region or other parameter affecting the exact context) how they are rendered.

The Semantic Document

So, how would we actually go about naming the containers? I'd think the above suggests something like this:

Name the container with some kind of base identification^[2] so we can find it.
Add metadata that formalises how changes to the contents can be described, so we can keep track of it.
Include a category or rendition (H0 and Z Gauge or Swedish) that makes the contents usable for a specific audience or situation.

The Semantic Document

Or, expressing the above in a more concise manner:

BASE:VERSION:RENDITION

BASE is the basic label on the box. It might be a simple category (CARS) or convey some meaning, such as a brand name and model (Ford Mondeo), but by itself, it is really just an identifier of the contents. We know, sort of, what the contents are, in a comfortingly abstract way.
VERSION gives that abstract contents a defined shape at some point in time (such as Model Year 2008); in short, it describes the change of the semantic document contents over time.
RENDITION is the type of output, the rendition required to make the versioned semantic document for a specific audience (EU Market, left-hand drive).

There is no right or wrong here. It could, for example, be argued that Mondeo is a Ford version rather than part of the base name.

Base

The BASE:VERSION:RENDITION notation suggests a minimum set of semantics^[3], but there is nothing wrong with providing additional intelligence to the base identifier. For example:

A product called 123 might be described using three different document types: user guide, service guide, and spare parts. BASE, in this case, could then identify the product AND the document type, so three base identifiers, 123UG, 123SG, and 123SP.

Version

Adding a version can be a question of using an ordinal number (1, 2, 3, 4, 5, ...); after all, they are very easy to understand for man and machine alike. In many cases, a multilevel construction as used in software (1.1, 1.2, 1.3, 1.4, 2.0, ...) might be betters.

A date is probably not a good idea as it is nearly impossible to know if 2008-11-11 is the first version or the tenth, at least without a context.

Rendition

The most obvious example of a rendition for documents is the language (sv, en, de, ...) a manual is published in, but another perfectly good example is provided by the EU Market, left-hand drive example, above.

Or going back to toys, a rendition can be a size for a specific use case. Märklin model trains, for example, come in three sizes, Z Gauge, H0, and 1 Gauge. Or to put it differently: as we all know, it's no fun when little brother wants to add his big plastic Volvo to your race track day.

A Few Asides

Before I move to addressing some of the more obvious issues with this naming schema, allow me to quickly discuss a few related topics.

IDs

The observant reader will no doubt have thought about identifiers beyond those naming the container by now. What about identifying specific parts of the contents? In toy parlance, the box CARS might contain smaller boxes, say, one for each car, and so there would be a Ford Mondeo MY2008 EU Market, left-hand drive box, a Jaguar E-Type MY 1961 Roadster box, and so on.

I'd argue that the definition of a container is mostly an arbitrary one and will likely change over time. The contents are important and should be considered, but above all, we should always be able to uniquely identify the container, whenever and however it changes. What's inside is uniquely identifiable by first identifying the container and then the particular object of interest inside (the Jaguar E-Type stored in CARS).

Jaguar E-Type is what we call a structural identifier; if the boy's got several E-Types but only one in CARS, then the structural identifier doesn't need to be unique across the toy collection, only inside CARS; the context is enough. Also, with only one E-Type inside CARS, we don't actually have to use the full identifier to find it even if it does have one.

On the other hand, if that E-Type proves to be the boy's favourite toy, he might well want to store it in its own box, separately from the others. Once outside, it would have to be referred to using its full identifier, especially if there were other Jaguars. The boy, of course, would probably just refer to it as the Jag^[4].

And if the boy's list of favourites grows over time, he might want to create a favourite cars box and put it there, in which case a structural identifier combined with the unique container name would again be enough.

Reasons for Change

A new version signifies a change and should happen whenever there is a significant change to the contents (begging the question what is a significant change?). When the E-Type toy is moved to its own box, or perhaps to that box with favourites, the old box changes and gets a new version^[5].

The definition of change is a bit like the definition of done to agile practitioners or starving artists, and so hard to offer here. I tend to go with anything beyond pretty-printing a document.

My definition of change, of course, means that there will be a lot of versions, most of which are either significant only to me or not significant at all, beyond me hitting Save. A precious few are of interest to others, and this is where workflows come into play.

A change in the workflow is a status identifier: ready to be reviewed, approved, rejected, ready to be translated, published, etc. A document could move from draft to published without change and thus in a single version, but on the other hand, reaching published might just as well require two dozen versions. None of the versions between the two workflow stages is of any interest to anyone else than those involved. Workflows, then, can be used to scope a name (only use published versions).

Links

Finding the E-Type Jag is achieved using a link. The boy will not think of it as such, but when warning his friend to not touch the Jag in the CARS box, a link is what he uses to get his message across.

A link, of course, would build on the same rules for applicability and scoping as the name itself. However, a link used in an editing situation (don't touch the Jag on the floor) would depend on context while a more stringent link to an outside observer (mum, don't touch the E-Type Jaguar in the CARS box if you go into my room) is meant to be persistent and needs to be better defined.

Addressing Problems

Naming happens on several levels. A semantic document would keep its base identifier for as long as the contents remain related to the original. For example, for as long as a user manual is about "Volvo V70", the new versions (model years) would happen to the same base document.

If there was a significant change (Volvo 850 to Volvo V70^[6]), the change might result in either what in software circles in known as a fork (a new version 1 based on a specific version; in other words, something new based on an existing version but going off in a new direction) or an entirely new document (for obvious reasons).

Versions

Versions indicate change, so in the car world, this would actually correspond to manufacturing weeks in addition to model years (as opposed to actual years). For example, MY2003 expressed using this schema might be M2003 w46 (in reality a model that started production in 2002), MY2003 w12, MY2003 w24, and MY2003 w38. MY2004 would happen in 2003, week 46.

A versioning schema for the Volvo V70 might then be defined as 2003:46, 2003:12, 2003:24, etc. Of course, what the end user sees is still 2003.

What this means is that the version label seen by the customer need not be the same as the version seen by the service technician. Versions, then, are better off scoped. Workflow stages (see section “Reasons for Change”) can help, but so can multilevel versioning (see section “Multi-level Versioning and Business Rules”) and naming abstractions (see id-naming-abstractions).

Translations

Frequently, problems with translations really are about the lack of proper versioning and/or naming. Consider, for example, multinational companies producing content for multiple markets and multiple languages. A manual may be first written and published in one language, say, Swedish, but then translated to another language and market, say, China and APAC, where the contents are further developed before publication to accommodate product changes specific to that market.

After some time, the product is then sold to a third market. It is translated from the second, and customised once again to meet the requirements of that market.

Meanwhile, the original Swedish version has been updated, independently from the other markets, and if they, at that point, decide to use product features developed for the other markets, the documentation from the other market(s) will not be reusable.

This is where the naming schema and regarding translations as renditions will help. Returning to the car model year and manufacturing week example, translations would be based on a specific exact version (rather than a scoped one), so 2004:46:sv-SE should be equivalent to 2004:46:en-GB rather than simply MY2004 sv-SE to MY2004 en-GB. The exact version schema would ensure equivalence^[7]. These would be different renditions of the same basic abstract manual.

Note that the rendition is simply defined as such; a GB manual would describe a right-hand drive car and therefore not be an exact match. Similarly, the GB model might have any number of standard features not present in the SE version or vice versa (a Spanish car would, for example, probably not have heated seats as standard).

EU Regulations Example

While translations are renditions, there might be a need to develop a translation more or less independently from the original language. EU regulations for different member countries provide a good example of this. If a French-language translation for France from an English-language original requires content development, it could either be a development within the same version or an independently developed fork.

Here, the English-language original identified as DocA is translated to French and developed as a variant that is still seen as DocA:

DocA en-GB v1 written =>    DocA fr-FR translation of v1
                            DocA fr-FR v1-1
                            DocA fr-FR v1-2
                            DocA fr-FR v1-3
                            ...

DocA en-GB v2
...

Here, on the other hand, the English-language DocA is translated but then developed independently as DocB:

DocA en-GB v1 written =>    DocA fr-FR translation of v1
                            DocB fr-FR v1
                            DocB fr-FR v2
                            DocB fr-FR v3
                            ...

DocA en-GB v2               DocB fr-FR v9

In the case of EU, the former variant is more likely. A regulation is adapted and developed for each member country until a new version comes out, at which point the translations need to be "merged" with the main version line.

While DocB starts out as a fork of DocA, this relationship is never explicit in the base identifier. Instead, it might be a relationship kept track of by a system.

Modularisation

Let's say that DocA:2:en-GB (see above) is modularised into a root file and two modules, like so:

DocA:3:en-GB (root; modularisation is a change so a new version)
|
|--DocC:1:en-GB
|--DocD:1:en-GB

When the new DocA is translated to French, this happens:

DocA:3:fr-FR
|
|--DocC:1:fr-FR
|--DocD:1:fr-FR

In other words, the root module and both of the new modules are translated. The root, of course, might simply comprise links to the modules and so the translation might happen automatically.

If a new document, DocE, is written and reuses modules like this:

DocE:1:en-GB
|
|--DocC:1:en-GB
|--DocF:1:en-GB

The translation to french would be

DocE:1:fr-FR
|
|--DocC:1:fr-FR
|--DocF:1:fr-FR

The root, if comprising links only, would be automatically translatable, simply by changing the links from the en_GB renditions to the fr-FR renditions; DocC:1:fr-FR would already exist, as it was translated in the previous run; and so the only thing remaining would be to translate the new DocF:1:en-GB module.

Too Many Versions?

Mentioned above: a new version happens when there is a significant change in contents. Would this mean that a document that was properly modularised from the start would be hell to keep updated and translated? After all, if DocC:1:en-GB is updated to DocC:2:en-GB, doesn't that mean that DocE needs to be updated so the link points to the new version (with the same applicable to every other document using DocC)?

So if this happens:

DocE:1:en-GB
|
|--DocC:1:en-GB
|--DocF:1:en-GB

is reworked to

DocE:192:en-GB
|
|--DocC:28:en-GB
|--DocF:45:en-GB

How many times does the root document need to be updated? Translated? 2? 5? 40? 100?

There are several problems in play here:

Poor modularisation (and configuration management)
Translations happen before everything is ready
Automated systems bump versions for everything (or there is not enough systems support)

The real problem here is the assumption that every version is important. Most aren't, as mentioned in section “Versions”. Let's have a closer look.

Multi-level Versioning and Business Rules

Remember, a translation is a predefined rendition. I.e. we are saying this content is equivalent to that content, not that this content is identical to that content. So use several levels of versioning to scope a link. Assuming these versions:

1.0.0
1.0.1
1.0.2
1.1.0
1.1.1
1.2.0
1.2.1
1.2.2
2.0.0

Define business rules saying that a new translation, or link, is only required for integer versions (or decimal versions, or ...).

This, of course, is equally applicable to modularisation without translation. We could simply say that a link made to DocC v 1.0.2 would be valid for all 1.x.x versions and the business logic could then automatically use the latest 1.x.x version. The same would be applicable for translations.

EU Example - Minimising Work

The multilevel versioning provides an easy solution to the EU translation example, while retaining the all-important link between renditions. A business rule could simply say that every 1.x rendition is equivalent to every other 1.x rendition.

A clever modularisation of the document could then minimise the impact in terms of the number of member country-dependent changes by placing the contents requiring updates into a single module.

Let's say we have this slightly more complex document:

DocG:3:en-GB
|
|--DocC:1:en-GB
|   |--DocX:1:en-GB 
|
|--DocD:1:en-GB
|
|--DocH:5:en-GB
|
|--DocM:7:en-GB
|   |--DocX:1:en-GB 
|
|--DocP:3:en-GB
    |--DocX:1:en-GB

Here we have DocC, DocM and DocP all linking to the common module, DocX. Of course, the links would not reuse the entire document, only selected parts, expressed as fragment identifiers like so:

DocC:1:en-GB => DocX:1:en-GB#id1
DocM:7:en-GB => DocX:1:en-GB#id2
DocP:3:en-GB => DocX:1:en-GB#id3

This results in limiting the member country-dependent changes to a single module, DocX, that could be updated when required. The links from the source documents would all point at the integer version 1 of DocX, meaning that every update within the 1.x series could be defined as an equivalent link in the business rules.

Note

Of course, a decent system would be able to keep track of docX 1.x during development, assigning workflow statuses to the development versions so a draft would never find its way to a published document.

Another Aside on IDs

Document IDs need to be unique, of course. How else would we find the right box? Structural IDs, however, do not. Internal links (#id) would be unique inside the module, but there is no need to enforce uniqueness outside them since any pointers outside include the document ID that is unique:

DocC:1:en-GB => DocX:1:en-GB#id1
DocM:7:en-GB => DocX:1:en-GB#id2
DocP:3:en-GB => DocX:1:en-GB#id3

Problems may ensue when merging modules or when modularising documents; in both cases, there is a need to check for uniqueness and possibly handle broken links from elsewhere.

A Case Study

The principles outlined in this paper have been put to practice far beyond my son's toy collection. This section outlines a system I've been involved developing.

The basic system was conceived from the start to offer full traceability in large, modularised technical documents that are translated to a dozen or so languages. The system therefore keeps a full version history of each and every resource in the system, which means that the document modules are linked to using specific versions of each module. Exact versions are also used when translating the modules, which means that in theory, reuse is very efficient.

The system identifies its resources using a URN schema that formalises the BASE:VERSION:RENDITION concept, but also generates structural IDs for the document contents. Thus, every link to structured content uses the form URN#ID, which means that reuse is allowed on document fragments, too. The structural IDs are kept intact when translating the modules, so a link to a Swedish document fragment URN:sv-SE#ID is equivalent to an English-language one, URN:en-GB#ID. All that is needed is to replace the language-country code (and, obviously, translating the contents.

Note

The Swedish and English versions are defined to be equivalent, not identical, which means that apart from the key nodes (with the IDs), some differences may exist.

Images and media are also linked to using URNs, and thus translated in the same way, by changing the language-country code (and the contents), unless they are language-neutral.

Weaknesses

In spite of the precise versioning capabilities, the system had several main weaknesses when it was built:

The Many Versions Problem

First and foremost, while the user was able to decide when to check in (and out) a module and thereby creating a new version, this would nevertheless result in a lot of versions; it was not unusual for frequently revised modules to have dozens or even hundreds of versions.

While the system had some degree of scoping in that it allowed the user to define workflow statuses for each stored resource, which in theory should have been used to limit linking to approved versions only, the workflow feature was manual and underused. This resulted in links frequently pointing to old versions of modules because even a slight change to a module would result in a new version, which, because specific versions of modules were always linked to and translated, would require the link to the module to be updated and the module itself to be translated again.

The Use Latest Problem

The very fact that links were made to a specific version also made it difficult to always link to the latest version, something frequently asked for by the users. For example, if an image link should always be the latest (approved) version, the way the system was built meant that if the image link was updated, the link to the module with the image would have to be updated, and so on, all the way to the root.

This was doable, of course, if none of the ancestors to the image had been reused elsewhere, in an incompatible way. For example, let's say we started with a document, RootX, linking to modules as follows, with ModuleA in version 5 as its immediate child:

RootX-v2 => ModuleA-v5 => ModuleB-v9 => Image-v1

The child of RootX, ModuleA, was then modified to version 8 and used in another document, RootY:

RootY-v1 => ModuleA-v8 => ModuleB-v9 => Image-v1

As we can see, the modules linked to by ModuleA are unchanged; only the contents of ModuleA have been updated.

If RootX required a new version of the image, however, this would cause a problem since updating the image to v2 would mean updating ModuleB, which was fine, but also ModuleA (and RootX).

But ModuleA had already been updated so it could be used by RootY, and so the next available version would be version 9. This is fine, of course, but would then create a similar situation for RootY if it was updated.

The Master Language Problem

Finally, the system had the concept of a master language, a language used when producing new content. The system had a feature to build a translation package from a master, which essentially meant creating a zip package of the modules to be translated, copying them verbatim in every respect but the xml:lang attribute in the root element, minus those that had already been translated, and logging the event in the database so it would know it had translated to upload back to the system at some point in the future.

This seemed like a good idea at the time but quickly became difficult to handle. For example, if the master language was set to Swedish, that was the language the translation packages sent to the translation agency had. There was no way to base a translation on, say, English, which would have made translations cheaper and easier^[8].

The master language concept, obviously, also made it more difficult to collaborate in more than one language, something that rather unexpectedly became a deal-breaker.

Improvements

The system, then, actually had issues with several concepts introduced above. Does this mean that the idea of a semantic document, labelled using a BASE:VERSION:RENDITION schema, doesn't work after all?

Handling Versions

We never managed to address the many versions problem fully, but did handle the resulting retranslation required because of a version bump problem in a rather simple but quite effective way:

Whenever a module had been updated, thus breaking the coupling with the existing translations, we added a feature allowing the user to see the master language and its available. but old, translations in a table, one column per language, select a target version and an old translation, and then investigate if the old translation was possible to bump to the selected later version.

A few business rules tested against made sure that the bump was structurally feasible and then ran an XProc pipeline that updated any links inside the module to match the master language's. The pipeline would also attempt to include any missing content from the master language, but using some very restrictive business rules.

The idea was simple and based on the fact that any translation is defined as being equivalent with the original; it does not have to be identical. This allowed the user to assume responsibility for bumped translation being equal to a selected later version, and in most cases, the users would then update the translation manually for the usually minor bits that did not match.

The system's major weakness, where the versioning was straight and not scoped in any way, was never addressed, however. For a comprehensive solution to this problem, see id-ml-paper.

Linking to Later Versions

The use latest problem was never fully addressed either. We side-stepped the problem by adding the notion of forks where a given module version would be copied and used as the basis of a new module, with that module's versioning then moving in a separate direction from the original.

The forking functionality did take care of some of the problems but did not really fully address the original problem, namely to be able to easily update a leaf node such as an image to the latest version without having to update the versions of every ancestor accordingly.

A more sensible approach would have been to used the kind of scoped versioning discussed in section “Multi-level Versioning and Business Rules” (with a more comprehensive solution suggested in id-ml-paper).

Why A Master Language?

The master language problem, finally, was not a problem with the approach but the system itself. The system was simply built that way, assuming that content production would always be in a single language and the translations in all of the others. The functionality to create translation packages, essentially copies of the originals but with updated language attributes and information, was then designed upon that first decision.

In other words, there is nothing in the naming principles discussed here that even hint at a single master language. Quite the opposite; the situation in the below table actually makes more sense. P stands for production while the empty cells mean that the content is translated to that language.

Table I

sv-SE	en-GB	en-US	fr-FR
P
	P
		P
P
			P

Of course, a system built on these principles would allow creating a translation package from any existing language.

End Notes

Profiling

Efficient modularisation of documents require being able to profile the contents, marking up sections to be about the different variants covered by the module and then showing and hiding them when publishing the document, depending on the publishing context. This practice also includes features like variable text strings so a variant name can be included in the content when using a specific profile.

Profiling is beyond the scope of this paper, but I presented a fuller treatment of how to manage profiles a number of years ago. See id-naming-abstractions.

Passive Version Tracking

The ideas described in this paper are all about labelling and annotating container, a play with the labels that identify those toy boxes with cars (or the boxes inside the larger boxes). The cardboard box metaphor is best visualised by the labels taped on the boxes but since it's just a metaphor, there's nothing to stop us from moving the labels to a more convenient location. In other words, indirection, a sort of an out-of-line versioning.

We can place imaginary labels on the boxes, on the cars, on groups of cars, and so on (think registers), so we can also remotely track the changes. If you think about the ideas presented here, they all observe and react on the changes to the content; they do not actually cause them.

FRBR

Two reviewers suggested I should not finish this paper without considering Functional Requirements for Bibliographic Records (FRBR) (see id-frbr). I must confess that FRBR is mostly new to me. I have heard of it before, somewhere, but I never made the connection.

There are similarities between what I propose and FRBR, in that there is a concept that reminds me of the abstract, rendition-less content that is my base document, and there is a concept similar to a rendition (called a manifestation, a phrase I like). There is also what is to some extent an equivalent, or at least similar, to a version (called an expression, I phrase I don't like), but while I think I understand where the authors of FRBR are coming from, I don't think FRBR is an equivalent, and here is why:

FRBR also describes what they call derivative works. These may include, say, annotated versions of a movie script but also any other version of the base work, and while technically, there is nothing in theory to stop an author from including a derivative work such as an annotated script as a new version in what I propose here, my proposal is not about including any kind of derivative work as a version. My versions are very strictly about change to the document, not interpretations of it.

My proposal (and practice; I have implemented what I describe here in several live systems) is about a base document, an abstract document without a context requiring a specific rendition, nor a version identifying its change over time, that then is changed over time, performed in increments where each increment is defined by a user as a significant (and therefore saved) new version, and where any of those user-identified versions can then be rendered in a specific way, depending on context.

It's a bit like quantum physics for versioning; I am talking about incremental changes where each change can be rendered according to (again) some user-defined context, with each rendition being defined as equivalent to another, which means that there can be no middle ground between two versions. In FRBR, however, at least in how I interpret it, there is a fluidity between manifestations and expressions, leaving the standard open to connecting not just versions and renditions of the document itself but also any derivative work it might have.

In my quantum versioning, that annotated version could be defined as a fork, which would be perfectly fine, but I would probably not want to mix my semantics in that manner as an annotated work would be a single version (in some rendition or renditions) where some selected nodes had been linked to external content.

In other words, while I can certainly appreciate the similarities between my work and FRBR, I do think the approaches are different.

Why This Paper?

I presented the passive version tracking idea at XML Prague 2016 (see id-xmlprague), expecting questions and objections on the idea itself. It didn't happen. Instead, people doubted the translations-as-renditions definition, both immediately after the talk and over beers. The EU translations example was brought up as proof against the idea and so triggered this paper.

So?

The basic principles outlined in this paper are already in use. The passive tracking system is a proposed proof of concept for a client of mine. Plus, I can't think of a good alternative. Can you?

References

[id-ml-paper] Nordström, Ari. “Multilevel Versioning.” Presented at Balisage: The Markup Conference 2014, Washington, DC, August 5 - 8, 2014. In Proceedings of Balisage: The Markup Conference 2014. Balisage Series on Markup Technologies, vol. 13 (2014). http://balisage.net/Proceedings/vol13/html/Nordstrom01/BalisageVol13-Nordstrom01.html. doi:https://doi.org/10.4242/BalisageVol13.Nordstrom01

[id-naming-abstractions] Nordström, Ari. “Semantic Profiling Using Indirection.” Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6 - 9, 2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). http://balisage.net/Proceedings/vol10/html/Nordstrom01/BalisageVol10-Nordstrom01.html. doi:https://doi.org/10.4242/BalisageVol10.Nordstrom01

[id-xmlprague] Nordström, Ari. "Virtual Document Management." Presented at XML Prague 2016, Prague, The Czech Republic, February 11 - 13.

[id-scales] Märklin Trains: Scales; see https://www.marklin.com/scales/

[id-frbr] Functional Requirements for Bibliographic Records; see http://www.ala.org/alcts/sites/ala.org.alcts/files/content/events/pastala/annual/04/Tillett.pdf

^[1] Unlike the boy, whose naming conventions remained ad hoc because he had no way to know what would come, we can predict some of the things to come with most documents.

^[2] Possibly including the information type (trucks, GT cars, vans, formula one...) or some other relevant characteristics.

^[3] Of course, if there is a document that never changes, we can get rid of the VERSION part. How likely do you think that is?

^[4] Like most Jag owners.

^[5] The new box changes, too, and gets version 1 if it's the first, obviously, or a bump up if it already existed before the Jag arrived.

^[6] An intentionally questionable example.

^[7] Of course, some work in terms of systems support and common workflows would also be required so the different markets would use the same naming and versioning conventions.

^[8] A Swedish to Mandarin translator, unsurprisingly, costs a lot more than an English to Mandarin ditto. Provided you can find one.

Nordström, Ari. “Multilevel Versioning.” Presented at Balisage: The Markup Conference 2014, Washington, DC, August 5 - 8, 2014. In Proceedings of Balisage: The Markup Conference 2014. Balisage Series on Markup Technologies, vol. 13 (2014). http://balisage.net/Proceedings/vol13/html/Nordstrom01/BalisageVol13-Nordstrom01.html. doi:https://doi.org/10.4242/BalisageVol13.Nordstrom01

Nordström, Ari. “Semantic Profiling Using Indirection.” Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6 - 9, 2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). http://balisage.net/Proceedings/vol10/html/Nordstrom01/BalisageVol10-Nordstrom01.html. doi:https://doi.org/10.4242/BalisageVol10.Nordstrom01

Nordström, Ari. "Virtual Document Management." Presented at XML Prague 2016, Prague, The Czech Republic, February 11 - 13.

Märklin Trains: Scales; see https://www.marklin.com/scales/

Functional Requirements for Bibliographic Records; see http://www.ala.org/alcts/sites/ala.org.alcts/files/content/events/pastala/annual/04/Tillett.pdf

Balisage Paper: Tracking Toys (and Documents)

Abstract

Table of Contents

Toys (and What Happens to Them)

Documents (and What Happens to Them)

Identifying the Box

Identifying Change

Identifying Context

Contexts as Renditions

Note

The Semantic Document

The Semantic Document

Base

Version

Rendition

A Few Asides

IDs

Reasons for Change

Links

Addressing Problems

Versions

Translations

EU Regulations Example

Modularisation

Too Many Versions?

Multi-level Versioning and Business Rules

EU Example - Minimising Work

Note

Another Aside on IDs

A Case Study

Note

Weaknesses

The Many Versions Problem

The Use Latest Problem

The Master Language Problem

Improvements

Handling Versions

Linking to Later Versions

Why A Master Language?

End Notes

Profiling

Passive Version Tracking

FRBR

Why This Paper?

So?

References

Balisage Series on Markup Technologies