Huitfeldt, Claus. “UnderDok: XML Structured attributes, change tracking, and the metaphysics of documents.” Presented at Balisage: The Markup Conference 2015, Washington, DC, August 11 - 14, 2015. In Proceedings of Balisage: The Markup Conference 2015. Balisage Series on Markup Technologies, vol. 15 (2015).
XML Structured attributes, change tracking, and the metaphysics of documents
Claus Huitfeldt is Associate Professor at the Department of Philosophy of the
University of Bergen, Norway. He was founding Director (1990-2000) of the Wittgenstein
Archives at the University of Bergen, for which he developed the text encoding system
as well as the editorial methods for the publication of Wittgenstein's Nachlass -
Bergen Electronic Edition (Oxford University Press, 2000).
UnderDok is an XML system for publishing, quality assurance, and change tracking of
education course descriptions. The documents have a fixed structure, numerous cross-references,
and prose interspersed with standard phrases. Each document exists in a native language
form, in
an English translation, and sometimes in additional languages. Changes must be tracked
to the last authorized version. Up to now, documents have been produced in Microsoft
Word and
manually copied to a database, a process both labor-intensive and error-prone. UnderDok
solutions to many of these technical challenges, but may also inspire reflections
on the
metaphysical status of documents. It is suggested that a course description, by which
institution and its students are legally bound, is neither the source XML nor the
XHTML, but a visual object containing linguistic information that occurs in certain
and contexts. The legal stability for these documents, which was traditionally provided
printed pages, is now provided by the reproducibility (standardization) of document
representation and presentation technology.
The basic goal of the Bologna process, which has been going on since 1999, is to
education systems throughout Europe easier to compare and to assess relative to each
other, in
order to facilitate the exchange of students and teachers as well as the mobility
on the labour
market. As part of this effort a number of steps have been taken to standardize structures,
procedures and quality measures used in eduction systems among the currently 46 member
The standardization of course description and program description documents[1]
is one central element in the Bologna process. The documents contain descriptions
among other things, the learning outcomes, learning activities, and assessment methods
of each
course or program, so as to make them comparable across institutional as well as national
borders. Course descriptions contain information which is important to both employers
educational institutions in their assessment of a student's or candidate's knowledge,
skills, and
general competence. They are important to academic and administrative staff as they
requirements for the teaching to be offered. For students they serve not only as a
source of
information, but also as a kind of contract between each individual student and his
or her
institution. In other words, course descriptions may also be seen to have a legal
Therefore, it is important for institutions to exert control over the contents of
descriptions. And since course descriptions change, it is important to be able to
keep track
of the changes. This paper describes aspects of an XML-based document management system,
UnderDok, which has been developed in order to tackle some
of these challenges. Other work in this area has focussed on development of general,
cross-institutional and international ontologies and semantic web resources (see,
for example,
[Demartini et al. 2012], [Camarero et al. 2009], and [Amorim et al. 2006]). The aims of UnderDok are more modest, and limited to issues of institutional document
Editing cycle
Course descriptions go through a more or less continuous editing process involving
a large
number of persons with different roles in a particular cycle, ending up twice a year[2]
in a formally authorized version which is legally binding for the institutions'
relation to students admitted to the courses in question for the next two or three
years. Thus,
at any point of time, several different versions of a document describing the same
course may
apply for students enrolled in the course at different times.
Once authorized by the institution at the appropriate level, the course descriptions
submitted to a national database including all higher educational institutions in
the country,
thus containing thousands of course descriptions. In the national database each document
represented in the form of a record structured corresponding to the main document
fields (to be
described below), with only limited indication of lower-level structure.
The revision of each course description involves several people in different roles.
a revision will be initiated and formulated by someone teaching the course, checked
for adherence
to formal requirements by administrative staff, reviewed by a program committee (or
by several,
if the course is part of several programs), and then submitted to the department,
faculty or
university board for authorization, depending on the kind of changes made. All of
these steps may
be repeated when required. It is vital that all changes relative to the last previously
authorized version of the document are clearly indicated.
This process is mostly performed through the revision and exchange of Microsoft Word
other proprietary format documents, using a variety of text processing and change
tools. Once authorized, revisions are transferred to the national database by manual
copy and
paste. With no support for control of cross-references, vocabulary or standard phrases,
only weak and fragile change-tracking, the process is not only labor intensive, but
extremely error-prone.
Document structure
Course description documents have a fixed overall structure consisting of twenty-one
fields of text.[3]
The contents of some of these fields is very specific (e.g. number of ECTS points,[4]
degree level, teaching term), varying normally only in which of a limited set of
values they may have. The contents of other fields is less specific, and yet other
fields may
contain free prose which will normally be unique to each course description. Thus,
descriptions are paradigmatic examples of semi-structured documents.
The documents frequently cross-reference each other, for example in referring to other,
prerequisite or overlapping courses or programs.
Some of the fields are required to appear in several languages. For example, course
should always be given in both official written forms of Norwegian as well as in English.
Learning outcomes should always appear in both English and Norwegian. In addition,
descriptions of courses taught in other languages than Norwegian will naturally appear
extenso in that other language.
The documents contain a large number of standard phrases, ranging from a few words
to a few
paragraphs which should be identical across several documents. Typical examples are
about general requirements, deadlines, formal rules etc. which are common to some,
but not all,
courses or programs.
In the course of the editing cycle it is also necessary to exert a more general control
vocabulary. This is so for two reasons. First, the fact that the documents are authored
edited by a large number of people easily leads to inconsistencies in style. Within
each of
the two official forms of Norwegian, there is ample room for choice between various
forms of
spelling and grammar, but mixing different choices from document to document, or even
the same document gives a very bad impression and reflects poorly on the institution.
considerations apply to the choice of different forms of other languages, such as
British or
American English. Second, the documents contain references to technical or semi-technical
terminology which should be used consistently throughout the corpus. For example,
if the terms
"semester essay" and "supervised essay" refer to the same kinds of things, one should
use only
one of the terms.
Appendix A gives an example of the English version of a course
description. Each of the twenty-one main fields of the document is indicated by a
headline. (Curiously, even the text of headlines is sometimes subject to change.)
Overall design of UnderDok
UnderDok is an XML-based document system designed to support editing, quality assurance
change tracking of course descriptions as described above. It has also been an important
consideration that it should be suited for maintenance and adaptation to local requirements
staff with only minimal training in XML. In particular, the system provides means
of letting
users change schemas and influence the effects of stylesheets without requiring knowledge
or XML schema languages.
A main DTD called System.dtd specifies elements for declaring XML
elements and attributes. An XML document, DTDkilde.xml, which conforms to
the main DTD, defines the basic element structure that all documents have to conform
Another XML document, AttOver.xml, declares attributes, legal attribute
values and optional headlines for elements declared in DTDkilde.xml. A
stylesheet, LagDtd.xsl, reads the two XML documents
(DTDKilde.xml and AttOver.xml) and generates the
local project DTD, UUI-dok.dtd). All course description documents as well
as AttOver.xml are of types declared in UUI-dok.dtd.
Figure 1
Course descriptions are represented as XML document instances with twenty-one main
elements, each element corresponding to one of the twenty-one main fields of course
descriptions as described earlier. Appendix B contains the XML source for the
document given in Appendix A.
Since course descriptions require only very trivial formatting, the output format
is basic
XHTML (without scripts, css or the like). A number of different stylesheets are available.
What these stylesheets all have in common is that they collect data not only from
the source
document, but also from other course descriptions and from the file
AttOver.xml. (In Appendix A, for example, all
headlines and much of the prose comes from AttOver.xml, while the course
titles in the section "Recommended previous knowledge" are extracted from other course
Figure 2
The advantage for users of UnderDok is that they can (indirectly) modify both the
and the effect of stylesheets without (directly) modifying neither the schema nor
stylesheets, but only AttOver.xml, which is a plain XML
document. Thus, no knowledge of schema or stylesheet languages is required.
Structured attributes
Consider the main element Undervisningstad. In the XML source (Appendix B), it reads:
This is an example of what we might, for the lack of a better name, call
"attribute elements". Attribute elements have one and only one required
attribute. In most occurrences the element will normally be empty. However, one of
the legal
values of the attribute may be used to indicate that it has content, and that the
value of the
attribute is given by the element content.
2) He may edit the relevant part of AttOver.exe, which
and then run a standard procedure to update the project DTD, which will now look like
<!ELEMENT Undervisningstad (Nor | Eng | (Nor,Eng))* >
<!ENTITY % Undervisningstad-verdier "( Bergen | Oslo | Anna | Ingen ) " >
<!ATTLIST Er %Undervisningstad-verdier; #REQUIRED >
This allows him to change the XML source document to read:
<Undervisningstad Er="Oslo"/>
While method 1) will have an effect only for the edited document, method 2) will make
value "Oslo" available for the @Er attribute of the Undervisningstad
element in all documents.
Let us assume that the encoder also wants to change the headline "Place of Teaching"
read "Place of Instruction". This can be obtained by changing the
Overskrift (Headline) element of AttOver.xmll, to
<Eng>Place of Instruction</Eng>
The effect of this operation is that all documents will contain the
headline "Place of Instruction" instead of "Place of Teaching". With the modifications
described above, the source will now display as follows:
Figure 4
While new attribute values can either be specified in individual documents or be made
available for all documents, changes to headlines will always affect all documents.
What has all this got to do with structured attributes? Strictly speaking, the legal
values of
@Er are initially the strings "Bergen", "Anna" ("Other"), and "Ingen" ("None"); and
the only
thing that happens in the course of the process described above is that the string
"Oslo" is --
in a rather roundabout way, one might say -- added to the list of legal values.
This simple example may admittedly give an impression of much ado about (nearly) nothing.
However, the Verdi elements may contain subelements, marked sections,
comments, and any other construct which is allowed in the location from which they
are referenced
by an attribute element. Since AttOver.xml is itself of type
UUI-dok.dtd, validation ensures that the inclusion of such structured
content does not break the validity of the course description documents.
A somewhat more complex example may illustrate this. One of the values of the @Er
attribute on the element Studierett, "Privatist", is
specified as follows:
<Element id="Studierett">
<Nor>Krav til studierett</Nor>
<Eng>Access to the Course Unit</Eng>
<Verdi id="Ope">
<Nor>Emnet er ope for studentar med studierett ved Universitetet i
<Eng>The course is open to students admitted at the University of Bergen </Eng>
<Verdi id="Privatist">
<Avsnitt>Emnet er ope for studentar med studierett ved Universitetet i
<Avsnitt>Personar utan studierett ved UiB kan søkje til Det humanistiske
fakultet om å få gå opp til eksamen i emnet. For meir informasjon, sjå: <Lenke>
<Anker>Særskilt eksamen ved Det humanistiske fakultet</Anker>
<Avsnitt>The course is open to students admitted at the University of
<Avsnitt>Persons who are not admitted as students at the University of
Bergen may apply to the Faculty of humanities to be admitted to
examination in the course. For more information, contact the <Lenke>
<Anker>Student Advisor</Anker>
In this respect, attribute elements have much in common with XML entity references.
are two reason why UnderDok does not use entity references: First, the occurrence
of entity
references cannot be constrained with DTDs. Second, we have found that a need to transform
documents from one XML form to another with XSLT arises from time to time. Although
it is
possible to do so without having entity references expanded, we found that strategy
convenient than using attribute elements.
It is clear enough, however, that although attribute elements may serve some, they
cannot (at
least not conveniently) serve all the purposes which have been called for under the
name of
structured attributes in the XML literature. For example, Stefan Ram [Ram 2004]
suggests the following "hypothetical XML-variant":
< tower
height = <meterlength>40</meterlength>
Name = "miller tower"
the assumption that one wants to be able to allow any number as element content of
meterlength element in this example, attribute elements would be unsuited,
as one would have to declare one attribute value for each number. Similar remarks
go for the much
more powerful structured attributes discussed by Michael Kay in [Kay 2013].
Attribute elements are primarily useful in cases where one wants to exert control
over frequently
recurring element content.
Revision control
Revision control is essential to the quality of the work described above. At every
stage in a
cycle one needs to know exactly which changes have been made relative to the last
authorized version, rather than relative to the last previous edit session. For that
ordinary change tracking facilities as usually found in off- the shelf word processors,
are not
very well suited. Such programs usually record every change that
has been made, including changes that have later been cancelled.
Furthermore, different changes are assigned different significance in the revision
-- revisions of certain fields require different treatment than others. For example,
in ECTS points, assessment methods, or compulsory requirements require a more elaborate
procedure than changing a typo in prose parts such as "Aim and Content". Finally,
it is
important to ensure that a change in the Norwegian text of a document is accompanied
by a
corresponding change in the English text, or vise versa.
As observed by many who has worked on the subject, representing change tracking in
documents is a considerable technical challenge (see for example [La Fontaine 2014]
or [Nordström 2014]). This was therefore also expected to be a particular difficulty
with the UnderDok project, especially as there would be a need for representing changes
in a
format which was suitable for use also by readers without any knowledge of XML. Work
was started
to find suitable XML file comparison or change tracking tools, but halted when it
was observed
that change tracking on the XML documents was not really required.
The documents which are actually authorized in the institutions' work with course
descriptions are not the XML document instances, but the XHTML versions as they appear
paper or on the screens of various kinds of devices. The XML versions contain no information
which is relevant to the revision and authorization process which is not also available
in the
XHTML output. And since that output contains only standard and very basic, static
without scripts or other factors which might complicate issues, it is considered stable
to serve as an archival format.
Tools for comparing XHTML files are readily available. Even so, a customized file
tool was developed in order to cater for the more specialized needs of this project,
such as
distinguishing between differences in particular parts of the documents, and aligning
changes in
different language versions of the same document.
Note on the metaphysics of course descriptions
It is sometimes claimed that one of the virtues of generalized declarative markup
is that it
makes the underlying structure and meaning of documents explicit. Presentational markup
claimed to be more prone to ambiguity and less suited for rigorous text processing
because it
merely represents visual cues tied to typographical conventions, the understanding
of which is
contextually and culturally determined.
Can the observations above, which lead to the adoption of XHTML rather than XML as
archival format for authorized versions of documents, be taken as evidence that XHTML
is a
more adequate representation of the structure and contents of documents? Yes and no.
The UnderDok XHTML representation looks like this:
<h3>Contact Information</h3>
<p>Department of Philosophy</p>
while the UnderDok XML representation looks like this:
<Kontaktinfo Er="FoF-kontakt"/>
The relationship between the snippet and the XHTML representation should, even without
natural language documentation, not be too hard to figure out. In order to figure
out the
relationship to the UnderDok representation, however, one
would need to consult, pace extensive natural language documentation, at least the
From System.dtd:
<!ELEMENT InnholdsElement (#PCDATA) >
from DtdKilde.xml:
<InnholdsElement GI="Kontaktinfo">(Norsk | English | (Norsk,English))? </InnholdsElement>
from UUI-dok.dtd:
<!ELEMENT Kontaktinfo (Norsk | English | (Norsk,English))? >
<!ENTITY % Kontaktinfo-verdier " ( FoF-kontakt | Anna | Ingen ) " >
<!ATTLIST Kontktinfo Er %Kontaktinfo-verdier; #REQUIRED >
from AttOver.xml:
<Element id="Kontaktinfo">
<Eng>Contact Information</Eng>
<Verdi id="FoF-kontakt">
<Avsnitt>Institutt for filosofi og førstesemesterstudier</Avsnitt>
<Avsnitt>Department of Philosophy</Avsnitt>
and finally from Generelt.xsl (imported by Std.xsl):
Compared to the UnderDok XML representation, the XHTML
representation certainly seems to be at a lower level of indirection. Even so, they
may clearly
be said to represent the same document.
In view of this, the UnderDok XML and XHTML representations
may look most of all like parts of equivalent tools for the production of the object
that users
of the document relate to.
The authoritative version of a document in the pre-digital age consisted in a written
printed document, which is an easily identifiable and (relatively) stable unique physical
object which does not rely on sophisticated technology for consultation. Its authority
granted by its role in cultural and social practices, which to a large extent relied
on its
properties as a physical object.
The course descriptions discussed here, however, which institutions produce and convey
their students and which both parties are legally bound by, are neither paper documents,
nor XML
or XHTML representations. They are the transient visual objects occurring under specific
conditions, in certain situations and contexts. These objects too can only be granted
by cultural and social practices, which will naturally favour considerations of ease
identification and reidentification.
So the reason for choosing XHTML as an archival format is not that XHTML constitutes
more faithful representation of the documents than XML, but simply that XHTML is regarded
the format that is easiest to access, which is most likely to give sufficiently similar
presentations across different hardware and software platforms, and which is also
assumed to
have acceptable archival life. All these assumptions (perhaps especially the latter)
may of
course be wrong.
Appendix A. English Version of a Course Description
This appendix presents the full English version of the course description FIL124.
Figure 6
Figure 7
Appendix B. XML Source for a Course Description
This appendix presents the XML source for the course description FIL124.
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../System/Std.xsl"?>
SYSTEM "../System/UUI-dok.dtd">
<Nynorsk>Introduksjon til praktisk filosofi</Nynorsk>
<Bokmål>Introduksjon til praktisk filosofi</Bokmål>
<Engelsk>Introduction to practical philosophy</Engelsk>
<KortNamn>Praktisk filosofi</KortNamn>
<Studiepoeng Er="10"/>
<E id="FIL124"/>
<Studienivå Er="Bachelor-nivå"/>
<Institutt Er="FoF"/>
<Studierett Er="Ope"/>
<Undervisningsspråk Er="Norsk-eller-engelsk"/>
<Avsnitt>Filosofi blir ofte delt inn i på den eine sida praktisk og den andre
sida teoretisk filosofi. Til praktisk filosofi reknar ein gjerne slike
område som etikk, estetikk, politisk og sosial filosofi, rettsfilosofi,
religionsfilosofi, feministisk filosofi, handlingsteori og verditeori. Men
skiljet mellom praktisk og teoretisk filosofi er ikkje alltid skarpt, det er
heller ikkje uomstritt, og problemstillingar innanfor praktisk filosofi er
ofte relevante for spørsmål i teoretisk filosofi. Difor er det viktig for
alle som studerer filosofi å ha kjennskap til praktisk filosofi, også om dei
i det vidare studiet vel å konsentrere seg om teoretisk filosofi. </Avsnitt>
<Avsnitt><E id="FIL124"/> skal gje studentane eit oversyn over viktige
grunnomgrep, argument og posisjonar i praktisk filosofi. Hovudvekta ligg på
tema i samtidsfilosofien, men det vil ofte være aktuelt å ta utgangspunkt i
filosofiske verk og posisjonar frå eldre tider. Etter fullført emne skal
studentane vere i stand til å formidle sentrale teoriar og problemstillingar
innanfor praktisk filosofi og sjå relevansen av desse i andre samanhengar.
Emnet skal gi grunnlag for vidare studiar i filosofi på
<Avsnitt>Philosophy is often divided into practical philosophy and
theoretical philosophy. Practical philosophy includes such areas as ethics,
aesthetics, political and social philosophy, philosophy of law, philosophy
of religion, feminist philosophy, action theory and value theory. The
distinction between practical philosophy and theoretical philosophy is,
however, not always clear and is a matter of debate, and problems within
practical philosophy are often relevant for questions in theoretical
philosophy. It is therefore important that all who study philosophy have a
solid knowledge of practical philosophy, even if they in their advanced
studies choose to concentrate on theoretical philosophy.</Avsnitt>
<Avsnitt><E id="FIL124"/> aims to give students an overview of important
basic concepts, arguments and positions in practical philosophy. Although
the main emphasis is on subjects from contemporary philosophy, it will often
be appropriate to start with philosophical works and positions from previous
time periods. After completion of the course, the students should be able to
demonstrate insight into central theories and problems from within practical
philosophy and to see their relevance and applicability for other contexts.
The course provides a foundation for further studies in philosophy at the
Bachelor level.</Avsnitt>
<Avsnitt>Etter fullført emne skal studentane ha god kjennskap til viktige
grunnomgrep, argument og posisjonar i praktisk filosofi. </Avsnitt>
<Avsnitt>After taking the course, the students should have a good knowledge
of important basic concepts, arguments and positions in practical
<Avsnitt>Studentane skal kunne kjenne att og formidle innsikt i
grunnleggjande problemstillingar og argument innan praktisk filosofi i
ulike samanhengar.</Avsnitt>
<Avsnitt>After taking the course, the students should be able to recognize
and demonstrate insight into basic problems and arguments within practical
philosophy in different contexts. </Avsnitt>
<Avsnitt>Emnet gir grunnlag for vidare studiar med sikte på bachelorgrad med
spesialisering i filosofi. I kombinasjon med andre emne og fag kan det
inngå i ei utdanning som kvalifiserer for undervisning i filosofi i
ungdomsskule eller videregåande skule. Emnet kan også vere eigna som støtte
til fordjuping i grunnlagsspørsmål i samband med studiet av andre fag.
<Avsnitt>The course provides a basis for further studies with the aim of
attaining a B.A. in philosophy. In combination with other subjects and
disciplines it can form part of an education which qualifies for teaching
philosophy in high schools. The course can also serve as support for a
deeper understanding of basic questions in connection with the study of
other disciplines.</Avsnitt>
<KravTilForkunnskapar Er="Ingen"/>
<AnnaUtdanning Er="Førstesemester"/>
<Framandspråk Er="Engelsk-kunnskapar"/>
<Nor><E id="FIL124"/> bør takast i samband med eller etter <E id="FIL120"/> og
<E id="FIL121"/></Nor>
<Eng>Students are advised to take <E id="FIL124"/> in parallel with or after
<E id="FIL120"/> and <E id="FIL121"/></Eng>
<Undervisningsform Er="Førelesingar-og-seminar-oppmøtekrav-100-nivå"/>
<Eksamen Er="Heimeeksamen-4-dagar-3-5000-ord"/>
<Eksamensemester Er="KvartSemester"/>
<Godkjenning Er="Arbeidskrav-godkjende-før-eksamen"/>
<SemesterForGodkjenning Er="Arbeidskrav-godkjende-i-semester-med-undervising"
<MunnlegFramlegg Er="Munnleg-gruppe"/>
<Deltaking Er="Delta-på-to-tredelar-av-seminara"/>
<Generelt Er="Arbeidskrav-gyldige-tre-semester"/>
<Avsnitt>Pensum er innføringsverk og utvalde tekstutdrag.</Avsnitt>
<Avsnitt>The reading list includes an introductory text book and selected
excerpts from other texts.</Avsnitt>
<Karakterskala Er="A-F"/>
<Undervisningssemester Er="Haust"/>
<Undervisningstad Er="Bergen"/>
<Emneevaluering Er="Jamne-mellomrom"/>
<Kontaktinfo Er="FoF-kontakt"/>
