How to cite this paper
La Fontaine, Robin. “Standard Change Tracking for XML.” Presented at Balisage: The Markup Conference 2014, Washington, DC, August 5 - 8, 2014. In Proceedings of Balisage: The Markup Conference 2014. Balisage Series on Markup Technologies, vol. 13 (2014). https://doi.org/10.4242/BalisageVol13.LaFontaine01.
Balisage: The Markup Conference 2014
August 5 - 8, 2014
Balisage Paper: Standard Change Tracking for XML
Robin La Fontaine
Robin is the founder and CEO of DeltaXML. He holds an Engineering Science
degree from Oxford University and an MSc in Computer Science. His background
includes computer aided design software and he has been addressing the
challenges and opportunities associated with information change for many
years.
Copyright © 2014 DeltaXML Limited. All Rights Reserved.
Abstract
XML is generally accepted as the default markup language for structured document
and data management systems worldwide. But, in spite of the fact that XML document
standards have matured over the past decade and despite its widespread use, XML
still has a significant shortcoming that limits its usefulness in this role. It has
no native ability to track changes. There is rudimentary support for change tracking
in some document formats, but a full solution is not available. The consensus
emerging is that this is an XML problem rather than a DITA, DocBook or XHTML
problem.
A generic change-tracking standard would transform the utility of XML. It would
allow documents to be moved from one XML editor to another, complete with change
history and the ability to roll back to previous versions; it would allow editing
applications to track changes in any XML document type; and software designed to
handle change in XML could be applied to many different XML document types.
The W3C now has a Community Group (W3C Change Community Group
http://www.w3.org/community/change/) looking into developing a standard solution.
This paper outlines one proposed solution to this important problem.
The purpose of the proposed change tracking format is to represent successive
changes or edits to an XML document, typically in one or more editing sessions. This
paper describes how such changes may be represented in XML markup or in Processing
Instructions. The tracked changes are designed to be used either as an independent
addition to a file or integrated into the applicable schema.
Table of Contents
- Introduction and Background
-
- Status
- State-of-the-art for XML Change Tracking
-
- Line based diff
- Processing instructions
- Revision flags
- Tracked changes
- Generic XML deltas
- Approach to Standard Change Tracking for XML
-
- Validation
- Discarding Changes
- Complex Changes
- Overhead and Readability
- Representation in XML
- Definitions and underlying rules
-
- Atomic Change: the basic building block
- Change Transaction: a change from one valid state to another
- Change Transaction grouping: controlling interaction and dependency
- Final state of a document: discard all the tracked changes
- Validation
- Additions, deletions and moves
- Moves: one or more additions linked to a deletion
- Namespaces
- Change Transaction (CT) Structure
- Tracking Changes: Level 1
-
- Change Tracking attributes: Level 1
- Change Tracking Elements: Level 1
- Add an element and its content (insert-with-content)
-
- Description
- Example
- Comments and Rationale
- Delete an element and its content (remove-with-content)
-
- Description
- Example
- Comments and Rationale
- Add an attribute to an element
-
- Description
- Example
- Comments and Rationale
- Delete an attribute from an element
-
- Description
- Example
- Comments and Rationale
- Change the value of an attribute
-
- Description
- Example
- Comments and Rationale
- Move an element (move)
-
- Description
- Example
- Comments and Rationale
- Add text (PCDATA)
-
- Description
- Example
- Comments and Rationale
- Delete mixed or PCDATA content
-
- Description
- Comments and Rationale
- Integration with a host format
-
- Stand-alone use of 'XML Track Changes'
- Host-integrated 'XML Track Changes'
- Schema Integration
- Schema Integration Level 1
- Conclusions
Introduction and Background
The lack of any standardised change-tracking capability in XML document formats places
a real constraint on the potential of an otherwise universally accepted tool for
document and data management. For while change tracking is commonly available in most
other document editing systems, the change-tracking capability of XML editors is
typically fairly basic; many do not track attribute changes and there is no common
standard. The result is that documents with changes tracked cannot be moved between
XML
editors unless some form of transformation is applied, and this can result in loss
of
information.
There is a real opportunity to make XML much more powerful by creating a standard
way
to track changes in XML documents which would mean that:
-
documents with tracked changes could be moved from one XML editor to
another
-
XML editors could track changes in any XML document type
-
every XML document type could include a change history and the ability to
roll-back to previous versions
-
software designed to handle change in XML could be applied to many different
XML document types
Today, every XML document type takes its own approach to change tracking. For example,
OOXML is built on the underlying binary model within Microsoft Word; ODF has only
a
limited capability to track some changes; DITA uses rev and status attributes to
indicate changes and DocBook similarly has a revisionflag attribute - but neither
can
track attribute or structural changes.
XML editors track changes either by additional markup or using Processing Instructions
(PI). Additional markup has the advantage of structure but at the cost of modifying
the
underlying schema. PIs have the advantage of preserving the latest state of the document
in valid XML markup but the PIs do not have structure and so are limited in the changes
they can track.
This paper introduces a possible solution that takes into account the current
approaches, building on their strengths and addressing their weaknesses. The primary
use
case here is tracking the successive changes that are made to a document by a single
editor over some period of time. The format does not address the issue of merging
changes from a number of different editors, or merging different versions of a single
document that have been independently edited. The format does cater for the situation
where there is a dependency between changes, for example modifying a word in an inserted
paragraph; in this situation, the word modification change depends on the paragraph
insertion change. It should always be possible to reject each change in reverse order,
i.e. starting with the last change and moving back through earlier changes, and thus
end
up with the original document.
Status
This work was originally done to demonstrate how change tracking within the Open
Document Format (ODF) could be improved and extended but was subsequently offered
to
the wider XML community when the ODF Committee opted, instead, to track edit
operations rather than changes, using an Operational Transformation approach – a
solution that could not be applied in a generic way to benefit other XML
groups
This proposal was submitted to a Community Group established by the W3C in 2012 to
explore change tracking [1]. There is also a sandbox [2] to demonstrate how it works in an interactive way. The
approach was implemented and a large number of examples were created and validated
against a Schematron rule set [3]. XSLT style sheets were
also written to extract the final version from a change-tracked document and to undo
the latest change in a change-tracked document.
The OpenDocument Format (ODF) is used in the worked examples in this paper.
The original work was supported with a grant from Stichting NLnet.
State-of-the-art for XML Change Tracking
There are many different approaches to tracking document changes in XML. A fuller
review of different approaches can be found in [5], which looks at different use cases for situations where change to XML is
important and reviews the different approaches used by some of the more popular
formats, including OpenDocument, Open XML, DocBook, DITA and editors including
XMetaL, oXygen and Xopus.
In this paper we will use some of those examples of change tracking in current
systems in order to provide some context to the later presentation of the approach
being considered within the W3C Community Group.
There are several different ways of representing changes in XML. Although in
general these are applied to the changes between two documents, some of them can be
extended to show or represent changes between multiple documents, or multiple
versions of the same document. It is interesting to note that all the examples use
an in-line representation of changes, i.e. the changes are represented within the
document itself, rather than as a separate file.
It is not possible, in this paper, to do a complete review of the capabilities of
all the existing change tracking systems, and the small example below does not do
justice to their capabilities. They each have different capabilities for
representing changes such as changes to attributes, changes to formatting, and
changes to structure. Perhaps the greatest challenge to any change tracking
mechanism is the ability to represent changes to structure, and this includes the
well-known problem of representing overlapping hierarchies [13], but with the added twist that the content has also
typically changed. Although a detailed discussion of this is beyond the scope of
this paper, the proposal does address this issue, which is important in the context
of document editing.
The example used here is to change "The very quick
brown fox jumped over lazy dog." to "The quick brown fox jumped over the lazy dog.", where bold
text shows changes. The examples have been shortened by removing some information
that
is not relevant to the discussion, and have been pretty-printed for clarity.
Line based diff
The traditional output of the UNIX diff
utility shows changes between
two text documents on a line by line basis. It is obviously possible to show
differences between two XML documents in a similar way.
< <p>The very quick brown fox jumped over lazy dog.</p>
---
> <p>The quick brown fox jumped over the lazy dog.</p>
This representation has a number of limitations for XML because of its sytnax and
tree structure neither of which is reflected in the line-based structure. It may be
useful to accept or reject changes based on lines for a regular text document, but
this is unlikely to work for an XML document where the structure is often easily
destroyed by moving lines from one document to another.
Processing instructions
Because processing instructions are in effect external to the main structure
of an XML document, they are commonly used to mark additions and deletions in
XML editors. Examples include XMetaL [6], Xopus [7], and oXygen [8].
One of the great advantages of using processing instructions to represent changes
is that the underlying XML file can still be validated by ignoring the processing
instructions. This implies that any deleted content will be within a processing
instruction, and any added content will be marked by a start and end marker, each
of
which is a processing
instruction.
<topic id="topic-1">
<title>Topic title</title>
<body>
<p>The <?oxy_delete author="robin"
timestamp="20100113T140621+0000"
content="very"?> quick
brown fox jumped over
<?oxy_insert_start author="robin"
timestamp="20100113T140625+0000"?>the
<?oxy_insert_end?>lazy dog.</p>
</body>
</topic>
In this example, you can see that if all of the processing instructions are
removed, the result is a valid file which represents all the changes being
accepted.
Revision flags
Attributes are often used to show revisions to parts of an XML document in order
to generate output showing where a document has been revised. This mechanism is
built into the XML format itself, and any processor would need to know about this
in
order to reflect the changes. Examples include DocBook and
DITA.
<topic xmlns:ditaarch="http://dita.oasis-open.org/architecture/2005/">
<title>Topic title</title>
<body>
<p>The <ph rev="deltaxml-delete">very </ph>
quick brown fox jumped over
<ph rev="deltaxml-add">the </ph>lazy dog.</p>
</body>
</topic>
In this example, the revision flags have been inserted by comparing two versions
of a document, and putting revision flags around text that has been either added or
deleted. The added or deleted text can then be decorated in the publishing pipeline,
either to PDF or HTML.
Tracked changes
Some document formats use a more sophisticated version of revision flags to
show where text has been added or deleted. These tracked changes are represented
in the XML structure, and the editing system may enable the editor to accept or
reject them. Tracked changes are typically not able to represent all the
possible changes to a document, but will satisfy the needs of a typical editor.
Examples include Arbortext, OpenDocument Format (ODF) and Open XML.
The example below is Arbortext track change format, which unlike other XML
editors uses markup to show changes.
<para>
The
<atict:del user="deltaxml" time="1403627577">
very
</atict:del>
quick brown fox jumped over
<atict:add user="deltaxml" time="1403627577">
the
</atict:add>
lazy dog.
</para>
The example below is OpenDocument text format,
ODT.
<office:body>
<office:text>
<text:tracked-changes>
<text:changed-region text:id="ct528047904">
<text:deletion>
<text:p text:style-name="Standard">
very
</text:p>
</text:deletion>
</text:changed-region>
<text:changed-region text:id="ct645104016">
<text:insertion />
</text:changed-region>
</text:tracked-changes>
<text:p text:style-name="Standard">
The
<text:change text:change-id="ct528047904" />
<text:s />
quick brown fox jumped over
<text:change-start text:change-id="ct645104016" />
the
<text:change-end text:change-id="ct645104016" />
lazy dog.
</text:p>
</office:text>
</office:body>
In this example of tracked changes, notice that the deleted text is held in a
separate place from the main body text. This means that the main body of the
document is very close to the new version of the document, i.e. with all the changes
accepted. However, it is not trivial to reinsert the deleted text in its correct
position, with all of the text decoration intact.
Microsoft Word also has a change tracking mechanism using a similar in-line
markup, though the deleted text is held in situ, as shown below.
<w:body>
<w:p w:rsidR="00D41E6C" w:rsidRDefault="005B7EF1">
<w:r>
<w:t xml:space="preserve">
The
</w:t>
</w:r>
<w:del w:id="0" w:author="Robin La Fontaine" w:date="2014-06-24T16:07:00Z">
<w:r w:rsidDel="005B7EF1">
<w:delText xml:space="preserve">
very
</w:delText>
</w:r>
</w:del>
<w:r>
<w:t xml:space="preserve">
quick brown fox jumped over
</w:t>
</w:r>
<w:ins w:id="1" w:author="Robin La Fontaine" w:date="2014-06-24T16:08:00Z">
<w:r>
<w:t xml:space="preserve">
the
</w:t>
</w:r>
</w:ins>
<w:r>
<w:t>
lazy dog.
</w:t>
</w:r>
</w:p>
</w:body>
Generic XML deltas
A delta file can be defined such that it represents the differences between
two arbitrary XML documents, in XML. Any XML format that is capable of updating
one XML document into another could be described as a generic delta, for example
XQuery Update Facility[9], XSLT [10]
or DeltaXML [11]. Typically this will operate in one
direction only (XQuery Update Facility, XSLT), though a symmetrical
representation is also possible (DeltaXML). The delta may be a transformation,
defining how to get from one document to another, or a data representation,
defining what is different between the documents.
XQuery Update Facility and XSLT are both declarative transformations and can
represent complex changes. They are intended to be executable transformations
between two XML documents and will, in conjunction with the execution engine,
convert one document into another. They are not intended to be used as a
change-tracking mechanism. They do not meet any of the needs of the use case
scenarios described here.
A delta file that is a data representation describes, in some way, the
differences between two documents. A useful derivation of such a generic XML
delta file would be one that contained not only the changes but also the
original data, all in XML. Both of the original documents could be generated
from such a delta representation. Ideally such a delta representation would not
duplicate content that is common to the two documents. It is possible to
transform this type of data representation into any of the other types of
representation listed above, although the reverse is in general not possible.
Therefore this generic delta representation is very versatile. An example of
this is the full-context delta used by DeltaXML, shown in the example
below.
<para deltaxml:deltaV2="A!=B"
deltaxml:version="2.0"
deltaxml:content-type="full-context">
The
<deltaxml:textGroup deltaxml:deltaV2="A">
<deltaxml:text deltaxml:deltaV2="A">
very
</deltaxml:text>
</deltaxml:textGroup>
quick brown fox jumped over
<deltaxml:textGroup deltaxml:deltaV2="B">
<deltaxml:text deltaxml:deltaV2="B">
the
</deltaxml:text>
</deltaxml:textGroup>
lazy dog.
</para>
In this example, all of the data from both documents, A and B, is present and has
the same look and feel as the original documents. The delta element wrappers and
attributes indicate where the documents differ. Either version of the document can
quite easily be extracted from this representation, for example using XSLT or
XQuery. Note that the actual delta file will not comply with the original DTD/schema
because of the additional delta wrapper elements and attributes, but each version
that is extracted will be valid against the DTD/schema. Although not shown in this
example, the format is capable of representing changes to attributes and elements
as
well as text. The format also extends to represent changes between more than two
documents, for example the changes between two concurrent edits and the document
from which the edits are derived.
However, this representation is not well-suited to the change tracking
scenario where a large number of small changes need to be represented. This is
because the representation has quite large overhead for small changes, and it is
focused on the problem of representing a large number of changes between a small
number of documents, rather than a large number of changes to a single
document.
Approach to Standard Change Tracking for XML
Generic change tracking for XML is a complex problem. It needs to cover not only
addition, deletion and modification of elements and attributes but also changes to
the
XML structure, for example when a paragraph element is split into two, or a <div/>
element is wrapped around some other elements. A solution which would meet all the
requirements of XML change tracking would inevitably be complex, but it may not be
essential for many applications.
A more simple approach can produce significant and useful results and this is covered
in the basic change tracking capability described as Level 1, which is the subject
of
this paper. Moving to a more powerful solution will give a better user experience
but at
the cost of increased complexity. This is covered in Level 2.
Level 1 provides the ability to modify attributes, add and delete elements, and add
and delete text. It also enables changes to be grouped into transactions where a single
transaction moves the document from one valid state to another. Changes can be
represented as markup or PIs in a way that allows loss-less transformation between
them,
thus gaining the advantages of both.
Level 2 adds to this the ability to add or delete element structure around existing
content and to split and merge elements in more complex ways. In terms of changes
to XML
documents, the content (typically text) takes priority over the structure (typically
paragraphs, tables and text decoration). In other words, an editor does not want to
see
change to content when only the structure or styling has been changed. As an example
of
this, when a newline is inserted in the middle of a paragraph, i.e. to split it into
two
paragraphs, the editor does not expect to see change to those paragraphs but rather
the
insertion of a new line. This does not always fit well with the underlying XML
structure, and could not be represented in Level 1. Level 2 addresses this, but it
is a
complex issue and the solution is therefore more complicated. We do not cover Level
2 in
this paper, but this is covered in detail in [4].
Validation
The proposed change tracking format avoids complex semantic rules relating to the
correctness of changes. Rather it takes a generic approach so that almost any change
can be represented, and then defines correctness in terms of the validity of the
state before and after the change. This enables a simple, intuitive and powerful
statement for the validity of a change (and this may include syntactic and semantic
validity): if the document before the change is valid and the document after the
change is valid, then the change is valid.
Discarding Changes
The format takes account of the task of a reader application that may not be able
to understand changes at all or in certain areas. For example, if a reader is unable
to represent any changes, it must be easy to read in the latest version of a
document. This would also apply to individual fragments or subtrees within the
document.
Complex Changes
A single action by an editor may generate changes in a number of different places
in the document. For example, a global change or 'replace all' will generate changes
throughout a document, or deleting a column in a table will generate several
disjoint changes in the underlying XML representation. Therefore there is clearly
a
need to represent a number of small atomic changes as a single action. Also the
format provides some flexibility in the way these can be grouped, so that for
example a global change can be accepted/rejected in one action or each change
handled separately.
Overhead and Readability
A localised change should have a simple, intuitive, and localised representation
within a document. For example, when an attribute is changed the format should not
generate a large amount of structure to represent that change. On the other hand,
the format should not require a lot of parsing of attribute values or other
information in order to determine the nature of a change. These criteria may
conflict, and in such cases a balance between these issues should be sought and
explained.
Representation in XML
This proposal includes alternative representations for the changes in XML, i.e.
markup and Processing Instructions (PIs). The reason for this is that markup is the
proper native XML way to represent such information but has the significant
disadvantage that it affects the nature of the document. On the other hand, if all
changes are wrapped in PIs, then the document remains as a ‘normal’ document and can
be validated and processed as usual because the PIs are simply discarded on reading
the document. The approach taken is, in general, to convert the outermost change
tracking element into a PI and wrap everything else within it. Therefore the
conversion to and from PIs is quite simple and there is only one structure that
needs to be defined. This approach may be too simple in practice and further
development may be needed here. Attribute changes present a particular problem, and
a view must be taken on whether it is best to represent these before or after the
start tag of an element.
Definitions and underlying rules
Atomic Change: the basic building block
An Atomic Change is a change such as the addition of an element or removal of an
attribute, which represents a single syntactic change. The representation may
involve more than one element or attribute. It is not appropriate to limit an Atomic
Change to one that cannot be subdivided. For example, the deletion of an element and
its contents, i.e. its attributes and children, is considered to be atomic, whereas
in principle this could be split into a collection of atomic changes that removes
each leaf node in the XML structure. Further, even these leaf nodes could in
principle have their textual content removed one character at a time. Forcing
systems to record change at this level of detail is inappropriate.
Each atomic change is part of one and only one Change Transaction (CT), described
later. This is enforced because each atomic change references the ID of the CT to
which it belongs. This grouping is very important, because it means that we can form
a change out of any number of atomic changes. This implies that we only need a few
atomic change operations, and these can be combined in complex ways to create
CTs.
Change Transaction: a change from one valid state to another
A Change Transaction (CT) consists of one or more Atomic Changes, and is uniquely
identified by an identifier (ID). If more than one Atomic Change is involved, there
is no ordering of these, they are considered to happen as a single operation. A CT
is therefore an indivisible change, which is represented as a single
transaction.
One CT may depend on others. In other words, it may not be possible to apply a
particular CT unless some other CT on which it depends is applied first. For
example, if some text has been added, and then one of the words is deleted, it is
not possible to accept the deletion if the addition has been rejected.
Where a document has more than one CT, the order of the CTs must be defined. If we
want to support an undo operation, then the ordering would be important. In general,
changes made by an editor are done in a certain order, because a particular change
may depend on a previous change. This ordering therefore represents the default
dependency, i.e. by default each CT depends on all the previous CTs. A particular
application may be able to provide more intelligent information on the dependencies,
and this is achieved with grouping.
Change Transaction grouping: controlling interaction and dependency
CTs can be grouped either in a specific order (CT Stack) or as a set (CT Set).
These groupings are for convenience, for example to allow a global edit (change
all), or an editing session, to be undone in one operation. A CT group may only
reference previously-defined CTs or CT groups, to avoid circular definitions.
Final state of a document: discard all the tracked changes
The final version of a document is the final state of its root element. The final
state of an element is the element and its attributes and the final state of all its
content. When determining the final state, any deleted element is ignored, and the
change history of any attributes is ignored. The format is designed so that the
final state of a document can be determined by simply ignoring certain elements,
ignoring elements with particular attributes and ignoring some attributes. For the
PI representation, it is simply necessary to ignore all the Pis.
Validation
A CT is valid if the document before the CT is valid, and the document after the
CT is applied is valid. This is a very simple definition of semantic correctness,
and means therefore that we do not need a lot of complex rules about what
combination of changes are correct.
Therefore we can say that a document is valid if its final state is valid and all
the CTs it contains are valid. We are using the term ‘valid’ here to mean whatever
validation is relevant to the document, including any relevant syntax and semantic
rules.
Additions, deletions and moves
An element can only come into existence once and go out of existence once. Once an
element has gone out of existence, or died, no further changes can be made to that
element or its content. This is an important (but also intuitive) simplification
because it means we do not need to cater for elements going out of existence and
then coming back into existence again, which would make the format much more
complex.
Moves: one or more additions linked to a deletion
Text and/or elements may be moved to one or more other locations in a document.
This is represented as an element being deleted from one place and added in one or
more other places in the document. The change history of an element is not moved
with the element.
Content that has been moved from position A to position B can be moved again from
B but it is deleted from A and so cannot be moved from A in a later
operation.
Namespaces
The namespaces are defined as follows (the deltaxml.com namespace is only used as
an example):
xmlns:delta="http://www.deltaxml.com/ns/track-changes/delta-namespace"
xmlns:ac="http://www.deltaxml.com/ns/track-changes/attribute-change-namespace"
Change Transaction (CT) Structure
There must be a position in the document where the change transactions are defined,
each being identified by an identifier (ID). Each will have some associated meta
information such as the name of the author who made the change, and the date.
The ordering of the change transactions is important. If a user wishes to undo the
changes one by one, then this can be achieved by undoing the change transaction at
the
end of the list and then moving up the list.
As mentioned above, it is also possible to group CTs in a change transaction group
(CT
group). This will have similar meta information to a CT, and will reference CTs or
other
CT groups that it groups together, i.e. that are its members. Again, all the members
must be previously-defined CT or CT groups. The effect of undoing a CT group will
be to
undo a number of CTs, which would then be removed from the list.
A software application that does not understand this grouping can ignore the groups,
and the result will be some loss of structure but no effect on the underlying tracked
changes. It is only a CT that has an effect on the document, the CT groups merely
provide structure for user convenience.
A CT group may be ordered (CT stack, delta:change-transaction-stack) or unordered
(CT
set, delta:change-transaction-set). The members of a CT set can be accepted or rejected
in any order. The members of a CT stack must be accepted or rejected in the defined
order, i.e. undo last member first.
Example:
<delta:tracked-changes>
<delta:change-transaction delta:change-id="ct1">
<delta:change-info>
<dc:creator>Robin</dc:creator>
<dc:date>2010-06-02T15:48:00</dc:date>
</delta:change-info>
</delta:change-transaction>
<delta:change-transaction delta:change-id="ct2"
delta:edit-operation="make-bold">
<delta:change-info>
<dc:creator>Robin</dc:creator>
<dc:date>2010-06-02T15:48:01</dc:date>
</delta:change-info>
</delta:change-transaction>
<delta:change-transaction delta:change-id="ct3"
delta:edit-operation="text-edit">
<delta:change-info>
<dc:creator>Robin</dc:creator>
<dc:date>2010-06-02T15:48:01</dc:date>
</delta:change-info>
</delta:change-transaction>
<delta:change-transaction-set delta:change-group-id="cs4">
<delta:change-info>
<dc:creator>Robin</dc:creator>
<dc:date>2010-06-02T15:48:01</dc:date>
</delta:change-info>
<delta:change-log>Global edit</delta:change-log>
<delta:change-references>
<delta:change-ref delta:change-idref="ct2"/>
<delta:change-ref delta:change-idref="ct3"/>
</delta:change-references>
</delta:change-transaction-set>
...
</delta:tracked-changes>
Tracking Changes: Level 1
This section details the attributes and elements needed to support the representation
of atomic changes, which are the lowest level changes that can be represented. All
changes can be represented using these atomic changes.
It is possible to move back from the final version of a document through successive
changes to previous versions of a document. It may not be easy to extract an arbitrary
version, but it is always possible to undo the last CT and thus work back through
versions, i.e. the state between each edit action or CT.
Change Tracking attributes: Level 1
Attribute |
Values |
Description |
delta:insertion-type |
'insert-with-content' |
Indicates how an element was created. Absence means
the element existed in the oldest version of the document.
|
delta:insertion-change-idref |
References a delta:change-id |
References the CT that brought this element into
existence. Present on all elements with an delta:insertion-type
attribute.
|
delta:removal-change-idref |
References a delta:change-id |
References the CT that removed some content from the
document. Can appear on a delta:removed-content element.
|
ac:XXX |
Details of the attribute change, comma
separated
|
ac: is a defined namespace, XXX is a generated
attribute name, each new XXX represents a change to one attribute.
|
delta:move-id |
Defines an ID for a move |
Can appear on a delta:removed-content or delta:merge
element.
|
delta:move-idref |
References an ID for a move |
Can appear on an element with
delta:insertion-type='insert-with-content' to indicate the element and
content was moved from elsewhere to this place. Can appear on
delta:inserted-text-start to indicate the text was moved from elsewhere to
this place.
|
delta:change-id |
Defines an ID for a CT |
Identifies a CT |
delta:inserted-text-end-id |
Defines an ID for a delta:inserted-text-end |
Identifies the end element of a text insertion. |
delta:inserted-text-end-idref |
Reference to delta:inserted-text-end-id |
Identifies the end element for some inserted
text.
|
delta:edit-operation |
Values defined in the standard or by a particular
editing application
|
Optional on CT, CT set and CT stack to identify the
type of edit-operation that this represents, e.g. text-to-table,
global-replace, make-bold, libreOffice:macro23
|
Change Tracking Elements: Level 1
Element |
Description |
delta:removed-content |
Contains element, PCDATA or mixed content that has
been removed.
|
delta:inserted-text-start |
Identifies the start point of some inserted
text.
|
delta:inserted-text-end |
Identifies the end point of some inserted text. |
Add an element and its content (insert-with-content)
Description
The whole element is added with its content.
Example
Addition of a paragraph.
<text:p delta:insertion-type="insert-with-content"
delta:insertion-change-idref='ct1234'>
This paragraph is inserted.</text:p>
Example PI:
<text:p><?delta-tracked-change-attributes delta:insertion-type="insert-with-content"
delta:insertion-change-idref='ct1234'?>
This paragraph is inserted.</text:p>
Comments and Rationale
An added item may contain changes within it, but the changes must all be after
it was added.
Delete an element and its content (remove-with-content)
Description
The whole element is deleted with its content.
Example
Deletion of a paragraph.
<delta:removed-content delta:removal-change-idref='ct456'>
<text:p>
This paragraph is deleted.
</text:p>
</delta:removed-content>
Addition and deletion of a paragraph is shown like this:
<delta:removed-content delta:removal-change-idref='ct456'>
<text:p delta:insertion-type="insert-with-content"
delta:insertion-change-idref='ct1234'>
This paragraph is added then later deleted.
</text:p>
</delta:removed-content>
Deletion of a paragraph – PI example.
<?delta:removed-content delta:removal-change-idref='ct456'>
<text:p>
This paragraph is deleted.
</text:p>
?>
Addition and deletion of a paragraph is shown like this – PI example:
<?delta:removed-content delta:removal-change-idref='ct456'>
<text:p delta:insertion-type="insert-with-content"
delta:insertion-change-idref='ct1234'>
This paragraph is added then later deleted.
</text:p>
?>
Comments and Rationale
A deleted item may contain changes within it, but the changes must all be
before its deletion.
Add an attribute to an element
Description
This construct provides the ability to add a new attribute to an
element.
Example
If a fragment starts as
<text:p text:style-name="Standard">
How an attribute is added
</text:p>
and goes to
<text:p text:style-name="Standard" text:outline-level="3">
How an attribute is added
</text:p>
then this is represented as
<text:p text:style-name="Standard" text:outline-level="3"
ac:change001="ct1,insert,text:outline-level">
How an attribute is added
</text:p>
where change001 is a generated attribute name and the name is not significant
– it must be different for each attribute change recorded for this element. The
content is a comma separated list of:
-
The change transaction (CT) ID. This is a reference to the ID.
-
The type of change: insert, remove, modify
-
The name of the attribute that is changed
-
The old value of the attribute – this is not needed for an added
attribute because the value will either be in the element or, if the
attribute is later deleted it will be recorded there.
PI example:
<text:p text:style-name="Standard" text:outline-level="3">
<?attribute-change "ct1,insert,text:outline-level" ?>
How an attribute is added
</text:p>
Comments and Rationale
All information on the change is local to the element changed. The attribute
local name is generated because multiple changes are possible, and this avoids
adding to a string (value) of some attribute and then parsing it. Minimal
parsing of the ac:change001 attribute value is needed. The latest attributes are
always listed in full, making extraction of the latest version simple.
Delete an attribute from an element
Description
This construct provides the ability to delete an attribute from an
element.
Example
If a fragment starts as
<text:p text:style-name="Standard" text:outline-level="3">
How an attribute is deleted
</text:p>
and goes to
<text:p text:style-name="Standard" >
How an attribute is deleted
</text:p>
then this is represented as
<text:p text:style-name="Standard"
ac:change001="ct1,remove,text:outline-level,3">
How an attribute is deleted
</text:p>
PI example:
<text:p text:style-name="Standard" >
<?attribute-change "ct1,remove,text:outline-level,3" ?>
How an attribute is deleted
</text:p>
Comments and Rationale
This follows the same principles as an inserted attribute.
Change the value of an attribute
Description
This construct provides the ability to change the value of an attribute on an
element.
Example
If a fragment starts as
<text:p text:style-name="Standard">
The style on the paragraph will be changed.
</text:p>
and goes to
<text:p text:style-name="Code">
The style on the paragraph will be changed.
</text:p>
then this is represented as
<text:p text:style-name="Code"
ac:change001="ct1,modify,text:style-name,Standard">
The style on the paragraph will be changed.
</text:p>
PI example:
<text:p text:style-name="Code">
<?attribute-change "ct1,modify,text:style-name,Standard" ?>
The style on the paragraph will be changed.
</text:p>
Comments and Rationale
This follows the same principles as an added or deleted attribute.
Move an element (move)
Description
This construct describes the origin and the destination of content that is
moved from one position in a document to another. Move provides a link between
some removed content (move-from) and some inserted content (move-to), but this
link simply provides additional information about the change transaction. If an
application does not understand the concept of move, the move information can be
ignored without compromising the content and structure of the document before
the move or the content and structure of the document after the move.
The move representation allows content to be deleted and then inserted in one
or more other positions in the document. A delta:move-id attribute must have one
or more delta:move-idref references to it.
Example
If a fragment is moved from this position
<text:p>
This paragraph will be moved.
</text:p>
<text:h text:style-name="Heading_20_1" text:outline-level="1">
This is the heading for the paragraph
</text:h>
to this
<text:h text:style-name="Heading_20_1" text:outline-level="1">
This is the heading for the paragraph
</text:h>
<text:p>
This paragraph will be moved.
</text:p>
then this is represented as
<delta:removed-content delta:removal-change-idref="ct123" delta:move-id="mv33" >
<text:p >
This paragraph will be moved.
</text:p>
</delta:removed-content>
<text:h text:style-name="Heading_20_1" text:outline-level="1">
This is the heading for the paragraph
</text:h>
<text:p delta:insertion-type="insert-with-content" delta:move-idref="mv33"
delta:insertion-change-idref="ct123">
This paragraph will be moved.
</text:p>
PI example
<?delta:removed-content delta:removal-change-idref="ct123" delta:move-id="mv33" >
<text:p >
This paragraph will be moved.
</text:p>
?>
<text:h text:style-name="Heading_20_1" text:outline-level="1">
This is the heading for the paragraph
</text:h>
<text:p >
<?delta-tracked-change-attributes delta:insertion-type="insert-with-content"
delta:move-idref="mv33" delta:insertion-change-idref="ct123"?>
This paragraph will be moved.
</text:p>
Comments and Rationale
Move from and move to are linked by the delta:move-id attribute and
delta:move-idref attributes. When content is moved, all its change history is
reset, e.g. a move-from paragraph has the change history and the move-to has no
history, it is as if it has been added new. This avoids duplicating history
(causing ID duplicates etc) and the history is not lost because it is there in
the original position.
The delta:move-id attribute appears on a delta:removed-content or delta:merge
element and therefore there is no 1:1 relationship between a move-from element,
whose parent has a delta:move-id attribute, and the move-to element. It would be
possible to specify this relationship to a finer level of granularity by using
multiple delta:removed-content elements rather than one.
Add text (PCDATA)
Description
This construct allows the insertion of text. It is similar to the existing
mechanism. This construct shall only be used within an element that allows
PCDATA content.
Example
If a fragment starts as
<text:p>
How text is added.
</text:p>
and goes to
<text:p>
How text is very easily added.
</text:p>
then this is represented as
<text:p>
How text is <delta:inserted-text-start delta:inserted-text-id="it632507360"
delta:insertion-change-idref="ct1"/>very easily
<delta:inserted-text-end delta:inserted-text-idref="it632507360"/>added.
</text:p>
PI example:
<text:p>
How text is <?delta:inserted-text-start delta:inserted-text-id="it632507360"
delta:insertion-change-idref="ct1" ?>very easily
<?delta:inserted-text-end delta:inserted-text-idref="it632507360" ?>added.
</text:p>
Second example: If a fragment starts as
<text:p>
How text is
</text:p>
and goes to
<text:p>
How text is very easily added.
</text:p>
<text:p>
And the addition is into a second paragraph.
</text:p>
then this is represented as
<text:p>
How text is <delta:inserted-text-start delta:inserted-text-id="it123" delta:insertion-change-idref="ct3"/>very easily added.<delta:inserted-text-end delta:inserted-text-idref="it123"/>
</text:p>
<text:p delta:insertion-type="insert-with-content" delta:insertion-change-idref="ct3">
And the addition is into a second paragraph.
</text:p>
Comments and Rationale
Additions may not always be within a single element, but the
delta:inserted-text-start and delta:inserted-text-end must both have the same
parent element when they are created, and the content between them must be
PCDATA only. Therefore when a second paragraph is added as per the second
example, the first atomic change terminates and the paragraph is added in the
normal way. The CT reference provides a link to indicate these occur at the same
time as a single addition. This avoids having two ways to add an element and
avoids the need to track across the element hierarchy to find the corresponding
end of an addition.
Additions must therefore always be non-overlapping and the start and end of a
change must be within a single element, when they are formed. Of course they may
not be within a single element at some later stage due to other changes, but in
this case it would not be possible to 'undo' it. This rule adds clarity at the
slight cost to the writer application and the considerable gain for the reader.
Since any number of atomic changes can be associated with a single CT, there is
no loss of information.
Delete mixed or PCDATA content
Description
This construct allows the deletion of text. It is similar to the existing
mechanism. This construct shall only be used within an element that allows
PCDATA content.
Example
If a fragment starts as
<text:p>
How text is deleted or removed from a paragraph.
</text:p>
and goes to
<text:p>
How text is removed from a paragraph.
</text:p>
then this is represented as
<text:p>
How text is <delta:removed-content delta:removal-change-idref="ct2">deleted or </delta:removed-content>removed from a paragraph.
</text:p>
Second example: If a fragment starts as
<text:p>
How text is deleted or <text:span text:style="bold">removed</text:span> like this from a paragraph.
</text:p>
and goes to
<text:p>
How text is deleted from a paragraph.
</text:p>
then this is represented as
<text:p>
How text is deleted
<delta:removed-content delta:removal-change-idref="ct2">
or
<text:span text:style="bold"> removed</text:span>
like this
</delta:removed-content>
from a paragraph.
</text:p>
Comments and Rationale
The deleted text is contained within a single element because it will never be
subdivided or added to after its deletion. The deleted text element contains at
least some deleted text, and may contain other elements.
A deleted item may contain changes within it, but the changes must all be
before its deletion.
Integration with a host format
We have identified two different ways of integrating this track change format,
discussed below.
Stand-alone use of 'XML Track Changes'
The format can be used as an independent addition to an existing XML host format.
In this scenario no changes are made to the schema of the host format, but the track
change elements and attributes are used to represent changes and edits to a
document. The following generic tools may be used to extract different versions of
the document, and to validate a version of the document that has tracked changes
represented.
-
Schematron checker to check a change-tracked document (Schematron
Checker)
-
XSLT stylesheet to extract the final document from a change-tracked
document (XSLT Extractor)
-
XSLT stylesheet to roll back the last change transaction from a
change-tracked document (XSLT Roll-back)
These tools allow a complete integrity check as follows:
-
Execute the Schematron Checker to check the document.
-
Use XSLT Extractor to extract the last version of the document
-
Check the last version of the document against the normal document schema
and/or other integrity checks.
-
If there are no Change Transactions in the document, the checking is
finished.
-
Use XSLT Roll-back to roll back the tracked change
-
Return to 1 to continue checking.
An application reading the change-tracked document would need to recognise the
change tracking elements and treat these in a special way so that the final version
of the document ends up in memory with some ancillary in-memory data structure to
denote the changes.
Host-integrated 'XML Track Changes'
In this scenario there will be a RelaxNG schema which specifies the host format
with change tracking schema integrated with it. The stand-alone testing mentioned
above would still be valid and work, but as well as that the change-tracked document
could be checked against a schema.
Note: More work is needed to develop this integration for a particular host format
such as ODF, and it has a significant impact on the RelaxNG schema.
Schema Integration
Integration of Level 1 is simpler than integration of Level 2.
It is possible to represent any changes to a document in each level, but Level 2
provide a more natural representation of typical document editing actions. Level 2
seek to make minimal changes to the document content or text while allowing complex
changes to the structure surrounding that textual content.
Schema Integration Level 1
This level is provided as a guide for other use cases of this tracked-change
representation. Level 1 has not been fully tested independently of Level 2.
Rule 1: The element delta:tracked-changes must be allowed at one point in the
document.
Rule 2: Any element in the host format that has one or more attributes which can
be added, deleted or values changed, need to allow attributes in the ac:
namespace.
Rule 3: All elements that can be added or deleted with their content (including
any element that allows no content, i.e. is always empty) need to allow the
attribute delta:insertion-type with value 'insert-with-content' and be permitted as
a child of delta:removed-content (unless this allows any element, see RelaxNG for
details). (Note that this is not necessarily all elements, for example an element
that is only used as a required item and never in a choice would not be in this
category.)
Rule 4: All elements that allow element content must have their content model
modified so that they allow delta:removed-content to appear anywhere as a child
element.
Rule 5: All elements that allow PCDATA content, including elements that allow
mixed content, need to allow for text content to be added (Rule 4 allows text to be
deleted).
Conclusions
The need to track changes to XML documents and data frequently extends beyond the
most
common application of changes to documents. For example there is keen interest from
the
Strategy Markup Language (StratML) [12] group responsible for
defining strategic plans in XML; these plans change, and the way they have changed
is
important and needs to be recorded. But, if every group develops its own solution
then
the advantage of common tools would be lost and each group would be left to find its
own
way forward. It would be a considerable amount of work!
There have been comments that the full solution proposed, including both Level 1 and
Level 2, is complex. That is true and it probably makes sense to limit the initial
approach, a first or draft standard, to fairly simple changes as addressed by Level
1,
and then in the light of experience and implementations move on to tackle the more
complex area of structural changes. Level 1 can represent any change to any XML
document, but for structural changes some content may need to be duplicated. Level
2 is
really only needed when content duplication is undesirable, for example when the XML
represents written documents.
This paper has shown that a generic approach to representing tracked changes in XML
is
possible, and the advantages to having a standard XML solution are considerable. More
experimental work is needed to see if there are better approaches and a draft standard
needs to be developed. Then, in the light of experience and implementations, we can
move
on to address the more complex area of structural changes.
References
[1] W3C Change Community Group,
http://www.w3.org/community/change/
[2] XML Change Tracking Prototype: Sandbox,
http://www.deltaxml.com/samples/track-changes/sandbox
[3] Robin La Fontaine, Nigel Whitaker and Tristan
Mitchell, Representing Changes in Open Document Format: Worked examples and XSLT style
sheets, July 2010, http://www.deltaxml.com/support/downloads/DeltaXML-TC4.tar.gz
(5.1Mb)
[4] Robin La Fontaine, XML Change Tracking:
Representing Change Tracking in any XML Document, DeltaXML Ltd., Draft 7, 2012,
http://www.deltaxml.com/support/documents/articles-and-papers/XML-change-tracking.pdf
[5] Robin La Fontaine, Approaches to Change tracking in XML, XML Prague 2010 Prague CZ,
http://www.deltaxml.com/support/documents/articles-and-papers/xml-change-tracking-review.pdf
[6]
XMetaL authoring system http://xmetal.com
[7]
Xopus online editing, http://xopus.com/
[8]
oXygen XML editor, http://www.oxygenxml.com/
[9]
XQuery Update Facility 1.0, http://www.w3.org/TR/xquery-update-10/
[10]
XSL Transformations (XSLT), http://www.w3.org/TR/xslt
[11]
DeltaXML: Two and Three Document DeltaV2 Format, http://www.deltaxml.com/support/documents/deltav2
[12] Strategy Markup Language (StratML),
http://xml.fido.gov/stratml/index.htm
[13] Modeling overlapping structures, Yves
Marcoux, Michael Sperberg-McQueen, Claus Huitfeldt,
http://www.balisage.net/Proceedings/vol10/html/Marcoux01/BalisageVol10-Marcoux01.html, doi:https://doi.org/10.4242/BalisageVol10.Marcoux01