Vion-Dury, Jean-Yves. “Stand-alone Encoding of Document History (or One Step Beyond XML Diff).” Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 - 6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). https://doi.org/10.4242/BalisageVol5.Vion-Dury01.
Balisage: The Markup Conference 2010 August 3 - 6, 2010
Balisage Paper: Stand-alone Encoding of Document History
Jean-Yves Vion-Dury holds an CS engineering degree from the "Conservatoire
National des Arts et Métiers, Paris" (1993) and graduated with a PhD in CS from
Université Joseph Fourier, Grenoble in 1999. He has been working at Xerox
Research Centre Europe (in Grenoble, France) since 1995, as a research
scientist; he has also been on a two year sabbatical with Vincent Quint's team
at INRIA in 2002-2004. His research interests relate to various aspect of XML
including models, the impact of standards, validation/transformation languages
and architectures, with theoretical background in programming languages,
compilation, type systems and formal logics.
Jean-Yves was Program Chair of DocEng (ACM Document Engineering Symposium ) in
2004, is member of its Program Committee since 2003, and member of its Steering
Committee since 2005.
This paper describes an approach to encapsulate together an XML document and its
history (i.e. the various significant states it adopted along its life cycle) into
a
single standalone XML document. Our proposal introduces an XML data model suited to
capture versioning information combined with an operational model to handle such
encapsulated data. We describe basic operators involved along the transformation
process of the document/history pair, mainly designed around the principle of
maintaining coherence properties.
XML being recognized today as the lingua franca of computerized information, some
key
basic functionalities become of universal interest and will predictably gain momentum
in
the upcoming XML related economy.
One of these functionalities is the computation of differences between two XML
documents, and more generally, the management of the history of a given document.
In
particular, this functions as a cornerstone of any system aiming at offering long
term
preservation.
Today, such functionality is achieved through content management systems or databases.
Models and operations are therefore hidden inside a black box, and up to the author's
knowledge, no standard mechanism makes this information explicit and accessible to
human
users neither to external algorithms and processors.
Our proposal addresses this issue by associating a target document instance together
with its history inside a standalone and consistent document, thus gaining strong
potential for current or future interoperability.
The notion of history, when applied to a document,
calls for a deeper understanding of the global context, intents and social processes
underpinning the creation, evolution and maintenance of documents (see pédauque03). Hence, the relevant structure of a meaningful document history highly depends on the document typology and
usages.
In addition, the criterion that founds any document history is the permanence of some
key deep invariants (for a general, ontological, reflection around this line, see
pédauque05). Those invariants define the identity of the document itself
and their loss inevitably ends the versioning process and calls for a new creation
cycle.
The method we describe in this paper takes into consideration the key points above,
especially through a flexible calculus of difference descriptors, and through offering
high level transformation operations that preserve the consistency of both the document
instance and its whole history.
Overview
The target document (also called body hereafter),
i.e. any XML document using any namespace, is encapsulated inside a dedicated packaging
XML document. The body always consistently relates to a given state of its history
(not
necessarily the most recent one) thanks to a dedicated attribute that refers to a
version label and makes this relation explicit.
The history itself is encoded as interrelated nodes (XML elements) and has the
structure of a directed acyclic graph (DAG); each node of this graph models a versioning
point, i.e. a particular significant state that the document reached during its lifetime
and for which a version has been recorded. The meaning of any versioning point with
respect to the document life cycle is application dependent, as discussed above, and
the
proposed approach abstracts over application semantics, just considering that a new
version is built when some actor in the document life cycle decided it makes sense
to do
so. Arcs that connect the nodes are decorated with deltas, and these arcs model the
transformation that allows changing the document from one versioning point to the
other.
Thus arcs are oriented. Deltas descriptors are combinations of low level operations
on
the document tree, mainly based on deletion and insertion of subtrees.
Thanks to a set of transformations qualitatively described in this paper, it is
possible to navigate inside the history of the document and to consistently extract
any
version of the document captured in the history graph. It is also possible to create
new
versioning points, new branches or to merge existing branches, all these operations
producing novel consistent encapsulations of the target document.
Modelling Abstract Differencing Operations
Document model
The encapsulating XML document can be represented (in an abstract way) as a tree
like this:
where d represent the target document and
vi a versioning point.
The full syntax of our encapsulation is available through a RelaxNG Schema (see
appendix). The examples Figure 26, Figure 27, Figure 28 and Figure 29 illustrate concretely the way we
propose to encode the history graph (DAG) in XML, and the way delta operations are
attached to transition arcs.
Diff Engine
We assume that a diff engine is able to operate as a black box function. Its
abstract signature could be for instance:
Figure 2: Abstracting over diff operation
diff(config, d, d’)
→
Δ
with config being some input information used to configure the processor (e.g.
filter to apply, mode commutative/non-commutative,…). The notation Δ represents a
set of basic delta operations, formally described hereafter.
Abstracting over Delta Operations
Figure 3: Syntax of delta operations
Δ
::=
{ }
no operation
| { δ1 … δi }
commutative snapshots
| Δ ; Δ
sequences
δ
::=
insert (pp, A)
subtree insertion
| insert-attr(pp/@nm, A)
attribute insertion
| delete(p)
subtree deletion
The delta operation use paths, noted p, to
designate the tree location where the modification should be applied. Those paths
are describe by the following grammar (i is a
positive integer, nm an attribute name, pp denotes "pure paths"):
Figure 4: Syntax of path description
p
::=
pp | pp/@nm
pp
::=
i/pp | i
These paths are interpreted as relative to the root of the encapsulated document,
and can be easily translated as XPath expressions. For example, 1/2/1 is translated
into *[1]/*[2]/*[1], and 1/3/@id into *[1]/*[3]/@id [1].
Each delta belonging to a snapshot must comply with a structural constraint that
ensures orthogonality (thus making the snapshot indeed commutative). In particular,
it is enough to verify that two paths in a snapshot do not designate sibling trees,
and one path does not designate a sibling tree of the parent node designated by the
other path. We assume that this constraint is part of the well-formedness condition
assured for all definitions [2].
The semantics of delta operations expresses changes that occur on the operand (an
XML tree) ; we note this transformation of d into
d’ by applying Δ as follows:
Figure 5: Transformation by Delta application
d › Δ › d’
More precisely, the previous notation is a logical assertion saying that a
well-formed document d is changed into a
well-formed document d’ after application of the
well-formed Δ operation.
Formally, for all subtree A, path p, documents d and
d’, deltas Δi and δi, the transformations verify the following
abstract properties:
Figure 6: Abstract properties of delta transformations
(a-seq)
d › Δ1;Δ2 › d’
⇔
∃ d’’ , d › Δ1 › d’’ ⋀ d’’›
Δ2 › d’
(a-snap)
d › { δ1 ... δi} › d’
⇔
d › {δπ1} ; {
δπi} › d’
for any permutation π defined over the sequence of indexes [1,…,i]
(a-void)
d › { } › d’
⇔
d ≈ d’
(a-ins)
d › { insert(pp,A)} › d’
⇒
get(d',pp)≈A ⋀ invar(d,d',pp,f)
with f = ζ(pp)
(a-ins-@)
d › { insert-attr(pp/@nm,A)} › d’
⇒
get(d',pp/@nm) = A ⋀ invar(d,d',pp/@nm)
(a-del)
d › { delete(pp)} › d’
⇒
get(d',pp)≈get(d,pp ⊕ f) ⋀ invar(d,d',pp,-f)
with f = ζ(pp)
(a-del-@)
d › { delete(pp/@nm)} › d’
⇒
get(d',pp/@nm) = ε
Note that the definition of insert operator makes
use of the function get, which extracts the subtree
rooted at a given location p, as well as a path
addition function ⊕ and a so called fingerprint extraction ζ . The invar operator
deals with locality of the transformation (see Invariance Property Fig Figure 11 and associated explanations). Moreover, ≈ (the
equivalence of trees) is defined extensionally using quantification of paths, and
is
insensitive to attribute order, according to path definition.
Figure 7: Tree equivalence
d ≈ d’
⇔
∀ p, get(d,p) = get(d’,p)
Path based, extensional equivalence over
trees
The path addition is defined over pure paths (paths designating element nodes
only, noted pp), can deal with operands of various
depth and is commutative:
Figure 8: Path addition
i/pp
⊕
j/pp’
=
(i + j)/(pp ⊕ pp’)
i/pp
⊕
j
=
(i + j)/pp
i
⊕
j
=
(i + j)
Also fingerprints capture some structural information, and more precisely, the
depth level of a path:
Figure 9: Fingerprint computation
ζ(i/pp)
=
0/ζ(pp)
ζ(i)
=
1
Thus ζ (1/2/3) = 0/0/1 and 1/2/3 ⊕ ζ(1/2/3) = 1/2/4 and 1/2/3/2 ⊕ ζ(1/2/3) =
1/2/4/2
The property (a-snap) holds for all permutations
over indexes. It simply means that the set of deltas must be commutative with
respect to their sequential application. In other words, all basic deltas of a
snapshot are pairwise orthogonal.
The property (a-del) involves a minus operator
over paths. This one is easily defined as inverting all integers found in the path.
Figure 10: Path inversion
-(i/pp)
=
(-i)/-(pp)
-(i)
=
-i
The invar property used in (a-ins), (a-ins-@) and
(a-del) expresses that other parts of the tree are not modified by the operation.
This is defined as follows:
Figure 11: Invariance property
invar(d, d’, pp/@nm)
≡
∀ nm' [ get(d, pp/@nm’) = get(d’, pp/@nm’) ]
invar(d, d’, pp, f)
≡
∀ p' [
p' ≪ pp
⇒
get(d, p’) = get(d’,p’)
⋀
pp ≪ p'
⇒
get(d, p’) = get(d’,p’ ⊕ f)
]
The relation ≪ is a strict total order over (pure) paths, which is sound with
respect to the standard total order over document nodes as defined by the XML
document model.
Figure 12: Strict total order on paths
for all i, j integers, pp, pp' pure paths,
Table I
i < j
i = j
i > j
i/pp
≪
j/pp'
true
pp ≪ pp'
false
i
≪
j/pp'
true
false
false
i/pp
≪
j
true
false
false
i
≪
j
true
false
false
The intuitive notion of path prefix relation captures the idea of tree embedding
Figure 13: Path prefix relation
pp1 ≺ pp2
iff
∃ pp | pp1 = pp2/pp
We now define the notion of path orthogonality as a binary boolean operator that
captures the notion that a path is not a subpath of another (or equivalently, a
subtree is not included into another subtree)
Figure 14: Path orthogonality
pp1 ⊥ pp2
iff
not [ pp1 ≺ pp2 ⋁
pp2 ≺ pp1]
Extension of basic vocabulary
In order to ease understanding and increase the generality of descriptions, we
propose now to extend the vocabulary through definitions based on the
fundamental operations described above: move,
copy and replace[3].
d › { insert
(pp2,get(d, pp1))};{delete(pp1) } ›
d’
pp1 ⊥ pp2
⇒
d › { copy(pp1,
pp2) } › d’
⇔
d › { insert
(pp2,get(d, pp1)) } › d’
d › { replace(pp,A) } › d’
⇔
d › { insert (pp,A)};{delete(pp ⊕ ζ(pp)) } › d’
d › { replace(pp/@nm,A) } › d’
⇔
d › {delete(pp/@nm) }; { insert-attr(pp/@nm,A)} › d’
Note that move and copy operations are only defined for orthogonal paths[4].
Inversion of basic operations and snapshots
Inverting a delta (i.e. computing the changes that will exactly bring the
operand in the previous state) requires knowing the original operand on which
changes will be applied.
The inversion function is inductively defined as follows:
Figure 16: Inversion Function ∘
∘{ }
=
{ }
d › {δ1 ... δi} › d’
⇒
∘{δ1 ...
δi}
=
{∘δ1 ...
∘δi}
d › Δ1;Δ2 › d’
⇒
∘(Δ1;Δ2)
=
∘(Δ2);∘(Δ1)
d › {delete(pp)} › d’
⇒
∘(delete(pp))
=
insert(pp,get(d,pp))
d › {delete(pp/@nm)} › d’
⇒
∘(delete(pp/@nm))
=
insert-attr(pp/@nm,get(d,pp/@nm))
d › {insert(pp,A)} › d’
⇒
∘(insert(pp,A))
=
delete(pp)
d › {insert-attr(pp/@nm,A)} › d’
⇒
∘(insert-attr(pp/@nm,A))
=
delete(pp/@nm)
Delta inversion is characterized by the following soundness property:
Figure 17: Proposition 1
d › Δ › d’
⇒
d’ › ∘Δ › d
Inversion of well-formed delta operations over
well-formed documents produces a well-formed reverse tree
transformations
The inversion of deltas is an important functionality which allows navigating
in the history graph. Moreover, it provides a more compact representation of
changes, especially when successive versions represent documents whose content
tends to increase incrementally (corresponds to the construction phase inside a
standard document life-cycle).
This transformation is beneficial because the subtrees A and B are redundantly
stored: once inside the history and once inside the encapsulated instance
itself. In case the focus is set to a non-terminal versioning point (e.g.
vj ), we also may have
vi
←delete(p)
[vj]
→insert(p',B)
vk
which is still quite meaningful as the subtree B is only stored once inside
the delta transition (indeed, the encapsulated document, consistent with focused
version vj, does not comprise the B subtree (see examples
3 et 4 below to grasp more concretely the point).
Note that inverting the operands of a diff operation should also lead to
reversed deltas [6]:
Figure 18: Strong soundness of delta inversion
diff(c, d, d’)=Δ
⇒
diff(c, d’, d) = ∘(Δ)
A weaker version of this property (see Figure 19 below)
requires defining an equivalence relation over deltas. This relation is based on
the effect of delta application rather than its syntactic structure.
Figure 19: Weak soundness of delta inversion
diff(c, d, d’)=Δ
⇒
diff(c, d’, d) ≈ ∘(Δ)
Figure 20: Equivalence of deltas
Δ ≈ Δ'
⇔
∀ d, d', d''
d › Δ › d' ⋀ d › Δ' › d''
⇒
d' ≈ d''
In the general case, it is much more interesting (from the performance point
of view) to compute inversion on deltas rather than to perform a reversed
diff.
Fundamental Operations over Encapsulated Documents
Creation of a history from an initial XML document
This operation has the following abstract signature:
Figure 21: Initial Encapsulation
create-history(d)
→
x-version[x-body v0[d], x-history[v0]]
This reflects that the document is encapsulated, and that an initial versioning
point v0 is created inside
the history. The link that relates the embedded document with the consistent
versioning point is inserted in the x-body
subtree.
Versioning
The operation requires two operands: the encapsulated document and another variant
of the document which is to be considered as the novel versioning point. What is
returned is a novel encapsulated document including a new versioning point, a
(consistent) link inside the body part, and a transition from the previous
versioning point to the new one. This transition expresses the delta operations
abstracted from the diff engine outputs.
Figure 22: Version Creation
create-version(x-version[x-bodyvi[d], x-history[...
vi ...]], d')
→
x-version[ x-bodyvj[d], x-history[... vi →
Δvj ]]
with
diff(c,d,d') = Δ
Extraction
Unfolds the embedded document. This is useful to work on the target document,
update or change it.
This operation allows modifying the embedded document d in order to be compliant to a given version stored in the history.
This requires first building the path that connects the current versioning point
vi to the novel
vj and second, applying
all deltas to obtain the new embedded document d'
Note that in the connection path, some vertices could have the reversed form
vn ←Δn
vm requiring the computation of inverse delta (indeed,
the connection graph is a DAG admitting branching). Note also that we assume that
the algorithm is able to choose a path when several possibilities are present in the
graph (it may decide which is the optimal path with respect to performance criteria
based on standard algorithm based on simple metrics).
Merging branches
This operation requires two operands: the embedded document and another existing
versioning point. An algorithm creates a novel versioning point and two transitions
that relate original versioning points to the new one. The idea is to perform a
merge with maximal preservation (no deletion operation in the respective deltas).
However, conflicts may arise, and can be signaled thanks to a dedicated annotation
inside the target document (this annotation is based on foreign namespace which
cannot conflict with namespaces used by target document)
In the above formula, Δa represents the concatenation of
(possibly reversed) deltas that allows to reach the new (merged) version starting
from version vi, and similarly for Δa
and vj .
Encoding Graphs of Deltas in XML
Each versioning point is captured using a dedicated element (e.g. version) associated with an id attribute that uniquely
identifies it. Our convention uses a naming scheme of the form v0…
v3 for instance. Names for diverging branches use a dot, e.g. v1.1, v1.2.
Each version element contains a sequence of delta
elements which capture the transition from this versioning point to another versioning
point. This information is conveyed by an attribute (e.g. fwd for forward links and bwd for backward
links). Moreover, a delta contains a non order-significant sequence of delta operations,
which captures the so-called snapshot. Several delta
elements are interpreted as sequences of snapshots (corresponds to the
Δ1 ; Δ2 syntactic form, where
Δi= {…}).
Each delta operation is described using a dedicated element name according to its
semantics (insert, insert-attr …). The path information
may be concisely encoded, e.g. through an “ipath” attribute. Copies of subtrees may
be
expressed through a “copy” attribute attached to the delta element. When such an
attribute is not defined, an extensive definition of the subtree is expected as the
content of the operation.
The example 1 below is a target document we want to track
After applying the create-history operation, the document of figure Figure 27 below is created.
In the following example, we assume three distinct versioning points have been
created.
The following example illustrate how the same information can be encoded differently,
featuring different properties.
We propose a particular optimisation that could be applied in linear mode: redundant
trees are eliminated tanks to an explicit copy instruction using an XPath expression
to
designate the subtree (thus the XPath selection refers to the focused version).
<delta fwd="v1">
<insert ipath=”1/2/2”>
<p class="verse" number="2">
Saying, Where is he that is born King of the Jews? For we have seen his star
in the east, and are come to worship him.
</p>
</insert >
</delta>
In the case of figure Figure 29, the node "insert", child of the forward
delta to v1, is suppressed from the history and its parent is changed into the
expression below:
This optimization applies only to versions belonging to past
history, but makes a lot of sense when the evolution of document
instances is best captured through using linear mode and comprises many incremental
additions.
State of the Art
XML diff operations are basically used to realize two fundamental operations on tree
structured documents: comparison and merging. Therefore tree based diff algorithms
are
central to solve various important problems related to XML document management and
editing processes (Chawathe96), among which:
checking modifications with respect to a reference document, typically, the
last version stored inside a repository or a file system (version management
Chien00)
synchronizing variants, that is, detection of changes that occurred
concurrently on two copies of a reference document, detecting potential
conflicts (LaFontaine02
merging variants, which is about fusing two variants into a unique document
integrating modifications of both instances (this may imply solving potential
conflicts when the underlying document management system applies a weak control
policy)
recovering anterior state(s) of a document
An abundant literature exists on this topic, including recent and excellent synthetic
work on an old topic, linear differencing (Khanna07). Most work covers
:
Algorithmic complexity of the differencing operation,
either for ordered or unordered tree models.
The problem was first approached as a particular case of string oriented diff
computation (see Khanna07 for a point on most recent
algorithms in this area), implying computing an adequate linearization of the
XML trees. Zhang and Shasha described a fast algorithm in 1989 for computing
tree edit distance Zha89 improving previous work conducted in
1979. Still the runtime and space complexity was higher than quadratic on
sequential architectures, but could drop to quadratic levels on parallel
architectures. In Wang03 an algorithm for unordered tree was
presented with a polynomial complexity for optimal differences, whereas previous
works established a NP-complete complexity for general cases. A survey (still of
interest 14 years after) and a comparative study on the topic can be found in
Bille95 and Cobe02. Most recent
algorithms for ordered diff computation are described in Lind06, Rönnau08, Rönnau09.
Optimality of editing scripts (also called
deltas)
The very notion of optimality is somehow controversial, because bound to
execution models one could consider either as over simplistic or too specific.
Of course execution models are needed to justify underlying cost functions
associated with deltas application. However, such an approach intrinsically
restricts the application field or the area of proposed results. In addition,
usages and document types deeply impact the structure of delta and thus the
runtime performance of algorithms. As a consequence, diff algorithms should be
chosen and parameterized on the basis of document type and related
activities.
Delta models and use of diff operations inside storage
architectures (typically databases).
In Martinez02, a versioning graph is proposed based on XLink
designation mechanism. But in this proposal, the graph relates two separate
documents that may exist in independent storage infrastructure. Hence there is
no notion of encapsulation and explicit consistency. In Arevalo09, an application is proposed that offers online
versioning services for XML documents. However, there is no notion of history
encapsulation, and the history itself cannot be exported or organized into an
explicit form (see Arevalo09b for an online demo).
A recent paper Rönnau08 proposes reversible deltas using a
straightforward mechanism as deletion operation conveys explicitly the deleted subtree
(this could quickly lead to enormous overhead, especially in the perspective of storing
the whole history). The authors focus on algorithmic and speed performance issues
of a
novel diff algorithm using a bottom-up approach (and also precomputed hash codes on
a
sliding window of adjacent nodes.)
Most work conducted on this area focused on related algorithm performance, time and
space complexity of the diff operation and also optimality of generated deltas, but
little attention has been paid to tools and models to make use of these in efficient
ways. From the practical point of view, no work considered the necessity and applicative
interest of abstracting over deltas and related operation models.
Recently, the XQuery Update working group from W3C (see XQueryUp)
published a candidate recommendation proposing to enrich XQuery with mechanisms intended
to modify XML documents. This impressive work largely focused on expressive power
and
consistent integration of the new constructs to the existing language. The patching
operations rely on "insert", "delete", "rename" and "replace" , using XPath to designate
the locations in the XML tree ; an innovative "transform" operator allows for combining
copy with on-the-fly modifications. One major difficulty was to deal with the extra
power raised by the selection semantic model of XPath, i.e sequences of nodes. As
a
consequence, many runtime error may occur depending on the evaluation of selectors
and
the semantics of the delta operation. Additionally the insert operator of the language
is particularly powerful since it can make explicit positional constraints ("first/last
position", "before/after A") whereas when implicit, positional constraints are
interpreted so that global ordering constraint are optimally satisfied (this remain
a
bit unclear how far this can be hard).
The transformation model of XQUpdate relies on a kind of transactional framework,
where changes are stored in a list of "pending modifications" only applied after calling
a dedicated primitive upd:applyUpdates. This is a pragmatic solution to the problem
of
path relocation (and variable scope validity!), but is a nightmare regarding static
analysis. Therefore it will be much more realistic to use the XQUpdate interpreter
as a
patch engine (i.e generating XQUpdate code from our descriptions), than considering
an
XQUpdate program as a description of differences.
Conclusion
The main contribution of this work is to propose a universal XML data structure and
related transformations which allow abstracting from underlying storage systems and
from
any execution model associated with the computation of XML versioning information.
This is in favor of long term preservation of XML documents, infrastructure and vendor
independence, and open the way to interoperable processing of XML versioning
information.
The method targets generic XML documents, and makes use of a particular namespace
to
embed any XML document (whatever vocabulary it may use) without any change in its
content and tag/attribute set. The history is encoded using a specific vocabulary
that
captures change operations in a formal and universal way, so that any XML diff enabled
processor can generate suitable deltas. Based on the properties of this change
description language, we propose a set of high level operators which allows exploiting
the historical information in order to navigate inside the history, compute new version
in the current branch, create new branches or suppress intermediate versioning points.
Various information may be attached to the versioning points, such as user meta
information, universal date/time of the creation of the versioning point, additional
data required to optimize or ameliorate the computation (hash code, digest number,...),
which are not central to this contribution but could substantially enhance any of
its
application.
The problem with existing diff related approaches is that the description model and
the transformation model are identical. In other words, they use the same objects
both
to describe operationally and declaratively the differences between trees. We outline
in
our proposal a calculus that formally allows making a distinction between modification
descriptors and modification operations. This calculus allows modeling "standard"
processing of deltas and allows transforming descriptions in order to gain compactness
or efficiency. We are not aware of any equivalent proposal.
The two transformation structures we described can lead to two broad classes of delta
interpretations:
Each modification step transforms the tree, so that the paths used inside
subsequent deltas should be defined on the modified tree. This model is used
inside database systems and DOM-like in memory tree processors (especially XUpdate, XQueryUp).
All modification steps are defined on the same reference tree, and all
modifications are realized in one global step. Thus the paths always refer to
the same tree operand and consequently, delta sequences are more easily
understood by a human reader, because they do not have to maintain the cognitive
load of mentally propagating tree modifications to paths. From a computational
point of view, this model is well adapted to stream based processing pipes in
which each operation can be performed concurrently (e.g. on a multi core
processor). As this approach doesn't need building a tree in memory, it is
particularly indicated for large and very large document processing.
The delta model we propose as foundational for history encoding integrates those two
transformation models, in the sense that both can be expressed explicitly. The first
one
is captured by sequences of singleton snapshots whereas the second may involve snapshots
including several deltas.
We identified 4 key issues when addressing the challenge of encapsulating document
history into a stand alone XML instance: mastering the potentially big data volume,
usability of the history model (human vs machine), flexibility of the history
representation and abstracting over diff operations (i.e. not being dependent of a
particular diff algorithm/engine behavior). The method described in this paper addresses
these four issues at various levels. We believe our approach therefore constitutes
a
significant contribution having a realistic applicability level, though the data volume
issue would still require large scale experiments.
Acknowledgment
This work is supported by the European project SHAMAN (FP7- ICT-216736,
http://shaman-ip.eu/shaman/ ). The author also would like to thanks
Jean-Pierre Chanod for his continuous support and helpful comments
We do a structural induction over Δ. Each sub-case is associated with a corresponding
well-formedness hypothesis; we use these together with the definitions of inversion.
We
show here the case for the Δ1;Δ2
composition:
Goal :
d › Δ1;Δ2 › d’
⇒
d’ › ∘(
Δ1;Δ2 ) › d
We introduce the left hand as a new hypothesis H and we get the new goal
d’ › ∘(
Δ1;Δ2 ) › d
By applying the definition of inversion (thanks to H), we get
d’ › ∘Δ2 ;
∘Δ1 › d
The we apply the property (a-seq)
∃ dd , d’ › ∘Δ2
› dd ⋀ dd › ∘Δ1
› d
We apply again the (a-seq) property to hypothesis H, and get a
particularisation for intermediate tree d''. Thus we choose d'' as an evidence
for the existence of dd, and get
d’ › ∘Δ2 › d''
⋀ d'' › ∘Δ1 ›
d
Now we split the goal in two subgoals and apply the induction hypothesis to
both.
Appendix B. An RelaxNG schema (compact syntax ) capturing our model
default namespace xversion = "XRCE::DS::X-Version"
datatypes xsd = "http://www.w3.org/2001/XMLSchema-datatypes"
start = x-version
anyElement = element * - xversion:* { anyAttribute*, anyElement*} | text
anyAttribute = attribute * { text}
x-version = element x-version {
element x-version-body {
attribute ref {V-REF},
attribute xml:space {"preserve" | "ignore"}?,
SUBTREE
}+,
element x-version-history {
attribute first {list {V-REF+}},
attribute last {list {V-REF+}},
# the version history encodes a DAG (Directed Acyclic Graph)
version*
}
}
version = element version {
attribute id { V-REF},
# allows zero or more
attribute conflicts {xsd:nonNegativeInteger}?,
a-time?,
delta*
}
a-target=
# this attribute defines a "one-step" next version
attribute to {V-REF}
delta = element delta { a-target, (rename | insert | append | move | copy | delete | replace)* }
a-here=
# the "here" attribute designates the insertion point (existing stuff is to be shift to right after insertion)
attribute here {IPath}
insert=
# "insert" only applies to elements, text, comment and PI
element insert {
a-here,
# the subtree to be inserted is here (but skipped thanks to a NVDL rule)
SUBTREE
}
append=
# "append" applies to everything including attributes
# for elements, text, comments, PI, the item is appended at the end of the sequence of children
element append {
a-here,
# the "attribute" attribute is used to specify attribute insertion; it must have the "QName=Value" syntactic structure
((attribute attribute {A-DEF}?, empty) | SUBTREE)
}
copy= element copy {
a-what, a-here,
attribute append {flag}?
}
move= element move {
a-what, a-here,
attribute append {flag}?
}
delete= element delete { a-what, SUBTREE }
replace= element replace { a-what, SUBTREE }
rename=
# works for elements, attributes, and PI as well
# the "as" attribute is a qualified name (that is, may refer to a given namespace through a declared prefix
element rename { a-what, attribute as {token}}
SUBTREE=anyElement | conflict
conflict=element conflict { item+ }
item = element item { attribute ref {V-REF}, anyElement }
attribute-conflict =element attribute-conflict { attribute ref {V-REF}, A-DEF}
tree-conflict = element tree-conflict { attribute ref {V-REF}, A-DEF}
V-REF= xsd:string { pattern="v\d+(\.\d+)*" }
A-DEF=string
flag = "yes" | "no"
a-what = attribute what {XPath}
a-time = attribute time {xsd:dateTime}
XPath = xsd:string
IPath = xsd:string { pattern="1(/\d+)*(/@([\c-[:]]+:)?[\c-[:]]+)?" }
<?xml version="1.0" encoding="UTF-8"?>
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" xml:lang="en" queryBinding="xslt2">
<sch:ns uri="XRCE::DS::X-Version" prefix="xv"/>
<sch:pattern >
<sch:let name="versions" value="/*[1]/*[2]/xv:version"/>
<sch:let name="versions-id" value="for $i in $versions/@id return normalize-space($i)"/>
<sch:let name="Nversions-id" value="distinct-values($versions-id)"/>
<sch:let name="history" value="/*[1]/xv:x-version-history[1]"/>
<sch:let name="starting-vector" value="tokenize(normalize-space($history/@first),'\s')"/>
<sch:let name="ending-vector" value="tokenize(normalize-space($history/@last),'\s')"/>
<sch:let name="Nstarting-vector" value="distinct-values($starting-vector)"/>
<sch:let name="Nending-vector" value="distinct-values($ending-vector)"/>
<sch:rule context="/*[1]/xv:x-version-history[1]">
<sch:assert test="count($versions-id) ge count($Nversions-id)">Every version element must have an unique id attribute</sch:assert>
<sch:assert test="count($starting-vector) eq count($Nstarting-vector)">"first" attribute should refer to each relevant version only once</sch:assert>
<sch:assert test="count($ending-vector) eq count($Nending-vector)">"last" attribute should refer to each relevant version only once</sch:assert>
<sch:assert test="every $i in $Nstarting-vector satisfies index-of($Nversions-id,$i) gt 0 ">
the "first" attribute should point to existing version(s) (cf
"<sch:value-of select="string-join($Nstarting-vector,'/')"/>" versus "<sch:value-of select="string-join($Nversions-id,'/')"/>"
)
</sch:assert>
<sch:assert test="every $i in $Nending-vector satisfies index-of($Nversions-id,$i) gt 0 ">
the "last" attribute should point to existing version(s) (cf
"<sch:value-of select="string-join($Nending-vector,'/')"/>" versus "<sch:value-of select="string-join($Nversions-id,'/')"/>"
)
</sch:assert>
</sch:rule>
<sch:rule context="xv:version">
<sch:report test="count(index-of($versions-id,normalize-space(@id))) gt 1">
The "id" attribute must be unique ("<sch:value-of select="@id"/>")
</sch:report>
<sch:let name="my-id" value="@id"/>
<sch:assert test="(count(../xv:version/xv:delta[@to=$my-id]) gt 0) or (index-of($Nstarting-vector,$my-id) gt 0)">
missing back link to anterior version (version "<sch:value-of select="@id"/>")
</sch:assert>
<sch:assert test="(count(xv:delta[@to]) gt 0) or (index-of($Nending-vector,$my-id) gt 0)">
missing link to posterior version (version "<sch:value-of select="@id"/>")
</sch:assert>
</sch:rule>
<sch:rule context="xv:delta">
<sch:assert test="index-of($Nversions-id,@to) gt 0">
The link to version "<sch:value-of select="@to"/>" is dangling (no corresponding version found in the whole history)
</sch:assert>
<sch:let name="my-dest" value="@to"/>
<sch:report test="count(preceding-sibling::xv:delta[@to=$my-dest]) gt 0" >
Delta must be unique for one target version ("<sch:value-of select="@to"/>")
</sch:report>
</sch:rule>
<sch:rule context="xv:conflict">
<sch:let name="all-item-refs" value="for $i in xv:item/@ref return normalize-space($i)"/>
<sch:assert test="count(distinct-values($all-item-refs)) eq count($all-item-refs)">
Conflicting items must be uniquely defined for a given version
</sch:assert>
</sch:rule>
<sch:rule context="xv:item">
<sch:assert test="index-of($Nversions-id,@ref) gt 0">
The reference to version "<sch:value-of select="@ref"/>" is dangling (no corresponding version found in the whole history)
</sch:assert>
</sch:rule>
</sch:pattern>
</sch:schema>
References
[Zha89] Simple Fast Algorithms for the Editing Distance between Trees and
Related Problems, Kaizhong Zhang; Dennis Shasha, SIAM J. Comput., 1989
[Bille95] A survey on tree edit distance and related problems,
Philip Bille, Theoretical Computer Science , Volume 337 Issue 1-3 June 2005.
[Chawathe96] Change detection in hierarchically structured
information, Sudarshan S. Chawathe; Anand Rajaraman;Hector Garcia-Molina ;
Jennifer Widom, SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference
on Management of data, June 1996, doi:https://doi.org/10.1145/233269.233366
[Chien00] A Comparative Study of Version Management Schemes for XML
Documents, Shu-Yao Chien ; Vassilis J. Tsotras ; Carlo Zaniolo, 2000
[Cobe02] A comparative study for XML change detection, Grégory
Cobéna; Talel Abdessalem; Yassine Hinnach, Research Report, INRIA Rocquencourt, France,
2002 PDF
[LaFontaine02] Merging XML files: a new approach
providing intelligent merge of XML data sets, Robin La Fontaine, 2002
[Martinez02] A method for the dynamic generation of virtual versions of
evolving documents, Mercedes Martinez, Jean-Claude Derniame, Pablo de la
Fuente, SAC 2002, Madrid, Spain, 2002
[pédauque03] Document: Form, Sign and Medium, as Reformulated for Electronic
Documents, Roger T. Pédauque, collective writing, STIC-CNRS, 12 September
2003. PDF
[Wang03] X-Diff: A Fast Change Detection Algorithm for
XMLDocuments. Yuan Wang , David J. DeWitt , Jin-Yi Cai in Proceedings of the
International Conference on Data Engineering (ICDE'03) PDF
[pédauque05] Le texte en jeu (Permanence et transformations du
document), Roger T. Pédauque, collective writing, STIC-CNRS, 7 April 2005.
PDF
[Lind06] Fast and Simple XML Tree Differencing by Sequence
Alignment, Tancred Lindholm; Jaakko Kangasharju; Sasu Tarkoma, DocEng '06:
Proceedings of the 2006 ACM symposium on Document engineering October 2006, doi:https://doi.org/10.1145/1166160.1166183
[Khanna07] A Formal Investigation of diff3, Sanjeev Khanna; Keshav
Kunal ; Benjamin C. Pierce, In Arvind and Prasad, editors, Foundations of Software
Technology and Theoretical Computer Science (FSTTCS), December 2007 PDF
[Tata07] Tree automata techniques and
applications, Hubert Comon; Max Dauchet; Florent Jacquemard; Denis Lugiez;
Sophie Tison; Marc Tommasi , 2007 PDF
[Rönnau08] Merging Changes in XML Documents Using
Reliable Context Fingerprints, Sebastian Rönnau; Christian Pauli; Uwe M.
Borghoff, ACM Symposium on Document Engineering, September 2008, doi:https://doi.org/10.1145/1410140.1410151
[Rönnau09] Efficient Change Control of XML
Documents, Sebastian Rönnau; Christian Pauli; Uwe M. Borghoff, ACM
Symposium on Document Engineering, September 2009, doi:https://doi.org/10.1145/1600193.1600197
[Arevalo09] A Web-Based Version Editor for XML
Documents, Luis Arévalo Rosado, Antonio Polo Márquez and Miryam Salas
Sánchez, ACM Symposium on Document Engineering, September 2009, doi:https://doi.org/10.1145/1600193.1600249
[Schematron] ISO Schematron, a language for making assertions about patterns
found in XML documents, Topologi , web
site
[1] the XPath translation of 1/2/1 could also
be node()[1]/node()[2]/node()[1] in order
to consider all possible nodes as allowed by the XML document tree
model.
[2] Actually the calculus can be refined in such a way that this condition can
be relaxed thanks to a particular rewriting of conflicting deltas. This
suppose a particular semantic interpretation of conflicting
information.
[3] These operations are known to be non-trivial to compute by a diff
engine, can be non optimal, and are actually rarely produced for this
reason. However, other sources of delta computation such as smart
versioning oriented editors can produce very useful move and copy delta
operations.
[4] A theorem establishes that pp1 ⊥
pp2 ⇔ pp1≪
pp2 ⋁ pp2 ≪
pp1
[5] focus node vk is by convention identified
through surrounding brackets [.]
[6] Such an abstract property could be hardly met by a "black box" diff
operator, from the implementation point of view. However, we investigate
if a delta normalization procedure could reduce those cases, so that the
abstract property of Figure 18 would be always
verified.
Change detection in hierarchically structured
information, Sudarshan S. Chawathe; Anand Rajaraman;Hector Garcia-Molina ;
Jennifer Widom, SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference
on Management of data, June 1996, doi:https://doi.org/10.1145/233269.233366
A method for the dynamic generation of virtual versions of
evolving documents, Mercedes Martinez, Jean-Claude Derniame, Pablo de la
Fuente, SAC 2002, Madrid, Spain, 2002
X-Diff: A Fast Change Detection Algorithm for
XMLDocuments. Yuan Wang , David J. DeWitt , Jin-Yi Cai in Proceedings of the
International Conference on Data Engineering (ICDE'03) PDF
Fast and Simple XML Tree Differencing by Sequence
Alignment, Tancred Lindholm; Jaakko Kangasharju; Sasu Tarkoma, DocEng '06:
Proceedings of the 2006 ACM symposium on Document engineering October 2006, doi:https://doi.org/10.1145/1166160.1166183
A Formal Investigation of diff3, Sanjeev Khanna; Keshav
Kunal ; Benjamin C. Pierce, In Arvind and Prasad, editors, Foundations of Software
Technology and Theoretical Computer Science (FSTTCS), December 2007 PDF
Merging Changes in XML Documents Using
Reliable Context Fingerprints, Sebastian Rönnau; Christian Pauli; Uwe M.
Borghoff, ACM Symposium on Document Engineering, September 2008, doi:https://doi.org/10.1145/1410140.1410151
Efficient Change Control of XML
Documents, Sebastian Rönnau; Christian Pauli; Uwe M. Borghoff, ACM
Symposium on Document Engineering, September 2009, doi:https://doi.org/10.1145/1600193.1600197
A Web-Based Version Editor for XML
Documents, Luis Arévalo Rosado, Antonio Polo Márquez and Miryam Salas
Sánchez, ACM Symposium on Document Engineering, September 2009, doi:https://doi.org/10.1145/1600193.1600249