How to cite this paper
Durusau, Patrick. “Deferred Well-Formedness and Validity: Change.log, Collaboration, Immutability, XML,
UUIDs.” Presented at Balisage: The Markup Conference 2021, Washington, DC, August 2 - 6, 2021. In Proceedings of Balisage: The Markup Conference 2021. Balisage Series on Markup Technologies, vol. 26 (2021). https://doi.org/10.4242/BalisageVol26.Durusau01.
Balisage: The Markup Conference 2021
August 2 - 6, 2021
Balisage Paper: Deferred Well-Formedness and Validity
Change.log, Collaboration, Immutability, XML, UUIDs
Patrick Durusau
Patrick Durusau is the Chair of the OASIS Open Document Format for Office Applications (OpenDocument) TC and has been a member of that TC since its initial meeting on December 16, 2002.
His employer/sponsor has changed several times over the years and Patrick has been
a co-editor/editor of the OpenDocument Format (ODF) for the majority of that time.
Patrick is also the project editor for the ISO/IEC mirror of ODF as ISO/IEC 26300.
Patrick blogs about topic maps (being one of the co-editors of ISO 13250-5), other
semantic issues and of late, how irregular forces can leverage data for their causes
at Another Word for It.
Abstract
This proposal emerges out of conversations about introducting collaborative editing
into OpenDocument Format (ODF) applications, as a type of change tracking. Vis-a-vis a document, a lone author is a lesser and included case of collaborative
editing. In either case, changes have to be captured, along with their metadata, and
reconciled, in the case of conflicting edits.
Despite progress on the software side of collaborative editing for a variety of formats,
there has been no visible progress on the capturing of changes, or their reconcilation
in OpenDocument Format documents. Being habituated, not to say addicted, to markup
approaches, it's understandable I find the lack of format discussions disquieting.
It's all well and good to have change tracking/collaborative editing, successfully
in software, but what the hell am I going to write down in ODF?
How to capture changes, from one or many authors, and how to capture reconciliations are the focus of this proposal. That requires unique identification
of changes (one or many authors), identifying where changes may be applied, and recording
the application of changes (the resulting document).
Table of Contents
- Introduction
- Change Log
- Identification of Changes (2), Proposed Changes (5), Location of Proposed Changes
(6)
- Acceptance or Denial of Change
- Well-formed and Valid (finally)
- Conclusion
Introduction
Usually consigned to a footnote, I want to thank reviewers #1, #2, and #3 for saving
you from a poorly written and likely boring presentation. I attempted to write in
the gradiose voice of tech papers instead of saying what I have found and why I find
it persuasive. Without a lot of hand waiving or convoluted arguments. Any of the foregoing
in this paper and/or presentation, remain because I failed to take their advice. A
large round of thanks for Balisage reviewers!
As I say in the abstract, I view the problem of collaborative editing to be a superset
of change tracking for a single editor. That being the case, what works for the larger
use case, should suffice for the lesser. Moreover, they should share a common syntax.
Exceptions, "or" statements, seem to trouble programmers so one goal of the proposed
format treat all cases the same. No exceptions.
The requirements of change tracking in XML are well known enough to not require citation:
-
change log
-
identification of the change (for acceptance or denial)
-
author of the change
-
date of change
-
proposed change
-
location of proposed change
-
acceptance or denial of the change
-
date of acceptance or denial of change
-
acceptor or denier of change
-
a well-formed and in the case of ODF, a valid XML document for presentation to the
parser
Implementations may choose to optimize this information internally. What is presented
here assumes verbosity is not an issue.
Change Log
The genesis of the idea of using a change log to capture proposed changes to an ODF
document came about from discussions of Operational Transformations (OT). Not that OT has a log such as proposed here, but capturing proposed changes requires
a means of recording them.
As we will see later, capturing proposed changes separate from the content.xml file, allows us to avoid questions of how to capture changes and at the same time
maintain well-formedness and validity. In fact, conflicting changes can be captured
when held separately from the document instance. But that doesn't answer the question
of how to uniquely identify changes from random authors.
Identification of Changes (2), Proposed Changes (5), Location of Proposed Changes
(6)
Identification of changes, proposed changes, and the location of proposed changes
all share the difficulty of how to coordinate uncoordinated editing of documents?
That is to say authors may be online simultaneously, online separately, or even offline
and still editing the same document. Before we even reach reconciliation, how do we
distinguish, reliably, edits, one from the other?
Fortunately, the problem of uncoordinated identification was solved outside the markup
world, under the unwieldy title: Information technology – Procedures for the operation of object identifier registration
authorities: Generation of universally unique identifiers and their use in object
identifiers, Recommendation ITU-T X.667.
Recommendation ITU-T X.667 defines the concept of generating "universally unique identifiers
(UUIDs)" and specifies procedures for their generation. The details of generation
need not delay us, but the introduction lays the groundwork for incorporation of UUIDs
as part of a change tracking log for ODF documents:
This Recommendation | International Standard standardizes the generation of universally
unique identifiers (UUIDs).
UUIDs are an octet string of 16 octets (128 bits). The 16 octets can be interpreted
as an unsigned integer encoding, and the resulting integer value can be used as the
primary integer value (defining an integer-valued Unicode label) for an arc of the
International Object Identifier tree under the Joint UUID arc. This enables users
to generate object identifier and OID internationalized resource identifier names
without any registration procedure.
...
If generated according to one of the mechanisms defined in this Recommendation | International
Standard, a UUID is either guaranteed to be different from all other UUIDs generated
before 3603 A.D., or is extremely likely to be different (depending on the mechanism
chosen).
No centralized authority is required to administer UUIDs. Centrally generated UUIDs
are guaranteed to be different from all other UUIDs centrally generated.
A UUID can be used for multiple purposes, from tagging objects with an extremely short
lifetime, to reliably identifying very persistent objects across a network, particularly
(but not necessarily) as part of an object identifier or OID internationalized resource
identifier value, or in a uniform resource name (URN).
With a near guarantee (check with your lawyers) of uniqueness until 3603 C.E. (that
beyond the end of Unix time if you are interested), the identification of changes
in a change log with a UUID (that's GUID for people from Redmond), looks good.
But it's not just the identification of changes, what of the identification of elements
within a change? And the poor editor who is editing off-line, how does he align his
changes against an ever changing XML tree?
What if all ODF elements used their xml:ids to hold UUIDs, prefixed by "odf" so as
to be a valid xml:id? The author of any ODF element and anyone to who that element
has been shared, knows a unique xml:id for addressing that element, to put material
before, after, and/or to delete the element. What's more, the offline editor is generating
their own unique xml:ids, enabling them to both make edits to XML elements known to
them, as well as the xml elements they have created.
That scenario presumes that xml:ids are immutable and change logs are append only,
but why not? Memory is for all practical intents and purposes unlimited so we need
not keep acting like we are all editing XML on XT clones. Not to mention that databases,
I know, document crowd but you have heard of databases, yes?, nearly universally use
UUIDs. If anything, we are behind the curve on using them in connection with XML documents.
Acceptance or Denial of Change
To just rough in the syntax of a changelog.txt file, at this point we have:
odfUUID author date insert (node|nodes) items before location
or
odfUUID author date insert (node|nodes) items after location
or
odfUUID author date delete (node | nodes) location
In order to avoid creating unnecessary difficulties, insert and delete operations
should be at element boundaries. If a deletion crosses a paragraph boundary, for example,
the deletion should be of the text nodes and not the beginning paragraph element.
In terms of representation, not as a constraint on execution, I propose the use of
XQuery 3.1 Update primitives, but only insert and delete.
One pattern that follows the recordation of changes format could be:
odfUUID author date accept/deny odfUUID (of change accepted or denied)
That serves to identify the acceptor of a change or deletion, separate from its original
author.
Well-formed and Valid (finally)
Assuming we have an append only change log and immutable xml:ids, how does that get
us to a well-formed and valid XML document to feed to an XML parser? Good question!
The beginning state of the XML document is made, just like any other change, following
the pattern:
odfUUID author date insert (node|nodes) items
except that it has no "location" value. It is the starting state of the document and
all changes will be recorded against the nodes in that start.
A version of the document is captured in the change log as follows:
odfUUID author date odfUUIDs (separated by commas)
The list of odfUUIDs, when those operations are performed, results in a well-formed
and valid ODF document for presentation to an XML parser.
There is no constraint on changelog.txt to prevent there being multiple versions of
the same document, representing differing decisions about what changes to accept or
reject.
Conclusion
The immutability of xml:ids used in this proposal introduces several advantages that
may not be immediately evident. One of the primary ones is that any editor capable
of producing a pointer to an xml:id, can both submit edits as well as comments to
a document, so long as it exists in digital form. In cases where public comment is
sought but later not included in final publication, the attachment of that content
is never lost.
The same longevity of annotations and comments is true when a document is purged of
such notes when shared with others, but you want to restore the notes when the document,
perhaps edited, returns to your possession.
Horror stories of editorial comments leaking out can be avoided automatically with
this proposal because the changelog will be bytes 0 for a document with no changes.
Pristine for distribution as it were.
There are whispers that programmers don't like to preserve xml:ids, but we know in
fact that quite large document systems can and do, everyday. Consider this example:
<section style="-uslm-lc:I80" id="id92be36ef-db2e-11eb-bf11-e2f53ffbac53" identifier="/us/usc/t26/s107"><num
value="107">§ 107.</num><heading> Rental value of parsonages</heading>
Part of title 26, Internal Revenue Code, from the Office of the Law Revision Counsel, United States Code
Serious publishers have no objections to UUIDs, why should you?
References
[Schubert 1994] Interoperable Document Collaboration Svante Schubert, Sebastian Rönnau, and Patrick Durusau. 2014. Interoperable Document
Collaboration. In Proceedings of the 2nd International Workshop on (Document) Changes:
modeling, detection, storage and visualization (DChanges '14). Association for Computing
Machinery, New York, NY, USA, Article 6, 1–4. doi:https://doi.org/10.1145/2723147.2723155.
[Schubert 2019] The Next Millennium Document Format. In Proceedings of the ACM Symposium on Document Engineering 2019 (DocEng '19). Association
for Computing Machinery, New York, NY, USA, Article 40, 1–4. doi:https://doi.org/10.1145/3342558.3345419.
[ITU-T X.667 2012] Recommendation ITU-T X.667 http://handle.itu.int/11.1002/1000/11746
×Interoperable Document Collaboration Svante Schubert, Sebastian Rönnau, and Patrick Durusau. 2014. Interoperable Document
Collaboration. In Proceedings of the 2nd International Workshop on (Document) Changes:
modeling, detection, storage and visualization (DChanges '14). Association for Computing
Machinery, New York, NY, USA, Article 6, 1–4. doi:https://doi.org/10.1145/2723147.2723155.
×The Next Millennium Document Format. In Proceedings of the ACM Symposium on Document Engineering 2019 (DocEng '19). Association
for Computing Machinery, New York, NY, USA, Article 40, 1–4. doi:https://doi.org/10.1145/3342558.3345419.