Beck, Jeff. “The False Security of Closed XML Systems.” Presented at Balisage: The Markup Conference 2011, Montréal, Canada, August 2 - 5, 2011. In Proceedings of Balisage: The Markup Conference 2011. Balisage Series on Markup Technologies, vol. 7 (2011). https://doi.org/10.4242/BalisageVol7.Beck01.
Balisage: The Markup Conference 2011 August 2 - 5, 2011
Balisage Paper: The False Security of Closed XML Systems
Jeff Beck
Technical Information Specialist
National Center for Biotechnology Information (NCBI), National Library of
Medicine (NLM), National Institutes of Heath (NIH)
Jeff has been involved in the PubMed Central project at NLM since 2000. He has
been working journal publishing since the early 1990.
Author's contribution to the Work was done as part of the Author's official duties
as an NIH employee and is a Work of the United States Government. Therefore, copyright
may not be established in the United States. 17 U.S.C. § 105. If Publisher intends
to disseminate the Work outside the U.S., Publisher may secure copyright to the extent
authorized under the domestic laws of the relevant country, subject to a paid-up,
nonexclusive, irrevocable worldwide license to the United States in such copyrighted
work to reproduce, prepare derivative works, distribute copies to the public and perform
publicly and display publicly the work, and to permit others to do so.
Abstract
Creating and using XML documents in a closes system gives a false sense of “All’s
Well”. A Closed XML system is one where a single party creates and uses the XML
documents. Certainly this happens all of the time. We all create little XML
documents with one-off throw-away models for one task or another that only we use.
These documents will not be the focus here. When documents are created in a closed
XML workflow and sent to another entity - such as an archive - for reuse, publishers
can quickly learn that the content they have been accumulating is not useful for
anything outside of their system and may not be useful if their own systems are
upgraded or changed.
The checks and procedures that must be put in place to support XML interchange
between entities should be applied within closed XML systems to keep the content
usable and useful.
I have been working on the PubMed Central [PMC01]repository at the National Library of Medicine for 11 years. My role
there is to ingest XML from different publishers and transform it into the JATS format
[JATS01] for inclusion in our database. In this
role, I have seen a lot of article SGML and XML content and had to make decisions
on whether
it was of a consistent quality to be included in the PMC database or not.
Sometimes we have problems with content that is submitted to PMC. We see content that
is
not well-formed, not valid to the schema that is being used, tag abuse and other
inconsistencies in how the elements and attributes of the XML model have been
applied.
PMC accepts content in many different XML formats, but well over 50% of the content
being
supplied currently is in one of the JATS models. Both the Archving and Interchange
and the Journal Publishing
models use the XHTML table model; the CALS table model is supplied in the Tag Suite
but not used
in these two models. A wonderful example of content no longer being valid outside
of a closed system
involved a modification of the DTD to call in the CALS table model. The DTD was expanded
correctly, and
everything ran fine on the publisher's system. However the DTD files were not given
a new name. Also, the
PUBLIC and SYSTEM IDs used in the DOCTYPE declaration in the instances were those
that were defined
for the Journal Publishing Model.
When this content arrived in PMC, the PUBLIC ID was resolved the the standard Journal
Publishing DTD
(without the extra table model), and all of the instances were invalid.
When we provide feedback on the sample XML supplied during the PMC evaluation process
[Beck01], the response we most often here from the
publishers is "We paid a lot of money to get this XML. It works on our website, so
we know
it is good."
It is obvious that if inconsistent and invalid content works in their system that
they are
creating and using their content in a Closed XML System.
What Is a Closed XML System?
A closed XML system is a system where the XML files never leave the system or are
never used by anyone other than the creator. We have all created little XML documents
with one-off throw-away models for one task or another that only we use.
An example would be if you were creating a To Do list application to run
in XML for your own use. First you would figure out what you want to track and probably
start with a sample document.
<todolist>
<todo>
<due month="07" day="08" year="2011"/>
<first-reminder month="07" day="01" year="2011">
<message>Just one week left to finish the paper.</message>
</first-reminder>
<second-reminder month="07" day="01" year="2011">
<message>Time to call Tommie to get an extension.</message>
</second-reminder>
<task>Finish Balisage Paper.</task>
</todo>
</todolist>
As this simple model has evolved, you've kept up with it with your XSLT or XQuery
that you are
using to process the To Do list. You can handle a <reminder>, a <first-reminder>,
and
a <second-reminder>, both with and without <message>. The To Do list works fine because,
although you have inconsistent data, you were able to make allowances for it in your
processor
when you made the changes to the model.
This is not a problem, because you control both ends of the process and nothing is
showing
up unexpectedly. Confusion could arise if you shared your data with someone else who
had to figure
out what the difference was between a <reminder> and a <first-reminder> and what to
do
with those that have messages and those that don't.
But these little
documents are not the focus of this paper.
Coming from the document publishing side of the XML world, I am going to concentrate
on document content XML: journal articles, books, book chapters, reports. But these
'rules' apply to any XML that is intended to be saved, used or reused. Certainly this
would apply to both document and data XML applications.
XML Interchange
A not-closed XML system is one where there is some interchange of XML. This could
be
interchange between organizations, interchange between departments in an organization,
or between individuals. Interchange can be an sharing of content between entities,
but
it could also be between steps in an XML workflow.
The submission of papers for this conference is an example of XML interchange. The
author creates XML to be used by the conference committee for peer review and
(hopefully) publication in the conference proceedings.
A Classic Communication Model
Wiener's modification of Shannon's classic communication model (see Fig. 1; Wiener01, Wiener02, Foulger01) can be applied to XML interchange.
In the communication model, there are two actors, the sender (information source)
and
the receiver (destination).
The communication in Fig. 1 contains the following steps:
The Information Source creates a Message.
The Message is converted into a Signal and sent by the Transmitter.
The Signal may be acted upon or interfered with by Noise - some third
party or environmental activity.
The Received Signal is converted into a Message by the Receiver.
The Destination receives the Message.
The Destination provides feedback to the Information Source.
Of course, this feedback is another message, with the original Destination as the
Information Source, but what is important here is that the Receiver acknowledges or
confirms the Message. (At this point, the Balisage audience should all be nodding
their
heads in agreement.) The feedback is a critical element of communication. It allows
the
sender to know whether the message is getting through and to make adjustments necessary
to make the communication successfull.
This model can be applied to any communication. For example a telephone conversation:
Person A (Information Source) says "XML is great!" (Message) into his
telephone.
Telephone (Transmitter) converts sound to electrical Signal
The cell drops out (Noise).
Person B's telephone (Receiver) converts the Received Signal into sound
(Message).
Person B hears, "XML is gray---".
Person B provides feedback: "Gray?"
What can go wrong with XML - The four layers of "bad"
XML can go bad on several levels. These levels were beautifully and simply illustrated
by [Bauman01] in "The 4 'Levels' of XML
Rectitude". The TEI examples in this section are his.
The first thing that can go wrong is that the XML is not well-formed. Simply the basic
rules of XML are not followed.
If the document is well formed, the next potential problem is validity. Does the
syntax of the XML match the schema?
Assuming that the document is well-formed and valid, next you have to worry about
Sensibility. XML constructions that do not make any sense are not good to anyone.
Finally, if the XML is well-formed, valid, and is sensibly constructed, the content
may just be wrong.
Applying the communication model to XML Interchange
We can apply the classic communication model to interchange of XML between
entities.
Fig. 6 shows how the communication model can be applied to XML
interchange between parties. The steps in this communication are:
The Information Source creates some Content.
The Content is encoded into a file based on an XML Model and sent.
The file may be acted upon or interfered with by Noise.
The file is converted into Content with the XML Model.
The Destination receives the Content.
The Destination provides feedback to the Information Source.
For our purposes, we can simplify this model somewhat by removing the Noise. Certainly
there can be noise in XML interchange, but I see noise that occurs in the transfer
of
files as a Systems problem and outside the scope of this discussion.
There is another change we need to make, which is at the root of our discussion. Just
as there may have been problems encoding and decoding the Message into and out of
the
Signal in the communication model because the Transmitter and Receiver are not the
same
entity, we need to note here that the XML is encoded with the Sender's XML model and
decoded with the Receiver's XML model (see Fig. 7).
So, if the Sender's XML Model is not exactly the same as the Receiver's XML model,
there will be distortion of the Content, just as there is distortion of the Message
if
the Receiver is not decoding the signal as the Transmitter encoded it.
Because XML is intended to be machine-processed content, we can run some tests after
the XML has been received. First, we can test Well-Formedness with any XML parser.
Well-formedness is defined by the XML Specification.
The rest of the tests require some agreement between Sender and Receiver, either
explicitly ("I am going to send you this article in DocBook 5 format.") or implicitly
-
where the XML file identifies itself. Either way, we can test for validity by processing
the file with the agreed-upon schema.
Next, content Sensibility can be checked with a content-application-level tool such
as
a Schematron or other application-specific checking tool like the PMC Stylechecker.
For
example, it would be trivial to write a Schematron rule to check the schema-valid
but
Nonsense XML in Fig. 4. If the application thought that a <height> element with
units but no value was not "correct", a test could be added for that; similarly if
<catchwords> was not something you wanted to see in <name>, you could have a
test for that. But, Sensibility checking at this level comes with a price, which is
even
greater communication between Sender and Receiver.
All of this Sender/Receiver communication is for one goal: to get the Sender's XML
model and the Receiver's XML model to be as closely aligned as possible. This works
pretty well in XML systems where content is transferred between entities, because
the
receiver is accustomed to running at least well-formedness and validity checks on
incoming content.
Closed Systems
In a closed XML system, the Sender and Receiver are the same entity. This greatly
simplifies the XML interchange model (see Figs 9 and 10).
In Fig. 10, things have gotten very simple with the Information Source
and Destination collapsed. And in some cases things are quite simple here. The danger
of
a closed XML system is that it gives a false sense of "All's Well." There are several
things that happen in closed systems. First, one entity controls both ends of the
pipe.
For example, one person can be responsible for tagging articles for a magazine and
building the rendering software that renders the articles on the web in HTML. When
a new
object appears in an article, the only requirement for the XML tagging is that it
works
in the renderer.
Generally in these systems, the only test that things are OK is that the XML is
working in the system; that is, in a system that was created to fit each twist and
turn
in the evolving XML model. If XML tools are used, then well-formedness tests come
along
for the ride, but validation against a schema (if there is one) is deemed unnecessary
and complicated. After all, "Our XML works".
In a closed system, "Garbage In" is OK, because you control both ends of the pipe,
know the garbage is coming through, and build something to deal with it when it comes
out the other end. Sometimes Garbage In is OK.
And if it works, that is great. The real price for this closed system won't come due
until you either have to send your XML to someone for reuse or reuse it yourself.
Switching to an XML Interchange workflow from a closed system can be humbling and
expensive. Any Destination that will be taking your content will expect it to pass
all
four levels of XML Rectitude, will actively check well-formedness and validity against
a
schema, and will seek information on tagging conventions (sensibility) expecting (or
at
least hoping for) some consistency in the tagging.
If the XML corpus has not been subjected to these checks throughout there will be
big
problems with reuse of the content by an outside entity.
Similarly, when everything about your XML system is in one geek's head, you will have
trouble maintaining the system, let alone changing it or upgrading it when that geek
moves on to greener pastures.
Standards are the answer?
Actually not really. Using a standard model like the TEI, DocBook, or the JATS will
get you schemas and some information on best practices for tagging, but there is no
forced validation. Also, in closed XML Systems, there is no penalty for Tag Abuse.
That
is, if the standard schema does not have an element for an object, you can just tag
your
content any way you like. Because you control both ends of the pipe,
Requirements for XML Interchange
There are two requirements for any XML Interchange, and they both come from an
agreement between the Information Source and the Destination.
Validation - XML files must be well-formed and valid against an agreed-upon
schema.
Defined tagging practices - XML files will be tagged consistently in a manner
that makes sense.
These two requirements for interchange are the same ones you will need to run a
consistent, sane system over time.
If xml interchange is like a conversation, than a closed system is like listening
to
the voices in your head.
[Bauman01] Bauman, Syd. (2010) "The 4 Levels of XML Rectitude", Balisage
2010, poster.
[Beck01] Beck, Jeff. “Report from the Field: PubMed Central, an
XML-based Archive of Life Sciences Journal Articles.” Presented at International
Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML,
Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on
XML
for the Long Haul: Issues in the Long-term Preservation of XML. Balisage Series on
Markup Technologies, vol. 6 (2010). doi:https://doi.org/10.4242/BalisageVol6.Beck01. http://www.balisage.net/Proceedings/vol6/html/Beck01/BalisageVol6-Beck01.html
Beck, Jeff. “Report from the Field: PubMed Central, an
XML-based Archive of Life Sciences Journal Articles.” Presented at International
Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML,
Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on
XML
for the Long Haul: Issues in the Long-term Preservation of XML. Balisage Series on
Markup Technologies, vol. 6 (2010). doi:https://doi.org/10.4242/BalisageVol6.Beck01. http://www.balisage.net/Proceedings/vol6/html/Beck01/BalisageVol6-Beck01.html