We live in a time of Tag Set extensions.
There was a time when organizations planning a conversion to XML, or planning to move a new document type to XML, assumed that the process would involve creating a tag set for that document type. The costs of creating that new tag set usually included an outside expert to create and document the tag set, internal subject experts to assist in document analysis, and programmers to customize the editing, database, and formatting tools to work with the new tag set.
Now, the assumption is that for any new XML application there is an existing public tag set that meets their needs, or meets them closely enough. Most organizations don’t consider a new bespoke tag set, and some consider the choice of public tag set so obvious that they don’t waste time exploring other options. Even among those who do explore their options, the default assumption seems to be that there is a model they can adopt to meet their needs. Many publishers with older bespoke tag sets have converted to a public one.
There are a lot of good reasons to adopt instead of developing from scratch. The most important are:
Cost to Develop and Document |
Vocabulary development in real-life complex domains is a multi-year multi-person project that requires the time and skills of subject matter experts as well as XML expertise. Costs include identifying not only the structures and types of information that are key to the expected data usage of this community, but also structures that are common in documents and needed for the applications and publications to be made from these documents. A group developing a subject-specialized vocabulary in a subject area is likely to do a better job modeling aspects relating to the subject matter than to normal prose structures — partly because specialists are more interested in their own subject matter and partly because modeling common prose structures is likely to feel like a waste of time to them. We have seen subject matter experts sigh, turn on their phones, or even leave the room when a lively discussion of the metadata needed to identify the subject of one of their reports turned to a discussion of what types of lists they would need in the prose portions of the same documents. Also surprising is the costs and time required to document a vocabulary well enough that tagging and usage will be consistent. Adoption of a vocabulary out of the box enables a community to avoid all of these costs. Adoption and adaptation enable the community to spend its energy (and time and money) modeling only those structures that are unique to the community and to document only the new or revised structures. |
Cost of tool customization |
While it is possible to create XML documents using XML editing tools out of the box, and it is possible to store, search, and retrieve XML documents using an XML database as it is shipped, neither of these provides an attractive user experience, especially for people who are not very comfortable with the syntax of XML. There is a significant investment in customizing tools to work with XML documents. Some of these customizations are specific to the type of document, but many are specific to each element, element in context, or element with attribute value. Users of a new tag set can save a significant amount of time and money if they do not have to tell their editing tool when elements should be displayed to an editor as blocks and which as in-line; which are list items, and what text should be generated on display. Similarly, if they do not have to tell their database which elements contain non-textual material (such as TeX) and which should be considered higher value for search result ranking (perhaps titles and table column heads) a lot of set-up time can be saved. |
Cost of formatting and display development |
Technically, it is also possible to format an XML document for human consumption without customizing the formatting software, but is it unlikely that the documents will be recognizable or useful. One of the major advantages people hope for when adopting a vocabulary is to be able to use, or at least start from, existing formatting applications to make common display formats such as HTML and PDF. |
Availability of experienced staff and vendors |
It is far easier to work in an environment in which one can hire experienced staff and in which service vendors are familiar with your requirements. Of course, you could train all of your staff members from scratch, but that takes time and resources and significantly increases the loss when they leave. Similarly, if you develop an XML vocabulary from the bottom up, you will be able to find vendors to create, manage, and host your documents, but you will have to pay them to learn your vocabulary and needs, pay them to train their staff, and pay them to customize their tools and processes. If you adopt an existing vocabulary, you will have to work with your staff and vendors on any variations you prefer and teach them about any customizations you have made. |
Pressure from tool vendors, service suppliers, and XML community |
XML is rarely created and used strictly in-house any longer. There are numerous partners who will be involved in creating and using it including: tagging vendors; publishing partners; and aggregators. Using a tag set that is familiar to these partners simplifies these relationships and may significantly reduce costs and errors because there is less need to explain the XML model and how it is used and less need for exception processing. (Many organizations choose a particular vocabulary because a particular vendor requires it or a particular tool creates or ingests it.) |
Adopt and Adapt
There are some situations in which users, and whole user sectors, can adopt an XML
model and use it comfortably. However, in many cases, it is more accurate to describe
the process as Adopt and Adapt
than simply Adopt
.
Adoption
A user who has exactly the situation envisioned when a tag set was developed may well be able to simply use it. A user who wants to encode their system manuals in XML may find DocBook works well for them as published, and they will gain the added value of being able to use existing user interface layers on tools and formatting stylesheets.
Similarly, a user who want to send their journal articles to an archive or document repository may be required to use JATS (ANSI/NISO z39.96-2015), and may even be provided with guidelines that specify which of the JATS tag sets and how they should use optional features.
A user who wants to participate in an existing data interchange process may be required to use the tag set used by the existing participants regardless of comfort. For example, a user who wants to include their poster and pamphlet content in a publication locator service based on XML-tagged technical reports will have to find a way to tag those posters and pamphlets using the vocabulary used for the technical reports.
Adaptation
A community that wants to begin interchanging XML documents may find that there is no existing tag set and community of practice that exactly meets their needs. Some members of the community may be using XML, but if they have not worked together when developing their practices it is likely that they have different approaches. Even if the individual members have adopted public models, they may not have adopted the same public model.
The Standards Community is an example of a community that is currently working on
developing a shared XML model
for interchange of documents among the participants. For an
excellent overview of this process, see NISO STS Project Overview and Update
[Wheeler et al. 2016]. In this case,
various members of the community already use DocBook-, DITA-, XHTML-, and JATS-based
models, and at least one has done a TEI-based pilot project. None found any of the
public
models met their needs out-of-the-box; all adapted the models they had adopted. This
community is now working to create an interchange tag set that will serve all of their
needs. They are starting with a tag set created by one of the participants (ISO) that
was developed by adopting and adapting JATS [ISO 2016]. This process is,
we believe, typical of the way shared tag sets are being developed now.
Public models have been developed and documented with the assumption that they will
be
adapted. NIEM describes itself as a framework
and provides tools for
domains
to use to develop information exchange
packages
[NIEM 2016]. DITA includes the Specialization
feature, which
enables users to extend the tag set and use DITA processors that are unaware of the
extension [Eberlein et al. 2010]. The Text Encoding Initiative Guidelines describe
clean
and unclean
modifications and provide a tool for
creating extended TEI-based vocabularies [TEI 2016]. JATS documents how the
tag sets can be modified [NCBI 2015] and provides terminology to identify and distinguish between
JATS-Based and JATS-Conforming extensions [ANSI/NISO 2015].
As users adopt (by choice or fiat) a customizable model and begin to adapt that model to meet their needs, they are faced with decisions that may have far-reaching consequences. It is not uncommon for users to come to regret customization decisions made early in the adaption process. In some cases, there is considerable discussion of options, and a choice is made between what are known to be imperfect options. In other cases, however, the customizers do not even know that they are creating problems for themselves and their users down the line.
If your adapted tag set is for use in isolation, most of these guidelines are irrelevant to your project and usage. If you intend to craft or customize tools as needed and are unconcerned about how your adapted tag set will work with existing tools, others of these guidelines are irrelevant. If you are going to train all of the people who will create, manage, use, and archive your documents, others of these guidelines are irrelevant. If you and your documents are on a technologically isolated deserted island and expect to remain so, none of this matters to you; do what you want as you want.
Most tag set adapters want the documents that use their adopted/adapted tag set to play nicely with others. They want to able to store their documents in databases alongside documents tagged with the source tag set or other adaptations of it and to be able to search them all as one coherent collection. They want to be able to use tools such as editors with customized user interfaces by adding only those features needed for the new structures in their documents. They want to be able to use formatting and display tools for the existing documents by adding handling for any new structures, if that. (With a DITA specialization, even that should be unnecessary).
JATS Compatability Guidelines
We, the authors of this paper, have been inspired by the ways in which JATS is being extended, and we are occasionally surprised by problems people who have adapted JATS have reported. We have been drafting a set of Guidelines [Usdin et al. 2016] for people extending JATS to help them understand which adaptations will integrate gracefully into existing JATS environments and how to tell if an adaption might bite them later. To our surprise, this was not always obvious. Many types of adaptation that we initially assumed would be problematic seem to be fine, and a few types of changes that seem innocuous can create significant surprises at later stages of the document life cycle.
The principles articulated in this paper are based on the work done to develop the JATS Compatibility Guidelines [Usdin et al. 2016], and many of the examples are taken from JATS and the JATS Compatibility Guidelines. However, readers who intend to create a JATS-compatible tag set are referred to those Guidelines; this paper is not a substitute for those Guidelines. We also hope that the JATS work and the thought that went into creating those Guidelines is more widely applicable.
Things that Must Match to Maintain Compatability
Respect the Semantics
Starting from first principles, when using or extending a tag set, respect the semantics of the starting structures. This should be obvious, but an amazing number of XML users think that they are doing no harm by repurposing an element or attribute they would not use for the original purpose.
They don’t call it tag abuse, but that is what it is. Sometimes blatant,
sometimes with a story justifying bending
the meaning of a structure
for convenience, tag abuse is rarely a good short term strategy and virtually always
a bad long term strategy.
Tag abuse
is using an element or attribute for content for which it was not
intended. Tags are abused when users are trying to control display. For example, it
is common
to use several empty <p>
elements in HTML to produce some blank
space on the screen. There are not several empty logical paragraphs in the document,
this is tag abuse to achieve screen formatting. Similarly, using a block-quote
element to emphasize instructions, making them stand out from the prose around them,
may achieve an acceptable display at the cost of junking up searches for
block-quotes and hiding the content from a search for instructions.
If you need to store the country in which some people live and you don’t use the phone number element for foreigners you could put their country names in the phone number element. We have seen this done. So, what happens when you start to validate phone numbers? Or when you decide that you can make phone calls across state lines and need a place to put the phone numbers for those people? Can your database list the countries for all authors? What about when a formatting engine inserts the usual punctuation for a phone number into those country names and displays them?
If your starting tag set has a tag called <state>
for state
or province
do not create an attribute called @state
with
the possible values solid
, liquid
, gas
,
or plasma
. Your state
does not technically infringe on
the original state
, but it will confuse people. Call your attribute
@state-of-matter
or some such.
Sometimes tag abuse happens from a coincidence of names — when a new user does not
check the semantics and is misled by a homophone. Oh, they think, I need an element for what the witness said at the trial and there is a
, not noticing that <statement>
element<statement>
is defined as a logical proof or hypothesis.
Use the Same Style of Nesting/Recursion for Sections
There are, generally speaking, three styles of modeling nested sections in XML:
-
Recursive
-
Nested with explicit levels
-
Non-nested with explicit levels
In the recursive model, sections contain sections, which can contain sections, which can contain sections. Display styling of the section headers is based on analysis of the location of the section in the section hierarchy.
In the nested-with-explicit-levels model, sections level 1 may contain sections level 2 which may contain sections level 3, etc.
In the non-nested with explicit levels model, sections level 1 may be followed by sections level 2 which may be followed by sections level 3, but these may come in any order and are not nested.
The section logic is fundamental to complex prose documents, and mixing section logic in the same environment creates the opportunity for significant confusion. People, and software, can get very confused if it is not clear, i.e., whether sections are nested or not; whether the level of nesting should be computed from the level of sections in which a section is contained or derived from the name of the section. Worst of all is a model in which sections that have explicitly named levels are sometimes nested at other levels. (Yes, this does occur in real documents.)
Maintain Distinction Between Elements and Attributes
In the XML world there are people who argue that the distinction between elements and attributes is arbitrary and that, since it is easy to transform one to the other using XSLT, vocabulary developers should feel free to use either at any time for any purpose. This may be so if the vocabulary is being developed in a vacuum, but if a new or modified vocabulary is intended to interoperate with another vocabulary, this is very much not so! While attributes are often used to control display, and their values may be used either to prompt selection of generated text or be displayed, their use in display is significantly different from element content. Similarly, while there are times when element content is not displayed, the default in most (text-based) applications is that element content is displayed to the reader. In most databases, attribute values are indexed, searched, and displayed differently from element content. Also, in most XML editing systems, attribute values are entered and displayed differently from element content.
If content in the source vocabulary is element content, keep it as element content. If it is attribute content, keep it as attribute content. If there is a need in a new vocabulary to change the form of content in a source vocabulary from element to attribute or vice versa, we recommend using a different name for the new structure and documenting its relationship to the content in the source vocabulary.
Whitespace Handling
In XML, some whitespace is significant and some is insignificant. How whitespace is handled has serious impact on the ability to re-use tools among documents in a heterogeneous collection. If elements in a tag set extension do not have the same whitespace handling properties as the display tools were developed to expect, there will be unfortunate (and in some cases surprising) effects on the display of the document content.
Three whitespace handling types are listed below. A compatible tag set extension must not change the whitespace handling type for any existing element.
Element-like whitespace
Content models that contain only elements (no characters) have insignificant whitespace. That is, XML tools may create or destroy whitespace in these models with, by definition, no effect on the document, how it is handled, or how it is displayed.
Data-like whitespace
Content models that contain character data or mixed content contain significant whitespace. That is, XML tools may fold the whitespace (collapse multiple whitespace characters into a single space character), but they may not create or destroy any whitespace nodes.
Preserved whitespace
Content models defined as preserve whitespace are character or mixed content models
where the
whitespace nodes must not be folded. Each whitespace character in the XML must be
preserved. Usually
this is used for alignment of code or other preformatted
content.
ID
, IDREF
, and IDREFS
Rendering and behavior, especially the rending and behavior of links, is often dependent
on the ID
/IDREF
relationship. If an attribute that has a type of ID
in the source vocabulary is
changed to any other type, rendering tools may not process the links
appropriately.
Changing from IDREF
to IDREFS
or vice versa is not a concern. The number
of pointers will not affect compatibility. Changing the direction of the pointer or
obscuring the pointer is the
concern here.
We have actually seen one instance in which a user reversed the uses of ID
s and
IDREF
s, creating documents that looked similar to those in the source vocabulary.
The result was chaotic; it turned out that the XSLT that created the HTML version
of
these documents relied on the ID
/IDREF
mechanism MOST of the time, but occasionally
simply treated the attribute values as values. So, SOME of the links worked as
expected and some did not. (On further thought, this is as much the fault of an
inconsistent transformation as a surprising document; all of these links should probably
have failed!)
Alternatives or Media-specific Content
In the world of prose documents, it is assumed that the reader should have access to all content. However, there are situations in which that is not the case. For example, it is common to provide several versions of the same graphical object: one for high resolution or full-screen display, one for display on small devices such as hand-helds, a thumbnail for navigation, and perhaps a very high resolution or black & white version for print. In counting the number of figures in a document, this figure should be counted once — not as many times as there are media- or use-specific versions — and only the most appropriate for the display media should be rendered. Similarly, it is becoming common for journals to publish author names both in the language and script of the journal and in the language and script of the author’s home environment. This person should be counted only once in specifying the number of authors of the paper and, more importantly, this paper should only count once when calculating the author’s influence.
Any structure in the original vocabulary that is provided to wrap two or more alternative structures, must be used in the same way in all compatible vocabularies.
Things that Don’t Seem to Matter in Compatible Modeling
In drafting the JATS Compatibility Meta-Model Description [Usdin et al. 2016] we considered quite a few areas of conformance that, on further examination, proved to be unnecessary to create document models that were compatible for our purposes. There are recognizable, classifiable distinctions that just turn out not to matter for these purposes.
EMPTY
Elements versus Contenting-containing Ones
One obvious element differentiator was EMPTY
elements versus those with #PCDATA
, element, or mixed content.
Element content is indeed unique, but data characters, mixed content, and EMPTY
are all the same, since characters
are, by definintion, optional in XML. An elmement with a #PCDATA
model or mixed content may have nothing in it, and will look the same
as an EMPTY
element in the document. Thus, the following categories are uninteresting in this
context:
Structures that contain character data only |
Elements that may not have internal markup. In many tag sets, Date may not have internal markup. |
Structures that contain character data and phrase-like structures |
Paragraph is often allowed to contain character data and phrase-like structures such as Italic, Place Name, or Cross Reference, but not allowed to contain larger nesting structures such as lists and figures. |
Structures that contain character data, phrase-like structures, and block-level objects |
In some tag sets there are structures that may contain character data, phrases, and block-like structures. For example, paragraphs may be allowed to contain lists, boxed text, display equations, block quotes, tables, or figures. |
Has Metadata
Some structures (whole documents, authors, boxed-text, appendices) may have
metadata, and there are other structures that are unlikely to have metadata (italic,
break, address-line). However, on analysis, we found that there are circumstances
in
which almost any structure could have metadata (at least an ID
or IDREF
that associates this structure with others), and that this does not affect interoperability
as we were
looking at it.
Is Metadata
In many tag sets, some elements are only used in the metadata of a document (journal in which published) while others are only used in the narrative text (figure). But in most tag sets there are many elements that can be used both to describe the document in which they occur and to describe other documents (copyright, digital identifiers, publication date), so this distinction is not just unimportant, it often changes over time.
Sections and Section-like Structures
It seemed intuitively obvious that an element that had the section structure in one vocabulary should have a section structure in a compatible vocabulary. That, for example, if a Boxed-text could contain not only paragraph-like structures but also nested headed sections in a source vocabulary, it should in any compatible vocabularies. But since those nested sections are, or could be, optional in the source vocabulary, documents without them can clearly be handled by the tools and formatter because we believe that a subset of an element model is always conforming. Thus, it is not necessary that compatible vocabularies allow nested sections in all of the places that the source vocabulary does.
Conversely, we considered that nested sections be allowed only in the places where they are allowed in the source vocabulary, and found that this, too, is not a requirement. If a tool or format is data driven (what in XSLT-speak is called push-processed), it should be able to accommodate sections that have the same style of sections as are already present in the vocabulary even in new locations.
Role in the Document
Structures can easily be grouped by their role in a document, and it is tempting to think that structures must play the same role in all document types in order to be compatible. We found that this is not so, and that while it might be interesting to group structures by their roles in documents, these roles do not seem to affect interoperability:
Paragraph-like |
Elements that may be used at the same structural level as a Paragraph ( |
Preformat-like |
Elements that have the Preserve whitespace model, which is often used for Code and sometimes for poetry. |
Emphasis-like with Toggle |
Inline elements that may be toggled on and off with recursion. In some tag sets, Italic toggles. That is, if an Italic tagged phrase appears in a context that would be displayed in italic anyway, the Italic tagged phrase is NOT displayed in italics to retain the typographic emphasis. |
Emphasis-like without Toggle |
Inline elements that do not toggle on and off with recursion. Some structures must be displayed as tagged even if the context they are in would have that display. For example, Sans Serif often does not toggle. |
Bibliographic Identifier-like |
Structures that identify the document, such as ISSNs, ISBNs, author names, or volume and issue numbers |
Grouping-structures |
Structures that contain several related structures but that have no formatting consequences themselves. For example, Article Metadata may be grouped separately from Issue Metadata, and Keywords may be grouped into a Keyword Group |
Footnote-like |
Structures that are generally displayed as footnotes are may include Footnote, Author Note, Funding Source, and Corresponding Author Address. |
Milestone-like |
Elements that are used to identify locations in the document or that are used in pairs to indicate the start and end of some portion of a document, typically that cannot be simply wrapped in an element because of overlap problems. Milestones may be Revision Start and Revision End, or simply Pull Quote. |
Structures that can have labels and/or titles |
Many, but not all, block type structures can have labels and/or titles. For example, Block Quotes, Boxed Text, Sections, Bibliographies, Lists, and Figures can have labels and titles in many tag sets. |
|
Elements that mark a location in the document or that may have attributes but no element content. |
Structures that have accessibility data |
The ability to provide alternate text or long descriptions may be available for Figures, Graphics, Equations, Tables, and a variety of other structures. |
Structures that have attribution and/or permissions or licensing data |
Structures such as Articles, Boxes, Sections, Tables, and Appendices may have information about who wrote them or who may use them and under what conditions. |
Attribute Value Types (other than ID
and IDREF
)
Even in a DTD, it is possible to type attribute values, and in XSD and RNG attribute
value types can be quite strongly specified. We know (see above) that it is
critical that attributes of type ID
remain of type ID
and that
attributes of type IDREF
or IDREFS
remain of type IDREF
or IDREFS
in order for
documents to be compatible. However, that leaves many other attribute types.
Some processing may be tied to specific values of attributes, and if none of the
expected values are present the processing may fail. For example, if a formatter
renders <styled-content view="GrIt">
as green italic, if that value is not
present the formatter will not render the content in green and italic. However, we
see no disruption from:
-
Adding or removing items from a specified value list
-
Changing a
CDATA
attribute to one with a specified value list, or vice verse -
Changing a
NMTOKEN
orNMTOKENS
attribute toCDATA
or vice versa -
Changing the value of a
#FIXED
attribute or changing a#FIXED
attribute toCDATA
or a specified value list
We came to the conclusion that most attribute typing is useful in the creation of correct documents as specified by the content creator, but is not essential to the storage, management, or rendering of the documents.
Conclusions
The first public draft of the JATS Compatibility Meta-Model Description [Usdin et al. 2016] was released to the public in July 2016. We anticipate that the assumptions we have made in this work will be tested through the process of public review and comment. We hope that we will be prompted to improve the content of the guidelines to make them more effective and to improve the descriptions of them to make them clearer and easier to implement.
Although some of the comaptibility principles we describe such as whitespace handling,
ID
/IDREF
consistency, and maintaining the meaning of object names are applicable for testing
tag set compatibility in general, we were working specificly on compatibility of extensions
to ANSI/NISO Z39.96-2015 JATS.
We welcome your comments on this conference paper and, more importantly, on the document at the NISO site.
References
[Wheeler et al. 2016] Wheeler, Robert, Bruce Rosenblum, and Lesley West. 2016.
NISO STS Project Overview and Update.
In Journal Article
Tag Suite Conference (JATS-Con) Proceedings 2016. Bethesda (MD): National Center for Biotechnology Information (US). http://www.ncbi.nlm.nih.gov/books/NBK350146/.
[ISO 2016] International Organization for Standardization (ISO). 2016.
Welcome to the ISO Standards Tag Set (ISOSTS).
Accessed April 19.
http://www.iso.org/schema/isosts/.
[NIEM 2016] National Information Exchange Model (NIEM). 2016. NIEM. Accessed April 19. https://www.niem.gov/.
[Eberlein et al. 2010] Eberlein, Kristen James, Robert D. Anderson, and Gershon Joseph, eds. December 2010. Darwin Information Typing Architecture (DITA) Version 1.2. Organization for the Advancement of Structured Information Standards (OASIS) Standard. http://docs.oasis-open.org/dita/v1.2/os/spec/DITA1.2-spec.html.
[TEI 2016] Text Encoding Initiative (TEI). 2016. Personalization and
Customization.
In P5: Guidelines for Electronic Text Encoding and
Interchange. Version 3.0.0. Last modified March 29, revision 89ba24e.
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/USE.html#MD.
[NCBI 2015] National Center for Biotechnology Information (NCBI),
National Library of Medicine (NLM). 2015. Modifying This Tag Set.
Journal Archiving and Interchange Tag Library NISO JATS Version 1.1 (ANSI/NISO
Z39.96-2015). Last modified December.
http://jats.nlm.nih.gov/archiving/tag-library/1.1/chapter/implementor.html.
[ANSI/NISO 2015] American National Standards Institute/National Information Standards Organization (ANSI/NISO). 2015. ANSI/NISO Z39.96-2015, JATS: Journal Article Tag Suite, version 1.1. Baltimore: National Information Standards Organization. http://www.niso.org/apps/group_public/download.php/15933/z39_96-2015.pdf.
[Usdin et al. 2016] Usdin, B. Tommie, Deborah A. Lapeyre, Laura Randall, and Jeffrey Beck. 2016. JATS Compatibility Meta-Model Description. Draft Version 0.7. 32 p. http://www.niso.org/apps/group_public/document.php?document_id=16764&wg_abbrev=jats-sc.