Graceful Tag Set Extension

B. Tommie Usdin; Deborah A. Lapeyre; Laura Randall; Jeffrey Beck

Abstract

Tag Sets, or XML Vocabularies, are often created from other Tag Sets or Vocabularies. Users expect significant efficiencies from using derived or based on vocabularies, including the ability to intermingle the documents in databases, to use tools created for the original Tag Set with minimal additional work, and to adopt rendering/formatting applications and change only those aspects specific to the new vocabulary. Some model changes create compatible documents, which can interoperate with documents tagged to the source specification gracefully. Some model changes are disruptive. We discuss what types of changes can be integrated into existing XML environments and which may be disruptive.

We live in a time of Tag Set extensions.

There was a time when organizations planning a conversion to XML, or planning to move a new document type to XML, assumed that the process would involve creating a tag set for that document type. The costs of creating that new tag set usually included an outside expert to create and document the tag set, internal subject experts to assist in document analysis, and programmers to customize the editing, database, and formatting tools to work with the new tag set.

Now, the assumption is that for any new XML application there is an existing public tag set that meets their needs, or meets them closely enough. Most organizations don’t consider a new bespoke tag set, and some consider the choice of public tag set so obvious that they don’t waste time exploring other options. Even among those who do explore their options, the default assumption seems to be that there is a model they can adopt to meet their needs. Many publishers with older bespoke tag sets have converted to a public one.

There are a lot of good reasons to adopt instead of developing from scratch. The most important are:

Cost to Develop and Document	Vocabulary development in real-life complex domains is a multi-year multi-person project that requires the time and skills of subject matter experts as well as XML expertise. Costs include identifying not only the structures and types of information that are key to the expected data usage of this community, but also structures that are common in documents and needed for the applications and publications to be made from these documents. A group developing a subject-specialized vocabulary in a subject area is likely to do a better job modeling aspects relating to the subject matter than to normal prose structures — partly because specialists are more interested in their own subject matter and partly because modeling common prose structures is likely to feel like a waste of time to them. We have seen subject matter experts sigh, turn on their phones, or even leave the room when a lively discussion of the metadata needed to identify the subject of one of their reports turned to a discussion of what types of lists they would need in the prose portions of the same documents. Also surprising is the costs and time required to document a vocabulary well enough that tagging and usage will be consistent. Adoption of a vocabulary out of the box enables a community to avoid all of these costs. Adoption and adaptation enable the community to spend its energy (and time and money) modeling only those structures that are unique to the community and to document only the new or revised structures.
Cost of tool customization	While it is possible to create XML documents using XML editing tools out of the box, and it is possible to store, search, and retrieve XML documents using an XML database as it is shipped, neither of these provides an attractive user experience, especially for people who are not very comfortable with the syntax of XML. There is a significant investment in customizing tools to work with XML documents. Some of these customizations are specific to the type of document, but many are specific to each element, element in context, or element with attribute value. Users of a new tag set can save a significant amount of time and money if they do not have to tell their editing tool when elements should be displayed to an editor as blocks and which as in-line; which are list items, and what text should be generated on display. Similarly, if they do not have to tell their database which elements contain non-textual material (such as TeX) and which should be considered higher value for search result ranking (perhaps titles and table column heads) a lot of set-up time can be saved.
Cost of formatting and display development	Technically, it is also possible to format an XML document for human consumption without customizing the formatting software, but is it unlikely that the documents will be recognizable or useful. One of the major advantages people hope for when adopting a vocabulary is to be able to use, or at least start from, existing formatting applications to make common display formats such as HTML and PDF.
Availability of experienced staff and vendors	It is far easier to work in an environment in which one can hire experienced staff and in which service vendors are familiar with your requirements. Of course, you could train all of your staff members from scratch, but that takes time and resources and significantly increases the loss when they leave. Similarly, if you develop an XML vocabulary from the bottom up, you will be able to find vendors to create, manage, and host your documents, but you will have to pay them to learn your vocabulary and needs, pay them to train their staff, and pay them to customize their tools and processes. If you adopt an existing vocabulary, you will have to work with your staff and vendors on any variations you prefer and teach them about any customizations you have made.
Pressure from tool vendors, service suppliers, and XML community	XML is rarely created and used strictly in-house any longer. There are numerous partners who will be involved in creating and using it including: tagging vendors; publishing partners; and aggregators. Using a tag set that is familiar to these partners simplifies these relationships and may significantly reduce costs and errors because there is less need to explain the XML model and how it is used and less need for exception processing. (Many organizations choose a particular vocabulary because a particular vendor requires it or a particular tool creates or ingests it.)

Adopt and Adapt

There are some situations in which users, and whole user sectors, can adopt an XML model and use it comfortably. However, in many cases, it is more accurate to describe the process as Adopt and Adapt than simply Adopt.

Adoption

A user who has exactly the situation envisioned when a tag set was developed may well be able to simply use it. A user who wants to encode their system manuals in XML may find DocBook works well for them as published, and they will gain the added value of being able to use existing user interface layers on tools and formatting stylesheets.

Similarly, a user who want to send their journal articles to an archive or document repository may be required to use JATS (ANSI/NISO z39.96-2015), and may even be provided with guidelines that specify which of the JATS tag sets and how they should use optional features.

A user who wants to participate in an existing data interchange process may be required to use the tag set used by the existing participants regardless of comfort. For example, a user who wants to include their poster and pamphlet content in a publication locator service based on XML-tagged technical reports will have to find a way to tag those posters and pamphlets using the vocabulary used for the technical reports.

Adaptation

A community that wants to begin interchanging XML documents may find that there is no existing tag set and community of practice that exactly meets their needs. Some members of the community may be using XML, but if they have not worked together when developing their practices it is likely that they have different approaches. Even if the individual members have adopted public models, they may not have adopted the same public model.

The Standards Community is an example of a community that is currently working on developing a shared XML model for interchange of documents among the participants. For an excellent overview of this process, see NISO STS Project Overview and Update [Wheeler et al. 2016]. In this case, various members of the community already use DocBook-, DITA-, XHTML-, and JATS-based models, and at least one has done a TEI-based pilot project. None found any of the public models met their needs out-of-the-box; all adapted the models they had adopted. This community is now working to create an interchange tag set that will serve all of their needs. They are starting with a tag set created by one of the participants (ISO) that was developed by adopting and adapting JATS [ISO 2016]. This process is, we believe, typical of the way shared tag sets are being developed now.

Public models have been developed and documented with the assumption that they will be adapted. NIEM describes itself as a framework and provides tools for domains to use to develop information exchange packages [NIEM 2016]. DITA includes the Specialization feature, which enables users to extend the tag set and use DITA processors that are unaware of the extension [Eberlein et al. 2010]. The Text Encoding Initiative Guidelines describe clean and unclean modifications and provide a tool for creating extended TEI-based vocabularies [TEI 2016]. JATS documents how the tag sets can be modified [NCBI 2015] and provides terminology to identify and distinguish between JATS-Based and JATS-Conforming extensions [ANSI/NISO 2015].

As users adopt (by choice or fiat) a customizable model and begin to adapt that model to meet their needs, they are faced with decisions that may have far-reaching consequences. It is not uncommon for users to come to regret customization decisions made early in the adaption process. In some cases, there is considerable discussion of options, and a choice is made between what are known to be imperfect options. In other cases, however, the customizers do not even know that they are creating problems for themselves and their users down the line.

If your adapted tag set is for use in isolation, most of these guidelines are irrelevant to your project and usage. If you intend to craft or customize tools as needed and are unconcerned about how your adapted tag set will work with existing tools, others of these guidelines are irrelevant. If you are going to train all of the people who will create, manage, use, and archive your documents, others of these guidelines are irrelevant. If you and your documents are on a technologically isolated deserted island and expect to remain so, none of this matters to you; do what you want as you want.

Most tag set adapters want the documents that use their adopted/adapted tag set to play nicely with others. They want to able to store their documents in databases alongside documents tagged with the source tag set or other adaptations of it and to be able to search them all as one coherent collection. They want to be able to use tools such as editors with customized user interfaces by adding only those features needed for the new structures in their documents. They want to be able to use formatting and display tools for the existing documents by adding handling for any new structures, if that. (With a DITA specialization, even that should be unnecessary).

JATS Compatability Guidelines

We, the authors of this paper, have been inspired by the ways in which JATS is being extended, and we are occasionally surprised by problems people who have adapted JATS have reported. We have been drafting a set of Guidelines [Usdin et al. 2016] for people extending JATS to help them understand which adaptations will integrate gracefully into existing JATS environments and how to tell if an adaption might bite them later. To our surprise, this was not always obvious. Many types of adaptation that we initially assumed would be problematic seem to be fine, and a few types of changes that seem innocuous can create significant surprises at later stages of the document life cycle.

The principles articulated in this paper are based on the work done to develop the JATS Compatibility Guidelines [Usdin et al. 2016], and many of the examples are taken from JATS and the JATS Compatibility Guidelines. However, readers who intend to create a JATS-compatible tag set are referred to those Guidelines; this paper is not a substitute for those Guidelines. We also hope that the JATS work and the thought that went into creating those Guidelines is more widely applicable.

Things that Must Match to Maintain Compatability

Respect the Semantics

Starting from first principles, when using or extending a tag set, respect the semantics of the starting structures. This should be obvious, but an amazing number of XML users think that they are doing no harm by repurposing an element or attribute they would not use for the original purpose.

They don’t call it tag abuse, but that is what it is. Sometimes blatant, sometimes with a story justifying bending the meaning of a structure for convenience, tag abuse is rarely a good short term strategy and virtually always a bad long term strategy.

Tag abuse is using an element or attribute for content for which it was not intended. Tags are abused when users are trying to control display. For example, it is common to use several empty <p> elements in HTML to produce some blank space on the screen. There are not several empty logical paragraphs in the document, this is tag abuse to achieve screen formatting. Similarly, using a block-quote element to emphasize instructions, making them stand out from the prose around them, may achieve an acceptable display at the cost of junking up searches for block-quotes and hiding the content from a search for instructions.

If you need to store the country in which some people live and you don’t use the phone number element for foreigners you could put their country names in the phone number element. We have seen this done. So, what happens when you start to validate phone numbers? Or when you decide that you can make phone calls across state lines and need a place to put the phone numbers for those people? Can your database list the countries for all authors? What about when a formatting engine inserts the usual punctuation for a phone number into those country names and displays them?

If your starting tag set has a tag called <state> for state or province do not create an attribute called @state with the possible values solid, liquid, gas, or plasma. Your state does not technically infringe on the original state, but it will confuse people. Call your attribute @state-of-matter or some such.

Sometimes tag abuse happens from a coincidence of names — when a new user does not check the semantics and is misled by a homophone. Oh, they think, I need an element for what the witness said at the trial and there is a <statement> element, not noticing that <statement> is defined as a logical proof or hypothesis.

Use the Same Style of Nesting/Recursion for Sections

There are, generally speaking, three styles of modeling nested sections in XML:

Recursive
Nested with explicit levels
Non-nested with explicit levels

In the recursive model, sections contain sections, which can contain sections, which can contain sections. Display styling of the section headers is based on analysis of the location of the section in the section hierarchy.

In the nested-with-explicit-levels model, sections level 1 may contain sections level 2 which may contain sections level 3, etc.

In the non-nested with explicit levels model, sections level 1 may be followed by sections level 2 which may be followed by sections level 3, but these may come in any order and are not nested.

The section logic is fundamental to complex prose documents, and mixing section logic in the same environment creates the opportunity for significant confusion. People, and software, can get very confused if it is not clear, i.e., whether sections are nested or not; whether the level of nesting should be computed from the level of sections in which a section is contained or derived from the name of the section. Worst of all is a model in which sections that have explicitly named levels are sometimes nested at other levels. (Yes, this does occur in real documents.)

Maintain Distinction Between Elements and Attributes

In the XML world there are people who argue that the distinction between elements and attributes is arbitrary and that, since it is easy to transform one to the other using XSLT, vocabulary developers should feel free to use either at any time for any purpose. This may be so if the vocabulary is being developed in a vacuum, but if a new or modified vocabulary is intended to interoperate with another vocabulary, this is very much not so! While attributes are often used to control display, and their values may be used either to prompt selection of generated text or be displayed, their use in display is significantly different from element content. Similarly, while there are times when element content is not displayed, the default in most (text-based) applications is that element content is displayed to the reader. In most databases, attribute values are indexed, searched, and displayed differently from element content. Also, in most XML editing systems, attribute values are entered and displayed differently from element content.

If content in the source vocabulary is element content, keep it as element content. If it is attribute content, keep it as attribute content. If there is a need in a new vocabulary to change the form of content in a source vocabulary from element to attribute or vice versa, we recommend using a different name for the new structure and documenting its relationship to the content in the source vocabulary.

Whitespace Handling

In XML, some whitespace is significant and some is insignificant. How whitespace is handled has serious impact on the ability to re-use tools among documents in a heterogeneous collection. If elements in a tag set extension do not have the same whitespace handling properties as the display tools were developed to expect, there will be unfortunate (and in some cases surprising) effects on the display of the document content.

Three whitespace handling types are listed below. A compatible tag set extension must not change the whitespace handling type for any existing element.

Element-like whitespace

Content models that contain only elements (no characters) have insignificant whitespace. That is, XML tools may create or destroy whitespace in these models with, by definition, no effect on the document, how it is handled, or how it is displayed.

Data-like whitespace

Content models that contain character data or mixed content contain significant whitespace. That is, XML tools may fold the whitespace (collapse multiple whitespace characters into a single space character), but they may not create or destroy any whitespace nodes.

Preserved whitespace

Content models defined as preserve whitespace are character or mixed content models where the whitespace nodes must not be folded. Each whitespace character in the XML must be preserved. Usually this is used for alignment of code or other preformatted content.

`ID`, `IDREF`, and `IDREFS`

Rendering and behavior, especially the rending and behavior of links, is often dependent on the ID/IDREF relationship. If an attribute that has a type of ID in the source vocabulary is changed to any other type, rendering tools may not process the links appropriately.

Changing from IDREF to IDREFS or vice versa is not a concern. The number of pointers will not affect compatibility. Changing the direction of the pointer or obscuring the pointer is the concern here.

We have actually seen one instance in which a user reversed the uses of IDs and IDREFs, creating documents that looked similar to those in the source vocabulary. The result was chaotic; it turned out that the XSLT that created the HTML version of these documents relied on the ID/IDREF mechanism MOST of the time, but occasionally simply treated the attribute values as values. So, SOME of the links worked as expected and some did not. (On further thought, this is as much the fault of an inconsistent transformation as a surprising document; all of these links should probably have failed!)

Alternatives or Media-specific Content

In the world of prose documents, it is assumed that the reader should have access to all content. However, there are situations in which that is not the case. For example, it is common to provide several versions of the same graphical object: one for high resolution or full-screen display, one for display on small devices such as hand-helds, a thumbnail for navigation, and perhaps a very high resolution or black & white version for print. In counting the number of figures in a document, this figure should be counted once — not as many times as there are media- or use-specific versions — and only the most appropriate for the display media should be rendered. Similarly, it is becoming common for journals to publish author names both in the language and script of the journal and in the language and script of the author’s home environment. This person should be counted only once in specifying the number of authors of the paper and, more importantly, this paper should only count once when calculating the author’s influence.

Any structure in the original vocabulary that is provided to wrap two or more alternative structures, must be used in the same way in all compatible vocabularies.

Things that Don’t Seem to Matter in Compatible Modeling

In drafting the JATS Compatibility Meta-Model Description [Usdin et al. 2016] we considered quite a few areas of conformance that, on further examination, proved to be unnecessary to create document models that were compatible for our purposes. There are recognizable, classifiable distinctions that just turn out not to matter for these purposes.

`EMPTY` Elements versus Contenting-containing Ones

One obvious element differentiator was EMPTY elements versus those with #PCDATA, element, or mixed content. Element content is indeed unique, but data characters, mixed content, and EMPTY are all the same, since characters are, by definintion, optional in XML. An elmement with a #PCDATA model or mixed content may have nothing in it, and will look the same as an EMPTY element in the document. Thus, the following categories are uninteresting in this context:

Structures that contain character data only	Elements that may not have internal markup. In many tag sets, Date may not have internal markup.
Structures that contain character data and phrase-like structures	Paragraph is often allowed to contain character data and phrase-like structures such as Italic, Place Name, or Cross Reference, but not allowed to contain larger nesting structures such as lists and figures.
Structures that contain character data, phrase-like structures, and block-level objects	In some tag sets there are structures that may contain character data, phrases, and block-like structures. For example, paragraphs may be allowed to contain lists, boxed text, display equations, block quotes, tables, or figures.

Has Metadata

Some structures (whole documents, authors, boxed-text, appendices) may have metadata, and there are other structures that are unlikely to have metadata (italic, break, address-line). However, on analysis, we found that there are circumstances in which almost any structure could have metadata (at least an ID or IDREF that associates this structure with others), and that this does not affect interoperability as we were looking at it.

Is Metadata

In many tag sets, some elements are only used in the metadata of a document (journal in which published) while others are only used in the narrative text (figure). But in most tag sets there are many elements that can be used both to describe the document in which they occur and to describe other documents (copyright, digital identifiers, publication date), so this distinction is not just unimportant, it often changes over time.

Sections and Section-like Structures

It seemed intuitively obvious that an element that had the section structure in one vocabulary should have a section structure in a compatible vocabulary. That, for example, if a Boxed-text could contain not only paragraph-like structures but also nested headed sections in a source vocabulary, it should in any compatible vocabularies. But since those nested sections are, or could be, optional in the source vocabulary, documents without them can clearly be handled by the tools and formatter because we believe that a subset of an element model is always conforming. Thus, it is not necessary that compatible vocabularies allow nested sections in all of the places that the source vocabulary does.

Conversely, we considered that nested sections be allowed only in the places where they are allowed in the source vocabulary, and found that this, too, is not a requirement. If a tool or format is data driven (what in XSLT-speak is called push-processed), it should be able to accommodate sections that have the same style of sections as are already present in the vocabulary even in new locations.

Role in the Document

Structures can easily be grouped by their role in a document, and it is tempting to think that structures must play the same role in all document types in order to be compatible. We found that this is not so, and that while it might be interesting to group structures by their roles in documents, these roles do not seem to affect interoperability:

Paragraph-like	Elements that may be used at the same structural level as a Paragraph (`<p>`), for example, inside of a section. This would include many block-level structures such as figures and tables.
Preformat-like	Elements that have the Preserve whitespace model, which is often used for Code and sometimes for poetry.
Emphasis-like with Toggle	Inline elements that may be toggled on and off with recursion. In some tag sets, Italic toggles. That is, if an Italic tagged phrase appears in a context that would be displayed in italic anyway, the Italic tagged phrase is NOT displayed in italics to retain the typographic emphasis.
Emphasis-like without Toggle	Inline elements that do not toggle on and off with recursion. Some structures must be displayed as tagged even if the context they are in would have that display. For example, Sans Serif often does not toggle.
Bibliographic Identifier-like	Structures that identify the document, such as ISSNs, ISBNs, author names, or volume and issue numbers
Grouping-structures	Structures that contain several related structures but that have no formatting consequences themselves. For example, Article Metadata may be grouped separately from Issue Metadata, and Keywords may be grouped into a Keyword Group
Footnote-like	Structures that are generally displayed as footnotes are may include Footnote, Author Note, Funding Source, and Corresponding Author Address.
Milestone-like	Elements that are used to identify locations in the document or that are used in pairs to indicate the start and end of some portion of a document, typically that cannot be simply wrapped in an element because of overlap problems. Milestones may be Revision Start and Revision End, or simply Pull Quote.
Structures that can have labels and/or titles	Many, but not all, block type structures can have labels and/or titles. For example, Block Quotes, Boxed Text, Sections, Bibliographies, Lists, and Figures can have labels and titles in many tag sets.
`EMPTY` elements	Elements that mark a location in the document or that may have attributes but no element content.
Structures that have accessibility data	The ability to provide alternate text or long descriptions may be available for Figures, Graphics, Equations, Tables, and a variety of other structures.
Structures that have attribution and/or permissions or licensing data	Structures such as Articles, Boxes, Sections, Tables, and Appendices may have information about who wrote them or who may use them and under what conditions.

Attribute Value Types (other than `ID` and `IDREF`)

Even in a DTD, it is possible to type attribute values, and in XSD and RNG attribute value types can be quite strongly specified. We know (see above) that it is critical that attributes of type ID remain of type ID and that attributes of type IDREF or IDREFS remain of type IDREF or IDREFS in order for documents to be compatible. However, that leaves many other attribute types.

Some processing may be tied to specific values of attributes, and if none of the expected values are present the processing may fail. For example, if a formatter renders <styled-content view="GrIt"> as green italic, if that value is not present the formatter will not render the content in green and italic. However, we see no disruption from:

Adding or removing items from a specified value list
Changing a CDATA attribute to one with a specified value list, or vice verse
Changing a NMTOKEN or NMTOKENS attribute to CDATA or vice versa
Changing the value of a #FIXED attribute or changing a #FIXED attribute to CDATA or a specified value list

We came to the conclusion that most attribute typing is useful in the creation of correct documents as specified by the content creator, but is not essential to the storage, management, or rendering of the documents.

Conclusions

The first public draft of the JATS Compatibility Meta-Model Description [Usdin et al. 2016] was released to the public in July 2016. We anticipate that the assumptions we have made in this work will be tested through the process of public review and comment. We hope that we will be prompted to improve the content of the guidelines to make them more effective and to improve the descriptions of them to make them clearer and easier to implement.

Although some of the comaptibility principles we describe such as whitespace handling, ID/IDREF consistency, and maintaining the meaning of object names are applicable for testing tag set compatibility in general, we were working specificly on compatibility of extensions to ANSI/NISO Z39.96-2015 JATS.

We welcome your comments on this conference paper and, more importantly, on the document at the NISO site.

References

[Wheeler et al. 2016] Wheeler, Robert, Bruce Rosenblum, and Lesley West. 2016. NISO STS Project Overview and Update. In Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016. Bethesda (MD): National Center for Biotechnology Information (US). http://www.ncbi.nlm.nih.gov/books/NBK350146/.

[ISO 2016] International Organization for Standardization (ISO). 2016. Welcome to the ISO Standards Tag Set (ISOSTS). Accessed April 19. http://www.iso.org/schema/isosts/.

[NIEM 2016] National Information Exchange Model (NIEM). 2016. NIEM. Accessed April 19. https://www.niem.gov/.

[Eberlein et al. 2010] Eberlein, Kristen James, Robert D. Anderson, and Gershon Joseph, eds. December 2010. Darwin Information Typing Architecture (DITA) Version 1.2. Organization for the Advancement of Structured Information Standards (OASIS) Standard. http://docs.oasis-open.org/dita/v1.2/os/spec/DITA1.2-spec.html.

[TEI 2016] Text Encoding Initiative (TEI). 2016. Personalization and Customization. In P5: Guidelines for Electronic Text Encoding and Interchange. Version 3.0.0. Last modified March 29, revision 89ba24e. http://www.tei-c.org/release/doc/tei-p5-doc/en/html/USE.html#MD.

[NCBI 2015] National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM). 2015. Modifying This Tag Set. Journal Archiving and Interchange Tag Library NISO JATS Version 1.1 (ANSI/NISO Z39.96-2015). Last modified December. http://jats.nlm.nih.gov/archiving/tag-library/1.1/chapter/implementor.html.

[ANSI/NISO 2015] American National Standards Institute/National Information Standards Organization (ANSI/NISO). 2015. ANSI/NISO Z39.96-2015, JATS: Journal Article Tag Suite, version 1.1. Baltimore: National Information Standards Organization. http://www.niso.org/apps/group_public/download.php/15933/z39_96-2015.pdf.

[Usdin et al. 2016] Usdin, B. Tommie, Deborah A. Lapeyre, Laura Randall, and Jeffrey Beck. 2016. JATS Compatibility Meta-Model Description. Draft Version 0.7. 32 p. http://www.niso.org/apps/group_public/document.php?document_id=16764&wg_abbrev=jats-sc.

B. Tommie Usdin

Mulberry Technologies, Inc.

B. Tommie Usdin is President of Mulberry Technologies, Inc., a consultancy specializing in XML and SGML. Ms. Usdin has been working with SGML since 1985 and has been a supporter of XML since 1996. She chairs Balisage: The Markup Conference conference. Ms. Usdin has developed DTDs, Schemas, and XML/SGML application frameworks for applications in government and industry. Projects include reference materials in medicine, science, engineering, and law; semiconductor documentation; historical and archival materials. Distribution formats have included print books, magazines, and journals, and both web- and media-based electronic publications. She is co-chair of the NISO Z39-96, JATS: Journal Article Tag Suite Working Group. You can read more about her at http://www.mulberrytech.com/people/usdin/index.html

Deborah A. Lapeyre

Mulberry Technologies, Inc.

Ms Lapeyre is an XML architect; a teacher of XML, XSLT, and Schematron; an expert in XML vocabulary design and DTD and schema development. She has been developing systems that manipulate tagged documents since 1980, working with SGML since before it was standardized, and with XML from the beginning. Ms. Lapeyre was one of the principal architects and the lead writer of the NLM DTDs and now plays that role for JATS, BITS, and NISO STS. She has designed tag sets for encyclopedias, semiconductor specifications, collections of historical materials, and technical documentation for tractors and heavy equipment.

Laura Randall

NCBI/NLM/NIH

Laura Randall is a Technical information Specialist at the National Center for Biotechnology Information at the US National Library of Medicine. She has been involved with markup languages since late last century and currently spends her time on the PubMed Central project. Her most notable achievement of late is receiving the designation Bringer of Food from her three rescued black cats, Vader, Tater, and Spud.

Jeffrey Beck

NCBI/NLM/NIH

Jeff Beck is a Technical information Specialist at the National Center for Biotechnology Information at the US National Library of Medicine. He has been involved in the PubMed Central project since it began in 2000. He has been working in print and then electronic journal publishing since the early 1990s. Currently he is co-chair of the NISO Z39.96 JATS Standing Committee and is a BELS-certified Editor in the Life Sciences.

BalisageThe Markup Conference

Balisage Paper: Graceful Tag Set Extension

B. Tommie Usdin

Deborah A. Lapeyre

Laura Randall

Jeffrey Beck

Table of Contents

Adopt and Adapt

Adoption

Adaptation

JATS Compatability Guidelines

Things that Must Match to Maintain Compatability

Respect the Semantics

Use the Same Style of Nesting/Recursion for Sections

Maintain Distinction Between Elements and Attributes

Whitespace Handling

Element-like whitespace

Data-like whitespace

Preserved whitespace

`ID`, `IDREF`, and `IDREFS`

Alternatives or Media-specific Content

Things that Don’t Seem to Matter in Compatible Modeling

`EMPTY` Elements versus Contenting-containing Ones

Has Metadata

Is Metadata

Sections and Section-like Structures

Role in the Document

Attribute Value Types (other than `ID` and `IDREF`)

Conclusions

References

Balisage Series on Markup Technologies

Balisage Paper: Graceful Tag Set Extension

B. Tommie Usdin

Deborah A. Lapeyre

Laura Randall

Jeffrey Beck

Table of Contents

Adopt and Adapt

Adoption

Adaptation

JATS Compatability Guidelines

Things that Must Match to Maintain Compatability

Respect the Semantics

Use the Same Style of Nesting/Recursion for Sections

Maintain Distinction Between Elements and Attributes

Whitespace Handling

Element-like whitespace

Data-like whitespace

Preserved whitespace

ID, IDREF, and IDREFS

Alternatives or Media-specific Content

Things that Don’t Seem to Matter in Compatible Modeling

EMPTY Elements versus Contenting-containing Ones

Has Metadata

Is Metadata

Sections and Section-like Structures

Role in the Document

Attribute Value Types (other than ID and IDREF)

Conclusions

References

Balisage Series on Markup Technologies

`ID`, `IDREF`, and `IDREFS`

`EMPTY` Elements versus Contenting-containing Ones

Attribute Value Types (other than `ID` and `IDREF`)