Ensuring XML Quality and Compatibility in Large Collections That Span Decades of Content

Mark Gross

Abstract

One of the promises of XML applications and related technologies is the consistency and longevity of data created through their use. But meeting the promise doesn’t happen without careful curation. In practice, for large applications such as JATS, there are many users of varying skill levels, with differing understanding of the XML process. With the complicating factor that diverse users may also be using diverse versions of the standard tag set, inconsistencies tend to accumulate as these large document collections grow.

Although issues such as missing DOIs, bad xrefs, duplicate IDs, invalid assets, and other structural problems in the XML may not always result in invalid documents, they are nonetheless problems for the usability and sustainability of a collection. This paper highlights automated solutions useful for fixing large amounts of JATS and BITS XML, ensuring that content is current with the latest DTDs and internally consistent across the collection.

The Importance of Content Structure and XML in Managing Document Collections

Incorporating structure and identifying elements in a large document collection facilitates the discoverability of content as well as information interchange between systems and platforms. For digital content management and data exchange, XML has become the standard approach to store and transport data in human-readable and machine-readable format, and JATS has become a standard in scientific and technical publishing.

Dual readability by both humans and machines is critical for academic research, libraries, public archives (e.g., PubMed Central), indexes, hosting platforms, and vendors. A consistent and accurate XML structure, conforming to the latest JATS standard, is essential for maintaining the integrity and usability of content collections and getting the promised results.

Discoverability

Structured XML allows readers to find content more precisely by allowing for advanced searches beyond simple text queries. The structural cues in the content and the metadata enable precise searching. Well-formatted XML documents allow indexing tools and applications to effectively parse and better identify relevant data faster, which is particularly valuable in large datasets and digital libraries.

Content Interchange

Today, the same information is reused in many forms and is expected to transfer seamlessly between systems and platforms. Consistent and accurate XML structures are crucial for this content interchange. XML’s standardization allows disparate systems to communicate easily, ensuring that data retains its meaning and structure when transferred. Interoperability is fundamental for data sharing across organizational boundaries.

Accessibility

Accessibility to information for people with disabilities is important, and it is also often legally required. Accurate structuring and tagging of XML does much of the work to make content accessible and allows much more to be done. For example, images of tables and math that are not encoded are inaccessible to visually disabled users; XML provides the mechanisms needed to encode and structure tables and math so that they are readable by software readers to ensure they are accessible to those with visual and other disabilities.

Emerging and Future Applications

Content discoverability, interchange, and accessibility are well-established applications in the scholarly publishing arena; however, many new XML-dependent applications are currently in development, and we don’t know all the ones that are coming along. Applications in the wings include various AI applications as well, to meet the challenges of research integrity.

How Good Digital Repositories Go Awry

It stands to reason that an organized document collection is good and that those assembling XML document sets would take pains to keep them in good shape. So, where do things go wrong? Even in the most carefully curated collections, changes, errors, and discrepancies will accumulate and, over time, reduce the system’s efficiency.

In one large collection we worked on, it was revealed that 15% of articles in the collection had duplicate IDs.
An analysis of another collection, one that contained all published content from one journal, revealed errors on <pub-date>s for 13% of the collection.
Another publisher required updates to 73% of files for a seamless platform migration.

These percentages of errors and issues are not atypical. In fact, journal publishers are often surprised by the results of a deep analysis.

Changing Standards

Standards evolve. XML has been an industry standard in scholarly publishing since at least 2003, when the National Library of Medicine (NLM) introduced the NLM DTD Suite. [Rosenblum 2010] Focusing on the JATS standard alone, there have been seven core iterations since 2003 [Wikipedia], and several intermediate versions (Table I). That alone accounts for significant variation over time. Some of the earlier versions are not totally compatible with the current versions and are also missing desirable features that were implemented later.

Table I

Core iterations of the JATS standard since its inception.

NLM JATS, version 1	March 31, 2003: NLM DTD v1.0 introduced November 5, 2003:version 1.1 update released
NLM JATS, version 2	December 30, 2004: version 2.0 major update released November 14, 2005: version 2.1 update released with the addition of the Article Authoring DTD June 8, 2006: version 2.2 update release March 28, 2007: version 2.3 update released
NLM JATS, version 3	November 21, 2008: version 3.0 major update released
NISO JATS, version 1.0	March 30, 2011 – September 30, 2011: First draft, NISO Z39.96.201x version 0.4 released July 15, 2012: NISO JATS, v1.0 received NISO approval August 9, 2012: NISO JATS, v1.0 received ANSI approval August 22, 2012: ANSI/NISO Z39.96-2012, JATS: Journal Article Tag Suite, version 1.0 published. It supports full backward compatibility with NLM JATS v3.0
NISO JATS, version 1.1	December 9, 2013: First draft, NISO JATS, v1.1d1 released December 29, 2014: Second draft, NISO JATS, v1.1d2 released April 14, 2015: Third draft, NISO JATS, v1.1d released October 22, 2015: NISO JATS, v1.1 received NISO approval November 19, 2015: NISO JATS, v1.1 received ANSI approval January 6, 2016: ANSI/NISO Z39.96-2015, JATS: Journal Article Tag Suite, version 1.1 published
NISO JATS, version 1.2	July 20, 2017:#x000A0;First draft, NISO JATS, v1.2d1 released May 23, 2018: First draft, NISO JATS, v1.2d2 released February 8, 2019: ANSI/NISO Z39.96-2019, JATS: Journal Article Tag Suite, version 1.2 published
NISO JATS, version 1.3	July 7, 2021: ANSI/NISO Z39.96-2021, JATS: Journal Article Tag Suite, version 1.3 published

Interpretation of Standards and Errors in Execution

The various standard versions are complex, and not everyone is fully conversant with all the elements and attributes or even has the same understanding of how to interpret them. Because publishers often work with multiple parties and vendors to produce XML based on the standard at the time, variances in tagging interpretation and application, differences in the specifications, and human error all contribute to inconsistencies. Production errors, interpretations, and variations in production support can all lead to discrepancies in the XML files over decades of publishing.

Archival Content is Often Not Consistent with Current Content

Publishers with significant history have likely published content in various NLM/JATS specification iterations. Yet few publishers do a systematic review of their content to analyze and identify key metrics about their source files and corresponding digital assets (e.g., .jpg, .gif, .XLSX, .zip, .tif, .mov, etc.) to bring them up to a common standard; a variety of standards continue to proliferate.

Some of these earlier materials may be following obsolete standards. Publishers may be missing out on important functionality by not updating to the current standard. Understanding what is in a publisher’s legacy archives is a critical first step.

A major factor for many publishers is that not all their content is in XML. Depending on the investment in publishing technology, legacy content could comprise PDFs, PDFs with XML header information, or XML files in an outdated DTD, along with dozens of other possibilities.

Structural Issues in JATS XML and Corresponding Files

An XML file can parse cleanly and be technically valid but still contain issues that impact interoperability or cause problems for the end user accessing the file.

When analyzing large content collections, often spanning decades, more often than not we encounter numerous problems that reduce the effectiveness of how organizations can use their content. The following are examples we’ve seen over the years.

Missing DOIs

Web hosting platforms such as Silverchair, HighWire, and Atypon use the Digital Object Identifiers (DOI) as a unique identifier for the article or paper. If the DOI is missing or not valid, the content may not load properly, or may not be loaded to the platform at all. If the DOI is not unique, subsequent loads might overwrite earlier content or cause other problems. Missing DOIs found at load time can often delay the whole process.

Identifying these problems in advance provides time to fix the problem before the crunch. Missing identifiers can be found with lookups against publishers’ records or standard databases like Crossref. Or if the DOIs do not exist, they can be created. But all this is best done earlier rather than at platform load time.

Cross-Reference Errors

Missing and incorrect cross-references (<xref>s) are common. Examples are image files that are not referred to in the text, or references in the text to images that are not found. Incorrect and empty <xref> tags must be resolved, or they will not properly render on hosting platforms and will not find the items referred to.

Following are some examples in a recent analysis:

Incorrect <xref> to "ref008" – Cross-references to something that does not exist. As a result, the XML would fail to parse in many platforms, and the file would fail to load onto a website.
<xref> with no text: <xref ref-type="table" rid="fig"/> – <xref>s with the incorrect @ref-type, which requires correction for the link to work properly. This example shows that the reference is to a table, but the ID does not match because it points to a figure.
<xref> with no text: <xref ref-type="fig" rid="F_EEMCS-04-2016-0059008"/> – <xref>s missing text that are either being used as placeholders for target objects to render or to represent a span.
<xref> with no text: <xref ref-type="table" rid="tbl8"/> – <xref>s that exist for no discernible reason

Invalid Assets

Aside from the text, there are other objects (assets) like images, math, video, etc., which typically need to be incorporated into the displayed document. Requirements for assets vary across hosting platforms, which may have different file formats requirements, naming conventions, and other specifics.

Following are common examples of issues with assets. While these problems do not appear as XML errors, they demonstrate the necessity of analyzing archival and current content and the supporting files that make a journal article or book chapter complete:

Mismatch between reported image type 'TXT' and file extension 'EPUB'
File format error
Asset is size 0
Missing asset: <graphic xlink:href="09526860010336984-0620130402008.tif"/>

Unusual but True Findings in JATS

Until a collection is deeply mined and analyzed, it’s hard to imagine the issues that might be found in a large collection spanning many decades. Just when we think we’ve seen it all, we find something new:

Bad dates such as all 9s or 0s for an @iso-8601-date
The brackets around tags given the hex Unicodes for the less than and greater than symbols (i.e., < and >) thus making the tagging visible in the rendering
Special characters given both the UTF and the hex Unicode following each other
DOCTYPE missing from the JATS XML file and thus not able to render

Comprehensive Content Cleanup Process

Working at scale on cleaning up XML quality and compatibility in a large collection requires a methodology and automated tools.

Manual error detection in XML documents is time-consuming and error prone. DCL’s approach has been to use extensive automation and AI to identify and highlight structural problems. An automated large-scale analysis exposes data corruption, identifies errors that will prevent interoperability, and maintains the quality standards of valuable assets.

Identifying What You’ve Got

The first step sounds simple, but sometimes isn’t – identify all your content and make sure it’s available to analyze. The publisher’s collection, organized in hierarchical folders and files, undergoes an initial process phase to organize the content into units, extract relevant metadata from the files, and verify completeness. A unit might be an article, a journal issue, a manuscript, conference proceedings, or anything else. For books, either each chapter or an entire book might be considered a unit. This processing phase also inventories digital assets such as .jpg, .gif, .XLSX, .r, .JSON, and other file types. The technologies used include DOM, XPath, regular expressions, custom algorithms, Exif tools, PDF tools, and Ghostscript.

Architecting the Process

Because of the sheer amount of content that might need to be audited, often a million pages or more, it’s important that software is architected to be multithreaded, restartable, and incremental.

Multithreading enables simultaneous execution of various parts of a program, allowing many parts of the data collection to be processed in parallel.
A restartable application allows software to resume from the point of interruption after system downtime or failure. It can be frustrating to have to restart from the beginning after twelve hours of processing.
Incremental processing minimizes overall processing demands by handling only new data partitions added to an already processed content set rather than reprocessing the entire content collection.

XML Parsing and Validation

The next phase involves health checks on digital assets including validating XML files. Software performs semantic checks on XML files at the unit level, journal title level, and across the collection. Additionally, the relationships between the XML files and digital assets are analyzed and verified.

Meeting Business Rules: DCL’s Global QC Approach

Aside from the technical details defined by the DTD or Schema, there are other publisher- or journal-specific rules on content that organizations should check to ensure the content base is consistent across the collection. We call those business rules; these are rules that apply to a particular business or situation.

Following are just a few examples from DCL clients to illustrate the kinds of rules that might be applied:

Figure and table labels should not have end punctuation: <label>Table 1</label>
Inside a table cell the element for a graphic should be <inline-graphic> rather than <graphic>
<article-title> should not be in all caps

Business rules can vary from client to client and even between one project and another, thus architecting a configurable system that defines and analyzes these rules across large content sets is essential.

DCL’s Global QC is a platform to facilitate defining business rules without programming and applying them across a large corpus (Figure 1). Quality checks can be performed on individual XML files or throughout the corpus. The checks can be restricted to DTDs or XML schemas, configured to exclude/include specific nodes within the content, be conditional based on business scenarios, and easily be grouped together into sets for consistent handling. We’ve also built an intuitive user interface to create, modify, and delete verification checks and apply them to the content to address common scenarios and data anomalies. Each individual or related set of checks may be assigned severity levels and customized error messaging to enable efficient downstream error handling.

Content Structure and New Challenges

In addition to the benefits already discussed, improved content structure is essential for addressing the industry’s latest challenges. The creation and dissemination of scholarly content is accelerating at an unprecedented rate, bringing with it a host of new obstacles to overcome. Effective content structuring will be crucial in navigating and managing this explosive growth and solving the challenges that come with it.

Seamless Interchange

Quality XML is crucial for content access, interpretation, and exchange. If publishers routinely analyzed their collections, the anomalies, errors, and issues that have been lurking might have been revealed earlier, which would have already improved interchange and discoverability. This part is not new; what is new is that we don’t know all the places we want content to go or all the new ways we will want to interchange. Having content rigorously standardized ensures that you will be ready for what comes next.

Research Integrity

Scholarly publishing is struggling with problematic papers getting into publishing workflows via papermills and citation cartels. XML is increasingly used to effectively identify potential issues related to research integrity by making it easier to verify the accuracy of references and author names and affiliations with third-party databases (e.g., Crossref, Scopus, Web of Science, CCC, ORCID).

XML tagging and metadata provide a structured framework for analyzing journal articles at scale, enabling publishers to effectively identify potential issues related to research integrity. DCL is building a library of automated research integrity checks to flag and report suspected issues at scale. Some integrity checks involve traditional XML analysis and validation, but other tools are also used.

Adjacent technology, tools, and processes invoked for XML analysis, such as entity extraction, machine learning, and natural language processing, are employed for automated analysis, data mining, and reporting across an entire collection. Some integrity checks we’ve been applying or studying include

Author Names and Affiliations: Tag and verify author identification against third-party databases (CCC, ORCID, etc.).
References and Citations: Tag and verify references and citations with external databases to verify accuracy (Crossref, Scopus, Web of Science, etc.).
Tortured Phrase Analysis: Analyze unstructured and structured content for tortured phrases or inconsistencies for closer scrutiny. Tortured phrases manifest as words that are translated from English into a foreign language and then back to English and are often an indication of paper mills being employed. A classic example is the phrase counterfeit consciousness instead of artificial intelligence.
iThenticate Similarity Score: Process large volumes of iThenticate reports and combine with publisher-specific business rules to report potential plagiarism.
Image Forensics: Apply digital forensics to identify potential plagiarized or manipulated images.

AI Initiatives

The rapid growth of AI in the research community has given new additional importance to structured content. While AI development is still in its infancy, it’s becoming clear that AI systems will produce better answers when it gets more clues that well-structured data and metadata provide. Adopting a consistent and accurate XML structure ensures that organizations will have content that will be AI-ready and helps preserve the value of their content assets. This will become additionally important when combining RAG (Retrieval-Augmented Generation) with structured content in an LLM (large language model) to improve the accuracy, timeliness, and factual grounding of LLM output.

Structured content provides potential advantages for finetuning an LLM such as OpenAI’s GPT model to refine its knowledge based on a specific domain. [Abbey 2024] By training on verified and structured data, the AI model can better parse and process information and ultimately minimize hallucinations.

Structured content has many potential advantages over unstructured content when it comes to training and finetuning language models:

Contextual understanding: structured content metadata provides valuable context and deeper semantic meaning for a language model.
Easier data processing: the DTD improves how the algorithm parses, processes, and understands content. This leads to efficient training and better results than with unstructured content. [Van Vulpen 2023]
Improved accuracy: structured content is organized and follows a consistent hierarchy that reduces the ambiguity and noise in content. This improves LLM performance, as a model can easily identify patterns and relationships in content.

Navigating Toward the Future

The future depends on what you do today.

— Mahatma Gandhi

Scholarly journal and book publishers are in the unique position of publishing groundbreaking news while also preserving the record of the scientific method in its true form. Legacy journal and book content represent a snapshot in time of knowledge, methods, and discoveries that have shaped our understanding of the world.

As the pace of scholarly publishing accelerates and the number of published materials grows, maintaining the integrity and accessibility of both current and legacy content is more critical than ever. Publishers must ensure that information is preserved accurately and remains easily accessible to researchers, scholars, and the public, allowing future generations to build upon these foundations of past research and knowledge. This dual responsibility of disseminating new findings while safeguarding historical content highlights the importance of robust content management, industry standards, technologies, and practices in the scholarly publishing industry.

References

[Rosenblum 2010] Rosenblum, A. (2010) NLM Journal Publishing DTD Flexibility: How and Why Applications of the NLM DTD Vary Based on Publisher-Specific Requirements. Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010.

[Wikipedia] Wikipedia contributors. (2024, June 3). Journal Article Tag Suite. In Wikipedia, The Free Encyclopedia. Retrieved 17:41, June 25, 2024, from https://en.wikipedia.org/w/index.php?title=Journal_Article_Tag_Suite&oldid=1227032206

[Abbey 2024] Abbey, A. (2024, March 26). Structured content: A secret weapon for finetuning AI. Retrieved, June 29, 2024, from https://www.rws.com/content-management/blog/structured-content-a-secret-weapon-for-finetuning-ai/

[Van Vulpen 2023] Van Vulpen, M. (2023, April 25). Unlocking the Secrets of GPT: How Structured Content Powers Next-Level Language Models. Retrieved June 29, 2024, from https://www.fontoxml.com/blog/unlocking-the-secrets-of-gpt-how-structured-content-powers-next-level-language-models/

Mark Gross

President

Data Conversion Laboratory

Mark Gross is an authority on XML implementation and digital transformations. Mark’s experience and leadership focus on developing and delivering technology-driven solutions to document publishing issues. Under his direction, DCL has incorporated the latest innovations in artificial intelligence, including machine learning and natural language processing, to help organizations structure data and content for modern technologies and platforms.

Prior to co-founding DCL in 1981, Mark was with the consulting practice of Arthur Young & Co. Mark has a BS in Engineering from Columbia University and an MBA from New York University. He also taught at the New York University Graduate School of Business, the New School, and Pace University. He is a frequent speaker on the topic of XML, DITA, AI in scholarly publishing and document analysis, and content reuse strategies

BalisageThe Markup Conference

Balisage Paper: Ensuring XML Quality and Compatibility in Large Collections That Span Decades of Content