How to cite this paper

Gross, Mark, and Charles O'Connor. “Pulling All Production Processes Together with an XML-First System.” Presented at Balisage: The Markup Conference 2023, Washington, DC, July 31 - August 4, 2023. In Proceedings of Balisage: The Markup Conference 2023. Balisage Series on Markup Technologies, vol. 28 (2023). https://doi.org/10.4242/BalisageVol28.Gross01.

Balisage: The Markup Conference 2023
July 31 - August 4, 2023

Balisage Paper: Pulling All Production Processes Together with an XML-First System

Mark Gross

President

Data Conversion Laboratory

Mark Gross, President of Data Conversion Laboratory, is a recognized authority on XML implementation, document conversion, and data mining. Prior to founding DCL in 1981, he was with the consulting practice of Arthur Young & Co. Mark has a BS in Engineering from Columbia University and an MBA from New York University, and has taught at the New York University Graduate School of Business, the New School, and Pace University.

Charles O’Connor

Lead Project Manager

Aries Systems

Charles O'Connor is a Lead Product Manager at Aries Systems. He has spent the last 15 years automating XML-based scholarly article production workflows. Currently working on Aries' LiXuid Manuscript solution, he previously led the team that built ArticleExpress for Sheridan (now KGL). His approach to creating tools is informed by his prior experience as an indexer, proofreader, copy editor, and production editor. He is a member of the NISO Information Creation and Curation Committee.

Abstract

Creating a seamless centralized workflow that starts with XML has long been the siren song of scholarly journal production workflows. Yet the definition of “start” is the critical piece in this publishing puzzle. For Aries Systems Corporation, innovating article production truly means starting with XML as soon as a manuscript is accepted after peer review. But how do you create a system for auto XML text conversion for Word files when you cannot control the creation of the Word file, nor force authors to follow any predefined template or complex instructions? Given all the ways that authors can use (and abuse!) the wide range of MS Word features, no automated system can produce good XML from 100% of author-supplied files. But Aries could not expect the users of Editorial Manager and ProduXion Manager to be detectives or tinkerers, figuring out what in a problem Word file needs to be fixed to get a good result, rerunning the file, QC’ing the result, maybe having to go back in and fix something else, rerunning the file, QC’ing the result. Editors and authors need to have confidence that when they submit a Word file for processing they will get a good XML file out every time.

Project Overview

Workflow Overview

Workflow Details

Receive
Pre-Process
Autostyle
Conversion

Confidence Analytics

Editorial

Quality Assurance

Final Packaging

Delivery to Aries

Conclusion

Project Overview

For the last two decades (and more), advocates have touted the potential benefits of XML-based workflows in scholarly publishing. XML would be the single source of truth. XML would allow multi-channel publishing. XML would give meaning to content through semantic tagging and rich metadata. XML would allow the mining of data and the linking of insights among different sources. If only we could rid ourselves of proprietary formats like MS Word documents and PDFs and start with structured content, the world would be a better place.

But, as we got further into the new millennium, the promise of XML seemed to fade. For use cases outside the document space, JSON is a lighter format for conveying semantic information, and despite XML’s mature tech stack, easier for most programmers to handle (https://www.balisage.net/Proceedings/vol26/html/StLaurent01/BalisageVol26-StLaurent01.html). Within the document space, XML authoring and editing in XML is considered just too hard. Among other issues, critics pointed out that in XML, whitespace can be meaningless, and to create a new paragraph, you cannot just hit [Enter], you must close the previous paragraph tagging and create a new element (https://contentwrangler.com/2016/02/23/why-does-xml-suck/). Efforts to put XML at the center of scholarly publishing workflows languished.

The value of XML to store and transmit semantic information was not entirely lost. Especially with the growth in importance of metadata such as funding information and persistent identifiers for everything from images to authors, the Journal Article Tag Suite (JATS) has become a vital tool in scholarly publishing. But its value has been primarily as a final delivery and archiving format. In most publishing workflows, the XML is either created at the end of the process, or its care and handling has been left to operators skilled in specialized software while content experts, the authors and editors, continue to work in Word and PDF.

However, as web-based, user-friendly XML editors were developed, new possibilities for XML-through workflows began to open, and Aries Systems saw an opportunity. Aries’ submission and peer review system Editorial Manager and its production tracking system ProduXion Manager already provided two of the necessary pieces of the puzzle: the metadata that belongs in the XML and the workflow engine necessary to assign and track tasks performed by all the different actors in the workflow. What was needed were the tools to create, edit, transform, and analyze the content.

Well, we’ve made progress. Aries worked with Fonto XML to develop a user-friendly but comprehensive JATS editing tool. We worked with Typefi to transform XML to PDF and other publication channels. We integrated Saxon to facilitate metadata exchange and quality analysis, but we still needed good XML at the beginning of the workflow. As much as we would love them to write in a structured format, authors, especially of unsolicited manuscripts, and publishers are still wedded to Word.

And not just good XML, but consistently good XML. Even with a user-friendly toolset, XML did not stop being hard. Aries could not expect editors, much less authors, to diagnose and fix problems arising from a suboptimal conversion. Given the almost absurd depth of possibility for author use and abuse of Word features, even the most sophisticated automated Word-to-XML conversion engine will have a fraction of articles it handles badly. Then what should editors do when the engine puts out as poorly constructed XML—fix the XML and compare it against the Word file to ensure nothing has been missed? Maybe they could try to figure out what the conversion didn’t like, fix it in the Word file, run it through again, and hope they got it right?

Convincing publishers to take control of their content rather than throwing it over a wall for a vendor to take care of is difficult enough. We could not expect them to also develop expertise in XML or the pitfalls of every feature in Word. Aries needed a conversion engine that quickly turns around good-quality JATS XML from the highest practical percentage of unstructured Word files but also recognizes when the results are not optimal so that the file can be diverted for manual review and remediation. Aries turned to Data Conversion Laboratory (DCL) for its support in creating this automated XML conversion engine.

Workflow Overview

DCL architected an automated workflow comprising nine key steps that the system must traverse from manuscript acceptance to delivery into Aries’ ProduXion Manager, with the goal of a 10-minute turnaround time per manuscript. It was understood by DCL and Aries that not every article would simply pass through the automated system, but also that every manuscript that did successfully pass the nine workflow steps would produce not only valid but also good XML that conforms to the Aries-defined subset DTD based on JATS 1.2. The ultimate goal is to accurately convert at least 80% of reasonably conformant Word files.

The workflow steps of each article are listed below.

Receive manuscript
Preprocess
Autostyle
Conversion
Confidence Analytics
Editorial
QA
Final packaging
Delivery to Aries

The system is architected to

invoke or bypass activities based on the nature of the content
maintain metadata, status, metrics, issues alerts and updates
provide management reporting and full transparency

Manual intervention would be required for manuscripts that did not pass through the system. However, the more the system is used, the more rules for automation and processing anomalies can be identified and applied, thus improving performance over time.

Workflow Details

While nine workflow steps initially sound simple and streamlined, the complexities in each step are considerable. Every manuscript is nearly as unique as the author who created it. Each step in the workflow enables subsequent actions. If an action fails in a workflow step, the system will route the manuscript for human intervention.

Receive

Upon peer review acceptance, Aries will invoke the DCL submission web service API and pass the payload containing the zip file for the article and also include a callback URL and job identifier to be used when DCL is ready to return the final results back to Aries.

This initial workflow step unpacks the files submitted and securely saves and registers the assets and corresponding metadata in DCL’s Production Control System (PCS). PCS provides workflow management and scheduling/reporting capabilities with comprehensive monitoring mechanisms that track timeliness and quality levels, generating alarms when requirements are at risk of not being met. It was developed and built specifically to handle multiple production sites and multiple workflows simultaneously and is time-tested to handle the volume and accuracy that would be needed to meet Aries requirements.

At this stage, the system will immediately determine if a document is seriously problematic such as being password-protected or in a source format other than Word. Future stages in the project are intended to also support PDF source files.

Upon successful completion of all these steps, including the triage, the workflow is initiated.

Pre-Process

The pre-process step normalizes the manuscript to enable subsequent autostyling. In some instances, normalization is straightforward, such as accepting all track change revisions. Other normalization tasks are complex such as using the MathType API to convert non-native math done using MathType or older versions of Word to MathML.

Pre-process is the first step at introducing industry standards to promote consistency in the content and convert it to the DCL Hub format, enabling connectors to other delivery formats such as XML, HTML, and XHTML.

Following are some of the individual actions taken in pre-processing:

Converting non-native math to MathML
Converting native Word math to MathML
Correcting malformed URLs
Normalizing lists
Accepting all track change revisions
Inserting page number bookmarks
Handling line breaks
Extracting and converting vector/raster images to PNG
Converting to HUB with the OASIS table model
Converting Word table properties to XHTML/CSS
Recognizing table headings
Representing tables as HTML
Handling inconsistent table columns, row separators
Normalizing fields such as symbols to XML style grammar

Autostyle

The autostyle step applies consistency and styling that further enables consistent conversion to XML meeting the Aries specification. A major component of this step deals with images, tables, and headings:

Link graphics and graphic captions, handling layered graphics, styling captions, accommodating page boundaries, and logging anomalies such as unconnected drawings
Identify and tag all the elements in front matter, body, and back matter
Autostyle based on the content structure – heading levels, lists

Graphics can be complex depending on how authors prepare and supply images. This stage ensures that for every graphic/image called out in text, there is a corresponding asset file as well as a caption for the graphic/image in the narrative text. In some cases, images might have multiple layers in the Word manuscript such as a grid with four panels labeled A,B,C,D. The autostyle process extracts each panel into a separate asset, and tags and structures it to align the figure callout with the correct asset file and associated metadata.

DCL employs a series of techniques to deploy the specific actions in this workflow step. spaCy, industrial-strength natural language processing in Python, is used to detect and autostyle author names and affiliations. A complex layering of business intelligence and configurable, custom algorithms are used to detect, structure, and autostyle the document sections and the elements within the content such as table captions, table citations, figure captions, figure citations, reference citations, appendix citations, and decompose and style the references.

Autostyling might seem straightforward, but the reality is that it needs to deal with many inconsistencies. For example, while we need to tag the reference section with the XML tag <ref>, authors often misspell “references.” And when citing references, the authors often misspell the author name, cite the wrong publication year, or use unmatchable reference subscript links.Information is often missing and must be inferred, such as needing to link author names and affiliations when there is no direct information, such as a subscript, to establish the relationship.

The system is designed to run through a series of common typos and repeated errors identified across a large sample content set. Business rules for autoconversion have been identified through natural language processing, and the system allows for ongoing inclusions of new rules as identified.

Table construction can, at times, be quite complex in these manuscripts. Specific font formats (italics, bold), shading in tables, and colors might all carry meaning that should not be lost. Every table must be tagged with attributes on each specific table element to denote the style represented in the source file.

For example, the author manuscript may be styled with basic table, table body, table rows, and table cells. Adding attributes on the table elements that detail stylistic choices, ensures that the manuscript is represented both physically and meaningfully with the author's intended choices:

Source:

XML Representation:

Conversion

While the conversion workflow may appear to be where the magic happens, it is an excellent illustration of how critical preceding steps are to the successful conversion to XML. Verifying autostyling actions and repairing anomalies is the first process.

The system processes and converts special characters such as currency symbols (£, ¢), legal symbols (©), mathematical symbols (<, √, °) to Unicode to ensure consistency in the content ingested by further downstream systems.

Bibliographic references and author affiliation linking are well known for myriad complexities. For example, references to content other than journals and books may be a challenge to interpret, the order of first names and last names may vary by region such as in Chinese where an author’s last name is listed first and the first name is listed last.

DCL decomposes bibliographic citations presented in straight text and applies detailed structure that allows for accurate search and discovery. It also detects and links bibliographic citations within an article to the corresponding bibliographic reference.

Following is an example of in-text bibliographic citations submitted in a manuscript:

The new system extracts the free-form bibliographic citation text and structures it within an <xref> tag to facilitate linking and provide clarity and structure.

Other items in the conversion stage include

Autostyling verification and repair
Author/affiliation linking
Special characters handling
Table footnotes
Retain and propagate table colors, shading, bolding
Bibliographic reference decomposition
Figure citations
Reference citations
Appendix citations
Conversion to the Aries version of the JATS 1.2 XML schema
Populating metadata

Along with the Word file, Aries provides an XML file with metadata about the article from Editorial Manager. Journal metadata that is added to the article XML as part of the conversion step includes ISSN, journal title, and other publisher information. Article metadata includes received/revised dates, volume and issue, funding information, and, when author names can be unambiguously matched, ORCIDs.

Confidence Analytics

The DCL Global Quality Control System (GQC) maintains XML-based cross-project and project-specific quality control checks that are configured and executed at various steps in the processing workflow to ensure consistently high-levels of quality.

Once the conversion is complete, an extensive set of automated QC checks are performed to assess the source anomalies and accuracy level of the resulting XML. The results of the checks are analyzed and weighed using DCL’s custom algorithms to provide a confidence factor and automated recommendation for manual review or straight-through processing for each document.

If the confidence factor indicates that manual review is required, the specific areas of the document to be reviewed and potentially repaired are identified and communicated to PCS to orchestrate the appropriate workflow steps.

Editorial

The editorial stage involves human intervention from both offshore and onshore resources. The combined onshore/offshore review supports time and budget concerns that are real factors when architecting a workflow that is in a near-constant state of movement.

The manual review is targeted and limited to the content elements that were highlighted for review and repair. Triage GQC is performed upon receipt of the updated XML to verify that only those identified elements were modified.

Quality Assurance

In the past, content quality review was a manual project, and the resources were simply not available or would often be outside of budget when considering the volume and scale publishers deal with today. DCL’s proficiency in automating QA checks decreases conversion turnaround time and vastly improves the quality of the final product. By applying a trifold approach of automated differencing, error analytics, and XML-based QC checks using GQC, DCL provides a model of continuous improvement that both allows the correction of issues prior to final delivery and proactively prevents those issues in the future.

Final Packaging

The Word-to-XML full-text conversion produces an XML file that is valid to JATS 1.2 and adheres to all XML style points detailed in Aries’ Full-Text XML Tagging Guidelines.

Final Quality Control checks are executed during packing such as verification that every image callout has a matching image and vice versa. Images are renamed as necessary, line breaks and unique identifiers are repaired as appropriate, and the final, verified files – XML and images – are zipped and prepared for delivery back to Aries.

Delivery to Aries

DCL’s PCS system calls the Aries API using the provided callback URL with the job identifier to notify Aries that a packet is ready for delivery---an API to deliver the packaged XML file back to Aries. Aries then invokes the DCL GetJob API and retrieves the results.

Additionally, a DCL daily reconciliation report confirms that all handshakes were successfully executed and that no files were dropped upon retrieval and return.

Conclusion

In the scholarly publishing workflow, good XML (not simply valid XML) is imperative for every person who works in an editorial and production capacity. Copyeditors, production editors, editorial directors, and ultimately authors benefit when content is structured early in the publishing workflow. The frictionless flow of content across a production workflow reduces frustrating delays and inaccuracies, resulting in faster turn-around times and improved quality of published scientific content.

Aries plans to move XML-centered workflows upstream from production to peer review as the system matures. In peer review, consistent quality of XML is even more vital, as it will be authors, instead of the production editors employed by the publisher, who will be the first to see the results.

While still in its early stages, the new XML-first workflow developed by Aries and DCL, reduces frustrating delays and inaccuracies and results in faster turn-around times and improved quality, better control and accessibility of key metadata, and most importantly, improved user experience for editors and production staff.

BalisageThe Markup Conference2023