Maloney, Chris, Alf Eaton and Jeff Beck. “A client-side JATS4R validator using Saxon-CE.” Presented at Balisage: The Markup Conference 2015, Washington, DC, August 11 - 14, 2015. In Proceedings of Balisage: The Markup Conference 2015. Balisage Series on Markup Technologies, vol. 15 (2015). https://doi.org/10.4242/BalisageVol15.Beck01.
Balisage: The Markup Conference 2015 August 11 - 14, 2015
Balisage Paper: A client-side JATS4R validator using Saxon-CE
Chris Maloney
NCBI/NLM/NIH
Chris Maloney is a web developer working for NCBI's PMC and Bookshelf
resources. He has worked with XML technologies for over ten years.
Alf Eaton
PeerJ
Jeff Beck
NCBI/NLM/NIH
Jeff is a Technical information Specialist at the National Center for
Biotechnology Information at the US National Library of Medicine. He has been
involved in the PubMed Central project since it began in 2000. He has been
working in print and then electronic journal publishing since the early 1990s.
Currently he is co-chair of the NISO Z39.96 JATSStanding Committee and is a
BELS-certified Editor in the Life Sciences.
Authors’ contribution to the Work was done as part of the Authors’
official duties as NIH employees and is a Work of the United States
Government. Therefore, copyright may not be established in the United
States. 17 U.S.C. § 105. If Publisher intends to disseminate the Work
outside the U.S., Publisher may secure copyright to the extent authorized
under the domestic laws of the relevant country, subject to a paid-up,
nonexclusive, irrevocable worldwide license to the United States in such
copyrighted work to reproduce, prepare derivative works, distribute
copies to the public and perform publicly and display publicly the work,
and to permit others to do so.
Abstract
JATS4R (jats4r.org) is a group that provides guidelines for tagging scholarly
articles in JATS XML to maximize machine-readability and the potential for content
reuse. When the group formalizes a recommendation, we encode the rules in
Schematron. For checking instance documents against the rules, we have implemented
a
validation tool (hosted at http://jats4r.org/validator/). When an instance document
is processed, it is first parsed with a JavaScript implementation of xmllint, then
validated against the DTD, if one is specified. The validator then checks the
document against the Schematron rules, and generates a report in Schematron
Validation Report Language XML (SVRL). To avoid the maintenance costs of hosting a
server-side tool, the validation tool is written in JavaScript, using an emscripten
port of libxml, and Saxon-CE as the client-side XSLT processor. This allows it to
be
hosted on a static site and run entirely within the user’s web browser. The XSLT
files used for validation are generated from the Schematron rulesets offline, and
an
HTML report is generated from the SVRL validation results using a further XSLT
transformation.
The National Center for Biotechnology Information of the National Library of Medicine
originally created the Journal Archiving and Interchange Tag Suite (JATS)[JATS01] with the intent of providing a common format in
which publishers and archives could exchange journal content.
It was developed in response to a Document Type Definition (DTD) used by the NCBI/NLM
PubMed Central project to archive life science journals from a variety of sources.
Input
and support from Harvard University Libraries, as well as support from The Andrew
W.
Mellon Foundation and collaboration with Inera, Inc. and Mulberry Technologies, Inc.,
allowed the scope of the project to be broadened and resulted in the NLM Journal
Archiving and Interchange Tag Suite.
JATS is a NISO standard (Z39.96-2012)[NISO01]
that defines elements and attributes that describe metadata and full content of
scholarly journal articles. The Tag Suite is the complete set of elements and attributes
described in the standard. Along with these descriptions, the standard includes three
article models or Tag Sets: the Journal Archive and Interchange Tag Set, the Journal
Publishing Tag Set, and the Article Authoring Tag Set.
The intent of the Tag Suite is to preserve the intellectual content of journals
independent of the form in which that content was originally delivered. It enables
an
archive to capture structural and semantic components of existing material without
modeling any particular sequence or textual format.
JATS is the XML model used for all content stored in PMC.[PMC01] All content in PMC is free to read, but the XML versions of the
truly open access articles are available through the PMC Open Access Subset.[PMC02]
The Problem
In 2012, an automated software tool—the Open Access Media Importer
(OAMI)[OAMI1] started using the articles in the
PMC Open Access Subset to find audio and video objects that could be loaded to Wikimedia
Commons for use on Wikipedia and elsewhere. The OAMI used several JATS elements and
attributes including those for licensing, keywords and media types. This use revealed
inconsistencies in the XML available from PMC.
Although JATS is a standard and PMC performs some standardization of the submitted
XML
during ingest, JATS has had to allow for the fine nuances of publishing and the varying
requirements of different types of content and different publishers. As a result,
publishers use JATS inconsistently, which leads to problems when reusing the content.
These inconsistencies affected how the OAMI software could reuse the material.
Inconsistencies and ambiguities in license tagging were exceptionally problematic.
This
required some algorithms to determine whether the content was compatible with reuse
on
Wikimedia Commons.
For example, in some instances, the license URI specified a “CC-BY” license, but the
human-readable text contradicted it, adding an extra non-commercial clause:
<license license-type="open-access"
xlink:href="http://creativecommons.org/licenses/by/2.5/">
<p>Re-use of this article is permitted in accordance with the Creative Commons Deed,
Attribution 2.5, which does not permit commercial exploitation.</p>
</license>
In other examples, articles had license information outside of any <license>
element:
<permissions>
<copyright-statement> Uosaki et al. This is an open-access article distributed under the
terms of the Creative Commons Attribution License, which permits unrestricted use,
distribution, and reproduction in any medium, provided the original author and
source are credited. </copyright-statement>
<copyright-year>2011</copyright-year>
</permissions>
Other articles were found to have different, contradictory license URIs within the
<permissions> element.
Similar problems were found whenever it was attempted to automatically extract
metadata related to the article and the accompanying media files. In particular, tagging
related to subjects, keywords and captions, the media types of those accompanying
files,
and other areas.
These problems led to a paper[MIET01] that was
presented at JATS-Con 2014 and triggered a call to action for the development of best
practices for tagging in JATS in a way that improves reusability.[BECK01]
JATS for Reuse—JATS4R
In June 2014, a group of publishers and aggregators met in Cambridge, UK, to discuss
JATS reusability issues. The group formed as "JATS for Reuse" and decided to publish
best tagging practices recommendations to improve the reusability of JATS-tagged article
content. The meeting resulted in a prioritization list of topics (elements). A
website[J4R01] and a public mailing
list[J4R02] soon followed.
The OAMI work revealed that many other JATS tags are used inconsistently, for example
those concerned with mathematical formulas, affiliations and contributor roles.
In January 2015, the group was expanded to include more publishers, representation
from online hosts, and also other interested parties such as content processing vendors.
The group is open to anyone interested in the creation of content in XML format using
the JATS DTD. Current JATS4R endorsing parties are listed on the JATS4R website.
As of this writing, the group has issued tagging recommendations in two areas:
Permissions and Licenses[J4R03] and
Mathematics.[J4R04] When the group formalizes a
recommendation, we encode the rules in Schematron. The Schematron files are available
from the GitHub repository.[J4R05]
Tool Details
To facilitate checking instance documents against the JATS4R rules without requiring
journal production editorial staff to set up their own Schematron validation service,
we
have implemented a validation tool hosted at http://jats4r.org/validator/. When
run against an instance document, the tool parses the XML, validates against the correct
NLM or JATS DTD, if one is specified, and then checks it against the JATS4R Schematron
rules, and presents a report to the user. The overall data flow of this tool is depicted
in Figure 1.
XML Parsing and DTD Validation
Saxon-CE[SAX01] uses the browser’s native XML
parser, which, for most browsers, does not read the DTD (if any) specified in the
doctype declaration. This resulted in the tool producing an error if the instance
document included entity references defined in the DTD. Therefore, a separate tool
was required to parse the documents using the DTD, and resolve those named entity
references with their corresponding replacement text. To accomplish this, we have
incorporated a JavaScript port of xmllint.
A project on GitHub, xml.js is a port of libxml, including xmllint, to JavaScript,
using emscripten. Emscripten is a free tool for compiling C and C++ into optimized
JavaScript code. We forked the xml.js project, and made some changes to improve the
API.
Within the validator.js module, running on the client, the following code is used
to invoke xmllint to parse and validate the instance document:
Note that the DTD is passed into the function via the `schemaFiles` argument. The
JavaScript implementation of xmllint uses the SYSTEM identifier in the doctype
declaration to find the DTD, and validator ensures that the dtd_filename variable
matches that SYSTEM identifier.
NLM and JATS DTDs
As alluded to in the previous section, the JavaScript implementation of xmllint
does not use OASIS catalogs, and has no way of looking up the correct DTD via the
PUBLIC identifier in the doctype declaration. Therefore, before the XML is parsed,
the validator uses a regular expression to check for the presence of a doctype
declaration. If one is present, it extracts the PUBLIC and SYSTEM identifiers. It
uses the PUBLIC identifier to determine which specific NLM or JATS DTD to use to
parse the file (there are currently 62 variants). It then also records the SYSTEM
identifier as the aforementioned dtd_filename variable, and passes it into xmllint,
which dereferences that name to get the DTD contents.
Because this implementation of xmllint has no direct access to a filesystem or to
the web, it was important that the DTDs be “flattened”. JATS, and the NLM DTDs
before them, are designed modularly; therefore, each variant of the DTD comprises
many individual files. We extended the NCBI DtdAnalyzer, adding a new utility
dtdflatten, and used it to produce flattened, single-file versions of all 62
variants of the NLM and JATS DTDs. That processing is done offline, when the
validator is deployed. We also took the NLM and JATS distributions from the NCBI FTP
site, normalized them a bit, and made them available on GitHub, from two
repositories, nlm-dtd and niso-jats.
Schematron Representation of JATS4R Recommendations
As described above, the JATS4R recommendations are encoded into schema files in
Schematron format. The recommendations are broken down by two categories:
level - either errors, warnings, or info
topic - math or permissions (currently; more to come)
The Schematron rules for each of the combinations of level and topic are
encapsulated in their own files. This is shown schematically in Figure 2. The table
in the lower-right shows each of the (six) combinations of level and topic, and each
table cell corresponds to a Schematron file.
There are two "master" Schematron files which break down the tests in two
different ways:
jats4r-level.sch - phases for “errors”, “warnings”, and “info”
jats4r-topic.sch - phases for “math” and “permissions”
Another "master" Schematron file, which determines conformance or non-conformance
of an instance document, is jats4r.sch. This includes all topics, but only the
"error” level tests, and is equivalent to using the “errors” phase of
jats4r-level.sch.
Schematron Offline Processing
The generated-xsl subdirectory contains XSLT2 files that have been generated from
the Schematrons, using the process-schematron.sh script. This script uses
conversions available from http://schematron.com.[SCH01]
This generates separate XSLTs: one for each level and one for each topic. A
level-specific XSLT (for example, jats4r-level-errors.xsl) includes the rules for
every topic; and conversely, a topic-specific XSLT (for example,
jats4r-topic-math.xsl) contains the rules for all of the levels.
The validator has an option box that allows the user to select one of the values
for level: errors, warnings, or info. Topic-specific validation using the
client-side validator is not available at this time, but could be added easily if
there is a demand for it.
Client-side Schematron Validation
The validator runs the instance document through the appropriate XSLT, which
generates a report in Schematron Validation Report Language XML (SVRL).
The validator code invokes Saxon, passing the URL of the appropriate XSLT file.
The results, in SVRL format, are converted into an HTML report using a separate XSLT
transformation.
This is then inserted by Saxon CE into the HTML DOM, and thus presented to the
user.
Future work
Validation against Relax NG and/or XSD versions of JATS
This should also be possible without major modifications to the software, since
xmllint supports these types of validation.
Port to a server-side validator
Despite our success with implementing this tool on the client, we’ll probably
also implement a server-side validator. The main reason for wanting to do this is
so
that it can be deployed as a web service.
[MIET01] Mietchen D, Maloney C, Moskopp ND. Inconsistent XML as a
Barrier to Reuse of Open Access Content. In: Journal Article Tag Suite Conference
(JATS-Con) Proceedings 2013 [Internet]. Bethesda (MD): National Center for Biotechnology
Information (US); 2013. Available from: http://www.ncbi.nlm.nih.gov/books/NBK159964/.
[NCBI01] Hess D, Maloney C, Hamelers A. DtdAnalyzer: A tool for
analyzing and manipulating DTDs. In: Journal Article Tag Suite Conference (JATS-Con)
Proceedings 2012 [Internet]. Bethesda (MD): National Center for Biotechnology
Information (US); 2012. Available from: http://www.ncbi.nlm.nih.gov/books/NBK100354/.
Mietchen D, Maloney C, Moskopp ND. Inconsistent XML as a
Barrier to Reuse of Open Access Content. In: Journal Article Tag Suite Conference
(JATS-Con) Proceedings 2013 [Internet]. Bethesda (MD): National Center for Biotechnology
Information (US); 2013. Available from: http://www.ncbi.nlm.nih.gov/books/NBK159964/.
Hess D, Maloney C, Hamelers A. DtdAnalyzer: A tool for
analyzing and manipulating DTDs. In: Journal Article Tag Suite Conference (JATS-Con)
Proceedings 2012 [Internet]. Bethesda (MD): National Center for Biotechnology
Information (US); 2012. Available from: http://www.ncbi.nlm.nih.gov/books/NBK100354/.