Morrissey, Sheila, John Meyer and Sushil Bhattarai. ““Be in the Room Where It Happens”: Digital Preservation at Portico and the JATS Ecosystem.” Presented at Symposium on Markup Vocabulary Ecosystems, Washington, DC, July 30, 2018. In Proceedings of the Symposium on Markup Vocabulary Ecosystems. Balisage Series on Markup Technologies, vol. 22 (2018). https://doi.org/10.4242/BalisageVol22.Morrissey01.
Symposium on Markup Vocabulary Ecosystems July 30, 2018
Balisage Paper: Be in the Room Where It Happens
Digital Preservation at Portico and the JATS Ecosystem
Sheila Morrissey is Senior Researcher at ITHAKA, where her role is to provide technological
perspective in researching the impact of the digital transition on the scholarly communications
ecosystem, on the sustainability of digital resources, on the scholarly use of digital
resources, on digital infrastructure in support of teaching and learning, and on collaborative
development of the technical infrastructure of the library of the future. Sheila has
worked on ITHAKA's Portico digital preservation service and has written extensively
on the complex interactions between digital formats and their mediating software,
as well as on the often subtle manner in which software engineering practice complicates
the use and intelligibility of digital artifacts, both in the present, and over the
very long term.
John Meyer is Director, Production and Content Technology, at ITHAKA. Since 2005,
John has played a key role in shaping and implementing technologies for managing content
streams into the Portico archive, from over 500 scholarly publishers, comprising over
29,000 journals 1.2 million e-book titles, and 187 digital collections, in over 1000
XML and SGML vocabularies. He was a member of the NLM Working Group, and is a member
of the JATS Working Group, and has been a past contributor to Balisage and to JATS-CON.
Sushil Bhattarai is the Manager, Content Technology Group at ITHAKA. He is responsible
for the development and maintenance of software systems performing validation and
conversion of scholarly publications in electronic form that are being preserved in
the Portico archive. He is a past contributor to Balisage and JATS-CON.
Institutions such as Portico that are engaged in ensuring that the digital record
of our time is accessible, usable, discoverable, and verifiable for the very long
term continually face the challenge of processing and managing content at very large
scales, often with minimal, and sometimes diminishing, resources to accomplish the
task.
A key resource in meeting the challenge of preserving born-digital and digitized scholarly
literature has been the NLM and JATS standards, and the community of practice centered
on those standards. We will be talking about our shared experience in developing
those standards: what motivated our participation, what benefits we have seen, and
what challenges we still face.
Standardization is a social process by which humans come to take things for granted.
Through standardization, inventions become commonplace, novelties become mundane,
and the local becomes universal. It is a historical, and therefore contested process
whose success depends upon the obfuscation of its founding conflicts and contingencies.
Successful standards, if they are noticed at all, simply appear as authoritative,
objective, uncontroversial, and natural. Standards are, as other scholars have noted,
recipes for reality whose black boxes are rarely opened and whose subjectivity and contingency are rarely
revealed.
No one really knows how the game is played
The art of the trade
How the sausage gets made
We just assume that it happens
But no one else is in
The room where it happens.
Institutions such as Portico that are engaged in ensuring that the digital record
of our time is accessible, usable, discoverable, and verifiable for the very long
term, continually face the challenge of processing and managing content at very large
scales, often with minimal, and sometimes diminishing, resources to accomplish the
task.
A key resource in meeting the challenge of preserving born-digital and digitized scholarly
literature has been the NLM and JATS standards, and the community of practice centered
on those standards. This paper discusses our shared experience in developing those
standards: what motivated our participation, how we participated in the evolution
of JATS, what benefits we have seen, and what challenges we still face.
Motivations
What is Portico?
Portico is a community-supported digital preservation service for electronic journals, books,
and other content. Portico is a service of ITHAKA, a not-for-profit organization dedicated to helping the academic community use digital
technologies to preserve the scholarly record and to advance research and teaching
in sustainable ways. Portico understands digital preservation as the series of management
policies and activities necessary to ensure the enduring usability, authenticity,
discoverability, and accessibility of content over the very long-term.
Portico serves as a permanent archive for the content of, at present, 553 publishers
(from 57 countries, and on behalf of over 2000 learned societies and associations),
with 29,068 committed electronic journal titles, 1,242,793 committed e-book titles,
and 187 committed digitized historical collections. The archive currently contains
over 93 million archival units (journal articles, e-books, etc.), comprising over
1.5 billion preserved files. Portico is sustained by the support of over 1000 libraries
in 22 countries.
Portico functions as a “dark archive”. While a limited number of credentialed users
at both depositing publishers’ and subscribing libraries’ sites can access the archive
content via an audit interface, subscribers to content in the archive generally continue
to access that content at the publishers’ host sites. Participating libraries, including
their students, faculty, and staff, gain direct access to archived content when specific
conditions or "trigger events" occur which cause titles no longer to be available
from the publisher or any other source.
From a technical perspective, the Portico archive is designed to preserve content
in an application-neutral manner. Each archived object is packaged in a ZIP file (more
precisely, in a ZIP file conforming to the Bagit specificaton), with all original publisher-provided files, along with any Portico-created digital
artifacts and metadata associated with the object. For each journal article in the
archive, for example, Portico preserves all original publisher-provided digital artifacts,
including PDF page images, along with any Portico-created digital artifacts associated
with the item. These latter include structural, technical, descriptive, and provenance
metadata, and a normalization of the publisher-provided SGML or XML journal article
files to JATS. The entire archive can be reconstituted as a file system object, using
non-platform-specific readers, completely independent of the Portico archive system.
Why standards in Preservation?
One of the three key aims of all standards, as Andrew Russell has noted [Russell], is compatibility. Digital preservation practitioners’ shorthand for what they
do is interoperability with the future – that is to say, compatibility over time [Pasking]. The preservation of digital artifacts is a relatively new endeavor, with, by definition,
an always-receding goal. Practitioners must act in the present to make provision
for unanticipated, perhaps not-yet-existing uses and contexts for those preserved
artifacts.
Standards are a hedge against this inherent uncertainty. They provide a means of \
uncoupling content from single-vendor or proprietary tools or formats.[Morrissey JEP] And, potentially at least, they mitigate the risks associated with that uncertainty.
As Portico’s first CTO, Evan Owens, has pointed out, interoperability with the present
is at least a good first step towards ensuring interoperability with the future [Owens]. So digital preservation practitioners construct their repositories, services, and
systems informed by many different standard frameworks. A key framework is the Reference
Model for an Open Archival Information System (OAIS) [OAIS] and its companion standard for audit of trustworthy digital repositories, ISO 16363:2012, derived from the Center for Research Libraries (CRL) Trustworthy Repositories Audit
and Certification (TRAC) standard and checklist [TRAC]. National libraries and consortia of institutions engaged in preservation maintain
best-practice checklists for such things as archive replication (number of copies,
storage types), the frequency and algorithms for content check-summing, and criteria
of choice and recommendations for use of file formats. A great deal of work has gone
into defining what metadata (in OAIS terms, “representation information”) are essential
for ensuring there is sufficient context to make use of digital objects over the very
long term, as well as providing sufficient, and sufficiently reliable, provenance
to ensure the authenticity of those objects. Key instances here are PREMIS and METS, both of which have defined XML schema. There is considerable body of work on the
specification of persistent identifiers schemes, including identifiers for digital
objects, people, and institutions, and on the community- and institution-building
necessary to ensure that, first, those identifiers are used, and second, that they
in fact continue to be persistently resolvable.
Why JATS in Preservation?
Portico’s original remit, fifteen years ago, was to develop a sustainable repository
for electronic scholarly journals, by acquiring publishers’ “original materials”,
from which various manifestations in print and online are derived, and rationalizing,
managing, and preserving those content streams. As Evan Owens noted [Owens], quite apart from issues such as non-standard practice in naming, packaging, handling
of author-supplied supplementary materials, the useof persistent identifiers, in versioning
of content,
Journal publishing models are still evolving: after ten years of delivery of e-journals
on the web, there is still wide variation in practice and online PDF and online HTML
are many; in effect, an e-journal article is a work with multiple “manifestations.”
This makes preservation an interesting challenge, particularly when the manifestation
delivered via the web (the HTML) is a subset of richer content and information resources
that exist behind the scenes, as it were.
The richer content and information resources were to be captured and preserved by Portico, who would then provide a “normalized”
view across all this variegated content for discovery, presentation, and archive management.
The tactic employed to accomplish this normalized view was the migration of publisher-provided
article metadata to a common journal article metadata vocabulary (while, of course,
retaining and preserving the original metadata supplied by the publisher).
Evolution
As it developed, there were a great many people in the room where, first, the NLM
Archiving and Interchange DTD, and, then, the JATS standard happened. And, with all
due respect to poor Aaron Burr, the people in the room, both at the time, and after,
have been very happy to detail how the sausage was being made. (For some samples,
see, first of all, the minutes of the NLM Working Group, history and description of NLM and JATS by Jeff Beck [Beck JEP, Beck Balisage], and many reports in the JAT-CON Proceedings, which inform the narrative below.)
Both Portico and the NLM/JATS vocabulary share a common origin in a 2001-2002 set
of e-journal archive planning projects funded by the Andrew W. Mellon Foundation [Cantara]. One of the projects, at Harvard University Library (HUL), and including Blackwell
Publishing, the University of Chicago Press, and John Wiley and Sons, undertook to
investigate the issues presented by dynamic e-journals – ones whose content changes frequently. The HUL project report [HUL Report] effectively served as a blueprint when, supported by Mellon Foundation funding,
Portico (originally called the JSTOR Electronic Archive Project) was founded in 2003.
As part of its investigation, HUL commissioned Inera’s Bruce Rosenblum to study the
feasibility of creating a common e-journal archival DTD. One of the key artifacts
of the HUL project was Inera’s report [Inera], which, having surveyed 10 DTDs (from primarily scientific, technological, engineering,
and medical publishers), anatomized the key components and characteristics of what
a common archival DTD might be, and what likely issues would be encountered in transformations
from publisher-specific vocabularies to a single common vocabulary.
Intertwined with these developments in Portico’s history were developments at National
Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM).
As Jeff Beck has described [Beck JEP], PubMed Central (PMC), was founded in 2000 in order to take full-text article submissions
from publishers and make them available through the PubMed Central database. As content
from more and more publishers was submitted, it was clear that the original PMC DTD
was insufficiently expressive to handle all the elements in incoming content. NCBI
engaged Mulberry Technologies to review the PMC DTD, and assist in the design of a
replacement. By this time, the HUL/Inera report was available. Its analysis and recommendations
were incorporated in the developing pmc-2.dtd.
That second-generation PMC DTD was submitted to Bruce Rosenblum for review, and was
the basis for a 2002 meeting with NCBI/NLM, HUL, the Mellon Foundation, Mulberry Technologies,
and Inera to formulate the project that would adapt pmc-2.dtd to a new DTD suitable
for general use for archiving any electronic journal article: the NLM Archiving and
Interchange Tag Suite. Version 1 of the NLM DTD, released in December, 2002, was
the outcome of that collaboration.
In 2003, the NLM Working Group was formed, hosted by NCBI, with participants (some
only occasionally) from Microsoft, BioMed Central, American Physical Society (APS),
Data Conversion Laboratories (DCL), IEEE, Portico, HighWire Press, Public Library
of Science, Mulberry Technologies (who served as secretariat), California Institute
of Technology, Griffin Brown, Cadmus, and Inera. The working group refined the tag
suite, shepherding it through Version 3, released in 2008. In 2009, the NLM Working
Group morphed into a formal NISO working group (with many continuing members), and
in August 2012 released NISO JATS Version 1.0. JATS 1.1 was released in November,
2015.
Both the NLM and NISO working groups have sought broad-based input to, and comment
on, proposed developments in the standards. Mulberry administers the JATS listserv.
Since 2010, JATS-CON has been hosted at NLM. The NLM maintains extensive documentation
about both NLM and JATS tag suites, including the NLM working group notes.
Benefits
Any markup tag scheme is an interpretation. Though the conventions and conventional
structures of the scholarly journal article are long established and broadly understood,
their detailed expression has been articulated and elaborated in many ways over the
more than two-decade long print-to-digital transition in scholarly publishing. As
an indication, for example, in its 15 years, Portico has processed over 1000 different
tag sets.
While accomplishing its goal of normalizing these vocabularies to a single tag set, Portico has to ensure that its transformation
of incoming content to a normalized format does not distort the original interpretation in the source document. This requirement for fidelity means, in turn, that broad
participation, key to the development of any consensus standard, was absolutely essential
for the formulation and refinement of an archival tag suite.
So the standards process itself was a first crucial benefit to Portico. It was not
just that were there lots of others in the room where it happened. Many of these people, besides representing their own institutions, brought broad-based,
in-depth experience with many publisher vocabularies and practices from beyond their
institution. Often their institutions were themselves aggregators of content from
many sources, in many vocabularies. As the Working Group meeting notes reflect, these
participants, as did Portico, brought to the working group meetings specific examples
of content that raised various issues, both philosophic and pragmatically gritty (boiler
plate text versus implied content, reference tagging, semantic significance of display elements such as bold and italic). These concrete examples challenged the working
group in developing and refining not just the tag set, but also well-articulated rationales
for the choices made, and rich documentation and detailed examples for guidance and
use.
The NLM DTD has modularization capabilities and a recommend process for creating a
custom profile of the tag set. In earlier years, Portico used a customization of
the earlier versions of the NLM DTD to implement normalization policies that had not
yet been incorporated into whatever was the current version of the public tag set
[Morrissey, Meyer, Bhatterai et al.]. For example, Portico transformations generate (sometimes boilerplate) text or
punctuation, titles, and labels that are only implicit in publisher markup, but that
appear in the display version of an article. As this is Portico-generated, rather
than publisher-supplied, content, there was a need for some form of markup to make
the origin of that content explicit, wherever in the document it occurs. Portico
created its customization, but at the same time shared its specific use cases and
examples with the working groups. Portico currently uses the latest version of the
JATS DTD without any customization to express these and other use cases.
Others besides Portico, of course, also use one or another of the NLM/JATS/BITS tag
sets. Of those 1079 formats processed by and archived in Portico, 420 are from that
format family. While this content still requires some amount of normalization, the
effort to create normalizing XSL transforms is considerably less than for content
in other vocabularies. Typically, configuring normalization tools for header-only
content in the JATS family takes one-quarter to one-half the time of proprietary header-only
content. Tools for full-text articles marked up with JATS takes one-half to two-thirds
the time of proprietary full-text content. The savings in resources is extremely
important for the sustainability of the not-for-profit Portico archive over the very
long term.
Though not, strictly speaking, a mark-up issue, the community of practice and discussion
fostered by the development of NLM and JATS, including with such groups as JATS4R, has in turn fostered an ecosystem of standards (such as those for the “packaging”
of supplementary journal article materials) and practices (especially for correct
and consistent use of persistent identifiers for digital objects such as articles
and data sets, for authors, for funders, for institutions) – all crucial components
for creating and maintaining the accessible, discoverable, navigable, and, in many
cases, machine-actionable context that ensures digital artifacts will retain their
meaning over the very long term.
Challenges
Few will be surprised to hear that no tag set, however elegantly crafted, by however
broad-based and well-informed a consensus, solves all problems in the interchange
of information between the systems of any two or more institutions.
There is inevitable variation in the use of these tag sets. The effort described
above to make JATS-to-JATS transformations at Portico is a sure indication that we
have not closed the interoperability gap between syntax and semantics. [Morrissey, Meyer, Bhatterai et al.]
While the JATS community certainly fosters best practices of many sorts, it can neither
ensure nor enforce them. We still see tag abuse. We still see content – including
JATS content – that is not well-formed and valid. Somewhat surprising to us is the
fact that roughly the same percentage (15-20%) of the XML content from providers who
moved their processes to JATS as those who did not is in some way defective.
JATS is a living, evolving standard. Its tradition of broad-based, inclusive, responsive
development entails its own challenges, characteristic, as Andrew Russell has described
[Russell], of open standards development. He describes as well the characteristic potential
entropy of such processes. Tommie Usdin has described [Usdin] the particular pressures this has placed on the development of the JATS tag sets,
as well as guidelines and guidance for constructively meeting these pressures. This
process makes for some complications. Portico is receiving content from publishers
whose content conforms to a now superseded draft version (JATS V1.2D1) of the next,
upcoming release of JATS. Later draft versions are incompatible with that earlier
draft, as V1.2 is also expected to be when released.
Though elegant, flexible, and expressive, working natively in JATS still apparently
is not accessible to what in the library community is referred to as the long tail
of small, specialized, and typically non-STM scholarly journals. It is a markup scheme
developed by and large out of the experience and practice of relatively large-scale,
sophisticated publishers. Uptake by long-tail publications would require easily-accessible, broadly-used, and likely free authoring
tools that natively produce JATS.
Be in the room where it happens
As we have described, the articulation and on-going refinement of the NLM/JATS family
of tag sets by inclusive process, with broadly and deeply experienced participants,
has been crucial to Portico’s mission to preserve digital scholarly artifacts for
the long term.
Firmly committed to on-going active participation in this community, Portico hopes
sharing its experience and practice can help others in the community as we have been
helped and enriched by participation.
It has been our experience that we all benefit when all who wish to be are the room
where JATS happens.
References
[Beck Balisage] Beck, Jeff. Report from the Field: PubMed Central, an XML-based Archive of Life Sciences Journal
Articles. Presented at International Symposium on XML for the Long Haul: Issues in the Long-term
Preservation of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the
Long-term Preservation of XML. Balisage Series on Markup Technologies, vol. 6 (2010). [online] [cited 27 July 2018]
doi:https://doi.org/10.4242/BalisageVol6.Beck01
[Cantara] Cantera, Linda, Ed. Archiving Electronic Journals: Research Funded by the Andrew W. Mellon Foundation
Edited, with an Introduction, by Linda Cantara, Indiana University. The Digital Library Federation Council on Library and Information Resources Washington,
DC. 2003. [online] [cited 27 July 2018] https://docplayer.net/19614542-Archiving-electronic-journals.html
[OAIS] Consultative Committee for Space Data Systesm (CCSDS). Reference Model for an Open Archival Information System (OAIS). Recommended Practice, Issue 2 CCSDS 650.0-M-2. June 2012. [online] [cited 27 July
2018] https://public.ccsds.org/pubs/650x0m2.pdf
[HUL Report] Harvard University Library Mellon Project Steering Committee. Report on the Planning Year Grant for the Design of an E-journal Archive. Presented by: Harvard University Library Mellon Project Steering Committee Harvard
University Library Mellon Project Technical Team To: The Andrew W. Mellon Foundation.
April 1 2002. [online] [cited 27 July 2018] http://old.diglib.org/preserve/harvardfinal.html
[Inera] Inera, Inc. E-Journal Archive DTD Feasibility Study. Prepared for the Harvard University Library, Office of Information Systems, E-Journal
Archiving Project. 2001. [online] [cited 27 July 2018] http://old.diglib.org/preserve/hadtdfs.pdf
[Morrissey, Meyer, Bhatterai et al.] Morrissey S, Meyer J, Bhattarai S, et al. Portico: A Case Study in the Use of the Journal Archiving and Interchange Tag Set
for the Long Term Preservation of Scholarly Journals. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010.
[online] [cited 27 July 2018] https://www.ncbi.nlm.nih.gov/books/NBK47087/
[Owens] Owens, E. Digital Preservation and Electronic Journals.Library and Information Services in Astronomy V: Common Challenges, Uncommon Solutions. ASP Conference Series, Vol. 377, proceedings of the conference held 18-21 June 2006
in Cambridge, Massachusetts, USA. Edited by Sandra Ricketts, Christina Birdie, and
Eva Isaksson., p.277. October 2007. [online] [cited 27 July 2018] http://www.aspbooks.org/publications/377/277.pdf
[Russell] Russell, Andrew L. Open Standards and the Digital Age: History, Ideology, and Networks. New York:Cambridge Universiy Press, 2014. [Print]
[Usdin] Usdin B. T. Fighting the ‘Inevitable’ Expansion of the JATS Tag Sets. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2018 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2018.
[online] [cited 27 July 2018] https://www.ncbi.nlm.nih.gov/books/NBK493526/
Beck, Jeff. Report from the Field: PubMed Central, an XML-based Archive of Life Sciences Journal
Articles. Presented at International Symposium on XML for the Long Haul: Issues in the Long-term
Preservation of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the
Long-term Preservation of XML. Balisage Series on Markup Technologies, vol. 6 (2010). [online] [cited 27 July 2018]
doi:https://doi.org/10.4242/BalisageVol6.Beck01
Cantera, Linda, Ed. Archiving Electronic Journals: Research Funded by the Andrew W. Mellon Foundation
Edited, with an Introduction, by Linda Cantara, Indiana University. The Digital Library Federation Council on Library and Information Resources Washington,
DC. 2003. [online] [cited 27 July 2018] https://docplayer.net/19614542-Archiving-electronic-journals.html
Consultative Committee for Space Data Systesm (CCSDS). Reference Model for an Open Archival Information System (OAIS). Recommended Practice, Issue 2 CCSDS 650.0-M-2. June 2012. [online] [cited 27 July
2018] https://public.ccsds.org/pubs/650x0m2.pdf
Harvard University Library Mellon Project Steering Committee. Report on the Planning Year Grant for the Design of an E-journal Archive. Presented by: Harvard University Library Mellon Project Steering Committee Harvard
University Library Mellon Project Technical Team To: The Andrew W. Mellon Foundation.
April 1 2002. [online] [cited 27 July 2018] http://old.diglib.org/preserve/harvardfinal.html
Inera, Inc. E-Journal Archive DTD Feasibility Study. Prepared for the Harvard University Library, Office of Information Systems, E-Journal
Archiving Project. 2001. [online] [cited 27 July 2018] http://old.diglib.org/preserve/hadtdfs.pdf
Journal Article Tag Suite Conference (JATS-Con) Proceedings [Internet]. Bethesda (MD):
National Center for Biotechnology Information (US); 2010-. [online] [cited 27 July
2018] Available at https://www.ncbi.nlm.nih.gov/books/NBK65129/
Morrissey S, Meyer J, Bhattarai S, et al. Portico: A Case Study in the Use of the Journal Archiving and Interchange Tag Set
for the Long Term Preservation of Scholarly Journals. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010.
[online] [cited 27 July 2018] https://www.ncbi.nlm.nih.gov/books/NBK47087/
Owens, E. Digital Preservation and Electronic Journals.Library and Information Services in Astronomy V: Common Challenges, Uncommon Solutions. ASP Conference Series, Vol. 377, proceedings of the conference held 18-21 June 2006
in Cambridge, Massachusetts, USA. Edited by Sandra Ricketts, Christina Birdie, and
Eva Isaksson., p.277. October 2007. [online] [cited 27 July 2018] http://www.aspbooks.org/publications/377/277.pdf
Paskin, Norman. Digital Object Identifiers. ICSTI Seminar: Digital Preservation of the Record of Science, Feb 14/15 2002. [online]
[cited 27 July 2018] http://www.doi.org/topics/020210_CSTI.pdf
Usdin B. T. Fighting the ‘Inevitable’ Expansion of the JATS Tag Sets. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2018 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2018.
[online] [cited 27 July 2018] https://www.ncbi.nlm.nih.gov/books/NBK493526/
Author's keywords for this paper:
Digital Preservation; JATS; NLM; Standards; Portico