Introduction
The initial spark for this study was a discussion on the JATS mailing list [JATS-list 2020], when someone opposed the “watering down” of Blue, that is, vocabulary items that have only been available in Green are moving to Blue and there are multiple allowed ways to tag a given piece of content. Typically the requests to amend Blue go in the other direction, that is, to add items. But also the JATS dogma of backwards compatibility tends to favor additions as opposed to deletions of vocabulary items. “We have bent over backwards to be backward compatible for years!” [Lapeyre 2019]
Restricting choices for tagging content can be done by restricting the vocabulary or by limiting choice and optionality, depending on context. Apart from documenting allowed use and restrictions, the latter can be achieved more robustly by adapting the DTD itself or by superimposing Schematron rules [Schwarzman 2017], as JATS4R [JATS4R] does, for example.
Although subsetting a DTD may also impose context-dependent restrictions such as which elements may be present in which order, which attributes may be present and which values they may assume, this work views customizations only as a restriction of the vocabulary, that is, the elements and attributes that are permitted by the subset anywhere in the document. Although this is a very coarse and limited approach to customizations, a reduced vocabulary can be beneficial in several ways; in particular it reduces the effort necessary to develop and maintain renderers and editors for the given vocabulary. The JATS authoring subset (“Pumpkin”) has a significantly reduced vocabulary, but it is not designed for publishing since it omits most of the metadata vocabulary. Quoting [Lapeyre 2019]:
My Favorite Proposed Change
Make a new simplified JATS subset
Specifically designed for schema-driven editing and tool use
Intended audience: journal production workers and editors
For example, it will include journal metadata
Such an attempt has already been made for the Texture editor [Texture 2017]. The proposed customization removed 228 elements and 147 attributes from Green, among them question/answers, index terms, and MathML. As will be discussed later, the authors decided to treat the entirety of MathML as a single vocabulary item. Doing so will reduce the number of elements omitted by Texture to 36 and the number of attributes to 30. This amounts to a vocabulary reduction of about 14%, far away from the stated 40% reduction target.
The authors decided to analyze the actual usage frequencies of the JATS vocabulary items. They sent an email to the mailing list [JATS-list 2021,Reinhardt 2021a], calling on publishers to either supply representative full-text XML articles or to run an analysis XSLT on their articles that produces lists of distinct element and attribute names used in the publications.
For a given collection of articles, the names of the elements and attributes that are used in the collection constitute a “de-facto customization.” This way, the authors hoped to be able to identify less popular vocabulary items and consider them for omission in a future “community consensus customization,” a synthetic customization that is both compact and well suited as a schema for creating, editing, and validating content, or as a starting point to derive the other de-facto customizations from without the nee to add many items.
How does one measure well-suitedness though? Not using available vocabulary items will not hurt, therefore customization A will be better suited as a starting point to derive customization B from if A is a superset of B. So the best synthetic customization will be the union of all element and attribute names of all collections?
If all collected articles used in fact only the same 60% of the JATS vocabulary, the stated goal would haven been reached trivially. It turns out that the collected articles use approx. 82% of the JATS vocabulary though.
In order to identify a smaller de-facto customization – for example the vocabulary used by a specific journal, a specific publisher or a specific subject area – that may serve as starting point to derive many other customizations from with as little vocabulary item additions as possible, the authors established a metric for the aptness of a given customization to serve as a starting point.
When considering vocabulary items on the basis of their popularity, one needs to be aware of the fact that items introduced in later JATS versions might not be widespread yet in published articles, or that other aspects such as promoting accessibility suggest that items be added to the consensus customization despite infrequent use in currently available articles. This is why the authors will suggest a minimal customization that comprises significantly more vocabulary items than the “naive minimal customization” that was determined as adequate for tagging many of the articles actually analyzed.
Sources
The authors have received or downloaded JATS articles from the following sources:
Table I
Source | Subject Area | Article Count |
---|---|---|
de Gruyter | STEM/HUM/ECON | 568 |
John Benjamins | HUM | n/a |
Optical Society of America | STEM | n/a |
Oxford University Press | n/a | n/a |
PMC comm_use A–B | STEM | 379.835 |
PMC non_comm_use O–Z | STEM | 249.341 |
PsychOpen | STEM | 45 |
Science Open | STEM/HUM | 210 |
The articles have been tagged according to diverse versions of JATS Blue or Green, sometimes with proprietary extensions.
When there is n/a in a table cell this means that the authors didn’t have access to full-text articles; therefore they were unable to count the articles or to classify them.
The subject classification was only applied handwavingly to individual journals or sources. The authors hoped that such a rough classification will offer insights with respect to varying tagging practices in the disciplines, but the analysis didn’t reveal large discrepancies, except that STEM (science, technology, engineering, mathematics and/or medicine) uses more vocabulary items than HUM (humanities and social sciences) or ECON (economics), but this is primarily due to the much larger sample size of STEM articles, and due to the wholesale categorization as STEM of all articles from PMC, which might not be vindicated.
The PubMed Central bulk packages have been downloaded from their FTP site [PMC].
The Science Open input consists of open-access articles that have been downloaded randomly from scienceopen.com.
In the PMC archives, the authors ignored journals with 40 or fewer articles, which only marginally decreased the number of articles processed.
As schemas, the authors considered the JATS 1.3d2 Blue, Pumpkin, and Green customizations, and also the Green customization that provides OASIS tables. This DTD should be the ultimate superset, unless a publisher uses proprietary extensions. The authors did also include the aforementioned Texture customization.
In addition, a synthetic schema “minimal” is included. It represents the proposed minimal Blue subset. It was only conceived after analyzing the collections and schemas according to the metrics and methods described in the next section, and after identifying an existing de-facto customization as a starting point.
A derivative of this minimal customization is the “naive minimal” customization that was prepared by omitting strategically important but infrequently used items from “minimal.”
Metrics and Methodology
“Supersetticity” and Aptness Metrics
A conceivable metric for the aptness of a given customization is to measure whether its vocabulary is sufficient to mark up 90% of all collected articles. Given that there are more than 1,000 de-facto customizations (when considering the vocabulary of each PMC journal as a de-facto customization on its own), the computational effort necessary to compare each of these 1,000 lists (one per journal) of several 100 items with approximately 630,000 lists (one for each article) of several 100 items seemed prohibitive.
Therefore the authors looked for a metric that was a cheaper to apply while still making the adequacy of a customizing measurable.
The difference is that each of the more than 1,000 customizations will be compared not to each article, but to all other more than 1,000 customizations. If an item needs to be added to arrive at the other customizing, it needs to be penalized more than if an item may be removed.
After four attempts of defining such a metric, the authors arrived at the following quantities and computation rules:
s |
The “supersetticity” defines the degree to which one customization is a superset of another. The maximum value here is 2.0, which means a customization is a complete superset of another. The supersetticity of a customization j with respect to a customization i is calculated as follows: sij = (1+rji) / (1+rij) with rji = aji / dji, rij = aij, aij: additions to j towards obtaining i, aji: additions to i towards obtaining j, dij = dji = aji + aij = edit distance between i and j.
|
||||||||
q |
defines the aptness of a customization as a starting point for another customization. Being a superset to another should favor a customization. But of two customizations with the same supersetticity wrt a third customization, the one that has the least edit distance should be favored even more. Therefore we define qij := 100 * sij / dij as the aptness of customization j to serve as the starting point for deriving customization i.
Figure 1 shows that qij will drop more quickly if items need to be added to j than if items need to be removed from j (= added to i in order to obtain j). It needs to be stressed that the q metric does not claim absolute truth (“the higher, the better starting point in all circumstances” is not necessarily true; it also depends on other factors such as: It is easier to add the whole of MathML than to add MathML partially). The q function was modeled after the requirement that both a small editing distance and “subsetting over supersetting” should be honored. In some figures and tables q5 or q5 can be seen. The 5 stems from the fact that the metric described here was the authors’ fifth attempt at finding appropriate supersetticity and aptness metrics. |
||||||||
p |
defines the percentage of aptness of a customization j to serve as a starting point relative to the best starting point’s aptness, which is arbitrarily set to 100. Let qmax,i be the maximum aptness of all other customizations k to serve as the starting point for i. Then the relative aptness of customization j wrt i is pij = 100 * qij / qmax,i
|
||||||||
Average p |
This value is shown for each customization in the last column of the table at sample-conf.xhtml. The individual p values for a given customization to serve as a starting point for all the other customizations can be read column-wise in the other, detailed table. These p values are averaged. The customization with the highest score is deemed to be most suitable to derive the other customizations from. |
The outcomes of these computation rules have been compared to what one would intuitively think is a good customization starting point, and the authors think they are appropriate in helping identify few candidate de-facto customizations that may serve as a basis for the synthetic minimal customization.
For evaluating the true aptness of a given or putative customization, other factors
need to be considered,
too, such as compactness (small number of items), support for recent additions to
JATS, or subjective factors,
such as the conviction that <array>
may never substituted with a caption-less
<table-wrap>
or that <sans-serif>
must never perish despite infrequent use
and the availability of styled-content/@style
for literal CSS.
There is no objective truth in this metric; it is just a means to reward customizations that may act as a starting point for other customizations without the need to add many items.
Data Acquisition and Normalization
In order to make DTD-based customizations and de-facto customizations comparable, all have been converted to HTML files that contain essentially two unordered lists: one with the element names and another one with the attribute names.
For the DTDs, the transformation starts with the corresponding, equivalent Relax NG
versions. In a first
XSLT pass, the include
s are resolved, then each define
element will backtrace its
ref
s recursively until it ultimately reaches the start
element. If it doesn’t reach
start
, the define
will be removed from the resulting transformed RNG. In a third
pass, the HTML lists will be populated from the remaining define
s that define elements or
attributes.
Articles can be grouped in different ways. The granularity chosen for example 3 in
section ““Supersetticity” and Aptness Metrics” was: Articles are grouped with their respective journals for de Gruyter, but the
Science Open articles were put in three baskets, SO_Medicine
, SO_Science
, and
SO_Humanities_SocialSciences
.
A third kind of input are precompiled HTML lists that publishers supplied if they couldn’t send the actual articles. These are grouped by publisher or by journal.
A configuration file points to the different Relax NG sources, article XML collections, and precompiled HTML files. A sample configuration file is at the Github repository that also holds the transformation code and the pre-cooked HTML lists as supplied by contributing publishers.
Virtual collections are created for each subject classification, that is, STEM, HUM, and ECON.
Analysis Choices
In order for the analyzed data to be more useful towards the goal, several choices have been made:
MathML |
MathML is included in the schemas as a black box. Although its usage is limited to a relatively small number of elements and attributes throughout the collections, the authors decided not to attempt to identify a popular MathML subset. The reasoning is that it is easier to include MathML wholesale in a customization than to cherry-pick individual elements and attributes. In order not to let the number of included or omitted MathML items influence scores,
the authors decided to include only |
Table model |
A similar argument was made for the HTML-informed table model of JATS. As opposed
to MathML, the
vocabulary items will be considered individually and the rarely used attributes |
Outlier Filtering |
For most collections, only items with a frequency of more than 0.01, that is, one occurrence in 100 articles, were considered. |
Journal Permanence |
Only journals with at least 40 articles were considered (where individual articles were available) |
Non-Blue Items |
Aptness analysis is carried out twice: Once with the original vocabulary of the schema or collection customization, and then with each vocabulary reduced to a subset of Blue. The reason is that customizations that also use Green or proprietary markup should have a chance to compete with their Blue subset. It turns out that the ignore-non-blue variants (Figure 12 and Figure 14) do not differ much from their peers that consider non-blue items (Figure 11 and Figure 13). |
PMC as individual journals or as a single collection |
Considering PMC as a single collection instead of more than 1,000 per-journal collections will accelerate computing, but will skew the results. Aptness analysis is carried out twice again, yielding the two variants depicted in Figure 13 and Figure 14 for single-collection PMC and Figure 11, Figure 12 for individual-journal PMC. |
Applying each of the latter two alternatives, four different analysis tables are created:
-
all.xhtml, its average p values (last column) are visualized in Figure 11
-
all_ignore-non-blue.xhtml, visualized in Figure 12
-
all_single-PMC.xhtml, visualized in Figure 13
-
all_single-PMC_ignore-non-blue.xhtml, visualized in Figure 14
note about including proposed customizations
These linked tables and figures already contain the proposed minimal and the naive minimal customizations. The tables and figures that have been used for identifying promising candidates to derive the minimal customization from certainly lacked these data points. Because of the averaging over the aptness to derive all other customizations from a given customization, the numbers have been slightly different before adding minimal and naive minimal. The difference is not large though, in particular for the datasets with individual PMC journals. Therefore, because they look qualitatively and almost quantitatively the same, the authors have omitted tables and figures without the minimal and naive minimal customizations in this paper
Minimal Subset Criteria and Choices
A candidate customization with a relatively low item count and a relatively high average aptitude p will be selected as a starting point for the minimal customization.
For synthesizing the proposed minimal subset, these choices have been made:
“Strategic” vocabulary items |
Although their frequency is low, vocabulary of certain important areas has been retained:
|
Potentially neglected subject area: Computer science |
Although they occur surprisingly unfrequently in the data set, |
Frequency |
Otherwise, let the element and attribute frequencies (Figure 2 to Figure 10) guide the decision whether an given Blue vocabulary item is retained in the synthesized minimal customization. Apart from the frequency itself, a secondary criterion may be whether the item’s frequency is significant in more than one of the source collections considered. (Note that only collections for which full-text XML articles have been obtained could be analyzed by item frequency.) |
Table II
Newly introduced elements or attributes in… | |||||
---|---|---|---|---|---|
JATS version 1.2 | JATS version 1.3d1 | JATS version 1.3d2 | |||
|
|
|
|
|
|
Results
The analysis results presented in this section already contain the proposed minimal subset and its derivative, the naive minimal subset.
The authors selected a compact yet high-scoring de-facto customizing of PMC Transfusion (items with less than 1% frequency ignored) as a starting point. Items that met the criteria presented in section “Minimal Subset Criteria and Choices” have been included. This led to a significantly larger and lower-scoring “minimal” customization. This customization still contains fewer items that the union of all items found in the collections. This is achieved by dropping more unpopular items than adopting strategic items. The item count (324) is still significantly lower than Blue’s (453), and the average aptness is not much worse than that of other collection-based or schema customizations. The reduction goal of 60% couldn’t be reached though. And it is questionable whether 90% of the analyzed articles are covered by this customization. This minimal customization is a superset of 38% of the de-facto customizations in the “Individual PMC journals, ignore vocabulary items not in Blue” scenario. Unless the use of the vocabulary is distributed very unevenly among a collection’s articles, this means that the goal of 90% article coverage is also missed by a high margin.
The situation looks different if the strategic additions are not considered. This is called the “naive minimal” customization and it comprises only 58% of Blue items. It is a superset to only 27% of the de-facto customizations, therefore the 90% goal is also missed significantly.
While the minimal subset has been prepared as a proper DTD customization, the naive minimal customization has been prepared manually by using the HTML vocabulary lists and commenting out strategically important yet empirically unpopular items again. This boosts the average aptness of minimal, in particular in the per-journal PMC case, to heights that only the all-used-vocabulary union customizations can reach, yet with about 25% less vocabulary.
Conclusion
Empirical JATS usage statistics have been analyzed. There is no clear-cut set of universally unpopular vocabulary items though. It is therefore arbitrary where to cut off the long tail. A thing that the authors consider not arbitrary at all is the question whether newly-added or otherwise strategically important items may be left out in a consensus customization; they may not. Only few items have been considered as maybe a sign of aquafication: index terms and questions/answers. These have also been left out in a previous effort by the Texture team.
More restrictions with respect to canonical usage of the tags within the vocabulary that the proposed minimal customizing provides may be done by fine-tuning the schema or by adding Schematron constraints, including existing ones from JATS4R.
The authors are not sure whether the proposed minimal customizing, together with these Schematron rules, may develop into a valid alternative to mainstream Blue. This probably needs to be field-tested by production staff at publishing houses.
A longer-form version of this paper, that is, Nina Reinhardt’s master’s thesis Reinhardt 2021 will be available for download from HTWK Leipzig in August, 2021.
References
[JATS-list 2020] diverse authors. JATS—Gripes and Suggestions. Discussion thread on the JATS mailing list. https://www.biglist.com/lists/lists.mulberrytech.com/jats-list/archives/202004/msg00030.html
[Lapeyre 2019] Lapeyre, Deborah A. What is JATS 2.0, and Should You be Worried? eXtyles User Group Meeting 2019, Boston, MA. https://inera.com/wp-content/uploads/XUG2019_Lapeyre_JATS-2-0.pdf
[Schwarzman 2017] Schwarzman, Alexander B. JATS Subset and Schematron: Achieving the Right Balance. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2017. Bethesda (MD): National Center for Biotechnology Information (US); 2017. https://www.ncbi.nlm.nih.gov/books/NBK425543/
[JATS4R] JATS for Reuse (JATS4R). https://jats4r.org/
[Texture 2017] Aufreiter, Michael, Buchtala, Oliver. Texture-JATS. https://github.com/substance/texture-jats/
[JATS-list 2021] Imsieke, Gerrit, Reinhardt, Nina. Does Blue need a Lite version, to counter its creeping aquafication? Posting on the JATS mailing list. https://www.biglist.com/lists/lists.mulberrytech.com/jats-list/archives/202102/msg00015.html
[Reinhardt 2021a] Nina Reinhardt. A JATS Customizing Analysis: Project Summary. https://docs.google.com/document/d/1jYDT0TkYP9Tg31Ldd9gFmdwSiu98Q2mg_qOuhgnxpRc
[PMC] PubMed Central Open Access Bulk Download. https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/ [packages retrieved between Feb. and March, 2021].
[Reinhardt 2021] Nina Reinhardt. JATS Blue Lite – Analysen zur Definition eines minimalen Konsens-Customizings der Journal Article Tag Suite. https://nbn-resolving.org/urn:nbn:de:bsz:l189-qucosa2-757738