Randall, Laura. “"To those who startle at innovation...".” Presented at Symposium on Cultural Heritage Markup, Washington, DC, August 10, 2015. In Proceedings of the Symposium on Cultural Heritage Markup. Balisage Series on Markup Technologies, vol. 16 (2015). https://doi.org/10.4242/BalisageVol16.Randall01.
Symposium on Cultural Heritage Markup August 10, 2015
Balisage Paper: "To those who startle at innovation..."
Laura Randall has been working with markup languages longer than she cares to admit
and currently works for the PubMed Central project at the National Library of Medicine.
This work is in the public domain and may be freely distributed and copied. However,
it is requested that in any subsequent use of this work, the author be given appropriate
acknowledgment.
Abstract
In these times of electronic journal publishing, adopting a continuous publication
model is easy: Open an issue, publish articles electronically as they flow through
the pipeline, close an issue. Even print journals offer this quick access to the content,
publishing online before issuing the printed publication. The goal is clear: Provide
access to the information as soon as possible. These models incorporating quick electronic
access offer clear benefits to the community, so it's no wonder the model is so widely
adopted.
But these models aren't new to the digital age. They're not exclusive to electronic
publishing. Almost 200 years ago, at least one journal publisher was facing the same
struggle of how to get information to their readers quickly. In the editor's words,
from January 1828, "We only ask that those printed sheets which lie from one to thirteen
weeks in the printing-office...may appear...half-monthly.... To those wo startle at
innovation, we put forth this plain question:—Can there be any objection, that each
packet...of this Journal should go forth to those who wish to have it every fifteen
days...?"
This publication model, familiar as it is, presents its own set of challenges to our
modern system. The journal is being digitized as part of a National Library of Medicine
(NLM) and Wellcome Library project to digitize NLM's collection and be made available
to the public through PubMed Central (https://www.nlm.nih.gov/news/welcome_library_agreement.html). So our challenge now is this: How do we integrate a 200-year-old publication model
in current vocabularies when we've re-invented the same model in a different medium?
PubMed Central (PMC) is a free archive of life science journal literature from the
National Library of Medicine (NLM) and is the digital counterpart to NLM's collection
of print journals. Currently PMC includes more than 3.5 million articles from approximately
1700 full participation journals and more than 3500 partial participation journals.
Full-text articles are archived in PMC with XML conforming to the JATS Archiving and
Interchange DTD.
Back Issue Digitization
From 2004 to 2010, the NLM in partnership with the Wellcome Trust and the U.K. Joint
Information Systems Committee (JISC) ran a project to digitize content of some PMC-participating
journals. The project included a destructive scanning and the output was high-resolution
page TIFFs, OCR full text, and XML meta data for approximately 1.3 million journal
articles from more than 160 journals. These records were added to the PMC archive
and are freely available. In 2014, the NLM and Wellcome Trust signed a memorandum
of understanding to begin a second project to digitize historical content to make
freely available in PMC.
Unlike the first project which employed a destructive scanning process of donated
source material, this second project will utilize NLM's own collection and thus requires
a non-destructive scanning method. In addition, this current project is focusing on
journals identified by NLM's History of Medicine Division as being orphaned or out-of-copyright
material and having significant historical significance.
These titles span approximately 200 years and while the very basic structure of journal
articles has remained largely unchanged in that time, the specifics of journal
and article
citation information are very nuanced. The experience gained by NLM and PMC staff
on the first
digitization project is certainly invaluable for handling this new project, but
as is to be
expected with any project, there are exceptions.
In an attempt to minimize the surprises in the project, staff fro NLM and PMC are
working together to review the material to be scanned and identifying any anomalies
that do not conform to our typical or expected data formats.
The Fascicular Series
During the analysis of one of the titles chosen for scanning, NLM staff came across
a series
in the Medico-Chirurgical Review (London: S. Highley, 1824-1847) titled the Fascicular
Series. As
the name suggests, this series of the journal was published in fascicles or small
bundles. Up
until January 1828, the journal regularly published quarterly issues of 288 pages.
In an address
to the subscribers of the journal appearing in January 1828 [Figure 1],
the editorial staff of the journal explained that they would be changing the publishing
form and
instead of a single 288-page issue once a quarter, they would be issuing six 48-page
fascicles
that would be sent to the subscribers half-monthly.
To those who startle at innovation, we would put this plain question:—Can there be
any objection, that each packet or fasciculus...of this Journal, should go forth to
those who wish to have it every fifteen days, (half-monthly) instead of remaining
in the printing office for the space of many weeks? [*]
The concern of the publisher that potentially timely material was lingering in a printing
office for weeks is one familiar to modern publishers. The time between acceptance
of a
manuscript and publication of the final edited version can sometimes take months.
Instead of
holding these publications, many publishers issue ahead-of-print or online-first
versions of
articles. Additionally, many electronic-only publications have adopted a continuous
publication
model whereby articles are published on a rolling basis and are collected into
issues which have
publication dates that may be as broad as an entire year. It would seem, then,
that the publisher
of the Medico-Chirurgical Review adopted a continuous publication model in the
print medium
almost 200 years ago.
Structure of the Series
The first two issues published in the Fascicular Series (issues 16 and 17) included
very distinct headings identifying the
fasciculus number and date [Figure 2].
Following the first two issues of the series, the publisher did not include the same
heading but continued to identify the fasciculus number in the footer of the first
page [Figure 3]. Because the bound volume in the collection does not include covers or tables
of contents, this note in the footer along with date information in the running
head [Figure 4] are our
indicators that the issues were being released in groups of 48 pages twice a month.
Each issue of this series has two categorical sections: Analytical Reviews and Periscope.
Both sections appear in each individual fasciculus but are grouped together in
the bound volume
of the collection. This results in the content in the bound volumes being out
of chronological
order [Figure 4].
Beginning with issue 21, the dates in the running heads change to month only. The
fasciculus numbering continues with six per issue, but they are no longer printed
with the day in the running head. During this run of the series, the issue date is
expressed as a month range. This pattern continues through the end of 1833 when the
journal ceases the Fascicular Series and begins the Decennial Series. From the beginning
of the Decennial Series through the end of the publication, the journal returns to
a single quarterly issue.
Tagging Considerations
Fortunately for the project, the JATS Archiving model will natively handle all of
the structures included in this journal. There are, however, two distinct issues that
need to be addressed: article division and publication dates.
Article Division
For the back issue scanning project, PMC has outlined very specific instructions for
how
to identify and group types of articles. They include consulting the tables of
contents and
reviewing the content of the articles. If articles contain brief announcements
or news-type
items, they are grouped as a single article. The Periscope section of the journal
falls into
this classification as it includes items such as brief communications, society
announcements,
and obituaries. Per the project rules, these brief announcements would all be
captured as a
single article titled "Periscope".
The challenge with this decision, however, comes in that each individual fasciculus
contained a Periscope section. So for each issue in the Fascicular Series, six
Periscope
sections were published. These divisions, however, do not exist in the bound
volume as they are
presented as one continuous section following all of the Analytical Series articles
of the
issue.
How do we address this? Should we stray from the specification and create a separate
Periscope article for each fasciculus, imposing a division where one does not necessarily
exist? If not, and we follow the general guidelines we laid out for ourselves, what,
then, is the article publication date for an article that was published in six installments?
Publication Dates
That challenge of addressing publication dates extends past the Periscope-specific
question. Since we have identified this as a print continuous publication model, the
first step would be to try to tag the dates with a method parallel to that of the
electronic continuous publication model we currently handle.
For electronic continuous publication models, PMC requires both a collection date
and an
electronic publication date which must include the day, month, and year.
The JATS attributes of
publication-format and date-type allow separation of the publication format from
the event
type, so one possible solution would be to use this same pairing of values but
just change
"electronic" to "print".
There are, however, two
issues with this potential solution for PMC's current system:
PMC style requires the date accompanying the collection date contain a day, month,
and
year.
For the first two issues in the Fascicular Series, this is not an issue. In the next
three issues, the fasciculus day exists only in the running head. So it is
possible to
identify the date, but NLM staff would need to inventory each issue and list
the divisions
for the vendor. The later issues in the Fascicular Series, however, contain
only a month and
year, not a day.
Since we can't impose a day where none was provided, we look at the reason behind
that
rule. PMC requires that date to include a day so we can ensure that we the
correct release
date identified for the article. Since this content is more than a century
old, the release
date is irrelevant, so this requirement could be modified.
PMC style requires an electronic publication date to be present if a collection date
exists.
Until this point, PMC has only ever encountered electronic collection dates. Since
we
occasionally receive data that has incorrectly identified electronic dates
as print dates,
this check catches the data in an early stage of processing and prevents the
incorrect
information from being loaded to the database. Easing this restriction would
jeopardize our
ability to identify these incorrect dates which we receive frequently enough
for this not to
be a viable option.
With the current constraints of the PMC system, it does not seem that capturing these
fascicular dates as publication dates is feasible. The option we have remaining is
to tag the information as a more generic history date rather than an actual publication
date. There is more flexibility in the history dates as there is a lot of variety
in the kinds of event dates publishers capture for articles. This would allow PMC
to retain the information about the fascicular date in the XML, but the date would
not appear anywhere in the rendered content. This option also does not address the
question of the appropriate publication date for the Periscope articles.
Inconclusion
PMC's approach to archiving has always been focused more on preserving the intellectual
content than the physical format. Does information about this publishing model
and the physical
form of the journal even belong in PMC? If not, where does it belong? And if this
information
isn't captured now, when these already-fragile volumes are being digitized, we
run the risk of
losing it completely when the physical copies are no longer viable.
The original focus of this analysis was to figure out how to capture the fasciculus date as a publication date, but as it progressed, the question
has become whether or not we should. In addition to the specific tagging questions that we have not been able to answer,
the bound volume itself raises the question of whether or not the specific fasciculus
information is of significant importance to the content. If the party responsible
for binding the issues did not think the structure significant to preserve, should
PMC really depart from that?
References
[bib1] Address: To the Subscribers of the Medico-Chirurgical Review. Medico-Chirurgical Review.
London: S. Highley. Jan 1828; p 284-287.