David, Ravit H. “BITS for Government Information?” Presented at Balisage: The Markup Conference 2022, Washington, DC, August 1 - 5, 2022. In Proceedings of Balisage: The Markup Conference 2022. Balisage Series on Markup Technologies, vol. 27 (2022). https://doi.org/10.4242/BalisageVol27.David01.
Balisage: The Markup Conference 2022 August 1 - 5, 2022
Balisage Paper: BITS for Government Information?
Ravit H. David
Scholars Portal, University of Toronto Libraries
Ravit H. David (she/her) serves as the Distinctive Collections Librarian at Scholars
Portal, Univ. of Toronto Libraries. In her current position, Ravit focuses on developing
Gov Info content, Accessible service (ACE), OA and Archive-It collections. Ravit is
keen on all aspects of digital access, from policy and copyright to metadata best
practices for discoverability.
Scholars Portal, a service of the Ontario Council of University Libraries, provides
multiple levels of access and preservation to scholarly packages of Ebooks. We have
created an in-house custom modification of the Book Interchange Tag Suite (BITS, a
sister vocabulary to the Journal Article Tag Suite, JATS) to describe EBooks. A recent
strategic decision to host government information has posed several challenges with
our BITS modification and required a new in-house schema to accommodate specific metadata
requirements posed by the somehow different nature of govinfo content. We’ll look
at some of these challenges, then examine ways BITS can accommodate metadata that
isn’t necessarily standard Ebook metadata.
This paper builds on the work done by the OCUL feedback group. I am deeply grateful
to all the members of the group for the opportunity to work with them and especially
for their insightful analysis of metadata for gov info: Frank van Kalmthout, Archives
of Ontario; Graeme Campbell, Queens U; Helene LeBlanc, Wilfrid Laurier U; Martha Murphy,
Ontario Workplace Tribunals Library; Sandra Craig, Legislative Assembly of Ontario;
Simone O’Byrne, Ministry of the Environment, Conservation and Parks.
Introduction
Scholars Portal (SP) was established in 2007 and is funded by the Ontario Council
of University Libraries (OCUL) consortium. It is the technological body of OCUL. Among
its primary services is an Ebook platform that provides a single interface for accessing
and preserving digital texts (licensed and digitized public domain materials) from
the world’s most important scholarly publishers. Publishers deliver full-text content
to SP, and the SP Ebook team uses programs, referred to as “loaders,” to ingest the
content within the SP platform. This process, which we refer to as “Ebook local loading”
at SP, is a process that is designed to meet the agreements made on behalf of the
twenty-one university members of its consortia, including managing various levels
of access to the content. The SP Ebook platform is similar to a federated provider
or aggregator, such as OhioLINK and HathiTrust.
In 2018 we finished redesigning our Ebook platform, and one of the critical decisions
in the redesign was what metadata schema we would use as our target format. To what
metadata standard will we normalize or transform all the different source data we
get from publishers and aggregators, so we maintain. Searching for the Ebook standard
format made it clear that no book DTD/schema is dominant in the Ebook publishing industry.
SP Ebook platform loads Ebooks from around 30 publishers. Each publisher delivers
the content in its format and packaging system. The metadata can be MARC, ONIX, excel,
MARC XML, TEI, or in other various DTD/Schema, some unique to and developed in-house
by the publisher. The full text can be XML and PDF. Some publishers deliver the Ebooks
as individual chapters, and some provide the book’s full text in one PDF or XML file.
Since our Ejournal platform maps all publishers’ data to JATS, we became interested
in its new sibling, BITS. For those who don’t know, BITS — The Book Interchange Tag
Suite (BITS) — is an XML document model for STEM books based on JATS (the Journal
Article Tag Suite, ANSI/NISO Z39-96-2015). BITS is a named collection of XML elements
and attributes for describing the structural and semantic content of books and book
components, as well as a packaging element for interchange of book parts. BITS provides
a robust book model that is compatible with JATS, making it easy for publishers of
both journals and books to publish them using the same system. [Lapeyre 2019]
Due to the similarities between JATS and BITS, if you already have expertise in JATS,
getting into BITS is easy. You could easily add BITS books if your display system
were built for JATS articles. If your search system were built for JATS articles,
it would search BITS books with minor adjustments.
As part of redesigning the Ebook service, we developed an SP BITS profile as the destination
format for all the publishers’ data we get. Our BITS format was created with scholarly
publishers, collection development departments of academic libraries, and E-resources
workflows in mind. We didn’t want to create a tag for uncommon values or to miss information
that serves academic librarians in reviews of inventory or assessments. Our profile
took advantage of BITS being a flexible XML. It aimed to capture rich descriptive
metadata at the book-meta level and minimal metadata at the book-part, i.e., chapter
level.
A word on metadata: quality and gaps
Although there are many definitions of quality metadata, I would like to focus on
one. Marieke Guy, Andy Powell and Michael Day state that “quality is about fitness
for purpose,” and this purpose may be internal and external. [Guy, Powell, and Day 2004]
This definition may lead us to discuss metadata gaps across various standards created
due to differences in the purpose of the standard.
By gaps, we talk about areas in bibliographic metadata that are not transferable across
the various standards. ONIX product types allow distinctions between 10- and 13-digit
ISBNs but not between paperback and online ISBN. BITS enables you to define the type
of identifiers, and in SP, it was essential for us to capture and differentiate between
print, online and Epub ISBNs.
However, it doesn’t mean that when we transfer data from ONIX to our custom BITS profile,
we can know which 13-digit ISBN is a print ISBN and which is an online ISBN. This
gap between the two metadata standards needs to be considered when we map data, which
has nothing to do with our choice of BITS as the service’s metadata format. It’s essential
to remember that gaps between metadata standards are common and do not in any way
testify to the quality of the metadata or the appropriateness of the selected metadata
standard.
SP BITS Profile: uses and challenges
SP BITS profile includes collection-meta, book-meta and book-part-meta sections. Still,
I will focus on the first two because gov info content usually doesn’t have the book-part
sections that scholarly monographs have. And already here, you may see that we have
a problem with our BITS profile: if we think about quality metadata as fitness for
purpose, why would we have book-parts in our target format if none of the source formats
contains chapters?
However, Debbie Lapeyre reminds us that “The BITS book models are not intended to
describe trade books, cookbooks, grade-school textbooks, legal works, historical editions,
or any of the wide variety of books outside the current scientific, technical, engineering,
and medical realms in which JATS is used for journals.” [Lapeyre 2019]
With that in mind, we can say that BITS is a perfect fit for our scholarly content
and Ebook service, and really, the model was never intended to be helpful for any
type of content. And yet, our Ebook service is a home for more than just scholarly
publications in the realm of JATS coverage and academic Ebooks. Scholars Portal has
a long history of collecting government documents through several channels. And since
most of these documents are in PDF format and are stored and accessible through the
Scholars Portal Books platform, the question of whether we could use BITS for these
collections came up from the inception of the SP profile. If you look at the profile
declaration of content, you can see that we tried to get ready for other types of
content:
When we first created the profile, we didn’t have scores but later added them since
the music librarian purchased scores that needed a local loading space. Did we map
the scores metadata to BITS? We did.
As for govdocs, this content lived on our Ebook platform long before switching to
BITS. We have an agreement with the Ontario Legislative Library and regularly load
their content. We have a gov info OCUL community with a small annual budget to digitize
at-risk government info and load it on our platform. Then, there are specific requests
from Ontario institutions. Since the Ontario government doesn’t invest in libraries
or librarians, the remaining few found Scholars Portal a good alley and partnered
with us to load content from specific ministries or bodies, again, either at-risk
content or content that they thought needed to be preserved and accessible.
The unique nature of Government Information
Most of the gov info collections had some level of metadata, usually MARC records.
Since the Ebook industry also used MARC records for a long time, there wasn’t any
problem in mapping the MARC records for gov info into the new SP BITS profile. At
least not until we took a close look at the data. It started with one of the most
popular collections in the area of govdocs that we had, a public and health policy
collection that most of the universities in OCUL purchased. One year during renegotiation
before the renewal of this collection, it became clear that there were many challenges
on the way to renewal. OCUL wanted to evaluate the impact of losing this collection.
And how do you assess a specific collection? You look for usage stat, of course. Which
public policy documents were high in demand? What health organizations were covered
by the collection, and whether a different licence could recover this content.
So how do you learn such details? You go to the metadata. You hope that when you look
at all the values under the publisher field, you’ll discover which providers were
covered in the collection. If you check usage, you want to have parameters to look
at. Publication year, content from specific provinces? Do users prioritize provincial
or federal content in their searches?
At this point, it became clear that the regular bibliographic fields we consult to
assess or discover scholarly publications could become useless when we want to analyze
gov info content. It also made clear that metadata for govdoc needs to be more than
book-type="govdoc". At the same time, OCUL’s strategic decision to focus on gov info collections required
high-quality metadata that could tell us a complete story about our gov info collections.
If we have content from Statistic Canada, what portion of their publications do we
have? A specific time slice? Unlike scholarly Ebooks, gov info doesn’t come with ONIX,
catalogues or title lists. The only way to know what’s in a collection is to have
robust metadata to count on, so it can be queried and give us the details we are looking
for.
In October 2020, Scholars Portal asked for feedback on how best to present and arrange
government information and related content on our E-book platform to take a closer
look at the challenges mentioned above. The call for participation went out to the
OCUL-GIC mailing list and other OCUL forums. A small group of government information
librarians from OCUL and the Ontario government met regularly to discuss best practices
to describe government information content.
Since SP Ebook service uses BITS, the conversation revolved mainly around possibilities
in the BITS standard. Still, the participants reviewed examples from MARC records
and consulted with essential documents such as the Ontario metadata guidelines.
Since SP Ebook service uses BITS, the conversation revolved mainly around possibilities
in the BITS standard. Still, the participants reviewed examples from MARC records
and consulted with essential documents such as the Ontario metadata guidelines.
BITS for Government Information?
Metadata is one of the main issues with gov info content. When partnering with government
Ontario Government or Federal departments and other bodies and organizations to digitize
at-risk content, it is usually hard to find the resources to create metadata for the
digitized content. The Ebook service requires high-quality metadata to add collections.
Still, unlike content from scholarly publishers, where high-quality metadata is written
into the licence agreement, metadata for government information comes in various forms
and levels. The feedback group identified several metadata areas they thought would
be significant for the discovery of government information:
Book Type attribute: book-type is an attribute of books the top-level BITS tag:
Are govdocs publications or information? Maybe both? The feedback group thought government
information would encompass all types of documents and publications in this area.
However, their idea of defining the book type was a bit more complex than what is
allowed by the book-type attribute in the image. Since the gov info collections live
in the same service as the scholarly monographs, the first level indeed should flag
the record as book-type="gov info".
However, there seems to be a need to be more specific and have a second level of book
type that will allow users to filter their searches on particular government publications.
These GovInfo Types are based on a list initially prepared by the Ontario Government
Libraries Council Working Group on Government Publications for describing types of
information posted on Ontario.ca. Examples from this list include Annual Reports,
Backgrounders, Budgets, Expenditure Estimates and Public Accounts, Bulletins and Notices,
Mandate Letters and News Releases.
This way, the first level separates gov info from other types of content loaded on
the platform. The second level helps the user identify what content they are dealing
with: at what level of authority was the document written? Is it a new obligatory
policy or a preliminary discussion with stakeholders? Users need this level of information
when it comes to gov info.
Publisher vs. Corporate Author: As seen in the following image, BITS defines the publisher as the entity responsible
for the work. In government information, it is often the corporate author.
Sometimes, the publisher will legitimately be the same as the corporate author. And
if there are different values for both publisher and contributor, BITS has the flexibility
to choose the contributor type and so contrib-type="corporate author". In such cases, there isn’t any problem, nor in cases where the corporate author
is different from the publisher; BITS has both fields.
But some examples could complicate things: The case for Kyoto: the failure of voluntary corporate action. By Matthew Bramley; prepared by Pembina Institute; in cooperation with David Suzuki
Foundation. Another example: Metro Toronto remedial action plan: environmental conditions and problem definition. Author: Canada-Ontario Agreement on Great Lakes Water Quality. Other Author(s):
International Joint Commission. Ontario. Ministry of Natural Resources.
Add a sponsoring body or two to the above examples, as often seen in Think Tank and
NGO publications. How do all the involved organizations fit into the contributor and
publisher elements in BITS? BITS has conference sponsors but not publication sponsors.
“Prepared by” and “in cooperation” are not regular contributors. The closest, perhaps,
is the option to use <collab> in <contrib> as BITS has the opportunity for an organization contributor:
But the more robust option for <collab> is in the reference or bibliography sections; thus, it might be suitable for the
flexibility needed around sponsors and other contributors.
In many government documents, the value for the publisher could be “Queen’s printer,”
which is typically a bureau of the national, state, or provincial government responsible
for producing official documents issued by the Queen-in-Council, Ministers of the
Crown, or other departments to identify the responsible body under the crown.
This, however, tells us very little about the ministry or department that created
the document or publication and relates more to copyright statements. If we want to
query the publisher field for a gov info collection, “Queen’s Printer” won’t give
us the information we want. Suppose we wanted to learn what portion of a given collection
was created by the Ontario Ministry of Education. While we could cross-search the
corporate author field, we could also allow more than one publisher in the BITS, thus
having “Queen’s Printer” as one value and “Ministry of Education, Ontario” as the
second value. Either way, the world of Ebook publishing is very different from the
gov info meaning of “publisher.”
Identifiers: Identifiers are highly significant for Ebooks, and BITS allows one to include any
identifier one wishes to capture:
The SP BITS profile captures identifiers that are commonly associated with Ebooks:
<!-- zero, one or many book-ids per document -->
<!-- doi: from crossref, use format 10.XXXXX/XXXXXXXX;
lcc: library of congress classification; docID: -->
<!-- publisherID unique ID from source data -->
<!-- ismn [print|online|obsolete]: these are used by the escore --> book-id-type="{doi|lcc|docid|publisherid|ismn_print|ismn_online|ismn_obsolete|oclc|utlcat}">{book id}</book-id>
And of course it allows for ISBN and ISNN capturing:
<!-- optional: use either issn or isbn, depending on publication -->
<!-- remove all hyphens -->
<issn publication-format="online|print|unknown">{series issn}</issn>
<isbn publication-format="online|print|epub|unknown">{series isbn}</isbn>
However, when it comes to gov info, many documents don’t have ISSN or ISBN, and DOI
is even rarer. At the same time, some identifiers are significant for this type of
content. Since BITS has the kind of book-id open, it is possible to add identifiers
such as “Government Document Classification Number” or “The CODOC classification system”
(both were explicitly developed for government documents).
A gap occurs when the identifier is meant to identify the body that created the metadata
record. Unlike content that comes directly from publishers and vendors, identifying
the origin of gov info is not always straightforward, mainly if harvested from the
web. Bringing in the information of the record creator could provide helpful information
for users who want to track back the origin of a given document. The name of the organisation(s)
that created the original bibliographic record usually appears as a specific code.
In MARC records, it is common for government librarians to add the code of their unit
to the 040 fields. Bringing this code as part of the mentioned identifiers would be
helpful for librarians who work with gov info.
Jurisdiction, level of jurisdiction and type of organization as attributes for the
corporate author: The feedback group agreed that the corporate author is more significant than the
publisher. So ideally, we would like to be able to describe the corporate author according
to 3 levels:
The type of organizational author: governmental, intergovernmental, nongovernmental
Level (of government): the level(s) of government describing the organizational author: country, province,
municipality (for counties, townships, cities, towns), nongovernmental (for organizational
authors that are outside of government)
Jurisdiction: the geographic region represented by the organizational author
(this is not necessarily the same as the Subject/Topic; this is also not the value
under “Publisher-Location,” which is mainly understood as the physical location of
the body responsible for the publication)
According to gov info librarians in the feedback group, users who visit the reference
desk often ask about a specific government level or jurisdiction. Adding the above
attributes could allow filtering on such parameters and get users the desirable results.
BITS <contrib-group> allows several attributes though none of them could be a perfect fit for understanding
better the type of organization or organizations we are dealing with in the corporate
author field:
To sum up, the model that separates publishers from other entities that participate
in the creation, presentation and sponsoring of government information is lacking.
The term “corporate author,” while it had a designated field in the MARC system, requires
some adaptation in BITS and specific attributes that are currently hard to fit under
the <contrib-group> but that, according to gov doc librarians, are very significant for searches.
Final Discussion
While BITS has not been created with gov info in mind, it is a highly flexible and
diverse standard to describe this content. Yet, different types of content bring other
demands and challenges. If quality metadata needs to fulfill its purpose, users of
government information require metadata that differs from the descriptive metadata
we are accustomed to seeing for academic Ebooks. Hosting government information on
the Ebooks service of Scholars Portal could undoubtedly build on the bibliographic
metadata fields available on the SP BITS profile. And yet, if we wanted to serve users
of government information, developing a new BITS profile or custom schema to reflect
their needs might be the preferred solution.
Marieke Guy, Andy Powell and Michael Day, “Improving the Quality of Metadata in Eprint
Archives,” Ariadne 38 (2004), http://www.ariadne.ac.uk/issue/38/guy/ (Accessed July 14, 2022).