Clark, Ash, and Sarah Connell. “Meta(data)morphosis.” Presented at XML In, Web Out: International Symposium on sub rosa XML, Washington, DC, August 1, 2016. In Proceedings of XML In, Web Out: International Symposium on sub rosa XML. Balisage Series on Markup Technologies, vol. 18 (2016). https://doi.org/10.4242/BalisageVol18.Clark01.
XML In, Web Out: International Symposium on sub rosa XML August 1, 2016
Balisage Paper: Meta(data)morphosis
Ashley M. Clark
Ashley M. Clark serves as the XML Applications Developer for the Northeastern University
Libraries’
Digital Scholarship Group (DSG). She works on TAPAS and the Women Writers Project,
specializing in XML
databases, XQuery, and XSLT.
Sarah Connell
Sarah Connell is the Project Manager for the Women Writers Project and a member of
the Digital Scholarship Group at Northeastern University.
Her current research explores applications of literary theory through text encoding,
working with early modern national histories.
This article offers a case study in the creation and metamorphosis of a corpus of
transcriptions intended for web publication. It discusses a process for encoding,
proofing,
and publishing a collection of brief periodical documents (largely reviews) on the
subject
of authors published in the Women Writers Project’s established Women Writers Online
corpus, as part of
an initiative investigating the transatlantic reception of early
women’s texts. Both encoding and publication in the initiative, Cultures of Reception,
were driven by the particular
characteristics of this collection and the importance of establishing links to the
existing
materials in Women Writers Online. This article discusses steps that the Cultures
of
Reception team took to prepare the encoded texts for publication—including development
of a
web-based tool systematizing human intervention—and then explains the goals and design
of
Women Writers in Review, the interface that is used to publish these texts.
The Cultures of Reception initiative offers a case study in the migration, normalization,
enhancement, and transformation required to publish a set of TEI-encoded data. The
initiative’s corpus consists of almost 700 transcriptions demonstrating the cultural
engagements and influences of women writers in the eighteenth and nineteenth centuries.
The corpus is large enough and most of its transcriptions are short enough that publishing
this corpus required prioritizing the use of data in aggregate over individual transcriptions.
And, while the corpus is also small enough that it cannot be considered definitive,
cleaning and enhancing the transcriptions’ metadata enabled their publication in an
interface that allows users to discover, to explore, and to make connections among
a set of texts and authors that enact a key moment for women’s participation in transatlantic
print culture. The data enhancement and publication processes required, among other
things, the migration of data from JavaScript Object Notation (JSON) to XML encoded
following the Text Encoding Initiative (TEI)’s guidelines[1]; the rehabilitation of ill-formed transcriptions; the identification of entities
such as people and literary works; the formalization of metadata associated with texts
and publications; and, of course, the development of a web interface capable of accomplishing
the project’s goals for these materials.
The Women Writers Project (WWP)[2] began work on the Cultures of Reception initiative in 2010. To support research
into the transatlantic reception and readership of early texts by women, the WWP transcribed
close to 700 brief documents—such as literary reviews, publication announcements,
textual excerpts, and subscription notices—published in Great Britain, Ireland, and
America
between 1770 and 1830. This paper focuses on the tasks which preceded and enabled
the release
of an exploratory web interface, Women Writers in Review,[3] which makes the Cultures of
Reception corpus available for discovery and reading. As is evident in this project—in
which
transcriptions were created in one web interface, edited and transformed into well-formed
XML
through another interface, and then published through a third—the processes of authoring
XML
files for web publication and of developing web interfaces to publish XML files need
not be
entirely distinct; in fact, they can be productively linked. This paper discusses
how, by
focusing extensively on the corpus’s metadata, we at the WWP managed the complexity
of the Cultures of Reception
source materials and addressed the need to create a publication platform in which
even novice
users can navigate these materials.
Context for Cultures of Reception
The Cultures of Reception corpus represents a major encoding and publication endeavor
for the
WWP—but it was produced in the context of a much longer established collection of
TEI-encoded texts: Women Writers Online (WWO).[4] WWO is full-text collection of early women’s writing in English; it includes
transcriptions of texts written, translated by, or attributed to women, published
between
1526 and 1850, and it has been central to the WWP’s research output for close to thirty
years. The various reviews (a term we have broadly defined to include not only reviews, but
also publication notices, literary histories, and other texts discussing women’s writing)
that make up the Cultures of Reception corpus were selected because they respond to
texts by
authors who are published in WWO. Thus, while the two collections are distinct in
their contents,
encoding practices, and publication interfaces, they are closely linked to each
other through their shared focus on a set of authors and texts.
Although both Cultures of Reception and WWO are TEI projects, their encoding practices
differ significantly. The tagset for Cultures of Reception is fairly minimal, designed
to capture basic document structures, significant renditional details, and the evidences
of textual exchanges (such as quotations and references to titles) that are important
for the project’s concerns with early transatlantic review culture.[5] The WWO collection, by contrast, has a tagset that currently includes approximately
165 elements and that reflects decades[6] of development with a generically and chronologically diverse collection of texts.
Finally, while the two interfaces will be linked extensively, their purposes and display
needs are quite different. WWO is designed as a reading interface, making individual
texts easy to locate and read online. Cultures of Reception places a high priority
on aggregation; its texts are generally shorter than those in WWO and they are more
likely to be discovered through their periodicals of publication or subject matter
than by reviews’ titles or authors. For this reason, Women Writers in Review (WWR),
the web interface in which the Cultures of Reception collection is published,[7] foregrounds metadata much more than WWO does and fosters exploration by a wider range
of facets.
The Cultures of Reception transcriptions require significantly more context to be
useful and their
metadata reflects that. Where each WWO text only needs to track information about
the individual
publication and its author(s), a Cultures of Reception review needs to track information
about where the review was originally published, which publication was reviewed, which
edition is likely to be the one reviewed, and which authors were mentioned. We also
wanted the materials in Women Writers
in Review to enrich the experience of reading the texts in Women Writers Online, linking
between the two
corpora as much as possible. See Appendix A for a tabular comparison of metadata captured
in WWO versus in Cultures of Reception. While the complexity of the Reception data
is
specific to the project, we hope to demonstrate that our focus on metadata added a
great
deal of value to the web publication of our small- to mid-size corpus.
The First Phases
The initial phases of the Cultures of Reception initiative were completed between
2010
and 2013, during which time the Women Writers Project had its institutional home at
Brown
University. These phases included selecting, sourcing, and gathering data on the texts
to be
encoded; setting up a web interface connected to CouchDB[8] for a first pass of transcription and encoding; and performing a substantial
amount of the encoding work itself. The WWP’s project manager at that time, John Melson,
took the lead in creating the transcription interface and gathering metadata on the
reviews
and periodical sources. The bulk of the transcription and encoding work was performed
by the
WWP’s graduate and undergraduate student encoders.
It was after the WWP moved to Northeastern University in 2013 that this paper’s authors
became involved with the Cultures of Reception initiative. Sarah Connell, the WWP’s
current
project manager, directed a team of graduate students as they continued encoding in
the
transcription interface and worked with Ashley Clark, the WWP’s XML Applications Developer,
on proofing the corpus and preparing the transcribed texts for web publication. In
the
sections that follow, we discuss the decisions that we made and the work that we performed
after the first transcription phase had been completed and we began to focus extensively
on
publishing the collection.
Reports and Initial Cleanup
As transcription wrapped up, we began to take stock of the data stored in CouchDB
as JSON objects.[9] We used XQuery to pull records from CouchDB’s API. Examining the query results, we
realized that we could not yet create accurate reports on the writers, works, and
periodical
sources represented in the database. Most of the JSON objects’ values were inconsistently
formatted and difficult to parse programmatically. From a human standpoint, the sheer
volume
of reviews in the corpus made it extremely difficult to find specific transcriptions.
Browsing
the list of records was overwhelming even for those who knew the project.
These first queries shaped the rest of the pre-publication phase by helping us to
recognize that, in order to appeal to
the casual user, the corpus needed to be browsable. Moreover, in order to appeal to
the researcher,
records needed to be discoverable. We decided that the best way forward would be to
focus our
pre-publication efforts on consolidating and normalizing metadata and determined that
this would be easiest for
our team to do once the corpus was migrated out of JSON and into XML.
First, however, we needed to take care of the transcriptions.
The Ill-formed Transcriptions
From the transcription interface, we knew we had 690 completed transcriptions stored
in CouchDB. However, those transcriptions were stored as plain-text strings—pseudo-XML—in
JSON.[10] While CouchDB has built-in methods of validating its databases using JavaScript,
those methods could not tell us how many of those transcriptions could be exported
as well-formed XML, much less valid TEI. Instead, we turned to an instance of eXist-DB[11] already running on the same server as CouchDB. We wrote an XQuery library to let
eXist communicate with the Cultures of Reception database through CouchDB’s API. Using
that library, we attempted to serialize the JSON records as XML and save them into
eXist. But, while CouchDB didn’t care at all about the transcriptions’ XML-esque pointy
brackets, eXist cared a little too much. It refused to store ill-formed transcriptions
as XML because, of course, they weren’t XML. Forget running XSLT on the strings! A direct export to TEI was just not possible.
Rather than strong-arming the JSON into a binary format, we decided to keep the
transcribed reviews in CouchDB a little longer. eXist would serve as an intermediary
for the transcriptions, running XQueries and storing XML reports on the CouchDB documents
(rather
than storing the documents themselves).
From the first of these reports, we learned that of the 690
transcriptions, 556 were ill-formed. After a moment of horror, we were
relieved to discover that most of the 556 were actually just XML
fragments; they only needed a wrapper element to be well-formed. That left us with
167
ill-formed transcriptions.
The Lack of Identifiers
Complicating this process, the records in CouchDB were not always consistent in terms
of
their metadata. There were no identifiers associated with the names of people, places,
or
publications. Instead, the transcription interface relied on alphabetical sorting
of strings
as a grouping mechanism. Because the names were typed in by hand, these fields varied
in
terms of punctuation, capitalization, and spelling—as one might expect—but also they
also
reflected variations in the amount of information and context included.
For example, one record might provide a reviewed author’s maiden name (Barbauld,
Anna Laetitia (Aikin)) and another might not. Mary Robinson’s works were listed
under both her real name and her pseudonym, Horace Juvenal. And in the reviewed works,
we
found that two conflicting titles, Harrington and Ormond,
Tales and Harrington, a Tale; and Ormond, a
Tale were in fact both correct. The title had changed in a
later edition.
The encoding interface was not flexible enough to handle the complex relationships
among
textual editions, or between two seemingly distinct persons. To address this limitation,
some encoders wrote in-depth notes, attempting to capture context that would not fit
in the
HTML form.[12] The transcription interface did not encourage these contextual notes, though, and
in any case, we
couldn’t reliably draw out those connections except with human intervention. Without
identifiers to link key persons and publications, there was no ready way to programmatically
categorize the 690 transcribed files by reviewed author or work. Additionally, it
was not always clear
which works referenced in the reviews corresponded to specific publications in
Women Writers Online. We could not link transcriptions to each other, nor link out
of the
Reception project. We had a context problem. We needed...
The Inspecter
The Reception corpus was not small enough to allow human intervention on each and
every
item inside it, but the problems we faced were too complex for a purely programmatic
approach. Instead, we used XPath and XQuery to create XML reports, then use a web
interface—named the Inspecter[13]—to limit human intervention to situations requiring judgement calls, such as
categorization or entity uniqueness. Any fixes would be ported back to CouchDB using
its
RESTful API.[14]
Formedness, Well-Done
For the ill-formed transcriptions, the XML report was simply a list of problematic
records by their CouchDB identifier. The web interface was simple, too—an Ace editor[15] set up to parse an input field as XML. When the transcription was submitted,
eXist would check it, only accepting the change if the XML fragment was
well-formed.
While Ace was an improvement over the original transcription interface, we quickly
found that Ace’s editing environment could barely keep up with the faux-XML. The editor
could tell the data was ill-formed, but its error messages were not helpful in conveying
why, nor in precisely locating problems within each document. Did
an element lack a closing tag? Was there overlap? In cases where there was more than
one
ill-formedness error,[16] oXygen’s interface proved more useful than Ace. By moving between oXygen and the
Inspecter as the complexity of the errors demanded, we were able to resolve all but
20
of the 167 ill-formed files relatively quickly.
Entity Recognition
The second report was generated by identifying unique strings for both authors and
the titles of works and then listing out corresponding reviews. The XQuery generating
the report
didn’t even try to ignore punctuation or normalize misspellings, relying instead on
the
web interface to highlight subtle differences for the human reviewer. The displayed
report
sorted data by author name, then by work title, which made it easy to tell when authors’
names or works’ titles needed to be normalized, and allowed us to port the corrected
string
value back to the affected CouchDB records. It also provided the ability to link those
reviewed works with an identifier for an edition in Women Writers Online, marking
the
transcriptions that did not have associated TR numbers[17] so that we could either add identifiers where they were missing, or mark
reviewed works as not yet having entries in the WWP’s records.
This Inspecter report was the first time we found the web interface to give us a
greater degree of power than specialized software like oXygen. The Inspecter’s strengths
lie in: (1) reducing a complex dataset into a human-readable report, and (2) changing
multiple records at once. The XML reports could be regenerated at the click of a button,
so fixes were never held back by inaccurate reports.
However, the Missing TRs report also showed the
value of simplicity. This report was used to trigger three separate
data munging tasks:
normalizing reviewed author names,
normalizing reviewed work titles, and
linking reviewed works to their WWO versions.
Given the hidden interdependencies between tasks (in that normalization for authors
and titles needed to be completed prior to entering links to WWO versions but the
report did not
enforce this order of completion), we initially
struggled to find a good workflow for this report. Since then, each
Inspecter report provides the structure to solve only one problem at a
time.
The JSON-to-XML Migration
With the first two Inspecter reports completed, converting the CouchDB records into
TEI-encoded XML was fairly simple. We used XQuery to request each transcribed, well-formed
record in the CouchDB database, then turned that record into XML using the XQJSON
library[18]. Each transcription was parsed as an XML fragment instead of a string representation.
This XML representation of the JSON record was then transformed into TEI using XSLT,
and stored in one of the WWP’s eXist-DB instances. For simplicity’s sake, the files
were named by their CouchDB identifiers.[20] We then compressed the results into a ZIP file, and added the new files to a Subversion
repository created and maintained by the Women Writers Project. Appendix B is a list
of the JSON keys stored in CouchDB on export. Appendix C has screenshots of selections
from the XSLT and XQuery files used to convert the JSON into TEI.
As for the 20 transcriptions with particularly thorny ill-formedness errors, the XQuery
was used to transform all of the JSON data except the transcriptions.
Each transcription was then treated as a string and wrapped in the
<body>. The resultant files were also placed into the Subversion
repository, in a directory for sequestered reviews.
We hope to rehabilitate these in the future, when there is more time to tackle the
errors. In the meantime, the ill-formedness errors bothered our colleague Syd
Bauman enough that he added attribute pointers, and split wrapper elements into empty
ones.[21] The result is a collection of files in which the placement of the original
encoding is preserved, and the files are
not just recognizable to humans as XML, but parsable by XML
processors.
Preparing for Publication
Building Entity Records
With the transcriptions in XML format and entities’ names and titles normalized, we
were
able to query the data much more easily, which enabled us to discover that each transcription
differed in how much data was included on any given entity. We could depend on the
periodical title’s presence, but not on the title under which the review was published.
Some
entries on periodicals included the publication place of the periodical, while others
gave
only the date.
One solution would have been to get compiled information for each entity and then
propagate the condensed versions back to the reviews, as the Inspecter reports did
to the records in CouchDB. This would have worked in the sense that every review would
have access to the same information on its source periodicals and mentioned authors
and works. But it would also retain the fragility of the CouchDB database—the reviews
would easily get out of sync with each other as the collection continued to grow past
publication.
Instead, we decided to create TEI files to house the canonical records of these entities.
Previous WWP work had already established a research database of people and so we
were able to make use of the WWP’s personography[22] as context for any authors referenced in the reviews. For referenced works and periodicals,
we used XQuery to create stub <biblStruct>s with only the distinct titles and newly-generated identifiers. After we created
a TEI file for each set of bibliographic information, we used another XQuery to compile
all the TEI elements associated with the <title>s, then save them to the new file, under the new identifier. We were then able to
associate the reviews with identifier references to the canonical entries for their
source periodicals and for any works mentioned.
The bibliography of mentioned works required a further step, since some reviews
discussed different editions of the same work. We needed a way to show the relationship
between different versions of the same work, not least because a goal of the project
has been to investigate the international scope and
cultural impact of women’s writing. We settled on a loose correspondence of TEI bibliographic
elements
with Functional Requirements of Bibliographic Records
(FRBR) entities.[23] The FRBR report conceptualizes intellectual products as four different types of
entities: works, expressions, manifestations, and items. Works can have multiple
expressions, and expressions can manifest in multiple ways:
Table I
FRBR Entity
Description
TEI Element
work
the text as a distinct artistic creation
<biblStruct>
expression
the text as created or interpreted in different ways
<monogr>
manifestation
the text as embodied in a published edition
<imprint>
We chose to identify works as having multiple expressions on the basis of on how much
changed between
different versions. For example, Hannah Adams wrote A Summary History of New
England, and later reinterpreted her work for a younger audience: An
Abridgement of the History of New England for the Use of Young Persons. We
classified these two texts as expressions of the same work. Since the records were
all in
the same file and since intervention required a fair amount of bibliographic expertise,
much of this identification was done by hand.
When we were satisfied with the bibliography of reviewed works, we used XQuery Update
to replace
the bibliographic metadata in each file with an identifier reference to the most applicable
version of
each work from the bibliography. This
was done by comparing the <biblStruct> parts listed in each review with the
condensed, canonical metadata in the bibliography. If the XQuery could not determine
the
most accurate FRBR entity, it used the work identifier with an attribute flag for
low
certainty. We were then able to search for that attribute and resolve the handful
of problem
cases.
Creating Display Titles
Another bibliographic issue we had to address resulted from the fact that very few
of
the items we transcribed had either titles or identifiable authors.[24] And, even for those that did, we knew that users
would be far more likely to look for items based on the works being reviewed or the
source
publications. That is, we expect that readers would be more interested in retrieving
the Edinburgh Magazine’s various articles on Maria Edgeworth than in
searching for Review of New Publications or
To the Editors of the Virginia Religious Magazine in particular. And so, we decided to label each review
according to the information that would be most useful to readers: its date, the source
in which it was published, and the author and work that are its primary subjects.
For example:
1817-09: The Edinburgh Magazine on Edgeworth’s
Comic Dramas.
We also knew that we would want to allow sorting and retrieval by source, reviewed
author, and reviewed work. However, many of the texts in the Reception corpus have
very
lengthy names, making them unwieldy for web display, and so it was necessary to create
display versions of these titles. In the relevant bibliographies for reviewed texts
and
sources, we added a second <title> with a @type of "display" and decided
on a shortened version of the title (for the rare cases where titles were short enough
already, we instead added a @subtype of "display" to the <title>). For
example, Hannah Adams’s
A View of Religions, in two Parts. Part i. Containing an
Alphabetical Compendium of the various Religious Denominations which have appeared
in
the World, from the beginning of the Christian Æra to the present Day. Part ii.
Containing a brief Account of the different Schemes of Religion now embraced among
Mankind. The whole collected from the best Authors, ancient and
modern
has a display title of A View of Religions, in two
Parts.
Establishing these display titles ensured that the Women Writers in Review interface
could
substitute the shortened form when the work or periodical source is referenced in
headings and
lists of results. In turn, users of the Women Writers in Review interface can scan
through the
significant amount of metadata necessary to locate individual reviews—and search
through the collection overall—efficiently and without having to parse verbose titles
such
as La Belle Assemblée, or, Bell’s Court and Fashionable Magazine
Addressed Particularly to the Ladies.
Reconciling Inconsistencies
Our data cleanup process also required that we address the inevitable encoding inconsistencies
that arise in projects on this scale.
One such issue was the encoding of notes, which were handled in a range of ways (often
reflecting the amount of experience the encoder had with WWO encoding practices);
some encoders were simply typing notes as
content where they happened to appear in the text while others were using <note> either at the location of the note or
in a separate <div> at the end of the text. Fortunately, once we had all of the reviews
well-formed and added to version control, it was easy to locate and reconcile the
different practices
so that all notes could be encoded in <note> elements at the end of each review. We were similarly able to address the occasional
cases
in which encoders both used the <quote> element and also typed quotation marks directly into
the text. In fact, we found that many of the consistency issues that were difficult
to enforce in
the transcription interface proved to be very simple fixes once we had the
files in XML because they could be identified, located, and addressed globally.
Another inconsistency we reconciled was, in fact, planned as part of the encoding
process. During the early stages of the project, the WWP had identified several key
themes
that we were interested in tracking across the corpus—including topics such as Nation
or empire, Gender identities, and Review culture. We
also wanted encoders to add their own thematic keywords, reflecting the specific content
of
different reviews and capturing themes that we hadn’t anticipated. And so, we ended
up with
a broad set of encoder-created tags, many of which were expressing essentially the
same concepts in slightly different language. For example, one encoder might mark
a review
as discussing Irish literature while another might use literature of
Ireland. We consolidated variations in the keywords, taking note of which
encoder-authored tags were appearing frequently. In this way, the review process was
also a method
for us to get a sense of the content and concerns of the corpus overall and to refine
the
list of corpus-wide thematic tags, which we expect will be one of the major ways that
readers discover content in the Women Writers in Review interface.
Women Writers in Review
Interface Goals
We had several key goals in mind when designing the corpus’s web presence: we wanted
to
make the materials easy to browse and search; we wanted to have linking among the
reviews
and to Women Writers Online; and we wanted to foster exploration of the materials
as a corpus. These goals were in many ways driven by the
transcribed texts themselves, which tend to be brief and are more likely to be of
interest
to users as they reveal information about other texts and the larger topic of transatlantic
reception culture.
Women Writers Online, by contrast, tends to treat each text on its own
merits; accordingly, the WWO interface is designed to make it easy to find and read
individual texts. We expect that users of Women Writers in Review will be asking questions
such as: How has Maria Gowen Brooks been reviewed in British and American
periodicals? or What changes are evident in the British Review
over time? or How did periodicals in this
period discuss questions of women’s authorship? (rather than Where is
Margaret Cavendish’s The Blazing World?—a
question that brings a great many readers to WWO).
That said, we also wanted to do justice to the very lively and interesting content
of
the reviews in presenting them to readers—we don’t expect that many readers will go
looking
for the British Review’s article on Maria Edgeworth’s
Harrington and Ormond, Tales, but once that reader has
found the article, we want to be sure it is easy to read online, despite the fact
that it is
one of the longer reviews in the collection. We also wanted to foster continued discovery
and exploration from each individual review. For example, if a reader lands on this
article
from the British Review, they could go on to explore other
literary reviews, other items tagged as dealing with nation and empire, other articles
from
the British Review, other reviews of Maria Edgeworth in
general or of Harrington and Ormond in particular—and so
on. For this reason, it was important that Women Writers in Review present all of
the
metadata associated with these items in intuitive and clearly organized ways.
Infrastructural Design
The web application for Women Writers in Review makes use of BackboneJS[25] to interface with eXist’s RESTXQ endpoint[26]—more specifically, with an XQuery designed to return JSON datasets and HTML
representations of the corpus. The web application was built out while the TEI files
were in
constant flux from undergoing the normalization tasks noted above, but because we
started
with the RESTXQ API, the BackboneJS app always had the latest data from eXist. Additionally,
within the eXist database, the transcription files are stored separately from the
EXPath
app. These abstractions allowed for flexibility in our workflows. An update to the
API did
not require all records to be saved back into eXist and an update to the transcriptions
did
not require the re-engineering of the BackboneJS app.
To accomplish the goals outlined above, we created a clean, simple reading interface
using XSLT, and put much of our efforts into developing a series of landing pages,
or pivot
points, where users can view sets of items by reviewed author name, by reviewed work,
by
source, and by tags (such as format, genre, overall positive or negative evaluation,
and
thematic content). Each landing page is generated directly from the requested bibliographic
or biographic canonical entry, using XSLT to create a representation of the appropriate
context for that particular pivot. For example, the landing page for an author like
Maria
Edgeworth includes information on her pseudonym, her birth and death dates, her birth
and death locations
and the countries she resided in; it also links out to her entries in the
Library of Congress Name Authority[27], the Virtual International Authority File[28], and WorldCat Identities[29]. Similarly, in addition to imprint data (such as publishers and publication
dates and locations), reviewed works have a link to a version published in WWO, wherever
WWO
texts are available.[30]
Landing pages also contain lists of reviews related to that particular pivot,
and a facets sidebar. The sidebar is broken out by sources, tags, and mentioned authors
and
works. The sidebar distinguishes between facets that identify features of the
result set, and those which can be used to reduce results to a subset. The latter
are
collapsed by default, while the former can be seen at a glance. For example, as one
would expect,
every item listed on the landing page for The Literary Gazette; and
Journal of Belles Lettres, Arts, Sciences, &c. has that periodical
listed as a source, so the Source facet in this case is used to indicate the number of
results in the set, rather than to filter among those results. In a less predictable
(and more interesting!) usage of facets for
counting rather than filtering, the interface also shows that all results for this
publication are articles with very
positive reception of the primary author.
When users expand the actionable facets and select one, any items that do not match
that
facet are hidden, and items that do match are shown. Currently, only one facet can
be
selected at a time and the facets cannot be stacked or used to create a union of two
subsets.
As we continue creating features past initial publication, we plan to make
the faceting system a better tool for querying, manipulating, or exploring aggregations
of
data.
Both landing pages and reading views also include links within
Women Writers in Review, to the landing pages for related entities or to sets
of items with the same metadata. For example, one can link from the top of reading
views
to other reviews with the same tags as in the example above, which shows buttons linking
to
all items that are tagged as literary review, article or essay, and
somewhat positive. We hope that these connections among texts with shared characteristics
will increase discoverability and make for an intuitive
browsing experience.
In Conclusion: Metadata Narratives
The Cultures of Reception grant work has ended, but the Women Writers Project still
has
aspirations for improving Women Writers in Review. In the future, we want to use the
refined metadata to
offer readers more ways of exploring these reviews and of considering the corpus as
a whole.
We are currently adding visualizations showing trends in the corpus across time and
location—such as positive and negative reviews of individual authors during the course
of
their writing careers—and we are working on making these visualizations themselves
a mechanism
that readers can use to explore the collection.
We plan to make geographic
information more visible—for example, plotting periodical sources and reviewed texts
onto maps
so that readers can investigate the roles that geographic space played in the circulation
of
these texts. While Women Writers in Review already has an API, we are in the process
of
expanding it and writing documentation to allow researchers to directly query the
corpus and
obtain results in JSON or XML.
Much of the framework is already in place for these goals. Metadata played a crucial
role
in the Cultures of Reception initiative, providing signposts from the earliest stages
of
transcription through the development of the Women Writers in Review publication interface.
In
the initial transcription phase, establishing a data capture mechanism that could
accommodate
our complex and highly structured textual and bibliographic data, while allowing a
necessary
degree of flexibility in the input, enabled us to complete a high volume of transcription
work
in a very short period of time.[31] Normalizing variations in that data early in the pre-publication process made it
possible for us to map out and navigate the remaining data enhancement tasks, despite
the
diversity and size of the corpus. And it was the experience of navigating the XML
files
through their bibliographic and biographic data that inspired the design of Women
Writers in
Review. The result is a publication with a contextual foundation sturdy enough to
build on.
More importantly, we believe that foundation will empower users to read and to question,
to
explore and to discover, to experiment with and to theorize about the collection we
have built.
Acknowledgments
The Cultures of Reception project was generously funded by a National Endowment for
the Humanities Collaborative
Research Grant.
The authors owe a considerable debt of gratitude to Julia Flanders, Syd Bauman, Steven
Braun, John Melson, and the encoding staff at the WWP for their expertise and enthusiasm
in
developing the project.
Appendix A. Appendix A: Metadata Comparison between Women Writers Project Publications
All Women Writers Project initiatives include references to the same WWP-canonical
personography—a TEI file with data on persons significant to the WWP. We at the WWP
collect
information on names (including birth names, married names, honorifics, pseudonyms,
and name
variants); geographic information (including birth and death locations as well as
residences);
dates (birth, death, and floruit); identifiers (including WWP and LCNA); and some
biographic
details (faith, marital status, parental status, and languages known). We also track
authors’
roles in the WWP’s information systems—such as their roles as authors of texts in
WWO, Women
Writers in Context, and Women Writers in Review—and we keep open fields for notes,
sources,
and queries.
Table II
Identifiers are marked "ID", and identifier references are marked "IDREF".
Women Writers Online
Women Writers in Review
transcription metadata
transcription metadata
canonical lists (’ographies)
WWP publication
file title
file title
publication information (publisher, date, place)
publication information (publisher, date, place)
file ID
file ID
CouchDB IDREF
work attributed to a woman
IDREF(s) to any work reviewed, including the specific edition, if known
work ID(s)
work title
work title(s), optionally including a shortened title suitable for web display
contributor IDREF(s)
contributor IDREF(s)
contributor name(s)
edition
edition(s), if known
publication information (publisher, date, place)
publication information (publisher, date, place) for each known edition
physical characteristics, such as the number of pages
Related WWO file IDREF, if one exists
source periodical
periodical IDREF
periodical ID
review title
review author, if known
periodical title
periodical title(s), optionally including a shortened title suitable for web
display
publication date
publication date range, if known
volume, issue
publication place
publisher name(s), if known
editor name(s), if known
contributor name(s), if known
politics, if known
frequency of publication, if known
categories
work genre IDREF(s)
review genre IDREF(s)
review format IDREF
review theme IDREF(s)
IDREF of the evaluated tone the review takes to the main received work
[5] Restricting the tagset enabled the project to complete a relatively high volume of
encoding work in a constrained period of time; it is possible that future research
at the WWP will entail enriching the Cultures of Reception corpus with more detailed
encoding.
[6] WWO itself was first published in 1999 and encoding work began years earlier.
[9] JSON objects are unordered lists of keys and their associated values.
[10] The plain-text looked like XML in that it had pointy brackets and element-like structures,
but it was not serialized as XML. Because it hadn’t been serialized as XML, there
was no guarantee that the text was hierarchical or had only one root element.
[12] For example, one note on a text discussing Elizabeth Ogilvy Benger’s Poems on the Abolition of the Slave Trade reads:
Elizabeth Benger is apparently only named in the work as E. Benger, and she is
referred to as Mr. Benger throughout the review. There is an interesting footnote
on
the bottom right of page 104, which speculates that E. Benger might be a
woman.
[13] The idea for the interface was conceived around Halloween. Its name combines inspector (for its task-based reports) with specter (for the ghostly way those reports were displayed: data unmoored from its context).
In the future, the name will be changed to inspectre to better embody the pun.
[17] The unique identifier for WWO texts (TR is short for
transcription).
[18] This helpful EXPath library[19] aids in JSON-XML translation, though conversion can certainly be accomplished using
other methods. https://github.com/joewiz/xqjson
[20] CouchDB identifiers are long and not particularly human-readable, but they had the
advantage of being unique. We later changed the filenames to use the primary reviewed
author’s surname, a key for the primary reviewed work, and a key for the periodical
in which the review was published. This yielded filenames that were at least human-readable,
though still lengthy.
[21] An unofficial lesson from this project: offending developers’ sensibilities can be
a
positive force for change.
[30] Reviews were selected for transcription by author, rather than by work, so some
reception items discuss works that are not themselves published in WWO, though other
works by that author are.
[31] For the WWP, at least, given the project’s long-term approach to research and publication.
Author's keywords for this paper:
metadata; XML; JSON; TEI; XQuery; XSLT; web applications