Gross, Mark, Tammy Bilitzky and Richard Thorne. “Extracting Funder and Grant Metadata from Journal Articles: Using Language Analysis
to Automatically Identify and Extract Metadata.” Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016). https://doi.org/10.4242/BalisageVol17.Gross01.
Balisage: The Markup Conference 2016 August 2 - 5, 2016
Balisage Paper: Extracting Funder and Grant Metadata from Journal Articles
Using Language Analysis to Automatically Identify and Extract Metadata
Mark Gross
CEO
Data Conversion Laboratory
Mark Gross, CEO & founder of Data Conversion Laboratory, is a recognized authority
and speaker on XML implementation and document conversion, . Prior to founding DCL
in 1981, he was with the consulting practice of Arthur Young & Co. Mark has a BS in
Engineering from Columbia University and an MBA from New York University, and has
taught at the New York University Graduate School of Business, the New School, and
Pace University.
Tammy Bilitzky
CIO
Data Conversion Laboratory
Tammy Bilitzky, Chief Information Officer of Data Conversion Laboratory is responsible
for managing DCL's technology department. Tammy has extensive experience in leveraging
technology to deliver client value, supporting business process transformation and
managing complex, large-scale programs onshore and offshore. Tammy holds a BS in Computer
Science and Business Administration from Northeastern Illinois University. She is
a Project Management Professional, Six Sigma Green Belt, and Certified ScrumMaster.
Richard Thorne
Senior Consultant
Data Conversion Laboratory
Richard Thorne, Ph.D. is a software consultant working with Data Conversion Laboratory
in the field of Java and Machine Learning. Richard has in the past worked for Answers.com
and has consulted for a number of internet start-ups.
Scientific, Technical, and Medical (STM) journal articles are an explosively growing
data source with constantly increasing requirements for complex metadata. Newly important
has been the identification of funding and grant data. It has become critical that
publishers know the sources of funding for journal articles so that they can meet
the associated open source distribution obligations and so that funding organizations
can track publications associated with their funding. Managing this information manually
is costly and often inaccurate.
Over the past few years, we have been working on enhancing techniques to extract various
kinds of metadata automatically from scientific and legal documents. We report here
on recent work to extract funding source and grant information from within STM journal
articles.
Scientific, Technical, and Medical (STM) journal articles are an explosively growing
data source with constantly increasing requirements for complex metadata. Newly important
has been the identification of funding and grant data. It has become critical that
publishers know the sources of funding for journal articles so that they can meet
the associated open source distribution obligations and so that funding organizations
can track publications associated with their funding. Managing this information manually
is costly and often inaccurate.
Over the past few years, we have been working on enhancing techniques to extract various
kinds of metadata automatically from scientific and legal documents. We report here
on recent work to extract funding source and grant information from within STM journal
articles.
Identifying funding sources has taken on particular importance as funders want to
better track what their funding is accomplishing, and scholarly journals want to better
track their open source distribution obligations. While extracting funding information
is a fairly new requirement, the need for extracting metadata of all kinds from within
documents is not new, and applies to a number of areas which we have been working
with. Examples include:
Cookbook recipes: Inside recipes are hidden away information on ingredients, procedures, timings.
All of which would help you plan meals, figure out what you need in your fridge, maybe
even automatically refer other ingredients to help you make that perfect meal.
Leases: Terms, expiration dates, default clauses, renewal clauses, rights of first refusal.
Just think of the information you need when you manage a large portfolio of units
that you're leasing out; there must be a better way to track information than in a
thousand separate documents.
Loan & mortgage documents: There is a lot of information regarding valuations of property and the nature of
the borrower hidden inside forms and documents. It would be helpful when valuing a
mortgage portfolio to have this information pulled out and ready to use (dare I say,
something like this might have even prevented the last housing melt-down).
STM journals: Buried within, these journals can contain information on authors, affiliations,
citations, etc.
While the methods we use apply to many other areas, for the purposes of this article,
we focus on one application - to identify and extract funder and grant metadata from
free-form article text, specifically the granting organizations (sponsors), grant
numbers and grant recipients. This article describes how we've used rule-based processes
plus Natural Language Machine Learning (ML/NLP) to automate the process of getting
at this data work.
At this point, you may be asking yourself, "Why not just ask the authors for this
information?" Maybe someday. Some publishers are starting to collect this information
but it will take time until it's implemented. And even when collected, not all authors
are providing this information accurately, especially when there are a number of ways
to refer to the funding source (refer to https://scholarlykitchen.sspnet.org/2015/01/28/what-has-fundref-done-for-me-lately/).
Specific problems can include: names not normalized, the taxonomy is unclear (Federal
Highway Authority was not noted under U.S. Department of Transportation), information
is ambiguous (University of California - which one?), and mnemonics are ambiguous
where a single mnemonic can refer to many different organizations (e.g. a search for
NSF in the Fundref database shows 8 entries - look it up on http://search.crossref.org/funding).
While some journals are starting to require authors to explicitly supply metadata,
it's still only for a minority of journals and will likely be a long time coming;
publishers are reluctant to rock the boat and trouble authors too much, as they might
risk losing authors to competing journals. And automated extracted would be quite
important for historical analysis, there's something like 50 million articles out
there according to at least one source (https://duncan.hull.name/2010/07/15/fifty-million).
Funding Data in STM Articles
Over the past few years we've done extensive work to develop techniques to identify
and extract metadata from inside various textual collections, including grant information
in STM (Scientific, Technical and Medical) journal articles.
While grant information might be found almost anywhere in an article, it is often
contained in either a section of several paragraphs identified as an acknowledgement
section or in footnotes near the beginning of articles. The work described in this
article assumes that such a section exists and has been identified. Following is an
example of a typical acknowledgement section:
<ce:section-title
id="st040">Acknowledgments</ce:section-title><ce:para id="p0255">This work was supported by grants of Saint Petersburg State University (11.38.271.2014
and 15.61.202.2015), Russian Foundation for Basic Research (RFBR) projects (Project
No. 13-02-91327) and was performed in the framework of a collaboration between the
Deutsche Forschungsgemeinschaft and RFBR (RA 1041/3-1). The authors acknowledge support
from Russian-German laboratory at BESSY II, the program "German-Russian Interdisciplinary
Science Center" (G-RISC) and the Resource Center of Saint-Petersburg State University
"Physical Methods of Surface Investigation".</ce:para>
It's typically in prose, may identify one or more funding, and may or may not identify
specific grants or other identifying characteristics. It may also acknowledge other
sources of assistance who were not necessarily funders of the project. While simpler
constructions refer to only one or two funding sources and follow regular grammar,
most such sections are more complex. Also, since many authors are not totally fluent
in English, the grammar is sometimes convoluted.
Humans are quite good at reading these sections and identifying the relevant information.
But the process is labor-intensive, expensive, and because it relies on judgement
calls, doesn't always provide consistent results. The increasing speeds required in
the publishing industry, and the constant pressure on costs, cries out for automation
to speed up the process, reduce manual costs, and provide consistent results. Automating
the process also holds the promise for continual improvement as the feedback is used
to improve the process.
Automating language analysis of course isn't easy to do reliably because of ambiguities
in the use of languages, lack of grammatical knowledge, and the absence of standardization
in how people use language ("if everyone followed the rules, this would be easy, alas...").
Also, while this article focuses on English language sources, the techniques we describe
can be adapted to other languages, though the analysis of grammatical constructs would
need to be adapted to the specifics of the language.
Worth noting is that it could contain multiple institutions, several grants within
an institution, and may also include non-funding support from various sources. Let's
unpack this even more with the question, Why is extracting funder and grant information
important? Here are some answers:
There are emerging rules on open access that require publishers to monitoring the
varying open-access posting and payment requirements of the wide variety of licenses.
According to the Registry of Open Access Repository Mandates and Policies (ROARMAP)
there are currently at least 764 access policies to track (http://roarmap.eprints.org/view/policymaker_type/. Compliance
requires knowing which institution funded the research.
Funders want a better way to track the yield of their funding. Automatic tracking
will help address this need.
Identification of conflict of interest situations. Example: Where funders might be
beneficiaries of the research.
Let's define some terms in our process:
Tokenization: This is the act of breaking up a sequence of strings into pieces such as words,
keywords, phrases, symbols and other elements called 'tokens'. Tokens can be individual
words, phrases or even whole sentences. In the process of tokenization, some characters
like punctuation marks are discarded. The tokens become the input for another process
like parsing and text mining.
Extended Tokenization: As we use it, and as described by
Hassler and Fliedl (http://www.witpress.com/elibrary/wit-transactions-on-information-and-communication-technologies/37/16699) this is tokenization that takes advantage of domain specific knowledge (also known
as context-dependent tokenization). In our case we would use it to identify keywords
or phrases that have been determined to consistently indicate the start of a text
sequence, such as tokens that identify grant information, granting organization, a
grant ID, or an author. Taking advantage of context allows us to more easily optimize
the analysis to provide meaningful results.
Classification: Classification has its English meaning - to identify a
thing as belonging to one distinct type as opposed to another. In Machine-Learning
(ML) it goes beyond that definition to suggest a methodology - using features of the
text to separate classes. In this paper we are using classification to identify a
sequence of words as either being a candidate funder, or as not a funder (binary or logistic decision).
Methodology
The following provides an overview of our methodology; it's not intended as a machine
learning tutorial, but rather to provide the context for what we were able to accomplish,
and where we think there is room for further development.
Included is discussion of the following methods we applied in order to improve the
initial results:
Lexical Analysis: There is an expected sequence for the association between a grant-id and a grant-sponsor.
This suggests a 'grammar'
can be imposed on the output of the lexical analyzer through a parser.
Machine Learning/Natural Language Processing (ML/NLP): In order to reach higher we are working ML/NLP techniques to identify funding sources
which are not necessarily in the Funding Data database, or are in formats which don't
cleanly match. (A good overview is found in https://en.wikipedia.org/wiki/Natural_language_processing).
External verification: Use a table with
known granting organizations to verify the offered candidates. To date we have been
using the Funding Data Registry at CrossRef (http://search.crossref.org/funding formerly known as FundRef).
The document set we used for this analysis consisted of a corpus of 1000 xml journal
articles in which we identified, and segregated, the sections which would be most
likely to contain funding data, which was identified as the Acknowledgement section
or something similar. While journals from various publishers have some variance in
how they handle this, most do have the information in some identified section.
Our approach, using context-dependent tokenization of the text, was to parse the token
stream to find sentences relating to grant funding. For each document the section
containing the words: grant, trust, fund, and support were extracted, stripped of
any XML tags and used to create a free-text data stream.
[1][2]
Let's break down how we parse the token string. In the example shown, figure 1, having
identified the appropriate sentence because it contains the word "Grant", we would
then be looking for a start word (a token), like "supported", which possibly identifies
that a supporting organization follows. In the example we're looking at, the organization
providing the support follows somewhere down the line, and the funding instrument
and recipient follows in due course. Typical starting words include "grant", "support
project", "award", "contract", "agreement", and "scholarship". Once you've identified
the start word, it will define the context of the rest of the sentence, and lets you
parse the remainder of the sentence based on the following rules:
Find a start token based on the list of common start tokens.
Once you find a start token, search for the start of funding organizations by skipping
words until you find the first proper noun, which we assumed would be the first capitalized
word, see figure 2.
Next search for the Grant ID token.
Must follow a grant start word, i.e. token.
All letters in a Grant ID must be capitalized.
Minimum length for a Grant ID is four characters.
Can include but not end with these characters: space, #, 0-9, /, -, _
Our primary goal was to identify all grant funders and the grant numbers within the
text, with a secondary goal to locate grant recipients, which we searched for with
logic similar to the above. Internal divisions such as punctuation and parenthesis
aided in the transition between "grant text" and "non-grant text". In addition to
looking for the first capitalized word, we did some more analysis using pattern matching
and textual context. We also used the start words as stop words for the prior segment.
This was to let us know that we have probably identified all the information for that
funding source and that we were probably now looking at another funding source. At
the end of this analysis we would expect to have found a set of
candidate phrases where each phrase had a candidate grant funder, zero or more grant
numbers, and zero or more recipients.
Classification
Classification is a key methodology we used next. The above process alone correctly
identified over 50% of the granting institutions, along with grant ID's, with a lower
yield for recipients. These results, while impressive, were not sufficient for a production
setting. Our goal was to get the automated process into the 80-90% area for the process
to be considered useful. Getting into the 90-95% was the ultimate goal.
In order to improve precision we added a machine learning classifier trained on proper
names taken from our corpus of documents. The internal functioning of a classifier
is outside the scope of this article, but a good overview is available in this article:
https://en.wikipedia.org/wiki/Support_vector_machine. The basic idea was to identify which combination of words was likely to indicate
it being a funding sourcing. This was done by working with training sets, which we
could then manually analyze and determine whether the selected items fit into the
category (valid funding source) or not. If the training set was sufficiently representative,
than the computed ML algorithms should correctly predict the correct answers for the
other sets that would be analyzed, which would then be an automated process.
The training of the classifier was based on a set of natural language processing tools
from R (a set of open access software tools from the The R Project for Statistical
Computing, www.r-project.org. The first step was to convert the phrase into a "bag of words" by dividing the textual
stream into a set of words. "Bag of Words" is a term used in ML when you want a list
of all words appearing in a set, but don't care what the order might be. The words
are stemmed to normalize and simplify the analysis (e.g. walk, walker, walking, walked
all have the same stem) and common stop-words (a, the, and, etc.) are removed. A term
frequency vector was calculated, and words with low frequency, in our case those occurring
in less than 10 documents, were removed from the analysis. Surprisingly, after this
analysis, on 1000 documents, only about 50 words remained.
The subsequent words were used to build a document term matrix where each column was
either 0 or 1 to indicate the presence or absence of a word and each row contained
a place holder for one of the 50 words used in the analysis. From the above data,
a Support Vector Machine (SVM) was trained. Background on SVM is available at https://en.wikipedia.org/wiki/Support_vector_machine.
The purpose of the classifier, once trained, is to return an evaluation for each instance,
of whether the instance is a funding institution or not. The model based on initial
testing showed an F1-measure of .86; the F1-measure is considered a measure of "goodness
of fit", and without getting too deeply into the statistics, tells that the chances
that we guessed correctly is about 86%.
Next Steps
To further improve the accuracy of our results, the next step would be to compare
them to a known list of granting organizations. If there was a complete, authoritative
database, we probably wouldn't need some of the sophisticated techniques we've discussed.
But that doesn't exist today. The Funder Registry maintained by Crossref (formerly
known as FundRef) is considered the most authoritative, and while it is being regularly
improved, it is not yet complete and there are some ambiguities.
By using the result of our analysis, along with the Funder Registry, we have gotten
results in the 90%'s, and anticipate being able to improve on that.
In addition to improving automated results, there is further work we are doing to
disambiguate mnemonics, in which a designated funder could be one of many, and to
better identify grant recipients, who are not often not fully identified in funding
acknowledgment.
The goal going forward will be to increase yield further, and with better precision,
while reducing false positives. Our plans are to achieve better calibration with larger
test sets, improvements in machine learning and in our parsing algorithms, improvements
in the Funder Registry and authority files.
[1] This type of extended-tokenization based on a domain specific language is described
by M. Hassler &
G. Flied (Text preparation through extended tokenization in Data Mining VII: Data,
Text and Web Mining and their Business Applications 13). Because the lexical tokenization
takes advantage of the surrounding text there is a greater confidence that the extracted
text relates to the domain of interest.
[2] The tokenizing software, or lexer, was based on lex which was originally written
by Mike Lesk and Eric Schmidt (Lesk, M.E.; Schmidt, E.
(July 21, 1975). "Lex - A
Lexical Analyzer Generator" (PDF). UNIX TIME-SHARING SYSTEM:UNIX
PROGRAMMER'S MANUAL, Seventh Edition, Volume 2B. bell-labs.com. Retrieved Dec 20, 2011.)