Koidaki, Fotini, and Katerina Tiktopoulou. “Encoding semantic relationships in literary texts: A methodological proposal for linking
networked entities into semantic relations .” Presented at Balisage: The Markup Conference 2021, Washington, DC, August 2 - 6, 2021. In Proceedings of Balisage: The Markup Conference 2021. Balisage Series on Markup Technologies, vol. 26 (2021). https://doi.org/10.4242/BalisageVol26.Koidaki01.
Balisage: The Markup Conference 2021 August 2 - 6, 2021
Balisage Paper: Encoding semantic relationships in literary texts
A methodological proposal for linking networked entities into semantic relations
Fotini Koidaki is a PhD candidate at the Aristotle University of Thessalοniki
where she studies a 19th century literary corpus of prose fiction using NLP
methods in order to examine the literary consistency of it, as well as the
narrative patterns and the themes. She is an encoding researcher in the research
and entrepreneurial project ECARLE – Exploitation of
Cultural Assets with computer-assisted Recognition, Labeling and meta-data
Enrichment on behalf of the Laboratory of Philology and New
Technologies. At the same time she is working on research the project
Semantic analysis of 19th-century modern Greek fiction with text
mining techniques. Since 2015 she coordinates a DigiNes research group focused on Digital Literary
Studies, Digital Scholarly Publishing and Literary Text Mining. She also holds a
BA on Modern Greek Literature and a MA on General and Comparative
Literature.
Katerina Tiktopoulou holds a BA (1986) and a PhD (2003) on Modern Greek
Literature from the School of Philology, Aristotle University of Thessaloniki
and a BA (2010) on Italian Literature from the School of Philology, University
La Sapienza of Rome. She has worked (1997-2007) as a
researcher in the Centre for the Greek Language, (Department of Language and
Literature). Since 2016 she is associate professor of Modern Greek Philology in
the School of Philology at the Aristotle University of Thessaloniki. Since 2018
she is responsible for the MA program in Modern Greek Studies and since 2019
Head of the Laboratory of Philology and New Technologies in the same School. Her
research interest in Modern Greek Philology includes the study of vernacular
literature of the 16th century, romanticism, autobiography, editorial
intervention, text editing and on-line publishing, digital humanities. Her
research projects focus on the digital edition of Greece’s national poet
Dionysios Solomos’ (1798-1857) manuscripts and on the study 19th century
literature, combining traditional and new technologies. Currently, she is the
supervisor of two financed research projects in digital humanities.
Encoding meaningful semantic relationships in literary texts is almost as
difficult as defining and identifying them. Defining the types and the components
of
semantic relationships that can be extracted from literary texts is a quite
challenging task because literature is full of implicit and oblique messages and
references. Subsequently, identifying and encoding semantic relationships in
literature is even more challenging because often relations do not have neither
clear nor standard linguistic form and usually they overlap each other. This paper
discusses modeling and encoding issues concerning the mapping of relationships of
cultural content in literary and humanities texts, highlighted by the case of the
ECARLE project annotation campaign. On handling these modeling and encoding issues
the paper proposes a methodology of minimalistic and flexible annotation techniques,
combined in order to generate human annotated training data for a Relation
Extraction machine learning system. The proposed methodology utilizes the available
TEI tagset, and, without any further customizations, allows the mapping of relations
formed by named entities in a simple yet flexible way, open to reuse, interchange,
conversion and visualization.
Today, literary criticism frequently focuses on the quantitative aspect of modern
literary studies, talking about the primacy of big-data over the qualitative and
interpretive goals of humanities research. However, computer-aided text analysis methods
have shown that the line between the quantitative and the qualitative is often blurred,
because determining even the simplest quantitative feature requires interpretative
skills and deep content knowledge. Likewise, modeling semantic relationships in literary
texts is a quite challenging endeavour because it requires not only technical expertise
but also in-depth knowledge of the subject. Semantic relationships in literary texts
often do not have neither clear nor standard linguistic structure and usually they
are
highly implicit and they overlap each other. As a result, it is difficult not only
to
design patterns of semantic relations and define their components, but also to identify
them in the literary text and encode them without any loss of information.
This paper examines issues related to modeling and encoding semantic relationships
through the case of ECARLE[1] project annotation campaign aimed to generate a dataset in order to train a
classifier for Relation Extraction from literary texts. The technique that has been
developed within the framework of the ECARLE project was specifically designed to
extract cultural knowledge from 19th century Greek polytonic texts related to ancient,
medieval and modern Greek Literature. The overall project objective was to develop
an
integrated Software as a Service (SaaS) which will enable users to effortlessly digitize
Greek polytonic humanities texts and enrich their content with metadata about the
logical structure of the documents, the publishing details, as well as the included
named entities and semantic relations. In simplified terms, users will be able to
upload
scanned images of printed sources and receive for each of them an XML file on which
the
bibliographic description, the content structure as well as the named entities and
their
relations would have been labeled. All the information extracted and captured in XML
files will be encoded according to the international markup standard of the Text
Encoding Initiative (TEI) consortium. However, the user will be able to select between
various export formats, since the extracted information would be exported to CSV,
RDF,
XLS, METS etc.
To this end the annotation campaign was specifically designed to create the
human-labeled dataset on which the training and the evaluation of the Named Entity
and
Relation Extraction systems were subsequently based.[2] Since the main objective of the project is the digitization of Greek 19th
century humanities documents as well as the discovery of cultural and literature related
knowledge, below we will discuss the technique developed in order to achieve the most
useful dataset in terms of semantic content and data capturing, as well as the various
parameters that had to be considered and defined. In particular, we will discuss the
research background, the construction of the corpus that served as the tank of textual
data, some challenging issues related to modeling and encoding, the proposed methodology
of techniques to combine, the generated dataset and the possible outcomes. The generated
dataset as well as the materials for parsing and processing the encoded texts (such
as
the Python script and the documentation) are available on GitHub.[3]
Related research
There is a considerable amount of literature on multi-layered XML annotations and
so
far, during the 23 years of XML technology, several multi-hierarchical markup formalisms
have been reported as XML alternatives for handling overlapping cases (i.e. CONCUR,
LMNL, MECS). However, the OHCO XML model has prevailed over the multi-hierarchical
markup languages, and now the most popular alternative to the XML stanoff mechanism
seems to be graph-based formalisms usually based on the RDF data model (GODDAGs,
EARMARK, TEILIX etc). Even though RDF is probably the most convenient data model to
capture and integrate multiple levels of annotations for the same content, it’s
still a data interchange format in comparison to XML which can be used easier as a
document interchange format. This means that XML seems to be the most suitable format
for encoding the primary data in the humanities and, in particular in literary studies,
since in these fields encoding mostly concerns books or any other kind of textual
documents. Instead, RDF could represent the content of the standoff XML annotations
allowing flexible data modeling compositions. However, a quite interesting graph-based
annotation formalism was published in 2018 under the name Text-As-Graph Markup Language
(TAGML) focusing mainly, but not exclusively, on manuscript research. TAGML is built
upon the LMNL[4] markup model, uses the cyclic-graph theory and introduces the text as a
multi-layered, non linear construct.[5] Even though TAGML opens up new perspectives for the representation of
textual data, it lacks editing and processing tools, while querying and interchanging
becomes even more difficult.[6]
In the Humanities and especially in literary studies the TEI tagset seems to be the
most preferred as well as the most suitable XML-based language for encoding, editing
and
processing textual documents. The TEI community keeps working on multi-layered annotations[7] and enrich <standoff> container, which, among others, may
include interpretative annotations, feature structures (ie. syntactic structure,
grammatical structure etc), group of links and lists of persons, locations, events
or
relations.
The LIFT project, published by Francesca Giovannetti in 2019, uses TEI tagset to
encode semantic relations in literary texts. The project does not make use of the
<standoff> container. Instead, it establishes links among
elements (persons, locations, events) by transcribing them into lists inside the
<teiHeader> section that is mainly intended to serve as the
resource registry. It should be noted though that the intention of the LIFT project
is
to demonstrate the flexibility that object oriented programming has in manipulating
and
parsing TEI/XML files, as it offers a Python 2.7 script that uses the lxml and RDFLib
libraries to convert TEI encoded relations into RDF triples. LIFT project also invites
users to choose between a set of ontologies that may meet their relation mapping needs
(AgRelOn, CIDOC, DCMI, OWL, PRO etc). However, none of the ontologies listed is designed
to represent the literary knowledge or the knowledge inherent in the humanities texts
which on the other hand is completely understandable, since it seems there is no model
available for mapping the types and the components of semantic relations occurring
in
humanities or literary texts. For example, the AgRelOn model, which defines private,
social or working relations among persons, is reported as incomplete and data pertinent,
while the well known CIDOC-CRM is focused on museums and other kinds of cultural
institutions and is highly generic and holistic. DCMI is a model for libraries designed
for bibliographic description, whereas PRO is a model written in OWL in order to
describe concepts involved in the publishing process. OWL, on the other hand, is a
family of languages which contains a variety of syntaxes (OWL abstract, XML, RDF)
and
specifications that can be used to describe taxonomies, relations and properties of
web
resources. A domain specific ontology is something that is still missing from literary
studies.
In much the same way, the ECARLE annotation campaign used the TEI tagset in order
to
extract relations from a corpus of polytonic Greek texts related to literary studies.
The relationships were extracted in standoff annotations in the form of binary links,
pointing labeled entities of different types (is. Person + Title). The linkages between
entities were designed on the basis of the RE training needs and constraints, and
since
there is no domain specific ontology to model literary relations, not even a combination
of the above mentioned ontologies could have described with the desired accuracy the
extracted data. The annotated TEI documents were finally parsed with python 3.7
(BeautifulSoup, Pandas etc.) which allowed groupings of relationships that formed
bigger
networks based on annotated years, persons, titles and locations. The generated
relations produced lists of named entities and IDs as well as XLS, CSV and RDF data
representations of relations.
Corpus design
When designing our annotation campaign we had to consider the possible usage scenarios
of the SaaS we were aiming at, as well as the features and constraints of the Relation
Extraction training process. The idea behind our supervised machine learning (ML)
strategy was to build a corpus of literary(-related) texts that would facilitate all
of
the ML tasks of our project. For example, the scanned pages of the corpus would serve
as
training and testing input, their human-corrected content would be used as the ground
truth data for the Greek-polytonic OCR pipeline, while their annotated content would
serve as the main training dataset, available also for the evaluation of the NER and
RE
systems.
The corpus building criteria established for this purpose were related to some of
the
key features of the corpus, namely i) the language, ii) the font-family, iii) the
publication date, iv) the textual type (narrative, non-fictional, poetry, dictionary
etc) and v) the structure. The first three criteria were defined in such a way as
to
facilitate the development of an OCR system capable of digitizing Greek polytonic
texts
printed during the 19th century. The other two criteria (the textual type and the
document structure) were defined to ensure, as far as was possible, the adequacy of
training data for both layout analysis and information extraction systems. In fact,
we
managed to retrain the grc-tesseract model and further improve it by deploying a
text-line detection algorithm that enhanced the accuracy of the characters’
recognition. (Tzogka et. al 2021) However, enhancing layout recognition
results required the textual structural features to be annotated not only on the TEI/XML
files but also on the scanned image files, producing way enough training data for
every
structural feature we meant to be recognised. Therefore, the very type of the text
was
an important selection criterion because different text types can differ in their
textual structure as in the named entities included. For example, factual text types,
such as dictionaries or essays, and literary text types, such as poems and narrative,
–not to mention the numerous sub-types of each type– may include, and often do,
different quantities of named entities and relations and their content has completely
different structures. Taking into account that in our case a considerable amount of
annotated named entities and semantic relations was required, we decided to focus
on
prose texts and particularly on essays related to literary criticism, as well as on
correspondences and memoirs that provided us with the desired structural diversity.
Issues
Human labeling is a time-consuming and complicated process, which includes many
challenges that one comes to realize only through practising. Our annotation campaign
was held by a research team of 5 pre-trained encoders, who were provided with extra
training and project-specific guidelines as well as with encoded samples. The annotation
campaign was launched with a pilot phase of labeling four small texts, during which
each
text was assigned to two encoders, while a super-encoder (a sixth researcher) was
used
to check the well-formedness, the TEI-conformance, as well as the discrepancies between
the encoded versions. Based on the evaluation of the pilot, we enriched our guidelines
and samples, and we built xml:id inventories (xlsx) that we were updating during
labeling. Every printed document was assigned to two encoders of different roles,
one of
which undertook the structural encoding, and the other the entities’ and
relations’ encoding. After both encoders had checked their annotated deliverable,
a super encoder was used to verify the well-formedness and the TEI compliance. For
the
encoding process it was preferred an XML editor since the code editing experience
allowed us accuracy on supervision of the markup and efficient error handling.Omitting
some other crucial issues (i.e. related to human resource management), we will focus
on
the challenges of converting unstructured plain texts into a structured set of
information that includes annotated data regarding the logical structure of the
document, the cultural named entities and their semantic relations.
Since one of the main objectives of our project was to develop a service for the
automated extraction of relationships of cultural content from Greek literary texts,
we
had to comply with the entities and relations constraints imposed by the ML model.
In
fact, the entities constraints were referred to (a) the number and the type of the
participant entities, while the relation constraints were referred to (b) the distance
between them. In particular, a relation was considered to be a valid training instance
if it consisted of only two (2) entities, each of which should belong to a different
type of entity, while both should have been detected within no more than three
consecutive sentences. Taking into account these constraints we specified the types
of
the relations and their components that could be annotated and generate a sufficient
training dataset. However, designing the annotation campaign we had to overcome
difficulties related to entity ambiguities or overlapping relations.
Ambiguity Issue
When it comes to literary texts the entities of cultural meaning as well as their
relations may be highly unclear, or suggestive, or most of the time fictional. For
example, entity types like persons, titles of artworks or locations may appear in
various written forms, namely as nicknames, allegories and abbreviations.
Animal farm, for instance, is a book title (Orwel, 1945), a
socio-political symbol, a song title (Kinks 1968), a tv-film title (1999) and a
reference to livestock, while 1984 is a numerical that at the same
time stands for a date and a book title. It is also quite common for the book titles
to coincide with character names (e.g. Anna Karenina) or they can be shortened to
the characters’ name (e.g. The tragedy of Hamlet, Prince of Denmark becomes
Hamlet), or even to have one-letter form.[8] So, to which category does the phrase The little prince
belong? It could be both a book title or a reference to the well known fictional
persona, or it could be neither of the two.
Our labeling strategy attempted to deal with ambiguity issues by producing more
detailed encodings of the ambiguous types of entities and using the ID mechanism to
record, track down and link entities. In the first place, therefore, we decided to
encode every occurrence of the selected entities as long as they were written as
named entities and directly expressed. This means, in particular,
that we annotated abbreviated titles and names like N.Y., but ignored indirect and
implicit references such as the city that never sleeps. Where needed,
we described with attributes the entities’ distinctive features that allowed
us to overcome the markup polysemy. For example, using the attribute-value mechanism
we distinguished between real and fictional persons. Finally we agreed to exclude
a
type of entity, namely historic events, since it was located in highly ambiguous
linguistic contexts, generalized phrases or words that may have no capital
characters and often consist just of common nouns. Thus, the category of historical
events has been omitted from the current annotation in order to reduce the noise in
the labeled data and benefit the training of the NER classifier.
Given the ambiguity constraints, we defined two thematic categories of relations
about cultural heritage, namely (a) artworks and (b) state institutions. Moreover,
each of these categories was divided into specific subcategories relative to the
dating, the participants and the geography of the relation. The core components of
the relations we defined are five different types of named entities, that is i)
person, ii) organization, iii) location, iv) date and v) title.
Table I
Semantic Relations of Cultural Content
Relation themes
Relation categories
Named entities
Artworks
Artwork creator
Person - Title
Artwork hero
Person - Title
Artwork dating
Title - Date
State institutions
Inst-org geography
Organization - Location
Inst-org worker
Organization - Person
Inst-org dating
Organization - Date
The elements belonging to the five entity types were assigned a single unique ID,
which, no matter the entity type, allowed us to generate inventories of titles,
persons, organizations, and even dates. The IDs were generated according to the
rules we had laid down in order to enable labelers to supervise the already
generated IDs, so as to prevent homonymy. In fact we came up with two complementary
ways that enabled us to avoid creating several IDs for the same named entity or vice
versa. More precisely, we have placed the letter t at the beginning
of the title’s IDs and the letter y at the beginning of the
year’s IDs. For instance, the y1984 ID represents the year
while the t1984 ID stands for the book title whether it is written as
Nineteen Eighty-Four or as 1984. Likewise, the IDs t1984 and
t1Q84 are referred to two different named-entities that
correspond to two different books. In this way, we have been able to produce more
indicative IDs and, therefore, more detailed annotations, which allowed us to
prevent the confusion as well as the misconnection between entities. Thus, two
apparently similar entities, whether or not are related, were now assigned with
distinctive IDs, which could be linked together. With this in mind, we created xls
catalogues where we logged every ID we produced, so that we could avoid assigning
more than one ID to the same entity or vice versa. In addition, we associated the
persons’ and the locations’ entities with their VIAF ID, where it was
available, in order to broaden the scope of use for our data, and open up
opportunities for interlinking with cross-lingual data.
Overlapping problem
Although designing an entity-relationship model is a challenging relational and
multifocal task, the major challenge of our project proved to be the annotation for
training data, for it had to meet certain conditions and, at the same time, it had
to produce accurate and flexible representations of our data without missing any
information. Taking into account the parameters and constraints posed by the RE
training technique that it was used, in order for a relation to be considered a
valid training sample, it had to consist of only two objects of different entity
types that must have been detected within no more than three consecutive sentences.
As a consequence the direct annotation of textual space that contains
a relation by marking up its boundaries was not an option.
Therefore, in numerous cases, relations emerged entangled and highly complex. In
these cases, it was quite common for two or more relations to overlap each other or
to consist of more than two entities. Relations of more than two entities could of
course be divided into smaller pairs of entities that comply with the rule of two
components. This solution, however, can intensify the overlapping problem, because
the relations’ components would have been even more entangled to each other.
For instance, if we had a passage stating multiple author names, titles and dates,
we should split it by pairs of two, where we had to restate the relation for every
written version of a name. However those bipolar fragments of texts
not only overlap each other, but they are more difficult to parse and group into
semantic relations. The types of overlapping relations can be presented in two
different schematic representations. As shown in the figure below the second
relation may either start inside the previous one and end after it (case 1), or it
may nest in it perfectly (case 2).
Although the second kind of the above relations may seem a well-formed XML
structure, it is, as the first one, perceived as an overlapping case and both are
equally problematic. In the first case we cannot get a valid XML tagging, while in
the second case one relation rules the other in a parent-child hierarchy. It is also
a matter of data quality as both kinds contain a lot of noise because a tagged
relation includes not only the related named entities but also their context, which
may contain entities that are not involved in the relationship.
Proposed methodology
Our encoding technique addressed the overlapping issues by using the standoff XML
mechanism in a simple yet flexible way that allowed us to record more than 1055
relations in a corpus of 2.846 pages.The mechanism of standoff annotations proved
to be
quite appropriate, since it allows multi-hierarchical structures to be encoded in
XML.
In fact, the multiple levels of information are captured in different levels of
annotations that are linked together. In the standoff representation, the concurrent
structures may be kept separated in different XML files that are linked together,
or
they may be stored in a single XML file as separate sections. Among these options,
we
have chosen the second one, because keeping all the information stored in a single
file
is more flexible and more in line with the document-process logic of the SaaS we were
aiming at. Since every XML file will contain a single book, it’s much more
convenient to gather all the information extracted in a single XML file. As a result,
these XML files, which are TEI-conformant documents, include not only the encoded
content of the book (<text>), but also the relations extracted
(<standoff>), as well as all the metadata about its bibliographic
description and its digitization (<teiHeader>). The core structure of
the produced XML file is illustrated in the figure below.
But what exactly does the standoff section include and how are the relations tagged
in
there since all the content of the digitized book is included in the
<text> section? No content was transcribed in the standoff
section. In fact, the relations were not actually tagged, but they have
been declared as links inside the <standoff> container. Every link is
an empty tag which establishes a binary relation among two different entities by
targeting their IDs.
The ID assignment mechanism
The entities that had been observed to participate at least once in a relationship
had been marked up and assigned with a single unique ID. Each ID was called on every
occurrence of its signified thus bonding together all the different written versions
of a single named entity. This ID pointing mechanism allowed us to generate
inventories of recorded entities (persons, titles, locations) that enabled us to
supervise the ID assignment, as well as to collect the variety of tokens referred
to
the extracted entities, and to provide NER with a training dataset. For example, the
name of the poet Jean Moréas has been recorded in 8 different versions written in
two languages (Γιάγκου Παπαδιαμαντοπούλου, Παπαδιαμαντόπουλον, Jean Moréas,
Παπαδιαμαντοπούλου, Παπαδιαμαντόπουλος, Ἰωάννην Παπαδιαμαντόπουλον, Ζὰν Μωρεὰς,
Μωρεάς).
When two named entities of a different type were observed to form a relation, at
least once, within no more than three consecutive sentences, that relationship was
recorded as a link between their IDs. Therefore, every relation was encoded as a
single <link>, which, as illustrated in the figure below, is an
empty element that bears all the information about the relations in attribute-value
pairs. The @type attribute indicates the thematic category of the relation, while
the @target attribute links the related entities by declaring their IDs. Τhe
standoff section, therefore, contains just empty links that represent
the relations discovered inside the whole book starting from the title page. From
the moment a relation has been identified and annotated in the standoff section, the
link created is enough to represent any other occurance of the same relation, since
its unique IDs were recalled every time when the entity was observed to participate
in a relation inside.
The binary links that represent the discovered relationships are components of a
group-of-links element (<linkGrp>) contained inside the standoff
sections, but they haven’t grouped because after evaluation we decided that
they didn’t need to be further classified. This simple standoff linking
mechanism was proved to be effective in meeting the needs and the objectives of the
annotation campaign. However, the available tagset allows more detailed descriptions
and classification of the relations. But since there wasn’t further grouping,
the <linkGrp> and <link> tags could be easily
replaced by the <listRelation> and <relation>
tag without any changes. In addition, it should be mentioned that the @target
attribute could be assigned with more than two attributes, thus linking more than
two entities and allowing larger and more complex relations to be captured.
The Generated Dataset
The encoded XML/TEI documents were parsed both with Python 3.7 and XSLT. Python
proved to be more efficient, because our objective was to extract and process
relationships-related data and not to convert the xml documents to other formats.
Besides, Python enables coders to combine together libraries that allow easier and
more flexible data processing and visualizations. In our case we used the
BeautifulSoup library in order to parse XML files and then used the Pandas library
in order to manipulate and exploit further the extracted data. Α data frame was
created for each of the two thematic categories we defined and the information
collected was organized in 5 columns. For example, as illustrated in the figure
below, which depicts a random sample of the artworks_relations frame, the encoded
relations were organized in columns such as title_id, titles_name (in listed
tokens), creators_id, creator_name (in listed tokens), hero_id, hero_names (in
listed tokens), the document_title and date. The conversion of the encoded
information into frames per thematic category allowed us to reshape the data through
further groupings and categorisations. In total, 1055 relationships were recorded
of
which 908 were artwork-related and 147 was referred to the Greek and European
organization and institutions.
The 1055 human labeled relations served as the knowledge database for distant
supervision, which revealed in total 1553 instances of the already encoded relations
in the same corpus, and which formed the final training dataset.[9]
Conclusions
By encoding relations as links between entities we succeeded in producing a more
realistic and accurate representation of named entities’ relations, because the
relationship was represented as a linkage between IDs of named entities. In this case,
there was no reason to transfer and transcribe content in the standoff section. The
empty links targeting binary values worked perfectly as pointers and since the standoff
annotation technique was combined with the id linking mechanism we did also easily
networked every entity with all the available written versions of it. The ID linking
mechanism allowed us to match an entity with all its written instances not just in
a
single document but throughout the entire corpus of texts. This means that from the
moment a relation had been annotated in the standoff section, this one annotation
was
enough to represent any other occurrence of the same relation. The valuable outcome
of
the proposed methodology was not just the generated list of the 1055 relations, but
the
inventory of entity tokens that enhanced our data. The extracted information was
captured in two dimensional frameworks in a way that allowed further analysis and
knowledge extraction through data reshaping and grouping. For instance, we could observe
the relations grouped by title or by author or even by year, merging together the
entries that had the same title_id or same creators_id or same dating.
Even though the proposed methodology of combined techniques served perfectly the goals
of the ECARLE project, still allows room for improvements. It could be said that the
standoff annotations of relations could have been even more detailed and further
organized or that they could include additional information about the models of
ontologies or even that they could be linked with completely external hierarchies
of
data. The conversion of the annotated information in other formats (RDF, CSV etc)
is
easy and less important, because there are several tools for this job. On the other
hand, the real challenge lies in capturing, processing and reshaping semantic
information in a meaningful way, as well as in enriching our data with domain specific
ontologies.
In addition to the benefits of analysis and visualization, the relations enhanced
by
the interlinked entities formed a training dataset as complete as possible in order
to
supply with training and evaluation data both the NER and ER model. At the same time
the
annotation methodology as well as the generated dataset containing titles of artworks,
names of creators and heroes and dates contributes to the automatic identification
of
the artwork references, which still remains of highest importance. Needless to say,
the
proposed methodology and the generated dataset are available for re-usage, dissemination
and optimization, as they aim to contribute to the discovery and the interconnection
of
semantic relationships of literary content.
Acknowledgements
We wish to acknowledge and thank our colleagues, namely Maria Georgoula, Valando
Landrou, Markia Liapi, Thalia Miliou and Marianna Silivrili, who worked on the
implementation of the method and the encoding of the selected texts.
We also acknowledge the support of this work by the project "ECARLE (Exploitation
of
Cultural Assets with computer-assisted Recognition, Labeling and meta-data Enrichment
-
cultural reserve Utilization using assisted identification, analysis, labeling and
evidence enrichment)" which has been integrated with project code T1DGK- 05580 in
the
framework of the Action "RESEARCH - CREATE - INNOVATE", which is co-financed by National
(EPANEK-NSRF 2014-2020) and Community resources (European Union, European Regional
Development Fund).
References
[Putra 2019] Putra, Hadi Syah, Rahmad Mahendra and Fariz Darari.
BudayaKB: Extraction of Cultural Heritage Entities from Heterogeneous
Formats. In Proceedings of the 9th International
Conference on Web Intelligence, Mining and Semantics (WIMS2019).
Association for Computing Machinery, New York, NY, USA, Article 6,
(2019). doi:https://doi.org/10.1145/3326467.3326487.
[Bleeker 2020] Bleeker, Elli, Bram Buitendijk and Ronald Haentjens
Dekker. Marking up microrevisions with major implications: Non-linear text in
TAG. Presented at Balisage: The Markup Conference 2020, Washington, DC, July
27 - 31, 2020. In Proceedings of Balisage: The Markup Conference
2020. Balisage Series on Markup Technologies, vol. 25 (2020). doi:https://doi.org/10.4242/BalisageVol25.Bleeker01.
[Haentjens 2018] Haentjens Dekker, Ronald,
Elli Bleeker, Bram Buitendijk, Astrid Kulsdom and David J. Birnbaum. TAGML: A
markup language of many dimensions. Presented at Balisage: The Markup
Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on
Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.HaentjensDekker01.
[Kruengkrai 2012] Kruengkrai, Canasai, Virach Sornlertlamvanich,
Watchira Buranasing and Thatsanee Charoenporn. Semantic Relation Extraction from
a Cultural Database. In Proceedings of the 3rd
Workshop on South and Southeast Asian Natural Language Processing (SANLP),
Association for Computational Linguistics, Mumbai (2012), pp. 15-24.
https://www.aclweb.org/anthology/W12-5002.pdf.
[Barrio 2014] Barrio, Pablo, Gonçalo Simoes, Helena Galhardas and
LLuis Gravano. REEL: A relation extraction learning framework. In
Proceedings of the 14th ACM/IEEE-CS Joint Conference on
Digital Libraries, London (2014), pp. 455-45.
http://www.cs.columbia.edu/~gravano/Papers/2014/jcdl2014.pdf.
[Desmond 2016] Desmond, Allan Schmidt. Using standoff
properties for marking-up historical documents in the humanities.it - Information Technology, vol. 58, no. 2 (2016) pp.
63-69. doi:https://doi.org/10.1515/itit-2015-0030.
[Stührenberg 2014] Stührenberg, Maik. Extending standoff annotation. In Proceedings of 9th International Conference on Language Resources and Evaluation (LREC’14),
Reykjavik (2014), pp. 169-174. https://www.aclweb.org/anthology/L14-1274/.
[Pose 2014] Pose, Javier, Patrice Lopez and Laurent Romary. A
Generic Formalism for Encoding Stand-off annotations in TEI.HAL (2014). https://hal.inria.fr/hal-01061548.
[Spanidi 2019] Spadini, Elena, and Magdalena Turska. XML-TEI
Stand-off Markup: One Step Beyond.Digital Philology: A Journal of Medieval Cultures vol.
8 (2019) pp. 225-239. doi:https://doi.org/10.1353/dph.2019.0025.
[Cayless 2019] Cayless, Hugh.
Implementing TEI Standoff Annotation in the browser. Presented at
Balisage: The Markup Conference 2019, Washington, DC, July 30 - August 2, 2019. In
Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23 (2019). doi:https://doi.org/10.4242/BalisageVol23.Cayless01.
[Beshero-Bondar 2018] Beshero-Bondar, Elisa E., and Raffaele
Viglianti. Stand-off Bridges in the Frankenstein Variorum Project: Interchange
and Interoperability within TEI Markup Ecosystems. Presented at Balisage:
The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage
Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Beshero-Bondar01
[Banski 2020] Banski, Piotr. Why TEI Stand-off Annotation
Doesn’t Quite Work and Why You Might Want to Use It Nevertheless.
Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 - 6,
2010.
In Proceedings of Balisage: The Markup Conference
2010. Balisage Series on Markup Technologies, vol. 5 (2010). doi:https://doi.org/10.4242/BalisageVol5.Banski01.
[van Hooland et. al 2015] van Hooland, Seth, Max De Wilde, Ruben
Verborgh, Thomas Steiner and Rik Van de Walle. Exploring Entity Recognition and
Disambiguation for Cultural Heritage Collections.Digital Scholarship in the Humanities vol. 30, no 2
(2015) pp. 262-279. doi:https://doi.org/10.1093/llc/fqt067.
[Tzogka et. al 2021] Tzogka, Christina, Fotini
Koidaki, Stavros Doropoulos, Ioannis Papastergiou, Efthymios Agrafiotis, Katerina
Tiktopoulou and Stavros Vologiannidis. OCR Workflow: Facing Printed Texts of
Ancient, Medieval and Modern Greek Literature.QURATOR
2021 – Conference on Digital Curation Technologies.
http://ceur-ws.org/Vol-2836/qurator2021_paper_8.pdf.
[1] Exploitation of Cultural Assets with computer-assisted Recognition, Labeling
and meta-data Enrichment https://ecarle.web.auth.gr/en/
[2] Different types of datasets were used for the training of the OCR and the
Document Layout Recognition systems (Tzogka et. al 2021).
[6] Due to the limited available tools for editing and parsing LMNL/TAGML, and due
to the rarity of it’s utilization, we prefered the XML, without excluding
the possibility of future conversion between the two markup formats.
[7] See Cayless 2019 or records from lingSIG 2019 and TEI-C
elections 2020
[9] For more information about the ML training process, see the Christou, D.,
and Tsoumakas, G. (2021). Extracting semantic relationships in Greek
literary texts. Manuscript submitted for publication.
Putra, Hadi Syah, Rahmad Mahendra and Fariz Darari.
BudayaKB: Extraction of Cultural Heritage Entities from Heterogeneous
Formats. In Proceedings of the 9th International
Conference on Web Intelligence, Mining and Semantics (WIMS2019).
Association for Computing Machinery, New York, NY, USA, Article 6,
(2019). doi:https://doi.org/10.1145/3326467.3326487.
Bleeker, Elli, Bram Buitendijk and Ronald Haentjens
Dekker. Marking up microrevisions with major implications: Non-linear text in
TAG. Presented at Balisage: The Markup Conference 2020, Washington, DC, July
27 - 31, 2020. In Proceedings of Balisage: The Markup Conference
2020. Balisage Series on Markup Technologies, vol. 25 (2020). doi:https://doi.org/10.4242/BalisageVol25.Bleeker01.
Haentjens Dekker, Ronald,
Elli Bleeker, Bram Buitendijk, Astrid Kulsdom and David J. Birnbaum. TAGML: A
markup language of many dimensions. Presented at Balisage: The Markup
Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on
Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.HaentjensDekker01.
Kruengkrai, Canasai, Virach Sornlertlamvanich,
Watchira Buranasing and Thatsanee Charoenporn. Semantic Relation Extraction from
a Cultural Database. In Proceedings of the 3rd
Workshop on South and Southeast Asian Natural Language Processing (SANLP),
Association for Computational Linguistics, Mumbai (2012), pp. 15-24.
https://www.aclweb.org/anthology/W12-5002.pdf.
Barrio, Pablo, Gonçalo Simoes, Helena Galhardas and
LLuis Gravano. REEL: A relation extraction learning framework. In
Proceedings of the 14th ACM/IEEE-CS Joint Conference on
Digital Libraries, London (2014), pp. 455-45.
http://www.cs.columbia.edu/~gravano/Papers/2014/jcdl2014.pdf.
Desmond, Allan Schmidt. Using standoff
properties for marking-up historical documents in the humanities.it - Information Technology, vol. 58, no. 2 (2016) pp.
63-69. doi:https://doi.org/10.1515/itit-2015-0030.
Stührenberg, Maik. Extending standoff annotation. In Proceedings of 9th International Conference on Language Resources and Evaluation (LREC’14),
Reykjavik (2014), pp. 169-174. https://www.aclweb.org/anthology/L14-1274/.
Pose, Javier, Patrice Lopez and Laurent Romary. A
Generic Formalism for Encoding Stand-off annotations in TEI.HAL (2014). https://hal.inria.fr/hal-01061548.
Spadini, Elena, and Magdalena Turska. XML-TEI
Stand-off Markup: One Step Beyond.Digital Philology: A Journal of Medieval Cultures vol.
8 (2019) pp. 225-239. doi:https://doi.org/10.1353/dph.2019.0025.
Cayless, Hugh.
Implementing TEI Standoff Annotation in the browser. Presented at
Balisage: The Markup Conference 2019, Washington, DC, July 30 - August 2, 2019. In
Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23 (2019). doi:https://doi.org/10.4242/BalisageVol23.Cayless01.
Beshero-Bondar, Elisa E., and Raffaele
Viglianti. Stand-off Bridges in the Frankenstein Variorum Project: Interchange
and Interoperability within TEI Markup Ecosystems. Presented at Balisage:
The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage
Series on Markup Technologies, vol. 21 (2018). doi:https://doi.org/10.4242/BalisageVol21.Beshero-Bondar01
Banski, Piotr. Why TEI Stand-off Annotation
Doesn’t Quite Work and Why You Might Want to Use It Nevertheless.
Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 - 6,
2010.
In Proceedings of Balisage: The Markup Conference
2010. Balisage Series on Markup Technologies, vol. 5 (2010). doi:https://doi.org/10.4242/BalisageVol5.Banski01.
van Hooland, Seth, Max De Wilde, Ruben
Verborgh, Thomas Steiner and Rik Van de Walle. Exploring Entity Recognition and
Disambiguation for Cultural Heritage Collections.Digital Scholarship in the Humanities vol. 30, no 2
(2015) pp. 262-279. doi:https://doi.org/10.1093/llc/fqt067.
Tzogka, Christina, Fotini
Koidaki, Stavros Doropoulos, Ioannis Papastergiou, Efthymios Agrafiotis, Katerina
Tiktopoulou and Stavros Vologiannidis. OCR Workflow: Facing Printed Texts of
Ancient, Medieval and Modern Greek Literature.QURATOR
2021 – Conference on Digital Curation Technologies.
http://ceur-ws.org/Vol-2836/qurator2021_paper_8.pdf.