Introduction
In this paper we primarily consider what we can gain from enhancing TEI-encoded texts with RDF, though there are other choices of re-representation which could also be profitable in the future. We consider the use of OAC annotations as part of our work for the future. To illustrate our approach, we take as a case study the Sharing Ancient Wisdoms (SAWS)[1] project, which explores and analyses the tradition of wisdom literatures in ancient Greek, Arabic and other languages. Our methods for representing semantic links within and between specific sections of these texts, and describing the relationships that exist between them in a systematic way, are documented and explained. We consider that this approach has the potential to be used widely to link and describe related sections of a variety of different types of texts. Given the common practice of publishing TEI documents as part of Digital Humanities research output, our central contribution is to demonstrate how the usefulness of these TEI documents can be developed further in diverse directions, beyond their current application for digital edition publication.[2]
The Sharing Ancient Wisdoms (SAWS) use case: sources and materials
SAWS[3] [4] [5] is a key use case for this work, demonstrating a requirement for a markup approach that encapsulates various types of information, including structural markup and semantic annotation. The SAWS project aims to present its texts digitally in a manner that enables linking and comparisons within and between anthologies, their source texts, and the texts that draw upon them. We are also creating a framework through which other projects can link their own materials to these texts via the Semantic Web, thus providing a ‘hub’ for future scholarship on these texts and in related areas. The project is funded by HERA (Humanities in the European Research Area) as part of a programme to investigate cultural dynamics in Europe, and is composed of teams at the Department of Digital Humanities and the Centre for e-Research at King's College London, The Newman Institute Uppsala in Sweden, and the University of Vienna.
Throughout antiquity and the Middle Ages, anthologies of extracts from larger texts containing wise or useful sayings were created and circulated widely, as a practical response to the cost and inaccessibility of full texts in an age when these existed only in manuscript form.[6] SAWS focuses on gnomologia (also known as florilegia), which are manuscripts that collected moral or social advice, and philosophical ideas, although the methods and tools developed are applicable to other manuscripts of an analogous form (e.g. medieval scientific or medical texts).[7]
The key characteristics of these manuscripts are that they are collections of smaller extracts of earlier works, and that, when new collections were created, they were rarely straightforward copies. Rather, sayings were selected from various manuscripts, reorganised or reordered, and subtly (or not so subtly) modified or reattributed. The genre also crossed linguistic barriers, in particular being translated from Greek into Arabic, and again these were rarely a matter of straightforward translations; they tend to be variations. In later centuries, these collections were translated into western European languages, and their significance is underlined by the fact that Caxton’s first imprint (the first book ever published in England) was one such collection.[8] Thus the corpus of material can be regarded as a very complex directed network or graph of manuscripts and individual sayings that are interrelated in a great variety of ways, an analysis of which can reveal a great deal about the dynamics of the cultures that created and used these texts.
Identifying and extracting the required data for SAWS
TEI traditionally excels in areas such as text structure definition and document metadata,
and although it possesses the means to identify and define semantic relationships
between
sections of text, none of these methods has, as far as we have been able to determine,
been
adopted widely or used as a standard mechanism for recording the nature of the relationship
between texts. For instance we could use <ref target="...">
to point to another section
of text, but we would need to modify the schema to require that @type
should appear, allowing
us to insert a description of the relationship between these two sections. Another
possibility
would be to use an <interp>
element with an @xml:id
attribute that contained the
required relationship: its @inst
attribute could then be used to point to another section of
text. However, the insertion of an attribute detailing the source of the asserted
relationship
(i.e. the person or bibliographic source responsible for making the assertion) is
also vital
to SAWS: we need to be able to trace the scholarly source of that link. The <relation>
element, which is a recent addition to the TEI, provides us with the ability to include
all of
the desired information within one element: the ID of the section of text being linked
from;
the ID of the section of text being linked to; the nature of the relationship between
the two
sections; and the identity of the source responsible for making the assertion. Our
use of the
<relation>
element is discussed fully below (see ‘Use case implementation: illustrating
the SAWS usage of TEI and RDF’), but it is worth noting in this introductory section
the
important point that the use of <relation>
allows us to enter RDF directly into the TEI
document (i.e. the triples we are defining about the sections of text and their relationships)
and to combine this with information about scholarly responsibility, all within one
element.
This is particularly useful when the data is being entered by scholars who are familiar
with
TEI encoding and are marking up the rest of their documents in TEI, but who do not
have any
training in RDF. Being able to enter the RDF data directly into the TEI document means
that
they do not have to learn a second set of skills, while at the same time we can make
use of
the advantages of RDF (see below, ‘Resulting benefits for information exploration
and
retrieval in the SAWS project’).
These types of semantic relationships within and between texts are particularly important to the understanding of how themes and ideas were transmitted between cultures, and across languages and time. As an example use case, in the SAWS project a key point of interest to our manuscript scholars is to represent relationships within and between different collections of wise or moral sayings, and to investigate how these collections have been referred to, amended and/or passed on from manuscript to manuscript. We want to record and visualise the links within and between these collections; from these collections to their source texts (e.g. Aristotle’s writings); and from these collections to their recipient texts (e.g. the 11th-century Strategikon of Kekaumenos, as well as later texts). Critically, we want to do this in a way which can be repeated by others, so that our collection of texts acts as an example and starting point for a larger enterprise taking this approach beyond our project alone.
At the moment, scholars of gnomologia and their related texts tend to work from manuscripts and printed editions, and the links between the texts they are working on are recorded within commentaries and footnotes. Sometimes their editions will include studies of the relationships between specific manuscripts: for instance, a discussion of the transmission of a particular work through a number of different manuscripts. What SAWS will provide is the ability for scholars to investigate much more deeply the relationships between specific sayings within those texts, and to follow those links through a number of different variants and languages. This is achieved by enabling identification and annotation of relationships by different scholars within a ‘hub’ that will provide visualisations of those relationships as well as direct links to the texts concerned (or in the case of texts that are not digitised, a URI for that text). Scholars interested in a particular saying or set of sayings will immediately be able to see both the fact that the saying is related to sayings within other texts (each of these identifiers will be displayed to them, with a clickable link to that text), and will also see a description of the nature of the relationships that have been identified. They will also be able to view who has asserted that relationship, and can add their own assertions or notes as desired.
As an illustration of why this is important for textual scholars, consider this saying from Gnomologium Vaticanum (no. 87):
Ὁ αὐτὸς ἐρωτηθεὶς τίνα μᾶλλον ἀγαπᾷ, Φίλιππον ἢ Ἀριστοτέλην, εἶπεν· “ὁμοίως ἀμφοτέρους· ὁ μὲν γάρ μοι τὸ ζῆν ἐχαρίσατο, ὁ δὲ τὸ καλῶς ζῆν ἐπαίδευσεν.”
Alexander, asked whom he loved more, Philip or Aristotle, said: ”Both equally, for one gave me the gift of life, the other taught me to live the virtuous life.
We can identify that this saying (i.e. section of text) exists in various forms in earlier works, and that there are relationships that can be defined between our first example and those below (and indeed between the various examples below):
Plutarch, Life of Alexander 8.4.1:
Ἀριστοτέλην δὲ θαυμάζων ἐν ἀρχῇ καὶ ἀγαπῶν οὐχ ἧττον, ὡς αὐτὸς ἔλεγε, τοῦ πατρός, ὡς δι' ἐκεῖνον μὲν ζῶν, διὰ τοῦτον δὲ καλῶς ζῶν ...
Alexander admired Aristotle at the start and loved him no less, as he himself said, than his own father, since he had life through his father but the virtuous life through Aristotle …
Diogenes Laertius 5.19, Life of Aristotle:
Tῶν γονέων τοὺς παιδεύσαντας ἐντιμοτέρους εἶναι τῶν μόνον γεννησάντων· τοὺς μὲν γὰρ τὸ ζῆν, τοὺς δὲ τὸ καλῶς ζῆν παρασχέσθαι.
Aristotle said that educators are more to be honored than mere begetters, for the latter offer life but the former offer the good life.
Pythagoras? Selections from the Sayings of the Four Philosophers: (B) Pythagoras saying 18 (ed. Gutas):
وقال الآباء هم سبب الحياة والحكماء هم سبب صلاح الحياة
He said: Fathers are the cause of life, but philosophers are the cause of the good life.
We can see clearly that these four sayings are related to one another in various ways, but that there are complexities between these texts that need to be described and documented (and ideally visualised) if we are going to be able to trace these relationships in a systematic way.
In the last example above, we can see that the saying has been attributed to a different author (Pythagoras), rather than being associated with Aristotle or his pupil Alexander: alternative attributions are a common feature of this type of text, and they add another layer of complexity to the types of relationship that need to be defined.
In our TEI document, therefore, we need to be able to:
-
insert links between these sections of text (which may or may not already be published digitally);
-
make scholarly assertions in a systematic way about the nature of the (often complex) relationships between these texts.
In order to achieve these aims, we have chosen to enhance our TEI with RDF. RDF provides an ideal way to store and manipulate our relationship data: each of the sayings can be linked to other relevant sections of text by means of a subject-predicate-object relationship that is defined as part of an ontology, which acts as an authority list. One of the main advantages of the ontology for the SAWS project is that it ensures consistency of description across texts that can vary greatly in their nature, but interestingly it has also acted as a means of stimulating scholarly discussion about the nature of the relationships and the ways in which they should be described. The textual scholars involved in the project have found that the necessity to be completely explicit about their decision-making processes and definitions has prompted them to identify, and describe concisely, new relationships that exist within and between their texts.
The way in which we are implementing the use of RDF within our TEI documents will now be described, and will be followed by specific examples from our SAWS texts to illustrate how this is being put into practice.
Background: Previous TEI and RDF combinatory approaches
We would like to be able to use RDF-like syntax to mark up information of semantic interest such as relations between the text and links to external entities, supported by a relevant vocabulary. Whilst RDFa allows RDF to be directly encoded in markup documents, it has been primarily deployed in XHTML documents to date. It would be desirable to extend the scope of RDF to a wider scale, and particularly for our purposes (and others[9]) to TEI XML documents, without extensive changes being required to the variant of XML being used for the source document or to the skills and workflow being used in the markup process. This last point is of particular concern for non-technical users of TEI markup: an established and growing community, not least given the increasing adoption of TEI by humanities scholars for Digital Humanities research.[10] Keeping structural, syntactical and semantic information in the same documents where possible also makes the process of markup more simple and less error-prone for non-technical users who wish to mark up documents with their annotations, though it is acknowledged that this is not always possible. To date, no method for accommodating TEI and RDF in the same document has been adopted as standard by the TEI community, though several approaches have recently been offered.[11]
RDFTEF[12] [13] is a Java-based tool for converting TEI files to a form which can incorporate and output RDF/XML markup. Based around the Jena framework for semantic web applications,[14] RDFTEF implements a basic ontology for representing structural and syntactical elements and allows additional ontologies to be added as required. Though SPARQL queries can be fashioned to query the resulting RDF, these need to be relatively complex and standard XML tools cannot be deployed within the RDFTEF environment.[15] RDFTEF has been criticised as ‘[o]nly a “toy” experiment’[16] for these limitations and due to its lack of ongoing maintenance (last source code update 2007). Also, RDFTEF introduces a new stage of work to the existing editing workflow and requires extra software to be deployed for and learned by the users. Given the non-technical nature of the target audience who will be marking up the documents with this semantic information, this is a significant concern to the SAWS project and potentially hinders the adoption of our approach by our target users.
The issues for non-technical users also problematise other interesting approaches, where RDFa has been used to encode RDF in a TEI document.[17] [18] Although the markup process was relatively straightforward, specialised scripts had to be deployed to extract the RDF information in a form suitable for adding to a triple store. Deploying such scripts is non-trivial for non-technical users both in setting up the appropriate environment and in executing the scripts. The scripts used by Jewell’s and Lawrence’s work were also highly specific to the type of information in those documents, rather than being more domain-general. These issues with over-specific scripts and associated implementation issues were also seen in a similar script-based approach to automated creation of RDF triples from TEI documents, in work performed by the SPQR project.[19] In terms of implementation and re-use, there is a more user-friendly alternative of transformations through XSLT stylesheets, the execution of which is incorporated into the user interface of tools like the Oxygen XML editor. To avoid or at least reduce over-specificity and encourage re-use of our materials, the adoption of a more generic underlying model for transformations is an interesting alternative, as is explored in this present paper.
Another tool is available to represent document structure(s) with RDF: the EARMARK OWL ontology[20] [21]
The inclusion of RDF in TEI documents is a current area of interest in the TEI community. Members of the TEI-Ontologies Special Interest Group (SIG)[22] are using XSLTs to convert TEI to RDF, by relating TEI markup to vocabulary in the CIDOC-CRM cultural heritage model[23] (a recognised ISO standard: ISO 21127). Some discussion has also been made by the SIG about the inclusion of FRBRoo[24] (a bibliographical records model harmonised with CIDOC-CRM) in the base vocabulary,[25] however work in this area has not progressed and development has been concentrated around a TEI-CIDOC harmonisation. This co-operation between TEI and CIDOC-CRM has been formally active since the formation of the SIG in 2004 and has seen regular but reasonably slow-paced development,[26] probably due to the other commitments and geographical displacement of the researchers involved. Some mappings have been drafted[27] (last updated 2007/8) and stylesheets[28] (last updated 2011) and guidelines[29] (last updated 2010) have been published, but several issues exist that are hampering the SIG’s progress:
-
The approach taken by the SIG requires some changes to be made to TEI, with new elements to be added and others to be extended.[30] This raises questions as to the applicability of the resulting stylesheet to existing and legacy TEI documents.
-
The size of the current TEI P5 tagset, containing hundreds of elements, raises practical difficulties in providing a comprehensive mapping from TEI to alternative representations. The TEI ontologies SIG has identified a subset of TEI elements to map to CIDOC-CRM, choosing only elements which represent semantically meaningful elements within the text, “elements such as persons, places, dates and events”.[31] This approach is practical but disregards many triples of potential interest within the TEI markup such as document structure and metadata. It also limits the scope of output triples to only those elements encodable using TEI markup, such as names of places and people.
-
It is questionable whether CIDOC-CRM is the best choice of vocabulary to be used for modelling textual document information, especially as its only direct representation of lexical material is through one class (E33 Linguistic Object) and its two subclasses (E34 Inscription, E35 Title). This choice of CIDOC as base model is acknowledged to be influenced by the research interests of the SIG members in cultural heritage and museum documentation.[32] Particularly for metadata information such as that contained in the TEI Header, the Dublin Core model[33] seems a more natural choice and is a highly developed and widely adopted ontology. A mapping from TEI to DC has been tackled in stylesheets[34] but does not appear in their main approach or considerations.
It is desirable (e.g. for SAWS) to be able to mark up triple-like relations directly
in
TEI, particularly if those relations are specific to the subject domain of the original
text
and/or if the relations indicate semantic information which cannot currently be encoded
using
TEI markup. The <relation>
element has recently[35] been recommended by the TEI for encoding RDF relations in a TEI document,
representing the Subject-Predicate-Object triple format through the following attributes
of
<relation>
: @active
, @ref
and @passive
respectively. This has increased the
expressiveness of standard TEI markup without requiring changes within TEI. Further,
RDF can
be included directly in TEI markup, allowing researchers to use the workflow and tools
they
are already accustomed to rather than introducing a requirement for new tools to be
learnt and
used, external to the existing workflow. This is of particular benefit for users of
TEI who do
not have a strong technical background.
Automatic extraction of information from TEI documents
Much information can be extracted from the markup already in a TEI document, particularly metadata and document structure. This ensures that markup work already invested in texts can be extracted from the text and represented in alternative forms that are more amenable to querying and automated reasoning. For example, in SAWS, there is an interest in how the structure and ordering of wise sayings changes as they are copied from one manuscript to another.
Acknowledging the size of the TEI tagset and the associated practical difficulties in mapping, we take the minimal subset of TEI needed to encode a document in TEI markup, TEI-Bare. Work done with this schema serves as a basis for further extensions, for example to TEI-Lite, identified as “the most widely used TEI customization”.[36] The Dublin Core Metadata Initiative[37] forms the base model for the mappings from TEI.
The comparison of TEI and RDF is an oddly emotional topic. The strength of RDF lies in its apparent simplicity, and in its interoperability. RDF data is discoverable, and reusable. An OAC annotation for instance may have any number of targets of differing types. TEI allows for extremely granular expression with a context; RDF may often not require context to be meaningful.
Deceptively simple SPO assertions can be combined to tell complex stories. The following annotation is relatively terse, but conveys much information, all of it easily discoverable using either SOLR or SPARQL. There is considerable metadata surrounding the individual annotation indicating what standards were employed, how it was encoded, the creation date etc.
<?xml version="1.0" encoding="utf-8"?> <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dms="http://dms.stanford.edu/ns/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:ore="http://www.openarchives.org/ore/terms/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:oac="http://www.openannotation.org/ns/"> <rdf:Description rdf:about="http://example.com/Development/fedora/repository/ilives:112490/AnnotationList"> <rdf:rest rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#nil"/> <rdf:type rdf:resource="http://dms.stanford.edu/ns/AnnotationList"/> <rdf:type rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#List"/> <rdf:type rdf:resource="http://www.openarchives.org/ore/terms/Aggregation"/> </rdf:Description> <rdf:Description rdf:about="http://example.com/Development/fedora/repository/ilives:112490/Canvas"> <rdf:type rdf:resource="http://dms.stanford.edu/ns/Canvas"/> </rdf:Description> <rdf:Description rdf:about="http://example.com/Development/emic/serve/ilives:112490/AnnotationList/AnnotationList.xml"> <ore:describes rdf:resource="http://example.com/Development/fedora/repository/ilives:112490/AnnotationList"/> <rdf:type rdf:resource=""/> <dc:format>application/rdf+xml</dc:format> <dcterms:modified>2012-07-18T15:52:12-03:00</dcterms:modified> </rdf:Description> <rdf:Description rdf:about="urn:uuid:C5501895-BEA0-0001-DDE4-99D25F82B940"> <rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/> <oac:hasBody rdf:resource="urn:uuid:C5501895-BEE0-0001-DE69-3CF318B030D0"/> <oac:hasTarget rdf:resource="urn:uuid:C5501895-BEF0-0001-3047-EF003BF91846"/> <dcterms:created>2012-07-18 18:52:12 UTC</dcterms:created> <dc:title>New Annotation</dc:title> </rdf:Description> <rdf:Description xmlns:cnt="http://www.w3.org/2008/content#" rdf:about="urn:uuid:C5501895-BEE0-0001-DE69-3CF318B030D0"> <rdf:type rdf:resource="http://www.w3.org/2008/content#ContentAsText"/> <cnt:chars>Sample text for demo</cnt:chars> <cnt:characterEncoding>utf-8</cnt:characterEncoding> </rdf:Description> <rdf:Description rdf:about="urn:uuid:C5501895-BEF0-0001-3047-EF003BF91846"> <rdf:type rdf:resource="http://www.openannotation.org/ns/ConstrainedTarget"/> <oac:constrains rdf:resource="http://example.com/Development/fedora/repository/ilives:112490/Canvas"/> <oac:constrainedBy rdf:resource="urn:uuid:C5501895-BEF0-0001-BD39-1FD48820108C"/> </rdf:Description> <rdf:Description xmlns:cnt="http://www.w3.org/2008/content#" rdf:about="urn:uuid:C5501895-BEF0-0001-BD39-1FD48820108C"> <rdf:type rdf:resource="http://www.openannotation.org/ns/SvgConstraint"/> <rdf:type rdf:resource="http://www.w3.org/2008/content#ContentAsText"/> <cnt:chars><svg:rect xmlns:svg='http://www.w3.org/2000/svg' x='283.5' y='615.5' width='377' height='108' r='0' rx='0' ry='0' fill='#ffffff' stroke='#000000' style='opacity: 0.7; stroke-width: 2;' opacity='0.7' stroke-width='2' ></svg:rect></cnt:chars> <cnt:characterEncoding>utf-8</cnt:characterEncoding> </rdf:Description> </rdf:RDF>
An advantage of OAC style encoding is that embedded tags are not necessary for the designation of a target. A target may be defined as either svg coordinates as in the example below, or starting and stopping at two line/character points. These points may be inside tagsets allowing us to mimic overlapping tags without breaking xml validation. In this example the rdf targets a body of text beginning with the 6th character, and being 11 characters long, and ties this back to an authority record.
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:w="http://cwrctc.artsrn.ualberta.ca/#"> <rdf:Description rdf:ID="ent_1"> <w:id type="offset">ent_1</w:id> <w:parent type="offset">struct_02</w:parent> <w:offset type="offset">6</w:offset> <w:length type="offset">11</w:length> <w:type type="props">place</w:type> <w:content type="props">Jean Golfin</w:content> <w:term type="info">Golfin, Jean</w:term> <w:viafid type="info">2498498</w:viafid> <w:ptbnp type="info">243840</w:ptbnp> <w:bnf type="info">12092537</w:bnf> <w:lc type="info">n85202716</w:lc> <w:certainty type="info">definite</w:certainty> </rdf:Description> </rdf:RDF>
By moving our structural TEI encoding, still very valuable in its native form, to OAC/RDF equivalents, we expose relationships based either on the physical textual coordinates, x/y coordinates, or structural location.
Use case implementation: illustrating the SAWS usage of TEI and RDF
The requirements for the SAWS project have been described above; namely that we need to insert links between sections of text within and between documents (some of which exist in digital form, and some of which do not), and to make scholarly assertions in a systematic way about the nature of these often complex relationships between sections of text.
First of all, therefore, we must define the basic unit of interest (a ‘section’ or
‘segment’ of text), i.e. the saying (or part of the saying). The SAWS TEI schema,
designed at
King’s College London for the encoding of gnomologia, uses the <seg>
element to mark up
this unit of intellectual interest, such as a saying (statement) together with its
surrounding
story (narrative). For example:
Alexander, asked whom he loved more, Philip or Aristotle, said: “Both equally, for one gave me the gift of life, the other taught me to live the virtuous life”.[38]
This contains both a statement and a narrative:
<seg type="contentItem"> <seg type="narrative"> Alexander, asked whom he loved more, Philip or Aristotle, said: </seg> <seg type="statement"> Both equally, for one gave me the gift of life, the other taught me to live the virtuous life. </seg> </seg>
Each of these <seg>
elements is given an @xml:id
to
provide a unique identifier (which is automatically generated using simple XSLT).
This
identifier differentiates one <seg>
from all other
examples of <seg>
, for instance <seg type="statement" xml:id="K.al-Haraka_ci_s1">
, where K.al-Haraka_ci_s1
is the unique identifier. In other words, it allows each intellectually interesting
unit (as
identified by our team’s scholars) to be distinguished from each other unit, thus
providing
the means of referring to a specific, often very brief, section of the text.
Secondly, we must have a systematic way of defining the relationship between one section of text and another. Using a systematic method is important for two reasons: to ensure consistency in the descriptive terms that we use across the SAWS project, and to develop a shared vocabulary between SAWS and other projects to which we want to make links (and which want to link their data to ours). We have therefore taken every possible opportunity to explore with other manuscript scholars the terms they need to use to describe the relationships that they can observe within, and between, their texts. Relationships identified include terms such as isCloseRenderingOf, isLooseTranslationOf, isVerbatimOf, and a variety of other terms that represent in an agreed form the different ways in which sections of text are connected to one another.
We are representing these relationships using an ontology that extends the FRBR-oo model[39] (the harmonisation of the FRBR model of bibliographic records[40] and the CIDOC[41] Conceptual Reference Model (CIDOC-CRM)). The SAWS[42] ontology, developed through collaboration between domain experts and technical observers, models the classes and links in the SAWS manuscripts. Basing the SAWS ontology around FRBR-oo provides most vocabulary for both the bibliographic (FRBR) and cultural heritage (CIDOC) aspects being modelled. Using this underlying ontology as a basis, relationships between (or within) manuscripts can be added to the TEI documents using RDF markup.[43]
To include RDF triples in TEI documents, three entities have to be represented for
each
triple: the subject being linked from, the object being linked to, and a description
of the
link between them. The subject and object entities in the RDF triple are represented
by the
@xml:id
that has been given to each of the TEI sections of
interest. We use the TEI element <relation/>
(recently
added to TEI) to place RDF markup in the SAWS documents, with four attributes as
follows:
-
The value of
@active
is the@xml:id
of the subject being linked from; -
The value of
@passive
is the@xml:id
or URI of the object being linked to; -
The value of
@ref
is the description of the relationship, which is drawn directly from the list of relationships in the ontology; -
The value of
@resp
is the name or identifier of a particular individual or resource (such as a bibliographic reference). Many of the links being highlighted are subjectively identified and are a matter of expert opinion, so it is important to record the identity of the person(s) responsible.
For example:
<seg type="statement" xml:id="K._al-Haraka_ci_s5"> برهان ثالث كل محرّك لذاته فهو راجع على ذاته </seg> <seg type="contentItem" xml:id="Proclus_ET_Prop.17_ci1"> Πᾶν τὸ ἑαυτὸ κινοῦν πρώτως πρὸς ἑαυτό ἐστιν ἐπιστρεπτικόν. </seg> <relation active="http://www.ancientwisdoms.ac.uk/mss/K._al-Haraka#ci_s5" ref="http://purl.org/saws/ontology#isCloseRenderingOf" passive="http://www.ancientwisdoms.ac.uk/mss/Proclus_ET_Prop.17#ci1" resp="http://purl.org/saws/people#wakelnig" />
This is equivalent to stating that the Arabic segment with the xml:id "ci_s5” in the
K._al-Haraka document is a close rendering of the Greek segment identified as “ci1”
in
Proclus_ET_Prop.17, and that this relationship has been asserted by Elvira Wakelnig.
The
definition of ‘isCloseRenderingOf’ has been agreed upon and
documented within the ontology, and the schema has been populated from the ontology
so that a
drop-down menu appears in the XML editor, from which the required value of @ref
can be selected. The <relation/>
element can be placed anywhere within the TEI document, or
indeed in a separate document if required: for our own purposes we have found it useful
to
place it immediately after the closing tag of the <seg>
identified as the “active” entity.
Some of the content of our texts could also be enhanced by being viewed in context by including information external to the XML document. For this purpose, the SAWS project will also use Linked Data principles to mark up our texts with semantic links to collections of data on the ancient world, such as the Pleiades[44] historical gazetteer of ancient places and the Pelagios collection of ancient data[45] interlinked through Pleiades references, and the Prosopography of the Byzantine World[46], which aims to document all the individuals mentioned in textual Byzantine sources from the seventh to thirteenth centuries. We also plan to mark up links to existing relevant documents such as those stored in the Perseus Digital Library[47] (which holds editions of some of the texts we identify as source texts for the gnomologia).
Examples of transformations from TEI to RDF for the SAWS use case
Taking the SAWS use case as an example, the TEI version of the Kitāb al-Ḥaraka (“Book of Happiness”) held at Ankara Üniversitesi contains the following TEI-Bare-compliant information in its TEI header:
<teiHeader> <fileDesc> <titleStmt> <title>Hacı Mahmud Efendi 5683</title> </titleStmt> <publicationStmt> <publisher>Sharing Ancient Wisdoms</publisher> </publicationStmt> <sourceDesc> <msDesc> <msContents> <msItem> <author>(Pseudo-)Aristotle</author> <title>Kitab al-Haraka</title> </msItem> </msContents> </msDesc> </sourceDesc> </fileDesc> </teiHeader>
Applying the XSLT generates the following Dublin Core triples[48]:
<rdf:Description rdf:about="http://www.ancientwisdoms.ac.uk/mss/HacıMahmud5683#"> <dct:title>Hacı Mahmud Efendi 5683</dct:title> <dct:creator>(Pseudo-)Aristotle</dct:creator> <dct:type>TEI/XML</dct:type> <dct:conformsTo>http://www.tei-c.org/ns/1.0</dct:conformsTo> </rdf:Description>
As an example of structural triples, take SAWS’ TEI version of the Corpus Parisinum
manuscript as stored in the Digby collection in the Bodleian library, Oxford, UK,
in which a
<div xml:id="Aristippus01">
section is contained by its parent, <div
xml:id="Part01">
. From this we can derive the following two triples:
<rdf:Description rdf:about="http://www.ancientwisdoms.ac.uk/mss/Cod_Bodl_Dig_6#Aristippus01"> <dct:isPartOf>http://www.ancientwisdoms.ac.uk/mss/Cod_Bodl_Dig_6#Part01</dct:isPartOf> </rdf:Description> <rdf:Description rdf:about="http://www.ancientwisdoms.ac.uk/mss/Cod_Bodl_Dig_6#Part01"> <dct:hasPart>http://www.ancientwisdoms.ac.uk/mss/Cod_Bodl_Dig_6#Aristippus01</dct:hasPart> </rdf:Description>
Resulting benefits for information exploration and retrieval in the SAWS project
We now have the capacity to extract many triples from our TEI document. The TEI-Bare
XSLT [49] allows us to extract RDF triples representing information about the document
structure and metadata about the markup, as encoded in the TEI markup. This XSLT can
also now
be simply extended to extract more semantics, by transforming the triples encoded
through the
<relation>
element into RDF/XML syntax.[50]
Once information is available in RDF format, it can be queried and reasoned with. Critically, queries can be constructed based around the semantics encoded in the triples.[51] The distribution of knowledge across Linked Data means that logical inferences can be made to derive new knowledge from the facts, and also from the external data sources that have been referenced by the RDF triples.
The ability to traverse links between sets of data and discover related information serendipitously is one of the major benefits of adopting linked data for the SAWS project. For the scholars working in SAWS, the study of the links between and within documents is a central part of the academic research underpinning this project.[52] Extra assistance in finding relevant information can help discover sources of interest that might otherwise have been missed, as many potential sources are geographically scattered, occasionally hard and/or time-consuming to access and may also be completely unknown outside of a handful of scholars. As an example, the Perseus Digital Library holds a collection of Classics-related documents which collectively contain over 68 million words, as well as an Arabic collection containing over 5 million words, and other collections.[53] Navigating such quantities of potential research material to find content of interest is one of the challenges faced by Classics researchers. Digitisation and cataloguing of the sources through projects like Perseus has been an important step in facilitating this research, and is being enhanced further by semantic navigation such as that undertaken in the SAWS project.
To illustrate ways in which linked data specifically assists scholars in the use case of SAWS, we look at how the scholars can discover information in new ways, draw from a broader set of sources and compile evidence for their research. If, say, a researcher is looking at how a particular place of interest is described across different manuscripts, information in the Pleiades historical gazetteer can be consulted when constructing queries. Researchers can ask to see, for example, all texts that refer to that particular geographical location, even if the texts use different place names to refer to that geographical location (as it was often the case that places were referred to by different names in different historical periods). For SAWS, this helps with the added complication of manuscripts in different languages, with different character sets (compare for example Ancient Greek, Arabic). This is possible through examining the place names mentioned in the SAWS manuscripts in the context of the information in the Pleiades ontology, which gives a precise geographical reference for each place.[54]
For example the place “Aphrodisias” (URI http://pleiades.stoa.org/places/638753) was known by the names:
-
Ninoe (in the Classical period),
-
Aphrodeisias (Hellenistic-republican, Roman periods),
-
Lelegon polis (unspecified period),
-
Stauropolis (Late-antique period)
-
Aphrodisias (Roman, Late-antique periods).
In Ancient Greek it is referred to as Ἀφροδισιάς (or Νινόη, Ἀφροδεισιάς, Λελέγων πόλις, Σταυρόπολις, respectively).
Developing this example, we can disambiguate between Aphrodisias located in modern-day Turkey and the Aphrodisias located by modern-day Spain (URI http://pleiades.stoa.org/places/255978/), which the textual information alone would not allow us to distinguish.
Returning to the issues of the SAWS manuscripts being written in various languages (Ancient Greek and Arabic being the two main languages, and some related documents in Spanish, Latin, and English, to date): Although the TEI documents contain transcriptions of manuscripts in the original language, the use of RDF and linking allows the manuscript information to transcend language boundaries to some extent, as parts of the text can be linked to resources which are more language-neutral (e.g. the person “Aristotle” can be represented by the URI http://dbpedia.org/resource/Aristotle independently of whether they are referred to as Aristotle, Ἀριστοτέλης, أرسطو , Aristoteles, Aristóteles or other alternative forms in the original document). This is particularly helpful in studying the transmission of information in the manuscripts across languages, especially if the researcher does not have sufficient language skills to navigate between the different languages.
Evaluation of the SAWS implementation
To evaluate the usefulness of this work, researchers on the SAWS project are currently encoding RDF information into existing TEI versions of manuscripts they are interested in. Having discussed what research questions they would like to explore, a demonstration of the TEI publications and the enhancements possible with the RDF information occurred in a workshop in June 2012.[55] This highlighted several positive benefits, in particular increasing motivation of actually seeing how the manuscripts could be navigated in this format, both through exploring the TEI digital edition and through seeing the tangible benefits of a semantically enhanced approach.
The demo also prompted useful constructive feedback, leading to further relation types being identified for the SAWS model. This demo also prompted some interesting scholarly debates following the identification of different interpretations of the notion of translation (which would not necessarily have been noticed and acted upon, had the scholars not been required to collaboratively formalise their tacit knowledge). Following this demo, ongoing further consultation with manuscript scholars has provided, and will continue to provide, formative evaluative feedback for further developments.
Future work
With a basic TEI to RDF mapping in place, and using an easily extensible transformation mechanism such as XSLT, this is a firm basis for future development of mappings by both ourselves and others, to include more of the TEI tagset. More generally, the choice of TEI tags being included will be dictated by individual needs (for example, SAWS uses a specific customisation of the TEI schema, as mentioned above, so is concentrating on tags used in that schema). In particular, we are discussing with collaborators how FRBR-oo can be used to enhance the base ontological model for the TEI to RDF mapping, for a richer vocabulary which includes more detailed semantics than Dublin Core (given that Dublin Core concentrates on modelling metadata and basic structures). We hope to discuss this work with members of the Special Interest Group on TEI and ontologies and make contributions to this group’s work.
Upon determining our mappings, obtaining the data becomes a matter of simple extraction. The RDF in our example makes direct connections - A is a child of B. Having information available in RDF is useful not only for what can be done directly with RDF, but for the possible transformations from RDF to other data representations. One of this paper’s authors is working with the image-based manuscript annotation environment Shared Canvas[56], which makes use of Open Annotation Collaboration (OAC)[57] syntax for annotations. An OAC annotation maps neatly to an RDF triple, where an active/subject item has an annotation with a body of x (e.g. isCloseTranslationOf) and a1 target of y (e.g. xml:id=GV132874897).)
OAC-RDF mappings are more complex, but more meaningful. Once our basic mappings are in place, we can spin off (or at least establish the framework for) more complex expressions. Relationships can build on relationships, attaching creators (with foaf tags) to annotations, which tie bodies of text (further identified by their character encoding) to the target being described. There is no real depth limit. The data is all there to be explored, and the framework exists to add many layers of metadata.
The Islandora is an open source project to allow users to manage a Fedora Repository through PHP using a Drupal front end. Fedora Repositories are particularly adept at maintaining and versioning the metadata that accompanies scholarly objects. The Digital Humanities project is sponsored by EMiC to develop a suite of application for the management and critical analysis of Canadian modernism. One of the authors of this paper is the lead programmer in both these projects, so will be able to incorporate these transformations into the workflow to expose the data publicly. Of particular interest to our team is the ability to extract data from the TEI stream to build and maintain authority lists.
We therefore have several possible avenues of work to explore in this area. Future development will both require, and foster, collaboration amongst those who are pursuing the question of what can be gained from the enhancement of TEI-encoded documents. It is envisaged that the outcomes of this research will be applicable across a wide variety of texts, and it is hoped that this paper will stimulate interest in new areas of future research into combining different types of markup.
Acknowledgements
This work partly results from collaborative development between two of the paper authors initiated at the Interedition 9th bootcamp, Leuven, Belgium, 2012, funded through COST action IS0704. The SAWS project is funded by HERA as project 09-HERA-JRP-CD-FP-152 and we acknowledge the benefits of this fruitful collaboration with our project partners. In preparing the final version of this paper we were assisted by the feedback from several anonymous reviewers.
References
W. Caxton, The Dictes and Wise Sayings of the Philosophers (originally published London, 1477), reprinted 1877 (Elliot Stock, London)
A. Dekhtyar and I. E. Iacob. A framework for management of concurrent XML markup. Data & Knowledge Engineering 52(2):185-208, 2005.
M. Doerr, “The CIDOC CRM - an Ontological Approach to Semantic Interoperability of Metadata”, AI Magazine, Vol. 24, No. 3 (2003)
M. Doerr, and P. LeBoeuf, “Modelling Intellectual Processes: The FRBR – CRM Harmonization” Digital Libraries: Research and Development, Vol. 4877, pp. 114-123. Springer (2007). doi:https://doi.org/10.1007/978-3-540-77088-6_11.
Ø. Eide, A. Felicetti, C. Ore, A. D'Andrea, and J. Holmen. Encoding Cultural Heritage Information for the Semantic Web. In EPOCH Conference on Open Digital Cultural Heritage Systems, Rome, Italy, 2008.
Hedges, Mark; Jordanous, Anna; Dunn, Stuart; Roueche, Charlotte; Kuster, Marc W.; Selig, Thomas; Bittorf, Michael; Artes, Waldemar; "New models for collaborative textual scholarship,", Proceedings of the 6th IEEE International Conference on Digital Ecosystems Technologies (DEST), Campione d’Italia, Italy. 2012.
H. V. Jagadish, L. V. S. Lakshmanan, M. Scannapieco, D. Srivastava, and N. Wiwatwattana. Colorful XML: One Hierarchy Isn't Enough. In Proceedings of ACM SIGMOD International Conference on Management of Data, volume 1, pages 251-262. ACM Press, 2004. doi:https://doi.org/10.1145/1007568.1007598.
M. O. Jewell. Semantic Screenplays: Preparing TEI for Linked Data. In Proceedings of Digital Humanities, London, UK, 2010.
A. Jordanous, K. F. Lawrence, M. Hedges, and C. Tupman. Exploring manuscripts: sharing ancient wisdoms across the semantic web. In Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics (WIMS '12), Craiova, Romania. 2012.
K. F. Lawrence. Wherefore Art Thou? - Crowdsourcing Linked Data from Shakespeare to Dr Who. In Proceedings of Web Science, Koblenz, Germany, 2011.
Christian-Emil Ore and Øyvind Eide. TEI and cultural heritage ontologies: Exchange of information? Literary and Linguistic Computing 24(2): 161-172, 2009. doi:https://doi.org/10.1093/llc/fqp010.
S. Peroni and F. Vitali. Annotations with EARMARK for arbitrary, overlapping and out-of order markup. In Proceedings of the 9th ACM symposium on Document engineering, pages 171-180, Munich, Germany, 2009. doi:https://doi.org/10.1145/1600193.1600232.
E. Pierazzo. A rationale of digital documentary editions. Literary and Linguistic Computing, 26(4):463-477, 2011. doi:https://doi.org/10.1093/llc/fqr033.
P. Portier, N. Chatti, S. Calabretto, E. Egyed-Zsigmond, and J. Pinon. Modeling, encoding and querying multi-structured documents. Information Processing & Management. Forthcoming.
M. Richard, “Florilèges grecs”, Dictionnaire de Spiritualité V (1962), cols. 475-512
F. Rodríguez Adrados, Greek wisdom literature and the Middle Ages: the lost Greek models and their Arabic and Castilian Translations (2001), English translation by Joyce Greer (2009), pp. 91-97 on Greek models; D. Gutas, “Classical Arabic Wisdom Literature: Nature and Scope”, Journal of the American Oriental Society, Vol. 101, No. 1, Oriental Wisdom (Jan. -Mar., 1981), pp. 49-86
Solomon, J. (ed)., Accessing antiquity: The computerization of classical studies. Tucson: University of Arizona Press. 1993.
Sanderson, R. Albritton, B. Schwemmer, R. Van de Sompel, H. "SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemination". Proceedings of the 11th ACM/IEEE Joint Conference on Digital Libraries, Ottawa, Canada, June 2011.
B. Tillett, “What is FRBR? A Conceptual Model for the Bibliographic Universe”, Library of Congress Cataloging Distribution Service, Library of Congress, Vol. 25, pp.1-8 (2004)
G. Tummarello, C. Morbidoni, and E. Pierazzo. Toward textual encoding based on RDF. In Proceedings of the 9th International Conference on Electronic Publishing (ELPUB 2005), Kath. Univ. Leuven, June, pages 57-63. 2005.
Tupman, Charlotte; Hedges, Mark; Jordanous, Anna; Lawrence, Faith; Roueche, Charlotte; Wakelnig, Elvira; Dunn, Stuart. Sharing Ancient Wisdoms: developing structures for tracking cultural dynamics by linking moral and philosophical anthologies with their source and recipient texts. In Proceedings of Digital Humanities (DH2012), Hamburg, Germany. 2012.
[1] http://www.ancientwisdoms.ac.uk. Last accessed 20th April 2012.
[2] John Unsworth.Tool-Time, or 'Haven't We Been Here Already?' Ten Years in Humanities Computing. Delivered as part of "Transforming Disciplines: The Humanities and Computer Science," Washington, DC, 2003. Available at: http://people.lis.illinois.edu/~unsworth/carnegie-ninch.03.html (last accessed 20th July 2012).
[3] Anna Jordanous, K. Faith Lawrence, Mark Hedges, and Charlotte Tupman. Exploring manuscripts: sharing ancient wisdoms across the semantic web. In Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics (WIMS '12), Craiova, Romania. 2012.
[4] Tupman, Charlotte; Hedges, Mark; Jordanous, Anna; Lawrence, Faith; Roueche, Charlotte; Wakelnig, Elvira; Dunn, Stuart. Sharing Ancient Wisdoms: developing structures for tracking cultural dynamics by linking moral and philosophical anthologies with their source and recipient texts. In Proceedings of Digital Humanities (DH2012), Hamburg, Germany. 2012.
[5] Hedges, Mark; Jordanous, Anna; Dunn, Stuart; Roueche, Charlotte; Kuster, Marc W.; Selig, Thomas; Bittorf, Michael; Artes, Waldemar; "New models for collaborative textual scholarship,", Proceedings of the 6th IEEE International Conference on Digital Ecosystems Technologies (DEST), Campione d’Italia, Italy. 2012.
[6] F. Rodríguez Adrados, Greek wisdom literature and the Middle Ages: the lost Greek models and their Arabic and Castilian Translations (2001), English translation by Joyce Greer (2009), pp. 91-97 on Greek models; D. Gutas, “Classical Arabic Wisdom Literature: Nature and Scope”, Journal of the American Oriental Society, Vol. 101, No. 1, Oriental Wisdom (Jan. -Mar., 1981), pp. 49-86
[7] M. Richard, “Florilèges grecs”, Dictionnaire de Spiritualité V (1962), cols. 475-512
[8] W. Caxton, The Dictes and Wise Sayings of the Philosophers (originally published London, 1477), reprinted 1877 (Elliot Stock, London)
[9] Ø. Eide, A. Felicetti, C. Ore, A. D'Andrea, and J. Holmen. Encoding Cultural Heritage Information for the Semantic Web. In EPOCH Conference on Open Digital Cultural Heritage Systems, Rome, Italy, 2008.
[10] E. Pierazzo. A rationale of digital documentary editions. Literary and Linguistic Computing, 26(4):463-477, 2011.
[11] A reviewer of this paper has drawn our attention to the DM2E Digital Manuscripts to Europeana project (http://dm2e.eu/). This project will be looking at how existing document formats can be translated into a form compatible with the Europeana Data Model and producing tools to support semantic annotation of documents;.in particular, the work package on Interoperability infrastructure (WP2) will look at transformations to RDF from other formats. At the time of writing (July 2012) this project has only been running for six months, thus have so far only very recently announced alpha releases of two tools; we look forward to trying these tools and tracking how they develop.
[12] G. Tummarello, C. Morbidoni, and E. Pierazzo. Toward textual encoding based on RDF. In Proceedings of the 9th International Conference on Electronic Publishing (ELPUB 2005), Kath. Univ. Leuven, June, pages 57-63. 2005.
[15] P. Portier, N. Chatti, S. Calabretto, E. Egyed-Zsigmond, and J. Pinon. Modeling, encoding and querying multi-structured documents. Information Processing & Management. Forthcoming.
[16] ibid, page 9.
[17] M. O. Jewell. Semantic Screenplays: Preparing TEI for Linked Data. In Proceedings of Digital Humanities, London, UK, 2010.
[18] K. F. Lawrence. Wherefore Art Thou? - Crowdsourcing Linked Data from Shakespeare to Dr Who. In Proceedings of Web Science, Koblenz, Germany, 2011.
[19] Blanke, Tobias; Bodard, Gabriel; Bryant, Michael; Dunn, Stuart; Hedges, Mark; Jackson, Michael; Scott, David; "Linked data for humanities research — The SPQR experiment," 2012 6th IEEE International Conference on Digital Ecosystems Technologies (DEST), Campione d’Italia, Italy, 2012
[20] S. Peroni and F. Vitali. Annotations with EARMARK for arbitrary, overlapping and out-of order markup. In Proceedings of the 9th ACM symposium on Document engineering, pages 171-180, Munich, Germany, 2009.
[21] P. Portier, N. Chatti, S. Calabretto, E. Egyed-Zsigmond, and J. Pinon. Modeling, encoding and querying multi-structured documents. Information Processing & Management. Forthcoming.
[23] Christian-Emil Ore and Øyvind Eide. TEI and cultural heritage ontologies: Exchange of information? Literary and Linguistic Computing 24(2): 161-172, 2009.
[27] http://www.edd.uio.no/artiklar/tekstkoding/tei_crm_mapping.html , http://www.edd.uio.no/tei/teiontsig/test_crm_model.graphml
[35] Sourceforge.net discussion: Encoding RDF relationships in TEI - ID: 3309894, at http://sourceforge.net/tracker/?func=detail&aid=3309894&groupid=106328&atid=644065
[38] Gnomologium Vaticanum 87
[39] M. Doerr, and P. LeBoeuf, “Modelling Intellectual Processes: The FRBR – CRM Harmonization” Digital Libraries: Research and Development, Vol. 4877, pp. 114-123. Springer (2007)
[40] B. Tillett, “What is FRBR? A Conceptual Model for the Bibliographic Universe”, Library of Congress Cataloging Distribution Service, Library of Congress, Vol. 25, pp.1-8 (2004)
[41] M. Doerr, “The CIDOC CRM - an Ontological Approach to Semantic Interoperability of Metadata”, AI Magazine, Vol. 24, No. 3 (2003)
[42] SAWS ontology: http://purl.org/saws/ontology. Last accessed 16th March 2012
[43] The SAWS ontology can be found in OWL format at the permanent URL: http://purl.org/saws/ontology/ and is documented in a more human-readable format at the SAWS website: http://www.ancientwisdoms.ac.uk .
[48] The manuscript ID is filled in from the xml:id given for the manuscript, as specified
in the root <TEI>
, the parent of the <teiHeader>
element.
[49] The XSLT is available at https://github.com/ajstanley/TEI_to_RDF.git
[50] As <relation>
is not part of the TEI-Bare schema, we do not include the
<relation>
mapping to RDF in the XSLT intended for TEI-Bare documents, but the
mapping from <relation>
to RDF is specified in this paper and further exemplified by
TEI at
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-relation.html
[51] For the SAWS project, a SPARQL endpoint to access the RDF data will be made available at the URL http://www.ancientwisdoms.ac.uk/sparql in late 2012/early 2013.
[52] Accessing antiquity: The computerization of classical studies. Solomon, J. (ed). Tucson: University of Arizona Press. 1993. The strong interest in relationships between manuscripts and collections has been confirmed more recently through personal communications with SAWS project partners and their colleagues.
[53] Statistics taken from http://www.perseus.tufts.edu/hopper/collections, last accessed 20th July 2012.
[54] Through the Pelagios networking tools, this geographical information can also be used to navigate external data relating to the same place.
[55] Demo available at http://www.ancientwisdoms.ac.uk/media/data/texts.html. Please note that as this tool is undergoing further development, it may therefore be unavailable occasionally, for short periods.
[56] Sanderson, R. Albritton, B. Schwemmer, R. Van de Sompel, H. "SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemination". Proceedings of the 11th ACM/IEEE Joint Conference on Digital Libraries, Ottawa, Canada, June 2011.