How to cite this paper
Gibson, Nathan P., Winona Salesky and David A. Michelson. “Encoding Western and Non-Western Names for Ancient Syriac Authors.” Presented at Symposium on Cultural Heritage Markup, Washington, DC, August 10, 2015. In Proceedings of the Symposium on Cultural Heritage Markup. Balisage Series on Markup Technologies, vol. 16 (2015). https://doi.org/10.4242/BalisageVol16.Gibson01.
Symposium on Cultural Heritage Markup
August 10, 2015
Balisage Paper: Encoding Western and Non-Western Names for Ancient Syriac Authors
Nathan P. Gibson
Postdoctoral Scholar in Syriac Studies and Digital Humanities (as of Fall
2015)
Vanderbilt University
Nathan P. Gibson is a researcher specializing in medieval Arabic and Syriac
and is one of the editors of the forthcoming Syriac Biographical
Dictionary, a born-digital reference publication produced by
Syriaca.org that will be an
authority record for names and biographic data of persons relevant to Syriac
studies.
Winona Salesky
Winona Salesky is an independent digital library consultant with 10 years’
experience building digital collections with XML technologies, including XQuery,
XSLT, and native XML databases. She is the Senior Programmer on the Syriaca.org
project and is working with the Library of Congress on their BIBFRAME
initiative. She was previously the Digital Initiatives Librarian at the
University of Vermont where she developed and deployed The Center for Digital
Initiatives, an entirely XML based digital library project run on
eXistdb.
David A. Michelson
Assistant Professor of the History of Christianity
Vanderbilt University
David A. Michelson is Assistant Professor of the History of Christianity and
affiliate faculty in the Department of Classical Studies and the program in
Islamic Studies at Vanderbilt University. He serves as General Editor of Syriaca.org.
Copyright © 2015, Nathan P. Gibson, Winona Salesky, & David A. Michelson. Released
under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
Abstract
One of the major digital challenges of the Syriaca.org research project has been to encode and visualize personal
names of authors in Middle Eastern languages (especially Syriac and Arabic). TEI-XML
and HTML are digital standards for the encoding and visualization of cultural
heritage data and have features for encoding names and displaying Middle Eastern
languages. Because these formats were developed primarily for Western cultural data,
however, representing our non-Western data in these formats has required complex
adaptation particularly in regard to marking up name parts, customizing search
algorithms, displaying bidirectional text, and displaying Syriac text with embedded
fonts. These requirements have led us to develop small-scale systems that may be of
use to other cultural heritage preservation projects involving names for ancient
and, especially, non-Western entities.
Table of Contents
- Introduction
-
- Overview of Challenges
- The Syriaca.org Project
- The Goals of the Syriac Biographical Dictionary
- TEI Encoding Challenges
-
- Labeling Name Parts
- Determining the Sorting Priority of Name Parts
- Making Names Accessible in Multiple Languages and Transliterations
- HTML Visualization Challenges & Solutions
-
- Searching name variants
- Visualizing RTL and LTR names
- Embedding Syriac Fonts in HTML
- Conclusions
Introduction
Overview of Challenges
The central question we are addressing in this presentation is how to overcome the
inherent limitations of encoding cultural historical data using tools originally
developed for a different subset of cultures and time periods. There are few types
of data that have more cultural variation and more cultural significance than names.
So if you will bear with me for a rather whimsical illustration, I would like to
give you a sense for the kind of problems we are addressing in this paper when it
comes to encoding non-Western names.
Imagine tonight you get hungry for a late-night feast, so you pop open your laptop
and prepare to order some GrubHub. But first you have to sign up. Now suppose for
a
moment that your name happens to be Mar Gregorius Abū al-Faraj the Melitene,
Maphrian, a.k.a. Bar ‘Ebroyo. Hmm, what do you put down for “First Name”? “Last
Name”? Suppose further that your computer doesn’t even have a font for the script
your name is written in. You download one and get your order sent off, but when you
get your confirmation email, your name is written backwards. Instead of, “Hey Mar
Gregorius,” it says, “Hey Ram Suirogerg!” You’re now beginning to get a feel for
some of the challenges we faced in cataloging historical non-Western names in
obscure scripts in a variety of both left-to-right and right-to-left languages.
Admittedly, the markup language we have been using, TEI-XML, is designed to be
able to accommodate historical, multilingual data. But the creators of TEI
originally developed it primarily using data from Western cultural materials and
within the framework of XML, which is itself a construct of Western culture. In addition, for our project, we have to use other tools to get the data
into TEI and then out of TEI onto a human-readable web page. Before getting to those
issues, let me explain in broad terms (1) the nature of our project and (2) what we
are specifically trying to accomplish with the dataset of personal names. Then I
will explain (3) the specific challenges we faced representing that data in TEI, and
our programmer, Winona will guide you through (4) the detailed solutions she
implemented for getting that TEI data into an HTML interface.
The Syriaca.org Project
Syriaca.org is an project to publish online reference works related to communities
that use or used the language known as Syriac, a cousin of Hebrew and Arabic. For several hundred years, starting close to the turn of the Common Era,
Syriac was used quite widely in places that include parts of modern-day Syria,
Lebanon, Iraq, Turkey, and Iran, among others. After the rise of Arabic as the
lingua franca of much of the Middle East, the prevalence of
Syriac faded, although it retains significance as a literary language and in
ceremonial settings. This makes the preservation of all things Syriac particularly
critical, since the Syriac heritage is large but its present-day heritage
communities small.
The Goals of the Syriac Biographical Dictionary
One of the building blocks for creating these online reference works is an
authority record of names and biographical data, which we are publishing online
(under a CC-BY 4.0 license) as the Syriac Biographical
Dictionary. The main purpose of the SBD is to
provide a resource for identifying any persons relevant to Syriac studies. We
anticipate two major use cases.
First, the SBD is a source of stable and
accurate URI’s for Syriac persons that can be used for library cataloging and linked
data. Since Syriac names resist categorization, and disambiguating Syriac persons
with similar names is difficult without referring to specialized resources and
sometimes consulting texts in non-Western languages, it is Syriac specialists who
must provide the authority files for these persons. Second, researchers can use the SBD
to identify persons they encounter in Syriac texts or other Syriac-related
materials.
These use cases have implications for how we encode name data. The data must be
-
transformable into formats used by
library catalogs and projects outside Syriac studies,
-
sortable by the most identifying
portion of each name,
-
searchable in a variety of languages
and transliterations, and
-
visualized in an easy-to-consult,
human-readable format.
The first three of these requirements correspond to three encoding
challenges we faced in TEI, which I will illustrate shortly. The third
(searchability) also relates to the visualization challenges that Winona will
discuss.
TEI Encoding Challenges
TEI-XML is the format that provided us with the best balance of precision,
flexibility, and widespread usage by historians, but we still faced challenging
decisions. These included
-
labeling name parts,
-
determining the sorting priority of name parts, and
-
making names accessible in multiple languages and transliterations.
I will explain each of these using as an example the person I mentioned earlier, Grigorios
Bar ʿEbroyo, who was a 13th century author, religious leader, and polymath who wrote
in both Syriac and Arabic.
Labeling Name Parts
When encoding Syriac names in the authority file, we decided to mark up the
various parts of each name within the persName element, both for cataloging purposes
and to allow for further analysis later. TEI provides several different labels that
may be used:
-
forename
-
surname
-
addName
-
roleName
-
genName
Certain of these have a stronger cultural bias than others. The term
“forename” implies that a person’s given name is actually the first name (a Western
construct), but since the TEI guidelines define the content of the “forename”
element as, “a forename, given or baptismal name,” we decided to mark up Syriac and
Arabic given names with the “forename” element
TEI 2008a.
Family names posed a more difficult problem. TEI guidelines define the “surname”
element as containing “a family (inherited) name, as opposed to a given, baptismal,
or nick name” TEI 2008b. Near Eastern names, including Syriac ones, often contain markers of
familial relationships, but these are not necessarily inherited, nor do they have
the same role as Western surnames. The most common of these is “son of X,” or “bar
X” in Syriac, which can indicate one’s father or other ancestor, but might also
indicate some other association, such as with a place. In the case of our example,
scholars originally took the name Bar ʿEbroyo to mean, “Son of the Hebrew,” and
supposed he was of Jewish origin. This lead to the Latinization of his name as “Bar
Hebraeus.” More recent research, however, suggests that ʿEbroyo was a geographical
term that became attached to the family Takahashi 2011.
The Arabic formula “Abū X” or “Umm X,” meaning “Father of X” or “Mother of X,” is
also ostensibly a familial marker, but functions as an honorific title and should
often not be taken literally. Grigorios Bar ʿEbroyo received the title Abū al-Faraj
despite the fact that he was a monk and is not known to have had any children.
Moreover, none of these familial or pseudo-familial markers consistently serve the
same role as an English surname, so to label them with the element “surname” might
confuse catalogers who are not Syriac specialists. In the end, we decided to use the
addName element for all of these, applying the “type='family'” attribute if the name
seemed to indicate a familial relationship, or the “type='untagged-title'” attribute
if some other association seemed to be in view. Most other types of name parts did
not create problems for us, falling rather cleanly under the guidelines of either
“addName” or “roleName.”
Determining the Sorting Priority of Name Parts
There was, however, another challenge associated with name parts. That was the
fact that there is no systematic way to determine which part of a Syriac or Arabic
name is the most important (or easily recognizable) identifier for the person which
might be analogous to a familial or “last name” in modern Western usage. Identity
markers, including given names, familial names, and various kinds of titles, are
included more or less fully in different texts, and historical circumstances lead
to
a person’s being remembered using certain of these rather than others. Some people
are known primarily by their given name, others by a familial name or a title.
One of our purposes in tagging name parts was to be able to mark the part of the
name that best identifies each person. Librarians need to know whether to catalog
our example author under “Grigorios,” under “Bar ‘Ebroyo,” or under “Maphrian.”
Users need to be able to peruse alphabetical lists and easily pick out the persons
they are looking for. In other words, it was our job to write the name on the cup
so
that the Starbucks barista would know what to call out when a Syriac author’s drink
order is ready.
Fortunately for us, TEI has a “sort” attribute that takes numeric values and can
be included in name part elements. Also, a list of headwords for a recently
published Syriac encyclopedia included many of the names we were encoding. We
automatically tagged the first name part that the encyclopedia listed as the top
sort priority, and then applied the same sort priority to that name part in other
versions of the name we had collected. For example, from the encyclopedia listing
“Bar ʿEbroyo, Grigorios,” we were able to determine that the <addName
type="family"> should have a sort priority of “1” for all versions of the name,
whether Syriac, English, Arabic, or some other language.
Making Names Accessible in Multiple Languages and Transliterations
This brings me to the final encoding challenge, that of making sure the names
could be properly accessed and searched in all of the different languages in which
we were collecting them. First, we found that the ISO 639 language codes were
inadequate and inaccurate in regard to Syriac, since they list two separate language
codes for Syriac even though scholars would not make such a distinction. “Syr” codes
for “Syriac” as a macrolanguage, while “syc” is an unrelated code specifically for
“Classical Syriac,” which is not included under the “syr” macrolanguage grouping. This means that anything tagged as “syc” will not appear in searches for
“syr.” Even though some of our material could be considered “Classical Syriac,” the
diachronic nature of our dataset renders such a label an arbitrary judgment. If the
same name is used in the 5th century C.E. and also the 15th century, it is difficult
to distinguish one usage as “Classical Syriac” and the other as modern. We have
therefore opted to use the “syr” code for all of our Syriac encoding. Meanwhile, we
have formally petitioned the ISO 639 Registrar to associate the “syc” code under the
larger “syr” macrolanguage grouping so that they might be linked in searching.
The other major challenge along these lines is that there is no single,
universally accepted standard for how to transliterate Syriac into Latin characters.
Thus, we had to decide on a standard to adopt for the headwords we display in Latin
script (we tagged this with an extension of the English language tag, “en-x-gedsh”),
but we also needed to include other English versions of the name from various
sources and even to generate some automatically. For example, since the English
version of the name “Grigorios” is “Gregory,” we added persName element substituting
the name “Gregory” for “Grigorios” for each person named “Grigorios.”
The following code example illustrates the editorial decisions I have mentioned
above:
<persName xml:lang="en-x-gedsh" source="#bib239-1" syriaca-tags="#syriaca-headword">
<addName type="family" sort="1">Bar ʿEbroyo </addName>
<forename sort="2">Grigorios </forename>
</persName>
<persName xml:lang="en" source="#bib239-2">
<addName type="untagged-title" sort="2">Mar </addName>
<forename sort="2">Gregorius </forename>
<addName type="family" sort="1">Bar Hebraeus </addName>
</persName>
<persName xml:lang="ar" source="#bib239-3">
<addName type="untagged-title" sort="2">مار </addName>
<forename sort="2">غريغوريوس </forename>
<addName type="untagged-title" sort="2">ابو الفرج </addName>
<addName type="untagged-title" sort="2">الملطي </addName>
<addName type="untagged-title" sort="2">مفريان </addName>
<addName type="untagged-title" sort="2">المشهور بابن العبري </addName>
</persName>
<persName xml:lang="syr" source="#bib239-4">
<addName type="untagged-title" sort="2">ܡܪܝ </addName>
<forename sort="2">ܓܪܝܓܘܪܝܘܣ </forename>
<addName type="untagged-title" sort="2">ܡܦܪܝܢܐ </addName>
<addName type="family" sort="1">ܒܪ ܥܒܪܝܐ </addName>
</persName>
<persName xml:id="name239-7" xml:lang="en" resp="http://syriaca.org" syriaca-tags="#syriaca-anglicized">
<addName type="family" sort="1">Bar ʿEbroyo </addName>
<forename sort="2">Gregory </forename>
</persName>
HTML Visualization Challenges & Solutions
Searching name variants
Once names were properly encoded, the issues of search and display needed to be
addressed. Syriaca.org uses eXistdb, a native XML
database, for storing, processing and searching our TEI files. eXist-db provides a
number of configurable indexing methods for searching XML documents, including a
full text search backed by the Apache
Lucene search framework. An advantage to using Lucene for full text
searching is the level of control it can give to the developer through a wide
variety of available text analyzers. Lucene also allows for the creation of custom
analyzers as needed, as well as customizable weighting of elements in the index. In
eXist-db multiple analyzers may be defined and used with different indexes eXistdb 2014.
We found the Standard Analyzer was sufficient for most of our needs as it
provides non-language-specific text segmentation Lucene 2013.
However we found that in addition to the non-language-specific tokenizer we needed
to better handle searches on names containing diacritics, for example a user
entering the text “Abda” should return hits for “Abdā,” “Abda,” and “ʿAbdā.”
eXist-db provides a customization of the Standard Analyzer allowing diacritic
insensitive searches; this is enabled by a simple flag (diacritics="no") in the
index configuration file. We found this implementation to be fairly robust and
satisfied most of our use cases. At present we have two outlying cases not handled
by this analyzer: ʿ (left-half-ring) and ʾ (right half-ring) characters,
representing guttural sounds in the Syriac language that Nathan will now attempt to
reproduce!
In the interest of efficiency the decision was made to add an additional
tei:persName element generated via an xquery update script at load time rather than
attempting to write a custom Lucene analyzer. This script added additional name
variants for names with the left and right half ring characters stripped from them,
as well as an option with an apostrophe. These name variants are flagged with the
syriaca-tags attribute (as @syriaca-tags="#syriaca-simplified-script") and are used
only for full-text searching. They are suppressed in the HTML view of the person
record Syriaca.org 2014.
Visualizing RTL and LTR names
Bidirectional text handling for web visualization was the next technical challenge
presented. Names could be in multiple languages, several of them using right to left
scripts. Generally speaking Syriaca.org provides HTML pages with a base direction
of
LTR (left-to-right). Within this document there can be text blocks of Syriac or
Arabic which require RTL (right-to-left) rendering and even inline interpopulations
for RTL names within a LTR text block. The general recommendations of the w3c for
handling bidirectional text is to use the markup rather than css to indicate text
direction W3C 2014. There are several options for marking up
bidirectional texts, and it has been a challenge working out the kinks and quirks
of
each mode. The HTML5 bdi element provides the best support for bidirectional text
as
it isolates the bidirectional text from the surrounding text and addresses some of
the odd display issues you can encounter with the dir attribute. However it is not
well supported, with only Firefox 10.0 and Chrome 16.0 offering support w3school 2015. Similar results can be achieved with tightly wrapping
each section of bidirectional text within a span element with dir attribute. As our
records make extensive use of the xml:lang attribute it is easy to add specific dir
attributes as required.
<h1><span dir="ltr">Ephrem - </span><span lang="syr" dir="rtl">ܐܦܪܝܡ</span></h1>
This markup is generated via an XSLT stylesheet, which selects the appropriate
font-face and text direction base on xml:lang attribute of each element rendered.
Embedding Syriac Fonts in HTML
Display of Syriac fonts presented an additional challenge, as they are not
natively supported in the browser. Initial development required users to download
and install the Meltho family of fonts from Beth
Mardutho to view Syriac. Drawbacks to this approach were the lack of
support in mobile devices and inconsistent browser support. To address this,
Syriaca.org now uses the CSS @font-face rule to embed the Meltho fonts on the web
site; @font-face works in all major browsers including mobile browsers. The
appropriate font family is selected during XSLT conversion from TEI to HTML based
on
the xml:lang attribute on each element, this is translated into the appropriate lang
attribute on the HTML element, which is then used to select the correct font using
the lang selector rules defined in the CSS.
Conclusions
When it comes to encoding names from the ancient and medieval Middle East using
modern, Western tools, we have found it possible to use only some features of those
tools as they were designed, whereas in other respects, we have had to use
customizations and workarounds to circumvent the limitations of TEI and HTML. Updating
TEI standards and ISO codes and improving browser support for certain HTML features
would make encoding projects like ours easier and more robust. Additionally, further
research into leveraging Lucene’s indexing and search technology as well as more complex
search syntax to take advantage of the complexity of the TEI records could be used
to
improve findability.
×Takahashi, H. “Bar ʿEbroyo, Grigorios” pp. 54-56 in Sebastian P. Brock, et
al. (eds.), The Gorgias Encyclopedic Dictionary of
the Syriac Heritage. Piscataway, NJ: Gorgias Press,
2011.