Introduction
Overview of Challenges
The central question we are addressing in this presentation is how to overcome the inherent limitations of encoding cultural historical data using tools originally developed for a different subset of cultures and time periods. There are few types of data that have more cultural variation and more cultural significance than names. So if you will bear with me for a rather whimsical illustration, I would like to give you a sense for the kind of problems we are addressing in this paper when it comes to encoding non-Western names.
Imagine tonight you get hungry for a late-night feast, so you pop open your laptop and prepare to order some GrubHub. But first you have to sign up. Now suppose for a moment that your name happens to be Mar Gregorius Abū al-Faraj the Melitene, Maphrian, a.k.a. Bar ‘Ebroyo. Hmm, what do you put down for “First Name”? “Last Name”? Suppose further that your computer doesn’t even have a font for the script your name is written in. You download one and get your order sent off, but when you get your confirmation email, your name is written backwards. Instead of, “Hey Mar Gregorius,” it says, “Hey Ram Suirogerg!” You’re now beginning to get a feel for some of the challenges we faced in cataloging historical non-Western names in obscure scripts in a variety of both left-to-right and right-to-left languages.
Admittedly, the markup language we have been using, TEI-XML, is designed to be able to accommodate historical, multilingual data. But the creators of TEI originally developed it primarily using data from Western cultural materials and within the framework of XML, which is itself a construct of Western culture. [1] In addition, for our project, we have to use other tools to get the data into TEI and then out of TEI onto a human-readable web page. Before getting to those issues, let me explain in broad terms (1) the nature of our project and (2) what we are specifically trying to accomplish with the dataset of personal names. Then I will explain (3) the specific challenges we faced representing that data in TEI, and our programmer, Winona will guide you through (4) the detailed solutions she implemented for getting that TEI data into an HTML interface.
The Syriaca.org Project
Syriaca.org is an project to publish online reference works related to communities that use or used the language known as Syriac, a cousin of Hebrew and Arabic. [2] For several hundred years, starting close to the turn of the Common Era, Syriac was used quite widely in places that include parts of modern-day Syria, Lebanon, Iraq, Turkey, and Iran, among others. After the rise of Arabic as the lingua franca of much of the Middle East, the prevalence of Syriac faded, although it retains significance as a literary language and in ceremonial settings. This makes the preservation of all things Syriac particularly critical, since the Syriac heritage is large but its present-day heritage communities small.
The Goals of the Syriac Biographical Dictionary
One of the building blocks for creating these online reference works is an authority record of names and biographical data, which we are publishing online (under a CC-BY 4.0 license) as the Syriac Biographical Dictionary. [3] The main purpose of the SBD is to provide a resource for identifying any persons relevant to Syriac studies. We anticipate two major use cases.
First, the SBD is a source of stable and accurate URI’s for Syriac persons that can be used for library cataloging and linked data. Since Syriac names resist categorization, and disambiguating Syriac persons with similar names is difficult without referring to specialized resources and sometimes consulting texts in non-Western languages, it is Syriac specialists who must provide the authority files for these persons. [4] Second, researchers can use the SBD to identify persons they encounter in Syriac texts or other Syriac-related materials.
These use cases have implications for how we encode name data. The data must be
-
transformable into formats used by library catalogs and projects outside Syriac studies,
-
sortable by the most identifying portion of each name,
-
searchable in a variety of languages and transliterations, and
-
visualized in an easy-to-consult, human-readable format.
TEI Encoding Challenges
TEI-XML is the format that provided us with the best balance of precision, flexibility, and widespread usage by historians, but we still faced challenging decisions. These included
-
labeling name parts,
-
determining the sorting priority of name parts, and
-
making names accessible in multiple languages and transliterations.
I will explain each of these using as an example the person I mentioned earlier, Grigorios Bar ʿEbroyo, who was a 13th century author, religious leader, and polymath who wrote in both Syriac and Arabic.
Labeling Name Parts
When encoding Syriac names in the authority file, we decided to mark up the various parts of each name within the persName element, both for cataloging purposes and to allow for further analysis later. TEI provides several different labels that may be used:
-
forename
-
surname
-
addName
-
roleName
-
genName
Family names posed a more difficult problem. TEI guidelines define the “surname” element as containing “a family (inherited) name, as opposed to a given, baptismal, or nick name” TEI 2008b. Near Eastern names, including Syriac ones, often contain markers of familial relationships, but these are not necessarily inherited, nor do they have the same role as Western surnames. The most common of these is “son of X,” or “bar X” in Syriac, which can indicate one’s father or other ancestor, but might also indicate some other association, such as with a place. In the case of our example, scholars originally took the name Bar ʿEbroyo to mean, “Son of the Hebrew,” and supposed he was of Jewish origin. This lead to the Latinization of his name as “Bar Hebraeus.” More recent research, however, suggests that ʿEbroyo was a geographical term that became attached to the family Takahashi 2011.
The Arabic formula “Abū X” or “Umm X,” meaning “Father of X” or “Mother of X,” is also ostensibly a familial marker, but functions as an honorific title and should often not be taken literally. Grigorios Bar ʿEbroyo received the title Abū al-Faraj despite the fact that he was a monk and is not known to have had any children. Moreover, none of these familial or pseudo-familial markers consistently serve the same role as an English surname, so to label them with the element “surname” might confuse catalogers who are not Syriac specialists. In the end, we decided to use the addName element for all of these, applying the “type='family'” attribute if the name seemed to indicate a familial relationship, or the “type='untagged-title'” attribute if some other association seemed to be in view. Most other types of name parts did not create problems for us, falling rather cleanly under the guidelines of either “addName” or “roleName.”
Determining the Sorting Priority of Name Parts
There was, however, another challenge associated with name parts. That was the fact that there is no systematic way to determine which part of a Syriac or Arabic name is the most important (or easily recognizable) identifier for the person which might be analogous to a familial or “last name” in modern Western usage. Identity markers, including given names, familial names, and various kinds of titles, are included more or less fully in different texts, and historical circumstances lead to a person’s being remembered using certain of these rather than others. Some people are known primarily by their given name, others by a familial name or a title.
One of our purposes in tagging name parts was to be able to mark the part of the name that best identifies each person. Librarians need to know whether to catalog our example author under “Grigorios,” under “Bar ‘Ebroyo,” or under “Maphrian.” Users need to be able to peruse alphabetical lists and easily pick out the persons they are looking for. In other words, it was our job to write the name on the cup so that the Starbucks barista would know what to call out when a Syriac author’s drink order is ready.
Fortunately for us, TEI has a “sort” attribute that takes numeric values and can be included in name part elements. Also, a list of headwords for a recently published Syriac encyclopedia included many of the names we were encoding. We automatically tagged the first name part that the encyclopedia listed as the top sort priority, and then applied the same sort priority to that name part in other versions of the name we had collected. For example, from the encyclopedia listing “Bar ʿEbroyo, Grigorios,” we were able to determine that the <addName type="family"> should have a sort priority of “1” for all versions of the name, whether Syriac, English, Arabic, or some other language.
Making Names Accessible in Multiple Languages and Transliterations
This brings me to the final encoding challenge, that of making sure the names could be properly accessed and searched in all of the different languages in which we were collecting them. First, we found that the ISO 639 language codes were inadequate and inaccurate in regard to Syriac, since they list two separate language codes for Syriac even though scholars would not make such a distinction. “Syr” codes for “Syriac” as a macrolanguage, while “syc” is an unrelated code specifically for “Classical Syriac,” which is not included under the “syr” macrolanguage grouping.[5] This means that anything tagged as “syc” will not appear in searches for “syr.” Even though some of our material could be considered “Classical Syriac,” the diachronic nature of our dataset renders such a label an arbitrary judgment. If the same name is used in the 5th century C.E. and also the 15th century, it is difficult to distinguish one usage as “Classical Syriac” and the other as modern. We have therefore opted to use the “syr” code for all of our Syriac encoding. Meanwhile, we have formally petitioned the ISO 639 Registrar to associate the “syc” code under the larger “syr” macrolanguage grouping so that they might be linked in searching.
The other major challenge along these lines is that there is no single, universally accepted standard for how to transliterate Syriac into Latin characters. Thus, we had to decide on a standard to adopt for the headwords we display in Latin script (we tagged this with an extension of the English language tag, “en-x-gedsh”), but we also needed to include other English versions of the name from various sources and even to generate some automatically. For example, since the English version of the name “Grigorios” is “Gregory,” we added persName element substituting the name “Gregory” for “Grigorios” for each person named “Grigorios.”
The following code example illustrates the editorial decisions I have mentioned above:
<persName xml:lang="en-x-gedsh" source="#bib239-1" syriaca-tags="#syriaca-headword"> <addName type="family" sort="1">Bar ʿEbroyo </addName> <forename sort="2">Grigorios </forename> </persName> <persName xml:lang="en" source="#bib239-2"> <addName type="untagged-title" sort="2">Mar </addName> <forename sort="2">Gregorius </forename> <addName type="family" sort="1">Bar Hebraeus </addName> </persName> <persName xml:lang="ar" source="#bib239-3"> <addName type="untagged-title" sort="2">مار </addName> <forename sort="2">غريغوريوس </forename> <addName type="untagged-title" sort="2">ابو الفرج </addName> <addName type="untagged-title" sort="2">الملطي </addName> <addName type="untagged-title" sort="2">مفريان </addName> <addName type="untagged-title" sort="2">المشهور بابن العبري </addName> </persName> <persName xml:lang="syr" source="#bib239-4"> <addName type="untagged-title" sort="2">ܡܪܝ </addName> <forename sort="2">ܓܪܝܓܘܪܝܘܣ </forename> <addName type="untagged-title" sort="2">ܡܦܪܝܢܐ </addName> <addName type="family" sort="1">ܒܪ ܥܒܪܝܐ </addName> </persName> <persName xml:id="name239-7" xml:lang="en" resp="http://syriaca.org" syriaca-tags="#syriaca-anglicized"> <addName type="family" sort="1">Bar ʿEbroyo </addName> <forename sort="2">Gregory </forename> </persName>
HTML Visualization Challenges & Solutions
Searching name variants
Once names were properly encoded, the issues of search and display needed to be addressed. Syriaca.org uses eXistdb, a native XML database, for storing, processing and searching our TEI files. eXist-db provides a number of configurable indexing methods for searching XML documents, including a full text search backed by the Apache Lucene search framework. An advantage to using Lucene for full text searching is the level of control it can give to the developer through a wide variety of available text analyzers. Lucene also allows for the creation of custom analyzers as needed, as well as customizable weighting of elements in the index. In eXist-db multiple analyzers may be defined and used with different indexes eXistdb 2014.
We found the Standard Analyzer was sufficient for most of our needs as it provides non-language-specific text segmentation Lucene 2013. However we found that in addition to the non-language-specific tokenizer we needed to better handle searches on names containing diacritics, for example a user entering the text “Abda” should return hits for “Abdā,” “Abda,” and “ʿAbdā.” eXist-db provides a customization of the Standard Analyzer allowing diacritic insensitive searches; this is enabled by a simple flag (diacritics="no") in the index configuration file. We found this implementation to be fairly robust and satisfied most of our use cases. At present we have two outlying cases not handled by this analyzer: ʿ (left-half-ring) and ʾ (right half-ring) characters, representing guttural sounds in the Syriac language that Nathan will now attempt to reproduce!
In the interest of efficiency the decision was made to add an additional tei:persName element generated via an xquery update script at load time rather than attempting to write a custom Lucene analyzer. This script added additional name variants for names with the left and right half ring characters stripped from them, as well as an option with an apostrophe. These name variants are flagged with the syriaca-tags attribute (as @syriaca-tags="#syriaca-simplified-script") and are used only for full-text searching. They are suppressed in the HTML view of the person record Syriaca.org 2014.
Visualizing RTL and LTR names
Bidirectional text handling for web visualization was the next technical challenge presented. Names could be in multiple languages, several of them using right to left scripts. Generally speaking Syriaca.org provides HTML pages with a base direction of LTR (left-to-right). Within this document there can be text blocks of Syriac or Arabic which require RTL (right-to-left) rendering and even inline interpopulations for RTL names within a LTR text block. The general recommendations of the w3c for handling bidirectional text is to use the markup rather than css to indicate text direction W3C 2014. There are several options for marking up bidirectional texts, and it has been a challenge working out the kinks and quirks of each mode. The HTML5 bdi element provides the best support for bidirectional text as it isolates the bidirectional text from the surrounding text and addresses some of the odd display issues you can encounter with the dir attribute. However it is not well supported, with only Firefox 10.0 and Chrome 16.0 offering support w3school 2015. Similar results can be achieved with tightly wrapping each section of bidirectional text within a span element with dir attribute. As our records make extensive use of the xml:lang attribute it is easy to add specific dir attributes as required.
<h1><span dir="ltr">Ephrem - </span><span lang="syr" dir="rtl">ܐܦܪܝܡ</span></h1>
This markup is generated via an XSLT stylesheet, which selects the appropriate font-face and text direction base on xml:lang attribute of each element rendered.
Embedding Syriac Fonts in HTML
Display of Syriac fonts presented an additional challenge, as they are not natively supported in the browser. Initial development required users to download and install the Meltho family of fonts from Beth Mardutho to view Syriac. Drawbacks to this approach were the lack of support in mobile devices and inconsistent browser support. To address this, Syriaca.org now uses the CSS @font-face rule to embed the Meltho fonts on the web site; @font-face works in all major browsers including mobile browsers. The appropriate font family is selected during XSLT conversion from TEI to HTML based on the xml:lang attribute on each element, this is translated into the appropriate lang attribute on the HTML element, which is then used to select the correct font using the lang selector rules defined in the CSS.
Conclusions
When it comes to encoding names from the ancient and medieval Middle East using modern, Western tools, we have found it possible to use only some features of those tools as they were designed, whereas in other respects, we have had to use customizations and workarounds to circumvent the limitations of TEI and HTML. Updating TEI standards and ISO codes and improving browser support for certain HTML features would make encoding projects like ours easier and more robust. Additionally, further research into leveraging Lucene’s indexing and search technology as well as more complex search syntax to take advantage of the complexity of the TEI records could be used to improve findability.
References
[eXistdb 2014] eXistdb. “eXist-db Documentation - Lucene Index Module.” 2014. http://exist-db.org/exist/apps/doc/lucene.xml.
[Soualah and Hassoun 2012] Soualah, Mohammed Ourabah and Mohamed Hassoun. “A TEI P5 Manuscript Description Adaptation for Cataloguing Digitized Arabic Manuscripts,” Journal of the Text Encoding Initiative. February 2012. http://jtei.revues.org/398; doi:https://doi.org/10.4000/jtei.398.
[Lucene 2013] Lucene. “org.apache.lucene.analysis.standard (Lucene 4.6.0 API).” 2013. http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/standard/package-summary.html.
[SIL International 2015] SIL International. “ISO 639 Code Tables.” 2015. http://www-01.sil.org/iso639-3/codes.asp.
[Syriaca.org 2014] Syriaca.org. “TEI Tag usage in Gazetteer.” 2014. http://syriaca.org/documentation/view-teiDocs.html.
[Takahashi 2011] Takahashi, H. “Bar ʿEbroyo, Grigorios” pp. 54-56 in Sebastian P. Brock, et al. (eds.), The Gorgias Encyclopedic Dictionary of the Syriac Heritage. Piscataway, NJ: Gorgias Press, 2011.
[TEI 2008a] TEI Consortium. “TEI Element forename.” 2008. http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-forename.html.
[TEI 2008b] TEI Consortium. “TEI Element surname.” 2008. http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-surname.html.
[W3C 2014] W3C. “Inline markup and bidirectional text in HTML.” 2014. http://www.w3.org/International/articles/inline-bidi-markup/.
[w3school 2015] W3School. “HTML bdi Tag.” 2015. http://www.w3schools.com/tags/tag_bdi.asp.
[1] Similar limitations of the TEI have been described in Soualah and Hassoun 2012.
[2] Syriaca.org has received funding from the National Endowment for the Humanities, The International Balzan Prize Foundation, and the Mellon Foundation. Additional funding has come from the university sponsors, including Vanderbilt University, the project host.
[3] Edited by David A. Michelson (General Editor) and Thomas A. Carlson, Nathan P. Gibson, and Jeanne-Nicole Mellon Saint-Laurent (Associate Editors).
[4] Along these lines, we have already been involved in helping OCLC, the makers of Worldcat.org, sanitize data for Syriac-related persons in their Virtual International Authority File (VIAF) by merging duplicate records, disambiguating conflated records, and providing names in Syriac script. The difficulty of curating URI’s for Syriac persons is shown by the fact that we sometimes found 5-10 duplicate records for some of the authors listed in VIAF.
[5] See both codes under SIL International 2015.