Many Views of the Same Text
Linguistic Datasets for Greek and Hebrew
This paper is about the challenging data integration problems that Clear Bible had to solve when working with many analyses of the same biblical texts, each with their own use cases, reference systems, models, and perspectives, sometimes based on different variants of the text. In this paper, we will focus on one particular data integration challenge: combining an analysis of the words in the Hebrew Old Testament (a morphological analysis) with an analysis of the sentences in the Hebrew Old Testament (a syntactic analysis, also known as a syntax tree). The syntax tree describes the relationships among words; for instance, it identifies the subject, object, and adjuncts of each verb. The morphological analysis explains the form of each word. For instance, in English, the pronoun "I" appears in that form when it functions as the subject of a verb, "me" when it functions as the object of a verb, and "my" or "mine" when used as a possessive. The morphology explains the word forms, the syntax tree relates these words to the overall structure of the sentence. If the word form is "me", the morphological analysis should say that the form is an object pronoun, and the syntax tree should say that it is the object of a particular verb.
Clear's own datasets include:
-
Morphology: Is it a verb, a noun, and adjective, an adverb, or something else? How is that word used?
-
Word senses: Which meanings does a Hebrew or Greek word have?
-
Synonyms: Which Hebrew and Greek words are related in meaning?
-
Syntax Trees: What are the relationships between words, phrases, and clauses?
-
Semantic Roles: Who does what to whom? (e.g., doers and receivers of actions)
-
Participant Referents: Who is “he,” “she,” or “it” in this sentence?
-
Similar Texts: Which phrases and clauses have “close relatives” elsewhere?
In our own datasets, we have been able to follow consistent conventions. But other datasets we use have their own conventions. Here are some of the third-party datasets we are using. Most are already integrated, we hope to have fully integrated the rest in coming months.
-
Hebrew morphology from the Open Scriptures Hebrew Bible
-
Word senses from the Semantic Dictionary of Biblical Hebrew and the Semantic Dictionary of Biblical Greek
-
Hebrew transliteration, glosses, and notes from SIL International
-
English and Mandarin glosses from Cherith, Inc.
-
Faith Comes by Hearing speaker identification data
-
Figure of Speech data from unfoldingWord
At Clear, we use these datasets in our own tools. including a dashboard for translation consultants to identify and address potential issues in a translation, an NLP engine for aligning translations to Greek or Hebrew, an environment for reading the biblical text in Hebrew, Aramaic, and Greek, and a syntax tree editor. We also align translations to the original Hebrew and Greek words so that images, maps, articles, and other resources can be associated with the original language text and used with translations. These alignments need to work with a wide variety of translations in thousands of languages, translations that may follow different canons and versification.
To make all of this possible, we have had to discover ways to integrate across datasets that were not designed to be used together. These datasets represent different ways of understanding the text, at various levels of analysis, using various linguistic and hermeneutical approaches. They were originally designed for a wide variety of purposes. Taken together, they provide a fuller understanding of the text. But in order to take them together, we need to find ways to make the interoperate in ways the original data creators did not foresee. Some of these ways involve tight integration, others involve loose coupling.
This paper discusses some of the major challenges we faced and the solutions we discovered. Although we are working with many datasets, each with its own challenges, this paper focuses on one integration, integrating the Westminster Hebrew Syntax Without Morphology with the Open Scriptures Hebrew Bible.
Core MACULA Tree Structure
Before we describe the data integration challenges, we need to explain what we are building. In MACULA, a number of data sources are joined together in one enhanced syntax tree. This makes queries simpler and faster. Other datasets share reference systems so that they can interoperate with the main tree while still remaining loosely coupled. MACULA also has mappings from our internal dataset to reference systems used in other sources.
In this paper, we will focus on the core MACULA tree structure for Hebrew. Because these trees contain a fairly large amount of information, this section will illustrate the structure of the trees with a very small subset of a sentence, the Hebrew equivalent to "and there was evening" from Genesis 1:5[1]:
Table I
וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר י֥וֹם אֶחָֽד׃ פ |
And there was evening, and there was morning, the first day. |
Table II
וַֽ = "and" |
יְהִי = "there was" |
עֶ֥רֶב = "evening" |
In MACULA's Hebrew trees, the second token and the third token form a clause, the first token is outside the clause. Here is how this is represented in our markup[3] [4]:
<wg class="cjp" head="true" rule="cj2cjp"> <w ref="GEN 1:5!8" xml:id="o010010050081" mandarin="于是" english="and" greek="καὶ" strongnumberx="2050b" class="cj" unicode="וַֽ" morph="C" lang="H" lemma="c" pos="conjunction">וַֽ</w> </wg> <wg class="cl" head="true" rule="v-s"> <wg role="v" class="vp" head="true" rule="v2vp"> <w ref="GEN 1:5!8" xml:id="o010010050082" mandarin="有" english="there was" greek="ἐγένετο" strongnumberx="1961" class="verb" morph="Vqw3ms" lang="H" lemma="1961" pos="verb" gender="masculine" number="singular" stem="qal" person="third" after="־" <!-- Extra whitespace added to avoid BIDI problems - see footnote. --> >יְהִי</w> </wg> <wg role="s" class="np" head="true" rule="n2np"> <w ref="GEN 1:5!9" xml:id="o010010050091" mandarin="晚上" english="evening" domain="002002002010" sdbh="005645001001000" greek="ἑσπέρα" strongnumberx="6153" class="noun" morph="Ncmsa" lang="H" lemma="6153" pos="noun" gender="masculine" number="singular" state="absolute" after=" ">עֶ֥רֶב</w> </wg> </wg>
Two Independent Analyses
In this tree, the overall syntax tree structure is based on the analysis found in
Westminster Hebrew Syntax Without Morphology.
The morphological analysis that describes individual words comes from
Open Scriptures Hebrew Bible, including the
lemma
and
morph
attributes and a set of attributes that interpret the morph
code so that it is more easily read or queried
(lang
,
pos
,
gender
,
state
, etc.).
Here is the analysis of the phrase "and it was evening" as represented in the Groves trees:
<Node Cat="cjp" Start="13" End="13" Rule="Cj2Cjp" Head="0" Language="H" nodeId="010010050300011" Length="1"> <Node Cat="cj" Start="13" End="13" Length="1" morphId="010010050081" Language="H" Unicode="וַֽ" nodeId="010010050300010">WA75</Node> </Node> <Node Cat="CL" Start="14" End="15" Rule="V-S" Head="0" Language="H" nodeId="010010050310060" Length="6"> <Node Cat="V" Start="14" End="14" Rule="Vp2V" Head="0" Language="H" nodeId="010010050310032" Length="3"> <Node Cat="vp" Start="14" End="14" Rule="V2VP" Head="0" Language="H" nodeId="010010050310031" Length="3"> <Node Cat="verb" Start="14" End="14" Length="3" morphId="010010050082" Language="H" Unicode="יְהִי־" nodeId="010010050310030">Y:HIY-</Node> </Node> </Node> <Node Cat="S" Start="15" End="15" Rule="Np2S" Head="0" Language="H" nodeId="010010050340032" Length="3"> <Node Cat="np" Start="15" End="15" Rule="N2NP" Head="0" Language="H" nodeId="010010050340031" Length="3"> <Node Cat="noun" Start="15" End="15" Length="3" morphId="010010050091" Language="H" Unicode="עֶ֥רֶב" nodeId="010010050340030">(E71REB</Node> </Node> </Node> </Node>
Here is Genesis 1:5 as represented in the Open Scriptures Hebrew Bible:
<verse osisID="Gen.1.5"> <w lemma="c/7121" morph="HC/Vqw3ms" id="01nAB">וַ/יִּקְרָ֨א</w> <w lemma="430" morph="HNcmpa" id="01kfX">אֱלֹהִ֤ים</w> <seg type="x-paseq">׀</seg> <w lemma="l/216" n="1.1.0" morph="HRd/Ncbsa" id="01Wkf">לָ/אוֹר֙</w> <w lemma="3117" n="1.1" morph="HNcmsa" id="01wrL">י֔וֹם</w> <w lemma="c/l/2822" n="1.0" morph="HC/Rd/Ncmsa" id="013TL">וְ/לַ/חֹ֖שֶׁךְ</w> <w lemma="7121" morph="HVqp3ms" id="01LeN">קָ֣רָא</w> <w lemma="3915" n="1" morph="HNcmsa" id="01sMn">לָ֑יְלָה</w> <w lemma="c/1961" morph="HC/Vqw3ms" id="01Y3z">וַֽ/יְהִי</w><seg type="x-maqqef">־</seg><w lemma="6153" morph=msa" id="01NQN">עֶ֥רֶב</w> <w lemma="c/1961" morph="HC/Vqw3ms" id="01uLf">וַֽ/יְהִי</w><seg type="x-maqqef">־</seg><w lemma="1242" n="0.0rph="HNcmsa" id="01kA7">בֹ֖קֶר</w> <w lemma="3117" morph="HNcmsa" id="013TS">י֥וֹם</w> <w lemma="259" n="0" morph="HAcmsa" id="01NFp">אֶחָֽד</w><seg type="x-sof-pasuq">׃</seg> <seg type="x-pe">פ</seg> </verse>
In the above, the analysis of "and it was evening" is shown on a single line:
<w lemma="c/1961" morph="HC/Vqw3ms" id="01Y3z">וַֽ/יְהִי</w><seg type="x-maqqef">־</seg><w lemma="6153" morph=msa" id="01NQN">עֶ֥רֶב</w>
The Data Integration Challenge
At this point, we have set up the data integration challenge: we need to combine the morphology and the syntax tree to create the tree structure that MACULA uses. Here are some characteristics of these datasets that add to the challenge, and the approach we took to overcoming each challenge. A fuller discussion of each point occurs later in the paper.
-
The underlying Hebrew text for these analyses is not identical. We chose to use the text of the morphology as our text. The morphology actually had two different texts, so we chose the one that best fit the syntactic analysis.
-
These analyses have different models that disagree about what a word is, and that affects the units of analysis. They also make different choices with respect to compound nouns. To allow mapping to both models, we split to the most granular level, morphs, and chose a simple way to identify the "word" that each morph is part of.
-
Even when breaking down the same word into morphs, one analysis creates an additional morph for an implicit article, the other does not. And they sometimes disagree about whether an implicit article is present in the first place.
-
As in any analysis of this complexity, there are errors in each source that affect integration. We had to find ways to deal with these errors that did not require us to wait for upstream changes, but still allowed us to push corrections upstream.
Prepare with pipelines, then merge
Our basic approach was to start with the syntax trees, transforming them in various ways to a format we prefer, creating reference systems that can map to other sources, then create pipelines that adapt other resources so that they can easily be integrated into the tree. These pipelines need to be verifiable, and we need to be able to use them for updates as the underlying resources mature in their upstream sources.
The pipeline that prepares the morphology is implemented in BaseX. The command file,
prepare-oshb-for-trees.bxs
, copies the original OSHB database to a new one,
then runs 11 updating queries. Each query makes one simple change. Four of these
queries
are 4 lines or fewer. The longest query contains 290 lines, almost entirely case
statements
that correspond to the tables in the documentation for the morpholical analysis.
No other query contains more than 50 lines. Here is the BaseX command file we use
for this:
# prepare-oshb-for-trees.bxs SET CHOP false SET EXPORTER indent=no,omit-xml-declaration=no XQUERY db:copy('oshb-morphology-raw', 'oshb-morphology') OPEN oshb-morphology RUN ./xquery/hoist-qere-reading.xq RUN ./xquery/remove-medial-segs.xq RUN ./xquery/add-after-attributes.xq RUN ./xquery/strip-leading-h-from-morph.xq RUN ./xquery/explode-word-parts.xq RUN ./xquery/add-bcvwp-numbering.xq RUN ./xquery/add-implicit-article.xq RUN ./xquery/delete-ketiv-reading.xq RUN ./xquery/remove-w-elements.xq RUN ./xquery/mark-proper-nouns.xq RUN ./xquery/expand-oshb-attributes.xq OPTIMIZE EXPORT ./out/For debugging purposes, we can have a separate command file that adds an export commmand after each stage, storing results of each in a separate directory. We can also run queries that evaluate the result of a given stage, placing output in the corresponding directory. The query names give some indication of what each step does. Some of these will be discussed later, but a brief explanation of some of the most important steps will give a sense for what they do.
-
host-qere-reading.xq
andhost-qere-reading.xq
involve two different readings of the text, which are named Ketiv and Qere. In the syntax trees, we use the Qere reading. -
add-after-attributes.xq
addsafter
attributes that contain punctuation and whitespace needed to correctly join morphs to form a sentence. -
explode-word-parts.xq
converts each word into a container element with a series of morphs, each with its own morphological analysis. -
add-bcvwp-numbering.xq
inserts identifier attributes into each morph, using the same identifiers that we use in our syntax trees. -
mark-proper-nouns.xq
creates proper nouns like "Beth El" or "Tubal Cain" by merging the relevant morphs. -
expand-oshb-attributes.xq
converts a morph code likeNcmpa
into a series of attributes likepos="noun" type="common" gender="masculine" number="plural" state="absolute"
.
<verse osisID="Gen.1.5"> <m n="010010050011" morph="C" lang="H" lemma="c" pos="conjunction">וַ</m> <m n="010010050012" morph="Vqw3ms" lang="H" lemma="7121" after=" " pos="verb" stem="qal" type="wayyiqtol" person="third" gender="masculine" number="singular">יִּקְרָ֨א</m> <m n="010010050021" lang="H" after="׀" lemma="430" morph="Ncmpa" id="01kfX" pos="noun" type="common" gender="masculine" number="plural" state="absolute">אֱלֹהִ֤ים</m> <seg type="x-paseq">׀</seg> <m n="010010050031" morph="Rd" lang="H" lemma="l" pos="preposition">לָ</m> <m n="010010050031ה" morph="Td" lemma="d" lang="H" pos="particle" type="definite article"/> <m n="010010050032" morph="Ncbsa" lang="H" lemma="216" after=" " pos="noun" type="common" gender="both" number="singular" state="absolute">אוֹר֙</m> <m n="010010050041" lang="H" after=" " lemma="3117" morph="Ncmsa" id="01wrL" pos="noun" type="common" gender="masculine" number="singular" state="absolute">י֔וֹם</m> <m n="010010050051" morph="C" lang="H" lemma="c" pos="conjunction">וְ</m> <m n="010010050052" morph="Rd" lang="H" lemma="l" pos="preposition">לַ</m> <m n="010010050052ה" morph="Td" lemma="d" lang="H" pos="particle" type="definite article"/> <m n="010010050053" morph="Ncmsa" lang="H" lemma="2822" after=" " pos="noun" type="common" gender="masculine" number="singular" state="absolute">חֹ֖שֶׁךְ</m> <m n="010010050061" lang="H" after=" " lemma="7121" morph="Vqp3ms" id="01LeN" pos="verb" stem="qal" type="qatal" person="third" gender="masculine" number="singular">קָ֣רָא</m> <m n="010010050071" lang="H" after=" " lemma="3915" morph="Ncmsa" id="01sMn" pos="noun" type="common" gender="masculine" number="singular" state="absolute">לָ֑יְלָה</m> <m n="010010050081" morph="C" lang="H" lemma="c" pos="conjunction">וַֽ</m> <m n="010010050082" morph="Vqw3ms" lang="H" lemma="1961" after="־" pos="verb" stem="qal" type="wayyiqtol" person="third" gender="masculine" number="singular">יְהִי</m> <seg type="x-maqqef">־</seg> <m n="010010050091" lang="H" after=" " lemma="6153" morph="Ncmsa" id="01NQN" pos="noun" type="common" gender="masculine" number="singular" state="absolute">עֶ֥רֶב</m> <m n="010010050101" morph="C" lang="H" lemma="c" pos="conjunction">וַֽ</m> <m n="010010050102" morph="Vqw3ms" lang="H" lemma="1961" after="־" pos="verb" stem="qal" type="wayyiqtol" person="third" gender="masculine" number="singular">יְהִי</m> <seg type="x-maqqef">־</seg> <m n="010010050111" lang="H" after=" " lemma="1242" morph="Ncmsa" id="01kA7" pos="noun" type="common" gender="masculine" number="singular" state="absolute">בֹ֖קֶר</m> <m n="010010050121" lang="H" after=" " lemma="3117" morph="Ncmsa" id="013TS" pos="noun" type="common" gender="masculine" number="singular" state="absolute">י֥וֹם</m> <m n="010010050131" lang="H" after="׃פ" lemma="259" morph="Acmsa" id="01NFp" pos="adjective" type="cardinal number" gender="masculine" number="singular" state="absolute">אֶחָֽד</m> <seg type="x-sof-pasuq">׃</seg> <seg type="x-pe">פ</seg> </verse>
The rest of this article discusses approaches we took and lessons we learned as we merged these two datasets, with illustrations from the data we used to merge these two datasets.
Let the Text Drive
But first, you have to find the text ...
When combining multiple analyses of a given text, the analyses may have nothing in common except the text itself. Sometimes, even the text may vary. Regardless, the text is at the heart of any analysis of a text, and it is often the one thing you can rely on most. But sometimes you cannot rely even on that, or you need further knowledge in order to be able to identify a common text you can rely on.
Are we looking at the same text?
Before merging two analyses, always make sure that they are analyzing the same text, or that you have a strategy for handling the differences that are there. Most analyses include excerpts of the text. Use them. Is the entire text there, are are some parts missing? Are there unintentional duplicates of sections of the text? Are there variant readings, spelling variation, or Unicode encoding artifacts? All of these issues are commonly found in analyses we have worked with, including analyses from reputable sources. If some of the text is missing, we can often simply omit that analysis for the portion of the text they did not cover, hoping that they or some other party can complete the analysis. If there are variants, having a clear understanding of which variants are being used and why is important. If your main task is to add an analysis to a text, you can ignore minor punctuation, spelling, and encoding issues since you have already chosen a representation of the text.
"You can't hit what you can't see." This slogan originated with the boxer Mohammed Ali, but it also applies to complex data integration scenarios. If you want to understand the textual differences between two sources and establish a common text, you need an efficient way to see the differences that you care about and ignore everything else. When we integrated the syntax tree with the morphological analysis, we wrote code to strip away punctuation and diacritics and add a delimiter between morpheme boundaries, then applied that to each verse in the Hebrew Bible, reporting those verses that differ. And we learned some things that surprised us. Here is sample output from the query:
<mismatch verse="ju16:25" n="07016025" nmorphs="36 35"> <a>ו|יהי|כי|כ|טוב|לב|מ|ו|יאמרו|קראו|ל|שמשונ|ו|ישחק|ל|נו|ו|יקראו|ל|שמשונ|מ|בית|ה|אסורימ|ו|יצחק|ל|פני|המ|ו|יעמידו|אות|ו|בינ|ה|עמודימ</a> <b>ו|יהי|כ|טוב|לב|מ|ו|יאמרו|קראו|ל|שמשונ|ו|ישחק|ל|נו|ו|יקראו|ל|שמשונ|מ|בית|ה|אסורימ|ו|יצחק|ל|פני|המ|ו|יעמידו|אות|ו|בינ|ה|עמודימ</b> </mismatch> <mismatch verse="gn18:10" n="01018010" nmorphs="27 26"> <a>ו|יאמר|שוב|אשוב|אלי|כ|כ||עת|חיה|ו|הנה|בנ|ל|שרה|אשת|כ|ו|שרה|שמעת|פתח|ה|אהל|ו|הוא|אחרי|ו</a> <b>ו|יאמר|שוב|אשוב|אלי|כ|כ|עת|חיה|ו|הנה|בנ|ל|שרה|אשת|כ|ו|שרה|שמעת|פתח|ה|אהל|ו|הוא|אחרי|ו</b> </mismatch> <mismatch verse="gn28:5" n="01028005" nmorphs="22 21"> <a>ו|ישלח|יצחק|את|יעקב|ו|ילכ|פדנ|ה|ארמ|אל|לבנ|בנ|בתואל|ה|ארמי|אחי|רבקה|אמ|יעקב|ו|עשו</a> <b>ו|ישלח|יצחק|את|יעקב|ו|ילכ|פדנה|ארמ|אל|לבנ|בנ|בתואל|ה|ארמי|אחי|רבקה|אמ|יעקב|ו|עשו</b> </mismatch> <mismatch verse="gn38:24" n="01038024" nmorphs="29 28"> <a>ו|יהי|כ|מ|שלש|חדשימ|ו|יגד|ל|יהודה|ל|אמר|זנתה|תמר|כלת|כ|ו|גמ|הנה|הרה|ל|זנונימ|ו|יאמר|יהודה|הוציאו|ה|ו|תשרפ</a> <b>ו|יהי|כ|משלש|חדשימ|ו|יגד|ל|יהודה|ל|אמר|זנתה|תמר|כלת|כ|ו|גמ|הנה|הרה|ל|זנונימ|ו|יאמר|יהודה|הוציאו|ה|ו|תשרפ</b> </mismatch>This simplified form of verse text was originally developed for software - it is easy to use in simple string comparisons so that only verses that differed needed to be examined. It is also useful in unit tests to make sure that we do not mess up the text when we add new analyses. But we also found that Hebrew experts who do not program could quickly identify the differences these strings identify by looking for the differences. We shared these files in Slack channels, the experts wrote notes in them, and we met to identify categories of discrepancies and how to address them. Before we started using this approach, we were spending hours and hours reading XML and looking for differences or writing code to find differences in complex data structures. We found the new approach much more efficient.
When we did these comparisons, we found a variety of reasons for discepancies, including:
-
Unexpected variant readings.
-
Differences in analysis that affect word boundaries or morph boundaries.
-
Spelling corrections to the original source.
-
Errors.
The most important textual difference we found was this: The syntax trees followed the Qere reading, the morphological analysis followed the Ketiv reading, but provided the Qere reading in notes. In the Masoretic text, there are two readings for many verses, and Jews consider both to be important. Qere means "it is said," and Jewish law says that the Qere should be read when the text is read out loud. But Jewish law also says that a Torah scroll must follow the Ketiv, which means "It is written." Most Bible translations follow the Qere. Here is a verse that contains a Qere reading in addition to the Ketiv, as represented in Open Scriptures Hebrew Bible
<verse osisID="Gen.8.17"> <w lemma="3605" morph="HNcmsc" id="01PUa">כָּל</w><seg type="x-maqqef">־</seg><w lemma="d/2416 c" morph="HTd/Ncfsa" id="01r3q">הַ/חַw>ָה<// <w lemma="834 a" morph="HTr" id="01i7e">אֲשֶׁר</w><seg type="x-maqqef">־</seg><w lemma="854" n="1.0.2.0" morph="HR/Sp2ms" id="01Mem"/ךָ֜</w> <w lemma="m/3605" morph="HR/Ncmsc" id="01wNK">מִ/כָּל</w><seg type="x-maqqef">־</seg><w lemma="1320" n="1.0.2" morph="HNcmsa" id="01">בָּשָׂר</w> <w lemma="b/5775" morph="HRd/Ncmsa" id="015e6">בָּ/ע֧וֹף</w> <w lemma="c/b/929" n="1.0.1" morph="HC/Rd/Ncfsa" id="01yr6">וּ/בַ/בְּהֵמָ֛ה</w> <w lemma="c/b/3605" morph="HC/R/Ncmsc" id="01ckn">וּ/בְ/כָל</w><seg type="x-maqqef">־</seg><w lemma="d/7431" n="1.0.0" morph="HTd/Nc" id="01vA4">הָ/רֶ֛מֶשׂ</w> <w lemma="d/7430" morph="HTd/Vqrmsa" id="01KEn">הָ/רֹמֵ֥שׂ</w> <w lemma="5921 a" morph="HR" id="01cPC">עַל</w><seg type="x-maqqef">־</seg><w lemma="d/776" n="1.0" morph="HTd/Ncbsa" id="01Eoc">הָ/ץ</w> <w type="x-ketiv" lemma="3318" morph="HVhv2ms" id="01Pdv">הוצא</w> <note type="variant"><catchWord>הוצא</catchWord><rdg type="x-qere"><w lemma="3318" morph="HVhv2ms" id="01S7t">הַיְצֵ֣א</w></rdg></note> <w lemma="854" n="1" morph="HR/Sp2fs" id="018F2">אִתָּ/ךְ</w> <w lemma="c/8317" morph="HC/Vqq3cp" id="01T2K">וְ/שֽׁרְצ֣וּ</w> <w lemma="b/776" n="0.1" morph="HRd/Ncbsa" id="01ouG">בָ/אָ֔רֶץ</w> <w lemma="c/6509" morph="HC/Vqq3cp" id="01xxG">וּ/פָר֥וּ</w> <w lemma="c/7235 a" n="0.0" morph="HC/Vqq3cp" id="01vin">וְ/רָב֖וּ</w> <w lemma="5921 a" morph="HR" id="01KSD">עַל</w><seg type="x-maqqef">־</seg><w lemma="d/776" n="0" morph="HTd/Ncbsa" id="01Eiv">הָ/אָ/w><seg type="x-sof-pasuq">׃</seg> </verse>When we first started working with this source, Ketiv and Qere were not marked up in a way that was always easy to distinguish and there were some errors, but we have been able to work with the Open Scriptures group to make it easy to choose one reading or the other using their data. We simply delete the Ketiv reading, then raise the Qere reading from the note into the main text:
let $oshb := db:open("oshb-morphology") for $qere in $oshb//*:note[@type='variant'] return replace node $qere with $qere/*:rdg/*
Are we looking at the same units?
Even if texts are identical, comparison depends on looking at the same units. For instance, our numbering system depends on concepts like "the third word in the verse" or "the second morpheme in the word," but this is problematic when the data sources we use have different versification schemes or different criteria for "word" or "morpheme." These differences can occur even for English, but they are much more acute for Hebrew. Consider the text we have used in many examples above:
וַֽיְהִי־עֶ֥רֶב
And there was evening
— Genesis 1:5
-
וַֽיְהִי־עֶ֥רֶב And it was evening
-
וַֽיְהִי And it was
-
עֶ֥רֶב Evening
-
וַֽ And
-
יְהִי it was
-
עֶ֥רֶב Evening
Because the concept of "word" depends on the analysis, and analyses vary, we
wanted a definition that relied only on simple string operations. In our numbering
system, an orthographic word is a sequence of letters, and any non-alphabetic character
is treated as a delimiter when tokenizing to find orthographic words. For instance,
וַֽיְהִי־עֶ֥רֶב,
which means וַֽיְהִי־עֶ֥רֶב is considered two orthographic words: וַֽיְהִיand עֶ֥רֶב.
The first orthographic
word, וַֽיְהִי is word number 8 in the sentence, and it contains two morphs:
010010050081
corresponds to וַֽ("and"),
010010050082
corresponds to יְהִי ("there was").
The second orthographic word, עֶ֥רֶב ("evening"), is word number 9 in the sentence,
and it
contains only a single morph, identified by
010010050091
.
But the number of morphemes in a single word also depends on the analysis. In the same verse, the third word, לָאוֹר, has a prefix and an implicit article. Some analyses treat this as two morphs, others treat it as three, creating a morph to represent the implicit article. Making matters worse, two Hebrew experts may not agree whether an implicit article is present. If we change our mind, we do not want to renumber the rest of the morphs in a word. Therefore, we decided to number everything except implicit articles using morph position without considering implicit articles. The identifier for an implicit article, the identifier is formed by adding ה to the morph where the implicit article occurs. If an implicit article occurs on a morph with the identifier 010010050031, the identifer for the implicit article is 010010050031ה, as you can see in the following example:
<m n="010010050031" morph="Rd" lang="H" lemma="l" pos="preposition">לָ</m> <m n="010010050031ה" morph="Td" lemm="d" lang="H" pos="particle" type="definite article"/> <m n="010010050032" morph="Ncbsa" lang="H" lemma="216" after=" " pos="noun" type="common" gender="both" number="singular" state="absolute">אוֹר֙</m>These identifiers use a BBCCCVVVWWWP format, where BB is a two digit number that identifies a book, CCC is a three digit number that identifies the chapter, VVV is a three digit number that identifies the verse, WWW is a three digit number that identifies the word within the verse, and P is a single digit that identifies a given morph within a word.
Compound nouns form another challenge. Consider this verse:
וְצִלָּ֣ה גַם־הִ֗וא יָֽלְדָה֙ אֶת־תּ֣וּבַל קַ֔יִן
Zillah also bore Tubal-cain
— Genesis 4:22
<m n="010040220051" lang="H" after="־" lemma="853" morph="To" id="01deG" pos="particle" type="direct object marker">אֶת</m> <seg type="x-maqqef">־</seg> <c> <m n="010040220061" lang="H" after=" " lemma="8423+" morph="Np" id="01Nvj" ps="noun" type="proper">תּ֣וּבַל</m> <m n="010040220071" lang="H" after=" " lemma="8423" morph="Np" id="01Gye" pos="noun" type="proper">קַ֔יִן</m> </c>
This example and some of the others illustrate the truism "Splitting is easy, lumping is hard". Lumping is hard because it requires a theory to explain what should be joined together. In this case, we need a list of compound nouns so that we know which ones should be combined. For data integration, splitting to a high degree of granularity makes it easier to map to other data sources that do the same, but we also need ways to lump again so that we can map to other sets of resources. We can use those resources to see which things should be lumped.
Working with Hebrew has forced us to think differently about the relationship between words and morphemes and the relationship between morphology and syntax. Aligning biblical texts with translation languages has also forced us to do so. In general, a single "word" in some languages can translate to a phrase or a clause in English, and the things that are modeled by an English syntax tree may be required to represent the internal structure of a "word" in these languages. And the same overlapping hierarchy issues that are familiar to those who work with verses and paragraphs also occur at the word level. Consider the following text:
וַיְהִ֗י בִּימֵי֙ שְׁפֹ֣ט הַשֹּׁפְטִ֔ים וַיְהִ֥י רָעָ֖ב בָּאָ֑רֶץ וַיֵּ֨לֶךְ אִ֜ישׁ מִבֵּ֧ית לֶ֣חֶם יְהוּדָ֗ה לָגוּר֙ בִּשְׂדֵ֣י מוֹאָ֔ב ה֥וּא וְאִשְׁתּ֖וֹ וּשְׁנֵ֥י בָנָֽיו׃
In the days when the judges ruled there was a famine in the land, and a man from Bethlehem in Judah went to sojourn in the country of Moab, he and his wife and his two sons.
— Ruth 1:1
מִבֵּ֧ית לֶ֣חֶם
From Bethlehem
The general distinction between morphology and syntax is widely taken for granted, but it crucially depends on the notion of a cross-linguistically valid concept of "(morphosyntactic) word". I show that there are no good criteria for defining such a concept. I examine ten criteria in some detail (potential pauses, free occurrence, mobility, uninterruptibility, non-selectivity, non-coordinatability, anaphoric islandhood, nonextractability, morphophonological isiosyncrasies, and deviations from biuniqueness), and I show that none of them is necessary and sufficient on its own, and no combination of them gives a definition of "word" that accords with linguists' orthographic practice. "Word" can be defined as a language-specific concept, but this is not relevant to the general question pursued here. "Word" can be defined as a fuzzy concept, but this is theoretically meaningful only if the continuum between affixes and words, or words and phrases, shows some clustering, for which there is no systematic evidence at present. Thus, I conclude that we do not currently have a good basis for dividing the domain of morphosyntax into "morphology" and "syntax", and that linguists should be very careful with general claims that make crucial reference to a cross-linguistic "word" notion.
— The indeterminacy of word segmentation and the nature of morphology and syntax - Martin Haspelmath
Lessons Learned
We hope this paper has given a flavor of the work we do when integrating data sources that reflect a wide variety of designs. Now we would like to conclude by listing some of the lessons we have learned along the way.
We have learned is that data integration is usually possible. If there is a dataset that provides important insights and you have the time to really understand the dataset, it can probably be integrated. And as you gain experience integrating new datasets, create useful reference systems and mappings, and design the tools you need, it becomes easier. Sometimes it can take significant time. Sometimes there are parts of the data that cannot be integrated. But in general, we are now able to integrate new datasets without inordinate effort. And the result is gratifying, allowing much richer queries and providing ways to create new resources or to view the text in new ways.
But we have also learned that language is hard, and that many things that seem simple turn out to complex in unexpected ways. Simple concepts like "book", "chapter and verse", "word", "morpheme", and many others have all turned out to be much more complex in practice than many people would expect. Data integration involves a great deal of exploratory data analysis. XML simplifies this because we can easily put a variety of XML sources into an XML database and query them to see what individual sources contain, how that compares to data found in other sources, and whether a particular change makes them easier to use together. Or we can use Python and lxml to explore the data in similar ways.
We have also learned the value of a good hub architecture. In our world, we use the MACULA Greek and MACULA Hebrew trees as a hub for our data integration and data mapping. The reference systems used in the hub representation have become the basis for sophisticated mappings that significantly simplify integrating new sources or joining across sources. "There's nothing more practical than a good" theory", and these trees have become a theory of the text that we can use to more easily understand other analyses of the same text.
We have also learned that building a community of data requires more than just putting your data on GitHub with an open license. Integrating with other datasets is already a significant contribution to the community, allowing others to leverage many insights at the same time without doing the hard work of data integration themselves. We provide our combined trees and mappings on GitHub under a free license. Beyond that, we are creating software for visualizing, editing, and curating these datasets and providing our source code on GitHub under a free license.
References
[ebibleEncoding] “Bible File Encoding for Bible Translators, Publishers, and Software Developers.” Accessed March 29, 2021. https://ebible.org/usfx/Bible-encoding.htm.
[usfm-grammar] GitHub. “Bridgeconn/Usfm-Grammar.” Accessed March 29, 2021. https://github.com/Bridgeconn/usfm-grammar.
[USFMtoOSIS] “Converting SFM Bibles to OSIS - CrossWire Bible Society.” Accessed April 2, 2021. https://wiki.crosswire.org/Converting_SFM_Bibles_to_OSIS.
[DBL] “Digital Bible Library.” Accessed March 25, 2021. https://app.thedigitalbiblelibrary.org/.
[DeRose] DeRose, Steven. “Markup Overlap: A Review and a Horse,” n.d., 17.
[EpiDoc] “EpiDoc: Epigraphic Documents in TEI XML / Home / Home.” Accessed March 27, 2021. https://sourceforge.net/p/epidoc/wiki/Home/.
[FieldLinguistsToolkbox] “Field Linguist’s Toolbox.” Accessed April 2, 2021. https://software.sil.org/toolbox/.
[Paratext] “Paratext.” paratext.org
[Fieldworks] “FieldWorks.” Accessed April 2, 2021. https://software.sil.org/fieldworks/.
[Glanz] Glanz, Oliver. “Bible Software on the Workbench of the Biblical Scholar: Assessment and Perspective.” Andrews University Seminary Studies (AUSS) 56, no. 1 (July 19, 2018): 5–45.
[Graham/Howe] Graham, Tony, and Mark Howe. “EPUB: Chapter and Verse (presentation slides).” Accessed April 1, 2021. https://archive.xmlprague.cz/2011/presentations/graham-howe-epub.pdf.
[Grassick/Wiens] Grassick, Clayton, and Hart Wiens. “Paratext: User-Driven Development:” The Bible Translator, April 1, 2011. doi:https://doi.org/10.1177/026009351106200205.
[Haiola] “Haiola Scripture Publishing Software.” Accessed April 2, 2021. http://haiola.org/.
[Little] Little, Chris. Chrislit/Usfm2osis. Python, 2021. https://github.com/chrislit/usfm2osis.
[OSIS2.1.1] “OSIS 2.1.1 User Manual 06March2006.pdf.” Accessed April 1, 2021. https://crosswire.org/osis/OSIS%202.1.1%20User%20Manual%2006March2006.pdf.
[PTXprint] “PTXprint – Bible Layout For Everyone - SIL Language Technology.” Accessed March 27, 2021. https://software.sil.org/ptxprint/.
[PublishingAssistant] “Publishing Assistant.” Accessed March 27, 2021. https://pubassist.paratext.org/.
[Rapidwords] “Rapidwords.Net |.” Accessed April 2, 2021. https://rapidwords.net/.
[Regt/Kees 2011] Regt, Lénart J. de, and Kees de Blois. Of Translations, Revisions, Scripts and Software: Contributions Presented to Kees de Blois. Reading: United Bible Societies, 2011.
[u2o] Ryan. Adyeths/u2o. Python, 2021. https://github.com/adyeths/u2o.
[SBLStandards] “Biblical Scholars, Standards and the SBL,” SBL Publications. Accessed March 27, 2021. https://www.sbl-site.org/publications/article.aspx?ArticleId=45.
[Bosak97] Bosak, Jon. “SGML, Java, and the Future of the Web (1996.11.17).” Accessed April 2, 2021. https://www.ibiblio.org/pub/sun-info/standards/xml/why/xmlapps.961117.htm.
[ptx2pdf] GitHub. “Sillsdev/Ptx2pdf.” Accessed April 2, 2021. https://github.com/sillsdev/ptx2pdf.
[Proskomma] “The Challenges — Proskomma 0.1 Documentation.” Accessed April 1, 2021. https://doc.proskomma.bible/en/latest/big_idea/challenges.html#why-is-usfm-so-popular.
[OSIS] “The CrossWire Bible Society - OSIS - A Common Format for Multiple Visions.” Accessed April 1, 2021. https://crosswire.org/osis/.
[USFM 3.0] “USFM Documentation — Unified Standard Format Markers 3.0.0 Documentation.” Accessed March 25, 2021. https://ubsicap.github.io/usfm/.
“Usfm/Usfm.Sty at Master · Ubsicap/Usfm · GitHub.” Accessed March 29, 2021. https://github.com/ubsicap/usfm/blob/master/sty/
[USX 3.0] “USX Documentation — Unified Scripture XML 3.0.0 Documentation.” Accessed March 25, 2021. https://ubsicap.github.io/usx/.
Vries, Lourens de. “Paratext and Skopos of Bible Translations.” Paratext and Megatext as Channels of Jewish and Christian Traditions, December 20, 2003, 176–93. doi:https://doi.org/10.1163/9789004421431_009.
[Copenhagen Workshop 2019] Winther-Nielsen, Nicolai. “Papers for the Copenhagen Workshop on Open Biblical Resources.” HIPHIL Novum 5, no. 2 (November 20, 2019): 1–5.
[XQuery 3.0] “XQuery 3.0: An XML Query Language.” Accessed April 2, 2021. https://www.w3.org/TR/xquery-30/.
[Semantic Dictionary of Biblical Hebrew] Semantic Dictionary of Biblical Hebrew, edited by Reinier de Blois, with the assistance of Enio R. Mueller, ©2000-2021 United Bible Societies. Available online at https://semanticdictionary.org/.
[Semantic Dictionary of Biblical Greek] Semantic Dictionary of Biblical Greek, Semantic Dictionary of the Greek New Testament, based on Louw & Nida's Greek-English Lexicon of the New Testament, ©1988-2021 United Bible Societies. Available online at https://semanticdictionary.org/.
[SIL Semantic Domains] SIL Semantic Domains. Available online at https://semdom.org/.
[Westminster Hebrew Syntax Without Morphology] The Westminster Hebrew Syntax without Morphology (version 4.20 as of 2018-04-11). Copyright (C) 1991-2018 by The J. Alan Groves Center for Advanced Biblical Research. Available line at https://github.com/Clear-Bible/macula-hebrew/tree/main/sources/GrovesCenter
[Open Scriptures Hebrew Bible] Open Scriptures Hebrew Bible. Available online at https://hb.openscriptures.org/. Data available at https://github.com/openscriptures/morphhb.
[Handling RTL in XHTML and HTML] Internationalization Best Practices: Handling Right-to-left Scripts in XHTML and HTML Content. Available online at https://www.w3.org/International/geo/html-tech/tech-bidi.html.
[MACULA Greek] MACULA Greek - Syntax trees, morphology, and linguistic annotations for the Greek New Testament. Clear Bible, Inc. Available at https://github.com/Clear-Bible/macula-greek.
[MACULA Hebrew] MACULA Hebrew - Syntax trees, morphology, and linguistic annotations for the Hebrew Bible. Clear Bible, Inc. Available at https://github.com/Clear-Bible/macula-hebrew.
[1] For the whole sentence, see Genesis 1 in the MACULA Hebrew repository: https://github.com/Clear-Bible/macula-hebrew/blob/main/lowfat/01-Gen-001-lowfat.xml.
[2] In this article, we use the term "morph" and avoid the term "word", which is hard to use precisely for reasons that will be explained later in this article.
[3]
When the after
attribute contains a character from the Hebrew code page,
it can display in ways that are baffling and confusing if it is the last attribute
in the attribute list.
See
Handling RTL in XHTML and HTML
for details. In the example, we placed the end of the start tag on its own line to
avoid these issues.
[4]
At the time of writing, the
lemma
attribute is still just a Strong's number
for Hebrew, but we hope to replace this with a proper lemma in the near future.