How to cite this paper
Robie, Jonathan. “Biblical Scholarship in the GitHub Jungle.” Presented at Balisage: The Markup Conference 2022, Washington, DC, August 1 - 5, 2022. In Proceedings of Balisage: The Markup Conference 2022. Balisage Series on Markup Technologies, vol. 27 (2022). https://doi.org/10.4242/BalisageVol27.Robie01.
Balisage: The Markup Conference 2022
August 1 - 5, 2022
Balisage Paper: Biblical Scholarship in the GitHub Jungle
Jonathan Robie
In the XML world, Jonathan is best known as one
of the inventors of XQuery and an editor of W3C XQuery
specifications from the the first Working Drafts
through XQuery 3.1. In the Bible translation
community, he is best known for his work on
Bible translation software and biblical datasets for
Greek and Hebrew. Jonathan is the Principle Engineer at Clear Bible, Inc.,
where he manages the MACULA team.
He is also co-chair of the Copenhagen Alliance for
Open Biblical Resources and chair for Distributed Text
Services, an API for TEI document
repositories. Previously, he was the Program Manager
for the Paratext ecosystem, used by over 9,000 Bible
translators in over 300 translation organizations
worldwide.
In a long and varied career, Jonathan has also served as Chair of the API
Governance Board at EMC’s Enterprise Content Division, a member of the AMQP
enterprise messaging team at Red Hat, and the architect of XML database systems
at Software AG, Progress Software, Texcel Incorporated, and POET
Software.
Abstract
Clear Bible's MACULA project is a major data integration challenge. Clear creates
freely licensed linguistic datasets for the entire Bible in the original Hebrew,
Aramaic, and Greek languages. We integrate these datasets with high quality datasets
created by others and align them with translations in many languages. Language is
complex and individuals have different ways of organizing and understanding texts,
so integrating these diverse datasets is not straightforward.
Few texts have been analyzed as thoroughly as the Bible, from many different
perspectives. A great deal of Biblical analysis is available on GitHub under open
licenses, including well-established reference systems for the verses in a Bible and
the words used in the original languages. But data integration is still problematic
since there are different traditions, with different sets of books and different
ways of dividing up individual books. For instance, Psalm 23 in a Protestant Bible
is called Psalm 22 in a traditional Catholic Bible. Even a concept as simple as
"what is a word" becomes complicated, since there is no clear universal distinction
between a word and a morpheme — linguists employ a range of different criteria,
which are not uniformly applicable across contexts and languages.
In this article, we would like to illustrate some of the challenges we have
encountered in the first year of our work on MACULA. We will also discuss the
approach we have taken in response to each of these challenges.
Table of Contents
- Many Views of the Same Text
- Core MACULA Tree Structure
- Two Independent Analyses
- The Data Integration Challenge
- Prepare with pipelines, then merge
- Let the Text Drive
-
- Are we looking at the same text?
- Are we looking at the same units?
- Lessons Learned
Many Views of the Same Text
Linguistic Datasets for Greek and Hebrew
This paper is about the challenging data integration
problems that Clear Bible had to solve when working with many
analyses of the same biblical texts, each with their own use cases,
reference systems, models, and perspectives, sometimes
based on different variants of the text. In this paper, we
will focus on one particular data integration challenge:
combining an analysis of the words in the Hebrew Old Testament
(a morphological analysis) with an analysis of the sentences in
the Hebrew Old Testament (a syntactic analysis, also known as a syntax tree).
The syntax tree describes the relationships among words; for instance, it identifies
the subject, object, and adjuncts of each verb. The morphological analysis explains
the form of each
word.
For instance, in English, the pronoun "I" appears in that form when it functions
as the subject of a verb, "me" when it functions as the object of a verb, and "my"
or
"mine" when used as a possessive. The morphology explains the word forms, the
syntax tree relates these words to the overall structure of the sentence. If
the word form is "me", the morphological analysis should say that the form is
an object pronoun, and the syntax tree should say that it is the object of a
particular verb.
Clear's own datasets include:
-
Morphology: Is it a verb, a noun, and adjective, an adverb, or something else? How
is that word
used?
-
Word senses: Which meanings does a Hebrew or Greek word have?
-
Synonyms: Which Hebrew and Greek words are related in meaning?
-
Syntax Trees: What are the relationships between words, phrases, and
clauses?
-
Semantic Roles: Who does what to whom? (e.g., doers and receivers of
actions)
-
Participant Referents: Who is “he,” “she,” or “it” in this
sentence?
-
Similar Texts: Which phrases and clauses have “close relatives”
elsewhere?
In our own datasets, we have been able to follow consistent conventions. But other
datasets we use have their own conventions. Here are some of the third-party datasets
we are using. Most are already integrated, we hope to have fully integrated the
rest in coming months.
At Clear, we use these datasets in our own tools.
including a dashboard for translation consultants to identify
and address potential issues in a translation, an NLP engine
for aligning translations to Greek or Hebrew, an environment
for reading the biblical text in Hebrew, Aramaic, and Greek,
and a syntax tree editor.
We also align translations to the original Hebrew and Greek
words so that images, maps, articles, and other resources can
be associated with the original language text and used with
translations.
These alignments need to work with a wide variety of
translations in thousands of languages, translations that may
follow different canons and versification.
To make all of this possible, we have had to discover
ways to integrate across datasets that were not designed to be
used together. These datasets represent different ways of
understanding the text, at various levels of analysis, using
various linguistic and hermeneutical approaches. They were
originally designed for a wide variety of purposes. Taken
together, they provide a fuller understanding of the text. But
in order to take them together, we need to find ways to make
the interoperate in ways the original data creators did not
foresee. Some of these ways involve tight integration, others
involve loose coupling.
This paper discusses some of the major challenges we faced and
the solutions we discovered. Although we are working with many
datasets, each with its own challenges, this paper focuses on one
integration, integrating the
Westminster Hebrew Syntax Without Morphology
with the
Open Scriptures Hebrew Bible.
Core MACULA Tree Structure
Before we describe the data integration challenges, we need to explain what
we are building. In MACULA, a number of data sources are joined together in one enhanced
syntax
tree. This makes queries simpler and faster. Other datasets share reference systems
so that they can interoperate with the main tree while still remaining loosely coupled.
MACULA also has mappings from our internal dataset to reference
systems used in other sources.
In this paper, we will focus on the core MACULA tree structure for Hebrew. Because
these trees contain a fairly large amount of information, this section will illustrate
the structure of the trees with a very small subset of
a sentence, the Hebrew equivalent to "and there was evening" from Genesis 1:5:
Table I
וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר י֥וֹם אֶחָֽד׃ פ |
And there was evening, and there was morning, the first day. |
We will focus on וַֽיְהִי־בֹ֖קֶר, which we parse into three morphs.
We break וַֽיְהִי
down into two morphs that mean "and" and "there was". עֶ֥רֶב is straighforward:
Table II
וַֽ = "and" |
יְהִי = "there was" |
עֶ֥רֶב = "evening" |
In MACULA's Hebrew trees, the second token and the third token form a clause, the
first
token is outside the clause. Here is how this is represented in our markup :
<wg class="cjp" head="true" rule="cj2cjp">
<w ref="GEN 1:5!8"
xml:id="o010010050081"
mandarin="于是"
english="and"
greek="καὶ"
strongnumberx="2050b"
class="cj"
unicode="וַֽ"
morph="C"
lang="H"
lemma="c"
pos="conjunction">וַֽ</w>
</wg>
<wg class="cl" head="true" rule="v-s">
<wg role="v" class="vp" head="true" rule="v2vp">
<w ref="GEN 1:5!8"
xml:id="o010010050082"
mandarin="有"
english="there was"
greek="ἐγένετο"
strongnumberx="1961"
class="verb"
morph="Vqw3ms"
lang="H"
lemma="1961"
pos="verb"
gender="masculine"
number="singular"
stem="qal"
person="third"
after="־" <!-- Extra whitespace added to avoid BIDI problems - see footnote. -->
>יְהִי</w>
</wg>
<wg role="s" class="np" head="true" rule="n2np">
<w ref="GEN 1:5!9"
xml:id="o010010050091"
mandarin="晚上"
english="evening"
domain="002002002010"
sdbh="005645001001000"
greek="ἑσπέρα"
strongnumberx="6153"
class="noun"
morph="Ncmsa"
lang="H"
lemma="6153"
pos="noun"
gender="masculine"
number="singular"
state="absolute"
after=" ">עֶ֥רֶב</w>
</wg>
</wg>
Two Independent Analyses
In this tree, the overall syntax tree structure is based on the analysis found in
Westminster Hebrew Syntax Without Morphology.
The morphological analysis that describes individual words comes from
Open Scriptures Hebrew Bible, including the
lemma
and
morph
attributes and a set of attributes that interpret the morph
code so that it is more easily read or queried
(lang
,
pos
,
gender
,
state
, etc.).
Here is the analysis of the phrase "and it was evening" as represented in the Groves
trees:
<Node Cat="cjp" Start="13" End="13" Rule="Cj2Cjp" Head="0" Language="H" nodeId="010010050300011" Length="1">
<Node Cat="cj" Start="13" End="13" Length="1" morphId="010010050081" Language="H" Unicode="וַֽ" nodeId="010010050300010">WA75</Node>
</Node>
<Node Cat="CL" Start="14" End="15" Rule="V-S" Head="0" Language="H" nodeId="010010050310060" Length="6">
<Node Cat="V" Start="14" End="14" Rule="Vp2V" Head="0" Language="H" nodeId="010010050310032" Length="3">
<Node Cat="vp" Start="14" End="14" Rule="V2VP" Head="0" Language="H" nodeId="010010050310031" Length="3">
<Node Cat="verb" Start="14" End="14" Length="3" morphId="010010050082" Language="H" Unicode="יְהִי־" nodeId="010010050310030">Y:HIY-</Node>
</Node>
</Node>
<Node Cat="S" Start="15" End="15" Rule="Np2S" Head="0" Language="H" nodeId="010010050340032" Length="3">
<Node Cat="np" Start="15" End="15" Rule="N2NP" Head="0" Language="H" nodeId="010010050340031" Length="3">
<Node Cat="noun" Start="15" End="15" Length="3" morphId="010010050091" Language="H" Unicode="עֶ֥רֶב" nodeId="010010050340030">(E71REB</Node>
</Node>
</Node>
</Node>
Here is Genesis 1:5 as represented in the Open Scriptures Hebrew Bible:
<verse osisID="Gen.1.5">
<w lemma="c/7121" morph="HC/Vqw3ms" id="01nAB">וַ/יִּקְרָ֨א</w>
<w lemma="430" morph="HNcmpa" id="01kfX">אֱלֹהִ֤ים</w>
<seg type="x-paseq">׀</seg>
<w lemma="l/216" n="1.1.0" morph="HRd/Ncbsa" id="01Wkf">לָ/אוֹר֙</w>
<w lemma="3117" n="1.1" morph="HNcmsa" id="01wrL">י֔וֹם</w>
<w lemma="c/l/2822" n="1.0" morph="HC/Rd/Ncmsa" id="013TL">וְ/לַ/חֹ֖שֶׁךְ</w>
<w lemma="7121" morph="HVqp3ms" id="01LeN">קָ֣רָא</w>
<w lemma="3915" n="1" morph="HNcmsa" id="01sMn">לָ֑יְלָה</w>
<w lemma="c/1961" morph="HC/Vqw3ms" id="01Y3z">וַֽ/יְהִי</w><seg type="x-maqqef">־</seg><w lemma="6153" morph=msa" id="01NQN">עֶ֥רֶב</w>
<w lemma="c/1961" morph="HC/Vqw3ms" id="01uLf">וַֽ/יְהִי</w><seg type="x-maqqef">־</seg><w lemma="1242" n="0.0rph="HNcmsa" id="01kA7">בֹ֖קֶר</w>
<w lemma="3117" morph="HNcmsa" id="013TS">י֥וֹם</w>
<w lemma="259" n="0" morph="HAcmsa" id="01NFp">אֶחָֽד</w><seg type="x-sof-pasuq">׃</seg>
<seg type="x-pe">פ</seg>
</verse>
In the above, the analysis of "and it was evening" is shown on a single line:
<w lemma="c/1961" morph="HC/Vqw3ms" id="01Y3z">וַֽ/יְהִי</w><seg type="x-maqqef">־</seg><w lemma="6153" morph=msa" id="01NQN">עֶ֥רֶב</w>
The Data Integration Challenge
At this point, we have set up the data integration challenge: we need to combine the
morphology and the
syntax tree to create the tree structure that MACULA uses. Here are some characteristics
of these datasets
that add to the challenge, and the approach we took to overcoming each challenge.
A fuller discussion of
each point occurs later in the paper.
-
The underlying Hebrew text for these analyses is not identical. We chose
to use the text of the morphology as our text. The morphology actually had two different
texts, so we chose the one that best fit the syntactic analysis.
-
These analyses have different models that disagree about what a word is, and that
affects the units of analysis. They also make different choices with respect to compound
nouns.
To allow mapping to both models, we split to the most granular level, morphs, and
chose a simple
way to identify the "word" that each morph is part of.
-
Even when breaking down the same word into morphs, one analysis creates an additional
morph for an implicit article, the other does not. And they sometimes disagree about
whether an
implicit article is present in the first place.
-
As in any analysis of this complexity, there are errors in each source that affect
integration. We had to find ways to deal with these errors that did not require us
to wait for
upstream changes, but still allowed us to push corrections upstream.
In this case, these texts use the same versification scheme, so a given verse refers
to the same
position in a text (even if the verse itself differs somewhat). This simplifies our
task significantly.
Prepare with pipelines, then merge
Our basic approach was to start with the syntax trees, transforming them in various
ways
to a format we prefer, creating reference systems that can map to other sources, then
create pipelines that adapt other resources so that they can easily be integrated
into
the tree. These pipelines need to be verifiable, and we need to be able to use them
for updates as the underlying resources mature in their upstream sources.
The pipeline that prepares the morphology is implemented in BaseX. The command file,
prepare-oshb-for-trees.bxs
, copies the original OSHB database to a new one,
then runs 11 updating queries. Each query makes one simple change. Four of these
queries
are 4 lines or fewer. The longest query contains 290 lines, almost entirely case
statements
that correspond to the tables in the documentation for the morpholical analysis.
No other query contains more than 50 lines. Here is the BaseX command file we use
for this:
# prepare-oshb-for-trees.bxs
SET CHOP false
SET EXPORTER indent=no,omit-xml-declaration=no
XQUERY db:copy('oshb-morphology-raw', 'oshb-morphology')
OPEN oshb-morphology
RUN ./xquery/hoist-qere-reading.xq
RUN ./xquery/remove-medial-segs.xq
RUN ./xquery/add-after-attributes.xq
RUN ./xquery/strip-leading-h-from-morph.xq
RUN ./xquery/explode-word-parts.xq
RUN ./xquery/add-bcvwp-numbering.xq
RUN ./xquery/add-implicit-article.xq
RUN ./xquery/delete-ketiv-reading.xq
RUN ./xquery/remove-w-elements.xq
RUN ./xquery/mark-proper-nouns.xq
RUN ./xquery/expand-oshb-attributes.xq
OPTIMIZE
EXPORT ./out/
For debugging purposes, we can have a separate command file that adds an export commmand
after
each stage, storing results of each in a separate directory. We can also run queries
that
evaluate the result of a given stage, placing output in the corresponding directory.
The query names give some indication of what each step does. Some of these will be
discussed later, but a brief explanation of some of the most important steps will
give a sense for what they do.
-
host-qere-reading.xq
and
host-qere-reading.xq
involve two different readings of the text, which are named Ketiv and Qere. In the
syntax trees, we use the Qere reading.
-
add-after-attributes.xq
adds
after
attributes that contain punctuation and whitespace needed to correctly join morphs
to form a sentence.
-
explode-word-parts.xq
converts each word into a container element with a series of morphs, each with its
own morphological analysis.
-
add-bcvwp-numbering.xq
inserts identifier attributes into each morph, using the same identifiers that we
use in our syntax trees.
-
mark-proper-nouns.xq
creates proper nouns like "Beth El" or "Tubal Cain" by merging the relevant morphs.
-
expand-oshb-attributes.xq
converts a morph code like
Ncmpa
into a series of attributes like
pos="noun" type="common" gender="masculine" number="plural" state="absolute"
.
When the pipeline has been run, we have a transformed morphological analysis that
can easily be merged into the syntax tree. It looks like this:
<verse osisID="Gen.1.5">
<m n="010010050011" morph="C" lang="H" lemma="c" pos="conjunction">וַ</m>
<m n="010010050012" morph="Vqw3ms" lang="H" lemma="7121" after=" " pos="verb" stem="qal" type="wayyiqtol" person="third" gender="masculine" number="singular">יִּקְרָ֨א</m>
<m n="010010050021" lang="H" after="׀" lemma="430" morph="Ncmpa" id="01kfX" pos="noun" type="common" gender="masculine" number="plural" state="absolute">אֱלֹהִ֤ים</m>
<seg type="x-paseq">׀</seg>
<m n="010010050031" morph="Rd" lang="H" lemma="l" pos="preposition">לָ</m>
<m n="010010050031ה" morph="Td" lemma="d" lang="H" pos="particle" type="definite article"/>
<m n="010010050032" morph="Ncbsa" lang="H" lemma="216" after=" " pos="noun" type="common" gender="both" number="singular" state="absolute">אוֹר֙</m>
<m n="010010050041" lang="H" after=" " lemma="3117" morph="Ncmsa" id="01wrL" pos="noun" type="common" gender="masculine" number="singular" state="absolute">י֔וֹם</m>
<m n="010010050051" morph="C" lang="H" lemma="c" pos="conjunction">וְ</m>
<m n="010010050052" morph="Rd" lang="H" lemma="l" pos="preposition">לַ</m>
<m n="010010050052ה" morph="Td" lemma="d" lang="H" pos="particle" type="definite article"/>
<m n="010010050053" morph="Ncmsa" lang="H" lemma="2822" after=" " pos="noun" type="common" gender="masculine" number="singular" state="absolute">חֹ֖שֶׁךְ</m>
<m n="010010050061" lang="H" after=" " lemma="7121" morph="Vqp3ms" id="01LeN" pos="verb" stem="qal" type="qatal" person="third" gender="masculine" number="singular">קָ֣רָא</m>
<m n="010010050071" lang="H" after=" " lemma="3915" morph="Ncmsa" id="01sMn" pos="noun" type="common" gender="masculine" number="singular" state="absolute">לָ֑יְלָה</m>
<m n="010010050081" morph="C" lang="H" lemma="c" pos="conjunction">וַֽ</m>
<m n="010010050082" morph="Vqw3ms" lang="H" lemma="1961" after="־" pos="verb" stem="qal" type="wayyiqtol" person="third" gender="masculine" number="singular">יְהִי</m>
<seg type="x-maqqef">־</seg>
<m n="010010050091" lang="H" after=" " lemma="6153" morph="Ncmsa" id="01NQN" pos="noun" type="common" gender="masculine" number="singular" state="absolute">עֶ֥רֶב</m>
<m n="010010050101" morph="C" lang="H" lemma="c" pos="conjunction">וַֽ</m>
<m n="010010050102" morph="Vqw3ms" lang="H" lemma="1961" after="־" pos="verb" stem="qal" type="wayyiqtol" person="third" gender="masculine" number="singular">יְהִי</m>
<seg type="x-maqqef">־</seg>
<m n="010010050111" lang="H" after=" " lemma="1242" morph="Ncmsa" id="01kA7" pos="noun" type="common" gender="masculine" number="singular" state="absolute">בֹ֖קֶר</m>
<m n="010010050121" lang="H" after=" " lemma="3117" morph="Ncmsa" id="013TS" pos="noun" type="common" gender="masculine" number="singular" state="absolute">י֥וֹם</m>
<m n="010010050131" lang="H" after="׃פ" lemma="259" morph="Acmsa" id="01NFp" pos="adjective" type="cardinal number" gender="masculine" number="singular" state="absolute">אֶחָֽד</m>
<seg type="x-sof-pasuq">׃</seg>
<seg type="x-pe">פ</seg>
</verse>
The rest of this article discusses approaches we took and lessons we learned as we
merged these two datasets, with illustrations
from the data we used to merge these two datasets.
Let the Text Drive
But first, you have to find the text ...
When combining multiple analyses of a given text, the analyses may have nothing in
common except the text itself.
Sometimes, even the text may vary. Regardless, the text is at the heart of any analysis
of a text, and it is often the one
thing you can rely on most. But sometimes you cannot rely even on that, or you need
further knowledge in order to
be able to identify a common text you can rely on.
Are we looking at the same text?
Before merging two analyses, always make sure that they are analyzing the same text,
or that you
have a strategy for handling the differences that are there. Most analyses include
excerpts of the
text. Use them. Is the entire text there, are are some parts missing? Are there
unintentional duplicates
of sections of the text? Are there variant readings, spelling variation, or Unicode
encoding artifacts?
All of these issues are commonly found in analyses we have worked with, including
analyses from reputable
sources. If some of the text is missing, we can often simply omit that analysis for
the portion of the
text they did not cover, hoping that they or some other party can complete the analysis.
If there are
variants, having a clear understanding of which variants are being used and why is
important.
If your main task is to add an analysis to a text, you can ignore minor punctuation,
spelling, and
encoding issues since you have already chosen a representation of the text.
"You can't hit what you can't see." This slogan originated with the boxer Mohammed
Ali,
but it also applies to complex data integration scenarios. If you want to understand
the textual
differences between two sources and establish a common text, you need an efficient
way to see the
differences that you care about and ignore everything else. When we integrated the
syntax tree
with the morphological analysis, we wrote code to strip away punctuation
and diacritics and add a delimiter between morpheme boundaries, then applied that
to each verse
in the Hebrew Bible, reporting those verses that differ. And we learned some things
that surprised
us. Here is sample output from the query:
<mismatch verse="ju16:25" n="07016025" nmorphs="36 35">
<a>ו|יהי|כי|כ|טוב|לב|מ|ו|יאמרו|קראו|ל|שמשונ|ו|ישחק|ל|נו|ו|יקראו|ל|שמשונ|מ|בית|ה|אסורימ|ו|יצחק|ל|פני|המ|ו|יעמידו|אות|ו|בינ|ה|עמודימ</a>
<b>ו|יהי|כ|טוב|לב|מ|ו|יאמרו|קראו|ל|שמשונ|ו|ישחק|ל|נו|ו|יקראו|ל|שמשונ|מ|בית|ה|אסורימ|ו|יצחק|ל|פני|המ|ו|יעמידו|אות|ו|בינ|ה|עמודימ</b>
</mismatch>
<mismatch verse="gn18:10" n="01018010" nmorphs="27 26">
<a>ו|יאמר|שוב|אשוב|אלי|כ|כ||עת|חיה|ו|הנה|בנ|ל|שרה|אשת|כ|ו|שרה|שמעת|פתח|ה|אהל|ו|הוא|אחרי|ו</a>
<b>ו|יאמר|שוב|אשוב|אלי|כ|כ|עת|חיה|ו|הנה|בנ|ל|שרה|אשת|כ|ו|שרה|שמעת|פתח|ה|אהל|ו|הוא|אחרי|ו</b>
</mismatch>
<mismatch verse="gn28:5" n="01028005" nmorphs="22 21">
<a>ו|ישלח|יצחק|את|יעקב|ו|ילכ|פדנ|ה|ארמ|אל|לבנ|בנ|בתואל|ה|ארמי|אחי|רבקה|אמ|יעקב|ו|עשו</a>
<b>ו|ישלח|יצחק|את|יעקב|ו|ילכ|פדנה|ארמ|אל|לבנ|בנ|בתואל|ה|ארמי|אחי|רבקה|אמ|יעקב|ו|עשו</b>
</mismatch>
<mismatch verse="gn38:24" n="01038024" nmorphs="29 28">
<a>ו|יהי|כ|מ|שלש|חדשימ|ו|יגד|ל|יהודה|ל|אמר|זנתה|תמר|כלת|כ|ו|גמ|הנה|הרה|ל|זנונימ|ו|יאמר|יהודה|הוציאו|ה|ו|תשרפ</a>
<b>ו|יהי|כ|משלש|חדשימ|ו|יגד|ל|יהודה|ל|אמר|זנתה|תמר|כלת|כ|ו|גמ|הנה|הרה|ל|זנונימ|ו|יאמר|יהודה|הוציאו|ה|ו|תשרפ</b>
</mismatch>
This simplified form of verse text was originally developed for software - it is easy
to use
in simple string comparisons so that only verses that differed needed to be examined.
It is also useful in unit
tests to make sure that we do not mess up the text when we add new analyses.
But we also found that Hebrew experts who do not program could quickly identify the
differences these strings
identify by looking for the differences. We shared these files in Slack channels,
the experts wrote notes
in them, and we met to identify categories of discrepancies and how to address them.
Before we started
using this approach, we were spending hours and hours reading XML and looking for
differences or writing
code to find differences in complex data structures. We found the new approach much
more efficient.
When we did these comparisons, we found a variety of reasons for discepancies, including:
-
Unexpected variant readings.
-
Differences in analysis that affect word boundaries or morph boundaries.
-
Spelling corrections to the original source.
-
Errors.
Of course, this is only one useful view that we used. In general, though, generating
useful views and using queries
to examine the data is a very helpful way to get beyond "paralysis by analysis."
Views should be designed to allow
different kinds of experts to quickly see and evaluate things that require their judgement,
while ignoring things
that do not need their attention and time.
The most important textual difference we found was this: The syntax trees followed
the Qere reading, the morphological
analysis followed the Ketiv reading, but provided the Qere reading in notes. In the
Masoretic text, there are two readings
for many verses, and Jews consider both to be important. Qere means "it is said,"
and Jewish law says that the Qere should
be read when the text is read out loud. But Jewish law also says that a Torah scroll
must follow the Ketiv, which means
"It is written." Most Bible translations follow the Qere. Here is a verse that contains
a Qere reading in addition to the
Ketiv, as represented in Open Scriptures Hebrew Bible
<verse osisID="Gen.8.17">
<w lemma="3605" morph="HNcmsc" id="01PUa">כָּל</w><seg type="x-maqqef">־</seg><w lemma="d/2416 c" morph="HTd/Ncfsa" id="01r3q">הַ/חַw>ָה<//
<w lemma="834 a" morph="HTr" id="01i7e">אֲשֶׁר</w><seg type="x-maqqef">־</seg><w lemma="854" n="1.0.2.0" morph="HR/Sp2ms" id="01Mem"/ךָ֜</w>
<w lemma="m/3605" morph="HR/Ncmsc" id="01wNK">מִ/כָּל</w><seg type="x-maqqef">־</seg><w lemma="1320" n="1.0.2" morph="HNcmsa" id="01">בָּשָׂר</w>
<w lemma="b/5775" morph="HRd/Ncmsa" id="015e6">בָּ/ע֧וֹף</w>
<w lemma="c/b/929" n="1.0.1" morph="HC/Rd/Ncfsa" id="01yr6">וּ/בַ/בְּהֵמָ֛ה</w>
<w lemma="c/b/3605" morph="HC/R/Ncmsc" id="01ckn">וּ/בְ/כָל</w><seg type="x-maqqef">־</seg><w lemma="d/7431" n="1.0.0" morph="HTd/Nc" id="01vA4">הָ/רֶ֛מֶשׂ</w>
<w lemma="d/7430" morph="HTd/Vqrmsa" id="01KEn">הָ/רֹמֵ֥שׂ</w>
<w lemma="5921 a" morph="HR" id="01cPC">עַל</w><seg type="x-maqqef">־</seg><w lemma="d/776" n="1.0" morph="HTd/Ncbsa" id="01Eoc">הָ/ץ</w>
<w type="x-ketiv" lemma="3318" morph="HVhv2ms" id="01Pdv">הוצא</w>
<note type="variant"><catchWord>הוצא</catchWord><rdg type="x-qere"><w lemma="3318" morph="HVhv2ms" id="01S7t">הַיְצֵ֣א</w></rdg></note>
<w lemma="854" n="1" morph="HR/Sp2fs" id="018F2">אִתָּ/ךְ</w>
<w lemma="c/8317" morph="HC/Vqq3cp" id="01T2K">וְ/שֽׁרְצ֣וּ</w>
<w lemma="b/776" n="0.1" morph="HRd/Ncbsa" id="01ouG">בָ/אָ֔רֶץ</w>
<w lemma="c/6509" morph="HC/Vqq3cp" id="01xxG">וּ/פָר֥וּ</w>
<w lemma="c/7235 a" n="0.0" morph="HC/Vqq3cp" id="01vin">וְ/רָב֖וּ</w>
<w lemma="5921 a" morph="HR" id="01KSD">עַל</w><seg type="x-maqqef">־</seg><w lemma="d/776" n="0" morph="HTd/Ncbsa" id="01Eiv">הָ/אָ/w><seg type="x-sof-pasuq">׃</seg>
</verse>
When we first started working with this source, Ketiv and Qere were not marked up
in a way that was always easy to distinguish
and there were some errors, but we have been able to work with the Open Scriptures
group to make it easy to choose one
reading or the other using their data. We simply delete the Ketiv reading, then raise
the Qere reading from the note into
the main text:
let $oshb := db:open("oshb-morphology")
for $qere in $oshb//*:note[@type='variant']
return replace node $qere with $qere/*:rdg/*
Are we looking at the same units?
Even if texts are identical, comparison depends on looking at the same units.
For instance, our numbering system depends on concepts like "the third word in the
verse"
or "the second morpheme in the word," but this is problematic when the data sources
we use have different versification schemes or different criteria for "word" or
"morpheme." These differences can occur even for English, but they are much
more acute for Hebrew.
Consider the text we have used in many examples above:
וַֽיְהִי־עֶ֥רֶב
And there was evening
— Genesis 1:5
Some of the resources we use consider that one word:
-
וַֽיְהִי־עֶ֥רֶב And it was evening
Some consider it two words.
-
וַֽיְהִי And it was
-
עֶ֥רֶב Evening
Some consider it three words.
-
וַֽ And
-
יְהִי it was
-
עֶ֥רֶב Evening
Because the concept of "word" depends on the analysis, and analyses vary, we
wanted a definition that relied only on simple string operations. In our numbering
system, an orthographic word is a sequence of letters, and any non-alphabetic character
is treated as a delimiter when tokenizing to find orthographic words. For instance,
וַֽיְהִי־עֶ֥רֶב,
which means וַֽיְהִי־עֶ֥רֶב is considered two orthographic words: וַֽיְהִיand עֶ֥רֶב.
The first orthographic
word, וַֽיְהִי is word number 8 in the sentence, and it contains two morphs:
010010050081
corresponds to וַֽ("and"),
010010050082
corresponds to יְהִי ("there was").
The second orthographic word, עֶ֥רֶב ("evening"), is word number 9 in the sentence,
and it
contains only a single morph, identified by
010010050091
.
But the number of morphemes in a single word also depends on the analysis. In the
same verse, the
third word, לָאוֹר, has a prefix and an implicit article. Some analyses treat this
as two morphs, others
treat it as three, creating a morph to represent the implicit article. Making matters
worse, two
Hebrew experts may not agree whether an implicit article is present. If we change
our mind, we do
not want to renumber the rest of the morphs in a word. Therefore, we decided to number
everything
except implicit articles using morph position without considering implicit articles.
The identifier
for an implicit article, the identifier is formed by adding ה to the morph where the
implicit article
occurs. If an implicit article occurs on a morph with the identifier 010010050031,
the identifer for
the implicit article is 010010050031ה, as you can see in the following example:
<m n="010010050031" morph="Rd" lang="H" lemma="l" pos="preposition">לָ</m>
<m n="010010050031ה" morph="Td" lemm="d" lang="H" pos="particle" type="definite article"/>
<m n="010010050032" morph="Ncbsa" lang="H" lemma="216" after=" " pos="noun" type="common" gender="both" number="singular" state="absolute">אוֹר֙</m>
These identifiers use a BBCCCVVVWWWP format, where BB is a two digit number that identifies
a book,
CCC is a three digit number that identifies the chapter, VVV is a three digit number
that identifies the verse, WWW is
a three digit number that identifies the word within the verse, and P is a single
digit that identifies a given morph within a word.
Compound nouns form another challenge. Consider this verse:
וְצִלָּ֣ה גַם־הִ֗וא יָֽלְדָה֙ אֶת־תּ֣וּבַל קַ֔יִן
Zillah also bore Tubal-cain
— Genesis 4:22
Let's focus on אֶת־תּ֣וּבַל קַ֔יִן, which means "(object marker) + Tubal-cain". In
our numbering system, as described
above, we treat this as three orthographic words, without considering whether compound
nouns are present - we
prefer to leave that for a higher level of analysis, and we would like to be able
to add new compound nouns
without changing identifiers. For instance, OSHB identifies some proper nouns that
are not identified as
such in the syntax trees, and we are likely to add these in the future. Therefore,
we base our word
numbering on simple orthographic words and use markup to identify compound nouns.
When we prepare
the morphology, it looks like this:
<m n="010040220051" lang="H" after="־" lemma="853" morph="To" id="01deG" pos="particle" type="direct object marker">אֶת</m>
<seg type="x-maqqef">־</seg>
<c>
<m n="010040220061" lang="H" after=" " lemma="8423+" morph="Np" id="01Nvj" ps="noun" type="proper">תּ֣וּבַל</m>
<m n="010040220071" lang="H" after=" " lemma="8423" morph="Np" id="01Gye" pos="noun" type="proper">קַ֔יִן</m>
</c>
This example and some of the others illustrate the truism "Splitting is easy, lumping
is hard". Lumping
is hard because it requires a theory to explain what should be joined together. In
this case, we need
a list of compound nouns so that we know which ones should be combined. For data
integration, splitting
to a high degree of granularity makes it easier to map to other data sources that
do the same, but we
also need ways to lump again so that we can map to other sets of resources. We can
use those resources
to see which things should be lumped.
Working with Hebrew has forced us to think differently about the relationship between
words and morphemes and the relationship between morphology and syntax. Aligning
biblical
texts with translation languages has also forced us to do so. In general, a single
"word"
in some languages can translate to a phrase or a clause in English, and the things
that
are modeled by an English syntax tree may be required to represent the internal structure
of a "word" in these languages. And the same overlapping hierarchy issues that are
familiar to those who work with verses and paragraphs also occur at the word level.
Consider the following text:
וַיְהִ֗י בִּימֵי֙ שְׁפֹ֣ט הַשֹּׁפְטִ֔ים וַיְהִ֥י רָעָ֖ב בָּאָ֑רֶץ וַיֵּ֨לֶךְ אִ֜ישׁ
מִבֵּ֧ית לֶ֣חֶם יְהוּדָ֗ה לָגוּר֙ בִּשְׂדֵ֣י מוֹאָ֔ב ה֥וּא וְאִשְׁתּ֖וֹ וּשְׁנֵ֥י
בָנָֽיו׃
In the days when the judges ruled there was a famine in the land, and a man from Bethlehem
in Judah went to sojourn in the country of Moab, he and his wife and his two sons.
— Ruth 1:1
Let's focus on this part of the text:
מִבֵּ֧ית לֶ֣חֶם
From Bethlehem
Neither מִבֵּ֧ית nor לֶ֣חֶם means "Bethlehem." מִבֵּ֧ית translates roughly to "from
Beth-" and לֶ֣חֶם
translates to "-lehem." We need to be able to represent both written words and nominal
units like compound nouns. To do this well across languages, we have to take the morphosyntax
of various languages into account. Martin Haspelmath describes these issues well:
The general distinction between morphology and syntax is widely taken for
granted, but it crucially depends on the notion of a cross-linguistically valid
concept of "(morphosyntactic) word". I show that there are no good criteria for
defining such a concept. I examine ten criteria in some detail (potential
pauses, free occurrence, mobility, uninterruptibility, non-selectivity,
non-coordinatability, anaphoric islandhood, nonextractability,
morphophonological isiosyncrasies, and deviations from biuniqueness), and I show
that none of them is necessary and sufficient on its own, and no combination of
them gives a definition of "word" that accords with linguists' orthographic
practice. "Word" can be defined as a language-specific concept, but this is not
relevant to the general question pursued here. "Word" can be defined as a fuzzy
concept, but this is theoretically meaningful only if the continuum between
affixes and words, or words and phrases, shows some clustering, for which there
is no systematic evidence at present. Thus, I conclude that we do not currently
have a good basis for dividing the domain of morphosyntax into "morphology" and
"syntax", and that linguists should be very careful with general claims that
make crucial reference to a cross-linguistic "word" notion.
— The indeterminacy of word segmentation and the nature of morphology and
syntax - Martin Haspelmath
Lessons Learned
We hope this paper has given a flavor of the work we do when integrating data sources
that reflect a wide variety of designs. Now we would like to conclude by listing some
of
the lessons we have learned along the way.
We have learned is that data integration is usually possible. If there is a dataset
that provides important insights and you have the time to really understand the dataset,
it can probably be integrated. And as you gain experience integrating new datasets,
create useful reference systems and mappings, and design the tools you need, it becomes
easier. Sometimes it can take significant time. Sometimes there are parts of the
data
that cannot be integrated. But in general, we are now able to integrate new datasets
without inordinate effort. And the result is gratifying, allowing much richer queries
and providing ways to create new resources or to view the text in new ways.
But we have also learned that language is hard, and that many things that seem
simple turn out to complex in unexpected ways. Simple concepts like "book", "chapter
and verse", "word", "morpheme", and many others have all turned out to be much more
complex in practice than many people would expect. Data integration involves a great
deal of exploratory data analysis. XML simplifies this because we can easily put
a variety of XML sources into an XML database and query them to see what individual
sources contain, how that compares to data found in other sources, and whether a
particular change makes them easier to use together. Or we can use Python and lxml
to explore the data in similar ways.
We have also learned the value of a good hub architecture. In our world,
we use the MACULA Greek and MACULA Hebrew trees as a hub for our data integration
and data mapping. The reference systems used in the hub representation have become
the basis for sophisticated mappings that significantly simplify integrating new
sources or joining across sources. "There's nothing more practical than a good"
theory", and these trees have become a theory of the text that we can use to
more easily understand other analyses of the same text.
We have also learned that building a community of data requires more than
just putting your data on GitHub with an open license. Integrating with other
datasets is already a significant contribution to the community, allowing
others to leverage many insights at the same time without doing the hard
work of data integration themselves. We provide our combined trees
and mappings on GitHub under a free license. Beyond that, we are
creating software for visualizing, editing, and curating these datasets
and providing our source code on GitHub under a free license.
References
[ebibleEncoding] “Bible File Encoding for
Bible Translators, Publishers, and Software Developers.” Accessed March 29, 2021.
https://ebible.org/usfx/Bible-encoding.htm.
[usfm-grammar] GitHub.
“Bridgeconn/Usfm-Grammar.” Accessed March 29, 2021.
https://github.com/Bridgeconn/usfm-grammar.
[USFMtoOSIS] “Converting SFM Bibles to OSIS -
CrossWire Bible Society.” Accessed April 2, 2021.
https://wiki.crosswire.org/Converting_SFM_Bibles_to_OSIS.
[DBL] “Digital Bible Library.” Accessed March 25, 2021.
https://app.thedigitalbiblelibrary.org/.
[DeRose] DeRose, Steven. “Markup Overlap: A Review
and a Horse,” n.d., 17.
[EpiDoc] “EpiDoc: Epigraphic Documents in TEI XML /
Home / Home.” Accessed March 27, 2021. https://sourceforge.net/p/epidoc/wiki/Home/.
[FieldLinguistsToolkbox] “Field
Linguist’s Toolbox.” Accessed April 2, 2021.
https://software.sil.org/toolbox/.
[Paratext] “Paratext.” paratext.org
[Fieldworks] “FieldWorks.” Accessed April 2,
2021. https://software.sil.org/fieldworks/.
[Glanz] Glanz, Oliver. “Bible Software on the
Workbench of the Biblical Scholar: Assessment and Perspective.” Andrews University
Seminary Studies (AUSS) 56, no. 1 (July 19, 2018): 5–45.
[Graham/Howe] Graham, Tony, and Mark Howe. “EPUB: Chapter
and Verse (presentation slides).” Accessed April 1, 2021.
https://archive.xmlprague.cz/2011/presentations/graham-howe-epub.pdf.
[Grassick/Wiens] Grassick, Clayton, and Hart Wiens.
“Paratext: User-Driven Development:” The Bible Translator, April 1, 2011.
doi:https://doi.org/10.1177/026009351106200205.
[Haiola] “Haiola Scripture Publishing Software.”
Accessed April 2, 2021. http://haiola.org/.
[Little] Little, Chris. Chrislit/Usfm2osis.
Python, 2021. https://github.com/chrislit/usfm2osis.
[OSIS2.1.1] “OSIS 2.1.1 User Manual
06March2006.pdf.” Accessed April 1, 2021.
https://crosswire.org/osis/OSIS%202.1.1%20User%20Manual%2006March2006.pdf.
[PTXprint] “PTXprint – Bible Layout For Everyone -
SIL Language Technology.” Accessed March 27, 2021.
https://software.sil.org/ptxprint/.
[PublishingAssistant] “Publishing
Assistant.” Accessed March 27, 2021. https://pubassist.paratext.org/.
[Rapidwords] “Rapidwords.Net |.” Accessed April
2, 2021. https://rapidwords.net/.
[Regt/Kees 2011] Regt, Lénart J. de, and Kees de Blois.
Of Translations, Revisions, Scripts and Software: Contributions Presented to Kees
de
Blois. Reading: United Bible Societies, 2011.
[u2o] Ryan. Adyeths/u2o. Python, 2021.
https://github.com/adyeths/u2o.
[SBLStandards] “Biblical Scholars, Standards
and the SBL,” SBL Publications. Accessed March 27, 2021.
https://www.sbl-site.org/publications/article.aspx?ArticleId=45.
[Bosak97] Bosak, Jon. “SGML, Java, and the Future of the Web
(1996.11.17).” Accessed April 2, 2021.
https://www.ibiblio.org/pub/sun-info/standards/xml/why/xmlapps.961117.htm.
[ptx2pdf] GitHub. “Sillsdev/Ptx2pdf.” Accessed April
2, 2021. https://github.com/sillsdev/ptx2pdf.
[Proskomma] “The Challenges — Proskomma 0.1
Documentation.” Accessed April 1, 2021.
https://doc.proskomma.bible/en/latest/big_idea/challenges.html#why-is-usfm-so-popular.
[OSIS] “The CrossWire Bible Society - OSIS - A Common
Format for Multiple Visions.” Accessed April 1, 2021. https://crosswire.org/osis/.
[USFM 3.0] “USFM Documentation — Unified Standard
Format Markers 3.0.0 Documentation.” Accessed March 25, 2021.
https://ubsicap.github.io/usfm/.
“Usfm/Usfm.Sty at Master · Ubsicap/Usfm · GitHub.” Accessed March 29, 2021.
https://github.com/ubsicap/usfm/blob/master/sty/
[USX 3.0] “USX Documentation — Unified Scripture XML
3.0.0 Documentation.” Accessed March 25, 2021. https://ubsicap.github.io/usx/.
Vries, Lourens de. “Paratext and Skopos of Bible Translations.” Paratext and
Megatext as Channels of Jewish and Christian Traditions, December 20, 2003, 176–93.
doi:https://doi.org/10.1163/9789004421431_009.
[Copenhagen Workshop 2019] Winther-Nielsen,
Nicolai. “Papers for the Copenhagen Workshop on Open Biblical Resources.” HIPHIL Novum
5, no. 2 (November 20, 2019): 1–5.
[XQuery 3.0] “XQuery 3.0: An XML Query Language.”
Accessed April 2, 2021. https://www.w3.org/TR/xquery-30/.
[Semantic Dictionary of Biblical Hebrew] Semantic
Dictionary of Biblical Hebrew, edited by Reinier de Blois, with the assistance of
Enio
R. Mueller, ©2000-2021 United Bible Societies. Available online at
https://semanticdictionary.org/.
[Semantic Dictionary of Biblical Greek] Semantic
Dictionary of Biblical Greek, Semantic Dictionary of the Greek New Testament, based
on
Louw & Nida's Greek-English Lexicon of the New Testament, ©1988-2021 United Bible
Societies. Available online at https://semanticdictionary.org/.
[SIL Semantic Domains] SIL Semantic Domains.
Available online at https://semdom.org/.
[Westminster Hebrew Syntax Without Morphology] The Westminster Hebrew
Syntax without Morphology (version 4.20 as of 2018-04-11). Copyright (C) 1991-2018
by
The J. Alan Groves Center for Advanced Biblical Research. Available line at
https://github.com/Clear-Bible/macula-hebrew/tree/main/sources/GrovesCenter
[Open Scriptures Hebrew Bible] Open Scriptures Hebrew Bible. Available online at
https://hb.openscriptures.org/. Data available at
https://github.com/openscriptures/morphhb.
[Handling RTL in XHTML and HTML] Internationalization Best Practices: Handling Right-to-left Scripts in XHTML and HTML
Content. Available online at https://www.w3.org/International/geo/html-tech/tech-bidi.html.
[MACULA Greek] MACULA Greek - Syntax trees, morphology, and linguistic annotations for the Greek
New Testament. Clear Bible, Inc. Available at https://github.com/Clear-Bible/macula-greek.
[MACULA Hebrew] MACULA Hebrew - Syntax trees, morphology, and linguistic annotations for the Hebrew
Bible. Clear Bible, Inc. Available at https://github.com/Clear-Bible/macula-hebrew.
דBible File Encoding for
Bible Translators, Publishers, and Software Developers.” Accessed March 29, 2021.
https://ebible.org/usfx/Bible-encoding.htm.
×GitHub.
“Bridgeconn/Usfm-Grammar.” Accessed March 29, 2021.
https://github.com/Bridgeconn/usfm-grammar.
דConverting SFM Bibles to OSIS -
CrossWire Bible Society.” Accessed April 2, 2021.
https://wiki.crosswire.org/Converting_SFM_Bibles_to_OSIS.
דDigital Bible Library.” Accessed March 25, 2021.
https://app.thedigitalbiblelibrary.org/.
×DeRose, Steven. “Markup Overlap: A Review
and a Horse,” n.d., 17.
דEpiDoc: Epigraphic Documents in TEI XML /
Home / Home.” Accessed March 27, 2021. https://sourceforge.net/p/epidoc/wiki/Home/.
דParatext.” paratext.org
דFieldWorks.” Accessed April 2,
2021. https://software.sil.org/fieldworks/.
×Glanz, Oliver. “Bible Software on the
Workbench of the Biblical Scholar: Assessment and Perspective.” Andrews University
Seminary Studies (AUSS) 56, no. 1 (July 19, 2018): 5–45.
×Graham, Tony, and Mark Howe. “EPUB: Chapter
and Verse (presentation slides).” Accessed April 1, 2021.
https://archive.xmlprague.cz/2011/presentations/graham-howe-epub.pdf.
דHaiola Scripture Publishing Software.”
Accessed April 2, 2021. http://haiola.org/.
×Little, Chris. Chrislit/Usfm2osis.
Python, 2021. https://github.com/chrislit/usfm2osis.
דOSIS 2.1.1 User Manual
06March2006.pdf.” Accessed April 1, 2021.
https://crosswire.org/osis/OSIS%202.1.1%20User%20Manual%2006March2006.pdf.
דPTXprint – Bible Layout For Everyone -
SIL Language Technology.” Accessed March 27, 2021.
https://software.sil.org/ptxprint/.
דPublishing
Assistant.” Accessed March 27, 2021. https://pubassist.paratext.org/.
דRapidwords.Net |.” Accessed April
2, 2021. https://rapidwords.net/.
×Regt, Lénart J. de, and Kees de Blois.
Of Translations, Revisions, Scripts and Software: Contributions Presented to Kees
de
Blois. Reading: United Bible Societies, 2011.
×Ryan. Adyeths/u2o. Python, 2021.
https://github.com/adyeths/u2o.
דBiblical Scholars, Standards
and the SBL,” SBL Publications. Accessed March 27, 2021.
https://www.sbl-site.org/publications/article.aspx?ArticleId=45.
×Bosak, Jon. “SGML, Java, and the Future of the Web
(1996.11.17).” Accessed April 2, 2021.
https://www.ibiblio.org/pub/sun-info/standards/xml/why/xmlapps.961117.htm.
×GitHub. “Sillsdev/Ptx2pdf.” Accessed April
2, 2021. https://github.com/sillsdev/ptx2pdf.
דThe Challenges — Proskomma 0.1
Documentation.” Accessed April 1, 2021.
https://doc.proskomma.bible/en/latest/big_idea/challenges.html#why-is-usfm-so-popular.
דThe CrossWire Bible Society - OSIS - A Common
Format for Multiple Visions.” Accessed April 1, 2021. https://crosswire.org/osis/.
דUSFM Documentation — Unified Standard
Format Markers 3.0.0 Documentation.” Accessed March 25, 2021.
https://ubsicap.github.io/usfm/.
דUsfm/Usfm.Sty at Master · Ubsicap/Usfm · GitHub.” Accessed March 29, 2021.
https://github.com/ubsicap/usfm/blob/master/sty/
דUSX Documentation — Unified Scripture XML
3.0.0 Documentation.” Accessed March 25, 2021. https://ubsicap.github.io/usx/.
×Winther-Nielsen,
Nicolai. “Papers for the Copenhagen Workshop on Open Biblical Resources.” HIPHIL Novum
5, no. 2 (November 20, 2019): 1–5.
דXQuery 3.0: An XML Query Language.”
Accessed April 2, 2021. https://www.w3.org/TR/xquery-30/.
×Semantic
Dictionary of Biblical Hebrew, edited by Reinier de Blois, with the assistance of
Enio
R. Mueller, ©2000-2021 United Bible Societies. Available online at
https://semanticdictionary.org/.
×Semantic
Dictionary of Biblical Greek, Semantic Dictionary of the Greek New Testament, based
on
Louw & Nida's Greek-English Lexicon of the New Testament, ©1988-2021 United Bible
Societies. Available online at https://semanticdictionary.org/.
×SIL Semantic Domains.
Available online at https://semdom.org/.
×The Westminster Hebrew
Syntax without Morphology (version 4.20 as of 2018-04-11). Copyright (C) 1991-2018
by
The J. Alan Groves Center for Advanced Biblical Research. Available line at
https://github.com/Clear-Bible/macula-hebrew/tree/main/sources/GrovesCenter
×Open Scriptures Hebrew Bible. Available online at
https://hb.openscriptures.org/. Data available at
https://github.com/openscriptures/morphhb.
×Internationalization Best Practices: Handling Right-to-left Scripts in XHTML and HTML
Content. Available online at https://www.w3.org/International/geo/html-tech/tech-bidi.html.
×MACULA Greek - Syntax trees, morphology, and linguistic annotations for the Greek
New Testament. Clear Bible, Inc. Available at https://github.com/Clear-Bible/macula-greek.
×MACULA Hebrew - Syntax trees, morphology, and linguistic annotations for the Hebrew
Bible. Clear Bible, Inc. Available at https://github.com/Clear-Bible/macula-hebrew.