Durusau, Patrick. “Hypergraphs: Escaping the Surly Bonds of Syntax.” Presented at Balisage: The Markup Conference 2023, Washington, DC, July 31 - August 4, 2023. In Proceedings of Balisage: The Markup Conference 2023. Balisage Series on Markup Technologies, vol. 28 (2023). https://doi.org/10.4242/BalisageVol28.Durusau01.
Balisage: The Markup Conference 2023 July 31 - August 4, 2023
Balisage Paper: Hypergraphs, Escaping the Surly Bounds of Syntax
Patrick Durusau is the Co-Chair of the OASIS Open Document Format for Office Applications
(OpenDocument) TC and has been a member of that TC since its initial meeting on December
16, 2002. His employer/sponsor has changed several times over the years, and Patrick
has been a co-editor/editor of the OpenDocument Format (ODF) for the majority of that
time. Patrick is also the project editor for the ISO/IEC mirror of ODF as ISO/IEC
Patrick blogs about topic maps (being one of the co-editors of ISO 13250-5), other
semantic issues and of late, how irregular forces can leverage data for their causes
at Another Word for It.
Claus Huitfeldt responded to one of the many variants of What is a Text Really:
Finally, and most importantly, I am struck by the lack of imagination in this approach:
why on earth should texts by all means be hierarchies? No doubt, there are many hierarchical
structures, and no doubt this is important, but there are countless other relations
between text elements which are worth while finding and investigating- overlap, substitution,
discontinuity, parallel texts, cross-references, etc.
Huitfeldt and Sperberg-McQueen have labored for decades on the imaginative representation
of texts in markup. As have many others.
Humanities scholars are confronted with a bewildering array of markup languages and
techniques, should they not decide to invent their own for representing complex texts.
A number of those syntaxes are illustrated here, to lay the groundwork for this heretical
suggestion: Humanists should use any consistent method they choose for complex markup.
The burden of preparing texts for interchange, should rest on technologists who have
input those texts in a hypergraph database. Let scholars be about what scholars do
and technologists at aiding them in those tasks, not training them for new ones.
The Balisage archives have forty (40) papers since 2008, addressing overlapping markup.
(SeeAppendix 1.) The literature beyond Balisage is vast and deep; I had two filing cabinets full
of such papers more than a decade ago. But the Balisage collection represents a fair
sampling of approaches and are likely familiar to you both as authors and listeners.
If you have been to one or more Balisage conferences, you will no doubt have heard
our host, Tommie Usdin, admonish us to be good listeners! For those of you not familiar
with the concept, it means hearing what others are saying without polishing your response
or slides as the case may be. While I try to listen, to the extent that I have at
Balisage, have I been listening to the wrong people?
That’s not a slur on the Balisage presenters, all of who I value as friends and colleagues.
When I say the wrong people, I mean while I enjoy the complexities of rabbit-duck grammars, will that help me capture the native language of users in other domains? For all
of my use and appreciation for markup, I want to empower users, not myself.
Texts Are Not Discrete and Linear
A casual perusal of the previous Balisage papers on overlapping markup, leaves no
doubt, the tree model of texts is the exception rather than the rule for texts. If
you roll the clock back to Text Retrieval on a Microcomputer, we find a description of overlapping complexities in my domain, biblical texts:
The structure of spoken text has particular complexities which make it difficult to index with already available
software. Most computer-assisted indexing systems, including the recent ones, assume
that ideas are discrete and linear, hence sequentially indexable. In spoken text, however, ideas are rarely discrete and linear. Instead,
as interviewees recount a story or make a point, ideas and recollections are often
condensed and bundled. Often a block of text may contain a number of ideas that a
researcher would like to index. Just as often, a block of text containing a single
idea may overlap other blocks of text containing other ideas. Indeed, the taut structure
that one hopes find in formal written text rarely exists in spoken or informal text
because many speakers and writers think extemporaneously, without regard to structure
or polish.
When you think about either the Hebrew Bible or the New Testament, they are almost
completely spoken texts. People talk to each other, they talk to snakes, donkeys,
fig trees, rocks, divine beings, conspire to conceal adultery, and government officials,
to name only a few of the spoken interactions.
The Hebrew Bible and the New Testament were transmitted over thousands of years through
thousands of witnesses, composed by authors lost to history, authors who are known, maybe, and those witnesses
each contend for particular content at a given location. Another set of conversations.
Biblical commentators have not been silent about the text, being in conversation (shouting?)
with each other and each succeeding generation creating new conversations (more shouting?)
about the text.
Modern scholars have a variety of languages to talk about the biblical text.
What is surprising is despite thousands of years of careful study, prior to 1988,
no biblical scholar raises the issue of overlap. Not once. Whatever model of the text
they were using, the concept of overlap wasn’t an issue.
The Birth of and Solutions to Overlap
The problem of overlap came into being, at least for our purposes, with the publication of Standard Generalized
Markup Language (ISO 8879, SGML) in 1988. As was the default for software at the time,
SGML assumed text to be encoded, in the words of Giordano, was discrete and linear. To be fair, SGML did have an optional feature, CONCUR, which enabled different discrete
and linear views of the same text, but only one could be active at any time. Being
an optional feature, it was only occasionally implemented.
For reasons that remain unclear, at least to someone who learned SGML from the SGML
Handbook (Goldfarb 1991), programmers wanted a simpler to use markup language, which we now know as XML.
What was an optional feature of SGML, that is CONCUR, was discarded as too hard for
the weekend programmer. A defect in XML that persists to this day, despite many labors
to repair that defect. (SeeAppendix 1.)
Examining only a few of the proposals to solve the overlap problem, which is a standards defect and not a feature of texts, or conversations
about them, shows languages strange to scholars, invented to solve a problem with
our standards.
Near the beginning of addressing the complexity of texts with markup, is MECS - A MULTI-ELEMENT CODE SYSTEM. Its language isn’t as frightening as some we will see, but still daunting to scholars
who already possess languages to describe their texts:
MECS is a syntax for the design of text encoding systems. Documents which conform
to this syntax consist of text interspersed with codes, of which there may be seven
syntactically distinct types:
No-element codes: <s>
One-element codes: <a/ ... /a>
Poly-element codes: [a/2| ... /a| ... /a]
N-element codes: [s/2\ ... /s| ... /s]
Character representation codes: {a}
or {"---"\a}
Character disambiguation codes: {a\a}
or {"---"\a}
Comments: <| xxx |>
MECS and its successors were developed at The Wittgenstein Archives at the University
of Bergen (WAB), https://wab.uib.no/index.page, in a particularly fruitful collaboration between Claus Huitfeldt and Michael Sperberg-McQueen.
Another solution, championed by Henry Thompson for different markup systems for text
corpora, is standoff markup:
Adding markup from a distance
Consider marking sentence structure in a read-only corpus of text which is marked-up already with tags for words and punctuation, but nothing more:
. . .
<w id='w12'>Now</w><w id='w13'>is</w><w id='w14'>the</w>
. . .
<w id='w27'>the</w><w id='w28'>party</w><c id='c4'>.</c>
With an inclusion semantics, I can mark sentences in a separate document as follows:
. . .
<s xml-link='simple' href="#ID(w12)..ID(c4)"></s>
<s xml-link='simple' href="#ID(w29)..ID(c7)"></s>
. . .
which does support arbitrary markup (so long as each instance is well-formed XML)
views on a text, but remains subject to the linear requirements in each instance.
Subject to breaking should the target text change but escapes the one view of a text
mandated by XML. (http://xml.coverpages.org/thompson-sgmleu97.html)
While writing this paper I encountered a non-Balisage paper (it happens) on text and
hypergraphs: Texts as Hypergraphs: An Intuitive Representation of Interpretations of Text by Elli Bleeker, Ronald Haentjens Dekker, and Bram Buitendijk (https://doi.org/10.4000/jtei.3919). The abstract reads:
Over the past decades, the question of what text really is has been addressed by a
large number of conferences, workshops, articles, and blog posts. If there is one
thing that, taken together, those contributions illustrate, it is that our understanding
of text is—and has been—constantly in flux and open to many interpretations. Still,
there is often a gap between how an editor conceptualizes a source text and how this
text is encoded and stored on a computer: using TEI XML, editors are compelled to
model their text as a single tree (a hierarchy), whether this structure corresponds
with their intellectual understanding or not. Textual features that do not fit naturally
into the XML data model require additional layers of code, which hinders processing,
querying, and interchange.
The Text-As-Graph (TAG) data model and the associated syntax TAGML are developed to
express and store textual information as a network. To this end, TAG implements a
hypergraph model. In the present contribution, we illustrate the benefits of TAG’s
hypergraph for the modeling of features like nonlinearity, discontinuity, and overlap.
In contrast to a tree model, a hypergraph accommodates these nonhierarchical structures
naturally. By making them part of the data model and the syntax, a TAGML processor
can process the features without having to resort to workarounds or schema-aware tools.
This lowers the difficulty of working with digital editions and facilitates querying
and interchange.
That sounds like it answers all the questions for conversations in, about, and with
a text. Or does it?
Consider the formal grammar of TAGML:
1. document ::= documentHeader? richText*
2. documentHeader ::= namespaceDefinition*
3. namespaceDefinition ::= '[!ns ' namespaceIdentifier ' ' namespaceURI ']'
4. namespaceIdentifier ::= nameCharacter+
5. richText ::= ( textEnrichment | text )*
6. textEnrichment ::= ( markupStartTag | markupEndTag | markupMilestone | textVariation | comment )*
7. text ::= textCharacter*
8. textCharacter ::= [^[<\] | '\[' | '\<' | '\\' # For regular text, we only need to escape the 2 characters that start a markupStartTag, markupEndTag or markupMilestone, plus the escape character itself.
9. markupStartTag ::= '[' ( optional | resume )? tagIdentifier (' ' annotation)* '>'
10. markupEndTag ::= '<' ( optional | suspend )? tagIdentifier ']'
11. markupMilestone ::= '[' tagIdentifier (' ' annotation)* ']'
12. textVariation ::= '<|' richTextInTextVariation ( '|' richTextInTextVariation )+ '|>'
13. richTextInTextVariation ::= ( textEnrichment | textInTextVariation )*
14. textInTextVariation ::= textInTextVariationCharacter*
15. textInTextVariationCharacter ::= [^[<|\] | '\[' | '\<' | '\|' | '\\' # For text inside textVariation tags we also have to escape the variation divider character |
16. comment ::= '[!' commentCharacter* '!]'
17. commentCharacter ::= [^!\] | '\!' | '\\' # For text inside a comment we only have to escape te 2 characters that constitute the comment closing tag !], plus the escape character itself.
18. optional ::= '?'
19. resume ::= '+'
20. suspend ::= '-'
21. tagIdentifier ::= qualifiedMarkupName layerSuffix?
22. qualifiedMarkupName ::= ( namespaceIdentifier ':' )? localMarkupName
23. localMarkupName ::= nameCharacter+
24. layerSuffix ::= '|' layerInfo ( ',' layerInfo )*
25. layerInfo ::= ( parentLayerId? '+' )? layerId
26. parentLayerId ::= layerId
27. layerId ::= nameCharacter+
28. annotation ::= annotationName '=' annotationValue
29. annotationName ::= nameCharacter+
30. annotationValue ::= stringValue | numberValue | booleanValue | richTextValue | listValue | objectValue
31. stringValue ::= '"' doubleQuotedStringValueCharacter* '"' | "'" singleQuotedStringValueCharacter* "'"
32. singleQuotedStringValueCharacter ::= [^'] | "\'" '\\' # For text inside the stringValue delimiters, only the delimiter used needs to be escaped, plus the escape character itself.
33. doubleQuotedStringValueCharacter ::= [^"] | '\"' '\\'
34. numberValue ::= '-'? digits ('.' digits)? ([eE] [+-]? digits)?
35. booleanValue ::= 'true' | 'false'
36. richTextValue ::= '[>' richText '<]'
37. listValue ::= '[' annotationValue ( ',' ' '? annotationValue )* ']'
38. objectValue ::= '{' annotation+ '}'
39. digits ::= [0-9]+
40. nameCharacter ::= [a-zA-Z] | digits | '_' | '-'
Considering these three examples, or any reported in the appendix, what is the one
thing they have in common (aside from the subject of overlapping markup)? (Sit with
that for a moment.)
Have you ever seen a Bible, a commentary on any book of the Bible, a critical edition
of a Bible, that uses any of these languages for consumption by the reader? And yet,
those texts embody all the richness of texts, without resort to such mechanisms. That
is to say the languages of scholars aren’t broken, deficient, but we have rushed in
with repairs for our languages, instead of listening for theirs.
Note: Why this paper is a mess
Gentle reader, this is where my paper blew up while writing my slides. I discovered I was committing
the same error I caution against, that is I was offering my language for a text model, which is the same error we as digital humanists have been committing
for decades. Apologies for the hasty citations, all will be repaired in the final
I encountered TypeDB during one of my irregular sweeps for hypergraph software. TypeDB
is of particular interest because of its use of an Entity-Relationship-Attribute model,
where attributes are first-class citizens, relationships have roles.
TypeDB has an impoverished definition of entity:
An entity may be defined as a thing capable of an independent existence that can be
uniquely identified. An entity is an abstraction from the complexities of a domain.
When we speak of an entity, we normally speak of some aspect of the real world that
can be distinguished from other aspects of the real world.
I prefer:
anything whatsoever, regardless of whether it exists or has any other specific characteristics,
about which anything whatsoever may be asserted by any means whatsoever (TMDM)
It doesn’t damage the model and does free up the use of entity-relationship modeling
for something more than the real world.
The focus in TypeDB development (and true for other hypergraph software) is on modeling
a domain using the language of users and not a language invented by developers, or even markup language specialists. That
is, we learn the language of the domain and use it to create labels for entities,
relationships between entities, along with attributes recognized by users for both.
Don’t be frightened; it has been done successfully in a number of domains.
Modeling the Greek New Testament, Without Syntax
My original demonstration was going to use a loader to take a CSV file with Greek
New Testament data and enter it into a TypeDB database. But unlike me, you have already
spotted the betrayal of the central theme of this paper. I don’t want to impose or
recommend a syntax, such as CSV, so much as advocate for abstract modeling of a text,
however it happens to be encoded. It’s the my language versus your language trap, the one that has kept so much information locked in free text form. (For consumption
by statistical idiots.)
For example, here is the first line of the Gospel of John, at least according to the
Nestle 1904 text (in part, there are many other attributes):
One way to model that single word as an entity would be:
Figure 1: JHN 1:1!1
A single entity representation of the first word in the Gospel of John
While that figure captures the word and location of it in the Gospel of John, it doesn’t
enable us to represent variations on that text. What witnesses support that reading
of the text? What has been said about witnesses to that text? Or a host of other details.
Compare the difference if we model the references to the biblical text, long held
standard by biblical scholars and then create an n-ary relationship (being permitted
in hypergraphs) to represent the text as:
Figure 2: JHN 1:1!1
An n-ary representation the first word in the Gospel of John, according to the Nestle1904
With the second representation, any number of n-ary relationships with distinct text
or witnesses components can all point at the entity representing the position of JHN
1:1!1. We can query for not only all the texts said to occur at that position, but
we can also find the witnesses for any particular text at that position. Or we can
ask for all the positions in the text where that term appears. To say nothing of other
relationships, being represented in the languages of other biblical disciplines, including
cognitive linguistics.
That is to say that hypergraphs enable us to harken back to the TEI adage that DTDs
represent some view of a text, but never the only true view of a text. We extend that
by capturing the language and models used by users, not as specified in the arcane
dialect of DTDs.
Listening to Users
Confronting users with yet another language, a language not their own, isn’t a solution.
So, why not take a different tack? Ask users what they want to talk about, what properties
(think attributes) they have, and the relationships they have to other subjects? Including
roles in those relationships.
While that sounds attractive, how does that move data from users into a hypergraph
Rather than solving a problem of our own creation, overlap, we should be listening
to users to capture their vocabularies and models for texts. It’s at least as challenging
as overlap and to actually listen, contrary the the claims of some programming paradigms,
will be a novelty among users. Who know? Listening may catch on, even in the digital
