Ariadne's thread

C. M. Sperberg-McQueen

Abstract

Ariadne is a query language intended to be powerful enough to allow domain experts to find interesting passages in their documents, but simple enough for them to learn even if XPath and other expression languages are too complex. Its assumptions about document structure (elements have parents and are at least partially ordered) are compatible with XML and the XPath Data Model but are also compatible with many non-XML models of text; Ariadne could thus serve as a query language for documents with overlapping structures.

This paper describes the design of Ariadne, a small query language intended to make searching structured documents easier for non-technical users. The following sections will describe the origins, goals, and design decisions of the language; present its grammar and semantics; illustrate its application to documents; and describe progress towards a full implementation.

Ariadne may be of interest in part because it is designed to be easier to learn and understand (and to implement!) than other query languages for structured documents. Users who would not be willing to undertake to learn XQuery or even XPath may (it is hoped) find Ariadne understandable and usable. Though designed to be smaller and less capable than XPath, XSLT, or XQuery, Ariadne can be used for relatively powerful searches within structured documents. Because it is smaller and less capable than XPath, XSLT, or XQuery, an Ariadne language processor is readily implemented using XSLT or XQuery. And because Ariadne makes only relatively weak assumptions about document structure, it is applicable to non-XDM text models like concurrent hierarchies, Goddag structures, TAG (Text as graph), multi-colored trees, and even (with mild restrictions) non-graph-based models like (a simple version of) LMNL.

Origins, goals, design decisions

Ariadne has its roots in several places. Among the most prominent are the perceived need for a reasonably powerful search language easier for non-XML-oriented users to learn than XPath and XQuery or XSLT; a desire to recreate the functionality of two query language I used and liked some years ago; the desire for a user-accessible query language to exploit the rich markup in some of the projects I have been involved with; and the desire to enjoy myself by writing language processors in XSLT and XQuery.

Two earlier languages have exerted a strong influence on the functionality and syntax of Ariadne: a system called Arras, and the query language which formed part of the SGML publishing tool DynaText. The name Ariadne was chosen in part because the letter combinations ar and dn recall the names of those two systems.^[1]

The following sections describe this background in slightly more detail.

Arras

In 1980, an associate professor of English at Pennsylvania State University published a book about James Joyce's Portait of the Artist as a Young Man (Smith 1980). In order to identify, trace, and visualize the appearance, repetition, and recurrence of various complexes of images in the work, the author, John B. Smith, had created what might be called an interactive concordance system. Like a printed concordance, his tool allowed him to look for all occurrences of a given word and see them in context. Unlike a printed concordance, Smith's tool was interactive. It allowed the user to specify how much context to give when displaying a word: three words on either side? One word before and five after? The full sentence? The typographic line? Three lines? It also allowed the user to search for multiple words occurring in proximity: fire and ice in the same paragraph, for example.

Smith's system was initially called Rats (Random-Accessible Text System), but he later changed the name to Arras, for which the retronym ARchival Retrieval and Analysis System was formed.^[2] The Arras reference manual is available on the web site of the computer science department at the University of North Carolina at Chapel Hill, to which Smith moved from Penn State in the mid-1980s, and interested readers are encouraged to consult it (Smith 1985). A few sample queries, mostly drawn from the manual, will suffice to illustrate some of the program's salient capabilities.

The core functionality is to display occurrences of words, in context. The following command displays occurrences of the word fire in the default context of one sentence.

display concordance: fire.

In an attempt to make the system easier to use, Smith allows noise words, which are ignored, and allows keywords to be abbreviated. So the command just given could also be given in these and other alternative forms

please display a concordance for the word: fire.

disp conc: fire.

A different amount of context can also be specified, for example five words on either side:

display concordance: fire; context: -5 to 5 words.

As may be seen, left and right context are specified independently. Context may be specified in several different logical units (word, sentence, paragraph, chapter) as well as in several physical units (line, page, volume).

To allow the user to identify images like those of fire or water, Arras allows the user to define categories, which in the simple case are just sets of word forms. To find all images of fire, a user might specify a number of word forms related by etymology or only by meaning:

define category: enflaming, fire, fireconsumed,
  fires, flame, flamed, flames, flaming,
  heat, heated, hot, hotly;
  name: firecat.

The names of categories (here firecat) can be used in display concordance and other commands in the same places as words from the text.

Another facility is the ability to define a configuration, which matches a word occurring within a specified proximity of another word.

configuration:  fire & water; name: ffww.

This defines the configuration ffww as the subset of occurrences of fire which occur within the default context (one sentence, again) of water. The context can be varied, using the same units as mentioned above. The usual Boolean operators are allowed; in addition to and (seen above), or and not can also be used.

configuration:  fire &~ water; name:  ff.

configuration: fire | water; name: fw.

Names of categories and configurations can be used wherever word forms are called for in a command. This allows definitions of complex constellations of words to be built up step by step.

configuration: firecat & watercat;
  context: -50 to +50 words;
  name: fwcat.

There are also facilities for interrogating the list of word forms found in a text, excluding specific word occurrences from a category or configuration, displaying the distribution of a word, category, or configuration across the text (in the form of a bar graph for two-per-cent segments of the text), session management, and so on.

The Arras scanner relied on simple markup in the input to know where volume, chapter, and paragraph boundaries fall. Line and sentence boundaries it recognized by itself using simple rules, with markup available for overriding the default decisions.

The text model exposed by Arras is essentially that of a text as a sequence of words, with the sequence partitioned into sentences, paragraphs, chapters, lines, and volumes. As may be seen, these units form both a logical hierarchy (chapters, paragraphs, sentences) and a physical hierarchy (volumes, pages, lines). The ability to specify proximity using any of these structural units makes Arras very useful for those interested in the close study of a text.

It may be noted in passing that while Arras provides ways of looking for passages in a text which meet certain criteria, it does not provide any mechanism for creating new textual objects.

DynaText query language

The second major influence of Ariadne is the textual query language which formed part of DynaText. I won't rehearse in detail the history of that product in detail here, since many others who attend Balisage know it as well as I, and some know it better. I will only observe that DynaText was a system for publishing SGML documents in electronic form, which allowed users to switch among multiple stylesheets for different views of the data, provided good hypertext linking, and offered a simple and convenient textual search interface. The query language was used both as part of the end-user interface and as part of the infrastructure: as part of the creation of an electronic book, DynaText publishers could also provide simple search forms, to elicit values which could be plugged into a complex search expression as parameters. So power users could use the query language to create complicated queries which others could use without learning the query language.^[3]

I won't attempt a full account of the DynaText query language (which for brevity I'll call DTQL from now on), but some examples will illustrate key points.

A search can consist of a word or phrase. To search for a word which is also a keyword in the language, it may be enclosed in quotation marks:

programming language

"containing"

Equally important, since DynaText is designed to allow the exploitation of markup, a search can consist of an element type's generic identifier, distinguished from a word to be sought in the text by being enclosed in angle brackets:

<title>

Context can be specified using the keyword inside or in, optionally modified by directly:

programming language inside <title>

<title> directly inside <book>

The first finds occurrences of the phrase programming language occurring inside a title element. The second finds title elements occurring as children of a book element.

Conversely, content can be specified using the keyword containing, also optionally modified by directly:

<chapter> directly containing <title>
  containing syntax

All operators in DTQL are right-associative, so the search just given is equivalent to the following:

<chapter> directly containing
  (<title> containing syntax)

These searches find all chapter elements within which there occurs (at any nesting depth) a title element which contains the word syntax.

Note that while

<title> containing programming language

and

programming language in <title>

describe the same set of document structures, the value of the first is a set of title elements, while the value of the second is a set of word occurrences. In the interactive interface to DynaText, this difference was reflected in the highlighting of hits and in the hit counts displayed in the table of contents, which was itself typically displayed in a navigation bar and always visible.

Boolean operators are also allowed:

<bibl> not containing <title>

(<element-citation> or <mixed-citation>)
  containing <publisher-name> not containing
  "Elsevier"

<title> containing theory and practice

<title> containing practice
  and not theory

The last two examples illustrate a slightly curious aspect of DTQL's boolean operators: the one search finds title elements containing both the word theory and the word practice, the other finds title elements containing one word but not the other. The right-hand argument of the and is not interpreted as a full query, but as a second operand for the containing operator. This feels natural to native speakers of English (and perhaps other languages), but it is trickier to implement than a naive approach might expect. In particular the query

<title> containing practice and not theory

does not first evaluate the queries <title>, practice, and theory, then construct the complement of the third query (every word that is not theory), then take the set intersection of the occurrences of the word practice and the set of words which are not the word theory, and so on.^[4]

DTQL also has a proximity operator, for which distance is specified in words:

proximity within 5 words of search

markup within 10 words of descriptive or prescriptive

The samples do not exhaust the DynaText query language, although they come close: it is a useful language but gets its power from a very small number of constructs.

Why not XQuery? Or even just XPath?

An obvious reaction to the announcement of yet another query language for documents, especially at Balisage, is to ask why. Why do we need another language for document queries? Doesn't XQuery exist? Or if XQuery is deemed too complicated, why not just use XPath? There are two reasons one might want to design a new query language, even though XPath and XQuery exist.

The first is ease of acquisition by users.

Like many Balisage attendees, I often work with domain experts who are less technically inclined than I am and less tolerant of the pernickety syntax often characteristic of languages designed to be parsed by machine. The XML community is justly proud of the number of domain experts who do not self-identify as programmers but who are able to learn and use XPath and XSLT. But many domain experts never do learn XSLT, or even XPath.

When I build a web interface for a collection of XML documents, one of the first things I like to do is to provide a simple search interface that allows a user to type in a simple XPath expression and see a list of the XDM nodes matched by the expression.^[5] But often, my hopes that the domain experts with whom I work will learn enough about XML and XPath to be able to use it are dashed. As far as I know, when I have made XPath interfaces for collaborative projects, I am the only project participant who ever uses them. From time to time, others in a project have attempted to learn XML, and XPath, and XQuery. But often, domain experts find it unnerving to venture from a field in which they are experts into a field in which they are novices, and since they typically have achieved a full and rewarding professional life without knowledge of any computer languages, it is understandable both that they don't have a lot of time to focus on learning new languages and that the frustrations known to anyone learning something new and very different will often lead busy people to abandon the attempt.

It is unlikely that Ariadne is a silver bullet for this problem.

It is quite possible that some users will resist any query interface involving anything recognizable as syntax or keywords. The logical consequence of such syntax phobia among users is a search interface with a single box in which to type a query, and no visible rules for how to type it, such as is now familiar to any user of Google, Bing, DuckDuckGo, or almost any other web search engine.

But anyone who has used a single-box search interface to look for structural patterns in documents will remember how disappointing the results usually are. A search for a chapter whose title includes the word syntax, in a book whose title includes the phrase programming language, is one thing. A search for a chapter whose title includes the words programming language, in a book whose title includes the word syntax, is something rather different.^[6] A query interface based on a bag-of-words document model, with no knowledge of chapters, books, or titles, cannot comfortably capture the difference.

One of Ariadne's goals is to serve users who don't feel comfortable with most artificial languages but who do understand how their documents are tagged, and who have some interest in being able to use structural information in queries.

The second reason for not just using XQuery or XPath is: trees.

XPath and XQuery and XSLT are designed to work on XML documents, and specifically on XML documents viewed through the XPath Data Model (XDM), or more generally any information that can be viewed through that model, whether its serialized form is XML or something else. Since I was not involved in the initial design of any of those languages, I can say without immodesty that they do a good job.

But it is well known that documents often exhibit structural patterns that are not tree-shaped, or at least not shaped like a single tree. A bewildering number of alternate proposals have been made, for an equally bewildering variety of text models. (The number of models is slightly lower than the number of proposals, since in some proposals appear on closer examination to be variations on the same basic idea.)

It would, I think, be helpful for users of the text-as-graph model, the LMNL model, the Goddag model, the concurrent-hierarchies model, if it were possible to formulate queries against those models. Of course, if as sometimes happens the actual documents are stored in XML, the documents can be queried in their XML form using XPath and XQuery. But models are not just guides for implementors; they are tools for thinking about things. It would be better, when we are using a non-tree-shaped model of text, if we could formulate queries against the model we want to use, instead of translating our queries into some sort of slightly awkward XPath.

One possible caveat to this line of reasoning should be addressed. XPath is indeed defined in terms of the XPath Data Model, but nothing in its syntax requires that the parent axis return a single node. If we were to allow nodes to have multiple parents, we could apply XPath to any document model involving a directed acyclic graph. And if we relaxed other constraints like the rule that no node is its own ancestor, we could apply XPath to cyclic as well as to acyclic graphs. So if we wished to apply XPath to Goddag structures, or to concurrent hierarchies, or to text-as-graph instances, we could do so, at least in principle. And indeed there have been proposals in the past, some of them presented at Balisage or Extreme Markup Languages, to use XPath as an expression language for navigating arbitrary directed graphs.

To anyone who can easily implement the whole of XPath over a non-XDM model, there is clearly no pressing need for a different language. The problem for others is that even in its version 1.0, XPath was large enough to require non-trivial implementation effort. Now, in XPath 3.0 and 3.1, the language has become Turing-complete and implementing it from scratch looks rather daunting. Ariadne's advantage in this context is that in comparison with XPath, it's tiny, and it's possible to imagine a relatively straightforward implementation that does not require programmer-years of effort.

Origins of Ariadne

The project that has turned into the design and specification of Ariadne began life as a pair of low-priority free-time projects to re-implement both Arras and the DynaText query language, as an exercise in nostalgic retro-computing.^[7]

The feasibility of re-implementing Arras became clear to me when I reflected that its text model could be implemented as a single SQL table in which each row contains a token, a word form (the token, stripped of punctuation), a word number, a sentence number, a chapter number, a line number, a page number, and a volume number. Arras commands could in principle be translated into SQL queries. John B. Smith implemented his own inverted-file index, but I was willing to farm all of those details out to a SQL database.

Once the key idea of letting a database management system handle the boring bits like storage details became clear, it immediately became obvious that the Arras text model could also be implemented quite simply in an XQuery database, using either element containment or milestone elements to handle the structural units. Since in Arras all structural units partition the entire text, simple milestones suffice and the more complex mechanisms of Trojan-Horse markup are not needed.

Since DynaText is designed for search and retrieval in SGML documents, it seems self-evident that it should be relatively straightforward to implement a DynaText front end which passes queries to an XQuery back end for evaluation. Since DTQL seems very accessible even to non-technical users, the idea of re-implementing DTQL sometimes seemed more than an exercise in nostalgia: it could be useful to real projects I have been engaged with. But the idea never had high priority, so it never got beyond a few sketches.

Recently, however, whiling away an idle hour with an attempt to figure out a way to describe the behavior of and and or in DTQL cleanly enough to guide an implementation, I realized that DTQL requires a model of text in which two nodes can be in a parent/child relation (or not), or more generally an ancestor/descendant relation, but does not require that the model form a tree. Nothing breaks in the semantics of the query if an element (or a word) can have more than one parent.

At this point I began to think that an implementation might have more practical use than I had first imagined, and that a new query language inspired by Arras and DynaText should have a higher priority than historically accurate re-implementations of the two languages.

Design decisions

The basic design approach of Ariadne was summarized by Steven J. DeRose, the designer of the DynaText query language, thus:^[8]

Give users a very limited and very formulaic subset of English — it's pretty easy to implement (because so limited), yet users can read queries as if they were just English, and can form new ones by analogy — all they really have to learn is to avoid being creative (which, when they try it, provides negative reinforcement for free).

DynaText has four query operators, not counting Booleans, which require special treatment:

INSIDE, optionally modified by NOT and/or DIRECTLY
CONTAINING, optionally modified by NOT and/or DIRECTLY
WITHIN n WORDS OF
WITH (which searches the attribute axis of XDM)

The INSIDE and CONTAINING operators are of course inverses of each other, moving vertically in a document tree.

Ariadne extends this set of operators modestly to allow searching based on document order among siblings or among all nodes and to allow proximity to be measured, Arras-style, in any structural unit, not just words.

It should be noted that the only basic relations required to support these operators are parent-child, prec-next, and element-attribute-value. No assumption is made that a child can have only one parent, so essentially any binary relation on structures in a document can be used as the parent-child relation. If the relation is cyclic, some results may of course puzzle users used to thinking of the parent-child relation as acyclic.

Like Arras and DTQL, Ariadne lacks any means of constructing new values of any kind: all Ariadne expressions denote sets of locations within texts, where from the user's point of view a location is either a tagged element in the text (e.g. a paragraph, title, or chapter) or a word. From the implementation point of view, at document ingestion time the document is tokenized and all words are wrapped in w elements, so for the implementation, all values are sets of locations, and all locations are elements in the input document. This provides a simple uniform model and makes implementation on top of XSLT or XQuery or XPath itself conceptually very simple.

The absence of constructors makes Ariadne strictly weaker than XQuery or XSLT. It cannot be used to perform arithmetic: an expression like 2 + 3 is not an Ariadne query or expression. It cannot be used to make a list of all word forms: Ariadne has no equivalent to the XPath distinct-values() function, and indeed no user-callable functions at all. Its intended users are not programmers who need to be able to specify and perform arbitrary computations, but readers who want to be able to describe and find locations in texts.

For some situations, it is convenient to have some notion of a location which is an appropriate size to return in a list of search results. I'll call such locations chunks. In most modern prose, a sentence, a list item, a heading, or a line of verse will be a reasonably good chunk. Sometimes it might be better to use a larger chunk, for example a paragraph. The only constraint on chunks is that every word in the document must be in some chunk; it may sometimes be desirable to specify further that no chunk contains any other chunk, i.e. chunks must partition the document.

To simplify the use of Ariadne for document owners, an implementation of Ariadne is expected to define some notion of chunk without intervention by the document owner; a simple default is to accept any element containing at least two words as a chunk. Another possible default (up to the implementor) is to break the input document up into sentences or sentence-like objects based on punctuation and capitalization and use them as chunks. A document owner should however be able to specify explicitly which elements are to be treated as chunks. And of course, a document owner can explicitly mark sentences (or sentence-like objects) with an element such as the TEI s element, and thus override the default behavior of the implementation. The same goes for tokenization into words: if the default behavior is suitable, the user need do nothing special, but if the user wants another behavior, they can control the behavior of an Ariadne search engine by using markup to identify what should be treated as words.

The design principles of Ariadne are thus:

Queries should be a very limited, very formulaic subset of English.
Query operators should allow qualification of searches both vertically (by context and content) and horizontally (by relative position in document order).
Proximity searching should be possible using words or any structural elements as units.
The language must be simple to implement.^[9]
Ariadne should be applicable to any model of text in which structural units can be construed as having a parent-child relation and an ordering. Just as structural units may have multiple parents, they may also have multiple immediately following siblings.
It is acceptable for searching of a document to be possible only after that document has been ingested in some way (indexed, scanned, imported, ...); in practice it is in the process of import that the system will perform services like recognition of implicit markup of words, lines, and sentences.
Anything the indexing system does without markup (e.g. the identification of words or sentences) should be done by injecting markup in the document. The user should be able to change the relevant behavior by supplying explicit markup for the phenomena in question, for example by exporting the document, changing the markup, and re-importing the document.
The text model used by Ariadne should be compatible with (i.e. its assumptions and constraints should where possible be a subset of) as the most prominent non-XDM document models: multi-colored trees, concurrent hierarchies, text as graph, and Goddag structures. Compatibility with LMNL is a desideratum.

One topic which has been ruled out of scope for this version of Ariadne is the ability to qualify searches by a particular hierarchy, analogous to the qualification of XPath axes with colors in Colored XML. It may be helpful, but I do not currently understand the requirements well enough to design it. It is hoped that experience with Ariadne will help to indicate whether extension in this direction is needed and what kind of extension would be helpful.

Examples

At heart, Ariadne is an expression language, with basic expressions which can be combined and re-combined using different operators. The value of any expression is a set of locations in the text, which are here referred to sometimes from the user view as locations and sometimes from the formal view as nodes; in an Ariadne implementation those two terms are extensionally equivalent. The locations in the value of an expression are said to be matched by or found by the expression.

Since the purpose of formulating expressions is normally to find locations in the text which match the expression, expressions will in what follows sometimes be referred to as searches or queries. No difference in meaning is intended by this variation in terminology.

Structurally, some expressions are basic expressions and others are compound expressions. Compound expressions are formed using operators to combine subordinate expression. Ariadne has two kinds of operators: filters and Boolean operators. As their name suggests, filters are used to eliminate items from a search result. Their first or left-hand argument matches a set of locations in the text, and the effect of the operator is to select some of those locations and eliminate others, based on the value of the second or right-most argument. Filters may be said to qualify a search. Boolean operators combine sets of results in familar ways: an AND denotes the intersection of two sets, an OR the union, and NOT typically denotes set subtraction.

Basic expressions

Like DTQL, Ariadne has two fundamental forms of expression, one for words (or strings) and one for (structural) elements:

<author>

Heinlein

The first of these finds all author elements in the text; the second finds all occurrences of the token Heinlein.

Expressions using filters

Context filter (`INSIDE`)

Word and element queries can both be qualified by context:

ward inside <heading>

<head> directly inside <div> directly inside <body>

<p> inside <subsection>

These have the same value in Ariadne as in DTQL. The first finds occurrences of the word ward occurring anywhere inside a heading element; the second finds head elements which are directly inside div elements (i.e. with no intervening elements) which are themselves directly contained by body elements. The third finds p elements occurring withing subsection elements.

In an XML context, these will be roughly equivalent to the following XPath expressions.^[10]

//text()[contains(., 'ward')][ancestor::heading]

//head[parent::div[parent::body]]

//p[ancestor::subsection]

Content filter (`CONTAINING`)

As in DTQL, element queries (but not word queries) can be qualified by the content of the element. For example:

<chapter> directly containing
  <title> containing syntax

This is the equivalent of the XPath expression

//chapter[title[contains(., 'syntax')]]

Using qualification by context and by content, the two examples of structure-aware full-text search given earlier can be expressed readily in Ariadne or in DTQL:

programming language in <title> directly inside
  <chapter> inside <book> directly containing
  <title> containing syntax

syntax in <title> directly inside <chapter>
  inside <book> directly containing <title>
  containing programming language

Left-right filters (`PRECEDED BY`, `FOLLOWED BY`)

Just as the keywords INSIDE and CONTAINING search vertically on the equivalent of the XDM parent, child, ancestor, and descendant axes, the keywords FOLLOWED BY and PRECEDED BY search horizontally on the equivalent of the XDM axes preceding, preceding-sibling, following, and following-sibling.

<theorem> not preceded by <axiom>

This finds all theorem elements which occur in the document before the first occurrence of an axiom element.^[11]

<speech> directly followed
  by sibling <stage-direction>

This finds all speech elements whose next sibling is a stage-direction element.

The keywords PRECEDED BY and FOLLOWED BY are interpreted by reference to document order; unlike XDM, the Ariadne model does not require that the ordering be total or unique. There may be multiple orderings; the value of an expression of the form x FOLLOWED BY y is the set of all values of x which are followed in some ordering by some value of y. If multiple orderings apply in a document, it might be helpful to be able to specify which ordering to use, but as noted earlier this is out of scope for the current version of Ariadne.

As may be seen, these operators may be modified by DIRECTLY, as well as by NOT, which is not illustrated, and by SIBLING, which restricts the search to locations which share a parent.

Attribute filters (`WITH`)

As in DTQL, element searches can be qualified by reference to attributes on the element and their values.

<section> with n = 2

<section> with n > 2

<div> with id null

These search, respectively, for section elements with the attribute-value pair n="2", for section elements with an n attribute whose value, when coerced to a number, is greater than 2, and for div elements which have no id attribute.

Word proximity filter (`WITHIN` k `WORDS OF`)

Words can be qualified by proximity to other words:

challenge within 10 words of election

The order of the two words can be prescribed:

challenge followed within 10 words by election

challenge preceded within 10 words by election

As mentioned elsewhere, word proximity is treated formally as a special case of element proximity. In the queries just shown, the keyword WORDS is syntactic sugar for the expression <w> elements . So the queries shown are equivalent to the following, and they may indeed by translated into the following forms internally before evaluation:

challenge within 10 <w> elements of election

challenge followed within 10 <w> elements by election

challenge preceded within 5 <w> elements by election

Element proximity filter

As in Arras, proximity can be measured in structural units other than words:

fire within 5 <l> elements of ice

This query finds all occurrences of the word fire whose distance from some occurrence of the word ice is fewer than five l elements. Distance is measured by counting elements within the ordering assigned to elements in a purely mechanical way. If fewer than five l elements begin between some occurrence of fire and some occurrence of ice, measured in some ordering of l elements, then the filter retains that occurrence of fire.

This method of measuring proximity has the consequence that if two elements are within the same l element, they have distance 0, measured in l elements:^[12]

fire within 0 <l> elements of ice

This returns all occurrences of fire which occur within the same l element as an occurrence of ice. It also returns all occurrence of fire which are not inside an l element, as long as there is also an occurrence of ice which is also not inside an l element and as long as there is no l element between them.^[13]

The elements in question need not be contiguous:

Borges within 4 <footnote> elements of Barthes

This query includes all occurrences of the name Borges if some occurrence of the name Barthes is found with fewer than five footnote elements between them.

Parentheses

The qualifying operators are right-associative, so multiple qualifications on a single query require parentheses. The following two queries are equivalent: each filters the set of p elements to include only those inside a subsection element and only those containing the word appointed.

(<p> inside <subsection>) containing appointed

(<p> containing appointed) inside <subsection>

Note, however, that since containment is a transitive relation, the latter formulation is equivalent to the unparenthesized form

<p> containing appointed inside <subsection>

Expressions with Boolean operators

Boolean operators in right-hand argument of qualifier

In English and other natural languages, conjunctions are sometimes tricky to parse. Given a sentence beginning Alice invited Bob and Carol ..., it is not clear without further context whether the conjunction and joins two clauses (Alice invited Bob and Carol invited Dave) or two names (Alice invited Bob and Carol to dinner with her and Dave).

The goal of an English-like query syntax means that Ariadne should if possible match users' expectations. Given a query like the following, many users will not perceive the or as ambiguous.

(<p> inside <subsection>) containing
  appointed or appoint or appointment

They will (or so I conjecture) expect the result to contain the set union of the values of the following three queries:

(<p> inside <subsection>) containing appointed

(<p> inside <subsection>) containing appoint

(<p> inside <subsection>) containing appointment

They will emphatically not expect the query result to be a mixture containing all the occurrences of appoint and appointment together with the set of paragraphs inside subsections which contain the word appointed.

To achieve this interpretation, Ariadne treats Booleans slightly differently depending on where they appear. When Boolean operators are found in the right-hand argument of any filter, the qualification is applied to each argument of the Boolean and the results are combined as indicated. Grammatically, this is mostly a matter of ensuring that the Boolean operators bind more tightly than the filter operators, so that x CONTAINING y AND z is parsed like x CONTAINING (y AND z) and not like (x CONTAINING y) AND (z).

Semantically, it's a bit more complicated than that; the filter must be applied to each item in the compound argument separately, before the results are combined as indicated by the Boolean operator.^[14]

For example, the value of the query

<p> containing fire and water and not ice

is the set of p elements which (1) contain the word fire, and (2) contain the word water, and (3) do not contain the word ice. That is to say, it is the intersection of the values of the three queries

<p> containing fire

<p> containing water

<p> not containing ice

This treatment of Boolean operators follows that of DTQL.

Boolean operators between filters

If a Boolean connector is found between qualifiers, then the two qualifiers share the left-hand argument. The query

<p> inside <subsection> and containing appointed

finds all p elements which both (1) occur within a subsection element and (2) contain the word appointed. It is equivalent to the form

(<p> inside <subsection>) containing appointed

in which the CONTAINING filter is applied to the results of applying the CONTAINING filter.

The Boolean operator or can also be used.

(<p> inside <subsection>) containing appointed
  or containing appoint or containing appointment

This has the same meaning as the query above in which the second and third occurrences of containing are omitted.

This treatment of Boolean operators is intended as a generalization of the treatment of Booleans described in the previous section. It is not clear whether DTQL accepts Boolean operators between filters.

Boolean operators (top-level)

Word queries can be combined with boolean operators:

Heinlein and Robert

Heinlein or Bradbury

Heinlein and not Bradbury

It is slightly challenging to provide a plausible interpretation for these which is consistent both with normal usage of the Boolean operators and with user expectations formed by expressions like those given in the preceding sections. If AND is interpreted as denoting the intersection of the values of its two arguments, and NOT as denoting the complement of the value of its argument, then the first query shown must denote the empty set, since there are no words which are both the word Heinlein and the word Robert. The third query will, given these interpretations, denote the set of locations which are both (1) occurrences of the word Heinlein and (2) not occurrences of the word Bradbury. Neither of these is a satisfactory interpretation for expressions which appear perfectly idiomatic to many users.^[15]

Ariadne interprets top-level queries Heinlein and Robert Heinlein or Bradbury Heinlein and not Bradbury like those shown as abbreviated forms of x CONTAINING y op z, where x denotes some smallish region of text, like a paragraph. It is for this reason that Ariadne introduces the idea of a chunk. The searches shown above find chunks which contain both Heinlein and Robert; contain either Heinlein or Bradbury; or contain Heinlein but do not contain Bradbury. If all chunks are p or q or bibl elements, then these Boolean searches are equivalent to

(<p> or <q> or <bibl>) containing (<p> or <q> or <bibl>) containing (<p> or <q> or <bibl>) containing
                     This is one of two places where the semantics of
                        	    Ariadne appeals to the notion of
                        	    chunk. The other place is that in an
                        	    Arras-style interactive interface which displays lists of
                        	    hits, the default context for a hit will be one chunk.

Grammar

The basic form of the language is straightforward: a query is a basic query, followed by zero or more qualifications using the filter operators. Since some filters apply only to one kind of query, word queries and element queries are distinguished syntactically. Also, queries can be combined using Boolean operators, as can filters.

The grammar will be given in ixml notation, for which see Pemberton 2013 and Pemberton 2021.

Starting at the top: a query is either a word query, an element query, a set of basic word queries joined by Boolean operators, or a set of basic element queries joined by Boolean operators.

query:  single-query | multi-query.
single-query:  word-query | element-query.
multi-query: base-word+boolean
      | base-element+boolean.

Each kind of query consists of a base query followed by zero or more filters, connected by Booleans.

word-query:  base-word, (s, word-filter*boolean)?.
element-query:  base-element, (s, element-filter*boolean)?.

A basic word query is sequence of tokens, or a parenthesized word query:

base-word:  tokens
      | '(', s?, word-query, s?, ')'.
tokens: token+s.
token:  word | pattern | string.

A word is a sequence of word characters or a quoted string. Ambitious implementations may also provide either simple glob-style patterns using asterisk and question mark or regular expressions. Those are not described here, though they are likely to be important in practice.

A basic element query is a generic identifier enclosed in angle brackets. As in XML, we allow whitespace before the closing angle bracket.

base-element:  '<', name, s?, '>'
      | '(', s?, element-query, s?, ')'.

Word queries may be filtered by word proximity, by element proximity, horizontally (by left-right context), or vertically (by ancestor-descendant context). Element queries may be filtered by attribute, by element proximity, horizontally, or vertically.

word-filter:  word-proximity-filter 
      | element-proximity-filter
      | left-right-filter
      | context-filter. 
element-filter:  element-proximity-filter
      | left-right-filter
      | context-filter 
      | content-filter
      | attribute-filter.

Word proximity filters take the form w₁ within k words of w₂ , or w₁ followed within k words by w₂ , or w₁ preceded within k words by w₂ .

word-proximity filter:
        NOT?, WITHIN, n, WORDS, OF, word-query 
      | NOT?, PRECEDED, WITHIN, n, WORDS, BY, word-query 
      | NOT?, FOLLOWED, WITHIN, n, WORDS, BY, word-query.

Element proximity filters are essentially the same, but specify an element type instead of using the keyword WORDS.

element-proximity filter:
        NOT?, WITHIN, n, units, OF, single-query 
      | NOT?, PRECEDED, WITHIN, n, units, BY, single-query 
      | NOT?, FOLLOWED, WITHIN, n, units, BY, single-query.
	units:  base-element, ELEMENTS.

Left-right filters use the keywords PRECEDED and FOLLOWED, but no proximity measurements. They can be preceded by NOT and DIRECTLY.

left-right-filter:
        NOT?, DIRECTLY?, PRECEDED, BY, single-query
      | NOT?, DIRECTLY?, FOLLOWED, BY, query.

Context filters use the keyword INSIDE.

context-filter:  NOT?, DIRECTLY?, INSIDE, element-query.

Content filters use the keyword CONTAINING:

content-filter: NOT?, DIRECTLY?, CONTAINING, single-query.

Attribute filters use the keyword WITH.

attribute-filter:  WITH, attribute-test.
attribute-test:  name, NOT?, NULL
      | name, NOT?, comparator, value.
comparator: "=" | "<" | ">" | "<=" | ">=".
value: token.

Omitted here are rules for the keywords NOT, DIRECTLY, PRECEDED, etc., which are defined to accept either lowercase or uppercase spelling and to allow or require surrounding whitespace.

An XML representation follows directly from the ixml specification; experience may lead to a reformulation of the grammar in order to get a different XML representation.

Semantics

For each form of expression E, we give a rule for finding its value v(E), typically in terms of a set comprehension rule.

In describing the forms of expressions, we write

w for a base word,
gi for a base element,
x for an expression which may be either a base word or a base element,
wq for a word query,
eq for an element query,
sq for any simple query (word query or element query)

In describing values, we write n for locations (or nodes) in the document, which can be either word nodes or element nodes.

A variety of predicates and relations holding among nodes, strings, names, and integers will be introduced as needed.

The only function used is v(); all other terms are relations. For example, we write name(e, gi) to say that the name of a particular element instance e is gi. In XPath, we would normally write name(e) = gi (or name($e) = $gi), but the formulas are easier to follow if there is never any doubt about whether a given term is a function call or a relational predicate. If the functor is v, it's a function call denoting a set of nodes; otherwise, it's a predicate which is either true or false.

Basic searches

Base-word searches

A search for a single word form finds all occurrences of that form in the document.

v(w) = { n | word(n) ∧ form(n, w) }

That is: the value of a search for word w is the set of all locations n in the document such that n is a word node and w is the word-form of n.

The predicate word is true if and only if its argument is a word-node.

The relation form is true for a word-node and its word form (a string).

A search for multiple words (e.g. programming language) looks for adjacent tokens.

v(w₁ w₂) = v(w₁ DIRECTLY FOLLOWED BY w₂)

The pattern w₁ DIRECTLY FOLLOWED BY w₂ is defined below.

Basic searches connected by Booleans

As noted earlier, in a query like

<title> containing fire and ice

the Boolean operator can be thought of as connecting two arguments for the filter, not two full queries. (It may be thought of as analogous to the and connecting the parts of a compound direct object, rather than to the and connecting two clauses in a compound sentence.)^[16]

Some queries may connect basic searches with Booleans, like this one:

fire and ice

In order to provide some meaning for such queries, these are interpreted as denoting the set of chunks which contain both the form fire and the form ice. If chunks can nest, then only the smallest chunks which satisfy the constraint are returned.

This special treatment is applied only to Boolean combinations of basic queries, which the grammar identifies as multi-queries.

A multi-query on words finds all chunks in the document containing that combination of words; it can be understood as performing set operations on the sets of chunks identified by the individual base queries.

For any multi-query mq, we shift to a special evaluation procedure:

v(mq) = vmq(mq)
vmq(x AND mq) = v(CHUNK CONTAINING x) ∩ vmq(mq)
vmq(x OR mq) = v(CHUNK CONTAINING x) ∪ vmq(mq)
vmq(x NOT mq) = v(CHUNK CONTAINING x) \ vmq(mq)
vmq(x AND NOT mq) = v(CHUNK CONTAINING x) \ vmq(mq)

As a base case, when all the occurrences of AND, OR, and NOT have been dealt with, and we are dealing with the simple query which is the last (right-most) element of the multi-query, we interpret it as if it were a search for a chunk containing whatever is specified by the simple query. So Heinlein, for example, is interpreted as if it were CHUNK CONTAINING Heinlein, and <author> as if it were CHUNK CONTAINING <author>.

vmq(sq) = v(CHUNK CONTAINING sq)

This requires that we define the pattern CHUNK CONTAINING x:

v(CHUNK CONTAINING w) = { n₁ | chunk(n₁) ∧ (∃ n₂)(n₂ ∈ v(w) ∧ ancestor-descendant(n₁, n₂)) }

Base-element searches

A basic element search finds all elements in the document with that generic identifier.

v(gi) = { n | element(n) ∧ name(n, gi) }

Filtered searches

Word-proximity filters

A word proximity query finds all instances of the left-hand argument which are within the designated distance of some instance of the right-hand argument.

v(w WITHIN k WORDS OF wq) = { n | n in v(w) ∧ (∃ n₂) (n₂ ∈ v(wq) ∧ (∃ d₁, d₂) (distance(n₁, n₂, w, d₁) ∧ num-abs(d₁, d₂) ∧ d₂ ≤ k)) }

Here distance(n₁, n₂, gi, d) is a relation holding among two nodes n₁ and n₂, the generic identifier of an element type serving as a unit of measure (here the w element), and d, an integer denoting the directed distance between n₁ and n₂, measured in those units. If d is positive then n₁ precedes n₂, if negative then n₁ follows n₂. So d is the distance from n₁ to n₂, moving forward in the text.

The num-abs relation holds between a number and its absolute value.

In prose: a search for some word w within k words of some word found by word-query wq finds all nodes (or: locations) n, such that n is in the value of w and there exists some second node n₂ in the value of the word-query wq such that there is some distance d, measured between n and n₁ in w elements, such that d is less than or equal to k.

The reader may find the formulation slightly puzzling, expecting in the last part of the set comprehension formula to see a formulation more like

... abs(distance(n₁, n₂, w)) ≤ k

with a meaning like the absolute value of the distance from n₁ to n₂ is less than or equal to k .

The first reason for this is that in this alternate formulation, abs and distance are functions, not relations.

The more important reason is as mentioned earlier: We would like to avoid the assumption that there is always a single distance between two words. For example, when a note or note reference^[17] occurs immediately following a word w, what is the next word after w? Is it the word after the reference? or is it the first word of the note? Viewed another way: what is the distance between w and the word following the reference? In this paragraph, what is the distance between the word reference and the word occurs? The formulation given above is intended to allow for the possibility that there are two distances, which depend on the route taken: if the note is skipped and we remain in the main text, then the distance is 1. If we follow the reading route through the note, and it has (as here) three words, then the distance is 4. What is the distance between two words? It depends on which route you take. An Ariadne implementation may adopt a distance measure in which there is only ever a single distance between two points — prescribing a single route between them, so to speak — but it is not obligated to do so.

Non-deterministic proximity measures can be helpful not only in cases of paratextual structures like footnotes or running heads, but also in cases where textual variants are encoded in a single document, whether the document is a critical edition, a variorum edition, or a normative document (technical specification, legislation, ...) in which changes from a previous published version are marked to simplify displays that show all changes in context. In a single version of the text, two word tokens are within k words of each other if there are at most k - 1 words between them. In a multi-version representation the text, two words are within k words of each other if there is some version of the text in which they are within k words of each other.

The use of relational notation instead of functional notation makes it easier to exclude unconscious assumptions about the uniqueness of certain values.

The ordered proximity filters are similar but do not need to calculate the absolute value of the distance.

v(w PRECEDED WITHIN k WORDS BY wq) = { n₁ | n₁ in v(w) ∧ (∃ n₂)(n₂ ∈ v(wq) ∧ (∃ d)(distance(n₂, n₁, w, d) ∧ 0 < d ≤ k)) }

In prose: the value of a search matching the pattern w PRECEDED WITHIN k WORDS BY wq is the set of all word occurrences n₁ matching w, such that there is some other word occurrence n₂ and some distance d, such that n₂ matches the word query wq and d is a left-to-right distance (not the distance) from n₂ to n₁, measured in words, and d is positive and d is less than or equal to k.

The FOLLOWED keyword is the mirror image: the only difference is the order of arguments in the distance relation.

v(w FOLLOWED WITHIN k WORDS BY wq) = { n₁ | n₁ ∈ v(w) ∧ (∃ n₂)(n₂ ∈ v(wq) ∧ (∃ d)(distance(n₁, n₂, w, d) ∧ 0 < d ≤ k)) }

The set comprehensions for the NOT operator have a similar structure, but the existential quantifier over the second node n₂ is negated.

v(w NOT WITHIN k WORDS OF wq) = { n | n ∈ v(w) ∧ ¬(∃ n₂)(n₂ ∈ v(wq) ∧ (∃ d₁, d₂)(distance(n₁, n₂, w, d₁) ∧ num-abs(d₁, d₂) ∧ d₂ ≤ k)) }
v(w PRECEDED WITHIN k WORDS BY wq) = { n | n ∈ v(w) ∧ ¬ (∃ n₂)(n₂ ∈ v(wq) ∧ (∃ d)(distance(n₂, n₁, w, d) ∧ 0 < d ≤ k)) }
v(w FOLLOWED WITHIN k WORDS BY wq) = { n | n ∈ v(w) ∧ ¬(∃ n₂)(n₂ ∈ v(wq) ∧ (∃ d)(distance(n₁, n₂, w, d) ∧ 0 < d ≤ k)) }

These find the set of all words which match the left argument w for which some node in the value of the word query wq is within k words; when PRECEDED or FOLLOWED is used, the distance is measured only in the indicated direction.

Element-proximity filters

The element-proximity filters have a very similar structure to the word-proximity filters; the only difference is that the hardcoded w is replaced by a reference to the specified generic identifier.

v(x WITHIN k gi ELEMENTS OF sq) = { n₁ | n₁ ∈ v(x) ∧ (∃ n₂)(n₂ ∈ v(sq) ∧ (∃ d₁, d₂)(distance(n₁, n₂ gi, d₁) ∧ num-abs(d₁, d₂) ∧ d₂ ≤ k)) }
v(x PRECEDED WITHIN k gi ELEMENTS BY sq) = { n₁ | n₁ ∈ v(x) ∧ (∃ n₂)(n₂ ∈ v(sq) ∧ (∃ d)(distance(n₂, n₁, gi, d) ∧ 0 < d ≤ k)) }
v(x FOLLOWED WITHIN k gi ELEMENTS BY sq) = { n₁ | n₁ ∈ v(x) ∧ (∃ n₂)(n₂ ∈ v(sq) ∧ (∃ d)(distance(n₁, n₂, gi, d) ∧ 0 < d ≤ k)) }
v(x FOLLOWED WITHIN k gi ELEMENTS BY sq) = { n₁ | n₁ ∈ v(x) ∧ (∃ n₂)(n₂ ∈ v(sq) ∧ (∃ d)(distance(n₁, n₂, gi, d) ∧ 0 < d ≤ k)) }
v(x NOT WITHIN k gi ELEMENTS OF sq) = { n₁ | n₁ ∈ v(x) ∧ ¬(∃ n₂)(n₂ ∈ v(sq) ∧ (∃ d₁, d₂)(distance(n₁, n₂ gi, d₁) ∧ num-abs(d₁, d₂) ∧ d₂ ≤ k)) }
v(x NOT PRECEDED WITHIN k gi ELEMENTS BY sq) = { n₁ | n₁ ∈ v(x) ∧ ¬(∃ n₂)(n₂ ∈ v(sq) ∧ (∃ d)(distance(n₂, n₁, gi, d) ∧ 0 < d ≤ k)) }
v(x NOT FOLLOWED WITHIN k gi ELEMENTS BY sq) = { n₁ | n₁ ∈ v(x) ∧ (∃ n₂)(n₂ ∈ v(sq) ∧ (∃ d)(distance(n₁, n₂, gi, d) ∧ 0 < d ≤ k)) }

Left-right filters

The filters PRECEDED BY and FOLLOWED BY appeal to an underlying relation preceding-following which holds for all pairs (x, y) such that x precedes y in the document.

In a totally ordered document like an XDM instance, for any pair (x, y) we have either preceding-following(x, y) or preceding-following(y, x) or x = y. In a partially ordered document, it can be the case that none of these apply. In a multiply ordered document (for example, a document that includes multiple orderings of its nodes), we may have both preceding-following(x, y) and preceding-following(y, x) — that is, x may both precede and follow y.

v(x PRECEDED BY sq) = { n₁ | n₁ ∈ v(x) ∧ (∃ n₂)(n₂ ∈ v(sq) and preceding-following(n₂, n₁)) }
v(x FOLLOWED BY sq) = { n₁ | n₁ ∈ v(x) ∧ (∃ n₂)(n₂ ∈ v(sq) and preceding-following(n₁, n₂)) }
v(x DIRECTLY PRECEDED BY sq) = { n₁ | n₁ ∈ v(x) ∧ (∃ n₂)(n₂ ∈ v(sq) and prec-next(n₂, n₁)) }
v(x DIRECTLY FOLLOWED BY sq) = { n₁ | n₁ ∈ v(x) ∧ (∃ n₂)(n₂ ∈ v(sq) and prec-next(n₁, n₂)) }
v(x NOT PRECEDED BY sq) = { n₁ | n₁ ∈ v(x) ∧ ¬(∃ n₂)(n₂ ∈ v(sq) and preceding-following(n₂, n₁)) }
v(x NOT FOLLOWED BY sq) = { n₁ | n₁ ∈ v(x) ∧ ¬ (∃ n₂)(n₂ ∈ v(sq) and preceding-following(n₁, n₂)) }
v(x NOT DIRECTLY PRECEDED BY sq) = { n₁ | n₁ ∈ v(x) ∧ ¬(∃ n₂)(n₂ ∈ v(sq) and prec-next(n₂, n₁)) }
v(x NOT DIRECTLY FOLLOWED BY sq) = { n₁ | n₁ ∈ v(x) ∧ ¬(∃ n₂)(n₂ ∈ v(sq) and prec-next(n₁, n₂)) }

Context filters

Analogously, the context and content filters rely on relations that we can call parent-child and ancestor-descendant.

v(x INSIDE eq) = { n₁ | n₁ ∈ v(x) ∧ (∃ n₂)(n₂ ∈ v(eq) ∧ ancestor-descendant(n₂, n₁)) }
v(x DIRECTLY INSIDE eq) = { n₁ | n₁ ∈ v(x) ∧ (∃ n₂)(n₂ ∈ v(eq) ∧ parent-child(n₂, n₁)) }
v(x NOT INSIDE eq) = { n₁ | n₁ ∈ v(x) ∧ ¬(∃ n₂)(n₂ ∈ v(eq) ∧ ancestor-descendant(n₂, n₁)) }
v(x NOT DIRECTLY INSIDE eq) = { n₁ | n₁ ∈ v(x) ∧ ¬(∃ n₂)(n₂ ∈ v(eq) ∧ parent-child(n₂, n₁)) }

Content filters

The content filters are the converse of the context filters; they look down the tree, not up. So the main difference in the definitions is the order of arguments in the reference to the ancestor-descendant and parent-child relations.

v(e CONTAINING sq) = { n₁ | n₁ ∈ v(e) ∧ (∃ n₂)(n₂ ∈ v(sq) ∧ ancestor-descendant(n₁, n₂)) }
v(e DIRECTLY CONTAINING sq) = { n₁ | n₁ ∈ v(e) ∧ (∃ n₂)(n₂ ∈ v(sq) ∧ parent-child(n₁, n₂)) }
v(e NOT CONTAINING sq) = { n₁ | n₁ ∈ v(e) ∧ ¬(∃ n₂)(n₂ ∈ v(sq) ∧ ancestor-descendant(n₁, n₂)) }
v(e NOT DIRECTLY CONTAINING sq) = { n₁ | n₁ ∈ v(e) ∧ ¬(∃ n₂)(n₂ ∈ v(sq) ∧ parent-child(n₁, n₂)) }

Attribute filters

The attribute filters are straightforward.

The first filter tests for an attribute being present or absent.

v(e WITH an NULL) = { n | n ∈ v(e) ∧ ¬(∃ val)(e-an-av(n, an, val)) }
v(e WITH an NOT NULL) = { n | n ∈ v(e) ∧ ¬(∃ val)(e-an-av(n, an, val)) }

In prose: the filter WITH an NULL matches an element just in case it has no value associated with the attribute name an; the variant with NOT matches just in case it does have some value for that name.

v(e WITH an EQ val) = { n | n ∈ v(e) ∧ (∃ v)(e-an-av(n, an, v) ∧ (((∃ i₁, i₂)(str-num(val, i₁) ∧ str-num(v, i₂) ∧ i₁ = i₂)) ∨ ¬(∃ i₁, i₂)(str-num(val, i₁) ∧ str-num(v, i₂) ∧ v = val))) }

In prose: the filter WITH an = val matches an element just in case the element has a value associated with the attribute name an, and the value specified and the value of the attribute are equal either numerically, if both are convertible to numbers, or else as strings.

The other forms differ from the one just given only in using different operators instead of equality, or in adding negation. They will not be given here.

v(e WITH an LT val)
v(e WITH an GT val)
v(e WITH an LE val)
v(e WITH an GE val)
v(e WITH an NOT EQ val)
v(e WITH an NOT LT val)
v(e WITH an NOT GT val)
v(e WITH an NOT LE val)
v(e WITH an NOT GE val)

The Ariadne text model

It may be observed, by inspecting the semantic descriptions offered in the preceding section, that Ariadne searches may be performed on any collection of data that can be represented as a set of nodes, on which a certain number of relatively simple, relatively general relations are defined.

Nodes can be words, or elements. Relevant predicates are:

word(n): true if node n is a word
element(n): true if node n is an element
node(n): true if n is a node
chunk(n): true if node n is a chunk

The word and element predicates are not mutually exclusive: as mentioned earlier, internally Ariadne processors are expected to represent words as w elements. The chunk predicate holds of some elements.

In most obvious cases, all nodes will be either words or elements, but in principle other node types are possible, though they cannot be retrieved using Ariadne as currently defined.

Words have word forms, and elements have names:^[18]

form(n, w): true if node n is a word and w is the (or: a) word form of n
name(n, gi): true if node n is an element and gi is its name

One or more partial orders are defined on the nodes in the document. The prec-next relation holds for nodes which are immediately adjacent in some partial ordering; the preceding-following relation is its transitive closure.

prec-next(n₁, n₂): true iff n₂ immediately follows n₁ in some ordering; this can be thought of a recording a next-pointer from n₁ to n₂, not necessarily the only next-pointer from n₁.
preceding-following(n₁, n₂): true if n₂ can be reached by following a series of next-pointers beginning at n₁ and ending at n₂; the transitive closure of the next relation

Either of these may be derived from the other. If preceding-following is specified and is acyclic, then prec-next can be derived by taking its transitive reduction. If prec-next is specified, then preceding-following may be derived from it by taking its transitive closure; in this case, prec-next need not be acyclic.

A parent-child relation is also defined on the nodes of the document, together with its transitive closure ancestor-descendant. If these are acyclic, then either can be specified and the other derived from it; if they include cycles, then parent-child must be defined and ancestor-descendant derived from it.

parent-child(n₁, n₂): true if n₁ is a parent of n₂
ancestor-descendant(n₁, n₂): true if n₁ is an ancestor of n₂; this is the transitive closure of the parent-child relation

As in SGML or XML, elements may have attribute-value pairs; the comparison operators assume that values are represented as strings, but may denote numbers. Ariadne is agnostic on whether attributes are nodes (as in XDM) or simply properties attached to element nodes.

e-an-av(n, an, val): true if n₁ is an element with an attribute with the name an and the value val.

For proximity searching, the distance relation is crucial. It measures the distance between two nodes, using an arbitrary element type as the unit of measure.

distance(n₁, n₂, gi, d): true if the distance from n₁ to n₂, moving forward in the text, can be measured at d elements of type gi. If the second argument precedes the first, d is negative.

The semantic formulae given above do not assume any particular relationship between the distance relation and the next and preceding-following relations. In order to ensure that searches involving the filters DIRECTLY FOLLOWED BY and FOLLOWED WITHIN 1 ... BY produce plausibly related results, however, it is probably best if they are systematically related.

When all nodes in a document are totally ordered, as in XDM, one possible approach is to define distance in terms of next, along the following lines. One possible definition is as follows. For any nodes n₁ and n₂, and any generic identifier gi, the following hold:

The distance from any node to itself is 0. That is, distance(n₁, n₁, gi, 0).
If prec-next(n₁, n₂) and gi is the name of n₂ (name(n₂, gi)), then the distance is 1: distance(n₁, n₂, gi, 1).
If prec-next(n₁, n₂) and gi is not the name of n₂, then the distance is 0: distance(n₁, n₂, gi, 1).
If there exists some node n₃ such that prec-next(n₁, n₃), and gi is the name of n₃, and the distance from n₃ to n₂ is d, then the distance from n₁ to n₂ is 1 + d.
If there exists some node n₃ such that prec-next(n₁, n3), and gi is not the name of n₃, and the distance from n₃ to n₂ is d, then the distance from n₁ to n₂ is also d.

Operationally, this amounts to walking the path from n₁ to n₂ by following next-pointers, incrementing the count each time an element with the specified generic identifier is encountered.

In a text model with multiple paths from n₁ to n₂, a distance is measured along each path.

A few relations are defined on numbers, strings, and attribute values:

num-abs(num₁, num₂): true if num₂ is the absolute value of num₁.
str-num(v, num): true if string v is in the lexical space of some XSD numeric type, and num is the corresponding value.

If both arguments to the following relations are numbers, or strings that can be coerced to numbers, then the comparisons are numeric; otherwise, the comparisons are string comparisons using an implementation-defined Unicode collation sequence.

equal(v₁, v₂): true if v₁ and v₂ are the same value
lt(v₁, v₂): true if v₁ is less than v₂, numerically if they are both numbers, otherwise lexicographically

Current status, progress towards implementation, and future plans

Ariadne is intended to be simple to implement, but early implementation of a design can have unfortunate effects if properties of the implementation leak out and affect the design. In order to avoid this problem, the author is attempting to work out the design of Ariadne as completely as possible before actually implementing anything.^[19] This has drawbacks, but if all goes well it will help ensure that the query language can be understood without reference to any particular implementation or implementation approach.

At the time this version of the paper was prepared, the language existed only on paper, with some paper sketches of code for a toy implementation. It is expected that a toy proof-of-concept implementation will be completed soon, sufficient to allow experimentation with small documents using different text models and with different ways of reducing the Ariadne text model to a minimal set of relations.

References

[EBT 1996] EBT (Electronic Book Technologies)/Inso Providence Corporation. DynaText Publishing: Document Preparation. Release 3.0. Providence: EBT/Inso Providence Corporation, 1996.

[Pemberton 2013] Pemberton, Steven. Invisible XML. Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6 - 9, 2013. In Proceedings of Balisage: The Markup Conference 2013.. Balisage Series on Markup Technologies, vol. 10 (2013). doi:https://doi.org/10.4242/BalisageVol10.Pemberton01.

[Pemberton 2021] Pemberton, Steven. Invisible XML Specification (Draft). 2021-01-28. On the Web at https://invisiblexml.org/ixml-specification.html.

[Smith 1980] Smith, John Bristow. Imagery and the Mind of Stephen Dedalus: A Computer-Assisted Study of Joyce's a Portrait of the Artist As a Young Man. Lewisburg, PA: Bucknell University Press, 1980. 294 pp.

[Smith 1985] Smith, John B. Arras User's Manual. TR85-036 1985. Chapel Hill: University of North Carolina at Chapel Hill, Dept. of Computer Science, 1985. Available online at http://www.cs.unc.edu/techreports/85-036.pdf.

^[1] The association of Ariadne with a method of finding one's way through a complex and confusing structure also doesn't hurt.

Ariadne is a reasonably popular name for technology projects, so it may be desirable to say explicitly here that the query language described in this paper has no relation to any other software, language, project, or firm using the name Ariadne, including but not limited to:

Ariadne Software (http://www.ariadnesoftware.com), the maker of Cool Spools and a provider of solutions for the IBM i (AS 400 iseries) platform;
Ariadne Solutions - Red Thread Software (https://www.ariadnesolutions.com/), the maker of Red Thread, a tool for bioanalytical data review and risk management;
the Ariadne project (https://www.ariadne-eu.org), a program funded by the European Union to integrate existing archeological data infrastructure across Europe, so researchers can use the various distributed datasets and new technologies to explore research methodologies.

^[2] Like many computer-related names at the time, the name Arras was typically written in all caps in printed documentation. For the sake of a visually quieter page I follow the rule of down-casing such names when they are pronounceable.

^[3] The description of the DynaText query language which follows is derived mostly from chapter 12 of the manual DynaText Publishing: Document Preparation (EBT 1996). DynaText was a commercial product and is as far as I know no longer commercially available. This has several consequences. First, I do not now have and have never had access to the source code of the program, and do not now have access to a running copy, so my description of corner cases in the query language cannot be tested against the behavior of the search interface and may need to be taken with a grain of salt. The information in the manual has been augmented in a few details by personal recollection and here and there also by information from Steven J. DeRose, who is thanked for his patience. Second, it must be assumed that someone, somewhere owns the intellectual property rights to the program. In the U.S., I believe my description of the DynaText query language and its influence on Ariadne fall into the category of fair use.

^[4] I'm just saying that someone might try to formalize the meaning of the language that way, and be surprised by the results, not that such a mistake happened to anyone I happen to know personally.

^[5] For example, the ancilla search interface for Trials in the Late Roman Republic (http://tlrr.blackmesatech.com/2016/04/ancilla.xhtml); the initial a of the name ancilla signals that this was the first search interface to be specified and named, though it was not in fact the first one implemented.

^[6] The first may be expected to locate chapters about the syntax rules in books about languages like Pascal, Javascript, or C; the second should find chapters discussing suitable programming languages, in books about the syntax of natural languages. When I first heard it, the example was attributed to the computational linguist Toshio Yokoi.

^[7] Actually, since I still miss Arras's proximity searching thirty-five years after last using it, an Arras implementation might be useful as well as fun. But it was fun, not functionality, that was the key motivator for keeping the project on my someday list. I reiterate my claim that independent re-implementations of Arras and DTQL would fall within the bounds of fair use under U.S. copyright law.

^[8] Steven J. DeRose, personal communication, 10 March 2021.

^[9] I mean, really simple to implement. I mean, simple enough for someone trained in comparative literature to implement.

^[10] The XPath expressions given are only roughly equivalent because the search for the word ward is not a string search, and should not find occurrences of the words reward or award. In an implementation which records its tokenization of words by marking them with w elements, the equivalent XPath will be:

//w[@form = 'ward'][ancestor::heading]

^[11] It may be less useful than it appears at first. If the user wants all the theorems which occur before the first theorem in the same chapter, this query won't find them. The search <theorem> not preceded by sibling <axiom> will work, but only if all the axioms and theorems we are looking for are siblings, and none of them are buried in subsections.

^[12] A case might be made for adding WITHIN SAME gi ELEMENT AS, as syntactic sugar for WITHIN 0 gi ELEMENTS OF.

^[13] Because of the way XDM orders nodes, a simple-minded implementation based on XDM will typically measure zero l elements between two locations if they are separated by an end-tag for an l element, but not by a start-tag. The asymmetry of start- and end-tags is imperfect, but for the moment it appears to be avoidable at the cost of increased implementation effort.

^[14] For OR, the results are the same whether the OR is evaluated before or after the filter, but for AND and NOT, the results are very different.

^[15] Some query systems are supposed to have concluded that a top-level query of the form x AND y expresses a desire to see all the records which match sub-query x and those which match sub-query y. That makes the operators AND and OR equivalent, which will work well for users who do in fact mean all the x and all the y, but which seems likely to puzzle at least some users who think that AND and OR are not synonyms.

^[16] If the and were thought of as connecting two queries, the obvious meaning would be the set of word nodes which are both instances of fire and instances of ice, which on any plausible formalization of word forms will be the empty set.

^[17] Like this one.

^[18] In its current form, Ariadne seems to be living in a markup paradise in which there are no complications involving the tangled relationships among local names, namespace prefixes, namespace names, expanded names, and so on. If the reader wishes to imagine that all names are unqualified, or that all names are given as expanded names in Q{namespace}local notation, they may do so.

^[19] I reject in the strongest possible terms any imputation that this approach has anything at all to do with procrastination or laziness. I am shocked at the suggestion.

C. M. Sperberg-McQueen

Founder and principal

Black Mesa Technologies LLC

C. M. Sperberg-McQueen is the founder and principal of Black Mesa Technologies, a consultancy specializing in helping memory institutions improve the long term preservation of and access to the information for which they are responsible.

He served as editor in chief of the TEI Guidelines from 1988 to 2000, and has also served as co-editor of the World Wide Web Consortium's XML 1.0 and XML Schema 1.1 specifications.

Balisage Paper: Ariadne's thread

A design for a user-facing query language for texts and documents