This paper describes the design of Ariadne, a small query language intended to make searching structured documents easier for non-technical users. The following sections will describe the origins, goals, and design decisions of the language; present its grammar and semantics; illustrate its application to documents; and describe progress towards a full implementation.
Ariadne may be of interest in part because it is designed to be easier to learn and understand (and to implement!) than other query languages for structured documents. Users who would not be willing to undertake to learn XQuery or even XPath may (it is hoped) find Ariadne understandable and usable. Though designed to be smaller and less capable than XPath, XSLT, or XQuery, Ariadne can be used for relatively powerful searches within structured documents. Because it is smaller and less capable than XPath, XSLT, or XQuery, an Ariadne language processor is readily implemented using XSLT or XQuery. And because Ariadne makes only relatively weak assumptions about document structure, it is applicable to non-XDM text models like concurrent hierarchies, Goddag structures, TAG (Text as graph), multi-colored trees, and even (with mild restrictions) non-graph-based models like (a simple version of) LMNL.
Origins, goals, design decisions
Ariadne has its roots in several places. Among the most prominent are the perceived need for a reasonably powerful search language easier for non-XML-oriented users to learn than XPath and XQuery or XSLT; a desire to recreate the functionality of two query language I used and liked some years ago; the desire for a user-accessible query language to exploit the rich markup in some of the projects I have been involved with; and the desire to enjoy myself by writing language processors in XSLT and XQuery.
Two earlier languages have exerted a strong influence on the functionality and syntax of Ariadne: a system called Arras, and the query language which formed part of the SGML publishing tool DynaText. The name Ariadne was chosen in part because the letter combinations ar and dn recall the names of those two systems.[1]
The following sections describe this background in slightly more detail.
Arras
In 1980, an associate professor of English at Pennsylvania State University published a book about James Joyce's Portait of the Artist as a Young Man (Smith 1980). In order to identify, trace, and visualize the appearance, repetition, and recurrence of various complexes of images in the work, the author, John B. Smith, had created what might be called an interactive concordance system. Like a printed concordance, his tool allowed him to look for all occurrences of a given word and see them in context. Unlike a printed concordance, Smith's tool was interactive. It allowed the user to specify how much context to give when displaying a word: three words on either side? One word before and five after? The full sentence? The typographic line? Three lines? It also allowed the user to search for multiple words occurring in proximity: fire and ice in the same paragraph, for example.
Smith's system was initially called Rats
(Random-Accessible Text System), but he later changed the
name to Arras, for which the retronym ARchival Retrieval
and Analysis System
was formed.[2] The Arras reference
manual is available on the web site of the computer science
department at the University of North Carolina at Chapel
Hill, to which Smith moved from Penn State in the mid-1980s,
and interested readers are encouraged to consult it (Smith 1985). A few sample queries, mostly drawn
from the manual, will suffice to illustrate some of the
program's salient capabilities.
The core functionality is to display occurrences of words, in context. The following command displays occurrences of the word fire in the default context of one sentence.
display concordance: fire.
In an attempt to make the system easier to use, Smith allows noise words, which are ignored, and allows keywords to be abbreviated. So the command just given could also be given in these and other alternative forms
please display a concordance for the word: fire.
disp conc: fire.
A different amount of context can also be specified, for example five words on either side:
display concordance: fire; context: -5 to 5 words.As may be seen, left and right context are specified independently. Context may be specified in several different logical units (word, sentence, paragraph, chapter) as well as in several physical units (line, page, volume).
To allow the user to identify images like those of fire
or water, Arras allows the user to define
categories
, which in the simple case
are just sets of word forms. To find all images of fire, a
user might specify a number of word forms related by
etymology or only by meaning:
define category: enflaming, fire, fireconsumed, fires, flame, flamed, flames, flaming, heat, heated, hot, hotly; name: firecat.The names of categories (here firecat) can be used in
display concordance
and other commands
in the same places as words from the text.
Another facility is the ability to define a configuration, which matches a word occurring within a specified proximity of another word.
configuration: fire & water; name: ffww.This defines the configuration ffww as the subset of occurrences of fire which occur within the default context (one sentence, again) of water. The context can be varied, using the same units as mentioned above. The usual Boolean operators are allowed; in addition to and (seen above), or and not can also be used.
configuration: fire &~ water; name: ff.
configuration: fire | water; name: fw.
Names of categories and configurations can be used wherever word forms are called for in a command. This allows definitions of complex constellations of words to be built up step by step.
configuration: firecat & watercat; context: -50 to +50 words; name: fwcat.There are also facilities for interrogating the list of word forms found in a text, excluding specific word occurrences from a category or configuration, displaying the distribution of a word, category, or configuration across the text (in the form of a bar graph for two-per-cent segments of the text), session management, and so on.
The Arras scanner relied on simple markup in the input to know where volume, chapter, and paragraph boundaries fall. Line and sentence boundaries it recognized by itself using simple rules, with markup available for overriding the default decisions.
The text model exposed by Arras is essentially that of a text as a sequence of words, with the sequence partitioned into sentences, paragraphs, chapters, lines, and volumes. As may be seen, these units form both a logical hierarchy (chapters, paragraphs, sentences) and a physical hierarchy (volumes, pages, lines). The ability to specify proximity using any of these structural units makes Arras very useful for those interested in the close study of a text.
It may be noted in passing that while Arras provides ways of looking for passages in a text which meet certain criteria, it does not provide any mechanism for creating new textual objects.
DynaText query language
The second major influence of Ariadne is the textual query language which formed part of DynaText. I won't rehearse in detail the history of that product in detail here, since many others who attend Balisage know it as well as I, and some know it better. I will only observe that DynaText was a system for publishing SGML documents in electronic form, which allowed users to switch among multiple stylesheets for different views of the data, provided good hypertext linking, and offered a simple and convenient textual search interface. The query language was used both as part of the end-user interface and as part of the infrastructure: as part of the creation of an electronic book, DynaText publishers could also provide simple search forms, to elicit values which could be plugged into a complex search expression as parameters. So power users could use the query language to create complicated queries which others could use without learning the query language.[3]
I won't attempt a full account of the DynaText query language (which for brevity I'll call DTQL from now on), but some examples will illustrate key points.
A search can consist of a word or phrase. To search for a word which is also a keyword in the language, it may be enclosed in quotation marks:
programming language
"containing"
Equally important, since DynaText is designed to allow the exploitation of markup, a search can consist of an element type's generic identifier, distinguished from a word to be sought in the text by being enclosed in angle brackets:
<title>
Context can be specified using the keyword
inside
or in
, optionally modified
by directly
:
programming language inside <title>
<title> directly inside <book>The first finds occurrences of the phrase programming language occurring inside a
title
element. The second finds title
elements occurring as children of a book
element.
Conversely, content can be specified using the keyword
containing
, also optionally modified by
directly
:
<chapter> directly containing <title> containing syntax
All operators in DTQL are right-associative, so the search just given is equivalent to the following:
<chapter> directly containing (<title> containing syntax)These searches find all
chapter
elements within
which there occurs (at any nesting depth) a title
element which contains the word
syntax.
Note that while
<title> containing programming languageand
programming language in <title>describe the same set of document structures, the value of the first is a set of
title
elements, while the value of
the second is a set of word occurrences. In the interactive
interface to DynaText, this difference was reflected in the
highlighting of hits and in the hit counts displayed in the
table of contents, which was itself typically displayed in a
navigation bar and always visible.
Boolean operators are also allowed:
<bibl> not containing <title>
(<element-citation> or <mixed-citation>) containing <publisher-name> not containing "Elsevier"
<title> containing theory and practice
<title> containing practice and not theory
The last two examples illustrate a slightly curious
aspect of DTQL's boolean operators: the one search finds
title
elements containing both the word theory and the
word practice, the other finds title
elements containing
one word but not the other. The right-hand argument of the
and
is not interpreted as a full query, but as a second
operand for the containing
operator. This feels natural to
native speakers of English (and perhaps other languages),
but it is trickier to implement than a naive approach might
expect. In particular the query
<title> containing practice and not theorydoes not first evaluate the queries
<title>,
practice, and
theory, then construct the complement of the third query (every word that is not theory), then take the set intersection of the occurrences of the word practice and the set of words which are not the word theory, and so on.[4]
DTQL also has a proximity operator, for which distance is specified in words:
proximity within 5 words of search
markup within 10 words of descriptive or prescriptive
The samples do not exhaust the DynaText query language, although they come close: it is a useful language but gets its power from a very small number of constructs.
Why not XQuery? Or even just XPath?
An obvious reaction to the announcement of yet another query language for documents, especially at Balisage, is to ask why. Why do we need another language for document queries? Doesn't XQuery exist? Or if XQuery is deemed too complicated, why not just use XPath? There are two reasons one might want to design a new query language, even though XPath and XQuery exist.
The first is ease of acquisition by users.
Like many Balisage attendees, I often work with domain experts who are less technically inclined than I am and less tolerant of the pernickety syntax often characteristic of languages designed to be parsed by machine. The XML community is justly proud of the number of domain experts who do not self-identify as programmers but who are able to learn and use XPath and XSLT. But many domain experts never do learn XSLT, or even XPath.
When I build a web interface for a collection of XML documents, one of the first things I like to do is to provide a simple search interface that allows a user to type in a simple XPath expression and see a list of the XDM nodes matched by the expression.[5] But often, my hopes that the domain experts with whom I work will learn enough about XML and XPath to be able to use it are dashed. As far as I know, when I have made XPath interfaces for collaborative projects, I am the only project participant who ever uses them. From time to time, others in a project have attempted to learn XML, and XPath, and XQuery. But often, domain experts find it unnerving to venture from a field in which they are experts into a field in which they are novices, and since they typically have achieved a full and rewarding professional life without knowledge of any computer languages, it is understandable both that they don't have a lot of time to focus on learning new languages and that the frustrations known to anyone learning something new and very different will often lead busy people to abandon the attempt.
It is unlikely that Ariadne is a silver bullet for this problem.
It is quite possible that some users will resist any query interface involving anything recognizable as syntax or keywords. The logical consequence of such syntax phobia among users is a search interface with a single box in which to type a query, and no visible rules for how to type it, such as is now familiar to any user of Google, Bing, DuckDuckGo, or almost any other web search engine.
But anyone who has used a single-box search interface to look for structural patterns in documents will remember how disappointing the results usually are. A search for a chapter whose title includes the word syntax, in a book whose title includes the phrase programming language, is one thing. A search for a chapter whose title includes the words programming language, in a book whose title includes the word syntax, is something rather different.[6] A query interface based on a bag-of-words document model, with no knowledge of chapters, books, or titles, cannot comfortably capture the difference.
One of Ariadne's goals is to serve users who don't feel comfortable with most artificial languages but who do understand how their documents are tagged, and who have some interest in being able to use structural information in queries.
The second reason for not just using XQuery or XPath is: trees.
XPath and XQuery and XSLT are designed to work on XML documents, and specifically on XML documents viewed through the XPath Data Model (XDM), or more generally any information that can be viewed through that model, whether its serialized form is XML or something else. Since I was not involved in the initial design of any of those languages, I can say without immodesty that they do a good job.
But it is well known that documents often exhibit structural patterns that are not tree-shaped, or at least not shaped like a single tree. A bewildering number of alternate proposals have been made, for an equally bewildering variety of text models. (The number of models is slightly lower than the number of proposals, since in some proposals appear on closer examination to be variations on the same basic idea.)
It would, I think, be helpful for users of the text-as-graph model, the LMNL model, the Goddag model, the concurrent-hierarchies model, if it were possible to formulate queries against those models. Of course, if as sometimes happens the actual documents are stored in XML, the documents can be queried in their XML form using XPath and XQuery. But models are not just guides for implementors; they are tools for thinking about things. It would be better, when we are using a non-tree-shaped model of text, if we could formulate queries against the model we want to use, instead of translating our queries into some sort of slightly awkward XPath.
One possible caveat to this line of reasoning should be
addressed. XPath is indeed defined in terms of the XPath
Data Model, but nothing in its syntax requires that the
parent
axis return a single node. If we were to
allow nodes to have multiple parents, we could apply XPath
to any document model involving a directed acyclic graph.
And if we relaxed other constraints like the rule that no
node is its own ancestor, we could apply XPath to cyclic as
well as to acyclic graphs. So if we wished to apply XPath to
Goddag structures, or to concurrent hierarchies, or to
text-as-graph instances, we could do so, at least in
principle. And indeed there have been proposals in the past,
some of them presented at Balisage or Extreme Markup
Languages, to use XPath as an expression language for
navigating arbitrary directed graphs.
To anyone who can easily implement the whole of XPath over a non-XDM model, there is clearly no pressing need for a different language. The problem for others is that even in its version 1.0, XPath was large enough to require non-trivial implementation effort. Now, in XPath 3.0 and 3.1, the language has become Turing-complete and implementing it from scratch looks rather daunting. Ariadne's advantage in this context is that in comparison with XPath, it's tiny, and it's possible to imagine a relatively straightforward implementation that does not require programmer-years of effort.
Origins of Ariadne
The project that has turned into the design and specification of Ariadne began life as a pair of low-priority free-time projects to re-implement both Arras and the DynaText query language, as an exercise in nostalgic retro-computing.[7]
The feasibility of re-implementing Arras became clear to me when I reflected that its text model could be implemented as a single SQL table in which each row contains a token, a word form (the token, stripped of punctuation), a word number, a sentence number, a chapter number, a line number, a page number, and a volume number. Arras commands could in principle be translated into SQL queries. John B. Smith implemented his own inverted-file index, but I was willing to farm all of those details out to a SQL database.
Once the key idea of letting a database management system handle the boring bits like storage details became clear, it immediately became obvious that the Arras text model could also be implemented quite simply in an XQuery database, using either element containment or milestone elements to handle the structural units. Since in Arras all structural units partition the entire text, simple milestones suffice and the more complex mechanisms of Trojan-Horse markup are not needed.
Since DynaText is designed for search and retrieval in SGML documents, it seems self-evident that it should be relatively straightforward to implement a DynaText front end which passes queries to an XQuery back end for evaluation. Since DTQL seems very accessible even to non-technical users, the idea of re-implementing DTQL sometimes seemed more than an exercise in nostalgia: it could be useful to real projects I have been engaged with. But the idea never had high priority, so it never got beyond a few sketches.
Recently, however, whiling away an idle hour with an
attempt to figure out a way to describe the behavior of
and
and or
in DTQL cleanly enough
to guide an implementation, I realized that DTQL requires a
model of text in which two nodes can be in a parent/child
relation (or not), or more generally an ancestor/descendant
relation, but does not require that the model form a tree.
Nothing breaks in the semantics of the query if an element
(or a word) can have more than one parent.
At this point I began to think that an implementation might have more practical use than I had first imagined, and that a new query language inspired by Arras and DynaText should have a higher priority than historically accurate re-implementations of the two languages.
Design decisions
The basic design approach of Ariadne was summarized by Steven J. DeRose, the designer of the DynaText query language, thus:[8]
Give users a very limited and very formulaic subset of English — it's pretty easy to implement (because so limited), yet users can read queries as if they were
just English, and can form new ones by analogy — all they really have to learn is to avoid being creative (which, when they try it, provides negative reinforcement for free).
DynaText has four query operators, not counting Booleans, which require special treatment:
-
INSIDE
, optionally modified byNOT
and/orDIRECTLY
-
CONTAINING
, optionally modified byNOT
and/orDIRECTLY
-
WITHIN
nWORDS
OF
-
WITH
(which searches the attribute axis of XDM)
INSIDE
and CONTAINING
operators are of course inverses of each other, moving
vertically in a document tree.
Ariadne extends this set of operators modestly to allow searching based on document order among siblings or among all nodes and to allow proximity to be measured, Arras-style, in any structural unit, not just words.
It should be noted that the only basic relations required to support these operators are parent-child, prec-next, and element-attribute-value. No assumption is made that a child can have only one parent, so essentially any binary relation on structures in a document can be used as the parent-child relation. If the relation is cyclic, some results may of course puzzle users used to thinking of the parent-child relation as acyclic.
Like Arras and DTQL, Ariadne lacks any means of
constructing new values of any kind: all Ariadne expressions
denote sets of locations within texts, where from the user's
point of view a location
is either a
tagged element in the text (e.g. a paragraph, title, or
chapter) or a word. From the implementation point of view,
at document ingestion time the document is tokenized and all
words are wrapped in w
elements, so for the
implementation, all values are sets of locations, and all
locations are elements in the input document. This provides
a simple uniform model and makes implementation on top of
XSLT or XQuery or XPath itself conceptually very simple.
The absence of constructors makes Ariadne strictly weaker than XQuery or XSLT. It cannot be used to perform arithmetic: an expression like 2 + 3 is not an Ariadne query or expression. It cannot be used to make a list of all word forms: Ariadne has no equivalent to the XPath distinct-values() function, and indeed no user-callable functions at all. Its intended users are not programmers who need to be able to specify and perform arbitrary computations, but readers who want to be able to describe and find locations in texts.
For some situations, it is convenient to have some notion
of a location which is an appropriate size to return in a
list of search results. I'll call such locations chunks
.
In most modern prose, a sentence, a list item, a heading, or
a line of verse will be a reasonably good chunk. Sometimes
it might be better to use a larger chunk, for example a
paragraph. The only constraint on chunks is that every
word in the document must be in some chunk; it may sometimes
be desirable to specify further that no chunk
contains any other chunk, i.e. chunks must partition the
document.
To simplify the use of Ariadne for document owners, an
implementation of Ariadne is expected to define some notion
of chunk
without intervention by the
document owner; a simple default is to accept any element
containing at least two words as a chunk. Another possible
default (up to the implementor) is to break the input
document up into sentences or sentence-like objects based on
punctuation and capitalization and use them as chunks. A
document owner should however be able to specify explicitly
which elements are to be treated as chunks. And of course, a
document owner can explicitly mark sentences (or
sentence-like objects) with an element such as the TEI
s
element, and thus override the default behavior
of the implementation. The same goes for tokenization into
words: if the default behavior is suitable, the user need do
nothing special, but if the user wants another behavior,
they can control the behavior of an Ariadne search engine by
using markup to identify what should be treated as
words.
The design principles of Ariadne are thus:
-
Queries should be a very limited, very formulaic subset of English.
-
Query operators should allow qualification of searches both vertically (by context and content) and horizontally (by relative position in document order).
-
Proximity searching should be possible using words or any structural elements as units.
-
The language must be simple to implement.[9]
-
Ariadne should be applicable to any model of text in which structural units can be construed as having a parent-child relation and an ordering. Just as structural units may have multiple parents, they may also have multiple immediately following siblings.
-
It is acceptable for searching of a document to be possible only after that document has been ingested in some way (indexed, scanned, imported, ...); in practice it is in the process of import that the system will perform services like recognition of implicit markup of words, lines, and sentences.
-
Anything the indexing system does without markup (e.g. the identification of words or sentences) should be done by injecting markup in the document. The user should be able to change the relevant behavior by supplying explicit markup for the phenomena in question, for example by exporting the document, changing the markup, and re-importing the document.
-
The text model used by Ariadne should be compatible with (i.e. its assumptions and constraints should where possible be a subset of) as the most prominent non-XDM document models: multi-colored trees, concurrent hierarchies, text as graph, and Goddag structures. Compatibility with LMNL is a desideratum.
One topic which has been ruled out of scope for this version of Ariadne is the ability to qualify searches by a particular hierarchy, analogous to the qualification of XPath axes with colors in Colored XML. It may be helpful, but I do not currently understand the requirements well enough to design it. It is hoped that experience with Ariadne will help to indicate whether extension in this direction is needed and what kind of extension would be helpful.
Examples
At heart, Ariadne is an expression language, with basic expressions which can be combined and re-combined using different operators. The value of any expression is a set of locations in the text, which are here referred to sometimes from the user view as locations and sometimes from the formal view as nodes; in an Ariadne implementation those two terms are extensionally equivalent. The locations in the value of an expression are said to be matched by or found by the expression.
Since the purpose of formulating expressions is normally to
find locations in the text which match the expression,
expressions will in what follows sometimes be referred to as
searches
or queries
.
No difference in meaning is intended by this variation in
terminology.
Structurally, some expressions are basic expressions and
others are compound expressions. Compound expressions are
formed using operators to combine subordinate expression.
Ariadne has two kinds of operators: filters and Boolean
operators. As their name suggests, filters are used to
eliminate items from a search result. Their first or left-hand
argument matches a set of locations in the text, and the
effect of the operator is to select some of those locations
and eliminate others, based on the value of the second or
right-most argument. Filters may be said to
qualify a search. Boolean operators combine sets
of results in familar ways: an AND
denotes the
intersection of two sets, an OR
the union, and
NOT
typically denotes set subtraction.
Basic expressions
Like DTQL, Ariadne has two fundamental forms of expression, one for words (or strings) and one for (structural) elements:
<author>
Heinlein
The first of these finds all author
elements in
the text; the second finds all occurrences of the token
Heinlein.
Expressions using filters
Context filter (INSIDE
)
Word and element queries can both be qualified by context:
ward inside <heading>
<head> directly inside <div> directly inside <body>
<p> inside <subsection>
These have the same value in Ariadne as in DTQL. The
first finds occurrences of the word
ward occurring anywhere inside a
heading
element; the second finds head
elements which are directly inside div
elements
(i.e. with no intervening elements) which are themselves
directly contained by body
elements. The third
finds p
elements occurring withing
subsection
elements.
In an XML context, these will be roughly equivalent to the following XPath expressions.[10]
//text()[contains(., 'ward')][ancestor::heading]
//head[parent::div[parent::body]]
//p[ancestor::subsection]
Content filter (CONTAINING
)
As in DTQL, element queries (but not word queries) can be qualified by the content of the element. For example:
<chapter> directly containing <title> containing syntaxThis is the equivalent of the XPath expression
//chapter[title[contains(., 'syntax')]]
Using qualification by context and by content, the two examples of structure-aware full-text search given earlier can be expressed readily in Ariadne or in DTQL:
programming language in <title> directly inside <chapter> inside <book> directly containing <title> containing syntax
syntax in <title> directly inside <chapter> inside <book> directly containing <title> containing programming language
Left-right filters (PRECEDED BY
, FOLLOWED BY
)
Just as the keywords INSIDE
and
CONTAINING
search
vertically
on the equivalent of the
XDM parent, child, ancestor, and descendant axes, the
keywords FOLLOWED BY
and PRECEDED
BY
search horizontally
on the
equivalent of the XDM axes preceding,
preceding-sibling,
following, and
following-sibling.
<theorem> not preceded by <axiom>This finds all
theorem
elements which occur in
the document before the first occurrence of an
axiom
element.[11]
<speech> directly followed by sibling <stage-direction>This finds all
speech
elements whose next
sibling is a stage-direction
element.
The keywords PRECEDED BY
and
FOLLOWED BY
are interpreted by reference to
document order; unlike XDM, the Ariadne model does not
require that the ordering be total or unique. There may be
multiple orderings; the value of an expression of the form
x
FOLLOWED BY
y is the set of all values
of x which are followed in some ordering by some value
of y. If multiple orderings apply in a document, it
might be helpful to be able to specify which ordering to
use, but as noted earlier this is out of scope for the
current version of Ariadne.
As may be seen, these operators may be modified by
DIRECTLY
, as well as by NOT
,
which is not illustrated, and by SIBLING
,
which restricts the search to locations which share a
parent.
Attribute filters (WITH
)
As in DTQL, element searches can be qualified by reference to attributes on the element and their values.
<section> with n = 2
<section> with n > 2
<div> with id null
These search, respectively, for section
elements with the attribute-value pair n="2"
,
for section
elements with an n
attribute whose value, when coerced to a number, is
greater than 2, and for div
elements which have
no id
attribute.
Word proximity filter (WITHIN
k
WORDS OF
)
Words can be qualified by proximity to other words:
challenge within 10 words of election
The order of the two words can be prescribed:
challenge followed within 10 words by election
challenge preceded within 10 words by election
As mentioned elsewhere, word proximity is treated
formally as a special case of element proximity. In the
queries just shown, the keyword WORDS
is
syntactic sugar for the expression
. So the queries shown are equivalent to
the following, and they may indeed by translated into the
following forms internally before evaluation:
<w>
elements
challenge within 10 <w> elements of election
challenge followed within 10 <w> elements by election
challenge preceded within 5 <w> elements by election
Element proximity filter
As in Arras, proximity can be measured in structural units other than words:
fire within 5 <l> elements of ice
This query finds all occurrences of the word
fire whose distance from some
occurrence of the word ice is fewer
than five l
elements. Distance is measured by
counting elements within the ordering assigned to elements
in a purely mechanical way. If fewer than five l
elements begin between some occurrence of
fire and some occurrence of
ice, measured in some ordering of
l
elements, then the filter retains that occurrence
of fire.
This method of measuring proximity has the consequence
that if two elements are within the same l
element,
they have distance 0, measured in l
elements:[12]
fire within 0 <l> elements of ice
This returns all occurrences of
fire which occur within the same
l
element as an occurrence of
ice. It also returns all occurrence
of fire which are not inside an
l
element, as long as there is also an occurrence
of ice which is also not inside an
l
element and as long as there is no l
element between them.[13]
The elements in question need not be contiguous:
Borges within 4 <footnote> elements of BarthesThis query includes all occurrences of the name Borges if some occurrence of the name Barthes is found with fewer than five
footnote
elements between them.
Parentheses
The qualifying operators are right-associative, so
multiple qualifications on a single query require
parentheses. The following two queries are equivalent:
each filters the set of p
elements to include
only those inside a subsection
element and only
those containing the word
appointed.
(<p> inside <subsection>) containing appointed
(<p> containing appointed) inside <subsection>
Note, however, that since containment is a transitive relation, the latter formulation is equivalent to the unparenthesized form
<p> containing appointed inside <subsection>
Expressions with Boolean operators
Boolean operators in right-hand argument of qualifier
In English and other natural languages, conjunctions
are sometimes tricky to parse. Given a sentence beginning
Alice invited Bob and Carol ...
, it is not clear
without further context whether the conjunction
and joins two clauses (Alice
invited Bob and Carol invited Dave
) or two names
(Alice invited Bob and Carol to dinner with her and
Dave
).
The goal of an English-like query syntax means that Ariadne should if possible match users' expectations. Given a query like the following, many users will not perceive the or as ambiguous.
(<p> inside <subsection>) containing appointed or appoint or appointmentThey will (or so I conjecture) expect the result to contain the set union of the values of the following three queries:
(<p> inside <subsection>) containing appointed
(<p> inside <subsection>) containing appoint
(<p> inside <subsection>) containing appointmentThey will emphatically not expect the query result to be a mixture containing all the occurrences of appoint and appointment together with the set of paragraphs inside subsections which contain the word appointed.
To achieve this interpretation, Ariadne treats Booleans
slightly differently depending on where they appear. When
Boolean operators are found in the right-hand argument of
any filter, the qualification is applied to each argument
of the Boolean and the results are combined as indicated.
Grammatically, this is mostly a matter of ensuring that
the Boolean operators bind more tightly than the filter
operators, so that
x
CONTAINING
y
AND
z
is parsed like
x
CONTAINING
(y
AND
z)
and not like
(x
CONTAINING
y) AND
(z).
Semantically, it's a bit more complicated than that; the filter must be applied to each item in the compound argument separately, before the results are combined as indicated by the Boolean operator.[14]
For example, the value of the query
<p> containing fire and water and not iceis the set of
p
elements which (1) contain the
word fire, and (2) contain the word
water, and (3) do not contain the
word ice. That is to say, it is the
intersection of the values of the three queries
<p> containing fire
<p> containing water
<p> not containing ice
This treatment of Boolean operators follows that of DTQL.
Boolean operators between filters
If a Boolean connector is found between qualifiers, then the two qualifiers share the left-hand argument. The query
<p> inside <subsection> and containing appointedfinds all
p
elements which both (1) occur within
a subsection
element and (2) contain the word
appointed.
It is equivalent to the form
(<p> inside <subsection>) containing appointedin which the
CONTAINING
filter is applied to
the results of applying the CONTAINING
filter.
The Boolean operator or
can also be used.
(<p> inside <subsection>) containing appointed or containing appoint or containing appointment
This has the same meaning as the query above in which
the second and third occurrences of
containing
are omitted.
This treatment of Boolean operators is intended as a generalization of the treatment of Booleans described in the previous section. It is not clear whether DTQL accepts Boolean operators between filters.
Boolean operators (top-level)
Word queries can be combined with boolean operators:
Heinlein and Robert
Heinlein or Bradbury
Heinlein and not Bradbury
It is slightly challenging to provide a plausible
interpretation for these which is consistent both with
normal usage of the Boolean operators and with user
expectations formed by expressions like those given in the
preceding sections. If AND
is interpreted as
denoting the intersection of the values of its two
arguments, and NOT
as denoting the complement
of the value of its argument, then the first query shown
must denote the empty set, since there are no words which
are both the word Heinlein and the
word Robert. The third query will,
given these interpretations, denote the set of locations
which are both (1) occurrences of the word
Heinlein and (2) not occurrences of
the word Bradbury. Neither of these
is a satisfactory interpretation for expressions which
appear perfectly idiomatic to many users.[15]
Ariadne interprets top-level queries
like those shown as abbreviated forms of x
CONTAINING
y
op
z, where
x denotes some smallish region of text, like a
paragraph. It is for this reason that Ariadne introduces
the idea of a chunk. The searches shown above
find chunks which contain both
Heinlein and
Robert; contain either
Heinlein or
Bradbury; or contain
Heinlein but do not contain
Bradbury. If all chunks are
p
or q
or bibl
elements, then
these Boolean searches are equivalent to
(<p> or <q> or <bibl>) containing Heinlein and Robert
(<p> or <q> or <bibl>) containing Heinlein or Bradbury
(<p> or <q> or <bibl>) containing Heinlein and not Bradbury
This is one of two places where the semantics of
Ariadne appeals to the notion of
chunk
. The other place is that in an
Arras-style interactive interface which displays lists of
hits, the default context for a hit will be one chunk.
Grammar
The basic form of the language is straightforward: a query is a basic query, followed by zero or more qualifications using the filter operators. Since some filters apply only to one kind of query, word queries and element queries are distinguished syntactically. Also, queries can be combined using Boolean operators, as can filters.
The grammar will be given in ixml notation, for which see Pemberton 2013 and Pemberton 2021.
Starting at the top: a query is either a word query, an element query, a set of basic word queries joined by Boolean operators, or a set of basic element queries joined by Boolean operators.
query: single-query | multi-query. single-query: word-query | element-query. multi-query: base-word+boolean | base-element+boolean.Each kind of query consists of a base query followed by zero or more filters, connected by Booleans.
word-query: base-word, (s, word-filter*boolean)?. element-query: base-element, (s, element-filter*boolean)?.A basic word query is sequence of tokens, or a parenthesized word query:
base-word: tokens | '(', s?, word-query, s?, ')'. tokens: token+s. token: word | pattern | string.
A word is a sequence of word characters or a quoted string. Ambitious implementations may also provide either simple glob-style patterns using asterisk and question mark or regular expressions. Those are not described here, though they are likely to be important in practice.
A basic element query is a generic identifier enclosed in angle brackets. As in XML, we allow whitespace before the closing angle bracket.
base-element: '<', name, s?, '>' | '(', s?, element-query, s?, ')'.
Word queries may be filtered by word proximity, by element proximity, horizontally (by left-right context), or vertically (by ancestor-descendant context). Element queries may be filtered by attribute, by element proximity, horizontally, or vertically.
word-filter: word-proximity-filter | element-proximity-filter | left-right-filter | context-filter. element-filter: element-proximity-filter | left-right-filter | context-filter | content-filter | attribute-filter.
Word proximity filters take the form
w1
, or
within
k
words of
w2
w1
, or followed within
k
words by
w2
w1
.
preceded within
k
words by
w2
word-proximity filter: NOT?, WITHIN, n, WORDS, OF, word-query | NOT?, PRECEDED, WITHIN, n, WORDS, BY, word-query | NOT?, FOLLOWED, WITHIN, n, WORDS, BY, word-query.
Element proximity filters are essentially the same, but specify an element type instead of using the keyword WORDS.
element-proximity filter: NOT?, WITHIN, n, units, OF, single-query | NOT?, PRECEDED, WITHIN, n, units, BY, single-query | NOT?, FOLLOWED, WITHIN, n, units, BY, single-query. units: base-element, ELEMENTS.
Left-right filters use the keywords PRECEDED
and FOLLOWED
, but no proximity measurements. They
can be preceded by NOT
and DIRECTLY
.
left-right-filter: NOT?, DIRECTLY?, PRECEDED, BY, single-query | NOT?, DIRECTLY?, FOLLOWED, BY, query.
Context filters use the keyword INSIDE
.
context-filter: NOT?, DIRECTLY?, INSIDE, element-query.
Content filters use the keyword CONTAINING
:
content-filter: NOT?, DIRECTLY?, CONTAINING, single-query.
Attribute filters use the keyword WITH
.
attribute-filter: WITH, attribute-test. attribute-test: name, NOT?, NULL | name, NOT?, comparator, value. comparator: "=" | "<" | ">" | "<=" | ">=". value: token.
Omitted here are rules for the keywords NOT
,
DIRECTLY
, PRECEDED
, etc., which are
defined to accept either lowercase or uppercase spelling and
to allow or require surrounding whitespace.
An XML representation follows directly from the ixml specification; experience may lead to a reformulation of the grammar in order to get a different XML representation.
Semantics
For each form of expression E, we give a rule for finding its value v(E), typically in terms of a set comprehension rule.
In describing the forms of expressions, we write
-
w for a base word,
-
gi for a base element,
-
x for an expression which may be either a base word or a base element,
-
wq for a word query,
-
eq for an element query,
-
sq for any simple query (word query or element query)
In describing values, we write n for locations (or nodes) in the document, which can be either word nodes or element nodes.
A variety of predicates and relations holding among nodes, strings, names, and integers will be introduced as needed.
The only function used is v(); all other terms are
relations. For example, we write name(e, gi) to say that
the name of a particular element instance e is gi. In
XPath, we would normally write name(e) = gi (or
name($e) = $gi
), but the formulas are easier to
follow if there is never any doubt about whether a given term
is a function call or a relational predicate. If the functor
is v, it's a function call denoting a set of nodes; otherwise,
it's a predicate which is either true or false.
Basic searches
Base-word searches
A search for a single word form finds all occurrences of that form in the document.
-
v(w) = { n | word(n) ∧ form(n, w) }
The predicate word is true if and only if its argument is a word-node.
The relation form is true for a word-node and its word form (a string).
A search for multiple words (e.g. programming language
)
looks for adjacent tokens.
-
v(w1 w2) = v(w1
DIRECTLY FOLLOWED BY
w2)
DIRECTLY FOLLOWED BY
w2 is defined
below.
Basic searches connected by Booleans
As noted earlier, in a query like
<title> containing fire and icethe Boolean operator can be thought of as connecting two arguments for the filter, not two full queries. (It may be thought of as analogous to the
and
connecting
the parts of a compound direct object, rather than to the
and
connecting two clauses in a compound
sentence.)[16]
Some queries may connect basic searches with Booleans, like this one:
fire and iceIn order to provide some meaning for such queries, these are interpreted as denoting the set of chunks which contain both the form fire and the form ice. If chunks can nest, then only the smallest chunks which satisfy the constraint are returned.
This special treatment is applied only to Boolean combinations of basic queries, which the grammar identifies as multi-queries.
A multi-query on words finds all chunks in the document containing that combination of words; it can be understood as performing set operations on the sets of chunks identified by the individual base queries.
For any multi-query mq, we shift to a special evaluation procedure:
-
v(mq) = vmq(mq)
-
vmq(x
AND
mq) = v(CHUNK CONTAINING
x) ∩ vmq(mq) -
vmq(x
OR
mq) = v(CHUNK CONTAINING
x) ∪ vmq(mq) -
vmq(x
NOT
mq) = v(CHUNK CONTAINING
x) \ vmq(mq) -
vmq(x
AND NOT
mq) = v(CHUNK CONTAINING
x) \ vmq(mq)
As a base case, when all the occurrences of
AND
, OR
, and NOT
have been dealt with, and we are dealing with the simple
query which is the last (right-most) element of the
multi-query, we interpret it as if it were a search for a
chunk containing whatever is specified by the simple
query. So Heinlein
, for example, is
interpreted as if it were CHUNK CONTAINING
Heinlein
, and <author>
as if it
were CHUNK CONTAINING <author>
.
-
vmq(sq) = v(
CHUNK CONTAINING
sq)
This requires that we define the pattern CHUNK CONTAINING
x:
-
v(
CHUNK CONTAINING
w) = { n1 | chunk(n1) ∧ (∃ n2)(n2 ∈ v(w) ∧ ancestor-descendant(n1, n2)) }
Base-element searches
A basic element search finds all elements in the document with that generic identifier.
-
v(gi) = { n | element(n) ∧ name(n, gi) }
Filtered searches
Word-proximity filters
A word proximity query finds all instances of the left-hand argument which are within the designated distance of some instance of the right-hand argument.
-
v(w
WITHIN
kWORDS OF
wq) = { n | n in v(w) ∧ (∃ n2) (n2 ∈ v(wq) ∧ (∃ d1, d2) (distance(n1, n2, w, d1) ∧ num-abs(d1, d2) ∧ d2 ≤ k)) }
Here distance(n1, n2, gi, d) is a relation
holding among two nodes n1 and n2, the generic
identifier of an element type serving as a unit of measure
(here the w
element), and d, an integer denoting
the directed distance between n1 and n2, measured in
those units. If d is positive then n1 precedes n2, if
negative then n1 follows n2. So d is the distance from
n1 to n2, moving forward in the text.
The num-abs relation holds between a number and its absolute value.
In prose: a search for some word w within k words
of some word found by word-query wq finds all nodes (or:
locations) n, such that n is in the value of w and
there exists some second node n2 in the value of the
word-query wq such that there is some distance d,
measured between n and n1 in w
elements, such
that d is less than or equal to k.
The reader may find the formulation slightly puzzling, expecting in the last part of the set comprehension formula to see a formulation more like
... abs(distance(n1, n2, w)) ≤ k
with a meaning like the absolute value of the
distance from n1 to n2 is less than or equal to
k
.
The first reason for this is that in this alternate formulation, abs and distance are functions, not relations.
The more important reason is as mentioned earlier: We
would like to avoid the assumption that there is always a
single distance between two words. For example, when a
note or note reference[17] occurs immediately following a word w,
what is the next word after w? Is it the word after the
reference? or is it the first word of the note? Viewed
another way: what is the distance between w and the word
following the reference? In this paragraph, what is the
distance between the word reference
and the word occurs? The
formulation given above is intended to allow for the
possibility that there are two distances, which depend on
the route taken: if the note is skipped and we remain in
the main
text, then the distance is
1. If we follow the reading route through the note, and it
has (as here) three words, then the distance is 4. What is
the distance between two words? It depends on which route
you take. An Ariadne implementation may adopt a distance
measure in which there is only ever a single distance
between two points — prescribing a single route
between them, so to speak — but it is not obligated
to do so.
Non-deterministic proximity measures can be helpful not only in cases of paratextual structures like footnotes or running heads, but also in cases where textual variants are encoded in a single document, whether the document is a critical edition, a variorum edition, or a normative document (technical specification, legislation, ...) in which changes from a previous published version are marked to simplify displays that show all changes in context. In a single version of the text, two word tokens are within k words of each other if there are at most k - 1 words between them. In a multi-version representation the text, two words are within k words of each other if there is some version of the text in which they are within k words of each other.
The use of relational notation instead of functional notation makes it easier to exclude unconscious assumptions about the uniqueness of certain values.
The ordered proximity filters are similar but do not need to calculate the absolute value of the distance.
-
v(w
PRECEDED WITHIN
kWORDS BY
wq) = { n1 | n1 in v(w) ∧ (∃ n2)(n2 ∈ v(wq) ∧ (∃ d)(distance(n2, n1, w, d) ∧ 0 < d ≤ k)) }
In prose: the value of a search matching the pattern
w
PRECEDED WITHIN
k
WORDS BY
wq is the set of all word occurrences n1 matching w,
such that there is some other word occurrence n2 and
some distance d, such that n2 matches the word query
wq and d is a left-to-right distance (not the
distance
) from n2 to n1, measured in words, and
d is positive and d is less than or equal to k.
The FOLLOWED
keyword is the mirror image:
the only difference is the order of arguments in the
distance relation.
-
v(w
FOLLOWED WITHIN
kWORDS BY
wq) = { n1 | n1 ∈ v(w) ∧ (∃ n2)(n2 ∈ v(wq) ∧ (∃ d)(distance(n1, n2, w, d) ∧ 0 < d ≤ k)) }
The set comprehensions for the NOT
operator have a similar structure, but the existential
quantifier over the second node n2 is negated.
-
v(w
NOT WITHIN
kWORDS OF
wq) = { n | n ∈ v(w) ∧ ¬(∃ n2)(n2 ∈ v(wq) ∧ (∃ d1, d2)(distance(n1, n2, w, d1) ∧ num-abs(d1, d2) ∧ d2 ≤ k)) } -
v(w
PRECEDED WITHIN
kWORDS BY
wq) = { n | n ∈ v(w) ∧ ¬ (∃ n2)(n2 ∈ v(wq) ∧ (∃ d)(distance(n2, n1, w, d) ∧ 0 < d ≤ k)) } -
v(w
FOLLOWED WITHIN
kWORDS BY
wq) = { n | n ∈ v(w) ∧ ¬(∃ n2)(n2 ∈ v(wq) ∧ (∃ d)(distance(n1, n2, w, d) ∧ 0 < d ≤ k)) }
These find the set of all words which match the left
argument w for which some node in the value of the word
query wq is within k words; when PRECEDED
or FOLLOWED
is used, the distance is measured
only in the indicated direction.
Element-proximity filters
The element-proximity filters have a very similar structure to the word-proximity filters; the only difference is that the hardcoded w is replaced by a reference to the specified generic identifier.
-
v(x
WITHIN
k giELEMENTS OF
sq) = { n1 | n1 ∈ v(x) ∧ (∃ n2)(n2 ∈ v(sq) ∧ (∃ d1, d2)(distance(n1, n2 gi, d1) ∧ num-abs(d1, d2) ∧ d2 ≤ k)) } -
v(x
PRECEDED WITHIN
k giELEMENTS BY
sq) = { n1 | n1 ∈ v(x) ∧ (∃ n2)(n2 ∈ v(sq) ∧ (∃ d)(distance(n2, n1, gi, d) ∧ 0 < d ≤ k)) } -
v(x
FOLLOWED WITHIN
k giELEMENTS BY
sq) = { n1 | n1 ∈ v(x) ∧ (∃ n2)(n2 ∈ v(sq) ∧ (∃ d)(distance(n1, n2, gi, d) ∧ 0 < d ≤ k)) } -
v(x
FOLLOWED WITHIN
k giELEMENTS BY
sq) = { n1 | n1 ∈ v(x) ∧ (∃ n2)(n2 ∈ v(sq) ∧ (∃ d)(distance(n1, n2, gi, d) ∧ 0 < d ≤ k)) } -
v(x
NOT WITHIN
k giELEMENTS OF
sq) = { n1 | n1 ∈ v(x) ∧ ¬(∃ n2)(n2 ∈ v(sq) ∧ (∃ d1, d2)(distance(n1, n2 gi, d1) ∧ num-abs(d1, d2) ∧ d2 ≤ k)) } -
v(x
NOT PRECEDED WITHIN
k giELEMENTS BY
sq) = { n1 | n1 ∈ v(x) ∧ ¬(∃ n2)(n2 ∈ v(sq) ∧ (∃ d)(distance(n2, n1, gi, d) ∧ 0 < d ≤ k)) } -
v(x
NOT FOLLOWED WITHIN
k giELEMENTS BY
sq) = { n1 | n1 ∈ v(x) ∧ (∃ n2)(n2 ∈ v(sq) ∧ (∃ d)(distance(n1, n2, gi, d) ∧ 0 < d ≤ k)) }
Left-right filters
The filters PRECEDED BY
and FOLLOWED
BY
appeal to an underlying relation
preceding-following which holds for all pairs (x, y)
such that x precedes y in the document.
In a totally ordered document like an XDM instance, for any pair (x, y) we have either preceding-following(x, y) or preceding-following(y, x) or x = y. In a partially ordered document, it can be the case that none of these apply. In a multiply ordered document (for example, a document that includes multiple orderings of its nodes), we may have both preceding-following(x, y) and preceding-following(y, x) — that is, x may both precede and follow y.
-
v(x
PRECEDED BY
sq) = { n1 | n1 ∈ v(x) ∧ (∃ n2)(n2 ∈ v(sq) and preceding-following(n2, n1)) } -
v(x
FOLLOWED BY
sq) = { n1 | n1 ∈ v(x) ∧ (∃ n2)(n2 ∈ v(sq) and preceding-following(n1, n2)) } -
v(x
DIRECTLY PRECEDED BY
sq) = { n1 | n1 ∈ v(x) ∧ (∃ n2)(n2 ∈ v(sq) and prec-next(n2, n1)) } -
v(x
DIRECTLY FOLLOWED BY
sq) = { n1 | n1 ∈ v(x) ∧ (∃ n2)(n2 ∈ v(sq) and prec-next(n1, n2)) } -
v(x
NOT PRECEDED BY
sq) = { n1 | n1 ∈ v(x) ∧ ¬(∃ n2)(n2 ∈ v(sq) and preceding-following(n2, n1)) } -
v(x
NOT FOLLOWED BY
sq) = { n1 | n1 ∈ v(x) ∧ ¬ (∃ n2)(n2 ∈ v(sq) and preceding-following(n1, n2)) } -
v(x
NOT DIRECTLY PRECEDED BY
sq) = { n1 | n1 ∈ v(x) ∧ ¬(∃ n2)(n2 ∈ v(sq) and prec-next(n2, n1)) } -
v(x
NOT DIRECTLY FOLLOWED BY
sq) = { n1 | n1 ∈ v(x) ∧ ¬(∃ n2)(n2 ∈ v(sq) and prec-next(n1, n2)) }
Context filters
Analogously, the context and content filters rely on relations that we can call parent-child and ancestor-descendant.
-
v(x
INSIDE
eq) = { n1 | n1 ∈ v(x) ∧ (∃ n2)(n2 ∈ v(eq) ∧ ancestor-descendant(n2, n1)) } -
v(x
DIRECTLY INSIDE
eq) = { n1 | n1 ∈ v(x) ∧ (∃ n2)(n2 ∈ v(eq) ∧ parent-child(n2, n1)) } -
v(x
NOT INSIDE
eq) = { n1 | n1 ∈ v(x) ∧ ¬(∃ n2)(n2 ∈ v(eq) ∧ ancestor-descendant(n2, n1)) } -
v(x
NOT DIRECTLY INSIDE
eq) = { n1 | n1 ∈ v(x) ∧ ¬(∃ n2)(n2 ∈ v(eq) ∧ parent-child(n2, n1)) }
Content filters
The content filters are the converse of the context filters; they look down the tree, not up. So the main difference in the definitions is the order of arguments in the reference to the ancestor-descendant and parent-child relations.
-
v(e
CONTAINING
sq) = { n1 | n1 ∈ v(e) ∧ (∃ n2)(n2 ∈ v(sq) ∧ ancestor-descendant(n1, n2)) } -
v(e
DIRECTLY CONTAINING
sq) = { n1 | n1 ∈ v(e) ∧ (∃ n2)(n2 ∈ v(sq) ∧ parent-child(n1, n2)) } -
v(e
NOT CONTAINING
sq) = { n1 | n1 ∈ v(e) ∧ ¬(∃ n2)(n2 ∈ v(sq) ∧ ancestor-descendant(n1, n2)) } -
v(e
NOT DIRECTLY CONTAINING
sq) = { n1 | n1 ∈ v(e) ∧ ¬(∃ n2)(n2 ∈ v(sq) ∧ parent-child(n1, n2)) }
Attribute filters
The attribute filters are straightforward.
The first filter tests for an attribute being present or absent.
-
v(e
WITH
anNULL
) = { n | n ∈ v(e) ∧ ¬(∃ val)(e-an-av(n, an, val)) } -
v(e
WITH
anNOT NULL
) = { n | n ∈ v(e) ∧ ¬(∃ val)(e-an-av(n, an, val)) }
In prose: the filter WITH
an
NULL
matches an element just in case it has
no value associated with the attribute name an; the
variant with NOT
matches just in case it does
have some value for that name.
-
v(e
WITH
anEQ
val) = { n | n ∈ v(e) ∧ (∃ v)(e-an-av(n, an, v) ∧ (((∃ i1, i2)(str-num(val, i1) ∧ str-num(v, i2) ∧ i1 = i2)) ∨ ¬(∃ i1, i2)(str-num(val, i1) ∧ str-num(v, i2) ∧ v = val))) }
In prose: the filter WITH
an = val
matches an element just in case the element has a value
associated with the attribute name an, and the value
specified and the value of the attribute are equal either
numerically, if both are convertible to numbers, or else
as strings.
The other forms differ from the one just given only in using different operators instead of equality, or in adding negation. They will not be given here.
-
v(e
WITH
anLT
val) -
v(e
WITH
anGT
val) -
v(e
WITH
anLE
val) -
v(e
WITH
anGE
val) -
v(e
WITH
anNOT EQ
val) -
v(e
WITH
anNOT LT
val) -
v(e
WITH
anNOT GT
val) -
v(e
WITH
anNOT LE
val) -
v(e
WITH
anNOT GE
val)
The Ariadne text model
It may be observed, by inspecting the semantic descriptions offered in the preceding section, that Ariadne searches may be performed on any collection of data that can be represented as a set of nodes, on which a certain number of relatively simple, relatively general relations are defined.
Nodes can be words, or elements. Relevant predicates are:
-
word(n): true if node n is a word
-
element(n): true if node n is an element
-
node(n): true if n is a node
-
chunk(n): true if node n is a chunk
The word and element predicates are not mutually exclusive:
as mentioned earlier, internally Ariadne processors are
expected to represent words as w
elements. The chunk
predicate holds of some elements.
In most obvious cases, all nodes will be either words or elements, but in principle other node types are possible, though they cannot be retrieved using Ariadne as currently defined.
Words have word forms, and elements have names:[18]
-
form(n, w): true if node n is a word and w is the (or: a) word form of n
-
name(n, gi): true if node n is an element and gi is its name
One or more partial orders are defined on the nodes in the document. The prec-next relation holds for nodes which are immediately adjacent in some partial ordering; the preceding-following relation is its transitive closure.
-
prec-next(n1, n2): true iff n2 immediately follows n1 in some ordering; this can be thought of a recording a
next-pointer
from n1 to n2, not necessarily the only next-pointer from n1. -
preceding-following(n1, n2): true if n2 can be reached by following a series of next-pointers beginning at n1 and ending at n2; the transitive closure of the next relation
Either of these may be derived from the other. If preceding-following is specified and is acyclic, then prec-next can be derived by taking its transitive reduction. If prec-next is specified, then preceding-following may be derived from it by taking its transitive closure; in this case, prec-next need not be acyclic.
A parent-child relation is also defined on the nodes of the document, together with its transitive closure ancestor-descendant. If these are acyclic, then either can be specified and the other derived from it; if they include cycles, then parent-child must be defined and ancestor-descendant derived from it.
-
parent-child(n1, n2): true if n1 is a parent of n2
-
ancestor-descendant(n1, n2): true if n1 is an ancestor of n2; this is the transitive closure of the parent-child relation
As in SGML or XML, elements may have attribute-value pairs; the comparison operators assume that values are represented as strings, but may denote numbers. Ariadne is agnostic on whether attributes are nodes (as in XDM) or simply properties attached to element nodes.
-
e-an-av(n, an, val): true if n1 is an element with an attribute with the name an and the value val.
For proximity searching, the distance relation is crucial. It measures the distance between two nodes, using an arbitrary element type as the unit of measure.
-
distance(n1, n2, gi, d): true if the distance from n1 to n2, moving forward in the text, can be measured at d elements of type gi. If the second argument precedes the first, d is negative.
The semantic formulae given above do not assume any
particular relationship between the distance relation and the
next and preceding-following relations. In order to ensure
that searches involving the filters DIRECTLY FOLLOWED
BY
and FOLLOWED
WITHIN
1 ...
BY
produce plausibly related results, however, it
is probably best if they are systematically related.
When all nodes in a document are totally ordered, as in XDM, one possible approach is to define distance in terms of next, along the following lines. One possible definition is as follows. For any nodes n1 and n2, and any generic identifier gi, the following hold:
-
The distance from any node to itself is 0. That is, distance(n1, n1, gi, 0).
-
If prec-next(n1, n2) and gi is the name of n2 (name(n2, gi)), then the distance is 1: distance(n1, n2, gi, 1).
-
If prec-next(n1, n2) and gi is not the name of n2, then the distance is 0: distance(n1, n2, gi, 1).
-
If there exists some node n3 such that prec-next(n1, n3), and gi is the name of n3, and the distance from n3 to n2 is d, then the distance from n1 to n2 is 1 + d.
-
If there exists some node n3 such that prec-next(n1, n3), and gi is not the name of n3, and the distance from n3 to n2 is d, then the distance from n1 to n2 is also d.
Operationally, this amounts to walking the path from n1 to n2 by following next-pointers, incrementing the count each time an element with the specified generic identifier is encountered.
In a text model with multiple paths from n1 to n2, a distance is measured along each path.
A few relations are defined on numbers, strings, and attribute values:
-
num-abs(num1, num2): true if num2 is the absolute value of num1.
-
str-num(v, num): true if string v is in the lexical space of some XSD numeric type, and num is the corresponding value.
If both arguments to the following relations are numbers, or strings that can be coerced to numbers, then the comparisons are numeric; otherwise, the comparisons are string comparisons using an implementation-defined Unicode collation sequence.
-
equal(v1, v2): true if v1 and v2 are the same value
-
lt(v1, v2): true if v1 is less than v2, numerically if they are both numbers, otherwise lexicographically
Current status, progress towards implementation, and future plans
Ariadne is intended to be simple to implement, but early implementation of a design can have unfortunate effects if properties of the implementation leak out and affect the design. In order to avoid this problem, the author is attempting to work out the design of Ariadne as completely as possible before actually implementing anything.[19] This has drawbacks, but if all goes well it will help ensure that the query language can be understood without reference to any particular implementation or implementation approach.
At the time this version of the paper was prepared, the
language existed only on paper, with some paper sketches of code
for a toy implementation. It is expected that a toy
proof-of-concept implementation will be completed
soon
, sufficient to allow experimentation with
small documents using different text models and with different
ways of reducing the Ariadne text model to a minimal set of
relations.
References
[EBT 1996] EBT (Electronic Book Technologies)/Inso Providence Corporation. DynaText Publishing: Document Preparation. Release 3.0. Providence: EBT/Inso Providence Corporation, 1996.
[Pemberton 2013]
Pemberton, Steven.
Invisible XML.
Presented at Balisage: The Markup Conference 2013,
Montréal, Canada, August 6 - 9, 2013.
In
Proceedings of Balisage: The Markup Conference 2013..
Balisage Series on Markup Technologies, vol. 10 (2013).
doi:https://doi.org/10.4242/BalisageVol10.Pemberton01.
[Pemberton 2021]
Pemberton, Steven.
Invisible XML Specification (Draft).
2021-01-28.
On the Web at
https://invisiblexml.org/ixml-specification.html.
[Smith 1980] Smith, John Bristow. Imagery and the Mind of Stephen Dedalus: A Computer-Assisted Study of Joyce's a Portrait of the Artist As a Young Man. Lewisburg, PA: Bucknell University Press, 1980. 294 pp.
[Smith 1985] Smith, John B. Arras User's Manual. TR85-036 1985. Chapel Hill: University of North Carolina at Chapel Hill, Dept. of Computer Science, 1985. Available online at http://www.cs.unc.edu/techreports/85-036.pdf.
[1] The association of Ariadne with a method of finding one's way through a complex and confusing structure also doesn't hurt.
Ariadne is a reasonably popular name for technology projects, so it may be desirable to say explicitly here that the query language described in this paper has no relation to any other software, language, project, or firm using the name Ariadne, including but not limited to:
-
Ariadne Software (http://www.ariadnesoftware.com), the maker of Cool Spools and
a provider of solutions for the IBM i (AS 400 iseries) platform
; -
Ariadne Solutions - Red Thread Software (https://www.ariadnesolutions.com/), the maker of Red Thread, a tool for
bioanalytical data review and risk management
; -
the Ariadne project (https://www.ariadne-eu.org), a program funded by the European Union
to integrate existing archeological data infrastructure across Europe, so researchers can use the various distributed datasets and new technologies to explore research methodologies
.
[2] Like many computer-related names at the time, the name Arras was typically written in all caps in printed documentation. For the sake of a visually quieter page I follow the rule of down-casing such names when they are pronounceable.
[3] The description of the DynaText query language which follows is derived mostly from chapter 12 of the manual DynaText Publishing: Document Preparation (EBT 1996). DynaText was a commercial product and is as far as I know no longer commercially available. This has several consequences. First, I do not now have and have never had access to the source code of the program, and do not now have access to a running copy, so my description of corner cases in the query language cannot be tested against the behavior of the search interface and may need to be taken with a grain of salt. The information in the manual has been augmented in a few details by personal recollection and here and there also by information from Steven J. DeRose, who is thanked for his patience. Second, it must be assumed that someone, somewhere owns the intellectual property rights to the program. In the U.S., I believe my description of the DynaText query language and its influence on Ariadne fall into the category of fair use.
[4] I'm just saying that someone might try to formalize the meaning of the language that way, and be surprised by the results, not that such a mistake happened to anyone I happen to know personally.
[5] For example, the ancilla search interface for Trials in the Late Roman Republic (http://tlrr.blackmesatech.com/2016/04/ancilla.xhtml); the initial a of the name ancilla signals that this was the first search interface to be specified and named, though it was not in fact the first one implemented.
[6] The first may be expected to locate chapters about the syntax rules in books about languages like Pascal, Javascript, or C; the second should find chapters discussing suitable programming languages, in books about the syntax of natural languages. When I first heard it, the example was attributed to the computational linguist Toshio Yokoi.
[7] Actually, since I still miss Arras's proximity searching thirty-five years after last using it, an Arras implementation might be useful as well as fun. But it was fun, not functionality, that was the key motivator for keeping the project on my someday list. I reiterate my claim that independent re-implementations of Arras and DTQL would fall within the bounds of fair use under U.S. copyright law.
[8] Steven J. DeRose, personal communication, 10 March 2021.
[9] I mean, really simple to implement. I mean, simple enough for someone trained in comparative literature to implement.
[10] The
XPath expressions given are only roughly equivalent
because the search for the word
ward is not a string search, and
should not find occurrences of the words
reward or
award. In an implementation which
records its tokenization of words by marking them with
w
elements, the equivalent XPath will be:
//w[@form = 'ward'][ancestor::heading]
[11] It may be
less useful than it appears at first. If the user wants
all the theorems which occur before the first theorem in
the same chapter, this query won't find them. The search
<theorem> not preceded by sibling
<axiom>
will work, but only if all the axioms
and theorems we are looking for are siblings, and none of
them are buried in subsections.
[12] A case might be made for adding WITHIN
SAME
gi
ELEMENT AS
, as
syntactic sugar for WITHIN 0
gi
ELEMENTS OF
.
[13] Because of the
way XDM orders nodes, a simple-minded implementation based
on XDM will typically measure zero l
elements
between two locations if they are separated by an end-tag
for an l
element, but not by a start-tag. The
asymmetry of start- and end-tags is imperfect, but for the
moment it appears to be avoidable at the cost of increased
implementation effort.
[14] For OR
, the results are the
same whether the OR
is evaluated before
or after the filter, but for AND
and
NOT
, the results are very different.
[15] Some query systems are supposed to have
concluded that a top-level query of the form x
AND
y expresses a desire to see all the
records which match sub-query x
and those
which match sub-query y. That makes the operators
AND
and OR
equivalent, which
will work well for users who do in fact mean
all the x
and all the y
,
but which
seems likely to puzzle at least some users who think
that AND
and OR
are not synonyms.
[16] If the and
were thought of as connecting two queries, the obvious
meaning would be the set of word nodes which are both
instances of fire and instances of
ice, which on any plausible
formalization of word forms will be the empty
set.
[17] Like this one.
[18] In its current form, Ariadne seems to be living in a markup paradise in which there are no complications involving the tangled relationships among local names, namespace prefixes, namespace names, expanded names, and so on. If the reader wishes to imagine that all names are unqualified, or that all names are given as expanded names in Q{namespace}local notation, they may do so.
[19] I reject in the strongest possible terms any imputation that this approach has anything at all to do with procrastination or laziness. I am shocked at the suggestion.