This paper describes the design of Ariadne, a small query
language intended to make searching structured documents easier
for non-technical users. The following sections will describe
the origins, goals, and design decisions of the language;
present its grammar and semantics; illustrate its application to
documents; and describe progress towards a full
implementation.
Ariadne may be of interest in part because it is designed to
be easier to learn and understand (and to implement!) than other
query languages for structured documents. Users who would not be
willing to undertake to learn XQuery or even XPath may (it is
hoped) find Ariadne understandable and usable. Though designed
to be smaller and less capable than XPath, XSLT, or XQuery,
Ariadne can be used for relatively powerful searches within
structured documents. Because it is smaller and less capable
than XPath, XSLT, or XQuery, an Ariadne language processor is
readily implemented using XSLT or XQuery. And because Ariadne
makes only relatively weak assumptions about document structure,
it is applicable to non-XDM text models like concurrent
hierarchies, Goddag structures, TAG (Text as graph),
multi-colored trees, and even (with mild restrictions)
non-graph-based models like (a simple version of) LMNL.
Origins, goals, design decisions
Ariadne has its roots in several places. Among the most
prominent are the perceived need for a reasonably powerful
search language easier for non-XML-oriented users to learn
than XPath and XQuery or XSLT; a desire to recreate the
functionality of two query language I used and liked some
years ago; the desire for a user-accessible query language to
exploit the rich markup in some of the projects I have been
involved with; and the desire to
enjoy myself by writing language processors in XSLT and
XQuery.
Two earlier languages have exerted a strong influence on
the functionality and syntax of Ariadne: a system called
Arras, and the query language which formed part of the SGML
publishing tool DynaText. The name Ariadne was chosen in part
because the letter combinations ar and
dn recall the names of those two
systems.
The following sections describe this background in slightly
more detail.
Arras
In 1980, an associate professor of English at
Pennsylvania State University published a book about James
Joyce's Portait of the Artist as a Young Man
(Smith 1980). In order to identify, trace, and
visualize the appearance, repetition, and recurrence of
various complexes of images in the work, the author, John B.
Smith, had created what might be called an interactive
concordance system. Like a printed concordance, his tool
allowed him to look for all occurrences of a given word and
see them in context. Unlike a printed concordance, Smith's
tool was interactive. It allowed the user to specify how
much context to give when displaying a word: three words on
either side? One word before and five after? The full
sentence? The typographic line? Three lines? It also allowed
the user to search for multiple words occurring in
proximity: fire and
ice in the same paragraph, for
example.
Smith's system was initially called Rats
(Random-Accessible Text System), but he later changed the
name to Arras, for which the retronym ARchival Retrieval
and Analysis System
was formed. The Arras reference
manual is available on the web site of the computer science
department at the University of North Carolina at Chapel
Hill, to which Smith moved from Penn State in the mid-1980s,
and interested readers are encouraged to consult it (Smith 1985). A few sample queries, mostly drawn
from the manual, will suffice to illustrate some of the
program's salient capabilities.
The core functionality is to display occurrences of
words, in context. The following command displays
occurrences of the word fire in the
default context of one sentence.
display concordance: fire.
In an attempt to make the system easier to use, Smith
allows noise words, which are ignored, and allows keywords
to be abbreviated. So the command just given could also be
given in these and other alternative forms
please display a concordance for the word: fire.
disp conc: fire.
A different amount of context can also be specified,
for example five words on either side:
display concordance: fire; context: -5 to 5 words.
As may be seen, left and right context are specified
independently. Context may be specified in several different
logical units (word, sentence, paragraph, chapter) as well
as in several physical units (line, page, volume).
To allow the user to identify images like those of fire
or water, Arras allows the user to define
categories
, which in the simple case
are just sets of word forms. To find all images of fire, a
user might specify a number of word forms related by
etymology or only by meaning:
define category: enflaming, fire, fireconsumed,
fires, flame, flamed, flames, flaming,
heat, heated, hot, hotly;
name: firecat.
The names of categories (here
firecat) can be
used in
display concordance
and other commands
in the same places as words from the text.
Another facility is the ability to define a
configuration, which matches a word occurring
within a specified proximity of another word.
configuration: fire & water; name: ffww.
This defines the configuration
ffww as the
subset of occurrences of
fire which
occur within the default context (one sentence, again) of
water. The context can be varied,
using the same units as mentioned above. The usual Boolean
operators are allowed; in addition to
and (seen above),
or and
not can
also be used.
configuration: fire &~ water; name: ff.
configuration: fire | water; name: fw.
Names of categories and configurations can be used
wherever word forms are called for in a command. This allows
definitions of complex constellations of words to be built
up step by step.
configuration: firecat & watercat;
context: -50 to +50 words;
name: fwcat.
There are also facilities for interrogating the list of word
forms found in a text, excluding specific word occurrences
from a category or configuration, displaying the
distribution of a word, category, or configuration across
the text (in the form of a bar graph for two-per-cent
segments of the text), session management, and so on.
The Arras scanner relied on simple markup in the input to
know where volume, chapter, and paragraph boundaries fall.
Line and sentence boundaries it recognized by itself using
simple rules, with markup available for overriding the
default decisions.
The text model exposed by Arras is essentially that of a
text as a sequence of words, with the sequence partitioned
into sentences, paragraphs, chapters, lines, and volumes. As
may be seen, these units form both a logical hierarchy
(chapters, paragraphs, sentences) and a physical hierarchy
(volumes, pages, lines). The ability to specify proximity
using any of these structural units makes Arras very useful
for those interested in the close study of a text.
It may be noted in passing that while Arras provides ways
of looking for passages in a text which meet certain
criteria, it does not provide any mechanism for creating new
textual objects.
DynaText query language
The second major influence of Ariadne is the textual
query language which formed part of DynaText. I won't
rehearse in detail the history of that product in detail
here, since many others who attend Balisage know it as well
as I, and some know it better. I will only observe that
DynaText was a system for publishing SGML documents in
electronic form, which allowed users to switch among
multiple stylesheets for different views of the data,
provided good hypertext linking, and offered a simple and
convenient textual search interface. The query language was
used both as part of the end-user interface and as part of
the infrastructure: as part of the creation of an electronic
book, DynaText publishers could also provide simple search
forms, to elicit values which could be plugged into a
complex search expression as parameters. So power users
could use the query language to create complicated queries
which others could use without learning the query
language.
I won't attempt a full account of the DynaText query
language (which for brevity I'll call DTQL from now on), but
some examples will illustrate key points.
A search can consist of a word or phrase. To search
for a word which is also a keyword in the language, it may
be enclosed in quotation marks:
programming language
"containing"
Equally important, since DynaText is designed to allow
the exploitation of markup, a search can consist of an
element type's generic identifier, distinguished from a word
to be sought in the text by being enclosed in angle
brackets:
<title>
Context can be specified using the keyword
inside
or in
, optionally modified
by directly
:
programming language inside <title>
<title> directly inside <book>
The first finds occurrences of the phrase
programming language occurring inside
a
title
element. The second finds
title
elements occurring as children of a
book
element.
Conversely, content can be specified using the keyword
containing
, also optionally modified by
directly
:
<chapter> directly containing <title>
containing syntax
All operators in DTQL are right-associative, so the
search just given is equivalent to the following:
<chapter> directly containing
(<title> containing syntax)
These searches find all
chapter
elements within
which there occurs (at any nesting depth) a
title
element which contains the word
syntax.
Note that while
<title> containing programming language
and
programming language in <title>
describe the same set of document structures, the value of
the first is a set of
title
elements, while the value of
the second is a set of word occurrences. In the interactive
interface to DynaText, this difference was reflected in the
highlighting of hits and in the hit counts displayed in the
table of contents, which was itself typically displayed in a
navigation bar and always visible.
Boolean operators are also allowed:
<bibl> not containing <title>
(<element-citation> or <mixed-citation>)
containing <publisher-name> not containing
"Elsevier"
<title> containing theory and practice
<title> containing practice
and not theory
The last two examples illustrate a slightly curious
aspect of DTQL's boolean operators: the one search finds
title
elements containing both the word theory and the
word practice, the other finds title
elements containing
one word but not the other. The right-hand argument of the
and
is not interpreted as a full query, but as a second
operand for the containing
operator. This feels natural to
native speakers of English (and perhaps other languages),
but it is trickier to implement than a naive approach might
expect. In particular the query
<title> containing practice and not theory
does
not first evaluate the queries
<title>
,
practice
, and
theory
, then
construct the complement of the third query (every word that
is not
theory), then take the set
intersection of the occurrences of the word
practice and the set of words which
are not the word
theory, and so
on.
DTQL also has a proximity operator, for which distance is
specified in words:
proximity within 5 words of search
markup within 10 words of descriptive or prescriptive
The samples do not exhaust the DynaText query language,
although they come close: it is a useful language but gets
its power from a very small number of constructs.
Why not XQuery? Or even just XPath?
An obvious reaction to the announcement of yet another
query language for documents, especially at Balisage, is to
ask why. Why do we need another language for document
queries? Doesn't XQuery exist? Or if XQuery is deemed too
complicated, why not just use XPath? There are two reasons
one might want to design a new query language, even though
XPath and XQuery exist.
The first is ease of acquisition by users.
Like many Balisage attendees, I often work with domain
experts who are less technically inclined than I am and less
tolerant of the pernickety syntax often characteristic of
languages designed to be parsed by machine. The XML
community is justly proud of the number of domain experts
who do not self-identify as programmers but who are able to
learn and use XPath and XSLT. But many domain experts never
do learn XSLT, or even XPath.
When I build a web interface for a collection of XML
documents, one of the first things I like to do is to
provide a simple search interface that allows a user to type
in a simple XPath expression and see a list of the XDM nodes
matched by the expression. But
often, my hopes that the domain experts with whom I work
will learn enough about XML and XPath to be able to use it
are dashed. As far as I know, when I have made XPath
interfaces for collaborative projects, I am the only project
participant who ever uses them. From time to time, others in
a project have attempted to learn XML, and XPath, and
XQuery. But often, domain experts find it unnerving to
venture from a field in which they are experts into a field
in which they are novices, and since they typically have
achieved a full and rewarding professional life without
knowledge of any computer languages, it is understandable
both that they don't have a lot of time to focus on learning
new languages and that the frustrations known to anyone
learning something new and very different will often lead
busy people to abandon the attempt.
It is unlikely that Ariadne is a silver bullet for this
problem.
It is quite possible that some users will resist any
query interface involving anything recognizable as syntax or
keywords. The logical consequence of such syntax phobia
among users is a search interface with a single box in which
to type a query, and no visible rules for how to type it,
such as is now familiar to any user of Google, Bing,
DuckDuckGo, or almost any other web search engine.
But anyone who has used a single-box search interface to
look for structural patterns in documents will remember how
disappointing the results usually are. A search for a
chapter whose title includes the word
syntax, in a book whose title
includes the phrase programming
language, is one thing. A search for a chapter
whose title includes the words programming
language, in a book whose title includes the
word syntax, is something rather
different. A query interface based on
a bag-of-words document model, with no knowledge of
chapters, books, or titles, cannot comfortably capture the
difference.
One of Ariadne's goals is to serve users who don't feel
comfortable with most artificial languages but who do
understand how their documents are tagged, and who have some
interest in being able to use structural information in
queries.
The second reason for not just using XQuery or XPath is:
trees.
XPath and XQuery and XSLT are designed to work on XML
documents, and specifically on XML documents viewed through
the XPath Data Model (XDM), or more generally any
information that can be viewed through that model, whether
its serialized form is XML or something else. Since I was
not involved in the initial design of any of those
languages, I can say without immodesty that they do a good
job.
But it is well known that documents often exhibit
structural patterns that are not tree-shaped, or at least
not shaped like a single tree. A bewildering number of
alternate proposals have been made, for an equally
bewildering variety of text models. (The number of models is
slightly lower than the number of proposals, since in some
proposals appear on closer examination to be variations on
the same basic idea.)
It would, I think, be helpful for users of the
text-as-graph model, the LMNL model, the Goddag model, the
concurrent-hierarchies model, if it were possible to
formulate queries against those models. Of course, if as
sometimes happens the actual documents are stored in XML,
the documents can be queried in their XML form using XPath
and XQuery. But models are not just guides for implementors;
they are tools for thinking about things. It would be
better, when we are using a non-tree-shaped model of text,
if we could formulate queries against the model we want to
use, instead of translating our queries into some sort of
slightly awkward XPath.
One possible caveat to this line of reasoning should be
addressed. XPath is indeed defined in terms of the XPath
Data Model, but nothing in its syntax requires that the
parent
axis return a single node. If we were to
allow nodes to have multiple parents, we could apply XPath
to any document model involving a directed acyclic graph.
And if we relaxed other constraints like the rule that no
node is its own ancestor, we could apply XPath to cyclic as
well as to acyclic graphs. So if we wished to apply XPath to
Goddag structures, or to concurrent hierarchies, or to
text-as-graph instances, we could do so, at least in
principle. And indeed there have been proposals in the past,
some of them presented at Balisage or Extreme Markup
Languages, to use XPath as an expression language for
navigating arbitrary directed graphs.
To anyone who can easily implement the whole of XPath
over a non-XDM model, there is clearly no pressing need for
a different language. The problem for others is that even in
its version 1.0, XPath was large enough to require
non-trivial implementation effort. Now, in XPath 3.0 and
3.1, the language has become Turing-complete and
implementing it from scratch looks rather daunting.
Ariadne's advantage in this context is that in comparison
with XPath, it's tiny, and it's possible to imagine a
relatively straightforward implementation that does not
require programmer-years of effort.
Origins of Ariadne
The project that has turned into the design and
specification of Ariadne began life as a pair of
low-priority free-time projects to re-implement both Arras
and the DynaText query language, as an exercise in nostalgic
retro-computing.
The feasibility of re-implementing Arras became clear to
me when I reflected that its text model could be implemented
as a single SQL table in which each row contains a token, a
word form (the token, stripped of punctuation), a word
number, a sentence number, a chapter number, a line number,
a page number, and a volume number. Arras commands could in
principle be translated into SQL queries. John B. Smith
implemented his own inverted-file index, but I was willing
to farm all of those details out to a SQL database.
Once the key idea of letting a database management system
handle the boring bits like storage details became clear, it
immediately became obvious that the Arras text model could
also be implemented quite simply in an XQuery database,
using either element containment or milestone elements to
handle the structural units. Since in Arras all structural
units partition the entire text, simple milestones suffice
and the more complex mechanisms of Trojan-Horse markup are
not needed.
Since DynaText is designed for search and retrieval in
SGML documents, it seems self-evident that it should be
relatively straightforward to implement a DynaText front end
which passes queries to an XQuery back end for evaluation.
Since DTQL seems very accessible even to non-technical
users, the idea of re-implementing DTQL sometimes seemed
more than an exercise in nostalgia: it could be useful to
real projects I have been engaged with. But the idea never
had high priority, so it never got beyond a few
sketches.
Recently, however, whiling away an idle hour with an
attempt to figure out a way to describe the behavior of
and
and or
in DTQL cleanly enough
to guide an implementation, I realized that DTQL requires a
model of text in which two nodes can be in a parent/child
relation (or not), or more generally an ancestor/descendant
relation, but does not require that the model form a tree.
Nothing breaks in the semantics of the query if an element
(or a word) can have more than one parent.
At this point I began to think that an implementation
might have more practical use than I had first imagined, and
that a new query language inspired by Arras and DynaText
should have a higher priority than historically accurate
re-implementations of the two languages.
Design decisions
The basic design approach of Ariadne was summarized by
Steven J. DeRose, the designer of the DynaText query
language, thus:
Give users a very limited and very formulaic subset of
English — it's pretty easy to implement (because so
limited), yet users can read queries as if they were
just English
, and can form new ones by analogy
— all they really have to learn is to avoid being
creative (which, when they try it, provides negative
reinforcement for free).
DynaText has four query operators, not counting Booleans,
which require special treatment:
-
INSIDE
, optionally modified by
NOT
and/or DIRECTLY
-
CONTAINING
, optionally modified by
NOT
and/or DIRECTLY
-
WITHIN
n
WORDS
OF
-
WITH
(which searches the attribute axis
of XDM)
The
INSIDE
and
CONTAINING
operators are of course inverses of each other, moving
vertically in a document tree.
Ariadne extends this set of operators modestly to allow
searching based on document order among siblings or among
all nodes and to allow proximity to be measured,
Arras-style, in any structural unit, not just words.
It should be noted that the only basic relations required
to support these operators are parent-child,
prec-next, and
element-attribute-value. No assumption is
made that a child can have only one parent, so essentially
any binary relation on structures in a document can be used
as the parent-child relation. If the relation is cyclic,
some results may of course puzzle users used to thinking of
the parent-child relation as acyclic.
Like Arras and DTQL, Ariadne lacks any means of
constructing new values of any kind: all Ariadne expressions
denote sets of locations within texts, where from the user's
point of view a location
is either a
tagged element in the text (e.g. a paragraph, title, or
chapter) or a word. From the implementation point of view,
at document ingestion time the document is tokenized and all
words are wrapped in w
elements, so for the
implementation, all values are sets of locations, and all
locations are elements in the input document. This provides
a simple uniform model and makes implementation on top of
XSLT or XQuery or XPath itself conceptually very simple.
The absence of constructors makes Ariadne strictly weaker
than XQuery or XSLT. It cannot be used to perform
arithmetic: an expression like 2 + 3 is not an Ariadne query
or expression. It cannot be used to make a list of all word
forms: Ariadne has no equivalent to the XPath
distinct-values() function, and indeed no user-callable
functions at all. Its intended users are not programmers who
need to be able to specify and perform arbitrary
computations, but readers who want to be able to describe
and find locations in texts.
For some situations, it is convenient to have some notion
of a location which is an appropriate size to return in a
list of search results. I'll call such locations chunks
.
In most modern prose, a sentence, a list item, a heading, or
a line of verse will be a reasonably good chunk. Sometimes
it might be better to use a larger chunk, for example a
paragraph. The only constraint on chunks is that every
word in the document must be in some chunk; it may sometimes
be desirable to specify further that no chunk
contains any other chunk, i.e. chunks must partition the
document.
To simplify the use of Ariadne for document owners, an
implementation of Ariadne is expected to define some notion
of chunk
without intervention by the
document owner; a simple default is to accept any element
containing at least two words as a chunk. Another possible
default (up to the implementor) is to break the input
document up into sentences or sentence-like objects based on
punctuation and capitalization and use them as chunks. A
document owner should however be able to specify explicitly
which elements are to be treated as chunks. And of course, a
document owner can explicitly mark sentences (or
sentence-like objects) with an element such as the TEI
s
element, and thus override the default behavior
of the implementation. The same goes for tokenization into
words: if the default behavior is suitable, the user need do
nothing special, but if the user wants another behavior,
they can control the behavior of an Ariadne search engine by
using markup to identify what should be treated as
words.
The design principles of Ariadne are thus:
-
Queries should be a very limited, very formulaic
subset of English.
-
Query operators should allow qualification of
searches both vertically (by context and content) and
horizontally (by relative position in document
order).
-
Proximity searching should be possible using
words or any structural elements as units.
-
The language must be simple to implement.
-
Ariadne should be applicable to any model of text
in which structural units can be construed as having a
parent-child relation and an ordering. Just as structural
units may have multiple parents, they may also have
multiple immediately following siblings.
-
It is acceptable for searching of a document to
be possible only after that document has been ingested in
some way (indexed, scanned, imported, ...); in practice it
is in the process of import that the system will perform
services like recognition of implicit markup of words,
lines, and sentences.
-
Anything the indexing system does without markup
(e.g. the identification of words or sentences) should be
done by injecting markup in the document. The user should
be able to change the relevant behavior by supplying
explicit markup for the phenomena in question, for example
by exporting the document, changing the markup, and
re-importing the document.
-
The text model used by Ariadne should be
compatible with (i.e. its assumptions and constraints
should where possible be a subset of) as the most
prominent non-XDM document models: multi-colored trees,
concurrent hierarchies, text as graph, and Goddag
structures. Compatibility with LMNL is a desideratum.
One topic which has been ruled out of scope for this
version of Ariadne is the ability to qualify searches by a
particular hierarchy, analogous to the qualification of
XPath axes with colors in Colored XML. It may be helpful,
but I do not currently understand the requirements well
enough to design it. It is hoped that experience with
Ariadne will help to indicate whether extension in this
direction is needed and what kind of extension would be
helpful.
Grammar
The basic form of the language is straightforward: a query
is a basic query, followed by zero or more qualifications
using the filter operators. Since some filters apply only to
one kind of query, word queries and element queries are
distinguished syntactically. Also, queries can be combined
using Boolean operators, as can filters.
The grammar will be given in ixml notation, for which see
Pemberton 2013 and Pemberton 2021.
Starting at the top: a query is either a word query, an
element query, a set of basic word queries joined by Boolean
operators, or a set of basic element queries joined by Boolean
operators.
query: single-query | multi-query.
single-query: word-query | element-query.
multi-query: base-word+boolean
| base-element+boolean.
Each kind of query consists of a base query followed by zero
or more filters, connected by Booleans.
word-query: base-word, (s, word-filter*boolean)?.
element-query: base-element, (s, element-filter*boolean)?.
A basic word query is sequence of tokens, or a parenthesized
word query:
base-word: tokens
| '(', s?, word-query, s?, ')'.
tokens: token+s.
token: word | pattern | string.
A word is a sequence of word characters or a quoted string.
Ambitious implementations may also provide either simple
glob-style patterns using asterisk and question mark or
regular expressions. Those are not described here, though they
are likely to be important in practice.
A basic element query is a generic identifier enclosed in
angle brackets. As in XML, we allow whitespace before the
closing angle bracket.
base-element: '<', name, s?, '>'
| '(', s?, element-query, s?, ')'.
Word queries may be filtered by word proximity, by element
proximity, horizontally (by left-right context), or vertically
(by ancestor-descendant context). Element queries may be
filtered by attribute, by element proximity, horizontally, or
vertically.
word-filter: word-proximity-filter
| element-proximity-filter
| left-right-filter
| context-filter.
element-filter: element-proximity-filter
| left-right-filter
| context-filter
| content-filter
| attribute-filter.
Word proximity filters take the form
w1
within
k
words of
w2
, or
w1
followed within
k
words by
w2
, or
w1
preceded within
k
words by
w2
.
word-proximity filter:
NOT?, WITHIN, n, WORDS, OF, word-query
| NOT?, PRECEDED, WITHIN, n, WORDS, BY, word-query
| NOT?, FOLLOWED, WITHIN, n, WORDS, BY, word-query.
Element proximity filters are essentially the same, but
specify an element type instead of using the keyword WORDS.
element-proximity filter:
NOT?, WITHIN, n, units, OF, single-query
| NOT?, PRECEDED, WITHIN, n, units, BY, single-query
| NOT?, FOLLOWED, WITHIN, n, units, BY, single-query.
units: base-element, ELEMENTS.
Left-right filters use the keywords PRECEDED
and FOLLOWED
, but no proximity measurements. They
can be preceded by NOT
and DIRECTLY
.
left-right-filter:
NOT?, DIRECTLY?, PRECEDED, BY, single-query
| NOT?, DIRECTLY?, FOLLOWED, BY, query.
Context filters use the keyword INSIDE
.
context-filter: NOT?, DIRECTLY?, INSIDE, element-query.
Content filters use the keyword CONTAINING
:
content-filter: NOT?, DIRECTLY?, CONTAINING, single-query.
Attribute filters use the keyword WITH
.
attribute-filter: WITH, attribute-test.
attribute-test: name, NOT?, NULL
| name, NOT?, comparator, value.
comparator: "=" | "<" | ">" | "<=" | ">=".
value: token.
Omitted here are rules for the keywords NOT
,
DIRECTLY
, PRECEDED
, etc., which are
defined to accept either lowercase or uppercase spelling and
to allow or require surrounding whitespace.
An XML representation follows directly from the ixml
specification; experience may lead to a reformulation of the
grammar in order to get a different XML representation.
The Ariadne text model
It may be observed, by inspecting the semantic descriptions
offered in the preceding section, that Ariadne searches may be
performed on any collection of data that can be represented as a
set of nodes, on which a certain number of relatively simple,
relatively general relations are defined.
Nodes can be words, or elements. Relevant predicates are:
-
word(n): true if node n is a word
-
element(n): true if node n is an element
-
node(n): true if n is a node
-
chunk(n): true if node n is a chunk
The word and element predicates are not mutually exclusive:
as mentioned earlier, internally Ariadne processors are
expected to represent words as w
elements. The chunk
predicate holds of some elements.
In most obvious cases, all nodes will be either words or
elements, but in principle other node types are possible,
though they cannot be retrieved using Ariadne as currently
defined.
Words have word forms, and elements have names:
-
form(n, w): true if node n is a word and w
is the (or: a) word form of n
-
name(n, gi): true if node n is an element and
gi is its name
One or more partial orders are defined on the nodes in the
document. The prec-next relation holds for nodes which are
immediately adjacent in some partial ordering; the
preceding-following relation is its transitive closure.
-
prec-next(n1, n2): true iff n2 immediately
follows n1 in some ordering; this can be thought of a
recording a next-pointer
from n1 to
n2, not necessarily the only next-pointer from
n1.
-
preceding-following(n1, n2): true if n2 can be
reached by following a series of next-pointers beginning at
n1 and ending at n2; the transitive closure of the next
relation
Either of these may be derived from the other. If
preceding-following
is specified and is
acyclic, then prec-next can be derived by taking its
transitive reduction. If prec-next is specified,
then
preceding-following
may be derived from it
by taking its transitive closure; in this case,
prec-next need not be acyclic.
A parent-child relation is also defined on the nodes of
the document, together with its transitive closure
ancestor-descendant. If these are acyclic, then either can
be specified and the other derived from it; if they include
cycles, then parent-child must be defined and
ancestor-descendant derived from it.
-
parent-child(n1, n2): true if n1 is a parent
of n2
-
ancestor-descendant(n1, n2): true if n1 is an
ancestor of n2; this is the transitive closure of the
parent-child relation
As in SGML or XML, elements may have attribute-value pairs;
the comparison operators assume that values are represented as
strings, but may denote numbers. Ariadne is agnostic on whether
attributes are nodes (as in XDM) or simply properties attached
to element nodes.
-
e-an-av(n, an, val): true if n1 is an
element with an attribute with the name an and the value
val.
For proximity searching, the distance relation is crucial.
It measures the distance between two nodes, using an arbitrary
element type as the unit of measure.
-
distance(n1, n2, gi, d): true if the
distance from n1 to n2, moving forward in the text, can
be measured at d elements of type gi. If the second
argument precedes the first, d is negative.
The semantic formulae given above do not assume any
particular relationship between the distance relation and the
next and preceding-following relations. In order to ensure
that searches involving the filters DIRECTLY FOLLOWED
BY
and FOLLOWED
WITHIN
1 ...
BY
produce plausibly related results, however, it
is probably best if they are systematically related.
When all nodes in a document are totally ordered, as in
XDM, one possible approach is to define distance in terms of
next, along the following lines. One possible definition is as
follows. For any nodes n1 and n2, and any generic
identifier gi, the following hold:
-
The distance from any node to itself is 0. That is,
distance(n1, n1, gi, 0).
-
If prec-next(n1, n2) and gi is the name of
n2 (name(n2, gi)), then the distance is 1:
distance(n1, n2, gi, 1).
-
If prec-next(n1, n2) and gi is not the name
of n2, then the distance is 0: distance(n1, n2,
gi, 1).
-
If there exists some node n3 such that
prec-next(n1, n3), and gi is the name of n3, and
the distance from n3 to n2 is d, then the distance
from n1 to n2 is 1 + d.
-
If there exists some node n3 such that
prec-next(n1, n3), and gi is not the name of n3, and
the distance from n3 to n2 is d, then the distance
from n1 to n2 is also d.
Operationally, this amounts to walking the path from n1 to
n2 by following next-pointers, incrementing the count each
time an element with the specified generic identifier is
encountered.
In a text model with multiple paths from n1 to n2, a
distance is measured along each path.
A few relations are defined on numbers, strings, and
attribute values:
-
num-abs(num1, num2): true if num2 is the
absolute value of num1.
-
str-num(v, num): true if string v is in the
lexical space of some XSD numeric type, and num is the
corresponding value.
If both arguments to the following relations are numbers, or
strings that can be coerced to numbers, then the comparisons are
numeric; otherwise, the comparisons are string comparisons using
an implementation-defined Unicode collation sequence.
-
equal(v1, v2): true if v1 and
v2 are the same value
-
lt(v1, v2): true if
v1 is less than v2, numerically if they are both numbers,
otherwise lexicographically