Quin, Liam R. E. “Text Retrieval for XML-Encoded Corpora: A Lexical Approach.” Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008). https://doi.org/10.4242/BalisageVol1.Quin01.
Balisage: The Markup Conference 2008 August 12 - 15, 2008
Balisage Paper: Text Retrieval for XML-Encoded Corpora: A Lexical Approach
Mr Quin has been involved with declarative, descriptive markup
since the early 1980s. He wrote his open-source text retrieval system
and first distributed it in the late 1980s.
He has worked at the World Wide Web Consortium since 2001, where
he is XML Activity Lead, or, informally, Mrs XML.
This paper describes some modifications done to an open source
text retrieval package to make it XML-aware, and contrasts this lexical
approach, in which XML documents are primarily treated as sequences of
characters rather than trees, with the W3C XPath 1.0 and XQuery 2.0
Full-Text facility.
Specific usage scenarios are taken into consideration, including
World Wide Web publication and the searching and analysis of text
corpora for research purposes.
The W3C XML Query Working Group has published a specification for
performing full-text queries over instances of the XPath and XQuery Data
Model using an extension of the XQuery syntax. This is a text retrieval
facility that operates on an abstract representation of XML trees, rather
than on text files that happen to contain markup. Elements and their
attributes are reified into hierarchies of nodes, text leaps into the
lacunæ and swims between them, and not a pointy bracket in sight.
This paper compares the XQuery Full Text Facility with a more
traditional open source text retrieval system, lq-text, and also explores
the work done to make lq-text become more suitable to the processing needs
of people who work with XML.
Disadvantage and advantages of the two approaches are
discussed.
A Brief Description of the Full Text Facility
Although this paper is primarily concerned with a lexical approach,
an understanding of the XPath 2 and XQuery approach is useful, and will be
taken as a baseline for comparison.
Informally, a full text search is a search to find all documents in
a collection, or all elements of some specific type (for example)
containing one or more specific words. For example, one might want to find
all occurrences of the phrase “warm socks” in a multi-gigabyte corpus of
text. The underlying assumption of full text is that the implementation
uses an index that has been constructed separately in advance, although
this is not necessarily true.
Primary characteristics
XQuery 1.0 and XPath 2.0 Full-Text 1.0 [W3C Full-Text, 2007] extends XPath 2.0 (and XQuery 1.0 in turn,
which itself extends XPath 2.0) to add support for explicit syntax for
full text searches.
XPath 2.0 is node-based, matching text nodes which are contained
by element nodes in a collection of XML document trees. The result is a
Boolean value (when used in an XPath predicate) together with an
optional numerical score or ranking.
The Full-Text facility includes a large number of possible
modifiers, many of which are optional features and may or may or be
available in any given implementation. These include (for example) both
query expansion through a thesaurus and also query narrowing using a
different sort of thesaurus. One can search for two tokens (words, for
English) within a certain number of tokens, sentences or even
paragraphs. The optional features are marked as being “at risk” in W3C
parlance, meaning that unimplemented (or unimplementable) features will
be dropped from the draft specification before it is published as a W3C
Recommendation.
A Brief Description of lq-text
Lq-text is an open source text retrieval package that was first
released in 1989. It has had sporadic development since then. Its main
claims to fame are high precision, good performance (particularly when the
data does not fit into available virtual memory), flexible concordance
generation and an open, extensible, multi-process architecture.
Lq-text operates on text files. It makes an index to the files; this
index stores the location of each occurrence of each natural-language word
in all of the files. The resulting index is stored efficiently, and
generally takes between a quarter and three quarters of the storage size
of the original documents. The index is an adjunct; lq-text also refers to
the original files, although these can be compressed to save space if
needed. The package is designed to work best with many small files rather
than a few large ones.
When lq-text indexes files, it can run a format-specific filter on
each file before indexing it. The list of filters is currently built in to
the software (but since it is open source, you can in fact change it if
you wish).
A suite of separate Unix programs operate on the index for
retrieval; some of these will be described in this paper. They are used in
conjunction with each other, using a documented text-based format to
communicate.
It is this open architecture that can be exploited to enable
XML-specific searches, and that is the primary work described in this
paper.
Commonalities Between The Approaches
An underlying assumption is that some sort of indexing will have
been performed before queries are run; this is of course for all full-text
systems, and although in some cases the constructed indexes do not persist
between invocations of the query software, usually the indexes are kept
and re-used.
Although the Full-Text facility operates on trees and lq-text
operates on flat text files, in practice both systems are matching
sequence of tokens against an index, and returning matches based on text
content.
The XQuery Update Facility allows queries to update documents, and,
as a result, implementations must be able to re-index documents
efficiently. Lq-text can also re-index documents, most efficiently when
both the original and the new version are available.
Lq-text and XML: Objectives
The author wanted to experiment to understand what work would be
needed to make lq-text be useful for people working with XML documents.
Some goals of this work included:
Make minimal changes to the architecture and index and match
format, because of limited programming resources;
Retain a small index and efficient retrieval;
Solve common use cases rather than providing extensive and
general mechanisms.
Although lq-text was not (at the start of the work) XML-aware, it
has the ability to run a format-specific filter program when indexing any
given document. There was already an SGML filter, but all it did was
ensure that element and attribute names were not indexed. This filter was
re-used for XML, modified to allow indexing of elements and attributes.
But at that point the work had only begun.
The following use cases were determined sufficient for
experiments:
Identify all documents containing two or more phrases in the
same element, for any given element;
Refine the search to an element with a specific attribute set to
a given value;
Highlight the matches of the search in context;
For a given match, print the parent element and its content, or
the contents of the parent tag, or a given attribute value, or the
name of the parent element... possibly constrained to any named
ancestor element not just the parent.
This of course is much less than one might want in a full XML-aware
text retrieval system. On the other hand, the XPath-based approach taken
by the Full-Text facility does not support highlighting of matches or
generation of concordances, and the author felt this to be essential
functionality, both for research and for industrial or commercial
use.
The approach taken was to extend lqkwic, the concordance program, so the paper will
describe the lq-text architecture and then lqkwic, and then explain the extensions that were
added. After that, an example program will be shown that uses lqkwic to solve one of the use cases given above.
At that point we will be able to compare an XQuery or XSLT 2
solution.
Support for a subset of XML (“just enough XML, Eh?”) was
implemented; this subset will also be described, as it may be of interest
for other people considering adding XML support to older software.
Lq-text Architecture in Detail
Before explaining how lq-text was extended, it is necessary to give
at least an abbreviated account of how lq-text works.
Lq-text builds and maintains a separate index for each set of
documents, which it calls a database.
When building the index, lq-text applies simple stemming, by reducing words to a root. Currently,
only plural and possessive forms are recognised and recorded, and other
forms are indexed separately. This code is specific to the English
language, and may be removed in a future version, with stemming instead
being done by term expansion at query time.
Lq-text comprises a suite of separate programs, and each program
always uses a single database. For the sake of simplicity in this paper we
will assume that only a single lq-text database is in use at any time,
unless otherwise stated.
Some of the programs included with lq-text are listed for reference
in the table. Only a few of them will be discussed further in this paper,
but the table may give the reader a clearer sense of the software.
Table I
Lq-text Programs
Program
Purpose
lqaddfile
Used to add documents to the index, and to manipulate the
index.
lqunindexfile
removes a file from the index.
lqphrase
matches one or more exact phrases
lqquery
matches words or phrases, but supports wildcard expansion
lqrank
reorders results based on the number of documents matched
(quorum ranking)
lqsort
sorts matches by various criteria e.g. by the word before the
match
lqshow
text-terminal (curses) program to show matched text
lqsed
process documents, highlighting matches by insertion
lqkwic
the main keyword in context concordance program
Once an index is built (for example with lqaddfile), it can be used.
A sample search might be as follows:
$ lqquery "on his face" | lqkwic
For one small corpus (Brewer's Dictionary of Phrase and Fable, with
about 17,000 files) the results are as follows:
==== Document 1: xml/1251.xml: Balafré ====
1:t which left a frightful scar on his face (1550–1588). So Ludovic Lesly, an
==== Document 2: xml/3720.xml: Cloud ====
2: He [Antony] has a cloud on his face.
==== Document 3: xml/6070.xml F ====
3: F is written on his face. “Rogue” is written on his face
4: face. “Rogue” is written on his face. The letter F used to be branded n
==== Document 4: xml/8745.xml Ill Omens ====
5: he happened to trip and fall on his face. This would have been considered a
6: shore at Bulverhythe he fell on his face, and a great cry went forth that i
Here, the matched text is shown with a few words of context on
either side, giving rise to the term key word in
context, KWIC, index.
Two lq-text programs, lqquery and
lqkwic, were combined in the search,
using a Unix pipe; that is, both programs were run concurrently, with the
output of one being fed as the input to the other. This is a usual way of
working with lq-text, and although it sometimes requires some thought, it
does mean that lq-text exploits multi-processor systems well, and also
works well with Unix and Linux, which were designed to run pipelines of
small programs very efficiently.
This description begs the question, exactly what output is passed
from lqquery to lqkwic in the example? The answer to this question
exposes the underlying index architecture, and can be seen by running just
the first program without the second:
The format, as can be determined by inspection, is a sequence of
lines of text, and, in each line, a number of space-separated fields. Each
line represents a single match, and just as there were six results before,
there are six matches here. The fields are, from left to right, the number
of words matched, the block in the file, the word in the block, the file
number and (optionally) the filename.
The lq-text index does not store exact locations for matches.
Instead, the location to the nearest block number, and the word within the
block, are stored. Blocks are by default 128 bytes in size. The result of
this is that a match location within a file is usually represented by a
pair of fairly small integers, but that finding the actual intended words
to highlight requires accessing the file and counting words. This is a
trade-off: a lq-text index is often much smaller than the indexed files,
because the average English word is about 5 characters long (depending
somewhat on the corpus), and it only takes 2 bytes in most cases to store
the information about a match.
Lines in the match list starting with a # are
considered to be comments, and lines of the form
{ variable = value } are used by
lqkwic to set values that can be used
later, as we shall see.
Lq-text programs generally both accept this match format as input
and produce it as output, so that they can be combined. In particular, the
lqkwic program can both read and produce
this format, as we shall see in the next section.
The lq-text lqkwic program
The lqkwic program takes lq-text matches as input, and prints them
using a user-supplied format, or a built-in format. Matches are grouped by
file, and another format is used to print the start of each group of
documents, and yet another can be supplied to be used at the end of each
group.
The format takes the form of a string with embedded variables that
are interpolated each time the format is used. An example may clarify the
format:
$ lqquery "on his fa*" |
lqkwic -S '' -A '' -s '${MatchNumber} ${MatchedText}\n'
1 on his father
2 on his face
3 on his favourite
4 on his face
5 on his father
6 on his face
7 on his face
8 on his father
9 on his favourite
10 on his face
11 on his face
Here, the formats for the start and end of each group of matches
have been set to the empty string with -S '' and -A
'' respectively. The per-match format is set to a string in which
for each match the match number is printed, followed by a space, followed
by the matched text and (indicated by \n in the grand Unix
tradition) a newline. The single quotes are used to surround the strings
to prevent the Unix shell from seeing the dollar signs and treating them
as references to shell variables.
Although the MatchedText variable is obviously useful for testing,
one would normally use it in conjunction with other variables, such as
TextBefore and TextAfter. The purpose of this section is not to document
lqkwic, but to give the reader an understanding of the sorts of things one
can print, since lqkwic has uses that are far removed from concordance
generation, and since we will shortly be taking advantage of such
uses.
The following table shows some of the variables available. In many
cases, lqkwic must read the actual matched documents (or at least part of
them), in order to evaluate the variables.
Table II
The lqkwic formatting variables
Variable
Description
DocName
the name of the current document, as stored in the database
FileName
the absolute path corresponding to ${DocName}
DocTitle
the title of the document
FID
the File Identifier Number of the document (an integer)
FileNumber
starts at 1, increases for each new document in the output
BlockInFile, WordInBlock
these determine the location of the match
NumberOfWordsInPhrase
the length in words of the phrase matched
TextBefore
the text in the document immediately before the match
MatchedText
the document text that exactly matches the phrase
TextAfter
the text in the document immediately after the match
MatchNumber
starts at 1 and increases for each match
MatchWithinFile
like MatchNumber but reset for each new document
StartByte
the byte offset in the file at which the match begins
EndByte
the byte offset in the file at which the match ends
MatchLength
length in bytes of ${MatchedText} (EndByte - StartByte)
There are also constructs for formatting variables, for padding them
to a given width (measured in Unicode characters, not bytes), and for
filtering them through routines that delete punctuation, convert
punctuation to spaces, perform case conversion and so forth.
Extending lqkwic
The following XML-specific variables were added as an experiment to
try to understand how viable the approach would be:
Table III
XML-specific Variables
Variable
Description
XML.Parent.Tag
The content of the containing element's tag, between the
angle brackets
XML.ContentBefore
Content up to the > of the start tag of the immediately enclosing parent element (including
any tags and content that open and close entirely between the match and the parent
tag)
XML.Parent.Name
the name of the parent element
XML.Parent.EndTag
the content of the parent element's end tag
XML.ContentAfter
content up to the < of the parent's end tag
It is not clear that this is sufficient to answer our use case of
finding multiple phrases in the same XML element. To do that, we would
need a way to identify parent elements and compare them.
One could use the File number and the byte offset of the matched
text (${StartByte}), but this is not sufficient, because
there may be close and open tags between matches of two phrases.
One approach to finding phrases with a common containing element
named (of type) E would be to find all of the start and end tags for E,
and then use the file, block and word within block numbers to perform
range algebra.
But it would be more efficient if this were not needed. In a corpus
of many files, it is likely that the element E will occur in many files,
perhaps many times, and searching for them all will be too slow.
If lqkwic could print the location of the parent tag, a much simpler
faster algorithm would be possible.
The notation ->startbyte or
->endbyte was added; after any XML variable name, it
generates the corresponding byte offset in the matched file.
In addition, the notation XML.parent.Tag.e was added,
to be similar to the XPath notation ancestor::e; it is
possible that a future version of lqkwic
will use the XPath notation, as long as there is no danger that users will
be confused into thinking that lq-text is using a node-based model
internally.
The search for a parent tag is implemented by reading the matched
document at the block containing the match, and for some distance
beforehand. lqkwic then searches backwards from the match to find an open
tag which has no corresponding close tag in the intervening distance. It
is worth noting that this sort of approach is not generally possible with
SGML, where empty elements have no end tag. The syntactic innovation of
XML was to require empty tags to have a trailing slash, as in <p/>
or <p id="p301" />, and this enables the software to skip empty
elements reliably. Start and end tags can be skipped more easily of
course, although the algorithm used for backwards parsing does rely on
attributes not containing unquoted < or > signs.
Unfortunately, backwards parsing suffers from a major drawback: the
search for the parent tag will fail if it is too far away. Although lqkwic
could in theory read arbitrarily back in the file, this could mean that
presenting matches in a dictionary would be very expensive, with every
match processed necessitating a search back to the start of a large
document.
In practice, an in-memory cache may be sufficient to achieve
reasonable performance in most cases. Another possibility might be to
store parent pointers in the index. For now, lq-text is primarily intended
for working with many thousands of small files; use XSLT to split large
files before indexing them.
A sample program
We are now in a position to find all elements E that contain all of
a set of phrases P0 ... Pn, as follows:
First, match the phrases, and, for each match, use a format of the
form ${xml.contentbefore.E->endbyte} to find the end byte
of the start tag of the parent element of type E; that is, the location
just after the > at the end of the start tag. If two
matches have the same value for the start tag, and are in the same file,
then they share the same XML ancestor E.
We can match the phrases with a single invocation of lqrank except for one difficulty: there is no way
to determine, for a given match, to which phrase it corresponds, so we
cannot determine whether an element contains all of the phrases.
The lqrank program has the ability
(when instructed with the -g option) to output a line,
{ q = N } where N is an integer, to
identify to which result set the following matches correspond. This is
available to lqkwic formats as the
variable g.q (the g stands for glue, the unpublished and unfinished lq-text
integration language).
Using this, it becomes a relatively simple matter in a language such
as Perl, Python or even the Unix shell, to run
print phrases one per line |
lqrank -r all -g -F - |
lqkwic -s '${FID} ${g.q}
${xml.contentbefore.E->endbyte} ${Match}\n'
Each match is in this way prefixed by the numeric identifier of the
document in the index (FID), the phrase number and the byte offset of the
end of the nearest ancestor E element's end tag. The
-F - option makes lqrank read the list of phrases to match from its
input, rather than expecting them as command-line arguments; one could
also use the Unix xargs program for this
purpose.
Next we must group the matches by file identifier and startbyte, and
if every different phrase occurred at least once, we print all the matches
for that file identifier and startbyte.
The result can then be fed to lqkwic to generate a concordance, or perhaps to
fetch information about the parent element, or both.
The program outlined here (and given in full in the appendix, in the
Perl programming language) is intended as an example of the sort of
flexibility that might be achieved as lq-text becomes more XML
aware.
Unicode
In 1988, the use of 8-bit character sets was pretty usual; lq-text
is at least 8-bit clean for data, so that conversion to UTF-8 seemed a
simple matter, and also has some locale awareness. There were two tricky
parts to the process of adding UTF-8 support. The first was to ensure that
characters, rather than bytes, were counted when formatting, and of course
that a UTF-8 octet sequence was never split part-way through.
The second difficulty was much harder: making sure that combining
characters are never split from their corresponding base character. This
last is not yet complete, but initial work using the GNOME glibc library
is promising. This is the main issue preventing lq-text from being
shipped, at present, and may have been completed by the time this paper is
presented in August 2008.
Software cannot tell by inspecting a singly byte (or octet, as
standards people say, in case 9-bit systems should reoccur) whether that
octet forms part of a longer UTF-8 sequence. One needs to scan backwards
to check, because the first octet is the
one that indicates the number of octets to follow in the sequence that
constitutes a single character. This is of course easy to deal with as
long as one can scan backwards a little. For diacritical marks and other
combining characters, however, one must consult a database. The author
could not help but wish that a single bit in the character representation
could have been reserved for this purpose, but that would have prevented
Unicode from being backwards-compatible with ISO 8859-1, a goal at the
time Unicode was designed. A future version of lq-text may use its own
database, with only the character properties that lq-text needs, perhaps
created automatically at the same time as each database so as to take
locale information into account.
Comparing with XQuery 1.0 or XSLT 2 + Full Text
The published draft of Full-Text does not support concordance
generation, although some implementations in practice (such as MarkLogic)
do appear to offer the necessary functionality through product-specific
extensions. The author of this paper considers match highlighting to be
essential functionality in practice. A future version of Full-Text may
well include it.
Let us then assume, as we must, that we are using an XQuery or XSLT
implementation that supports in some way identifying match locations, and
hence allows highlighting.
Advantages of Full-Text
With Full-Text, XPath predicates and axes are available, so
that one can easily find ancestors, parents, position in the element
tree, and so forth. The lexical approach is very limited in this
regard.
Full-Text is (or probably will soon be) a standard, and one
can easily move between implementations. The necessity of using
vendor extensions for highlighting reduces this somewhat, but of
course there is only one implementation of lq-text, albeit with
source code freely available.
An XPath implementation with Full-Text might have indexes for
element location that enable higher performance, for example by
using one CPU to find elements and another to resolve the text
search. Although this sort of optimisation is largely at the
research level today, it is likely to find its way into products,
both closed and open source, in the near future. Lq-text uses
multiple programs, which can run on separate CPUs of course (and
will do so without any action from the user on a multi-CPU system)
but there are no plans for finer-grained parallelism.
The Full-Text facility is designed to work with Unicode and
XML-based language support, giving a high degree of
internationalisation. Although the author is adding Unicode support
to lq-text (which previously, because it predated Unicode, used
8-bit character sets and a locale-based mechanism), it is not yet
complete and pervasive.
Since lq-text is not tree-based, it does not currently have
any means to respect xml:lang, nor does it have any understanding of
namespaces. Prefixed elements and attribute names are not currently
handled. A solution involving the XML indexing filter is being
considered for both of these issues, but its effectiveness is as yet
unknown.
The Full_text XPath extension is already in wider use than
lq-text; training, support, books and forums are available for it,
but not for lq-text.
Advantages of a lexical approach
Open access to the match list supports flexibility and
extensibility. The use of separate programs also allows intermediate
results to be cached or stored and compared easily. By contrast,
XQuery (where Full-Text is most likely to be found) is a large
monolithic language. Open Source XQuery implementations are mostly
in Java, which does not lend itself to good performance if a JVM
must be started for each query, for example outside a servlet
environment. None the less it should be mentioned that the fastest
readily available indexed XQuery implementation in the
author's experience is in Java, and once
the JVM is started, is very fast.
Because the data is not forced into the shape of a tree, it is
possible to experiment, for example with overlapping markup. The
generation of results by lqkwic can
include a span from start element to corresponding end element,
regardless of other start or end tags. Although XQuery
and XPath 2.0 Full-Text allows for matching as if tags were
absent, it does not give good control over which tags are to
be treated as word boundaries and which not. But this is
a difficult thing to do at query-time in any case, and
neither system today has a complete answer for this.
Lq-text can be used to generate non-XML results, for example a
bitmap image representing a graph of word occurrence. XQuery and
XSLT are limited to text and XML, although one can certainly write
out SVG with them.
Experiments with Salton-style similarity functions,
clustering, and other Information Retrieval techniques might
eventually find their way back from work like this and into a future
Full-Text specification. See, for example Salton, 1989 or Konchady2006 for
descriptions of some applicable information retrieval
techniques.
Lq-text lies more in the world of traditional Unix text
processing than in the world of relational databases. If one is
primarily interested in finding content in a database, Full-Text is
a clear winner. If one is more interested in exploring or searching
text, perhaps lq-text has something to offer.
JEXE: Just Enough XML, Eh?
Some XML features were harder to see how to support than others.
The author has no intent to support all of XML at this time, but just
enough to be useful. This is regardless of how the XML is parsed. The
following features are not supported, and are unlikely to be
supported:
CDATA sections; you can use entities instead. This is because
the retrieval software does not scan the document from the start
each time, but from the middle, and cannot determine whether markup
is part of a marked section.
External general entities (and XInclude); it is more useful
for people working with XML as files to know the file than the
document; if you want to resolve included entities, use a
pre-processor such as xmllint
before indexing.
Arbitrary namespace support; the limit on namespaces is that
all xmlns declarations must come
before any regular attributes. In other words, the order of
attributes (or pseudo-attributes) is significant. This may change in
the future; it is because of a limitation in the indexer to do with
the amount of available look-ahead.
General entities; although support is planned for entities,
the plan is to read the replacement text from a per-database
configuration file. This is already done by lqkwic, but should also
be done by the indexer. This means that per-document entities are
not supported. External entities are not supported: the unit of
retrieval is the file, not the document.
The internal subset; currently lq-text can skip over an
internal subset correctly in most cases (it is possible to construct
an internal subset that will confuse it, I suspect, although this is
always detected and a warning issued), but it is not parsed.
Fixed and defaulted attribute values; without reading a DTD or
internal subset, there are no default values. This could be thought
of as a minimization feature of SGML that was overlooked during the
design of XML.
XML Notations; if the DTD were to be read, it might be
possible to associate a URI or a MIME content type with a different
tokenisation system, but the document author cannot know what MIME
type will be used if a file is served on the Web; the DTD is not
authoritative, and currently lq-text does not use HTTP to fetch
things, but only works with local files. If lq-text used HTTP,
behaviour would be based on the content type header for downloaded
entities, not on any notation value in the DTD. For a local file,
the notation value could be treated as a list of plausible content
types, perhaps, but in practice content sniffing is more likely to
work.
The result of this is that a JEXE document consists of an XML
declaration (the encoding, if given, must be in UTF-8, however), an
optional doctype declaration to point at an external DTD to be ignored,
and then one (or more) simple element trees. Elements may have
attributes, and may also declare namespaces. Namespace prefixes may be
“normalised” based on a per-database configuration file, with elements
in a default namespace that is associated with a URI given an explicit
prefix [This feature is not implemented at the time of writing]. Numeric
character references are expanded on indexing. Entity references are
replaced by their per-database string values on retrieval; the plan is
to index entity references both with their entity name and with their
expanded value.
The resulting XML can be parsed “from the middle out” for the
purposes of retrieval.
Although there is only support for a subset of XML, enough of the
syntax is understood that you can index any XML document. However, some
features, such as CDATA sections, permit the construction of documents
that will confuse retrieval, even though the actual CDATA sections will
be correctly parsed. A possible work-around is to process documents with
XSLT before indexing them, creating surrogate documents.
Future Work
The work has shown that adding some simple XML support to lq-text
is possible, but leaves a lot to be desired. For people already using
lq-text, the support described in this paper is useful, but it is
unlikely to persuade many people to try the package for the first
time.
Adding more support for “just enough XML” will make the package
more interesting. In the short term, extending the Unicode support is
necessary before a release, as is more thorough testing and (as always)
more documentation. After that, changes in the indexer to add support
for (just enough) namespaces, general text entities and numeric
character references have been sketched out.
There are no plans to use a full XML parser right now; although
the author had originally intended to do so, the difficulty in tracking
exact byte positions in the input delayed the work, and at this point
although it is now possible, it has become a matter of human
resources.
It is possible that the work here would be enough to enable
lq-text to be used by an implementor of XQuery, and the author would
like to do experiments in that area.
Searching a corpus of documents with disparate markup can be
difficult with either approach, because one tends to write patterns that
depend on the markup retrieved. One approach is to try to map queries at
runtime; this can be a difficult problem of matching incompatible
hierarchies of elements; see Euzenat and Shvaiko, 2007 on various
approaches to the problem of matching ontologies. A more pragmatic
approach is to re-write documents before indexing them, perhaps with
XSLT. This approach works with both approaches to text retrieval, but
can be tedious. An intermediate approach might be to define some XPath
expressions, or to use a W3C XML Schema to impost some specific types,
to identify sections, titles, paragraphs, and to mark which elements are
considered to break apart words, phrases and paragraphs. The index could
then include this information alongside the element structure. More
experiments in this area are planned.
Conclusions
The author's original goal in adding XML support to lq-text was to
use lq-text to help optimise an XQuery implementation. After
experimenting with an XQuery implementation that supported Full-Text,
the author decided instead to focus on enhancing lq-text to see if the
results would be useful. It turns out that they are indeed useful, and
development is continuing.
It must be admitted, however, that any advantage of lq-text over
sophisticated XQuery implementations is likely to diminish over
time.
The subset of XML supported (and with planned support), “just
enough XML, Eh?” (JEXE), may be worth documenting separately.
References
[Adolphs, 2006] Adolphs,
Svenja, “Introducing Electronic Text Analysis” (Routledge, 2006). A very
clear and impressively slender introduction to the application of
information retrieval, and especially the keyword-in-context list, to
literary and linguistic research.
[Baeza-Yates and Marais, 1999] Baeza-Yates, Ricardo, and Marais, H., “Modern
Information Retrieval” (ACM Press, 1999). Describes information retrieval
mostly from the perspective of a researcher in text retrieval rather than
a programmer or a user, and assumes more background knowledge,
particularly in mathematics, than Manu Konchady’s book, so may be best
read second.
[Euzenat and Shvaiko, 2007] Euzenat, Jérôme and Shvaiko, Pavel, “Ontology
Matching” (Springer, 2007); a surprisingly clear introduction to problems
such as relating two or more different classification schemes (such as XML
schemas) over the same subject matter, although the presentation uses a
mathematical notation, and some background in formal logic may be
helpful.
[Konchady2006] Konchady,
Manu, “Text Mining Application Programming” (Charles River Media, Boston
USA, 2006). A useful programmer-level introduction to topics relating to
implementing and using text retrieval, part-of-speech tagging, clustering
and other topics, together with just enough mathematics, but not specific
to any particular language. Includes CD-ROM with code samples in Perl,
however.
[Salton, 1989] Salton, Gerald,
“Automatic Text Processing” (Addison-Wesley, 1989). Perhaps a little
dated, but the late Dr. Salton was extremely influential in the field. His
earlier, 1983, book formed the basis for a single chapter of this work,
but the 1983 book is harder to find today.
[W3C Full-Text, 2007] Sihem
Amer-Yahia, Chavdar Botev, Stephen Buxton, Pat Case, Jochen Doerre, Mary
Holstege, Jim Melton, Michael Rys and Jayavel Shanmugasundaram (Editors),
“XQuery 1.0 and XPath 2.0 Full-Text 1.0” [online]. [cited 18th April
2008].
http://www.w3.org/TR/xpath-full-text-10/.
Adolphs,
Svenja, “Introducing Electronic Text Analysis” (Routledge, 2006). A very
clear and impressively slender introduction to the application of
information retrieval, and especially the keyword-in-context list, to
literary and linguistic research.
Baeza-Yates, Ricardo, and Marais, H., “Modern
Information Retrieval” (ACM Press, 1999). Describes information retrieval
mostly from the perspective of a researcher in text retrieval rather than
a programmer or a user, and assumes more background knowledge,
particularly in mathematics, than Manu Konchady’s book, so may be best
read second.
Euzenat, Jérôme and Shvaiko, Pavel, “Ontology
Matching” (Springer, 2007); a surprisingly clear introduction to problems
such as relating two or more different classification schemes (such as XML
schemas) over the same subject matter, although the presentation uses a
mathematical notation, and some background in formal logic may be
helpful.
Konchady,
Manu, “Text Mining Application Programming” (Charles River Media, Boston
USA, 2006). A useful programmer-level introduction to topics relating to
implementing and using text retrieval, part-of-speech tagging, clustering
and other topics, together with just enough mathematics, but not specific
to any particular language. Includes CD-ROM with code samples in Perl,
however.
Salton, Gerald,
“Automatic Text Processing” (Addison-Wesley, 1989). Perhaps a little
dated, but the late Dr. Salton was extremely influential in the field. His
earlier, 1983, book formed the basis for a single chapter of this work,
but the 1983 book is harder to find today.
Sihem
Amer-Yahia, Chavdar Botev, Stephen Buxton, Pat Case, Jochen Doerre, Mary
Holstege, Jim Melton, Michael Rys and Jayavel Shanmugasundaram (Editors),
“XQuery 1.0 and XPath 2.0 Full-Text 1.0” [online]. [cited 18th April
2008].
http://www.w3.org/TR/xpath-full-text-10/.
Author's keywords for this paper:
XML; Full Text; Information Retrieval; Natural Language Processing; Computational Linguistics