Briquet, Cyril, Pascale Renders and Etienne Petitjean. “A Virtualization-Based Retrieval and Update API for XML-Encoded Corpora.” Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 - 6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). https://doi.org/10.4242/BalisageVol5.Briquet01.
Balisage: The Markup Conference 2010 August 3 - 6, 2010
Balisage Paper: A Virtualization-Based Retrieval and Update API for XML-Encoded Corpora[1]
Cyril Briquet
Postdoctoral Fellow in Digital Humanities and High Performance Computing
Dr. Cyril Briquet received the PhD degree in Computing Science from the University
of Liège in 2008.
His research interests include distributed systems and algorithmics, with a particular
focus on efficiency and scalability.
In 2007, he introduced the first fully Peer-to-Peer Grid middleware (i.e. both P2P
computations as well as BitTorrent-based P2P data transfers).
During his stay at CNRS in Nancy in 2008-2009[2], he implemented algorithms for the automatic detection of fields of information in
XML-encoded corpora.
Now at McMaster University as a Postdoctoral Fellow in Digital Humanities and High
Performance Computing, he is working to scale out the analytics back-end of the Voyeur
Tools text analytics software.
Pascale Renders holds one diploma in Romance languages as well as one diploma in Classics.
She is currently completing her PhD studies at University of Liège, in partnership
with the ATILF lab (Nancy). Her doctoral research topic is the digitization of the Französisches
Etymologisches Wörterbuch (FEW), which is the reference etymological and historical
dictionary in the Gallo-Roman area. Besides Lexicography and Natural Language Processing,
her research interests also encompass Diachronic Linguistics of Romance languages,
specifically the key era where Latin evolved into French.
Providing support for flexible automated tagging of text-oriented XML documents (i.e.
text with intersparsed markup) is a challenging issue.
This requires support for tag-aware full text search (i.e. the capability to skip
some tags or make invisible whole sections of the document), match points, and transparent
updates.
An API addressing this issue is described.
Based on the virtualization of selected sections of the XML document, the API produces
a tag-aware representation, backed by the document, that is transparently searchable
(using keyword search or regular expressions) and updatable, offering support for
natural linguistic reasoning.
XML-encoding a textual corpus is typically achieved through manual tagging,
i.e. by human operators using a text editor or specialized annotation software,
or through the application of algorithms that complete the tagging on the basis of
a known schema.
The problem we address in this paper is to provide an algorithmic substrate,
made accessible through an Application Programming Interface (API),
to achieve automated tagging [Renders2007a, Renders2009a, Briquet2009a]
of the digitized version of one of the most complex paper dictionaries.
This dictionary is the Französisches Etymologisches Wörterbuch (FEW) [ProjectFEW, Wartburg1922a].
The FEW is the historical and etymological dictionary of the languages of the gallo-roman
area
(i.e. French, Franco-Provençal, Occitan and Gascon in all of their diatopic variations),
written over the course of more than eighty years.
Each article of the FEW describes all derived words (i.e. lexemes) of a given root
word (i.e. etymon) [Buchi1996a],
instead of (as happens in most etymological dictionaries) describing the root word
of a given derived word.
Of course, several fields of information are offered for each article, such as the
etymon and the language in which this etymon originates, the signature of the article,
footnotes.
The structure of the article is emphasized through hierarchical numbering and labeling
of groups of paragraphs and cross-references to these.
Several fields of information are offered for each lexeme,
including but not limited to geolinguistic labels, bibliographical references to attestations
of the lexeme
and the definition of the lexeme.
All of these types of information are well-delineated, thus good candidates for tagging.
In a nutshell, each article of the FEW is a highly structured text that can be described
through a model [Buchi1996a] and thus algorithmically processed.
Besides these semantic types of information that are expected to be identified through
automated tagging, formatting markup (such as paragraph, bold, italic, exponent, break
tags such as end of line or end of column,...) is also available in each article following
the digitization process (the formatting markup is added either manually or through
OCR). This enables an on-screen visualization of articles that is faithful to the
paper version of the dictionary. The availability of paragraph tags is important not
only to properly format the articles, but to make accessible a first level of structuring
of the article. Likewise, the availability of break tags enables to correctly tag
some types of semantic types of information. Many hyphens indeed do not indicate that
a long word is being cut at the end of the line where they are found, but instead
convey semantic information (e.g. as part of affixes). And there is no straightforward
way to automatically disambiguate the syntactic uses from the semantic uses of hyphens
at the end of lines in the FEW dictionary.
To put the FEW dictionary [ProjectFEW, Wartburg1922a] in perspective, here is a quick comparison with another complex dictionary, the
well-known Oxford English Dictionary (OED) [OED]. Whereas the markup used to encode OED articles is proprietary SGML [OED], the markup used to encode FEW articles is standard XML complying with an XML Schema
[Briquet2009a]. The FEW markup is referred to as FFML (FEW Font-style Markup Language) for files
input to the automated tagging process. It is referred to as FSML (FEW Semantic Markup
Language) for files output by the automated tagging process. FSML is comprised of
mostly a superset of FFML, that is augmented with semantic markup. There currently
exists no in-depth comparison of the scopes of the OED and FEW markups, and it is
clear that there are differences if only because FEW entries are etymons instead of,
typically, lexemes. Nonetheless, in very general terms, both markups are intended
to encode dictionary articles and it is not surprising to see many common features.
The FEW dictionary, though, is considered to be one of the most complex dictionaries
in the world [Buchi1996a], notably because of the ever-present implicitness of many types of information.
The Table here below provides a shallow quantitative comparison of both dictionaries.
Table I
Shallow quantitative comparison of OED [OED] and FEW [Buchi1996a]
Feature
OED
FEW
Comment
Pages
21730
16865
Volumes
20
25
Entries
300000
20000
FEW entries are etymons, not lexemes, thus fewer
Lexemes
600000
900000 (*)
(*) back-of-the-envelop estimate
Importance of the problem
Semantically tagging the articles of the FEW is a multi-year linguistics research
project [Renders2007a] that is tremendously important to the historical linguistics community.
Merely digitizing the FEW, without further processing would only enable full text
search.
Tagging the various fields of information would also enable so-called lateral search
over the whole dictionary.
Semantic tagging would disambiguate and resolve the implicitness of most of the fields
of information contained in the dictionary.
Given the extreme linguistic complexity of the FEW, only highly-trained human experts
could theoretically do the job. But there are not enough of them
to process the twenty thousand articles of the FEW
and this would be a tedious, error-prone job anyway.
It was thus decided to use algorithmic automation to tag the fields of information
of the FEW [Renders2009a].
Even though the articles of the FEW (implicitly) constitute a highly specialized textual
corpus,
we believe that the problem can be sufficiently abstracted
so that our proposed approach is also relevant to other endeavors,
particularly the processing of text-oriented XML documents (i.e. text with intersparsed
markup) such as text corpora.
Scope of the problem
The input of our project is comprised of XML documents,
each of which represents an article of the FEW and includes basic layout tags (article,
paragraphs, bold, italic,...)
that were added by human encoders and/or by OCR software.
Obviously, all textual contents are eventually enclosed in paragraph and article tags.
Nonetheless, the element hierarchy is often not the prime model of choice when accessing
the FEW.
Identifying fields of information in the FEW indeed focuses on linear regions of text
where elements can be nested in many ways.
Thus a very important property is that an FEW article should be considered as a text-oriented
XML document,
i.e. a text with intersparsed tags rather than a tree of elements with some textual
contents found in the leaves.
The objective of our project is ultimately to add semantic markup around fields of
information within a text-oriented XML document
(fields of information of interest are listed in the Problem domain section here above).
This requires to first identify these fields of information, then accordingly update
the XML document.
Each article of the FEW dictionary typically features multiple occurrences of 25 or
so types of information [Buchi1996a], each of which is detected and tagged by a specific tagging algorithm (see Figure
1 here below).
As the dependency graph of tagging algorithms is acyclic, it is possible - though
not straightforward - to determine the sequence in which they should be applied (the
so-called retroconversion sequence, e.g. tag definitions, then geolinguistic labels,
then bibliographic labels,...) [Renders2009a, Briquet2009a]. An important property is that the order in which tags are inserted into the XML
document representing an FEW article is known. Therefore, when designing a tagging
algorithm, one knows which types of information have previously been tagged - and
thus which ones might interfere with information retrieval and updates.
Identification of fields of information in an FEW article by a tagging algorithm is
typically based on several retrieval primitives:
keyword search, matching of regular expressions (regexp),
as well as matching of contextual tag sequences.
Most algorithms rely on a combination of several of these primitives,
often in non-straightforward ways that involve secondary lookups in the contexts surrounding
initial matches.
The linguistic challenge resides in handcrafting combinations of retrieval primitives
that lead to accurate identification of types of informations.
As the design and implementation of tagging algorithms is beyond the scope of this
paper,
we now concentrate on issues that arise from the stated information retrieval requirements.
Among the issues that make information retrieval in the FEW quite challenging, two
classes of issues stand out:
Firstly, some keyword search may be relevant only in some selected contexts, i.e.
would lead to false positives if applied to irrelevant contexts.
Secondly, the presence of some tags, e.g. such as exponent or end of line tags,
within words or groups of words may prevent the matching of keywords and regular expressions,
thus leading to false negatives.
Update of (the XML document representing) an FEW article by a tagging algorithm typically
includes the insertion, removal, displacement or update of text,
as well as the insertion, displacement or removal of tags.
The possible presence of many types of tags within updated text regions
increases the complexity of update operations,
as these should preserve the well-formedness of the XML document.
Moreover, the tagging of a text region might be decided based on the identification
of patterns in a neighboring text region. Support for match points is thus very important,
as the locus of where to insert a tag is most often decided based on offsets in the
textual representation of the XML document (these offsets being sometimes far away
from the XML node whose contents were matched in the identification phase), not relative
to tags of the XML document.
In addition to these retrieval and update challenges, one must keep in mind that
the semantic tagging of FEW is a research project [Renders2007a] whose object of study is a very complex dictionary. Concretely, another layer added
to these challenges resides in the lack of knowledge of where tags can appear in the
text. Even though it might be known that a given tag type is always included in a
specific parent element, this provides no information on where the given tag appears
among the textual content of its englobing element. This is the kind of challenge
one must face when searching both markup and textual contents at the same time [StLaurent2001a].
This leads to the following additional requirement: A mechanism must be provided to
flexibly and easily configure
the contexts where retrieval and update operations will be applied. This is a challenging
requirement, as we desire to provide users of the mechanism
a way to think of the dictionary and express themselves in a familiar environment:
text without tags.
What is required is thus a mechanism that enables flexible, dynamic, tag-aware retrieval
and update in a text-oriented XML document,
in order to support natural linguistic reasoning.
Based on our review of the state of the art presented in the section of this paper
discussing related works (see below),
we are not aware of any existing XML technology that addresses all of these requirements,
specifically the provision of a familiar environment to users of the mechanism.
Our Approach
The key principle of our approach to the tag-aware retrieval and update problem
is the virtualization of the text-oriented XML document.
We propose a mechanism to allow one to easily construct,
through an Application Programming Interface (API),
a representation of selected sections of the XML document
from which types of tags leading to ambiguities - as well as their contents, if required
-
have been removed.
In practice, sections of the XML document are virtualized
into multiple virtual strings, separated by so-called visible tags,
according to dynamically-defined tag visibility rules.
A definition of the virtual string data structure is provided in a further section;
intuitively, a virtual string is the concatenation of adjacent text chunks,
except those within elements that are configured to not be added to the virtual string.
For example, this section of XML document: <X><Y>some nice text</Y> <Z>and text to be made invisible</Z> and <W>finally</W> <Y>nice
text again</Y></X> is virtualized into these three virtual strings: (1) some nice text, (2) and finally and (3) nice text again, if using the following configuration: <X> and <Y> tags should stop the virtualization mechanism, <Z> tags as well as their contents (if considered as elements) should be made invisible,
and <W> tags (not their contents, if considered as elements) should be made invisible.
How do API users benefit from the virtualization? With ambiguity abstracted away through
an appropriate configuration of what tags should be made invisible (with or without
their contents, if considered as elements), the representation of the XML document
as a sequence of virtual strings
is more easily searchable, simplifying retrieval (keyword search, regexp matching,
contextual tag sequence matching).
Importantly, each virtual string is backed by the underlying XML document. Updates
to a virtual string
are thus transparently propagated to the document.
Therefore, this constructed representation is also easily updatable.
Our approach shares the intent of the Regular Fragmentations
mechanism proposed by Simon St.Laurent [StLaurent2001a, RegFrag]
and of the XML Fragments mechanism introduced by Carmel et al. [Carmel2003a],
to the extent that we wish to devise a transparent mechanism that enables
mixed-information (text/markup) retrieval in a text-oriented XML document.
Our proposed virtualization mechanism
may be viewed as a dynamic, more flexible and - importantly - updatable version
of the Reading Context mechanism introduced by Xavier Tannier [Tannier2005a].
Mixed-information (text/markup) retrieval is provided by our proposed virtualization
mechanism
through the backing of virtual strings by the underlying XML document.
The tag visibility model on which our proposed virtualization mechanism relies
is an extended version (with slightly different semantics) of the tag visibility model
initially introduced by Lini et al. [Lini2001a]
and used by Tannier [Tannier2005a].
Our proposed API is comprised of exactly the operations that are required to digitize
the FEW dictionary [ProjectFEW, Wartburg1922a], no more, no less.
Nonetheless, we believe it is sufficiently general and extensible
to be useful to other text-oriented XML document processing needs.
The rest of this paper is structured as follows.
Typical use cases are first described in order to ground our work into real, concrete,
albeit basic examples.
Hypothetical use cases are also provided, in order to situate our work within a broader
context.
The proposed XML document retrieval and update API is then introduced.
Related works are subsequently reviewed and their relationships with our work are
explored.
Finally, concluding remarks put the proposed API into perspective.
Typical Use Cases
Four typical use cases gathered here below illustrate how our proposed API can help
design tagging algorithms [Renders2009a]. The use cases highlight two major classes of information retrieval issues encountered
in semantically tagging the FEW dictionary (false positives, false negatives).
Use case 1: avoidance of false positives for regular expression matching
In this use case, the objective is to tag dates. To search for dates, a regular expression
is applied to the full text version of the XML document.
Input (FEW 2, 982b, completus):
<geoling>Nfr.</geoling> <i>com-<lb merge="no"/>
plètement</i> <def>„action de mettre au complet“</def> (seit 1750,<lb merge="no"/>
text in <biblio>Fér 1787</biblio>).
Expected output, one <date> tag inserted:
<geoling>Nfr.</geoling> <i>com-<lb merge="no"/>
plètement</i> <def>„action de mettre au complet“</def> (seit <date>1750</date>,<lb merge="no"/>
text in <biblio>Fér 1787</biblio>).
If all tags were removed prior to full text search, the string 1787 would be matched as a licit date. This would lead to a false positive as it has already
been tagged as part of a bibliographical reference.
Instead, <biblio> tags (as well as others, in practice) should be made totally invisible, their contents
included, prior to full text search. The full text to which the search operation is
applied thus becomes:
Nfr. complètement „action de mettre au complet“ (seit 1750, text in ).
Use case 2: avoidance of false positives for keyword search
In this use case, the objective is to tag grammatical categories. To search for grammatical
categories, a keyword search is applied to the full text version of the XML document.
In practice, the string f. belongs to the keyword list of grammatical categories (it's an abbreviation standing
for: feminine substantive).
Input (FEW 2, 983a, completus):
<geoling>Nfr.</geoling> <i>fleur incomplète</i> <def>„f. dé-<lb merge="discard-hyphen"/>
pourvue de quelque organe, notamment de corolle“</def><lb merge="no"/>
(seit <biblio>Trév 1771</biblio>).
Expected output, no tag inserted:
<geoling>Nfr.</geoling> <i>fleur incomplète</i> <def>„f. dé-<lb merge="discard-hyphen"/>
pourvue de quelque organe, notamment de corolle“</def><lb merge="no"/>
(seit <biblio>Trév 1771</biblio>).
If all tags were removed prior to full text search, the string f. would be matched as a licit grammatical category. This would lead to a false positive
as f. is here an abbreviation of the word fleur (i.e. flower) and has already been tagged as part of the definition of a lexeme.
Instead, <def> tags (as well as others, in practice and for the relevant algorithm) should be made
totally invisible, their contents included, prior to full text search. The full text
to which the search operation is applied thus becomes:
Nfr. fleur incomplète (seit Trév 1771).
Use case 3: avoidance of false negatives for regular expression matching
In this use case, the objective is to tag dates. To search for dates, a regular expression
is applied to the full text version of the XML document.
The – character is not a hyphen, but an en dash. This character, as well as the spacing around it, is accounted for in the
regular expression used to match dates.
If <e> and <lb/> tags were not skipped, the date range 4e-6e s. (i.e. 4th-6th century) would be split into six fragments. This would lead to a false
negative, i.e. the date would not be matched. By virtually removing the <e> and <lb/>
tags (as well as others, in practice), the date can be matched by a regular expression.
The full text to which the search operation is applied thus becomes:
Emprunt de lttard. mlt. augmentator (4e– 6e s., ThesLL ;
Use case 4: avoidance of false negatives for keyword search
In this use case, the objective is to tag affixes (i.e. prefixes and suffixes). To
search for affixes, a keyword search is applied to the full text version of the XML
document.
Input (FEW 8, 237a, perfectus):
<pref id="I 2 a">a</pref> mit dem suffix -<i>ivus</i>
Expected output, one <affix> tag inserted:
<pref id="I 2 a">a</pref> mit dem suffix <affix>-<i>ivus</i></affix>
If <i> tags were not skipped, the hyphen of the suffix -ivus would be separated from the rest of the word. This would lead to a false negative,
i.e. the affix would not be matched. By virtually removing the <i> tags (as well as
others, in practice), the affix keyword can be matched.
The full text to which the search operation is applied thus becomes:
a mit dem suffix -ivus
Hypothetical Use Cases
In a general sense, our proposed virtualization mechanism enables to construct views
of an XML document, dynamically at any time during its processing.
The mechanism masks selected parts of an XML document and transparently applies updates
- through the constructed document view - to the XML document. The mechanism operates
on a representation of the XML document that does not rely on a DOM tree. The representation
of the XML document exhibits very lightweight storage requirements if one desires
to process only sections of the document at a time. Therefore, our proposed virtualization
mechanism is well-suited to process streamed data.
An hypothetical use case of growing importance would be the semantic analysis of web
documents intended to feed the ever-expanding index of a web search engine. Given
that web search is predicted to evolve towards a better understanding of what web
documents mean [Savage2010a], semantic analysis of a web document may well evolve to become a key problem in
web search. The ability to easily mask selected tagged regions of a web document to
try and derive different meanings would thus be an asset when engineering its semantic
analysis.
Document Retrieval and Update API
As most algorithms devised for the semantic tagging of types of information of the
FEW dictionary operate based on different contextual assumptions, one should be able
to flexibly and easily configure the retrieval and update operations for each algorithm.
Why did we choose to expose this mechanism through an Application Programming Interface
(API)?
Firstly, tools that can apply only one transformation to an XML document, however
powerful they are [StLaurent2001a, RegFrag, Tannier2005a, Tannier2006a, Tannier2006b], are not adapted because the semantic tagging of FEW requires each XML document
to be processed multiple times.
Secondly, domain-specific languages can serve their purposes well, but often lack
in generality.
Thirdly, embeddable domain-specific languages, like XQuery [XQuery] is intended to be, deserve attention but the depth of integration may (currently)
sometimes be a limiting factor.
Fourthly, an Application Programming Interface (API) offers great flexibility and
by design can be easily integrated with user code.
We have thus decided to expose our proposed virtualization mechanism through an Application
Programming Interface (API). A Java implementation of this API was completed during
the summer of 2009.
Our proposed document retrieval and update API is actually comprised of a set of six
distinct APIs (see Figure 2 here below, based on the example illustrating the Our
Approach section):
addressing API: provides a mechanism to locate target nodes;
text retrieval API: no specific document-level or node-level text retrieval API is
provided, user code is expected to rely on standard Java facilities or user-provided
custom packages;
text update API: provides a mechanism to update a single text node and insert tags
into a single text node;
text virtualization API: provides a mechanism to construct a sequence of virtual strings
over a section of the XML document potentially spanning many nodes, from which specified
types of tags are virtually removed;
virtual text retrieval API: provides a mechanism to retrieve data within a virtual
string, that also provides access to the underlying nodes of the XML document;
virtual text update API and virtual tag splicing API: provides a mechanism to update
a virtual string and insert tag into a virtual string, and have the updates transparently
propagated to the underlying nodes of the XML document.
Before introducing each API, the in-memory representation of XML document on which
they operate is first described.
In-memory Document Data Structure
An important design decision in XML document processing is the in-memory representation
of its markup and content. A typical choice is the DOM tree representation [DOM]. However, as we are considering text-oriented XML documents only, there are at least
two arguments against a tree representation.
Firstly, the structural information contained in the tree branches is practically
useless to our purposes as the element hierarchy of the FEW dictionary is not always
predictable (see Scope of the problem here above for the rationale). Therefore, the
memory footprint of the document representation can be greatly reduced by discarding
hierarchical relationships and storing only the list of leaves of the document.
Secondly, as the generated XML documents may not be well-formed (due to e.g. unexpected
textual data that cannot be recognized by the tagging algorithms), a mechanism to
deal with the lack of well-formedness, such as e.g. milestones [DeRose2004a], is required. A tree representation has thus little benefit.
We decide instead to represent an XML document as a list of nodes, with each node
being either an XML tag or a text chunk. This in-memory representation matches the
assumption of text-oriented XML documents. It allows straightforward access to contextual
relationships (i.e. is-left-of, is-right-of) between nodes, without precluding to
construct additional data structures on top of it in order to access parent/child
relationships for specific sections of the document. In the rest of the paper, it
will be referred to as a node list.
In practice, the node list can be easily constructed using a validating SAX parser
that appends text nodes and tag nodes to the node list as they are loaded to memory.
Tag attributes are not appended as nodes themselves, but instead are added as properties
of tag nodes. Comments and processing instructions are currently discarded from the
XML document (this is an implementation-specific limitation that can be removed through
additional engineering hours).
The existing implementation of the API only considers UTF-8-encoded XML documents.
Fragments of text of the XML document are normalized as they are inserted into the
node list as text nodes. A character map enables the API user to configure all licit
character, character entity references and numeric character references. Character
sanity is enforced by checking all characters of the XML document against the character
map. Character entity references and numeric character references are all resolved
into one Unicode character each, possibly in a private use area. All spacing characters
are converted to the space character (U+0020), then multiple spaces are compacted
into one space only. Leading and trailing spaces within high-level structural tags
(dictionary, article, paragraph) are removed. Break tags (e.g. end of line, end of
column, end of page,...), despite being tags and not text, are also normalized, i.e.
redundant break tags are removed.
Addressing API
The addressing API takes one node as input, and provides a node as output.
Addressing of XML elements, tags or text chunks, in the node list is always relative
to a target node and always returns a single node (never a set of nodes).
No mechanism of absolute addressing of XML elements in the XML document is currently
provided because it was not a requirement for the tagging of the FEW dictionary. An
absolute addressing mechanism featuring XPath [XPath] syntax might be supported but at the expense of a severe performance penalty, as
the node list does not contain the hierarchical structure of a node tree (see In-memory
Document Data Structure here above for the rationale).
Relative addressing of XML elements can be done in three ways: by neighbor access,
by tag node search, by text node search.
The left and right neighbor of a target node are always immediately accessible, by
construction of the node list.
Several tag node search operations are provided by the addressing API: search forwards/backwards
for the next tag node matching a given set of tags and/or tag attributes, before the
first occurrence of a stopper tag (if specified). It must be noted that forward search
and backward search are not equivalent to descendant search or ancestor search [XPath] in a tree representation. Consider the following two examples. First example: <X><Y>some text</Y><Z>some text</Z></X>. Second example: <X><Y>some text</Y></X> <Z>some text</Z>. A forward search (in our proposed node list representation), relative to the target
node <X>, for a <Z> tag will find a match in both examples, whereas a similar descendant search (in a
tree representation) would not find a match in the second example.
Several text node search operations are provided by the addressing API: search forwards/backwards
for the next text node, before the first occurrence of a stopper tag (if specified).
Again, forward search and backward search are not equivalent to descendant search
or ancestor search [XPath] in a tree representation.
In theory, the cost of relative addressing is bounded by the number of nodes in the
node list. In practice, though, most addressing requests for the tagging of the FEW
dictionary are fairly local, e.g. within a paragraph of the text encoded into the
XML document. If the addressing API is considered for searches not exhibiting high
locality, implementation of auxiliary data structures should be considered to guarantee
lower bounds in the search cost. To this end, insights (though no direct reuse) could
certainly be drawn from the skip list [Pugh1990a] data structure. The availability of such auxiliary data structures would pave the
way for XPath addressing.
Text Retrieval API
Neither a dedicated node-level retrieval API (i.e. within individual nodes only) nor
a document-level text retrieval API are provided (i.e. within all individual nodes
of the document, considered individually), for two reasons.
Firstly, the text chunk of a text node is a classic Java string.
To achieve node-level text retrieval, it is thus straightforward for user code to
rely on standard Java facilities such as the operations on Java strings (java.lang.String
methods), the regular expression engine (java.util.regex package), or user-supplied
packages.
This also enables one to plug advanced search mechanisms such as collocate search,
or search for inflected forms (e.g. mouse/mice).
For instance, in the context of the FEW dictionary, several semantic tagging algorithms
depend on a keyword search seeking to match a list of keywords that may each exhibit
several nontrivial variations.
The decoupling of node-level text retrieval operations (as well as virtual text retrieval
over selected sections of the document, which constitutes the core of this paper,
see below) from other core operations thus offers great flexibility to users.
Secondly, document-level retrieval can be achieved by using an additional retrieval
mechanism on top of the node list representation, such as one of those discussed in
the Related Work section of this article.
Text retrieval over sections of the document (i.e. of its node list representation),
though, is made possible using the virtual text retrieval API (see below).
Text Update API
The text update API takes one text node and an update specification as input, and
updates/inserts/deletes text nodes as output.
In the existing implementation, the updates are not in-place, i.e. text nodes are actually substituted with an updated copy of themselves.
Three operations applicable to a target text chunk are provided by the node-level
text update API:
Update (replacement) of a substring of the target text chunk with the specified string.
Tagging of a substring of target text chunk with a given tag, i.e. insertion of an
opening and a closing tag at specified insertion indexes.
Insertion of a list of tags at specified insertion index of target text chunk. Typical
use case: inserting an empty element.
The node-level text update API is convenient but can update text or insert tags only
between a pair of existing tags.
Text Virtualization API
The text virtualization API takes one node and a virtualization specification as input,
and provides a (possibly empty) sequence of virtual strings as output.
Just one operation is provided by the API: the construction of a sequence of virtual
strings based on a specification of tag types visibility provided by the API user.
The text virtualization API addresses the issue of virtually removing specified types
of tag, as well as their contents (if specified), from a sequence of sections of a
text-oriented XML document. We define a section as the sequence of nodes starting
at a target node and extending immediately prior to (and excluding) the first tag
node matching a specified stopper tag type (if any), or the end of the document (in
the absence, after the target node, of any stopper tag matching the specified type).
The construction of a sequence of virtual strings thus virtualizes a sequence of sections
of the XML document.
We define a virtual string as the data structure:
based on a string resulting from the concatenation of adjacent text chunks (two text
chunks are considered as adjacent if no visible tag occurs between them),
that is backed by the underlying text nodes of the XML document,
that includes all the information necessary for retrieval and dynamic updates (see
Virtual Text Retrieval API and Virtual Text Update API in the following).
The construction of a sequence of virtual strings is controlled through the specification
of a visibility partition of all types of tags that can appear in the XML document.
It relies on a tag visibility model that is an extended version (with slightly different
semantics) of the tag visibility model initially introduced by Lini et al. [Lini2001a] and used by Tannier [Tannier2005a].
Our tag visibility model is comprised of five classes:
skipped tags: should be virtually removed from the document section under consideration;
invisible tags: should be virtually removed, as well as all of their contents (including
descendant tags), from the document section under consideration;
visible tags: signal the end of the document section under consideration;
terminal tags: signal the end of the document section under consideration, as well
as of the sequence of sections under consideration;
the availability of the visible tag class in addition to the terminal tag class enables
to construct a sequence of virtual strings using one API call only (e.g. a visible
tag could be </li> and a terminal tag could be </ul>);
unexpected tags: should not be found within those areas of the document section under
consideration that are outside invisible tags.
A visibility partition is a partition of the set of all possible tag types that is
comprised of five sets, each of which represents a visibility class.
A partition is intrinsically comprehensive: all tag types must be included into exactly
one visibility class each.
At most four visibility classes can thus be empty.
When defining a visibility partition, the API user has the option to specify explicitly
only four visibility classes, with the remaining visibility class (whichever it is)
implicitly considered as including all tag types not yet assigned to a visibility
class.
As will be illustrated through examples, this can be very handy in practice.
When invoked on a target node, the text virtualization API constructs a sequence of
virtual strings, virtual string by virtual string. The text virtualization algorithm
can be sketched as follows:
Each time a visible tag or terminal tag is encountered, a virtual string is produced
and returned to the calling code (except when the visible tag is comprised between
a pair of matching invisible tags, cf. below). Virtual strings keep getting produced
until a terminal tag (or the end of the node list) is encountered, at which point
the sequence of virtual strings is complete.
If an unexpected tag is encountered, an exception is raised.
If a skipped tag is encountered, it is skipped, as can be expected.
When an opening invisible tag is encountered, all[3] subsequent nodes (both text nodes and tag nodes) are skipped, until a matching closing
invisible tag is encountered.
When a text node is encountered, its text is added to the virtual string under construction.
As an example, let us consider the following visibility partition over the tag set
{ <i>, <lb>, <s>, <t>, <u>, <v> }:
skipped tags = { <s>, <lb> }
invisible tags = { <i> }
visible tags = { <v> }
terminal tags = { <t> }
unexpected tags = { <u> }
Let the text virtualization API be invoked on the following section of an XML document
(the target node is the first node, i.e. the text node Once upon a time,), configured with the aforementioned visibility partition:
Once upon a time,<i /> there was a <s>sentence with an important</s> part,<lb merge="no" />
followed by an <i>irrelevant part</i>.<v />It was followed by<lb merge="no" />
a second sentence separated from the first by a visible tag.<lb merge="no" />
<v>A word near the end of the third sen-<lb merge="discard-hyphen" />
tence was split by a break tag.</v></t>
A sequence of three virtual strings is returned to the calling code:
Once upon a time, there was a sentence with an important part, followed by an .
It was followed by a second sentence separated from the first by a visible tag.
A word near the end of the third sentence was split by a break tag.
Each character of these virtual strings either is backed by a text node of the underlying
XML document, or is a virtual space.
A fourth virtual string would have been added to the sequence if there had been some
text between </v> and </t>; indeed, empty virtual strings are never added to the returned sequence.
The construction of a sequence of virtual strings could be formalized as a generalized
form of string projection [StringProjection] applied to a suitable representation of the selected sections of the XML document,
based on classes of nodes of the node list. The projection alphabet would be comprised
of one symbol only: text nodes that are not inside an invisible tag. The general alphabet
would also include as individual symbols the nodes containing standalone tags (either
opening, closing or empty) whose types match those included in the skipped, visible,
terminal and unexpected tag visibility classes. The general alphabet would further
include as individual symbols the sequences of nodes constituting complete elements
(i.e. opening tag, matching closing tag and the sequence of nodes in-between) whose
types match those included in the invisible tag visibility class.
Virtual Text Retrieval API
The virtual text retrieval API takes one virtual string and an information request
as input, and provides either the text of the virtual string, a node, an offset or
a stopper tag as output. Retrieval operations are independently performed by the API
user on the text of the virtual string provided by the API.
The virtual text retrieval API implicitly allows tag-aware search by letting the API
user perform full text search of a target virtual string. The virtual string exposes
the string that was virtually constructed from a section of the XML document. This
string is a normal, regular string that can be searched as any other string. In particular,
the API user can apply her favorite regexp engine or pattern matching algorithm.
To obtain match points [Lini2001a] and otherwise access the nodes of the XML document backing the virtual string, all
the API user needs to provide is the index of the character of interest (from 0 to
N-1, if the length of the virtual string is N).
Five operations are provided by the virtual text retrieval API:
Get the text of the virtual string.
Get the text node backing the character at target index in the virtual string.
Get the offset, i.e. start index within the virtual string, of the text node backing
the character at target index in the virtual string.
Get the stopper tag (i.e. visible or terminal tag) delimiting (i.e. immediately following)
the virtual string, if any. The stopper tag qualifier can also be obtained.
Check whether the character at target index in the virtual string is a virtual space
or a regular character.
A virtual space is a space character found in a virtual string, that does not belong
to any node of the underlying XML document. Virtual spaces are inserted instead of
break tags. If a break tag is preceded by a hyphen and if it contains a specific attribute
(i.e. merge), the break tag can be processed differently based on the value of the attribute.
Three behaviors are defined:
insert a virtual space (default behavior in the absence of the specific attribute);
merge the word preceding the break tag and the word following the break tag, preserving
the hyphen;
merge the word preceding the break tag and the word following the break tag, discarding
the hyphen.
The text virtualization API thus takes care of automatically adding virtual spaces
where specified, and otherwise virtually merges the two parts of hyphenated words,
preserving or discarding the hyphen as specified.
This mechanism lets the API user control how words split by a break tag are processed
in the presence of a hyphen. All is required from the API user is to set the value
of a specific attribute for each word split by a break tag that should be virtually
recomposed into one word when a virtual string is constructed.
Virtual Text Update API and Virtual Tag Splicing API
The virtual text update API takes one virtual string and an update specification as
input, and updates/inserts/deletes text nodes as output.
The virtual tag splicing API takes one virtual string and a splicing specification
as input, and inserts tag nodes as output.
The virtual text update API allows to modify the text of a virtual string. The virtual
tag splicing API allows to splice (i.e. insert) tags into a virtual string.
Both APIs allow to specify updates in a straightforward manner and have these updates
transparently propagated to the underlying XML document. This may involve, in addition
to the update of nodes, the insertion of nodes into the node list as well as the deletion
of some of its nodes. To specify updates to a virtual string, all the API user needs
to specify, besides the update description, is the index of the character (or indexes
of the couple of characters) of interest (from 0 to N-1, if the length of the virtual
string is N).
One operation is provided by the virtual text update API:
Updating a target substring of a virtual string: a couple of target indexes (i.e.
start and end indexes) in the virtual string are specified, along with a replacement
string. If the replacement string is shorter than the target substring, some of the
rightmost backing text nodes might be completely deleted from the XML document.
Similarly to XQuery updates [XQueryUpdateIntro], virtual text updates are not applied immediately. Virtual text updates are accumulated
into a list of pending updates. The listed updates are all applied when the API user
explicitly invokes the execution of the updates.
Two operations are provided by the virtual tag splicing API:
Inserting a couple of opening and closing tags around a target substring of a virtual
string, that is specified with a couple of target indexes (start and end indexes).
Inserting a single (opening, closing or empty) tag at a specified index into a virtual
string.
Also similarly to XQuery updates [XQueryUpdateIntro], virtual tag splicing updates are not applied immediately, but accumulated into
a list of pending updates that are applied together when explicitly requested by the
API user.
Related Work
Query and Update Frameworks
Information Extraction
Information Extraction (IE) [IE] consists of extracting structured information from unstructured documents. Therefore,
the semantic tagging of the FEW dictionary could theoretically be modelled as an IE
problem (named entity recognition, coreference resolution, terminology extraction,
relationship extraction). In practice, however, we hypothesize (but neither claim
nor prove) that the quantity and complexity of the involved rules to identify fields
of information in the FEW dictionary together make it intractable to use existing
general-purpose IE software [GATE, IBMIE].
For instance, the General Architecture for Text Engineering (GATE) [GATE] makes it straightforward to define a large number of pattern matching and text processing
rules, however with limited access to context. From our reading of the online documentation
of GATE, we hypothesize that most rules for the semantic tagging of the FEW depend
so much on context that they cannot be expressed within GATE.
Whether the stated hypothesis is proved or disproved, it will be very interesting
to more deeply explore to what extent the semantic tagging of the FEW can be modelled
as an IE problem.
Regular Fragmentations and XML Fragments
The Regular Fragmentations retrieval language and API [StLaurent2001a, RegFrag] allows full text search in XML-encoded corpora (keyword search at least, as well
as regexp search). Regular Fragmentations does allow updates as it is a filter tool
made available through either an XML configuration file or an API. Its internal representation
is tree-based and can be modified based on an implied processing model. Matching on
both markup and textual content is (currently?) not supported, which makes it unsuitable
to the semantic tagging of the FEW dictionary.
The XML fragments query mechanism [Carmel2003a] allows full text search (keyword search only, no regexp search) and structural search
in XML-encoded corpora. It does not allow updates. It does support tag-aware search.
Queries are described as exemplar fragments of the XML document, which allows tag-aware
search, albeit with limited flexibility. Lack of support for join operations is noted
to be a major limitation of [the] XML fragment model, and a major restriction of most IR systems
as compared to DB systems [Carmel2003a]. This remark also highlights that the major strength of our proposed API is its
flexibility: Full access to the underlying XML document is provided with the retrieved
results, enabling arbitrary supplementary processing of the retrieved results, as
well as arbitrary updates of the underlying XML document.
Reading Contexts
The XTReSy [Lini2001a] retrieval language allows full text search in XML-encoded corpora (keyword search,
probably not regexp search).
XTReSY does not allow updates. XTReSY supports tag-aware search and has the capability
to return match points. Interestingly, text normalization can be flexibly defined
and is part of the language. The construction of the in-memory representation of XML
documents in our work also features text normalization, but as opposed to XTReSY,
punctuation is preserved (as it can constitute useful hints to detect types of information)
and case streamlining is left to the API user (see Virtual Text Retrieval API here
above). XTReSY actually introduced the concept of tag-aware search (refereed to as
tag-dependent search [Lini2001a]). XTReSy also introduced a tag visibility model that is extended by our proposed
model (see above).
The XTReSy [Lini2001a] tag visibility model is comprised of three classes:
soft tags: can be mapped to skipped tags in our tag visibility model;
jump tags: can be mapped to invisible tags in our tag visibility model;
hard tags: can be mapped to visible and terminal tags in our tag visibility model.
The XGTagger [Tannier2005a] retrieval tool allows full text search in XML-encoded corpora (keyword search, as
well as regexp search). XGTagger supports tag-aware search but cannot return match
points. XGTagger does not allow updates. Instead, an external tool can update (only
once) the virtualized XML document. XGTagger relies on the XTReSy [Lini2001a] tag visibility model to virtualize XML documents. Our proposed concept of virtual
string is conceptually close to Tannier's concept of reading context. Our virtualization
mechanism is more flexible than XGTagger [Tannier2006a, Tannier2006b]. Firstly, our API allows to dynamically invoke an arbitrary number of virtualizations
and updates to different sections of the XML document. Secondly, XGTagger cannot return
match points. Thirdly, though reading contexts are backed by the underlying XML document,
XGTagger does not expose the XML document to the external updating tool. Fourthly,
our tag visibility model is an extended version of the XTReSy [Lini2001a] model on which XGTagger relies to virtualize XML documents; the distinction between
visible and terminal tags in our model allows the API user to transparently obtain
a scoped iterator over multiple virtual strings.
XQuery
The XQuery [XQuery] retrieval language allows, among many powerful features, full text search in XML-encoded
corpora (keyword search as well as, to a certain extent, regexp search). The Full
Text extension [XQueryXPathFullText, XQueryFullTextIntro, Imhof2008a] adds full-text search capabilities to XQuery. The Full Text extension is a very
recently (as of January 2010) stabilized W3C Candidate Recommendation. The Full Text
extension does support tag-aware search, to a certain extent. Tags can be configured
to be made invisible during search through the use of the FTIgnore option. It must be noted that constructing an intermediate full text representation
and searching this full text representations are tightly interwoven, in order to return
to user code the XML elements that include contents matching the full text search
query.
XQuery, through its Update Facility [XQueryUpdate, XQueryUpdateIntro], allows powerful forms of updates to the processed XML document. The Update Facility
is a recently (as of June 2009) stabilized W3C Candidate Recommendation. Its internal
representation is tree-based and XPath-addressable [XPath]. XQuery Full Text and XQuery Update Facility could thus be combined.
XQuery considered together with XQuery Full Text and the XQuery Update Facility is
a very powerful and versatile technology. What then could make it unsuitable for the
semantic tagging of the FEW dictionary? FEW algorithms often need to perform additional
search operations in the textual context around full text search matches. Results
returned to user code by XQuery, though, are XML elements. It is thus not straightforward
to perform subsequent search operations on these results, or obtain match points for
arbitrary characters in the context of the returned results.
We hypothesize that our proposed virtual text retrieval API might be implemented in
XQuery on top of Full Text. However, we also hypothesize that offering the same expressiveness
and flexibility would lead to a performance overhead. Indeed, XQuery returns XML elements
instead of virtual strings, which means a partial loss of intermediate information
(such as match points) that will have to be reconstructed over and over. It is possible,
however, that such performance overhead could be addressed through good software engineering
and extensions to the XQuery language.
Philosophically, the key feature of our proposed virtualization mechanism over XQuery
is to offer linguists a way to think of the dictionary and express themselves in a
familiar environment: text without tags.
Other relevant works
The lq-text [Quin2008a] retrieval tool allows full text search in XML-encoded corpora (both keyword search
as well as regexp search).
It does support a limited form of update, i.e. adding tags around keyword matches.
It supports useful Natural Language Processing features such as punctuation-aware
and plural-aware search. Its internal representation is not tree-based. It does not
support tag-aware search (as considered in this paper), as far as we can tell.
The Nexi [Kamps2006a] query language allows full text search in XML-encoded corpora (keyword search only).
It does not allow updates. It does support tag-aware search based on several expressive
query forms (it is claimed to be at least as expressive as XML Fragments [Carmel2003a], see above), but not to the extent required for the semantic tagging of the FEW
dictionary (see [Kamps2006a], Section 6, p. 433).
Semantic Tagging of XML-Encoded Corpora
The semantic tagging of the Dictionary of Middle Dutch [Voort2005a] as well as of the TLFi dictionary [Dendien2003a, TLFi, TLFi04a] were partially automated. However, due to lack of support for virtualization of
the XML document, there was little flexibility in the retrieval of the fields of information.
Exceedingly complex and brittle regular expressions including hard-coded markup were
used. In practice, this certainly limited any changes to the sequence in which fields
of information were tagged, and also discouraged the use of attributes in the markup
(because tags, attributes and attribute values all had to be hard-coded statically
within the regular expressions).
Concluding Remarks
An API providing support for flexible tagging of text-oriented XML documents was introduced
in this paper. It is based on the virtualization of selected sections of the XML document.
It enables tag-aware full text search, obtaining match points and transparent updates
to the underlying XML document.
The API enabled successful, automated tagging [Renders2009a, Briquet2009a]
of a test corpus (146 articles, about 0.75% of the total number of articles) of the
digitized version of the FEW, which is one of the most complex paper dictionaries.
Coverage, defined here as the percentage of the length of relevant sections of the
XML document that are semantically tagged, consistently converges around 99.3%.
Precision and recall are not yet available. These metrics are very difficult to measure
in the context of the FEW: A highly-trained human operator would need several days
to manually add semantic tagging to the average FEW article.
A Java implementation of the API was completed during the summer of 2009. The API
itself is comprised of ~7500 lines of Java code, with additional dependencies on other
parts of the code base (~69000 lines) of the FEW semantic tagging project, notably
the node list data structure. It is clear that the current implementation, while 100%
operational and used in production for a real-world project, can be optimized for
speed (addressing, virtual text update and virtual tag splicing APIs) and for improved
memory management (text virtualization API). One of the logical next steps will be
to package (and optimize) the API implementation and the node list data structure
into a library, so that other projects can benefit from them as well. External feedback
on the syntax, semantics and expected performance of the operations provided by the
API will be valuable.
To conclude, the take-home message could be stated as follows. Our proposed virtual
text retrieval API might be implemented in XQuery on top of Full Text, although probably
at the expense of a performance overhead. The key feature of our proposed virtualization
mechanism is to offer linguists a way to think of the dictionary and express themselves
in a familiar environment: text without tags. Its flexibility and expressiveness are
derived from (unconsciously, though, at the time of mechanism design) blending prior
ideas, i.e. reading contexts [Lini2001a, Tannier2006a, Tannier2006b] and regular fragmentations [StLaurent2001a], with a good dose of dynamicity. Given the scope and complexity of the FEW dictionary
[Renders2007a, Wartburg1922a], an environment or data processing abstraction that is flexible and familiar to
the API user might be more important than the power and breadth of features of successful,
general-purpose technologies such as XQuery [XQuery]. Support for natural linguistic reasoning, i.e. tag-aware mixed-information (text/markup) retrieval and update, is the main
contribution of our paper.
Acknowledgments
We would like to thank Xavier Dalem, Stéfan Sinclair and the anonymous reviewers for
their helpful suggestions that greatly improved the quality of the paper.
We also would like to thank Eva Buchi for her encouragement to prepare this paper,
as well as for a particularly relevant bibliographical hint that greatly helped us
identify related works.
Figure 1 includes icons from the Tango library, under Creative Commons Attribution Share-Alike license.
Finally, we would like to thank Chris Lilley for proofreading the final version of
the paper; of course, all remaining typos and grammatical mistakes are ours.
References
[Briquet2009a] Cyril Briquet and Pascale Renders. «Une approche reposante (RESTful) des aspects opérationnels de la rétroconversion
du Französisches Etymologisches Wörterbuch (FEW)». Proc. Liège Day in Processing of Gallo-Roman Sources (TraSoGal), May 2009.
[Buchi1996a] Éva Büchi. «Les Structures du /Französisches Etymologisches Wörterbuch/. Recherches métalexicographiques
et métalexicologiques», Niemeyer, Tübingen, 1996.
[Carmel2003a] David Carmel, Yoelle S. Maarek, Matan Mandelbrod, Yosi Mass and Aya Soffer. «Searching XML documents via XML fragments». Proc. SIGIR, Toronto, ON, 2003.
[Dendien2003a] Jacques Dendien and Jean-Marie Pierrel. «Le Trésor de la Langue Française informatisé. Un exemple d’informatisation d’un dictionnaire
de langue de référence». In Traitement Automatique des Langues 43 (2), 2003.
[Kamps2006a] Jaap Kamps, Maarten Marx, Maarten de Rijke and Börkur Sigurbjörnsson. «Articulating Information Needs in XML Query Languages». In ACM Transactions on Information Systems 24 (4), October 2006. doi:https://doi.org/10.1145/1185877.1185879
[Lini2001a] Luca Lini, Daniella Lombardini, Michele Paoli, Dario Colazzo and Carlo Sartiani. «XTReSy: A Text Retrieval System for XML documents». In D. Buzzetti, H. Short, and G. Pancalddella, editors, Augmenting Comprehension:
Digital Tools for the History of Ideas. Office for Humanities Communication Publications,
King's College, London, 2001.
[Renders2007a] Pascale Renders. «L’informatisation du Französisches Etymologisches Wörterbuch : quels objectifs, quelles
possibilités ?». Proc. Congrès International de Linguistique et de Philologie Romanes, Innsbruck,
Austria, September 2007.
[Renders2009a] Pascale Renders and Cyril Briquet. «Conception d’algorithmes de rétroconversion». Proc. Liège Day in Processing of Gallo-Roman Sources (TraSoGal), May 2009.
[Tannier2005a] Xavier Tannier, Jean-Jacques Girardot and Mihaela Mathieu. «Classifying XML Tags through Reading Contexts». Proc. ACM Symposium on Document Engineering, Bristol, UK, 2005. doi:https://doi.org/10.1145/1096601.1096638
[Voort2005a] John van der Voort van der Kleij. «Reverse Lemmatizing of the Dictionary of Middle Dutch (1885-1929) Using Pattern Matching». Proc. Conf. Computational Lexicography and Text Research, Budapest, Hungary, 2005.
[Wartburg1922a] Walther von Wartburg et al. «Französisches Etymologisches Wörterbuch. Eine darstellung des galloromanischen sprachschatzes», 25 volumes, Bonn/Heidelberg/Leipzig-Berlin/Bâle, Klopp/Winter/Teubner/Zbinden, 1922-2002.
[1] The research reported in this paper was supported by grant ANR-06-CORP-007-03.
[2] The majority of the research reported here was conducted while Cyril Briquet was at
CNRS in 2008-2009.
[3] After the encounter with an invisible tag, the encounter (immediately or a few nodes
further) with a terminal tag, as well as the encounter with the end of the node list,
triggers a virtual string to be returned to the calling code. This is not the case
when a visible tag is encountered. As invisible tags have precedence over visible
tags, the latter ones are also skipped until a matching closing invisible tag is encountered.
Cyril Briquet and Pascale Renders. «Une approche reposante (RESTful) des aspects opérationnels de la rétroconversion
du Französisches Etymologisches Wörterbuch (FEW)». Proc. Liège Day in Processing of Gallo-Roman Sources (TraSoGal), May 2009.
David Carmel, Yoelle S. Maarek, Matan Mandelbrod, Yosi Mass and Aya Soffer. «Searching XML documents via XML fragments». Proc. SIGIR, Toronto, ON, 2003.
Jacques Dendien and Jean-Marie Pierrel. «Le Trésor de la Langue Française informatisé. Un exemple d’informatisation d’un dictionnaire
de langue de référence». In Traitement Automatique des Langues 43 (2), 2003.
Jaap Kamps, Maarten Marx, Maarten de Rijke and Börkur Sigurbjörnsson. «Articulating Information Needs in XML Query Languages». In ACM Transactions on Information Systems 24 (4), October 2006. doi:https://doi.org/10.1145/1185877.1185879
Luca Lini, Daniella Lombardini, Michele Paoli, Dario Colazzo and Carlo Sartiani. «XTReSy: A Text Retrieval System for XML documents». In D. Buzzetti, H. Short, and G. Pancalddella, editors, Augmenting Comprehension:
Digital Tools for the History of Ideas. Office for Humanities Communication Publications,
King's College, London, 2001.
Pascale Renders. «L’informatisation du Französisches Etymologisches Wörterbuch : quels objectifs, quelles
possibilités ?». Proc. Congrès International de Linguistique et de Philologie Romanes, Innsbruck,
Austria, September 2007.
Pascale Renders and Cyril Briquet. «Conception d’algorithmes de rétroconversion». Proc. Liège Day in Processing of Gallo-Roman Sources (TraSoGal), May 2009.
Xavier Tannier, Jean-Jacques Girardot and Mihaela Mathieu. «Classifying XML Tags through Reading Contexts». Proc. ACM Symposium on Document Engineering, Bristol, UK, 2005. doi:https://doi.org/10.1145/1096601.1096638
John van der Voort van der Kleij. «Reverse Lemmatizing of the Dictionary of Middle Dutch (1885-1929) Using Pattern Matching». Proc. Conf. Computational Lexicography and Text Research, Budapest, Hungary, 2005.
Walther von Wartburg et al. «Französisches Etymologisches Wörterbuch. Eine darstellung des galloromanischen sprachschatzes», 25 volumes, Bonn/Heidelberg/Leipzig-Berlin/Bâle, Klopp/Winter/Teubner/Zbinden, 1922-2002.