Rehm, Georg, Oliver Schonefeld, Thorsten Trippel and Andreas Witt. “Sustainability of Linguistic Resources Revisited.” Presented at International Symposium on XML for the Long Haul: Issues in the Long-term Preservation
of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the
Long-term Preservation of XML. Balisage Series on Markup Technologies, vol. 6 (2010).
International Symposium on XML for the Long Haul: Issues in the Long-term Preservation
of XML August 2, 2010
Balisage Paper: Sustainability of Linguistic Resources Revisited
Georg Rehm works at DFKI, the German Research Center for Artificial
Intelligence, where he coordinates META-NET, a strategic pan-European research project on
Machine Translation and multilingualism. He holds a PhD
in Computational Linguistics and has been working with SGML and
related technologies in the context of Natural Language Processing since 1995.
Oliver Schonefeld works at the Institut für Deutsche Sprache (Institute for
the German Language) in Mannheim and is involved in the projects TextGrid and Clarin.
He studied computer science with specialization in text technology at
Bielefeld University until 2005. After graduating he worked as a researcher at
Bielefeld University and later at Tübingen University's collaborative research
center Linguistic Data Structures.
His major research interests are the limitations of markup languages
(especially overlapping markup) and the use of markup languages in linguistic
description of language data.
Thorsten Trippel works at Tübingen University in a project on sustainability
of language resources called NaLiDa. This
national project aims at providing a platform for linguists to locate resources
they need and to enable them to produce long time usable data by introducing
them to relevant metadata descriptions and standards. He is part of national and
international standardization groups on language resources.
His major research interests are directed towards language resources in
general and specifically in terminology and lexicography/lexicon theory (PhD
Thesis: The Lexicon Graph Model: A generic Model for multimodal lexicon
development) including other types of resources such as speech corpora and
involving other modalities. He has conducted research in speech technology and
textual corpus linguistics, has been working with (XML-)databases for
information retrieval over highly structured data and run research projects on
interface design for such data.
Work at his previous affiliation Bielefeld University involved research
projects in Brazil, transforming archives of handwritten texts into web-usable
multi purpose sources for computational linguists and historians. Additionally
he taught at Bielefeld University, and various institutions and summer schools,
for example in introducing text technological and computational linguistic
backgrounds to field linguists and language documentarists in West-Africa.
Witt received his Ph.D. in Computational Linguistics and Text Technology from
the Bielefeld University in 2002 (dissertation title: “Multiple
Informationsstrukturierung mit Auszeichnungssprachen. XML-basierte Methoden und
deren Nutzen für die Sprachtechnologie”).
After graduating in 1996, he started as a researcher and instructor in
Computational Linguistics and Text Technology. He was heavily involved in the
establishment of the minor subject Text Technology in Bielefeld University's
Magister and B.A. program in 1999 and 2002 respectively. After his Ph.D. in 2002
he became an assistant lecturer, still at the Text Technology group in
Bielefeld. In 2006 he moved to Tübingen University, where he was involved in a
project on “Sustainability of Linguistic Resources” and in projects on the
interoperability of language data. Since 2009 he is senior researcher at
Institut für Deutsche Sprache (Institute for the German Language) in Mannheim.
Witt is and was a member of several research organizations, amongst them the
TEI Special Interest Group on overlapping markup, for which he was involved in
the writing of the latest version of the chapter “Multiple Hierarchies”, which
is included in TEI-Guidelines P5.
Witt's major research interests deal with questions on the use and
limitations of markup languages for the linguistic description of language data.
Data providers, users, and funders alike want and need sustainability of language
resources (e.g. language corpora, grammars, etc.); sustainability requires making
the resources available according to defined processes, platforms, or archives in
a reproducible and reliable way. A three-year project on sustainability of linguistic
resources conducted at Tübingen, Hamburg, and Potsdam illuminates some of the difficulties:
the prevalence of stand-off markup (requiring a layer of specialized tools atop the
XML stack), machine-generated XML of low clarity, ad hoc non-standard tag sets, discoverability,
and selection criteria for long-term archiving. XML and other standards are necessary
but not sufficient ingredients in the mix.
This paper discusses work on the sustainability of linguistic resources as it was
conducted in various projects, including the work of a three year project
Sustainability of Linguistic Resources which finished in
December 2008, a follow-up project, Sustainable linguistic data,
and initiatives related to the work of the International Organization of Standardization
(ISO) on developing standards for linguistic resources. The individual projects have
been conducted at German collaborative research centres at the Universities of Potsdam,
Hamburg and Tübingen, where the sustainability work was coordinated.
Today, most language resources are represented in XML. The representation of data
XML is an important prerequisite for long-term preservation but a reasonable
representation format such as XML alone is not sufficient. Though XML is being said
be human-readable it is obvious that legibility is a rather problematic notion in
of photos encoded in SVG, complex structures generated from data dumps of databases
other applications or even formats such as Office Open XML. In the linguistic data
community, various flavours of stand-off annotation also demonstrate the complexity
of the problem.
Usually these data formats are not meant to be read by humans, though the advantages
mentioned in XML-introductions still hold, namely, that data modelled according to
standardized and continuously maintained XML formalism can be read and analysed by
human users to re-engineer tools using simple parsers for validation and mental effort.
Case Study: The Project “Sustainability of Linguistic Resources”
This section briefly presents SPLICR, the Web-based Sustainability Platform for
Linguistic Corpora and Resources aimed at researchers who work in Linguistics or
Computational Linguistics: a comprehensive database of metadata records can be explored
and searched in order to find language resources that could be appropriate for one’s
specific research needs. SPLICR also provides a graphical interface that enables users
to query and to visualise corpora.
The project in which SPLICR was developed aimed at sustainably archiving the language
resources that were constructed in three collaborative research centres. The groups
Tübingen (SFB 441: “Linguistic Data Structures”), Hamburg (SFB 538: “Multilingualism”),
and Potsdam/Berlin (SFB 632: “Information Structure”) built a total of 56 resources
– corpora and treebanks mostly. According to our estimates it took more than one
hundred person years to collect and to annotate these datasets. The project had two
goals: (a) To process and to sustainably archive the resources so that they are still
available to the research community and other interested parties in five, ten, or
20 years time. (b) To enable researchers to query the resources both on the level
their metadata as well as on the level of linguistic annotations. In more general
the main goal was to enable solutions that leverage the interoperability, reusability,
and sustainability of a large collection of heterogeneous language resources.
One of the obstacles we were confronted with was providing homogeneous means of
accessing a large collection of diverse and complex linguistic resources. For this
purpose we developed several custom tools in order to normalise the corpora and their
metadata records.
Normalization of Linguistic Resources
Language resources are nowadays usually built using XML-based representations and
contain several concurrent annotation layers that correspond to multiple levels of
linguistic description (e.g., part-of-speech, syntax, coreference). Our approach
included the normalization of XML-annotated resources, e.g., for cases in which
corpora use PCDATA content to capture both primary data (i.e., the original text or
transcription) as well as annotation information (e.g., POS tags). We used a set of
tools to ensure that only primary data is encoded in PCDATA content and that all
annotations proper are encoded using XML elements and attributes.
A second reason for the normalization procedure was that both hierarchical and
timeline-based corpora needed to be transformed into a common annotation approach,
because we wanted our users to be able to query both types of resources at the same
time and in a uniform way. The approach can be compared to the NITE Object Model
(Carletta et al. 2003): we developed tools that semiautomatically
split hierarchically annotated corpora that typically consist of a single XML
document instance into individual files, so that each file represented the
information related to a single annotation layer; this approach also guaranteed that
overlapping structures can be represented straightforwardly. Timeline-based corpora
were also processed in order to separate graph annotations. This approach enabled
to represent arbitrary types of XML-annotated corpora as individual files, i.e.,
individual XML element trees. These were encoded as regular XML document instances,
but, as a single corpus comprises multiple files, there was a need to go beyond the
functionality offered by typical XML tools to enable us to process multiple files,
as regular tools work with single files only. The normalization process is
described in more detail in Witt et al. 2007.
Figure 1: Resource normalization and SPLICR's staging area.
Normalization of Metadata Records
The separation of the individual annotation layers contained in a corpus has
serious consequences with regard to legal issues: due to copyright and personal
rights specifics that usually apply to a corpus’s primary data we provided a
fine-grained access control layer to regulate access by means of user accounts and
access roles. We had to be able to explicitly specify that a certain user only has
access to the set of, say, six annotation layers (in this example they might be
available free of charge for research purposes) but not to the primary data, because
they might be copyright-protected.
The generic metadata schema used for SPLICR, named eTEI, was
based on the TEI P4 header and extended by a set of additional requirements. We
decided to store both eTEI records and also the corpora in an XML database. The
underlying assumption was that XML-annotated datasets are more sustainable than, for
example, data stored in a proprietary relational DBMS. The main difference between
eTEI and other approaches is that the generic eTEI metadata schema, formalized as
document type definition (DTD), can be applied to five different levels of
description. One eTEI file contains information on one of the following levels: (1)
setting (recordings or transcripts of spoken language, describes the situation in
which the speech or dialogue took place); (2) raw data (e.g., a book, a piece of
paper, an audio or video recording of a conversation etc.); (3) primary data
(transcribed speech, digital texts etc.); (4) annotations; (5) a corpus (consists
primary data with one or more annotation levels). We devised a workflow that helps
users to edit eTEI records. The workflow’s primary components were the eTEI DTD and
the Oxygen XML editor. Based on structured annotations contained in the DTD we
automatically generate an empty XML document with embedded documentation and a
Schematron schema. The Schematron specification is used to check whether all
elements and attributes instantiated in an eTEI document conform to the current
level of metadata description.
The sustainability platform SPLICR consists of a front-end and a back-end. The
front-end is the part visible to the user and is realized using JSP (Java Server Pages)
Ajax technology. It runs in the user’s browser and provides functions for searching
and exploring metadata records and corpus data. The back-end hosts the JSP files and
related data. It accesses two different databases, the corpus database and the
system database, as well as a set of ontologies and additional components. The
corpus database is an XML database, extended by AnnoLab, an XML/XQuery-based
corpus query and management framework that was specifically designed to deal with
multiple possibly concurrent annotation layers, in which all resources and metadata
are stored. The system database is a relational database that contains all data
about user accounts, resources (i.e., annotation layers), resource groups (i.e.,
corpora) and access rights. A specific user can only access a specific resource if
the permissions for this user/resource tuple allow it.
SPLICR: Concluding Remarks
The corpus normalization and preprocessing phase in this project started in early
2007 and was finished in May 2008, the process of transforming the existing metadata
records into the eTEI format was completed in June 2008. Work on the querying engine
and integration of the XML database, metadata exploration and on the graphical
visualization and querying front-end as well as on the back-end was carried out in
the summer of 2008; a first prototype of the platform was finished in October 2008.
Rehm et al. 2009 gives a more detailed description of the project.
XML and Sustainability: Problems and Solutions
Problem: Stand-off Annotation
Stand-off markup refers to the physical separation of annotations and
text. Piotr Bański described this technique thoroughly at Balisage 2010 (Bański 2010). Stand-off annotation allows for marking up text
without altering it by the inclusion of markup. It is the opposite approach to
inline or embedded markup that was one of the principle ideas behind SGML and its
successor XML. The term stand-off annotation was introduced
by Henry Thompson and
David McKelvie in 1997 (Thompson & McKelvie 1997), however the principles of
this technique are even older, since, e.g., the linking mechanisms described in TEI
already allowed to mark up texts by linking annotations to text regions. Within the
last couple of years the use of stand-off markup became predominant, especially for
complex linguistic annotations.
Linguistically annotated corpora use stand-off markup extensively. Stand-off is
also predominant within the forthcoming ISO standard “Linguistic Annotation
Framework” (LAF, Ide & Romary 2007).
Figure 2: LAF based linguistic annotation
<!-- base segmantation -->
<region id="r42" a="24 35"/>
<!-- annotation over the base segmentation -->
<node id="n16">
<f name="pos" value="NN"/>
<edge from="n16" to="r42"/>
<!-- annotation over another annotation -->
<node id="n23">
<f name="synLabel" value="NP"/>
<f name="role" value="-SBJ"/>
<edge from="n23" to="n16"/>
<!-- ... -->
Stand-off annotation has witnessed an increase in use due to the advantages of
this approach (see Bański 2010 and Bański & Przepiórkowski 2009),
but considering the sustainability and interoperability point of view, there are
quite a few disadvantages (see Witt 2004):
very difficult to read for humans
the information, although included, is difficult to access using
generic methods
limited software support as standard parsing or editing software
cannot be employed
standard document grammars can only be used for the level which
contains both markup and textual data
new layers require a separate interpretation
layers, although separate, often depend on each other
Our solution to overcome these problems is to process the standoff annotations
and the annotated source text so that multiple annotations of the same text are created
that are archived together with the original stand-off
resources. This approach achieves sustainability through redundancy.
Problem: Machine-Generated XML
Today, a lot of XML data is generated by machines. Many of those XML documents
are used for machine-to-machine communication, e.g., as SOAP-messages in web
services. However, these messages are rather short-lived and will not be considered
in this paper.
A growing number of applications use XML to store documents.
These XML documents differ greatly from handcrafted XML and are rather complicated,
especially with respect to the semantics of their tag sets, structure and code layout
and therefore are difficult to comprehend by humans. Since users usually do not work
with these documents directly this issue is not of a big concern. From a sustainability
point of view these documents present a challenge though.
Figure 3: Screenshot of Microsoft Word 2007
As an example, the figure shows a conference paper created with Microsoft Word (see
Figure 3).
Since the 2007 version of Microsoft Office documents are saved by default in Office
XML format (OOXML) (see ISO/IEC 29500:2008) and are – as the name
suggests – encoded in XML. With regard to sustainability this is, in
principle, a step in the right direction, but OOXML itself is not sufficient. [1] Without the corresponding application the generated XML document is very hard to
understand or to use. Figure 4 shows an excerpt of the
resulting OOXML document for the first heading and paragraph. The document is mostly
structured by sections and paragraphs, but the OOXML structure does not show
this structure in a transparent way. The following can be noted:
There is no difference in markup used for headings and paragraph. Both
are encoded by w:p elements. A heading made different from a
paragraph by adding further information through the
w:pStyle element. It's w:val attribute
denotes whether the construct is a heading (“Heading1”) or a regular
paragraph (“Textkorper”). More style information is encoded
in additional XML files, but this still does not yield enough
properties to resolving their role in structuring the text.
The running text in the paragraph is heavily fragmented. For example
the words “Referenzkorpus” or “established” are – for no apparent
reason – both fragmented into 3 parts with a middle part which
only contains a single character. The fragmentation could be the result of
editing the document in MS Word's Track Changes mode.
The markup contains rather complex constructs, e.g., the handling of
italics. The words “Mannheimer Korpus 1” are set in italics. The
formatting is applied to a text-run (w:r element) which has
formatting information applied to it by means of a w:rPr
Figure 4: OOXML excerpt
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:ve=""
xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w10="urn:schemas-microsoft-com:office:word"
<!-- ... -->
<w:p w:rsidR="00A77FB8" w:rsidRDefault="00A77FB8">
<w:pStyle w:val="Heading1"/>
<w:ilvl w:val="0"/>
<w:numId w:val="5"/>
<w:p w:rsidR="00A77FB8" w:rsidRDefault="00A77FB8" w:rsidP="007357D1">
<w:pStyle w:val="Textkorper"/>
<w:t>The Institute for the German Language (IDS) has a long tradition in building
corpora. DeReKo (Deutsches Refe</w:t>
<w:t xml:space="preserve">enzkorpus), the Archive of General Reference Corpora of
Contemporary Written German, has been set </w:t>
<w:r w:rsidR="003A1540">
<w:t xml:space="preserve"> as the </w:t>
<w:t>Mannheimer Korpus 1</w:t>
<w:t xml:space="preserve"> project in 1964. Paul Grebe and Ulrich Engel succeeded in
compiling a corpus of about 2.2 million running words of Written German by 1967. Since
then, further corpus acquisition projects esta</w:t>
<w:t>lished a ceaseless stream of electronic text documents and let the corpus to grow
steadily (Kupietz & Keibel, 2009).</w:t>
<!-- ... -->
An excerpt of an OOXML document produced by MS Word 2007 (the
document was reformatted for readability).
Just having data encoded in XML does not automatically make the data sustainable.
Especially very complex tag sets such as OOXML are of very limited use if one does
not have an application which understands these formats. For almost any given application,
obtaining and using such
a piece of software will most probably pose a big problem a few years later.
Sustainability of software is a whole different topic by itself and is not within
the scope of this paper.
As a possible solution for this problem we propose to provide the
machine-generated XML data in multiple formats. For example, the OOXML document
can be stored in its native format, in plain text or in Portable
Document format (PDF). Furthermore, filters can be used to remove those XML elements
attributes from the machine-generated code that are not necessary. It would be even
better to transform the machine-generated XML data to established formats such as
(TEI P5).
Other than that, one should provide various descriptions and a thorough
documentation of the data format, not only providing the schema but also tutorials,
conceptual descriptions or similar documents for human reimplementation of tools
operating on the machine-generated XML code.
Problem: Proprietary Tag Sets
In the document lifecycle, especially when taking long-term maintenance and
archiving into account, it is a common problem that XML tag sets and document
grammars are being used that are not well established outside the
group defining the tagset. The use of XML tags following the insights and beliefs
the individual who wrote the schema as such does not pose the problem, but the
interpretation of the schema by somebody reviewing the material later may cause
problems, because no one else knows and understands the implicit logical constraints
of tag and attribute names as well as data structures.
The usual answer to the use of proprietary tag sets would be not to use them at
all and to replace them by standard annotation schemas and tagsets wherever
possible. For example TEI (TEI P5), tagsets developed in the
context of the standardization processes of ISO TC 37 SC 4 (“Language Resources”)
DocBook for technical articles and texts (see Walsh & Muellner 1999) come to
mind. However, these tagsets do not always fit the given problem very well and using
them often results in the well-known problem of tag abuse: tags are used in
unintended ways or – even worse – users confuse the semantics of tags
with their intended use. In these cases the results are bound to be more confusing
than starting from an idiosyncratic tagset. Therefore, if users decide to use one
the established tagsets they should thoughtfully select the most appropriate one for
given problem.
More critical are those cases in which for various reasons no established tagset
is used. Reasons for not selecting established tagsets range from not knowing about
tagsets, not understanding tagsets, via policy reasons to the unavailability of
appropriate tagsets. For example commercial terminological applications may use a
data model that is consistent with established standards (such as ISO 16642:2003 in combination with ISO 30042:2008) but use a
native XML format that is very similar but utilises different generic identifiers
example SDL Trados MultiTerm 2009 shows this behaviour). The reason for this does
not lie
in the technology, but in management decisions. In each of these cases it is not
sufficient to include the document grammar only to achieve valid XML, but further
documentation is required. The basic idea is to document everything.
One way of approaching this problem is by providing a reference in the element
description to an ontology or some other form of knowledge representation to define
the data types with possible values. Data types here refer both to XML elements and
attributes, similar to data types used in XML schema. The reference to the external
definition of the elements allows for a human user to evaluate the correctness of
the semantic interpretation, possibly also to automatically evaluate the content
using a parser. With external definitions the data types are unambiguously
described according to available means.
The definition by reference is only one part of the definition, for human use it
is advisable to use a documentation with the tag set that uses multiple examples.
prototype semantics of a tagset is intended to explain the
meaning of tags and attributes as applied in a given domain or application. For
human use it is also recommended to use names that bear a certain meaning, i.e., which
are easily interpretable by a person reading them. Interpreting and understanding
element and attribute names and values depends on a common background of the creator
and user. For example, it is harder if both do not use the same script or language,
because mutual intelligibility is important.
In the field of language resources this method has been implemented with ISOcat (ISO 12620:2009).
ISOcat is a registry for data categories used in describing terminological databases
and language resources. All data categories needed in these fields are allowed to
registered with a unique identifier, definition and name in various languages. Several
data categories have been defined, but the list is open, hence it is possible to
insert data categories that are needed but not available in the registry yet. The
registry consists of two parts, a private and a public section. Every data
category that is defined or used by a project or tagset is first defined in a
private workspace that is nevertheless part of the registry and can be reused and
referenced. Data categories that are important for various contexts can then be
moved from private workspaces to the public area by domain experts. This
promotion includes a quality assessment of the definitions as well as a check for
possible redundancy in the registry. By this means consistency and documentation of
data categories is fostered, together with persistent identifiers of the data
categories, even in the case of the renaming of elements.
Based on the idea of persistent category definitions, the Component Metadata Infrastructure
(CMDI, see Broeder et al. 2010) was designed. CMDI is intended for
describing language resources. These resources are of various types and require
different metadata schemas to appropriately describe the contents in a form that
allows a human user to understand what kind of resources they have to expect. Most
of these schemas are far more detailed than traditional metadata schemas from
archivist containing bibliographical data, but also contain keywords, abstracts,
subject fields, participants, annotation schemas, etc. For reusability reasons the
data categories are clustered into components, and components
are combined to other components or to a profile, which is more
or less a metadata schema for a specific type of resource, the components also
allowing the definition of a value schema for each data category. The data
categories which are used in the components do not provide their own description,
but refer to the data category registry, for example ISOcat or Dublin Core, using
URIs. By this procedure, the concrete tag name becomes language, script and
application independent, because the definition is given in a central repository.
User interfaces are provided with the component registry web tool and the Arbil
Metadata editor developed by the Max Planck Institute for Psycholinguistics (all
available at the CMDI site).
Problem: Availability and Findability
Many researchers creating language resources are more than willing to share their
resources with close colleagues upon request. However, for various reasons such as
personal, privacy or property rights they tend to restrict public access to these
resources. Furthermore, resources created in research contexts are usually designed
specific purposes such as the analysis of specific linguistic phenomena. The resource
itself is
mostly not visible, because research publications discuss the phenomena and their
analysis, but usually do not describe the resource in great detail. However, these
publications are often the only documentation for the existence of the resource and
describe the rationale behind their creation.
Hence, accessibility to language resources is a major problem to be dealt with.
Especially in fields with a large economic interest in linguistic resources, such
as statistical language processing and machine learning, data centres or distribution
agencies were created to address the problem of accessibility. These data centres
material in large quantities and they use rather flat structures for their data.
In contrast, resources created by individual research projects and researchers
are often deeply structured and tend to be much more detailed and complex.
Data centres have standard procedures for intellectual property rights handling and
cataloguing resources using bibliographical procedures. Language resources from
commercially less interesting areas or resources that are deeply structured, can
hardly be found in these data centres. Even if such resources are accessible
elsewhere, they cannot be reliably located by general search engines. Most often
they will only be part of the statistical noise of general search engine results.
There are some specialized search engines, such as ODIN (see for interlinear glossed text, but
they usually do not provide users with knowledge about the text type and what kind
of structures and content to expect in the resource.
The solution is well-known from the initial ideas around the semantic web: metadata
descriptions of resources should be used that are based on standards,
quasi-standards, best practice and which are used for specialized catalogues of
resources. Providing exhaustive metadata records enables a possible user to
understanding the structures and content of a resource, not necessarily the document
grammar, but at least they would give a fair idea on the theory behind it.
Providing metadata refers to issues of proprietary tagsets and controlled
vocabulary again. The keywords used to describe a resource ideally refer to a
conceptual space, in which all concepts are well defined and classified according
superordinate and subordinate terms. The reference to the concept system or to an
ontology requires standardized values. Standardized values means that a central,
accessible structure needs to provide them, i.e., a kind of a registry such as ISOcat.
In the process of metadata creation different perspectives can be taken: the
perspective of the author of the resource, the software engineer, the publisher and
the person looking for a resource later, to name just a few.
These different roles in relation to a resource are not mutually exclusive in terms
of metadata categories, but in the creation process different areas are emphasized.
example the publisher will usually be more interested in
making sure that the copyright is explicitly defined than the user searching for a
specific resource to be employed for a specific use case. Software engineers will
interested in technical features, while archivist require bibliographical data.
For the creation of metadata it is essential to use the perspective of
prospective users. Though it can be argued that not all possible users and their
requirements can possibly be anticipated, the perspective of users, especially with
other backgrounds, helps to include not only technically relevant metadata but also
descriptive metadata relevant for human users. Technical metadata here means those
bits of information required by someone implementing tools for processing the data,
while descriptive metadata refers to those classifications that help a possible user
to understand the content of a resource before actually seeing it.
Taking various user groups into account when selecting the descriptive detail of
metadata also allows the design of structured search engines. Structured search
engines here refer to search engines not only interpreting the textual content of
pages but that take into account the structure of the metadata. The intention behind
using the structure of metadata is to provide search results with a higher
precision while providing a high recall at the same time, which is not necessarily
achieved by full text based search engines.
Additionally it is recommended to make these datasets available through as many
national and international catalogues and initiatives in the respective field or sub-field
possible (see section “Conclusions”) and also to enable harvesting of metadata sets using OAI-PMH. With the help of these
it is possible to announce the availability of the dataset to the scientific
community using websites, blogs, etc.
Problem: Selection and Qualification for Long-Term Archiving
In the past, resources were either available or not, a lot of data was lost due
to conversion problems, technical failure, etc. For each of these there are
technical solutions, but a major problem remains in the question: what is worth
archiving? Resources undergo a life cycle and in general it is agreed that not every
step in the life cycle is worth being archived. In contrast to that, some resources
are supposed to be archived, even if they did not reach the archival phase of the
intended life cycle. Finding formal criteria for deciding upon archiving or not is
a major
problem that still remains unsolved, one that might be unsolvable as such.
Criteria for deciding which resource should be archived fall into different
categories: status, technical quality, organizational and institutional
requirements, extent of use, quality evaluation and longevity. Some of these
criteria depend on each other, but can be evaluated independently and therefore be
used to measure the need for archiving a resource.
The status of a resource defines the formal editing status, starting from first
draft versions to released or published versions, etc. Projects that work with a
life cycle model in resource creation need to archive those documents that are in
the archiving phase. Naming conventions and value schemas for the different phases
vary greatly. However, the archive status cannot be the sole criterion,
because in some projects resources get stuck in an earlier state and do not reach
publication phase, but considering other criteria, they nevertheless may qualify for
or even require long-term archiving.
Especially for technological applications the technical quality can be of prime
importance. For some testing environments it is sufficient to have a resource that
is technically adequate and has the correct size, so it can serve as a reference point
or for testing procedures, algorithms and technologies, even if the content and
status as such are incomplete and still pending improvement. Consequently, the
technical quality can be a decisive factor for long-term archiving.
Institutionalized requirements may force data providers to submit material, for
example close to the end of a project life, while others are hesitant in providing
data for various reasons, even if the quality is much higher. These requirements are
usually negotiated with archivists and partners, but often result in
archiving the resource regardless of other criteria.
A resource that is widely used by various groups needs to be archived regardless
of other factors, because it is used as a reference Ὰ ignoring other criteria
such as quality and status. One reason could be that it is the only resource
available or has unique properties. Though the use of a resource by a variety of
users is complex to evaluate, this criterion seems to be obvious.
Quality is another factor in an evaluation matrix. In contrast to an approach
which might be termed a take-whatever-you-can-get approach in archiving, archiving
material without prior evaluation is not desired, as the information flood becomes
unmanageable, if not for saving, then for retrieval and search. The assessment can
be both formal by algorithmic processes that can also provide information on the
technical quality mentioned before, or by a peer reviewing process. In the latter,
experts decide on the quality of a resource and based on this judgment a resource
archived or disregarded.
Even more problematic but essential is the question of longevity of a resource. A
resource that is most likely to be usable for a long period of time is supposed to
be archived. The usability over a long period depends on the application of a
resource. If the resource answers to demands that are continuously present, then the
resource needs to be available, hence archived, even if the number of users might
When measuring all of these criteria separately it is comparatively easy to
define a threshold of criteria that need to be fulfilled in order for a resource to
be archived. The threshold is selected in a way that each criteria can serve as an
overriding criterion, that is, if one of these criteria mandates archiving, then the
resource will be archived. But if there is no criteria with this requirement, the
values can accumulate. If the threshold is not set too low, the resource will then
The ultimate goal for working with resources is of course to achieve a high
quality resource, that is highly regarded by experts, used and usable for many
years, and reaches a maturity level that is technically well established, etc.
However, for most resources there are limitations that are not supposed to interfere
as knock-out criteria for long-term archiving.
Additional Pitfalls
Technical sustainability is one aspect of sustainability. Other major aspects are
organizational sustainability and legal issues – two issues not to be
underestimated. While the technical sustainability is an engineering task which
seems to be solved in most cases with semi-automatic migration procedures for
digital devices, this is not true for organizational and legal aspects.
Eide et al. 2008 claim that organizational sustainability may even be more
important than technical sustainability, because valuable resources can easily be
lost when an organization is shut down. They list several examples from cultural
heritage management, where shutting down museums almost lead to the loss of
resources, e.g., the Newham case where data was only saved because the staff acted
quickly and dumped it to floppy disks. Sometimes, the resources also exist on paper
could be digitized again, but as there is a movement away from paper, this option
will cease to exist soon.
Organizational sustainability is a rather fragile process because it correlates
with funding and institutional commitment, which are rather soft and fragile factors.
Due to the structure of funding organizations it is hardly
possible to receive a statement of commitment for a very long period of time. For
example, the duration of German collaborative research centres is limited to 12
years. Other long time programs exist, but it is virtually impossible to find a
commitment for more than 20 years. Therefore, ventures in sustainability also need
to consider the organizational aspect with a proper strategy how to guarantee taking
of resources in the years to come – either by securing continuity of
the organization itself or by preparing and implementing a proper migration plan for
resources to a different organization. Preparing for both cases would be even
Another issue are legal aspects. Especially in the field of linguistics,
intellectual property rights create their own set of problems which have to be dealt
with when thinking about sustainability (see Lehmberg et al. 2008 and
Zimmermann et al. 2007). These issues are investigated in the context
of international projects such as CLARIN and META-NET; the current direction is
to work out licensing models (see Lindén et al. 2010 and Weitzmann et al. 2010). These intellectual property rights issues are
especially tricky as linguistic resources often cross political and cultural
borders, hence not only legal issues but also ethical implications are involved.
Sustainability of language resources is an aspect wanted and needed by data
providers, users and funders alike. To be able to speak of sustainable resources it
necessary to make resources available according to defined processes, platforms or
archives in a reproducible and reliable way. To this end, XML is an essential part
of a
complex approach which, additionally, also encompasses other standards on multiple
levels. These are requirements, but tools and systems, accessible in a reliable manner
and operating based on standards, are important as well.
With SPLICR there is a proof-of-concept implementation of large parts of the
functionality required for sustainability platforms. A platform alone is a node in
sustainable web of trusted resource repositories, each repository providing
organizational support, technical infrastructure with archiving technology, and being
entrusted to use specified procedures to respect privacy and rights of data providers
while providing non-discriminatory access to the resources according to stated
procedures and rights holders restrictions. Part of this network is also the cooperation
of various national and international initiatives. In cases of sustainability a certain
amount of overlap between these projects is desirable to further foster interoperation
and reliability of tools, data centres and increase redundant archives, avoiding major
problems in disaster scenarios.
All in all it can be said that with a number of international projects such as CLARIN
and META-NET along with its META-SHARE open resource exchange facility, together with
the initial implementations of various tools, the development of standards in the
Technical Committee 37, Subcommittee 4 “Language Resources” (see TC 37 SC 4) and establishment of
de-facto procedures, the sustainability of language resources is no longer something
that needs to be argued for. Instead, the situation has changed dramatically, as the
very real problem of providing sustainable data sets is, by now, firmly anchored in
academic as well as commercially oriented research centres. With raised awareness
the community, the continuation of language resource distribution projects and
institutional support by academic libraries and institutions, chances are more than
promising for providing sustainable resources, using XML technology and state of the
