|
Balisage 2010 Program
Tuesday, August 3, 2010
|
Tuesday
9:15 am — 9:45 am
The high cost of risk aversion
Tommie Usdin, Mulberry Technologies
Avoiding risk is not always the way to minimize risk.
|
Tuesday 9:45 am - 10:30 am
Multi-channel eBook production as a function of diverse target
device capabilities
Eric Freese, Aptara
The challenge: develop an eBook that can demonstrate a number
of enhanced eBook capabilities (intelligent table of contents,
bidirectional linking, external links to study files and geospatial
data, hidden text, media support, epubcheck validation, etc.) that
will work on many “standard” eBook devices (such as
the Kindle, nook, Sony Reader, iPad, and eDGe platforms, and even
on smart phones).
The text: the World English Bible.
The talk: show-and-tell session and the sharing of lessons
learned.
|
Tuesday 11:00 am - 11:45 am
gXML, a new approach to cultivating XML trees in Java
Amelia A. Lewis
& Eric E. Johnson, TIBCO Software Inc.
Different XML tree-processing tasks may require tree models with
different design tradeoffs, producing problems of multiplicity,
interoperability, variability, and weight. It is no longer necessary
to use a different API for each different tree model. A
single unified Java-based API, gXML, can provide a
programming platform for all tree models for which a “bridge” has
been developed. gXML exploits the Handle/Body design pattern and
supports the XQuery Data Model (XDM).
|
Tuesday 11:00 am - 11:45 am
Grammar-driven markup generation
Mario Blažević,
Stilo International
For use in document conversions, we have written a normalizer that
generates a grammatical element structure for incompletely tagged
document instances, guided by a RELAX NG schema. From a well-formed
but invalid instance that contains only tags that occur in the target
schema, the normalizer generates a document instance valid against the
grammar; the weakly structured input is first translated into elements
from the schema, then the instance is manipulated into validity. We
introduce a set of processing instructions that allow a user to
control how the normalizer resolves ambiguity in the instance.
|
Tuesday 11:45 am - 12:30 pm
Java integration of XQuery — an
information unit oriented approach
Hans-Jürgen Rennau
Need to process XML data in Java? Keen to let Java delegate to XQuery what
XQuery can do much better than Java, to discover a novel pattern of cooperation
between XQuery and Java developer? A new API, XQJPLUS, makes it possible to let
XQuery build “information units” collected into
“information trays”. Tray design
is driven by the Java side requirements; the implementation is a pure XQuery task,
and the use does not require any knowledge of XQuery.
|
Tuesday 11:45 am - 12:30 pm
Reverse modeling for domain-driven engineering of publishing
technology
Anne Brüggemann-Klein,
Tamer Demirel,
Dennis Pagano, &
Andreas Tai,
Technische Universität München
Our ultimate goal is to develop a meta-meta-modeling facility whose
instances are custom meta-models for conceptual document and data
models. Such models could drive development by being systematically
transformed into lower-level models and software artifacts
(model-driven architecture). In a step toward that goal, we present
“reverse modeling” that constructs a conceptual model by
working backwards from a pre-existing model such as an XML Schema or a
UML model. Starting with a schema, we abstract a custom,
domain-specific meta-model, which explicitly captures salient points
of the model bearing on system and interface design, and then we
re-formulate the original model as an instance of the new meta-model.
|
Tuesday 2:00 pm - 2:45 pm
Managing semantics in XML vocabularies: an experience
in the legal and legislative domain
Gioele Barabucci,
Luca Cervone,
Angelo Di Iorio,
Monica Palmirani,
Silvio Peroni,
Fabio Vitali,
University of Bologna
Akoma Ntoso is an XML vocabulary for legal and legislative documents
sponsored by the United Nations for use in African and other countries.
Documents include concrete semantic information describing and identifying
the resource itself as well as the legal knowledge contained in it. This
paper shows how the Akoma Ntoso standard expresses the multiple independent
conceptual layers and provides ontological structures on top of them. We
also describe features intended to allow future access to the
legal information represented in documents without relying on the future
availability of today's technology.
|
Tuesday 2:45 pm - 3:30 pm
XML pipeline processing in the browser
Vojtěch Toman
EMC Corporation
Powerful XML processing pipelines can be specified using the W3C
XProc language, but the currently available XProc implementations,
such as EMC's Java-based Calumet, are expected to run on servers,
not on client-side browsers. A client-side implementation could be
provided
as a browser plug-in, but a Javascript-based implementation would offer
comprehensive client-side portability for XML pipelines specified in
XProc.
A Javascript port of Calumet is in the works, and the early results are
encouraging.
|
Tuesday 4:00 pm - 4:45 pm
Extension of the type/token distinction to document structure
Claus Huitfeldt,
University of Bergen
Yves Marcoux,
Université de Montréal, &
C. M. Sperberg-McQueen,
Black Mesa Technologies,
C. S. Peirce's type/token distinction can be extended beyond
words and atomic characters to higher-level document structures.
Also, mechanisms for handling tokens whose types are (perhaps
intentionally) ambiguous can be added. Thus fortified, the
distinction offers an intellectual tool useful for closer
examination of the relationships between XML element types and
their instances, and, more broadly, across the whole hierarchy of
character, word, element, and document types.
|
Tuesday 4:00 pm - 4:45 pm
A virtualization-based retrieval and update API for XML-encoded corpora
Cyril Briquet,
McMaster University & ATILF (CNRS & Nancy-Université);
Pascale Renders,
University of Liège & ATILF (CNRS & Nancy-Université);
Etienne Petitjean,
ATILF (CNRS & Nancy-Université)
Processing a large textual corpus with many XML tags is fraught with
difficulties for many processes such as search and editing. The
presence of tags interleaved with text may cause textual operations to
return invalid results, such as false positives or false negatives.
Virtualization of XML documents offers the possibility to guarantee
correct results by hiding selected tags, text and combinations thereof
without invalidating the overall corpus. A Java API that supports
virtualization has enabled automatic processing (retrieval and update)
of large and complex documents that contain multipurpose semantic
tags.
|
Tuesday 4:45 pm - 5:30 pm
Discourse situations and markup interoperability
Karen Wickett,
University of Illinois Urbana-Champaign
Interoperability of markup across time and systems requires a
mapping from tags to the logical predicates associated with those
tags. The use of natural-language element names allows readers to
loosely interpret markup by exploiting the everyday resource
situations that support ordinary language-based communication, but (as
we demonstrate) the name of a tag alone does not convey everything
necessary to interpret the meaning of the markup. Misinterpretation
problems become obvious when the markup is used to derive erroneous
RDF statements. Semantic resolution requires sufficient access to
documentation. Without such support, interoperability across time and
systems is an unlikely prospect.
|
Tuesday 4:45 pm - 5:30 pm(LB) XHTML Dialects: Interchange over domain vocabularies
through upward expansion:
With examples of manifesting and validating microformats
Erik Hennum
The XML community exhibits a persistent tension between the value of sharing
(motivating standards) and the value of individuation
(motivating customization of those standards).
Some communities resolve this tension through particular emphasis
on customizations that produce subsets of base vocabularies.
Current practices for defining subset vocabularies, however, have limitations
that reduce the value of this approach.
This paper proposes enhancing the XML ecosystem with a general-purpose
mechanism for defining and managing subset extensions of a vocabulary.
The proposal makes use of Semantic Web strategies —
in particular, asserting new type relations for existing type definitions and
simplifying content models —
to identify commonality for variant vocabularies.
This approach has particular promise for extending XHTML
as illustrated with a few microformats.
|
Wednesday, August 4, 2010
|
Wednesday 9:00 am - 9:45 am
Where XForms meets the glass: bridging between data and
interaction design
Charlie Wiecha,
Rahul Akolkar, &
Andrew Spyker, IBM
XForms offers a model-view framework for XML applications. Some
developers take a data-centric approach, developing XForms applications
by first specifying abstract operations on data and then gradually
giving those operations a concrete user interface using XForms
widgets. Other developers start from the user interface and develop
the MVC model only as far as is needed to support the desired user
experience. Tools and design methods suitable for one group may be
unhelpful (at best) for the other. We explore a way to bridge this
divide by working within the conventions of existing Ajax frameworks
such as Dojo.
|
Wednesday 9:45 am - 10:30 am
I say XSLT, you say XQuery: let’s call the whole thing off
David J. Birnbaum,
University of Pittsburgh
XSLT and XQuery can both be used for extracting information from XML resources and transforming it for
presentation in a different form. The same task can be performed entirely with XSLT, entirely with
XQuery, or using a combination of the two, and there seems to be no general consensus or guidelines
concerning best practice for choosing among the available approaches. The author solved a specific
problem initially (and satisfactorily) with XSLT because XQuery was not a sufficiently mature
technology at the time the task first arose, but years later began to suspect that XQuery might be,
in some ineffable way, a better fit than XSLT for the data and the task. Both the exclusively XSLT
approach and the exclusively XQuery approach were comparable in functionality, efficiency, ease of
development, and ease of maintenance, and they also shared (of course) an XPath addressing component,
but they were nonetheless profoundly different in the way they interacted with the same source XML files.
The goal of this presentation is to consider why one or the other technology may be a better fit for a
particular combination of data and task, and to suggest guidelines for making decisions of that sort.
|
Wednesday 11:00am - 11:45am
Refining the taxonomy of XML schema languages. A new
approach for categorizing XML schema languages in terms of processing complexity.
Maik Stührenberg, &
Christian Wurm, Bielefeld University
During the last decade, many researchers have worked in the fields of XML applications
(especially regarding schema languages) and formal languages. Amongst these is the taxonomy
of XML schema languages described by Murata et al., including local
tree grammars (DTDs), single-type tree grammars (XSD schemas) and restrained competition
grammars (RELAX NG schemas).
We refine and extend this hierarchy, using the concepts of determinism, local and global
ambiguity. It turns out that there exist interesting grammar types which are not yet
captured formally, such as “unambiguous restraint competition grammars” and “unique
subtree grammars”. In addition, we prove some interesting results regarding ambiguous
grammars and languages: if a tree language is inherently ambiguous (i.e. ambiguity cannot
be deleted), different interpretations of the same structure are isomorphic. This has
important consequences for the treatment of ambiguity in document grammars.
|
Wednesday 11:45 am - 12:30 pm
Schema component paths for schema analysis
Mary Holstege Mark Logic
An XPath-like syntax for XSD schema components allows sets of XSD
schema documents to be described and navigated in convenient and
familiar ways. Each component has a unique canonical path, which can
be used to identify the component; canonical paths are robust
against changes in the physical organization of the schema. A set of
canonical paths provides a sort of snapshot or signature of a schema,
which can provide a quick and simple summary of what has changed in a
new version of a familiar schema. Schema signatures may also be
helpful in the calculation of simple measures of schema
complexity.
|
Wednesday 2:00 pm - 2:45 pm
A packaging system for EXPath
Florent Georges, H2O Consulting
EXPath provides a framework for collaborative community-based
development of extensions to XPath and XPath-based technologies (including
XSLT and XQuery), thus exploiting the built-in extensibility of those
technologies. But given multiple modules extending XPath, how can
a user conveniently manage installation and de-installation of
extension modules? How can developers make installation easy for users?
How can users and developers avoid being trapped in dependency
hell? These problems are familiar from other platforms, as are
potential solutions. We can adapt conventional ideas of
packaging to work well in the EXPath environment.
|
Wednesday 2:45 pm - 3:30 pm
A streaming XSLT processor
Michael Kay, Saxonica
XSLT transformations can refer to any information in the source
document from any point in the stylesheet, without constraint; XSLT
implementations typically support this freedom by building a tree
representation of the entire source document in memory and in
consequence can process only documents which fit in memory. But many
transformations can in principle be performed without storing the
entire source tree. The W3C XSL Working Group is developing a new
version of XSLT designed to make streamed implementations of XSLT
feasible. The author (editor of the XSLT 2.1 specification) has been
implementing streaming features in his Saxon XSLT processor; the paper
will describe how the implementation is organized and how far it has
progressed to date. The exposition is chronological to show how
the streaming features have developed gradually from small beginnings. |
Thursday, August 5, 2010
|
Thursday 9:00 am - 9:45 am
Why TEI stand-off annotation doesn't quite work
and why you might want to use it nevertheless
Piotr Bański, University of Warsaw
Textual and linguistic analysis of corpora together awaken all the
sleeping dragons of markup overlap. The TEI, like many others with an
interest in markup, has taken up stand-off markup as one of its
weapons of choice. That choice has problems in both the technical and
sociological realms, however. Implementing extensions to XML tools to
support XInclude and XPointer would make life easier for OWLs
(ordinary working linguists).
|
Thursday 9:00 am - 9:45 am
DITA or Not?
Lynne A. Price Text Structure Consulting
Use of DITA has become so pervasive that some users assume that anyone who inquires
about moving to an XML environment must use DITA. Often, the selection of DITA is
independent of DITA's strengths such as ease of reuse, specialization, support of
distributed authoring, and availability of the Open Toolkit. While numerous DITA
case studies have been published, such reports tend to focus on what was accomplished
rather than how the approach was chosen, and typically reflect successful
implementations in large organizations. This study focusses on why end users,
consultants, and tools have chosen to use or to avoid DITA. While this should not
be considered an unbiased or scientifically balanced survey, anecdotal evidence
such as this can be valuable to organizations faced with similar decisions.
|
Thursday 9:45 am - 10:30 am
Freestyle Markup Language
Denis Pondorf &
Andreas Witt
Institute for the German Language (IDS)
Freestyle Markup Language (FML) is a nascent generalized
descriptive markup language to describe polyhierarchical markup of
texts and data. FML is (we hope) the next generation in the evolution
of markup languages. By design, FML is described using a Type-2
grammar (production rules in EBNF) so that FML may be produced by a
context-free grammar and recognized by a nondeterministic pushdown
automaton. FML documents will be transformable into a semantically
unambiguous corresponding graph structure. By overcoming many of the
restrictions inherent in monohierarchical OHCO (ordered hierarchy of
content objects) structures, FML should overcome problems such as of
congruence, interference, and content redundancy that result from
root- and hierarchy-bondage.
|
Thursday 9:45 am - 10:30 am
IPSA RE: A New Model of Data/Document
Management, Defined by Identity, Provenance, Structure, Aptitude,
Revision and Events
Walter E. Perry,
Fiduciary Automation
In private investment fund dealing each transaction is a series of interactions between
parties transacting business at different granularities and often
with materially different understandings of the substance of the
transaction. Data records for private investment fund trading often
don't accurately reflect whose money has gone into, or come out of,
a given transaction or, conversely, in which particular transaction
was a investor's stake in a fund secured, and at what basis.
Investor skepticism in light of recent events, and government
insistence on regulation necessitates transparency about whose
money is deployed in what exact amounts in which transactions for
which investment assets at what basis and through what chain of
provenance.
The design of Google BigTable and the API for Google App Engine
facilitates implementation of a "linksbase" which redefines a data
record as an instance aggregation of linkages or "extended arc" on
whose path may lie any number of instances identified by entity
types, each separately influencing the resultant arc. The instance
record is transactable across gross differences of granularity
separating transaction parties and widely different understandings
of the instance transaction.
|
Thursday 11:00 am - 11:45 am
Multi-structured documents and the emergence of annotation vocabularies
Pierre-Édouard Portier,
Sylvie Calabretto,
University of Lyon
Annotation vocabularies frequently need to grow and change as the
user's understanding of the documents being annotated grows. We have
developed methods to allow users to add new annotation terms while
keeping some control over the growth and change of the annotation
vocabularies; we use traces of user actions involving particular terms
to help document those terms for users. Our ideas are being tested in
a project involving the papers of Jean-Toussaint Desanti, the French
philosopher of mathematics.
|
Thursday 11:00 am - 11:45 am
Processing arbitrarily large XML using a persistent DOM
Martin Probst
Processing of large XML documents
usually traps the user between the memory
constraints on DOM processing and
the limitations on tree traversal
in streaming processes. Moving the DOM out of
memory and into persistent
storage offers another processing option.
Because disk storage is much slower than memory
access, an efficient binary
representation of the XML document has
been developed, with a supporting Java API.
Results are promising for gigabyte-sized
documents that are not suitable for conventional DOM techniques.
|
Thursday 11:45 am - 12:30 pm
On Implementing string-range() for TEI
Hugh Cayless, NYU &
Adam Soroka, UVA
The long-standing argument over the theoretical validity of “embedded” XML
markup (particularly the TEI) flared again recently on the Humanist mailing list.
That discussion prompted a group of programmers (including the authors) to meet for
a session at THATCamp Prime in May to see whether anything practical could be done
to address the deficiencies of TEI-style embedded markup. The TEI guidelines contain
XPointer schemes which, if implemented, would allow the kinds of standoff markup and
annotation that the anti-embedded-markup camp want within the context of a widely used
standard. In the years since these (still unimplemented) pointer schemes were proposed,
there have been developments (one very recent) that might now make implementation
practical, so we decided to make one of these schemes, string-range(), actually work.
This paper will present our implementation and a discussion of how it might be used to
manage overlapping hierarchies of markup within a single TEI document.
|
Thursday 11:45 am - 12:30 pm
There are No Documents
Allen H. Renear &
Karen M. Wickett,
University of Illinois at Urbana-Champaign
Last year at Balisage (2009) we considered the claim that documents cannot
be modified. This consideration took the form of identifying and evaluating
possible responses to this inconsistent triad: 1) Documents are strings;
2) Strings cannot be modified; 3) Documents can be modified. Late this spring
we were surprised to realize that our survey of possible responses had overlooked
one: There are no documents. We turn to that neglected possible response now.
|
Thursday 2:00 pm - 3:30 pm
Panel Discussion. Greasing the Wheels:
Overcoming User Resistance to XML
While Balisage may be filled with people who are not only comfortable working in XML,
many of us are more comfortable with XML than with spreadsheets, word processors, or pens.
But many of the users we work with find XML confusing or intimidating and resist learning
about XML and using XML tools. This discussion will focus on how to overcome user resistance to
XML, including ways to hide the XML, or at least the full complexity of the XML, from end users.
|
Thursday 4:00 pm - 4:45 pm
XML essence testing
Abraham Becker &
Jeff Beck,
U.S. National Library of Medicine (NLM)
PubMed Central (PMC) is the U.S. National Institutes of Health free
digital archive, gathering together biomedical and life sciences
journal literature from diverse sources. When an article arrives at
PMC, it conforms to one of over 40 evolving DTDs. An ingestion
process applies appropriate “Modified Local” and “Production” XSLT
stylesheets to produce two instances of the common NLM Archiving and
Interchange DTD. In the “essence testing” phase, the essential nodes
of these instances, as specified by some 60 XPath expressions, are
compared. This method allows the reliable detection of unintentional
changes to an XSLT stylesheet with negative impacts on product
quality.
|
Thursday 4:45 pm - 5:30 pm
Automatic upconversion using XSLT 2.0 and XProc
Stefanie Haupt &
Maik Stührenberg,
University of Bielefeld
Upconversion of presentation-oriented HTML documents to a
data-centric XML form is a non-trivial but automatable process. Our
data is a corpus of video game reviews represented as (sometimes
invalid) HTML 4.01. Hidden in these reviews are useful pieces of
metadata such as genre, number of players, age ratings, difficulty,
and the pros and cons of the game. With a schema cleanly defining and
extending useful datatypes, an XSLT 2.0 stylesheet making heavy use
of regular expressions and string processing, we recursively process
the HMTL documents using an XProc pipeline. Thus we transform tag soup into
fully structured (and valid!) XML instances that allow
semantically rich XQueries over the data.
|
Friday, August 6, 2010
|
Friday 9:00 am - 9:45 am
Stand-alone encoding of document history
Jean-Yves Vion-Dury,
Xerox Research Centre Europe
Tracking the change history of a document has frequently depended
either on external systems that atomize the document into databases or
on running differences over separately stored intermediate versions.
Why not encapsulate the entire history process in a single XML
document? Using appropriate namespaces, both an instance and its
history can be combined in a single construct. Unification of document
and history allows the use of XPath expressions to express delta
structures and the systematic distinction between modification
descriptors and modification operations, with gains in both
compactness and efficiency of storage.
|
Friday 9:45 am - 10:30 am
Scripting documents with XQuery: Virtual documents in TNTBase
Vyacheslav Zholudev
Michael Kohlhase
Jacobs University, Bremen
If x is to XQuery as views are to the
query language SQL, what is x? We present a
virtual-document facility integrated into TNTBase, an XML database
with support for versioning. Our virtual documents consist of document
skeletons with static text and parameterizable embedded XQuery
queries; they can be edited, and changes to elements in the underlying
XML repository are propagated automatically back to the database.
The ability to integrate computational tasks
into documents makes virtual documents an enabling technology with
far-reaching possibilities.
|
Friday 11:00 am - 11:45 am
XQuery design patterns
William Candillon,
Matthias Brantner,
Dennis Knochenwefel,
28msec Inc.
The idea that design patterns are identifiable, reusable, and
teachable is itself a (meta) design pattern whose applications and
benefits extend far beyond the field of object-oriented programming.
XQuery is an XML technology that, like OO technology, both suggests
design patterns and is being influenced by them. A working
AtomPub-based cloud application illustrates some XQuery design
patterns: “Chain of Responsibility”, “Pattern Matching”,
“Strategy”,
and “Observer”.
|
Friday 11:45 am - 12:30 am
Platform independence 2010 -
Helping documents fly well in emerging architectures
Ann Wrightson,
NHS Wales Informatics Service
XML data structures, and those who design them, must adapt to the
reality that multiprocessing technologies, including multicore
processors, multichannel memory, multilevel caches, clouds, etc. are
now ubiquitous. What does this mean for the practices and design
patterns of the XML industry and for our instincts about design
tradeoffs, such as cascading defaults? our willingness to incur the
cost of data replication, or the addition of an optional detail to a
model?
|
Friday 12:30 pm - 1:15 pm
(FP)Stone soup
C. M. Sperberg-McQueen, Black Mesa Technologies
Reflections on making the best of unpromising situations.
|
|