Balisage logo

Balisage 2015 Program

Monday, August 10, 2015
see: Cultural Heritage Markup

Tuesday, August 11, 2015

Tuesday 7:30am - 9:00am

Conference Registration & Continental Breakfast

Pick up your conference badge outside the White Flint Amphitheater and join us for a light breakfast.

Tuesday 9:00am - 9:15am

Welcome and Introductions

Tuesday 9:15am - 9:45am

The art of the elevator pitch

B. Tommie Usdin, Mulberry Technologies

Many of us at Balisage feel that the universe (or our organization, sponsor, client, or mother-in-law) doesn’t sufficiently appreciate or respect technologies we know could significantly improve the world. XSLT, techniques for processing overlap, DITA, XQuery, HTML5, even XML, are not given the attention they deserve. People aren't listening! This is our fault, at least in part. We as a community need to learn to say less and communicate more, and more persuasively.

Tuesday 9:45am - 10:30am

Markup as index interface: Thinking like a search engine

Mary Holstege, MarkLogic [Paper] [EPUB] [Slides and materials]

To a search engine, indexes are specified by the content: the words, phrases, and characters that are actually present tell the search engine what inverted indexes to create. External knowledge can add to this inventory of indexes. For example, knowledge of the document language can lead to indexes for word stems or decompounding. These can unify different content into the same index or split the same content into multiple indexes. That is, different words manifest in the content can be unified under a single search key, and the same word can have multiple manifestations under different search keys. Turning this around, the indexes represent the retrievable information content in the document. Full text search is not an either/or yes/no system, but one of relative fit (scoring). Precision balances against recall, mediated by scoring. What does this search engine perspective on markup mean, concretely? Can we use it to reframe some persistent conundrums, such as vocabulary resolution and overlap? Let's see.

Tuesday 11:00am - 11:45am

Markup and meter: Using XML tools to teach a computer to think about versification

David J. Birnbaum & Elise Thorsen, University of Pittsburgh [Paper] [EPUB] [Slides and materials]

We are developing an XML-based system to identify and visualize the structures of natural-language poetry. Poetic texts are the poster children of overlapping hierarchies, since the organization of poems into stanzas, lines, and feet is largely independent of the sentences and words of the text. Foot boundaries and word boundaries are mutually independent, yet the implementation of caesura depends on their synchronization. Furthermore, the formal organization of poetry is not only overlapping, but also massively discontinuous in terms of how underlying formal structures like meter or rhyme are realized in natural orthography. In many poetic traditions, stress and pronunciation (including rhyme) are only implicit in written texts, and our first markup challenge is to identify such structures automatically and add markup to make them explicit. When stress and pronunciation are made explicit within the XML model, the most natural representations will involve mixed content, which poses special challenges for subsequent XML processing.

Tuesday 11:45am - 12:30pm

XML (almost) all the way: Experiences with a small-scale journal publishing system

Peter Flynn, University College Cork [Paper] [EPUB] [Slides and materials]

In 2006 my university academic IT support group was approached by an academic colleague wanting to start a new journal to be available in electronic form only. The technical capabilities of the author pool, the requirements of the discipline, and — unsurprisingly — the lack of financial resources, imposed restrictions on the design of the publishing system. We implemented a system using only open source software, building it largely from scratch as the existing open source journal publishing systems at the time, although comprehensive and well-established, seemed far too large and complex for the task. I describe the process, explain the background to the design decisions made, and attempt to draw some conclusions about the technical viability of creating a small-scale publishing system which attempted to retain XML throughout the workflow, and about the human factors which influenced the decisions.

Tuesday 2:00pm - 2:45pm

The state of MathML in K-12 educational publishing

Autumn Cuellar, Design Science
Jean Kaplansky, Safari Books Online [Paper] [EPUB] [Slides and materials]

K-12 publishers are moving towards modern, single-source XML production workflows for predictable reasons: delivery in print and electronic form, repurposing into supplemental materials, digital-only learning management systems, assignments, assessments, etc. Publishing mathematics in the K-12 environment poses special challenges. Even though MathML is used to markup scholarly, and higher education content, K-12 textbooks demand mathematical constructs not found in higher level texts. Further, K-12 textbooks are subject to regulatory challenges and, of course, MathML support is inconsistent across browsers and devices.

Tuesday 2:45pm - 3:30pm

Diagramming XML: Exploring concepts, constraints and affordances

Liam R. E. Quin, W3C [Paper] [EPUB]

Visualization, or the graphical representation of data, is used to assist people in understanding complex structures, relationships, and trends. There have been several approaches to diagramming XML vocabularies (DTD and schemas) and to diagramming relationships in and among XML documents. People working with XML documents may have large amounts of data, in which case visualization of the relationships betwixt and between the documents may help them spot emergent patterns and understand the document collections in ways that are otherwise difficult to comprehend. Similarly, diagrams of document structure (as specified in a schema) may be effective ways to explore the constraints on the described document set.

Tuesday 4:00pm - 4:45pm

Implementing a system at US Patent and Trademark Office to fully automate the conversion of filing documents to XML

Terrel Morris, US Patent and Trademark Office
Mark Gross, Data Conversion Laboratory
Amit Khare, CGI Federal [Paper] [EPUB] [Slides and materials]

The US Patent and Trademark Office (USPTO) gathers massive collections of content, including legal documents, filings, and contracts. These become much more valuable if searchable and mineable. But at this scale even the most efficient conventional conversion techniques become economically infeasible. At USPTO, a fully automated conversion process with no human intervention has been operational and fully functional for over two years; it processes over one million pages per month, with turnaround often measured in minutes. Scanned pages are preprocessed to separate text from images, improve the quality of OCR, and clean up extraneous material. Extracted text is converted into XML, metadata is tagged automatically, and documents are re-assembled automatically from the XML and images. Automated quality analysis flags documents which may not have been processed correctly. We meet scalability challenges with a parallel processing architecture and continuous load balancing among documents and processes.

Tuesday 4:45pm - 5:30pm

(LB) MML: Multi Markup Language - When one markup language just isn't enough

David A. Lee, MarkLogic

The fundamental arguments in the "XML" vs "JSON" "Debate/War" are irrelevant. True, XML can be extremely complicated and bloated for what should be a simple task. True, JSON lacks native support for mixed content, complex types, and is difficult to hand edit. In complex environments it is necessary to exploit the advantages of each format and to mitigate the weaknesses of each. Using as an example a multi-year ongoing project of authoring and managing the lifecycle of a particular document type I will demonstrate how very minor differences in markup style, ironically differences intentional designed into JSON specifically as a counterpoint to XML 'complexity' make a huge impact on the ability of software to assist in the editing process, and equally for humans to accurately author and modify even small documents. These little things - added together - make even 'well formedness' validation impossible, useless, or worse - A valid document that is structurally different than what it appears.

I propose to solve this problem with "MML"; a hybrid approach for multi markup documents. "MML" allows multiple 'Native Markup' variants of the same document to co-exist. Simple transformations can produce variants of the document suitable for different tooling including JSON and XML formats, each of which is syntactically valid for the specific markup language. This is critical for early error detection and integration with existing tools.

Tuesday 8:00pm - 10:00pm

Balisage Hospitality

Stop in to the Balisage Coffee and Conversation room (Brookside A).

Wednesday, August 12, 2015

Wednesday 7:30am - 9:00am

Conference Registration & Continental Breakfast

Pick up your conference badge outside the White Flint Amphitheater and join us for a light breakfast.

Wednesday 9:00am - 9:45am

Spreadsheets - 90+ million end user programmers with no comment tracking or version control

Patrick Durusau
Sam Hunting [Paper] [EPUB]

Modern business runs on spreadsheets; modern business fails on errors in spreadsheets. Yet amazingly, typical spreadsheets have no good tools for auditing, revision tracking, or version control. Without the ability to alter the actual code of spreadsheet software, what protection can we build? Topic Maps are no insurance against stupidity, bad design, or faulty logic, but they do offer something that might be of use to spreadsheet sufferers: the ability to point into data and make connections. Using Topic-Map-spectacles to look into data associated with a major corporate failure shows some ways in which some problems could have been avoided.

Wednesday 9:45am - 10:30am

State chart XML as a modeling technique in web engineering

Anne Brüggemann-Klein, Marouane Sayih, & Zlatina Keskivov, Technische Universität München [Paper] [EPUB] [Slides and materials]

Domain-driven design is a methodology that attempts to improve software development by focusing on domain logic. It encourages collaboration between technical and domain experts by making an explicit technology-independent model that is iteratively refined. Many modeling languages exist. This paper examines the use of State Chart XML (SCXML) as a language for modeling a complex, reactive system. The use case presented in this paper is GameX, an interactive browser-based game where players operate on a map of towns and fields. We demonstrate that SCXML can be used to describe complex models during the software development process. Not only can these models be used to drive new development, but modeling already deployed systems in this way can provide a deeper understanding of their behavior.

Wednesday 11:00am - 11:45am

Accounting for Context in Markup: Which Situation, Whose Semantics?

Karen M. Wickett, University of Texas at Austin [Paper] [EPUB] [Slides and materials]

Situation semantics - as developed by Barwise and Perry - is a general theory of meaning for natural language, and can be used to understand the role of context in markup semantics. While the notion of a discourse situation provides many of the right hooks for accounting for contextual assignment of meaning to markup structures, there are still many open questions. One critical issue is that situation semantics itself is open enough to allow many different approaches to identifying the relevant discourse situation. Three core types of discourse situations for descriptive markup - documentary, transport, and discovery - lead to distinct features and strategies in the use of markup. But are discourse situations expressive enough for the work we want them to do? Beyond developing a fuller picture of the discourse situations that shape the meaning of markup, this exercise will help us evaluate the sufficiency of situation semantics for modeling markup semantics in general.

Wednesday11:45am - 12:30pm

XML solutions for Swedish farmers: A case study

Ari Nordström, Creative Words [Paper] [EPUB] [Slides and materials]

The Federation of Swedish Farmers (LRF) provides its 170,000 members with a web-based service to check compliance with state and EU farming regulations. These checklists are also produced nightly both as generic checklists with more than 130 pages and as individualised checklists for registered members. The system consists of an eXist database coupled with oXygen Author. The checklists and their related contents are edited, stored, and processed, published as PDFs, and exported to the SQL database which stores member registration, feeds the website, and does various other tasks. The system uses XQuery, XSLT, XInclude modularization, an extended XLink linkbase, and other markup technologies. It currently handles more than 40,000 PDF documents a year and many more than that in the web-based forms.

Wednesday 2:00pm - 2:45pm

(LB) XSLT Pipelines in XSLT

Tomos Hillman, Oxford University Press

Pipelines consisting of XSLT transforms are in common use, particularly in the publishing industry. The transformations as well as the pipelines themselves are re-usable, and may need to be run in several environments. This paper proposes two possible pipeline implementations where only the existence of an XSLT processor is guaranteed at run time, and discusses their advantages and limitations.

Wednesday 2:45pm - 3:30pm

To be announced

An encore presentation from one of the attendees at Balisage. Details to be announced on site.

Wednesday 4:00pm - 4:45pm

XSDGuide — Automated generation of web interfaces from XML schemas: A case study for suspicious activity reporting

Fabrizio Gotti, Université de Montréal
Kevin Heffner, Pegasus Research & Technologies
Guy Lapalme, Université de Montréal [Paper] [EPUB] [Slides and materials]

Suspicious Activity Reports (SARs) conforming to NIEM-SAR fit into an architecture that is both complex and intricate. It is challenging to build a system which guides authors to produce only valid SARs. XSDGuide uses the underlying schemas to build such a system in the browser. Full schema validation in the back end is complemented by dynamic front-end JavaScript validation. Despite some limitations, the system supports the creation of suspicious activity reports by public safety and security experts as well as by members of the general public.

Wednesday 4:45pm - 5:30pm

Tricolor automata

C. M. Sperberg-McQueen, Black Mesa Technologies; Technische Universität Darmstadt [Paper] [EPUB] [Slides and materials]

In working with different versions of a vocabulary, we often need to understand how different definitions of the same element type relate to each other. Will every element valid against one version of the document grammar also be valid against the other? And vice versa? Tricolor automata are a technique for visualizing, in a single diagram, the relations among several languages: two content models A and B, their intersection, their differences, and their union. Given the tricolor automaton for two content models, it is obvious at a glance whether one defines a sublanguage of the other. Their construction involves a straighforward application of the concepts of Glushkov automata and Brzozowski derivatives, and thus illustrates the applicability of theory to practice.

Wednesday 6:30pm - 8:00pm

Related Event
Data and documents using XML: Fits like a .gov
sponsored by the DC XML Users Group

Description and Reservations through Eventbrite. Balisage attendees are welcome and requested to sign up through the Eventbrite system. Attendance is free.

Wednesday 8:00pm - 10:00pm

Balisage Hospitality

Stop in to the Balisage Coffee and Conversation room (Brookside A).

Thursday, August 13, 2015

Thursday 7:30am - 9:00am

Conference Registration & Continental Breakfast

Pick up your conference badge outside the White Flint Amphitheater and join us for a light breakfast.

Thursday 9:00am - 9:45am

Two from three (in XSLT)

John Lumley, jωL Research / Saxonica [Paper] [EPUB]

XSLT 3.0 introduces a number of features that simplify stylesheet comprehension. Some (such as xsl:iterate) make the language more declarative, others (such as text value templates, let expressions, and maps) make the language more expressive, and still others (new functions and operators) make the language more concise. Programs upgraded to use these features can be clearer and cleaner than their 2.0 equivalents. These upgraded programs, however, may well need to be deployed in existing 2.0 systems. Some XSLT 3.0 language features can be translated into XSLT 2.0, at least under some circumstances. We discuss an implementation, in XSLT of course, of a system that does so.

Thursday 9:45am - 10:30am

XQuery as a data integration language

Hans-Jürgen Rennau, Traveltainment
Christian Grün, BaseX [Paper] [EPUB] [Slides and materials]

What is the potential for XQuery? What roles might it play in future projects? Pretty obviously, XQuery is not limited to querying; it has significant capabilities for processing (i.e., transforming, or constructing new) XML data. Less obvious are XQuery's capabilities of integration: accessing multiple resources adds no complexity to the code, and resource boundaries do not fragment information. With XQuery 3.1 and EXPath, we now gain access to all kinds of non-XML resources, such as JSON, SQL, plain text or binary data. Combining native and new capabilities, it may be time to rethink the scope of XQuery: is it a data integration language?

Thursday 11:00am - 11:45am

Smart content for high-value communications

David White, Quark Software [Paper] [EPUB]

How does a publisher capture XML content from subject matter specialists who focus on their content rather than formal document structure? A new schema for Smart Content provides the structure for an editing tool that authors can use comfortably and that captures the content in a format that lends itself to later reprocessing. Smart Content, influenced by XHTML and DITA, uses a hierarchy of common structural archetypes from sections down through blocks to inline items. The focus of this approach is on capturing content quickly from non-technical authors without having to deal with issues like “you can't paste that here because the schema says you can't”. Judicious use of type attributes on generic blocks can aid in overcoming issues of ordering and occurrence, leading to greater comfort for authors.

Thursday 11:45am - 12:30pm

Vivliostyle: An open-source, web-browser based, CSS typesetting engine

Shinyu Murakami & Johannes Wilm, Vivliostyle [Paper] [EPUB] [Slides and materials]

Most current multimedia publishing workflows are labor intensive, with separate workflow steps needed for each output format. We believe that the future of book and eBook publishing should be a JavaScript solution with CSS and HTML united into a single publishing workflow, with the formatting handled entirely by CSS. For this to be viable, we need browser support for the CSS page-related specifications (such as the W3C CSS Paged Media, CSS Page Floats, and CSS Generated Content) and similar support in commercial typesetting engines (without vendor-proprietary extensions). New features in JavaScript (such as the ones under consideration by the W3C) will also be needed. We are working on a new browser-based typesetting engine to help test these specifications and to start the larger process of unifying web, ebook, and print publishing.

Thursday 1:15pm - 2:00pm

Balisage Bard: The markup verse and lyrics game

Lynne Price, Gamemaster
     Balisage will whisper
     Through the nodes of the tree
     Here am I your special conference
     Come to me, come to me

     Your own special tags
     Your own schema dreams
     Parse into node sets
     and merge in the streams

Now is your chance to exercise your creative skills in limericks, sonnets, haiku, ballads, blank verse, or other poetic forms. Parodies welcome. Earn extra points for using or rhyming specific words. Click for Detailed Rules.

Thursday 2:00pm - 2:45pm

(LB) A client-side JATS4R validator using Saxon-CE

Chris Maloney, NCBI/NLM/NIH
Alf Eaton, PeerJ
Jeff Beck, NCBI/NLM/NIH [Paper] [EPUB] [Slides and materials]

JATS4R (jats4r.org) is a group that provides guidelines for tagging scholarly articles in JATS XML to maximize machine-readability and the potential for content reuse. When the group formalizes a recommendation, we encode the rules in Schematron. For checking instance documents against the rules, we have implemented a validation tool (hosted at http://jats4r.org/validate/). When an instance document is validated, it generates a report in Schematron Validation Report Language XML (SVRL). To avoid the maintenance costs of hosting a server-side tool, the validation tool is written in JavaScript, using Saxon-CE as the client-side XSLT processor. This allows it to be hosted on a static site (currently GitHub Pages) and run entirely within the user's web browser. The XSLT files used for validation are generated from the Schematron rulesets offline, and an HTML report is generated from the SVRL validation results using a further XSLT transformation.

Thursday 2:45pm - 3:30pm

(LB) Publishing TEI documents with TEI Simple: a case study at the U.S. Department of State's Office of the Historian

Joseph Wicentowski, U.S. Department of State
Wolfgang Meier, eXist Solutions GmbH

TEI Simple addresses two challenges shared by many humanities-based text digitization projects that adopt the Text Encoding Initiative (TEI) Guidelines: (1) it defines a reduced tag set focused on the needs of encoding modern and early modern books to support interchange, and (2) it adds a processing model for TEI documents which can be expressed with the TEI vocabulary itself, and without requiring any knowledge about the specific target media. The result is a promising technology stack that empowers editors to control the output of their documents across media formats without reliance on stylesheet experts or programmers, and both simplifies and reduces the cost of application development. This presentation will demonstrate the Office of the Historian's recent adoption of TEI Simple for its publications using the TEI Simple Processing Model library for eXist.

Thursday 4:00pm - 4:45pm

Comparing and diffing XML schemas

Priscilla Walmsley, Datypic [Slides and materials]

Schemas evolve over time, and it is useful to be able to automatically compare versions of a schema in order to provide detailed, accurate documentation to implementers. Automatically "diffing" schemas is also an effective quality control technique, ensuring that inadvertent changes were not made, and that all changes made are backward compatible (if that is a goal).

When taking into account the variety of ways of expressing a content model, and the possibility that advanced schema features were used, it is necessary to go beyond simple text diffing or even XML diffing. By first "canonicalizing" schemas to make them easier to compare, and then cataloging the differences between schemas we can answer questions like "Is this schema backward compatible?" and "Is this schema a subset or superset of another schema"?

Thursday 4:45 pm - 5:30 pm

Applying intertextual semantics to cyberjustice: Many reality checks for the price of one

Yves Marcoux, Université de Montréal [Paper] [EPUB]

Intertextual semantics (IS) is a way of documenting the meaning of an XML vocabulary. For each element, the modeler writes two text segments in natural-language: a text-before and a text-after, which can be inserted into the display of an XML document to make the meaning of that document (markup and textual contents) explicit to a human reader. As part of the Cyberjustice project, we applied IS in developing a tool for collaborative authoring of a particular type of legal document used in Québec: the “Agreement as to the conduct of the proceedings”. The rudimentary existing IS platform was extended to present domain experts (judges and lawyers) with the successive modeling hypotheses established by the modeler. The new IS system has richer possibilities for display and print and includes a mechanism for keeping track of questions, answers, decisions, and decision rationales in the development of the model. Some challenges faced in this project were quite general, including interactions between document validation and the semantic verification of peritexts for all possible elements and contexts. Other challenges were domain-specific, including the difficulty some participants had coming to grips with the role of fuzziness and ambiguity not only in the design of the XML vocabulary but in the legal domain being modeled. The use of IS provided a focal point which allowed some fundamental ambiguities in the purpose and goals of the project to be identified and resolved.

Thursday 8:00pm - 10:00pm

Balisage Hospitality

Stop in to the Balisage Coffee and Conversation room (Brookside A).

Friday, August 14, 2015

Friday 7:30am - 9:00am

Continental Breakfast

Join us for a light breakfast.

Friday 9:00am - 9:45am

UnderDok: XML structured attributes, change tracking, and the metaphysics of documents

Claus Huitfeldt, University of Bergen, Norway [Paper] [EPUB]

UnderDok is an XML system for publishing, QA, and change tracking of higher education course descriptions. The documents have a fixed structure, numerous cross-references, and prose interspersed with fixed phrases. Each document exists in a native language form, in an English translation, and possibly in additional languages. Changes must be tracked relative to the last authorized version. Formerly, documents were produced in Microsoft Word and manually copied to a database, a process both labor-intensive and error-prone. Our XML document production solution has many complex, interrelated pieces. An interesting philosophical complexity is that a course description, by which the institution and the students are legally bound, is neither the source XML nor the presentation XHTML, but a visual object containing linguistic information that occurs in certain situations and contexts. The legal stability for these documents, which was traditionally provided by printed pages, is now provided by the reproducibility (standardization) of document representation and presentation technology.

Friday 9:45am - 10:30am

Hyperdocument authoring link management using Git and XQuery in service of an abstract hyperdocument management model applied to DITA hyperdocuments

Eliot Kimber, Contrext [Paper] [EPUB] [Slides and materials]

The DITA architecture is supposed to improve life for everyone in the production chain by creating documents of easily constructed, reusable components: write once, use many times. However, the result may quickly migrate into multilayered hyperspace, with components lost in other dimensions. What version of a given component is used in a specific version of a particular document that references it? These issues have been addressed by theoretical hypertext models, such as those behind HyTime and its offshoot Topic Maps, but now XQuery, operating on top of git and a REST API, provides the tools for bringing the abstract models into practical use.

Friday 11:00am - 11:45am

Extending the cybersecurity digital thread with XForms

Joshua Lubell, National Institute of Standards and Technology [Paper] [EPUB] [Slides and materials]

A Digital Thread facilitates the creation, sharing and processing of data, allowing for dynamic and real-time assessment of a system's capabilities, performance, current state, and likelihood of future failure. Cybersecurity is in essence a cyclical sequence of steps whose common goal is to manage cyber-risk. The National Institute of Standards and Technology (NIST) has developed an infrastructure named SCAP (the Security Content Automation Protocol), which acts as an enabler of the digital thread with respect to a risk management framework defined in NIST Special Publication (SP) 800-53. A gap in the digital thread inhibits the selection and tailoring of NIST SP 800-53 security controls needed to meet an organization's specific requirements. My proposed solution simplifies the documentation of security control selections to ensure compliance with NIST SP 800-53 and facilitate the flow of digital data through the risk management framework.

Friday 11:45 am - 12:30pm

Calling things by their true names: Descriptive markup and the search for a perfect language

C. M. Sperberg-McQueen, Black Mesa Technologies; Technische Universität Darmstadt [Paper] [EPUB]

One of Leibniz's many projects for improving the world involved the construction of an encyclopedia which would lay out the body of existing knowledge and enable the systematic development of more. Ideally, the encyclopedia should be formulated in a philosophical language and written in a real character (a set of symbols, a universal character set, whose symbols denote not the words of a natural language but the objects of the real world). Properly constructed, the real character would enable a calculus of reasoning: a set of mechanical rules for logical inference. We may smile at Leibniz's idealism; few modern minds can share his optimism that we can reduce all complex concepts to uniquely determined combinations of primitive atomic ideas. But there is a reason Leibniz's ideas continue to inspire modern readers, and many of the same ideals motivate some of our best work in markup languages.