zo

Balisage 2023 Program

Pre-conference Event: Sunday, July 30, 2023

Sunday 12:00 12:30 EDT

Dress Rehearsal & Social Time

Conference Attendees

Balisage is using the Whova Conference Portal, which is unfamiliar to some attendees and has changed since some of us used it last year at Balisage. In order to provide an opportunity for us all to figure out how the portal works, we will do a “Dress Rehearsal” on the Sunday before the conference. The Dress Rehearsal will start with some social time including coaching to help people get logged in to Whova, a bit of conference-lite content, a Q&A session, and some small group social time.

Sunday 12:30 13:00 EDT (+ Q&A 13:00 - 13:15)

The SGML/XML Approach to Document Processing:  [an incomplete] History of Criticisms and Challenges

Allen H Renear School of Information Sciences / ischool.illinois.edu
University of Illinois at Urbana-Champaign

Like all SGML/XML enthusiasts I have spent considerable energy responding to objections to the SGML/XML approach to document processing. In this talk I list some objections to that approach that have been posed by authors, scholars, designers, philosophers, clients, programmers, web designers, and computer scientists, and indicate how I (and others) have responded — that includes responses that in the interest of sustaining friendships or job security were never quite articulated. I look forward to your additions in the discussion.

Sunday 13:30 14:00 EDT

Small Group Social Time

There will be several social spaces available throughout Balisage. This is a good time to take a look at them and chat with other conference attendees.

Monday, July 31, 2023

Monday 10:00 10:15 EDT

Welcome to Balisage 2023

Conference logistics, tips for attendees, and other getting started messages.

Monday 10:15 10:45 EDT

The Secret Garden

B. Tommie Usdin, Mulberry Technologies

We in the markup community have built ourselves a beautiful and ever-improving place to work. We can move content into markup, we have a variety of tools to manipulate marked-up content, we can move at will from tool to tool, we create a variety of products from that marked up content, and we believe our marked up content will be long lived. We frequently lament that most of the world doesn’t live in our techno-garden, and we occasionally admit that most of the world doesn’t even know it exists. At Balisage this year we will learn about ways in which our technology is improving. We will hear about some of the projects we are doing with markup and some of the problems we are having. And we will hear (a little) about how we are opening the gate to our garden and interacting with the outside world.

Monday 11:00 11:30 EDT (+ Q&A 11:30 - 11:45)

Hypergraphs: Escaping the Surly Bonds of Syntax

Patrick Durusau

Texts are not just hierarchies, there are countless other relations between text structures worth finding and investigating: overlap, substitution, discontinuity, parallel texts, cross-references, etc. Humanities scholars are confronted with a bewildering array of markup languages and techniques because when they encounter situations where current markup languages don’t meet their needs, they invent yet another new one. A number of those syntaxes are illustrated here, to lay the groundwork for an heretical suggestion: Humanists should use any consistent method they choose for complex markup. The burden of preparing texts for interchange should rest on technologists. (Hint: use a hypergraph database.) Let scholars do what scholars do and let technologists aid them in those tasks.

Monday 12:00 12:30 EDT (+ Q&A 12:30 - 12:45)

Ambiguity in iXML, and How to Control It

Norm Tovey-Walsh, Saxonica

Humans are really good at resolving ambiguities. Our senses are trained for it: is that pattern of shadows in the forest dappled sunlight, or a tiger waiting to pounce? Our minds quickly and almost effortlessly adjust interpretations based on contextual clues that change over time. Parsers? Not so much. Our everyday languages and formats: XML, JSON, JavaScript, Java, etc. are rigorously defined to avoid ambiguity: you must put a quote here, a semicolon there. (Most) parsers reject anything that cannot be unambiguously identified within a small textual window. Invisible XML is an uncommon format in that it doesn’t reject grammars or parses that are ambiguous. That doesn’t mean ambiguity is a good thing, and it doesn’t mean authors wouldn’t like to control it.

Monday 13:30 14:00 EDT

Sponsor Presentation: Antenna House -
Quality in Formatted Documents

Tony Graham, Antenna House

What does “quality” mean for markup when the markup is meant to be formatted for human consumption? When that markup is transformed from an original document in a completely different XML vocabulary? This presentation looks at different ways of assessing or assuring the quality of both the markup and the formatted document.

Monday 14:00 14:30 EDT (+ Q&A 14:30 - 14:45)

Privately Automating Common, Uncommon, and Surprising Markup Tasks using AI Large Language Models (LB)

Uche Ogbuji, Independent Consultant

Generative AI is everywhere, from DALL-E for image generation to ChatGPT for language tasks. The power of these models boggles the mind even for people who’ve been involved in AI for ages. Can they help us? It's been widely reported that they can be used to write rather sophisticated code in languages like Python and Javascript — but how well do they work with markup? Can they work with XML properly, or can they only treat it as tag soup? Even without any specialized XML training, large language models (LLMs) prove to have some very interesting, and in some cases impressive, capabilities.

By using self-hosted LLMs rather than third-party services such as ChatGPT or Bard, we can exploit those capabilities even for private applications. There are certainly some clear limitations, but LLMs can also handle a surprising number of common markup tasks out of the box.

Monday 15:00 15:30 EDT (+ Q&A 15:30 - 15:45)

Turning a Battleship: Migrating ServiceNow Documentation to Use DITA Keys

Eliot Kimber, ServiceNow

Migration of ServiceNow's DITA-based product documentation from using unmanageable direct-URL references for all reuse references and hyperlinks to key-based indirect linking within and among publications required complex analysis, design, planning, and data processing. The technical solution uses DITA keys for all links, including use of DITA’s “cross deliverable” linking feature, which enables authoring links from one publication to another (similar to DocBook’s cross-book links but with different implementation challenges). Focuses on the XSLT and XQuery migration process used to perform the migration over a four-day period in the pause between semi-yearly product releases.

Monday 16:00

Birds of a Feather Discussion(s)

Discussion Leader(s) to be Announced

Balisage participants will choose topics we want to discuss and discussion leaders to keep the conversation on topic. These topics may be inspired by conference presentations or may be other subjects of interest to the markup community. Specific topics and discussion leaders will be announced during Balisage.

Tuesday, August 1, 2023

Tuesday 10:00 10:30 EDT (+ Q&A 10:30 - 10:45)

Markup in a Time of AI and Big Data Analytics

Elisa E. Beshero-Bondar, Penn State Erie, The Behrend College

Again and again, we move data from unstructured or semi-structured forms into XML documents governed by schemas. Our goal is the analysis of text — sometimes that means big-data analytics using R or Python libraries, and sometimes it means scholarly editions and preservation of cultural heritage. Markup allows us to use the same resources for both kinds of work. But does markup still make sense in the era of ChatGPT? In large volumes of data, occasional flaws seem negligible, just a little noise, and big data analytics are good at ignoring noise. For creators of scholarly editions, a different question arises: why do we frequently find that the markup we rely on proves a barrier to the sharing and reuse of our data?

Reusing data means looking at texts now from this angle, now from that. So we spend a lot of our data-wrangling effort on conversion from one format to another, filtering out information or enriching the markup. We would do well to consider our tools and research methods critically; descriptive markup can be both a way of limiting our dependence on ephemeral software stacks and a way of organizing critical self-reflection.

Tuesday 11:00 11:30 EDT (+ Q&A 11:30 - 11:45)

The Future Begins Tomorrow: Succession Planning for XML Infrastructure Resources

Jeffrey Beck, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health

Much of the world’s scholarship, including most of the world’s scientific, technical, engineering, and medical discoveries, are published in journal articles. Journal publishing has changed rapidly and dramatically in the last 25 years. Current journal article publishing (both print and online), archiving, and interchange is done in XML using JATS (the Journal Article Tag Suite). JATS is supported and maintained by a NISO Standing Committee and a group of (aging) volunteers. How long can this work be supported by volunteers on standards committees? New technologies may transform journal modeling and related tasks but will not replace the work of skilled practitioners. I see document modeling work becoming a function of libraries. Libraries should support shared standards and document models, including training, tool creation, and documentation. Just as libraries store and preserve physical journals published when scholarship was ink on paper, their cultural heritage preservation role must expand to support the creation and preservation of XML journal articles.

Tuesday 12:00 12:30 EDT (+ Q&A 12:30 - 12:45)

Artificial Intelligence with XForms (LB)

John J Chelsom, Fordham University

Classical AI techniques: Bayesian networks, fuzzy logic, forward and backward chaining inference rules, etc. have proved useful in providing clinical decision support. The Artificial Intelligence (AI) Workbench investigates how these might be used in an XRX (XForms, REST, XQuery) application with a view towards adding them to the cityEHR Electronic Health Records system.

Tuesday 13:30 14:00 EDT

Sponsor Presentation: Docugami, A Document Foundation Model Generating the Core XML Data Model

Jean Paoli & Zubin Rustom Wadia, both of Docugami

Docugami groups documents in “docsets” of semantically similar documents (that will be sharing the same tag set), and generates for each document a semantically rich hierarchical XML tree representing the entire document. Docugami is a proprietary Business Document Foundation Model, a family of Large Language Models (LLMs) and Vision Models trained on millions of Business Documents, ranging from 2.7B parameters to 20B parameters, with multimodal inputs in the vision and text domains. We will demonstrate our new Developer Playground and how to use our Generative AI APIs.

Tuesday 14:00 14:30 EDT (+ Q&A 14:30 - 14:45)

Building applications with generative AI (LB)

M.Joel Dubinko

Application builders are oft sought and seldom realized. An ideal application builder allows domain experts to make configuration choices, push a button, and generate a useful application. It’s a delicate balance. If the builder is too specific, it isn’t reusable. If it’s too generic, it requires intense customization which defeats the purpose. Can generative AI tools be used to analyze sample data and make application builders better? Let’s find out! Along the way, there will be surprises. What’s possible, and what isn’t?

Tuesday 15:00 15:30 EDT (+ Q&A 15:30 - 15:45)

Processing Lax XML Element Trees

Phil Fearon & Gursheen Kaur, DeltaXML

DeltaXML’s XML Compare tool finds and processes changes between two XML documents, a process made significantly more difficult by the complexity of table tagging (both HTML and CALS tables). The HTML table specification provides a fairly lax, loosely constrained structure. This laxity complicates processing, increases the number of logic paths deep within the comparison code, and makes it more difficult to compare different versions of the same table. We created an XSLT process to ‘normalize’ an HTML table, by transforming its structure to conform to a strict content model, fixing ‘problem nodes’ in the original tagging, such as lack of appropriate wrapper elements. This HTML Table Normalization fits within the XSLT pipeline of DeltaXML’s XML Compare product, just before the HTML Table Validation step. This allows the XSLT pipeline originally designed for the stricter CALS table model to rely on a standard hierarchy for HTML tables, even tables that started the comparison process with bad tag nesting.

Tuesday 16:00

Birds of a Feather Discussion(s)

Discussion Leader(s) to be Announced

Balisage participants will choose topics we want to discuss and discussion leaders to keep the conversation on topic. These topics may be inspired by conference presentations or may be other subjects of interest to the markup community. Specific topics and discussion leaders will be announced during Balisage.

Wednesday, August 2, 2023

Wednesday 10:00 10:30 EDT (+ Q&A 10:30 - 10:45)

Adventures in Single-Sourcing XQuery and XSLT

Mary Holstege

XQuery and XSLT have a lot in common: they share XPath as an expression language, they were designed together by the same committee, and users can solve similar problems with them. They are not, however, “the same”. Suppose you have a lot of XQuery that you’d like to share with XSLT. Can you just import it? Can you convert XQuery to XSLT? Directly? Through intermediate forms? Yes. And yes. And yes. But are there inconveniences and tradeoffs? Yes, again. And pretty pictures, we bet.

Wednesday 11:00 11:30 EDT (+ Q&A 11:30 - 11:45)

Pulling All Production Processes Together With an XML-First System (LB)

Charles O’Connor, Aries System Corporation and Mark Gross, Data Conversion Laboratory

Creating a seamless centralized workflow that starts with XML has long been the siren song of scholarly journal production workflows. Yet the definition of “start” is the critical piece in this publishing puzzle. For Aries Systems Corporation, innovating article production truly means starting with XML as soon as a manuscript is accepted after peer review. But how do you create a system for auto XML text conversion for Word files when you cannot control the creation of the Word file, nor force authors to follow any predefined template or complex instructions? Given all the ways that authors can use (and abuse!) the wide range of MS Word features, no automated system can produce good XML from 100% of author-supplied files. But Aries could not expect the users of Editorial Manager and ProduXion Manager to be detectives or tinkerers, figuring out what in a problem Word file needs to be fixed to get a good result, rerunning the file, QC’ing the result, maybe having to go back in and fix something else, rerunning the file, QC’ing the result. Editors and authors need to have confidence that when they submit a Word file for processing they will get a good XML file out every time.

Wednesday 12:00 12:30 EDT (+ Q&A 12:30 - 12:45)

Keyboarding Frege’s Concept Writing

C. M. Sperberg-McQueen, Black Mesa Technologies LLC

Descriptive markup should reflect the structure of the information to which it’s being applied. But what if the structure of the information isn’t linear sequences of words in sentences, paragraphs, chapters, but a graphical two-dimensional layout? That’s the problem faced in trying to create an ebook of Gottlob Frege’s 1879 Begriffsschrift (Concept Writing). Simply scanning the pages won’t work: the essential axioms and propositions cannot be queried or manipulated. Capturing Frege’s text in XML is conceivable, but, as with other mathematical notation, the ratio of markup to text is very high. Invisible XML to the rescue! With it, we can create a context-free grammar which makes the formulas both more compact than normal XML and much easier to keyboard. Having Frege’s logical formulas in XML has many benefits, beginning with the ability to use XSLT to generate SVG diagrams. XML enables transformation of Frege’s formulas into modern linear logic notation, input to modern logic analyzers, searchable ebooks, and perhaps more to come.

Wednesday 13:30 14:45 EDT

Open Mic: Anything Goes

Conference Participants

Balisage short subject open microphone. All conference participants are invited to give a 2 to 10 minute presentation on ANY topic (within the limits of the conference Code of Conduct). Use video, sound, bullet point slides, cartoons, visualizations, SW demonstrations, or just yourself as a talking head. Anything goes!
Click for details including how to sign up

Wednesday 15:00 15:30 EDT (+ Q&A 15:30 - 15:45)

Auto-Markup BenchMark: towards an industry-standard benchmark for Evaluating Automatic Document Markup (LB)

Paul Prescod, Document Minds; Ben Feuer, New York University; Andrii Hladkyi and Sean Paulk, Western Tidewater Community Services Board; and Arjun Prasad, BITS-Pilani, Hyderabad campus

As important as structured markup is to the publishing process, the difficulty of generating markup reliably in an automatic system has long been an impediment. With the arrival of large language models in artifical intelligence, it may be possible to use AI to bridge between unstructured text and structured publishing systems. Establishing a benchmark for the understanding and generation of structured documents and markup languages can drive innovation, standardize evaluations, identify algorithm strengths and weaknesses, clarify the state of the art and foster interdisciplinary collaborations. A new test suite, combined with a new metric called XATER (for XML Automarkup Translation Edit Rate), should improve the speed with which new markup-generation systems can be developed and evaluated.

Wednesday 16:00

Birds of a Feather Discussion(s)

Discussion Leader(s) to be Announced

Balisage participants will choose topics we want to discuss and discussion leaders to keep the conversation on topic. These topics may be inspired by conference presentations or may be other subjects of interest to the markup community. Specific topics and discussion leaders will be announced during Balisage.

Thursday, August 3, 2023

Thursday 10:00 10:30 EDT (+ Q&A 10:30 - 10:45)

Serializing the Locator Format of the United States Government Publishing Office as XML

Joel Kalvesmaki, Government Publishing Office

Over half a century ago, the Government Printing Office (GPO) embarked on pioneering work in computerized typesetting. The system, which evolved over many years, employed “locator codes” to apply typographic formats to units of text. Locator codes still drive GPO’s composition process. In the early days, GPO often referred to this system as generic coding. But a locator code does not have a defined meaning in the way, for example, an element in Akoma Ntoso does: its interpretation is conditioned by context-sensitive references to supplementary resources concerned with formats, character mappings, and grids (which go back to the physical handling of character images in GPO’s first electronic phototypesetter).

Although SGML and XML have been used at GPO for many years, many pages are still set using the locator system. There have been projects to convert some individual locator applications to purpose-built XML applications, but Serializing Locators as XML (Slax) takes a different approach. Rather than convert locators to specific XML vocabularies, Slax expresses locators as XML. On the way, Slax has converted the resource files from ASCII to XML, then converted them into datasets that a C# program can apply to locator documents in order to generate the appropriate XML output. Although these XML documents are still intended to be used by the GPO composition system, having the files in XML opens them up to further processing by XML tools.

Thursday 11:00 11:30 EDT (+ Q&A 11:30 - 11:45)

Accumulators in XSLT and XSpec: Developing, Debugging, and Testing XSLT 3 Accumulators

Amanda Galtman

Accumulators in XSLT 3, while not necessarily an everyday staple of XSLT code bases, can provide elegant solutions to certain challenges in both streaming and non-streaming XSLT applications. The debugging and testing techniques you might use for accumulators are a little different from techniques you use for templates and functions. This paper shows examples where accumulators are or aren't a good fit. The paper then describes and compares several debugging and testing techniques for accumulators, especially in non-streaming applications. Whether you require a near-term bug diagnosis or a long-term automated XSpec test of the accumulator's functionality, you will learn ways to access, understand, and maintain your accumulator's behavior.

Thursday 12:00 12:30 EDT (+ Q&A 12:30 - 12:45)

Unveiling Linguistic Harmony: Asserting Interlingual Synchronicity in Documents (LB)

Geert Bormans, C-Moria BV and Srikanth Venkata Subramanian, Cognizone BV

Publishing texts in multiple languages always involves issues of coordination of both content and structure. The Swiss Chancellery must deal with documents in four official languages, and they face special problems in the production of compilations that merge amendments into existing legislation. Although the source documents are prepared in MS Word, they are issued in derivative products such as PDF, HTML, and Akoma Ntoso. To aid coordination between versions, the authors are developing a reporting system that does rule-driven sanity checks on parallel documents, starting with the XML of Akoma Ntoso. Each rule is represented by XSLT processes, and the selection of rules to be applied is controlled by XProc, with the results displayed as messages interpolated into an HTML rendition of the source document.

Thursday 13:30 14:00 EDT

Typefi Automated Publishing for XML

Guy van der Kolk, Typefi

Typefi’s publishing automation software makes it possible to produce accurate and well-designed outputs from your XML content incredibly quickly. It’s built on Adobe InDesign Server, so you get all the power of the world's leading layout design software and full control over your designs. In this presentation, Typefi Product Manager, Guy van der Kolk, will show a few demos of Typefi in action—including automated publishing using Adobe InDesign, Typefi’s Standards publishing capabilities, integrations with Oxygen XML Web Author and DeltaXML Compare for redlining, multi-lingual publishing, and Typefi's proprietary InDesign plug-ins.

Thursday 14:00 14:30 EDT (+ Q&A 14:30 - 14:45)

A Wonderful Historie of Intertextual Networks: Or, How Not to Index Your Data

Ash Clark, Northeastern University Women Writer’s Project

Being a learning experience: a tale told of TEI documents, EXPath, XQuery scripts, JSON serialized as XML, eXist-db, RelaxNG schemas for cached responses, maps, arrays, HTML, and more. Let me tell you about it! In 2022, the Women Writers Project (WWP) published ‘Women Writers Intertextual Network’ (WWIN) as an EXPath web app served out of eXist-db. The database contained Women Writers Online (WWO) documents encoded with ‘intertextual gestures’ (references within one work to a second work, for example, citations, quotes, parodies), separate bibliographic entries for each of the works referenced, and a taxonomy of topic and genre keywords. From the beginning, the WWIN site was plagued with connection issues and long load times. An important design goal was to give the website user a lot of power to control their experience of the WWIN content. Unfortunately, this level of control, combined with the application’s scale and choice of interface, meant that the site needed to aggregate a lot of data before being able to return results. The index-heavy application put a premium on processing as much data ahead of publication as possible, optimizing the indexing and retrieval of cached data, and making the indexes smaller and more precise.

Thursday 15:00 16:00 EDT

Balisage Bard

Lynne Price, Gamemaster

Once again, Balisage Bard gives you the opportunity to exercise your literary creativity with original poems, short stories, jokes, songs, photos, recipes, trivia questions, and other masterpieces. Subject matter must be related to Balisage—possibilities include markup, papers presented this or previous years, virtual conferences, attendees’ interests (whether or not pertinent to markup), and so forth. Read your effort, play it on video, or show photos or text during the game session. Translations of works in languages other than English are not required but will be appreciated. There is a two-minute time limit per presentation. Sign up by sending email to bard@txstruct.com. One contribution per person/team unless there is time for more at the end.

Thursday 16:00

Birds of a Feather Discussion(s)

Discussion Leader(s) to be Announced

Balisage participants will choose topics we want to discuss and discussion leaders to keep the conversation on topic. These topics may be inspired by conference presentations or may be other subjects of interest to the markup community. Specific topics and discussion leaders will be announced during Balisage.

Friday, August 4, 2023

Friday 10:00 10:30 EDT (+ Q&A 10:30 - 10:45)

Schema-Aware Conversion of XML to JSON (LB)

Michael Kay, Saxonica

There is — unsurprisingly — a high demand for good XML to JSON conversion tools. XML is widely used for intra-enterprise and inter-enterprise dataflows, and has many strengths that make it well suited to that role; but JSON is easier to consume in conventional programming languages, and most especially in Javascript.

But most existing libraries — also unsurprisingly — do the job badly. There isn't a good one-to-one fit between the data models and there are real tensions between the requirements for good conversions. Different libraries have made different design compromises, and the JSON they produce tends to please no-one.

Schema-aware conversion can do a better job. In the XSLT/XPath world we have a real opportunity to deliver that, because we already have all the infrastructure for processing schema-aware XML. A function that performs schema-aware conversion has recently been specified for inclusion in XPath 4.0, and we have written a prototype implementation. It may be informative to see how it works and hjow its results compare to those produced by other conversion tools.

Friday 11:00 11:30 EDT (+ Q&A 11:30 - 11:45)

The Dream of a CMS

Ari Nordström

I’ve always wanted an XML-first content management system, built around XML technologies: XProc, XSLT, XQuery, XForms, the whole package. Over the years, I have managed to build some of the CMS of my dreams, but never all of it. Not, that is, until recently. Strictly speaking, what I’ve built for the client is a portal for viewing automotive documentation at all stages of the product life cycle. The portal does not manage document creation, but once you have on-the-fly document assembly, conditional processing (aka profiling), browsing, filtering, and integration of 3D models, you have a lot of what is needed for a complete content management system. Add some markup for workflow management and versioning, and I think I am very close to getting my wish.

Friday 12:00 12:30 EDT (+ Q&A 12:30 - 12:45)

Retractions and Corrections at Scholars Portal Journals

Jessica Hymers & Qinqin Lin, OCUL Scholars Portal

In order to provide accurate and up-to-date access to scholarly research, Scholars Portal is improving our handling of article corrections and retractions. We are an XML based repository that hosts e-journal content for universities in Ontario as a service of the Ontario Council of University Libraries (OCUL). Our new process uses the JATS metadata element <related-article> to link between articles and their corrections and retractions. This allows us to notify users immediately when there have been changes to the article they are viewing. This is not without challenges: handling articles that have not been registered with a Digital Object Identifier (DOI), for example, and publishers’ inconsistent use of attribute values.

Friday 13:30 14:45 EDT

Knock Down This Wall

C. M. Sperberg-McQueen, Black Mesa Technologies

Life can be comfortable inside a walled garden. But if we wish to engage with the world, we need to knock down those walls.

Friday 15:00 15:30 EDT

Feedback

What did you like at Balisage 2023? What could have been better? What changes would you suggest for future Balisage conferences? Tell us what you think.

Timezone: 24-hour clock

Interactive schedule-at-a-glance
Time 30 July
Sunday
Pre-conference
31 July
Monday
1 August
Tuesday
2 August
Wednesday
3 August
Thursday
4 August
Friday
technology break
technology break
mid-day break & social time
technology break
technology break