Balisage 2023 Program
Pre-conference Event: Sunday, July 30, 2023
Dress Rehearsal & Social Time
Conference Attendees
Balisage is using the Whova Conference Portal, which is unfamiliar to some attendees and has changed since some of us used it last year at Balisage. In order to provide an opportunity for us all to figure out how the portal works, we will do a “Dress Rehearsal” on the Sunday before the conference. The Dress Rehearsal will start with some social time including coaching to help people get logged in to Whova, a bit of conference-lite content, a Q&A session, and some small group social time.
The SGML/XML Approach to Document Processing: [an incomplete] History of Criticisms and Challenges
Allen H Renear
School of Information Sciences /
ischool.illinois.edu
University of Illinois at Urbana-Champaign
Like all SGML/XML enthusiasts I have spent considerable energy responding to objections to the SGML/XML approach to document processing. In this talk I list some objections to that approach that have been posed by authors, scholars, designers, philosophers, clients, programmers, web designers, and computer scientists, and indicate how I (and others) have responded — that includes responses that in the interest of sustaining friendships or job security were never quite articulated. I look forward to your additions in the discussion.
Small Group Social Time
There will be several social spaces available throughout Balisage. This is a good time to take a look at them and chat with other conference attendees.
Monday, July 31, 2023
Welcome to Balisage 2023
Conference logistics, tips for attendees, and other getting started messages.
The Secret Garden
B. Tommie Usdin, Mulberry Technologies
We in the markup community have built ourselves a beautiful and ever-improving place to work. We can move content into markup, we have a variety of tools to manipulate marked-up content, we can move at will from tool to tool, we create a variety of products from that marked up content, and we believe our marked up content will be long lived. We frequently lament that most of the world doesn’t live in our techno-garden, and we occasionally admit that most of the world doesn’t even know it exists. At Balisage this year we will learn about ways in which our technology is improving. We will hear about some of the projects we are doing with markup and some of the problems we are having. And we will hear (a little) about how we are opening the gate to our garden and interacting with the outside world.
Hypergraphs: Escaping the Surly Bonds of Syntax
Patrick Durusau
Texts are not just hierarchies, there are countless other relations between text structures worth finding and investigating: overlap, substitution, discontinuity, parallel texts, cross-references, etc. Humanities scholars are confronted with a bewildering array of markup languages and techniques because when they encounter situations where current markup languages don’t meet their needs, they invent yet another new one. A number of those syntaxes are illustrated here, to lay the groundwork for an heretical suggestion: Humanists should use any consistent method they choose for complex markup. The burden of preparing texts for interchange should rest on technologists. (Hint: use a hypergraph database.) Let scholars do what scholars do and let technologists aid them in those tasks.
Ambiguity in iXML, and How to Control It
Norm Tovey-Walsh, Saxonica
Humans are really good at resolving ambiguities. Our senses are trained for it: is that pattern of shadows in the forest dappled sunlight, or a tiger waiting to pounce? Our minds quickly and almost effortlessly adjust interpretations based on contextual clues that change over time. Parsers? Not so much. Our everyday languages and formats: XML, JSON, JavaScript, Java, etc. are rigorously defined to avoid ambiguity: you must put a quote here, a semicolon there. (Most) parsers reject anything that cannot be unambiguously identified within a small textual window. Invisible XML is an uncommon format in that it doesn’t reject grammars or parses that are ambiguous. That doesn’t mean ambiguity is a good thing, and it doesn’t mean authors wouldn’t like to control it.
Sponsor Presentation: Antenna House -
Quality in Formatted Documents
Tony Graham, Antenna House
What does “quality” mean for markup when the markup is meant to be formatted for human consumption? When that markup is transformed from an original document in a completely different XML vocabulary? This presentation looks at different ways of assessing or assuring the quality of both the markup and the formatted document.
Privately Automating Common, Uncommon, and Surprising Markup Tasks using AI Large Language Models (LB)
Uche Ogbuji, Independent Consultant
Generative AI is everywhere, from DALL-E for image generation to ChatGPT for language tasks. The power of these models boggles the mind even for people who’ve been involved in AI for ages. Can they help us? It's been widely reported that they can be used to write rather sophisticated code in languages like Python and Javascript — but how well do they work with markup? Can they work with XML properly, or can they only treat it as tag soup? Even without any specialized XML training, large language models (LLMs) prove to have some very interesting, and in some cases impressive, capabilities.
By using self-hosted LLMs rather than third-party services such as ChatGPT or Bard, we can exploit those capabilities even for private applications. There are certainly some clear limitations, but LLMs can also handle a surprising number of common markup tasks out of the box.
Turning a Battleship: Migrating ServiceNow Documentation to Use DITA Keys
Eliot Kimber, ServiceNow
Migration of ServiceNow's DITA-based product documentation from using unmanageable direct-URL references for all reuse references and hyperlinks to key-based indirect linking within and among publications required complex analysis, design, planning, and data processing. The technical solution uses DITA keys for all links, including use of DITA’s “cross deliverable” linking feature, which enables authoring links from one publication to another (similar to DocBook’s cross-book links but with different implementation challenges). Focuses on the XSLT and XQuery migration process used to perform the migration over a four-day period in the pause between semi-yearly product releases.
Birds of a Feather Discussion(s)
Discussion Leader(s) to be Announced
Balisage participants will choose topics we want to discuss and discussion leaders to keep the conversation on topic. These topics may be inspired by conference presentations or may be other subjects of interest to the markup community. Specific topics and discussion leaders will be announced during Balisage.
Tuesday, August 1, 2023
Markup in a Time of AI and Big Data Analytics
Elisa E. Beshero-Bondar, Penn State Erie, The Behrend College
Again and again, we move data from unstructured or semi-structured forms into XML documents governed by schemas. Our goal is the analysis of text — sometimes that means big-data analytics using R or Python libraries, and sometimes it means scholarly editions and preservation of cultural heritage. Markup allows us to use the same resources for both kinds of work. But does markup still make sense in the era of ChatGPT? In large volumes of data, occasional flaws seem negligible, just a little noise, and big data analytics are good at ignoring noise. For creators of scholarly editions, a different question arises: why do we frequently find that the markup we rely on proves a barrier to the sharing and reuse of our data?
Reusing data means looking at texts now from this angle, now from that. So we spend a lot of our data-wrangling effort on conversion from one format to another, filtering out information or enriching the markup. We would do well to consider our tools and research methods critically; descriptive markup can be both a way of limiting our dependence on ephemeral software stacks and a way of organizing critical self-reflection.
The Future Begins Tomorrow: Succession Planning for XML Infrastructure Resources
Jeffrey Beck, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health
Much of the world’s scholarship, including most of the world’s scientific, technical, engineering, and medical discoveries, are published in journal articles. Journal publishing has changed rapidly and dramatically in the last 25 years. Current journal article publishing (both print and online), archiving, and interchange is done in XML using JATS (the Journal Article Tag Suite). JATS is supported and maintained by a NISO Standing Committee and a group of (aging) volunteers. How long can this work be supported by volunteers on standards committees? New technologies may transform journal modeling and related tasks but will not replace the work of skilled practitioners. I see document modeling work becoming a function of libraries. Libraries should support shared standards and document models, including training, tool creation, and documentation. Just as libraries store and preserve physical journals published when scholarship was ink on paper, their cultural heritage preservation role must expand to support the creation and preservation of XML journal articles.
Artificial Intelligence with XForms (LB)
John J Chelsom, Fordham University
Classical AI techniques: Bayesian networks, fuzzy logic, forward and backward chaining inference rules, etc. have proved useful in providing clinical decision support. The Artificial Intelligence (AI) Workbench investigates how these might be used in an XRX (XForms, REST, XQuery) application with a view towards adding them to the cityEHR Electronic Health Records system.
Sponsor Presentation: Docugami, A Document Foundation Model Generating the Core XML Data Model
Jean Paoli & Zubin Rustom Wadia, both of Docugami
Docugami groups documents in “docsets” of semantically similar documents (that will be sharing the same tag set), and generates for each document a semantically rich hierarchical XML tree representing the entire document. Docugami is a proprietary Business Document Foundation Model, a family of Large Language Models (LLMs) and Vision Models trained on millions of Business Documents, ranging from 2.7B parameters to 20B parameters, with multimodal inputs in the vision and text domains. We will demonstrate our new Developer Playground and how to use our Generative AI APIs.
Building applications with generative AI (LB)
M.Joel Dubinko
Application builders are oft sought and seldom realized. An ideal application builder allows domain experts to make configuration choices, push a button, and generate a useful application. It’s a delicate balance. If the builder is too specific, it isn’t reusable. If it’s too generic, it requires intense customization which defeats the purpose. Can generative AI tools be used to analyze sample data and make application builders better? Let’s find out! Along the way, there will be surprises. What’s possible, and what isn’t?
Processing Lax XML Element Trees
Phil Fearon & Gursheen Kaur, DeltaXML
DeltaXML’s XML Compare tool finds and processes changes between two XML documents, a process made significantly more difficult by the complexity of table tagging (both HTML and CALS tables). The HTML table specification provides a fairly lax, loosely constrained structure. This laxity complicates processing, increases the number of logic paths deep within the comparison code, and makes it more difficult to compare different versions of the same table. We created an XSLT process to ‘normalize’ an HTML table, by transforming its structure to conform to a strict content model, fixing ‘problem nodes’ in the original tagging, such as lack of appropriate wrapper elements. This HTML Table Normalization fits within the XSLT pipeline of DeltaXML’s XML Compare product, just before the HTML Table Validation step. This allows the XSLT pipeline originally designed for the stricter CALS table model to rely on a standard hierarchy for HTML tables, even tables that started the comparison process with bad tag nesting.
Birds of a Feather Discussion(s)
Discussion Leader(s) to be Announced
Balisage participants will choose topics we want to discuss and discussion leaders to keep the conversation on topic. These topics may be inspired by conference presentations or may be other subjects of interest to the markup community. Specific topics and discussion leaders will be announced during Balisage.
Wednesday, August 2, 2023
Adventures in Single-Sourcing XQuery and XSLT
Mary Holstege
XQuery and XSLT have a lot in common: they share XPath as an expression language, they were designed together by the same committee, and users can solve similar problems with them. They are not, however, “the same”. Suppose you have a lot of XQuery that you’d like to share with XSLT. Can you just import it? Can you convert XQuery to XSLT? Directly? Through intermediate forms? Yes. And yes. And yes. But are there inconveniences and tradeoffs? Yes, again. And pretty pictures, we bet.
Pulling All Production Processes Together With an XML-First System (LB)
Charles O’Connor, Aries System Corporation and Mark Gross, Data Conversion Laboratory
Creating a seamless centralized workflow that starts with XML has long been the siren song of scholarly journal production workflows. Yet the definition of “start” is the critical piece in this publishing puzzle. For Aries Systems Corporation, innovating article production truly means starting with XML as soon as a manuscript is accepted after peer review. But how do you create a system for auto XML text conversion for Word files when you cannot control the creation of the Word file, nor force authors to follow any predefined template or complex instructions? Given all the ways that authors can use (and abuse!) the wide range of MS Word features, no automated system can produce good XML from 100% of author-supplied files. But Aries could not expect the users of Editorial Manager and ProduXion Manager to be detectives or tinkerers, figuring out what in a problem Word file needs to be fixed to get a good result, rerunning the file, QC’ing the result, maybe having to go back in and fix something else, rerunning the file, QC’ing the result. Editors and authors need to have confidence that when they submit a Word file for processing they will get a good XML file out every time.
Keyboarding Frege’s Concept Writing
C. M. Sperberg-McQueen, Black Mesa Technologies LLC
Descriptive markup should reflect the structure of the information to which it’s being applied. But what if the structure of the information isn’t linear sequences of words in sentences, paragraphs, chapters, but a graphical two-dimensional layout? That’s the problem faced in trying to create an ebook of Gottlob Frege’s 1879 Begriffsschrift (Concept Writing). Simply scanning the pages won’t work: the essential axioms and propositions cannot be queried or manipulated. Capturing Frege’s text in XML is conceivable, but, as with other mathematical notation, the ratio of markup to text is very high. Invisible XML to the rescue! With it, we can create a context-free grammar which makes the formulas both more compact than normal XML and much easier to keyboard. Having Frege’s logical formulas in XML has many benefits, beginning with the ability to use XSLT to generate SVG diagrams. XML enables transformation of Frege’s formulas into modern linear logic notation, input to modern logic analyzers, searchable ebooks, and perhaps more to come.
Open Mic: Anything Goes
Conference Participants
Balisage short subject open microphone. All conference participants are invited to give a 2 to 10 minute presentation on ANY topic (within the limits of the conference Code of Conduct). Use video, sound, bullet point slides, cartoons, visualizations, SW demonstrations, or just yourself as a talking head. Anything goes!
Click for details including how to sign up
Auto-Markup BenchMark: towards an industry-standard benchmark for Evaluating Automatic Document Markup (LB)
Paul Prescod, Document Minds; Ben Feuer, New York University; Andrii Hladkyi and Sean Paulk, Western Tidewater Community Services Board; and Arjun Prasad, BITS-Pilani, Hyderabad campus
As important as structured markup is to the publishing process, the difficulty of generating markup reliably in an automatic system has long been an impediment. With the arrival of large language models in artifical intelligence, it may be possible to use AI to bridge between unstructured text and structured publishing systems. Establishing a benchmark for the understanding and generation of structured documents and markup languages can drive innovation, standardize evaluations, identify algorithm strengths and weaknesses, clarify the state of the art and foster interdisciplinary collaborations. A new test suite, combined with a new metric called XATER (for XML Automarkup Translation Edit Rate), should improve the speed with which new markup-generation systems can be developed and evaluated.
Birds of a Feather Discussion(s)
Discussion Leader(s) to be Announced
Balisage participants will choose topics we want to discuss and discussion leaders to keep the conversation on topic. These topics may be inspired by conference presentations or may be other subjects of interest to the markup community. Specific topics and discussion leaders will be announced during Balisage.
Thursday, August 3, 2023
Serializing the Locator Format of the United States Government Publishing Office as XML
Joel Kalvesmaki, Government Publishing Office
Over half a century ago, the Government Printing Office (GPO) embarked on pioneering work in computerized typesetting. The system, which evolved over many years, employed “locator codes” to apply typographic formats to units of text. Locator codes still drive GPO’s composition process. In the early days, GPO often referred to this system as generic coding. But a locator code does not have a defined meaning in the way, for example, an element in Akoma Ntoso does: its interpretation is conditioned by context-sensitive references to supplementary resources concerned with formats, character mappings, and grids (which go back to the physical handling of character images in GPO’s first electronic phototypesetter).
Although SGML and XML have been used at GPO for many years, many pages are still set using the locator system. There have been projects to convert some individual locator applications to purpose-built XML applications, but Serializing Locators as XML (Slax) takes a different approach. Rather than convert locators to specific XML vocabularies, Slax expresses locators as XML. On the way, Slax has converted the resource files from ASCII to XML, then converted them into datasets that a C# program can apply to locator documents in order to generate the appropriate XML output. Although these XML documents are still intended to be used by the GPO composition system, having the files in XML opens them up to further processing by XML tools.
Accumulators in XSLT and XSpec: Developing, Debugging, and Testing XSLT 3 Accumulators
Amanda Galtman
Accumulators in XSLT 3, while not necessarily an everyday staple of XSLT code bases, can provide elegant solutions to certain challenges in both streaming and non-streaming XSLT applications. The debugging and testing techniques you might use for accumulators are a little different from techniques you use for templates and functions. This paper shows examples where accumulators are or aren't a good fit. The paper then describes and compares several debugging and testing techniques for accumulators, especially in non-streaming applications. Whether you require a near-term bug diagnosis or a long-term automated XSpec test of the accumulator's functionality, you will learn ways to access, understand, and maintain your accumulator's behavior.
Unveiling Linguistic Harmony: Asserting Interlingual Synchronicity in Documents (LB)
Geert Bormans, C-Moria BV and Srikanth Venkata Subramanian, Cognizone BV
Publishing texts in multiple languages always involves issues of coordination of both content and structure. The Swiss Chancellery must deal with documents in four official languages, and they face special problems in the production of compilations that merge amendments into existing legislation. Although the source documents are prepared in MS Word, they are issued in derivative products such as PDF, HTML, and Akoma Ntoso. To aid coordination between versions, the authors are developing a reporting system that does rule-driven sanity checks on parallel documents, starting with the XML of Akoma Ntoso. Each rule is represented by XSLT processes, and the selection of rules to be applied is controlled by XProc, with the results displayed as messages interpolated into an HTML rendition of the source document.
Typefi Automated Publishing for XML
Guy van der Kolk, Typefi
Typefi’s publishing automation software makes it possible to produce accurate and well-designed outputs from your XML content incredibly quickly. It’s built on Adobe InDesign Server, so you get all the power of the world's leading layout design software and full control over your designs. In this presentation, Typefi Product Manager, Guy van der Kolk, will show a few demos of Typefi in action—including automated publishing using Adobe InDesign, Typefi’s Standards publishing capabilities, integrations with Oxygen XML Web Author and DeltaXML Compare for redlining, multi-lingual publishing, and Typefi's proprietary InDesign plug-ins.
A Wonderful Historie of Intertextual Networks: Or, How Not to Index Your Data
Ash Clark, Northeastern University Women Writer’s Project
Being a learning experience: a tale told of TEI documents, EXPath, XQuery scripts, JSON serialized as XML, eXist-db, RelaxNG schemas for cached responses, maps, arrays, HTML, and more. Let me tell you about it! In 2022, the Women Writers Project (WWP) published ‘Women Writers Intertextual Network’ (WWIN) as an EXPath web app served out of eXist-db. The database contained Women Writers Online (WWO) documents encoded with ‘intertextual gestures’ (references within one work to a second work, for example, citations, quotes, parodies), separate bibliographic entries for each of the works referenced, and a taxonomy of topic and genre keywords. From the beginning, the WWIN site was plagued with connection issues and long load times. An important design goal was to give the website user a lot of power to control their experience of the WWIN content. Unfortunately, this level of control, combined with the application’s scale and choice of interface, meant that the site needed to aggregate a lot of data before being able to return results. The index-heavy application put a premium on processing as much data ahead of publication as possible, optimizing the indexing and retrieval of cached data, and making the indexes smaller and more precise.
Balisage Bard
Lynne Price, Gamemaster
Once again, Balisage Bard gives you the opportunity to exercise your literary creativity with original poems, short stories, jokes, songs, photos, recipes, trivia questions, and other masterpieces. Subject matter must be related to Balisage—possibilities include markup, papers presented this or previous years, virtual conferences, attendees’ interests (whether or not pertinent to markup), and so forth. Read your effort, play it on video, or show photos or text during the game session. Translations of works in languages other than English are not required but will be appreciated. There is a two-minute time limit per presentation. Sign up by sending email to bard@txstruct.com. One contribution per person/team unless there is time for more at the end.
Birds of a Feather Discussion(s)
Discussion Leader(s) to be Announced
Balisage participants will choose topics we want to discuss and discussion leaders to keep the conversation on topic. These topics may be inspired by conference presentations or may be other subjects of interest to the markup community. Specific topics and discussion leaders will be announced during Balisage.
Friday, August 4, 2023
Schema-Aware Conversion of XML to JSON (LB)
Michael Kay, Saxonica
There is — unsurprisingly — a high demand for good XML to JSON conversion tools. XML is widely used for intra-enterprise and inter-enterprise dataflows, and has many strengths that make it well suited to that role; but JSON is easier to consume in conventional programming languages, and most especially in Javascript.
But most existing libraries — also unsurprisingly — do the job badly. There isn't a good one-to-one fit between the data models and there are real tensions between the requirements for good conversions. Different libraries have made different design compromises, and the JSON they produce tends to please no-one.
Schema-aware conversion can do a better job. In the XSLT/XPath world we have a real opportunity to deliver that, because we already have all the infrastructure for processing schema-aware XML. A function that performs schema-aware conversion has recently been specified for inclusion in XPath 4.0, and we have written a prototype implementation. It may be informative to see how it works and hjow its results compare to those produced by other conversion tools.
The Dream of a CMS
Ari Nordström
I’ve always wanted an XML-first content management system, built around XML technologies: XProc, XSLT, XQuery, XForms, the whole package. Over the years, I have managed to build some of the CMS of my dreams, but never all of it. Not, that is, until recently. Strictly speaking, what I’ve built for the client is a portal for viewing automotive documentation at all stages of the product life cycle. The portal does not manage document creation, but once you have on-the-fly document assembly, conditional processing (aka profiling), browsing, filtering, and integration of 3D models, you have a lot of what is needed for a complete content management system. Add some markup for workflow management and versioning, and I think I am very close to getting my wish.
Retractions and Corrections at Scholars Portal Journals
Jessica Hymers & Qinqin Lin, OCUL Scholars Portal
In order to provide accurate and up-to-date access to scholarly research, Scholars Portal is improving our handling of article corrections and retractions. We are an XML based repository that hosts e-journal content for universities in Ontario as a service of the Ontario Council of University Libraries (OCUL). Our new process uses the JATS metadata element <related-article> to link between articles and their corrections and retractions. This allows us to notify users immediately when there have been changes to the article they are viewing. This is not without challenges: handling articles that have not been registered with a Digital Object Identifier (DOI), for example, and publishers’ inconsistent use of attribute values.
Knock Down This Wall
C. M. Sperberg-McQueen, Black Mesa Technologies
Life can be comfortable inside a walled garden. But if we wish to engage with the world, we need to knock down those walls.
Feedback
What did you like at Balisage 2023? What could have been better? What changes would you suggest for future Balisage conferences? Tell us what you think.