How to cite this paper

Prescod, Paul, Ben Feuer, Andrii Hladkyi, Sean Paulk and Arjun Prasad. “Auto-Markup BenchMark: Towards an Industry-standard Benchmark for Evaluating Automatic Document Markup.” Presented at Balisage: The Markup Conference 2023, Washington, DC, July 31 - August 4, 2023. In Proceedings of Balisage: The Markup Conference 2023. Balisage Series on Markup Technologies, vol. 28 (2023). https://doi.org/10.4242/BalisageVol28.Prescod01.

Balisage: The Markup Conference 2023
July 31 - August 4, 2023

Balisage Paper: Auto-Markup BenchMark: towards an industry-standard benchmark for evaluating automatic document markup

Paul Prescod

President and Founder

Document Minds

`<paul@documentminds.com>`

Paul Prescod is the founder of Document Minds, a consultancy specializing in the application of Large Language Models to Markup Technologies.

Ben Feuer

Ph.D. Candidate, Deep Learning

New York University

`<bf996@nyu.edu>`

Ben Feuer is a deep learning researcher in the lab of Prof. Chinmay Hegde. Current areas of interest include computer vision, NLP, and vision-language models like OpenAI’s CLIP.

In particular, Ben’s research focuses on real-world AI performance and consistency when data are scarce, inconsistently labeled or highly diverse.

Ben has a deep background in the arts and humanities, which can be helpful when attempting to communicate about highly complex and technical ideas.

Andrii Hladkyi

Software engineer

`<andrii.hladkyi.teacher@gmail.com>`

Andrii Hladkyi is a software engineer who develops NLP applications. His projects include applications in multiple popular languages as well as his own Ukrainian.

Sean Paulk

Data Analyst

Western Tidewater Community Services Board

`<spaulk@mail.umw.edu>`

Sean Paulk is a data analyst trying to learn about new technologies in novel ways.

Arjun Prasad

Student

BITS-Pilani, Hyderabad campus

`<apeinsteinz@gmail.com>`

Arjun Prasad is a final year CS undergraduate. He has worked on projects involving NLP applications in software testing and bug detection, as well as ML applications in IoT, and is currently interning at a startup, where he is working with LLMs to build a chatbot.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Abstract

Recent large language models (LLMs) such as GPT, Bard and Claude have achieved impressive progress in comprehending and generating human language. We explore the potential for LLMs to automate markup of semi-structured texts into HTML, DITA, and other related languages. Automated markup could make it possible to move documents into XML at much lower cost. Currently, there is no standard benchmark for evaluating automatic markup systems. Establishing a benchmark for the understanding and generation of structured documents and markup languages can drive innovation, standardize evaluations, identify algorithm strengths and weaknesses, clarify the State of the Art and foster interdisciplinary collaborations. This paper introduces an early version of a benchmark for this purpose.

Overview

Note on the use of AI in Writing

Prior Work

The Role of AI Benchmarks

The Project

What is Auto-Markup?

Use Cases

Jane’s Use Case - Meeting Notes to HTML for Confluence
Alex’s Use Case - Rough Notes to DITA
Paul’s Use Case: Google Drive to Balisage Paper

Conceptual Challenges

Ambiguity

Scale Challenges

Project Deliverables

Project Scope Restrictions

XML-centric AI Datasets

Metrics: Analogs from other fields

Trade-Offs to Consider

One Proposal: XATER = The XML Translation Edit Rate

XATER Examples

Companion Metric: Validation Error Metric

Experimental Trial Run

Future Project Governance

Other Applications of Markup AI

Conclusion

Overview

“In recent years, large language models (LLMs) and generative text technologies have made significant strides in natural language processing, yielding impressive results in understanding and generating human language.”

— ChatGPT’s humble opinion🤖

One of the barriers to entry for markup systems is the cost of getting documents from unstructured formats into structured formats, usually with a mix of programming and human authoring. AI in general, and Large Language Models in particular, may allow us to build systems that shift the balance further towards full automation.

Long term, the growth of large language models raises the question of whether marked up documentation will actually decline in relevance. The main argument for that position is that markup exists to help machines understand complex human languages. If the machines understand human documents as well as humans do, perhaps markup will no longer be necessary, just as it was not necessary when human scribes would transfer documents between contexts before markup was invented. The notations humans developed for their own use (italics, superscript, subscript) will presumably be sufficient for “truly intelligent” machines, as well.

In the medium term, however, machine inference is still fairly expensive, slow and not entirely reliable. Programmatic and deterministic-declarative systems will still need structured inputs for many years or decades. Automated markup might play the same role in making data available to structured transformation systems and stylesheets that OCR or TTS does in making text available to digital computers. Automated markup can also generate a “first draft” for human authors to perfect.

People have implemented various forms of automatic markup for several decades, but it seems that there is no standardization of how to evaluate them, and therefore no clear sense of what represents State Of The Art, and whether we are actually making progress. This paper describes an early version of such a benchmark created by a team of researchers collaborating as part of the RoundtableML community and releasing tools and datasets through Github.

Note on the use of AI in Writing

Consistent with the themes of this paper, certain (labeled) portions of the document were written by ChatGPT 4. For example, we used it for aspects that summarize the State of the Art, a task well-suited to a machine that has read almost all of the text on the Internet. All such text has been reviewed and edited by our own subject matter experts.

The 🤖 emoji is used to denote content produced by ChatGPT 4.

The XML version of this document was also created with a custom GPT 4-based Auto-Markup engine rather than with an XML editor or traditional transformation tool. Final markup tweaks were done in a text editor.

Prior Work

Researchers have applied machine learning techniques to the Automatic Markup problem since at least 2002 [Shazia et al., 2021]. More recently, some organizations, such as Docugami [Paoli, 2019] and Innodata have turned this into a commercial enterprise. So far, however, there has been no standardized way to compare these offerings to each other or demonstrate progress in the State Of The Art.

The Role of AI Benchmarks

“Are we there yet”? Are we even making progress? An AI sub-field lacking metrics is like a vehicle without a speedometer or odometer.

As described in Liao et al., 2021:

Benchmarking was popularized in machine learning in the 1980s through the UCI dataset repository and challenges sponsored by DARPA and NIST. Since then, benchmark evaluations have become the core of most empirical machine learning papers. The impact of benchmarking is illustrated by the ImageNet competition, which seeded much of the excitement in machine learning since 2010. Winning entries such as AlexNet and ResNets have become some of the most widely cited papers across all sciences.

These stories illustrate the influential role of benchmarks in the AI research community, pushing the boundaries of what is possible and fostering innovation across various domains.

Creating an AI benchmark for the understanding and generation of structured documents and markup languages, would have several benefits:

Encourage innovation: A benchmark would stimulate research and development in AI-driven solutions for processing, understanding, and generating structured documents and markup languages. The competitive nature of benchmarks often inspires researchers to develop novel techniques and algorithms to outperform existing solutions.
Standardize evaluation: A benchmark would provide a standardized framework for evaluating and comparing the performance of AI models and systems in handling structured documents and markup languages. This would enable the research community to measure progress and identify the most effective approaches.
Identify strengths and weaknesses: A benchmark would help researchers identify the strengths and weaknesses of various AI algorithms and techniques when applied to structured documents and markup languages. By understanding these strengths and weaknesses, researchers can focus on improving specific aspects of their models, leading to more robust and efficient solutions.
Foster collaboration: A well-designed benchmark would facilitate collaboration among researchers, developers, and practitioners in the markup language community. Sharing ideas, techniques, and best practices would lead to the development of more effective AI solutions for structured documents and markup languages.
Improve AI-driven applications: A benchmark would drive advancements in AI technology that can be applied to real-world use cases involving structured documents and markup languages. For example, AI models that excel in understanding and generating structured documents could enhance content management systems, data extraction tools, and document processing applications.
Expand the AI research scope: Establishing a benchmark focused on structured documents and markup languages would broaden the scope of AI research, connecting the field with the Balisage community and other experts in markup languages. This interdisciplinary collaboration could lead to new insights and innovations in both AI and markup language domains.

The ideal situation would be if OpenAI, Google, Anthropic and other industry leaders would adopt our benchmarks as goals in the training of their Foundation Models, which would mean that GPT-5, Claude 2, etc. might be trained with a deep understanding of markup and schemas.

The Project

An international group of industry engineers and academics from the RoundtableML community have come together and initiated this research direction. Our work is open source and we look forward to collaboration with others.

We have undertaken the following steps, and this document represents a checkpoint in our progress.

“Defining the tasks: We outlined the specific tasks related to structured document understanding and generation that the AI models should perform.
Established evaluation metrics: We determined suitable evaluation metrics for each task to capture the accuracy, efficiency, and quality of the AI models’ performance.
Created a standardized evaluation framework: This paper outlines a standardized evaluation framework that allows researchers to test their AI models against the tasks and datasets, and compare their performance using the established metrics. This framework should be easily accessible and widely adopted by the community to ensure meaningful comparisons and consistent progress tracking.”🤖
Collected and curated datasets: We gathered a diverse set of structured documents, including various markup languages such as Docbook, DITA, TEI, JATS, XHTML and many others. The examples span different domains like academic papers, technical documentation, web pages, and more. We ensured that the dataset covers a wide range of complexities and styles. We also annotated and labeled the data as needed for the defined tasks. This includes providing the ground truth for content extraction or the correct markup code for generation tasks.
Encourage participation and collaboration (ongoing): We are Promoting the benchmark among AI researchers, developers, and practitioners working on structured documents and markup languages. We could encourage participation in the benchmark by hosting competitions, workshops, and conferences, and provide incentives such as prizes, recognition, or publication opportunities.
Update and maintain the benchmark: We should periodically update the benchmark to reflect advancements in AI models, markup languages, and real-world use cases. This may include adding new tasks, expanding the dataset, or refining the evaluation metrics. Ensuring the benchmark remains relevant and challenging will drive continuous innovation and progress in the field.
Share results and findings: We will encourage participants to share their results, insights, and techniques, fostering collaboration and knowledge exchange within the community. This will help identify the most effective AI models and approaches for structured document understanding and generation, driving further advancements and practical applications.”🤖

This paper discusses our progress on Steps 1 through 5. As indicated by the emojis, much of this project plan was formulated in collaboration with ChatGPT!

What is Auto-Markup?

We define Auto-Markup as an automated process that can consume documents and produce structured XML, or equivalent, markup in an autonomous or mostly autonomous fashion.

Given:

A declarative schema
An (X)HTML, Word, RTF or Plain Text document which semantically “matches” the schema
Zero or more example transformations
Optional prose English “guidance”

Generate:

An XML or equivalent markup document that applies the schema to the document in a manner comparable to what a human would do.

We call these tools Auto-Markup engines and we anticipate a few categories of them:

Narrow versus General Auto-Markup Engines: A narrow Auto-Markup engine might have innate knowledge of a specific schema. Such a narrow system might use human knowledge to out-perform a generalized Auto-Markup system in the short term, although Rich Sutton’s “The Bitter Lesson” suggests that such performance gaps will eventually disappear.
Trained Auto-Markup Engines might learn their tasks from a large number of examples of input and output text. This is typically done by fine-tuning an existing base model.
Few-shot or zero-shot engines might learn from instructions rather than voluminous examples. In the not too distant future, such a system might simply read the same schema documentation that a human does.

We anticipate that just as with tools that translate between human languages, different Auto-Markup tools will vary in their performance based on the input/output pair. For example, a tool that is specialized in HTML to DITA might do a poor job converting Plain Text to Docbook.

Building specific benchmarks for every variant described above would be a monumental task, so we assume that the model has already been trained, prompted or otherwise taught how to deal with specific input and output formats and we do not include the efficiency of training or prompting as part of the benchmark. This is similar to how human language translation metrics are used.

Use Cases

We identified several user stories and use cases to help us ground our thinking. Here are a couple of examples.

Jane’s Use Case - Meeting Notes to HTML for Confluence

Jane, a diligent project manager at a software company, had a crucial meeting with her team to discuss upcoming product features. She jotted down the meeting notes, but they were plain text and looked quite disorganized. Knowing she had to share these notes with higher-ups and other team members, she decided to convert them into HTML format for a professional appearance, which could be easily inserted into emails, uploaded to Confluence, or integrated into Google Docs.

Here are Jane’s meeting notes in Plain Text:

    Meeting Notes - 9 June 2023
    New Feature Discussion
    People: 
    Jane - Project Manager
    Tom - Developer
    Sarah - Designer
    Agenda:
    - Talk about new features
    - Finalize the design
    Discussion:
    1. Jane started the meeting.
    2. Tom suggested a new search bar - link: www.example.com/searchbar
    3. Sarah showed new design mockup - image at: C:/images/design.png
    4. Team agreed to work on search bar and new design.
    Next steps:
    - Tom to create a prototype.
    - Sarah to finalize designs.
    Meeting concluded at 3:00 PM.

Jane can use Auto-Markup to convert these notes into HTML to include headings, bullet points, hyperlinks, and embedded images, ensuring that the information was well-structured and aesthetically appealing for the recipients.

<!DOCTYPE html>
<html>
  <head>
    <title>Meeting Notes - 9 June 2023</title>
  </head>
  <body>
    <h1>Meeting Notes - 9 June 2023</h1>
    <h2>New Feature Discussion</h2>
    <h3>People:</h3>
    <ul>
      <li>Jane - Project Manager</li>
      <li>Tom - Developer</li>
      <li>Sarah - Designer</li>
    </ul>
    <h3>Agenda:</h3>
    <ul>
      <li>Talk about new features</li>
      <li>Finalize the design</li>
    </ul>
    <h3>Discussion:</h3>
    <ol>
      <li>Jane started the meeting.</li>
      <li>Tom suggested a new search bar - <a href="http://www.example.com/searchbar">link</a></li>
      <li>Sarah showed new design mockup - image at: C:/images/design.png</li>
      <li>Team agreed to work on search bar and new design.</li>
    </ol>
    <h3>Next steps:</h3>
    <ul>
      <li>Tom to create a prototype.</li>
      <li>Sarah to finalize designs.</li>
    </ul>
    <p>Meeting concluded at 3:00 PM.</p>
  </body>
</html>

Alex’s Use Case - Rough Notes to DITA

Alex is a technical writer for a company that develops software for medical devices. Alex is working on a new user guide for a groundbreaking medical device that the company is about to launch. The company recently adopted the DITA (Darwin Information Typing Architecture) standard for their documentation to facilitate content reuse, improve consistency, and streamline the localization process.

He starts by writing the content in plain text, as it allows him to focus on the information without getting bogged down by the structure and formatting.

Here’s how a section of Alex’s user guide looks in plain text before converting it to DITA:

Title: Using the Heart Rate Monitor
Introduction:
The heart rate monitor helps doctors to accurately measure patients’ heart rates in real-time. 
Prerequisite: Ensure the device is sanitized before use.
Steps:
1. Turn on the device by pressing the power button.
2. Place the monitor on the patient’s finger.
3. Wait for the device to calibrate.
4. Read the heart rate on the device screen.

After writing the content, Alex uses a specialized tool to convert his plain text into DITA format.

Here’s how the content might look in DITA:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE task PUBLIC "-//OASIS//DTD DITA Task//EN" "task.dtd">
<task id="heart_rate_monitor">
  <title>Using the Heart Rate Monitor</title>
  <shortdesc>The heart rate monitor helps doctors to accurately measure patients’ heart rates in real-time.</shortdesc> 
  <taskbody>
    <prereq>Ensure the device is sanitized before use.</prereq>
    <steps>
      <step><cmd>Turn on the device by pressing the power button.</cmd></step>
      <step><cmd>Place the monitor on the patient’s finger.</cmd></step>
      <step><cmd>Wait for the device to calibrate.</cmd></step>
      <step><cmd>Read the heart rate on the device screen.</cmd></step>
    </steps>
  </taskbody>
</task>

By converting his content into DITA, Alex is able to structure the information in a way that’s consistent across all product documentation. Moreover, it facilitates content reuse, as parts of the documentation can be easily repurposed for other products or output formats. This streamlined approach is highly valued by his team and the company, as it greatly enhances the efficiency and quality of their technical documentation.

Paul’s Use Case: Google Drive to Balisage Paper

Paul is a research analyst specializing in document processing and structured content. He is passionate about automating the conversion of plain text documents into structured formats. Recently, he worked with a team of open source developers to create a benchmark tool named Auto-Markup BenchMark for evaluating different auto-markup software. His tool aims to standardize the evaluation process and help in making informed decisions regarding the selection of auto-markup tools. Paul decides to present his findings at the Balisage conference.

Paul initially drafts his paper in Google Drive because it allows him to easily structure the content, collaborate in real-time with co-authors, and make use of Google Drive’s version history to track changes. And it’s free!

Here’s how a section of Paul’s paper looks after exporting as plain text, but before converting it to the Balisage conference DTD:

    Auto-Markup BenchMark: towards an industry-standard benchmark for Evaluating Automatic Document Markup

    Abstract:

    Large language models (LLMs) have greatly advanced in comprehending and generating human language. Markup remains crucial for structured inputs in publishing systems due to the high costs and unreliability of machine inference. Automated markup could bridge AI systems and unstructured text systems with structured publishing software by making it possible to move documents into XML at much lower cost. Currently, there is no standard way to evaluate these automatic markup systems. Benchmarks are always vital in AI development as they provide standardized datasets and evaluation metrics. Establishing a benchmark for the understanding and generation of structured documents and markup languages can drive innovation, standardize evaluations, identify algorithm strengths and weaknesses, clarify the State of the Art and foster interdisciplinary collaborations. This paper introduces an early version of a benchmark for this purpose.

    Overview

    In recent years, large language models (LLMs) and generative text technologies have made significant strides in natural language processing, yielding impressive results in understanding and generating human language.

    In the very long run, …

To submit his paper to the Balisage conference, Paul converts his Google Drive document into the required Balisage conference tagset format.^[1]

Conceptual Challenges

There are many challenges that arise in defining a metric for such a fuzzy task. If “correctness” were well-defined and clear, we probably would not need AI to help with it: we could simply use normal imperative code.

Here are some of the subjective issues we must address:

What constitutes reasonable input? A string of random characters cannot be turned into any useful form of HTML or DITA. Neither a human nor a tool would know where to put the tags. So there must be some boundary around the input.
What constitutes correct output? Putting an entire document in a paragraph would pass a validator, but it would not be useful output.
Given that few answers will be exactly perfect, how do we represent imperfection numerically? If we use a 1-to-10 scale, what does “3” signify mathematically and in terms of real-world implications?

Ambiguity

A key challenge is the question of ambiguous labeling. What if two different tags are equally valid for a text run, or if one tool decides to insert an optional attribute that another might ignore? Options we explored are:

Simple grading such that a tool might be inappropriately penalized but hopefully not systematically enough to change its mark significantly
Fuzzy grading that tries to identify what constitutes “good enough”, e.g. with regular expressions, wildcards, optional alternatives

Human evaluation (rating) is another option, although it is too expensive to scale to the largest systems.

To a certain extent, this challenge can be addressed through normalization. For example, whitespace normalization is an obvious way to reduce the space of potential answers. One could also decide that certain tags are similar enough that they can be treated as equivalent in this context, although this requires schema-specific tweaking of the metrics, which introduces another point of subjectivity and perhaps disagreement.

Supplying multiple “correct” reference documents is another standard way of dealing with this issue and is implemented in our tooling.

Scale Challenges

Running metrics on large documents presents two main challenges:

The simpler one is that many metrics depend on quadratic algorithms which scale poorly with document size.

The more subtle one is another kind of exponential explosion. If there are M legitimate different ways to mark up the first section of a document and N ways for the second, as well as O, P and Q ways for the third, fourth and fifth, then the number of possible representations for the document is M*N*O*P*Q. Shorter documents are less prone to this explosion of alternatives.

For now, we have focused our attention on shorter documents.

Project Deliverables

Our project consisted of building a suite of open-source tools that can be used in the evaluation and training of markup engines:

markup-metrics is a test harness for applying auto-markup engines and metric engines to test suites. It includes a small, prototype, open source Auto-Markup Engine as well as some prospective Metric engines.
the-xml-document-stack is a virtual collection of many gigabytes of XML documents and tools for filtering them to those that are useful for a given project and transforming them as well.
auto-markup-training-toolkit is a set of tools for generating sample document sets, including the Messy Markdown Generator which is described below

In order to make the project tractable, we have scoped it to a few kinds of inputs and outputs.

For our input we use plain text, because it is the most universal format that can be extracted from almost any other format. By “plain text” we mean the kinds of texts that humans might write in a text editor for the consumption of other humans rather than tools. We call this variant of Plain Text “Messy Markdown” because it is similar to Markdown (in fact it was the inspiration for Markdown). Messy Markdown is necessarily much more fuzzy, broad and mutable than “real” Markdown. After all, the process of converting real, strict, Markdown into HTML is well-understood. It’s the messy output of copy and paste to Notepad, or direct typing into notepad, which requires AI.

For our output we use HTML and DITA to take advantage of the varying knowledge sets of project participants.

Adding HTML as an input format is a logical next step in our project:

Table I

	HTML Output	DITA Output
Plain Text Input	In Scope	In Scope
HTML Input	No-op	Future Work

Project Scope Restrictions

In order to keep the project tractable, and in recognition that Auto-Markup Engines need to walk before they run, there are certain aspects of the problem which we do not intend to address in the current project:

Componentization: this implies generating reusable content objects, or splitting topics into parts to be included in a DITA map or equivalent. This kind of work is certainly within the scope of Markup AI, but not part of our 2023 definition of “Auto-Markup”. As the field evolves, we may move this from being an optional add-on, or post-processing feature into the heart of “Auto-Markup”.
Internal linking: it remains to be seen whether it is practical for current Auto-Markup tools to infer relationships and generate link elements. Obviously if one is converting from HTML to DITA then this would be expected, and perhaps even from Markdown. But if up-converting from raw text, this would require a form of whole-document semantic evaluation which may be beyond the capabilities of any 2023-era Auto-Markup tools except as an optional add-on, or post-processing feature.
Metadata: another whole-document analysis question would be document summarization into metadata or indexing elements.
Re-Ordering: re-ordering elements to match a schema is likely beyond what current or near-future Auto-Markup engines will be capable of.
As the field evolves, we will re-evaluate support for these features.

XML-centric AI Datasets

Our open source team has started the process of constructing a massive dataset of open source XML documents suitable for both training and evaluating XML AIs. It is frequently easier to generate an input/output pair for evaluation by starting with the output (e.g. DITA) because it is simple to generate plain text from DITA whereas the opposite is hard (until Auto-Markup is mainstream and reliable).

Our dataset consists of a mix of newly-written documents and documents repurposed from other projects.

The repurposed portion consists of open source documents from Github via The Stack:

Table II

Directory	Number of Files	Data Size (MB)
`xml/dita`	67,303	237.03
`xml/html`	138	4.33
`xml/jats`	10,691	834.20
`xml/tei`	1,934	323.15
`xml/docbook`	16,969	301.15

We have filtered out non-document-oriented XML such as configuration files and Ant build files.

The code for assembling this data is at https://github.com/prescod/the-xml-document-stack

Some of the properties we will use to manage the dataset as it grows are:

Relevance: The dataset should contain diverse examples of text that are representative of real-world scenarios and target the specific markup task, such as DITA, DocBook, business documents and transaction documents.
Size and diversity: The dataset should be large enough to cover a wide range of markup scenarios, styles, and complexities. It should include a variety of document types, content domains, and structures, ensuring that the LLM can generalize across different markup tasks.
Quality and correctness: The dataset must be accurate and free of errors in both the raw text and the corresponding markup. High-quality ground truth annotations are essential for training and evaluating the model.
Balanced distribution: The dataset should provide a balanced distribution of examples, ensuring that it covers various markup patterns and does not over-represent certain features. This can help avoid biases in the evaluation process.
Variability in difficulty: The dataset should include examples with different levels of difficulty, ranging from simple to complex markup scenarios. This helps in assessing the LLM’s capability to handle a wide range of markup challenges.
Up-to-date and dynamic: The dataset should be regularly updated and expanded to reflect the evolving nature of markup tasks, as well as to maintain its relevance and usefulness in evaluating LLMs.
Accessibility and openness: Ideally, the dataset should be publicly available and licensed for research purposes, facilitating comparisons across different LLMs and promoting collaboration among researchers.
Comprehensive documentation: The dataset should be accompanied by comprehensive documentation that provides detailed information on its collection process, annotation guidelines, and any preprocessing steps. This ensures transparency and reproducibility in training and evaluation processes.

The project has not yet organized the dataset meaningfully. When it does, the dataset should be segmented into three parts:

Training data: the biggest portion, available for training models
Eval: a portion which is held back using social conventions, to evaluate models in a public and transparent way. This eval portion should be refreshed frequently to guard against memorization, whether deliberate or accidental. This will require us to implement a versioning strategy which allows access to old and new evaluations.
Hidden: a portion which is impossible for models to accidentally crawl, and perhaps literally secret so that even most humans are restricted from seeing it, to minimize cheating. This hidden portion should also be refreshed frequently to guard against memorization

Metrics: Analogs from other fields

There are two fields of Artificial Intelligence that seem similar to the Auto-Markup problem. Studying these fields allows us to transfer decades of learning to our own problem.

The first is Named Entity Recognition, the field of Natural Language Processing that categorizes phrases in text. In our case, the entities would be identified so that they can be marked up.

The other is Machine Translation, wherein text is translated from one human language to another. In our case, the languages are “Unstructured Text” (or some more specific variant) and “Structured XML” (in some specific vocabulary).

“Named Entity Recognition (NER) is a subtask of information extraction in natural language processing (NLP) that seeks to locate and classify named entities mentioned in unstructured text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages.

Evaluating the performance of NER systems is crucial for understanding how well they’re functioning and where they might need improvement. Here are several common methods used for evaluating NER systems:

Precision is the ratio of correctly predicted positive observations to the total predicted positives. High precision indicates a low false-positive rate.

Recall (Sensitivity) is the ratio of correctly predicted positive observations to all observations in the actual class. High recall indicates a low false-negative rate.

The F1 Score is the harmonic mean of Precision and Recall. It tries to find the balance between precision and recall.

For each of these metrics, you count the number of correct positive predictions (true positives), incorrect positive predictions (false positives), and incorrect negative predictions (false negatives), and then use these counts to calculate precision, recall, and F1 score.”🤖

Rather than think of XML markup as an entity recognition task, we could instead think of it as a generative text task, akin to translating between languages or summarizing text: generating a new document from the old one.

Viewed this way, as a content generation or transformation problem, reference metrics could be repurposed from quantitative scores used to assess machine translations and summarization:

BLEU (Bilingual Evaluation Understudy):

“BLEU is an evaluation metric primarily used for assessing the quality of machine-generated translations. It compares the generated translation to one or more human reference translations, quantifying their similarity based on the presence of shared n-grams (sequences of n words). The BLEU score ranges from 0 to 1, with 1 representing a perfect match with the reference translation.”🤖

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)):

“ROUGE is a recall-oriented metric that is often used in tasks like text summarization. It measures how many n-grams in the reference translations match with the n-grams in the machine-generated translations. There are several types of ROUGE scores, including ROUGE-N (which is similar to BLEU but focuses on recall rather than precision), ROUGE-L (which considers sentence level structure similarity based on longest common subsequence), and ROUGE-S(which considers skip-bigram statistics). Like BLEU, ROUGE scores also range from 0 to 1.”🤖

Translation Edit Rate:

“Translation Edit Rate (TER) is a metric for machine translation that measures the amount of editing that a human would have to perform to change a system output into a correct translation.

In other words, TER is the minimum number of edits required to change a hypothesis so that it exactly matches a reference. The edits can be insertion, deletion, substitution of words, or shifting of word sequences. The rate is then calculated as the number of edits divided by the total number of words in the reference. The lower the TER, the closer the machine-generated translation is to the reference translation, and hence, the better the translation quality is considered to be.”🤖

In code:

    ter = number_of_edits / length_of_reference_text_in_tokens

Levenshtein distance:

“The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other. It is a very commonly used tool in computer science, beyond NLP.”🤖

Trade-Offs to Consider

None of these metrics is completely perfect even within their own domains but especially when imported into the domain of Auto-Markup.

When modeled as a classification problem, Auto-Markup for most commonly used languages is heavily class imbalanced. Certain tags happen much more often than others (those in the “long tail”).

Therefore, metrics such as accuracy and Macro-F1 can lead to unrealistically optimistic views of model performance. Weighted F1 scores adapt to this limitation by biasing by class frequency, but this approach can be unrealistically pessimistic if long-tail tags are less important when evaluating the overall performance of the system.

Another challenge we encounter when modeling auto-markup as a classification problem is the existence of open-set attributes such as the href attribute of the a tag and the src attribute of the img tag in HTML. These attributes cannot be modeled using closed-set classification in a straightforward way (although classification error could be augmented with use of some secondary metric, such as character-wise Levenshtein distance).

The limitations of TER, BLEU and ROUGE depend on what we intend them to measure.

TER, BLEU and ROUGE are designed to work at the level of n-grams or complete entities; since most tags are singular entities, if you sum up the discrete measurements for each tag, these metrics will behave much the same as accuracy when applied to single words.

For example: If you are evaluating the BLEU score of a one-word sentence against a reference one-word sentence using unigrams (single words), then the BLEU score would be either one if the words match (i.e., the generated word is the same as the reference word) or zero if they do not match.

If the system output and the reference are the same word, no edits are required, so the TER would be zero (indicating a perfect match). If the words are different, one substitution is required, so the TER would be one (one edit / one word = 1).

If we use these metrics naively at the level of the document, then the metric’s mass will concentrate in the text (which is supposed to be copied from one document to another) rather than the markup (which is what we aim to measure). And yet, we cannot simply ignore the text, because there is no guarantee that the engine (especially an LLM) will not modify it in the course of generating the markup.

A prosaic concern is implementation efficiency. Many of these algorithms are designed for comparing sentences to other sentences and their implementations are quadratic or even cubic in time complexity. When applied to big documents, they can grind to a halt. This is unfortunate, because models themselves may have poor scaling behaviors which we would like to probe with our tests.

Finally, the efficacy of our auto-markup metric will depend upon the tokenization strategy we use. Because markup tags are not formatted like words, typical tokenization strategies like byte-pair encoding will break them up into small blocks of characters. But if we use TER on partial-tag tokens, we will wind up more heavily penalizing some tags than others, depending on how they are tokenized. For instance, if the tag <img> is tokenized as (<) (im) (g) (>), and GPT produces instead the tag <a>, which is then tokenized (<) (a) (>), this mistake will be penalized more heavily than if GPT produces a non-existent tag such as <ims>, tokenized (<) (im) (s) (>).

One Proposal: XATER = The XML Translation Edit Rate

At the time of publishing, the Metric that we have used most often is a custom variant of Translation Edit Rate. This is a form of Translation Edit Rate that processes the output of a custom tokenizer. The XATER tokenizer normalizes unambiguously meaningless or likely-meaningless changes in markup.

In particular, our tokenizer recognizes the following types of tokens:

Start-tag
Attributes, as separate tokens, sorted alphabetically
The value of ID attributes is ignored so that all values are equally valid
End-tag
Whitespace-separated word

The handling of words is configurable: if you split text nodes on words, then markup engines that leave words alone get high marks and ones that mangle the words get low marks. But given that the default situation is that most auto-markup systems do NOT mangle words, this causes a kind of “grade inflation”.

XATER uses a different scale than TER. TER scores are normally in the range of 0 to 1 with 0 being better (no edits) and 1 being worse (virtually everything had to change). If the reference and hypothesis documents are extremely different in length then TER can even be greater than 1.

XATER flips the scale and shifts it to a percentage so that the best score is 100 and in extreme cases of mismatched length it can be a negative number.

The XATER calculation is merely:

    return 100 - pyter.ter(input.hypothesis_tokens, input.reference_tokens) * 100

The Tokenizer is a SAX ContentHandler containing about 30 lines of code.

XATER Examples

Consider the following example (based on a sample created by Jordan Stanchev).

Input:

    ** How to Start the Calculator App **
    Here you will find information on how to start the Calculator app on your mobile phone.
    Prerequisite: To use the calculator app, you need to unlock your phone first.
    Steps:
    1.  Navigate to the Calculator application icon.
    2.  Tap the icon.
    The application starts. You are now ready to calculate with it.

Human-Authored Reference Output:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE task PUBLIC "-//OASIS//DTD DITA Task//EN" "task.dtd">
<task id="start-the-calculator">
    <title>How to Start the Calculator App</title>
    <shortdesc>Here you will find information on how to start the Calculator app on your mobile phone.</shortdesc>
    <taskbody>
        <prereq>
            <p>To use the calculator app, you need to unlock your phone first.</p>
        </prereq>
        <steps>
            <step>
                <cmd>Navigate to the Calculator application icon.</cmd>
            </step>
            <step>
                <cmd>Tap the icon.</cmd>
            </step>
        </steps>
        <result>
            <p>The application starts. You are now ready to calculate with it.</p>
        </result>
    </taskbody>
</task>

Using a prototype, open-source auto-markup engine based on GPT 4, we can get the following, very similar result:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE task PUBLIC "-//OASIS//DTD DITA Task//EN" "task.dtd">
<task id="start-calculator-app">
    <title>How to Start the Calculator App</title>
    <shortdesc>Here you will find information on how to start the Calculator app on your mobile phone.</shortdesc>
    <taskbody>
        <prereq>To use the calculator app, you need to unlock your phone first.</prereq>
        <steps>
            <step><cmd>Navigate to the Calculator application icon.</cmd></step>
            <step><cmd>Tap the icon.</cmd></step>
        </steps>
        <result>The application starts. You are now ready to calculate with it.</result>
    </taskbody>
</task>

The XATER metric gives this a score of 86.96%.

Here are some of the differences:

The ID is different, but that’s not counted against the engine.
Whitespace also varies, but the engine is not penalized for that.
The original uses paragraphs inside of the prereq. This is (arguably unfairly) counted against the model and is a good example where we might need more normalization or alternate outputs for comparison.

Still, the score is pretty good. Compare this to a comparable GPT 3.5-based engine.

It generates this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
<topic id="how-to-start-calculator-app">
    <title>How to Start the Calculator App</title>
    <shortdesc>Here you will find information on how to start the Calculator app on your mobile phone.</shortdesc>
    <body>
        <section>
            <title>Prerequisite</title>
            <p>To use the calculator app, you need to unlock your phone first.</p>
        </section>
        <section>
            <title>Steps</title>
            <ol>
                <li>
                    <p>Navigate to the Calculator application icon.</p>
                </li>
                <li>
                    <p>Tap the icon.</p>
                </li>
            </ol>
        </section>
        <section>
            <title>Result</title>
            <p>The application starts. You are now ready to calculate with it.</p>
        </section>
    </body>
</topic>

The metric scores this at 28.26% because it is dramatically different than the Reference text and not as reasonable of an encoding.

Companion Metric: Validation Error Metric

The Validation Error Metric is based on a normalized count of well-formedness and validation errors. It uses the popular open source LXML engine to do the counting in a royalty-free and portable manner.

The scoring is as such:

    # Calculate the score based on the percent of correct tags
    total_errors = num_wf_errors + num_dtd_errors
    good_elements = total_elements - total_errors
    # in case any elements generated multiple errors, 
    # we might have a negative number
    good_elements = max(0, good_elements)
    good_tags_ratio = good_elements / total_elements
    # flip it back to a zero to 100 score with better scores being better
    score = good_tags_ratio * 100

Roughly speaking it is the ratio of “correct” elements to “total” elements, but if each element generates more than one error then the algorithm could theoretically generate a score of less than zero. The code above limits the minimum score to zero, however.

Experimental Trial Run

We have implemented three Prototype Engines so far.

The dummy_automarkup engine serves to test the lower bound of auto-markup engines that properly preserve text but get all markup wrong.

The gpt3.5_am1_automarkup uses the gpt-3.5-turbo model through the OpenAI API and simplistic prompting to auto-markup simple DITA tasks. The gpt4_am1_automarkup uses the gpt-4 model.

Additionally, we assembled a test suite using sample Tasks from the DITA Toolkit. Each task was transformed into a variety of plain text formats, such as Markdown, Pandoc Plain Text, Emacs OrgMode and several variants of Messy Markdown, as described in section “Project Scope Restrictions”. A total of 24 DITA tasks were represented in 288 plain-textish input files. The goal of each Auto-Markup engine was to re-create the DITA task given only a single plain text file.

Our three simplistic prototypes are all less than 60 lines of code each and do not represent the State of the Art at all. Nevertheless, we can hypothesize that if they behave as we expect them to, and the metrics do as well, then gpt4_am1_automarkup should get a better score than gpt3.5_am1_automarkup which should in turn beat dummy_automarkup.

This is in fact what we see in the scores:

Table III

Markup Engine	Metric Engine	xater_metric	validation_error_metric
dummy_automarkup	xater_metric	14.94%	0.00%
gpt3.5_am1_automarkup	xater_metric	38.80%	98.98%
gpt4_am1_automarkup	xater_metric	70.56%	99.61%

As a reminder, xater_metric tests whether the engine matches a reference text, and validation_error_metric tests whether it generates well-formed and valid XML.

The GPT 4-based engine did a reasonable job, with a reference-matching score of 70.56% and 99.61% for DTD-validity. As an example of its flaws, it did not know how to mark images up, because its extremely limited training data did not have images.

Another failure mode is unique to Generative Large Language Models. When presented with the following ambiguous Task, the LLMs may sometimes 'hallucinate' or generate the 'missing' content (the steps)!

Workbench.txt:

    # Organizing the workbench and tools
    Keep your garage neat by organizing the things you keep in the garage.

In one case, the engine garnered a negative score, because it generated this text for the above:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE task PUBLIC "-//OASIS//DTD DITA Task//EN" "task.dtd">
<task id="workbench-organization">
    <title>Organizing the workbench and tools</title>
    <shortdesc>Keep your garage neat by organizing the things you keep in the garage.</shortdesc>
    <taskbody>
        <context>Keeping your garage organized not only makes it easier to find what you need, but it also creates a safer work environment.</context>
        <steps>
            <step><cmd>Sort all items in your garage into categories.</cmd></step>
            <step><cmd>Decide on a place for each category on your workbench or in your tool cabinet.</cmd></step>
            <step><cmd>Place each item in its designated spot.</cmd></step>
        </steps>
        <result>Your workbench and tools are now organized!</result>
        <postreq>Make sure to return each item to its designated spot after use to maintain organization.</postreq>
    </taskbody>
</task>

This demonstrates that a mere 60 lines of code are far from sufficient to fully solve the auto-markup problem. With investment, we can overcome these problems by training (“fine-tuning”) specialized language models. This training was outside of the scope of this project, but we laid the groundwork for it by curating large, complex datasets. A future paper will show that we can improve these scores using those techniques.

Future Project Governance

Although the project initially began as a volunteer-run open-source initiative, its significance might warrant institutional support from organizations like OASIS, TEI, or a similar new establishment.

Funding may be needed over the long term. In addition to future research, the benchmarks should evolve over time:

Test cases should be refreshed to avoid accidental or deliberate “hard coding” or “learning from the tests”
The tests should be made more and more ambitious over time.

Other Applications of Markup AI

We consider Auto-Markup as only a single application of a broader field of Markup AI. Other examples include:

Automating content generation: Generating human-like text using LLMs for various applications, such as documentation, tutorials, or marketing materials.
Metadata generation: Employing LLMs to generate rich and accurate metadata for XML-based documents, improving searchability, discoverability, and contextual understanding.
Assisting in schema design: Utilizing LLMs to facilitate the creation of XML schemas, by analyzing existing data sets and offering suggestions based on patterns and best practices.
Easier conversions and transformations: AI could convert prose descriptions of conversions, transformations, or style applications into code and stylesheets for traditional software applications.
Componentizing content: recognizing component boundaries, or extracting reusable components.
Document Correction: Interpret the output of Validation tools and make automated changes to documents to fit the Schema.
Markup-Enhanced Q&A: Question and Answer bots which use the structure of the XML as part of their determination of what content is relevant.

We imagine new benchmarks or extensions to this benchmark to cover all of these use-cases in the future.

Conclusion

Auto-Markup is an exciting new direction for structured publishing workflows. We believe that empirical study of such Engines can accelerate progress in the development of them. Standardized metrics will allow us to compare tools and analyze their progress.

So far, our team has implemented some simple AutoMarkup engine prototypes to test our Metric and come to the conclusion that it does produce intuitive and useful evaluations.

We invite anyone interested in contributing to either join us at http://bit.ly/automarkup-metrics or to contact the lead researcher.

References

[Liao et al., 2021] Liao, Thomas, Rohan Taori, Inioluwa Deborah Raji, & Ludwig Schmidt (2021). Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning. Presented at Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). In Proceedings of the Neural Information Processing Systems Track on Datasets and and Benchmarks 1 (NeurIPS Datasets and Benchmarks 2021). https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/757b505cfd34c64c85ca5b5690ee5293-Paper-round2.pdf

[Shazia et al., 2021] Akhtar, Shazia, Ronan Reilly, & John Dunnion (2002). AutoMarkup: A Tool for Automatically Marking up Text Documents. In Computational Linguistics and Intelligent Text Processing. CICLing 2002. Lecture Notes in Computer Science, vol. 2276. pp.433-435. doi:https://doi.org/10.1007/3-540-45715-1_46

[Paoli, 2019] Paoli, Jean (2019). We Created Document Dysfunction: It Is Time to Fix It. Presented at Balisage: The Markup Conference 2019, Washington, DC, July 30 - August 2, 2019. In Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23 (2019). doi:https://doi.org/10.4242/BalisageVol23.Paoli01

^[1] Yes, you are reading the output of an Auto-Markup pipeline.

Liao, Thomas, Rohan Taori, Inioluwa Deborah Raji, & Ludwig Schmidt (2021). Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning. Presented at Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). In Proceedings of the Neural Information Processing Systems Track on Datasets and and Benchmarks 1 (NeurIPS Datasets and Benchmarks 2021). https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/757b505cfd34c64c85ca5b5690ee5293-Paper-round2.pdf

Akhtar, Shazia, Ronan Reilly, & John Dunnion (2002). AutoMarkup: A Tool for Automatically Marking up Text Documents. In Computational Linguistics and Intelligent Text Processing. CICLing 2002. Lecture Notes in Computer Science, vol. 2276. pp.433-435. doi:https://doi.org/10.1007/3-540-45715-1_46

Paoli, Jean (2019). We Created Document Dysfunction: It Is Time to Fix It. Presented at Balisage: The Markup Conference 2019, Washington, DC, July 30 - August 2, 2019. In Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23 (2019). doi:https://doi.org/10.4242/BalisageVol23.Paoli01

BalisageThe Markup Conference2023

Balisage Paper: Auto-Markup BenchMark: towards an industry-standard benchmark for evaluating automatic document markup

`<paul@documentminds.com>`

`<bf996@nyu.edu>`

`<andrii.hladkyi.teacher@gmail.com>`

`<spaulk@mail.umw.edu>`

`<apeinsteinz@gmail.com>`

Abstract

Table of Contents

Overview

Note on the use of AI in Writing

Prior Work

The Role of AI Benchmarks

The Project

What is Auto-Markup?

Use Cases

Jane’s Use Case - Meeting Notes to HTML for Confluence

Alex’s Use Case - Rough Notes to DITA

Paul’s Use Case: Google Drive to Balisage Paper

Conceptual Challenges

Ambiguity

Scale Challenges

Project Deliverables

Project Scope Restrictions

XML-centric AI Datasets

Metrics: Analogs from other fields

Trade-Offs to Consider

One Proposal: XATER = The XML Translation Edit Rate

XATER Examples

Companion Metric: Validation Error Metric

Experimental Trial Run

Future Project Governance

Other Applications of Markup AI

Conclusion

References

Balisage Series on Markup Technologies