How to cite this paper
Prescod, Paul, Ben Feuer, Andrii Hladkyi, Sean Paulk and Arjun Prasad. “Auto-Markup BenchMark: Towards an Industry-standard Benchmark for Evaluating Automatic
Document Markup.” Presented at Balisage: The Markup Conference 2023, Washington, DC, July 31 - August 4, 2023. In Proceedings of Balisage: The Markup Conference 2023. Balisage Series on Markup Technologies, vol. 28 (2023). https://doi.org/10.4242/BalisageVol28.Prescod01.
Balisage: The Markup Conference 2023
July 31 - August 4, 2023
Balisage Paper: Auto-Markup BenchMark: towards an industry-standard benchmark for evaluating automatic
document markup
Paul Prescod
President and Founder
Document Minds
Paul Prescod is the founder of Document Minds, a consultancy specializing in the application
of Large Language Models to Markup Technologies.
Ben Feuer
Ph.D. Candidate, Deep Learning
New York University
Ben Feuer is a deep learning researcher in the lab of Prof. Chinmay Hegde. Current
areas of interest include computer vision, NLP, and vision-language models like OpenAI’s
CLIP.
In particular, Ben’s research focuses on real-world AI performance and consistency
when data are scarce, inconsistently labeled or highly diverse.
Ben has a deep background in the arts and humanities, which can be helpful when attempting
to communicate about highly complex and technical ideas.
Andrii Hladkyi
Andrii Hladkyi is a software engineer who develops NLP applications. His projects
include applications in multiple popular languages as well as his own Ukrainian.
Sean Paulk
Data Analyst
Western Tidewater Community Services Board
Sean Paulk is a data analyst trying to learn about new technologies in novel ways.
Arjun Prasad
Student
BITS-Pilani, Hyderabad campus
Arjun Prasad is a final year CS undergraduate. He has worked on projects involving
NLP applications in software testing and bug detection, as well as ML applications
in IoT, and is currently interning at a startup, where he is working with LLMs to
build a chatbot.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International
License.
Abstract
Recent large language models (LLMs) such as GPT, Bard and Claude have achieved impressive
progress in comprehending and generating human language. We explore the potential
for LLMs to automate markup of semi-structured texts into HTML, DITA, and other related
languages. Automated markup could make it possible to move documents into XML at much
lower cost. Currently, there is no standard benchmark for evaluating automatic markup
systems. Establishing a benchmark for the understanding and generation of structured
documents and markup languages can drive innovation, standardize evaluations, identify
algorithm strengths and weaknesses, clarify the State of the Art and foster interdisciplinary
collaborations. This paper introduces an
early version
of a benchmark for this purpose.
Table of Contents
- Overview
-
- Note on the use of AI in Writing
- Prior Work
- The Role of AI Benchmarks
- The Project
- What is Auto-Markup?
- Use Cases
-
- Jane’s Use Case - Meeting Notes to HTML for Confluence
- Alex’s Use Case - Rough Notes to DITA
- Paul’s Use Case: Google Drive to Balisage Paper
- Conceptual Challenges
- Ambiguity
- Scale Challenges
- Project Deliverables
- Project Scope Restrictions
- XML-centric AI Datasets
- Metrics: Analogs from other fields
- Trade-Offs to Consider
- One Proposal: XATER = The XML Translation Edit Rate
-
- XATER Examples
- Companion Metric: Validation Error Metric
- Experimental Trial Run
- Future Project Governance
- Other Applications of Markup AI
- Conclusion
Overview
“In recent years, large language models (LLMs) and generative text technologies have
made significant strides in natural language processing, yielding impressive results
in understanding and generating human language.”
— ChatGPT’s humble opinion🤖
One of the barriers to entry for markup systems is the cost of getting documents
from unstructured formats into structured formats, usually with a mix of programming
and
human authoring. AI in general, and Large Language Models in particular, may allow
us to build systems that shift the balance further towards full automation.
Long term, the growth of large language models raises the question of whether marked
up documentation will actually decline in relevance. The main argument for that position
is that markup exists to help machines understand complex human languages. If the
machines understand human documents as well as humans do, perhaps markup will no longer
be necessary, just as it was not necessary when human scribes would transfer documents
between contexts before markup was invented. The notations humans developed for their
own use (italics, superscript, subscript) will presumably be sufficient for “truly
intelligent” machines, as well.
In the medium term, however, machine inference is still fairly expensive, slow and
not entirely reliable. Programmatic and deterministic-declarative systems will still
need structured inputs for many years or decades.
Automated markup might play the same role in making data available to structured transformation
systems and stylesheets that OCR or TTS does in making text available to digital computers.
Automated markup can also generate a “first draft” for human authors to perfect.
People have implemented various forms of automatic markup for several decades, but
it seems that there is no standardization of how to evaluate them, and therefore no
clear sense of what represents State Of The Art, and whether we are actually making
progress. This paper describes an early version of such a benchmark created by a team
of researchers collaborating as part of the
RoundtableML
community and releasing tools and datasets through
Github.
Note on the use of AI in Writing
Consistent with the themes of this paper, certain (labeled) portions of the document
were written by ChatGPT 4. For example, we used it for aspects that summarize the
State of the Art, a task well-suited to a machine that has read almost all of the
text on the Internet. All such text has been reviewed and edited by our own subject
matter experts.
The 🤖 emoji is used to denote content produced by ChatGPT 4.
The XML version of this document was also created with a custom GPT 4
-based Auto-Markup engine rather than with an XML editor or traditional transformation
tool. Final markup tweaks were done in a text editor.
Prior Work
Researchers have applied machine learning techniques to the Automatic Markup problem
since at least 2002 [Shazia et al., 2021].
More recently, some organizations, such as Docugami [Paoli, 2019]
and
Innodata
have
turned this into a commercial enterprise. So far, however, there has been no standardized
way to compare these offerings to each other or demonstrate progress in the State
Of The
Art.
The Role of AI Benchmarks
“Are we there yet”? Are we even making progress? An AI sub-field lacking metrics is
like a vehicle
without a speedometer or odometer.
As described in Liao et al., 2021:
Benchmarking was popularized in machine learning in the 1980s through the UCI dataset
repository and challenges sponsored by DARPA and NIST. Since then, benchmark evaluations
have become the core of most empirical machine learning papers. The impact of benchmarking
is illustrated by the ImageNet competition, which seeded much of the excitement in
machine learning since 2010. Winning entries such as AlexNet and ResNets have become
some of the most widely cited papers across all sciences.
These stories illustrate the influential role of benchmarks in the AI research community,
pushing the boundaries of what is possible and fostering innovation across various
domains.
Creating an AI benchmark for the understanding and generation of structured documents
and markup languages, would have several benefits:
-
Encourage innovation: A benchmark would stimulate research and development in AI-driven
solutions for processing, understanding, and generating structured documents and markup
languages. The competitive nature of benchmarks often inspires researchers to develop
novel techniques and algorithms to outperform existing solutions.
-
Standardize evaluation: A benchmark would provide a standardized framework for evaluating
and comparing the performance of AI models and systems in handling structured documents
and markup languages. This would enable the research community to measure progress
and identify the most effective approaches.
-
Identify strengths and weaknesses: A benchmark would help researchers identify the
strengths and weaknesses of various AI algorithms and techniques when applied to structured
documents and markup languages. By understanding these strengths and weaknesses, researchers
can focus on improving specific aspects of their models, leading to more robust and
efficient solutions.
-
Foster collaboration: A well-designed benchmark would facilitate collaboration among
researchers, developers, and practitioners in the markup language community. Sharing
ideas, techniques, and best practices would lead to the development of more effective
AI solutions for structured documents and markup languages.
-
Improve AI-driven applications: A benchmark would drive advancements in AI technology
that can be applied to real-world use cases involving structured documents and markup
languages. For example, AI models that excel in understanding and generating structured
documents could enhance content management systems, data extraction tools, and document
processing applications.
-
Expand the AI research scope: Establishing a benchmark focused on structured documents
and markup languages would broaden the scope of AI research, connecting the field
with the Balisage community and other experts in markup languages. This interdisciplinary
collaboration could lead to new insights and innovations in both AI and markup language
domains.
The ideal situation would be if OpenAI, Google, Anthropic and other industry leaders
would adopt our benchmarks as goals in the training of their Foundation Models, which
would mean that GPT-5, Claude 2, etc. might be trained with a deep understanding of
markup and schemas.
The Project
An international group of industry engineers and academics from the
RoundtableML
community have come together and initiated this research direction. Our work is open
source and we look forward to collaboration with others.
We have undertaken the following steps, and this document represents a checkpoint
in our progress.
-
“Defining the tasks: We outlined the specific tasks related to structured document
understanding and generation that the AI models should perform.
-
Established evaluation metrics: We determined suitable evaluation metrics for each
task to capture the accuracy, efficiency, and quality of the AI models’ performance.
-
Created a standardized evaluation framework: This paper outlines a standardized evaluation
framework that allows researchers to test their AI models against the tasks and datasets,
and compare their performance using the established metrics. This framework should
be easily accessible and widely adopted by the community to ensure meaningful comparisons
and consistent progress tracking.”🤖
-
Collected and curated datasets: We gathered a diverse set of structured documents,
including various markup languages such as Docbook, DITA, TEI, JATS, XHTML and many
others. The examples span different domains like academic papers, technical documentation,
web pages, and more. We ensured that the dataset covers a wide range of complexities
and styles. We also annotated and labeled the data as needed for the defined tasks.
This includes providing the ground truth for content extraction or the correct markup
code for generation tasks.
-
Encourage participation and collaboration (ongoing): We are Promoting the benchmark
among AI researchers, developers, and practitioners working on structured documents
and markup languages. We could encourage participation in the benchmark by hosting
competitions, workshops, and conferences, and provide incentives such as prizes, recognition,
or publication opportunities.
-
Update and maintain the benchmark: We should periodically update the benchmark to
reflect advancements in AI models, markup languages, and real-world use cases. This
may include adding new tasks, expanding the dataset, or refining the evaluation metrics.
Ensuring the benchmark remains relevant and challenging will drive continuous innovation
and progress in the field.
-
Share results and findings: We will encourage participants to share their results,
insights, and techniques, fostering collaboration and knowledge exchange within the
community. This will help identify the most effective AI models and approaches for
structured document understanding and generation, driving further advancements and
practical applications.”🤖
This paper discusses our progress on Steps 1 through 5. As indicated by the emojis,
much of this project plan was formulated in collaboration with ChatGPT!
What is Auto-Markup?
We define Auto-Markup as an automated process that can consume documents and produce
structured XML, or equivalent, markup in an autonomous or mostly autonomous fashion.
Given:
-
A declarative schema
-
An (X)HTML, Word, RTF or Plain Text document which semantically “matches” the schema
-
Zero or more example transformations
-
Optional prose English “guidance”
Generate:
We call these tools Auto-Markup engines and we anticipate a few categories of them:
-
Narrow versus General Auto-Markup Engines: A narrow Auto-Markup engine might have
innate knowledge of a specific schema. Such a narrow system might use human knowledge
to out-perform a generalized Auto-Markup system in the short term, although
Rich Sutton’s “The Bitter Lesson”
suggests that such performance gaps will eventually disappear.
-
Trained Auto-Markup Engines might learn their tasks from a large number of examples
of input and output text. This is typically done by fine-tuning an existing base model.
-
Few-shot or zero-shot engines might learn from instructions rather than voluminous
examples. In the not too distant future, such a system might simply read the same
schema documentation that a human does.
We anticipate that just as with tools that translate between human languages, different
Auto-Markup tools will vary in their performance based on the input/output pair. For
example, a tool that is specialized in HTML to DITA might do a poor job converting
Plain Text to Docbook.
Building specific benchmarks for every variant described above would be a monumental
task, so we assume that the model has already been trained, prompted or otherwise
taught how to deal with specific input and output formats and we do not include the
efficiency of training or prompting as part of the benchmark. This is similar to how
human language translation metrics are used.
Use Cases
We identified several
user stories and use cases
to help us ground our thinking. Here are a couple of examples.
Jane’s Use Case - Meeting Notes to HTML for Confluence
Jane, a diligent project manager at a software company, had a crucial meeting with
her team to discuss upcoming product features. She jotted down the meeting notes,
but they were plain text and looked quite disorganized. Knowing she had to share these
notes with higher-ups and other team members, she decided to convert them into HTML
format for a professional appearance, which could be easily inserted into emails,
uploaded to Confluence, or integrated into Google Docs.
Here are Jane’s meeting notes in Plain Text:
Meeting Notes - 9 June 2023
New Feature Discussion
People:
Jane - Project Manager
Tom - Developer
Sarah - Designer
Agenda:
- Talk about new features
- Finalize the design
Discussion:
1. Jane started the meeting.
2. Tom suggested a new search bar - link: www.example.com/searchbar
3. Sarah showed new design mockup - image at: C:/images/design.png
4. Team agreed to work on search bar and new design.
Next steps:
- Tom to create a prototype.
- Sarah to finalize designs.
Meeting concluded at 3:00 PM.
Jane can use Auto-Markup to convert these notes into HTML to include headings, bullet
points, hyperlinks, and embedded images, ensuring that the information was well-structured
and aesthetically appealing for the recipients.
<!DOCTYPE html>
<html>
<head>
<title>Meeting Notes - 9 June 2023</title>
</head>
<body>
<h1>Meeting Notes - 9 June 2023</h1>
<h2>New Feature Discussion</h2>
<h3>People:</h3>
<ul>
<li>Jane - Project Manager</li>
<li>Tom - Developer</li>
<li>Sarah - Designer</li>
</ul>
<h3>Agenda:</h3>
<ul>
<li>Talk about new features</li>
<li>Finalize the design</li>
</ul>
<h3>Discussion:</h3>
<ol>
<li>Jane started the meeting.</li>
<li>Tom suggested a new search bar - <a href="http://www.example.com/searchbar">link</a></li>
<li>Sarah showed new design mockup - image at: C:/images/design.png</li>
<li>Team agreed to work on search bar and new design.</li>
</ol>
<h3>Next steps:</h3>
<ul>
<li>Tom to create a prototype.</li>
<li>Sarah to finalize designs.</li>
</ul>
<p>Meeting concluded at 3:00 PM.</p>
</body>
</html>
Alex’s Use Case - Rough Notes to DITA
Alex is a technical writer for a company that develops software for medical devices.
Alex is working on a new user guide for a groundbreaking medical device that the company
is about to launch. The company recently adopted the DITA (Darwin Information Typing
Architecture) standard for their documentation to facilitate content reuse, improve
consistency, and streamline the localization process.
He starts by writing the content in plain text, as it allows him to focus on the information
without getting bogged down by the structure and formatting.
Here’s how a section of Alex’s user guide looks in plain text before converting it
to DITA:
Title: Using the Heart Rate Monitor
Introduction:
The heart rate monitor helps doctors to accurately measure patients’ heart rates in real-time.
Prerequisite: Ensure the device is sanitized before use.
Steps:
1. Turn on the device by pressing the power button.
2. Place the monitor on the patient’s finger.
3. Wait for the device to calibrate.
4. Read the heart rate on the device screen.
After writing the content, Alex uses a specialized tool to convert his plain text
into DITA format.
Here’s how the content might look in DITA:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE task PUBLIC "-//OASIS//DTD DITA Task//EN" "task.dtd">
<task id="heart_rate_monitor">
<title>Using the Heart Rate Monitor</title>
<shortdesc>The heart rate monitor helps doctors to accurately measure patients’ heart rates in real-time.</shortdesc>
<taskbody>
<prereq>Ensure the device is sanitized before use.</prereq>
<steps>
<step><cmd>Turn on the device by pressing the power button.</cmd></step>
<step><cmd>Place the monitor on the patient’s finger.</cmd></step>
<step><cmd>Wait for the device to calibrate.</cmd></step>
<step><cmd>Read the heart rate on the device screen.</cmd></step>
</steps>
</taskbody>
</task>
By converting his content into DITA, Alex is able to structure the information in
a way that’s consistent across all product documentation. Moreover, it facilitates
content reuse, as parts of the documentation can be easily repurposed for other products
or output formats. This streamlined approach is highly valued by his team and the
company, as it greatly enhances the efficiency and quality of their technical documentation.
Paul’s Use Case: Google Drive to Balisage Paper
Paul is a research analyst specializing in document processing and structured content.
He is passionate about automating the conversion of plain text documents into structured
formats. Recently, he worked with a team of open source developers to create a benchmark
tool named Auto-Markup BenchMark
for evaluating different auto-markup software. His tool aims to standardize the evaluation
process and help in making informed decisions regarding the selection of auto-markup
tools. Paul decides to present his findings at the Balisage conference.
Paul initially drafts his paper in Google Drive because it allows him to easily structure
the content, collaborate in real-time with co-authors, and make use of Google Drive’s
version history to track changes. And it’s free!
Here’s how a section of Paul’s paper looks after exporting as plain text, but before
converting it to the Balisage conference DTD:
Auto-Markup BenchMark: towards an industry-standard benchmark for Evaluating Automatic Document Markup
Abstract:
Large language models (LLMs) have greatly advanced in comprehending and generating human language. Markup remains crucial for structured inputs in publishing systems due to the high costs and unreliability of machine inference. Automated markup could bridge AI systems and unstructured text systems with structured publishing software by making it possible to move documents into XML at much lower cost. Currently, there is no standard way to evaluate these automatic markup systems. Benchmarks are always vital in AI development as they provide standardized datasets and evaluation metrics. Establishing a benchmark for the understanding and generation of structured documents and markup languages can drive innovation, standardize evaluations, identify algorithm strengths and weaknesses, clarify the State of the Art and foster interdisciplinary collaborations. This paper introduces an early version of a benchmark for this purpose.
Overview
In recent years, large language models (LLMs) and generative text technologies have made significant strides in natural language processing, yielding impressive results in understanding and generating human language.
In the very long run, …
To submit his paper to the Balisage conference, Paul converts his Google Drive document
into the required Balisage conference tagset format.
Conceptual Challenges
There are many challenges that arise in defining a metric for such a fuzzy task. If
“correctness” were well-defined and clear, we probably would not need AI to help with
it: we could simply use normal imperative code
.
Here are some of the subjective issues we must address:
-
What constitutes reasonable input? A string of random characters cannot be turned
into any useful form of HTML or DITA. Neither a human nor a tool would know where
to put the tags. So there must be some boundary around the input.
-
What constitutes correct output? Putting an entire document in a paragraph would pass
a validator, but it would not be useful output.
-
Given that few answers will be exactly perfect, how do we represent imperfection numerically?
If we use a 1-to-10 scale, what does “3” signify mathematically and in terms of real-world
implications?
Ambiguity
A key challenge is the question of ambiguous labeling. What if two different tags
are equally valid for a text run, or if one tool decides to insert an optional attribute
that another might ignore? Options we explored are:
-
Simple grading such that a tool might be inappropriately penalized but hopefully not
systematically enough to change its mark significantly
-
Fuzzy grading that tries to identify what constitutes “good enough”, e.g. with regular
expressions, wildcards, optional alternatives
Human evaluation (rating) is another option, although it is too expensive to scale
to the largest systems.
To a certain extent, this challenge can be addressed through normalization. For example,
whitespace normalization is an obvious way to reduce the space of potential answers.
One could also decide that certain tags are similar enough that they can be treated
as equivalent in this context, although this requires schema-specific tweaking of
the metrics, which introduces another point of subjectivity and perhaps disagreement.
Supplying multiple “correct” reference documents is another standard way of dealing
with this issue and is implemented in our tooling.
Scale Challenges
Running metrics on large documents presents two main challenges:
The simpler one is that many metrics depend on quadratic algorithms which scale poorly
with document size.
The more subtle one is another kind of exponential explosion. If there are M legitimate
different ways to mark up the first section of a document and N ways for the second,
as well as O, P and Q ways for the third, fourth and fifth, then the number of possible
representations for the document is M*N*O*P*Q. Shorter documents are less prone to
this explosion of alternatives.
For now, we have focused our attention on shorter documents.
Project Deliverables
Our project consisted of building a suite of open-source tools that can be used in
the evaluation and training of markup engines:
-
markup-metrics
is a test harness for applying auto-markup engines and metric engines to test suites.
It includes a small, prototype, open source Auto-Markup Engine as well as some prospective
Metric engines.
-
the-xml-document-stack
is a virtual collection of many gigabytes of XML documents and tools for filtering
them to those that are useful for a given project and transforming them as well.
-
auto-markup-training-toolkit
is a set of tools for generating sample document sets, including the Messy Markdown
Generator which is described below
In order to make the project tractable, we have scoped it to a few kinds of inputs
and outputs.
For our input we use plain text, because it is the most universal format that can
be extracted from almost any other format. By “plain text” we mean the kinds of texts
that humans might write in a text editor for the consumption of other humans rather
than tools. We call this variant of Plain Text “Messy Markdown” because it is similar
to Markdown (in fact it was the inspiration for Markdown). Messy Markdown is necessarily
much more fuzzy, broad and mutable than “real” Markdown. After all, the process of
converting real, strict, Markdown into HTML is well-understood. It’s the messy output
of copy and paste to Notepad, or direct typing into notepad, which requires AI.
For our output we use HTML and DITA to take advantage of the varying knowledge sets
of project participants.
Adding HTML as an input format is a logical next step in our project:
Table I
|
HTML Output |
DITA Output |
Plain Text Input |
In Scope |
In Scope |
HTML Input |
No-op |
Future Work |
Project Scope Restrictions
In order to keep the project tractable, and in recognition that Auto-Markup Engines
need to walk before they run, there are certain aspects of the problem which we do
not intend to address in the current project:
-
Componentization: this implies generating reusable content objects, or splitting topics
into parts to be included in a DITA map or equivalent. This kind of work is certainly
within the scope of Markup AI
, but not part of our 2023 definition of “Auto-Markup”. As the field evolves, we may
move this from being an optional add-on, or post-processing feature into the heart
of “Auto-Markup”.
-
Internal linking: it remains to be seen whether it is practical for current Auto-Markup
tools to infer relationships and generate link elements. Obviously if one is converting
from HTML to DITA then this would be expected, and perhaps even from Markdown. But
if up-converting from raw text, this would require a form of whole-document semantic
evaluation which may be beyond the capabilities of any 2023-era Auto-Markup
tools except as an optional add-on, or post-processing feature.
-
Metadata: another whole-document analysis question would be document summarization
into metadata or indexing elements.
-
Re-Ordering: re-ordering elements to match a schema is likely beyond what current
or near-future Auto-Markup
engines will be capable of.
-
As the field evolves, we will re-evaluate support for these features.
XML-centric AI Datasets
Our open source team has started the process of constructing a massive dataset of
open source XML documents suitable for both training and evaluating XML AIs. It is
frequently easier to generate an input/output pair for evaluation by starting with
the output (e.g. DITA) because it is simple to generate plain text from DITA whereas
the opposite is hard (until Auto-Markup is mainstream and reliable).
Our dataset consists of a mix of newly-written documents and documents repurposed
from other projects.
The repurposed portion consists of open source documents from Github via The Stack
:
Table II
Directory |
Number of Files |
Data Size (MB) |
xml/dita |
67,303 |
237.03 |
xml/html |
138 |
4.33 |
xml/jats |
10,691 |
834.20 |
xml/tei |
1,934 |
323.15 |
xml/docbook |
16,969 |
301.15 |
We have filtered out non-document-oriented XML such as configuration files and Ant
build files.
The code for assembling this data is at https://github.com/prescod/the-xml-document-stack
Some of the properties we will use to manage the dataset as it grows are:
-
Relevance: The dataset should contain diverse examples of text that are representative
of real-world scenarios and target the specific markup task, such as DITA, DocBook,
business documents and transaction documents.
-
Size and diversity: The dataset should be large enough to cover a wide range of markup
scenarios, styles, and complexities. It should include a variety of document types,
content domains, and structures, ensuring that the LLM can generalize across different
markup tasks.
-
Quality and correctness: The dataset must be accurate and free of errors in both the
raw text and the corresponding markup. High-quality ground truth annotations are essential
for training and evaluating the model.
-
Balanced distribution: The dataset should provide a balanced distribution of examples,
ensuring that it covers various markup patterns and does not over-represent certain
features. This can help avoid biases in the evaluation process.
-
Variability in difficulty: The dataset should include examples with different levels
of difficulty, ranging from simple to complex markup scenarios. This helps in assessing
the LLM’s capability to handle a wide range of markup challenges.
-
Up-to-date and dynamic: The dataset should be regularly updated and expanded to reflect
the evolving nature of markup tasks, as well as to maintain its relevance and usefulness
in evaluating LLMs.
-
Accessibility and openness: Ideally, the dataset should be publicly available and
licensed for research purposes, facilitating comparisons across different LLMs and
promoting collaboration among researchers.
-
Comprehensive documentation: The dataset should be accompanied by comprehensive documentation
that provides detailed information on its collection process, annotation guidelines,
and any preprocessing steps. This ensures transparency and reproducibility in training
and evaluation processes.
The project has not yet organized the dataset meaningfully. When it does, the dataset
should be segmented into three parts:
-
Training data: the biggest portion, available for training models
-
Eval: a portion which is held back using social conventions, to evaluate models in
a public and transparent way. This eval portion should be refreshed frequently to
guard against memorization, whether deliberate or accidental. This will require us
to implement a versioning strategy which allows access to old and new evaluations.
-
Hidden: a portion which is impossible for models to accidentally crawl, and perhaps
literally secret so that even most humans are restricted from seeing it, to minimize
cheating. This hidden portion should also be refreshed frequently to guard against
memorization
Metrics: Analogs from other fields
There are two fields of Artificial Intelligence that seem similar to the Auto-Markup
problem. Studying these fields allows us to transfer decades of learning to our own
problem.
The first is Named Entity Recognition, the field of Natural Language Processing that
categorizes phrases in text. In our case, the entities would be identified so that
they can be marked up.
The other is Machine Translation, wherein text is translated from one human language
to another. In our case, the languages are “Unstructured Text” (or some more specific
variant) and “Structured XML” (in some specific vocabulary).
“Named Entity Recognition (NER) is a subtask of information extraction in natural
language processing (NLP) that seeks to locate and classify named entities mentioned
in unstructured text into predefined categories such as person names, organizations,
locations, medical codes, time expressions, quantities, monetary values, percentages.
Evaluating the performance of NER systems is crucial for understanding how well they’re
functioning and where they might need improvement. Here are several common methods
used for evaluating NER systems:
-
Precision is the ratio of correctly predicted positive observations to the total predicted
positives. High precision indicates a low false-positive rate.
-
Recall (Sensitivity) is the ratio of correctly predicted positive observations to
all observations in the actual class. High recall indicates a low false-negative rate.
-
The F1 Score is the harmonic mean of Precision and Recall. It tries to find the balance
between precision and recall.
For each of these metrics, you count the number of correct positive predictions (true
positives), incorrect positive predictions (false positives), and incorrect negative
predictions (false negatives), and then use these counts to calculate precision, recall,
and F1 score.”🤖
Rather than think of XML markup as an entity recognition task, we could instead think
of it as a generative text task, akin to translating between languages or summarizing
text: generating a new document from the old one.
Viewed this way, as a content generation or transformation problem, reference metrics
could be repurposed from quantitative scores used to assess machine translations and
summarization:
BLEU (Bilingual Evaluation Understudy):
“BLEU is an evaluation metric primarily used for assessing the quality of machine-generated
translations. It compares the generated translation to one or more human reference
translations, quantifying their similarity based on the presence of shared n-grams
(sequences of n words). The BLEU score ranges from 0 to 1, with 1 representing a perfect
match with the reference translation.”🤖
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)):
“ROUGE is a recall-oriented metric that is often used in tasks like text summarization.
It measures how many n-grams in the reference translations match with the n-grams
in the machine-generated translations. There are several types of ROUGE scores, including
ROUGE-N (which is similar to BLEU but focuses on recall rather than precision), ROUGE-L
(which considers sentence level structure similarity based on longest common subsequence),
and ROUGE-S(which considers skip-bigram statistics). Like BLEU, ROUGE scores also
range from 0 to 1.”🤖
Translation Edit Rate:
“Translation Edit Rate (TER) is a metric for machine translation that measures the
amount of editing that a human would have to perform to change a system output into
a correct translation.
In other words, TER is the minimum number of edits required to change a hypothesis
so that it exactly matches a reference. The edits can be insertion, deletion, substitution
of words, or shifting of word sequences. The rate is then calculated as the number
of edits divided by the total number of words in the reference. The lower the TER,
the closer the machine-generated translation is to the reference translation, and
hence, the better the translation quality is considered to be.”🤖
In code:
ter = number_of_edits / length_of_reference_text_in_tokens
Levenshtein distance:
“The Levenshtein distance is a string metric for measuring the difference between
two sequences. Informally, the Levenshtein distance between two words is the minimum
number of single-character edits (insertions, deletions, or substitutions) required
to change one word into the other. It is a very commonly used tool in computer science,
beyond NLP.”🤖
Trade-Offs to Consider
None of these metrics is completely perfect even within their own domains but especially
when imported into the domain of Auto-Markup.
When modeled as a classification problem, Auto-Markup for most commonly used languages
is heavily class imbalanced. Certain tags happen much more often than others (those
in the “long tail”).
Therefore, metrics such as accuracy and Macro-F1 can lead to
unrealistically optimistic views
of model performance. Weighted F1 scores adapt to this limitation by biasing by class
frequency, but this approach can be unrealistically pessimistic if long-tail tags
are less important when evaluating the overall performance of the system.
Another challenge we encounter when modeling auto-markup as a classification problem
is the existence of open-set attributes such as the href
attribute of the a
tag and the src
attribute of the img
tag in HTML. These attributes cannot be modeled using closed-set classification in
a straightforward way (although classification error could be augmented with use of
some secondary metric, such as character-wise Levenshtein distance).
The limitations of TER, BLEU and ROUGE depend on what we intend them to measure.
TER, BLEU and ROUGE are designed to work at the level of n-grams or complete entities;
since most tags are singular entities, if you sum up the discrete measurements for
each tag, these metrics will behave much the same as accuracy when applied to single
words.
For example: If you are evaluating the BLEU score of a one-word sentence against a
reference one-word sentence using unigrams (single words), then the BLEU score would
be either one if the words match (i.e., the generated word is the same as the reference
word) or zero if they do not match.
If the system output and the reference are the same word, no edits are required, so
the TER would be zero (indicating a perfect match). If the words are different, one
substitution is required, so the TER would be one (one edit / one word = 1).
If we use these metrics naively at the level of the document, then the metric’s mass
will concentrate in the text (which is supposed to be copied from one document to
another) rather than the markup (which is what we aim to measure). And yet, we cannot
simply ignore the text, because there is no guarantee that the engine (especially
an LLM) will not modify it in the course of generating the markup.
A prosaic concern is implementation efficiency. Many of these algorithms are designed
for comparing sentences to other sentences and their implementations are quadratic
or even cubic in time complexity. When applied to big documents, they can grind to
a halt. This is unfortunate, because models themselves may have poor scaling behaviors
which we would like to probe with our tests.
Finally, the efficacy of our auto-markup metric will depend upon the tokenization
strategy we use. Because markup tags are not formatted like words, typical tokenization
strategies like byte-pair encoding will break them up into small blocks of characters.
But if we use TER on partial-tag tokens, we will wind up more heavily penalizing some
tags than others, depending on how they are tokenized. For instance, if the tag <img>
is tokenized as (<) (im) (g) (>)
, and GPT produces instead the tag <a>
, which is then tokenized (<) (a) (>)
, this mistake will be penalized more heavily than if GPT produces a non-existent
tag such as <ims>
, tokenized (<) (im) (s) (>)
.
One Proposal: XATER = The XML Translation Edit Rate
At the time of publishing, the Metric that we have used most often is a custom variant
of Translation Edit Rate. This is a form of Translation Edit Rate that processes the
output of a custom tokenizer. The XATER tokenizer normalizes unambiguously meaningless
or likely-meaningless changes in markup.
In particular, our tokenizer recognizes the following types of tokens:
-
Start-tag
-
Attributes, as separate tokens, sorted alphabetically
-
The value of ID attributes is ignored so that all values are equally valid
-
End-tag
-
Whitespace-separated word
The handling of words is configurable: if you split text nodes on words, then markup
engines that leave words alone get high marks and ones that mangle the words get low
marks. But given that the default situation is that most auto-markup systems do NOT
mangle words, this causes a kind of “grade inflation”.
XATER uses a different scale than TER. TER scores are normally in the range of 0 to
1 with 0 being better (no edits) and 1 being worse (virtually everything had to change).
If the reference and hypothesis documents are extremely different in length then TER
can even be greater than 1.
XATER flips the scale and shifts it to a percentage so that the best score is 100
and in extreme cases of mismatched length it can be a negative number.
The XATER calculation is merely:
return 100 - pyter.ter(input.hypothesis_tokens, input.reference_tokens) * 100
The Tokenizer is a SAX ContentHandler containing about 30 lines of code.
XATER Examples
Consider the following example (based on a sample created by
Jordan Stanchev).
Input:
** How to Start the Calculator App **
Here you will find information on how to start the Calculator app on your mobile phone.
Prerequisite: To use the calculator app, you need to unlock your phone first.
Steps:
1. Navigate to the Calculator application icon.
2. Tap the icon.
The application starts. You are now ready to calculate with it.
Human-Authored Reference Output:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE task PUBLIC "-//OASIS//DTD DITA Task//EN" "task.dtd">
<task id="start-the-calculator">
<title>How to Start the Calculator App</title>
<shortdesc>Here you will find information on how to start the Calculator app on your mobile phone.</shortdesc>
<taskbody>
<prereq>
<p>To use the calculator app, you need to unlock your phone first.</p>
</prereq>
<steps>
<step>
<cmd>Navigate to the Calculator application icon.</cmd>
</step>
<step>
<cmd>Tap the icon.</cmd>
</step>
</steps>
<result>
<p>The application starts. You are now ready to calculate with it.</p>
</result>
</taskbody>
</task>
Using a prototype, open-source auto-markup engine based on GPT 4
, we can get the following, very similar result:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE task PUBLIC "-//OASIS//DTD DITA Task//EN" "task.dtd">
<task id="start-calculator-app">
<title>How to Start the Calculator App</title>
<shortdesc>Here you will find information on how to start the Calculator app on your mobile phone.</shortdesc>
<taskbody>
<prereq>To use the calculator app, you need to unlock your phone first.</prereq>
<steps>
<step><cmd>Navigate to the Calculator application icon.</cmd></step>
<step><cmd>Tap the icon.</cmd></step>
</steps>
<result>The application starts. You are now ready to calculate with it.</result>
</taskbody>
</task>
The XATER metric gives this a score of 86.96%.
Here are some of the differences:
-
The ID is different, but that’s not counted against the engine.
-
Whitespace also varies, but the engine is not penalized for that.
-
The original uses paragraphs inside of the prereq. This is (arguably unfairly) counted
against the model and is a good example where we might need more normalization or
alternate outputs for comparison.
Still, the score is pretty good. Compare this to a comparable GPT 3.5-based engine.
It generates this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
<topic id="how-to-start-calculator-app">
<title>How to Start the Calculator App</title>
<shortdesc>Here you will find information on how to start the Calculator app on your mobile phone.</shortdesc>
<body>
<section>
<title>Prerequisite</title>
<p>To use the calculator app, you need to unlock your phone first.</p>
</section>
<section>
<title>Steps</title>
<ol>
<li>
<p>Navigate to the Calculator application icon.</p>
</li>
<li>
<p>Tap the icon.</p>
</li>
</ol>
</section>
<section>
<title>Result</title>
<p>The application starts. You are now ready to calculate with it.</p>
</section>
</body>
</topic>
The metric scores this at 28.26% because it is dramatically different than the Reference
text and not as reasonable of an encoding.
Companion Metric: Validation Error Metric
The Validation Error Metric is based on a normalized count of well-formedness and
validation errors. It uses the popular open source LXML
engine to do the counting in a royalty-free and portable manner.
The scoring is as such:
# Calculate the score based on the percent of correct tags
total_errors = num_wf_errors + num_dtd_errors
good_elements = total_elements - total_errors
# in case any elements generated multiple errors,
# we might have a negative number
good_elements = max(0, good_elements)
good_tags_ratio = good_elements / total_elements
# flip it back to a zero to 100 score with better scores being better
score = good_tags_ratio * 100
Roughly speaking it is the ratio of “correct” elements to “total” elements, but if
each element generates more than one error then the algorithm could theoretically
generate a score of less than zero. The code above limits the minimum score to zero,
however.
Experimental Trial Run
We have implemented three Prototype Engines so far.
The dummy_automarkup
engine serves to test the lower bound of auto-markup engines that properly preserve
text but get all markup wrong.
The gpt3.5_am1_automarkup
uses the gpt-3.5-turbo
model through the OpenAI API and simplistic prompting to auto-markup simple DITA
tasks. The gpt4_am1_automarkup
uses the gpt-4
model.
Additionally, we assembled a test suite using sample Tasks from the DITA Toolkit.
Each task was transformed into a variety of plain text formats, such as Markdown,
Pandoc Plain Text, Emacs OrgMode and several variants of Messy Markdown, as described
in
section “Project Scope Restrictions”. A total of 24 DITA tasks were represented in 288 plain-textish input files. The
goal of each Auto-Markup engine was to re-create the DITA task given only a single
plain text file.
Our three simplistic prototypes are all less than 60 lines of code each and do not
represent the State of the Art at all. Nevertheless, we can hypothesize that if they
behave as we expect them to, and the metrics do as well, then gpt4_am1_automarkup
should get a better score than gpt3.5_am1_automarkup
which should in turn beat dummy_automarkup
.
This is in fact what we see in the scores:
Table III
Markup Engine |
Metric Engine |
xater_metric |
validation_error_metric |
dummy_automarkup |
xater_metric |
14.94% |
0.00% |
gpt3.5_am1_automarkup |
xater_metric |
38.80% |
98.98% |
gpt4_am1_automarkup |
xater_metric |
70.56% |
99.61% |
As a reminder, xater_metric
tests whether the engine matches a reference text, and validation_error_metric
tests whether it generates well-formed and valid XML.
The GPT 4
-based engine did a reasonable job, with a reference-matching score of 70.56% and
99.61% for DTD-validity. As an example of its flaws, it did not know how to mark images
up, because its extremely limited training data did not have images.
Another failure mode is unique to Generative Large Language Models. When presented
with the following ambiguous Task, the LLMs may sometimes 'hallucinate' or generate
the 'missing' content (the steps)!
Workbench.txt:
# Organizing the workbench and tools
Keep your garage neat by organizing the things you keep in the garage.
In one case, the engine garnered a negative score, because it generated this text
for the above:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE task PUBLIC "-//OASIS//DTD DITA Task//EN" "task.dtd">
<task id="workbench-organization">
<title>Organizing the workbench and tools</title>
<shortdesc>Keep your garage neat by organizing the things you keep in the garage.</shortdesc>
<taskbody>
<context>Keeping your garage organized not only makes it easier to find what you need, but it also creates a safer work environment.</context>
<steps>
<step><cmd>Sort all items in your garage into categories.</cmd></step>
<step><cmd>Decide on a place for each category on your workbench or in your tool cabinet.</cmd></step>
<step><cmd>Place each item in its designated spot.</cmd></step>
</steps>
<result>Your workbench and tools are now organized!</result>
<postreq>Make sure to return each item to its designated spot after use to maintain organization.</postreq>
</taskbody>
</task>
This demonstrates that a mere 60 lines of code are far from sufficient to fully solve
the auto-markup problem. With investment, we can overcome these problems by training
(“fine-tuning”) specialized language models. This training was outside of the scope
of this project, but we laid the groundwork for it by curating large, complex datasets.
A future paper will show that we can improve these scores using those techniques.
Future Project Governance
Although the project initially began as a volunteer-run open-source initiative, its
significance might warrant institutional support from organizations like OASIS, TEI,
or a similar new establishment.
Funding may be needed over the long term. In addition to future research, the benchmarks
should evolve over time:
Other Applications of Markup AI
We consider Auto-Markup as only a single application of a broader field of Markup
AI. Other examples include:
-
Automating content generation: Generating human-like text using LLMs for various applications,
such as documentation, tutorials, or marketing materials.
-
Metadata generation: Employing LLMs to generate rich and accurate metadata for XML-based
documents, improving searchability, discoverability, and contextual understanding.
-
Assisting in schema design: Utilizing LLMs to facilitate the creation of XML schemas,
by analyzing existing data sets and offering suggestions based on patterns and best
practices.
-
Easier conversions and transformations: AI could convert prose descriptions of conversions,
transformations, or style applications into code and stylesheets for traditional software
applications.
-
Componentizing content: recognizing component boundaries, or extracting reusable components.
-
Document Correction: Interpret the output of Validation tools and make automated changes
to documents to fit the Schema.
-
Markup-Enhanced Q&A: Question and Answer bots which use the structure of the XML as
part of their determination of what content is relevant.
We imagine new benchmarks or extensions to this benchmark to cover all of these use-cases
in the future.
Conclusion
Auto-Markup is an exciting new direction for structured publishing workflows. We believe
that empirical study of such Engines can accelerate progress in the development of
them. Standardized metrics will allow us to compare tools and analyze their progress.
So far, our team has implemented some simple AutoMarkup engine prototypes to test
our Metric and come to the conclusion that it does produce intuitive and useful evaluations.
We invite anyone interested in contributing to either join us at
http://bit.ly/automarkup-metrics
or to contact the lead researcher.
References
[Liao et al., 2021]
Liao, Thomas, Rohan Taori, Inioluwa Deborah Raji, & Ludwig Schmidt (2021). Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning.
Presented at Thirty-fifth Conference on Neural Information Processing Systems Datasets
and Benchmarks Track (Round 2). In Proceedings of the Neural Information Processing Systems Track on Datasets and and
Benchmarks 1 (NeurIPS Datasets and Benchmarks 2021). https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/757b505cfd34c64c85ca5b5690ee5293-Paper-round2.pdf
[Shazia et al., 2021]
Akhtar, Shazia, Ronan Reilly, & John Dunnion (2002). AutoMarkup: A Tool for Automatically Marking up Text Documents.
In Computational Linguistics and Intelligent Text Processing. CICLing 2002. Lecture Notes in Computer Science, vol. 2276. pp.433-435. doi:https://doi.org/10.1007/3-540-45715-1_46
[Paoli, 2019]
Paoli, Jean (2019). We Created Document Dysfunction: It Is Time to Fix It.
Presented at Balisage: The Markup Conference 2019, Washington, DC, July 30 - August
2, 2019. In Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23 (2019). doi:https://doi.org/10.4242/BalisageVol23.Paoli01
×
Akhtar, Shazia, Ronan Reilly, & John Dunnion (2002). AutoMarkup: A Tool for Automatically Marking up Text Documents.
In Computational Linguistics and Intelligent Text Processing. CICLing 2002. Lecture Notes in Computer Science, vol. 2276. pp.433-435. doi:https://doi.org/10.1007/3-540-45715-1_46
×
Paoli, Jean (2019). We Created Document Dysfunction: It Is Time to Fix It.
Presented at Balisage: The Markup Conference 2019, Washington, DC, July 30 - August
2, 2019. In Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23 (2019). doi:https://doi.org/10.4242/BalisageVol23.Paoli01