Context: The Large-Scale Text Analysis Course at Penn State Behrend
The Large-Scale Text Analysis class (Digit 210) is taught (by me, Elisa
Beshero-Bondar) in spring semesters in the Digital Media, Arts, and Technology major
at
Penn State Behrend (affectionately known to us as the DIGIT
program). The
course is part of a multiple-semester digital humanities core sequence that concentrates
on text encoding and processing. Students usually come to this class with previous
experience in text encoding with XML, transformation with XSLT, and web development
with
HTML and CSS from previous coursework in text encoding. Our university semesters are
15
weeks long with three 50-minute classes per week, during which these classes involve
daily homework and a few tests with an emphasis on students’ applying what they learn
to
develop semester projects in small teams.
Digit 210 is usually understood to be the Regex-and-Python course
. A
typical semester would involve orienting students to natural language processing in
Python and preparing text corpora for analysis. In this context, XML was a helpful
(but
not completely necessary) option that contributed to more precise text curation and
analysis, and the expectation to prepare XML has provided a good means for students
to
learn regular expression search-and-replace operations in order to generate simply
but
meaningfully structured XML from regular patterns in text corpora. Students would
then
learn to write XQuery to output portions of the texts to analyze using natural language
processing libraries (spaCy
or NLTK). For an exemplary semester
project, a team of students scraped
collections of popular game script
files. On performing careful document analysis, they applied regular expression
search-and-replace operations to create a simple XML structure that helped them to
isolate (and mark off) spoken conversation between NPC characters from passages about
game items and optional actions. The team prepared XML and wrote XQuery to output
the
texts that described decision making forks, in order to find out how frequently certain
characters and items are mentioned at specific locations in the game. In other projects,
students developed text corpora from many seasons of available TV series like
The Simpsons. They prepared simple XML to identify dialogue,
speakers, and non-spoken descriptive passages, then applied XQuery to separate out
just
the portions of the texts they wished to analyze (e.g., just the spoken text) in
plain-text inputs to provide to Python. In previous semesters, we would have student
teams import the Saxon-C package into their Python scripts to have students apply
XPath
and XQuery directly as a pipeline process within their Python. Thus, Python has
traditionally come first in this course and dominated the class experience of developing
a pipeline algorithm for analyzing text corpora.
Perhaps the 2020s are a decade inviting us to resist complacency, particularly in
the
organizing of teaching syllabi. In Spring 2025, motivated greatly by the interesting
opportunities of Invisible XML and the creative affordances of SVG discussed in previous
years at Balisage, we changed direction and put XML technologies much more in the
foreground.[1] This year, our students began the course by learning to write SVG and think
about creative ways to visualize data by programmatically scripting SVG with XSLT.
The
SVG data visualization unit previously came at the end of the Python-dominated course,
as a consequence of preparing the XML and writing XQuery: students were prepared to
script SVG with XQuery with an emphasis on extracting data and providing an alternative
to visualization libraries available in the Python ecosystem. In the new experimental
course, we decided to begin with hand-encoding SVG as a rewarding
starting point for our digital creative students, and then review what they had learned
of XSLT in their Text Encoding course. In this version of our course, they reviewed
XSLT
by pulling data from texts they had encoded in order to make their own data
visualizations. This moved XSLT front and center
, but also prevented us
from covering XQuery, given the time we would now have to devote to new processing
technologies. We introduced the regular expressions unit much as usual with the same
goals of preparing XML files from so-called plain text
using
search-and-replace operations in the oXygen XML Editor. However, this
time we found opportunity to return again to XSLT to introduce stages of
conversion from text to XML using xsl:analyze-string to refine their
processing and create new element nodes based on regular expression matches and
non-matches. This set the stage for the new unit on Invisible XML and XProc.
All this activity concentrated on XML processing before the students wrote any Python. The new priority on XML not only helped students review and develop XSLT skills introduced in the previous semester, but it also gave students an unusual experience with writing grammars and seeing the relationships and differences between regular expression matching, grammars, and schemas. Putting these XML, XSLT, and ixml experiences first, before introducing Python for natural language processing, changed the course experience significantly. Even though not all students applied ixml or XProc pipelines in their semester projects, the common experience of encountering these technologies certainly changed the way students encountered pipeline processing, improved their command line fluency. and introduced them to software development in alpha stages. Was Invisible XML ready for experimentation by undergraduates, and was the experience worth deferring their attention to Python processing?
We found that perhaps ixml was just ready enough, but it will surely be more ready for student experimentation next year! This semester everything was new and different, and here we break down the most challenging and learning-intensive experiences for students confronting Invisible XML and XProc for the first time.
Introducing the Battle-Testing Team
Preparing for a Cross-Platform Educational Experiment
Hello world! I’m Michael Simons, and I was one of the
battle-testers
for the new ixml and XProc unit in our Digit 210
class. I was one of three students who signed up to assist Dr. Beshero-Bondar and
Dr. David J. Birnbaum in trying out some new exciting XML technologies to be taught
to the rest of our class. Dr. Birnbaum was invited to teach a roughly one week long
unit on these technologies, but the preparation was long and intense. I was joined
by Dannika Love, who was taking the course with me, and Caleb King, who had taken
the course the previous year in its previous form. Dannika is a student with careful
attention to detail and strong leadership qualities but likely would have not gained
as much from ixml and XProc on her own if she had not been a behind-the-scenes
battle-tester. Caleb is also a highly motivated student and leader, and one of his
main contributions was working through the challenges of installing ixml and XProc
processors in a Windows environment, as Dr. Beshero-Bondar, Dannika and myself are
all MacOS users.
Our group of battle-testers was largely responsible for, most importantly, getting ahead of the class and working our way through learning ixml and XProc before our peers, so that we could
-
Write instructions for installation and configuration that we felt our peers would be able to easily follow, and
-
Assist our peers when problems arose.
We hope that our battle-testing was helpful to the developers of ixml and XProc as
we figured out what documentation was needed to provide these technologies to a
group of undergraduates. While Dr. Birnbaum developed installation instructions, he
warned us that they were only for MacOS and for those with a purchased Saxon EE
license for use with XProc processing.[2] Knowing that our students would not be able to purchase a Saxon EE
license and would be using the HE (Home Edition) instead, this was one difference
that the battle-testing team needed to work on incorporating into our instructions.
As the number of differences between Dr. Birnbaum’s instructions and our experiences
grew rapidly, including the many differences between MacOS and Windows as discovered
by Caleb and Dr. Beshero-Bondar, we realized we would need separate sets of
instructions for each platform as well.[3] Our instructions also taught the students many things that would be
useful to them in general as DIGIT majors, including how to install a package
manager (Homebrew for MacOS or Chocolatey for Windows) for the
purpose of installing OpenJDK,[4] creating shell aliases and editing system dot-files,[5] and smoke-testing
their installations to ensure proper
configuration.
We are grateful to Dr. Birnbaum for joining us over Zoom for consultation sessions as we worked on this battle-testing phase, and we learned a lot from his whimsical sidenotes and extraordinary yet accessible knowledge. While individual members of the class might have been able to successfully prepare their environments, without our work ahead of time, things would have likely been very chaotic when Dr. Birnbaum arrived to guest-instruct our class. Additionally, this was a condensed experiement, as Dr. Birnbaum was allotted just three, 50 minute class periods for teaching us both ixml and XProc. So, ensuring all students were ready to hit the ground running at the time of his guest appearance was crucial. He created a thorough lesson plan outlining this fast-tracked learning experience.[6] His lesson plan shared very important introductory readings by Norm Tovey-Walsh and Martin Kraetke, serving as an orientation to ixml and XProc respectively.[7] He also introduced us to John Lumley’s ixml workbench as a helpful resource to practice using ixml.[8]
Invisible XML and the Music Analysis Project
We hoped initially that students would be inspired to try out these technologies on their own projects. And some students did, but we learned some unexpected things about their potential application, especially to the field of music encoding.
A Project on Chord Chart Analysis: What ixml Looks Like in a Large-Scale Student Project
The Optimally Motivated Student Experience
Michael again! Dr. Beshero-Bondar deemed me an optimally motivated
student
because I did the work (relatively on time) and was able to
grasp ixml and XProc enough to take it further than what most other students
attempted to do. The big question for me was: Are these technologies worth
incorporating into my group’s semester project?
The short-term answer, unfortunately, was no
, because I held up my
team working on something that I actually completed with regular expressions
search-and-replace operations in less than a quarter of the time I spent on ixml.
However, the big picture answer may in fact be yes
, because here I
am, reflecting on my learning experience in a professional setting which is a great
honor. I’m walking away from the semester with a larger perspective on the potential
of ixml and a greater hopefulness for completing and expanding the project.
Introducing the GretaVanZeppelin Project

GretaVanZeppelin Project Logo
The GretaVanZeppelin Project is an analysis of musical artists, Greta Van Fleet and Led Zeppelin, using collections of chord charts encoded in XML and analyzed with various Python tools. While the project is still in what could be considered its early stages, there were many valuable discoveries and triumphs made along the way, as well as a strong setup for future research possibilities. One of the discoveries was finding more efficient ways to encode and analyze chord charts. The main triumph was being able to implement ixml and XProc on a fairly large scale.
The project goal for my team was to analyze the chord charts of two artists and determine if there were any meaningful comparisons between them, musically or lyrically. Spoiler alert: our findings were inconclusive! The data we extracted from the documents resulted in some nice data visualizations, but unfortunately, we didn’t take it far enough to draw any of those conclusions between the two chosen artists. So, for me, the main takeaway from this project was, unexpectedly, the process in which we obtained the data, including of course, ixml and an XProc pipeline.
The future of the project was made aware to us (the project developers) during
its showcase at our Digit program’s end-of-the-semester presentation day called
DIGIT
Works
. We received valuable feedback from peers and
industry professionals about incorporating additional data into the project to
achieve a more thorough comparison, specifically, being able to analyze the
artists’ influences and genre blending, and using our analysis processes to
study more than just two artists.
My main task for this project was the transformation of the raw text into something the rest of the team could process with various Python tools and display on our website. This became my task largely because of my role as a battle-tester of the new markup tools. Conversely, my battle-tester status motivated me further to include ixml and XProc in our project because I felt a strong desire to implement what I had learned.
Before discussing my specific processes and work, I feel it is important to share a brief overview of our preliminary research into music encoding, as well as what further research revealed after the conclusion of the semester project.
A Brief Background on Music Encoding
As well as some things we wished we knew when we started the project
None of us knew where to begin in the markup stages, so Dr. Beshero-Bondar
suggested that our team research the Music Encoding Initiative
(MEI). We discovered that the MEI is a community-driven,
open-source effort to define a system for encoding musical documents in a
machine-readable structure
,[9] but we possibly should have kept looking for other options for
encoding chord charts at that time. In our research, we only saw examples of
encoded sheet music in which all the notes are represented on the page on
musical staffs. Sheet music is not conducive to rock music which, in my
experience, is based on repeated grooves and phrases that are more effectively
communicated verbally/audibly. A chord chart serves as a sufficient guide to
show an outline of the structure of the song while displaying what chords need
to be played based on what lyrics are being sung. This means there is a lot of
implicit information about the song purposefully unmarked in these chord charts,
but what we were seeing of the MEI was that it was very much focused on
explicitly encoding every note of a piece of music. That, combined with the
limited amount of time we had for the project, steered us away from utilizing
the MEI guidelines and schema in our project.
Our Digit program has one past experience using the MEI: the Locke Anthology Project.[10] This project conformed to MEI and TEI Guidelines to preserve a digital version of the poetry and music in Alain Locke’s anthology The New Negro: An Interpretation (1925). The work includes short snippets of music between its stories and poems, and the team used the MEI’s structure to encode those snippets so that they could be converted to MIDI[11] and played back as audio files on the website. This use of the MEI was practical, not analytical like the GretaVanZeppelin Project’s focus was. So, while very interesting, the Locke Anthology Project was not something we could have used to help us with our project.
So, how do we approach a project that utilizes solely chord notation? There are other formats for the digitization of music, most notably MusicXML which we discovered after the semester’s conclusion. This markup language is used and supported in many popular music notation, recording, and other music-related software.[12] Upon familiarizing myself with their tutorial,[13] I think we could have utilized it, but it would have still been a very large and time-consuming learning curve.
We only discovered ChordPro after the project’s development. I actually knew about ChordPro existing as a way to write chord charts, but only in the context of my church’s music planning software.[14] I was unaware of the scope of ChordPro as an in-development open-source program[15] aimed at the markup of chord charts. This could very well be the future of the GretaVanZeppelin project and could also more deeply incorporate ixml, but I’ll discuss that after I’ve explained what the project already accomplished. (It sure would have been nice to know about ChordPro back in February 2025!)
All of this to say, we ultimately decided to make our own markup specifically for our analysis.
Document Analysis and Our Unique Markup
My implementation of raw text transformation began by sketching out what the process would look like as a pipeline. After some basic XML structure, I knew that I would first need to find the two blank lines that separated the beginnings and endings of each section (Verse, Chorus, etc.) in the song files. Then, I would need to separate chords from lyrics within those sections. Finally, some additional processing would be needed to identify individual chords for counting purposes and to extract either just the chords or just the lyrics for analysis.
The source of our files was Ultimate Guitar. It’s a community resource, so the proofreading process is uncertain. It became apparent that lines were not separated logically as they would be if I or a professional made the chord charts. Instead of the lines being divided by musical phrases/chord progressions, they seemed to be divided by simply what looked the best on the page. This limited our ability to analyze actual chord progressions, so instead we focused on chord usage per song/artist.
An example of this inconsistent formatting is shown below in an exerpt from
Greta Van Fleet’s song Flower Power
. Each chord is one measure long, so a
logical way of dividing the lines of lyrics and chords would be to have each
line correspond with one measure. But, the original formatting confusingly makes
it seem more complex than that.
Figure 1: Original Sample
[Chorus]
A
Turn tonight, firelight
D
Star shines in her eye
A D
Makes me feel like I’m alive
A
She’s outta sight, yeah
D
Aw yeah
A D
She’s alright, she’s alright, she’s alright
F G
She’s outta sight, outta sightFigure 2: Ideally Edited Sample (for the purpose of this example)
[Chorus]
A
Turn tonight, firelight
D
Star shines in her eye
A
Makes me feel like
D
I’m alive, she’s outta sight
A
Yeah
D
Aw yeah
A
She’s al - right, she’s alright
D
She’s al - right, she’s outta sight
F G
Outta sightOne more important note about the raw text: the chords are placed directly over the word, or syllable of the word (e.g., al - right), that is being sung when that chord is played. In the above example, the chords were placed properly above the words; the easiest way for me to check this is to listen to the song and pay attention to when the chord changes. In some of the chord charts, I noticed that less care was taken to preserve the placement of the chords. For this reason, as well as the sometimes confusing line divisions, I made the decision that we would not make the effort to preserve the chords’ placement with the lyrics. We later discovered a way to accomplish this (see section below on ChordPro), but we cannot see the benefits of it in this project other than beautifully and accurately displaying the final XML output. The objective of the project was not to improve the displaying of the chord charts but rather to analyze them.
There is, however, something to be said for noting the duration of the chords
in the markup. My preliminary research into the MEI showed that they use an
attribute @dur in their <chord> elements to
indicate the musical duration of the chord. This would allow for chord
progressions to be analyzed more completely by our Python analysis tools without
having to preserve the spacing between the chords and make the technology
understand that each line is, say, one bar. This, however, was not within the
scope of this project, because unless there is an AI tool that can accurately
listen to music and recognize chord changes, the durations of the chords would
have to be marked by hand for every chord in every song.
All of this begs the question: Is chord notation declaring itself to be
markup in its own right?
What is an easier form of music for a lot
of musicians to read is not as easy for a computer to understand because of the
implicit information that is purposely not marked in chord charts. Is it worth
encoding chord charts more deliberately to support research and analysis? MEI
preserves music well in academia; MusicXML outputs MIDI to be played as music;
and ChordPro allows humans to write chord charts and accurately place the chords
with the words to support performers. Can chord charts allow for a deeper
analysis and comparison of music? Our experience with ixml suggests there is
hope for this work!
Implementation and Reflection on ixml in the Project
As previously mentioned, this project began with the MEI in mind; the ixml reflects that. Below is an example of a song (as we saw it in oXygen Editor with visible space and newline characters), the full ixml grammar, and XML output after processing the song with the ixml grammar:
Figure 3: Raw Text Chord Chart for Greta Van Fleet’s Flower
Power
Flower·Power↵ From·the·Fires↵ Greta·Van·Fleet↵ A↵ ↵ [Intro]↵ A·D·A·D·A·D·A·D↵ ·↵ ·↵ [Verse·1]↵ A····················D↵ ·She·is·a·lady,·comes·from·all·around↵ A····························D↵ ·She's·many·places,·but·she's·homeward·bound↵ ············A↵ And·now·she·walks·kinda·funny↵ I·think·she·knows↵ D↵ Day·by·day·by·day↵ Our·love·grows↵ A···················D↵ ·She's·a·lantern·in·the·night↵ She's·outta·sight↵ ·↵ ·↵ [Pre-Chorus]↵ A·······················D↵ ·Ma·ma·ma·ma·ma·ma·ma·ma·ma·ma·ma·ma↵ A·······························D↵ Ma·ma·ma·ma·ma·ma·ma·ma·ma·ma·ma·ma↵ Ma·ma↵ Hey↵ ·↵ ·↵ [Chorus]↵ A↵ Turn·tonight,·firelight↵ D↵ Star·shines·in·her·eye↵ A···················D↵ ·Makes·me·feel·like·I'm·alive↵ ···················A↵ She's·outta·sight,·yeah↵ D↵ Aw·yeah↵ ········A·····························D↵ She's·alright,·she's·alright,·she's·alright↵ ·························F····G↵ She's·outta·sight,·outta·sight↵ ·↵ ·↵ [Bridge]↵ A·D·A·D↵ ·↵ ·↵ [Verse·2]↵ A···························D↵ ·Electric·gold·our·love·with·tender·care↵ A·····················D↵ ·Hills·of·satin·grass·and·maidens·fair↵ ········A↵ Now·she·rides·through·the·night↵ On·a·silver·storm↵ D↵ Sword·in·hand↵ Our·fate's·torn↵ A···················D↵ ·She's·a·sparrow·of·the·dawn↵ Our·love·is·born↵ ·↵ ·↵ [Pre-Chorus]↵ A·······················D↵ ·Ma·ma·ma·ma·ma·ma·ma·ma·ma·ma·ma·ma↵ A·······························D↵ Ma·ma·ma·ma·ma·ma·ma·ma·ma·ma·ma·ma↵ Ma·ma↵ Hey↵ ·↵ ·↵ [Chorus]↵ A↵ Turn·tonight,·firelight↵ D↵ Star·shines·in·her·eye↵ A···················D↵ ·Makes·me·feel·like·I'm·alive↵ ···················A↵ She's·outta·sight,·yeah↵ D↵ Aw·yeah↵ ········A·····························D↵ She's·alright,·she's·alright,·she's·alright↵ ·························F····G↵ She's·outta·sight,·outta·sight↵ ·↵ ·↵ [Solo]↵ A···D↵ Yeah↵ A↵ Oh·yeah↵ D·A↵ ···Oh·yeah↵ Oh·yeah↵ ·······D·A·D·A······D↵ Oh·yeah·······papapa↵ A·D↵ ···Oh·yeah↵ ·↵ ·↵ [Verse·3]↵ A···········G···········D↵ ·As·the·days·pass·by·my·mind↵ A··············G↵ ·Are·the·wrong,·the·right↵ ·······D↵ You·are·my·sunshine↵ A················G··········D↵ ·And·as·the·night·begins·to·die↵ A··············G···················D↵ ·We·are·the·morning·birds·that·sing·against·the·sky↵ ·↵ ·↵ [Interlude]↵ A·G·D·A·G·D↵ A·G·D·A·G·D↵ A·G·D·A·G·D↵ A·G·D·A·G·D↵ A·G·D·A·G·D↵ A·G·D·A·G·D↵ A·G·D·A·G·D↵ A·G↵
Figure 4: GretaVanZeppelin Project’s ixml
mei: music.
music: title, newline, album, newline, artist, newline, key, newline, newline*, section++newline.
title: ~[#d;#a]+.
album: ~[#d;#a]+.
artist: ~[#d;#a]+.
key: ~[#d;#a]+.
section: type, mdiv.
@type: -"[", ~[#22]+, -"]".
mdiv: ~[#22]+.
-newline: (#d?, #a).
-space: " ".Figure 5: ixml Output File for Flower Power
<?xml version="1.0" encoding="UTF-8"?>
<mei ixml:state="ambiguous" xmlns:ixml="http://invisiblexml.org/NS"><music>
<title>Flower Power</title>
<album>From the Fires</album>
<artist>Greta Van Fleet</artist>
<key>A</key>
<section type="Intro"><mdiv>
A D A D A D A D
</mdiv></section>
<section type="Verse 1"><mdiv>
A D
She is a lady, comes from all around
A D
She's many places, but she's homeward bound
A
And now she walks kinda funny
I think she knows
D
Day by day by day
Our love grows
A D
She's a lantern in the night
She's outta sight
</mdiv></section>
<section type="Pre-Chorus"><mdiv>
A D
Ma ma ma ma ma ma ma ma ma ma ma ma
A D
Ma ma ma ma ma ma ma ma ma ma ma ma
Ma ma
Hey
</mdiv></section>
<section type="Chorus"><mdiv>
A
Turn tonight, firelight
D
Star shines in her eye
A D
Makes me feel like I'm alive
A
She's outta sight, yeah
D
Aw yeah
A D
She's alright, she's alright, she's alright
F G
She's outta sight, outta sight
</mdiv></section>
<section type="Bridge"><mdiv>
A D A D
</mdiv></section>
<section type="Verse 2"><mdiv>
A D
Electric gold our love with tender care
A D
Hills of satin grass and maidens fair
A
Now she rides through the night
On a silver storm
D
Sword in hand
Our fate's torn
A D
She's a sparrow of the dawn
Our love is born
</mdiv></section>
<section type="Pre-Chorus"><mdiv>
A D
Ma ma ma ma ma ma ma ma ma ma ma ma
A D
Ma ma ma ma ma ma ma ma ma ma ma ma
Ma ma
Hey
</mdiv></section>
<section type="Chorus"><mdiv>
A
Turn tonight, firelight
D
Star shines in her eye
A D
Makes me feel like I'm alive
A
She's outta sight, yeah
D
Aw yeah
A D
She's alright, she's alright, she's alright
F G
She's outta sight, outta sight
</mdiv></section>
<section type="Solo"><mdiv>
A D
Yeah
A
Oh yeah
D A
Oh yeah
Oh yeah
D A D A D
Oh yeah papapa
A D
Oh yeah
</mdiv></section>
<section type="Verse 3"><mdiv>
A G D
As the days pass by my mind
A G
Are the wrong, the right
D
You are my sunshine
A G D
And as the night begins to die
A G D
We are the morning birds that sing against the sky
</mdiv></section>
<section type="Interlude"><mdiv>
A G D A G D
A G D A G D
A G D A G D
A G D A G D
A G D A G D
A G D A G D
A G D A G D
A G
</mdiv></section>
</music></mei>The way I hear and see the ixml in my head is, [what’s on the left side
of the colon] contains [what’s on the right side of the colon] followed by
[anything else, separated by a comma]
.
Beginning at the first line, the root element named <mei>
and the element following (<music>) are named as such simply
for the purpose of being adaptable to MEI guidelines in the future. Of course
now, we realize that may not be all that beneficial. All of the subsequent
elements are contained in the secondary root element <music>.
There are four lines of metadata which are there to identify the song. Our team
added the <key> metadata to the original documents so that
the chords could eventually be identified in a number system for more meaningful
analysis.[16]
Then, each <section> element is a different section of the
song (Verse, Chorus, etc.). These section headings are noted by square brackets
[] in the original documents. While the development of the ixml
was difficult as a beginner, it was nothing too challenging until the first
major setback: not recognizing the ends of sections. There were no square
brackets within the lyrics or the chords themselves—only the section
headings—so, logically, a section ends and a new section begins when an opening
square bracket [ is found. This is a concept we practiced in class
using regular expressions in a search-and-replace context. However, this concept
didn’t translate one-to-one with ixml, because we also had to account for the
two (rather, most of the time two) blank lines
between sections. Dr. Birnbaum’s
study of ambiguity in ixml helped us solve this problem of parsing
the sections, specifically his notes on the double plus sign (++) used to help
us define the separation between sections.[17]
We did, however, encounter an ambiguity problem that was beyond our capacity
to solve in ixml. This involved differentiating the multi-character chord
symbols from lyrics. These multi-character chords appear periodically in both
artists’ songs (as well as throughout Ultimate Guitar), and they provide
musicians who read chord charts with valuable information called
extensions
that make the chords more interesting and more
accurately representative of how the artist originally played them. Below is a
famous example of some of these chords:
Figure 6: Snippet from Led Zeppelin’s Stairway to Heaven
[Verse 1]
There's
Am Ammaj9
a lady who's sure
Am7 D/F#
All that glitters is gold
Fmaj7 G Am
And she's buying a stairway to heaven.This example contains a good variety of the different possible chord
extensions: a lowercase m
for minor, mmaj9
for
a minor chord with a major 9th, and D/F#
to indicate a D
chord with an F# as the bass note of the chord.
It would be one thing if all the chords were one letter, but the complexity of
the chords was too much for us to figure out how to represent in an ixml
grammar. There is also the fact that the chord charts do not strictly follow the
form of one line of chords followed by one line of lyrics. As seen in a previous
figure of Greta Van Fleet’s Flower Power
, there are sections that
contain exclusively chords. There is also the possibility for sections of lyrics
with no chords above them. To solve this problem, Dr. Beshero-Bondar developed a
monstrous Regex line
which we implemented with
xsl:analyze-string in a single XSLT template through our XProc
pipeline following the ixml stage:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
exclude-result-prefixes="xs math"
version="3.0">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="mdiv">
<xsl:analyze-string select="."
regex="\n(\s*([A-Z][#ba-z/0-9]*) *([A-Z][#ba-z/0-9]*)?)*\n">
<xsl:matching-substring>
<chordLine>
<xsl:for-each select="tokenize(., '\s+')">
<xsl:if test="current() ! matches(., '\S')">
<chord><xsl:value-of select="current()"/></chord>
</xsl:if>
</xsl:for-each>
</chordLine>
</xsl:matching-substring>
<xsl:non-matching-substring>
<lyrics>
<xsl:value-of select=". ! normalize-space()"/>
</lyrics>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>The reason this worked is because there would be one or more capitalized
constructions
(chords) in a row that could combine letters
and numbers. That said, this single XSLT template was all we needed to complete
our transformation into full XML and represented a successful markup pipeline in
our project. I continued by adding some additional attributes to the chords for
further processing. To learn more about this, or the entire process including
the XProc pipeline, see my Methods
page which was perhaps our most
significant development from this semester’s project.
What if We Had Used ChordPro? The Potential of ixml
That last step of separating lyrics and chord lines with XSLT and regular expressions may have been unncessary had we used ChordPro. Created in 1991, ChordPro predates MEI and MusicXML by almost a decade.[18] It began as a chord notation system and has become a fully functional program for creating chord charts, using Perl to produce output.[19] Unlike human-readable chord charts that place the chords in a separate line above the lyrics, ChordPro places chords inline, in square brackets, directly in front of the word or even syllable of the word so that their position is preserved no matter the font or spacing. The nature of this inline style of chord placement would allow the chords to remain properly placed within the lyrics after it has been converted to markup.
Figure 7: Sample of a Song Written in ChordPro
Flower Power
From the Fires
Greta Van Fleet
A
{start_of_intro}
[A] [D][A][D][A][D][A][D]
{end_of_intro}
{start_of_verse}
[A] She is a lady, comes from[D]all around
[A] She's many places, but she's[Dmaj7add9#13]homeward bound
{end_of_verse}This small example was slightly altered from the original text for debugging purposes to include a chord with many extensions and the end of the verse section.
During the week of Balisage 2025, the writers of this paper, along with Dr.
Birnbaum, were determined to see if a chord chart written in ChordPro would
indeed allow ixml to process it entirely (meaning chords were also put in markup
separated from the lyrics), taking it further than I originally did in the
project’s initial stages. I can confirm that the development of this version of
ixml was not any faster than the time it took to develop XSLT to do the same
thing (an initial concern with using ixml in the project at all). However, it
does, in fact, work! Where the previous ixml was only able to distinguish
metadata and song sections (leaving lines, chords, and lyrics as unmarked
blobs
of text), the newly developed ixml that is to be run
over chord charts in ChordPro format processes lines, chords, and lyrics with no
ambiguity.
Figure 8: ixml for Songs in ChordPro
xml: metadata, music.
metadata: title, newline, album, newline, artist, newline, key, newline, newline+.
title: ~[#d;#a]+.
album: ~[#d;#a]+.
artist: ~[#d;#a]+.
key: ~[#d;#a]+.
music: section++(newline, newline+), newline?.
section: type, newline, line++newline, newline, outro.
@type: -"{start_of_", ~["}"]+, -"}".
-outro: -"{end_of_", -~["}"]+, -"}".
line: lineContent.
-lineContent: nullableText, (chord++nullableText, nullableText)?.
chord: -"[", ~["]"]+, -"]".
-nullableText: ~["[]{}";#a;#d]*.
-newline: (-#d?, -#a).Many thanks to Dr. Birnbaum for figuring out how to make the mixed text
content of a ChordPro line be unambiguous. The definition of
-lineContent and -nullableText was the secret:
(chord++nullableText, nullableText)?. It means text may or
may not be present at all, and when it is, it might or might not have chords
bounded by text (which might not be there at all).
Figure 9: Output of a Song Written in ChordPro Processed by Our New ixml
<xml>
<metadata>
<title>Flower Power</title>
<album>From the Fires</album>
<artist>Greta Van Fleet</artist>
<key>A</key>
</metadata>
<music>
<section type='intro'>
<line>
<chord>A</chord>
<chord>D</chord>
<chord>A</chord>
<chord>D</chord>
<chord>A</chord>
<chord>D</chord>
<chord>A</chord>
<chord>D</chord>
</line>
</section>
<section type='verse'>
<line>
<chord>A</chord> She is a lady, comes
<chord>D</chord>from all around</line>
<line>
<chord>A</chord> She's many places, but she's
<chord>Dmaj7add9#13</chord>homeward bound</line>
...
</section>
...
</music>
</xml>With this breakthrough, the next steps for the project were to set up and run
this new ixml via XProc over the selected texts. The setup was simple:
reorganize the project’s GitHub repository into phase-1 and phase-2-ChordPro and start the process from scratch
again, beginning with new raw-text files in ChordPro format. This step had an
unforeseen issue: there was no proper ChordPro chord chart resource known to us.
The best solution we had discovered was this converter webpage, which
is designed to input chord charts from Ultimate Guitar and output them in
ChordPro format. Sounds perfect! However, besides the fact that the converter
only recognized Verse and Chorus and considered all
other section names comments
, it appears they also struggle with
the inherent ambiguity of chord charts! It works successfully when the input is
as follows: Section Title, line of chords, line of lyrics, line of chords, line
of lyrics, etc. But, it struggles when the chart deviates from this pattern
(which 99% of our charts do). So, perhaps with our newfound logic for deciding
whether a line contains chords, lyrics, or both; we could create our own system
for accurately converting chord charts to ChordPro format, which then prepares
them for use with our ixml. But, for now, I simply converted, then manually
edited, one album (eight songs) for the purpose of testing our new ixml on a
larger collection in our pipeline. After learning to add the
@serialization attribute to the <p:store>
step in the XProc pipeline so that we could properly indent and read each of the
outputted XML files, we successfully ran our new ixml on a collection of
resources! This also means that we elimintated one of the XSLT transformations
from the original pipeline, as well as more accurately preserved the chords’
placement with the lyrics which means more, and more accurate, data to
analyze.
Most significantly, it seems unlikely that there are any creators or users of ChordPro aware currently of ixml and its ability to read ChordPro and turn it into XML. In our project, ChordPro allows for both a more accurate preservation of the chord charts and a more accurate representation of entire chord progressions. I certainly think it would be interesting to analyze something like the modern artist Greta Van Fleet directly copying a chord progression from a verse in a Led Zeppelin song.
What Do We Gain From Learning and Teaching Invisible XML?
Was this worth the effort, and was Invisible XML ready for undergraduates in an
algorithmic text analysis course? Most student project teams were not motivated to
apply
Invisible XML in their projects, and we did not require them to do so. (We only required
that students give the technologies a try in their homework assignments). For those
that
were motivated (particularly the authors of this paper), there was an interest in
the
technologies themselves cultivated by drafting new documentation and training resources
for the class, but also an awareness that Invisible XML was not strictly necessary
to
their projects: regular expression matching and XSLT could have sufficed more quickly.
Yet there is satisfaction in writing a successful grammar and a simplification in
the
documentation, as well. Invisible XML declares the patterns of a
text document to be the defining grammar of XML nodes, and this is both less
and more compared to scripting a process with a sequential set of regular
expression search-and-replace operations. It is more in the sense
of sheer effort to ensure the operations work, and also more
definitive as a grammar than a convenient string-match, and perhaps
more in the sense of requiring the installation of CoffeePot or
Markup Blitz to process it rather than built-in regular expression search features
in a
coding IDE like oXygen or the re library in Python. But it is
less in the precise elegance of a grammar that expresses an
expectation of its source documents, and (in our experience) fewer lines of code,
though
each line scripted with exacting care. Each approach puts students in the position
of
writing declaratively what needs to become their data structure, and perhaps the
thinking process required of Invisible XML might be different from writing sequential
regular expression recipes in the way that writing poetry differs from prose. Poetry,
like Invisible XML, attends to more dimensions of expression for every meaningful
and
resonant word and punctuation mark, by contrast with prose and step-by-step
search-and-replace operations.
Certainly for all its expressive power, Invisible XML cannot replace the prosaic
versatility of regular expressions as a Swiss-army knife
for many
different text analysis purposes. And we could have taught our course without it.
But
the experience may have enhanced other kinds of learning in the course. For example,
we
found that students approached Python with better understanding of algorithmic
pipelines, more clarity and less trouble with installation processes, and especially
greater comfort with command line processes. Perhaps there was something more: The
appreciation of declarative methods this year was
balanced differently with the imperative programming of
Python. Are students understanding the formatting and processing of text differently
when encountering Invisible XML? In making Invisible XML a topic for homework and
demonstration, students encountered the concept of grammars, in context with schema
validation and regular expression matching—related to these but fundamentally different.
The declaration of patterns, and the experience of writing grammars explicitly to
work
across platforms (Windows, Mac, Linux systems) may have enriched their experience
of
moving between declarative and imperative methods of text handling, and helped them
to
reflect on a certain fluidity of methods.
Our experience of this course helped to bridge distinct cultures in text analysis
that
are perhaps better connected than experienced in isolation. Perhaps we have found
a
place to assert the value of declarative methods in the 2020s, a time when natural
language processing of sequential strings dominates the development of Large Language
Models and artificial intelligence
. The movement from raw
text
to identifying structures as nodes, and then extracting text from nodes
for meaningful natural language processing, with findings expressed, visualized, and
output on websites involves a round-trip adventure with text and code formats. Perhaps
the most significant application of Invisible XML is the agency it gives the student
coder in constructing their own pattern recognition as XML. The experience of developing
and processing Invisible XML in a text analysis course makes for a multi-dimensional
experience of texts, finding value in applying different methods to structured
architectures and unstructured sequences. Thus we find Invisible XML a worthwhile
experience for our digital humanities students and look forward to experimenting with
it
in future iterations of our text analysis course.
[1] Several papers in the Proceedings of Balisage: The Markup Conference vol. 29
(2024) demonstrated exciting applications of Invisible XML connected to careful
document planning and analysis with exciting implications for interface design.
See Joseph Michael Courtney and Michael Robert Gryk, Pulse, Parse, and
Ponder: Using Invisible XML to Dissect a Scientific Domain Specific
Language
, https://doi.org/10.4242/BalisageVol29.Courtney01; Mary Holstege,
Invisible Fish: API Experimentation with InvisibleXML
, https://doi.org/10.4242/BalisageVol29.Holstege01; John Lumley,
Variations on an Invisible Theme: Using iXML to produce XML to
produce iXML to produce ...
,
https://doi.org/10.4242/BalisageVol29.Lumley01; Ari Nordström,
Adventures in Mainframes, Text-based Messaging, and iXML
,
https://doi.org/10.4242/BalisageVol29.Nordstrom01; C. M.
Sperberg-McQueen, From Word to XML via iXML: a Word-first XML workflow in
the TLRR 2e project
,
https://doi.org/10.4242/BalisageVol29.Sperberg-McQueen01; Bethan
Tovey-Walsh, When women do algorithms: a semi-generative approach to
overlay crochet with iXML and XSLT
,
https://doi.org/10.4242/BalisageVol29.Tovey-Walsh01.
[2] Dr. Birnbaum’s installation instructions, which he claimed were developed
for his primary audience: himself, Configuring XProc and ixml
processors
, http://dh.obdurodon.org/ixml-and-xproc-configuration.xhtml
(2025). Last accessed 2025-07-02.
[3] Digit 210’s set of installation instructions, initially drafted by Dr. Beshero-Bondar, the battle-testing students made contributions as they ran into their own issues, and the final documents were proofread and heavily edited by Michael Simons: Version for MacOS: https://github.com/newtfire/textAnalysis-Hub/blob/main/Installations/InstallNotes-Mac.md (2025). Version for Windows: https://github.com/newtfire/textAnalysis-Hub/blob/main/Installations/InstallNotes-Win.md (2025).
[4] Preliminary set of student-developed instructions for installing OpenJDK via a package manager. Version for MacOS: https://github.com/newtfire/textAnalysis-Hub/blob/main/Installations/OpenJDK-mac.md (2025). Version for Windows: https://github.com/newtfire/textAnalysis-Hub/blob/main/Installations/InstallNotes-Win.md (2025).
[5] Digit 210 assignment: Create Shell Aliases and System ‘Dot
Files.’
https://newtfire.org/courses/tutorials/command-line-aliases.html
(2025). Last accessed 2025-07-02.
[6] David J. Birnbaum, Lesson plan: Invisible XML and XProc
(hosted on Obdurodon.org), http://dh.obdurodon.org/ixml-and-xproc-lesson-plan.xhtml (2025).
Last accessed 2025-07-02.
[7] Norm Tovey-Walsh, Writing Invisible XML grammars
, https://www.xml.com/articles/2022/03/28/writing-invisible-xml-grammars/
(2022). Martin Kraetke, XProc 3.0 Tutorial
, https://xporc.net/xproc-tutorial/. Last accessed
2025-07-02.
[8] John Lumley, jωiXML processor
(an online resource for
easily processing text with an iXML grammar), https://johnlumley.github.io/jwiXML.xhtml (2024). Last accessed
2025-07-02.
[9] Music Encoding Initiative, An introduction to MEI
,
https://music-encoding.org/about/ (2025). Last accessed
2025-07-02.
[10] Locke Anthology Project, https://newtfire.github.io/locke-anthology2.0/music.html (2024). Last accessed 2025-07-02.
[11] For more information on MIDI, see https://midi.org/about-midi-part-1overview and https://midi.org/specs. Last accessed 2025-07-02.
[12] MusicXML, MusicXML Software
, https://www.musicxml.com/software/. For more information on
how MusicXML compares to MEI, see Music XML Introduction and Comparison
(OpenSheetMusicDisplay.org, 2025). Last accessed 2025-07-02.
[13] W3C, MusicXML 4.0
Tutorial: Chord Symbols and Diagrams
, https://www.w3.org/2021/06/musicxml40/tutorial/chord-symbols-and-diagrams/
(2021). Last accessed 2025-07-02.
[14] Planning Center Services, the
industry-standard program for planning church worship services that
includes the ability to easily make and format chord charts using
ChordPro. See Preventing charts from shifting using Chord Pro
(updated 2025) to learn more about how they implement ChordPro, or watch
someone utilize ChordPro within Planning Center: https://www.youtube.com/watch?v=GS4GIw_0LQk (2018). Last
accessed 2025-07-02.
[15] ChordPro on GitHub, https://github.com/ChordPro/chordpro. Last accessed 2025-07-02.
[16] See Converting Chords to Nashville Numbers
on the
GretaVanFleet Project’s Methods page for more information on this
numbering system: https://newtfire.github.io/GretaVanZeppelin/methods.html
(2025).
[17] David J. Birnbaum, Invisible XML and ambiguity
, http://dh.obdurodon.org/ixml-ambiguity.xhtml (2025). Last
accessed 2025-07-02.
[18] History of Chord Pro
, https://www.chordpro.org/chordpro/chordpro-history/. The
authors find it disappointing that there seems to be no mention of
ChordPro in the documentation of either MEI or MusicXML. (See also
An
introduction to MEI
for a history of MEI and
Music XML Introduction and comparison
for a
history of MusicXML).
[19] A Perl Module file in ChordPro’s library for producing chord chart output, https://github.com/ChordPro/chordpro/blob/master/lib/ChordPro/Output/HTML.pm. Last accessed 2025-07-02.