Cayless, Hugh. “(Re)building the TEI Website: A Bit of History and New Directions.” Presented at Balisage: The Markup Conference 2025, Washington, DC, August 4 - 8, 2025. In Proceedings of Balisage: The Markup Conference 2025. Balisage Series on Markup Technologies, vol. 30 (2025). https://doi.org/10.4242/BalisageVol30.Cayless01.
Balisage: The Markup Conference 2025 August 4 - 8, 2025
Balisage Paper: (Re)building the TEI Website: A Bit of History and New Directions
Hugh Cayless
Hugh is a Senior Digital Humanities Developer at Duke University Libraries. He is
the
Treasurer and past Council Chair for the Text Encoding Initiative Consortium.
The Text Encoding Initiative has had a web presence for almost thirty years. It's
instructive to consider how a large, robust, and widely-used XML vocabulary defines
its presence on the web. How it has weathered the storms of change (management, institutional,
technological) to be where it is today. And how it imagines its future.
The Text Encoding Initiative has had an online presence since the early days of the
web. It
has progressed from old school static HTML, to dynamic XML processing systems, to
WordPress, and
most recently to a static site built via Continuous Integration using Eleventy. I
will survey
where the site has been over the years and then talk about how its most recent iteration
handles
some of its architectural quirks, including XML sources!
The Beginnings
The Text Encoding Initiative (TEI), which develops and promulgates a set of guidelines
for
the markup of cultural heritage texts for research purposes, began before the World
Wide Web
existed. Its origins date back to a meeting at Vassar College in Poughkeepsie, New
York in
November 1987.[1] It began as a joint effort of the Association for Computers and the Humanities, the
Association for Literary and Linguistic Computing, and the Association for Computational
Linguistics and was later organized (in 2001) as a consortium. The tei-c.org domain,
the TEI
Consortium's home on the web, was registered on March 22nd, 1999, and the first available
capture in the Internet Archive is from October 9th of that year (figure 1). Before
that, the
site was hosted by the University of Illinois at Chicago, maintained by Wendy Plotkin
and our
own, dearly missed, Michael Sperberg McQueen (figure 2).[2] The TEI is one of the longest continuously-running Digital Humanities projects in
existence. It serves as the infrastructure for very many text-based projects worlwide.
The
history of the organization created to support the development of the TEI Guidelines,
the TEI
Consortium (and along with it the website) overlaps with the period archived by the
Internet
Archive, and so we can observe its full history at the tei-c.org domain. It has transitioned
through management on a variety of academic hosts, by a variety of organizations,
and finally
to being self-managed. It contains (and always has) a variety of resource types, with
different and overlapping publishing pipelines. It therefore makes for an interesting
case
study of how scholarly web communication has evovled over the last few decades.
Figure 1: The earliest TEI homepage from 1999
Figure 2: The TEI page at UIC
From there, the TEI site moved to the Institute for Advanced Technology in the Humanities
at the University of Virginia. But development and site generation were done by the
Oxford
University Computing Center and mirrored to the host at UVA. It was in this period
that the
site began to be generated from XML sources, first using an XSLT 1.0 stylesheet and
the Saxon
XSLT processor (figs. 3-5).
Figure 3: DOCTYPE declaration and comment from the front page of the 2004 site
<!DOCTYPE html
PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<!--THIS FILE IS GENERATED FROM AN XML MASTER.
DO NOT EDIT-->
Figure 4: Comment from the 2004 site
<!--
Generated using an XSLT version 1 stylesheet
based on http://www.oucs.ox.ac.uk/stylesheets/teihtml.xsl
processed using SAXON 6.3 from Michael Kay-->
Figure 5: The site as it looked in 2004
Because so much of the TEI's early content was static HTML and other formats (including,
e.g. GML, Waterloo SCRIPT, PostScript, and PDF), that content has for a long time
been stored
in a section of the website known as the Vault. The Vault contains content that may
or may not
be viewable with a web browser, but even if it is, does not follow the formatting
conventions
(such as menus) that the rest of the site does. Published copies of the TEI Guidelines
go in
there, as well as old project outputs, meeting minutes, etc. A large portion of the
site has
thus always been static.
Dynamic, XML-driven Websites
In 2005, OUCS migrated the website to an Apache Cocoon[3] based system (see fig. 6). Cocoon was an XML-driven web publishing system that
allowed users to define transformation pipelines for a variety of routes and document
types.
It was extremely flexible and was, for a time, much in vogue for XML-based Digital
Humanities
projects. The Cocoon iteration lasted until only until the end of 2007, however, when
it was
migrated to an OpenCMS[4] instance managed by the University of Virginia and customized to process TEI-based
sources into HTML (fig. 7). The new setup promised to allow authenticated users to
directly
modify the site content via the web for the first time, without needing to have access
to
upload source files to the web server. A MediaWiki instance was also added in 2008,
which
allowed broader member participation in producing site content.
Figure 6: The tei-c.org site delivered by Cocoon (2006)
Figure 7: The tei-c.org site delivered by OpenCMS (2008)
The OpenCMS setup lasted a long time—longer than it should have, in fact. Page editing
was
done via a Java Applet and by the mid-2010s this had become very poorly supported
(not to
mention a security risk). In 2014, the site was moved from UVA to the Alliance of
Digital
Humanities Organizations' (ADHO) cluster in Hamburg. In mid-2016 Kevin Hawkins, then
the TEI
web administrator, announced an RFP[5] to migrate the site to WordPress. The plan was to work in two phases. The first
would create a WordPress site with the same look and feel as the OpenCMS site; the
second
would work on refactoring the site to improve its aesthetics and usability. Phase
1 took a
long time and was finally completed in 2018 (fig. 8). Phase 2 never began.
Figure 8: tei-c.org delivered by WordPress
WordPress
Because WordPress doesn't have native support for XML sources, the TEI files in OpenCMS
were converted to HTML as part of the migration. This conversion did not always result
in very
clean HTML. The TEI header information, for example, was dumped in a hidden HTML div
on each
page. Oddities like this did not always work well with WordPress's HTML editor. But
what was
worse, the switchover came scarcely a month before ADHO suffered a major disk failure,
which
disabled their services for an extended period. Since the TEI website was unavailable,
without
an estimated recovery time, we decided to temporarily relocate it to the University
of
Victoria, where our then webmaster, Luis Meneses, worked and therefore had access
to their
computing infrastructure. Luis accomplished this with remarkable efficiency, and the
TEI site
remained at UVic for about a year, but the announcement that Compute Canada would
be reorganized[6] prompted a search for a new host. Laurent Romary suggested Huma-Num[7], the French computing infrastructure for the Social Sciences and Humanities and
helped arrange the transition. In August 2019 the site and other TEI services moved
to three
virtual machines hosted by Huma-Num, where they have been ever since. All of this
churn meant
that any immediate energy that might have gone into remediating the infelicities of
the new
site was diverted into rescuing it.
While WordPress, since it is a Content Management System like OpenCMS, seemed like
a good
fit for the TEI website, it was in fact not optimal. Authenticated users could edit
pages on
the site and add news articles, etc. but this was often an awkward process due to
the state of
the once-TEI HTML sources. In addition, since the sources had been migrated over in
totality
and not pruned or reorganized, the site's structure itself was extremely unweildy.
As a
result, site maintenance slipped and much-desired projects such as an architectural
redesign,
and producing translations of the site seemed out of reach. Some of the website's
sources were
(and still remain) TEI XML documents, notably Technical Council documentation and
more
recently the TEI Bylaws. Displaying these in WordPress involved either keeping the
XML and
WordPress HTML versions in sync manually, developing a different publishing pipeline
(like the
one for the TEI Guidelines), or utilizing a custom-developed plugin. All of these
strategies
were used at one time or another but none of them were very satisfactory. OpenCMS
handled
content URLs by directly referencing the source filename, so for example, the homepage
resolved to http://tei-c.org/index.xml. This resulted in incompatibility with WordPress's
URL
conventions, wherein URLs generally end with a forward slash. This problem was solved
with a
1,825-item redirect list, which only added to the unmanageability of the overall site.
Moreover, since WordPress is notoriously vulnerable to being compromised, keeping
it and
its plugins patched was a headache and the fear of being hacked was a source of constant
stress. Another solution seemed called for, but it needed to be able to handle our
idiosyncratic mixture of sources, provide for easy content editing, and a much higher
level of
security.
Back to the Future
In 2023, I began an effort to reimagine the TEI website as a static site, with a new
design, and a mission to pare back the cruft. In such a setup, the sources could be
contained
in a GitHub repository, which could handle the CMS functions of the WordPress site
(user
authentication, online editing, etc.). The requirements were 1) support for source
files in
both MarkDown and (crucially) TEI, 2) seamless integration with the TEI Guidelines[8], 3) a simpler editing workflow, and 4) a more attractive, up-to-date appearance
with less complexity.
After evaluating several static site generators, including Jekyll, and Hugo, I settled
on
Eleventy, a very flexible JavaScript-based static site generator. The requirement
to handle
TEI XML as a source format meant some level of customization would be required, and
Eleventy
makes registering new source types very easy. Moreover, it is written in JavaScript,
a
language I am very familiar with.[9] To get the content out of WordPress, I exported it as an XML dump and wrote a
converter to extract pages by link pattern from the export. The converter, written
in Python,
parsed the XML, pulled out the HTML pages that matched the designated link patterns,
then
converted them to MarkDown . This meant I could selectively extract content rather
than
blindly re-create the sprawling mess of the old site. The new site went live on September
3,
2024. The source code is located at https://github.com/TEIC/website and is directly editable by members with commit
access to the repository. Site rebuilds are managed by a GitHub Action and are triggered
by a
push to the main branch or to the Documentation repository[10], which contains the TEI XML sources for TEI Council Working Papers and the TEI
Consortium bylaws. The site typically builds and deploys to the TEI web server in
about 30 seconds.[11]
Balisage attendees and readers will no doubt be interested in the XML-processing pipeline,
which is, I'm afraid, somewhat boring. Eleventy allows developers to add new Template
types in
its configuration, so I created one that handles XML files and compiles them by piping
them
through CETEIcean, a JavaScript library for displaying TEI documents on the web. The
documentation files you see on the site are therefore static HTML with TEI Custom
Elements and
CSS to style them. The whole process is accomplished with fewer than 30 lines of code.
Figure 9: Configuring an XML-based template language
The getData function adds page metadata, including navigation data that
enables the display of links to Documentation pages on the Council Activity page.
The
compile function is the meat of the process, where the source document loaded
into a DOM and converted to TEI-flavored Custom HTML Elements. These are styled in
the usual
way with CSS.
The TEI Guidelines are built from their own distinct sources which reside in their
own
GitHub repository. They use a custom XSLT pipeline which produces HTML pages and for
the
current release series, P5, all of them are available in the Vault under https://tei-c.org/Vault/P5/. Full
integration of the TEI Guidelines entails adjusting the CSS they use in the build
process so
that they match the lok and feel of the main website. But also, importantly, current
versions
should support the same menus as the rest of the site. The WordPress version relied
on an API
call to fetch the menus as JSON and dynamically write them into the page using JavaScript.
The
new site does something very similar, but with a static JSON representation of the
menu data,
which is used as a data source in the website build procedure and is also made available
directly.
The question of editing workflow improvements is a little harder to quantify. Since
its
release, there have been 271 changes commited to the website repo, of those, 85 are
mine and
the rest by 15 other members of the TEI community.[12] For a similar period of time in 2023–2024, the WordPress site had about 40 page
creations or edits and 7 posts (news items) by about 5 editors. GitHub is likely a
much more
friendly environment to the sort of person who works with TEI, but this represents
more than a
doubling of both work done and of contributors doing the work.
As for the site's appearance and usability, I will leave it to the audience to judge
whether the new site (fig. 12) is an improvement. There remain some broken links and
there is
still content from the old site that needs to be moved over, but that is being done
in
response to community needs and the old site remains available for the time being
in a
read-only state at https://old.tei-c.org/. The new site has been checked against the Web
Content Accessibility Guidelines 2.2 and passes with no violations.[13]
Figure 12: The current TEI website
The new website structure and workflow is also allowing us to develop internationalized
versions of parts of the site, something we have long wished to do, but which proved
difficult
in the WordPress régime. A contributor from Argentina has been adding translations
this
summer, and the site has been configured to deliver a Spanish version of the homepage
if a
user's language preferences have been set accordingly.
Long-running projects tend to accrete a lot of information and making that information
public-facing is a difficult job—even more so in an all-volunteer organization. The
TEI site
has evolved over time from a static HTML site, to one pre-generated from XML sources,
then
dynamically generated from those sources, then to an XML-based content management
system, an
HTML-based content management system, and now at last back to a static site, pregenerated
from
mixed sources and managed from an online Git repository. It has in some ways come
full circle,
even though the technologies employed have changed greatly. It is perhaps significant
that the
website management, which for most of the organization's existence was a DIY affair,
has
returned to those roots as well.
Ide, Nancy and C. M. Sperberg-McQueen, "The TEI: History, Goals, and Future,"
Computers and the Humanities 29, pp. 5–15 (1995). doi:https://doi.org/10.1007/BF01830313
[6] Compute Canada, which operated Canada's national advanced research computing platform,
ceased operations in 2022 and responsibility the platform was handed over to the Digital
Research Alliance of Canada. See 2022 Resource Allocations Competition
Results. As it turned out, the reorganization was less impactful than seemed
likely in 2019.
[7] From the Huma-Num website: "La principale mission de l’IR* est de construire, avec les
communautés et à partir d’un pilotage scientifique,
une infrastructure numérique de niveau international (nœud français des ERIC DARIAH
et CLARIN)
pour les SHS." IR* signifies a 'star' Research Infrastructure, and SHS the Humanities
and
Social Sciences.
[8] Recall that the Guidelines are located in the Vault, but unlike the other content
there should present as part of the website, with the same menus and styling.
[9] And, importantly, enjoy. Jekyll is written in Ruby, a language I am very familiar
with
and hate. Hugo is written in Go, which I have only a passing knowledge of.
[11] The Eleventy build itself takes single-digit seconds to run, but the Action is also
spinning up a container, installing NodeJS, checking out repositories, etc.
[12] My changes are sometimes content and sometimes code edits. The other users' changes
are almost all content.
[13] The Guidelines do get flagged for a large number of violations, for the most part
because the HTML produced has not yet been upgraded to modern HTML5. Various checkers
were
used, including the IBM Equal Access Checker browser plugin.
Ide, Nancy and C. M. Sperberg-McQueen, "The TEI: History, Goals, and Future,"
Computers and the Humanities 29, pp. 5–15 (1995). doi:https://doi.org/10.1007/BF01830313