From Word to XML via iXML

C. M. Sperberg-McQueen

1. Overview

Trials in the late Roman Republic, 149 BC to 50 BC (Alexander 1990) is, in the words of its author, a tabulation of “the known legal facts pertaining to the 391 trials and possible trials, criminal and civil, which date from the last century of the Roman Republic, and about which some information has survived.” For each trial, the book records (if known) the date, the charge or claim made, the defendant, their advocate(s), the prosecutor(s) or plaintiff(s), the presiding magistrate, the jurors, the witnesses, other individuals involved in the case, the verdict, and other salient information.

The TLRR2e project is creating a second edition of this reference work, to be published online. An earlier effort in this direction (here called TLRR2) was reported on by Sperberg-McQueen 2016 but bore no fruit; TLRR2e is a reboot of the effort with new editors.

The material consists of a series of distinct trial records and so invites treatment as a database. The information is fragmentary and highly variable in structure, which further suggests an XML database rather than a relational one.

This paper describes the workflow being developed for the project, with special emphasis on the use of invisible XML to assist the transformation from Microsoft Word to a conventional semantically rich XML vocabulary. The historians editing the second edition start with a copy of the first edition which has been transformed into Word format, make their changes, and submit them to the technical staff in batches of a few tens of trials for processing. The technical staff (that is, the author) is then responsible for transforming the material into XML suitable for querying and display.

The work is performed in several stages, each of which may involve multiple steps:

An XML document in WordML is extracted from the Word file.

An XSLT stylesheet transforms it into a rudimentary XML in a more conventional XML style, preserving paragraph breaks, boldface, and italics, but discarding other information in the Word file since it carries no meaning in these files.

For simplicity, this form is called RXML (for ‘rudimentary XML’).
An invisible-XML processor then parses the RXML data using a series of grammars which describe the organization of the material in ever finer detail, ending with XML in which trial records are represented as trial elements containing a flat sequence of fields with element names like date, charge, defendant, and so on.

The output of this process, with trials and data fields explicitly marked and labeled, we’ll call FXML (for ‘fielded XML’).
Another invisible-XML grammar describes the macro structure of a trial record; the macro structure groups fields into larger structures so that (for example) a defendant and the defendant’s advocate and witnesses are in one group, and the plaintiff and the plaintiff’s advocates are in a different group.

An iXML processor reads the FXML using this grammar and produces a form we’ll call ‘macro XML’ (MXML).
Within the macro XML, many fields contain micro-structures that need to be identified and marked up: names of people, references to ancient sources, bibliographic references to modern scholarship, hyperlinks to other trials, and so on.

For each of these, invisible XML grammars will be written to identify the micro-structures and capture them with XML markup.

The reader may correctly observe that all of these transforms could also be done without iXML, by writing a pipeline of XSLT or XQuery transformations. The author is well aware of this, having processed the data from the first edition of the work in precisely that way. Invisible XML grammars often allow us to describe the patterns to be recognized more simply and cleanly, although applying iXML grammars to XML, and especially to XML with mixed content, poses some challenges which will be mentioned here and there in the paper.

2. The starting point: TLRR in Word format

The editors have chosen to do their work in Word, keeping the layout and field labeling (that is, the presentational markup) of the first edition.

The following image shows the record for the first trial in Libre Office Writer, with changes from the first edition marked. The color coding provided by the word processor shows which editor made which changes.

As most readers of this paper are likely to know already, Word files with extension .docx are zipped archives with an internal file system containing a variety of sub-files with different information about the document. What most people will think of as the document “itself” is in one of those files (document.xml in this case), tagged in XML using the WordML / WordprocessingML vocabulary.

The first few lines of Trial #1 in this form are shown below.

<w:p w14:paraId="00000004" w14:textId="77777777"
w:rsidR="001D55EB" w:rsidRDefault="00000000">
<w:pPr><w:rPr><w:b/></w:rPr></w:pPr>
<w:r><w:rPr><w:b/></w:rPr><w:t>No. 1</w:t></w:r>
</w:p> <w:p w14:paraId="00000005" w14:textId="77777777"
w:rsidR="001D55EB" w:rsidRDefault="00000000" w:rsidP="00AE75A0">
<w:pPr><w:jc w:val="both"/></w:pPr>
<w:r><w:t>date: 149 [1]</w:t></w:r>
</w:p> <w:p w14:paraId="00000006" w14:textId="77777777"
w:rsidR="001D55EB" w:rsidRDefault="00000000" w:rsidP="00AE75A0">
<w:pPr><w:jc w:val="both"/></w:pPr>
<w:r><w:t xml:space="preserve">charge:
</w:t></w:r>
<w:r><w:rPr><w:i/></w:rPr><w:t>quaestio
extraordinaria</w:t></w:r> <w:r><w:t
xml:space="preserve"> (proposed) [2] (misconduct as gov. Lusitania
150)</w:t></w:r> </w:p> <w:p
w14:paraId="00000007" w14:textId="77777777" w:rsidR="001D55EB"
w:rsidRDefault="00000000" w:rsidP="00AE75A0"> <w:pPr><w:jc
w:val="both"/></w:pPr> <w:r><w:t
xml:space="preserve">defendant: Ser. Sulpicius Galba (58) cos. 144
spoke </w:t></w:r>
<w:r><w:rPr><w:i/></w:rPr><w:t>pro
se</w:t></w:r> <w:r><w:t xml:space="preserve">
(</w:t></w:r> <w:r
w:rsidRPr="00B02312"><w:rPr><w:i/></w:rPr><w:t>ORF</w:t></w:r>
<w:r><w:t xml:space="preserve"> 19.II,
III)</w:t></w:r> </w:p> <w:p
w14:paraId="00000008" w14:textId="77777777" w:rsidR="001D55EB"
w:rsidRDefault="00000000" w:rsidP="00AE75A0"> <w:pPr><w:jc
w:val="both"/></w:pPr> <w:r><w:t>advocate:
Q. Fulvius Nobilior (95) cos. 153, cens. 136</w:t></w:r>
</w:p>

This is not the place for a serious introduction to WordML (nor am I competent to provide one), but some salient points should be noted. These apply to the data received from the editors, even if (as I have been warned by people I trust) not necessarily to all Word documents found in the wild.

Each paragraph is tagged as a w:p element (where w is bound to the namespace URI http://schemas.openxmlformats.org/wordprocessingml/2006/main) and contains a series of text runs tagged as w:r elements. A run is just a sequence of zero or more characters which share the same properties; if adjacent characters have distinct properties, they will be in different runs, but having the same properties does not guarantee that they will be in the same run.
Paragraph properties are recorded in a w:pPr element which is the first child of the w:p element; the properties of a run may (optionally) similarly be given in a first child named w:rPr. An italicized run, for example, has an empty w:i element as a flag in the run properties, as may be seen in the charge field of the record shown above for the string “quaestio extraordinaria”. When the entire paragraph is bold, an empty w:b element is found in the paragraph properties.
The characters actually displayed as part of the document occur as character data children of w:t elements within a w:r.

For a fuller introduction to the XML form of Word, the overview by Evan Lenz (Lenz) is useful, although in some respects now outdated by changes in the format.

It will be noted that there is no mixed content in this format. To put it in XPath terms: no text node has an element node as a sibling.

When changes are marked, individual text runs are wrapped in w:ins and w:del elements. Inside a deletion, the text run contains not a w:t child but a w:delText child.

In the list of ancient citations for this trial, one of the editors has changed the reference to Cicero’s “de Orat. 1.40 227-28; 2.263” to read “De orat. 1.40 227-28; 2.263” by deleting the d and O and inserting D and o in their places. The corresponding XML is shown below; whitespace has been added to aid legibility.

<w:ins w:id="0" w:author="MANFREDI"
w:date="2024-01-24T09:41:00Z"> <w:r
w:rsidR="00B02312"><w:rPr><w:i/></w:rPr><w:t>D</w:t></w:r>
</w:ins> <w:del w:id="1" w:author="MANFREDI"
w:date="2024-01-24T09:41:00Z"> <w:r
w:rsidDel="00B02312"><w:rPr><w:i/></w:rPr><w:delText>d</w:delText></w:r>
</w:del>

<w:r><w:rPr><w:i/></w:rPr><w:t
xml:space="preserve">e </w:t></w:r>

<w:ins w:id="2" w:author="MANFREDI"
w:date="2024-01-24T09:41:00Z"> <w:r
w:rsidR="00B02312"><w:rPr><w:i/></w:rPr><w:t>o</w:t></w:r>
</w:ins> <w:del w:id="3" w:author="MANFREDI"
w:date="2024-01-24T09:41:00Z"> <w:r
w:rsidDel="00B02312"><w:rPr><w:i/></w:rPr><w:delText>O</w:delText></w:r>
</w:del>

<w:r><w:rPr><w:i/></w:rPr><w:t>rat</w:t></w:r>
<w:r><w:t xml:space="preserve">. 1.40, 227-28; 2.263;
</w:t></w:r>

3. The goal: TLRR in XML

In earlier work (reported in Sperberg-McQueen 2016), the author has translated the first edition of TLRR into XML, first converting the Waterloo Script files used to produce the camera-ready copy for the book into XML with identical semantics, and then using a pipeline of XSLT stylesheets to enrich the tagging. At the same time, the author devised an XML vocabulary tailored to the information in the work and designed to support useful searches and flexible display of results. Here are the first few lines of trial #1 in the first edition as represented in that vocabulary.

<trial id="ZAA" tlrr1="1">
<date>149<en> <p>On the date see
Cic. <i>Att.</i> 12.5b.</p> </en>
</date> <ccGrp> <charge> <procedure
pid="c-quaestio_extraordinaria" lang="lat">quaestio
extraordinaria</procedure> (proposed)<en> <p>See
Douglas, <i>Brutus</i> p. 77.</p> </en>
(misconduct as gov. Lusitania 150) </charge> </ccGrp>
<defGrp> <defendant> <namelist> <person-entry>
<person pid="pSulpicius58Ser.Galba" form="Sulpicius (+58),
Ser. Galba" >Ser. Sulpicius Galba (58)</person> cos. 144
spoke <i>pro se</i> (<i>ORF</i> 19.II, III)
</person-entry> </namelist> </defendant>
<advocate> <namelist> <person-entry> <person
pid="pFulvius95Q.Nobilior" form="Fulvius (+95), Q. Nobilior"
>Q. Fulvius Nobilior (95)</person> cos. 153, cens. 136
</person-entry> </namelist> </advocate>
</defGrp> ...  </trial>

The challenge to be met by the workflow described here is to get from the starting point to something resembling the tagging just shown. (At a first approximation, the goal is to use the same vocabulary, but there may be reasons to deviate from it.)

The salient properties of the TLRR XML format include these:

Each trial is divided into fields. Each field is tagged with a meaningful name: trial, date, charge, defendant, and so on.

In the pipeline described here, this markup is added in the second stage, which produces the FXML form of the data.
Fields are grouped in meaningful ways; by listing Q. Fulvius Nobilior (95) immediately after the defendant and before the prosecutors, TLRR signals that Fulvius was the advocate for the defendant; the XML records this by grouping them in a defGrp (defendant group) element.

In the pipeline being constructed, this is achieved in the MXML format in the third stage.
Person names are identified and hyperlinked to an index of persons.

In the pipeline described here, this markup is added in the fourth and final stage.
Not visible here, but ancient sources are recognized and tagged with the author, the work, and the passage (so that if desired they can be hyperlinked to online editions which support lookup by canonical references).

This markup, too, is added in the fourth stage.
Endnotes are given at the point of attachment to the base text.

This property of the initial XML form created from the first edition of TLRR may be replicated in the fourth stage, or it may be abandoned (in which case routines for checking and adjusting note numbers will be needed).

One approach to getting TLRR2e data into this format would be to start from this format — to produce the second edition by editing the existing XML representation of the first edition. That was the original plan (editors would provide a list of changes, technical staff would edit the XML), but the first batch of results from the editors showed that the volume of changes was high enough to make that approach error prone as well as tedious. So we want to start from the editors’ Word documents and produce XML marked up in the style shown.

We will proceed stage by stage.

4. Step 1: WordML to Rudimentary XML

Experience with the XML conversion of the first edition showed that XSLT could recognize and process the structural patterns of the data, but that the expression of the patterns was often bulky and sometimes obscured by artifacts of XSLT. Work on the TLRR2e workflow starts from the observation that many of the relevant patterns are easy to express in grammatical form and with the conjecture that invisible-XML grammars may provide a more compact and more easily manageable way of bringing out the structure expressed by presentational markup like line breaks, in-text labels, font shifts, and the like.

So we would like to use an invisible-XML grammar to parse the input.

A complication immediately arises: Word files are XML documents using a WordprocessingML vocabulary, and invisible XML is designed for processing straight textual data, not XML. It is in principle possible to parse well-formed XML with an invisible-XML grammar (as illustrated in Hillman et al. 2022), and we will exploit that fact in later stages of the workflow, but the presentational markup of TLRR data is rather obscured, in the WordprocessingML format, by the very high incidence of markup conveying information that is important for Word but contributes no information of use to us. An invisible-XML grammar designed to read the Word file directly would devote most of its attention to throwing information away.

Most of the presentational markup relevant for our purposes would still be available in a text-only version of the file in which the paragraph boundaries of the Word document are mapped to line breaks in the textual form. That document would be far easier to parse with invisible XML.

Unless special steps are taken, however, the italics and bold of the Word document — which do convey meaning — would be lost along with all the information we would be happy to lose. A simple wiki-style markup could preserve paragraph breaks and font shifts and remove all the other complications. It might look like this (using *...* to mark bold and /.../ to mark italics):

*No. 1* date: 149 [1] charge: /quaestio
extraordinaria/ (proposed) [2] (misconduct as gov. Lusitania 150)
defendant: Ser. Sulpicius Galba (58) cos. 144 spoke pro se (ORF 19.II,
III) advocate: Q. Fulvius Nobilior (95) cos. 153, cens. 136 ...

This approach was contemplated for a while. But without a thorough search through the book — including the trials not yet delivered by the editors — it’s hard to be confident that there won’t be any asterisks or slashes in the input which do not mark italics or bold and need to be preserved. Rather than seek to invent some sort of escaping mechanism on the fly, it seems simpler to use simple XML markup to mark italic and bold. If an XSLT stylesheet is used to convert the Word file into a rudimentary XML format, any left angle brackets and ampersands will be escaped and all will be well.

If we are going to use a rudimentary XML format as our first step, we might as well use markup to capture the paragraph boundaries and record indentation as well. A document schema for the rudimentary XML (RXML) to be produced would be:

                  
<!ELEMENT TLRR-rudimentary-XML (p+)
> <!ELEMENT p (#PCDATA | i | b)* > <!ATTLIST indent CDATA
#IMPLIED > <!ELEMENT i (#PCDATA | b)* > <!ELEMENT b
(#PCDATA | i)* >

In this RXML format, the first few lines of Trial #1 will look like this:

<p><b>No. 1</b></p> <p>date: 149
[1]</p> <p>charge: <i>quaestio
extraordinaria</i> (proposed) [2] (misconduct as gov. Lusitania
150)</p> <p>defendant: Ser. Sulpicius Galba (58) cos. 144
spoke <i>pro se</i> (<i>ORF</i> 19.II,
III)</p> <p>advocate: Q. Fulvius Nobilior (95) cos. 153,
cens. 136</p> <p>prosecutors:</p> <p
indent="720">L. Cornelius Cethegus (91)</p> <p
indent="720">M. Porcius Cato (9) cos. 195, cens. 184
(<i>ORF</i> 8.LI)</p> <p
indent="720">L. Scribonius Libo (18) tr. pl. 149
(<i>promulgator</i>)</p> <p>outcome: proposal
defeated</p>

The value of the indent attribute is taken direct from the WordprocessingML, which records indentation in one-twentieths of a point. So the names of the prosecutors are indented half an inch. The exact amount does not matter, but if the editors produce any nested lists the relative amounts will be important.)

It should be noted that another alternative was also considered: using the HTML export facility of the word processor and processing the HTML. The HTML thus produced is simpler than the WordprocessingML (and far simpler than HTML produced by word processors in years gone by), but it is also much more complex than the RXML form shown above, in part because it faithfully records a lot of information present in the Word file which has no significance for TLRR.

The main advantage to be gained by using the HTML export would be to make it unnecessary to write any code for this first stage of the pipeline. But the advantage is very small. The wordml-to-rxml.xsl stylesheet is simple enough that it took very little time to implement. It has only seven templates:

one for the document node, which inserts a TLRR-rudimentary-XML element;
one for w:p elements which writes out a p element, adding an indent attribute if indentation is detected in the paragraph properties;
three for runs, which detect the presence of the bold property, the italic property, or both, and insert a b element, an i element, or both;
two for w:ins and w:del elements, to map them into ins and del elements in the output. (This has thus far proved to be a dead end: the change markup complicates things enough that the current version of the pipeline simply ignores it: comparisons of the first and second editions can be performed later field by field.)

So the cost of using a bespoke transformation is in the case very low (much lower than the cost of complicating subsequent steps by requiring them to deal with the HTML exported by the word processor).

5. Stage 2: Rudimentary XML to Fielded XML

The next stage is to capture the overall structure of the input, and in particular to translate the presentational markup for labeled fields into corresponding XML markup. To take a concrete example: the RXML element date: 149 [1] should be transformed into <date>149 [1]</date>. Some of the microstructure can also be recognized easily, so in practice the reference to note 1 is recognized and tagged, and the element is rendered as <date>149 <ref>1</ref></date>.

5.1. Goals of stage 2

More explicitly, the goals in this stage are:

to recognize the boundaries of trials and tag each trial as a trial element;
to recognize the list of references at the end of the batch of trials and tag the references as bibl elements;
within each trial, to recognize the boundaries of labeled fields and tag each field with an appropriate element; the element type name will usually but not always match the label found in the input, and when it doesn’t, the label should be recorded as an attribute;
within each trial, to recognize the (unlabeled) notes and references to ancient and modern sources;
in fields whose value is a list of names, to recognize the structure of the name list.

Recognizing the internal structure of bibliographic references, names, dates, etc. is not a goal for this stage.

It is neither a goal nor a non-goal to eliminate empty paragraphs. They may be retained if it seems likely to be useful, or dropped if no longer useful. (They are useful in recognizing the references to sources, because the references are consistently preceded by at least one blank line.)

It proved helpful to break this stage up into a sequence of three steps, each building on the work performed by the preceding. Two intermediate formats are thus introduced between the RXML and the FXML formats described above; since in them the markup identifies fields in a more generic way, these two intermediate formats are called gfxml0 and gfxml1, for ‘generic-fielded XML’ stages 0 and 1.

5.2. Step 1: categorize paragraphs

The rxml-to-gfxml0 grammar describes the input as a series of chunks (all tagged p in the input), some of which can be identified as (section) titles, some as trial numbers, some as labeled fields; some cannot be identified as a special kind of chunk at all, and appear as p elements in the output, unchanged from the input.

5.2.1. Top-level rule in the grammar

A simplified version of the top-level grammar rule in this grammar is:

                        
TLRR-generic-fields-0 =
-"<TLRR-rudimentary-XML>", para-or-field+
-"</TLRR-rudimentary-XML>".

That is, the input consists of a start-tag for the element TLRR-rudimentary-XML, a series of things (in practice, all p elements) we will recognize as paragraphs or fields, and an end-tag for the TLRR-rudimentary-XML element. The heart of the grammar will lie in the rule para-or-field and its descendants.

In reality, however, as examination of the full grammar (given as an attachment) will show, the rule actually used is more complicated. Depending on the process used to produce the RXML input, the input may or may not have an XML declaration. And the elements in the input will typically be separated by whitespace, for legibility. We will need to add rules to recognize them, and refer to those rules from this one. And, finally, we would like to make the output more legible by injecting line breaks before the series of paragraphs-or-fields and after each individual paragraph-or-field. Making all these things explicit in the rule adds some visual clutter in the form of references to the nonterminals xml-declaration, s (for optional whitespace), and inject-NL (which matches the empty string and inserts a newline character into the output), so the rule takes the following form.

                        
TLRR-generic-fields-0 = s,
(xml-declaration, s)?, -"<TLRR-rudimentary-XML>", s, inject-NL,
(para-or-field, inject-NL) ++ s, s, -"</TLRR-rudimentary-XML>",
s.

5.2.2. Recognizing paragraphs and fields

The grammar distinguishes four varieties of paragraph-or-field, as indicated in the following rule.

                        
-para-or-field = title | trial-number
| field | p .

The mark “-” on the left-hand side indicates that the nonterminal will not be serialized as an element, so there won’t be a series of elements named para-or-field. (A significant part of the design effort involved in developing an iXML grammar for data one wants to work with lies in choosing which nonterminals should be serialized as elements, which as attributes, and which should be hidden, like para-or-field.)

Trial numbers

A trial number (as may be seen in the examples given above) appears in RXML as a bolded paragraph with character data content like “No. 22”. This is easily represented as a grammar rule. Since only the number carries useful information and the string “No. ” can be supplied at display time, we signal with a “-” mark on the literal string that it should be omitted from the output.
```
 
trial-number = -"No. ", [N]+,
-"". 
```
The character class “[N]” matches any Unicode numeral digit. In this input, only the Western form of the Indo-Arabic digits will appear, so this could also be written “["0"-"9"]”. The shorter and more general form is used for convenience in typing; since there is no likelihood that those preparing the input will inadvertently use any other form of digit, there is no point in checking that only Western digits are used.
Titles

Any other bolded paragraph we will tag as a title. The simple way to do this would be similar to the preceding:
```
 
title = -"",
~["<>"]*, -"". 
```
This, however, would render the grammar ambiguous, since every paragraph that matches the rule for trial number would also match this rule for title. Unlike XSLT, formal grammars as defined in computer science have no system of priorities to specify which rule applies if more than one rule matches the input. Some grammar notations do have various more or less ad hoc mechanisms for resolving ambiguities, and some iXML processors have extensions for that purpose.

In this case, the task of rendering the grammar unambiguous is relatively simple; it requires only a little bit of boilerplate following a standard pattern. Every bolded paragraph should be classified as either a trial-number or a title — how can we decide which?
- If it begins with a nested element (the only candidates are b and i), then it is a title.
- Otherwise, if it does not begin with “N”, it is a title.
- Otherwise, if it begins with “N” but stops there or continues with a character other than “o”, then it is a title.
- Otherwise, if it begins with “No” but stops there or continues with a character other than “.”, then it is a title.
- Otherwise, if it begins with “No.” but stops there or continues with a character other than a space, then it is a title.
- Otherwise, if it begins with “~No. ~” but stops there or continues with a character other than a numeric digit, then it is a title.
- Otherwise, if it begins with “~No. ~” followed by one or more numeric digits, but does not stop there and continues with additional characters, then it is a title.
- Otherwise (as the reader will perceive), it contains the string “No. ” followed by a series of numeric digits, which means it matches the rule for trial-number, and it is a trial number and not a title.
This logic can be captured compactly in a grammatical rule. The nonterminal para-bit denotes phrase-level material (a small bit of a paragraph).
```
 
title = -"",
not-num, -"". -not-num = i, para-bit* | b,
para-bit* | (~["N"; "<>"], para-bit*)? | ("N", (~["o<>"],
para-bit*)?) | ("No", (~[".<>"], para-bit*)?) | ("No.", (~["
<>"], para-bit*)?) | ("No. ", (~[Nd], para-bit*)?) | ("No. ",
[Nd]+, para-bit+) . 
```
(In practice, the rule given also generally excludes angle brackets at various points. That’s probably unnecessary and inconsistent.)

Similar logic will be called for in any situation in which two distinct classes of inputs must be distinguished, and the rule for one of them (here title) is most simply expressed as anything that matches a fairly simple description, unless it also matches a competing description of higher priority (here trial-number). In language-theoretic terms, what we want is for trial to be the set of all bolded paragraphs, after the set of all trial numbers is subtracted: in set notation bolded-paragraph \ trial-number. When both sets are regular languages, the set difference is also guaranteed to be a regular language, so in that case there is always a way to describe the desired set, even if (as shown above) the expression requires more machinery.

An extension to iXML to include a set-subtraction operator (and possibly also a set-intersection operator) would make this grammar somewhat simpler to write and understand. So would some mechanism for assigning priorities to nonterminals or rules, or more generally for resolving ambiguity.
Labeled fields

A labeled field is a paragraph whose content begins with a label, which is a sequence of lower-case letters possibly italicized and possibly containing blanks, followed by a colon and then a value for the field. Sometimes a note-reference precedes the value, and must also be recognized. The grammatical rule is this:
```
 
field = -"", label, s,
(note-ref, s)?, value?, -"" . 
```
A note reference is just a number in square brackets.
```
 
note-ref = -"[", [N], -"]".
```
Only a single digit is allowed, because no individual trial has ten end-notes.

The definition of label is complicated a bit by the observation that in working with input produced in Word we cannot reliably assume either that the colon after an italicized label will always be italicized, or never be italicized; we need to accept either form.
```
 
 label = label-content, -":" |
italic-label . -label-content = ["a"-"z"; " "]+. italic-label > i
= -"", label-content, -":", s, "" | -"",
label-content, -"", -":". 
```
In a labeled field, the colon after the label is followed by whitespace, so a simple definition of value would be any series of paragraph-bits that begins with a non-whitespace character.
```
 
value = ~[" <>"], para-bit+.
```
However, it is not unusual, in word-processing files, to see a blank preceding or following an italicized phrase marked as itself italic, as in the following field from trial #2:
```
charge: iudicium populi,
for perduellio [1] (failure as commander in Farther
Spain) 
```
This could easily be saved for a later cleanup stage, but it broke the first attempt at a rule for value, so a more complicated set of rules was written to detect the case and ensure that the whitespace at the beginning of the value was suppressed. (The blank after the Latin term perduellio remains, awaiting that later cleanup stage.)
```
 
value = ~[" <>[]"], para-bit* |
i-no-leading-blank, para-bit* | b-no-leading-blank, para-bit* .

i-no-leading-blank > i = -"", s, ((~[" <>"]; i; b),
para-bit*)?, -"". b-no-leading-blank > b = -"",
s, ((~[" <>"]; i; b), para-bit*)?, -"".
```
The two rules at the end of that code block exploit a feature added to iXML after the publication of version 1.0: they specify that the nonterminals i-no-leading-blank and b-no-leading-blank should be serialized in the output as i and b elements. That allows us to write nonterminals to recognize special cases of italic and bold, which require special handling (here, the suppression of initial whitespace within the element), without having those nonterminals appear in the output as distinct element names.
Other paragraphs

Any paragraph which is not one of the above should be copied to the output, tagged p. This includes empty elements, and indented paragraphs in the input.
```
 
p = -"", unlabeled-content,
-"" ; -"" ; -"" ; -"", unlabeled-content, -"" ; -"" .

@indent = -"'", ~["'"]*, -"'" | -'"', ~['"']*, -'"'.
```
The definition of unlabeled-content is complicated by the need to exclude any sequence of characters beginning with a sequence of lowercase letters and blanks terminated by a colon, with or without italics. The interested reader will find it in the full grammar in the attachment.

5.2.3. The data after step 1

The result of parsing the RXML form of the data with the rxml-to-gfxml0 grammar can be illustrated by the beginning of the file, up through the trial number for trial #2. Line breaks have been introduced in long paragraphs, to simplify display. In the actual data, each field and each p element occupies a single line.

                        
<TLRR-generic-fields-0
xmlns:ixml="http://invisiblexml.org/NS"
ixml:state="version-mismatch"> <title>TLRR (2)</title>
<p/> <p/> <trial-number>1</trial-number>
<field><label>date</label><value>149
[1]</value></field>
<field><label>charge</label><value><i>quaestio
extraordinaria</i> (proposed) [2] (misconduct as gov. Lusitania
150)</value></field>
<field><label>defendant</label><value>Ser. Sulpicius
Galba (58) cos. 144 spoke <i>pro se</i>
(<i>ORF</i> 19.II, III)</value></field>
<field><label>advocate</label><value>Q. Fulvius
Nobilior (95) cos. 153, cens. 136</value></field>
<field><label>prosecutors</label></field>
<p indent="720">L. Cornelius Cethegus (91)</p> <p
indent="720">M. Porcius Cato (9) cos. 195, cens. 184
(<i>ORF</i> 8.LI)</p> <p
indent="720">L. Scribonius Libo (18) tr. pl. 149
(<i>promulgator</i>)</p>
<field><label>outcome</label> <value>proposal
defeated</value></field> <p/> <p>Cic.<i>
Div. Caec.</i> 66; <i>Mur</i>. 59;
<i>D</i><i>e
</i><i>o</i><i>rat</i>. 1.40, 227-28;
2.263; <i>Brut</i>. 80, 89; <i>Att</i>. 12.5b;
Liv. 39.40.12; <i>Per</i>. 49;
<i>Per</i>. <i>Oxy</i>. 49;
Quint. <i>Inst</i>. 2.15.8;
Plut. <i>Cat</i>. <i>Mai</i>. 15.5;
Tac. <i>Ann</i>. 3.66; App. <i>Hisp</i>. 60;
Fro. <i>Aur</i>. 1. p. 172 (56N); Gell. 1.12.17, 13.25.15;
see also Val. Max. 8.1. abs. 2; Ps. Asc. p. 203;
<i>Vir</i>. <i>Ill</i>. 47.7. [3]</p>
<p/> <p>[1] On the date see
Cic. <i>Att</i>. 12.5b.</p> <p>[2] See Brennan
(2000) 175. </p> <p>[3] Val. Max. wrongly portrays the
trial as a iudicium populi: see Briscoe (2019) 69.</p>
<p/> <p/> <trial-number>2</trial-number> ...

Several things are worth noting about this output:

Information about a single trial is not grouped into a single containing element; the fields pertaining to each trial still need to be gathered into a trial element.
The prosecutors field of trial #1 has a value consisting of a list of names, each in a separate paragraph in the WordprocessingML and RXML data.

That list of names needs to be tagged and brought into the field element for the prosecutors.
In each trial, the list of ancient sources and the notes need to be recognized and tagged.
The lists of abbreviations and bibliography at the end of the input (not shown above, and not discussed in detail below) need to be wrapped in containers, and the individual items in each list need to be tagged appropriately.

The next step is devoted to addressing these issues. A further peculiarity visible in the data may be noted but will not be addressed yet:

At some points, a sequence of italic characters is not marked by an i element but by a sequence of two or more adjacent i elements; this will complicate later processing and will need to be cleaned up.

5.3. Step 2: field-boundary fixup

5.3.1. Goals of step 2

The second step in this stage has the primary task of detecting cases where a given field takes up more than one paragraph in the RXML form. Since fields are recognized only by having a label at the beginning, any paragraphs beyond the first will have been left unchanged by the preceding step. Sometimes this is due to the value containing a list of names, as illustrated by the prosecutors field of trial #1; sometimes it’s the result of an erroneous hard return in a field value, as in the following example from trial #8:

                        
<field><label>charge</label><value><i>lex
</i>(<i>Calpurnia</i>?)  <i>de
repetundis</i> (misconduct as consul
and</value></field> <p>proconsul in Hither Spain)
[2]</p>

At the same time, the grammar recognizes the notes and the lists of ancient and modern sources appearing at the end of each trial, and groups and tags the back matter in the input.

5.3.2. Organization of the input, organization of individual trials

The key drivers here are the definition of the top level of the document and the definition of trial. The top level defines the body of the document as consisting of a work title followed by a series of trials, followed by a list of abbreviations and a bibliography, with sections optionally separated by blank lines.

                        
TLRR-generic-fields-1 = s,
(xml-declaration, s)?, -"<TLRR-generic-fields-0", -~["<>"]*,
-">", s, inject-NL,

     work-title, s, (blank-lines ++ s, s)?, inject-NLNL,

     ((trial, inject-NLNL) ++ s), s, inject-NLNL,

     abbreviations, s, (blank-lines ++ s, s)?, inject-NLNL,

     bibliography, s, inject-NL, -"</TLRR-generic-fields-0>", s.

Since the abbreviations and bibliography begin with distinctive titles, they are not recognized solely by position. But some material within a trial is recognized by position. Neither the sources of a trial nor the notes have distinctive markup (although notes do begin with note numbers, which proves important); the rule for a trial recognizes them, essentially, by their position. The input consistently (at least so far!) separates the sources from the preceding labeled fields with one or more blank lines (represented at this point by empty p elements).

                        
trial = -"<trial-number>",
@trial-num, -"</trial-number>", inject-NL, s, (p, s)?, (field,
inject-NL) ++ s, s, blank-lines, s, sources, inject-NL, s,
(blank-lines, s, (notes, inject-NL, s, (blank-lines, s)?)?)?  .

5.3.3. Three forms of field

In the input to this step, individual fields can take any of three forms. Most common is a field element and nothing more, as illustrated by the first right-hand side of the field rule below. The second form has a list-structured value, where the first paragraph has only the label (and optionally a note references), and the actual value is recorded in a series of indented paragraphs immediately following. In the third form, the value spills to one (or possibly more) following p elements.

                        
field = -"<field>", label,
note-ref?, value, -"</field>" | -"<field>", label,
note-ref?, -"</field>", s, list-value | -"<field>", label,
note-ref?, extended-value .

A note on the first line of the rule above: Defining the nonterminal field as a string beginning with an XML start-tag for a field element and ending with an end-tag for that element is the iXML equivalent of an identity transformation on the field element. (Unlike XSLT, iXML does not have the equivalent of a generic identity template, or a default mode="shallow-copy" treatment: every element to be treated must be given a rule. That means there are tradeoffs to consider in deciding whether to process XML data with iXML or with XSLT or XQuery. In the workflow described here, the markup is simple enough that providing identity rules is not a big burden. But as the markup becomes richer, the burden of providing the identity rules increases.)

The rules for label, note-ref, and value are also identity rules.

                        
label = -"<label>", phrases,
-"</label>".  note-ref = -"<note-ref>", phrases,
-"</note-ref>".  value = -"<value>", phrases,
-"</value>".

For list-valued fields, we would like to wrap the value in a value element, as usual, and then within that wrap the list items in a list element. That leads to the following slightly indirect-looking set of rules. (Some inject-NL references have been deleted here to make the rules a little easier to read.) Any indented paragraph we see is a list item. The exact value of the indentation does not matter, because the input has no nested lists.

                        
list-value > value = value-list.
value-list > list = item++s.  item = -"<p indent=",
-~["<>"]*, -">", phrases, -"</p>".

An ‘extended’ value, on the other hand, spills not to indented paragraphs but to unindented paragraphs, as shown in the following rule.

                        
extended-value > value =
-"<value>", phrases, -"</value></field>", s,
(-"<p>", +" ", phrases, -"</p>") ++ s.

Because the rule for trial specifies that the main body of a trial consists of a sequence of fields separated only by whitespace, every p element following a field element in the input will be absorbed into that field. (To do this in XSLT, it is necessary to perform what is sometimes called a right-sibling traversal, which many XSLT programmers report finding difficult to get right — or use grouping constructs, which are somewhat easier to get right.) An attempt was made to make the grammar of the preceding step detect such multi-paragraph fields and handle them correctly, but the attempt failed, because the higher-level nonterminal had defined the document as a sequence of fields and paragraphs, so every multi-paragraph field could be read as a single field or as a field followed by paragraphs. Both were allowed, so the grammar became ambiguous, and no way was found to force one reading over the other. (It is, probably, not impossible. But no solution was found in the time available.)

5.3.4. Phrase-level information and well-formed XML

It may be worth spending a moment on the phrases nonterminal and on the handling of phrase-level content within fields and notes.

The nonterminal phrases matches zero or more occurrences of:

a character other than an angle bracket;
a note reference;
an i (italics) element;
a b (bold) element;
empty i and b elements which enclose no characters and thus serve no purpose other than to complicate downstream processing; like start- and end-tags that put whitespace that belongs outside an element inside that element, such empty elements are not uncommon in word-processor documents.

                        
-phrases = phrase*.  -phrase =
~["<>"] | note-reference | i | b | cruft.

note-reference > note-ref = "[", [N], "]".  i = -"<i>",
phrases, -"</i>".  b = -"<b>", phrases, -"</b>".

-cruft = -"<i/>"; -"<b/>".

The grammar also includes rules for structuring the abbreviations and bibliography; they are present in the attached grammar but will not be discussed here.

5.3.5. The data after step 2

The beginning of the data now shows the restructuring of the information. (Reformatted for legibility.)

                        
<TLRR-generic-fields-1
xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous
version-mismatch"> <title>TLRR (2)</title>

  <trial n="1"> <field><label>date</label>
<value>149
<note-ref>[1]</note-ref></value></field>
<field><label>charge</label>
<value><i>quaestio extraordinaria</i> (proposed)
<note-ref>[2]</note-ref> (misconduct as gov.  Lusitania
150)</value> </field>
<field><label>defendant</label>
<value>Ser. Sulpicius Galba (58) cos. 144 spoke <i>pro
se</i> (<i>ORF</i> 19.II, III)</value>
</field> <field><label>advocate</label>
<value>Q. Fulvius Nobilior (95) cos. 153,
cens. 136</value> </field>
<field><label>prosecutors</label>
<value><list> <item>L. Cornelius Cethegus
(91)</item> <item>M. Porcius Cato (9) cos. 195, cens. 184
(<i>ORF</i> 8.LI)</item> <item>L. Scribonius
Libo (18) tr. pl. 149 (<i>promulgator</i>)</item>
</list></value></field>
<field><label>outcome</label> <value>proposal
defeated</value> </field> <sources>
<ancient>Cic.<i> Div. Caec.</i> 66;
<i>Mur</i>. 59; <i>D</i><i>e
</i><i>o</i><i>rat</i>. 1.40, 227-28;
2.263; <i>Brut</i>. 80, 89; <i>Att</i>. 12.5b;
Liv. 39.40.12; <i>Per</i>. 49;
<i>Per</i>. <i>Oxy</i>. 49;
Quint. <i>Inst</i>. 2.15.8;
Plut. <i>Cat</i>. <i>Mai</i>. 15.5;
Tac. <i>Ann</i>. 3.66; App. <i>Hisp</i>. 60;
Fro. <i>Aur</i>. 1. p. 172 (56N); Gell. 1.12.17, 13.25.15;
see also Val. Max. 8.1. abs. 2; Ps. Asc. p. 203;
<i>Vir</i>.
<i>Ill</i>. 47.7. <note-ref>[3]</note-ref>
</ancient> </sources>
    
    <notes> <note>[1] On the date see
Cic. <i>Att</i>. 12.5b.</note> <note>[2] See
Brennan (2000) 175. </note> <note>[3] Val. Max. wrongly
portrays the trial as a iudicium populi: see Briscoe (2019)
69.</note> </notes> </trial> ...

5.4. Step 3: retagging the fields

5.4.1. Goal of step 3

At this point, the field structure of the input has been cleanly recognized. For further work, it will be more convenient if each type of field has a distinctive element type, rather than being tagged generically as a field with a label and value. So the sole purpose of this step is to retag the fields, using the value of the label as a guide.

For some fields, this is a simple matter of using the value of the label as the generic identifier for the element. So the date field of trial #1 (reformatted here to shorten lines)

<field> <label>date</label>
<value>149 <note-ref>[1]</note-ref></value>
</field>

will be retagged more informatively and compactly as

<date>149 <note-ref>[1]</note-ref></date>

For other fields, the retagging is more complex. A survey of the first-edition data shows that the field which records the presiding magistrate in a trial, which will be tagged as a judge element, may in the displayed text be labeled in any of several different ways, often simply recording the Latin term used in the historical sources, and sometimes recording some uncertainty:

praetor
peregrine praetor
peregrine? praetor
urban praetor
(urban?) praetor
urban and peregrine praetor
quaesitor
quaesitores
iudex quaestionis
praetor or iudex quaestionis
aedile or iudex quaestionis
…

The variations in the label are, needless to say, important to preserve, even as all variations on the theme of “presiding magistrate𣀝 are merged into a single element type in order to support searches to find out whether a particular named individual is ever recorded as having presided over a trial.

So the field tagged generically as:

<field> <label>praetor</label>
<value>M. Popillius Laenas (22) (pr. by 142) cos. 139?
<note-ref>[1]</note-ref> </value> </field>

will be retagged as:

<judge label="praetor"> M. Popillius Laenas (22) (pr. by 142)
cos. 139? <note-ref>[1]</note-ref> </judge>

5.4.2. Recognizing fields

In this grammar, again the input is recognized as a series of trials, and each trial as a sequence of fields. The main work occurs in the rule describing fields, which classifies them by type.

                        
-field = date; charge; claim;
charge-or-claim; defendant; prosecutor; plaintiff; party; advocate;
judge; juror; witness; other; outcome .

For each type, there is a simple rule: it’s a field element with an appropriate label and other content.

                        
 date =
-"<field><label>date</label>", note-ref?, value,
-"</field>".  charge =
-"<field><label>charge</label>", note-ref?, value,
-"</field>".  claim =
-"<field><label>claim</label>", note-ref?, value,
-"</field>".  charge-or-claim =
-"<field><label>charge? claim?</label>", note-ref?,
value, -"</field>".  defendant = -"<field>",
label-defendant, note-ref?, value, -"</field>".  prosecutor =
-"<field>", label-prosecutor, note-ref?, value,
-"</field>".  plaintiff =
-"<field><label>plaintiff</label>", note-ref?,
value, -"</field>".  party = -"<field>", label-party,
note-ref?, value, -"</field>".  advocate = -"<field>",
label-advocate, note-ref?, value, -"</field>".  judge =
-"<field>", label-judge, note-ref?, value, -"</field>".
juror = -"<field><label>juror</label>", note-ref?,
value, -"</field>".  witness = -"<field>", label-witness,
note-ref?, value, -"</field>".  other = -"<field>",
label-other, note-ref?, value, -"</field>".  outcome =
-"<field><label>outcome</label>", note-ref?, value,
-"</field>".

As may be seen for some fields such as date, there is only one form of label, and it’s given as a literal. For others such as judge, the description of matching labels is packaged in a separate nonterminal such as label-judge, which we give here as a typical example:

                        
 -label-judge =
-"<label>judge</label>"; judge-label.  @judge-label >
label = -"<label>", ("praetor"; "peregrine praetor"; "peregrine?
praetor"; "urban praetor"; "(urban?) praetor"; "urban and peregrine
praetor"; "judge (triumvir capitalis)"; "praetor or iudex questionis";
"aedile or iudex questionis"; "iudex quaestionis"; "duumvir
perduellionis"; -"<i>", "quaesitor", ("es")?, -"</i>";
-"<i>", "iudices", -"</i>" ), -"</label>".

The effect of these grammar rules may be worth describing step by step in a procedural way. At any point where the grammar is expecting a field of any kind, the parser looks to see if the input matches the rule for field, which in turn translates to looking to see if the input matches the rule for date or charge or any of the other nonterminals on the right-hand side of the rule for field, including judge. If the input does match the rule for field, it will not be tagged as a field element in the output because the left-hand side of the rule is marked with a minus sign, meaning that the nonterminal field is to be hidden. Its children may be tagged, but not the nonterminal as a whole.

Matching the rule for judge requires a field element with contents matching the sequence of nonterminals label-judge, optionally note-ref, and value. If the input matches here, it will appear in the output as a judge element. Operationally speaking, we have thus used the iXML grammar to replace a field element with a corresponding judge element.

If the label child of the field we are processing has the form “<label>praetor</label>”, then it will match the nonterminal label-judge, which is hidden, and also the nonterminal judge-label which is (a) marked with “@”, which means it will be serialized as an attribute, and also annotated with “> label”, which means that the attribute will be serialized with the attribute name “label”, not “label-judge”. The literal strings which match the tags for the label element in the input are marked with a minus sign to be hidden, so they will not be copied to the output. Only the content of the label element will be copied, as the value of the label attribute in the output.

If the input had matched the string “<label>judge</label>” instead, there would be no label attribute. So operationally we have used iXML to copy the string used as a label into a label attribute on the judge element, but not if the label used was the string “judge”.

A comparison with the XSLT code that performed this work on the first edition about ten years ago shows that the iXML form is somewhat more compact. (However, as the expression "quaesitor", "es"? shows, for individual regular expressions, the notation used in XSLT and XQuery is somewhat more compact than that of iXML.)

5.4.3. Handling well-formed XML (revisited)

Fields recognized by the grammar rules shown above get retagged as described. Everything else in the input, meanwhile, should be left unchanged. The discussion of step 2 (above) included a description of how the grammar matched well-formed XML in the phrases nonterminal. That part of the grammar was simple, because every phrase-level fragment was either a character, an i element, or a b element, so only a few rules were needed.

In the grammar for this step, however, the input has significantly richer markup (and we are working on larger structures, not just on phrase-level material). So the relevant part of the grammar containing the iXML equivalent of an identity transform is somewhat longer, as shown below.

                        
 -wf-xml = wf-xml-bit*.  -wf-xml-bit
= ~["<>"] | abbr | ancient | author | author-group | b | bibl |
i | item | list | modern | note | note-ref | notes | p |
primary-sources | secondary-sources | sources | title | work .

         abbr = -"<abbr>", wf-xml, -"</abbr>".
abbreviations = -"<abbreviations>", wf-xml,
-"</abbreviations>".  ancient = -"<ancient>", wf-xml,
-"</ancient>".  author = -"<author>", wf-xml,
-"</author>".  author-group = -"<author-group>", wf-xml,
-"</author-group>".  b = -"<b>", wf-xml, -"</b>".
bibl = -"<bibl>", wf-xml, -"</bibl>".  bibliography =
-'<bibliography>', wf-xml, -"</bibliography>".  i =
-"<i>", wf-xml, -"</i>".  item = -"<item>", wf-xml,
-"</item>".  list = -"<list>", wf-xml, -"</list>".
modern = -"<modern>", wf-xml, -"</modern>".  note =
-"<note>", wf-xml, -"</note>".  note-ref =
-"<note-ref>", wf-xml, -"</note-ref>".  notes =
-"<notes>", wf-xml, -"</notes>".  p = -"<p>",
wf-xml, -"</p>".  primary-sources = -"<primary-sources>",
wf-xml, -"</primary-sources>".  secondary-sources =
-"<secondary-sources>", wf-xml, -"</secondary-sources>".
sources = -"<sources>", wf-xml, -"</sources>".  title =
-"<title>", wf-xml, -"</title>".  work = -"<work>",
wf-xml, -"</work>".

At about this point, the absence of anything in iXML comparable to a default identity template in XSLT may begin to feel like a painful gap. The next section explores a potential work-around.

5.4.4. An alternative approach to handling well-formed XML

The rules given in the preceding section for the identity-transform behavior desired for much of the input are not very complicated, but very repetitive, and if any of the elements involved carried attributes, it would be even more tedious and probably a bit error-prone.

A generic form of the identity-transform part of the grammar would perhaps make things simpler. A simple version of the required rules was shown in Hillman et al. 2022. In the grammar fragment below, all nonterminals are written in all-caps and prefixed with two underscores, to reduce the likelihood of collision with other names in the surrounding iXML grammar and for reasons which will shortly become clear.

                        
  -wf-xml = __PCDATA?,
            ((__PI ; __COMMENT ; __ELEMENT)++(__PCDATA?),
            __PCDATA?)?.
   __PCDATA = (~["<>&"]; "&amp;"; "&lt;"; "&gt;"; "&apos;"; "&quot;")+.
       __PI = "<?", @__NAME, S, @__PI-DATA, "?>".
  __COMMENT = "<--", __COMMENT-DATA, "-->".
  __ELEMENT = __STARTTAG, wf-xml, __ENDTAG; __SOLETAG .
-__STARTTAG = -"<", @__GI, (S, __ATTRIBUTE)*, s, -">".
  -__ENDTAG = -"</", @__GI2, s, -">".
 -__SOLETAG = -"<", @__GI, (S, __ATTRIBUTE)*, s, -"/>".

__ATTRIBUTE = @__NAME, s, -"=", s, @__ATT-VALUE.
     __NAME = [L; "_"], [L; "_-."; Nd]*.
       __GI = __NAME.
      __GI2 = __NAME.

@__ATT-VALUE > VALUE = ['"'], ~['"']*, ['"']
                     | ["'"], ~["'"]*, ["'"].
-__COMMENT-DATA = ~["-"]*.
-__PI-DATA = ~["?"]*.

A few lines of trial #1 will give an idea of the flavor of the result of parsing the data using this grammar. As usual, linebreaks have been introduced to make the lines shorter.

<trial n="1">
<date><__PCDATA>149 </__PCDATA
  ><__ELEMENT __GI="note-ref" __GI2="note-ref"
  ><__PCDATA>[1]</__PCDATA></__ELEMENT></date>
<charge><__ELEMENT __GI="i" __GI2="i"
  ><__PCDATA>quaestio extraordinaria</__PCDATA
  ></__ELEMENT><__PCDATA> (proposed) </__PCDATA
  ><__ELEMENT __GI="note-ref" __GI2="note-ref"
  ><__PCDATA>[2]</__PCDATA></__ELEMENT><__PCDATA
  > (misconduct as gov. Lusitania 150)</__PCDATA></charge>
<defendant><__PCDATA
  >Ser. Sulpicius Galba (58) cos. 144 spoke </__PCDATA
  ><__ELEMENT __GI="i" __GI2="i"
  ><__PCDATA>pro se</__PCDATA></__ELEMENT><__PCDATA
  > (</__PCDATA><__ELEMENT __GI="i" __GI2="i"
  ><__PCDATA>ORF</__PCDATA></__ELEMENT
  ><__PCDATA> 19.II, III)</__PCDATA></defendant>
<advocate><__PCDATA
  >Q. Fulvius Nobilior (95) cos. 153, cens. 136</__PCDATA
  ></advocate>
<prosecutor label="prosecutors"
  ><__ELEMENT __GI="list" __GI2="list"><__PCDATA>
</__PCDATA
  ><__ELEMENT __GI="item" __GI2="item"><__PCDATA
  >L. Cornelius Cethegus (91)</__PCDATA></__ELEMENT
  ><__PCDATA>
</__PCDATA
  ><__ELEMENT __GI="item" __GI2="item"><__PCDATA
  >M. Porcius Cato (9) cos. 195, cens. 184 (</__PCDATA
  ><__ELEMENT __GI="i" __GI2="i"><__PCDATA>ORF</__PCDATA
  ></__ELEMENT><__PCDATA> 8.LI)</__PCDATA></__ELEMENT><__PCDATA>
</__PCDATA
  ><__ELEMENT __GI="item" __GI2="item"><__PCDATA
  >L. Scribonius Libo (18) tr. pl. 149 (</__PCDATA
  ><__ELEMENT __GI="i" __GI2="i"><__PCDATA
  >promulgator</__PCDATA></__ELEMENT><__PCDATA
  >)</__PCDATA></__ELEMENT><__PCDATA>
</__PCDATA></__ELEMENT></prosecutor>
<outcome><__PCDATA>proposal defeated</__PCDATA></outcome>

This format is, of course, not particularly easy to read or convenient for further processing, but it does preserve all the information in the material parsed as wf-xml. And it does allow an arbitrary number of XML element types to be passed through without change using a small number of rules in the iXML grammar, and without requiring one rule per element type.

One hitch is that iXML cannot easily be used to transform this format back into normal XML, because the names of elements and attributes must be assigned in the grammar, not taken from the input data. This is another extension to iXML which might make life easier in some cases.

Fortunately, a simple XSLT transformation with templates for elements ELEMENT, ATTRIBUTE, PCDATA, and so on can easily translate from this format back into conventional XML. Here are the templates for the three elements mentioned.

                        
<xsl:template match="__ELEMENT">
  <xsl:element name="{@__GI}">
    <xsl:apply-templates/>
  </xsl:element>    
</xsl:template>

<xsl:template match="__ATTRIBUTE">
  <xsl:attribute name="{@__NAME}"
                 select="@__VALUE"/>
</xsl:template>

<xsl:template match="__PCDATA">
  <xsl:sequence select="string()"/>
</xsl:template>

Since the material being processed has no XML comments or processing instructions, these suffice.

5.4.5. Would this step be simpler in XSLT?

Some readers may be thinking at this point that performing this step with an iXML grammar may be pushing things a bit far. Would it not be just as simple, or perhaps simpler, to do it in XSLT? After all, XSLT’s system of templates and priorities is very well suited to near-identity transforms.

The answer, unsurprisingly, is that yes, XSLT can indeed be used for this step. The key template is shown below.

                        
<xsl:template match="field">
  <xsl:variable name="label" as="xs:string" select="label"/>
  <xsl:variable name="gi">
    <xsl:choose>
      <xsl:when test="$label = ('date', 'charge', 'claim',
                      'plaintiff', 'juror', 'outcome')">
        <xsl:sequence select="$label"/>
      </xsl:when>
      <xsl:when test="$label = ('defendant',
                      'defendants',
                      'defendants?')">
        <xsl:sequence select=" 'defendant' "/>
      </xsl:when>
      <xsl:when test='$label = ("prosecutor",
                      "prosecutors")'>
        <xsl:sequence select=' "prosecutor" '/>
      </xsl:when>
      <xsl:when test='$label = ("party",
                      "opposing party",
                      "parties")'>
        <xsl:sequence select=' "party" '/>
      </xsl:when>
      <xsl:when test='$label = ("judge",
                      "praetor",
                      "peregrine praetor",
                      "peregrine? praetor",
                      "urban praetor",
                      "(urban?) praetor",
                      "urban and peregrine praetor",
                      "judge (triumvir capitalis)",
                      "praetor or iudex quaestionis",
                      "aedile or iudex quaestionis",
                      "iudex quaestionis",
                      "duumvir perduellionis",
                      "quaesitor",
                      "quaesitores",
                      "iudices"
                      )'>
        <xsl:sequence select=' "judge" '/>          
      </xsl:when>
      <xsl:when test='$label = (
                      "advocate",
                      "advocates",
                      "advocate for defendant",
                      "advocates for defendant",
                      "advocate for defendants",
                      "advocates for defendants",
                      "advocate for plaintiff",
                      "advocates for plaintiff",
                      "advocate for plaintiffs",
                      "advocates for plaintiffs"
                      )'>
        <xsl:sequence select=' "advocate" '/>
      </xsl:when>
      <xsl:when test='$label = (
                      "witness",
                      "witnesses",
                      "cognitor",
                      "procurator"
                      )'>
        <xsl:sequence select=' "witness" '/>
      </xsl:when>
      <xsl:when test='$label = (
                      "other",
                      "legate",
                      "legates",
                      "in attendance",
                      "present for defense",
                      "present for defendants"
                      )'>
        <xsl:sequence select=' "other" '/>
      </xsl:when>

    </xsl:choose>
  </xsl:variable>

  <xsl:element name="{$gi}">
    <xsl:if test="$label ne $gi">
      <xsl:attribute name="label" select="$label"/>
    </xsl:if>
    <xsl:apply-templates select="child::node() except label"/>
  </xsl:element>
</xsl:template>

Because the XSLT stylesheet can pass over in silence all elements and attributes to be left unchanged, the XSLT stylesheet is about ten per cent more compact than either of the two iXML grammars for this step: 150 vs 167 or 168 lines (including blank lines and comments). It is also much simpler than the corresponding XSLT step in the workflow for the first edition created ten years ago, probably because the task of assigning the correct element names to fields has been separated from the task of recognizing field boundaries.)

5.4.6. The data in FXML form

The form of the data in ‘fielded XML’ is shown below. All three versions of the previous step produce the same output, except for some variation in the namespace declarations and iXML attributes on the document element.

<TLRR-fielded xmlns:ixml="http://invisiblexml.org/NS" ixml:state="version-mismatch">
<title>TLRR (2)</title>

<trial n="1">
<date>149 <note-ref>[1]</note-ref></date>
<charge><i>quaestio extraordinaria</i
  > (proposed) <note-ref>[2]</note-ref
  > (misconduct as gov. Lusitania 150)</charge>
<defendant>Ser. Sulpicius Galba (58) cos. 144
  spoke <i>pro se</i> (<i>ORF</i> 19.II, III)</defendant>
<advocate>Q. Fulvius Nobilior (95)
  cos. 153, cens. 136</advocate>
<prosecutor label="prosecutors"><list>
  <item>L. Cornelius Cethegus (91)</item>
  <item>M. Porcius Cato (9) cos. 195, cens. 184 (<i>ORF</i> 8.LI)</item>
  <item>L. Scribonius Libo (18) tr. pl. 149 (<i>promulgator</i>)</item>
</list></prosecutor>
<outcome>proposal defeated</outcome>
<sources>
<ancient>Cic.<i> Div. Caec.</i> 66; <i>Mur</i>. 59;
  <i>D</i><i>e </i><i>o</i><i>rat</i>. 1.40, 227-28; 2.263;
  <i>Brut</i>. 80, 89; <i>Att</i>. 12.5b; Liv. 39.40.12;
  <i>Per</i>. 49; <i>Per</i>. <i>Oxy</i>. 49;
  Quint. <i>Inst</i>. 2.15.8;
  Plut. <i>Cat</i>. <i>Mai</i>. 15.5; Tac. <i>Ann</i>. 3.66;
  App. <i>Hisp</i>. 60; Fro. <i>Aur</i>. 1. p. 172 (56N);
  Gell. 1.12.17, 13.25.15; see also Val. Max. 8.1.
  abs. 2; Ps. Asc. p. 203; <i>Vir</i>.
  <i>Ill</i>. 47.7. <note-ref>[3]</note-ref></ancient></sources>
<notes>
<note>[1] On the date see Cic. <i>Att</i>. 12.5b.</note>
<note>[2] See Brennan (2000) 175. </note>
<note>[3] Val. Max. wrongly portrays the trial
  as a iudicium populi: see Briscoe (2019) 69.</note>
</notes>
</trial>
...

6. Stage 3: Fielded XML to Macro-structured XML

The main goal of this stage is to cluster fields together into appropriate groups marked with a container element. For example, the defendant and adjacent advocate(s) and witness(es) will be grouped into a def-group (‘defendant group’) element, and similarly plaintiffs and their allies will go into a pp-group (‘plaintiffs or prosecutors group’).

6.1. The grammar

The key grammar rule is that for trial, which describes the content of a trial as for the most part a series of grouping elements.

                     
trial = -"<trial n=", @n, -">", s,
        (NB, s)?,
        (date, s)? { date is missing in 21, 118, 383}, 
        (ccGrp, s)?,
        ((def-and-pp | parties | advGrp), s)?, 
        (magGrp, s)?, 
        (jurGrp, s)?, 
        (witGrp, s)?, 
        (other, s)?, 
        (outcome, s)?,
        (other, s)?,

        sources, s,
        (notes, s)?,
        -"</trial>"

The individual grouping elements vary somewhat in complexity. Some are just wrappers around one or two field elements; others may enclose several disparate fields.

                     
      ccGrp = charge; claim; charge-or-claim.
-def-and-pp = defGrp, (s, ppGrp)?
            | ppGrp
            .
     defGrp = defendant, (s, advocate)*, (s, witness)?
            | advocate, (s, witness)?
            | witness
            .
      ppGrp = (prosecutor | plaintiff), (s, advocate)?
            | advocate
            .

   -parties = partiesGrp, (s, partiesGrp)?.
 partiesGrp = party, (s, party)?, (s, advocate, (s, advocate)?)?.
     advGrp = advocate, (s, advocate)?.
     magGrp = judge.
     jurGrp = juror++s.
     witGrp = witness++s.

The individual fields each need to be identified by a distinctive nonterminal so that the rules just given can work. But nothing is changed inside any field, so the fields all have the equivalent of identity rules, some with special provision for a label attribute.

                     
             NB = -"<NB>", wf-xml, -"</NB>".
           date = -"<date>", wf-xml, -"</date>".
         charge = -"<charge>", wf-xml, -"</charge>".
          claim = -"<claim>", wf-xml, -"</claim>".
charge-or-claim = -"<charge-or-claim>", wf-xml, -"</charge-or-claim>".
      defendant = -"<defendant", label?, -">", wf-xml, -"</defendant>".
     prosecutor = -"<prosecutor", label?, -">", wf-xml, -"</prosecutor>".
      plaintiff = -"<plaintiff>", wf-xml, -"</plaintiff>".
          party = -"<party", label?, -">", wf-xml, -"</party>".
       advocate = -"<advocate", label?, -">", wf-xml, -"</advocate>".
          judge = -"<judge", label?, -">", wf-xml, -"</judge>".                  
          juror = -"<juror>", wf-xml, -"</juror>".
        witness = -"<witness", label?, -">", wf-xml, -"</witness>".
          other = -"<other", label?, -">", wf-xml, -"</other>".
        outcome = -"<outcome>", wf-xml, -"</outcome>".
        sources = -"<sources>", wf-xml, -"</sources>".
          notes = -"<notes>", wf-xml, -"</notes>".

         @label = S, -"label=", -__ATT-VALUE, s.

The nonterminal wf-xml is defined using a generic XML grammar, and the output from the iXML process is passed through the same XSLT fix-up transformation as was described above.

6.2. The macro-structured XML

The groupings detected and marked up in this step are visible in trial #1:

<trial n="1">
  <date>149 <note-ref>[1]</note-ref></date>
  <ccGrp>
    <charge><i>quaestio extraordinaria</i> (proposed)
    <note-ref>[2]</note-ref> (misconduct as gov. Lusitania
    150)</charge>
  </ccGrp>
  <defGrp>
    <defendant>Ser. Sulpicius Galba (58) cos. 144
      spoke <i>pro se</i> (<i>ORF</i> 19.II, III)</defendant>      
    <advocate>Q. Fulvius Nobilior (95) cos. 153, cens. 136</advocate>
  </defGrp>
  <ppGrp>
    <prosecutor label=" &#34;prosecutors&#34;"><list>
      <item>L. Cornelius Cethegus (91)</item>
      <item>M. Porcius Cato (9) cos. 195, cens. 184 (<i>ORF</i> 8.LI)</item>
      <item>L. Scribonius Libo (18) tr. pl. 149 (<i>promulgator</i>)</item>
    </list></prosecutor>
  </ppGrp>
  <outcome>proposal defeated</outcome>
  <sources>
    <ancient>Cic.<i> Div. Caec.</i> 66; <i>Mur</i>. 59;
    <i>D</i><i>e </i><i>o</i><i>rat</i>. 1.40, 227-28; 2.263;
    <i>Brut</i>. 80, 89; <i>Att</i>. 12.5b; Liv. 39.40.12;
    <i>Per</i>. 49; <i>Per</i>. <i>Oxy</i>. 49;
    Quint. <i>Inst</i>. 2.15.8; Plut. <i>Cat</i>. <i>Mai</i>. 15.5;
    Tac. <i>Ann</i>. 3.66; App. <i>Hisp</i>. 60;
    Fro. <i>Aur</i>. 1. p. 172 (56N); Gell. 1.12.17, 13.25.15; see
    also Val. Max. 8.1. abs. 2; Ps. Asc. p. 203;
    <i>Vir</i>. <i>Ill</i>. 47.7. <note-ref>[3]</note-ref></ancient>
  </sources>
  <notes>
    <note>[1] On the date see Cic. <i>Att</i>. 12.5b.</note>
    <note>[2] See Brennan (2000) 175. </note>
    <note>[3] Val. Max. wrongly portrays the trial as a
      iudicium populi: see Briscoe (2019) 69.</note>
  </notes>
</trial>

7. Future work: Recognizing micro-structures in mixed content

At this point, the fields and the macro-structure of each trial have been fully recognized and are tagged with characteristic element types. What remains to be done is to enrich the markup of individual fields by recognizing important structures: dates, individual references to ancient or modern sources (and their structure), references to named individuals, and so on. It is not a single process defined by a single iXML grammar but a series of independent grammars intended to be applied to individual fields as appropriate.

In order to operate in the simplest possible way on a specific field type (e.g. only on dates, or only on the sources element), at this point the pattern for invoking an iXML processor changes. In the earlier stages, the parser has been invoked from the command line to operator on a file or similar text stream. In this stage, the parser will be invoked from an XSLT or XQuery processor using an extension function. The processing of any given element E involves several steps:

E (or, more usually, its content) is serialized as XML and the result saved in a variable s of type xs:string.
An appropriate iXML grammar is used to parse s and return XML, which is stored in another variable (call it x). The grammar may have rules for every element type that may be found in the string, or it may use a generic XML grammar as illustrated in earlier stages.
Any generic XML (using the elements __ELEMENT etc.) in x is fixed up.
The contents of E are replaced with the value of x, or part of it.

The next steps in the development of the TLRR 2e workflow are thus to write iXML grammars to recognize and tag various micro-structures:

dates with indications of vagueness (date range, terminus ad quem, terminus a quo), uncertainty, and annotation (references to notes);
names of individuals, with annotation and indications of uncertainty;
references to ancient sources, identifying author, work, and locus for each reference (so that it can if desired be hyperlinked to an online edition);
bibliographic references to modern sources;
cross references to other trials (and supplying unique identifiers for trials, so that numbers can be regenerated in case trials are relocated and renumbered).

8. Concluding remarks

Those interested in descriptive markup have been devising workflows beginning with word-processor documents since the late 1980s or longer; there is nothing new about the idea of taking Word documents as input and producing useful XML with descriptive markup as output. The technical interest of the work described here, if it has any, is in the use of invisible XML to describe patterns of interest in the input and in the application of invisible-XML grammars not only to plain text but to XML documents, and further to mixed content within XML documents.

Patterns, of course, can be described in many ways. Some of the patterns recognized and marked up by the iXML grammars shown above can also be recognized using regular expressions or using ad hoc logic in a program.

Clarity, like beauty, may lie in the eye of the beholder, but it can easily be verified that the iXML grammars used in stages 2 and 3 of the processing are as a rule more compact than equivalent XSLT or XQuery programs. For example: the three iXML grammars used in step 2 total 486 lines (including comments and blank lines); the XSLT stylesheet which did roughly the equivalent work on the first edition is 1043 lines long. For step 3, the grammar is 120 lines and the rough equivalent in XSLT written for the first edition is 175 lines long.

Most iXML grammars are used to move non-XML data into the XML space; it is less usual to use them to process data that is already in XML. Some extensions to iXML (which have, for the most part, already been proposed and are still under consideration) might make workflows like that described here a little simpler:

A set-subtraction operator would make it much easier to eliminate ambiguity in cases like trial numbers and titles: the rule for titles could be written to say that it matches any bolded paragraph that does not match the description of a trial number. When both patterns are simple (technically: both regular languages), it's possible to work around the problem, but as shown above the solution is often complex and tedious to work out by hand.
Failing that, it would be helpful if one could manage ambiguity by specifying in the grammar which parse trees should be preferred when the input has more than one parse tree. A number of parser generators (e.g. yacc and its kin) provide annotations with this purpose; they offer little hope that an easily explained general approach to the problem can be found.
The ability to serialize XML with element and attribute names drawn not from the grammar but from the input would make it much easier to define a single rule for elements in the input which should reappear unchanged in the output.

Invisible XML processing is by no means a replacement for XSLT or XQuery; far from it. But as the examples above show, it is a useful complement to those languages, particularly useful in cases where the input XML is very simple and the patterns to be captures are more easily expressed in grammatical terms than in XPath.

Attachments

A Zip file containing the following attachments can be found in the Slides and Materials accompanying this paper.

wordml-to-rxml.xsl transformation for stage 1.
rxml-to-gfxml0.ixml grammar for step 2a.
gfxml0-to-gfxml1.ixml grammar for step 2b.
gfxml1-to-fxml.ixml grammar for step 2c.
gfxml1-to-fxml-generic.ixml alternate grammar for step 2c.
gfxml1-to-fxml.xsl XSLT stylesheet for step 2c.
gxml-to-xml.xsl XSLT stylesheet for fixup after a ‘generic’ XML grammar is used.
fxml-to-mxml.ixml grammar for stage 3.

BalisageThe Markup Conference

Balisage Paper: From Word to XML via iXML

a Word-first XML workflow in the TLRR 2e project

C. M. Sperberg-McQueen

Table of Contents