Overview
From time to time, it may be necessary or expedient to flatten our XML documents by replacing the start- and end-tags of conventional XML content elements with empty place-marker elements. These place-markers are variously known as milestone elements after the milestone technique described in the TEI Guidelines for page beginnings, column beginnings, line beginnings, etc. [TEI P5], or Trojan horse markup after the technique described by Steve DeRose [DeRose 2004]. To avoid over-stressing any particular syntax, we will use a more general term, markers. When we do flatten our documents, we will often wish, later, to convert markers back into their original form as content elements. The three co-authors discovered recently that we had each had occasion to perform this task, and that we had undertaken it using different techniques.[1]
The purpose of this late-breaking presentation is to survey multiple ways to perform the task of unflattening or raising, that is, restoring a conventional XML element structure of content elements from a flattened XML document instance. We will compare different solutions to see what we can learn from them. Nothing here is profoundly difficult or new, but each of us found it challenging and interesting enough that we think it may be worth while to share what we have learned with others.
In the following sections, we describe first a concrete instance of the task, with enough supporting detail to make clear that this is not an academic exercise but one that arose in a real project. We then present several approaches to solving the problem, including some false starts, which illustrate possible wrong turnings along the way. We then discuss and compare the different solutions with respect to coding difficulty and costs in space and time.
In the discussions that follow, we adopt the following terminology in an attempt to avoid unnecessary confusion. First, we distinguish content elements (marked by standard XML start- and end-tags) from virtual elements (indicated by markers).
content element |
a conventional XML element marked by start- and end-tags with (possibly empty) content between them, or by a sole-tag. Cf. marker. |
virtual element |
a |
marker |
an empty XML element serving to mark the start or end of a virtual element. |
We are interested in two processes that convert between content elements and virtual elements indicated by markers:
flattening |
the process of replacing the start- and end-tags of content elements with corresponding start- and end-markers. |
raising (aka unflattening) |
the process of replacing virtual elements with content elements by converting pairs of start- and end-markers into corresponding start- and end-tags. |
Our approaches for raising flattened XML may be categorized according to the following parameters:
-
Whether they read their input as XML or as a string.
-
Whether they construct their output as XML or as a string.
-
For those that read the input as XML, the order in which they raise the virtual elements.
The methods that work with the input and output as XML are typically recursive (whether the recursion is implemented through functions or templates), and the recursion follows one of three patterns:
-
The input is handled in a single pass and the virtual elements are raised left-to-right in start-tag/end-tag order.
We refer to the order as left-to-right or tag-order, rather than as depth-first, partly because in the input the markers and the content of the virtual elements they mark are all siblings. That is, the input document is typically a shallow hierarchy and there is no difference between depth- and breadth-first traversal of the input tree. With respect to the output tree, this pattern performs a depth-first pre- and post-order traversal: that is, the processing of an element starts before the processing of its children and ends afterwards.
We refer to the construction as tag-order because the construction of the virtual elements begins in the order of their start-markers and ends in the order of their end-markers.
-
The input is processed in multiple passes; on each pass the innermost or bottom-most virtual element(s) are raised to become content elements.
We refer to the construction as bottom-up because in each parent-child pair of the output, the construction of the child begins and ends before the construction of the parent begins.
As in the preceding case, because the input has no meaningful depth, it is more appropriate to describe the traversal without using hierarchical terms, although bottom-up is an appropriate way to refer to the order in which nodes of the final output tree are completed.
-
The input is processed with a series of recursive calls to a function or template to operate on a sequence of nodes. On each call, the left-most start-marker in the sequence and its matching end-marker are selected. Material which precedes the selected start-marker is returned without change; material between the selected markers is processed recursively as the content of a newly raised element; material which follows the selected end-marker is processed recursively as the following siblings of the newly raised element.[2]
We refer to the construction as outside-in because on each call (one of) the outermost element(s) in the parameter sequence is raised, and a recursive call operates on elements contained within that one.
With respect to those parameters, the methods discussed below may be classified as follows:
Table I
Method | Input | Output | Raising Order | Passes / Calls | Overlap handling |
---|---|---|---|---|---|
Right-sibling traversal | XML | XML | Left to right | Single | Silent error |
Inside-out recursion | XML | XML | Inside-out | Depth of tree | Partial raising, well formed |
Outside-in recursion | XML | XML | Outside-in | Number of virtual elements | Partial raising, well formed |
Accumulator | XML | XML | Left to right | Single | Silent error |
Regex replacement | String | String | Left to right | Single or double | Full raising, ill formed |
Python pull parser 1 | XML | String | Left to right | Single | Full raising, ill formed |
Python pull parser 2 | XML | XML | Left to right | Single | Silent error |
-
We have implemented the last three methods in Python and the others in XSLT (and experimentally in XQuery). The accumulator method requires XSLT 3.0; the inside-out, and outside-in methods can all be implemented with (recursive) function calls in XSLT 2.0 and 3.0 and with named templates in XSLT 1.0. Right-sibling traversal has been implemented using match templates; an experimental implementation using a recursive function is also contemplated.
-
Input and Output refer to whether the input is read as XML or as a string, and whether the output is constructed as a DOM tree or as a string.
-
Raising order describes the order in which virtual elements in the input are raised.
Left to right
,Inside-out
andoutside-in
are as described above. -
Passes / Calls describes how many passes over the input are made, or how many recursive calls to the core function are made.
Right-sibling traversal makes a single pass over the input, but because it does so using recursive calls to to
apply-templates
which select single nodes, the template stack will in theory grow until it has one template on the stack for each node in the flattened sequence of nodes in the input. (Since the calls are tail-recursive, implementations may optimize the calls and perform the traversal in constant stack space.)Regex replacement is easiest to implement in two passes: we replace start-markers globally and then end-markers globally in a pipeline (or vice versa). Matching both types of marker with a single regex is easy, but because the replacements are different (start-markers may contain non-Trojan attributes), the replacement logic may be more complex.
The other left-to-right methods will use template stack space proportional to the depth of the tree, as is common in XSLT stylesheets.
The inside-out method makes one pass over the input for each nesting level in the tree of virtual elements being raised. On each pass, the entire document is processed, although content elements can typically be handled with
copy-of
, which means their subtrees do not require template matching.In the outside-in method, each call to the function raises one virtual element (unless the arguments are all leaves, in which case it raises none), so the total number of calls is proportional to the number of virtual elements. Each node in the input is passed as (part of) an argument n + m times, where n is 1 plus the number of virtual ancestors the node has (for outermost elements, n= 1, for their children, 2, etc.), and m is 1 + the number of preceding siblings the node has in the final output tree.
-
Overlap handling describes what happens when this method is used on input with overlapping virtual elements, where an attempt to raise every virtual element would produce ill-formed output.
Partial raising, well formed means that the method raises some but not all virtual elements and produces well-formed XML by leaving some markers unraised.
Full raising, ill formed means that the method converts all start- and end-markers to start- and end-tags without any rearrangement or adjustment, even when the result is not well formed.
Silent error means that the method produces well formed XML output that is semantically incorrect: in some cases the method includes content within a raised element which does not belong there, in other cases content is omitted. In all three cases, the absence of any error signal is a characteristic of our simple proof-of-concept implementations, which were designed to handle input without overlapping virtual elements. All of these methods can be implemented with a more thorough check of the input, to raise an error, and optionally to recover from the situation, in cases involving overlapping virtual elements.
The problem
Raising Frankenstein
As a concrete example, we can consider the form taken by this task in the Variorum Frankenstein [Variorum Frankenstein] project edited by the second author. In this project, we collate different encodings of the novel deriving from five digital sources using the software CollateX [CollateX], which reads the input texts to locate their moments of alignment and variation. The process of collation compares XML documents as text files, which means that XML tags are treated as text so that strings of text that compare structural boundaries like paragraph breaks in one version of a document can be aligned with passages containing the same (or nearly the same) text lacking those boundaries. CollateX provides XML output that represents in one file the collation of all the variant input documents, and its output raises a new hierarchy made up of a root element and a flat sequence of TEI critical apparatus elements designed to mark where the documents align in comparable and variant passages. When XML elements are supplied in the input to the collation process, their tags are returned as text with angle brackets escaped, as shown in Figure 1 below.
To prepare the editions for collation we began by flattening the original structural markup, converting elements that wrapped volumes, chapters, paragraphs, and lines (among others) into self-closed start and end markers. We flattened these elements in anticipation of reconstructing them on the other side of the collation process. Because we need to retain original markup information in the eventual collation output for later use, flattening it allows us to preserve it without letting it interfere with the alignment process or the new hierarchical output of CollateX. The XML output produced by CollateX includes (as we expect and want) many fragmented start and end tags, showing us not only places where paragraph breaks occur in one witness to the text but not in the others, but also locations where part (but not all) of a deleted passage in the manuscript draft aligns with material in the published editions of the novel. In such situations, we wind up with an original start-tag inside one container element and its corresponding original end-tag inside another, and we cannot reconstruct that element without creating overlap. Knowing that the collation process will generate (by design) a hierarchy in conflict with the structural markup of our input documents, we prepared the input source documents in advance to flatten their hierarchies, because we intended to reconstruct the elements later building on the collation output.
Here is an example from the output of CollateX representing a single divergent
reading in the Thomas copy
of Frankenstein
before all witnesses align. (Some line breaks have
been introduced for legibility.)
In the example above, there is only one <rdg>
element inside
the first <app>
because at this point the Thomas edition contains
an inserted passage not present in the other editions, which have no material to
compare with it. Following this point, the witnesses all agree, as shown in the next
<app>
element. The output actually does not contain markers
for the virtual <del>
and <p>
elements but
rather these have been converted to a string carrying escape characters as a trace
of their prior existence as elements: <del/>
and
<p/>
.
Following the collation process, we use the XML output as the basis for
reconstructing the individual edition files so that they may individually indicate
hotspots
, or passages that vary in the other editions. We wrote
XSLT to run over the collation XML output to produce a separate file representing
each reading
witness (not shown here for space reasons). In this first stage, we leave all the
original marker elements of the edition as escaped strings and we mark each witness’s
hotspots with <seg>
elements.
Because these hotspots will sometimes overlap with structural markup from the input
editions, we must also insert these elements as flattened start and end markers.
In the next stage, shown in Figure 2, we apply an XSLT script with
<xsl:analyze-string>
to reconstruct the source edition’s
element markers from the strings.
The script above applies regular expressions to isolate the strings we need to reconstruct
the Trojan marker elements of the source edition that were reduced to text in the
collation process. Going into the collation none of our files contain namespaces,
but we apply them in the up-conversion process, so here we concatenate the th:
namespace prefix as we reconstruct our Trojan horse attributes.
Following this element reconstruction, our next
task will be to find a way to raise both the original markup we flattened from the
source editions with the conflicting hierarchy posed by the seg
elements derived from the collation process. Figure 3
shows a sample passage from the
Thomas copy edition file with the reconstructed but
flattened elements (again, white space has been added for legibility):
In the passage above, the empty <seg>
elements indicate the
start and end points of the variant passage in the Thomas copy. Their partially
shared @xml:id
values coindex them while also pointing to the collation
unit and numbered <app>
element in the collation output; the
trailing underscore separator and the string start
or end
distinguish start- from end-markers. In this example, <seg
xml:id="C10_app90-fThomas_start"/>
means that this is the start of a
variant from apparatus unit 90, where the Thomas copy diverges from a reading shared
by all of the other editions (in this case, the type of variation is that the other
witnesses have no corresponding reading).
Within the <seg>
virtual element, the first two
<del>
markers (start- and end-) frame a deleted portion of a
word (puls
), and the third marks the end of paragraph 133 in Chapter
4, followed by the beginning of paragraph 134 in the same chapter. The use of
attributes on <del>
and <p>
markers differs from
that with the <seg>
elements; in the case of
<del>
and <p>
, the Trojan horse attributes point
to the location in the source and coindexes the start- and end-markers.[4]
While we wish to raise both hierarchies, the primary use case on which we concentrate
in this paper is to raise the structural elements from the original edition (such
as <p>
and <del>
), while leaving the <seg>
elements flattened. The question of how best to accomplish this brought the three
co-authors together, and serves as a testing ground for the raising methods described
in this paper. The output will look something like Figure 4 for the end of paragraph 133 and the start of paragraph 134 in the Thomas edition:
A secondary use case is to raise the <seg/>
elements. When,
later in the production process, we also raise the <seg>
elements, we avoid overlap, or, rather we mark a moment of intersecting hierarchy,
by fragmenting the one that spans a paragraph boundary
into two parts:
The split <seg>
elements in the passage above now indicate
their association with each other with __Pt1
and __Pt2
appended to the original value of the apparatus and
reading witness location.
Our process of raising the new edition files thus entails the following steps:
-
Flatten all markup (prior to and following the collation process)
-
Reconstruct (raise) the structural elements from the source edition (e.g.,
<p>
,<del>
) -
Raise the
<seg>
elements, which indicate moments of variation in the collation, splitting the raised elements into parts where that is required to avoid creating overlap
Simplified sample data
In addition to exploring and reporting on the application of different raising methods to authentic data from Variorum Frankenstein, described above, for illustrative and development purposes when discussing program logic we use a small contrived hierarchical XML sample extracted from a short passage of poetry, derived from a quoted passage in the Frankenstein novel. The simplified data comes in three forms, which we call basic, extended, and overlap.[5]
Basic sample
A sample of the basic input format in its original form is:
Flattening
To test the method on our Simplified sample data, we first flatten the
original XML with the following XSLT, which converts all tags except the root
(which must be preserved as a container element to ensure that the XML is well
formed) to Trojan milestones. We modify the Trojan milestone markup method
described in DeRose 2004 by putting the @sID
and
@eID
attributes in a namespace, for which we bind the prefix
th:
to the URI
http://www.blackmesatech.com/2017/nss/trojan-horse
(following Sperberg-McQueen 2018). The
original generic identifier is retained, the start-tag is replaced by an empty
element that adds a @th:sID
attribute with a generated value, and
the end-tag is replaced by an empty element that adds a th:eID
attribute with the same generated value:
The flattened version looks roughly like the following (line breaks have been added within tags here, to shorten the lines:
The output of raising must match the original XML.
Extended basic sample
The basic sample contains no non-Trojan empty elements and no non-Trojan attributes. We can test whether those features are handled properly with the following slightly more complicated sample:
The extended basic sample adds the following features:
-
The
bibl
element a non-Trojan element, without Trojan attributes. -
Most elements have, in addition to their Trojan attributes,
@xml:id
attributes. The Trojan attributes must be removed during raising, but the non-Trojan attributes must be retained.
Overlapping markers
To test the behavior of the methods with input that cannot be fully raised without creating overlap, we use the following sample (some whitespace added for legibility):
This consists of a few lines of verse, marked up with both the metrical structure (verse lines) and the division into sentences; both the first and third sentences span verse boundaries.
Frankenstein markers
The Frankenstein Variorum encoding represents an overlapping virtual hierarchy, so
we need a way of distinguishing one set of elements participating in one hierarchy
from the other set. The virtual elements participating in the first hierarchy hold
trojan markers with attributes in the form of @th:sID
and @th:eID
. The second hierarchy is signalled with an alternative marker style used only on
the <seg>
elements. Here there is an @xml:id
bearing a flag indicating the start or end position of a variant passage. These markers
cannot be raised in exactly the way they were placed, since in some cases a variant
passage may start inside a paragraph or chapter, and end in the next one. In this
project, the decision to raise the <seg>
elements necessitates a process of splitting them into pieces within structural boundaries,
so that the original virtual element becomes two elements, with part one inside one
structural unit and part two inside the next. Different styles of markers here facilitates
handling these elements at different moments in the raising process.
Solutions to the problem
We describe several solutions to the problem, some formulated in XSLT 3.0 and some using other tools.
Right-sibling traversal
One way to construct content elements from flattened XML is to do roughly what a recursive-descent parser for XML does: when a recursive-descent parser sees the beginning of any construct, it calls a routine to handle that construct; if the construct contains other constructs, other routines are called recursively. The function handling a parent element starts before any function handling a child, and ends only after all children have been processed.
The standard idiom of including
xsl:apply-templates
similarly visits each
node in the input tree left to right, in order, but
provides no mechanism for passing results or state from
one sibling to the next. In order to perform the raising
task, we perform instead a right-sibling traversal of the
input tree, which proceeds one element at a time
processing along the following-sibling axis. Each marker
element calls <xsl:apply-templates>
only on its first child, and each template passes control
to its immediate right sibling.
Order of raising the elements
Figure 11 through Figure 20 illustrate some stages in the process of raising the virtual elements in the sample document by progressing through the flat sequence of siblings which makes up most of the input.
Between Figure 14 and Figure 15, we elide several stages. Figure 15 through Figure 17 illustrate the complete recognition of the second line of verse.
Figure 18 through Figure 20 illustrate the recognition of the last two end-markers and the final state of processing.
Outline of XSLT code
As described above, the right-sibling traversal idiom
in XSLT proceeds one element at a time along the following-sibling axis. In the
general case, each template for a content element calls
<xsl:apply-templates>
not on all its children but only on
its first child. Every template (for content elements, marker elements, text
nodes, comments, or processing instructions) concludes with a call to
<xsl:apply-templates
select="following-sibling::node()[1]"/>
, thus passing control to the
template for its immediate right sibling. In the flattening case, the outermost
element of the document may be the only content element and its template the
only one that applies templates to its children.
For our simple test cases, processing begins with the template for the outermost element.[6]
The template for start-markers constructs a content element for the corresponding virtual element; it has the following structure:
Note that there are two calls to <xsl:apply-templates>
. The
first call to <xsl:apply-templates>
occurs within an
<xsl:copy>
element (which constructs an element with the
name and attributes given by the start-marker); it selects the immediately
following sibling node, which will in the normal run of things become the first
child node of the new content element. The second call to
<xsl:apply-templates>
occurs after the new content
element and it does not select the start-marker’s immediately following sibling, but instead selects the
node immediately to the right of the end-marker.
Each child of the virtual element copies itself into the content element being
created. The templates for text()
nodes, comments, processing
instructions, and any content elements present in the input all have essentially the
same structure:
If content elements can contain further markers at other levels of the tree, the
<xsl:copy-of>
should be replaced by a shallow copy and a
recursive <xsl:apply-templates select="child::node()[1]>
. That is
not the case with our test data.
When an end-marker is encountered, the contents of the virtual element whose end it marks have now all been accumulated, and the right-sibling traversal of the input should stop. The template for end-markers will thus look like this:
The code shown here is simplified by assuming that the start- and end-markers in its input correspond to the start- and end-tags of a well-formed XML document. The template for end-markers, for example, does not check to see that the end-marker found is the one that matches the element on the top of the current element stack. When confonted with the overlapping virtual elements of some of our sample input, therefore, the code shown here will behave as if start- and end-markers were paired up by position, without regard for co-indexing or element type.
Better behavior in the presence of overlapping
virtual elements or errors in the input can be achieved
by passing parameters on each call to
apply-templates
in mode raising
,
which keep a stack of co-indexing IDs and element types.
Resource consumption
A few notes on resource consumption may be in order.
The templates shown each handle a single node and then call
<xsl:apply-templates>
, selecting the next node to be
processed. The template activation stack in the XSLT processor thus contains no
sets of nodes waiting to be processed, but the call stack may become rather
deep: in our sample data, the maximum depth of the template stack is the number
of nodes in the flattened portion of the input. An XSLT processor that does not
eliminate tail-recursive template calls may thus run out of stack space (at
least in theory—in practice, we have never seen a right-sibling recursion blow
the stack on real, well-formed input).
Because the right-sibling idiom visits each node in the flattened input just once, the cost of the process should be linear in the size of the input. (In practice, preliminary measurements suggest that with the Saxon HE processor, the rise in cost is sub-linear to the rise in input size.)
Despite its virtues, the right-sibling traveral idiom is needed rarely enough in XSLT programming that many XSLT authors have never used it; some find it difficult.[7] The inside-out algorithm described below (Inside-out recursion) can be regarded as an alternative which is closer to conventional XSLT usage. The implementation using XSLT 3.0 accumulators is another approach worth learning.
Raising Frankenstein
For raising Frankenstein using right-sibling traversal, we adapt the code lightly to apply to its project markup in the following template rules:
Inside-out recursion
Inside-out recursion works by finding all innermost pairs of start- and
end-markers, that is, those that mark virtual elements that do not contain any
markers or other virtual elements. They may contain anything else, including
text()
nodes, empty elements that are not markers, and container
elements. The transformation forms all innermost pairs that it finds into container
elements and passes the resulting new document back into the recursive function,
where the (new) innermost pairs of markers are now outside the newly created
container elements. The function recurs until there are no more markers. The maximum
possible depth of recursion is equal to the depth of the original (pre-flattening)
XML hierarchy.
The sequence in which virtual elements in our sample document are raised in this method is illustrated by the diagrams in Figure 26 to Figure 32.
Raising
The output of the preceding transformation is included in the Simplified sample data section, above. We then reconstruct the hierarchy
using the inside-out recursive function th:raise()
in the following
XSLT stylesheet:
We turn off indentation (line 5) to avoid deforming the whitespace and set up
a debug option (line 8), controlled by a
$debug
parameter.
The traditional identity template walks the entire tree, but because any
descendant nodes lower than children of the root are guaranteed not to be or
contain Trojan markers, we set up one identity template for those lower nodes,
which just copied them to the output (lines 16–18). The regular identity
template (lines 21–25) applies only to the root element and its children.
@exclude-result-prefixes="#all"
is not enough to avoid writing
the th:
namespace onto the root element of the output, even though
the namespace in question is not used in the output. An unused namespace
declaration is informationally harmless, but also needlessly distracting, so we
suppress it by specifying @copy-namespaces="no"
on
<xsl:copy>
in the identity template that deals with
Trojan markers (line 22).
Our recursive raising operation (the th:raise()
function, lines
31–52) operates on document nodes, and we need to process the original document
node of the input file differently from the new document nodes that we create on
each pass through the recursive function. For that reason, we match the original
document node in no mode (<xsl:template match="/">
, lines
54–56) and pass it into the raising function (<xsl:sequence
select="th:raise(.)"/>
, line
55).
The raising function checks for the presence of @th:sID
attributes in the input that are candidates for raising (<xsl:when
test="exists($input//*[@th:sID eq
following-sibling::*[@th:eID][1]/@th:eID])">
, line 137 see the
discussion of this test in Inside-out recursion challenges). If there
aren’t any (<xsl:otherwise>
, lines 46–50), the recursion is
finished, and the function returns the result (<xsl:sequence
select="$input"/>
, line 49). If there are still
@th:sID
attributes that can be raised in the text, we create a
variable $result
(lines 39–43) of type document and apply templates
inside the newly created document node (line 41). After the application of
templates is finished, we recur and pass the result into another invocation of
th:raise()
(<xsl:sequence
select="th:raise($result)"/>
, line
44).
The application of templates within the recursive function begins by applying
templates to the (newly created) document node in loop mode
(<xsl:apply-templates select="$input" mode="loop"/>
, line
41). The matching template (lines 58–60) simply applies templates to its
children, unlike the template that matches the original document node (in no
mode, lines 54–56), which passes the document into the th:raise()
function (line 55), a difference in mode that is needed to avoid an endless
loop. All other processing is the same for both the original document and the
interim documents created inside th:raise()
, so
<xsl:template match="/" mode="loop">
(lines 58–60) is the
only modal template, and it applies templates to
its children in no mode.
There are three templates that do the actual processing of the innermost
elements to be raised on each recursion: one that processes the start-marker,
one that processes the content of the newly raised container element, and one
that processes the corresponding end-marker. XSLT thinks that they have the same
priority as the identity templates, so we specify a higher priority explicitly
with priority="1"
.
-
start-marker: We match elements with a
@th:sID
attribute that has a value equal to the value of a@th:eID
attribute on their first following sibling element that has a@th:eID
attribute (line 33). This, then, matches only start-markers that contain nothing buttext()
nodes, non-Trojan empty elements, and container elements. In other words, it matches only the innermost flattened elements, those that do not contain any other empty flattened elements.[8] We process these hits by creating a container element with the same generic identifier as the start-marker; we instruct it not to copy unused namespaces, and we copy all non-Trojan attributes. We create the element content by copying all following-sibling nodes that precede the end-marker (which we find with the help of theend-markers
key) that matches the start-marker we’re processing at the moment (lines 66–72). In other words, we copy the content of the newly raised element into it. -
nodes inside the new wrapper: We have already copied the content of the newly raised element inside it, which means that we don’t want to process those nodes again, since that would create duplicates. For that reason, we suppress all nodes between the start- and end-markers that we’re processing at the moment by matching them inside an empty
<xsl:template>
element (lines 75–78). -
end-marker: Since we create real start- and end-tags when we match the flattened start-marker, we have no more use for the flattened end-marker, so we suppress it by matching it, too, inside an empty
<xsl:template>
element (line 80).
When the simplified original document is flattened and then raised, as described above, the output of the raising operation matches the original input.
Raising Frankenstein
To raise the Frankenstein data, our process is run recursively over a collection of 165 files in the Frankenstein Variorum. The files have a TEI header that for project maintenance purposes needs to be updated after a version change, and only a portion of the document is to be raised. Here the most significant adaptation is that we supply an element node rather than a document node as the input parameter for the th:raise function.
The adaptation to the raising function to process a container element node allows for us to make changes to our TEI header. Beyond this, there are few differences between our generic inside-out raising example and the Frankenstein stylesheet. One minor difference is that we use only one identity-transformation template, and have not set a priority on the template rule to process the innermost elements. We convert the Trojan marker attributes into an xml:id with an element constructor:
Outside-in processing
Once the inside-out algorithm has been defined and understood, it is natural to wonder whether a mirror-image version of the algorithm would be possible which works from the outside in. Once the pairs of outermost start- and end-markers have been identified, the function can be called recursively not on the entire document but only on the children of the outermost elements, in an instantiation of a divide-and-conquer strategy; this should (or so it seems) result in faster processing.
Our realization of this idea is imperfect
in that we have not found a way to find and raise
all outermost
elements in a sequence with a single call, the way
the template-matching rules of the inside-out
approach can raise multiple elements in a single
call to apply-templates
. Instead,
we find the left-most start-marker and process it,
then recur to continue processing the remainder
of the sequence after the matching end-marker.
This results in two recursive calls for each virtual
element: one to handle the children of the newly
raised element, and one for its right-siblings.
The order in which elements are raised is illustrated in the diagrams of Figure 36 to Figure 44. Some steps are elided to save space.
Code outline
The default handling for all nodes is a shallow copy, as specified in the mode declaration for the default mode:
For content nodes, the function th:raise-sequence()
is called on their
children. In the simple inputs we are dealing with, this template will match
only the document's outermost element; for the Frankenstein data, it will match
several others. At the time of this writing, we have not experimented to see
whether it would be faster to test for start- and end-marker children here.
The core of the implementation is the
th:raise-sequence
function.
The function begins by creating lists of start- and end-markers (identified by their co-indexing IDs), so that we can find not just the leftmost start-marker but the leftmost start-marker with a matching end-marker in the sequence. If there are no matching pairs in the sequence, the sequence is returned without change. (This makes this implementation slightly more robust in the presence of overlap than the implementation shown above of the right-sibling traversal approach; we have not yet had time to update our method of right-sibling traversal to rectify the inconsistency in implementation philosophy.)
In the normal case, however, the function
will select the leftmost start-marker with a
matching end-marker, find the positions of the
start- and end-markers in the input, and then
partition the input sequence into three parts.
A call to apply-templates
processes
nodes preceding the selected start-marker;
a recursive call to th:raise-sequence()
from within the raised elements processes the
nodes between the selected start- and end-markers;
a second recursive call processes the remainder
of the original sequence.
A number of ancillary functions are defined in the stylesheet and used in the code shown above. They are shown briefly here.
The th:start-marker()
and th:end-marker()
functions
encapsulate the recognition criteria for
markers. (These forms are redundant with
those given elsewhere in this paper; we have
not yet eliminated the redundancy in our
demonstration code.)
The th:id()
function
returns the co-indexing identifier used for
any given marker.
Resource consumption
The implementation shown calls the
th:raise-sequence()
function
once for each virtual element. The cost of
the process should thus be roughly linear
in the number of virtual elements in the input;
that seems consistent with our timings of
XSLT processors.[9]
Each node in the original flattened sequence is passed to the central as part of the argument sequence once for each level of its distance from the root of the virtual tree. In this, the outside-in approach is indeed the mirror image of the inside-out approach.
The implementation of outside-in shown here is not, however, demonstrably faster than our implementation of inside-out, despite the divide-and-conquer strategy. We have not been able to perform measurements to explain the differences in run time, but we can offer a couple of speculations. First, on our test data the inside-out approach reduces the number of nodes faster than the outside-in approach: in a typical XML tree there will be more leaf nodes than root nodes (and more than that: in our test data the fanout increases with distance from the root). Second, each call to the function traverses the input sequence several times: once to find all start-marker IDs, once to find all end-marker IDs, once to find the positions of the selected markers. Either or both of these could explain a slow run-time.
Accumulators
Another approach to the problem avoids recursion by performing a single pass over the input, maintaining a stack of partly-raised elements using the accumulator construction introduced in XSLT 3.0.
A bluffer's guide to accumulators
An accumulator is a sequence of values associated with the nodes of a tree. Each node has two values:
-
a before value (calculated without access to the values associated with any descendants, and thus calculable before descendants are visited in a depth-first traversal of the tree)
-
an after value (which may depend on the values associated with descendants and thus cannot be calculated until after they have been visited).
As a simple example, we can make an accumulator to keep track of the number of
open virtual elements at any given point in the flattened input document: when we
encounter a start-marker we will add one to the accumulated value, and when we
encounter an end-marker we will subtract one. An accumulator is declared using an
<xsl:accumulator>
declaration, which contains a sequence of
<xsl:accumulator-rule>
elements describing how the
accumulator’s values are calculated. Within an accumulator rule, the variable
$value
gives the value of the accumulator assigned by the
immediately preceding node.
The pattern in the accumulator rules here assumes that start- and end-markers use
the @th:sID
and @th:eID
attributes described above.
At any point in a tree to which this accumulator applies, we can refer to the
current node’s before value of the accumulator with the expression
accumulator-before('stack')
and to the after value with
accumulator-after('stack')
.[10] Without accumulators, we could keep track of the relevant information by
replacing references to accumlator-before('stack')
with references to
count(preceding::*[@th:sID]) -
count(preceding::[@th:eID])
.
Implementation strategy
To solve the raising problem with accumulators, we declare an accumulator whose value is an array (if the XPath 3.1 feature is supported) or a map, with which we maintain a stack of virtual elements that have been started, but not finished. The first member of the array will track the contents of the outermost element, the second member will track the contents of a second-level virtual element, and so forth. Each member of the array is a sequence of nodes containing a start marker (the first item in the sequence) and all the nodes seen so far that should be children of the raised content element.
The processing rules are relatively straightforward and may feel familiar to anyone who has ever constructed an in-memory tree from a SAX event stream:
-
Each time we see a start-marker, we will push a new sequence onto the stack.
-
Each time we see a text node, comment, processing instruction, or content element, we will append it to the sequence on the top of the stack.
-
Each time we see an end-marker, we will
-
Create a content element from the sequence on top of the stack: the element type and attributes come from the start-marker at
$stack(array:size($stack))[1]
, and the contents come from the rest of the sequence:$stack(array:size($stack))[position() gt 1]
. -
Pop the stack, i.e., discard the topmost member, whose purpose has been served now it has been turned into a content element.
-
Append the newly created content element to the sequence now at the top of the stack.
Special handling is required for the case in which the element just ending is at level 1 in the stack, because once the stack has been popped there is no top level to which to append the new element, which should be written to the output tree instead.
-
Some utility functions
In order to allow the stylesheet to be used with either the th:
attributes or with the ana="start|end"
convention, we isolate the
definition of start- and end-markers in two functions we place in the
Trojan-Horse namespace:
These functions rely on a static parameter that identifies the idiom in use for markers.
Marking the parameter as static allows the choice among definitions to be made
during static analysis of the stylesheet (informally, at compile time) and not
with a dynamic (run-time) choose/when
construction.
Another function constructs a content element from a sequence of nodes beginning with a start-marker:
The functions just presented are not specific to the use of accumulators; they or analogous functions could be used in any of the XSLT stylesheets described here. But most of the other code shown in the current version of this paper does not use them; they are described here because the other code in this section uses them and would be hard to understand if they were not presented.
Declaring the accumulator and making it applicable
The overall pattern of accumulator declarations was shown above, and the declaration for an accumulator to be used as a stack follows the same pattern, though the rules are slightly more complex.
Since accumulators were introduced largely to support the streaming processing
of very large inputs, and since the pointless calculation of accumulator values
would make that processing unnecessarily difficult, XSLT 3.0 requires that any
accumulators be associated more or less explicitly with an input tree or a mode.
If the accumulator is not made applicable to a given tree in this way,
references to the accumulator-before()
and
accumulator-after()
functions will raise errors.
We handle that by declaring the default mode of the stylesheet and adding a
@use-accumulators
attribute naming the accumulators to be
calculated:
Note that the streamable="yes"
is not required unless streaming
processing is to be requested.[11]
Templates
The templates in this stylesheet have very little work to do: their primary task is to override the default templates and ensure that nodes are not copied unnecessarily to the output tree.
The template for start-markers does nothing; all the work is handled by the declaration of the accumulator.
The template for end-markers is more complicated, because it must handle the case where the end-marker marks the end of an outermost virtual element. The normal accumulator rules cannot handle this case, because they write the newly constructed element into the new top member of the stack. But after we pop the stack at level one, the stack is empty and there is no top member. So we need to make the element and write it to the output tree ourselves. In all other cases, this template has no work to do.
The template for the document node is identical to the default template; it is
made necessary by the specification on-no-match="fail"
on the mode
declaration.
The templates for other nodes make the same distinction as that for end-markers: when the stack is empty, the node is copied to the output tree, and otherwise all the work is left to the accumulator.
Regex
It is not, in general, advisable to attempt to process arbitrary XML solely with regular expressions (regex), but it is not difficult to identify and process markers of the kind we are concerned with here.[12]
The following regex matches start-markers:
The regex works as follows (in dot-all mode, that
is, where dot also matches \n
):
-
The first capture group matches everything from the beginning of a tag that contains a
@th:sID
attribute until that attribute name. This necessarily includes the space that precedes the attribute name, as well as any attributes that might also precede it. -
We do not capture any part of the
@th:sID
attribute: the attribute name, the equal sign (with optional whitespace before or after), the quotation mark value delimiter (single or double), the attribute value (all characters up to the closing value delimiter), and the closing value delimiter. As long as the@th:sID
and@th:eID
values are created with the XPathgenerate-id()
function, they cannot contain single or double quotation marks (generate-id()
creates only values that are XML names), so we do not need to verify that the opening and closing delimiters match each other lexically.[13] -
The second capture group captures everything following the
@th:sID
attribute up to the/>
that marks the end of the tag. -
We do not capture the
/
before the closing>
.[14] -
The third capture group captures the closing
>
.
We replace all matches with the following replacement pattern:
The regex to match end-markers is similar to the one for start-markers, and
because real end-tags cannot contain attributes, we do not need to match or copy
them. We capture the opening <
separately from whatever follows it,
so that we can write a /
into the replacement after it. The regex
is:
and the replacement pattern is:
This method will incorrectly apply the replacement to matching patterns within XML comments and CDATA marked sections. With insincere apologies for disappointing Regex Edge-Case Bounty Hunters, coping with matches in these contexts, which would not naturally appear in our data, is not a goal in our work.[15]
A more serious limitation of this method is that because it is not XML-aware, it can be used only in situations where raising is guaranteed not to create overlap. For example, given input like:
the result of converting all markers to real start- and end-tags would be:
The resulting document is not well-formed because the pages and paragraphs overlap. It is, however, possible to raise just pages but not paragraphs, or just paragraphs but not pages, without creating overlap, and the result would be well-formed.
Pull parsing in Python
The Python xml.dom.pulldom
module can be used to stream an XML
document past an event handler that can be instructed to raise a flattened hierarchy.[16] The result of pull parsing can be fashioned into output in two ways,
as a string or as XML. With string output, it passes all events through unchanged
except for markers, which it replaces with regular XML start- and end-tags. Because
the output construction is not XML-aware, it has the same limitations as the regex
approach: most significantly, it is capable of creating output that includes
overlap, and that therefore is not well formed XML. With XML output, though, we use
xml.dom.minidom
to create XML elements within a DOM structure,
which means that the result is necessarily well-formed. Pull parsing operates as a
single traversal, which is to say that it begins at the document node and touches
each element only once. With string output, the procedure can create write output
as
it handles each event; with XML output it is necessary to maintain the current
context in a stack, which is similar to the use of a stack in the Accumulators method described above.[17] The maximum depth of the stack is equal to the maximum depth of nesting
in the hierarchical XML.
String output
The following Python 3 code replaces markers with strings equivalent to real XML start- and end-tags:
We create an empty list to hold the output, read in the source, and examine
each event. In this simplified example, we process only
START_ELEMENT
, END_ELEMENT
, and
CHARACTERS
, as follows:
-
START_ELEMENT.
Markers with@th:eID
attributes are flattened end-tags, so when we encounter one, we output</
; for otherSTART_ELEMENT
events we output only<
. We follow this with the gi (node.nodeName
) and then iterate over the attributes. For each non-Trojan attribute, we output a space, the attribute name, an equal sign, and the attribute value in quotation marks. Finally, we output the closing>
. -
END_ELEMENT.
Although markers are sole-tags, they fire bothSTART_ELEMENT
andEND_ELEMENT
events. Since we process all markers at theirSTART_ELEMENT
events, we ignore theirEND_ELEMENT
events. For otherEND_ELEMENT
events, we create a regular end-tag. -
CHARACTERS.
We add character data content to the output, explicitly replacing reserved characters with XML entities.
We are parsing the input XML in an XML-aware manner, which is reasonably robust, but we are constructing the output XML as a string, which is not. See below section “Pull parsing in Python” for a discussion of the limitations.
XML output
The general approach to parsing with pulldom
and constructing
output with minidom
is similar to the Accumulators method described above. Specifically, the flattened XML is traversed in a
single pass from left to right (technically a depth-first traversal, which is
how pulldom
sees it) and events (we focus on
START_ELEMENT
, END_ELEMENT
, and
CHARACTERS
) are handled as they occur. The output is assembled
in memory as a tree, using minidom
to create and attach nodes as
needed. Open elements are stored on a stack, the maximum depth of which is,
therefore, equal to the maximum depth of the unflattened XML. The code
is:
Parsing is the same as described above, so we import the same items from
pulldom
. From minidom
we import only the
Document
class, which we use to initialize our output XML
document. Our Stack
class is a Python list adorned with
stack-idiomatic terminology: we alias append()
as
push()
(lists already have a pop()
method), and we
add peek()
to provide access to the item at the top of the stack.
That top item is the most recently opened element, and we peek at it in order to
append child elements and child text()
nodes as we encouter them.
We begin by pushing the document node onto the
stack.
We handle START_ELEMENT
, END_ELEMENT
, and
CHARACTERS
events in an if/elif/elif
structure,
silently ignoring other events. There are three types of elements whose start
and end events we need to process: start-markers, end-markers, and regular
elements (whether empty or not).
-
In the case of start-markers and regular elements, we append the new node as a child of the item at the top of the stack (initially the document node; thereafter an element node) and then push the node onto the stack so that it becomes the new context element. The
appendChild()
andpush()
operations serve different but related purposes. Appending the new node to the current context (whatever is at the top of the stack) adds it to the correct location on the eventual output tree. Pushing it makes it the new current context, so that subsequent nodes will be added to it as children as long as it remains the current context. Nodes are passed by reference, rather than by value, so after these operations we point to the same new current node from three places:-
the original input document,
-
the top of the stack
-
the location in the tree we are assembling under the output document node, which is at the bottom of the stack.
We ignore
START_ELEMENT
events for end-markers; all of their processing happens on theirEND_ELEMENT
event. This is a somewhat arbitrary decision. -
-
The three mutually exclusive clauses in the
if/elif/elif
block that processesEND_ELEMENT
events work as follows:-
minidom
treats namespace declarations as attributes, so the presence of an attribute with the namexmlns:th
identifies the root note.[18] In our examples we are raising all markers, and therefore removing all attributes in the Trojan namespace, so we remove this declaration to avoid the clutter of an unneeded namespace declaration in the output. Because we identify Trojan attributes by their lexical shape (that is, by the presence of an explicitth:
prefix), removing the namespace declaration early in the processing does not impinge on our ability to identify attribute instances that use the prefix. Because we never pop the root element, it will be at the top of the stack when our processing concludes. -
We remove the Trojan
@th:sID
attribute from the start-marker. We want to remove it eventually because it is not needed in the output, but we cannot remove it until theEND_ELEMENT
event of the start-marker because its presence tells us not to pop the stack at thatEND_ELEMENT
event. In this way we push the element we are currently raising onto the top of the stack at its start-marker, but we remove its Trojan attributes only when we pop it at theEND_ELEMENT
event of the end-marker. -
The first two clauses process the root element and start-markers, both of which we leave on the stack after their
END_ELEMENT
events. In the remaining cases (end-markers and regular elements, whether empty of not), we pop the stack at theEND_ELEMENT
event, which makes its parent in the output tree where we are constructing the current context for the next event.
-
-
In the case of
CHARACTERS
, we create atext()
node and append it as a child of the current context, that is, the element at the top of the stack. When we wrote string output earlier we had to convert reserved characters to character entities ourselves, but in this case we are writing XML output, so the system automatically performs the entity encoding for us.
Conceptually, the entire start-marker functions as if it were the
START_ELEMENT
event for a regular element, and the entire
corresponding end-marker functions as if it were the END_ELEMENT
event for the same regular element. This means that we push a start-marker on
its START_ELEMENT
event (ignoring its END_ELEMENT
event, except to remove its Trojan attribute), and pop it on the corresponding
end-marker’s END_ELEMENT
event (ignoring its
START_ELEMENT
event). In tabular
form:
Table II
Element type |
START_ELEMENT
|
END_ELEMENT
|
---|---|---|
Regular | Push | Pop |
Start-marker | Push | Remove @th:sID |
End-marker | Ignore | Pop |
Some things that can go wrong
Each of the methods described above comes with its own complications, and it was in the discussion of these complications that this paper originated. Some of these are challenges that can be overcome, others are limitations in what the method can manage, and others are deal-breakers that show that the method is not ultimately suitable for realistic use cases. In this section we review briefly the complications for each method.
Right-sibling traversal
Two elusive bugs in an initial implementation of the right-sibling traversal algorithm took the better part of a day or two to identify:
-
It is remarkably easy to write
select="following-sibling::node()"
instead ofselect="following-sibling::node()[1]"
, and remarkably easy to overlook the error when scanning the code looking for the reason that the output is an order of magnitude larger than the input instead of approximately the same size. The symptom is that single nodes in the input appear more than once in the output. -
Similarly, it is remarkably easy to write
<xsl:apply-templates select="..."/>
instead of<xsl:apply-templates select="..." mode="raising"/>
. The symptom is the reverse of the preceding: some nodes in the input drop out of the output.
[Fuller examination of the version-control history of the stylesheets will probably reveal further errors, a discussion of which may be illuminating.]
Inside-out recursion
The two principal pitfalls with inside-recursion are double processing and endless recursion.
Double processing
When a start-marker is matched and raised, the nodes that belong inside the
new container are copied into it inside the template that matches the
start-marker. Because the nodes being copied are also candidates for the
application of templates in the current pass through the function, we need to
match them (along with the end-marker) in an empty
<xsl:template>
in order to avoid outputting them twice.
Otherwise they would be copied when the start-marker is matched and then
processed again when templates are applied to them in their own right.
Endless recursion
Recursion requires an exit condition to avoid falling into an endless loop. In
an early version of the code, the recursive function tested for the presence of
Trojan milestone attributes, and if there weren’t any, it concluded that all
raising had been completed and stopped the recursion. This test fails in
situations where there are Trojan elements that cannot be raised without
creating overlap. Avoiding the endless loop in such situations requires a more
complex test, not just for the presence of Trojan attributes, but for the
presence of those that can be raised without creating overlap. In our simplified
sample, instead of testing for exists($input//@th:sID)
, we test for
elements that can be raised without risk of overlap with:
Endless recursion is not an issue in situations where complete raising would not create overlap. This is the case with our original simplified sample, where the XML with Trojan markup was created by flattening original hierarchical XML, and since the original could not have had overlap, it can be reconstructed safely. Endless recursion is also not a problem when we raise only a subset of the markup that is guaranteed not to overlap. For example, if we have tessellated page and paragraph hierarchies over a prose text, where both pages and paragraphs have been flattened, we cannot fully raise all instances both types of elements if doing so would create overlap. But if our markup convention is that pages cannot overlap with pages and paragraphs cannot overlap with paragraphs, we can modify the raising routine to raise only the pages or only the paragraphs. See also Appendix A, where we discuss an alternative approach to raising with data of this type.
Accumulators
The primary problems we encountered in our implementation using accumulators were:
-
Failure to use a fully implemented XSLT 3.0 processor. Initial tests used Saxon HE 9.6.0.5, which does not complain or warn about
version="3.0"
, but which appears not to support all of the 3.0 constructs used. The main symptom was an error message reporting thatXTSE0010: Element xsl:mode must not appear directly within xsl:stylesheet
(which led to a wild goose chase through the 3.0 spec trying to locate constraints on where mode declarations are allowed), followed unobtrusively by the messageXTSE0010: Unknown XSLT element: mode
. Upgrading to Saxon 9.8.0.12 solved this problem. -
Failure to specify
@use-accumulators
.The next problem was the persistent repetition of the error message
Accumulator stack is not applicable to the current document
. Several attempts of increasing complexity (and, to be honest, decreasing plausibility) to make it applicable failed, until eventually it became clear that the only thing needed was to specifyuse-accumulators="stack"
on the mode declaration for the default unnamed mode. (The mode declaration was already present; had it not been, it would have needed to be introduced.)
Regex
The principal challenge to writing the regex (aside from the risk of inadvertently
creating XML that is not well-formed because of overlap, which is discussed above)
is anticipating variation in the markup. For example, an XML start-tag with a single
attribute looks like <gi attname="value">
, but it allows optional
whitespace around the equal sign and before the closing >
delimiter—but not between the opening <
delimiter and the generic
identifier, and not between the /
and the >
at the end
of a self-closing empty tag. Where it allows whitespace, it allows any amount of any
combination of whitespace characters.
Within the context of a single project, the easiest way to deal with the allowable variation is not to allow it, that is, to enforce rigorous consistency even where XML syntax does not require it. But because rigorous consistency is difficult to achieve without computational validation, we opted for a more robust regex—that is, one more accepting of variation, even though allowing for variation made the regex harder to read and develop.
The regex expressions needed to raise Frankenstein are slightly more complex because they may or may not coexist with other attributes on a given element node.
Pull parsing in Python
The Python pull parser reads the input XML as XML and responds to parse events. We
handle START_ELEMENT
, END_ELEMENT
, and
CHARACTERS
, and ignore other events, and we use XML-aware methods
to access attributes. With respect to managing the input, then, this is a reasonably
robust strategy. The quality of the output handling depends on whether we create
output as a string or as an XML DOM.
String output
Creating XML output as a string is a brittle strategy. Not only is it
susceptible to writing overlapping tags for the same reason as the regex method
described earlier, but we also explicitly wrap attribute values in double-quote
characters ("
), which will produce results that are not well-formed
if the attribute value happens to have contained the double-quote character
originally. Python has an escape mechanism that is capable of dealing with
awkwardly nested single- and double-quote characters in strings, as does XML,
but the Python escape strategy is different from the XML one, and the code to
perform the string manipulation needed to mediate between the two is difficult
to read and write. Furthermore, where output is created as a string, we have to
replace reserved characters with character entities explicitly ourselves, while
XML-aware output does that automatically. For those reasons, it is safer to use
XML methods to create the XML output as a (necessarily well-formed) DOM tree,
which can then be serialized, instead of creating the output directly as a
string. The use of pulldom
to parse the input XML also requires
accommodations to the peculiarities of pulldom
namespace handling,
which are discussed immediately below, under Python XML output challenges.
XML output
Generating XML output with Python minidom
is reasonably robust
and straightforward, but potentially confusing for the following reasons:
-
The pull parser responds not to start- and end-tags, but to
START_ELEMENT
andEND_ELEMENT
events. Although markers are single tags (sole-tags), they fire both events, one immediately after the other. Since a start-marker and an end-marker both fire both types of events, dealing with the two different senses of start and end (marker vs event) simultaneously is potentially confusing for the developer. We also need to decide whether to process the marker at theSTART_ELEMENT
orEND_ELEMENT
event, or to divide the processing of a marker over the two events. Since the two events always follow each other immediately, it might seem not to matter, but that assumption led us into error when dealing with removing Trojan attributes, about which see the following issue. -
Attributes in
minidom
are part of a dictionary-like property of the element node. We had decided (somewhat arbitrarily) to process start-markers at theSTART_ELEMENT
event, and we wanted to remove Trojan attributes from those elements when we raised them. Initially we removed them when processing theSTART_ELEMENT
event without noticing that we use them as part of the decision process onEND_ELEMENT
events. Becauseminidom
objects are mutable, once we had removed the attribute at theSTART_ELEMENT
event it was no longer available when we needed to test for it at theEND_ELEMENT
event, which led to errors in our stack management. We fixed the problem by removing the Trojan attributes only at theEND_ELEMENT
event. An alternative strategy, which also worked under testing, was to clone the node at theSTART_ELEMENT
event, remove the Trojan attributes from the clone, and add the clone to the output tree and the stack. Since this approach does not remove anything from the original node, any original Trojan attribute is still available for testing when theEND_ELEMENT
event fires. -
The relationships among namespaces, namespace prefixes, and namespace declarations in
minidom
are alien to the XML view of namespaces. Whileminidom
is namespace-aware at the level of the model, that awareness is not integrated with the serialization supported by thetoxml()
method (and some others). Here is a partial summary of the details:-
Whether
minidom
writes a namespace prefix into a serialization depends only on whether the prefix was specified when the element was created. It has nothing to do with whether the element is really in a namespace. It is possible to create and serialize an element with a namespace prefix that is not really in the namespace. It is also possible to create an element in a namespace without a prefix and when you serialize it, it will emerge without a prefix even if the namespace is not in scope, which means that when the serialization is parsed downstream, the element will be in whatever the default namespace is for its context. -
minidom
will not write namespace declarations (default or prefix-binding) unless you create them explicitly as attributes. This means that you can create an element with a prefix and serialize output where the prefix is not actually declared, which is not well-formed. -
minidom
will let you create and serialize an element with a namespace prefix withcreateElement()
even though the element is not really in a namespace. An element is in a namespace only if it is created withcreateElementNS()
.
Because
pulldom
andminidom
are essentially blind to namespace declarations, this also means that when we remove the declaration of the Trojan namespace while processing the root element (which changes the in-memory tree becausepulldom
andminidom
objects are mutable), we can nonetheless still find Trojan attributes by their lexical shape, that is, by their explicit prefixes. It also means that our code is depending on those lexical forms: we rely on the presence of an explicit declaration of the Trojan namespace prefix only on the root element, and we rely on the use of the stringth:
as the namespace prefix. This means that XML that expresses the same namespace information in different ways (e.g., by binding a different prefix to the namespace URI, or by using namespace declarations on individual elements, instead of a prefix) will not be processed correctly.These
pulldom
andminidom
behaviors are far from XML-idiomatic, but except for the syntactic dependencies they impose (where namespaces are declared, how the declarations are spelled, how they are bound to prefixes), they prove not to be much of a practical obstacle during processing. The required accommodation entails writing namespace prefixes and namespace declarations (using attribute-like syntax) explicitly where we want them in the output, and matching namespaced elements and attributes with the prefix where it is used in the serialized input and without the prefix where it is not. -
Comparison
The inside-out recursive approach (both the function-based XSLT 3.0 version and named-template-based XSLT 1.0 one) is tail-recursive, which means that an XSLT processor that performs tail-call optimization will not be at risk for running out of stack space. In cases where tail-call optimization is not available, the maximum depth of recursion is equal to the depth of the deepest marker in the input (pseudo-)hierarchy. The right-sibling traversal approach (both the XSLT 3.0 and the XSLT 1.0 versions) is also tail-recursive, and with an XSLT parser that does not perform tail-call optimization, it requires stack space equal to the maximum width of the widest hierarchical level.
Insofar as an open-source XSLT 3.0 processor that performs tail-call optimization is freely available in the open-source, platform-independent Saxon-HE product [Saxon-HE], the difference in stack requirements between the two methods has not been a consideration for our purposes. But insofar as XML documents of the sort that are of interest to digital humanists are typically wider than they are deep, users who are unable to employ an XSLT processor that performs tail-call optimization may favor inside-out processing over right-sibling traversal because inside-out processing is likely to require less stack space.
Developing the XSLT to reconstruct elements from the inside out may be more efficient than to do so from the outside in. Attempts to write code that does not control the order of processing must always account for the potential presence of intervening nodes to be reconstructed on the inside, that risk becoming duplicate (or triplicate, or quadruplicate, etc) nodes in the output depending on the depth of the hierarchy to be reconstructed.
[Comparison of other methods to be added]
Conclusion
None of the methods described here is new, but their explicit juxtaposition, comparison, and evaluation in a tutorial context based on real use cases has clarified much about micropipelining for the author, and, it is hoped, for the reader, as well.
Appendix A. Raising tessellated hierarchies
Raising tessellated hierarchies is a common requirement for hierarchy inversion.[19] Consider a simple printed document that can be understood as containing a
sequence of pages or a sequence of paragraphs, where each sequence fully covers the
content without self-overlap, but where the two different element types overlap in
a way
that makes it impossible to use container elements for both. One typical XML workflow
with such structures involves encoding one hierarchy with XML container elements and
the
other with milestones. For example, a novel transcribed from an original source and
encoded in TEI
might be structured with <div>
elements for chapters, which contain
<head>
elements for chapter titles and <p>
elements for paragraphs, with page beginnings encoded as empty <pb/>
elements. If it later becomes necessary to operate on the pages as units, we can
invert the markup by transforming the
page-beginning milestones into page container elements, while flattening the elements
that demarcate chapters, titles, and paragraphs.
The method described in this appendix differs from those described above because it
works only where the hierarchy being raised is tessellated. That outcome requirement
invites the use of <xsl:for-each-group>
, which partitions its entire
domain exhaustively into groups. The same property that makes
<xsl:for-each-group>
suitable for a situation where an entire
text may be partitioned into tessellated paragraphs or tessellated pages makes it
a poor
choice for the contexts described in the body of this report, where new container
elements must be created only around small islands of content.
References
[CollateX] CollateX: software for collating textual sources. https://collatex.net/-
[DeRose 2004] DeRose, Steve. 2004. Markup
Overlap: a review and a horse.
Presented at Extreme Markup Languages 2004. Montréal,
Québec, August 2-6, 2004.
http://xml.coverpages.org/DeRoseEML2004.pdf
[Sperberg-McQueen 2018]
Sperberg-McQueen,
C. M.
Representing concurrent document structures using Trojan Horse markup
.
To be
presented at Balisage: The Markup Conference 2018,
Washington, DC.
On the Web in the preliminary proceedings.
[TEI P5] P5: Guidelines for electronic text encoding and interchange. http://www.tei-c.org/guidelines/P5/
[Variorum Frankenstein] Frankenstein Variorum Project. https://github.com/PghFrankenstein/Pittsburgh_Frankenstein
[Saxon-HE] Saxon-HE (home edition). http://saxon.sourceforge.net/
[XPath functions] XPath and XQuery functions and operators 3.1 W3C recommendation 21 March 2017. https://www.w3.org/TR/xpath-functions-31/
[1] Community interest in flat-to-hierarchy and hierarchy-to-flat transformations is confirmed by their inclusion in Abel Braaksma’s 2014 proposal, on the XSLT mailing list, that they be included among the top 10 most used XSLT design patterns. See https://www.biglist.com/lists/lists.mulberrytech.com/xsl-list/archives/201404/msg00028.html.
[2] Ideally, all the outermost elements in the input sequence would be recognized and processed in a single call to the function, but we have not found a way to achieve this.
[3] This particular kind of “raising” from text strings is slightly tangential to the methods we concentrate on this paper, but we include some discussion of it because it is part of the practical use-case prompting our work on this paper, and is likely to be a stage in other projects that require elements to be flattened and raised. The complete XSLT stylesheet for this process is in the Frankenstein Variorum GitHub repository, and the stylesheet itself is part three series of transformations in our raising process that may be found in the parent directory, which also holds its input and output directories.
[4] Using different formats for the markup of different flattened elements
makes it easy to raise one group of flattened element types while leaving
another in its flattened stage. Alternatively, we could use the same
attributes for all flattened elements and specify which ones we wanted to
raise at the processing stage. Sperberg-McQueen 2018 proposes a
th:doc
attribute for identifying a start- or end-marker as
belonging to a particular hierarchy.
[5] All files discussed in this report, as well as the report itself, are available in https://github.com/djbpitt/raising.
[6] If the flat content is contained not in the outermost element but in some other container
element, all nodes outside the flattened region can be copied to the
output without change. In this case the template for the container
element will be responsible for shifting from the mode for identity
transformation to the mode for element raising, and will have a more
selective match
pattern.
[7] And even those who don’t admit to finding it difficult may find it easy to slip up in the details when implementing a right-sibling traversal; see below (section “Right-sibling traversal”).
[8] Attributes in the th:
namespace are removed
from markers when they are converted to tags during
raising.
[9] In an experimental implementation of inside-out and outside-in using tumbling windows in XQuery, on the other hand, the two approaches differ dramatically in run time, and the costs do not appear to be linear. We do not currently have good data on this or any explanation.
[10] In streaming processing, references to the after value can only be located
after any <xsl:apply-templates>
or other expression that
requires that descendants be processed.
[11] Despite Saxon’s apparent belief to the contrary, the declarative statement that the accumulator is streamable does not constitute a request that the XSLT processor perform streaming processing of the input. It is merely a claim that streaming processing is possible.
[12] In terms of the Chomsky hierarchy, XML is a Type 2 (context-free) grammar and regular expressions are a Type 3 (regular) grammar. Parsing a context-free grammar, which permits recursion, requires a stack, which is not available in regular grammars. While some modern regular expression implementations support back-referencing and recursion and thus go beyond the limitations of a Chomsky regular grammar, whether methods that rely on these extensions should be considered regular-expression parsing is unclear (this is a theoretical issue), as is the role of legibility in evaluating the suitability of the method to the task (this is a practical concern).
[13] “The returned identifier must consist of ASCII alphanumeric characters and must start with an alphabetic character. Thus, the string is syntactically an XML name.” [XPath functions, §14.5.4]
[14] Forward slash has to be escaped in some regex implementations, but not in others. Escaping where it is not required does no harm.
[15] Similarly, we rely on the use of th:
as the namespace prefix
and th:sID
and th:eID
as the attribute names for
our Trojan attributes. Changing these assumptions is not a problem as long
as the regex is changed to match.
[16] The authors are grateful to Ronald Haentjens Dekker for bringing this method to our attention. The example at his https://github.com/rhdekker/python_xml_pull_parser_example GitHub repository is a partial model for the examples here, and we appreciate his assistance in debugging the Trojan attribute pitfall described below under Python XML output challenges.
[17] We buffer output by writing it to a list, which we stringify and write all at once at the end, saving the overhead of repeated write operations. With large documents and low memory, though, the method can easily be modified to stream the output to the filesystem.
[18] It is legal to declare namespaces redundantly even
where their declarations are already in force because
they are inherited from a higher level in the tree. If
accommodating a redundant declaration of the
th:
namespace is required, the code
could be modified to do that by revising the
if/elif/elif
logic.
[19] Tessellated structures cover an entire domain with the same pattern without gaps or self-overlap, much as mosaic tiles might cover an entire floor.