How to cite this paper
Kay, Michael. “XSLT Extensions for JSON Processing.” Presented at Balisage: The Markup Conference 2022, Washington, DC, August 1 - 5, 2022. In Proceedings of Balisage: The Markup Conference 2022. Balisage Series on Markup Technologies, vol. 27 (2022). https://doi.org/10.4242/BalisageVol27.Kay01.
Balisage: The Markup Conference 2022
August 1 - 5, 2022
Balisage Paper: XSLT Extensions for JSON Processing
Michael Kay
Founder and Director
Saxonica
Michael Kay is the lead developer of the Saxon XSLT and XQuery processor,
and was the editor of the XSLT 2.0 and 3.0 specifications. His company,
Saxonica, was founded in 2004 and continues the development of
products implementing the specifications in the XML family.
He is based in Reading, UK.
Copyright © 2022 Saxonica Ltd.
Abstract
XSLT 3.0 contains basic facilities for transforming JSON as well as XML.
But looking at actual use cases, it’s clear that some things are a lot harder than
they
need to be. How could we extend XSLT to make JSON transformations as easy as XML transformations,
using the same rule-based tree-walking paradigm? Some of these extensions are already
implemented
in current Saxon releases, so we are starting to get user feedback.
Table of Contents
- Introduction
- Constructing maps and arrays
-
- Constructing maps
- Constructing arrays
- Array and Map Construction: Summary
- Map and Array Types
- Template-based Transformation
-
- Type-based pattern matching
- Context for match patterns
- Built-in template rules
- Deep Tree Operations
- Use Cases
-
- First use case: bulk update
- Second Use Case: Hierarchic Inversion
- Conclusions
Introduction
JSON support in XSLT 3.0 (together with XPath 3.1 and related specifications) is actually
at two levels:
Firstly, there is some functionality that explicitly recognises JSON as a lexical
format:
functions like parse-json()
and json-doc()
; the conversion functions xml-to-json()
and json-to-xml()
; and the JSON serialization method.
Secondly, the data model has been extended with primitives — maps and arrays —
to allow faithful representation of JSON structures. These primitives are supported
by a range of language features providing capabilities for manipulating maps and arrays.
Although JSON support was a major motivation for introducing maps and arrays, they
are
useful in their own right for a wide range of applications. For this reason, maps
and
arrays have more flexibility than is needed only to support JSON: for example, map
keys
can be of any atomic data type (not only xs:string
), maps and arrays can include (XML)
nodes in their content, and in fact the entries in a map or array can contain arbitrary
XDM values, not only the values that can appear in JSON. (For example, an entry in
an
array might be a sequence of three xs:date
values).
In considering where XSLT 3.0 has limitations in its JSON support, we need to retain
the same dual perspective: we need both to think about the specifics of transforming
data that starts and/or ends as serialized JSON, and also about the more general problem
of transforming maps and arrays as defined in the data model.
There are two main areas I want to consider in this paper: firstly, the task of
constructing maps and arrays, and secondly, the processing of maps and arrays using
XSLT's
classic rule-based recursive descent design pattern involving template rules and match
patterns.
I'll also look at the requirements for "deep" processing of tree structures representing
JSON
data, for example deep searching and deep update.
I'm going to restrict my ambitions with a self-imposed constraint: I want to make
no
changes to the data model. The reason for that is partly because I feel the data model
is rich enough already, and partly because experience tells us that any changes to
the
data model inevitably create a requirement for a swathe of new supporting functionality:
we want to round off the specifications by supplying more complete functionality for
the
existing data structures, not by opening the floodgates to new data structures.
This doesn't mean that I think the data model is perfect. On the contrary, it has
many
imperfections. The set of 19 primitive data types (derived from the XML Schema specifications,
which claim that they were chosen "judiciously") is clearly a rag-bag that could only
have
come from a large committee making piecemeal decisions. The two linear data structures,
namely arrays and sequences, have overlapping and non-orthogonal capabilities. The
fact that
(XML) nodes have identity and ancestry, while the pseudo-nodes in a JSON-derived tree
don't,
is clearly inconsistent and non-orthogonal. Both decisions (to give nodes identity,
and to
allow navigation upwards as well as downwards) have significant advantages as well
as significant
disadvantages; but supporting these features for one kind of node and not the other
makes the
language curiously asymmetrical. However, fixing these problems without major incompatibilities
is too large an undertaking for the author to contemplate. We have to live with our
past, even
when we regret it.
Constructing maps and arrays
XSLT is a two-language system (XPath nested as a sub-language within XSLT) and we
always
need to think about whether functionality belongs properly in XPath, in XSLT, or in
both.
Traditionally for XML nodes, the language constructs for building trees are provided
at the
XSLT level. For construction of maps and arrays, however, there are facilities in
both languages,
and they don't always match exactly.
There's a good reason for wanting to construct the result tree at the level of XSLT
instructions. XSLT can invoke XPath, but not (directly) the other way around, and
the normal
XSLT coding style is for XSLT instructions to follow the structure of the result tree
under
construction, using XPath expressions to pull data from the source tree as required.
That
is why the term "template" is used: a template in the stylesheet is a proforma for
the
structure of the result tree.
XSLT uses XML syntax, and that makes it difficult to follow this paradigm when constructing
non-XML output. But we can come close, using instructions such as <xsl:map>
and <xsl:array>
that
mimic the corresponding constructs in JSON, but with a different surface syntax.
However, there's a problem we need to think about. A sequence of instructions in XSLT
is
known as a "sequence constructor" because that's what it does: each instruction returns
one
or more items, and the items returned by the instructions in a sequence constructor
are
concatenated into a single result sequence. So the program structure of XSLT is intrinsically
bound up with the task of creating sequences, and doesn't naturally lend itself to
creating
other aggregates such as maps and arrays.
Let's look at the two cases separately.
Constructing maps
In XSLT 3.0, maps are constructed using the <xsl:map>
instruction. What are the units
from which a map is constructed? They are "map entries", also known as key-value pairs,
which can be built using the instruction <xsl:map-entry>
. A map entry or key-value pair
is not in fact recognised as a first-class object in the data model; rather, it is
represented simply a singleton map. And in fact, the <xsl:map>
instruction simply combines multiple
maps (which may or may not be singletons) into a single map.
As a reminder, the <xsl:map>
instruction might be used like this:
<xsl:map>
<xsl:for-each select="employee">
<xsl:map-entry key="@ssn" select="firstName || ' ' || lastName"/>
</xsl:for-each>
</xsl:map>
The functionality is very close to that of the map:merge()
function available at the
XPath level, with one notable exception: there is no control over handling of duplicate
keys. The map:merge()
function offers a choice of five policies for handling duplicate
keys (reject
, use-first
, use-last
, use-any
, and combine
),
whereas the <xsl:map>
instruction offers only one: reject
.
In fact experience suggests that both these designs are inadequate. In use cases
I have encountered, I have wanted to handle duplicates in many different ways, and
this can easily be done by allowing the policy to be specified using a callback function.
When a duplicate key is encountered, the old value and the new value associated with
the
key can be combined by calling a user-supplied function that accepts both the old
and
new values as parameters. The five existing policies of map:merge()
can then be expressed
using the functions:
-
reject: function($old, $new) { error(....) }
-
use-first: function($old, $new) { $old }
-
use-last: function($old, $new) { $new }
-
use-any: function($old, $new) { $new }
-
combine: function($old, $new) { $old, $new }
Other policies I have found useful are:
-
concatenate: function($old, $new) { $old || ', ' || $new }
-
total: function($old, $new) { $old + $new }
-
array: function($old, $new) { array:append($old, $new) }
-
map: function($old, $new) { map:merge(($old, $new)) }
This callback capability can readily be added to both the map:merge()
function
and the <xsl:map>
instruction. (Extending XSLT is always easier of course, in
consequence of the use of XML syntax: adding attributes to an existing instruction
is always straightforward.)
There's a bit of a problem with this model, however. If <xsl:map-entry>
creates a
key-value pair, then to transform maps into other maps it would also be useful to
have further functionality for operating on key-value pairs; and it would be useful
to have this in XPath. Specifically:
-
Decomposing a map into a sequence of key-value pairs. We can do
map:for-each($MAP, map:entry#2)
but it's a little obscure.
A function map:entries($MAP)
would be cleaner.
-
Extracting the key and value from a key-value pair. With a key-value pair
represented as a singleton map, this is particularly clumsy. It would be nice
to allow $KVP?key
and $KVP?value
, but that only works if a key-value pair is
represented as a map with two entries named key
and value
rather than as
a singleton map{key:value}
. It would be nice to use that representation of a
map entry, but it would break compatibility. Instead I propose a pair of functions
map:single-key($MAP)
and map:single-value($MAP)
which only work on singleton maps.
-
Constructing a singleton map. We can always write map{$key:$value}
. But sometimes
a function is more convenient, and map:entry($key, $value)
works well.
By providing complementary facilities for composing and decomposing maps, operations
that transform maps immediately become simpler. For example filtering of maps can
be
done using
<xsl:map>
<xsl:for-each select="map:entries($input-map)">
<xsl:if test="map:single-key(.) => starts-with('2022-')">
<xsl:sequence select="."/>
</xsl:if>
</xsl:for-each>
</xsl:map>
Constructing arrays
There's no xsl:array
instruction in XSLT 3.0, which is a major gap. Anyone trying
to produce JSON output from a stylesheet will have come across this. The reason for
this is the unfortunate timing of the specs: although XPath 3.1 was published before
XSLT 3.0 (March 2017 versus June 2017), the XSL Working Group was reluctant to make
the language dependent on anything that wasn't in XPath 3.0 — which means that support
for arrays in XSLT 3.0 is rather limited.
There are two constructs in XPath for constructing arrays, and neither is completely
general:
-
The "square array constructor", for example [1, (), 5 to 10]
, creates an
array whose members can be arbitrary values, but the number of members in
the array must be statically known: This example creates an array with three
members, being the values (1)
, ()
, and (5, 6, 7, 8, 9, 10)
respectively.
-
The "curly array constructor", for example array{1, (), 5 to 10}
creates
an array whose members are always singleton items: this example creates an
array with seven members (1, 5, 6, 7, 8, 9, 10)
.
The only way to construct arrays where both the size of the arrays and the size of
each member vary dynamically is to use functions such as array:join()
or array:fold-left()
,
which can be rather clumsy.
Saxon has attempted to fill the XSLT gap with an extension instruction, <saxon:array>
.
Like <xsl:map
, which takes a sequence of "map entries" as its operand, <saxon:array>
takes
a sequence of "array entries". Array entries can be constructed using an instruction
<saxon:array-member>
which is nicely symmetric with <xsl:map-entry>
. For example the
following code fragment:
<saxon:array>
<xsl:for-each select="1 to 5">
<saxon:array-member select="0 to ."/>
</xsl:for-each>
</saxon:array>
constructs the array [(0, 1), (0, 1, 2), (0, 1, 2, 3), (0, 1, 2, 3, 4), (0, 1, 2, 3, 4, 5)]
.
But what exactly does the <saxon:array-member>
instruction return?
In the current Saxon implementation, <saxon:array-entry>
returns an "extension object" -
a Saxon extension to the XDM data model, as sanctioned by provisions in the XSLT
specification (see §24.1.3). This isn't a very satisfactory solution, but it works.
The underlying requirement is to deliver a sequence wrapped up as an item (it has
to be
an item, because as we've seen, it's intrinsic to the XSLT language that instructions
return items, and the results of separate instructions are concatenated to form a
single sequence). More recently [see https://dev.saxonica.com/blog/mike/2021/06/arrays.html]
I've coined the term "parcel" to describe this idea, which has applications that go
beyond
array construction. There are a number of ways parcels could be implemented:
-
As an array
-
As an entry in a singleton map, with an arbitrary key of say "value"
-
As a zero-arity function, which delivers the wrapped value when called with no arguments.
-
As an opaque object, encapsulated in some way to hide its implementation
The choice is in many ways arbitrary. In this paper I have adopted the second option;
but experience
of writing transformations this way suggests that the fourth option gives better type
safety and
therefore easier debugging.
However we choose to represent a parcel, there are a number of simple operations that
can be defined:
-
fn:parcel($sequence) wraps a sequence as a parcel
-
fn:unparcel($parcel) unwraps a parcel to return the original sequence
-
array:of($parcels) creates an array from a sequence of parcels
-
array:parcels($array) returns the members of an array as a sequence of parcels
We'll see that these operations are useful not only when constructing arrays, but
also
when processing them; one benefit is that all the XSLT and XPath machinery for processing
sequences (and in particular, <xsl:apply-templates>
) immediately becomes
available for processing arrays.
Array and Map Construction: Summary
At the XPath level, we're proposing four functions/operations for maps, and a
corresponding set of four operations for arrays:
-
Construct an entry
-
Construct a map/array from a sequence of entries
-
Decompose a map/array into a sequence of entries
-
Extract the contents of an entry
In both cases the first two operations have exact analogues in XSLT instructions,
but the second pair are XPath only, invokable from XSLT.
At one point I was recommending specific XSLT-level instructions for decomposing a
map or
array, for example <xsl:for-each-entry>
or <xsl:for-each-member>
.
In fact, with the right decomposition functions available in XPath, such instructions
aren't
needed: we can instead use <xsl:for-each select="map:entries($M)"
> or
<xsl:for-each select="array:parcels($A)">
. This immediately makes other decomposing operations
in XSLT available for maps and arrays, for example <xsl:iterate>
and <xsl:for-each-group>
.
Because composition and decomposition operations use the same primitives,
transformation of maps and arrays immediately becomes a great deal simpler:
the general pattern is: decompose; then use all the machinery for operating an
sequences to effect a transformation; then recompose.
Map and Array Types
XPath 3.1 defines some basic type syntax for constraining maps and arrays.
-
For arrays, the type of the array members can be constrained, for example
array(xs:integer)
defines an array whose members are single integers
-
For maps, the types of both the keys and the values can be constrained,
for example map(xs:anyURI, document-node())
defines a map whose keys are
URIs and whose values are (XML) document nodes.
These types, especially the map types, are not very descriptive. Many (perhaps most)
maps used in practice will have a variety of value types depending on the key:
we want to be able to say, for example, that the map contains a "name" entry of
type xs:string
, a "date-of-birth" entry of type xs:date
, and "phone numbers"
entry whose value is a sequence whose members are maps themselves containing
several defined entries.
I have therefore proposed adding syntax for "record types".
An example of a record type would be
record(name: xs:string,
date-of-birth: xs:date,
phone-numbers: record(type: xs:string, number: my:phone-number-type)*
)
Record types do not introduce any new kind of value (they don't extend the data model);
the instances of a record type are maps, and record types serve only to provide a
more
precise description of the content of a map. Some features of record types include:
-
The type of a field can be any sequence type, for example given-names: xs:string+
indicates a value comprising one or more strings
-
Fields can be optional, for example middle-name? : xs:string*
indicating
that the field may be absent, but must be of type xs:string*
if present
-
The name of a field can be any string. If it is not an NCName, it must be
written in quotes: for example "date of birth": xs:date
-
A record type may be defined to be extensible, by adding ", *"
to the
list of fields. If a record type is defined to be extensible, it may contain
other fields beyond those listed (otherwise, additional fields are not allowed).
-
Record types may be recursive: the pseudo-type ".."
is used to refer to the
containing type. So a linked list may have the type record(value: item()*, next?: ..)
.
-
The type of fields may be omitted if there are no constraints:
record(latitude, longitude)
. (This is often useful in pattern matching: just
because the longitude and latitude are always of type xs:double
doesn't mean
we need to make this an explicit contraint).
In practice in a stylesheet that makes heavy use of maps, the same record types
will be used over and over again to describe the types of variables, parameters,
and function results. We therefore introduce the ability to name types:
<xsl:item-type name="person"
as="record(first: xs:string, middle?: xs:string*, last: xs:string, *)"/>
and a variable can then be declared as
<xsl:variable name="employees" as="person*"/>
Type names declared in an XSLT package may be public or private, but they cannot be
overridden.
The rules for subsumption of record types are complex to express in detail,
but they reduce to a simple principle: S is a subtype of T if every map that
matches S necessarily also matches T.
Template-based Transformation
One might argue that the classic XSLT design pattern in which trees are
transformed using a recursive-descent application of template rules is particularly
tailored to document processing (where the data structure is often highly polymorphic)
and is not so essential for processing the more rigidly structured data that is often
found in JSON files. However, experience shows that polymorphic structures also
arise commonly in JSON, and the rule-based processing model can be equally valuable.
One common reason is that file formats evolve over time, and there's a need to handle
different versions. For example, version 1 of a JSON format might allow someone to
have a single phone number expressed as a number:
"phone": 4518265
The designers then realise that using a number was a mistake, so version 2 allows
strings:
"phone": "+44 753 110 8561"
Then they realise that people can have more than one phone number, so version 3 allows:
"phone": ["+44 753 110 8561", "+44 118 943 2844"]
and finally in version 4 they recognize a need to distinguish these phone numbers
by role:
"phone": {"mobile": "+44 753 110 8561", "home": "+44 118 943 2844"}
A stylesheet that can handle all these variations can benefit from using template
rules to
handle the polymorphism, and to allow new rules to be added to accommodate new
variations as they are introduced. This is especially true because JSON is often
used (by comparison with XML) in environments where change control is relatively informal.
XSLT 3.0 does have some basic ability for template rules to process values other
than XML nodes, but the facilities are very limited. Among the limitations:
-
The syntax for matching items by type is clumsy: match=".[. instance of xs:string]"
-
Matching different kinds of maps is particularly awkward: we need record types
-
Matching values by their context rather than by their content is impossible,
because values other than XML nodes are parentless. Writing a rule for processing
addresses that handles billing address and shipping address differently is therefore
challenging.
-
The entries that we get when we decompose a map or array into its parts are not
themselves items, and cannot therefore be readily matched.
-
The built-in template rules (for example, on-no-match="shallow-copy"
) were not
designed with structures other than XML node trees in mind.
I'm proposing a set of language extensions to address these concerns, described in
the following sections.
Type-based pattern matching
We introduce pattern syntax to match items by their type.
-
For atomic types, we use the syntax match="atomic(xs:integer)"
.
-
For map and array types, we use the item type syntax directly: match="array(xs:integer)"
,
match="map(xs:string, document-node())"
, match="record(first, middle, last, *)"
.
-
For named item types (defined using <xsl:item-type>
): match="type(person)"
.
In all cases this can be followed by predicates, for example match="array(xs:integer)[array:length(.)=2]"
.
However we choose to represent parcels, it's useful to have a construct match="parcel(type)"
to match
them, because we will be using <xsl:apply-templates select="array:parcels($array)"
to decompose an array,
and each parcel then needs to be matched by a template rule. (Here type
is a sequence type that has to match
the content of the parcel.)
Context for match patterns
To compensate for the absence of a parent axis to match items by context,
we allow tunnel parameters to be referenced in match patterns. Specifically,
a tunnel parameter declared in an <xsl:param> element within the body of a template
rule may be referenced within the match pattern.
In addition, a match pattern in a template rule can use the new function via()
to
refer to items on the route by which the item was selected: specifically, via()
returns a sequence of items representing the stack of apply-templates
calls by
which the template rule was reached, including both explicit template rules and
implicit built-in rules. This is in top-down order, so fn:via()[1]
represents the
context item of the immediate caller.
For example
<xsl:template match="record(address)[via()[1] instance of type(person)]">
matches an address only if the invoking <xsl:apply-templates>
instruction had a
context item of type type(person)
.
Built-in template rules
The current built-in template rules do not handle maps and arrays particularly well.
In particular, if the processing mode has <xsl:mode on-no-match="shallow-copy"/>
,
and the target item is a map or array, then it is deep-copied. (This is because shallow-copy
is defined in terms of the <xsl:copy>
instruction, which performs a deep copy if applied to a map or array.)
To tackle this we'll define a new option, provisionally written
on-no-match="shallow-copy-all"
, which differs from the existing
shallow-copy in the way maps and arrays are handled. For arrays, it does
<xsl:apply-templates select="array:parcels(.)" mode="#current"/>
that is, it wraps each of the array members into a parcel and processes each one separately.
For maps, it does:
<xsl:apply-templates select="map:entries()" mode="#current"/>
that is, it decomposes the map into a sequence of entries (represented as singleton
maps)
and processes each one independently.
Because the action of the default template rule here is to construct a new array or
map from the
values returned by the selected template rules, there is a requirement that these
template rules return
values of an appropriate type for inclusion in an array or map. In particular, when
constructing an array
the template rules must return parcels, and when constructing a map they must return
maps (which are merged).
Similarly, we define on-no-match="shallow-skip-all"
which is similar to shallow-skip
,
but for arrays and maps it applies templates to the components of the array or map
rather than omitting
them entirely.
Here is a simple example that transforms the input map {'red':12, 'green':13, 'yellow':14, 'blue':15}
to a new map {"green":13, "orange":20, "lilac":21, "blue":25}
:
<xsl:mode on-no-match="shallow-copy-all"/>
<xsl:template name="xsl:initial-template">
<xsl:apply-templates select="map{'red':12, 'green':13, 'yellow':14, 'blue':15}"/>
</xsl:template>
<xsl:template match="record(red)"/>
<xsl:template match="record(yellow)">
<xsl:map-entry key="'orange'" select="20"/>
<xsl:map-entry key="'lilac'" select="21"/>
</xsl:template>
<xsl:template match="record(blue)">
<xsl:map-entry key="'blue'" select="25"/>
</xsl:template>
The initial call on <xsl:apply-templates
selects a map which is
not matched by any template rule, so the built-in template kicks in. This constructs
a new
map and builds its content by applying templates to each entry (key-value pair) in
the input
map. The first entry (red
) is matched by a rule that returns empty content, so
this key is effectively dropped. The second entry (green
) is not matched, so it
is copied unchanged to the result. The third (yellow
) is matched, and results in
the addition of two new entries to the result. The fourth (blue
) results
in an entry being added to the result with the same key, but a different value.
Deep Tree Operations
The discussion in the previous section treats maps and arrays as one-dimensional
collections of entries (or members, respectively). This neglects the fact that in
a
structure derived from JSON, there will be a tree of maps and arrays. And more
generally, there will be a tree containing maps, arrays, sequences, nodes, and atomic
values.
There are very few operations defined in XPath 3.1 for processing such a tree.
The function map:find()
searches for a key value at any depth in a tree of maps
and arrays, but it is very rarely useful in practice, because it yields no
information about where the key was found. Because (unlike XML node trees) a
tree of maps and arrays has no parent or ancestor or following-sibling axis,
it's not possible to determine anything about the context of the value that was found.
A more powerful operation to search a tree would (a) provide a more flexible
way of matching entries than a simple equality on key values, and (b) would
return more information about where the match was found. I propose a function
fn:search($input, $predicate as function(item()) as xs:boolean)
which returns
all the "descendant" items that satisfy the supplied predicate. The search
logic would be similar to map:find, but matching on a general predicate rather
than purely on key values. So for example
search($input,
function($x){
$x[. instance of record(first, middle, last)]?last="Kay"
})?first
would return the first names of everyone whose last name is "Kay", at any
depth in the tree.
I've experimented in Saxon with various more ambitious ways of doing deep
processing of maps and arrays.
The extension function saxon:tabulate-maps()
flattens the tree of maps and
arrays into a flat sequence of leaf nodes, each containing projected information
from every level of the tree. The result is a flat sequence, which makes all the
XPath machinery for processing sequences available. The function is inspired by
the way in which Jackson Structured Programming handles boundary clashes when
converting one hierarchical view of data into another (essentially by turning
the tree into a flat sequence of leaf nodes, each retaining full information
about the path by which it was reached in the tree). The complexity, however,
has meant there has been little interest in this function in practice.
Another attempt to tackle this problem is the saxon:pedigree()
extension function.
This essentially creates a copy of a tree (of maps and arrays) in which each pseudo-node
is augmented with information about its parentage. Specifically, every map reached
via
the tree structure will be augmented with an extra entry "container" whose value is
the immediately containing map or array. This map is infinite (because it contains
cycles): a situation which is not explicitly prohibited by the XDM model, but which
was not envisaged by the working group, and which can cause some operations (such
as
serialization) to be non-terminating. Again, the solution feels complex and rather
unwieldy.
A third tree-level operation is the XSLT extension instruction saxon:deep-update.
The design here is somewhat more intuitive, though formalising the specification is
far from easy. The instruction takes three operands:
-
root: selects the set of root nodes to be processed
-
select: starting at a root, selects the nodes to be updated
-
action: a function that is applied to each selected node, and returns a replacement
for that node
The result of the function is an updated tree with the specified changes. The original
tree is of course unchanged; an efficient implementation will use persistent data
structures
internally to ensure that unchanged nodes do not need to be physically copied. This
is far
easier to achieve with maps and arrays than with XML nodes, because maps and arrays
lack
identity and ancestry.
I am not 100% satisfied with the design of these extensions; they feel rather clumsy.
But they do fulfil a need. At the time of writing I don't have any better suggestions,
but it is certainly an area worth revisiting.
Use Cases
Some years ago I published a paper [Kay 2016] giving two use cases for JSON based transformations,
and explored solutions to the problems in XSLT 3.0.
It's worth revisiting these to see how well the solutions can take advantage of the
proposed extensions.
In that paper I explored two alternative ways of tackling the transformation tasks:
firstly by
native transformation of maps and arrays representing the JSON data, and secondly
by converting
the data to XML, transforming it, and then converting back. In both cases, the second
approach proved easier,
strongly suggesting that improved constructs for native transformation of maps and
arrays were needed.
In this section I shall explore how well the features proposed in this paper meet
this requirement.
First use case: bulk update
This JSON example was taken from json-schema.org
:
[
{
"id": 2,
"name": "An ice sculpture",
"price": 12.50,
"tags": ["cold", "ice"],
"dimensions": {
"length": 7.0,
"width": 12.0,
"height": 9.5
},
"warehouseLocation": {
"latitude": -78.75,
"longitude": 20.4
}
},
{
"id": 3,
"name": "A blue mouse",
"price": 25.50,
"dimensions": {
"length": 3.1,
"width": 1.0,
"height": 1.0
},
"warehouseLocation": {
"latitude": 54.4,
"longitude": -32.7
}
}
]
The requirement here is: for all products having the tag "ice", increase the price
by 10%, leaving all other data unchanged.
The solution to this in my 2016 paper was very cumbersome; it required a helper
stylesheet with a range of supporting library functions.
With <saxon:deep-update>
, this can be solved simply as
<saxon:deep-update
root="json-doc('input.json')"
select="?*[?tags = 'ice']?price"
action=". * 1.1"/>
I've already mentioned that I'm not entirely comfortable with this extension,
partly because formalising the semantics is challenging, and partly because it involves
some messy contraints on the form of the expression in the select
attribute
(only downwards selections allowed). However, it's reasonably
usable.
How would it look as a rule-based transformation?
First, we define the default template rule to do a shallow-copy, JSON style:
<xsl:mode on-no-match="shallow-copy-all"/>
The map entries we need to change are handled by the rule:
<xsl:template match="record(price)[via()?tags = 'ice']">
<xsl:map-entry key="'price'" select="?price * 1.1"/>
</xsl:template>
and that's it. Vastly simpler than the pure XSLT 3.0 solution - and no more
complex than the call on <saxon:deep-update>
.
The way this works is that data is processed by built-in template rules until we hit
something that matches the explicit rule. The built-in template rule for the top-level
map
splits it into separate entries, one per key, and matches each one independently as
a singleton
map. The only one of these maps that matches is the one that has the key price
,
and is therefore matched by the pattern record(price)
.
The predicate on this pattern uses the via()
function to match the relevant price entries
according to their context. The via()
function returns all context items on the stack,
and the predicate matches so long as at least one of them is a map with atomized value
of the
tags
field containing the string "ice"
. (We're relying here on the way
that atomization works on arrays.) So long as all <apply-templates>
calls are
making simple downward selections, the via()
function has much the same effect as using the
ancestor axis when processing an XML node tree.
Second Use Case: Hierarchic Inversion
The second use case from my 2016 paper transform a JSON document that lists
courses with the students enrolled on each, to create an inverted JSON document
that lists students with the courses the are enrolled on.
Here is the input dataset:
[{
"faculty": "humanities",
"courses": [
{
"course": "English",
"students": [
{
"first": "Mary",
"last": "Smith",
"email": "mary_smith@gmail.com"
},
{
"first": "Ann",
"last": "Jones",
"email": "ann_jones@gmail.com"
}
]
},
{
"course": "History",
"students": [
{
"first": "Ann",
"last": "Jones",
"email": "ann_jones@gmail.com"
},
{
"first": "John",
"last": "Taylor",
"email": "john_taylor@gmail.com"
}
]
}
]
},
{
"faculty": "science",
"courses": [
{
"course": "Physics",
"students": [
{
"first": "Anil",
"last": "Singh",
"email": "anil_singh@gmail.com"
},
{
"first": "Amisha",
"last": "Patel",
"email": "amisha_patel@gmail.com"
}
]
},
{
"course": "Chemistry",
"students": [
{
"first": "John",
"last": "Taylor",
"email": "john_taylor@gmail.com"
},
{
"first": "Anil",
"last": "Singh",
"email": "anil_singh@gmail.com"
}
]
}
]
}]
The goal is to produce a list of students, sorted by last name then
first name, each containing a list of courses taken by that student, like this:
[ {
"email": "ann_jones@gmail.com",
"courses": [
"English",
"History"
]
},
{
"email": "amisha_patel@gmail.com",
"courses": ["Physics"]
},
{
"email": "anil_singh@gmail.com",
"courses": [
"Physics",
"Chemistry"
]
},
{
"email": "mary_smith@gmail.com",
"courses": ["English"]
},
{
"email": "john_taylor@gmail.com",
"courses": [
"History",
"Chemistry"
]
}
]
The approach I used in the 2016 paper (constrained by the inability to access ancestors
in the JSON
tree) was a two pass approach: first flatten the data into a normalised table
(represented as a sequence of maps) containing (course, student) tuples; then
group these tuples by student name.
The basic approach with new language features remains the same, but I think we can
improve a little on
the detail.
First, we can collect together the "flattened" sequence of (course, student)
tuples like this:
<xsl:mode name="flatten" on-no-match="shallow-skip-all"/>
<xsl:variable name="flattened" as="record(course, first, last, email)*">
<xsl:apply-templates select="json-doc('input.xml')" mode="flatten"/>
</xsl:variable>
<xsl:template match="record(first, last, email, *)" mode="flatten">
<xsl:sequence select=". => map:put("course", via()?course)"/>
</xsl:template>
and then we build a new hierarchy over the flat sequence of records:
<xsl:template name="xsl:initial-template">
<xsl:array>
<xsl:for-each-group select="$flattened" group-by="?email">
<xsl:array-member select="map{'email': ?email,
'courses': array{current-group()?course}}"/>
</xsl:for-each-group>
</xsl:array>
</xsl:template>
Again, vastly simpler than the pure XSLT 3.0 solution in the 2016 paper.
Conclusions
This paper describes proposed extensions to XSLT 3.0 to make transformation of maps
and arrays (and therefore JSON)
much easier; in particular, making it convenient to use the familiar processing model
of recursive descent applying
matching template rules.
The main features added to the language are:
-
Two new default ("on-no-match") template rules: shallow-copy-all
and
shallow-skip-all
, designed to offer the same capability for maps and arrays
as the current options offer for XML node trees.
-
A new function via()
giving access to the context items that were visited
en route to the current template rule, designed to enable context-sensitive matching
of maps
and arrays in a way that compensates for the absence of an ancestor axis.
-
Symmetric and orthogonal functions for decomposing maps and arrays into pieces that
can
be individually matched by template rules, and recombined to form new maps and arrays.
-
New type syntax providing simpler matching of maps (record types), and integration
of
type syntax into pattern syntax to allow template rules to match maps and arrays more
sensitively.
The benefits of these features are illustrated with the help of two use cases, originally
solved
using XSLT 3.0 constructs alone.