How to cite this paper

Holstege, Mary. “Preprocessing XQuery using Custom Module URI Resolvers.” Presented at Balisage: The Markup Conference 2025, Washington, DC, August 4 - 8, 2025. In Proceedings of Balisage: The Markup Conference 2025. Balisage Series on Markup Technologies, vol. 30 (2025). https://doi.org/10.4242/BalisageVol30.Holstege01.

Balisage: The Markup Conference 2025
August 4 - 8, 2025

Balisage Paper: Preprocessing XQuery using Custom Module URI Resolvers

Mary Holstege

Mary Holstege spent decades developing software in Silicon Valley, in and around markup technologies and information extraction. She has most recently been pursuing artistic endeavours using the XML stack to construct generative art programs.

Copyright © 2025 Mary Holstege. All code snippets provided under CC-BY (https://creativecommons.org/licenses/by/4.0/).

Abstract

This paper explores using module URI resolvers to modify the source module at resolution time. The motivating use case is to provide type-based templates for function libraries. Other examples are given, followed by an examination of the advantages and pitfalls of this approach.

Table of Contents

Background and Motivation
The Mad Idea: Using a Custom Module URI Resolver
Implementation of modifying module URI resolver
Other Modifying Module URI Resolvers
Other Applications
Templating
Aesthetic modifications
Syntax Augmentation
Implementation Accommodation
Testing
Why is this a Terrible Idea?
Readability and sharing
Security issues
Interference with module caching
Misnesting
Operator precedence
Name clashes
Duplication of work
When to do it anyway
Conclusions and Summary

Background and Motivation

Compound data structures are the basis for the implementation of complex algorithms. In XQuery, the operations to support compound data structures can be corralled into function libraries. For example, vector operations such as vector addition, multiplication, normalization and so on provide the basis for various algorithms, such as graph layout or scene lighting models. It is straight-forward enough to implement this library of operations using xs:double as the base type.

Figure 1: Snippet of function library for vector operations

(:~
 : Compute the dot product of two vectors
 :)
declare function this:dot($p1 as xs:double*, $p2 as xs:double*) as xs:double
{
  fn:sum(
    this:map2(
      function ($a as xs:double, $b as xs:double) as xs:double {$a * $b},
      $p1, $p2
    )
  )
};

(:~
 : Compute the cross product of two 3D vectors.
 :)
declare function this:cross($u as xs:double*, $v as xs:double*) as xs:double*
{
  this:py($u)*this:pz($v) - this:pz($u)*this:py($v),
  this:pz($u)*this:px($v) - this:px($u)*this:pz($v),
  this:px($u)*this:py($v) - this:py($u)*this:px($v)
};

(:~
 : Rotate the vector represented by the given angle (radians).
 :)
declare function this:rotate(
  $v as xs:double*,
  $radians as xs:double
) as xs:double*
{
  let $ca := math:cos($radians)
  let $sa := math:sin($radians)
  return (
    this:px($v) * $ca - this:py($v) * $sa,
    this:px($v) * $sa + this:py($v) * $ca
  )
};
      

Some functions in a vector function library operating over vectors of double values, represented as sequences. The actual library has dozens of operations.

The strong typing afforded by declaring the base type (xs:double in this case) in the function signatures allows for rapid error detection.

The problem arises when we want the same set of functionality, but over a different type, xs:integer for example. The quick and dirty approach is to copy all the code, put it into a different namespace, and replace xs:double by xs:integer as required. Some additional casts may also be warranted.[1] For example:

Figure 2: Snippet of function library for vector operations over integers

(:~
 : Compute the dot product of two vectors
 :)
declare function this:dot($p1 as xs:integer*, $p2 as xs:integer*) as xs:integer
{
  fn:sum(
    this:map2(
      function ($a as xs:integer, $b as xs:integer) as xs:integer {$a * $b},
      $p1, $p2
    )
  )
};

(:~
 : Compute the cross product of two 3D vectors.
 :)
declare function this:cross($u as xs:integer*, $v as xs:integer*) as xs:integer*
{
  (this:py($u)*this:pz($v) - this:pz($u)*this:py($v)) cast as xs:integer,
  (this:pz($u)*this:px($v) - this:px($u)*this:pz($v)) cast as xs:integer,
  (this:px($u)*this:py($v) - this:py($u)*this:px($v)) cast as xs:integer
};

(:~
 : Rotate the vector represented by the given angle (radians).
 :)
declare function this:rotate(
  $v as xs:integer*,
  $radians as xs:double
) as xs:integer*
{
  let $ca := math:cos($radians)
  let $sa := math:sin($radians)
  return (
    (this:px($v) * $ca - this:py($v) * $sa) cast as xs:integer,
    (this:px($v) * $sa + this:py($v) * $ca) cast as xs:integer
  )
};
      

The same functions as before, now operating over vectors of integer values.

The new library has all the functionality of the old one with strong typing for vectors of integers. When we want vectors of doubles we import the original library; when we want vectors of integers we import the new one. We'll give the two modules different namespaces. Using related namespaces to make the connection clear is wise.

However, we now have a maintenance headache: every fix to any type variant of this set of functions requires a fix to all the other semi-copies. It can be easy to overlook a necessary change. Can we make this more manageable somehow?

With the proposed XQuery 4.0 we can do a little better, using the item-type declaration. By changing the item type declaration from xs:double to xs:integer the base type for the vectors changes, and no other code need be touched.[2]

Figure 3: Snippet of function library for vector operations, genericized

declare item-type ᐸTᐳ as xs:double;

(:~
 : Compute the dot product of two vectors
 :)
declare function this:dot($p1 as ᐸTᐳ*, $p2 as ᐸTᐳ*) as xs:double
{
  fn:sum(
    this:map2(
      function ($a as ᐸTᐳ, $b as ᐸTᐳ) as ᐸTᐳ {$a * $b},
      $p1, $p2
    )
  )
};

(:~
 : Compute the cross product of two 3D vectors.
 :)
declare function this:cross($u as ᐸTᐳ*, $v as ᐸTᐳ*) as ᐸTᐳ*
{
  (this:py($u)*this:pz($v) - this:pz($u)*this:py($v)) cast as ᐸTᐳ,
  (this:pz($u)*this:px($v) - this:px($u)*this:pz($v)) cast as ᐸTᐳ,
  (this:px($u)*this:py($v) - this:py($u)*this:px($v)) cast as ᐸTᐳ
};

(:~
 : Rotate the vector represented by the given angle (radians).
 :)
declare function this:rotate(
  $v as ᐸTᐳ*,
  $radians as xs:double
) as ᐸTᐳ*
{
  let $ca := math:cos($radians)
  let $sa := math:sin($radians)
  return (
    (this:px($v) * $ca - this:py($v) * $sa) cast as ᐸTᐳ,
    (this:px($v) * $sa + this:py($v) * $ca) cast as ᐸTᐳ
  )
};
      

The same functions as before, now with the type name abstracted into the item-type declaration.

This is better: the body of the libraries is identical from one numerical type to the next. Maintenance can proceed with a direct cut and paste of the body from one copy to the others. It is easy to determine exactly what fix needs to be made and whether we have accomplished it: the only difference between the modules should be in the namespace and in the item type declaration.

There is something deeply unsatisfying about this approach. For data structures that may have a larger variety of types in play, such as a FIFO queue, we end up with a proliferation of substantially identical libraries. Furthermore, we may have additional degrees of variability. Perhaps we have more than one type in play (a key and a value, an input type and a result type). There is a multiplication of possibilities, exacerbating our maintenance headache.

If these were XSLT libraries, we could use XML entities to get the job done. With the proposed XSLT 4.0 we can use the xsl:item-type instruction, and reference the generic type name in the function bodies as above with XQuery. The body of the function library exists in an external entity. Normal XML mechanics come into play when the package is accessed to perform the appropriate expansion and substitution and the amount of duplication is small: most of the code exists in the external entity containing the genericized function definitions proper.

Figure 4: Snippet of XSLT function library for vector operations, using entities to genericize

<!DOCTYPE xsl:package [
  <!ENTITY body SYSTEM "vector-body.xsl">
  <!ENTITY type "xs:integer">
]>
<xsl:package xmlns:this="http://example.com/core/vector/integer" ...>
   ...
   <xsl:item-type name="ᐸTᐳ" as="&type;"/>
   &body;
</xsl:package>
      

Using entities to pull in a genericized set of function definitions and to define the appropriate type for the generic type name.

That's all well and good for XSLT packages, but what about for XQuery function libraries? XQuery does not have any such entity mechanism. We could abandon strong typing. That just shifts the maintenance burden onto the code calling the library. We could turn the XQuery library into a thin non-typed shim that calls XSLT functions via fn:transform() and rely on the XSLT entity support. That comes at a cost of tremendous runtime overhead. Or we could could consider an approach to get the same effect via some sort of "compilation step" that creates final forms of the necessary modules from some sort of template. There are several options:

  1. Wrapping the XQuery in XML and using entities as in XSLT. The compilation step reads the XML and writes out the content with entities expanded.

  2. Using special markers for the variable parts (namespace and type). The compilation step applies text manipulation commands such as sed or awk to perform the substitution and write out the result.

  3. Writing custom XSLT or XQuery module generators for each generic library. The compilation step executes these generators with the appropriate parameters for a given type selection and writes out the result.

None of these is terribly complicated to implement. As long as we are careful to genericize the code and minimize the number of changes that need to get applied, and use markers from some interesting part of Unicode that doesn't appear in the code otherwise, even a simple script can do the trick.[3]

Figure 5: Using a script to generate a valid module

Snippet of template:

module namespace this="http://example.com/core/vector#«type»";

...
declare item-type ᐸTᐳ as «type»;

...
(:~
 : Compute the dot product of two vectors
 :)
declare function this:dot($p1 as ᐸTᐳ*, $p2 as ᐸTᐳ*) as xs:double
{
  fn:sum(
    this:map2(
      function ($a as ᐸTᐳ, $b as ᐸTᐳ) as ᐸTᐳ {$a * $b},
      $p1, $p2
    )
  )
};
...
      

Compilation script:

#!/bin/sh
template=$1
type=$2
sed -e "s/«type»(/(/g" -e "s/«type»/${type}/g" ${template}
      

Using a script to turn a template of a genericized set of function definitions into a valid module.

This solves our maintenance problem: there is one template with one set of function definitions to fix. The variant modules are regenerated ("compiled") as appropriate. There is still something unsatisfying about this approach. The charm of XQuery is precisely that it doesn't need to get compiled to be used, that we don't need complicated build systems just to get on with the job. What can we do if we don't want a compilation step? Is there some way we can get the replacement to happen more dynamically? It turns out there is.

The Mad Idea: Using a Custom Module URI Resolver

The basic notion is simple: instead of introducing a compilation step and a build process, apply the substitutions when the module is resolved, using URI parameters to define the appropriate substitutions dynamically. The module URI resolver hands the modified stream to the processor.

For example, suppose we use the same template as we did in Figure 5. We'll supply a custom module URI resolver that takes string/replacement pairs and applies them to the source module, providing the modified text as the source as far as the processor is concerned. In my implementation I use the parameters in the location URI to do the work. Here is a sample main module that imports both the double vector module and the integer vector module:

Figure 6: Using custom module URI resolver to generate variants

import module namespace vd="http://example.com/core/vector#xs:double"
       at "vector.txqy?«type»=xs:double";
import module namespace vi="http://example.com/core/vector#xs:integer"
       at "vector.txqy?«type»=xs:integer";

<test>{
let $v1 := (1.0, 2.4, 3.0)
let $v2 := (1, 2, 3)
return (
  <double>{vd:add($v1, $v2)}</double>,
  <error>{
    try { vi:add($v1, $v2) }
    catch * { "Error correctly generated" }
  }</error>,
  <integer>{vi:add($v1!xs:integer(.), $v2!xs:integer(.))}</integer>
)
}</test>
      

Main module importing the same function library template with different parameters. The custom module URI resolver turns the template into a valid module.

With the custom resolver in place, the correct type constraints are properly applied to the functions. The results are:

<test>
  <double>2 4.4 6</double>
  <error>Error correctly generated</error>
  <integer>2 4 6</integer>
</test>
      

Implementation of modifying module URI resolver

I implemented such a modifying module URI resolver with Saxon. Saxon expects a module URI resolver to provide a resolve() method that takes a module namespace URI, a base URI, and an array of location URIs and return an array of StreamSource objects. Each StreamSource hands an input stream or a stream reader to the processor for it to parse.

There a few fine points and choices to consider in the implementation.

  1. Should substitutions apply over the complete text of the module, or is line-by-line substitution sufficient? I chose to apply the substitutions line by line. That creates the limitation that patterns cannot include end of line characters, but that seems like a small sacrifice. I preferred to avoid the extra overhead of buffering the entire module text.

  2. As a consequence of line-by-line processing, I used readline() to read each line, losing the end of line character. Unless it is restored to the modified module, error messages will reference the wrong line numbers. Note that using end-of-line characters in either the patterns or the replacement strings would also cause misalignments in error messages.

  3. Since the patterns and substitution strings are part of the location URI, certain characters may need to be escaped using percent-escaping. For example, a space in either the pattern or the replacement string must be represented as %20. The resolver will need to decode the strings in order to properly apply the mappings.

  4. Patterns could be treated as regular expressions rather than literal strings. The replacement string could them make use of capture expressions. This affords more power, but at a cost of more opportunities for unfortunate results. See section “Why is this a Terrible Idea?” for some of the possible downsides. Escaping the URIs becomes positively nightmarish.

  5. In what order should substitutions apply? Should it be guaranteed? Inasmuch as most systems to not guarantee that query parameters in URIs will be seen in order, I chose not to treat them as such, and the order of substitution is not specified. It is probably unwise to depend on order anyway. As a consequence, if patterns and replacements are entangled, results may differ depending on the actual order in which they are applied. For example, if we have a=bc and b=ad then "abc" could become "bcbc" and then "adcadc" or "aadc" and then "bcbcdc".

  6. Saxon does not expect to find a query string as part of a file URI. In addition, a bare file path as the name of a main module on the command line is treated as such. Again, a trailing query string is not expected. In both cases, the resolution will fail, in the latter case without the custom resolver even getting a chance. Therefore, if you want main modules to be subject to this kind of manipulation, they must be given using file URIs. The custom resolver needs to handle this case.

I implemented my resolver by subclassing StandardModuleURIResolver. The resolve method looks something like this:

Figure 7: The resolver() method

// Resolve the module URI
public StreamSource[] resolve(
  String moduleURI,
  String baseURI,
  String[] locations
) throws XPathException {
  StreamSource[] result = super.resolve(moduleURI, baseURI, locations);
  for (int i = 0; i < locations.length; i++) {
    int index = locations[i].indexOf('?');
    if (index >= 0) {
      String parameterString = locations[i].substring(index + 1);
      String[] parameters = parameterString.split("&");
      HashMap<String,String> mapping = new HashMap<>();
      boolean hasParameters = false;
      for (int j = 0; j < parameters.length; j++) {
        int pindex = parameters[j].indexOf('=');
        if (pindex >= 0) {
          try {
            String pname = URLDecoder.decode(
              parameters[j].substring(0, pindex), "UTF-8"
            );
            String pvalue = URLDecoder.decode(
              parameters[j].substring(pindex + 1), "UTF-8"
            );
            mapping.put(pname, pvalue);
          } catch (Exception e) {
            // Convert URI exceptions into XPathException
            throw new XPathException(e);
          }
          hasParameters = true;
        }
      }
      // Optimization: just use base stream if no patterns
      if (hasParameters) {
        InputStream istream = result[i].getInputStream();
        if (istream == null) {
          // Try without query stream: for file: URIs this helps
          String[] bareLocations = new String[]{locations[i].substring(0, index)};
          StreamSource[] bareResult = super.resolve(moduleURI, baseURI, bareLocations);
          istream = bareResult[0].getInputStream();
        }
        if (istream != null) {
          InputStream stream = new TemplateFilterInputStream(istream, mapping);
          String systemId = result[i].getSystemId();
          String publicId = result[i].getPublicId();
          result[i] = new StreamSource(stream);
          // Carry over public and system IDs
          result[i].setPublicId(publicId);
          result[i].setSystemId(systemId);
        } // null stream; continue
      }
    }
  }
  return result;
}
        

The standard resolver is used when there are no query parameters on the location URI. Otherwise the parameters are parsed to create a mapping of patterns to replacements which are applied by a special stream class. The bulk of the work is done within the new TemplateFilterInputStream class. Here is what that looks like:

Figure 8: InputStream class that applies replacements

public class TemplateFilterInputStream extends InputStream {
  public TemplateFilterInputStream(
    InputStream stream,
    HashMap<String,String> mapping
  ) {
    // Construct a BufferedReader so we can read line by line
    _input = new BufferedReader(new InputStreamReader(stream, _encoding));
    _mapping = mapping;
  }

  // Read and process the next line
  private boolean nextLine() throws IOException {
    String line;
    while ((line = _input.readLine()) != null) {
      // Apply replacements
      line = filterLine(line);
      if (line != null) {
        // Keep line-numbering consistent with source
        line = line + '\n';
        _buffer = new ByteArrayInputStream(line.getBytes(_encoding));
        return true;
      }
    }
    return false;
  }

  // Replace patterns in the line
  private String filterLine(String line) {
    for (Map.Entry<String,String> e : _mapping.entrySet()) {
      line = line.replace(e.getKey(), e.getValue());
    }
    return line;
  }

  // Read next character
  @Override
  public int read() throws IOException {
    // Refill buffer as needed
    if (_buffer == null) { if (!nextLine()) return -1; return read(); }
    int ch = _buffer.read();
    if (ch == -1) { if (!nextLine()) return -1; return read(); }
    return ch;
  }

  // Make sure reader is closed
  @Override
  public void close() throws IOException { _input.close(); }

  private HashMap<String,String> _mapping;
  private BufferedReader _input;
  private ByteArrayInputStream _buffer;
  private Charset _encoding = StandardCharsets.UTF_8;
}
        

We read from a BufferedReader line by line, applying the replacements in the filterLine() method.

Given the new resolver class, making use of it is easy: add the relevant classes to the class path and configure Saxon to use it. Configuration is also easy: either reference the class name in the configuration file as the value of the moduleUriResolver attribute for the xquery element, or set it on the command line with the -mr option.

Similar kinds of APIs are available with some other processors.

Other Modifying Module URI Resolvers

Simple string substitution is easy to implement, but there are other possibilities, with different strengths and weaknesses. I have experimented with implementations of many of these variants.

Non-modifying checker

Instead of modifying the text of the module when it is resolved, scan it and produce warnings or errors if the defined patterns appear. Patterns could be simple strings, as we have seen, or regular expressions, or more advanced parsing. The use case here is linting. Using patterns defined in some kind of global configuration seems most appropriate.

Regular expression replacer

Implementation of this variant is simple: replace the call to the replace() method in filterLine() by a call to replaceAll(). Regular expressions provide more power and allow for more elaborate code rewriting. The regular expressions themselves can by nightmarish to comprehend, especially after they have been escaped for placement in a URI. We also have to be careful (as in all these alternatives) not to run afoul of module caching (see section “Interference with module caching”). A variant where the regular expressions come from a configuration and are universally applied might be a better option.

XQuery generator

The idea here is to treat the target of the location URI as a main module to execute, and use its output as the text of the module. Query parameters on the location URI become external variable bindings. Implementation is only slightly more complicated than the simple string substitution resolver. The downside here is that XQuery is not a great language to use for generating XQuery. We also need to provide some way of distinguishing modules that should be taken verbatim and those that should be executed.

XSLT generator

Instead of using XQuery to generate code, use XSLT. There are two flavours here: the stylesheet is referenced in the location URI or an XML template is referenced in the location URI. In the first case the initial template is fixed or part of the query parameters. In the second case the stylesheet is fixed or part of the query parameters. In either case other query parameters can become stylesheet parameters. Both of these options are also not that complicated to implement.

Full macro preprocessor

Instead of going for half-measures, we could define a full macro preprocessing language and implement it. This is a great deal more complicated to do: we need to be able to parse the text of the module, define semantics around macro parameter binding and so forth. As we will see in section “Why is this a Terrible Idea?”, there are a lot of problems with this approach.

Other Applications

Now that I have fabricated a module-modifying hammer, let's see what else looks like a nail. What other kinds of parameterization or modifications might be possible?

There is an endless variety of possibilities, but they fall into two broad camps: module-specific selections, as we saw with the vector templates, or general-purpose modifications that apply across modules. We'll see some examples of these in a moment. When it comes to the latter class of cases, we are probably better off with a slightly different resolver implementation that applies a configured set of replacements universally.

Regardless of how broadly the effects apply, there are a few basic motivations:

  1. Templating or selection of alternative code

  2. Aesthetic modifications

  3. Augmenting the syntax of the language

  4. Rewriting to accommodate implementation limitations

  5. Testing changes before applying them

Templating

External variables in modules allow us to parameterize the behaviour of an XQuery module. But XQuery provides no mechanism for overriding the default value of an external variable. XQuery processors provide such mechanisms. In Saxon, an external variable can be set from the command line. With the modifying module URI resolver, we can set it from the import line and create a variant of the module with the appropriate value, or different values in different contexts.

Figure 9: Dynamic external variables

Import from calling module:

import module namespace ex="http://example.com/test/external#3"
       at "test/external.txqy?«max»=3";
        

Referenced module:

module namespace this='http://example.com/test/external#«max»;
declare variable $this:max as xs:integer := «max»;

declare function this:hello($n as xs:integer) as xs:string {
  if ($n > $this:max)
  then error(QName('http://example.com/errors','BAD-ARGS'), 'n is too big')
  else (),
  'Hello '||$n||' times!'
};
        

A common purpose for macro preprocessing in languages that have the capability is to enable or disable debugging code. The idea is to remove the code entirely, not just make it conditional on some debugging variable. In this example new character sequences (<~ and ~>) delimit the conditional code. Applying a mapping to spaces enables the code; applying a mapping to comment delimiter sequences comments it out.

Figure 10: Debugging code activation/inactivation

Calling module with code deactivated:

import module namespace ex="http://example.com/test/conditional"
       at "test/conditional.xqy?%3C~=%28:&amp;~%3E=:%29";
       (: <~ == comment start; ~> == comment end :)
ex:hello(4)
        

Calling module with code activated:

import module namespace ex="http://example.com/test/conditional"
       at "test/conditional.xqy?%3C~=%20&amp;~%3E=%20";
       (: <~ == space; ~> == space :)
ex:hello(4)
        

Module with debugging code:

module namespace this='http://example.com/test/conditional';
declare variable $this:max external as xs:integer := 3;

declare function this:hello($n as xs:integer) as xs:string {
  <~
  if ($n > $this:max)
  then error(QName('http://example.com/errors','BAD-ARGS'), 'n is too big')
  else (),
  ~>
  'Hello '||$n||' times!'
};
        

A less dangerous alternative is to use specially marked comments instead of new syntax. In this example I use << and >> as markers for these special comments. Mapping the full special comment sequences to spaces enables the debugging code which is otherwise inside a normal XQuery comment and is therefore ignored.

Figure 11: Debugging code activation/inactivation (alternative)

Module with debugging code:

module namespace this='http://example.com/test/conditional';
declare variable $this:max external as xs:integer := 3;

declare function this:hello($n as xs:integer) as xs:string {
  (:<<
  if ($n > $this:max)
  then error(QName('http://example.com/errors','BAD-ARGS'), 'n is too big')
  else (),
  >>:)
  'Hello '||$n||' times!'
};
        

Activating the debugging code in module:

import module namespace ex="http://example.com/test/conditional"
       at "test/conditional.xqy?(:%3C%3C=%20&amp;%3E%3E:)=%20";
       (:  (:<< == space; >>:) == space :)
ex:hello(4)
        

Aesthetic modifications

I came across someone suggesting this trick with the C++ macro preprocessor: defining certain Unicode box characters in such as way that code nesting could be represented as boxes. The trick doesn't actually work: box characters are not valid in identifiers, which the C++ macro preprocessor requires. However, we can play this trick with the modifying module URI resolver, because the pattern is not required to be an identifier.

Figure 12: Pretty comments

Import from calling module:

import module namespace ex="http://example.com/test/documentation"
       at "test/documentation.txqy?╔=(:&amp;╝=:)";
        

Referenced module:

╔══════════════════════════════════════════════════════════════════╗
║ Experiment with resolver-based modifications.                    ║
║ @since March 2025                                                ║
║ @custom:Status Bleeding edge                                     ║
╚══════════════════════════════════════════════════════════════════╝
module namespace this="http://example.com/test/documentation";

╔══════════════════════════════════════════════════════════════════╗
║ Compute the dot product of two vectors.                          ║
║                                                                  ║
║ @param $p1: one vector                                           ║
║ @param $p2: another vector                                       ║
║ @return p1·p2                                                    ║
╚══════════════════════════════════════════════════════════════════╝
declare function this:dot($p1 as xs:double*, $p2 as xs:double*) as xs:double
{
  fn:sum(
    this:map2(
      function ($a as xs:double, $b as xs:double) as xs:double {$a * $b},
      $p1, $p2
    )
  )
};
        

Module with block comments written out using Unicode box characters. Mapping the top-left and bottom-right corner characters ( and ) to comment start and end sequences converts this into a valid XQuery module.

In C++ and languages with similar macro preprocessors, macros can take parameters of sorts that define functional forms. With simple string substitution we can't really accomplish that, but if the module URI resolver performs regular expression matching and substitutions, we can, albeit with a certain amount of struggle. Escaping all the regular expression metacharacters becomes a real chore and the result is an unreadable and somewhat brittle mess. As nasty as this regular expression is, even unescaped, it can fail to properly match when a complex expression is used in the module text.

Figure 13: C++ style function macro

import module namespace ex="http://mathling.com/test/regex"
       at "test/rex.txqy?ceil_div%5C(([%5E,%5C)%5Cs]%2B),%5Cs?([%5E%5Cs%5C)]%2B)%5C)=($1%20%2B%20$2%20-%201)%20idiv%20$2";
       (: ceil_div\(([^,\)\s]+),\s?([^\s\)]+)\)=($1 + $2 - 1) idiv $2 :)

ex:test(10, 2)
        
module namespace this='http://mathling.com/test/regex';

declare function this:test(
  $a as xs:integer,
  $b as xs:integer
) as xs:integer
{
  ceil_div($a,$b)
};
        

A regular expression that matches a "function" call with two simple parameters and replaces it with the desired expression. The escaping of the regular expression makes it even harder to read than normal.

Support for function-like macros more directly would require more work: both the patterns and the module text would need to be parsed to properly identify the parameter bindings, which would then need to be presubstituted.

Just defining a new function seems a whole lot easier.

Syntax Augmentation

Modifying the language syntax is a little dangerous. Let us live more dangerously still, and create custom operator symbols. Here I am using the symbol (U+25A1) to stand in for a function call to construct a square, (U+25CB) to stand in for a function to construct a circle, and (U+2197) to stand in for a mapped function that performs a translation affine transformation.

Figure 14: Custom function and operator symbols

Mappings, unescaped and escaped:

□ == geom:box(0,0,10,10)     %E2%96%A1=geom:box(0,0,10,10)
○ == geom:circle((0,0),10)   %E2%97%8B=geom:circle((0,0),10)
↗ == =>geom:translate        %E2%86%97=%3D%3Egeom:translate"
        

module namespace this="http://example.com/test/geometric";
import module namespace geom="http://example.com/geometric"
       at "../geo/euclidean.xqy";

declare function this:shapes(
  $location as xs:double*,
  $scaling as xs:double
) as map(xs:string,item()*)*
{
  (□,○)=>geom:scale($scaling)↗($location)
};
        

Module written with custom operator symbols. The mappings convert these into regular XQuery operators and functions. The mappings will need to be heavily escaped.

Still, part of the point might be to experiment with new syntax. If we can try out new operator symbols before cracking open our favourite XQuery processor and doing a lot of complex implementation work, so much the better.

A more compelling example of using alternative symbols is for localization of an API. (See Swidan23 for a fascinating discussion of programming language localization in general.) Suppose we have some standard library, but want to make it more accessible to non-English speakers. Users of the library can operate more comfortably when the names are mapped. Such a use case is best served using configured mappings that apply universally. The imported module is the same in either case, so maintenance is not complicated by having multiple localized versions.

Figure 15: Localizing an API

Mappings:

:cross => :перекрестный
:rotate => :вращать
...
        

English version of the code:

let $vec := (1.0, 2.3, 4.0)=>v:cross((2.0, 1.0, -4.0))
let $angle := math:pi() div 3
return $vec=>v:rotate($angle)
        

Russian version of the code:

let $век := (1.0, 2.3, 4.0)=>в:перекрестный((2.0, 1.0, -4.0))
let $угол := math:pi() div 3
return $век=>в:вращать($угол)
        

Code written using localized function names. The source module is the same in either case.

Implementation Accommodation

When standards evolve, sometimes there are lags in which parts of that standard an implementation supports. Sometimes implementations get ahead of things, implementing something that isn't final and gets changed later. Perhaps a particular processor doesn't support a particular function or operator. Perhaps the specification isn't even close to final. The module rewrite can fix that. For example, the mapping %C3%B7=%20div%20 replaces the proposed ÷ operator symbol (XQuery 4.0) with the older div. The name get-years-from-yearMonthDuration (early XQuery draft) can be replaced by years-from-yearMonthDuration (final Recommendation). The processor specific xdmp:function-name can be replaced by fn:function-name. And so forth.

Testing

One class of use case for this pattern/replacement paradigm is to perform quick tests of changes before rewriting a raft of code. The intention is not to have a more permanent kind of change as with the templated vector example, but to just try things out in a lightweight fashion. In Figure 16 the rewrite rules modify the code for a data model change.

Figure 16: Quick test of data model changes

import module namespace ex="http://example.com/test/model"
       at "test/model.txqy?/root=/top/root";

ex:things(document { <top><root><thing>Thing</thing></root></top> } )
        

Source module we're testing out:

module namespace this="http://example.com/test/model";

declare function this:things($doc as document-node()) as xs:string
{
  string-join($doc/root/thing, " ")
};
        

In a similar vein, we can test out a replacement for some old function before we edit a bunch of code. Or we could have multiple alternative implementations, suitable for different environments, and select which one to use. In Figure 17 the function name ex:hello is replaced by the version we mean to try out.

Figure 17: API alternatives

Main module, selecting appropriate function:

import module namespace exc="http://example.com/test/api/caller"
       at "test/apicaller.txqy?:hello=:simple-hello";

exc:test()
        

Module using the call:

module namespace this="http://example.com/test/api/caller";
import module namespace ex="http://example.com/test/api"
       at "test/api.txqy";

declare function this:test() as xs:string
{
  ex:hello(3)
};
        

Module with variant implementations of the function:

module namespace this="http://example.com/test/api";

declare function this:simple-hello($n as xs:integer) as xs:string
{
  "Hello "||$n||" times!"
};

declare function this:multiple-hello($n as xs:integer) as xs:string
{
  string-join(
    for $i in 1 to $n return "Hello", " "
  )
};
        

In Figure 18 we are again replacing one function with another. Here we are looking at a particular annotation on the function (ann:deprecated) and using it as a trigger to change the name of the function to something else, in order to force an error. With simple string substitution this kind of replacement depends a great deal on certain coding conventions.

Figure 18: Deprecation check

import module namespace ex="http://example.com/test/deprecation"
       at "test/deprecation.txqy?%25ann:deprecated%20function%20this:=function%20this:deprecated-";

ex:new-hello(10),
ex:old-hello(10)
        

Why is this a Terrible Idea?

Bjarne Stroustrup lays out some of the issues with lexical macros:

The first rule about macros is: don’t use them if you don’t have to. Almost every macro demonstrates a flaw in the programming language, in the program, or in the programmer. Because they rearrange the program text before the compiler proper sees it, macros are also a major problem for many programming support tools. So when you use macros, you should expect inferior service from tools such as debuggers, cross-reference tools, and profilers.

C++, pg. 336

The modern consensus on programming language design largely agrees with this. Language designers have introduced templating and markers for constants, function inlining, and machine architecture to replace some of the common uses for lexical macros. Lexical macros can be dangerous and modifying module URI resolvers are essentially lexical macro processors. Let's look at some of the issues with module URI resolvers that modify the source.

Complaints about the dangers of C++ style lexical macros abound. For example, there is a whole section of the GCC documentation devoted to it. Many of these issues really only apply to the function-style macros, but let's look at them anyway. Let's suppose, for the moment, that we have a modifying module URI resolver that supports them. We'll consider these potential issues also.

Some of the examples are quite contrived, but the issues they expose are real enough. Caution is warranted.

Readability and sharing

Code will work incorrectly or even fail to parse unless you use the right module resolver. The more you monkey with syntax, the more completely you depend on it or on the "compiled" version of things. Other tools won't understand special symbols or unsubstituted values. For example, part of my toolchain relies on XQDoc to process structured comments. It does not use a resolver at all, and fails to parse modules with extended syntax.

Even if a debugging tool can use the modifying resolver, without being able to see the unescaped mappings or the expanded text, debugging can become quite difficult. If the modifications change line locations, error messages will point to the wrong parts of the code.

Idiosyncratic syntax extensions may not be comprehensible to other people. For example, the special operators in Figure 14 will not make sense without additional information.

Security issues

Code injection is a live concern. Here a substitution is used to reference a perfectly ordinary XQuery module and use it to exfiltrate data:

import ex="http://example.com/example" at "anodyne.xqy?};=,unparsed-text('/secret.txt')};";
        

The billion laughs attack can be used to perform a denial of service attack on the processor, as long as expansions happen in the right order.

Mappings in location URI:

«lol9»=«lol8»«lol8»«lol8»«lol8»«lol8»«lol8»«lol8»«lol8»«lol8»«lol8»
«lol8»=«lol7»«lol7»«lol7»«lol7»«lol7»«lol7»«lol7»«lol7»«lol7»«lol7»
«lol7»=«lol6»«lol6»«lol6»«lol6»«lol6»«lol6»«lol6»«lol6»«lol6»«lol6»
«lol6»=«lol5»«lol5»«lol5»«lol5»«lol5»«lol5»«lol5»«lol5»«lol5»«lol5»
«lol5»=«lol4»«lol4»«lol4»«lol4»«lol4»«lol4»«lol4»«lol4»«lol4»«lol4»
«lol4»=«lol3»«lol3»«lol3»«lol3»«lol3»«lol3»«lol3»«lol3»«lol3»«lol3»
«lol3»=«lol2»«lol2»«lol2»«lol2»«lol2»«lol2»«lol2»«lol2»«lol2»«lol2»
«lol2»=«lol1»«lol1»«lol1»«lol1»«lol1»«lol1»«lol1»«lol1»«lol1»«lol1»
       

Module content:

element content { «lol9» }
       

Interference with module caching

Module identity is by namespace, so unless the namespace of the imported module is also modified, all modifications coming from the query parameters of the location will be baked in from first use. Consider the module from Figure 9 but without the numeric addendum to the namespace URI, so that it is just "http://example.com/test/external".

Suppose this module gets imported with different mappings for «max», say, once directly as 3 and once indirectly through some other imported module as 5. The first copy seen will be cached by Saxon and used for both variants. So the effect will be either as if the max variable was set to 3 in both cases, or as if it were set to 5 in both cases. If we need to have both variants in the same program, we need to also adjust the module namespace URI as we did in Figure 9

Misnesting

Since the substitutions are just applying to text, they do not obey any rules of program structure. They can include separators such as commas, which can make for some strange looking source code.

The mapping includes a stray comma:

import ex="http://example.com/example"
       at "ex.txqy?FORMAT='[Y0001]-[M01]-[D01] [H01]:[m01][Z01]', ";
      

The module source looks weird:

format-dateTime(current-dateTime(), FORMAT "en","AD","US")
      

But the expansion is valid:

format-date-time(current-dateTime(), '[Y0001]-[M01]-[D01] [H01]:[m01][Z01]', "en","AD","US")
      

Extra separators can make the code incorrect. Here the replacement introduces an extra comma:

PAIR=$data("a"),$data("b")
      
Used in a functional context such as this:
max(PAIR)
      
The expansion converts what looks like a one-argument maximum to a two-argument maximum, where the second argument is a collation:
max($data("a"),$data("b"))
      

It is all rather unfortunate.

Operator precedence

Textual substitutions into expressions, if not handled with care, can run afoul of operator precedence rules. In XQuery the union operator | has lower precedence than the child path operator /. If we substitute in an expression using the union operator in a context with the path operator, results may not be what we expect.

Suppose we expand the following mappings in order:

TAG=((P)/c/@tag)
(P)=a|b
      

The source text looks fine:

$root/TAG(p)!string(.)
      

The expansion will give us:

$root/((P)/c/@tag)!string(.) => $root/(a|b/c/@tag)!string(.)
      

It is unlikely this is what was intended. More likely we meant:

$root/((a|b)/c/@tag)!string(.)
        
which produces very different results.

This kind of example is more plausible with function-style macros. The mapping would be:

TAG(P)=(P/c/@tag)
      

The source would then provide the parameter:

$root/TAG(a|b)!string(.)
      

The source still looks OK, but the expansion is as before, with an unintended result.

Name clashes

Ill-advised match patterns can create name collisions in the XQuery text. This may cause syntax errors or incorrect code.

Here someone has attempted to create a shorter name for a function. They even attempted to limit the applicability with parentheses.

empty()=ex:set-matrix-all-to-zeroes()
      

It will end in tears, however, because of the clash with a standard built-in XQuery function in an expression like this:

if ($seq=>empty()) then ...
      

Similarly, binding of variables in the replacement text can create unintentional variable shadowing.

For example, this function code snippet looks like it is summing the numbers from 3 to 12, but with the mapping below it is actually summing the numbers from 1 to 10.

let $start := 3
let $end := 12
return «init»sum($start to $end)
      

Mapping causing the trouble:

«init»=let $start := 1 let $end := 10 return
      

Duplication of work

In C++ with function-style macros there is the danger of unexpected duplication of side-effects. In a functional language like XQuery we don't have to worry about that, although duplication of effort could still be an issue.

Suppose we have a macro defined as follows:

square(x)=x*x
      
which will be substituted in the text:
square(ex:complex-calculation($data))
      

The substitution will result in the program text of:

ex:complex-calculation($data)*ex:complex-calculation($data)
      

While this will produce the correct result, if ex:complex-calculation takes a long time to compute, we may end up executing it twice, unless the processor optimizes the second call away.

When to do it anyway

In the previous section we saw several serious arguments for why a using a modifying module URI resolver is problematic, especially if it supports something beyond simple string substitutions. In this section we examine the inverse question: when is this a good idea (or the least bad idea) anyway?

  1. To overcome XQuery limitations.

    XQuery lacks templates. It leans heavily on the processor's optimizer instead of providing language support for things like assertions, function inlining, and excision of environment-specific or debugging code. It does not support polymorphism by type. As Stroustrup said, a macro use may indicate a flaw in the programming language, and like every language, XQuery is not without flaws.

  2. As a quality control step.

    The module URI resolver can be used to detect undesirable code patterns without modifying anything and report or raise errors to function as a linter. Or it can be used to inject errors to test error-handling code.

  3. For constrained substitutions, like templating.

    Most of the issues with the approach stem from the wilder kinds of substitutions. If the modifying resolver is used in more constrained circumstances, most of these harms do not arise.

  4. To avoid a compilation step during active development.

    For example, the vector templates we began with can be developed and debugged while still enjoying the simplicity of direct execution without having to use some kind of build system. Such concerns can be left until deployment.

    Similarly, debugging code can be marked and enabled during testing and development, and left as comments for deployment.

  5. When modifying the modules directly is impractical or imprudent.

    It may be there are certain modifications that are best accomplished by editing the modules in question. Sometimes library modules are part of a shared resource and editing them would impact others. One may not have permissions to modify them.

    This technique could provide a way of providing simple patches to code that would otherwise be untouchable or when the alternative of providing a modified version of the library is not practical.

  6. As a way of experimenting with syntactic devices that are not part of the language.

    The best way to test the usability and effectiveness of a syntactic device is to use it. Replacing it with some equivalent when the module is accessed can be easier than modifying the processor to support some experimental operator.

  7. When the processor doesn't support certain features present in the code.

    This is the flip side of the previous case: there may be newer features and functions that the processor doesn't have available, or when we need to handle other processor variations such as pre-defined namespaces.

  8. To provide a domain specific language for a particular community

    Let's face it: XQuery isn't for everyone. Smoothing over some of the wrinkles for a limited use case or a limited community might be worthwhile. Maybe some more mathematically inclined folks would prefer not to have to sprinkle math:cos and div and math:pi about, and would be happier with cos, ÷, and π instead.

Many of these kinds of use cases would work better with a modifying module URI resolver that always applies some set of modifications rather than the one that gets the patterns and replacements from the query parameters of the location URI. That would also eliminate some concerns about escaping mappings and module caching.

Conclusions and Summary

Custom module URI resolvers can be used to apply modifications to source modules in a dynamic fashion. Even simple string matching with replacement supports a variety of use cases. In particular it is possible to define function libraries for composite data structures with both strong typing for the base types and a minimal amount of code-copying in order to support different base types. That is to say: the XQuery equivalent of template classes.

There are dangers in going even this far: issues with module caching, toolchain support, and the ability to share code. Applying more elaborate modification regimes or less judicious changes are even more dangerous. While the technique is interesting and has its uses in constrained environments and circumstances, it is a sharp tool indeed and should be approached with caution.

Extending the idea to module URI resolvers that don't change the code, but that perform some other operation as the code is accessed avoids the dangers of dynamic code modification and supports some interesting use cases such as linting, auditing, and use metering.

The module URI resolver provides a place to hook these kind of operations.

References

[XSLT] W3C: Michael Kay, editor. XSL Transformations (XSLT) Version 3.0 Recommendation. W3C, 8 June 2017. http://www.w3.org/TR/xslt-30/

[XSLT 4.0] W3C: Michael Kay, editor. XSL Transformations (XSLT) Version 4.0 Editor's Draft. W3C, 18 March 2025. http://www.w3.org/TR/xslt-30/

[XQuery 4.0] W3C: Michael Kay, editor. XQuery 4.0: An XML Query Language Editor's Draft. W3C, 18 March 2025. https://qt4cg.org/specifications/xquery-40/

[XQuery] W3C: Jonathan Robie, Michael Dyck, Josh Spiegel, editors. XQuery 3.1: An XML Query Language Recommendation. W3C, 21 March 2017. http://www.w3.org/TR/xquery-31/

[Saxon] Saxonica: Saxon. https://www.saxonica.com/products/products.xml

[C++] Bjarne Stroustrup. The C++ Programming Language, Fourth Edition. Addison-Wesley, 2013

[Swidan23] Alaaedin Swidan and Felienne Hermans. A Framework for the Localization of Programming Languages. Presented at SPLASH-E ’23, October 25, 2023, Cascais, Portugal. In Proceedings of the 2023 ACM SIGPLAN International Symposium on SPLASH-E, pp13-25. Available at https://dl.acm.org/doi/pdf/10.1145/3622780.3623645. doi:https://doi.org/10.1145/3622780.3623645



[1] Actually, we need to be a little more careful here. The value of 1.9 cast as xs:integer is 1 whereas 2 would be a better result. It is possible, even without type introspection (although I would argue for type introspection in XQuery), to write generic casting code for numeric types that behaves better for integer subtypes. My real template vector code does this, but details are beyond the scope of this paper.

[2] Here I am playing an unholy trick of using the Cherokee Syllabic characters U+1438 and U+1433 to delimit the type name to make it clear this is a parameterized type. "ᐸTᐳ" is a perfectly acceptable name: those characters look like angle brackets, perhaps, but they are just letters. I find it makes the intentions behind the code clearer. Your mileage may vary.

[3] I am using guillemets (« U+00AB and » U+00BB) to delimit the pattern to be replaced here. I could have used the Cherokee syllabic characters again. Since this is a part of the template that is not intended to be used directly, it isn't necessary to use letter characters for this purpose. Indeed, it may be preferable not to. Guillemets work well here, as do punctuation sequences such as @@. As long as you avoid the sequence in the actual code you want to preserve, all is well.

×

W3C: Michael Kay, editor. XSL Transformations (XSLT) Version 3.0 Recommendation. W3C, 8 June 2017. http://www.w3.org/TR/xslt-30/

×

W3C: Michael Kay, editor. XSL Transformations (XSLT) Version 4.0 Editor's Draft. W3C, 18 March 2025. http://www.w3.org/TR/xslt-30/

×

W3C: Michael Kay, editor. XQuery 4.0: An XML Query Language Editor's Draft. W3C, 18 March 2025. https://qt4cg.org/specifications/xquery-40/

×

W3C: Jonathan Robie, Michael Dyck, Josh Spiegel, editors. XQuery 3.1: An XML Query Language Recommendation. W3C, 21 March 2017. http://www.w3.org/TR/xquery-31/

×

Bjarne Stroustrup. The C++ Programming Language, Fourth Edition. Addison-Wesley, 2013

×

Alaaedin Swidan and Felienne Hermans. A Framework for the Localization of Programming Languages. Presented at SPLASH-E ’23, October 25, 2023, Cascais, Portugal. In Proceedings of the 2023 ACM SIGPLAN International Symposium on SPLASH-E, pp13-25. Available at https://dl.acm.org/doi/pdf/10.1145/3622780.3623645. doi:https://doi.org/10.1145/3622780.3623645

Author's keywords for this paper:
XQuery; Templates; Preprocessor; Saxon