How to cite this paper

van der Zander, Benito. “Extending XQuery with pattern matching over XML, HTML and JSON, and its usage for data mining.” Presented at Balisage: The Markup Conference 2014, Washington, DC, August 5 - 8, 2014. In Proceedings of Balisage: The Markup Conference 2014. Balisage Series on Markup Technologies, vol. 13 (2014). https://doi.org/10.4242/BalisageVol13.Zander01.

Balisage: The Markup Conference 2014
August 5 - 8, 2014

Balisage Paper: Extending XQuery with pattern matching over XML, HTML and JSON, and its usage for data mining

Benito van der Zander

Institute for Theoretical Computer Science

Graduate School for Computing in Medicine and Life Sciences, University of Lübeck

Benito van der Zander is a graduate student at the University of Lübeck, working on the Ph.D. project "Algorithmics Of Causal Inference" investigating algorithms for Pearl's causality framework. Before his enrollment in that graduate school, he has developed the open-source XQuery engine Xidel. He received a computer science B.S. from the University of Düsseldorf, and a M.S. from the RWTH Aachen.

Copyright © 2014 Benito van der Zander

Abstract

Pattern matching in a broad sense is a common feature of modern functional programming languages, answering the question, if one complex structured object has a form that is the same as another complex structured object, for some definition of “the same”. In XQuery path expressions, switch, and typeswitch statements are often described as performing pattern matching, but these are merely impoverished flavors of matching when compared to the real thing. We describe a syntax for general pattern matching based on regular expressions for XML/HTML/JSONiq trees, how these patterns are matched against input data, and how this pattern matching can be integrated into the syntax and semantics of the XQuery language. At the end we summarize real-world experience using it for large-scale data mining of library webcatalogs.

Table of Contents

Introduction
The pattern syntax
Basic node matching
Selecting data
Advanced patterns
Optional elements
Repeated elements
Conditionals
Alternative elements
Default options
Matching JSONiq
The integration in XQuery
As function, without syntax modification
Extending Typeswitch-Expressions
Extending Flowers
Extending let
Extending for
Using pattern matching for data mining
Further work
Conclusion
Acknowledgment

Introduction

Modern XQuery is becoming a more and more functional programming language, with features like filter and map operators, inline functions, higher order functions and sequence heads/tails. However, it still does not support any kind of pattern matching, although pattern matching is often considered one of the most remarkable features of functional programming. Some of them, like Haskell, even base their entire syntax on it. In such languages a pattern matching algorithm takes an input value and one or more patterns, and searches for the pattern that has the same internal structure as the input value, possibly extracting some data from the input value. It is very similar to a regular expression matching a string that has a certain character structure, but applicable to all types of the language, not only to strings.

The most similar feature in XQuery is the switch expression that can be used for atomic values, which branches to a case clause of equal value. But, since it is limited to atomic values, it cannot be used for the majority of data used in XQuery, the XML infosets.

XML Schemas could be used for matching these XML infosets in path or typeswitch-expressions. However, this matching cannot directly return data from the matched nodes and defining an XML Schema for schema-less input data (like HTML pages) can be cumbersome, since its definition requires an additional file and cannot occur within the XQuery itself. Also the definition of an XML Schema is not implicit/intuitive like a pattern which consists of a sample structure that is compared with the input, but explicit by declaring every element and attribute by a special element, so it cannot really be considered pattern matching.

Sometimes the standard path expressions are called "pattern matching", but this is not correct either, since a path expression is also a list of explicitly applied filtering expressions, not an implicitly defined pattern.

In academics there are various different meanings of the term "pattern matching", often it is used to denote any specific querying on XML trees, like in [Hosoya2003] or [Yao2004]. Or in another direction, [Fischer2010] considers pattern matching as finding specially ordered subsequences in sequences, similarly to regular expressions on sequences of letters. None of this is pattern matching in our sense. Our patterns should find a matching part in the hierarchical structure of an XML (HTML, JSON) document, and be given as an exemplary sample document that is compared to the input document, detecting if the input document contains the pattern, and which parts of the input document were added.

More specifically, but also figuratively, a pattern can be considered a tree with holes, and during matching we must test if the tree of an input document contains the pattern tree as subtree, and which nodes of the document are matched to/fall into the "holes".

To see examples of such patterns, we need to look outside of XQuery and the common XML research:

One of the first query languages using true XML pattern matching was XML-QL [XMLQL], an early attempt to develop an XML query language which combines SQL with XML pattern matching. However it has never evolved beyond drafting state and appears to have been forgotten in favor of XQL the predecessor of XQuery. It is even unclear, if there has ever existed an implementation of XML-QL. Its pattern matching as such is more expressive than ours, but since it cannot use the matched values in a Turing-complete functional language, it is in total less powerful than our system.

Very similar to XML-QL is the graphical query language XGL [Fle2002], which uses graphs instead of XML structures as patterns. Thus it does not have exactly patterns in our sense, however you could consider it as working on a graphical representation of a pattern, and if you write one of its graphs in a serialized form as XML document, it would become our kind of pattern. Like XML-QL it also lacks the processing power of XQuery.

A final example is the language Scala [Scala], which combines functional and object-oriented programming concepts. It can perform true pattern matching on all of its objects, and has XML literals as abbreviations for XML elements, so it can naturally perform pattern matching on XML elements. However, it cannot do this on elements with attributes and is therefore of limited use. It is also not considered a query language, but a general purpose programming language.

This paper brings such concepts to XQuery by describing a possible syntax for XML pattern matching, how these patterns can return selected data and how these patterns can be used in XQuery in a natural way. We conclude with our experiences of applying these patterns to HTML pages, and possible future work. We do not investigate theoretical implications or efficient ways to implement it, instead we want to develop an easy and powerful syntax that can be practically used.

The pattern matching described here is an idealized variant of the actual pattern matching in our implementation[1]. On the one hand, some logical extensions explained here have not yet been implemented, and on the other hand, our implementation has features we do not mention, because they are deprecated or would cause confusion.

We assume the reader is familiar with the XML [XML], XQuery [XQuery] and JSONiq [JSONiq] standards, so terms and definitions given there will not be repeated.

We will often use the verb "match" as well as the noun "match", which have slightly different meanings. First, "X is matched against Y" means we call our pattern matching algorithm with X as pattern and Y as input data. It does not say anything about the output of that algorithm or the relationship between X and Y. Second, "X matches Y" and "Y is matched by X" means X is (a part of) a pattern and when X is matched against Y the matching algorithm succeeds (does not throw an exception). If the algorithm succeeds, it also associates each part of the pattern with a part of the data. In that specific case we say such a part of the data "is a match of/for" the associated part of the pattern. However, generally "X is a match of/for Y" can also just be another way to say "X matches Y". The adverb "matching" has the same two meanings as the verb, although it can refer to the algorithm itself, too (the "pattern matching algorithm", "matching process", ...).

The pattern syntax

Our goal is to develop a pattern syntax that fulfills the following three properties:

  • Intuitive:

    It should be obvious which data a pattern could match, even for someone who has never seen a pattern before. We try to realize this, firstly, by requiring that a (basic) pattern has to match itself, if the pattern is given as input. Therefore the pattern is an illustrative example for itself. And, secondly, by basing most of the syntax on regular expressions which should be familiar to any programmer.

  • Validating:

    When matched against not-matching, unexpected input data the pattern should raise an error, instead returning something arbitrary. (unlike a simple path expression that returns an empty sequence on failure. )

  • Minimal:

    There should be no pointless redundancy or data not relevant to the query within a pattern. This means that all additional data in the input should be ignored during matching, so we can exclude unimportant data from the pattern.

From the first and third property (and later given examples) it follows that it is also easy to create a pattern that matches a given input, since the pattern can be created by copying the input data, completely removing all data that should not be queried for, replacing the remaining data with annotations and finally calling it a pattern. We hope that this is even easy for people who have only knowledge of XML/HTML and no real programming language, since they do not have to write anything new, contrary to a path expression.

There is, however, a conflict between the second and third property, which cannot be resolved in general. By ignoring additional data a pattern might match a completely different input document successfully, which was not intended to be accepted as a match, but contains matching data somewhere within it. So for any actual pattern, it must be carefully decided what to include and exclude from the pattern. Nevertheless the third property is important, because the (HTML) documents we are processing contain far too much noise to include it all in the patterns.

Basic node matching

Due to the intuitiveness goal the pattern itself has to be a well-formed XML document (resp. fragment). We can therefore define the basic "matching" as recursive relation between a node of the pattern and a node of the input as follows:

  • A text node matches another text node, if they have matching string values.

  • An attribute matches another attribute, if they have the same name and matching string values.

  • An element E matches another element F, if they have the same name, every attribute of E matches an attribute of F and every child of E matches a descendant of F in order.

An exact definition for "matching string values" is given below. The property "in order" of the last point means: if a child C of E matches a descendant X of F and another child D of E matches a descendant Y of F, C precedes D in the pattern if and only if X precedes Y in the document.

Trivial examples are the pattern <foo/> matching an identical input document <foo/> or a text node pattern foo matching an identical text node foo. A more meaningful example is the pattern

<element foo="bar">cat<meow/></element>
that matches
<element foo="bar">cat<meow/></element>
or (ignoring additional data)
<element foo="bar" att="value"><p>cat<call><meow loudness="60 dB"/></call></p></element>
but neither (missing attribute)
<element>cat<meow/></element>
nor (wrong name(space))
<element foo="bar" xmlns="elsewhere">cat<meow/></element>
nor (wrong order of descendants)
<element foo="bar"><meow/>cat</element>

We have not considered a syntax to match comment or processing-instruction nodes, since such nodes rarely contain data that should be returned by a query and cannot contain a selector expression (see below) as a child to select this data. Also the most natural way to match a comment node would be to include an <!-- .. --> XML comment node in the pattern, which is problematic in our application, because we use standalone patterns outside of XQuery, where we prefer to use XML comments as actual comments that are ignored by the matching process. There is also no syntax to match document nodes, since it is always possible to match the root node of the document instead.

That a child in the pattern is allowed to match any descendant in the input, not only another child, follows the principle of ignoring additional data to keep the pattern minimal.

Requiring that the descendants have to be in the same order as the children, instead accepting any order, has a few benefits. First, it follows the idea of patterns as XML regular expressions, which also match their letters ordered. Second, XQuery itself is good at selecting unordered elements, but very difficult [Fischer2010] to use to select ordered elements[2], so the combination of XPath and patterns provides a simple and powerful solution for both cases. Third, the advanced pattern syntax contains natural, unrelated features that can directly be used for unordered matching. It is doubtful that unordered matching could be used for ordered matching in such a way.

The conditions for text node matching could be relaxed to accept any node that has a matching string value, not just other text nodes. This might be useful, if the input data could sometimes contain markups like <em> within the text and sometimes not. However, we have not seen such a use case, and therefore have implemented the faster way to match text nodes only to text nodes.

String values are matched against a string using one of six different modes: eq, matches, starts-with, ends-with, contains and list-contains. eq checks for an exact match, like the eq operator. list-contains treats the string value as space separated list (e.g. like the class attribute of HTML elements) and matches, if the value of the pattern occurs in that list. The other modes match, if the call to the corresponding XQuery function, e.g. fn:starts-with($value, $pattern), returns true. Additionally the matching can be case-sensitive or case-insensitive, in latter case, the values are first converted to lowercase resp. the i flag is passed to fn:matches.

Which mode is chosen depends on the node whose string value is matched and the local settings. Due to legacy reasons, the default modes of our implementation are starts-with for text nodes, list-contains for class-attributes and eq for all other attributes[3]. These modes can be changed within a pattern, similarly to the options described below.

There is no difference between matching a pattern to an XML document and matching it against a HTML document, except that node names are compared case sensitive for XML and case insensitive for HTML.

Selecting data

So far we have only described how a pattern can be used to test if an input document has a certain structure. To become a real replacement for standard XPath selectors, we need a way to select specific data from the matches in the input document. In regular expressions such selecting can be done with capture groups, an idea we need to transfer to the pattern syntax.

For this we allow the embedding of arbitrary XQuery expressions which will be evaluated in the context of local matches and whose return values become the return values of the pattern matching. So in case someone needs a query that does not just copy the matched data from the input document, but needs to perform certain calculations with the data, he can do it with all the power of XQuery.

We have decided to allow the following ways to embed a query in a pattern:

  • {expression}: A text node starting with { and ending with } is the shortest syntax to include an expression. The {} parentheses are used, because they correspond to the expression in a standard XQuery element constructor, and, more important, because every string starting with { is an invalid regular expression. So it unambiguous if a certain text node is supposed to be a selector expression or a regular expression used to match a text node in the input.

  • <template:s>expression</template:s>: A more standard syntax to include queries in an XML document. Although the {...} syntax does not collide with regex text matching, it is sometimes problematic to use. For example, if you want to have a text node for matching and a selector expression in the same parent node. So we have added the <s>-element in the namespace http://www.benibela.de/2011/templateparser (at default bound to the namespace prefixes template[4] and t) that can also contain an expression.

  • <t:s>expression</t:s>: This is actually identical to <template:s>expression</template:s> due to binding the same namespace to prefixes template and t, but faster to write being the shortest name a node outside the default namespace can have.

  • attribute="{expression}": The {} syntax can also be used in attributes. In that case the attribute is replaced by a t:condition attribute (see below) just checking the existence of the old attribute, and a <t:s>@attribute / (expression) </t:s> child node is inserted instead. E.g. <a href="{.}"/> to query for the destination URL of a link.

A selector expression is evaluated with the context item . set to the node matched by the parent node of the selector expression (more precisely the parent node of the {} text node or the parent node of the <t:s/> node containing the query). This definition is well-defined, since as so far described every node of the pattern has exactly one corresponding match in the data, once the matching has succeeded. The results of all evaluated selector expressions are returned as result of the pattern matching.

With this definition we can for example already write

<a><b>{.}</b></a>
as a pattern query for the first b-element contained in an a-element[5], e.g. it returns <b>foo<br/>bar</b> on input
<a><b>foo<br/>bar</b></a>
. Using the longer syntax, we get example patterns
<a><b><t:s>.</t:s></b></a>
and
<a><b><template:s>.</template:s></b></a>
that are matched in exactly the same way as the first example pattern.

An example for the attribute variant is

<a href="{.}"/>
which returns an attribute node href="http://balisage.net/" when matched against
<html>Some text<p><a href="http://balisage.net/">title</a></p></html>
.

A combined example is the pattern

<a href="{.}">{concat("=> ", .)}</a>
which would return an attribute node href="http://balisage.net/" as well as a string value "=> title".

Someone who is familiar with XQuery's handling of sequences might now wonder what happens when a pattern has multiple selector expressions that all return sequences. Does the pattern return multiple sequences? Is that even possible? Does it return a single sequence of all sequences flattened together?

We prefer former case, because it makes it possible to combine multiple patterns to a single one, regardless of the results of the combined patterns. However, multiple sequences cannot really be handled with standard XQuery data types, so we have decided to extend the XQuery syntax to create some kind of named return values.

Thus we have added variable assignments $variable := expression which evaluate the expression on the right hand side and assign it to a global variable on the left hand side. The priority of the := operator is set to be lower than the priority of every existing operator. More formally, we have introduced these extensions to the XQuery EBNF:

ExprSingle         ::= FLWORExpr | QuantifiedExpr | SwitchExpr | TypeswitchExpr | IfExpr | TryCatchExpr | VarAssignmentExpr
VarAssignmentExpr  ::= (VarRef ":=" )* OrExpr
VarRef             ::= "$" VarName   
Now we can return multiple sequences by assigning these sequences to variables, e.g. {$var1 := (1,2,3), $var2 := (4,5,6)}.

This leaves the question, what happens, when there are multiple assignments to the same variable. We will later see that in loops every selector expression is evaluated multiple times, so the most reasonable way is to return two different variables with the same name. Although in most of the XQuery syntax extension this is not possible, we prefer to do it, whenever possible. So e.g. {$x := (1,2), $x := (3,4)} also returns two different sequences. This is easier to understand, when the variable assignments are not considered as creating actual variables, but as creating a name+value pair (like a map with one element) and adding it to a sequence. Since the pairs are not sequences themselves, they are not flattened[6].

However, always writing {$var := expr} would be cumbersome, especially after seeing the {.} examples above. So we have added two abbreviations[7]:

  • {expression} not containing :=: An expression without assignment is replaced by an assignment {$result := expression} to a default variable $result. The name of the default variable is configurable, and later we will even extend it to assignments to the context item . itself.

  • {$variable}: A selector expression containing exactly one variable is replaced by {$variable := .}

With this definition, the meaning of {.} is implied and does not need to be defined explicitly, and the above example pattern <a><b>{.}</b></a> returns the b-element in the default variable (if the matching does not fail). Like the XQuery expression let $result := exactly-one(((//a)[1]//b)[1]) return ... (but we will omit the let $result := prefix in the remaining of the paper for brevity).

Another example is

<root><a>{$a}</a><b>{$b}</b></root>
returning the first a-element in $a and a following b-element in $b. It is equivalent to the XQuery expression
let $a := exactly-one(((//root)[1]//a)[1]),  
    $b := exactly-one(($a/following::b)[1] intersect (//root)[1]//b)
return ...
This example also shows how awkward it is to simultaneously encode the order requirement (b following a) as well as the descendant requirement (a and b being descendants of root) in standard XQuery[8].

So in the end the matching process either raises a matching error or returns a list of variable assignments

$var1 := value1
$var2 := value2
... 
. We call this list "assignment stream", since it is similar to a tuple stream of a flower expression except that it has different variables in each row.

Advanced patterns

The previous sections have explained the basic matching of individual elements, in which each node of the pattern is matched to exactly one node in the input data. But in practical applications it is necessary to return data that does not have a 1-1 relationship with the query itself. For this the patterns have various syntaxes which are described here:

Optional elements

An optional element is optional, i.e. if a match for the element exists in the input, the element is matched as usual, otherwise it is ignored.

There are two ways to mark an element as optional:

  • ?: The element is followed by a ?, e.g.: <element/>?

  • t:optional = "true": It has an optional attribute, e.g.: <element t:optional="true"/> or <element template:optional="true"/>

? as marker is not ambiguous in respect to text node matching, since regular expressions cannot start with a question mark.

A naive implementation can match optional elements first as non-optional, and if that fails with a matching error, remove the element and match the pattern again.

Repeated elements

With the patterns described so far, we can only apply a selector expression to a single element of the input document and therefore only return a single value with a simple selector like {.} or {$var}, although the input usually contains many elements, which all should be examined by a query. In fact returning multiple elements is considered so important that it is easier to select all elements than only the first element with standard path expressions. So the pattern syntax needs to be extended to be able to match multiple elements.

For this we introduce a notation similar to regular expressions:

  • *: The preceding element can be repeated arbitrary many times. E.g. <element/>*

  • +: The preceding element can be repeated once or more times. E.g. <element/>+

  • {min, max}: The preceding element is repeated between min and max times. E.g. <element/>{1,3}

  • {count}: The preceding element is repeated exactly count times. E.g. <element/>{3} is equivalent to <element/><element/><element/>

  • <t:loop [min=".."] [max=""]> ... </t:loop>: The child elements are matched at least @min and at most @max times. A missing min attribute is interpreted as 0, and a missing max attribute as infinite. E.g. <t:loop min="1"><tr/></t:loop>

Using these symbols is unambiguous like ? for optionality and {} for selector expressions, since no regular expression can start with + or *. {count} could be confused with a selector expression returning count, however a selector returning a single number is pointless, so we can assume every single number is not meant to be a selector. A single text node can contain a repetition marker and a selector expression.

The use of a minimal count can be demonstrated with the following example: each of the patterns <x>{.}</x>* and <x>{.}</x>+ applied to <root><x>1</x><x>2</x><x>3</x></root> returns <x>1</x>, <x>2</x> and <x>3</x>. However, if applied to <root/>, former succeeds and returns nothing, while latter raises a matching error.

Unlike the minimal count a maximal count does not raise matching errors, instead all elements after the first max elements are ignored, following the principle of ignoring additional data.

Although <x>{.}</x>* seems to be the same as //x, it is not. This becomes obvious, if we look at the expanded pattern <x>{.}</x><x>{.}</x><x>{.}</x><x>{.}</x>.... All the x-elements in the pattern are siblings and not descendants of each other, so the pattern only matches separated x-elements which do not contain each other. An equivalent XQuery would therefore be let $temp := count(ancestor-or-self::x) return .//x[count(ancestor::x) eq $temp] or shorter .//x except .//x//x .

Considering a loop to be a sequence of repeated elements also explains, how selectors are handled within the loop: Every selector is evaluated for every matched element and the result is returned as another assignment to the result variable.

For example

<a>{$var}<b/></a>+
applied to
<root> <a><b>1</b></a> <a>2</a> <a><b>3</b></a></root>
returns an assigment stream
$var := <a><b>1</b></a>
$var := <a><b>3</b></a>
. The second <a/> is skipped, because it does not a <b/> child.

<t:loop> is the most powerful variant of these five syntaxes, not only does it generalize the other notations[9], it is the only one that can repeat multiple elements e.g. to separate odd and even rows: <t:loop><tr>{$odd}</tr> <tr>{$even}</tr> </t:loop>

A possible way to implement repeated elements is to modify the matching process of the parent of the loop node to match all its children and children of t:loop-elements against all nodes in the input document. This leads to a list of possible matches for each node in the pattern, which can be used to find the actual matches by taking as many matches from that list as possible, such that all following nodes still have matches (by going backwards and choosing the last possible match before the one chosen for the next node, we know how many we can take at maximum). Or it could model the links between pattern nodes and possible matches as flow network and solve it with a standard max-flow algorithm.

Alternatively, an implementation can simply repeat the matching of the elements in the loop as long as possible, and when a matching error occurs, continue the matching with the next element. If during the continuation another error occurs, the last matching of the elements in the loop is rolled back, and the next element tried again. Repeatedly, till it succeeds, or cannot find enough possible matches.

Conditionals

In some cases it is necessary to test arbitrary conditions that cannot be expressed directly in a pattern, e.g. for conditions that are not local like two values depending on each other. For this we use an attribute t:condition, which stores an XQuery expression that an input element has to satisfy in order to be accepted as a valid match, similarly to the expression contained in a filter expression. The context item . is set to the node currently checked for a match.

For example the pattern <e t:condition="exists(@a) and @b eq ."/> matches <e a="" b="1">1</e>, but neither <e b="1">1</e> nor <e a="" b="1">2</e>.

Another use of conditionals is to skip parts of the pattern. E.g. if you have two data sources which are similar, but not identical, you might want to use the same pattern for both, ignoring the parts of the pattern which only apply to the respective other data source. This can be accomplished with the <t:if> element, whose children are only matched, if the XQuery expression given in the test attribute of the <t:if> element evaluates to true. During that evaluation the element that is the match of the parent element of the <t:if> is used as context item.

For example <x><y>{$y}</y><t:if test="$y eq 17"><z>{$z}</z></t:if></x> matched against <x><y>1</y><z>2</z></x> returns $x := <y>1</y> and matched against <x><y>17</y><z>2</z></x>, it returns $x := <y>17</y>, $y := <z>2</z>.

Most languages support an else statement to be used with their if statement, so we have added a similar <t:else/>-element to the pattern syntax. The children of a <t:else/> element are only considered for matching, if the children of the preceding <t:if/> element have been ignored. For example <root><t:if test="$check"><a>{.}</a></t:if><t:else><b>{.}</b></t:else></root> matched against <root><a>1</a><b>2</b></root> returns <a>1</a>, if $check is true, and <b>2</b> otherwise.

If the <t:if> element is used to give a condition for a single element, it is quite cumbersome to surround just that element with <t:if>..</t:if>. Therefore we allow the t:test attribute to be used directly on elements, as abbreviation for a surrounding <t:if>. E.g. <a t:test="$check"/> is equivalent to <t:if t:test="$check"><a/></t:if> and ignored if $check is false.

t:test might be confused with t:condition by someone just seeing a pattern without having read this paper. but they are very different from each other. t:test is a condition the element in the pattern (given the previous variables) has to satisfy, is checked (at most once, outside t:loop) before a match for this element is searched and only a positive return value can lead to a matching error. t:condition is a condition the element in the input data has to satisfy, might be checked for every element in the input data, and only a negative return value can lead to a matching error.

Alternative elements

Sometimes different elements may occur in the input data and should be accepted as matches, but it is not known a priori which one actually exists. It would be possible to handle this with optional elements and conditions, but that would be cumbersome. Instead the patterns support a <t:switch> element, which contains different alternative patterns, i.e. it matches an element in the input successfully, if any of its children match that element successfully.

For example <t:switch><a>{.}</a><b>{.}</b></t:switch> applied to either <a>1</a> or <b>1</b> returns the input node (with value 1).

There occurs a strange effect, if all the children exist as elements in the input data, but in a different order, like <root><x><b>B</b></x><a>A</a></root> for the previous example. The <t:switch/> element as described so far would return the b-element, because <t:switch/> is first matched against <root> and <x>, which are not accepted by any of its children, and then against <b>, which is matched by the second child. So <t:switch> always chooses the first matching element in the input.

However, it has turned out that we often want the element matching the earliest child that has a match, e.g. when combining multiple patterns to a single one[10] . So we have added an attribute prioritized, which enables "prioritized" matching, i.e. performing the matching of the children in order and accepting the first match of the earliest child. E.g. <t:switch prioritized="true"><a>{.}</a><b>{.}</b></t:switch> would return the a-element when matched against the previous example input document.

Another interesting use of the <t:switch> element is to perform unordered matching, when combined with a loop. For example <t:loop><t:switch><a>{.}</a><b>{.}</b></t:switch></t:loop> returns all (not nested) <a/> and <b/> nodes of the input document. So applied to the example input above it returns <b>B</b> and <a>A</a>. (without the loop and switch, it would raise a matching error due to the different order). This unordered matching occurs automatically, without handling this case in an implementation, since the switch-element is matched against every element in the input.

A possible abbreviation for a <t:switch> element could be the regex inspired (<a>{.}</a>|<b>{.}</b>). We have, however, not implemented this, because a regex is allowed to start with a | (matching an empty string), so it might not be clear, if a switch abbreviation or a regex for text node matching is meant[11].

Default options

The way string values are matched can be changed with the <t:meta/>-element, which sets the comparison function and case-sensitiveness for text nodes and attributes:

<t:meta [text-matching="eq|matches|starts-with|ends-with|contains|list-contains"] 
        [text-case-sensitive="true|false"] 
        [attribute-matching="eq|matches|starts-with|ends-with|contains|list-contains"] 
        [attribute-case-sensitive="true|false"] 
        />

For example <t:meta text-case-sensitive="false"><a>foobar</a></t:meta> matches <a>foobar</a> and <a>FOOBAR</a>, while <t:meta text-case-sensitive="true"><a>foobar</a></t:meta> matches only former input.

For individual attributes the string value matching can be controlled with the <t:meta-attribute/>-element:

<t:meta-attribute 
        name="..."
        [matching="eq|matches|starts-with|ends-with|contains|list-contains"] 
        [case-sensitive="true|false"] />

For example <t:meta-attribute name="x" case-sensitive="false"><a x="y"/></t:meta-attribute> matches <a x="y"/> and <a x="Y"/>, while <t:meta-attribute name="x" case-sensitive="true"><a x="y"/></t:meta-attribute> matches only former input.

A meta element changes the options for all its children, and, if it does not have children, for all following elements due to legacy reasons.

Some people might wonder why we have added an option for setting the case sensitiveness instead an option for setting a collation. The reason is that the most important text/attribute matching mode is matching with regular expressions, which cannot use a collation. And we do not want to have two different options for basically the same.

Matching JSONiq

After having specified a pattern syntax for XML data, it appears reasonable to extend it to other data storage formats, like JSON. Recently a JSON-query language called JSONiq [JSONiq] was announced as extension to XQuery, which we have implemented and plan to extend with pattern matching.

To perform pattern matching on JSONiq types, we need to define the matching process for all new types added by JSONiq: null, objects and arrays[12]. As well as for the standard atomic XQuery types derived from xs:anyAtomicType, because the previous sections only defined the matching for nodes.

A small problem is that none of the JSONiq types support namespaces, so we cannot do advanced matching with a t: prefix, like in the XML case, not even to include selector expressions. However, the important JSONiq types, arrays and objects, can contain XML nodes, so we can wrap the selector expressions in <t:s/>-elements and include them like that in the JSONiq item.

  • Defining a natural matching for atomic types and null is easy: These values matches another value, if the eq-operator returns true. Due to the definition of the equality of null in JSONiq, it follows null matches null and nothing else.

  • An object pattern matches a value, if that value is an object and the value of every property of the pattern matches the value of the corresponding property of the value.

  • For arrays there are three reasonable ways to define array matching: An array pattern $a matches an value $b, if

    • Exact matching: element $a[$i] matches $b[$i] for all $i in 1 to jn:size($a), and $a and $b have the same number of elements.

    • Exact prefix matching: element $a[$i] matches $b[$i] for all $i in 1 to jn:size($a). I.e. after the exactly matched beginning of the array $b there might follow arbitrary other elements.

    • Existence matching: element $a[$i] matches an $b[$j($i)], such that $j($i) < $j($i+1), for all $i in 1 to jn:size($a). I.e. every element of $a can be matched, but there might be arbitrary many elements between each match.

    Each way represents a different compromise between the goal 2 (validation) and goal 3 (minimality). We have decided to use the last way, giving priority to minimality[13].

An example of a pattern according to above definitions is

{"a": [1,2,3], "b": null, "c": <t:s>{.}</t:s>}
which matches
{"a": [1,2,3], "b": null, "c": "foobar"}
returning foobar or
{"a": [1,"u",2,"v",3], "b": null, "c": [7,8,9], "d": 17}
returning [7,8,9].

But it matches neither

{"a": [1,2,3], "b": [4,5], "c": "xyz"}
nor
{"a": [1,2,3], "c": "foobar"}

Since writing <t:s/> is extremely cumbersome in a JSONiq pattern, it might be beneficial to allow abbreviations like {"c": .}, {"c": $result} or {"c": $result := .} in an implementation, in which the <t:s> is omitted. This syntax is mostly unambiguous, however it cannot be evaluated and stored as JSONiq type. Therefore, such an implementation must be able to partially evaluate JSONiq types, keeping it in AST form or serializing the XQuery-AST back to a reparsable query wrapped in <t:s>. Another alternative might be to use inline functions instead <t:s>, e.g function(){$result := .} replacing <t:s>$result := . </t:s>, in which case no serialization of XQuery expressions is needed. But we will not consider such extensions in the remaining paper.

The integration in XQuery

As function, without syntax modification

Since every pattern is a valid XML element (or JSONiq item), it can be represented by valid values in the XQuery type schema, and it is possible to write a matching function that takes a pattern and input data as arguments and performs the matching in standard XQuery (or JSONiq). Such a function is useful to perform pattern matching in existing XQuery implementations without having to write a new implementation, and to match with dynamically constructed patterns that are not hardcoded in the query.

To use the full power of selector expressions within a pattern, the underlying implementation has to support some kind of eval function to evaluate an XQuery given as string. And of course, only the right-hand-side of := assignment can be passed to the eval-function and the variable names have to be tracked separately.

The type declaration of our matching function is:[14]

pxp:match($pattern as item(), $data as item()*) as item()*

The first argument is the pattern that should be matched against every item in the $data sequence, thereby creating multiple assignment streams which are concatenated. Since $pattern is an ordinary value, the {...} selector expressions have to be encoded as {{...}}, if the pattern is created within the query by an element constructor.

Principally the function should return the final assignment stream, but an assignment stream is not a valid XDM instance. So we wrap the result in an XQuery map which maps variable names to a sequence of values and which can be read like a function call. For example the map $map := {"foo": "bar", "x": ("y", "z")} contains two keys which can be accessed with $map("foo") = "bar" and $map("x") = ("y", "z"). In XQuery 3.0 such a map can be implemented as an anonymous function that returns a certain value if called with the corresponding key. A theoretical alternative to a map is to wrap the data in an XML element, however, that would destroy information like the parents of the returned items[15].

Since sequences cannot nest, assignments of sequences to variables are flattened in the returned sequence. e.g. the assignment stream $a = (1, 2), $a = 3, $a = 4 becomes a single sequence in the map {"a": (1,2,3,4)}. If the only assigment is to the default variable, we do not wrap it in a map and return the value directly.

For example

pxp:match(<root><foo>{{.}}</foo>+</root>, <root><foo>1</foo><foo>2</foo></root>)
has internally the assigment stream
. = <foo>1</foo>
. = <foo>2</foo>
and returns
(<foo>1</foo>, <foo>2</foo>)

Another example with variables is

pxp:match(<root attrib="{{$attrib}}"><foo>{{$var}}</foo>+</root>, <root attrib="ABC"><foo>1</foo><foo>2</foo></root>)
which has the assigment stream
$attrib = "ABC"
$var = <foo>1</foo>
$var = <foo>2</foo>
and returns
{"attrib": "ABC", "var": (<foo>1</foo>, <foo>2</foo>) }

Extending Typeswitch-Expressions

XQuery has two different kinds of "switch" expression, the switch expression to test if a value is equal to one of many cases and the typeswitch expression to test if a value has the same type as one of the cases. It is not clear which "switch" expression we should extend, if we want to create the most natural syntax. On the one hand typeswitch is often used to test if an element has a certain structure (in combination with an XML schema type), the same purpose a pattern has, on the other hand a pattern does not do any type checking and instead only cares about the value (like switch or deep-equal). In the end we have decided to extend the typeswitch expression, since there a pattern can be given in a case clause, because a direct element constructor is not allowed within a typeswitch case in standard XQuery, while it is allowed to use a direct element constructor in a switch case (denoting the atomization of the constructed element and comparing it to the atomized input of the switch expression). The formal grammar of this new typeswitch-expression with patterns in the case clauses is then:

TypeswitchExpr 	   ::=  "typeswitch" "(" Expr ")" CaseClause+ "default" ("$" VarName)? "return" ExprSingle
CaseClause 	   ::=   "case" CaseClauseOperand "return" ExprSingle
CaseClauseOperand  ::=  ("$" VarName "as")? SequenceTypeUnion | PatternExpr

A pattern given in a case clause is not evaluated before the matching. So writing <x>{$y}</x> is sufficient for a pattern that assigns <x/> to $y and it does not have to be encoded as <x>{{$y}}</x>. So the variables that could be created by the pattern are statically known, and can be added to the static context of the expression in the corresponding return clause.

We define PatternExpr to be either an XML pattern with an optional loop marker (e.g. +) or a JSONiq constructor:

LoopIndicator   := OccurrenceIndicator | "{" Digits "}" | "{" Digits "," Digits "}" 
PatternExpr     := DirectConstructor LoopIndicator? | ObjectConstructor | ArrayConstructor 

This syntactical definition is broader than the semantically allowed values (e.g. function calls would not be allowed), hence some expressions are rejected during evaluation.

The typeswitch expression is then evaluated as follows:

All case clauses are processed in order.

  • If a normal case clause is meet, it is "matched" like in standard XQuery.

  • If a case clause with a pattern is meet, its pattern is matched against the input as described in the previous sections.

    • If a matching error occurs, this clause is skipped (the error is not propagated upwards).

    • If the matching succeeds, the value of the typeswitch expression is the value of the current return expression. All variables that have been created by the pattern are available in the evaluation of the return expression and, if a value has been assigned to the default variable, it replaces the context item.

So for example, either

typeswitch (<y>123</y>) 
  case <x>{$data}</x> return "The x-data is " || $data
  case <y>{$data}</y> return "The y-data is " || $data
  default return "foo"
or
typeswitch (<y>123</y>) 
  case <x>{.}</x> return "The x-data is " || .
  case <y>{.}</y> return "The y-data is " || .
  default return "foo"
returns The y-data is 123.

If multiple values have been added to a single variable, like in <x>{$var}</x>+ they are all flattened to a variable containing a sequence. However, if the assignment was to the context item, an exception is thrown, since . cannot be a sequence.

So we can use the following example to count all links on a webpage:

typeswitch ($inputdata) 
  case <a>{$a}</a>+ return "There are " || count($a) || " links on the webpage"
  case <html/> return "There are no links on the webpage"
  default return "Invalid input"

Extending Flowers

Before we can extend the flower expressions, we need to extend the tuple stream, so it behaves more like an assignment stream. In XQuery implementations this stream assigns certain values to certain variables. Since our patterns also allow assignments to a default variable with {.} which is then mapped to the context item ., the tuple stream has to be able to not only store normal variables, but also this default variable. We will write such assignments

(. = 1)
(. = 2)
in this paper.

We then could write for . in $sequence return expression ... meaning the same as $sequence ! expression ... or let . := $value return expression meaning the same as exactly-one($value) ! expression ... in a query[16].

Extending let

We extend the let-expression by allowing patterns to be used as binding variable, i.e. by adding a MatchLetClause to the InitialClause-EBNF defined as follows:

MatchLetClause         ::=    	"let" PatternLetBinding ("," PatternLetBinding)*      
LetBinding             ::=    	PatternExpr ":=" ExprSingle

During evaluation this pattern is matched against the right-hand-side of := and the resulting variables are added to the tuple stream, so they can be used just like in the return clause of an extended switch-expression.

For example

let <a>{.}</a> := <root><a>123</a></root> return .
creates the tuple stream
(. = <a>123</a>)
and therefore return <a>123</a>.

As mentioned when extending the typeswitch-expression, multiple assignments are flattened to a sequence. E.g. let <a>{$a}</a>* := <root><a>1</a><a>2</a><a>3</a></root> creates the tuple stream ($a = (<a>1</a>, <a>2</a>, <a>3</a>)).

The practical usage of such an expression is to validate that (schema-less) data has a certain structure, without creating a new XML schema or using a lot of if-expressions. E.g. let <xml><a>{$a}/<a><b>{$b}</b><c>{$c}</c></xml> := <xml><a>{123}/<a><b>{456}</b><c>{789}</c></xml> return ($a, $b, $c) queries the three elements and throws an exception, if they do not occur or have a different order.

Extending for

Similarly to the let-expression we extend the for-expression by allowing a pattern as binding variable. Hence we add a new initial clause MatchForClause:

MatchForClause         ::=    	"for" PatternForBinding ("," PatternForBinding)*
PatternForBinding      ::=    	PatternExpr "in" ExprSingle

During evaluation this pattern is matched against the right-hand-side of in similar to the evaluation of the extended let. However, this time the resulting variables are not directly added to the tuple stream. Instead each assignment within the pattern creates a new row in the stream, in which only the currently assigned variable has a value. This also means that a sequence assigned to a variable is not flattened.

For example

for <a>{.}</a>+ in <root><a>1</a><a>2</a><a>3</a></root> return .
creates the tuple stream
(. = <a>1</a>)
(. = <a>2</a>)
(. = <a>3</a>)
and return every a-element.

And

for  <html><h2>Title A</h2><p>{.}</p>+<h2>Title B</h2></html> in $input-data return .
returns all paragraphs between Title A and Title B, a natural appearing query that would be complicated to express with standard XPath.

When multiple variables are used in the pattern, they all occur in the tuple stream, but only one is non-empty. For example:

for <root><a>{$a}</a><b>{$b}</b><a>{$a := (., .)}</a></root> in <root><a>1</a><b>2</b><a>3</a></root> return .
results in the tuple stream
($a = <a>1</a>,             $b = ()),
($a = (),                   $b = <b>2</b>)
($a = (<a>3</a>, <a>3</a>), $b = ())

Combinations of let and for as well as the intermediate clauses can be used as normally in standard XQuery, since they change the tuple stream in a certain way, independent of the source of that tuple stream.

Using pattern matching for data mining

In this section we summarize our experiences and observed issues when using the pattern matching for data mining from various webpages. (which is actually the purpose we originally designed the patterns for.)

The webpages in question are the catalogs of over 175 different libraries which were using 15 different library systems with, in total, around 100 individual HTML pages. We have created a pattern for each of these 100 pages, ranging from querying all the data on the page to querying a single link. With these patterns we mirror the entire functionality of the catalog, like searching for books, getting detailed bibliography data, ordering books for lending, showing the items lend by a patron, renewing loans... A list of these libraries is available on the German VideLibri webpage and the patterns itself can be found in its source repository.

We have observed the following positive aspects when applying the pattern matching:

  • The patterns have vastly simplified the task of mining data from a new webpage. As mentioned above a pattern "can be created by copying the input data, completely removing all data that should not be queried for, replacing the remaining data with annotations and finally calling it a pattern", so we can add a new webpage in a few minutes[17], the user account functionality (listing and renewing lendings ) in half an hour and all functionality of an entire new library system in three hours (Times as needed for the last added library. But often we discover corner cases/bugs after months of real world usage).

  • A more concise point is that the pattern is useful for "grounding" an XQuery expression. If we use path expressions to query data from the same element, we either have to prefix each expression with //all/ancestors/of/that/element/ (pointless redundancy) or create a new variable $var := //all/ancestors/of/that/element/ and prefix $var instead (many variables to name and keep track of). With patterns we create just one element <all><ancestors><of><that><element>{...}</element></that></of></ancestors></all> (or copy it from the webpage) and all expressions inside that element are relative to it. This is especially helpful, if there are additional expressions with partial identical ancestors like, //all/ancestors/of/that/element/foo/bar and //all/ancestors/of/x/y/z. The patterns make it obvious how they are related to each other and in which order they occur on the webpage, the path expressions not so much.

  • It is very easy to query for elements surrounded by certain siblings like <root><pre-sibling/><elements/>+<post-sibling/></root> or more specific <root><h2>Header 1</h2><p>{.}</p>+<h2>Header 2</h2></root> that often occur on webpages, and which is not easy in pure XPath.

  • Many people prefer to CSS selectors instead XPath/XQuery expressions, because they are better suited to process HTML, e.g. when querying for classes and ids, although those selectors cannot process the selected data. Our patterns are as easy to use as CSS selectors (especially due to special handling of class attributes) and provide a way to process the data with all the power of XQuery.

  • As expected the patterns are much shorter than queries in a non pattern based system. We do not know how many lines our implementation would have required without patterns, but we can compare it to another, independently developed, open-source project (using Java and CSS selectors) that attempts to mine data from some of these pages. For these pages, our pattern based implementation consists of 1641 lines (mostly copied from the original pages), while the other project had to use 3362 lines (assumed to be written manually).

  • The patterns are easy to unit test. For this we keep an archive of all ever mined webpages and match the corresponding pattern to each of them, testing if they still return the same assignment stream. This would be difficult to do for a bunch of path expressions, since these expressions are embedded within a query and do not return an assignment stream, so for every tested pattern there would be several path expressions to test.

We can conclude that the mining task would have been much more difficult and taken much more time without our patterns[18].

Further work

Nevertheless the practical application has shown that the patterns are still neither as easy and intuitive to use nor as powerful as we had hoped. The following list describes a few observed issues together with possible extensions that might solve them:

  • Within loops the validating-property works against its purpose, such that too strongly validating patterns in loops lead to skipped elements. For example, you can use the following pattern to read the 2nd column of all rows of a table that have an image in the first column:

    <tr><td><img/></td><td>{.}</td></tr>+
    But if (unexpectedly) a row does not contain an image
    <table>
      <tr><td><img></td>          <td>1</td></tr>
      <tr><td>img unavailable</td><td>2</td></tr>
      <tr><td><img></td>          <td>3</td></tr>
    </table>
    it is skipped, because the row does not match, which can be surprising to someone expecting either a correct matching of all rows or a matching error.

    In this example the issue is easily solved by marking the image as optional, either with a following ? or the t:optional attribute. However, someone creating the pattern has to remember to do that, and needs to know that there might be no image. But when patterns are created based on various samples of a webpage, we often cannot know which elements are optional and not.

    We have considered two possible extensions to the pattern syntax, which could solve the issue of skipped rows by requiring that all rows participate in the matching:

    • An option that all children of a certain element have to be matched, if one of them has been matched. (a ++ qualifier might fit well)

    • An option that disables the recursive matching, when searching the next match for a loop child, and only tests, if the loop child itself (and perhaps its n-th descendants for a fixed n) matches an element (as if the loop child had no children itself). If this first test succeeds, the recursive matching is performed to evaluate the contained selector expressions as usually, but if that recursive matching fails, the entire matching is aborted with a matching error. E.g. with this option, a pattern like <tr>...</tr>+ would first ignore the ... and just search the next tr-element. Afterward the ... would be matched against that element as usual, propagating exceptions.

  • There is no syntax to match or loop over table columns, instead rows. Although you can read the i-th column by prepending <td/>{i-1} to skip the previous i-1 columns, it is quite difficult to create a pattern that matches the i-th column on one page and the j-th column on another, or that handles different column orders on different pages[19]. This issue stems from the hierarchical structure of SGML and is problematic in every query language.

    To solve it, a pattern syntax not based on a tree structure is required. Maybe it is possible to add an option that connects the order of children of different elements in the pattern. Then the first row of the pattern could match the header, unordered, and the following rows were linked to the header row, such that the children are processed in the same order the header elements were assigned to their matches.

  • Some pages use a table layout with many nested tables. In these cases the table, tr, td-elements of a pattern that are supposed to match a single table might all match different tables. For example

    <table>
      <tr><td>Header</td></tr>
      <tr><td>{.}</td></tr>+
    </table>
    appears to be a good pattern to match all rows below a header row in a certain table. However, if table layouts are used like in
    <table>
      <tr><td><table>
        <tr><td>Header</td></tr>
        <tr><td>foo</td></tr>
      </table></td></tr>
      <tr><td>bar</td></tr>
    </table>
    the table-element matches the outer table (because it still contains a descendant Header text node), and the pattern selects all rows of that table, i.e. the last td-element containing bar. And not the td-element containing foo.

    A solution would be to add an option to only match direct children of certain elements, instead all descendants. Perhaps even enable it at default for tables, although this will cause new issues with table rows never being children of the table, but of a tbody-element. A possible notation for such an option would be an attribute t:max-delta containing a number that specifies the maximal accepted number of ancestors between the node matched by the node with the attribute and the node matched by its parent. It could lead to even more options like t:min-delta (minimal accepted number), t:delta (abbreviation for t:min-delta and t:max-delta together), t:max-sibling-delta (maximal accepted number of siblings between the node matched by the current node and the node matched by the previous sibling), ...

    However, such attribute options do not look like regular expressions and are not really "intuitive".

  • Despite our goal of minimality there is a lot of unnecessary redundancy in the pattern syntax. For example a frequently used pattern to read properties of a book is

    <table>
      <tr><td>Author:</td>   <td>{$book.author}</td></tr>
      <tr><td>Title:</td>    <td>{$book.title}</td></tr>
      <tr><td>Publisher:</td><td>{$book.publisher}</td></tr>
      <tr><td>Year:</td>     <td>{$book.year}</td></tr>
    </table>
    which contains many duplicated td and tr-elements. This is not a problem during the creation of the pattern, but if it needs to be adapted to changes (e.g. to read divs instead tds or have the value in the third column) each tag has to be updated, which is pointless work.

    So perhaps we should develop some kind of "meta-pattern" that can create a pattern programmatically within the pattern itself, like a macro in C++ or a function with access to the AST. Or just using pxp:match instead the hardcoded patterns in typeswitch/let/for.

  • The validation property was not as useful as expected to find errors. We assumed that an element throwing a matching error tells us, which element in the pattern was wrong/changed and should be adapted. But although this works in patterns like <a><b><c>{.}</c></b></a>, where we can get a matching error for a, b or c and know which element is missing in the input, in patterns containing multiple alternatives like <t:switch prioritized="true"><a>{.}</a><b>{.}/b> we get a matching error for the t:switch expression and cannot distinguish if the input was supposed to match a or b.

    Therefore the error reporting has to somehow consider all possible pattern branches and report the error for the most likely one.

  • Currently every pattern is represented by a valid XML file, which is useful for automatic parsing/processing and familiarity. However, the pattern syntax can be greatly simplified by allowing non-valid XML patterns. The first possible extension in this direction is to allow regular expressions as element and attribute names. A simple pattern like

    <a|b|c foo|bar="123" x.z="17" />
    could then replace this more complex, but equivalent pattern:
    <t:switch>
    <a foo="123" t:condition="exists(@*[matches(name(.), '^x.z$') and . eq '17'])" />
    <a bar="123" t:condition="exists(@*[matches(name(.), '^x.z$') and . eq '17'])" />
    <b foo="123" t:condition="exists(@*[matches(name(.), '^x.z$') and . eq '17'])" />
    <b bar="123" t:condition="exists(@*[matches(name(.), '^x.z$') and . eq '17'])" />
    <c foo="123" t:condition="exists(@*[matches(name(.), '^x.z$') and . eq '17'])" />
    <c bar="123" t:condition="exists(@*[matches(name(.), '^x.z$') and . eq '17'])" />
    </t:switch>

    The second extension is to allow other attribute value separators than the = sign which would specify the matching mode used for the string value. E.g. the symbols ==, ~=, ^=, $=, *= or |= could be used to specify the resp. matching mode eq, matches, starts-with, ends-with, contains or list-contains. For example foo ^= "bar" would match a foo attribute starting with "bar".

    The third is to allow omitted end tags. Currently half of the pattern consists of redunant end tags, which is not very useful. If we allow the old SGML syntax with closing short tags </> or null end tags <element/ ... / in the pattern, it could be much shorter. Since we already match the patterns against HTML as well, there is no point in keeping a pure XML syntax.

  • The patterns as defined so far only match against the string values of the nodes. It might be useful to combine them with XML schemas to match on typed values, distinguishing e.g. xs:integer(17) and xs:string("17").

  • Our current implementation performs naive backtracking, following exactly the recursive definition of the matching process, which is well suited for the fast testing of experimental syntaxes. It is, however, not very performant. It is therefore worthwhile to research more efficient ways to implement patterns. One faster way would be to use dynamic programming, using a n*m table tracking which pattern element can be matched to which input element. Another way might be to model the matching as a NFA like a regular expression. A third way would be an automated conversion of a pattern to normal path expressions. Such an approach could benefit from the existing research on query optimization, but might miss specific optimizations applicable only to the patterns. Especially problematic from an optimization POV is that variables created in a selector expression could later be used in a condition affecting the matching. This creates a basically infinite state space and makes most optimization techniques impossible to use. It might be a good idea to forbid the reading of the variables in the pattern, or implement the special case without such reading separately.

  • The basic node patterns do not have a syntax for back (like parent::) or forward (like following::) references. Therefore they seem to present a natural way to write queries with streaming parsers that read one element at the time and do not store preceding or following nodes. It is worthwhile to investigate how well and efficient they can be combined with such a streaming XML parser.

  • Once further extensions have been researched, it might be a good idea to convince the W3C to add patterns to the next XQuery standard.

Conclusion

We have presented a true pattern matching on XML and HTML nodes with a syntax based on regular expressions and XQuery, which provides an intuitive way to write queries for XML and HTML data. It was then successfully extended to matching on arbitrary JSONiq data, albeit with a less intuitive syntax.

It was then shown that it can be integrated naturally in XQuery typeswitch, let and for expressions, allowing one to use patterns in XQuery and XQuery expressions in patterns.

We conclude with the experiences of many years of pattern matching usage that such a matching is very useful to run queries on schema-less data like HTML pages, although there are still open issues and possible further syntax extensions that need to be researched.

Acknowledgment

This presentation and paper was supported by the Graduate School for Computing in Medicine and Life Sciences funded by Germany’s Excellence Initiative [DFG GSC 235/1].

References

[XMLQL] Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, and Dan Suciu. XML-QL: A Query Language for XML. 1998. http://www.w3.org/TR/1998/NOTE-xml-ql-19980819

[XML] W3C Extensible Markup Language (XML) 1.0. http://www.w3.org/TR/xml/

[XQuery] W3C XQuery 3.0: An XML Query Language. http://www.w3.org/TR/xquery-30/

[JSONiq] Jonathan Robie, Ghislain Fourny, Matthias Brantner, Daniela Florescu, Till Westmann, and Markos Zaharioudakis. The JSON Query Language. http://www.jsoniq.org/docs/JSONiqExtensionToXQuery/html-single/index.html

[Scala] Odersky, M., Altherr, P., Cremet, V., Emir, B., Maneth, S., Micheloud, S., ... & Zenger, M. An overview of the Scala programming language. 2004.

[Fischer2010] Peter Fischer, Aayush Garg, and Kyumars Sheykh Esmaili. Extending XQuery with a Pattern Matching Facility, in Database and XML Technologies, Lecture Notes in Computer Science. 2003.

[Hosoya2003] Haruo Hosoya and Benjamin C. Pierce. Regular expression pattern matching for XML, in Journal of Functional Programming. 2003. doi:https://doi.org/10.1017/S0956796802004410.

[Yao2004] J. T. Yao and Ming Zhang. A fast tree pattern matching algorithm for XML query, in Web Intelligence. IEEE. 2004. doi:https://doi.org/10.1109/WI.2004.10048.

[Fle2002] Flesca, Sergio, Filippo Furfaro, and Sergio Greco. A query language for XML based on graph grammars, in World Wide Web. 2002. doi:https://doi.org/10.1023/A:1019627607240.



[1] Available as standalone version under the name "Xidel" as command line tool and webservice, and under the name "Internet Tools" as GPL library for FreePascal.

[2] Many people do not even understand how to use the following or preceding axes, which becomes evident after watching XPath questions on stackoverflow for a while.

[3] However, in a new, not HTML-focused implementation matches as default for everything might be a more reasonable choice.

[4] We usually refer to the patterns as "templates" in our implementation. However, "pattern" is a more appropriate term.

[5] The content of the b-element, i.e. a sequence of all its children, might be more natural, but it cannot be assigned to the context item.

[6] It seems that with this behavior there is actually no point in having the variables anymore. But in many cases it is not possible to do this, and multiple assignments to the same variable are flattened.

[7] They also apply to <t:s>-expressions, but in the following we will restrict ourselves to {..} for breverity

[8] following-sibling:: cannot be used, since it encodes a sibling not a descendant relationship.

[9] In fact, our implementation converts everything to <t:loop> before matching.

[10] More specific example: We query a webpage that either contains certain data or an error message. Then we want a single pattern that returns the data, if there is some, and, only if that has failed, detects the error message. There could be a header before the data that looks like the error message, which it definitely should ignore, so the normal switch syntax would not work.

[11] Although one could say that a single | text node which would match everything or everything never makes sense and therefore can safely be considered an abbreviated switch expression.

[12] This section is purely theoretical, we have never used JSONiq matching in any practical application.

[13] Sequences could be handled similar to arrays, but we will not consider sequences here, since they do not occur as such in input data.

[14] The prefix pxp can be considered an abbreviation of "Pascal XQuery Project", another name of our XQuery implementation. Although the function pxp:match exists in our implementation it is there not written in XQuery and depends on our own special features.

[15] For the same reason we are talking about a "map" instead of a JSONiq "object". A standard JSONiq object would contain copies of the XML elements and therefore lose the parent relationships as well.

[16] Actually our current parser rejects these expressions, but they would be natural extensions. Also storing an assignment to . could not be distinguished from an assignment to $. in our implementation, so instead we store it as assignment to $$, which is an invalid variable and thus does not cause any ambiguousness.

[17] Outside the scope of this paper, we have simplified this creation process even further by developing a script that can create a pattern for a HTML file, by just selecting the relevant data with the mouse in Firefox.

[18] Although possibly not as much time, as we have spend developing a complete XQuery engine to integrate the patterns. Anyways, writing an XQuery engine is more fun than a library system client, since you can use the engine forever, while the libraries switch to another system, and you have to rewrite the client from scratch.

[19] Former case can be solved with optional columns, later only with both cases being included in the pattern as switchable subpatterns.

×

Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, and Dan Suciu. XML-QL: A Query Language for XML. 1998. http://www.w3.org/TR/1998/NOTE-xml-ql-19980819

×

W3C Extensible Markup Language (XML) 1.0. http://www.w3.org/TR/xml/

×

W3C XQuery 3.0: An XML Query Language. http://www.w3.org/TR/xquery-30/

×

Jonathan Robie, Ghislain Fourny, Matthias Brantner, Daniela Florescu, Till Westmann, and Markos Zaharioudakis. The JSON Query Language. http://www.jsoniq.org/docs/JSONiqExtensionToXQuery/html-single/index.html

×

Odersky, M., Altherr, P., Cremet, V., Emir, B., Maneth, S., Micheloud, S., ... & Zenger, M. An overview of the Scala programming language. 2004.

×

Peter Fischer, Aayush Garg, and Kyumars Sheykh Esmaili. Extending XQuery with a Pattern Matching Facility, in Database and XML Technologies, Lecture Notes in Computer Science. 2003.

×

Haruo Hosoya and Benjamin C. Pierce. Regular expression pattern matching for XML, in Journal of Functional Programming. 2003. doi:https://doi.org/10.1017/S0956796802004410.

×

J. T. Yao and Ming Zhang. A fast tree pattern matching algorithm for XML query, in Web Intelligence. IEEE. 2004. doi:https://doi.org/10.1109/WI.2004.10048.

×

Flesca, Sergio, Filippo Furfaro, and Sergio Greco. A query language for XML based on graph grammars, in World Wide Web. 2002. doi:https://doi.org/10.1023/A:1019627607240.

Author's keywords for this paper:
Pattern matching; XQuery syntax extension; JSONiq syntax extension; Data mining