Holstege, Mary. “Application of Brzozowski Derivatives to JSON Schema Validation.” Presented at Balisage: The Markup Conference 2019, Washington, DC, July 30 - August 2, 2019. In Proceedings of Balisage: The Markup Conference 2019. Balisage Series on Markup Technologies, vol. 23 (2019). https://doi.org/10.4242/BalisageVol23.Holstege01.
Balisage: The Markup Conference 2019 July 30 - August 2, 2019
Balisage Paper: Application of Brzozowski Derivatives to JSON Schema Validation
Mary Holstege has been developing software in Silicon Valley for decades, in and around
markup technologies and information extraction. She currently works at MarkLogic Corporation,
where she mainly works on search. She holds a Ph.D. from Stanford University in Computer
Science, for a thesis on document representation.
Brzozowski derivatives are a technique for computing whether a string of symbols is
in the language defined by an extended regular expression. They have been applied
to content model validation in XML Schema, following the observation that a content
model is an extended regular expression over symbols in the vocabulary described by
the schema. This paper explores using an extension of Brzozowski derivatives to the
problem of model validation for JSON Schema. It turns out that this application requires
extending to "type-tagged" regular expressions, which provide an interesting way of
understanding certain matching problems outside of the problem of JSON Schema validation.
JSON Schema is a JSON vocabulary for defining valid JSON data. The data may be null,
simple – numbers, strings, or boolean values – or structured – arrays of JSON values
or objects consisting of key/value pairs. In JSON parlance the keys are called properties.
The JSON Schema specification defines a set of properties that define possible constraints
on the JSON values. In general these constraints are independent of one another.
Among the constraints are some that define what we might call, in the XML world, "content
models" or "content types". As such, we pose the question: can we apply similar techniques
as we use for XML Schema to implementing JSON Schema? In particular, can we use Brzozowski
derivatives to compute validity with respect to these "content models"?
Background and Related Work
Brzozowski64 defines the concept of a derivative of an extended regular expression with respect to a symbol, and to a sequence of
symbols. The derivative of an extended regular expression with respect to a symbol
is the extended regular expression that would need to be matched once that symbol
is consumed. By repeating the process as we walk through a sequence of symbols (the
input sequence) we end up with an extended regular expression that must match the
empty sequence once we have consumed the whole input sequence. An extended regular
expression is called nullable if it can be reduced to match the empty sequence. Brzozowski shows that the input
sequence is valid with respect to the initial extended regular expression if and only
if the derivative of that expression with respect to the input sequence is nullable.
Sperberg-McQueen (Sperberg-McQueen05) discusses the application of Brzozowski derivatives to various problems of XML Schema
processing, and offers a nice summary of some of the previous work in this area. Brzozowski
derivatives are reported [1] to be used in James Clark's JING Validator and Sun's MSV validator, and I know it
to be used in MarkLogic's validation support as well. There are no doubt others.
The essential idea in using Brzozowski derivatives for XML Schema content model validation
is to observe that, with a few extensions to the model as shown by Sperberg-McQueen,
XML Schema content models are extended regular expressions over the set of defined
element names (plus one "every other name" symbol). When the derivative of a sequence
of names from the input with respect to the declared content model is nullable, the
input matches the content model. Given the unique particle attribution property in
XML Schema, we can look up each symbol in the matching sequence, check simple type
constraints as necessary, and complete validation.
JSON Schema Overview
The definitions of JSON Schema (JSONSchema) and JSON Schema Validation (JSValidation) are evolving and managed for the most part outside of the scope of any standards
body (see https://json-schema.org/specification.html for the current state), although there are now IETF Internet-Drafts (cited above)
that track the changes. Here I describe draft 07.
The conceptual framework behind JSON Schema differs greatly from the traditions coming
from the XML community or the formal languages community and some of the vocabulary
is used differently as well. A "schema" is a JSON value: either an object or boolean
value. It defines a set of constraints against a JSON value, which may be an object
(a map of name/value pairs), an array of JSON values, or a simple number, string,
boolean, or null value. Certain properties of the schema object have defined meanings;
undefined properties are irrelevant and ignored. Certain properties contain schemas
as values. It is important to remember that the term "schema" does not mean the totality
of the set of constraints defined in a particular schema document and its referenced
schema documents but instead means the current set of constraints against a particular
JSON value.
A schema defines a set of constraints through a set of properties and their values.
Importantly, with a very few defined exceptions, the constraints apply independently.
Certain properties are ignored for certain values. A JSON value is valid with respect
to its schema when all the relevant properties are valid with respect to that value.
A few examples might help make this clear:
Figure 1 shows a simple JSON Schema. The minimum property requires that the JSON value being validated be a numeric value greater
than twelve. The pattern property requires that the JSON value being validated be a string value consisting
of a sequence of 'a's. Is this an unsatisfiable schema? No. If the value presented happens to be a number,
it will be valid if it is more than twelve, because the pattern property will be ignored for numeric values. If the value presented happens to be
a string, it will be valid if it consists of a sequence of 'a's because the minimum property will be ignored for string values. What is more, if the value presented
happens to be a JSON object, it will also be valid, because both the minimum and pattern properties will be ignored!
We can fix the schema to disallow that last scenario by adding another property:
In Figure 2 we have a schema that will match either a number greater than twelve or a sequence
of 'a's and nothing else.
Similar kinds of scenarios play out with constraints on JSON objects.
The schema in Figure 3 needs all four[2] of the properties to ensure that the value is an object with exactly three properties
named a, b, and c. The type property excludes other kinds of values such as numbers. The required property excludes objects that do not have all of the indicated properties. The properties property defines constraints for a certain set of named properties (those constraints
being the boolean schema true which means anything is valid). And finally the additionalProperties property excludes objects that have other properties we haven't mentioned using a
boolean schema of false (always invalid). There are no ordering constraints or occurrence constraints: objects
are unordered maps in which keys can only occur once.
This is obviously a little more convoluted than just saying the content model is a & b & c. It is easy to imagine we might have a mapping that gets us such a content model,
however, and we will see that in a moment. Nevertheless, the reader may be forgiven
for suspecting that applying regular expression derivatives to validation here is
a dubious enterprise. Hold on to that thought! One more example:
The schema in Figure 4 will successfully validate objects that have two numeric properties named a and b, where either both numbers are more than zero, or both numbers are less than zero,
together with an additional numeric property named either c or d, depending on whether we are in the positive or negative case. We could map this
to the "content model" (a & b & c) | (a & b & d) but verifying that the numeric constraints hold would require a separate check that
essentially performs all the content model checking anyway. The reader now has probably
concluded that regular expression derivatives for JSON Schema validation are doomed.
Not so fast, reader! We will be adding some extensions to the Brzozowski model to
rescue the situation.
Before we get to the extensions to the model required, let's enumerate the JSON Schema
validation properties. The classification is not part of the JSON Schema specification:
it is an informational organization that we will make use of in a minute.
Property name
Relevance
Meaning
Classification
type
all
Data value kind: object, array, number, integer, string, boolean, null
type (special)
format
string
Required format such as date-time (optional)
type
enum
all
Set of possible values
type
const
all
Fixed value
type
minLength
string
Minimum length
type
maxLength
string
Maximum length
type
pattern
string
String matches regular expression
type
multipleOf
number
Number is multiple of this
type
minimum
number
Inclusive minimum
type
minimumExclusive
number
Exclusive minimum
type
maximum
number
Inclusive maximum
type
maximumExclusive
number
Exclusive maximum
type
minProperties
object
Minimum number of properties
type
maxProperties
object
Maximum number of properties
type
minItems
array
Minimum number of array items
type
maxItems
array
Maximum number of array items
type
uniqueItems
array
Are values in array unique?
type
properties
object
Child properties
composition
additionalProperties
object
Constraints on unenumerated properties
composition
required
object
Names of required properties
composition
items (array value)
array
Sequence of constraints for items in array
composition
items (single value)
array
Constraint on every item in array
composition
additionalItems
array
Constraints on additional unenumerated items
composition
contains
array
At least one item is valid
composition
anyOf
all
Valid against at least one subschema
composition
oneOf
all
Valid against exactly one subschema
composition
allOf
all
Valid against all subschemas
composition
not
all
Valid if not valid against subschema
composition
if,then,else
all
Conditional validation
composition
propertyNames
object
Schema constraint on the names of properties
type (special)
dependencies
object
Additional constraints if certain child properties are present
composition (special)
$ref
all
References the schema via a path rather than in place
special
$id
all
Sets a base URI for resolution of $ref properties
special
Extending the Model
The approach taken to applying Brzozowski derivatives to JSON Schema validation is
to map the compositional properties to extended regular expressions, where the symbols
will be composites of the JSON property names tagged with a type identifier. A further
mapping will gather up bundles of type properties in the schema; these bundles will
get a type identifier. Call these bundles types and their constraints facets. We will show how we can define nullability for these extended expressions and derivatives
of them with respect to JSON values. Some additional machinery and notation will need
to be in play for unnamed values, either as top-level values or array items.
In sum, here is the program of work:
Define a mapping from JSON Schema properties to extended regular expressions
Define a mapping from JSON Schema properties to types and facets
Define extension operators to account for all the composition properties
Define extensions to tag property names with type identifiers in the expressions
Define a type matching function
Define extensions for type-only items (top level and array items)
Define derivatives with respect to JSON values
Define derivatives over extended expressions
Define nullability over extended expressions
Sip wine in celebration
Mapping
Mapping performs two operations: mapping type properties to faceted types, and mapping
compositional properties to extended regular expressions. We will denote this with
two separate functions, μ and μM respectively. To complete the picture, we will also
need a function that provides the type with its name, and a lookup function that can
obtain the facets for a given named type. I use the form f(A) => B to say that the definition of the result of applying function f to A is B.
Basic Notation
{}
a schema with no constraints
true
true schema, equivalent to a schema with no constraints
convenience notation: ε if condition is true; ∅ otherwise
p1:v1
a property:value pair
{p1:v1,...,pn:vn}
map of property:value pairs, JSON object
[v1,...,vn]
array of values, JSON array
Mapping function μ
μ(schema)
maps schema to type facets {p1:v1,...,pn:vn}
Model mapping function μM
μM(schema)
maps schema to extended regular expression E
Key function κ
κ({p1:v1,...,pn:vn})
maps set of type properties to key uniquely identifying type
κ(ε) => ε
special case ε
κ(∅) => ∅
special case ∅
Lookup function λ
λ(key)
maps type key back to type facets {p1:v1,...,pn:vn}
λ(κ(P)) = P
identity: lookup of the key of type properties gets the properties back again
λ(ε) => ε
special case ε
λ(∅) => ∅
special case ∅
The type mapping function is fairly straightforward, although capturing the interactions
with the compositional properties gets a little messy in a functional paradigm. In
the formulation here, the extended regular expressions from μM are expressed as the output for a particular property (model) in the output of μ.
For the most part type properties in the schema map directly to facets in the type.
If there is a type property in the schema, its value will constrain which properties map to facets;
if it is an array of values, we will treat this as creating a content model disjunction
between the alternatives; and if it is absent we will treat this as creating a content
model disjunction between all[4] alternatives (array, object, number, string, boolean, null). Each alternative will only gather properties relevant for that specific type value.
In what follows, P is a map of property-value pairs {p1:v1,...,pn:vn} where each property name is unique and the + operator performs a union of maps. The rules are generally written so that the set
of property names is disjoint.[5] In addition, s1, s2, etc. refer to (sub)schemas.
The definition below shows the mapping, broken down for each of the distinct type values. Most properties can be mapped independently, although there are some exceptions.
That is, unless specified otherwise: μ({p:v,P}) => μ({p:v})+μ(P) and similarly μM({p:v,P}) => μM({p,v})+μM(P). Where there is a rule for a specific combination of properties, that rule takes
precedence. There are a few edges cases to take care of: boolean schemas and $ref dereferencing.
It will be useful to define some standard bundles of properties. The function relevant(P,kind) will return just the properties listed for the specific kind from the map of property-value
pairs. For example, relevant({minLength:1,maximum:12},"string") is {minLength:1}:
multipleOf, minimum, exclusiveMinimum, maximum, exclusiveMaximum and those for any
null
Those for any
array
minItems, maxItems, uniqueItems, items, additionalItems and those for any
object
propertyNames, minProperties, maxProperties, properties, patternProperties, additionalProperties,
required and those for any
Properties relevant to certain types
First, we define how some edge cases are handled:
Figure 6
s
μ(s)
true
true
false
∅
∅
∅
{$ref:path}
μ(dereference(path))
μ edge cases
The boolean schemas map to true or empty set as appropriate, and references map to the referenced definition. Specifying
the details of how the JSON path gets dereferenced is beyond the scope of this paper.
In what follows assume we have a fully dereferenced schema in hand.[6] We are also assuming the schema as a whole (including the references) is valid.
The type property is the main driver for the mapping. In general it gets mapped to a kind facet plus a simpleType facet for references to simple values and occasionally other facets as well. The
type names have the same meaning as in XSD2. Here are the rules for mapping the type property:[7]
Before we get to model mapping, we need to introduce some additional notation. The
notation is similar to Sperberg-McQueen05, which is a little different from Brzozowski64, but with a few alterations and extensions. JSON Schema does not need general occurrence
counts, and the wildcards required are relative to a set of property names, not namespaces.
Basic Extended Regular Expressions
ε
empty string (universal match)
∅
impossible match
n
property name
n:v
name/value pair
F
model F
F@T
F with type identifier T
Thus F = F@ε, a model F with no type identifier
F?
optional F
F*
one or more F
F,G
F followed by G
F|G
F or G (non exclusive)
F&G
F and G in any order
!F
F invalid
F⊕G
F xor G
(F?G:H)
conditional: if F valid then G otherwise H
<F,G>
both F and G simultaneously on same input
wc(pattern)
any key matching pattern
wc(-N,-P)
any key that is not a name in the name set N and that does not match a pattern in
the pattern set P
wc(-N,+P)
any key that is not a name in the name set N and that does match a pattern in the pattern set P
The type marking, exclusive or, conditional, simultaneous, and wildcards are extensions.
Negation is present in Brzozowski64 although with a slightly different notation.
It may seem that F&G and <F,G> mean the same thing, but the difference will become clearer when we get to the definition
of derivatives, because F&G matches F and G against different segments of the input, whereas with <F,G> matches F and G against the same segment of the input simultaneously. For example, a@T & a@S can never be satisfied because each symbol is unique in a JSON object so there can
only be one a matched, but <a@T, a@S> may well be satisfied if both type constraints are satisfied on the value of property
a.
The mapping of the basic composition operators is fairly direct:
Figure 13
s
μ(s)
{anyOf: [s1f,...,sn]}
{model: μM({anyOf:[s1,...,sn])}
{oneOf: [s1,...,sn]}
{model: μM({oneOf:[s1,...,sn])}
{allOf: [s1,...,sn]}
{model: μM({allOf:[s1,...,sn])}
{not:s}
{model: μM({not:s})}
{if:s1,then:s2,else:s3}
{model: μM({if:s1,then:s2,else:s3})}
μ for basic composition properties
Figure 14
s
μM(s)
{anyOf: [s1,s2,...,sn]}
μM(s1)|μM(s2)|...|μM(sn)
{oneOf: [s1,s2,...,sn]}
μM(s1)⊕μM(s2)⊕...⊕μM(sn)
{allOf: [s1,s2,...,sn]}
<μM(s1),μM(s2),...,μM(sn)>
{not:s}
!μM(s)
{if:s1,then:s2,else:s3}
(μM(s1)?μM(s2):μM(s3))
{if:s1,then:s2}
(μM(s1)?μM(s2):true)
{if:s1,else:s3}
(μM(s1)?true:μM(s3))
μM for basic composition properties
Mapping of the object-related compositional properties requires dealing with the interactions
and dependencies between them. The two main compositional properties are properties and patternProperties. Each of these consists of a set of keys and an associated schema for each key. In
the case of properties the key is the property name to match in the instance; in the case of patternProperties the key is a regular expression that matches the name of the property in the instance.
All properties named are optional unless their name appears in the required array. Properties otherwise unaccounted for must conform to the schema given for
additionalProperties, which defaults to the true schema if absent.
To simplify things somewhat, let's first ensure that when the additionalProperties property was not specified, we treat it as an explicit true schema, that is, that all additional properties are OK. We also want to normalize
the additionalProperties property when it is the false schema to remove it, and to normalize the properties and patternProperties properties to replace absence with an empty object, and the required property to replace absence with an empty array. Assume the following normalizations
apply one after another:
Figure 15
Constraints on {p1:s1,...,pn:sn}
μ({p1:s1,...,pn:sn})
∀pi|pi!=additionalProperties
μ({p1:s1,...,pn:sn,additionalProperties:true})
{p1:s1,...,pn:sn}={additionalProperties:false,P}
μ({P})
∀pi|pi!=properties
μ({p1:s1,...,pn:sn,properties:{}})
∀pi|pi!=patternProperties
μ({p1:s1,...,pn:sn,patternProperties:{}})
∀pi|pi!=required
μ({p1:s1,...,pn:sn,required:[]}
Normalizations for interdependent object composition properties
This leaves us with two variations to map: the bundle of properties P with the properties named properties, patternProperties, and required, and the bundle of properties with those plus the property named additionalProperties as well. In either case, the mapping creates a model property with the property bundle mapped by μM: {model: μM(P)}.
Capturing the semantics of additionalProperties requires some trickiness, because if its value is not the false schema, then it applies to any properties that are not named in properties and that do not match a pattern in patternProperties. We capture this in the wildcard which takes those properties and patterns as its
sets. We also capture the required property by marking appropriate properties as optional or not in the model. This
is an optimization: we could instead have a simultaneous compositor where one branch
handles the required properties and the other handles the type information. Both branches end up being
relatively weak, in general.
The semantics of patternProperties is particularly tricky:
Pattern properties do not count as additional properties.
More than one pattern property can match a specific input property and both constraints
need to be enforced.
A pattern property may or may not match a specifically named property; if it does
its constraint must be enforced.
We may lack named properties (properties) but still have a constraint on "additional" properties (additionalProperties).
In general, every property listed is optional, unless it is specifically mentioned
in the required array. To capture the semantics of required in an effective way is a little awkward to express algebraically. Note that it is
perfectly acceptable in JSON Schema for a property to be in the required array but not mentioned by either the properties or patternProperties.
For example the following schema is perfectly OK, and will not match an object {"c":1,"dd":2} as it lacks the required a and b properties:
With all that said, the general rules for mapping the object-related composition properties
are:
<
(q1@κ(t1) & ... & qk@κ(tk) &
ql@κ(ql)? & ... & qm@κ(tm)? &
u1@κ(sp) & ... & uv@κ(sp) &
wc(-[p1,...,pm],+[r1,...,ro])*
),
(wc(r1)@κ(pt1) | wc(-[],-[r1]))*,
(wc(r2)@κ(pt2) | wc(-[],-[r2]))*,
...
(wc(rn)@κ(pto) | wc(-[],-[ro]))*
>
where
l = k+1
for i<=k, qi=pj and ti=sj and ∃rqh|pj=rqh
for i>k, qi=pj and ti=sj and ∀rh|pj!=rqh
for i<=n, ui=rqj where ∀ph|ph!=rqj
μM for object composition properties
This looks complicated, but all it is saying is that our mapped content model consists
of a mix, in any order, of: properties that are named and required (q1 through qk), properties that are named but not required (ql through qm, properties that are not named but are nevertheless required (u1 through uv), a wildcard to pick up properties that are not named that match none of the patterns
but satisfy the schema for additionalProperties, and a wildcard to pick up properties that are not named but do match one of the
patterns. The wildcard clauses in the second and subsequent branches of this simultaneous
model ensure that properties matching patterns satisfy their appropriate schema: everything
matches the pattern and its corresponding schema, or doesn't match the pattern at
all. Each pattern gets its own simultaneous branch because an input property may match
more than one of them, in which case it would need to satisfy constraints from each
match.
If patternProperties is empty, as is often the case, this simplifies a great deal. We can drop the wildcards
that match against patterns and simplify the simultaneous model with one branch to
just that branch:
Notice that we still get fairly weak content models here: where every property can
occur in any order, optional unless specifically required, with all kinds of wildcards
unless additionalProperties is explicitly false and there are no patternProperties. This has performance implications, as it creates larger and more complex derivatives.
In JSON objects, every property name is unique, however. We can use this knowledge
to strengthen the wildcards in the first branch by adding any required properties
to the excluded list. (In the examples we will assume we do this.) Other optimizations
are possible. We will return to this point later.
Mapping the array-relevant compositional properties follows a similar paradigm, but
an additional class of extension is in order, because array members do not have any
symbol associated with them, just values. However, given our type constraint scheme,
we still have something to say about such things, we just need a little bit more notational
mechanism:
Non-symbol extensions
•@T
type-only, no symbol
• = •@ε
no symbol, no type; i.e. value must exist
Given the type-only expressions, we are now in a position to map array properties.
The items property in JSON Schema comes in two varieties with different semantics: as a single
schema, or as an array of schemas. If it is a single schema, then that schema applies
to every member of the array, and additionalItems is irrelevant. If it is an array, then each member of the array applies one to one
to the values in the array in order, with any surplus values in the array matching
the additionalItems schema, if any:
The reason for the complexity of the rules for items is this: just because the property defines type constraints for three slots, that
doesn't mean a matching array must have three entries in it. That is handled by a
different constraint property, the minItems property. We could apply it here to remove some of the optionality.
In other words, array constraint properties map to type-only expressions. The additionalItems property is only relevant if the items property enumerates a sequence of type constraints for each item in the array.
Mapping the dependencies property is a little confusing because its value is a JSON object where each value
is either a subschema or an array of property names. The semantics are very different,
and the dependencies property is likely to bifurcate in future drafts of JSON Schema. Each item in the
JSON object maps separately, using a different rule depending on whether its value
is a schema or an array.
That is, if we have the property named p plus a bunch of other stuff not named p, the object must match either the given schema, or must include all the named properties
(depending on which form was used).
It should also be noted that although at first blush we have made so many extensions
to regular expressions that it seems likely that none of the supporting results would
hold, in fact we have hardly extended anything substantively. Instead of the finite
vocabulary consisting of the symbols in the schema we have a vocabulary consisting
of the symbols crossed with the type identifiers plus the • symbol. It is still a finite vocabulary.[9] The operators and functions we have added, being boolean operators, fit into the
rules defined by Brzozowski. True, we have made the rules for matching more complex,
but one could also think of them as just being a funny kind of equality. On the other
side, while our input can consist of an arbitrary number of distinct values, they
are all just annotations on that finite vocabulary. The infinitude of values will
concern us when we get to caching, however.
Derivatives
Given a JSON Schema mapped in this way, we can compute derivatives of those expressions
with respect to JSON values. Most of these rules come directly from Sperberg-McQueen05 or Brzozowski64 or can be derived by Brzozowski's observation that the derivative of any boolean
operator over some operands is the operator over the derivatives of those operands.
Algebraically: Δf(R)/Δa = f(ΔR/Δa) for any boolean function f. A key difference is that when we get to the leaves, we have to do type matching
as well as just symbol matching. Let us introduce a little more notation:
Derivative function Δ
ΔX/Δy
compute a new expression E, the derivative of expression X with respect to JSON value y
Nullability function ν
ν(model)
produces ε if ε is in language of model; ∅ otherwise
Typematch function τ
τ(value,type)
produces ε if value matches type constraints; ∅ otherwise
JSON values
{p1:v1,...,pn:vn}
a JSON object with property/value pairs
[v1,...,vn]
a JSON array with a sequence of values
n = n:•
name with no particular value
v = •:v
value with no name
Taking lowercase a and b as symbols and uppercase E, F, and G as expressions, the derivatives of basic expressions (without type matching) are
as follows:
Figure 23
E
ΔE/Δa
b
[a=b]
ε
∅
∅
∅
true
true
E?
ΔE/Δa | ν(E)
E*
ΔE/Δa,E*
E,F
(ΔE/Δa),F | ν(E),ΔF/Δa
E|F
ΔE/Δa | ΔF/Δa
E&F
(ΔE/Δa & F) | (E & ΔF/Δa)
!E
!ΔE/Δa
E⊕F
(ΔE/Δa) ⊕ (ΔF/Δa)
(E?F:G)
(ΔE/Δa ? ΔF/Δa : ΔG/Δa)
<E,F>
<ΔE/Δa, ΔF/Δa>
wc(pat)
[matches(a,pattern)]
wc(-[p1,...,pn],-[r1,...,rn])
[∀pj:a!=pj ∧ ∀rj:!matches(a,rj)]
wc(-[p1,...,pn],+[r1,...,rn])
[∀pj:a!=pj ∧ ∃rj:matches(a,rj)]
Basic derivative rules
To take the derivative of an expression with respect to a JSON object, first note
the edge case for an empty sequence is defined so that:
ΔE/Δε => E
Then observe that for a sequence of symbols u followed by a symbol a:
Given the set of properties of an object p1, p2, etc., then the derivative of the set of properties is the union of the derivative
of each possible sequence starting with some pj:
Adding type matching to the picture doesn't change the rules much although they look
a little more complicated. All the rules are the same, just carrying through the type
annotation on the expressions, except for the leaf cases where we need to perform
a type check as well as a symbol matching:[10]
Now we have type-aware derivatives for expressions over symbols, we still need to
account for arrays and top level values that have no symbols and compute derivatives
over these type-only expressions. Arrays unroll like sequences:
Figure 26
Derivative expression
Derivative result
ΔE@T/Δ•:[v1,v2,...vn]
τ([v1,v2,...,vn],λ(T)), ΔE/Δ(•:v1,•:v2,...,•:vn)
ΔE/Δ(•:v1,•:v2,...,•:vn)
(...(((ΔE/Δ•:v1)/Δ•:v2)/...)/Δ•:vn)
Δ(•@T)/Δ•:v
τ(v,λ(T))
Δ•/Δb
∅
Δa/Δ•
∅
Δwc(pat)/Δ•
∅
Δwc(-[p1,...,pn],-[r1,...,rn])/Δ•
∅
Δwc(-[p1,...,pn],+[r1,...,rn])/Δ•
∅
Δ•/Δ•
ε
Δε/Δ•
∅
Derivatives for arrays and nameless values
What this says is that derivative of an array can be treated mostly as a sequence
of non-symbol values, although we need to consider the type constraints on the array
as a whole. A symbol cannot match a non-symbol, so derivatives of that kind are empty.
A derivative of a non-symbol with respect to a non-symbol value comes down to the
type matching function against that value.
Nullability
The nullability function decides whether an expression can be matched against nothing.
When the derivative of an expression with respect to the input is not nullable, that
means that either it would need additional input in order to match or it has already
run down a blind alley and there is no way to match what is there. Therefore the data
cannot be valid with respect to that expression. Conversely, when the derivative is
nullable, it means that the input has been successfully matched by the regular expression
(or consumed by the state machine, if you prefer) and no additional input is necessary
for a successful match.
The rules for the nullability function are as follows:
Figure 27
E
ν(E)
ε
ε
∅
∅
true
ε
a
∅
E*
ε
E,F
<ν(E), ν(F)>
<E,F>
<ν(E), ν(F)>
E|F
ν(E) | ν(F)
!E
ε if ν(E) = ∅
!E
∅ if ν(E) = ε
F&G
<ν(F), ν(G)>
F?G:H
(ν(F)?ν(G):ν(H))
F?
ε
F⊕G
ν(F) ⊕ ν(G) = (ν(F)?ν(!G):ν(G))
wc(pat)
∅
wc(N,P)
∅
F@T
ν(F)
•
∅
Rules for nullability function ν
Type Matching
Type matching checks the type constraints of a type T against a particular value v. It returns ε if the value matches the type constraints of the type and ∅ otherwise.
Figure 28
τ(v,T) expression
τ(v,T) result
τ(•,T)
ε
τ(v,T)
ε if ∀f∈facets(T) τ(v,f)=ε, otherwise ∅
τ(v,ε)
ε
τ(v,∅)
∅
τ basic rules
Type-matching evaluates the various facets according to their semantics, something
like:
Figure 29
f
τ(v,f)
{kind:x}
[v isa x]
{simpletype:x}
[v instanceof x]
{enum:[x1,x2,...,xn]}
[∃xi|v=x1]
{const:x}
[v=x]
{minLength:x}
[!(v isa string) ∨ string-length(v) >= v]
{maxLength:x}
[!(v isa string) ∨ string-length(v) <= v]
{pattern:x}
[!(v isa string) ∨ matches(v,x)]
{multipleOf:x}
!(v isa number) ∨ (v mod x)=0]
{minimumInclusive:x}
[!(v isa number) ∨ v >= x]
{minimumExclusive:x}
[!(v isa number) ∨ v > x]
{maximumInclusive:x}
[!(v isa number) ∨ v <= x]
{maximumExclusive:x}
[!(v isa number) ∨ v < x]
{minProperties:x}
[!(v isa object) ∨ count(keys(v)) >= x]
{maxProperties:x}
[!(v isa object) ∨ count(keys(v)) <= x]
{propertyNames:x}
[!(v isa object) ∨ ∀k∈keys(v)|τ(k,x)]
{minItems:x}
[!(v isa array) ∨ count(items(v)) >= x]
{maxItems:x}
[!(v isa array) ∨ count(items(v)) <= x]
{uniqueItems:x}
[!(v isa array) ∨ !x ∨ count(items(v))=count(distinct(items(v)))]
τ for type facets
There is an additional rule for the τ function that handles subschemas. This will make more sense after we have looked
at some examples:
Figure 30
f
τ(v,f)
{model:E}
ν(ΔE/Δ•:v)
τ over model facet
Given this rule, we can observe that a JSON value is valid with respect to a JSON
schema if and only if the type matching function for the mapped JSON schema is ε.
Relations and Simplifications
There are a number of relations and simplifications of the derivative expressions
that come in useful in computing derivatives and keeping the size of the derivatives
in reasonable bounds.
There are straightforward simplifications involving ε, ∅, and redundant expressions. We will make use of these, generally without comment,
in the examples.
These identities mean that, given a function or operator f that can only take the values of ε or ∅, such as the type matching function or the nullability function
We have implicitly made use of this relation in the definitions of some of the derivatives,
above.
Negation behaves as expected with respect to boolean operators:
Finally, certain of the extension operators can be defined in terms of the others.
This can be useful for proving some of these derivations, but it also gives us some
comfort that we are not actually enhancing extended regular expressions beyond recognition:
JSON Schema Validation
Again, following Sperberg-McQueen05 we can use derivatives for validation. Given a JSON value v and a JSON schema mapped to a type T, where T is a member of a set of mapped types accessible via λ: v is valid with respect to T if and only if the type check τ(v,T) returns ε. If the mapping includes a mapped {model:E} then calculation of τ will compute the nullability of the derivative of E with respect to the nameless value: ν(ΔE/Δ•:v) per the type matching rule given previously.
The overall procedure is to compute the mapping of the schema. This will give us a
type whose identifier (given by κ) will be the type to match. Where the mapping produces a model, computing the nullability
of the derivative of this model with respect to the nameless JSON value in hand tells
us whether that value is valid with respect to the schema.
Some examples might help:
Examples
Simple Type
Let's start with a very simple case. Is the value 47 valid with respect to this schema?
We would certainly expect it to be. Let's walk through the mechanics.
We need to compute the type matching function, which is simple enough (here = shows the chain of derivations):
Since the result is ε, the value is valid with respect to the schema.
More Complicated Example
Let's take a more complicated example, the schema from Figure 4, reproduced here:
Mapping gives us:
We will henceforth abbreviate wc(-[a,b],[]) as just wc().
Let's see if this object is valid:
To do that, we need to compute the type function:
which is to say, determine the nullability of the derivative of the model on the top
level type.
To do that, we need to compute the following derivative:
This expression is equal to:
Let's unroll that a bit:
Let's consider the left side of the exclusive or (⊕):
Evaluating the first clause gives us ε because the symbol and type match:
Therefore:
So far so good! What about the other alteratives? These don't match (leading to a
derivative of ∅) because b and c do not match a and are not nullable.
Since we have
so
each alternative reduces to ∅ and we are still left with
On other clause of the exclusive or (⊕) we have
Here the first alternative fails, because the type check fails:
which is all just ∅.
The other alternatives don't fare any better, for reasons similar to the case on the
other side (b and d do not match a and are not nullable):
so each of those alternatives nets out to ∅.
Putting this all together, we have got to
The clause coming from the required property is a little easier since it doesn't involve any types:
because Δb/Δa:1 is ∅ (eliminating the second alternative) and
because, recall, this wildcard excludes a (eliminating the third).
Whew! We've consumed the first property. On to b!
Type matching b fails. We have
since
and
we end up with
On the other half of simultaneous operator we have:
We're down to the final symbol, which is where things fail entirely because the exclusive
or fails, dragging everything else down with it:
Since ∅ is not nullable, the object is not valid with respect to the schema.
Array Example
Let's look an array example to show how arrays unroll.
Is the array
valid with respect to this schema?
The mapped schema:
We need to compute:
The kind check passes because the value is an array, leaving the derivative we need
to compute being
Starting with the first item:
The second item ("a") also works:
But things run aground on the third item ("b") because we run out of expressions to match against the value:
Here we have a non-nullable derivative, and the array is not valid with respect to
the schema.
Nested Object Example
Finally, let's look at an example where we have nested objects.
Here is the schema:
This maps to:
Let's validate the JSON object
In the familiar way, we begin with the type matching of the top-level type
which nets out to determining the nullability of the following derivative:
Resolving this:
Here is where things get interesting, and we move down to the derivative of the subschema:
The derivative for the nested object fails:
because both those alternatives net out to ∅.
Therefore
making the derivative
Since this is non-nullable, it makes the whole of τ({a:{c:false}},λ(T0)) equal to ∅, so the object is not valid with respect to the schema.
Implementation Considerations
The rules given above for mapping object-related properties produce very weak expressions,
which require a lot of work to resolve. We could do slightly better if we recall that
properties are unique in a JSON object and assume we can always access those properties
in a particular order, say, in codepoint order. Then we can use the sequence operator
instead of the intermix operator when we don't have wildcards. We could also use the
sequence operator and just insert wildcards everywhere, but that is not better than
just using intermix. Because JSON property names are unique within a given object,
it is easy enough to prune the branches, especially if we recognize the pattern:
Furthermore, if x equals any pj we know this can only be nullable if we pick the branch where we take the derivative
of that property, because the derivative of any other pi or of the wildcard will be ∅ (the wildcard excludes all property names mapped from properties). That is:
and in fact if we iterate over the whole object:
It is possible to create yet another extension operator for this case and handle it
specially as a performance optimization.
To make this whole scheme worthwhile, we want to amortize the cost of computing derivatives,
both within a particular validation episode and across validation episodes. Indeed,
the major theme of Brzozowski's paper is that it can be used to construct a finite
state automaton for matching the complete extended regular expression. Having constructed
such a thing, it can be used again and again. We choose to be slightly more incremental
here, and only compute the parts of the machine we actually need, as we go. Experience
with this technique in XML Schema has shown that this is more cost effective, as there
can be large parts of the state machine that never get accessed by actual data, especially
with large schemas.
If we don't worry about type matching, it is pretty straightforward to create a cache
of derivative expressions keyed off an expression ID and a symbol.[13] The expression ID needs to be computed by memoizing it, lest we keep storing the
same derivative computation over and over again. That is, from the state machine point
of view, we don't want to create lots of redundant states. It is reasonably straightforward
to canonicalize expressions for memoization.
It is helpful, in practice, to prune the computed derivatives as much as possible,
precomputing any of the predicate functions (such as [a=x] or ν which lead to ε and ∅ values) that can be used to prune further. Here is where the type-matching causes
us trouble: we could compute this and prune the derivative, but not if we wish to
cache the result, because the cache is keyed only off the symbol, and τ evaluation depends on the value as well.
What is to be done?
There are three basic options:
Don't cache derivatives at all. We can then prune based on τ results without constraint.
Include the value in the cache key, so that we cache the appropriate derivative for
the symbol:value pair.
Cache "partial" derivatives: derivatives where all τ calls are replaced with a marker to be evaluated using actual value presented. When
a derivative is pulled from the cache, if it has any unresolved markers, evaluate
them and return the pruned result.
Not caching means a lot of work calculating answers: we'd probably be better off just
using a more direct algorithm and not using derivatives at all. The whole point of
this technique was to compute the finite state machine (here implemented in pieces
via the cache). Including values in the cache key means that the derivative cache
will have a lot of duplicative entries and looks like an unpromising approach, especially
when you remember that some of these values are whole JSON objects and arrays. We
essentially sacrifice the finiteness of states, as a practical matter. A more promising
variation is to cache partial derivatives and add a secondary cache for the unresolved
τ calculations. The τ cache only needs to store one bit of information for each type plus value, and we
can make different decisions about the sizing and lifespan of entries for the two
caches. The partial derivative cache (∂ cache) can be readily shared across validation episodes, while the τ cache could live for only a single episode. We can tune whether the τ cache caches objects and arrays at all, or if so, how large they might be before
we don't bother.
Performance
Measure, because it is never what you think
This test performs a simple validation on a small document (~3K) with up to five levels
of nesting, a Medline citation, transcribed directly from its XML representation.
The JSON Schema imposes structural constraints (required properties, and sub-properties)
as well as some value and simple type constraints. The document is repeatedly validated
in a single-threaded loop under various regimes. As a baseline, an XML Schema validation
of the XML version of the document is also tested (also using derivatives and the
same baseline simple type facet testing). The documents are valid against their respective
schemas and both documents and schema files are pulled from a pre-warmed cache. The
table shows both the validations per second for the first validation, which for some
of the regimes is the one that does most of the work, and for subsequent validations.
Format
Cached schemas?
Derivatives?
∂ cache?
τ cache?
V/s (non-initial)
V/s (initial)
XSD
yes
yes
N/A
N/A
15088
678
JSON Schema
yes
yes
no
yes
12618
610
JSON Schema
yes
yes
yes
yes
11380
415
JSON Schema
yes
no
N/A
N/A
9419
1071
JSON Schema
no
no
N/A
N/A
2375
1720
JSON Schema
no
yes
no
yes
650
617
JSON Schema
no
yes
yes
yes
500
461
JSON Schema
yes
yes
no
no
220
203
JSON Schema
yes
yes
yes
no
207
617
JSON Schema
no
yes
no
no
201
210
JSON Schema
no
yes
yes
no
178
178
The variance in these numbers is quite high, although the relative ordering is generally
the same.
What is perhaps most interesting here are the negative results:
The simple recursive walk algorithm with no caching is not greatly worse than any
derivative algorithm, regardless of caching strategy. Given its substantially better
initialization cost compared to the derivative algorithms, this suggests that where
the same schema is not substantially reused, the derivative algorithms aren't worth
the bother. This is despite the fact that this test is something of a best case scenario
for the caching in the derivative algorithms.
Caching partial derivatives doesn't help performance in this scenario: hurts it slightly,
in fact.
I believe the answer to the first negative result is that a lot of the work, at least
in this schema, is located in the τ checking, not in the property relation constraints, and that the models are simplistic
enough that the recursive walk doesn't have to do much beyond looking at the current
set of properties, so the work of computing, constructing, and caching derivatives
doesn't really pay.
As for the partial derivatives, I think what is going on here is (1) the τ cache is in play regardless, and in this scenario with the same document repeated
over and over, it provides a strong benefit and (2) resolving partial derivatives
requires construction of more objects representing the resolved derivative, and the
top line on performance here is time spent in allocation from the heap. We can see
the strong impact of the τ cache by seeing the two orders of magnitude hit to performance when we turn it off.
This scenario is clearly a best-case one for the τ cache, as we can see from these numbers.
The strongest performance win here is from simply having compiled schema objects cached,
which gives about a 5 times gain for the recursive walk algorithm, and a 20 times
gain for the derivative algorithm. The usefulness of the derivative algorithm is much
stronger for cached schemas, unsurprisingly, as the work of computing the derivatives
can be amortized over multiple validation episodes.
Similar results obtain from running multiple validations in parallel. This test is
a validate-on-load one, where 18 thousand Medline citations are loaded into a database
after being validated, using 20 threads. Here the performance is overall lower and
overall more similar, probably because the update operations dominate performance
overall. Here the data is more varied, so the impact of the τ cache is substantially reduced. This test doesn't separate out the costs of the initial
use of the schema, and in fact, in effect, each client thread may end up doing some
of the initialization work redundantly in parallel, which impacts the amortization.
Format
Cached schemas?
Derivatives?
∂ cache?
τ cache?
V/s
JSON Schema
yes
no
N/A
N/A
868
JSON Schema
yes
yes
no
yes
701
JSON Schema
yes
yes
no
no
675
JSON Schema
yes
yes
yes
yes
608
JSON Schema
yes
yes
yes
no
588
Again we see the same fundamental negative results: derivatives aren't much better
(and here are a little worse, in fact) than a simple recursive walk and derivative
caching is not particularly helpful.
This is a somewhat disappointing discovery.
It is entirely possible that there are deficiencies in the specific implementation
strategy here. While tying model checking directly to τ simplifies the model of what is going on, I don't think it is a great implementation
strategy: you end up computing τ recursively on complex objects over and over as you move through branches of their
parent type's derivative. The τ cache helps here somewhat, but it might be better to deal with this as a local validation
instead. I believe the numbers from the XML Schema implementation are instructive
here because that is how it works: local validity worked from the bottom up. Another
approach that might work better is a scheme to cache the fully resolved derivatives
also, keyed off the partial derivative expression ID and the actual value of whatever
unevaluated τ's it might contain. That would avoid repeated construction of new expression objects
whenever we resolve from the ∂ cache.
It is possible to reclaim finiteness and perform full caching by vectorizing the values.
Take each distinct facet anywhere in the schema. Each slot in the vector is 1 or 0
depending on whether a particular value satisfies that facet. This then compresses
to a binary value of however many bits. While this recovers finiteness, evaluating
a lot more facets that necessary in the moment is also not without cost, so it isn't
clear what the conditions are where this could be a performance win.
It may well also be the case that the optimization of folding the required property into the mapping for object composition properties is no optimization at
all. Perhaps we'd be better served with a mapping like:
Although the models are weaker overall, the first branch is entirely independent of
type matching of specific values, and we can fail fast, pruning values that don't
match it by a pure cache lookup in the derivative cache without any additional resolution
required.
Other Applications of Type-Tagged Regular Expressions
Type-tagged regular expressions like this can be applied to other problems, or can
be used to recast other problems in different terms.
First, most obviously, we can recast XML Schema validation in these terms as well.
The mapping function is different (rather more straightforward). We do need additional
wildcard functions (see Sperberg-McQueen). The τ function evaluates the constraining facets of simple types, analogously to what we
have above, and looks at the nullability of the derivative of the content models of
complex types. These content models will have their type information retained, tagged
either with the type name itself (for named types), or with a generated stable ID.
We don't have a single top-level type, so we kick things off by identifying the appropriate
element declaration (if any), and computing the τ with respect to its type. Attribute checks folds into this model as well, through
appropriate mapping. The details of this are beyond the scope of this paper, and work
for another day.
Whether evaluated with derivatives or not, type-tagged regular expressions provide
a framework for clearer expression of certain kinds of patterns or for a more general
understanding of certain kinds of regular expressions. Consider, for example, a conventional
regular expression[14]:
A type-tagged expression of this could be something like:
with types
(In keeping with the spirit of abbreviation for this style of regular expression,
the type tagging uses \@{type} as a marker rather than the bare @type.)
I believe this is a clearer expression of the intention and meaning of the desired
pattern, and may be useful for expositional clarity, if nothing else.
There are extensions to regular expressions that capture a certain class of τ functions in this fashion already: Unicode character category and block escapes:
The first of these expressions matches camel-cased strings of letters with an underbar
separator (e.g. "Example_Match") and the second matches a sequence of Greek characters.
Rewriting them as type-tagged expressions in the notation of this paper:
Clearly this is not a compact rewrite! However, it does allow us to think about the
escape as less a special case than as a way of naming something whose significance
is defined by the τ function. Let us reimagine the \p escape as naming a type where the τ function in this case is either "matches a Unicode category with the given name"
or "matches a Unicode block with the same name as what follows the 'Is'".
One way to think about this is that we have a representation of a state machine with
τ representing a conditional function on the transitions. There is no reason (other
than computation efficiency) why τ needs to be a "type" function at all. What restrictions should we place on τ? Any?
Final Thoughts
This work followed a path of applying Brzozowski derivatives to JSON Schema validation,
following the pattern of using them for XML Schema validation. This led to the need
to extend regular expressions with type tagging, and adding a type checking function
that applied the appropriate matching constraints for the compound symbol, and to
allow for type-only symbols in the expressions. It turns out, this formulation also
works to provide a more uniform story for XML Schema validation beyond content model
checking. More generally, tagged symbols combined with a tag matching function can
be used more widely to improve the clarity of certain regular expressions and provide
a framework for expressing other kinds of matching problems. However, the current
implementation performance numbers are not particularly compelling: the need to match
types against an infinitude of specific values makes it difficult to find effective
caching strategies, undermining the point of using derivatives as an implementation
technique in the first place.
[Brzozowski64]
Janusz A. Brzozowski
Derivatives of Regular Expressions. Journal of the ACM 11.4 (1964):
481-494.
[XSD2]
W3C: David Peterson, Shudi (Sandy) Gao 高殊镝, Ashok Malhotra,
C.M. Sperberg-McQueen, and Henry S. Thompson, editors.
W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes.
W3C. April 2012.
http://www.w3.org/TR/xmlschema11-2/
[1] The links to Clark's "An algorithm for RELAX NG validation" and Bob Foster's blob
referenced by Sperberg-McQueen are, alas, dead.
[2] Strictly speaking, in this case, where we don't have any actual constraints on a, b, and c we could have dispensed with properties. If we wanted to ensure the properties were all numeric, we would need to have a
schema like {"type":"numeric"} instead of true for each of the properties, and properties would be necessary.
[3] The true schema is not actually equivalent to ε: derivative rules are different.
[4] Actually, integer is another alternative value for the type property, but as the facets for integer are the same as for number, we don't need to account for it separately here.
[5] There is a small detail to take care of: it is possible to end up with more than one
model property according to these rules, but we require the output to have unique keys.
In that case we use the simultaneous operator to combine them: {model:m1} + {model:m2} => model:<m1,m2>.
[6] You can't actually implement things that way as there may be circular references.
In practice, the usual kind of pre-pass or symbol table fix-up is required, if for
no other reason than to prevent problems. The JSON Schema specification does say that
schemas "should not" make use of infinitely recursive circular references as they
may blow up naive validators and specifies such behaviour as undefined.
[7] Support for the format property is optional, as are the specific kinds of formats that are supported. Here
I give an exemplar as to how it might be mapped. For other format kinds more elaborate
mappings would be required, or the use of special simpleType values that encapsulate any checks that can't be mapped to other facets.
[8] From an implementation perspective both the type matching rules of the propertyNames property and its mapping are a little strange and special. We store the mapping of
s1, but realistically what is stored is the key (κ(μ(s1))) with the actually mapped faceted type stored in a lookup table elsewhere.
[9] Once you follow the conventional dodge of adding a unique symbol for all non-matching
symbols encountered.
[10] Note: There are actually a couple of annoying of special cases around empty objects
and handling subtype recursion that I have elided here. There is an asymmetry due
to the fact that top level values have no name, but values inside a JSON object (including
other objects) do.
[11] See Sperberg-McQueen05 for many of these; the logic for the others is similar. Since JSON objects have unique
keys and contain nothing but keyed items, the analysis of intermix with respect to
UPA apply here too.
[12]T0, T1, etc. are the identifiers given to us by κ. The set of properties on the other side of the equals sign is what is produced from
the mapping.
[13] Given JSON's proclivity for open-ended sets of properties, it is useful to be able
to group all symbols not found in the schema together as a single "unknown symbol"
here.
[14] For those unfamiliar with it: This grep-style regular expression uses a slightly different
notation than what we've used in this paper: The sequence operator (,) is assumed, [0-5] is an abbreviation for (0|1|2|3|4|5), \d is an abbreviation for a choice between all digits, and dot (.) is a wildcard.
W3C: David Peterson, Shudi (Sandy) Gao 高殊镝, Ashok Malhotra,
C.M. Sperberg-McQueen, and Henry S. Thompson, editors.
W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes.
W3C. April 2012.
http://www.w3.org/TR/xmlschema11-2/
Michael Sperberg-McQueen
Applications of Brzozowski derivatives to XML Schema processing. Presented at Extreme Markup Languages, Montréal, Canada, August 1-5, 2005. Available
at http://cmsmcq.com/2005/EML2005Sper0518.html.