Introduction
JSON Schema is a JSON vocabulary for defining valid JSON data. The data may be null, simple – numbers, strings, or boolean values – or structured – arrays of JSON values or objects consisting of key/value pairs. In JSON parlance the keys are called properties. The JSON Schema specification defines a set of properties that define possible constraints on the JSON values. In general these constraints are independent of one another.
Among the constraints are some that define what we might call, in the XML world, "content models" or "content types". As such, we pose the question: can we apply similar techniques as we use for XML Schema to implementing JSON Schema? In particular, can we use Brzozowski derivatives to compute validity with respect to these "content models"?
Background and Related Work
Brzozowski64 defines the concept of a derivative of an extended regular expression with respect to a symbol, and to a sequence of symbols. The derivative of an extended regular expression with respect to a symbol is the extended regular expression that would need to be matched once that symbol is consumed. By repeating the process as we walk through a sequence of symbols (the input sequence) we end up with an extended regular expression that must match the empty sequence once we have consumed the whole input sequence. An extended regular expression is called nullable if it can be reduced to match the empty sequence. Brzozowski shows that the input sequence is valid with respect to the initial extended regular expression if and only if the derivative of that expression with respect to the input sequence is nullable.
Sperberg-McQueen (Sperberg-McQueen05) discusses the application of Brzozowski derivatives to various problems of XML Schema processing, and offers a nice summary of some of the previous work in this area. Brzozowski derivatives are reported [1] to be used in James Clark's JING Validator and Sun's MSV validator, and I know it to be used in MarkLogic's validation support as well. There are no doubt others.
The essential idea in using Brzozowski derivatives for XML Schema content model validation is to observe that, with a few extensions to the model as shown by Sperberg-McQueen, XML Schema content models are extended regular expressions over the set of defined element names (plus one "every other name" symbol). When the derivative of a sequence of names from the input with respect to the declared content model is nullable, the input matches the content model. Given the unique particle attribution property in XML Schema, we can look up each symbol in the matching sequence, check simple type constraints as necessary, and complete validation.
JSON Schema Overview
The definitions of JSON Schema (JSONSchema) and JSON Schema Validation (JSValidation) are evolving and managed for the most part outside of the scope of any standards body (see https://json-schema.org/specification.html for the current state), although there are now IETF Internet-Drafts (cited above) that track the changes. Here I describe draft 07.
The conceptual framework behind JSON Schema differs greatly from the traditions coming from the XML community or the formal languages community and some of the vocabulary is used differently as well. A "schema" is a JSON value: either an object or boolean value. It defines a set of constraints against a JSON value, which may be an object (a map of name/value pairs), an array of JSON values, or a simple number, string, boolean, or null value. Certain properties of the schema object have defined meanings; undefined properties are irrelevant and ignored. Certain properties contain schemas as values. It is important to remember that the term "schema" does not mean the totality of the set of constraints defined in a particular schema document and its referenced schema documents but instead means the current set of constraints against a particular JSON value.
A schema defines a set of constraints through a set of properties and their values. Importantly, with a very few defined exceptions, the constraints apply independently. Certain properties are ignored for certain values. A JSON value is valid with respect to its schema when all the relevant properties are valid with respect to that value.
A few examples might help make this clear:
Figure 1
{ "minimum": 12, "pattern": "a+" }
A schema for a simple value
Figure 1 shows a simple JSON Schema. The minimum
property requires that the JSON value being validated be a numeric value greater
than twelve. The pattern
property requires that the JSON value being validated be a string value consisting
of a sequence of 'a'
s. Is this an unsatisfiable schema? No. If the value presented happens to be a number,
it will be valid if it is more than twelve, because the pattern
property will be ignored for numeric values. If the value presented happens to be
a string, it will be valid if it consists of a sequence of 'a'
s because the minimum
property will be ignored for string values. What is more, if the value presented
happens to be a JSON object, it will also be valid, because both the minimum
and pattern
properties will be ignored!
We can fix the schema to disallow that last scenario by adding another property:
Figure 2
{ "type": ["number","string"] "minimum": 12, "pattern": "a+" }
A schema for an integer or string value
In Figure 2 we have a schema that will match either a number greater than twelve or a sequence
of 'a'
s and nothing else.
Similar kinds of scenarios play out with constraints on JSON objects.
Figure 3
{ "type": "object", "required": ["a","b","c"], "properties": { "a": true, "b": true, "c": true }, "additionalProperties": false }
A schema requiring an object with exactly three properties
The schema in Figure 3 needs all four[2] of the properties to ensure that the value is an object with exactly three properties
named a
, b
, and c
. The type
property excludes other kinds of values such as numbers. The required
property excludes objects that do not have all of the indicated properties. The properties
property defines constraints for a certain set of named properties (those constraints
being the boolean schema true
which means anything is valid). And finally the additionalProperties
property excludes objects that have other properties we haven't mentioned using a
boolean schema of false
(always invalid). There are no ordering constraints or occurrence constraints: objects
are unordered maps in which keys can only occur once.
This is obviously a little more convoluted than just saying the content model is a & b & c
. It is easy to imagine we might have a mapping that gets us such a content model,
however, and we will see that in a moment. Nevertheless, the reader may be forgiven
for suspecting that applying regular expression derivatives to validation here is
a dubious enterprise. Hold on to that thought! One more example:
Figure 4
{ "type": "object", "required": ["a","b"], "oneOf": [ { "properties": { "a": { "type": "number", "minimum": 0 }, "b": { "type": "number", "minimum": 0 }, "c": { "type": "number" } }, "additionalProperties": false }, { "properties": { "a": { "type": "number", "maximum": 0 }, "b": { "type": "number", "maximum": 0 }, "d": { "type": "number" } }, "additionalProperties": false } ] }
A tricky schema
The schema in Figure 4 will successfully validate objects that have two numeric properties named a
and b
, where either both numbers are more than zero, or both numbers are less than zero,
together with an additional numeric property named either c
or d
, depending on whether we are in the positive or negative case. We could map this
to the "content model" (a & b & c) | (a & b & d)
but verifying that the numeric constraints hold would require a separate check that
essentially performs all the content model checking anyway. The reader now has probably
concluded that regular expression derivatives for JSON Schema validation are doomed.
Not so fast, reader! We will be adding some extensions to the Brzozowski model to
rescue the situation.
Before we get to the extensions to the model required, let's enumerate the JSON Schema validation properties. The classification is not part of the JSON Schema specification: it is an informational organization that we will make use of in a minute.
Property name | Relevance | Meaning | Classification |
---|---|---|---|
type | all | Data value kind: object, array, number, integer, string, boolean, null | type (special) |
format | string | Required format such as date-time (optional) | type |
enum | all | Set of possible values | type |
const | all | Fixed value | type |
minLength | string | Minimum length | type |
maxLength | string | Maximum length | type |
pattern | string | String matches regular expression | type |
multipleOf | number | Number is multiple of this | type |
minimum | number | Inclusive minimum | type |
minimumExclusive | number | Exclusive minimum | type |
maximum | number | Inclusive maximum | type |
maximumExclusive | number | Exclusive maximum | type |
minProperties | object | Minimum number of properties | type |
maxProperties | object | Maximum number of properties | type |
minItems | array | Minimum number of array items | type |
maxItems | array | Maximum number of array items | type |
uniqueItems | array | Are values in array unique? | type |
properties | object | Child properties | composition |
additionalProperties | object | Constraints on unenumerated properties | composition |
required | object | Names of required properties | composition |
items (array value) | array | Sequence of constraints for items in array | composition |
items (single value) | array | Constraint on every item in array | composition |
additionalItems | array | Constraints on additional unenumerated items | composition |
contains | array | At least one item is valid | composition |
anyOf | all | Valid against at least one subschema | composition |
oneOf | all | Valid against exactly one subschema | composition |
allOf | all | Valid against all subschemas | composition |
not | all | Valid if not valid against subschema | composition |
if,then,else | all | Conditional validation | composition |
propertyNames | object | Schema constraint on the names of properties | type (special) |
dependencies | object | Additional constraints if certain child properties are present | composition (special) |
$ref | all | References the schema via a path rather than in place | special |
$id | all | Sets a base URI for resolution of $ref properties
|
special |
Extending the Model
The approach taken to applying Brzozowski derivatives to JSON Schema validation is to map the compositional properties to extended regular expressions, where the symbols will be composites of the JSON property names tagged with a type identifier. A further mapping will gather up bundles of type properties in the schema; these bundles will get a type identifier. Call these bundles types and their constraints facets. We will show how we can define nullability for these extended expressions and derivatives of them with respect to JSON values. Some additional machinery and notation will need to be in play for unnamed values, either as top-level values or array items.
In sum, here is the program of work:
-
Define a mapping from JSON Schema properties to extended regular expressions
-
Define a mapping from JSON Schema properties to types and facets
-
Define extension operators to account for all the composition properties
-
Define extensions to tag property names with type identifiers in the expressions
-
Define a type matching function
-
Define extensions for type-only items (top level and array items)
-
Define derivatives with respect to JSON values
-
Define derivatives over extended expressions
-
Define nullability over extended expressions
-
Sip wine in celebration
Mapping
Mapping performs two operations: mapping type properties to faceted types, and mapping
compositional properties to extended regular expressions. We will denote this with
two separate functions, μ and μM respectively. To complete the picture, we will also
need a function that provides the type with its name, and a lookup function that can
obtain the facets for a given named type. I use the form f(A) => B
to say that the definition of the result of applying function f
to A
is B
.
Basic Notation |
|
||||||||||||||||||||
Mapping function |
|
||||||||||||||||||||
Model mapping function |
|
||||||||||||||||||||
Key function |
|
||||||||||||||||||||
Lookup function |
|
The type mapping function is fairly straightforward, although capturing the interactions
with the compositional properties gets a little messy in a functional paradigm. In
the formulation here, the extended regular expressions from μM
are expressed as the output for a particular property (model
) in the output of μ
.
For the most part type properties in the schema map directly to facets in the type.
If there is a type
property in the schema, its value will constrain which properties map to facets;
if it is an array of values, we will treat this as creating a content model disjunction
between the alternatives; and if it is absent we will treat this as creating a content
model disjunction between all[4] alternatives (array
, object
, number
, string
, boolean
, null
). Each alternative will only gather properties relevant for that specific type
value.
In what follows, P
is a map of property-value pairs {p1:v1,...,pn:vn}
where each property name is unique and the +
operator performs a union of maps. The rules are generally written so that the set
of property names is disjoint.[5] In addition, s1
, s2
, etc. refer to (sub)schemas.
The definition below shows the mapping, broken down for each of the distinct type
values. Most properties can be mapped independently, although there are some exceptions.
That is, unless specified otherwise: μ({p:v,P}) => μ({p:v})+μ(P)
and similarly μM({p:v,P}) => μM({p,v})+μM(P)
. Where there is a rule for a specific combination of properties, that rule takes
precedence. There are a few edges cases to take care of: boolean schemas and $ref
dereferencing.
It will be useful to define some standard bundles of properties. The function Figure 5 Properties relevant to certain typesrelevant(P,kind)
will return just the properties listed for the specific kind from the map of property-value
pairs. For example, relevant({minLength:1,maximum:12},"string")
is {minLength:1}
:
Kind
Properties selected by relevant(P,kind)
any
enum, const, oneOf, anyOf, allOf, not, if, then, else
boolean
Those for
any
string
minLength, maxLength, pattern
and those for any
integer
Same as
number
number
multipleOf, minimum, exclusiveMinimum, maximum, exclusiveMaximum
and those for any
null
Those for
any
array
minItems, maxItems, uniqueItems, items, additionalItems
and those for any
object
propertyNames, minProperties, maxProperties, properties, patternProperties, additionalProperties,
required
and those for any
First, we define how some edge cases are handled:
Figure 6 μ edge cases
s
μ(s)
true
true
false
∅
∅
∅
{$ref:path}
μ(dereference(path))
The boolean schemas map to true
or empty set as appropriate, and references map to the referenced definition. Specifying
the details of how the JSON path gets dereferenced is beyond the scope of this paper.
In what follows assume we have a fully dereferenced schema in hand.[6] We are also assuming the schema as a whole (including the references) is valid.
The Figure 7 μ mappings for typetype
property is the main driver for the mapping. In general it gets mapped to a kind
facet plus a simpleType
facet for references to simple values and occasionally other facets as well. The
type names have the same meaning as in XSD2. Here are the rules for mapping the type
property:[7]
s
μ(s)
{type:[t1,...,tn], P}
μ({oneOf: [{type:t1,P},...,{type:tn,P}]})
{type:boolean, P}
μ({type:boolean}) + μ(relevant(P,"boolean"))
{type:string, format:f1, P}
μ({type:string, format:f1}) + μ(relevant(P,"string"))
{type:string, P}
μ({type:string}) + μ(relevant(P,"string"))
{type:integer, P}
μ({type:integer}) + μ(relevant(P,"integer"))
{type:number, P}
μ({type:number}) + μ(relevant(P,"number"))
{type:null, P}
μ({type:null}) + μ(relevant(P,"null"))
{type:array, P}
μ({type:array}) + μ(relevant(P,"array"))
{type:object, P}
μ({type:object}) + μ(relevant(P,"object"))
{type:boolean}
{kind:boolean, simpleType:"xs:boolean", enum:[true,false]}
{type:string, format:date-time}
{kind:string, simpleType:"xs:dateTime"}
{type:string}
{kind:string, simpleType:"xs:string"}
{type:number}
{kind:number, simpleType:"xs:numeric"}
{type:integer}
{kind:number, simpleType:"xs:integer"}
{type:null}
{kind:null}
{type:array}
{kind:array}
{type:object}
{kind:object}
The Figure 8 μ for enum
and const
properties apply to all kinds of values:
s
μ(s)
{enum:[s1,s2,...,sn]}
{enum:[s1,s2,...,sn]}
{const:value}
{fixed:value}
const
and enum
The string-specific properties are Figure 9 μ for string propertiesminLength
, maxLength
, and pattern
s
μ(s)
{minLength:value}
{minLength:value}
{maxLength:value}
{maxLength:value}
{pattern:value}
{pattern:value}
Here are the numeric property mappings:
Figure 10 μ for numeric properties
s
μ(s)
{multipleOf:value}
{multipleOf:value}
{minimum:value}
{minimumInclusive:value}
{minimumExclusive:value}
{minimumExclusive:value}
{maximum:value}
{maximumInclusive:value}
{maximumExclusive:value}
{maximumExclusive:value}
Type properties for objects:[8]
Figure 11 μ for object properties
s
μ(s)
{minProperties:value}
{minProperties:value}
{maxProperties:value}
{maxProperties:value}
{propertyNames:s1}
{propertyNames:μ(s1)}
Array properties:
Figure 12 μ for array properties
s
μ(s)
{minItems:value}
{minItems:value}
{maxItems:value}
{maxItems:value}
{uniqueItems:value}
{uniqueItems:value}
Before we get to model mapping, we need to introduce some additional notation. The notation is similar to Sperberg-McQueen05, which is a little different from Brzozowski64, but with a few alterations and extensions. JSON Schema does not need general occurrence counts, and the wildcards required are relative to a set of property names, not namespaces.
Basic Extended Regular Expressions |
|
The type marking, exclusive or, conditional, simultaneous, and wildcards are extensions. Negation is present in Brzozowski64 although with a slightly different notation.
It may seem that F&G
and <F,G>
mean the same thing, but the difference will become clearer when we get to the definition
of derivatives, because F&G
matches F
and G
against different segments of the input, whereas with <F,G>
matches F
and G
against the same segment of the input simultaneously. For example, a@T & a@S
can never be satisfied because each symbol is unique in a JSON object so there can
only be one a
matched, but <a@T, a@S>
may well be satisfied if both type constraints are satisfied on the value of property
a
.
The mapping of the basic composition operators is fairly direct:
Figure 13 μ for basic composition properties Figure 14 μM for basic composition properties
s
μ(s)
{anyOf: [s1f,...,sn]}
{model: μM({anyOf:[s1,...,sn])}
{oneOf: [s1,...,sn]}
{model: μM({oneOf:[s1,...,sn])}
{allOf: [s1,...,sn]}
{model: μM({allOf:[s1,...,sn])}
{not:s}
{model: μM({not:s})}
{if:s1,then:s2,else:s3}
{model: μM({if:s1,then:s2,else:s3})}
s
μM(s)
{anyOf: [s1,s2,...,sn]}
μM(s1)|μM(s2)|...|μM(sn)
{oneOf: [s1,s2,...,sn]}
μM(s1)⊕μM(s2)⊕...⊕μM(sn)
{allOf: [s1,s2,...,sn]}
<μM(s1),μM(s2),...,μM(sn)>
{not:s}
!μM(s)
{if:s1,then:s2,else:s3}
(μM(s1)?μM(s2):μM(s3))
{if:s1,then:s2}
(μM(s1)?μM(s2):true)
{if:s1,else:s3}
(μM(s1)?true:μM(s3))
Mapping of the object-related compositional properties requires dealing with the interactions
and dependencies between them. The two main compositional properties are properties
and patternProperties
. Each of these consists of a set of keys and an associated schema for each key. In
the case of properties
the key is the property name to match in the instance; in the case of patternProperties
the key is a regular expression that matches the name of the property in the instance.
All properties named are optional unless their name appears in the required
array. Properties otherwise unaccounted for must conform to the schema given for
additionalProperties
, which defaults to the true
schema if absent.
To simplify things somewhat, let's first ensure that when the Figure 15 Normalizations for interdependent object composition propertiesadditionalProperties
property was not specified, we treat it as an explicit true
schema, that is, that all additional properties are OK. We also want to normalize
the additionalProperties
property when it is the false
schema to remove it, and to normalize the properties
and patternProperties
properties to replace absence with an empty object, and the required
property to replace absence with an empty array. Assume the following normalizations
apply one after another:
Constraints on {p1:s1,...,pn:sn}
μ({p1:s1,...,pn:sn})
∀pi|pi!=additionalProperties
μ({p1:s1,...,pn:sn,additionalProperties:true})
{p1:s1,...,pn:sn}={additionalProperties:false,P}
μ({P})
∀pi|pi!=properties
μ({p1:s1,...,pn:sn,properties:{}})
∀pi|pi!=patternProperties
μ({p1:s1,...,pn:sn,patternProperties:{}})
∀pi|pi!=required
μ({p1:s1,...,pn:sn,required:[]}
This leaves us with two variations to map: the bundle of properties P
with the properties named properties
, patternProperties
, and required
, and the bundle of properties with those plus the property named additionalProperties
as well. In either case, the mapping creates a model
property with the property bundle mapped by μM
: {model: μM(P)}
.
Capturing the semantics of additionalProperties
requires some trickiness, because if its value is not the false
schema, then it applies to any properties that are not named in properties
and that do not match a pattern in patternProperties
. We capture this in the wildcard which takes those properties and patterns as its
sets. We also capture the required
property by marking appropriate properties as optional or not in the model. This
is an optimization: we could instead have a simultaneous compositor where one branch
handles the required
properties and the other handles the type information. Both branches end up being
relatively weak, in general.
The semantics of patternProperties
is particularly tricky:
-
Pattern properties do not count as additional properties.
-
More than one pattern property can match a specific input property and both constraints need to be enforced.
-
A pattern property may or may not match a specifically named property; if it does its constraint must be enforced.
-
We may lack named properties (
properties
) but still have a constraint on "additional" properties (additionalProperties
).
In general, every property listed is optional, unless it is specifically mentioned
in the required
array. To capture the semantics of required
in an effective way is a little awkward to express algebraically. Note that it is
perfectly acceptable in JSON Schema for a property to be in the required
array but not mentioned by either the properties
or patternProperties
.
For example the following schema is perfectly OK, and will not match an object Figure 16 Schema with required property that is not otherwise described{"c":1,"dd":2}
as it lacks the required a
and b
properties:
{
"type": "object",
"required": ["a","b"],
"properties": {
"c": { "type": "number" }
},
"patternProperties": {
"d+": { "type": "number" }
},
"additionalProperties": true
}
With all that said, the general rules for mapping the object-related composition properties
are:
Figure 17 μM for object composition properties
s
μM(s)
{
properties: {p1:s1,...pm:sm},
patternProperties: {r1:pt1,...,ro:pto},
additionalProperties:sp,
required:[rq1,...,rqn]
}
<
(q1@κ(t1) & ... & qk@κ(tk) &
ql@κ(ql)? & ... & qm@κ(tm)? &
u1@κ(sp) & ... & uv@κ(sp) &
(wc(-[p1,...,pm],-[r1,...,ro])@κ(sp)|
wc(-[p1,...,pm],+[r1,...,ro]))*
),
(wc(r1)@κ(pt1) | wc(-[],-[r1]))*,
(wc(r2)@κ(pt2) | wc(-[],-[r2]))*,
...
(wc(rn)@κ(pto) | wc(-[],-[ro]))*
>
{
properties: {p1:s1,...pm:sm},
patternProperties: {r1:pt1,...,ro:pto},
required:[rq1,...,rqn]
}
<
(q1@κ(t1) & ... & qk@κ(tk) &
ql@κ(ql)? & ... & qm@κ(tm)? &
u1@κ(sp) & ... & uv@κ(sp) &
wc(-[p1,...,pm],+[r1,...,ro])*
),
(wc(r1)@κ(pt1) | wc(-[],-[r1]))*,
(wc(r2)@κ(pt2) | wc(-[],-[r2]))*,
...
(wc(rn)@κ(pto) | wc(-[],-[ro]))*
>
where
l = k+1
for i<=k, qi=pj and ti=sj and ∃rqh|pj=rqh
for i>k, qi=pj and ti=sj and ∀rh|pj!=rqh
for i<=n, ui=rqj where ∀ph|ph!=rqj
This looks complicated, but all it is saying is that our mapped content model consists
of a mix, in any order, of: properties that are named and required (q1
through qk
), properties that are named but not required (ql
through qm
, properties that are not named but are nevertheless required (u1
through uv
), a wildcard to pick up properties that are not named that match none of the patterns
but satisfy the schema for additionalProperties
, and a wildcard to pick up properties that are not named but do match one of the
patterns. The wildcard clauses in the second and subsequent branches of this simultaneous
model ensure that properties matching patterns satisfy their appropriate schema: everything
matches the pattern and its corresponding schema, or doesn't match the pattern at
all. Each pattern gets its own simultaneous branch because an input property may match
more than one of them, in which case it would need to satisfy constraints from each
match.
If Figure 18 Simplified μM for object composition propertiespatternProperties
is empty, as is often the case, this simplifies a great deal. We can drop the wildcards
that match against patterns and simplify the simultaneous model with one branch to
just that branch:
s
μM(s)
{
properties: {p1:s1,...pm:sm},
additionalProperties:sp,
required:[rq1,...,rqn]
}
(
q1@κ(t1) & ... & qk@κ(tk) &
ql@κ(ql)? & ... & qm@κ(tm)? &
u1@κ(sp) & ... & uv@κ(sp) &
wc(-[p1,...,pm],-[])@κ(sp)*
)
{
properties: {p1:s1,...pm:sm},
required:[rq1,...,rqn]
}
(
q1@κ(t1) & ... & qk@κ(tk) &
ql@κ(ql)? & ... & qm@κ(tm)? &
u1@κ(sp) & ... & uv@κ(sp)
)
Notice that we still get fairly weak content models here: where every property can
occur in any order, optional unless specifically required, with all kinds of wildcards
unless additionalProperties
is explicitly false
and there are no patternProperties
. This has performance implications, as it creates larger and more complex derivatives.
In JSON objects, every property name is unique, however. We can use this knowledge
to strengthen the wildcards in the first branch by adding any required properties
to the excluded list. (In the examples we will assume we do this.) Other optimizations
are possible. We will return to this point later.
Mapping the array-relevant compositional properties follows a similar paradigm, but an additional class of extension is in order, because array members do not have any symbol associated with them, just values. However, given our type constraint scheme, we still have something to say about such things, we just need a little bit more notational mechanism:
Non-symbol extensions |
|
Given the type-only expressions, we are now in a position to map array properties.
The Figure 19 μ for array composition properties Figure 20 μM for array composition propertiesitems
property in JSON Schema comes in two varieties with different semantics: as a single
schema, or as an array of schemas. If it is a single schema, then that schema applies
to every member of the array, and additionalItems
is irrelevant. If it is an array, then each member of the array applies one to one
to the values in the array in order, with any surplus values in the array matching
the additionalItems
schema, if any:
s
μ(s)
{items:I,additionalItems:sa,contains:sc}
{model: <μM({items:I,additionalItems:sa}),μM({contains:sc})>}
{items:I,additionalItems:sa}
{model: μM({items:I,additionalItems:sa})}
{items:I,additionalItems:false}
{model: μM({items:I})}
{items:I}
{model: μM({items:I,additionalItems:true})}
s
μM(s)
{contains:sc}
(•@ε*,•@κ(sc),•@ε*)
{items:s1,additionalItems:sa}
(•@κ(s1)*)
{items:[s1,s2,...,sn], additionalItems:sa}
(•@κ(s1),•@κ(s2),...,•@κ(sn),•@κ(sa)*|
•@κ(s1),•@κ(s2),...,•@κ(sn)?|
•@κ(s1),•@κ(s2),...,•@κ(sn-1)?|
...
•@κ(s1),•@κ(s2)?|
•@κ(s1)?
)
=
((•@κ(s1),
(•@κ(s2),
...
(•@κ(sn)?,•@κ(sa)*)
...
)?
)?
)
{items:[s1,s2,...,sn]}
(•@κ(s1),•@κ(s2),...,•@κ(sn)?|
•@κ(s1),•@κ(s2),...,•@κ(sn-1)?|
...
•@κ(s1),•@κ(s2)?|
•@κ(s1)?
)
=
((•@κ(s1),
(•@κ(s2),
...
•@κ(sn)?
...
)?
)?
)
The reason for the complexity of the rules for items
is this: just because the property defines type constraints for three slots, that
doesn't mean a matching array must have three entries in it. That is handled by a
different constraint property, the minItems
property. We could apply it here to remove some of the optionality.
In other words, array constraint properties map to type-only expressions. The additionalItems
property is only relevant if the items
property enumerates a sequence of type constraints for each item in the array.
Mapping the dependencies
property is a little confusing because its value is a JSON object where each value
is either a subschema or an array of property names. The semantics are very different,
and the dependencies
property is likely to bifurcate in future drafts of JSON Schema. Each item in the
JSON object maps separately, using a different rule depending on whether its value
is a schema or an array.
Here are the rules:
Figure 21 μ for Figure 22 μM for
s
μ(s)
{dependencies: {p1:v1,p2:v2,...,pn:vn}}
{model: <
μM({dependencies:{p1:v1}}),
...,
μM({dependencies:{pn:vn}})
>}
dependencies
s
μM(s)
dependencies:{p:S}
((p & wc(-[p],-[])*) ? wc(*)@κ(S) : true)
dependencies:{p:[q1,q2,...,qn]}
((p & wc(-[p],-[])*) ?
(q1 & ... & qn & wc(-[q1,...,qn],-[])*) :
true)
dependencies
p
plus a bunch of other stuff not named p
, the object must match either the given schema, or must include all the named properties
(depending on which form was used).
It should also be noted that although at first blush we have made so many extensions
to regular expressions that it seems likely that none of the supporting results would
hold, in fact we have hardly extended anything substantively. Instead of the finite
vocabulary consisting of the symbols in the schema we have a vocabulary consisting
of the symbols crossed with the type identifiers plus the •
symbol. It is still a finite vocabulary.[9] The operators and functions we have added, being boolean operators, fit into the
rules defined by Brzozowski. True, we have made the rules for matching more complex,
but one could also think of them as just being a funny kind of equality. On the other
side, while our input can consist of an arbitrary number of distinct values, they
are all just annotations on that finite vocabulary. The infinitude of values will
concern us when we get to caching, however.
Derivatives
Given a JSON Schema mapped in this way, we can compute derivatives of those expressions
with respect to JSON values. Most of these rules come directly from Sperberg-McQueen05 or Brzozowski64 or can be derived by Brzozowski's observation that the derivative of any boolean
operator over some operands is the operator over the derivatives of those operands.
Algebraically: Δf(R)/Δa = f(ΔR/Δa)
for any boolean function f
. A key difference is that when we get to the leaves, we have to do type matching
as well as just symbol matching. Let us introduce a little more notation:
Derivative function Δ |
|
||||||||
Nullability function |
|
||||||||
Typematch function |
|
||||||||
JSON values |
|
Taking lowercase Figure 23 Basic derivative rulesa
and b
as symbols and uppercase E
, F
, and G
as expressions, the derivatives of basic expressions (without type matching) are
as follows:
E
ΔE/Δa
b
[a=b]
ε
∅
∅
∅
true
true
E?
ΔE/Δa | ν(E)
E*
ΔE/Δa,E*
E,F
(ΔE/Δa),F | ν(E),ΔF/Δa
E|F
ΔE/Δa | ΔF/Δa
E&F
(ΔE/Δa & F) | (E & ΔF/Δa)
!E
!ΔE/Δa
E⊕F
(ΔE/Δa) ⊕ (ΔF/Δa)
(E?F:G)
(ΔE/Δa ? ΔF/Δa : ΔG/Δa)
<E,F>
<ΔE/Δa, ΔF/Δa>
wc(pat)
[matches(a,pattern)]
wc(-[p1,...,pn],-[r1,...,rn])
[∀pj:a!=pj ∧ ∀rj:!matches(a,rj)]
wc(-[p1,...,pn],+[r1,...,rn])
[∀pj:a!=pj ∧ ∃rj:matches(a,rj)]
To take the derivative of an expression with respect to a JSON object, first note the edge case for an empty sequence is defined so that:
ΔE/Δε => EThen observe that for a sequence of symbols
u
followed by a symbol a
:
ΔE/Δ(u,a) => (ΔE/Δu)/Δaor, unrolling an entire sequence:
ΔE/Δ(p1,p2,...,pn) => (...(((ΔE/Δp1)/Δp2)/...)/Δpn)
Given the set of properties of an object Figure 24 Derivative with respect to a whole objectp1
, p2
, etc., then the derivative of the set of properties is the union of the derivative
of each possible sequence starting with some pj
:
Derivative expression
Derivative result
ΔE/Δ•:{p1:v1,p2:v2,...,pn:vn}
(ΔE/Δp1:v1)/Δ•:{p2:v2,p3:v3,...,pn:vn} |
(ΔE/Δp2:v2)/Δ•:{p1:v1,p3:v3,...,pn:vn} |
... |
(ΔE/Δpn:vn)/Δ•:{p1:v1,p2:v2,...,pn-1:vn-1}
Adding type matching to the picture doesn't change the rules much although they look
a little more complicated. All the rules are the same, just carrying through the type
annotation on the expressions, except for the leaf cases where we need to perform
a type check as well as a symbol matching:[10]
Figure 25 Derivatives with type matching
Derivative expression
Derivative result
Δb@T/Δa:v
[a=b],τ(v,λ(T))
Δwc(pat)@T/Δa:v
[matches(a,pattern)],τ(v,λ(T))
Δwc(-[p1,...,pn],-[r1,...,rn])@T/Δa:v
[∀pj:a!=pj ∧ ∀rj:!matches(a,rj)],τ(v,λ(T))
Δwc(-[p1,...,pn],+[r1,...,rn])@T/Δa:v
[∀pj:a!=pj ∧ ∃rj:matches(a,rj)],τ(v,λ(T))
ΔE@T/Δ•:{p1:v1,p2:v2,...,pn:vn}
τ({p1:v1,p2:v2,...,pn:vn},λ(T)),
((ΔE/Δp1:v1)/Δ•:{p2:v2,p3:v3,...,pn:vn} |
(ΔE/Δp2:v2)/Δ•:{p1:v1,p3:v3,...,pn:vn} |
... |
(ΔE/Δpn:vn)/Δ•:{p1:v1,p2:p2,...,pn-1:vn-1}
)
Now we have type-aware derivatives for expressions over symbols, we still need to
account for arrays and top level values that have no symbols and compute derivatives
over these type-only expressions. Arrays unroll like sequences:
Figure 26 Derivatives for arrays and nameless values
Derivative expression
Derivative result
ΔE@T/Δ•:[v1,v2,...vn]
τ([v1,v2,...,vn],λ(T)), ΔE/Δ(•:v1,•:v2,...,•:vn)
ΔE/Δ(•:v1,•:v2,...,•:vn)
(...(((ΔE/Δ•:v1)/Δ•:v2)/...)/Δ•:vn)
Δ(•@T)/Δ•:v
τ(v,λ(T))
Δ•/Δb
∅
Δa/Δ•
∅
Δwc(pat)/Δ•
∅
Δwc(-[p1,...,pn],-[r1,...,rn])/Δ•
∅
Δwc(-[p1,...,pn],+[r1,...,rn])/Δ•
∅
Δ•/Δ•
ε
Δε/Δ•
∅
What this says is that derivative of an array can be treated mostly as a sequence of non-symbol values, although we need to consider the type constraints on the array as a whole. A symbol cannot match a non-symbol, so derivatives of that kind are empty. A derivative of a non-symbol with respect to a non-symbol value comes down to the type matching function against that value.
Nullability
The nullability function decides whether an expression can be matched against nothing. When the derivative of an expression with respect to the input is not nullable, that means that either it would need additional input in order to match or it has already run down a blind alley and there is no way to match what is there. Therefore the data cannot be valid with respect to that expression. Conversely, when the derivative is nullable, it means that the input has been successfully matched by the regular expression (or consumed by the state machine, if you prefer) and no additional input is necessary for a successful match.
The rules for the nullability function are as follows:
Figure 27 Rules for nullability function ν
E
ν(E)
ε
ε
∅
∅
true
ε
a
∅
E*
ε
E,F
<ν(E), ν(F)>
<E,F>
<ν(E), ν(F)>
E|F
ν(E) | ν(F)
!E
ε if ν(E) = ∅
!E
∅ if ν(E) = ε
F&G
<ν(F), ν(G)>
F?G:H
(ν(F)?ν(G):ν(H))
F?
ε
F⊕G
ν(F) ⊕ ν(G) = (ν(F)?ν(!G):ν(G))
wc(pat)
∅
wc(N,P)
∅
F@T
ν(F)
•
∅
Type Matching
Type matching checks the type constraints of a type T
against a particular value v
. It returns ε
if the value matches the type constraints of the type and ∅
otherwise.
Figure 28
τ(v,T) expression
|
τ(v,T) result
|
---|---|
τ(•,T) |
ε |
τ(v,T) |
ε if ∀f∈facets(T) τ(v,f)=ε, otherwise ∅ |
τ(v,ε) |
ε |
τ(v,∅) |
∅ |
τ basic rules
Type-matching evaluates the various facets according to their semantics, something
like:
Figure 29 τ for type facets
f
τ(v,f)
{kind:x}
[v isa x]
{simpletype:x}
[v instanceof x]
{enum:[x1,x2,...,xn]}
[∃xi|v=x1]
{const:x}
[v=x]
{minLength:x}
[!(v isa string) ∨ string-length(v) >= v]
{maxLength:x}
[!(v isa string) ∨ string-length(v) <= v]
{pattern:x}
[!(v isa string) ∨ matches(v,x)]
{multipleOf:x}
!(v isa number) ∨ (v mod x)=0]
{minimumInclusive:x}
[!(v isa number) ∨ v >= x]
{minimumExclusive:x}
[!(v isa number) ∨ v > x]
{maximumInclusive:x}
[!(v isa number) ∨ v <= x]
{maximumExclusive:x}
[!(v isa number) ∨ v < x]
{minProperties:x}
[!(v isa object) ∨ count(keys(v)) >= x]
{maxProperties:x}
[!(v isa object) ∨ count(keys(v)) <= x]
{propertyNames:x}
[!(v isa object) ∨ ∀k∈keys(v)|τ(k,x)]
{minItems:x}
[!(v isa array) ∨ count(items(v)) >= x]
{maxItems:x}
[!(v isa array) ∨ count(items(v)) <= x]
{uniqueItems:x}
[!(v isa array) ∨ !x ∨ count(items(v))=count(distinct(items(v)))]
There is an additional rule for the Figure 30 τ over τ
function that handles subschemas. This will make more sense after we have looked
at some examples:
f
τ(v,f)
{model:E}
ν(ΔE/Δ•:v)
model
facet
Given this rule, we can observe that a JSON value is valid with respect to a JSON
schema if and only if the type matching function for the mapped JSON schema is ε
.
Relations and Simplifications
There are a number of relations and simplifications of the derivative expressions that come in useful in computing derivatives and keeping the size of the derivatives in reasonable bounds.
First, some of the operators are associative:[11]
Figure 31 Derivative associativity
(E|(F|G)) = ((E|F)|G) = (E|F|G)
(E,(F,G)) = ((E,F),G) = (E,F,G)
(E&(F&G)) = ((E&F)&G) = (E&F&G)
(F⊕G)⊕H = F⊕(G⊕H) = F⊕G⊕H
<<F,G>,H> = <F,<G,H>> = <F,G,H>
Some of the operators are also commutative:
Figure 32 Derivative commutativity
E|F = F|E
E&F = F&E
E⊕F = F⊕E
<E,F> = <F,E>
There are straightforward simplifications involving Figure 33 Derivative simplification identitiesε
, ∅
, and redundant expressions. We will make use of these, generally without comment,
in the examples.
E|E = E
ε&E = E&ε = E
∅&E = E&∅ = ∅
∅|E = E|∅ = E
ε,E = E,ε = E
∅,E = E,∅ = ∅
E⊕E = ∅
E⊕!E = ε
∅⊕E = E⊕∅ = E
(∅?F:G) = G
(ε?F:G) = F
<E,E> = E
<E,!E> = ∅
<ε,E> = <E,ε> = E
<∅,E> = <E,∅> = ∅
These identities mean that, given a function or operator Figure 34 Boolean function equivalencef
that can only take the values of ε
or ∅
, such as the type matching function or the nullability function
<f,E> = f,E
Negation behaves as expected with respect to boolean operators:
Figure 35 Derivative negation identities
!(!E) = E
!<E,F> = !E|!F
!(E|F) = <!E,!F>
!(E⊕F) = <!E,!F>|<E,F>
Finally, certain of the extension operators can be defined in terms of the others.
This can be useful for proving some of these derivations, but it also gives us some
comfort that we are not actually enhancing extended regular expressions beyond recognition:
Figure 36 Relations between extension operators
F⊕G = <F|G,!F|!G>
(F?G:H) = <F,G>|<!F,H>
JSON Schema Validation
Again, following Sperberg-McQueen05 we can use derivatives for validation. Given a JSON value v
and a JSON schema mapped to a type T
, where T
is a member of a set of mapped types accessible via λ
: v
is valid with respect to T
if and only if the type check τ(v,T)
returns ε
. If the mapping includes a mapped {model:E}
then calculation of τ
will compute the nullability of the derivative of E
with respect to the nameless value: ν(ΔE/Δ•:v)
per the type matching rule given previously.
The overall procedure is to compute the mapping of the schema. This will give us a
type whose identifier (given by κ
) will be the type to match. Where the mapping produces a model, computing the nullability
of the derivative of this model with respect to the nameless JSON value in hand tells
us whether that value is valid with respect to the schema.
Some examples might help:
Examples
Simple Type
Let's start with a very simple case. Is the value Figure 37 Trivial schema47
valid with respect to this schema?
{ type:"number" }
We would certainly expect it to be. Let's walk through the mechanics.
The mapping of the schema gives us just:[12]
Figure 38 Mapping of trivial schema
T0: { kind:number, simpleType:"xs:numeric" }
We need to compute the type matching function, which is simple enough (here Figure 39 Evaluation of τ=
shows the chain of derivations):
τ(47,λ(T0)) =
τ(47,{kind:number, simpleType:"xs:numeric"}) =
[47 isa number],[47 instanceof xs:numeric] = ε
Since the result is ε
, the value is valid with respect to the schema.
More Complicated Example
Let's take a more complicated example, the schema from Figure 4, reproduced here:
Figure 40 A tricky schema
{
"type": "object",
"required": ["a","b"],
"oneOf": [
{
"properties": {
"a": { "type": "number", "minimum": 0 },
"b": { "type": "number", "minimum": 0 },
"c": { "type": "number" }
},
"additionalProperties": false
},
{
"properties": {
"a": { "type": "number", "maximum": 0 },
"b": { "type": "number", "maximum": 0 },
"d": { "type": "number" }
},
"additionalProperties": false
}
]
}
Mapping gives us:
Figure 41 Mapping of tricky schema
T0: {
kind:object,
model:
<
(a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?),
(a & b & wc(-[a,b],[])*)
>
}
T1: {
kind:number,
simpleType:"xs:numeric",
minimumInclusive:0
}
T2: {
kind:number,
simpleType:"xs:numeric"
}
T3: {
kind:number,
simpleType:"xs:numeric",
maximumInclusive:0
}
We will henceforth abbreviate wc(-[a,b],[])
as just wc()
.
Let's see if this object is valid:
Figure 42 JSON object to validate against tricky schema
{
a: 1,
b: -1,
c: 2
}
To do that, we need to compute the type function:
Figure 43 Computing τ
τ({a:1,b:-1,c:2}, λ(T0)) =
τ({a:1,b:-1,c:2},
{
kind:object,
model:
<
(a@T1? & b@T1? & c@T2?) ⊕
(a@T3? & b@T3? & d@T2?),
(a & b & wc()*)
>
}) =
ν(Δ<
(a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?),
(a & b & wc()*)
>/Δ•:{a:1,b:-1,c:2}
)
To do that, we need to compute the following derivative:
Figure 44 Derivative to be computed
Δ<
(a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?),
(a & b & wc()*)
>/Δ•:{a:1,b:-1,c:2}
This expression is equal to:
Figure 45 Derivative with respect to
(
Δ<
(a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?),
(a & b & wc()*)
>/Δa:1
)/Δ•:{b:-1,c:2}
a:1
Let's unroll that a bit:
Figure 46 Moving
(
Δ<
(a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?),
(a & b & wc()*)
>/Δa:1
)/Δ•:{b:-1,c:2} =
(
<
Δ(
(a@T1? & b@T1? & c@T2?) ⊕
(a@T3? & b@T3? & d@T2?)
)/Δa:1,
Δ(a & b & wc()*)/Δa:1
>
)/Δ•:{b:-1,c:2} =
(
<
Δ(a@T1? & b@T1? & c@T2?)/Δa:1 ⊕
Δ(a@T3? & b@T3? & d@T2?)/Δa:1,
Δ(a & b & wc()*)/Δa:1
>
)/Δ•:{b:-1,c:2} =
a:1
inside simultaneous and exclusive or
Let's consider the left side of the exclusive or (⊕):
Figure 47 Moving
Δ(a@T1? & b@T1? & c@T2?)/Δa:1 =
Δ(a@T1?)/Δa:1 & b@T1? & c@T2? |
a@T1? & Δ(b@T1?)/Δa:1 & c@T2? |
a@T1? & b@T1? & Δ(c@T2?)/Δa:1
a:1
inside intermix
Evaluating the first clause gives us Figure 48 Computing derivative of Figure 49 Derivative of first full intermix clauseε
because the symbol and type match:
Δ(a@T1?)/Δa:1 = Δa@T1/Δa:1 | ν(a@T1) =
[a=a],τ(1,λ(T1)) | ν(a@T1) =
τ(1,
{
kind:number,
simpleType:"xs:numeric",
minimumInclusive:0
}
) = ε
a@T1?
Δ(a@T1?)/Δa:1 & b@T1? & c@T2? = b@T1? & c@T2?
So far so good! What about the other alteratives? These don't match (leading to a
derivative of ∅
) because b
and c
do not match a
and are not nullable.
Since we have
Figure 50 Computing derivatives of Figure 51 Derivative of other full intermix clauses Figure 52 Derivative of first clause of exclusive or
Δ(b@T1?)/Δa:1 = Δb@T1/Δa:1 | ν(b@T1) = ∅ | ∅ = ∅
Δ(c@T2?)/Δa:1 = Δc@T2/Δa:1 | ν(c@T2) = ∅ | ∅ = ∅
b@T1?
and c@T2?
a@T1? & Δ(b@T1?)/Δa:1 & c@T2? = a@T1? & ∅ & c@T2? = ∅
a@T1? & b@T1? & Δ(c@T2?)/Δa:1 = a@T1? & b@T1? & ∅ = ∅
∅
and we are still left with
Δ(a@T1? & b@T1? & c@T2?)/Δa:1 = b@T1? & c@T2?
On other clause of the exclusive or (⊕) we have
Figure 53 Moving
Δ(a@T3? & b@T3? & d@T2?)/Δa:1 =
Δ(a@T3?)/Δa:1 & b@T3? & d@T2?) |
a@T3? & Δ(b@T3?)/Δa:1 & d@T2? |
a@T3? & b@T3? & Δ(c@T2?)/Δa:1
a:1
inside intermix
Here the first alternative fails, because the type check fails:
Figure 54 Computing derivative of
Δ(a@T3?)/Δa:1 =
Δa@T3/Δa:1 | ν(a@T3) =
[a=a],
τ(1,
{
kind:number,
simpleType:"xs:numeric",
maximumInclusive:0
}
)
a@T3?
∅
.
The other alternatives don't fare any better, for reasons similar to the case on the
other side ( Figure 55 Computing derivatives of b
and d
do not match a
and are not nullable):
Δ(b@T3?)/Δa:1 = Δb@T3/Δa:1 | ν(b@T3) = ∅
Δ(d@T2?)/Δa:1 = Δ(d@T2)/Δa:1 | ν(d@T2) = ∅
b@T3?
and d@T2?
∅
.
Putting this all together, we have got to
Figure 56 Summary: derivative of exclusive or clause in context
(
<
(b@T1? & c@T2?) ⊕ ∅,
Δ(a & b & wc()*)/Δa:1
>
)/Δ•:{b:-1,c:2}
The clause coming from the Figure 57 Pushing Figure 58 Derivative of wildcardrequired
property is a little easier since it doesn't involve any types:
Δ(a & b & wc()*)/Δa:1 =
(Δa/Δa:1 & b & wc()* |
a & Δb/Δa:1 & wc()* |
a & b & Δwc()*/Δa:1) =
b & wc()*
a:1
inside intermix
Δb/Δa:1
is ∅
(eliminating the second alternative) and
Δwc()*/Δa:1 = Δwc()/Δa:1,wc()* = ∅,wc()* = ∅
a
(eliminating the third).
Whew! We've consumed the first property. On to Figure 59 Pushing b
!
<(b@T1? & c@T2?) ⊕ ∅, b & wc()*>/Δ•:{b:-1,c:2} =
<
Δ(b@T1? & c@T2?)/Δb:-1 ⊕ Δ∅/Δb:-1,
Δ(b & wc()*)/Δb:-1
>/Δc:2
b:-1
inside simultaneous and exclusive or
Type matching Figure 60 Pushing Figure 61 Computing τ of -1 Figure 62 Derivative of Figure 63 Summary: derivative of exclusive orb
fails. We have
Δ(b@T1? & c@T2?)/Δb:-1 =
(Δ(b@T1?)/Δb:-1 & c@T2?) | (b@T1? & Δ(c@T2?)/Δb:-1) =
([b=b],τ(-1,λ(T1)) & c@T2?) | (b@T1? & ∅)
b:-1
into intermix
τ(-1,
{
kind:number,
simpleType:"xs:numeric",
minimumInclusive:0
}
) = ∅
Δ∅/Δb:-1 = ∅
∅
with respect to b:-1
Δ(b@T1? & c@T2?)/Δb:-1 ⊕ Δ∅/Δb:-1 = ∅ ⊕ ∅
On the other half of simultaneous operator we have:
Figure 64 Derivative of intermix with respect to
Δ(b & wc()*)/Δb:-1 =
Δb/Δb:-1 & wc()* | b & Δwc()*/Δb:-1 =
wc()* | b & (Δwc()/Δb:-1,wc()*) =
wc()* | b & (∅,wc()*) =
wc()* | b & ∅ =
wc()*
b:-1
We're down to the final symbol, which is where things fail entirely because the exclusive
or fails, dragging everything else down with it:
Figure 65 Derivative with respect to
Δ<∅ ⊕ ∅, wc()*>/Δc:2 =
Δ<∅, wc()*>/Δc:2 =
Δ∅/Δc:2 =
∅
c:2
Since ∅
is not nullable, the object is not valid with respect to the schema.
Array Example
Let's look an array example to show how arrays unroll.
Is the array
Figure 66 JSON array Figure 67 Array schema
[1,"a","b"]
{
type:"array",
items:[{type:"number"},{type:"string"}],
additionalItems:false
}
The mapped schema:
Figure 68 Mapped array schema
T0: {
kind:array,
model: •@T1,•@T2
}
T1: {
kind:number, simpleType:"xs:numeric"
}
T2: {
kind:string, simpleType:"xs:string"
}
We need to compute:
Figure 69 Evaluation of τ
τ([1,"a","b"],λ(T0)) =
τ([1,"a","b"],{kind:array}),τ([1,"a","b"], {model: •@T1,•@T2})
The kind check passes because the value is an array, leaving the derivative we need
to compute being
Figure 70 Unrolling
Δ(•@T1,•@T2)/Δ•:[1,"a","b"] =
((Δ(•@T1,•@T2)/Δ•:1)/Δ•:"a")/Δ•:"b"
•:[1,"a","b"]
Starting with the first item:
Figure 71 Derivative with respect to
Δ(•@T1,•@T2)/Δ•:1 =
Δ(•@T1)/Δ•:1,•@T2 | ν(•@T1),Δ(•@T2)/Δ•:1 =
τ(1,λ(T1)),•@T2 =
τ(1,{kind:number, simpleType:"xs:numeric"}),•@T2 =
•@T2
•:1
The second item ( Figure 72 Derivative with respect to "a"
) also works:
Δ(•@T2)/Δ•:"a" =
τ("a",{kind:string, simpleType:"xs:string"}) = ε
•:"a"
But things run aground on the third item ( Figure 73 Derivative with respect to "b"
) because we run out of expressions to match against the value:
Δε/Δ•:"b" = ∅
•:"b"
Here we have a non-nullable derivative, and the array is not valid with respect to the schema.
Nested Object Example
Finally, let's look at an example where we have nested objects.
Here is the schema:
Figure 74 Nested object schema
{
type: "object",
properties: {
"a": {
type: "object",
properties: {
"b": { type: "integer" }
},
additionalProperties: false
}
},
additionalProperties: false
}
This maps to:
Figure 75 Mapping of nested object schema
TO = { kind:object, model: a@T1? }
T1 = { kind:object, model: b@T2? }
T2 = { kind:number, simpleType:"xs:integer" }
Let's validate the JSON object
Figure 76 JSON object with nesting
{
a: { c: false }
}
In the familiar way, we begin with the type matching of the top-level type
Figure 77 Evaluation of τ Figure 78 Derivative to be computed
τ({a:{c:false}}, λ(T0)) =
τ({a:{c:false}}, {kind:object}),
τ({a:{c:false}}, {model: a@T1?})
Δ(a@T1?)/Δa:{c:false}
Resolving this:
Figure 79 Computing derivative of
Δ(a@T1?)/Δa:{c:false} =
Δ(a@T1)/Δa:{c:false} | ν(a@T1) =
[a=a],τ({c:false},λ(T1)) =
τ({c:false},λ(T1))
a@T1?
Here is where things get interesting, and we move down to the derivative of the subschema:
Figure 80 Recursion to subschema
τ({c:false},λ(T1)) =
τ({c:false},{kind:object}),
τ({c:false}, {model: b@T2?}) =
ν(Δ(b@T2?)/Δc:false)
The derivative for the nested object fails:
Figure 81 Derivative of
Δ(b@T2?)/Δc:false = Δb@T2/Δc:false | ν(b@T2)
b@T2?
∅
.
Therefore
Figure 82 Computing τ Figure 83 Popping the recursion
τ({c:false},λ(T1)) = ∅
Δ(a@T1?)/Δa:{c:false} = ∅
τ({a:{c:false}},λ(T0))
equal to ∅
, so the object is not valid with respect to the schema.
Implementation Considerations
The rules given above for mapping object-related properties produce very weak expressions,
which require a lot of work to resolve. We could do slightly better if we recall that
properties are unique in a JSON object and assume we can always access those properties
in a particular order, say, in codepoint order. Then we can use the sequence operator
instead of the intermix operator when we don't have wildcards. We could also use the
sequence operator and just insert wildcards everywhere, but that is not better than
just using intermix. Because JSON property names are unique within a given object,
it is easy enough to prune the branches, especially if we recognize the pattern:
Figure 84 A common pattern arising from JSON objects
Δ(p1@T1? & p2@T2? & .. & pn@TN? & wc()@S*)/Δx:v =
(Δp1@T1?/Δx:v & p2@T2? & ... & pn@TN? & wc()@S*) |
(p1@T1? & Δp2@T2?/Δx:v & ... & pn@TN? & wc()@S*) |
...
(p1@T1? & p2@T2? & ... & Δpn@TN?/Δx:v & wc()@S*) |
(p1@T1? & p2@T2? & ... & pn@TN? & Δwc()@S*/Δx:v)
Furthermore, if Figure 85 Simplifying a common pattern Figure 86 Full simplification of a common patternx
equals any pj
we know this can only be nullable if we pick the branch where we take the derivative
of that property, because the derivative of any other pi
or of the wildcard will be ∅
(the wildcard excludes all property names mapped from properties
). That is:
Δ(
p1@T1? & p2@T2? & ... & pn@Tn? & wc(-[p1,...,pn])@S*
)/Δx:v =
([∃pj|pj=x] ?
(τ(v,Tj),
(p1@T1? & ... & pj-1@Tj-1? & pj+1@Tj+1? & ...
& pn@Tn? & wc()@S*)
) :
(τ(v,S),
(p1@T1? & p2@T2? & ...
& pn@Tn? & wc(-[p1,...,pn])@S*)
)
)
Δ(
p1@T1? & p2@T2? & ... & pn@Tn? & wc([-pi])@S*
)/Δ{q1:v1,q2:v2,...,qm:vm} =
τ(q1,S1),τ(q2,S2),...,τ(qm,Sm),
(r1@V1? & r2@V2? & ... & wc(-[p1,...pn])@S*)
where Sj=([∃i|pi=pj]?Ti:S) and
the rk@Vk are the pi@Si where ∄j|pi=qj
It is possible to create yet another extension operator for this case and handle it specially as a performance optimization.
To make this whole scheme worthwhile, we want to amortize the cost of computing derivatives, both within a particular validation episode and across validation episodes. Indeed, the major theme of Brzozowski's paper is that it can be used to construct a finite state automaton for matching the complete extended regular expression. Having constructed such a thing, it can be used again and again. We choose to be slightly more incremental here, and only compute the parts of the machine we actually need, as we go. Experience with this technique in XML Schema has shown that this is more cost effective, as there can be large parts of the state machine that never get accessed by actual data, especially with large schemas.
If we don't worry about type matching, it is pretty straightforward to create a cache of derivative expressions keyed off an expression ID and a symbol.[13] The expression ID needs to be computed by memoizing it, lest we keep storing the same derivative computation over and over again. That is, from the state machine point of view, we don't want to create lots of redundant states. It is reasonably straightforward to canonicalize expressions for memoization.
It is helpful, in practice, to prune the computed derivatives as much as possible,
precomputing any of the predicate functions (such as [a=x]
or ν
which lead to ε
and ∅
values) that can be used to prune further. Here is where the type-matching causes
us trouble: we could compute this and prune the derivative, but not if we wish to
cache the result, because the cache is keyed only off the symbol, and τ
evaluation depends on the value as well.
What is to be done?
There are three basic options:
-
Don't cache derivatives at all. We can then prune based on
τ
results without constraint. -
Include the value in the cache key, so that we cache the appropriate derivative for the symbol:value pair.
-
Cache "partial" derivatives: derivatives where all
τ
calls are replaced with a marker to be evaluated using actual value presented. When a derivative is pulled from the cache, if it has any unresolved markers, evaluate them and return the pruned result.
Not caching means a lot of work calculating answers: we'd probably be better off just
using a more direct algorithm and not using derivatives at all. The whole point of
this technique was to compute the finite state machine (here implemented in pieces
via the cache). Including values in the cache key means that the derivative cache
will have a lot of duplicative entries and looks like an unpromising approach, especially
when you remember that some of these values are whole JSON objects and arrays. We
essentially sacrifice the finiteness of states, as a practical matter. A more promising
variation is to cache partial derivatives and add a secondary cache for the unresolved
τ
calculations. The τ
cache only needs to store one bit of information for each type plus value, and we
can make different decisions about the sizing and lifespan of entries for the two
caches. The partial derivative cache (∂
cache) can be readily shared across validation episodes, while the τ
cache could live for only a single episode. We can tune whether the τ
cache caches objects and arrays at all, or if so, how large they might be before
we don't bother.
Performance
Measure, because it is never what you think
This test performs a simple validation on a small document (~3K) with up to five levels of nesting, a Medline citation, transcribed directly from its XML representation. The JSON Schema imposes structural constraints (required properties, and sub-properties) as well as some value and simple type constraints. The document is repeatedly validated in a single-threaded loop under various regimes. As a baseline, an XML Schema validation of the XML version of the document is also tested (also using derivatives and the same baseline simple type facet testing). The documents are valid against their respective schemas and both documents and schema files are pulled from a pre-warmed cache. The table shows both the validations per second for the first validation, which for some of the regimes is the one that does most of the work, and for subsequent validations.
Format | Cached schemas? | Derivatives? | ∂ cache? | τ cache? | V/s (non-initial) | V/s (initial) |
---|---|---|---|---|---|---|
XSD | yes | yes | N/A | N/A | 15088 | 678 |
JSON Schema | yes | yes | no | yes | 12618 | 610 |
JSON Schema | yes | yes | yes | yes | 11380 | 415 |
JSON Schema | yes | no | N/A | N/A | 9419 | 1071 |
JSON Schema | no | no | N/A | N/A | 2375 | 1720 |
JSON Schema | no | yes | no | yes | 650 | 617 |
JSON Schema | no | yes | yes | yes | 500 | 461 |
JSON Schema | yes | yes | no | no | 220 | 203 |
JSON Schema | yes | yes | yes | no | 207 | 617 |
JSON Schema | no | yes | no | no | 201 | 210 |
JSON Schema | no | yes | yes | no | 178 | 178 |
The variance in these numbers is quite high, although the relative ordering is generally the same.
What is perhaps most interesting here are the negative results:
-
The simple recursive walk algorithm with no caching is not greatly worse than any derivative algorithm, regardless of caching strategy. Given its substantially better initialization cost compared to the derivative algorithms, this suggests that where the same schema is not substantially reused, the derivative algorithms aren't worth the bother. This is despite the fact that this test is something of a best case scenario for the caching in the derivative algorithms.
-
Caching partial derivatives doesn't help performance in this scenario: hurts it slightly, in fact.
I believe the answer to the first negative result is that a lot of the work, at least
in this schema, is located in the τ
checking, not in the property relation constraints, and that the models are simplistic
enough that the recursive walk doesn't have to do much beyond looking at the current
set of properties, so the work of computing, constructing, and caching derivatives
doesn't really pay.
As for the partial derivatives, I think what is going on here is (1) the τ
cache is in play regardless, and in this scenario with the same document repeated
over and over, it provides a strong benefit and (2) resolving partial derivatives
requires construction of more objects representing the resolved derivative, and the
top line on performance here is time spent in allocation from the heap. We can see
the strong impact of the τ
cache by seeing the two orders of magnitude hit to performance when we turn it off.
This scenario is clearly a best-case one for the τ
cache, as we can see from these numbers.
The strongest performance win here is from simply having compiled schema objects cached, which gives about a 5 times gain for the recursive walk algorithm, and a 20 times gain for the derivative algorithm. The usefulness of the derivative algorithm is much stronger for cached schemas, unsurprisingly, as the work of computing the derivatives can be amortized over multiple validation episodes.
Similar results obtain from running multiple validations in parallel. This test is
a validate-on-load one, where 18 thousand Medline citations are loaded into a database
after being validated, using 20 threads. Here the performance is overall lower and
overall more similar, probably because the update operations dominate performance
overall. Here the data is more varied, so the impact of the τ
cache is substantially reduced. This test doesn't separate out the costs of the initial
use of the schema, and in fact, in effect, each client thread may end up doing some
of the initialization work redundantly in parallel, which impacts the amortization.
Format | Cached schemas? | Derivatives? | ∂ cache? | τ cache? | V/s |
---|---|---|---|---|---|
JSON Schema | yes | no | N/A | N/A | 868 |
JSON Schema | yes | yes | no | yes | 701 |
JSON Schema | yes | yes | no | no | 675 |
JSON Schema | yes | yes | yes | yes | 608 |
JSON Schema | yes | yes | yes | no | 588 |
Again we see the same fundamental negative results: derivatives aren't much better (and here are a little worse, in fact) than a simple recursive walk and derivative caching is not particularly helpful.
This is a somewhat disappointing discovery.
It is entirely possible that there are deficiencies in the specific implementation
strategy here. While tying model checking directly to τ
simplifies the model of what is going on, I don't think it is a great implementation
strategy: you end up computing τ
recursively on complex objects over and over as you move through branches of their
parent type's derivative. The τ
cache helps here somewhat, but it might be better to deal with this as a local validation
instead. I believe the numbers from the XML Schema implementation are instructive
here because that is how it works: local validity worked from the bottom up. Another
approach that might work better is a scheme to cache the fully resolved derivatives
also, keyed off the partial derivative expression ID and the actual value of whatever
unevaluated τ
's it might contain. That would avoid repeated construction of new expression objects
whenever we resolve from the ∂
cache.
It is possible to reclaim finiteness and perform full caching by vectorizing the values. Take each distinct facet anywhere in the schema. Each slot in the vector is 1 or 0 depending on whether a particular value satisfies that facet. This then compresses to a binary value of however many bits. While this recovers finiteness, evaluating a lot more facets that necessary in the moment is also not without cost, so it isn't clear what the conditions are where this could be a performance win.
It may well also be the case that the optimization of folding the Figure 87 Alternative mapping for object propertiesrequired
property into the mapping for object composition properties is no optimization at
all. Perhaps we'd be better served with a mapping like:
μM({
properties: {p1:s1,...pm:sm},
patternProperties: {r1:pt1,...,ro:pto},
required:[rq1,...,rqn]
}) =>
{model:
< rq1 & ... & rqn & wc(-[rq1,...,rqn])*,
p1@κ(t1)? & ... & pm@κ(sm)? &
wc(-[p1,...,pm],+[r1,...,ro])*,
(wc(r1)@κ(pt1) | wc(-[],-[r1]))*,
(wc(r2)@κ(pt2) | wc(-[],-[r2]))*,
...
(wc(rn)@κ(pto) | wc(-[],-[ro]))*
>
}
Other Applications of Type-Tagged Regular Expressions
Type-tagged regular expressions like this can be applied to other problems, or can be used to recast other problems in different terms.
First, most obviously, we can recast XML Schema validation in these terms as well.
The mapping function is different (rather more straightforward). We do need additional
wildcard functions (see Sperberg-McQueen). The τ
function evaluates the constraining facets of simple types, analogously to what we
have above, and looks at the nullability of the derivative of the content models of
complex types. These content models will have their type information retained, tagged
either with the type name itself (for named types), or with a generated stable ID.
We don't have a single top-level type, so we kick things off by identifying the appropriate
element declaration (if any), and computing the τ
with respect to its type. Attribute checks folds into this model as well, through
appropriate mapping. The details of this are beyond the scope of this paper, and work
for another day.
Whether evaluated with derivatives or not, type-tagged regular expressions provide
a framework for clearer expression of certain kinds of patterns or for a more general
understanding of certain kinds of regular expressions. Consider, for example, a conventional
regular expression[14]:
Figure 88 Regular expression
([0-1]\d)|(2[0-4])-([0-5]\d)-([0-5]\d)(.\d\d\d)?(.{3,3})
A type-tagged expression of this could be something like:
Figure 89 Tagged version of regular expression Figure 90 Associated type definitions
(\d\d)\@{HOUR}-(\d,\d)\@{MINUTE}-(\d,\d)\@{SECOND}(.,\d,\d,\d)?,(.{3,3})\@{TZ}
HOUR = {minimumInclusive:0, maximumInclusive:24}
MINUTE = {minimumInclusive:0, maximumExclusive:60}
SECOND = {minimumInclusive:0, maximumExclusive:60}
TZ = {enum:["PST","PDT",....]}
\@{type}
as a marker rather than the bare @type
.)
I believe this is a clearer expression of the intention and meaning of the desired pattern, and may be useful for expositional clarity, if nothing else.
There are extensions to regular expressions that capture a certain class of Figure 91 Regular expressions with Unicode escapesτ
functions in this fashion already: Unicode character category and block escapes:
((\p{Lu}\p{Ll}*)_)*
\p{IsGreek}*
The first of these expressions matches camel-cased strings of letters with an underbar
separator (e.g. "Example_Match") and the second matches a sequence of Greek characters.
Rewriting them as type-tagged expressions in the notation of this paper:
Figure 92 Reinterpreting Unicode escapes as type tags
((wc()@Lu,(wc()@Ll)*),_)*
(wc()@IsGreek)*
Clearly this is not a compact rewrite! However, it does allow us to think about the
escape as less a special case than as a way of naming something whose significance
is defined by the τ
function. Let us reimagine the \p
escape as naming a type where the τ
function in this case is either "matches a Unicode category with the given name"
or "matches a Unicode block with the same name as what follows the 'Is'".
One way to think about this is that we have a representation of a state machine with
τ
representing a conditional function on the transitions. There is no reason (other
than computation efficiency) why τ
needs to be a "type" function at all. What restrictions should we place on τ
? Any?
Final Thoughts
This work followed a path of applying Brzozowski derivatives to JSON Schema validation, following the pattern of using them for XML Schema validation. This led to the need to extend regular expressions with type tagging, and adding a type checking function that applied the appropriate matching constraints for the compound symbol, and to allow for type-only symbols in the expressions. It turns out, this formulation also works to provide a more uniform story for XML Schema validation beyond content model checking. More generally, tagged symbols combined with a tag matching function can be used more widely to improve the clarity of certain regular expressions and provide a framework for expressing other kinds of matching problems. However, the current implementation performance numbers are not particularly compelling: the need to match types against an infinitude of specific values makes it difficult to find effective caching strategies, undermining the point of using derivatives as an implementation technique in the first place.
References
[JSValidation] H. Andrews, Editor, G. Luff. JSON Schema Validation: A Vocabulary for Structural Validation of JSON IETF Internet-Draft, March 19, 2018. Available at https://tools.ietf.org/html/draft-handrews-json-schema-validation-01
[JSONSchema] H. Andrews, Editor. JSON Schema: A Media Type for Describing JSON Documents IETF Internet-Draft, March 19, 2018. Available at https://tools.ietf.org/html/draft-handrews-json-schema-01
[Brzozowski64]
Janusz A. Brzozowski
Derivatives of Regular Expressions
. Journal of the ACM 11.4 (1964):
481-494.
[XSD2] W3C: David Peterson, Shudi (Sandy) Gao 高殊镝, Ashok Malhotra, C.M. Sperberg-McQueen, and Henry S. Thompson, editors. W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. W3C. April 2012. http://www.w3.org/TR/xmlschema11-2/
[Sperberg-McQueen05]
Michael Sperberg-McQueen
Applications of Brzozowski derivatives to XML Schema processing
. Presented at Extreme Markup Languages, Montréal, Canada, August 1-5, 2005. Available
at http://cmsmcq.com/2005/EML2005Sper0518.html.
[1] The links to Clark's "An algorithm for RELAX NG validation" and Bob Foster's blob referenced by Sperberg-McQueen are, alas, dead.
[2] Strictly speaking, in this case, where we don't have any actual constraints on a
, b
, and c
we could have dispensed with properties
. If we wanted to ensure the properties were all numeric, we would need to have a
schema like {"type":"numeric"}
instead of true
for each of the properties, and properties
would be necessary.
[3] The true
schema is not actually equivalent to ε
: derivative rules are different.
[4] Actually, integer
is another alternative value for the type
property, but as the facets for integer
are the same as for number
, we don't need to account for it separately here.
[5] There is a small detail to take care of: it is possible to end up with more than one
model
property according to these rules, but we require the output to have unique keys.
In that case we use the simultaneous operator to combine them: {model:m1} + {model:m2} => model:<m1,m2>
.
[6] You can't actually implement things that way as there may be circular references. In practice, the usual kind of pre-pass or symbol table fix-up is required, if for no other reason than to prevent problems. The JSON Schema specification does say that schemas "should not" make use of infinitely recursive circular references as they may blow up naive validators and specifies such behaviour as undefined.
[7] Support for the format
property is optional, as are the specific kinds of formats that are supported. Here
I give an exemplar as to how it might be mapped. For other format kinds more elaborate
mappings would be required, or the use of special simpleType
values that encapsulate any checks that can't be mapped to other facets.
[8] From an implementation perspective both the type matching rules of the propertyNames
property and its mapping are a little strange and special. We store the mapping of
s1
, but realistically what is stored is the key (κ(μ(s1))
) with the actually mapped faceted type stored in a lookup table elsewhere.
[9] Once you follow the conventional dodge of adding a unique symbol for all non-matching symbols encountered.
[10] Note: There are actually a couple of annoying of special cases around empty objects and handling subtype recursion that I have elided here. There is an asymmetry due to the fact that top level values have no name, but values inside a JSON object (including other objects) do.
[11] See Sperberg-McQueen05 for many of these; the logic for the others is similar. Since JSON objects have unique keys and contain nothing but keyed items, the analysis of intermix with respect to UPA apply here too.
[12] T0
, T1
, etc. are the identifiers given to us by κ
. The set of properties on the other side of the equals sign is what is produced from
the mapping.
[13] Given JSON's proclivity for open-ended sets of properties, it is useful to be able to group all symbols not found in the schema together as a single "unknown symbol" here.
[14] For those unfamiliar with it: This grep-style regular expression uses a slightly different
notation than what we've used in this paper: The sequence operator (,
) is assumed, [0-5]
is an abbreviation for (0|1|2|3|4|5)
, \d
is an abbreviation for a choice between all digits, and dot (.
) is a wildcard.