Introduction
JSON Schema is a JSON vocabulary for defining valid JSON data. The data may be null, simple – numbers, strings, or boolean values – or structured – arrays of JSON values or objects consisting of key/value pairs. In JSON parlance the keys are called properties. The JSON Schema specification defines a set of properties that define possible constraints on the JSON values. In general these constraints are independent of one another.
Among the constraints are some that define what we might call, in the XML world, "content models" or "content types". As such, we pose the question: can we apply similar techniques as we use for XML Schema to implementing JSON Schema? In particular, can we use Brzozowski derivatives to compute validity with respect to these "content models"?
Background and Related Work
Brzozowski64 defines the concept of a derivative of an extended regular expression with respect to a symbol, and to a sequence of symbols. The derivative of an extended regular expression with respect to a symbol is the extended regular expression that would need to be matched once that symbol is consumed. By repeating the process as we walk through a sequence of symbols (the input sequence) we end up with an extended regular expression that must match the empty sequence once we have consumed the whole input sequence. An extended regular expression is called nullable if it can be reduced to match the empty sequence. Brzozowski shows that the input sequence is valid with respect to the initial extended regular expression if and only if the derivative of that expression with respect to the input sequence is nullable.
Sperberg-McQueen (Sperberg-McQueen05) discusses the application of Brzozowski derivatives to various problems of XML Schema processing, and offers a nice summary of some of the previous work in this area. Brzozowski derivatives are reported [1] to be used in James Clark's JING Validator and Sun's MSV validator, and I know it to be used in MarkLogic's validation support as well. There are no doubt others.
The essential idea in using Brzozowski derivatives for XML Schema content model validation is to observe that, with a few extensions to the model as shown by Sperberg-McQueen, XML Schema content models are extended regular expressions over the set of defined element names (plus one "every other name" symbol). When the derivative of a sequence of names from the input with respect to the declared content model is nullable, the input matches the content model. Given the unique particle attribution property in XML Schema, we can look up each symbol in the matching sequence, check simple type constraints as necessary, and complete validation.
JSON Schema Overview
The definitions of JSON Schema (JSONSchema) and JSON Schema Validation (JSValidation) are evolving and managed for the most part outside of the scope of any standards body (see https://json-schema.org/specification.html for the current state), although there are now IETF Internet-Drafts (cited above) that track the changes. Here I describe draft 07.
The conceptual framework behind JSON Schema differs greatly from the traditions coming from the XML community or the formal languages community and some of the vocabulary is used differently as well. A "schema" is a JSON value: either an object or boolean value. It defines a set of constraints against a JSON value, which may be an object (a map of name/value pairs), an array of JSON values, or a simple number, string, boolean, or null value. Certain properties of the schema object have defined meanings; undefined properties are irrelevant and ignored. Certain properties contain schemas as values. It is important to remember that the term "schema" does not mean the totality of the set of constraints defined in a particular schema document and its referenced schema documents but instead means the current set of constraints against a particular JSON value.
A schema defines a set of constraints through a set of properties and their values. Importantly, with a very few defined exceptions, the constraints apply independently. Certain properties are ignored for certain values. A JSON value is valid with respect to its schema when all the relevant properties are valid with respect to that value.
A few examples might help make this clear:
Figure 1 shows a simple JSON Schema. The minimum
property requires that the JSON value being validated be a numeric value greater
than twelve. The pattern
property requires that the JSON value being validated be a string value consisting
of a sequence of 'a'
s. Is this an unsatisfiable schema? No. If the value presented happens to be a number,
it will be valid if it is more than twelve, because the pattern
property will be ignored for numeric values. If the value presented happens to be
a string, it will be valid if it consists of a sequence of 'a'
s because the minimum
property will be ignored for string values. What is more, if the value presented
happens to be a JSON object, it will also be valid, because both the minimum
and pattern
properties will be ignored!
We can fix the schema to disallow that last scenario by adding another property:
In Figure 2 we have a schema that will match either a number greater than twelve or a sequence
of 'a'
s and nothing else.
Similar kinds of scenarios play out with constraints on JSON objects.
The schema in Figure 3 needs all four[2] of the properties to ensure that the value is an object with exactly three properties
named a
, b
, and c
. The type
property excludes other kinds of values such as numbers. The required
property excludes objects that do not have all of the indicated properties. The properties
property defines constraints for a certain set of named properties (those constraints
being the boolean schema true
which means anything is valid). And finally the additionalProperties
property excludes objects that have other properties we haven't mentioned using a
boolean schema of false
(always invalid). There are no ordering constraints or occurrence constraints: objects
are unordered maps in which keys can only occur once.
This is obviously a little more convoluted than just saying the content model is a & b & c
. It is easy to imagine we might have a mapping that gets us such a content model,
however, and we will see that in a moment. Nevertheless, the reader may be forgiven
for suspecting that applying regular expression derivatives to validation here is
a dubious enterprise. Hold on to that thought! One more example:
The schema in Figure 4 will successfully validate objects that have two numeric properties named a
and b
, where either both numbers are more than zero, or both numbers are less than zero,
together with an additional numeric property named either c
or d
, depending on whether we are in the positive or negative case. We could map this
to the "content model" (a & b & c) | (a & b & d)
but verifying that the numeric constraints hold would require a separate check that
essentially performs all the content model checking anyway. The reader now has probably
concluded that regular expression derivatives for JSON Schema validation are doomed.
Not so fast, reader! We will be adding some extensions to the Brzozowski model to
rescue the situation.
Before we get to the extensions to the model required, let's enumerate the JSON Schema validation properties. The classification is not part of the JSON Schema specification: it is an informational organization that we will make use of in a minute.
Property name | Relevance | Meaning | Classification |
---|---|---|---|
type | all | Data value kind: object, array, number, integer, string, boolean, null | type (special) |
format | string | Required format such as date-time (optional) | type |
enum | all | Set of possible values | type |
const | all | Fixed value | type |
minLength | string | Minimum length | type |
maxLength | string | Maximum length | type |
pattern | string | String matches regular expression | type |
multipleOf | number | Number is multiple of this | type |
minimum | number | Inclusive minimum | type |
minimumExclusive | number | Exclusive minimum | type |
maximum | number | Inclusive maximum | type |
maximumExclusive | number | Exclusive maximum | type |
minProperties | object | Minimum number of properties | type |
maxProperties | object | Maximum number of properties | type |
minItems | array | Minimum number of array items | type |
maxItems | array | Maximum number of array items | type |
uniqueItems | array | Are values in array unique? | type |
properties | object | Child properties | composition |
additionalProperties | object | Constraints on unenumerated properties | composition |
required | object | Names of required properties | composition |
items (array value) | array | Sequence of constraints for items in array | composition |
items (single value) | array | Constraint on every item in array | composition |
additionalItems | array | Constraints on additional unenumerated items | composition |
contains | array | At least one item is valid | composition |
anyOf | all | Valid against at least one subschema | composition |
oneOf | all | Valid against exactly one subschema | composition |
allOf | all | Valid against all subschemas | composition |
not | all | Valid if not valid against subschema | composition |
if,then,else | all | Conditional validation | composition |
propertyNames | object | Schema constraint on the names of properties | type (special) |
dependencies | object | Additional constraints if certain child properties are present | composition (special) |
$ref | all | References the schema via a path rather than in place | special |
$id | all | Sets a base URI for resolution of $ref properties
|
special |
Extending the Model
The approach taken to applying Brzozowski derivatives to JSON Schema validation is to map the compositional properties to extended regular expressions, where the symbols will be composites of the JSON property names tagged with a type identifier. A further mapping will gather up bundles of type properties in the schema; these bundles will get a type identifier. Call these bundles types and their constraints facets. We will show how we can define nullability for these extended expressions and derivatives of them with respect to JSON values. Some additional machinery and notation will need to be in play for unnamed values, either as top-level values or array items.
In sum, here is the program of work:
-
Define a mapping from JSON Schema properties to extended regular expressions
-
Define a mapping from JSON Schema properties to types and facets
-
Define extension operators to account for all the composition properties
-
Define extensions to tag property names with type identifiers in the expressions
-
Define a type matching function
-
Define extensions for type-only items (top level and array items)
-
Define derivatives with respect to JSON values
-
Define derivatives over extended expressions
-
Define nullability over extended expressions
-
Sip wine in celebration
Mapping
Mapping performs two operations: mapping type properties to faceted types, and mapping
compositional properties to extended regular expressions. We will denote this with
two separate functions, μ and μM respectively. To complete the picture, we will also
need a function that provides the type with its name, and a lookup function that can
obtain the facets for a given named type. I use the form f(A) => B
to say that the definition of the result of applying function f
to A
is B
.
Basic Notation |
|
||||||||||||||||||||
Mapping function |
|
||||||||||||||||||||
Model mapping function |
|
||||||||||||||||||||
Key function |
|
||||||||||||||||||||
Lookup function |
|
The type mapping function is fairly straightforward, although capturing the interactions
with the compositional properties gets a little messy in a functional paradigm. In
the formulation here, the extended regular expressions from μM
are expressed as the output for a particular property (model
) in the output of μ
.
For the most part type properties in the schema map directly to facets in the type.
If there is a type
property in the schema, its value will constrain which properties map to facets;
if it is an array of values, we will treat this as creating a content model disjunction
between the alternatives; and if it is absent we will treat this as creating a content
model disjunction between all[4] alternatives (array
, object
, number
, string
, boolean
, null
). Each alternative will only gather properties relevant for that specific type
value.
In what follows, P
is a map of property-value pairs {p1:v1,...,pn:vn}
where each property name is unique and the +
operator performs a union of maps. The rules are generally written so that the set
of property names is disjoint.[5] In addition, s1
, s2
, etc. refer to (sub)schemas.
The definition below shows the mapping, broken down for each of the distinct type
values. Most properties can be mapped independently, although there are some exceptions.
That is, unless specified otherwise: μ({p:v,P}) => μ({p:v})+μ(P)
and similarly μM({p:v,P}) => μM({p,v})+μM(P)
. Where there is a rule for a specific combination of properties, that rule takes
precedence. There are a few edges cases to take care of: boolean schemas and $ref
dereferencing.
It will be useful to define some standard bundles of properties. The function Figure 5relevant(P,kind)
will return just the properties listed for the specific kind from the map of property-value
pairs. For example, relevant({minLength:1,maximum:12},"string")
is {minLength:1}
:
Kind
Properties selected by relevant(P,kind)
any
enum, const, oneOf, anyOf, allOf, not, if, then, else
boolean
Those for
any
string
minLength, maxLength, pattern
and those for any
integer
Same as
number
number
multipleOf, minimum, exclusiveMinimum, maximum, exclusiveMaximum
and those for any
null
Those for
any
array
minItems, maxItems, uniqueItems, items, additionalItems
and those for any
object
propertyNames, minProperties, maxProperties, properties, patternProperties, additionalProperties,
required
and those for any
First, we define how some edge cases are handled:
Figure 6
s
μ(s)
true
true
false
∅
∅
∅
{$ref:path}
μ(dereference(path))
The boolean schemas map to true
or empty set as appropriate, and references map to the referenced definition. Specifying
the details of how the JSON path gets dereferenced is beyond the scope of this paper.
In what follows assume we have a fully dereferenced schema in hand.[6] We are also assuming the schema as a whole (including the references) is valid.
The Figure 7type
property is the main driver for the mapping. In general it gets mapped to a kind
facet plus a simpleType
facet for references to simple values and occasionally other facets as well. The
type names have the same meaning as in XSD2. Here are the rules for mapping the type
property:[7]
s
μ(s)
{type:[t1,...,tn], P}
μ({oneOf: [{type:t1,P},...,{type:tn,P}]})
{type:boolean, P}
μ({type:boolean}) + μ(relevant(P,"boolean"))
{type:string, format:f1, P}
μ({type:string, format:f1}) + μ(relevant(P,"string"))
{type:string, P}
μ({type:string}) + μ(relevant(P,"string"))
{type:integer, P}
μ({type:integer}) + μ(relevant(P,"integer"))
{type:number, P}
μ({type:number}) + μ(relevant(P,"number"))
{type:null, P}
μ({type:null}) + μ(relevant(P,"null"))
{type:array, P}
μ({type:array}) + μ(relevant(P,"array"))
{type:object, P}
μ({type:object}) + μ(relevant(P,"object"))
{type:boolean}
{kind:boolean, simpleType:"xs:boolean", enum:[true,false]}
{type:string, format:date-time}
{kind:string, simpleType:"xs:dateTime"}
{type:string}
{kind:string, simpleType:"xs:string"}
{type:number}
{kind:number, simpleType:"xs:numeric"}
{type:integer}
{kind:number, simpleType:"xs:integer"}
{type:null}
{kind:null}
{type:array}
{kind:array}
{type:object}
{kind:object}
The Figure 8enum
and const
properties apply to all kinds of values:
s
μ(s)
{enum:[s1,s2,...,sn]}
{enum:[s1,s2,...,sn]}
{const:value}
{fixed:value}
The string-specific properties are Figure 9minLength
, maxLength
, and pattern
s
μ(s)
{minLength:value}
{minLength:value}
{maxLength:value}
{maxLength:value}
{pattern:value}
{pattern:value}
Here are the numeric property mappings:
Figure 10
s
μ(s)
{multipleOf:value}
{multipleOf:value}
{minimum:value}
{minimumInclusive:value}
{minimumExclusive:value}
{minimumExclusive:value}
{maximum:value}
{maximumInclusive:value}
{maximumExclusive:value}
{maximumExclusive:value}
Type properties for objects:[8]
Figure 11
s
μ(s)
{minProperties:value}
{minProperties:value}
{maxProperties:value}
{maxProperties:value}
{propertyNames:s1}
{propertyNames:μ(s1)}
Array properties:
Figure 12
s
μ(s)
{minItems:value}
{minItems:value}
{maxItems:value}
{maxItems:value}
{uniqueItems:value}
{uniqueItems:value}
Before we get to model mapping, we need to introduce some additional notation. The notation is similar to Sperberg-McQueen05, which is a little different from Brzozowski64, but with a few alterations and extensions. JSON Schema does not need general occurrence counts, and the wildcards required are relative to a set of property names, not namespaces.
Basic Extended Regular Expressions |
|
The type marking, exclusive or, conditional, simultaneous, and wildcards are extensions. Negation is present in Brzozowski64 although with a slightly different notation.
It may seem that F&G
and <F,G>
mean the same thing, but the difference will become clearer when we get to the definition
of derivatives, because F&G
matches F
and G
against different segments of the input, whereas with <F,G>
matches F
and G
against the same segment of the input simultaneously. For example, a@T & a@S
can never be satisfied because each symbol is unique in a JSON object so there can
only be one a
matched, but <a@T, a@S>
may well be satisfied if both type constraints are satisfied on the value of property
a
.
The mapping of the basic composition operators is fairly direct:
Figure 13 Figure 14
s
μ(s)
{anyOf: [s1f,...,sn]}
{model: μM({anyOf:[s1,...,sn])}
{oneOf: [s1,...,sn]}
{model: μM({oneOf:[s1,...,sn])}
{allOf: [s1,...,sn]}
{model: μM({allOf:[s1,...,sn])}
{not:s}
{model: μM({not:s})}
{if:s1,then:s2,else:s3}
{model: μM({if:s1,then:s2,else:s3})}
s
μM(s)
{anyOf: [s1,s2,...,sn]}
μM(s1)|μM(s2)|...|μM(sn)
{oneOf: [s1,s2,...,sn]}
μM(s1)⊕μM(s2)⊕...⊕μM(sn)
{allOf: [s1,s2,...,sn]}
<μM(s1),μM(s2),...,μM(sn)>
{not:s}
!μM(s)
{if:s1,then:s2,else:s3}
(μM(s1)?μM(s2):μM(s3))
{if:s1,then:s2}
(μM(s1)?μM(s2):true)
{if:s1,else:s3}
(μM(s1)?true:μM(s3))
Mapping of the object-related compositional properties requires dealing with the interactions
and dependencies between them. The two main compositional properties are properties
and patternProperties
. Each of these consists of a set of keys and an associated schema for each key. In
the case of properties
the key is the property name to match in the instance; in the case of patternProperties
the key is a regular expression that matches the name of the property in the instance.
All properties named are optional unless their name appears in the required
array. Properties otherwise unaccounted for must conform to the schema given for
additionalProperties
, which defaults to the true
schema if absent.
To simplify things somewhat, let's first ensure that when the Figure 15additionalProperties
property was not specified, we treat it as an explicit true
schema, that is, that all additional properties are OK. We also want to normalize
the additionalProperties
property when it is the false
schema to remove it, and to normalize the properties
and patternProperties
properties to replace absence with an empty object, and the required
property to replace absence with an empty array. Assume the following normalizations
apply one after another:
Constraints on {p1:s1,...,pn:sn}
μ({p1:s1,...,pn:sn})
∀pi|pi!=additionalProperties
μ({p1:s1,...,pn:sn,additionalProperties:true})
{p1:s1,...,pn:sn}={additionalProperties:false,P}
μ({P})
∀pi|pi!=properties
μ({p1:s1,...,pn:sn,properties:{}})
∀pi|pi!=patternProperties
μ({p1:s1,...,pn:sn,patternProperties:{}})
∀pi|pi!=required
μ({p1:s1,...,pn:sn,required:[]}
This leaves us with two variations to map: the bundle of properties P
with the properties named properties
, patternProperties
, and required
, and the bundle of properties with those plus the property named additionalProperties
as well. In either case, the mapping creates a model
property with the property bundle mapped by μM
: {model: μM(P)}
.
Capturing the semantics of additionalProperties
requires some trickiness, because if its value is not the false
schema, then it applies to any properties that are not named in properties
and that do not match a pattern in patternProperties
. We capture this in the wildcard which takes those properties and patterns as its
sets. We also capture the required
property by marking appropriate properties as optional or not in the model. This
is an optimization: we could instead have a simultaneous compositor where one branch
handles the required
properties and the other handles the type information. Both branches end up being
relatively weak, in general.
The semantics of patternProperties
is particularly tricky:
-
Pattern properties do not count as additional properties.
-
More than one pattern property can match a specific input property and both constraints need to be enforced.
-
A pattern property may or may not match a specifically named property; if it does its constraint must be enforced.
-
We may lack named properties (
properties
) but still have a constraint on "additional" properties (additionalProperties
).
In general, every property listed is optional, unless it is specifically mentioned
in the required
array. To capture the semantics of required
in an effective way is a little awkward to express algebraically. Note that it is
perfectly acceptable in JSON Schema for a property to be in the required
array but not mentioned by either the properties
or patternProperties
.
For example the following schema is perfectly OK, and will not match an object {"c":1,"dd":2}
as it lacks the required a
and b
properties:
With all that said, the general rules for mapping the object-related composition properties
are:
Figure 17
s
μM(s)
{
properties: {p1:s1,...pm:sm},
patternProperties: {r1:pt1,...,ro:pto},
additionalProperties:sp,
required:[rq1,...,rqn]
}
<
(q1@κ(t1) & ... & qk@κ(tk) &
ql@κ(ql)? & ... & qm@κ(tm)? &
u1@κ(sp) & ... & uv@κ(sp) &
(wc(-[p1,...,pm],-[r1,...,ro])@κ(sp)|
wc(-[p1,...,pm],+[r1,...,ro]))*
),
(wc(r1)@κ(pt1) | wc(-[],-[r1]))*,
(wc(r2)@κ(pt2) | wc(-[],-[r2]))*,
...
(wc(rn)@κ(pto) | wc(-[],-[ro]))*
>
{
properties: {p1:s1,...pm:sm},
patternProperties: {r1:pt1,...,ro:pto},
required:[rq1,...,rqn]
}
<
(q1@κ(t1) & ... & qk@κ(tk) &
ql@κ(ql)? & ... & qm@κ(tm)? &
u1@κ(sp) & ... & uv@κ(sp) &
wc(-[p1,...,pm],+[r1,...,ro])*
),
(wc(r1)@κ(pt1) | wc(-[],-[r1]))*,
(wc(r2)@κ(pt2) | wc(-[],-[r2]))*,
...
(wc(rn)@κ(pto) | wc(-[],-[ro]))*
>
where
l = k+1
for i<=k, qi=pj and ti=sj and ∃rqh|pj=rqh
for i>k, qi=pj and ti=sj and ∀rh|pj!=rqh
for i<=n, ui=rqj where ∀ph|ph!=rqj
This looks complicated, but all it is saying is that our mapped content model consists
of a mix, in any order, of: properties that are named and required (q1
through qk
), properties that are named but not required (ql
through qm
, properties that are not named but are nevertheless required (u1
through uv
), a wildcard to pick up properties that are not named that match none of the patterns
but satisfy the schema for additionalProperties
, and a wildcard to pick up properties that are not named but do match one of the
patterns. The wildcard clauses in the second and subsequent branches of this simultaneous
model ensure that properties matching patterns satisfy their appropriate schema: everything
matches the pattern and its corresponding schema, or doesn't match the pattern at
all. Each pattern gets its own simultaneous branch because an input property may match
more than one of them, in which case it would need to satisfy constraints from each
match.
If Figure 18patternProperties
is empty, as is often the case, this simplifies a great deal. We can drop the wildcards
that match against patterns and simplify the simultaneous model with one branch to
just that branch:
s
μM(s)
{
properties: {p1:s1,...pm:sm},
additionalProperties:sp,
required:[rq1,...,rqn]
}
(
q1@κ(t1) & ... & qk@κ(tk) &
ql@κ(ql)? & ... & qm@κ(tm)? &
u1@κ(sp) & ... & uv@κ(sp) &
wc(-[p1,...,pm],-[])@κ(sp)*
)
{
properties: {p1:s1,...pm:sm},
required:[rq1,...,rqn]
}
(
q1@κ(t1) & ... & qk@κ(tk) &
ql@κ(ql)? & ... & qm@κ(tm)? &
u1@κ(sp) & ... & uv@κ(sp)
)
Notice that we still get fairly weak content models here: where every property can
occur in any order, optional unless specifically required, with all kinds of wildcards
unless additionalProperties
is explicitly false
and there are no patternProperties
. This has performance implications, as it creates larger and more complex derivatives.
In JSON objects, every property name is unique, however. We can use this knowledge
to strengthen the wildcards in the first branch by adding any required properties
to the excluded list. (In the examples we will assume we do this.) Other optimizations
are possible. We will return to this point later.
Mapping the array-relevant compositional properties follows a similar paradigm, but an additional class of extension is in order, because array members do not have any symbol associated with them, just values. However, given our type constraint scheme, we still have something to say about such things, we just need a little bit more notational mechanism:
Non-symbol extensions |
|
Given the type-only expressions, we are now in a position to map array properties.
The Figure 19 Figure 20items
property in JSON Schema comes in two varieties with different semantics: as a single
schema, or as an array of schemas. If it is a single schema, then that schema applies
to every member of the array, and additionalItems
is irrelevant. If it is an array, then each member of the array applies one to one
to the values in the array in order, with any surplus values in the array matching
the additionalItems
schema, if any:
s
μ(s)
{items:I,additionalItems:sa,contains:sc}
{model: <μM({items:I,additionalItems:sa}),μM({contains:sc})>}
{items:I,additionalItems:sa}
{model: μM({items:I,additionalItems:sa})}
{items:I,additionalItems:false}
{model: μM({items:I})}
{items:I}
{model: μM({items:I,additionalItems:true})}
s
μM(s)
{contains:sc}
(•@ε*,•@κ(sc),•@ε*)
{items:s1,additionalItems:sa}
(•@κ(s1)*)
{items:[s1,s2,...,sn], additionalItems:sa}
(•@κ(s1),•@κ(s2),...,•@κ(sn),•@κ(sa)*|
•@κ(s1),•@κ(s2),...,•@κ(sn)?|
•@κ(s1),•@κ(s2),...,•@κ(sn-1)?|
...
•@κ(s1),•@κ(s2)?|
•@κ(s1)?
)
=
((•@κ(s1),
(•@κ(s2),
...
(•@κ(sn)?,•@κ(sa)*)
...
)?
)?
)
{items:[s1,s2,...,sn]}
(•@κ(s1),•@κ(s2),...,•@κ(sn)?|
•@κ(s1),•@κ(s2),...,•@κ(sn-1)?|
...
•@κ(s1),•@κ(s2)?|
•@κ(s1)?
)
=
((•@κ(s1),
(•@κ(s2),
...
•@κ(sn)?
...
)?
)?
)
The reason for the complexity of the rules for items
is this: just because the property defines type constraints for three slots, that
doesn't mean a matching array must have three entries in it. That is handled by a
different constraint property, the minItems
property. We could apply it here to remove some of the optionality.
In other words, array constraint properties map to type-only expressions. The additionalItems
property is only relevant if the items
property enumerates a sequence of type constraints for each item in the array.
Mapping the dependencies
property is a little confusing because its value is a JSON object where each value
is either a subschema or an array of property names. The semantics are very different,
and the dependencies
property is likely to bifurcate in future drafts of JSON Schema. Each item in the
JSON object maps separately, using a different rule depending on whether its value
is a schema or an array.
Here are the rules:
Figure 21 Figure 22
s
μ(s)
{dependencies: {p1:v1,p2:v2,...,pn:vn}}
{model: <
μM({dependencies:{p1:v1}}),
...,
μM({dependencies:{pn:vn}})
>}
s
μM(s)
dependencies:{p:S}
((p & wc(-[p],-[])*) ? wc(*)@κ(S) : true)
dependencies:{p:[q1,q2,...,qn]}
((p & wc(-[p],-[])*) ?
(q1 & ... & qn & wc(-[q1,...,qn],-[])*) :
true)
p
plus a bunch of other stuff not named p
, the object must match either the given schema, or must include all the named properties
(depending on which form was used).
It should also be noted that although at first blush we have made so many extensions
to regular expressions that it seems likely that none of the supporting results would
hold, in fact we have hardly extended anything substantively. Instead of the finite
vocabulary consisting of the symbols in the schema we have a vocabulary consisting
of the symbols crossed with the type identifiers plus the •
symbol. It is still a finite vocabulary.[9] The operators and functions we have added, being boolean operators, fit into the
rules defined by Brzozowski. True, we have made the rules for matching more complex,
but one could also think of them as just being a funny kind of equality. On the other
side, while our input can consist of an arbitrary number of distinct values, they
are all just annotations on that finite vocabulary. The infinitude of values will
concern us when we get to caching, however.
Derivatives
Given a JSON Schema mapped in this way, we can compute derivatives of those expressions
with respect to JSON values. Most of these rules come directly from Sperberg-McQueen05 or Brzozowski64 or can be derived by Brzozowski's observation that the derivative of any boolean
operator over some operands is the operator over the derivatives of those operands.
Algebraically: Δf(R)/Δa = f(ΔR/Δa)
for any boolean function f
. A key difference is that when we get to the leaves, we have to do type matching
as well as just symbol matching. Let us introduce a little more notation:
Derivative function Δ |
|
||||||||
Nullability function |
|
||||||||
Typematch function |
|
||||||||
JSON values |
|
Taking lowercase Figure 23a
and b
as symbols and uppercase E
, F
, and G
as expressions, the derivatives of basic expressions (without type matching) are
as follows:
E
ΔE/Δa
b
[a=b]
ε
∅
∅
∅
true
true
E?
ΔE/Δa | ν(E)
E*
ΔE/Δa,E*
E,F
(ΔE/Δa),F | ν(E),ΔF/Δa
E|F
ΔE/Δa | ΔF/Δa
E&F
(ΔE/Δa & F) | (E & ΔF/Δa)
!E
!ΔE/Δa
E⊕F
(ΔE/Δa) ⊕ (ΔF/Δa)
(E?F:G)
(ΔE/Δa ? ΔF/Δa : ΔG/Δa)
<E,F>
<ΔE/Δa, ΔF/Δa>
wc(pat)
[matches(a,pattern)]
wc(-[p1,...,pn],-[r1,...,rn])
[∀pj:a!=pj ∧ ∀rj:!matches(a,rj)]
wc(-[p1,...,pn],+[r1,...,rn])
[∀pj:a!=pj ∧ ∃rj:matches(a,rj)]
To take the derivative of an expression with respect to a JSON object, first note the edge case for an empty sequence is defined so that:
ΔE/Δε => EThen observe that for a sequence of symbols
u
followed by a symbol a
:
ΔE/Δ(u,a) => (ΔE/Δu)/Δaor, unrolling an entire sequence:
ΔE/Δ(p1,p2,...,pn) => (...(((ΔE/Δp1)/Δp2)/...)/Δpn)
Given the set of properties of an object Figure 24p1
, p2
, etc., then the derivative of the set of properties is the union of the derivative
of each possible sequence starting with some pj
:
Derivative expression
Derivative result
ΔE/Δ•:{p1:v1,p2:v2,...,pn:vn}
(ΔE/Δp1:v1)/Δ•:{p2:v2,p3:v3,...,pn:vn} |
(ΔE/Δp2:v2)/Δ•:{p1:v1,p3:v3,...,pn:vn} |
... |
(ΔE/Δpn:vn)/Δ•:{p1:v1,p2:v2,...,pn-1:vn-1}
Adding type matching to the picture doesn't change the rules much although they look
a little more complicated. All the rules are the same, just carrying through the type
annotation on the expressions, except for the leaf cases where we need to perform
a type check as well as a symbol matching:[10]
Figure 25
Derivative expression
Derivative result
Δb@T/Δa:v
[a=b],τ(v,λ(T))
Δwc(pat)@T/Δa:v
[matches(a,pattern)],τ(v,λ(T))
Δwc(-[p1,...,pn],-[r1,...,rn])@T/Δa:v
[∀pj:a!=pj ∧ ∀rj:!matches(a,rj)],τ(v,λ(T))
Δwc(-[p1,...,pn],+[r1,...,rn])@T/Δa:v
[∀pj:a!=pj ∧ ∃rj:matches(a,rj)],τ(v,λ(T))
ΔE@T/Δ•:{p1:v1,p2:v2,...,pn:vn}
τ({p1:v1,p2:v2,...,pn:vn},λ(T)),
((ΔE/Δp1:v1)/Δ•:{p2:v2,p3:v3,...,pn:vn} |
(ΔE/Δp2:v2)/Δ•:{p1:v1,p3:v3,...,pn:vn} |
... |
(ΔE/Δpn:vn)/Δ•:{p1:v1,p2:p2,...,pn-1:vn-1}
)
Now we have type-aware derivatives for expressions over symbols, we still need to
account for arrays and top level values that have no symbols and compute derivatives
over these type-only expressions. Arrays unroll like sequences:
Figure 26
Derivative expression
Derivative result
ΔE@T/Δ•:[v1,v2,...vn]
τ([v1,v2,...,vn],λ(T)), ΔE/Δ(•:v1,•:v2,...,•:vn)
ΔE/Δ(•:v1,•:v2,...,•:vn)
(...(((ΔE/Δ•:v1)/Δ•:v2)/...)/Δ•:vn)
Δ(•@T)/Δ•:v
τ(v,λ(T))
Δ•/Δb
∅
Δa/Δ•
∅
Δwc(pat)/Δ•
∅
Δwc(-[p1,...,pn],-[r1,...,rn])/Δ•
∅
Δwc(-[p1,...,pn],+[r1,...,rn])/Δ•
∅
Δ•/Δ•
ε
Δε/Δ•
∅
What this says is that derivative of an array can be treated mostly as a sequence of non-symbol values, although we need to consider the type constraints on the array as a whole. A symbol cannot match a non-symbol, so derivatives of that kind are empty. A derivative of a non-symbol with respect to a non-symbol value comes down to the type matching function against that value.
Nullability
The nullability function decides whether an expression can be matched against nothing. When the derivative of an expression with respect to the input is not nullable, that means that either it would need additional input in order to match or it has already run down a blind alley and there is no way to match what is there. Therefore the data cannot be valid with respect to that expression. Conversely, when the derivative is nullable, it means that the input has been successfully matched by the regular expression (or consumed by the state machine, if you prefer) and no additional input is necessary for a successful match.
The rules for the nullability function are as follows:
Figure 27
E
ν(E)
ε
ε
∅
∅
true
ε
a
∅
E*
ε
E,F
<ν(E), ν(F)>
<E,F>
<ν(E), ν(F)>
E|F
ν(E) | ν(F)
!E
ε if ν(E) = ∅
!E
∅ if ν(E) = ε
F&G
<ν(F), ν(G)>
F?G:H
(ν(F)?ν(G):ν(H))
F?
ε
F⊕G
ν(F) ⊕ ν(G) = (ν(F)?ν(!G):ν(G))
wc(pat)
∅
wc(N,P)
∅
F@T
ν(F)
•
∅
Type Matching
Type matching checks the type constraints of a type T
against a particular value v
. It returns ε
if the value matches the type constraints of the type and ∅
otherwise.
Figure 28
τ(v,T) expression
|
τ(v,T) result
|
---|---|
τ(•,T) |
ε |
τ(v,T) |
ε if ∀f∈facets(T) τ(v,f)=ε, otherwise ∅ |
τ(v,ε) |
ε |
τ(v,∅) |
∅ |
Type-matching evaluates the various facets according to their semantics, something
like:
Figure 29
f
τ(v,f)
{kind:x}
[v isa x]
{simpletype:x}
[v instanceof x]
{enum:[x1,x2,...,xn]}
[∃xi|v=x1]
{const:x}
[v=x]
{minLength:x}
[!(v isa string) ∨ string-length(v) >= v]
{maxLength:x}
[!(v isa string) ∨ string-length(v) <= v]
{pattern:x}
[!(v isa string) ∨ matches(v,x)]
{multipleOf:x}
!(v isa number) ∨ (v mod x)=0]
{minimumInclusive:x}
[!(v isa number) ∨ v >= x]
{minimumExclusive:x}
[!(v isa number) ∨ v > x]
{maximumInclusive:x}
[!(v isa number) ∨ v <= x]
{maximumExclusive:x}
[!(v isa number) ∨ v < x]
{minProperties:x}
[!(v isa object) ∨ count(keys(v)) >= x]
{maxProperties:x}
[!(v isa object) ∨ count(keys(v)) <= x]
{propertyNames:x}
[!(v isa object) ∨ ∀k∈keys(v)|τ(k,x)]
{minItems:x}
[!(v isa array) ∨ count(items(v)) >= x]
{maxItems:x}
[!(v isa array) ∨ count(items(v)) <= x]
{uniqueItems:x}
[!(v isa array) ∨ !x ∨ count(items(v))=count(distinct(items(v)))]
There is an additional rule for the Figure 30τ
function that handles subschemas. This will make more sense after we have looked
at some examples:
f
τ(v,f)
{model:E}
ν(ΔE/Δ•:v)
Given this rule, we can observe that a JSON value is valid with respect to a JSON
schema if and only if the type matching function for the mapped JSON schema is ε
.
Relations and Simplifications
There are a number of relations and simplifications of the derivative expressions that come in useful in computing derivatives and keeping the size of the derivatives in reasonable bounds.
First, some of the operators are associative:[11]
Some of the operators are also commutative:
There are straightforward simplifications involving ε
, ∅
, and redundant expressions. We will make use of these, generally without comment,
in the examples.
These identities mean that, given a function or operator f
that can only take the values of ε
or ∅
, such as the type matching function or the nullability function
We have implicitly made use of this relation in the definitions of some of the derivatives,
above.
Negation behaves as expected with respect to boolean operators:
Finally, certain of the extension operators can be defined in terms of the others. This can be useful for proving some of these derivations, but it also gives us some comfort that we are not actually enhancing extended regular expressions beyond recognition:
JSON Schema Validation
Again, following Sperberg-McQueen05 we can use derivatives for validation. Given a JSON value v
and a JSON schema mapped to a type T
, where T
is a member of a set of mapped types accessible via λ
: v
is valid with respect to T
if and only if the type check τ(v,T)
returns ε
. If the mapping includes a mapped {model:E}
then calculation of τ
will compute the nullability of the derivative of E
with respect to the nameless value: ν(ΔE/Δ•:v)
per the type matching rule given previously.
The overall procedure is to compute the mapping of the schema. This will give us a
type whose identifier (given by κ
) will be the type to match. Where the mapping produces a model, computing the nullability
of the derivative of this model with respect to the nameless JSON value in hand tells
us whether that value is valid with respect to the schema.
Some examples might help:
Examples
Simple Type
Let's start with a very simple case. Is the value 47
valid with respect to this schema?
We would certainly expect it to be. Let's walk through the mechanics.
The mapping of the schema gives us just:[12]
We need to compute the type matching function, which is simple enough (here =
shows the chain of derivations):
Since the result is ε
, the value is valid with respect to the schema.
More Complicated Example
Let's take a more complicated example, the schema from Figure 4, reproduced here:
Mapping gives us:
We will henceforth abbreviate wc(-[a,b],[])
as just wc()
.
Let's see if this object is valid:
To do that, we need to compute the type function: which is to say, determine the nullability of the derivative of the model on the top level type.
To do that, we need to compute the following derivative:
This expression is equal to:
Let's unroll that a bit:
Let's consider the left side of the exclusive or (⊕):
Evaluating the first clause gives us ε
because the symbol and type match:
Therefore:
So far so good! What about the other alteratives? These don't match (leading to a
derivative of ∅
) because b
and c
do not match a
and are not nullable.
Since we have
so
each alternative reduces to ∅
and we are still left with
On other clause of the exclusive or (⊕) we have
Here the first alternative fails, because the type check fails:
which is all just ∅
.
The other alternatives don't fare any better, for reasons similar to the case on the
other side (b
and d
do not match a
and are not nullable):
so each of those alternatives nets out to ∅
.
Putting this all together, we have got to
The clause coming from the required
property is a little easier since it doesn't involve any types:
because Δb/Δa:1
is ∅
(eliminating the second alternative) and
because, recall, this wildcard excludes a
(eliminating the third).
Whew! We've consumed the first property. On to b
!
Type matching b
fails. We have
since
and
we end up with
On the other half of simultaneous operator we have:
We're down to the final symbol, which is where things fail entirely because the exclusive or fails, dragging everything else down with it:
Since ∅
is not nullable, the object is not valid with respect to the schema.
Array Example
Let's look an array example to show how arrays unroll.
Is the array valid with respect to this schema?
The mapped schema:
We need to compute:
The kind check passes because the value is an array, leaving the derivative we need to compute being
Starting with the first item:
The second item ("a"
) also works:
But things run aground on the third item ("b"
) because we run out of expressions to match against the value:
Here we have a non-nullable derivative, and the array is not valid with respect to the schema.
Nested Object Example
Finally, let's look at an example where we have nested objects.
Here is the schema:
This maps to:
Let's validate the JSON object
In the familiar way, we begin with the type matching of the top-level type which nets out to determining the nullability of the following derivative:
Resolving this:
Here is where things get interesting, and we move down to the derivative of the subschema:
The derivative for the nested object fails:
because both those alternatives net out to ∅
.
Therefore
making the derivative
Since this is non-nullable, it makes the whole of τ({a:{c:false}},λ(T0))
equal to ∅
, so the object is not valid with respect to the schema.
Implementation Considerations
The rules given above for mapping object-related properties produce very weak expressions, which require a lot of work to resolve. We could do slightly better if we recall that properties are unique in a JSON object and assume we can always access those properties in a particular order, say, in codepoint order. Then we can use the sequence operator instead of the intermix operator when we don't have wildcards. We could also use the sequence operator and just insert wildcards everywhere, but that is not better than just using intermix. Because JSON property names are unique within a given object, it is easy enough to prune the branches, especially if we recognize the pattern:
Furthermore, if x
equals any pj
we know this can only be nullable if we pick the branch where we take the derivative
of that property, because the derivative of any other pi
or of the wildcard will be ∅
(the wildcard excludes all property names mapped from properties
). That is:
and in fact if we iterate over the whole object:
It is possible to create yet another extension operator for this case and handle it specially as a performance optimization.
To make this whole scheme worthwhile, we want to amortize the cost of computing derivatives, both within a particular validation episode and across validation episodes. Indeed, the major theme of Brzozowski's paper is that it can be used to construct a finite state automaton for matching the complete extended regular expression. Having constructed such a thing, it can be used again and again. We choose to be slightly more incremental here, and only compute the parts of the machine we actually need, as we go. Experience with this technique in XML Schema has shown that this is more cost effective, as there can be large parts of the state machine that never get accessed by actual data, especially with large schemas.
If we don't worry about type matching, it is pretty straightforward to create a cache of derivative expressions keyed off an expression ID and a symbol.[13] The expression ID needs to be computed by memoizing it, lest we keep storing the same derivative computation over and over again. That is, from the state machine point of view, we don't want to create lots of redundant states. It is reasonably straightforward to canonicalize expressions for memoization.
It is helpful, in practice, to prune the computed derivatives as much as possible,
precomputing any of the predicate functions (such as [a=x]
or ν
which lead to ε
and ∅
values) that can be used to prune further. Here is where the type-matching causes
us trouble: we could compute this and prune the derivative, but not if we wish to
cache the result, because the cache is keyed only off the symbol, and τ
evaluation depends on the value as well.
What is to be done?
There are three basic options:
-
Don't cache derivatives at all. We can then prune based on
τ
results without constraint. -
Include the value in the cache key, so that we cache the appropriate derivative for the symbol:value pair.
-
Cache "partial" derivatives: derivatives where all
τ
calls are replaced with a marker to be evaluated using actual value presented. When a derivative is pulled from the cache, if it has any unresolved markers, evaluate them and return the pruned result.
Not caching means a lot of work calculating answers: we'd probably be better off just
using a more direct algorithm and not using derivatives at all. The whole point of
this technique was to compute the finite state machine (here implemented in pieces
via the cache). Including values in the cache key means that the derivative cache
will have a lot of duplicative entries and looks like an unpromising approach, especially
when you remember that some of these values are whole JSON objects and arrays. We
essentially sacrifice the finiteness of states, as a practical matter. A more promising
variation is to cache partial derivatives and add a secondary cache for the unresolved
τ
calculations. The τ
cache only needs to store one bit of information for each type plus value, and we
can make different decisions about the sizing and lifespan of entries for the two
caches. The partial derivative cache (∂
cache) can be readily shared across validation episodes, while the τ
cache could live for only a single episode. We can tune whether the τ
cache caches objects and arrays at all, or if so, how large they might be before
we don't bother.
Performance
Measure, because it is never what you think
This test performs a simple validation on a small document (~3K) with up to five levels of nesting, a Medline citation, transcribed directly from its XML representation. The JSON Schema imposes structural constraints (required properties, and sub-properties) as well as some value and simple type constraints. The document is repeatedly validated in a single-threaded loop under various regimes. As a baseline, an XML Schema validation of the XML version of the document is also tested (also using derivatives and the same baseline simple type facet testing). The documents are valid against their respective schemas and both documents and schema files are pulled from a pre-warmed cache. The table shows both the validations per second for the first validation, which for some of the regimes is the one that does most of the work, and for subsequent validations.
Format | Cached schemas? | Derivatives? | ∂ cache? | τ cache? | V/s (non-initial) | V/s (initial) |
---|---|---|---|---|---|---|
XSD | yes | yes | N/A | N/A | 15088 | 678 |
JSON Schema | yes | yes | no | yes | 12618 | 610 |
JSON Schema | yes | yes | yes | yes | 11380 | 415 |
JSON Schema | yes | no | N/A | N/A | 9419 | 1071 |
JSON Schema | no | no | N/A | N/A | 2375 | 1720 |
JSON Schema | no | yes | no | yes | 650 | 617 |
JSON Schema | no | yes | yes | yes | 500 | 461 |
JSON Schema | yes | yes | no | no | 220 | 203 |
JSON Schema | yes | yes | yes | no | 207 | 617 |
JSON Schema | no | yes | no | no | 201 | 210 |
JSON Schema | no | yes | yes | no | 178 | 178 |
The variance in these numbers is quite high, although the relative ordering is generally the same.
What is perhaps most interesting here are the negative results:
-
The simple recursive walk algorithm with no caching is not greatly worse than any derivative algorithm, regardless of caching strategy. Given its substantially better initialization cost compared to the derivative algorithms, this suggests that where the same schema is not substantially reused, the derivative algorithms aren't worth the bother. This is despite the fact that this test is something of a best case scenario for the caching in the derivative algorithms.
-
Caching partial derivatives doesn't help performance in this scenario: hurts it slightly, in fact.
I believe the answer to the first negative result is that a lot of the work, at least
in this schema, is located in the τ
checking, not in the property relation constraints, and that the models are simplistic
enough that the recursive walk doesn't have to do much beyond looking at the current
set of properties, so the work of computing, constructing, and caching derivatives
doesn't really pay.
As for the partial derivatives, I think what is going on here is (1) the τ
cache is in play regardless, and in this scenario with the same document repeated
over and over, it provides a strong benefit and (2) resolving partial derivatives
requires construction of more objects representing the resolved derivative, and the
top line on performance here is time spent in allocation from the heap. We can see
the strong impact of the τ
cache by seeing the two orders of magnitude hit to performance when we turn it off.
This scenario is clearly a best-case one for the τ
cache, as we can see from these numbers.
The strongest performance win here is from simply having compiled schema objects cached, which gives about a 5 times gain for the recursive walk algorithm, and a 20 times gain for the derivative algorithm. The usefulness of the derivative algorithm is much stronger for cached schemas, unsurprisingly, as the work of computing the derivatives can be amortized over multiple validation episodes.
Similar results obtain from running multiple validations in parallel. This test is
a validate-on-load one, where 18 thousand Medline citations are loaded into a database
after being validated, using 20 threads. Here the performance is overall lower and
overall more similar, probably because the update operations dominate performance
overall. Here the data is more varied, so the impact of the τ
cache is substantially reduced. This test doesn't separate out the costs of the initial
use of the schema, and in fact, in effect, each client thread may end up doing some
of the initialization work redundantly in parallel, which impacts the amortization.
Format | Cached schemas? | Derivatives? | ∂ cache? | τ cache? | V/s |
---|---|---|---|---|---|
JSON Schema | yes | no | N/A | N/A | 868 |
JSON Schema | yes | yes | no | yes | 701 |
JSON Schema | yes | yes | no | no | 675 |
JSON Schema | yes | yes | yes | yes | 608 |
JSON Schema | yes | yes | yes | no | 588 |
Again we see the same fundamental negative results: derivatives aren't much better (and here are a little worse, in fact) than a simple recursive walk and derivative caching is not particularly helpful.
This is a somewhat disappointing discovery.
It is entirely possible that there are deficiencies in the specific implementation
strategy here. While tying model checking directly to τ
simplifies the model of what is going on, I don't think it is a great implementation
strategy: you end up computing τ
recursively on complex objects over and over as you move through branches of their
parent type's derivative. The τ
cache helps here somewhat, but it might be better to deal with this as a local validation
instead. I believe the numbers from the XML Schema implementation are instructive
here because that is how it works: local validity worked from the bottom up. Another
approach that might work better is a scheme to cache the fully resolved derivatives
also, keyed off the partial derivative expression ID and the actual value of whatever
unevaluated τ
's it might contain. That would avoid repeated construction of new expression objects
whenever we resolve from the ∂
cache.
It is possible to reclaim finiteness and perform full caching by vectorizing the values. Take each distinct facet anywhere in the schema. Each slot in the vector is 1 or 0 depending on whether a particular value satisfies that facet. This then compresses to a binary value of however many bits. While this recovers finiteness, evaluating a lot more facets that necessary in the moment is also not without cost, so it isn't clear what the conditions are where this could be a performance win.
It may well also be the case that the optimization of folding the required
property into the mapping for object composition properties is no optimization at
all. Perhaps we'd be better served with a mapping like:
Although the models are weaker overall, the first branch is entirely independent of
type matching of specific values, and we can fail fast, pruning values that don't
match it by a pure cache lookup in the derivative cache without any additional resolution
required.
Other Applications of Type-Tagged Regular Expressions
Type-tagged regular expressions like this can be applied to other problems, or can be used to recast other problems in different terms.
First, most obviously, we can recast XML Schema validation in these terms as well.
The mapping function is different (rather more straightforward). We do need additional
wildcard functions (see Sperberg-McQueen). The τ
function evaluates the constraining facets of simple types, analogously to what we
have above, and looks at the nullability of the derivative of the content models of
complex types. These content models will have their type information retained, tagged
either with the type name itself (for named types), or with a generated stable ID.
We don't have a single top-level type, so we kick things off by identifying the appropriate
element declaration (if any), and computing the τ
with respect to its type. Attribute checks folds into this model as well, through
appropriate mapping. The details of this are beyond the scope of this paper, and work
for another day.
Whether evaluated with derivatives or not, type-tagged regular expressions provide a framework for clearer expression of certain kinds of patterns or for a more general understanding of certain kinds of regular expressions. Consider, for example, a conventional regular expression[14]:
A type-tagged expression of this could be something like:
with types
(In keeping with the spirit of abbreviation for this style of regular expression,
the type tagging uses \@{type}
as a marker rather than the bare @type
.)
I believe this is a clearer expression of the intention and meaning of the desired pattern, and may be useful for expositional clarity, if nothing else.
There are extensions to regular expressions that capture a certain class of τ
functions in this fashion already: Unicode character category and block escapes:
The first of these expressions matches camel-cased strings of letters with an underbar separator (e.g. "Example_Match") and the second matches a sequence of Greek characters. Rewriting them as type-tagged expressions in the notation of this paper:
Clearly this is not a compact rewrite! However, it does allow us to think about the
escape as less a special case than as a way of naming something whose significance
is defined by the τ
function. Let us reimagine the \p
escape as naming a type where the τ
function in this case is either "matches a Unicode category with the given name"
or "matches a Unicode block with the same name as what follows the 'Is'".
One way to think about this is that we have a representation of a state machine with
τ
representing a conditional function on the transitions. There is no reason (other
than computation efficiency) why τ
needs to be a "type" function at all. What restrictions should we place on τ
? Any?
Final Thoughts
This work followed a path of applying Brzozowski derivatives to JSON Schema validation, following the pattern of using them for XML Schema validation. This led to the need to extend regular expressions with type tagging, and adding a type checking function that applied the appropriate matching constraints for the compound symbol, and to allow for type-only symbols in the expressions. It turns out, this formulation also works to provide a more uniform story for XML Schema validation beyond content model checking. More generally, tagged symbols combined with a tag matching function can be used more widely to improve the clarity of certain regular expressions and provide a framework for expressing other kinds of matching problems. However, the current implementation performance numbers are not particularly compelling: the need to match types against an infinitude of specific values makes it difficult to find effective caching strategies, undermining the point of using derivatives as an implementation technique in the first place.
References
[JSValidation] H. Andrews, Editor, G. Luff. JSON Schema Validation: A Vocabulary for Structural Validation of JSON IETF Internet-Draft, March 19, 2018. Available at https://tools.ietf.org/html/draft-handrews-json-schema-validation-01
[JSONSchema] H. Andrews, Editor. JSON Schema: A Media Type for Describing JSON Documents IETF Internet-Draft, March 19, 2018. Available at https://tools.ietf.org/html/draft-handrews-json-schema-01
[Brzozowski64]
Janusz A. Brzozowski
Derivatives of Regular Expressions
. Journal of the ACM 11.4 (1964):
481-494.
[XSD2] W3C: David Peterson, Shudi (Sandy) Gao 高殊镝, Ashok Malhotra, C.M. Sperberg-McQueen, and Henry S. Thompson, editors. W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. W3C. April 2012. http://www.w3.org/TR/xmlschema11-2/
[Sperberg-McQueen05]
Michael Sperberg-McQueen
Applications of Brzozowski derivatives to XML Schema processing
. Presented at Extreme Markup Languages, Montréal, Canada, August 1-5, 2005. Available
at http://cmsmcq.com/2005/EML2005Sper0518.html.
[1] The links to Clark's "An algorithm for RELAX NG validation" and Bob Foster's blob referenced by Sperberg-McQueen are, alas, dead.
[2] Strictly speaking, in this case, where we don't have any actual constraints on a
, b
, and c
we could have dispensed with properties
. If we wanted to ensure the properties were all numeric, we would need to have a
schema like {"type":"numeric"}
instead of true
for each of the properties, and properties
would be necessary.
[3] The true
schema is not actually equivalent to ε
: derivative rules are different.
[4] Actually, integer
is another alternative value for the type
property, but as the facets for integer
are the same as for number
, we don't need to account for it separately here.
[5] There is a small detail to take care of: it is possible to end up with more than one
model
property according to these rules, but we require the output to have unique keys.
In that case we use the simultaneous operator to combine them: {model:m1} + {model:m2} => model:<m1,m2>
.
[6] You can't actually implement things that way as there may be circular references. In practice, the usual kind of pre-pass or symbol table fix-up is required, if for no other reason than to prevent problems. The JSON Schema specification does say that schemas "should not" make use of infinitely recursive circular references as they may blow up naive validators and specifies such behaviour as undefined.
[7] Support for the format
property is optional, as are the specific kinds of formats that are supported. Here
I give an exemplar as to how it might be mapped. For other format kinds more elaborate
mappings would be required, or the use of special simpleType
values that encapsulate any checks that can't be mapped to other facets.
[8] From an implementation perspective both the type matching rules of the propertyNames
property and its mapping are a little strange and special. We store the mapping of
s1
, but realistically what is stored is the key (κ(μ(s1))
) with the actually mapped faceted type stored in a lookup table elsewhere.
[9] Once you follow the conventional dodge of adding a unique symbol for all non-matching symbols encountered.
[10] Note: There are actually a couple of annoying of special cases around empty objects and handling subtype recursion that I have elided here. There is an asymmetry due to the fact that top level values have no name, but values inside a JSON object (including other objects) do.
[11] See Sperberg-McQueen05 for many of these; the logic for the others is similar. Since JSON objects have unique keys and contain nothing but keyed items, the analysis of intermix with respect to UPA apply here too.
[12] T0
, T1
, etc. are the identifiers given to us by κ
. The set of properties on the other side of the equals sign is what is produced from
the mapping.
[13] Given JSON's proclivity for open-ended sets of properties, it is useful to be able to group all symbols not found in the schema together as a single "unknown symbol" here.
[14] For those unfamiliar with it: This grep-style regular expression uses a slightly different
notation than what we've used in this paper: The sequence operator (,
) is assumed, [0-5]
is an abbreviation for (0|1|2|3|4|5)
, \d
is an abbreviation for a choice between all digits, and dot (.
) is a wildcard.