How to cite this paper
Cagle, Kurt. “Schema and SHACL: Bridging the Gap Between XML and RDF Validation.” Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). https://doi.org/10.4242/BalisageVol19.Cagle01.
Balisage: The Markup Conference 2017
August 1 - 4, 2017
Balisage Paper: Bridging the Gap Between XML and RDF Validation
Kurt Cagle
Kurt Cagle has been kicking aroung the XML, RDF and JSON communities for the
last twenty years, exploring the bleeding edges, forgotten corners and horror
hotspots of the data world. He currently lives in Issaquah, WA, where he spends
most of his time staring at rain.
Copyright © 2017 Semantical LLC
Abstract
XSD Schema is a foundational technology for XML. In the RDF space, OWL has long
established constraints for the language, but with the advent of more sophisticated
SPARQL processes, OWL's weight and complexity is often seen as being to heavy for
many business applications. SHACL, the Shape Constraint Language, is a SPARQL
friendly validation language that bears a lot of resemblance to XSD. This paper
examines SHACL and its relationship to XSD, and maps out how this technology may be
an effective bridge not only between XML and RDF but between XML and JSON.
Table of Contents
- Introduction
- Creating Shapes
- Shape Validation and Reporting
- SHACL and System Generated User Interfaces
- Summary
They say that structure is freedom, and in a sense it is. When you're dealing with
multiple constraints, you have to figure out what you can get out of that.
— Dmitri Martin
Introduction
One can argue that the XML Schema Definition Language, or XSD, had a profound impact
upon the XML community. XML, coming from SGML, has long had a clear mechanism for
identifying the structure of a given document instance. However, it had only a limited
concept of type, which inhibited the adoption of XML among developers for whom type
declarations and bindings were often more important than class relationships. At the
same time, until XML, there were few mechanisms for the ad hoc assignment of type
that
weren't intrisically tied into the specific byte level storage implementation.
Yet XSD by itself has also proven to be only part of a broader constraint validation
strategy. While XSD can, for a given element, identify the children of that element
and
the data types for atomic data (including cardinality, regular expression patterns,
min
and max values and enumerations), , certain types of constraints fall outside of these
terms. The ISO Schematron standard that emerged in the wake of XSD specifically
addressed these relational constraints, making it possible to specify constraints
such
as certain enumerations only being valid when an attribute has one type of value as
compared to another. These constraints are frequently specified as rules. The XSD
1.1
specification incorporated some of the features of Schematron, while the ability to
create constraints that span multiple nodes is proving to be one of the more desired
features for validation.
The Resource Description Framework (RDF) is younger than XML (and indeed, younger
than
XSD) and because it's initial focus was more atomic and assertional, it is perhaps
not
surprising that RDF quickly evolved a set of schematic constraint languages - from
RDF
Schema to the Web Ontology Language (OWL) to a whole collection of OWL profiles.
Arguably, because RDF works upon the open world assumption, constraining the language
has always been more complex, to the extent that much of the flexibility of the language
has been compromised because OWL itself evolved to be an internally consistent
constraint language with an extremely robust toolset for differentiating between
different types of constraints.
However, OWL also predated SPARQL, which is a language for both querying and
constructing RDF triples. OWL established constraints through the use of blank nodes
-
an "open" slot in the tuple, while SPARQL made it possible to impugn some operational
semantics (variable names) into these slots as rules. The complexity of blank node
semantics tends to make OWL a major hurdle for even semanticists to master, and for
those more used to thinking in terms of SQL queries OWL seemed like expensive overkill,
especially when it often required the forward chaining of assertions and the
cocommitment of significant memory allocation for transactions that in general changed
at best slowly over time.
The idea that SPARQL could be used to perform validation and constraint has
consequently been floated for a while, and has given rise (through some interesting
historical stepping stones) to the notion of semantic shapes
Creating Shapes
The distinction between a shape and a class is subtle but can best be stated in XSD
terms. A class can be thought of as analogous to element declaration in XSD, while
a
shape is the analog to a simple or complex type declaration - the first identifies
the
existence of a given entity and identifies it as being the structure for all instances
of that type, while a shape defines what that structure is. In effect, a shape holds
roughly the same role as an abstract type in XSD.
For instance, consider a movie such as "Star Wars: A New Hope" by Walt Disney Studio,
that has the following (highly abbreviated) XSD structure:
<packet>
<movie id="star-wars-a-new-hope">
<title>Star Wars: A New Hope</title>
<productionDate>2015-11-25</productionDate>
<franchiseRef ref="star-wars-franchise"/>
<studioRef ref="the_walt-disney-company"/>
<studioRef ref="lucasfilms"/>
</movie>
<franchise id="star-wars-franchise">
<name>Star Wars</name>
</franchise>
<studio id="the_walt-disney-company">
<name>The Walt Disney Company</name>
</studio>
<studio id="lucasfilms">
<name>Lucas Films</name>
</studio>
</packet>
This structure is normalized (broken down into pieces.
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xsd:element name="packet" type="packet_type"/>
<xsd:element name="movie" type="movie_type"/>
<xsd:element name="franchiseRef" type="franchise_type"/>
<xsd:element name="studioRef" type="studio_type"/>
<xsd:complexType name="packet_type">
<xsd:sequence>
<xsd:element ref="movie"/>
<xsd:element ref="franchise"/>
<xsd:element ref="studio"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="movie_type">
<xsd:sequence>
<xsd:element name="title" type="xsd:string"/>
<xsd:element name="productionDate" type="xsd:date" minOccurs="0"/>
<xsd:element ref="franchiseRef" type="reference" minOccurs="0"/>
<xsd:element ref="studioRef" type="reference" minOccurs="1" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="id" type="xsd:ID"/>
</xsd:complexType>
<xsd:complexType name="reference">
<xsd:sequence>
<xsd:attribute name="ref" type="xsd:IDREF"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="franchise">
<xsd:sequence>
<xsd:element ref="name" type="xsd:string"/>
</xsd:sequence>
<xsd:attribute name="id" type="xsd:ID"/>
</xsd:complexType>
<xsd:complexType name="studio">
<xsd:sequence>
<xsd:element ref="name" type="xsd:string"/>
</xsd:sequence>
<xsd:attribute name="id" type="xsd:ID"/>
</xsd:complexType>
</xsd:schema>
One problem that XML faces is that external references usually do not identify type
information outside of the document itself. Because RDF consists of normalized data
(increasingly the case for large XML structures), this constraint set is complex to
model with schema.
The same information can be rendered in RDF (here using Turtle, and again, being
rather cavalier about namespaces):
movie:ANewHope a class:Movie;
movie:title "A New Hope"^^xsd:string;
movie:productionDate "2015-11-25"^^xsd:date;
movie:franchise franchise:StarWars;
movie:studio studio:DisneyStudios,studio:Lucasfilms;
.
franchise:StarWars a class:Franchise;
franchise:name "Star Wars"^^xsd:string;
.
studio:DisneyStudios a class:Studio;
franchise:name "Disney Studios"^^xsd:string;
.
studio:Lucasfilms a class:Studio;
franchise:name "Lucasfilms"^^xsd:string;
.
The RDF here is a bit friendlier for references, at the expense of being more complex
for atomic types. The Shape language (defined at https://www.w3.org/TR/shacl/) provides
the language for establishing the constraint models, along with a reporting toolset
that
provides the results of validation.
An example can showcase what such shapes are capable of specifying. The following
illustrates the shapes associated with this data:
shape:Movie
a sh:NodeShape;
sh:targetClass class:Movie; #Applies to all movies.
sh:property [
sh:name "Title";
sh:path movie:title; #This property shape applies to movie title.
sh:datatype xsd:string; #title is a string.
sh:minCount "1"^^xsd:integer; #title is a required property
sh:maxCount "1"^^xsd:integer;
sh:order "0"^^xsd:integer;
];
sh:property [
sh:name "Production Date"^^xsd:string; #identifies the UX name of the property
sh:path movie:productionDate; #This property shape applies to the production date of the movie.
sh:datatype xsd:date; #title is a string.
sh:minCount "0"^^xsd:integer; #publication date is an optional property
sh:maxCount "1"^^xsd:integer;
sh:order "1"^^xsd:integer;
];
sh:property [
sh:name "Franchise"^^xsd:string; #identifies the UX name of the property
sh:path movie:franchise; #This property shape applies to franchise.
sh:nodekind sh:IRI; #The franchise is given as an IRI link.
sh:minCount "0"^^xsd:integer; #movie franchise is an unbounded property
sh:class class:Franchise; #the object of the property is a franchise class
sh:order "2"^^xsd:integer;
];
sh:property [
sh:name "Studio"^^xsd:string; #identifies the UX name of the property
sh:path movie:studio; #This property shape applies to franchise.
sh:nodekind sh:IRI; #The franchise is given as an IRI link.
sh:minCount "0"^^xsd:integer; #movie franchise is an unbounded property
sh:class class:Studio; #the object of the property is a studio class
sh:order "3"^^xsd:integer;
];
shape:Franchise
a sh:NodeShape;
sh:targetClass class:Franchise; #Applies to all franchises.
sh:property [
sh:path franchise:name; #This property shape applies to the franchise name.
sh:name "Name";
sh:datatype xsd:string; #name is a string.
sh:minCount "1"^^xsd:integer; #name is a required property
sh:maxCount "1"^^xsd:integer;
];
shape:Studio
a sh:NodeShape;
sh:targetClass class:Studio; #Applies to all studios.
sh:property [
sh:path franchise:name; #This property shape applies to the franchise name.
sh:name "Name";
sh:datatype xsd:string; #name is a string.
sh:minCount "1"^^xsd:integer; #name is a required property
sh:maxCount "1"^^xsd:integer;
];
There are a few key predicates in the shape namespace that need to be explained. The
shape:Movie object is an instance of a node shape (as opposed to a property shape).
It
has a target class of class:Movie - the shape describes the movie class. It has four
properties, each of which are here treated as blank nodes.
Each property has a sh:path property which identifies the predicates that this
property applies to. This can be a single predicate, or a more complex predicate path
(such as the union of multiple predicates, or a predicate path such as that used for
constructing an RDF collection). This differs from XSD in that any given element
reference must always be relative to its immediate parent. Similarly, the sh:class
property within a property definition identifies the target classes. Unlike XML and
XSD,
this can be used for constraining the result of an IDREF just to instances that are
of a
given class. Note that this is basically the same kind of operations as rdfs:range,
while sh:targetClass performs much the same operation as rdfs:domain. In that regard,
a
lot of what SHACL does is to pull together a minimal data-friendly ontology from the
rdfs+ and very basic OWL class sets.
The sh:order component strengthens another weakness of RDF. The framework generally
has no preferred order for output of properties (unlike XML, which defaults to an
xsd:sequence model from the schema). The use of sh:order establishes an ordering
algorithm to properties, making it easier to build interfaces that follow a specific
cluster. Additionally, SHACL defines sh:group for grouping together properties within
logical groups, then uses the sh:order to determine interim ordering within the group.
Shapes can define cardinality. sh:minCount and sh:maxCount determine the lower and
upper bounds respectively for cardinality, with the assumption that sh:minCount defaults
to "0" while sh:maxCount, when not included, gives the unbounded case. This can be
used
with sh:defaultValue to populate UI widgets or establish behavior of "new"
components.
Finally, SHACL defines a concept called an entailment, which is a SPARQL query that
performs additional validation. Entailments are intriguing, because they can make
for
more sophisticated queries, can construct interim results and can use those provide
internodal constraints.
Shape Validation and Reporting
Validation is a two stage process. The first part passes a node of a given type to
the
validator along with the graph holding the SHAPE files themselves, most likely with
the
processor being a fairly complex SPARQL query. The output are triples, which can then
be
passed to a post-processor for conversion to an HTML or XML page of some sort.
The reports so generated are similar to those of XSD, in that passing validation
results in no output. However, even a property or component relationship is not valid,
then this will be output. For instance, supposed that the above instance had franchise
set to a studio value instead.
movie:ANewHope a class:Movie;
movie:title "A New Hope"^^xsd:string;
movie:productionDate "2015-11-25"^^xsd:date;
movie:franchise studio:DisneyStudios; # This line is invalid
movie:studio studio:DisneyStudios,studio:Lucasfilms;
.
The invalid line will then, when validated, return the following report.
[ a sh:ValidationReport ;
sh:conforms false ;
sh:result [
a sh:ValidationResult ;
sh:resultSeverity sh:Violation ;
sh:focusNode move:aNewHope ;
sh:resultPath movie:franchise ;
sh:value studio:DisneyStudios ;
sh:resultMessage "movie:franchise expects an IRI of type class:franchise." ;
sh:sourceConstraintComponent sh:ClassConstraintComponent ;
sh:sourceShape sh:Movie ;
]
] .
The report will include multiple sh:result nodes for each node in the passed nodesets
(or of the given target class in the triple store). These can also be output as turtle
or similar files, or even mapped to JSON or XML files for additional
post-processing.
SHACL and System Generated User Interfaces
A huge challenge exists for people working with large scale RDF triple databases.
Typically, there may be potentially thousands of different classes involved in such
databases, making the hand creation of user interfaces problematic - especially when
dealing with data hubs and similar aggregate enterprise systems. Generating user
interfaces from XSDs is a well known processes, but because of the complexities of
OWL,
having systems create their own interfaces was simply out of the bounds for all but
the
simplest of models.
SHACL has the potential to change that. Because SHACL resources can be grouped and
ordered, there is generally enough of information to build not only display only but
editable interfaces from SHACL graphs, as well as to support services for validation
and
import processing of content when used in a restful architecture (which is typical
for
RDF) systems. This can also be supplemented with permission constraints to determine
editability of content at the property or record level. Because such information is
just
RDF in a different graph in a federatable system, there is no real need to create
separate vocabularies as part of SHACL - these simply become other constraint
conditions.
Additionally, SPARQL can be used to determine what tools and widgets work best for
editing, and could potentially construct these as part of an output. A simple example
illustrates the
concept:
select ?output where {
$node a ?class.
graph graph:sh {
?shape sh:targetClass ?class.
?shape sh:property ?property.
?property sh:order ?order.
optional {?property sh:group ?group.}
?property sh:path ?path.
?property sh:datatype ?datatype.
?property sh:name ?label.
$node ?path ?value.
bind(if(sameIRI(?datatype,xsd:string),concat('<div class="prop" id="',$node,'"><span class="label">',
?label,'</span><input type="text" name="',$path,
'" value="',?value,'"/></div>'),
sameIRI(?datatype,xsd:date),concat('<div class="prop"><span class="label">',?label,
'</span><input type="text" name="',$node,
'" value="',?value,'"/></div>'))) as ?output)
}
} order by ?group ?order
This query would then generate an ordered sequence of HTML items for inputting string
or date content. Obviously, a real world scenario would be more complex, but not
dramatically so.
Summary
The graph model that describes RDF and the folded hierarchy of XML are readily
translatable, although there are assumptions made in each of these data representation
models as they currently exist in OWL that are simply too rich to capture in XSD.
However, SHACL, as a smaller, more data-centric format, may actually be a good tool
for
managing equivalency in pipelines where large number of resources (millions or even
billions of data "documents") are involved. Because of the work done in rectifying
the
core XDM and JSON-DM models, SHACL could act as a unifying bridge, a mechanism for
storing both normalized and denormalizing content and providing both validation and
potentially visualization of interfaces moving forward.