How to cite this paper
Cagle, Kurt. “The Ontologist: Controlled Vocabularies and Semantic Wikis.” Presented at Balisage: The Markup Conference 2012, Montréal, Canada, August 7 - 10, 2012. In Proceedings of Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012). https://doi.org/10.4242/BalisageVol8.Cagle01.
Balisage: The Markup Conference 2012
August 7 - 10, 2012
Balisage Paper: The Ontologist
Controlled Vocabularies and Semantic Wikis
Kurt Cagle
Information Architect
Avalon Consulting, Ltd.
Kurt Cagle is an author, editor, and information architect working for Avalon Consulting,
LLC.
He is the author of a number of books and articles focused on XML and web technologies,
and he has
been active as a W3C Invited Expert, technology evangelist and consultant in the XML
and development
communities for more than two decades. Clients have included the ObamaCare Health
Insurance Exchange system,
the US National Archives, the Library of Congress, Harvard Business School and the
Canadian Research
Consortium. He has most recently been focused on XQuery and Semantic Web Development.
Copyright © 2012 Kurt Cagle
Abstract
This paper covers the design details of the Ontologist, a MarkLogic based project
for using RDF triples, XQuery search capabilities and RESTful services to provide
controlled vocabularies, taxonomy management and semantic wikis.
Table of Contents
- Overview
- Controlled Vocabularies and Data Feeds
- URIs, CURIEs and Triplexes
- Terms and Assertions
- Faces, Presentations and Ingestors
- Summary
Overview
The Ontologist began its existence as a way of dealing with one of the more vexing
problems of modern distributed applications - dealing with controlled vocabularies.
This
paper and talk will explore the design of the Ontologist at an application and
theoretical level, and illustrate how it provides a synergy between XQuery and SPARQL
development in order to build everything from list managers to knowledge management
systems.
Controlled Vocabularies and Data Feeds
A controlled vocabulary typically appears at first as an
enumerated set of terms (with the obvious temptation by schema designers to encode
these
terms in XSD schemas as enumerated simple types).
However, unlike a "normal" enumeration, controlled vocabulary terms have a number
of
properties which can make such a temptation a costly trap. In most cases such controlled
vocabularies have a number of key characteristics:
-
Distinct labels and codes. The displayed value of the
item in question will more than likely not be the same as the underlying
code value that references the term in question. For instance, a set of
colors might have the color name but use RGB notation for identifying the
resources at a system level.
-
Temporality. Items may be added or removed from the
underlying set over time, the values of given sets may change, or the
preferred ordering of the set may change depending upon context. For
instance, if your enumerated set contains the names of stores in a given
region, this list may change as stores are added or closed, or as regions
are redefined.
-
Subordinate Lists. A list of terms may themselves
identify lists of terms. In this respect, there is a
relationship (formal or implicit) that connects the
parent term (or context) and the child terms. A list of
countries, for instance, may contain a list of states, regions or
provinces.
-
Uniqueness. The combination of label
and code act as a defacto unique key for a given term.
-
Multiple representations. The set of metadata about a
term exists independently of the representations used for those terms - a
list of controlled vocabulary terms can be displayed as an XML construct of
various flavors, JSON, CSV, HTML lists or tables or any other underlying
format.
Significantly, these are also characteristics that are found within the notion of
both
Semantic and REST oriented resources. In effect, the terms of a
controlled vocabulary can be thought of as having an associated URI that identifies
each
term and that also identifies the context. Indeed, in all cases, the context for a
list
will be a term itself.
One other subtle point that brings the conflation of RESTful services and semantics
full circle is the fact that conceptually each term in the given list can consequently
be thought of as a reference to a resource, with an associated metadata bundle. That
is
to say, the URI (a semantic construct) for the term is also either itself or can be
associated one-to-one with a unique URL (a RESTful construct) for the term.
Put another way, the context term and its correlated "child" terms are in effect a
data feed. The term feed in this case is in fact intended to
conflate with the notion of "news feed" - an Atom feed and an RSS feed are in fact
both
subclasses of data feeds in which the child nodes are "articles" that are returned
in a
date descending order, are paged, and may actually contain summary content (or even
the
full content) of the associated article as part of each feed entry.
More formally a data feed can be identified conceptually as the following
-
Feed
-
Feed URI - This identifies the context term's "address" and
systemic or global name.
-
Feed Label - This gives the "user centric" label that identifies
the feed.
-
Feed Code - This is an identifier or "value" that establishes the
internal (or local) state for the context term.
-
Feed Description - This is used to provide a more detailed
description of the context term, and could be anything from a few
lines of text to the body of a web page.
-
Feed Timestamp - This provides one or more timestamp entities that
identify the age of the context term.
-
Context Content - This is effectively the document payload
corresponding to the term in question. This will be a representation
of the resource indicated by the context item, and may be optional
(not all terms necessarily have content payloads, and not all
representations need to transmit those payloads if they do).
-
Relationship - This is a (potentially implicit) term identifying
the relationship of the context item with its associated entryset.
In a folder/file relationship, this will usually be implicit, but in
a semantic relationship, this will typically be the predicate of a
triple.
-
Entry Set - This is a set of entries that satisfy the relationship
between the context term and the associated child items in the
ontological space.
-
Entry URI - This identifiest the entry term's address and
systemic or global name.
-
Feel Label - This gives the "user-centric" label that
identifies the entry.
-
Entry Code - This provides the value for the entry
term.
-
Entry Description - This provides a detailed description
of the entry term.
-
Entry Timestamp - This provides one or more timestamp
entities that identify the age of the entry term.
-
Entry Content - This is a representation of the entry
term's corresponding resource.
In theory, such a construct could be recursive, but ultimately it's worth
understanding that any context term may in fact have multiple relationships that
identify the subordinate entries, and that as some of these relationships are themselves
orthogonal the ultimate structure that is conceptually described here is not a recursive
descent tree, but is rather a graph that may ultimately end up resolving to multiple
"roots".
Note that while the above may be seen as being a description of an atom or RSS feed,
it is also, not accidentally, the structure of a typical "search" page, such as the
results of a Google search query. In this case, the "context" term is the query
expression passed. Typing in "Balisage", for instance, will establish the context
for
search to be the term "Balisage", while the relationship is implicitly those items
that
have some contextual relevance to the search. The search results will typically contain
as a label the name of the site and the description being a short summary of the site
itself as a snippet, with one or more links to different representations of that same
resource as hypertext links.
URIs, CURIEs and Triplexes
There is a subtle distinction here - the URI of the resource is effectively the
primary URL, but secondary links are still links to different representations of the
same conceptual resource. Similarly an "image" search is in fact a feed that establishes
its primary representation of resources as image URLs that are then rendered in a
table
or sequences of hyperlinked images. This only highlights the fact that the feed
structure is in fact orthogonal to its representation.
In a typical website, the relationships between pages (which can be thought of as
terms) have until fairly recently followed the folder/file model of containment that
reflected the underlying file system that stored the relevant web-page resources.
This
meant that URIs typically tended to follow a containment format as well. However,
over
the course of the last decade the balance of development has shifted from a file system
model to a database model for the storing and generation of web content, which has
in
turn shifted the URI structures for such resources to a more key oriented one.
This has led to some confusion about the construction of URIs and how they relate
to
REST. From a purely REST standpoint, a GUID can effectively identify a resource in
a
data-centric environment. The URL
http://www.myserver.com/term/AFD23C2F559A412C
has one major benefit - it is globally unique. A URL rewriter script could do a lookup
on the term and retrieve that particular key with very little parsing overhead. On
the
other hand, it tells the person requesting the term next to nothing about it, and
it is
arguable that, in many cases, being able to readily identify such a term by an easily
interpretable string outweighs the parsing cost of that string.
After considerable experimentation (which went into the design of the Ontologist),
all
URIs were designed with the following convention:
http://www.myserver.com/domainName/className/instanceName.faceName
The notation is broken into a domain, a class, and a instance, along with a
face that identifies which particular representation to use for
displaying the term feed. The class name establishes the category within a given domain,
while the instance name establishes a human readible unique name for a given resource.
For instance, the color blue would be given as "color/blue". Because classes themselves
are also terms, they would be in the "class" category - "class/color".
The domain name arose from a realisation after working with a client who wanted to
have classes that might be used by multiple different groups or customers. That is
to
say, there might be a "class/article" for multiple potential vendors, as well as the
possibility of common classes that could be used by all vendors. While it is possible
to
pass these in parametrically, it was such a useful term for a number of reasons that
it
was made part of the key description.
Consequently, the Ontologist makes use of a triplex notation of the form
domainName:className:instanceName
which uniquely identified each term in the system, relative to
the server host. These were not universal because the application was ported over
various servers, from development to staging to deployment, and as a consequence the
specific URI could not correspond to a physical server.
It should be pointed out that these triplexes do have a direct correspondance with
a
qname. For server myserver.com, the corresponding qname for the color blue in the
common
domain would look like:
{http://myserver.com:8050/common/color/}blue
where everything within the brackets corresponds to the qualifying category term and
everything after is the instance term, or conceptually, common:color:
represents the qualifying category, and blue
again represents the instance
term. Similarly, the class of color
would be in common:class:
,
corresponding to
{http://myserver.com:8050/common/class/}color
This notation is odd at first glance (its easy to get confused about the distinction
between triplexes and CURIEs), but if you realize that a CURIE such as
commonColor:blue
corresponds to the triplex
common:color:blue
, it makes working with these resources in a semantic
turtle notation somewhat more intuitive.
Note
It's easy to worry about such "plexes" getting out of control, but in practice
none of the applications that have been built on top of the ontologist has ever
needed more than three such terms. The combination of domain, class and instance,
along with the realization that everything else can be built with these, seems to
suffice in limiting the size of such plexes.
Terms and Assertions
Every term in the system is represented internally as an XML document with a structure
very similar to that of the feed format, albeit not quite identical. Listing 1 gives
an
example of such a entry.
<entry xmlns="http://www.avalonconsult.com/xmlns/entry">
<metadata>
<id>common:color:Blue</id>
<url>/common/color/Blue</url>
<class>common:class:color</class>
<code>#0000FF</code>
<domain>app:domain:Common</domain>
<label key="label">Blue</label>
<description key="description" xml:space="preserve">Item 'Blue' with code '#0000FF' is in
class 'Color'.</description>
<lastModified>2012-05-02T19:02:51.361676-07:00</lastModified>
</metadata>
<assertions>
<assert subject="common:color:Blue" predicate="app:rel:instanceOf"
object="common:class:Color"/>
<assert subject="common:color:Blue" predicate="app:rel:domain" object="app:domain:Common"/>
<assert subject="common:color:Blue" predicate="app:rel:termOf" object="app:class:Term"/>
<assert subject="common:color:Blue" predicate="app:rel:workflow"
object="app:workflow:Active"/>
<assert subject="common:color:Blue" predicate="app:rel:label" object="#label"/>
<assert subject="common:color:Blue" predicate="app:rel:description" object="#description"/>
</assertions>
</entry>
This particular term establishes the color blue in the collection common:colors. The
metadata block holds the context information. If there was a need for a payload it
would
be in a content element, though here there isn't any. What's perhaps most important
is
the assertion elements in the assertions set.
Each assertion is, in effect, an RDF triple, with a subject, predicate and object.
In
most cases the subject will be the id of the term. The predicate will be a term as
well,
usually one of a class of relationships that are defined within either the application
domain, (part of app:rel:) or in some other domain (such as geo:rel:). The object
will
be another term resource in the system (the assumption here is that the system is
self-contained).
The first predicate is especially worth studying. This defines the instanceOf
relationship between the color blue and the class color within the common domain.
In
effect, this is a directed named vertex pointing to the common:class:Color term.
Relative to the color:Blue term, this is an outbound relationship, relative to the
class:color term, this is an inbound relationship (that is to say, class:Color does
not
have any corresponding outbound vector to the color blue). Thus, for the class:Color
object, there is a SPARQL query statement of the form
?color app:rel:instanceOf common:class:Color.
that identifies the set of subjects that have an outbound instanceOf assertion to
class:Color.
This is a fairly reasonable operation for a triple store. What makes the Ontologist
useful is that it provides a set of libraries for doing the same type of operations
within an XQuery context in an XML data store such as MarkLogic (though it could do
the
same in eXist or other fourth generation XML databases).
Note
A question that has been asked by reviewers of this paper is why RDF-XML wasn't employed
for
this. As it turns out, RDF-XML was explored as a potential vehicle to support this
within the Ontologist initially, but it's structure generally does not index well
within XML databases - the queries involved are not especially complex but they are
involved enough that indexing optimizations becomes a major factor.
Additionally, there is a certain degree of deliberate redundancy in the system
because more traditional XQuery searches are also permitted on the Ontologist set,
both to order the sequence of returned objects and to create filtered subsets. Given
the advantages inherent for optimized searching and indexing, this generally
outweighs the cost of maintaining redundant data in multiple forms (and is typical
of XQuery applications).
The structure of a class term can provide some additional insight into the
application.
<entry xmlns="http://www.rtis.com/xmlns/vocabulary">
<metadata>
<qname>common:class:color</qname>
<url>/common/class/color</url>
<class>common:class:fda_list</class>
<code>color</code>
<domain>app:domain:Common</domain>
<label key="label">Color</label>
<description xml:space="preserve" key="description">A vocabulary of colors as defined by
the FDA.</description>
<lastModified>2012-05-02T19:02:51.361676-07:00</lastModified>
</metadata>
<assertions>
<assert subject="common:class:color" predicate="app:rel:subClassOf"
object="common:class:Class"/>
<assert subject="common:class:color" predicate="app:rel:domain" object="app:domain:Common"/>
<assert subject="common:class:color" predicate="app:rel:defaultRel"
object="app:rel:instanceOf"/>
<assert subject="common:class:color" predicate="app:rel:label" object="#label"/>
<assert subject="common:class:color" predicate="app:rel:description" object="#description"/>
<assert subject="common:class:color" predicate="app:rel:workflow"
object="app:workflow:Active"/>
</assertions>
</entry>
In this case, the term class:color is treated as a subclass of class:Class, which
can
be thought of as a folder of classes in the common domain. The term is also an item
in
the domain app:domain:Common (domain "sweeps" can include a fairly large number of
terms. The third assertion, however, is one of the most powerful. The defaultRel
predicate is used to determine, when no formal predicate is specified for the
Ontologist, what predicate should be used. It is in fact used as in the following
SPARQL
query:
$term app:rel:defaultRel ?relationship.
?child-terms ?relationship $term.
$term here is the term from the XQuery invoking this call (such as
common:class:color), ?relationship is the predicate determined by the default
relationship and ?child-terms are the ones that satisfy this relations.
A similar
relationship can be used to walk "up" the tree of default terms as if it was an
inheritance tree (it typically is).
$child-term ?relationship ?parent-term.
?parent-term app:rel:defaultRel ?relationship.
Although it should be noted that it is possible that a given term may in fact have
more than one such relationship, meaning again that this system, while tending towards
hierarchies in its structures, is still very much graphlike in the bigger
picture.
Faces, Presentations and Ingestors
One consequence of the use of such XML embodiments of terms is that the entries involved can not only contain
assertion pointers, but can also identify and invoke the representations used to
both present content and to ingest it. An example can illustrate how this
works.
A face, as indicated earlier, is identified by an extension (although it can also
be
identified parametrically from a query string parameter). The face is also identified
as
being tied in with an HTTP verb - PUT, POST, GET, or DELETE. Within an entry there
is an
optional <pipelines> element that can include maps of the form:
<pipelines>
<error href="/modules/app/class/Error/error.xq"/>
<method match="GET">
<face match="html">
<action href="/path/to/class/get-html.xq">
<search ref="s1"/>
</action>
</face>
<face match="keys.xml">
<action href="/path/to/class/get-html.xq">
<search ref="s1"/>
</action>
<action href="/path/to/class/transform-to-keys-xml.xq">
<search ref="s1"/>
</action>
</face>
<face match="xml">
<action href="/path/to/class/get-html.xq">
<search ref="s1"/>
</action>
</face>
..
</method>
</pipelines>
With the combination of method and face, it becomes possible to select a particular
pipeline of actions for processing, along with the ability to retrieve additional
parameters (not shown here). This uses a somewhat complex inheritance model. The
Ontologist processor looks first in the term's XML to see if it has a processor. If
it
does, this is used. If it doesn't (or if the processor passes the modified object
up the
chain) the class corresponding to that term is queried, then the default rel for that
term (in most cases, these are the same thing). This continues until either a match
is
found or the app:class:Term object (the root object in the system) is reached, which
contains defaults for all potential views.
A particularly significant result of this of this is that a generic viewer can be
created for terms, but that
more specialized viewers can be built for specific classes - for instance, everything
in
a given domain may have different logos, arrangement of elements and even
interpretations of content, and dedicated classes may be able to provide customized
viewers or editors for specific types of payloads. This benefit extends to ingesters
as
well - input formats can be customized for specific class content, so that what gets
saved may be processed from other sources. As an example, posting to a json face might
take JSON content and convert it into XML before saving the resource, while posting
to a
zip face would unzip the content, for that particular class, and do some kind of post
processing on the individual items in the file.
Summary
By combining such post processing with the creation of relevant assertions (though
some editor interface) makes it possible for this system to effectively function as
a
semantic wiki or knowledge management system, with each term effectively being the
same
as one entry in the space. It is this piece which I plan on demonstrating at
Balisage.
The Ontologist was written on MarkLogic Server 5.0, but it could reasonably work on
any document-centric database, including XML databases such as eXist, JSON databases
such as CouchDB, and Semantic Triple Stores that combine SPARQL and XQuery. The more
important aspect of this is that it has the potential to be used in a wide variety
of
circumstances, from managing controlled vocabularies to hosting encyclopedic websites
to
acting as a Linked Data repository with queryable interfaces.
References
[Roy Fielding 2000] Fielding, Roy ThomasArchitectural Styles and the Design of Network-based Software Architectures,
PhD Dissertation Thesis, Chapter 5. Representational State Transfer. University of California, Irvine
© 2000. http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm.
[Semantics for the Working Ontologist 2011] Allemang, Dean and Hendler,
JamesSemantics for the Working Ontologists, Effective Modeling
in RDFS and OWL Morgan Kaufmann, 2nd Edition © 2011.
http://workingontologist.org.
×Allemang, Dean and Hendler,
JamesSemantics for the Working Ontologists, Effective Modeling
in RDFS and OWL Morgan Kaufmann, 2nd Edition © 2011.
http://workingontologist.org.