How to cite this paper
Cooper, John. “SOCRview: a case study of RESTful web application development for publishing.” Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). https://doi.org/10.4242/BalisageVol19.Cooper01.
Balisage: The Markup Conference 2017
August 1 - 4, 2017
Balisage Paper: SOCRview: a case study of RESTful service development for publishing
John Cooper
Senior Content Systems Analyst
SAGE Publications, London
John has been working with structured text since 1994. Since then he has worn multiple
hats: conversion programmer to OmniMark solutions architect, systems integrator,
consultant. He has an Honours BSc in Physics and Computer Science from the University
of
Toronto.
Copyright © 2017 Sage Publications
Abstract
SOCRview is a RESTful HTTP service layer that exposes content--including transformed,
packaged, listed or analyzed content--to other services, programmers writing ad hoc
scripts
and users through persistent, readable, meaningful and concise URI. Lessons learned
from the
first proof-of-concept allowed expansion to include customization layers for commonly
used
delivery formats.
Table of Contents
- Context and Goals
-
- Motivation
- Goals
- XML Database Services
- URI - Initial Analysis
-
- SOCR already had RESTful access to content
- Examples from other publishers
- Core URI design
- URI for all objects
- SOCRview Proof Of Concept (POC)
-
- POC URI processing
- Production SOCRview using RXQ
- Views
-
- Standard extension based views
- Deliveries: packages and report
-
- Package delivery
- Report Delivery
- Filters
-
- Multiple XSLT filters
- XPath filters
- Summary
Context and Goals
SAGE is an academic publisher whose content is marked-up in XML and stored in an Content
Management System (CMS) known internally as SOCR (SAGE Online Content Repository).
Many types
of content are stored in SOCR but this paper will focus on journals.
SOCR is a CMS with typical characteristics: content comes in (ingestion, validation),
goes
out (reports, searches, delivery) and is stored/archived. At its core SOCR consists
of two
applications: a front-end that provides user access and also contains a workflow engine;
and a
back-end XML database.
This paper will present how content goes out, is accessed through a
service, SOCRview, running on an XML database; starting from initial motivations and
an XML
database: services, URIs, and a REST framework will be sequentially added to the mix.
Motivation
Content should not be hidden away, only accessible through expert database specialists.
Storing content in a database system has advantages of consistency, scalability and
security
but often accessing the content requires special knowledge and privileges. Wouldn't
it be
nice if typical access could be provided in a simple and intuitive way using, say,
HTTP and
more advanced access made easier with a standardized configuration layer (ideally
in
XML?)
Goals
-
Demonstrate the accessibility of content through a simple HTTP interface
-
Design persistent, readable, meaningful and succinct URIs for content and use them
to access content
-
Use variations on the core URI, by adding extensions and postfixes, to access
different views of the content including metadata, reports and
transformations
-
Create a customizable transformation layer to implement complex or non-standard
views
Note
From the beginning, browser access was useful and important but the goal was not to
create a web application. The goal was to provide simple and intuitive HTTP-URL based
access to content that could be used by services, programmers writing ad hoc scripts
or a
web application. To date, there is no web application, just a very thin XSLT-to-HTML
layer.
XML Database Services
To set the stage for what follows it necessary to understand a little about services
written in XQuery. A minimum configuration can consist of specifying a port and location
for
XQuery files. The examples below demonstrate a simple content query and how to access
the
requesting URL.
A service is written like an XQuery program where the input context is all the documents
in the database, as if the database was one giant root document and each actual document
a
child of the root document. This example illustrates a service running on port 8123
that
returns an arbitrary article. The following is placed in a file, one-article.xqy:
(/article)[1]
Opening the following URL in a browser will
return one article
http://localhost:8123/one-article.xqy
Returns
<article article-type="research-article" dtd-version="1.1d1" r:rsuiteId="6536723" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">EPM</journal-id>
...
</article>
There was one word excluded from the stated goal of accessing content using URIs:
"directly." We want to use the URI directly and not as a unique id in a parameter.
Not like this:
http://localhost:8080/goGoGadget.xqy?uri=/a/b/c
Yes, like this:
http://localhost:8080/a/b/c
This can be achieved by configuring the service to redirect requests. Below is an
example
of redirecting all requests to XQuery file simple-service.xqy that contains the following:
let $url := xdmp:get-original-url()
return
if ( $url eq '/one-article' ) then
(/article)[1]
else
fn:concat("URL: ",$url)
To return one article:
http://localhost:8080/one-article
Otherwise, just echo the request URL:
http://localhost:8080/a/b/c
Returns
URL: /a/b/c
The XQuery program above, simple-service.xqy, shows how an HTTP service can interpret
a
request URL and access content. The next step would be to use regular expressions
to match the
request URL, isolating the object URI from modifiers that will indicate which aspect
of the
object is to be returned: (e.g. the object itself, its metadata, a transformed version
of the
object, etc.)
URI - Initial Analysis
The catalyst for introducing URI design for journals came from a 2012 MarkLogic Users
Group London (MUGL) meeting where Jeni Tennison presented her technical approach and
architecture for UK legislationMUGL2012; in particular
the utility of meaningful persistent URIs and how modifiers could be applied to view
different
aspects of an object. This presentation at MUGL led to combining an analysis of our
journal
content and examples of how other systems provided URI-based access to journal content
into an
initial URI design.
SOCR already had RESTful access to content
http://localhost:8080/rsuite/rest/v1/content/38024?skey=1345
RSuiteAPI Each document has a unique positive integer
that works well to identify content when a CMS is generalized to store anything. But
it is
not meaningful and the id is not persistent; if you delete a document and add it again
it
would get a new, different id.
Examples from other publishers
Legacy academic publishing, organized by volume and issue and published in print as
well
as online, lends itself to hierarchical URIs. An hierarchical URI can be seen on HighWire
in
example below: article on page 395 of journal aas, volume 25, issue 4. This is meaningful
but not at the article level as the page number is less meaningful than an article
DOI and
also tied to a particular PDF
rendering.
http://aas.sagepub.com/content/25/4/395
All SAGE journal articles are identified with a Digital Object Identifier (DOI)
DOI. HighWire used DOIs as an alternate for directly
accessing articles. Below shows the HighWire URL for accessing journal aas, article
DOI
10.1177/009539979402500401.
http://aas.sagepub.com/lookup/doi/10.1177/009539979402500401
Some online-only journals use the DOI as the primary identifier to access content.
-
BioMed Central, "Big Data Analytics" DOI 10.1186/s41044-017-0021-9
https://bdataanalytics.biomedcentral.com/articles/10.1186/s41044-017-0021-9
-
Public Library of Science, "PLOS ONE" DOI 10.1371/journal.pone.0127502
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127502
Core URI design
Given that articles can be uniquely identified through a DOI, why not stop there?
Why
include journal, volume and issue identifiers?
-
In SOCR the DOI is not unique because we have parallel versions. For example when
an article first enters the system it is in the form of an Accepted Manuscript which
has not been assigned a volume or issue. There is a requirement to keep the Accepted
Manuscript separately and not as a version of a single article.
-
There is structure in the database for navigating journal content -- journal,
volume, issue and article container objects and these also must have URI
-
There is content at the issue level (e.g. cover images)
-
It is useful to apply metadata at the journal, volume, issue and article levels
(e.g. if you want act on an issue as a whole)
-
Finally, even if it was not necessary to have identifiers above the article level,
it is meaningful to able to know where an article belongs based on its URI
So, given the examples above a reasonable base URI for a journal article might be:
/AAS/25/4/10.1177/009539979402500401
Except we want to use a normalized DOI where forward slash (and most non-alphanumeric
characters) is replaced by underscore:
/AAS/25/4/10.1177_009539979402500401
Normalizing was a naturally step since the input files are named using the DOI. Later,
when creating SOCRview, using normalized DOI will simplify the regular expressions
that
parse the URI. Also, we have DOI like this:
10.1597/1545-1569(1995)032<0206:pfaoat>2.3.co;2
Although this paper is focused on journals, there are other types of content in SOCR
and
in order to unambiguously interpret the URI--especially in a service as described
above that
will want to respond differently based on matching URI with regular expressions--a
namespace-like prefix will indicate that this is a journal URI:
/journal/AAS/25/4/10.1177_009539979402500401
URI for all objects
A journal consists of one or more volumes, a volume of one or more issues, an issue
of
one or more articles and an article of at least the article XML with optional PDF
and
images. Below shows the hierarchical structure of content stored in SOCR. Though not
complete the listing below shows a nested series of containers and objects. Each node
is an
XML document: a container contains references to its children; graphics and PDF contain
references to files on a file system.
URI for objects inside an article container: why exclude the article container level
from the
URI?
/journal/AJS/44/9/10.1177_0363546515618372.xml
rather
than
/journal/AJS/44/9/10.1177_0363546515618372/10.1177_0363546515618372.xml
One
of the goals was to have succinct URI without unnecessary repetition. Taking advantage
of
restricted naming conventions, requiring all objects belonging to an article to start
with
the normalized DOI, allowed the former approach, not unnecessarily repeating the DOI.
If
this restriction was not present then the latter approach would have been used
/journal/AJS/44/9/10.1177_0363546515618372/foobar.xml
The initial set of URI implemented in the first version of SOCRview also had modifiers
on the core URI to provide transformations and different views (e.g. /journal/AJS/44/9.zip
would return a zip file of all content in the issue); this will be explored later
when
discussing the current version of SOCRview.
SOCRview Proof Of Concept (POC)
Full details of the POC would show little but two aspects have bearing on what follows:
most importantly it successfully accomplished most of the stated goals and demonstrated
what
was possible in a way that words by themselves did not; the approach taken to matching
an
parsing URI was not ideal.
POC URI processing
The approach taken to processing the URI consisted of tokenizing the URI and then
making
decisions based on the decomposed parts of the URI (i.e. journal, volume, issue, article,
extension, etc.) The service worked and was performant but it was difficult to understand
and maintain; each additional endpoint to the service increased the complexity of
the code.
Also, the approach was contrary to the spirit of having persistent URI for objects.
There is
a different philosophy/approach in play when matching an URI with a given pattern
but
subsequently treating it as a single identifier.
Production SOCRview using RXQ
An alternative approach to matching and parsing URIs presented itself at another MUGL
where meeting where Jim Fuller presenting his RESTXQ library which use XQuery function
annotations to expose RESTful services in MarkLogicMUGL2014. Jim's RESTXQ library is based on Adam Retter's RESTXQ draft RESTXQspec presented at XML Prague 2012XMLPrague2012.
The RXQ library makes use of XQuery annotationsXQuery3 on function declarations. Every entry point (endpoint) of the service
will have function declared as in the example below. This example shows the default
behaviour
when no URI is provided, the root URI '/', return a static table of contents XML document:
declare
%rxq:produces('text/xml')
%rxq:GET
%rxq:path('/')
function toc() { static:toc() };
The
above example shows three annotations used in SOCRview; this paper will focus on the
rxq:path
annotation containing a regular expression string.
Before showing the %rxq:path
annotations that would match URI, as proposed
above, it is necessary to explain an enhancement made to RXQ. As ubiquitous and powerful
as
regular expressions are they can be cryptic--especially for complex patterns--and
difficult to
understand or modify; more, a programmer should be able to look at a regular expression
in an
annotation and understand the URI it is intended to match. An abstraction layer was
added to
add symbolic patterns/pattern variables. Pattern variables are defined in a
map:
let $m := map:map()
let $_ := map:put($m,'$doi','(10\.\d{4,5}_[^/]+)')
let $_ := map:put($m,'$tla','([A-Z]*)')
let $_ := map:put($m,'$vol','([^/]+)')
let $_ := map:put($m,'$iss','([^/]+)')
let $_ := map:put($m,'$obj','([^/]+\.$objext)')
let $_ := map:put($m,'$objext','([a-z]+)')
...
Changes
were made to the RXQ library to resolve these variables. Finally the variables are
used in a
function declaration. The following function will match any of the above object URI
and return
the
object:
declare
%rxq:GET
%rxq:path('(/journal(/$tla(/$vol(/$iss)?)?)?(/$obj|/$doi)?)($filter)?')
function jrnlObject(
$socrUri, $_1, $_tla, $_2, $_vol, $_3, $_iss, $_4, $_obj, $_objext, $_doi, $_5, $filter
)
{ uf:applyFilters($filter,_getObject($socrUri)) };
In the above function declaration:
-
$tla - Three Letter Acronym - a journal code (e.g. AJS = "The American Journal of
Sports Medicine")
-
$vol - volume
-
$iss - issue
-
$obj - an object name - a file name
-
$doi - DOI
-
$filter - to be explained later
Parentheses in regular expressions are used to isolate sub-expressions and capture
text.
These capture groupsRegular Expressions are assigned to a
corresponding variable in the declared function. In the example most of the capture
patterns
are not used; only the URI and filter are used. A future enhancement could implement
non-capture groups so that only required capture groups are assigned to variables.
A future
enhancement might also disallow capture groups inside pattern variables so that what
is
captured can be understood just from reading the %rxq:path
.
Using RXQ allows for better organization and maintenance of service endpoints. Functions
that match URI with complex patterns can be created that act upon the URI, applying
any
modifiers.
Views
So far all examples of URI and corresponding endpoints have corresponded to objects
(container, non-XML or XML nodes); views are anything the can be derived
from an URI and some modifying suffixes. Here are some examples of views:
-
metadata associated with object
-
zip file containing all content in an issue
-
the most recent cover image for a journal
-
transformed XML
Standard extension based views
Simple file extensions (e.g. .html
) are used to show structural aspects of
an object. Structural aspects mean either
-
resolving the internal integer-based linking to SOCR URI to allow for simple
rendering and navigation in a browser
-
resources, metadata or the raw integer based linking from container to child –
mostly used by administrators or developers
To illustrated typical structural views, below are 4 views for a container node:
-
resource metadata -- every object (container, XML or non-XML) has a corresponding
resource / metadata document that can be access by appending a .res
extension
/journal/AJS.res
-
raw XML document container -- contain numerical ids pointing to its
children
/journal/AJS
-
XML document listing the children where the children are referenced by URI
/journal/AJS.lst
-
HTML view child list
/journal/AJS.html
this view converts document list above into HTML by adding a processing
instruction that will run an XSLT 1.0 program, pretty.xsl, in a browser:
<?xml-stylesheet type="text/xsl" href="/xslt/pretty.xsl"?>
-
Map - this is an XML representation of the database structure starting at the
given URI:
/journal/AAN/25/3/10.1177_0218492315603212.map
<container name="10.1177_0218492315603212" type="rs_ca" socrUri="/journal/AAN/25/3/10.1177_0218492315603212" id="167632462">
<title>10.1177_0218492315603212</title>
<meta name="tla">AAN</meta>
<meta name="volume">25</meta>
<meta name="issue">3</meta>
<meta name="year">2017</meta>
<meta name="doi">10.1177/0218492315603212</meta>
<meta name="articleType">case-report</meta>
<object name="10.1177_0218492315603212-fig2.tif" type="nonxml" socrUri="/journal/AAN/25/3/10.1177_0218492315603212-fig2.tif" id="167632519">
<title>10.1177_0218492315603212-fig2.tif</title>
<meta name="tla">AAN</meta>
...
<meta name="md5sum">4882ffc15e361c9bd5737ba1c5855372</meta>
<created>2017-03-21T16:04:59.004Z</created>
<modified>2017-03-21T16:04:59.243Z</modified>
</object>
<object name="10.1177_0218492315603212.xml" type="article" socrUri="/journal/AAN/25/3/10.1177_0218492315603212.xml" id="167632474">
<title>Angina in left main coronary artery occlusion by pulmonary artery aneurysm</title>
<meta name="tla">AAN</meta>
...
Deliveries: packages and report
SOCR has over 100 delivery targets the vast majority of which are simple: a zip file
of
some or all of the content of a journal issue. There also some highly customized deliveries
(e.g. Epub). Naturally there are deliveries that fall somewhere in between and the
challenge
was to push as much of these onto production where they only need copy a configuration
file,
change a few ids, and perhaps add or override XML transformations. But always new
requirements kept pushing the complexity of transformations specified in the delivery
configuration file: multiple transformations; conditional transformations; etc. XProc
was
considered but was not a natural fit; XSLT was a natural fit; each delivery has two
levels
of configuration requiring 2 levels of expertise: an XML delivery configuration file
customizable by production users and an XSLT packaging program requiring a developer.
Deliveries are views where the URL consists of an object URI, a
delivery identifier and an extension .rpt
, .dlvr
or
.zip
. For example, the following creates a zip file of all content belonging
to an issue:
/journal/AJS/44/9/localDelivery.zip
The delivery identifier is "localDelivery"; every delivery identifier must resolve
to a
deliver configuration XML fragment; SOCRview will first look for the configuration
in an
static variable, for standard system deliveries, or an external document that can
be
customizable by users, for bespoke deliveries. localDelivery is system
delivery with the following configuration:
<deliveryConfig id="localDelivery">
<pkgList type="xslt" uri="/deliver/localDelivery.xsl"/>
</deliveryConfig>
Package delivery
A package delivery assumes the content exists, constructs a map of the content
structure rooted at the given URI (see Standard extension based views, above), runs
an
XSLT to transform the map into a package specification (XML) that is then interpreted
to
constructed the final package, usually a zip file.
Map -> Packaging XSLT -> Package Specification -> Package
A request URL of
/journal/AJS/44/9/localDelivery.zip
will process a map as listed above through an XSLT
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="2.0">
<xsl:template match="container">
<transform type="zip">
<xsl:apply-templates select=".//object"/>
</transform>
</xsl:template>
<xsl:template match="object[@type ne 'nonxml']">
<transform name="{util:getObjName(.)}" type="xqyfn" fn="serialize">
<param name="addHeader"/>
<param name="removeRsuite"/>
<object uri="{@socrUri}"/>
</transform>
</xsl:template>
<xsl:template match="object[@type eq 'nonxml']">
<object name="{@@name}" uri="{@socrUri}"/>
</xsl:template>
</xsl:stylesheet>
to create a package
specification
<transform type="zip">
<object name="10.1177_0218492315603212.pdf" uri="/journal/AAN/25/3/10.1177_0218492315603212.pdf"/>
<object name="10.1177_0218492315603212-fig3.tif" uri="/journal/AAN/25/3/10.1177_0218492315603212-fig3.tif"/>
<object name="10.1177_0218492315603212-fig2.tif" uri="/journal/AAN/25/3/10.1177_0218492315603212-fig2.tif"/>
<transform name="10.1177_0218492315603212.xml" type="xqyfn" fn="serialize">
<param name="addHeader"/>
<param name="removeRsuite"/>
<object uri="/journal/AAN/25/3/10.1177_0218492315603212.xml"/>
</transform>
<object name="10.1177_0218492315603212-fig1.tif" uri="/journal/AAN/25/3/10.1177_0218492315603212-fig1.tif"/>
</transform>
which will return a zip file, localDelivery.zip
Archive: localDelivery.zip
Length Date Time Name
--------- ---------- ----- ----
17111 07-09-2017 17:42 10.1177_0218492315603212.xml
1737310 07-09-2017 17:42 10.1177_0218492315603212-fig1.tif
303239 07-09-2017 17:42 10.1177_0218492315603212.pdf
5352824 07-09-2017 17:42 10.1177_0218492315603212-fig2.tif
1612126 07-09-2017 17:42 10.1177_0218492315603212-fig3.tif
--------- -------
9022610 5 files
Report Delivery
A SOCRview report simply runs an XQuery function passing the URI
and a report id; there are no other restrictions and URI does not have to resolve
to
existing content.
Example below returns list of all journal issues where provided DOI is used, excluding
provided URI; if DOI is unique then list will be empty; if DOI is not unique it will
return URI where already used.
A request
URI,
/journal/AAN/25/3/10.1177_0218492315603212/uniqueDoi.rpt
,
will use internal delivery
configuration,
<deliveryConfig id="uniqueDoi">
<report>
<function
fnName="uniqueDoi"
fnNamespace="http://sagepub.org/socrview/report"
fnLocation="/modules/report.xqy"/>
</report>
</deliveryConfig>
,
run following XQuery
function,
declare function uniqueDoi(
$socrUri as xs:string
, $refxml as node()
)
{...};
and return following result,
<socrUris/>
indicating
that DOI is unique.
Filters
The final type of view is a filter: a sequence of one or more XSLT,
XQuery or XPath expressions run on the content obtained from an URI or URI
view. Multiple filters can be executed, left to right. XSLT or XQuery
expressions will resolve to program files that form part of SOCRview code. XPath expressions
can be ad hoc and reference any namespaces or functions declared or visible in the
code
context where the filter is evaluated. Parameters can be used and will be supplied
to every
XSLT or XQuery module referenced in the filter; if the parameter is not declared it
simply
be ignored.
Multiple XSLT filters
Example below will apply 2 XSLT filters to an XML object
wrapper-one.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xlink="http://www.w3.org/1999/xlink"
>
<xsl:template match="*">
<wrapperOne>
<xsl:copy-of select="."/>
</wrapperOne>
</xsl:template>
</xsl:stylesheet>
wrapper-id.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xlink="http://www.w3.org/1999/xlink"
>
<xsl:param name="id" select="'Default'"/>
<xsl:template match="*">
<xsl:element name="{concat('wrapper',$id)}">
<xsl:copy-of select="."/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
Applying wrapper-one.xsl
/journal/AAN/25/3/10.1177_0218492315603212.xml/__filter/xslt/wrapper-one.xsl
returns
<wrapperOne>
<article article-type="case-report" dtd-version="1.1d1" r:rsuiteId="167632474" xml:lang="en">
<front>
...
Applying wrapper-one.xsl, then wrapper-id.xsl
/journal/AAN/25/3/10.1177_0218492315603212.xml/__filter/xslt/wrapper-one.xsl/xslt/wrapper-id.xsl?id=Two&dummy=Null
returns
<wrapperTwo>
<wrapperOne>
<article article-type="case-report" dtd-version="1.1d1" r:rsuiteId="167632474" xml:lang="en">
<front>
...
Notice
that parameter id was applied but the non-existent parameter, dummy, was ignored.
XPath filters
Example, md5sum for an image:
/journal/AAN/25/3/10.1177_0218492315603212.pdf/__filter/xdmp:md5(binary()).xpath
12130105eaeaf74a21cbe457b8b70bd0
Example, byte count for XML:
/journal/AAN/25/3/10.1177_0218492315603212.xml/__filter/xdmp:binary-size(xdmp:unquote(xdmp:quote(.),(),"format-binary")/binary()).xpath
17061
Example, abstract from article XML:
/journal/AAN/25/3/10.1177_0218492315603212.xml/__filter/descendant::abstract.xpath
<abstract>
<p>A 51-year-old woman with exercise angina and a history of pulmonary artery hypertension ...</p>
<p>After a multidisciplinary evaluation,...</p>
</abstract>
Summary
The initial motivation of exposing SAGE's journal content through a simplified interface
and the goals of building this interface through an HTTP service utilizing persistent,
readable, meaningful and succinct URI was achieved. The usefulness of approach has
so far
mostly been seen in redesigning SOCR as multiple services but the browser interface
has also
proven popular and useful for technical users in our publishing systems group--who
created
and maintain SOCR--and the production group--who use SOCR. There has also been a
gradual
increase content accesses through scripts (e.g. data scientists using Python). URI
design for
journals has met all stated goals but URI design for non-journal content has been
less
satisfying because of tendency to view content base on its form (i.e. markup – e.g.
TEI,
DocBook, etc.) rather than its function (e.g. a book)--a salutary lesson that the
time spent
thinking about journal URI design was well spent. The use of RXQ has allowed easy
additions
of new, non-journal, content types. Having multiple levels of configuration for deliveries
has allowed simple new deliveries to be created without the intervention of a developer
or
administrator and complex deliveries, requiring development, to be created faster.