Context and Goals
SAGE is an academic publisher whose content is marked-up in XML and stored in an Content Management System (CMS) known internally as SOCR (SAGE Online Content Repository). Many types of content are stored in SOCR but this paper will focus on journals.
SOCR is a CMS with typical characteristics: content comes in (ingestion, validation), goes out (reports, searches, delivery) and is stored/archived. At its core SOCR consists of two applications: a front-end that provides user access and also contains a workflow engine; and a back-end XML database.
This paper will present how content goes out, is accessed through a service, SOCRview, running on an XML database; starting from initial motivations and an XML database: services, URIs, and a REST framework will be sequentially added to the mix.
Motivation
Content should not be hidden away, only accessible through expert database specialists. Storing content in a database system has advantages of consistency, scalability and security but often accessing the content requires special knowledge and privileges. Wouldn't it be nice if typical access could be provided in a simple and intuitive way using, say, HTTP and more advanced access made easier with a standardized configuration layer (ideally in XML?)
Goals
-
Demonstrate the accessibility of content through a simple HTTP interface
-
Design persistent, readable, meaningful and succinct URIs for content and use them to access content
-
Use variations on the core URI, by adding extensions and postfixes, to access different views of the content including metadata, reports and transformations
-
Create a customizable transformation layer to implement complex or non-standard views
Note
From the beginning, browser access was useful and important but the goal was not to create a web application. The goal was to provide simple and intuitive HTTP-URL based access to content that could be used by services, programmers writing ad hoc scripts or a web application. To date, there is no web application, just a very thin XSLT-to-HTML layer.
XML Database Services
To set the stage for what follows it necessary to understand a little about services written in XQuery. A minimum configuration can consist of specifying a port and location for XQuery files. The examples below demonstrate a simple content query and how to access the requesting URL.
A service is written like an XQuery program where the input context is all the documents in the database, as if the database was one giant root document and each actual document a child of the root document. This example illustrates a service running on port 8123 that returns an arbitrary article. The following is placed in a file, one-article.xqy:
(/article)[1]Opening the following URL in a browser will return one article
http://localhost:8123/one-article.xqyReturns
<article article-type="research-article" dtd-version="1.1d1" r:rsuiteId="6536723" xml:lang="EN"> <front> <journal-meta> <journal-id journal-id-type="publisher-id">EPM</journal-id> ... </article>
There was one word excluded from the stated goal of accessing content using URIs: "directly." We want to use the URI directly and not as a unique id in a parameter.
Not like this:
http://localhost:8080/goGoGadget.xqy?uri=/a/b/c
Yes, like this:
http://localhost:8080/a/b/c
This can be achieved by configuring the service to redirect requests. Below is an example of redirecting all requests to XQuery file simple-service.xqy that contains the following:
let $url := xdmp:get-original-url() return if ( $url eq '/one-article' ) then (/article)[1] else fn:concat("URL: ",$url)To return one article:
http://localhost:8080/one-articleOtherwise, just echo the request URL:
http://localhost:8080/a/b/cReturns
URL: /a/b/c
The XQuery program above, simple-service.xqy, shows how an HTTP service can interpret a request URL and access content. The next step would be to use regular expressions to match the request URL, isolating the object URI from modifiers that will indicate which aspect of the object is to be returned: (e.g. the object itself, its metadata, a transformed version of the object, etc.)
URI - Initial Analysis
The catalyst for introducing URI design for journals came from a 2012 MarkLogic Users Group London (MUGL) meeting where Jeni Tennison presented her technical approach and architecture for UK legislationTennison 2012; in particular the utility of meaningful persistent URIs and how modifiers could be applied to view different aspects of an object. This presentation at MUGL led to combining an analysis of our journal content and examples of how other systems provided URI-based access to journal content into an initial URI design.
SOCR already had RESTful access to content
http://localhost:8080/rsuite/rest/v1/content/38024?skey=1345RSI 2017 Each document has a unique positive integer that works well to identify content when a CMS is generalized to store anything. But it is not meaningful and the id is not persistent; if you delete a document and add it again it would get a new, different id.
Examples from other publishers
Legacy academic publishing, organized by volume and issue and published in print as well as online, lends itself to hierarchical URIs. An hierarchical URI can be seen on HighWire in example below: article on page 395 of journal aas, volume 25, issue 4. This is meaningful but not at the article level as the page number is less meaningful than an article DOI and also tied to a particular PDF rendering.
http://aas.sagepub.com/content/25/4/395
All SAGE journal articles are identified with a Digital Object Identifier (DOI) wikipedia DOI. HighWire used DOIs as an alternate for directly accessing articles. Below shows the HighWire URL for accessing journal aas, article DOI 10.1177/009539979402500401.
http://aas.sagepub.com/lookup/doi/10.1177/009539979402500401
Some online-only journals use the DOI as the primary identifier to access content.
-
BioMed Central, "Big Data Analytics" DOI 10.1186/s41044-017-0021-9
https://bdataanalytics.biomedcentral.com/articles/10.1186/s41044-017-0021-9
-
Public Library of Science, "PLOS ONE" DOI 10.1371/journal.pone.0127502
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127502
Core URI design
Given that articles can be uniquely identified through a DOI, why not stop there? Why include journal, volume and issue identifiers?
-
In SOCR the DOI is not unique because we have parallel versions. For example when an article first enters the system it is in the form of an Accepted Manuscript which has not been assigned a volume or issue. There is a requirement to keep the Accepted Manuscript separately and not as a version of a single article.
-
There is structure in the database for navigating journal content -- journal, volume, issue and article container objects and these also must have URI
-
There is content at the issue level (e.g. cover images)
-
It is useful to apply metadata at the journal, volume, issue and article levels (e.g. if you want act on an issue as a whole)
-
Finally, even if it was not necessary to have identifiers above the article level, it is meaningful to able to know where an article belongs based on its URI
So, given the examples above a reasonable base URI for a journal article might be:
/AAS/25/4/10.1177/009539979402500401
Except we want to use a normalized DOI where forward slash (and most non-alphanumeric characters) is replaced by underscore:
/AAS/25/4/10.1177_009539979402500401
Normalizing was a naturally step since the input files are named using the DOI. Later, when creating SOCRview, using normalized DOI will simplify the regular expressions that parse the URI. Also, we have DOI like this:
10.1597/1545-1569(1995)032<0206:pfaoat>2.3.co;2
Although this paper is focused on journals, there are other types of content in SOCR and in order to unambiguously interpret the URI--especially in a service as described above that will want to respond differently based on matching URI with regular expressions--a namespace-like prefix will indicate that this is a journal URI:
/journal/AAS/25/4/10.1177_009539979402500401
URI for all objects
A journal consists of one or more volumes, a volume of one or more issues, an issue of one or more articles and an article of at least the article XML with optional PDF and images. Below shows the hierarchical structure of content stored in SOCR. Though not complete the listing below shows a nested series of containers and objects. Each node is an XML document: a container contains references to its children; graphics and PDF contain references to files on a file system.
-
journal AJS
/journal/AJS
-
volume 44
/journal/AJS/44
-
issue 9
/journal/AJS/44/9
-
issue cover image
/journal/AJS/44/9/AJS_44_9_cover.tif
-
article 10.1177_0363546515618372
/journal/AJS/44/9/10.1177_0363546515618372
-
article XML
/journal/AJS/44/9/10.1177_0363546515618372.xml
-
article PDF
/journal/AJS/44/9/10.1177_0363546515618372.pdf
-
article graphic
/journal/AJS/44/9/10.1177_0363546515618372-fig1.tif
-
-
-
-
URI for objects inside an article container: why exclude the article container level from the URI?
/journal/AJS/44/9/10.1177_0363546515618372.xmlrather than
/journal/AJS/44/9/10.1177_0363546515618372/10.1177_0363546515618372.xmlOne of the goals was to have succinct URI without unnecessary repetition. Taking advantage of restricted naming conventions, requiring all objects belonging to an article to start with the normalized DOI, allowed the former approach, not unnecessarily repeating the DOI. If this restriction was not present then the latter approach would have been used
/journal/AJS/44/9/10.1177_0363546515618372/foobar.xml
The initial set of URI implemented in the first version of SOCRview also had modifiers on the core URI to provide transformations and different views (e.g. /journal/AJS/44/9.zip would return a zip file of all content in the issue); this will be explored later when discussing the current version of SOCRview.
SOCRview Proof Of Concept (POC)
Full details of the POC would show little but two aspects have bearing on what follows: most importantly it successfully accomplished most of the stated goals and demonstrated what was possible in a way that words by themselves did not; the approach taken to matching an parsing URI was not ideal.
POC URI processing
The approach taken to processing the URI consisted of tokenizing the URI and then making decisions based on the decomposed parts of the URI (i.e. journal, volume, issue, article, extension, etc.) The service worked and was performant but it was difficult to understand and maintain; each additional endpoint to the service increased the complexity of the code. Also, the approach was contrary to the spirit of having persistent URI for objects. There is a different philosophy/approach in play when matching an URI with a given pattern but subsequently treating it as a single identifier.
Production SOCRview using RXQ
An alternative approach to matching and parsing URIs presented itself at another MUGL where meeting where Jim Fuller presenting his RESTXQ library which use XQuery function annotations to expose RESTful services in MarkLogicFuller 2014. Jim's RESTXQ library is based on Adam Retter's RESTXQ draft Retter 2016 presented at XML Prague 2012Retter 2012.
The RXQ library makes use of XQuery annotationsw3c 2014 on function declarations. Every entry point (endpoint) of the service will have function declared as in the example below. This example shows the default behaviour when no URI is provided, the root URI '/', return a static table of contents XML document:
declare %rxq:produces('text/xml') %rxq:GET %rxq:path('/') function toc() { static:toc() };The above example shows three annotations used in SOCRview; this paper will focus on the rxq:path annotation containing a regular expression string.
Before showing the %rxq:path
annotations that would match URI, as proposed
above, it is necessary to explain an enhancement made to RXQ. As ubiquitous and powerful
as
regular expressions are they can be cryptic--especially for complex patterns--and
difficult to
understand or modify; more, a programmer should be able to look at a regular expression
in an
annotation and understand the URI it is intended to match. An abstraction layer was
added to
add symbolic patterns/pattern variables. Pattern variables are defined in a
map:
let $m := map:map() let $_ := map:put($m,'$doi','(10\.\d{4,5}_[^/]+)') let $_ := map:put($m,'$tla','([A-Z]*)') let $_ := map:put($m,'$vol','([^/]+)') let $_ := map:put($m,'$iss','([^/]+)') let $_ := map:put($m,'$obj','([^/]+\.$objext)') let $_ := map:put($m,'$objext','([a-z]+)') ...Changes were made to the RXQ library to resolve these variables. Finally the variables are used in a function declaration. The following function will match any of the above object URI and return the object:
declare %rxq:GET %rxq:path('(/journal(/$tla(/$vol(/$iss)?)?)?(/$obj|/$doi)?)($filter)?') function jrnlObject( $socrUri, $_1, $_tla, $_2, $_vol, $_3, $_iss, $_4, $_obj, $_objext, $_doi, $_5, $filter ) { uf:applyFilters($filter,_getObject($socrUri)) };
In the above function declaration:
-
$tla - Three Letter Acronym - a journal code (e.g. AJS = "The American Journal of Sports Medicine")
-
$vol - volume
-
$iss - issue
-
$obj - an object name - a file name
-
$doi - DOI
-
$filter - to be explained later
Parentheses in regular expressions are used to isolate sub-expressions and capture
text.
These capture groupswikipedia regex are assigned to a
corresponding variable in the declared function. In the example most of the capture
patterns
are not used; only the URI and filter are used. A future enhancement could implement
non-capture groups so that only required capture groups are assigned to variables.
A future
enhancement might also disallow capture groups inside pattern variables so that what
is
captured can be understood just from reading the %rxq:path
.
Using RXQ allows for better organization and maintenance of service endpoints. Functions that match URI with complex patterns can be created that act upon the URI, applying any modifiers.
Views
So far all examples of URI and corresponding endpoints have corresponded to objects (container, non-XML or XML nodes); views are anything the can be derived from an URI and some modifying suffixes. Here are some examples of views:
-
metadata associated with object
-
zip file containing all content in an issue
-
the most recent cover image for a journal
-
transformed XML
Standard extension based views
Simple file extensions (e.g. .html
) are used to show structural aspects of
an object. Structural aspects mean either
-
resolving the internal integer-based linking to SOCR URI to allow for simple rendering and navigation in a browser
-
resources, metadata or the raw integer based linking from container to child – mostly used by administrators or developers
To illustrated typical structural views, below are 4 views for a container node:
-
resource metadata -- every object (container, XML or non-XML) has a corresponding resource / metadata document that can be access by appending a
.res
extension/journal/AJS.res
-
raw XML document container -- contain numerical ids pointing to its children
/journal/AJS
-
XML document listing the children where the children are referenced by URI
/journal/AJS.lst
-
HTML view child list
/journal/AJS.html
this view converts document list above into HTML by adding a processing instruction that will run an XSLT 1.0 program, pretty.xsl, in a browser:
<?xml-stylesheet type="text/xsl" href="/xslt/pretty.xsl"?>
-
Map - this is an XML representation of the database structure starting at the given URI:
/journal/AAN/25/3/10.1177_0218492315603212.map
<container name="10.1177_0218492315603212" type="rs_ca" socrUri="/journal/AAN/25/3/10.1177_0218492315603212" id="167632462"> <title>10.1177_0218492315603212</title> <meta name="tla">AAN</meta> <meta name="volume">25</meta> <meta name="issue">3</meta> <meta name="year">2017</meta> <meta name="doi">10.1177/0218492315603212</meta> <meta name="articleType">case-report</meta> <object name="10.1177_0218492315603212-fig2.tif" type="nonxml" socrUri="/journal/AAN/25/3/10.1177_0218492315603212-fig2.tif" id="167632519"> <title>10.1177_0218492315603212-fig2.tif</title> <meta name="tla">AAN</meta> ... <meta name="md5sum">4882ffc15e361c9bd5737ba1c5855372</meta> <created>2017-03-21T16:04:59.004Z</created> <modified>2017-03-21T16:04:59.243Z</modified> </object> <object name="10.1177_0218492315603212.xml" type="article" socrUri="/journal/AAN/25/3/10.1177_0218492315603212.xml" id="167632474"> <title>Angina in left main coronary artery occlusion by pulmonary artery aneurysm</title> <meta name="tla">AAN</meta> ...
Deliveries: packages and report
SOCR has over 100 delivery targets the vast majority of which are simple: a zip file of some or all of the content of a journal issue. There also some highly customized deliveries (e.g. Epub). Naturally there are deliveries that fall somewhere in between and the challenge was to push as much of these onto production where they only need copy a configuration file, change a few ids, and perhaps add or override XML transformations. But always new requirements kept pushing the complexity of transformations specified in the delivery configuration file: multiple transformations; conditional transformations; etc. XProc was considered but was not a natural fit; XSLT was a natural fit; each delivery has two levels of configuration requiring 2 levels of expertise: an XML delivery configuration file customizable by production users and an XSLT packaging program requiring a developer.
Deliveries are views where the URL consists of an object URI, a
delivery identifier and an extension .rpt
, .dlvr
or
.zip
. For example, the following creates a zip file of all content belonging
to an issue:
/journal/AJS/44/9/localDelivery.zip
The delivery identifier is "localDelivery"; every delivery identifier must resolve to a deliver configuration XML fragment; SOCRview will first look for the configuration in an static variable, for standard system deliveries, or an external document that can be customizable by users, for bespoke deliveries. localDelivery is system delivery with the following configuration:
<deliveryConfig id="localDelivery"> <pkgList type="xslt" uri="/deliver/localDelivery.xsl"/> </deliveryConfig>
Package delivery
A package delivery assumes the content exists, constructs a map of the content structure rooted at the given URI (see Standard extension based views, above), runs an XSLT to transform the map into a package specification (XML) that is then interpreted to constructed the final package, usually a zip file.
Map -> Packaging XSLT -> Package Specification -> Package
A request URL of
/journal/AJS/44/9/localDelivery.zipwill process a map as listed above through an XSLT
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0"> <xsl:template match="container"> <transform type="zip"> <xsl:apply-templates select=".//object"/> </transform> </xsl:template> <xsl:template match="object[@type ne 'nonxml']"> <transform name="{util:getObjName(.)}" type="xqyfn" fn="serialize"> <param name="addHeader"/> <param name="removeRsuite"/> <object uri="{@socrUri}"/> </transform> </xsl:template> <xsl:template match="object[@type eq 'nonxml']"> <object name="{@@name}" uri="{@socrUri}"/> </xsl:template> </xsl:stylesheet>to create a package specification
<transform type="zip"> <object name="10.1177_0218492315603212.pdf" uri="/journal/AAN/25/3/10.1177_0218492315603212.pdf"/> <object name="10.1177_0218492315603212-fig3.tif" uri="/journal/AAN/25/3/10.1177_0218492315603212-fig3.tif"/> <object name="10.1177_0218492315603212-fig2.tif" uri="/journal/AAN/25/3/10.1177_0218492315603212-fig2.tif"/> <transform name="10.1177_0218492315603212.xml" type="xqyfn" fn="serialize"> <param name="addHeader"/> <param name="removeRsuite"/> <object uri="/journal/AAN/25/3/10.1177_0218492315603212.xml"/> </transform> <object name="10.1177_0218492315603212-fig1.tif" uri="/journal/AAN/25/3/10.1177_0218492315603212-fig1.tif"/> </transform>which will return a zip file, localDelivery.zip
Archive: localDelivery.zip Length Date Time Name --------- ---------- ----- ---- 17111 07-09-2017 17:42 10.1177_0218492315603212.xml 1737310 07-09-2017 17:42 10.1177_0218492315603212-fig1.tif 303239 07-09-2017 17:42 10.1177_0218492315603212.pdf 5352824 07-09-2017 17:42 10.1177_0218492315603212-fig2.tif 1612126 07-09-2017 17:42 10.1177_0218492315603212-fig3.tif --------- ------- 9022610 5 files
Report Delivery
A SOCRview report simply runs an XQuery function passing the URI and a report id; there are no other restrictions and URI does not have to resolve to existing content.
Example below returns list of all journal issues where provided DOI is used, excluding provided URI; if DOI is unique then list will be empty; if DOI is not unique it will return URI where already used.
A request URI,
/journal/AAN/25/3/10.1177_0218492315603212/uniqueDoi.rpt, will use internal delivery configuration,
<deliveryConfig id="uniqueDoi"> <report> <function fnName="uniqueDoi" fnNamespace="http://sagepub.org/socrview/report" fnLocation="/modules/report.xqy"/> </report> </deliveryConfig>, run following XQuery function,
declare function uniqueDoi( $socrUri as xs:string , $refxml as node() ) {...};and return following result,
<socrUris/>indicating that DOI is unique.
Filters
The final type of view is a filter: a sequence of one or more XSLT, XQuery or XPath expressions run on the content obtained from an URI or URI view. Multiple filters can be executed, left to right. XSLT or XQuery expressions will resolve to program files that form part of SOCRview code. XPath expressions can be ad hoc and reference any namespaces or functions declared or visible in the code context where the filter is evaluated. Parameters can be used and will be supplied to every XSLT or XQuery module referenced in the filter; if the parameter is not declared it simply be ignored.
Multiple XSLT filters
Example below will apply 2 XSLT filters to an XML object
wrapper-one.xsl
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xlink="http://www.w3.org/1999/xlink" > <xsl:template match="*"> <wrapperOne> <xsl:copy-of select="."/> </wrapperOne> </xsl:template> </xsl:stylesheet>
wrapper-id.xsl
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xlink="http://www.w3.org/1999/xlink" > <xsl:param name="id" select="'Default'"/> <xsl:template match="*"> <xsl:element name="{concat('wrapper',$id)}"> <xsl:copy-of select="."/> </xsl:element> </xsl:template> </xsl:stylesheet>
Applying wrapper-one.xsl
/journal/AAN/25/3/10.1177_0218492315603212.xml/__filter/xslt/wrapper-one.xslreturns
<wrapperOne> <article article-type="case-report" dtd-version="1.1d1" r:rsuiteId="167632474" xml:lang="en"> <front> ...
Applying wrapper-one.xsl, then wrapper-id.xsl
/journal/AAN/25/3/10.1177_0218492315603212.xml/__filter/xslt/wrapper-one.xsl/xslt/wrapper-id.xsl?id=Two&dummy=Nullreturns
<wrapperTwo> <wrapperOne> <article article-type="case-report" dtd-version="1.1d1" r:rsuiteId="167632474" xml:lang="en"> <front> ...Notice that parameter id was applied but the non-existent parameter, dummy, was ignored.
XPath filters
Example, md5sum for an image:
/journal/AAN/25/3/10.1177_0218492315603212.pdf/__filter/xdmp:md5(binary()).xpath
12130105eaeaf74a21cbe457b8b70bd0
Example, byte count for XML:
/journal/AAN/25/3/10.1177_0218492315603212.xml/__filter/xdmp:binary-size(xdmp:unquote(xdmp:quote(.),(),"format-binary")/binary()).xpath
17061
Example, abstract from article XML:
/journal/AAN/25/3/10.1177_0218492315603212.xml/__filter/descendant::abstract.xpath
<abstract> <p>A 51-year-old woman with exercise angina and a history of pulmonary artery hypertension ...</p> <p>After a multidisciplinary evaluation,...</p> </abstract>
Summary
The initial motivation of exposing SAGE's journal content through a simplified interface and the goals of building this interface through an HTTP service utilizing persistent, readable, meaningful and succinct URI was achieved. The usefulness of approach has so far mostly been seen in redesigning SOCR as multiple services but the browser interface has also proven popular and useful for technical users in our publishing systems group--who created and maintain SOCR--and the production group--who use SOCR. There has also been a gradual increase content accesses through scripts (e.g. data scientists using Python). URI design for journals has met all stated goals but URI design for non-journal content has been less satisfying because of tendency to view content base on its form (i.e. markup – e.g. TEI, DocBook, etc.) rather than its function (e.g. a book)--a salutary lesson that the time spent thinking about journal URI design was well spent. The use of RXQ has allowed easy additions of new, non-journal, content types. Having multiple levels of configuration for deliveries has allowed simple new deliveries to be created without the intervention of a developer or administrator and complex deliveries, requiring development, to be created faster.
References
[RSI 2017] RSuite REST API Reference Version 1 (3.7.x, 4.1.x, 5.x)
[wikipedia DOI] Digital Object Identifier
[Retter 2016] RESTXQ 1.0: RESTful Annotations for XQuery
[Retter 2012] XML Prague 2012: RESTful XQuery - Standardised XQuery 3.0 Annotations for REST
[w3c 2014] XQuery 3.0: An XML Query Language - Annotations
[Tennison 2012] May 22, 2012 MarkLogic User Group London: Jeni Tennison Jeni's slides
[Fuller 2014] MarkLogic Users Group (MUGL) meeting, April 19, 2014 RXQ v1.0 - RESTXQ for MarkLogic MarkLogic Users Group (MUGL) meeting, April 19, 2014 RXQ RESTXQ for MarkLogic on GitHub