Kimber, Eliot. “Project Mirabel: XQuery-Based Documentation Reporting and Exploration System.” Presented at Balisage: The Markup Conference 2022, Washington, DC, August 1 - 5, 2022. In Proceedings of Balisage: The Markup Conference 2022. Balisage Series on Markup Technologies, vol. 27 (2022). https://doi.org/10.4242/BalisageVol27.Kimber01.
Balisage: The Markup Conference 2022 August 1 - 5, 2022
Balisage Paper: Project Mirabel: XQuery-Based Documentation Reporting and Exploration System
Eliot Kimber
Senior Product Content Engineer
ServiceNow
Eliot Kimber is a founding member of the W3C XML Working Group, the OASIS Open DITA
Technical Committee, a co-author of ISO/IEC 10744:1996, HyTime, and a long-time SGML
and
XML practitioner in a variety of domains. When not wrangling angle brackets Eliot
likes to
bake. Eliot holds a black belt in the Japanese martial art Aikido.
Describes the organic development of a multi-function XQuery-based system, Project
Mirabel, for capturing and reporting on the results of applying Schematron validation
to
large sets of DITA documents, growing from a simple reporting utility to a multi-function
platform for general reporting and exploration over large and complex sets of DITA
content,
representing non-trivial hyperdocuments, all within the span three weeks.
Each version of ServiceNow's primary product, ServiceNow Platform, is documented by
a
collection of about 40,000 DITA topics, organized into about 40 different "bundles",
where a
bundle is a top-level unit of publishing, roughly corresponding to a major component
of the
Platform.
Bundles are DITA [DITA] maps, which use hyperlinks to organize topics into
sequences and hierarchies for publication. These topics are authored by about 200
writers in the
Product Content organization, spread around the globe. ServiceNow maintains four active
product
versions at any given time, where each version has its own set of topics and maps.
The content
is managed in git repositories.
This content has been under active development in its current DITA form for about
eight
years, during which time ServiceNow has experienced explosive growth. During this
time the
content accumulated hundreds of thousands of instances of various errors, as identified
by
Schematron rules developed over the those eight years, reflecting editorial and terminology
concerns, markup usage rules, and content details, such as leading and trailing spaces
in
paragraphs.
In February of 2022 Product Content decided to hold a "fixit week" where all authoring
activity would be paused and writers would instead spend their time fixing errors
in the content
in an attempt to pay down this accumulated technical debt.
To support this task the Product Content Engineering team, which develops and maintains
the
tools and infrastructure for Product Content, needed to supply managers with data
about the
validation errors so that they could plan and direct the fix up work most effectively.
In
addition, we needed to capture validation results over time to then document and report
our
fixup achievements over the course of fixit week and beyond. This validation data
capture and
reporting ability did not exist before the start of fixit week.
This paper is a description of the technical details of the resulting Mirabel system
and the
story of how we developed the validation data capture and reporting tools we needed
in a very
short time and, by taking advantage of the general facilities of XQuery as a language
for
working with XML data and the BaseX XQuery engine [BASEX] as a powerful but
easy-to-implement and easy-to-deploy infrastructure component, established the basis
for a much
more expansive set of features for exploring and reporting on ServiceNow's large and
dynamic
corpus of product and internal documentation.
We named this system "Project Mirabel", reflecting a practice of naming internal projects
after characters from animation (we have Baymax and Voltron as notable examples).
Mirabel is the character Mirabel Madrigal from the 2021 Disney movie
Encanto. Mirabel's power is to see what is there, and that is exactly the
purpose of Project Mirabel: to give visibility and insight to all aspects of the XML
and
supporting data that documents ServiceNow's products, including validation reporting,
content
searching and analysis, version management and history, linking information, etc.
The work described in this paper was performed by the author, Eliot Kimber, with
contributions from Scott Hudson and Abdul Chaudry, all of ServiceNow.
How We Built Project Mirabel
The Mirabel system grew organically from some simple requirements. The initial
implementation occurred very rapidly, over the course of three weeks of intense coding.
It was
completely unplanned and unauthorized except to the degree that we had a "provide
a validation
dashboard" mandate. The use of XQuery and BaseX allowed us to develop this system
quickly
using an iterative approach combined with good engineering practices: modularity,
test-driven
development (as much as we could), separation of concerns, and so on.
At the start we knew what we wanted to do but we didn't really know how to do it at
the
detail level. It had been nearly ten years since we last worked heavily with BaseX
and XQuery,
XQuery 3 was new to us, and we had never tried to implement a robust, multi-user web
application using BaseX. We had experience with MarkLogic, which is a very different
environment for implementing large-scale applications, so that experience did not
translate
that well to BaseX.
This is a description of our experience implementing an entirely new system while
learning
a new XQuery engine and application implementation paradigm, making up the details
as we went
along. It is a testament to the utility of XQuery as a language, RESTXQ as a way to
produce
web applications and web pages, and BaseX as a well-thought-out and well-supported
XQuery
engine that we were able to do this at all let alone in the short time that we had.
During this time we were also responsible for supporting a small React-based documentation
delivery system for ServiceNow's Lightstep Incident Response product. Thus we had
a Gatsby-
and React-based web system to compare our XQuery- and RESTXQ-based implementation
to. We had
also recently investigated Hugo as a potential alternative to Gatsby and React for
web page
templating.
The Starting Point: Git and OxygenXML
At the time we implemented Project Mirabel there was no content or link management
system for Product Content beyond the features provided by git for version management
and
OxygenXML Author [OXY] for authoring and validation. We did have the
ability to do batch validation of the entire content set using Oxygen's DITA map validation
tool. Excel spreadsheets that summarize and organize the validation reports were then
manually created by Product Content Engineering and supplied to Product Content managers
more or less daily.
ServiceNow maintains three versions ("families") of the Platform in service at any
given
time, with new families released every six months. Thus there are always four versions
of
the documentation under active development: the version to be released next plus the
three
released versions. Platform versions are named for cities or regions, i.e., "Quebec",
"Rome", "San Diego", and "Tokyo", with Tokyo being the family being prepared for release
at
the time of writing.
The validation reports posed several practical challenges:
The validation process itself takes approximately 30 minutes to perform for each
platform version, so a total of two hours to validate all four versions.
Validation issues are reported against individual XML documents but managers need
to
have issues organized by bundle. However, there is only an indirect relationship between
documents and bundles through the hyperlinks from top-level bundle maps to subordinate
maps and then to individual DITA topics. There is no existing information system that
manages knowledge of the relationship of topics and maps to the bundles they are part
of. That is, given a topic or non-bundle DITA map, there is no quick way to know what
bundles it is a part of, if any.
Issues also need to be organized by error type or error code in order to provide
counts of each error type, both across the entire content set and on a per-bundle
basis.
The validation reports themselves are quite large, typical 50MB or more, making them
a challenge to work with and store.
The requirements for these reports appeared suddenly and our response was entirely
ad-hoc. The development process was not planned in any significant way.
Initially, the manager report spreadsheets were created by saving the validation reports
as HTML from Oxygen and then importing the resulting HTML tables into Excel. This
produced a
usable but sub-optimal result and was an entirely manual process that required doing
some
data cleanup in Excel. These spreadsheets used a pivot table to report on issues by
issue
code.
Our initial attempt to automate this process used an XSLT transform applied to the
XML
validation reports to group the issues by issue code and then generate CSV files with
the
relevant issue data to then serve as the basis for Excel spreadsheets that use pivot
tables
to summarize the issues by issue code and count. This improved the ease of generating
the
spreadsheet and provided cleaner data (for example, the transform could collapse issue
codes
that represented the same base issue type into a single group).
However, this did not satisfy the "issues by bundle" requirement as we did not have
the
topic-to-bundle mapping needed to group issues by bundle.
For that we turned to XQuery and BaseX.
From Ad-Hoc To Service, Phase 1
In the run up to fixit week we were given the requirement to provide some sort of
"validation dashboard" that would provide up-to-date information about the current
validation status of a given family's documentation with a historical trend report,
if
possible.
Our current "dashboard" was simply the spreadsheets we were preparing via the
partially-automated process combining an XSLT and an ad-hoc XQuery process. It became
clear
that this ad-hoc process was not a good use of our time and not sustainable. It was
also not
an ideal way to communicate the information (the spreadsheet was being emailed to
the
managers and stored in a SharePoint folder).
I knew from prior experience with BaseX's support for RESTXQ that it would be simple
to
set up a web application to provide a dashboard view of the validation report we were
already generating.
We already had the core of the data processing needed for the dashboard in the form
of
the XSLT transform and the ad-hoc XQuery script.
All it would require would be to reimplement the XSLT in XQuery and work out the HTML
details for the dashboard itself as well as a server to serve it from. While BaseX
(and most
other XQuery engines) can apply XSLT transforms from XQuery, in this case the XSLT
transform
was simple enough that it made more sense to convert it to XQuery for consistency
and ease
of integration as all it was really doing was grouping and sorting.
I also realized that if we could set up this simple validation report web application
that it would create the foundation of a general DITA-and-link-aware server that could
be
adapted to a wide variety of requirements while satisfying the immediate validation
reporting requirements. A big and evil plan started to take shape in my brain.
As the first step we reimplemented the issue CSV data set generation in XQuery and
added
it to the ad-hoc script. This moved all of the data preparation needed by the existing
Excel
spreadsheets into XQuery and reduced its generation to a single script that loaded
the data
for a family, loaded the separately-generated Oxygen validation report for that family,
and
then generated the CSV files for the spreadsheet. Preparing the spreadsheet was a
still a
manual process but now simply required reloading two CSV data sets and refreshing
the pivot
tables.
To fully automate the dashboard we would need the following features:
A way to get the current content from git.
A way to get the validation report for each family.
Integration of the ad-hoc link record keeping functions into a reliable
service.
A web application to publish the validation dashboard.
For the git requirement we created a set of bash scripts that set up and pull the
appropriate repositories and branches for each product family. As currently organized,
each
product family's content is managed as a separate branch in one of two git repositories.
In order to correlate validation data to the content it reflects we implemented XQuery
functions that use the BaseX proc extension functions to call the git command in order
to
get git information, such as the branch name and commit hash. These functions also
set the
foundation for a deeper integration with git.
To get the validation reports we use OxygenXML's scripting feature to run Oxygen's
map
validation against specific DITA maps. We were already using this on another server
to do
regular validation report generation so it was a simple matter to move it to the Project
Mirabel server. To make capturing a historical record of validations easier we set
up a
separate git repository that stores the validation reports. This repository uses the
git
large file support (LFS) facility to make storing the 50MB+ validation reports more
efficient. While these are XML documents, which would normally not be a candidate
for LFS
storage (because it bypasses git's normal line-base differencing features) this repository
is being used only for archiving so the lack of differencing is not a concern.
For the link record keeping service we reworked the ad-hoc scripts into XQuery modules
for link record keeping and validation reporting. We also adapted existing DITA support
XQuery code from the open-source DITA for Small Teams project [DFST],
which happened to have quite a bit of useful DITA awareness more or less
ready to use.
For the web application we quickly slapped together a small set of RESTXQ modules
that
provided a simple landing page and then a page for the validation dashboard, initially
just
providing a pair of tables, one of documents by issue and another of issues by bundle.
The
actual report generation is just a matter of iterating over the XQuery maps and generating
HTML table rows. We integrated the open-source sortable.js package, which makes tables
sortable by column simply by adding specific class values to the table markup.
RESTXQ as implemented in BASEX is simple to use and BASEX provides many examples to
crib
from. But fundamentally it's as simple as implementing an XQuery function that returns
the
HTML markup for a given page, using RESTXQ-defined annotations that bind the page-generating
function to the URLs and URL parameters the page handles. Once you deploy the pages
to
BaseX's configured webapp directory they just work. It makes creating web services
and web
applications about as simple as it could be. There is none of the build overhead of
typical
JavaScript or Java-based web applications. BaseX provides its own built-in HTTP server
based
on Jetty so there is no separate setup or configuration task needed just to have a
running
HTTP server.
We were able to implement this initial server, running off a developer's personal
development machine, in about two days of effort. It was crude but it worked.
In the meantime we started the internal process of provisioning an internal server
on
which to run our new validation dashboard service.
At this point we had spent maybe four or five person days to go from "let me hack
an
XQuery to make that data for you" to "Look at the dashboard we made".
Validation Dashboard Implementation, Phase 2: Dark Night of the Implementor
We had the ability to produce useful reports for a single validation report but now
we
needed to provide a time series report showing the change over time in the number
of issues
of different types as the content is updated over the course of fix it week.
This required a couple of new features:
Capturing validation reports with appropriate time stamps and connection to the git
details of the content that was validated.
Generating a graph or other visualization of the time data itself.
Ensuring that the web page would remain responsive when multiple users were
accessing it.
For the validation reports we implemented a bash process that could be run via cron
jobs
to pull from the source repositories, run the validation process against each product
family, and then load the resulting validation reports into the appropriate validation
report databases.
To support the git association and time data the XQuery that loads the validation
reports adds the git commit hash for the commit the validation was done against, the
time
stamp of the commit (when it was committed) and the time stamp of when the validation
was
performed. It took us a few iterations to realize that we also needed structured filenames
for the validation reports themselves so that the filenames would provide the data
needed
for the reports as loaded (validation time, commit hash, commit time, and product
version)
so that reports could be loaded after the fact rather than requiring that the process
that
triggered the validation also do the loading into BaseX. Because the validation reports
take
a long time to produce and because they might be regenerated in the future against
past
content, we needed to be able to load reports given only the report file itself.
Which reflects the document validated ("now-maintenance-errors.ditamap"), the product
version ("tokyo"), the time stamp of the validated commit, the commit hash of the
validated
commit ("c136e447") and the time when the validation was performed ("1656513278",
a seconds
since epoc time stamp).
With each report having a time stamp, we implemented a general "get documents by time
stamp" facility that then enables queries like "get the most recent validation report",
"get
the most recent n reports", etc., and order them by time stamp. This facility then
enables
constructing time series reports for any data that includes a @timestamp attribute.
As part of this implementation activity we also worked out a general approach for
creating an HTML dashboard that reflects the current state of the validation and indicates
trend direction for each class of issues (info, warning, and error). This required
us to
come up to speed on HTML and CSS techniques that were new to us. Fortunately the how-to
information is readily accessible on the web. We found the Mozilla Developers Network
(MDN)
information the most useful and reliable when it came to learning the details of things
like
using the CSS flex facility.
The ease of generation of web pages using RESTXQ make it easy to experiment and iterate
quickly as we developed the web page details.
Within a day or so we were able to get a credible dashboard with tabs for different
reports and visualizations working despite our lack of web UX implementation or design
skills. The resulting dashboard is shown below.
The visualization was produced using the open-source chart.js package, which was easy
to
integrate and easy to use from XQuery. It simply requires generating a JSON structure
with
the data for each series and the chart configuration details. The biggest challenge
was the
syntactic tangle that is using XQuery to generate HTML that includes inline
Javascript:
There is probably a cleaner way to manage the train wreck of syntaxes but this was
sufficient for the moment.
Because of the time it took to get everything else in place, the implementation of
the
chart generation, which was ultimately the one thing we had to provide to our new
Vice
President of Product Content, occurred in the final hours of the weekend before the
Monday
when the dashboard had to be available. It was, to say the least, a frantic bit of
coding.
But we did make it work.
However, there were still a number of practical and performance issues with the
dashboard as implemented: it took many seconds to actually construct the report, which
meant
that if more than one or two people made a request, the system would be unresponsive
while
the main BaseX server performed the data processing needed to render the dashboard.
In
particular, the issues-by-bundle report was being constructed dynamically for each
request.
Fortunately, we really only had one user, our Vice President, for this initial
rollout.
Validation Dashboard Implementation, Phase 3: Concurrency and Job Control
While we had the system working we were making a number of basic mistakes in how we
managed keeping the databases up to date.
Because the content is being constantly updated and we wanted to add new validation
reports every eight hours, we had to solve the problems of concurrency and background
update
so that the web site itself remained responsive.
For this we developed the job orchestration facility described in detail below and
a
general system of bash scripts that then invoke XQuery scripts via the BaseX command-line
API to run jobs in the background using cron jobs or manual script execution. This
worked
well enough to make the site reasonably stable and reliable. Our main limitation was
lack of
project scope to test it thoroughly.
Once fixit week was over we were directed to stop work because the activity had never
actually been approved or prioritized and was not the most important thing for us
to be
working on. While the organization recognized the value of Mirabel and fact that we
had met
our goal of providing a useful validation dashboard we had to accept the reality that
it was
no longer the most important thing.
Of course, we couldn't leave it entirely alone.
Additional Features: Fun With XQuery
While we didn't have approval to work on Mirabel in a fully-planned way, we were still
able to spend a little time on it.
One challenge I faced as the primary implementor was simply keeping up with all the
code
I was producing. While the BaseX GUI is an adequate XQuery editor it is not by any
stretch a
full-featured XQuery IDE. It does not provide any features for navigating the code
or
otherwise exploring it.
However, BaseX does provide an XQuery introspection module that provides XQuery access
to the structure and comments of XQuery modules. I realized that with this I could
quickly
implement RESTXQ pages that provide details of the XQuery code itself. The initial
implementation of the XQuery explorer is shown below.
Because this view is using the code as deployed it is always up to date, so it reflects
the code as it is being developed. The clipboard icon puts a call to the function
on the
clipboard ready to paste into an XQuery editor.
Another challenge was developing and testing our Schematron rules. The bulk of the
validation is Schematron rules that check a large number of editorial and markup usage
rules. These Schematron rules are complex and difficult to test across the full scope
of the
Platform content. The fixit week validation activity highlighted the need to optimize
our
Schematron rules to ensure that they were accurate and useful. To assist with that
we
started implementing a Schematron Explorer that enabled interactive development and
testing
of Schematron rules against the entire database. Unfortunately we ran out of scope
to finish
the Explorer, having run into some practical challenges with the HTML for the test
results.
We will return to this at some point.
Another area of exploration was a more general Git Explorer.
With Mirabel's git integration it is easy to get the git log and report on it. As
a way
to simply test and validate Mirabel's ability to access and use information from the
git log
we implemented a Git Explorer that provides a report of the git log and demonstrates
access
to all the git log information. We did not have scope to do more with the git information
but it would be straightforward to get the per-commit and per-file history information,
allowing a direct connection between files in the content repository and their git
history
and status. There is also a general requirement to provide time sequence analysis
and
reporting on the git data, such as commits per unit time, commits per bundle, etc.
Mentions
of git commit hashes are made into links to those commits on our internal GitHub Enterprise
site.
New Requirements: Analytics and Metrics Reporting
After sitting idle for several months while we focused on higher priorities, Mirabel
found new life as the platform for delivering additional Product Content metrics and
analytics.
In the late spring of 2022 Product Content put a focus on analytics and metrics
reporting and created a new job role responsible for gathering and reporting analytics
and
metrics of all kinds with the general goal of supporting data-driven decision making
for
Product Content executives and managers, including metrics for the content itself
as well as
from other sources, such as the ServiceNow documentation server web analytics, customer
survey results, and so on.
The Analytics person immediately saw the potential in Mirabel as a central gathering
point and access service for these analytics and established an ambitious set of
requirements for Mirabel to eventually address. However, we were still facing the
reality of
limited resources and scope to pursue these new Mirabel features.
Fortunately, we hired an intern, Abdul, who was already familiar with Product Content
from his previous internship with us and who had a data science background. Abdul's
internship project was to implement a "key metrics" dashboard, which he did. This
required
doing additional work on Mirabel's core features to improve performance and reliability,
which we were able to do.
The resulting key metrics dashboard is shown below.
In the context of the key metrics implementation we were able to add a number of
important performance enhancements, including using the BaseX attribute index to
dramatically speed up queries, especially construction of the where-used and doc-to-bundle
indexes. We also put some effort into improving the look and feel of the web site
itself, as
best we could given that we are not UX designers.
While ServiceNow has many talented UX designers on staff, they are all fully occupied
on
ServiceNow products so we have not yet been able to get their assistance in improving
Mirabel's site design.
Project Mirabel Components and Architecture
The Mirabel system is delivered to users as a web application served using the BaseX
XQuery database. The web server uses RESTXQ as implemented by BaseX to serve the pages.
The
data processing for the services provided is implemented primarily as XQuery modules
run by
BaseX servers with the persistent results stored in BaseX databases. The system is
deployed to
a standalone linux server dedicated to the Mirabel system and available to all users
within
the ServiceNow internal network.
The source XML content is accessed from git repositories cloned to the Mirabel server
machine. ServiceNow's internal git hub system does not allow API-based access to the
repositories so streaming content directly from the GitHub Enterprise server was not
an
option. Bash scripts are used to manage cloning and pulling the repositories as required,
either on a regular schedule or on demand.
Validation of the DITA content is performed by OxygenXML using Oxygen's batch processing
features. The input is DITA maps as managed in the git repositories. The output is
XML
validation reports, stored in a separate git repository on the server file system.
Bash
scripts are used to manage running the validation processes, either on a regular schedule
or
on demand. In addition to simply validating the latest version of the content in a
given
repository, bash scripts also allow validating older versions, for example, to recreate
the
validation reports authors were actually seeing at the time by accessing the corresponding
versions-in-time of the Schematron rule sets used to do the validation. This allows
Mirabel to
change the details of how validation reports are stored and managed, as well as enabling
the
retro-active capturing of historical data in order to deliver trends and analysis
information
about how errors changed over time. It also allows for validation of older versions
of the
content with specific versions of the validation rules, for example, to eliminate
rules that,
in retrospect, were not useful or were largely ignored by authors.
From this source DITA content and corresponding validation reports maintained in git
repositories on the file system, the main XQuery-based Mirabel server constructs a
set of
databases containing the source XML data, the corresponding validation reports, and
constructed indexes that optimize the correlation of validation issues to the documents
they
apply to as well as the DITA maps that directly or indirectly refer to those documents.
In BaseX, the only core indexing features are for indexes over the XML markup and
the text
content (full-text indexes). There is no more-general feature for creating indexes
as distinct
from whatever one might choose to store in a database. Instead, the BaseX technique
is to
simply create separate databases that contain XML representations of an "index". These
indexes
typically use BaseX-assigned persistent node IDs to relate index entries to nodes
held in
other databases.
In BaseX, databases are light weight, meaning that they are quick to create or remove.
The
BaseX XQuery extensions make it possible to access content from different databases
in the
same query. Because databases are light weight, it tends to be easier and more effective
to
use separate databases for specific purposes rather than putting all the data into
a single
database and using collections or object URIs to organize different kinds of content.
A single database may be read by any number of BaseX servers and may be written to
by
different servers as long as attempts to do concurrent writing are prevented, either
by using
write locks or by ensuring that concurrent writing is not attempted by the code as
designed.
To enable link resolution in order to then correlate individual documents to their
containing DITA maps, Mirabel maintains two linking-related indexes:
Document where used: For each document, records the direct references to it (cross
references, content references, and references from DITA maps).
Document-to-bundle-map: For each document, the bundle DITA maps that directly or
indirectly refer to the document. "Bundle" maps are DITA maps that are used as the
unit of
publication to ServiceNow's public documentation HTML server. Thus these DITA maps
play a
unique and important role in the ServiceNow documentation work flow.
The where-used index is used by the document-to-bundle-map constructor. The
document-to-bundle-map index then enables quick lookup of the bundles a given document
participates in. This then enables the organizing of validation issues by bundle,
a key
requirement for validation reporting. The where-used index also enables organizing
validation
issues by individual DITA map. DITA maps below the bundle level usually organize the
work of
smaller coordinated teams and thus represent another important organizing grouping
for
reporting validation issues. More generally, the doc-to-bundle map allows quick filtering
of
data about individual documents by bundle. For example, the Mirabel key metrics report
presents counts of many different things (topics, maps, images, links, tables, etc.)
across
the content repository. These counts are captured on a per-bundle basis as well as
for the
entire content set. With the doc-to-bundle index it is quick to get the set of documents
in a
given bundle in order to then count things in that set.
Mirabel uses database naming conventions to enable associating the database for a
set of
DITA content to its supporting databases. The databases for the DITA content are named
for the
product release name the content is for (i.e., "rome", "sandiego", "tokyo"). Corresponding
supporting databases are named "_dbname_link_records",
"_dbname_validation_reports", etc. The leading "_" convention indicates
databases that are generated and that therefore can be deleted and recreated as needed.
Databases that serve as backup copies of other databases are named
"_backup_databaseName". Databases that are temporary copies of other
databases are named "_temp_databaseName".
For a given product release it takes about 30 seconds to construct the where-used
and
doc-to-bundle indexes. By contrast it takes about two minutes to simply load the DITA
XML
content into a database, bringing the total time needed to load a given release to
about three
minutes. This is fast enough to remove the need to implement some sort of incremental
update
of the content database and supporting indexes and still keep the Mirabel server reasonably
up
to date. At the time of writing the production server pulls from the content repositories
every 15 minutes and reloads the content databases every two hours. The more-frequent
git
pulls are required simply because the volume of updates from writers is such that
pulling
frequently avoids having to do massive pulls of hundreds of commits.
For other reports, Mirabel uses a similar pre-generation and caching strategy, where
the
raw XML for a given report is generated at content load and then used at display time
to
generate HTML tables, CSV for download, etc. For example, Mirabel produces a report
that lists
the images that are in one version but not in the previous version, representing the
"images
new since X" report. This is an expensive report to generate so it is pre-generated
and
cached.
Mirabel also relies heavily on the BaseX attribute index, which optimizes attribute-based
lookup. For example, using normal XPath lookups to construct the doc-to-bundle index
takes
about 30 minutes but using the attribute index takes less than 30 seconds.
Managing Concurrency
One significant practical challenge in implementing Mirabel was managing the parallel
processing required to serve web pages and content while also constructing the supporting
indexes.
The BaseX server implementation is relatively simple, which helps make it small and
fast, but means that it lacks features found in other XQuery servers, in particular,
built-in and transparent concurrency.
While evaluating an XQuery, a given BaseX server instance will utilize all resources
available to it, leaving no cycles for serving web pages. In an application where
queries
are quick, which is most BaseX applications, this is not a problem. But in Mirabel,
where
report construction can take tens of seconds or more and the content is being constantly
updated, it is a serious problem. As a more general requirement, any BaseX application
that
needs to constantly ingest new content and construct indexes over it must address
this
concurrency challenge.
Fortunately, there is a simple solution: run multiple BaseX servers.
Because BaseX servers are relatively small and light weight, it is practical to run
multiple BaseX instances, bound to different ports, and allocate different tasks to
them.
Because multiple servers can read from a single database, there is no need to worry
about
copying databases between different servers: all the BaseX instances simply pull from
a
shared set of databases. The main practical challenge is ensuring that only one server
is
writing to a given database at a time.
Each BaseX server is a separate Java process and thus can fully utilize one core of
a
multi-core server. Thus, on a four-core server such as is used for the Mirabel production
server at the time of writing, you can have one BaseX instance that serves web pages
and two
that do background processing, leaving one core for other tasks, such as OxygenXML
validation processing, which is also a Java process and thus also requires a dedicated
server core.
For Mirabel, the base configuration is three BaseX servers: a primary server that
serves
web pages and handles requests, and two secondary servers that manage the long-running
data
processing needed to create and update the linking indexes and load new validation
reports,
as well as managing updates to the main content databases as new content is pulled
into the
content git repositories.
The web-serving server is bound to the default BaseX ports and the worker servers
are
assigned ports using a simple convention of incrementing the first port number, i.e.,
9894
for the second server and 10894 for the second.
Another scaling option would be to use containers to run separate BaseX servers, where
the containers share one or more file systems with shared databases or use remote
APIs to
copy the results of long-running operations to a production server. Unfortunately,
at
ServiceNow we do not currently have the option of using containers, at least not in
a way
that is supported by our IT organization.
Managing Updating Operations
Another practical challenge is orchestrating update operations.
This presents an orchestration challenge when the data processing requirement is to
create a new database, add content to it, and then use that database as the source
for
another query, as BaseX does not allow a newly-created database to be written to and
read
from in the same query.
BaseX implements the XQuery update recommendation [XQUPDATE] and
imposes the recommendation's restriction that updating expressions cannot return values.
BaseX treats each updating XQuery as a separate transaction. BaseX updates are handled
in an
internal update queue that is satisfied after any non-updating queries in the queue
are
handled. In addition, updates are handled in parallel and order of processing (and
order of
completion) is not deterministic, so you cannot assume that updates will be handled
in the
order they are submitted or that one operation will have completed before another
starts.
This means, for example, that you cannot have a single updating XQuery expression
that
does a BaseX db:create() followed by db:load() to add documents
and then tries to act on that content. The database creation and loading must be in
one
expression and the access or subsequent updating in another.
In addition, indexes on databases, such the attribute index, are not available for
use
until the database has been optimized after having been updated. This optimization
process
cannot be performed in the same transaction that creates the database.
The BaseX solution for general orchestration is its "jobs" feature, which allows for
the
creation of jobs, where a job is a single XQuery. Jobs are queued and run as resources
become available. A query that creates a job is not, by default, blocked by the job,
but it
can choose to block until the job completes. Jobs are tracked and managed by the BaseX
job
manager. Jobs may be run immediately or scheduled for future execution.
Jobs are not necessarily run in the order queued, so performing a complex tasks is
not
as simple as queuing a set of jobs in the intended sequence of execution.
The general solution is to have one job submit the next job in a sequence to the job
queue.
The Mirabel system implements this approach through a general-purpose orchestration
XQuery module that provides infrastructure for defining jobs and running them, including
logging facilities to make it easier to debug orchestrated jobs.
The orchestration module defines a "job definition" map structure that allows callers
to
define a sequence of jobs and submit them for execution. Individual XQuery modules
can
define "make job" functions that handle setting the module's namespace and otherwise
constructing jobs specific to that module. XQuery modules can also provide functions
that
construct ready-made job sets to implement specific sequences of actions. This
general-purpose orchestration facility is published as a standalone GitHub project:
basex-orch.
The job-running function is a recursive function that takes as input a sequence of
job
definitions, queues the head job, blocks until it finishes, and then calls itself
with the
remainder of the job queue. This effectively serializes job execution while letting
the
BaseX server manage the resources for each job in the context of the larger job queue
for
the server.
Job sequences can be initiated from within BaseX, for example, in the context of a
RESTXQ handler function, or from outside BaseX using the BaseX command line to run
an XQuery
that constructs a new job. Mirabel uses this technique with simple bash scripts that
then
run XQuery expressions that call job-creating functions. For example, the
cleanup-backup-databases.sh script runs a simple XQuery that starts a set of
jobs to remove backup databases:
#!/usr/bin/env bash
# ==========================================
# ServicenNow Product Content Dashboard
#
# Cleans up any lingering backup datatabases
# ==========================================
scriptDir=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
xqueryScriptDir="${scriptDir}/xquery"
# Include function libraries:
source "${scriptDir}/bash-functions/validation-functions.sh"
basexuser="${1:-admin}"
basexpw="${2:-admin}"
basexClientPort=${3:-2984}
xqueryToRun="${xqueryScriptDir}/cleanup-backup-databases.xqy"
echo "[INFO] Cleaning up backup databases"
"$(getBasexBinDir)/basexclient" -U "${basexuser}" -P "${basexpw}" -p${basexClientPort} "$xqueryToRun"
The bash script selects the appropriate secondary server to perform the task, in this
case the server on port 2984. The current Mirabel implementation hard codes the ports
used
for different tasks in the scripts. It would be possible in bash to create a more
dynamic
server selection mechanism by getting the list of available servers and choosing one
that
has little or no load. For current Mirabel use cases this level of sophistication
is not
required.
The cleanup-backup-databases.xqy XQuery run by the bash script is:
(:
Runs the cleanup backup databases job
:)
import module namespace orch="http://servicenow.com/xquery/module/orchestration";
import module namespace dbadmin="http://servicenow.com/xquery/module/database-admin";
let $job as xs:string := dbadmin:makeJob('dbadmin:cleanupBackupDatabases', ())
let $result as item()* := orch:runJobs($job)
return (``[Cleanup backup databases job queued. Check BaseX log for details.
]``)
With the orchestration module in conjunction with multiple secondary servers, all
processing can be performed reliably while ensuring the responsiveness of the main
web
server BaseX instance.
Automated Testing
The Mirabel XQuery components are implemented as a set of modules, with each module
focused on a specific concern: link record keeping, database management and access,
validation reporting, git access and interaction, etc.
Each module has associated unit tests, implemented using BaseX's unit testing
extensions. One challenging aspect of the unit tests is testing the results of updating
operations, as the unit tests are subject to the constraints on updating functions,
which
means you cannot have a single test method that both performs an update and evaluates
the
result of the update.
BaseX's solution is to allow the specification of "before" functions that are run
before
a corresponding test-evaluation function, where the "before" function is run as a
separate
transaction. Likewise, an "after" function can be used to clean up the test.
Deployment
The server code is deployed from the Mirabel source code repository to the running
BaseX
XQuery repository using an Ant script. The script prepares the working XQuery modules
and
standalone scripts from the source versions (or example, to add version comments to
the
files) in the directory structure required by BaseX, zips the result, and then calls
the
BaseX command line to import the Zip file. BaseX provides a module repository mechanism
by
which it resolves module namespace URIs to module locations, removing the need for
separate
module paths in module import statements.
The web application modules are deployed to the BaseX webapp directory by the Ant
script. For web applications, BaseX automatically loads XQuery modules within the
configured
webapp directory and uses RESTXQ function annotations to map incoming URLs to the
functions
that handle them.
This makes deploying new code quick and simple: there is no need restart the BaseX
server.
Web Application
Mirabel implements the web application using BaseX's RESTXQ support.
In RESTXQ you bind XQuery functions to URLs using annotations on the functions:
declare
%rest:GET
%rest:path('/now/dashboards/{$database}/dbstatus')
%output:method('html')
function now:databaseStatusReportForFamily(
$database as xs:string
) as element(html) {
let $haveDatabase as xs:boolean := db:exists($database)
let $databaseLabel as xs:string := util:getDatabaseLabel($database)
return
<html>
<head>
<title>Family {$databaseLabel} Dashboards</title>
{webutils:makeCommonHeadElements()}
</head>
<body>
...
</html>
}
The function returns the HTML for the web page.
The HTML is constructed as you would construct any other literal result element.
Parameters can be extracted from the URL itself, as shown in this example, or passed
as
URL parameters.
BaseX's RESTXQ implementation makes it about as easy as it can be to implement web
applications. As a language for constructing XML results, XQuery is ideally suited
for the
type of templated HTML generation required for dynamic web applications.
For the Mirabel project, this ease of web application implementation made it remarkably
quick and easy to get a site up and running and to refine it quickly. Adding a new
page or
set of pages is as simple as implementing a new XQuery module to serve the pages and
deploying the module to the running BaseX server.
This does result in a web application where all the work is done on the server. However,
there is nothing in the RESTXQ implementation that prevents using JavaScript in the
browser
to implement single-page applications or otherwise interact with the BaseX web services
from
in-browser JavaScript. For example, BaseX could be used to quickly implement microservices
in support of a larger web application that depends on in-browser processing.
Processing DITA Without DTDs
DITA processing depends on using the value of the DITA @class attribute, which captures
the specialization hierarchy of every DITA element. For example, the specialized topic
type
"concept" is a specialization of the base element type "topic". Any concept can be
processed
usefully using the processing associated with the base <topic> element type.
The @class value for a concept topic is "- topic/topic concept/concept ", which
means that you can apply generic topic processing to concept elements by matching
on
"topic/topic" in the @class value, ignoring the element type name. For example,
the way to find all topics in your DITA concept \with the XQuery would be
"//*[contains-token(@class, 'topic/topic')]".
However, this only works if your content has the @class attributes available. This
turns
out to be a problem with BaseX for the volume of content we have.
In normal DITA practice the @class attributes are defined with default values in the
governing grammars, i.e., in DTDs. Thus to do @class-aware processing you normally
need to
parse the documents with respect to their DTDs or other grammar.
However, the DITA DTDs are large.
We found that using BaseX's out-of-the-box DTD-aware parsing it took nearly two hours
to
load the content of a single product version, roughly 40,000 documents. This reflects
the
fact that BaseX as of version 10.0.0 does not implement use of the Xerces' grammar
cache
feature, so it has to re-parse the DTDs for every document. We investigated adding
use of
the grammar cache to BaseX but it was beyond what we could do at the time.
Instead we parse the documents without using the DTDs and add the DITA class awareness
after the fact.
Parsing without the DTDs takes about two minutes to load the 40,000 documents for
a
platform version.
To enable @class-aware processing we implemented a module that simply creates a static
mapping from element types to @class values. This is possible for a known content
set
because the @class value for a given element type name in DITA is (or should be) invariant,
meaning that the DITA element type "foo" should have the same @class value everywhere,
no
matter how many different document types that element type is used in. For all elements
defined in the DITA standard and DITA Technical Committee-defined vocabularies this
is
always true. Within the scope of a single set of DITA content managed as a unit it
must also
be true. While different DITA users could define their own specializations that happen
to
have the same element type name but different @class values, it is unlikely that those
two
vocabularies would be used together. Within an enterprise like ServiceNow we have
complete
control over the markup details of our content and do not have a requirement to integrate
content from third parties in a way that would require more flexible solutions. This
is true
for the vast majority of DITA users.
The solution to finding elements based on DITA @class value is to define a utility
function that takes as input an element and a @class value token to match on and returns
true or false based on a static mapping:
(:~
: Determine if an element is of the specified DITA class
: @param context Element to check the class of
: @param classSpec The class value to check (i.e., 'topic/p')
: @return True if the element is of the specified class.
:)
declare function dutils:class($context as element(), $classSpec as xs:string) as xs:boolean {
let $normalizedClass := normalize-space($classSpec)
let $classTokens as xs:string* := dutils:getClassTokens($context)
return $normalizedClass = $classTokens
};
(:~
: Gets the DITA @class value for the specified element.
: @param context Element to get the class value for
: @return The class value as a sequence of tokens, one for each module/element pair
:)
declare function dutils:getClassTokens($context as element()) as xs:string* {
let $classValue := $dutils:elementToClassMap(name($context))
return tail(tokenize($classValue, '\s+')) (: First token is always the - or + at the start of the @class value :)
};
The $dutils:elementToClassMap XQuery map is generated from the DITA RELAX
NG grammars:
(:~
: Use the RELAX NG grammars to generate a mapping from element types to their
: declared @class values.
: @param database The database that contains the RNG grammars, i.e., "rng". The RNG database
: must have an attribute index.
: @return A map of element type names to @class values.
:)
declare function dutils:generateElementToClassMap($database) as map(*) {
let $debug := prof:dump('dutils:generateElementToClassMap(): Using database ' || $database)
return
map:merge(
for $element in db:open($database)//rng:element[exists(@name)]
let $attlistName as xs:string? := $element//rng:ref[contains(@name, '.attlist')]/@name ! string(.)
let $classAttDecl as element()* := db:attribute('rng', $attlistName, 'name')/../self::rng:define[.//rng:attribute[@name eq 'class']]
where exists($classAttDecl)
return map {string($element/@name) : $classAttDecl[1]//rng:attribute[@name eq 'class']/@a:defaultValue ! string(.)}
)
};
Another option would be to use this mapping to add the @class attributes to the XML,
either as part of the initial ingestion process or as an update applied after the
content is
initially loaded from the file system. Having the @class attributes present on the
DITA
elements would enable using the attribute index to do class-based lookups, which would
be a
significant performance improvement.
In the process of preparing this paper we tried this experiment:
import module namespace dutils="http://servicenow.com/xquery/module/now-dita-utils";
import module namespace util="http://servicenow.com/xquery/module/utilities";
let $database := 'tokyo'
let $profile := prof:track(dutils:getAllMapsAndTopics($database))
let $docs := $profile('value')
return
for $e in $docs//*[empty(@class)]
let $classAtt := attribute {'class'} { $dutils:elementToClassMap(name($e)) }
return insert node $classAtt into $e
It took several hours to complete (exact time not determined because we had to run
errands while the process was running but it was no less than two hours for the 7950589
elements in the content database we tested with).
There may be a more efficient way to do this update but a better solution is probably
a
SAX filter used at initial parse time, which will add essentially no additional overhead
to
the parsing process. This would be relatively easy to implement and configure but
given the
existing speed and few active users it is hard to justify at this time.
Capturing and Reporting Link-Related Information: Where Used Indexes
The validation reporting depends on two indexes: document-where-used and
document-to-bundle-map.
Constructing the document-where-used index is relatively simple: For each element
that
is a referenced, create an XQuery map entry where the key is the target document and
the
value is the referencing element itself, then merge the entries, combining duplicates,
to
create a map of documents to the references to those documents. Because the referencing
elements retain their document contexts, the map provides quick lookup of what documents
point to any given document. By organizing the value of each entry by reference type
(cross
reference, content reference, and map reference) it is easy to evaluate links based
on link
type, for example, to determine what DITA maps refer to a given document as distinct
from
topics that refer to the same document via cross reference.
The current ServiceNow Platform documentation does not use DITA's indirect addressing
feature (keys and key references), which simplifies the process of constructing a
where used
index to the simple
algorithm:
let $whereUsedMap as map(*) := map:merge(
for $ref in collection()//*[@href|@conref]
let $target as element()? := local:resolveRefToDoc($ref)
return
if (exists($target))
then map{ base-uri($target) : root($ref) }
else ()
,
map {'duplicates' : 'combine' }
)
Where the local:resolveRefToDoc() function attempts to resolve a DITA
reference to the DITA map or topic element that is or contains the reference (references
may
be to elements within maps or topics).
This algorithm results in an XQuery map where the keys are the base URI of a map or
topic document and the values are the references to that topic. In practice, the value
of
each entry is actually a map of reference types to referencing elements as DITA defines
distinct types of reference:
Topic references (links from DITA maps to maps, topics, or non-DITA
resources)
Cross references (<xref> and specializations of <xref> and <link> and
specializations of <link>)
Content references (transclusions specified by the @conref attribute on any element
type).
This classification of references then allows for easy lookup of references by type,
i.e., "find all maps that use this topic" is simply the value of the "topicrefs" entry
in
the map that is the value of the where-used map entry.
The working code constructs the final map as a two-phase process:
Construct a map where each entry has the target document as the key and the value
is
a sequence of maps, one from each reference to the target document.
Construct a new map that combines the individual value maps into single values for
each link
type:
let $mergedMap as map(*) := map:merge($entries, map{'duplicates' : 'combine'})
let $whererUsedMap := map:merge(
for $key in map:keys($mergedMap)
let $entry := $mergedMap($key)
let $newEntry :=
map{
'doc' : dutils:distinct-nodes($entry?doc),
'topicrefs' : $entry?topicrefs,
'xrefs' : $entry?xrefs,
'conrefs' : $entry?conrefs,
'referencing-maps' : $entry?referencing-maps
}
return map{$key : $newEntry}
)
Note that this takes advantage of the "?" map lookup syntax, which has the effect
of
getting all the values for the specified key from all maps in the left-hand map set,
resulting in a single sequence for the new map entry value.
This where-used table serves as the underpinning for any link management features
that
might be needed.
This processing results in a single XQuery map with one entry for each document in
the
content set. It currently takes about a minute to construct the where-used index for
a
single product version. This time largely reflects the lack of @class attributes in
the
content, which makes use of the BaseX attribute index impossible. This is still an
acceptable level of performance for the load of a product version.
To persist the index, the XQuery map is converted to XML and stored in the link record
keeping database for the corresponding content database (i.e., "_lrk_tokyo_link_records").
By denormalizing the data stored in the XML, lookups can be optimized. Nodes in the
XQuery
map are represented by their BaseX node IDs. The node IDs are captured as attributes
on the
index elements, enabling optimized lookup using the BaseX attribute index.
Doc-to-Bundle Map Construction
The document-to-bundle-map index construction requires walking up the map-to-document
reference chain for each target document to find the bundle maps that ultimately refer
to
the document.
Conceptually the algorithm is a recursive graph walk using the where-used table to
find
all the DITA maps that refer to a document, then the maps that refer to those maps,
and so
on, collecting any referencing maps that are bundle maps, until the set of referencing
maps
is exhausted. Cycles are not possible in DITA, or rather, a system of DITA maps that
resulted in a cycle would fail to process and so should never survive into the committed
content set should one ever be created. DITA maps represent strict hierarchies of
maps.
Initial implementation of this algorithm resulted in wildly different execution times
for different data sets. In addition, the initial implementation did not use the BaseX
attribute index, so performance differences for different test data sets were quite
obvious.
Development was supported by three test data sets:
A small but realistic documentation set consisting of one bundle map, about 20
submaps, and about 400 topics.
A portion of the ServiceNow Platform content reflecting about 9000 topics, a quarter
of the total number of topics, but the total set of maps (we did not bother trying
to
eliminate maps that did not refer to any topics in the test set as the number of maps
is
very small relative to the number of topics).
The full content set for a single Platform release, reflecting the full 40,000
topics and all the maps.
When the document-to-bundle-map construction was run on the small test set, it ran
as
fast as expected, taking only a couple of seconds to construct the index.
However, when run with the one-quarter data set the construction took many minutes.
Why?
The initial implementation was very simple, simply taking each map in turn and
determining the bundle that uses it. This means that for a given map you find all
its uses,
recurse on those maps, find their uses, recurse on those, and so on until you have
maps with
no uses or you've found maps that are known to be bundle maps (bundle maps have distinctive
filenames).
If the number of uses of any map is small this algorithm will perform well, as it
did
with the single-bundle content set, where almost no map has more than one use.
However, our real content has a single DITA map, now-keys-common.ditamap, that is
used
by almost every other map (about 740 uses of this one map). This map is also referenced
as
the first map in every map that uses it, so when processing this map to see who references
it we find 740 references, which we then process to see who references them, etc.,
and we do
this for every map (because every map references this one map).
The first reaction was to process maps in reference count order, from least to most,
so
that by the time we get to the now-keys-common DITA map we already know the bundles
all the
other maps are in, which means we only need to do one lookup for each using map to
determine
the set of bundles that ultimately use the now-keys-common map.
We then realized that we could short circuit other lookups based on already-gathered
information, further optimizing the algorithm.
With these changes the process went from "never finishes" to about two minutes to
complete.
At this point in the processing we know what bundles use each map. The next step is
to
determine which bundles use each topic.
This process involves getting, for each topic document, the maps that reference it
and
then looking up the bundles for those maps. This process involves many more documents—36,000
or more—but is a much more efficient look up.
The initial implementation took about 20 minutes to complete (about 0.035 seconds
per
topic). Adding the use of the BaseX attribute index reduced that time to about 20
seconds.
Content Preview: Rendering Formatted DITA With Minimum Effort
Because the repository provides access to all the authored content, it follows that
it
should provide a way to preview topics and maps so users can see the content in a
useful
way.
DITA markup is complex and transforming it into non-trivial HTML is a tall order—DITA
Open Toolkit has thousands of lines of XSLT code to do just this. But we didn't have
time
for that.
Fortunately, in the context of my earlier DITA for Small Teams project, also based
on
BaseX and XQuery, I had stumbled on a remarkably simple technique for generating a
useful
preview of DITA content: just generate <div> and <span>
where the HTML @class attribute values reflect the tag name and DITA @class attribute
values
and use CSS to do all rendering based on @class values. Attributes are captured as
spans
that capture the attribute name and value. This transform can then be implemented
easily
enough using XQuery type-switch expressions:
declare function preview:nodeToHTML($node as node()) as node()* {
typeswitch ($node)
case text() return $node
case processing-instruction() return preview:htmlFromPI($node)
case comment() return ()
case element() return preview:htmlFromElement($node)
default return() (: Ignore other node types :)
};
declare function preview:htmlFromElement($elem as element()) as node()* {
let $effectiveElem as element() := lmutil:resolveContentReference($elem)
return
typeswitch ($effectiveElem)
case element(image) return
let $worktree as xs:string := db:name($elem)
let $imagePath as xs:string := relpath:resolve-uri(string($elem/@href), "/" || db:path($elem)) ! substring(.,2)
let $imageFilesystemPath as xs:string? := git:getFileSystemPath($worktree, $imagePath)
let $imageUrl as xs:string := string-join(('/now/content', $worktree, 'image', $imagePath), '/')
return
<a
class="image-link"
href="{$imageUrl}"
title="{$imagePath}"
target="_imageFullSize"><img src="{$imageUrl}" alt="{$imagePath}"
/></a>
default return
<div class="{(name($elem), dutils:getClassTokens($elem), string($elem/@outputclass)) => string-join(' ')}">
{for $att in $effectiveElem/@* except ($effectiveElem/@class)
return preview:attributeToHTML($att)
}
{for $node in $effectiveElem/node()
return preview:nodeToHTML($node)
}
</div>
};
declare function preview:attributeToHTML($att as attribute()) as node()* {
let $attDisplay as node()* :=
if (name($att) = ('href'))
then
let $worktree as xs:string := db:name($att)
let $targetPath as xs:string := relpath:resolve-uri(string($att), "/" || db:path($att)) ! substring(.,1)
let $targetUrl as xs:string := string-join(('/now/content', $worktree, $targetPath), '/')
return <a class="href-link" href="{$targetUrl}">{string($att)}</a>
else text{string($att)}
return
<span class="attribute" data-attname="{name($att)}"
><span class="attvalue">{$attDisplay}</span></span>
};
declare function preview:htmlFromPI($pi as processing-instruction()) as node()* {
let $name as xs:string := name($pi)
return
if ($name eq 'oxy_comment_start')
then preview:oxyCommentToHTML($pi)
else if ($name eq 'oxy_comment_end')
then <span class="comment-end">{string($pi)}</span>
else () (: Ignore :)
};
declare function preview:oxyCommentToHTML($pi as processing-instruction()) as node()* {
let $data as xs:string := $pi ! string(.)
let $elem as element() := parse-xml("<comment " || $data || "/>")/*
let $result as node()* :=
<span class="oxy_comment">{
if (exists($elem/@flag))
then attribute {'data-flag'} {string($elem/@flag)}
else ()
}{
for $att in $elem/@*
return <span class="comment-{name($att)}">{string($att)}</span>
}</span>
return $result
};
This process only needs to special case elements that need to be navigable links and
image references. Otherwise everything can be done in CSS.
Note that images are served directly from the system system—they are not stored in
the
BaseX content database, although they could be. As BaseX always has the git clones
available
the images can be served directly from the file system, avoiding the expense of copying
the
images.
In addition to the formatted view of each document we also provide a view of the raw
XML. For that we just serialize the XML into an HTML <pre> element:
declare function preview:elementToHTML($element as element()) as node()* {
if (df:isTopic($element)) then preview:topicToHTML($element)
else if (df:isMap($element)) then preview:mapToHTML($element)
else preview:serializeXmlToHTML($element)
};
declare function preview:serializeXmlToHTML($element as element()) as node()* {
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>{document-uri(root($element))}</title>
</head>
<body>
<pre>
{serialize($element)}
</pre>
</body>
</html>
};
(The BaseX adminstration web app includes a better serialized XML renderer but we
didn't
have time to integrate that into Mirabel so we used this very quick solution.)
Conclusions and Future Work
The use of XQuery, RESTXQ, and BaseX enable remarkably quick and easy implementation
of a
remarkably functional system for holding and reporting on the large volume of content
that is
ServiceNow's Platform documentation.
The ease of deployment and the features of the XQuery language itself coupled with
BaseX's
lightweight implementation, clear documentation, and ready community support made
it about as
easy as it could be to apply iterative development approaches. The ability to quickly
acquire
and integrate open-source components for things like data visualization and table
sorting made
things even easier. It's not news that the Internet provides but for someone who can
still
remember when you had to go to a library or book store to learn about a new technology
or buy
a product and wait weeks to receive it in order to do something new, it's still a
marvel of
quickness and ease.
The need to implement concurrency and job control added effort relative to using a
product
like MarkLogic but the implementation effort was not that great and, once implemented,
is
available for future projects, and is more than offset by the other advantages of
BaseX.
Future work includes:
Support for authenticated users in order to provide personalized results, options,
and
role-based features.
Support for DITA's indirect addressing feature (keys). Work has started on this in
the
context of an independent open-source project but still needs to be integrated with
the
Mirabel system.
A general "content explorer" that allows authors to explore and navigate the DITA
content with full access to all where-used information. The content explorer will
include
features for doing canned and ad-hoc XML queries over the content (for example, "how
many
steps within the ITSM bundle include a reference to the menu item 'new'?").
Finish out the git explorer to provide complete access the git history at any desired
level of granularity (individual document, DITA map, bundle, user or set of users,
etc.).
One challenge here is not allowing Mirabel to become too Orwelian in its use of git
data.
Through git's "blame" feature Mirabel can know which author is responsible for any
line in
any version of any file. This knowledge could be misused or at least be presented
in a way
that seems creepy and unwanted.
Finish out the Schematron explorer and testing facility.
Implement complex validation processes currently performed by ad-hoc scripts against
files on the file system, such as link validation processes. Mirabel should be able
to
perform such processing with greater speed and completeness than the current Python-based
implementations.
Consume and present analytics and metrics from other information systems, such as
web
site analytics for docs.servicenow.com and Jenkins build history metrics. In general,
if
the data can be captured, capture it in Mirabel with time stamps to enable time series
reporting.
Integrate with other ServiceNow information systems, especially the ServiceNow
instances ServiceNow uses to manage the work of Product Content.
Parse-time adding of @class attributes to the DITA source or implement use of Xerces'
grammar cache for DTD-aware parsing in order to fully optimize DITA-aware element
access
based on @class values.
Integrate more tightly with authoring tools as appropriate. For example, provide
Oxygen add-ons that can submit queries to Mirabel and show the results in the
editor.
Package Mirabel for use on individual user's machines so that they have immediate
access to uncommitted and unpushed changes in their working environment. This would
require implementing incremental update of the content database and supporting indexes.
Because BaseX is a simple-to-install package it would be easy enough to create a
standalone Mirabel package that could be installed and run on user's personal machines.
This could even be done through an OxygenXML add-on (Product Content Engineering already
maintains a custom Oxygen add-on that could be extended to manage a local Mirabel
instance).
While the stated goal of Project Mirabel is to be a read-only source of information
about
the DITA source and related materials Product Content creates and works with, it's
not hard to
see how Mirabel starts to look a lot like a DITA-aware component content management
system.
For example, it would not be difficult to integrate Mirabel with Oxygen Web Author
to provide
a complete authoring environment on top of the existing git-based management
infrastructure.