How to cite this paper

Kimber, Eliot. “Project Mirabel: XQuery-Based Documentation Reporting and Exploration System.” Presented at Balisage: The Markup Conference 2022, Washington, DC, August 1 - 5, 2022. In Proceedings of Balisage: The Markup Conference 2022. Balisage Series on Markup Technologies, vol. 27 (2022). https://doi.org/10.4242/BalisageVol27.Kimber01.

Balisage: The Markup Conference 2022
August 1 - 5, 2022

Balisage Paper: Project Mirabel: XQuery-Based Documentation Reporting and Exploration System

Eliot Kimber

Senior Product Content Engineer

ServiceNow

Eliot Kimber is a founding member of the W3C XML Working Group, the OASIS Open DITA Technical Committee, a co-author of ISO/IEC 10744:1996, HyTime, and a long-time SGML and XML practitioner in a variety of domains. When not wrangling angle brackets Eliot likes to bake. Eliot holds a black belt in the Japanese martial art Aikido.

Abstract

Describes the organic development of a multi-function XQuery-based system, Project Mirabel, for capturing and reporting on the results of applying Schematron validation to large sets of DITA documents, growing from a simple reporting utility to a multi-function platform for general reporting and exploration over large and complex sets of DITA content, representing non-trivial hyperdocuments, all within the span three weeks.

How We Built Project Mirabel

The Starting Point: Git and OxygenXML
From Ad-Hoc To Service, Phase 1
Validation Dashboard Implementation, Phase 2: Dark Night of the Implementor
Validation Dashboard Implementation, Phase 3: Concurrency and Job Control
Additional Features: Fun With XQuery
New Requirements: Analytics and Metrics Reporting

Project Mirabel Components and Architecture

Managing Concurrency
Managing Updating Operations
Automated Testing
Deployment
Web Application
Processing DITA Without DTDs
Capturing and Reporting Link-Related Information: Where Used Indexes
Doc-to-Bundle Map Construction
Content Preview: Rendering Formatted DITA With Minimum Effort

Conclusions and Future Work

Each version of ServiceNow's primary product, ServiceNow Platform, is documented by a collection of about 40,000 DITA topics, organized into about 40 different "bundles", where a bundle is a top-level unit of publishing, roughly corresponding to a major component of the Platform.

Bundles are DITA [DITA] maps, which use hyperlinks to organize topics into sequences and hierarchies for publication. These topics are authored by about 200 writers in the Product Content organization, spread around the globe. ServiceNow maintains four active product versions at any given time, where each version has its own set of topics and maps. The content is managed in git repositories.

This content has been under active development in its current DITA form for about eight years, during which time ServiceNow has experienced explosive growth. During this time the content accumulated hundreds of thousands of instances of various errors, as identified by Schematron rules developed over the those eight years, reflecting editorial and terminology concerns, markup usage rules, and content details, such as leading and trailing spaces in paragraphs.

In February of 2022 Product Content decided to hold a "fixit week" where all authoring activity would be paused and writers would instead spend their time fixing errors in the content in an attempt to pay down this accumulated technical debt.

To support this task the Product Content Engineering team, which develops and maintains the tools and infrastructure for Product Content, needed to supply managers with data about the validation errors so that they could plan and direct the fix up work most effectively. In addition, we needed to capture validation results over time to then document and report our fixup achievements over the course of fixit week and beyond. This validation data capture and reporting ability did not exist before the start of fixit week.

This paper is a description of the technical details of the resulting Mirabel system and the story of how we developed the validation data capture and reporting tools we needed in a very short time and, by taking advantage of the general facilities of XQuery as a language for working with XML data and the BaseX XQuery engine [BASEX] as a powerful but easy-to-implement and easy-to-deploy infrastructure component, established the basis for a much more expansive set of features for exploring and reporting on ServiceNow's large and dynamic corpus of product and internal documentation.

We named this system "Project Mirabel", reflecting a practice of naming internal projects after characters from animation (we have Baymax and Voltron as notable examples).

Mirabel is the character Mirabel Madrigal from the 2021 Disney movie Encanto. Mirabel's power is to see what is there, and that is exactly the purpose of Project Mirabel: to give visibility and insight to all aspects of the XML and supporting data that documents ServiceNow's products, including validation reporting, content searching and analysis, version management and history, linking information, etc.

The work described in this paper was performed by the author, Eliot Kimber, with contributions from Scott Hudson and Abdul Chaudry, all of ServiceNow.

How We Built Project Mirabel

The Mirabel system grew organically from some simple requirements. The initial implementation occurred very rapidly, over the course of three weeks of intense coding. It was completely unplanned and unauthorized except to the degree that we had a "provide a validation dashboard" mandate. The use of XQuery and BaseX allowed us to develop this system quickly using an iterative approach combined with good engineering practices: modularity, test-driven development (as much as we could), separation of concerns, and so on.

At the start we knew what we wanted to do but we didn't really know how to do it at the detail level. It had been nearly ten years since we last worked heavily with BaseX and XQuery, XQuery 3 was new to us, and we had never tried to implement a robust, multi-user web application using BaseX. We had experience with MarkLogic, which is a very different environment for implementing large-scale applications, so that experience did not translate that well to BaseX.

This is a description of our experience implementing an entirely new system while learning a new XQuery engine and application implementation paradigm, making up the details as we went along. It is a testament to the utility of XQuery as a language, RESTXQ as a way to produce web applications and web pages, and BaseX as a well-thought-out and well-supported XQuery engine that we were able to do this at all let alone in the short time that we had.

During this time we were also responsible for supporting a small React-based documentation delivery system for ServiceNow's Lightstep Incident Response product. Thus we had a Gatsby- and React-based web system to compare our XQuery- and RESTXQ-based implementation to. We had also recently investigated Hugo as a potential alternative to Gatsby and React for web page templating.

The Starting Point: Git and OxygenXML

At the time we implemented Project Mirabel there was no content or link management system for Product Content beyond the features provided by git for version management and OxygenXML Author [OXY] for authoring and validation. We did have the ability to do batch validation of the entire content set using Oxygen's DITA map validation tool. Excel spreadsheets that summarize and organize the validation reports were then manually created by Product Content Engineering and supplied to Product Content managers more or less daily.

ServiceNow maintains three versions ("families") of the Platform in service at any given time, with new families released every six months. Thus there are always four versions of the documentation under active development: the version to be released next plus the three released versions. Platform versions are named for cities or regions, i.e., "Quebec", "Rome", "San Diego", and "Tokyo", with Tokyo being the family being prepared for release at the time of writing.

The validation reports posed several practical challenges:

The validation process itself takes approximately 30 minutes to perform for each platform version, so a total of two hours to validate all four versions.
Validation issues are reported against individual XML documents but managers need to have issues organized by bundle. However, there is only an indirect relationship between documents and bundles through the hyperlinks from top-level bundle maps to subordinate maps and then to individual DITA topics. There is no existing information system that manages knowledge of the relationship of topics and maps to the bundles they are part of. That is, given a topic or non-bundle DITA map, there is no quick way to know what bundles it is a part of, if any.

Issues also need to be organized by error type or error code in order to provide counts of each error type, both across the entire content set and on a per-bundle basis.
The validation reports themselves are quite large, typical 50MB or more, making them a challenge to work with and store.

The requirements for these reports appeared suddenly and our response was entirely ad-hoc. The development process was not planned in any significant way.

Initially, the manager report spreadsheets were created by saving the validation reports as HTML from Oxygen and then importing the resulting HTML tables into Excel. This produced a usable but sub-optimal result and was an entirely manual process that required doing some data cleanup in Excel. These spreadsheets used a pivot table to report on issues by issue code.

Our initial attempt to automate this process used an XSLT transform applied to the XML validation reports to group the issues by issue code and then generate CSV files with the relevant issue data to then serve as the basis for Excel spreadsheets that use pivot tables to summarize the issues by issue code and count. This improved the ease of generating the spreadsheet and provided cleaner data (for example, the transform could collapse issue codes that represented the same base issue type into a single group).

However, this did not satisfy the "issues by bundle" requirement as we did not have the topic-to-bundle mapping needed to group issues by bundle.

For that we turned to XQuery and BaseX.

From Ad-Hoc To Service, Phase 1

In the run up to fixit week we were given the requirement to provide some sort of "validation dashboard" that would provide up-to-date information about the current validation status of a given family's documentation with a historical trend report, if possible.

Our current "dashboard" was simply the spreadsheets we were preparing via the partially-automated process combining an XSLT and an ad-hoc XQuery process. It became clear that this ad-hoc process was not a good use of our time and not sustainable. It was also not an ideal way to communicate the information (the spreadsheet was being emailed to the managers and stored in a SharePoint folder).

I knew from prior experience with BaseX's support for RESTXQ that it would be simple to set up a web application to provide a dashboard view of the validation report we were already generating.

We already had the core of the data processing needed for the dashboard in the form of the XSLT transform and the ad-hoc XQuery script.

All it would require would be to reimplement the XSLT in XQuery and work out the HTML details for the dashboard itself as well as a server to serve it from. While BaseX (and most other XQuery engines) can apply XSLT transforms from XQuery, in this case the XSLT transform was simple enough that it made more sense to convert it to XQuery for consistency and ease of integration as all it was really doing was grouping and sorting.

I also realized that if we could set up this simple validation report web application that it would create the foundation of a general DITA-and-link-aware server that could be adapted to a wide variety of requirements while satisfying the immediate validation reporting requirements. A big and evil plan started to take shape in my brain.

As the first step we reimplemented the issue CSV data set generation in XQuery and added it to the ad-hoc script. This moved all of the data preparation needed by the existing Excel spreadsheets into XQuery and reduced its generation to a single script that loaded the data for a family, loaded the separately-generated Oxygen validation report for that family, and then generated the CSV files for the spreadsheet. Preparing the spreadsheet was a still a manual process but now simply required reloading two CSV data sets and refreshing the pivot tables.

To fully automate the dashboard we would need the following features:

A way to get the current content from git.
A way to get the validation report for each family.
Integration of the ad-hoc link record keeping functions into a reliable service.
A web application to publish the validation dashboard.

For the git requirement we created a set of bash scripts that set up and pull the appropriate repositories and branches for each product family. As currently organized, each product family's content is managed as a separate branch in one of two git repositories.

In order to correlate validation data to the content it reflects we implemented XQuery functions that use the BaseX proc extension functions to call the git command in order to get git information, such as the branch name and commit hash. These functions also set the foundation for a deeper integration with git.

To get the validation reports we use OxygenXML's scripting feature to run Oxygen's map validation against specific DITA maps. We were already using this on another server to do regular validation report generation so it was a simple matter to move it to the Project Mirabel server. To make capturing a historical record of validations easier we set up a separate git repository that stores the validation reports. This repository uses the git large file support (LFS) facility to make storing the 50MB+ validation reports more efficient. While these are XML documents, which would normally not be a candidate for LFS storage (because it bypasses git's normal line-base differencing features) this repository is being used only for archiving so the lack of differencing is not a concern.

For the link record keeping service we reworked the ad-hoc scripts into XQuery modules for link record keeping and validation reporting. We also adapted existing DITA support XQuery code from the open-source DITA for Small Teams project [DFST], which happened to have quite a bit of useful DITA awareness more or less ready to use.

For the web application we quickly slapped together a small set of RESTXQ modules that provided a simple landing page and then a page for the validation dashboard, initially just providing a pair of tables, one of documents by issue and another of issues by bundle. The actual report generation is just a matter of iterating over the XQuery maps and generating HTML table rows. We integrated the open-source sortable.js package, which makes tables sortable by column simply by adding specific class values to the table markup.

RESTXQ as implemented in BASEX is simple to use and BASEX provides many examples to crib from. But fundamentally it's as simple as implementing an XQuery function that returns the HTML markup for a given page, using RESTXQ-defined annotations that bind the page-generating function to the URLs and URL parameters the page handles. Once you deploy the pages to BaseX's configured webapp directory they just work. It makes creating web services and web applications about as simple as it could be. There is none of the build overhead of typical JavaScript or Java-based web applications. BaseX provides its own built-in HTTP server based on Jetty so there is no separate setup or configuration task needed just to have a running HTTP server.

We were able to implement this initial server, running off a developer's personal development machine, in about two days of effort. It was crude but it worked.

In the meantime we started the internal process of provisioning an internal server on which to run our new validation dashboard service.

At this point we had spent maybe four or five person days to go from "let me hack an XQuery to make that data for you" to "Look at the dashboard we made".

Validation Dashboard Implementation, Phase 2: Dark Night of the Implementor

We had the ability to produce useful reports for a single validation report but now we needed to provide a time series report showing the change over time in the number of issues of different types as the content is updated over the course of fix it week.

This required a couple of new features:

Capturing validation reports with appropriate time stamps and connection to the git details of the content that was validated.
Generating a graph or other visualization of the time data itself.
Ensuring that the web page would remain responsive when multiple users were accessing it.

For the validation reports we implemented a bash process that could be run via cron jobs to pull from the source repositories, run the validation process against each product family, and then load the resulting validation reports into the appropriate validation report databases.

To support the git association and time data the XQuery that loads the validation reports adds the git commit hash for the commit the validation was done against, the time stamp of the commit (when it was committed) and the time stamp of when the validation was performed. It took us a few iterations to realize that we also needed structured filenames for the validation reports themselves so that the filenames would provide the data needed for the reports as loaded (validation time, commit hash, commit time, and product version) so that reports could be loaded after the fact rather than requiring that the process that triggered the validation also do the loading into BaseX. Because the validation reports take a long time to produce and because they might be regenerated in the future against past content, we needed to be able to load reports given only the report file itself.

The report filename convention we resolved on is:

now-maintenance_errors_tokyo_2022-06-04T14~35~22Z_c136e447_1656513278.xml

Which reflects the document validated ("now-maintenance-errors.ditamap"), the product version ("tokyo"), the time stamp of the validated commit, the commit hash of the validated commit ("c136e447") and the time when the validation was performed ("1656513278", a seconds since epoc time stamp).

With each report having a time stamp, we implemented a general "get documents by time stamp" facility that then enables queries like "get the most recent validation report", "get the most recent n reports", etc., and order them by time stamp. This facility then enables constructing time series reports for any data that includes a @timestamp attribute.

As part of this implementation activity we also worked out a general approach for creating an HTML dashboard that reflects the current state of the validation and indicates trend direction for each class of issues (info, warning, and error). This required us to come up to speed on HTML and CSS techniques that were new to us. Fortunately the how-to information is readily accessible on the web. We found the Mozilla Developers Network (MDN) information the most useful and reliable when it came to learning the details of things like using the CSS flex facility.

The ease of generation of web pages using RESTXQ make it easy to experiment and iterate quickly as we developed the web page details.

Within a day or so we were able to get a credible dashboard with tabs for different reports and visualizations working despite our lack of web UX implementation or design skills. The resulting dashboard is shown below.

The visualization was produced using the open-source chart.js package, which was easy to integrate and easy to use from XQuery. It simply requires generating a JSON structure with the data for each series and the chart configuration details. The biggest challenge was the syntactic tangle that is using XQuery to generate HTML that includes inline Javascript:

    <div class="chart-container">
      <canvas id="trendChart" width="800" height="400" aria-label="Validation trend chart">
      </canvas>
      <script>
      const labels = [{$labels}];
    
      const data = {{
        labels: labels,
        datasets: [
        {{
          label: 'Total Errors',
          backgroundColor: 'rgb(55, 96, 230)',
          borderColor: 'rgb(55, 96, 230)',
          data: [{string-join($totalSeries, ', ')}],
        }},
        {{
          label: 'Fatal',
          backgroundColor: 'rgb(240, 65, 38)',
          borderColor: 'rgb(240, 65, 38)',
          data: [{string-join($fatalSeries, ', ')}],
        }},
          {{
          label: 'Error',
          backgroundColor: 'rgb(255, 159, 56)',
          borderColor: 'rgb(255, 159, 56)',
          data: [{string-join($errorSeries, ', ')}],
        }},
          {{
          label: 'Warning',
          backgroundColor: 'rgb(247, 241, 62)',
          borderColor: 'rgb(247, 241, 62)',
          data: [{string-join($warningSeries, ', ')}],
        }}
      ]
      }};
    
      const config = {{
        type: 'line',
        data: data,
        options: {{
          plugins: {{
            title: {{
                display: true,
                text: '{$chartTitle}'
            }}
          }}
        }}
      }};
      </script>
      <script>
      const myChart = new Chart(
        document.getElementById('trendChart'),
        config
      );
      </script>
    </div>

There is probably a cleaner way to manage the train wreck of syntaxes but this was sufficient for the moment.

Because of the time it took to get everything else in place, the implementation of the chart generation, which was ultimately the one thing we had to provide to our new Vice President of Product Content, occurred in the final hours of the weekend before the Monday when the dashboard had to be available. It was, to say the least, a frantic bit of coding. But we did make it work.

However, there were still a number of practical and performance issues with the dashboard as implemented: it took many seconds to actually construct the report, which meant that if more than one or two people made a request, the system would be unresponsive while the main BaseX server performed the data processing needed to render the dashboard. In particular, the issues-by-bundle report was being constructed dynamically for each request.

Fortunately, we really only had one user, our Vice President, for this initial rollout.

Validation Dashboard Implementation, Phase 3: Concurrency and Job Control

While we had the system working we were making a number of basic mistakes in how we managed keeping the databases up to date.

Because the content is being constantly updated and we wanted to add new validation reports every eight hours, we had to solve the problems of concurrency and background update so that the web site itself remained responsive.

For this we developed the job orchestration facility described in detail below and a general system of bash scripts that then invoke XQuery scripts via the BaseX command-line API to run jobs in the background using cron jobs or manual script execution. This worked well enough to make the site reasonably stable and reliable. Our main limitation was lack of project scope to test it thoroughly.

Once fixit week was over we were directed to stop work because the activity had never actually been approved or prioritized and was not the most important thing for us to be working on. While the organization recognized the value of Mirabel and fact that we had met our goal of providing a useful validation dashboard we had to accept the reality that it was no longer the most important thing.

Of course, we couldn't leave it entirely alone.

Additional Features: Fun With XQuery

While we didn't have approval to work on Mirabel in a fully-planned way, we were still able to spend a little time on it.

One challenge I faced as the primary implementor was simply keeping up with all the code I was producing. While the BaseX GUI is an adequate XQuery editor it is not by any stretch a full-featured XQuery IDE. It does not provide any features for navigating the code or otherwise exploring it.

However, BaseX does provide an XQuery introspection module that provides XQuery access to the structure and comments of XQuery modules. I realized that with this I could quickly implement RESTXQ pages that provide details of the XQuery code itself. The initial implementation of the XQuery explorer is shown below.

Because this view is using the code as deployed it is always up to date, so it reflects the code as it is being developed. The clipboard icon puts a call to the function on the clipboard ready to paste into an XQuery editor.

Another challenge was developing and testing our Schematron rules. The bulk of the validation is Schematron rules that check a large number of editorial and markup usage rules. These Schematron rules are complex and difficult to test across the full scope of the Platform content. The fixit week validation activity highlighted the need to optimize our Schematron rules to ensure that they were accurate and useful. To assist with that we started implementing a Schematron Explorer that enabled interactive development and testing of Schematron rules against the entire database. Unfortunately we ran out of scope to finish the Explorer, having run into some practical challenges with the HTML for the test results. We will return to this at some point.

Another area of exploration was a more general Git Explorer.

With Mirabel's git integration it is easy to get the git log and report on it. As a way to simply test and validate Mirabel's ability to access and use information from the git log we implemented a Git Explorer that provides a report of the git log and demonstrates access to all the git log information. We did not have scope to do more with the git information but it would be straightforward to get the per-commit and per-file history information, allowing a direct connection between files in the content repository and their git history and status. There is also a general requirement to provide time sequence analysis and reporting on the git data, such as commits per unit time, commits per bundle, etc. Mentions of git commit hashes are made into links to those commits on our internal GitHub Enterprise site.

New Requirements: Analytics and Metrics Reporting

After sitting idle for several months while we focused on higher priorities, Mirabel found new life as the platform for delivering additional Product Content metrics and analytics.

In the late spring of 2022 Product Content put a focus on analytics and metrics reporting and created a new job role responsible for gathering and reporting analytics and metrics of all kinds with the general goal of supporting data-driven decision making for Product Content executives and managers, including metrics for the content itself as well as from other sources, such as the ServiceNow documentation server web analytics, customer survey results, and so on.

The Analytics person immediately saw the potential in Mirabel as a central gathering point and access service for these analytics and established an ambitious set of requirements for Mirabel to eventually address. However, we were still facing the reality of limited resources and scope to pursue these new Mirabel features.

Fortunately, we hired an intern, Abdul, who was already familiar with Product Content from his previous internship with us and who had a data science background. Abdul's internship project was to implement a "key metrics" dashboard, which he did. This required doing additional work on Mirabel's core features to improve performance and reliability, which we were able to do.

The resulting key metrics dashboard is shown below.

In the context of the key metrics implementation we were able to add a number of important performance enhancements, including using the BaseX attribute index to dramatically speed up queries, especially construction of the where-used and doc-to-bundle indexes. We also put some effort into improving the look and feel of the web site itself, as best we could given that we are not UX designers.

While ServiceNow has many talented UX designers on staff, they are all fully occupied on ServiceNow products so we have not yet been able to get their assistance in improving Mirabel's site design.

Project Mirabel Components and Architecture

The Mirabel system is delivered to users as a web application served using the BaseX XQuery database. The web server uses RESTXQ as implemented by BaseX to serve the pages. The data processing for the services provided is implemented primarily as XQuery modules run by BaseX servers with the persistent results stored in BaseX databases. The system is deployed to a standalone linux server dedicated to the Mirabel system and available to all users within the ServiceNow internal network.

The source XML content is accessed from git repositories cloned to the Mirabel server machine. ServiceNow's internal git hub system does not allow API-based access to the repositories so streaming content directly from the GitHub Enterprise server was not an option. Bash scripts are used to manage cloning and pulling the repositories as required, either on a regular schedule or on demand.

Validation of the DITA content is performed by OxygenXML using Oxygen's batch processing features. The input is DITA maps as managed in the git repositories. The output is XML validation reports, stored in a separate git repository on the server file system. Bash scripts are used to manage running the validation processes, either on a regular schedule or on demand. In addition to simply validating the latest version of the content in a given repository, bash scripts also allow validating older versions, for example, to recreate the validation reports authors were actually seeing at the time by accessing the corresponding versions-in-time of the Schematron rule sets used to do the validation. This allows Mirabel to change the details of how validation reports are stored and managed, as well as enabling the retro-active capturing of historical data in order to deliver trends and analysis information about how errors changed over time. It also allows for validation of older versions of the content with specific versions of the validation rules, for example, to eliminate rules that, in retrospect, were not useful or were largely ignored by authors.

From this source DITA content and corresponding validation reports maintained in git repositories on the file system, the main XQuery-based Mirabel server constructs a set of databases containing the source XML data, the corresponding validation reports, and constructed indexes that optimize the correlation of validation issues to the documents they apply to as well as the DITA maps that directly or indirectly refer to those documents.

In BaseX, the only core indexing features are for indexes over the XML markup and the text content (full-text indexes). There is no more-general feature for creating indexes as distinct from whatever one might choose to store in a database. Instead, the BaseX technique is to simply create separate databases that contain XML representations of an "index". These indexes typically use BaseX-assigned persistent node IDs to relate index entries to nodes held in other databases.

In BaseX, databases are light weight, meaning that they are quick to create or remove. The BaseX XQuery extensions make it possible to access content from different databases in the same query. Because databases are light weight, it tends to be easier and more effective to use separate databases for specific purposes rather than putting all the data into a single database and using collections or object URIs to organize different kinds of content.

A single database may be read by any number of BaseX servers and may be written to by different servers as long as attempts to do concurrent writing are prevented, either by using write locks or by ensuring that concurrent writing is not attempted by the code as designed.

To enable link resolution in order to then correlate individual documents to their containing DITA maps, Mirabel maintains two linking-related indexes:

Document where used: For each document, records the direct references to it (cross references, content references, and references from DITA maps).

Document-to-bundle-map: For each document, the bundle DITA maps that directly or indirectly refer to the document. "Bundle" maps are DITA maps that are used as the unit of publication to ServiceNow's public documentation HTML server. Thus these DITA maps play a unique and important role in the ServiceNow documentation work flow.

The where-used index is used by the document-to-bundle-map constructor. The document-to-bundle-map index then enables quick lookup of the bundles a given document participates in. This then enables the organizing of validation issues by bundle, a key requirement for validation reporting. The where-used index also enables organizing validation issues by individual DITA map. DITA maps below the bundle level usually organize the work of smaller coordinated teams and thus represent another important organizing grouping for reporting validation issues. More generally, the doc-to-bundle map allows quick filtering of data about individual documents by bundle. For example, the Mirabel key metrics report presents counts of many different things (topics, maps, images, links, tables, etc.) across the content repository. These counts are captured on a per-bundle basis as well as for the entire content set. With the doc-to-bundle index it is quick to get the set of documents in a given bundle in order to then count things in that set.

Mirabel uses database naming conventions to enable associating the database for a set of DITA content to its supporting databases. The databases for the DITA content are named for the product release name the content is for (i.e., "rome", "sandiego", "tokyo"). Corresponding supporting databases are named "_dbname_link_records", "_dbname_validation_reports", etc. The leading "_" convention indicates databases that are generated and that therefore can be deleted and recreated as needed. Databases that serve as backup copies of other databases are named "_backup_databaseName". Databases that are temporary copies of other databases are named "_temp_databaseName".

For a given product release it takes about 30 seconds to construct the where-used and doc-to-bundle indexes. By contrast it takes about two minutes to simply load the DITA XML content into a database, bringing the total time needed to load a given release to about three minutes. This is fast enough to remove the need to implement some sort of incremental update of the content database and supporting indexes and still keep the Mirabel server reasonably up to date. At the time of writing the production server pulls from the content repositories every 15 minutes and reloads the content databases every two hours. The more-frequent git pulls are required simply because the volume of updates from writers is such that pulling frequently avoids having to do massive pulls of hundreds of commits.

For other reports, Mirabel uses a similar pre-generation and caching strategy, where the raw XML for a given report is generated at content load and then used at display time to generate HTML tables, CSV for download, etc. For example, Mirabel produces a report that lists the images that are in one version but not in the previous version, representing the "images new since X" report. This is an expensive report to generate so it is pre-generated and cached.

Mirabel also relies heavily on the BaseX attribute index, which optimizes attribute-based lookup. For example, using normal XPath lookups to construct the doc-to-bundle index takes about 30 minutes but using the attribute index takes less than 30 seconds.

Managing Concurrency

One significant practical challenge in implementing Mirabel was managing the parallel processing required to serve web pages and content while also constructing the supporting indexes.

The BaseX server implementation is relatively simple, which helps make it small and fast, but means that it lacks features found in other XQuery servers, in particular, built-in and transparent concurrency.

While evaluating an XQuery, a given BaseX server instance will utilize all resources available to it, leaving no cycles for serving web pages. In an application where queries are quick, which is most BaseX applications, this is not a problem. But in Mirabel, where report construction can take tens of seconds or more and the content is being constantly updated, it is a serious problem. As a more general requirement, any BaseX application that needs to constantly ingest new content and construct indexes over it must address this concurrency challenge.

Fortunately, there is a simple solution: run multiple BaseX servers.

Because BaseX servers are relatively small and light weight, it is practical to run multiple BaseX instances, bound to different ports, and allocate different tasks to them. Because multiple servers can read from a single database, there is no need to worry about copying databases between different servers: all the BaseX instances simply pull from a shared set of databases. The main practical challenge is ensuring that only one server is writing to a given database at a time.

Each BaseX server is a separate Java process and thus can fully utilize one core of a multi-core server. Thus, on a four-core server such as is used for the Mirabel production server at the time of writing, you can have one BaseX instance that serves web pages and two that do background processing, leaving one core for other tasks, such as OxygenXML validation processing, which is also a Java process and thus also requires a dedicated server core.

For Mirabel, the base configuration is three BaseX servers: a primary server that serves web pages and handles requests, and two secondary servers that manage the long-running data processing needed to create and update the linking indexes and load new validation reports, as well as managing updates to the main content databases as new content is pulled into the content git repositories.

The web-serving server is bound to the default BaseX ports and the worker servers are assigned ports using a simple convention of incrementing the first port number, i.e., 9894 for the second server and 10894 for the second.

Another scaling option would be to use containers to run separate BaseX servers, where the containers share one or more file systems with shared databases or use remote APIs to copy the results of long-running operations to a production server. Unfortunately, at ServiceNow we do not currently have the option of using containers, at least not in a way that is supported by our IT organization.

Managing Updating Operations

Another practical challenge is orchestrating update operations.

This presents an orchestration challenge when the data processing requirement is to create a new database, add content to it, and then use that database as the source for another query, as BaseX does not allow a newly-created database to be written to and read from in the same query.

BaseX implements the XQuery update recommendation [XQUPDATE] and imposes the recommendation's restriction that updating expressions cannot return values. BaseX treats each updating XQuery as a separate transaction. BaseX updates are handled in an internal update queue that is satisfied after any non-updating queries in the queue are handled. In addition, updates are handled in parallel and order of processing (and order of completion) is not deterministic, so you cannot assume that updates will be handled in the order they are submitted or that one operation will have completed before another starts.

This means, for example, that you cannot have a single updating XQuery expression that does a BaseX db:create() followed by db:load() to add documents and then tries to act on that content. The database creation and loading must be in one expression and the access or subsequent updating in another.

In addition, indexes on databases, such the attribute index, are not available for use until the database has been optimized after having been updated. This optimization process cannot be performed in the same transaction that creates the database.

The BaseX solution for general orchestration is its "jobs" feature, which allows for the creation of jobs, where a job is a single XQuery. Jobs are queued and run as resources become available. A query that creates a job is not, by default, blocked by the job, but it can choose to block until the job completes. Jobs are tracked and managed by the BaseX job manager. Jobs may be run immediately or scheduled for future execution.

Jobs are not necessarily run in the order queued, so performing a complex tasks is not as simple as queuing a set of jobs in the intended sequence of execution.

The general solution is to have one job submit the next job in a sequence to the job queue.

The Mirabel system implements this approach through a general-purpose orchestration XQuery module that provides infrastructure for defining jobs and running them, including logging facilities to make it easier to debug orchestrated jobs.

The orchestration module defines a "job definition" map structure that allows callers to define a sequence of jobs and submit them for execution. Individual XQuery modules can define "make job" functions that handle setting the module's namespace and otherwise constructing jobs specific to that module. XQuery modules can also provide functions that construct ready-made job sets to implement specific sequences of actions. This general-purpose orchestration facility is published as a standalone GitHub project: basex-orch.

The job-running function is a recursive function that takes as input a sequence of job definitions, queues the head job, blocks until it finishes, and then calls itself with the remainder of the job queue. This effectively serializes job execution while letting the BaseX server manage the resources for each job in the context of the larger job queue for the server.

Job sequences can be initiated from within BaseX, for example, in the context of a RESTXQ handler function, or from outside BaseX using the BaseX command line to run an XQuery that constructs a new job. Mirabel uses this technique with simple bash scripts that then run XQuery expressions that call job-creating functions. For example, the cleanup-backup-databases.sh script runs a simple XQuery that starts a set of jobs to remove backup databases:

#!/usr/bin/env bash
# ==========================================
# ServicenNow Product Content Dashboard
#
# Cleans up any lingering backup datatabases
# ==========================================

scriptDir=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
xqueryScriptDir="${scriptDir}/xquery"

# Include function libraries:
source "${scriptDir}/bash-functions/validation-functions.sh"

basexuser="${1:-admin}"
basexpw="${2:-admin}"

basexClientPort=${3:-2984}

xqueryToRun="${xqueryScriptDir}/cleanup-backup-databases.xqy"

echo "[INFO] Cleaning up backup databases"

  "$(getBasexBinDir)/basexclient" -U "${basexuser}" -P "${basexpw}" -p${basexClientPort} "$xqueryToRun"

The bash script selects the appropriate secondary server to perform the task, in this case the server on port 2984. The current Mirabel implementation hard codes the ports used for different tasks in the scripts. It would be possible in bash to create a more dynamic server selection mechanism by getting the list of available servers and choosing one that has little or no load. For current Mirabel use cases this level of sophistication is not required.

The cleanup-backup-databases.xqy XQuery run by the bash script is:

(:

  Runs the cleanup backup databases job
  
:)
import module namespace orch="http://servicenow.com/xquery/module/orchestration";
import module namespace dbadmin="http://servicenow.com/xquery/module/database-admin";

let $job as xs:string := dbadmin:makeJob('dbadmin:cleanupBackupDatabases', ())
let $result as item()* := orch:runJobs($job)
 
return (``[Cleanup backup databases job queued. Check BaseX log for details.
]``)

With the orchestration module in conjunction with multiple secondary servers, all processing can be performed reliably while ensuring the responsiveness of the main web server BaseX instance.

Automated Testing

The Mirabel XQuery components are implemented as a set of modules, with each module focused on a specific concern: link record keeping, database management and access, validation reporting, git access and interaction, etc.

Each module has associated unit tests, implemented using BaseX's unit testing extensions. One challenging aspect of the unit tests is testing the results of updating operations, as the unit tests are subject to the constraints on updating functions, which means you cannot have a single test method that both performs an update and evaluates the result of the update.

BaseX's solution is to allow the specification of "before" functions that are run before a corresponding test-evaluation function, where the "before" function is run as a separate transaction. Likewise, an "after" function can be used to clean up the test.

Deployment

The server code is deployed from the Mirabel source code repository to the running BaseX XQuery repository using an Ant script. The script prepares the working XQuery modules and standalone scripts from the source versions (or example, to add version comments to the files) in the directory structure required by BaseX, zips the result, and then calls the BaseX command line to import the Zip file. BaseX provides a module repository mechanism by which it resolves module namespace URIs to module locations, removing the need for separate module paths in module import statements.

The web application modules are deployed to the BaseX webapp directory by the Ant script. For web applications, BaseX automatically loads XQuery modules within the configured webapp directory and uses RESTXQ function annotations to map incoming URLs to the functions that handle them.

This makes deploying new code quick and simple: there is no need restart the BaseX server.

Web Application

Mirabel implements the web application using BaseX's RESTXQ support.

In RESTXQ you bind XQuery functions to URLs using annotations on the functions:

declare
  %rest:GET
  %rest:path('/now/dashboards/{$database}/dbstatus')
  %output:method('html')
function now:databaseStatusReportForFamily(
  $database as xs:string
) as element(html) {
let $haveDatabase as xs:boolean := db:exists($database)
let $databaseLabel as xs:string := util:getDatabaseLabel($database)

return
  <html>
    <head>
      <title>Family {$databaseLabel} Dashboards</title>
      {webutils:makeCommonHeadElements()}
    </head>
    <body>
 
 ...
  </html>
}

The function returns the HTML for the web page.

The HTML is constructed as you would construct any other literal result element.

Parameters can be extracted from the URL itself, as shown in this example, or passed as URL parameters.

BaseX's RESTXQ implementation makes it about as easy as it can be to implement web applications. As a language for constructing XML results, XQuery is ideally suited for the type of templated HTML generation required for dynamic web applications.

For the Mirabel project, this ease of web application implementation made it remarkably quick and easy to get a site up and running and to refine it quickly. Adding a new page or set of pages is as simple as implementing a new XQuery module to serve the pages and deploying the module to the running BaseX server.

This does result in a web application where all the work is done on the server. However, there is nothing in the RESTXQ implementation that prevents using JavaScript in the browser to implement single-page applications or otherwise interact with the BaseX web services from in-browser JavaScript. For example, BaseX could be used to quickly implement microservices in support of a larger web application that depends on in-browser processing.

Processing DITA Without DTDs

DITA processing depends on using the value of the DITA @class attribute, which captures the specialization hierarchy of every DITA element. For example, the specialized topic type "concept" is a specialization of the base element type "topic". Any concept can be processed usefully using the processing associated with the base <topic> element type. The @class value for a concept topic is "- topic/topic concept/concept ", which means that you can apply generic topic processing to concept elements by matching on "topic/topic" in the @class value, ignoring the element type name. For example, the way to find all topics in your DITA concept \with the XQuery would be "//*[contains-token(@class, 'topic/topic')]".

However, this only works if your content has the @class attributes available. This turns out to be a problem with BaseX for the volume of content we have.

In normal DITA practice the @class attributes are defined with default values in the governing grammars, i.e., in DTDs. Thus to do @class-aware processing you normally need to parse the documents with respect to their DTDs or other grammar.

However, the DITA DTDs are large.

We found that using BaseX's out-of-the-box DTD-aware parsing it took nearly two hours to load the content of a single product version, roughly 40,000 documents. This reflects the fact that BaseX as of version 10.0.0 does not implement use of the Xerces' grammar cache feature, so it has to re-parse the DTDs for every document. We investigated adding use of the grammar cache to BaseX but it was beyond what we could do at the time.

Instead we parse the documents without using the DTDs and add the DITA class awareness after the fact.

Parsing without the DTDs takes about two minutes to load the 40,000 documents for a platform version.

To enable @class-aware processing we implemented a module that simply creates a static mapping from element types to @class values. This is possible for a known content set because the @class value for a given element type name in DITA is (or should be) invariant, meaning that the DITA element type "foo" should have the same @class value everywhere, no matter how many different document types that element type is used in. For all elements defined in the DITA standard and DITA Technical Committee-defined vocabularies this is always true. Within the scope of a single set of DITA content managed as a unit it must also be true. While different DITA users could define their own specializations that happen to have the same element type name but different @class values, it is unlikely that those two vocabularies would be used together. Within an enterprise like ServiceNow we have complete control over the markup details of our content and do not have a requirement to integrate content from third parties in a way that would require more flexible solutions. This is true for the vast majority of DITA users.

The solution to finding elements based on DITA @class value is to define a utility function that takes as input an element and a @class value token to match on and returns true or false based on a static mapping:

(:~ 
 : Determine if an element is of the specified DITA class
 : @param context Element to check the class of
 : @param classSpec The class value to check (i.e., 'topic/p')
 : @return True if the element is of the specified class.
 :)
declare function dutils:class($context as element(), $classSpec as xs:string) as xs:boolean {
  let $normalizedClass := normalize-space($classSpec)
  let $classTokens as xs:string* := dutils:getClassTokens($context)
  return $normalizedClass = $classTokens
};

(:~
 : Gets the DITA @class value for the specified element.
 : @param context Element to get the class value for
 : @return The class value as a sequence of tokens, one for each module/element pair 
 :)
declare function dutils:getClassTokens($context as element()) as xs:string* {
  let $classValue := $dutils:elementToClassMap(name($context))
  return tail(tokenize($classValue, '\s+')) (: First token is always the - or + at the start of the @class value :)
};

The $dutils:elementToClassMap XQuery map is generated from the DITA RELAX NG grammars:

(:~ 
 : Use the RELAX NG grammars to generate a mapping from element types to their
 : declared @class values.
 : @param database The database that contains the RNG grammars, i.e., "rng". The RNG database
 :                 must have an attribute index.
 : @return A map of element type names to @class values.
 :)
declare function dutils:generateElementToClassMap($database) as map(*) {
  let $debug := prof:dump('dutils:generateElementToClassMap(): Using database ' || $database)
  return
  map:merge(
    for $element in db:open($database)//rng:element[exists(@name)]
    let $attlistName as xs:string? := $element//rng:ref[contains(@name, '.attlist')]/@name ! string(.)
    let $classAttDecl as element()* := db:attribute('rng', $attlistName, 'name')/../self::rng:define[.//rng:attribute[@name eq 'class']]
    where exists($classAttDecl)
    return map {string($element/@name) : $classAttDecl[1]//rng:attribute[@name eq 'class']/@a:defaultValue ! string(.)}
  )
};

Another option would be to use this mapping to add the @class attributes to the XML, either as part of the initial ingestion process or as an update applied after the content is initially loaded from the file system. Having the @class attributes present on the DITA elements would enable using the attribute index to do class-based lookups, which would be a significant performance improvement.

In the process of preparing this paper we tried this experiment:

import module namespace dutils="http://servicenow.com/xquery/module/now-dita-utils";
import module namespace util="http://servicenow.com/xquery/module/utilities";

let $database := 'tokyo'
let $profile := prof:track(dutils:getAllMapsAndTopics($database))
let $docs := $profile('value')
return
for $e in $docs//*[empty(@class)]
let $classAtt := attribute {'class'} { $dutils:elementToClassMap(name($e)) }
return insert node $classAtt into  $e

It took several hours to complete (exact time not determined because we had to run errands while the process was running but it was no less than two hours for the 7950589 elements in the content database we tested with).

There may be a more efficient way to do this update but a better solution is probably a SAX filter used at initial parse time, which will add essentially no additional overhead to the parsing process. This would be relatively easy to implement and configure but given the existing speed and few active users it is hard to justify at this time.

Capturing and Reporting Link-Related Information: Where Used Indexes

The validation reporting depends on two indexes: document-where-used and document-to-bundle-map.

Constructing the document-where-used index is relatively simple: For each element that is a referenced, create an XQuery map entry where the key is the target document and the value is the referencing element itself, then merge the entries, combining duplicates, to create a map of documents to the references to those documents. Because the referencing elements retain their document contexts, the map provides quick lookup of what documents point to any given document. By organizing the value of each entry by reference type (cross reference, content reference, and map reference) it is easy to evaluate links based on link type, for example, to determine what DITA maps refer to a given document as distinct from topics that refer to the same document via cross reference.

The current ServiceNow Platform documentation does not use DITA's indirect addressing feature (keys and key references), which simplifies the process of constructing a where used index to the simple algorithm:

let $whereUsedMap as map(*) := map:merge(
for $ref in collection()//*[@href|@conref]
  let $target as element()? := local:resolveRefToDoc($ref)
  return
  if (exists($target))
  then map{ base-uri($target) : root($ref) }
  else ()
  ,
  map {'duplicates' : 'combine' }
)

Where the local:resolveRefToDoc() function attempts to resolve a DITA reference to the DITA map or topic element that is or contains the reference (references may be to elements within maps or topics).

This algorithm results in an XQuery map where the keys are the base URI of a map or topic document and the values are the references to that topic. In practice, the value of each entry is actually a map of reference types to referencing elements as DITA defines distinct types of reference:

Topic references (links from DITA maps to maps, topics, or non-DITA resources)
Cross references (<xref> and specializations of <xref> and <link> and specializations of <link>)
Content references (transclusions specified by the @conref attribute on any element type).

This classification of references then allows for easy lookup of references by type, i.e., "find all maps that use this topic" is simply the value of the "topicrefs" entry in the map that is the value of the where-used map entry.

The working code constructs the final map as a two-phase process:

Construct a map where each entry has the target document as the key and the value is a sequence of maps, one from each reference to the target document.
Construct a new map that combines the individual value maps into single values for each link type:
```
let $mergedMap as map(*) := map:merge($entries, map{'duplicates' : 'combine'})
let $whererUsedMap := map:merge(
    for $key in map:keys($mergedMap)
    let $entry := $mergedMap($key)
    let $newEntry := 
      map{
          'doc' : dutils:distinct-nodes($entry?doc),
          'topicrefs' : $entry?topicrefs, 
          'xrefs' : $entry?xrefs,
          'conrefs' : $entry?conrefs,
          'referencing-maps' : $entry?referencing-maps
      }
  return map{$key : $newEntry}
)
```
Note that this takes advantage of the "?" map lookup syntax, which has the effect of getting all the values for the specified key from all maps in the left-hand map set, resulting in a single sequence for the new map entry value.

This where-used table serves as the underpinning for any link management features that might be needed.

This processing results in a single XQuery map with one entry for each document in the content set. It currently takes about a minute to construct the where-used index for a single product version. This time largely reflects the lack of @class attributes in the content, which makes use of the BaseX attribute index impossible. This is still an acceptable level of performance for the load of a product version.

To persist the index, the XQuery map is converted to XML and stored in the link record keeping database for the corresponding content database (i.e., "_lrk_tokyo_link_records"). By denormalizing the data stored in the XML, lookups can be optimized. Nodes in the XQuery map are represented by their BaseX node IDs. The node IDs are captured as attributes on the index elements, enabling optimized lookup using the BaseX attribute index.

Doc-to-Bundle Map Construction

The document-to-bundle-map index construction requires walking up the map-to-document reference chain for each target document to find the bundle maps that ultimately refer to the document.

Conceptually the algorithm is a recursive graph walk using the where-used table to find all the DITA maps that refer to a document, then the maps that refer to those maps, and so on, collecting any referencing maps that are bundle maps, until the set of referencing maps is exhausted. Cycles are not possible in DITA, or rather, a system of DITA maps that resulted in a cycle would fail to process and so should never survive into the committed content set should one ever be created. DITA maps represent strict hierarchies of maps.

Initial implementation of this algorithm resulted in wildly different execution times for different data sets. In addition, the initial implementation did not use the BaseX attribute index, so performance differences for different test data sets were quite obvious.

Development was supported by three test data sets:

A small but realistic documentation set consisting of one bundle map, about 20 submaps, and about 400 topics.
A portion of the ServiceNow Platform content reflecting about 9000 topics, a quarter of the total number of topics, but the total set of maps (we did not bother trying to eliminate maps that did not refer to any topics in the test set as the number of maps is very small relative to the number of topics).
The full content set for a single Platform release, reflecting the full 40,000 topics and all the maps.

When the document-to-bundle-map construction was run on the small test set, it ran as fast as expected, taking only a couple of seconds to construct the index.

However, when run with the one-quarter data set the construction took many minutes. Why?

The initial implementation was very simple, simply taking each map in turn and determining the bundle that uses it. This means that for a given map you find all its uses, recurse on those maps, find their uses, recurse on those, and so on until you have maps with no uses or you've found maps that are known to be bundle maps (bundle maps have distinctive filenames).

If the number of uses of any map is small this algorithm will perform well, as it did with the single-bundle content set, where almost no map has more than one use.

However, our real content has a single DITA map, now-keys-common.ditamap, that is used by almost every other map (about 740 uses of this one map). This map is also referenced as the first map in every map that uses it, so when processing this map to see who references it we find 740 references, which we then process to see who references them, etc., and we do this for every map (because every map references this one map).

The first reaction was to process maps in reference count order, from least to most, so that by the time we get to the now-keys-common DITA map we already know the bundles all the other maps are in, which means we only need to do one lookup for each using map to determine the set of bundles that ultimately use the now-keys-common map.

We then realized that we could short circuit other lookups based on already-gathered information, further optimizing the algorithm.

With these changes the process went from "never finishes" to about two minutes to complete.

At this point in the processing we know what bundles use each map. The next step is to determine which bundles use each topic.

This process involves getting, for each topic document, the maps that reference it and then looking up the bundles for those maps. This process involves many more documents—36,000 or more—but is a much more efficient look up.

The initial implementation took about 20 minutes to complete (about 0.035 seconds per topic). Adding the use of the BaseX attribute index reduced that time to about 20 seconds.

Content Preview: Rendering Formatted DITA With Minimum Effort

Because the repository provides access to all the authored content, it follows that it should provide a way to preview topics and maps so users can see the content in a useful way.

DITA markup is complex and transforming it into non-trivial HTML is a tall order—DITA Open Toolkit has thousands of lines of XSLT code to do just this. But we didn't have time for that.

Fortunately, in the context of my earlier DITA for Small Teams project, also based on BaseX and XQuery, I had stumbled on a remarkably simple technique for generating a useful preview of DITA content: just generate <div> and <span> where the HTML @class attribute values reflect the tag name and DITA @class attribute values and use CSS to do all rendering based on @class values. Attributes are captured as spans that capture the attribute name and value. This transform can then be implemented easily enough using XQuery type-switch expressions:

declare function preview:nodeToHTML($node as node()) as node()* {
  typeswitch ($node)
    case text() return $node
    case processing-instruction() return preview:htmlFromPI($node)
    case comment() return ()
    case element() return preview:htmlFromElement($node)
    default return() (: Ignore other node types :)
};

declare function preview:htmlFromElement($elem as element()) as node()* {

  let $effectiveElem as element() := lmutil:resolveContentReference($elem)
  return 
  typeswitch ($effectiveElem)
  case element(image) return
    let $worktree as xs:string := db:name($elem)
    let $imagePath as xs:string := relpath:resolve-uri(string($elem/@href), "/" || db:path($elem)) ! substring(.,2)
    let $imageFilesystemPath as xs:string? := git:getFileSystemPath($worktree, $imagePath)
    let $imageUrl as xs:string := string-join(('/now/content', $worktree, 'image', $imagePath), '/')
    return
      <a 
          class="image-link" 
          href="{$imageUrl}" 
          title="{$imagePath}"
          target="_imageFullSize"><img src="{$imageUrl}" alt="{$imagePath}"
      /></a>
  default return
  <div class="{(name($elem), dutils:getClassTokens($elem), string($elem/@outputclass)) => string-join(' ')}">
    {for $att in $effectiveElem/@* except ($effectiveElem/@class)
         return preview:attributeToHTML($att)
    }
    {for $node in $effectiveElem/node()
         return preview:nodeToHTML($node)
    }
  </div>
};


declare function preview:attributeToHTML($att as attribute()) as node()* {
  let $attDisplay as node()* := 
  if (name($att) = ('href'))
  then 
    let $worktree as xs:string := db:name($att)
    let $targetPath as xs:string := relpath:resolve-uri(string($att), "/" || db:path($att)) ! substring(.,1)
    let $targetUrl as xs:string := string-join(('/now/content', $worktree, $targetPath), '/')
    return <a class="href-link" href="{$targetUrl}">{string($att)}</a>
  else text{string($att)}
  return
   <span class="attribute" data-attname="{name($att)}"
     ><span class="attvalue">{$attDisplay}</span></span>
};

declare function preview:htmlFromPI($pi as processing-instruction()) as node()* {
  let $name as xs:string := name($pi)
  return
  if ($name eq 'oxy_comment_start')
  then preview:oxyCommentToHTML($pi)
  else if ($name eq 'oxy_comment_end')
  then <span class="comment-end">{string($pi)}</span>
  else () (: Ignore :)
};

declare function preview:oxyCommentToHTML($pi as processing-instruction()) as node()* {
  let $data as xs:string := $pi ! string(.)
  let $elem as element() := parse-xml("<comment " || $data || "/>")/*
  let $result as node()* :=
  <span class="oxy_comment">{
     if (exists($elem/@flag))
     then attribute {'data-flag'} {string($elem/@flag)}
     else ()
   }{
    for $att in $elem/@*
    return <span class="comment-{name($att)}">{string($att)}</span> 
    
  }</span>
  return $result
};

This process only needs to special case elements that need to be navigable links and image references. Otherwise everything can be done in CSS.

Note that images are served directly from the system system—they are not stored in the BaseX content database, although they could be. As BaseX always has the git clones available the images can be served directly from the file system, avoiding the expense of copying the images.

In addition to the formatted view of each document we also provide a view of the raw XML. For that we just serialize the XML into an HTML <pre> element:

declare function preview:elementToHTML($element as element()) as node()* {
  if (df:isTopic($element)) then preview:topicToHTML($element)
  else if (df:isMap($element)) then preview:mapToHTML($element)
  else preview:serializeXmlToHTML($element)
};

declare function preview:serializeXmlToHTML($element as element()) as node()* {
  <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
     <title>{document-uri(root($element))}</title>
   </head>
   <body>
   <pre>
   {serialize($element)}
   </pre>
   </body>
 </html>

};

(The BaseX adminstration web app includes a better serialized XML renderer but we didn't have time to integrate that into Mirabel so we used this very quick solution.)

Conclusions and Future Work

The use of XQuery, RESTXQ, and BaseX enable remarkably quick and easy implementation of a remarkably functional system for holding and reporting on the large volume of content that is ServiceNow's Platform documentation.

The ease of deployment and the features of the XQuery language itself coupled with BaseX's lightweight implementation, clear documentation, and ready community support made it about as easy as it could be to apply iterative development approaches. The ability to quickly acquire and integrate open-source components for things like data visualization and table sorting made things even easier. It's not news that the Internet provides but for someone who can still remember when you had to go to a library or book store to learn about a new technology or buy a product and wait weeks to receive it in order to do something new, it's still a marvel of quickness and ease.

The need to implement concurrency and job control added effort relative to using a product like MarkLogic but the implementation effort was not that great and, once implemented, is available for future projects, and is more than offset by the other advantages of BaseX.

Future work includes:

Support for authenticated users in order to provide personalized results, options, and role-based features.
Support for DITA's indirect addressing feature (keys). Work has started on this in the context of an independent open-source project but still needs to be integrated with the Mirabel system.
A general "content explorer" that allows authors to explore and navigate the DITA content with full access to all where-used information. The content explorer will include features for doing canned and ad-hoc XML queries over the content (for example, "how many steps within the ITSM bundle include a reference to the menu item 'new'?").
Finish out the git explorer to provide complete access the git history at any desired level of granularity (individual document, DITA map, bundle, user or set of users, etc.). One challenge here is not allowing Mirabel to become too Orwelian in its use of git data. Through git's "blame" feature Mirabel can know which author is responsible for any line in any version of any file. This knowledge could be misused or at least be presented in a way that seems creepy and unwanted.
Finish out the Schematron explorer and testing facility.

Implement complex validation processes currently performed by ad-hoc scripts against files on the file system, such as link validation processes. Mirabel should be able to perform such processing with greater speed and completeness than the current Python-based implementations.
Consume and present analytics and metrics from other information systems, such as web site analytics for docs.servicenow.com and Jenkins build history metrics. In general, if the data can be captured, capture it in Mirabel with time stamps to enable time series reporting.
Integrate with other ServiceNow information systems, especially the ServiceNow instances ServiceNow uses to manage the work of Product Content.
Parse-time adding of @class attributes to the DITA source or implement use of Xerces' grammar cache for DTD-aware parsing in order to fully optimize DITA-aware element access based on @class values.
Integrate more tightly with authoring tools as appropriate. For example, provide Oxygen add-ons that can submit queries to Mirabel and show the results in the editor.

Package Mirabel for use on individual user's machines so that they have immediate access to uncommitted and unpushed changes in their working environment. This would require implementing incremental update of the content database and supporting indexes. Because BaseX is a simple-to-install package it would be easy enough to create a standalone Mirabel package that could be installed and run on user's personal machines. This could even be done through an OxygenXML add-on (Product Content Engineering already maintains a custom Oxygen add-on that could be extended to manage a local Mirabel instance).

While the stated goal of Project Mirabel is to be a read-only source of information about the DITA source and related materials Product Content creates and works with, it's not hard to see how Mirabel starts to look a lot like a DITA-aware component content management system. For example, it would not be difficult to integrate Mirabel with Oxygen Web Author to provide a complete authoring environment on top of the existing git-based management infrastructure.

References

[DITA] Eberlein, Kristen; Anderson, Robert, Editors, Darwin Information Typing Architecture (DITA) Version 1.3, OASIS Open, 2018, http://docs.oasis-open.org/dita/dita/v1.3/errata02/os/complete/part3-all-inclusive/dita-v1.3-errata02-os-part3-all-inclusive-complete.html

[BASEX] Grün, Christian, et al., BaseX, https://basex.org

[basex-orch] Kimber, Eliot, BaseX Orchestration Package, https://github.com/ekimbernow/basex-orchestration

[DFST] Kimber, Eliot, DITA for Small Teams link management facility, https://github.com/dita-for-small-teams/dfst-linkmgmt-basex

[XQUPDATE] Robie, Jonathan, et al., Editors, XQuery Update Facility 1.0, W3C, March 2011, https://www.w3.org/TR/xquery-update-10

[OXY] SyncRO Soft SRL, OxygenXML, https://www.oxygenxml.com/

Eberlein, Kristen; Anderson, Robert, Editors, Darwin Information Typing Architecture (DITA) Version 1.3, OASIS Open, 2018, http://docs.oasis-open.org/dita/dita/v1.3/errata02/os/complete/part3-all-inclusive/dita-v1.3-errata02-os-part3-all-inclusive-complete.html

Grün, Christian, et al., BaseX, https://basex.org

Kimber, Eliot, BaseX Orchestration Package, https://github.com/ekimbernow/basex-orchestration

Kimber, Eliot, DITA for Small Teams link management facility, https://github.com/dita-for-small-teams/dfst-linkmgmt-basex

Robie, Jonathan, et al., Editors, XQuery Update Facility 1.0, W3C, March 2011, https://www.w3.org/TR/xquery-update-10

SyncRO Soft SRL, OxygenXML, https://www.oxygenxml.com/

Author's keywords for this paper:

XQuery; DITA; Software Engineering; XML

BalisageThe Markup Conference2022