Paxton, Alan, and Adam Retter. “Using a testbed to assess XML Database Performance: Integrating a NoSQL testbed into
the XML testing universe.” Presented at Balisage: The Markup Conference 2024, Washington, DC, July 29 - August 2, 2024. In Proceedings of Balisage: The Markup Conference 2024. Balisage Series on Markup Technologies, vol. 29 (2024). https://doi.org/10.4242/BalisageVol29.Paxton01.
Balisage: The Markup Conference 2024 July 29 - August 2, 2024
Balisage Paper:
Using a testbed to assess XML Database Performance
Integrating a NoSQL testbed into the XML testing universe
Alan is a Senior Engineer at Evolved Binary, where he works on Database Development
and Testing,
in Java and C++.
Alan started his career at DEC, in VMS Development, where he worked on transactional
components and novel filesystems.
He developed an interest in formal methods and returned to academe to work on a PhD
which linked
proof and testing.
Back in industry, he worked in a variety of software development roles, moving into
Mobile Phone development
when it was new and interesting.
Returning to his database and storage roots, Alan's career has come full circle.
Adam is the Director of Evolved Binary, a software, consultancy,
and training company in the UK that also specialises in XML and
LinkedData technologies. Adam is also a co-founder of eXist Solutions
GmbH in Germany who specialise in TEI Publisher software.
Adam is passionate about Open Source software, Open Technical
Communities, and enjoys researching and solving difficult software
engineering problems. Adam was an invited expert to the W3C XQuery
Working Group, and has been a core developer on eXist-db for 19 years.
Adam also created the Elemental Native XML Database, and FusionDB.
We have reviewed the history of XML Benchmarking, and we have argued that the benchmarking
of XML Databases has stagnated [Paxton 2024].
Consequently we have proposed that by adopting the testbed approach [Difallah 2013] [Shook 2020], and adapting
it to XML databases, we can bring recent progress in OLTP and NoSQL benchmarking to
XML [Paxton 2024].
Our goal is to make it easier to run flexible, repeatable, and potentially large benchmarks
against any standard-compliant
XQuery or XML database implementation in order to perform:
Application level benchmarking, to provide a broad characterisation of the performance
of a database against a particular workload.
Microbenchmarking, in which more tailored queries and workloads are run to characterise
in detail the performance of particular code or data paths within a database implementation.
Microbenchmarks are often run in the presence of profiling tools which can analyse
detailed CPU and memory usage patterns of a system while it is executing a workload.
We have implemented the 2 key components necessary for adopting the NoSQLBench testbed
in XML Database benchmarking:
An adapter for NoSQLBench, to drive XML databases.
NoSQLBench [Shook 2020] is a testbed for benchmarking NoSQL databases.
It has an extensible adapter-based architecture for interacting with
different NoSQL database APIs.
Our adapter implements the XML:DB standard [XML:DB].
A tool xmlgen2 for the efficient generation of synthetic XML data which can be ingested into a database
as a prelude to benchmarking. Of course where data ingestion is an important part
of a workload, the ingestion itself can become part of the
benchmark workload. This tool is modelled on the xmlgen tool of [Schmidt 2002] but is designed to be flexible enough to generate entirely diverse workloads in
a straightforward way. To do this we exploit the powerful Virtual Data Set features
which are part of NoSQLBench [NoSQLBench].
This is an advance over the original xmlgen tool which was only designed to generate one specific XML corpus.
We have previously used the term XML Storage Systems to describe a range of systems which are used to
store XML (structured) data [Paxton 2024]. This encapsulates processors which implement one-off
XML parsing and transformation, systems which store XML files whole (and perhaps allow
processing of them),
systems which decompose XML documents and store them in another storage system (perhaps
a SQL DB) and so-called
Native XML Databases, in which collections of XML documents are decomposed and stored within a
system which supports efficient retrieval, reconstruction, analysis and update of
the documents.
While an XML Processor like Saxon [Saxon Processor] is focused on one-off processing of documents, Native XML
Databases such as FusionDB [FusionDB], Elemental [Elemental], BaseX [BaseX],
and eXist-db [eXist-db] are designed
to handle data which is repeatedly accessed, and potentially updated. As their name
suggests, they are much
closer to traditional SQL and NoSQL Databases in the concerns they address; they just
manage XML structured data.
Native XML Databases and XML Processors share a common, standardized query language
for XML in [XQuery 3.1].
The standardization of XQuery offered an obvious basis on which to measure, compare
and contrast the performance
of XML Storage Systems. The XMark Project [Schmidt 2002] defined a canonical XML data collection
and a set of XQuery queries over that collection. The XMark data collection and queries
have been widely used to
assess the relative performance of multiple XML Storage Systems, for instance by [Lu 2005].
The term database testbed was introduced by [Difallah 2013] in the context of
their system OLTP-Bench, which was designed to flexibly control, manage and extract
metrics from the
benchmarking of OLTP systems. Observing that the core benchmark code was only a small
part of a
realistic production-quality benchmark, led to the understanding that the infrastructure
of dispatch, monitoring and metric
extraction and reporting was a problem that could be solved within a testbed which
allowed for easier
construction of new benchmarks in future.
In the NoSQL space, the YCSB system [YCSB10] predates OLTP-Bench, and is a database testbed
in practice if not in name. More recently, NoSQLBench [NoSQLBench], [DataStax 2020]
has embraced the testbed ethos, and has united pluggable adapters for database instances
with the concept of Virtual Data Sets,
a flexible and powerful means of generating large, pseudo-random but repeatable synthetic
data with which to
drive benchmarks.
The continued progress in benchmarking of NoSQL and OLTP systems exposes the relative
lack of progress in benchmarking
XML Databases. Much of the literature on the subject of XML benchmarking is not recent,
and a review such
as [Mlýnková 2008] remains relevant. We have revisited the XML Benchmarking
landscape in this paper [Paxton 2024] in detail, and have proposed a way forward by adopting NoSQLBench as
a testbed, adapting it with an XML:DB driver to operate with standardized XML Databases,
and enhanced it with a Virtual Data Set(VDS) based library for generating synthetic
XML database collections.
In the rest of this paper we restate our goals in developing an XML Benchmarking testbed,
and describe the
recent progress we have made towards them.
In the section: section “Running XMark in the Testbed” we demonstrate what is involved in running individual benchmarks
in our testbed, using the XMark benchmarks of [Schmidt 2002] as our example. We show that
the level of automation we have currently achieved already makes this a practical
system.
In the section: section “A Comparison of XMark xmlgen and xmlgen2 Auction Data” we compare the results of the XMark benchmarks when run
against the original auction database generated by XMark xmlgen with the results
when run against xmlgen2, our re-implementation of xmlgen
utilising our new testbed framework. We have set up this comparison as a checkpoint
of the capabilities
of our testbed.
In the section: section “Conclusions and Future Work” we discuss what we have achieved so far in using
NoSQLBench as an XML Benchmarking testbed, how successful it has been,
and what can usefully still be done to improve it.
Architecture of the XML Benchmark Testbed
The OLTP-Bench paper [Difallah 2013] defines the goals of benchmark testbeds as:
control the data set and workload for testing
automate the extraction of metrics from test runs
ensure these runs are repeatable
When running benchmarks from a testbed (or otherwise) we recognise 3 distinct phases
to the process:
Test data generation/loading
Benchmark setup, query ramp-up, run and teardown
Metric collection
We want to bring the benefits of a testbed as described by [Difallah 2013]
into the XML benchmark domain, and to do this we have extended NoSQLBench so that
it
can work effectively with XML Storage Systems.
Benchmark run
In order to run NoSQLBench benchmarks with XML databases, NoSQLBench must be able
to interact with XML databases.
NoSQLBench interacts with NoSQL databases using adapters which fit into a framework which it defines.
This allows 3rd parties to create adapters to implement interaction with other databases,
and in practice XML Storage
Systems can just as well fit in the adapter framework.
The standard XML:DB API [XML:DB] is the obvious candidate for this, and so the first artifact
we have created is an XML:DB adapter for NoSQLBench.
Adapter
XML:DB API is database agnostic, and may be used to access any XML database that supports
it.
Our XML:DB adapter is compatible with [FusionDB], [Elemental],
and [eXist-db].
The XML:DB API adapter implementation is straightforward.
Both NoSQLBench and its adapters are written in the Java programming language.
The current NoSQLBench release (5.17) requires Java 17,
and the current development branch for the forthcoming (5.21) release requires Java
21.
We implemented our adapter as a new Maven module within the NoSQLBench codebase.
There are several compulsory classes to create/load an adapter and have it build
and execute the operations that NoSQLBench requires.
Our fork of NoSQLBench with our new XML:DB API adapter is available from
https://github.com/evolvedbinary/nosqlbench.
The adapter supports compulsory parameters defining how to connect to the XML database
endpoint.
Naturally we intend to contribute this adapter back to NoSQLBench as soon as is practical.
NoSQLBench provides tight control of the data set and workload in a number of ways:
You can run common testing workloads directly from the command line. You can start
doing this within 5 minutes of reading this.
You can generate virtual data sets of arbitrary size, with deterministic data and
statistically shaped values.
You can design custom workloads that emulate your application, contained in a single
file, based on statement templates - no IDE or coding required.
These features are provided by the core of NoSQLBench. What is missing is a way to
efficiently provide
a large XML data workload. Of course it is easy to update the database through
queries using the XML:DB adapter,
but the overhead of the interface makes this piecemeal mechanism very inefficient.
What is needed is a way to bulk generate XML data which can be stored in a file and
uploaded into the
target database through whatever file ingestion mechanism it provides. We observed
that the NoSQLBench
Virtual Data Sets (VDS) effectively do this for the runtime of a benchmark, and we
decided to build
a tool using VDS that generates XML at compile time. This is the second and final
artifact we created to adapt NoSQLBench for XML.
A Standalone XML Generation Tool using VDS
Recall that the NoSQLBench framework is written in Java.
In fact, the VDS services within it were once a standalone package of their own,
and while they have been rolled into the greater NoSQLBench,
they still present a clean, general, and very powerful functional library for efficient
data generation.
It turns out to be very easy, then, to use VDS as a Java library directly and to supplement
that with support
(and syntactic sugar) simplifying the generation of XML output. Thus while defining
XML workloads in this system involves,
in both principle and practice, writing programs for XML generation
rather than scripts,
the programs are usually extremely simple and formulaic.
We have implemented such a Java tool which uses the VDS libraries and generates XML
elements,
attributes and text.
Using the tool we have implemented XML generation of a database modelled on the XMark
auction site.
Our tool could be re-integrated into NoSQLBench as a step in a workload scenario,
however it does not take the same form as an adapter (as defined by NoSQLBench).
In the section: section “A Comparison of XMark xmlgen and xmlgen2 Auction Data”, we compare the results of this method of XML
generation with the original XMark xmlgen.
Metric collection and extraction
NoSQLBench automatically measures and extracts timing of distinct operations run within
a benchmark.
Means and distributions of per-operation performance are then reported via whichever
mechanisms are
configured; the simplest is a log file. That is more than adequate for our needs,
but the growing
requirement for feeding continuous statistics into control panel dashboards
can also be accommodated.
Running XMark in the Testbed
Having built an XML:DB adapter and an XML generation tool, we can construct a NoSQLBench script
to run XMark benchmark queries. NoSQLBench scripts are written in YAML, and are structured
at the top level as a number of scenarios. Our scenarios are:
A scenario for clearing the database and reloading XML data from the (previously generated)
auction
database file.
A scenario for running each individual XMark benchmark; most are structured in the
same way:
Run a warmup phase which performs the single XQuery
operation of the benchmark
several times to stabilise any startup effects
of the database.
Run a read phase which performs the XQuery operation of the benchmark
multiple times to generate a statistically useful distribution of results.
The different benchmark query scenarios are run a fixed number of times which depends
on how long-running each individual query is.
Most of the XMark benchmark scenarios run the query multiple times serially in a single
thread,
because the nature of the XMark benchmark query demands this mode of operation. It
is entirely
possible, and a core feature of NoSQLBench, to configure a scenario to run multiple
threads and
to distribute more than one operation over these threads. We have tested this model
with a
very simple XQuery operation to demonstrate the simulation of a database servicing
multiple
requests from multiple clients. Clearly the potential exists to benchmark realistic
queries
in relevant concurrent contexts.
As an example, here is the scenario for XMark benchmark 3:
Return the first and current increases of all open
auctions whose current increase is at least twice as
high as the initial increase.
xmark003:
# Single threaded
warmup: run driver=elemental tags==block:"xmark003.*",
cycles===TEMPLATE(warmup-cycles,5)
threads=1 errors=timer,warn
endpoint=xmldb://localhost:3808
read: run driver=elemental tags==block:"xmark003.*",
cycles===TEMPLATE(read-cycles,10)
threads=1 errors=timer,warn
endpoint=xmldb://localhost:3808
and here is the single XMark benchmark 3 operation,
xmark003:
ops:
query:
xquery version "3.1"
for $b in /site/open_auctions/open_auction
where $b/bidder[1]/increase/text() * 2 <= $b/bidder[last()]/increase/text()
return <increase
first="{$b/bidder[1]/increase/text()}"
last="{$b/bidder[last()]/increase/text()}" />
NoSQLBench is invoked as a command to run a series of scenarios:
This runs the scenarios reset and xmark003 in order, and has set the (script defined) auctionfile
parameter as the name of the XML database file to install (part of the reset scenario).
Metrics from the run can, by default, be found in NoSQLBench log files. The default
result_success
metric from the read activity is what we need to tell us how the benchmark has performed:
A Comparison of XMark xmlgen and xmlgen2 Auction Data
In order to demonstrate that our new XML generation library is arguably adequate,
along with the XML:DB adapter, as an XML benchmark system
we have chosen to appeal to the the XMark benchmark. First we present the results
of
running a selection of XMark benchmarks using auction sites generated both by the
original
XMark xmlgen and by our new xmlgen2. Time pressure has prevented
us completing running all the benchmarks, but we have tried to select a broad illustrative
sample. Then, we discuss the results.
Looking at the table, we can see that most benchmarks result in a very slightly slower
runtime for xmlgen2
compared to that for XMark xmlgen.
We speculate that the consistently slightly larger database, which is due to
the VDS mechanism we use to generate text yielding larger paragraphs,
is the uniform reason for this. Even so, the differences are never more than 5%-10%.
We consider in more detail the queries that differ most significantly:
Query 3
Return the first and current increases of all open
auctions whose current increase is at least twice as
high as the initial increase.
The xmlgen2 numbers were surprisingly better.
A variation in the distribution of the bid values means that
xmlgen2 database yields only 1862 matching result elements,
while XMark xmlgen yields 3048 elements.
When we control for this in xmlgen2 by adjusting the query
to return an element if the final bid is at least 1.4 times greater,
the query yields 3072 elements and the run time is almost identical
to that of XMark xmlgen.
Query 8
List the names of persons and the number of items
they bought. (joins person, closed auction)
We were only able to use the 10MB database for this query;
the query was too slow to measure reliably on the 100MB database.
This reflects a known issue in
eXist-db which will be addressed in Elemental
in the near future.
By manually rewriting the where clauses to predicates we were
able to re-formulate the same query in order to utilise the defined indices.
Then the query could easily
be executed against the full 100MB database.
The results for this run in an acceptable time.
Query 9
List the names of persons and the names of
the items they bought in Europe. (joins person,
closed auction, item)
This query again had to be reformulated, in the same way as Query 8 was.
Particularly reassuring was the observation that the number of result
elements returned by the query was:
2497 for xmlgen2
2568 for XMark xmlgen
- close but not identical.
This suggests we have managed to select realistic distributions to
replicate the ones from XMark xmlgen.
Query 14
Return the names of all items whose description
contains the word gold.
We use white instead of gold as the query
for xmlgen2 as it occurs with a close
count (1619) to the number of occurrences of gold (1689) in
the XMark xmlgen database.
Query 15
For only closed auctions, print the keywords in emphasis in annotations.
There are fewer results in the xmlgen2 database
(1440 vs 1890),
but the performance is slightly lower, possibly
due to the greater raw size of the text fields generated by xmlgen2.
Query 17
Return the person records with an empty homepage.
This returns a very large result set, and the cost of sending the result set over
the wire to
the database dwarfs the cost of the query. We could improve our adapter implementation
to reduce this effect,
but the easy solution is to run a query which simply counts the result set.
This is the version we provide a result for.
Interestingly, on first running this query we saw a significantly quicker
result for xmlgen2 (0.851s). On
investigation we noted that our
generator was unconditionally creating <homepage> elements within
<person> records,
rather than conditionally creating them with a 50% probability, as in the original
XMark xmlgen.
We quickly remedied this in xmlgen2,
and on re-running this query benchmark we found the reported
results. This is reassuring. The benchmark queries have allowed us to identify a mismatch
between XMark xmlgen and our new xmlgen2.
Conclusions and Future Work
First of all, we have validated that our new xmlgen2 (VDS) generator
can generate an auction site sufficiently similar to XMark xmlgen
that the same set of queries execute correctly.
Second, we are able to run performance tests against both versions of the auction
database,
in a range of sizes, and get repeatable results.
Our NoSQLBench driver extension is responsible for this.
Third, our results show that the selection of XMark queries we have tested
behave consistently across auction databases generated by XMark xmlgen
and xmlgen2 (VDS). While results are not identical, they are close enough
to show that our system is generating a sufficiently similar auction database,
and we have been able to identify and repair flaws in xmlgen2 on the basis
of discrepancies in the performance of the XMark benchmarks.
This all means that moving forward, someone who wishes to benchmark their XML Database
can confidently adopt our NoSQLBench adapter and xmlgen2 to easily run
XMark benchmarks. If they have maintained XMark benchmarks with the original
XMark xmlgen they can carry out a one-off comparison of their benchmark
results with both systems, and then confidently adopt xmlgen2 and NoSQLBench
as a modern testbed with which to build a continuous performance regression test setup.
Future Work
Integrate xmlgen2 generation directly into NoSQLBench.
We have tested using a number of variations of the XMark auction database
which we have pre-generated at various sizes, and re-generated when we
discovered flaws in the generator.
An additional NoSQLBench adapter to directly call the generator would remove
a manual step from our testing, and take us close to a 1-step testing invocation.
The more advanced version of this would generalise from just using the auction
site generator, to any top level generator class built upon the
xmlgen2 libraries.
Make the xmlgen2 generator concurrent. At present the generator is
resolutely single-threaded,
as the internal architecture contains a single central XML context state managing
the tree of XML
being generated. A re-architecture to reduce or remove this dependency, and to introduce
threading,
would allow far better exploitation of modern multi-core systems.
This would realistically make the (re)generation of test XML
data interactively fast, and better support 1-step invocation.
We are considering using xmlgen2 to generate simulated and anonymous test data which has the
performance characteristics of important customer data. With this we can further enhance
test suites
which ensure and maintain that databases we are developing perform well for real world
customers.
We intend to submit our new NoSQLBench adapter(s) to the NoSQLBench project for their
inclusion.
References
[Difallah 2013]
Difallah, Djellel Eddine and Pavlo, Andrew and Curino, Carlo and Cudre-Mauroux, Philippe.
Oltp-bench: An extensible testbed for benchmarking relational databases,Proceedings of the VLDB Endowment, Volume 7, Number 4, Pages 277-288, 2013, VLDB Endowment. doi:https://doi.org/10.14778/2732240.2732246
[Paxton 2024]
Paxton, Alan and Retter, Adam.
Modern Benchmarking of XQuery and XML Databases,XML Prague 2024, Conference Proceedings, Pages 211-239.
[Schmidt 2002]
Schmidt, Albrecht and Waas, Florian and Kersten, Martin and Carey, Michael J and Manolescu,
Ioana and Busse, Ralph.
XMark: A benchmark for XML data management,VLDB'02: Proceedings of the 28th International Conference on Very Large Databases, Pages 974-985, 2002, Elsevier. doi:https://doi.org/10.1016/B978-155860869-6/50096-2
[Lu 2005]
Lu, Hongjun and Yu, Jeffrey Xu and Wang, Guoren and Zheng, Shihui and Jiang, Haifeng
and Yu, Ge and Zhou, Aoying.
What makes the differences: benchmarking XML database implementations,ACM Transactions on Internet Technology (TOIT), Volume 5, Number 1, Pages 154-194, 2005, ACM New York, NY, USA. doi:https://doi.org/10.1145/1052934.1052940
[YCSB10]
Cooper, Brian F and Silberstein, Adam and Tam, Erwin and Ramakrishnan, Raghu and Sears,
Russell.
Benchmarking cloud serving systems with YCSB,Proceedings of the 1st ACM symposium on Cloud computing, Pages 143-154, 2010. doi:https://doi.org/10.1145/1807128.1807152
[Mlýnková 2008]
Mlýnková, Irena.
XML benchmarking: Limitations and opportunities.
Technical Report, Department of Software Engineering, Charles University, 2008.
Difallah, Djellel Eddine and Pavlo, Andrew and Curino, Carlo and Cudre-Mauroux, Philippe.
Oltp-bench: An extensible testbed for benchmarking relational databases,Proceedings of the VLDB Endowment, Volume 7, Number 4, Pages 277-288, 2013, VLDB Endowment. doi:https://doi.org/10.14778/2732240.2732246
Schmidt, Albrecht and Waas, Florian and Kersten, Martin and Carey, Michael J and Manolescu,
Ioana and Busse, Ralph.
XMark: A benchmark for XML data management,VLDB'02: Proceedings of the 28th International Conference on Very Large Databases, Pages 974-985, 2002, Elsevier. doi:https://doi.org/10.1016/B978-155860869-6/50096-2
Lu, Hongjun and Yu, Jeffrey Xu and Wang, Guoren and Zheng, Shihui and Jiang, Haifeng
and Yu, Ge and Zhou, Aoying.
What makes the differences: benchmarking XML database implementations,ACM Transactions on Internet Technology (TOIT), Volume 5, Number 1, Pages 154-194, 2005, ACM New York, NY, USA. doi:https://doi.org/10.1145/1052934.1052940
Cooper, Brian F and Silberstein, Adam and Tam, Erwin and Ramakrishnan, Raghu and Sears,
Russell.
Benchmarking cloud serving systems with YCSB,Proceedings of the 1st ACM symposium on Cloud computing, Pages 143-154, 2010. doi:https://doi.org/10.1145/1807128.1807152