Thompson, Will. “Automatically Denormalizing Document Relationships.” Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). https://doi.org/10.4242/BalisageVol19.Thompson01.
Balisage: The Markup Conference 2017 August 1 - 4, 2017
Will Thompson leads software development on core technologies for O'Connor's Online,
a web-based legal research platform that complements the company's expanding library
of mostly home-grown (and mostly Texas-based) legal books. Will works on a wide array
of software, from back-end editorial toolchains to customer-facing search engines.
Native XML databases provide no exception to the problem that data may not be easily
contained by any single data storage idiom. Many-to-many relationships, in particular,
present a unique problem for documents, as strategies for joining across documents
are a potential minefield of software maintenance and performance problems. Automatic
denormalization shifts the responsibilty for managing relationships to write-time,
making an explicit trade-off for simplicity and speed at runtime. This paper discusses
existing strategies for managing relationships across documents and explores design
patterns and use cases for performing automatic denormalization and their trade-offs.
It seems unusual that in an industry known for constant innovation and change, it
would be common practice to build systems using forty-year-old technology. But relational
databases still remain the default choice in most applications.
The relational model was conceived “primarily as a tool to free users from the frustrations
of having to deal with the clutter of [database] storage representation details” [Codd1979], and its normalized representational form was intended to ensure stored data “would
be free from [the] insertion, update, and deletion anomalies” [Codd1990] that produce inconsistent data, while also providing a more flexible data model
from which developers could perform ad-hoc queries. In addition to the system design
and maintenance benefits from abstracting a data storage and query layer, constraints
in storage hardware capacity, speed, and reliability made the relational model a prudent
choice for system designers. These advantages have persisted for decades as the technology
has matured.
Early predictors hoped that the relational model would be displaced by something superior
from the wave of NoSQL databases that have arrived over the last two decades. But
now that practitioners have gained experience with them, it seems more likely that
a one-size-fits-all database model will not be realized, and relational databases
will continue to be used alongside a variety of alternative database models, all with
overlapping advantages and drawbacks.
The overarching constraint of the relational model is that while it is ideal for highly
structured and naturally relational data, it requires hierarchical or document-like
data to be deconstructed from its natural form and additional work to insert and retrieve
that data. Yet hierarchies exist everywhere, in print documents, web pages, emails,
message threads, and Facebook posts, as well as abstract and entity data models. Documents
are a natural structure, and document databases - one type under the NoSQL umbrella
- have emerged as a natural and native means to store and query in document idioms.
XQuery databases, in particular, allow a developer to store a document as XML (and
sometimes JSON) and insert, transform and query documents with a language which is
both natively designed with insight into the rich semantics of its underlying data
and highly expressive at navigating it.
One common problem facing developers, however, is that real-world data do not typically
fit neatly into a relational or a document model. Instead, developers are forced to
design with some degree of impedance mismatch. It is important to focus our attention
on these mismatched design aspects because the rough edges of a design are the parts
most likely to generate artifacts that will hinder application development.
Most relational databases offer some level of support for XML, though typically with
very limited XQuery and XPath features based on older versions or working drafts[1][2] and with limited indexing functionality.[3][4] In cases where data is highly structured and only minimally document-like, a relational
database can still be a good bet if those parts don’t require complex querying.
In cases where a majority of the data would be ideally modeled using documents, the
idiomatic environment provided by XQuery databases are a better bet. However, documents
are not always self contained. They frequently describe relationships: books, articles,
and web pages link to other books, articles, and web pages; entities reference other
entities. Conventional methods for handling relationships across documents using XQuery
and XPath language constructs may be sufficient in some cases. However, systems grow
in complexity over time as they store larger volumes of data, add support for varieties
of data types, and handle more sophisticated modes of querying and visualizing data,
like search grammars or facets.[5] Complex relationships are the crumpled design edges of document database models,
and as a system scales they can become cumbersome to organize, with query performance
that can be difficult to predict.
Automatic denormalization offers a simple set of patterns that reduce this complexity
by shifting the responsibility of managing relationships from run-time to write-time,
simplifying code and improving query performance. This paper will examine conventional
methods for coping with document relationships and explore design patterns for automatic
denormalization and their trade-offs.
Many-to-many relationships
One-to-one and one-to-many relationships are both hierarchical and are thus isomorphic
to the XML document object model. These structures can be modeled and contained completely
within a single XML document [Williams2000]. But to model a many-to-many relationship, a child element would need to have multiple
parents, which is not possible in the XML document model.
For example, a database may need to store a model for attorney and client entities.
Most properties of each entity are hierarchical, but naturally an attorney will have
multiple clients, and a client may hire multiple attorneys. How is this difference
reconciled in a system whose data model is designed for self contained documents?
Recommendation 1: Don’t do that
Typically, the initial advice on modeling these types of relationships in XML is to
simply avoid them. There are a few ways this can be done. The first and simplest is
to discard all data about the many-to-many relationships, resulting in documents that
only contain data directly describing the entity [Williams2002]:
These documents are very limited in scope and so also very limited in utility for
any application. Client documents would have no information about the attorneys representing
them and the same would be true for attorney documents with respect to their clients.
In some limited cases, this option may be feasible, but generally only when there
is a real-world many-to-many relationship that is not important to the application.
If the application requires the relationship to be modeled in XML, another potential
solution is reducing the real-world many-to-many relationship to a one-to-many relationship
in XML by scoping the documents to represent only one side of the relationship [Williams2002]. Using the above example, this would require modeling either the attorney or the
client as a document root and the other entity as a child element. However, if the
element-assigned entity is related to more than one document-assigned entity, this
will result in duplicate records of the child element-assigned entity.
In practice, this approach may satisfy the requirements for some applications, but
generally only if the document-assigned entity is the only one queried directly (because
querying the child entity would require deduplication), and if the child-assigned
entity is never updated (because updates would require special handling for finding
and updating its duplicates). Even when this can work, the trade-off is risky for
any application whose requirements are not fixed indefinitely, since going outside
of those constraints could entail complex workarounds or even rewriting much of the
application's back-end.
Recommendation 2: Keys and Joins
The next approach is the most common and amounts to essentially giving up on the document-centric
model in favor of a relational model. In this model, both entities are designed as
independent documents, and they are related using a two-way ID and IDREF convention,
imitating the primary-key and foreign-key design of the relational model [Williams2000] [Williams2002] [Walmsley2015].
This design is intuitive for anyone familiar with relational models, and it solves
the above problem of duplicated entities. However, it introduces several new problems
as a result of the normalized data, all of which stem from the fact that data about
an entity and data about its related entities are now in separate documents. By projecting
a relational design into XML, an application will inherit the responsibilities of
a relational data-based application: reconstructing and joining data at runtime.
Note
id and idref attributes are typically associated with the xs:ID and xs:IDREF types from XML Schema and the corresponding XPath functions fn:id() and fn:idref(), but those are all scoped within a single document. In the context of database joins,
those types and functions are not very useful, but the attribute names have been used
in these examples for clarity.
Incomplete views
The fundamental problem with references is that accessing a complete view of a document
with references requires traversing it and constructing its missing content dynamically
using joins. For each document we must find its references, get the referenced documents
from the database, and insert them into the expanded view.
In the context of result sets, this effect is multiplied, and the overhead for generating
complete views on a set of documents is no longer simply a factor of the number of
documents selected by a query, but of that number plus the number of referenced documents
in the result set. Performance-sensitive applications will need to ensure that a small
result set of documents with many references does not result in an explosion of reads.
In some cases, that will not be necessary, or it can be mitigated. For example, in
a page of query results, the application can choose not to expand references when
rendering results and only expand an individual document once it has been selected.
But it is often the case that the application needs to address aspects of the expanded
view in queries, which brings us to the next problem.
When using key-based relationships, search applications may be forced to support queries
with constraints that join data across documents. For example, using the attorneys
and clients model above, we may need to generate a query that selects “attorneys living
on the west coast representing a client on the east coast.” This query includes a
pair of related constraints, each in a different set of documents associated by ID/IDREF
relationships. When the query is executed, it will need to read both sets of documents
with constraints applied and join them together based on the criteria that they are
related via ID and IDREF. Even a selective query with few results may, in the worst
case, produce an explosion of document reads due to the fact that the selectivity
of the query hinges only on the intersection of constraints, not any per-document
constraint.
If either $attorneys-west or $clients-east selects a large number of documents, the overhead to perform this query will be high.
It might be improved by selecting the east coast clients first, using their ids to
add an additional constraint to the attorneys, increasing selectivity:
However, it is possible that $clients-east still leads to overhead or that serializing the queries prevents the query engine
from executing them in parallel. Ultimately, the best optimizations may be data and
implementation-dependent, resulting in queries that are difficult to write and maintain,
especially as the content of the database changes.
Another common application requirement is displaying search facets, wherein aggregated
data about a query’s full set of results are displayed alongside a subset of sorted
results displayed as snippets. Conventionally, search snippets are hyperlinked to
a more complete view of the item represented in the paged set, and facets are hyperlinked
to a query that returns a new result set constrained by the property and value indicated
by the facet.
Most XQuery databases provide configurable indexes for efficiently querying specified
path or element values that can be used to facilitate fast faceted search calculations.
It is not currently possible, however, to create an index for a document using values
from a referenced document, so there are no shortcuts around manual joins using indexes.
Like the first example, calculating facets for documents outside the result set requires
joining across both sets documents using XQuery:
Best practices for handling these types of use cases in different database implementations
have different conventions with respect to indexing and query optimization that can
be unintuitive,[6][7] and building a solution to support joins generically within an application can be
tricky and still lead to variable or disappointing performance. Armed with the implementation-recommended
indexing and query patterns, it is possible to minimize the impact of joins, but depending
on the database implementation and application, it may still be necessary to put hard
limits on the size of result sets or drop some of the harder to calculate information
from the aggregate queries to prevent high latency queries.
Runaway code
One issue that may be less obvious at the outset, but can be insidious and capable
of having a strong impact, is the fragmenting effect on code that can occur from introducing
a new paradigm into an application and maintenance problems that result from adding
complexity and managing it as the application scales. At a high level, combining self-contained
document models with relational-like data models ties together contrasting concepts
for reading and writing information within an application. As outlined above, solutions
built to handle querying across related documents will introduce coding patterns that
diverge from those of fully coherent documents.
Software developers typically follow design principles and utilize coding patterns
as they work, in many cases religiously. Despite the fact that many principles overlap
or conflict, and various trends come and go over time, they all seem to work toward
the same goal: preventing code complexity from ballooning out of control. Whether
or not any particular set of maintenance strategies is more effective than another
is a topic of debate (and unfortunately very little research), but experience has
taught the industry as a whole that some strategy is necessary to continuously reduce
complexity through the implementation of abstractions, modularization, limiting dependencies,
and code reuse.
Applications take on expanded requirements over time, and it is critical for maintainers
to carefully examine their designs and refactor as code is updated to implement new
features. The different coding patterns introduced in examples above to support the
diverging document concepts can negatively affect nearly every aspect of code maintenance.
Even in applications that are not performance critical, avoiding the introduction
of these types of queries and runtime dependencies can result in code that is simpler
to reason about, organize, and maintain as it scales.
Proposed solution: automatic denormalization
Automatic denormalization is not a new idea. The solution presented here is a type
of precomputation similar to indexed views or materialized views in relational databases.
In relational databases, the precomputed views are typically implemented with the
goal of improving read performance. In this instance, while performance is a goal,
it is not the primary goal.
The problems outlined above demonstrate the difficulty in integrating documents that
support many-to-many relationships alongside documents whose relationships are self-contained.
We can take significant steps to resolve this problem by denormalizing the many-to-many
relationships as they are ingested into the database. Instead of storing references
to canonical documents using IDREFs, the referencing document will store a complete
copy of the referenced document. In a way, documents still resemble the relational-like
document model, but with references swapped for copies of the referenced data.
The denormalization process can be broken down into two essential subprocesses: one
for creating the initial many-to-many relationship between two documents and another
for updating any document that is part of a relationship.
Note
XQuery update syntax differs substantially across database implementations, so the
following examples are presented using figures only.
Relationship create
Creating a denormalized many-to-many relationship follows similar steps as the normalized
two-way IDREF solution, and in both cases, both documents must be updated. In the
normalized create, only an element denoting a reference to the other is added; in
the denormalized create, the contents of the referenced document are used wholesale
in place of the reference. One new requirement is that a denormalized create must
read-on-update to exchange content. Typically, lock contention will not differ when
reading from a document that is also updated in the same transaction, so the added
overhead of denormalizing this operation can be minimal.[8]
This example creates a relationship on documents with no existing relationships. However,
when relating or updating documents that already have denormalized relationships,
the copied entities stored in <ref:copy> should not also include their copied references. From the perspective of the containing
entity, the relationships within a copied entity are either circular, or they will
reference documents that do not have references back, leaving no way to update them
consistently when the canonical document is updated. Therefore, an entity copy transformation
should be used to remove existing denormalized relationships from any dereferenced
copy.
Related document update
Now that references are replaced with denormalized copies of the canonical entity’s
data, updating a document requires updating every document on the other side of the
relationship as well. Similar to the normalized model, where reading a complete document
required expanding every reference to a canonical document, now to maintain consistency,
any document update requires propagating the new information to its denormalized copy
in referenced documents.
This step is the essence of the denormalization process because it is responsible
for maintaining consistency across related documents. It deliberately defies one of
the original stated design goals of the relational model and normal form - to free
relations from insert, update, and delete dependencies. The primary reason behind
that goal was the expectation that managing dependencies would quickly become complex
or impractical (or a problem writing to storage hardware would occur), leading to
“update anomalies galore” [Codd1990], and inconsistencies across related data would accumulate throughout the database.
The automatic denormalization process proposed here can be implemented in a single
component capable of handling any update operation. As a practical matter, maintenance
of the component to support changes to the data model is simple, and in many cases
(depending on the database) the component can be written to handle generic references,
such that code changes would only be required if the denormalization schema itself
changes. BaseX, eXist-db, and MarkLogic all provide methods to apply cascading updates
atomically, within a single transaction, eliminating a class of possible undesirable
side effects from managing redundant data, one of the motivating factors behind normalization.
Because the number of relationships per document will typically not grow relative
to the size of the database, avoiding joins across large portions of the database
can also be a worthwhile performance trade-off for cascading document updates. However,
the comparison is not apples-to-apples, since writing to a database is a more resource
intensive operation than reading from it. Also, the details of lock handling across
database implementations differ (and some are configurable), so performance testing
for any specific implementation and configuration is appropriate if the likely trade-off
is not clear from an application’s requirements or usage information.
Delete
Denormalized delete follows the same pattern as update, but without the need for the
entity copy transformation step. When a document is deleted, iterate over its references
and remove the copy in any referenced document. Deleting a relationship follows the
inverse pattern of create.
Read
Now that work has been done on the create/update/delete side to ensure consistent
denormalized relationships, documents have been restored to a prototypical document
structure.
Rendering documents with relationships no longer necessitates a runtime expansion
process to produce a complete view, and it is possible to query related documents
without joining. The additional consistency across queries and rendering components
should provide more opportunities for code reuse and less complexity to maintain overall.
The pathological query from above - “return any attorney working on the west coast
representing a client on the east coast” - is now simplified and is structured more
optimally for query evaluation. Additionally, indexing state elements will now be effective at preventing the database from reading documents
from disk before a result set is determined.
Similarly, the facet-calculating query example from above can also be simplified without
the need for joining and can benefit even further by leveraging the indexing capabilities
in MarkLogic or eXist. Because those databases provide functions to access index values
directly, indexing the path /client/ref:copies/attorney/full-name permits even more efficient calculation of facets.
Without the need to join data across documents, the need for nebulous (and often unsuccessful)
optimizations is avoided, and with related documents conforming to the self-contained
structure of other documents, integrating two disparate data models is no longer necessary,
avoiding a class of thorny code maintenance problems.
The resource overhead of this technique will be impacted by a factor of the update
frequency to documents with relationships and by the number of relationships per document.
One must also consider the overlap of relationships across documents because concurrent
updates to multiple documents with relationships to a single document will all contend
for the ability to lock and write to that document when the denormalized updating
occurs. In cases where a long chain of cascading updates requires changing the same
document multiple times, and update conflict may occur that prevents the transaction
from completing. In that case, a workaround would be required to process all updates
to the single document in memory first, and then commit only the final result to the
database.
Only recommended for two-way joins
For example, a LawFirm entity may have a many-to-many relationship with an Attorney entity, which also has a many-to-many relationship with a Client entity.
Using only the technique described here, it is not possible to get denormalized data
about Clients from LawFirms and vice versa. It is possible to modify the entity copy transformation to retain
references for entities such that an entity’s references and its references’ references are part of the cascading update. But this is dangerous behavior that may result
in combinatorial explosions of updates and it can quickly become impractical for most
use cases.
Extensions and Optimizations
Entities and sub-entities
In real-world models, it is often unrealistic to insist that relationships occur only
between document roots. Instead, it is more idiomatic to consider a document as an
entity that may contain descendant subentities recursively, such that any entity or
sub-entity can be related to any other entity or sub-entity.
Additionally, it may be helpful when generating a copy of a sub-entity to also include
data about that entity’s parent or ancestors. This pattern can be managed through
modifications to the entity copy transform (used above only to remove references)
and to update behavior. When an entity is copied, process the parent or ancestors
recursively, and when an entity is updated, trigger updates on its child or descendant
sub-entities. The structure of the denormalized ancestor markup should be determined
by the application. In some cases, the application may benefit from maintaining the
nested structure of the related document in its copy and annotating the directly-related
entity:
In another, it may be more convenient to break the hierarchy into siblings and annotate
the entities:
<ref:copies>
<ref:copy>
<ref:context>
<matter>
<name>People v. O. J. Simpson</name>
<judge>Lance Ito</judge>
<dt-arraignment>1994-06-20T09:00:00-07:00</dt-arraignment>
</matter>
</ref:context>
<ref:parent>
<attorney>
<name>Johnny Cochran</name>
...
</attorney>
</ref:parent>
</ref:copy>
</ref:copies>
The annotation schema is not prescriptive and should be adapted to fit the needs of
the application, so other variations may be appropriate. For example, relationship
annotation elements could be children of entities instead of parents, or they may
be attributes on the copied element.
Overlapping hierarchies
Sometimes a model may not call explicitly for a many-to-many relationship, but it
may have naturally overlapping one-to-many relationships. Overlapping data structures
are not supported by the XML model directly, but they can be approximated using many-to-many
relationships.
In this scenario, one of the trees may be modeled as the “base” document, and overlapping
trees may be modeled using many-to-many relationships between the shared entity of
the base document and the parent entities of the overlapping documents. Using automatic
denormalization, overlapping hierarchies can be stored as separate and completely
coherent documents.
Sparse entity copy transformations
If within in the context of a particular type document, it is known ahead of time
that full copies of referenced entities are unnecessary for the application, then
the entity copy transformation can be extended to exclude some parts of the copy.
Depending on the application, it may be appropriate to create rules based on the type
of entity copied, the type of entity containing the copy, or a combination of the
two.
declare function ref:make-copy($source as element(), $target as element()) as element()
{
typeswitch($source)
(: Rules based on copied entity :)
case element(attorney) | element(client) return
From: Attorney or Client + To:Anything rules
case element(matter) return
(: Combination rules :)
typeswitch($target)
case element(client) return From:Matter + To:Client rules
default return From:Matter + To:Not-Client rules
default return
typeswitch($target) return
(: Rules based on containing entity :)
}
Nearline reference updates (allow dirty reads)
If the number of relationships or frequency of updates per document are high, denormalization
may be hard to justify. However, in some cases it may not be important for updated
data to be consistent in all related documents immediately (within the same transaction).
In these instances, we may implement a queue for processing updates, such that when
a document is updated, cascading updates to related documents are only added to the
queue to be processed later. This method allows the canonical document update to commit
quickly, while the related data becomes “eventually consistent.” Queued denormalization
updates also provide an opportunity to tune queue processing to operate within a particular
range of resource utilization.
Conclusion
NoSQL databases have emerged to serve an increasing demand to handle larger, more
complex, and greater varieties of data structures without elaborate workarounds or
by using non-database storage systems. Initially, many NoSQL databases were hyped
as replacements for relational databases. But as the hype cycle dies down and these
systems mature, it seems more likely that the future holds not one, but a variety
of database models, each tailored for a subset of data storage needs.
Many aspects of software development can be presented as a set of tradeoffs, and judgment
is necessary to make good bets that will affect the speed of development, application
performance, and maintainability. Even taking an optimistic view of the solutions
available from the variety of new data storage models, applications grow and change
over time, and the likelihood that any one model will perfectly meet an application’s
requirements over its lifetime are still unlikely. Rather, when confronted with facets
of a design that are imperfect, finding the most reasonable tradeoffs is one of the
most critical aspects of software development.
The techniques described here provide a simple set of patterns that offer reasonable
tradeoffs for a common set of problems facing XQuery database developers. By explicitly
creating and encapsulating the management of one set of dependencies - specifically,
dependencies the relational model intended to eliminate outright - it is possible
to eliminate a different set of runtime dependencies that can have a more troubling
influence on the development and maintenance of an application throughout its lifecycle.
References
[Codd1979] Codd, E. F. "Extending the Database Relational Model to Capture More Meaning." ACM
Transactions on Database Systems 4.4 (1979): 397-434. Print. doi:https://doi.org/10.1145/320107.320109.
[Codd1990] Codd, Edgar F. The Relational Model for Database Management: Version 2. Reading, Mass.:
Addison-Wesley, 1990. Print.
[Williams2000] Williams, Kevin, Michael Brundage, and Patrick Dengler. Professional XML Databases.
Birmingham: Wrox, 2000. N. pag. Print.
[8] In most databases, read locks are non-exclusive and therefore don't prevent concurrency,
while a write lock will block until writing is complete. Whether two documents are
updated to insert a small element (the reference) or a large one (the full copy),
they will both require write locks. It is also common for database implementations
to rewrite an entire document to disk when any part of it has been updated (MarkLogic
is one implementation that does this). In that case, the relative cost of inserting
a larger element is not as high as one might assume, compared to inserting a smaller
one.
Codd, E. F. "Extending the Database Relational Model to Capture More Meaning." ACM
Transactions on Database Systems 4.4 (1979): 397-434. Print. doi:https://doi.org/10.1145/320107.320109.