Introduction
Relational and document databases
It seems unusual that in an industry known for constant innovation and change, it would be common practice to build systems using forty-year-old technology. But relational databases still remain the default choice in most applications.
The relational model was conceived “primarily as a tool to free users from the frustrations of having to deal with the clutter of [database] storage representation details” [Codd1979], and its normalized representational form was intended to ensure stored data “would be free from [the] insertion, update, and deletion anomalies” [Codd1990] that produce inconsistent data, while also providing a more flexible data model from which developers could perform ad-hoc queries. In addition to the system design and maintenance benefits from abstracting a data storage and query layer, constraints in storage hardware capacity, speed, and reliability made the relational model a prudent choice for system designers. These advantages have persisted for decades as the technology has matured.
Early predictors hoped that the relational model would be displaced by something superior from the wave of NoSQL databases that have arrived over the last two decades. But now that practitioners have gained experience with them, it seems more likely that a one-size-fits-all database model will not be realized, and relational databases will continue to be used alongside a variety of alternative database models, all with overlapping advantages and drawbacks.
The overarching constraint of the relational model is that while it is ideal for highly structured and naturally relational data, it requires hierarchical or document-like data to be deconstructed from its natural form and additional work to insert and retrieve that data. Yet hierarchies exist everywhere, in print documents, web pages, emails, message threads, and Facebook posts, as well as abstract and entity data models. Documents are a natural structure, and document databases - one type under the NoSQL umbrella - have emerged as a natural and native means to store and query in document idioms. XQuery databases, in particular, allow a developer to store a document as XML (and sometimes JSON) and insert, transform and query documents with a language which is both natively designed with insight into the rich semantics of its underlying data and highly expressive at navigating it.
One common problem facing developers, however, is that real-world data do not typically fit neatly into a relational or a document model. Instead, developers are forced to design with some degree of impedance mismatch. It is important to focus our attention on these mismatched design aspects because the rough edges of a design are the parts most likely to generate artifacts that will hinder application development.
Most relational databases offer some level of support for XML, though typically with very limited XQuery and XPath features based on older versions or working drafts[1][2] and with limited indexing functionality.[3][4] In cases where data is highly structured and only minimally document-like, a relational database can still be a good bet if those parts don’t require complex querying.
In cases where a majority of the data would be ideally modeled using documents, the idiomatic environment provided by XQuery databases are a better bet. However, documents are not always self contained. They frequently describe relationships: books, articles, and web pages link to other books, articles, and web pages; entities reference other entities. Conventional methods for handling relationships across documents using XQuery and XPath language constructs may be sufficient in some cases. However, systems grow in complexity over time as they store larger volumes of data, add support for varieties of data types, and handle more sophisticated modes of querying and visualizing data, like search grammars or facets.[5] Complex relationships are the crumpled design edges of document database models, and as a system scales they can become cumbersome to organize, with query performance that can be difficult to predict.
Automatic denormalization offers a simple set of patterns that reduce this complexity by shifting the responsibility of managing relationships from run-time to write-time, simplifying code and improving query performance. This paper will examine conventional methods for coping with document relationships and explore design patterns for automatic denormalization and their trade-offs.
Many-to-many relationships
One-to-one and one-to-many relationships are both hierarchical and are thus isomorphic to the XML document object model. These structures can be modeled and contained completely within a single XML document [Williams2000]. But to model a many-to-many relationship, a child element would need to have multiple parents, which is not possible in the XML document model.
For example, a database may need to store a model for attorney and client entities. Most properties of each entity are hierarchical, but naturally an attorney will have multiple clients, and a client may hire multiple attorneys. How is this difference reconciled in a system whose data model is designed for self contained documents?
Recommendation 1: Don’t do that
Typically, the initial advice on modeling these types of relationships in XML is to simply avoid them. There are a few ways this can be done. The first and simplest is to discard all data about the many-to-many relationships, resulting in documents that only contain data directly describing the entity [Williams2002]:
These documents are very limited in scope and so also very limited in utility for any application. Client documents would have no information about the attorneys representing them and the same would be true for attorney documents with respect to their clients. In some limited cases, this option may be feasible, but generally only when there is a real-world many-to-many relationship that is not important to the application.
If the application requires the relationship to be modeled in XML, another potential solution is reducing the real-world many-to-many relationship to a one-to-many relationship in XML by scoping the documents to represent only one side of the relationship [Williams2002]. Using the above example, this would require modeling either the attorney or the client as a document root and the other entity as a child element. However, if the element-assigned entity is related to more than one document-assigned entity, this will result in duplicate records of the child element-assigned entity.
In practice, this approach may satisfy the requirements for some applications, but generally only if the document-assigned entity is the only one queried directly (because querying the child entity would require deduplication), and if the child-assigned entity is never updated (because updates would require special handling for finding and updating its duplicates). Even when this can work, the trade-off is risky for any application whose requirements are not fixed indefinitely, since going outside of those constraints could entail complex workarounds or even rewriting much of the application's back-end.
Recommendation 2: Keys and Joins
The next approach is the most common and amounts to essentially giving up on the document-centric model in favor of a relational model. In this model, both entities are designed as independent documents, and they are related using a two-way ID and IDREF convention, imitating the primary-key and foreign-key design of the relational model [Williams2000] [Williams2002] [Walmsley2015].
This design is intuitive for anyone familiar with relational models, and it solves the above problem of duplicated entities. However, it introduces several new problems as a result of the normalized data, all of which stem from the fact that data about an entity and data about its related entities are now in separate documents. By projecting a relational design into XML, an application will inherit the responsibilities of a relational data-based application: reconstructing and joining data at runtime.
Note
id
and idref
attributes are typically associated with the xs:ID
and xs:IDREF
types from XML Schema and the corresponding XPath functions fn:id()
and fn:idref()
, but those are all scoped within a single document. In the context of database joins,
those types and functions are not very useful, but the attribute names have been used
in these examples for clarity.
Incomplete views
The fundamental problem with references is that accessing a complete view of a document with references requires traversing it and constructing its missing content dynamically using joins. For each document we must find its references, get the referenced documents from the database, and insert them into the expanded view.
In the context of result sets, this effect is multiplied, and the overhead for generating complete views on a set of documents is no longer simply a factor of the number of documents selected by a query, but of that number plus the number of referenced documents in the result set. Performance-sensitive applications will need to ensure that a small result set of documents with many references does not result in an explosion of reads. In some cases, that will not be necessary, or it can be mitigated. For example, in a page of query results, the application can choose not to expand references when rendering results and only expand an individual document once it has been selected. But it is often the case that the application needs to address aspects of the expanded view in queries, which brings us to the next problem.
When using key-based relationships, search applications may be forced to support queries with constraints that join data across documents. For example, using the attorneys and clients model above, we may need to generate a query that selects “attorneys living on the west coast representing a client on the east coast.” This query includes a pair of related constraints, each in a different set of documents associated by ID/IDREF relationships. When the query is executed, it will need to read both sets of documents with constraints applied and join them together based on the criteria that they are related via ID and IDREF. Even a selective query with few results may, in the worst case, produce an explosion of document reads due to the fact that the selectivity of the query hinges only on the intersection of constraints, not any per-document constraint.
If either $attorneys-west
or $clients-east
selects a large number of documents, the overhead to perform this query will be high.
It might be improved by selecting the east coast clients first, using their ids to
add an additional constraint to the attorneys, increasing selectivity:
However, it is possible that $clients-east
still leads to overhead or that serializing the queries prevents the query engine
from executing them in parallel. Ultimately, the best optimizations may be data and
implementation-dependent, resulting in queries that are difficult to write and maintain,
especially as the content of the database changes.
Another common application requirement is displaying search facets, wherein aggregated data about a query’s full set of results are displayed alongside a subset of sorted results displayed as snippets. Conventionally, search snippets are hyperlinked to a more complete view of the item represented in the paged set, and facets are hyperlinked to a query that returns a new result set constrained by the property and value indicated by the facet.
Most XQuery databases provide configurable indexes for efficiently querying specified path or element values that can be used to facilitate fast faceted search calculations. It is not currently possible, however, to create an index for a document using values from a referenced document, so there are no shortcuts around manual joins using indexes. Like the first example, calculating facets for documents outside the result set requires joining across both sets documents using XQuery:
Best practices for handling these types of use cases in different database implementations have different conventions with respect to indexing and query optimization that can be unintuitive,[6][7] and building a solution to support joins generically within an application can be tricky and still lead to variable or disappointing performance. Armed with the implementation-recommended indexing and query patterns, it is possible to minimize the impact of joins, but depending on the database implementation and application, it may still be necessary to put hard limits on the size of result sets or drop some of the harder to calculate information from the aggregate queries to prevent high latency queries.
Runaway code
One issue that may be less obvious at the outset, but can be insidious and capable of having a strong impact, is the fragmenting effect on code that can occur from introducing a new paradigm into an application and maintenance problems that result from adding complexity and managing it as the application scales. At a high level, combining self-contained document models with relational-like data models ties together contrasting concepts for reading and writing information within an application. As outlined above, solutions built to handle querying across related documents will introduce coding patterns that diverge from those of fully coherent documents.
Software developers typically follow design principles and utilize coding patterns as they work, in many cases religiously. Despite the fact that many principles overlap or conflict, and various trends come and go over time, they all seem to work toward the same goal: preventing code complexity from ballooning out of control. Whether or not any particular set of maintenance strategies is more effective than another is a topic of debate (and unfortunately very little research), but experience has taught the industry as a whole that some strategy is necessary to continuously reduce complexity through the implementation of abstractions, modularization, limiting dependencies, and code reuse.
Applications take on expanded requirements over time, and it is critical for maintainers to carefully examine their designs and refactor as code is updated to implement new features. The different coding patterns introduced in examples above to support the diverging document concepts can negatively affect nearly every aspect of code maintenance. Even in applications that are not performance critical, avoiding the introduction of these types of queries and runtime dependencies can result in code that is simpler to reason about, organize, and maintain as it scales.
Proposed solution: automatic denormalization
Automatic denormalization is not a new idea. The solution presented here is a type of precomputation similar to indexed views or materialized views in relational databases. In relational databases, the precomputed views are typically implemented with the goal of improving read performance. In this instance, while performance is a goal, it is not the primary goal.
The problems outlined above demonstrate the difficulty in integrating documents that support many-to-many relationships alongside documents whose relationships are self-contained. We can take significant steps to resolve this problem by denormalizing the many-to-many relationships as they are ingested into the database. Instead of storing references to canonical documents using IDREFs, the referencing document will store a complete copy of the referenced document. In a way, documents still resemble the relational-like document model, but with references swapped for copies of the referenced data.
The denormalization process can be broken down into two essential subprocesses: one for creating the initial many-to-many relationship between two documents and another for updating any document that is part of a relationship.
Note
XQuery update syntax differs substantially across database implementations, so the following examples are presented using figures only.
Relationship create
Creating a denormalized many-to-many relationship follows similar steps as the normalized two-way IDREF solution, and in both cases, both documents must be updated. In the normalized create, only an element denoting a reference to the other is added; in the denormalized create, the contents of the referenced document are used wholesale in place of the reference. One new requirement is that a denormalized create must read-on-update to exchange content. Typically, lock contention will not differ when reading from a document that is also updated in the same transaction, so the added overhead of denormalizing this operation can be minimal.[8]
This example creates a relationship on documents with no existing relationships. However,
when relating or updating documents that already have denormalized relationships,
the copied entities stored in <ref:copy>
should not also include their copied references. From the perspective of the containing
entity, the relationships within a copied entity are either circular, or they will
reference documents that do not have references back, leaving no way to update them
consistently when the canonical document is updated. Therefore, an entity copy transformation
should be used to remove existing denormalized relationships from any dereferenced
copy.
Related document update
Now that references are replaced with denormalized copies of the canonical entity’s data, updating a document requires updating every document on the other side of the relationship as well. Similar to the normalized model, where reading a complete document required expanding every reference to a canonical document, now to maintain consistency, any document update requires propagating the new information to its denormalized copy in referenced documents.
This step is the essence of the denormalization process because it is responsible for maintaining consistency across related documents. It deliberately defies one of the original stated design goals of the relational model and normal form - to free relations from insert, update, and delete dependencies. The primary reason behind that goal was the expectation that managing dependencies would quickly become complex or impractical (or a problem writing to storage hardware would occur), leading to “update anomalies galore” [Codd1990], and inconsistencies across related data would accumulate throughout the database.
The automatic denormalization process proposed here can be implemented in a single component capable of handling any update operation. As a practical matter, maintenance of the component to support changes to the data model is simple, and in many cases (depending on the database) the component can be written to handle generic references, such that code changes would only be required if the denormalization schema itself changes. BaseX, eXist-db, and MarkLogic all provide methods to apply cascading updates atomically, within a single transaction, eliminating a class of possible undesirable side effects from managing redundant data, one of the motivating factors behind normalization.
Because the number of relationships per document will typically not grow relative to the size of the database, avoiding joins across large portions of the database can also be a worthwhile performance trade-off for cascading document updates. However, the comparison is not apples-to-apples, since writing to a database is a more resource intensive operation than reading from it. Also, the details of lock handling across database implementations differ (and some are configurable), so performance testing for any specific implementation and configuration is appropriate if the likely trade-off is not clear from an application’s requirements or usage information.
Delete
Denormalized delete follows the same pattern as update, but without the need for the entity copy transformation step. When a document is deleted, iterate over its references and remove the copy in any referenced document. Deleting a relationship follows the inverse pattern of create.
Read
Now that work has been done on the create/update/delete side to ensure consistent denormalized relationships, documents have been restored to a prototypical document structure.
Rendering documents with relationships no longer necessitates a runtime expansion process to produce a complete view, and it is possible to query related documents without joining. The additional consistency across queries and rendering components should provide more opportunities for code reuse and less complexity to maintain overall.
The pathological query from above - “return any attorney working on the west coast
representing a client on the east coast” - is now simplified and is structured more
optimally for query evaluation. Additionally, indexing state
elements will now be effective at preventing the database from reading documents
from disk before a result set is determined.
//attorney[state = $states-west][ref:copies/client/state = $states-east)]
Similarly, the facet-calculating query example from above can also be simplified without
the need for joining and can benefit even further by leveraging the indexing capabilities
in MarkLogic or eXist. Because those databases provide functions to access index values
directly, indexing the path /client/ref:copies/attorney/full-name
permits even more efficient calculation of facets.
Without the need to join data across documents, the need for nebulous (and often unsuccessful) optimizations is avoided, and with related documents conforming to the self-contained structure of other documents, integrating two disparate data models is no longer necessary, avoiding a class of thorny code maintenance problems.
Caveats
Relationship-heavy workloads and/or update-heavy workloads
The resource overhead of this technique will be impacted by a factor of the update frequency to documents with relationships and by the number of relationships per document. One must also consider the overlap of relationships across documents because concurrent updates to multiple documents with relationships to a single document will all contend for the ability to lock and write to that document when the denormalized updating occurs. In cases where a long chain of cascading updates requires changing the same document multiple times, and update conflict may occur that prevents the transaction from completing. In that case, a workaround would be required to process all updates to the single document in memory first, and then commit only the final result to the database.
Only recommended for two-way joins
For example, a LawFirm
entity may have a many-to-many relationship with an Attorney
entity, which also has a many-to-many relationship with a Client
entity.
Using only the technique described here, it is not possible to get denormalized data
about Clients
from LawFirms
and vice versa. It is possible to modify the entity copy transformation to retain
references for entities such that an entity’s references and its references’ references are part of the cascading update. But this is dangerous behavior that may result
in combinatorial explosions of updates and it can quickly become impractical for most
use cases.
Extensions and Optimizations
Entities and sub-entities
In real-world models, it is often unrealistic to insist that relationships occur only between document roots. Instead, it is more idiomatic to consider a document as an entity that may contain descendant subentities recursively, such that any entity or sub-entity can be related to any other entity or sub-entity.
Additionally, it may be helpful when generating a copy of a sub-entity to also include data about that entity’s parent or ancestors. This pattern can be managed through modifications to the entity copy transform (used above only to remove references) and to update behavior. When an entity is copied, process the parent or ancestors recursively, and when an entity is updated, trigger updates on its child or descendant sub-entities. The structure of the denormalized ancestor markup should be determined by the application. In some cases, the application may benefit from maintaining the nested structure of the related document in its copy and annotating the directly-related entity:
<client id="abc"> <name>O.J. Simpson</name> <address> ... </address> <client-instance id="1357" id-client="abc"> <name> ... </name> <address> ... </address> <ref:copies> <attorney> <name>Johnny Cochran</name> ... <ref:context> <matter> <name>People v. O. J. Simpson</name> <judge>Lance Ito</judge> <dt-arraignment>1994-06-20T09:00:00-07:00</dt-arraignment> </matter> </ref:context> </attorney> </ref:copies> </client-instance> </client>
In another, it may be more convenient to break the hierarchy into siblings and annotate the entities:
<ref:copies> <ref:copy> <ref:context> <matter> <name>People v. O. J. Simpson</name> <judge>Lance Ito</judge> <dt-arraignment>1994-06-20T09:00:00-07:00</dt-arraignment> </matter> </ref:context> <ref:parent> <attorney> <name>Johnny Cochran</name> ... </attorney> </ref:parent> </ref:copy> </ref:copies>
The annotation schema is not prescriptive and should be adapted to fit the needs of the application, so other variations may be appropriate. For example, relationship annotation elements could be children of entities instead of parents, or they may be attributes on the copied element.
Overlapping hierarchies
Sometimes a model may not call explicitly for a many-to-many relationship, but it may have naturally overlapping one-to-many relationships. Overlapping data structures are not supported by the XML model directly, but they can be approximated using many-to-many relationships.
In this scenario, one of the trees may be modeled as the “base” document, and overlapping trees may be modeled using many-to-many relationships between the shared entity of the base document and the parent entities of the overlapping documents. Using automatic denormalization, overlapping hierarchies can be stored as separate and completely coherent documents.
Sparse entity copy transformations
If within in the context of a particular type document, it is known ahead of time that full copies of referenced entities are unnecessary for the application, then the entity copy transformation can be extended to exclude some parts of the copy. Depending on the application, it may be appropriate to create rules based on the type of entity copied, the type of entity containing the copy, or a combination of the two.
declare function ref:make-copy($source as element(), $target as element()) as element() { typeswitch($source) (: Rules based on copied entity :) case element(attorney) | element(client) return From: Attorney or Client + To:Anything rules case element(matter) return (: Combination rules :) typeswitch($target) case element(client) return From:Matter + To:Client rules default return From:Matter + To:Not-Client rules default return typeswitch($target) return (: Rules based on containing entity :) }
Nearline reference updates (allow dirty reads)
If the number of relationships or frequency of updates per document are high, denormalization may be hard to justify. However, in some cases it may not be important for updated data to be consistent in all related documents immediately (within the same transaction). In these instances, we may implement a queue for processing updates, such that when a document is updated, cascading updates to related documents are only added to the queue to be processed later. This method allows the canonical document update to commit quickly, while the related data becomes “eventually consistent.” Queued denormalization updates also provide an opportunity to tune queue processing to operate within a particular range of resource utilization.
Conclusion
NoSQL databases have emerged to serve an increasing demand to handle larger, more complex, and greater varieties of data structures without elaborate workarounds or by using non-database storage systems. Initially, many NoSQL databases were hyped as replacements for relational databases. But as the hype cycle dies down and these systems mature, it seems more likely that the future holds not one, but a variety of database models, each tailored for a subset of data storage needs.
Many aspects of software development can be presented as a set of tradeoffs, and judgment is necessary to make good bets that will affect the speed of development, application performance, and maintainability. Even taking an optimistic view of the solutions available from the variety of new data storage models, applications grow and change over time, and the likelihood that any one model will perfectly meet an application’s requirements over its lifetime are still unlikely. Rather, when confronted with facets of a design that are imperfect, finding the most reasonable tradeoffs is one of the most critical aspects of software development.
The techniques described here provide a simple set of patterns that offer reasonable tradeoffs for a common set of problems facing XQuery database developers. By explicitly creating and encapsulating the management of one set of dependencies - specifically, dependencies the relational model intended to eliminate outright - it is possible to eliminate a different set of runtime dependencies that can have a more troubling influence on the development and maintenance of an application throughout its lifecycle.
References
[Codd1979] Codd, E. F. "Extending the Database Relational Model to Capture More Meaning." ACM Transactions on Database Systems 4.4 (1979): 397-434. Print. doi:https://doi.org/10.1145/320107.320109.
[Codd1990] Codd, Edgar F. The Relational Model for Database Management: Version 2. Reading, Mass.: Addison-Wesley, 1990. Print.
[Williams2000] Williams, Kevin, Michael Brundage, and Patrick Dengler. Professional XML Databases. Birmingham: Wrox, 2000. N. pag. Print.
[Williams2002] Williams, Kevin. "XML for Data: Modeling Many-to-many Relationships." IBM DeveloperWorks. IBM, 1 Jan. 2002. Web. 12 May 2017. https://www.ibm.com/developerworks/library/x-xdm2m.html.
[Walmsley2015] Walmsley, Priscilla. XQuery: Search Across a Variety of XML Data. 2nd ed. N.p.: O'Reilly Media, 2015. Print.
[1] See "XQuery Language Reference (SQL Server)," Microsoft Docs, https://docs.microsoft.com/en-us/sql/xquery/xquery-language-reference-sql-server.
[2] See "SQL*Plus® User's Guide and Reference," ORACLE Help Center, http://docs.oracle.com/database/122/SQPUG/XQUERY.htm.
[3] See "XML Indexes (SQL Server)," Microsoft Docs, https://docs.microsoft.com/en-us/sql/relational-databases/xml/xml-indexes-sql-server.
[4] See "8.13. XML Type." PostgreSQL documentation, https://www.postgresql.org/docs/9.6/static/datatype-xml.html.
[5] See "Faceted Search," Wikipedia, https://en.wikipedia.org/wiki/Faceted_search.
[6] See "Tuning the Database," EXist-db Documentation , http://exist-db.org/exist/apps/doc/tuning.xml.
[7] See "Query Performance and Tuning Guide," MarkLogic 8 Product Documentation, https://docs.marklogic.com/8.0/guide/performance.
[8] In most databases, read locks are non-exclusive and therefore don't prevent concurrency, while a write lock will block until writing is complete. Whether two documents are updated to insert a small element (the reference) or a large one (the full copy), they will both require write locks. It is also common for database implementations to rewrite an entire document to disk when any part of it has been updated (MarkLogic is one implementation that does this). In that case, the relative cost of inserting a larger element is not as high as one might assume, compared to inserting a smaller one.