How to cite this paper
Shifrin, Laurel. “Managing multiple vocabularies across a global enterprise.” Presented at International Symposium on Versioning XML Vocabularies and
Systems, Montréal, Canada, August 11, 2008. In Proceedings of the International Symposium on Versioning XML Vocabularies and
Systems. Balisage Series on Markup Technologies, vol. 2 (2008). https://doi.org/10.4242/BalisageVol2.Shifrin01.
International Symposium on Versioning XML Vocabularies and
Systems
August 11, 2008
Balisage Paper: Managing multiple vocabularies across a global enterprise
Laurel Shifrin
Manager, Data Architecture
LexisNexis
Laurel Shifrin is the Manager of Data Architecture at LexisNexis, an international
company offering online information and solutions. She has been working with markup
languages since 1993 and presented at Markup Technologies, a predecessor to Balisage.
She
is also the co-author (with Michael Atkinson) of Flickipedia:
Perfect Films for Every Occasion, Holiday, Mood, Ordeal and Whim(Chicago
Review Press).
Copyright © 2008 LexisNexis
Abstract
Organizations share vocabularies across disparate user groups and data to maximize
the
value of their investment in XML, and, without question, those XML vocabularies need
to
change as the businesses evolve and expand. Managing change to DTDs and schemas is
difficult
enough with a small group of co-located users working on the same content types. What
happens when you have hundreds of XML consumers spread across the globe and they have
completely different requirements, systems, and content? Get a view of the challenges
of
implementing change management and vocabulary versioning on a very large scale.
Table of Contents
- Introduction
- Background of LexisNexis and the migration to XML
- Shared Vocabularies
- XML Maintenance Protocol
-
- XML Maintenance System
- XML Maintenance Process
- XML Versioning Policy
- Benefits and Challenges
- Conclusion
Introduction
One of the primary motivators for executive sponsorship of an XML implementation is
decreased cost and increased efficiency. One of the ways to maximize this effectiveness
is to
create a centrally managed XML vocabulary that can be shared across the entire organization.
In a small company where technical developers and users alike work in a single location,
this can be a relatively straightforward prospect: only a small number of people are
involved
and they can communicate frequently and easily. Likewise, if the content is specialized
toward
a single discipline, the vocabulary itself may be semantically rich but otherwise
uncomplicated. However, even in a small organization with a fairly lean vocabulary,
managing
the XML definition is important: it will certainly continue to expand and ad hoc change
management can result in unexpected – and unpleasant – outcomes. In a large organization
with
a great diversity of data types, the potential for problems associated with loose
management
is increased. Governance of markup is paramount, not just because of the downstream
applications, stylesheets and other artifacts expecting certain data constructs but
also
because of the cost in changing your data: once you create content in XML and store
it, the
odds are that it’s going to stay that way for a very long time and that changes will
be
incremental. As the amount of data grows, so does the importance of vocabulary control.
Background of LexisNexis and the migration to XML
LexisNexis is a leading international provider of business information solutions to
professionals in a variety of areas, including: legal, corporate, government, law
enforcement,
tax, accounting, academic, and risk and compliance assessment. In more than 100 countries,
across six continents, LexisNexis provides customers with access to five billion searchable
documents from more than 40,000 legal, news and business sources. We have over 13,000
employees located in business units around the globe.
Since the majority of these business units have been acquired as existing entities,
upon
acquisition they typically each had their own editorial systems, data formats and
online
products, much of these reflective of the different legal systems in place around
the world.
Some of the acquired business units already had XML-based Content Management Systems,
some had
SGML, some were using either proprietary markup systems or Microsoft Word documents
with a
varying degree of semantics – sometimes not at all but occasionally captured using
MS Word
styles.
In 2004, we launched a common delivery platform that included news, financial and
legal
data from seven countries across three continents. This shared platform was designed
to meet
several goals:
-
consolidate the technology for delivering data to our customers, thus decreasing
cost;
-
establish a publishing pipeline that could enable the addition of new countries in
a
relatively clear-cut manner;
-
create a standardized user interface that could be recognized globally; and
-
make it possible to develop new products and services across different markets and
countries more easily.
At the same time, the U.S. business unit continued to enhance the existing flagship
American products, lexis.com and nexis.com, and to build our XML-based editorial systems.
Our
editorial staff are similarly spread across the U.S., working in different time zones
to
create different products consisting of data that is sometimes shared but often unique
in
nature and geared toward a specific market; for example, an editor creating solutions
for
legal practitioners who specialize in Intellectual Property works with a very dissimilar
data
set than an editor abstracting and indexing International Statistics for an academic
audience.
All of these projects were staffed with representatives from a host of different groups
in
multiple countries (engineering, editorial, user experience, etc. ), sometimes leveraging
offshore resources as well. The schedules, time-line and project management were also
geared
toward the specific goals of the individual efforts, with a limited amount of overlap
and
consideration of the other projects.
Shared Vocabularies
Underlying all of these considerable efforts was the XML definition, managed by a
single,
centralized Data Architecture team, based in the U.S. Our team is dispersed among
eight
locations and two time zones and we maintain over 100 DTDs and schemas.
At the beginning of our markup development, we were faced with two options. The first
option was building standalone DTDs for each content type -- and, for legal statutory
data,
potentially building standalone DTDs for each content type within each
jurisdiction, as the structure and standards for a given jurisdiction can be
quite dissimilar from another and we’re frequently under a contractual obligation
that
prevents us from standardizing across jurisdictions. The second option was to build
our DTDs
from modules and share as much common vocabulary as possible.
There were several benefits to creating standalone DTDs: smaller scope makes for swifter,
less complicated development; fewer requirements for the markup to support; discrete
user
groups would mean less controversy and disagreement. However, the benefits of creating
the
modularized vocabulary would prevent redundancy in the markup, duplicated effort,
and, we
believed, would prove more efficient in the long run.
This was the genesis of what we have today: XML vocabularies that are shared across
multiple editorial systems, discrete applications, and editorial groups, and also
provide the
definition for data delivery from the business units around the world. The Data Architecture
team uses a toolbox of XML modules (body text elements, metadata elements, content-specific
elements, etc.) from which unique DTDs can be created, so that whole DTDs do not need
to be
written from scratch for each new content type that is added to our voluminous data
stores.
Additionally, the XML definition for a given semantic component may need to be transformed
as it moves through our publishing workflow; for example, metadata necessary for Editorial
tools may be dropped on external delivery. To ensure data quality as it moves through
the
publishing pipeline, we maintain a DTD or schema for data validation at specific points
in the
process. Thus, a certain element may have a very robust set of attributes at the beginning
of
the editorial cycle but may have a much leaner group of attributes farther down the
path – or,
conversely, may have additional metadata added by other post-editorial processes.
These
multiple configurations of a given semantic are defined using parameter entities with
conditional switches.
With such a great variety and quantity of data moving through and being added to our
pipeline at any given time, it is expected that our DTDs will change – they need to
change to
support improvements and enhancements to our products, services and systems and to
continue to
expand our content offerings. However, with so many disparate user and developer groups
around
the globe relying on established XML vocabularies, managing that change must be done
carefully
and transparently. Very early in our migration to XML, it became evident that the
small-shop
method of requesting and communicating changes over cubicle walls or via email was
not going
to be sufficient – and that was back when the Data Architecture team itself consisted
of three
people sitting within 10 feet of one another and the majority of the interested parties
were a
stone’s throw away. We determined that a formal maintenance protocol was needed, which
turned
out to be prescient, considering the growth of our own team and the even greater expansion
of
the community depending on our XML. It was essential to institute a centralized system,
not
just to control the vocabularies but also the data using the vocabularies and to establish
a
common understanding and means of expression.
The maintenance protocol we introduced had three features:
-
an application for requesting and storing information about each DTD/schema
change;
-
a publicized change control process; and
-
a versioning policy for the vocabulary itself.
XML Maintenance Protocol
XML Maintenance System
Our maintenance system is an intranet application accessible by any employee. Each
change request must contain enough information so that all of our XML consumers can
evaluate
the impact of the change on their own realm of responsibility, including:
-
the reason for the change;
-
affected modules or DTDs/schemas;
-
existing and proposed content models;
-
change benefits and risks;
-
a document sample that has been validated to a mocked-up DTD/schema.
All change requests are assigned a unique ID and stored in an SQL database. The system
allows users to query various fields, or generate lists of changes by platform and/or
change
status. When a change request is submitted, the maintenance system alerts the appropriate
Data Architect, who is then responsible for shepherding the change through the maintenance
process. Generally, the users provide some basic information about the problem they’re
trying to solve and then the Data Architect follows the usual steps in designing markup:
reviewing requirements, conducting user interviews, modeling the data, assessing downstream
impacts, tagging or converting sample documents, etc. A related tool helps us research
existing vocabulary so that we’re as consistent and efficient as possible – a database
of
elements and attributes, which we search using XQuery.
XML Maintenance Process
Each change request is first reviewed in a Data Architecture-only round table
discussion; because the modules are shared, each Data Architect may find the new markup
in
his DTD and so needs to understand and agree to the change. This has the additional
benefit
of team members learning more about each other’s content expertise and of course provides
an
opportunity for newer staff members to learn about data modeling from those with more
experience. At this point, only our team can view the change request on the Intranet
site;
this reduces churn and controversy within the larger community.
Once the change passes muster with our group, it is then viewable to everyone on our
Intranet site and distributed to representatives throughout the entire LexisNexis
organization. We typically allow 10 business days for assessing impacts, although
we will
expedite the review period when deadlines demand.
If an objection or negative impact is raised, we will reconsider the change based
on the
feedback and we sometimes must negotiate a compromise between the requester and the
objector. It’s entirely possible that a change that benefits one group will have a
negative
impact on another group and the overall business benefits and risks must be evaluated
to
determine the right course of action. A simple DTD change may seem trivial – what
could
possibly cause damage? – but if we were to make a backwards-incompatible change without
working out the details of how to remediate the data, we could block up our publishing
pipeline; invalid data will get bounced back. This is not so bad if it’s a few hundred
documents but if it’s a few hundred thousand, this could wreak havoc. Additionally,
we have
very aggressive deadlines for updating our documents, particularly for caselaw and
news. If
we were to introduce invalid markup that then resulted in even a short delay of delivering
our documents online, our customers would transfer to our competitors pretty quickly.
Once a change is approved, the new markup is implemented in the module with
documentation specifically referencing the change itself (change request ID, date,
Data
Architect making the change, etc), so that we can view our DTDs in the future and
not have
to scratch our heads wondering why a particular change was made two years ago. Then
the DTDs
are normalized and validated using a series of scripts and tested using a set of sample
documents. The number of DTDs being updated at a given time depends on several factors:
which system, platform or group needed the change, how many changes are in the queue
waiting
to be implemented, and whether or not we need to meet a specific project deadline.
Documentation is embedded within the DTDs and schemas so that it can be kept up-to-date
as changes are implemented; we autogenerate several different types of documentation
for our
users, including a summary of what changes are included in each DTD update.
XML Versioning Policy
Revision control and storage of our source modules is handled by Rational ClearCase.
Since our customers only see normalized DTD/Schemas, we adopted a versioning policy
for
those deliverables. We use a three node, numerical versioning syntax, expressing three
degrees of revision: major, minor, and micro.
Major changes are defined as changes that would either render documents generated
under
the previous version invalid or alters a previous version’s existing default DOM expansion (such as changing the default or fixed value of
an attribute). When a change is determined to be major, the major revision node is
incremented by 1, and the minor and micro nodes are moved to 0.
Minor changes are defined as changes that will not
render documents generated under the previous version invalid, such as making previously
required nodes optional. When a change is determined to be minor, the minor revision
node is
incremented by 1, and the micro node is moved to 0. The major node is left unchanged.
Micro changes are defined as changes that have no syntactic or semantic impact to
the
document (e.g., enhancements to documentation stored within comments in the DTD).
When a
change is determined to be micro, the micro revision node is incremented by 1, and
the major
and minor nodes are left unchanged.
We typically do not maintain multiple versions of the same DTD or schema at once,
though
we have agreed to this for short periods of time and under the caveat that if we need
to
maintain multiple major versions, we will not accept any backwards incompatible change
requests to the lower version – if a group needs a backwards incompatible change,
then
they’re going to need to upgrade to the later version.
Benefits and Challenges
This process and system have provided several benefits. The first is just simple
record-keeping: there’s no wondering about why a particular change was made three
years ago
because we can easily go back to the original details. Central control of the XML
vocabulary
prevents the practice of keeping local copies that rapidly fall out of sync as each
group
makes changes without regard to the consequences for other teams. This was the situation
at
one of the acquired business units and when we had to fold their DTDs into our catalog,
it was
a labor-intensive, frustrating task to try to find all the copies of their DTDs and
consolidate them to a single canonical version of each.
This record-keeping provides us with an easy way to derive metrics about our own
efficiency in managing change. We can determine which DTDs or modules require the
most
modifications, which could be an indication that something needs a closer, more holistic
analysis. As the manager of the team, I can also draw metrics on the amount of time
needed by
individual team members to process a change, which could indicate any number of issues:
unbalanced workloads, lack of adequate training and skills, etc.
Central control of the XML also helps us keep redundancy in check – we’re less likely
to
create a new element for a very similar semantic because the process for doing this
is so
transparent. When our team first began formal tracking of our DTD changes and introduction
of
new vocabulary, we were still rather siloed in our approach, which resulted in individual
Data
Architects creating separate elements or attributes for the same or very similar semantic.
Adopting a maintenance process ensured that no XML can be introduced or changed without
making
the rest of the team aware. Our twice weekly discussions and the use of our Element
Repository
for researching existing vocabulary were an additional resolution to this problem.
Another benefit is that opening up the process to the development and user groups
has
increased their understanding of XML, which can be inconsistent across such a broad
audience.
And we've added to the information coming to us, not just
from us: by notifying the community of a proposed change rather than just distributing
an
updated DTD, we’ve made allowances for the situations when feedback is critical. We
have so
many applications downstream of the XML definition it would be impossible for the
Data
Architecture team to understand the minute details of how the XML is being interrogated
and
used everywhere (though we do know quite a bit of it), so we rely on the application
and
content experts for information in this regard.
The transparency of our process has its downside: users have become so accustomed
to
seeing XML syntax that they are prone to requesting specific markup instead of stating
their
requirements and letting the Data Architects design the markup. Then if we propose
an
alternate design, this sometimes causes frustration and contention all around. Also,
because a
given semantic can be modeled in multiple ways, in addition to impact assessments,
we have to
be willing to accept constructive criticism about our markup design and to be flexible
enough
to alter it.
Backwards-incompatibility can also be challenging. In the earliest stages of our initial
migration to markup (which was SGML), the number of DTD consumers could fit comfortably
in a
conference room and the scope of the data conversion was intentionally limited. At
that time,
making backwards-incompatible changes to the DTD wasn’t a big problem. Also, if you’re
lucky
enough to have documents that are published once and don’t need updating,
backwards-incompatibility isn’t an issue. However, many of our documents do need to
be updated
continually and now that we’re several years into such a large-scale migration, making
a DTD
change that would invalidate data is not a trivial matter. And with a shared vocabulary,
this
could have an enormous ripple effect into document types owned by groups completely
unfamiliar
with the data that needed the backwards-incompatible change. For instance, a financial
product
developer might not understand why we need to push a change to a shared module to
solve a
problem for legislative products. Certainly the freedom to make these kinds of
backwards-incompatible changes easily was something we sacrificed when we went down
this path.
Another challenge is that sharing XML to this degree necessitates a relatively loose
set
of rules, particularly for online document delivery. We tend to move the data standards
conformance checks farther upstream: our Editorial DTDs and schemas are far stricter
than the
delivery DTDs and we frequently employ other technologies such as Schematron to ensure
data
standards are met.
Conclusion
The protocol we follow for managing our XML vocabularies has been in place now since
1999,
albeit in a much more rigorous form now than it was at that time. The engineering
groups have
used various forms of change management for their software development -- both home-grown
or
off-the-shelf change control products -- but our XML management protocol (system and
process)
has withstood whatever else has come and gone. In fact, several other ancillary vocabularies
have been put under our auspices because our system is so respected. For an enterprise
of this
size to share XML across content types and continents, it’s critical to keep the change
management transparent and well-publicized. Even smaller organizations can benefit
from
following an established process, if only to eliminate the questions of why a predecessor
made
a particular modification. This system enables us to expand our XML vocabularies at
the
aggressive rate our internal users and external customers expect.
It is difficult to imagine how engineering costs, time-to-market, and product and
content
integration could be done on such a large scale if DTDs and schemas were spread throughout
the
organization under the control of local content and application development groups
and without
a public process. In the loosely-coupled systems designs typified by Service Oriented
Architectures of today’s large scale systems, centrally managed schemas provide the
reliable
specifications for the content that systems trade with one another as well as something
just
as important – a common vocabulary that can be used by distant, possibly even unrelated,
product and application development groups to explain and express their needs. This
means of
communication makes it possible for us to deliver a very broad set of products, services
and
content types but it also provides a framework for innovation through shared understanding
across the organization.