Balisage Paper: A Linked-Data Method to Organize an XML Database for Mathematics Education
Alan Edward Bickel
Software Engineer
Big Ideas Learning, LLC / Larson Texts, Inc.
Alan Bickel is a software engineer at Big Ideas Learning, LLC. | Larson
Texts, Inc. His current focus with Big Ideas Learning includes Data and Systems
Architecture, System Design, Application and Web development. Tech stack
experience includes: LAMP, Node.js, Typescript, Express, Angular, Aurelia,
MongoDB, Phaser, Apache Tomcat. Actively learning and loving the XML/RDF/ExistDB
ecosystem. Interests include:
-
machine language translations for digital and print consumables
-
Embedded electronics engineering and development
-
text-to-speech and accessibilty-driven application
development
-
Path of Exile
Elisa E. Beshero-Bondar
Professor of Digital Humanities
Program Chair of Digital Media, Arts, and Technology
Penn State Erie, The Behrend College
Elisa Beshero-Bondar explores and teaches document data modeling with the XML family
of languages.
Until June 2020, she was a professor of English Literature and Director of the Center
for the Digital Text at
Pitt-Greensburg. She serves on the TEI Technical Council and is the founder and organizer
of the Digital Mitford project and
its usually
annual coding school. She experiments with visualizing data from complex document
structures like epic poems and with computer-assisted collation of differently encoded
editions of Frankenstein.
Her ongoing adventures with markup technologies are
documented on her development site at
newtfire.org.
Tim Larson
Director
Big Ideas Learning, LLC / Larson Texts, Inc.
In Timothy Roland
Tim
Larson’s four decades of professional experience, he has written in many formats
from page to screen, for all ages from early reader to adult, and in most media, including
interactive media and markup languages. He is a developer, producer, entrepreneur,
and occasional yacht crewman. Tim helped found
Grant Larson Productions, a film production company, and he is the chief architect of
Larson Texts, an educational publishing company. Many of his projects have won awards and achieved
market success. Tim is married to Mary Grant Larson, and together they have 2 children,
2 grandchildren, 2 dogs, and a multi-generation family farm in Pennsylvania, where
they still make hay in the summer and cider in the fall.
Copyright ©2021 Big Ideas Learning, LLC
Abstract
This paper presents work in progress to support fine-grained semantic relationships
between mathematical concepts and educational resources. Can RDF ontologies and XML
structure support a high-capacity database application for lesson planning, teaching,
assessment, and tutoring?
Table of Contents
- Contexts for designing an adaptive content delivery system
-
- Overview: What we seek to design
- Prior research and solutions in educational technology
- Pedagogical objectives and market needs
- A look at the correlation problem in context
- Our proposed solution to the competency alignment problem
-
- Introducing the competency graph
- Constructing RDF for mathematics education
- RDF/XML and the development of the competency graph
- Implementation challenges
-
- The challenge of resource identification and referencing
- Exploring an XML database for content management supported by RDF/XML
- Conclusion
Contexts for designing an adaptive content delivery system
Overview: What we seek to design
Big Ideas Learning LLC (BIL)
creates and publishes math learning content for elementary, secondary, and
post-secondary courses primarily in the United States. The authors are helping BIL
to organize a new content delivery system that will serve its partner elementary and
secondary schools (K-12). BIL wants to move beyond the restrictions of relational
schemas to organize content using declarative methods. The design must accommodate
a large, diverse multimedia archive of
digitized materials representing textbooks, teacher
materials, tutorials, and assessments that the company has
published over the past four decades. It must also support ongoing creation of digital-first
learning
materials in multiple media formats. BIL’s goal is to atomize
these resources into learning objects, to allow for the rapid customization of
curriculum to suit more varied learning contexts. Those learning contexts may be based
on local curriculum, state standards, and adaptations of the
Common Core Standards for
Mathematical Practice (CCSS)—or may even be individualized.
The authors are working with XML to organize, search, and remix BIL’s archive.
Making all the resources fully searchable and available digitally is a
long-range task. We seek to begin with an organizational structure based on a set
of
Resource Descriptive Framework (RDF) ontologies that associate the following kinds
of
information:
-
topic
-
use (teaching, practicing, assessing, etc.)
-
curriculum and standards (local, state, and national)
-
relationships to other topics and materials
-
client data (assessing competency and tracking usage)
The work involves matching internal BIL resources with external
requirements. BIL serves a growing base of over 5 million student users per year,
with custom alignment for 22 states.
Curriculum to Standards alignment is complicated because there is no formal relationship
between district curriculum needs and state and CCSS standards. We are developing
a set of
ontologies to benefit educators and districts by providing
formal ways to comprehend intersections and deviations with standards. This is
especially difficult for states that do not follow the CCSS.
In addition, math content providers need to align their contents to distinct
learning contexts for training and reviewing math skills.
The online interactive service that BIL hopes to provide will empower teachers, students,
and tutors to discover and
organize their learning plans. In addition to mainline plans, the system should support
projects and tasks for which gaps and problems are identified. It should also support
students who move between schools in different states, who may need to adjust quickly
to
topics they are not prepared for in their new school. A system that can adeptly assist
these customizations should also be able to track clients’ use of the system through
time to support customized recommendations.
The authors are planning a database storing RDF associations and data
pointers that correlate resources, standards, topics, related topics, and client data.
We are exploring the drafting of RDF in XML format, and organizing this using eXist-dB.
We are also exploring XPath and XQuery for fine-grained searching, retrieving, and
visualizing networked data.
Prior research and solutions in educational technology
In the field, much attention has been dedicated to
intelligent
learning
management systems that respond to the needs of learners by assessing and delivering
bespoke content. The promise of the semantic web is emphasized by
Gottfried Vossen, Miltiadis Lytras, and Nick Koudas:
The fundamental social and political impact of the Semantic
Web . . . supports a shift of
social interaction patterns from ‘knowledge push’ to ‘knowledge pull’. This includes
the shift . . . from teacher-centric to learner-centric education
. Vossen
et. al. see this shift in education as well as health care, government, and business.
The linked open data of the semantic web intersects public and private sectors.
Applications in education, like those in health care as well as business, require
networks crossing between open public and
secure private domains. Y. Anistyasari et. al. explored the
interoperation of learning management systems like Moodle to help students enroll
in
courses at multiple universities. Cross-enrollment management is based on a publicly
shared ontology of course information, permitting individualized calculations of tuition
for each school involved.
Other projects seek to design
intelligent
or adaptive
learning management systems that combine the use of RDF ontologies with individual
student data.
Monika Rani et. al. observe that designing an LMS to have
meta-cognitive awareness
argues for the
application of machine-readable RDF over the use of a relational database, and that
reliance of LMS’s on such databases and client-server applications limits their capacity
to adapt flexibly to individual learners. They propose an RDF-based LMS for Computer
Science designed on two ontology categories: for domain and for task. They base the
domain ontology on a standard ontology for computer science concepts and software,
the
ACM Computing Classification system, and for the task ontology they apply VARK for
classifying
different kinds of learning styles (visual, aural, read/write, kinesthetic) to be
self-selected by the student who interacts with the system.
More pertinent to the BIL project is the work of Fernando Díez and Rafael Gil
on the Reasoning and Managing System (RAMSys), designed to supervise and support
students in writing geometry equations. This is a far narrower application than what
the
authors are designing for BIL but is relevant for its responsiveness to student input
and
its application of the OpenMath markup language for guiding and semantically checking
student input working with Mathematica software.
While there are neighboring use-cases for RDF ontologies informing learning management
systems, what BIL needs is more of a catalog of its resources. These ontologies to
deliver
learning objects as needed to instructors as they design lessons and to
students as they seek tutoring. Perhaps the most similar to what BIL seeks to design
is the
model of the intelligent learning management system Multitutor, discussed
by Goran Šimić et. al. in 2004. Multitutor was designed in Java with reliance on XML
to
store course descriptions, with the idea of making materials reusable in multiple
course
contexts. The system provided authoring tools for instructors to organize their own
courses and track students’ progress, and it involved administrator, teacher, and
student levels of access.
The system is designed to support changeable navigation possibilities to the
student. It provides the dynamic creation of the learning materials . . .
The tutor is the main part of the system architecture. It is the system coordinator,
dispatcher, and monitor at the same time. The pedagogical strategies are implemented
in the tutor. It analyzes the data of the student model (model of particular student)
and uses its teacher knowledge to require the proper learning contents. Tech expert
module maintains the references of domain knowledge and rule base. The reasoning
machine processes the request of the tutor and composes the learning content.
The content can include the text, the picture, or some other multimedia. In the
test phase the content is represented by the test sets or by the problems that
students have to solve. These contents the tutor sends back to the servlets.
Multitutor permits teachers to customize and organize the learning experience,
with the system brokering delivery of customized content to students. Optimally, the
system can respond
to a student’s need for review by connecting related materials relevant to student
competence with assessed skills and tasks.
The authors have begun experimentally drafting RDF/XML to incorporate existing
ontologies in order to describe resources and their interconnectedness. We present
this
paper at a moment when we face serious questions about how best to adapt existing
RDF
ontologies for education to ontologies describing mathematical concepts. While we
seek
to work with existing ontologies, we need to determine at what point and for what
purposes a new ontology will be required based on BIL’s needs and application. We
also
face serious concerns about how best to implement a functional and adaptive content
delivery service, and how much to deploy XML stack technologies in BIL’s existing
development workflow.
Pedagogical objectives and market needs
The marketplace for learning materials is changing. Classrooms continue to be more
connected and more digital-friendly. At the same time, the divide between urban and
rural, poor and affluent, diverse and homogeneous is more
pronounced in the digital learning environment than in the physical classroom. Market
stakeholders have a
duty to enable all teachers and all students.
BIL’s needs for next-generation digital classrooms require improvements to our
resource correlation and usage. Teachers, administrators, and the community need to
know
that their limited resources are used to benefit all their students. Learning materials
need to be accessible for all students and teachers. Technology must help to lower
the bar for entry, not raise it. One way to eliminate barriers may be to improve the
alignment of resources with standards, regardless of medium.
Historically, BIL’s digital content has been written, aligned, and correlated from
a
print-first perspective, meaning that standards alignment, remediation resources,
and
curriculum coordination occur in terms of the print page. This presents several
challenges in converting print resources to digital and/or interactive web content,
while adding limited value to the teacher. Nevertheless, much of this content has
demonstrated efficacy across decades of use and needs to be preserved if not
enhanced.
The first challenge is that many of BIL’s print resources are re-used across multiple
products and programs, which complicates proper alignment and correlation. Poor
information design leads to a mix of cloning and re-use. This becomes increasingly
difficult for resources like digital assessment questions, when the same assessment
question may be used in multiple products or included in a custom assessment created
by
a teacher.
The second challenge is to guide a teacher or student user to appropriate
remediation materials. In the current system, the best we can do is to direct the
user
back to the lesson that teaches a concept. While this may be appropriate for the simple
general case, it does little to precisely address a student’s needs.
The third challenge is to empower users to create custom curriculum content.
Historically, textbook publishers provide a canonical curriculum to users, with the
expectation that teachers will follow it as laid out in the print books. With the
increase in online learning, teachers expect the ability to tweak their curricula
to
meet the needs of their classrooms and individual students. It is a straightforward
task
to provide users the ability to add and remove lesson content. However, customizing
a
lesson risks invalidating its correlations to standards and curriculum (i.e., x lesson
teaches y required topic). If a teacher removes a component of one lesson, does it
still
teach to the state standard? Does it provide proficiency for a given measurable Learning
Objective? A successful customization tool must do more than enable remixing of the
print content. The customized plan must be meaningful, measurable, and accountable
to
the educational requirements it serves.
A look at the correlation problem in context
When we look at U. S. state mathematics standards to assess objectives and learning
paths, we quickly see that a major shortcoming of nearly every alignment mechanism is that the standards are composite in nature.
A single state standard often concatenates several individual skills, competencies,
and
facts into one bullet point
. Let’s look at an example of this in the
Common Core
Mathematics Standards, Grade 2, since many state standards are in fact,
simple variants of this standards program.
When looking at the domain Operations and Algebraic Thinking
, we see
that a single standard, CCSS.MATH.CONTENT.2.OA.A.1, states that a student should be able to Use addition and subtraction within 100 to solve one- and
two-step word problems involving situations of adding to, taking from, putting
together, taking apart, and comparing, with unknowns in all positions
. If we take a moment to decompose all the tasks that this single
standard covers, we see that there are a number of discrete skills that all combine
to
achieve proficiency in this standard:
-
Use addition within 100 to solve one- and two-step problems,
-
Use subtraction within 100 to solve one- and two-step problems,
-
Understand decomposition in order to put together,
-
Understand decomposition in order to compare,
-
Understand decomposition in order to take apart,
-
Use symbols as variables in equations,
-
Use symbols as variables in drawings.
While this breakdown might not be expressly outlined in pedagogy, it
nevertheless shows that in order for a student to master a single standard, they
actually need to master several smaller skills. At the end of the day, the teacher
is
responsible for the student being able to accurately pass assessment of this standard,
whether or not they are provided with distinct resources for each of these components.
When we apply this insight to the
generality
of our current alignment
and remediation mechanisms, it becomes clear that there is room for improvement in
our
digital offerings. For example, if we have a second-grade learner, little Bobby
DropTables, and they fail an assessment question aligned to this example standard
CCSS.MATH.CONTENT.2.OA.A.1, what can we offer in terms of remediation? Did they fail
the
question because they don’t understand addition within 100? Is it because they don’t
understand how to use symbols as variables in an equation? Or, is it because they
are
missing or forgetting some fundamental prior knowledge
skill or concept?
Currently, our digital offerings have little capability to offer such insight, and
it falls squarely on
the shoulders of both teacher and student users to perform this analysis, for each
student, for each standard, for each assessment. This ambiguity, coupled
with our disconnected remediation offerings, brings to the forefront the challenges
that we
wish to overcome when serving digital content to our users.
Our proposed solution to the competency alignment problem
Introducing the competency graph
One of the beautiful things about mathematics is that it is a progressive, cumulative
discipline. While it is true that many states provide their own distinctive state
mathematics standards,
they all cover the same material, varying primarily on cadence and progression. Whether
you live in Arkansas or California, you
calculate a percentage in the same way. Students in Puerto Rico and Illinios know
the
same Quadratic Equation. It is this immutability, the fundamentally progressive way
in
which mathematics is taught and learned, that enables us to propose our Competency
Graph.
The study of mathematics is in part the progressive attainment
of discrete skills. Certain competencies require certain other prior knowledge competencies.
Looking at the prior example, breaking down a single standard into its composite parts,
we get a
feel for the level of granularity that a competency graph can express.
Essentially, the competency graph is a low-level knowledge framework
that underpins a state standards set, or our internal classification system of measurable
learning objectives.
The immediate benefits of developing the competency graph can be expressed in two
distinct areas. The first is standards correlation; by mapping a
standards set to our competency graph, and also mapping our lesson content |
resources | curriculum data to the same competency graph, we provide an accurate
alignment at a highly granular level. For example, if we map state standard
CCSS.MATH.CONTENT.2.OA.A.1 to competencies
A ,
B , and
C , and
also map a 3-page lesson to the same competencies, then we gain standards alignment
through the proxy of the competency graph. The subtle but important distinction
is the shift from
This lesson covers this standard because they are aligned with each other
, to
lesson A and standard X are aligned through their intersection of competencies
. This is the basis for
supporting alignable custom curricula.
This new alignment perspective offers superior accuracy and flexibility. By aligning
directly to
individual resources, instead of to the
container
of the curriculum
(i.e., the Lesson), we gain the accuracy and granularity for remediation support that
serves our users’ market and pedagogical needs. We also realize a substantial increase
in
alignment efficiency, due to the fact that a lesson’s alignment to a given standards
set
is now inherited through the competency graph, by virtue of the alignment of the contents
within the lesson.
Finally, we need to consider the analysis of prior
knowledge
requirements for competencies. Due to the linked nature of the competency graph,
with any given node being aware of its immediate prior knowledge dependencies, we
are
also positioned to query this information for remediation. If a student
misses an assessment question on the Quadratic Equation, not only can we provide
resource links to target the teaching of the competency, but also its prior knowledge
dependencies. By surfacing these knowledge dependencies to teachers and students at
point of use, we offer a valuable analytical tool that can be used to help diagnose
underlying issues. This becomes especially important in higher grade bands, when topics
become complex and there is a greater probability of the assessment failure being
a
symptom of their misunderstanding of a prior competency.
Constructing RDF for mathematics education
Ontology vocabularies abound to express the organization of educational
materials for general delivery of content, assessment of skills, and indications of
prerequisite knowledge. However, we found ourselves unexpectedly lacking
in RDF models for sequencing educational materials in
mathematics. Looking outside of RDF vocabularies for education, though, we found
some impressively complex ontologies of mathematics concepts, which could be
associated with educational concepts using the subject-predicate-object construction
of RDF triple-stores
. A very detailed mathematics ontology we have
found so far is OntoMathPro, developed by a research group at the Federal University
of Kazan (Russia): https://ontomathpro.org/. The developers write, We are going to
create an ecosystem of datasets and mashups around the ontology
, which
suggests use in modeling mathematical applications.
We think of RDF for mathematics education as a network that students, teachers, and
their assistants (both human and machine) will traverse in multiple directions, and
through which there is not just one simple linear path of progression. Structuring
the
basis for triple-stores gives us a basis for organizing math concepts with educational
content and curricular activities.
We have chosen to separate our local BIL data into several distinct buckets for the
purposes of prototyping. These categorizations are not entirely superficial, however.
Given the expected size of our data set, we sought value in separating each type of
data
into its own collection for maintenance and governance purposes. Our collections are:
-
curriculum.rdf
, which stores all linked-list style containers
for representing curricula and lesson structure. Conceivably, we would house
custom curriculum data in its own collection. Realistically, we could expect
to break this out into multiple collections, possibly along program or
product lines.
-
elements.rdf
, which defines all of our custom data types for
competencies, learning resource types, curriculum container types, such as
lesson
, section
, and chapter
,
along with custom relationship types, depth-of-knowledge (DOK) alignment
classes, and any other proprietary data types. This collection effectively
houses the custom BIL namespace.
-
learningObjects.rdf
, which stores all instances of learning
resources. Again, we will probably find ourselves in a position where we
need to separate along resource types for maintenance and processing
purposes. To provide a little context into the data volumes we expect, our
assessment question bank alone represents well over 1,500,000 entries. We can
realistically expect two to three times that volume in ancillary, consumable
resources after a few years of production, not to mention multimedia assets,
interactive tools and widgets, as well as static content modules, which
typically number between 500-1000 per grade.
The ontologies that we have chosen to implement alongside our custom BIL namespace
are SKOS (Simple Knowledge
Organization System) and LRMI Metadata
Specification. The SKOS namespace provides the base concept
class, upon which we can construct our competency
class, along with several semantically expressive labels, such as
prefLabel
, editorialNote
, altLabel
, and
hiddenLabel
. Availability of multiple label tags is important not only
for editorial and maintenance purposes, but also for providing alternate names for
competencies. Our labelling scheme will likely be leveraged in search functions, and
it
should allow us a certain degree of flexibility to support custom nomenclature for
specific state customizations. Additionally, the canonical prefLabel
allows
us to provide competency names in multiple languages. The following is an example
of our
base competency definition:
<rdfs:Class rdf:ID="competency">
<rdfs:subClassOf rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/>
<skos:prefLabel xml:lang="en-US">Competency</skos:prefLabel>
<skos:definition>Root competency class. This should be
extended by competency subclasses.</skos:definition>
</rdfs:Class>
Another important vehicle provided by SKOS is the base
relationship class that we use to build our Prior Knowledge bridge. Even with SKOS’s
extensive collection of transitive and hierarchical relationships, we thought it
appropriate to define a custom
extends
relationship:
<rdfs:Class rdf:ID="extends">
<rdfs:subClassOf rdf:resource="http://www.w3.org/2004/02/skos/core#related"/>
<skos:prefLabel xml:lang="en-US">Extends</skos:prefLabel>
<skos:closeMatch rdf:resource="https://ceds.ed.gov/element/000869/#Prerequisite"/>
<skos:definition>
A semantic relationship to show that a concept, skill, or strategy
'builds upon' another competency. Implies a logical 'requirement',
and is disjoint with 'dc:requires', i.e., a competency either
'dc:requires' or 'bil:extends' another competency, but not both.
</skos:definition>
</rdfs:Class>
The last major component we are using from the SKOS
namespace is the
orderedCollection
and
memberList
properties,
which allow us to store lists of links to other container resources. The LRMI namespace
provides us with
learningResource
and
learningResourceType
,
which we use to define our resource instances (colloquially referred to as Learning
Objects), as well as to provide structure for storing our curriculum data. We see
here,
an example of how we are storing curriculum data as a list-like container.
<lrmi:learningResource rdf:ID="learningResource/curriculum/NA/algebra-1">
<lrmi:learningResourceType rdf:resource="learningResource/type/curriculum"/>
<dc:title>Big Ideas Learning Algebra 1</dc:title>
<skos:OrderedCollection>
<skos:memberList>
<lrmi:learningResource rdf:resource="Select Resource Type/my-lesson-plan"/>
<lrmi:learningResource rdf:resource="learningResource/curriculum/NA/algebra-1/chapter/2"/>
</skos:memberList>
</skos:OrderedCollection>
</lrmi:learningResource>
In the above example, we note that the curriculum object
is an
LRMI:learningResource
, with a
learningResourceType
attribute, which points to a definition in
elements.rdf
, and an
orderedCollection
container from the SKOS namespace, which contains
references to child container lists.
RDF/XML and the development of the competency graph
Early in the design phase, we needed a mechanism to regulate
and standardize both our data structures and the semantic relationships between them.
By
leveraging RDFS namespaces such as LRMI and SKOS, we were able to design an XML data
schema that provided consistent relationships, meaningful tag names, and highly
structured collections upon which we could apply validation to ensure consistency
and
homogeneity when creating our initial data sets.
In addition to its advantages for validation, structuring our data in RDF/XML allows
us to explore a number of relational representations not possible in a SQL environment.
This flexibility, coupled with the ability to use attributes such as rdf:resource as pointers, or weak foreign key references, we
were able to organize our data into simple collections with shallow hierarchies in
an
easily readable and highly queryable state.
RDF also allows us to express our data in a robust, sustainable vernacular that
requires little transformation between persistent data and its natural language
origin. The ability for us to capture contextual, lexical, and pedagogical metadata
in a human-readable format should empower our internal subject matter experts to
work in data much closer to the persistence layer, which, in turn, helps to increase
our data transparency and accuracy by reducing the amount of transformation that our
data must undergo between entry and storage.
Implementation challenges
The challenge of resource identification and referencing
Creating links to resources poses a serious challenge considering the storage of our
curriculum data.
According to our system, a learning object resource is simply a pointer to a digital
asset, and
we have chosen to make each container’s member list a collection of pointers to resource
identifiers:
<lrmi:learningResource rdf:ID="learningResource/curriculum/NA/algebra-2">
<lrmi:learningResourceType rdf:resource="learningResource/type/curriculum"/>
<dc:title>Big Ideas Learning Algebra 2</dc:title>
<skos:OrderedCollection>
<skos:memberList>
<lrmi:learningResource rdf:resource="learningResource/curriculum/NA/algebra-1/chapter/1"/>
<lrmi:learningResource rdf:resource="learningResource/curriculum/NA/algebra-1/chapter/2"/>
</skos:memberList>
</skos:OrderedCollection>
</lrmi:learningResource>
And, for one of those resources listed within the
container, we have another entry, like this:
<lrmi:learningResource rdf:ID="learningResource/curriculum/NA/algebra-1/chapter/1">
<lrmi:learningResourceType rdf:resource="learningResource/curriculum/chapter"/>
<dc:title>Chapter 1: The 0th Chapter</dc:title>
<skos:OrderedCollection>
<skos:memberList>
<lrmi:learningResource rdf:resource="learningResource/curriculum/NA/algebra-1/chapter/1/lesson/1.1"/>
</skos:memberList>
</skos:OrderedCollection>
</lrmi:learningResource>
This implementation should allow us to store each
container as a free node, rather than being structurally and intrinsically bound to
a
single containing resource. This is important when we take customization support into
consideration: A single BIL lesson may be referenced by
n containers, and we do not want to have to search the entire collection
of customized curricula for every instance of this single lesson resource when changes
are
made, so, we store it as an independent node, and then link to the resource
wherever it needs to be included. This way, should we need to make changes to the
source
node, we update in one place, and all references pull the updated data. As we look
at an entry for a single resource (not a container), it should be noted that there
are
two vital pieces of information that need to be captured here:
-
The resource’s RDF ID: this is how the database knows to reference and
locate a specific resource.
-
The digital resource ID, which will be passed to a content server for
retrieval within Big Ideas learning platforms.
Canonically, we recognize that an IRI represents a
unique, resolvable address
to a resource. We chose not
to use the physical resource ID as the RDF ID for the following reasons:
-
Increase flexibility when integrating with our content servers. By providing a system
ID instead of an absolute IRI, we eliminate the need to host content in a static location.
For example, a video resource moving to a new cloud host would require us to update
IRIs on all affected video resources if we had hardcoded the absolute address of the
resource.
-
Resources represented in RDF are just pointers. BIL houses its digital resources
across multiple CDN delivery systems, so any application that integrates with
our content base will do so through a content proxy.
This layer of abstraction allows our applications to be agnostic to
the content, since it is the responsibility of the content proxy to
fetch and return the target resource.
-
Semantic human-readable Resource IDs assist in data maintenance and
governance. By implementing IRIs as semantic paths to each entity, we gain
important contextual awareness of each IRI within the greater namespace. For
example, implementation of UUIDs such as
6afc32b7-8b73-4b42-bb03-08af18ab5655
ensures uniqueness but
neither provides nor receives context or purpose from the identifier itself.
If we instead rely on a namespace hierarchy, such as
Arjuna/LearningObject/MediaElement/Video/6077473635001
, we
can ensure uniqueness through validation of each segment of the hierarchy
(i.e., unique values at any given depth), while retaining context and
purpose though the IRI. It becomes possible, then, to ensure that an entity
has consistent, singular representation across multiple systems while
retaining contextual value within the ID itself. In essence, by creating
'resource namespaces' within our learning object data, we provide another
measure of organization and control.
This implementation decision is the product of much internal debate and research,
and
represents what we feel is the most stable, scalable solution to the challenge of
naming
and identifying resources. We welcome any insight or
observations into improving our resource identification and storage
mechanisms.
Exploring an XML database for content management supported by RDF/XML
One of the biggest challenges our team faces is determining the most appropriate
technology stack for a production-grade application. Early development and prototyping
with the XML database eXist-db has proven to be more than capable in terms of data
manipulation,
serving, and storage. However, we do face some significant constraints.
-
Tech team background: XQuery's FLWOR
expressions represent a significant shift for the engineering team's experience with
data manipulation and processing. We learned from early prototyping that
development velocity lags when designing and writing more complex filtering and querying,
due to the team’s unfamiliarity with XQuery. This issue has been partially remedied
by the
development of a custom JavaScript API which allows our developers to
interact with eXist-db in a context closely resembling the Fetch API.
-
Hardening and scaling: While the challenges
of maintaining, tuning, scaling, and securing a new database technology are
not unique to eXist-db, the engineering team is not currently equipped to absorb all
facets of securing and scaling new database technology. Enterprise partnership and
support
would help to mitigate this concern as we seek a scalable database solution.
-
Security: Authentication at the server level may present
a challenge in implementing secure write operations from external client requests.
This is an ongoing area of investigation.
RDF can be serialized in many different ways, and XML is one of the oldest. Nowadays
it is certainly more common
to see expressions of RDF in Turtle syntax or JSON-LD, yet RDF itself can be shared
in multiple formats as needed.
The authors have been writing and modeling RDF in XML (rather than Turtle or JSON-LD)
for the following reasons:
-
Legibility: RDF written in well-formed XML is precise and legible
in representing relationships among resources via attributes on the XML element tree.
This should be easy
for the core team to write and maintain as a central source of truth
for conceptual organization of the project.
-
Validation: Maintaining the conceptual RDF framework at the core
of the project should require more precise validation beyond checking for correct
use of RDF vocabularies.
Checking against the semantic web of linked data standards on its own can be served
by the w3c validating services
at https://www.w3.org/RDF/Validator/or
https://www.w3.org/2015/03/ShExValidata/.
However, validation needs to be customized much more precisely to keep relationships
simple, to control use
of appropriate namespaces, to delimit acceptable values and ranges, and define valid
datatypes where needed.
For this purpose, we are exploring powerful validation tools such as Relax NG and
Schematron.
-
Querying and Transformation We are exploring XPath, XQuery, and XSLT
as tools for precise querying as well as serializing data in syntaxes we need to interact
with multiple web services.
Having begun work with RDF/XML for these reasons, we are aware that we can serialize
it as JSON or JSON-LD,
which gives us a wide range of considerations for how best to deploy a system based
on our abstract data model.
If our RDF/XML serves as an index and central nexus point for coordinating access
to
resources, the BIL tech team will need to be querying it regularly, and of course
RDF/XML can be transformed for querying into JSON, JSON-LD, or GraphQL. A web
service running XSLT 3.0 that mediates between JSON and XML might simplify validation
and maintenance,
and serialize the database outputs as needed for querying.
We close, then, with these questions:
-
Is RDF/XML the best format for legible declarative expression of our
data structure with robust schema validation? Or is JSON expression,
validation, and querying comparable and sufficient for BIL’s requirements?
-
Is an XML database actually necessary for us, even if we are
expressing our abstract data model and structure in RDF/XML?
-
If we continue to work with RDF/XML at the core of our system
architecture, should we serialize it in a JSON output format for
database implementation?
Thanks to XPath, XSLT, and XQuery 3 specifications, we know that we
can now transform XML to JSON, and JSON to XML, which gives us a wide range of
database options to consider implementing in the BIL technology stack. Going
forward, we need to evaluate these decisions based not only on a continually
evolving technology landscape, but also on the particular resources, technology
requirements, and implementation needs of BIL and the community it serves.