Introduction and Pre-History
A long, long time ago in a galaxy much like our own, one of the authors, Ken Sall, was fascinated by the number of musicians who had played with John Mayall. [1] Armed only with paper, pencil and lots of record album jackets, he drew a diagram with Mayall at the center and arcs pointing to musicians who played with him, such as Eric Clapton, Peter Green, Mick Taylor, etc. Then he drew arcs from these musicians to others they had played with. Soon he reached the limit of the paper. Since this was prior to the emergence of the PC, there was no option to scroll or easily copy to a larger drawing area. But Ken held onto this labor of trivia for nearly 40 years. (See Figure 1.) Meanwhile, Pete Frame developed far more detailed and visually appealing rock family trees Frame 1983, some of which appeared in various publications such as Rolling Stone magazine and the Encyclopedia of Rock.
Nearly everyone is familiar with the Six Degrees of Kevin Bacon
trivia game Bacon.
The Bacon number is based on a game where actors were rated on distance they were
away from the actor
Kevin Bacon. Actors who appeared in a movie with Kevin Bacon were given a Bacon number
of 1, and actors
having only been in movies with actors with a Bacon number of 1 were given a Bacon
number of 2.
Each degree of separation lead to increasing the Bacon number by 1.
The object is to connect any movie star to Mr. Bacon with no more than six hops based
on
two people appearing in either the same movie or commercial.
This is a popular special case
of the more general Six Degrees of Separation Six Degrees
also known as the Human Web
. The idea is that any two people on the planet can be
connected by no more than six hops.
According to Wikipedia," Mathematicians use an analogous notion of collaboration distance".
Problem Statement
Using semantic technology, we can certainly improve upon hand copying of data from record album jackets. If we can refer to a recording artist with all the associations to other musicians and to the albums they appeared on together, we can produce more complete graphs than what is shown in Figure 1. As avid fans of blues and rock music, we wondered if we could construct SPARQL queries to examine properties and relationships between performers in order to answer global questions such as "Who has had the greatest impact on rock music?" Our primary focus was Eric Clapton, a musical artist with a decades-spanning career who has enjoyed both a very successful solo career as well as having performed in several world-renowned bands. Then using Drupal and SVG to visualize the results, we could traverse the musician graphs in a straightforward manner.
This paper explores the use of DBpedia and MusicBrainz data sources using OpenLink Virtuoso Universal Server with a Drupal frontend. Much attention is given to the challenges we encountered, especially with respect to community-entered open data sources and the strategies we employed to overcome the challenges. One such challenge we encountered was that there were several properties by which an artist could be connected to another and the semantics were not well-defined, as discussed later.
We should be able to draw inferences from the data. According to Dean Allemang and Jim Hendler:
In the context of the Semantic Web, inferencing simply means that given some stated information, we can determine other related information that we can also consider as if it had been stated.
For example, ultimately we plan to use RDF and SPARQL to address questions such as these:
-
Which recording artist has directly played with the most musicians?
-
Which recording artist has the most connections within six degrees?
-
Which musician has been a session man for the most number of artists?
-
Which recording artist was most active during a particular decade?
-
Among all artists of a particular genre, who has played with the most other musicians?
-
Which rock artist's extended graph has the most other artists in 2 degrees? 3 degrees? 4 degrees?
-
Who has appeared on the most albums?
-
If we weight results by the length of time a band stays together, how does that impact other queries?
-
Can we distinguish between legitimate releases and unofficial releases?
-
What additional inferences can be made when multiple graphs are queried?
-
Can results be corroborated by comparing results to ground truths (i.e. documented in Joel Whitburn's Billboard books)?
-
Which musician-related properties are reversible (inverse makes sense)?
-
How can we differentiate between a musician's playing in a band, being associated with other musicians, starring together in a live show, and others collaborating with the musician?
-
Does total number of songs or album released correlate with other measures of success?
-
Who created the most songs?
-
Which song has been recorded the most times by any artists? ("Yesterday" and "White Christmas" are typically cited.)
-
Is there a predominant record label in the music world?
-
Which solo artist has had the longest career?
-
Which band has been together (in some form) the longest time?
-
What is the average age of a musician when he/she first joined a band?
-
For bands with changing membership, can we conclude which configuration lasted the longest?
-
What is the "Eric Clapton number" (a la Kevin Bacon number) for various musicians?
-
Can we use our own knowledge of Eric Clapton to clarify some of the semantics behind some of the RDF data we encounter?
Our notion of a musical artist's activity and impact can be explained in general terms. We consider activity to be correlated with the number of recordings produced. Another factor that we chose not to consider is an artist's concert performances unless they resulted in a commercially available recording. An artist's impact is more subjective. The greater the number of musicians that play with a performer (the greater the number of associations), the greater the potential impact of the performer, provided that the performer is not simply a session man. Another measure that would have proved extremely helpful in determining both activity and impact in a more quantitative manner would be the use of Billboard chart data. Unfortunately, use of Billboard data is not royalty free.[2]
Furthermore, our working definitions of activity and impact is primarily based on commercial recordings of an artist. We acknowledge that some artists tend to perform numerous live concerts and yet have produced relatively few commerical recordings; many of these concerts may be available on bootleg recordings. The argument could easily be made that the number of concerts and/or number of bootleg recordings are correlated with activity and/or impact. We have chosen not to consider these factors at this time.
Data Sources
Fortunately in 2011, there is a tremendous wealth of information about musicians freely available on the Web either in structured markup, especially triples, that lends itself to SPARQL queries. For example, Wikipedia tells us that Eric Clapton Clapton is associated with: Dire Straits, The Yardbirds, John Mayall & the Bluesbreakers, Powerhouse, Cream, Free Creek, The Dirty Mac, Blind Faith, J.J. Cale, The Plastic Ono Band, Delaney, Bonnie & Friends, Derek and the Dominos, and T.D.F. By comparison, Ringo Starr Starr is associated with The Beatles, Rory Storm and the Hurricanes, Ringo Starr & His All-Starr Band, and Plastic Ono Band. On the surface, Ringo's direct associations are fewer, but actually Ringo Starr & His All-Starr Band has had 11 different lineups (to date) with a total of 42 unique musicians, most of whom have a number of associations as well. Ringo Starr & His All-Starr Band
The various data sources we leveraged are discussed in the following subsections.
Wikipedia and Infoboxes
The primary (although indirect) source for our RDF data was Wikipedia which is a major source of detailed information about musicians, among many other things. To understand both the kinds of properties available and their open community origins, some details about Wikipedia are in order.
As illustrated by the excerpt from Clapton's Wikipedia page below, the main page for a musical artist contains an abstract at the top, a contents navigation box below the abstract with a variable number of section links pointing to the main content of the page, and a so-called infobox in the upper right. In addition to a main page, most musical artists with more than a handful of albums or singles have a separate discography page with varying amount of detail and organization regarding studio albums, live albums, compilations, singles, etc. The discography is linked to the musician's main page and vice versa.
An infobox is a fixed-format table designed to be added to the top right-hand corner of Wikipedia articles to consistently and concisely present a summary of some common aspects that the articles of the same category (i.e., musical artist) share, as well as to improve navigation to other interrelated articles (i.e., music genres). Infoboxes are an instance of MediaWiki's template feature; there are numerous infobox templates arranged by broad categories such as arts and culture infobox templates, which is further divided into 10 subcategories including templates for film, fictional characters, and music. There are over 50 templates in the subcategory music infobox templates of which the most relevant to our work is the template for the infobox of musicial artists. (See the right side of Table 1 below.) This infobox template is used by Wikipedia authors to create infoboxes such as Clapton's, shown in the left side of the table. The correspondence between the template and the resultant infobox is apparent when viewed side by side. Most of the properties that we used in our queries (e.g., name, genres, associated acts, etc.) are based on the contents of the infobox, with the notable exception of albums (from the discography page).
Table 1
The wiki source markup of the infobox is shown in Figure 3. For a much more complete explanation of the Wikipedia extraction process employed by DBpedia including a discussion of the design and development of infobox templates, see Auer and Lehmann 2007 and Auer et al 2007.
DBpedia
One of the two primary dataset we used was DBpedia 3.6 en (English) based on Wikipedia dumps from October/November 2010. DBpedia DBpedia Dataset is a community effort to provide sophisticated query access to the structured content of Wikipedia, thereby allowing a small group of researchers and developers to enhance Wikipedia by linking to additional datasets. The DBpedia 3.6 release announcement describes the content in detail:
"The new DBpedia dataset DBpedia Release describes more than 3.5 million things, of which 1.67 million are classified in a consistent ontology, including 364,000 persons, 462,000 places, 99,000 music albums, 54,000 films, 16,500 video games, 148,000 organizations, 148,000 species and 5,200 diseases. The DBpedia dataset features labels and abstracts for 3.5 million things in up to 97 different languages; 1,850,000 links to images and 5,900,000 links to external web pages; 6,500,000 external links into other RDF datasets, and 632,000 Wikipedia categories. The dataset consists of 672 million pieces of information (RDF triples) out of which 286 million were extracted from the English edition of Wikipedia and 386 million were extracted from other language editions and links to external datasets."
The DBpedia Ontology DBpedia Ontology is a cross-domain ontology
created from the most commonly used infoboxes within Wikipedia.
The ontology currently covers over 272 classes with 1,300 different properties and
1.667 million resources,
364,000 of which are Persons.
DBpedia predicate IRIs [3] begin with either http://dbpedia.org/ontology
or
http://dbpedia.org/property
; entitites are designated
by http://dbpedia.org/resource
.
We initially used a simple SPARQL query SPARQL 1.0 to determine what properties are relevant to a musician
as shown in the following two listings. Some musicians had less than the 23 properties
shown;
others had more. Note that the resource
IRI designating the musician is a straightforward rendering of
the artist's name in English with underscores replacing spaces:
http://dbpedia.org/resource/Eric_Clapton
.
In a later section, we discuss alternative property-related queries with different
results.
Wikipedia-like categories can also be specified in queries. The following query returns the short list of musical artists who have three distinctions: Grammy Award winners, Rock and Roll Hall of Fame inductees, and MTV Video Music Awards winners.
SELECT ?allAwards { ?allAwards <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:Grammy_Award_winners>. ?allAwards <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:Rock_and_Roll_Hall_of_Fame_inductees>. ?allAwards <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:MTV_Video_Music_Awards_winners> } ORDER BY ?allAwards
The thrice-honored, distinguished musical artists are: [4]
http://dbpedia.org/resource/Aerosmith http://dbpedia.org/resource/Bruce_Springsteen http://dbpedia.org/resource/Elton_John http://dbpedia.org/resource/Eric_Clapton http://dbpedia.org/resource/Johnny_Cash http://dbpedia.org/resource/Madonna_%28entertainer%29 http://dbpedia.org/resource/Metallica http://dbpedia.org/resource/Michael_Jackson http://dbpedia.org/resource/R.E.M. http://dbpedia.org/resource/The_Beatles http://dbpedia.org/resource/The_Rolling_Stones http://dbpedia.org/resource/U2 http://dbpedia.org/resource/Van_Halen
The full list of predicates relating to MusicalArtist
is quite large. The figure below is a partial view of the DBpedia "About: Eric Clapton" page.
The IRI http://dbpedia.org/resource/Eric_Clapton
forwards to this page.
MusicBrainz
MusicBrainz MusicBrainz is another comprehensive public database for musician information; it contains detailed data about artists, release groups, releases, tracks, record labels and many relationships between them. As described in the database overview, MusicBrainz defines artist attributes including the artist's name, aliases, GUID, annotation, type (individual or group), and begin and end dates (birth/death of an individual or formation/disbanding for a group). A release group is a logical grouping of variant releases (deluxe/limited editions, reissues/remasters, international variations, box sets, etc.). One example is the release group describing several variants of Clapton's 461 Ocean Boulevard album. Each release group is identified by type (album, single, EP, compilation, soundtrack, live, etc.), as well as by artist, title, and ID. An individual release has the same properties as a release group as well as status (official, bootleg, etc.), Amazon ASIN, annotation, language, release event (date, country, format, label, etc.) and more. Each track has properties such as duration (in milliseconds) and PUID (the MusicIP acoustic fingerprint identifier for the track) as well as artist, title, and ID. Record labels are also described with a number of detailed attributes.
Every MusicBrainz ID (MBID) is a permanent GUID. There is a direct relationship between an artist's MBID, a IRI that identifies the artist, and a web page that collects information about the artist. For example:
-
Eric Clapton's MBID is 618b6900-0618-4f1e-b835-bccb17f84294
-
Eric Clapton's IRI is http://musicbrainz.org/mm-2.1/artist/618b6900-0618-4f1e-b835-bccb17f84294 (points to basic RDF metadata)
-
His MusicBrainz page is http://musicbrainz.org/artist/618b6900-0618-4f1e-b835-bccb17f84294.html
Below is a screenshot of the Eric Clapton MusicBrainz page. MusicBrainz offers a tremendous amount of detail about releases as exemplified by the page for Clapton's classic From the Cradle blues album.
MusicBrainz also provides a web service interface to its extensive database. Requests may be English-like such as http://musicbrainz.org/ws/1/artist/?type=xml&name=Eric+Clapton or may contain MBIDs such as http://musicbrainz.org/mm-2.1/album/1301e027-b038-4017-9b4c-7655bff78f6b for the "461 Ocean Boulevard" album, shown below. (Release dates of 1974 and three from 1996 are collapsed in the screenshot.)
MusicBrainz maintains numerous database statistics, a small sampling of which appear below. It is interesting to note that in the three months since the draft version of this paper, although the number of Recordings increased by approximately 37,000, the Artist count dropped by roughly 8,000. This seems to suggest the dynamic nature of the data both in terms of quantity and in how recordings are categorized. (For example, artists may have appeared under variant name spellings.) We could not find a definition for the new term Works in the terminology page or other MusicBrainz documentation.
Table II
Artists | 612,428 |
Release Groups | 787,918 |
Releases | 952,743 |
Disc IDs | 460,876 |
Recordings [tracks] | 10,307,311 |
Labels | 52,156 |
Works | 276,864 |
Relationships [links] | 3,160,096 |
The MusicBrainz data quality page states that one of the goals is "Establish a method to determine the quality of an artist and the releases that belong to that artist. This provides consumers of MusicBrainz a clue about the relative quality rating of the data in the database." The page explains the connection between the quality number, voting periods, and strictness regarding edits.
To accomplish these goals, this feature will allow editors to indicate the quality for a given artist. An artist can be of unknown, low, medium or high data quality. The data quality indicator determines what level of effort is required to change the artist information or to add/remove albums from an artist. An artist with unknown or medium quality will roughly require the amount of effort that MusicBrainz currently requires to edit the database. An artist with low data quality will make it easier to add/remove albums or to change the artist information (name, sortname, aliases). And an artist with high data quality will require more effort to add/remove albums or the change the artist information. The data quality concept also applies to releases in the same manner. Changing a release with low data quality will be easier than changing a release with high data quality.
Freebase
Initially we considered using music data from Freebase Freebase, an open, community-based, Creative Commons licensed repository of structured data describing millions of entities (i.e., person, place, thing) which are connected as a graph. Freebase IDs can be used to uniquely identify entities from any web-reachable source. At the time of this writing, the 1.22 GB (uncompressed) music segment of Freebase contained 9 million topics and 33 million facts. The music category contains classical, opera, and many other genres in addition to blues and rock music. Data is formatted as approximately 50 separate Tab Separated Values files with filenames such as group_member.tsv, group_membership.tsv, musical_group.tsv, guitarist.tsv, album.tsv, artist.tsv, release.tsv, and track.tsv (the largest file at roughly 671 MB). A total of over 151,000 groups are listed in musical_group.tsv. [5]
Freebase data can be used in several ways. MQL (Metaweb Query Language) MQL is a query API analogous to SPARQL for RDF. MQL uses JSON objects as queries via standard HTTP requests and responses. For example, the IRI below will return all the genres associated with Eric Clapton (line breaks added for readability).
http://api.freebase.com/api/service/mqlread?query={%20%22query%22: {%20%22active_start%22:null,%20%22genre%22:[],%20%22name%22:%22Eric%20Clapton%22, %20%22type%22:%22/music/artist%22%20}%20}will produce the following result, indicating all the genres associated with Eric Clapton:
We will see later that this list of genres differs from other data sources.
For any given group, band members are non-sequential within the group_membership.tsv file. The relevant lines for the members of Blind Faith are collected in the table below.
Table III
id | member | group | (role) | start | end |
---|---|---|---|---|---|
/m/01tfhrr | Ginger Baker | Blind Faith | 1968 | 1969-10 | |
/m/01tfhry | Steve Winwood | Blind Faith | 1968 | 1969-10 | |
/m/01wvwr8 | Ric Grech | Blind Faith | 1968 | 1969-10 | |
/m/01tfhrk | Eric Clapton | Blind Faith | 1968 | 1969-10 |
The group_membership.tsv file contains dozens of entries for Clapton, one line for each band or other association in which he participated. Each entry (line) is identified by a different ID. For example:
/m/01t73cp Eric Clapton Derek and the Dominos
Eric Clapton's
main Freebase page is shown below. The IRI format for Clapton as a topic
is http://www.freebase.com/view/en/eric_clapton
.
Ultimately we elected not to use Freebase as a data source because we, like others, were unable to locate the complete Freebase dataset rendered as RDF and determined that the conversion process would be non-trivial. In fact, we found mailing list and forum messages with others expressing the same problem. Initially we had downloaded the dataset in its native format and considered converting it into RDF. This would have been possible although many of the lines contained a fourth element consisting of a number of values concatenated together; it was unclear how that could be cleanly converted into a named graph. After the rough draft of our paper was submitted, we discovered that RDF data is available manually by following a link labeled "RDF" near the bttom of each Freebase page. It appears that this RDF is rendered at execution time since there is a slight delay in the display of the RDF. For example, follow the RDF link on the Eric Clapton page. Had we discovered this sooner we might have attempted to obtain the Freebase RDF for Rock and Roll stars using some automated process.
Methodology
In this section, we discuss our frontend and backend development platform, a few term definitions, and details concerning the Graphviz visualization which is still under development as of this July writing.
Development Environment
The front end to our SPARQL queries was a guest virtual machine running under Sun VirtualBox 3.1.6_OSE. on a Dell Inspiron with an Intel Core 2 Quad CPU Q9400 running at 2.66GHz. The guest operating system is the 10.10 server release of Ubuntu Linux. The database machine leverages 2 Intel Xeon X5650 CPUs overclocked to 3.12GHz with 48 gigabytes of RAM. From the perspective of the 10.10 Ubuntu Desktop operating system, the 2 hexcore processors are regarded as 24 processors (149867.81 BogoMIPS).
Our RDF store was the open source (freely-available) version of
OpenLink Virtuoso Universal Server Virtuoso Universal Server (Version 06.01.3127) running in "Full Mode". Configuration changes were made to
maximize use of available RAM by making a change to the virtuoso.ini
file:
"NumberOfBuffers = 7000000".
On the frontend, our queries were facilitated by Drupal 6.18 with the modules listed
in the
Drupal Modules appendix.
It was necessary to configure PHP to allow Drupal more memory, so "memory_limit =
512M"
was added to /etc/php5/apache2/php.ini
.
Drupal 7.0 became available during 2011, but we elected not to migrate to that version. Although Drupal 7.0 does have substantially better support for RDF data than earlier versions, that is likely to only effect people publishing information that was authored or stored inside Drupal. Our use of Drupal was for collaborative development of and storing of queries, as well as for the visualization capabilities.
Working Definitions of Key Properties
The definition list below presents our current thinking about several key concepts that are covered in details in later sections; some refer to OpenCyc for the Semantic Web.
a person who is either a musician who plays one or more instruments,
or is a singer, or is a music composer; similar (but not identical) to
|
|
both the subject and object of this predicate are musicians;
similar to |
|
this is mapped either to associatedBand (for subjects who are individual artists) or associatedMusicalArtist (for bands) |
|
OpenCyc: An element of Band_MusicGroup is a (small or large) group of musicians who play non-Classical music together on either a regular or intermittent schedule. |
|
a categorization of music into types that can be distinguished from other types of music; can be applied to a musical artist, a band, an album, or individual songs. |
RDF Visualization Coding
In order to better understand our result sets a form of visualization was implemented as a custom Drupal module, written in PHP. The front end was a simple HTML form with JQuery and AJAX for displaying the information. Main processing was handled by the PHP module setup to intercept a post to a particular URL. The PHP module first accesses the POST data and extracts the input search term which became the subject of the following SPARQL query:
SELECT DISTINCT ?predicate ?object WHERE {<" . $subject . "> ?predicate ?object } ORDER BY ?predicatewhere in our initial implementation
$subject
is replaced by the musician subject requested by the user.
XML output is specified by the request as the desired return format.
The next step was to convert the XML results output from the query to the Graphviz input format DOT language. Graphviz Graphviz is open source graph visualization software that converts descriptions of graphs in a simple text language either to diagrams in various formats (e.g., images, SVG, PDF, Postscript) or for display in an interactive graph browser.
Using the SimpleXMLElement
class (from the PHP library), the code loops through the
results of the SPARQL query <binding> elements, accessing the “name” attribute of
each one.
If the name attribute is ‘predicate’ (from the ?predicate variable in our SPARQL query),
the predicate is obtained from the <uri> subelement.
If the name attribute is ‘object’ (from the ?object variable in the query),
the value is obtained from the <uri> subelement or the <literal> subelement,
depending on which is present in the XML output of the query.
If the <uri> subelement is found, we use the uri target for display,
but use the complete uri for the link, for use in future queries that the user can
select.
The Graphviz text input file is constructed incrementally by adding Graphviz
format input statements for each <binding> element. Since Graphviz defines
links and colors in a separate section of the file, the XML <binding> elements
are searched a second time to create the Graphviz [label]
entries. (To
improve performance, we may do this in a single loop populating two sections of the
Graphviz input at once.) A static set of predefined colors is defined in the module.
As new predicates (i.e., artist, associatedMusicalArtist, associatedBand,
musicComposer, producer, writer, etc.) are encountered, new colors are assigned by
taking the next one from the list and associating it with the new predicate.
Use of an associative array allows predicates of the same type to be assigned the
same color.
The Graphviz input string is completed when all the <binding> elements have
been processed and a header and footer are added.
The SVG data is created by sending the Graphviz input string to the
graphviz_filter_process method available by means of a Graphviz filter module
installed on the Drupal site. The SVG description is serialized by the filter
module. At this point the return information is structured using the Drupal function
drupal_json
to create JSON formatted data. The SVG data and the XML
data are each added as named elements of the JSON data and returned to the caller.
To use this module, an HTML page with a short embedded JQuery script invokes the module and displays the results. The HTML defines an input text field, a search button and two <div> elements which initially contain no content. They are given the IDs outputGraph and outputTable so content can be associated later using JavaScript (JQuery).The user interface is a simple text field into which the user types the name of the musician subject and then presses a search button which runs a JQuery function that posts the input field to the module previously described using AJAX. The AJAX method waits for the data to be returned from the module in JSON format. When the JSON data arrives, the returned SVG data and XML data are assigned to the <div> elements with the correct IDs.
The figure below shows one of the Graphviz digraphs. The SVG graphic produced by this Graphviz DOT file was too large to present here. See Vocabutek Web Site.
Interface Screenshots
A sampling of our interface on Vocabutek Web Site is presented in the following figures. As of this writing, the Graphviz visualization is expected to undergo changes. Figure 12 illustrates how we entered queries in the Drupal interface to test.
Figure 13 displays the result of executing the query shown in Figure 12.
We used the Drupal adminstration content management to organize queries into cascading menus, as shown in Figure 14.
The result of the query organzation is shown in Figure 15
Figure 16 illustrates the interface for running a query to send to Graphviz for rendering in SVG. Pre-tested queries can be copy/pasted into a single string (topmost text field) or new queries can be entered. The subject defaults to Eric Clapton but any musician can be entered.
Finally, Figure 17 depicts part of the SVG resulting from a query, as rendered by Graphviz. (This is one aspect we plan to improve over time.)
Challenges and Results
Our results were hampered by various anomalies or complications discovered in the data sources especially in terms of semantics. In this section, we present some of the problems we encountered and either how they were solved or how they might be approached in the future.
Properties of a Musical Artist
We next consider a different view of Clapton predicates. The query below returns properties with an object value that contains the string "Eric Clapton".
Compare the result below to that presented in figure Clapton's DBpedia Properties - Version 1.
When we subsequently connected the DBpedia and MusicBrainz data sources as discussed in Bridging Data Sources below, we obtained a third view of Clapton's properties, shown in figure Clapton's Properties Combined.
Along Comes the Association
The paramount requirement in determining an Eric Clapton Number is the ability to unambiguously identify those musicians with whom he directly worked. Originally we thought this would be rather straightforward. Properties that seemed relevant to making that determination included:
-
http://dbpedia.org/property/associatedActs
-
http://dbpedia.org/ontology/associatedBand
-
http://dbpedia.org/ontology/associatedMusicalArtist
-
http://dbpedia.org/property/starring
-
http://dbpedia.org/property/musicalguests
musicalguests
property was eliminated because it relates only to variety shows such as
"The Late Show". The starring
property also proved to be unreliable since multiple
musicians might have all been considered stars of a performance but might not have
played together.
One case in point is the The Concert for Bangladesh;
a query returned Klaus Voorman, Billy Preston,
Bob Dylan, Eric Clapton, George Harrison, Leon Russell, Ravi Shankar, and Ringo Starr.
While
most of these performers did share the stage at one time during the concert, Ravi
Shankar did not
perform with the others. Furthermore, members of the band Badfinger did play with
most of the
others, but they were not returned by the query although we note that Badfinger is
cited in the abstract of
the Wikipedia article in the list of stars of the supergroup. On the other hand,
if we look at the infobox
on the concert's Wikipedia page,
Badfinger is not listed as one of the stars.
We considered the associatedActs
property since that is the term used in the infobox.
As seen in Table 1, the infobox on Clapton's Wikipedia page
lists these associated acts:
"The Yardbirds, John Mayall & the Bluesbreakers, Powerhouse, Cream, Free Creek, The
Dirty Mac,
Blind Faith, J.J. Cale, The Plastic Ono Band, Delaney, Bonnie & Friends, Derek and
the Dominos".
Our initial attempt to retrieve these artists was a little surprising. Our query was:
SELECT DISTINCT ?who { <http://dbpedia.org/resource/Eric_Clapton> <http://dbpedia.org/property/associatedActs> ?who. } ORDER BY ?whoThe result, shown below, was one long string [6] the contents of which represented a superset of the associated acts on the Wikipedia page. The additional associated acts are underlined below.
"The Yardbirds, John Mayall & the Bluesbreakers, Powerhouse, Cream, Free Creek, Dire Straits, George Harrison, The Dirty Mac, Blind Faith, Freddie King, Phil Collins, J.J. Cale, The Plastic Ono Band, Delaney, Bonnie & Friends, Derek and the Dominos, T.D.F., Jeff Beck, Paul McCartney, Steve Winwood, B.B. King, The Beatles, The Band"@enIt is unclear how there could be this disparity since the DBpedia data is derived from Wikipedia. Although several of the underlined performers played with Clapton, they were not in bands together (as far as we could determine). Furthermore, several of the bands listed on the original Wikipedia page are arguably not exactly bands in the general sense since they existed only briefly and for a specific purpose. The Dirty Mac were a supergroup consisting of Eric Clapton, John Lennon, Keith Richards and Mitch Mitchell that came together for The Rolling Stones' TV special. Free Creek was a band composed of many musical artists, including Eric Clapton, Jeff Beck, Keith Emerson, for one super-session album. Powerhouse only recorded a few songs, only three of which were released on a compilation album. These "bands" are certainly not on a par with the others in the infobox since they were never intended to exist beyond their stated purpose.
Perhaps our interpretation of the associatedActs
property was incorrect? Further
examination of the DBpedia and Wikipedia documentation pointed us to the
template for the infobox of musicial artists,
shown on the right side of Table 1 in the earlier Wikipedia section.
The template description of associated_acts
from Wikipedia follows. (Numbers have been added for ease of reference and
the text has been reformatted.)
This field is for professional relationships with other musicians or bands that are significant and notable to this artist's career. This field can include, for example, any of the following: a1) For individuals: groups of which he or she has been a member a2) Other acts with which this act has collaborated on multiple occasions, or on an album, or toured with as a single collaboration act playing together a3) Groups which have spun off from this group a4) A group from which this group has spun off Separate multiple entries with commas. The following uses of this field should be avoided: b1) Association of groups with members' solo careers b2) Groups with only one member in common b3) Association of producers, managers, etc. (who are themselves acts) with other acts (unless the act essentially belongs to the producer, as in the case of a studio orchestra formed by and working exclusively with a producer) b4) One-time collaboration for a single, or on a single song b5) Groups that are merely similarBased on (a1), (a2) and (b4), it would seem that Powerhouse and arguably The Dirty Mac and Free Creek should be eliminated from the list of groups of which Clapton was a member. [7]
Further investigation brought us to the numerous
DBpedia Infobox Mappings
and specifically to the ontology mapping for the Infobox for Musical Artists.
The (infobox) template property Associated_acts
maps to two ontology properties,
associatedBand
and associatedMusicalArtist
, meaning that the infobox property
can refer to either a group (case (a3) and (a4) above) or to an individual (cases
(a1) and (a2)).
Therefore we turned our attention to queries involving the ontology properties
associatedBand
and associatedMusicalArtist
. The following query
asks for the bands in which Clapton was a member.
SELECT DISTINCT ?who WHERE {<http://dbpedia.org/resource/Eric_Clapton> <http://dbpedia.org/ontology/associatedBand> ?who. } ORDER BY ?whoThe results are shown below. Note that the results are nearly identical to the string superset shown earlier. (The only exception is X-sample and a few band name variations.) The same results are obtained if the predicate is replaced by
http://dbpedia.org/ontology/associatedMusicalArtist
.
http://dbpedia.org/resource/B.B._King http://dbpedia.org/resource/Blind_Faith http://dbpedia.org/resource/Cream_%28band%29 http://dbpedia.org/resource/Delaney,_Bonnie_&_Friends http://dbpedia.org/resource/Derek_and_the_Dominos http://dbpedia.org/resource/Dire_Straits http://dbpedia.org/resource/Eric_Clapton%27s_Powerhouse http://dbpedia.org/resource/Freddie_King http://dbpedia.org/resource/Free_Creek_%28band%29 http://dbpedia.org/resource/George_Harrison http://dbpedia.org/resource/J.J._Cale http://dbpedia.org/resource/Jeff_Beck http://dbpedia.org/resource/John_Mayall_&_the_Bluesbreakers http://dbpedia.org/resource/Paul_McCartney http://dbpedia.org/resource/Phil_Collins http://dbpedia.org/resource/Steve_Winwood http://dbpedia.org/resource/The_Band http://dbpedia.org/resource/The_Beatles http://dbpedia.org/resource/The_Dirty_Mac http://dbpedia.org/resource/The_Plastic_Ono_Band http://dbpedia.org/resource/The_Yardbirds http://dbpedia.org/resource/X-sample
When we reverse the order of the subject and object, the results can be interpreted as musicians who have played on Clapton's albums. The query:
SELECT DISTINCT ?who WHERE {?who <http://dbpedia.org/ontology/associatedBand> <http://dbpedia.org/resource/Eric_Clapton>. } ORDER BY ?whoyields these results:
http://dbpedia.org/resource/Aashish_Khan http://dbpedia.org/resource/Alan_Clark_%28keyboardist%29 http://dbpedia.org/resource/Albert_Lee http://dbpedia.org/resource/Andy_Fairweather_Low http://dbpedia.org/resource/B.B._King http://dbpedia.org/resource/Billy_Preston http://dbpedia.org/resource/Bobby_Keys http://dbpedia.org/resource/Chris_Stainton http://dbpedia.org/resource/Chuck_Leavell http://dbpedia.org/resource/Dave_Carlock http://dbpedia.org/resource/Dave_Mason http://dbpedia.org/resource/Doyle_Bramhall_II http://dbpedia.org/resource/Freddie_King http://dbpedia.org/resource/Ian_Wallace_%28drummer%29 http://dbpedia.org/resource/Jamie_Oldaker http://dbpedia.org/resource/Jeff_Beck http://dbpedia.org/resource/Jesse_Ed_Davis http://dbpedia.org/resource/Jim_Gordon_%28musician%29 http://dbpedia.org/resource/Leon_Russell http://dbpedia.org/resource/Mac_and_Katie_Kissoon http://dbpedia.org/resource/Marc_Benno http://dbpedia.org/resource/Marcella_Detroit http://dbpedia.org/resource/Nathan_East http://dbpedia.org/resource/Otis_Spann http://dbpedia.org/resource/P._P._Arnold http://dbpedia.org/resource/Phil_Collins http://dbpedia.org/resource/Phil_Palmer http://dbpedia.org/resource/Pino_Palladino http://dbpedia.org/resource/Plastic_Ono_Band http://dbpedia.org/resource/Ray_Cooper http://dbpedia.org/resource/Reverend_Zen http://dbpedia.org/resource/Richard_Cole http://dbpedia.org/resource/Rita_Coolidge http://dbpedia.org/resource/Rob_Fraboni http://dbpedia.org/resource/Sheryl_Crow http://dbpedia.org/resource/Steve_Ferrone http://dbpedia.org/resource/Steve_Jordan_%28musician%29 http://dbpedia.org/resource/Stevie_Ray_Vaughan http://dbpedia.org/resource/The_Shaun_Murphy_Band http://dbpedia.org/resource/Yvonne_EllimanThe same results are obtained if the predicate is replaced by
http://dbpedia.org/ontology/associatedMusicalArtist
.
How are associatedBand
, associatedMusicalArtist
, and associatedActs
related?
We believe the relationship varies across performers depending upon how individual
collaborators interpreted the terms.
Consider the following similar queries and their very different results, including
many performers who are
unfamiliar to all of the present authors.
Bridging the Data Sources
In order to address some of the more difficult questions or compare results between data sources, we needed a way to combine the RDF datasets, or at least to connect musical artist identifiers across DBpedia and MusicBrainz.
We employed the OpenLink Virtuoso Universal Server Virtuoso Universal Server as an RDF store. While there are other very good RDF stores such as 4store, the Virtuoso Universal Server [8] is compelling because it supports reasoning. Backward-chaining reasoning is the ability to derive new information based on existing information at run time. This contrasts with forward chaining reasoning in which derived information is expressed explicitly. An example of forward chaining reasoning would be creating and adding triples that represented new information that was inherent in information the datastore already contained. [9] There are two major costs to leveraging backward chaining reasoning. The first and foremost cost is that derived information is created each time it is needed; it is therefore quite likely that the same information is derived repeatedly at run time. The advantage, however, is that the derived information does not take up any space in the database. The second requirement for leveraging backward chaining reasoning in Virtuoso is the use of specific predicates, none of which were available in the native datasets with which we worked.
Given that the IRI for Eric Clapton was different in DBPedia and MusicBrainz, we used
forward
chaining to express the relatedness between these two IRI’s and thereby facilitating
backward chaining reasoning in queries. We used the SPARQL
INSERT
directive to create new statements which made
assertions explicit as a forward-chaining reasoning strategy. Specifically, the
assertions we added leveraged the owl:sameAs
[10] predicates
Virtuoso defines to empower
reasoning. The
following query was used to create an assertion stating that the DBpedia Subject IRI
of type
http://dbpedia.org/ontology/MusicalArtist
corresponds to the MusicBrainz Subject IRIs of type
http://musicbrainz.org/mm/mm-2.1#Artist
whenever the DBpedia
rdf-schema#label
matches the MusicBrainz dc:title
exactly.
SPARQL INSERT in GRAPH inference/sameAs> {?mbiri <http://www.w3.org/2002/07/owl#sameAs> ?s} WHERE { ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/MusicalArtist>. ?s <http://www.w3.org/2000/01/rdf-schema#label> ?dbpedianame. ?mbiri <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://musicbrainz.org/mm/mm-2.1#Artist>. ?mbiri <http://purl.org/dc/elements/1.1/title> ?mbname. FILTER (str(?mbname) = str (?dbpedianame)) };This query resulted in the addition of 16,029
owl:sameAs
assertions to our datastore which took
109,692,665 msec (30.47 hours) to complete.
Since neither these assertions nor the DBpedia and MusicBrainz assertions were in
the same NAMED GRAPH
, it was necessary to use another unique
capability of the Virtuoso Universal Server called GRAPH GROUPs
.
This enabled us to refer to multiple NAMED GRAPHs
as if they were one. First we
created a GRAPH GROUP
with the following command:
DB.DBA.RDF_GRAPH_GROUP_CREATE ('http://group.dbpedia.inference','1');
Then we placed the previously created owl:sameAs
assertions, as well as
the NAMED GRAPHs
containing DBpedia assertions and the MusicBrainz assertions
into the GRAPH GROUP
:
DB.DBA.RDF_GRAPH_GROUP_INS('http://group.dbpedia.inference' , 'inference/sameAs'); DB.DBA.RDF_GRAPH_GROUP_INS('http://group.dbpedia.inference' , 'http://mytest.com'); DB.DBA.RDF_GRAPH_GROUP_INS('http://group.dbpedia.inference' , 'http://musicbrainz.com');
To demonstrate that the owl:sameAs
assertions we are indeed
intersecting the two datasets, consider the following query that determines the properties
connecting Eric Clapton to resources. Compare this to the figure
Clapton's DBpedia Properties - Version 1.
DEFINE input:same-as "yes" SELECT DISTINCT ?predicate FROM <http://group.dbpedia.inference> WHERE { ?s ?predicate <http://dbpedia.org/resource/Eric_Clapton>. } ORDER BY ?predicate
The result of this query is 30 properties which is 7 more than the result when querying only DBpedia, 6 of which come from MusicBrainz. Compare this result to the previous query results shown in figures Clapton's DBpedia Properties - Version 1 and Clapton's DBpedia Properties - Version 2.
Any queries that benefit from using a single IRI to refer to the same artist in both
datasets
becomes available by adding DEFINE input:same-as "yes"
before our result clause.
The same 30 results can be obtained with the UNION
query below. However, this query
requires knowledge of the GUID-based MusicBrainz IRI for each musician of interest,
whereas the above query takes advantage of the previously established
correspondence between the methods of artist identification in DBpedia and MusicBrainz.
Therefore, the owl:sameAs
approach is clearly the better solution.
SELECT DISTINCT ?predicate WHERE { { ?s ?predicate <http://dbpedia.org/resource/Eric_Clapton>. } UNION { ?s ?predicate <http://musicbrainz.org/mm-2.1/artist/618b6900-0618-4f1e-b835-bccb17f84294>. } } ORDER BY ?predicate
You Say It's Your Birthday?
Given the ability to bridge the two datasets, we can now issue queries that compare data values across the sets. The query below obtains the birthdates for each musical artist in both sources and returns the discrepancies.
DEFINE input:same-as "yes" select ?musician ?DBPdate ?MBdate { ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/MusicalArtist>. ?s <http://dbpedia.org/ontology/birthDate> ?DBPdate . ?s <http://musicbrainz.org/mm/mm-2.1#beginDate> ?MBdate. ?s <http://www.w3.org/2000/01/rdf-schema#label> ?musician. FILTER (str(?MBdate) != str (?DBPdate)) } ORDER BY ?musician
A sampling of the nearly one thousand mismatched
birthdates follows. Typically the differences are one day, one month, or one year,
but note several major differences. We have yet to independently verify all of the
discrepancies, but we note that the birthday of George Harrison is incorrect in
MusicBrainz. In fact, MusicBrainz often uses "00-00" dates indicating only the year
is known. It is probably safe to conjecture that Courtney Love is not 21 and Five
For Fighting (singer-songwriter John Ondrasik) is not 14, so these birthdates are
more likely when their musical careers started. Wikipedia confirms that 1997 is indeed
the first "year active" for Five For Fighting, but Love's initial "year
active" is 1982. Clearly the interpretation of MusicBrainz's beginDate
property
varies across artists.
Our tentative conclusion is that DBpedia is more accurate than
MusicBrainz with respect to birthdates.
musician DBPdate MBdate 50 Cent 1975-07-06 1976-07-06 Andy Partridge 1953-11-11 1953-12-11 Astrud Gilberto 1940-03-29 1940-03-30 Blind Willie Johnson 1897-01-22 1902-00-00 [5 year difference] Blind Willie McTell 1898-05-05 1901-05-05 [3 year difference] Carl Perkins 1932-04-09 1928-08-16 [4 year difference] Courtney Love 1964-07-09 1990-00-00 [extremely different!] David Lee Roth 1954-10-10 1955-10-10 Eddie Van Halen 1955-01-26 1956-01-26 Edgar Winter 1947-12-28 1946-12-28 Five for Fighting 1965-01-07 1997-00-00 [extremely different!] Frankie Valli 1934-05-03 1937-05-03 [3 year difference] George Harrison 1943-02-25 1943-02-24 Jennifer Lopez 1969-07-24 1970-07-24
Top Record Labels
One of our original questions was "Is there a predominant record label in the music world?" The following query answers that question.
SELECT ?label (count(?label) as ?count) { ?artist <http://dbpedia.org/ontology/recordLabel> ?label } ORDER BY DESC (?count) LIMIT 20
The results are shown below.
http://dbpedia.org/resource/Columbia_Records 5762 http://dbpedia.org/resource/EMI 4215 http://dbpedia.org/resource/Warner_Bros._Records 3518 http://dbpedia.org/resource/Epic_Records 3509 http://dbpedia.org/resource/Atlantic_Records 3502 http://dbpedia.org/resource/Capitol_Records 3264 http://dbpedia.org/resource/Virgin_Records 2926 http://dbpedia.org/resource/RCA_Records 2590 http://dbpedia.org/resource/MCA_Records 1975 http://dbpedia.org/resource/Mercury_Records 1919 http://dbpedia.org/resource/Island_Records 1800 http://dbpedia.org/resource/A&M_Records 1697 http://dbpedia.org/resource/Sony_BMG 1513 http://dbpedia.org/resource/Elektra_Records 1480 http://dbpedia.org/resource/Reprise_Records 1406 http://dbpedia.org/resource/Arista_Records 1382 http://dbpedia.org/resource/Geffen_Records 1372 http://dbpedia.org/resource/Universal_Music_Group 1339 http://dbpedia.org/resource/Universal_Records 1277 http://dbpedia.org/resource/Interscope_Records 1226
Musical Genres
In order to determine which artist is most influential in rock music, we needed to be able to reliably specify the genre of interest. However, formulating queries involving genres is more difficult than it would seem. DBpedia (and presumably Wikipedia) defines 2,887 musical genres of which 330 contain "rock" in their label. Rock-related genres literally run the gamut from A to Z -- Aboriginal_rock to Zulu_rock (really!). To our surprise, the single concept of "rock and roll music" is represented by 18 distinctly different IRIs, as determined by the following query. [11]
SELECT DISTINCT (?genre) { ?album <http://dbpedia.org/ontology/genre> ?genre. ?album <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Album>. FILTER regex (?genre, "[Rr]ock") . FILTER regex (?genre, "[Rr]oll") . } ORDER BY ?genre
The variants of "rock and roll" are:
http://dbpedia.org/resource/British_Rock_and_Roll http://dbpedia.org/resource/Real_Rock_and_Roll http://dbpedia.org/resource/Rock%27n%27Roll http://dbpedia.org/resource/Rock%27n%27roll http://dbpedia.org/resource/Rock%27n_roll http://dbpedia.org/resource/Rock_%27N%27_Roll http://dbpedia.org/resource/Rock_%27n%27_Roll http://dbpedia.org/resource/Rock_%27n%27_roll http://dbpedia.org/resource/Rock_%27n_Roll http://dbpedia.org/resource/Rock_&_Roll http://dbpedia.org/resource/Rock_&_roll http://dbpedia.org/resource/Rock_N%27_Roll http://dbpedia.org/resource/Rock_and_Roll http://dbpedia.org/resource/Rock_and_Roll_music http://dbpedia.org/resource/Rock_and_roll http://dbpedia.org/resource/Rock_n%27_Roll http://dbpedia.org/resource/Rock_n_Roll http://dbpedia.org/resource/Spanish_language_rock_and_roll
The first and last results above could be considered outliers since they are narrowings of the generic rock and roll classification.
Although Clapton is identified by only a few genres on Wikipedia, his albums fall into 18 genres, as determined by the query:
SELECT DISTINCT (?genre) WHERE { ?s <http://dbpedia.org/ontology/artist> <http://dbpedia.org/resource/Eric_Clapton>. ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Album>. ?s <http://dbpedia.org/ontology/genre> ?genre } ORDER BY ?genre
Genres associated with Clapton's albums follow. Note four variants each for the concepts "blues-rock" and "rock and roll".
http://dbpedia.org/resource/Acoustic_blues http://dbpedia.org/resource/Blues http://dbpedia.org/resource/Blues-Rock http://dbpedia.org/resource/Blues-rock http://dbpedia.org/resource/Blues_Rock http://dbpedia.org/resource/Blues_rock http://dbpedia.org/resource/British_Blues http://dbpedia.org/resource/Electric_blues http://dbpedia.org/resource/Folk_music http://dbpedia.org/resource/Jazz http://dbpedia.org/resource/Orchestral http://dbpedia.org/resource/Pop_music http://dbpedia.org/resource/Reggae http://dbpedia.org/resource/Rock_%28music%29 http://dbpedia.org/resource/Rock_and_Roll http://dbpedia.org/resource/Rock_and_roll http://dbpedia.org/resource/Rock_music http://dbpedia.org/resource/Soul_blues
For any given album, more than one genre may apply. For example, the 1975 album "There's One in Every Crowd" is classified as both reggae and blues-rock. For those wondering which Clapton album could possibly be considered jazz or orchestral, that distinction belongs to the first "Lethal Weapon" soundtrack, which is also designated as blues. However, the genre query above does not capture all genres associated with Clapton. For example, the genre for "Lethal Weapon 3" soundtrack is simply "soundtrack".
If we wish to display all of Clapton's blues and rock albums, we could use a query such as:
SELECT DISTINCT (?genre) ?album WHERE { ?album <http://dbpedia.org/ontology/artist> <http://dbpedia.org/resource/Eric_Clapton>. ?album <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Album>. ?album <http://dbpedia.org/ontology/genre> ?genre . FILTER ( regex (?genre, "[Rr]ock") || regex (?genre, "[Bb]lues")) } ORDER BY ?album
To display all of Clapton's albums and their associated genre(s), we used the following query:
SELECT ?album ?genre WHERE { ?album <http://dbpedia.org/ontology/artist> <http://dbpedia.org/resource/Eric_Clapton>. ?album <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Album>. ?album <http://dbpedia.org/ontology/genre> ?genre . } ORDER BY ?album
The result including albums with several genres listed follows. (The common portion
of the IRI,
http://dbpedia.org/resource/
has been removed from each resource to make the results
more readable.)
album genre 24_Nights Rock_music 24_Nights Blues 461_Ocean_Boulevard Blues-rock Another_Ticket Blues-rock August_%28album%29 Rock_music August_%28album%29 Pop_music Back_Home_%28Eric_Clapton_album%29 Blues-Rock Backless Rock_and_roll Backtrackin%27 Rock_%28music%29 Behind_the_Sun_%28Eric_Clapton_album%29 Rock_music Behind_the_Sun_%28Eric_Clapton_album%29 Pop_music Blues_%28Eric_Clapton_album%29 Blues-rock Clapton_%282010_album%29 Blues-rock Clapton_Chronicles:_The_Best_of_Eric_Clapton Rock_and_Roll Crossroads_%28Eric_Clapton_album%29 Blues-rock Crossroads_2:_Live_in_the_Seventies Blues-rock E._C._Was_Here Blues-rock Eric_Clapton%27s_Rainbow_Concert Blues-rock Eric_Clapton_%28album%29 Rock_and_Roll From_the_Cradle Blues From_the_Cradle Electric_blues From_the_Cradle Soul_blues From_the_Cradle British_Blues Guitar_Boogie Rock_and_roll Guitar_Boogie Blues-rock Journeyman_%28album%29 Blues-rock Just_One_Night_%28album%29 Blues-rock Lethal_Weapon_%28soundtrack%29 Jazz Lethal_Weapon_%28soundtrack%29 Blues Lethal_Weapon_%28soundtrack%29 Orchestral Live_in_Hyde_Park_%28Eric_Clapton_album%29 Rock_music Live_in_Hyde_Park_%28Eric_Clapton_album%29 Blues Live_in_Japan_%28George_Harrison_album%29 Rock_and_roll Me_and_Mr._Johnson Blues Money_and_Cigarettes Blues-rock No_Reason_to_Cry Rock_and_roll One_More_Car,_One_More_Rider Blues-rock Pilgrim_%28Eric_Clapton_album%29 Rock_music Pilgrim_%28Eric_Clapton_album%29 Blues Pilgrim_%28Eric_Clapton_album%29 Pop_music Reptile_%28album%29 Rock_music Reptile_%28album%29 Blues Riding_with_the_King_%28B._B._King_and_Eric_Clapton_album%29 Blues-rock Riding_with_the_King_%28B._B._King_and_Eric_Clapton_album%29 Blues_rock Slowhand Rock_music Steppin%27_Out_%28Eric_Clapton_album%29 Blues-rock The_Cream_of_Clapton Rock_music The_Cream_of_Clapton Blues_Rock The_Cream_of_Eric_Clapton Rock_music The_History_of_Eric_Clapton Rock_music The_History_of_Eric_Clapton Blues There%27s_One_in_Every_Crowd Reggae There%27s_One_in_Every_Crowd Blues-rock Time_Pieces:_The_Best_of_Eric_Clapton Blues_Rock Unplugged_%28Eric_Clapton_album%29 Folk_music Unplugged_%28Eric_Clapton_album%29 Acoustic_blues
However, if we omit a reference to genre and simply ask for Clapton's albums using the query:
SELECT ?album WHERE { ?album <http://dbpedia.org/ontology/artist> <http://dbpedia.org/resource/Eric_Clapton>. ?album <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Album>. } ORDER BY ?album
then the results contain 7 additional albums (below). We have not been able to determine the reason for this disparity since there is genre information for each of them when we dereference their IRIs on the DBpedia site.
Clapton_%281973_album%29 dbpprop:genre Rock [compilation from 1973] Complete_Clapton dbpprop:genre Blues [compilation from 2007] Edge_of_Darkness_%28soundtrack%29 dbpprop:genre Soundtracks [18 minute 1985 soundtrack for British TV series] Eric_Clapton_at_His_Best dbpprop:genre Rock [compilation from 1972] Lethal_Weapon_3_%28soundtrack%29 dbpprop:genre Orchestral, Jazz, and Blues [Wikipedia soundtrack entry shares page with the movie; 2 infoboxes] Live_from_Madison_Square_Garden_%28Eric_Clapton_and_Steve_Winwood_album%29 dbpprop:genre Blues/Rock [note the slash] Rush_%28soundtrack%29 dbpedia-owl:type dbpedia:Soundtrack [Wikipedia soundtrack entry shares page with the movie; but no album infobox]
A better solution for managing the permutations between the
representations for the genre concept "Rock and Roll" as well as its narrowings would
be to
again use reasoning. The semantically equivalent variant genres could be represented
as
rdfs:subProperties
of one
another, thereby enabling a single genre representation to refer to many. [12] This would
require analogous steps to where we created and loaded assertions, defined a graph,
define a graph group, and in this case define a rule_set
in Virtuoso. Then, using the
proper syntax we could query using that rule_set
to include exactly the genres we
intend to use in our queries. A less elegant approach would be to enumerate them one
by one and construct a
UNION
of results.
Limitations and Further Efforts
We recognize several limitations in our work to date:
-
data inconsistencies
-
need for further RDF visualization work
-
need to address more of the original problems
-
constructing queries across datasets
Data Inconsistencies
While we would wish the source RDF data could be regarded as ground truth, we realize there are several problems with our sources.
-
Errors of omission: Since the original data in Wikipedia is community-entered, it is predictable that certain facts will be missing from the data but are in fact true. Some of thes facts may be obscure while others may be more obvious to a subject matter expert for the given topic. For example, what if Eric Clapton were to enter his own data about himself?
-
Poor data curation: In some cases, Wikipedia data may have been present and complete, but there might have been problems in the extraction process from Wikipedia to DBpedia.
-
Erroneous data: Such problems are due to simple data entry errors, unintentional errors in stating what the data entry person considers facts, or possibly intentional falsehoods or unsubstantiated facts.
-
Unclear semantics: The various properties which were of importance to us were generally not defined in the ontology, as discussed in Working Definitions. We struggled to interpret these terms in a consistent manner. It is possible that individuals who care less about the precision of information may have entered relationships not necessarily using the correct semantics. We believe that the ambiguity of the musician-related properties was the most significant problem with the reliability of the data and therefore the biggest challenge in testing our hypothesis.
Further RDF Visualization Work
Since AJAX facilitates updating a web page without reloading the entire page, we plan to insert additional links within the SVG graphic to add more interactivity, invoking additional queries. Links could also be added during the module processing. CSS and XSLT could also be used to enhance the XML presentation.
The current visualization graph is quite wide and long, making it difficult to
view in a web browser without additional panning and zooming capabilities. Useful
visualization of the result set is difficult but we intend to improve the
visualization to facilitate traversing the dataset using hyperlinks. The
hyperlinks are there now, but there is currently a problem with the JavaScript
POST
method.
Further Attempts to Address the Original Problems
We initially formulated nearly two dozen questions [see Problems] that we believed we could use the DBpedia and MusicBrainz RDF datasets to answer. As of this July writing, few of these questions have been answered definitively. If properly bounded, the following are the questions we should be able to address in the future.
-
Which recording artist has directly played with the most musicians?
-
Which recording artist has the most connections within six degrees?
-
Which musician has been a session man for the most number of artists?
-
Which recording artist was most active during a particular decade?
-
Among all artists of a particular genre, who has played with the most other musicians?
-
Which rock artist's extended graph has the most other artists in 2 degrees? 3 degrees? 4 degrees?
-
Who has appeared on the most albums?
-
Which musician-related properties are reversible (inverse makes sense)?
-
Who created the most songs?
-
Which song has been recorded the most times by any artists? ("Yesterday" and "White Christmas" are typically cited.)
-
What is the average age of a musician when he/she first joined a band?
Other questions we may potentially be able to address include:
-
If we weight results by the length of time a band stays together, how does that impact other queries?
-
Does total number of songs or album released correlate with other measures of success?
-
Which solo artist has had the longest career?
-
Which band has been together (in some form) the longest time?
-
For bands with changing membership, can we conclude which configuration lasted the longest?
-
What is the "Eric Clapton number" (a la Kevin Bacon number) for various musicians?
Queries Across Datasets
Since we have established a mechanism to refer to a given artist in both datasets using a single IRI, we are prepared to ask queries that span the datasets, including queries that would not be possible without both sources. A few such queries are as follows:
-
Which albums of a given artist occur in one dataset but not the other? Since we are using Virtuoso which does not support the SPARQL 1.1
MINUS
construct, we would need to use OpenLink's proprietary approach. [13] -
Which albums that occur in both datasets have the same title and record producer but different release dates? This might be difficult to determine because of differences in representation of the record producer name.
-
For a given artist, return their spouse(s) and children even if they themselves were not musical artists. MusicBrainz only has names of children if they themselves are artists (e.g., has John Lennon's sons but not Paul McCartney's children).
Put the LOD Right on Me
Regrettably, we focused most of our attention on DBpedia, MusicBrainz and Freebase web sites, rather than casting a broader net to the Linking Open Data community effort initiated by the W3C's (now closed) Semantic Web Education and Outreach (SWEO) Interest Group. As we were finalizing this paper, we discovered several key resources that could prove extremely useful in answering some of our original questions; we hope to explore resources such as the following over the next year:
-
Demos including DBpedia and MusicBrainz (and many others)
-
DBpedia Extended Datasets for MusicBrainz and Freebase
-
LOD cloud diagram, an imagemap showing datasets published in Linked Data format (maintained by Richard Cyganiak and Anja Jentzsch)
-
Comprehensive Knowledge Archive Network (CKAN) web-based software for storage and distribution of data supported by the Open Knowledge Foundation
-
Music Ontology especially MusicArtist, musicbrainz_guid and supporting_musician
-
Linked Data: Evolving the Web into a Global Data Space, a new book by Tom Heath and Christian Bizer (2011)
Advantages and Disadvantages
Next we offer our subjective opinion about the relative advantages and disadvantages of our approach as compared to simply using the DBpedia SPARQL endpoint (which also runs on a Virtuoso server).
Advantages of Our Approach
Our Drupal frontend to a dedicated Virtuoso server under our fine-grained control afforded us several advantages [14], namely:
-
No execution timeout - Time-intensive queries were allowed to run to completion rather than being subject to an externally specified time limit on a shared server.
-
Higher limits - SPARQL endpoints shared by multiple users are often constrained to a limited number of results (i.e., 2000 for DBpedia endpoint). We did not need to limit the number of results returned since our number of "users" never exceeded three.
-
Direct access to raw RDF at command line level - When results were not as expected or did not seem to make sense, the ability to examine the actual RDF files (which had been ingested into our datastore) in our Linux environment was invaluable.
-
Collaborative SPARQL development for saving queries in Drupal - As the three authors worked remotely, it was quite helpful to enter our SPARQL queries into a Drupal interface, run the queries, save the queries, and optionally enter them into a cascading menu system for each other to try so we could discuss results.
-
More precise SPARQL timing metrics - Again, since we could control which other processes were running and other variables that would otherwise impact timing metrics, we could more accurately time our queries. For example, we found it took on average 3477.6 msec for a query with one
FILTER
statement comprised of tworegex
expressions joined by "&&" compared to only 731 msec for a similar query in which eachregex
expression was the parameter of a separateFILTER
statement. See the first example in the Musical Genres section. -
Test harness for repeatedly running the same query - We employed a perl test harness for iterative execution and averaging of results. For example, the previously mentioned
FILTER
andregex
tests were run 1,000 times. -
Examination of error messages - We had access to server logging of error messages.
Disadvantages of Our Approach
-
Static datastore - After we ingested the RDF files, our datastore did not change so it could not take advantage of possible improvements in the DBpedia and Musicbrainz data throughout the first half of 2011. On the other hand, this was also an advatange since it meant we were not subject to subtle changes that might have impacted previous queries.
-
Limited knowledge of inner workings - Since we were not members of either the DBpedia or Musicbrainz developer communities, we were not privy to any internal documentation that might have answered some of the questions we raised and helped us to address some of the problems we encountered. In a volunteer developer collaborative environment, it is likely that user-facing documentation may lag behind actual implementation changes; relevant examples may also be lacking.
Recommendations
The problems we encountered suggest several recommendations we would like to share with the DBPedia and MusicBrainz communities:
-
Publish example SPARQL queries: Although sample data is readily available on the music sites, we did not find SPARQL queries, just endpoints to explore.
-
Provide detailed explanation of IRI conventions and property semantics: We found that it was not always obvious what the significance was for IRI differences, nor what non-alphabetic characters should be used in multi-word IRIs. For example, what is the difference between
http://dbpedia.org/ontology
andhttp://dbpedia.org/property
IRIs? -
Include examples of intersecting graphs from other music-oriented sites: In the spirit of Linking Open Data, it would be helpful if both DBpedia and MusicBrainz created and posted demos (or links to demos) on their sites to illustrate linking of their data sources to other popular music data stores.
Conclusions
Although at this time we have fallen far short of our lofty goal of answering two dozen complex questions by means of RDF and SPARQL, we believe we have uncovered several significant challenges regarding the consistency of the data sources and the interpretation of the semantics underlying various key properties of musical artists. We believe similar problems would be encountered with using other large community-entered datasets of relatively low quality.
While Wikipedia editors often focus on crafting the most accessible presentation of encyclopedic information for human consumption, there are other factors to consider. Individuals working with data at the semantic level could improve the semantic representation by isolating outliers, as in the case of multiple representations of the "Rock and Roll music" genre concept. Data quality and therefore utility would be greatly enhanced by making the necessary modifications to "normalize" or decrease spurious diversity that is not based on any actual or intended semantic distinction. [15]
In the end, any accurate measure of impact a given musician has on a genre is likely to be the product of a weighted measurement of several variables and is therefore largely subjective. The number of bands someone played with, the number of songs they wrote, the number of albums they produced or the length of time they were active are only part of the equation. An artist who wrote a single song that has enjoyed frequent airplay for decades and/or was covered by artists in many different genres clearly demonstrates a substantial impact. Other measurements not available in the RDF datasets are likely to be even more revealing of an artist's true impact, such as the number of concerts they played, the number of people who recognize their name, and their success on music charts (i.e., Billboard).
By the same token, a strong argument could be made that an individual musician whose career lasted only a short time with a limited repertoire but perhaps created or influenced a new genre could be regard to have a major impact. Therefore, one would have to somehow compare Eric Clapton to early genre pioneers. For example, Robert Johnson, a major influence on Clapton, recorded only 29 songs and basically enjoyed only a 2-year (1936-37) recording career. Yet Clapton himself has called Johnson "the most important blues singer that ever lived".[16] Johnson was ranked fifth in Rolling Stone's list of 100 Greatest Guitarists of All Time.
Given our knowledge and supporting resources surrounding Eric Clapton's career, we
believe
that some of what we perceive as errors are indeed just that -- mistakes. Whether
those
mistakes are based on individuals simply being incorrect, or not understanding the
semantics really does not matter in the final analysis. What this highlights, for
us,
is a limitation induced
by the Open World Assumption.
When an assertion is not made, we cannot also assume its
negation. Having assertions stating that Clapton created 30 albums does not mean he
did
not create 45. Likewise, because we know that Clapton did participate in collaboration
efforts like Powerhouse, The Dirty Mac and Free Creek, we cannot state unequivocally
that
it happened only a
single time and therefore violates the semantics of the associatedActs
predicate.
Triples Counts. Triple Counts
Musicbrainz Triples Count
4301998 albums.rdf
1824953 albums_tracklists.rdf
1712434 artists.rdf
427027 relations_artist_to_artist.rdf
292955 albums_tags.rdf
188708 tracks.rdf
114285 relations_album_to_artist.rdf
--------------------------------------
8,862,360 total triples
DBpedia Assertions
130166251 page_links_en.nt
43640719 infobox_properties_en.nt
23917050 wikipedia_links_en.nt
13795664 mappingbased_properties_en.nt
12161691 article_categories_en.nt
9485630 revisions_en.nt
9485630 page_ids_en.nt
7972385 labels_en.nt
6173940 instance_types_en.nt
5907507 external_links_en.nt
4615815 images_en.nt
4503651 redirects_en.nt
3261096 short_abstracts_en.nt
3261096 long_abstracts_en.nt
2529082 skos_categories_en.nt
1745873 persondata_en.nt
1544820 geo_coordinates_en.nt
928708 disambiguations_en.nt
910517 article_related_geo_countries_en.nt
632615 category_labels_en.nt
414195 homepages_en.nt
387336 specific_mappingbased_properties_en.nt
81602 infobox_property_definitions_en.nt
1555 pnd_en.nt
---------------------------------------------------
287,524,428 total assertions
Drupal Modules. Drupal 6.18 Modules
Admin role 6.x-1.3 Administration menu 6.x-1.6 Advanced help 6.x-1.2 Backup and Migrate 6.x-2.4 Colorpicker 6.x-2.1 Content Construction Kit (CCK) 6.x-2.9 Date 6.x-2.7 Devel 6.x-1.23 Drupal For Firebug 6.x-1.4 Google Visualization API 6.x-1.3 Graphviz Filter 6.x-1.6 Graphviz Styles 6.x-1.0 jQuery Update 6.x-2.0-alpha1 Masquerade 6.x-1.5 Node export 6.x-2.24 Node import 6.x-1.0 Pathauto 6.x-1.5 RDF CCK 6.x-2.x-dev (2011-Feb-25) Resource Description Framework (RDF) 6.x-1.0-alpha8 SPARQL 6.x-1.0-alpha1 Includes: SPARQL API Sphinx (Sphinx search integration) 6.x-1.3 Tagadelic 6.x-1.2 Taxonomy import/export via XML 6.x-1.3 Taxonomy Manager 6.x-2.2 Taxonomy Menu 6.x-2.9 Token 6.x-1.15 Views 6.x-2.12
References
[Auer and Lehmann 2007] Auer, Sören; Lehmann, Jens: What Have Innsbruck and Leipzig in Common? Extracting Semantics from Wiki Content. ISBN-13: 978-3-540-72666-1. Springer Verlag. © 2007. Lecture Notes in Computer Science, 2007, Volume 4519/2007, pp. 503-517, doi:https://doi.org/10.1007/978-3-540-72667-8_36. http://www.springerlink.com/content/3131t21p634191n2/ and http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71.1314&rep=rep1&type=pdf.
[Auer et al 2007] Auer, Sören; Bizer, Christian; Kobilarov, Georgi; Lehmann, Jens; Cyganiak, Richard; Ives, Zachary: DBpedia: A Nucleus for a Web of Open Data. ISBN-13: 978-3-540-76297-3. Springer Verlag. © 2007. Lecture Notes in Computer Science, 2007, Volume 4825/2007, pp. 722-735, doi:https://doi.org/10.1007/978-3-540-76298-0_52. http://www.springerlink.com/content/rm32474088w54378/ and http://www.informatik.uni-leipzig.de/~auer/publication/dbpedia.pdf.
[Allemang and Hendler 2008] Allemang, Dean; Hendler, James A.: Semantic Web for the Working Ontologist: Effective Modeling in RDFS and OWL. ISBN-13: 978-0-12-373556-0. Elsevier Inc. © 2008. http://www.workingontologist.org/index.html
[Bacon] Six Degrees of Kevin Bacon. [online]. [cited 08 Apr 2011]. http://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon
[Butcher 2010] Butcher, Matt et al. Drupal 7 Module Development. ISBN-10: 1849511160. First edition. Packt Publishing © 2010. [cited 08 Apr 2011]. http://www.amazon.com/Drupal-Module-Development-Matt-Butcher/dp/1849511160/
[Clapton] Eric Clapton's Wikipedia page. [online]. [cited 08 Apr 2011]. http://en.wikipedia.org/wiki/Eric_clapton
[Clapton FAQ] Eric Clapton Frequently Asked Questions, part of The Eric Clapton Fan Club Magazine. [online]. [cited 08 Apr 2011]. http://www.whereseric.com/eric-clapton-faq
[DBpedia Dataset] The DBpedia Dataset. [online]. [cited 08 Apr 2011]. http://wiki.dbpedia.org/Datasets
[DBpedia Ontology] The DBpedia Ontology. [online]. [cited 08 Apr 2011]. http://wiki.dbpedia.org/Ontology
[DBpedia Release] DBpedia 3.6 released. [online]. [cited 08 Apr 2011]. http://blog.dbpedia.org/2011/01/17/dbpedia-36-released/.
[Feigenbaum and Prud'hommeaux 2008] SPARQL By Example - A Tutorial. [online] [cited 08 Apr 2011]. Feigenbaum, Lee and Prud'hommeaux, Eric. Cambridge Semantics, 2008. Updated 2011-01-25. http://www.cambridgesemantics.com/2008/09/sparql-by-example/
[Feigenbaum 2008] SPARQL by Example: The Cheat Sheet. [online] [cited 08 Apr 2011]. Feigenbaum, Lee. Cambridge Semantics, 2008. http://www.slideshare.net/LeeFeigenbaum/sparql-cheat-sheet
[Frame 1983] Peter Frame's Rock Family Trees. [online and 3 books]. [cited 08 Apr 2011]. http://www.familyofrock.com/browse/details/trees.html
[Freebase] Freebase home page. [online]. [cited 08 Apr 2011]. http://www.freebase.com/
[Freebase Data Dumps] Freebase Data Dumps. [online]. [cited 08 Apr 2011]. http://wiki.freebase.com/wiki/Data_dumps/
[Graphviz] Graphviz home page. [online]. [cited 08 Apr 2011]. http://www.graphviz.org/
[MQL] Metaweb Query Language. [online]. [cited 08 Apr 2011]. http://wiki.freebase.com/wiki/MQL
[MusicBrainz] MusicBrainz home page. [online]. [cited 08 Apr 2011]. http://musicbrainz.org/
[Newman 2010] Newman, Mark. Networks - An Introduction. ISBN-10: 9780199206650. First edition. Oxford University Press, USA © 2010. [cited 08 Apr 2011]. http://www.amazon.com/Networks-Introduction-Mark-Newman/dp/0199206651/
[OpenCyc for the Semantic Web] OpenCyc for the Semantic Web. [online]. [Cited 26 Jun 2011]. http://sw.opencyc.org/
[Six Degrees] Six Degrees of Separation. [online]. [Cited 08 Apr 2011]. http://en.wikipedia.org/wiki/Six_degrees_of_separation
[SPARQL 1.0] SPARQL Query Language for RDF. [online] [cited 08 Apr 2011]. W3C Recommendation 15 January 2008. W3C © 2006-2007 http://www.w3.org/TR/rdf-sparql-query/
[Starr] Ringo Starr's Wikipedia page. [online] [cited 08 Apr 2011]. http://en.wikipedia.org/wiki/Ringo_Starr
[Ringo Starr & His All-Starr Band] Ringo Starr & His All-Starr Band Wikipedia page. [online] [cited 08 Apr 2011]. http://en.wikipedia.org/wiki/Ringo_Starr_%26_His_All-Starr_Band
[VanDyke and Westgate 2007] VanDyk, John K and Westgate, Matt Pro Drupal Development. ISBN-10: 1590597559. First edition. APress © 2007. [cited 08 Apr 2011]. http://www.amazon.com/Pro-Drupal-Development-John-VanDyk/dp/1590597559/
[Virtuoso Universal Server] OpenLink Virtuoso Universal Server. [online]. [cited 08 Apr 2011]. http://virtuoso.openlinksw.com/
[Virtuoso SPARQL Tutorial] Virtuoso SPARQL Tutorial, Part 2. [online]. [cited 08 Apr 2011]. http://virtuoso.openlinksw.com/presentations/SPARQL_Tutorials/SPARQL_Tutorials_Part_2/SPARQL_Tutorials_Part_2.html
[Vocabutek Web Site] Ron Reck's demo site. [online] [cited 08 Apr 2011]. http://www.vocabutek.com/
[Whitburn 2006] Whitburn, Joel. The Billboard Albums. ISBN-13 0-89820-166-7. Sixth edition. Record Research Inc. © 2006. [Data from 1956-2006; other editions are entitled Top Pop Albums.] https://www.recordresearch.com
[1] He believes it was circa 1972 in Buffalo, New York. But his recollection of the Seventies is much like his recollection of the Sixties.
[2] According to Whitburn 2006, Clapton released 41 albums (including reissues) from 1970 to 2005. (He has released three more as of July 2011.) This includes eleven Top 10 albums and three number 1 albums. Whitburn devised a formula for ranking artists by their chart success, with rankings per decade and overall. According to Whitburn's calculations as of 2006, Clapton ranked #10 in the 1990's and ranked #12 all-time; he ranks #21 in most charted albums.
[3] IRI is Internationalized Resource Identifier, a generalization of the Uniform Resource Identifier (IRI) enabling the use of Unicode. In this paper, we refer to IRI rather than URI since SPARQL technically permits IRIs.
[4] If we add one more clause to the query asking also for
http://dbpedia.org/resource/Category:English_rock_guitarists
,
then Eric Clapton is the only result.
[5] The full Freebase data set in TSV format is 1.3 GB compressed. The entire Freebase data dump is also available as quadruples in a 3.36 GB download. When uncompressed, the single file freebase-datadump-quadruples.tsv is 27.3 GB.
[6] Technically this is an object value comprised of a number of strings concatenated into a single field.
[7] We believe this is indicative of the open world assumption in formal logic.
[8] The OpenLink Virtuoso Universal Server supports relational data management, XML and RDF data management, free text content management with full text indexing and web services, and functions as a document web server, linked data server and web application server.
[9] "Inference is moreso the means or mechanisms by which reasoning occurs. Reasoning is the 'goal' whereas inferencing is the 'implementation'". -- From answers.semanticweb.com.
[10] owl:sameAs is often used in establishing mappings between two or more ontologies.
[11] Actually, there is a 19th IRI if the query is not constrained to require the string "roll": http://dbpedia.org/resource/Rock_music. This link redirects to a page which among other things lists many subgenres of rock.
[12] However, we chose to use rdfs:subClassOf
because it was the most appropriate RDFS property that the
Virtuoso database reasoner supported.
[13] It is possible that NOT EXISTS
in SPARQL 1.0 would be sufficient.
[14] We note, however, that the DBpedia Amazon Machine Instance (AMI) released earlier in 2011 shares several of the advantages discussed in this section.
[15] For example, see The Pedantic Web, especially Frequently Observed Problems on the Web of Data.
[16] Booklet accompanying Johnson's Complete Recordings box set, Stephen LaVere, Sony Music Entertainment, 1990, Clapton quote on p. 26.