The Resource Description Format (RDF) is a way to describe information about relationships between entities and objects. It was originally developed by the W3C as a way to describe information about resources on the Web. It is intended to be the data model used in the Semantic Web, where web pages contain not just text but also structured records describing the data they contain and the relationships in that data.
RDF has seen widespread adoption in recent years. For example, the entire MIT library catalog is available in RDF format. More recently, a number of biology researchers have begun to publish their data in RDF, including the UniProt comprehensive catalog of protein sequence, function, and annotation data.
An RDF document consists of a collection of statements of the form subject-property-object. For example, a library database that stores data about authors and books might have statement triples like “User1 has-name ‘Sam Madden’”, “User1 is-an Author”, “User1 wrote Book1″, “Book1 is-a Book”, “Book1 has-title ‘Who ate my cheese?’”, etc.
It should be clear that an RDF document, containing a collection of triples about a group of resources, is a structured database that users may want to browse, search, or query in a number of ways. Building tools that make it possible do this efficiently is one of the goals of our research. In particular, we are interested in the performance of different on-disk storage representations for a collection of triples.
Designing tools to handle RDF efficiently
Our first attempts to do this have focused on leveraging relational database technology. The obvious relational representation of an RDF document is as a table with three columns, which would conventionally be stored as a series of 3-tuples laid out on disk in a row-major format. This representation, however, performs quite poorly for many types of queries. Suppose, for example, we want to find all the authors of the book “Who ate my cheese”. We will first have to find the triple “bookM has-title ‘Who ate my cheese’”. We will then have to perform a self join with the triples table to find all of the triples of the form “personN wrote bookM’. Finally, for each author, we will have to perform another self join to find triples of the form ‘personN has-name ‘Sam Madden’”.
Hence, we have been looking at alternative representations that eliminate these self joins (we still expose a logical model of a collection of triples that the user queries, but we transform user queries to apply to our modified physical representation.) For example, one possible representation is to store a table where the first column contains the subject, and each additional column corresponds to a particular property. This representation is sometimes called a “property representation”, as shown on the bottom of the figure above. Though this representation can have many NULL values if there are a variety of subjects with diverse properties defined, it has the advantage that all of the properties of a given object are now stored together.
Our work in this area, “Scalable Semantic Web Data Management Using Vertical Partitioning,” appeared in the VLDB Conference in Vienna in September. It showed that using a column-oriented database, along with this property representation, allows us to overcome the overhead of representing NULLs, while providing two orders of magnitude better performance than the naive triples representation. This is particularly true when processing queries that must access many triples during execution (e.g., computing the number of books grouped by subject area or institution.) Of course, there is a fair amount of subtlety to getting good performance out of such a representation. Have a look at our conference paper for the details!
Caveats for column- and row-store databases
As we’ve discussed elsewhere in this blog, column-stores can perform worse than row-stores for certain classes of queries. In particular, for lookups of a single record (e.g., all of the information about a particular author), a row-oriented database (using a property representation) may outperform a column-oriented system. This is because it only has to seek to one location on disk to read the data from this record, whereas a column store will have to seek to each column to reconstruct the entire record.
There are other situations where neither a row- nor column-oriented property representation is ideal. Imagine, for example, a user browsing an RDF-based Web site containing our library database. During browsing, suppose the user navigates from books or articles, to authors, to related books and articles, and so on. Such browsing queries in a property representation will lead to (slow) self-joins on the property table, just as they did in the triples table. Hence, a more sensible representation for a browsing-oriented database would be to store a given record R near to records the user is likely to navigate to from R. This is the topic of our current research in this area.
* Editors note: While this post will show up in the blog as written by Sam Madden, it has two authors: Samuel Madden (MIT) and Daniel Abadi (Yale)