Our goal in this myth-debunking series has been to shed some light on common misconceptions about column-store databases that we have come across in our conversations with people in the database community. This post we debunk yet another myth, which is perhaps even more subtle than the vertical partitioning myth:
Myth: The performance advantages of a column store can gained by replacing a row-oriented storage layer of a DBMS with a column-oriented storage layer without also rewriting the row-oriented query execution system that plans and processes queries on top of the storage system.
Experiments that debunk this myth
At first blush, it may seem that all of the benefits of column orientation can be had by simply replacing the storage layer. For example, one commonly cited benefit of column stores is: Because they store each attribute of a table separately, they are better for read queries since they can avoid reading in unnecessary attributes. This benefit for read queries can clearly be obtained solely by replacing the storage layer — the query executor simply requests the columns it needs and the storage layer reads them off disk and merges them into rows so that the query executer can process them in the standard fashion. The other commonly cited advantage of column stores is that by storing data from the same attribute domain consecutively, they see higher data value locality, and thus an improved compression ratio. Again, this improvement is at the storage layer, and the executor need not be aware of compression at all if data is decompressed as it is read off of disk.
In our recent SIGMOD paper, “Column-Stores vs. Row-Stores: How Different Are They Really?”, we ran some experiments to discover how closely related the performance of column stores is to these storage layer optimizations. We took the C-Store column-store database (the academic precursor to Vertica) and ran it on the same Star Schema Benchmark discussed in our previous two posts. We then ran the same benchmark on the same C-Store software after stripping out all column-specific operations in the query executer (which sits on top of the storage layer). More specifically, the column-oriented storage layer reads in the subset of data needed to answer a particular query (i.e., only those columns accessed by the query), merges these columns into tuples, and then runs a traditional query executer on these tuples. So in the first case we have a column-oriented executer sitting on top of a column-oriented storage layer, and in the second case we have a traditional (row-oriented) executer sitting on top of a column-oriented storage layer. Our benchmark results of these two systems are presented in the figure below: