In this post, we’ll discuss how column-oriented databases are able to more effectively exploit compression than a typical row-oriented system. There are two key points:
- Columns of data from the same attribute compress better than rows of tuples in a table, and
- A well-architected database engine using appropriate compression techniques can operate directly on the compressed data, without decompressing it.
Achieving a high degree of compression is highly desirable. Depending on the dataset, column databases use up to 90% less storage space than the raw data loaded into the database. This means column databases require less storage hardware (and associated space, cooling & power) and/or allow users to analyze much more data on the hardware they already have.
There are several common approaches to database compression. One involves compressing tables using a block-based dictionary compression algorithm like Lempel-Ziv (the algorithm used by gzip and other tools). A variant of these block-based dictionary schemes are value-based approaches, which replace individual values in records with shorter values from a dictionary – for example, every time the string “John Smith” occurs, it might be replaced with the code “*A”. Other codes would be used for other common strings.
Another type of compression can be applied to sorted sequences of numbers or values that generally differ by just a few bits. For example, if a collection of tuples contain the values “10001, 10002, 10005, 10008″ in one attribute, those can be compactly encoded as a series of differences, or deltas as “10001, +1, +3, +3″. Though this hasn’t reduced the number of entries in the list, the differences are much smaller than the original number, and so can be stored with fewer bits (in this case, it takes 16 bits to encode 10001, but just 2 to encode +1 and 3 to encode +3.) Text data works here too, if you think of a string as really being a sequence of numeric ASCII character codes.
These methods effectively reduce the space requirements for storing data and work similarly in column- or row-oriented systems. But column-oriented databases can also take advantage of other forms of compression that row-oriented databases cannot easily leverage. We will spend the rest of this post examining alternative compression techniques that are advantageous to column-oriented databases solutions.
The column compression advantage
To understand the additional advantage that column-oriented databases have with respect to compression, let’s examine how a column-oriented database structures data on disk and how compression on that stored data works.
Think of a column-oriented database as storing each column of a table in a separate file. This means that when the database system wants to access data, it only needs to read the files representing those columns that it cares about. This is unlike a row-oriented database that must always scan over all of the columns in the rows it is reading.
When the database wants to reconstruct a tuple in the original table (remember, column-oriented databases typically still provide a relational API!), it needs some way to establish correspondence between different columns. The simplest way to do this is to ensure that the positions in the different files line up — e.g., that the nth entry in column 1 belongs to the same tuple as the nth entry in column 2, and so on. Note, however, that these columns can be stored in any order the system wants, and that certain orders — e.g., those sorted on one or more of the attributes (e.g., customer state, and then customer name) — may allow the system to rapidly find records of interest (e.g., Smith’s in California).
Rather than simply storing each column once in one sort order, column-oriented databases will often store each column multiple times, in different sort orders, much as materialized views are used in traditional database systems. This improves query performance since there is a higher probability of there being a “good” sort order for a particular query. But storing different columns in different sort orders provides the database with another benefit: sorted columns of data compress extremely well.
There are several reasons for this. First, when a sorted column has only a few different distinct values (such as a column of states, or genders, departments, or even product codes) there will be many repeated values that can be compressed away. For example, given a list of numbers like 1,1,1,1,2,2,2,3,3, it’s clearly going to be more compact to encode this as 4 x 1, 3 x 2, 2 x 3 (we’ve used just six numbers, instead of 9.) This type of compression is called run-length encoding. Of course, this technique can be used with textual data just as effectively as with numeric data.
A second reason that columns compress so well is that there is relatively little variability between successive values in columns — that is, columns are low entropy. This means that schemes like Lempel-Ziv encoding, which are especially good at compressing such low-entropy data, can be substantially more effective in column-based systems. This is because individual columns tend to have fewer differences between values than entire rows and hence compress better.
Improved compression benefits
The benefit of techniques like run length encoding can be tremendous. Imagine a table with 50 million sales records, sorted on state and then last name as described above. To represent the state column using run length encoding, the system needs just 50 states names and 50 run lengths (e.g., AK x 1,000,000 , AL x 1,000,000, etc). Perhaps more surprisingly, there’s also a big win to be had for the last name column. Though it won’t be strictly sorted, it will be sorted within each state (e.g., when the two columns are joined, the will form a sorted list of tuples like (AK, Aaronson), (AK, Adams), … , (AL, Adams), (AL, Baker), (AL, Cook), …. ). Last names within each state can thus be run length or delta encoded quite effectively. For example, in the 50 million record database, if there are 10,000 last names in the database, and customers are evenly spread across 50 states, there will be 50,000,000 / (50×10,000) = 100 repeats of each last name in each state. This means run length encoding will reduce the size of the last name column by a factor of 100.
The compression techniques described above have another benefit: the database system often doesn’t need to convert the compressed records back into their uncompressed representation before operating on the data (direct operation can be applied to run length, delta encoded, and some kinds of dictionary encoded data.) For example, it’s trivial to determine whether a predicate is satisfied by a run length encoded (run length, value) pair by simply applying the predicate to the value. And when that predicate is applied, it can be applied to the entire run’s worth of tuples. But it is possible to efficiently join or aggregate run length encoded data too, and the performance wins can be dramatic. For example, in his SIGMOD 2006 paper “Integrating Compression and Execution in a Column-Oriented Database System“, my student Daniel Abadi showed that for many queries over the TPC-H data set, performance gains in C-Store (an academic predecessor to the Vertica Database) are an order of magnitude or more relative to an uncompressed database.
Notice that these compression techniques would be much less effective in a row-oriented system because successive tuples in a table have a lot less in common than successive values in a column. Though it’s probably possible to re-arch
itect a row-oriented database to get some of the advantages described above, doing so is tantamount to making the system column-oriented, since it requires the compressed data to be stored separately in a compact representation. So, the bottom line is that redundant columns, multiple sort orders, and column orientation pave the way for dramatic performance gains from compression.
Note: This post had been updated to improve clarity about compression techniques.