Archive for September, 2007

Follow up on compression post: Columns, indices, and sorting

Earlier this week I wrote about the advantages of compression in column-oriented databases. A reader had questions about an example I used and the issue of sorting and indexes. I thought the commenter’s points and questions were worth exploring in some depth.

Keeping columns in order

The commenter’s key question was really about maintaining the correspondence between columns in a column-oriented database. The issue isn’t really “getting the data back in the original order” as much as it is ensuring that the system has a way of accessing records in other columns that correspond to a particular record in a sorted column. I’ll explain the details of how this works.

The simplest way to do this is just store one copy of the table sorted on one or more attributes. For example, suppose we have the following table sorted on state:

name state salary
joe ca 10k
dave ca 55k
mary ma 50k
nancy ma 65k
mike nv 30k

We can physically represent this as three columns each stored in the order shown in a different file. We can RLE encode the state column to reduce its space, and, if we secondarily sort the table on salaries, can probably also benefit from delta encoding salaries and dictionary encoding names. Any query that includes a predicate over state (e.g., state = ‘ca’) can be evaluated quickly because we can figure out what positions (e.g., 1 and 2) in the three columns need to be looked up to find matching records without decompressing the state column or looking at all of records of the other two columns.

Of course, we need to be able to efficiently perform this look up of a given set of positions in a column (e.g., finding the names and salaries that correspond to positions 1 and 2 in the state column). We can do this much faster than a conventional join would (which is what the commenter’s foreign key suggestion would require) because the name and salary columns are stored in the same order as the state column (so the 1st state corresponds to the 1st entry in the name file, etc.). This allows us to either:

  1. Directly offset to the locations in the columns where a particular position is stored, if the column is uncompressed and fields are fixed length (e.g., each record is the same size). This would be the case for salary in the previous example if it were uncompressed, or, if each record is variable-sized, or the column is compressed (as would be the case if name is stored as a text field, for example) such that we can’t apply direct offsetting.
  2. Use a sparse index to map positions to entries in a column. A sparse index is simply an index with one entry per *page* of the column file where an entry indicates the range of positions on the corresponding page. So if the name column has 10,000 names stored on 100 pages, there would 100 entries in the sparse index. To look up a given position, the system searches the index to quickly find the pages that contain the positions of interest. Because these sparse indices are very compact, they can usually be stored in memory or on a single disk page, and this extra lookup imposes very little overhead (unlike a traditional database index which can be quite large and require a significant number of I/Os to query).

Notice that the positions will be accessed sequentially within each column file, which further improves performance (especially if a range of consecutive positions are accessed).

That said, the commenter was correct in that in this example we will really only get the maximum benefit of RLE on the state column with (if we sort them at all) a potentially decreased benefit for a secondary or tertiary sort on name or salary.

One way to address this is to replicate subsets of columns, sorting each subset in a different order. We call such subsets “projections” (where the columns in a projection are still each stored in separate files). For example, a user might create one projection with the above three columns sorted in state order and another projection with just name and salary sorted in salary order (which will compress well using delta or RLE encoding). That way, if a query wants to find employees in a particular salary range, it can do that quickly using this second projection. Of course, for a query to be able to use a particular projection, that projection must have the columns used in the query (or the system must run an expensive join operation to glue together two projections that are sorted in different orders).

Future post: Projections to maximize query performance

This, of course, presents an interesting physical database design question: What projections should the user (or administrator) create to maximize query performance for a given query workload? This is the idea behind the automatic database design features in C-Store and Vertica, which will be the subject of a future post.

Good things come in small packages: The advantage of compression in column databases

In this post, we’ll discuss how column-oriented databases are able to more effectively exploit compression than a typical row-oriented system. There are two key points:

  • Columns of data from the same attribute compress better than rows of tuples in a table, and
  • A well-architected database engine using appropriate compression techniques can operate directly on the compressed data, without decompressing it.

Achieving a high degree of compression is highly desirable. Depending on the dataset, column databases use up to 90% less storage space than the raw data loaded into the database. This means column databases require less storage hardware (and associated space, cooling & power) and/or allow users to analyze much more data on the hardware they already have.

There are several common approaches to database compression. One involves compressing tables using a block-based dictionary compression algorithm like Lempel-Ziv (the algorithm used by gzip and other tools). A variant of these block-based dictionary schemes are value-based approaches, which replace individual values in records with shorter values from a dictionary – for example, every time the string “John Smith” occurs, it might be replaced with the code “*A”. Other codes would be used for other common strings.

Another type of compression can be applied to sorted sequences of numbers or values that generally differ by just a few bits. For example, if a collection of tuples contain the values  “10001, 10002, 10005, 10008″ in one attribute, those can be compactly encoded as a series of differences, or deltas as “10001, +1, +3, +3″. Though this hasn’t reduced the number of entries in the list, the differences are much smaller than the original number, and so can be stored with fewer bits (in this case, it takes 16 bits to encode 10001, but just 2 to encode +1 and 3 to encode +3.) Text data works here too, if you think of a string as really being a sequence of numeric ASCII character codes.

These methods effectively reduce the space requirements for storing data and work similarly in column- or row-oriented systems. But column-oriented databases can also take advantage of other forms of compression that row-oriented databases cannot easily leverage. We will spend the rest of this post examining alternative compression techniques that are advantageous to column-oriented databases solutions.

The column compression advantage

To understand the additional advantage that column-oriented databases have with respect to compression, let’s examine how a column-oriented database structures data on disk and how compression on that stored data works.

Think of a column-oriented database as storing each column of a table in a separate file. This means that when the database system wants to access data, it only needs to read the files representing those columns that it cares about. This is unlike a row-oriented database that must always scan over all of the columns in the rows it is reading.

When the database wants to reconstruct a tuple in the original table (remember, column-oriented databases typically still provide a relational API!), it needs some way to establish correspondence between different columns. The simplest way to do this is to ensure that the positions in the different files line up — e.g., that the nth entry in column 1 belongs to the same tuple as the nth entry in column 2, and so on. Note, however, that these columns can be stored in any order the system wants, and that certain orders — e.g., those sorted on one or more of the attributes (e.g., customer state, and then customer name) — may allow the system to rapidly find records of interest (e.g., Smith’s in California).

Rather than simply storing each column once in one sort order, column-oriented databases will often store each column multiple times, in different sort orders, much as materialized views are used in traditional database systems. This improves query performance since there is a higher probability of there being a “good” sort order for a particular query. But storing different columns in different sort orders provides the database with another benefit: sorted columns of data compress extremely well.

There are several reasons for this. First, when a sorted column has only a few different distinct values (such as a column of states, or genders, departments, or even product codes) there will be many repeated values that can be compressed away. For example, given a list of numbers like 1,1,1,1,2,2,2,3,3, it’s clearly going to be more compact to encode this as 4 x 1, 3 x 2, 2 x 3 (we’ve used just six numbers, instead of 9.) This type of compression is called run-length encoding. Of course, this technique can be used with textual data just as effectively as with numeric data.

A second reason that columns compress so well is that there is relatively little variability between successive values in columns  — that is, columns are low entropy. This means that schemes like Lempel-Ziv encoding, which are especially good at compressing such low-entropy data, can be substantially more effective in column-based systems. This is because individual columns tend to have fewer differences between values than entire rows and hence compress better.

Improved compression benefits

The benefit of techniques like run length encoding can be tremendous. Imagine a table with 50 million sales records, sorted on state and then last name as described above. To represent the state column using run length encoding, the system needs just 50 states names and 50 run lengths (e.g., AK x 1,000,000 , AL x 1,000,000, etc). Perhaps more surprisingly, there’s also a big win to be had for the last name column. Though it won’t be strictly sorted, it will be sorted within each state (e.g., when the two columns are joined, the will form a sorted list of tuples like (AK, Aaronson), (AK, Adams), … , (AL, Adams), (AL, Baker), (AL, Cook), …. ). Last names within each state can thus be run length or delta encoded quite effectively. For example, in the 50 million record database, if there are 10,000 last names in the database, and customers are evenly spread across 50 states, there will be 50,000,000 / (50×10,000) = 100 repeats of each last name in each state. This means run length encoding will reduce the size of the last name column by a factor of 100.

The  compression techniques described above have another benefit: the database system often doesn’t need to convert the compressed records back into their uncompressed representation before operating on the data (direct operation can be applied to run length, delta encoded, and some kinds of dictionary encoded data.) For example, it’s trivial to determine whether a predicate is satisfied by a run length encoded (run length, value) pair by simply applying the predicate to the value. And when that predicate is applied, it can be applied to the entire run’s worth of tuples. But it is possible to efficiently join or aggregate run length encoded data too, and the performance wins can be dramatic. For example, in his SIGMOD 2006 paper “Integrating Compression and Execution in a Column-Oriented Database System“, my student Daniel Abadi showed that for many queries over the TPC-H data set, performance gains in C-Store (an academic predecessor to the Vertica Database) are an order of magnitude or more relative to an uncompressed database.

Notice that these compression techniques would be much less effective in a row-oriented system because successive tuples in a table have a lot less in common than successive values in a column. Though it’s probably possible to re-arch
itect a row-oriented database to get some of the advantages described above, doing so is tantamount to making the system column-oriented, since it requires the compressed data to be stored separately in a compact representation. So, the bottom line is that redundant columns, multiple sort orders, and column orientation pave the way for dramatic performance gains from compression.

Note: This post had been updated to improve clarity about compression techniques.

Get Started With Vertica Today

Subscribe to Vertica