Earlier this week I wrote about the advantages of compression in column-oriented databases. A reader had questions about an example I used and the issue of sorting and indexes. I thought the commenter’s points and questions were worth exploring in some depth.
Keeping columns in order
The commenter’s key question was really about maintaining the correspondence between columns in a column-oriented database. The issue isn’t really “getting the data back in the original order” as much as it is ensuring that the system has a way of accessing records in other columns that correspond to a particular record in a sorted column. I’ll explain the details of how this works.
The simplest way to do this is just store one copy of the table sorted on one or more attributes. For example, suppose we have the following table sorted on state:
We can physically represent this as three columns each stored in the order shown in a different file. We can RLE encode the state column to reduce its space, and, if we secondarily sort the table on salaries, can probably also benefit from delta encoding salaries and dictionary encoding names. Any query that includes a predicate over state (e.g., state = ‘ca’) can be evaluated quickly because we can figure out what positions (e.g., 1 and 2) in the three columns need to be looked up to find matching records without decompressing the state column or looking at all of records of the other two columns.
Of course, we need to be able to efficiently perform this look up of a given set of positions in a column (e.g., finding the names and salaries that correspond to positions 1 and 2 in the state column). We can do this much faster than a conventional join would (which is what the commenter’s foreign key suggestion would require) because the name and salary columns are stored in the same order as the state column (so the 1st state corresponds to the 1st entry in the name file, etc.). This allows us to either:
- Directly offset to the locations in the columns where a particular position is stored, if the column is uncompressed and fields are fixed length (e.g., each record is the same size). This would be the case for salary in the previous example if it were uncompressed, or, if each record is variable-sized, or the column is compressed (as would be the case if name is stored as a text field, for example) such that we can’t apply direct offsetting.
- Use a sparse index to map positions to entries in a column. A sparse index is simply an index with one entry per *page* of the column file where an entry indicates the range of positions on the corresponding page. So if the name column has 10,000 names stored on 100 pages, there would 100 entries in the sparse index. To look up a given position, the system searches the index to quickly find the pages that contain the positions of interest. Because these sparse indices are very compact, they can usually be stored in memory or on a single disk page, and this extra lookup imposes very little overhead (unlike a traditional database index which can be quite large and require a significant number of I/Os to query).
Notice that the positions will be accessed sequentially within each column file, which further improves performance (especially if a range of consecutive positions are accessed).
That said, the commenter was correct in that in this example we will really only get the maximum benefit of RLE on the state column with (if we sort them at all) a potentially decreased benefit for a secondary or tertiary sort on name or salary.
One way to address this is to replicate subsets of columns, sorting each subset in a different order. We call such subsets “projections” (where the columns in a projection are still each stored in separate files). For example, a user might create one projection with the above three columns sorted in state order and another projection with just name and salary sorted in salary order (which will compress well using delta or RLE encoding). That way, if a query wants to find employees in a particular salary range, it can do that quickly using this second projection. Of course, for a query to be able to use a particular projection, that projection must have the columns used in the query (or the system must run an expensive join operation to glue together two projections that are sorted in different orders).
Future post: Projections to maximize query performance
This, of course, presents an interesting physical database design question: What projections should the user (or administrator) create to maximize query performance for a given query workload? This is the idea behind the automatic database design features in C-Store and Vertica, which will be the subject of a future post.