Data compression is like money or free time: the more the better. Compression can save big bucks on storage costs, increase data center density, or allow more data to be kept, while simultaneously increasing query performance in cases where I/O is the bottleneck. So naturally, database marketeers want to claim great compression ratios. Vertica is no exception.
Generally, I am skeptical of vendors who claim amazing compression breakthroughs. In fact, I’m quite skeptical of any lossless compression product that promises a compression ratio up front, without any knowledge about the data to be compressed. Remember the WEB 16:1 compressor saga? Also, don’t we all know that it’s theoretically impossible to contrive a general, lossless compression utility that makes all input files smaller? Take a look at Steve Tate’s or Mike Goldman’s challenges in the timeless comp.compression FAQ. I’d double their wagers.
Thus I expect some skepticism surrounding Vertica’s claims of compression capabilities. To dispel some of these doubts, I’ll discuss the three main reasons why Vertica’s compression is better than you’ll see in competing databases for a long time to come. I’ll also present a real-life customer example (albeit a best-case scenario) of database compression. I won’t be claiming any specific compression ratio, like “up to 10x compression or more!” But I will claim that we can do as well as anyone else, and usually much better.
Compression Advantages of Vertica’s True Column Store
There are three main reasons that Vertica’s compression is superior and each reason has its own advantage for compressing data:
Reason #1: Vertica Is A True Column Store.
If you walk down attribute columns there is more similarity than if you cut across the rows. So why can’t a row store simply compress its blocks by slicing each block of rows into columns? It can, but given the same block size, the column store will see more data and therefore get better compression, as shown in Illustration 1, where data was fed into gzip in rows, and in columns.
In the extreme case of run length encoding (RLE), even many billions of records can be encoded in a single disk block, which is far outside the scope of what can be done by slicing up blocks of rows.
Reason #2: Data Is Always Sorted in Vertica.
Data that is well organized compresses better than data that is located haphazardly, as shown in Illustration 2. Take a text file containing a million random integers (say between 1 and 10 million). It’ll be a shade under 8MB, as most entries are 7 digits plus a newline. Applying gzip, you’ll find that the data compresses, because the numbers are made of digits, which are a subset of all byte representations. I got 3.7MB, for a bit better than 2:1 compression. However, sorting the data makes it much more compressible; 3.4:1 compression is achieved. By not using strings, Vertica gets better than 10:1 compression in this case.
This is a contrived case, but real data also has trends. Stocks trade from pennies to over a hundred thousand dollars a share. But if the data is sorted by stock, date, and time, there are few distinct trading prices and the data is quite compressible. Unlike other databases that keep data in insertion order, Vertica keeps the data sorted all the time, improving compression.
Reason #3: Vertica Doesn’t Do In-Place Updates.
In most databases that were originally designed for transaction processing, updates are applied in place. Since new values could come along that don’t compress as well as the old values, some empty space must be left, or updates foregone. Since Vertica puts updates in a separate place (such as the Write Optimized Store), we can squeeze every last bit out of the data. I’ve seen some competing systems that update in place and others that don’t allow updates if maximum compression is on, but none that allow updates while still squeezing every last bit out of the data representation.
A Real-Life Example
The reasons sound good, but how well does Vertica’s compression work in practice? Take a look at the following example and ask yourself if your favorite row store (which is probably soon to claim that it is also a column store) can match Vertica’s true column compression.
We have a customer that collects metrics from some meters. There are 4 columns in the schema:
- Metric. There are a few hundred metrics collected.
- Meter. There are a couple thousand meters.
- Collection Time Stamp. Each meter spits out metrics every 5 minutes, 10 minutes, hour, etc., depending on the metric.
- Metric Value. A 64-bit floating point value.
So how much space will this take to record? A baseline .csv file of 200 million of the meter/metric/time/value rows takes 6200 MB, for ~31 bytes per row. gzip reduces this to 1050MB.
Vertica compresses much better. By sorting the data on metric, meter, and collection time, Vertica not only optimizes common query predicates (which specify the metric or a time range), but exposes great compression opportunities for each column:
- Metric: There aren’t many. With RLE, it is as if there are only a few hundred rows. Vertica compressed this column to 5KB.
- Meter: There are quite a few, and there is one record for each meter for each metric. With RLE, Vertica brings this down to a mere 35MB.
- Collection Time Stamp: The regular collection intervals present a great compression opportunity. Vertica compressed this column to 20MB.
- Metric Value: Some metrics have trends (like lots of 0?values when nothing happens). Others change gradually with time. Some are much more random, and less compressible. However, Vertica compressed the data to only 363MB.
The total size for all the columns is 418MB (a shade over 2 bytes per row). This is less than half of what gzip used, and represents a compression ratio of 30:1 over the raw data.