Vertica

Archive for the ‘compression’ Category

No, You Do Not Need One Projection Per Query in Vertica

Projections are the Vertica Analytic Databases’s only data structure. Every row from every table stored in Vertica is stored in a projection. There is no such thing as a query which “does not hit the projections.” If someone says those words, you should immediately suspect their motives and/or competence. We are quite open about projections (see previous posts such as this, this and this) and we think they are part of Vertica’s fundamental technical advantage. For those wishing for a more detailed description of projections, please see our VLDB paper from last year.

The idea that Vertica needs a special projection for every query in order to get good performance is just wrong. This rumor is spread as FUD sales tactic by one of our more unscrupulous competitors who knows it is not true and yet promulgates it anyways. We have typically assumed that people see through this transparent tactic, but after reading the same idea repeated by some otherwise credible articles and posts on the internet recently, I feel the need to set the record straight. The reason our competitor’s FUD inducing claim is semi-believable is because it plays on a classic DBA nightmare of full table scans in Row Store systems.

There is a fundamental technical difference between a native Column Store (e.g. Vertica) and a Row Store for ad hoc analysis when you do not have a specialized data structure for that query. In a Row Store, if you don’t have an appropriate index, the system must fallback to a full table scan to retrieve the data. Scanning an entire table’s worth of data is almost always a disastrous amount of I/O for large tables. However, in a Column Store, even if you don’t have an optimal physical structure for a specific query (for example, the optimal projection in Vertica), you simply end up with column scans for those columns referred to in the query.

Furthermore, due to the fact that we built our column storage and execution engine from the ground up with this kind of scenario in mind, our specialized storage format can often avoid reading all the column data from the disk even when a “full column scan” is needed. Along with the other well discussed benefits such as better compression, the fundamental I/O benefit for ad hoc queries is why a Column Store architecture is so much better suited to many data warehouse scenarios than a Row Store.

When UPDATE is actually INSERT

At the VLDB 2012 conference a few weeks ago, we had a chance to listen to Jiri Schindler giving a tutorial about NoSQL.  His interesting and informative presentation covered the fundamental architecture and I/O usage patterns of RDBMS systems and various NoSQL data management systems, such as HBase, Cassandra, and MongoDB.

During the presentation, Schindler listed basic I/O access patterns for columnar databases using the slide below. It is hard to summarize the operation of the various columnar database systems on a single slide, and Schindler did a great job given the constraints of the presentation. However, while his characterization might hold for other columnar databases, the Vertica Analytic Database  has a different I/O pattern for UPDATEs which we wanted to explain in more detail.

First, Vertica does not require synchronous I/O of a recovery log. Unlike most other RDBMS systems,  Vertica implements durability and fault tolerance via distributed replication.

Second, since Vertica never modifies storage in place, it avoids the other I/O intensive operations referenced in the slide.

When a user issues an UPDATE statement, Vertica performs the equivalent of a delete followed by an insert. The existing row is deleted by inserting a Delete Vector (a small record saying that the row was deleted), and a new copy of the row with the appropriately updated columns is inserted. Both the Delete Vector and the new version of the row are stored in a memory buffer known as the WOS (write optimized store). After sufficient data has accumulated in the WOS from INSERTs, UPDATEs, DELETEs, and COPYs (bulk loads), they are moved in bulk to disk storage known as the ROS.

It is important to note that existing files in the ROS are not modified while data is moved from WOS to the ROS – rather a new set of sorted and encoded column files is created. To avoid a large number of files accumulating over time, the Tuple Mover regularly merges column files together using an algorithm that limits the number of times any tuple is rewritten and also uses large contiguous disk operations, which is quite efficient well on most modern file and disk systems.

This arrangement has several advantages:

  1. From the user’s point of view, the update statement completes quickly and future queries get the expected answer (by filtering out the original values at runtime using the Delete Vectors).
  2. The cost of sorting, encoding, and writing column files to disk is amortized over a large number of rows by utilizing the in memory WOS.
  3. I/O is always in proportion to the number of rows inserted or modified – it is never the case that an update of a small number of rows causes I/O on a significant amount of previously stored data.

More details about how data is stored and Vertica’s overall architecture and design decisions, please consider reading our VLDB 2012 paper.

 

 

Why Vertica’s Compression is Better

Data compression is like money or free time: the more the better. Compression can save big bucks on storage costs, increase data center density, or allow more data to be kept, while simultaneously increasing query performance in cases where I/O is the bottleneck. So naturally, database marketeers want to claim great compression ratios. Vertica is no exception.

Generally, I am skeptical of vendors who claim amazing compression breakthroughs. In fact, I’m quite skeptical of any lossless compression product that promises a compression ratio up front, without any knowledge about the data to be compressed. Remember the WEB 16:1 compressor saga? Also, don’t we all know that it’s theoretically impossible to contrive a general, lossless compression utility that makes all input files smaller? Take a look at Steve Tate’s or Mike Goldman’s challenges in the timeless comp.compression FAQ. I’d double their wagers.

Thus I expect some skepticism surrounding Vertica’s claims of compression capabilities. To dispel some of these doubts, I’ll discuss the three main reasons why Vertica’s compression is better than you’ll see in competing databases for a long time to come. I’ll also present a real-life customer example (albeit a best-case scenario) of database compression. I won’t be claiming any specific compression ratio, like “up to 10x compression or more!” But I will claim that we can do as well as anyone else, and usually much better.

Compression Advantages of Vertica’s True Column Store

There are three main reasons that Vertica’s compression is superior and each reason has its own advantage for compressing data:

Reason #1: Vertica Is A True Column Store.
If you walk down attribute columns there is more similarity than if you cut across the rows. So why can’t a row store simply compress its blocks by slicing each block of rows into columns? It can, but given the same block size, the column store will see more data and therefore get better compression, as shown in Illustration 1, where data was fed into gzip in rows, and in columns.

Illustration 1: Vertica’s True Column Compression Advantage 

In the extreme case of run length encoding (RLE), even many billions of records can be encoded in a single disk block, which is far outside the scope of what can be done by slicing up blocks of rows.

Reason #2: Data Is Always Sorted in Vertica.

Data that is well organized compresses better than data that is located haphazardly, as shown in Illustration 2. Take a text file containing a million random integers (say between 1 and 10 million). It’ll be a shade under 8MB, as most entries are 7 digits plus a newline. Applying gzip, you’ll find that the data compresses, because the numbers are made of digits, which are a subset of all byte representations. I got 3.7MB, for a bit better than 2:1 compression. However, sorting the data makes it much more compressible; 3.4:1 compression is achieved. By not using strings, Vertica gets better than 10:1 compression in this case.

Illustration 2: Sorting Dramatically Improves Compression with Vertica 

This is a contrived case, but real data also has trends. Stocks trade from pennies to over a hundred thousand dollars a share. But if the data is sorted by stock, date, and time, there are few distinct trading prices and the data is quite compressible. Unlike other databases that keep data in insertion order, Vertica keeps the data sorted all the time, improving compression.

Reason #3: Vertica Doesn’t Do In-Place Updates.
In most databases that were originally designed for transaction processing, updates are applied in place. Since new values could come along that don’t compress as well as the old values, some empty space must be left, or updates foregone. Since Vertica puts updates in a separate place (such as the Write Optimized Store), we can squeeze every last bit out of the data. I’ve seen some competing systems that update in place and others that don’t allow updates if maximum compression is on, but none that allow updates while still squeezing every last bit out of the data representation.

A Real-Life Example

The reasons sound good, but how well does Vertica’s compression work in practice? Take a look at the following example and ask yourself if your favorite row store (which is probably soon to claim that it is also a column store) can match Vertica’s true column compression.

We have a customer that collects metrics from some meters. There are 4 columns in the schema:

  1. Metric. There are a few hundred metrics collected.
  2. Meter. There are a couple thousand meters.
  3. Collection Time Stamp. Each meter spits out metrics every 5 minutes, 10 minutes, hour, etc., depending on the metric.
  4. Metric Value. A 64-bit floating point value.

So how much space will this take to record? A baseline .csv file of 200 million of the meter/metric/time/value rows takes 6200 MB, for ~31 bytes per row. gzip reduces this to 1050MB.

Vertica compresses much better. By sorting the data on metric, meter, and collection time, Vertica not only optimizes common query predicates (which specify the metric or a time range), but exposes great compression opportunities for each column:

  1. Metric: There aren’t many. With RLE, it is as if there are only a few hundred rows. Vertica compressed this column to 5KB.
  2. Meter: There are quite a few, and there is one record for each meter for each metric. With RLE, Vertica brings this down to a mere 35MB.
  3. Collection Time Stamp: The regular collection intervals present a great compression opportunity. Vertica compressed this column to 20MB.
  4. Metric Value: Some metrics have trends (like lots of 0?values when nothing happens). Others change gradually with time. Some are much more random, and less compressible. However, Vertica compressed the data to only 363MB.
Illustration 3: Real Customer Data Compression

 

The total size for all the columns is 418MB (a shade over 2 bytes per row). This is less than half of what gzip used, and represents a compression ratio of 30:1 over the raw data.

Get Started With Vertica Today

Subscribe to Vertica