It has been 5 years since Vertica was founded and it is great to see that Column Stores are becoming prevalent and widely regarded as the preferred architectures for data warehousing and analytics. Mainstream and upstart vendors alike are announcing columnar storage and columnar compression as “features” of their row-oriented DBMS. While this is excellent news for the column store enthusiasts, marketing messages are rife with false information that creates confusion for buyers. Could you be mistaking an imitation diamond for the real thing?
Here’s what you should know about when evaluating or buying a Column store DBMS.
What makes a True Columnar DBMS
A true column store, like Vertica, must have the following 4 features:
Columnar Storage, Compression and Retrieval
Data is stored in columns such that it is possible to retrieve data in a column without fetching other columns. This has the benefits of I/O reduction as well as improved compression. Data is compressed on column-by-column basis, with the compression technique chosen based on properties of the data. Block level columnar compression in row-oriented databases fails to meet this criterion – compression in these systems is typically limited to a single technique and does not eliminate unnecessary columns (and the resulting I/O) on retrieval.
Columnar on Disk, not just In-memory
Some so-called columnar DBMS vendors rely on caching the entire data into memory in columnar format. These systems experience a performance cliff when the data sizes grow beyond what can fit into memory or require a huge hardware footprint. It is no secret that memory continues to be the most expensive component in any system, so this approach is likely to limit your scalability. Check out some recently published 1TB TPCH benchmarks by columnar vendors and notice how much hardware and memory was needed for this tiny amount of data!!
Columnar Optimizer & Execution Engine
To really take advantage of a column store architecture, the query optimizer must be deeply aware of columnar storage and optimization techniques. Late materialization is just one example of an optimization technique that can significantly speed up joins in a column store. Here, the result of the join can be computed by simply fetching the join key columns off the disk and the remaining columns are only fetched at the very end of query execution.
Going hand in hand with the optimizer, the execution engine of a true columnar database looks radically different from the typical processing model employed in a typical modern row-oriented DBMS. A true columnar engine can do predicates, joins, aggregates, sorts and analytics on compressed data, thereby saving not only on I/O but also CPU cycles. The problem then shifts to optimizing memory bandwidth and techniques like vectorizing or operating on columns are used to allow more efficient use of the L2 cache.
No amount of retrofitting can turn a row-oriented optimizer and engine into column-oriented ones.
For more on this subject, see Dan Abadi’s excellent research on this topic:
Optimized Loads and Transactions
While analytic DBMS workloads are heavy on queries v/s transaction throughput, this does not mean they are “read-only”. Many vendors implement columnar storage as a feature assuming “archival” or “read-only” access or reducing compression if updates are supported. A true columnar RDBMS should provide the ability to do fast loads, and handle SQL deletes and updates to the data without sacrificing query performance or compression benefits.
Lacking any one of the above elements significantly reduces the benefits of column stores. Vertica is the only analytic database in the market today with all of the above features. That being said, a columnar architecture is just one of the many design choices that makes Vertica the DBMS of choice for large-scale real-time analytics – I’ll talk more about these in a future blog post.
And don’t take our word for it. Try it out for yourself.