Vertica

Author Archive

Column Store vs. Column Store

It has been 5 years since Vertica was founded and it is great to see that Column Stores are becoming prevalent and widely regarded as the preferred architectures for data warehousing and analytics. Mainstream and upstart vendors alike are announcing columnar storage and columnar compression as “features” of their row-oriented DBMS. While this is excellent news for the column store enthusiasts, marketing messages are rife with false information that creates confusion for buyers. Could you be mistaking an imitation diamond for the real thing?

Here’s what you should know about when evaluating or buying a Column store DBMS.

What makes a True Columnar DBMS

A true column store, like Vertica, must have the following 4 features:

Columnar Storage, Compression and Retrieval

Data is stored in columns such that it is possible to retrieve data in a column without fetching other columns. This has the benefits of I/O reduction as well as improved compression. Data is compressed on column-by-column basis, with the compression technique chosen based on properties of the data. Block level columnar compression in row-oriented databases fails to meet this criterion – compression in these systems is typically limited to a single technique and does not eliminate unnecessary columns (and the resulting I/O) on retrieval.

Columnar on Disk, not just In-memory

Some so-called columnar DBMS vendors rely on caching the entire data into memory in columnar format. These systems experience a performance cliff when the data sizes grow beyond what can fit into memory or require a huge hardware footprint. It is no secret that memory continues to be the most expensive component in any system, so this approach is likely to limit your scalability. Check out some recently published 1TB TPCH benchmarks by columnar vendors and notice how much hardware and memory was needed for this tiny amount of data!!

Columnar Optimizer & Execution Engine

To really take advantage of a column store architecture, the query optimizer must be deeply aware of columnar storage and optimization techniques.  Late materialization is just one example of an optimization technique that can significantly speed up joins in a column store. Here, the result of the join can be computed by simply fetching the join key columns off the disk and the remaining columns are only fetched at the very end of query execution.

Going hand in hand with the optimizer, the execution engine of a true columnar database looks radically different from the typical processing model employed in a typical modern row-oriented DBMS. A true columnar engine can do predicates, joins, aggregates, sorts and analytics on compressed data, thereby saving not only on I/O but also CPU cycles. The problem then shifts to optimizing memory bandwidth and techniques like vectorizing or operating on columns are used to allow more efficient use of the L2 cache.

No amount of retrofitting can turn a row-oriented optimizer and engine into column-oriented ones.

For more on this subject, see Dan Abadi’s excellent research on this topic:

http://cs-www.cs.yale.edu/homes/dna/papers/abadiicde2007.pdf,

http://cs-www.cs.yale.edu/homes/dna/papers/abadi-sigmod08.pdf,

http://cs-www.cs.yale.edu/homes/dna/talks/abadi-sigmod-award.pdf

Optimized Loads and Transactions

While analytic DBMS workloads are heavy on queries v/s transaction throughput, this does not mean they are “read-only”. Many vendors implement columnar storage as a feature assuming “archival” or “read-only” access or reducing compression if updates are supported.  A true columnar RDBMS should provide the ability to do fast loads, and handle SQL deletes and updates to the data without sacrificing query performance or compression benefits.

Lacking any one of the above elements significantly reduces the benefits of column stores. Vertica is the only analytic database in the market today with all of the above features. That being said, a columnar architecture is just one of the many design choices that makes Vertica the DBMS of choice for large-scale real-time analytics – I’ll talk more about these in a future blog post.

And don’t take our word for it.  Try it out for yourself.

Five Tools for High Velocity Analytics

In this post we take a look at the tools required to achieve “high velocity” analytics.  What are the technologies that are important for high velocity analytics and the defining characteristics of these technologies?

  • Low Latency Data Source – Starting at the front end, the gating requirement for embedding analytics in your business is a real time data source.
  • Pipelined [ET]L – Whether data is extracted or transformed, it has to be loaded in a pipeline as data arrives, not in batches at the end of the day.
  • Complex Event Processing – The only way to respond to real time events is on the wire.  CEP systems look at every record before it is stored and can respond to events as they happen.
  • Real Time  Analytic Database – Once data makes it to a storage system you can analyze it in context with your historical data. Concurrent load and query is the critical backbone to high velocity analytics.
  • Flexible Business Intelligence – Front end tools are designed for business users and they have to be as flexible as an analyst is creative.

We assume you’re reading this post because you have a lot of incoming data or you are expecting it.  You will need to have tools in place to get this data into your system as quickly as possible. If your inbound data is log files, you can use tools like scribe to capture these logs in real time from your web servers. This tool was developed by the team at Facebook to centralize all of their log data and is now free and open source.  There’s a great introductory article on the High Scalability blog.

If your key business data comes from an online transaction processing (OLTP) system you need to first make sure you have a fast OLTP system handling inbound transactions.  This can be anything from a general purpose database, perhaps shardedfor scalability or an optimized OLTP specific DBMS.  As with log files, the key attribute you need to identify is transaction latency with large volumes of data; just having a lot of data is not sufficient.  If you can handle a million users but it takes an hour to process their orders then that hour is going to be your bottleneck. The same is true when you need to modify your pricing or ad campaigns or to identify up-sell opportunities. That hour of lag is going to be the bottleneck to building analytics into your business and increasing your profitability. The time between a user executing a transaction, such as purchasing an item or changing their subscription, and when you can act on that transaction needs to be minutes if not seconds.

To achieve this low latency interface you can use a change data capture tool, an OLTP solution with a direct extract/load (EL) to an analytic database or structure your application to log the result of the transaction directly to your analytic database. You can also use tools such as scribe to log this transaction just as you would any other application log. You may also have a custom data feed such as a financial tick feed fromThomson ReutersBloomberg B-PIPE or QuantHouse QuantFEED or a feed from your operational network provider.

In between the transaction data source and your analytic database you may need to respond to events as they occur “on the wire.”  This is where you employ a complex event processing engine to handle on the wire detection, automate common responses and flag important events.  CEP systems typically operate by running data through a pre-defined query, that accumulates and modifies state, triggering behaviors when a certain threshold is met. For example, a CEP system can be used to keep a count of errors for various data sources and raise an alert if any of them exceeds your maximum SLA threshold.

The hub for your business processes is the analytic database. This database is different from the general purpose database you use for accounting and it may even be different than the enterprise data warehouse where you log and report across your business activities. Your analytic database must be able to accept incoming data 24×7, allow you to access data quickly and with low latency – within minutes if not seconds and scale out infinitely as your data volumes grow.

The analytic database collects real time data as it is streamed in and stored for some defined period of time. Since the historical data storage is defined by business requirements, the analytic database must scale out to handle as much historical data as necessary. Similarly as requirements for faster analysis on more data grows the database must scale to handle more users and faster queries.

Since data is flowing in non-stop, the analytic database must have robust features to support trickle load, concurrent load and query and non-stop high availability.  If you have to pause queries in order to load or if any parts of the system need to be restarted in order to load data after a failure, you risk having downtime in a critical component during the highest peak of loads. These scenarios are easy to test by simulating high rates of loads and queries while pulling the plug (figuratively or literally) on random components.

Finally, as business users get access to more data in real time, the type of analysis changes. With the flexibility to iteratively explore real time data, user demand for additional information and different views grows. The front end tools must handle dynamic visualizations to accommodate these requirements. Both classic BI tools such as MicroStrategyCognos and Business Objects as well as new cutting edge tools fromTableauJasperSoft and Pentaho are modernizing the front end for real time BI. The key features to look for are flexibility of schema, and simplicity of abstraction.  When adding new data to the system it should be easy to incorporate it into the tool (if not automatic) and the complexity for mapping from what the business user clicks on and what the database schema defines should be minimal.  Keep in mind the development time between adding a data source to the analytic DBMS and giving your business users access to that data.

The key takeaway of this tutorial is that the latency between your transaction processing systems, data capture tools, complex event processing, real time analytic DBMS and your business analytics tool will define the speed at which your business can react to changes and ultimately your flexibility to adapt.  The lower the latency in your toolset, the higher the velocity which with you can operate and the more effectively you will be able to compete.  There is no doubt today that the winning players in every market are the most adaptable and flexible companies.

Vertica 3.5 FlexStore: The next generation of Column-stores

Last month Vertica delivered release 3.5 of our analytic database software. Over the past few years, Vertica has continued to innovate and mature its core database functionality with major releases every 6 months. I would like to thank all our customers and partners whose feedback has been instrumental in developing our product. The centerpiece of the Vertica 3.5 release is – FlexStore – our next generation columnar database architecture.  With FlexStore, Vertica now has all the benefits of a columnar database with several of the benefits traditionally considered the forte of traditional row-oriented databases.

There are three main ideas introduced by FlexStore:

  1. grouping of multiple (possibly all) columns into a single file
  2. automatically choosing the disk storage format based on data load patterns
  3. ability to differentiate storage media by their performance characteristics and to enable intelligent placement of data based on usage patterns

Let us look at a couple of practical examples where Vertica can derive huge performance gains from patent-pending FlexStore features.

One of the key innovations in Vertica has been the ability to eliminate the need for batch loading data warehouses – most Vertica customers trickle load data throughout the day, while providing real-time data latency to their end-users.  There are no indexes to build or materialized views to refresh.  Data is trickle loaded into the Write-Optimized-Store (WOS) – a fully queryable in-memory store and over time moved over to Read-Optimized-Store (ROS) on disk. The migration and access to data across the WOS and ROS is completely transparent to the user and managed automatically by the Tuple Mover.  Data from both the WOS & ROS are also automatically combined whenever necessary in any queries.

FlexStore improves the efficiency of this trickle load process by enabling Vertica to choose whether the incoming data is to be stored into a row-oriented format in the WOS or in a row or column oriented format in the ROS, depending on the size of the data  being loaded. This determination is similarly made entirely automatically without any user intervention.  The row-oriented format on disk groups several columns into a single file and applies columnar compression within each file. This reduces the number of file accesses needed to access the data during queries.  Over time, data from multiple small loads are combined if necessary and reorganized into a more highly optimized column oriented format.

FlexStore also allows user control over placement of columns on different media.  This intelligent data placement also provides an opportunity to incorporate the use of solid state drives for database storage in a cost effective manner.  Even if your architecture consists of a homogeneous set of disks, it is well known that storing data on inner v/s outer tracks can result in different performance.  By presenting the inner and outer tracks as two different storage locations to Vertica, FlexStore allows intelligence placement of columns so that frequently accessed columns can be placed on the outer tracks and infrequently used columns can be placed on inner tracks.

Grouping of two or more columns into a single file is also available as a user directive when defining projections.  Columns that are frequently accessed together can be grouped to reduce the number of file accesses necessary to retrieve the data.  Columns that are related in a domain specific way, such as bid and ask values in tick data are candidates to for the grouping directive. Grouping of columns also enables use of interesting compression and encoding techniques across multiple columns.

The grouped column feature can be combined with the data placement feature to fine tune performance or storage efficiency. For instance, columns that are infrequently accessed can be combined into a single group and stored on slower disks or on the inner tracks of a disk.  In this manner, Flexstore lays the foundation of Vertica’s Information Lifecycle Management strategy, where data is differentiated based on its value over its the course of its lifecycle.  You can expect more advances from Vertica in this area over the course of the next few releases.

Vertica 3.5 also introduces our next generation query optimizer and execution engine with improved performance and many new optimizations. For example, because Vertica stores data sorted, merge joins are extremely efficient in Vertica and in Vertica 3.5, merge joins can be now be used in many more situations. Deletes and updates, traditionally considered a weakness of column store architectures are not only faster but also, have little impact on query performance. Several of our early-adopter customers reported 30% faster out of the box query performance and over 100x improvement on delete performance after upgrade to Vertica 3.5!

Along with performance and scalability, Vertica continues to invest in enhancing and simplifying the operational capabilities of the database and improving overall user experience. Vertica 3.5 introduces a rich set of monitoring tables fully query-able using SQL.

Watch this space to learn about more upcoming technical innovations from Vertica as we continue to build the fastest and simplest database for analytics!

Get Started With Vertica Today

Subscribe to Vertica