Archive for the ‘column store’ Category

HP Vertica Boosts Performance for Infinity Insurance

Dana Gardner, president and principal analyst for Interarbor Solutions, recently conducted a podcast with Barry Ralston, assistant vice president for data management, of Infinity Insurance Companies, an HP Vertica customer.  Infinity Insurance is a Top 50 property and casualty insurance company based in Birmingham, Alabama, and they are using the HP Vertica Analytics Platform for their business intelligence and analytics programs.

In a story written for ZDNet, Dana shares some of the highlights from his podcast.  Here are a few key quotes from Barry from the podcast that were shown in the ZDNet story:

“We specifically chose Vertica to achieve higher productivity and to allow us to focus on optimizing queries and extracting value out of the data.”

“We took the top 12 worst-performing queries or longest-running queries from the Exadata implementation, and not one of the proof of concept queries ran less than 100 times faster. It was an easy decision to make in terms of the analytic workload, versus trying to use the Oracle row-store technology.”

“In short, in leveraging Vertica, the underlying architecture allows me to create a playfield, if you will, for business analysts.”

Infinity Insurance is deriving a true competitive advantage from their ongoing analysis of data.  If you would like, you can listen to the full podcast or subscribe on iTunes. Or, you can read a full transcript or download a copy of the transcript.

No, You Do Not Need One Projection Per Query in Vertica

Projections are the Vertica Analytic Databases’s only data structure. Every row from every table stored in Vertica is stored in a projection. There is no such thing as a query which “does not hit the projections.” If someone says those words, you should immediately suspect their motives and/or competence. We are quite open about projections (see previous posts such as this, this and this) and we think they are part of Vertica’s fundamental technical advantage. For those wishing for a more detailed description of projections, please see our VLDB paper from last year.

The idea that Vertica needs a special projection for every query in order to get good performance is just wrong. This rumor is spread as FUD sales tactic by one of our more unscrupulous competitors who knows it is not true and yet promulgates it anyways. We have typically assumed that people see through this transparent tactic, but after reading the same idea repeated by some otherwise credible articles and posts on the internet recently, I feel the need to set the record straight. The reason our competitor’s FUD inducing claim is semi-believable is because it plays on a classic DBA nightmare of full table scans in Row Store systems.

There is a fundamental technical difference between a native Column Store (e.g. Vertica) and a Row Store for ad hoc analysis when you do not have a specialized data structure for that query. In a Row Store, if you don’t have an appropriate index, the system must fallback to a full table scan to retrieve the data. Scanning an entire table’s worth of data is almost always a disastrous amount of I/O for large tables. However, in a Column Store, even if you don’t have an optimal physical structure for a specific query (for example, the optimal projection in Vertica), you simply end up with column scans for those columns referred to in the query.

Furthermore, due to the fact that we built our column storage and execution engine from the ground up with this kind of scenario in mind, our specialized storage format can often avoid reading all the column data from the disk even when a “full column scan” is needed. Along with the other well discussed benefits such as better compression, the fundamental I/O benefit for ad hoc queries is why a Column Store architecture is so much better suited to many data warehouse scenarios than a Row Store.

When UPDATE is actually INSERT

At the VLDB 2012 conference a few weeks ago, we had a chance to listen to Jiri Schindler giving a tutorial about NoSQL.  His interesting and informative presentation covered the fundamental architecture and I/O usage patterns of RDBMS systems and various NoSQL data management systems, such as HBase, Cassandra, and MongoDB.

During the presentation, Schindler listed basic I/O access patterns for columnar databases using the slide below. It is hard to summarize the operation of the various columnar database systems on a single slide, and Schindler did a great job given the constraints of the presentation. However, while his characterization might hold for other columnar databases, the Vertica Analytic Database  has a different I/O pattern for UPDATEs which we wanted to explain in more detail.

First, Vertica does not require synchronous I/O of a recovery log. Unlike most other RDBMS systems,  Vertica implements durability and fault tolerance via distributed replication.

Second, since Vertica never modifies storage in place, it avoids the other I/O intensive operations referenced in the slide.

When a user issues an UPDATE statement, Vertica performs the equivalent of a delete followed by an insert. The existing row is deleted by inserting a Delete Vector (a small record saying that the row was deleted), and a new copy of the row with the appropriately updated columns is inserted. Both the Delete Vector and the new version of the row are stored in a memory buffer known as the WOS (write optimized store). After sufficient data has accumulated in the WOS from INSERTs, UPDATEs, DELETEs, and COPYs (bulk loads), they are moved in bulk to disk storage known as the ROS.

It is important to note that existing files in the ROS are not modified while data is moved from WOS to the ROS – rather a new set of sorted and encoded column files is created. To avoid a large number of files accumulating over time, the Tuple Mover regularly merges column files together using an algorithm that limits the number of times any tuple is rewritten and also uses large contiguous disk operations, which is quite efficient well on most modern file and disk systems.

This arrangement has several advantages:

  1. From the user’s point of view, the update statement completes quickly and future queries get the expected answer (by filtering out the original values at runtime using the Delete Vectors).
  2. The cost of sorting, encoding, and writing column files to disk is amortized over a large number of rows by utilizing the in memory WOS.
  3. I/O is always in proportion to the number of rows inserted or modified – it is never the case that an update of a small number of rows causes I/O on a significant amount of previously stored data.

More details about how data is stored and Vertica’s overall architecture and design decisions, please consider reading our VLDB 2012 paper.



Setting the Record Straight on Column Stores (Again?!)

Couple months ago I went to SIGMOD 2012.  One of the big award winners there was Bruce Lindsay (IBM Fellow Emeritus), a true patriarch of relational databases.  (System R; enough said!)

I was somehow drawn to him before I figured out his name, and before I learned that he was an award winner.  Maybe it was the hairdo and mannerisms.

Or maybe it was how he asked the presenters of the paper on “MCJoin” something along the lines of  “So, I’ve written a few join algorithms in my day and one of the things that set me back a few months each time was OUTER JOINs”.  Which, in my day, set me back a few months.

Back to the awards.  Each recipient gave a talk.  Bruce gave a very interesting presentation covering RDBMS, how it built up to something useful over the years, and then considered whether we are “losing our way”.  I was a bit surprised that he listed “column stores” as a “detour” on the path of RDBMS progress.  This is his slide (and, as you view it, try imagine someone in the row in front of you cackling about how Mike Stonebraker would react to it…):


The Power of Projections – Part 2

By Colin Mahony and Shilpa Lawande

Part II – Understanding the Simplicity of Projections and the Vertica Database Designer™

In Part I of this post, we introduced you to the simple concept of Vertica’s projections.  Now that you have an understanding of what they are, we wanted to go into more detail on how users interface with them, and introduce you to Vertica’s unique Database Designer tool.

For each table in the database, Vertica requires a minimum of one projection, called a “superprojection”. A superprojection is a projection for a single table that contains all the columns and rows in the table.  Although the data may be the same as a traditional base table, it has the advantages of segmentation (spreading the data evenly across the nodes in the cluster), sorting, and encoding (compressing the size of the data on a per column basis).  This leads to significant footprint reduction as well as load and query performance enhancements.  To give you a sense of the impact that Vertica’s projections have on size, most Vertica customers have at least a 50% reduction in footprint thanks to our compression.  This includes the high availability copy and on average 3-5 projections.  Again, contrast this to traditional row-store databases ballooning upwards of 5x their original size and that is a 10:1 difference in Vertica.

To get your database up and running quickly, Vertica automatically creates a default superprojection for each table created through the CREATE TABLE and CREATE TEMPORARY TABLE statements. This means that if database admins and users never want to know about a projection, they don’t have to – Vertica will automatically handle it under the covers. To further illustrate this point, users can simply pass in projection parameters such as Order By, Encodings, Segmentation, High Availability, and Partitioning right after the CREATE TABLE statement, never interfacing directly with a projection under the hood.

By creating a superprojection for each table in the database, Vertica ensures that all SQL queries can be answered. Default superprojections alone will do far better than a row-store, however, by themselves they may not fully optimize database performance and Vertica’s full potential.  Vertica recommends that you start with the default projections and then use Vertica’s nifty Database Designer™  to optimize your database.  Database Designer creates new projections that optimize your database based on its data statistics and the queries you use. Database Designer:

1. Analyzes your logical schema, sample data, and sample queries (optional).
2. Creates a physical schema design (projections) in the form of a SQL script that can be deployed automatically or manually.
3. Can be used by anyone without specialized database knowledge (even business users can run Database Designer).
4. Can be run and re-run anytime for additional optimization without stopping the database.

Designs created by the Database Designer provide exceptional query performance. The Database Designer uses sophisticated strategies to provide excellent ad-hoc query performance while using disk space efficiently. Of course, a proficient human may do even better than the Database Designer with more intimate knowledge of the data and the use-case – a small minority of our customers prefer to do manual projection design and can usually get a good feel for it after working with the product for a few weeks.

We’ve heard people ask if we need a projection for each query in Vertica, which we absolutely do not! Typically our customers use 3-5 projections and several are using the single superprojection only. A typical customer would have the superprojection along with a few smaller projections (often comprised of only a few columns each).  Unlike MVs and indexes, projections are cheap to maintain during load and due to Vertica’s compression, the resulting data size tends to be 5-25x smaller than the base data. Depending on your data latency needs (seconds to minutes) and storage availability you could choose to add more projections to further optimize the database.  Also important to note is that Vertica does not charge extra for projections, regardless of how many are deployed.  So whether a customer has 1 or 50 projections, their license fees are the same – entirely based on raw data.

As you can see, projections are very easy to work with, and if you are a business analyst who doesn’t know SQL/DDL, that’s okay, we created a tool that designs, deploys and optimizes the database automatically for you.  Our objective from day one has always been to enable customers to ask more questions and get faster answers from their data without having to constantly tune the underlying database.  Part III of this post goes into more detail on projections versus indexes and materialized views.

Read the rest of the 3-part series…

The Power of Projections – Part 1: Understanding Projections and What They Do
The Power of Projections – Part 3: Comparing and Contrasting Projections to Materialized Views and Indexes

The Power of Projections – Part 1

By Colin Mahony and Shilpa Lawande

Part I: Understanding Projections and What They Do

Many of us here at Vertica have been amazed and frankly flattered at how much FUD our competitors are putting out there regarding Vertica’s “projections”.  Having heard some incredibly inaccurate statements about them, we’ve decided to clarify what they are, how and why we have them, and the advantages they bring.  Actually, projections are a pivotal component of our platform, and a major area of differentiation from the competition.  Most importantly, Vertica’s customers love the benefits projections bring! In an effort to provide you with as much detail as possible, this blog is broken up into three posts with Parts II and III being more technical.

First, some background. In traditional database architectures, data is primarily stored in tables. Additionally, secondary tuning structures such as indexes and materialized views are created for improved query performance.  Secondary structures like MVs and indexes have drawbacks – for instance they are expensive to maintain during data load (more detail on this in Part III).  Hence best practices often require rebuilding them during nightly batch windows, which prevents the ability to do real-time analytics.  Also, it isn’t uncommon to find data warehouse implementations that balloon to 3-6x base table sizes due to these structures. As a result, customers are often forced to remove valuable detailed data and replace it with aggregated data to solve this problem. However you can’t monetize what you lost!

Vertica created a superior solution by optimizing around performance, storage footprint, flexibility and simplicity. We removed the trade-off between performance and data size by using projections as the lynchpin of our purpose-built architecture.  Physical storage consists of optimized collections of table columns, which we call “projections”. In the traditional sense, Vertica has no raw uncompressed base tables, no materialized views, and no indexes. As a result there are no complex choices – everything is a projection!  Of course, your logical schema (we support any) remains the same as with any other database so that importing data is a cinch.  Furthermore, you still work with standard SQL/DDL (i.e. Create Table statements, etc).  The magic of projections and Vertica are what we do under the covers for you with the physical storage objects.  We provide the same benefits as indexes without all of the baggage.  We also provide an automatic tool, the Database Designer (more on this in Part II) to create projections automatically.

Projections store data in formats that optimize query execution. They share one similarity to materialized views in that they store data sets on disk rather than compute them each time they are used in a query (e.g. physical storage).  However, projections aren’t aggregated but rather store every row in a table, e.g. the full atomic detail. The data sets are automatically refreshed whenever data values are inserted, appended, or changed – again, all of this happens beneath the covers without user intervention – unlike materialized views. Projections provide the following benefits:

  • • Projections are transparent to end-users and SQL. The Vertica query optimizer automatically picks the best projections to use for any query.
  • • Projections allow for the sorting of data in any order (even if different from the source tables). This enhances query performance and compression.
  • • Projections deliver high availability optimized for performance, since the redundant copies of data are always actively used in analytics.  We have the ability to automatically store the redundant copy using a different sort order.  This provides the same benefits as a secondary index in a more efficient manner.
  • • Projections do not require a batch update window.  Data is automatically available upon loads.
  • • Projections are dynamic and can be added/changed on the fly without stopping the database.

In summary, Vertica’s projections represent collections of columns (okay so it is a table!), but are optimized for analytics at the physical storage structure level and are not constrained by the logical schema.  This allows for much more freedom and optimization without having to change the actual schema that certain applications are built upon.

Hopefully this gave you an overview of what projections are and how they work.  Please read Part II and Part III of this post to drill down into projections even further.

Read the rest of the 3-part series…

The Power of Projections – Part 2: Understanding the Simplicity of Projections and the Vertica Database Designer™
The Power of Projections – Part 3: Comparing and Contrasting Projections to Materialized Views and Indexes

Get Started With Vertica Today

Subscribe to Vertica