Vertica

Archive for the ‘column store’ Category

No, You Do Not Need One Projection Per Query in Vertica

Projections are the Vertica Analytic Databases’s only data structure. Every row from every table stored in Vertica is stored in a projection. There is no such thing as a query which “does not hit the projections.” If someone says those words, you should immediately suspect their motives and/or competence. We are quite open about projections (see previous posts such as this, this and this) and we think they are part of Vertica’s fundamental technical advantage. For those wishing for a more detailed description of projections, please see our VLDB paper from last year.

The idea that Vertica needs a special projection for every query in order to get good performance is just wrong. This rumor is spread as FUD sales tactic by one of our more unscrupulous competitors who knows it is not true and yet promulgates it anyways. We have typically assumed that people see through this transparent tactic, but after reading the same idea repeated by some otherwise credible articles and posts on the internet recently, I feel the need to set the record straight. The reason our competitor’s FUD inducing claim is semi-believable is because it plays on a classic DBA nightmare of full table scans in Row Store systems.

There is a fundamental technical difference between a native Column Store (e.g. Vertica) and a Row Store for ad hoc analysis when you do not have a specialized data structure for that query. In a Row Store, if you don’t have an appropriate index, the system must fallback to a full table scan to retrieve the data. Scanning an entire table’s worth of data is almost always a disastrous amount of I/O for large tables. However, in a Column Store, even if you don’t have an optimal physical structure for a specific query (for example, the optimal projection in Vertica), you simply end up with column scans for those columns referred to in the query.

Furthermore, due to the fact that we built our column storage and execution engine from the ground up with this kind of scenario in mind, our specialized storage format can often avoid reading all the column data from the disk even when a “full column scan” is needed. Along with the other well discussed benefits such as better compression, the fundamental I/O benefit for ad hoc queries is why a Column Store architecture is so much better suited to many data warehouse scenarios than a Row Store.

When UPDATE is actually INSERT

At the VLDB 2012 conference a few weeks ago, we had a chance to listen to Jiri Schindler giving a tutorial about NoSQL.  His interesting and informative presentation covered the fundamental architecture and I/O usage patterns of RDBMS systems and various NoSQL data management systems, such as HBase, Cassandra, and MongoDB.

During the presentation, Schindler listed basic I/O access patterns for columnar databases using the slide below. It is hard to summarize the operation of the various columnar database systems on a single slide, and Schindler did a great job given the constraints of the presentation. However, while his characterization might hold for other columnar databases, the Vertica Analytic Database  has a different I/O pattern for UPDATEs which we wanted to explain in more detail.

First, Vertica does not require synchronous I/O of a recovery log. Unlike most other RDBMS systems,  Vertica implements durability and fault tolerance via distributed replication.

Second, since Vertica never modifies storage in place, it avoids the other I/O intensive operations referenced in the slide.

When a user issues an UPDATE statement, Vertica performs the equivalent of a delete followed by an insert. The existing row is deleted by inserting a Delete Vector (a small record saying that the row was deleted), and a new copy of the row with the appropriately updated columns is inserted. Both the Delete Vector and the new version of the row are stored in a memory buffer known as the WOS (write optimized store). After sufficient data has accumulated in the WOS from INSERTs, UPDATEs, DELETEs, and COPYs (bulk loads), they are moved in bulk to disk storage known as the ROS.

It is important to note that existing files in the ROS are not modified while data is moved from WOS to the ROS – rather a new set of sorted and encoded column files is created. To avoid a large number of files accumulating over time, the Tuple Mover regularly merges column files together using an algorithm that limits the number of times any tuple is rewritten and also uses large contiguous disk operations, which is quite efficient well on most modern file and disk systems.

This arrangement has several advantages:

  1. From the user’s point of view, the update statement completes quickly and future queries get the expected answer (by filtering out the original values at runtime using the Delete Vectors).
  2. The cost of sorting, encoding, and writing column files to disk is amortized over a large number of rows by utilizing the in memory WOS.
  3. I/O is always in proportion to the number of rows inserted or modified – it is never the case that an update of a small number of rows causes I/O on a significant amount of previously stored data.

More details about how data is stored and Vertica’s overall architecture and design decisions, please consider reading our VLDB 2012 paper.

 

 

Setting the Record Straight on Column Stores (Again?!)

Couple months ago I went to SIGMOD 2012.  One of the big award winners there was Bruce Lindsay (IBM Fellow Emeritus), a true patriarch of relational databases.  (System R; enough said!)

I was somehow drawn to him before I figured out his name, and before I learned that he was an award winner.  Maybe it was the hairdo and mannerisms.

Or maybe it was how he asked the presenters of the paper on “MCJoin” something along the lines of  “So, I’ve written a few join algorithms in my day and one of the things that set me back a few months each time was OUTER JOINs”.  Which, in my day, set me back a few months.

Back to the awards.  Each recipient gave a talk.  Bruce gave a very interesting presentation covering RDBMS, how it built up to something useful over the years, and then considered whether we are “losing our way”.  I was a bit surprised that he listed “column stores” as a “detour” on the path of RDBMS progress.  This is his slide (and, as you view it, try imagine someone in the row in front of you cackling about how Mike Stonebraker would react to it…):

(more…)

The Power of Projections – Part 2

By Colin Mahony and Shilpa Lawande

Part II – Understanding the Simplicity of Projections and the Vertica Database Designer™

In Part I of this post, we introduced you to the simple concept of Vertica’s projections.  Now that you have an understanding of what they are, we wanted to go into more detail on how users interface with them, and introduce you to Vertica’s unique Database Designer tool.

For each table in the database, Vertica requires a minimum of one projection, called a “superprojection”. A superprojection is a projection for a single table that contains all the columns and rows in the table.  Although the data may be the same as a traditional base table, it has the advantages of segmentation (spreading the data evenly across the nodes in the cluster), sorting, and encoding (compressing the size of the data on a per column basis).  This leads to significant footprint reduction as well as load and query performance enhancements.  To give you a sense of the impact that Vertica’s projections have on size, most Vertica customers have at least a 50% reduction in footprint thanks to our compression.  This includes the high availability copy and on average 3-5 projections.  Again, contrast this to traditional row-store databases ballooning upwards of 5x their original size and that is a 10:1 difference in Vertica.

To get your database up and running quickly, Vertica automatically creates a default superprojection for each table created through the CREATE TABLE and CREATE TEMPORARY TABLE statements. This means that if database admins and users never want to know about a projection, they don’t have to – Vertica will automatically handle it under the covers. To further illustrate this point, users can simply pass in projection parameters such as Order By, Encodings, Segmentation, High Availability, and Partitioning right after the CREATE TABLE statement, never interfacing directly with a projection under the hood.

By creating a superprojection for each table in the database, Vertica ensures that all SQL queries can be answered. Default superprojections alone will do far better than a row-store, however, by themselves they may not fully optimize database performance and Vertica’s full potential.  Vertica recommends that you start with the default projections and then use Vertica’s nifty Database Designer™  to optimize your database.  Database Designer creates new projections that optimize your database based on its data statistics and the queries you use. Database Designer:

1. Analyzes your logical schema, sample data, and sample queries (optional).
2. Creates a physical schema design (projections) in the form of a SQL script that can be deployed automatically or manually.
3. Can be used by anyone without specialized database knowledge (even business users can run Database Designer).
4. Can be run and re-run anytime for additional optimization without stopping the database.

Designs created by the Database Designer provide exceptional query performance. The Database Designer uses sophisticated strategies to provide excellent ad-hoc query performance while using disk space efficiently. Of course, a proficient human may do even better than the Database Designer with more intimate knowledge of the data and the use-case – a small minority of our customers prefer to do manual projection design and can usually get a good feel for it after working with the product for a few weeks.

We’ve heard people ask if we need a projection for each query in Vertica, which we absolutely do not! Typically our customers use 3-5 projections and several are using the single superprojection only. A typical customer would have the superprojection along with a few smaller projections (often comprised of only a few columns each).  Unlike MVs and indexes, projections are cheap to maintain during load and due to Vertica’s compression, the resulting data size tends to be 5-25x smaller than the base data. Depending on your data latency needs (seconds to minutes) and storage availability you could choose to add more projections to further optimize the database.  Also important to note is that Vertica does not charge extra for projections, regardless of how many are deployed.  So whether a customer has 1 or 50 projections, their license fees are the same – entirely based on raw data.

As you can see, projections are very easy to work with, and if you are a business analyst who doesn’t know SQL/DDL, that’s okay, we created a tool that designs, deploys and optimizes the database automatically for you.  Our objective from day one has always been to enable customers to ask more questions and get faster answers from their data without having to constantly tune the underlying database.  Part III of this post goes into more detail on projections versus indexes and materialized views.

Read the rest of the 3-part series…

The Power of Projections – Part 1: Understanding Projections and What They Do
The Power of Projections – Part 3: Comparing and Contrasting Projections to Materialized Views and Indexes

The Power of Projections – Part 1

By Colin Mahony and Shilpa Lawande

Part I: Understanding Projections and What They Do

Many of us here at Vertica have been amazed and frankly flattered at how much FUD our competitors are putting out there regarding Vertica’s “projections”.  Having heard some incredibly inaccurate statements about them, we’ve decided to clarify what they are, how and why we have them, and the advantages they bring.  Actually, projections are a pivotal component of our platform, and a major area of differentiation from the competition.  Most importantly, Vertica’s customers love the benefits projections bring! In an effort to provide you with as much detail as possible, this blog is broken up into three posts with Parts II and III being more technical.

First, some background. In traditional database architectures, data is primarily stored in tables. Additionally, secondary tuning structures such as indexes and materialized views are created for improved query performance.  Secondary structures like MVs and indexes have drawbacks – for instance they are expensive to maintain during data load (more detail on this in Part III).  Hence best practices often require rebuilding them during nightly batch windows, which prevents the ability to do real-time analytics.  Also, it isn’t uncommon to find data warehouse implementations that balloon to 3-6x base table sizes due to these structures. As a result, customers are often forced to remove valuable detailed data and replace it with aggregated data to solve this problem. However you can’t monetize what you lost!

Vertica created a superior solution by optimizing around performance, storage footprint, flexibility and simplicity. We removed the trade-off between performance and data size by using projections as the lynchpin of our purpose-built architecture.  Physical storage consists of optimized collections of table columns, which we call “projections”. In the traditional sense, Vertica has no raw uncompressed base tables, no materialized views, and no indexes. As a result there are no complex choices – everything is a projection!  Of course, your logical schema (we support any) remains the same as with any other database so that importing data is a cinch.  Furthermore, you still work with standard SQL/DDL (i.e. Create Table statements, etc).  The magic of projections and Vertica are what we do under the covers for you with the physical storage objects.  We provide the same benefits as indexes without all of the baggage.  We also provide an automatic tool, the Database Designer (more on this in Part II) to create projections automatically.

Projections store data in formats that optimize query execution. They share one similarity to materialized views in that they store data sets on disk rather than compute them each time they are used in a query (e.g. physical storage).  However, projections aren’t aggregated but rather store every row in a table, e.g. the full atomic detail. The data sets are automatically refreshed whenever data values are inserted, appended, or changed – again, all of this happens beneath the covers without user intervention – unlike materialized views. Projections provide the following benefits:

  • • Projections are transparent to end-users and SQL. The Vertica query optimizer automatically picks the best projections to use for any query.
  • • Projections allow for the sorting of data in any order (even if different from the source tables). This enhances query performance and compression.
  • • Projections deliver high availability optimized for performance, since the redundant copies of data are always actively used in analytics.  We have the ability to automatically store the redundant copy using a different sort order.  This provides the same benefits as a secondary index in a more efficient manner.
  • • Projections do not require a batch update window.  Data is automatically available upon loads.
  • • Projections are dynamic and can be added/changed on the fly without stopping the database.

In summary, Vertica’s projections represent collections of columns (okay so it is a table!), but are optimized for analytics at the physical storage structure level and are not constrained by the logical schema.  This allows for much more freedom and optimization without having to change the actual schema that certain applications are built upon.

Hopefully this gave you an overview of what projections are and how they work.  Please read Part II and Part III of this post to drill down into projections even further.

Read the rest of the 3-part series…

The Power of Projections – Part 2: Understanding the Simplicity of Projections and the Vertica Database Designer™
The Power of Projections – Part 3: Comparing and Contrasting Projections to Materialized Views and Indexes

Sessionize with Style: Part 2

Why is the Vertica Approach Unmatched?

In Part 1 of this two-part article, we described Vertica’s approach to sessionization. To understand why this approach is head and shoulders above the state-of-the-art, let us closely examine a few existing approaches.

The Aster Data Approach –From Boston to LA via China

The first approach comes from Aster Data. One query example from this article is quoted below.

SELECT sessionId, userId, timestampValue

FROM Sessionize( ‘timestamp’, 60 ) ON

(SELECT userid, timestampValue FROM webclicks WHERE userid = 50);

There are a few usability hurdles with this design.

  1. The Sessionize function does not take userId as its input. It therefore must be assuming the input table webclicks has a column named userId, to be used in this computation. Hard-coding column names makes the product less usable.
  2. This design of sessionization support unnecessarily deviates from standard SQL in terms of both syntax and semantics, artificially creating a steep learning curve for the end users to climb. For example, the keyword ON has a different meaning from the standard use of ON in SQL ANSI joins. As a proposal for an alternative design, why not retain the subquery as a standard FROM clause subquery, and move the Sessionize function to the SELECT clause? This could preserve the required semantic ordering of evaluating the subquery before the Sessionize function, while minimizing the deviation from Standard SQL.
  3. This example implies that the predicate userid = 50 has to be placed in a subquery in order for that predicate to be pushed before the Sessionize computation. This predicate push down technique, well known to relational databases since the invention of relational algebra in the 1970’s, should have been taken care of in the query optimizer automatically. Forcing the end user to come up with “clever” SQL formulations violates the spirit of declarative programming in SQL, again making the product less usable.

Apparently Aster Data took another stab at the design, where a query example in that new design is quoted below.

Select ts, userid, session

From sessionize (

on clicks partition by userid order by ts

Timecolumn(‘ts’) timeout (60)

)

While this design represents progress, it still suffers from unnecessary deviation from the Standard SQL (see Comment #2 above). Also, since the proposed Sessionize syntax is not fully interoperable with other existing SQL constructs, it makes the query formulation cumbersome when the user wants to perform sessionization and other SQL computation in the same query. For example, how many subqueries need to be involved, if the user wants to first join the table clicks with a dimension table, perform a group-by, and then perform sessionization? In comparison, the Vertica approach employs not a single subquery to accomplish this task.

Finally, in case you wonder, in Aster Data Sessionize is a MapReduce function written in JavaDoes the sessionization task justify the employment of a big hammer execution mechanism like MapReduce? No! As we mentioned before, the run-time complexity of sessionization is at the level of the simplest SQL ’99 analytic functions such as RANK and ROW_NUMBER, which takes nothings more than a single pass over the sorted data.

Using MapReduce to implement sessionization is a mistake on the architectural level. It does not matter how fast your MapReduce implementation is – if you travel from Boston to LA by flying around the globe, does it matter that you are taking a supersonic aircraft? Besides, while MapReduce is more versatile than SQL when processing computational tasks, it has been shown that when processing SQL-like query tasks, due to its significant overhead a MapReduce engine can be slower than a SQL engine by an order of magnitude or more.

The Teradata Approach – a Horse Wagon Fun Ride from Boston to LA

In this article, Teradata revealed its support for sessionization. We quote its query example below.

with dt (IP, Click_Timestamp, samesession) as

( select IP, Click_Timestamp,

case when (Click_Timestamp – (max(Click_Timestamp) over (partition by IP order by Click_Timestamp rows between 1 preceding and 1 preceding ) ) minute) < (interval ’30′ minute) then 0 else 1 end

from webclicks)

select sum(samesession) over (partition by IP order by Click_Timestamp rows unbounded preceding) as Session_No, IP, Click_Timestamp

from dt;

From this example, we can see that Teradata’s sessionization support is not native. Instead, it is expressed in terms of existing SQL ’99 functions. Such a design bears the following consequences.

  1. The resulting formulation is quite cumbersome and unintuitive – if you understand why this query is actually performing sessionization, we salute to you for being a SQL guru. On the other hand, an average DBA might be at a complete loss at what the query intends to do. Also, the WITH clause is an unnecessary big hammer construct to employ – a subquery could have been used. Then again, no subquery would be nice!
  2. The use of the SQL window aggregate function Max further complicates the formulation unnecessarily. The SQL analytic function LAG should have been used.
  3. Last but certainly not least, the complex query formulation involving two query blocks is likely to lead to suboptimal run-time performance.

Conclusions

Vertica’s sessionization support is unmatched among its peers, and here is why.

  1. Thanks to Vertica’s analytic function CONDITIONAL_TRUE_EVENT (CTE), the query formulation of sessionization is extremely compact – no subquery or unnecessary SQL construct is involved.
  2. Thanks to CTE, Vertica’s sessionization goes beyond its standard semantics, and is thus capable of supporting a wider range of use cases.
  3. The run-time performance of Vertica’s sessionization is close to optimal –the computation is fully pipelined for each user id, and is fully load-balanced across all user ids.

In closing this post, we would like to offer our dear esteemed readers the challenge to come up with a solution that beats Vertica’s implementation of sessionization.