Vertica

Archive for April, 2010

Exciting Times at Vertica

I recently joined Vertica as President and CEO and could not be more excited about the team and  technology.  These are also exciting times for the leading next generation analytic database company.  I want to introduce myself and share my thoughts on Vertica’s position and opportunity in this market.

I joined Vertica because over the past few years I’ve seen a fundamental change in the enterprise IT ecosystem, and I believe Vertica is the personification of this shift.  The change I am referring to is the broad horizontal and vertical consolidation that is occurring in the data center and the need for innovative and purpose-built applications that can take full advantage of this architectural shift.  Hyper-competitiveness, cost constraints, and regulatory compliance are forcing this consolidation, and Vertica not only participates in it, but also helps facilitate it.

Already in the company’s young life Vertica has been able to create a new ecosystem for software, services, server, and storage vendors.  Vertica fundamentally changes the way organizations are able to store, access, analyze, and ultimately monetize their data.  Given my background in the systems, network and storage business, I know first hand how Vertica’s unique architecture creates differentiated opportunities for server and storage vendors to effectively compete for the datacenter footprint of tomorrow.  Vertica offers organizations freedom of choice to use the best industry standard hardware with the best purpose-built database solution for analytics.

When it comes to delivering high-volume real-time analytics, solutions that incrementally attempt to retrofit next-gen features as add-ons to their traditional offerings need not apply.  As recently as twelve months ago, next generation MPP-columnar databases were considered to be relatively new technologies.  The customer pain around data warehousing has only gotten worse, and now the data warehousing market and industry influencers have embraced columnar technology and predict it will be used in the majority of data warehouse implementations.  However, not all column stores were created equal!

Vertica was purpose-built from the ground-up as a truly native MPP-Columnar DBMS.  This allows Vertica to deliver superior query performance, high compression rates, concurrent real-time loading and querying, and a scale-out model that is unique.  We’ve also built this as a modular and massively parallel platform, which we continually add rich analytics libraries and capabilities to.

This only matters as a means to the end of solving customer problems.  Last year we more than doubled our customer base, reaching the 100th customer milestone faster than any data warehouse vendor.  We now have over 130 customers.  We have consistently brought out new versions of our database incorporating innovations that increase the effectiveness of the Vertica-based solutions our customers use every day, all day.   We have established a strong presence in the telecommunications, financial services, internet, retail and healthcare industries.  Finally, we have established strong direct and indirect sales channels and have begun to establish our presence in international markets.

It is my goal to capitalize on the momentum and market opportunity that exists for Vertica and to pursue an aggressive growth strategy.  This will lead to continuous innovation and further improvements in what we deliver to customers, and it will also expand a healthy ecosystem around us.  The end result is that we will continue to build on our track record of delivering an analytic database solution that scales-out further, queries and loads data faster, enables more real-time analytics, and gets even easier to use.  It is going to be a fun ride, and we are eager to experience this journey with our customers, prospects and partners alike.

Vertica Under the Hood: The Query Optimizer

As we bring our 4.0 release to market, we are starting a series of educational blog posts to provide a in-depth look at Vertica’s core technology. We start with one of our crown jewels – the Vertica Query Optimizer.

The goal of query optimizers in general is to allow users to get the maximal performance from their database without worrying about details of how it gets done.  At Vertica, we take this goal to heart in everything that we build.  From the first day starting out, the Vertica Optimizer team has focused on creating a product that reduces the need for manual tuning as much as possible.  This lets users focus on their business needs rather than tuning our technology.

Before we dive into the unique innovations within our optimizer, let’s get a few simple facts straight:

  • The Vertica Optimizer is not limited to classic Star and Snowflake Schemas – it hasn’t been since version 2.5. Many of our 130+ customers in production today are using non-star schemas with great success.  In fact, our Optimizer easily handles very complicated queries – from workloads as simple as TPCH, containing only relatively simple Star queries with a few tables, to complex queries containing hundreds of joins with mixes of INNER/OUTER joins and a variety of predicates and sub-queries.
  • It is not common and certainly not necessary to have one projection per query to get great performance from Vertica. While the Optimizer understands and chooses the optimal plan in the presence of several choices, few customers have found it necessary to do custom tuning for individual queries except in very unusual circumstances.  It is far more typical to have great performance without such tuning at all.
  • The Vertica Optimizer is the only true columnar optimizer developed from scratch to make best use of a column store engine.  Unlike some other column store vendors, we do not use any part of the Postgres optimizer.

Why? Because fundamentally, we believe that no amount of retrofitting can turn a row-oriented optimizer into a column-oriented one.

For the optimizer geeks out there, here are some of the capabilities that we believe give the Vertica Optimizer that special edge over others, even mature ones:

  • The entire Optimizer is designed as a set of extensible modules so that we can change the brains of the optimizer without rewriting much of the code. This means we can incorporate knowledge gleaned from end-user experiences into the Optimizer, without a lot of engineering effort.  After all, when you build a system from scratch, you can build it smarter and better!
  • Unlike standard optimizers that determine the optimal single-node plan and then introduce parallelizing operators into it as an after thought, our patent-pending optimizer algorithms account for data distribution during the join order enumeration phase of the optimizer. We use sophisticated heuristics based on knowledge of physical properties of the available projections to control the explosion in search space.
  • Unlike standard optimizers that restrict the join search space to left-deep plans, the Vertica Optimizer considers bushy plans very naturally.
  • The Vertica Optimizer is cost-based with a cost-model based not just on I/O but also CPU and Network transfer costs and takes into account the unique details of our columnar operators and runtime environment.
  • The Vertica Optimizer employs many techniques that take advantage of the specifics of our sorted columnar storage and compression – for example, late materialization, compression aware costing and planning, stream aggregation, sort elimination, merge joins, etc.
  • The Vertica Database designer works hand-in-glove with the optimizer by producing a physical design that can take advantage of the many clever optimizations available to the optimizer.

While innovating on the core algorithms, we have also incorporated many of the best practices developed over the past 30 years of optimizer research, such as:

  • Using histograms to calculate selectivity.
  • Optimizing queries to favor co-located joins where possible.  Note that optimizer can handle physical designs with arbitrary distribution properties and uses distribution techniques such as re-segmented or broadcast joins.
  • Transformations such as converting outer joins to inner joins, taking advantage of primary/foreign key and null constraints, sub-query de-correlation, view flattening, introducing transitive predicates based on join keys and automatically pruning out unnecessary parts of the query.

As a testament to the quality of our optimizer, we are proud to say that customers rarely override the plans produced by our optimizer.  This removes an entire class of management from the DBA and letting our algorithms take full advantage of our ever-improving execution engine. That being said, we believe that performance and ease-of-use speak for themselves and so we invite you to Test Drive the Vertica Database on your schema, your queries and your data!

Column Store vs. Column Store

It has been 5 years since Vertica was founded and it is great to see that Column Stores are becoming prevalent and widely regarded as the preferred architectures for data warehousing and analytics. Mainstream and upstart vendors alike are announcing columnar storage and columnar compression as “features” of their row-oriented DBMS. While this is excellent news for the column store enthusiasts, marketing messages are rife with false information that creates confusion for buyers. Could you be mistaking an imitation diamond for the real thing?

Here’s what you should know about when evaluating or buying a Column store DBMS.

What makes a True Columnar DBMS

A true column store, like Vertica, must have the following 4 features:

Columnar Storage, Compression and Retrieval

Data is stored in columns such that it is possible to retrieve data in a column without fetching other columns. This has the benefits of I/O reduction as well as improved compression. Data is compressed on column-by-column basis, with the compression technique chosen based on properties of the data. Block level columnar compression in row-oriented databases fails to meet this criterion – compression in these systems is typically limited to a single technique and does not eliminate unnecessary columns (and the resulting I/O) on retrieval.

Columnar on Disk, not just In-memory

Some so-called columnar DBMS vendors rely on caching the entire data into memory in columnar format. These systems experience a performance cliff when the data sizes grow beyond what can fit into memory or require a huge hardware footprint. It is no secret that memory continues to be the most expensive component in any system, so this approach is likely to limit your scalability. Check out some recently published 1TB TPCH benchmarks by columnar vendors and notice how much hardware and memory was needed for this tiny amount of data!!

Columnar Optimizer & Execution Engine

To really take advantage of a column store architecture, the query optimizer must be deeply aware of columnar storage and optimization techniques.  Late materialization is just one example of an optimization technique that can significantly speed up joins in a column store. Here, the result of the join can be computed by simply fetching the join key columns off the disk and the remaining columns are only fetched at the very end of query execution.

Going hand in hand with the optimizer, the execution engine of a true columnar database looks radically different from the typical processing model employed in a typical modern row-oriented DBMS. A true columnar engine can do predicates, joins, aggregates, sorts and analytics on compressed data, thereby saving not only on I/O but also CPU cycles. The problem then shifts to optimizing memory bandwidth and techniques like vectorizing or operating on columns are used to allow more efficient use of the L2 cache.

No amount of retrofitting can turn a row-oriented optimizer and engine into column-oriented ones.

For more on this subject, see Dan Abadi’s excellent research on this topic:

http://cs-www.cs.yale.edu/homes/dna/papers/abadiicde2007.pdf,

http://cs-www.cs.yale.edu/homes/dna/papers/abadi-sigmod08.pdf,

http://cs-www.cs.yale.edu/homes/dna/talks/abadi-sigmod-award.pdf

Optimized Loads and Transactions

While analytic DBMS workloads are heavy on queries v/s transaction throughput, this does not mean they are “read-only”. Many vendors implement columnar storage as a feature assuming “archival” or “read-only” access or reducing compression if updates are supported.  A true columnar RDBMS should provide the ability to do fast loads, and handle SQL deletes and updates to the data without sacrificing query performance or compression benefits.

Lacking any one of the above elements significantly reduces the benefits of column stores. Vertica is the only analytic database in the market today with all of the above features. That being said, a columnar architecture is just one of the many design choices that makes Vertica the DBMS of choice for large-scale real-time analytics – I’ll talk more about these in a future blog post.

And don’t take our word for it.  Try it out for yourself.

Reaffirming our Commitment and Approach to Hadoop / MapReduce

As head of Product Management at a next generation analytic DBMS company I often get the question about Vertica’s endeavors with Hadoop/MapReduce.  Given that Vertica and Hadoop/MR share many similar core principles like being massively parallel and highly available on distributed commodity hardware, there is a natural fit.  That said, the two are still different- Vertica is designed for real-time analytics of structured data whereas Hadoop/MR is typically for batch oriented jobs with any type of data (structured/semi-structured/unstructured).  We try to stay out of the comparisons though and instead focus on complementary approaches, particularly in solving real-world customer problems.  This has been our approach since the beginning of our joint-development.

Vertica and Hadoop/MR complement one another extremely well, and we are committed to ensuring bi-directional and tight integration between Hadoop/MR and Vertica.  Our preference is to work with great partners like Cloudera who understand enterprise class Hadoop the same way Vertica understands enterprise-class databases.  Our approach of seamless and parallel integration is in line with Vertica’s core “One Size Does Not Fit All” tenet. We don’t think we need to develop the technology ourselves, much in the same way that we don’t feel the need to develop our own ETL and front-end visualization solutions.

Vertica is focused on building the best next generation analytic database solution on the market.  Our solution enables customers to unlock and monetize their data in a fully-relational and massively parallel manner with scalability and simplicity of setup and administration as core design principles. We enable companies to ingest, store, and analyze vast amounts of structured data with near real-time latency on a fraction of the hardware they would otherwise need.  This is why Vertica was founded, this is where we owe our success to date, and as far as we can tell, we are solving a very clear and present data problem that is only getting worse.  Our focus is also the reason we reached the 100 customer mark faster than all of our competitors.  Among other uses, Hadoop/MR is wonderful at getting more and higher quality data into Vertica.

While Hadoop/MR and Vertica are different, the “problem solved” is not always just orthogonal.  As it turns out, and not surprisingly, many data problems can be solved in more than one way.  Again, we see merit in Hadoop/MR for several use cases (including but not limited to the massaging, structuring, and transformation of data before and/or after it gets to the database), but we also know that some of the most commonly cited MR use cases can be performed through a single pass of SQL in the database engine as well.  By stripping away the noise and listening to our customers and their pain, we are able to deliver a core product that solves many of the same issues.  Not all, but many.

A case in point is sessionization, which is perhaps the most often cited use case for MapReduce in the enterprise (stay tuned for a more in depth post on this topic and CTE).  Sessionization is the process of taking web log files and grouping them together in buckets of visitor sessions (most commonly time-based, e.g. 30 seconds) for analysis.  This has been pegged as problematic to perform in SQL and therefore in the RDBMS because it often requires multiple passes through the engine and is difficult to express.  In Vertica 4.0 however, this can be expressed through single pass SQL no problem.

Here’s the SQL with a Web/Clickstream timeout threshold of 30 seconds:

SELECT userId, timestamp, CTE(timestamp – LAG(timestamp) <= ‘30 seconds’) OVER (PARTITION BY userId ORDER BY timestamp) as session FROM webclicks;

 

By performing this operation in the Vertica database, our customers leverage our massively parallel real-time columnar infrastructure without having to move the data around for external batch processing.  They can do this from within their same favorite reporting tool without adding that extra step.  Furthermore, Vertica’s extensive native windowing conditions for advanced analytics, including sessionization, are many, and not limited to the conditional true event (CTE) on just timestamp depicted above.  Of course, there are still good reasons to perform sessionization outside the database such as not wanting to take up valuable real-time analytics resources while performing such grouping legwork- (although this can actually be solved using Vertica’s new workload management capabilities).  We get that, and again, that is why we support native Hadoop/MR — no need for syntax changes.

Key to our One Size Does Not Fit All approach was Vertica’s day one decision to not cut corners and build on top of Postgres or some other traditional row-store as most of our competitors have done with their offerings.  We have instead written a truly next generation native MPP-Columnar ADBMS solution from scratch complete with a unique set of bells and whistles (stay tuned for specific post on this subject as well).  The good news is that on this core foundation, we can now add functionality that traditional row-stores would never be able to handle in a fast enough manner.  Sessionization is a great example.  It is simply too inefficient to perform it in a traditional RDBMS, not to mention most databases are not as expressive; hence why many people turn to Hadoop/MR for it.  Vertica’s customers are finding there are a lot of things they can now do in Vertica that they could never consider with a traditional database.  This combined with tight integration to frameworks like Hadoop allow our customers to monetize all of their data in ways never before possible.

Get Started With Vertica Today

Subscribe to Vertica