Vertica

Author Archive

Announcing the Vertica Community Edition

by Colin Mahony, VP of Products & Business Development
and Shilpa Lawande, VP of Engineering

Vertica has had an amazing journey since it was founded in 2005. We’ve built a great product, a great team and an incredibly strong and loyal customer base and partner ecosystem. When we first started, no one had even heard of a column store, and today, ‘Big Data Analytics’ is taking the industry by storm. Every day we see companies – big and small – in industries from retail to gaming becoming more data-driven and doing amazing things with the help of analytics. We feel proud and humbled to see the transformation impact the Vertica Analytics Platform has had on our customers’ businesses, and we believe the time has come to broaden access to our technology to a wider Big Data community.

Today, we are truly excited to announce the Vertica Community Edition beta program! The Vertica Community Edition will offer many of the same features as the enterprise edition of the Vertica Analytics Platform to anyone who wants to discover the power of Vertica.  And, as part of the Community Edition beta announcement, we are developing a new MyVertica Community portal which will provide a platform for Vertica users and partners to interact and share knowledge and code with the entire Vertica user community.

Vertica has always been a customer-driven company and we couldn’t have built Vertica without ideas, feedback and guidance from our customers and partners. We hope that the Vertica community will play a similar role going forward – sharing ideas and best practices and providing candid feedback about the product and how it can be made richer and simpler to use.  The MyVertica community portal will feature product downloads, forums, documentation,  training materials, FAQs and best practice guides. We will also be maintaining a GitHub code repository where community users will be able to share code samples, user-defined extensions built using our SDK, adapters to 3rd party products, and more. We hope that with the Community Edition, we take a small step towards our vision of democratizing data and making data and analytics accessible to all!

To register for the Vertica Community Edition beta program, simply visit www.vertica.com/community and complete the registration form.  The beta program will be limited initially, but full availability of the Vertica Community Edition software is expected by the end of the year.

On behalf of Vertica and HP, we are excited to contribute something back to the Vertica Community.  We sincerely invite you to join and contribute, and we can’t wait to see the many cool things you will do with Big Data and Vertica!

Shilpa & Colin

Vertica at BIRTE 2011: Social Graph Analytics

A few days ago, Lakshmikant Shrinivas and I gave an invited talk at BIRTE 2011.  Rather than yet another talk on Vertica 101, we chose to talk about Social Graph Analytics, a topic we are truly very excited about because it is something we are learning from our customers! And of course, it is uber cool to talk about Cityville and Zynga these days!  We presented the Zynga usecase  – how to find the influencers among the active users of a game and several other graph problems Vertica has very successfully solved with SQL.  More about these in future blog posts.

The most fun part of this talk (according to the audience) was that it was multi-threaded – we weaved a real-time demo through the talk. The demo was in two parts. The first part calculated an approximate 4-core of a graph of 90M nodes and 405M edges representing active users and their interactions in a game.  A k-core of a graph picks out a subgraph where every node has at least k neighbors in the subgraph – some research shows that the set of influencers in a social network is often a 3 or 4-core. The demo was run on 4 blades of the HP/Vertica Analytics System with 12 cores and 96GB RAM each. The animation below show an approximate visualization of the actual graph with 10K nodes and 45K edges (we could not find any graph visualizing tool that could draw a graph of 405M edges!) and how the algorithms iteratively reduces it to find the 4-core. The actual computation (which cannot be visualized) took about 1 minute to process 1TB of in-game event data to compute the initial graph of 90M active users and only 4.5 minutes to run 8 iterations resulting in the 4-core of 34K users!

The second part showed how this graph of influencers could be used to do A/B testing for in-game product placement. We simulated giving a group of users from the graph of influencers , Coke, and a group of randomly chosen users, Pepsi, and loading their in-game interactions data into Vertica, every 15 seconds. In the 5-minutes or so it took us to describe the setup, you could see how the relative penetration of the two products changed in real-time.  This was really a demo of Vertica’s awesome real-time analytics engine – we loaded data continuously at the rate of 40GB/minute (on 4 nodes with  2 copies of the data), which translates to 20TB/hr on the full rack HP/Vertica Analytics System.  That’s fast, eh?!!

This little demo and what our customers are doing in the real world, makes for a very convincing case that SQL can be used to express many practical graph problems. Vertica’s MPP architecture and sorted projections provide a highly scalable and fast engine for iterative graph algorithms. Such problems are relevant not only to the Zyngas and Facebooks of the world but to any company that has a connected user community. From online forums to friend-and-family call logs, social networks are everywhere and finding influencers is the key to monetizing them. Gartner believes that social CRM will be a $1B market by end of 2012!

We hope that our talk planted a small seed for the database research community to more formally explore SQL solutions to graph problems!  Watch for more blogs from Vertica on solving graph problems with SQL.

The Power of Projections – Part 3

By Colin Mahony and Shilpa Lawande

Part III – Comparing and Contrasting Projections to Materialized Views and Indexes

In Part I and Part II of this post, we introduced you to Vertica’s projections, and described how easy it is to interface with them directly via SQL or through our Database Designer™ tool.  We will now end this series by comparing and contrasting Vertica’s projections with traditional indexes and materialized views.

Row-store databases often use Btree indexes as a performance enhancement.  Btree indexes are designed for highly concurrent single-record inserts and updates, e.g. an OLTP scenario. Most data warehouse practitioners would agree that index rebuilds after a batch load are preferable to taking the hit of maintaining them record by record given the logging overhead. Bitmap indexes are designed for bulk loads and are better than btrees for data warehousing but only suitable for low cardinality columns and a certain class of queries.  Even though you have these indexes to help find data, you still have to go to the base table to get the actual data, which brings with it all the disadvantages of a row store.

In a highly simplified view, you can think of a Vertica projection as a single level, densely packed, clustered index which stores the actual data values, is never updated in place, and has no logging. Any “maintenance” such as merging sorted chunks or purging deleted records is done as automatic and background activity, not in the path of real-time loads.  So yes, projections are a type of native index if you will, but they are very different from traditional indexes like Bitmap and Btrees.

Vertica also offers a unique feature known as “pre-join projections”. Pre-join projections denormalize tables at the physical layer under the covers providing a significant performance advantage over joining tables at run-time.  Pre-join projections automatically store the results of a join ahead of time, yet the logical schema is maintained – again, flexibility of the storage structure without having to rewrite your ETL or application.  Vertica can get away with this because it excels at sparse data storage, and in particular isn’t penalized at all for null values nor for wide fact tables. Since Vertica does not charge extra for additional projections, this is a great way to reap the benefits of denormalization without the need to purchase a larger capacity license.

So to sum up, here’s how Vertica projections stack up versus materialized views and conventional indexes.

Vertica’s Projections

Traditional Materialized Views

Traditional Indexes

  • Are primary storage – no base tables are required
  • Are secondary storage
  • Are secondary storage pointing to base table data
  • Can be segmented, partitioned, sorted, compressed and encoded to suit your needs
  • Are rigid: Practically limited to columns and query needs, more columns = more I/O
  • Support one clustered index at most – tough to scale out
  • Have a simple physical design
  • Use Aggregation losing valuable detail
  • Require complex design choices
  • Are efficient to load & maintain
  • Are mostly batch updated
  • Are expensive to update
  • Are versatile – they can support any data model
  • Provide high data latency
  • Provide high data latency
  • Allow you to work with the detailed data
  • Provide near-real time low data latency
  • Combine high availability with special optimizations for query performance

.
That’s pretty much all there is to it.  Whether you are running ad-hoc queries or canned operational-BI workloads, you will find projections to be a very powerful backbone for getting the job done!

Read the rest of the 3-part series…

The Power of Projections – Part 1: Understanding Projections and What They Do
The Power of Projections – Part 2: Understanding the Simplicity of Projections and the Vertica Database Designer™

The Power of Projections – Part 2

By Colin Mahony and Shilpa Lawande

Part II – Understanding the Simplicity of Projections and the Vertica Database Designer™

In Part I of this post, we introduced you to the simple concept of Vertica’s projections.  Now that you have an understanding of what they are, we wanted to go into more detail on how users interface with them, and introduce you to Vertica’s unique Database Designer tool.

For each table in the database, Vertica requires a minimum of one projection, called a “superprojection”. A superprojection is a projection for a single table that contains all the columns and rows in the table.  Although the data may be the same as a traditional base table, it has the advantages of segmentation (spreading the data evenly across the nodes in the cluster), sorting, and encoding (compressing the size of the data on a per column basis).  This leads to significant footprint reduction as well as load and query performance enhancements.  To give you a sense of the impact that Vertica’s projections have on size, most Vertica customers have at least a 50% reduction in footprint thanks to our compression.  This includes the high availability copy and on average 3-5 projections.  Again, contrast this to traditional row-store databases ballooning upwards of 5x their original size and that is a 10:1 difference in Vertica.

To get your database up and running quickly, Vertica automatically creates a default superprojection for each table created through the CREATE TABLE and CREATE TEMPORARY TABLE statements. This means that if database admins and users never want to know about a projection, they don’t have to – Vertica will automatically handle it under the covers. To further illustrate this point, users can simply pass in projection parameters such as Order By, Encodings, Segmentation, High Availability, and Partitioning right after the CREATE TABLE statement, never interfacing directly with a projection under the hood.

By creating a superprojection for each table in the database, Vertica ensures that all SQL queries can be answered. Default superprojections alone will do far better than a row-store, however, by themselves they may not fully optimize database performance and Vertica’s full potential.  Vertica recommends that you start with the default projections and then use Vertica’s nifty Database Designer™  to optimize your database.  Database Designer creates new projections that optimize your database based on its data statistics and the queries you use. Database Designer:

1. Analyzes your logical schema, sample data, and sample queries (optional).
2. Creates a physical schema design (projections) in the form of a SQL script that can be deployed automatically or manually.
3. Can be used by anyone without specialized database knowledge (even business users can run Database Designer).
4. Can be run and re-run anytime for additional optimization without stopping the database.

Designs created by the Database Designer provide exceptional query performance. The Database Designer uses sophisticated strategies to provide excellent ad-hoc query performance while using disk space efficiently. Of course, a proficient human may do even better than the Database Designer with more intimate knowledge of the data and the use-case – a small minority of our customers prefer to do manual projection design and can usually get a good feel for it after working with the product for a few weeks.

We’ve heard people ask if we need a projection for each query in Vertica, which we absolutely do not! Typically our customers use 3-5 projections and several are using the single superprojection only. A typical customer would have the superprojection along with a few smaller projections (often comprised of only a few columns each).  Unlike MVs and indexes, projections are cheap to maintain during load and due to Vertica’s compression, the resulting data size tends to be 5-25x smaller than the base data. Depending on your data latency needs (seconds to minutes) and storage availability you could choose to add more projections to further optimize the database.  Also important to note is that Vertica does not charge extra for projections, regardless of how many are deployed.  So whether a customer has 1 or 50 projections, their license fees are the same – entirely based on raw data.

As you can see, projections are very easy to work with, and if you are a business analyst who doesn’t know SQL/DDL, that’s okay, we created a tool that designs, deploys and optimizes the database automatically for you.  Our objective from day one has always been to enable customers to ask more questions and get faster answers from their data without having to constantly tune the underlying database.  Part III of this post goes into more detail on projections versus indexes and materialized views.

Read the rest of the 3-part series…

The Power of Projections – Part 1: Understanding Projections and What They Do
The Power of Projections – Part 3: Comparing and Contrasting Projections to Materialized Views and Indexes

The Power of Projections – Part 1

By Colin Mahony and Shilpa Lawande

Part I: Understanding Projections and What They Do

Many of us here at Vertica have been amazed and frankly flattered at how much FUD our competitors are putting out there regarding Vertica’s “projections”.  Having heard some incredibly inaccurate statements about them, we’ve decided to clarify what they are, how and why we have them, and the advantages they bring.  Actually, projections are a pivotal component of our platform, and a major area of differentiation from the competition.  Most importantly, Vertica’s customers love the benefits projections bring! In an effort to provide you with as much detail as possible, this blog is broken up into three posts with Parts II and III being more technical.

First, some background. In traditional database architectures, data is primarily stored in tables. Additionally, secondary tuning structures such as indexes and materialized views are created for improved query performance.  Secondary structures like MVs and indexes have drawbacks – for instance they are expensive to maintain during data load (more detail on this in Part III).  Hence best practices often require rebuilding them during nightly batch windows, which prevents the ability to do real-time analytics.  Also, it isn’t uncommon to find data warehouse implementations that balloon to 3-6x base table sizes due to these structures. As a result, customers are often forced to remove valuable detailed data and replace it with aggregated data to solve this problem. However you can’t monetize what you lost!

Vertica created a superior solution by optimizing around performance, storage footprint, flexibility and simplicity. We removed the trade-off between performance and data size by using projections as the lynchpin of our purpose-built architecture.  Physical storage consists of optimized collections of table columns, which we call “projections”. In the traditional sense, Vertica has no raw uncompressed base tables, no materialized views, and no indexes. As a result there are no complex choices – everything is a projection!  Of course, your logical schema (we support any) remains the same as with any other database so that importing data is a cinch.  Furthermore, you still work with standard SQL/DDL (i.e. Create Table statements, etc).  The magic of projections and Vertica are what we do under the covers for you with the physical storage objects.  We provide the same benefits as indexes without all of the baggage.  We also provide an automatic tool, the Database Designer (more on this in Part II) to create projections automatically.

Projections store data in formats that optimize query execution. They share one similarity to materialized views in that they store data sets on disk rather than compute them each time they are used in a query (e.g. physical storage).  However, projections aren’t aggregated but rather store every row in a table, e.g. the full atomic detail. The data sets are automatically refreshed whenever data values are inserted, appended, or changed – again, all of this happens beneath the covers without user intervention – unlike materialized views. Projections provide the following benefits:

  • • Projections are transparent to end-users and SQL. The Vertica query optimizer automatically picks the best projections to use for any query.
  • • Projections allow for the sorting of data in any order (even if different from the source tables). This enhances query performance and compression.
  • • Projections deliver high availability optimized for performance, since the redundant copies of data are always actively used in analytics.  We have the ability to automatically store the redundant copy using a different sort order.  This provides the same benefits as a secondary index in a more efficient manner.
  • • Projections do not require a batch update window.  Data is automatically available upon loads.
  • • Projections are dynamic and can be added/changed on the fly without stopping the database.

In summary, Vertica’s projections represent collections of columns (okay so it is a table!), but are optimized for analytics at the physical storage structure level and are not constrained by the logical schema.  This allows for much more freedom and optimization without having to change the actual schema that certain applications are built upon.

Hopefully this gave you an overview of what projections are and how they work.  Please read Part II and Part III of this post to drill down into projections even further.

Read the rest of the 3-part series…

The Power of Projections – Part 2: Understanding the Simplicity of Projections and the Vertica Database Designer™
The Power of Projections – Part 3: Comparing and Contrasting Projections to Materialized Views and Indexes

Column Store vs. Column Store

It has been 5 years since Vertica was founded and it is great to see that Column Stores are becoming prevalent and widely regarded as the preferred architectures for data warehousing and analytics. Mainstream and upstart vendors alike are announcing columnar storage and columnar compression as “features” of their row-oriented DBMS. While this is excellent news for the column store enthusiasts, marketing messages are rife with false information that creates confusion for buyers. Could you be mistaking an imitation diamond for the real thing?

Here’s what you should know about when evaluating or buying a Column store DBMS.

What makes a True Columnar DBMS

A true column store, like Vertica, must have the following 4 features:

Columnar Storage, Compression and Retrieval

Data is stored in columns such that it is possible to retrieve data in a column without fetching other columns. This has the benefits of I/O reduction as well as improved compression. Data is compressed on column-by-column basis, with the compression technique chosen based on properties of the data. Block level columnar compression in row-oriented databases fails to meet this criterion – compression in these systems is typically limited to a single technique and does not eliminate unnecessary columns (and the resulting I/O) on retrieval.

Columnar on Disk, not just In-memory

Some so-called columnar DBMS vendors rely on caching the entire data into memory in columnar format. These systems experience a performance cliff when the data sizes grow beyond what can fit into memory or require a huge hardware footprint. It is no secret that memory continues to be the most expensive component in any system, so this approach is likely to limit your scalability. Check out some recently published 1TB TPCH benchmarks by columnar vendors and notice how much hardware and memory was needed for this tiny amount of data!!

Columnar Optimizer & Execution Engine

To really take advantage of a column store architecture, the query optimizer must be deeply aware of columnar storage and optimization techniques.  Late materialization is just one example of an optimization technique that can significantly speed up joins in a column store. Here, the result of the join can be computed by simply fetching the join key columns off the disk and the remaining columns are only fetched at the very end of query execution.

Going hand in hand with the optimizer, the execution engine of a true columnar database looks radically different from the typical processing model employed in a typical modern row-oriented DBMS. A true columnar engine can do predicates, joins, aggregates, sorts and analytics on compressed data, thereby saving not only on I/O but also CPU cycles. The problem then shifts to optimizing memory bandwidth and techniques like vectorizing or operating on columns are used to allow more efficient use of the L2 cache.

No amount of retrofitting can turn a row-oriented optimizer and engine into column-oriented ones.

For more on this subject, see Dan Abadi’s excellent research on this topic:

http://cs-www.cs.yale.edu/homes/dna/papers/abadiicde2007.pdf,

http://cs-www.cs.yale.edu/homes/dna/papers/abadi-sigmod08.pdf,

http://cs-www.cs.yale.edu/homes/dna/talks/abadi-sigmod-award.pdf

Optimized Loads and Transactions

While analytic DBMS workloads are heavy on queries v/s transaction throughput, this does not mean they are “read-only”. Many vendors implement columnar storage as a feature assuming “archival” or “read-only” access or reducing compression if updates are supported.  A true columnar RDBMS should provide the ability to do fast loads, and handle SQL deletes and updates to the data without sacrificing query performance or compression benefits.

Lacking any one of the above elements significantly reduces the benefits of column stores. Vertica is the only analytic database in the market today with all of the above features. That being said, a columnar architecture is just one of the many design choices that makes Vertica the DBMS of choice for large-scale real-time analytics – I’ll talk more about these in a future blog post.

And don’t take our word for it.  Try it out for yourself.

Five Tools for High Velocity Analytics

In this post we take a look at the tools required to achieve “high velocity” analytics.  What are the technologies that are important for high velocity analytics and the defining characteristics of these technologies?

  • Low Latency Data Source – Starting at the front end, the gating requirement for embedding analytics in your business is a real time data source.
  • Pipelined [ET]L – Whether data is extracted or transformed, it has to be loaded in a pipeline as data arrives, not in batches at the end of the day.
  • Complex Event Processing – The only way to respond to real time events is on the wire.  CEP systems look at every record before it is stored and can respond to events as they happen.
  • Real Time  Analytic Database – Once data makes it to a storage system you can analyze it in context with your historical data. Concurrent load and query is the critical backbone to high velocity analytics.
  • Flexible Business Intelligence – Front end tools are designed for business users and they have to be as flexible as an analyst is creative.

We assume you’re reading this post because you have a lot of incoming data or you are expecting it.  You will need to have tools in place to get this data into your system as quickly as possible. If your inbound data is log files, you can use tools like scribe to capture these logs in real time from your web servers. This tool was developed by the team at Facebook to centralize all of their log data and is now free and open source.  There’s a great introductory article on the High Scalability blog.

If your key business data comes from an online transaction processing (OLTP) system you need to first make sure you have a fast OLTP system handling inbound transactions.  This can be anything from a general purpose database, perhaps shardedfor scalability or an optimized OLTP specific DBMS.  As with log files, the key attribute you need to identify is transaction latency with large volumes of data; just having a lot of data is not sufficient.  If you can handle a million users but it takes an hour to process their orders then that hour is going to be your bottleneck. The same is true when you need to modify your pricing or ad campaigns or to identify up-sell opportunities. That hour of lag is going to be the bottleneck to building analytics into your business and increasing your profitability. The time between a user executing a transaction, such as purchasing an item or changing their subscription, and when you can act on that transaction needs to be minutes if not seconds.

To achieve this low latency interface you can use a change data capture tool, an OLTP solution with a direct extract/load (EL) to an analytic database or structure your application to log the result of the transaction directly to your analytic database. You can also use tools such as scribe to log this transaction just as you would any other application log. You may also have a custom data feed such as a financial tick feed fromThomson ReutersBloomberg B-PIPE or QuantHouse QuantFEED or a feed from your operational network provider.

In between the transaction data source and your analytic database you may need to respond to events as they occur “on the wire.”  This is where you employ a complex event processing engine to handle on the wire detection, automate common responses and flag important events.  CEP systems typically operate by running data through a pre-defined query, that accumulates and modifies state, triggering behaviors when a certain threshold is met. For example, a CEP system can be used to keep a count of errors for various data sources and raise an alert if any of them exceeds your maximum SLA threshold.

The hub for your business processes is the analytic database. This database is different from the general purpose database you use for accounting and it may even be different than the enterprise data warehouse where you log and report across your business activities. Your analytic database must be able to accept incoming data 24×7, allow you to access data quickly and with low latency – within minutes if not seconds and scale out infinitely as your data volumes grow.

The analytic database collects real time data as it is streamed in and stored for some defined period of time. Since the historical data storage is defined by business requirements, the analytic database must scale out to handle as much historical data as necessary. Similarly as requirements for faster analysis on more data grows the database must scale to handle more users and faster queries.

Since data is flowing in non-stop, the analytic database must have robust features to support trickle load, concurrent load and query and non-stop high availability.  If you have to pause queries in order to load or if any parts of the system need to be restarted in order to load data after a failure, you risk having downtime in a critical component during the highest peak of loads. These scenarios are easy to test by simulating high rates of loads and queries while pulling the plug (figuratively or literally) on random components.

Finally, as business users get access to more data in real time, the type of analysis changes. With the flexibility to iteratively explore real time data, user demand for additional information and different views grows. The front end tools must handle dynamic visualizations to accommodate these requirements. Both classic BI tools such as MicroStrategyCognos and Business Objects as well as new cutting edge tools fromTableauJasperSoft and Pentaho are modernizing the front end for real time BI. The key features to look for are flexibility of schema, and simplicity of abstraction.  When adding new data to the system it should be easy to incorporate it into the tool (if not automatic) and the complexity for mapping from what the business user clicks on and what the database schema defines should be minimal.  Keep in mind the development time between adding a data source to the analytic DBMS and giving your business users access to that data.

The key takeaway of this tutorial is that the latency between your transaction processing systems, data capture tools, complex event processing, real time analytic DBMS and your business analytics tool will define the speed at which your business can react to changes and ultimately your flexibility to adapt.  The lower the latency in your toolset, the higher the velocity which with you can operate and the more effectively you will be able to compete.  There is no doubt today that the winning players in every market are the most adaptable and flexible companies.