Author Archive

Teaching the elephant new tricks

by Shilpa Lawande & Rajat Venkatesh


A growing number of Vertica customers use Hadoop as an ETL/pre-processing tool or HDFS as a “data parking lot” – either to stage files before they are loaded or to simply store data whose worth is yet to be proven. For several years, Vertica has provided a Hadoop Connector that provides bi-directional connectivity between the two systems. In this post, we evaluate the pros and cons of the current approach and discuss a new technique that is enabled by Vertica 6, newly released.

Current Design

The most efficient topology of network connections is shown in the diagram  above. There is one database connection for each partition or a subset of partitions and all these connections transfer data in parallel. Most solutions in the market including the current Vertica-Hadoop connector do follow this design. They open a JDBC connection for each (or a subset of) partition and execute a batch insert to transfer the data. Some products, like Vertica may optimize the transfer by using a bulk load API or transform the data to avoid using resources on the database. Apache Sqoop is the end result of this evolution and provides a platform to implement generic or optimized connectors for all databases that provide a JDBC driver.


The architecture described above has some inherent bad qualities owing to a fundamental impedance mismatch between Vertica and Hadoop.

Even though Vertica and Hadoop are both MPP systems there is no good way to coordinate the degree of parallelism across the two systems, resulting in an inefficient use of the combined resources. Even a moderately large Hadoop cluster can overwhelm a database with too many JDBC connections. Therefore, it is common best practice to reduce the number of reduce (or map) tasks that connect to the database.

Vertica’s customers typically transfer a large file (with many HDFS chunks) or a set of files into one table. Each of the resulting JDBC connections initiates a new transaction overwhelming the transaction module with too many actors. Since each connection is a transaction, it’s not possible to roll back a transfer. A failed transfer may leave partial data – the worst kind of failure. A common best practice is to first store the data into a temporary table, verify the transfer and then move it to its final resting table. Another common strategy is to add a column to identify the rows inserted in a transfer, verify the rows with the unique id and then keep or delete the rows. Depending on the database and storage engine, verifying and moving the data may take up significant resources. Regardless of the approach or database, there is significant management overhead for loading data from Hadoop. A majority of Vertica’s customers use Vertica so that they don’t have to worry about ACID for bulk loads. They don’t want to be in the business of writing and maintaining complex scripts.

In our survey of customers, we found that many of them have come up with creative solutions to the “too-many-connections” problems. By popular demand, we’re in the process of creating a repository of the Hadoop connector source on Github and hope to see some of these solutions contributed there. However, by far, the most preferred solution is to simply write the results of a MR job to HDFS files, then transfer those files to Vertica and load them. And that’s what we’ve chosen to do in our new design as well.

New Design

In the new HDFS connector, currently in private beta, we have taken the first step to solve the problems described earlier by allowing direct load from files from HDFS to Vertica,  thereby eliminating the non-ACID behaviors due to multiple connections. Further, we solve the problem of too many connections by using one TCP/IP connection per file in HDFS. How did we do this? With Vertica 6, we introduced User-defined Loads. This exposes APIs that can extend Vertica’s native load command COPY, to load from any source and any format. With a few days of effort, using the UDL API and Apache Hadoop REST API for HDFS files, we wrote a plugin to read HDFS files from COPY. The benefits are obvious – COPY can process many files within the same transaction in parallel, across the Vertica cluster. Parallelism is limited by the hardware of the machines. Since it is only one transaction, a failure will result in a complete roll back.

The resulting network topology looks like the diagram below.

Let’s see how the HDFS connector works using some examples.

Imagine that there are 6 files in a directory in HDFS

-rw-r-r-   1 hadoop supergroup        759863287 2012-05-18 16:44
Found 1 items
-rw-r-r-   1 hadoop supergroup        759863287 2012-05-18 16:44
Found 1 items
-rw-r-r-   1 hadoop supergroup        759863287 2012-05-18 16:44
Found 1 items
-rw-r-r-   1 hadoop supergroup        759863287 2012-05-18 16:44
Found 1 items
-rw-r-r-   1 hadoop supergroup       759863287 2012-05-18 16:44
Found 1 items
-rw-r-r-   1 hadoop supergroup       759863287 2012-05-18 16:44
Found 1 items

The Vertica Copy command to load lineitem.tbl works as follows:

copy lineitem source

If you wanted to load all 6 files, you simply use the glob * feature. In this case, the 6 files are loaded into Vertica and are processed in parallel across the cluster.

copy lineitem source

In this example, lineitem.tbl is a tab-limited file. What about all the other popular file formats? User Defined Load provides an API to plugin a parser for any file format as well as APIs to decompress, decrypt or perform any arbitrary binary transformation. Check out this blog post showing how UDL can be used to process image file data.

Further, with Vertica 6 external tables, in combination with the HDFS connector, you can now use Vertica SQL to analyze data in HDFS directly (see the “Vertica Analytics Anywhere” blog post). This means, while you are in the exploratory stage with some data, you could experiment with different data models and still use SQL analytics (or use the Vertica R SDK) to plough through the data.

CREATE EXTERNAL TABLE hdfs.lineitem (<list of columns>) AS COPY
FROM Hdfs(url='http://<dfs.http.address>/webhdfs/v1/user/hadoop/hdfslib/glob_test/*');


“Wait a minute! Most chunks are sent to another Hadoop datanode and then to Vertica. This is worse than what we had before”. Yes we have exchanged expensive JDBC connections with one extra hop for most chunks. However, the new strategy is still faster and maintains transaction semantics. The next step is to enable COPY to parse data chunks in parallel. However, we believe that the current solution has significant advantages over using JDBC connections and already solves major pain points of Vertica-Hadoop users.

An under-performing connector requires far more resources – machine and human – to bring together data from various sources. With an efficient connector, users can transfer data in real time and on a whim, allowing them to use the right tool for the job.

If you would be interested in our private beta for the HDFS connector, please send a note to with a brief description of your use-case.

Introducing Vertica 6

by Shilpa Lawande & Luis Maldonado

Big data is all the rage these days.  It is incredible to watch this segment mature- even in just the past few years.  Of course, Vertica continues to evolve and innovate to keep up with the challenges and demands of this movement, especially when it comes to analytics.  The three greatest challenges customers face with regards to Big Data are 1) the volume and velocity of data, 2) the rapidly growing variety of disparate sources of data, and 3) the complex analytics that are required for maximizing the value of information derived from data.  In Vertica 6, we address all of these and more!

It has been just over two and a half years since Vertica first released FlexStore™, our patented architecture for flexible storage and placement of data based on usage patterns. During this time, we have been hard at work to evolve our flagship database product from a blazingly  fast database into a rich and flexible analytic platform.  Our high performance MPP columnar engine combined with our suite of built-in analytic functions including time-series, event-series pattern matching, C++ SDK, and more have enabled over 600 customers to transform their businesses by harnessing the power of large scale real-time analytics.  Last week at HP Discover, we raised the bar once again, with the release of Vertica 6, including the industry’s first and only open analytics architecture that enables broad analytics over any data source, structured, semi-structured or unstructured.

Let’s review some of the highlights of Vertica 6.

  • Vertica 6 FlexStore™ has been expanded to allow access to any data, stored at any location, via any interface, providing a comprehensive architecture to handle the requirements of Big Data – open, extensible, and flexible.  Using our patent-pending User-Defined Load (UDL) feature, you can now connect natively to popular storage systems like Hadoop File System (HDFS), existing databases and data warehouses, as well as unstructured analysis platforms such as HP/Autonomy IDOL™.  Further, by combining our External Tables capability with UDL, you can access those data sources in a federated manner at query time, without pre- loading data into Vertica. Now, whether you are just exploring your data in HDFS to find the right data model or building a production application, you can use the same robust Vertica SQL analytics you know and love, and use your favorite reporting and visualization tools to slice and dice the data!
  • Vertica 6 includes high performance and massively scalable in-database analytics for the R statistical tool.  Data scientists and analysts can now run their favorite statistical algorithms written in R natively and in parallel on Vertica, without the in-memory and single-threading limitations of R!  We’ve also expanded our C++ SDK to add secure sandboxing of user-defined code, and additional APIs for user-defined aggregates, analytics and multi-phase transform functions that can enable Map-Reduce style computations to be done in Vertica.
  • Vertica 6 simplifies the out-of-the-box user experience and enables more diverse workloads with our improvements to Workload Management.  An early adopter of Vertica 6 experienced a 40x speed up on some of their queries.  Whether you are running complex analytics or short operational BI workloads, Vertica’s workload management capabilities can effectively balance all system resources to meet your SLAs.
  • In 2007, Vertica was the first analytics platform to run on the Cloud. With Vertica 6 we extended this lead in significant ways. First, we announced a private beta of Vertica on HP’s new public Cloud. Further, we extended the core platform with a number of enhancements such as Object-level Backup & Restore and Schema evolution that make it much easier to develop and deploy multi-tenant applications using Vertica.
  • Last but not the least, we continue to invest in our core platform with a  number of enhancements to overall query and load performance, elastic cluster, monitoring views, database designer, security and more.

For a quick overview of Vertica 6, we highly recommend watching this video interview of Luis Maldonado, Director of Product Management, from last week’s HP Discover show, or reading What’s new in Vertica 6 Enterprise Edition.  Watch this space over the next few days, for more articles that drill down into specific features of Vertica 6!

We at Vertica are very excited about Vertica 6 and the role it plays in HP’s vision for Cloud, Security and Information Optimization.  We hope you will try it for yourselves and tell us your stories about how you used it to Make it Matter for your business!

Over the next few days, watch this space for articles that drill down into specific features of Vertica 6.

Announcing the Vertica Community Edition

by Colin Mahony, VP of Products & Business Development
and Shilpa Lawande, VP of Engineering

Vertica has had an amazing journey since it was founded in 2005. We’ve built a great product, a great team and an incredibly strong and loyal customer base and partner ecosystem. When we first started, no one had even heard of a column store, and today, ‘Big Data Analytics’ is taking the industry by storm. Every day we see companies – big and small – in industries from retail to gaming becoming more data-driven and doing amazing things with the help of analytics. We feel proud and humbled to see the transformation impact the Vertica Analytics Platform has had on our customers’ businesses, and we believe the time has come to broaden access to our technology to a wider Big Data community.

Today, we are truly excited to announce the Vertica Community Edition beta program! The Vertica Community Edition will offer many of the same features as the enterprise edition of the Vertica Analytics Platform to anyone who wants to discover the power of Vertica.  And, as part of the Community Edition beta announcement, we are developing a new MyVertica Community portal which will provide a platform for Vertica users and partners to interact and share knowledge and code with the entire Vertica user community.

Vertica has always been a customer-driven company and we couldn’t have built Vertica without ideas, feedback and guidance from our customers and partners. We hope that the Vertica community will play a similar role going forward – sharing ideas and best practices and providing candid feedback about the product and how it can be made richer and simpler to use.  The MyVertica community portal will feature product downloads, forums, documentation,  training materials, FAQs and best practice guides. We will also be maintaining a GitHub code repository where community users will be able to share code samples, user-defined extensions built using our SDK, adapters to 3rd party products, and more. We hope that with the Community Edition, we take a small step towards our vision of democratizing data and making data and analytics accessible to all!

To register for the Vertica Community Edition beta program, simply visit and complete the registration form.  The beta program will be limited initially, but full availability of the Vertica Community Edition software is expected by the end of the year.

On behalf of Vertica and HP, we are excited to contribute something back to the Vertica Community.  We sincerely invite you to join and contribute, and we can’t wait to see the many cool things you will do with Big Data and Vertica!

Shilpa & Colin

Vertica at BIRTE 2011: Social Graph Analytics

A few days ago, Lakshmikant Shrinivas and I gave an invited talk at BIRTE 2011.  Rather than yet another talk on Vertica 101, we chose to talk about Social Graph Analytics, a topic we are truly very excited about because it is something we are learning from our customers! And of course, it is uber cool to talk about Cityville and Zynga these days!  We presented the Zynga usecase  – how to find the influencers among the active users of a game and several other graph problems Vertica has very successfully solved with SQL.  More about these in future blog posts.

The most fun part of this talk (according to the audience) was that it was multi-threaded – we weaved a real-time demo through the talk. The demo was in two parts. The first part calculated an approximate 4-core of a graph of 90M nodes and 405M edges representing active users and their interactions in a game.  A k-core of a graph picks out a subgraph where every node has at least k neighbors in the subgraph – some research shows that the set of influencers in a social network is often a 3 or 4-core. The demo was run on 4 blades of the HP/Vertica Analytics System with 12 cores and 96GB RAM each. The animation below show an approximate visualization of the actual graph with 10K nodes and 45K edges (we could not find any graph visualizing tool that could draw a graph of 405M edges!) and how the algorithms iteratively reduces it to find the 4-core. The actual computation (which cannot be visualized) took about 1 minute to process 1TB of in-game event data to compute the initial graph of 90M active users and only 4.5 minutes to run 8 iterations resulting in the 4-core of 34K users!

The second part showed how this graph of influencers could be used to do A/B testing for in-game product placement. We simulated giving a group of users from the graph of influencers , Coke, and a group of randomly chosen users, Pepsi, and loading their in-game interactions data into Vertica, every 15 seconds. In the 5-minutes or so it took us to describe the setup, you could see how the relative penetration of the two products changed in real-time.  This was really a demo of Vertica’s awesome real-time analytics engine – we loaded data continuously at the rate of 40GB/minute (on 4 nodes with  2 copies of the data), which translates to 20TB/hr on the full rack HP/Vertica Analytics System.  That’s fast, eh?!!

This little demo and what our customers are doing in the real world, makes for a very convincing case that SQL can be used to express many practical graph problems. Vertica’s MPP architecture and sorted projections provide a highly scalable and fast engine for iterative graph algorithms. Such problems are relevant not only to the Zyngas and Facebooks of the world but to any company that has a connected user community. From online forums to friend-and-family call logs, social networks are everywhere and finding influencers is the key to monetizing them. Gartner believes that social CRM will be a $1B market by end of 2012!

We hope that our talk planted a small seed for the database research community to more formally explore SQL solutions to graph problems!  Watch for more blogs from Vertica on solving graph problems with SQL.

The Power of Projections – Part 3

By Colin Mahony and Shilpa Lawande

Part III – Comparing and Contrasting Projections to Materialized Views and Indexes

In Part I and Part II of this post, we introduced you to Vertica’s projections, and described how easy it is to interface with them directly via SQL or through our Database Designer™ tool.  We will now end this series by comparing and contrasting Vertica’s projections with traditional indexes and materialized views.

Row-store databases often use Btree indexes as a performance enhancement.  Btree indexes are designed for highly concurrent single-record inserts and updates, e.g. an OLTP scenario. Most data warehouse practitioners would agree that index rebuilds after a batch load are preferable to taking the hit of maintaining them record by record given the logging overhead. Bitmap indexes are designed for bulk loads and are better than btrees for data warehousing but only suitable for low cardinality columns and a certain class of queries.  Even though you have these indexes to help find data, you still have to go to the base table to get the actual data, which brings with it all the disadvantages of a row store.

In a highly simplified view, you can think of a Vertica projection as a single level, densely packed, clustered index which stores the actual data values, is never updated in place, and has no logging. Any “maintenance” such as merging sorted chunks or purging deleted records is done as automatic and background activity, not in the path of real-time loads.  So yes, projections are a type of native index if you will, but they are very different from traditional indexes like Bitmap and Btrees.

Vertica also offers a unique feature known as “pre-join projections”. Pre-join projections denormalize tables at the physical layer under the covers providing a significant performance advantage over joining tables at run-time.  Pre-join projections automatically store the results of a join ahead of time, yet the logical schema is maintained – again, flexibility of the storage structure without having to rewrite your ETL or application.  Vertica can get away with this because it excels at sparse data storage, and in particular isn’t penalized at all for null values nor for wide fact tables. Since Vertica does not charge extra for additional projections, this is a great way to reap the benefits of denormalization without the need to purchase a larger capacity license.

So to sum up, here’s how Vertica projections stack up versus materialized views and conventional indexes.

Vertica’s Projections

Traditional Materialized Views

Traditional Indexes

  • Are primary storage – no base tables are required
  • Are secondary storage
  • Are secondary storage pointing to base table data
  • Can be segmented, partitioned, sorted, compressed and encoded to suit your needs
  • Are rigid: Practically limited to columns and query needs, more columns = more I/O
  • Support one clustered index at most – tough to scale out
  • Have a simple physical design
  • Use Aggregation losing valuable detail
  • Require complex design choices
  • Are efficient to load & maintain
  • Are mostly batch updated
  • Are expensive to update
  • Are versatile – they can support any data model
  • Provide high data latency
  • Provide high data latency
  • Allow you to work with the detailed data
  • Provide near-real time low data latency
  • Combine high availability with special optimizations for query performance

That’s pretty much all there is to it.  Whether you are running ad-hoc queries or canned operational-BI workloads, you will find projections to be a very powerful backbone for getting the job done!

Read the rest of the 3-part series…

The Power of Projections – Part 1: Understanding Projections and What They Do
The Power of Projections – Part 2: Understanding the Simplicity of Projections and the Vertica Database Designer™

The Power of Projections – Part 2

By Colin Mahony and Shilpa Lawande

Part II – Understanding the Simplicity of Projections and the Vertica Database Designer™

In Part I of this post, we introduced you to the simple concept of Vertica’s projections.  Now that you have an understanding of what they are, we wanted to go into more detail on how users interface with them, and introduce you to Vertica’s unique Database Designer tool.

For each table in the database, Vertica requires a minimum of one projection, called a “superprojection”. A superprojection is a projection for a single table that contains all the columns and rows in the table.  Although the data may be the same as a traditional base table, it has the advantages of segmentation (spreading the data evenly across the nodes in the cluster), sorting, and encoding (compressing the size of the data on a per column basis).  This leads to significant footprint reduction as well as load and query performance enhancements.  To give you a sense of the impact that Vertica’s projections have on size, most Vertica customers have at least a 50% reduction in footprint thanks to our compression.  This includes the high availability copy and on average 3-5 projections.  Again, contrast this to traditional row-store databases ballooning upwards of 5x their original size and that is a 10:1 difference in Vertica.

To get your database up and running quickly, Vertica automatically creates a default superprojection for each table created through the CREATE TABLE and CREATE TEMPORARY TABLE statements. This means that if database admins and users never want to know about a projection, they don’t have to – Vertica will automatically handle it under the covers. To further illustrate this point, users can simply pass in projection parameters such as Order By, Encodings, Segmentation, High Availability, and Partitioning right after the CREATE TABLE statement, never interfacing directly with a projection under the hood.

By creating a superprojection for each table in the database, Vertica ensures that all SQL queries can be answered. Default superprojections alone will do far better than a row-store, however, by themselves they may not fully optimize database performance and Vertica’s full potential.  Vertica recommends that you start with the default projections and then use Vertica’s nifty Database Designer™  to optimize your database.  Database Designer creates new projections that optimize your database based on its data statistics and the queries you use. Database Designer:

1. Analyzes your logical schema, sample data, and sample queries (optional).
2. Creates a physical schema design (projections) in the form of a SQL script that can be deployed automatically or manually.
3. Can be used by anyone without specialized database knowledge (even business users can run Database Designer).
4. Can be run and re-run anytime for additional optimization without stopping the database.

Designs created by the Database Designer provide exceptional query performance. The Database Designer uses sophisticated strategies to provide excellent ad-hoc query performance while using disk space efficiently. Of course, a proficient human may do even better than the Database Designer with more intimate knowledge of the data and the use-case – a small minority of our customers prefer to do manual projection design and can usually get a good feel for it after working with the product for a few weeks.

We’ve heard people ask if we need a projection for each query in Vertica, which we absolutely do not! Typically our customers use 3-5 projections and several are using the single superprojection only. A typical customer would have the superprojection along with a few smaller projections (often comprised of only a few columns each).  Unlike MVs and indexes, projections are cheap to maintain during load and due to Vertica’s compression, the resulting data size tends to be 5-25x smaller than the base data. Depending on your data latency needs (seconds to minutes) and storage availability you could choose to add more projections to further optimize the database.  Also important to note is that Vertica does not charge extra for projections, regardless of how many are deployed.  So whether a customer has 1 or 50 projections, their license fees are the same – entirely based on raw data.

As you can see, projections are very easy to work with, and if you are a business analyst who doesn’t know SQL/DDL, that’s okay, we created a tool that designs, deploys and optimizes the database automatically for you.  Our objective from day one has always been to enable customers to ask more questions and get faster answers from their data without having to constantly tune the underlying database.  Part III of this post goes into more detail on projections versus indexes and materialized views.

Read the rest of the 3-part series…

The Power of Projections – Part 1: Understanding Projections and What They Do
The Power of Projections – Part 3: Comparing and Contrasting Projections to Materialized Views and Indexes

The Power of Projections – Part 1

By Colin Mahony and Shilpa Lawande

Part I: Understanding Projections and What They Do

Many of us here at Vertica have been amazed and frankly flattered at how much FUD our competitors are putting out there regarding Vertica’s “projections”.  Having heard some incredibly inaccurate statements about them, we’ve decided to clarify what they are, how and why we have them, and the advantages they bring.  Actually, projections are a pivotal component of our platform, and a major area of differentiation from the competition.  Most importantly, Vertica’s customers love the benefits projections bring! In an effort to provide you with as much detail as possible, this blog is broken up into three posts with Parts II and III being more technical.

First, some background. In traditional database architectures, data is primarily stored in tables. Additionally, secondary tuning structures such as indexes and materialized views are created for improved query performance.  Secondary structures like MVs and indexes have drawbacks – for instance they are expensive to maintain during data load (more detail on this in Part III).  Hence best practices often require rebuilding them during nightly batch windows, which prevents the ability to do real-time analytics.  Also, it isn’t uncommon to find data warehouse implementations that balloon to 3-6x base table sizes due to these structures. As a result, customers are often forced to remove valuable detailed data and replace it with aggregated data to solve this problem. However you can’t monetize what you lost!

Vertica created a superior solution by optimizing around performance, storage footprint, flexibility and simplicity. We removed the trade-off between performance and data size by using projections as the lynchpin of our purpose-built architecture.  Physical storage consists of optimized collections of table columns, which we call “projections”. In the traditional sense, Vertica has no raw uncompressed base tables, no materialized views, and no indexes. As a result there are no complex choices – everything is a projection!  Of course, your logical schema (we support any) remains the same as with any other database so that importing data is a cinch.  Furthermore, you still work with standard SQL/DDL (i.e. Create Table statements, etc).  The magic of projections and Vertica are what we do under the covers for you with the physical storage objects.  We provide the same benefits as indexes without all of the baggage.  We also provide an automatic tool, the Database Designer (more on this in Part II) to create projections automatically.

Projections store data in formats that optimize query execution. They share one similarity to materialized views in that they store data sets on disk rather than compute them each time they are used in a query (e.g. physical storage).  However, projections aren’t aggregated but rather store every row in a table, e.g. the full atomic detail. The data sets are automatically refreshed whenever data values are inserted, appended, or changed – again, all of this happens beneath the covers without user intervention – unlike materialized views. Projections provide the following benefits:

  • • Projections are transparent to end-users and SQL. The Vertica query optimizer automatically picks the best projections to use for any query.
  • • Projections allow for the sorting of data in any order (even if different from the source tables). This enhances query performance and compression.
  • • Projections deliver high availability optimized for performance, since the redundant copies of data are always actively used in analytics.  We have the ability to automatically store the redundant copy using a different sort order.  This provides the same benefits as a secondary index in a more efficient manner.
  • • Projections do not require a batch update window.  Data is automatically available upon loads.
  • • Projections are dynamic and can be added/changed on the fly without stopping the database.

In summary, Vertica’s projections represent collections of columns (okay so it is a table!), but are optimized for analytics at the physical storage structure level and are not constrained by the logical schema.  This allows for much more freedom and optimization without having to change the actual schema that certain applications are built upon.

Hopefully this gave you an overview of what projections are and how they work.  Please read Part II and Part III of this post to drill down into projections even further.

Read the rest of the 3-part series…

The Power of Projections – Part 2: Understanding the Simplicity of Projections and the Vertica Database Designer™
The Power of Projections – Part 3: Comparing and Contrasting Projections to Materialized Views and Indexes

Get Started With Vertica Today

Subscribe to Vertica