Vertica

Archive for the ‘Hadoop’ Category

Avro parser UDx – Using Apache Avro to enable easier data transfer from Hadoop to Vertica

After careful research and brainstorming of different ideas for the intern UDx competition we decided to implement an Avro parser UDx. Our team, “The Avro-rian Revolutionaries” wanted to implement something useful, ready to use, and is in the top-3 wish list of customers. And what better than an Avro parser which would help users to easily transfer data from Hadoop to Vertica!. (This Avro parser UDx package is now available on github [6] and Vertica users are encouraged to try it out!)

Apache Avro [1] is a data serialization format widely used in Hadoop world. It is a new data serialization format which succeeds Thrift [2] and Protocol Buffers [3]. According to some technologists, Avro is the best data serialization framework out there [4]. This was good motivation for us to implement an Avro parser for the intern competition, hoping to make importing Avro data into Vertica, feasible.

Figure 1. Hadoop, Avro, Avro UDx and Vertica workflow

With this motivation, we began our day 1 of the 5 day intern competition. The first milestone was to get the standalone Avro parser to work. This basic, standalone parser (still no Vertica in picture) which will just read an Avro file and print out the header and data in text format. The Avro API’s were our means to do it and by referring the basic documentation [5] we quickly came up with a parser which could dump out the contents of a sample Avro file in text format as in Figure 2.

Figure 2: weather.avro sample file in text format.

We spent day 2 of the competition learning the Vertica SDK, the next tool of trade.
There were some great examples already out there on github. We picked a simple example UDx and began using and playing with it. Once we got our hands on loading, testing, and running this UDx we started learning the required SDK interfaces for loading the data into Vertica. One important interface was called UDParser which parses a stream of bytes parallelly into Vertica. Very quickly we were able to use this and develop an UDx skeleton, ready to get integrated into the module developed on day 1.

On day 3, midway through the competition we had the most important milestone to achieve. The task was to integrate our standalone Avro parser developed on day 1 with a parser UDx skeleton developed on day 2. And this was point where we got stuck and had an unexpected setback. After talking to our mentors we discovered that there is an interface gap between Avro file reader api and Vertica UDParser interface. To fill this gap we developed a couple of modules called CRReader and CRStream which successfully addressed the issue.

Day 4, we began integrating the modules, and finally the moment of judgement arrived. This was the moment when we ran our first test of loading a weather.avro file into vertica, which exercised most of the code we wrote. And we did not have to hold our breath long. Within a fraction of a second the data was loaded into Vertica. We really couldn’t believe our eyes that all the 3 pieces of modules we wrote in 3 days are working like parts of an engine. The magic of UDx was happening! and the Avro file was successfully loaded into Vertica. (Figure 3)

Figure 3: Demo screenshot

On day 5, the last day of the competition, we spent all our efforts in testing and packaging the UDx. We wanted to have a quality product which will be ready to use by the customer by the end of competition.

Finally we presented our work with other interns in front of a fully packed room with audience from all departments of Vertica. This was a unique experience by itself because we had to present the work in the most appealing format for audience of different perspective apart from the technical dimension. End of the day we were happy that we learnt lots of new things, collaborated with senior mentors and received great response feedback and comments for our work which made the competition a great success! And now when looking at our UDx parser available on github[6] and ready to use, it gives us great satisfaction of achieving of our first step of getting one step closer to the Avro-rian revolution!

References:
[1] http://avro.apache.org/docs/1.7.1/
[2] http://wiki.apache.org/thrift/FrontPage
[3] http://code.google.com/p/protobuf/
[4] http://www.cloudera.com/blog/2011/05/three-reasons-why-apache-avro-data-serialization-is-a-good-choice-for-openrtb/
[5] http://avro.apache.org/docs/1.6.1/api/cpp/html/index.html
[6] https://github.com/vertica/Vertica-Extension-Packages/tree/master/avro_parser

Teaching the elephant new tricks

by Shilpa Lawande & Rajat Venkatesh

Introduction

A growing number of Vertica customers use Hadoop as an ETL/pre-processing tool or HDFS as a “data parking lot” – either to stage files before they are loaded or to simply store data whose worth is yet to be proven. For several years, Vertica has provided a Hadoop Connector that provides bi-directional connectivity between the two systems. In this post, we evaluate the pros and cons of the current approach and discuss a new technique that is enabled by Vertica 6, newly released.

Current Design

The most efficient topology of network connections is shown in the diagram  above. There is one database connection for each partition or a subset of partitions and all these connections transfer data in parallel. Most solutions in the market including the current Vertica-Hadoop connector do follow this design. They open a JDBC connection for each (or a subset of) partition and execute a batch insert to transfer the data. Some products, like Vertica may optimize the transfer by using a bulk load API or transform the data to avoid using resources on the database. Apache Sqoop is the end result of this evolution and provides a platform to implement generic or optimized connectors for all databases that provide a JDBC driver.

Problems

The architecture described above has some inherent bad qualities owing to a fundamental impedance mismatch between Vertica and Hadoop.

Even though Vertica and Hadoop are both MPP systems there is no good way to coordinate the degree of parallelism across the two systems, resulting in an inefficient use of the combined resources. Even a moderately large Hadoop cluster can overwhelm a database with too many JDBC connections. Therefore, it is common best practice to reduce the number of reduce (or map) tasks that connect to the database.

Vertica’s customers typically transfer a large file (with many HDFS chunks) or a set of files into one table. Each of the resulting JDBC connections initiates a new transaction overwhelming the transaction module with too many actors. Since each connection is a transaction, it’s not possible to roll back a transfer. A failed transfer may leave partial data – the worst kind of failure. A common best practice is to first store the data into a temporary table, verify the transfer and then move it to its final resting table. Another common strategy is to add a column to identify the rows inserted in a transfer, verify the rows with the unique id and then keep or delete the rows. Depending on the database and storage engine, verifying and moving the data may take up significant resources. Regardless of the approach or database, there is significant management overhead for loading data from Hadoop. A majority of Vertica’s customers use Vertica so that they don’t have to worry about ACID for bulk loads. They don’t want to be in the business of writing and maintaining complex scripts.

In our survey of customers, we found that many of them have come up with creative solutions to the “too-many-connections” problems. By popular demand, we’re in the process of creating a repository of the Hadoop connector source on Github and hope to see some of these solutions contributed there. However, by far, the most preferred solution is to simply write the results of a MR job to HDFS files, then transfer those files to Vertica and load them. And that’s what we’ve chosen to do in our new design as well.

New Design

In the new HDFS connector, currently in private beta, we have taken the first step to solve the problems described earlier by allowing direct load from files from HDFS to Vertica,  thereby eliminating the non-ACID behaviors due to multiple connections. Further, we solve the problem of too many connections by using one TCP/IP connection per file in HDFS. How did we do this? With Vertica 6, we introduced User-defined Loads. This exposes APIs that can extend Vertica’s native load command COPY, to load from any source and any format. With a few days of effort, using the UDL API and Apache Hadoop REST API for HDFS files, we wrote a plugin to read HDFS files from COPY. The benefits are obvious – COPY can process many files within the same transaction in parallel, across the Vertica cluster. Parallelism is limited by the hardware of the machines. Since it is only one transaction, a failure will result in a complete roll back.

The resulting network topology looks like the diagram below.

Let’s see how the HDFS connector works using some examples.

Imagine that there are 6 files in a directory in HDFS

-rw-r-r-   1 hadoop supergroup        759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem.tbl
Found 1 items
-rw-r-r-   1 hadoop supergroup        759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem1.tbl
Found 1 items
-rw-r-r-   1 hadoop supergroup        759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem11.tbl
Found 1 items
-rw-r-r-   1 hadoop supergroup        759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem12.tbl
Found 1 items
-rw-r-r-   1 hadoop supergroup       759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem2.tbl
Found 1 items
-rw-r-r-   1 hadoop supergroup       759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem31.tbl
Found 1 items

The Vertica Copy command to load lineitem.tbl works as follows:

copy lineitem source
Hdfs(url='http://<dfs.http.address>/webhdfs/v1/user/hadoop/hdfslib/glob_test/lineitem.tbl');

If you wanted to load all 6 files, you simply use the glob * feature. In this case, the 6 files are loaded into Vertica and are processed in parallel across the cluster.

copy lineitem source
Hdfs(url='http://<dfs.http.address>/webhdfs/v1/user/hadoop/hdfslib/glob_test/lineitem*');

In this example, lineitem.tbl is a tab-limited file. What about all the other popular file formats? User Defined Load provides an API to plugin a parser for any file format as well as APIs to decompress, decrypt or perform any arbitrary binary transformation. Check out this blog post showing how UDL can be used to process image file data.

Further, with Vertica 6 external tables, in combination with the HDFS connector, you can now use Vertica SQL to analyze data in HDFS directly (see the “Vertica Analytics Anywhere” blog post). This means, while you are in the exploratory stage with some data, you could experiment with different data models and still use SQL analytics (or use the Vertica R SDK) to plough through the data.

CREATE EXTERNAL TABLE hdfs.lineitem (<list of columns>) AS COPY
FROM Hdfs(url='http://<dfs.http.address>/webhdfs/v1/user/hadoop/hdfslib/glob_test/*');

Summary

“Wait a minute! Most chunks are sent to another Hadoop datanode and then to Vertica. This is worse than what we had before”. Yes we have exchanged expensive JDBC connections with one extra hop for most chunks. However, the new strategy is still faster and maintains transaction semantics. The next step is to enable COPY to parse data chunks in parallel. However, we believe that the current solution has significant advantages over using JDBC connections and already solves major pain points of Vertica-Hadoop users.

An under-performing connector requires far more resources – machine and human – to bring together data from various sources. With an efficient connector, users can transfer data in real time and on a whim, allowing them to use the right tool for the job.

If you would be interested in our private beta for the HDFS connector, please send a note to beta@vertica.com with a brief description of your use-case.

Counting Triangles

by Stephen Walkauskas

Recently I’ve heard from or read about people who use Hadoop because their analytic jobs can’t achieve the same level of performance in a database. In one case, a professor I visited said his group uses Hadoop to count triangles “because a database doesn’t perform the necessary joins efficiently.”

Perhaps I’m being dense but I don’t understand why a database doesn’t efficiently support these use-cases. In fact, I have a hard time believing they wouldn’t perform better in a columnar, MPP database like Vertica – where memory and storage are laid out and accessed efficiently, query jobs are automatically tuned by the optimizer, and expression execution is vectorized at run-time. There are additional benefits when several, similar jobs are run or data is updated and the same job is re-run multiple times. Of course, performance isn’t everything; ease-of-use and maintainability are important factors that Vertica excels at as well.

Since the “gauntlet was thrown down”, to steal a line from Good Will Hunting, I decided to take up the challenge of computing the number of triangles in a graph (and include the solutions in GitHub so others can experiment – more on this at the end of the post).

Problem Description

A triangle exists when a vertex has two adjacent vertexes that are also adjacent to each other. Using friendship as an example: If two of your friends are also friends with each other, then the three of you form a friendship triangle. How nice. Obviously this concept is useful for understanding social networks and graph analysis in general (e.g. it can be used to compute the clustering coefficient of a graph).

Let’s assume we have an undirected graph with reciprocal edges, so there’s always a pair of edges ({e1,e2} and {e2,e1}). We’ll use the following input for illustration (reciprocal edge is elided to condense the information):

source

destination

Ben Chuck
Ben Stephen
Chuck Stephen
Chuck Rajat
Rajat Stephen
Andrew Ben
Andrew Matt
Matt Pachu
Chuck Lyric

.
A little ascii art to diagram the graph might help.

I know you can quickly count the number of triangles. I’m very proud of you but imagine there are hundreds of millions of vertexes and 10s of billions of edges. How long would it take you to diagram that graph? And how much longer to count all of the triangles? And what if your 2 year old daughter barges in counting “one, two, three, four, …” and throws off your count?

Below we present a few practical solutions for large scale graphs and evaluate their performance.

The Hadoop Solution

Let’s consider first the Hadoop approach to solving this problem. The MapReduce (MR) framework implemented in Hadoop allows us to distribute work over many computers to get the count faster. The solution we describe here is a simplified version of the work at Yahoo Research. You can download our solution here.

Overview

The solution involves a sequence of 3 MR jobs. The first job constructs all of the triads in the graph. A triad is formed by a pair of edges sharing a vertex, called its apex. It doesn’t matter which vertex we choose as the apex of the triad, so for our purposes we’ll pick the “lowest” vertex (e.g. friends could be ordered alphabetically by their names). The Yahoo paper makes a more intelligent choice of “lowest” – the vertex with the smallest degree. However that requires an initial pass of the data (and more work on my part) so I skipped that optimization and did so consistently for all solutions to ensure fairness.

These triads and the original edges are emitted as rows by the first MR job, with a field added to distinguish the two. Note that the output of the first job can be quite large, especially in a dense graph. Such output is consumed by the second MR job, which partitions the rows by either the unclosed edge, if the row is a triad, or the original edge. A partition has n triangles if it contains an original edge and n triads. A third, trivial MR job counts the triangles produced by the second job, to produce the final result.

Details

Let’s look at each MR job in detail. The map part of the first job generates key-value pairs for each triad such that the apex is the key and the value is the edge. In our small example the map job would emit the following rows.
.

key

value

Andrew Andrew, Matt
Andrew Andrew, Pachu
Andrew Andrew, Ben
Matt Matt, Pachu
Ben Ben, Chuck
Ben Ben, Stephen
Chuck Chuck, Rajat
Chuck Chuck, Lyric
Chuck Chuck, Stephen
Rajat Rajat, Stephen

.
For each apex-partition, the reduce job emits the original edges and all of the corresponding triads (there are ?(j-1) -> j=1 to d triads per partition, where d is the degree of the vertex at the apex). For each original edge, the key is the edge itself and the value is “edge”. For each triad, the key is the unclosed edge. In other words, the edge needed to complete the triangle. The value is “triad.” The actual code used “0” for the edge value and “1” for the triad value for run-time efficiency.

The rows corresponding to the triads emitted by this reduce job in our simple example are described below in the “key” and “value” columns (the original edges are also emitted by the reduce job but elided below for brevity). For presentation purposes we added a third column “triad content”. That column is not produced by the actual reduce job.
.

key

value

triad content

Ben,  Matt triad {Andrew, Ben}, {Andrew, Matt}
Ben, Pachu triad {Andrew, Ben}, {Andrew, Pachu}
Matt, Pachu triad {Andrew, Matt}, {Andrew, Pachu}
Chuck, Stephen triad {Ben, Chuck}, {Ben, Stephen}
Lyric, Rajat triad {Chuck, Lyric}, {Chuck, Rajat}
Lyric, Stephen triad {Chuck, Lyric}, {Chuck, Stephen}
Rajat, Stephen triad {Chuck, Rajat}, {Chuck, Stephen}

.
The input to the next reduce job is partitioned such that the unclosed edge of each triad is in the same partition as its corresponding original edge, if any. The reduce job just needs to check for the existence of an original edge in that partition (i.e., a row with value set to “edge”). If it finds one, all of the triads in the partition are closed as triangles. The reduce job sums up all of the closed triads and on finalize emits a count. A trivial final MR job aggregates the counts from the previous job.

There we’ve used MapReduce to count the number of triangles in a graph. The approach isn’t trivial but it’s not horribly complex either. And if it runs too slowly we can add more hardware, each machine does less work and we get our answer faster.

Experiences with Hadoop

I have to admit it took me much longer than I estimated to implement the Hadoop solution. Part of the reason being I’m new to the API, which is exacerbated by the fact that there are currently two APIs, one of them deprecated, the other incomplete, forcing use of portions of the deprecated API. Specifically, the examples I started with were unfortunately based on the deprecated API and when I ported to the newer one I ran into several silly but somewhat time consuming issues (like mapred’s version of Reducer.reduce takes an Iterator but mapreduce’s version takes an Iterable – they look similar to the human eye but the compiler knows that a method that takes an Iterator should not be overridden by one that takes an Iterable). Learning curve aside there was a fair chunk of code to write. The simple version is >200 lines. In a more complex version I added a secondary sort to the MR job that computes triads. Doing so introduced several dozen lines of code (most of it brain dead stuff like implementing a Comparable interface). Granted a lot of the code is cookie cutter or trivial but it still needs to be written (or cut-n-pasted and edited). In contrast, to add a secondary sort column in SQL is a mere few characters of extra code.

The PIG Solution

Rajat Venkatesh, a colleague of mine, said he could convert the algorithm to a relatively small PIG script and he wagered a lunch that the PIG script would outperform my code. He whipped up what was eventually a 10 statement PIG script that accomplished the task. When we get to the performance comparison we’ll find out who got a free lunch.

Here’s the PIG solution, much simpler than coding MR jobs by hand. We used PIG 0.8.1. We made several passes over the script to optimize it, following the PIG Cookbook. For example, we rearranged the join order and put the larger table last (I’m probably not giving too much away by mentioning that Vertica’s optimizer uses a cost model which properly chooses join order). We also tried several values for default_parallel and mapreduce.job.maps (and we changed the corresponding parameter in mapred-site.xml as well, just to be safe). We did not enable lzo compression for two reasons. First, considering the hardware used for the experiment (large RAM – plenty of file system cache, high throughput network), the CPU tax incurred by compression was more likely to hurt performance than help in this case. Second, one website listed 7 steps to get the compression working but the 2nd step had several steps itself, so I gave up on it.

The Vertica Solution

Can you count the number of triangles in a graph using a database? Of course. First create an “edges” table and load the graph. Vertica can automate the decision about how to organize storage for the table – something called projections specify important characteristics of physical storage such as sort order, segmentation, encoding and compression. In this case we simply tell Vertica how to distribute data among nodes in our cluster (Vertica calls this segmenting). Alternatively the Vertica Database Designer can be used to automate projection design. The following statements create our table and load data.

We’ve got the data loaded and stored in an efficient way. If we need to run more jobs with the same data later we won’t incur the load cost again. Likewise, if we need to modify the data we only incur work proportional to the change. Now we just need a horribly complex hack to count triangles. Take a deep breath, stretch, get a cup of coffee, basically do what you have to do to prepare your brain for this challenge. Ok, ready? Here it is:

Good, you didn’t run away screaming in horror. If we ignore the less than predicates the query is simply finding all triplets that form a cycle, v1 -> v2 -> v3 -> v1. The less than predicates ensure we don’t count the same triangle multiple times (remember our edges are reciprocal and we only need to consider triangles with the “lowest” vertex at the apex).

That’s it! A single, 4-liner query. Of course you’re interested in what the Vertica database does under the covers and how its performance, disk utilization and scalability compare with those of Hadoop and PIG.

Performance Study

The publicly available LiveJournal social network graph (http://snap.stanford.edu/data/soc-LiveJournal1.html) was used to test performance. It was selected because of its public availability, its modest size permitted relatively quick experiments. The modified edges file (in the original file not every edge is reciprocated) contained 86,220,856 edges, about 1.3GB in raw size. We used HDFS dfs.replication=2 (replication=1 performed worse – fewer map jobs were run, almost regardless of the mapreduce.job.maps value). Experiments were run on between 1 and 4 machines each with 96GB of RAM, 12 cores and 10GBit interconnect.

Run-Time Cost

All solutions are manually tuned to obtain the best performance numbers. For the Hadoop and PIG solutions, the number of mappers and reducers as well as the code itself were tweaked to optimize performance. For the Vertica solution, out-of-the-box Vertica is configured to support multiple users; default expectation is 24 concurrent queries for the hardware used. This configuration was tweaked to further increase pipeline parallelism (equivalent configuration settings will be on by default in an upcoming release). The following chart compares the best performance numbers for each solution.

PIG beat my Hadoop program, so my colleague who wrote the PIG script earned his free lunch. One major factor is PIG’s superior join performance – its uses hash join. In comparison, the Hadoop solution employs a join method very close to sort merge join.

Vertica’s performance wasn’t even close to that of Hadoop – thankfully. It was much much better. In fact Vertica ate PIG’s and Hadoop’s lunch – its best time is 22x faster than PIG’s and 40x faster than the Hadoop program (even without configuration tweaks Vertica beats optimized Hadoop and PIG programs by more than a factor of 9x in comparable tests).

Here are a few key factors in Vertica’s performance advantage:

  • Fully pipelined execution in Vertica, compared to a sequence of MR jobs in the Hadoop and PIG solutions, which incurs significant extra I/O. We quantify the differences in how the disk is used among the solutions below in the “disk usage” study.
    .
  • Vectorization of expression execution, and the use of just-in-time code generation in the Vertica engine
    .
  • More efficient memory layout, compared to the frequent Java heap memory allocation and deallocation in Hadoop / PIG

Overall, Hadoop and PIG are free in software, but hardware is not included. With a 22x speed-up, Vertica’s performance advantage effectively equates to a 95% discount on hardware. Think about that. You’d need 1000 nodes to run the PIG job to equal the performance of just 48 Vertica nodes, which is a rack and a half of the Vertica appliance.

Finally consider what happens when the use case shifts from counting all of the triangles in a graph to counting (or listing) just the triangles that include a particular vertex. Vertica’s projections (those things that define the physical storage layout) can be optimized such that looking up all of the edges with a particular vertex is essentially an index search (and once found the associated edges are co-located on disk – an important detail which anyone who knows the relative cost of a seek versus a scan will appreciate). This very quickly whittles e1 and e3 down to relatively few rows which  can participate in a merge join with e2. All in all a relatively inexpensive operation. On the other hand PIG and Hadoop must process all of the edges to satisfy such a query.

Disk Usage

For the input data set of 1.3GB, it takes 560MB to store it in Vertica’s compressed storage. In comparison, storing it in HDFS consumes more space than the raw data size.

At run-time, here is the peak disk usage among all 3 solutions in a 4-node cluster (remember lzo was not enabled for Hadoop and PIG – turning it on would reduce disk usage but likely hurt performance).

Given the huge differences in disk usage and thus I/O work, along with other advantages outlined above it should come as no surprise that the Vertica solution is much faster.

Join Optimization

As we mentioned earlier, the Hadoop solution does not optimize for join performance. Both Vertica and PIG were able to take advantage of a relatively small edges table that fit in memory (100s of billions or more edges can fit in memory when distributed over 10s or 100s of machines), with a hash join implementation.

For PIG, the join ordering needs to be explicitly specified. Getting this ordering wrong may carry a significant performance penalty. In our study, the PIG solution with the wrong join ordering is 1.5x slower. The penalty is likely even higher with a larger data set, where the extra disk I/O incurred in join processing can no longer be masked by sufficient RAM. To further complicate the matter, the optimal join ordering may depend on the input data set (e.g. whether the input graph is dense or not). It is infeasible for users to manually tweak the join ordering before submitting each PIG job.

In comparison, the Vertica columnar optimizer takes care of join ordering as well as many other factors crucial to optimizing for the job run-time.

The Right Tool for the Job

Many people get significant value out of Hadoop and PIG, including a number of Vertica’s customers who use these tools to work with unstructured or semi-structured data – typically before loading that data into Vertica. The question is which tool is best suited to solve your problem. With User Defined Functions, Aggregates, Load, et cetera available or coming soon to Vertica the lines are becoming blurred but when it comes to performance the choice is crystal clear.

In the case of triangle counting as we presented above, the Vertica solution enjoys the following advantages over Hadoop and PIG:

  • Ease of programming and maintenance, in terms of both ensuring correctness (The Vertica SQL solution is simpler) and achieving high performance (The Vertica optimizer chooses the best execution plan)
    .
  • Compressed storage
    .
  • Orders of magnitude faster query performance

Do Try this at Home

It is a relatively safe experiment (unlike slicing a grape in half and putting it in the microwave – don’t try that one at home). We’ve uploaded all three solutions to GitHub. Feel free to run your own experiments and improve on our work. As it stands the project includes a build.xml file which runs the Hadoop and PIG solutions in standalone mode – the project README file describes these targets and more in detail. With a little more work one can configure a Hadoop cluster and run the experiments in distributed mode, which is how we ran the experiments described above.

It’s a little more difficult to run the tests if you are not currently a Vertica customer, but we do have a free trial version of the Vertica Analytics Platform software.

Acknowledgements

Many thanks to Rajat Venkatesh for writing the PIG script (though I already thanked him with a lunch) and Mingsheng Hong for his suggestions, ideas and edits.

The Right Tool for the Job: Using Hadoop with Vertica for Big Data Analytics

by Mingsheng Hong, Vertica Product Marketing Engineer

I have an entrepreneur friend who used to carry a butter knife around.  He claimed this “almighty” tool was the only one he ever needed!  While the butter knife does serve a wide range of purposes (especially with a stretch of the imagination), in practice it doesn’t always yield optimal results.  For example, as a screwdriver, it may work for common screws, but certainly not a Phillips (unless you push down very hard and hope not to strip the screw).  As a hammer, you may be able to drive finishing nails, but your success and mileage may vary.  As a pry bar, well, I think you get my point!  Clearly one tool isn’t sufficient for all purposes – a good toolbox includes various tools each fulfilling a specific purpose.

When it comes to Big Data Analytics, Hadoop (as a platform) has received an incredible amount of attention.  Some highlights include: scalable architecture based on commodity hardware, flexible programming language support, and strong open source community support committed to its on-going development.  However, Hadoop is not without limitations: due to its batch oriented nature, Hadoop alone cannot be deployed as a real-time analytics solution.  Its highly technical and low-level programming interface makes it extremely flexible and friendly to developers but not optimal for business analysts.  In an enterprise business intelligence environment Hadoops’s limited integration with existing BI tools makes people scratch their head trying to figure out how to fit it into their environment.

As Hadoop has continued to gain traction in the market and (in my opinion) moved beyond the peak of the hype cycle, it is becoming clear that to maximize its effectiveness, one should leverage Hadoop in conjunction with other business intelligence platforms and tools.  Best practices are emerging regarding the choice of such companions, as well as how to leverage each component in a joint deployment.

Among the various BI platforms and tools, Vertica has proved an excellent choice. Many of its customers have successfully leveraged the joint deployment of Hadoop and Vertica to tackle BI challenges in algorithmic trading, web analytics, and countless other industry verticals.

What makes the joint deployment so effective, and what are the common use cases?

First, both platforms have a lot in common:

  • Purpose-built from scratch for Big Data transformation and analytics
  • Leverage MPP architecture to scale out with commodity hardware, capable of managing TBs through PBs of data
  • Native HA support with low administration overhead

In the Big Data space crowded with existing and emerging solutions, the above architectural elements have been accepted as must-haves for any solution to deliver scalability, cost effectiveness and ease of use.  Both platforms have obtained strong market traction in the last few years, with customer success stories from a wide range of industry verticals.

While agreeing on things can be pleasant, it is the following key differences that make Hadoop and Vertica complement each other when addressing Big Data challenges:

Aspect / Feature Hadoop VERTICA
Interface and extensibility Hadoop’s map-reduce programming interface is designed for developers.The platform is acclaimed for its multi-language support as well as ready-made analytic library packages supplied by a strong community. Vertica’s interface complies with BI industry standards (SQL, ODBC, JDBC etc).  This enables both technologists and business analysts to leverage Vertica in their analytic use cases.Vertica’s 5.0 analytics SDK enables users to plug their custom analytic logic into the platform, with in-process and parallel execution.  The SDK is an alternative to the map-reduce paradigm, and often delivers higher performance.
Tool chain /
Eco system
Hadoop and HDFS integrate well with many other open source tools. Its integration with existing BI tools is emerging. Vertica integrates with the BI tools because of its standards compliant interface.  Through Vertica’s Hadoop connector, data can be exchanged in parallel between Hadoop and Vertica.
Storage management Hadoop replicates data 3 times by default for HA.  It segments data across the machine cluster for loading balancing, but the data segmentation scheme is opaque to the end users and cannot be tweaked to optimize for the analytic jobs. Vertica’s columnar compression often achieves 10:1 in its compression ratio.  A typical Vertica deployment replicates data once for HA, and both data replicas can attain different physical layout in order to optimize for a wider range of queries.  Finally, Vertica segments data not only for load balancing, but for compression and query workload optimization as well.
Runtime optimization Because the HDFS storage management does not sort or segment data in ways that optimize for an analytic job, at job runtime the input data often needs to be resegmented across the cluster and/or sorted, incurring a large amount of network and disk I/O. The data layout is often optimized for the target query workload during data loading, so that a minimal amount of I/O is incurred at query runtime.  As a result, Vertica is designed for real-time analytics as opposed to batch oriented data processing.
Auto tuning The map-reduce programs use procedural languages (Java, python, etc), which provide the developers fine-grained control of the analytic logic, but also requires that the developers optimize the jobs carefully in their programs. The Vertica Database Designer provides automatic performance tuning given an input workload.  Queries are specified in the declarative SQL language, and are automatically optimized by the Vertica columnar optimizer.

 

After working with a number of customers involving joint Hadoop and Vertica deployments, we have identified a number of best practices combing the power of both platforms.  As an example, Hadoop is ideal for the initial exploratory data analysis, where the data is often available in HDFS and is schema-less, and batch jobs usually suffice, whereas Vertica is ideal for stylized, interactive analysis, where a known analytic method needs to be applied repeatedly to incoming batches of data.  Sessionizing clickstreams, Monte Carlo analysis or web-scale graph analytics are some such examples.  For those analytic features supported by both platforms, we have observed significant performance advantages in Vertica, due to the key architectural differences between the two platforms as described above.

Finally, by leveraging Vertica’s Hadoop connector, users can easily move data between the two platforms.  Also, a single analytic job can be decomposed into bits and pieces that leverage the execution power of both platforms; for instance, in a web analytics use case, the JSON data generated by web servers is initially dumped into HDFS.  A map-reduce job is then invoked to convert such semi-structured data into relational tuples, with the results being loaded into Vertica for optimized storage and retrieval by subsequent analytic queries.   As another example, when an analytic job retrieves input data from the Vertica storage, its initial stages of computation, often consisting of filter, join and aggregation, should be conducted in Vertica for optimal performance.  The intermediate result can then be fed into a map-reduce job for further processing, such as building a decision tree or some other machine learning model.

Big Data with Hadoop and Vertica – OSCON ‘11

The recent OSCON ’11 was filled with exciting technology and best practice discussions on Big Data, Java and many other subjects. There I had an opportunity to deliver a talk to the open source community on the subject of this post. In a subsequent talk, my colleagues Steve Watt and Glenn Gebhart presented a compelling demo to illustrate the power of combining Hadoop and Vertica to analyze unstructured and structured data. We were delighted at the feedback that both talks received from the follow-up conversations in person as well as from Twitter. This interview captured the gist of the numerous conversations we had with other attendants of OSCON about Vertica’s real-time analytics capabilities and its underlying technology.

How to Make Pig for Hadoop SQL

A few weeks back, my co-worker Rajat Venkatesh gave a little talk on his work with the Vertica Connector for Hadoop.  When he displayed a Pig program, the de facto programming language for Hadoop, I noted that it looks a lot like SQL.  And also, that it shouldn’t be too hard to convert a Pig program to SQL, allowing us to run Pig on Vertica directly!

So, I spent a few hours over the last week and whipped up an engine for running Pig programs on Hadoop and Vertica.  I’m not trying to start up a Hadoop vs. database war, or even “perfume” Pig (sorry… bad joke).  I just wanted to make life easier for those users who are combining Hadoop and Vertica to get the most out of their data.

The core idea of the Vertica Pig/SQL conversion engine is to rewrite the data flow described by the Pig program into a set of nested SQL queries that produce the query answer. The conversion engine uses the regular PigScriptParser, looks at the LogicalPlan, and transforms each Operator into a SQL statement that implements its functionality.

The Vertica Connector for Hadoop is used to move data from HDFS into Vertica and from Vertica back into HDFS.  We run a mixture of Pig scripts and SQL scripts in order to create tables in Vertica, move the data from Hadoop to Vertica, compute the result, and move it back to Hadoop.  The hope is to seamlessly take advantage of Vertica to run your Pig program faster.

Here’s a fairly trivial Pig script example:

x = LOAD ‘foo.log’ USING PigStorage(‘|’) as (a:int,b:int,c:int);
y = FILTER x BY a > 5;
z = FOREACH y GENERATE b*c as f;
z2 = DISTINCT z;
z3 = FOREACH z2 GENERATE f, f*f as g;
z4 = ORDER z3 by g;
STORE z4 INTO ‘zzz';

And here is the corresponding code run by the Vertica conversion engine:

Script: schema refresh(SQL)[ErrorsOK]
DROP SCHEMA squeal_PigLatin_simple_pig CASCADE;
CREATE SCHEMA squeal_PigLatin_simple_pig;
Script: sql setup(SQL)
CREATE TABLE squeal_PigLatin_simple_pig.hdfs_L3H9215_verticacorp_com_9935_user_bvandiver_simple_foo_log
(
a int,
b int,
c int,
d varchar(65000)
);
Script: test sql(SQL)
SELECT * FROM (SELECT * FROM (SELECT * FROM (SELECT f as f,f * f as g FROM (SELECT DISTINCT * FROM (SELECT b * c as f FROM (SELECT * FROM squeal_PigLatin_simple_pig.hdfs_L3H9215_verticacorp_com_9935_user_bvandiver_simple_foo_log as x WHERE a > 5) AS y) AS z) as z2) AS z3 ORDER BY g) AS z4 LIMIT 3) AS z5
Script: load3(PIG)
x = LOAD ‘hdfs://L3H9215.verticacorp.com:9935/user/bvandiver/simple/foo.log’ USING PigStorage(‘|’) AS (a:int,b:int,c:int,d:chararray);
STORE x into ‘{squeal_PigLatin_simple_pig.hdfs_L3H9215_verticacorp_com_9935_user_bvandiver_simple_foo_log(a int,b int,c int,d varchar(65000))}’ using com.vertica.pig.VerticaStorer(‘localhost’,’verticadb502′,’5935′,’bvandiver’,'””‘);
Script: populate vdata4(SQL)
CREATE TABLE squeal_PigLatin_simple_pig.vdata AS SELECT * FROM (SELECT * FROM (SELECT * FROM (SELECT f as f,f * f as g FROM (SELECT DISTINCT * FROM (SELECT b * c as f FROM (SELECT * FROM squeal_PigLatin_simple_pig.hdfs_L3H9215_verticacorp_com_9935_user_bvandiver_simple_foo_log as x WHERE a > 5) AS y) AS z) as z2) AS z3 ORDER BY g) AS z4 LIMIT 3) AS z5
Script: store vdata5(PIG)
vdata = LOAD ‘sql://{SELECT * FROM squeal_PigLatin_simple_pig.vdata}’ using com.vertica.pig.VerticaLoader(‘localhost’,’verticadb502′,’5935′,’bvandiver’,'””‘);
STORE vdata INTO ‘hdfs://L3H9215.verticacorp.com:9935/user/bvandiver/simple/zzz’ USING org.apache.pig.builtin.PigStorage();

While this example only shows filter, distinct, order, and foreach, the conversion engine currently supports limit, join, sample, and group as well.

There are a number of cool tricks we can do to improve performance. One trick is that we’re already taking advantage of the Vertica Optimizer to make data-aware decisions. For example, the Optimizer will chose join order based on the data actually loaded, instead of relying on the programmer to know what the best order was (and specify it correctly!).

Additionally, we can take advantage of the Vertica Database Designer. Our Database Designer automatically chooses a good data layout and compression strategy for storing the data in Vertica. The Designer uses the schema, the queries, and a sample of the data as its input – all of which the conversion engine can provide and involve programmatically. Finally, you can leave the original data in Vertica, so the second time you run the script, you avoid paying the cost of transferring the data.

What’s the take-away? With the Vertica Pig/SQL conversion engine, dislike of writing SQL is not a reason to throw away your database. If scalability is your major concern, I’d encourage you to see how well Vertica scales up to the challenge of large data.

For an off-hours, couple-day hack, the conversion engine is pretty feature-rich! Today, the conversion engine doesn’t support all of Pig (for instance cogroup or user-defined functions), but this is simply a matter of time and effort. We would love to hear from our user community on the interest in using the Vertica Pig/SQL conversion engine if it were a real product. If so, we will improve it and release it as an open source download.

Thoughts?

The Vertica Connector for Hadoop

At Hadoop World today, Vertica is announcing our second generation connector for Hadoop and Pig.  You can read more about the Vertica Connector for Hadoop in the press release, but we wanted to give you the basics about the new Connector here on the blog.

About a year ago, Vertica was the first analytic database company to deliver a parallel connector for Hadoop.  Today, Vertica has over 25 customers using Vertica and Hadoop together.  These are real customers, with real-world applications, who are combining the strengths of Hadoop and the Vertica Analytics Platform.

And with Vertica Connector for Hadoop, users have unprecedented flexibility and speed in loading data from Hadoop to Vertica and querying data from Vertica in Hadoop.  And, as part of our ongoing commitment to Hadoop, we’ve announced an upcoming backup option to Hadoop File System (HDFS), and plans to publish a set of libraries that will allow Hadoop users to leverage Vertica’s advanced sorting, compression and encoding natively in the backup files on HDFS for additional processing and data exploration.

The Vertica 4.0 Connector for Hadoop is open source, supported by Vertica, and available for download at here.

You can read more on the web site, but here’s a short summary of what Vertica 4.0 Connector for Hadoop offers right now:

  • Enhanced integration of Vertica 4.0 with stable versions of Hadoop and Pig
  • Improved transfer performance

And, in a separate release available in the first half of 2011, we’ll be providing:

  • An innovative, low-cost backup option
  • Libraries to read Vertica’s native format from HDFS

As part of our Hadoop push, we’re going to be presenting a couple of webinars around our combined Vertica/Hadoop solution.  Watch for an announcement here on the blog or in your email (if you are on the Vertica list).

Reaffirming our Commitment and Approach to Hadoop / MapReduce

As head of Product Management at a next generation analytic DBMS company I often get the question about Vertica’s endeavors with Hadoop/MapReduce.  Given that Vertica and Hadoop/MR share many similar core principles like being massively parallel and highly available on distributed commodity hardware, there is a natural fit.  That said, the two are still different- Vertica is designed for real-time analytics of structured data whereas Hadoop/MR is typically for batch oriented jobs with any type of data (structured/semi-structured/unstructured).  We try to stay out of the comparisons though and instead focus on complementary approaches, particularly in solving real-world customer problems.  This has been our approach since the beginning of our joint-development.

Vertica and Hadoop/MR complement one another extremely well, and we are committed to ensuring bi-directional and tight integration between Hadoop/MR and Vertica.  Our preference is to work with great partners like Cloudera who understand enterprise class Hadoop the same way Vertica understands enterprise-class databases.  Our approach of seamless and parallel integration is in line with Vertica’s core “One Size Does Not Fit All” tenet. We don’t think we need to develop the technology ourselves, much in the same way that we don’t feel the need to develop our own ETL and front-end visualization solutions.

Vertica is focused on building the best next generation analytic database solution on the market.  Our solution enables customers to unlock and monetize their data in a fully-relational and massively parallel manner with scalability and simplicity of setup and administration as core design principles. We enable companies to ingest, store, and analyze vast amounts of structured data with near real-time latency on a fraction of the hardware they would otherwise need.  This is why Vertica was founded, this is where we owe our success to date, and as far as we can tell, we are solving a very clear and present data problem that is only getting worse.  Our focus is also the reason we reached the 100 customer mark faster than all of our competitors.  Among other uses, Hadoop/MR is wonderful at getting more and higher quality data into Vertica.

While Hadoop/MR and Vertica are different, the “problem solved” is not always just orthogonal.  As it turns out, and not surprisingly, many data problems can be solved in more than one way.  Again, we see merit in Hadoop/MR for several use cases (including but not limited to the massaging, structuring, and transformation of data before and/or after it gets to the database), but we also know that some of the most commonly cited MR use cases can be performed through a single pass of SQL in the database engine as well.  By stripping away the noise and listening to our customers and their pain, we are able to deliver a core product that solves many of the same issues.  Not all, but many.

A case in point is sessionization, which is perhaps the most often cited use case for MapReduce in the enterprise (stay tuned for a more in depth post on this topic and CTE).  Sessionization is the process of taking web log files and grouping them together in buckets of visitor sessions (most commonly time-based, e.g. 30 seconds) for analysis.  This has been pegged as problematic to perform in SQL and therefore in the RDBMS because it often requires multiple passes through the engine and is difficult to express.  In Vertica 4.0 however, this can be expressed through single pass SQL no problem.

Here’s the SQL with a Web/Clickstream timeout threshold of 30 seconds:

SELECT userId, timestamp, CTE(timestamp – LAG(timestamp) <= ‘30 seconds’) OVER (PARTITION BY userId ORDER BY timestamp) as session FROM webclicks;

 

By performing this operation in the Vertica database, our customers leverage our massively parallel real-time columnar infrastructure without having to move the data around for external batch processing.  They can do this from within their same favorite reporting tool without adding that extra step.  Furthermore, Vertica’s extensive native windowing conditions for advanced analytics, including sessionization, are many, and not limited to the conditional true event (CTE) on just timestamp depicted above.  Of course, there are still good reasons to perform sessionization outside the database such as not wanting to take up valuable real-time analytics resources while performing such grouping legwork- (although this can actually be solved using Vertica’s new workload management capabilities).  We get that, and again, that is why we support native Hadoop/MR — no need for syntax changes.

Key to our One Size Does Not Fit All approach was Vertica’s day one decision to not cut corners and build on top of Postgres or some other traditional row-store as most of our competitors have done with their offerings.  We have instead written a truly next generation native MPP-Columnar ADBMS solution from scratch complete with a unique set of bells and whistles (stay tuned for specific post on this subject as well).  The good news is that on this core foundation, we can now add functionality that traditional row-stores would never be able to handle in a fast enough manner.  Sessionization is a great example.  It is simply too inefficient to perform it in a traditional RDBMS, not to mention most databases are not as expressive; hence why many people turn to Hadoop/MR for it.  Vertica’s customers are finding there are a lot of things they can now do in Vertica that they could never consider with a traditional database.  This combined with tight integration to frameworks like Hadoop allow our customers to monetize all of their data in ways never before possible.

Get Started With Vertica Today

Subscribe to Vertica