Vertica

Archive for the ‘Hadoop’ Category

Distributed R for Big Data

Data scientists use sophisticated algorithms to obtain insights. However, what usually takes tens of lines of MATLAB or R code is now been rewritten in Hadoop like systems and applied at scale in the industry. Instead of rewriting algorithms in a new model, can we stretch the limits of R and reuse it for analyzing Big Data? We present our early experiences at HP Labs as we attempt to answer this question.

Consider a few use cases– product recommendations in Netflix and Amazon, PageRank calculation by search providers, financial options pricing and detection of important people in social networks. These applications (1) process large amounts of data, (2) implement complex algorithms such as matrix decomposition and eigenvalue calculation, and (3) continuously refine their predictive models on arrival of new user ratings, Web pages, or addition of relations in the network. To support these applications we need systems that can scale, can easily express complex algorithms, and can handle continuous analytics.

The complex aspect refers to the observation that most of the above applications use advanced concepts such as matrix operations, graph algorithms, and so on. By continuous analytics we mean that if a programmer writes y=f(x), then y is recomputed automatically whenever x changes. Continuous analytics reduces the latency with which information is processed. For example, in recommendation systems new ratings can be quickly processed to give better suggestions. In search engines newly added Web pages can be ranked and made part of search results more quickly.

In this post we will focus on scalability and complex algorithms.

R is an open source statistical software. It has millions of users, including data scientists, and more than three thousand algorithms packages. Many machine learning algorithms already exist in R, albeit for small datasets. These algorithms use matrix operations that are easily expressed and efficiently implemented in R. In less than a hundred lines you can implement most algorithms. Therefore, we decided to extend R and determine if we can achieve scalability in a familiar programming model.

Figure 1 is a very simplified view that compares R and Hadoop. Hadoop can handle large volumes of data, but R can efficiently execute a variety of advanced analysis. At HP Labs we have developed a distributed system that extends R. The main advantages are the language semantics, and the mechanisms to scale R and to run programs in a distributed manner.

FIgure 1 Graph

Figure 1: Extending R for Big Data

Details

Figure 2 shows a high level diagram of how programs are executed in our distributed R framework. Users write programs using language extensions to R and then submit the code to the new runtime. The code is executed across servers in a distributed manner. Distributed R programs run on commodity hardware: from your multi-core desktop to existing Vertica clusters.

Figure 2 Architecture

Figure 2: Architecture

Our framework adds three main language constructs to R: darray, splits, and update. A foreach construct is also present. It is similar to parallel loops found in other languages.

For transparent scaling, we provide the abstraction of distributed arrays, darray.  Distributed arrays store data across multiple machines and give programmers the flexibility to partition data by rows, columns or blocks. Programmers write analytics code treating the distributed array as a regular array, without worrying that it is mapped to different physical machines. Array partitions can be referenced using splits and their contents modified using update. The body of foreach loop processes array partitions in parallel.

Figure 3 shows part of a program that calculates distributed PageRank of a graph. At a high level, the program executes A = (M*B)+C in a distributed manner till convergence. Here M is the adjacency matrix of a large graph. Initially M is declared a NxN sparse matrix partitioned by rows. The vector A is partitioned such that each partition has the same number of rows as the corresponding partition of M. The accompanying illustration (Figure 3) points out that each partition of A requires the corresponding (shaded) partitions of M, C, and the whole array B. The runtime passes these partitions and automatically reconstructs B from its partitions before executing the body of foreach on workers.

Our algorithms package has distributed algorithms such as regression analysis, clustering, power method based PageRank, a recommendation system, and so on. For each of these applications we had to write less than 150 lines of code.

Presto Code

Figure 3: Sample Code

This post is not to claim yet another system faster than Hadoop. Hence we exclude comprehensive experiment results or pretty graphs.  Our Eurosys 2013 and HotCloud 2012 papers have detailed performance results [1, 2]. As a data nugget, our experiments show that many algorithms in our distributed R framework are more than 20 times faster than Hadoop.

Summary

Our framework extends R. It efficiently executes machine learning and graph algorithms on a cluster. Distributed R programs are easy to write, are scalable, and are fast.

Our aim in building a distributed R engine is not to replace Hadoop or its variants. Rather, it is a design point in the space of analytics interfaces—one that is more familiar to data scientists.

Our framework is still evolving. Today, you can use R on top of Vertica to accelerate your data mining analysis. Soon we will support in-database operations as well. Stay tuned.


[1] Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, Rob Schreiber. Eurosys 2013, Prague, Czech Republic.

[2] Using R for Iterative and Incremental Processing. Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Rob Schreiber. HotCloud 2012, Boston, USA.

GameStop CIO: Hadoop Isn’t For Everyone

GameStop Corp. is the world’s largest multichannel video game retailer, with a retail network and family of brands that includes 6,650 company-operated stores in 15 countries worldwide and online at www.GameStop.com. The network also includes:  www.Kongregate.com, a leading browser-based game site; Game Informer® magazine, the leading multi-platform video game publication; Spawn Labs, a streaming technology company; and a digital PC game distribution platform available at http://www.GameStop.com/PC.

As part of their efforts to upgrade their analytics infrastructure to handle the massive traffic from their 21 million member PowerUp Rewards™ loyalty membership program, GameStop looked at Hadoop to see if the open source platform would handle their Big Data requirements.  Ultimately, GameStop CIO Jeff Donaldson chose the HP Vertica Analytics Platform because his engineers, who are trained in working with traditional data warehousing solutions that use the SQL programming language, would be able to quickly transition to the open, standards-based HP Vertica Analytics Platform.

Recently, GameStop was featured in an article by Clint Boulton, reporter for the the Wall Street Journal’s CIO Journal.  In the article, “GameStop CIO: Hadoop Isn’t For Everyone,” Clint and Jeff Donaldson discuss the issues with implementing Hadoop and why a high-performance analytics platform like HP Vertica may be a better solution for Big Data success than Hadoop.  According to Jeff Donaldson, “[Data management] is a hard enough of a business problem and we wanted to reduce the technology risk to near zero.”

The article can be found at the CIO Journal blog.  To read the full article, you will need to be a member of the Wall Street Journal web site.

Now Shipping: HP Vertica Analytics Platform 6.1 “Bulldozer” Release

We’re pleased to announce that the “Bulldozer” (v6.1) release of the HP Vertica Analytics Platform is now shipping! The 6.1 release extends v6 of the HP Vertica Analytics Platform with exciting new enhancements that are sure to delight cloud computing and Hadoop fans alike, while giving data scientists and administrators more features and functionality to get the most out of their HP Vertica investment.

Tighter Hadoop Integration

With a name like “Bulldozer,” you’d expect the release to provide some heavy, big data lifting, and this release delivers. Many of our customers use Hadoop in early stages of their data pipeline, especially for storing loads of raw data. But after they’ve MapReduce-massaged the data, users want to load it into HP Vertica as fast as possible so they can start their true analytics processing. HP Vertica 6.1 provides a new HDFS connector that allows you to do just that: pull data straight from HDFS with optimal parallelism without any additional MapReduce coding. Furthermore, for users who are still deciding whether or not to bring some of their Hadoop data into their primary analytics window, they can use HP Vertica’s external tables feature with the HDFS connector to run rich analytics queries and functions in situ in HDFS. They may even choose to plug in a custom parser using the User Defined Load framework and let HP Vertica do some of the ETL lifting for them. Flexibility is what it’s all about, and to learn how to use the HP Vertica Analytics Platform with Hadoop, see our newly released white paper: Make All Your Information Matter — Hadoop and HP Vetica Analytics Platform.

Simplified Cloud Deployments

We also have many customers who run HP Vertica in the cloud, and know that more and more enterprises are making the cloud their deployment model of choice. To simplify and improve the cloud deployment experience, we now have an officially qualified Amazon EC2 AMI for HP Vertica. This AMI eliminates the guesswork and manual effort involved in rolling your own AMI. And to make these AMIs even easier to administer, we’ve provided cloud scripts that simplify the installation, configuration, and deployment of HP Vertica clusters. Now creating, expanding, and contracting your HP Vertica deployments is easier than ever, enabling a more agile and elastic cloud experience.

Killer Features for Big Data Analytics

In addition to the above, there are dozens of new features and improvements in this release that address the needs of Big Data analytics deployments. From a new R language pack that gets data scientists up and running quickly to enhanced storage tiering and archiving features that will help optimize storage media spend to new validation tools that assist administrators with hardware deployment planning and tuning, this new release provides the platform needed to create an enterprise-grade Big Data environment.  And, as with every release, we’ve made HP Vertica’s already incredibly fast performance even faster.

It’s easy for me to be excited about all the great new improvements in this release, but I challenge you to come see for yourself. Test drive HP Vertica 6.1 and find out how its new features can help you tackle your biggest Big Data challenges. Interested in learning more? Attend our upcoming Introduction to HP Vertica 6.1 webinar, where we’ll provide even more details about this exciting new release. We’re constantly striving to make the HP Vertica Analytics Platform the best solution for our customers’ Big Data analytics needs, and with our “Bulldozer” now out, the door we’re looking forward to helping more enterprises pave the way to data-driven business success.

 

Luis Maldonado

Director, HP Vertica

Observations from Hadoop World 2012

Strata Hadoop World Logo

More than 3,000 attendees converged on the sold-out O’Reilly Strata Conference and Hadoop World 2012 in New York City to gain some clarity on arguably the biggest high-tech megatrend in recent years: Big Data.

From a 100,000-foot view, the majority of attendees—from press to developers to exhibitors to event staff—understood that we are generating a nearly incomprehensible amount of data, really Big Data. And there’s no reason to believe that this Big Data will continue to grow by orders of magnitude, given the proliferation of:

But from my conversations, attendees came to the show to understand how their organization could manage, analyze, and ultimately monetize this Big Data, and, specifically, how Hadoop could help with that effort.

As a newbie to this space, I could relate to the quizzical faces of attendees, barraged with messages claims as the next Big Data solution, but with very different offerings—everything from search engines to hosted solutions to ETL tools to even staffing resources.

Hadoop in itself comprises a uniquely named set of technologies: Hive, Sqoop, Pig, Flume, etc. Despite the unusual terminology, the Hadoop-focused sessions proved educational and featured an impressive range of real-world case studies even large companies (such as Facebook) using Hadoop to store and analyze an impressive amount of Big Data.

But the question still remains: is Hadoop the answer or are there other technologies that can either complement or serve as a better path?

As is often the case when choosing technology, the answer is “It depends on your business need.”

At HP, many of our customers used Hadoop for batch processing before ultimately adopting the HP Vertica Data Analytics Platform to manage and analyze their Big Data for sub-second query response times.

Other customers, particularly with the Hadoop Connector released with HP Vertica Version 6, use the technologies together to seamlessly move data back and forth between Hadoop and HP Vertica.

Which use cases do you feel are a good fit for Hadoop and how can we provide better integration with our platform? Let us know.

We’re passionate about providing the data analytics platform to help you obtain answers from your Big Data questions and add some clarity, in the process.

Avro parser UDx – Using Apache Avro to enable easier data transfer from Hadoop to Vertica

After careful research and brainstorming of different ideas for the intern UDx competition we decided to implement an Avro parser UDx. Our team, “The Avro-rian Revolutionaries” wanted to implement something useful, ready to use, and is in the top-3 wish list of customers. And what better than an Avro parser which would help users to easily transfer data from Hadoop to Vertica!. (This Avro parser UDx package is now available on github [6] and Vertica users are encouraged to try it out!)

Apache Avro [1] is a data serialization format widely used in Hadoop world. It is a new data serialization format which succeeds Thrift [2] and Protocol Buffers [3]. According to some technologists, Avro is the best data serialization framework out there [4]. This was good motivation for us to implement an Avro parser for the intern competition, hoping to make importing Avro data into Vertica, feasible.

Figure 1. Hadoop, Avro, Avro UDx and Vertica workflow

With this motivation, we began our day 1 of the 5 day intern competition. The first milestone was to get the standalone Avro parser to work. This basic, standalone parser (still no Vertica in picture) which will just read an Avro file and print out the header and data in text format. The Avro API’s were our means to do it and by referring the basic documentation [5] we quickly came up with a parser which could dump out the contents of a sample Avro file in text format as in Figure 2.

Figure 2: weather.avro sample file in text format.

We spent day 2 of the competition learning the Vertica SDK, the next tool of trade.
There were some great examples already out there on github. We picked a simple example UDx and began using and playing with it. Once we got our hands on loading, testing, and running this UDx we started learning the required SDK interfaces for loading the data into Vertica. One important interface was called UDParser which parses a stream of bytes parallelly into Vertica. Very quickly we were able to use this and develop an UDx skeleton, ready to get integrated into the module developed on day 1.

On day 3, midway through the competition we had the most important milestone to achieve. The task was to integrate our standalone Avro parser developed on day 1 with a parser UDx skeleton developed on day 2. And this was point where we got stuck and had an unexpected setback. After talking to our mentors we discovered that there is an interface gap between Avro file reader api and Vertica UDParser interface. To fill this gap we developed a couple of modules called CRReader and CRStream which successfully addressed the issue.

Day 4, we began integrating the modules, and finally the moment of judgement arrived. This was the moment when we ran our first test of loading a weather.avro file into vertica, which exercised most of the code we wrote. And we did not have to hold our breath long. Within a fraction of a second the data was loaded into Vertica. We really couldn’t believe our eyes that all the 3 pieces of modules we wrote in 3 days are working like parts of an engine. The magic of UDx was happening! and the Avro file was successfully loaded into Vertica. (Figure 3)

Figure 3: Demo screenshot

On day 5, the last day of the competition, we spent all our efforts in testing and packaging the UDx. We wanted to have a quality product which will be ready to use by the customer by the end of competition.

Finally we presented our work with other interns in front of a fully packed room with audience from all departments of Vertica. This was a unique experience by itself because we had to present the work in the most appealing format for audience of different perspective apart from the technical dimension. End of the day we were happy that we learnt lots of new things, collaborated with senior mentors and received great response feedback and comments for our work which made the competition a great success! And now when looking at our UDx parser available on github[6] and ready to use, it gives us great satisfaction of achieving of our first step of getting one step closer to the Avro-rian revolution!

References:
[1] http://avro.apache.org/docs/1.7.1/
[2] http://wiki.apache.org/thrift/FrontPage
[3] http://code.google.com/p/protobuf/
[4] http://www.cloudera.com/blog/2011/05/three-reasons-why-apache-avro-data-serialization-is-a-good-choice-for-openrtb/
[5] http://avro.apache.org/docs/1.6.1/api/cpp/html/index.html
[6] https://github.com/vertica/Vertica-Extension-Packages/tree/master/avro_parser

Teaching the elephant new tricks

by Shilpa Lawande & Rajat Venkatesh

Introduction

A growing number of Vertica customers use Hadoop as an ETL/pre-processing tool or HDFS as a “data parking lot” – either to stage files before they are loaded or to simply store data whose worth is yet to be proven. For several years, Vertica has provided a Hadoop Connector that provides bi-directional connectivity between the two systems. In this post, we evaluate the pros and cons of the current approach and discuss a new technique that is enabled by Vertica 6, newly released.

Current Design

The most efficient topology of network connections is shown in the diagram  above. There is one database connection for each partition or a subset of partitions and all these connections transfer data in parallel. Most solutions in the market including the current Vertica-Hadoop connector do follow this design. They open a JDBC connection for each (or a subset of) partition and execute a batch insert to transfer the data. Some products, like Vertica may optimize the transfer by using a bulk load API or transform the data to avoid using resources on the database. Apache Sqoop is the end result of this evolution and provides a platform to implement generic or optimized connectors for all databases that provide a JDBC driver.

Problems

The architecture described above has some inherent bad qualities owing to a fundamental impedance mismatch between Vertica and Hadoop.

Even though Vertica and Hadoop are both MPP systems there is no good way to coordinate the degree of parallelism across the two systems, resulting in an inefficient use of the combined resources. Even a moderately large Hadoop cluster can overwhelm a database with too many JDBC connections. Therefore, it is common best practice to reduce the number of reduce (or map) tasks that connect to the database.

Vertica’s customers typically transfer a large file (with many HDFS chunks) or a set of files into one table. Each of the resulting JDBC connections initiates a new transaction overwhelming the transaction module with too many actors. Since each connection is a transaction, it’s not possible to roll back a transfer. A failed transfer may leave partial data – the worst kind of failure. A common best practice is to first store the data into a temporary table, verify the transfer and then move it to its final resting table. Another common strategy is to add a column to identify the rows inserted in a transfer, verify the rows with the unique id and then keep or delete the rows. Depending on the database and storage engine, verifying and moving the data may take up significant resources. Regardless of the approach or database, there is significant management overhead for loading data from Hadoop. A majority of Vertica’s customers use Vertica so that they don’t have to worry about ACID for bulk loads. They don’t want to be in the business of writing and maintaining complex scripts.

In our survey of customers, we found that many of them have come up with creative solutions to the “too-many-connections” problems. By popular demand, we’re in the process of creating a repository of the Hadoop connector source on Github and hope to see some of these solutions contributed there. However, by far, the most preferred solution is to simply write the results of a MR job to HDFS files, then transfer those files to Vertica and load them. And that’s what we’ve chosen to do in our new design as well.

New Design

In the new HDFS connector, currently in private beta, we have taken the first step to solve the problems described earlier by allowing direct load from files from HDFS to Vertica,  thereby eliminating the non-ACID behaviors due to multiple connections. Further, we solve the problem of too many connections by using one TCP/IP connection per file in HDFS. How did we do this? With Vertica 6, we introduced User-defined Loads. This exposes APIs that can extend Vertica’s native load command COPY, to load from any source and any format. With a few days of effort, using the UDL API and Apache Hadoop REST API for HDFS files, we wrote a plugin to read HDFS files from COPY. The benefits are obvious – COPY can process many files within the same transaction in parallel, across the Vertica cluster. Parallelism is limited by the hardware of the machines. Since it is only one transaction, a failure will result in a complete roll back.

The resulting network topology looks like the diagram below.

Let’s see how the HDFS connector works using some examples.

Imagine that there are 6 files in a directory in HDFS

-rw-r-r-   1 hadoop supergroup        759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem.tbl
Found 1 items
-rw-r-r-   1 hadoop supergroup        759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem1.tbl
Found 1 items
-rw-r-r-   1 hadoop supergroup        759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem11.tbl
Found 1 items
-rw-r-r-   1 hadoop supergroup        759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem12.tbl
Found 1 items
-rw-r-r-   1 hadoop supergroup       759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem2.tbl
Found 1 items
-rw-r-r-   1 hadoop supergroup       759863287 2012-05-18 16:44
hdfslib/glob_test/lineitem31.tbl
Found 1 items

The Vertica Copy command to load lineitem.tbl works as follows:

copy lineitem source
Hdfs(url='http://<dfs.http.address>/webhdfs/v1/user/hadoop/hdfslib/glob_test/lineitem.tbl');

If you wanted to load all 6 files, you simply use the glob * feature. In this case, the 6 files are loaded into Vertica and are processed in parallel across the cluster.

copy lineitem source
Hdfs(url='http://<dfs.http.address>/webhdfs/v1/user/hadoop/hdfslib/glob_test/lineitem*');

In this example, lineitem.tbl is a tab-limited file. What about all the other popular file formats? User Defined Load provides an API to plugin a parser for any file format as well as APIs to decompress, decrypt or perform any arbitrary binary transformation. Check out this blog post showing how UDL can be used to process image file data.

Further, with Vertica 6 external tables, in combination with the HDFS connector, you can now use Vertica SQL to analyze data in HDFS directly (see the “Vertica Analytics Anywhere” blog post). This means, while you are in the exploratory stage with some data, you could experiment with different data models and still use SQL analytics (or use the Vertica R SDK) to plough through the data.

CREATE EXTERNAL TABLE hdfs.lineitem (<list of columns>) AS COPY
FROM Hdfs(url='http://<dfs.http.address>/webhdfs/v1/user/hadoop/hdfslib/glob_test/*');

Summary

“Wait a minute! Most chunks are sent to another Hadoop datanode and then to Vertica. This is worse than what we had before”. Yes we have exchanged expensive JDBC connections with one extra hop for most chunks. However, the new strategy is still faster and maintains transaction semantics. The next step is to enable COPY to parse data chunks in parallel. However, we believe that the current solution has significant advantages over using JDBC connections and already solves major pain points of Vertica-Hadoop users.

An under-performing connector requires far more resources – machine and human – to bring together data from various sources. With an efficient connector, users can transfer data in real time and on a whim, allowing them to use the right tool for the job.

If you would be interested in our private beta for the HDFS connector, please send a note to beta@vertica.com with a brief description of your use-case.

Counting Triangles

by Stephen Walkauskas

Recently I’ve heard from or read about people who use Hadoop because their analytic jobs can’t achieve the same level of performance in a database. In one case, a professor I visited said his group uses Hadoop to count triangles “because a database doesn’t perform the necessary joins efficiently.”

Perhaps I’m being dense but I don’t understand why a database doesn’t efficiently support these use-cases. In fact, I have a hard time believing they wouldn’t perform better in a columnar, MPP database like Vertica – where memory and storage are laid out and accessed efficiently, query jobs are automatically tuned by the optimizer, and expression execution is vectorized at run-time. There are additional benefits when several, similar jobs are run or data is updated and the same job is re-run multiple times. Of course, performance isn’t everything; ease-of-use and maintainability are important factors that Vertica excels at as well.

Since the “gauntlet was thrown down”, to steal a line from Good Will Hunting, I decided to take up the challenge of computing the number of triangles in a graph (and include the solutions in GitHub so others can experiment – more on this at the end of the post).

Problem Description

A triangle exists when a vertex has two adjacent vertexes that are also adjacent to each other. Using friendship as an example: If two of your friends are also friends with each other, then the three of you form a friendship triangle. How nice. Obviously this concept is useful for understanding social networks and graph analysis in general (e.g. it can be used to compute the clustering coefficient of a graph).

Let’s assume we have an undirected graph with reciprocal edges, so there’s always a pair of edges ({e1,e2} and {e2,e1}). We’ll use the following input for illustration (reciprocal edge is elided to condense the information):

source

destination

Ben Chuck
Ben Stephen
Chuck Stephen
Chuck Rajat
Rajat Stephen
Andrew Ben
Andrew Matt
Matt Pachu
Chuck Lyric

.
A little ascii art to diagram the graph might help.

I know you can quickly count the number of triangles. I’m very proud of you but imagine there are hundreds of millions of vertexes and 10s of billions of edges. How long would it take you to diagram that graph? And how much longer to count all of the triangles? And what if your 2 year old daughter barges in counting “one, two, three, four, …” and throws off your count?

Below we present a few practical solutions for large scale graphs and evaluate their performance.

The Hadoop Solution

Let’s consider first the Hadoop approach to solving this problem. The MapReduce (MR) framework implemented in Hadoop allows us to distribute work over many computers to get the count faster. The solution we describe here is a simplified version of the work at Yahoo Research. You can download our solution here.

Overview

The solution involves a sequence of 3 MR jobs. The first job constructs all of the triads in the graph. A triad is formed by a pair of edges sharing a vertex, called its apex. It doesn’t matter which vertex we choose as the apex of the triad, so for our purposes we’ll pick the “lowest” vertex (e.g. friends could be ordered alphabetically by their names). The Yahoo paper makes a more intelligent choice of “lowest” – the vertex with the smallest degree. However that requires an initial pass of the data (and more work on my part) so I skipped that optimization and did so consistently for all solutions to ensure fairness.

These triads and the original edges are emitted as rows by the first MR job, with a field added to distinguish the two. Note that the output of the first job can be quite large, especially in a dense graph. Such output is consumed by the second MR job, which partitions the rows by either the unclosed edge, if the row is a triad, or the original edge. A partition has n triangles if it contains an original edge and n triads. A third, trivial MR job counts the triangles produced by the second job, to produce the final result.

Details

Let’s look at each MR job in detail. The map part of the first job generates key-value pairs for each triad such that the apex is the key and the value is the edge. In our small example the map job would emit the following rows.
.

key

value

Andrew Andrew, Matt
Andrew Andrew, Pachu
Andrew Andrew, Ben
Matt Matt, Pachu
Ben Ben, Chuck
Ben Ben, Stephen
Chuck Chuck, Rajat
Chuck Chuck, Lyric
Chuck Chuck, Stephen
Rajat Rajat, Stephen

.
For each apex-partition, the reduce job emits the original edges and all of the corresponding triads (there are ?(j-1) -> j=1 to d triads per partition, where d is the degree of the vertex at the apex). For each original edge, the key is the edge itself and the value is “edge”. For each triad, the key is the unclosed edge. In other words, the edge needed to complete the triangle. The value is “triad.” The actual code used “0″ for the edge value and “1″ for the triad value for run-time efficiency.

The rows corresponding to the triads emitted by this reduce job in our simple example are described below in the “key” and “value” columns (the original edges are also emitted by the reduce job but elided below for brevity). For presentation purposes we added a third column “triad content”. That column is not produced by the actual reduce job.
.

key

value

triad content

Ben,  Matt triad {Andrew, Ben}, {Andrew, Matt}
Ben, Pachu triad {Andrew, Ben}, {Andrew, Pachu}
Matt, Pachu triad {Andrew, Matt}, {Andrew, Pachu}
Chuck, Stephen triad {Ben, Chuck}, {Ben, Stephen}
Lyric, Rajat triad {Chuck, Lyric}, {Chuck, Rajat}
Lyric, Stephen triad {Chuck, Lyric}, {Chuck, Stephen}
Rajat, Stephen triad {Chuck, Rajat}, {Chuck, Stephen}

.
The input to the next reduce job is partitioned such that the unclosed edge of each triad is in the same partition as its corresponding original edge, if any. The reduce job just needs to check for the existence of an original edge in that partition (i.e., a row with value set to “edge”). If it finds one, all of the triads in the partition are closed as triangles. The reduce job sums up all of the closed triads and on finalize emits a count. A trivial final MR job aggregates the counts from the previous job.

There we’ve used MapReduce to count the number of triangles in a graph. The approach isn’t trivial but it’s not horribly complex either. And if it runs too slowly we can add more hardware, each machine does less work and we get our answer faster.

Experiences with Hadoop

I have to admit it took me much longer than I estimated to implement the Hadoop solution. Part of the reason being I’m new to the API, which is exacerbated by the fact that there are currently two APIs, one of them deprecated, the other incomplete, forcing use of portions of the deprecated API. Specifically, the examples I started with were unfortunately based on the deprecated API and when I ported to the newer one I ran into several silly but somewhat time consuming issues (like mapred’s version of Reducer.reduce takes an Iterator but mapreduce’s version takes an Iterable – they look similar to the human eye but the compiler knows that a method that takes an Iterator should not be overridden by one that takes an Iterable). Learning curve aside there was a fair chunk of code to write. The simple version is >200 lines. In a more complex version I added a secondary sort to the MR job that computes triads. Doing so introduced several dozen lines of code (most of it brain dead stuff like implementing a Comparable interface). Granted a lot of the code is cookie cutter or trivial but it still needs to be written (or cut-n-pasted and edited). In contrast, to add a secondary sort column in SQL is a mere few characters of extra code.

The PIG Solution

Rajat Venkatesh, a colleague of mine, said he could convert the algorithm to a relatively small PIG script and he wagered a lunch that the PIG script would outperform my code. He whipped up what was eventually a 10 statement PIG script that accomplished the task. When we get to the performance comparison we’ll find out who got a free lunch.

Here’s the PIG solution, much simpler than coding MR jobs by hand. We used PIG 0.8.1. We made several passes over the script to optimize it, following the PIG Cookbook. For example, we rearranged the join order and put the larger table last (I’m probably not giving too much away by mentioning that Vertica’s optimizer uses a cost model which properly chooses join order). We also tried several values for default_parallel and mapreduce.job.maps (and we changed the corresponding parameter in mapred-site.xml as well, just to be safe). We did not enable lzo compression for two reasons. First, considering the hardware used for the experiment (large RAM – plenty of file system cache, high throughput network), the CPU tax incurred by compression was more likely to hurt performance than help in this case. Second, one website listed 7 steps to get the compression working but the 2nd step had several steps itself, so I gave up on it.

The Vertica Solution

Can you count the number of triangles in a graph using a database? Of course. First create an “edges” table and load the graph. Vertica can automate the decision about how to organize storage for the table – something called projections specify important characteristics of physical storage such as sort order, segmentation, encoding and compression. In this case we simply tell Vertica how to distribute data among nodes in our cluster (Vertica calls this segmenting). Alternatively the Vertica Database Designer can be used to automate projection design. The following statements create our table and load data.

We’ve got the data loaded and stored in an efficient way. If we need to run more jobs with the same data later we won’t incur the load cost again. Likewise, if we need to modify the data we only incur work proportional to the change. Now we just need a horribly complex hack to count triangles. Take a deep breath, stretch, get a cup of coffee, basically do what you have to do to prepare your brain for this challenge. Ok, ready? Here it is:

Good, you didn’t run away screaming in horror. If we ignore the less than predicates the query is simply finding all triplets that form a cycle, v1 -> v2 -> v3 -> v1. The less than predicates ensure we don’t count the same triangle multiple times (remember our edges are reciprocal and we only need to consider triangles with the “lowest” vertex at the apex).

That’s it! A single, 4-liner query. Of course you’re interested in what the Vertica database does under the covers and how its performance, disk utilization and scalability compare with those of Hadoop and PIG.

Performance Study

The publicly available LiveJournal social network graph (http://snap.stanford.edu/data/soc-LiveJournal1.html) was used to test performance. It was selected because of its public availability, its modest size permitted relatively quick experiments. The modified edges file (in the original file not every edge is reciprocated) contained 86,220,856 edges, about 1.3GB in raw size. We used HDFS dfs.replication=2 (replication=1 performed worse – fewer map jobs were run, almost regardless of the mapreduce.job.maps value). Experiments were run on between 1 and 4 machines each with 96GB of RAM, 12 cores and 10GBit interconnect.

Run-Time Cost

All solutions are manually tuned to obtain the best performance numbers. For the Hadoop and PIG solutions, the number of mappers and reducers as well as the code itself were tweaked to optimize performance. For the Vertica solution, out-of-the-box Vertica is configured to support multiple users; default expectation is 24 concurrent queries for the hardware used. This configuration was tweaked to further increase pipeline parallelism (equivalent configuration settings will be on by default in an upcoming release). The following chart compares the best performance numbers for each solution.

PIG beat my Hadoop program, so my colleague who wrote the PIG script earned his free lunch. One major factor is PIG’s superior join performance – its uses hash join. In comparison, the Hadoop solution employs a join method very close to sort merge join.

Vertica’s performance wasn’t even close to that of Hadoop – thankfully. It was much much better. In fact Vertica ate PIG’s and Hadoop’s lunch – its best time is 22x faster than PIG’s and 40x faster than the Hadoop program (even without configuration tweaks Vertica beats optimized Hadoop and PIG programs by more than a factor of 9x in comparable tests).

Here are a few key factors in Vertica’s performance advantage:

  • Fully pipelined execution in Vertica, compared to a sequence of MR jobs in the Hadoop and PIG solutions, which incurs significant extra I/O. We quantify the differences in how the disk is used among the solutions below in the “disk usage” study.
    .
  • Vectorization of expression execution, and the use of just-in-time code generation in the Vertica engine
    .
  • More efficient memory layout, compared to the frequent Java heap memory allocation and deallocation in Hadoop / PIG

Overall, Hadoop and PIG are free in software, but hardware is not included. With a 22x speed-up, Vertica’s performance advantage effectively equates to a 95% discount on hardware. Think about that. You’d need 1000 nodes to run the PIG job to equal the performance of just 48 Vertica nodes, which is a rack and a half of the Vertica appliance.

Finally consider what happens when the use case shifts from counting all of the triangles in a graph to counting (or listing) just the triangles that include a particular vertex. Vertica’s projections (those things that define the physical storage layout) can be optimized such that looking up all of the edges with a particular vertex is essentially an index search (and once found the associated edges are co-located on disk – an important detail which anyone who knows the relative cost of a seek versus a scan will appreciate). This very quickly whittles e1 and e3 down to relatively few rows which  can participate in a merge join with e2. All in all a relatively inexpensive operation. On the other hand PIG and Hadoop must process all of the edges to satisfy such a query.

Disk Usage

For the input data set of 1.3GB, it takes 560MB to store it in Vertica’s compressed storage. In comparison, storing it in HDFS consumes more space than the raw data size.

At run-time, here is the peak disk usage among all 3 solutions in a 4-node cluster (remember lzo was not enabled for Hadoop and PIG – turning it on would reduce disk usage but likely hurt performance).

Given the huge differences in disk usage and thus I/O work, along with other advantages outlined above it should come as no surprise that the Vertica solution is much faster.

Join Optimization

As we mentioned earlier, the Hadoop solution does not optimize for join performance. Both Vertica and PIG were able to take advantage of a relatively small edges table that fit in memory (100s of billions or more edges can fit in memory when distributed over 10s or 100s of machines), with a hash join implementation.

For PIG, the join ordering needs to be explicitly specified. Getting this ordering wrong may carry a significant performance penalty. In our study, the PIG solution with the wrong join ordering is 1.5x slower. The penalty is likely even higher with a larger data set, where the extra disk I/O incurred in join processing can no longer be masked by sufficient RAM. To further complicate the matter, the optimal join ordering may depend on the input data set (e.g. whether the input graph is dense or not). It is infeasible for users to manually tweak the join ordering before submitting each PIG job.

In comparison, the Vertica columnar optimizer takes care of join ordering as well as many other factors crucial to optimizing for the job run-time.

The Right Tool for the Job

Many people get significant value out of Hadoop and PIG, including a number of Vertica’s customers who use these tools to work with unstructured or semi-structured data – typically before loading that data into Vertica. The question is which tool is best suited to solve your problem. With User Defined Functions, Aggregates, Load, et cetera available or coming soon to Vertica the lines are becoming blurred but when it comes to performance the choice is crystal clear.

In the case of triangle counting as we presented above, the Vertica solution enjoys the following advantages over Hadoop and PIG:

  • Ease of programming and maintenance, in terms of both ensuring correctness (The Vertica SQL solution is simpler) and achieving high performance (The Vertica optimizer chooses the best execution plan)
    .
  • Compressed storage
    .
  • Orders of magnitude faster query performance

Do Try this at Home

It is a relatively safe experiment (unlike slicing a grape in half and putting it in the microwave – don’t try that one at home). We’ve uploaded all three solutions to GitHub. Feel free to run your own experiments and improve on our work. As it stands the project includes a build.xml file which runs the Hadoop and PIG solutions in standalone mode – the project README file describes these targets and more in detail. With a little more work one can configure a Hadoop cluster and run the experiments in distributed mode, which is how we ran the experiments described above.

It’s a little more difficult to run the tests if you are not currently a Vertica customer, but we do have a free trial version of the Vertica Analytics Platform software.

Acknowledgements

Many thanks to Rajat Venkatesh for writing the PIG script (though I already thanked him with a lunch) and Mingsheng Hong for his suggestions, ideas and edits.

Get Started With Vertica Today

Subscribe to Vertica