# Counting Triangles

by Stephen Walkauskas

Recently I’ve heard from or read about people who use Hadoop because their analytic jobs can’t achieve the same level of performance in a database. In one case, a professor I visited said his group uses Hadoop to count triangles “because a database doesn’t perform the necessary joins efficiently.”

Perhaps I’m being dense but I don’t understand why a database doesn’t efficiently support these use-cases. In fact, I have a hard time believing they wouldn’t perform better in a columnar, MPP database like Vertica – where memory and storage are laid out and accessed efficiently, query jobs are automatically tuned by the optimizer, and expression execution is vectorized at run-time. There are additional benefits when several, similar jobs are run or data is updated and the same job is re-run multiple times. Of course, performance isn’t everything; ease-of-use and maintainability are important factors that Vertica excels at as well.

Since the “gauntlet was thrown down”, to steal a line from Good Will Hunting, I decided to take up the challenge of computing the number of triangles in a graph (and include the solutions in GitHub so others can experiment – more on this at the end of the post).

### Problem Description

A triangle exists when a vertex has two adjacent vertexes that are also adjacent to each other. Using friendship as an example: If two of your friends are also friends with each other, then the three of you form a friendship triangle. How nice. Obviously this concept is useful for understanding social networks and graph analysis in general (e.g. it can be used to compute the clustering coefficient of a graph).

Let’s assume we have an undirected graph with reciprocal edges, so there’s always a pair of edges ({e1,e2} and {e2,e1}). We’ll use the following input for illustration (reciprocal edge is elided to condense the information):

 source destination Ben Chuck Ben Stephen Chuck Stephen Chuck Rajat Rajat Stephen Andrew Ben Andrew Matt Matt Pachu Chuck Lyric

.
A little ascii art to diagram the graph might help.

I know you can quickly count the number of triangles. I’m very proud of you but imagine there are hundreds of millions of vertexes and 10s of billions of edges. How long would it take you to diagram that graph? And how much longer to count all of the triangles? And what if your 2 year old daughter barges in counting “one, two, three, four, …” and throws off your count?

Below we present a few practical solutions for large scale graphs and evaluate their performance.

Let’s consider first the Hadoop approach to solving this problem. The MapReduce (MR) framework implemented in Hadoop allows us to distribute work over many computers to get the count faster. The solution we describe here is a simplified version of the work at Yahoo Research. You can download our solution here.

#### Overview

The solution involves a sequence of 3 MR jobs. The first job constructs all of the triads in the graph. A triad is formed by a pair of edges sharing a vertex, called its apex. It doesn’t matter which vertex we choose as the apex of the triad, so for our purposes we’ll pick the “lowest” vertex (e.g. friends could be ordered alphabetically by their names). The Yahoo paper makes a more intelligent choice of “lowest” – the vertex with the smallest degree. However that requires an initial pass of the data (and more work on my part) so I skipped that optimization and did so consistently for all solutions to ensure fairness.

These triads and the original edges are emitted as rows by the first MR job, with a field added to distinguish the two. Note that the output of the first job can be quite large, especially in a dense graph. Such output is consumed by the second MR job, which partitions the rows by either the unclosed edge, if the row is a triad, or the original edge. A partition has n triangles if it contains an original edge and n triads. A third, trivial MR job counts the triangles produced by the second job, to produce the final result.

#### Details

Let’s look at each MR job in detail. The map part of the first job generates key-value pairs for each triad such that the apex is the key and the value is the edge. In our small example the map job would emit the following rows.
.

 key value Andrew Andrew, Matt Andrew Andrew, Pachu Andrew Andrew, Ben Matt Matt, Pachu Ben Ben, Chuck Ben Ben, Stephen Chuck Chuck, Rajat Chuck Chuck, Lyric Chuck Chuck, Stephen Rajat Rajat, Stephen

.
For each apex-partition, the reduce job emits the original edges and all of the corresponding triads (there are ?(j-1) -> j=1 to d triads per partition, where d is the degree of the vertex at the apex). For each original edge, the key is the edge itself and the value is “edge”. For each triad, the key is the unclosed edge. In other words, the edge needed to complete the triangle. The value is “triad.” The actual code used “0″ for the edge value and “1″ for the triad value for run-time efficiency.

The rows corresponding to the triads emitted by this reduce job in our simple example are described below in the “key” and “value” columns (the original edges are also emitted by the reduce job but elided below for brevity). For presentation purposes we added a third column “triad content”. That column is not produced by the actual reduce job.
.

 key value triad content Ben,  Matt triad {Andrew, Ben}, {Andrew, Matt} Ben, Pachu triad {Andrew, Ben}, {Andrew, Pachu} Matt, Pachu triad {Andrew, Matt}, {Andrew, Pachu} Chuck, Stephen triad {Ben, Chuck}, {Ben, Stephen} Lyric, Rajat triad {Chuck, Lyric}, {Chuck, Rajat} Lyric, Stephen triad {Chuck, Lyric}, {Chuck, Stephen} Rajat, Stephen triad {Chuck, Rajat}, {Chuck, Stephen}

.
The input to the next reduce job is partitioned such that the unclosed edge of each triad is in the same partition as its corresponding original edge, if any. The reduce job just needs to check for the existence of an original edge in that partition (i.e., a row with value set to “edge”). If it finds one, all of the triads in the partition are closed as triangles. The reduce job sums up all of the closed triads and on finalize emits a count. A trivial final MR job aggregates the counts from the previous job.

There we’ve used MapReduce to count the number of triangles in a graph. The approach isn’t trivial but it’s not horribly complex either. And if it runs too slowly we can add more hardware, each machine does less work and we get our answer faster.

I have to admit it took me much longer than I estimated to implement the Hadoop solution. Part of the reason being I’m new to the API, which is exacerbated by the fact that there are currently two APIs, one of them deprecated, the other incomplete, forcing use of portions of the deprecated API. Specifically, the examples I started with were unfortunately based on the deprecated API and when I ported to the newer one I ran into several silly but somewhat time consuming issues (like mapred’s version of Reducer.reduce takes an Iterator but mapreduce’s version takes an Iterable – they look similar to the human eye but the compiler knows that a method that takes an Iterator should not be overridden by one that takes an Iterable). Learning curve aside there was a fair chunk of code to write. The simple version is >200 lines. In a more complex version I added a secondary sort to the MR job that computes triads. Doing so introduced several dozen lines of code (most of it brain dead stuff like implementing a Comparable interface). Granted a lot of the code is cookie cutter or trivial but it still needs to be written (or cut-n-pasted and edited). In contrast, to add a secondary sort column in SQL is a mere few characters of extra code.

### The PIG Solution

Rajat Venkatesh, a colleague of mine, said he could convert the algorithm to a relatively small PIG script and he wagered a lunch that the PIG script would outperform my code. He whipped up what was eventually a 10 statement PIG script that accomplished the task. When we get to the performance comparison we’ll find out who got a free lunch.

Here’s the PIG solution, much simpler than coding MR jobs by hand. We used PIG 0.8.1. We made several passes over the script to optimize it, following the PIG Cookbook. For example, we rearranged the join order and put the larger table last (I’m probably not giving too much away by mentioning that Vertica’s optimizer uses a cost model which properly chooses join order). We also tried several values for default_parallel and mapreduce.job.maps (and we changed the corresponding parameter in mapred-site.xml as well, just to be safe). We did not enable lzo compression for two reasons. First, considering the hardware used for the experiment (large RAM – plenty of file system cache, high throughput network), the CPU tax incurred by compression was more likely to hurt performance than help in this case. Second, one website listed 7 steps to get the compression working but the 2nd step had several steps itself, so I gave up on it.

### The Vertica Solution

Can you count the number of triangles in a graph using a database? Of course. First create an “edges” table and load the graph. Vertica can automate the decision about how to organize storage for the table – something called projections specify important characteristics of physical storage such as sort order, segmentation, encoding and compression. In this case we simply tell Vertica how to distribute data among nodes in our cluster (Vertica calls this segmenting). Alternatively the Vertica Database Designer can be used to automate projection design. The following statements create our table and load data.

We’ve got the data loaded and stored in an efficient way. If we need to run more jobs with the same data later we won’t incur the load cost again. Likewise, if we need to modify the data we only incur work proportional to the change. Now we just need a horribly complex hack to count triangles. Take a deep breath, stretch, get a cup of coffee, basically do what you have to do to prepare your brain for this challenge. Ok, ready? Here it is:

Good, you didn’t run away screaming in horror. If we ignore the less than predicates the query is simply finding all triplets that form a cycle, v1 -> v2 -> v3 -> v1. The less than predicates ensure we don’t count the same triangle multiple times (remember our edges are reciprocal and we only need to consider triangles with the “lowest” vertex at the apex).

That’s it! A single, 4-liner query. Of course you’re interested in what the Vertica database does under the covers and how its performance, disk utilization and scalability compare with those of Hadoop and PIG.

### Performance Study

The publicly available LiveJournal social network graph (http://snap.stanford.edu/data/soc-LiveJournal1.html) was used to test performance. It was selected because of its public availability, its modest size permitted relatively quick experiments. The modified edges file (in the original file not every edge is reciprocated) contained 86,220,856 edges, about 1.3GB in raw size. We used HDFS dfs.replication=2 (replication=1 performed worse – fewer map jobs were run, almost regardless of the mapreduce.job.maps value). Experiments were run on between 1 and 4 machines each with 96GB of RAM, 12 cores and 10GBit interconnect.

#### Run-Time Cost

All solutions are manually tuned to obtain the best performance numbers. For the Hadoop and PIG solutions, the number of mappers and reducers as well as the code itself were tweaked to optimize performance. For the Vertica solution, out-of-the-box Vertica is configured to support multiple users; default expectation is 24 concurrent queries for the hardware used. This configuration was tweaked to further increase pipeline parallelism (equivalent configuration settings will be on by default in an upcoming release). The following chart compares the best performance numbers for each solution.

PIG beat my Hadoop program, so my colleague who wrote the PIG script earned his free lunch. One major factor is PIG’s superior join performance – its uses hash join. In comparison, the Hadoop solution employs a join method very close to sort merge join.

Vertica’s performance wasn’t even close to that of Hadoop – thankfully. It was much much better. In fact Vertica ate PIG’s and Hadoop’s lunch – its best time is 22x faster than PIG’s and 40x faster than the Hadoop program (even without configuration tweaks Vertica beats optimized Hadoop and PIG programs by more than a factor of 9x in comparable tests).

Here are a few key factors in Vertica’s performance advantage:

• Fully pipelined execution in Vertica, compared to a sequence of MR jobs in the Hadoop and PIG solutions, which incurs significant extra I/O. We quantify the differences in how the disk is used among the solutions below in the “disk usage” study.
.
• Vectorization of expression execution, and the use of just-in-time code generation in the Vertica engine
.
• More efficient memory layout, compared to the frequent Java heap memory allocation and deallocation in Hadoop / PIG

Overall, Hadoop and PIG are free in software, but hardware is not included. With a 22x speed-up, Vertica’s performance advantage effectively equates to a 95% discount on hardware. Think about that. You’d need 1000 nodes to run the PIG job to equal the performance of just 48 Vertica nodes, which is a rack and a half of the Vertica appliance.

Finally consider what happens when the use case shifts from counting all of the triangles in a graph to counting (or listing) just the triangles that include a particular vertex. Vertica’s projections (those things that define the physical storage layout) can be optimized such that looking up all of the edges with a particular vertex is essentially an index search (and once found the associated edges are co-located on disk – an important detail which anyone who knows the relative cost of a seek versus a scan will appreciate). This very quickly whittles e1 and e3 down to relatively few rows which  can participate in a merge join with e2. All in all a relatively inexpensive operation. On the other hand PIG and Hadoop must process all of the edges to satisfy such a query.

#### Disk Usage

For the input data set of 1.3GB, it takes 560MB to store it in Vertica’s compressed storage. In comparison, storing it in HDFS consumes more space than the raw data size.

At run-time, here is the peak disk usage among all 3 solutions in a 4-node cluster (remember lzo was not enabled for Hadoop and PIG – turning it on would reduce disk usage but likely hurt performance).

Given the huge differences in disk usage and thus I/O work, along with other advantages outlined above it should come as no surprise that the Vertica solution is much faster.

#### Join Optimization

As we mentioned earlier, the Hadoop solution does not optimize for join performance. Both Vertica and PIG were able to take advantage of a relatively small edges table that fit in memory (100s of billions or more edges can fit in memory when distributed over 10s or 100s of machines), with a hash join implementation.

For PIG, the join ordering needs to be explicitly specified. Getting this ordering wrong may carry a significant performance penalty. In our study, the PIG solution with the wrong join ordering is 1.5x slower. The penalty is likely even higher with a larger data set, where the extra disk I/O incurred in join processing can no longer be masked by sufficient RAM. To further complicate the matter, the optimal join ordering may depend on the input data set (e.g. whether the input graph is dense or not). It is infeasible for users to manually tweak the join ordering before submitting each PIG job.

In comparison, the Vertica columnar optimizer takes care of join ordering as well as many other factors crucial to optimizing for the job run-time.

### The Right Tool for the Job

Many people get significant value out of Hadoop and PIG, including a number of Vertica’s customers who use these tools to work with unstructured or semi-structured data – typically before loading that data into Vertica. The question is which tool is best suited to solve your problem. With User Defined Functions, Aggregates, Load, et cetera available or coming soon to Vertica the lines are becoming blurred but when it comes to performance the choice is crystal clear.

In the case of triangle counting as we presented above, the Vertica solution enjoys the following advantages over Hadoop and PIG:

• Ease of programming and maintenance, in terms of both ensuring correctness (The Vertica SQL solution is simpler) and achieving high performance (The Vertica optimizer chooses the best execution plan)
.
• Compressed storage
.
• Orders of magnitude faster query performance

### Do Try this at Home

It is a relatively safe experiment (unlike slicing a grape in half and putting it in the microwave – don’t try that one at home). We’ve uploaded all three solutions to GitHub. Feel free to run your own experiments and improve on our work. As it stands the project includes a build.xml file which runs the Hadoop and PIG solutions in standalone mode – the project README file describes these targets and more in detail. With a little more work one can configure a Hadoop cluster and run the experiments in distributed mode, which is how we ran the experiments described above.

It’s a little more difficult to run the tests if you are not currently a Vertica customer, but we do have a free trial version of the Vertica Analytics Platform software.

### Acknowledgements

Many thanks to Rajat Venkatesh for writing the PIG script (though I already thanked him with a lunch) and Mingsheng Hong for his suggestions, ideas and edits.

## 45 Responses

1. Colin Loghin says:

Good job Stephen. I am looking forward to try the experiment at home ! I enjoy those runs where vertica eats other’s lunch.

2. Brock says:

What happens with the grape?

3. Wkinkel says:

I would like to see this comparison adding H-Base

4. Ken Egozi says:

Great experiment. I’m going over the github code, and am downloading the dataset to experiment now. Got a question regarding the dataset – if it’s 1.3Gb, why would you need 48 cores and 384Gb memory? couldn’t it be computed using some in-memory single process JVM structures? or a straightforward old-time RDBMS query on a single machine. The hadoop-ers will say you’d need a far larger dataset (100s GBs or more) to really see where hadoop shines, and it would be interesting to see how vertica stands vs hadoop on these really large datasets

• Stephen says:

Hi Ken,

That is a good observation. Though the input is small it generates a reasonable amount of work – enough to keep Hadoop busy for ~1 hour on 4 machines – a convenient characteristic when you want to provide input data. There are graph generators available but I didn’t find any with a license suitable for my purposes.

I mentioned that Vertica and PIG benefited from the relatively small input by using a hash join. When I forced Vertica to use a sort-merge-join (something it will do automatically when the inner relation doesn’t fit in memory) it was still >25x faster than Hadoop.

I plan to follow this work with a sort experiment. That should address the concerns you raised. Of course I’d be happy to see others run this experiment on a larger graph.

Stay tuned for future posts.

Cheers,
Stephen

• Guy Bayes says:

Stephen want to clairfy

if i look at graph above your runtime with PIG is 97 seconds on the 4 node cluster right?

• Stephen says:

No, times (in seconds) are above the corresponding lines. On 4 nodes: Hadoop=3900, PIG=2151 and Vertica=97.

5. Stephen says:

You’re welcome to try that experiment. I’m curious about your ideas to improve performance using HBase.

6. Dimi says:

pro tip: if you do the same test using 1.3MB of data, Vertica solutions will be 1000x times faster then hadoop;)

7. Guy Bayes says:

You should have used Giraph, Pig is not the right hadoop tool for this

http://incubator.apache.org/giraph/

It’s well known that standard hadoop map/reduce does not perform well against graph problems.

Also in general, as the Ken says hadoop is for embarrassingly parallel problems, 1.3 gb does not meet that criteria.

• Stephen says:

If you provide a Giraph solution I’m happy to add it to github and report performance numbers.

Actually the problem is embarrassingly parallel, hence the near linear decrease in time over number of machines. As noted the input is small but the intermediate result is a reasonable size, otherwise why would Hadoop spin its wheels for an hour (and consume 160 GB).

• Guy Bayes says:

I will look into doing a giraph solution

160 gig is still teeny tiny from an hadoop perspective, you really need to be in the 10′s of terabytes for it to shine. 160 gig can still fit into main memory on a single machine. Honestly to process that data volume you would be better off with neither hadoop nor vertica but something like neo4j

The hadoop implementation has a lot of abstraction layers and overhead, you are probably better off running in local mode on one machine then running a 4 machine cluster.

As for why is it taking an hour, it’s probably thrashing, spending a lot of time reading and writing the same set of data over and over again. Basic hadoop map/reduce is not at all clever in the use of memory, one of the places that rdbm’s shine.

• Stephen says:

Thanks Guy. I’d appreciate a Giraph solution. I’m sure others would as well.

A significant difference between Hadoop and PIG/Vertica is that Hadoop sorts the intermediate data (the triads) the other two do not. While unfair to some degree a very large graph likely fit in memory when distributed over many machines – the triads however may not. Therefore, it is realistic to expect real world graphs to benefit from this advantage. Hence, flexible, optimized plan execution is important.

When I artificially force Vertica to behave as though it is memory constrained and execute a sort-merge-join Vertica’s performance advantage is still notable (>25x). Of course all of the 4 node experiments comfortably ride in fs cache.

I’ll put together an experiment that highlights the strengths of Hadoop for another blog.

8. Guy Bayes says:

Stephen, you might also be interesting in this google tech talk with discussion around graph problems in general and what makes them difficult

I’m talking from memory here, so don’t shoot me if I get it wrong but my recollection is in gerenal you have two requirements for an efficient graph traversal,  neither of which is that hard individually but together are difficult to meet.

In general you want any vertex in the graph to be able to do a constant time lookup to any other vertex in the graph since any vertex can in theory be able to connect to any other. This makes any solution derived from indexing problematic, since index lookups are Olog(n).

Normally the way to solve this would be to hash your data into memory, however you also need to be able to support a graph that does not fit into memory, epsecialyl given that once you start including degrees of indirection your meory structure explodes in volume.

These two requirements make solving this problem tricky in general to do efficiently  (both for rdbms and map/reduce engines) since both rely on large, sequential disk scans and pre fetching algorithms to optimize I/O to memory transfer, and what you are effectively doing is random access.

Also the caching algorithms for expiring objects from memory are sub optimal since they are genrally variants of LRU algorithms, and all the vertexs are sort of in use and there is not a clear distinction between hot and cold memory objects.

There are some papers if you want to know more, try searching “google pregel”

You need new algorithms and methods of computing to solve these problems efficiently (example http://en.wikipedia.org/wiki/Bulk_synchronous_parallel)

The other thing that is interesting is that social graphs tend to have some unique properties with regards to distribution of edges and vertices, they tend to clump into smaller sub graphs that are then linkec by a few super nodes with many many connections (small world phenomenon) . So your algorithms need to be specially optimized for this type of graph (the super nodes cause lots of performance issues when doing distributed joins)

9. thejas m says:

Hadoop pays a price for being fault tolerant, it writes between phases to disk and to hdfs so that failure of a single node does not result in whole query having to be re-run, or a slow node from slowing down the whole query. It does so because it is designed to run on 100s of nodes where the probability of that happening is much higher.

But the hadoop/pig performance can be much better, and much closer, if you make more optimal use of pig. Without doing these optimizations, I think the comparison is misleading.

1. The pig query that you have posted does not do a hash-join in the
traditional sense. There will be a MR job for each join, where the map
partitions the data on join key, and reduce does a sort-merge-join. To
use a hash-join (which uses a in-memory hash table), you need to add “using ‘replicated’”.  That will result in a map-only join , which does not take advantage of the fact that the table can fit into memory.

2. The experiment with Vertica is storing the data such that it is stored optimally for this join (partioned and sorted?). You can do that with hadoop as well. In pig you can store the data sorted on the join keys and use zebra storage format, and then add “using ‘merge’” to the the join query. For such a join, pig will run a quick sampling MR job to find the first record in each block and then do the join in a map only job (it saves the cost of reduce and shuffle/sort). This will improve performance by few times.

3. Pig trunk has changes to speed up algebraic aggregation functions by upto 50% . Using that the group-by should almost be twice as fast. (https://issues.apache.org/jira/browse/PIG-2228)

4. PigStorage is not efficient storage format, it stores data in human readable text format (tsv). Using the zebra format (as suggested in no. 2 above) will reduce the data size and also result in significant savings in (de)serialization costs.

5. LZO or snappy can be used for light weight compression. It will specially be useful to enable use of this for storing pig intermediate results. (I am assuming that what you say about entire input being in FS cache is true when pig query starts, otherwise compression there is also likely to help.)

Also, comparing uncompressed size of file in hadoop with compressed size in vertica, without measuring the performance degradation that you think will happen in hadoop/pig is not meaningful.
snappy compression library from google has been released with apache compatible license, so enabling compression in hadoop projects will be easier in future.

Thejas

10. thejas m says:

Hadoop pays a price for being fault tolerant, it writes between phases
to disk and to hdfs so that failure of a single node does not result in
whole query having to be re-run, or a slow node from slowing down the
whole query. It does so because it is designed to run on 100s of nodes
where the probability of that happening is much higher.

But the
hadoop/pig performance can be much better, and much closer, if you make
more optimal use of pig. Without doing these optimizations, I think the

1. The pig query that you have posted does not do a hash-join in the
traditional sense. There will be a MR job for each join, where the map
partitions the data on join key, and reduce does a sort-merge-join. To
use
a hash-join (which uses a in-memory hash table), you need to add “using
‘replicated’”.  That will result in a map-only join , which does not
take advantage of the fact that the table can fit into memory.

2.
The experiment with Vertica is storing the data such that it is stored
optimally for this join (partioned and sorted?). You can do that with
hadoop as well. In pig you can store the data sorted on the join keys
and use zebra storage format, and then add “using ‘merge’” to the the
join query. For such a join, pig will run a quick sampling MR job to
find the first record in each block and then do the join in a map only
job (it saves the cost of reduce and shuffle/sort). This will improve
performance by few times.

3. Pig trunk has changes to speed up
algebraic aggregation functions by upto 50% . Using that the group-by
should almost be twice as fast. (https://issues.apache.org/jira

4.
PigStorage is not efficient storage format, it stores data in human
readable text format (tsv). Using the zebra format (as suggested in no. 2
above) will reduce the data size and also result in significant savings
in (de)serialization costs.

5. LZO or snappy can be used for
light weight compression. It will specially be useful to enable use of
this for storing pig intermediate results. (I am assuming that what you
say about entire input being in FS cache is true when pig query starts,
otherwise compression there is also likely to help.)

Also,
comparing uncompressed size of file in hadoop with compressed size in
vertica, without measuring the performance degradation that you think
will happen in hadoop/pig is not meaningful.
snappy compression
enabling compression in hadoop projects will be easier in future.

Thejas

• Stephen says:

Fault tolerance is certainly an important consideration when choosing the right technology. In this example the cost of query fault tolerance is excessive. Vertica could re-run a failed query multiple times before the PIG job completes. This will not always be true. Users need to weigh this trade-off for their application (and the hardware they intend to run it on). A Vertica cluster is fault tolerant but in-progress queries may need to be restarted after a node fails.

Vertica is currently run on clusters with hundreds of nodes. Each machine is dedicated to Vertica, so there’s never any performance skew caused by rogue jobs. Vertica monitors and manages the resources on each of the machines in the cluster. On very, very rare occasions (I can recall two in the three years I’ve been at Vertica) we observed performance skew resulting from hardware problems.

I appreciate your performance tuning tips and I’m curious to measure their impact. Feel free to improve the source in github, if you’re so inclined. That would save me some time, reduce any chance of error and increase turn around time. I realize everyone is busy, if you don’t have a chance to do so, I’ll get to it, eventually.

As per #2, Vertica’s storage was not optimized for this experiment. The projection is sorted (by default) by “source, dest” and segmented / distributed by “hash(source,dest)”. This permits neither a merge-join nor an identically segmented first join. Nor does it allow for an identically segmented second join (the outer (larger) relation is re-segmented). Only when the intermediate result is sorted does the second join benefit from this storage (it can use a MJ). However, sorting the intermediate result dominates execution time therefore the time to sort the original edges table would be negligible (hence the benefit is negligible). So I’m skeptical that your proposed change will have any impact. But give me a modified script to test and we’ll see.

• Guy Bayes says:

The pig job will never run as fast as the Vertica job, I don’t care how you tweak it. Though send me code and i am happy to benchmark, our dev cluster is close to the specs they used for this test.

It’s close to a worse case scenario for map/reduce.

- The table is medium sized, not excessively big but too big for map/reduce to fit in memory,
- it’s a pure join with no transformation and excessively large intermediate data that necessitates a lot of shuffle and reduce
- the same data table is used three times so lots of benefit from caching

Different tools excel at different things. If you understand the architecture of two different platforms it is usually possible to design a problem where one will overcome the other.

For instance I have not yet seen Vertica post a Terasort result that beats Hadoop?

http://sortbenchmark.org/

The important thing is to understand your workload and the tools you have at hand and how best to use a combination of tools to solve a given problem in the most cost effective manner

It’s also important to remember that Hadoop is more then map/reduce that it is a toolkit to allow you to build any data processing platform you like.

That is why exercises like this are useful, so thank you Stephen for organizing it, and why I am interested in seeing how Giraph performs

I’d be really interested though to see how fast this thing would run on a single neo4j machine with say 512 gig ram (if there are any neo4j people listening) (-:

Also this is how you would write the code in Hive. Why you do RDBM’s people always forget about Hive? I got it to run in 87 minutes without any tweaking

select  count(*)
from
(select * from edges where source < dest) e1 join
(select * from edges where source < dest)  e2 on (e1.dest = e2.source ) join
(select * from edges where dest < source)  e3 on (e2.dest = e3.source )
where e3.dest = e1.source ;

• Stephen says:

Thanks for providing and benchmarking a Hive solution. And sharing your ideas in general.

We plan to run a terasort experiment. I’m sure there will be another blog coming soon.

• thejas m says:

The reliability of the hardware on you clusters is very different from what I have seen with hadoop clusters. Do these machines have something better than commodity hardware ? (RAID disks etc?)

You are right about #2, I failed to note that the join is on two different keys. So merge-join can’t be done in this case. To do merge-join in this case, for the first join, there will have to be two copies of the data sorted by each of the join columns.
Where do you think the large performance difference in  your experiment comes from ? I haven’t tried running the query, so I don’t know how large the intermediate sizes are for MR jobs.  I am guessing veritca is able to hold the input in the 96GB of ram available on each of the machines and avoid writing to disk. Is is possible for you try the same query with much larger input, which would force the intermediate data to be written to disk ? ie, use 10s of terabytes of data instead of 1GB ?

• Stephen says:

Yes, most are running RAID but that isn’t so exotic these days. And many are running on 3-4 year old hardware, that has been constantly stressed over its live time.

Certainly some of the performance advantage is due to the fact that Vertica pipelines execution.

Give me an optimized PIG script and I’ll re-run the experiment. I’ll try bigger data as well. 1GB took 35 minutes, I don’t think I’ll be patient enough to wait for an experiment with even 100GB. Keep in mind the intermediate PIG result for 1GB was 100GB (I don’t have a 1PB of storage on these 4 machines).

• thejas m says:

This query can also be re-written using pig COGROUP.  Doing a Join is like doing a flatten on a bag coming out of a cogroup operation, which is unncessary in this case.

TRIANGLE_CGROUP = cogroup  CANON_EDGES_1 by (source,dest),
OPEN_EDGES by (CANON_EDGES_1::source,
CANON_EDGES_2::dest);

TRIANGLES     = foreach TRIANGLE_CGROUP generate SIZE(CANON_EDGES_1) as num_triangles;

CONST_GROUP   = group TRIANGLES ALL parallel 1;
FINAL_COUNT   = foreach CONST_GROUP generate SUM(TRIANGLES);

(thanks to Daniel Dai for this tip)

• Stephen says:

Ok. Once you PIG gurus agree on the best script give it to me, with a pointer to the easiest instructions for enable compression and I’ll re-run the experiment.

11. [...] Hadoop vs PIG vs Vertica for Counting Triangles: [...]

12. [...] Hadoop vs PIG vs Vertica for Counting Triangles: [...]

13. JZS says:

There are a lot of comments saying — oh, forget it, this isn’t the ideal problem or data for Hadoop. Wasn’t that also the point of the article? I agree with Stephen — I’ve heard tons of people say “oh, we need Hadoop, because a DB can’t cut it for our workload,” when their workload looks a LOT like this (under 200 GB of peak data usage, fairly structured data, queries that can be written in SQL reasonably well). Look at the example problem in the Orbitz Hadoop presentation from just a few weeks ago (http://assets.en.oreilly.com/1/event/61/Distributed%20Data%20Analysis%20with%20Hadoop%20and%20R%20Presentation.pdf). They’re doing simple aggregates on 12 GB of structured data. It’s just an example presentation, but many, many people are taking these sorts of examples to heart and applying Hadoop to similar problems where a DBMS would be a much better fit in terms of both performance and developer time.

• Stephen says:

Exactly. And let’s not forget that a number of research papers and white papers (the Yahoo! Research paper referenced by this post for one) tout the performance benefits of using Hadoop to solve this problem.

• Guy Bayes says:

• Guy Bayes says:

If you read it, it states

“Regular DBMS
– How to fit square peg in a round hole
– No in-line R calls from SQL but commercial efforts are underway”

This guy is an R coder. Does Vertica support in-database R?

This may come as a shock to you, but a lot of people do not like sql very much and are never going to code in it, no matter how much you wish they would.

One of the reasons why hadoop is popular is it gives people what they want rather then what vendors thing they should have.

14. Anonymous says:

I will post this as a simple thank you note, rather than a comment on the comparison itself.  The author (Stephen Walkauskas) has written it with all the right caveats.  One simple question though – what is the performance of the straight DB solution, say written directly in Oracle or SQLServer in terms of the following query:

select count(*) from edges e1, edges e2, edges e3
where e1.dest = e2.source and e1.source < e2.source
and e2.dest = e3.source and e2.source < e3.source
and e3.dest = e1.source

• Stephen says:

I think they’d perform reasonably. This example doesn’t do a good job showcasing the advantages of Vertica. Generally optimizations based on storage and segmentation (distribution of data in a cluster) have significant performance impact on large scale analytic queries but not in this example (as noted in comments below).

I don’t have a license for either of the DBMSs you mentioned. Not to mention an Oracle license prohibits customers from posting performance results. Though it is worth noting that neither of those products is particularly well known for working with large scale data – http://www.informationweek.com/news/global-cio/interviews/231602498.

“There may well be Exadata customers managing hundreds of terabytes, but they haven’t talked to the media as yet.”

15. [...] Counting Triangles [...]

16. Neetesh says:

Thanks for posting this. Makes for a very interesting read.

17. I shouldn’t come with much surprise that Hadoop on “raw” HDFS doesn’t perform
well for the case of counting triangles in a graph. Hadoop’s MapReduce approach
benefits immensely from “locality”. Unfortunately a collection of triangular
subgraphs is not very local. Not for real world networks nor for constructor
approaches modeling real world networks.

18. [...] few months ago, I discovered Vertica’s “Counting Triangles”-article through Prismatic. The blog post describes a number of benchmarks on counting triangles in large [...]

19. [...] long ago, I read an article from Vertica’s Stephen Walkauskas titled Counting Triangles that compared Hadoop, Hadoop/Pig, and Vertica when applied to the problem of counting triangles in [...]

20. Bimal says:

A very interesting experiment indeed. Thanks for doing this.

21. [...] to read my previous post): the query runs on a sample data set originally created by Vertica (see here) that consists of a 1.2GB csv file with two integer columns and 86,220,856 rows. The rows represent [...]

22. […] few months ago, I discovered Vertica’s “Counting Triangles”-article through Prismatic. The blog post describes a number of benchmarks on counting triangles in large […]

23. […] Counting Triangles by Stephen Walkauskas […]

24. […] big difference. Others have shown this sort of thing too. Look at this blog on counting triangles. There is also these results at UC […]