Vertica

Archive for the ‘social graph analysis’ Category

The Unstructured Leprechaun

The “De-mythification” Series

Part 2: The Unstructured Leprechaun

Unstructured Data

In this, the second of the multi-part “de-mythification” series, I’ll address another common misconception in the Big Data marketplace today – that there are only two types of data an enterprise must deal with for Big Data analytics – structured and unstructured, and that unstructured data is somehow structure-free.

Let’s start with a definition of “structured” data. When we in the Big Data space talk of structured data, what we really mean is that the data has easily identifiable things like fields, rows, columns, etc. which makes it simple for us to use this as input for analytics.  Virtually all modern analytic routines leverage mathematical algorithms which look for things like groupings, trends, patterns, etc., and these routines require that the data be structured in such a way that they can digest it.   So when we say “structured” in this context, what we really mean is “structured in such a way that our analytic routines can process it.

On the other hand, “unstructured” data has become a catch-all term that’s used to describe everything not captured by the definition above. And this is unfortunate, because there’s very little data in the world which is truly unstructured. This over-generalization leads many organizations down costly, time-consuming paths which they don’t need to traverse.

The truth is that there is very little electronic data in our world today which is unstructured. Here’s a short list of some of the types of data or information commonly lumped under the “unstructured” label, with a sanity check as to the real story.

Type of Data Common Source(s) Structure Sanity Check
Audio Call center recordings, webinars, etc. Digital audio is stored in files, usually as a stream of bits. This stream is encoded and decoded as written & read, often with compression.   This is how the audio can be replayed after recording.
Video Dash-cams, security, retail traffic monitoring, social media sharing, etc. As with audio, digital video is stored in files, with a very similar approach to storing the stream of bits – encoded and often compressed, and replayable with the right decoder.
E-mails Personal and business e-mail, marketing automation, etc. An e-mail is typically quite well structured, with one section of the message containing key data about the message – From, To, Date, Subject, etc. – and another field containing the message itself, often stored as simple text.
Documents (contracts, books, white papers, articles, etc.) Electronic document systems, file sharing systems such as Google Docs and Sharepoint, etc. The documents themselves have structure similar to e-mail, with a group of fields often describing the document, and a body of text which comprises the document itself.  This is a broad category with much variation.
Social Media Tweets, blog posts, online video, picture sharing, check-ins, status updates, etc. Similar to e-mails, social media often has data which describes the message – who’s posting it, the date of the post, referenced hashtags and users, etc. – and the post itself. Images, audio and video included in social media are structured no differently than they are elsewhere.
Machine Logs mobile applications, hardware devices, web applications, etc. I’m not sure who exactly lumped machine logs under the “unstructured” label since these are highly structured and always have been. They are, after all, written by machines!     I suspect a bunch of marketing people decided this after consuming one too many bottles of wine in Napa.

By now it should be clear that this data is not at all unstructured. Quite the opposite. It has plenty of structure to it, otherwise we could never replay that video or audio, read a status update, read e-mail, etc. The real challenge is that this data is generated for a purpose, and that purpose rarely includes analytics. Furthermore, video, audio and email have been around for decades, but it’s only in recent years that we’ve discovered the value of analyzing that information along with the rest.

How does this information add new value? Here are a few examples:

  • Hedge funds found, a number of years ago, that by incorporating sentiment analysis of Tweets on publicly traded securities, that they can predict the daily closing prices of those securities very accurately.
  • Facial recognition in video allows for the creation of an event driven monitoring system which allows a single soldier to effectively monitor hundreds of security cameras concurrently.
  • Sentiment scoring in audio allows a business to detect an unhappy customer during a call, predict that they are likely to churn, and extend a retention offer to keep that customer.
  • Expressing the graph of relationships between players of a social game, as determined by their in-game messages, allows the game developer to dramatically improve profitability as well as player experience.

There are many, many such examples. This is why there’s so much attention being paid to “unstructured” data today – it offers a powerful competitive advantage for those who can incorporate it into their analytics.

The problem is that the data serves…the application which created it. When coder/decoder algorithms were being developed in the 1990’s for audio and video, I doubt that anyone expected that someday we might want to understand (a) who is talking; (b) what they’re talking about; and (c) how they feel about it.

This is the core problem many of us in the Big Data industry are working to address today. How do we take data with one type of structure such as audio, and create a second type of structure which suits it for analytics? To accomplish this, we need structure suited to our analytic routines such as a field identifying the person speaking, a field with the timestamp, a field identifying the topic they’re talking about, and so on. Getting from a stream of audio to this requires careful choice of technology, and thoughtful design. Unfortunately, my esteemed colleagues in the Big Data marketplace have tended to oversimplify this complex situation down to a single word: “unstructured”. This has led to the unstructured leprechaun – a mythical creature who many organizations are chasing hoping to find an elusive pot of gold.

Not that simplicity of messaging is a bad thing. Lord knows I’ve been in enough conference rooms watching people’s eyes glaze over as I talk through structured versus unstructured data! But, as with the real-time unicorn, if organizations chase the unstructured leprechaun – the myth that there is this big bucket of “unstructured” data that we can somehow address with a single magic tool (for more on that, see my next post: “The Single Solution Elf”), they risk wasting their time and money approaching the challenge without truly understanding the problem.

Once my colleagues and I get everyone comfortable with this more nuanced situation, we can begin the real work – identifying the high value use-cases where we can bring in non-traditional data to enhance analytic outcomes.  It’s worth mentioning that I’m careful today to refer to this data as non-traditional, and never unstructured!  This avoids a lot of overgeneralizing, and  makes selecting the right portfolio of technologies and designing a good architecture to address the use-cases very do-able.

So when organizations state that they need to deal with their “unstructured” data, I recommend a thorough assessment of the types of data involved and why they matter and the identification of discrete use cases where this data can add value.  We can then use this information as a guideline in developing the plan of action that’s much more likely to yield a tangible ROI.

Next up: The Single Solution Elf

Counting Triangles

by Stephen Walkauskas

Recently I’ve heard from or read about people who use Hadoop because their analytic jobs can’t achieve the same level of performance in a database. In one case, a professor I visited said his group uses Hadoop to count triangles “because a database doesn’t perform the necessary joins efficiently.”

Perhaps I’m being dense but I don’t understand why a database doesn’t efficiently support these use-cases. In fact, I have a hard time believing they wouldn’t perform better in a columnar, MPP database like Vertica – where memory and storage are laid out and accessed efficiently, query jobs are automatically tuned by the optimizer, and expression execution is vectorized at run-time. There are additional benefits when several, similar jobs are run or data is updated and the same job is re-run multiple times. Of course, performance isn’t everything; ease-of-use and maintainability are important factors that Vertica excels at as well.

Since the “gauntlet was thrown down”, to steal a line from Good Will Hunting, I decided to take up the challenge of computing the number of triangles in a graph (and include the solutions in GitHub so others can experiment – more on this at the end of the post).

Problem Description

A triangle exists when a vertex has two adjacent vertexes that are also adjacent to each other. Using friendship as an example: If two of your friends are also friends with each other, then the three of you form a friendship triangle. How nice. Obviously this concept is useful for understanding social networks and graph analysis in general (e.g. it can be used to compute the clustering coefficient of a graph).

Let’s assume we have an undirected graph with reciprocal edges, so there’s always a pair of edges ({e1,e2} and {e2,e1}). We’ll use the following input for illustration (reciprocal edge is elided to condense the information):

source

destination

Ben Chuck
Ben Stephen
Chuck Stephen
Chuck Rajat
Rajat Stephen
Andrew Ben
Andrew Matt
Matt Pachu
Chuck Lyric

.
A little ascii art to diagram the graph might help.

I know you can quickly count the number of triangles. I’m very proud of you but imagine there are hundreds of millions of vertexes and 10s of billions of edges. How long would it take you to diagram that graph? And how much longer to count all of the triangles? And what if your 2 year old daughter barges in counting “one, two, three, four, …” and throws off your count?

Below we present a few practical solutions for large scale graphs and evaluate their performance.

The Hadoop Solution

Let’s consider first the Hadoop approach to solving this problem. The MapReduce (MR) framework implemented in Hadoop allows us to distribute work over many computers to get the count faster. The solution we describe here is a simplified version of the work at Yahoo Research. You can download our solution here.

Overview

The solution involves a sequence of 3 MR jobs. The first job constructs all of the triads in the graph. A triad is formed by a pair of edges sharing a vertex, called its apex. It doesn’t matter which vertex we choose as the apex of the triad, so for our purposes we’ll pick the “lowest” vertex (e.g. friends could be ordered alphabetically by their names). The Yahoo paper makes a more intelligent choice of “lowest” – the vertex with the smallest degree. However that requires an initial pass of the data (and more work on my part) so I skipped that optimization and did so consistently for all solutions to ensure fairness.

These triads and the original edges are emitted as rows by the first MR job, with a field added to distinguish the two. Note that the output of the first job can be quite large, especially in a dense graph. Such output is consumed by the second MR job, which partitions the rows by either the unclosed edge, if the row is a triad, or the original edge. A partition has n triangles if it contains an original edge and n triads. A third, trivial MR job counts the triangles produced by the second job, to produce the final result.

Details

Let’s look at each MR job in detail. The map part of the first job generates key-value pairs for each triad such that the apex is the key and the value is the edge. In our small example the map job would emit the following rows.
.

key

value

Andrew Andrew, Matt
Andrew Andrew, Pachu
Andrew Andrew, Ben
Matt Matt, Pachu
Ben Ben, Chuck
Ben Ben, Stephen
Chuck Chuck, Rajat
Chuck Chuck, Lyric
Chuck Chuck, Stephen
Rajat Rajat, Stephen

.
For each apex-partition, the reduce job emits the original edges and all of the corresponding triads (there are ?(j-1) -> j=1 to d triads per partition, where d is the degree of the vertex at the apex). For each original edge, the key is the edge itself and the value is “edge”. For each triad, the key is the unclosed edge. In other words, the edge needed to complete the triangle. The value is “triad.” The actual code used “0” for the edge value and “1” for the triad value for run-time efficiency.

The rows corresponding to the triads emitted by this reduce job in our simple example are described below in the “key” and “value” columns (the original edges are also emitted by the reduce job but elided below for brevity). For presentation purposes we added a third column “triad content”. That column is not produced by the actual reduce job.
.

key

value

triad content

Ben,  Matt triad {Andrew, Ben}, {Andrew, Matt}
Ben, Pachu triad {Andrew, Ben}, {Andrew, Pachu}
Matt, Pachu triad {Andrew, Matt}, {Andrew, Pachu}
Chuck, Stephen triad {Ben, Chuck}, {Ben, Stephen}
Lyric, Rajat triad {Chuck, Lyric}, {Chuck, Rajat}
Lyric, Stephen triad {Chuck, Lyric}, {Chuck, Stephen}
Rajat, Stephen triad {Chuck, Rajat}, {Chuck, Stephen}

.
The input to the next reduce job is partitioned such that the unclosed edge of each triad is in the same partition as its corresponding original edge, if any. The reduce job just needs to check for the existence of an original edge in that partition (i.e., a row with value set to “edge”). If it finds one, all of the triads in the partition are closed as triangles. The reduce job sums up all of the closed triads and on finalize emits a count. A trivial final MR job aggregates the counts from the previous job.

There we’ve used MapReduce to count the number of triangles in a graph. The approach isn’t trivial but it’s not horribly complex either. And if it runs too slowly we can add more hardware, each machine does less work and we get our answer faster.

Experiences with Hadoop

I have to admit it took me much longer than I estimated to implement the Hadoop solution. Part of the reason being I’m new to the API, which is exacerbated by the fact that there are currently two APIs, one of them deprecated, the other incomplete, forcing use of portions of the deprecated API. Specifically, the examples I started with were unfortunately based on the deprecated API and when I ported to the newer one I ran into several silly but somewhat time consuming issues (like mapred’s version of Reducer.reduce takes an Iterator but mapreduce’s version takes an Iterable – they look similar to the human eye but the compiler knows that a method that takes an Iterator should not be overridden by one that takes an Iterable). Learning curve aside there was a fair chunk of code to write. The simple version is >200 lines. In a more complex version I added a secondary sort to the MR job that computes triads. Doing so introduced several dozen lines of code (most of it brain dead stuff like implementing a Comparable interface). Granted a lot of the code is cookie cutter or trivial but it still needs to be written (or cut-n-pasted and edited). In contrast, to add a secondary sort column in SQL is a mere few characters of extra code.

The PIG Solution

Rajat Venkatesh, a colleague of mine, said he could convert the algorithm to a relatively small PIG script and he wagered a lunch that the PIG script would outperform my code. He whipped up what was eventually a 10 statement PIG script that accomplished the task. When we get to the performance comparison we’ll find out who got a free lunch.

Here’s the PIG solution, much simpler than coding MR jobs by hand. We used PIG 0.8.1. We made several passes over the script to optimize it, following the PIG Cookbook. For example, we rearranged the join order and put the larger table last (I’m probably not giving too much away by mentioning that Vertica’s optimizer uses a cost model which properly chooses join order). We also tried several values for default_parallel and mapreduce.job.maps (and we changed the corresponding parameter in mapred-site.xml as well, just to be safe). We did not enable lzo compression for two reasons. First, considering the hardware used for the experiment (large RAM – plenty of file system cache, high throughput network), the CPU tax incurred by compression was more likely to hurt performance than help in this case. Second, one website listed 7 steps to get the compression working but the 2nd step had several steps itself, so I gave up on it.

The Vertica Solution

Can you count the number of triangles in a graph using a database? Of course. First create an “edges” table and load the graph. Vertica can automate the decision about how to organize storage for the table – something called projections specify important characteristics of physical storage such as sort order, segmentation, encoding and compression. In this case we simply tell Vertica how to distribute data among nodes in our cluster (Vertica calls this segmenting). Alternatively the Vertica Database Designer can be used to automate projection design. The following statements create our table and load data.

We’ve got the data loaded and stored in an efficient way. If we need to run more jobs with the same data later we won’t incur the load cost again. Likewise, if we need to modify the data we only incur work proportional to the change. Now we just need a horribly complex hack to count triangles. Take a deep breath, stretch, get a cup of coffee, basically do what you have to do to prepare your brain for this challenge. Ok, ready? Here it is:

Good, you didn’t run away screaming in horror. If we ignore the less than predicates the query is simply finding all triplets that form a cycle, v1 -> v2 -> v3 -> v1. The less than predicates ensure we don’t count the same triangle multiple times (remember our edges are reciprocal and we only need to consider triangles with the “lowest” vertex at the apex).

That’s it! A single, 4-liner query. Of course you’re interested in what the Vertica database does under the covers and how its performance, disk utilization and scalability compare with those of Hadoop and PIG.

Performance Study

The publicly available LiveJournal social network graph (http://snap.stanford.edu/data/soc-LiveJournal1.html) was used to test performance. It was selected because of its public availability, its modest size permitted relatively quick experiments. The modified edges file (in the original file not every edge is reciprocated) contained 86,220,856 edges, about 1.3GB in raw size. We used HDFS dfs.replication=2 (replication=1 performed worse – fewer map jobs were run, almost regardless of the mapreduce.job.maps value). Experiments were run on between 1 and 4 machines each with 96GB of RAM, 12 cores and 10GBit interconnect.

Run-Time Cost

All solutions are manually tuned to obtain the best performance numbers. For the Hadoop and PIG solutions, the number of mappers and reducers as well as the code itself were tweaked to optimize performance. For the Vertica solution, out-of-the-box Vertica is configured to support multiple users; default expectation is 24 concurrent queries for the hardware used. This configuration was tweaked to further increase pipeline parallelism (equivalent configuration settings will be on by default in an upcoming release). The following chart compares the best performance numbers for each solution.

PIG beat my Hadoop program, so my colleague who wrote the PIG script earned his free lunch. One major factor is PIG’s superior join performance – its uses hash join. In comparison, the Hadoop solution employs a join method very close to sort merge join.

Vertica’s performance wasn’t even close to that of Hadoop – thankfully. It was much much better. In fact Vertica ate PIG’s and Hadoop’s lunch – its best time is 22x faster than PIG’s and 40x faster than the Hadoop program (even without configuration tweaks Vertica beats optimized Hadoop and PIG programs by more than a factor of 9x in comparable tests).

Here are a few key factors in Vertica’s performance advantage:

  • Fully pipelined execution in Vertica, compared to a sequence of MR jobs in the Hadoop and PIG solutions, which incurs significant extra I/O. We quantify the differences in how the disk is used among the solutions below in the “disk usage” study.
    .
  • Vectorization of expression execution, and the use of just-in-time code generation in the Vertica engine
    .
  • More efficient memory layout, compared to the frequent Java heap memory allocation and deallocation in Hadoop / PIG

Overall, Hadoop and PIG are free in software, but hardware is not included. With a 22x speed-up, Vertica’s performance advantage effectively equates to a 95% discount on hardware. Think about that. You’d need 1000 nodes to run the PIG job to equal the performance of just 48 Vertica nodes, which is a rack and a half of the Vertica appliance.

Finally consider what happens when the use case shifts from counting all of the triangles in a graph to counting (or listing) just the triangles that include a particular vertex. Vertica’s projections (those things that define the physical storage layout) can be optimized such that looking up all of the edges with a particular vertex is essentially an index search (and once found the associated edges are co-located on disk – an important detail which anyone who knows the relative cost of a seek versus a scan will appreciate). This very quickly whittles e1 and e3 down to relatively few rows which  can participate in a merge join with e2. All in all a relatively inexpensive operation. On the other hand PIG and Hadoop must process all of the edges to satisfy such a query.

Disk Usage

For the input data set of 1.3GB, it takes 560MB to store it in Vertica’s compressed storage. In comparison, storing it in HDFS consumes more space than the raw data size.

At run-time, here is the peak disk usage among all 3 solutions in a 4-node cluster (remember lzo was not enabled for Hadoop and PIG – turning it on would reduce disk usage but likely hurt performance).

Given the huge differences in disk usage and thus I/O work, along with other advantages outlined above it should come as no surprise that the Vertica solution is much faster.

Join Optimization

As we mentioned earlier, the Hadoop solution does not optimize for join performance. Both Vertica and PIG were able to take advantage of a relatively small edges table that fit in memory (100s of billions or more edges can fit in memory when distributed over 10s or 100s of machines), with a hash join implementation.

For PIG, the join ordering needs to be explicitly specified. Getting this ordering wrong may carry a significant performance penalty. In our study, the PIG solution with the wrong join ordering is 1.5x slower. The penalty is likely even higher with a larger data set, where the extra disk I/O incurred in join processing can no longer be masked by sufficient RAM. To further complicate the matter, the optimal join ordering may depend on the input data set (e.g. whether the input graph is dense or not). It is infeasible for users to manually tweak the join ordering before submitting each PIG job.

In comparison, the Vertica columnar optimizer takes care of join ordering as well as many other factors crucial to optimizing for the job run-time.

The Right Tool for the Job

Many people get significant value out of Hadoop and PIG, including a number of Vertica’s customers who use these tools to work with unstructured or semi-structured data – typically before loading that data into Vertica. The question is which tool is best suited to solve your problem. With User Defined Functions, Aggregates, Load, et cetera available or coming soon to Vertica the lines are becoming blurred but when it comes to performance the choice is crystal clear.

In the case of triangle counting as we presented above, the Vertica solution enjoys the following advantages over Hadoop and PIG:

  • Ease of programming and maintenance, in terms of both ensuring correctness (The Vertica SQL solution is simpler) and achieving high performance (The Vertica optimizer chooses the best execution plan)
    .
  • Compressed storage
    .
  • Orders of magnitude faster query performance

Do Try this at Home

It is a relatively safe experiment (unlike slicing a grape in half and putting it in the microwave – don’t try that one at home). We’ve uploaded all three solutions to GitHub. Feel free to run your own experiments and improve on our work. As it stands the project includes a build.xml file which runs the Hadoop and PIG solutions in standalone mode – the project README file describes these targets and more in detail. With a little more work one can configure a Hadoop cluster and run the experiments in distributed mode, which is how we ran the experiments described above.

It’s a little more difficult to run the tests if you are not currently a Vertica customer, but we do have a free trial version of the Vertica Analytics Platform software.

Acknowledgements

Many thanks to Rajat Venkatesh for writing the PIG script (though I already thanked him with a lunch) and Mingsheng Hong for his suggestions, ideas and edits.

Vertica at BIRTE 2011: Social Graph Analytics

A few days ago, Lakshmikant Shrinivas and I gave an invited talk at BIRTE 2011.  Rather than yet another talk on Vertica 101, we chose to talk about Social Graph Analytics, a topic we are truly very excited about because it is something we are learning from our customers! And of course, it is uber cool to talk about Cityville and Zynga these days!  We presented the Zynga usecase  – how to find the influencers among the active users of a game and several other graph problems Vertica has very successfully solved with SQL.  More about these in future blog posts.

The most fun part of this talk (according to the audience) was that it was multi-threaded – we weaved a real-time demo through the talk. The demo was in two parts. The first part calculated an approximate 4-core of a graph of 90M nodes and 405M edges representing active users and their interactions in a game.  A k-core of a graph picks out a subgraph where every node has at least k neighbors in the subgraph – some research shows that the set of influencers in a social network is often a 3 or 4-core. The demo was run on 4 blades of the HP/Vertica Analytics System with 12 cores and 96GB RAM each. The animation below show an approximate visualization of the actual graph with 10K nodes and 45K edges (we could not find any graph visualizing tool that could draw a graph of 405M edges!) and how the algorithms iteratively reduces it to find the 4-core. The actual computation (which cannot be visualized) took about 1 minute to process 1TB of in-game event data to compute the initial graph of 90M active users and only 4.5 minutes to run 8 iterations resulting in the 4-core of 34K users!

The second part showed how this graph of influencers could be used to do A/B testing for in-game product placement. We simulated giving a group of users from the graph of influencers , Coke, and a group of randomly chosen users, Pepsi, and loading their in-game interactions data into Vertica, every 15 seconds. In the 5-minutes or so it took us to describe the setup, you could see how the relative penetration of the two products changed in real-time.  This was really a demo of Vertica’s awesome real-time analytics engine – we loaded data continuously at the rate of 40GB/minute (on 4 nodes with  2 copies of the data), which translates to 20TB/hr on the full rack HP/Vertica Analytics System.  That’s fast, eh?!!

This little demo and what our customers are doing in the real world, makes for a very convincing case that SQL can be used to express many practical graph problems. Vertica’s MPP architecture and sorted projections provide a highly scalable and fast engine for iterative graph algorithms. Such problems are relevant not only to the Zyngas and Facebooks of the world but to any company that has a connected user community. From online forums to friend-and-family call logs, social networks are everywhere and finding influencers is the key to monetizing them. Gartner believes that social CRM will be a $1B market by end of 2012!

We hope that our talk planted a small seed for the database research community to more formally explore SQL solutions to graph problems!  Watch for more blogs from Vertica on solving graph problems with SQL.

Get Started With Vertica Today

Subscribe to Vertica