Vertica

Author Archive

Comparing Pattern Mining on a Billion Records with HP Vertica and Hadoop

Pattern mining can help analysts discover hidden structures in data. Pattern mining has many applications—from retail and marketing to security management. For example, from a supermarket data set, you may be able to predict whether customers who buy Lay’s potato chips are likely to buy a certain brand of beer. Similarly, from network log data, you may determine groups of Web sites that are visited together or perform event analysis for security enforcement. In this blog post, we will show you how the HP Vertica Analytics Platform can efficiently find frequent patterns in very large data sets.

A pattern mining algorithm

Frequent patterns are items that occur often in a data set. After finding frequent patterns, analysts can use methods such as association rule mining to discover rules in the data. A classic example of an association (from Wikipedia) is that customers who buy diapers also tend to buy beer from a supermarket. While there are many frequent pattern mining algorithms in literature, we will use the FP-growth algorithm. FP-growth is considered efficient as it performs fewer database scans and does not require candidate set generation [1].

Instead of describing FP-growth in detail, we list the main steps from a practitioner’s perspective. We need to perform the following steps to obtain frequent patterns using FP-growth:

  1. Create transactions of items
  2. Count occurrence of item sets
  3. Sort item sets according to their occurrence
  4. Remove infrequent items
  5. Scan DB and build FP-tree
  6. Recursively grow frequent item sets

Let’s use an example to illustrate these steps. We will assume that our data set is a Web proxy log from a corporate network that, among other things, has IP address and Web sites visited as fields. Our goal is to find patterns such as Web sites that are visited together. After step 1, we obtain a set of transaction items shown in Table 1. Each transaction lists the Web sites visited from each IP address. After steps 2 and 3, we get Table 2 that has items sorted by their frequencies. Assuming that an item is considered frequent only if it occurs more than three times, then in step 4 we will discard cnn and yahoo from the table. In step 5 we use the pruned table to create an FP-tree (Figure 1). Finally, in step 6 we grow frequent patterns. The final output is shown in Table 3. The output, for example, shows that many users tend to visit both the Web sites of HP and Amazon.

Table 1: Sites Visited

Table 2: Sorted Items

Figure 1: FP-tree

Table 3: Final output of frequent patterns

Parallel pattern mining on the HP Vertica Analytics Platform

Despite the efficiency of the FP-Growth algorithm, single-threaded sequential version of FP-Growth can take very long on large data sets. Fortunately, we can rewrite the algorithm using SQL and HP Vertica user-defined functions (UDFs), and let the HP Vertica Analytics Platform parallelize the implementation. The main issue to resolve is how to map the algorithm to SQL statements and then remove dependencies between UDFs so that they can run independently and in parallel. Below are the statements that we used in the HP Vertica Analytics Platform. Let’s assume that we are still working with the Web proxy log example introduced earlier.

  1. Create transaction of items
    • SELECT DISTINCT srcIP, hostname INTO uniqueSipHn FROM networkLog;
  2. Count frequency of occurrence of each host name
    • SELECT count(hostname) INTO hnCnt FROM uniqueSipHn;
  3. List host names visited by each IP and also the frequency of each host name.
    • SELECT a.srcIP, b.hostName, b.frequency into sipHnCnt FROM uniqueSipHn a INNER JOIN hnCnt b ON a.hostName=b.hostName;
  4. Build conditional transactions. Assume an item is frequent if it occurs more than 20,000 times.
    • SELECT t1.hostName, t1.srcIP, t2.hostName AS condItem INTO condTr FROM sipHnCnt t1 JOIN sipHnCnt t2 ON (t1.srcIP=t2.srcIP) and (t1.count>20000 and t2.count>20000) and ((t2.count>t1.count) or (t2.count=t1.count and t2.hostName>t1.hostName))
  5. Generate patterns in parallel using UDF.
    • SELECT FPGrowth(srcIP, condItem, 20000) OVER(PARTITION BY hostName ORDER BY srcIP) INTO frequentItems FROM condTr;

The real test: a billion records, and, of course, Hadoop

Now that we know how to implement parallel frequent pattern mining in the HP Vertica Analytics Platform, let’s see how the implementation performs a large data set. Our input data is a few days’ worth of Web proxy logs. The log file is 330 GB in size, and has a billion records each with 22 fields. For comparison, we use Mahout’s implementation of parallel frequent pattern mining (Cloudera Hadoop 2.0 and mahout-0.7). We wrote a MapReduce program to create transactions from the log (step 1 of the algorithm). Our test bed consists of 12 HP ProLiant servers, each with 12 cores, 96GB RAM, and 128GB SSD.

Figure 2 depicts our results. On 4 servers, the HP Vertica Analytics Platform can complete the end-to-end pattern mining in fewer than 140 seconds. Hadoop takes 1,250 seconds (20 minutes)—approximately 9x more time than the HP Vertica Analytics Platform. As we increase the number of servers to 12, both the HP Vertica Analytics Platform and Hadoop take less time to complete. However, unlike Hadoop, the HP Vertica Analytics Platform has close to linear scaling for this setup.

Are you searching for patterns in your data set? Want a fast and easy-to-use data analytics platform? Evaluate the HP Vertica Community Edition today.


[1] Mining frequent patterns without candidate generation. Jiawei Han, Jian Pei, Yiwen Yin. SIGMOD 2000.

Distributed R for Big Data

Data scientists use sophisticated algorithms to obtain insights. However, what usually takes tens of lines of MATLAB or R code is now been rewritten in Hadoop like systems and applied at scale in the industry. Instead of rewriting algorithms in a new model, can we stretch the limits of R and reuse it for analyzing Big Data? We present our early experiences at HP Labs as we attempt to answer this question.

Consider a few use cases– product recommendations in Netflix and Amazon, PageRank calculation by search providers, financial options pricing and detection of important people in social networks. These applications (1) process large amounts of data, (2) implement complex algorithms such as matrix decomposition and eigenvalue calculation, and (3) continuously refine their predictive models on arrival of new user ratings, Web pages, or addition of relations in the network. To support these applications we need systems that can scale, can easily express complex algorithms, and can handle continuous analytics.

The complex aspect refers to the observation that most of the above applications use advanced concepts such as matrix operations, graph algorithms, and so on. By continuous analytics we mean that if a programmer writes y=f(x), then y is recomputed automatically whenever x changes. Continuous analytics reduces the latency with which information is processed. For example, in recommendation systems new ratings can be quickly processed to give better suggestions. In search engines newly added Web pages can be ranked and made part of search results more quickly.

In this post we will focus on scalability and complex algorithms.

R is an open source statistical software. It has millions of users, including data scientists, and more than three thousand algorithms packages. Many machine learning algorithms already exist in R, albeit for small datasets. These algorithms use matrix operations that are easily expressed and efficiently implemented in R. In less than a hundred lines you can implement most algorithms. Therefore, we decided to extend R and determine if we can achieve scalability in a familiar programming model.

Figure 1 is a very simplified view that compares R and Hadoop. Hadoop can handle large volumes of data, but R can efficiently execute a variety of advanced analysis. At HP Labs we have developed a distributed system that extends R. The main advantages are the language semantics, and the mechanisms to scale R and to run programs in a distributed manner.

FIgure 1 Graph

Figure 1: Extending R for Big Data

Details

Figure 2 shows a high level diagram of how programs are executed in our distributed R framework. Users write programs using language extensions to R and then submit the code to the new runtime. The code is executed across servers in a distributed manner. Distributed R programs run on commodity hardware: from your multi-core desktop to existing Vertica clusters.

Figure 2 Architecture

Figure 2: Architecture

Our framework adds three main language constructs to R: darray, splits, and update. A foreach construct is also present. It is similar to parallel loops found in other languages.

For transparent scaling, we provide the abstraction of distributed arrays, darray.  Distributed arrays store data across multiple machines and give programmers the flexibility to partition data by rows, columns or blocks. Programmers write analytics code treating the distributed array as a regular array, without worrying that it is mapped to different physical machines. Array partitions can be referenced using splits and their contents modified using update. The body of foreach loop processes array partitions in parallel.

Figure 3 shows part of a program that calculates distributed PageRank of a graph. At a high level, the program executes A = (M*B)+C in a distributed manner till convergence. Here M is the adjacency matrix of a large graph. Initially M is declared a NxN sparse matrix partitioned by rows. The vector A is partitioned such that each partition has the same number of rows as the corresponding partition of M. The accompanying illustration (Figure 3) points out that each partition of A requires the corresponding (shaded) partitions of M, C, and the whole array B. The runtime passes these partitions and automatically reconstructs B from its partitions before executing the body of foreach on workers.

Our algorithms package has distributed algorithms such as regression analysis, clustering, power method based PageRank, a recommendation system, and so on. For each of these applications we had to write less than 150 lines of code.

Presto Code

Figure 3: Sample Code

This post is not to claim yet another system faster than Hadoop. Hence we exclude comprehensive experiment results or pretty graphs.  Our Eurosys 2013 and HotCloud 2012 papers have detailed performance results [1, 2]. As a data nugget, our experiments show that many algorithms in our distributed R framework are more than 20 times faster than Hadoop.

Summary

Our framework extends R. It efficiently executes machine learning and graph algorithms on a cluster. Distributed R programs are easy to write, are scalable, and are fast.

Our aim in building a distributed R engine is not to replace Hadoop or its variants. Rather, it is a design point in the space of analytics interfaces—one that is more familiar to data scientists.

Our framework is still evolving. Today, you can use R on top of Vertica to accelerate your data mining analysis. Soon we will support in-database operations as well. Stay tuned.


[1] Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, Rob Schreiber. Eurosys 2013, Prague, Czech Republic.

[2] Using R for Iterative and Incremental Processing. Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Rob Schreiber. HotCloud 2012, Boston, USA.

Get Started With Vertica Today

Subscribe to Vertica