Archive for the ‘pattern matching’ Category

Comparing Pattern Mining on a Billion Records with HP Vertica and Hadoop

Pattern mining can help analysts discover hidden structures in data. Pattern mining has many applications—from retail and marketing to security management. For example, from a supermarket data set, you may be able to predict whether customers who buy Lay’s potato chips are likely to buy a certain brand of beer. Similarly, from network log data, you may determine groups of Web sites that are visited together or perform event analysis for security enforcement. In this blog post, we will show you how the HP Vertica Analytics Platform can efficiently find frequent patterns in very large data sets.

A pattern mining algorithm

Frequent patterns are items that occur often in a data set. After finding frequent patterns, analysts can use methods such as association rule mining to discover rules in the data. A classic example of an association (from Wikipedia) is that customers who buy diapers also tend to buy beer from a supermarket. While there are many frequent pattern mining algorithms in literature, we will use the FP-growth algorithm. FP-growth is considered efficient as it performs fewer database scans and does not require candidate set generation [1].

Instead of describing FP-growth in detail, we list the main steps from a practitioner’s perspective. We need to perform the following steps to obtain frequent patterns using FP-growth:

  1. Create transactions of items
  2. Count occurrence of item sets
  3. Sort item sets according to their occurrence
  4. Remove infrequent items
  5. Scan DB and build FP-tree
  6. Recursively grow frequent item sets

Let’s use an example to illustrate these steps. We will assume that our data set is a Web proxy log from a corporate network that, among other things, has IP address and Web sites visited as fields. Our goal is to find patterns such as Web sites that are visited together. After step 1, we obtain a set of transaction items shown in Table 1. Each transaction lists the Web sites visited from each IP address. After steps 2 and 3, we get Table 2 that has items sorted by their frequencies. Assuming that an item is considered frequent only if it occurs more than three times, then in step 4 we will discard cnn and yahoo from the table. In step 5 we use the pruned table to create an FP-tree (Figure 1). Finally, in step 6 we grow frequent patterns. The final output is shown in Table 3. The output, for example, shows that many users tend to visit both the Web sites of HP and Amazon.

Table 1: Sites Visited

Table 2: Sorted Items

Figure 1: FP-tree

Table 3: Final output of frequent patterns

Parallel pattern mining on the HP Vertica Analytics Platform

Despite the efficiency of the FP-Growth algorithm, single-threaded sequential version of FP-Growth can take very long on large data sets. Fortunately, we can rewrite the algorithm using SQL and HP Vertica user-defined functions (UDFs), and let the HP Vertica Analytics Platform parallelize the implementation. The main issue to resolve is how to map the algorithm to SQL statements and then remove dependencies between UDFs so that they can run independently and in parallel. Below are the statements that we used in the HP Vertica Analytics Platform. Let’s assume that we are still working with the Web proxy log example introduced earlier.

  1. Create transaction of items
    • SELECT DISTINCT srcIP, hostname INTO uniqueSipHn FROM networkLog;
  2. Count frequency of occurrence of each host name
    • SELECT count(hostname) INTO hnCnt FROM uniqueSipHn;
  3. List host names visited by each IP and also the frequency of each host name.
    • SELECT a.srcIP, b.hostName, b.frequency into sipHnCnt FROM uniqueSipHn a INNER JOIN hnCnt b ON a.hostName=b.hostName;
  4. Build conditional transactions. Assume an item is frequent if it occurs more than 20,000 times.
    • SELECT t1.hostName, t1.srcIP, t2.hostName AS condItem INTO condTr FROM sipHnCnt t1 JOIN sipHnCnt t2 ON (t1.srcIP=t2.srcIP) and (t1.count>20000 and t2.count>20000) and ((t2.count>t1.count) or (t2.count=t1.count and t2.hostName>t1.hostName))
  5. Generate patterns in parallel using UDF.
    • SELECT FPGrowth(srcIP, condItem, 20000) OVER(PARTITION BY hostName ORDER BY srcIP) INTO frequentItems FROM condTr;

The real test: a billion records, and, of course, Hadoop

Now that we know how to implement parallel frequent pattern mining in the HP Vertica Analytics Platform, let’s see how the implementation performs a large data set. Our input data is a few days’ worth of Web proxy logs. The log file is 330 GB in size, and has a billion records each with 22 fields. For comparison, we use Mahout’s implementation of parallel frequent pattern mining (Cloudera Hadoop 2.0 and mahout-0.7). We wrote a MapReduce program to create transactions from the log (step 1 of the algorithm). Our test bed consists of 12 HP ProLiant servers, each with 12 cores, 96GB RAM, and 128GB SSD.

Figure 2 depicts our results. On 4 servers, the HP Vertica Analytics Platform can complete the end-to-end pattern mining in fewer than 140 seconds. Hadoop takes 1,250 seconds (20 minutes)—approximately 9x more time than the HP Vertica Analytics Platform. As we increase the number of servers to 12, both the HP Vertica Analytics Platform and Hadoop take less time to complete. However, unlike Hadoop, the HP Vertica Analytics Platform has close to linear scaling for this setup.

Are you searching for patterns in your data set? Want a fast and easy-to-use data analytics platform? Evaluate the HP Vertica Community Edition today.

[1] Mining frequent patterns without candidate generation. Jiawei Han, Jian Pei, Yiwen Yin. SIGMOD 2000.

Being Green with Data Exhaust

Funnel Analysis

by Matt Fuller, Vertica

By 2015, it is estimated the annual global internet traffic will reach almost 1 zettabyte. To put it into more familiar units, this is equivalent to about 1 billion terabytes. Web, email, instant messaging, etc. will account for about 30% of this figure.  I found this fascinating, but not surprising, given the rate of new applications and users entering the market. Whether you believe these estimates or interpret the data differently, I think we can agree there is a vast amount of data out there.

As users perform their online activities, such as playing Farmville, reading tweets, or browsing for slick deals on Groupon, web server logs may store their clicks. This raw data, , or “data exhaust,” may appear to be junk to many, but in reality this “data exhaust” can be monetized given the right tools.

In funnel analysis, a funnel is the flow, or path, a user may take before reaching an end goal, such as a purchase, sign up, or download. The path may consist of a series of web clicks to different pages on the site until finally ending up at the goal. Along this path, users may drop out after any point, thus reducing the percentage of users that make it to the goal.

Analyzing the funnel can provide insight to improve the site flow. For example, if a site knew the registration page is the page where most users dropped out, the site could improve the usability of that page to engage more customers and then analyze those improvements.

In Vertica 5.0, we introduced the latest addition to our in-database analytics package: Event Series Pattern Matching.  In this article we will discuss how you can use Vertica’s Event Series Pattern Matching to discover user click events that match funnels.

Suppose we used a user-defined transform (UDT) to help load your server’s web click log into a Vertica table, where each row corresponds to a single click in your log (user_id may actually be an ip address).

Next, we would like to search for sequences of web clicks that match a particular funnel. For example, we would like to identify the series of clicks where a user viewed an item for sale, filled out the form to purchase, and then ultimately made the purchase. Additionally, we would like to include any other items the user visited during the flow. The funnel may look like:

Let’s now construct an Event Series Pattern Matching SQL query in Vertica  to find the sequence of clicks matching our funnel.  This can be done in 3 simple steps by defining our funnel as a regular expression, defining an alphabet for our regular expression, and defining the logical window over which we want to match our regular expression. The next sections describe these steps in more detail.

Step 1: Pattern Specification

Event Series Pattern Matching goes far beyond simple Funnel Analysis. So let’s start to use event series pattern matching terminology. First, think of each click record as an “event” and a clickstream (or sequence of many clicks) as an “event series.” And let’s also think of our funnel as a “pattern.”

Our pattern is specified using regular expression notation consisting of events from our event alphabet (more details in the next section) with optional quantifiers such as the Kleene Star (i.e. “*”):

(EntryItemView ItemView* Checkout Purchase)

For a contiguous sequence of events, this notation describes all the pattern instances starting with an item page view referred from another site event, zero or more page views of other items, proceeding to the checkout page to buy an item, and then ultimately ending with a purchase by landing on the purchase confirmation page.

Step 2: Event Definitions

Next, we must define our event alphabet. We use SQL Boolean expressions aliased to an event name. The event names are what we used in the pattern specification.

EntryItemView as referring_url NOT ILIKE ‘’
…………..and page_url ILIKE ‘’

ItemView as page_url ILIKE ‘’ and action = ‘VIEW’

Checkout as page_url ILIKE ‘’ and action = ‘CHECKOUT’

Purchase as page_url ILIKE ‘’ and action = ‘PURCHASE’

A row is considered to be of an event type if the Boolean expression yields TRUE for that row. For example, the below row from the clickstream table is considered to be of event “Purchase” because the predicate is TRUE given the values of the row.

Step 3: Window Definition

A window is the logical grouping of data. In our example, we are trying to discover matching pattern instances per user. Therefore, we would like to logically group all the data per user together and perform Event Series Pattern Matching for each user. And since we wish to find the pattern instances of a contiguous sequence of events, the data must be ordered. In this example, processing the data sorted on the timestamp is the logical choice. We define our window using the PARTITION BY and ORDER BY clause:

PARTITION BY user_id ORDER BY timestamp

Optionally, you may want to group the data per user per session. This is simple and efficient to do with Vertica’s in-database sessionization. In our example, we added the session id to the table already for simplicity:

PARTITION BY user_id, session_id ORDER BY timestamp

(NOTE: In these two examples, what is the “Window” – we talk about it but don’t point out the final result (i.e. the window))

Putting it all together…

Let’s combine the steps from above into a SQL SELECT:

SELECT user_id, referring_url, page_url,
event_name(), match_id(), pattern_id()
FROM clickstream
…….PARTITION BY user_id, session_id ORDER BY timestamp
…………EntryItemView as referring_url NOT ILIKE ‘’
…………….and page_url ILIKE ‘’,
…………ItemView as page_url ILIKE ‘’
…………….and action = ‘VIEW’,
…………Checkout as page_url ILIKE ‘’
…………….and action = ‘CHECKOUT’,
…………Purchase as page_url ILIKE ‘’
…………….and action = ‘PURCHASE’
…………PATTERN P as (EntryItemView ItemView* Checkout Purchase)

Run on Vertica, and Voila!


The results above consist of all rows that contributed to a discovered pattern instance. You may have noticed three new functions in the SELECT list: event_name(), match_id(), pattern_id(). These return data for additional analysis as well as demarcate the different pattern instances:

  • event_name() returns the name of the event for which that row contributed in the pattern instance
  • - match_id() is a monotonically increasing integer to serve as a unique identifier for the row within the pattern instance
  • pattern_id() is a monotonically increasing integer serving as a unique identifier with the pattern instance within the partition/group

For simplicity, our example doesn’t demonstrate more than one pattern match per partition/group. But imagine that if user 100 made another set of clicks matching the funnel, its pattern identifier would be 2. And if user 300 also made another set of clicks matching the funnel, its pattern identifier would also be 2 since we reset the starting identifier for each new partition.

We immediately notice from the results that Twitter user ABC has referred users to the website, but more importantly, referred users with a high success rate of making purchases. And of course further analysis, such as aggregation and pivoting, can be performed on the results of this Event Series Pattern Matching.

It might be said this could be done with SQL OLAP windowing functions such as LAG. For simple funnels, this is certainly true. But this would be difficult, if not impossible, using SQL OLAP for funnels defined by more complex pattern specifications including quantifiers such as Kleene Star or Kleene Plus.

And since Event Series Pattern Matching is in-database, you get Vertica’s performance and scalability. As an MPP system, Vertica automatically parallelizes the work across the cluster. The figure below illustrates finding the pattern instances in parallel across a Vertica cluster. First the data is segmented based on the partition window definition. Then each node independently processes and finds the pattern instances in the data segments. Finally, the pattern instance results from each node are combined at the node that issued the query and the final result is sent back to the user.

Vertica is an ideal platform for monetizing ALL of your data, and we’ve shown you how event series pattern matching can be used to analyze seemingly unimportant web log data to find the top patterns that lead to conversion events on a web site.  Just by using our new SQL “match” clause in three very simple and straightforward steps.

In a future post, we will discuss how one can use event series pattern matching to perform more advanced sessionization and compare it to  Google Analytics.

Get Started With Vertica Today

Subscribe to Vertica