Identifying Patterns in Your Data with Event Series Pattern Matching – Part 1

Posted April 20, 2016 by Sarah Lemaire, Manager, Vertica Documentation

white cloud in vault type room representing cloud computing

While walking to the coffee shop yesterday morning, I wondered briefly how frequently I decide to stop and get coffee and how frequently I head straight to work. While I stood in line for an embarrassingly complicated latte, I started to think about how my actions could actually be interpreted (and maybe even predicted) by Big Data analytics. Specifically, I thought about pattern matching.

It works like this: I could record everywhere I go, and enter each event (leave home, get coffee, arrive at work) as a different row in a table. Then, I could examine that table for patterns, or specific series of events. One pattern would be: leave home, arrive at work. Another pattern would be: leave home, get coffee, arrive at work.

Extrapolating from the previous scenario, we can see that the same tactic can be applied to larger amounts of data. Maybe you want to find specific patterns of user behavior in your online game. Maybe you want to identify areas on your credit card bill that show a $20+ charge at one bar followed immediately by a $20+ charge at a different bar. Then again, maybe you don’t want to do that. The point is, the applications of Vertica’s event series pattern matching functionally are endless. In this blog, we’ll introduce you to the pattern matching key features.

What’s an event and what’s a pattern?

As the name suggests, event patterns are simply series of “events” that occur in an order, or “pattern” that you specify. Vertica evaluates each row in your table, looking for the event you define. When Vertica finds a sequence of rows that conform to your pattern, it outputs the rows that contribute to the match.

As a simple example, let’s say you want to find places in your records where event A is following by event B and then event C. With pattern matching, Vertica will return the following shaded rows:

 

pattern1

 

An example with clickstream analysis

Pattern matching is particularly useful for clickstream analysis where you might want to identify users’ actions based on their web browsing behavior. An example of an online clickstream event sequence might be:

  1. User is on outside website
  2. User visits your organization’s home page
  3. User navigates to a product page
  4. User purchases a product online

pattern2

 

 

 

Here’s the clickstream_log table we’ll use through the tutorial. The column definitions are as follows:

  • uid INT: user ID
  • ts TIME: timestamp of the action the user took
  • refURL VARCHAR(50): URL of the page referencing pageURL. In other words, the page where the previous event took place.
  • pageURL VARCHAR(50): URL of the page the user visits
  • action CHAR(1): action the user took after visiting the page (‘E’ = Enter my site, ‘V’ = View page on my site, ‘P’ = Purchase item from my site, ‘X’ = Exit my site)

uid  |    ts    |           refURL           |          pageURL           | action
----+----------+----------------------------+----------------------------+--------
1 | 12:00:00 | otherSite.com              | mySite.com/home            | E
1 | 12:01:00 | mySite.com/home            | mySite.com/products        | V
1 | 12:02:00 | mySite.com/products        | mySite.com/products/tshirt | V
1 | 12:03:00 | mySite.com/products/tshirt | mySite.com/buy             | P
2 | 12:10:00 | otherSite.com              | mySite.com/home            | E
2 | 12:11:00 | mySite.com/home            | mySite.com/products        | V
2 | 12:12:00 | mySite.com/products        | mySite.com/home            | V
2 | 12:13:00 | mySite.com/home            | otherSite.com              | X
3 | 12:17:00 | otherSite.com              | mySite.com/home            | E
3 | 12:18:00 | mySite.com/home            | mySite.com/products        | V
3 | 12:19:00 | mySite.com/products        | mySite.com/products/couch  | V
3 | 12:20:00 | mySite.com/products/couch  | mySite.com/buy             | P
(12 rows)

 

The MATCH Clause

So what can you do with this data? Let’s say you want to find out if your purchase option is easy for a user to use. You might determine this by identifying how many times a user goes straight from a specific product’s page to the buy page, indicating the user is ready to purchase that specific item.

To find instances where these events occurred in order, use the MATCH clause. The MATCH clause includes two subclauses: DEFINE and PATTERN. True to their name, you use the subclauses to do the following:

  • DEFINE: define your events
  • PATTERN: specify the pattern of the events

In this scenario, the events involved are the following:

  • User viewed a product’s page
  • User clicked the buy button

pattern3

And the pattern we want them to be in is:

  • User viewed a product’s page, and then immediately purchased the item.
pattern4

 

 

 

Putting these subclauses together in the MATCH clause, we get:
=> SELECT * FROM clickstream_log
MATCH(PARTITION BY uid ORDER BY ts
DEFINE product AS pageURL iLIKE '%mySite.com/products/%'
AND action='V', purchase AS pageURL ILIKE '%mySite.com%' AND action ='P' PATTERN P AS (product purchase));
uid  |    ts    |           refURL           |          pageURL           | action
----+----------+----------------------------+----------------------------+--------
1 | 12:02:00 | mySite.com/products        | mySite.com/products/tshirt | V
1 | 12:03:00 | mySite.com/products/tshirt | mySite.com/buy             | P
3 | 12:19:00 | mySite.com/products        | mySite.com/products/couch  | V
3 | 12:20:00 | mySite.com/products/couch  | mySite.com/buy             | P
(4 rows)

As you can see, the statement returns all instances where a user went straight from viewing a product’s page to purchasing an item.

Pattern Matching Functions

You can view more information about your patterns using pattern matching functions.

Pattern matching functions include:

  • EVENT_NAME(): Returns the name of the event that matched the row.
  • MATCH_ID(): Returns a successful pattern match as an INTEGER value. The returned value is the ordinal position of a match within a partition.
  • PATTERN_ID():Returns an integer value that is a partition-wide unique identifier for the instance of the pattern that matched.

For example, using the previous MATCH clause:
=> SELECT uid, ts, refURL, pageURL, action, event_name(), pattern_id(), match_id()
FROM clickstream_log
MATCH (PARTITION BY uid ORDER BY ts
DEFINE product AS pageURL iLIKE '%mySite.com/products/%' AND action='V', purchase AS pageURL ILIKE '%mySite.com%' AND action ='P' PATTERN P AS (product purchase));

pattern6

 

You’ll notice that the event_name column lists only one event per row. But what if, within a single row, more than one event is a match? In the previous example, this can never happen (i.e., a row can’t satisfy both the product and purchase predicates). So let’s look at an example where this might happen.

Using the ROWS MATCH parameter for multiple TRUE events

In the following clause we define event ‘duplicate1’ as an event where the pageURL includes ‘mySite.com’. We define event ‘duplicate2’ as an event where the refURL contains ‘mySite.com.’ We want to output the pattern where a row with event duplicate1 is followed by a row with duplicate2.

DEFINE       
duplicate1 AS pageURL iLIKE '%mySite.com/%',       
duplicate2 AS refURL iLIKE '%mySite.com/%'     
PATTERN P AS (duplicate1 duplicate2)

From our clickstream_log table, it’s easy to see that almost all rows satisfy events duplicate1 and duplicate2. If we leave this clause as-is, Vertica returns the following error:

ERROR: pattern events must be mutually exclusive 
HINT:  try using ROWS MATCH FIRST EVENT

The Vertica MATCH clause has a ROWS MATCH parameter that allows you to solve this problem. Until now, we’ve been letting the MATCH clause use the default ROWS MATCH parameter: ROWS MATCH ALL EVENTS. To deal with a table that has rows that satisfy more than one event, use the ROWS MATCH FIRST EVENT option to avoid the above error.

With the ROWS MATCH FIRST EVENT keyword, when Vertica comes across a row where both event types exist, it considers the row as being of the event type that is mentioned first in the MATCH clause.

Looking at our table, Vertica would consider the rows marked in green boxes as being of event type ‘duplicate1’ (because they all satisfy the first defined event) and the row marked in a hashed orange box as being of event type ‘duplicate2’ (it only matches the second event):

=> SELECT * FROM clickstream_log;

 

 

Thus, when we specify that we want a pattern where duplicate1 is followed by duplicate2, we get only one instance (where the orange box follows a green box):
=> SELECT uid, ts, refURL, pageURL, action, event_name()
FROM clickstream_log
MATCH ( PARTITION BY uid ORDER BY ts
DEFINE duplicate1 AS pageURL iLIKE '%mySite.com/%', duplicate2 AS refURL iLIKE '%mySite.com/%'
PATTERN P AS (duplicate1 duplicate2)
ROWS MATCH FIRST EVENT);
uid |    ts    |       refURL        |     pageURL     | action | event_name
-----+----------+---------------------+-----------------+--------+------------
2 | 12:12:00 | mySite.com/products | mySite.com/home | V      | duplicate1
2 | 12:13:00 | mySite.com/home     | otherSite.com   | X      | duplicate2
(2 rows)

 

More Examples

With event series pattern matching, the possibilities for data interpretation are endless. Want to see a more complex example? Stay tuned for part two of the Event Series Pattern Matching series.

 

Learn More

Learn more about event series pattern matching, including restrictions, in the core Vertica documentation.