Event Series Pattern Matching

The SQL MATCH Clause syntax lets you screen large amounts of historical data in search of event patterns. You specify a pattern as a regular expression and can then search for the pattern within a sequence of input events. MATCH provides subclauses for analytic data partitioning and ordering, and the pattern matching occurs on a contiguous set of rows.

Pattern matching is particularly useful for clickstream analysis where you might want to identify users' actions based on their Web browsing behavior (page clicks). A typical online clickstream funnel is:

Company home page -> product home page -> search -> results -> purchase online

Using this clickstream funnel, you can search for a match on the user's sequence of web clicks and identify that user:

  • Landed on the company home page
  • Navigated to the product page
  • Ran a search
  • Clicked a link from the search results
  • Made a purchase

Clickstream Funnel Schema

The examples in this topic use this clickstream funnel and the following clickstream_log table schema:

=> CREATE TABLE clickstream_log (
  uid INT,             --user ID
  sid INT,             --browsing session ID, produced by previous sessionization computation
  ts TIME,             --timestamp that occurred during the user's page visit 
  refURL VARCHAR(20),  --URL of the page referencing PageURL
  pageURL VARCHAR(20), --URL of the page being visited  
  action CHAR(1)       --action the user took after visiting the page ('P' = Purchase, 'V' = View) 
);

INSERT INTO clickstream_log VALUES (1,100,'12:00','website1.com','website2.com/home', 'V');
INSERT INTO clickstream_log VALUES (1,100,'12:01','website2.com/home','website2.com/floby', 'V');
INSERT INTO clickstream_log VALUES (1,100,'12:02','website2.com/floby','website2.com/shamwow', 'V');
INSERT INTO clickstream_log values (1,100,'12:03','website2.com/shamwow','website2.com/buy', 'P');
INSERT INTO clickstream_log values (2,100,'12:10','website1.com','website2.com/home', 'V');
INSERT INTO clickstream_log values (2,100,'12:11','website2.com/home','website2.com/forks', 'V');
INSERT INTO clickstream_log values (2,100,'12:13','website2.com/forks','website2.com/buy', 'P');
COMMIT;

Here's the clickstream_log table's output:

=> SELECT * FROM clickstream_log;
 uid | sid |    ts    |        refURL        |       pageURL        | action
-----+-----+----------+----------------------+----------------------+--------
   1 | 100 | 12:00:00 | website1.com         | website2.com/home    | V
   1 | 100 | 12:01:00 | website2.com/home    | website2.com/floby   | V
   1 | 100 | 12:02:00 | website2.com/floby   | website2.com/shamwow | V
   1 | 100 | 12:03:00 | website2.com/shamwow | website2.com/buy     | P
   2 | 100 | 12:10:00 | website1.com         | website2.com/home    | V
   2 | 100 | 12:11:00 | website2.com/home    | website2.com/forks   | V
   2 | 100 | 12:13:00 | website2.com/forks   | website2.com/buy     | P
(7 rows)

Example

This example includes the Vertica Pattern Matching Functions to analyze users' browsing history over website2.com. It identifies patterns where the user performed the following tasks:

  • Landed on website2.com from another web site (Entry)
  • Browsed to any number of other pages (Onsite)
  • Made a purchase (Purchase)

In the following statement, pattern P (Entry Onsite* Purchase) consist of three event types: Entry, Onsite, and Purchase. When Vertica finds a match in the input table, the associated pattern instance must be an event of type Entry followed by 0 or more events of type Onsite, and an event of type Purchase

=> SELECT uid,
       sid,
       ts,
       refurl,
       pageurl,
       action,
       event_name(),
       pattern_id(),
       match_id()
FROM clickstream_log
MATCH
  (PARTITION BY uid, sid ORDER BY ts
   DEFINE
     Entry    AS RefURL  NOT ILIKE '%website2.com%' AND PageURL ILIKE '%website2.com%',
     Onsite   AS PageURL ILIKE     '%website2.com%' AND Action='V',
     Purchase AS PageURL ILIKE     '%website2.com%' AND Action = 'P'
   PATTERN
     P AS (Entry Onsite* Purchase)
   ROWS MATCH FIRST EVENT);

In the output below, the first four rows represent the pattern for user 1's browsing activity, while the following three rows show user 2's browsing habits.

 uid | sid |    ts    |        refurl        |       pageurl        | action | event_name | pattern_id | match_id
-----+-----+----------+----------------------+----------------------+--------+------------+------------+----------
   1 | 100 | 12:00:00 | website1.com         | website2.com/home    | V      | Entry      |          1 |        1
   1 | 100 | 12:01:00 | website2.com/home    | website2.com/floby   | V      | Onsite     |          1 |        2
   1 | 100 | 12:02:00 | website2.com/floby   | website2.com/shamwow | V      | Onsite     |          1 |        3
   1 | 100 | 12:03:00 | website2.com/shamwow | website2.com/buy     | P      | Purchase   |          1 |        4
   2 | 100 | 12:10:00 | website1.com         | website2.com/home    | V      | Entry      |          1 |        1
   2 | 100 | 12:11:00 | website2.com/home    | website2.com/forks   | V      | Onsite     |          1 |        2
   2 | 100 | 12:13:00 | website2.com/forks   | website2.com/buy     | P      | Purchase   |          1 |        3
(7 rows)