Sessionization with Event-Based Windows

Sessionization, a special case of event-based windows, is a feature often used to analyze click streams, such as identifying web browsing sessions from recorded web clicks.

In Vertica, given an input clickstream table, where each row records a Web page click made by a particular user (or IP address), the sessionization computation attempts to identify Web browsing sessions from the recorded clicks by grouping the clicks from each user based on the time-intervals between the clicks. If two clicks from the same user are made too far apart in time, as defined by a time-out threshold, the clicks are treated as though they are from two different browsing sessions.

Example Schema

The examples in this topic use the following WebClicks schema to represent a simple clickstream table:

CREATE TABLE WebClicks(userId INT, timestamp TIMESTAMP);
INSERT INTO WebClicks VALUES (1, '2009-12-08 3:00:00 pm');
INSERT INTO WebClicks VALUES (1, '2009-12-08 3:00:25 pm');
INSERT INTO WebClicks VALUES (1, '2009-12-08 3:00:45 pm');
INSERT INTO WebClicks VALUES (1, '2009-12-08 3:01:45 pm');
INSERT INTO WebClicks VALUES (2, '2009-12-08 3:02:45 pm');
INSERT INTO WebClicks VALUES (2, '2009-12-08 3:02:55 pm');
INSERT INTO WebClicks VALUES (2, '2009-12-08 3:03:55 pm');
COMMIT;

The input table WebClicks contains the following rows:

=> SELECT * FROM WebClicks;
 userId |      timestamp
--------+---------------------
      1 | 2009-12-08 15:00:00
      1 | 2009-12-08 15:00:25
      1 | 2009-12-08 15:00:45
      1 | 2009-12-08 15:01:45
      2 | 2009-12-08 15:02:45
      2 | 2009-12-08 15:02:55
      2 | 2009-12-08 15:03:55
(7 rows)

In the following query, sessionization performs computation on the SELECT list columns, showing the difference between the current and previous timestamp value using LAG(). It evaluates to true and increments the window ID when the difference is greater than 30 seconds.

=> SELECT userId, timestamp,
     CONDITIONAL_TRUE_EVENT(timestamp - LAG(timestamp) > '30 seconds') 
     OVER(PARTITION BY userId ORDER BY timestamp) AS session FROM WebClicks;
 userId |      timestamp      | session
--------+---------------------+---------
      1 | 2009-12-08 15:00:00 |       0
      1 | 2009-12-08 15:00:25 |       0
      1 | 2009-12-08 15:00:45 |       0
      1 | 2009-12-08 15:01:45 |       1
      2 | 2009-12-08 15:02:45 |       0
      2 | 2009-12-08 15:02:55 |       0
      2 | 2009-12-08 15:03:55 |       1
(7 rows)

In the output, the session column contains the window ID from the CONDITIONAL_TRUE_EVENT function. The window ID evaluates to true on row 4 (timestamp 15:01:45), and the ID that follows row 4 is zero because it is the start of a new partition (for user ID 2), and that row does not evaluate to true until the last line in the output.

You might want to give users different time-out thresholds. For example, one user might have a slower network connection or be multi-tasking, while another user might have a faster connection and be focused on a single Web site, doing a single task.

To compute an adaptive time-out threshold based on the last 2 clicks, use CONDITIONAL_TRUE_EVENT with LAG to return the average time between the last 2 clicks with a grace period of 3 seconds:

=> SELECT userId, timestamp, CONDITIONAL_TRUE_EVENT(timestamp - LAG(timestamp) >
(LAG(timestamp, 1) - LAG(timestamp, 3)) / 2 + '3 seconds') 
OVER(PARTITION BY userId ORDER BY timestamp) AS session
FROM WebClicks;
 userId |      timestamp      | session
--------+---------------------+---------
      2 | 2009-12-08 15:02:45 |       0
      2 | 2009-12-08 15:02:55 |       0
      2 | 2009-12-08 15:03:55 |       0
      1 | 2009-12-08 15:00:00 |       0
      1 | 2009-12-08 15:00:25 |       0
      1 | 2009-12-08 15:00:45 |       0
      1 | 2009-12-08 15:01:45 |       1
(7 rows)

You cannot define a moving window in time series data. For example, if the query is evaluating the first row and there’s no data, it will be the current row. If you have a lag of 2, no results are returned until the third row.