Sessionization with Event-Based Windows
Sessionization, a special case of event-based windows, is a feature often used to analyze click streams, such as identifying web browsing sessions from recorded web clicks.
In Vertica, given an input clickstream table, where each row records a Web page click made by a particular user (or IP address), the sessionization computation attempts to identify Web browsing sessions from the recorded clicks by grouping the clicks from each user based on the time-intervals between the clicks. If two clicks from the same user are made too far apart in time, as defined by a time-out threshold, the clicks are treated as though they are from two different browsing sessions.
Example Schema
The examples in this topic use the following WebClicks schema to represent a simple clickstream table:
CREATE TABLE WebClicks(userId INT, timestamp TIMESTAMP); INSERT INTO WebClicks VALUES (1, '2009-12-08 3:00:00 pm'); INSERT INTO WebClicks VALUES (1, '2009-12-08 3:00:25 pm'); INSERT INTO WebClicks VALUES (1, '2009-12-08 3:00:45 pm'); INSERT INTO WebClicks VALUES (1, '2009-12-08 3:01:45 pm'); INSERT INTO WebClicks VALUES (2, '2009-12-08 3:02:45 pm'); INSERT INTO WebClicks VALUES (2, '2009-12-08 3:02:55 pm'); INSERT INTO WebClicks VALUES (2, '2009-12-08 3:03:55 pm'); COMMIT;
The input table WebClicks
contains the following rows:
=> SELECT * FROM WebClicks; userId | timestamp --------+--------------------- 1 | 2009-12-08 15:00:00 1 | 2009-12-08 15:00:25 1 | 2009-12-08 15:00:45 1 | 2009-12-08 15:01:45 2 | 2009-12-08 15:02:45 2 | 2009-12-08 15:02:55 2 | 2009-12-08 15:03:55 (7 rows)
In the following query, sessionization performs computation on the SELECT list columns, showing the difference between the current and previous timestamp value using LAG()
. It evaluates to true and increments the window ID when the difference is greater than 30 seconds.
=> SELECT userId, timestamp, CONDITIONAL_TRUE_EVENT(timestamp - LAG(timestamp) > '30 seconds') OVER(PARTITION BY userId ORDER BY timestamp) AS session FROM WebClicks; userId | timestamp | session --------+---------------------+--------- 1 | 2009-12-08 15:00:00 | 0 1 | 2009-12-08 15:00:25 | 0 1 | 2009-12-08 15:00:45 | 0 1 | 2009-12-08 15:01:45 | 1 2 | 2009-12-08 15:02:45 | 0 2 | 2009-12-08 15:02:55 | 0 2 | 2009-12-08 15:03:55 | 1 (7 rows)
In the output, the session column contains the window ID from the CONDITIONAL_TRUE_EVENT function. The window ID evaluates to true on row 4 (timestamp 15:01:45), and the ID that follows row 4 is zero because it is the start of a new partition (for user ID 2), and that row does not evaluate to true until the last line in the output.
You might want to give users different time-out thresholds. For example, one user might have a slower network connection or be multi-tasking, while another user might have a faster connection and be focused on a single Web site, doing a single task.
To compute an adaptive time-out threshold based on the last 2 clicks, use CONDITIONAL_TRUE_EVENT with LAG to return the average time between the last 2 clicks with a grace period of 3 seconds:
=> SELECT userId, timestamp, CONDITIONAL_TRUE_EVENT(timestamp - LAG(timestamp) > (LAG(timestamp, 1) - LAG(timestamp, 3)) / 2 + '3 seconds') OVER(PARTITION BY userId ORDER BY timestamp) AS session FROM WebClicks; userId | timestamp | session --------+---------------------+--------- 2 | 2009-12-08 15:02:45 | 0 2 | 2009-12-08 15:02:55 | 0 2 | 2009-12-08 15:03:55 | 0 1 | 2009-12-08 15:00:00 | 0 1 | 2009-12-08 15:00:25 | 0 1 | 2009-12-08 15:00:45 | 0 1 | 2009-12-08 15:01:45 | 1 (7 rows)
You cannot define a moving window in time series data. For example, if the query is evaluating the first row and there’s no data, it will be the current row. If you have a lag of 2, no results are returned until the third row.