The Vertica Approach
Sessionization is a common analytic operation in clickstream analysis. Given an input clickstream table, where each row records a webpage click made by a particular user (or IP address), the sessionization operation identifies user’s web browsing sessions from the recorded clicks, by grouping the clicks from each user based on the time-intervals between the clicks. Conceptually, if two clicks from the same user are made too far apart in time (as defined by a time-out threshold), they will be treated as coming from two browsing sessions.
Here is an example input clickstream table with a simplified schema. Ignore the output column session_id for now.
The standard semantics of sessionization takes a single input parameter: the time-out threshold, which is a constant time interval value. An example time-out threshold value is 30 seconds. Sessionization performs its computation on two columns in the input clickstream table, the user_id and the timestamp of the click. The output session_id column produced by sessionization is shown in the above table.
Vertica’s Sessionization Support
Sessionization in Vertica is built on top of the event-based window function CONDITIONAL_TRUE_EVENT (or CTE in short). Recall the semantics of CTE with input Boolean expression P: CTE(P) is evaluated once per input row, and defines a new window starting at the current row, whenever P is evaluated to true for that row. For example, given a sequence of values <1, 2, 3, 4> for column X, CTE(X > 2) assigns to these rows a sequence of window Ids <0, 0, 1, 2>. Also, recall that the expression P in CTE can access column values in the current row, as well as in previous rows. For example, CTE (X > LAG(X)) defines a new window whenever the value of column X in the current row is greater than X in the last row.
Despite of its powerful semantics, the run-time complexity of CTE is at the level of the simplest SQL ’99 analytic functions such as RANK and ROW_NUMBER – it takes only a single pass over the sorted data, while retaining a minimal amount of state in the computation.
Thanks to CTE, sessionization with its standard semantics can be expressed in Vertica as follows.
SELECT user_id, timestamp, CONDITIONAL_TRUE_EVENT(timestamp – LAG(timestamp) > ’30 seconds’) OVER (PARTITION BY user_id ORDER BY timestamp)
Beyond the Standard Semantics of Sessionization
One limitation of the standard semantics of sessionization is that the time-out threshold is a constant value. However, different users may have different styles and preferences for internet browsing, and therefore the same time-out threshold may not accurately identify sessions for all users.
For example, say user A is a slower web-surfer than an average user, perhaps because A is multi-tasking heavily. Say if an average user does not perform page clicks in a particular web domain D in 30 seconds, it indicates the end of a session. However, for user A, the typical interval between two clicks in same domain is 1 minute, as she is busy tweeting, listening to music, and harvesting in Farmville at the same time. So a better solution is to adaptively determine the session timeout threshold of user A based on her recent browsing behavior (e.g. the average time interval between 2 consecutive clicks in the last 10 clicks which A has performed). This allows the clickstream analyst to customize the timeout threshold for difference users.
For example, to adaptively compute the time-out threshold for a user based on her last 10 clicks with a “fudge factor” of 3 seconds, we can use the following CTE expression: CONDITIONAL_TRUE_EVENT (timestamp – lag(timestamp) <= (LAG(timestamp, 1) – LAG(timestamp,11)) / 10) + ’3 seconds’. The fudge factor can be a multiplicative factor instead of an additive one. For example, it can be 110% of the average time intervals of the last 10 clicks.
Another sessionization use case involving a more sophisticated time-out threshold is to use different threshold values based on other factors, such as the time of the day, or the nature of the website being browsed. For example, the time-out threshold for Wall Street Journal Online should be higher than xkcd webcomic, as the WSJ articles take longer to read in average than the xkcd comic strips.
The Vertica approach to sessionization enjoys the multiple benefits of ease of use, strong expressive power, as well as highly efficient and scalable execution. Wondering how some alternative approaches will stack up against Vertica’s (hint: they won’t)? That’s what we will answer in a future post.