Why is the Vertica Approach Unmatched?
In Part 1 of this two-part article, we described Vertica’s approach to sessionization. To understand why this approach is head and shoulders above the state-of-the-art, let us closely examine a few existing approaches.
The Aster Data Approach –From Boston to LA via China
The first approach comes from Aster Data. One query example from this article is quoted below.
SELECT sessionId, userId, timestampValue
FROM Sessionize( ‘timestamp’, 60 ) ON
(SELECT userid, timestampValue FROM webclicks WHERE userid = 50);
There are a few usability hurdles with this design.
- The Sessionize function does not take userId as its input. It therefore must be assuming the input table webclicks has a column named userId, to be used in this computation. Hard-coding column names makes the product less usable.
- This design of sessionization support unnecessarily deviates from standard SQL in terms of both syntax and semantics, artificially creating a steep learning curve for the end users to climb. For example, the keyword ON has a different meaning from the standard use of ON in SQL ANSI joins. As a proposal for an alternative design, why not retain the subquery as a standard FROM clause subquery, and move the Sessionize function to the SELECT clause? This could preserve the required semantic ordering of evaluating the subquery before the Sessionize function, while minimizing the deviation from Standard SQL.
- This example implies that the predicate userid = 50 has to be placed in a subquery in order for that predicate to be pushed before the Sessionize computation. This predicate push down technique, well known to relational databases since the invention of relational algebra in the 1970’s, should have been taken care of in the query optimizer automatically. Forcing the end user to come up with “clever” SQL formulations violates the spirit of declarative programming in SQL, again making the product less usable.
Apparently Aster Data took another stab at the design, where a query example in that new design is quoted below.
Select ts, userid, session
From sessionize (
on clicks partition by userid order by ts
Timecolumn(‘ts’) timeout (60)
While this design represents progress, it still suffers from unnecessary deviation from the Standard SQL (see Comment #2 above). Also, since the proposed Sessionize syntax is not fully interoperable with other existing SQL constructs, it makes the query formulation cumbersome when the user wants to perform sessionization and other SQL computation in the same query. For example, how many subqueries need to be involved, if the user wants to first join the table clicks with a dimension table, perform a group-by, and then perform sessionization? In comparison, the Vertica approach employs not a single subquery to accomplish this task.
Finally, in case you wonder, in Aster Data Sessionize is a MapReduce function written in Java. Does the sessionization task justify the employment of a big hammer execution mechanism like MapReduce? No! As we mentioned before, the run-time complexity of sessionization is at the level of the simplest SQL ’99 analytic functions such as RANK and ROW_NUMBER, which takes nothings more than a single pass over the sorted data.
Using MapReduce to implement sessionization is a mistake on the architectural level. It does not matter how fast your MapReduce implementation is – if you travel from Boston to LA by flying around the globe, does it matter that you are taking a supersonic aircraft? Besides, while MapReduce is more versatile than SQL when processing computational tasks, it has been shown that when processing SQL-like query tasks, due to its significant overhead a MapReduce engine can be slower than a SQL engine by an order of magnitude or more.
The Teradata Approach – a Horse Wagon Fun Ride from Boston to LA
In this article, Teradata revealed its support for sessionization. We quote its query example below.
with dt (IP, Click_Timestamp, samesession) as
( select IP, Click_Timestamp,
case when (Click_Timestamp – (max(Click_Timestamp) over (partition by IP order by Click_Timestamp rows between 1 preceding and 1 preceding ) ) minute) < (interval ’30′ minute) then 0 else 1 end
select sum(samesession) over (partition by IP order by Click_Timestamp rows unbounded preceding) as Session_No, IP, Click_Timestamp
From this example, we can see that Teradata’s sessionization support is not native. Instead, it is expressed in terms of existing SQL ’99 functions. Such a design bears the following consequences.
- The resulting formulation is quite cumbersome and unintuitive – if you understand why this query is actually performing sessionization, we salute to you for being a SQL guru. On the other hand, an average DBA might be at a complete loss at what the query intends to do. Also, the WITH clause is an unnecessary big hammer construct to employ – a subquery could have been used. Then again, no subquery would be nice!
- The use of the SQL window aggregate function Max further complicates the formulation unnecessarily. The SQL analytic function LAG should have been used.
- Last but certainly not least, the complex query formulation involving two query blocks is likely to lead to suboptimal run-time performance.
Vertica’s sessionization support is unmatched among its peers, and here is why.
- Thanks to Vertica’s analytic function CONDITIONAL_TRUE_EVENT (CTE), the query formulation of sessionization is extremely compact – no subquery or unnecessary SQL construct is involved.
- Thanks to CTE, Vertica’s sessionization goes beyond its standard semantics, and is thus capable of supporting a wider range of use cases.
- The run-time performance of Vertica’s sessionization is close to optimal –the computation is fully pipelined for each user id, and is fully load-balanced across all user ids.
In closing this post, we would like to offer our dear esteemed readers the challenge to come up with a solution that beats Vertica’s implementation of sessionization.