An event-based window function assigns input rows to windows based on their non-timestamp column values. The Vertica event-based window functions, one of Vertica’s many in-database analytics, assign to each input row an integer value representing the window ID, starting from 0. The window ID is incremented when a new window starts.
Two previous posts discussed Gap Filling and Interpolation (GFI) time series analytics in Vertica and some use cases. GFI analytics groups the input rows according to consecutive uniform time intervals referred to as time slices, and then performs gap filling and interpolation on the group of rows associated with each time slice. We say the grouping of the input rows in GFI is time-based.
In some use cases, however, the rows in the input time series data need to be grouped according to certain column values, instead of their timestamps. For example, given an input stream of MSFT stock quotes, the stock analyst may want to place the input quotes into a new group whenever the spread (the difference between the ask price and the bid price) goes above $0.05. If we view each such group as awindow of events, then the window endpoints are defined by the occurrence of certain event types. In the above example, the window border-defining event type is a stock quote whose spread is above $0.05.
The above Financial Services query example can be formulated as follows.
SELECT symbol, ask, bid, timestamp, CONDITIONAL_TRUE_EVENT(ask – bid > 0.05) OVER (PARTITION BY symbol ORDER BY timestamp) spread_window_id
The Almighty Event-based Window Function CONDITIONAL_TRUE_EVENT
The primary event window function in Vertica 4.0 time series analytics is CONDITIONAL_TRUE_EVENT, which we abbreviate as CTE in subsequent text. CTE takes an input Boolean expression P (P for predicate). P is evaluated once for each input row. When P is evaluated to true, the associated input row is labeled with a new window ID.
Below is a pictorial example illustrating the semantics of CTE. Let the input Tickstore table contain the following rows (the ask column is omitted for simplicity). Ignore the output column window_id for now.
Now let us answer the following query. The values of its output column window_id are shown in the above table.
SELECT symbol, bid, timestamp, CONDITIONAL_TRUE_EVENT(bid > 10.6) OVER(PARTITION BY symbol ORDER BY timestamp) window_id
The input and output data can be visualized in the following figure. The blue dots represent the input rows. Whenever the bid price goes above the threshold value $10.6, the output window_id is incremented.
Accessing Previous Rows in Event-based Window Functions
In the example of event-based window functions provided above, the Boolean expression P within CTE only accesses values from the current row. However, sometimes the window border-defining event type involves a sequence of recent past rows as well as the current row. For example, we may want to define a new window whenever the average value of bid and ask in the current row is above that in the last row. This Boolean expression can be formulated in Vertica as (bid1 + ask1) / 2 – (LAG(bid1) + LAG(ask1))/2 > 0. More generally, we use the analytic functional syntax LAG(x, n) to retrieve the value of column X in the nth to last input row. The second parameter n is optional, and defaults to 1.
With its ability to access previous rows, we can show that CTE can express any event-based window functions whose input involves the current row and the past n rows for any arbitrary finite number n. A formal proof is omitted here in an attempt to keep the readers on this post.
Another Event-based Window Function CONDITIONAL_CHANGE_EVENT
Having covered the extremely powerful event-based window function CTE, we now introduce a second function CONDITIONAL_CHANGE_EVENT, abbreviate as CCE in this post. CCE takes an input expression E of any data type. E is evaluated once for each input row. When the value of E on the current row is different from the value of E on the previous row, the current row is labeled with a new window ID.
Semantically, CCE is a mere special version of CTE, because CCE(E(current row)) º CTE(E(current row) <> E(previous row)). However, proper use of CCE can result in more compact query formulation, and possibly better run-time performance.
More Use Cases of Event-based Window Functions
Besides Financial Services, event-based window functions have applications in many other industry sectors as well. Let us turn to log analysis, where the log data can be produced by click streams, software programs, online games, etc. say the popular MMORPG game World of Warcraft logs a sequence of action events for each in-game character. Each character can work on one major task at a time (e.g. slay dragons, obtain magical artifacts, etc). For each task being taken on by the character, its log events consist of a START TASK event, followed by a sequence of action events pertinent to accomplishing this task. An example table schema that stores such log data can be (character, event_type, timestamp). Now the log analyst would like to group the log events based on the tasks they are associated with. This can be accomplished by this event-based window function: CONDITIONAL_TRUE_EVENT(eventType = ‘START TASK’) OVER (PARTITION BY character ORDER BY timestamp) task_id.
It turns out that for clickstream analytics, CTE is a powerful tool to implement in-database sessionization capability with unmatched flexibility and performance. This will be the subject of a future post.