Vertica

Archive for September, 2010

More Time Series Analytics: Event-based Window Functions

An event-based window function assigns input rows to windows based on their non-timestamp column values.  The Vertica event-based window functions, one of Vertica’s many in-database analytics, assign to each input row an integer value representing the window ID, starting from 0. The window ID is incremented when a new window starts.

Two previous posts discussed Gap Filling and Interpolation (GFI) time series analytics in Vertica and some use cases. GFI analytics groups the input rows according to consecutive uniform time intervals referred to as time slices, and then performs gap filling and interpolation on the group of rows associated with each time slice. We say the grouping of the input rows in GFI is time-based.

In some use cases, however, the rows in the input time series data need to be grouped according to certain column values, instead of their timestamps. For example, given an input stream of MSFT stock quotes, the stock analyst may want to place the input quotes into a new group whenever the spread (the difference between the ask price and the bid price) goes above $0.05. If we view each such group as awindow of events, then the window endpoints are defined by the occurrence of certain event types. In the above example, the window border-defining event type is a stock quote whose spread is above $0.05.

The above Financial Services query example can be formulated as follows.

SELECT symbol, ask, bid, timestamp, CONDITIONAL_TRUE_EVENT(ask – bid > 0.05) OVER (PARTITION BY symbol ORDER BY timestamp) spread_window_id

FROM Tickstore;

The Almighty Event-based Window Function CONDITIONAL_TRUE_EVENT

The primary event window function in Vertica 4.0 time series analytics is CONDITIONAL_TRUE_EVENT, which we abbreviate as CTE in subsequent text. CTE takes an input Boolean expression P (P for predicate). P is evaluated once for each input row. When P is evaluated to true, the associated input row is labeled with a new window ID.

Below is a pictorial example illustrating the semantics of CTE. Let the input Tickstore table contain the following rows (the ask column is omitted for simplicity). Ignore the output column window_id for now.

symbol bid timestamp window_id
XYZ 10.0 03:00:00 0
XYZ 11.0 03:00:03 1
XYZ 10.5 03:00:06 1
XYZ 11.0 03:00:09 2

 

 

Now let us answer the following query. The values of its output column window_id are shown in the above table.

SELECT symbol, bid, timestamp, CONDITIONAL_TRUE_EVENT(bid > 10.6) OVER(PARTITION BY symbol ORDER BY timestamp) window_id

FROM Tickstore;

The input and output data can be visualized in the following figure. The blue dots represent the input rows. Whenever the bid price goes above the threshold value $10.6, the output window_id is incremented.

 

Accessing Previous Rows in Event-based Window Functions

In the example of event-based window functions provided above, the Boolean expression P within CTE only accesses values from the current row. However, sometimes the window border-defining event type involves a sequence of recent past rows as well as the current row. For example, we may want to define a new window whenever the average value of bid and ask in the current row is above that in the last row. This Boolean expression can be formulated in Vertica as (bid1 + ask1) / 2 – (LAG(bid1) + LAG(ask1))/2 > 0. More generally, we use the analytic functional syntax LAG(x, n) to retrieve the value of column X in the nth to last input row. The second parameter n is optional, and defaults to 1.

With its ability to access previous rows, we can show that CTE can express any event-based window functions whose input involves the current row and the past n rows for any arbitrary finite number n. A formal proof is omitted here in an attempt to keep the readers on this post.

Another Event-based Window Function CONDITIONAL_CHANGE_EVENT

Having covered the extremely powerful event-based window function CTE, we now introduce a second function CONDITIONAL_CHANGE_EVENT, abbreviate as CCE in this post. CCE takes an input expression E of any data type. E is evaluated once for each input row. When the value of E on the current row is different from the value of E on the previous row, the current row is labeled with a new window ID.

Semantically, CCE is a mere special version of CTE, because CCE(E(current row)) º CTE(E(current row) <> E(previous row)). However, proper use of CCE can result in more compact query formulation, and possibly better run-time performance.

More Use Cases of Event-based Window Functions

Besides Financial Services, event-based window functions have applications in many other industry sectors as well. Let us turn to log analysis, where the log data can be produced by click streams, software programs, online games, etc. say the popular MMORPG game World of Warcraft logs a sequence of action events for each in-game character. Each character can work on one major task at a time (e.g. slay dragons, obtain magical artifacts, etc). For each task being taken on by the character, its log events consist of a START TASK event, followed by a sequence of action events pertinent to accomplishing this task. An example table schema that stores such log data can be (character, event_type, timestamp). Now the log analyst would like to group the log events based on the tasks they are associated with. This can be accomplished by this event-based window function: CONDITIONAL_TRUE_EVENT(eventType = ‘START TASK’) OVER (PARTITION BY character ORDER BY timestamp) task_id.

It turns out that for clickstream analytics, CTE is a powerful tool to implement in-database sessionization capability with unmatched flexibility and performance. This will be the subject of a future post.

Gap Filling and Interpolation (GFI)

A Swiss-Army Knife for Time Series Analytics

Gap Filling and Interpolation (GFI) is a set of patent-pending time series analytics features in Vertica 4.0.  In a previous post, we introduced GFI and showcased their expressive power and ease of use through a use case in financial services. In this post, through additional use cases, we will show that GFI can enable Vertica users in a wide range of industry sectors to achieve a diverse set of goals.

Use Case #1: Aggregate bandwidth analysis for Video-On-Demand (VOD) applications

In a Telecommunications VOD workload, assume we have the following (simplified) fact table schema for tracking the quality of service (QoS) in serving video content to end users: vod_qos_fact(user_id, video_id, timestamp, bandwidth).

Each time a user requests a video clip, a row is inserted into the fact table, recording the timestamp and the bandwidth consumption of serving the video clip at that time.  In addition, when the bandwidth consumption changes significantly for a video clip being served, a new row is inserted to record the new bandwidth consumption. An example fact table might be:

In this example, the  VOD QoS analyst might like to know the following:  for every minute, what is the aggregate bandwidth consumption across all users? This question can be answered in Vertica 4.0 in two pipelined steps.

First, the GFI time series analytics is used to compute the interpolated bandwidth consumption of each user for every minute. The result is shown below.

Second, the interpolated bandwidth consumption is aggregated for every minute. The resulting SQL formulation of these two steps is the following.

SELECT slice_time, sum(i_bandwidth) agg_bandwidth

FROM (

SELECT user_id, slice_time, TS_FIRST_VALUE(bandwidth) AS i_bandwidth

FROM vod_qos_fact

TIMESERIES slice_time AS ’1 minute’ OVER (PARTITION BY user_id ORDER BY timestamp)) tmp

GROUP BY slice_time;

In comparison to other popular solution alternatives such as stored procedure, Map/Reduce script and ODBC/JDBC program, the Vertica approach enjoys the following benefits:

  • Code and architecture simplicity: The computation is expressed in SQL, and is performed within the Vertica database. No additional programming language or execution platform is needed.
  • Efficiency and scalability: The Vertica approach effectively balances the data and computation across all cluster nodes, and minimizes the amount of disk and network I/O needed to accomplish this computation task.

These benefits apply to the following use cases as well.

Use Case #2: Storage compression for inventory management applications

In a retail workload, assume we have the following (simplified) fact table schema for inventory management: inventory_fact(item_id, item_location, timestamp, quantity_in_stock).

For each inventory item in a certain location, every time its stock quantity changes, there is a new row inserted into this fact table with the updated quantity value and the associated timestamp of that inventory change.

For example, I recent went to a nearby Apple Retail Store to purchase an iPad. After that transaction, suppose Apple’s inventory_fact table is populated with a new row, indicating that for this particular Apple store, the quantity of its iPads in stock has decreased by 1.

For a popular product like the iPad, the inventory level in each store may change thousands of times or more each day, creating a large number of rows in the inventory_fact table. However, for the purpose of inventory analysis, the inventory data can often be stored in a much coarser time granularity, such as one row per day for each item and location pair.

The GFI time series analytics in Vertica 4.0 can be used to efficiently aggregate the raw inventory data into a proper time granularity, thus consuming much less storage as well as significantly speeding up inventory analysis. Note that in comparison, traditional SQL analytic functions such as FIRST_VALUE cannot be used to correctly compute the aggregation (the reasoning behind this statement is left as an exercise for the interested reader).

Use Case #3: Power reduction for wireless sensor applications

In a Smartdust project, computer scientists may deploy a large number of sensors in a wild region to measure environmental parameters, such as light and temperature. The sensors transmit their measurement readings via wireless signals to one or more base stations, where installed applications process the measurements.

Since these sensors are battery-driven, efficient power management is crucial to ensure a prolonged operation. The key to efficient power management is to minimize the amount of data transmission from sensors to the base stations.

A base station application may require that measurement updates be available at regular time intervals; for example, once every second. One straightforward implementation is to have each sensor transmit a new measurement for each time interval. However, this transmission policy could lead to a huge battery drain on the sensors, especially considering that the consecutive measurements from the same sensor often carry the same value.

Instead, we could have the sensors submit their readings only when the measurement values change, and then use GFI on the base station application to losslessly recover the required measurement values at regular intervals.

Conclusion

The above three use cases respectively illustrate the use of Vertica’s GFI time series analytics in the context of Telecommunications, Retail, and Science applications. You may be wondering, how to assess whether GFI applies to your particular problem at hand? One way is to look for the following common characteristics of problems to which GFI is applicable:

  1. You have multiple time series event streams (say from multiple users or multiple stock symbols), where the events do not occur at regular time intervals. However, you would like to aggregate or correlate over these event streams (use case #1)
  2. The raw data is stored in a denser time granularity than needed in your analysis computation (use case #2)
  3. The raw data is stored in a sparser time granularity than needed in your analysis computation (use case #3)

With Vertica, the built-in GFI functionality allows you to simplify and accelerate your time-series analytics.

Introducing Tech Talk

Since its beginnings, the Database Column has evolved to include information about the Business Intelligence and Database industries from a number of different perspectives. As such, this blog has attracted a diverse range of readers, including those who want high-level intelligence that helps them make strategic business decisions all the way to the people who are implementing databases and working with business intelligence on a daily basis.

It’s for this last group – the “techies” – that we’re proud to introduce a new feature to the Database Column blog. Called “Tech Talk,” each post will feature in-depth information that is important to those of you who are in the trenches. And for those who don’t need to know how to write a SQL query or aren’t interested in the details of GFI (Gap Filling and Interpolation), we are identifying our Tech Talk posts by categorizing them with the “Tech Talk” category so you know immediately whether the content is appropriate for you.

So watch this space for our first Tech Talk post entitled:

GFI – A Swiss Army Knife for Time Series Analytics

Over the coming weeks and months, we’ll be rolling out Tech Talk posts on a number of different subjects, including event window functions, the query benefits of sorted data, implementing a successful proof of concept trial, and how Vertica offers high availability.

If you have any ideas for a Tech Talk topic, please use the Comments section to let us know. We’re looking forward to exploring a number of topics with you, our readers.

Get Started With Vertica Today

Subscribe to Vertica