How to Do Time Series Analysis in Vertica

Posted June 11, 2018 by Soniya Shah, Information Developer

This blog post was authored by Maurizio Felici. Time series analytics evaluate the values of a given set of variables over time and group those values into a window, based on a time interval for analysis and aggregation. There are a few types of functions that help perform time series analytics:
  • Event based functions
  • Time series functions

Event Based Functions

Event based analytic functions allow you to use significant events to define the “borders” of your event window. For example, suppose you have a series of events represented by timestamps: select * from ts1 order by ts ; ts --------------------- 2015-05-21 23:08:02 2015-05-21 23:08:11 2015-05-21 23:08:20 2015-05-21 23:08:53 2015-05-21 23:09:03 2015-05-21 23:11:18 2015-05-21 23:16:38 2015-05-21 23:16:49 You can visualize these events by plotting them on a time axis: Now we want to identify group events consisting of one or more single events. A single event is part of a group event if it happens within a time proximity threshold of 60 seconds: Each group event is described by the following attributes:
  • Start timestamp
  • End timestamp
  • Group-event size (the number of single events in the group event)
This kind of analysis has several practical applications, including customer interaction analysis and web interaction session definitions. Let’s see how easily we can identify our group events using the Vertica event based analytic functions: The following graphic breaks down how the previous query worked:

Time Series Functions

Time series are two dimensional objects used to measure variables over time. Each element of a time series contains a variable value and a position in the time axis. Time series analytics is useful when you want to analyze discrete data collected over time, such as stock market trades and performance. Discrete time series occur when variables are measured within a certain sampling interval. Time series are normally represented in two dimension graphs, with an X-axis for time and a Y-axis for the variable value. Time series analysis can be difficult when you have gaps, incomplete series, or a non-uniform time interval. Time series analytics is a useful Vertica SQL extension used to evaluate and group variables over time. It consists of:
  • SELECT timeseries clause
  • TS_FIRST_VALUE and TS_LAST_VALUE aggregate functions
The syntax for time series functions is similar to other analytic functions: The slice time is the time column produced by the TIMESERIES clause. The time interval is the length of the time slice. The window partition clause (PARTITION BY) is option and used to partition data. The window order clause (ORDER BY) is required and used to sort data. This example uses the same list of timestamps used in the previous section and then fills in the gaps: select * from ts1 ; ts --------------------- 2015-05-21 23:08:02 2015-05-21 23:08:11 2015-05-21 23:08:20 2015-05-21 23:08:53 2015-05-21 23:09:03 2015-05-21 23:11:18 2015-05-21 23:16:38 2015-05-21 23:16:49 The SELECT…TIMESERIES created one time point for each minute: select tm from ts1 timeseries tm as '1 minute' over ( order by ts ) ; tm --------------------- 2015-05-21 23:08:00 2015-05-21 23:09:00 2015-05-21 23:10:00 2015-05-21 23:11:00 2015-05-21 23:12:00 2015-05-21 23:13:00 2015-05-21 23:14:00 2015-05-21 23:15:00 2015-05-21 23:16:00 Now, let’s add a new column: select * from tseries order by ts ; ts | value ---------------------+------- 2015-05-21 23:08:02 | 27 2015-05-21 23:08:11 | 26 2015-05-21 23:08:20 | 30 2015-05-21 23:08:52 | 28 2015-05-21 23:09:03 | 27 2015-05-21 23:11:18 | 20 2015-05-21 23:16:38 | 13 2015-05-21 23:16:49 | 12 Then, create a time series with both gap filling and interpolation: select tm, ts_first_value(value) as int_value from tseries timeseries tm as '1 minute' over ( order by ts ) ; tm | int_value ---------------------+---------------- 2015-05-21 23:08:00 | 2015-05-21 23:09:00 | 28 2015-05-21 23:10:00 | 27 2015-05-21 23:11:00 | 27 2015-05-21 23:12:00 | 20 2015-05-21 23:13:00 | 20 2015-05-21 23:14:00 | 20 2015-05-21 23:15:00 | 20 2015-05-21 23:16:00 | 20 Vertica uses two different interpolation methods:
  • CONSTANT interpolation
  • LINEAR interpolation
If you do not specify the interpolation method, the default is CONSTANT interpolation. The interpolation method is specified through the TS_FIRST_VALUE and TS_LAST_VALUE functions: TS_FIRST_VALUE ( expression [ IGNORE NULLS ] ... [, { 'CONST' | 'LINEAR' } ] ) TS_LAST_VALUE ( expression [ IGNORE NULLS ] … [, { ‘CONST’ | ‘LINEAR’ } ] ) Choosing the correct interpolation method depends on what you are analyzing. For example, analyzing stock quotes would likely mean using CONSTANT, while analyzing temperatures is better suited for LINEAR. The following example shows what happens when the example is gap filled using linear interpolation: Time series can have NULL values either on the time axis or on the value axis. Rows with NULL in the time axis are filtered out by Vertica. If you instruct Vertica to IGNORE NULLS, it will use the previous or next LAST/FIRST value in the time slice. LINEAR interpolation requires at least two values in the time slice. If one is NULL, Vertica interpolated values are NULL as well.

Event Series Joins

Event series joins are a Vertica SQL extension for OUTER JOINs. This extension allows you to join tables on time series that do not align precisely. For example, the following shows an OUTER JOIN on the two tables, where non-matching rows are padded with NULL values: Event series joins will replace “gaps” with interpolated values.

Time Series Pattern Matching

The Vertica event series pattern matching functionality allows you to screen large amounts of historical data in search of event patterns. Event patters are a series of events that occur in an order or “pattern” that you specify. Events are evaluated for each row. When Vertica finds a contiguous sequence of rows that conforms to the pattern you define, it outputs the rows that contribute to the match. Rows not part of the match do not output. For more information, see the following in the Vertica documentation: