Time series analytics is a powerful Vertica tool that evaluates the values of a given set of variables over time and groups those values into a window based on a time interval for analysis and aggregation. Time series analytics is useful when you want to analyze discrete data collected over time, such as stock market trades and performance, but find that your input records often occur at non-uniform intervals, which creates gaps. To solve this problem Vertica has:
• The TIMESERIES clause, which normalizes data into time slices.
• Time series aggregate functions, which determine whether to return the first or last value in a given time slice.
• Gap-filling, which has interpolation schemes that construct new data points within the range of a discrete set of known data points.
This post refers to the TickStore table, which contains stock bid prices over time for company XYZ:
Time | Symbol | Bid |
---|---|---|
2014-09-30 10:00:00 | XYZ | 0 |
2014-09-30 10:01:00 | XYZ | 32 |
2014-09-30 10:04:00 | XYZ | 38 |
2014-09-30 10:06:00 | XYZ | 36 |
2014-09-30 10:10:00 | XYZ | 33 |
There are time lags between several bids. You can use time series analytics to estimate the bid price at the times that fall between these known times, such as 10:02:00.
The TIMESERIES Clause
Provide the TIMESERIES clause an INTERVAL slice time that determines at which intervals output records are produced. Slice_time is a time column produced by the clause that stores the time slices generated from gap filling:TIMESERIES slice_time AS '3 minutes'…
The result consists of four, 3-minute time slices:
2014-09-30 10:00:00 through 2014-09-30 10:02:59
2014-09-30 10:03:00 through 2014-09-30 10:05:59
2014-09-30 10:06:00 through 2014-09-30 10:08:59
2014-09-30 10:09:00 through 2014-09-30 10:11:59
Time Series Aggregate Functions
The two time series aggregate functions determine which value within your defined time slice to output:• TS_FIRST_VALUE: Returns the value at the beginning of a time slice.
• TS_LAST VALUE: Returns the value at the end of a time slice.
Each aggregate function takes a column and an interpolation scheme, as described in the next section.
Interpolation Schemes
Interpolation schemes determine how Vertica calculates or “fills in” the missing data points. Vertica provides two interpolation schemes:• Constant Interpolation: This scheme is the default method for gap-filling in Vertica. It is specified with the CONST keyword in both time series aggregate functions. The constant interpolation scheme fills in missing data based on the last known value at any given time. This means that the value at a constructed data point matches the value at the last known data point.
• Linear Interpolation: This scheme is specified by the LINEAR keyword in both time series aggregate functions. It interpolates values in a linear slope based on the specified time slice. Using linear interpolation, a constructed data point value falls between the values of the last known data point and the next known data point following the constructed point.
Using Schemes
Given the two time series aggregate functions, and the two methods of interpolations, you have four options to construct missing data points:1.TS_FIRST_VALUE with constant scheme, which fills in missing data points according to the last value seen so far, and returns the value at the beginning of the time slice:
TS_FIRST_VALUE (value, 'CONST')
2.TS_FIRST_VALUE with linear scheme, which fills in missing data points according to a linear slope, and returns the value at the beginning of the time slice:
TS_FIRST_VALUE (value, 'LINEAR')
3.TS_LAST_VALUE with constant scheme, which fills in missing data points according to the last value seen so far, and returns the value at the end of the time slice:
TS_LAST_VALUE (value, 'CONST')
4.TS_LAST_VALUE with linear scheme, which fills in missing data points according to a linear slope, and returns the value at the end of the time slice:
TS_LAST_VALUE (value, 'LINEAR')
Examples
Using the TickStore table presented earlier, let’s apply time series analytics to find some missing data points.This example shows how to use the TS_FIRST_VALUE function with the constant interpolation scheme. The statement uses the TIMESERIES clause to divide the time range into 1 minute slices. The result is one row for each minute within our defined data set. Vertica displays the first value in each time slice and when data is not available, Vertica interpolates the bid prices, shown in bold, based on the last known value.
=> SELECT slice_time, symbol, TS_FIRST_VALUE(bid, 'CONST') AS first_bid FROM TickStore TIMESERIES slice_time AS '1 minute' OVER (PARTITION BY symbol ORDER BY time);
Output: slice_time | symbol | first_bid |
---|---|---|
10:00:00 | XYZ | 0 |
10:01:00 | XYZ | 32 |
10:02:00 | XYZ | 32 |
10:03:00 | XYZ | 32 |
10:04:00 | XYZ | 38 |
10:05:00 | XYZ | 38 |
10:06:00 | XYZ | 36 |
10:07:00 | XYZ | 36 |
10:08:00 | XYZ | 36 |
10:09:00 | XYZ | 36 |
10:10:00 | XYZ | 33 |
This example shows how to use the TS_FIRST_VALUE function with the linear interpolation scheme. The statement uses the TIMESERIES clause to divide the time range into 1 minute slices. The result is one row for each minute within our defined data set. Vertica displays the first value in each time slice and when data is not available, Vertica interpolates the bid prices, shown in bold, based on the linear slope.
=> SELECT slice_time, symbol, TS_FIRST_VALUE(bid, 'LINEAR') AS first_bid FROM TickStore TIMESERIES slice_time AS '1 minute' OVER (PARTITION BY symbol ORDER BY time);
Output: slice_time | symbol | first_bid |
---|---|---|
10:00:00 | XYZ | 0 |
10:01:00 | XYZ | 32 |
10:02:00 | XYZ | 34 |
10:03:00 | XYZ | 36 |
10:04:00 | XYZ | 38 |
10:05:00 | XYZ | 37 |
10:06:00 | XYZ | 36 |
10:07:00 | XYZ | 35.25 |
10:08:00 | XYZ | 34.5 |
10:09:00 | XYZ | 33.75 |
10:10:00 | XYZ | 33 |
This example shows how to use the TS_LAST_VALUE function with the constant interpolation scheme. The statement uses the TIMESERIES clause to divide the time range into 1 minute slices. The result is one row for each minute within our defined data set. Vertica displays the last value in each time slice and when data is not available, Vertica interpolates the bid prices, shown in bold, based on the last known value.
=> SELECT slice_time, symbol, TS_LAST_VALUE(bid, 'CONST') AS last_bid FROM TickStore TIMESERIES slice_time AS '1 minute' OVER (PARTITION BY symbol ORDER BY time);
Output: slice_time | symbol | last_bid |
---|---|---|
10:00:00 | XYZ | 0 |
10:01:00 | XYZ | 32 |
10:02:00 | XYZ | 32 |
10:03:00 | XYZ | 32 |
10:04:00 | XYZ | 38 |
10:05:00 | XYZ | 38 |
10:06:00 | XYZ | 36 |
10:07:00 | XYZ | 36 |
10:08:00 | XYZ | 36 |
10:09:00 | XYZ | 36 |
10:10:00 | XYZ | 33 |
This example shows how to use the TS_LAST_VALUE function with the linear interpolation scheme. The statement uses the TIMESERIES clause to divide the time range into 1 minute slices. The result is one row for each minute within our defined data set. Vertica displays the last value in each time slice and when data is not available, Vertica interpolates the bid prices, shown in bold, based on the linear slope.
=> SELECT slice_time, symbol, TS_LAST_VALUE(bid, 'LINEAR') AS last_bid FROM TickStore TIMESERIES slice_time AS '1 minute' OVER (PARTITION BY symbol ORDER BY time);
Output: slice_time | symbol | last_bid |
---|---|---|
10:00:00 | XYZ | 32 |
10:01:00 | XYZ | 34 |
10:02:00 | XYZ | 36 |
10:03:00 | XYZ | 38 |
10:04:00 | XYZ | 37 |
10:05:00 | XYZ | 36 |
10:06:00 | XYZ | 35.25 |
10:07:00 | XYZ | 34.5 |
10:08:00 | XYZ | 33.75 |
10:09:00 | XYZ | 33 |
10:10:00 | XYZ | N/A-no value at the end of this time slice (10:10:59) |