Null Values in Time Series Data

Null values are not common inputs for gap-filling and interpolation (GFI) computation, but if null values do exist, you can use time series aggregate functions (TS_FIRST_VALUE/TS_LAST_VALUE) with the IGNORE NULLS arguments to affect output of the interpolated values. The TSA functions are treated similarly to their analytic counterparts (FIRST_VALUE/LAST_VALUE) in that if the timestamp itself is null Vertica filter out those rows before gap filling and interpolation occurs.

The three images below will illustrate the points that follow on how Vertica handles time series data that contains null values.

Figure 1. Interpolated bid values when the input has no NULLs

Figure 2. CONST-interpolated bid values when the input has NULL values

Figure 3. LINEAR-interpolated bid values when the input has NULL values

Constant Interpolation with Null Values

Figure 1 illustrates a default (constant) interpolation result on four input rows where none of the inputs contains a NULL value. Figure 2 shows the same input rows with the addition of another input record whose bid value is NULL, and whose timestamp (ts) value is 3:00:03.

For constant interpolation, the bid value starting at 3:00:03 is null until the next non-null bid value appears in time. In Figure 2, the presence of the null row makes the interpolated bid value null in the time interval denoted by the shaded region. As a result, if TS_FIRST_VALUE(bid) is evaluated with constant interpolation on the time slice that begins at 3:00:02, its output is non-null. However, TS_FIRST_VALUE(bid) on the next time slice produces null. If the last value of the 3:00:02 time slice is null, the first value for the next time slice (3:00:04) is null. However, if you were to use a TSA function with IGNORE NULLS, then the value at 3:00:04 would be the same value as it was at 3:00:02.

To illustrate, insert a new row into the TickStore table at 03:00:03 with a null bid value, Vertica will output a row for the 03:00:02 record with a null value but no row for the 03:00:03 input:

=> INSERT INTO tickstore VALUES('2009-01-01 03:00:03', 'XYZ', NULL);
=> SELECT slice_time, symbol, TS_LAST_VALUE(bid) AS last_bid FROM TickStore
-> TIMESERIES slice_time AS '2 seconds' OVER (PARTITION BY symbol ORDER BY ts);
     slice_time      | symbol | last_bid
---------------------+--------+----------
 2009-01-01 03:00:00 | XYZ    |       10
 2009-01-01 03:00:02 | XYZ    |
 2009-01-01 03:00:04 | XYZ    |     10.5
(3 rows)

If you specify IGNORE NULLS, Vertica fills in the missing data point using a constant interpolation scheme. Here, the bid price at 03:00:02 is interpolated to the last known input record for bid, which was $10 at 03:00:00:

=> SELECT slice_time, symbol, TS_LAST_VALUE(bid IGNORE NULLS) AS last_bid FROM TickStore
-> TIMESERIES slice_time AS '2 seconds' OVER (PARTITION BY symbol ORDER BY ts);
     slice_time      | symbol | last_bid
---------------------+--------+----------
 2009-01-01 03:00:00 | XYZ    |       10
 2009-01-01 03:00:02 | XYZ    |       10
 2009-01-01 03:00:04 | XYZ    |     10.5
(3 rows)

Now if you were to insert a row where the timestamp column contained a null value, Vertica would filter out that row before gap filling and interpolation occurred.

=> INSERT INTO tickstore VALUES(NULL, 'XYZ', 11.2);
=> SELECT slice_time, symbol, TS_LAST_VALUE(bid) AS last_bid FROM TickStore
-> TIMESERIES slice_time AS '2 seconds' OVER (PARTITION BY symbol ORDER BY ts);

Notice there is no output for the 11.2 bid row:

     slice_time      | symbol | last_bid
---------------------+--------+----------
 2009-01-01 03:00:00 | XYZ    |       10
 2009-01-01 03:00:02 | XYZ    |
 2009-01-01 03:00:04 | XYZ    |     10.5
(3 rows)

Linear Interpolation with Null Values

For linear interpolation, the interpolated bid value becomes null in the time interval, which is represented by the shaded region in Figure 3. In the presence of an input null value at 3:00:03, Vertica cannot linearly interpolate the bid value around that time point.

Vertica takes the closest non null value on either side of the time slice and uses that value. For example, if you use a linear interpolation scheme and you do not specify IGNORE NULLS, and your data has one real value and one null, the result is null. If the value on either side is null, the result is null. Therefore, to evaluate TS_FIRST_VALUE(bid) with linear interpolation on the time slice that begins at 3:00:02, its output is null. TS_FIRST_VALUE(bid) on the next time slice remains null.

=> SELECT slice_time, symbol, TS_FIRST_VALUE(bid, 'linear') AS fv_l FROM TickStore
-> TIMESERIES slice_time AS '2 seconds' OVER (PARTITION BY symbol ORDER BY ts);
     slice_time      | symbol | fv_l
---------------------+--------+------
 2009-01-01 03:00:00 | XYZ    |   10
 2009-01-01 03:00:02 | XYZ    |
 2009-01-01 03:00:04 | XYZ    |
(3 rows)