Null Values in Time Series Data
Null values are uncommon inputs for gap-filling and interpolation (GFI) computation. When null values exist, you can use time series aggregate (TSA) functions TS_FIRST_VALUE
and TS_LAST_VALUE
with IGNORE NULLS
to affect output of the interpolated values. TSA functions are treated like their analytic counterparts FIRST_VALUE
and LAST_VALUE
: if the timestamp itself is null, Vertica filters out those rows before gap filling and interpolation occurs.
Constant Interpolation with Null Values
Figure 1 illustrates a default (constant) interpolation result on four input rows where none of the inputs contains a NULL value:
Figure 2 shows the same input rows with the addition of another input record whose bid value is NULL, and whose timestamp (ts) value is 3:00:03:
For constant interpolation, the bid value starting at 3:00:03 is null until the next non-null bid value appears in time. In Figure 2, the presence of the null row makes the interpolated bid value null in the time interval denoted by the shaded region. If TS_FIRST_VALUE(bid)
is evaluated with constant interpolation on the time slice that begins at 3:00:02, its output is non-null. However, TS_FIRST_VALUE(bid)
on the next time slice produces null. If the last value of the 3:00:02 time slice is null, the first value for the next time slice (3:00:04) is null. However, if you use a TSA function with IGNORE NULLS
, then the value at 3:00:04 is the same value as it was at 3:00:02.
To illustrate, insert a new row into the TickStore table at 03:00:03 with a null bid value, Vertica outputs a row for the 03:00:02 record with a null value but no row for the 03:00:03 input:
=> INSERT INTO tickstore VALUES('2009-01-01 03:00:03', 'XYZ', NULL); => SELECT slice_time, symbol, TS_LAST_VALUE(bid) AS last_bid FROM TickStore -> TIMESERIES slice_time AS '2 seconds' OVER (PARTITION BY symbol ORDER BY ts); slice_time | symbol | last_bid ---------------------+--------+---------- 2009-01-01 03:00:00 | XYZ | 10 2009-01-01 03:00:02 | XYZ | 2009-01-01 03:00:04 | XYZ | 10.5 (3 rows)
If you specify IGNORE NULLS, Vertica fills in the missing data point using a constant interpolation scheme. Here, the bid price at 03:00:02 is interpolated to the last known input record for bid, which was $10 at 03:00:00:
=> SELECT slice_time, symbol, TS_LAST_VALUE(bid IGNORE NULLS
) AS last_bid FROM TickStore
TIMESERIES slice_time AS '2 seconds' OVER (PARTITION BY symbol ORDER BY ts);
slice_time | symbol | last_bid
---------------------+--------+----------
2009-01-01 03:00:00 | XYZ | 10
2009-01-01 03:00:02 | XYZ | 10
2009-01-01 03:00:04 | XYZ | 10.5
(3 rows)
Now, if you insert a row where the timestamp column contains a null value, Vertica filters out that row before gap filling and interpolation occurred.
=> INSERT INTO tickstore VALUES(NULL, 'XYZ', 11.2); => SELECT slice_time, symbol, TS_LAST_VALUE(bid) AS last_bid FROM TickStore TIMESERIES slice_time AS '2 seconds' OVER (PARTITION BY symbol ORDER BY ts);
Notice there is no output for the 11.2 bid row:
slice_time | symbol | last_bid ---------------------+--------+---------- 2009-01-01 03:00:00 | XYZ | 10 2009-01-01 03:00:02 | XYZ | 2009-01-01 03:00:04 | XYZ | 10.5 (3 rows)
Linear Interpolation with Null Values
For linear interpolation, the interpolated bid value becomes null in the time interval, represented by the shaded region in Figure 3:
In the presence of an input null value at 3:00:03, Vertica cannot linearly interpolate the bid value around that time point.
Vertica takes the closest non null value on either side of the time slice and uses that value. For example, if you use a linear interpolation scheme and do not specify IGNORE NULLS
, and your data has one real value and one null, the result is null. If the value on either side is null, the result is null. Therefore, to evaluate TS_FIRST_VALUE(bid)
with linear interpolation on the time slice that begins at 3:00:02, its output is null. TS_FIRST_VALUE(bid)
on the next time slice remains null.
=> SELECT slice_time, symbol, TS_FIRST_VALUE(bid, 'linear') AS fv_l FROM TickStore TIMESERIES slice_time AS '2 seconds' OVER (PARTITION BY symbol ORDER BY ts); slice_time | symbol | fv_l ---------------------+--------+------ 2009-01-01 03:00:00 | XYZ | 10 2009-01-01 03:00:02 | XYZ | 2009-01-01 03:00:04 | XYZ | (3 rows)