Using Vertica on IoT Data: Gap Filling and Interpolation for Incomplete Sensor Data

Posted March 6, 2017 by Soniya Shah, Information Developer

This post was originally authored by Marco Gessner and appeared on LinkedIn. It has been reposted here with his permission.

This article explains the basic gap filling and interpolation functionality in Vertica. Vertica was designed for the fast processing and analysis of huge volumes of data and is well suited to IoT applications. One of the challenges in processing IoT data is the fact that the data is almost always incomplete. Sensor data consists of sensor measurements combined with the timestamps of the measure recordings. For example:

sensor_id | measure_ts | dgr_celsius
101203202 | 2017-01-23 13:07:05 | 70
101203202 | 2017-01-23 13:07:10 | 71
101203202 | 2017-01-23 13:10:03 | 35

As you can see from this example, the raw data almost always has gaps. A sensor in the engine bay of a car can get too hot to deliver its measures for a period of time. A satellite might have a communication issue, so that it can’t send its sensor data to its receiver for a while. A smart meter, which sends its consumption data to a central point using the electricity grid as the communication line, might be cut off because a high tension line is disconnected for a period of time to keep the grid in balance.

Probabilistic analysis, however, relies on regular data without gaps. Timestamps must be spaced evenly, with always the same interval between them. In most cases, IoT data must be linearly interpolated from the raw data prior to the computation of probabilities.

The first time I had to satisfy a request for this type of data transformation was in the early 1990s. The actuaries of an insurance company needed their claims data to be evenly spaced in time and linearly interpolated within a working day, so that they could calculate how an insurance plan was being used and predict its usage in the future. Since claims do not appear at regular intervals in time, I had to transform the data. I spent two hours every day from a Monday to a Friday to come up with the solution. Back then (without Vertica) I must have written two pages of SQL to make that happen.

But Vertica has the TIMESERIES clause, which would have made that exercise completely unnecessary!

To demonstrate how it works, let’s start from this irregular time series that could have come from an electricity smart meter. The input data is in the initial WITH clause. It consists of three rows with the smartmeter ID, the timestamp of the measurement, and the running sum of consumption in Watt hours.

WITH energy_readings (meter_id,reading_ts,watt_hrs) AS (
SELECT 100012700004100,'2015-01-01 03:00:00'::TIMESTAMP(0),10000
UNION ALL SELECT 100012700004100,'2015-01-01 03:03:00'::TIMESTAMP(0),10030
UNION ALL SELECT 100012700004100,'2015-01-01 03:20:00'::TIMESTAMP(0),10370
)
SELECT
meter_id
, reading_ts
, watt_hrs
FROM energy_readings;

The result is:

meter_id |reading_ts |watt_hrs
100,012,700,004,100|2015-01-01 03:00:00| 10,000
100,012,700,004,100|2015-01-01 03:03:00| 10,030
100,012,700,004,100|2015-01-01 03:20:00| 10,370

So we have one measurement on the hour, another measurement 3 minutes past the hour, and a third measurement 20 minutes past the hour. In the first 3 minutes, 30 Watt hours have been consumed—that’s 10 Watt hours per minute. In the following 17 minutes, 340 Watt hours were consumed—that’s 20 Watt hours per minute.

To prepare the data for predictive analysis—say to predict electricity consumption using linear regression—we need the data evenly spaced and with no gaps. For this example, we use an interval of 5 minutes and a linear interpolation of the Watt hour running sum.

Using the same WITH clause for the input data as above, we add the timeseries name and a Vertica function dependent on the TIMESERIES clause to the SELECT column list. After the FROM clause, we add the TIMESERIES clause itself, where we specify the timeseries name, the interval spacing (AS …), when to reset the timeseries (PARTITON BY …), and which timestamp to start from (ORDER BY …).

That’s all you need! And it’s a long way from the roughly two pages mentioned above. Take a look at the following:

WITH energy_readings (meter_id,reading_ts,watt_hrs) AS (
SELECT 100012700004100,'2015-01-01 03:00:00'::TIMESTAMP(0),10000
UNION ALL SELECT 100012700004100,'2015-01-01 03:03:00'::TIMESTAMP(0),10030
UNION ALL SELECT 100012700004100,'2015-01-01 03:20:00'::TIMESTAMP(0),10370
)
SELECT
meter_id
,interval_5min::TIMESTAMP(0)
,TS_FIRST_VALUE(watt_hrs,'linear') AS watt_hrs_5min
FROM energy_readings
TIMESERIES interval_5min AS '5 minutes' OVER(
PARTITION BY meter_id ORDER BY reading_ts
)
;

Here’s the result:

meter_id |interval_5min |watt_hrs_5min
100,012,700,004,100|2015-01-01 03:00:00| 10,000
100,012,700,004,100|2015-01-01 03:05:00| 10,070
100,012,700,004,100|2015-01-01 03:10:00| 10,170
100,012,700,004,100|2015-01-01 03:15:00| 10,270
100,012,700,004,100|2015-01-01 03:20:00| 10,370

Note that the entry 3 minutes after the hour has disappeared and that we have 70 Watt hours between the measure on the hour and the measure 5 minutes after the hour. There are 10 Watt hours for 3 minutes, followed by 20 Watt hours for 2 minutes, adding up to 7. And now we have a regular difference of 100 Watt hours every 5 minutes.

How would you do that without the TIMESERIES clause? Well, with databases that don’t support OLAP functions (also called analytic functions or window-based functions), such as MySQL or Microsoft Access or Hive before Version 0.11, you can just forget it and program it in C or Java or something else that is procedural. When OLAP functions are supported, it’s possible with several nested queries, which are best coded as Common Table Expressions in a WITH clause with several CTEs.

NOTE: The code samples in this post are complete examples, including data. You can try them out by simply cutting and pasting the code into vsql, DbVisualizer, or your favorite SQL front-end tool. All you need is a connection to Vertica. If you do not have Vertica, you can download the community edition free of charge from my.vertica.com.