Smart Meters¶
This example uses the following datasets to predict peoples' electricity consumption. You can download the Jupyter Notebook of the study here. We'll use the following datasets:
- dateUTC: Date and time of the record
- meterID: Smart meter ID
- value: Electricity consumed during 30 minute interval (in kWh)
- dateUTC: Date and time of the record
- temperature: Temperature
- humidity: Humidity
- longitude: Longitude
- latitude: Latitude
- residenceType: 1 for Single-Family; 2 for Multi-Family; 3 for Appartement
We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Create vDataFrames of the datasets:
sm_consumption = vp.read_csv("data/smart_meters/sm_consumption.csv")
sm_weather = vp.read_csv("data/smart_meters/sm_weather.csv")
sm_meters = vp.read_csv("data/smart_meters/sm_meters.csv")
display(sm_consumption)
display(sm_weather)
display(sm_meters)
Data Exploration and Preparation¶
Predicting energy consumption in households is very important. Surges in electricity use could cause serious power outages. In our case, we'll be using data on general household energy consumption in Ireland to predict consumption at various times.
In order to join the different data sources, we need to assume that the weather will be approximately the same across the entirety of Ireland. We'll use the date and time as the key to join 'sm_weather' and 'sm_consumption'.
Joining different datasets with interpolation
In VerticaPy, you can interpolate joins; Vertica will find the closest timestamp to the key and join the result.
sm_consumption_weather = sm_consumption.join(
sm_weather,
how = "left",
on_interpolate = {"dateUTC": "dateUTC"},
expr1 = ["dateUTC", "meterID", "value"],
expr2 = ["humidity", "temperature"])
display(sm_consumption_weather)
Segmenting Latitude & Longitude using Clustering
The dataset 'sm_meters' is pretty important. In particular, the type of residence is probably a good predictor for electricity usage. We can create clusters of the different regions with k-means clustering based on longitude and latitude. Let's find the most suitable 'k' using an elbow curve and scatter plot.
sm_meters.agg(["min", "max"])
%matplotlib inline
from verticapy.learn.model_selection import elbow
from verticapy.datasets import load_world
world = load_world()
df = world.to_geopandas(geometry = "geometry")
df = df[df["country"].isin(["Ireland", "United Kingdom"])]
ax = df.plot(edgecolor = "black",
color = "white",
figsize = (10, 9))
ax = sm_meters.scatter(["longitude", "latitude"], ax = ax)
Based on the scatter plot, five seems like the optimal number of clusters. Let's verify this hypothesis using an elbow curve.
elbow(sm_meters, ["longitude", "latitude"], n_cluster = (3, 8))
The elbow curve seems to confirm that five is the optimal number of clusters, so let's create a k-means model with that in mind.
from verticapy.learn.cluster import KMeans
model = KMeans("kmeans_sm_regions",
n_cluster = 5,
init = [(-6.26980, 53.38127),
(-9.06178, 53.25998),
(-8.48641, 51.90216),
(-7.12408, 52.24610),
(-8.63985, 52.65945),])
model.drop()
model.fit(sm_meters,
["longitude", "latitude"])
Let's add our clusters to the vDataFrame.
sm_meters = model.predict(sm_meters, name = "region")
Let's draw a scatter plot of the different regions.
ax = df.plot(edgecolor = "black",
color = "white",
figsize = (10, 9))
sm_meters.scatter(["longitude", "latitude"],
catcol = "region",
max_cardinality = 10,
ax = ax)
Dataset Enrichment
Let's join 'sm_meters' with 'sm_consumption_weather'.
sm_consumption_weather_region = sm_consumption_weather.join(
sm_meters,
how = "natural",
expr1 = ["*"],
expr2 = ["residenceType",
"region"])
display(sm_consumption_weather_region)
Handling Missing Values
Let's take care of our missing values.
sm_consumption_weather_region.count_percent()
The variable 'value' has a few missing values that we can drop.
sm_consumption_weather_region["value"].dropna()
sm_consumption_weather_region.count()
Interpolation & Aggregations
Since power outages seem relatively common in each area, and the "value" represents the electricity consumed during 30 minute intervals (in kWh), it'd be a good idea to interpolate and aggregate the data to get a monthly average in electricity consumption per region.
Let's save our new dataset in the Vertica database.
vp.drop("sm_consumption_weather_region", method = "table")
sm_consumption_weather_region.to_db("sm_consumption_weather_region",
relation_type = "table")
sm_consumption_weather_region_clean = vp.vDataFrame("sm_consumption_weather_region")
To get an equally-sliced dataset, we can then interpolate to fill any gaps. This operation is essential for creating correct time series models.
sm_consumption_weather_region_clean = sm_consumption_weather_region_clean.interpolate(
ts = "dateUTC",
rule = "30 minutes",
method = {"value": "linear",
"humidity": "linear",
"temperature": "linear",
"residenceType": "ffill",
"region": "ffill"},
by = ["meterID"])
display(sm_consumption_weather_region_clean)
Let's aggregate the data to figure out the monthly energy consumption for each smart meter. We can then save the result in the Vertica database.
import verticapy.stats as st
sm_consumption_month = sm_consumption_weather_region_clean.groupby(
columns = ["meterID",
"region",
"residenceType",
"MONTH(dateUTC) AS month",
"DATE_TRUNC('MONTH', dateUTC::date) AS date_month"],
expr = [st.sum(sm_consumption_weather_region["value"])._as("value"),
st.avg(sm_consumption_weather_region["temperature"])._as("avg_temperature"),
st.avg(sm_consumption_weather_region["humidity"])._as("avg_humidity")]).filter(
"date_month < '2015-09-01'")
vp.drop("sm_consumption_month", method = "table")
sm_consumption_month.to_db("sm_consumption_month",
relation_type = "table",
inplace = True)
Understanding the Data & Detecting Outliers
Looking at three different smart meters, we can see a clear decrease in energy consumption during the summer followed by a sharp increase in the winter.
for i in [10, 12, 14]:
sm_consumption_month[sm_consumption_month["meterID"] == i]["value"].plot(ts = "date_month")
This behavior seems to be seasonal, but we don't have enough data to prove this.
Let's find outliers in the distribution by computing the ZSCORE per meterID.
std = st.std(sm_consumption_month["value"])._over(by = [sm_consumption_month["meterID"]])
avg = st.avg(sm_consumption_month["value"])._over(by = [sm_consumption_month["meterID"]])
sm_consumption_month["value_zscore"] = (sm_consumption_month["value"] - avg) / std
sm_consumption_month.search("value_zscore > 4")
Four smart meters are outliers in energy consumption. We'll need to investigate to get more information.
for i in [364, 399, 809, 951]:
sm_consumption_month[sm_consumption_month["meterID"] == i]["value"].plot(ts = "date_month")
Data Encoding & Bivariate Analysis
Since most of our data is categorical, let's encode them with One-hot encoding. We can then examine the correlations between the various categories.
sm_consumption_month = sm_consumption_month.one_hot_encode(["region", "residenceType", "month"],
drop_first = False,
max_cardinality = 20)
display(sm_consumption_month)
Let's compute the Pearson correlation matrix.
sm_consumption_month.corr()
There's a clear correlation between the month and energy consumption, but this isn't causal. Instead, we can think of the weather as having the direct influence on energy consumption. To accomodate for this view, we'll use the temperature as a predictor (rather than the month).
sm_consumption_month.corr(focus = "value")
Global Behavior
Let's look at this globally.
sm_consumption_final = sm_consumption_month.groupby(
["date_month"],
[st.avg(sm_consumption_month["avg_temperature"])._as("avg_temperature"),
st.avg(sm_consumption_month["avg_humidity"])._as("avg_humidity"),
st.avg(sm_consumption_month["value"])._as("avg_value")])
sm_consumption_final.plot(ts = "date_month", columns = ["avg_value"])
We expect to see a fall in energy consumption during summer and then an increase during the winter. A simple prediction could use the average value a year before.
sm_consumption_final["prediction"] = st.case_when(sm_consumption_final["date_month"] < '2015-01-01',
sm_consumption_final["avg_value"],
st.lag(sm_consumption_final["avg_value"], 12)._over(
order_by = ["date_month"]))
sm_consumption_final.plot(ts = "date_month", columns = ["prediction", "avg_value"])
sm_consumption_final.score("avg_value", "prediction", "r2")
As expected, our model's score is excellent.
Let's use machine learning to understand the influence of the weather and the humidity on energy consumption.
Machine Learning¶
Let's create our model.
from verticapy.learn.linear_model import LinearRegression
predictors = ["avg_temperature",
"avg_humidity"]
model = LinearRegression("sm_lr", solver = "BFGS")
model.fit(sm_consumption_final,
predictors,
"avg_value")
model.report("details")
The model seems to be good with an adjusted R2 of 77.5%, and the F-Statistic indicates that at least one of the two predictors is useful. Let's look at the residual plot.
sm_consumption_final = model.predict(sm_consumption_final,
name = "value_prediction")
sm_consumption_final["residual"] = sm_consumption_final["avg_value"] - sm_consumption_final["value_prediction"]
sm_consumption_final.scatter(["avg_value", "residual"])
Looking at the residual plot, we can see that the error variance varies by quite a bit. A possible suspect might be heteroscedasticity. Let's verify our hypothesis using a Breusch-Pagan test.
from verticapy.stats import het_breuschpagan
het_breuschpagan(sm_consumption_final, "residual", predictors)
The p-value is 4.81% and sits around the 5% threshold, so we can't really draw any conclusions.
Let's look at the entire regression report.
model.report()
Our model is very good; its median absolute error is around 13kWh.
With this model, we can make predictions about the energy consumption of households per region. If the usage exceeds what the model predicts, we can raise an alert and respond, for example, by regulating the electricity distributed to the region.
Conclusion¶
We've solved our problem in a Pandas-like way, all without ever loading data into memory!
VerticaPy
About the Author
Badr Ouali
Head of Data Science
Badr Ouali works as a Lead Data Scientist for Vertica worldwide. He can embrace data projects end to end through a clear understanding of the “big picture” as well as attention to details, resulting in achieving great business outcomes – a distinctive differentiator in his role. Badr enjoys sharing knowledge and insights related to data analytics with colleagues & peers and has a sweet spot for Python. He loves helping customers finding the best value from their data and empower them to solve their use-cases.
