Smart Meters¶
This example uses the following datasets to predict peoples' electricity consumption. You can download the Jupyter Notebook of the study here. We'll use the following datasets:
- dateUTC: Date and time of the record
- meterID: Smart meter ID
- value: Electricity consumed during 30 minute interval (in kWh)
- dateUTC: Date and time of the record
- temperature: Temperature
- humidity: Humidity
- longitude: Longitude
- latitude: Latitude
- residenceType: 1 for Single-Family; 2 for Multi-Family; 3 for Appartement
We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Create vDataFrames of the datasets:
sm_consumption = vp.read_csv("data/smart_meters/sm_consumption.csv")
sm_weather = vp.read_csv("data/smart_meters/sm_weather.csv")
sm_meters = vp.read_csv("data/smart_meters/sm_meters.csv")
display(sm_consumption)
display(sm_weather)
display(sm_meters)
Data Exploration and Preparation¶
Predicting energy consumption in households is very important. Surges in electricity use could cause serious power outages. In our case, we'll be using data on general household energy consumption in Ireland to predict consumption at various times.
In order to join the different data sources, we need to assume that the weather will be approximately the same across the entirety of Ireland. We'll use the date and time as the key to join 'sm_weather' and 'sm_consumption'.
Joining different datasets with interpolation
In VerticaPy, you can interpolate joins; Vertica will find the closest timestamp to the key and join the result.
sm_consumption_weather = sm_consumption.join(
sm_weather,
how = "left",
on_interpolate = {"dateUTC": "dateUTC"},
expr1 = ["dateUTC", "meterID", "value"],
expr2 = ["humidity", "temperature"])
display(sm_consumption_weather)
Segmenting Latitude & Longitude using Clustering
The dataset 'sm_meters' is pretty important. In particular, the type of residence is probably a good predictor for electricity usage. We can create clusters of the different regions with k-means clustering based on longitude and latitude. Let's find the most suitable 'k' using an elbow curve and scatter plot.
sm_meters.agg(["min", "max"])
%matplotlib inline
from verticapy.learn.model_selection import elbow
from verticapy.datasets import load_world
world = load_world()
df = world.to_geopandas(geometry = "geometry")
df = df[df["country"].isin(["Ireland", "United Kingdom"])]
ax = df.plot(edgecolor = "black",
color = "white",
figsize = (10, 9))
ax = sm_meters.scatter(["longitude", "latitude"], ax = ax)
Based on the scatter plot, five seems like the optimal number of clusters. Let's verify this hypothesis using an elbow curve.
elbow(sm_meters, ["longitude", "latitude"], n_cluster = (3, 8))
The elbow curve seems to confirm that five is the optimal number of clusters, so let's create a k-means model with that in mind.
from verticapy.learn.cluster import KMeans
model = KMeans("kmeans_sm_regions",
n_cluster = 5,
init = [(-6.26980, 53.38127),
(-9.06178, 53.25998),
(-8.48641, 51.90216),
(-7.12408, 52.24610),
(-8.63985, 52.65945),])
model.drop()
model.fit(sm_meters,
["longitude", "latitude"])
Let's add our clusters to the vDataFrame.
sm_meters = model.predict(sm_meters, name = "region")
Let's draw a scatter plot of the different regions.
ax = df.plot(edgecolor = "black",
color = "white",
figsize = (10, 9))
sm_meters.scatter(["longitude", "latitude"],
catcol = "region",
max_cardinality = 10,
ax = ax)
Dataset Enrichment
Let's join 'sm_meters' with 'sm_consumption_weather'.
sm_consumption_weather_region = sm_consumption_weather.join(
sm_meters,
how = "natural",
expr1 = ["*"],
expr2 = ["residenceType",
"region"])
display(sm_consumption_weather_region)