VerticaPy

Python API for Vertica Data Science at Scale

Regression

Regressions are machine learning algorithms used to predict numerical response columns. Predicting the salaries of employees using their age or predicting the number of cyber attacks a website might face would be examples of regressions. The most popular regression algorithm is the linear regression.

You must always verify that all the assumptions of a given algorithm are met before using them. For example, to create a good linear regression model, we need to verify the Gauss-Markov assumptions.

  • Linearity : the parameters we are estimating using the OLS method must be linear.
  • Non-Collinearity : the regressors being calculated aren’t perfectly correlated with each other.
  • Exogeneity : the regressors aren’t correlated with the error term.
  • Homoscedasticity : no matter what the values of our regressors might be, the error of the variance is constant.

Most of regression models are sensitive to unnormalized data, so it's important to normalize and decompose your data before using them (though some models like random forest can handle unnormalized and correlated data). If we don't follow the assumptions, we might get unexpected results (example: negative R2).

Let's predict the total charges of the Telco customers using their tenure. We'll start by importing a linear regression model.

In [38]:
from verticapy.learn.linear_model import LinearRegression

Let's create a model object. Since Vertica has its own model management system, we just need to choose a model name.

In [39]:
model = LinearRegression("LR_churn")

We can then fit the model with our data.

In [41]:
model.fit("churn", ["tenure"], "TotalCharges")
model.plot()
Out[41]:
<AxesSubplot:xlabel='"tenure"', ylabel='"TotalCharges"'>

We have many metrics to evaluate the model.

In [42]:
model.regression_report()
Out[42]:
value
explained_variance0.682078535751238
max_error3997.15535368371
median_absolute_error480.052176277521
mean_absolute_error879.836040491484
mean_squared_error1633328.42507227
root_mean_squared_error1278.0173805830145
r20.682078535751241
r2_adj0.6820333828661042
Rows: 1-8 | Columns: 2

Our example forgoes splitting the data into training and testing, which is important for real-world work. Our main goal in this lesson is to look at the metrics used to evaluate regressions. The most famous metric is R2: generally speaking, the closer R2 is to 1, the better the model is.

In the next lesson, we'll go over classification models.