VerticaPy

Python API for Vertica Data Science at Scale

Introduction to Machine Learning

One of the last stages of the data science life cycle is the Data Modeling. Machine learning algorithms are a set of statistical techniques that build mathematical models from training data. These algorithms come in two types:

  • Supervised : these algorithms are used when we want to predict a response column.
  • Unsupervised : these algorithms are used when we want to detect anomalies or when we want to segment the data. No response column is needed.

Supervised Learning

Supervised Learning techniques map an input to an output based on some example dataset. This type of learning consists of two main types:

  • Regression : The Response is numerical (Linear Regression, SVM Regression, RF Regression...)
  • Classification : The Response is categorical (Gradient Boosting, Naive Bayes, Logistic Regression...)

For example, predicting the total charges of a Telco customer using their tenure would be a type of regression. The following code is drawing a linear regression using the 'TotalCharges' as a function of the 'tenure' in the Telco Churn Dataset.

In [29]:
from verticapy.learn.linear_model import LinearRegression
model = LinearRegression("LR_churn")
model.drop()
model.fit("churn", ["tenure"], "TotalCharges")
model.plot()
Out[29]:
<AxesSubplot:xlabel='"tenure"', ylabel='"TotalCharges"'>

In contrast, when we have to predict a categorical column, we're dealing with classification.

In the following example, we use a Linear Support Vector Classification (SVC) to predict the species of a flower based on its petal and sepal lengths.

In [30]:
from verticapy.learn.svm import LinearSVC
model = LinearSVC("svc_setosa_iris")
model.drop()
model.fit("iris_clean", ["PetalLengthCm", "SepalLengthCm"], "Species_Iris-setosa")
model.plot()
Out[30]:
<AxesSubplot:xlabel='"PetalLengthCm"', ylabel='"SepalLengthCm"'>

When we have more than two categories, we use the expression 'Multiclass Classification' instead of 'Classification'.

Unsupervised Learning

These algorithms are to used to segment the data (k-means, DBSCAN, etc.) or to detect anomalies (Local Outlier Factor, Z-Score Techniques...). In particular, they're useful for finding patterns in data without labels. For example, let's use a k-means algorithm to create different clusters on the Smart Meters dataset. Each cluster will represent a region.

In [31]:
from verticapy.learn.cluster import KMeans
model = KMeans("KMeans_sm", n_cluster = 6)
model.drop()
model.fit("sm_meters", ["latitude", "longitude"])
model.plot()
Out[31]:
<AxesSubplot:xlabel='"latitude"', ylabel='"longitude"'>

In this section, we went over a few of the many ML algorithms available in VerticaPy. In the next lesson, we'll cover creating a regression model.