Introduction to Machine Learning#

One of the last stages of the data science life cycle is the Data Modeling. Machine learning algorithms are a set of statistical techniques that build mathematical models from training data. These algorithms come in two types:

  • Supervised : these algorithms are used when we want to predict a response column.

  • Unsupervised : these algorithms are used when we want to detect anomalies or when we want to segment the data. No response column is needed.

Supervised Learning#

Supervised Learning techniques map an input to an output based on some example dataset. This type of learning consists of two main types:

  • Regression : The Response is numerical (Linear Regression, SVM Regression, RF Regression…)

  • Classification : The Response is categorical (Gradient Boosting, Naive Bayes, Logistic Regression…)

For example, predicting the total charges of a Telco customer using their tenure would be a type of regression. The following code is drawing a linear regression using the ‘TotalCharges’ as a function of the ‘tenure’ in the Telco Churn Dataset.

[1]:
import verticapy as vp
vp.set_option("plotting_lib","highcharts")
churn = vp.read_csv("data/churn.csv")

from verticapy.learn.linear_model import LinearRegression
vp.drop("LR_churn")
model = LinearRegression("LR_churn")
model.fit(churn, ["tenure"], "TotalCharges")
model.plot()
[1]:

In contrast, when we have to predict a categorical column, we’re dealing with classification.

In the following example, we use a Linear Support Vector Classification (SVC) to predict the species of a flower based on its petal and sepal lengths.

[2]:
from verticapy.datasets import load_iris
iris = load_iris()
iris.one_hot_encode()
/opt/venv/lib/python3.10/site-packages/verticapy/core/vdataframe/_encoding.py:123: Warning: The vDataColumn '"Id"' was ignored because of its high cardinality.
Increase the parameter 'max_cardinality' to solve this issue or use directly the vDataColumn get_dummies method.
  warnings.warn(warning_message, Warning)
/opt/venv/lib/python3.10/site-packages/verticapy/core/vdataframe/_encoding.py:123: Warning: The vDataColumn '"PetalLengthCm"' was ignored because of its high cardinality.
Increase the parameter 'max_cardinality' to solve this issue or use directly the vDataColumn get_dummies method.
  warnings.warn(warning_message, Warning)
/opt/venv/lib/python3.10/site-packages/verticapy/core/vdataframe/_encoding.py:123: Warning: The vDataColumn '"PetalWidthCm"' was ignored because of its high cardinality.
Increase the parameter 'max_cardinality' to solve this issue or use directly the vDataColumn get_dummies method.
  warnings.warn(warning_message, Warning)
/opt/venv/lib/python3.10/site-packages/verticapy/core/vdataframe/_encoding.py:123: Warning: The vDataColumn '"SepalLengthCm"' was ignored because of its high cardinality.
Increase the parameter 'max_cardinality' to solve this issue or use directly the vDataColumn get_dummies method.
  warnings.warn(warning_message, Warning)
/opt/venv/lib/python3.10/site-packages/verticapy/core/vdataframe/_encoding.py:123: Warning: The vDataColumn '"SepalWidthCm"' was ignored because of its high cardinality.
Increase the parameter 'max_cardinality' to solve this issue or use directly the vDataColumn get_dummies method.
  warnings.warn(warning_message, Warning)
[2]:
123
Id
Integer
123
PetalLengthCm
Numeric(8)
123
PetalWidthCm
Numeric(8)
123
SepalLengthCm
Numeric(8)
123
SepalWidthCm
Numeric(8)
Abc
Species
Varchar(30)
123
Species_Iris-setosa
Integer
123
Species_Iris-versicolor
Integer
111.40.25.13.5Iris-setosa10
221.40.24.93.0Iris-setosa10
331.30.24.73.2Iris-setosa10
441.50.24.63.1Iris-setosa10
551.40.25.03.6Iris-setosa10
661.70.45.43.9Iris-setosa10
771.40.34.63.4Iris-setosa10
881.50.25.03.4Iris-setosa10
991.40.24.42.9Iris-setosa10
10101.50.14.93.1Iris-setosa10
11111.50.25.43.7Iris-setosa10
12121.60.24.83.4Iris-setosa10
13131.40.14.83.0Iris-setosa10
14141.10.14.33.0Iris-setosa10
15151.20.25.84.0Iris-setosa10
16161.50.45.74.4Iris-setosa10
17171.30.45.43.9Iris-setosa10
18181.40.35.13.5Iris-setosa10
19191.70.35.73.8Iris-setosa10
20201.50.35.13.8Iris-setosa10
21211.70.25.43.4Iris-setosa10
22221.50.45.13.7Iris-setosa10
23231.00.24.63.6Iris-setosa10
24241.70.55.13.3Iris-setosa10
25251.90.24.83.4Iris-setosa10
26261.60.25.03.0Iris-setosa10
27271.60.45.03.4Iris-setosa10
28281.50.25.23.5Iris-setosa10
29291.40.25.23.4Iris-setosa10
30301.60.24.73.2Iris-setosa10
31311.60.24.83.1Iris-setosa10
32321.50.45.43.4Iris-setosa10
33331.50.15.24.1Iris-setosa10
34341.40.25.54.2Iris-setosa10
35351.50.14.93.1Iris-setosa10
36361.20.25.03.2Iris-setosa10
37371.30.25.53.5Iris-setosa10
38381.50.14.93.1Iris-setosa10
39391.30.24.43.0Iris-setosa10
40401.50.25.13.4Iris-setosa10
41411.30.35.03.5Iris-setosa10
42421.30.34.52.3Iris-setosa10
43431.30.24.43.2Iris-setosa10
44441.60.65.03.5Iris-setosa10
45451.90.45.13.8Iris-setosa10
46461.40.34.83.0Iris-setosa10
47471.60.25.13.8Iris-setosa10
48481.40.24.63.2Iris-setosa10
49491.50.25.33.7Iris-setosa10
50501.40.25.03.3Iris-setosa10
51514.71.47.03.2Iris-versicolor01
52524.51.56.43.2Iris-versicolor01
53534.91.56.93.1Iris-versicolor01
54544.01.35.52.3Iris-versicolor01
55554.61.56.52.8Iris-versicolor01
56564.51.35.72.8Iris-versicolor01
57574.71.66.33.3Iris-versicolor01
58583.31.04.92.4Iris-versicolor01
59594.61.36.62.9Iris-versicolor01
60603.91.45.22.7Iris-versicolor01
61613.51.05.02.0Iris-versicolor01
62624.21.55.93.0Iris-versicolor01
63634.01.06.02.2Iris-versicolor01
64644.71.46.12.9Iris-versicolor01
65653.61.35.62.9Iris-versicolor01
66664.41.46.73.1Iris-versicolor01
67674.51.55.63.0Iris-versicolor01
68684.11.05.82.7Iris-versicolor01
69694.51.56.22.2Iris-versicolor01
70703.91.15.62.5Iris-versicolor01
71714.81.85.93.2Iris-versicolor01
72724.01.36.12.8Iris-versicolor01
73734.91.56.32.5Iris-versicolor01
74744.71.26.12.8Iris-versicolor01
75754.31.36.42.9Iris-versicolor01
76764.41.46.63.0Iris-versicolor01
77774.81.46.82.8Iris-versicolor01
78785.01.76.73.0Iris-versicolor01
79794.51.56.02.9Iris-versicolor01
80803.51.05.72.6Iris-versicolor01
81813.81.15.52.4Iris-versicolor01
82823.71.05.52.4Iris-versicolor01
83833.91.25.82.7Iris-versicolor01
84845.11.66.02.7Iris-versicolor01
85854.51.55.43.0Iris-versicolor01
86864.51.66.03.4Iris-versicolor01
87874.71.56.73.1Iris-versicolor01
88884.41.36.32.3Iris-versicolor01
89894.11.35.63.0Iris-versicolor01
90904.01.35.52.5Iris-versicolor01
91914.41.25.52.6Iris-versicolor01
92924.61.46.13.0Iris-versicolor01
93934.01.25.82.6Iris-versicolor01
94943.31.05.02.3Iris-versicolor01
95954.21.35.62.7Iris-versicolor01
96964.21.25.73.0Iris-versicolor01
97974.21.35.72.9Iris-versicolor01
98984.31.36.22.9Iris-versicolor01
99993.01.15.12.5Iris-versicolor01
1001004.11.35.72.8Iris-versicolor01
Rows: 1-100 of 150 | Columns: 8
[3]:
from verticapy.learn.svm import LinearSVC
vp.drop("svc_setosa_iris")
model = LinearSVC("svc_setosa_iris")
model.drop()
model.fit(iris, ["PetalLengthCm", "SepalLengthCm"], "Species_Iris-setosa")
model.plot()
[3]:

When we have more than two categories, we use the expression ‘Multiclass Classification’ instead of ‘Classification’.

Unsupervised Learning#

These algorithms are to used to segment the data (k-means, DBSCAN, etc.) or to detect anomalies (Local Outlier Factor, Z-Score Techniques…). In particular, they’re useful for finding patterns in data without labels. For example, let’s use a k-means algorithm to create different clusters on the Iris dataset. Each cluster will represent a flower’s species.

[4]:
from verticapy.learn.cluster import KMeans
vp.drop("KMeans_iris")
model = KMeans("KMeans_iris", n_cluster = 3)
model.fit(iris, ["PetalLengthCm", "SepalLengthCm"])
model.plot()
[4]:

In this section, we went over a few of the many ML algorithms available in VerticaPy. In the next lesson, we’ll cover creating a regression model.