Clustering#

Clustering algorithms are used to segment data or to find anomalies. Generally speaking, clustering algorithms are sensitive to unnormalized data, so it’s important to properly prepare your data beforehand.

For example, if we consider the ‘titanic’ dataset, the features ‘fare’ and ‘age’ don’t have values within the same interval; that is, ‘fare’ can be much higher than the ‘age’. Applying a clustering algorithm to this kind of dataset would create misleading clusters.

To create a clustering model, we’ll start by importing the k-means algorithm.

[1]:
from verticapy.learn.cluster import KMeans
import verticapy as vp

vp.set_option("plotting_lib","highcharts") # Set the desired plotting library

Next, we’ll create a model object. Since Vertica has its own model management system, we just need to choose a model name and cluster size. The model’s name must include the schema. By default, the public schema is used.

[2]:
vp.drop("KMeans_sm") # To ensure no other model with same name
model = KMeans("KMeans_sm", n_cluster = 3)

Let’s use the iris dataset to fit our model.

[3]:
from verticapy.datasets import load_iris
iris = load_iris()

We can fit the model.

[4]:
model.fit(iris, ["PetalLengthCm", "SepalLengthCm"])
model.plot()
[4]:

While there aren’t any real metrics for evaluating unsupervised models, metrics used during computation can help us to understand the quality of the model. For example, a k-means model with fewer clusters and when the k-means score, ‘Between-Cluster SS / Total SS’ is close to 1.

[5]:
model.get_vertica_attributes("metrics")
[5]:
Abc
Varchar(65000)
1
Rows: 1-1 | Column: metrics | Type: Varchar(65000)

You can add the prediction to your vDataFrame.

[6]:
model.predict(iris, name = "cluster")
[6]:
123
Id
Integer
123
PetalLengthCm
Numeric(8)
123
PetalWidthCm
Numeric(8)
123
SepalLengthCm
Numeric(8)
123
SepalWidthCm
Numeric(8)
Abc
Species
Varchar(30)
123
cluster
Integer
111.40.25.13.5Iris-setosa2
221.40.24.93.0Iris-setosa2
331.30.24.73.2Iris-setosa2
441.50.24.63.1Iris-setosa2
551.40.25.03.6Iris-setosa2
661.70.45.43.9Iris-setosa2
771.40.34.63.4Iris-setosa2
881.50.25.03.4Iris-setosa2
991.40.24.42.9Iris-setosa2
10101.50.14.93.1Iris-setosa2
11111.50.25.43.7Iris-setosa2
12121.60.24.83.4Iris-setosa2
13131.40.14.83.0Iris-setosa2
14141.10.14.33.0Iris-setosa2
15151.20.25.84.0Iris-setosa2
16161.50.45.74.4Iris-setosa2
17171.30.45.43.9Iris-setosa2
18181.40.35.13.5Iris-setosa2
19191.70.35.73.8Iris-setosa2
20201.50.35.13.8Iris-setosa2
21211.70.25.43.4Iris-setosa2
22221.50.45.13.7Iris-setosa2
23231.00.24.63.6Iris-setosa2
24241.70.55.13.3Iris-setosa2
25251.90.24.83.4Iris-setosa2
26261.60.25.03.0Iris-setosa2
27271.60.45.03.4Iris-setosa2
28281.50.25.23.5Iris-setosa2
29291.40.25.23.4Iris-setosa2
30301.60.24.73.2Iris-setosa2
31311.60.24.83.1Iris-setosa2
32321.50.45.43.4Iris-setosa2
33331.50.15.24.1Iris-setosa2
34341.40.25.54.2Iris-setosa2
35351.50.14.93.1Iris-setosa2
36361.20.25.03.2Iris-setosa2
37371.30.25.53.5Iris-setosa2
38381.50.14.93.1Iris-setosa2
39391.30.24.43.0Iris-setosa2
40401.50.25.13.4Iris-setosa2
41411.30.35.03.5Iris-setosa2
42421.30.34.52.3Iris-setosa2
43431.30.24.43.2Iris-setosa2
44441.60.65.03.5Iris-setosa2
45451.90.45.13.8Iris-setosa2
46461.40.34.83.0Iris-setosa2
47471.60.25.13.8Iris-setosa2
48481.40.24.63.2Iris-setosa2
49491.50.25.33.7Iris-setosa2
50501.40.25.03.3Iris-setosa2
51514.71.47.03.2Iris-versicolor0
52524.51.56.43.2Iris-versicolor1
53534.91.56.93.1Iris-versicolor0
54544.01.35.52.3Iris-versicolor1
55554.61.56.52.8Iris-versicolor1
56564.51.35.72.8Iris-versicolor1
57574.71.66.33.3Iris-versicolor1
58583.31.04.92.4Iris-versicolor1
59594.61.36.62.9Iris-versicolor1
60603.91.45.22.7Iris-versicolor1
61613.51.05.02.0Iris-versicolor1
62624.21.55.93.0Iris-versicolor1
63634.01.06.02.2Iris-versicolor1
64644.71.46.12.9Iris-versicolor1
65653.61.35.62.9Iris-versicolor1
66664.41.46.73.1Iris-versicolor1
67674.51.55.63.0Iris-versicolor1
68684.11.05.82.7Iris-versicolor1
69694.51.56.22.2Iris-versicolor1
70703.91.15.62.5Iris-versicolor1
71714.81.85.93.2Iris-versicolor1
72724.01.36.12.8Iris-versicolor1
73734.91.56.32.5Iris-versicolor1
74744.71.26.12.8Iris-versicolor1
75754.31.36.42.9Iris-versicolor1
76764.41.46.63.0Iris-versicolor1
77774.81.46.82.8Iris-versicolor0
78785.01.76.73.0Iris-versicolor0
79794.51.56.02.9Iris-versicolor1
80803.51.05.72.6Iris-versicolor1
81813.81.15.52.4Iris-versicolor1
82823.71.05.52.4Iris-versicolor1
83833.91.25.82.7Iris-versicolor1
84845.11.66.02.7Iris-versicolor1
85854.51.55.43.0Iris-versicolor1
86864.51.66.03.4Iris-versicolor1
87874.71.56.73.1Iris-versicolor1
88884.41.36.32.3Iris-versicolor1
89894.11.35.63.0Iris-versicolor1
90904.01.35.52.5Iris-versicolor1
91914.41.25.52.6Iris-versicolor1
92924.61.46.13.0Iris-versicolor1
93934.01.25.82.6Iris-versicolor1
94943.31.05.02.3Iris-versicolor1
95954.21.35.62.7Iris-versicolor1
96964.21.25.73.0Iris-versicolor1
97974.21.35.72.9Iris-versicolor1
98984.31.36.22.9Iris-versicolor1
99993.01.15.12.5Iris-versicolor2
1001004.11.35.72.8Iris-versicolor1
Rows: 1-100 | Columns: 7

This concludes this lesson on clustering models in VerticaPy. We’ll look at time series models in the next lesson.