VerticaPy

Python API for Vertica Data Science at Scale

Clustering

Clustering algorithms are used to segment data or to find anomalies. Generally speaking, clustering algorithms are sensitive to unnormalized data, so it's important to properly prepare your data beforehand.

For example, if we consider the 'titanic' dataset, the features 'fare' and 'age' don't have values within the same interval; that is, 'fare' can be much higher than the 'age'. Applying a clustering algorithm to this kind of dataset would create misleading clusters.

To create a clustering model, we'll start by importing the k-means algorithm.

In [1]:
from verticapy.learn.cluster import KMeans

Next, we'll create a model object. Since Vertica has its own model management system, we just need to choose a model name and cluster size. The model's name must include the schema. By default, the public schema is used.

In [9]:
model = KMeans("KMeans_sm", n_cluster = 3)

Let's use the iris dataset to fit our model.

In [10]:
from verticapy.datasets import load_iris
iris = load_iris()

We can fit the model.

In [11]:
model.fit(iris, ["PetalLengthCm", "SepalLengthCm"])
model.plot()
Out[11]:
<AxesSubplot:xlabel='"PetalLengthCm"', ylabel='"SepalLengthCm"'>

While there aren't any real metrics for evaluating unsupervised models, metrics used during computation can help us to understand the quality of the model. For example, a k-means model with fewer clusters and when the k-means score, 'Between-Cluster SS / Total SS' is close to 1.

In [12]:
model.metrics_
Out[12]:
value
512.23072
566.03207
53.801351
0.9049499969144859
Rows: 1-5 | Columns: 2

You can add the prediction to your vDataFrame.

In [13]:
model.predict(iris, name = "cluster")
Out[13]:
123
Id
Int
123
PetalLengthCm
Numeric(6,3)
123
PetalWidthCm
Numeric(6,3)
123
SepalLengthCm
Numeric(6,3)
123
SepalWidthCm
Numeric(6,3)
Abc
Species
Varchar(30)
123
cluster
Integer
111.40.25.13.5Iris-setosa1
221.40.24.93.0Iris-setosa1
331.30.24.73.2Iris-setosa1
441.50.24.63.1Iris-setosa1
551.40.25.03.6Iris-setosa1
661.70.45.43.9Iris-setosa1
771.40.34.63.4Iris-setosa1
881.50.25.03.4Iris-setosa1
991.40.24.42.9Iris-setosa1
10101.50.14.93.1Iris-setosa1
11111.50.25.43.7Iris-setosa1
12121.60.24.83.4Iris-setosa1
13131.40.14.83.0Iris-setosa1
14141.10.14.33.0Iris-setosa1
15151.20.25.84.0Iris-setosa1
16161.50.45.74.4Iris-setosa1
17171.30.45.43.9Iris-setosa1
18181.40.35.13.5Iris-setosa1
19191.70.35.73.8Iris-setosa1
20201.50.35.13.8Iris-setosa1
21211.70.25.43.4Iris-setosa1
22221.50.45.13.7Iris-setosa1
23231.00.24.63.6Iris-setosa1
24241.70.55.13.3Iris-setosa1
25251.90.24.83.4Iris-setosa1
26261.60.25.03.0Iris-setosa1
27271.60.45.03.4Iris-setosa1
28281.50.25.23.5Iris-setosa1
29291.40.25.23.4Iris-setosa1
30301.60.24.73.2Iris-setosa1
31311.60.24.83.1Iris-setosa1
32321.50.45.43.4Iris-setosa1
33331.50.15.24.1Iris-setosa1
34341.40.25.54.2Iris-setosa1
35351.50.14.93.1Iris-setosa1
36361.20.25.03.2Iris-setosa1
37371.30.25.53.5Iris-setosa1
38381.50.14.93.1Iris-setosa1
39391.30.24.43.0Iris-setosa1
40401.50.25.13.4Iris-setosa1
41411.30.35.03.5Iris-setosa1
42421.30.34.52.3Iris-setosa1
43431.30.24.43.2Iris-setosa1
44441.60.65.03.5Iris-setosa1
45451.90.45.13.8Iris-setosa1
46461.40.34.83.0Iris-setosa1
47471.60.25.13.8Iris-setosa1
48481.40.24.63.2Iris-setosa1
49491.50.25.33.7Iris-setosa1
50501.40.25.03.3Iris-setosa1
51514.71.47.03.2Iris-versicolor0
52524.51.56.43.2Iris-versicolor2
53534.91.56.93.1Iris-versicolor0
54544.01.35.52.3Iris-versicolor2
55554.61.56.52.8Iris-versicolor2
56564.51.35.72.8Iris-versicolor2
57574.71.66.33.3Iris-versicolor2
58583.31.04.92.4Iris-versicolor2
59594.61.36.62.9Iris-versicolor2
60603.91.45.22.7Iris-versicolor2
61613.51.05.02.0Iris-versicolor2
62624.21.55.93.0Iris-versicolor2
63634.01.06.02.2Iris-versicolor2
64644.71.46.12.9Iris-versicolor2
65653.61.35.62.9Iris-versicolor2
66664.41.46.73.1Iris-versicolor2
67674.51.55.63.0Iris-versicolor2
68684.11.05.82.7Iris-versicolor2
69694.51.56.22.2Iris-versicolor2
70703.91.15.62.5Iris-versicolor2
71714.81.85.93.2Iris-versicolor2
72724.01.36.12.8Iris-versicolor2
73734.91.56.32.5Iris-versicolor2
74744.71.26.12.8Iris-versicolor2
75754.31.36.42.9Iris-versicolor2
76764.41.46.63.0Iris-versicolor2
77774.81.46.82.8Iris-versicolor0
78785.01.76.73.0Iris-versicolor0
79794.51.56.02.9Iris-versicolor2
80803.51.05.72.6Iris-versicolor2
81813.81.15.52.4Iris-versicolor2
82823.71.05.52.4Iris-versicolor2
83833.91.25.82.7Iris-versicolor2
84845.11.66.02.7Iris-versicolor2
85854.51.55.43.0Iris-versicolor2
86864.51.66.03.4Iris-versicolor2
87874.71.56.73.1Iris-versicolor2
88884.41.36.32.3Iris-versicolor2
89894.11.35.63.0Iris-versicolor2
90904.01.35.52.5Iris-versicolor2
91914.41.25.52.6Iris-versicolor2
92924.61.46.13.0Iris-versicolor2
93934.01.25.82.6Iris-versicolor2
94943.31.05.02.3Iris-versicolor2
95954.21.35.62.7Iris-versicolor2
96964.21.25.73.0Iris-versicolor2
97974.21.35.72.9Iris-versicolor2
98984.31.36.22.9Iris-versicolor2
99993.01.15.12.5Iris-versicolor1
1001004.11.35.72.8Iris-versicolor2
Rows: 1-100 | Columns: 7

This concludes this lesson on clustering models in VerticaPy. We'll look at time series models in the next lesson.