VerticaPy

Python API for Vertica Data Science at Scale

Unsupervised

This example demonstrates various Unsupervised Model methods. We will use the Iris dataset.

In [1]:
from verticapy.learn.datasets import load_iris
iris = load_iris()
display(iris)
123
SepalLengthCm
Numeric(5,2)
123
SepalWidthCm
Numeric(5,2)
123
PetalLengthCm
Numeric(5,2)
123
PetalWidthCm
Numeric(5,2)
Abc
Species
Varchar(30)
14.303.001.100.10Iris-setosa
24.402.901.400.20Iris-setosa
34.403.001.300.20Iris-setosa
44.403.201.300.20Iris-setosa
54.502.301.300.30Iris-setosa
64.603.101.500.20Iris-setosa
74.603.201.400.20Iris-setosa
84.603.401.400.30Iris-setosa
94.603.601.000.20Iris-setosa
104.703.201.300.20Iris-setosa
114.703.201.600.20Iris-setosa
124.803.001.400.10Iris-setosa
134.803.001.400.30Iris-setosa
144.803.101.600.20Iris-setosa
154.803.401.600.20Iris-setosa
164.803.401.900.20Iris-setosa
174.902.403.301.00Iris-versicolor
184.902.504.501.70Iris-virginica
194.903.001.400.20Iris-setosa
204.903.101.500.10Iris-setosa
214.903.101.500.10Iris-setosa
224.903.101.500.10Iris-setosa
235.002.003.501.00Iris-versicolor
245.002.303.301.00Iris-versicolor
255.003.001.600.20Iris-setosa
265.003.201.200.20Iris-setosa
275.003.301.400.20Iris-setosa
285.003.401.500.20Iris-setosa
295.003.401.600.40Iris-setosa
305.003.501.300.30Iris-setosa
315.003.501.600.60Iris-setosa
325.003.601.400.20Iris-setosa
335.102.503.001.10Iris-versicolor
345.103.301.700.50Iris-setosa
355.103.401.500.20Iris-setosa
365.103.501.400.20Iris-setosa
375.103.501.400.30Iris-setosa
385.103.701.500.40Iris-setosa
395.103.801.500.30Iris-setosa
405.103.801.600.20Iris-setosa
415.103.801.900.40Iris-setosa
425.202.703.901.40Iris-versicolor
435.203.401.400.20Iris-setosa
445.203.501.500.20Iris-setosa
455.204.101.500.10Iris-setosa
465.303.701.500.20Iris-setosa
475.403.004.501.50Iris-versicolor
485.403.401.500.40Iris-setosa
495.403.401.700.20Iris-setosa
505.403.701.500.20Iris-setosa
515.403.901.300.40Iris-setosa
525.403.901.700.40Iris-setosa
535.502.304.001.30Iris-versicolor
545.502.403.701.00Iris-versicolor
555.502.403.801.10Iris-versicolor
565.502.504.001.30Iris-versicolor
575.502.604.401.20Iris-versicolor
585.503.501.300.20Iris-setosa
595.504.201.400.20Iris-setosa
605.602.503.901.10Iris-versicolor
615.602.704.201.30Iris-versicolor
625.602.804.902.00Iris-virginica
635.602.903.601.30Iris-versicolor
645.603.004.101.30Iris-versicolor
655.603.004.501.50Iris-versicolor
665.702.505.002.00Iris-virginica
675.702.603.501.00Iris-versicolor
685.702.804.101.30Iris-versicolor
695.702.804.501.30Iris-versicolor
705.702.904.201.30Iris-versicolor
715.703.004.201.20Iris-versicolor
725.703.801.700.30Iris-setosa
735.704.401.500.40Iris-setosa
745.802.604.001.20Iris-versicolor
755.802.703.901.20Iris-versicolor
765.802.704.101.00Iris-versicolor
775.802.705.101.90Iris-virginica
785.802.705.101.90Iris-virginica
795.802.805.102.40Iris-virginica
805.804.001.200.20Iris-setosa
815.903.004.201.50Iris-versicolor
825.903.005.101.80Iris-virginica
835.903.204.801.80Iris-versicolor
846.002.204.001.00Iris-versicolor
856.002.205.001.50Iris-virginica
866.002.705.101.60Iris-versicolor
876.002.904.501.50Iris-versicolor
886.003.004.801.80Iris-virginica
896.003.404.501.60Iris-versicolor
906.102.605.601.40Iris-virginica
916.102.804.001.30Iris-versicolor
926.102.804.701.20Iris-versicolor
936.102.904.701.40Iris-versicolor
946.103.004.601.40Iris-versicolor
956.103.004.901.80Iris-virginica
966.202.204.501.50Iris-versicolor
976.202.804.801.80Iris-virginica
986.202.904.301.30Iris-versicolor
996.203.405.402.30Iris-virginica
1006.302.304.401.30Iris-versicolor
Rows: 1-100 of 150 | Columns: 5

Let's create a k-means model to segment the data into clusters.

In [2]:
from verticapy.learn.cluster import KMeans
model = KMeans("public.KMeans_iris", n_cluster = 3)
model.fit("public.iris", ["PetalWidthCm", "PetalLengthCm"])
Out[2]:

=======
centers
=======
petalwidthcm|petallengthcm
------------+-------------
   2.04783  |   5.62609   
   1.35926  |   4.29259   
   0.24400  |   1.46400   


=======
metrics
=======
Evaluation metrics:
     Total Sum of Squares: 550.64347
     Within-Cluster Sum of Squares: 
         Cluster 0: 15.163478
         Cluster 1: 14.227407
         Cluster 2: 2.0384
     Total Within-Cluster Sum of Squares: 31.429286
     Between-Cluster Sum of Squares: 519.21418
     Between-Cluster SS / Total SS: 94.29%
 Number of iterations performed: 8
 Converged: True
 Call:
kmeans('public.KMeans_iris', 'public.iris', '"PetalWidthCm", "PetalLengthCm"', 3
USING PARAMETERS max_iterations=10, epsilon=0.0001, init_method='kmeanspp', distance_method='euclidean')

Fitting the model creates new model attributes, which make methods easier to use.

In [3]:
model.X
Out[3]:
['"PetalWidthCm"', '"PetalLengthCm"']
In [4]:
model.input_relation
Out[4]:
'public.iris'

These attributes will be used when invoking the different model abstractions. The model could also have other useful attributes. In the case of KMeans, the 'clustercenters' and 'metrics_' attributes can give you useful information about the model.

In [5]:
model.cluster_centers_
123
petalwidthcm
Float
123
petallengthcm
Float
12.047826086956525.62608695652174
21.359259259259264.29259259259259
30.2441.464
Out[5]:

In [6]:
model.metrics_
value
519.21418
550.64347
31.429286
0.9429226137921877
Out[6]:

You can also use the 'get_model_attribute' method to check the other available Vertica attributes.

In [8]:
model.get_model_attribute()
Abc
attr_name
Varchar(128)
Abc
attr_fields
Long varchar(32000000)
123
#_of_rows
Integer
1centerspetalwidthcm, petallengthcm3
2metricsmetrics1
Out[8]:

Let's look at the generated SQL code.

In [7]:
display(model.deploySQL())
APPLY_KMEANS("PetalWidthCm", "PetalLengthCm" USING PARAMETERS model_name = 'public.KMeans_iris', match_by_pos = 'true')

You can also add the prediction to your vDataFrame.

In [9]:
model.predict(iris, name = "pred_Species")
123
SepalLengthCm
Numeric(5,2)
123
SepalWidthCm
Numeric(5,2)
123
PetalLengthCm
Numeric(5,2)
123
PetalWidthCm
Numeric(5,2)
Abc
Species
Varchar(30)
123
pred_Species
Integer
14.303.001.100.10Iris-setosa2
24.402.901.400.20Iris-setosa2
34.403.001.300.20Iris-setosa2
44.403.201.300.20Iris-setosa2
54.502.301.300.30Iris-setosa2
64.603.101.500.20Iris-setosa2
74.603.201.400.20Iris-setosa2
84.603.401.400.30Iris-setosa2
94.603.601.000.20Iris-setosa2
104.703.201.300.20Iris-setosa2
114.703.201.600.20Iris-setosa2
124.803.001.400.10Iris-setosa2
134.803.001.400.30Iris-setosa2
144.803.101.600.20Iris-setosa2
154.803.401.600.20Iris-setosa2
164.803.401.900.20Iris-setosa2
174.902.403.301.00Iris-versicolor1
184.902.504.501.70Iris-virginica1
194.903.001.400.20Iris-setosa2
204.903.101.500.10Iris-setosa2
214.903.101.500.10Iris-setosa2
224.903.101.500.10Iris-setosa2
235.002.003.501.00Iris-versicolor1
245.002.303.301.00Iris-versicolor1
255.003.001.600.20Iris-setosa2
265.003.201.200.20Iris-setosa2
275.003.301.400.20Iris-setosa2
285.003.401.500.20Iris-setosa2
295.003.401.600.40Iris-setosa2
305.003.501.300.30Iris-setosa2
315.003.501.600.60Iris-setosa2
325.003.601.400.20Iris-setosa2
335.102.503.001.10Iris-versicolor1
345.103.301.700.50Iris-setosa2
355.103.401.500.20Iris-setosa2
365.103.501.400.20Iris-setosa2
375.103.501.400.30Iris-setosa2
385.103.701.500.40Iris-setosa2
395.103.801.500.30Iris-setosa2
405.103.801.600.20Iris-setosa2
415.103.801.900.40Iris-setosa2
425.202.703.901.40Iris-versicolor1
435.203.401.400.20Iris-setosa2
445.203.501.500.20Iris-setosa2
455.204.101.500.10Iris-setosa2
465.303.701.500.20Iris-setosa2
475.403.004.501.50Iris-versicolor1
485.403.401.500.40Iris-setosa2
495.403.401.700.20Iris-setosa2
505.403.701.500.20Iris-setosa2
515.403.901.300.40Iris-setosa2
525.403.901.700.40Iris-setosa2
535.502.304.001.30Iris-versicolor1
545.502.403.701.00Iris-versicolor1
555.502.403.801.10Iris-versicolor1
565.502.504.001.30Iris-versicolor1
575.502.604.401.20Iris-versicolor1
585.503.501.300.20Iris-setosa2
595.504.201.400.20Iris-setosa2
605.602.503.901.10Iris-versicolor1
615.602.704.201.30Iris-versicolor1
625.602.804.902.00Iris-virginica0
635.602.903.601.30Iris-versicolor1
645.603.004.101.30Iris-versicolor1
655.603.004.501.50Iris-versicolor1
665.702.505.002.00Iris-virginica0
675.702.603.501.00Iris-versicolor1
685.702.804.101.30Iris-versicolor1
695.702.804.501.30Iris-versicolor1
705.702.904.201.30Iris-versicolor1
715.703.004.201.20Iris-versicolor1
725.703.801.700.30Iris-setosa2
735.704.401.500.40Iris-setosa2
745.802.604.001.20Iris-versicolor1
755.802.703.901.20Iris-versicolor1
765.802.704.101.00Iris-versicolor1
775.802.705.101.90Iris-virginica0
785.802.705.101.90Iris-virginica0
795.802.805.102.40Iris-virginica0
805.804.001.200.20Iris-setosa2
815.903.004.201.50Iris-versicolor1
825.903.005.101.80Iris-virginica0
835.903.204.801.80Iris-versicolor1
846.002.204.001.00Iris-versicolor1
856.002.205.001.50Iris-virginica1
866.002.705.101.60Iris-versicolor0
876.002.904.501.50Iris-versicolor1
886.003.004.801.80Iris-virginica1
896.003.404.501.60Iris-versicolor1
906.102.605.601.40Iris-virginica0
916.102.804.001.30Iris-versicolor1
926.102.804.701.20Iris-versicolor1
936.102.904.701.40Iris-versicolor1
946.103.004.601.40Iris-versicolor1
956.103.004.901.80Iris-virginica1
966.202.204.501.50Iris-versicolor1
976.202.804.801.80Iris-virginica1
986.202.904.301.30Iris-versicolor1
996.203.405.402.30Iris-virginica0
1006.302.304.401.30Iris-versicolor1
Out[9]:
Rows: 1-100 of 150 | Columns: 6

Examine your prediction.

In [10]:
iris.hist(["Species", "pred_Species"])

Some Unsupervised Algorithms can be drawn.

In [12]:
model.plot()

Some Algorithms are constructed using SQL code generation. Let's create a DBSCAN and a LOF.

In [17]:
from verticapy.learn.cluster import DBSCAN
model = DBSCAN("public.DBSCAN_iris")
model.fit("public.iris", ["PetalWidthCm", "PetalLengthCm"])
Out[17]:
<DBSCAN>
Number of Clusters: 6
Number of Outliers: 0
In [18]:
from verticapy.learn.neighbors import LocalOutlierFactor
model2 = LocalOutlierFactor("public.LocalOutlierFactor_iris")
model2.fit("public.iris", ["PetalWidthCm", "PetalLengthCm"])
Out[18]:
<LocalOutlierFactor>

These models store the results in a table. It is possible to play with the result using the 'predict' method.

In [19]:
model.predict()
123
PetalWidthCm
Numeric(5,2)
123
PetalLengthCm
Numeric(5,2)
123
dbscan_cluster
Int
10.101.100
20.101.400
30.101.500
40.101.500
50.101.500
60.101.500
70.201.000
80.201.200
90.201.200
100.201.300
110.201.300
120.201.300
130.201.300
140.201.400
150.201.400
160.201.400
170.201.400
180.201.400
190.201.400
200.201.400
210.201.400
220.201.500
230.201.500
240.201.500
250.201.500
260.201.500
270.201.500
280.201.600
290.201.600
300.201.600
310.201.600
320.201.600
330.201.700
340.201.900
350.301.300
360.301.300
370.301.400
380.301.400
390.301.400
400.301.500
410.301.700
420.401.300
430.401.500
440.401.500
450.401.500
460.401.600
470.401.700
480.401.900
490.501.700
500.601.600
511.003.301
521.003.301
531.003.501
541.003.501
551.003.701
561.004.002
571.004.102
581.103.001
591.103.801
601.103.901
611.203.901
621.204.002
631.204.202
641.204.402
651.204.702
661.303.601
671.304.002
681.304.002
691.304.002
701.304.102
711.304.102
721.304.202
731.304.202
741.304.302
751.304.302
761.304.402
771.304.502
781.304.602
791.403.902
801.404.402
811.404.402
821.404.602
831.404.702
841.404.702
851.404.802
861.405.604
871.504.202
881.504.502
891.504.502
901.504.502
911.504.502
921.504.502
931.504.602
941.504.702
951.504.902
961.504.902
971.505.003
981.505.103
991.604.502
1001.604.702
Out[19]:
Rows: 1-100 of 150 | Columns: 3
In [20]:
model2.predict()
123
PetalWidthCm
Numeric(5,2)
123
PetalLengthCm
Numeric(5,2)
123
lof_score
Float
10.101.102.08234031003641
20.101.401.05722609447314
30.101.501.10299515545165
40.101.501.10299515545165
50.101.501.10299515545165
60.101.501.10299515545165
70.201.002.43694090797296
80.201.201.53695743415418
90.201.201.53695743415418
100.201.301.14067601173295
110.201.301.14067601173295
120.201.301.14067601173295
130.201.301.14067601173295
140.201.400.982463512587469
150.201.400.982463512587469
160.201.400.982463512587469
170.201.400.982463512587469
180.201.400.982463512587469
190.201.400.982463512587469
200.201.400.982463512587469
210.201.400.982463512587469
220.201.501.00386533089101
230.201.501.00386533089101
240.201.501.00386533089101
250.201.501.00386533089101
260.201.501.00386533089101
270.201.501.00386533089101
280.201.601.16605506750783
290.201.601.16605506750783
300.201.601.16605506750783
310.201.601.16605506750783
320.201.601.16605506750783
330.201.701.47475360743923
340.201.902.20482138359511
350.301.301.25406400900897
360.301.301.25406400900897
370.301.401.06843214709229
380.301.401.06843214709229
390.301.401.06843214709229
400.301.501.10661974516951
410.301.701.4276374283454
420.401.301.38388841756981
430.401.501.36155303325323
440.401.501.36155303325323
450.401.501.36155303325323
460.401.601.40306099177471
470.401.701.51794573888052
480.401.902.09963466437765
490.501.701.81377540597986
500.601.601.93923956424429
511.003.301.70757746613585
521.003.301.70757746613585
531.003.501.46018853313619
541.003.501.46018853313619
551.003.701.28788552390156
561.004.001.16146034202009
571.004.101.1277650089115
581.103.002.12104418437916
591.103.801.21079006485531
601.103.901.14844993391747
611.203.901.13380669424528
621.204.001.09283321298032
631.204.201.04785386369406
641.204.401.03244025717094
651.204.701.1239354494644
661.303.601.30419812752566
671.304.001.09294378681154
681.304.001.09294378681154
691.304.001.09294378681154
701.304.101.06133152082905
711.304.101.06133152082905
721.304.20