VerticaPy

Python API for Vertica Data Science at Scale

Decomposition

This example demonstrates various Decomposition Model methods with the Iris dataset.

In [1]:
from verticapy.learn.datasets import load_iris
iris = load_iris()
display(iris)
123
SepalLengthCm
Numeric(5,2)
123
SepalWidthCm
Numeric(5,2)
123
PetalLengthCm
Numeric(5,2)
123
PetalWidthCm
Numeric(5,2)
Abc
Species
Varchar(30)
14.303.001.100.10Iris-setosa
24.402.901.400.20Iris-setosa
34.403.001.300.20Iris-setosa
44.403.201.300.20Iris-setosa
54.502.301.300.30Iris-setosa
64.603.101.500.20Iris-setosa
74.603.201.400.20Iris-setosa
84.603.401.400.30Iris-setosa
94.603.601.000.20Iris-setosa
104.703.201.300.20Iris-setosa
114.703.201.600.20Iris-setosa
124.803.001.400.10Iris-setosa
134.803.001.400.30Iris-setosa
144.803.101.600.20Iris-setosa
154.803.401.600.20Iris-setosa
164.803.401.900.20Iris-setosa
174.902.403.301.00Iris-versicolor
184.902.504.501.70Iris-virginica
194.903.001.400.20Iris-setosa
204.903.101.500.10Iris-setosa
214.903.101.500.10Iris-setosa
224.903.101.500.10Iris-setosa
235.002.003.501.00Iris-versicolor
245.002.303.301.00Iris-versicolor
255.003.001.600.20Iris-setosa
265.003.201.200.20Iris-setosa
275.003.301.400.20Iris-setosa
285.003.401.500.20Iris-setosa
295.003.401.600.40Iris-setosa
305.003.501.300.30Iris-setosa
315.003.501.600.60Iris-setosa
325.003.601.400.20Iris-setosa
335.102.503.001.10Iris-versicolor
345.103.301.700.50Iris-setosa
355.103.401.500.20Iris-setosa
365.103.501.400.20Iris-setosa
375.103.501.400.30Iris-setosa
385.103.701.500.40Iris-setosa
395.103.801.500.30Iris-setosa
405.103.801.600.20Iris-setosa
415.103.801.900.40Iris-setosa
425.202.703.901.40Iris-versicolor
435.203.401.400.20Iris-setosa
445.203.501.500.20Iris-setosa
455.204.101.500.10Iris-setosa
465.303.701.500.20Iris-setosa
475.403.004.501.50Iris-versicolor
485.403.401.500.40Iris-setosa
495.403.401.700.20Iris-setosa
505.403.701.500.20Iris-setosa
515.403.901.300.40Iris-setosa
525.403.901.700.40Iris-setosa
535.502.304.001.30Iris-versicolor
545.502.403.701.00Iris-versicolor
555.502.403.801.10Iris-versicolor
565.502.504.001.30Iris-versicolor
575.502.604.401.20Iris-versicolor
585.503.501.300.20Iris-setosa
595.504.201.400.20Iris-setosa
605.602.503.901.10Iris-versicolor
615.602.704.201.30Iris-versicolor
625.602.804.902.00Iris-virginica
635.602.903.601.30Iris-versicolor
645.603.004.101.30Iris-versicolor
655.603.004.501.50Iris-versicolor
665.702.505.002.00Iris-virginica
675.702.603.501.00Iris-versicolor
685.702.804.101.30Iris-versicolor
695.702.804.501.30Iris-versicolor
705.702.904.201.30Iris-versicolor
715.703.004.201.20Iris-versicolor
725.703.801.700.30Iris-setosa
735.704.401.500.40Iris-setosa
745.802.604.001.20Iris-versicolor
755.802.703.901.20Iris-versicolor
765.802.704.101.00Iris-versicolor
775.802.705.101.90Iris-virginica
785.802.705.101.90Iris-virginica
795.802.805.102.40Iris-virginica
805.804.001.200.20Iris-setosa
815.903.004.201.50Iris-versicolor
825.903.005.101.80Iris-virginica
835.903.204.801.80Iris-versicolor
846.002.204.001.00Iris-versicolor
856.002.205.001.50Iris-virginica
866.002.705.101.60Iris-versicolor
876.002.904.501.50Iris-versicolor
886.003.004.801.80Iris-virginica
896.003.404.501.60Iris-versicolor
906.102.605.601.40Iris-virginica
916.102.804.001.30Iris-versicolor
926.102.804.701.20Iris-versicolor
936.102.904.701.40Iris-versicolor
946.103.004.601.40Iris-versicolor
956.103.004.901.80Iris-virginica
966.202.204.501.50Iris-versicolor
976.202.804.801.80Iris-virginica
986.202.904.301.30Iris-versicolor
996.203.405.402.30Iris-virginica
1006.302.304.401.30Iris-versicolor
Rows: 1-100 of 150 | Columns: 5

Let's create a PCA model of the different flowers.

In [2]:
from verticapy.learn.decomposition import PCA
model = PCA("public.PCA_iris")
model.fit("public.iris", ["PetalWidthCm", "PetalLengthCm", "SepalLengthCm", "SepalWidthCm"])
Out[2]:

=======
columns
=======
index|    name     |  mean  |   sd   
-----+-------------+--------+--------
  1  |petalwidthcm | 1.19867| 0.76316
  2  |petallengthcm| 3.75867| 1.76442
  3  |sepallengthcm| 5.84333| 0.82807
  4  |sepalwidthcm | 3.05400| 0.43359


===============
singular_values
===============
index| value  |explained_variance|accumulated_explained_variance
-----+--------+------------------+------------------------------
  1  | 2.05544|      0.92462     |            0.92462           
  2  | 0.49218|      0.05302     |            0.97763           
  3  | 0.28022|      0.01719     |            0.99482           
  4  | 0.15389|      0.00518     |            1.00000           


====================
principal_components
====================
index|  PC1   |  PC2   |  PC3   |  PC4   
-----+--------+--------+--------+--------
  1  | 0.35884|-0.07471| 0.54906| 0.75112
  2  | 0.85657|-0.17577| 0.07252|-0.47972
  3  | 0.36159| 0.65654|-0.58100| 0.31725
  4  |-0.08227| 0.72971| 0.59642|-0.32409


========
counters
========
   counter_name   |counter_value
------------------+-------------
accepted_row_count|     150     
rejected_row_count|      0      
 iteration_count  |      1      


===========
call_string
===========
SELECT PCA('public.PCA_iris', 'public.iris', '"PetalWidthCm", "PetalLengthCm", "SepalLengthCm", "SepalWidthCm"'
USING PARAMETERS scale=false);

Fitting the model creates new model attributes, which make methods easier to use.

In [3]:
model.X
Out[3]:
['"PetalWidthCm"', '"PetalLengthCm"', '"SepalLengthCm"', '"SepalWidthCm"']
In [4]:
model.input_relation
Out[4]:
'public.iris'

These attributes will be used when invoking the different model abstractions. The model could also have other useful attributes. In the case of PCA, the 'components_', 'explainedvariance' and 'mean_' attributes can give you useful information about the model.

In [5]:
model.components_
123
PC1
Float
123
PC2
Float
123
PC3
Float
123
PC4
Float
10.358843926248216-0.07470647013503420.5490609107266030.751120560380823
20.856572105290528-0.1757674034286540.0725240754869635-0.47971898732994
30.361589677381450.656539883285831-0.5809972798276180.31725454716854
4-0.08226888989221420.7297123713264970.596418087938103-0.324094352417966
Out[5]:

In [6]:
model.explained_variance_
123
value
Float
123
explained_variance
Float
123
accumulated_explained_variance
Float
12.055441745299560.9246162071742680.924616207174268
20.4921824576592660.05301556785053510.977631775024803
30.2802211770979390.01718513952500680.99481691454981
40.1538929079782450.005183085450189610.999999999999999
Out[6]:

In [7]:
model.mean_
Abc
name
Varchar(65000)
123
mean
Float
123
sd
Float
1petalwidthcm1.198666666666670.763160741700841
2petallengthcm3.758666666666671.76442041995226
3sepallengthcm5.843333333333330.828066127977863
4sepalwidthcm3.0540.433594311362174
Out[7]:

Some other attributes might be useful. You can get the Vertica attributes using the 'get_model_attribute' method.

In [13]:
model.get_model_attribute()
Abc
attr_name
Varchar(128)
Abc
Long varchar(32000000)
123
#_of_rows
Integer
1columns4
2singular_values4
3principal_components4
4counters3
5call_string1
Out[13]:

Looking at the SQL code can help you understand how Vertica works.

In [8]:
display(model.deploySQL())
APPLY_PCA("PetalWidthCm", "PetalLengthCm", "SepalLengthCm", "SepalWidthCm" USING PARAMETERS model_name = 'public.PCA_iris', match_by_pos = 'true', cutoff = 1)

It is also possible to deploy the inverse PCA.

In [9]:
display(model.deployInverseSQL())
APPLY_INVERSE_PCA("PetalWidthCm", "PetalLengthCm", "SepalLengthCm", "SepalWidthCm" USING PARAMETERS model_name = 'public.PCA_iris', match_by_pos = 'true')

It is also possible to use the 'transform' method to apply the model on a vDataFrame. You can choose the number of components to keep.

In [10]:
model.transform(iris, n_components = 2)
123
col1
Float
123
col2
Float
1-3.22520044627498-0.503279909485424
2-2.88795856533563-0.57079802633159
3-2.98184266485391-0.480250048856075
4-2.99829644283235-0.334307574590776
5-2.85221108156639-0.932865367469544
6-2.74643719730874-0.31112431575199
7-2.84032129682701-0.220576338276475
8-2.82089068218063-0.0821045110246793
9-3.215857694900110.141615571625585
10-2.88981953961792-0.137345609605026
11-2.63284790803076-0.190075830633622
12-2.7874339759971-0.227740188871105
13-2.71566519074746-0.242681482898112
14-2.58846205130339-0.197393079437689
15-2.613142718271060.0215206319602596
16-2.3561710866839-0.0312095890683366
17-0.751467140648227-1.00110751297439
180.519383245084935-1.19135168905065
19-2.71539061563413-0.169556847556025
20-2.67384468671912-0.106691703752737
21-2.67384468671912-0.106691703752737
22-2.67384468671912-0.106691703752737
23-0.511086195895091-1.26249195386214
24-0.707081283920861-1.00842476177846
25-2.50791722683788-0.139056339913173
26-2.866999846932530.0771930957235882
27-2.703912314863650.115010852170507
28-2.626481993323820.170405348960291
29-2.469055997545120.137887314590419
30-2.770138910746320.271059419765168
31-2.40551410128470.195917257696062
32-2.728592981831310.333924563568456
33-0.908463333123496-0.751568725169485
34-2.303128537663880.105522678429983
35-2.590323025585670.236059337288874
36-2.684207125103950.326607314764389
37-2.648322732479130.319136667750886
38-2.54323490730370.440031754659817
39-2.587346188917740.520473638805969
40-2.537573371013510.510367545476607
41-2.208832954176710.442696030421005
42-0.0101900707278539-0.7205748667019
43-2.639821268376580.319290065960323
44-2.562390946836750.374684562750107
45-2.64763667339690.819982632559509
46-2.542685757077050.58628102534399
470.58727485357991-0.483284267717481
48-2.41007733712160.418080008247617
49-2.310531701313130.397867821588893
50-2.50652678933890.651935013672573
51-2.622526203125810.818089674596596
52-2.27989736100960.747782713225135
530.180967206347698-0.825603943576112
54-0.191884492103146-0.677490544374356
55-0.0703428889492723-0.702537931730725
560.164513428369255-0.679661469310813
570.463030988871423-0.669526546536121
58-2.625228464680420.606800008421587
59-2.597159477075921.10002192800727
600.043246400328704-0.581489446612358
610.355533039187063-0.503218487402661
621.19809737227496-0.605578961764593
63-0.174864001965697-0.25181557108017
640.245195161690345-0.266728035661847
650.6595927890562-0.351976291060315
661.34459421750982-0.776415425176824
67-0.307334755663405-0.365086612766159
680.297807907406933-0.347016521598563
690.640436749523145-0.417323482970024
700.375238228946765-0.291622024808778
710.331126947332722-0.211180140662625
72-2.199077961430760.879244088091738
73-2.383869323799381.34475434455986
740.228879049969647-0.40225762017891
750.134994950451373-0.311709642703395
760.234540586259835-0.331921829362119
771.41407222517376-0.574925055912304
781.41407222517376-0.574925055912304
791.58526729930864-0.539307053847171
80-2.643543216941151.18619489941345
810.511098060683477-0.102284105045969
821.38966613331942-0.282886709172268
831.11624072375382-0.0842140138783721
840.26233575615318-0.547893298025336
851.29832982456781-0.761013936521507
861.37873698277558-0.421205138214627
870.812455548998001-0.162331574878632
881.1688534694704-0.164502499815089
890.8072054966767160.195053963771113
901.77964010689857-0.501464794706014
910.356786567830461-0.0668238279413654
920.920502648909008-0.18239036332792
930.98404454516943-0.124360420222277
940.890160445651156-0.0338124427467615
951.2906696477376-0.116425251829371
960.942361707398841-0.541822258150014
971.25762518292514-0.179136997423222
980.6416902781665430.0190711764912717
991.901629075288230.115876748247967
1000.812867790369069-0.370678998318909
Out[10]:
Rows: 1-100 of 150 | Columns: 2

Or the minimal cumulative explained variance.

In [11]:
model.transform(iris, cutoff = 0.8)
123
col1
Float
1-3.22520044627498
2-2.88795856533563
3-2.98184266485391
4-2.99829644283235
5-2.85221108156639
6-2.74643719730874
7-2.84032129682701
8-2.82089068218063
9-3.21585769490011
10-2.88981953961792
11-2.63284790803076
12-2.7874339759971
13-2.71566519074746
14-2.58846205130339
15-2.61314271827106
16-2.3561710866839
17-0.751467140648227
180.519383245084935
19-2.71539061563413
20-2.67384468671912
21-2.67384468671912
22-2.67384468671912
23-0.511086195895091
24-0.707081283920861
25-2.50791722683788
26-2.86699984693253
27-2.70391231486365
28-2.62648199332382
29-2.46905599754512
30-2.77013891074632
31-2.4055141012847
32-2.72859298183131
33-0.908463333123496
34-2.30312853766388
35-2.59032302558567
36-2.68420712510395
37-2.64832273247913
38-2.5432349073037
39-2.58734618891774
40-2.53757337101351
41-2.20883295417671
42-0.0101900707278539
43-2.63982126837658
44-2.56239094683675
45-2.6476366733969
46-2.54268575707705
470.58727485357991
48-2.4100773371216
49-2.31053170131313
50-2.5065267893389
51-2.62252620312581
52-2.2798973610096
530.180967206347698
54-0.191884492103146
55-0.0703428889492723
560.164513428369255
570.463030988871423
58-2.62522846468042
59-2.59715947707592
600.043246400328704
610.355533039187063
621.19809737227496
63-0.174864001965697
640.245195161690345
650.6595927890562
661.34459421750982
67-0.307334755663405
680.297807907406933
690.640436749523145
700.375238228946765
710.331126947332722
72-2.19907796143076
73-2.38386932379938
740.228879049969647
750.134994950451373
760.234540586259835
771.41407222517376
781.41407222517376
791.58526729930864
80-2.64354321694115
810.511098060683477
821.38966613331942
831.11624072375382
840.26233575615318
851.29832982456781
861.37873698277558
870.812455548998001
881.1688534694704
890.807205496676716
901.77964010689857
910.356786567830461
920.920502648909008
930.98404454516943
940.890160445651156
951.2906696477376
960.942361707398841
971.25762518292514
980.641690278166543
991.90162907528823
1000.812867790369069
Out[11]:
Rows: 1-100 of 150 | Column: col1 | Type: float