Decomposition#

Decomposition is the process of using an orthogonal transformation to convert a set of observations of possibly-correlated variables (with numerical values) into a set of values of linearly-uncorrelated variables called principal components.

Since some algorithms are sensitive to correlated predictors, it can be a good idea to use the PCA (Principal Component Analysis: Decomposition Technique) before applying the algorithm. Since some algorithms are also sensitive to the number of predictors, we’ll have to be picky with which variables we include.

To demonstrate data decomposition in VerticaPy, we’ll use the well-known ‘Iris’ dataset.

[1]:
from verticapy.datasets import load_iris
import verticapy as vp

vp.set_option("plotting_lib","highcharts")

vdf = load_iris()
display(vdf)
123
Id
Integer
123
PetalLengthCm
Numeric(8)
123
PetalWidthCm
Numeric(8)
123
SepalLengthCm
Numeric(8)
123
SepalWidthCm
Numeric(8)
Abc
Species
Varchar(30)
111.40.25.13.5Iris-setosa
221.40.24.93.0Iris-setosa
331.30.24.73.2Iris-setosa
441.50.24.63.1Iris-setosa
551.40.25.03.6Iris-setosa
661.70.45.43.9Iris-setosa
771.40.34.63.4Iris-setosa
881.50.25.03.4Iris-setosa
991.40.24.42.9Iris-setosa
10101.50.14.93.1Iris-setosa
11111.50.25.43.7Iris-setosa
12121.60.24.83.4Iris-setosa
13131.40.14.83.0Iris-setosa
14141.10.14.33.0Iris-setosa
15151.20.25.84.0Iris-setosa
16161.50.45.74.4Iris-setosa
17171.30.45.43.9Iris-setosa
18181.40.35.13.5Iris-setosa
19191.70.35.73.8Iris-setosa
20201.50.35.13.8Iris-setosa
21211.70.25.43.4Iris-setosa
22221.50.45.13.7Iris-setosa
23231.00.24.63.6Iris-setosa
24241.70.55.13.3Iris-setosa
25251.90.24.83.4Iris-setosa
26261.60.25.03.0Iris-setosa
27271.60.45.03.4Iris-setosa
28281.50.25.23.5Iris-setosa
29291.40.25.23.4Iris-setosa
30301.60.24.73.2Iris-setosa
31311.60.24.83.1Iris-setosa
32321.50.45.43.4Iris-setosa
33331.50.15.24.1Iris-setosa
34341.40.25.54.2Iris-setosa
35351.50.14.93.1Iris-setosa
36361.20.25.03.2Iris-setosa
37371.30.25.53.5Iris-setosa
38381.50.14.93.1Iris-setosa
39391.30.24.43.0Iris-setosa
40401.50.25.13.4Iris-setosa
41411.30.35.03.5Iris-setosa
42421.30.34.52.3Iris-setosa
43431.30.24.43.2Iris-setosa
44441.60.65.03.5Iris-setosa
45451.90.45.13.8Iris-setosa
46461.40.34.83.0Iris-setosa
47471.60.25.13.8Iris-setosa
48481.40.24.63.2Iris-setosa
49491.50.25.33.7Iris-setosa
50501.40.25.03.3Iris-setosa
51514.71.47.03.2Iris-versicolor
52524.51.56.43.2Iris-versicolor
53534.91.56.93.1Iris-versicolor
54544.01.35.52.3Iris-versicolor
55554.61.56.52.8Iris-versicolor
56564.51.35.72.8Iris-versicolor
57574.71.66.33.3Iris-versicolor
58583.31.04.92.4Iris-versicolor
59594.61.36.62.9Iris-versicolor
60603.91.45.22.7Iris-versicolor
61613.51.05.02.0Iris-versicolor
62624.21.55.93.0Iris-versicolor
63634.01.06.02.2Iris-versicolor
64644.71.46.12.9Iris-versicolor
65653.61.35.62.9Iris-versicolor
66664.41.46.73.1Iris-versicolor
67674.51.55.63.0Iris-versicolor
68684.11.05.82.7Iris-versicolor
69694.51.56.22.2Iris-versicolor
70703.91.15.62.5Iris-versicolor
71714.81.85.93.2Iris-versicolor
72724.01.36.12.8Iris-versicolor
73734.91.56.32.5Iris-versicolor
74744.71.26.12.8Iris-versicolor
75754.31.36.42.9Iris-versicolor
76764.41.46.63.0Iris-versicolor
77774.81.46.82.8Iris-versicolor
78785.01.76.73.0Iris-versicolor
79794.51.56.02.9Iris-versicolor
80803.51.05.72.6Iris-versicolor
81813.81.15.52.4Iris-versicolor
82823.71.05.52.4Iris-versicolor
83833.91.25.82.7Iris-versicolor
84845.11.66.02.7Iris-versicolor
85854.51.55.43.0Iris-versicolor
86864.51.66.03.4Iris-versicolor
87874.71.56.73.1Iris-versicolor
88884.41.36.32.3Iris-versicolor
89894.11.35.63.0Iris-versicolor
90904.01.35.52.5Iris-versicolor
91914.41.25.52.6Iris-versicolor
92924.61.46.13.0Iris-versicolor
93934.01.25.82.6Iris-versicolor
94943.31.05.02.3Iris-versicolor
95954.21.35.62.7Iris-versicolor
96964.21.25.73.0Iris-versicolor
97974.21.35.72.9Iris-versicolor
98984.31.36.22.9Iris-versicolor
99993.01.15.12.5Iris-versicolor
1001004.11.35.72.8Iris-versicolor
Rows: 1-100 | Columns: 6

Notice that all the predictors are well-correlated with each other.

[2]:
vdf.corr()
[2]:

Let’s compute the PCA of the different elements.

[3]:
from verticapy.learn.decomposition import PCA
vp.drop("pca_iris")

model = PCA("pca_iris")
model.fit("iris", ["PetalLengthCm",
                   "SepalWidthCm",
                   "SepalLengthCm",
                   "PetalWidthCm"])
[3]:
'\n\n=======\ncolumns\n=======\nindex|    name     |  mean  |   sd   \n-----+-------------+--------+--------\n  1  |petallengthcm| 3.75867| 1.76442\n  2  |sepalwidthcm | 3.05400| 0.43359\n  3  |sepallengthcm| 5.84333| 0.82807\n  4  |petalwidthcm | 1.19867| 0.76316\n\n\n===============\nsingular_values\n===============\nindex| value  |explained_variance|accumulated_explained_variance\n-----+--------+------------------+------------------------------\n  1  | 2.05544|      0.92462     |            0.92462           \n  2  | 0.49218|      0.05302     |            0.97763           \n  3  | 0.28022|      0.01719     |            0.99482           \n  4  | 0.15389|      0.00518     |            1.00000           \n\n\n====================\nprincipal_components\n====================\nindex|  PC1   |  PC2   |  PC3   |  PC4   \n-----+--------+--------+--------+--------\n  1  | 0.85657|-0.17577| 0.07252|-0.47972\n  2  |-0.08227| 0.72971| 0.59642|-0.32409\n  3  | 0.36159| 0.65654|-0.58100| 0.31725\n  4  | 0.35884|-0.07471| 0.54906| 0.75112\n\n\n========\ncounters\n========\n   counter_name   |counter_value\n------------------+-------------\naccepted_row_count|     150     \nrejected_row_count|      0      \n iteration_count  |      1      \n\n\n===========\ncall_string\n===========\nSELECT PCA(\'public.pca_iris\', \'iris\', \'"PetalLengthCm", "SepalWidthCm", "SepalLengthCm", "PetalWidthCm"\'\nUSING PARAMETERS scale=false);'

Let’s compute the correlation matrix of the result of the PCA.

[4]:
model.transform().corr()
[4]:

Notice that the predictors are now independant and combined together and they have the exact same amount of information than the previous variables. Let’s look at the accumulated explained variance of the PCA components.

[5]:
model.explained_variance_
[5]:
array([0.92461621, 0.05301557, 0.01718514, 0.00518309])

Most of the information is in the first two components with more than 97.7% of explained variance. We can export this result to a vDataFrame.

[6]:
model.transform(n_components = 2)
[6]:
123
Id
Integer
Abc
Species
Varchar(30)
123
col1
Float(22)
123
col2
Float(22)
11Iris-setosa-1.22853483184053-2.32797486205867
22Iris-setosa-1.48027473044093-2.42192960364832
33Iris-setosa-1.56648109119663-2.55060213398962
44Iris-setosa-1.46721003050149-2.64393895599043
55Iris-setosa-1.22880940695385-2.40109949740075
66Iris-setosa-0.736002504518092-2.06768323189023
77Iris-setosa-1.4534409521453-2.57580291955543
88Iris-setosa-1.21492098167444-2.40373494371661
99Iris-setosa-1.69695396175647-2.74272889827773
1010Iris-setosa-1.35050623829783-2.51994822813733
1111Iris-setosa-0.962631932847397-2.16353093144279
1212Iris-setosa-1.20158170662168-2.55261966071664
1313Iris-setosa-1.50820680918985-2.56055482910955
1414Iris-setosa-1.94597327946773-2.83609454972387
1515Iris-setosa-0.967314515607511-1.87059669814037
1616Iris-setosa-0.619418059237654-1.87292102128627
1717Iris-setosa-1.0786313466343-1.99737627051877
1818Iris-setosa-1.23676172082975-2.25500362492602
1919Iris-setosa-0.655183104939257-1.93622185702363
2020Iris-setosa-1.04345133242623-2.29499230630939
2121Iris-setosa-0.898970689663757-2.17627247108801
2222Iris-setosa-1.08756261404028-2.21455042216324
2323Iris-setosa-1.71607412002264-2.59340848934362
2424Iris-setosa-1.06801265247068-2.14685007766231
2525Iris-setosa-0.944610075034521-2.60534988174524
2626Iris-setosa-1.27280134164468-2.39142909600546
2727Iris-setosa-1.14571754912383-2.27536920979418
2828Iris-setosa-1.10671865357333-2.27989761407295
2929Iris-setosa-1.2282602567272-2.25485022671658
3030Iris-setosa-1.30950945960947-2.60333235501822
3131Iris-setosa-1.30923488449614-2.53020771967613
3232Iris-setosa-1.0867388887003-1.99517651613698
3333Iris-setosa-0.88318540883518-2.39769273328662
3434Iris-setosa-0.832708212514197-2.11765343783886
3535Iris-setosa-1.35050623829783-2.51994822813733
3636Iris-setosa-1.54366139851124-2.33606342866101
3737Iris-setosa-1.169556171417-2.04778216840147
3838Iris-setosa-1.35050623829783-2.51994822813733
3939Iris-setosa-1.7467267796607-2.73262280494837
4040Iris-setosa-1.1787620139363-2.33808095538803
4141Iris-setosa-1.35857789909695-2.30308087291174
4242Iris-setosa-1.96998544928553-2.54170305039261
4343Iris-setosa-1.67495799441106-2.74756409897537
4444Iris-setosa-1.12628693447745-2.13689738254238
4545Iris-setosa-0.709049379299243-2.29232803054821
4646Iris-setosa-1.52466058716829-2.41461235484425
4747Iris-setosa-0.949567232907958-2.38554028378491
4848Iris-setosa-1.51698284840572-2.63383286266107
4949Iris-setosa-0.998790900585542-2.22918491977137
5050Iris-setosa-1.33646258482832-2.37868755636024
5151Iris-versicolor2.07879765689785-0.762514728497838
5252Iris-versicolor1.68230254042165-1.04831394065096
5353Iris-versicolor2.16984182860376-0.782880313365999
5454Iris-versicolor0.622080022488128-1.63002278503764
5555Iris-versicolor1.66058114818956-0.970354104611224
5656Iris-versicolor1.30210597373379-1.62395174516232
5757Iris-versicolor1.84511549737721-1.08362081954612
5858Iris-versicolor-0.133909198051625-2.12729389102054
5959Iris-versicolor1.74907828653097-1.05811323756144
6060Iris-versicolor0.563256590254706-1.76631936060189
6161Iris-versicolor-0.0699733797546608-2.06691079532367
6262Iris-versicolor1.17276728489412-1.30891236723827
6363Iris-versicolor0.791671135521697-1.51319590777918
6464Iris-versicolor1.64571376938008-1.33098868241458
6565Iris-versicolor0.530916503858992-1.53888571741862
6666Iris-versicolor1.67746472947143-0.899275825441488
6767Iris-versicolor1.32126201326685-1.55860455325261
6868Iris-versicolor0.984432373698567-1.69943385984672
6969Iris-versicolor1.25114067869714-1.10491544717309
7070Iris-versicolor0.660804342925307-1.7076758246585
7171Iris-versicolor1.73379866635042-1.21040039192452
7272Iris-versicolor1.01845579204111-1.27345209013366
7373Iris-versicolor1.73758166642596-1.13198036125648
7474Iris-versicolor1.6262831547337-1.46946050966637
7575Iris-versicolor1.41978871946752-1.13669099319001
7676Iris-versicolor1.60542136910846-0.957459166756568
7777Iris-versicolor1.94859936145132-0.881516857443856
7878Iris-versicolor2.13084293305326-0.778351909087227
7979Iris-versicolor1.4300134915946-1.28851795292478
8080Iris-versicolor0.398445750161284-1.65215675910461
8181Iris-versicolor0.503103772033287-1.74828242563072
8282Iris-versicolor0.425673450493456-1.8036769224205
8383Iris-versicolor0.796664174662019-1.51833790489569
8484Iris-versicolor1.86396108053006-1.30606586382231
8585Iris-versicolor1.24894407779056-1.68991252990978
8686Iris-versicolor1.60120856572949-1.25289995085965
8787Iris-versicolor1.92620947206937-0.879034809337434
8888Iris-versicolor1.2539806065095-1.17509783978044
8989Iris-versicolor0.995086949129077-1.63424006614645
9090Iris-versicolor0.693848807737772-1.64496407906465
9191Iris-versicolor1.08058893146803-1.79571292458227
9292Iris-versicolor1.59594095147584-1.32088258908521
9393Iris-versicolor0.84643699256625-1.52844399822506
9494Iris-versicolor-0.133634622938302-2.05416925567845
9595Iris-versicolor0.973090981783665-1.62940486544881
9696Iris-versicolor1.1251300163855-1.65913405529338
9797Iris-versicolor1.08101873477145-1.57869217114723
9898Iris-versicolor1.34747078399123-1.26799896984718
9999Iris-versicolor-0.290905390526893-1.87775510321563
100100Iris-versicolor0.959477131617579-1.55364478379086
Rows: 1-100 | Columns: 4