Iris

This example uses the 'iris' dataset to predict the species of various flowers based on their physical features. You can download the Jupyter Notebook of the study here.

  • PetalLengthCm: Petal Length in cm
  • PetalWidthCm: Petal Width in cm
  • SepalLengthCm: Sepal Length in cm
  • SepalWidthCm: Sepal Width in cm
  • Species: The Flower Species (Setosa, Virginica, Versicolor)

We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.

Initialization

This example uses the following version of VerticaPy:

In [24]:
import verticapy as vp
vp.__version__
Out[24]:
'0.9.0'

Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.

In [1]:
vp.connect("VerticaDSN")

Let's create a Virtual DataFrame of the dataset.

In [11]:
from verticapy.datasets import load_iris
import verticapy.stats as st
iris = load_iris()
iris.head(5)
Out[11]:
123
SepalLengthCm
Numeric(5,2)
123
SepalWidthCm
Numeric(5,2)
123
PetalLengthCm
Numeric(5,2)
123
PetalWidthCm
Numeric(5,2)
Abc
Species
Varchar(30)
14.33.01.10.1Iris-setosa
24.42.91.40.2Iris-setosa
34.43.01.30.2Iris-setosa
44.43.21.30.2Iris-setosa
54.52.31.30.3Iris-setosa
Rows: 1-5 | Columns: 5

Data Exploration and Preparation

Let's explore the data by displaying descriptive statistics of all the columns.

In [12]:
iris.describe(method = "categorical", unique=True)
Out[12]:
dtype
count
top
top_percent
unique
"SepalLengthCm"numeric(5,2)1505.06.66735.0
"SepalWidthCm"numeric(5,2)1503.017.33323.0
"PetalLengthCm"numeric(5,2)1501.59.33343.0
"PetalWidthCm"numeric(5,2)1500.218.66722.0
"Species"varchar(30)150Iris-setosa33.3333.0
Rows: 1-5 | Columns: 6

We don't have much data here, but that's okay; since different flower species have different proportions and ratios between those proportions, we can start by making ratios between each feature.

We'll need to use the One-Hot Encoder on the 'Species' to get information about each species.

In [13]:
iris["Species"].one_hot_encode(drop_first = False)
iris["ratio_pwl"] = iris["PetalWidthCm"] / iris["PetalLengthCm"]
iris["ratio_swl"] = iris["SepalWidthCm"] / iris["SepalLengthCm"]

We can draw the correlation matrix (Pearson correlation coefficient) of the new features to see if there are some linear links.

In [14]:
%matplotlib inline
iris.corr()
Out[14]:
"SepalLengthCm"
"SepalWidthCm"
"PetalLengthCm"
"PetalWidthCm"
"Species_Iris-setosa"
"Species_Iris-versicolor"
"Species_Iris-virginica"
"ratio_pwl"
"ratio_swl"
"SepalLengthCm"1.0-0.1093692499506560.8717541573048860.817953633369181-0.7174156686861110.07939552384344440.638020144842660.645854809352185-0.724085081370438
"SepalWidthCm"-0.1093692499506561.0-0.420516096401169-0.3565440896138120.595600845226849-0.464699560561606-0.13090128466524-0.3398548658934460.755415996230207
"PetalLengthCm"0.871754157304886-0.4205160964011691.00.962757097050968-0.9226883328831090.2015867595375060.7211015733456010.8129592586072-0.867296724293694
"PetalWidthCm"0.817953633369181-0.3565440896138120.9627570970509681.0-0.8875099587826580.1183759791393070.769133979643350.910838299356941-0.796309230951869
"Species_Iris-setosa"-0.7174156686861110.595600845226849-0.922688332883109-0.8875099587826581.0-0.5-0.5-0.8251497787065520.907112624976622
"Species_Iris-versicolor"0.0793955238434444-0.4646995605616060.2015867595375060.118375979139307-0.51.0-0.50.212967593449348-0.409714538856098
"Species_Iris-virginica"0.63802014484266-0.130901284665240.7211015733456010.76913397964335-0.5-0.51.00.612182185257205-0.497398086120528
"ratio_pwl"0.645854809352185-0.3398548658934460.81295925860720.910838299356941-0.8251497787065520.2129675934493480.6121821852572051.0-0.689720454015874
"ratio_swl"-0.7240850813704380.755415996230207-0.867296724293694-0.7963092309518690.907112624976622-0.409714538856098-0.497398086120528-0.6897204540158741.0
Rows: 1-9 | Columns: 10

The Iris setosa is highly linearly correlated with the petal length and the sepal ratio. We can see a perfect separation using the two features (though we can also see this separation the petal length alone).

In [15]:
iris.scatter(columns = ["PetalLengthCm", "ratio_swl"], 
             catcol = "Species")
Out[15]:
<AxesSubplot:xlabel='"PetalLengthCm"', ylabel='"ratio_swl"'>

We can we a clear linear separation between the Iris setosa and the other species, but we'll need more features to identify the differences between Iris virginica and Iris versicolor.

In [16]:
iris.scatter(columns = ["PetalLengthCm", 
                        "PetalWidthCm", 
                        "SepalLengthCm"], 
             catcol = "Species")
Out[16]:
<Axes3DSubplot:xlabel='"PetalLengthCm"', ylabel='"PetalWidthCm"'>

Our strategy is simple: we'll use two Linear Support Vector Classification (SVC): one to classify the Iris setosa and another to classify the Iris versicolor.

Machine Learning

Let's build the first Linear SVC to predict if a flower is an Iris setosa.

In [17]:
from verticapy.learn.svm import LinearSVC
from verticapy.learn.model_selection import cross_validate

predictors = ["PetalLengthCm", "ratio_swl"]
response = "Species_Iris-setosa"
model = LinearSVC("svc_setosa_iris")
cross_validate(model, iris, predictors, response)

Out[17]:
auc
prc_auc
accuracy
log_loss
precision
recall
f1_score
mcc
informedness
markedness
csi
time
1-fold1.00.99999999999999991.00.08521359910982651.01.01.01.01.01.01.01.638392686843872
2-fold1.00.99999999999999991.00.08089434789211331.01.01.01.01.01.01.01.5792129039764404
3-fold0.99999999999999991.01.00.07443785764928421.01.01.01.01.01.01.01.4164810180664062
avg1.00.99999999999999991.00.080181934883741341.01.01.01.01.01.01.01.544695536295573
std6.409875621278546e-176.409875621278546e-170.00.0054230803264367870.00.00.00.00.00.00.00.11491206937746823
Rows: 1-5 | Columns: 13

Our model is excellent. Let's build it using the entire dataset.

In [18]:
model.fit(iris, predictors, response)
Out[18]:

=======
details
=======
  predictor  |coefficient
-------------+-----------
  Intercept  |  1.38349  
petallengthcm| -0.84012  
  ratio_swl  |  1.32517  


===========
call_string
===========
SELECT svm_classifier('public.svc_setosa_iris', '"public"."_verticapy_tmp_view_dbadmin_40328_9186882012_"', '"species_iris-setosa"', '"PetalLengthCm", "ratio_swl"'
USING PARAMETERS class_weights='1,1', C=1, max_iterations=100, intercept_mode='regularized', intercept_scaling=1, epsilon=0.0001);

===============
Additional Info
===============
       Name       |Value
------------------+-----
accepted_row_count| 150 
rejected_row_count|  0  
 iteration_count  |  7  

Let's plot the model to see the perfect separation.

In [19]:
model.plot()
Out[19]:
<AxesSubplot:xlabel='"PetalLengthCm"', ylabel='"ratio_swl"'>

We can add this probability to the vDataFrame.

In [20]:
model.predict_proba(iris, name = "setosa", pos_label=1)
Out[20]:
123
SepalLengthCm
Numeric(5,2)
123
SepalWidthCm
Numeric(5,2)
123
PetalLengthCm
Numeric(5,2)
123
PetalWidthCm
Numeric(5,2)
Abc
Species
Varchar(30)
123
Species_Iris-setosa
Bool
123
Species_Iris-versicolor
Bool
123
Species_Iris-virginica
Bool
123
ratio_pwl
Numeric(20,15)
123
ratio_swl
Numeric(20,15)
123
setosa
Float
14.33.01.10.1Iris-setosa1000.0909090909090910.6976744186046510.799616133475824
24.42.91.40.2Iris-setosa1000.1428571428571430.6590909090909090.746632427676863
34.43.01.30.2Iris-setosa1000.1538461538461540.6818181818181820.767609329878423
44.43.21.30.2Iris-setosa1000.1538461538461540.7272727272727270.778180760876583
54.52.31.30.3Iris-setosa1000.2307692307692310.5111111111111110.724849376339106
64.63.11.50.2Iris-setosa1000.1333333333333330.6739130434782610.734263300128653
74.63.21.40.2Iris-setosa1000.1428571428571430.6956521739130430.755687832710383
84.63.41.40.3Iris-setosa1000.2142857142857140.7391304347826090.766167839545688
94.63.61.00.2Iris-setosa1000.20.7826086956521740.82926980908022
104.73.21.30.2Iris-setosa1000.1538461538461540.6808510638297870.76738063375719
114.73.21.60.2Iris-setosa1000.1250.6808510638297870.719411418945644
124.83.01.40.1Iris-setosa1000.0714285714285710.6250.737991521269184
134.83.01.40.3Iris-setosa1000.2142857142857140.6250.737991521269184
144.83.11.60.2Iris-setosa1000.1250.6458333333333330.709949644614095
154.83.41.60.2Iris-setosa1000.1250.7083333333333330.726703765068597
164.83.41.90.2Iris-setosa1000.1052631578947370.7083333333333330.673910864793945
174.92.43.31.0Iris-versicolor0100.3030303030303030.4897959183673470.323039614860416
184.92.54.51.7Iris-virginica0010.3777777777777780.5102040816326530.151750696077398
194.93.01.40.2Iris-setosa1000.1428571428571430.6122448979591840.734710097493698
204.93.11.50.1Iris-setosa1000.0666666666666670.632653061224490.723459090938328
214.93.11.50.1Iris-setosa1000.0666666666666670.632653061224490.723459090938328
224.93.11.50.1Iris-setosa1000.0666666666666670.632653061224490.723459090938328
235.02.03.51.0Iris-versicolor0100.2857142857142860.40.263694030318376
245.02.33.31.0Iris-versicolor0100.3030303030303030.460.314465933883348
255.03.01.60.2Iris-setosa1000.1250.60.697285021121722
265.03.21.20.2Iris-setosa1000.1666666666666670.640.772671290753683
275.03.31.40.2Iris-setosa1000.1428571428571430.660.746860255947267
285.03.41.50.2Iris-setosa1000.1333333333333330.680.735834215605851
295.03.41.60.4Iris-setosa1000.250.680.719183705655739
305.03.51.30.3Iris-setosa1000.2307692307692310.70.771879602615583
315.03.51.60.6Iris-setosa1000.3750.70.724505067474212
325.03.61.40.2Iris-setosa1000.1428571428571430.720.761595468468724
335.12.53.01.1Iris-versicolor0100.3666666666666670.4901960784313730.380536424820992
345.13.31.70.5Iris-setosa1000.2941176470588240.6470588235294120.692695552384892
355.13.41.50.2Iris-setosa1000.1333333333333330.6666666666666670.732385408664398
365.13.51.40.2Iris-setosa1000.1428571428571430.6862745098039220.753386230741964
375.13.51.40.3Iris-setosa1000.2142857142857140.6862745098039220.753386230741964
385.13.71.50.4Iris-setosa1000.2666666666666670.7254901960784310.747384308490649
395.13.81.50.3Iris-setosa1000.20.7450980392156860.752258455681401
405.13.81.60.2Iris-setosa1000.1250.7450980392156860.736272141535994
415.13.81.90.4Iris-setosa1000.2105263157894740.7450980392156860.68452518034775
425.22.73.91.4Iris-versicolor0100.3589743589743590.5192307692307690.230604211159528
435.23.41.40.2Iris-setosa1000.1428571428571430.6538461538461540.745315390833046
445.23.51.50.2Iris-setosa1000.1333333333333330.6730769230769230.734047050003793
455.24.11.50.1Iris-setosa1000.0666666666666670.7884615384615380.762811887474022
465.33.71.50.2Iris-setosa1000.1333333333333330.698113207547170.740473501806866
475.43.04.51.5Iris-versicolor0100.3333333333333330.5555555555555560.159649649102876
485.43.41.50.4Iris-setosa1000.2666666666666670.629629629629630.722656797169744
495.43.41.70.2Iris-setosa1000.1176470588235290.629629629629630.687757248197022
505.43.71.50.2Iris-setosa1000.1333333333333330.6851851851851850.737167698365962
515.43.91.30.4Iris-setosa1000.3076923076923080.7222222222222220.777023333634539
525.43.91.70.4Iris-setosa1000.2352941176470590.7222222222222220.713482010522215
535.52.34.01.3Iris-versicolor0100.3250.4181818181818180.194219643823906
545.52.43.71.0Iris-versicolor0100.270270270270270.4363636363636360.241093824269382
555.52.43.81.1Iris-versicolor0100.2894736842105260.4363636363636360.226058264210938
565.52.54.01.3Iris-versicolor0100.3250.4545454545454550.201872265775898
575.52.64.41.2Iris-versicolor0100.2727272727272730.4727272727272730.156225228875569
585.53.51.30.2Iris-setosa1000.1538461538461540.6363636363636360.756691610416908
595.54.21.40.2Iris-setosa1000.1428571428571430.7636363636363640.771935440609133
605.62.53.91.1Iris-versicolor0100.2820512820512820.4464285714285710.213933207860434
615.62.74.21.3Iris-versicolor0100.309523809523810.4821428571428570.181519581637256
625.62.84.92.0Iris-virginica0010.4081632653061220.50.111996799091321
635.62.93.61.3Iris-versicolor0100.3611111111111110.5178571428571430.277943121171659
645.63.04.11.3Iris-versicolor0100.3170731707317070.5357142857142860.205693392743854
655.63.04.51.5Iris-versicolor0100.3333333333333330.5357142857142860.156153617405426
665.72.55.02.0Iris-virginica0010.40.438596491228070.0965737982202029
675.72.63.51.0Iris-versicolor0100.2857142857142860.4561403508771930.278390203317582
685.72.84.11.3Iris-versicolor0100.3170731707317070.4912280701754390.196228591121515
695.72.84.51.3Iris-versicolor0100.2888888888888890.4912280701754390.148542053943545
705.72.94.21.3Iris-versicolor0100.309523809523810.5087719298245610.186821367793599
715.73.04.21.2Iris-versicolor0100.2857142857142860.5263157894736840.190379016686771
725.73.81.70.3Iris-setosa1000.1764705882352940.6666666666666670.698198801634805
735.74.41.50.4Iris-setosa1000.2666666666666670.7719298245614040.758825404150609
745.82.64.01.2Iris-versicolor0100.30.4482758620689660.200536953059694
755.82.73.91.2Iris-versicolor0100.3076923076923080.465517241379310.218217860142616
765.82.74.11.0Iris-versicolor0100.243902439024390.465517241379310.190910337541694
775.82.75.11.9Iris-virginica0010.3725490196078430.465517241379310.0924379577314657
785.82.75.11.9Iris-virginica0010.3725490196078430.465517241379310.0924379577314657
795.82.85.12.4Iris-virginica0010.4705882352941180.4827586206896550.0943726599781477
805.84.01.20.2Iris-setosa1000.1666666666666670.6896551724137930.784021608790257
815.93.04.21.5Iris-versicolor0100.3571428571428570.5084745762711860.186761512366878
825.93.05.11.8Iris-virginica0010.3529411764705880.5084745762711860.0973257207195747
835.93.24.81.8Iris-versicolor0100.3750.5423728813559320.126712735034151
846.02.24.01.0Iris-versicolor0100.250.3666666666666670.183758509548148
856.02.25.01.5Iris-virginica0010.30.3666666666666670.0885712440311769
866.02.75.11.6Iris-versicolor0100.3137254901960780.450.0907272611961409
876.02.94.51.5Iris-versicolor0100.3333333333333330.4833333333333330.147223724664182
886.03.04.81.8Iris-virginica0010.3750.50.120628357257542
896.03.44.51.6Iris-versicolor0100.3555555555555560.5666666666666670.16163497279113
906.12.65.61.4Iris-virginica0010.250.4262295081967210.0597292318554775
916.12.84.01.3Iris-versicolor0100.3250.4590163934426230.202828547374176
926.12.84.71.2Iris-versicolor0100.255319148936170.4590163934426230.123814688224948
936.12.94.71.4Iris-versicolor0100.2978723404255320.4754098360655740.126190742362697
946.13.04.61.4Iris-versicolor0100.3043478260869570.4918032786885250.138317919091645
956.13.04.91.8Iris-virginica0010.367346938775510.4918032786885250.110921074597917
966.22.24.51.5Iris-versicolor0100.3333333333333330.3548387096774190.127102859928249
976.22.84.81.8Iris-virginica0010.3750.4516129032258060.113990321158717
986.22.94.31.3Iris-versicolor0100.3023255813953490.4677419354838710.166702290969475
996.23.45.42.3Iris-virginica0010.4259259259259260.5483870967741940.0811783736332713
1006.32.34.41.3Iris-versicolor0100.2954545454545450.3650793650793650.138329002279972
Rows: 1-100 of 150 | Columns: 11

Let's create a model to classify the Iris virginica.

In [21]:
predictors = ["PetalLengthCm", "SepalLengthCm", "SepalWidthCm", 
              "PetalWidthCm", "ratio_pwl", "ratio_swl"]
response = "Species_Iris-virginica"
model = LinearSVC("svc_virginica_iris")
cross_validate(model, iris, predictors, response)

Out[21]:
auc
prc_auc
accuracy
log_loss
precision
recall
f1_score
mcc
informedness
markedness
csi
time
1-fold0.98833333333333340.98095412048391960.960.1247535965568930.90909090909090911.00.95238095238095230.92113237294367660.93333333333333330.90909090909090920.90909090909090911.9328792095184326
2-fold0.99619047619047620.99109126984126980.980.08031433644206910.93751.00.9677419354838710.95431351542052780.97142857142857150.93750.93751.7381629943847656
3-fold1.00.99999999999999991.00.0850368152073147