Iris¶
This example uses the 'iris' dataset to predict the species of various flowers based on their physical features. You can download the Jupyter Notebook of the study here.
- PetalLengthCm: Petal Length in cm
- PetalWidthCm: Petal Width in cm
- SepalLengthCm: Sepal Length in cm
- SepalWidthCm: Sepal Width in cm
- Species: The Flower Species (Setosa, Virginica, Versicolor)
We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Let's create a Virtual DataFrame of the dataset.
from verticapy.datasets import load_iris
import verticapy.stats as st
iris = load_iris()
iris.head(5)
Data Exploration and Preparation¶
Let's explore the data by displaying descriptive statistics of all the columns.
iris.describe(method = "categorical", unique=True)
We don't have much data here, but that's okay; since different flower species have different proportions and ratios between those proportions, we can start by making ratios between each feature.
We'll need to use the One-Hot Encoder on the 'Species' to get information about each species.
iris["Species"].one_hot_encode(drop_first = False)
iris["ratio_pwl"] = iris["PetalWidthCm"] / iris["PetalLengthCm"]
iris["ratio_swl"] = iris["SepalWidthCm"] / iris["SepalLengthCm"]
We can draw the correlation matrix (Pearson correlation coefficient) of the new features to see if there are some linear links.
%matplotlib inline
iris.corr()
The Iris setosa is highly linearly correlated with the petal length and the sepal ratio. We can see a perfect separation using the two features (though we can also see this separation the petal length alone).
iris.scatter(columns = ["PetalLengthCm", "ratio_swl"],
catcol = "Species")
We can we a clear linear separation between the Iris setosa and the other species, but we'll need more features to identify the differences between Iris virginica and Iris versicolor.
iris.scatter(columns = ["PetalLengthCm",
"PetalWidthCm",
"SepalLengthCm"],
catcol = "Species")
Our strategy is simple: we'll use two Linear Support Vector Classification (SVC): one to classify the Iris setosa and another to classify the Iris versicolor.
Machine Learning¶
Let's build the first Linear SVC to predict if a flower is an Iris setosa.
from verticapy.learn.svm import LinearSVC
from verticapy.learn.model_selection import cross_validate
predictors = ["PetalLengthCm", "ratio_swl"]
response = "Species_Iris-setosa"
model = LinearSVC("svc_setosa_iris")
cross_validate(model, iris, predictors, response)
Our model is excellent. Let's build it using the entire dataset.
model.fit(iris, predictors, response)
Let's plot the model to see the perfect separation.
model.plot()
We can add this probability to the vDataFrame.
model.predict_proba(iris, name = "setosa", pos_label=1)
Let's create a model to classify the Iris virginica.
predictors = ["PetalLengthCm", "SepalLengthCm", "SepalWidthCm",
"PetalWidthCm", "ratio_pwl", "ratio_swl"]
response = "Species_Iris-virginica"
model = LinearSVC("svc_virginica_iris")
cross_validate(model, iris, predictors, response)