Wine Quality¶
This example uses the Wine Quality dataset to predict the quality of white wine. You can download the Jupyter Notebook of the study here.
- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- total sulfur dioxide
- free sulfur dioxide
- density
- pH
- sulphates
- alcohol
- quality (score between 0 and 10)
We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Let's create a Virtual DataFrame of the dataset.
from verticapy.datasets import load_winequality
winequality = load_winequality()
winequality.head(5)
Data Exploration and Preparation¶
Let's explore the data by displaying descriptive statistics of all the columns.
winequality.describe()
The quality of a wine is based on the equilibrium between certain components:
- For red wines: tannin/smoothness/acidity
- For white wines: smoothness/acidity
Based on this, we don't have the data to create a good model for red wines (the tannins weren't extracted). We do, however, have enough data to make a good model for white wines, so let's filter out red wines from our study.
winequality.filter(winequality["color"] == 'white').drop(["good", "color"])
Let's draw the correlation matrix of the dataset.
%matplotlib inline
winequality.corr(method = "spearman")
We can see a strong correlation between the density and the alcohol degree (the alcohol degree describes the density of pure ethanol in the wine). We can drop the 'density' column since it doesn't influence the quality of the white wine (instead, its presence will just bias the data).
winequality.drop(["density"])