Wine Quality

This example uses the Wine Quality dataset to predict the quality of white wine. You can download the Jupyter Notebook of the study here.

  • fixed acidity
  • volatile acidity
  • citric acid
  • residual sugar
  • total sulfur dioxide
  • free sulfur dioxide
  • density
  • pH
  • sulphates
  • alcohol
  • quality (score between 0 and 10)

We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.

Initialization

This example uses the following version of VerticaPy:

In [10]:
import verticapy as vp
vp.__version__
Out[10]:
'0.9.0'

Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.

In [2]:
vp.connect("VerticaDSN")

Let's create a Virtual DataFrame of the dataset.

In [3]:
from verticapy.datasets import load_winequality
winequality = load_winequality()
winequality.head(5)
Out[3]:
123
fixed_acidity
Numeric(6,3)
123
volatile_acidity
Numeric(7,4)
123
citric_acid
Numeric(6,3)
123
residual_sugar
Numeric(7,3)
123
chlorides
Float
123
free_sulfur_dioxide
Numeric(7,2)
123
total_sulfur_dioxide
Numeric(7,2)
123
density
Float
123
pH
Numeric(6,3)
123
sulphates
Numeric(6,3)
123
alcohol
Float
123
quality
Int
123
good
Int
Abc
color
Varchar(20)
13.80.310.0211.10.03620.0114.00.992483.750.4412.460white
23.90.2250.44.20.0329.0118.00.9893.570.3612.881white
34.20.170.361.80.02993.0161.00.989993.650.8912.071white
44.20.2150.235.10.04164.0157.00.996883.420.448.030white
54.40.320.394.30.0331.0127.00.989043.460.3612.881white
Rows: 1-5 | Columns: 14

Data Exploration and Preparation

Let's explore the data by displaying descriptive statistics of all the columns.

In [4]:
winequality.describe()
Out[4]:
count
mean
std
min
approx_25%
approx_50%
approx_75%
max
"fixed_acidity"64977.215307064799141.296433757799823.86.47.07.715.9
"volatile_acidity"64970.3396659996921650.1646364740846790.080.230.290.41.58
"citric_acid"64970.3186332153301530.1453178648975910.00.250.310.391.66
"residual_sugar"64975.443235339387424.757803743147410.61.83.08.165.8
"chlorides"64970.05603386178236110.03503360137245910.0090.0380.0470.0650.611
"free_sulfur_dioxide"649730.525319378174617.74939977200251.017.029.041.0289.0
"total_sulfur_dioxide"6497115.74457441896356.52185452263046.077.0118.0156.0440.0
"density"64970.9946966338310.002998673003719150.987110.992340.994890.996991.03898
"pH"64973.218500846544560.1607872021039872.723.113.213.324.01
"sulphates"64970.5312682776666140.148805873614490.220.430.510.62.0
"alcohol"649710.49180083115291.1927117488718.09.510.311.314.9
"quality"64975.818377712790510.8732552715311233.05.06.06.09.0
"good"64970.196552254886870.3974214088953670.00.00.00.01.0
Rows: 1-13 | Columns: 9

The quality of a wine is based on the equilibrium between certain components:

  • For red wines: tannin/smoothness/acidity
  • For white wines: smoothness/acidity

Based on this, we don't have the data to create a good model for red wines (the tannins weren't extracted). We do, however, have enough data to make a good model for white wines, so let's filter out red wines from our study.

In [5]:
winequality.filter(winequality["color"] == 'white').drop(["good", "color"])
1599 elements were filtered
Out[5]:
123
fixed_acidity
Numeric(6,3)
123
volatile_acidity
Numeric(7,4)
123
citric_acid
Numeric(6,3)
123
residual_sugar
Numeric(7,3)
123
chlorides
Float
123
free_sulfur_dioxide
Numeric(7,2)
123
total_sulfur_dioxide
Numeric(7,2)
123
density
Float
123
pH
Numeric(6,3)
123
sulphates
Numeric(6,3)
123
alcohol
Float
123
quality
Int
13.80.310.0211.10.03620.0114.00.992483.750.4412.46
23.90.2250.44.20.0329.0118.00.9893.570.3612.88
34.20.170.361.80.02993.0161.00.989993.650.8912.07
44.20.2150.235.10.04164.0157.00.996883.420.448.03
54.40.320.394.30.0331.0127.00.989043.460.3612.88
64.40.460.12.80.02431.0111.00.988163.480.3413.16
74.40.540.095.10.03852.097.00.990223.410.412.27
84.50.190.210.950.03389.0159.00.993323.340.428.05
94.60.4450.01.40.05311.0178.00.994263.790.5510.25
104.70.1450.291.00.04235.090.00.99083.760.4911.36
114.70.3350.141.30.03669.0168.00.992123.470.4610.55
124.70.4550.181.90.03633.0106.00.987463.210.8314.07
134.70.670.091.00.025.09.00.987223.30.3413.65
144.70.7850.03.40.03623.0134.00.989813.530.9213.86
154.80.130.321.20.04240.098.00.98983.420.6411.87
164.80.170.282.90.0322.0111.00.99023.380.3411.37
174.80.210.2110.20.03717.0112.00.993243.660.4812.27
184.80.2250.381.20.07447.0130.00.991323.310.410.36
194.80.260.2310.60.03423.0111.00.992743.460.2811.57
204.80.290.231.10.04438.0180.00.989243.280.3411.96
214.80.330.06.50.02834.0163.00.99373.350.619.95
224.80.340.06.50.02833.0163.00.99393.360.619.96
234.80.650.121.10.0134.010.00.992463.320.3613.54
244.90.2350.2711.750.0334.0118.00.99543.070.59.46
254.90.330.311.20.01639.0150.00.987133.330.5914.08
264.90.3350.141.30.03669.0168.00.992123.470.4610.46666666666675
274.90.3350.141.30.03669.0168.00.992123.470.4610.46666666666675
284.90.3450.341.00.06832.0143.00.991383.240.410.15
294.90.3450.341.00.06832.0143.00.991383.240.410.15
304.90.470.171.90.03560.0148.00.989643.270.3511.56
315.00.170.561.50.02624.0115.00.99063.480.3910.87
325.00.20.41.90.01520.098.00.98973.370.5512.056
335.00.2350.2711.750.0334.0118.00.99543.070.59.46
345.00.240.195.00.04317.0101.00.994383.670.5710.05
355.00.240.212.20.03931.0100.00.990983.690.6211.76
365.00.240.341.10.03449.0158.00.987743.320.3213.17
375.00.2550.222.70.04346.0153.00.992383.750.7611.36
385.00.270.324.50.03258.0178.00.989563.450.3112.67
395.00.270.324.50.03258.0178.00.989563.450.3112.67
405.00.270.41.20.07642.0124.00.992043.320.4710.16
415.00.290.545.70.03554.0155.00.989763.270.3412.98
425.00.30.333.70.0354.0173.00.98873.360.313.07
435.00.310.06.40.04643.0166.00.9943.30.639.96
445.00.330.161.50.04910.097.00.99173.480.4410.76
455.00.330.161.50.04910.097.00.99173.480.4410.76
465.00.330.161.50.04910.097.00.99173.480.4410.76
475.00.330.184.60.03240.0124.00.991143.180.411.06
485.00.330.2311.80.0323.0158.00.993223.410.6411.86
495.00.350.257.80.03124.0116.00.992413.390.411.36
505.00.350.257.80.03124.0116.00.992413.390.411.36
515.00.440.0418.60.03938.0128.00.99853.370.5710.26
525.00.4550.181.90.03633.0106.00.987463.210.8314.07
535.00.550.148.30.03235.0164.00.99183.530.5112.58
545.00.610.121.30.00965.0100.00.98743.260.3713.55
555.10.110.321.60.02812.090.00.990083.570.5212.26
565.10.140.250.70.03915.089.00.99193.220.439.26
575.10.1650.225.70.04742.0146.00.99343.180.559.96
585.10.210.281.40.04748.0148.00.991683.50.4910.45
595.10.230.181.00.05313.099.00.989563.220.3911.55
605.10.250.361.30.03540.078.00.98913.230.6412.17
615.10.260.331.10.02746.0113.00.989463.350.4311.47
625.10.260.346.40.03426.099.00.994493.230.419.26
635.10.290.288.30.02627.0107.00.993083.360.3711.06
645.10.290.288.30.02627.0107.00.993083.360.3711.06
655.10.30.32.30.04840.0150.00.989443.290.4612.26
665.10.3050.131.750.03617.073.00.993.40.5112.33333333333335
675.10.310.30.90.03728.0152.00.9923.540.5610.16
685.10.330.221.60.02718.089.00.98933.510.3812.57
695.10.330.221.60.02718.089.00.98933.510.3812.57
705.10.330.221.60.02718.089.00.98933.510.3812.57
715.10.330.276.70.02244.0129.00.992213.360.3911.07
725.10.350.266.80.03436.0120.00.991883.380.411.56
735.10.350.266.80.03436.0120.00.991883.380.411.56
745.10.350.266.80.03436.0120.00.991883.380.411.56
755.10.390.211.70.02715.072.00.98943.50.4512.56
765.10.420.011.50.01725.0102.00.98943.380.3612.37
775.10.520.062.70.05230.079.00.99323.320.439.35
785.20.1550.331.60.02813.059.00.989753.30.8411.98
795.20.1550.331.60.02813.059.00.989753.30.8411.98
805.20.160.340.80.02926.077.00.991553.250.5110.16
815.20.170.270.70.0311.068.00.992183.30.419.85
825.20.1850.221.00.0347.0123.00.992183.550.4410.156
835.20.20.273.20.04716.093.00.992353.440.5310.17
845.20.210.311.70.04817.061.00.989533.240.3712.07
855.20.220.466.20.06641.0187.00.993623.190.429.733333333333335
865.20.240.157.10.04332.0134.00.993783.240.489.96
875.20.240.453.80.02721.0128.00.9923.550.4911.28
885.20.240.453.80.02721.0128.00.9923.550.4911.28
895.20.250.231.40.04720.077.00.990013.320.6211.45
905.20.280.291.10.02818.069.00.991683.240.5410.06
915.20.2850.295.150.03564.0138.00.98953.190.3412.48
925.20.30.341.50.03818.096.00.989423.560.4813.08
935.20.310.22.40.02727.0117.00.988863.560.4513.07
945.20.310.365.10.03146.0145.00.98973.140.3112.47
955.20.3350.21.70.03317.074.00.990023.340.4812.36
965.20.340.376.20.03142.0133.00.990763.250.4112.56
975.20.360.021.60.03124.0104.00.98963.440.3512.26
985.20.3650.0813.50.04137.0142.00.9973.460.399.96
995.20.370.331.20.02813.081.00.99023.370.3811.76
1005.20.380.267.70.05320.0103.00.99253.270.4512.26
Rows: 1-100 of 4898 | Columns: 12

Let's draw the correlation matrix of the dataset.

In [6]:
%matplotlib inline
winequality.corr(method = "spearman")
Out[6]:
"fixed_acidity"
"volatile_acidity"
"citric_acid"
"residual_sugar"
"chlorides"
"free_sulfur_dioxide"
"total_sulfur_dioxide"
"density"
"pH"
"sulphates"
"alcohol"
"quality"
"fixed_acidity"1.0-0.04295344115546210.2943519557055970.1016444303644920.0900110347529423-0.02779264833156940.1093887187104840.265711053951591-0.415781573031458-0.0141936686269647-0.103450490979059-0.0796745702348406
"volatile_acidity"-0.04295344115546211.0-0.1454069083790830.107237277630878-0.00339782066183659-0.08338190889885730.1147485415406030.00854780369850073-0.0449435976968945-0.01743666868474510.0353025244854393-0.158558078043926
"citric_acid"0.294351955705597-0.1454069083790831.00.02489103261813680.03620891887063470.09005644858207480.09600645634609480.0946381002313568-0.1472496252419410.0813914907919214-0.03513340617580740.00381493685138116
"residual_sugar"0.1016444303644920.1072372776308780.02489103261813681.00.2246599587936940.3452300155377770.4301034543860880.779068758176437-0.180627450864797-0.00657339265069306-0.441559643430315-0.0923288244199536
"chlorides"0.0900110347529423-0.003397820661836590.03620891887063470.2246599587936941.00.1654698862453890.3733551183476690.505577326845011-0.05598968197799410.0887708062447246-0.569837905062804-0.318382425676257
"free_sulfur_dioxide"-0.0277926483315694-0.08338190889885730.09005644858207480.3452300155377770.1654698862453891.00.6189596424554670.329172049904512-0.008278266941152710.0510877316189511-0.2734204567507060.00379066722904633
"total_sulfur_dioxide"0.1093887187104840.1147485415406030.09600645634609480.4301034543860880.3733551183476690.6189596424554671.00.563857044809788-0.01306978626218150.155569043482336-0.477228700782309-0.204178237570967
"density"0.2657110539515910.008547803698500730.09463810023135680.7790687581764370.5055773268450110.3291720499045120.5638570448097881.0-0.110660332831870.0921206844814246-0.821563801759593-0.357150131425647
"pH"-0.415781573031458-0.0449435976968945-0.147249625241941-0.180627450864797-0.0559896819779941-0.00827826694115271-0.0130697862621815-0.110660332831871.00.1404081751579920.1488642082304350.113164400131828
"sulphates"-0.0141936686269647-0.01743666868474510.0813914907919214-0.006573392650693060.08877080624472460.05108773161895110.1555690434823360.09212068448142460.1404081751579921.0-0.04135567739912990.0303598456826122
"alcohol"-0.1034504909790590.0353025244854393-0.0351334061758074-0.441559643430315-0.569837905062804-0.273420456750706-0.477228700782309-0.8215638017595930.148864208230435-0.04135567739912991.00.444900539842434
"quality"-0.0796745702348406-0.1585580780439260.00381493685138116-0.0923288244199536-0.3183824256762570.00379066722904633-0.204178237570967-0.3571501314256470.1131644001318280.03035984568261220.4449005398424341.0
Rows: 1-12 | Columns: 13

We can see a strong correlation between the density and the alcohol degree (the alcohol degree describes the density of pure ethanol in the wine). We can drop the 'density' column since it doesn't influence the quality of the white wine (instead, its presence will just bias the data).

In [7]:
winequality.drop(["density"])
Out[7]:
123
fixed_acidity
Numeric(6,3)
123
volatile_acidity
Numeric(7,4)
123
citric_acid
Numeric(6,3)
123
residual_sugar
Numeric(7,3)
123
chlorides
Float
123
free_sulfur_dioxide
Numeric(7,2)
123
total_sulfur_dioxide
Numeric(7,2)
123
pH
Numeric(6,3)
123
sulphates
Numeric(6,3)
123
alcohol
Float
123
quality
Int
13.80.310.0211.10.03620.0114.03.750.4412.46
23.90.2250.44.20.0329.0118.03.570.3612.88
34.20.170.361.80.02993.0161.03.650.8912.07
44.20.2150.235.10.04164.0157.03.420.448.03
54.40.320.394.30.0331.0127.03.460.3612.88
64.40.460.12.80.02431.0111.03.480.3413.16
74.40.540.095.10.03852.097.03.410.412.27
84.50.190.210.950.03389.0159.03.340.428.05
94.60.4450.01.40.05311.0178.03.790.5510.25
104.70.1450.291.00.04235.090.03.760.4911.36
114.70.3350.141.30.03669.0168.03.470.4610.55
124.70.4550.181.90.03633.0106.03.210.8314.07
134.70.670.091.00.025.09.03.30.3413.65
144.70.7850.03.40.03623.0134.03.530.9213.86
154.80.130.321.20.04240.098.03.420.6411.87
164.80.170.282.90.0322.0111.03.380.3411.37
174.80.210.2110.20.03717.0112.03.660.4812.27
184.80.2250.381.20.07447.0130.03.310.410.36
194.80.260.2310.60.03423.0111.03.460.2811.57
204.80.290.231.10.04438.0180.03.280.3411.96
214.80.330.06.50.02834.0163.03.350.619.95
224.80.340.06.50.02833.0163.03.360.619.96
234.80.650.121.10.0134.010.03.320.3613.54
244.90.2350.2711.750.0334.0118.03.070.59.46
254.90.330.311.20.01639.0150.03.330.5914.08
264.90.3350.141.30.03669.0168.03.470.4610.46666666666675
274.90.3350.141.30.03669.0168.03.470.4610.46666666666675
284.90.3450.341.00.06832.0143.03.240.410.15
294.90.3450.341.00.06832.0143.03.240.410.15
304.90.470.171.90.03560.0148.03.270.3511.56
315.00.170.561.50.02624.0115.03.480.3910.87
325.00.20.41.90.01520.098.03.370.5512.056
335.00.2350.2711.750.0334.0118.03.070.59.46
345.00.240.195.00.04317.0101.03.670.5710.05
355.00.240.212.20.03931.0100.03.690.6211.76
365.00.240.341.10.03449.0158.03.320.3213.17
375.00.2550.222.70.04346.0153.03.750.7611.36
385.00.270.324.50.03258.0178.03.450.3112.67
395.00.270.324.50.03258.0178.03.450.3112.67
405.00.270.41.20.07642.0124.03.320.4710.16
415.00.290.545.70.03554.0155.03.270.3412.98
425.00.30.333.70.0354.0173.03.360.313.07
435.00.310.06.40.04643.0166.03.30.639.96
445.00.330.161.50.04910.097.03.480.4410.76
455.00.330.161.50.04910.097.03.480.4410.76
465.00.330.161.50.04910.097.03.480.4410.76
475.00.330.184.60.03240.0124.03.180.411.06
485.00.330.23