How can you really tell if a wine is good? It’s a commonly known fact that the price of wine does not have a direct correlation to the quality or taste. What if we use some analytics to help make a clear determination if a wine is “good?” My friend and Data Scientist, Badr Ouali,
shared with me a way to put analytics into action, to be able to state if, in fact, a wine is good.
This fascinated me. He created a simple machine learning model, then tested it on individual wines. I love this because it shows a practical application of my favorite SQL database, Vertica
(AND, IT’S ABOUT WINE).
The purpose of the study was to determine all the components which make a wine good. Badr created this model using Kaggle data. The author behind the data was a French chemist who extracted the chemicals behind 5,000 wines. The chemical components can be found in the Kaggle database
. You can use this wine analyzer model to test against new specific wines.
There are three impartial factors of a “good” wine.
- Zero faults: no visual defects (like grapes) and absolutely no bad odors.
- Good equilibrium related to the chemicals. Good balance between three variables [tannin, acidity, and smoothness].
- Fine length in the mouth. This means the taste of the wine stays in the mouth for a while, and is related to the quality of the grapes.
The following factors need to be constant when testing wines:
- Wines of the same type (red, white, or rose)
- Wine tasted by an impartial wine expert
The process to rate a wine is quite simple. The wine expert will look at it and taste it in order to rate it between 1 and 10. A wine is excellent if it reaches more than 7, and good if its rate is between 5 and 7. Otherwise the wine is bad.
We can create two types of model for this study:
- Regression: The model will predict the rate (between 0 and 10)
- Classification: We consider that a wine is good when the rate is greater than 6.5 and we create another variable good=1 if rate>6.5, 0 otherwise. This case is less precise than the first one.
In Vertica, we have multiple in-database classification and regression models to choose from without moving the data, and that’s a big advantage. You can compare the accuracy of the models and choose the best. We cannot see the performance advantage of Vertica in this case, because the volume of data is pretty small. But the study will remain the same even if we increase the data volume.
Badr used the Kaggle data related to Vinho de Verde, a Portuguese wine. However, as the guy who tested all the wine didn’t include tannins in his features, the predictive accuracy of our algorithm will stay quite weak. That is a main variable. That’s why Badr pointed out that a very good business understanding is important, even before data ingestion.
Thanks, Badr for helping us create a systematic approach to identifying good quality wine with machine learning in Vertica.
Check out the full recording here: LINK TO THE VIDEO
Also, we decided to put this model into action at a Partner event. Join Vertica, CB Technologies, Inc., & Tableau Software at 4PM on May 30th at JM Cellars in Seattle, WA
. We will be talking about big data, analytics, and machine learning while tasting delicious wine!