Pokemon¶
This example uses the 'pokemon' and 'combats' datasets to predict the winner of a 1-on-1 Pokemon battle. You can download the Jupyter Notebook of the study here and two datasets:
- Name: The name of the Pokemon
- Generation: Pokemon's generation
- Legendary: True if the Pokemon is legendary
- HP: Number of hit points
- Attack: Attack stat
- Sp_Atk: Special attack stat
- Defense: Defense stat
- Sp_Def: Special defense stat
- Speed: Speed stat
- Type_1: Pokemon's first type
- Type_2: Pokemon's second type
- First_pokemon: Pokemon of trainer 1
- Second_pokemon: Pokemon of trainer 2
- Winner: Winner of the battle
We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Let's ingest the datasets.
import verticapy.stats as st
vp.drop('combats')
combats = vp.read_csv('data/combats.csv')
combats.head(5)
vp.drop('pokemon')
pokemon = vp.read_csv('data/pokemon.csv')
pokemon.head(5)
Data Exploration and Preparation¶
The table 'combats' will be joined to the table 'pokemon' to predict the winner.
The 'pokemon' table contains the information on each Pokemon. Let's describe this table.
pokemon.describe(method = "categorical", unique = True)
The pokemon's 'Name', 'Generation', and whether or not it's 'Legendary' will never influence the outcome of the battle, so we can drop these columns.
pokemon.drop(["Generation",
"Legendary",
"Name"])
The 'ID' will be the key to join the data. By joining the data, we will be able to create more relevant features.
fights = pokemon.join(combats,
on = {"ID": "First_Pokemon"},
how = "inner",
expr1 = ["Sp_Atk AS Sp_Atk_1",
"Speed AS Speed_1",
"Sp_Def AS Sp_Def_1",
"Defense AS Defense_1",
"Type_1 AS Type_1_1",
"Type_2 AS Type_2_1",
"HP AS HP_1",
"Attack AS Attack_1"],
expr2 = ["First_Pokemon",
"Second_Pokemon",
"Winner"]).join(pokemon,
on = {"Second_Pokemon": "ID"},
how = "inner",
expr2 = ["Sp_Atk AS Sp_Atk_2",
"Speed AS Speed_2",
"Sp_Def AS Sp_Def_2",
"Defense AS Defense_2",
"Type_1 AS Type_1_2",
"Type_2 AS Type_2_2",
"HP AS HP_2",
"Attack AS Attack_2"],
expr1 = ["Sp_Atk_1",
"Speed_1",
"Sp_Def_1",
"Defense_1",
"Type_1_1",
"Type_2_1",
"HP_1",
"Attack_1",
"Winner",
"Second_pokemon"])
Features engineering is the key. Here, we can create features that describe the stat differences between the first and second Pokemon. We can also change 'winner' to a binary value: 1 if the first pokemon won and 0 otherwise.
fights["Sp_Atk_diff"] = fights["Sp_Atk_1"] - fights["Sp_Atk_2"]
fights["Speed_diff"] = fights["Speed_1"] - fights["Speed_2"]
fights["Sp_Def_diff"] = fights["Sp_Def_1"] - fights["Sp_Def_2"]
fights["Defense_diff"] = fights["Defense_1"] - fights["Defense_2"]
fights["HP_diff"] = fights["HP_1"] - fights["HP_2"]
fights["Attack_diff"] = fights["Attack_1"] - fights["Attack_2"]
fights["Winner"] = st.case_when(fights["Winner"] == fights["Second_pokemon"], 0, 1)
fights = fights[["Sp_Atk_diff", "Speed_diff", "Sp_Def_diff",
"Defense_diff", "HP_diff", "Attack_diff",
"Type_1_1", "Type_1_2", "Type_2_1", "Type_2_2",
"Winner"]]
display(fights)
Missing values can not be handled by most machine learning models. Let's see which features we should impute.
fights.count()
In terms of missing values, our only concern is the Pokemon's second type (Type_2_1 and Type_2_2). Since some Pokemon only have one type, these features are MNAR (missing values not at random). We can impute the missing values by creating another category.
fights["Type_2_1"].fillna("No")
fights["Type_2_2"].fillna("No")
Let's use the current_relation method to see how our data preparation so far on the vDataFrame generates SQL code.
print(fights.current_relation())
VerticaPy will remember your modifications and always generate an up-to-date SQL query.
Let's look at the correlations between all the variables.
%matplotlib inline
fights.corr(method = "spearman")
Many variables are correlated to the response column. We have enough information to create our predictive model.
Machine Learning¶
Some really important features are categorical. Random forest can handle them. Besides, we need trees deep enough to compare all the different types.
from verticapy.learn.ensemble import RandomForestClassifier
from verticapy.learn.model_selection import cross_validate
predictors = fights.get_columns(exclude_columns = ['Winner'])
model = RandomForestClassifier("rf_pokemon",
n_estimators = 50,
max_depth = 100,
max_leaf_nodes = 400,
nbins = 100)
cross_validate(model, fights, predictors, "Winner")
We have an excellent model with an average AUC of more than 99%. Let's create a model with the entire dataset and look at the importance of each feature.
model.fit(fights,
predictors,
"Winner").features_importance()
Based on our model, it seems that a Pokemon's speed and attack stats are the strongest predictors for the winner of a battle.
Conclusion¶
We've solved our problem in a Pandas-like way, all without ever loading data into memory!
VerticaPy
About the Author
Badr Ouali
Head of Data Science
Badr Ouali works as a Lead Data Scientist for Vertica worldwide. He can embrace data projects end to end through a clear understanding of the “big picture” as well as attention to details, resulting in achieving great business outcomes – a distinctive differentiator in his role. Badr enjoys sharing knowledge and insights related to data analytics with colleagues & peers and has a sweet spot for Python. He loves helping customers finding the best value from their data and empower them to solve their use-cases.
