Pokemon¶
This example uses the 'pokemon' and 'combats' datasets to predict the winner of a 1-on-1 Pokemon battle. You can download the Jupyter Notebook of the study here and two datasets:
- Name: The name of the Pokemon
- Generation: Pokemon's generation
- Legendary: True if the Pokemon is legendary
- HP: Number of hit points
- Attack: Attack stat
- Sp_Atk: Special attack stat
- Defense: Defense stat
- Sp_Def: Special defense stat
- Speed: Speed stat
- Type_1: Pokemon's first type
- Type_2: Pokemon's second type
- First_pokemon: Pokemon of trainer 1
- Second_pokemon: Pokemon of trainer 2
- Winner: Winner of the battle
We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Let's ingest the datasets.
import verticapy.stats as st
vp.drop('combats')
combats = vp.read_csv('data/combats.csv')
combats.head(5)
vp.drop('pokemon')
pokemon = vp.read_csv('data/pokemon.csv')
pokemon.head(5)
Data Exploration and Preparation¶
The table 'combats' will be joined to the table 'pokemon' to predict the winner.
The 'pokemon' table contains the information on each Pokemon. Let's describe this table.
pokemon.describe(method = "categorical", unique = True)
The pokemon's 'Name', 'Generation', and whether or not it's 'Legendary' will never influence the outcome of the battle, so we can drop these columns.
pokemon.drop(["Generation",
"Legendary",
"Name"])
The 'ID' will be the key to join the data. By joining the data, we will be able to create more relevant features.
fights = pokemon.join(combats,
on = {"ID": "First_Pokemon"},
how = "inner",
expr1 = ["Sp_Atk AS Sp_Atk_1",
"Speed AS Speed_1",
"Sp_Def AS Sp_Def_1",
"Defense AS Defense_1",
"Type_1 AS Type_1_1",
"Type_2 AS Type_2_1",
"HP AS HP_1",
"Attack AS Attack_1"],
expr2 = ["First_Pokemon",
"Second_Pokemon",
"Winner"]).join(pokemon,
on = {"Second_Pokemon": "ID"},
how = "inner",
expr2 = ["Sp_Atk AS Sp_Atk_2",
"Speed AS Speed_2",
"Sp_Def AS Sp_Def_2",
"Defense AS Defense_2",
"Type_1 AS Type_1_2",
"Type_2 AS Type_2_2",
"HP AS HP_2",
"Attack AS Attack_2"],
expr1 = ["Sp_Atk_1",
"Speed_1",
"Sp_Def_1",
"Defense_1",
"Type_1_1",
"Type_2_1",
"HP_1",
"Attack_1",
"Winner",
"Second_pokemon"])
Features engineering is the key. Here, we can create features that describe the stat differences between the first and second Pokemon. We can also change 'winner' to a binary value: 1 if the first pokemon won and 0 otherwise.
fights["Sp_Atk_diff"] = fights["Sp_Atk_1"] - fights["Sp_Atk_2"]
fights["Speed_diff"] = fights["Speed_1"] - fights["Speed_2"]
fights["Sp_Def_diff"] = fights["Sp_Def_1"] - fights["Sp_Def_2"]
fights["Defense_diff"] = fights["Defense_1"] - fights["Defense_2"]
fights["HP_diff"] = fights["HP_1"] - fights["HP_2"]
fights["Attack_diff"] = fights["Attack_1"] - fights["Attack_2"]
fights["Winner"] = st.case_when(fights["Winner"] == fights["Second_pokemon"], 0, 1)
fights = fights[["Sp_Atk_diff", "Speed_diff", "Sp_Def_diff",
"Defense_diff", "HP_diff", "Attack_diff",
"Type_1_1", "Type_1_2", "Type_2_1", "Type_2_2",
"Winner"]]
display(fights)
Missing values can not be handled by most machine learning models. Let's see which features we should impute.
fights.count()
In terms of missing values, our only concern is the Pokemon's second type (Type_2_1 and Type_2_2). Since some Pokemon only have one type, these features are MNAR (missing values not at random). We can impute the missing values by creating another category.
fights["Type_2_1"].fillna("No")
fights["Type_2_2"].fillna("No")