Pokemon

This example uses the 'pokemon' and 'combats' datasets to predict the winner of a 1-on-1 Pokemon battle. You can download the Jupyter Notebook of the study here and two datasets:

pokemon

  • Name: The name of the Pokemon
  • Generation: Pokemon's generation
  • Legendary: True if the Pokemon is legendary
  • HP: Number of hit points
  • Attack: Attack stat
  • Sp_Atk: Special attack stat
  • Defense: Defense stat
  • Sp_Def: Special defense stat
  • Speed:
  • Speed stat
  • Type_1: Pokemon's first type
  • Type_2: Pokemon's second type

combats

  • First_pokemon: Pokemon of trainer 1
  • Second_pokemon: Pokemon of trainer 2
  • Winner: Winner of the battle

We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.

Initialization

This example uses the following version of VerticaPy:

In [15]:
import verticapy as vp
vp.__version__
Out[15]:
'0.9.0'

Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.

In [1]:
vp.connect("VerticaDSN")

Let's ingest the datasets.

In [2]:
import verticapy.stats as st
vp.drop('combats')
combats = vp.read_csv('data/combats.csv')
combats.head(5)
Out[2]:
123
First_pokemon
Int
123
Second_pokemon
Int
123
Winner
Int
1166
212626
313737
414343
515454
Rows: 1-5 | Columns: 3
In [4]:
vp.drop('pokemon')
pokemon = vp.read_csv('data/pokemon.csv')
pokemon.head(5)
Out[4]:
123
ID
Int
Abc
Name
Varchar(50)
Abc
Type_1
Varchar(20)
Abc
Type_2
Varchar(20)
123
HP
Int
123
Attack
Int
123
Defense
Int
123
Sp_Atk
Int
123
Sp_Def
Int
123
Speed
Int
123
Generation
Int
010
Legendary
Boolean
11BulbasaurGrassPoison4549496565451
22IvysaurGrassPoison6062638080601
33VenusaurGrassPoison808283100100801
44Mega VenusaurGrassPoison80100123122120801
55CharmanderFire[null]3952436050651
Rows: 1-5 | Columns: 12

Data Exploration and Preparation

The table 'combats' will be joined to the table 'pokemon' to predict the winner.

The 'pokemon' table contains the information on each Pokemon. Let's describe this table.

In [5]:
pokemon.describe(method = "categorical", unique = True)
Out[5]:
dtype
count
top
top_percent
unique
"ID"int80010.125800.0
"Name"varchar(50)799Deino0.125799.0
"Type_1"varchar(20)800Water14.018.0
"Type_2"varchar(20)414[null]48.2518.0
"HP"int800608.37594.0
"Attack"int8001005.0111.0
"Defense"int800706.75103.0
"Sp_Atk"int800606.375105.0
"Sp_Def"int800806.592.0
"Speed"int800505.75108.0
"Generation"int800120.756.0
"Legendary"boolean800
91.8752.0
Rows: 1-12 | Columns: 6

The pokemon's 'Name', 'Generation', and whether or not it's 'Legendary' will never influence the outcome of the battle, so we can drop these columns.

In [6]:
pokemon.drop(["Generation", 
              "Legendary", 
              "Name"])
Out[6]:
123
ID
Int
Abc
Type_1
Varchar(20)
Abc
Type_2
Varchar(20)
123
HP
Int
123
Attack
Int
123
Defense
Int
123
Sp_Atk
Int
123
Sp_Def
Int
123
Speed
Int
11GrassPoison454949656545
22GrassPoison606263808060
33GrassPoison80828310010080
44GrassPoison8010012312212080
55Fire[null]395243605065
66Fire[null]586458806580
77FireFlying78847810985100
88FireDragon7813011113085100
99FireFlying7810478159115100
1010Water[null]444865506443
1111Water[null]596380658058
1212Water[null]79831008510578
1313Water[null]7910312013511578
1414Bug[null]453035202045
1515Bug[null]502055252530
1616BugFlying604550908070
1717BugPoison403530202050
1818BugPoison452550252535
1919BugPoison659040458075
2020BugPoison65150401580145
2121NormalFlying404540353556
2222NormalFlying636055505071
2323NormalFlying8380757070101
2424NormalFlying83808013580121
2525Normal[null]305635253572
2626Normal[null]558160507097
2727NormalFlying406030313170
2828NormalFlying6590656161100
2929Poison[null]356044405455
3030Poison[null]608569657980
3131Electric[null]355540505090
3232Electric[null]6090559080110
3333Ground[null]507585203040
3434Ground[null]75100110455565
3535Poison[null]554752404041
3636Poison[null]706267555556
3737PoisonGround909287758576
3838Poison[null]465740404050
3939Poison[null]617257555565
4040PoisonGround8110277857585
4141Fairy[null]704548606535
4242Fairy[null]957073959060
4343Fire[null]384140506565
4444Fire[null]73767581100100
4545NormalFairy1154520452520
4646NormalFairy1407045855045
4747PoisonFlying404535304055
4848PoisonFlying758070657590
4949GrassPoison455055756530
5050GrassPoison606570857540
5151GrassPoison7580851109050
5252BugGrass357055455525
5353BugGrass609580608030
5454BugPoison605550405545
5555BugPoison706560907590
5656Ground[null]105525354595
5757Ground[null]3580505070120
5858Normal[null]404535404090
5959Normal[null]6570606565115
6060Water[null]505248655055
6161Water[null]808278958085
6262Fighting[null]408035354570
6363Fighting[null]6510560607095
6464Fire[null]557045705060
6565Fire[null]90110801008095
6666Water[null]405040404090
6767Water[null]656565505090
6868WaterFighting909595709070
6969Psychic[null]2520151055590
7070Psychic[null]40353012070105
7171Psychic[null]55504513595120
7272Psychic[null]55506517595150
7373Fighting[null]708050353535
7474Fighting[null]8010070506045
7575Fighting[null]9013080658555
7676GrassPoison507535703040
7777GrassPoison659050854555
7878GrassPoison80105651007070
7979WaterPoison4040355010070
8080WaterPoison80706580120100
8181RockGround4080100303020
8282RockGround5595115454535
8383RockGround80120130556545
8484Fire[null]508555656590
8585Fire[null]65100708080105
8686WaterPsychic906565404015
8787WaterPsychic95751101008030
8888WaterPsychic95751801308030
8989ElectricSteel253570955545
9090ElectricSteel5060951207070
9191NormalFlying526555586260
9292NormalFlying358545353575
9393NormalFlying60110706060100
9494Water[null]654555457045
9595WaterIce907080709570
9696Poison[null]808050405025
9797Poison[null]105105756510050
9898Water[null]3065100452540
9999WaterIce5095180854570
100100GhostPoison3035301003580
Rows: 1-100 of 800 | Columns: 9

The 'ID' will be the key to join the data. By joining the data, we will be able to create more relevant features.

In [7]:
fights = pokemon.join(combats, 
                      on = {"ID": "First_Pokemon"}, 
                      how = "inner",
                      expr1 = ["Sp_Atk AS Sp_Atk_1", 
                               "Speed AS Speed_1", 
                               "Sp_Def AS Sp_Def_1", 
                               "Defense AS Defense_1", 
                               "Type_1 AS Type_1_1", 
                               "Type_2 AS Type_2_1", 
                               "HP AS HP_1", 
                               "Attack AS Attack_1"],
                      expr2 = ["First_Pokemon", 
                               "Second_Pokemon", 
                               "Winner"]).join(pokemon, 
                      on = {"Second_Pokemon": "ID"}, 
                      how = "inner",
                      expr2 = ["Sp_Atk AS Sp_Atk_2", 
                               "Speed AS Speed_2", 
                               "Sp_Def AS Sp_Def_2", 
                               "Defense AS Defense_2", 
                               "Type_1 AS Type_1_2", 
                               "Type_2 AS Type_2_2", 
                               "HP AS HP_2", 
                               "Attack AS Attack_2"],
                      expr1 = ["Sp_Atk_1", 
                               "Speed_1", 
                               "Sp_Def_1", 
                               "Defense_1", 
                               "Type_1_1", 
                               "Type_2_1", 
                               "HP_1", 
                               "Attack_1", 
                               "Winner", 
                               "Second_pokemon"])

Features engineering is the key. Here, we can create features that describe the stat differences between the first and second Pokemon. We can also change 'winner' to a binary value: 1 if the first pokemon won and 0 otherwise.

In [8]:
fights["Sp_Atk_diff"] = fights["Sp_Atk_1"] - fights["Sp_Atk_2"]
fights["Speed_diff"] = fights["Speed_1"] - fights["Speed_2"]
fights["Sp_Def_diff"] = fights["Sp_Def_1"] - fights["Sp_Def_2"]
fights["Defense_diff"] = fights["Defense_1"] - fights["Defense_2"]
fights["HP_diff"] = fights["HP_1"] - fights["HP_2"]
fights["Attack_diff"] = fights["Attack_1"] - fights["Attack_2"]
fights["Winner"] = st.case_when(fights["Winner"] == fights["Second_pokemon"], 0, 1)
fights = fights[["Sp_Atk_diff", "Speed_diff", "Sp_Def_diff", 
                 "Defense_diff", "HP_diff", "Attack_diff", 
                 "Type_1_1", "Type_1_2", "Type_2_1", "Type_2_2", 
                 "Winner"]]
display(fights)
123
Sp_Atk_diff
Integer
123
Speed_diff
Integer
123
Sp_Def_diff
Integer
123
Defense_diff
Integer
123
HP_diff
Integer
123
Attack_diff
Integer
Abc
Type_1_1
Varchar(20)
Abc
Type_1_2
Varchar(20)
Abc
Type_2_1
Varchar(20)
Abc
Type_2_2
Varchar(20)
123
Winner
Integer
1-15-350-9-13-15GrassFirePoison[null]0
215-52-5-11-10-32GrassNormalPoison[null]0
3-10-31-20-38-45-43GrassPoisonPoisonGround0
415-200978GrassFirePoison[null]0
525010-1-15-6GrassBugPoisonPoison0
60-10-20-31-45-81GrassFightingPoison[null]0
70-10-20-31-45-81GrassFightingPoison[null]0
80-450-6-5-36GrassFirePoison[null]0
9-60-100-36-50-46GrassGrassPoisonPsychic0
1030-42-45-4-5-71GrassFightingPoison[null]0
11-35-70-20-36-15-26GrassWaterPoisonPsychic0
120-35-5-56-15-66GrassRockPoisonWater0
13-5-25-5-16-16-35GrassDragonPoison[null]0
14-35-35-35-46-46-85GrassDragonPoisonFlying0
1529-5919-1519GrassNormalPoisonFlying0
169-22911-3011GrassWaterPoisonElectric0
1725250-161029GrassFairyPoison[null]1
18-155-40-36-109GrassFairyPoisonFlying1
19010209-109GrassElectricPoison[null]1
20-3515-45-31-50-26GrassWaterPoisonPsychic1
21-7-3171-3-23GrassPsychicPoison[null]0
22-25-400-16-25-31GrassNormalPoisonPsychic0
23505-26-45-71GrassFairyPoison[null]0
245-55-31-55-51GrassIcePoisonGround0
25-75-70-25-41-30-41GrassDarkPoisonFire0
265-55-71-45-71GrassGroundPoison[null]0
273510351410-6GrassDarkPoison[null]1
2810-403519519GrassWaterPoisonFlying1
29-100-55-70-16-23-36GrassPsychicPoisonFairy0
305-255-31-15-81GrassGrassPoisonFighting0
312015-25-86154GrassRockPoison[null]1
32-15-55-20-36-15-51GrassFightingPoisonPsychic0
330-2025954GrassElectricPoison[null]0
3422512-4-256GrassPoisonPoison[null]1
35-5-153014-85-21GrassWaterPoison[null]0
36-50-105-11-25-66GrassGrassPoisonDark0
3719-15246-51GrassWaterPoisonGround0
38-50-15-51-30-76GrassRockPoisonBug0
3955-55-21-15-41GrassNormalPoison[null]1
40-7-6-22-34-54-19GrassGrassPoisonFlying0
4115-5-35-151-35-51GrassRockPoison[null]0
42-29-2115-1-25-45GrassBugPoisonFlying0
4324-29242-18-14GrassPoisonPoisonDark0
44-5-15-2542524GrassPsychicPoisonFairy0
4515-3710-16-23-41GrassDragonPoisonGround0
465-50-10-61-25-41GrassPoisonPoisonDark0
47-25-1-7-23-29-51GrassGrassPoison[null]0
4820-80-20-16-25-71GrassDarkPoisonIce0
49-65-20-30-61-20-11GrassIcePoison[null]0
50-40-41-42-58-5-16GrassElectricPoisonFire0
51-40-41-42-58-5-16GrassElectricPoisonFire0
52-10-68-30-46-30-26GrassGrassPoison[null]0
53200204-20-14GrassFirePoison[null]0
54-33-562-14-30-49GrassGrassPoison[null]0
5529235-1-5-6GrassNormalPoisonFlying1
5635-1226-10154GrassBugPoisonPoison0
57-15-202595-16GrassDarkPoison[null]0
58-30-20-45-46-25-6GrassPsychicPoison[null]0
59-22-532-14-30-38GrassWaterPoisonFlying0
60-10-585-11-10-26GrassElectricPoisonFlying0
61-5-5-20-46-15-31GrassSteelPoison[null]0
62-35-10059-35-21GrassBugPoison[null]0
63301015-1-14-25GrassGroundPoisonGhost1
645-25-5-51-20-76GrassDarkPoisonSteel0
655-25-5-51-20-76GrassDarkPoisonSteel0
66-80-56-15-21-34-56GrassElectricPoisonFlying0
67-64-63-25-41-46-23GrassWaterPoisonFighting0
68-12-83-12-41-55-79GrassNormalPoisonFighting0
69-44-64-29-3-17-6GrassElectricPoisonNormal0
70157512-21GrassGhostPoisonGrass1
7120-530202110GrassFirePoison[null]0
7235-525-47-15-38GrassGroundPoison[null]0
7340104023145GrassPoisonPoison[null]1
7430-515232221GrassFirePoison[null]0
7520300-170-33GrassBugPoisonGrass1
7640353013-20-18GrassPoisonPoison[null]0
771510-20-12-45-43GrassPoisonPoison[null]1
78-35-3525181512GrassGhostPoisonPoison0
79-20040-52-57GrassGrassPoison[null]0
8010-21-50-46-35-93GrassWaterPoisonDark0
81-50-15-17-70-23GrassWaterPoisonIce0
8215-2010-420-53GrassRockPoisonWater0
83-74-70-20-37-46-128GrassPsychicPoisonFighting0
84361732-110-3GrassWaterPoison[null]1
85404015-22542GrassFairyPoison[null]1
86404015-22542GrassFairyPoison[null]1
87020-25-22522GrassFairyPoisonFlying1
88-50-50-153-5-3GrassPsychicPoison[null]0
89-5-3138210-23GrassDarkPoisonFlying0
90-5-3138210-23GrassDarkPoisonFlying0
91-30-2010-7-20-58GrassFirePoisonFighting0
92604050281517GrassBugPoison[null]1
9315-5604315-28GrassWaterPoisonDark0
94-35-55203-5-88GrassDarkPoison[null]0
95301030131012GrassIcePoison[null]1
96-40-400-17-20-58GrassIcePoison[null]0
97-3485-425-22GrassWaterPoison[null]1
985005033207GrassNormalPoisonFlying0
9920031140-23GrassElectricPoison[null]0
1003330-58-105010GrassRockPoisonSteel0
Rows: 1-100 | Columns: 11

Missing values can not be handled by most machine learning models. Let's see which features we should impute.

In [9]:
fights.count()
Out[9]:
count
"Sp_Atk_diff"50000.0
"Speed_diff"50000.0
"Sp_Def_diff"50000.0
"Defense_diff"50000.0
"HP_diff"50000.0
"Attack_diff"50000.0
"Type_1_1"50000.0
"Type_1_2"50000.0
"Type_2_1"25969.0
"Type_2_2"26015.0
"Winner"50000.0
Rows: 1-11 | Columns: 2

In terms of missing values, our only concern is the Pokemon's second type (Type_2_1 and Type_2_2). Since some Pokemon only have one type, these features are MNAR (missing values not at random). We can impute the missing values by creating another category.

In [10]:
fights["Type_2_1"].fillna("No")
fights["Type_2_2"].fillna("No")
24031 elements were filled.
23985 elements were filled.
Out[10]: