Football

In this exaample, we use the 'football' dataset to predict the outcomes of games between various teams. You can download the Jupyter Notebook of the study here.

  • date: Date of the game
  • home_team: Home Team
  • home_score: Home Team number of goals
  • away_team: Away Team
  • away_score: Away Team number of goals
  • tournament: Game Type (World Cup, Friendly...)
  • city: City where the game took place
  • country: Country where the game took place
  • neutral: If the event took place to a neutral location

We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.

Initialization

This example uses the following version of VerticaPy:

In [1]:
import verticapy as vp
vp.__version__
Out[1]:
'0.9.0'

Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.

In [2]:
vp.connect("VerticaDSN")

Let's create a Virtual DataFrame of the dataset. The dataset is available here.

In [9]:
football = vp.read_csv("data/football.csv")
football.head(5)
Out[9]:
📅
date
Date
Abc
home_team
Varchar(64)
Abc
away_team
Varchar(64)
123
home_score
Int
123
away_score
Int
Abc
tournament
Varchar(84)
Abc
city
Varchar(56)
Abc
country
Varchar(64)
010
neutral
Boolean
11872-11-30ScotlandEngland00FriendlyGlasgowScotland
21873-03-08EnglandScotland42FriendlyLondonEngland
31874-03-07ScotlandEngland21FriendlyGlasgowScotland
41875-03-06EnglandScotland22FriendlyLondonEngland
51876-03-04ScotlandEngland30FriendlyGlasgowScotland
Rows: 1-5 | Columns: 9

Data Exploration

Let's explore the data by displaying descriptive statistics of all the columns.

In [4]:
football["date"].describe()
Out[4]:
value
name"date"
dtypedate
count41586
min1872-11-30
max2020-02-01
Rows: 1-5 | Columns: 2

The dataset includes a total of 41,586 games, which take place between 1872 and 2020. Let's look at our game types and teams.

In [5]:
football["tournament"].describe()
Out[5]:
value
"tournament"
varchar(84)
112.0
41586.0
17029
10630
7236
2582
1672
900
813
Rows: 1-11 | Columns: 2

Different types of tournaments took place (FIFA World Cup, UEFA Euro, etc.) aand most of the games in our data are friendlies or qualifiers for international tournaments.

In [6]:
display(football.describe())
football.describe(method = "categorical")
count
mean
std
min
approx_25%
approx_50%
approx_75%
max
"home_score"415861.745755783196281.753780340476990.01.01.02.031.0
"away_score"415861.187587168758721.405323468335890.00.01.02.021.0
"neutral"415860.2472466695522520.4314165382756490.00.00.00.01.0
Rows: 1-3 | Columns: 9
Out[6]:
dtype
count
top
top_percent
"date"date415862012-02-290.159
"home_team"varchar(64)41586Brazil1.366
"away_team"varchar(64)41586Uruguay1.301
"home_score"int41586129.57
"away_score"int41586037.135
"tournament"varchar(84)41586Friendly40.949
"city"varchar(56)41586Kuala Lumpur1.416
"country"varchar(64)41586United States2.787
"neutral"boolean41586
75.275
Rows: 1-9 | Columns: 5

The dataset includes 308 national teams. For most of the games, the home team scores better than the away team. Since some games take place in a neutral location, we can ensure this hypothesis using the variable 'neutral.' Notice also that the number of goals per match is pretty low (median of 1 for both away and home teams).

Goal

Our goal for the study will be to predict the outcomes of games after 2015.

Before doing the study, we can notice that some teams names have changed over time. We need to change the old names by the new names otherwise it will add too much bias in the data.

In [10]:
for team in ["home_team", "away_team"]:
    football[team].decode('German DR', 'Germany',
                          'Czechoslovakia', 'Czech Republic',
                          'Yugoslavia', 'Serbia',
                          'Yemen DPR', 'Yemen',
                          football[team])

Let's just consider teams that have played more than five home and away games.

In [11]:
football["cnt_games_1"] = "COUNT(*) OVER (PARTITION BY home_team)"
football["cnt_games_2"] = "COUNT(*) OVER (PARTITION BY away_team)"
football.filter((football["cnt_games_2"] > 5) & (football["cnt_games_1"] > 5))
vp.drop("football_clean", relation_type = "table")
football.to_db(name = "football_clean",
               usecols = ["date", 
                          "home_score", 
                          "home_team", 
                          "tournament", 
                          "away_team", 
                          "away_score", 
                          "neutral", 
                          "country",
                          "city"],
               relation_type = "table",
               inplace = True)
177 elements were filtered
Out[11]:
📅
date
Date
123
home_score
Int
Abc
home_team
Varchar(64)
Abc
tournament
Varchar(84)
Abc
away_team
Varchar(64)
123
away_score
Int
010
neutral
Boolean
Abc
country
Varchar(64)
Abc
city
Varchar(56)
11872-11-300ScotlandFriendlyEngland0
ScotlandGlasgow
21873-03-084EnglandFriendlyScotland2
EnglandLondon
31874-03-072ScotlandFriendlyEngland1
ScotlandGlasgow
41875-03-062EnglandFriendlyScotland2
EnglandLondon
51876-03-043ScotlandFriendlyEngland0
ScotlandGlasgow
61876-03-254ScotlandFriendlyWales0
ScotlandGlasgow
71877-03-031EnglandFriendlyScotland3
EnglandLondon
81877-03-050WalesFriendlyScotland2
WalesWrexham
91878-03-027ScotlandFriendlyEngland2
ScotlandGlasgow
101878-03-239ScotlandFriendlyWales0
ScotlandGlasgow
111879-01-182EnglandFriendlyWales1
EnglandLondon
121879-04-055EnglandFriendlyScotland4
EnglandLondon
131879-04-070WalesFriendlyScotland3
WalesWrexham
141880-03-135ScotlandFriendlyEngland4
ScotlandGlasgow
151880-03-152WalesFriendlyEngland3
WalesWrexham
161880-03-275ScotlandFriendlyWales1
ScotlandGlasgow
171881-02-260EnglandFriendlyWales1
EnglandBlackburn
181881-03-121EnglandFriendlyScotland6
EnglandLondon
191881-03-141WalesFriendlyScotland5
WalesWrexham
201882-02-180Northern IrelandFriendlyEngland13
Republic of IrelandBelfast
211882-02-257WalesFriendlyNorthern Ireland1
WalesWrexham
221882-03-115ScotlandFriendlyEngland1
ScotlandGlasgow
231882-03-135WalesFriendlyEngland3
WalesWrexham
241882-03-255ScotlandFriendlyWales0
ScotlandGlasgow
251883-02-035EnglandFriendlyWales0
EnglandLondon
261883-02-247EnglandFriendlyNorthern Ireland0
EnglandLiverpool
271883-03-102EnglandFriendlyScotland3
EnglandSheffield
281883-03-120WalesFriendlyScotland3
WalesWrexham
291883-03-171Northern IrelandFriendlyWales1
Republic of IrelandBelfast
301884-01-260Northern IrelandBritish ChampionshipScotland5
Republic of IrelandBelfast
311884-02-096WalesBritish ChampionshipNorthern Ireland0
WalesWrexham
321884-02-231Northern IrelandBritish ChampionshipEngland8
Republic of IrelandBelfast
331884-03-151ScotlandBritish ChampionshipEngland0
ScotlandGlasgow
341884-03-170WalesBritish ChampionshipEngland4
WalesWrexham
351884-03-294ScotlandBritish ChampionshipWales1
ScotlandGlasgow
361885-02-284EnglandBritish ChampionshipNorthern Ireland0
EnglandManchester
371885-03-141EnglandBritish ChampionshipWales1
EnglandBlackburn
381885-03-148ScotlandBritish ChampionshipNorthern Ireland2
ScotlandGlasgow
391885-03-211EnglandBritish ChampionshipScotland1
EnglandLondon
401885-03-231WalesBritish ChampionshipScotland8
WalesWrexham
411885-04-112Northern IrelandBritish ChampionshipWales8
Republic of IrelandBelfast
421885-11-280United StatesFriendlyCanada1
United StatesNewark
431886-02-275WalesBritish ChampionshipNorthern Ireland0
WalesWrexham
441886-03-131Northern IrelandBritish ChampionshipEngland6
Republic of IrelandBelfast
451886-03-202Northern IrelandBritish ChampionshipScotland7
Republic of IrelandBelfast
461886-03-271ScotlandBritish ChampionshipEngland1
ScotlandGlasgow
471886-03-291WalesBritish ChampionshipEngland3
WalesWrexham
481886-04-104ScotlandBritish ChampionshipWales1
ScotlandGlasgow
491886-11-253United StatesFriendlyCanada2
United StatesNewark
501887-02-057EnglandBritish ChampionshipNorthern Ireland0
EnglandSheffield
511887-02-194ScotlandBritish ChampionshipNorthern Ireland1
ScotlandGlasgow
521887-02-264EnglandBritish ChampionshipWales0
EnglandLondon
531887-03-124Northern IrelandBritish ChampionshipWales1
Republic of IrelandBelfast
541887-03-192EnglandBritish ChampionshipScotland3
EnglandBlackburn
551887-03-210WalesBritish ChampionshipScotland2
WalesWrexham
561888-02-045EnglandBritish ChampionshipWales1
EnglandCrewe
571888-03-0311WalesBritish ChampionshipNorthern Ireland0
WalesWrexham
581888-03-105ScotlandBritish ChampionshipWales1
ScotlandEdinburgh
591888-03-170ScotlandBritish ChampionshipEngland5
ScotlandGlasgow
601888-03-242Northern IrelandBritish ChampionshipScotland10
Republic of IrelandBelfast
611888-04-071Northern IrelandBritish ChampionshipEngland5
Republic of IrelandBelfast
621888-09-194ScotlandFriendlyCanada0
ScotlandGlasgow
631889-02-234EnglandBritish ChampionshipWales1
EnglandStoke-on-Trent
641889-03-026EnglandBritish ChampionshipNorthern Ireland1
EnglandLiverpool
651889-03-097ScotlandBritish ChampionshipNorthern Ireland0
ScotlandGlasgow
661889-04-132EnglandBritish ChampionshipScotland3
EnglandLondon
671889-04-150WalesBritish ChampionshipScotland0
WalesWrexham
681889-04-271Northern IrelandBritish ChampionshipWales3
Republic of IrelandBelfast
691890-02-085WalesBritish ChampionshipNorthern Ireland2
EnglandShrewsbury
701890-03-151Northern IrelandBritish ChampionshipEngland9
Republic of IrelandBelfast
711890-03-151WalesBritish ChampionshipEngland3
WalesWrexham
721890-03-225ScotlandBritish ChampionshipWales0
ScotlandPaisley
731890-03-291Northern IrelandBritish ChampionshipScotland4
Republic of IrelandBelfast
741890-04-051ScotlandBritish ChampionshipEngland1
ScotlandGlasgow
751891-02-077Northern IrelandBritish ChampionshipWales2
Republic of IrelandBelfast
761891-03-074EnglandBritish ChampionshipWales1
EnglandSunderland
771891-03-076EnglandBritish ChampionshipNorthern Ireland1
EnglandWolverhampton
781891-03-213WalesBritish ChampionshipScotland4
WalesWrexham
791891-03-282ScotlandBritish ChampionshipNorthern Ireland1
ScotlandGlasgow
801891-04-062EnglandBritish ChampionshipScotland1
EnglandBlackburn
811892-02-271WalesBritish ChampionshipNorthern Ireland1
WalesBangor
821892-03-050Northern IrelandBritish ChampionshipEngland2
Republic of IrelandBelfast
831892-03-050WalesBritish ChampionshipEngland2
WalesWrexham
841892-03-192Northern IrelandBritish ChampionshipScotland3
Republic of IrelandBelfast
851892-03-266ScotlandBritish ChampionshipWales1
ScotlandEdinburgh
861892-04-021ScotlandBritish ChampionshipEngland4
ScotlandGlasgow
871893-02-256EnglandBritish ChampionshipNorthern Ireland1
EnglandBirmingham
881893-03-136EnglandBritish ChampionshipWales0
EnglandStoke-on-Trent
891893-03-180WalesBritish ChampionshipScotland8
WalesWrexham
901893-03-256ScotlandBritish ChampionshipNorthern Ireland1
ScotlandGlasgow
911893-04-015EnglandBritish ChampionshipScotland2
EnglandRichmond
921894-02-244WalesBritish ChampionshipNorthern Ireland1
WalesSwansea
931894-03-032Northern IrelandBritish ChampionshipEngland2
Republic of IrelandBelfast
941894-03-121WalesBritish ChampionshipEngland5
WalesWrexham
951894-03-245ScotlandBritish ChampionshipWales2
ScotlandKilmarnock
961894-03-311Northern IrelandBritish ChampionshipScotland2
Republic of IrelandBelfast
971894-04-072ScotlandBritish ChampionshipEngland2
ScotlandGlasgow
981895-03-099EnglandBritish ChampionshipNorthern Ireland0
EnglandDerby
991895-03-162Northern IrelandBritish ChampionshipWales2
Republic of IrelandBelfast
1001895-03-181EnglandBritish ChampionshipWales1
EnglandLondon
Rows: 1-100 | Columns: 9

A lot of things could influence the outcome of a game. Since we only have access to the score, teams, and type of game, we can't consider external factors like, weather or temperature, which would otherwise help our prediction.

To create a good model using this dataset, we could compute each team's key performance indicator (KPI), ranking (clusters computed using the number of games in important tournaments like the World Cup, the percentage of victory...), shape (moving windows using the last games information), and other factors.

Here's our plan:

  • Identify cup winners
  • Rank the teams with clustering
  • Compute teams' KPIs
  • Create a machine learning model

Data Preparation for Clustering

To create clusters, we need to find which teams are the winners of main tournaments (mainly the World Cups and Continental Cups). Since all tournaments took place the same year, we could partition by tournament and year to identify the last game of the tournament.

We'll ignore ties for our analysis since there's no way to determine a winner.

Cup Winner

Let's start by creating the feature 'winner' to indicate the winner of a game.

In [13]:
import verticapy.stats as st
football.filter(st.year(football["date"]) <= 2015)
football.case_when("winner",
                   football["home_score"] > football["away_score"], football["home_team"],
                   football["home_score"] < football["away_score"], football["away_team"],
                   None)
3884 elements were filtered
Out[13]:
📅
date
Date
123
home_score
Int
Abc
home_team
Varchar(64)
Abc
tournament
Varchar(84)
Abc
away_team
Varchar(64)
123
away_score
Int
010
neutral
Boolean
Abc
country
Varchar(64)
Abc
city
Varchar(56)
Abc
winner
Varchar(64)
11872-11-300ScotlandFriendlyEngland0
ScotlandGlasgow[null]
21873-03-084EnglandFriendlyScotland2
EnglandLondonEngland
31874-03-072ScotlandFriendlyEngland1
ScotlandGlasgowScotland
41875-03-062EnglandFriendlyScotland2
EnglandLondon[null]
51876-03-043ScotlandFriendlyEngland0
ScotlandGlasgowScotland
61876-03-254ScotlandFriendlyWales0
ScotlandGlasgowScotland
71877-03-031EnglandFriendlyScotland3
EnglandLondonScotland
81877-03-050WalesFriendlyScotland2
WalesWrexhamScotland
91878-03-027ScotlandFriendlyEngland2
ScotlandGlasgowScotland
101878-03-239ScotlandFriendlyWales0
ScotlandGlasgowScotland
111879-01-182EnglandFriendlyWales1
EnglandLondonEngland
121879-04-055EnglandFriendlyScotland4
EnglandLondonEngland
131879-04-070WalesFriendlyScotland3
WalesWrexhamScotland
141880-03-135ScotlandFriendlyEngland4
ScotlandGlasgowScotland
151880-03-152WalesFriendlyEngland3
WalesWrexhamEngland
161880-03-275ScotlandFriendlyWales1
ScotlandGlasgowScotland
171881-02-260EnglandFriendlyWales1
EnglandBlackburnWales
181881-03-121EnglandFriendlyScotland6
EnglandLondonScotland
191881-03-141WalesFriendlyScotland5
WalesWrexhamScotland
201882-02-180Northern IrelandFriendlyEngland13
Republic of IrelandBelfastEngland
211882-02-257WalesFriendlyNorthern Ireland1
WalesWrexhamWales
221882-03-115ScotlandFriendlyEngland1
ScotlandGlasgowScotland
231882-03-135WalesFriendlyEngland3
WalesWrexhamWales
241882-03-255ScotlandFriendlyWales0
ScotlandGlasgowScotland
251883-02-035EnglandFriendlyWales0
EnglandLondonEngland
261883-02-247EnglandFriendlyNorthern Ireland0
EnglandLiverpoolEngland
271883-03-102EnglandFriendlyScotland3
EnglandSheffieldScotland
281883-03-120WalesFriendlyScotland3
WalesWrexhamScotland
291883-03-171Northern IrelandFriendlyWales1
Republic of IrelandBelfast[null]
301884-01-260Northern IrelandBritish ChampionshipScotland5
Republic of IrelandBelfastScotland
311884-02-096WalesBritish ChampionshipNorthern Ireland0
WalesWrexhamWales
321884-02-231Northern IrelandBritish ChampionshipEngland8
Republic of IrelandBelfastEngland
331884-03-151ScotlandBritish ChampionshipEngland0
ScotlandGlasgowScotland
341884-03-170WalesBritish ChampionshipEngland4
WalesWrexhamEngland
351884-03-294ScotlandBritish ChampionshipWales1
ScotlandGlasgowScotland
361885-02-284EnglandBritish ChampionshipNorthern Ireland0
EnglandManchesterEngland
371885-03-141EnglandBritish ChampionshipWales1
EnglandBlackburn[null]
381885-03-148ScotlandBritish ChampionshipNorthern Ireland2
ScotlandGlasgowScotland
391885-03-211EnglandBritish ChampionshipScotland1
EnglandLondon[null]
401885-03-231WalesBritish ChampionshipScotland8
WalesWrexhamScotland
411885-04-112Northern IrelandBritish ChampionshipWales8
Republic of IrelandBelfastWales
421885-11-280United StatesFriendlyCanada1
United StatesNewarkCanada
431886-02-275WalesBritish ChampionshipNorthern Ireland0
WalesWrexhamWales
441886-03-131Northern IrelandBritish ChampionshipEngland6
Republic of IrelandBelfastEngland
451886-03-202Northern IrelandBritish ChampionshipScotland7
Republic of IrelandBelfastScotland
461886-03-271ScotlandBritish ChampionshipEngland1
ScotlandGlasgow[null]
471886-03-291WalesBritish ChampionshipEngland3
WalesWrexhamEngland
481886-04-104ScotlandBritish ChampionshipWales1
ScotlandGlasgowScotland
491886-11-253United StatesFriendlyCanada2
United StatesNewarkUnited States
501887-02-057EnglandBritish ChampionshipNorthern Ireland0
EnglandSheffieldEngland
511887-02-194ScotlandBritish ChampionshipNorthern Ireland1
ScotlandGlasgowScotland
521887-02-264EnglandBritish ChampionshipWales0
EnglandLondonEngland
531887-03-124Northern IrelandBritish ChampionshipWales1
Republic of IrelandBelfastNorthern Ireland
541887-03-192EnglandBritish ChampionshipScotland3
EnglandBlackburnScotland
551887-03-210WalesBritish ChampionshipScotland2
WalesWrexhamScotland
561888-02-045EnglandBritish ChampionshipWales1
EnglandCreweEngland
571888-03-0311WalesBritish ChampionshipNorthern Ireland0
WalesWrexhamWales
581888-03-105ScotlandBritish ChampionshipWales1
ScotlandEdinburghScotland
591888-03-170ScotlandBritish ChampionshipEngland5
ScotlandGlasgowEngland
601888-03-242Northern IrelandBritish ChampionshipScotland10
Republic of IrelandBelfastScotland
611888-04-071Northern IrelandBritish ChampionshipEngland5
Republic of IrelandBelfastEngland
621888-09-194ScotlandFriendlyCanada0
ScotlandGlasgowScotland
631889-02-234EnglandBritish ChampionshipWales1
EnglandStoke-on-TrentEngland
641889-03-026EnglandBritish ChampionshipNorthern Ireland1
EnglandLiverpoolEngland
651889-03-097ScotlandBritish ChampionshipNorthern Ireland0
ScotlandGlasgowScotland
661889-04-132EnglandBritish ChampionshipScotland3
EnglandLondonScotland
671889-04-150WalesBritish ChampionshipScotland0
WalesWrexham[null]
681889-04-271Northern IrelandBritish ChampionshipWales3
Republic of IrelandBelfastWales
691890-02-085WalesBritish ChampionshipNorthern Ireland2
EnglandShrewsburyWales
701890-03-151Northern IrelandBritish ChampionshipEngland9
Republic of IrelandBelfastEngland
711890-03-151WalesBritish ChampionshipEngland3
WalesWrexhamEngland
721890-03-225ScotlandBritish ChampionshipWales0
ScotlandPaisleyScotland
731890-03-291Northern IrelandBritish ChampionshipScotland4
Republic of IrelandBelfastScotland
741890-04-051ScotlandBritish ChampionshipEngland1
ScotlandGlasgow[null]
751891-02-077Northern IrelandBritish ChampionshipWales2
Republic of IrelandBelfastNorthern Ireland
761891-03-074EnglandBritish ChampionshipWales1
EnglandSunderlandEngland
771891-03-076EnglandBritish ChampionshipNorthern Ireland1
EnglandWolverhamptonEngland
781891-03-213WalesBritish ChampionshipScotland4
WalesWrexhamScotland
791891-03-282ScotlandBritish ChampionshipNorthern Ireland1
ScotlandGlasgowScotland
801891-04-062EnglandBritish ChampionshipScotland1
EnglandBlackburnEngland
811892-02-271WalesBritish ChampionshipNorthern Ireland1
WalesBangor[null]
821892-03-050Northern IrelandBritish ChampionshipEngland2
Republic of IrelandBelfastEngland
831892-03-050WalesBritish ChampionshipEngland2
WalesWrexhamEngland
841892-03-192Northern IrelandBritish ChampionshipScotland3
Republic of IrelandBelfastScotland
851892-03-266ScotlandBritish ChampionshipWales1
ScotlandEdinburghScotland
861892-04-021ScotlandBritish ChampionshipEngland4
ScotlandGlasgowEngland
871893-02-256England