Football¶
In this exaample, we use the 'football' dataset to predict the outcomes of games between various teams. You can download the Jupyter Notebook of the study here.
- date: Date of the game
- home_team: Home Team
- home_score: Home Team number of goals
- away_team: Away Team
- away_score: Away Team number of goals
- tournament: Game Type (World Cup, Friendly...)
- city: City where the game took place
- country: Country where the game took place
- neutral: If the event took place to a neutral location
We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Let's create a Virtual DataFrame of the dataset. The dataset is available here.
football = vp.read_csv("data/football.csv")
football.head(5)
Data Exploration¶
Let's explore the data by displaying descriptive statistics of all the columns.
football["date"].describe()
The dataset includes a total of 41,586 games, which take place between 1872 and 2020. Let's look at our game types and teams.
football["tournament"].describe()
Different types of tournaments took place (FIFA World Cup, UEFA Euro, etc.) aand most of the games in our data are friendlies or qualifiers for international tournaments.
display(football.describe())
football.describe(method = "categorical")
The dataset includes 308 national teams. For most of the games, the home team scores better than the away team. Since some games take place in a neutral location, we can ensure this hypothesis using the variable 'neutral.' Notice also that the number of goals per match is pretty low (median of 1 for both away and home teams).
Goal
Our goal for the study will be to predict the outcomes of games after 2015.
Before doing the study, we can notice that some teams names have changed over time. We need to change the old names by the new names otherwise it will add too much bias in the data.
for team in ["home_team", "away_team"]:
football[team].decode('German DR', 'Germany',
'Czechoslovakia', 'Czech Republic',
'Yugoslavia', 'Serbia',
'Yemen DPR', 'Yemen',
football[team])
Let's just consider teams that have played more than five home and away games.
football["cnt_games_1"] = "COUNT(*) OVER (PARTITION BY home_team)"
football["cnt_games_2"] = "COUNT(*) OVER (PARTITION BY away_team)"
football.filter((football["cnt_games_2"] > 5) & (football["cnt_games_1"] > 5))
vp.drop("football_clean", relation_type = "table")
football.to_db(name = "football_clean",
usecols = ["date",
"home_score",
"home_team",
"tournament",
"away_team",
"away_score",
"neutral",
"country",
"city"],
relation_type = "table",
inplace = True)
A lot of things could influence the outcome of a game. Since we only have access to the score, teams, and type of game, we can't consider external factors like, weather or temperature, which would otherwise help our prediction.
To create a good model using this dataset, we could compute each team's key performance indicator (KPI), ranking (clusters computed using the number of games in important tournaments like the World Cup, the percentage of victory...), shape (moving windows using the last games information), and other factors.
Here's our plan:
- Identify cup winners
- Rank the teams with clustering
- Compute teams' KPIs
- Create a machine learning model
Data Preparation for Clustering¶
To create clusters, we need to find which teams are the winners of main tournaments (mainly the World Cups and Continental Cups). Since all tournaments took place the same year, we could partition by tournament and year to identify the last game of the tournament.
We'll ignore ties for our analysis since there's no way to determine a winner.
Cup Winner¶
Let's start by creating the feature 'winner' to indicate the winner of a game.
import verticapy.stats as st
football.filter(st.year(football["date"]) <= 2015)
football.case_when("winner",
football["home_score"] > football["away_score"], football["home_team"],
football["home_score"] < football["away_score"], football["away_team"],
None)