Movies Scoring and Clustering¶
This example uses the 'filmtv_movies' dataset to evaluate the quality of the movies and create clusters of similar movies. You can download the Jupyter Notebook of the study here.
- year: Movie's release year
- filmtv_id: Movie ID
- title: Movie title
- genre: Movie genre
- country: Movie's country of origin
- description: Movie description
- notes: Information about the movie
- duration: Movie duration
- votes: Number of votes
- avg_vote: Average score
- director: Movie director
- actors: Actors in the movie
We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Let's create a Virtual DataFrame of the dataset. The dataset is available here.
filmtv_movies = vp.read_csv("data/filmtv_movies.csv")
display(filmtv_movies.head(5))
Data Exploration and Preparation¶
One of the biggest challenges for any streaming platform is to find a good catalog of movies.
First, let's explore the dataset.
filmtv_movies.describe(method = 'categorical', unique = True)
We can drop the 'description' and 'notes' columns since these fields are empty for most of our dataset.
filmtv_movies.drop(["description", "notes"])
We have access to more than 50000 movies in 27 different genres. Let's organize our list by their average rating.
filmtv_movies.sort({"avg_vote" : "desc"})
Since we want properly averaged scores, let's just consider the top 10 movies that have at least 10 votes.
filmtv_movies.search(conditions = [filmtv_movies["votes"] > 10],
order_by = {"avg_vote" : "desc" })
We can see classic movies like 'The Godfather' and 'Greed.' Let's smooth the avg_vote using a linear regression to make it more representative.
To create our model we could use the votes, the category, the duration, etc. but let's go with the director and main actors.
We can extract the five main actors for each movie with regular expressions.
for i in range(1, 5):
filmtv_movies2 = vp.read_csv("data/filmtv_movies.csv")
filmtv_movies2.regexp(column = "actors",
method = "substr",
pattern = '[^,]+',
occurrence = i,
name = "actor")
if i == 1:
filmtv_movies = filmtv_movies2.copy()
else:
filmtv_movies = filmtv_movies.append(filmtv_movies2)
filmtv_movies["actor"].describe()
By aggregating the data, we can find the number of actors and the number of votes by actor. We can then normalize the data using the min-max method and quantify the notoriety of the actors.
import verticapy.stats as st
actors_stats = filmtv_movies.groupby(columns = ["actor"],
expr = [st.sum(filmtv_movies["votes"])._as("notoriety_actors"),
st.count(filmtv_movies["actors"])._as("castings_actors")])
actors_stats["actor"].dropna()
actors_stats["notoriety_actors"].normalize(method = 'minmax')
Let's look at the top ten actors by notoriety.
actors_stats.search(order_by = {"notoriety_actors" : "desc",
"castings_actors" : "desc"}).head(10)
As expected, we get a list of very popular actors like Robert De Niro, Morgan Freeman, and Clint Eastwood.
Let's do the same for the directors.
director_stats = filmtv_movies.groupby(columns = ["director"],
expr = [st.sum(filmtv_movies["votes"])._as("notoriety_director"),
st.count(filmtv_movies["director"])._as("castings_director")])
director_stats["notoriety_director"].normalize(method = 'minmax')
Now let's look at the top 10 movie directors.
director_stats.search(order_by = {"notoriety_director" : "desc",
"castings_director" : "desc" }).head(10)
Again, we get a list of popular directors like Steven Spielberg, Woody Allen, and Clint Eastwood.
Let's join our notoriety metrics for actors and directors with the main dataset.
filmtv_movies_director = filmtv_movies.join(
director_stats,
on = {'director': 'director'},
how = "left",
expr1 = ["*"],
expr2 = ["notoriety_director",
"castings_director"])
filmtv_movies_director_actors = filmtv_movies_director.join(
actors_stats,
on = {'actor': 'actor'},
how = "left",
expr1 = ["*"],
expr2 = ["notoriety_actors",
"castings_actors" ])
As we did many operation, it can be nice to save the vDataFrame as a table in the Vertica database.
vp.drop("filmtv_movies_director_actors", relation_type = "table")
filmtv_movies_director_actors.to_db(name = "filmtv_movies_director_actors",
relation_type = "table",
inplace = True)
We can aggregate the data to get metrics on each movie.
filmtv_movies_complete = filmtv_movies_director_actors.groupby(
columns = ["filmtv_id",
"title",
"year",
"genre",
"country",
"avg_vote",
"votes",
"duration",
"director",
"notoriety_director",
"castings_director"],
expr = [st.sum(filmtv_movies_director_actors["notoriety_actors"])._as("notoriety_actors"),
st.sum(filmtv_movies_director_actors["castings_actors"])._as("castings_actors")])
Let's compute some statistics on our dataset.
filmtv_movies_complete.describe(method = "all")
We can use the movie's release year to get create three categories.
filmtv_movies_complete.case_when('period',
filmtv_movies_complete["year"] < 1990, 'Old',
filmtv_movies_complete["year"] >= 2000, 'Recent', '90s')
Now, let's look at the countries that made the most movies.
filmtv_movies_complete.groupby(columns = ["country"],
expr = ["COUNT(*)"]).sort(
{"count" : "desc"}).head(10)
We can use this variable to create language groups.
from verticapy import str_sql
# Language Discretization
Arabic_Middle_Est = ["Arab", "Iran", "Turkey", "Egypt", "Tunisia",
"Lebanon", "Palestine", "Morocco", "Iraq",
"Sudan", "Algeria", "Yemen", "Afghanistan",
"Azerbaijan", "Kazakhstan", "Kyrgyzstan",
"Kurdistan", "Syria", "Uzbekistan"]
Chinese_Japan_Asian = ["Japan", "Hong Kong", "China", "South Korea",
"Thailand", "Philippines", "Taiwan", "Indonesia",
"Singapore", "Malaysia", "Vietnam", "Laos", "Cambodia",
"Bhutan"]
Indian = ["India", "Pakistan", "Nepal", "Sri Lanka", "Bangladesh"]
Hebrew = ["Israel"]
Spanish_Portuguese = ["Spain", "Portugal", "Mexico", "Brasil", "Chile",
"Argentina", "Colombia", "Cuba", "Venezuela", "Peru",
"Uruguay", "Dominican Republic", "Ecuador", "Guatemala",
"Costa Rica", "Paraguay", "Bolivia"]
English = ["United States", "England", "Great Britain", "Ireland",
"Australia", "New Zealand", "South Africa"]
French = ["France", "Canada", "Belgium", "Switzerland", "Luxembourg"]
Italian = ["Italy"]
German_North_Europe = ["German", "Austria", "Holland", "Netherlands", "Denmark",
"Norway", "Iceland", "Finland", "Sweden", "Greenland"]
Russian_Est_Europe = ["Russia", "Soviet Union", "Yugoslavia", "Czechoslovakia",
"Poland", "Bulgaria", "Croatia", "Czech Republic", "Serbia",
"Ukraine", "Slovenia", "Lithuania", "Latvia", "Estonia",
"Bosnia and Herzegovina", "Georgia"]
Grec_Balkan = ["Greece", "Macedonia", "Cyprus", "Romania", "Armenia", "Hungary",
"Albania", "Malta"]
# Creation of the new feature
filmtv_movies_complete.case_when('language_area',
str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(Arabic_Middle_Est))), 'Arabic_Middle_Est',
str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(Chinese_Japan_Asian))), 'Chinese_Japan_Asian',
str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(Indian))), 'Indian',
str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(Hebrew))), 'Hebrew',
str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(Spanish_Portuguese))), 'Spanish_Portuguese',
str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(English))), 'English',
str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(French))), 'French',
str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(Italian))), 'Italian',
str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(German_North_Europe))), 'German_North_Europe',
str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(Russian_Est_Europe))), 'Russian_Est_Europe',
str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(Grec_Balkan))), 'Grec_Balkan',
'Others')
We can do the same for the genres.
filmtv_movies_complete.case_when(
'Category',
str_sql("REGEXP_LIKE(Genre, 'Drama|Noir')"), 'Drama',
str_sql("REGEXP_LIKE(Genre, 'Comedy|Grotesque')"), 'Comedy',
str_sql("REGEXP_LIKE(Genre, 'Fantasy|Super-hero')"), 'Fantasy',
str_sql("REGEXP_LIKE(Genre, 'Romantic|Sperimental|Mélo')"), 'Romantic',
str_sql("REGEXP_LIKE(Genre, 'Thriller|Crime|Gangster')"), 'Thriller',
str_sql("REGEXP_LIKE(Genre, 'Action|Western|War|Spy')"), 'Action',
str_sql("REGEXP_LIKE(Genre, 'Adventure')"), 'Adventure',
str_sql("REGEXP_LIKE(Genre, 'Animation')"), 'Animation',
str_sql("REGEXP_LIKE(Genre, 'Horror')"), 'Horror',
'Others')
Since we're more concerned with the 'Category' at this point, we can drop 'genre.'
filmtv_movies_complete.drop(columns = ["genre"])
Let's look at the missing values.
filmtv_movies_complete.count_percent()
Let's impute the missing values for 'notoriety_actors' and 'castings_actors' using different techniques. We can then drop the few remaining missing values.
filmtv_movies_complete["notoriety_actors"].fillna(method = "median",
by = ["director",
"Category"])
filmtv_movies_complete["castings_actors"].fillna(method = "median",
by = ["director",
"Category"])
filmtv_movies_complete.dropna()
