Movies Scoring and Clustering¶
This example uses the 'filmtv_movies' dataset to evaluate the quality of the movies and create clusters of similar movies. You can download the Jupyter Notebook of the study here.
- year: Movie's release year
- filmtv_id: Movie ID
- title: Movie title
- genre: Movie genre
- country: Movie's country of origin
- description: Movie description
- notes: Information about the movie
- duration: Movie duration
- votes: Number of votes
- avg_vote: Average score
- director: Movie director
- actors: Actors in the movie
We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Let's create a Virtual DataFrame of the dataset. The dataset is available here.
filmtv_movies = vp.read_csv("data/filmtv_movies.csv")
display(filmtv_movies.head(5))
Data Exploration and Preparation¶
One of the biggest challenges for any streaming platform is to find a good catalog of movies.
First, let's explore the dataset.
filmtv_movies.describe(method = 'categorical', unique = True)
We can drop the 'description' and 'notes' columns since these fields are empty for most of our dataset.
filmtv_movies.drop(["description", "notes"])
We have access to more than 50000 movies in 27 different genres. Let's organize our list by their average rating.
filmtv_movies.sort({"avg_vote" : "desc"})
Since we want properly averaged scores, let's just consider the top 10 movies that have at least 10 votes.
filmtv_movies.search(conditions = [filmtv_movies["votes"] > 10],
order_by = {"avg_vote" : "desc" })