Predicting Popularity on Spotify¶
This example uses the publicly-available Spotify from Kaggle to predict the popularity of Polish songs and artists on Spotify. We'll also use a model to group artists together based on how similar their songs are.
You can download the Jupyter notebook of this study here.
The "tracks" dataset (tracks.csv) have the following features:
- id represents the Id of the track generated by Spotify
Numerical:
- acousticness (range: [0,1])
- danceability (range: [0,1])
- energy, (range: [0,1])
- duration_ms (range: [200000,300000])
- instrumentalness (range: [0,1])
- valence (range: [0,1])
- popularity (range: [0,100])
- tempo (range: [50,150])
- liveness (range: [0,1])
- loudness (range: [-60,0])
- speechiness (range: [0,1])
Dummy:
- mode (0 = Minor, 1 = Major)
- explicit (0 = No explicit content and 1 = Explicit content)
Categorical:
- key - keys on an octave encoded as integers in range [0,11] (C = 0, C# = 1, etc.)
- timesignature - predicted time signature
- artists - list of contributing artists
- artists - list of IDs of contributing artists
- release_date - date of release (yyyy-mm-dd)
- name - track name
The "artists" dataset (artists.csv) has the following features:
- id - ID of the artist
- name - artist name
- followers - how many followers the artist has
- popularity - popularity of the artists based on their tracks
- genres - list of genres covered by the artist's tracks
Import libraries¶
Start by importing VerticaPy and loading the SQL extension, which allows you to query the Vertica database with SQL.
import verticapy as vp
%load_ext verticapy.sql
This examlpe uses the following version of VerticaPy:
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Create a new schema, "spotify."
vp.drop("spotify", method = "schema")
vp.create_schema("spotify")
Data Loading¶
Load the datasets into the vDataFrame with read_csv() and then view them with display().
# load datasets as vDataFrame objects
artists = vp.read_csv("data/artists.csv", schema = "spotify", parse_nrows = 100)
tracks = vp.read_csv("data/tracks.csv" , schema = "spotify", parse_nrows = 100)
display(artists)
display(tracks)
Data Exploration¶
Our "artists" dataset is too broad for us to use right now; we're only concerned with Polish artists, so let's extract and save them to our Vertica database.
# filter polish artists out of the 'artists' dataset using information in 'genres' column
polish_artists = artists.search("genres ilike '%disco polo%' or genres ilike '%polish%'")
# save it to the database
polish_artists.to_db('"spotify"."polish_artists"', relation_type = "table")