Movies Scoring and Clustering

This example uses the 'filmtv_movies' dataset to evaluate the quality of the movies and create clusters of similar movies. You can download the Jupyter Notebook of the study here.

  • year: Movie's release year
  • filmtv_id: Movie ID
  • title: Movie title
  • genre: Movie genre
  • country: Movie's country of origin
  • description: Movie description
  • notes: Information about the movie
  • duration: Movie duration
  • votes: Number of votes
  • avg_vote: Average score
  • director: Movie director
  • actors: Actors in the movie

We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.

Initialization

This example uses the following version of VerticaPy:

In [1]:
import verticapy as vp
vp.__version__
Out[1]:
'0.9.0'

Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.

In [2]:
vp.connect("VerticaDSN")

Let's create a Virtual DataFrame of the dataset. The dataset is available here.

In [3]:
filmtv_movies = vp.read_csv("data/filmtv_movies.csv")
display(filmtv_movies.head(5))
123
filmtv_id
Int
Abc
title
Varchar(486)
123
year
Numeric(8,2)
Abc
genre
Varchar(22)
123
duration
Int
Abc
country
Varchar(208)
Abc
director
Varchar(1066)
Abc
Varchar(2218)
123
avg_vote
Numeric(6,2)
123
votes
Int
Abc
Varchar(2232)
Abc
Varchar(1052)
118Diner1982.0Comedy95United StatesBarry Levinson7.215
222A ciascuno il suo1967.0Drama93ItalyElio Petri7.8102
324A... come assassino1966.0Thriller80ItalyRay Morrison (Angelo Dorigo)6.53
430A Ghentar si muore facile1968.0Adventure101ItalyLeon Klimovsky5.52
531Does This Mean We are Married?1990.0Comedy90United StatesCarol Wiseman4.01
Rows: 1-5 | Columns: 12

Data Exploration and Preparation

One of the biggest challenges for any streaming platform is to find a good catalog of movies.

First, let's explore the dataset.

In [4]:
filmtv_movies.describe(method = 'categorical', unique = True)
Out[4]:
dtype
count
top
top_percent
unique
"filmtv_id"int53397180.00253397.0
"title"varchar(486)53395Les Vampires0.01950584.0
"year"numeric(8,2)533872016.03.092111.0
"genre"varchar(22)53195Drama30.13827.0
"duration"int533979011.798282.0
"country"varchar(208)53346United States41.1412394.0
"director"varchar(1066)53335Mario Mattòli0.13719160.0
"actors"varchar(2218)50372[null]5.66550121.0
"avg_vote"numeric(6,2)533976.015.01489.0
"votes"int53397122.988588.0
"description"varchar(2232)359[null]99.328354.0
"notes"varchar(1052)106[null]99.801105.0
Rows: 1-12 | Columns: 6

We can drop the 'description' and 'notes' columns since these fields are empty for most of our dataset.

In [5]:
filmtv_movies.drop(["description", "notes"])
Out[5]:
123
filmtv_id
Int
Abc
Varchar(486)
123
year
Numeric(8,2)
Abc
genre
Varchar(22)
123
duration
Int
Abc
country
Varchar(208)
Abc
director
Varchar(1066)
Abc
Varchar(2218)
123
avg_vote
Numeric(6,2)
123
votes
Int
1181982.0Comedy95United StatesBarry Levinson7.215
2221967.0Drama93ItalyElio Petri7.8102
3241966.0Thriller80ItalyRay Morrison (Angelo Dorigo)6.53
4301968.0Adventure101ItalyLeon Klimovsky5.52
5311990.0Comedy90United StatesCarol Wiseman4.01
6321990.0Drama96United StatesJoseph Ruben5.1107
7331991.0Thriller100United StatesBrian Grant8.02
8371975.0Comedy100ItalyMarcello Fondato5.942
9411985.0Drama110FranceVera Belmont4.01
10421988.0Drama110HungaryGéza Beremenyi8.01
11461992.0Drama88France, Great BritainIan Sellar6.01
12491964.0Drama110United StatesSidney Lumet8.368
13501990.0Action88United States, ChinaLiu Chia Yung7.02
14531968.0Adventure94ItalyEmilio Miraglia4.06
15551986.0Comedy100United StatesAlan Metter6.616
16561988.0Comedy89United StatesNico Mastorakis4.01
17581989.0Comedy100United StatesBruce Beresford7.1171
18601987.0Drama98Great BritainMark Rezyka4.01
19611966.0Western98United StatesSidney J. Furie7.024
20631975.0Crime100ItalyMario Caiano5.526
21641991.0Crime85ItalyRomolo Guerrieri4.01
22651981.0Comedy94CanadaLes Rose4.01
23681945.0Comedy90ItalyGennaro Righelli6.028
24701986.0Drama120United StatesMichael Miller6.01
25741977.0Adventure130United StatesPeter Yates5.228
26751979.0Drama120United StatesMilton Katselas8.02
27771940.0Biography110United StatesJohn Cromwell6.39
28821943.0Comedy88Italy, SpainJosé Lopez Rubio, Primo Zeglio3.53
29851955.0War75GermanyGeorg Wilhelm Pabst6.52
30861962.0Comedy92United StatesAndrew Marton6.53
31881950.0Comedy104United StatesWilliam Dieterle7.25
32911988.0Comedy90United StatesMichael Schroeder6.01
33931980.0Comedy97United StatesJames Frawley6.01
34941987.0Drama77FranceChristine Allen4.54
35961933.0Comedy68ItalyGero Zambuto8.01
361001942.0Drama80ItalyNunzio Malasomma5.53
371011988.0Crime97United StatesCraig R. Baxley5.132
381021990.0Drama115JapanKoei Oguri8.01
391031977.0Drama92Great BritainDon Boyd4.01
401071961.0Drama109United StatesDaniel Mann6.55
411081949.0Comedy90ItalyMario Mattòli5.07
421091949.0Comedy92Great BritainHarold French6.811
431111932.0War78United StatesFrank Borzage6.716
441131990.0Thriller100United StatesJerrold Freedman6.01
451141940.0Drama88ItalyFerdinando M. Poggioli7.211
461151956.0Drama95United StatesWilliam A. Wellman6.03
471161969.0Adventure104Great BritainJack Couffer4.01
481171939.0Romantic114United StatesSam Wood7.323
491181951.0Comedy90Great BritainAnthony Asquith8.314
501191954.0Drama86ItalyRoberto Bianchi Montero3.02
511201954.0Drama104United StatesDaniel Mann8.01
521221963.0Comedy108FranceEdouard Molinaro5.310
531231984.0Comedy100United StatesHoward Zieff6.114
541251943.0Comedy80FranceFernandel6.01
551271971.0Drama93Sweden, United StatesIngmar Bergman6.016
561281952.0Drama82United StatesWilliam Spier, Roy Kellino4.01
571291966.0Comedy95ItalyPasquale Festa Campanile5.533
581301963.0Comedy90ItalyRaffaello Matarazzo5.04
591361966.0Spy100United StatesRaoul Levy4.84
601371988.0Comedy92United StatesJim Abrahams5.220
611381990.0Crime112United StatesMike Figgis6.565
621391988.0Drama100United StatesSharron Miller6.01
631401962.0Comedy98United StatesRichard Quine7.132
641411991.0Drama89United StatesLou Antonio6.01
651451975.0Adventure95ItalyMichele Lupo5.520
661461952.0Adventure95ItalyGiovanni Roccardi4.56
671531969.0Spy140Great BritainPeter R. Hunt7.3145
681551974.0Spy122Great BritainGuy Hamilton6.0107
691581967.0Spy116Great BritainLewis Gilbert6.6117
701591965.0Spy132Great BritainTerence Young7.0130
711621971.0Spy122Great BritainGuy Hamilton6.1116
721631964.0Spy105FranceMaurice Labro4.01
731641954.0Thriller75United StatesLouis King3.95
741671965.0Spy90ItalyEnrico Bomba3.24
751681966.0Spy90ItalyHenry Bay3.85
761691972.0Spy73Great BritainDavid Greene4.77
771711966.0Spy97FranceRobert Vernay4.01
781721980.0Spy108United StatesClaudio Guzman4.52
791731916.0Comedy42United StatesCharles Chaplin6.913
801741984.0Comedy102United StatesAnthony Harvey5.310
811751988.0Adventure88SpainGerardo Herrero4.01
821761957.0Thriller88ItalyRiccardo Freda4.96
831781987.0Crime96United StatesAndy Sidaris3.55
841791967.0Western91United StatesR. G. Springsteen5.02
851801964.0Adventure91United StatesIvan Tors2.52
861811964.0Western86GermanyJürgen Roland4.52
871831985.0Drama98United StatesNorman Jewison6.143
881851985.0Fantasy78United StatesAaron Lipstadt3.52
891861953.0Drama95ItalyCarlo Lizzani5.49
901871939.0Comedy70ItalyMario Mattòli6.33
911881953.0Musical97ItalyClemente Fracassi6.07
921891990.0Adventure106United StatesRoger Spottiswoode5.356
931931977.0Action110United StatesJerry Jameson4.736
941941991.0Comedy100United StatesFrancis Schaeffer4.78
951961989.0Comedy60United StatesBurt Brickerhoff4.01
961981961.0Comedy99ItalyLuigi Filippo D'Amico4.46
971991983.0Comedy100ItalyFrancesco Massaro5.1103
982001992.0Thriller90ItalyAlessandro Lucidi4.36
992011959.0Gangster104United StatesRichard Wilson6.511
1002021956.0Drama87United StatesDaniel Taradash6.39
Rows: 1-100 of 53397 | Columns: 10

We have access to more than 50000 movies in 27 different genres. Let's organize our list by their average rating.

In [6]:
filmtv_movies.sort({"avg_vote" : "desc"})
Out[6]:
123
filmtv_id
Int
Abc
Varchar(486)
123
year
Numeric(8,2)
Abc
genre
Varchar(22)
123
duration
Int
Abc
Varchar(208)
Abc
Varchar(1066)
Abc
Varchar(2218)
123
avg_vote
Numeric(6,2)
123
votes
Int
1743762013.0Documentary19310.01
2709691969.0Documentary37810.01
3817152015.0Documentary9910.01
4744082013.0Documentary8710.01
5727211987.0Fantasy8110.01
61250832016.0Drama6510.01
7862662018.0Drama7310.01
8822802016.0Comedy9010.01
9798881976.0Comedy10010.01
10753941973.0Drama10110.01
11729472014.0Comedy10110.01
121392422006.0Thriller9210.01
131340512016.0Thriller9010.02
141273652016.0Documentary7010.03
151212612015.0Drama9010.02
16869452015.0Comedy8610.01
17845351974.0Animation4310.01
18831021994.0Drama8110.01
19811892015.0Drama11510.01
20802052008.0Drama20010.01
21778981975.0Drama17710.01
22758061948.0Documentary7110.01
23742382014.0Drama12610.01
24734451971.0Comedy8510.01
251400352001.0Drama9110.01
261398182017.0Comedy10510.01
271560592018.0Comedy8310.01
281340902016.0Comedy9010.01
291299142015.0Comedy8910.01
301286652016.0Documentary5710.01
311509592015.0Romantic9010.01
321440852000.0Comedy9410.01
331211472001.0Drama9310.01
341146152015.0Documentary11010.01
351500772017.0Romantic9010.01
36860422007.0Drama9210.01
37833422008.0War11810.01
38831521982.0Horror7010.01
391472572017.0Drama9210.01
40814312015.0Drama8010.01
41811472005.0Adventure10410.01
42803092015.0Drama28010.01
431456572017.0Thriller10210.01
44779502014.0Drama8710.01
45770782009.0Animation6010.01
46766461937.0Adventure9210.01
471419402017.0Drama8810.03
481228102015.0Documentary8210.01
491433562017.0Thriller9010.01
501550931945.0Biography13810.01
511727432019.0Drama7710.01
521605872018.0Comedy9010.01
531620172018.0Romantic9010.01
541398252017.0Comedy8110.01
551734891967.0Comedy8610.01
561584171982.0Documentary6310.01
571669072019.0Comedy8410.02
581365052016.0Documentary7010.01
591672591985.0Animation5310.01
601561132016.0Documentary6010.01
611566872018.0Documentary8510.01
621291612017.0Comedy9010.01
631798632017.0Romantic8410.01
641642271988.0Drama10510.01
651694272008.0History7210.01
661476632017.0Drama7310.01
671710992019.0Documentary10710.01
681520452017.0Drama9210.01
691622352015.0Documentary9010.01
701206682016.0Thriller9010.02
711761992019.0Documentary7610.01
721504312017.0Romantic9010.01
731643012016.0Documentary5010.02
74860712016.0Documentary7010.01
751667752019.0Musical13010.01
761502972017.0Action9910.01
771503512017.0Documentary6210.02
78833061976.0Comedy9710.02
791775532008.0Animation8310.01
801623752018.0Documentary7410.01
811640632019.0Adventure10410.01
821444212017.0Documentary9010.01
831695172019.0Drama10810.01
841474382005.0Animation6710.01
851474422016.0Thriller18010.01
86806372015.0Comedy8310.01
871708792019.0Comedy9010.01
881472312017.0Documentary6010.01
891623772018.0Comedy9010.01
90792072015.0Romantic9010.01
911635992018.0Thriller9010.01
921460732017.0Comedy11510.01
931470932017.0Thriller9010.01
94767141972.0Comedy9010.01
951795192018.0Thriller8710.01
961605632018.0Documentary7810.01
971667772017.0Romantic8610.03
981389202017.0Drama10110.01
991673912019.0Documentary7510.01
100706712014.0Drama8010.01
Rows: 1-100 of 53397 | Columns: 10

Since we want properly averaged scores, let's just consider the top 10 movies that have at least 10 votes.

In [7]:
filmtv_movies.search(conditions = [filmtv_movies["votes"] > 10], 
                     order_by = {"avg_vote" : "desc" })
Out[7]:
123
filmtv_id
Integer
Abc
Varchar(486)
123
year
Numeric(8,2)
Abc
genre
Varchar(22)
123
duration
Integer
Abc
Varchar(208)
Abc
Varchar(1066)
Abc
Varchar(2218)
123
avg_vote
Numeric(6,2)
123
votes
Integer
1259801992.0Drama5839.833
2291361985.0Documentary544