Movies Scoring and Clustering

This example uses the 'filmtv_movies' dataset to evaluate the quality of the movies and create clusters of similar movies. You can download the Jupyter Notebook of the study here.

  • year: Movie's release year
  • filmtv_id: Movie ID
  • title: Movie title
  • genre: Movie genre
  • country: Movie's country of origin
  • description: Movie description
  • notes: Information about the movie
  • duration: Movie duration
  • votes: Number of votes
  • avg_vote: Average score
  • director: Movie director
  • actors: Actors in the movie

We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.

Initialization

This example uses the following version of VerticaPy:

In [1]:
import verticapy as vp
vp.__version__
Out[1]:
'0.9.0'

Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.

In [2]:
vp.connect("VerticaDSN")

Let's create a Virtual DataFrame of the dataset. The dataset is available here.

In [3]:
filmtv_movies = vp.read_csv("data/filmtv_movies.csv")
display(filmtv_movies.head(5))
123
filmtv_id
Int
Abc
title
Varchar(486)
123
year
Numeric(8,2)
Abc
genre
Varchar(22)
123
duration
Int
Abc
country
Varchar(208)
Abc
director
Varchar(1066)
Abc
Varchar(2218)
123
avg_vote
Numeric(6,2)
123
votes
Int
Abc
Varchar(2232)
Abc
Varchar(1052)
118Diner1982.0Comedy95United StatesBarry Levinson7.215
222A ciascuno il suo1967.0Drama93ItalyElio Petri7.8102
324A... come assassino1966.0Thriller80ItalyRay Morrison (Angelo Dorigo)6.53
430A Ghentar si muore facile1968.0Adventure101ItalyLeon Klimovsky5.52
531Does This Mean We are Married?1990.0Comedy90United StatesCarol Wiseman4.01
Rows: 1-5 | Columns: 12

Data Exploration and Preparation

One of the biggest challenges for any streaming platform is to find a good catalog of movies.

First, let's explore the dataset.

In [4]:
filmtv_movies.describe(method = 'categorical', unique = True)
Out[4]:
dtype
count
top
top_percent
unique
"filmtv_id"int53397180.00253397.0
"title"varchar(486)53395Les Vampires0.01950584.0
"year"numeric(8,2)533872016.03.092111.0
"genre"varchar(22)53195Drama30.13827.0
"duration"int533979011.798282.0
"country"varchar(208)53346United States41.1412394.0
"director"varchar(1066)53335Mario Mattòli0.13719160.0
"actors"varchar(2218)50372[null]5.66550121.0
"avg_vote"numeric(6,2)533976.015.01489.0
"votes"int53397122.988588.0
"description"varchar(2232)359[null]99.328354.0
"notes"varchar(1052)106[null]99.801105.0
Rows: 1-12 | Columns: 6

We can drop the 'description' and 'notes' columns since these fields are empty for most of our dataset.

In [5]:
filmtv_movies.drop(["description", "notes"])
Out[5]:
123
filmtv_id
Int
Abc
Varchar(486)
123
year
Numeric(8,2)
Abc
genre
Varchar(22)
123
duration
Int
Abc
country
Varchar(208)
Abc
director
Varchar(1066)
Abc
Varchar(2218)
123
avg_vote
Numeric(6,2)
123
votes
Int
1181982.0Comedy95United StatesBarry Levinson7.215
2221967.0Drama93ItalyElio Petri7.8102
3241966.0Thriller80ItalyRay Morrison (Angelo Dorigo)6.53
4301968.0Adventure101ItalyLeon Klimovsky5.52
5311990.0Comedy90United StatesCarol Wiseman4.01
6321990.0Drama96United StatesJoseph Ruben5.1107
7331991.0Thriller100United StatesBrian Grant8.02
8371975.0Comedy100ItalyMarcello Fondato5.942
9411985.0Drama110FranceVera Belmont4.01
10421988.0Drama110HungaryGéza Beremenyi8.01
11461992.0Drama88France, Great BritainIan Sellar6.01
12491964.0Drama110United StatesSidney Lumet8.368
13501990.0Action88United States, ChinaLiu Chia Yung7.02
14531968.0Adventure94ItalyEmilio Miraglia4.06
15551986.0Comedy100United StatesAlan Metter6.616
16561988.0Comedy89United StatesNico Mastorakis4.01
17581989.0Comedy100United StatesBruce Beresford7.1171
18601987.0Drama98Great BritainMark Rezyka4.01
19611966.0Western98United StatesSidney J. Furie7.024
20631975.0Crime100ItalyMario Caiano5.526
21641991.0Crime85ItalyRomolo Guerrieri4.01
22651981.0Comedy94CanadaLes Rose4.01
23681945.0Comedy90ItalyGennaro Righelli6.028
24701986.0Drama120United StatesMichael Miller6.01
25741977.0Adventure130United StatesPeter Yates5.228
26751979.0Drama120United StatesMilton Katselas8.02
27771940.0Biography110United StatesJohn Cromwell6.39
28821943.0Comedy88Italy, SpainJosé Lopez Rubio, Primo Zeglio3.53
29851955.0War75GermanyGeorg Wilhelm Pabst6.52
30861962.0Comedy92United StatesAndrew Marton6.53
31881950.0Comedy104United StatesWilliam Dieterle7.25
32911988.0Comedy90United StatesMichael Schroeder6.01
33931980.0Comedy97United StatesJames Frawley6.01
34941987.0Drama77FranceChristine Allen4.54
35961933.0Comedy68ItalyGero Zambuto8.01
361001942.0Drama80ItalyNunzio Malasomma5.53
371011988.0Crime97United StatesCraig R. Baxley5.132
381021990.0Drama115JapanKoei Oguri8.01
391031977.0Drama92Great BritainDon Boyd4.01
401071961.0Drama109United StatesDaniel Mann6.55
411081949.0Comedy90ItalyMario Mattòli5.07
421091949.0Comedy92Great BritainHarold French6.811
431111932.0War78United StatesFrank Borzage6.716
441131990.0Thriller100United StatesJerrold Freedman6.01
451141940.0Drama88ItalyFerdinando M. Poggioli7.211
461151956.0Drama95United StatesWilliam A. Wellman6.03
471161969.0Adventure104Great BritainJack Couffer4.01
481171939.0Romantic114United StatesSam Wood7.323
491181951.0Comedy90Great BritainAnthony Asquith8.314
501191954.0Drama86ItalyRoberto Bianchi Montero3.02
511201954.0Drama104United StatesDaniel Mann8.01
521221963.0Comedy108FranceEdouard Molinaro5.310
531231984.0Comedy100United StatesHoward Zieff6.114
541251943.0Comedy80FranceFernandel6.01
551271971.0Drama93Sweden, United StatesIngmar Bergman6.016
561281952.0Drama82United StatesWilliam Spier, Roy Kellino4.01
571291966.0Comedy95ItalyPasquale Festa Campanile5.533
581301963.0Comedy90ItalyRaffaello Matarazzo5.04
591361966.0Spy100United StatesRaoul Levy4.84
601371988.0Comedy92United StatesJim Abrahams5.220
611381990.0Crime112United StatesMike Figgis6.565
621391988.0Drama100United StatesSharron Miller6.01
631401962.0Comedy98United StatesRichard Quine7.132
641411991.0Drama89United StatesLou Antonio6.01
651451975.0Adventure95ItalyMichele Lupo5.520
661461952.0Adventure95ItalyGiovanni Roccardi4.56
671531969.0Spy140Great BritainPeter R. Hunt7.3145
681551974.0Spy122Great BritainGuy Hamilton6.0107
691581967.0Spy116Great BritainLewis Gilbert6.6117
701591965.0Spy132Great BritainTerence Young7.0130
711621971.0Spy122Great BritainGuy Hamilton6.1116
721631964.0Spy105FranceMaurice Labro4.01
731641954.0Thriller75United StatesLouis King3.95
741671965.0Spy90ItalyEnrico Bomba3.24
751681966.0Spy90ItalyHenry Bay3.85
761691972.0Spy73Great BritainDavid Greene4.77
771711966.0Spy97FranceRobert Vernay4.01
781721980.0Spy108United StatesClaudio Guzman4.52
791731916.0Comedy42United StatesCharles Chaplin6.913
801741984.0Comedy102United StatesAnthony Harvey5.310
811751988.0Adventure88SpainGerardo Herrero4.01
821761957.0Thriller88ItalyRiccardo Freda4.96
831781987.0Crime96United StatesAndy Sidaris3.55
841791967.0Western91United StatesR. G. Springsteen5.02
851801964.0Adventure91United StatesIvan Tors2.52
861811964.0Western86GermanyJürgen Roland4.52
871831985.0Drama98United StatesNorman Jewison6.143
881851985.0Fantasy78United StatesAaron Lipstadt3.52
891861953.0Drama95ItalyCarlo Lizzani5.49
901871939.0Comedy70ItalyMario Mattòli6.33
911881953.0Musical97ItalyClemente Fracassi6.07
921891990.0Adventure106United StatesRoger Spottiswoode5.356
931931977.0Action110United StatesJerry Jameson4.736
941941991.0Comedy100United StatesFrancis Schaeffer4.78
951961989.0Comedy60United StatesBurt Brickerhoff4.01
961981961.0Comedy99ItalyLuigi Filippo D'Amico4.46
971991983.0Comedy100ItalyFrancesco Massaro5.1103
982001992.0Thriller90ItalyAlessandro Lucidi4.36
992011959.0Gangster104United StatesRichard Wilson6.511
1002021956.0Drama87United StatesDaniel Taradash6.39
Rows: 1-100 of 53397 | Columns: 10

We have access to more than 50000 movies in 27 different genres. Let's organize our list by their average rating.

In [6]:
filmtv_movies.sort({"avg_vote" : "desc"})
Out[6]:
123
filmtv_id
Int
Abc
Varchar(486)
123
year
Numeric(8,2)
Abc
genre
Varchar(22)
123
duration
Int
Abc
Varchar(208)
Abc
Varchar(1066)
Abc
Varchar(2218)
123
avg_vote
Numeric(6,2)
123
votes
Int
1743762013.0Documentary19310.01
2709691969.0Documentary37810.01
3817152015.0Documentary9910.01
4744082013.0Documentary8710.01
5727211987.0Fantasy8110.01
61250832016.0Drama6510.01
7862662018.0Drama7310.01
8822802016.0Comedy9010.01
9798881976.0Comedy10010.01
10753941973.0Drama10110.01
11729472014.0Comedy10110.01
121392422006.0Thriller9210.01
131340512016.0Thriller9010.02
141273652016.0Documentary7010.03
151212612015.0Drama9010.02
16869452015.0Comedy8610.01
17845351974.0Animation4310.01
18831021994.0Drama8110.01
19811892015.0Drama11510.01
20802052008.0Drama20010.01
21778981975.0Drama17710.01
22758061948.0Documentary7110.01
23742382014.0Drama12610.01
24734451971.0Comedy8510.01
251400352001.0Drama9110.01
261398182017.0Comedy10510.01
271560592018.0Comedy8310.01
281340902016.0Comedy9010.01
291299142015.0Comedy8910.01
301286652016.0Documentary5710.01
311509592015.0Romantic9010.01
321440852000.0Comedy9410.01
331211472001.0Drama9310.01
341146152015.0Documentary11010.01
351500772017.0Romantic9010.01
36860422007.0Drama9210.01
37833422008.0War11810.01
38831521982.0Horror7010.01
391472572017.0Drama9210.01
40814312015.0Drama8010.01
41811472005.0Adventure10410.01
42803092015.0Drama28010.01
431456572017.0Thriller10210.01
44779502014.0Drama8710.01
45770782009.0Animation6010.01
46766461937.0Adventure9210.01
471419402017.0Drama8810.03
481228102015.0Documentary8210.01
491433562017.0Thriller9010.01
501550931945.0Biography13810.01
511727432019.0Drama7710.01
521605872018.0Comedy9010.01
531620172018.0Romantic9010.01
541398252017.0Comedy8110.01
551734891967.0Comedy8610.01
561584171982.0Documentary6310.01
571669072019.0Comedy8410.02
581365052016.0Documentary7010.01
591672591985.0Animation5310.01
601561132016.0Documentary6010.01
611566872018.0Documentary8510.01
621291612017.0Comedy9010.01
631798632017.0Romantic8410.01
641642271988.0Drama10510.01
651694272008.0History7210.01
661476632017.0Drama7310.01
671710992019.0Documentary10710.01
681520452017.0Drama9210.01
691622352015.0Documentary9010.01
701206682016.0Thriller9010.02
711761992019.0Documentary7610.01
721504312017.0Romantic9010.01
731643012016.0Documentary5010.02
74860712016.0Documentary7010.01
751667752019.0Musical13010.01
761502972017.0Action9910.01
771503512017.0Documentary6210.02
78833061976.0Comedy9710.02
791775532008.0Animation8310.01
801623752018.0Documentary7410.01
811640632019.0Adventure10410.01
821444212017.0Documentary9010.01
831695172019.0Drama10810.01
841474382005.0Animation6710.01
851474422016.0Thriller18010.01
86806372015.0Comedy8310.01
871708792019.0Comedy9010.01
881472312017.0Documentary6010.01
891623772018.0Comedy9010.01
90792072015.0Romantic9010.01
911635992018.0Thriller9010.01
921460732017.0Comedy11510.01
931470932017.0Thriller9010.01
94767141972.0Comedy9010.01
951795192018.0Thriller8710.01
961605632018.0Documentary7810.01
971667772017.0Romantic8610.03
981389202017.0Drama10110.01
991673912019.0Documentary7510.01
100706712014.0Drama8010.01
Rows: 1-100 of 53397 | Columns: 10

Since we want properly averaged scores, let's just consider the top 10 movies that have at least 10 votes.

In [7]:
filmtv_movies.search(conditions = [filmtv_movies["votes"] > 10], 
                     order_by = {"avg_vote" : "desc" })
Out[7]:
123
filmtv_id
Integer
Abc
Varchar(486)
123
year
Numeric(8,2)
Abc
genre
Varchar(22)
123
duration
Integer
Abc
Varchar(208)
Abc
Varchar(1066)
Abc
Varchar(2218)
123
avg_vote
Numeric(6,2)
123
votes
Integer
1259801992.0Drama5839.833
2291361985.0Documentary5449.624
3165671924.0Drama1009.658
4165841973.0Comedy2089.430
538311948.0Drama909.4477
6279081962.0Drama1359.468
7121351994.0Drama4509.472
877531950.0Drama1009.4535
9233951954.0Adventure2009.4325
1056481941.0Drama1199.4588
1170421936.0Comedy839.4427
12279841992.0Drama15259.318
13127301957.0War869.3514
14260351930.0Drama849.331
15230231954.0Drama1109.350
1655801960.0Thriller1089.3789
17231121953.0Drama979.3125
1839631931.0Drama879.3314
19181731955.0Drama1199.3148
20283822002.0Action1299.317
21105591924.0Comedy429.3123
22104231928.0Drama759.329
23120911937.0War1179.2172
24232601950.0Drama889.2291
25230371931.0Comedy829.216
26278381954.0Drama1199.263
2726861925.0Comedy609.2273
2826761926.0Drama859.252
2944701921.0Comedy519.2310
3017921948.0Biography959.244
3150821928.0Drama979.2153
32418961996.0Animation759.213
3360401945.0Drama989.2389
34181721943.0Drama949.2144
35197551953.0Drama1369.2152
36170651971.0Grotesque1379.21087
3792601951.0Comedy939.148
3885851958.0Noir1049.1356
39131961979.0War1479.1650
4074981951.0Drama909.1302
4183951978.0Drama1839.1629
42145491927.0Drama979.1140
4368631932.0Horror709.1124
4461861948.0Drama1339.1113
4569841931.0Drama809.134
4649911972.0Drama1789.1837
47146031949.0Drama1089.187
4849921974.0Drama1859.1603
49277971952.0Drama1369.186
5032321940.0Comedy1189.1459
5119631968.0Fantasy1419.1986
5222331958.0Thriller1289.1622
5344801947.0Drama1239.1206
5464761980.0Horror1199.11168
5523211944.0Noir1069.1259
5625771950.0Comedy1309.1255
5727671954.0Thriller1129.1692
5871191957.0Thriller1149.1288
59543291975.0Drama2019.124
6070251976.0Drama1109.1867
6129731920.0Horror789.1222
6289111988.0Drama5609.156
63278081952.0Drama1439.1122
64249571971.0Action1009.115
65279271955.0Drama1159.136
6610091982.0Fantasy1249.1278
6745081955.0Drama939.1300
68198391926.0Fantasy879.1329
69279661966.0Drama1199.153
70278751956.0Drama959.0127
71253731980.0Drama589.014
72279131941.0Drama2239.036
73279831984.0Drama9249.042
74281111924.0Drama2379.031
75174551983.0Drama1079.015
76163511989.0Drama1109.012
77230881962.0Drama959.0229
78122541957.0Drama959.0369
79128491994.0Gangster1549.01222
80236621941.0Drama1169.049
81103591932.0Horror649.0302
8289671934.0Comedy999.021
83105261932.0Drama909.034
8471411928.0Drama919.021
8587731928.0Drama909.035
8692051950.0Comedy979.039
87110081962.0Comedy1089.0436
88240652000.0Drama1409.091
8911521984.0Drama2209.0813
90121231962.0Drama1129.057
9121941960.0Drama1789.0429
9210221980.0Comedy1279.0684
9352731959.0Drama759.097
94169031959.0Drama939.0431
954391969.0Drama1749.0137
968681975.0Drama1849.0599
9760251960.0Drama1809.0246
9861631946.0Fantasy1049.075
99242351963.0Drama629.039
100695152006.0Thriller1639.016
Rows: 1-100 | Columns: 10

We can see classic movies like 'The Godfather' and 'Greed.' Let's smooth the avg_vote using a linear regression to make it more representative.

To create our model we could use the votes, the category, the duration, etc. but let's go with the director and main actors.

We can extract the five main actors for each movie with regular expressions.

In [8]:
for i in range(1, 5):
    filmtv_movies2 = vp.read_csv("data/filmtv_movies.csv")
    filmtv_movies2.regexp(column = "actors",
                          method = "substr",
                          pattern = '[^,]+',
                          occurrence = i,
                          name = "actor")
    if i == 1:
        filmtv_movies = filmtv_movies2.copy()
    else:
        filmtv_movies = filmtv_movies.append(filmtv_movies2)
filmtv_movies["actor"].describe()
Out[8]:
value
name"actor"
dtypevarchar(2218)
unique74885.0
count191674.0
Others191083
[null]21914
Totò100
Ciccio Ingrassia91
John Wayne85
Franco Franchi83
Alberto Sordi79
Rows: 1-11 | Columns: 2

By aggregating the data, we can find the number of actors and the number of votes by actor. We can then normalize the data using the min-max method and quantify the notoriety of the actors.

In [9]:
import verticapy.stats as st
actors_stats = filmtv_movies.groupby(columns = ["actor"], 
                                     expr = [st.sum(filmtv_movies["votes"])._as("notoriety_actors"),
                                             st.count(filmtv_movies["actors"])._as("castings_actors")])
actors_stats["actor"].dropna()
actors_stats["notoriety_actors"].normalize(method = 'minmax')
1 element was filtered.
Out[9]:
Abc
actor
Varchar(2218)
123
notoriety_actors
Float
123
castings_actors
Integer
1 Claire Brosseau0.0003049710277521
2 Uberto Kovacevich0.0003049710277521
3 Julie Bishop0.0028463962590223
4 Guillaume Lemay-Thivierge0.01
5 Keith Morris0.0001016570092511
6Carl Brisson0.0048795364440382
7 Ameko Eks Mass Carroll0.0006099420555051
8Claudio Santamaria0.13113754193351616
9 Paul Lukather0.0005082850462541
10 Ben Gavin0.0012198841110092
11Gina Carano0.0123004981193453
12 Mathew St. Patrick0.0005082850462541
13 Carl Shell0.01
14 Wanda Cannon0.005184507471792
15 Tasko Nacic0.0024397682220191
16 Dominik Madani0.0001016570092511
17 Pina Di Blasi0.0005082850462541
18 Daniel Quinn0.0002033140185022
19 Federico Zardi0.0152485513876181
20 Charles Duming0.0001016570092511
21 Dorville0.0004066280370032
22Emily Ratajkowski0.0007115990647561
23 Vili Matula0.0002033140185021
24 Renee Cline0.0004066280370033
25Ian McDonald0.0001016570092511
26Gorka Otxoa0.0001016570092512
27Rae Dawn Chong0.0019314831757652
28 John Phillips0.0006099420555051
29 Amy Paffrath0.01
30 Ludivine Sagnier0.0425942868760814
31Céline Sallette0.0007115990647562
32Bill Mumy0.0003049710277521
33 Zhang Yao0.01
34Babbayar Batchuluun0.0018298261665141
35 Adrian Boult0.01
36 Juan Camillo Roman Estrada0.0054894784995431
37 Geoff Shaw0.01
38 Nick Cassavetes0.0209413439056627
39 Kuniko Asagi0.01
40 Brendan Fehr0.00986072989732612
41 Carolyn Colquhoun0.0005082850462541
42Mohammed Bakri0.0003049710277522
43Isuzu Yamada0.0022364542035172
44 Jared Martin0.0053878214902922
45 Gian Guido Baldi0.0001016570092511
46 Loa Falkman0.01
47 Victoria Larchenko0.0006099420555051
48 John Ritter0.03964623360780711
49 Martin Bell0.0008132560740061
50 Luca Esposito0.0033546813052761
51 Axel Milberg0.01006404391582810
52 Piroska Vaszary0.0004066280370031
53 Kamar de los Reyes0.0003049710277521
54 Andreina Pezzi0.0005082850462541
55 Johanna Sallstrom0.0005082850462543
56 Ji-won Ye0.0004066280370033
57 Olga James0.0022364542035171
58 Sigfrit Steiner0.0004066280370031
59Jin Zhang0.0002033140185021
60 Cecile Anckarswärd0.0001016570092511
61 Miya Cech0.0032530242960252
62 François Maistre0.0001016570092511
63 Lane Edwards0.0011182271017592
64 Pier Ana Quaglia0.0027447392497711
65Ricky Davao0.0003049710277521
66 Lauren London0.01
67 Nokko0.01
68 Pat Evison0.0006099420555052
69András Hatházi0.01
70 Gustavo Beck0.0001016570092511
71Jean-Jacques Lelté0.0001016570092511
72Gig Young0.0011182271017594
73 Polly Draper0.0100640439158288
74 Naveen Andrews0.0201280878316563
75 Sally Cooper0.01
76 Michael Badalucco0.0023381112127684
77 Valerie Mairesse0.0005082850462543
78 Cheanik Nov0.01
79 Facundo Gambandé0.01
80Gianfranco Giachetti0.0002033140185023
81 William E. Fountaine0.0003049710277521
82Alexandra Breckenridge0.0003049710277521
83 Beverly Hills0.0001016570092511
84 Krista Stadler0.01
85 Jesus Tordesillas0.0004066280370032
86 Charlotte Laemmel0.0018298261665141
87 Manuel Alexandre0.01
88Ewa Strömberg0.0019314831757651
89 María Jesús Solina0.01
90 Jacqueline Scott0.0018298261665142
91Alessia Barela0.0008132560740061
92 James McCaffrey0.0005082850462543
93 Siri Neal0.01
94 Stef Aerts0.0060994205550472
95 Anne Gwynne0.0002033140185023
96Roland Laos0.0002033140185021
97Martha Scott0.0007115990647561
98 Makalou De Yafera0.01
99Sakis Rouvas0.0006099420555051
100 Katee Sackhoff0.0269391074514596
Rows: 1-100 | Columns: 3

Let's look at the top ten actors by notoriety.

In [10]:
actors_stats.search(order_by = {"notoriety_actors" : "desc", 
                                "castings_actors" : "desc"}).head(10)
Out[10]:
Abc
actor
Varchar(2218)
123
notoriety_actors
Numeric(34,15)
123
castings_actors
Integer
1Robert De Niro1.057
2 Morgan Freeman0.86144149639117652
3Clint Eastwood0.85646030293788843
4Tom Cruise0.82037206465385834
5Johnny Depp0.81488258615431534
6Tom Hanks0.77188167124123237
7 Samuel L. Jackson0.74250279556775449
8Brad Pitt0.7277625292263926
9Leonardo DiCaprio0.71546203110704520
10Al Pacino0.64836840500152540
Rows: 1-10 | Columns: 3

As expected, we get a list of very popular actors like Robert De Niro, Morgan Freeman, and Clint Eastwood.

Let's do the same for the directors.

In [11]:
director_stats = filmtv_movies.groupby(columns = ["director"], 
                                       expr = [st.sum(filmtv_movies["votes"])._as("notoriety_director"),
                                               st.count(filmtv_movies["director"])._as("castings_director")])
director_stats["notoriety_director"].normalize(method = 'minmax')
Out[11]:
Abc
Varchar(1066)
123
notoriety_director
Float
123
castings_director
Integer
10.0003466204506078
20.0065857885615258
30.0009532062391688
40.04
50.0009532062391684
60.04
70.0003466204506074
80.00727902946273824
90.00563258232235752
100.04
110.01863084922010480
120.0002599653379554
130.04
148.6655112652e-054
150.0284228769497420
160.04
170.0023396880415948
180.00086655112651616
190.04
200.00987868284228812
218.6655112652e-058
220.0007798960138658
230.02616984402079712
240.0006065857885628
250.0007798960138654
260.04
270.00086655112651612
280.03544194107452368
290.0006065857885624
300.04
310.0002599653379554
320.03024263431542596
330.0001733102253038
340.0002599653379554
350.000519930675914
360.00181975736568528
370.0210571923743548
380.01533795493934112
390.00095320623916812
400.00467937608318912
410.0004332755632584
420.0044194107452344
430.0001733102253034
440.00086655112651620
450.04
460.00753899480069328
470.0002599653379554
480.01967071057192424
490.0045927209705378
500.04
510.04
520.0017331022530334
530.0011265164644714
540.00025996533795516
550.0028596187175048
568.6655112652e-054
570.0006932409012134
580.0030329289428088
590.04
600.04
610.00658578856152532
620.04
630.0001733102253034
640.0002599653379554
650.00025996533795512
668.6655112652e-054
670.0041594454072794
680.0002599653379554
690.00060658578856224
700.04
718.6655112652e-054
728.6655112652e-054
730.0001733102253038
740.0002599653379554
750.04
760.04
778.6655112652e-054
780.0014731369150788
790.0001733102253034
800.00415944540727916
818.6655112652e-054
820.0002599653379554
830.0002599653379554
840.0001733102253034
850.04
860.0002599653379554
870.000519930675914
880.04
890.04
900.08084922010398624
910.0007798960138654
920.0004332755632584
930.04
940.04
950.0001733102253034
960.0001733102253034
970.0045927209705374
980.002079722703648
990.0020797227036428
1000.09064124783362236
Rows: 1-100 | Columns: 3

Now let's look at the top 10 movie directors.

In [16]:
director_stats.search(order_by = {"notoriety_director" : "desc", 
                                  "castings_director" : "desc" }).head(10)
Out[16]:
Abc
director
Varchar(1066)
123
notoriety_director
Numeric(34,15)
123
castings_director
Integer
1Steven Spielberg1.0132
2Woody Allen0.962045060658579192
3Clint Eastwood0.893067590987868152
4Martin Scorsese0.829289428076256140
5Alfred Hitchcock0.753379549393414208
6Quentin Tarantino0.68674176776429840
7Ridley Scott0.655459272097054108
8Stanley Kubrick0.64991334488734848
9Tim Burton0.58821490467937672
10David Cronenberg0.51343154246100584
Rows: 1-10 | Columns: 3

Again, we get a list of popular directors like Steven Spielberg, Woody Allen, and Clint Eastwood.

Let's join our notoriety metrics for actors and directors with the main dataset.

In [12]:
filmtv_movies_director = filmtv_movies.join(
                                    director_stats,
                                    on = {'director': 'director'},
                                    how = "left",
                                    expr1 = ["*"],
                                    expr2 = ["notoriety_director", 
                                             "castings_director"])
filmtv_movies_director_actors = filmtv_movies_director.join(
                                    actors_stats,
                                    on = {'actor': 'actor'},
                                    how = "left",
                                    expr1 = ["*"],
                                    expr2 = ["notoriety_actors",
                                             "castings_actors" ])

As we did many operation, it can be nice to save the vDataFrame as a table in the Vertica database.

In [15]:
vp.drop("filmtv_movies_director_actors", relation_type = "table")
filmtv_movies_director_actors.to_db(name = "filmtv_movies_director_actors", 
                                    relation_type = "table",
                                    inplace = True)
Out[15]:
123
filmtv_id
Int
Abc
Varchar(486)
123
year
Numeric(8,2)
Abc
genre
Varchar(22)
123
duration
Int
Abc
country
Varchar(208)
Abc
director
Varchar(1066)
Abc
Varchar(2218)
123
avg_vote
Numeric(6,2)
123
votes
Int
Abc
Varchar(2232)
Abc
Varchar(1052)
Abc
actor
Varchar(2218)
123
notoriety_director
Numeric(34,15)
123
castings_director
Int
123
notoriety_actors
Numeric(34,15)
123
castings_actors
Int
1181982.0Comedy95United StatesBarry Levinson7.215Mickey Rourke0.1897746967071061040.13449222323879226
2181982.0Comedy95United StatesBarry Levinson7.215 Steve Guttenberg0.1897746967071061040.04889702144962914
3181982.0Comedy95United StatesBarry Levinson7.215 Ellen Barkin0.1897746967071061040.09637084476974727
4181982.0Comedy95United StatesBarry Levinson7.215 Daniel Stern0.1897746967071061040.08193554945613514
5221967.0Drama93ItalyElio Petri7.8102Gian Maria Volonté0.1078856152513480.23025312595303435
6221967.0Drama93ItalyElio Petri7.8102 Salvo Randone0.1078856152513480.12473315035071724
7221967.0Drama93ItalyElio Petri7.8102 Gabriele Ferzetti0.1078856152513480.1135508793331357
8221967.0Drama93ItalyElio Petri7.8102 Irene Papas0.1078856152513480.0864084578631728
9241966.0Thriller80ItalyRay Morrison (Angelo Dorigo)6.53Alan Steel0.00017331022530340.00406628037003210
10241966.0Thriller80ItalyRay Morrison (Angelo Dorigo)6.53 Sergio Ciani0.00017331022530340.0007115990647564
11241966.0Thriller80ItalyRay Morrison (Angelo Dorigo)6.53 Mary Arden0.00017331022530340.0002033140185021
12241966.0Thriller80ItalyRay Morrison (Angelo Dorigo)6.53 Ivano Davoli0.00017331022530340.0024397682220194
13301968.0Adventure101ItalyLeon Klimovsky5.52George Hilton0.004679376083189440.02785402053471625
14301968.0Adventure101ItalyLeon Klimovsky5.52 Marta Padovan0.004679376083189440.0006099420555052
15301968.0Adventure101ItalyLeon Klimovsky5.52 Ennio Girolami0.004679376083189440.00833587475856514
16301968.0Adventure101ItalyLeon Klimovsky5.52 Alfonso De La Vega0.004679376083189440.0001016570092511
17311990.0Comedy90United StatesCarol Wiseman4.01Patsy Kensit0.000433275563258120.0012198841110098
18311990.0Comedy90United StatesCarol Wiseman4.01 Stéphane Freiss0.000433275563258120.0112839280268379
19311990.0Comedy90United StatesCarol Wiseman4.01 Mouss Diouf0.000433275563258120.01
20311990.0Comedy90United StatesCarol Wiseman4.01 Anne-Marie Pisani0.000433275563258120.01
21321990.0Drama96United StatesJoseph Ruben5.1107Julia Roberts0.037781629116118520.22913489885127620
22321990.0Drama96United StatesJoseph Ruben5.1107 Patrick Bergin0.037781629116118520.01535020839686915
23321990.0Drama96United StatesJoseph Ruben5.1107 Kevin Anderson0.037781629116118520.01992477381315414
24321990.0Drama96United StatesJoseph Ruben5.1107 Elizabeth Lawrence0.037781629116118520.0107756429805841
25331991.0Thriller100United StatesBrian Grant8.02Virginia Madsen0.000693240901213200.02216122801667214
26331991.0Thriller100United StatesBrian Grant8.02 Lenny Von Dohlen0.000693240901213200.0011182271017596
27331991.0Thriller100United StatesBrian Grant8.02 Jim Metzler0.000693240901213200.0050828504625396
28331991.0Thriller100United StatesBrian Grant8.02 Erich Anderson0.000693240901213200.0001016570092511
29371975.0Comedy100ItalyMarcello Fondato5.942Claudia Cardinale0.031542461005199320.23991054183185929
30371975.0Comedy100ItalyMarcello Fondato5.942 Vittorio Gassman0.031542461005199320.27142421469960452
31371975.0Comedy100ItalyMarcello Fondato5.942 Monica Vitti0.031542461005199320.08579851580766519
32371975.0Comedy100ItalyMarcello Fondato5.942 Giancarlo Giannini0.031542461005199320.15878824844973154
33411985.0Drama110FranceVera Belmont4.01Charlotte Valandrey0.003206239168111120.0002033140185022
34411985.0Drama110FranceVera Belmont4.01 Lambert Wilson0.003206239168111120.10877299989834328
35411985.0Drama110FranceVera Belmont4.01 Marthe Keller0.003206239168111120.05275998780115922
36411985.0Drama110FranceVera Belmont4.01 Laurent Terzieff0.003206239168111120.04625393920910820
37421988.0Drama110HungaryGéza Beremenyi8.01Károly Eperjes0.040.01
38421988.0Drama110HungaryGéza Beremenyi8.01 Judit Pogàny0.040.0005082850462542
39421988.0Drama110HungaryGéza Beremenyi8.01 Péter Andorai0.040.0003049710277523
40421988.0Drama110HungaryGéza Beremenyi8.01 Géza Balkay0.040.01
41461992.0Drama88France, Great BritainIan Sellar6.01Alan Cumming0.040.0119955270915934
42461992.0Drama88France, Great BritainIan Sellar6.01 Sandrine Bonnaire0.040.0162651214801269
43461992.0Drama88France, Great BritainIan Sellar6.01 Raphael Meiss0.040.01
44461992.0Drama88France, Great BritainIan Sellar6.01 Bruno Ganz0.040.16692080918979431
45491964.0Drama110United StatesSidney Lumet8.368 Walter Matthau0.2350953206239171680.13835518959032232
46491964.0Drama110United StatesSidney Lumet8.368Henry Fonda0.2350953206239171680.1625495577920138
47491964.0Drama110United StatesSidney Lumet8.368 Dan O'Herlihy0.2350953206239171680.05489478499542514
48491964.0Drama110United StatesSidney Lumet8.368 Dom DeLuise0.2350953206239171680.0477787943478719
49501990.0Action88United States, ChinaLiu Chia Yung7.02 Tong Chung8.6655112652e-0540.0001016570092511
50501990.0Action88United States, ChinaLiu Chia Yung7.02Cynthia Rothrock8.6655112652e-0540.00284639625902212
51501990.0Action88United States, ChinaLiu Chia Yung7.02 Mark Huston8.6655112652e-0540.0001016570092511
52501990.0Action88United States, ChinaLiu Chia Yung7.02 Shing Fulon8.6655112652e-0540.0001016570092511
53531968.0Adventure94ItalyEmilio Miraglia4.06Walter Pidgeon0.00043327556325840.02439768222018911
54531968.0Adventure94ItalyEmilio Miraglia4.06 Tino Carraro0.00043327556325840.02155128596116711
55531968.0Adventure94ItalyEmilio Miraglia4.06 Klaus Kinski0.00043327556325840.10989122700010239
56531968.0Adventure94ItalyEmilio Miraglia4.06 Ira Fürstenberg0.00043327556325840.0053878214902926
57551986.0Comedy100United StatesAlan Metter6.616Rodney Dangerfield0.006759098786828440.0031513672867745
58551986.0Comedy100United StatesAlan Metter6.616 Keith Gordon0.006759098786828440.0253125953034464
59551986.0Comedy100United StatesAlan Metter6.616 Robert Downey jr.0.006759098786828440.34136423706414632
60551986.0Comedy100United StatesAlan Metter6.616 Sally Kellerman0.006759098786828440.02998881772898211
61561988.0Comedy89United StatesNico Mastorakis4.01Will Egan0.000779896013865240.0001016570092512
62561988.0Comedy89United StatesNico Mastorakis4.01 Kathleen Stevens0.000779896013865240.0001016570092512
63561988.0Comedy89United StatesNico Mastorakis4.01 Kelly Randall0.000779896013865240.0001016570092512
64561988.0Comedy89United StatesNico Mastorakis4.01 Gerald Okamura0.000779896013865240.0002033140185023
65581989.0Comedy100United StatesBruce Beresford7.1171Jessica Tandy0.0553726169844021040.0177899766188883
66581989.0Comedy100United StatesBruce Beresford7.1171 Patty LuPone0.0553726169844021040.0172816915726341
67581989.0Comedy100United StatesBruce Beresford7.1171 Morgan Freeman0.0553726169844021040.86144149639117652
68581989.0Comedy100United StatesBruce Beresford7.1171 Dan Aykroyd0.0553726169844021040.36067906882179531
69601987.0Drama98Great BritainMark Rezyka4.01Jeff Osterhage0.040.01
70601987.0Drama98Great BritainMark Rezyka4.01 Lewis Van Bergen0.040.001321541120265
71601987.0Drama98Great BritainMark Rezyka4.01 Joe Estevez0.040.0011182271017596
72601987.0Drama98Great BritainMark Rezyka4.01 Lisa Blount0.040.0076242756938093
73611966.0Western98United StatesSidney J. Furie7.024Marlon Brando0.0291161178509531200.33994103893463532
74611966.0Western98United StatesSidney J. Furie7.024 John Saxon0.0291161178509531200.12066686998068526
75611966.0Western98United StatesSidney J. Furie7.024 Anjanette Comer0.0291161178509531200.0049811934532898
76611966.0Western98United StatesSidney J. Furie7.024 Emilio Fernández0.0291161178509531200.0042695943885332
77631975.0Crime100ItalyMario Caiano5.526Antonio Sabato0.018717504332756960.01768831960963720
78631975.0Crime100ItalyMario Caiano5.526 Luciana Paluzzi0.018717504332756960.03232692894175121
79631975.0Crime100ItalyMario Caiano5.526 Enrico Maria Salerno0.018717504332756960.18308427366066954
80631975.0Crime100ItalyMario Caiano5.526 Gabriele Ferzetti0.018717504332756960.1135508793331357
81641991.0Crime85ItalyRomolo Guerrieri4.01[null]0.01880415944540768[null][null]
82641991.0Crime85ItalyRomolo Guerrieri4.01[null]0.01880415944540768[null][null]
83641991.0Crime85ItalyRomolo Guerrieri4.01Tony Palazzo0.018804159445407680.01
84641991.0Crime85ItalyRomolo Guerrieri4.01 Massimiliano Pazzaglia0.018804159445407680.01
85651981.0Comedy94CanadaLes Rose4.01Susan Anspach0.00025996533795580.0007115990647564
86651981.0Comedy94CanadaLes Rose4.01 Sterling Hayden0.00025996533795580.1276812036189921
87651981.0Comedy94CanadaLes Rose4.01 Howie Mandel0.00025996533795580.0009149130832573
88651981.0Comedy94CanadaLes Rose4.01 Helen Shaver0.00025996533795580.03151367286774419
89681945.0Comedy90ItalyGennaro Righelli6.028Anna Magnani0.004506065857886520.12432652231371424
90681945.0Comedy90ItalyGennaro Righelli6.028 Virgilio Riento0.004506065857886520.01077564298058417
91681945.0Comedy90ItalyGennaro Righelli6.028 Lauro Gazzolo0.004506065857886520.00925078784182213
92681945.0Comedy90ItalyGennaro Righelli6.028 Nino Besozzi0.004506065857886520.01280878316559912
93701986.0Drama120United StatesMichael Miller6.01Justine Bateman0.003726169844021720.0005082850462544
94701986.0Drama120United StatesMichael Miller6.01 Joe Nasser0.003726169844021720.01
95701986.0Drama120United StatesMichael Miller6.01 Max Gail0.003726169844021720.0026430822405217
96701986.0Drama120United StatesMichael Miller6.01 Jason Bateman0.003726169844021720.20351733252007718
97741977.0Adventure130United StatesPeter Yates5.228Nick Nolte0.045753899480069960.09748907187150626
98741977.0Adventure130United StatesPeter Yates5.228 Robert Shaw0.045753899480069960.06180746162447910
99741977.0Adventure130United StatesPeter Yates5.228 Eli Wallach0.045753899480069960.15126562976517232
100741977.0Adventure130United StatesPeter Yates5.228 Jacqueline Bisset0.045753899480069960.09840398495476335
Rows: 1-100 | Columns: 17

We can aggregate the data to get metrics on each movie.

In [17]:
filmtv_movies_complete = filmtv_movies_director_actors.groupby(
                            columns = ["filmtv_id", 
                                       "title",
                                       "year",
                                       "genre",
                                       "country",
                                       "avg_vote",
                                       "votes", 
                                       "duration", 
                                       "director", 
                                       "notoriety_director",
                                       "castings_director"],
                            expr = [st.sum(filmtv_movies_director_actors["notoriety_actors"])._as("notoriety_actors"),
                                    st.sum(filmtv_movies_director_actors["castings_actors"])._as("castings_actors")])

Let's compute some statistics on our dataset.

In [18]:
filmtv_movies_complete.describe(method = "all")
Out[18]:
123
"filmtv_id"
Integer
100%
123
"year"
Numeric(8,2)
99%
123
"avg_vote"
Numeric(6,2)
100%
123
"votes"
Integer
100%
123
"duration"
Integer
100%
123
"notoriety_director"
Numeric(34,15)
99%
123
"castings_director"
Integer
99%
123
"notoriety_actors"
Numeric(34,15)
94%
123
"castings_actors"
Integer
94%
Abc
"title"
Varchar(486)
99%
Abc
"genre"
Varchar(22)
99%
Abc
"country"
Varchar(208)
99%
Abc
"director"
Varchar(1066)
99%
dtypeintegernumeric(8,2)numeric(6,2)integerintegernumeric(34,15)integernumeric(34,15)integervarchar(486)varchar(22)varchar(208)varchar(1066)
percent100.099.981100.0100.0100.099.88499.88494.33594.33599.99699.62299.90499.884
count53397533875339753397533975333553335503725037253395531955334653335
top182016.06.01900.04[null]4Les VampiresDramaUnited StatesMario Mattòli
top_percent0.0023.09215.01422.98811.7987.97622.3125.6656.0430.01930.13841.1410.137
avg44479.20304511491990.978346788545.8447796692698124.541753282019698.46678652358750.030242309368906841.62812412112120.1513469057266939.050266020805217.50892405655966.5140520725632111.498162936302614.9814943283022
stddev42397.272644743622.90076753250251.5236564256841360.975726784458935.3330112784710.081119700089097448.01899838820590.22879181880830834.09619465415019.676488266566431.784579700438766.363784025028947.9513590648408
min181897.00.51400.040.011343
approx_25%14979.681975.05.02.089.00.0006065857885628.00.0095557588695738811.0115612
approx_50%31301.281251997.06.05.095.00.00415944540727924.00.057711754225453830.01661314
approx_75%57081.05970149252010.07.018.0105.00.02166377816291260.00.19575010165700958.02281316
max1799372019.010.0122252801.02922.23147301006404426724311104532
range179919122.09.5122152401.02882.2314730100640442662428100529
empty[null][null][null][null][null][null][null][null][null]0000
Rows: 1-14 | Columns: 14

We can use the movie's release year to get create three categories.

In [19]:
filmtv_movies_complete.case_when('period',
                                 filmtv_movies_complete["year"] < 1990, 'Old',
                                 filmtv_movies_complete["year"] >= 2000, 'Recent', '90s') 
Out[19]:
123
filmtv_id
Integer
Abc
Varchar(486)
123
year
Numeric(8,2)
Abc
genre
Varchar(22)
Abc
Varchar(208)
123
avg_vote
Numeric(6,2)
123
votes
Integer
123
duration
Integer
Abc
Varchar(1066)
123
notoriety_director
Numeric(34,15)
123
castings_director
Integer
123
notoriety_actors
Numeric(34,15)
123
castings_actors
Integer
Abc
period
Varchar(6)
1849571993.0Animation5.813720.001039861351824[null][null]90s
245831979.0Thriller6.011200.018630849220104800.02500762427569421Old
3522242012.0Thriller6.021000.009965337954939560.09464267561248457Recent
4518542012.0Drama6.02888.6655112652e-0540.0004066280370044Recent
585381992.0Drama5.531350.001906412478336120.6757141404899879890s
6695902008.0Fantasy7.021290.029809358752166480.0007115990647566Recent
7549522012.0Comedy7.35800.00034662045060740.0047778794347879Recent
8204241997.0Thriller6.01960.040.000101657009251590s
9346901955.0Drama9.011070.040.00945410186032311Old
101705372019.0Comedy4.07750.00069324090121380.00721764765680611Recent
11310922000.0Animation6.114740.0012131715771238[null][null]Recent
12402281992.0Drama7.031320.003552859618718520.003557995323779790s
13102811960.0Comedy5.02890.001646447140381160.14628443631188370Old
14457722009.0Fantasy6.5111120.00086655112651640.00691267662905412Recent
15753342007.0Thriller4.841100.001993067590988400.37389448002439867Recent
16174561976.0Action8.02900.1746967071057191120.27416895394937558Old
171467192017.0Documentary7.65490.0007798960138658[null][null]Recent
1884001990.0Drama9.15990.00051993067591120.002439768222019790s
19183631997.0Action6.01900.001646447140381160.1584832774219793590s
20187721963.0Thriller8.39820.015597920277296720.0035579953237775Old
21253971989.0Crime7.0491150.0533795493934141040.12696960455423537Old
22233781950.0Comedy5.36930.002772963604853320.09992884009352456Old
23137821995.0Drama6.9191100.002253032928943240.0562163261156852590s
241409091958.0Drama8.85660.00034662045060740.0008132560740062Old
25153721994.0Horror4.629910.02105719237435480.1498424316356612990s
2673381984.0Drama4.52880.00051993067591120.00365965233302915Old
27583022012.0Thriller3.36900.00086655112651680.26745959133882246Recent
28109531972.0Adventure5.817930.006499133448873360.14862254752465279Old
29742382014.0Drama10.011260.040.00569279251804414Recent
301622572016.0[null]7.01880.040.04Recent
31490782011.0Documentary6.53920.00025996533795580.0023381112127695Recent
32282881999.0Horror5.03950.010398613518198320.0034563383145281390s
3337741985.0Comedy5.1811090.09107452339688720.10247026532479415Old
3488521981.0Comedy4.5301100.055112651646447680.04137440276507141Old
351286652016.0Documentary10.01570.04[null][null]Recent
36104591960.0Comedy5.8231000.012045060658579520.35091999593372128Old
37702032007.0Drama7.340910.00337954939341440.0253125953034478Recent
38589192012.0Comedy4.021000.00017331022530380.0007115990647567Recent
391591172019.0Comedy6.059930.102253032928943520.1534004269594426Recent
4072011991.0Drama6.01900.040.0390s
4171491986.0Drama7.95980.000953206239168120.0014231981295114Old
421658232016.0Drama6.52830.00017331022530380.0004066280370044Recent
431391988.0Drama6.011008.6655112652e-0580.16946223442106441Old
44144971992.0Drama5.151130.125303292894281440.0330385280065073290s
45168301996.0Thriller8.02920.00017331022530380.000711599064756790s
46130571990.0Comedy4.01900.0082322357019061560.0661787130222632690s
4790031984.0Comedy5.929950.024610051993068641.03537663921927596Old
48379962004.0Horror5.511980.015164644714038400.25790383246924977Recent
49150711983.0Comedy5.88950.00060658578856240.10409677747280751Old
50177241978.0Spy6.9111000.006412478336222560.10917962793534743Old
51542542010.0Drama3.16900.001473136915078120.00853918877706722Recent
52404342009.0Drama8.021150.006759098786828240.04330588594083613Recent
53693692014.0Comedy5.2113950.02668977469670780.26959438853308938Recent
54111301961.0War4.011000.017417677642981760.10023381112127745Old
55617252013.0Comedy4.97900.004592720970537360.0166717495171316Recent
561655772015.0Comedy9.021160.00043327556325880.0426959438853314Recent
57555052006.0Horror5.521098.6655112652e-0540.00853918877706710Recent
58468422011.0Drama6.1461000.00389948006932440.43671851174138584Recent
59425142009.0Action3.515920.013431542461005920.01240215512859610Recent
601715692012.0Horror7.011140.040.00620107756429811Recent
6170521959.0Drama7.545810.1237435008665511240.0134187252211053Old
62422431994.0Fantasy5.84910.00069324090121380.1731218867540917690s
63447742012.0Drama5.1291070.031369150779896400.51621429297550175Recent
64512702012.0Drama3.35900.009445407279029360.15004574565416250Recent
65306512004.0Comedy7.2269990.097487001733102200.23076141099928819Recent
66540672012.0Adventure6.838880.005892547660312120.04066280370031613Recent
67556882008.0Horror6.02860.001126516464471120.02510928128494516Recent
681449692016.0Thriller1.01908.6655112652e-0580.01972145979465381Recent
69223182002.0Drama4.85940.00034662045060740.0020331401850155Recent
70860712016.0Documentary10.01700.04[null][null]Recent
71145001969.0Drama5.86900.1798093587521661400.0568262681711912Old
7267051963.0Spy4.011000.040.01108061400833622Old
731471572017.0Animation8.01930.0001733102253038[null][null]Recent
74855092011.0Action5.54910.002339688041594120.22700010165700948Recent
75649702014.0Comedy5.3821080.021490467937608200.34685371556368850Recent
76382672007.0Thriller5.926950.00216637781629140.42523126969604685Recent
77135321962.0History4.8381500.00320623916811140.14852089051540278Old
781615392018.0Drama6.01880.040.01728169157263422Recent
79378502001.0Comedy4.03880.008058925476603360.01026735793433111Recent
801148062016.0Drama6.2111070.00233968804159480.0073193046660578Recent
81452722006.0Action8.041170.1487001733102252320.01484192335061414Recent
82232641985.0Horror7.5107860.049566724436742640.06414557283724724Old
83448252009.0Action3.310900.00086655112651680.02551590932194717Recent
84271712003.0Thriller5.79950.009618717504333520.14892751855240561Recent
85586952003.0Comedy7.83900.00017331022530340.0009149130832585Recent
86682412014.0Comedy4.341000.010051993067591760.1767815390871250Recent
87317201982.0Thriller7.041050.00025996533795540.13998170173833491Old
88506602012.0Comedy7.11991100.01715771230502640.11914201484192439Recent
89279591965.0Romantic7.07750.008058925476603480.0040662803700327Old
90115881956.0Adventure7.5131010.001906412478336120.13520382230354877Old
91251501962.0Adventure4.39940.1166377816291162440.00599776354579610Old
921276282016.0Documentary6.01920.00268630849220120[null][null]Recent
93274392004.0Drama4.54900.003552859618718520.25332926705296365Recent
94420622007.0Comedy4.27900.0005199306759140.002439768222024Recent
95229651946.0Spy6.01980.040.11558401951814659Old
96148181985.0Action4.01950.0005199306759180.00396462336078112Old
97325942005.0Comedy5.681120.011958405545927400.10043712513977916Recent
9829291952.0Adventure6.05830.001819757365685360.01453695232286312Old
99338791930.0Musical6.011040.040.0696350513367936Old
100385152007.0Comedy2.01850.040.04Recent
Rows: 1-100 of 53397 | Columns: 14

Now, let's look at the countries that made the most movies.

In [20]:
filmtv_movies_complete.groupby(columns = ["country"], 
                               expr = ["COUNT(*)"]).sort(
                               {"count" : "desc"}).head(10)
Out[20]:
Abc
country
Varchar(208)
123
COUNT
Integer
1United States21968
2Italy9071
3France3045
4Great Britain2524
5Germany1699
6Japan1133
7Canada1053
8Spain563
9Italy, France403
10Hong Kong380
Rows: 1-10 | Columns: 2

We can use this variable to create language groups.

In [22]:
from verticapy import str_sql
# Language Discretization
Arabic_Middle_Est = ["Arab", "Iran", "Turkey", "Egypt", "Tunisia",
                     "Lebanon", "Palestine", "Morocco", "Iraq",
                     "Sudan", "Algeria", "Yemen", "Afghanistan",
                     "Azerbaijan", "Kazakhstan", "Kyrgyzstan",
                     "Kurdistan", "Syria", "Uzbekistan"]
Chinese_Japan_Asian = ["Japan", "Hong Kong", "China", "South Korea", 
                       "Thailand", "Philippines", "Taiwan", "Indonesia",
                       "Singapore", "Malaysia", "Vietnam", "Laos", "Cambodia",
                       "Bhutan"]
Indian = ["India", "Pakistan", "Nepal", "Sri Lanka", "Bangladesh"]
Hebrew = ["Israel"]
Spanish_Portuguese = ["Spain", "Portugal", "Mexico", "Brasil", "Chile",
                      "Argentina", "Colombia", "Cuba", "Venezuela", "Peru",
                      "Uruguay", "Dominican Republic", "Ecuador", "Guatemala",
                      "Costa Rica", "Paraguay", "Bolivia"]
English = ["United States", "England", "Great Britain", "Ireland",
           "Australia", "New Zealand", "South Africa"]
French = ["France", "Canada", "Belgium", "Switzerland", "Luxembourg"]
Italian = ["Italy"]
German_North_Europe = ["German", "Austria", "Holland", "Netherlands", "Denmark",
                       "Norway", "Iceland", "Finland", "Sweden", "Greenland"]
Russian_Est_Europe = ["Russia", "Soviet Union", "Yugoslavia", "Czechoslovakia",
                      "Poland", "Bulgaria", "Croatia", "Czech Republic", "Serbia",
                      "Ukraine", "Slovenia", "Lithuania", "Latvia", "Estonia", 
                      "Bosnia and Herzegovina", "Georgia"]
Grec_Balkan = ["Greece", "Macedonia", "Cyprus", "Romania", "Armenia", "Hungary",
               "Albania", "Malta"]

# Creation of the new feature
filmtv_movies_complete.case_when('language_area', 
        str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(Arabic_Middle_Est))), 'Arabic_Middle_Est',
        str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(Chinese_Japan_Asian))), 'Chinese_Japan_Asian', 
        str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(Indian))), 'Indian', 
        str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(Hebrew))), 'Hebrew', 
        str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(Spanish_Portuguese))), 'Spanish_Portuguese', 
        str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(English))), 'English',
        str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(French))), 'French',
        str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(Italian))), 'Italian',
        str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(German_North_Europe))), 'German_North_Europe',
        str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(Russian_Est_Europe))), 'Russian_Est_Europe',
        str_sql("REGEXP_LIKE(Country, '{}')".format("|".join(Grec_Balkan))), 'Grec_Balkan', 
        'Others') 
Out[22]:
123
filmtv_id
Integer
Abc
Varchar(486)
123
year
Numeric(8,2)
Abc
genre
Varchar(22)
Abc
Varchar(208)
123
avg_vote
Numeric(6,2)
123
votes
Integer
123
duration
Integer
Abc
Varchar(1066)
123
notoriety_director
Numeric(34,15)
123
castings_director
Integer
123
notoriety_actors
Numeric(34,15)
123
castings_actors
Integer
Abc
period
Varchar(6)
Abc
language_area
Varchar(19)
1849571993.0Animation5.813720.001039861351824[null][null]90sChinese_Japan_Asian
245831979.0Thriller6.011200.018630849220104800.02500762427569421OldEnglish
3522242012.0Thriller6.021000.009965337954939560.09464267561248457RecentItalian
4518542012.0Drama6.02888.6655112652e-0540.0004066280370044RecentSpanish_Portuguese
585381992.0Drama5.531350.001906412478336120.6757141404899879890sSpanish_Portuguese
6695902008.0Fantasy7.021290.029809358752166480.0007115990647566RecentChinese_Japan_Asian
7549522012.0Comedy7.35800.00034662045060740.0047778794347879RecentRussian_Est_Europe
8204241997.0Thriller6.01960.040.000101657009251590sFrench
9346901955.0Drama9.011070.040.00945410186032311OldRussian_Est_Europe
101705372019.0Comedy4.07750.00069324090121380.00721764765680611RecentSpanish_Portuguese
11310922000.0Animation6.114740.0012131715771238[null][null]RecentEnglish
12402281992.0Drama7.031320.003552859618718520.003557995323779790sIndian
13102811960.0Comedy5.02890.001646447140381160.14628443631188370OldEnglish
14457722009.0Fantasy6.5111120.00086655112651640.00691267662905412RecentFrench
15753342007.0Thriller4.841100.001993067590988400.37389448002439867RecentEnglish
16174561976.0Action8.02900.1746967071057191120.27416895394937558OldEnglish
171467192017.0Documentary7.65490.0007798960138658[null][null]RecentFrench
1884001990.0Drama9.15990.00051993067591120.002439768222019790sRussian_Est_Europe
19183631997.0Action6.01900.001646447140381160.1584832774219793590sEnglish
20187721963.0Thriller8.39820.015597920277296720.0035579953237775OldEnglish
21253971989.0Crime7.0491150.0533795493934141040.12696960455423537OldChinese_Japan_Asian
22233781950.0Comedy5.36930.002772963604853320.09992884009352456OldEnglish
23137821995.0Drama6.9191100.002253032928943240.0562163261156852590sFrench
241409091958.0Drama8.85660.00034662045060740.0008132560740062OldRussian_Est_Europe
25153721994.0Horror4.629910.02105719237435480.1498424316356612990sEnglish
2673381984.0Drama4.52880.00051993067591120.00365965233302915OldFrench
27583022012.0Thriller3.36900.00086655112651680.26745959133882246RecentEnglish
28109531972.0Adventure5.817930.006499133448873360.14862254752465279OldItalian
29742382014.0Drama10.011260.040.00569279251804414RecentChinese_Japan_Asian
301622572016.0[null]7.01880.040.04RecentEnglish
31490782011.0Documentary6.53920.00025996533795580.0023381112127695RecentEnglish
32282881999.0Horror5.03950.010398613518198320.0034563383145281390sEnglish
3337741985.0Comedy5.1811090.09107452339688720.10247026532479415OldEnglish
3488521981.0Comedy4.5301100.055112651646447680.04137440276507141OldEnglish
351286652016.0Documentary10.01570.04[null][null]RecentItalian
36104591960.0Comedy5.8231000.012045060658579520.35091999593372128OldEnglish
37702032007.0Drama7.340910.00337954939341440.0253125953034478RecentEnglish
38589192012.0Comedy4.021000.00017331022530380.0007115990647567RecentItalian
391591172019.0Comedy6.059930.102253032928943520.1534004269594426RecentItalian
4072011991.0Drama6.01900.040.0390sGrec_Balkan
4171491986.0Drama7.95980.000953206239168120.0014231981295114OldRussian_Est_Europe
421658232016.0Drama6.52830.00017331022530380.0004066280370044RecentChinese_Japan_Asian
431391988.0Drama6.011008.6655112652e-0580.16946223442106441OldEnglish
44144971992.0Drama5.151130.125303292894281440.0330385280065073290sEnglish
45168301996.0Thriller8.02920.00017331022530380.000711599064756790sGerman_North_Europe
46130571990.0Comedy4.01900.0082322357019061560.0661787130222632690sEnglish
4790031984.0Comedy5.929950.024610051993068641.03537663921927596OldEnglish
48379962004.0Horror5.511980.015164644714038400.25790383246924977RecentEnglish
49150711983.0Comedy5.88950.00060658578856240.10409677747280751OldItalian
50177241978.0Spy6.9111000.006412478336222560.10917962793534743OldEnglish
51542542010.0Drama3.16900.001473136915078120.00853918877706722RecentGerman_North_Europe
52404342009.0Drama8.021150.006759098786828240.04330588594083613RecentGrec_Balkan
53693692014.0Comedy5.2113950.02668977469670780.26959438853308938RecentItalian
54111301961.0War4.011000.017417677642981760.10023381112127745OldEnglish
55617252013.0Comedy4.97900.004592720970537360.0166717495171316RecentEnglish
561655772015.0Comedy9.021160.00043327556325880.0426959438853314RecentItalian
57555052006.0Horror5.521098.6655112652e-0540.00853918877706710RecentChinese_Japan_Asian
58468422011.0Drama6.1461000.00389948006932440.43671851174138584RecentEnglish
59425142009.0Action3.515920.013431542461005920.01240215512859610RecentEnglish
601715692012.0Horror7.011140.040.00620107756429811RecentEnglish
6170521959.0Drama7.545810.1237435008665511240.0134187252211053OldItalian
62422431994.0Fantasy5.84910.00069324090121380.1731218867540917690sEnglish
63447742012.0Drama5.1291070.031369150779896400.51621429297550175RecentEnglish
64512702012.0Drama3.35900.009445407279029360.15004574565416250RecentItalian
65306512004.0Comedy7.2269990.097487001733102200.23076141099928819RecentEnglish
66540672012.0Adventure6.838880.005892547660312120.04066280370031613RecentSpanish_Portuguese
67556882008.0Horror6.02860.001126516464471120.02510928128494516RecentChinese_Japan_Asian
681449692016.0Thriller1.01908.6655112652e-0580.01972145979465381RecentGerman_North_Europe
69223182002.0Drama4.85940.00034662045060740.0020331401850155RecentItalian
70860712016.0Documentary10.01700.04[null][null]RecentEnglish
71145001969.0Drama5.86900.1798093587521661400.0568262681711912OldEnglish
7267051963.0Spy4.011000.040.01108061400833622OldFrench
731471572017.0Animation8.01930.0001733102253038[null][null]RecentChinese_Japan_Asian
74855092011.0Action5.54910.002339688041594120.22700010165700948RecentEnglish
75649702014.0Comedy5.3821080.021490467937608200.34685371556368850RecentEnglish
76382672007.0Thriller5.926950.00216637781629140.42523126969604685RecentSpanish_Portuguese
77135321962.0History4.8381500.00320623916811140.14852089051540278OldItalian
781615392018.0Drama6.01880.040.01728169157263422RecentEnglish
79378502001.0Comedy4.03880.008058925476603360.01026735793433111RecentEnglish
801148062016.0Drama6.2111070.00233968804159480.0073193046660578RecentEnglish
81452722006.0Action8.041170.1487001733102252320.01484192335061414RecentChinese_Japan_Asian
82232641985.0Horror7.5107860.049566724436742640.06414557283724724OldEnglish
83448252009.0Action3.310900.00086655112651680.02551590932194717RecentEnglish
84271712003.0Thriller5.79950.009618717504333520.14892751855240561RecentEnglish
85586952003.0Comedy7.83900.00017331022530340.0009149130832585RecentChinese_Japan_Asian
86682412014.0Comedy4.341000.010051993067591760.1767815390871250RecentItalian
87317201982.0Thriller7.041050.00025996533795540.13998170173833491OldFrench
88506602012.0Comedy7.11991100.01715771230502640.11914201484192439RecentFrench
89279591965.0Romantic7.07750.008058925476603480.0040662803700327OldFrench
90115881956.0Adventure7.5131010.001906412478336120.13520382230354877OldEnglish
91251501962.0Adventure4.39940.1166377816291162440.00599776354579610OldItalian
921276282016.0Documentary6.01920.00268630849220120[null][null]RecentEnglish
93274392004.0Drama4.54900.003552859618718520.25332926705296365RecentItalian
94420622007.0Comedy4.27900.0005199306759140.002439768222024RecentEnglish
95229651946.0Spy6.01980.040.11558401951814659OldEnglish
96148181985.0Action4.01950.0005199306759180.00396462336078112OldEnglish
97325942005.0Comedy5.681120.011958405545927400.10043712513977916RecentEnglish
9829291952.0Adventure6.05830.001819757365685360.01453695232286312OldEnglish
99338791930.0Musical6.011040.040.0696350513367936OldEnglish
100385152007.0Comedy2.01850.040.04RecentFrench
Rows: 1-100 of 53397 | Columns: 15

We can do the same for the genres.

In [23]:
filmtv_movies_complete.case_when(
         'Category', 
         str_sql("REGEXP_LIKE(Genre, 'Drama|Noir')"), 'Drama', 
         str_sql("REGEXP_LIKE(Genre, 'Comedy|Grotesque')"), 'Comedy', 
         str_sql("REGEXP_LIKE(Genre, 'Fantasy|Super-hero')"), 'Fantasy', 
         str_sql("REGEXP_LIKE(Genre, 'Romantic|Sperimental|Mélo')"), 'Romantic', 
         str_sql("REGEXP_LIKE(Genre, 'Thriller|Crime|Gangster')"), 'Thriller', 
         str_sql("REGEXP_LIKE(Genre, 'Action|Western|War|Spy')"), 'Action', 
         str_sql("REGEXP_LIKE(Genre, 'Adventure')"), 'Adventure', 
         str_sql("REGEXP_LIKE(Genre, 'Animation')"), 'Animation', 
         str_sql("REGEXP_LIKE(Genre, 'Horror')"), 'Horror', 
         'Others') 
Out[23]:
123
filmtv_id
Integer
Abc
Varchar(486)
123
year
Numeric(8,2)
Abc
genre
Varchar(22)
Abc
Varchar(208)
123
avg_vote
Numeric(6,2)
123
votes
Integer
123
duration
Integer
Abc
Varchar(1066)
123
notoriety_director
Numeric(34,15)
123
castings_director
Integer
123
notoriety_actors
Numeric(34,15)
123
castings_actors
Integer
Abc
period
Varchar(6)
Abc
language_area
Varchar(19)
Abc
Category
Varchar(9)
1849571993.0Animation5.813720.001039861351824[null][null]90sChinese_Japan_AsianAnimation
245831979.0Thriller6.011200.018630849220104800.02500762427569421OldEnglishThriller
3522242012.0Thriller6.021000.009965337954939560.09464267561248457RecentItalianThriller
4518542012.0Drama6.02888.6655112652e-0540.0004066280370044RecentSpanish_PortugueseDrama
585381992.0Drama5.531350.001906412478336120.6757141404899879890sSpanish_PortugueseDrama
6695902008.0Fantasy7.021290.029809358752166480.0007115990647566RecentChinese_Japan_AsianFantasy
7549522012.0Comedy7.35800.00034662045060740.0047778794347879RecentRussian_Est_EuropeComedy
8204241997.0Thriller6.01960.040.000101657009251590sFrenchThriller
9346901955.0Drama9.011070.040.00945410186032311OldRussian_Est_EuropeDrama
101705372019.0Comedy4.07750.00069324090121380.00721764765680611RecentSpanish_PortugueseComedy
11310922000.0Animation6.114740.0012131715771238[null][null]RecentEnglishAnimation
12402281992.0Drama7.031320.003552859618718520.003557995323779790sIndianDrama
13102811960.0Comedy5.02890.001646447140381160.14628443631188370OldEnglishComedy
14457722009.0Fantasy6.5111120.00086655112651640.00691267662905412RecentFrenchFantasy
15753342007.0Thriller4.841100.001993067590988400.37389448002439867RecentEnglishThriller
16174561976.0Action8.02900.1746967071057191120.27416895394937558OldEnglishAction
171467192017.0Documentary7.65490.0007798960138658[null][null]RecentFrenchOthers
1884001990.0Drama9.15990.00051993067591120.002439768222019790sRussian_Est_EuropeDrama
19183631997.0Action6.01900.001646447140381160.1584832774219793590sEnglishAction
20187721963.0Thriller8.39820.015597920277296720.0035579953237775OldEnglishThriller
21253971989.0Crime7.0491150.0533795493934141040.12696960455423537OldChinese_Japan_AsianThriller
22233781950.0Comedy5.36930.002772963604853320.09992884009352456OldEnglishComedy
23137821995.0Drama6.9191100.002253032928943240.0562163261156852590sFrenchDrama
241409091958.0Drama8.85660.00034662045060740.0008132560740062OldRussian_Est_EuropeDrama
25153721994.0Horror4.629910.02105719237435480.1498424316356612990sEnglishHorror
2673381984.0Drama4.52880.00051993067591120.00365965233302915OldFrenchDrama
27583022012.0Thriller3.36900.00086655112651680.26745959133882246RecentEnglishThriller
28109531972.0Adventure5.817930.006499133448873360.14862254752465279OldItalianAdventure
29742382014.0Drama10.011260.040.00569279251804414RecentChinese_Japan_AsianDrama
301622572016.0[null]7.01880.040.04RecentEnglishOthers
31490782011.0Documentary6.53920.00025996533795580.0023381112127695RecentEnglishOthers
32282881999.0Horror5.03950.010398613518198320.0034563383145281390sEnglishHorror
3337741985.0Comedy5.1811090.09107452339688720.10247026532479415OldEnglishComedy
3488521981.0Comedy4.5301100.055112651646447680.04137440276507141OldEnglishComedy
351286652016.0Documentary10.01570.04[null][null]RecentItalianOthers
36104591960.0Comedy5.8231000.012045060658579520.35091999593372128OldEnglishComedy
37702032007.0Drama7.340910.00337954939341440.0253125953034478RecentEnglishDrama
38589192012.0Comedy4.021000.00017331022530380.0007115990647567RecentItalianComedy
391591172019.0Comedy6.059930.102253032928943520.1534004269594426RecentItalianComedy
4072011991.0Drama6.01900.040.0390sGrec_BalkanDrama
4171491986.0Drama7.95980.000953206239168120.0014231981295114OldRussian_Est_EuropeDrama
421658232016.0Drama6.52830.00017331022530380.0004066280370044RecentChinese_Japan_AsianDrama
431391988.0Drama6.011008.6655112652e-0580.16946223442106441OldEnglishDrama
44144971992.0Drama5.151130.125303292894281440.0330385280065073290sEnglishDrama
45168301996.0Thriller8.02920.00017331022530380.000711599064756790sGerman_North_EuropeThriller
46130571990.0Comedy4.01900.0082322357019061560.0661787130222632690sEnglishComedy
4790031984.0Comedy5.929950.024610051993068641.03537663921927596OldEnglishComedy
48379962004.0Horror5.511980.015164644714038400.25790383246924977RecentEnglishHorror
49150711983.0Comedy5.88950.00060658578856240.10409677747280751OldItalianComedy
50177241978.0Spy6.9111000.006412478336222560.10917962793534743OldEnglishAction
51542542010.0Drama3.16900.001473136915078120.00853918877706722RecentGerman_North_EuropeDrama
52404342009.0Drama8.021150.006759098786828240.04330588594083613RecentGrec_BalkanDrama
53693692014.0Comedy5.2113950.02668977469670780.26959438853308938RecentItalianComedy
54111301961.0War4.011000.017417677642981760.10023381112127745OldEnglishAction
55617252013.0Comedy4.97900.004592720970537360.0166717495171316RecentEnglishComedy
561655772015.0Comedy9.021160.00043327556325880.0426959438853314RecentItalianComedy
57555052006.0Horror5.521098.6655112652e-0540.00853918877706710RecentChinese_Japan_AsianHorror
58468422011.0Drama6.1461000.00389948006932440.43671851174138584RecentEnglishDrama
59425142009.0Action3.515920.013431542461005920.01240215512859610RecentEnglishAction
601715692012.0Horror7.011140.040.00620107756429811RecentEnglishHorror
6170521959.0Drama7.545810.1237435008665511240.0134187252211053OldItalianDrama
62422431994.0Fantasy5.84910.00069324090121380.1731218867540917690sEnglishFantasy
63447742012.0Drama5.1291070.031369150779896400.51621429297550175RecentEnglishDrama
64512702012.0Drama3.35900.009445407279029360.15004574565416250RecentItalianDrama
65306512004.0Comedy7.2269990.097487001733102200.23076141099928819RecentEnglishComedy
66540672012.0Adventure6.838880.005892547660312120.04066280370031613RecentSpanish_PortugueseAdventure
67556882008.0Horror6.02860.001126516464471120.02510928128494516RecentChinese_Japan_AsianHorror
681449692016.0Thriller1.01908.6655112652e-0580.01972145979465381RecentGerman_North_EuropeThriller
69223182002.0Drama4.85940.00034662045060740.0020331401850155RecentItalianDrama
70860712016.0Documentary10.01700.04[null][null]RecentEnglishOthers
71145001969.0Drama5.86900.1798093587521661400.0568262681711912OldEnglishDrama
7267051963.0Spy4.011000.040.01108061400833622OldFrenchAction
731471572017.0Animation8.01930.0001733102253038[null][null]RecentChinese_Japan_AsianAnimation
74855092011.0Action5.54910.002339688041594120.22700010165700948RecentEnglishAction
75649702014.0Comedy5.3821080.021490467937608200.34685371556368850RecentEnglishComedy
76382672007.0Thriller5.926950.00216637781629140.42523126969604685RecentSpanish_PortugueseThriller
77135321962.0History4.8381500.00320623916811140.14852089051540278OldItalianOthers
781615392018.0Drama6.01880.040.01728169157263422RecentEnglishDrama
79378502001.0Comedy4.03880.008058925476603360.01026735793433111RecentEnglishComedy
801148062016.0Drama6.2111070.00233968804159480.0073193046660578RecentEnglishDrama
81452722006.0Action8.041170.1487001733102252320.01484192335061414RecentChinese_Japan_AsianAction
82232641985.0Horror7.5107860.049566724436742640.06414557283724724OldEnglishHorror
83448252009.0Action3.310900.00086655112651680.02551590932194717RecentEnglishAction
84271712003.0Thriller5.79950.009618717504333520.14892751855240561RecentEnglishThriller
85586952003.0Comedy7.83900.00017331022530340.0009149130832585RecentChinese_Japan_AsianComedy
86682412014.0Comedy4.341000.010051993067591760.1767815390871250RecentItalianComedy
87317201982.0Thriller7.041050.00025996533795540.13998170173833491OldFrenchThriller
88506602012.0Comedy7.11991100.01715771230502640.11914201484192439RecentFrenchComedy
89279591965.0Romantic7.07750.008058925476603480.0040662803700327OldFrenchRomantic
90115881956.0Adventure7.5131010.001906412478336120.13520382230354877OldEnglishAdventure
91251501962.0Adventure4.39940.1166377816291162440.00599776354579610OldItalianAdventure
921276282016.0Documentary6.01920.00268630849220120[null][null]RecentEnglishOthers
93274392004.0Drama4.54900.003552859618718520.25332926705296365RecentItalianDrama
94420622007.0Comedy4.27900.0005199306759140.002439768222024RecentEnglishComedy
95229651946.0Spy6.01980.040.11558401951814659OldEnglishAction
96148181985.0Action4.01950.0005199306759180.00396462336078112OldEnglishAction
97325942005.0Comedy5.681120.011958405545927400.10043712513977916RecentEnglishComedy
9829291952.0Adventure6.05830.001819757365685360.01453695232286312OldEnglishAdventure
99338791930.0Musical6.011040.040.0696350513367936OldEnglishOthers
100385152007.0Comedy2.01850.040.04RecentFrenchComedy
Rows: 1-100 of 53397 | Columns: 16

Since we're more concerned with the 'Category' at this point, we can drop 'genre.'

In [24]:
filmtv_movies_complete.drop(columns = ["genre"])
Out[24]:
123
filmtv_id
Integer
Abc
Varchar(486)
123
year
Numeric(8,2)
Abc
Varchar(208)
123
avg_vote
Numeric(6,2)
123
votes
Integer
123
duration
Integer
Abc
Varchar(1066)
123
notoriety_director
Numeric(34,15)
123
castings_director
Integer
123
notoriety_actors
Numeric(34,15)
123
castings_actors
Integer
Abc
period
Varchar(6)
Abc
language_area
Varchar(19)
Abc
Category
Varchar(9)
1849571993.05.813720.001039861351824[null][null]90sChinese_Japan_AsianAnimation
245831979.06.011200.018630849220104800.02500762427569421OldEnglishThriller
3522242012.06.021000.009965337954939560.09464267561248457RecentItalianThriller
4518542012.06.02888.6655112652e-0540.0004066280370044RecentSpanish_PortugueseDrama
585381992.05.531350.001906412478336120.6757141404899879890sSpanish_PortugueseDrama
6695902008.07.021290.029809358752166480.0007115990647566RecentChinese_Japan_AsianFantasy
7549522012.07.35800.00034662045060740.0047778794347879RecentRussian_Est_EuropeComedy
8204241997.06.01960.040.000101657009251590sFrenchThriller
9346901955.09.011070.040.00945410186032311OldRussian_Est_EuropeDrama
101705372019.04.07750.00069324090121380.00721764765680611RecentSpanish_PortugueseComedy
11310922000.06.114740.0012131715771238[null][null]RecentEnglishAnimation
12402281992.07.031320.003552859618718520.003557995323779790sIndianDrama
13102811960.05.02890.001646447140381160.14628443631188370OldEnglishComedy
14457722009.06.5111120.00086655112651640.00691267662905412RecentFrenchFantasy
15753342007.04.841100.001993067590988400.37389448002439867RecentEnglishThriller
16174561976.08.02900.1746967071057191120.27416895394937558OldEnglishAction
171467192017.07.65490.0007798960138658[null][null]RecentFrenchOthers
1884001990.09.15990.00051993067591120.002439768222019790sRussian_Est_EuropeDrama
19183631997.06.01900.001646447140381160.1584832774219793590sEnglishAction
20187721963.08.39820.015597920277296720.0035579953237775OldEnglishThriller
21253971989.07.0491150.0533795493934141040.12696960455423537OldChinese_Japan_AsianThriller
22233781950.05.36930.002772963604853320.09992884009352456OldEnglishComedy
23137821995.06.9191100.002253032928943240.0562163261156852590sFrenchDrama
241409091958.08.85660.00034662045060740.0008132560740062OldRussian_Est_EuropeDrama
25153721994.04.629910.02105719237435480.1498424316356612990sEnglishHorror
2673381984.04.52880.00051993067591120.00365965233302915OldFrenchDrama
27583022012.03.36900.00086655112651680.26745959133882246RecentEnglishThriller
28109531972.05.817930.006499133448873360.14862254752465279OldItalianAdventure
29742382014.010.011260.040.00569279251804414RecentChinese_Japan_AsianDrama
301622572016.07.01880.040.04RecentEnglishOthers
31490782011.06.53920.00025996533795580.0023381112127695RecentEnglishOthers
32282881999.05.03950.010398613518198320.0034563383145281390sEnglishHorror
3337741985.05.1811090.09107452339688720.10247026532479415OldEnglishComedy
3488521981.04.5301100.055112651646447680.04137440276507141OldEnglishComedy
351286652016.010.01570.04[null][null]RecentItalianOthers
36104591960.05.8231000.012045060658579520.35091999593372128OldEnglishComedy
37702032007.07.340910.00337954939341440.0253125953034478RecentEnglishDrama
38589192012.04.021000.00017331022530380.0007115990647567RecentItalianComedy
391591172019.06.059930.102253032928943520.1534004269594426RecentItalianComedy
4072011991.06.01900.040.0390sGrec_BalkanDrama
4171491986.07.95980.000953206239168120.0014231981295114OldRussian_Est_EuropeDrama
421658232016.06.52830.00017331022530380.0004066280370044RecentChinese_Japan_AsianDrama
431391988.06.011008.6655112652e-0580.16946223442106441OldEnglishDrama
44144971992.05.151130.125303292894281440.0330385280065073290sEnglishDrama
45168301996.08.02920.00017331022530380.000711599064756790sGerman_North_EuropeThriller
46130571990.04.01900.0082322357019061560.0661787130222632690sEnglishComedy
4790031984.05.929950.024610051993068641.03537663921927596OldEnglishComedy
48379962004.05.511980.015164644714038400.25790383246924977RecentEnglishHorror
49150711983.05.88950.00060658578856240.10409677747280751OldItalianComedy
50177241978.06.9111000.006412478336222560.10917962793534743OldEnglishAction
51542542010.03.16900.001473136915078120.00853918877706722RecentGerman_North_EuropeDrama
52404342009.08.021150.006759098786828240.04330588594083613RecentGrec_BalkanDrama
53693692014.05.2113950.02668977469670780.26959438853308938RecentItalianComedy
54111301961.04.011000.017417677642981760.10023381112127745OldEnglishAction
55617252013.04.97900.004592720970537360.0166717495171316RecentEnglishComedy
561655772015.09.021160.00043327556325880.0426959438853314RecentItalianComedy
57555052006.05.521098.6655112652e-0540.00853918877706710RecentChinese_Japan_AsianHorror
58468422011.06.1461000.00389948006932440.43671851174138584RecentEnglishDrama
59425142009.03.515920.013431542461005920.01240215512859610RecentEnglishAction
601715692012.07.011140.040.00620107756429811RecentEnglishHorror
6170521959.07.545810.1237435008665511240.0134187252211053OldItalianDrama
62422431994.05.84910.00069324090121380.1731218867540917690sEnglishFantasy
63447742012.05.1291070.031369150779896400.51621429297550175RecentEnglishDrama
64512702012.03.35900.009445407279029360.15004574565416250RecentItalianDrama
65306512004.07.2269990.097487001733102200.23076141099928819RecentEnglishComedy
66540672012.06.838880.005892547660312120.04066280370031613RecentSpanish_PortugueseAdventure
67556882008.06.02860.001126516464471120.02510928128494516RecentChinese_Japan_AsianHorror
681449692016.01.01908.6655112652e-0580.01972145979465381RecentGerman_North_EuropeThriller
69223182002.04.85940.00034662045060740.0020331401850155RecentItalianDrama
70860712016.010.01700.04[null][null]RecentEnglishOthers
71145001969.05.86900.1798093587521661400.0568262681711912OldEnglishDrama
7267051963.04.011000.040.01108061400833622OldFrenchAction
731471572017.08.01930.0001733102253038[null][null]RecentChinese_Japan_AsianAnimation
74855092011.05.54910.002339688041594120.22700010165700948RecentEnglishAction
75649702014.05.3821080.021490467937608200.34685371556368850RecentEnglishComedy
76382672007.05.926950.00216637781629140.42523126969604685RecentSpanish_PortugueseThriller
77135321962.04.8381500.00320623916811140.14852089051540278OldItalianOthers
781615392018.06.01880.040.01728169157263422RecentEnglishDrama
79378502001.04.03880.008058925476603360.01026735793433111RecentEnglishComedy
801148062016.06.2111070.00233968804159480.0073193046660578RecentEnglishDrama
81452722006.08.041170.1487001733102252320.01484192335061414RecentChinese_Japan_AsianAction
82232641985.07.5107860.049566724436742640.06414557283724724OldEnglishHorror
83448252009.03.310900.00086655112651680.02551590932194717RecentEnglishAction
84271712003.05.79950.009618717504333520.14892751855240561RecentEnglishThriller
85586952003.07.83900.00017331022530340.0009149130832585RecentChinese_Japan_AsianComedy
86682412014.04.341000.010051993067591760.1767815390871250RecentItalianComedy
87317201982.07.041050.00025996533795540.13998170173833491OldFrenchThriller
88506602012.07.11991100.01715771230502640.11914201484192439RecentFrenchComedy
89279591965.07.07750.008058925476603480.0040662803700327OldFrenchRomantic
90115881956.07.5131010.001906412478336120.13520382230354877OldEnglishAdventure
91251501962.04.39940.1166377816291162440.00599776354579610OldItalianAdventure
921276282016.06.01920.00268630849220120[null][null]RecentEnglishOthers
93274392004.04.54900.003552859618718520.25332926705296365RecentItalianDrama
94420622007.04.27900.0005199306759140.002439768222024RecentEnglishComedy
95229651946.06.01980.040.11558401951814659OldEnglishAction
96148181985.04.01950.0005199306759180.00396462336078112OldEnglishAction
97325942005.05.681120.011958405545927400.10043712513977916RecentEnglishComedy
9829291952.06.05830.001819757365685360.01453695232286312OldEnglishAdventure
99338791930.06.011040.040.0696350513367936OldEnglishOthers
100385152007.02.01850.040.04RecentFrenchComedy
Rows: 1-100 of 53397 | Columns: 15

Let's look at the missing values.

In [25]:
filmtv_movies_complete.count_percent()
Out[25]:
count
percent
"filmtv_id"53397.0100.0
"avg_vote"53397.0100.0
"votes"53397.0100.0
"duration"53397.0100.0
"period"53397.0100.0
"language_area"53397.0100.0
"Category"53397.0100.0
"title"53395.099.996
"year"53387.099.981
"country"53346.099.904
"director"53335.099.884
"notoriety_director"53335.099.884
"castings_director"53335.099.884
"notoriety_actors"50372.094.335
"castings_actors"50372.094.335
Rows: 1-15 | Columns: 3

Let's impute the missing values for 'notoriety_actors' and 'castings_actors' using different techniques. We can then drop the few remaining missing values.

In [26]:
filmtv_movies_complete["notoriety_actors"].fillna(method = "median",
                                                  by = ["director",
                                                        "Category"])
filmtv_movies_complete["castings_actors"].fillna(method = "median",
                                                 by = ["director",
                                                       "Category"])
filmtv_movies_complete.dropna()
640 elements were filled.
640 elements were filled.
2499 elements were filtered.
Out[26]:
123
filmtv_id
Integer
Abc
Varchar(486)
123
year
Numeric(8,2)
Abc
country
Varchar(208)
123
avg_vote
Numeric(6,2)
123
votes
Integer
123
duration
Integer
Abc
Varchar(1066)
123
notoriety_director
Numeric(34,15)
123
castings_director
Integer
123
notoriety_actors
Float
123
castings_actors
Float
Abc
period
Varchar(6)
Abc
language_area
Varchar(19)
Abc
Category
Varchar(9)
1806842015.0United States4.05860.00034662045060740.01616346447087521.0RecentEnglishAction
2562922012.0India8.011508.6655112652e-0580.0001016570092515.0RecentIndianAction
3637382008.0India8.011838.6655112652e-0580.0023381112127689.0RecentIndianAction
4490212011.0United States3.9391080.00329289428076340.88502592253735895.0RecentEnglishAction
5171961985.0United States4.01900.00155979202773160.19650299888177355.0OldEnglishAction
616441988.0United States4.971000.008838821490468440.01657009250787829.0OldEnglishAction
7234701995.0United States5.27860.008838821490468440.04320422893158647.090sEnglishAction
8771901997.0United States5.52970.008838821490468440.04838873640337668.090sEnglishAction
9113871993.0United States3.7211000.008838821490468440.05052353359764360.090sEnglishAction
10782432001.0United States5.52900.008838821490468440.05103181864389675.0RecentEnglishAction
11475662005.0United States6.041200.008838821490468440.05103181864389675.0RecentEnglishAction
12159241996.0United States5.37930.008838821490468440.0593676934024666.090sEnglishAction
1348821991.0United States5.112950.008838821490468440.08020738029887274.090sEnglishAction
14513831992.0United States6.021010.008838821490468440.09708244383450394.090sEnglishAction
15103451990.0United States4.6191100.008838821490468440.10562163261156970.090sEnglishAction
16607232008.0India6.351490.00034662045060740.00498119345328913.0RecentIndianAction
1748561989.0United States5.521860.1378682842287691040.427162752871811117.0OldEnglishAction
18504052010.0India6.371260.0005199306759140.00365965233302910.0RecentIndianAction
191507051957.0United States5.01880.040.07797092609535461.0OldEnglishAction
2074201969.0United States7.145960.007538994800693120.4477991257497266.0OldEnglishAction
21206541997.0Germany6.01970.040.00355799532377816.090sGerman_North_EuropeAction
22435592010.0United States5.86890.016117850953206240.04960862051438535.0RecentEnglishAction
23511402012.0United States8.01810.00077989601386580.04.0RecentEnglishAction
24391522008.0United States4.810900.004766031195841120.18237267459591352.0RecentEnglishAction
251377762017.0Great Britain5.119910.0015597920277340.19406323065975437.0RecentEnglishAction
261390762015.0United States4.01870.040.0078275897123122.0RecentEnglishAction
27501082011.0Canada5.87910.0005199306759140.03171698688624534.0RecentFrenchAction
2879551991.0United States6.011000.040.0004066280370045.090sEnglishAction
29220121968.0Italy4.08940.00086655112651680.03263189996950433.0OldItalianAction
301229832017.0Spain6.01900.040.00955575886957413.0RecentSpanish_PortugueseAction
3111381990.0United States4.01950.040.0004066280370034.090sEnglishAction
321657892019.0United States, Mexico, Spain5.9221000.01048526863084980.64420046762224368.0RecentSpanish_PortugueseAction
33451912011.0United States6.3100950.01048526863084980.73416692080918978.0RecentEnglishAction
34508272006.0Hungary, Great Britain2.791090.00069324090121340.10785808681508637.0RecentEnglishAction
35700332008.0Great Britain7.331230.00242634315424680.0040662803700328.0RecentEnglishAction
36449012011.0Great Britain3.826900.00242634315424680.05448815695842229.0RecentEnglishAction
37218331970.0Italy4.841000.00034662045060780.15726339331096994.0OldItalianAction
3876761964.0Italy, Spain3.55800.00043327556325880.0018298261665145.0OldSpanish_PortugueseAction
39679752012.0Kazakhstan8.011380.00017331022530380.04.0RecentArabic_Middle_EstAction
40574421965.0Japan6.01840.040.00599776354579722.0OldChinese_Japan_AsianAction
41310761945.0Japan6.917830.1999133448873481200.02307614109992911.0OldChinese_Japan_AsianAction
42310731943.0Japan7.529800.1999133448873481200.02643082240520512.0OldChinese_Japan_AsianAction
43289801975.0United States6.02800.000866551126516120.0004066280370044.0OldEnglishAction
44385491969.0United States2.66860.000866551126516120.02439768222018928.0OldEnglishAction
451243581973.0Spain, Italy4.52840.00043327556325880.03517332520077323.0OldSpanish_PortugueseAction
46218501967.0Spain, France, Italy4.24870.00043327556325880.04838873640337664.0OldSpanish_PortugueseAction
4774241969.0Italy4.53990.002426343154246200.00864084578631720.0OldItalianAction
4830951967.0Italy5.4121050.002426343154246200.03354681305276148.0OldItalianAction
49199481968.0Italy3.89820.002426343154246200.04716885229236653.0OldItalianAction
5073171954.0United States7.83770.00017331022530340.0903730812239551.0OldEnglishAction
511446852017.0United States4.77870.0005199306759140.58808579851580880.0RecentEnglishAction
52611292003.0United States6.52900.00069324090121380.00467622242553613.0RecentEnglishAction
5378731981.0United States4.461180.072270363951473640.26908610348683575.0OldEnglishAction
546041978.0United States6.9361200.072270363951473640.411812544474941102.0OldEnglishAction
55360141950.0United States5.01810.040.01768831960963731.0OldEnglishAction
561410842017.0Hong Kong, China7.041170.00025996533795540.0019314831757657.0RecentChinese_Japan_AsianAction
57144681996.0United States4.53900.00017331022530340.04554234014435424.090sEnglishAction
58178451991.0United States4.01900.000259965337955120.0002033140185024.090sEnglishAction
59201101995.0United States2.01880.003812824956672440.00213479719426710.090sEnglishAction
60355521969.0United States6.411940.003812824956672440.05723289620819349.0OldEnglishAction
61225261997.0United States, Luxembourg5.02908.6655112652e-0540.07512452983633261.090sEnglishAction
62682672013.0China, Hong Kong6.4191180.0015597920277340.12524143539697124.0RecentChinese_Japan_AsianAction
63263961967.0Italy, Spain4.01820.040.07126156348480344.0OldSpanish_PortugueseAction
64160291957.0United States5.02850.000866551126516120.00447290840703412.0OldEnglishAction
65130221958.0United States6.15800.00043327556325880.03080207380298843.0OldEnglishAction
66292501957.0United States6.01720.00043327556325880.20229744840906826.0OldEnglishAction
6739961967.0Italy4.181030.003899480069324160.02287282708142821.0OldItalianAction
68139881970.0Italy4.36960.003899480069324160.04798210836637246.0OldItalianAction
69140801966.0Italy5.7111000.003899480069324160.08945816814069384.0OldItalianAction
7043391966.0Italy6.4211050.003899480069324160.13256074006302888.0OldItalianAction
71279951965.0Italy, Germany, France4.52918.6655112652e-0540.01585849344312429.0OldFrenchAction
721426012018.0United States6.026960.00216637781629140.04411914201484212.0RecentEnglishAction
73273651995.0United States6.031150.02192374350086780.13337399613703430.090sEnglishAction
74407622010.0United States5.72511170.02192374350086781.0023381112127798.0RecentEnglishAction
75123991993.0United States6.54880.001213171577123160.00620107756429818.090sEnglishAction
76194521964.0Spain4.33820.0001