Titanic

This example uses the 'titanic' dataset to predict the survival of passengers on the Titanic. You can download the Jupyter Notebook of the study here.

Initialization

This example uses the following version of VerticaPy:

In [1]:
import verticapy as vp
vp.__version__
Out[1]:
'0.9.0'

Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.

In [2]:
vp.connect("VerticaDSN")

Use the following command to allow Matplotlib to display graphics.

In [3]:
%matplotlib inline

Let's load the dataset.

In [4]:
from verticapy.datasets import load_titanic
titanic = load_titanic()
display(titanic)
123
pclass
Int
123
survived
Int
Abc
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(6,3)
123
sibsp
Int
123
parch
Int
Abc
ticket
Varchar(36)
123
fare
Numeric(10,5)
Abc
cabin
Varchar(30)
Abc
embarked
Varchar(20)
Abc
boat
Varchar(100)
123
body
Int
Abc
home.dest
Varchar(100)
110female2.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
210male30.012113781151.55C22 C26S[null]135Montreal, PQ / Chesterville, ON
310female25.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
410male39.0001120500.0A36S[null][null]Belfast, NI
510male71.000PC 1760949.5042[null]C[null]22Montevideo, Uruguay
610male47.010PC 17757227.525C62 C64C[null]124New York, NY
710male[null]00PC 1731825.925[null]S[null][null]New York, NY
810male24.001PC 17558247.5208B58 B60C[null][null]Montreal, PQ
910male36.0001305075.2417C6CA[null]Winnipeg, MN
1010male25.0001390526.0[null]C[null]148San Francisco, CA
1110male45.00011378435.5TS[null][null]Trenton, NJ
1210male42.00011048926.55D22S[null][null]London / Winnipeg, MB
1310male41.00011305430.5A21S[null][null]Pomeroy, WA
1410male48.000PC 1759150.4958B10C[null]208Omaha, NE
1510male[null]0011237939.6[null]C[null][null]Philadelphia, PA
1610male45.00011305026.55B38S[null][null]Washington, DC
1710male[null]0011379831.0[null]S[null][null][null]
1810male33.0006955.0B51 B53 B55S[null][null]New York, NY
1910male28.00011305947.1[null]S[null][null]Montevideo, Uruguay
2010male17.00011305947.1[null]S[null][null]Montevideo, Uruguay
2110male49.0001992426.0[null]S[null][null]Ascot, Berkshire / Rochester, NY
2210male36.0101987778.85C46S[null]172Little Onn Hall, Staffs
2310male46.010W.E.P. 573461.175E31S[null][null]Amenia, ND
2410male[null]001120510.0[null]S[null][null]Liverpool, England / Belfast
2510male27.01013508136.7792C89C[null][null]Los Angeles, CA
2610male[null]0011046552.0A14S[null][null]Stoughton, MA
2710male47.000572725.5875E58S[null][null]Victoria, BC
2810male37.011PC 1775683.1583E52C[null][null]Lakewood, NJ
2910male[null]0011379126.55[null]S[null][null]Roachdale, IN
3010male70.011WE/P 573571.0B22S[null]269Milwaukee, WI
3110male39.010PC 1759971.2833C85C[null][null]New York, NY
3210male31.010F.C. 1275052.0B71S[null][null]Montreal, PQ
3310male50.010PC 17761106.425C86C[null]62Deephaven, MN / Cedar Rapids, IA
3410male39.000PC 1758029.7A18C[null]133Philadelphia, PA
3510female36.000PC 1753131.6792A29C[null][null]New York, NY
3610male[null]00PC 17483221.7792C95S[null][null][null]
3710male30.00011305127.75C111C[null][null]New York, NY
3810male19.03219950263.0C23 C25 C27S[null][null]Winnipeg, MB
3910male64.01419950263.0C23 C25 C27S[null][null]Winnipeg, MB
4010male[null]0011377826.55D34S[null][null]Westcliff-on-Sea, Essex
4110male[null]001120580.0B102S[null][null][null]
4210male37.01011380353.1C123S[null][null]Scituate, MA
4310male47.00011132038.5E63S[null]275St Anne's-on-Sea, Lancashire
4410male24.000PC 1759379.2B86C[null][null][null]
4510male71.000PC 1775434.6542A5C[null][null]New York, NY
4610male38.001PC 17582153.4625C91S[null]147Winnipeg, MB
4710male46.000PC 1759379.2B82 B84C[null][null]New York, NY
4810male[null]0011379642.4[null]S[null][null][null]
4910male45.0103697383.475C83S[null][null]New York, NY
5010male40.0001120590.0B94S[null]110[null]
5110male55.0111274993.5B69S[null]307Montreal, PQ
5210male42.00011303842.5B11S[null][null]London / Middlesex
5310male[null]001746351.8625E46S[null][null]Brighton, MA
5410male55.00068050.0C39S[null][null]London / Birmingham
5510male42.01011378952.0[null]S[null]38New York, NY
5610male[null]00PC 1760030.6958[null]C14[null]New York, NY
5710female50.000PC 1759528.7125C49C[null][null]Paris, France New York, NY
5810male46.00069426.0[null]S[null]80Bennington, VT
5910male50.00011304426.0E60S[null][null]London
6010male32.500113503211.5C132C[null]45[null]
6110male58.0001177129.7B37C[null]258Buffalo, NY
6210male41.0101746451.8625D21S[null][null]Southington / Noank, CT
6310male[null]0011302826.55C124S[null][null]Portland, OR
6410male[null]00PC 1761227.7208[null]C[null][null]Chicago, IL
6510male29.00011350130.0D6S[null]126Springfield, MA
6610male30.00011380145.5[null]S[null][null]London / New York, NY
6710male30.00011046926.0C106S[null][null]Brockton, MA
6810male19.01011377353.1D30S[null][null]New York, NY
6910male46.0001305075.2417C6C[null]292Vancouver, BC
7010male54.0001746351.8625E46S[null]175Dorchester, MA
7110male28.010PC 1760482.1708[null]C[null][null]New York, NY
7210male65.0001350926.55E38S[null]249East Bridgewater, MA
7310male44.0201992890.0C78Q[null]230Fond du Lac, WI
7410male55.00011378730.5C30S[null][null]Montreal, PQ
7510male47.00011379642.4[null]S[null][null]Washington, DC
7610male37.001PC 1759629.7C118C[null][null]Brooklyn, NY
7710male58.00235273113.275D48C[null]122Lexington, MA
7810male64.00069326.0[null]S[null]263Isle of Wight, England
7910male65.00111350961.9792B30C[null]234Providence, RI
8010male28.500PC 1756227.7208D43C[null]189?Havana, Cuba
8110male[null]001120520.0[null]S[null][null]Belfast
8210male45.50011304328.5C124S[null]166Surbiton Hill, Surrey
8310male23.0001274993.5B24S[null][null]Montreal, PQ
8410male29.01011377666.6C2S[null][null]Isleworth, England
8510male18.010PC 17758108.9C65C[null][null]Madrid, Spain
8610male47.00011046552.0C110S[null]207Worcester, MA
8710male38.000199720.0[null]S[null][null]Rotterdam, Netherlands
8810male22.000PC 17760135.6333[null]C[null]232[null]
8910male[null]00PC 17757227.525[null]C[null][null][null]
9010male31.000PC 1759050.4958A24S[null][null]Trenton, NJ
9110male[null]0011376750.0A32S[null][null]Seattle, WA
9210male36.0001304940.125A10C[null][null]Winnipeg, MB
9310male55.010PC 1760359.4[null]C[null][null]New York, NY
9410male33.00011379026.55[null]S[null]109London
9510male61.013PC 17608262.375B57 B59 B63 B66C[null][null]Haverford, PA / Cooperstown, NY
9610male50.0101350755.9E44S[null][null]Duluth, MN
9710male56.00011379226.55[null]S[null][null]New York, NY
9810male56.0001776430.6958A7C[null][null]St James, Long Island, NY
9910male24.0101369560.0C31S[null][null]Huntington, WV
10010male[null]0011305626.0A19S[null][null]Streatham, Surrey
Rows: 1-100 | Columns: 14

Data Exploration and Preparation

Let's explore the data by displaying descriptive statistics of all the columns.

In [5]:
titanic.describe(method = "categorical", unique = True)
Out[5]:
dtype
count
top
top_percent
unique
"pclass"int1234353.7283.0
"survived"int1234063.5332.0
"name"varchar(164)1234Kelly, Mr. James0.1621232.0
"sex"varchar(20)1234male65.9642.0
"age"numeric(6,3)997[null]19.20696.0
"sibsp"int1234067.7477.0
"parch"int1234076.9048.0
"ticket"varchar(36)1234CA. 23430.81887.0
"fare"numeric(10,5)12338.054.7277.0
"cabin"varchar(30)286[null]76.823182.0
"embarked"varchar(20)1232S70.7463.0
"boat"varchar(100)439[null]64.42526.0
"body"int118[null]90.438118.0
"home.dest"varchar(100)706[null]42.788359.0
Rows: 1-14 | Columns: 6

The columns "body" (passenger ID), "home.dest" (passenger origin/destination), "embarked" (origin port) and "ticket" (ticket ID) shouldn't influence survival, so we can ignore these.

Let's focus our analysis on the columns "name" and "cabin." We'll begin with the passengers' names.

In [6]:
from verticapy.learn.preprocessing import CountVectorizer
model = CountVectorizer("name_voc")
model.fit(titanic, ["Name"]).transform()
Out[6]:
Abc
token
Varchar(128)
123
df
Numeric(36,18)
123
cnt
Integer
123
rnk
Integer
1mr0.148163100524828427341
2miss0.0460234154218813062282
3mrs0.037343560758982641853
4william0.016148566814695196804
5john0.013726281792490917685
6master0.011303996770286637566
7henry0.008881711748082359447
8james0.007468712151796528378
9charles0.007468712151796528378
10thomas0.0068631408962454583410
11joseph0.0062575696406943883111
12george0.0062575696406943883111
13edward0.0054501412999596292713
14mary0.0052482842147759392614
15johan0.0044408558740411792215
16elizabeth0.0044408558740411792215
17frederick0.0040371417036737992017
18samuel0.0038352846184901091918
19robert0.0036334275333064191819
20alexander0.0036334275333064191819
21arthur0.0034315704481227291721
22alfred0.0034315704481227291721
23richard0.0032297133629390391623
24maria0.0032297133629390391623
25anna0.0032297133629390391623
26karl0.0028259991925716591426
27jr0.0028259991925716591426
28albert0.0028259991925716591426
29peter0.0026241421073879691329
30j0.0026241421073879691329
31patrick0.0024222850222042791231
32frank0.0024222850222042791231
33ernest0.0024222850222042791231
34alice0.0024222850222042791231
35margaret0.0022204279370205891135
36h0.0022204279370205891135
37andersson0.0022204279370205891135
38walter0.0020185708518368991038
39sage0.0020185708518368991038
40martin0.0020185708518368991038
41leonard0.0020185708518368991038
42edith0.00181671376665321942
43e0.00181671376665321942
44y0.00161485668146952844
45victor0.00161485668146952844
46van0.00161485668146952844
47rev0.00161485668146952844
48harry0.00161485668146952844
49goodwin0.00161485668146952844
50francis0.00161485668146952844
51elias0.00161485668146952844
52dr0.00161485668146952844
53david0.00161485668146952844
54daniel0.00161485668146952844
55asplund0.00161485668146952844
56annie0.00161485668146952844
57nils0.00141299959628583757
58marie0.00141299959628583757
59louise0.00141299959628583757
60l0.00141299959628583757
61helen0.00141299959628583757
62ernst0.00141299959628583757
63emil0.00141299959628583757
64de0.00141299959628583757
65davies0.00141299959628583757
66carl0.00141299959628583757
67brown0.00141299959628583757
68bertram0.00141299959628583757
69bertha0.00141299959628583757
70benjamin0.00141299959628583757
71august0.00141299959628583757
72a0.00141299959628583757
73sofia0.00121114251110214673
74smith0.00121114251110214673
75skoog0.00121114251110214673
76rice0.00121114251110214673
77panula0.00121114251110214673
78olsen0.00121114251110214673
79kate0.00121114251110214673
80johnson0.00121114251110214673
81johansson0.00121114251110214673
82ivan0.00121114251110214673
83ida0.00121114251110214673
84hansen0.00121114251110214673
85gustafsson0.00121114251110214673
86fortune0.00121114251110214673
87emily0.00121114251110214673
88ellen0.00121114251110214673
89douglas0.00121114251110214673
90delia0.00121114251110214673
91catherine0.00121114251110214673
92carter0.00121114251110214673
93bridget0.00121114251110214673
94augusta0.00121114251110214673
95ada0.00121114251110214673
96williams0.00100928542591845596
97white0.00100928542591845596
98vander0.00100928542591845596
99sidney0.00100928542591845596
100philip0.00100928542591845596
Rows: 1-100 | Columns: 4

Passengers' titles might come in handy. We can extract these from their names.

Let's move on to the cabins.

In [7]:
model = CountVectorizer("cabin_voc")
model.fit("titanic", ["cabin"]).transform()
Out[7]:
Abc
token
Varchar(128)
123
df
Numeric(36,18)
123
cnt
Integer
123
rnk
Integer
1[null]0.73374613003095989481
2f0.00619195046439628582
3c270.00464396284829721463
4c250.00464396284829721463
5c230.00464396284829721463
6g60.00386996904024767856
7f40.00309597523219814247
8d0.00309597523219814247
9c780.00309597523219814247
10c260.00309597523219814247
11c220.00309597523219814247
12b980.00309597523219814247
13b960.00309597523219814247
14b660.00309597523219814247
15b630.00309597523219814247
16b590.00309597523219814247
17b570.00309597523219814247
18f330.002321981424148607318
19f20.002321981424148607318
20e460.002321981424148607318
21e340.002321981424148607318
22c1010.002321981424148607318
23b600.002321981424148607318
24b580.002321981424148607318
25b550.002321981424148607318
26b530.002321981424148607318
27b510.002321981424148607318
28a340.002321981424148607318
29g730.001547987616099071229
30g630.001547987616099071229
31e80.001547987616099071229
32e670.001547987616099071229
33e500.001547987616099071229
34e440.001547987616099071229
35e330.001547987616099071229
36e310.001547987616099071229
37e240.001547987616099071229
38e1210.001547987616099071229
39e1010.001547987616099071229
40d370.001547987616099071229
41d360.001547987616099071229
42d350.001547987616099071229
43d330.001547987616099071229
44d300.001547987616099071229
45d280.001547987616099071229
46d260.001547987616099071229
47d210.001547987616099071229
48d200.001547987616099071229
49d190.001547987616099071229
50d170.001547987616099071229
51d150.001547987616099071229
52d120.001547987616099071229
53d100.001547987616099071229
54c930.001547987616099071229
55c920.001547987616099071229
56c890.001547987616099071229
57c860.001547987616099071229
58c850.001547987616099071229
59c830.001547987616099071229
60c800.001547987616099071229
61c70.001547987616099071229
62c680.001547987616099071229
63c650.001547987616099071229
64c640.001547987616099071229
65c620.001547987616099071229
66c60.001547987616099071229
67c570.001547987616099071229
68c550.001547987616099071229
69c540.001547987616099071229
70c520.001547987616099071229
71c460.001547987616099071229
72c320.001547987616099071229
73c310.001547987616099071229
74c20.001547987616099071229
75c1260.001547987616099071229
76c1250.001547987616099071229
77c1240.001547987616099071229
78c1230.001547987616099071229
79c1160.001547987616099071229
80c1060.001547987616099071229
81b780.001547987616099071229
82b770.001547987616099071229
83b710.001547987616099071229
84b690.001547987616099071229
85b50.001547987616099071229
86b490.001547987616099071229
87b450.001547987616099071229
88b410.001547987616099071229
89b350.001547987616099071229
90b280.001547987616099071229
91b220.001547987616099071229
92b200.001547987616099071229
93b180.001547987616099071229
94t0.000773993808049536194
95f380.000773993808049536194
96e770.000773993808049536194
97e690.000773993808049536194
98e680.000773993808049536194
99e630.000773993808049536194
100e600.000773993808049536194
Rows: 1-100 | Columns: 4

Here, we have the cabin IDs, the letter of which represents a certain position on the boat. Let's see how often each cabin occurs in the dataset.

In [8]:
CountVectorizer("cabin_voc").fit("titanic", ["cabin"]).transform(
                )["token"].str_slice(1, 1).groupby(
                columns = ["token"], expr = ["SUM(cnt)"]).head(30)
Out[8]:
Abc
token
Varchar(4)
123
SUM
Integer
1[null]948
2a20
3b92
4c113
5d47
6e43
7f19
8g9
9t1
Rows: 1-9 | Columns: 2

While NULL values for "boat" clearly represent passengers who have a dedicated "lifeboat," we can't be so sure about NULL values for "cabin". We can guess that these might represent passengers without a cabin. If this is the case, then these are missing values not at random (MNAR).

We'll revisit this problem later. For now, let's drop the columns that don't affect survival and then encode the rest.

In [9]:
titanic.drop(["body", "home.dest", "embarked", "ticket"])
Out[9]:
123
pclass
Int
123
survived
Int
Abc
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(6,3)
123
sibsp
Int
123
parch
Int
123
fare
Numeric(10,5)
Abc
cabin
Varchar(30)
Abc
boat
Varchar(100)
110female2.012151.55C22 C26[null]
210male30.012151.55C22 C26[null]
310female25.012151.55C22 C26[null]
410male39.0000.0A36[null]
510male71.00049.5042[null][null]
610male47.010227.525C62 C64[null]
710male[null]0025.925[null][null]
810male24.001247.5208B58 B60[null]
910male36.00075.2417C6A
1010male25.00026.0[null][null]
1110male45.00035.5T[null]
1210male42.00026.55D22[null]
1310male41.00030.5A21[null]
1410male48.00050.4958B10[null]
1510male[null]0039.6[null][null]
1610male45.00026.55B38[null]
1710male[null]0031.0[null][null]
1810male33.0005.0B51 B53 B55[null]
1910male28.00047.1[null][null]
2010male17.00047.1[null][null]
2110male49.00026.0[null][null]
2210male36.01078.85C46[null]
2310male46.01061.175E31[null]
2410male[null]000.0[null][null]
2510male27.010136.7792C89[null]
2610male[null]0052.0A14[null]
2710male47.00025.5875E58[null]
2810male37.01183.1583E52[null]
2910male[null]0026.55[null][null]
3010male70.01171.0B22[null]
3110male39.01071.2833C85[null]
3210male31.01052.0B71[null]
3310male50.010106.425C86[null]
3410male39.00029.7A18[null]
3510female36.00031.6792A29[null]
3610male[null]00221.7792C95[null]
3710male30.00027.75C111[null]
3810male19.032263.0C23 C25 C27[null]
3910male64.014263.0C23 C25 C27[null]
4010male[null]0026.55D34[null]
4110male[null]000.0B102[null]
4210male37.01053.1C123[null]
4310male47.00038.5E63[null]
4410male24.00079.2B86[null]
4510male71.00034.6542A5[null]
4610male38.001153.4625C91[null]
4710male46.00079.2B82 B84[null]
4810male[null]0042.4[null][null]
4910male45.01083.475C83[null]
5010male40.0000.0B94[null]
5110male55.01193.5B69[null]
5210male42.00042.5B11[null]
5310male[null]0051.8625E46[null]
5410male55.00050.0C39[null]
5510male42.01052.0[null][null]
5610male[null]0030.6958[null]14
5710female50.00028.7125C49[null]
5810male46.00026.0[null][null]
5910male50.00026.0E60[null]
6010male32.500211.5C132[null]
6110male58.00029.7B37[null]
6210male41.01051.8625D21[null]
6310male[null]0026.55C124[null]
6410male[null]0027.7208[null][null]
6510male29.00030.0D6[null]
6610male30.00045.5[null][null]
6710male30.00026.0C106[null]
6810male19.01053.1D30[null]
6910male46.00075.2417C6[null]
7010male54.00051.8625E46[null]
7110male28.01082.1708[null][null]
7210male65.00026.55E38[null]
7310male44.02090.0C78[null]
7410male55.00030.5C30[null]
7510male47.00042.4[null][null]
7610male37.00129.7C118[null]
7710male58.002113.275D48[null]
7810male64.00026.0[null][null]
7910male65.00161.9792B30[null]
8010male28.50027.7208D43[null]
8110male[null]000.0[null][null]
8210male45.50028.5C124[null]
8310male23.00093.5B24[null]
8410male29.01066.6C2[null]
8510male18.010108.9C65[null]
8610male47.00052.0C110[null]
8710male38.0000.0[null][null]
8810male22.000135.6333[null][null]
8910male[null]00227.525[null][null]
9010male31.00050.4958A24[null]
9110male[null]0050.0A32[null]
9210male36.00040.125A10[null]
9310male55.01059.4[null][null]
9410male33.00026.55[null][null]
9510male61.013262.375B57 B59 B63 B66[null]
9610male50.01055.9E44[null]
9710male56.00026.55[null][null]
9810male56.00030.6958A7[null]
9910male24.01060.0C31[null]
10010male[null]0026.0A19[null]
Rows: 1-100 of 1234 | Columns: 10
In [10]:
titanic["cabin"].str_slice(1, 1)["name"].str_extract(
        ' ([A-Za-z]+)\.')["boat"].fillna(
        method = "0ifnull")["cabin"].fillna("No Cabin")
795 elements were filled.
948 elements were filled.
Out[10]:
123
pclass
Int
123
survived
Int
Abc
name
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(6,3)
123
sibsp
Int
123
parch
Int
123
fare
Numeric(10,5)
Abc
cabin
Varchar(4)
123
boat
Bool
110 Miss.female2.012151.55C0
210 Mr.male30.012151.55C0
310 Mrs.female25.012151.55C0
410 Mr.male39.0000.0A0
510 Mr.male71.00049.5042No Cabin0
610 Col.male47.010227.525C0
710 Mr.male[null]0025.925No Cabin0
810 Mr.male24.001247.5208B0
910 Mr.male36.00075.2417C1
1010 Mr.male25.00026.0No Cabin0
1110 Mr.male45.00035.5T0
1210 Mr.male42.00026.55D0
1310 Mr.male41.00030.5A0
1410 Mr.male48.00050.4958B0
1510 Dr.male[null]0039.6No Cabin0
1610 Major.male45.00026.55B0
1710 Mr.male[null]0031.0No Cabin0
1810 Mr.male33.0005.0B0
1910 Mr.male28.00047.1No Cabin0
2010 Mr.male17.00047.1No Cabin0
2110 Mr.male49.00026.0No Cabin0
2210 Mr.male36.01078.85C0
2310 Mr.male46.01061.175E0
2410 Mr.male[null]000.0No Cabin0
2510 Mr.male27.010136.7792C0
2610 Mr.male[null]0052.0A0
2710 Mr.male47.00025.5875E0
2810 Mr.male37.01183.1583E0
2910 Mr.male[null]0026.55No Cabin0
3010 Capt.male70.01171.0B0
3110 Mr.male39.01071.2833C0
3210 Mr.male31.01052.0B0
3310 Mr.male50.010106.425C0
3410 Mr.male39.00029.7A0
3510 Miss.female36.00031.6792A0
3610 Mr.male[null]00221.7792C0
3710 Mr.male30.00027.75C0
3810 Mr.male19.032263.0C0
3910 Mr.male64.014263.0C0
4010 Mr.male[null]0026.55D0
4110 Mr.male[null]000.0B0
4210 Mr.male37.01053.1C0
4310 Mr.male47.00038.5E0
4410 Mr.male24.00079.2B0
4510 Mr.male71.00034.6542A0
4610 Mr.male38.001153.4625C0
4710 Mr.male46.00079.2B0
4810 Mr.male[null]0042.4No Cabin0
4910 Mr.male45.01083.475C0
5010 Mr.male40.0000.0B0
5110 Mr.male55.01193.5B0
5210 Mr.male42.00042.5B0
5310 Mr.male[null]0051.8625E0
5410 Mr.male55.00050.0C0
5510 Mr.male42.01052.0No Cabin0
5610 Mr.male[null]0030.6958No Cabin1
5710 Miss.female50.00028.7125C0
5810 Mr.male46.00026.0No Cabin0
5910 Mr.male50.00026.0E0
6010 Mr.male32.500211.5C0
6110 Mr.male58.00029.7B0
6210 Mr.male41.01051.8625D0
6310 Mr.male[null]0026.55C0
6410 Mr.male[null]0027.7208No Cabin0
6510 Mr.male29.00030.0D0
6610 Mr.male30.00045.5No Cabin0
6710 Mr.male30.00026.0C0
6810 Mr.male19.01053.1D0
6910 Mr.male46.00075.2417C0
7010 Mr.male54.00051.8625E0
7110 Mr.male28.01082.1708No Cabin0
7210 Mr.male65.00026.55E0
7310 Dr.male44.02090.0C0
7410 Mr.male55.00030.5C0
7510 Mr.male47.00042.4No Cabin0
7610 Mr.male37.00129.7C0
7710 Mr.male58.002113.275D0
7810 Mr.male64.00026.0No Cabin0
7910 Mr.male65.00161.9792B0
8010 Mr.male28.50027.7208D0
8110 Mr.male[null]000.0No Cabin0
8210 Mr.male45.50028.5C0
8310 Mr.male23.00093.5B0
8410 Mr.male29.01066.6C0
8510 Mr.male18.010108.9C0
8610 Mr.male47.00052.0C0
8710 Jonkheer.male38.0000.0No Cabin0
8810 Mr.male22.000135.6333No Cabin0
8910 Mr.male[null]00227.525No Cabin0
9010 Mr.male31.00050.4958A0
9110 Mr.male[null]0050.0A0
9210 Mr.male36.00040.125A0
9310 Mr.male55.01059.4No Cabin0
9410 Mr.male33.00026.55No Cabin0
9510 Mr.male61.013262.375B0
9610 Mr.male50.01055.9E0
9710 Mr.male56.00026.55No Cabin0
9810 Mr.male56.00030.6958A0
9910 Mr.male24.01060.0C0
10010 Mr.male[null]0026.0A0
Rows: 1-100 of 1234 | Columns: 10

Looking at our data now, we can see that some first class passengers have a NULL value for their cabin, so we can safely say that our assumption about the meaning of a NULL value of "cabin" turned out to be incorrect. This means that the "cabin" column has far too many missing values at random (MAR). We'll have to drop it.

In [11]:
titanic["cabin"].drop()
Out[11]:
123
pclass
Int
123
survived
Int
Abc
name
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(6,3)
123
sibsp
Int
123
parch
Int
123
fare
Numeric(10,5)
123
boat
Bool
110 Miss.female2.012151.550
210 Mr.male30.012151.550
310 Mrs.female25.012151.550
410 Mr.male39.0000.00
510 Mr.male71.00049.50420
610 Col.male47.010227.5250
710 Mr.male[null]0025.9250
810 Mr.male24.001247.52080
910 Mr.male36.00075.24171
1010 Mr.male25.00026.00
1110 Mr.male45.00035.50
1210 Mr.male42.00026.550
1310 Mr.male41.00030.50
1410 Mr.male48.00050.49580
1510 Dr.male[null]0039.60
1610 Major.male45.00026.550
1710 Mr.male[null]0031.00
1810 Mr.male33.0005.00
1910 Mr.male28.00047.10
2010 Mr.male17.00047.10
2110 Mr.male49.00026.00
2210 Mr.male36.01078.850
2310 Mr.male46.01061.1750
2410 Mr.male[null]000.00
2510 Mr.male27.010136.77920
2610 Mr.male[null]0052.00
2710 Mr.male47.00025.58750
2810 Mr.male37.01183.15830
2910 Mr.male[null]0026.550
3010 Capt.male70.01171.00
3110 Mr.male39.01071.28330
3210 Mr.male31.01052.00
3310 Mr.male50.010106.4250
3410 Mr.male39.00029.70
3510 Miss.female36.00031.67920
3610 Mr.male[null]00221.77920
3710 Mr.male30.00027.750
3810 Mr.male19.032263.00
3910 Mr.male64.014263.00
4010 Mr.male[null]0026.550
4110 Mr.male[null]000.00
4210 Mr.male37.01053.10
4310 Mr.male47.00038.50
4410 Mr.male24.00079.20
4510 Mr.male71.00034.65420
4610 Mr.male38.001153.46250
4710 Mr.male46.00079.20
4810 Mr.male[null]0042.40
4910 Mr.male45.01083.4750
5010 Mr.male40.0000.00
5110 Mr.male55.01193.50
5210 Mr.male42.00042.50
5310 Mr.male[null]0051.86250
5410 Mr.male55.00050.00
5510 Mr.male42.01052.00
5610 Mr.male[null]0030.69581
5710 Miss.female50.00028.71250
5810 Mr.male46.00026.00
5910 Mr.male50.00026.00
6010 Mr.male32.500211.50
6110 Mr.male58.00029.70
6210 Mr.male41.01051.86250
6310 Mr.male[null]0026.550
6410 Mr.male[null]0027.72080
6510 Mr.male29.00030.00
6610 Mr.male30.00045.50
6710 Mr.male30.00026.00
6810 Mr.male19.01053.10
6910 Mr.male46.00075.24170
7010 Mr.male54.00051.86250
7110 Mr.male28.01082.17080
7210 Mr.male65.00026.550
7310 Dr.male44.02090.00
7410 Mr.male55.00030.50
7510 Mr.male47.00042.40
7610 Mr.male37.00129.70
7710 Mr.male58.002113.2750
7810 Mr.male64.00026.00
7910 Mr.male65.00161.97920
8010 Mr.male28.50027.72080
8110 Mr.male[null]000.00
8210 Mr.male45.50028.50
8310 Mr.male23.00093.50
8410 Mr.male29.01066.60
8510 Mr.male18.010108.90
8610 Mr.male47.00052.00
8710 Jonkheer.male38.0000.00
8810 Mr.male22.000135.63330
8910 Mr.male[null]00227.5250
9010 Mr.male31.00050.49580
9110 Mr.male[null]0050.00
9210 Mr.male36.00040.1250
9310 Mr.male55.01059.40
9410 Mr.male33.00026.550
9510 Mr.male61.013262.3750
9610 Mr.male50.01055.90
9710 Mr.male56.00026.550
9810 Mr.male56.00030.69580
9910 Mr.male24.01060.00
10010 Mr.male[null]0026.00
Rows: 1-100 of 1234 | Columns: 9

Let's look at descriptive statistics of the entire Virtual Dataframe.

In [12]:
titanic.describe(method = "all")
Out[12]:
123
"pclass"
Int
100%
123
"survived"
Int
100%
123
"age"
Numeric(6,3)
80%
123
"sibsp"
Int
100%
123
"parch"
Int
100%
123
"fare"
Numeric(10,5)
99%
123
"boat"
Bool
100%
Abc
"name"
Varchar(164)
100%
Abc
"sex"
Varchar(20)
100%
dtypeintintnumeric(6,3)intintnumeric(10,5)boolvarchar(164)varchar(20)
percent100.0100.080.794100.0100.099.919100.0100.0100.0
count123412349971234123412331234.012341234
top30[null]008.050 Mr.male
top_percent53.72863.53319.20667.74776.9044.764.42559.48165.964
avg2.284440842787680.36466774716369530.15245737211630.5040518638573740.37844408427876833.96379367396590.3557536466774724.731766612641814.6807131280389
stddev0.8424856361902920.48153201864128814.43530462991591.041117272416290.86860470779039252.64607298312930.4789351437776611.083764559049660.948042321667023
min100.33000.0044
approx_25%1.00.021.00.00.07.89580.044
approx_50%3.00.028.00.00.014.45420.044
approx_75%3.01.039.01.00.031.38751.056
max3180.089512.32921106
range2179.6789512.3292162
empty[null][null][null][null][null][null][null]00
Rows: 1-14 | Columns: 10

Descriptive statistics can give us valuable insights into our data. Notice, for example, that the column "fare" has many outliers (The maximum of 512.33 is much greater than the 9th decile of 79.13). Most passengers traveled in 3rd class (median of pclass = 3).

The "sibsp" column represents the number of siblings for each passenger, while the "parch" column represents the number of parents and children. We can use these to create a new feature: "family_size".

In [13]:
titanic["family_size"] = titanic["parch"] + titanic["sibsp"] + 1

Let's move on to outliers. We have several tools for locating outliers (LocalOutlier Factor, DBSCAN, k-means...), but we'll just use winsorization in this example. Again, "fare" has many outliers, so we'll start there.

In [14]:
titanic["fare"].fill_outliers(method = "winsorize", 
                              alpha = 0.03)
Out[14]:
123
pclass
Int
123
survived
Int
Abc
name
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(6,3)
123
sibsp
Int
123
parch
Int
123
fare
Numeric(18,13)
123
boat
Bool
123
family_size
Integer
110 Miss.female2.012151.5504
210 Mr.male30.012151.5504
310 Mrs.female25.012151.5504
410 Mr.male39.0000.001
510 Mr.male71.00049.504201
610 Col.male47.010227.52502
710 Mr.male[null]0025.92501
810 Mr.male24.001244.548085606483102
910 Mr.male36.00075.241711
1010 Mr.male25.00026.001
1110 Mr.male45.00035.501
1210 Mr.male42.00026.5501
1310 Mr.male41.00030.501
1410 Mr.male48.00050.495801
1510 Dr.male[null]0039.601
1610 Major.male45.00026.5501
1710 Mr.male[null]0031.001
1810 Mr.male33.0005.001
1910 Mr.male28.00047.101
2010 Mr.male17.00047.101
2110 Mr.male49.00026.001
2210 Mr.male36.01078.8502
2310 Mr.male46.01061.17502
2410 Mr.male[null]000.001
2510 Mr.male27.010136.779202
2610 Mr.male[null]0052.001
2710 Mr.male47.00025.587501
2810 Mr.male37.01183.158303
2910 Mr.male[null]0026.5501
3010 Capt.male70.01171.003
3110 Mr.male39.01071.283302
3210 Mr.male31.01052.002
3310 Mr.male50.010106.42502
3410 Mr.male39.00029.701
3510 Miss.female36.00031.679201
3610 Mr.male[null]00221.779201
3710 Mr.male30.00027.7501
3810 Mr.male19.032244.548085606483106
3910 Mr.male64.014244.548085606483106
4010 Mr.male[null]0026.5501
4110 Mr.male[null]000.001
4210 Mr.male37.01053.102
4310 Mr.male47.00038.501
4410 Mr.male24.00079.201
4510 Mr.male71.00034.654201
4610 Mr.male38.001153.462502
4710 Mr.male46.00079.201
4810 Mr.male[null]0042.401
4910 Mr.male45.01083.47502
5010 Mr.male40.0000.001
5110 Mr.male55.01193.503
5210 Mr.male42.00042.501
5310 Mr.male[null]0051.862501
5410 Mr.male55.00050.001
5510 Mr.male42.01052.002
5610 Mr.male[null]0030.695811
5710 Miss.female50.00028.712501
5810 Mr.male46.00026.001
5910 Mr.male50.00026.001
6010 Mr.male32.500211.501
6110 Mr.male58.00029.701
6210 Mr.male41.01051.862502
6310 Mr.male[null]0026.5501
6410 Mr.male[null]0027.720801
6510 Mr.male29.00030.001
6610 Mr.male30.00045.501
6710 Mr.male30.00026.001
6810 Mr.male19.01053.102
6910 Mr.male46.00075.241701
7010 Mr.male54.00051.862501
7110 Mr.male28.01082.170802
7210 Mr.male65.00026.5501
7310 Dr.male44.02090.003
7410 Mr.male55.00030.501
7510 Mr.male47.00042.401
7610 Mr.male37.00129.702
7710 Mr.male58.002113.27503
7810 Mr.male64.00026.001
7910 Mr.male65.00161.979202
8010 Mr.male28.50027.720801
8110 Mr.male[null]000.001
8210 Mr.male45.50028.501
8310 Mr.male23.00093.501
8410 Mr.male29.01066.602
8510 Mr.male18.010108.902
8610 Mr.male47.00052.001
8710 Jonkheer.male38.0000.001
8810 Mr.male22.000135.633301
8910 Mr.male[null]00227.52501
9010 Mr.male31.00050.495801
9110 Mr.male[null]0050.001
9210 Mr.male36.00040.12501
9310 Mr.male55.01059.402
9410 Mr.male33.00026.5501
9510 Mr.male61.013244.548085606483105
9610 Mr.male50.01055.902
9710 Mr.male56.00026.5501
9810 Mr.male56.00030.695801
9910 Mr.male24.01060.002
10010 Mr.male[null]0026.001
Rows: 1-100 | Columns: 10

Let's encode the column "sex" so we can use it with numerical methods.

In [15]:
titanic["sex"].label_encode()
Out[15]:
123
pclass
Int
123
survived
Int
Abc
name
Varchar(164)
123
sex
Int
123
age
Numeric(6,3)
123
sibsp
Int
123
parch
Int
123
fare
Numeric(18,13)
123
boat
Bool
123
family_size
Integer
110 Miss.02.012151.5504
210 Mr.130.012151.5504
310 Mrs.025.012151.5504
410 Mr.139.0000.001
510 Mr.171.00049.504201
610 Col.147.010227.52502
710 Mr.1[null]0025.92501
810 Mr.124.001244.548085606483102
910 Mr.136.00075.241711
1010 Mr.125.00026.001
1110 Mr.145.00035.501
1210 Mr.142.00026.5501
1310 Mr.141.00030.501
1410 Mr.148.00050.495801
1510 Dr.1[null]0039.601
1610 Major.145.00026.5501
1710 Mr.1[null]0031.001
1810 Mr.133.0005.001
1910 Mr.128.00047.101
2010 Mr.117.00047.101
2110 Mr.149.00026.001
2210 Mr.136.01078.8502
2310 Mr.146.01061.17502
2410 Mr.1[null]000.001
2510 Mr.127.010136.779202
2610 Mr.1[null]0052.001
2710 Mr.147.00025.587501
2810 Mr.137.01183.158303
2910 Mr.1[null]0026.5501
3010 Capt.170.01171.003
3110 Mr.139.01071.283302
3210 Mr.131.01052.002
3310 Mr.150.010106.42502
3410 Mr.139.00029.701
3510 Miss.036.00031.679201
3610 Mr.1[null]00221.779201
3710 Mr.130.00027.7501
3810 Mr.119.032244.548085606483106
3910 Mr.164.014244.548085606483106
4010 Mr.1[null]0026.5501
4110 Mr.1[null]000.001
4210 Mr.137.01053.102
4310 Mr.147.00038.501
4410 Mr.124.00079.201
4510 Mr.171.00034.654201
4610 Mr.138.001153.462502
4710 Mr.146.00079.201
4810 Mr.1[null]0042.401
4910 Mr.145.01083.47502
5010 Mr.140.0000.001
5110 Mr.155.01193.503
5210 Mr.142.00042.501
5310 Mr.1[null]0051.862501
5410 Mr.155.00050.001
5510 Mr.142.01052.002
5610 Mr.1[null]0030.695811
5710 Miss.050.00028.712501
5810 Mr.146.00026.001
5910 Mr.150.00026.001
6010 Mr.132.500211.501
6110 Mr.158.00029.701
6210 Mr.141.01051.862502
6310 Mr.1[null]0026.5501
6410 Mr.1[null]0027.720801
6510 Mr.129.00030.001
6610 Mr.130.00045.501
6710 Mr.130.00026.001
6810 Mr.119.01053.102
6910 Mr.146.00075.241701
7010 Mr.154.00051.862501
7110 Mr.128.01082.170802
7210 Mr.165.00026.5501
7310 Dr.144.02090.003
7410 Mr.155.00030.501
7510 Mr.147.00042.401
7610 Mr.137.00129.702
7710 Mr.158.002113.27503
7810 Mr.164.00026.001
7910 Mr.165.00161.979202
8010 Mr.128.50027.720801
8110 Mr.1[null]000.001
8210 Mr.145.50028.501
8310 Mr.123.00093.501
8410 Mr.129.01066.602
8510 Mr.118.010108.902
8610 Mr.147.00052.001
8710 Jonkheer.138.0000.001
8810 Mr.122.000135.633301
8910 Mr.1[null]00227.52501
9010 Mr.131.00050.495801
9110 Mr.1[null]0050.001
9210 Mr.136.00040.12501
9310 Mr.155.01059.402
9410 Mr.133.00026.5501
9510 Mr.161.013244.548085606483105
9610 Mr.150.01055.902
9710 Mr.156.00026.5501
9810 Mr.156.00030.695801
9910 Mr.124.01060.002
10010 Mr.1[null]0026.001
Rows: 1-100 of 1234 | Columns: 10

The column "age" has too many missing values and since most machine learning algorithms can't handle missing values, we need to impute our data. Let's fill the missing values using the average "age" of the passengers that have the same "pclass" and "sex".

In [16]:
titanic["age"].fillna(method = "mean", by = ["pclass", "sex"])
237 elements were filled.
Out[16]:
123
pclass
Int
123
survived
Int
Abc
name
Varchar(164)
123
sex
Int
123
age
Float
123
sibsp
Int
123
parch
Int
123
fare
Numeric(18,13)
123
boat
Bool
123
family_size
Integer
111 Miss.036.000135.633311
211 Miss.031.000134.511
311 Miss.021.00026.5511
411 Mrs.050.011211.513
511 Mrs.045.011164.866713
611 Miss.031.002164.866713
711 Mrs.055.000135.633311
811 Mrs.060.01075.2512
911 Miss.035.000244.548085606483111
1011 Mrs.037.26356589147290079.211
1111 Mrs.039.011110.883313
1211 Mrs.037.26356589147291052.012
1311 Mrs.039.01179.6513
1411 Miss.018.00279.6513
1511 Mrs.048.00025.929211
1611 Mrs.062.00080.011
1711 Mrs.052.01078.266712
1811 Mrs.043.01055.441712
1911 Mrs.037.263565891472910146.520812
2011 Mrs.040.011134.513
2111 Mrs.023.01082.266712
2211 Mrs.018.01060.012
2311 Mrs.039.01055.912
2411 Miss.040.000153.462511
2511 Miss.030.00031.011
2611 Mrs.035.01057.7512
2711 Mlle.024.00069.311
2811 Mrs.048.013244.548085606483115
2911 Miss.018.022244.548085606483115
3011 Mrs.054.01059.412
3111 Countess.033.00086.511
3211 Miss.033.00027.720811
3311 Mrs.043.001211.337512
3411 Mrs.056.00183.158312
3511 Miss.030.00093.511
3611 Mrs.017.010108.912
3711 Mrs.022.01066.612
3811 Miss.022.00161.979212
3911 Dona.039.000108.911
4011 Miss.019.00226.283313
4111 Miss.023.010113.27512
4211 Miss.031.010113.27512
4311 Mrs.037.01090.012
4411 Miss.033.01090.012
4511 Mrs.037.26356589147291082.170812
4611 Mrs.018.01053.112
4711 Miss.016.00086.511
4811 Miss.015.001211.337512
4911 Miss.058.000146.520801
5011 Miss.021.00077.958311
5111 Mrs.051.00139.412
5211 Miss.016.00139.412
5311 Mrs.055.00027.720811
5411 Dr.049.00025.929211
5511 Miss.030.000106.42511
5611 Miss.039.000211.337511
5711 Mrs.045.01052.554212
5811 Mrs.037.26356589147291051.862512
5911 Miss.038.00080.011
6011 Mrs.035.01090.012
6111 Mrs.035.01052.012
6211 Mrs.051.01077.958312
6311 Mrs.044.00157.979212
6411 Miss.016.00157.979212
6511 Mrs.052.01193.513
6611 Miss.024.00083.158311
6711 Mrs.035.01083.47512
6811 Mrs.049.01076.729212
6911 Mrs.025.01055.441712
7011 Mrs.045.00163.358312
7111 Mrs.058.001153.462512
7211 Miss.019.00030.011
7311 Mrs.037.26356589147291089.104212
7411 Mrs.045.00159.412
7511 Miss.022.00159.412
7611 Miss.035.000211.511
7711 Mrs.035.01053.112
7811 Mrs.048.01179.213
7911 Miss.022.00249.513
8011 Mrs.037.263565891472910133.6512
8111 Miss.030.00056.929211
8211 Mrs.060.014244.548085606483116
8311 Miss.023.032244.548085606483116
8411 Miss.028.032244.548085606483116
8511 Miss.024.032244.548085606483116
8611 Miss.037.263565891472900110.883311
8711 Mrs.037.26356589147290031.683311
8811 Miss.054.01078.266712
8911 Miss.038.000227.52511
9011 Mrs.023.00183.158312
9111 Mrs.048.010106.42512
9211 Mrs.027.011244.548085606483113
9311 Mrs.054.01181.858313
9411 Mrs.017.01057.012
9511 Mrs.027.01252.014
9611 Miss.033.000151.5511
9711 Mrs.038.01071.283312
9811 Mrs.064.01126.5513
9911 Miss.036.00271.013
10011 Mrs.055.02025.713
Rows: 1-100 of 1234 | Columns: 10

Let's draw the correlation matrix to see the links between variables.

In [17]:
titanic.corr(method = "spearman")
Out[17]:
"pclass"
"survived"
"sex"
"age"
"sibsp"
"parch"
"fare"
"boat"
"family_size"
"pclass"1.0-0.3356346064441290.150967389146603-0.447455588651876-0.0691851178284916-0.0429882472107741-0.722423468046863-0.344146454498079-0.109035609183308
"survived"-0.3356346064441291.0-0.528849820180154-0.003368244656049630.09343035183924840.1716914555791240.3222782859342430.9491949180728450.185599315104602
"sex"0.150967389146603-0.5288498201801541.00.0659428426025525-0.184712709381236-0.246951353519755-0.264115871968389-0.484361066873315-0.284892754418329
"age"-0.447455588651876-0.003368244656049630.06594284260255251.0-0.116480956583223-0.1746156320354040.2285297919431210.00421157947473071-0.133316437156597
"sibsp"-0.06918511782849160.0934303518392484-0.184712709381236-0.1164809565832231.00.4365249859077550.4440103253719980.1006522682557720.854550816071704
"parch"-0.04298824721077410.171691455579124-0.246951353519755-0.1746156320354040.4365249859077551.00.4021989810367970.1753937116697080.776055413048162
"fare"-0.7224234680468630.322278285934243-0.2641158719683890.2285297919431210.4440103253719980.4021989810367971.00.3349101925616930.52665819103252
"boat"-0.3441464544980790.949194918072845-0.4843610668733150.004211579474730710.1006522682557720.1753937116697080.3349101925616931.00.190320484343677
"family_size"-0.1090356091833080.185599315104602-0.284892754418329-0.1333164371565970.8545508160717040.7760554130481620.526658191032520.1903204843436771.0
Rows: 1-9 | Columns: 10

Fare correlates strongly with family size. This is about what you would expect: a larger family means more tickets, and more tickets means a greater fare.

Survival correlates strongly with whether or not a passenger has a lifeboat (the "boat" variable). Still, to increase the generality of our model, we should avoid predictions based on just one variable. Let's split the study into two use cases:

  • Passengers with a lifeboat
  • Passengers without a lifeboat

Before we move on: we did a lot of work to clean up this data, but we haven't saved anything to our Vertica database! Let's look at the modifications we've made to the vDataFrame.

In [18]:
print(titanic.current_relation())
(
   SELECT
     "pclass",
     "survived",
     "name",
     "sex",
     COALESCE("age", AVG("age") OVER (PARTITION BY "pclass", "sex")) AS "age",
     "sibsp",
     "parch",
     "fare",
     "boat",
     "family_size" 
   FROM
 (
   SELECT
     "pclass",
     "survived",
     REGEXP_SUBSTR("name", ' ([A-Za-z]+)\.') AS "name",
     DECODE("sex", 'female', 0, 'male', 1, 2) AS "sex",
     "age",
     "sibsp",
     "parch",
     (CASE WHEN "fare" < -176.6204982585513 THEN -176.6204982585513 WHEN "fare" > 244.5480856064831 THEN 244.5480856064831 ELSE "fare" END) AS "fare",
     DECODE("boat", NULL, 0, 1) AS "boat",
     (("parch") + ("sibsp")) + (1) AS "family_size" 
   FROM
 (
   SELECT
     "pclass",
     "survived",
     "name",
     "sex",
     "age",
     "sibsp",
     "parch",
     "fare",
     "boat" 
   FROM
 "public"."titanic") 
VERTICAPY_SUBTABLE) 
VERTICAPY_SUBTABLE) 
VERTICAPY_SUBTABLE

Let see what's happening when we aggregate and turn on SQL generation.

In [19]:
vp.set_option("sql_on", True)
titanic.avg()

Computing the different aggregations.

  SELECT
    2.28444084278768,
    0.364667747163695,
    AVG("sex"),
    AVG("age"),
    0.504051863857374,
    0.378444084278768,
    AVG("fare"),
    0.355753646677472,
    AVG("family_size")  
  FROM
(
  SELECT
    "pclass",
    "survived",
    "name",
    "sex",
    COALESCE("age", AVG("age") OVER (PARTITION BY "pclass", "sex")) AS "age",
    "sibsp",
    "parch",
    "fare",
    "boat",
    "family_size"  
  FROM
(
  SELECT
    "pclass",
    "survived",
    REGEXP_SUBSTR("name", ' ([A-Za-z]+)\.') AS "name",
    DECODE("sex", 'female', 0, 'male', 1, 2) AS "sex",
    "age",
    "sibsp",
    "parch",
    (CASE WHEN "fare" < -176.6204982585513 THEN -176.6204982585513 WHEN "fare" > 244.5480856064831 THEN 244.5480856064831 ELSE "fare" END) AS "fare",
    DECODE("boat", NULL, 0, 1) AS "boat",
    (("parch") + ("sibsp")) + (1) AS "family_size"  
  FROM
(
  SELECT
    "pclass",
    "survived",
    "name",
    "sex",
    "age",
    "sibsp",
    "parch",
    "fare",
    "boat"  
  FROM
"public"."titanic")  
VERTICAPY_SUBTABLE)  
VERTICAPY_SUBTABLE)  
VERTICAPY_SUBTABLE LIMIT 1
Out[19]:
avg
"pclass"2.28444084278768
"survived"0.364667747163695
"sex"0.659643435980551
"age"29.717623352014
"sibsp"0.504051863857374
"parch"0.378444084278768
"fare"32.9113074018842
"boat"0.355753646677472
"family_size"1.88249594813614
Rows: 1-9 | Columns: 2

VerticaPy dynamically generates SQL code whenever you make modifications to your data. To avoid recomputation, it also stores previous aggregations. If we filter anything in our data, it will update the catalog with our modifications.

In [20]:
vp.set_option("sql_on", False)
print(titanic.info())
The vDataFrame was modified many times: 
 * {Tue May  3 10:20:59 2022} [Drop]: vColumn "body" was deleted from the vDataFrame.
 * {Tue May  3 10:20:59 2022} [Drop]: vColumn "home.dest" was deleted from the vDataFrame.
 * {Tue May  3 10:20:59 2022} [Drop]: vColumn "embarked" was deleted from the vDataFrame.
 * {Tue May  3 10:20:59 2022} [Drop]: vColumn "ticket" was deleted from the vDataFrame.
 * {Tue May  3 10:21:01 2022} [Apply]: The vColumn 'cabin' was transformed with the func 'x -> SUBSTR(x, 1, 1)'.
 * {Tue May  3 10:21:01 2022} [Apply]: The vColumn 'name' was transformed with the func 'x -> REGEXP_SUBSTR(x, ' ([A-Za-z]+)\.')'.
 * {Tue May  3 10:21:01 2022} [Fillna]: 795 "boat" missing values were  filled.
 * {Tue May  3 10:21:02 2022} [Fillna]: 948 "cabin" missing values were  filled.
 * {Tue May  3 10:21:04 2022} [Drop]: vColumn "cabin" was deleted from the vDataFrame.
 * {Tue May  3 10:21:09 2022} [Eval]: A new vColumn "family_size" was added to the vDataFrame.
 * {Tue May  3 10:21:12 2022} [Apply]: The vColumn 'fare' was transformed with the func 'x -> (CASE WHEN x < -176.6204982585513 THEN -176.6204982585513 WHEN x > 244.5480856064831 THEN 244.5480856064831 ELSE x END)'.
 * {Tue May  3 10:21:14 2022} [Label Encoding]: Label Encoding was applied to the vColumn "sex" using the following mapping:
	female => 0	male => 1
 * {Tue May  3 10:21:16 2022} [Fillna]: 237 "age" missing values were  filled.

Let's move on to modeling our data. Save the vDataframe to your Vertica database.

In [37]:
from verticapy import drop
drop("titanic_boat", method = "view")
titanic_boat = titanic.search(titanic["boat"] == 1).to_db("titanic_boat", relation_type = "view")
drop("titanic_no_boat", method = "view")
titanic_no_boat = titanic.search(titanic["boat"] == 0).to_db("titanic_no_boat", relation_type = "view")

Machine Learning

Passengers with a lifeboat

First, let's look at the number of survivors.

In [24]:
titanic_boat["survived"].describe()
Out[24]:
value
name"survived"
dtypeinteger
unique2.0
count439.0
1430
09
Rows: 1-6 | Columns: 2

We have nine deaths. Let's try to understand why these passengers died.

In [40]:
titanic_boat.search(titanic_boat["survived"] == 0).head(10)
Out[40]:
123
pclass
Integer
123
survived
Integer
Abc
name
Varchar(164)
123
sex
Integer
123
age
Float
123
sibsp
Integer
123
parch
Integer
123
fare
Numeric(18,13)
123
boat
Integer
123
family_size
Integer
110 Mr.140.98220689655170030.695811
210 Mr.136.00075.241711
320 Mr.134.01021.012
430 Mrs.030.01015.5512
530 Mr.127.01014.454212
630 Mr.136.01015.5512
730 Mr.126.2142058823529007.2511
830 Mr.125.0007.2511
930 Mr.132.01015.8512
Rows: 1-9 | Columns: 10

Apart from a decent amount of these passengers being third-class passengers, it doesn't seem like there are any clear predictors here for their deaths. Making a model from this would be unhelpful.

Passengers without a lifeboat

Let's move on to passengers without a lifeboat.

In [38]:
titanic_no_boat["survived"].describe()
Out[38]:
value
name"survived"
dtypeinteger
unique2.0
count795.0
0775
120
Rows: 1-6 | Columns: 2

Only 20 survived. Let's find out why.

In [39]:
titanic_no_boat.search(titanic_boat["survived"] == 1).head(20)
Out[39]:
123
pclass
Integer
123
survived
Integer
Abc
name
Varchar(164)
123
sex
Integer
123
age
Float
123
sibsp
Integer
123
parch
Integer
123
fare
Numeric(18,13)
123
boat
Integer
123
family_size
Integer
111 Miss.058.000146.520801
221 Mrs.028.00012.6501
321 Mrs.030.03021.004
421 Mrs.014.01030.070802
521 Mrs.042.01026.002
621 Miss.017.00010.501
721 Mrs.034.00123.002
821 Miss.018.00123.002
921 Mrs.042.00013.001
1031 Mrs.015.01014.454202
1131 Mrs.047.0107.002
1231 Miss.022.5766423357664007.779201
1331 Mrs.031.0008.683301
1431 Mrs.022.5766423357664007.229201
1531 Miss.027.0007.92501
1631 Miss.026.0007.92501
1731 Miss.023.0008.0501
1831 Mrs.033.03015.8504
1931 Mr.126.2142058823529007.7501
2031 Mr.126.2142058823529007.7501
Rows: 1-20 | Columns: 10

Most survivors seem to be women. Let's build a model with this in mind.

One of our predictors is categorical: the passenger title. Some of these predictors are corrleated, so it'd be best to work with a non-linear classifier that can handle that. In this case, a random forest classifier seems to be perfect. Let's evaluate it with a cross-validation.

In [41]:
from verticapy.learn.ensemble import RandomForestClassifier
from verticapy.learn.model_selection import cross_validate

predictors = titanic.get_columns(exclude_columns = ['survived'])
response = "survived"
model = RandomForestClassifier("rf_titanic", 
                               n_estimators = 40, 
                               max_depth = 4)
cross_validate(model, titanic_no_boat, predictors, response)

Out[41]:
auc
prc_auc
accuracy
log_loss
precision
recall
f1_score
mcc
informedness
markedness
csi
time
1-fold0.95337690631808270.357301385332830.96590909090909090.038991985437220900.0000.0-0.034090909090909060.00.4354102611541748
2-fold0.92912513842746390.370623900753080160.97358490566037740.034920118623541900.0000.0-0.0264150943396226360.00.464846134185791
3-fold0.97701149425287360.221392496392496360.98490566037735850.021477002780222400.0000.0-0.0150943396226415060.00.4200739860534668
avg0.95317117966614010.31643926082613550.97479988564894230.0317963689469950640.00.00.00.00.0-0.0252001143510577340.00.4401101271311442
std0.023943840776928970.082582007279179580.0095563876632406790.0091658067511831390.00.00.00.00.00.0095563876632406790.00.022753084785294538
Rows: 1-5 | Columns: 13

This dataset is pretty unbalanced so we'll use an AUC to evaluate it. Looking at our table, our model has an average AUC of more than 0.9, so our model is quite good.

We can now build a model with the entire dataset.

In [42]:
model.fit(titanic_no_boat, predictors, response)
Out[42]:

===========
call_string
===========
SELECT rf_classifier('public.rf_titanic', '"public"."_verticapy_tmp_view_dbadmin_43247_5454950161_"', '"survived"', '"pclass", "name", "sex", "age", "sibsp", "parch", "fare", "boat", "family_size"' USING PARAMETERS exclude_columns='', ntree=40, mtry=4, sampling_size=0.632, max_depth=4, max_breadth=1000000000, min_leaf_size=1, min_info_gain=0, nbins=32);

=======
details
=======
 predictor |      type      
-----------+----------------
  pclass   |      int       
   name    |char or varchar 
    sex    |      int       
    age    |float or numeric
   sibsp   |      int       
   parch   |      int       
   fare    |float or numeric
   boat    |      int       
family_size|      int       


===============
Additional Info
===============
       Name       |Value
------------------+-----
    tree_count    | 40  
rejected_row_count|  1  
accepted_row_count| 794 

Let's look at the importance of each feature.

In [43]:
model.features_importance()
Out[43]:
importance
sign
name26.041
age13.791
sex13.641
fare12.891
pclass9.731
parch9.481
sibsp7.471
family_size6.951
boat0.00
Rows: 1-9 | Columns: 3

As expected, a passenger's title and the sex are the most important predictors of survival.

Conclusion

We've solved our problem in a Pandas-like way, all without ever loading data into memory!


VerticaPy

Python API for Vertica Data Science at Scale

About the Author


Badr Ouali
Head of Data Science


Badr Ouali works as a Lead Data Scientist for Vertica worldwide. He can embrace data projects end to end through a clear understanding of the “big picture” as well as attention to details, resulting in achieving great business outcomes – a distinctive differentiator in his role. Badr enjoys sharing knowledge and insights related to data analytics with colleagues & peers and has a sweet spot for Python. He loves helping customers finding the best value from their data and empower them to solve their use-cases.