Titanic

This example uses the 'titanic' dataset to predict the survival of passengers on the Titanic. You can download the Jupyter Notebook of the study here.

Initialization

This example uses the following version of VerticaPy:

In [1]:
import verticapy as vp
vp.__version__
Out[1]:
'0.9.0'

Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.

In [2]:
vp.connect("VerticaDSN")

Use the following command to allow Matplotlib to display graphics.

In [3]:
%matplotlib inline

Let's load the dataset.

In [4]:
from verticapy.datasets import load_titanic
titanic = load_titanic()
display(titanic)
123
pclass
Int
123
survived
Int
Abc
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(6,3)
123
sibsp
Int
123
parch
Int
Abc
ticket
Varchar(36)
123
fare
Numeric(10,5)
Abc
cabin
Varchar(30)
Abc
embarked
Varchar(20)
Abc
boat
Varchar(100)
123
body
Int
Abc
home.dest
Varchar(100)
110female2.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
210male30.012113781151.55C22 C26S[null]135Montreal, PQ / Chesterville, ON
310female25.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
410male39.0001120500.0A36S[null][null]Belfast, NI
510male71.000PC 1760949.5042[null]C[null]22Montevideo, Uruguay
610male47.010PC 17757227.525C62 C64C[null]124New York, NY
710male[null]00PC 1731825.925[null]S[null][null]New York, NY
810male24.001PC 17558247.5208B58 B60C[null][null]Montreal, PQ
910male36.0001305075.2417C6CA[null]Winnipeg, MN
1010male25.0001390526.0[null]C[null]148San Francisco, CA
1110male45.00011378435.5TS[null][null]Trenton, NJ
1210male42.00011048926.55D22S[null][null]London / Winnipeg, MB
1310male41.00011305430.5A21S[null][null]Pomeroy, WA
1410male48.000PC 1759150.4958B10C[null]208Omaha, NE
1510male[null]0011237939.6[null]C[null][null]Philadelphia, PA
1610male45.00011305026.55B38S[null][null]Washington, DC
1710male[null]0011379831.0[null]S[null][null][null]
1810male33.0006955.0B51 B53 B55S[null][null]New York, NY
1910male28.00011305947.1[null]S[null][null]Montevideo, Uruguay
2010male17.00011305947.1[null]S[null][null]Montevideo, Uruguay
2110male49.0001992426.0[null]S[null][null]Ascot, Berkshire / Rochester, NY
2210male36.0101987778.85C46S[null]172Little Onn Hall, Staffs
2310male46.010W.E.P. 573461.175E31S[null][null]Amenia, ND
2410male[null]001120510.0[null]S[null][null]Liverpool, England / Belfast
2510male27.01013508136.7792C89C[null][null]Los Angeles, CA
2610male[null]0011046552.0A14S[null][null]Stoughton, MA
2710male47.000572725.5875E58S[null][null]Victoria, BC
2810male37.011PC 1775683.1583E52C[null][null]Lakewood, NJ
2910male[null]0011379126.55[null]S[null][null]Roachdale, IN
3010male70.011WE/P 573571.0B22S[null]269Milwaukee, WI
3110male39.010PC 1759971.2833C85C[null][null]New York, NY
3210male31.010F.C. 1275052.0B71S[null][null]Montreal, PQ
3310male50.010PC 17761106.425C86C[null]62Deephaven, MN / Cedar Rapids, IA
3410male39.000PC 1758029.7A18C[null]133Philadelphia, PA
3510female36.000PC 1753131.6792A29C[null][null]New York, NY
3610male[null]00PC 17483221.7792C95S[null][null][null]
3710male30.00011305127.75C111C[null][null]New York, NY
3810male19.03219950263.0C23 C25 C27S[null][null]Winnipeg, MB
3910male64.01419950263.0C23 C25 C27S[null][null]Winnipeg, MB
4010male[null]0011377826.55D34S[null][null]Westcliff-on-Sea, Essex
4110male[null]001120580.0B102S[null][null][null]
4210male37.01011380353.1C123S[null][null]Scituate, MA
4310male47.00011132038.5E63S[null]275St Anne's-on-Sea, Lancashire
4410male24.000PC 1759379.2B86C[null][null][null]
4510male71.000PC 1775434.6542A5C[null][null]New York, NY
4610male38.001PC 17582153.4625C91S[null]147Winnipeg, MB
4710male46.000PC 1759379.2B82 B84C[null][null]New York, NY
4810male[null]0011379642.4[null]S[null][null][null]
4910male45.0103697383.475C83S[null][null]New York, NY
5010male40.0001120590.0B94S[null]110[null]
5110male55.0111274993.5B69S[null]307Montreal, PQ
5210male42.00011303842.5B11S[null][null]London / Middlesex
5310male[null]001746351.8625E46S[null][null]Brighton, MA
5410male55.00068050.0C39S[null][null]London / Birmingham
5510male42.01011378952.0[null]S[null]38New York, NY
5610male[null]00PC 1760030.6958[null]C14[null]New York, NY
5710female50.000PC 1759528.7125C49C[null][null]Paris, France New York, NY
5810male46.00069426.0[null]S[null]80Bennington, VT
5910male50.00011304426.0E60S[null][null]London
6010male32.500113503211.5C132C[null]45[null]
6110male58.0001177129.7B37C[null]258Buffalo, NY
6210male41.0101746451.8625D21S[null][null]Southington / Noank, CT
6310male[null]0011302826.55C124S[null][null]Portland, OR
6410male[null]00PC 1761227.7208[null]C[null][null]Chicago, IL
6510male29.00011350130.0D6S[null]126Springfield, MA
6610male30.00011380145.5[null]S[null][null]London / New York, NY
6710male30.00011046926.0C106S[null][null]Brockton, MA
6810male19.01011377353.1D30S[null][null]New York, NY
6910male46.0001305075.2417C6C[null]292Vancouver, BC
7010male54.0001746351.8625E46S[null]175Dorchester, MA
7110male28.010PC 1760482.1708[null]C[null][null]New York, NY
7210male65.0001350926.55E38S[null]249East Bridgewater, MA
7310male44.0201992890.0C78Q[null]230Fond du Lac, WI
7410male55.00011378730.5C30S[null][null]Montreal, PQ
7510male47.00011379642.4[null]S[null][null]Washington, DC
7610male37.001PC 1759629.7C118C[null][null]Brooklyn, NY
7710male58.00235273113.275D48C[null]122Lexington, MA
7810male64.00069326.0[null]S[null]263Isle of Wight, England
7910male65.00111350961.9792B30C[null]234Providence, RI
8010male28.500PC 1756227.7208D43C[null]189?Havana, Cuba
8110male[null]001120520.0[null]S[null][null]Belfast
8210male45.50011304328.5C124S[null]166Surbiton Hill, Surrey
8310male23.0001274993.5B24S[null][null]Montreal, PQ
8410male29.01011377666.6C2S[null][null]Isleworth, England
8510male18.010PC 17758108.9C65C[null][null]Madrid, Spain
8610male47.00011046552.0C110S[null]207Worcester, MA
8710male38.000199720.0[null]S[null][null]Rotterdam, Netherlands
8810male22.000PC 17760135.6333[null]C[null]232[null]
8910male[null]00PC 17757227.525[null]C[null][null][null]
9010male31.000PC 1759050.4958A24S[null][null]Trenton, NJ
9110male[null]0011376750.0A32S[null][null]Seattle, WA
9210male36.0001304940.125A10C[null][null]Winnipeg, MB
9310male55.010PC 1760359.4[null]C[null][null]New York, NY
9410male33.00011379026.55[null]S[null]109London
9510male61.013PC 17608262.375B57 B59 B63 B66C[null][null]Haverford, PA / Cooperstown, NY
9610male50.0101350755.9E44S[null][null]Duluth, MN
9710male56.00011379226.55[null]S[null][null]New York, NY
9810male56.0001776430.6958A7C[null][null]St James, Long Island, NY
9910male24.0101369560.0C31S[null][null]Huntington, WV
10010male[null]0011305626.0A19S[null][null]Streatham, Surrey
Rows: 1-100 | Columns: 14

Data Exploration and Preparation

Let's explore the data by displaying descriptive statistics of all the columns.

In [5]:
titanic.describe(method = "categorical", unique = True)
Out[5]:
dtype
count
top
top_percent
unique
"pclass"int1234353.7283.0
"survived"int1234063.5332.0
"name"varchar(164)1234Kelly, Mr. James0.1621232.0
"sex"varchar(20)1234male65.9642.0
"age"numeric(6,3)997[null]19.20696.0
"sibsp"int1234067.7477.0
"parch"int1234076.9048.0
"ticket"varchar(36)1234CA. 23430.81887.0
"fare"numeric(10,5)12338.054.7277.0
"cabin"varchar(30)286[null]76.823182.0
"embarked"varchar(20)1232S70.7463.0
"boat"varchar(100)439[null]64.42526.0
"body"int118[null]90.438118.0
"home.dest"varchar(100)706[null]42.788359.0
Rows: 1-14 | Columns: 6

The columns "body" (passenger ID), "home.dest" (passenger origin/destination), "embarked" (origin port) and "ticket" (ticket ID) shouldn't influence survival, so we can ignore these.

Let's focus our analysis on the columns "name" and "cabin." We'll begin with the passengers' names.

In [6]:
from verticapy.learn.preprocessing import CountVectorizer
model = CountVectorizer("name_voc")
model.fit(titanic, ["Name"]).transform()
Out[6]:
Abc
token
Varchar(128)
123
df
Numeric(36,18)
123
cnt
Integer
123
rnk
Integer
1mr0.148163100524828427341
2miss0.0460234154218813062282
3mrs0.037343560758982641853
4william0.016148566814695196804
5john0.013726281792490917685
6master0.011303996770286637566
7henry0.008881711748082359447
8james0.007468712151796528378
9charles0.007468712151796528378
10thomas0.0068631408962454583410
11joseph0.0062575696406943883111
12george0.0062575696406943883111
13edward0.0054501412999596292713
14mary0.0052482842147759392614
15johan0.0044408558740411792215
16elizabeth0.0044408558740411792215
17frederick0.0040371417036737992017
18samuel0.0038352846184901091918
19robert0.0036334275333064191819
20alexander0.0036334275333064191819
21arthur0.0034315704481227291721
22alfred0.0034315704481227291721
23richard0.0032297133629390391623
24maria0.0032297133629390391623
25anna0.0032297133629390391623
26karl0.0028259991925716591426
27jr0.0028259991925716591426
28albert0.0028259991925716591426
29peter0.0026241421073879691329
30j0.0026241421073879691329
31patrick0.0024222850222042791231
32frank0.0024222850222042791231
33ernest0.0024222850222042791231
34alice0.0024222850222042791231
35margaret0.0022204279370205891135
36h0.0022204279370205891135
37andersson0.0022204279370205891135
38walter0.0020185708518368991038
39sage0.0020185708518368991038
40martin0.0020185708518368991038
41leonard0.0020185708518368991038
42edith0.00181671376665321942
43e0.00181671376665321942
44y0.00161485668146952844
45victor0.00161485668146952844
46van0.00161485668146952844
47rev0.00161485668146952844
48harry0.00161485668146952844
49goodwin0.00161485668146952844
50francis0.00161485668146952844
51elias0.00161485668146952844
52dr0.00161485668146952844
53david0.00161485668146952844
54daniel0.00161485668146952844
55asplund0.00161485668146952844
56annie0.00161485668146952844
57nils0.00141299959628583757
58marie0.00141299959628583757
59louise0.00141299959628583757
60l0.00141299959628583757
61helen0.00141299959628583757
62ernst0.00141299959628583757
63emil0.00141299959628583757
64de0.00141299959628583757
65davies0.00141299959628583757
66carl0.00141299959628583757
67brown0.00141299959628583757
68bertram0.00141299959628583757
69bertha0.00141299959628583757
70benjamin0.00141299959628583757
71august0.00141299959628583757
72a0.00141299959628583757
73sofia0.00121114251110214673
74smith0.00121114251110214673
75skoog0.00121114251110214673
76rice0.00121114251110214673
77panula0.00121114251110214673
78olsen0.00121114251110214673
79kate0.00121114251110214673
80johnson0.00121114251110214673
81johansson0.00121114251110214673
82ivan0.00121114251110214673
83ida0.00121114251110214673
84hansen0.00121114251110214673
85gustafsson0.00121114251110214673
86fortune0.00121114251110214673
87emily0.00121114251110214673
88ellen0.00121114251110214673
89douglas0.00121114251110214673
90delia0.00121114251110214673
91catherine0.00121114251110214673
92carter0.00121114251110214673
93bridget0.00121114251110214673
94augusta0.00121114251110214673
95ada0.00121114251110214673
96williams0.00100928542591845596
97white0.00100928542591845596
98vander0.00100928542591845596
99sidney0.00100928542591845596
100philip0.00100928542591845596
Rows: 1-100 | Columns: 4

Passengers' titles might come in handy. We can extract these from their names.

Let's move on to the cabins.

In [7]:
model = CountVectorizer("cabin_voc")
model.fit("titanic", ["cabin"]).transform()
Out[7]:
Abc
token
Varchar(128)
123
df
Numeric(36,18)
123
cnt
Integer
123
rnk
Integer
1[null]0.73374613003095989481
2f0.00619195046439628582
3c270.00464396284829721463
4c250.00464396284829721463
5c230.00464396284829721463
6g60.00386996904024767856
7f40.00309597523219814247
8d0.00309597523219814247
9c780.00309597523219814247
10c260.00309597523219814247
11c220.00309597523219814247
12b980.00309597523219814247
13b960.00309597523219814247
14b660.00309597523219814247
15b630.00309597523219814247
16b590.00309597523219814247
17b570.00309597523219814247
18f330.002321981424148607318
19f20.002321981424148607318
20e460.002321981424148607318
21e340.002321981424148607318
22c1010.002321981424148607318
23b600.002321981424148607318
24b580.002321981424148607318
25b550.002321981424148607318
26b530.002321981424148607318
27b510.002321981424148607318
28a340.002321981424148607318
29g730.001547987616099071229
30g630.001547987616099071229
31e80.001547987616099071229
32e670.001547987616099071229
33e500.001547987616099071229
34e440.001547987616099071229
35e330.001547987616099071229
36e310.001547987616099071229
37e240.001547987616099071229
38e1210.001547987616099071229
39e1010.001547987616099071229
40d370.001547987616099071229
41d360.001547987616099071229
42d350.001547987616099071229
43d330.001547987616099071229
44d300.001547987616099071229
45d280.001547987616099071229
46d260.001547987616099071229
47d210.001547987616099071229
48d200.001547987616099071229
49d190.001547987616099071229
50d170.001547987616099071229
51d150.001547987616099071229
52d120.001547987616099071229
53d100.001547987616099071229
54c930.001547987616099071229
55c920.001547987616099071229
56c890.001547987616099071229
57c860.001547987616099071229
58c850.001547987616099071229
59c830.001547987616099071229
60c800.001547987616099071229
61c70.001547987616099071229
62c680.001547987616099071229
63c650.001547987616099071229
64c640.001547987616099071229
65c620.001547987616099071229
66c60.001547987616099071229
67c570.001547987616099071229
68c550.001547987616099071229
69c540.001547987616099071229
70c520.001547987616099071229
71c460.001547987616099071229
72c320.001547987616099071229
73c310.001547987616099071229
74c20.001547987616099071229
75c1260.001547987616099071229
76c1250.001547987616099071229
77c1240.001547987616099071229
78c1230.001547987616099071229
79c1160.001547987616099071229
80c1060.001547987616099071229
81b780.001547987616099071229
82b770.001547987616099071229
83b710.001547987616099071229
84b690.001547987616099071229
85b50.001547987616099071229
86b490.001547987616099071229
87b450.001547987616099071229
88b410.001547987616099071229
89b350.001547987616099071229
90b280.001547987616099071229
91b220.001547987616099071229
92b200.001547987616099071229
93b180.001547987616099071229
94t0.000773993808049536194
95f380.000773993808049536194
96e770.000773993808049536194
97e690.000773993808049536194
98e680.000773993808049536194
99e630.000773993808049536194
100e600.000773993808049536194
Rows: 1-100 | Columns: 4

Here, we have the cabin IDs, the letter of which represents a certain position on the boat. Let's see how often each cabin occurs in the dataset.

In [8]:
CountVectorizer("cabin_voc").fit("titanic", ["cabin"]).transform(
                )["token"].str_slice(1, 1).groupby(
                columns = ["token"], expr = ["SUM(cnt)"]).head(30)
Out[8]:
Abc
token
Varchar(4)
123
SUM
Integer
1[null]948
2a20
3b92
4c113
5d47
6e43
7f19
8g9
9t1
Rows: 1-9 | Columns: 2

While NULL values for "boat" clearly represent passengers who have a dedicated "lifeboat," we can't be so sure about NULL values for "cabin". We can guess that these might represent passengers without a cabin. If this is the case, then these are missing values not at random (MNAR).

We'll revisit this problem later. For now, let's drop the columns that don't affect survival and then encode the rest.

In [9]:
titanic.drop(["body", "home.dest", "embarked", "ticket"])
Out[9]: