import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Use the following command to allow Matplotlib to display graphics.
%matplotlib inline
Let's load the dataset.
from verticapy.datasets import load_titanic
titanic = load_titanic()
display(titanic)
Data Exploration and Preparation¶
Let's explore the data by displaying descriptive statistics of all the columns.
titanic.describe(method = "categorical", unique = True)
The columns "body" (passenger ID), "home.dest" (passenger origin/destination), "embarked" (origin port) and "ticket" (ticket ID) shouldn't influence survival, so we can ignore these.
Let's focus our analysis on the columns "name" and "cabin." We'll begin with the passengers' names.
from verticapy.learn.preprocessing import CountVectorizer
model = CountVectorizer("name_voc")
model.fit(titanic, ["Name"]).transform()
Passengers' titles might come in handy. We can extract these from their names.
Let's move on to the cabins.
model = CountVectorizer("cabin_voc")
model.fit("titanic", ["cabin"]).transform()
Here, we have the cabin IDs, the letter of which represents a certain position on the boat. Let's see how often each cabin occurs in the dataset.
CountVectorizer("cabin_voc").fit("titanic", ["cabin"]).transform(
)["token"].str_slice(1, 1).groupby(
columns = ["token"], expr = ["SUM(cnt)"]).head(30)
While NULL values for "boat" clearly represent passengers who have a dedicated "lifeboat," we can't be so sure about NULL values for "cabin". We can guess that these might represent passengers without a cabin. If this is the case, then these are missing values not at random (MNAR).
We'll revisit this problem later. For now, let's drop the columns that don't affect survival and then encode the rest.
titanic.drop(["body", "home.dest", "embarked", "ticket"])