import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Use the following command to allow Matplotlib to display graphics.
%matplotlib inline
Let's load the dataset.
from verticapy.datasets import load_titanic
titanic = load_titanic()
display(titanic)
Data Exploration and Preparation¶
Let's explore the data by displaying descriptive statistics of all the columns.
titanic.describe(method = "categorical", unique = True)
The columns "body" (passenger ID), "home.dest" (passenger origin/destination), "embarked" (origin port) and "ticket" (ticket ID) shouldn't influence survival, so we can ignore these.
Let's focus our analysis on the columns "name" and "cabin." We'll begin with the passengers' names.
from verticapy.learn.preprocessing import CountVectorizer
model = CountVectorizer("name_voc")
model.fit(titanic, ["Name"]).transform()
Passengers' titles might come in handy. We can extract these from their names.
Let's move on to the cabins.
model = CountVectorizer("cabin_voc")
model.fit("titanic", ["cabin"]).transform()
Here, we have the cabin IDs, the letter of which represents a certain position on the boat. Let's see how often each cabin occurs in the dataset.
CountVectorizer("cabin_voc").fit("titanic", ["cabin"]).transform(
)["token"].str_slice(1, 1).groupby(
columns = ["token"], expr = ["SUM(cnt)"]).head(30)
While NULL values for "boat" clearly represent passengers who have a dedicated "lifeboat," we can't be so sure about NULL values for "cabin". We can guess that these might represent passengers without a cabin. If this is the case, then these are missing values not at random (MNAR).
We'll revisit this problem later. For now, let's drop the columns that don't affect survival and then encode the rest.
titanic.drop(["body", "home.dest", "embarked", "ticket"])
titanic["cabin"].str_slice(1, 1)["name"].str_extract(
' ([A-Za-z]+)\.')["boat"].fillna(
method = "0ifnull")["cabin"].fillna("No Cabin")
Looking at our data now, we can see that some first class passengers have a NULL value for their cabin, so we can safely say that our assumption about the meaning of a NULL value of "cabin" turned out to be incorrect. This means that the "cabin" column has far too many missing values at random (MAR). We'll have to drop it.
titanic["cabin"].drop()
Let's look at descriptive statistics of the entire Virtual Dataframe.
titanic.describe(method = "all")
Descriptive statistics can give us valuable insights into our data. Notice, for example, that the column "fare" has many outliers (The maximum of 512.33 is much greater than the 9th decile of 79.13). Most passengers traveled in 3rd class (median of pclass = 3).
The "sibsp" column represents the number of siblings for each passenger, while the "parch" column represents the number of parents and children. We can use these to create a new feature: "family_size".
titanic["family_size"] = titanic["parch"] + titanic["sibsp"] + 1
Let's move on to outliers. We have several tools for locating outliers (LocalOutlier Factor, DBSCAN, k-means...), but we'll just use winsorization in this example. Again, "fare" has many outliers, so we'll start there.
titanic["fare"].fill_outliers(method = "winsorize",
alpha = 0.03)
Let's encode the column "sex" so we can use it with numerical methods.
titanic["sex"].label_encode()
The column "age" has too many missing values and since most machine learning algorithms can't handle missing values, we need to impute our data. Let's fill the missing values using the average "age" of the passengers that have the same "pclass" and "sex".
titanic["age"].fillna(method = "mean", by = ["pclass", "sex"])
Let's draw the correlation matrix to see the links between variables.
titanic.corr(method = "spearman")
Fare correlates strongly with family size. This is about what you would expect: a larger family means more tickets, and more tickets means a greater fare.
Survival correlates strongly with whether or not a passenger has a lifeboat (the "boat" variable). Still, to increase the generality of our model, we should avoid predictions based on just one variable. Let's split the study into two use cases:
- Passengers with a lifeboat
- Passengers without a lifeboat
Before we move on: we did a lot of work to clean up this data, but we haven't saved anything to our Vertica database! Let's look at the modifications we've made to the vDataFrame.
print(titanic.current_relation())
Let see what's happening when we aggregate and turn on SQL generation.
vp.set_option("sql_on", True)
titanic.avg()
VerticaPy dynamically generates SQL code whenever you make modifications to your data. To avoid recomputation, it also stores previous aggregations. If we filter anything in our data, it will update the catalog with our modifications.
vp.set_option("sql_on", False)
print(titanic.info())
Let's move on to modeling our data. Save the vDataframe to your Vertica database.
from verticapy import drop
drop("titanic_boat", method = "view")
titanic_boat = titanic.search(titanic["boat"] == 1).to_db("titanic_boat", relation_type = "view")
drop("titanic_no_boat", method = "view")
titanic_no_boat = titanic.search(titanic["boat"] == 0).to_db("titanic_no_boat", relation_type = "view")
First, let's look at the number of survivors.
titanic_boat["survived"].describe()
We have nine deaths. Let's try to understand why these passengers died.
titanic_boat.search(titanic_boat["survived"] == 0).head(10)
Apart from a decent amount of these passengers being third-class passengers, it doesn't seem like there are any clear predictors here for their deaths. Making a model from this would be unhelpful.
Passengers without a lifeboat¶
Let's move on to passengers without a lifeboat.
titanic_no_boat["survived"].describe()
Only 20 survived. Let's find out why.
titanic_no_boat.search(titanic_boat["survived"] == 1).head(20)
Most survivors seem to be women. Let's build a model with this in mind.
One of our predictors is categorical: the passenger title. Some of these predictors are corrleated, so it'd be best to work with a non-linear classifier that can handle that. In this case, a random forest classifier seems to be perfect. Let's evaluate it with a cross-validation.
from verticapy.learn.ensemble import RandomForestClassifier
from verticapy.learn.model_selection import cross_validate
predictors = titanic.get_columns(exclude_columns = ['survived'])
response = "survived"
model = RandomForestClassifier("rf_titanic",
n_estimators = 40,
max_depth = 4)
cross_validate(model, titanic_no_boat, predictors, response)
This dataset is pretty unbalanced so we'll use an AUC to evaluate it. Looking at our table, our model has an average AUC of more than 0.9, so our model is quite good.
We can now build a model with the entire dataset.
model.fit(titanic_no_boat, predictors, response)
Let's look at the importance of each feature.
model.features_importance()
As expected, a passenger's title and the sex are the most important predictors of survival.
Conclusion¶
We've solved our problem in a Pandas-like way, all without ever loading data into memory!
VerticaPy
About the Author
Badr Ouali
Head of Data Science
Badr Ouali works as a Lead Data Scientist for Vertica worldwide. He can embrace data projects end to end through a clear understanding of the “big picture” as well as attention to details, resulting in achieving great business outcomes – a distinctive differentiator in his role. Badr enjoys sharing knowledge and insights related to data analytics with colleagues & peers and has a sweet spot for Python. He loves helping customers finding the best value from their data and empower them to solve their use-cases.
