
VerticaPy
Encoding¶
Encoding features is a very important part of the data science life cycle. In data science, generality is important and having too many categories can compromise that and lead to incorrect results. In addition, some algorithmic optimizations are linear and prefer categorized information, and some can't process non-numerical features.
There are many encoding techniques:
- User-Defined Encoding : The most flexible encoding. The user can choose how to encode the different categories.
- Label Encoding : Each category is converted to an integer using a bijection to [0;n-1] where n is the feature number of unique values.
- One-hot Encoding : This technique creates dummies (values in {0,1}) of each category. The categories are then separated into n features.
- Mean Encoding : This technique uses the frequencies of each category for a specific response column.
- Discretization : This technique uses various mathematical technique to encode continuous features into categories.
To demonstrate encoding data in VerticaPy, we'll use the well-known 'Titanic' dataset.
from verticapy.datasets import load_titanic
vdf = load_titanic()
display(vdf)
Let's look at the 'age' of the passengers.
vdf["age"].hist()
By using the 'discretize' method, we can discretize the data using equal-width binning.
vdf["age"].discretize(method = "same_width", h = 10)
vdf["age"].hist(max_cardinality = 10)
We can also discretize the data using frequency bins.
vdf = load_titanic()
vdf["age"].discretize(method = "same_freq", nbins = 5)
vdf["age"].hist(max_cardinality = 5)
Computing categories using a response column can also be a good solution.
vdf = load_titanic()
vdf["age"].discretize(method = "smart", response = "survived", nbins = 6)
vdf["age"].hist(method = "avg", of = "survived")
We can view the available techniques in the 'discretize' method with the 'help' method.
help(vdf["age"].discretize)
To encode a categorical feature, we can use label encoding. For example, the column 'sex' has two categories (male and female) that we can represent with 0 and 1, respectively.
vdf["sex"].label_encode()
display(vdf["sex"])
When a feature has few categories, the most suitable choice is the one-hot encoding. Label encoding converts a categorical feature to numerical without retaining its mathematical relationships. Let's use a one-hot encoding on the 'embarked' column.
vdf["embarked"].one_hot_encode()
vdf.select(["embarked", "embarked_C", "embarked_Q"])
One-hot encoding can be expensive if the column in question has a large number of categories. In that case, we should use mean encoding. Mean encoding replaces each category of a variable with its corresponding average over a partition by a response column. This makes it an efficient way to encode the data, but be careful of over-fitting.
Let's use a mean encoding on the 'home.dest' variable.
vdf["home.dest"].mean_encode("survived")
display(vdf["home.dest"])
VerticaPy offers many encoding techniques. For example, the 'case_when' and 'decode' methods allow the user to use a customized encoding on a column. The 'discretize' method allows you to reduce the number of categories in a column. It's important to get familiar with all the techniques available so you can make informed decisions about which to use for a given dataset.