Encoding#

Encoding features is a very important part of the data science life cycle. In data science, generality is important and having too many categories can compromise that and lead to incorrect results. In addition, some algorithmic optimizations are linear and prefer categorized information, and some can’t process non-numerical features.

There are many encoding techniques:

User-Defined Encoding : The most flexible encoding. The user can choose how to encode the different categories.
Label Encoding : Each category is converted to an integer using a bijection to [0;n-1] where n is the feature number of unique values.
One-hot Encoding : This technique creates dummies (values in {0,1}) of each category. The categories are then separated into n features.
Mean Encoding : This technique uses the frequencies of each category for a specific response column.
Discretization : This technique uses various mathematical technique to encode continuous features into categories.

To demonstrate encoding data in VerticaPy, we’ll use the well-known ‘Titanic’ dataset.

[1]:

from verticapy.datasets import load_titanic
import verticapy as vp

vp.set_option("plotting_lib","highcharts")

vdf = load_titanic()
display(vdf)

	123 pclass Integer	123 survived Integer	Abc Varchar(164)	Abc sex Varchar(20)	123 age Numeric(8)	123 sibsp Integer	123 parch Integer	Abc ticket Varchar(36)	123 fare Numeric(12)	Abc cabin Varchar(30)	Abc embarked Varchar(20)	Abc boat Varchar(100)	123 body Integer	Abc home.dest Varchar(100)
1	1	0		female	2.0	1	2	113781	151.55	C22 C26	S	[null]	[null]	Montreal, PQ / Chesterville, ON
2	1	0		male	30.0	1	2	113781	151.55	C22 C26	S	[null]	135	Montreal, PQ / Chesterville, ON
3	1	0		female	25.0	1	2	113781	151.55	C22 C26	S	[null]	[null]	Montreal, PQ / Chesterville, ON
4	1	0		male	39.0	0	0	112050	0.0	A36	S	[null]	[null]	Belfast, NI
5	1	0		male	71.0	0	0	PC 17609	49.5042	[null]	C	[null]	22	Montevideo, Uruguay
6	1	0		male	47.0	1	0	PC 17757	227.525	C62 C64	C	[null]	124	New York, NY
7	1	0		male	[null]	0	0	PC 17318	25.925	[null]	S	[null]	[null]	New York, NY
8	1	0		male	24.0	0	1	PC 17558	247.5208	B58 B60	C	[null]	[null]	Montreal, PQ
9	1	0		male	36.0	0	0	13050	75.2417	C6	C	A	[null]	Winnipeg, MN
10	1	0		male	25.0	0	0	13905	26.0	[null]	C	[null]	148	San Francisco, CA
11	1	0		male	45.0	0	0	113784	35.5	T	S	[null]	[null]	Trenton, NJ
12	1	0		male	42.0	0	0	110489	26.55	D22	S	[null]	[null]	London / Winnipeg, MB
13	1	0		male	41.0	0	0	113054	30.5	A21	S	[null]	[null]	Pomeroy, WA
14	1	0		male	48.0	0	0	PC 17591	50.4958	B10	C	[null]	208	Omaha, NE
15	1	0		male	[null]	0	0	112379	39.6	[null]	C	[null]	[null]	Philadelphia, PA
16	1	0		male	45.0	0	0	113050	26.55	B38	S	[null]	[null]	Washington, DC
17	1	0		male	[null]	0	0	113798	31.0	[null]	S	[null]	[null]	[null]
18	1	0		male	33.0	0	0	695	5.0	B51 B53 B55	S	[null]	[null]	New York, NY
19	1	0		male	28.0	0	0	113059	47.1	[null]	S	[null]	[null]	Montevideo, Uruguay
20	1	0		male	17.0	0	0	113059	47.1	[null]	S	[null]	[null]	Montevideo, Uruguay
21	1	0		male	49.0	0	0	19924	26.0	[null]	S	[null]	[null]	Ascot, Berkshire / Rochester, NY
22	1	0		male	36.0	1	0	19877	78.85	C46	S	[null]	172	Little Onn Hall, Staffs
23	1	0		male	46.0	1	0	W.E.P. 5734	61.175	E31	S	[null]	[null]	Amenia, ND
24	1	0		male	[null]	0	0	112051	0.0	[null]	S	[null]	[null]	Liverpool, England / Belfast
25	1	0		male	27.0	1	0	13508	136.7792	C89	C	[null]	[null]	Los Angeles, CA
26	1	0		male	[null]	0	0	110465	52.0	A14	S	[null]	[null]	Stoughton, MA
27	1	0		male	47.0	0	0	5727	25.5875	E58	S	[null]	[null]	Victoria, BC
28	1	0		male	37.0	1	1	PC 17756	83.1583	E52	C	[null]	[null]	Lakewood, NJ
29	1	0		male	[null]	0	0	113791	26.55	[null]	S	[null]	[null]	Roachdale, IN
30	1	0		male	70.0	1	1	WE/P 5735	71.0	B22	S	[null]	269	Milwaukee, WI
31	1	0		male	39.0	1	0	PC 17599	71.2833	C85	C	[null]	[null]	New York, NY
32	1	0		male	31.0	1	0	F.C. 12750	52.0	B71	S	[null]	[null]	Montreal, PQ
33	1	0		male	50.0	1	0	PC 17761	106.425	C86	C	[null]	62	Deephaven, MN / Cedar Rapids, IA
34	1	0		male	39.0	0	0	PC 17580	29.7	A18	C	[null]	133	Philadelphia, PA
35	1	0		female	36.0	0	0	PC 17531	31.6792	A29	C	[null]	[null]	New York, NY
36	1	0		male	[null]	0	0	PC 17483	221.7792	C95	S	[null]	[null]	[null]
37	1	0		male	30.0	0	0	113051	27.75	C111	C	[null]	[null]	New York, NY
38	1	0		male	19.0	3	2	19950	263.0	C23 C25 C27	S	[null]	[null]	Winnipeg, MB
39	1	0		male	64.0	1	4	19950	263.0	C23 C25 C27	S	[null]	[null]	Winnipeg, MB
40	1	0		male	[null]	0	0	113778	26.55	D34	S	[null]	[null]	Westcliff-on-Sea, Essex
41	1	0		male	[null]	0	0	112058	0.0	B102	S	[null]	[null]	[null]
42	1	0		male	37.0	1	0	113803	53.1	C123	S	[null]	[null]	Scituate, MA
43	1	0		male	47.0	0	0	111320	38.5	E63	S	[null]	275	St Anne's-on-Sea, Lancashire
44	1	0		male	24.0	0	0	PC 17593	79.2	B86	C	[null]	[null]	[null]
45	1	0		male	71.0	0	0	PC 17754	34.6542	A5	C	[null]	[null]	New York, NY
46	1	0		male	38.0	0	1	PC 17582	153.4625	C91	S	[null]	147	Winnipeg, MB
47	1	0		male	46.0	0	0	PC 17593	79.2	B82 B84	C	[null]	[null]	New York, NY
48	1	0		male	[null]	0	0	113796	42.4	[null]	S	[null]	[null]	[null]
49	1	0		male	45.0	1	0	36973	83.475	C83	S	[null]	[null]	New York, NY
50	1	0		male	40.0	0	0	112059	0.0	B94	S	[null]	110	[null]
51	1	0		male	55.0	1	1	12749	93.5	B69	S	[null]	307	Montreal, PQ
52	1	0		male	42.0	0	0	113038	42.5	B11	S	[null]	[null]	London / Middlesex
53	1	0		male	[null]	0	0	17463	51.8625	E46	S	[null]	[null]	Brighton, MA
54	1	0		male	55.0	0	0	680	50.0	C39	S	[null]	[null]	London / Birmingham
55	1	0		male	42.0	1	0	113789	52.0	[null]	S	[null]	38	New York, NY
56	1	0		male	[null]	0	0	PC 17600	30.6958	[null]	C	14	[null]	New York, NY
57	1	0		female	50.0	0	0	PC 17595	28.7125	C49	C	[null]	[null]	Paris, France New York, NY
58	1	0		male	46.0	0	0	694	26.0	[null]	S	[null]	80	Bennington, VT
59	1	0		male	50.0	0	0	113044	26.0	E60	S	[null]	[null]	London
60	1	0		male	32.5	0	0	113503	211.5	C132	C	[null]	45	[null]
61	1	0		male	58.0	0	0	11771	29.7	B37	C	[null]	258	Buffalo, NY
62	1	0		male	41.0	1	0	17464	51.8625	D21	S	[null]	[null]	Southington / Noank, CT
63	1	0		male	[null]	0	0	113028	26.55	C124	S	[null]	[null]	Portland, OR
64	1	0		male	[null]	0	0	PC 17612	27.7208	[null]	C	[null]	[null]	Chicago, IL
65	1	0		male	29.0	0	0	113501	30.0	D6	S	[null]	126	Springfield, MA
66	1	0		male	30.0	0	0	113801	45.5	[null]	S	[null]	[null]	London / New York, NY
67	1	0		male	30.0	0	0	110469	26.0	C106	S	[null]	[null]	Brockton, MA
68	1	0		male	19.0	1	0	113773	53.1	D30	S	[null]	[null]	New York, NY
69	1	0		male	46.0	0	0	13050	75.2417	C6	C	[null]	292	Vancouver, BC
70	1	0		male	54.0	0	0	17463	51.8625	E46	S	[null]	175	Dorchester, MA
71	1	0		male	28.0	1	0	PC 17604	82.1708	[null]	C	[null]	[null]	New York, NY
72	1	0		male	65.0	0	0	13509	26.55	E38	S	[null]	249	East Bridgewater, MA
73	1	0		male	44.0	2	0	19928	90.0	C78	Q	[null]	230	Fond du Lac, WI
74	1	0		male	55.0	0	0	113787	30.5	C30	S	[null]	[null]	Montreal, PQ
75	1	0		male	47.0	0	0	113796	42.4	[null]	S	[null]	[null]	Washington, DC
76	1	0		male	37.0	0	1	PC 17596	29.7	C118	C	[null]	[null]	Brooklyn, NY
77	1	0		male	58.0	0	2	35273	113.275	D48	C	[null]	122	Lexington, MA
78	1	0		male	64.0	0	0	693	26.0	[null]	S	[null]	263	Isle of Wight, England
79	1	0		male	65.0	0	1	113509	61.9792	B30	C	[null]	234	Providence, RI
80	1	0		male	28.5	0	0	PC 17562	27.7208	D43	C	[null]	189	?Havana, Cuba
81	1	0		male	[null]	0	0	112052	0.0	[null]	S	[null]	[null]	Belfast
82	1	0		male	45.5	0	0	113043	28.5	C124	S	[null]	166	Surbiton Hill, Surrey
83	1	0		male	23.0	0	0	12749	93.5	B24	S	[null]	[null]	Montreal, PQ
84	1	0		male	29.0	1	0	113776	66.6	C2	S	[null]	[null]	Isleworth, England
85	1	0		male	18.0	1	0	PC 17758	108.9	C65	C	[null]	[null]	Madrid, Spain
86	1	0		male	47.0	0	0	110465	52.0	C110	S	[null]	207	Worcester, MA
87	1	0		male	38.0	0	0	19972	0.0	[null]	S	[null]	[null]	Rotterdam, Netherlands
88	1	0		male	22.0	0	0	PC 17760	135.6333	[null]	C	[null]	232	[null]
89	1	0		male	[null]	0	0	PC 17757	227.525	[null]	C	[null]	[null]	[null]
90	1	0		male	31.0	0	0	PC 17590	50.4958	A24	S	[null]	[null]	Trenton, NJ
91	1	0		male	[null]	0	0	113767	50.0	A32	S	[null]	[null]	Seattle, WA
92	1	0		male	36.0	0	0	13049	40.125	A10	C	[null]	[null]	Winnipeg, MB
93	1	0		male	55.0	1	0	PC 17603	59.4	[null]	C	[null]	[null]	New York, NY
94	1	0		male	33.0	0	0	113790	26.55	[null]	S	[null]	109	London
95	1	0		male	61.0	1	3	PC 17608	262.375	B57 B59 B63 B66	C	[null]	[null]	Haverford, PA / Cooperstown, NY
96	1	0		male	50.0	1	0	13507	55.9	E44	S	[null]	[null]	Duluth, MN
97	1	0		male	56.0	0	0	113792	26.55	[null]	S	[null]	[null]	New York, NY
98	1	0		male	56.0	0	0	17764	30.6958	A7	C	[null]	[null]	St James, Long Island, NY
99	1	0		male	24.0	1	0	13695	60.0	C31	S	[null]	[null]	Huntington, WV
100	1	0		male	[null]	0	0	113056	26.0	A19	S	[null]	[null]	Streatham, Surrey

Rows: 1-100 | Columns: 14

Let’s look at the ‘age’ of the passengers.

[2]:

vdf["age"].hist()

[2]:

By using the ‘discretize’ method, we can discretize the data using equal-width binning.

[3]:

vdf["age"].discretize(method = "same_width", h = 10)
vdf["age"].bar(max_cardinality = 10)

[3]:

We can also discretize the data using frequency bins.

[4]:

vdf = load_titanic()
vdf["age"].discretize(method = "same_freq", nbins = 5)
vdf["age"].bar(max_cardinality = 5)

[4]:

Computing categories using a response column can also be a good solution.

[5]:

vdf = load_titanic()
vdf["age"].discretize(method = "smart", response = "survived", nbins = 6)
vdf["age"].bar(method = "avg", of = "survived")

/opt/venv/lib/python3.10/site-packages/vertica_python/vertica/connection.py:659: UserWarning: [WARNING] max_depth is set to 8 while max_breadth to 1000000000. This means the size of trees may become limited by max_depth first
  warnings.warn(notice)

[5]:

We can view the available techniques in the ‘discretize’ method with the ‘help’ method.

[6]:

help(vdf["age"].discretize)

Help on method discretize in module verticapy.core.vdataframe._encoding:

discretize(method: Literal['auto', 'smart', 'same_width', 'same_freq', 'topk'] = 'auto', h: Annotated[Union[int, float, decimal.Decimal], 'Python Numbers'] = 0, nbins: int = -1, k: int = 6, new_category: str = 'Others', RFmodel_params: Optional[dict] = None, response: Optional[str] = None, return_enum_trans: bool = False) -> 'vDataFrame' method of verticapy.core.vdataframe.base.vDataColumn instance
Discretizes the vDataColumn using the input method.

Parameters
----------
method: str, optional
The method used to discretize the vDataColumn.
auto : Uses method 'same_width' for numerical
vDataColumns, casts the other types to
varchar.
same_freq : Computes bins with the same number of
elements.
same_width : Computes regular width bins.
smart : Uses the Random Forest on a response
column to find the most relevant
interval to use for the discretization.
topk : Keeps the topk most frequent categories
and merge the other into one unique
category.
h: PythonNumber, optional
The interval size used to convert the vDataColumn.
If this parameter is equal to 0, an optimised interval is
computed.
nbins: int, optional
Number of bins used for the discretization (must be > 1)
k: int, optional
The integer k of the 'topk' method.
new_category: str, optional
The name of the merging category when using the 'topk'
method.
RFmodel_params: dict, optional
Dictionary of the Random Forest model parameters used to
compute the best splits when 'method' is set to 'smart'.
A RF Regressor is trained if the response is numerical
(except ints and bools), a RF Classifier otherwise.
Example: Write {"n_estimators": 20, "max_depth": 10} to train
a Random Forest with 20 trees and a maximum depth of 10.
response: str, optional
Response vDataColumn when method is set to 'smart'.
return_enum_trans: bool, optional
Returns the transformation instead of the vDataFrame parent,
and does not apply the transformation. This parameter is
useful for testing the look of the final transformation.

Returns
-------
vDataFrame
self._parent

To encode a categorical feature, we can use label encoding. For example, the column ‘sex’ has two categories (male and female) that we can represent with 0 and 1, respectively.

[7]:

vdf["sex"].label_encode()
display(vdf["sex"])

	123 sex Integer
1	0
2	1
3	0
4	1
5	1
6	1
7	1
8	1
9	1
10	1
11	1
12	1
13	1
14	1
15	1
16	1
17	1
18	1
19	1
20	1
21	1
22	1
23	1
24	1
25	1
26	1
27	1
28	1
29	1
30	1
31	1
32	1
33	1
34	1
35	0
36	1
37	1
38	1
39	1
40	1
41	1
42	1
43	1
44	1
45	1
46	1
47	1
48	1
49	1
50	1
51	1
52	1
53	1
54	1
55	1
56	1
57	0
58	1
59	1
60	1
61	1
62	1
63	1
64	1
65	1
66	1
67	1
68	1
69	1
70	1
71	1
72	1
73	1
74	1
75	1
76	1
77	1
78	1
79	1
80	1
81	1
82	1
83	1
84	1
85	1
86	1
87	1
88	1
89	1
90	1
91	1
92	1
93	1
94	1
95	1
96	1
97	1
98	1
99	1
100	1

Rows: 1-100 of 1234 | Column: sex | Type: Integer

When a feature has few categories, the most suitable choice is the one-hot encoding. Label encoding converts a categorical feature to numerical without retaining its mathematical relationships. Let’s use a one-hot encoding on the ‘embarked’ column.

[8]:

vdf["embarked"].one_hot_encode()
vdf.select(["embarked", "embarked_C", "embarked_Q"])

[8]:

	Abc embarked Varchar(20)	123 embarked_C Integer	123 embarked_Q Integer
1	S	0	0
2	S	0	0
3	S	0	0
4	S	0	0
5	C	1	0
6	C	1	0
7	S	0	0
8	C	1	0
9	C	1	0
10	C	1	0
11	S	0	0
12	S	0	0
13	S	0	0
14	C	1	0
15	C	1	0
16	S	0	0
17	S	0	0
18	S	0	0
19	S	0	0
20	S	0	0
21	S	0	0
22	S	0	0
23	S	0	0
24	S	0	0
25	C	1	0
26	S	0	0
27	S	0	0
28	C	1	0
29	S	0	0
30	S	0	0
31	C	1	0
32	S	0	0
33	C	1	0
34	C	1	0
35	C	1	0
36	S	0	0
37	C	1	0
38	S	0	0
39	S	0	0
40	S	0	0
41	S	0	0
42	S	0	0
43	S	0	0
44	C	1	0
45	C	1	0
46	S	0	0
47	C	1	0
48	S	0	0
49	S	0	0
50	S	0	0
51	S	0	0
52	S	0	0
53	S	0	0
54	S	0	0
55	S	0	0
56	C	1	0
57	C	1	0
58	S	0	0
59	S	0	0
60	C	1	0
61	C	1	0
62	S	0	0
63	S	0	0
64	C	1	0
65	S	0	0
66	S	0	0
67	S	0	0
68	S	0	0
69	C	1	0
70	S	0	0
71	C	1	0
72	S	0	0
73	Q	0	1
74	S	0	0
75	S	0	0
76	C	1	0
77	C	1	0
78	S	0	0
79	C	1	0
80	C	1	0
81	S	0	0
82	S	0	0
83	S	0	0
84	S	0	0
85	C	1	0
86	S	0	0
87	S	0	0
88	C	1	0
89	C	1	0
90	S	0	0
91	S	0	0
92	C	1	0
93	C	1	0
94	S	0	0
95	C	1	0
96	S	0	0
97	S	0	0
98	C	1	0
99	S	0	0
100	S	0	0

Rows: 1-100 | Columns: 3

One-hot encoding can be expensive if the column in question has a large number of categories. In that case, we should use mean encoding. Mean encoding replaces each category of a variable with its corresponding average over a partition by a response column. This makes it an efficient way to encode the data, but be careful of over-fitting.

Let’s use a mean encoding on the ‘home.dest’ variable.

[9]:

vdf["home.dest"].mean_encode("survived")
display(vdf["home.dest"])

The mean encoding was successfully done.

	123 home.dest Float(22)
1	1.0
2	0.0
3	0.0
4	0.0
5	0.0
6	0.0
7	0.0
8	0.5
9	0.5
10	0.0
11	0.0
12	0.0
13	0.0
14	0.0
15	0.0
16	1.0
17	1.0
18	1.0
19	0.5
20	0.5
21	1.0
22	0.0
23	0.0
24	0.75
25	0.75
26	0.75
27	0.75
28	0.0
29	0.0
30	0.0
31	0.0
32	0.25
33	0.25
34	0.25
35	0.25
36	0.75
37	0.75
38	0.75
39	0.75
40	1.0
41	1.0
42	0.0
43	0.0
44	0.0
45	0.0
46	0.0
47	1.0
48	0.0
49	0.0
50	0.5
51	0.5
52	0.0
53	1.0
54	1.0
55	0.0
56	0.0
57	0.0
58	0.0
59	0.0
60	1.0
61	0.0
62	0.0
63	0.666666666666667
64	0.666666666666667
65	0.666666666666667
66	0.0
67	0.357142857142857
68	0.357142857142857
69	0.357142857142857
70	0.357142857142857
71	0.357142857142857
72	0.357142857142857
73	0.357142857142857
74	0.357142857142857
75	0.357142857142857
76	0.357142857142857
77	0.357142857142857
78	0.357142857142857
79	0.357142857142857
80	0.357142857142857
81	0.0
82	1.0
83	0.5
84	0.5
85	0.5
86	0.5
87	0.5
88	0.5
89	1.0
90	0.0
91	0.0
92	0.0
93	0.0
94	0.0
95	0.0
96	0.0
97	0.0
98	0.0
99	0.5
100	0.5

Rows: 1-100 of 1234 | Column: home.dest | Type: Float(22)

VerticaPy offers many encoding techniques. For example, the ‘case_when’ and ‘decode’ methods allow the user to use a customized encoding on a column. The ‘discretize’ method allows you to reduce the number of categories in a column. It’s important to get familiar with all the techniques available so you can make informed decisions about which to use for a given dataset.