Encoding#
Encoding features is a very important part of the data science life cycle. In data science, generality is important and having too many categories can compromise that and lead to incorrect results. In addition, some algorithmic optimizations are linear and prefer categorized information, and some can’t process non-numerical features.
There are many encoding techniques:
User-Defined Encoding : The most flexible encoding. The user can choose how to encode the different categories.
Label Encoding : Each category is converted to an integer using a bijection to [0;n-1] where n is the feature number of unique values.
One-hot Encoding : This technique creates dummies (values in {0,1}) of each category. The categories are then separated into n features.
Mean Encoding : This technique uses the frequencies of each category for a specific response column.
Discretization : This technique uses various mathematical technique to encode continuous features into categories.
To demonstrate encoding data in VerticaPy, we’ll use the well-known ‘Titanic’ dataset.
[1]:
from verticapy.datasets import load_titanic
import verticapy as vp
vp.set_option("plotting_lib","highcharts")
vdf = load_titanic()
display(vdf)
123 pclassInteger | 123 survivedInteger | Abc Varchar(164) | Abc sexVarchar(20) | 123 ageNumeric(8) | 123 sibspInteger | 123 parchInteger | Abc ticketVarchar(36) | 123 fareNumeric(12) | Abc cabinVarchar(30) | Abc embarkedVarchar(20) | Abc boatVarchar(100) | 123 bodyInteger | Abc home.destVarchar(100) | |
1 | 1 | 0 | female | 2.0 | 1 | 2 | 113781 | 151.55 | C22 C26 | S | [null] | [null] | Montreal, PQ / Chesterville, ON | |
2 | 1 | 0 | male | 30.0 | 1 | 2 | 113781 | 151.55 | C22 C26 | S | [null] | 135 | Montreal, PQ / Chesterville, ON | |
3 | 1 | 0 | female | 25.0 | 1 | 2 | 113781 | 151.55 | C22 C26 | S | [null] | [null] | Montreal, PQ / Chesterville, ON | |
4 | 1 | 0 | male | 39.0 | 0 | 0 | 112050 | 0.0 | A36 | S | [null] | [null] | Belfast, NI | |
5 | 1 | 0 | male | 71.0 | 0 | 0 | PC 17609 | 49.5042 | [null] | C | [null] | 22 | Montevideo, Uruguay | |
6 | 1 | 0 | male | 47.0 | 1 | 0 | PC 17757 | 227.525 | C62 C64 | C | [null] | 124 | New York, NY | |
7 | 1 | 0 | male | [null] | 0 | 0 | PC 17318 | 25.925 | [null] | S | [null] | [null] | New York, NY | |
8 | 1 | 0 | male | 24.0 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C | [null] | [null] | Montreal, PQ | |
9 | 1 | 0 | male | 36.0 | 0 | 0 | 13050 | 75.2417 | C6 | C | A | [null] | Winnipeg, MN | |
10 | 1 | 0 | male | 25.0 | 0 | 0 | 13905 | 26.0 | [null] | C | [null] | 148 | San Francisco, CA | |
11 | 1 | 0 | male | 45.0 | 0 | 0 | 113784 | 35.5 | T | S | [null] | [null] | Trenton, NJ | |
12 | 1 | 0 | male | 42.0 | 0 | 0 | 110489 | 26.55 | D22 | S | [null] | [null] | London / Winnipeg, MB | |
13 | 1 | 0 | male | 41.0 | 0 | 0 | 113054 | 30.5 | A21 | S | [null] | [null] | Pomeroy, WA | |
14 | 1 | 0 | male | 48.0 | 0 | 0 | PC 17591 | 50.4958 | B10 | C | [null] | 208 | Omaha, NE | |
15 | 1 | 0 | male | [null] | 0 | 0 | 112379 | 39.6 | [null] | C | [null] | [null] | Philadelphia, PA | |
16 | 1 | 0 | male | 45.0 | 0 | 0 | 113050 | 26.55 | B38 | S | [null] | [null] | Washington, DC | |
17 | 1 | 0 | male | [null] | 0 | 0 | 113798 | 31.0 | [null] | S | [null] | [null] | [null] | |
18 | 1 | 0 | male | 33.0 | 0 | 0 | 695 | 5.0 | B51 B53 B55 | S | [null] | [null] | New York, NY | |
19 | 1 | 0 | male | 28.0 | 0 | 0 | 113059 | 47.1 | [null] | S | [null] | [null] | Montevideo, Uruguay | |
20 | 1 | 0 | male | 17.0 | 0 | 0 | 113059 | 47.1 | [null] | S | [null] | [null] | Montevideo, Uruguay | |
21 | 1 | 0 | male | 49.0 | 0 | 0 | 19924 | 26.0 | [null] | S | [null] | [null] | Ascot, Berkshire / Rochester, NY | |
22 | 1 | 0 | male | 36.0 | 1 | 0 | 19877 | 78.85 | C46 | S | [null] | 172 | Little Onn Hall, Staffs | |
23 | 1 | 0 | male | 46.0 | 1 | 0 | W.E.P. 5734 | 61.175 | E31 | S | [null] | [null] | Amenia, ND | |
24 | 1 | 0 | male | [null] | 0 | 0 | 112051 | 0.0 | [null] | S | [null] | [null] | Liverpool, England / Belfast | |
25 | 1 | 0 | male | 27.0 | 1 | 0 | 13508 | 136.7792 | C89 | C | [null] | [null] | Los Angeles, CA | |
26 | 1 | 0 | male | [null] | 0 | 0 | 110465 | 52.0 | A14 | S | [null] | [null] | Stoughton, MA | |
27 | 1 | 0 | male | 47.0 | 0 | 0 | 5727 | 25.5875 | E58 | S | [null] | [null] | Victoria, BC | |
28 | 1 | 0 | male | 37.0 | 1 | 1 | PC 17756 | 83.1583 | E52 | C | [null] | [null] | Lakewood, NJ | |
29 | 1 | 0 | male | [null] | 0 | 0 | 113791 | 26.55 | [null] | S | [null] | [null] | Roachdale, IN | |
30 | 1 | 0 | male | 70.0 | 1 | 1 | WE/P 5735 | 71.0 | B22 | S | [null] | 269 | Milwaukee, WI | |
31 | 1 | 0 | male | 39.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | [null] | [null] | New York, NY | |
32 | 1 | 0 | male | 31.0 | 1 | 0 | F.C. 12750 | 52.0 | B71 | S | [null] | [null] | Montreal, PQ | |
33 | 1 | 0 | male | 50.0 | 1 | 0 | PC 17761 | 106.425 | C86 | C | [null] | 62 | Deephaven, MN / Cedar Rapids, IA | |
34 | 1 | 0 | male | 39.0 | 0 | 0 | PC 17580 | 29.7 | A18 | C | [null] | 133 | Philadelphia, PA | |
35 | 1 | 0 | female | 36.0 | 0 | 0 | PC 17531 | 31.6792 | A29 | C | [null] | [null] | New York, NY | |
36 | 1 | 0 | male | [null] | 0 | 0 | PC 17483 | 221.7792 | C95 | S | [null] | [null] | [null] | |
37 | 1 | 0 | male | 30.0 | 0 | 0 | 113051 | 27.75 | C111 | C | [null] | [null] | New York, NY | |
38 | 1 | 0 | male | 19.0 | 3 | 2 | 19950 | 263.0 | C23 C25 C27 | S | [null] | [null] | Winnipeg, MB | |
39 | 1 | 0 | male | 64.0 | 1 | 4 | 19950 | 263.0 | C23 C25 C27 | S | [null] | [null] | Winnipeg, MB | |
40 | 1 | 0 | male | [null] | 0 | 0 | 113778 | 26.55 | D34 | S | [null] | [null] | Westcliff-on-Sea, Essex | |
41 | 1 | 0 | male | [null] | 0 | 0 | 112058 | 0.0 | B102 | S | [null] | [null] | [null] | |
42 | 1 | 0 | male | 37.0 | 1 | 0 | 113803 | 53.1 | C123 | S | [null] | [null] | Scituate, MA | |
43 | 1 | 0 | male | 47.0 | 0 | 0 | 111320 | 38.5 | E63 | S | [null] | 275 | St Anne's-on-Sea, Lancashire | |
44 | 1 | 0 | male | 24.0 | 0 | 0 | PC 17593 | 79.2 | B86 | C | [null] | [null] | [null] | |
45 | 1 | 0 | male | 71.0 | 0 | 0 | PC 17754 | 34.6542 | A5 | C | [null] | [null] | New York, NY | |
46 | 1 | 0 | male | 38.0 | 0 | 1 | PC 17582 | 153.4625 | C91 | S | [null] | 147 | Winnipeg, MB | |
47 | 1 | 0 | male | 46.0 | 0 | 0 | PC 17593 | 79.2 | B82 B84 | C | [null] | [null] | New York, NY | |
48 | 1 | 0 | male | [null] | 0 | 0 | 113796 | 42.4 | [null] | S | [null] | [null] | [null] | |
49 | 1 | 0 | male | 45.0 | 1 | 0 | 36973 | 83.475 | C83 | S | [null] | [null] | New York, NY | |
50 | 1 | 0 | male | 40.0 | 0 | 0 | 112059 | 0.0 | B94 | S | [null] | 110 | [null] | |
51 | 1 | 0 | male | 55.0 | 1 | 1 | 12749 | 93.5 | B69 | S | [null] | 307 | Montreal, PQ | |
52 | 1 | 0 | male | 42.0 | 0 | 0 | 113038 | 42.5 | B11 | S | [null] | [null] | London / Middlesex | |
53 | 1 | 0 | male | [null] | 0 | 0 | 17463 | 51.8625 | E46 | S | [null] | [null] | Brighton, MA | |
54 | 1 | 0 | male | 55.0 | 0 | 0 | 680 | 50.0 | C39 | S | [null] | [null] | London / Birmingham | |
55 | 1 | 0 | male | 42.0 | 1 | 0 | 113789 | 52.0 | [null] | S | [null] | 38 | New York, NY | |
56 | 1 | 0 | male | [null] | 0 | 0 | PC 17600 | 30.6958 | [null] | C | 14 | [null] | New York, NY | |
57 | 1 | 0 | female | 50.0 | 0 | 0 | PC 17595 | 28.7125 | C49 | C | [null] | [null] | Paris, France New York, NY | |
58 | 1 | 0 | male | 46.0 | 0 | 0 | 694 | 26.0 | [null] | S | [null] | 80 | Bennington, VT | |
59 | 1 | 0 | male | 50.0 | 0 | 0 | 113044 | 26.0 | E60 | S | [null] | [null] | London | |
60 | 1 | 0 | male | 32.5 | 0 | 0 | 113503 | 211.5 | C132 | C | [null] | 45 | [null] | |
61 | 1 | 0 | male | 58.0 | 0 | 0 | 11771 | 29.7 | B37 | C | [null] | 258 | Buffalo, NY | |
62 | 1 | 0 | male | 41.0 | 1 | 0 | 17464 | 51.8625 | D21 | S | [null] | [null] | Southington / Noank, CT | |
63 | 1 | 0 | male | [null] | 0 | 0 | 113028 | 26.55 | C124 | S | [null] | [null] | Portland, OR | |
64 | 1 | 0 | male | [null] | 0 | 0 | PC 17612 | 27.7208 | [null] | C | [null] | [null] | Chicago, IL | |
65 | 1 | 0 | male | 29.0 | 0 | 0 | 113501 | 30.0 | D6 | S | [null] | 126 | Springfield, MA | |
66 | 1 | 0 | male | 30.0 | 0 | 0 | 113801 | 45.5 | [null] | S | [null] | [null] | London / New York, NY | |
67 | 1 | 0 | male | 30.0 | 0 | 0 | 110469 | 26.0 | C106 | S | [null] | [null] | Brockton, MA | |
68 | 1 | 0 | male | 19.0 | 1 | 0 | 113773 | 53.1 | D30 | S | [null] | [null] | New York, NY | |
69 | 1 | 0 | male | 46.0 | 0 | 0 | 13050 | 75.2417 | C6 | C | [null] | 292 | Vancouver, BC | |
70 | 1 | 0 | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | [null] | 175 | Dorchester, MA | |
71 | 1 | 0 | male | 28.0 | 1 | 0 | PC 17604 | 82.1708 | [null] | C | [null] | [null] | New York, NY | |
72 | 1 | 0 | male | 65.0 | 0 | 0 | 13509 | 26.55 | E38 | S | [null] | 249 | East Bridgewater, MA | |
73 | 1 | 0 | male | 44.0 | 2 | 0 | 19928 | 90.0 | C78 | Q | [null] | 230 | Fond du Lac, WI | |
74 | 1 | 0 | male | 55.0 | 0 | 0 | 113787 | 30.5 | C30 | S | [null] | [null] | Montreal, PQ | |
75 | 1 | 0 | male | 47.0 | 0 | 0 | 113796 | 42.4 | [null] | S | [null] | [null] | Washington, DC | |
76 | 1 | 0 | male | 37.0 | 0 | 1 | PC 17596 | 29.7 | C118 | C | [null] | [null] | Brooklyn, NY | |
77 | 1 | 0 | male | 58.0 | 0 | 2 | 35273 | 113.275 | D48 | C | [null] | 122 | Lexington, MA | |
78 | 1 | 0 | male | 64.0 | 0 | 0 | 693 | 26.0 | [null] | S | [null] | 263 | Isle of Wight, England | |
79 | 1 | 0 | male | 65.0 | 0 | 1 | 113509 | 61.9792 | B30 | C | [null] | 234 | Providence, RI | |
80 | 1 | 0 | male | 28.5 | 0 | 0 | PC 17562 | 27.7208 | D43 | C | [null] | 189 | ?Havana, Cuba | |
81 | 1 | 0 | male | [null] | 0 | 0 | 112052 | 0.0 | [null] | S | [null] | [null] | Belfast | |
82 | 1 | 0 | male | 45.5 | 0 | 0 | 113043 | 28.5 | C124 | S | [null] | 166 | Surbiton Hill, Surrey | |
83 | 1 | 0 | male | 23.0 | 0 | 0 | 12749 | 93.5 | B24 | S | [null] | [null] | Montreal, PQ | |
84 | 1 | 0 | male | 29.0 | 1 | 0 | 113776 | 66.6 | C2 | S | [null] | [null] | Isleworth, England | |
85 | 1 | 0 | male | 18.0 | 1 | 0 | PC 17758 | 108.9 | C65 | C | [null] | [null] | Madrid, Spain | |
86 | 1 | 0 | male | 47.0 | 0 | 0 | 110465 | 52.0 | C110 | S | [null] | 207 | Worcester, MA | |
87 | 1 | 0 | male | 38.0 | 0 | 0 | 19972 | 0.0 | [null] | S | [null] | [null] | Rotterdam, Netherlands | |
88 | 1 | 0 | male | 22.0 | 0 | 0 | PC 17760 | 135.6333 | [null] | C | [null] | 232 | [null] | |
89 | 1 | 0 | male | [null] | 0 | 0 | PC 17757 | 227.525 | [null] | C | [null] | [null] | [null] | |
90 | 1 | 0 | male | 31.0 | 0 | 0 | PC 17590 | 50.4958 | A24 | S | [null] | [null] | Trenton, NJ | |
91 | 1 | 0 | male | [null] | 0 | 0 | 113767 | 50.0 | A32 | S | [null] | [null] | Seattle, WA | |
92 | 1 | 0 | male | 36.0 | 0 | 0 | 13049 | 40.125 | A10 | C | [null] | [null] | Winnipeg, MB | |
93 | 1 | 0 | male | 55.0 | 1 | 0 | PC 17603 | 59.4 | [null] | C | [null] | [null] | New York, NY | |
94 | 1 | 0 | male | 33.0 | 0 | 0 | 113790 | 26.55 | [null] | S | [null] | 109 | London | |
95 | 1 | 0 | male | 61.0 | 1 | 3 | PC 17608 | 262.375 | B57 B59 B63 B66 | C | [null] | [null] | Haverford, PA / Cooperstown, NY | |
96 | 1 | 0 | male | 50.0 | 1 | 0 | 13507 | 55.9 | E44 | S | [null] | [null] | Duluth, MN | |
97 | 1 | 0 | male | 56.0 | 0 | 0 | 113792 | 26.55 | [null] | S | [null] | [null] | New York, NY | |
98 | 1 | 0 | male | 56.0 | 0 | 0 | 17764 | 30.6958 | A7 | C | [null] | [null] | St James, Long Island, NY | |
99 | 1 | 0 | male | 24.0 | 1 | 0 | 13695 | 60.0 | C31 | S | [null] | [null] | Huntington, WV | |
100 | 1 | 0 | male | [null] | 0 | 0 | 113056 | 26.0 | A19 | S | [null] | [null] | Streatham, Surrey |
Let’s look at the ‘age’ of the passengers.
[2]:
vdf["age"].hist()
[2]:
By using the ‘discretize’ method, we can discretize the data using equal-width binning.
[3]:
vdf["age"].discretize(method = "same_width", h = 10)
vdf["age"].bar(max_cardinality = 10)
[3]:
We can also discretize the data using frequency bins.
[4]:
vdf = load_titanic()
vdf["age"].discretize(method = "same_freq", nbins = 5)
vdf["age"].bar(max_cardinality = 5)
[4]:
Computing categories using a response column can also be a good solution.
[5]:
vdf = load_titanic()
vdf["age"].discretize(method = "smart", response = "survived", nbins = 6)
vdf["age"].bar(method = "avg", of = "survived")
/opt/venv/lib/python3.10/site-packages/vertica_python/vertica/connection.py:659: UserWarning: [WARNING] max_depth is set to 8 while max_breadth to 1000000000. This means the size of trees may become limited by max_depth first
warnings.warn(notice)
[5]:
We can view the available techniques in the ‘discretize’ method with the ‘help’ method.
[6]:
help(vdf["age"].discretize)
Help on method discretize in module verticapy.core.vdataframe._encoding:
discretize(method: Literal['auto', 'smart', 'same_width', 'same_freq', 'topk'] = 'auto', h: Annotated[Union[int, float, decimal.Decimal], 'Python Numbers'] = 0, nbins: int = -1, k: int = 6, new_category: str = 'Others', RFmodel_params: Optional[dict] = None, response: Optional[str] = None, return_enum_trans: bool = False) -> 'vDataFrame' method of verticapy.core.vdataframe.base.vDataColumn instance
Discretizes the vDataColumn using the input method.
Parameters
----------
method: str, optional
The method used to discretize the vDataColumn.
auto : Uses method 'same_width' for numerical
vDataColumns, casts the other types to
varchar.
same_freq : Computes bins with the same number of
elements.
same_width : Computes regular width bins.
smart : Uses the Random Forest on a response
column to find the most relevant
interval to use for the discretization.
topk : Keeps the topk most frequent categories
and merge the other into one unique
category.
h: PythonNumber, optional
The interval size used to convert the vDataColumn.
If this parameter is equal to 0, an optimised interval is
computed.
nbins: int, optional
Number of bins used for the discretization (must be > 1)
k: int, optional
The integer k of the 'topk' method.
new_category: str, optional
The name of the merging category when using the 'topk'
method.
RFmodel_params: dict, optional
Dictionary of the Random Forest model parameters used to
compute the best splits when 'method' is set to 'smart'.
A RF Regressor is trained if the response is numerical
(except ints and bools), a RF Classifier otherwise.
Example: Write {"n_estimators": 20, "max_depth": 10} to train
a Random Forest with 20 trees and a maximum depth of 10.
response: str, optional
Response vDataColumn when method is set to 'smart'.
return_enum_trans: bool, optional
Returns the transformation instead of the vDataFrame parent,
and does not apply the transformation. This parameter is
useful for testing the look of the final transformation.
Returns
-------
vDataFrame
self._parent
To encode a categorical feature, we can use label encoding. For example, the column ‘sex’ has two categories (male and female) that we can represent with 0 and 1, respectively.
[7]:
vdf["sex"].label_encode()
display(vdf["sex"])
123 sexInteger | |
1 | 0 |
2 | 1 |
3 | 0 |
4 | 1 |
5 | 1 |
6 | 1 |
7 | 1 |
8 | 1 |
9 | 1 |
10 | 1 |
11 | 1 |
12 | 1 |
13 | 1 |
14 | 1 |
15 | 1 |
16 | 1 |
17 | 1 |
18 | 1 |
19 | 1 |
20 | 1 |
21 | 1 |
22 | 1 |
23 | 1 |
24 | 1 |
25 | 1 |
26 | 1 |
27 | 1 |
28 | 1 |
29 | 1 |
30 | 1 |
31 | 1 |
32 | 1 |
33 | 1 |
34 | 1 |
35 | 0 |
36 | 1 |
37 | 1 |
38 | 1 |
39 | 1 |
40 | 1 |
41 | 1 |
42 | 1 |
43 | 1 |
44 | 1 |
45 | 1 |
46 | 1 |
47 | 1 |
48 | 1 |
49 | 1 |
50 | 1 |
51 | 1 |
52 | 1 |
53 | 1 |
54 | 1 |
55 | 1 |
56 | 1 |
57 | 0 |
58 | 1 |
59 | 1 |
60 | 1 |
61 | 1 |
62 | 1 |
63 | 1 |
64 | 1 |
65 | 1 |
66 | 1 |
67 | 1 |
68 | 1 |
69 | 1 |
70 | 1 |
71 | 1 |
72 | 1 |
73 | 1 |
74 | 1 |
75 | 1 |
76 | 1 |
77 | 1 |
78 | 1 |
79 | 1 |
80 | 1 |
81 | 1 |
82 | 1 |
83 | 1 |
84 | 1 |
85 | 1 |
86 | 1 |
87 | 1 |
88 | 1 |
89 | 1 |
90 | 1 |
91 | 1 |
92 | 1 |
93 | 1 |
94 | 1 |
95 | 1 |
96 | 1 |
97 | 1 |
98 | 1 |
99 | 1 |
100 | 1 |
When a feature has few categories, the most suitable choice is the one-hot encoding. Label encoding converts a categorical feature to numerical without retaining its mathematical relationships. Let’s use a one-hot encoding on the ‘embarked’ column.
[8]:
vdf["embarked"].one_hot_encode()
vdf.select(["embarked", "embarked_C", "embarked_Q"])
[8]:
Abc embarkedVarchar(20) | 123 embarked_CInteger | 123 embarked_QInteger | |
1 | S | 0 | 0 |
2 | S | 0 | 0 |
3 | S | 0 | 0 |
4 | S | 0 | 0 |
5 | C | 1 | 0 |
6 | C | 1 | 0 |
7 | S | 0 | 0 |
8 | C | 1 | 0 |
9 | C | 1 | 0 |
10 | C | 1 | 0 |
11 | S | 0 | 0 |
12 | S | 0 | 0 |
13 | S | 0 | 0 |
14 | C | 1 | 0 |
15 | C | 1 | 0 |
16 | S | 0 | 0 |
17 | S | 0 | 0 |
18 | S | 0 | 0 |
19 | S | 0 | 0 |
20 | S | 0 | 0 |
21 | S | 0 | 0 |
22 | S | 0 | 0 |
23 | S | 0 | 0 |
24 | S | 0 | 0 |
25 | C | 1 | 0 |
26 | S | 0 | 0 |
27 | S | 0 | 0 |
28 | C | 1 | 0 |
29 | S | 0 | 0 |
30 | S | 0 | 0 |
31 | C | 1 | 0 |
32 | S | 0 | 0 |
33 | C | 1 | 0 |
34 | C | 1 | 0 |
35 | C | 1 | 0 |
36 | S | 0 | 0 |
37 | C | 1 | 0 |
38 | S | 0 | 0 |
39 | S | 0 | 0 |
40 | S | 0 | 0 |
41 | S | 0 | 0 |
42 | S | 0 | 0 |
43 | S | 0 | 0 |
44 | C | 1 | 0 |
45 | C | 1 | 0 |
46 | S | 0 | 0 |
47 | C | 1 | 0 |
48 | S | 0 | 0 |
49 | S | 0 | 0 |
50 | S | 0 | 0 |
51 | S | 0 | 0 |
52 | S | 0 | 0 |
53 | S | 0 | 0 |
54 | S | 0 | 0 |
55 | S | 0 | 0 |
56 | C | 1 | 0 |
57 | C | 1 | 0 |
58 | S | 0 | 0 |
59 | S | 0 | 0 |
60 | C | 1 | 0 |
61 | C | 1 | 0 |
62 | S | 0 | 0 |
63 | S | 0 | 0 |
64 | C | 1 | 0 |
65 | S | 0 | 0 |
66 | S | 0 | 0 |
67 | S | 0 | 0 |
68 | S | 0 | 0 |
69 | C | 1 | 0 |
70 | S | 0 | 0 |
71 | C | 1 | 0 |
72 | S | 0 | 0 |
73 | Q | 0 | 1 |
74 | S | 0 | 0 |
75 | S | 0 | 0 |
76 | C | 1 | 0 |
77 | C | 1 | 0 |
78 | S | 0 | 0 |
79 | C | 1 | 0 |
80 | C | 1 | 0 |
81 | S | 0 | 0 |
82 | S | 0 | 0 |
83 | S | 0 | 0 |
84 | S | 0 | 0 |
85 | C | 1 | 0 |
86 | S | 0 | 0 |
87 | S | 0 | 0 |
88 | C | 1 | 0 |
89 | C | 1 | 0 |
90 | S | 0 | 0 |
91 | S | 0 | 0 |
92 | C | 1 | 0 |
93 | C | 1 | 0 |
94 | S | 0 | 0 |
95 | C | 1 | 0 |
96 | S | 0 | 0 |
97 | S | 0 | 0 |
98 | C | 1 | 0 |
99 | S | 0 | 0 |
100 | S | 0 | 0 |
One-hot encoding can be expensive if the column in question has a large number of categories. In that case, we should use mean encoding. Mean encoding replaces each category of a variable with its corresponding average over a partition by a response column. This makes it an efficient way to encode the data, but be careful of over-fitting.
Let’s use a mean encoding on the ‘home.dest’ variable.
[9]:
vdf["home.dest"].mean_encode("survived")
display(vdf["home.dest"])
The mean encoding was successfully done.
123 home.destFloat(22) | |
1 | 1.0 |
2 | 0.0 |
3 | 0.0 |
4 | 0.0 |
5 | 0.0 |
6 | 0.0 |
7 | 0.0 |
8 | 0.5 |
9 | 0.5 |
10 | 0.0 |
11 | 0.0 |
12 | 0.0 |
13 | 0.0 |
14 | 0.0 |
15 | 0.0 |
16 | 1.0 |
17 | 1.0 |
18 | 1.0 |
19 | 0.5 |
20 | 0.5 |
21 | 1.0 |
22 | 0.0 |
23 | 0.0 |
24 | 0.75 |
25 | 0.75 |
26 | 0.75 |
27 | 0.75 |
28 | 0.0 |
29 | 0.0 |
30 | 0.0 |
31 | 0.0 |
32 | 0.25 |
33 | 0.25 |
34 | 0.25 |
35 | 0.25 |
36 | 0.75 |
37 | 0.75 |
38 | 0.75 |
39 | 0.75 |
40 | 1.0 |
41 | 1.0 |
42 | 0.0 |
43 | 0.0 |
44 | 0.0 |
45 | 0.0 |
46 | 0.0 |
47 | 1.0 |
48 | 0.0 |
49 | 0.0 |
50 | 0.5 |
51 | 0.5 |
52 | 0.0 |
53 | 1.0 |
54 | 1.0 |
55 | 0.0 |
56 | 0.0 |
57 | 0.0 |
58 | 0.0 |
59 | 0.0 |
60 | 1.0 |
61 | 0.0 |
62 | 0.0 |
63 | 0.666666666666667 |
64 | 0.666666666666667 |
65 | 0.666666666666667 |
66 | 0.0 |
67 | 0.357142857142857 |
68 | 0.357142857142857 |
69 | 0.357142857142857 |
70 | 0.357142857142857 |
71 | 0.357142857142857 |
72 | 0.357142857142857 |
73 | 0.357142857142857 |
74 | 0.357142857142857 |
75 | 0.357142857142857 |
76 | 0.357142857142857 |
77 | 0.357142857142857 |
78 | 0.357142857142857 |
79 | 0.357142857142857 |
80 | 0.357142857142857 |
81 | 0.0 |
82 | 1.0 |
83 | 0.5 |
84 | 0.5 |
85 | 0.5 |
86 | 0.5 |
87 | 0.5 |
88 | 0.5 |
89 | 1.0 |
90 | 0.0 |
91 | 0.0 |
92 | 0.0 |
93 | 0.0 |
94 | 0.0 |
95 | 0.0 |
96 | 0.0 |
97 | 0.0 |
98 | 0.0 |
99 | 0.5 |
100 | 0.5 |
VerticaPy offers many encoding techniques. For example, the ‘case_when’ and ‘decode’ methods allow the user to use a customized encoding on a column. The ‘discretize’ method allows you to reduce the number of categories in a column. It’s important to get familiar with all the techniques available so you can make informed decisions about which to use for a given dataset.