Encoding#

Encoding features is a very important part of the data science life cycle. In data science, generality is important and having too many categories can compromise that and lead to incorrect results. In addition, some algorithmic optimizations are linear and prefer categorized information, and some can’t process non-numerical features.

There are many encoding techniques:

  • User-Defined Encoding : The most flexible encoding. The user can choose how to encode the different categories.

  • Label Encoding : Each category is converted to an integer using a bijection to [0;n-1] where n is the feature number of unique values.

  • One-hot Encoding : This technique creates dummies (values in {0,1}) of each category. The categories are then separated into n features.

  • Mean Encoding : This technique uses the frequencies of each category for a specific response column.

  • Discretization : This technique uses various mathematical technique to encode continuous features into categories.

To demonstrate encoding data in VerticaPy, we’ll use the well-known ‘Titanic’ dataset.

[1]:
from verticapy.datasets import load_titanic
import verticapy as vp

vp.set_option("plotting_lib","highcharts")

vdf = load_titanic()
display(vdf)
123
pclass
Integer
123
survived
Integer
Abc
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(8)
123
sibsp
Integer
123
parch
Integer
Abc
ticket
Varchar(36)
123
fare
Numeric(12)
Abc
cabin
Varchar(30)
Abc
embarked
Varchar(20)
Abc
boat
Varchar(100)
123
body
Integer
Abc
home.dest
Varchar(100)
110female2.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
210male30.012113781151.55C22 C26S[null]135Montreal, PQ / Chesterville, ON
310female25.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
410male39.0001120500.0A36S[null][null]Belfast, NI
510male71.000PC 1760949.5042[null]C[null]22Montevideo, Uruguay
610male47.010PC 17757227.525C62 C64C[null]124New York, NY
710male[null]00PC 1731825.925[null]S[null][null]New York, NY
810male24.001PC 17558247.5208B58 B60C[null][null]Montreal, PQ
910male36.0001305075.2417C6CA[null]Winnipeg, MN
1010male25.0001390526.0[null]C[null]148San Francisco, CA
1110male45.00011378435.5TS[null][null]Trenton, NJ
1210male42.00011048926.55D22S[null][null]London / Winnipeg, MB
1310male41.00011305430.5A21S[null][null]Pomeroy, WA
1410male48.000PC 1759150.4958B10C[null]208Omaha, NE
1510male[null]0011237939.6[null]C[null][null]Philadelphia, PA
1610male45.00011305026.55B38S[null][null]Washington, DC
1710male[null]0011379831.0[null]S[null][null][null]
1810male33.0006955.0B51 B53 B55S[null][null]New York, NY
1910male28.00011305947.1[null]S[null][null]Montevideo, Uruguay
2010male17.00011305947.1[null]S[null][null]Montevideo, Uruguay
2110male49.0001992426.0[null]S[null][null]Ascot, Berkshire / Rochester, NY
2210male36.0101987778.85C46S[null]172Little Onn Hall, Staffs
2310male46.010W.E.P. 573461.175E31S[null][null]Amenia, ND
2410male[null]001120510.0[null]S[null][null]Liverpool, England / Belfast
2510male27.01013508136.7792C89C[null][null]Los Angeles, CA
2610male[null]0011046552.0A14S[null][null]Stoughton, MA
2710male47.000572725.5875E58S[null][null]Victoria, BC
2810male37.011PC 1775683.1583E52C[null][null]Lakewood, NJ
2910male[null]0011379126.55[null]S[null][null]Roachdale, IN
3010male70.011WE/P 573571.0B22S[null]269Milwaukee, WI
3110male39.010PC 1759971.2833C85C[null][null]New York, NY
3210male31.010F.C. 1275052.0B71S[null][null]Montreal, PQ
3310male50.010PC 17761106.425C86C[null]62Deephaven, MN / Cedar Rapids, IA
3410male39.000PC 1758029.7A18C[null]133Philadelphia, PA
3510female36.000PC 1753131.6792A29C[null][null]New York, NY
3610male[null]00PC 17483221.7792C95S[null][null][null]
3710male30.00011305127.75C111C[null][null]New York, NY
3810male19.03219950263.0C23 C25 C27S[null][null]Winnipeg, MB
3910male64.01419950263.0C23 C25 C27S[null][null]Winnipeg, MB
4010male[null]0011377826.55D34S[null][null]Westcliff-on-Sea, Essex
4110male[null]001120580.0B102S[null][null][null]
4210male37.01011380353.1C123S[null][null]Scituate, MA
4310male47.00011132038.5E63S[null]275St Anne's-on-Sea, Lancashire
4410male24.000PC 1759379.2B86C[null][null][null]
4510male71.000PC 1775434.6542A5C[null][null]New York, NY
4610male38.001PC 17582153.4625C91S[null]147Winnipeg, MB
4710male46.000PC 1759379.2B82 B84C[null][null]New York, NY
4810male[null]0011379642.4[null]S[null][null][null]
4910male45.0103697383.475C83S[null][null]New York, NY
5010male40.0001120590.0B94S[null]110[null]
5110male55.0111274993.5B69S[null]307Montreal, PQ
5210male42.00011303842.5B11S[null][null]London / Middlesex
5310male[null]001746351.8625E46S[null][null]Brighton, MA
5410male55.00068050.0C39S[null][null]London / Birmingham
5510male42.01011378952.0[null]S[null]38New York, NY
5610male[null]00PC 1760030.6958[null]C14[null]New York, NY
5710female50.000PC 1759528.7125C49C[null][null]Paris, France New York, NY
5810male46.00069426.0[null]S[null]80Bennington, VT
5910male50.00011304426.0E60S[null][null]London
6010male32.500113503211.5C132C[null]45[null]
6110male58.0001177129.7B37C[null]258Buffalo, NY
6210male41.0101746451.8625D21S[null][null]Southington / Noank, CT
6310male[null]0011302826.55C124S[null][null]Portland, OR
6410male[null]00PC 1761227.7208[null]C[null][null]Chicago, IL
6510male29.00011350130.0D6S[null]126Springfield, MA
6610male30.00011380145.5[null]S[null][null]London / New York, NY
6710male30.00011046926.0C106S[null][null]Brockton, MA
6810male19.01011377353.1D30S[null][null]New York, NY
6910male46.0001305075.2417C6C[null]292Vancouver, BC
7010male54.0001746351.8625E46S[null]175Dorchester, MA
7110male28.010PC 1760482.1708[null]C[null][null]New York, NY
7210male65.0001350926.55E38S[null]249East Bridgewater, MA
7310male44.0201992890.0C78Q[null]230Fond du Lac, WI
7410male55.00011378730.5C30S[null][null]Montreal, PQ
7510male47.00011379642.4[null]S[null][null]Washington, DC
7610male37.001PC 1759629.7C118C[null][null]Brooklyn, NY
7710male58.00235273113.275D48C[null]122Lexington, MA
7810male64.00069326.0[null]S[null]263Isle of Wight, England
7910male65.00111350961.9792B30C[null]234Providence, RI
8010male28.500PC 1756227.7208D43C[null]189?Havana, Cuba
8110male[null]001120520.0[null]S[null][null]Belfast
8210male45.50011304328.5C124S[null]166Surbiton Hill, Surrey
8310male23.0001274993.5B24S[null][null]Montreal, PQ
8410male29.01011377666.6C2S[null][null]Isleworth, England
8510male18.010PC 17758108.9C65C[null][null]Madrid, Spain
8610male47.00011046552.0C110S[null]207Worcester, MA
8710male38.000199720.0[null]S[null][null]Rotterdam, Netherlands
8810male22.000PC 17760135.6333[null]C[null]232[null]
8910male[null]00PC 17757227.525[null]C[null][null][null]
9010male31.000PC 1759050.4958A24S[null][null]Trenton, NJ
9110male[null]0011376750.0A32S[null][null]Seattle, WA
9210male36.0001304940.125A10C[null][null]Winnipeg, MB
9310male55.010PC 1760359.4[null]C[null][null]New York, NY
9410male33.00011379026.55[null]S[null]109London
9510male61.013PC 17608262.375B57 B59 B63 B66C[null][null]Haverford, PA / Cooperstown, NY
9610male50.0101350755.9E44S[null][null]Duluth, MN
9710male56.00011379226.55[null]S[null][null]New York, NY
9810male56.0001776430.6958A7C[null][null]St James, Long Island, NY
9910male24.0101369560.0C31S[null][null]Huntington, WV
10010male[null]0011305626.0A19S[null][null]Streatham, Surrey
Rows: 1-100 | Columns: 14

Let’s look at the ‘age’ of the passengers.

[2]:
vdf["age"].hist()
[2]:

By using the ‘discretize’ method, we can discretize the data using equal-width binning.

[3]:
vdf["age"].discretize(method = "same_width", h = 10)
vdf["age"].bar(max_cardinality = 10)
[3]:

We can also discretize the data using frequency bins.

[4]:
vdf = load_titanic()
vdf["age"].discretize(method = "same_freq", nbins = 5)
vdf["age"].bar(max_cardinality = 5)
[4]:

Computing categories using a response column can also be a good solution.

[5]:
vdf = load_titanic()
vdf["age"].discretize(method = "smart", response = "survived", nbins = 6)
vdf["age"].bar(method = "avg", of = "survived")
/opt/venv/lib/python3.10/site-packages/vertica_python/vertica/connection.py:659: UserWarning: [WARNING] max_depth is set to 8 while max_breadth to 1000000000. This means the size of trees may become limited by max_depth first
  warnings.warn(notice)
[5]:

We can view the available techniques in the ‘discretize’ method with the ‘help’ method.

[6]:
help(vdf["age"].discretize)
Help on method discretize in module verticapy.core.vdataframe._encoding:

discretize(method: Literal['auto', 'smart', 'same_width', 'same_freq', 'topk'] = 'auto', h: Annotated[Union[int, float, decimal.Decimal], 'Python Numbers'] = 0, nbins: int = -1, k: int = 6, new_category: str = 'Others', RFmodel_params: Optional[dict] = None, response: Optional[str] = None, return_enum_trans: bool = False) -> 'vDataFrame' method of verticapy.core.vdataframe.base.vDataColumn instance
    Discretizes the vDataColumn using the input method.

    Parameters
    ----------
    method: str, optional
        The method used to discretize the vDataColumn.
            auto       : Uses method 'same_width' for numerical
                         vDataColumns, casts the other types to
                         varchar.
            same_freq  : Computes bins  with the same number of
                         elements.
            same_width : Computes regular width bins.
            smart      : Uses  the Random  Forest on a  response
                         column  to   find   the  most  relevant
                         interval to use for the discretization.
            topk       : Keeps the topk most frequent categories
                         and  merge the  other  into one  unique
                         category.
    h: PythonNumber, optional
        The  interval  size  used  to  convert  the vDataColumn.
        If this parameter is equal to 0, an optimised interval is
        computed.
    nbins: int, optional
        Number of bins  used for the discretization  (must be > 1)
    k: int, optional
        The integer k of the 'topk' method.
    new_category: str, optional
        The  name of the  merging  category when using the  'topk'
        method.
    RFmodel_params: dict, optional
        Dictionary  of the  Random Forest  model  parameters used  to
        compute the best splits when 'method' is set to 'smart'.
        A RF Regressor is  trained if  the response is numerical
        (except ints and bools), a RF Classifier otherwise.
        Example: Write {"n_estimators": 20, "max_depth": 10} to train
        a Random Forest with 20 trees and a maximum depth of 10.
    response: str, optional
        Response vDataColumn when method is set to 'smart'.
    return_enum_trans: bool, optional
        Returns  the transformation instead of the vDataFrame parent,
        and does not apply the transformation. This parameter is
        useful for testing the look of the final transformation.

    Returns
    -------
    vDataFrame
        self._parent

To encode a categorical feature, we can use label encoding. For example, the column ‘sex’ has two categories (male and female) that we can represent with 0 and 1, respectively.

[7]:
vdf["sex"].label_encode()
display(vdf["sex"])
123
sex
Integer
10
21
30
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
201
211
221
231
241
251
261
271
281
291
301
311
321
331
341
350
361
371
381
391
401
411
421
431
441
451
461
471
481
491
501
511
521
531
541
551
561
570
581
591
601
611
621
631
641
651
661
671
681
691
701
711
721
731
741
751
761
771
781
791
801
811
821
831
841
851
861
871
881
891
901
911
921
931
941
951
961
971
981
991
1001
Rows: 1-100 of 1234 | Column: sex | Type: Integer

When a feature has few categories, the most suitable choice is the one-hot encoding. Label encoding converts a categorical feature to numerical without retaining its mathematical relationships. Let’s use a one-hot encoding on the ‘embarked’ column.

[8]:
vdf["embarked"].one_hot_encode()
vdf.select(["embarked", "embarked_C", "embarked_Q"])
[8]:
Abc
embarked
Varchar(20)
123
embarked_C
Integer
123
embarked_Q
Integer
1S00
2S00
3S00
4S00
5C10
6C10
7S00
8C10
9C10
10C10
11S00
12S00
13S00
14C10
15C10
16S00
17S00
18S00
19S00
20S00
21S00
22S00
23S00
24S00
25C10
26S00
27S00
28C10
29S00
30S00
31C10
32S00
33C10
34C10
35C10
36S00
37C10
38S00
39S00
40S00
41S00
42S00
43S00
44C10
45C10
46S00
47C10
48S00
49S00
50S00
51S00
52S00
53S00
54S00
55S00
56C10
57C10
58S00
59S00
60C10
61C10
62S00
63S00
64C10
65S00
66S00
67S00
68S00
69C10
70S00
71C10
72S00
73Q01
74S00
75S00
76C10
77C10
78S00
79C10
80C10
81S00
82S00
83S00
84S00
85C10
86S00
87S00
88C10
89C10
90S00
91S00
92C10
93C10
94S00
95C10
96S00
97S00
98C10
99S00
100S00
Rows: 1-100 | Columns: 3

One-hot encoding can be expensive if the column in question has a large number of categories. In that case, we should use mean encoding. Mean encoding replaces each category of a variable with its corresponding average over a partition by a response column. This makes it an efficient way to encode the data, but be careful of over-fitting.

Let’s use a mean encoding on the ‘home.dest’ variable.

[9]:
vdf["home.dest"].mean_encode("survived")
display(vdf["home.dest"])
The mean encoding was successfully done.
123
home.dest
Float(22)
11.0
20.0
30.0
40.0
50.0
60.0
70.0
80.5
90.5
100.0
110.0
120.0
130.0
140.0
150.0
161.0
171.0
181.0
190.5
200.5
211.0
220.0
230.0
240.75
250.75
260.75
270.75
280.0
290.0
300.0
310.0
320.25
330.25
340.25
350.25
360.75
370.75
380.75
390.75
401.0
411.0
420.0
430.0
440.0
450.0
460.0
471.0
480.0
490.0
500.5
510.5
520.0
531.0
541.0
550.0
560.0
570.0
580.0
590.0
601.0
610.0
620.0
630.666666666666667
640.666666666666667
650.666666666666667
660.0
670.357142857142857
680.357142857142857
690.357142857142857
700.357142857142857
710.357142857142857
720.357142857142857
730.357142857142857
740.357142857142857
750.357142857142857
760.357142857142857
770.357142857142857
780.357142857142857
790.357142857142857
800.357142857142857
810.0
821.0
830.5
840.5
850.5
860.5
870.5
880.5
891.0
900.0
910.0
920.0
930.0
940.0
950.0
960.0
970.0
980.0
990.5
1000.5
Rows: 1-100 of 1234 | Column: home.dest | Type: Float(22)

VerticaPy offers many encoding techniques. For example, the ‘case_when’ and ‘decode’ methods allow the user to use a customized encoding on a column. The ‘discretize’ method allows you to reduce the number of categories in a column. It’s important to get familiar with all the techniques available so you can make informed decisions about which to use for a given dataset.