VerticaPy

Python API for Vertica Data Science at Scale

Encoding

Encoding features is a very important part of the data science life cycle. In data science, generality is important and having too many categories can compromise that and lead to incorrect results. In addition, some algorithmic optimizations are linear and prefer categorized information, and some can't process non-numerical features.

There are many encoding techniques:

  • User-Defined Encoding : The most flexible encoding. The user can choose how to encode the different categories.
  • Label Encoding : Each category is converted to an integer using a bijection to [0;n-1] where n is the feature number of unique values.
  • One-hot Encoding : This technique creates dummies (values in {0,1}) of each category. The categories are then separated into n features.
  • Mean Encoding : This technique uses the frequencies of each category for a specific response column.
  • Discretization : This technique uses various mathematical technique to encode continuous features into categories.

To demonstrate encoding data in VerticaPy, we'll use the well-known 'Titanic' dataset.

In [35]:
from verticapy.datasets import load_titanic
vdf = load_titanic()
display(vdf)
123
pclass
Int
123
survived
Int
Abc
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(6,3)
123
sibsp
Int
123
parch
Int
Abc
ticket
Varchar(36)
123
fare
Numeric(10,5)
Abc
cabin
Varchar(30)
Abc
embarked
Varchar(20)
Abc
boat
Varchar(100)
123
body
Int
Abc
home.dest
Varchar(100)
110female2.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
210male30.012113781151.55C22 C26S[null]135Montreal, PQ / Chesterville, ON
310female25.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
410male39.0001120500.0A36S[null][null]Belfast, NI
510male71.000PC 1760949.5042[null]C[null]22Montevideo, Uruguay
610male47.010PC 17757227.525C62 C64C[null]124New York, NY
710male[null]00PC 1731825.925[null]S[null][null]New York, NY
810male24.001PC 17558247.5208B58 B60C[null][null]Montreal, PQ
910male36.0001305075.2417C6CA[null]Winnipeg, MN
1010male25.0001390526.0[null]C[null]148San Francisco, CA
1110male45.00011378435.5TS[null][null]Trenton, NJ
1210male42.00011048926.55D22S[null][null]London / Winnipeg, MB
1310male41.00011305430.5A21S[null][null]Pomeroy, WA
1410male48.000PC 1759150.4958B10C[null]208Omaha, NE
1510male[null]0011237939.6[null]C[null][null]Philadelphia, PA
1610male45.00011305026.55B38S[null][null]Washington, DC
1710male[null]0011379831.0[null]S[null][null][null]
1810male33.0006955.0B51 B53 B55S[null][null]New York, NY
1910male28.00011305947.1[null]S[null][null]Montevideo, Uruguay
2010male17.00011305947.1[null]S[null][null]Montevideo, Uruguay
2110male49.0001992426.0[null]S[null][null]Ascot, Berkshire / Rochester, NY
2210male36.0101987778.85C46S[null]172Little Onn Hall, Staffs
2310male46.010W.E.P. 573461.175E31S[null][null]Amenia, ND
2410male[null]001120510.0[null]S[null][null]Liverpool, England / Belfast
2510male27.01013508136.7792C89C[null][null]Los Angeles, CA
2610male[null]0011046552.0A14S[null][null]Stoughton, MA
2710male47.000572725.5875E58S[null][null]Victoria, BC
2810male37.011PC 1775683.1583E52C[null][null]Lakewood, NJ
2910male[null]0011379126.55[null]S[null][null]Roachdale, IN
3010male70.011WE/P 573571.0B22S[null]269Milwaukee, WI
3110male39.010PC 1759971.2833C85C[null][null]New York, NY
3210male31.010F.C. 1275052.0B71S[null][null]Montreal, PQ
3310male50.010PC 17761106.425C86C[null]62Deephaven, MN / Cedar Rapids, IA
3410male39.000PC 1758029.7A18C[null]133Philadelphia, PA
3510female36.000PC 1753131.6792A29C[null][null]New York, NY
3610male[null]00PC 17483221.7792C95S[null][null][null]
3710male30.00011305127.75C111C[null][null]New York, NY
3810male19.03219950263.0C23 C25 C27S[null][null]Winnipeg, MB
3910male64.01419950263.0C23 C25 C27S[null][null]Winnipeg, MB
4010male[null]0011377826.55D34S[null][null]Westcliff-on-Sea, Essex
4110male[null]001120580.0B102S[null][null][null]
4210male37.01011380353.1C123S[null][null]Scituate, MA
4310male47.00011132038.5E63S[null]275St Anne's-on-Sea, Lancashire
4410male24.000PC 1759379.2B86C[null][null][null]
4510male71.000PC 1775434.6542A5C[null][null]New York, NY
4610male38.001PC 17582153.4625C91S[null]147Winnipeg, MB
4710male46.000PC 1759379.2B82 B84C[null][null]New York, NY
4810male[null]0011379642.4[null]S[null][null][null]
4910male45.0103697383.475C83S[null][null]New York, NY
5010male40.0001120590.0B94S[null]110[null]
5110male55.0111274993.5B69S[null]307Montreal, PQ
5210male42.00011303842.5B11S[null][null]London / Middlesex
5310male[null]001746351.8625E46S[null][null]Brighton, MA
5410male55.00068050.0C39S[null][null]London / Birmingham
5510male42.01011378952.0[null]S[null]38New York, NY
5610male[null]00PC 1760030.6958[null]C14[null]New York, NY
5710female50.000PC 1759528.7125C49C[null][null]Paris, France New York, NY
5810male46.00069426.0[null]S[null]80Bennington, VT
5910male50.00011304426.0E60S[null][null]London
6010male32.500113503211.5C132C[null]45[null]
6110male58.0001177129.7B37C[null]258Buffalo, NY
6210male41.0101746451.8625D21S[null][null]Southington / Noank, CT
6310male[null]0011302826.55C124S[null][null]Portland, OR
6410male[null]00PC 1761227.7208[null]C[null][null]Chicago, IL
6510male29.00011350130.0D6S[null]126Springfield, MA
6610male30.00011380145.5[null]S[null][null]London / New York, NY
6710male30.00011046926.0C106S[null][null]Brockton, MA
6810male19.01011377353.1D30S[null][null]New York, NY
6910male46.0001305075.2417C6C[null]292Vancouver, BC
7010male54.0001746351.8625E46S[null]175Dorchester, MA
7110male28.010PC 1760482.1708[null]C[null][null]New York, NY
7210male65.0001350926.55E38S[null]249East Bridgewater, MA
7310male44.0201992890.0C78Q[null]230Fond du Lac, WI
7410male55.00011378730.5C30S[null][null]Montreal, PQ
7510male47.00011379642.4[null]S[null][null]Washington, DC
7610male37.001PC 1759629.7C118C[null][null]Brooklyn, NY
7710male58.00235273113.275D48C[null]122Lexington, MA
7810male64.00069326.0[null]S[null]263Isle of Wight, England
7910male65.00111350961.9792B30C[null]234Providence, RI
8010male28.500PC 1756227.7208D43C[null]189?Havana, Cuba
8110male[null]001120520.0[null]S[null][null]Belfast
8210male45.50011304328.5C124S[null]166Surbiton Hill, Surrey
8310male23.0001274993.5B24S[null][null]Montreal, PQ
8410male29.01011377666.6C2S[null][null]Isleworth, England
8510male18.010PC 17758108.9C65C[null][null]Madrid, Spain
8610male47.00011046552.0C110S[null]207Worcester, MA
8710male38.000199720.0[null]S[null][null]Rotterdam, Netherlands
8810male22.000PC 17760135.6333[null]C[null]232[null]
8910male[null]00PC 17757227.525[null]C[null][null][null]
9010male31.000PC 1759050.4958A24S[null][null]Trenton, NJ
9110male[null]0011376750.0A32S[null][null]Seattle, WA
9210male36.0001304940.125A10C[null][null]Winnipeg, MB
9310male55.010PC 1760359.4[null]C[null][null]New York, NY
9410male33.00011379026.55[null]S[null]109London
9510male61.013PC 17608262.375B57 B59 B63 B66C[null][null]Haverford, PA / Cooperstown, NY
9610male50.0101350755.9E44S[null][null]Duluth, MN
9710male56.00011379226.55[null]S[null][null]New York, NY
9810male56.0001776430.6958A7C[null][null]St James, Long Island, NY
9910male24.0101369560.0C31S[null][null]Huntington, WV
10010male[null]0011305626.0A19S[null][null]Streatham, Surrey
Rows: 1-100 | Columns: 14

Let's look at the 'age' of the passengers.

In [36]:
vdf["age"].hist()
Out[36]:
<AxesSubplot:xlabel='"age"', ylabel='Density'>

By using the 'discretize' method, we can discretize the data using equal-width binning.

In [38]:
vdf["age"].discretize(method = "same_width", h = 10)
vdf["age"].hist(max_cardinality = 10)
Out[38]:
<AxesSubplot:xlabel='"age"', ylabel='Density'>

We can also discretize the data using frequency bins.

In [39]:
vdf = load_titanic()
vdf["age"].discretize(method = "same_freq", nbins = 5)
vdf["age"].hist(max_cardinality = 5)
Out[39]:
<AxesSubplot:xlabel='"age"', ylabel='Density'>

Computing categories using a response column can also be a good solution.

In [40]:
vdf = load_titanic()
vdf["age"].discretize(method = "smart", response = "survived", nbins = 6)
vdf["age"].hist(method = "avg", of = "survived")
Out[40]:
<AxesSubplot:xlabel='"age"', ylabel='avg'>

We can view the available techniques in the 'discretize' method with the 'help' method.

In [42]:
help(vdf["age"].discretize)
Help on method discretize in module verticapy.vcolumn:

discretize(method:str='auto', h:float=0, nbins:int=-1, k:int=6, new_category:str='Others', RFmodel_params:dict={}, response:str='', return_enum_trans:bool=False) method of verticapy.vcolumn.vColumn instance
        ---------------------------------------------------------------------------
        Discretizes the vColumn using the input method.
    
        Parameters
        ----------
        method: str, optional
                The method to use to discretize the vColumn.
                        auto       : Uses method 'same_width' for numerical vColumns, cast 
                                the other types to varchar.
                        same_freq  : Computes bins with the same number of elements.
                        same_width : Computes regular width bins.
                        smart      : Uses the Random Forest on a response column to find the most 
                                relevant interval to use for the discretization.
                        topk       : Keeps the topk most frequent categories and merge the other 
                                into one unique category.
        h: float, optional
                The interval size to convert to use to convert the vColumn. If this parameter 
                is equal to 0, an optimised interval will be computed.
        nbins: int, optional
                Number of bins used for the discretization (must be > 1)
        k: int, optional
                The integer k of the 'topk' method.
        new_category: str, optional
                The name of the merging category when using the 'topk' method.
        RFmodel_params: dict, optional
                Dictionary of the Random Forest model parameters used to compute the best splits 
        when 'method' is set to 'smart'. A RF Regressor will be trained if the response
        is numerical (except ints and bools), a RF Classifier otherwise.
        Example: Write {"n_estimators": 20, "max_depth": 10} to train a Random Forest with
        20 trees and a maximum depth of 10.
    response: str, optional
        Response vColumn when method is set to 'smart'.
        return_enum_trans: bool, optional
                Returns the transformation instead of the vDataFrame parent and do not apply
                it. This parameter is very useful for testing to be able to look at the final 
                transformation.
    
        Returns
        -------
        vDataFrame
                self.parent
    
        See Also
        --------
        vDataFrame[].decode       : Encodes the vColumn with user defined Encoding.
        vDataFrame[].get_dummies  : Encodes the vColumn with One-Hot Encoding.
        vDataFrame[].label_encode : Encodes the vColumn with Label Encoding.
        vDataFrame[].mean_encode  : Encodes the vColumn using the mean encoding of a response.

To encode a categorical feature, we can use label encoding. For example, the column 'sex' has two categories (male and female) that we can represent with 0 and 1, respectively.

In [43]:
vdf["sex"].label_encode()
display(vdf["sex"])
123
sex
Integer
10
21
30
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
201
211
221
231
241
251
261
271
281
291
301
311
321
331
341
350
361
371
381
391
401
411
421
431
441
451
461
471
481
491
501
511
521
531
541
551
561
570
581
591
601
611
621
631
641
651
661
671
681
691
701
711
721
731
741
751
761
771
781
791
801
811
821
831
841
851
861
871
881
891
901
911
921
931
941
951
961
971
981
991
1001
Rows: 1-100 of 1234 | Column: sex | Type: Integer

When a feature has few categories, the most suitable choice is the one-hot encoding. Label encoding converts a categorical feature to numerical without retaining its mathematical relationships. Let's use a one-hot encoding on the 'embarked' column.

In [44]:
vdf["embarked"].one_hot_encode()
vdf.select(["embarked", "embarked_C", "embarked_Q"])
Out[44]:
Abc
embarked
Varchar(20)
123
embarked_C
Integer
123
embarked_Q
Integer
1S00
2S00
3S00
4S00
5C10
6C10
7S00
8C10
9C10
10C10
11S00
12S00
13S00
14C10
15C10
16S00
17S00
18S00
19S00
20S00
21S00
22S00
23S00
24S00
25C10
26S00
27S00
28C10
29S00
30S00
31C10
32S00
33C10
34C10
35C10
36S00
37C10
38S00
39S00
40S00
41S00
42S00
43S00
44C10
45C10
46S00
47C10
48S00
49S00
50S00
51S00
52S00
53S00
54S00
55S00
56C10
57C10
58S00
59S00
60C10
61C10
62S00
63S00
64C10
65S00
66S00
67S00
68S00
69C10
70S00
71C10
72S00
73Q01
74S00
75S00
76C10
77C10
78S00
79C10
80C10
81S00
82S00
83S00
84S00
85C10
86S00
87S00
88C10
89C10
90S00
91S00
92C10
93C10
94S00
95C10
96S00
97S00
98C10
99S00
100S00
Rows: 1-100 | Columns: 3

One-hot encoding can be expensive if the column in question has a large number of categories. In that case, we should use mean encoding. Mean encoding replaces each category of a variable with its corresponding average over a partition by a response column. This makes it an efficient way to encode the data, but be careful of over-fitting.

Let's use a mean encoding on the 'home.dest' variable.

In [45]:
vdf["home.dest"].mean_encode("survived")
display(vdf["home.dest"])
The mean encoding was successfully done.
123
home.dest
Float
10.0
21.0
31.0
40.0
50.0
60.5
70.5
80.0
90.0
100.0
110.0
120.0
130.0
141.0
150.0
160.0
170.0
180.0
190.0
200.0
211.0
221.0
231.0
241.0
251.0
261.0
270.0
281.0
291.0
301.0
310.0
320.0
330.0
340.0
350.0
360.0
371.0
381.0
390.0
400.0
411.0
421.0
430.5
440.5
450.666666666666667
460.666666666666667
470.666666666666667
480.0
490.5
500.5
510.75
520.75
530.75
540.75
550.0
560.0
570.0
580.0
590.0
600.0
611.0
620.5
630.5
640.0
650.0
660.5
670.5
680.0
690.0
701.0
710.666666666666667
720.666666666666667
730.666666666666667
740.666666666666667
750.666666666666667
760.666666666666667
770.0
780.0
790.0
801.0
811.0
821.0
831.0
840.0
850.0
860.0
870.0
880.0
890.0
900.0
910.0
920.0
931.0
941.0
950.0
960.666666666666667
970.666666666666667
980.666666666666667
990.0
1001.0
Rows: 1-100 of 1234 | Column: home.dest | Type: Float

VerticaPy offers many encoding techniques. For example, the 'case_when' and 'decode' methods allow the user to use a customized encoding on a column. The 'discretize' method allows you to reduce the number of categories in a column. It's important to get familiar with all the techniques available so you can make informed decisions about which to use for a given dataset.