vDataFrame[].discretize

In [ ]:
vDataFrame[].discretize(method: str = "auto",
                        h: float = 0,
                        bins: int = -1,
                        k: int = 6,
                        new_category: str = "Others",
                        RFmodel_params: dict = {},
                        response: str = "",
                        return_enum_trans: bool = False)

Discretizes the vcolumn using the input method.

Parameters

Name Type Optional Description
method
str
The method to use to discretize the vcolumn.
  • auto : Uses method 'same_width' for numerical vcolumns, cast the other types to varchar.
  • same_freq : Computes bins with the same number of elements.
  • same_width : Computes regular width bins.
  • smart : Uses the Random Forest on a response column to find the most relevant interval to use for the discretization.
  • topk : Keeps the topk most frequent categories and merge the other into one unique category.
h
float
The interval size to convert to use to convert the vcolumn. If this parameter is equal to 0, an optimised interval will be computed.
bins
int
Number of bins used for the discretization (must be > 1)
k
int
The integer k of the 'topk' method.
new_category
str
The name of the merging category when using the 'topk' method.
RFmodel_params
dict
Dictionary of the Random Forest model parameters used to compute the best splits when 'method' is set to 'smart'. A RF Regressor will be trained if the response is numerical (except ints and bools), a RF Classifier otherwise. Example: Write {"n_estimators": 20, "max_depth": 10} to train a Random Forest with 20 trees and a maximum depth of 10.
response
str
Response vcolumn when using the 'smart' method.
return_enum_trans
bool
Returns the transformation instead of the vDataFrame parent and do not apply it. This parameter is very useful for testing to be able to look at the final transformation.

Returns

vDataFrame : self.parent

Example

In [14]:
from verticapy.datasets import load_titanic
titanic = load_titanic()
display(titanic["age"])
titanic["age"].hist()
123
age
Numeric(6,3)
12.0
230.0
325.0
439.0
571.0
647.0
7[null]
824.0
936.0
1025.0
1145.0
1242.0
1341.0
1448.0
15[null]
1645.0
17[null]
1833.0
1928.0
2017.0
2149.0
2236.0
2346.0
24[null]
2527.0
26[null]
2747.0
2837.0
29[null]
3070.0
3139.0
3231.0
3350.0
3439.0
3536.0
36[null]
3730.0
3819.0
3964.0
40[null]
41[null]
4237.0
4347.0
4424.0
4571.0
4638.0
4746.0
48[null]
4945.0
5040.0
5155.0
5242.0
53[null]
5455.0
5542.0
56[null]
5750.0
5846.0
5950.0
6032.5
6158.0
6241.0
63[null]
64[null]
6529.0
6630.0
6730.0
6819.0
6946.0
7054.0
7128.0
7265.0
7344.0
7455.0
7547.0
7637.0
7758.0
7864.0
7965.0
8028.5
81[null]
8245.5
8323.0
8429.0
8518.0
8647.0
8738.0
8822.0
89[null]
9031.0
91[null]
9236.0
9355.0
9433.0
9561.0
9650.0
9756.0
9856.0
9924.0
100[null]
Rows: 1-100 of 1234 | Column: age | Type: Numeric(6,3)
In [45]:
# Discretizing using the same bar width
titanic["age"].discretize(method = "same_width", h = 10)
display(titanic["age"])
titanic["age"].hist()
Abc
age
Varchar
1[0;10]
2[30;40]
3[20;30]
4[30;40]
5[70;80]
6[40;50]
7[null]
8[20;30]
9[30;40]
10[20;30]
11[40;50]
12[40;50]
13[40;50]
14[40;50]
15[null]
16[40;50]
17[null]
18[30;40]
19[20;30]
20[10;20]
21[40;50]
22[30;40]
23[40;50]
24[null]
25[20;30]
26[null]
27[40;50]
28[30;40]
29[null]
30[70;80]
31[30;40]
32[30;40]
33[50;60]
34[30;40]
35[30;40]
36[null]
37[30;40]
38[10;20]
39[60;70]
40[null]
41[null]
42[30;40]
43[40;50]
44[20;30]
45[70;80]
46[30;40]
47[40;50]
48[null]
49[40;50]
50[40;50]
51[50;60]
52[40;50]
53[null]
54[50;60]
55[40;50]
56[null]
57[50;60]
58[40;50]
59[50;60]
60[30;40]
61[50;60]
62[40;50]
63[null]
64[null]
65[20;30]
66[30;40]
67[30;40]
68[10;20]
69[40;50]
70[50;60]
71[20;30]
72[60;70]
73[40;50]
74[50;60]
75[40;50]
76[30;40]
77[50;60]
78[60;70]
79[60;70]
80[20;30]
81[null]
82[40;50]
83[20;30]
84[20;30]
85[10;20]
86[40;50]
87[30;40]
88[20;30]
89[null]
90[30;40]
91[null]
92[30;40]
93[50;60]
94[30;40]
95[60;70]
96[50;60]
97[50;60]
98[50;60]
99[20;30]
100[null]
Rows: 1-100 of 1234 | Column: age | Type: varchar
In [47]:
# Discretizing using the same frequence per bin
titanic["age"].discretize(method = "same_freq", bins = 5)
display(titanic["age"])
titanic["age"].hist()
Abc
age
Varchar
1[0.330;21.000]
2[28.000;39.000]
3[21.000;28.000]
4[28.000;39.000]
5[39.000;80.000]
6[39.000;80.000]
7[null]
8[21.000;28.000]
9[28.000;39.000]
10[21.000;28.000]
11[39.000;80.000]
12[39.000;80.000]
13[39.000;80.000]
14[39.000;80.000]
15[null]
16[39.000;80.000]
17[null]
18[28.000;39.000]
19[21.000;28.000]
20[0.330;21.000]
21[39.000;80.000]
22[28.000;39.000]
23[39.000;80.000]
24[null]
25[21.000;28.000]
26[null]
27[39.000;80.000]
28[28.000;39.000]
29[null]
30[39.000;80.000]
31[28.000;39.000]
32[28.000;39.000]
33[39.000;80.000]
34[28.000;39.000]
35[28.000;39.000]
36[null]
37[28.000;39.000]
38[0.330;21.000]
39[39.000;80.000]
40[null]
41[null]
42[28.000;39.000]
43[39.000;80.000]
44[21.000;28.000]
45[39.000;80.000]
46[28.000;39.000]
47[39.000;80.000]
48[null]
49[39.000;80.000]
50[39.000;80.000]
51[39.000;80.000]
52[39.000;80.000]
53[null]
54[39.000;80.000]
55[39.000;80.000]
56[null]
57[39.000;80.000]
58[39.000;80.000]
59[39.000;80.000]
60[28.000;39.000]
61[39.000;80.000]
62[39.000;80.000]
63[null]
64[null]
65[28.000;39.000]
66[28.000;39.000]
67[28.000;39.000]
68[0.330;21.000]
69[39.000;80.000]
70[39.000;80.000]
71[21.000;28.000]
72[39.000;80.000]
73[39.000;80.000]
74[39.000;80.000]
75[39.000;80.000]
76[28.000;39.000]
77[39.000;80.000]
78[39.000;80.000]
79[39.000;80.000]
80[28.000;39.000]
81[null]
82[39.000;80.000]
83[21.000;28.000]
84[28.000;39.000]
85[0.330;21.000]
86[39.000;80.000]
87[28.000;39.000]
88[21.000;28.000]
89[null]
90[28.000;39.000]
91[null]
92[28.000;39.000]
93[39.000;80.000]
94[28.000;39.000]
95[39.000;80.000]
96[39.000;80.000]
97[39.000;80.000]
98[39.000;80.000]
99[21.000;28.000]
100[null]
Rows: 1-100 of 1234 | Column: age | Type: varchar
In [15]:
# Discretizing using a response column distribution
# During the process, a Random Forest will be created
titanic["age"].discretize(method = "smart", 
                          response = "survived", 
                          bins = 6, 
                          RFmodel_params = {"n_estimators": 20})
display(titanic["age"].topk())
titanic["age"].hist()
# Each bin will represent a Random Forest split
titanic["age"].hist(method = "avg", of = "survived")
count
percent
[15.268125;60.082500]85986.158
[0.33;5.309375]535.316
[7.799062;15.268125]434.313
[60.082500;72.530937]303.009
[5.309375;7.799062]90.903
[72.530937;80]30.301
Rows: 1-6 | Columns: 3
In [51]:
# Extracting the passenger Title from the name
titanic["name"].str_extract(' ([A-Za-z])+\.')
titanic["name"].hist()
# Discretizing using the TOP 5 most occurent categories
# the others will be meged together to create the 'rare' category
titanic["name"].discretize(method = "topk", k = 5, new_category = "rare")
display(titanic["name"])
titanic["name"].hist()
Abc
name
Varchar
1 Miss.
2 Mr.
3 Mrs.
4 Mr.
5 Mr.
6rare
7 Mr.
8 Mr.
9 Mr.
10 Mr.
11 Mr.
12 Mr.
13 Mr.
14 Mr.
15 Dr.
16rare
17 Mr.
18 Mr.
19 Mr.
20 Mr.
21 Mr.
22 Mr.
23 Mr.
24 Mr.
25 Mr.
26 Mr.
27 Mr.
28 Mr.
29 Mr.
30rare
31 Mr.
32 Mr.
33 Mr.
34 Mr.
35 Miss.
36 Mr.
37 Mr.
38 Mr.
39 Mr.
40 Mr.
41 Mr.
42 Mr.
43 Mr.
44 Mr.
45 Mr.
46 Mr.
47 Mr.
48 Mr.
49 Mr.
50 Mr.
51 Mr.
52 Mr.
53 Mr.
54 Mr.
55 Mr.
56 Mr.
57 Miss.
58 Mr.
59 Mr.
60 Mr.
61 Mr.
62 Mr.
63 Mr.
64 Mr.
65 Mr.
66 Mr.
67 Mr.
68 Mr.
69 Mr.
70 Mr.
71 Mr.
72 Mr.
73 Dr.
74 Mr.
75 Mr.
76 Mr.
77 Mr.
78 Mr.
79 Mr.
80 Mr.
81 Mr.
82 Mr.
83 Mr.
84 Mr.
85 Mr.
86 Mr.
87rare
88 Mr.
89 Mr.
90 Mr.
91 Mr.
92 Mr.
93 Mr.
94 Mr.
95 Mr.
96 Mr.
97 Mr.
98 Mr.
99 Mr.
100 Mr.
Rows: 1-100 of 1234 | Column: name | Type: varchar

See Also

vDataFrame[].decode Encodes the vcolumn using a user defined Encoding.
vDataFrame[].label_encode Encodes the vcolumn using the Label Encoding.
vDataFrame[].get_dummies Encodes the vcolumn using the One Hot Encoding.
vDataFrame[].mean_encode Encodes the vcolumn using the Mean Encoding of a response.