vDataFrame[].discretize¶
In [ ]:
vDataFrame[].discretize(method: str = "auto",
h: float = 0,
nbins: int = -1,
k: int = 6,
new_category: str = "Others",
RFmodel_params: dict = {},
response: str = "",
return_enum_trans: bool = False)
Discretizes the vcolumn using the input method.
Parameters¶
| Name | Type | Optional | Description |
|---|---|---|---|
method | str | ✓ | The method to use to discretize the vcolumn.
|
h | float | ✓ | The interval size to convert to use to convert the vcolumn. If this parameter is equal to 0, an optimised interval will be computed. |
nbins | int | ✓ | Number of bins used for the discretization (must be > 1) |
k | int | ✓ | The integer k of the 'topk' method. |
new_category | str | ✓ | The name of the merging category when using the 'topk' method. |
RFmodel_params | dict | ✓ | Dictionary of the Random Forest model parameters used to compute the best splits when 'method' is set to 'smart'. A RF Regressor will be trained if the response is numerical (except ints and bools), a RF Classifier otherwise. Example: Write {"n_estimators": 20, "max_depth": 10} to train a Random Forest with 20 trees and a maximum depth of 10. |
response | str | ✓ | Response vcolumn when using the 'smart' method. |
return_enum_trans | bool | ✓ | Returns the transformation instead of the vDataFrame parent and do not apply it. This parameter is very useful for testing to be able to look at the final transformation. |
In [14]:
from verticapy.datasets import load_titanic
titanic = load_titanic()
display(titanic["age"])
titanic["age"].hist()
In [45]:
# Discretizing using the same bar width
titanic["age"].discretize(method = "same_width", h = 10)
display(titanic["age"])
titanic["age"].hist()
In [47]:
# Discretizing using the same frequence per bin
titanic["age"].discretize(method = "same_freq", nbins = 5)
display(titanic["age"])
titanic["age"].hist()
In [15]:
# Discretizing using a response column distribution
# During the process, a Random Forest will be created
titanic["age"].discretize(method = "smart",
response = "survived",
nbins = 6,
RFmodel_params = {"n_estimators": 20})
display(titanic["age"].topk())
titanic["age"].hist()
# Each bin will represent a Random Forest split
titanic["age"].hist(method = "avg", of = "survived")
In [51]:
# Extracting the passenger Title from the name
titanic["name"].str_extract(' ([A-Za-z])+\.')
titanic["name"].hist()
# Discretizing using the TOP 5 most occurent categories
# the others will be meged together to create the 'rare' category
titanic["name"].discretize(method = "topk", k = 5, new_category = "rare")
display(titanic["name"])
titanic["name"].hist()
See Also¶
| vDataFrame[].decode | Encodes the vcolumn using a user defined Encoding. |
| vDataFrame[].label_encode | Encodes the vcolumn using the Label Encoding. |
| vDataFrame[].get_dummies | Encodes the vcolumn using the One Hot Encoding. |
| vDataFrame[].mean_encode | Encodes the vcolumn using the Mean Encoding of a response. |
