AutoDataPrep¶
In [ ]:
class AutoDataPrep(name: str = "",
cat_method: str = "ooe",
num_method: str = "none",
nbins: int = 20,
outliers_threshold: float = 4.0,
na_method: str = "auto",
cat_topk: int = 10,
normalize: bool = True,
normalize_min_cat: int = 6,
id_method: int = "drop",
apply_pca: bool = False,
rule: (str, datetime.timedelta) = "auto",
identify_ts: bool = True,
save: bool = True,)
Automatically find relations between the different features to preprocess the data according to each column type.
Parameters¶
Name | Type | Optional | Description |
---|---|---|---|
name | str | ❌ | Name of the model. |
cat_method | str | ✓ | Method for encoding categorical features. This can be set to 'label' for label encoding and 'ooe' for One-Hot Encoding. |
num_method | str | ✓ | [Only used for non-time series datasets] Method for encoding numerical features. This can be set to 'same_freq' to encode using frequencies, 'same_width' to encode using regular bins, or 'none' to not encode numerical features. |
nbins | int | ✓ | [Only used for non-time series datasets] Number of bins used to discretize numerical features. |
outliers_threshold | float | ✓ | [Only used for non-time series datasets] How to deal with outliers. If a number is used, all elements with an absolute z-score greater than the threshold will be converted to NULL values. Otherwise, outliers are treated as regular values. |
na_method | str | ✓ | Method for handling missing values.
|
cat_topk | int | ✓ | Keeps the top-k most frequent categories and merges the others into one unique category. If unspecified, all categories are kept. |
normalize | bool | ✓ | If True, the data will be normalized using the z-score. The 'num_method' parameter must be set to 'none'. |
normalize_min_cat | int | ✓ | Minimum feature cardinality before using normalization. |
id_method | str | ✓ | Method for handling ID features.
|
apply_pca | bool | ✓ | [Only used for non-time series datasets] If True, a PCA is applied at the end of the preprocessing. |
rule | str / time | ✓ | [Only used for time series datasets] Interval to use to slice the time. For example, '5 minutes' will create records separated by '5 minutes' time interval. If set to auto, the rule will be detected using aggregations. |
identify_ts | bool | ✓ | If True and parameter 'ts' is undefined when fitting the model, the function will try to automatically detect the parameter 'ts'. |
print_info | bool | ✓ | If True, prints the model information at each step. |
Attributes¶
Name | Type | Description |
---|---|---|
X_in | list | Variables used to fit the model. |
X_out | list | Variables created by the model. |
ts | str | TS component. |
by | list | vcolumns used in the partition. |
sql_ | str | SQL needed to deploy the model. |
final_relation_ | vDataFrame | Relation created after fitting the model. |
Main Methods¶
Name | Description |
---|---|
Trains the model. |
AutoDataPrep also inherits the vModel methods.
Example¶
In [4]:
from verticapy.learn.delphi import AutoDataPrep
model = AutoDataPrep("titanic_autodataprep")
model.fit("public.titanic",)
Out[4]: