verticapy.machine_learning.vertica.automl.AutoDataPrep#

class verticapy.machine_learning.vertica.automl.AutoDataPrep(name: str | None = None, overwrite_model: bool | None = False, cat_method: Literal['label', 'ooe'] = 'ooe', num_method: Literal['same_freq', 'same_width', 'none'] = 'none', nbins: int = 20, outliers_threshold: float = 4.0, na_method: Literal['auto', 'drop'] = 'auto', cat_topk: int = 10, standardize: bool = True, standardize_min_cat: int = 6, id_method: Literal['none', 'drop'] = 'drop', apply_pca: bool = False, rule: str | timedelta = 'auto', identify_ts: bool = True, save: bool = True)#

Automatically find relations between the different features to preprocess the data according to each column type.

Parameters#

name: str, optional

Name of the model in which to store the output relation in the Vertica database.

overwrite_model: bool, optional

If set to True, training a model with the same name as an existing model overwrites the existing model.

cat_method: str, optional

Method for encoding categorical features. This can be set to ‘label’ for label encoding and ‘ooe’ for One-Hot Encoding.

num_method: str, optional

[Only used for non-time series datasets] Method for encoding numerical features. This can be set to ‘same_freq’ to encode using frequencies, ‘same_width’ to encode using regular bins, or ‘none’ to not encode numerical features.

nbins: int, optional

[Only used for non-time series datasets] Number of bins used to discretize numerical features.

outliers_threshold: float, optional

[Only used for non-time series datasets] Method for dealing with outliers. If a number is used, all elements with an absolute z-score greater than the threshold are converted to NULL values. Otherwise, outliers are treated as regular values.

na_method: str, optional

Method for handling missing values.

auto: Mean for the numerical features and: creates a new category for the categorical vDataColumns. For time series datasets, ‘constant’ interpolation is used for categorical features and ‘linear’ for the others.

drop: Drops the missing values.

cat_topk: int, optional

Keeps the top-k most frequent categories and merges the others into one unique category. If unspecified, all categories are kept.

standardize: bool, optional

If True, the data is standardized. The ‘num_method’ parameter must be set to ‘none’.

standardize_min_cat: int, optional

Minimum feature cardinality before using standardization.

id_method: str, optional

Method for handling ID features.: drop: Drops any feature detected as ID. none: Does not change ID features.

apply_pca: bool, optional

[Only used for non-time series datasets] If True, a PCA is applied at the end of the preprocessing.

rule: TimeInterval, optional

[Only used for time series datasets] Interval used to slice the time. For example, setting to ‘5 minutes’ creates records separated by ‘5 minutes’ time interval. If set to auto, the rule is detected using aggregations.

identify_ts: bool, optional

If True and parameter ‘ts’ is undefined when fitting the model, the function tries to automatically detect the parameter ‘ts’.

save: bool, optional

If True, saves the final relation inside the database.

Attributes#

X_in_: list: Variables used to fit the model.
X_out_: list: Variables created by the model.
sql_: str: SQL needed to deploy the model.
final_relation_: vDataFrame: Relation created after fitting the model.

__init__(name: str | None = None, overwrite_model: bool | None = False, cat_method: Literal['label', 'ooe'] = 'ooe', num_method: Literal['same_freq', 'same_width', 'none'] = 'none', nbins: int = 20, outliers_threshold: float = 4.0, na_method: Literal['auto', 'drop'] = 'auto', cat_topk: int = 10, standardize: bool = True, standardize_min_cat: int = 6, id_method: Literal['none', 'drop'] = 'drop', apply_pca: bool = False, rule: str | timedelta = 'auto', identify_ts: bool = True, save: bool = True) → None#: Must be overridden in the child class

Methods

`__init__`([name, overwrite_model, ...])	Must be overridden in the child class
`contour`([nbins, chart])	Draws the model's contour plot.
`deploySQL`([X])	Returns the SQL code needed to deploy the model.
`does_model_exists`(name[, raise_error, ...])	Checks whether the model is stored in the Vertica database.
`drop`()	Drops the model from the Vertica database.
`export_models`(name, path[, kind])	Exports machine learning models.
`fit`(input_relation[, X, ts, by, return_report])	Trains the model.
`get_attributes`([attr_name])	Returns the model attributes.
`get_match_index`(x, col_list[, str_check])	Returns the matching index.
`get_params`()	Returns the parameters of the model.
`get_plotting_lib`([class_name, chart, ...])	Returns the first available library (Plotly, Matplotlib, or Highcharts) to draw a specific graphic.
`get_vertica_attributes`([attr_name])	Returns the model Vertica attributes.
`import_models`(path[, schema, kind])	Imports machine learning models.
`register`(registered_name[, raise_error])	Registers the model and adds it to in-DB Model versioning environment with a status of 'under_review'.
`set_params`([parameters])	Sets the parameters of the model.
`summarize`()	Summarizes the model.
`to_binary`(path)	Exports the model to the Vertica Binary format.
`to_pmml`(path)	Exports the model to PMML.
`to_python`([return_proba, ...])	Returns the Python function needed for in-memory scoring without using built-in Vertica functions.
`to_sql`([X, return_proba, ...])	Returns the SQL code needed to deploy the model without using built-in Vertica functions.
`to_tf`(path)	Exports the model to the Frozen Graph format (TensorFlow).

Attributes

object_type