verticapy.machine_learning.vertica.automl.AutoDataPrep#
- class verticapy.machine_learning.vertica.automl.AutoDataPrep(name: str | None = None, overwrite_model: bool | None = False, cat_method: Literal['label', 'ooe'] = 'ooe', num_method: Literal['same_freq', 'same_width', 'none'] = 'none', nbins: int = 20, outliers_threshold: float = 4.0, na_method: Literal['auto', 'drop'] = 'auto', cat_topk: int = 10, standardize: bool = True, standardize_min_cat: int = 6, id_method: Literal['none', 'drop'] = 'drop', apply_pca: bool = False, rule: str | timedelta = 'auto', identify_ts: bool = True, save: bool = True)#
Automatically find relations between the different features to preprocess the data according to each column type.
Parameters#
- name: str, optional
Name of the model in which to store the output relation in the Vertica database.
- overwrite_model: bool, optional
If set to
True
, training a model with the same name as an existing model overwrites the existing model.- cat_method: str, optional
Method for encoding categorical features. This can be set to ‘label’ for label encoding and ‘ooe’ for One-Hot Encoding.
- num_method: str, optional
[Only used for non-time series datasets] Method for encoding numerical features. This can be set to ‘same_freq’ to encode using frequencies, ‘same_width’ to encode using regular bins, or ‘none’ to not encode numerical features.
- nbins: int, optional
[Only used for non-time series datasets] Number of bins used to discretize numerical features.
- outliers_threshold: float, optional
[Only used for non-time series datasets] Method for dealing with outliers. If a number is used, all elements with an absolute z-score greater than the threshold are converted to NULL values. Otherwise, outliers are treated as regular values.
- na_method: str, optional
- Method for handling missing values.
- auto: Mean for the numerical features and
creates a new category for the categorical vDataColumns. For time series datasets, ‘constant’ interpolation is used for categorical features and ‘linear’ for the others.
drop: Drops the missing values.
- cat_topk: int, optional
Keeps the top-k most frequent categories and merges the others into one unique category. If unspecified, all categories are kept.
- standardize: bool, optional
If True, the data is standardized. The ‘num_method’ parameter must be set to ‘none’.
- standardize_min_cat: int, optional
Minimum feature cardinality before using standardization.
- id_method: str, optional
- Method for handling ID features.
drop: Drops any feature detected as ID. none: Does not change ID features.
- apply_pca: bool, optional
[Only used for non-time series datasets] If True, a PCA is applied at the end of the preprocessing.
- rule: TimeInterval, optional
[Only used for time series datasets] Interval used to slice the time. For example, setting to ‘5 minutes’ creates records separated by ‘5 minutes’ time interval. If set to auto, the rule is detected using aggregations.
- identify_ts: bool, optional
If True and parameter ‘ts’ is undefined when fitting the model, the function tries to automatically detect the parameter ‘ts’.
- save: bool, optional
If True, saves the final relation inside the database.
Attributes#
- X_in_: list
Variables used to fit the model.
- X_out_: list
Variables created by the model.
- sql_: str
SQL needed to deploy the model.
- final_relation_: vDataFrame
Relation created after fitting the model.
- __init__(name: str | None = None, overwrite_model: bool | None = False, cat_method: Literal['label', 'ooe'] = 'ooe', num_method: Literal['same_freq', 'same_width', 'none'] = 'none', nbins: int = 20, outliers_threshold: float = 4.0, na_method: Literal['auto', 'drop'] = 'auto', cat_topk: int = 10, standardize: bool = True, standardize_min_cat: int = 6, id_method: Literal['none', 'drop'] = 'drop', apply_pca: bool = False, rule: str | timedelta = 'auto', identify_ts: bool = True, save: bool = True) None #
Must be overridden in the child class
Methods
__init__
([name, overwrite_model, ...])Must be overridden in the child class
contour
([nbins, chart])Draws the model's contour plot.
deploySQL
([X])Returns the SQL code needed to deploy the model.
does_model_exists
(name[, raise_error, ...])Checks whether the model is stored in the Vertica database.
drop
()Drops the model from the Vertica database.
export_models
(name, path[, kind])Exports machine learning models.
fit
(input_relation[, X, ts, by, return_report])Trains the model.
get_attributes
([attr_name])Returns the model attributes.
get_match_index
(x, col_list[, str_check])Returns the matching index.
Returns the parameters of the model.
get_plotting_lib
([class_name, chart, ...])Returns the first available library (Plotly, Matplotlib, or Highcharts) to draw a specific graphic.
get_vertica_attributes
([attr_name])Returns the model Vertica attributes.
import_models
(path[, schema, kind])Imports machine learning models.
register
(registered_name[, raise_error])Registers the model and adds it to in-DB Model versioning environment with a status of 'under_review'.
set_params
([parameters])Sets the parameters of the model.
Summarizes the model.
to_binary
(path)Exports the model to the Vertica Binary format.
to_pmml
(path)Exports the model to PMML.
to_python
([return_proba, ...])Returns the Python function needed for in-memory scoring without using built-in Vertica functions.
to_sql
([X, return_proba, ...])Returns the SQL code needed to deploy the model without using built-in Vertica functions.
to_tf
(path)Exports the model to the Frozen Graph format (TensorFlow).
Attributes