AutoDataPrep

In [ ]:
class AutoDataPrep(name: str = "",
                   cursor=None,
                   cat_method: str = "ooe",
                   num_method: str = "none",
                   nbins: int = 20,
                   outliers_threshold: float = 4.0,
                   na_method: str = "auto",
                   cat_topk: int = 10,
                   normalize: bool = True,
                   normalize_min_cat: int = 6,
                   id_method: int = "drop",
                   apply_pca: bool = False,
                   rule: (str, datetime.timedelta) = "auto",
                   identify_ts: bool = True,
                   save: bool = True,)

Automatically find relations between the different features to preprocess the data according to each column type.

Parameters

Name Type Optional Description
name
str
Name of the model.
cursor
DBcursor
Vertica database cursor.
cat_method
str
Method for encoding categorical features. This can be set to 'label' for label encoding and 'ooe' for One-Hot Encoding.
num_method
str
[Only used for non-time series datasets]
Method for encoding numerical features. This can be set to 'same_freq' to encode using frequencies, 'same_width' to encode using regular bins, or 'none' to not encode numerical features.
nbins
int
[Only used for non-time series datasets]
Number of bins used to discretize numerical features.
outliers_threshold
float
[Only used for non-time series datasets]
How to deal with outliers. If a number is used, all elements with an absolute z-score greater than the threshold will be converted to NULL values. Otherwise, outliers are treated as regular values.
na_method
str
Method for handling missing values.
  • auto: Mean for the numerical features and creates a new category for the categorical vcolumns. For time series datasets, 'constant' interpolation is used for categorical features and 'linear' for the others.
  • drop: Drops the missing values.
cat_topk
int
Keeps the top-k most frequent categories and merges the others into one unique category. If unspecified, all categories are kept.
normalize
bool
If True, the data will be normalized using the z-score. The 'num_method' parameter must be set to 'none'.
normalize_min_cat
int
Minimum feature cardinality before using normalization.
id_method
str
Method for handling ID features.
  • drop : Drops any feature detected as ID.
  • none : Does not change ID features.
apply_pca
bool
[Only used for non-time series datasets]
If True, a PCA is applied at the end of the preprocessing.
rule
str / time
[Only used for time series datasets]
Interval to use to slice the time. For example, '5 minutes' will create records separated by '5 minutes' time interval. If set to auto, the rule will be detected using aggregations.
identify_ts
bool
If True and parameter 'ts' is undefined when fitting the model, the function will try to automatically detect the parameter 'ts'.
print_info
bool
If True, prints the model information at each step.

Attributes

Name Type Description
X_in
list
Variables used to fit the model.
X_out
list
Variables created by the model.
ts
str
TS component.
by
list
vcolumns used in the partition.
sql_
str
SQL needed to deploy the model.
final_relation_
vDataFrame
Relation created after fitting the model.
model_grid_
tablesample
Grid containing the different models information.

Main Methods

Name Description
Trains the model.

AutoDataPrep also inherits the vModel methods.

Example

In [4]:
from verticapy.learn.delphi import AutoDataPrep

model = AutoDataPrep("titanic_autodataprep")
model.fit("public.titanic",)
Out[4]:
123
pclass
Int
123
survived
Int
123
age
Numeric(45,29)
123
sibsp
Numeric(33,15)
123
parch
Numeric(33,15)
123
fare
Numeric(48,30)
123
body
Numeric(60,29)
123
sex_female
Int
123
sex_male
Int
123
ticket_113781
Int
123
ticket_1601
Int
123
ticket_19950
Int
123
ticket_3101295
Int
123
ticket_347077
Int
123
ticket_347082
Int
123
ticket_347088
Int
123
ticket_CA_2144
Int
123
ticket_CA._2343
Int
123
ticket_Others
Int
123
ticket_S.O.C._14879
Int
123
cabin_A34
Int
123
cabin_B57_B59_B63_B66
Int
123
cabin_B96_B98
Int
123
cabin_C22_C26
Int
123
cabin_C23_C25_C27
Int
...
123
embarked_C
Int
123
embarked_Q
Int
123
embarked_S
Int
123
boat_10
Int
123
boat_13
Int
123
boat_14
Int
123
boat_15
Int
123
boat_3
Int
123
boat_4
Int
123
boat_5
Int
123
boat_8
Int
123
boat_C
Int
123
boat_NULL
Int
123
boat_Others
Int
123
home.dest_Cornwall___Akron__OH
Int
123
home.dest_London
Int
123
home.dest_Montreal__PQ
Int
123
home.dest_NULL
Int
123
home.dest_New_York__NY
Int
123
home.dest_Others
Int
123
home.dest_Paris__France
Int
123
home.dest_Philadelphia__PA
Int
123
home.dest_Sweden_Winnipeg__MN
Int
123
Int
123
home.dest_Winnipeg__MB
Int
110-1.95025031295652781.02.03.2465264161434751.75500630588542e-15101000000000000010...001000000000100000010000
210-0.91113126527714470.00.00.465903204510471.75500630588542e-15010000000001000000...001000000000100000010000
310-0.84185666209851931.00.02.11111636371506251.75500630588542e-15010000000001000000...100000000000100000010000
410-0.77258205891989371.00.00.62563263744295471.75500630588542e-15010000000001000000...001000000000100000100000
510-0.77258205891989373.02.02.49988404095323e-161.75500630588542e-15010010000000000001...001000000000100000000001
610-0.63403285256264260.01.01.26954191395203390.050280930663622686010000000001000000...001000000000100000010000
710-0.56475824938401710.00.02.8227988386173950.7026167745612807010000000001000000...100000000000100001000000
810-0.495483646205391540.00.01.70114415252168531.75500630588542e-15010000000001000000...001000000000100010000000
910-0.4262090430267660.00.01.32045567069926341.75500630588542e-15010000000001000000...100000000000100001000000
1010-0.4262090430267660.01.02.49988404095323e-161.75500630588542e-15010000000001000000...100000000000100010000000
1110-0.4262090430267661.00.00.80932148531531211.75500630588542e-15010000000001000000...001000000000100000010000
1210-0.356934439848140440.00.0-0.09581196796876795-0.16716435063559665010000000001000000...100000000000100000010000
1310-0.356934439848140441.02.03.2465264161434751.75500630588542e-15101000000000000010...001000000000100000010000
1410-0.218385233490889370.02.04.8424896668605511.75500630588542e-15010000000001000000...100000000000100000010000
1510-0.218385233490889371.00.02.8533044981502841.75500630588542e-15010000000001000000...100000000000100000010000
1610-0.149110630312263820.00.00.465903204510471.75500630588542e-15010000000001000000...001000000000100000010000
1710-0.149110630312263821.00.01.39954303725856751.75500630588542e-15010000000001000000...100000000000100000100000
1810-0.114473328722951050.00.0-0.0500015666037313260.2573716747581173010000000001000000...100000000000100000010000
1910-0.079836027133638270.00.00.010674320652888538-0.3949641691395407010000000001000000...001000000000100000010000
2010-0.079836027133638271.00.00.98502386154104541.75500630588542e-15010000000001000000...001000000000100000010000
2110-0.0105614239550127320.00.0-0.095811967968767951.75500630588542e-15010000000001000000...001000000000100000010000
2210-0.0105614239550127320.00.0-0.0492242166967932351.75500630588542e-15010000000001000000...100000000000100000100000
2310-0.0105614239550127320.00.00.42330868906180741.75500630588542e-15010000000001000000...001000000000100000010000
2410-0.0105614239550127321.02.03.246526416143475-0.30177333429701814011000000000000010...001000000000100000010000
25103.39772125921244e-150.00.0-0.7879728440095351.75500630588542e-15010000000001000000...001000000000100000010000
26103.39772125921244e-150.00.0-0.7879728440095351.75500630588542e-15010000000001000000...001000000000100000010000
27103.39772125921244e-150.00.0-0.7879728440095351.75500630588542e-15010000000001000000...001000000000100001000000
28103.39772125921244e-150.00.0-0.0978085858804241.75500630588542e-15010000000001000000...001000000000100000100000
29103.39772125921244e-150.00.0-0.095811967968767951.75500630588542e-15010000000001000000...001000000000100000010000
30103.39772125921244e-150.00.0-0.081170103283290181.75500630588542e-15010000000001000000...001000000000100000010000
31103.39772125921244e-150.00.0-0.081170103283290181.75500630588542e-15010000000001000000...001000000000100000010000
32103.39772125921244e-150.00.0-0.081170103283290181.75500630588542e-15010000000001000000...001000000000100000010000
33103.39772125921244e-150.00.0-0.0500015666037313261.75500630588542e-15010000000001000000...100000000000100000010000
34103.39772125921244e-150.00.0-0.0500015666037313261.75500630588542e-15010000000001000000...100000000000100000010000
35103.39772125921244e-150.00.00.0291976105586256831.75500630588542e-15010000000001000000...100001000000000000100000
36103.39772125921244e-150.00.00.037295892808302661.75500630588542e-15010000000001000000...001000000000100001000000
37103.39772125921244e-150.00.00.143782181429959131.75500630588542e-15010000000001000000...001000000000100000010000
38103.39772125921244e-150.00.00.26624141334486411.75500630588542e-15010000000001000000...100000000000100000000100
39103.39772125921244e-150.00.00.34078181538002361.75500630588542e-15010000000001000000...001000000000100001000000
40103.39772125921244e-150.00.00.54310576376117091.75500630588542e-15010000000001000000...001000000000100000010000
41103.39772125921244e-150.00.00.59268844190062971.75500630588542e-15010000000001000000...001000000000100000010000
42103.39772125921244e-150.00.00.59634890807199931.75500630588542e-15010000000001000000...001000000000100000010000
43103.39772125921244e-150.00.05.1161381313604851.75500630588542e-15010000000001000000...001000000000100001000000
44103.39772125921244e-150.00.05.2691003606510631.75500630588542e-15010000000001000000...100000000000100001000000
45100.0587131792236128060.00.00.55630473923582521.75500630588542e-15010000000001000000...001000000000100000010000
46100.0587131792236128061.00.00.59634890807199931.75500630588542e-15010000000001000000...001000000000100010000000
47100.162625083991551120.00.04.842489666860551-1.2336816827222439010000000001000000...100000000000100001000000
48100.19726238558086390.00.0-0.65486498323246441.75500630588542e-15010000000001000000...001000000000100000100000
49100.19726238558086390.00.0-0.08117010328329018-0.5709913016198611010000000001000000...001000000000100100000000
50100.40508619511674050.00.00.055377264616259931.75500630588542e-15100000000001000000...100000000000100000100000
51100.40508619511674050.00.00.28021773872645651.75500630588542e-15010000000001000000...100000000000100000000001
52100.40508619511674050.00.01.21507950163648751.75500630588542e-15010000000001000000...100000000000010000010000
53100.40508619511674051.00.01.31113812044486840.08134454227779687010000000001000000...001000000000100000010000
54100.474360798295366040.01.00.00268784900626430071.75500630588542e-15010000000001000000...100000000000100000010000
55100.474360798295366041.00.00.62563263744295471.75500630588542e-15010000000001000000...001000000000100000010000
56100.474360798295366041.01.01.4258318397620391.75500630588542e-15010000000001000000...100000000000100000010000
57100.54363540147399160.00.0-0.7879728440095351.75500630588542e-15010000000001000000...001000000000100000010000
58100.54363540147399160.01.03.2974401728907043-0.17751888784032135010000000001000000...001000000000100000000001
59100.61291000465261710.00.0-0.7879728440095351.75500630588542e-15010000000001000000...001000000000100000010000
60100.61291000465261710.00.00.0026878490062643007-0.3224824087064676010000000001000000...100000000000100000000100
61100.61291000465261711.00.01.10970067041649621.75500630588542e-15010000000001000000...100000000000100000100000
62100.68218460783124270.00.0-0.787972844009535-0.5606367644151363010000000001000000...001000000000100001000000
63100.68218460783124270.00.0-0.0500015666037313261.75500630588542e-15010000000001000000...100000000000100000010000
64100.75145921100986820.00.00.0239851067305955971.75500630588542e-15010000000001000000...001000000000100000010000
65100.75145921100986821.00.00.59268844190062971.75500630588542e-15010000000001000000...001000000000100000010000
66100.82073381418849380.00.0-0.081170103283290181.75500630588542e-15010000000001000000...001000000000100000010000
67100.82073381418849380.00.00.343443972595565031.75500630588542e-15010000000001000000...001000000000100000010000
68100.82073381418849381.00.00.5963489080719993-1.3061634431553169010000000001000000...001000000000100000100000
69100.95928302054574492.00.01.60796864997773570.6819077001518312010000000001000000...010000000000100000010000
70101.02855762372437030.00.0-0.081170103283290181.75500630588542e-15010000000001000000...001000000000100000010000
71101.02855762372437030.00.00.157092967507666211.75500630588542e-15010000000001000000...001000000000100000010000
72101.02855762372437031.00.01.43426289166365861.75500630588542e-15010000000001000000...001000000000100000100000
73101.06319492531368320.00.0-0.0292580375802326440.019217319049448497010000000001000000...001000000000100000010000
74101.09783222690299590.00.0-0.09581196796876795-0.8712728805568782010000000001000000...001000000000100000010000
75101.09783222690299590.00.01.21507950163648751.3238890068447644010000000001000000...100000000000100000010000
76101.09783222690299590.00.01.32045567069926341.75500630588542e-15010000000001000000...100000000000100000100000
77101.09783222690299591.00.00.84060183259792381.75500630588542e-1501