VerticaPy Machine Learning V0.10.1 Cheat Sheet
Click here for a PDF version of this article.
VerticaPy Machine Learning supports the entire machine learning workflow via a Python interface. For more information about the capabilities of VerticaPy ML, see the VerticaPy ML documentation or check out the VerticaPy examples.
Preprocessing Data
Load data (link)
=> from verticapy.utilities import *
=> import verticapy as vp
=> VDataFrame=vp.read_csv("filename.csv")
Creates a VDataFrame from a csv file.
Summarize data (link)
=> VDataFrame.describe()
Aggregates the vDataFrame using multiple statistical aggregations
=> VDataFrame.describe(columns=["column_1", "column_2", "column_3", method="categorical"])
Aggregates the selected columns using categorical statistical aggregations.
Detect Outliers (link) and (link)
=> VDataFrame.outliers_plot(["col1", "col2"])
A 2D plot to visualize outliers based on the given two columns.
=> VDataFrame.outliers(columns=["col1", "col2"], name="name of the outlier columns")
Create a new column which indicates whether a datapoint is an outlier.
Measure Correlations (link)
=> VDataFrame.corr(method="pearson")
Calculates and displays the Pearson correlation matrix.
=> VDataFrame.corr(["column_1", "column_2"], method="spearman")
Calculates and displays the Pearson correlation between two columns.
Normalize Data (link)
=> VDataFrame.normalize()
Normalizes all the columns in the dataset using zscore method as default.
=> VDataFrame.normalize(columns=["col1", "col2"], method="minmax")
Normalizes selective columns in the dataset using minimax method as default.
Dimensionality Reduction (link)
=> from verticapy.learn.decomposition import PCA
Importing PCA function.
=> model = PCA("PCA_name")
Make a PCA object.
=> model.fit(VDataFrame)
Apply the PCA on the VDataFrame and display the results of PCA.
=> model.transform(n_components=2)
Create a VDataFrame with columns as the principal components.
Encode Categorical Features (label encode link) and (get dummies link)
=> VDataFrame.label_encode()
Encodes a categorical column into numerical values.
=> VDataFrame["column_name"].one_hot_encode()
One Hot Encoding for the desired column.
=> VDataFrame["column_name"].mean_encode()
Mean Encoding for the desired column.
Impute Missing Values (link)
=> VDataFrame.count_percent()
Counts the percentage of missing values for reach column.
=> VDataFrame["col_to_fill"].fillna(method="auto")
Fills missing values by selecting mean of numeric values and mode for categorical.
=> VDataFrame["col_to_fill"].fillna(method="avg", by =["columns_used_in_parition"])
Fills missing values using the columns for prediction. This replaces the original column.
Process Imbalanced Data (link)
=> VDataFrame.balance(column=["column_to_balance"])
Creates a view with an equal distribution of the input data based on response column. Default method is hybrid
=> VDataFrame.balance(column=["column_to_balance"], method="under", x=0.5)
Creates a view with a custom distribution of the input data based on response column. Ratio(x) can be changed.
=> VDataFrame["column_to_balance"].topk(k=3)
Returns the count for the values in a column.
Sample Data (link)
=> VDataFrame.sample(x=0.2)
The entire table is randomly sampled using the given ratio(x).
=> VDataFrame.sample(n=100)
The entire table is randomly sampled using the number of elements required(n).
=> VDataFrame.sample(x=0.3, method="stratified")
The entire table is randomly sampled using the given ratio(x) and method (random, stratified, or systematic).
Training and Predicting
Regression – Model Building
Linear Regression (link)
=> from verticapy.learn.linear_model import LinearRegression
Import the Linear Regression function.
=> model = LinearRegression(name="public.Name_of_Model")
Build a Linear Regression model.
Support Vector Machines (SVM) (link)
=> from verticapy.learn.svm import LinearSVR
=> model = LinearSVR(name="Name_of_Model", acceptable_error_margin=0.5)
Build a LinearSVR object using the Vertica SVM (Support Vector Machine) algorithm.
Random Forest (link)
=> from verticapy.learn.ensemble import RandomForestRegressor
=> model = RandomForestRegressor(name="Name_of_Model", n_estimators=20, max_features="auto", max_leaf_nodes=32, sample=0.7, max_depth=3, min_sample_leaf=5, min_info_gain=0.0, nbins=32)
RandomForestRegressor object using the Vertica Random Forest function on the data.
XGBoost (link)
=> from verticapy.learn.ensemble import XGBoostRegressor
=> model = XGBoostRegressor(name="Name_of_Model", max_ntree=10, max_depth=5, nbins=32, objective="squarederror", split_proposal_method="global", tot=0.001, learning_rate=0.1, min_split_loss=0, weight_reg=0, sample=1)
Creates a XGBoostRegressor object using the Vertica XGBoost algorithm. From all the available options, only name is mandatory.
Autoregression (link)
=> from verticapy.learn.delphi import AutoML
=> model=AutoML(name="Name_of_Model", estimator_type="regressor", cv=3, stepwise=True)
Tests multiple models to find which the ones which maximize the input score.
Classification
Logistic Regression (link)
=> from verticapy.learn.linear_model import LogisitcRegression
Import the Logistic Regression function.
=> mode = LogisticRegression(name="Name_of_Model", penalty= "L2", tol=1e-4, C=1, max_iter=100, solver= "CGD")
Creates a LogisticRegression object using Vertica LOGISTIC_REG function.
Support Vector Machines (SVM) (link)
=> from verticapy.learn.svm import LinearSVC
=> model = LinearSVC(name="Name_of_Model", tol=1e-4, C=1.0, fit_intercept= True, intercept_model="regularized", max_iter=100)
Build a LinearSVC object using the Vertica SVM (Support Vector Machine) algorithm.
Random Forest (link)
=> from verticapy.learn.ensemble import RandomForestClassifier
=> model = RandomForestClassifier(name="Name_of_Model", n_estimators=20, max_features="auto", max_leaf_nodes=32, sample=0.7, max_depth=3, min_sample_leaf=5, min_info_gain=0.0, nbins=32)
Creates a RandomForestRegressor object using the Vertica Random Forest function on the data.
XGBoost (link)
=> from verticapy.learn.ensemble import XGBoostClassifier
=> model = XGBoostClassifier(name="Name_of_Model", max_ntree=10, max_depth=5, nbins=32, objective="squarederror", split_proposal_method="global", tot=0.001, learning_rate=0.1, min_split_loss=0, weight_reg=0,
sample=1)
Creates a XGBoostRegressor object using the Vertica XGBoost algorithm. From all the available options, only name is mandatory.
Autoregression (link)
=> from verticapy.learn.delphi import AutoML
=> model=AutoML(name="Name_of_Model", estimator_type="multi", cv=3, stepwise=True)
Tests multiple models to find which the ones which maximize the input score.
Clustering
K-neighbors (link)
=> from verticapy.learn.neighbors import KNeighborsClassifier
=> model= KNeighborsClassifier(name="Name_of_Model", n_neighbors=5, p=2)
Creates a KNeighborsClassifier object by using the k-nearest neighbors algorithm.
K-nearest centroid (link)
=> from verticapy.learn.neighbors import NearestCentroid
=> model= NearestCentroid(name="Name_of_Model", p=2)
Creates a NearestCentroid object by using the k-nearest centroid algorithm.
Fitting, Predicting, and Evaluating Models
Regression/Classification – Model Prediction
Fitting
=> model.fit("public.Name_of_Model" , ["independent_col_1", "independent_col_2"], "dependent_col")
Fit the model to the given independent inputs and dependent outputs
Prediction
=> model.predict(VDataFrame, X=["independent_col_1", "independent_col_2"], name="name_of_pred_column")
Predicts and adds those values inside the VDataFrame using the new name of prediction columns.
General Metrics
Link to all (link)
Mean Squared Error | R-squared | aic | bic | Explained Variance |
=> model.score("mse") | => model.score("r2") | => model.score("aic") | => model.score("bic") | => model.score("var") |
Max error | R-squared adjusted | RMSE | Median Absolute Error | Median Absolute Error |
=> model.score("max") | => model.score("r2a") | => model.score("rmse") | => model.score("mae") | => model.score("mae") |
Classification-specific Metrics
Confusion Matrix (link)
=> model.confusion_matrix(pos_label="Label", cutoff=0.33)
Fit the model to the given independent inputs and dependent outputs.
Lift Chart (link)
=> from verticapy.learn.model_selection import lift_chart
=> lift_chart("Response_Column", "Prediction_Probability", VDataFrame)
Draws a lift chart.
ROC Curve (link)
=> model.roc_curve(nbins=12)
Plots the ROC curve.
Managing Models
memModel
To build models using their attributes (link)
For Linear Regression
=> from verticapy.learn.memmodel import memModel
=> model=memModel (model_type="LinearRegression", attributes={ "coefficients": [0.5, 1.2], "intercept": 2})
Builds a Linear Regression model from its attributes.
=> model.predict_sql (["x1", "x2"])
Generates the SQL code for deploying the model in Vertica.
Generate SQL Code (link)
For Linear Regression
=> model.to_sql()
Generates the SQL code for deploying the model in Vertica.