VerticaPy

Python API for Vertica Data Science at Scale

Training and Testing Sets

Before you test a supervised model, you'll need separate, non-overlapping sets for training and testing.

In VerticaPy, the 'train_test_split' method uses a random number generator to decide how to split the data.

In [22]:
%load_ext verticapy.sql
%sql SELECT SEEDED_RANDOM(0);
Execution: 0.007s
Out[22]:
123
SEEDED_RANDOM
Float
10.548813502304256
Rows: 1-1 | Column: SEEDED_RANDOM | Type: float

The 'SEEDED_RANDOM' function chooses a number in the interval [0,1). Since the seed is user-provided, these results are reproducible. In this example, passing '0' as the seed always returns the same value.

In [23]:
%sql SELECT SEEDED_RANDOM(0);
Execution: 0.005s
Out[23]:
123
SEEDED_RANDOM
Float
10.548813502304256
Rows: 1-1 | Column: SEEDED_RANDOM | Type: float

A different seed will generate a different value.

In [24]:
%sql SELECT SEEDED_RANDOM(1);
Execution: 0.008s
Out[24]:
123
SEEDED_RANDOM
Float
10.417021998437122
Rows: 1-1 | Column: SEEDED_RANDOM | Type: float

The 'train_test_split' function generates a random seed and we can then share that seed between the training and testing sets.

In [25]:
from verticapy.datasets import load_titanic
titanic = load_titanic()
train, test = titanic.train_test_split()
In [26]:
train.shape()
Out[26]:
(827, 14)
In [27]:
test.shape()
Out[27]:
(407, 14)

Note that 'SEEDED_RANDOM' depends on the order of your data. That is, if your data isn't sorted by a unique feature, the selected data might be inconsistent. To avoid this, we'll want to use the 'order_by' parameter.

In [28]:
train, test = titanic.train_test_split(order_by = {"fare": "asc"})

Even if the 'fare' has duplicates, ordering the data alone will drastically decrease the likelihood of a collision.

Let's create a model and evaluate it.

In [29]:
from verticapy.learn.linear_model import LinearRegression
lr = LinearRegression("MyModel")

When fitting the model with the 'fit' function, you can use the parameter 'test_relation' to score your data on a specific relation.

In [30]:
lr.fit(train,
       ["age", "fare"],
       "survived",
       test)
lr.report()
Out[30]:
value
explained_variance0.0624802875215391
max_error0.736622624880268
median_absolute_error0.387460828317489
mean_absolute_error0.454952951370068
mean_squared_error0.227156783963415
root_mean_squared_error0.476609676741267
r20.0594284048444568
r2_adj0.05369321219106937
aic-484.50661147280925
bic-473.1736508420909
Rows: 1-10 | Columns: 2

All model evaluation abstractions will now use the test relation for the scoring. After that, you can evaluate the efficiency of your model.