
VerticaPy
Training and Testing Sets¶
Before you test a supervised model, you'll need separate, non-overlapping sets for training and testing.
In VerticaPy, the 'train_test_split' method uses a random number generator to decide how to split the data.
%load_ext verticapy.sql
%sql SELECT SEEDED_RANDOM(0);
The 'SEEDED_RANDOM' function chooses a number in the interval [0,1). Since the seed is user-provided, these results are reproducible. In this example, passing '0' as the seed always returns the same value.
%sql SELECT SEEDED_RANDOM(0);
A different seed will generate a different value.
%sql SELECT SEEDED_RANDOM(1);
The 'train_test_split' function generates a random seed and we can then share that seed between the training and testing sets.
from verticapy.datasets import load_titanic
titanic = load_titanic()
train, test = titanic.train_test_split()
train.shape()
test.shape()
Note that 'SEEDED_RANDOM' depends on the order of your data. That is, if your data isn't sorted by a unique feature, the selected data might be inconsistent. To avoid this, we'll want to use the 'order_by' parameter.
train, test = titanic.train_test_split(order_by = {"fare": "asc"})
Even if the 'fare' has duplicates, ordering the data alone will drastically decrease the likelihood of a collision.
Let's create a model and evaluate it.
from verticapy.learn.linear_model import LinearRegression
lr = LinearRegression("MyModel")
When fitting the model with the 'fit' function, you can use the parameter 'test_relation' to score your data on a specific relation.
lr.fit(train,
["age", "fare"],
"survived",
test)
lr.report()
All model evaluation abstractions will now use the test relation for the scoring. After that, you can evaluate the efficiency of your model.