RF_CLASSIFIER

Trains a random forest model for classification on an input table or view.

Important: Before using a machine learning function, be aware that all the ongoing transactions might be committed.

Syntax

RF_CLASSIFIER ( 'model_name', 'input_relation', 'response_column',
              'predictor_col1, predictor_col2, ..., predictor_coln'                                                          
              [USING PARAMETERS [exclude_columns= 'col1, col2, ..., coln',]    
                                [ntree= value,]                  
                                [mtry= value,]
                                [sampling_size= value,]
                                [max_depth= value,]
                                [max_breadth= value,]
                                [min_leaf_size= value,]
                                [min_info_gain= value,]
                                [nbins= value] ])

Arguments

model_name

The name of the model stored as a result of the training. Model names are case insensitive.

input_relation

The table or view that contains the training samples.

response_column

The name of the column in input_relation that represents the dependent variable.

This column must be of data type CHAR or VARCHAR.

predictor_columns

A comma-separated list of the columns in the input_relation that represent the independent variables for the model. These columns must be of CHAR,VARCHAR, BOOLEAN, INT and FLOAT data types.

CHAR, VARCHAR and BOOLEAN are treated as categorical data types. All other data types are treated as numeric data types.

Parameters

exclude_columns='col1, col2, ..., coln'

(Optional) The columns from input_relation that you want to exclude from the predictor_columns argument.

ntree=value

(Optional) A positive integer number that indicates the number of trees in the forest.

Default Value: 20

Valid Range: (0 to 1000]

mtry=value

(Optional) A positive integer number that indicates the number of features to be considered at the split of a tree node.

Default Value: When no value is specified for mtry, its default value is the square root of the total number of predictors.

Valid Range: A positive integer number, smaller than or equal to the number of predictors.

sampling_size=value

(Optional) A number that indicates what portion of the input data set will randomly be picked for training each tree

Default Value: 0.632

Valid Range:(0.0,1.0]

max_depth=value

(Optional) A positive integer number that specifies the maximum depth for growing each tree.

Default Value: 5

Valid Range: [1 to 100]

max_breadth=value

(Optional) A positive integer number that specifies the maximum number of leaf nodes a tree in the forest can have.

Default Value: 32

Valid Range: [1 to 1e9]

min_leaf_size=value

(Optional) A positive integer number that specifies the minimum samples each branch must have after splitting a node. A split that causes fewer remaining samples will be discarded.

Default Value: 1

Valid Range: [1 to 1e6]

min_info_gain=value

(Optional) A non-negative number. Any split with information gain less than this threshold will be discarded.

Default Value: 0.0

Valid Range: [0.0 to 1.0)

nbins=value

(Optional) A positive integer number that indicates the number of bins to use for continuous features.

Default Value: 32

Valid Range: [2 to 1000]

Privileges

To use RF_CLASSIFIER, you must either be a superuser or have CREATE privileges for the schema of the output view and SELECT privileges for the input table or view. There are no privileges needed on the function itself.

See GRANT (Schema) and GRANT (Table).

Examples

This example shows how you can use the RF_CLASSIFIER function.

=> SELECT RF_CLASSIFIER ('myRFModel', 'iris', 'Species', 'Sepal_Length, Sepal_Width, Petal_Length, Petal_Width' 
USING PARAMETERS ntree=100, sampling_size=0.3);
RF_CLASSIFIER
--------------------------------------------------
The random forest is trained
(1 row)

 

See Also