LOGISTIC_REG

Executes logistic regression on an input relation. The result is a logistic regression model.

Syntax

LOGISTIC_REG ( 'model‑name', 'input‑relation', 'response‑column', 'predictor‑columns'
                 [ USING PARAMETERS [exclude_columns='excluded‑columns']
                                    [, optimizer='optimizer‑method']
                                    [, regularization='regularization‑method']
                                    [, epsilon=epsilon‑value]
                                    [, max_iterations=iterations]
                                    [, lambda=lamda‑value] 
                                    [, alpha=alpha‑value] ] )

Arguments

model‑name

Identifies the model to create, where model‑name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.

input‑relation

The table or view that contains the training data for building the model. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.

response‑column

The input column that represents the dependent variable or outcome. The column value must be 0 or 1, and of type numeric or BOOLEAN. The function automatically skips all other values.

predictor‑columns

Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter exclude_columns must include response‑column, and any columns that are invalid as predictor columns.

All predictor columns must be of type numeric or BOOLEAN; otherwise the model is invalid.

All BOOLEAN predictor values are converted to FLOAT values before training: 0 for false, 1 for true. No type checking occurs during prediction, so you can use a BOOLEAN predictor column in training, and during prediction provide a FLOAT column of the same name. In this case, all FLOAT values must be either 0 or 1.

Parameter Settings

Parameter name Set to…
exclude_columns Comma-separated list of columns from predictor‑columns to exclude from processing.
optimizer

The optimizer method used to train the model, one of the following:

  • Newton
  • BFGS
  • CGD

    If you select CGD, regularization‑method must be set to L1 or ENet, otherwise the function returns an error.

Default: CGD if regularization‑method is set to L1 or ENet, otherwise Newton.

regularization

Determines the method of regularization, one of the following:

  • None (default)
  • L1
  • L2
  • ENet
epsilon

Determines whether the algorithm has reached the specified accuracy result.

Default: 1 e-6

max_iterations

Determines the maximum number of iterations the algorithm performs before achieving the specified accuracy result.

Default: 100

lambda

The regularization parameter value, an integer ≥ 0.

Default: 1

alpha

ENet mixture parameter that defines how much L1 versus L2 regularization to provide, one of the following:

  • 0: L2
  • 1: L1

This argument returns a warning if it is used without ENet regularization.

Model Attributes

Attribute Description
data

The data for the function, including:

  • coeffNames: Name of the coefficients. This starts with intercept and then follows with the names of the predictors in the same order specified in the call.
  • coeff: Vector of estimated coefficients, with the same order as coeffNames
  • stdErr: Vector of the standard error of the coefficients, with the same order as coeffNames
  • zValue (for logistic regression): Vector of z-values of the coefficients, in the same order as coeffNames
  • tValue (for linear regression): Vector of t-values of the coefficients, in the same order as coeffNames
  • pValue: Vector of p-values of the coefficients, in the same order as coeffNames
regularization

The type of regularization to use when training the model.

lambda The regularization parameter. Higher values enforce stronger regularization. This value must be positive.
alpha The elastic net mixture parameter.
iterations The number of iterations that actually occur for the convergence before exceeding max_iteration.
skippedRows The number of rows of input_relation that were skipped because they contained an invalid value.
processedRows The total number of rows in input_relation minus the skippedRows.
callStr The value of all input arguments that were specified at the time the function was called.

Privileges

Superuser, or SELECT privileges on the input relation

Examples

=> SELECT LOGISTIC_REG('myLogisticRegModel', 'mtcars', 'am',
                       'mpg, cyl, disp, hp, drat, wt, qsec, vs, gear, carb'
                        USING PARAMETERS exclude_columns='hp', optimizer='BFGS');
        LOGISTIC_REG
----------------------------
 Finished in 20 iterations

(1 row)

See Also