KMEANS

Executes the k-means algorithm on an input relation. The result is a model with a list of cluster centers.

You can export the resulting k-means model in VERTICA_MODELS or PMML format to apply it on data outside Vertica. You can also train a k-means model elsewhere, then import it to Vertica in PMML format to predict on data in Vertica.

Syntax

KMEANS ( 'model‑name', 'input‑relation', 'input‑columns', 'num‑clusters'
        [ USING PARAMETERS 
              [exclude_columns = 'excluded‑columns']
	       [, max_iterations = max‑iterations]
	       [, epsilon = epsilon‑value]
	       [, { init_method = 'init‑method' } | { initial_centers_table = 'init‑table' } ]
	       [, output_view = 'output‑view']
	       [, key_columns = 'key‑columns'] ] )

Arguments

model‑name

Identifies the model to create, where model‑name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.

input‑relation

The table or view that contains the input data for k-means. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.

input‑columns

Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of data type numeric.

num‑clusters

The number of clusters to create, an integer ≤ 10,000. This argument represents the k in k-means.

Parameters

Parameters init_method and initial_centers_table are mutually exclusive. If you set both, the function returns an error.

exclude_columns

Comma-separated list of column names from input‑columns to exclude from processing.

max_iterations

The maximum number of iterations the algorithm performs. If you set this value to a number lower than the number of iterations needed for convergence, the algorithm may not converge.

Default: 10

epsilon

Determines whether the algorithm has converged. The algorithm is considered converged after no center has moved more than a distance of
'epsilon' from the previous iteration.

Default: 1e-4

init_method

The method used to find the initial cluster centers, one of the following:

  • random
  • kmeanspp (default): kmeans++ algorithm

    This value can be memory intensive for high k. If the function returns an error that not enough memory is available, decrease the value of k or use the random method.

initial_centers_table

The table with the initial cluster centers to use. Supply this value if you know the initial centers to use and do not want Vertica to find the initial cluster centers for you.

output_view

The name of the view where you save the assignments of each point to its cluster. You must have CREATE privileges on the schema where the view is saved.

key_columns

Comma-separated list of column names that identify the output rows. Columns must be in the input‑columns argument list. To exclude these and other input columns from being used by the algorithm, list them in parameter exclude_columns.

Model Attributes

centers A list that contains the center of each cluster.
metrics

A string summary of several metrics related to the quality of the clustering.

Examples

The following example creates k-means model myKmeansModel and applies it to input table iris1. The call to APPLY_KMEANS mixes column names and constants. When a constant is passed in place of a column name, the constant is substituted for the value of the column in all rows:

=> SELECT KMEANS('myKmeansModel', 'iris1', '*', 5
USING PARAMETERS max_iterations=20, output_view='myKmeansView', key_columns='id', exclude_columns='Species, id');
           KMEANS
----------------------------
 Finished in 12 iterations

(1 row)	
=> SELECT id, APPLY_KMEANS(Sepal_Length, 2.2, 1.3, Petal_Width 
USING PARAMETERS model_name='myKmeansModel', match_by_pos='true') FROM iris2;
 id  | APPLY_KMEANS
-----+--------------
   5 |            1
  10 |            1
  14 |            1
  15 |            1
  21 |            1
  22 |            1
  24 |            1
  25 |            1
  32 |            1
  33 |            1
  34 |            1
  35 |            1
  38 |            1
  39 |            1
  42 |            1	
…
 (60 rows)

See Also