Clustering Data Using k-means

This k-means example uses two small data sets named agar_dish_1 and agar_dish_2. Using the numeric data in the agar_dish_1 data set, you can cluster the data into k clusters. Then, using the created k-means model, you can run APPLY_KMEANS on agar_dish_2 and assign them to the clusters created in your original model.

Before you begin the example, make sure that you have loaded the Machine Learning sample data.

Clustering Training Data into k Clusters

  1. Using the KMEANS function, run k-means on the agar_dish_1 table.

    => SELECT KMEANS('agar_dish_kmeans', 'agar_dish_1', '*', 5
                      USING PARAMETERS exclude_columns ='id', max_iterations=20, output_view=agar_1_view, 
                      key_columns='id');
               KMEANS
    ---------------------------
     Finished in 7 iterations
    
    (1 row)

    The example creates a model named agar_dish_kmeans and a view containing the results of the model named agar_1_view. You might get different results when you run the clustering algorithm. This is because KMEANS randomly picks initial centers by default.

  2. View the output of agar_1_view.

    => SELECT * FROM agar_1_view;
     id  | cluster_id
    -----+------------
       2 |          4
       5 |          4
       7 |          4
       9 |          4
      13 |          4
    .
    .
    .
    (375 rows)
  3. Because you specified the number of clusters as 5, verify that the function created five clusters. Count the number of data points within each cluster.

    => SELECT cluster_id, COUNT(cluster_id) as Total_count
       FROM agar_1_view
       GROUP BY cluster_id;
     cluster_id | Total_count
    ------------+-------------
              0 |          76
              2 |          80
              1 |          74
              3 |          73
              4 |          72
    (5 rows)
    

    From the output, you can see that five clusters were created: 0, 1, 2, 3, and 4.

    You have now successfully clustered the data from agar_dish_1.csv into five distinct clusters.

Summarizing Your Model

You can also view a summary of the model you created using the SUMMARIZE_MODEL function. This summary tells you how many cluster centers your model contains, along with other metrics.

=> SELECT SUMMARIZE_MODEL('agar_dish_kmeans');
-[ RECORD 1 ]---+-------------------------------------------------------------------------------------------------
SUMMARIZE_MODEL | k-Means Model Summary:

 Number of clusters: 5
 Input columns: x, y
 Cluster centers:
     0: {x: -7.4811859, y: -7.5257672}
     1: {x: -3.5061558, y: -3.5570295}
     2: {x: -5.5205715, y: -5.4919726}
     3: {x: -1.5623823, y: -1.5056116}
     4: {x:  0.4970753, y:  0.5111612}
 Evaluation metrics:
     Total Sum of Squares: 6008.4619
     Within-Cluster Sum of Squares:
         Cluster 0: 12.389038
         Cluster 1: 11.210146
         Cluster 2: 12.994356
         Cluster 3: 12.639238
         Cluster 4: 12.083548
     Total Within-Cluster Sum of Squares: 61.316326
     Between-Cluster Sum of Squares: 5947.1456
     Between-Cluster SS / Total SS: 98.98%
 Number of iterations performed: 6
 Converged: True
 Call:
kmeans(model_name=agar_dish_kmeans, input_table=agar_dish_training, input_columns=*, num_clusters=5,
exclude_columns=id, max_iterations=20, epsilon=0.0001, init_method=random, initial_centers_table=,
distance_method=euclidean, outputView=agar_training_view, key_columns=id
)

Clustering Data Using a k-means Model

Using agar_dish_kmeans, the k-means model you just created, you can classify the agar_dish_2 data set.

Create a table named kmeans_results, using the agar_dish_2 table as your input table and the agar_dish_kmeans model for your initial cluster centers.

Add only the relevant feature columns to the arguments in the APPLY_KMEANS function.

=> CREATE TABLE kmeans_results AS
        (SELECT id,
                APPLY_KMEANS(x, y
                             USING PARAMETERS 
                                              model_name='agar_dish_kmeans') AS cluster_id
         FROM agar_dish_2);

The kmeans_results table shows that the agar_dish_kmeans model correctly clustered the agar_dish_2 data.

See Also