Clustering Data Using k-means

This k-means example uses two small data sets named agar_dish_1 and agar_dish_2. Using the numeric data in the agar_dish_1 data set, you can cluster the data into k clusters. Then, using the created k-means model, you can run APPLY_KMEANS on agar_dish_2 and assign them to the clusters created in your original model.

Before you begin the example, make sure that you have loaded the Machine Learning sample data.

Clustering Training Data into k Clusters

  1. Create the k-means model, named agar_dish_kmeans using the agar_dish_1 table data.

    => SELECT KMEANS('agar_dish_kmeans', 'agar_dish_1', '*', 5
                      USING PARAMETERS exclude_columns ='id', max_iterations=20, output_view='agar_1_view', 
                      key_columns='id');
               KMEANS
    ---------------------------
     Finished in 7 iterations
    
    (1 row)

    The example creates a model named agar_dish_kmeans and a view containing the results of the model named agar_1_view. You might get different results when you run the clustering algorithm. This is because KMEANS randomly picks initial centers by default.

  2. View the output of agar_1_view.

    => SELECT * FROM agar_1_view;
     id  | cluster_id
    -----+------------
       2 |          4
       5 |          4
       7 |          4
       9 |          4
      13 |          4
    .
    .
    .
    (375 rows)
  3. Because you specified the number of clusters as 5, verify that the function created five clusters. Count the number of data points within each cluster.

    => SELECT cluster_id, COUNT(cluster_id) as Total_count
       FROM agar_1_view
       GROUP BY cluster_id;
     cluster_id | Total_count
    ------------+-------------
              0 |          76
              2 |          80
              1 |          74
              3 |          73
              4 |          72
    (5 rows)
    

    From the output, you can see that five clusters were created: 0, 1, 2, 3, and 4.

    You have now successfully clustered the data from agar_dish_1.csv into five distinct clusters.

Summarizing Your Model

View the summary output of agar_dish_means using the GET_MODEL_SUMMARY function.

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='agar_dish_kmeans');
----------------------------------------------------------------------------------
=======
centers
=======
x       |   y
--------+--------
0.49708 | 0.51116
-7.48119|-7.52577
-1.56238|-1.50561
-3.50616|-3.55703
-5.52057|-5.49197

=======
metrics
=======
Evaluation metrics:
  Total Sum of Squares: 6008.4619
  Within-Cluster Sum of Squares:
      Cluster 0: 12.083548
      Cluster 1: 12.389038
      Cluster 2: 12.639238
      Cluster 3: 11.210146
      Cluster 4: 12.994356 
  Total Within-Cluster Sum of Squares: 61.316326 
  Between-Cluster Sum of Squares: 5947.1456
  Between-Cluster SS / Total SS: 98.98%
Number of iterations performed: 2
Converged: True
Call:
kmeans('public.agar_dish_kmeans', 'agar_dish_1', '*', 5
USING PARAMETERS exclude_columns='id', max_iterations=20, epsilon=0.0001, init_method='kmeanspp', 
distance_method='euclidean', output_view='agar_view_1', key_columns='id')
(1 row)

Clustering Data Using a k-means Model

Using agar_dish_kmeans, the k-means model you just created, you can assign the points in agar_dish_2 to cluster centers.

Create a table named kmeans_results, using the agar_dish_2 table as your input table and the agar_dish_kmeans model for your initial cluster centers.

Add only the relevant feature columns to the arguments in the APPLY_KMEANS function.

=> CREATE TABLE kmeans_results AS
        (SELECT id,
                APPLY_KMEANS(x, y
                             USING PARAMETERS 
                                              model_name='agar_dish_kmeans') AS cluster_id
         FROM agar_dish_2);

The kmeans_results table shows that the agar_dish_kmeans model correctly clustered the agar_dish_2 data.

See Also