Finding the “K” in K-means Clustering With a UDx

Posted October 4, 2019 by Bryan Herger, Vertica Big Data Solution Architect at Micro Focus

Clusters of points colored differently by grouping on a black background

You can apply k-means clustering to partition data points into k different groups. Along with the data, the number of clusters “k” is an input to the algorithm. Common examples like the Iris data set tell you upfront how many different groups exist, so you set k=3. What if you don’t know how many clusters to expect in your data set?

There are several approaches to estimate “k”. Cebeci and Cebeci combined several methods into the R library “kpeaks”, which we’ll use here to predict “k”.

The UDx is built on an R library, so first install Vertica R package from your usual source.

Next, install jsonlite and kpeaks packages into the Vertica R installation as shown at https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/ExtendingVertica/R/RPackages.htm

OR

$ sudo /opt/vertica/R/bin/R
install.packages("jsonlite");
install.packages("kpeaks");

You’ll likely need to select a CRAN mirror from the download list.

Download the attached R file “kpeaks.R” and SQL file “kpeaks_test.sql” text files, rename them with the correct extension (.R or .sql), and copy to a cluster node. Run kpeaks_test.sql with vsql, which does the following:

• Loads the R library
• Defines the kpeaks function
• Loads the Iris clustering data set
• Runs “kpeaks” on the Iris data set

You should get a JSON output such as the following:

KPeaks_User
{"am":[2],"med":[2],"mod":[2],"mppc":[2],"cr":[2],"ciqr":[2],"mq3m":[3],"mtl":[2],"avgk":[2],"modk":[2],"mtlk":[2],"dst":["Full"],"pcounts":[2,1,2,3]}

So the methods implemented by kpeaks suggest there are 1-3 clusters in the data set. This should help reduce the number of trials needed to identify the best “k” for a data set. Check out the references below for a better understanding of what the kpeaks results mean.

For more information and the math behind the library, see

• kpeaks documentation – https://cran.r-project.org/web/packages/kpeaks/kpeaks.pdf
• kpeaks publication – Cebeci, Z., & Cebeci, C. “kpeaks: An R Package for Quick Selection of K for Cluster Analysis”, https://www.researchgate.net/publication/331258718_kpeaks_An_R_Package_for_Quick_Selection_of_K_for_Cluster_Analysis

Attachments:

kpeaks_test_sql
kpeaks_R.txt

Enjoy!

Related Posts:
Machine Learning Series: Linear Regression
Using Vertica Machine Learning to Analyze Smart Meter Data
Evaluating Classifier Models in Vertica
Is this Wine Good? Wine Quality with Machine Learning
Can You Hear Me Now? Network Optimization at Work
Create a Python UDx to Order a List of Values
One on One with Davin Potts: 1. On Becoming a Core Python Committer and the Tools Used to Build Data Science