Finding the "K" in K-means Clustering With a UDx

Clusters of points colored differently by grouping on a black background

You can apply k-means clustering to partition data points into k different groups. Along with the data, the number of clusters “k” is an input to the algorithm. Common examples like the Iris data set tell you upfront how many different groups exist, so you set k=3. What if you don’t know how many clusters to expect in your data set?

There are several approaches to estimate “k”. Cebeci and Cebeci combined several methods into the R library “kpeaks”, which we’ll use here to predict “k”.

The UDx is built on an R library, so first install Vertica R package from your usual source.

Next, install jsonlite and kpeaks packages into the Vertica R installation as shown at https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/ExtendingVertica/R/RPackages.htm

$ sudo /opt/vertica/R/bin/R install.packages("jsonlite"); install.packages("kpeaks");

You’ll likely need to select a CRAN mirror from the download list.

Download the attached R file “kpeaks.R” and SQL file “kpeaks_test.sql” text files, rename them with the correct extension (.R or .sql), and copy to a cluster node. Run kpeaks_test.sql with vsql, which does the following:

• Loads the R library
• Defines the kpeaks function
• Loads the Iris clustering data set
• Runs “kpeaks” on the Iris data set

You should get a JSON output such as the following:

KPeaks_User {"am":[2],"med":[2],"mod":[2],"mppc":[2],"cr":[2],"ciqr":[2],"mq3m":[3],"mtl":[2],"avgk":[2],"modk":[2],"mtlk":[2],"dst":["Full"],"pcounts":[2,1,2,3]}

So the methods implemented by kpeaks suggest there are 1-3 clusters in the data set. This should help reduce the number of trials needed to identify the best “k” for a data set. Check out the references below for a better understanding of what the kpeaks results mean.

For more information and the math behind the library, see

• kpeaks documentation – https://cran.r-project.org/web/packages/kpeaks/kpeaks.pdf
• kpeaks publication – Cebeci, Z., & Cebeci, C. “kpeaks: An R Package for Quick Selection of K for Cluster Analysis”, https://www.researchgate.net/publication/331258718_kpeaks_An_R_Package_for_Quick_Selection_of_K_for_Cluster_Analysis

Attachments:

kpeaks_test_sql
kpeaks_R.txt

Enjoy!

Product Overview

Vertica Announces Vertica 12 for Future-Proof Analytics

Harness the Internet of Things (IoT)

Support & Services

Partners

Vertica Inside – Embedded Analytics at Scale

Resources

About Vertica

Stay Informed

Finding the “K” in K-means Clustering With a UDx

About the Author

Search The Blog

Explore Popular Topics

Subscribe For Email Updates