The content of this blog is based on a white paper that was authored by Maurizio Felici.
What is k-means Clustering?
K-means clustering is an unsupervised learning algorithm that clusters data into groups based on their similarity. Using k-means, you can find k clusters of data, represented by centroids. As the user, you select the number of clusters.
For example, suppose you want to separate customers into groups based on their purchase history so you can send out targeted emails to the different groups. There are a number of factors you might want to consider when you create the groups – how many times they visit the store, whether purchases are in-store or online, whether or not they use coupons, the types of items they buy, and so on.
Let’s say you want different email messages for these three different groups: frequent purchasers, occasional purchasers, and rare purchasers. In this case, you can specify the number of clusters k as 3 in the k-means function.
Inside the k-means algorithm, to assign a data point to a cluster, the distance is calculated from the data point to all centroids, and the closest centroid is selected for each data point. Then, the centroids are recalculated based on the data points assigned to each cluster. This cycle of calculating centroid positions continues until the centroids no longer move significantly or the max number of iterations is reached. The following image shows how data points move across iterations. The crosses represent centroids and the colored circles represent the data points:
K-means Clustering in Vertica
Running the Vertica k-means function is a two-step process:
- Run the k-means function to identify cluster centers and store the centers in a model.
- Apply the resulting model to assign data points to the identified cluster centers.
To better understand, let’s look at an example.
About the Wheat Data Set
We’re going to use a small data set, with 210 rows and 8 columns that is publicly available from UCI. This data set includes the geometrical properties of three different types of wheat: Kama, Rosa, and Canadian:
• Length of kernel
• Width of kernel
• Asymmetry coefficient
• Length of kernel groove
• Type: 1=Kama, 2=Rosa or 3=Canadian
If you’re a wheat aficionado, a farmer, or a wine expert, the clustering of wheat might mean something to you. For others, the reason we chose this data set is because it’s relatively easy to understand within the k-means algorithm.
Loading the Data Set
We’ll use the following table to store our wheat data set:
=> CREATE TABLE public.wheat (
The table has all the properties defined in their own columns.
Before loading the data into our database, we ran this sed command on the data set to remove a few extra tab characters:
$ sed ’s/\t\t*/\t/g’ seeds_dataset.txt | vsql -c "
To load the data, we use the following:
=> COPY public.wheat (
area, perimeter, compactness, klength,
width, asymmetry, glength, type
ABORT ON ERROR
Clustering the Wheat Data Set
After loading the data set into public.wheat, it’s time to use the Vertica k-means functions to cluster our wheat data set.
Before we can run k-means, we must define the number of clusters in our data set. While this is sometimes difficult, with our data set we can assume we need three clusters, because we have three different types of wheat.
Let’s run the Vertica k-means function against the public.wheat table:
=> SELECT KMEANS(’wmod1’, ’public.wheat’, ’*’, 3,
This SQL statement creates a k-means model named wmod1 that contains three clusters. All the columns in the public.wheat table will be used, except type.
Now that we have created the model, we can access the model contents with the following statement:
=> SELECT SUMMARIZE_MODEL(’wmod1’);
SUMMARIZE_MODEL() provides information about the following:
The table above describes the centers of the clusters. Each center is represented by the combination of the values of the seven properties.
: The algorithm picks the initial set of centers randomly. Therefore, both the order of the vector of the centers and their values may change slightly in different executions.
SUMMARIZE_MODEL() also provides the following information:
•Within Cluster Sum of Squares (WCSS) – This value measures cluster cohesion. The lower this value, the more compact the cluster. If you have as many clusters as the number of data points so that each data point coincides with one centroid, WCSS is zero.
•Between Cluster Sum of Squares (BCSS) – This value measures the separation between clusters. If you have just one cluster, this value is zero.
•Total Sum of Squares (TSS) – Equal to the BCSS plus the sum of all WCSS.
•BCSS/TSS ratio – The higher the cohesion within clusters and cluster separation, the closer this value will be to 1.
Assigning the Data Points to Clusters
Now we can use the Vertica APPLY_KMEANS function to mark the data points in the table using the model we developed in the previous step. The following SQL statement provides cluster distribution over seed types:
SUM(CASE cid WHEN 0 THEN cnt ELSE 0 END) AS clust_0,
SUM(CASE cid WHEN 1 THEN cnt ELSE 0 END) AS clust_1,
SUM(CASE cid WHEN 2 THEN cnt ELSE 0 END) AS clust_2
SELECT type, APPLY_KMEANS(area, perimeter,
compactness, klength, width, asymmetry, glength
USING PARAMETERS OWNER=’dbadmin’,
MODEL_NAME=’wmod1’) AS cid,
COUNT(*) as cnt
GROUP BY 1, 2 ) x
GROUP BY 1 ORDER BY 1
type | clust_0 | clust_1 | clust_2
Canadian | 34 | 0 | 0
Kama | 5 | 34 | 0
Rosa | 0 | 4 | 33
We see there was a small number of seeds that were not assigned to the cluster that corresponds to their type. This is because some of the geometrical properties used to cluster the data points overlap with each other. We can better visualize this overlap by plotting area, perimeter, and compactness:
We hope this introduction to k-means was helpful. Keep an eye out for the next blog post in our machine learning series. For more information about k-means, see k-means
in the Vertica documentation.
All the Vertica machine learning functions are also described in Machine Learning for Predictive Analytics
in the Vertica documentation.