In a previous blog posting titled, “Vertica Moneyball and ‘R’. The perfect team!” we showed how by using the kmeans clustering algorithm we were able to group our Major League Baseball (MLB) best pitchers for 2011 based on a couple of key performance indicators called WHIP and IPOUTS. This blog posting provides more detail on how you can implement in Vertica the statistical algorithm called kmeans provided by “R”.
A quick explanation on how User Defined Functions (UDFs) work is necessary before we describe how R can be implemented. UDFs provide a means to execute business logic best suited for analytic operations that are typically difficult to perform in standard SQL. Vertica includes two types of user defined functions.
- User defined scalar functions: Scalar functions take in a single row of data and produce a single output value. For example, a scalar function called add2Ints takes in a row that has two integers and produces the sum of the integers as the output.
- User defined transform functions: Transform functions can take in any number of rows of data and produce any number of rows and columns of data as output. For example, a transform function topk takes in a set of rows and produces the top k rows as the output.
UDFs are the integration method we use to invoke business logic difficult to perform in standard SQL. Let’s look at how we can implement R in Vertica using a UDF.
The following example uses a transform function, sending an entire results set to R, which in our case is a list of baseball players and their associated WHIP and IPOUT measures.
Implementing any user defined function in Vertica is a two-step process. What follows is a summary of the process. (The actual code for the function is given after the summary of steps.)
Step 1: Write the function.
For this example, we begin by writing an R source file. This example contains the following two necessary functions:
- Main function: contains the code for main processing.
- Factory function: consists of a list of at most six elements, including name, udxtype, intype,outtype, outtypecallback, parametertypecallback. Note that the outtypecallback and parametertypecallback are optional fields.
Step 2: Deploy the function.
- Define a new library using the CREATE LIBRARY command.
- Define a new function/transform using CREATE FUNCTION/TRANSFORM command.
Write the function (sample code): The first step in implementing kmeans clustering is to write the R script for computing the kmeans clusters. The R script is a file with the extension “.R” that tells Vertica what the main processing function looks like, and provides some datatype information. Keep in mind that writing this function can be done by someone on your analytics team who is somewhat familiar with ‘R’ and this skill is not required by your entire user base. Here is a simplified example R script for implementing kmeans clustering.
# Function that does all the work
kmeans_cluster <- function(x)
# load the required package
# number of clusters to be made
# Run the kmeans algorithm
#returns the clustering vector which will contain the information
#about grouping of our data entities in our case WHIP & IPOUTS.
#KMEANS groups the data entities into 3 groups (the default).
clusters <- data.frame(x[,1], c1$cluster)
kmeansFactory <- function()
list(name=kmeans_cluster, #function that does the processing
udxtype=c(“transform”), #type of the function
intype=c(“int”, “float”,”float”), #input types
outtype=c(“int”,”int”) #output types
This is a simplified version, but you can develop a more robust production ready version to make this even more reusable. Stay tuned for a future blog that describe how this can be done. Now that we have the function written it is now ready to be deployed.
Deploy the new function.
Deployment is done like any other UDF deployment by issuing the following statements in Vertica. If you have written a UDF before you might notice the new variable R for the LANGUAGE parameter:
create library kmeansGeoLib as ‘/home/Vertica/R-code/kmeans.R’ language ‘R’;
create transform function kmeansGeo as name ‘kmeansGeoData’ library kmeansGeoLib;
Invoking the new R function.
To invoke the new R function you can use standard sql syntax such as:
select kmeans(geonameid, latitude, longitude) over () from geotab_kmeans;
The above is an example of how you would invoke the function with a “points in space” or location related scenario. The example below is how we used it in our moneyball example.
select kmeans(playerid, WHIP, IPOUTS) over () from bestpitchers;
Note: The over() clause is required for transform functions. The over clause can be used to parallelize the execution if the user knows that the calculation for a group of rows is independent of other rows. For example, consider that you want to cluster the data for each player independently. In such a scenario, this is what the sql might look like:
select kmeans(playerid, WHIP, IPOUTS) over (partition by playerid) from bestpitchers;
Once this R function has been implemented in Vertica, it can be used by anyone who has a requirement to group subjects together using a sophisticated data mining clustering technology across many business domains. It does not take much effort to implement data mining algorithms in Vertica.
Many of our customers have indicated to us that time to market is very important. We believe our implementation of R provides more value for your organization because it saves time from the following perspectives:
- Implementation perspective – leverage the current UDX integration.
- End users perspective – leverage standard sql syntax.
- Performance perspective – leverage the parallelism of the Vertica multi node architecture.
Some big data problems have requirements that demand better utilization of the hardware architecture in order to deliver timely results. KMeans is a powerful, but compute-intensive algorithm that can involve multiple iterations to the increase accuracy of the results. Stay tuned for another blog that will describe in more detail how Vertica’s R implementation takes advantage of your Vertica cluster and parallelism to improve the accuracy of results with this algorithm while meeting your service level agreements.