Vertica

Author Archive

A Deeper Dive on Vertica & R

The R programming language  is fast gaining popularity among data scientists to perform statistical analyses. It is extensible and has a large community of users, many of whom contribute packages to extend its capabilities. However, it is single-threaded and limited by the amount of RAM on the machine it is running on, which makes it challenging to run R programs on big data.

There are efforts under way to remedy this situation, which essentially fall into one of the following two categories:

  • Integrate R into a parallel database, or
  • Parallelize R so it can process big data

In this post, we look at Vertica’s take on “Integrating R into a parallel database” and the two major areas that allow for the performance improvement.  A follow on blog will be posted to describe alternatives to the first approach.

1.)    Running multiple instances of the R algorithm in parallel (query partitioned data)

The first major performance benefit from Vertica R implementation has to do with running multiple instances of the R algorithm in parallel with queries that chunk the data independently.  In the recently launched Vertica 6.0, we added the ability to write sophisticated R programs and have them run in parallel on a cluster of machines.   At a high level Vertica threads communicate with R processes to compute results.  It uses optimized data conversion from Vertica tables to R data frames and all ‘R’ processing is automatically parallelized between Vertica servers.  The diagram below shows how the Vertica R integration has been implemented from a parallelization perspective.

The parallelism comes from processing independent chunks of data simultaneously (referred to as data parallelism).   SQL, being a declarative language, allows database query optimizers to figure out the order of operations, as well as which of them can be done in parallel, due to the well-defined semantics of the language. For example, consider the following query that computes the average sales figures for each month:

SELECT avg(qty*price) FROM sales GROUP BY month;

The semantics of the GROUP BY operation are such that the average sales of a particular month are independent of the average sales of a different month, which allows the database to compute the average for different months in parallel.   Similarly, the SQL-99 standard defines analytic functions (also referred to as window functions) – these functions operate on a sliding window of rows and can be used to compute moving averages, percentiles etc. For example, the following query assigns student test scores into quartiles for each grade:

SELECT name, grade, score, NTILE(4) OVER (PARTITION BY grade ORDER BY score DESC) FROM test_scores;

   name     grade  score   ntile
 Tigger      1     98         1
 Winnie      1     89         1
 Rabbit      1     78         2
 Roo      1     67         2
 Piglet      1     56         3
 Owl      1     54         3
 Eeyore      1     45         4
 Batman      2     98         1
 Ironman      2     95         1
 Spiderman      2     75         2
 Heman      2     56         2
 Superman      2     54         3
 Hulk      2     43         4

 

Again, the semantics of the OVER clause in window functions allows the database to compute the quartiles for each grade in parallel, since they are independent of one another.   Unlike some of our competitors, instead of inventing yet another syntax to perform R computations inside the database, we decided to leverage the OVER clause, since it is a familiar and natural way to express data parallel computations.  A prior blog post shows how easy it is to create, deploy and use R functions on Vertica.

 

Listed below is an example comparing using R and ODBC vs Vertica’ R implementation with the UDX framework.

Looking at the chart above as your data volumes increase Vertica’s implementation using the UDX framework scales much better compared to an ODBC approach.  Note: Numbers indicated on the chart should only be used for relative comparisons since this is not a formal benchmark.

 

2.)    Leveraging column-store technology for optimized data exchange (query non-partitioned data).

It is important to note that even for non-data parallel tasks (functions that operate on input that is basically one big chunk of non-partitioned data) , Vertica’s implementation  provides better performance since computation runs on a server instead of client, and we have optimized data flow between DB and R (no need to parse data again).

The other major benefits of Vertica’s R integration has to do with the UDX framework and the avoidance of ODBC and by the efficiencies obtained by Vertica’s column store.  Here are some examples showing how much more efficient Vertica’s integration with ‘R’ is compared to a typical ODBC approach for a query having non-partitioned data.

As the chart above indicates performance improvements are also achieved by the optimizing the data transfers between Vertica and R.  Since Vertica is a column store and R is vector based it is very efficient to move data from a Vertica column in very large blocks to R vectors.  Note: Numbers indicated on the chart should only be used for relative comparisons since this is not a formal benchmark.

This blog focused on performance and ‘R’ algorithms that are amenable to data parallel solutions.  A following post will talk about our approach to parallelizing R for problems not amenable to data parallel solutions such as if you want to make one decision tree and “Parallelize R” so it can process the results more effectively.

For more details on how to implement R in Vertica please go to the following blog http://www.vertica.com/2012/10/02/how-to-implement-r-in-vertica/

How to implement “R” in Vertica

In a previous blog posting titled, “Vertica Moneyball and ‘R’. The perfect team!”  we showed how by using the kmeans clustering algorithm we were able to group our Major League Baseball (MLB) best pitchers for 2011 based on a couple of key performance indicators called WHIP and IPOUTS.  This blog posting provides more detail on how you can implement in Vertica the statistical algorithm called kmeans provided by “R”.

A quick explanation on how User Defined Functions (UDFs) work is necessary before we describe how R can be implemented.  UDFs provide a means to execute business logic best suited for analytic operations that are typically difficult to perform in standard SQL.  Vertica includes two types of user defined functions.

  1. User defined scalar functions: Scalar functions take in a single row of data and produce a single output value. For example, a scalar function called add2Ints takes in a row that has two integers and produces the sum of the integers as the output.
  2. User defined transform functions: Transform functions can take in any number of rows of data and produce any number of rows and columns of data as output. For example, a transform function topk takes in a set of rows and produces the top k rows as the output.

UDFs are the integration method we use to invoke business logic difficult to perform in standard SQL. Let’s look at how we can implement R in Vertica using a UDF.

The following example uses a transform function, sending an entire results set to R, which in our case is a list of baseball players and their associated WHIP and IPOUT measures.

Implementing any user defined function in Vertica is a two-step process.  What follows is a summary of the process.  (The actual code for the function is given after the summary of steps.)

Step 1: Write the function.

For this example, we begin by writing an R source file. This example contains the following two necessary functions:

  1. Main function: contains the code for main processing.
  2. Factory function: consists of a list of at most six elements, including name, udxtype, intype,outtype, outtypecallback, parametertypecallback. Note that the  outtypecallback and parametertypecallback are optional fields.

Step 2: Deploy the function.

  1. Define a new library using the CREATE LIBRARY command.
  2. Define a new function/transform using CREATE FUNCTION/TRANSFORM command.

Write the function (sample code): The first step in implementing kmeans clustering is to write the R script for computing the kmeans clusters. The R script is a file with the extension “.R” that tells Vertica what the main processing function looks like, and provides some datatype information. Keep in mind that writing this function can be done by someone on your analytics team who is somewhat familiar with ‘R’ and this skill is not required by your entire user base. Here is a simplified example R script for implementing kmeans clustering.

 

# Function that does all the work

kmeans_cluster <- function(x)

{

# load the required package

library(cluster)

# number of clusters to be made

k=3

# Run the kmeans algorithm

kmeans(x,k)

#returns the clustering vector which will contain the information

#about grouping of our data entities in our case WHIP & IPOUTS.

#KMEANS groups the data entities into 3 groups (the default).

clusters <- data.frame(x[,1], c1$cluster)

clusters

}

kmeansFactory <- function()

{

list(name=kmeans_cluster, #function that does the processing

udxtype=c(“transform”), #type of the function

intype=c(“int”, “float”,”float”), #input types

outtype=c(“int”,”int”) #output types

)

}

 

This is a simplified version, but you can develop a more robust production ready version to make this even more reusable.  Stay tuned for a future blog that describe how this can be done.  Now that we have the function written it is now ready to be deployed.

Deploy the new function

Deployment is done like any other UDF deployment by issuing the following statements in Vertica.  If you have written a UDF before you might notice the new variable R for the LANGUAGE parameter:

create library kmeansGeoLib as ‘/home/Vertica/R-code/kmeans.R’ language ‘R’;

create transform function kmeansGeo as name ‘kmeansGeoData’ library kmeansGeoLib;

Invoking the new R function.

To invoke the new R function you can use standard sql syntax such as:

select kmeans(geonameid, latitude, longitude) over () from geotab_kmeans;

The above is an example of how you would invoke the function with a “points in space” or location related scenario.  The example below is how we used it in our moneyball example.

select kmeans(playerid, WHIP, IPOUTS) over () from bestpitchers;

Note: The over() clause is required for transform functions. The over clause can be used to parallelize the execution if the user knows that the calculation for a group of rows is independent of other rows. For example, consider that you want to cluster the data for each player independently. In such a scenario, this is what the sql might look like:

select kmeans(playerid, WHIP, IPOUTS) over (partition by playerid) from bestpitchers;

Once this R function has been implemented in Vertica, it can be used by anyone who has a requirement to group subjects together using a sophisticated data mining clustering technology across many business domains.   It does not take much effort to implement data mining algorithms in Vertica.

Many of our customers have indicated to us that time to market is very important. We believe our implementation of R provides more value for your organization because it saves time from the following perspectives:

  1. Implementation perspective -  leverage the current UDX integration.
  2. End users perspective – leverage standard sql syntax.
  3. Performance perspective -  leverage the parallelism of the Vertica multi node architecture. 

Some big data problems have requirements that demand better utilization of the hardware architecture in order to deliver timely results.  KMeans is a powerful, but compute-intensive algorithm that can involve multiple iterations to the increase accuracy of the results. Stay tuned for another blog that will describe in more detail how Vertica’s R implementation takes advantage of your Vertica cluster and parallelism to improve the accuracy of results with this algorithm while meeting your service level agreements.

Vertica Moneyball and ‘R’. The perfect team!

Back in April, Colin’s blog on, “Moneyball – not just for baseball anymore” was a good example of describing how statistics can be used to make better decisions on and off the baseball field.  New measures can be created to better understand a player’s real contribution to a team.  For instance, most baseball players are familiar with the popular earned run average (ERA) measure for pitchers, but a new one that is becoming more popular is called WHIP (Walks plus Hits per Innings Pitched).

Here is how Wikipedia describes WHIP: While earned run average (ERA) measures the runs a pitcher gives up, WHIP more directly measures a pitcher’s effectiveness against the batters faced. It is calculated by adding the number of walks and hits allowed and dividing this sum by the number of innings pitched; therefore, the lower a pitcher’s WHIP, the better his performance.   Listed below is the calculation for WHIP.

 WHIP = (Walks + Hits)/ Innings Pitched.

Dashboards such as the following can be built demonstrating these new kinds of measures or key performance indicators (KPI) and how they can be used across a wider audience and provide more insight on teams and players.

Some of the other measures needed to accurately determine a person’s contribution to the team can only be implemented using a statistical package such as ‘R’.  Typically implementing a statistical package in an organization is not a trivial task for the following reasons:

1.)    Specialized domain expertise required – Statistics requires a new skill set to understand and use properly.

2.)    Data Access – Import and Export must be done into the statistical package.

3.)    Performance – Many of the statistical algorithms are compute intensive.

This article will demonstrate how Vertica 6 handles the first two items above and another article to soon be posted will show how Vertica 6 “Presto” has some distinct ‘R’ integration related “Performance” capabilities.

While it is true that understanding statistics can be challenging without proper training, having a group who fully understands the algorithms collaborate with the business domain experts ensures that proper implementation can be done.  Implementing these new algorithms in the database allows your organization to leverage the powerful statistics in their daily business analysis and reduce the time to market because they can now be treated as any other “standard” database function. The possibility for error is also reduced because no longer are complex “Extraction, Transformation and Load (ETL)” products required to import and export the data into the statistical package.  The entire process is now streamlined so that any BI tool or ETL tool in the organization can also leverage the new capability as well because they are now in the database.

So let’s put on our favorite baseball cap, in my case a Tiger cap, and take a closer look at how using ‘R’ can enhance our understanding of our favorite baseball teams and players.

As indicated before, “Moneyball” enlightened the baseball world with many new “measures” that are now almost common speak amongst baseball fans.  The scenario for this example could be a team might want to ensure they are paying their pitchers appropriately based on performance, or they might be interested in finding some more talented pitchers for their team.  Once these pitchers are determined, I want to group them together in “liked clusters” based on our key performance indicators (KPI). The two KPI’s I have decided to use are the WHIP calculation that we described above and another one called IPouts, which is simply the “number of outs pitched”.

Listed below is a simple query showing results for last year’s top pitchers sorted on the new measure called WHIP.

You can see very clearly why Justin Verlander was the MVP and Cy Young award winner last year.  His WHIP and IPouts where the best and he was third in ERA.   All of the measures provided so far can be implemented with standard SQL.  The next thing I want to do is group these pitchers into clusters based on my two measures of WHIP and IPouts.  To do this I used the new Vertica integration with a statistical package called ‘R’ to implement a clustering algorithm called KMeans.  In my case I want 3 clusters of the 91 pitchers from 2011 that qualified.  The column below called Geo.cluster was provided by the integration of ‘R’ in Vertica.

You can see that even in the top 10 we have pitchers in all of our 3 clusters. Keep in mind that lower numbers for WHIP and ERA are better and higher values for IPouts are better. Looking at the list above I now have some better insight on the players and I can focus on cluster 3 players and possibly some players from cluster 2. Listed below is an example of a histogram showing the WHIP on the X axis for all our 91 pitchers of 2011.  You can include these charts and graphs in your dashboards as well.

Other database competitors can also claim ‘R’ integration, but Vertica’s implementation provides better value to your organization because of its simplicity and performance.  Other vendors take an ‘R’ centric approach, which means your users have to know ‘R’ and use the ‘R’ suite of programming tools.  Vertica’s implementation is a more ‘data’ centric approach that shields the users from having to know and use the ‘R’ language.  Users can continue to use their favorite BI or query tool and now have access to ‘R’ capability.

This article demonstrated how statistics can be used to build new measures to provide more insight on a particular situation.  This kind of analysis can also be applied in your organization to help with detecting fraud etc.

Stay tuned on future posts that will give you more detail on how the kmeans and other statistical functions like page rank were implemented in Vertica using ‘R’.  Go Tigers!

For more details on how to implement R in Vertica please to the following blog http://www.vertica.com/2012/10/02/how-to-implement-r-in-vertica/

Get Started With Vertica Today

Subscribe to Vertica