Archive for the ‘R programming’ Category

Distributed R for Big Data

Data scientists use sophisticated algorithms to obtain insights. However, what usually takes tens of lines of MATLAB or R code is now been rewritten in Hadoop like systems and applied at scale in the industry. Instead of rewriting algorithms in a new model, can we stretch the limits of R and reuse it for analyzing Big Data? We present our early experiences at HP Labs as we attempt to answer this question.

Consider a few use cases– product recommendations in Netflix and Amazon, PageRank calculation by search providers, financial options pricing and detection of important people in social networks. These applications (1) process large amounts of data, (2) implement complex algorithms such as matrix decomposition and eigenvalue calculation, and (3) continuously refine their predictive models on arrival of new user ratings, Web pages, or addition of relations in the network. To support these applications we need systems that can scale, can easily express complex algorithms, and can handle continuous analytics.

The complex aspect refers to the observation that most of the above applications use advanced concepts such as matrix operations, graph algorithms, and so on. By continuous analytics we mean that if a programmer writes y=f(x), then y is recomputed automatically whenever x changes. Continuous analytics reduces the latency with which information is processed. For example, in recommendation systems new ratings can be quickly processed to give better suggestions. In search engines newly added Web pages can be ranked and made part of search results more quickly.

In this post we will focus on scalability and complex algorithms.

R is an open source statistical software. It has millions of users, including data scientists, and more than three thousand algorithms packages. Many machine learning algorithms already exist in R, albeit for small datasets. These algorithms use matrix operations that are easily expressed and efficiently implemented in R. In less than a hundred lines you can implement most algorithms. Therefore, we decided to extend R and determine if we can achieve scalability in a familiar programming model.

Figure 1 is a very simplified view that compares R and Hadoop. Hadoop can handle large volumes of data, but R can efficiently execute a variety of advanced analysis. At HP Labs we have developed a distributed system that extends R. The main advantages are the language semantics, and the mechanisms to scale R and to run programs in a distributed manner.

FIgure 1 Graph

Figure 1: Extending R for Big Data


Figure 2 shows a high level diagram of how programs are executed in our distributed R framework. Users write programs using language extensions to R and then submit the code to the new runtime. The code is executed across servers in a distributed manner. Distributed R programs run on commodity hardware: from your multi-core desktop to existing Vertica clusters.

Figure 2 Architecture

Figure 2: Architecture

Our framework adds three main language constructs to R: darray, splits, and update. A foreach construct is also present. It is similar to parallel loops found in other languages.

For transparent scaling, we provide the abstraction of distributed arrays, darray.  Distributed arrays store data across multiple machines and give programmers the flexibility to partition data by rows, columns or blocks. Programmers write analytics code treating the distributed array as a regular array, without worrying that it is mapped to different physical machines. Array partitions can be referenced using splits and their contents modified using update. The body of foreach loop processes array partitions in parallel.

Figure 3 shows part of a program that calculates distributed PageRank of a graph. At a high level, the program executes A = (M*B)+C in a distributed manner till convergence. Here M is the adjacency matrix of a large graph. Initially M is declared a NxN sparse matrix partitioned by rows. The vector A is partitioned such that each partition has the same number of rows as the corresponding partition of M. The accompanying illustration (Figure 3) points out that each partition of A requires the corresponding (shaded) partitions of M, C, and the whole array B. The runtime passes these partitions and automatically reconstructs B from its partitions before executing the body of foreach on workers.

Our algorithms package has distributed algorithms such as regression analysis, clustering, power method based PageRank, a recommendation system, and so on. For each of these applications we had to write less than 150 lines of code.

Presto Code

Figure 3: Sample Code

This post is not to claim yet another system faster than Hadoop. Hence we exclude comprehensive experiment results or pretty graphs.  Our Eurosys 2013 and HotCloud 2012 papers have detailed performance results [1, 2]. As a data nugget, our experiments show that many algorithms in our distributed R framework are more than 20 times faster than Hadoop.


Our framework extends R. It efficiently executes machine learning and graph algorithms on a cluster. Distributed R programs are easy to write, are scalable, and are fast.

Our aim in building a distributed R engine is not to replace Hadoop or its variants. Rather, it is a design point in the space of analytics interfaces—one that is more familiar to data scientists.

Our framework is still evolving. Today, you can use R on top of Vertica to accelerate your data mining analysis. Soon we will support in-database operations as well. Stay tuned.

[1] Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, Rob Schreiber. Eurosys 2013, Prague, Czech Republic.

[2] Using R for Iterative and Incremental Processing. Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Rob Schreiber. HotCloud 2012, Boston, USA.

A Deeper Dive on Vertica & R

The R programming language  is fast gaining popularity among data scientists to perform statistical analyses. It is extensible and has a large community of users, many of whom contribute packages to extend its capabilities. However, it is single-threaded and limited by the amount of RAM on the machine it is running on, which makes it challenging to run R programs on big data.

There are efforts under way to remedy this situation, which essentially fall into one of the following two categories:

  • Integrate R into a parallel database, or
  • Parallelize R so it can process big data

In this post, we look at Vertica’s take on “Integrating R into a parallel database” and the two major areas that allow for the performance improvement.  A follow on blog will be posted to describe alternatives to the first approach.

1.)    Running multiple instances of the R algorithm in parallel (query partitioned data)

The first major performance benefit from Vertica R implementation has to do with running multiple instances of the R algorithm in parallel with queries that chunk the data independently.  In the recently launched Vertica 6.0, we added the ability to write sophisticated R programs and have them run in parallel on a cluster of machines.   At a high level Vertica threads communicate with R processes to compute results.  It uses optimized data conversion from Vertica tables to R data frames and all ‘R’ processing is automatically parallelized between Vertica servers.  The diagram below shows how the Vertica R integration has been implemented from a parallelization perspective.

The parallelism comes from processing independent chunks of data simultaneously (referred to as data parallelism).   SQL, being a declarative language, allows database query optimizers to figure out the order of operations, as well as which of them can be done in parallel, due to the well-defined semantics of the language. For example, consider the following query that computes the average sales figures for each month:

SELECT avg(qty*price) FROM sales GROUP BY month;

The semantics of the GROUP BY operation are such that the average sales of a particular month are independent of the average sales of a different month, which allows the database to compute the average for different months in parallel.   Similarly, the SQL-99 standard defines analytic functions (also referred to as window functions) – these functions operate on a sliding window of rows and can be used to compute moving averages, percentiles etc. For example, the following query assigns student test scores into quartiles for each grade:

SELECT name, grade, score, NTILE(4) OVER (PARTITION BY grade ORDER BY score DESC) FROM test_scores;

   name     grade  score   ntile
 Tigger      1     98         1
 Winnie      1     89         1
 Rabbit      1     78         2
 Roo      1     67         2
 Piglet      1     56         3
 Owl      1     54         3
 Eeyore      1     45         4
 Batman      2     98         1
 Ironman      2     95         1
 Spiderman      2     75         2
 Heman      2     56         2
 Superman      2     54         3
 Hulk      2     43         4


Again, the semantics of the OVER clause in window functions allows the database to compute the quartiles for each grade in parallel, since they are independent of one another.   Unlike some of our competitors, instead of inventing yet another syntax to perform R computations inside the database, we decided to leverage the OVER clause, since it is a familiar and natural way to express data parallel computations.  A prior blog post shows how easy it is to create, deploy and use R functions on Vertica.


Listed below is an example comparing using R and ODBC vs Vertica’ R implementation with the UDX framework.

Looking at the chart above as your data volumes increase Vertica’s implementation using the UDX framework scales much better compared to an ODBC approach.  Note: Numbers indicated on the chart should only be used for relative comparisons since this is not a formal benchmark.


2.)    Leveraging column-store technology for optimized data exchange (query non-partitioned data).

It is important to note that even for non-data parallel tasks (functions that operate on input that is basically one big chunk of non-partitioned data) , Vertica’s implementation  provides better performance since computation runs on a server instead of client, and we have optimized data flow between DB and R (no need to parse data again).

The other major benefits of Vertica’s R integration has to do with the UDX framework and the avoidance of ODBC and by the efficiencies obtained by Vertica’s column store.  Here are some examples showing how much more efficient Vertica’s integration with ‘R’ is compared to a typical ODBC approach for a query having non-partitioned data.

As the chart above indicates performance improvements are also achieved by the optimizing the data transfers between Vertica and R.  Since Vertica is a column store and R is vector based it is very efficient to move data from a Vertica column in very large blocks to R vectors.  Note: Numbers indicated on the chart should only be used for relative comparisons since this is not a formal benchmark.

This blog focused on performance and ‘R’ algorithms that are amenable to data parallel solutions.  A following post will talk about our approach to parallelizing R for problems not amenable to data parallel solutions such as if you want to make one decision tree and “Parallelize R” so it can process the results more effectively.

For more details on how to implement R in Vertica please go to the following blog

Now Shipping: HP Vertica Analytics Platform 6.1 “Bulldozer” Release

We’re pleased to announce that the “Bulldozer” (v6.1) release of the HP Vertica Analytics Platform is now shipping! The 6.1 release extends v6 of the HP Vertica Analytics Platform with exciting new enhancements that are sure to delight cloud computing and Hadoop fans alike, while giving data scientists and administrators more features and functionality to get the most out of their HP Vertica investment.

Tighter Hadoop Integration

With a name like “Bulldozer,” you’d expect the release to provide some heavy, big data lifting, and this release delivers. Many of our customers use Hadoop in early stages of their data pipeline, especially for storing loads of raw data. But after they’ve MapReduce-massaged the data, users want to load it into HP Vertica as fast as possible so they can start their true analytics processing. HP Vertica 6.1 provides a new HDFS connector that allows you to do just that: pull data straight from HDFS with optimal parallelism without any additional MapReduce coding. Furthermore, for users who are still deciding whether or not to bring some of their Hadoop data into their primary analytics window, they can use HP Vertica’s external tables feature with the HDFS connector to run rich analytics queries and functions in situ in HDFS. They may even choose to plug in a custom parser using the User Defined Load framework and let HP Vertica do some of the ETL lifting for them. Flexibility is what it’s all about, and to learn how to use the HP Vertica Analytics Platform with Hadoop, see our newly released white paper: Make All Your Information Matter — Hadoop and HP Vetica Analytics Platform.

Simplified Cloud Deployments

We also have many customers who run HP Vertica in the cloud, and know that more and more enterprises are making the cloud their deployment model of choice. To simplify and improve the cloud deployment experience, we now have an officially qualified Amazon EC2 AMI for HP Vertica. This AMI eliminates the guesswork and manual effort involved in rolling your own AMI. And to make these AMIs even easier to administer, we’ve provided cloud scripts that simplify the installation, configuration, and deployment of HP Vertica clusters. Now creating, expanding, and contracting your HP Vertica deployments is easier than ever, enabling a more agile and elastic cloud experience.

Killer Features for Big Data Analytics

In addition to the above, there are dozens of new features and improvements in this release that address the needs of Big Data analytics deployments. From a new R language pack that gets data scientists up and running quickly to enhanced storage tiering and archiving features that will help optimize storage media spend to new validation tools that assist administrators with hardware deployment planning and tuning, this new release provides the platform needed to create an enterprise-grade Big Data environment.  And, as with every release, we’ve made HP Vertica’s already incredibly fast performance even faster.

It’s easy for me to be excited about all the great new improvements in this release, but I challenge you to come see for yourself. Test drive HP Vertica 6.1 and find out how its new features can help you tackle your biggest Big Data challenges. Interested in learning more? Attend our upcoming Introduction to HP Vertica 6.1 webinar, where we’ll provide even more details about this exciting new release. We’re constantly striving to make the HP Vertica Analytics Platform the best solution for our customers’ Big Data analytics needs, and with our “Bulldozer” now out, the door we’re looking forward to helping more enterprises pave the way to data-driven business success.


Luis Maldonado

Director, HP Vertica

Capitalizing on the Potential of R and HP Vertica Analytics Platform

With the release of HP Vertica v6, we introduced a no-charge download of a new package that incorporates R, one of the most popular open-source data mining and statistics software offerings on the market today.

With this package, you can implement advanced analytics and sift through your data quickly to find anomalies using advanced R data mining algorithms. For more technical details on this integration, see How to Implement “R” in Vertica.

It’s but one part of our open platform approach, integrating and supporting a range of tools and technologies—from Hadoop to R to BI/visualization environments like Tableau —in affording you more flexibility and choice to derive maximum insight from your Big Data initiative.

“Got it,” you say. “But can you share some common use cases to spark ideas for my organization?”

For a complete list of use cases and additional details on the technology behind this integration, download our just-released white paper: “R” you ready? Turning Big Data into big value with the HP Vertica Analytics Platform and R.

That said, the use cases are only limited by your imagination—everything from behavior analytics (making meaningful predictions based on past and current buying behavior) to claims analyses (identifying anomalies, such as fraud, or identifying product defects early in the product release phase). In fact, the best ideas, and even more gratifying to us, implementations, are happening today in our user community.

One of our digital intelligence customers uses Hadoop’s HDFS to store raw input behavioral data and Hadoop/MapReduce to find conversions (regular-expressions processing) by determining what type of user clicked on a particular advertisement, and HP Vertica Analytics Platform to store and operationalize high-value business data.

In addition, the company’s big data solution supports reporting and analytics via Tableau and the R programming language as well as custom ETL. This combination of technologies helps this customer achieve faster insights that are delivered more consistently with less administrative overhead and lower-cost, commodity hardware.

How are you using R with HP Vertica Analytics Platform? We’re all ears.

How to implement “R” in Vertica

In a previous blog posting titled, “Vertica Moneyball and ‘R’. The perfect team!”  we showed how by using the kmeans clustering algorithm we were able to group our Major League Baseball (MLB) best pitchers for 2011 based on a couple of key performance indicators called WHIP and IPOUTS.  This blog posting provides more detail on how you can implement in Vertica the statistical algorithm called kmeans provided by “R”.

A quick explanation on how User Defined Functions (UDFs) work is necessary before we describe how R can be implemented.  UDFs provide a means to execute business logic best suited for analytic operations that are typically difficult to perform in standard SQL.  Vertica includes two types of user defined functions.

  1. User defined scalar functions: Scalar functions take in a single row of data and produce a single output value. For example, a scalar function called add2Ints takes in a row that has two integers and produces the sum of the integers as the output.
  2. User defined transform functions: Transform functions can take in any number of rows of data and produce any number of rows and columns of data as output. For example, a transform function topk takes in a set of rows and produces the top k rows as the output.

UDFs are the integration method we use to invoke business logic difficult to perform in standard SQL. Let’s look at how we can implement R in Vertica using a UDF.

The following example uses a transform function, sending an entire results set to R, which in our case is a list of baseball players and their associated WHIP and IPOUT measures.

Implementing any user defined function in Vertica is a two-step process.  What follows is a summary of the process.  (The actual code for the function is given after the summary of steps.)

Step 1: Write the function.

For this example, we begin by writing an R source file. This example contains the following two necessary functions:

  1. Main function: contains the code for main processing.
  2. Factory function: consists of a list of at most six elements, including name, udxtype, intype,outtype, outtypecallback, parametertypecallback. Note that the  outtypecallback and parametertypecallback are optional fields.

Step 2: Deploy the function.

  1. Define a new library using the CREATE LIBRARY command.
  2. Define a new function/transform using CREATE FUNCTION/TRANSFORM command.

Write the function (sample code): The first step in implementing kmeans clustering is to write the R script for computing the kmeans clusters. The R script is a file with the extension “.R” that tells Vertica what the main processing function looks like, and provides some datatype information. Keep in mind that writing this function can be done by someone on your analytics team who is somewhat familiar with ‘R’ and this skill is not required by your entire user base. Here is a simplified example R script for implementing kmeans clustering.


# Function that does all the work

kmeans_cluster <- function(x)


# load the required package


# number of clusters to be made


# Run the kmeans algorithm


#returns the clustering vector which will contain the information

#about grouping of our data entities in our case WHIP & IPOUTS.

#KMEANS groups the data entities into 3 groups (the default).

clusters <- data.frame(x[,1], c1$cluster)



kmeansFactory <- function()


list(name=kmeans_cluster, #function that does the processing

udxtype=c(“transform”), #type of the function

intype=c(“int”, “float”,”float”), #input types

outtype=c(“int”,”int”) #output types




This is a simplified version, but you can develop a more robust production ready version to make this even more reusable.  Stay tuned for a future blog that describe how this can be done.  Now that we have the function written it is now ready to be deployed.

Deploy the new function

Deployment is done like any other UDF deployment by issuing the following statements in Vertica.  If you have written a UDF before you might notice the new variable R for the LANGUAGE parameter:

create library kmeansGeoLib as ‘/home/Vertica/R-code/kmeans.R’ language ‘R’;

create transform function kmeansGeo as name ‘kmeansGeoData’ library kmeansGeoLib;

Invoking the new R function.

To invoke the new R function you can use standard sql syntax such as:

select kmeans(geonameid, latitude, longitude) over () from geotab_kmeans;

The above is an example of how you would invoke the function with a “points in space” or location related scenario.  The example below is how we used it in our moneyball example.

select kmeans(playerid, WHIP, IPOUTS) over () from bestpitchers;

Note: The over() clause is required for transform functions. The over clause can be used to parallelize the execution if the user knows that the calculation for a group of rows is independent of other rows. For example, consider that you want to cluster the data for each player independently. In such a scenario, this is what the sql might look like:

select kmeans(playerid, WHIP, IPOUTS) over (partition by playerid) from bestpitchers;

Once this R function has been implemented in Vertica, it can be used by anyone who has a requirement to group subjects together using a sophisticated data mining clustering technology across many business domains.   It does not take much effort to implement data mining algorithms in Vertica.

Many of our customers have indicated to us that time to market is very important. We believe our implementation of R provides more value for your organization because it saves time from the following perspectives:

  1. Implementation perspective -  leverage the current UDX integration.
  2. End users perspective – leverage standard sql syntax.
  3. Performance perspective -  leverage the parallelism of the Vertica multi node architecture. 

Some big data problems have requirements that demand better utilization of the hardware architecture in order to deliver timely results.  KMeans is a powerful, but compute-intensive algorithm that can involve multiple iterations to the increase accuracy of the results. Stay tuned for another blog that will describe in more detail how Vertica’s R implementation takes advantage of your Vertica cluster and parallelism to improve the accuracy of results with this algorithm while meeting your service level agreements.

Get Started With Vertica Today

Subscribe to Vertica