Data scientists use sophisticated algorithms to obtain insights. However, what usually takes tens of lines of MATLAB or R code is now been rewritten in Hadoop like systems and applied at scale in the industry. Instead of rewriting algorithms in a new model, can we stretch the limits of R and reuse it for analyzing Big Data? We present our early experiences at HP Labs as we attempt to answer this question.
Consider a few use cases– product recommendations in Netflix and Amazon, PageRank calculation by search providers, financial options pricing and detection of important people in social networks. These applications (1) process large amounts of data, (2) implement complex algorithms such as matrix decomposition and eigenvalue calculation, and (3) continuously refine their predictive models on arrival of new user ratings, Web pages, or addition of relations in the network. To support these applications we need systems that can scale, can easily express complex algorithms, and can handle continuous analytics.
The complex aspect refers to the observation that most of the above applications use advanced concepts such as matrix operations, graph algorithms, and so on. By continuous analytics we mean that if a programmer writes y=f(x), then y is recomputed automatically whenever x changes. Continuous analytics reduces the latency with which information is processed. For example, in recommendation systems new ratings can be quickly processed to give better suggestions. In search engines newly added Web pages can be ranked and made part of search results more quickly.
In this post we will focus on scalability and complex algorithms.
R is an open source statistical software. It has millions of users, including data scientists, and more than three thousand algorithms packages. Many machine learning algorithms already exist in R, albeit for small datasets. These algorithms use matrix operations that are easily expressed and efficiently implemented in R. In less than a hundred lines you can implement most algorithms. Therefore, we decided to extend R and determine if we can achieve scalability in a familiar programming model.
Figure 1 is a very simplified view that compares R and Hadoop. Hadoop can handle large volumes of data, but R can efficiently execute a variety of advanced analysis. At HP Labs we have developed Presto– a distributed system that extends R. The main advantages of Presto are the language semantics, and the mechanisms to scale R and to run programs in a distributed manner.
Figure 2 shows a high level diagram of how Presto programs are executed. Users write programs using language extensions to R and then submit the code to the Presto runtime. The code is executed across servers in a distributed manner. Presto programs run on commodity hardware: from your multi-core desktop to existing Vertica clusters.
Presto adds three main language constructs to R: darray, splits, and update. A foreach construct is also present. It is similar to parallel loops found in other languages.
For transparent scaling, Presto provides the abstraction of distributed arrays, darray. Distributed arrays store data across multiple machines and give programmers the flexibility to partition data by rows, columns or blocks. Programmers write analytics code treating the distributed array as a regular array, without worrying that it is mapped to different physical machines. Array partitions can be referenced using splits and their contents modified using update. The body of foreach loop processes array partitions in parallel.
Figure 3 shows part of a Presto program that calculates PageRank of a graph. At a high level, the program executes A = (M*B)+C in a distributed manner till convergence. Here M is the adjacency matrix of a large graph. Initially M is declared a NxN sparse matrix partitioned by rows. The vector A is partitioned such that each partition has the same number of rows as the corresponding partition of M. The accompanying illustration (Figure 3) points out that each partition of A requires the corresponding (shaded) partitions of M, C, and the whole array B. The Presto runtime passes these partitions and automatically reconstructs B from its partitions before executing the body of foreach on workers.
Our Presto package has distributed algorithms such as regression analysis, clustering, power method based PageRank, a recommendation system, and so on. For each of these applications we had to write less than 150 lines of code.
While Presto is faster than Hadoop, this post is not to claim yet another system that outperforms Hadoop. Hence we exclude comprehensive experiment results or pretty graphs. Our Eurosys 2013 and HotCloud 2012 papers have detailed performance results [1, 2]. As a data nugget, our experiments show that for many algorithms Presto is more than 20 times faster than Hadoop.
Presto extends R. It efficiently executes machine learning and graph algorithms on a cluster. Presto programs are easy to write, are scalable, and are fast. Our aim in building Presto is not to replace Hadoop or its variants. Rather, Presto is a design point in the space of analytics interfaces—one that is more familiar to data scientists. Presto is still evolving. Today, you can use Presto on top of Vertica to accelerate your data mining analysis. Soon we will support in-database operations as well. Stay tuned.
 Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, Rob Schreiber. Eurosys 2013, Prague, Czech Republic.
 Using R for Iterative and Incremental Processing. Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Rob Schreiber. HotCloud 2012, Boston, USA.