Archive for February, 2013

Distributed R for Big Data

Data scientists use sophisticated algorithms to obtain insights. However, what usually takes tens of lines of MATLAB or R code is now been rewritten in Hadoop like systems and applied at scale in the industry. Instead of rewriting algorithms in a new model, can we stretch the limits of R and reuse it for analyzing Big Data? We present our early experiences at HP Labs as we attempt to answer this question.

Consider a few use cases– product recommendations in Netflix and Amazon, PageRank calculation by search providers, financial options pricing and detection of important people in social networks. These applications (1) process large amounts of data, (2) implement complex algorithms such as matrix decomposition and eigenvalue calculation, and (3) continuously refine their predictive models on arrival of new user ratings, Web pages, or addition of relations in the network. To support these applications we need systems that can scale, can easily express complex algorithms, and can handle continuous analytics.

The complex aspect refers to the observation that most of the above applications use advanced concepts such as matrix operations, graph algorithms, and so on. By continuous analytics we mean that if a programmer writes y=f(x), then y is recomputed automatically whenever x changes. Continuous analytics reduces the latency with which information is processed. For example, in recommendation systems new ratings can be quickly processed to give better suggestions. In search engines newly added Web pages can be ranked and made part of search results more quickly.

In this post we will focus on scalability and complex algorithms.

R is an open source statistical software. It has millions of users, including data scientists, and more than three thousand algorithms packages. Many machine learning algorithms already exist in R, albeit for small datasets. These algorithms use matrix operations that are easily expressed and efficiently implemented in R. In less than a hundred lines you can implement most algorithms. Therefore, we decided to extend R and determine if we can achieve scalability in a familiar programming model.

Figure 1 is a very simplified view that compares R and Hadoop. Hadoop can handle large volumes of data, but R can efficiently execute a variety of advanced analysis. At HP Labs we have developed a distributed system that extends R. The main advantages are the language semantics, and the mechanisms to scale R and to run programs in a distributed manner.

FIgure 1 Graph

Figure 1: Extending R for Big Data


Figure 2 shows a high level diagram of how programs are executed in our distributed R framework. Users write programs using language extensions to R and then submit the code to the new runtime. The code is executed across servers in a distributed manner. Distributed R programs run on commodity hardware: from your multi-core desktop to existing Vertica clusters.

Figure 2 Architecture

Figure 2: Architecture

Our framework adds three main language constructs to R: darray, splits, and update. A foreach construct is also present. It is similar to parallel loops found in other languages.

For transparent scaling, we provide the abstraction of distributed arrays, darray.  Distributed arrays store data across multiple machines and give programmers the flexibility to partition data by rows, columns or blocks. Programmers write analytics code treating the distributed array as a regular array, without worrying that it is mapped to different physical machines. Array partitions can be referenced using splits and their contents modified using update. The body of foreach loop processes array partitions in parallel.

Figure 3 shows part of a program that calculates distributed PageRank of a graph. At a high level, the program executes A = (M*B)+C in a distributed manner till convergence. Here M is the adjacency matrix of a large graph. Initially M is declared a NxN sparse matrix partitioned by rows. The vector A is partitioned such that each partition has the same number of rows as the corresponding partition of M. The accompanying illustration (Figure 3) points out that each partition of A requires the corresponding (shaded) partitions of M, C, and the whole array B. The runtime passes these partitions and automatically reconstructs B from its partitions before executing the body of foreach on workers.

Our algorithms package has distributed algorithms such as regression analysis, clustering, power method based PageRank, a recommendation system, and so on. For each of these applications we had to write less than 150 lines of code.

Presto Code

Figure 3: Sample Code

This post is not to claim yet another system faster than Hadoop. Hence we exclude comprehensive experiment results or pretty graphs.  Our Eurosys 2013 and HotCloud 2012 papers have detailed performance results [1, 2]. As a data nugget, our experiments show that many algorithms in our distributed R framework are more than 20 times faster than Hadoop.


Our framework extends R. It efficiently executes machine learning and graph algorithms on a cluster. Distributed R programs are easy to write, are scalable, and are fast.

Our aim in building a distributed R engine is not to replace Hadoop or its variants. Rather, it is a design point in the space of analytics interfaces—one that is more familiar to data scientists.

Our framework is still evolving. Today, you can use R on top of Vertica to accelerate your data mining analysis. Soon we will support in-database operations as well. Stay tuned.

[1] Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, Rob Schreiber. Eurosys 2013, Prague, Czech Republic.

[2] Using R for Iterative and Incremental Processing. Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Rob Schreiber. HotCloud 2012, Boston, USA.

Sensor Data and the Internet of Things: When Big Data Gets Really Big

I remember back in the 1990s when Sun Microsystems claimed that “Java anywhere” would even make refrigerators intelligent to know when you were out of milk, triggering a series of events that ultimately resulted in a grocery delivery chain bringing your milk to your door step the very next day.

Fast forward to today. There are millions (and soon billions) of devices that are connected to the Internet — cars, medical equipment, buildings, meters, power grids, and, yes, even refrigerators. These connected devices comprise the Internet of Things (also known as Machine to Machine or M2M).

But why is this important to your world of Big Data analytics?

The Internet of Things is generating an unfathomable amount of sensor data  — data that product manufacturers, particularly, need to manage and analyze to build better products, predict failures to reduce costs, and understand customer behavior to differentiate and improve loyalty.

In fact, a recent report by IDC’s The Digital Universe 2020 forecasts that machine-generated data will increase to 42 percent of all data by 2020, up from 11 percent in 2005.

The use cases are proven and here. Some are even mainstream. Think Progressive Insurance’s Snapshot pay-as-you-drive insurance commercials that have taken over our airwaves. Others are around us, and you may not even know it. Over your next work day, think about how many devices are connected and distributing information just waiting for analysis — your car, train, flight, or bus; traffic lights, road side signs, the elevator and escalator, an ATM, your check-out system.

But, more importantly, join us for our upcoming Webcast: Unlocking the Massive Potential of Sensor Data and the Internet of Things on Thursday, February 14th at noon EST (9:00AM PST).

We look forward to continuing the conversation and share these and other emerging use cases, real-world case studies, and a technology perspective to help you prepare for this massive opportunity ushered in by sensor data and the Internet of Things!

HP Vertica helps secure HP’s IT infrastructure

Click image for case study

HP’s online security strategy is designed to protect its infrastructure from hackers, fraud, and malware. The cyber security model includes prevention, detection, and response, and incorporates a number of key HP solutions from the HP IT Performance Suite — Security Intelligence and Risk Management portfolio. These HP solutions help HP’s security professionals respond more quickly and efficiently to events — despite the complexity of HP’s IT infrastructure.

HP Vertica figures prominently in the HP IT cyber security story as part of the Lancope StealthWatch solution.  Lancope StealthWatch is a network monitoring tool that leverages the HP Vertica Analytics Platform to provides HP’s network security team with a cost-effective, yet powerful, way to monitor and analyze HP’s network traffic, delivering network-based anomaly detection.


Startup Rink

For years, I’ve enjoyed working at Vertica, part of a culture where developers aren’t encumbered by bureaucracy, there is a true meritocracy, and we focus on efficiently delivering meaningful features to customers. I’ve been impressed through the years by the commitment, hard work, and truly impressive accomplishments of my colleagues. It takes an incredible team to build a product, like the original Vertica Analytics Database (now known as the HP Vertica Analytics Platform), from scratch, and tackle complex distributed systems and scalability challenges — it is also a lot of fun, especially with this group.

After HP acquired Vertica over a year and a half ago, I was glad to see the startup culture continue to thrive. The acquisition did bring about some change, which has overall been very positive. The engineering group has benefited from a wealth of resources at HP, including new toys, mostly in the form of hardware, and newfound relationships with the talented folks at HP Labs and in other business units.

It is my great fortune to work with truly talented developers, who have greatly impacted my personal and career growth. The challenges we’ve faced have worked to strengthen their influence. During a recent holiday project, I leaned on lessons learned from my colleagues. Interestingly, the project had nothing to do with my profession.

What does building a backyard, or, in my case front yard, skating rink have to do with a startup experience?

For starters, you hear lots of reasons why you shouldn’t do it. Building a rink is an impractical project, especially in my geographical location. It is relatively expensive compared to skating at a public rink — the cost is roughly what many pay for a few months of cable, but for something that you don’t mind your kids doing for hours each day. It is a lot of work. I call it exercise, something I need more of this time of year. At best, temperatures will remain cold enough to sustain five or six weeks of skating. As I got started, I heard all about how the ground didn’t freeze at all last winter.

To complete a project like this one must filter criticism appropriately. The folks at my local box store were very helpful in improving my rink design while others contributed only negative comments. I’m certain a good many of my neighbors think I am crazy. I was a little concerned when two fire engines came down my street while I was flooding the rink. It turns out that they were carrying Santa Claus on display for kids; his sleigh must have been getting tuned for his big day.

front_yard_rinkPerhaps most importantly, you have to be able to rebound when things don’t go as planned. I broke my back — at least it felt that way — framing the rink. What I didn’t count on was a lot of rain, followed by a fair amount of snow. These conditions added additional weight to the rink and made the ground extremely soggy (it was mush to a depth of more than one foot in some areas). Consequently, the deep end of the rink — the ground isn’t perfectly level — burst at one corner.

I’m certain that I looked crazed as I hurried to mend the damage before the rink fell apart completely. Once things stabilized, I could see that the ground wasn’t holding. The stakes were leaning and the rink was in great jeopardy. I felt defeated. I thought about giving up. I’d invested a lot of time and energy and wasted some money on this foolish project. Comments from the naysayers filled my head. But, as I said earlier, I’ve had the good fortune of working on challenging projects with colleagues who know how to make things work in the face of adversity. I didn’t need to consult them. I knew how they’d react. I’ve seen the same scenario play out dozens of times at work. After I cleared my head and got a pep talk from my wife I doubled down my efforts and made a serious attempt to salvage the rink. There was no guarantee of success—things looked bleak.

Thankfully, hard work paid off. It usually does, but there are times when, despite good intentions and best efforts, things don’t work out as intended. When that happens you’re left with valuable lessons learned. And, in that case, next year’s rink will be a success.

shooting_goalA few days after the rink was repaired Mother Nature did her part. The rink has been in operation for a couple of days now. Already, the work has been worthwhile. My family has had some very memorable times out there. Like, the time my three year old daughter amazed us with her on-ice impression of Prof. Hinkle chasing Frosty down a hill as she laughed hysterically or watching my five-year-old son give my wife a celebratory hug after imagining winning the Stanley Cup for the 1,000th time with another amazing goal.

With any luck, we’ve got a few more weeks to enjoy the cold weather. Now I’ve got to head out to resurface the ice with the homeboni I built (see image) so there’s a fresh sheet for the kids to skate on tomorrow.


Get Started With Vertica Today

Subscribe to Vertica