Vertica

Archive for the ‘Hadoop’ Category

I Love DIY Projects

I love DIY projects. I love watching the YouTube videos, scouring the web for advice from others, and learning new skills. The innovation and creativity that’s out there is amazing! Plus, DIY projects save money, sometimes a lot of money! This past month, we decided to build our own stone patio in the backyard … how hard could that be? Turns out, lifting 5000+ pounds of rock and stone combined with the challenges of grade and water management is a lot harder than it looks!

patch of dirt

Somehow, this experience led me to think about Open Source Software. Quite a leap, isn’t it? But think about it … the innovation and excitement that comes from thousands of smart people working together to create new software is pretty cool. It saves money (well, let’s talk about that later) and it exposes an organization to new thinking in a new era of technology.

Then comes the hard part. First, implementations of Open Source Software like Hadoop do take more expertise than might have been expected. So it really makes sense to engage with a professional Hadoop distribution partner. So free isn’t really free anymore. Then the commercial grade discussion comes into play. Is there enough security and manageability built into the current Hadoop distributions to meet the constantly rising bar in today’s world? I don’t want my company’s logo in the next Big Data breach article. Finally, the underlying infrastructure (kind of like the piping, drainage gravel and paver base that now must be added to my patio project!) starts to expand in ways that might not have been expected.
But does this mean that Open Source projects are a bad idea? Absolutely not! I will never give up the satisfaction (and cost savings!) of DIY projects. But the key is to make the right choices and partner with the right people. HP Software Big Data has a passion for innovation and we love the excitement of the Open Source community; that’s why we were so excited to contribute our recent Distributed R release to Open Source. We want our customers to find value in their Hadoop implementations so we bring the strongest and most sophisticated SQL on Hadoop offering to the market with a set of rich analytics that can really uncover the insights in data stored in Hadoop. After all, as Mike Stonebraker, the founder of HP Vertica and winner of the Turing Award (really the Nobel Prize of Computer Science!) recently said in an interview with Barron’s, “It started out, NoSQL meant, ‘Not SQL,’ then it became ‘Not only SQL,’ and now I think it means “Not-yet-SQL.” That’s why we at HP are so determined to “speak the language” that our developer and customer community knows and needs. And perhaps most importantly, we continue to develop open APIs and SDKs for our Haven Big Data Platform because we know that the brilliant and passionate developer community (think DIY for analytics!) needs the right tools for their jobs.

HP Software Big Data believes in DIY. We’re a hands on group with the advantage of structured QA processes, the expertise from more than a decade of analytics purpose built for Big Data, and the ability to bring data scientists and data migration specialists to any enterprise DIY project through our Enterprise Services Analytics & Data Management practice. Got a DIY project in mind? We’re in!

The Top Five Reasons SQL-on-Hadoop Keeps CIOs Awake at Night

The Elephant and the engineer

Being a part of HP is really an amazing thing – it gives us access to amazing technologies and very bright, hard-working people. But the best part is talking with our customers.

One topic on the mind of many technology leaders today is the “elephant in the room” – Hadoop. From its humble beginnings as a low-cost implementation of mass storage and the Map/Reduce programming framework, it’s become something of a movement. Businesses from Manhattan to Mumbai are quickly discovering that it provides favorable economics for one very specific use case – it provides a very low cost way to store data of uncertain value. This use case even has acquired a name – the “data lake”.

I first heard the term five years ago, while Vertica was a tiny startup based in Boston. It seemed that a few risk-tolerant businesses in California were trying out this thing called Hadoop as a place to park data that they’d previously been throwing away. Many businesses have been throwing away all but a tiny portion of their data simply because they can’t find a cost effective place to store it. To these companies, Hadoop was a godsend.

And yet in some key ways, Hadoop is also extremely limited. Technology teams continue to wrestle with extracting value from a Hadoop investment. Their primary complaint? That there is no easy way to explore and ask questions of data stored in Hadoop. Technology teams understand SQL, but Hadoop provides only the most basic SQL support. I’ve even heard stories of entire teams resigning en masse, frustrated that their company has put them in a no-win situation – data everywhere and not a drop to drink.

Variations on the above story have undoubtedly played out at many companies across the globe. The common theme is that, love it or hate it, SQL is one of the core languages for exploration and inquiry of semi-structured and structured data. And most SQL on Hadoop offerings are simply not up to the task. As a result, we now have a gold rush of sorts, with multiple vendors rushing to build SQL on Hadoop solutions. To date, there are at least seven different commercial SQL for Hadoop offerings, and many organizations are learning about the very big differences between these offerings!

In our many conversations with C-level technology executives, we’ve heard a common set of concerns about most SQL on Hadoop options. Some are significant. So, without further ado, here are the top five reasons SQL on Hadoop keeps CIO’s awake at night:

5. Is it secure? Really?

The initial appeal of the data lake is that it can be a consolidated store – businesses can place all their data in one place. But that creates huge risk because now…all the data is in one place. Therefore, our team has been working diligently a SQL on Hadoop offering that not only consists of core enterprise security features, but it also requires the ability to secure data in flight with such things as SSL encryption, integration with enterprise security systems such as Kerberos, and a column-level access model. If your SQL on Hadoop solution doesn’t offer these features, your data is at risk.

4. Does it support all the SQL you need?

Technically, SQL on Hadoop has been around for years now in the form of an open source project called Hive. Hive has its own version of SQL called HQL. Hive users frequently complain that HQL only supports a subset of SQL. There are many things you just can’t do. This requires all manner of data flow contortions as analysts must continually resort to other tools or languages for things that are very expressible in SQL…if only the Hadoop environment supported it.

This problem remains today, as many of the SQL on Hadoop variants do not support the full range of ANSI SQL. For example, our benchmark team regularly performs tests with the Vertica SQL on Hadoop product to ensure that it meets our standards for quality, stability and performance. One of the test suites we use is the TPC-H benchmark. For those not in the know, TPC-H is an industry standard benchmark with pre-defined SQL, schemas, and data. While our engine runs the full suite of tests, other SQL on Hadoop flavors that we’ve tested are not capable of running the entire workload. In fact, some of them only run 60% of the queries!

3. …And if it runs the SQL, does it run well?

It’s one thing to implement a SQL engine that can parse a bit of SQL and create an execution plan to go and get the data. It’s a very different thing to optimize the engine such that it does these things quickly and efficiently. I’ve been working with database products for almost thirty years now, and have seen over and over that the biggest challenge faced by any SQL engine is not creating the engine, but in dealing with the tens of thousands of edge cases that will arise in the real world.

For example, being aware of sort order in stored data on disk can dramatically improve query performance. Moreover, optimizing the storage of the data to leverage the sort sequence with something like run-length encoding can further improve performance. But not if the SQL engine doesn’t know how to deal with this. One example of an immature implementation is an engine that cannot use just-in-time decompression of highly compressed data. If the system has to pay the CPU penalty of decompressing highly compressed data every time it is queried, why bother compressing it in the first place, except maybe to save disk space? Also, if a user needs to keep extremely high-performance aggregations in sync with the transaction data, unless the engine has been written to manage the data this way, and be aware of the data characteristics at run-time, this simply won’t be possible.
These are just two examples. But it can make the difference between a query taking one second, or two days. Or worse, crashing when you try to run it because uncompressed data overflows the memory and crashes the database.

2. Does it just dump files to a file-system, or actively manage and optimize storage?

Projects built for Hadoop almost invariably pick up some of the “baggage” of using the core Hadoop functionality. For example, some of the SQL on Hadoop offerings just dump individual files into the filesystem as data is ingested. After loading a year of data, you’re likely to find yourself with hundreds of thousands of individual files. This is a performance catastrophe. Moreover, to optimize these files a person has to manually do something –write a script, run a process, call an executable, etc. This just adds to the real cost of the solution in terms of administrative complexity and design complexity to work around performance issues. What a business needs is a system which simplifies this by managing and optimizing files automatically.

1. When two people ask the same question at the same time, do they get the same answer?

There are certain fundamentals about databases that have made them so common for tracking key business data today. One of these things is called ACID compliance. It’s an acronym that doesn’t bear explaining here, so suffice it to say that one of the things an ACID-compliant database guarantees is that if two people ask the exact same question of the exact same data at the exact same time, they will get the same answer.

Seems kind of obvious, doesn’t it? And a common issue with SQL on Hadoop distributions is that they may lack ACID compliance. This isn’t so good for data science to create predictive models for growing the business, and certainly not suitable for producing financials! Caveat Emptor.

Many of our customers consider these five areas to be a benchmark for measuring SQL on Hadoop maturity. SQL on Hadoop offerings that fail to deliver these things will drive up the cost and time it takes to solve problems as analysts must use a mix of tools, work around performance and stability limitations, etc. And in the context of massive data thefts taking place today, how many CIOs feel comfortable with three petabytes of unsecured data pertaining to every single aspect of their business being accessible to anyone with a text editor and a bit of Java programming know-how?

The good news is that we at HP have been thinking of these concerns for years now. And working on solving them. Vertica SQL on Hadoop addresses each of these concerns in a comprehensive way, so organizations can finally unlock the full value of their data lake. We’re happy to tell you more about this, and we’d love for you to try it out! Click here to request more information from our team.

Better Together

Assembling puzzles

Just like peanut butter and chocolate, the mix of several flavors of data is much more interesting and useful than just one. At HP we classify types of data into three categories:


Human Data

Human data is stuff created by people as opposed to machines, like social media posts, videos, audio, emails, spreadsheets, blogs, and Wikipedia. This data is hard to analyze, as it is written in natural language and does not conform to a particular structure, and lives in places that are not particularly easy to access Because human data lacks traditional structure, we can’t just pull it straight into an data warehouse (nor should we want to). If you want to take full advantage of human data, you must do two things: extract metadata and textual content, and extract meaning. These are completely different things I can easily write a program to extract keywords and text from PDFs, and use them for a simple search engine. But unless I understand how that PDF relates to the millions of other documents in my business, I cannot do much more than that simple search. Plus, howcan I extract information from a video? What about audio recordings from your customer service desk? Sentiment from a YouTube video review of your product and the related comments? These are all very valuable, and not particularly easy to analyze.


Machine Data

Machine Data is data produced by machines or for machines, like sensor data from smart meters, wearable technology, and weblogs from your web site. This category of data is growing exponentially faster than human or business data, and the size of the data is the main driver behind technologies like the Hadoop Distributed File System (HDFS). If I asked you how much data you have today versus 5 years ago, you might say 10 times as much. (If I asked you how many new customers you have today vs. 5 years ago, I would hope you’d say 10 times as many as well!) If you do indeed have 10x more data today, it’s because most of your new data is machine data. Machine data is growing so fast that it has spawned a number of new technologies to store and analyze it, from both open-source and proprietary sources. Understanding what these technologies do, and what they do NOT do, should be on your to-do list right now! (If you want help, feel free to contact me.)

Business Data

Data created by businesses to help them run the business. This includes data in your data warehouse, as well as less centralized data like data found in spreadsheets. Think your data warehouse solution has all of your business data? Just for fun, think about how much of your business is run through Excel spreadsheets. If you are lucky, those spreadsheets are sitting in a SharePoint space somewhere, and not just on employee desktops. And if they are on people’s desktops, hopefully, they’re being backed up. Scary that you don’t have that information indexed and searchable, isn’t it?

So now that you have an idea of the types of data out there, what can you do with it? A picture is worth a thousand words, so let’s start of with a picture and a story.

Use Case: NASCAR

First, watch this video. When you think about NASCAR, you think about fast cars flying around the track, smacking into each other as they jockey for position. What you might not realize is that everything in NASCAR comes back to sponsorship. A NASCAR race is essentially a collection of 200mph billboards. Take a look at this picture:

800px-Tony_Stewart_600

You are looking at 3-Time Sprint Cup Champion Tony Stewart at Infineon Raceway. First, notice that the car is an advertisement for a number of different companies. The race is called the Sprint cup. The raceway is Infineon Raceway. NASCAR is not just about racing!

“The NASCAR ecosystem is a huge one involving fans, race teams, owners, drivers, racetracks, promoters, sponsors, advertisers, media and many more.”
– Sean Doherty, Director of Digital Engagement and Integrated Marketing Communications at NASCAR (credit CIO Insight).

NASCAR is a vehicle for advertising, as much as it is advertising for vehicles. Of course advertisers want to maximize viewers, because that is ultimately what sponsors want: people looking at their logo, or viewing their ads during the commercial break.

NASCAR realizes that its success is all about the fan base. But the majority of that fan base is sitting at home, far from the action. How to engage them? Putting aside creepy ideas like taking over video feeds from an Xbox Kinect, there are plenty of ways that fans publicly interact. The most obvious one: they tweet about the action in real time. They even tweet during the commercials, about the commercials. So now we have two things we can monitor: the number of tweets at any given time during the race, and the content of the tweets. Counting tweets is easy: just pick a time slice like 1 minute, count tweets that include NASCAR-related hashtags in that timeslice, and put them up on a dashboard. TA-DA! You now have one indicator of engagement.

But wait, are the fans happy or mad? We have to look at the content of the tweets, and that means sentiment analysis. We need to attach sentiment to each tweet so that we can gauge overall sentiment. Now the real problem: tweets are, by nature, short. They also are written in shorthand, and use colloquial language. So now we need natural language processing on what is essentially slang. We have two factors that we can gauge throughout the race: engagement level and sentiment. That dashboard is getting more interesting!

800px-Thrashers_zamboni

Here is a strange and related observation: did you know that the time spent during a hockey game where a Zamboni cleans the ice is one of the most heavily tweeted parts of the game? PEOPLE LOVE THE ZAMBONI.

Anyway, how does this relate to fan engagement? Well, let’s say that it starts raining heavily during a race, and NASCAR decides to pull the vehicles off the track. We now have a problem and an opportunity: will the home viewers check out until the rain stops? How do we keep them engaged during the break? Well, we could start by looking at that dashboard and see what the most heavily talked about parts of the race were, then queue up the commentators and video to go over those bits. We could poll the audience and have them tweet their favorite moment, then watch in real time as we see the results from Twitter. For that we will have to categorize and cluster keywords from the tweets in real time.

There is much more to this use case, but suffice it to say that NASCAR also collects data from media outlets in print, radio, and TV, and adds them into the mix. That means scanning video and audio for keywords and content, just like the tweets.

The data collected by NASCAR can then be used by its sponsors, who have their own data, likely in a more traditional data warehouse. Here are a few of the things NASCAR and their sponsors are doing with this system:

  • Race teams can gauge fan conversation and reaction to a new paint scheme for one of its cars to decide whether to alter it before future races.
  • The Charlotte Motor Speedway is tracking conversations and levels of interaction about one of its recent ticket promotions.
  • A sponsor is following track response and media coverage about a new marketing campaign.

-List Credit: CIO Insight

What has HP done to make this easier?

We covered a lot of ground in that one use case. We needed access to non-traditional data sources like Twitter, access to traditional data sources like an EDW, sentiment analysis, natural language processing, audio text retrieval, video frame recognition, audio text retrieval, and time series functions to slice up the data. Throw in some pattern-matching techniques and probabilistic modeling too. Then connect all that data to some real-time dashboards using standard SQL technologies and tools. That’s quite a laundry list.

HP has all of the technologies needed to implement this solution for NASCAR. We created a platform that can store and analyze 100% of your data. Structured, unstructured, semi-structured, multi-structured, human, machine, or business data: we can store it and analyze it. The latter part is the interesting one. It’s trivial to set up a Hadoop cluster and store your EDW, web logs, and tweets from the Twitter Fire Hose on there. But Hadoop doesn’t magically know how to parse emails, databases, weblog data, or anything else. That’s on you. So is stitching those data sources together, running analytics on them, and hooking all that up to a sensible user interface. Of course, that’s what we do at HP. We have even moved this technology onto the cloud, to make development and testing of these solutions quick and easy. Take a look at Haven on Demand!

What should you be asking yourself?

First, do you understand all of the types of data involved in your industry? Outside of your EDW, how do you interact with your customers, vendors, sponsors, or investors? How can you collect that data and get it into an analytics system? Does your data give you a competitive advantage, or is it just sitting in cold storage? What other data sources do you need in order to make innovative products and services? How do you join it all together using modern data science techniques, while using common data languages like SQL?

These are non-trivial questions. Sometimes just knowing what you have is a science project in itself (it doesn’t have to be, we actually have products for that). Many people assume that data cannot be analyzed unless it is lumped all together in one place, like a Hadoop cluster or an EDW. The good news is that it isn’t necessary in most cases. There are likely cases where you can optimize data processing by moving data into a high-performance data store, but much of your data can be analyzed right where it is. We have been helping customers solve these problems, and we would be delighted to help you as well.

Author Note

This is the first in a series of three articles. The next article deals with how location data from cell phones and social media is creating huge new opportunities for those with the means to analyze it. The third article will deal with machine data, and the issues with dealing with the Internet of Things at scale.

Thanks for reading!

HP Vertica for SQL on Hadoop

HP Vertica for SQL on Hadoop from Vertica Systems on Vimeo

HP Vertica now offers a SQL on Hadoop license, which allows you to leverage Vertica’s powerful analytics engine to explore data in Hadoop Distributed File System (HDFS).

This offering is licensed per-node/per-year term with no data volume limits.

With your SQL on Hadoop license, you get access to proven and enterprise features like:

  • Database designer
  • Management console
  • Workload management
  • Flex tables
  • External tables
  • Backup functionality

See our documentation on HP Vertica SQL on Hadoop for limitations.
To learn more about other HP Vertica licenses, view our Obtaining and Installing Your HP Vertica Licenses video or contact an HP Licensing center.

The Automagic Pixie

The “De-mythification” Series

Part 4: The Automagic Pixie

Au∙to∙mag∙ic: (Of a usually complicated technical or computer process) done, operating, or happening in a way that is hidden from or not understood by the user, and in that sense, apparently “magical”

[Source: Dictionary.com]

In previous installments of this series, I de-bunked some of the more common myths around big data analytics. In this final installment, I’ll address one of the most pervasive and costly myths: that there exists an easy button that organizations can press to automagically solve their big data problems. I’ll provide some insights as to how this myth has come about, and recommend strategies for dealing with the real challenges inherent in big data analytics.

Like the single-solution elf, this easy button idea is born of the desire of many vendors to simplify their message. The big data marketplace is new enough that all the distinct types of needs haven’t yet become entirely clear – which makes it tough to formulate a targeted message. Remember in the late 1990’s when various web vendors were all selling “e-commerce” or “narrowcasting” or “recontextualization”? Today most people are clear on the utility of the first two, while the third is recognized for what it was at the time – unhelpful marketing fluff. I worked with a few of these firms, and watched as the businesses tried to position product for a need which hadn’t yet been very well defined by the marketplace. The typical response by the business was to keep it simple – just push the easy button and our technology will do it for you.

I was at my second startup in 2001 (an e-commerce provider using what we would refer to today as a SaaS model) when I encountered the unfortunate aftermath of this approach. I sat down at my desk on the first day of the job, and was promptly approached by the VP of Engineering, who informed me that our largest customer was about to cancel its contract – we’d been trying to upgrade the customer for weeks, during which time their e-commerce system was down. Although they’d informed the customer that the upgrade was a push-button process, it wasn’t. In fact, at the time I started there, the team was starting to believe that an upgrade would be impossible and that they should propose re-implementing the customer from scratch. By any standard, this would be a fail.

Over the next 72 hours, I migrated the customer’s data and got them up and running.   It was a Pyrrhic victory at best – the customer cancelled anyhow, and the startup went out of business a few months later.

The moral of the story? No, it’s not to keep serious data geeks on staff to do automagical migrations. The lesson here is that when it comes to data driven applications – including analytics – the “too good to be true” easy button almost always is. Today, the big data marketplace is full of great sounding messages such as “up and running in minutes”, or “data scientist in a box”.

“Push a button and deploy a big data infrastructure in minutes to grind through that ten petabytes of data sitting on your SAN!”

“Automatically derive predictive models that used to take the data science team weeks in mere seconds! (…and then fire the expensive data scientists)!”

Don’t these sound great?

The truth is, as usual, more nuanced. One key point I like to make with organizations is that big data analytics, like most technology practices, involves different tasks. And those tasks generally require different tools. To illustrate this for business stakeholders, I usually resort to the metaphor of building a house. We don’t build a house with just a hammer, or just a screwdriver. In fact, it requires a variety of tools – each of which is suited to a different task. A brad nailer for drywall. A circular saw for cutting. A framing hammer for framing. And so on. And in the world of engineering, a house is a relatively simple thing to construct. A big data infrastructure is considerably more complex. So it’s reasonable to assume that an organization building this infrastructure would need a rich set of tools and technologies to meet the different needs.

Now that we’ve clarified this, we can get to the question behind the question. When someone asks me “Why can’t we have an easy button to build and deploy analytics?” What they’re really asking is “How can I use technological advances to build and deploy analytics faster, better and cheaper?

Ahh, now that’s an actionable question!

In the information technology industry, we’ve been blessed (some would argue cursed) by the nature of computing. For decades now we’ve been able to count on continually increasing capacity and efficiency. So while processors continue to grow more powerful, they also consume less power. As the power requirements for a given unit of processing become low enough, it is suddenly possible to design computing devices which run on “ambient” energy from light, heat, motion, etc. This has opened up a very broad set of possibilities to instrument the world in ways never before seen – resulting in dramatic growth of machine-readable data. This data explosion has led to continued opportunity and innovation across the big data marketplace. Imagine if each year, a homebuilder could purchase a saw which could cut twice as much wood with a battery half the size. What would that mean for the homebuilder? How about the vendor of the saw? That’s roughly analogous to what we all face in big data.

And while we won’t find one “easy button”, it’s very likely that we can find a tool for a given analytic task which is significantly better than one that was built in the past. A database that operates well at petabyte scale, with performance characteristics that make it practical to use. A distributed filesystem whose economics make it a useful place to store virtually unlimited amounts of data until you need it. An engine capable of extracting machine-readable structured information from media. And so on. Once my colleagues and I have debunked the myth of the automagic pixie, we can have a productive conversation to identify the tools and technologies that map cleanly to the needs of an organization and can offer meaningful improvements in their analytical capability.

I hope readers have found this series useful. In my years in this space, I’ve learned that in order to move forward with effective technology selection, sometimes we have to begin by taking a step backward and undoing misconceptions. And there are plenty! So stay tuned.

Vertica on MapR SQL-on-Hadoop – join us in June!

We’ve been working closely with MapR Technologies to bring to market our industry-leading SQL-on-Hadoop solution, and on June 3, 2014 will be jointly delivering a live webinar which will feature this joint solution and related use cases. To register and learn how you can enjoy the benefits of a SQL-on-Hadoop analytics solution that provides the highest-performing, tightly-integrated platform for operational and exploratory analytics, click here.

This joint solution is a unified, integrated solution that reduces complexity and costs by running a single cluster for both HP Vertica and Hadoop. It tightly integrates HP Vertica’s 100% ANSI SQL, high-performance Big Data analytics platform with the MapR enterprise-grade Distribution for Apache Hadoop, providing customers and partners with the highest-performing, most tightly-integrated solution for operational and exploratory analytics with the lowest total cost of ownership (TCO).

This solution will also be presented live by HP Vertica and MapR executives at HP Discover on June 11, 2014. For more information, visit the HP Discover website.

In addition, a specially-optimized version of the MapR Sandbox for Hadoop is now available in the HP Vertica Marketplace. To download this and other add-ons for the HP Vertica Analytics platform, click here.

 

Distributed R for Big Data

Data scientists use sophisticated algorithms to obtain insights. However, what usually takes tens of lines of MATLAB or R code is now been rewritten in Hadoop like systems and applied at scale in the industry. Instead of rewriting algorithms in a new model, can we stretch the limits of R and reuse it for analyzing Big Data? We present our early experiences at HP Labs as we attempt to answer this question.

Consider a few use cases– product recommendations in Netflix and Amazon, PageRank calculation by search providers, financial options pricing and detection of important people in social networks. These applications (1) process large amounts of data, (2) implement complex algorithms such as matrix decomposition and eigenvalue calculation, and (3) continuously refine their predictive models on arrival of new user ratings, Web pages, or addition of relations in the network. To support these applications we need systems that can scale, can easily express complex algorithms, and can handle continuous analytics.

The complex aspect refers to the observation that most of the above applications use advanced concepts such as matrix operations, graph algorithms, and so on. By continuous analytics we mean that if a programmer writes y=f(x), then y is recomputed automatically whenever x changes. Continuous analytics reduces the latency with which information is processed. For example, in recommendation systems new ratings can be quickly processed to give better suggestions. In search engines newly added Web pages can be ranked and made part of search results more quickly.

In this post we will focus on scalability and complex algorithms.

R is an open source statistical software. It has millions of users, including data scientists, and more than three thousand algorithms packages. Many machine learning algorithms already exist in R, albeit for small datasets. These algorithms use matrix operations that are easily expressed and efficiently implemented in R. In less than a hundred lines you can implement most algorithms. Therefore, we decided to extend R and determine if we can achieve scalability in a familiar programming model.

Figure 1 is a very simplified view that compares R and Hadoop. Hadoop can handle large volumes of data, but R can efficiently execute a variety of advanced analysis. At HP Labs we have developed a distributed system that extends R. The main advantages are the language semantics, and the mechanisms to scale R and to run programs in a distributed manner.

FIgure 1 Graph

Figure 1: Extending R for Big Data

Details

Figure 2 shows a high level diagram of how programs are executed in our distributed R framework. Users write programs using language extensions to R and then submit the code to the new runtime. The code is executed across servers in a distributed manner. Distributed R programs run on commodity hardware: from your multi-core desktop to existing Vertica clusters.

Figure 2 Architecture

Figure 2: Architecture

Our framework adds three main language constructs to R: darray, splits, and update. A foreach construct is also present. It is similar to parallel loops found in other languages.

For transparent scaling, we provide the abstraction of distributed arrays, darray.  Distributed arrays store data across multiple machines and give programmers the flexibility to partition data by rows, columns or blocks. Programmers write analytics code treating the distributed array as a regular array, without worrying that it is mapped to different physical machines. Array partitions can be referenced using splits and their contents modified using update. The body of foreach loop processes array partitions in parallel.

Figure 3 shows part of a program that calculates distributed PageRank of a graph. At a high level, the program executes A = (M*B)+C in a distributed manner till convergence. Here M is the adjacency matrix of a large graph. Initially M is declared a NxN sparse matrix partitioned by rows. The vector A is partitioned such that each partition has the same number of rows as the corresponding partition of M. The accompanying illustration (Figure 3) points out that each partition of A requires the corresponding (shaded) partitions of M, C, and the whole array B. The runtime passes these partitions and automatically reconstructs B from its partitions before executing the body of foreach on workers.

Our algorithms package has distributed algorithms such as regression analysis, clustering, power method based PageRank, a recommendation system, and so on. For each of these applications we had to write less than 150 lines of code.

Presto Code

Figure 3: Sample Code

This post is not to claim yet another system faster than Hadoop. Hence we exclude comprehensive experiment results or pretty graphs.  Our Eurosys 2013 and HotCloud 2012 papers have detailed performance results [1, 2]. As a data nugget, our experiments show that many algorithms in our distributed R framework are more than 20 times faster than Hadoop.

Summary

Our framework extends R. It efficiently executes machine learning and graph algorithms on a cluster. Distributed R programs are easy to write, are scalable, and are fast.

Our aim in building a distributed R engine is not to replace Hadoop or its variants. Rather, it is a design point in the space of analytics interfaces—one that is more familiar to data scientists.

Our framework is still evolving. Today, you can use R on top of Vertica to accelerate your data mining analysis. Soon we will support in-database operations as well. Stay tuned.


[1] Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, Rob Schreiber. Eurosys 2013, Prague, Czech Republic.

[2] Using R for Iterative and Incremental Processing. Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Rob Schreiber. HotCloud 2012, Boston, USA.

Get Started With Vertica Today

Subscribe to Vertica