Vertica

Archive for the ‘Hadoop’ Category

HP Vertica for SQL on Hadoop

HP Vertica for SQL on Hadoop from Vertica Systems on Vimeo

HP Vertica now offers a SQL on Hadoop license, which allows you to leverage Vertica’s powerful analytics engine to explore data in Hadoop Distributed File System (HDFS).

This offering is licensed per-node/per-year term with no data volume limits.

With your SQL on Hadoop license, you get access to proven and enterprise features like:

  • Database designer
  • Management console
  • Workload management
  • Flex tables
  • External tables
  • Backup functionality

See our documentation on HP Vertica SQL on Hadoop for limitations.
To learn more about other HP Vertica licenses, view our Obtaining and Installing Your HP Vertica Licenses video or contact an HP Licensing center.

The Automagic Pixie

The “De-mythification” Series

Part 4: The Automagic Pixie

Au∙to∙mag∙ic: (Of a usually complicated technical or computer process) done, operating, or happening in a way that is hidden from or not understood by the user, and in that sense, apparently “magical”

[Source: Dictionary.com]

In previous installments of this series, I de-bunked some of the more common myths around big data analytics. In this final installment, I’ll address one of the most pervasive and costly myths: that there exists an easy button that organizations can press to automagically solve their big data problems. I’ll provide some insights as to how this myth has come about, and recommend strategies for dealing with the real challenges inherent in big data analytics.

Like the single-solution elf, this easy button idea is born of the desire of many vendors to simplify their message. The big data marketplace is new enough that all the distinct types of needs haven’t yet become entirely clear – which makes it tough to formulate a targeted message. Remember in the late 1990’s when various web vendors were all selling “e-commerce” or “narrowcasting” or “recontextualization”? Today most people are clear on the utility of the first two, while the third is recognized for what it was at the time – unhelpful marketing fluff. I worked with a few of these firms, and watched as the businesses tried to position product for a need which hadn’t yet been very well defined by the marketplace. The typical response by the business was to keep it simple – just push the easy button and our technology will do it for you.

I was at my second startup in 2001 (an e-commerce provider using what we would refer to today as a SaaS model) when I encountered the unfortunate aftermath of this approach. I sat down at my desk on the first day of the job, and was promptly approached by the VP of Engineering, who informed me that our largest customer was about to cancel its contract – we’d been trying to upgrade the customer for weeks, during which time their e-commerce system was down. Although they’d informed the customer that the upgrade was a push-button process, it wasn’t. In fact, at the time I started there, the team was starting to believe that an upgrade would be impossible and that they should propose re-implementing the customer from scratch. By any standard, this would be a fail.

Over the next 72 hours, I migrated the customer’s data and got them up and running.   It was a Pyrrhic victory at best – the customer cancelled anyhow, and the startup went out of business a few months later.

The moral of the story? No, it’s not to keep serious data geeks on staff to do automagical migrations. The lesson here is that when it comes to data driven applications – including analytics – the “too good to be true” easy button almost always is. Today, the big data marketplace is full of great sounding messages such as “up and running in minutes”, or “data scientist in a box”.

“Push a button and deploy a big data infrastructure in minutes to grind through that ten petabytes of data sitting on your SAN!”

“Automatically derive predictive models that used to take the data science team weeks in mere seconds! (…and then fire the expensive data scientists)!”

Don’t these sound great?

The truth is, as usual, more nuanced. One key point I like to make with organizations is that big data analytics, like most technology practices, involves different tasks. And those tasks generally require different tools. To illustrate this for business stakeholders, I usually resort to the metaphor of building a house. We don’t build a house with just a hammer, or just a screwdriver. In fact, it requires a variety of tools – each of which is suited to a different task. A brad nailer for drywall. A circular saw for cutting. A framing hammer for framing. And so on. And in the world of engineering, a house is a relatively simple thing to construct. A big data infrastructure is considerably more complex. So it’s reasonable to assume that an organization building this infrastructure would need a rich set of tools and technologies to meet the different needs.

Now that we’ve clarified this, we can get to the question behind the question. When someone asks me “Why can’t we have an easy button to build and deploy analytics?” What they’re really asking is “How can I use technological advances to build and deploy analytics faster, better and cheaper?

Ahh, now that’s an actionable question!

In the information technology industry, we’ve been blessed (some would argue cursed) by the nature of computing. For decades now we’ve been able to count on continually increasing capacity and efficiency. So while processors continue to grow more powerful, they also consume less power. As the power requirements for a given unit of processing become low enough, it is suddenly possible to design computing devices which run on “ambient” energy from light, heat, motion, etc. This has opened up a very broad set of possibilities to instrument the world in ways never before seen – resulting in dramatic growth of machine-readable data. This data explosion has led to continued opportunity and innovation across the big data marketplace. Imagine if each year, a homebuilder could purchase a saw which could cut twice as much wood with a battery half the size. What would that mean for the homebuilder? How about the vendor of the saw? That’s roughly analogous to what we all face in big data.

And while we won’t find one “easy button”, it’s very likely that we can find a tool for a given analytic task which is significantly better than one that was built in the past. A database that operates well at petabyte scale, with performance characteristics that make it practical to use. A distributed filesystem whose economics make it a useful place to store virtually unlimited amounts of data until you need it. An engine capable of extracting machine-readable structured information from media. And so on. Once my colleagues and I have debunked the myth of the automagic pixie, we can have a productive conversation to identify the tools and technologies that map cleanly to the needs of an organization and can offer meaningful improvements in their analytical capability.

I hope readers have found this series useful. In my years in this space, I’ve learned that in order to move forward with effective technology selection, sometimes we have to begin by taking a step backward and undoing misconceptions. And there are plenty! So stay tuned.

Vertica on MapR SQL-on-Hadoop – join us in June!

We’ve been working closely with MapR Technologies to bring to market our industry-leading SQL-on-Hadoop solution, and on June 3, 2014 will be jointly delivering a live webinar which will feature this joint solution and related use cases. To register and learn how you can enjoy the benefits of a SQL-on-Hadoop analytics solution that provides the highest-performing, tightly-integrated platform for operational and exploratory analytics, click here.

This joint solution is a unified, integrated solution that reduces complexity and costs by running a single cluster for both HP Vertica and Hadoop. It tightly integrates HP Vertica’s 100% ANSI SQL, high-performance Big Data analytics platform with the MapR enterprise-grade Distribution for Apache Hadoop, providing customers and partners with the highest-performing, most tightly-integrated solution for operational and exploratory analytics with the lowest total cost of ownership (TCO).

This solution will also be presented live by HP Vertica and MapR executives at HP Discover on June 11, 2014. For more information, visit the HP Discover website.

In addition, a specially-optimized version of the MapR Sandbox for Hadoop is now available in the HP Vertica Marketplace. To download this and other add-ons for the HP Vertica Analytics platform, click here.

 

Distributed R for Big Data

Data scientists use sophisticated algorithms to obtain insights. However, what usually takes tens of lines of MATLAB or R code is now been rewritten in Hadoop like systems and applied at scale in the industry. Instead of rewriting algorithms in a new model, can we stretch the limits of R and reuse it for analyzing Big Data? We present our early experiences at HP Labs as we attempt to answer this question.

Consider a few use cases– product recommendations in Netflix and Amazon, PageRank calculation by search providers, financial options pricing and detection of important people in social networks. These applications (1) process large amounts of data, (2) implement complex algorithms such as matrix decomposition and eigenvalue calculation, and (3) continuously refine their predictive models on arrival of new user ratings, Web pages, or addition of relations in the network. To support these applications we need systems that can scale, can easily express complex algorithms, and can handle continuous analytics.

The complex aspect refers to the observation that most of the above applications use advanced concepts such as matrix operations, graph algorithms, and so on. By continuous analytics we mean that if a programmer writes y=f(x), then y is recomputed automatically whenever x changes. Continuous analytics reduces the latency with which information is processed. For example, in recommendation systems new ratings can be quickly processed to give better suggestions. In search engines newly added Web pages can be ranked and made part of search results more quickly.

In this post we will focus on scalability and complex algorithms.

R is an open source statistical software. It has millions of users, including data scientists, and more than three thousand algorithms packages. Many machine learning algorithms already exist in R, albeit for small datasets. These algorithms use matrix operations that are easily expressed and efficiently implemented in R. In less than a hundred lines you can implement most algorithms. Therefore, we decided to extend R and determine if we can achieve scalability in a familiar programming model.

Figure 1 is a very simplified view that compares R and Hadoop. Hadoop can handle large volumes of data, but R can efficiently execute a variety of advanced analysis. At HP Labs we have developed a distributed system that extends R. The main advantages are the language semantics, and the mechanisms to scale R and to run programs in a distributed manner.

FIgure 1 Graph

Figure 1: Extending R for Big Data

Details

Figure 2 shows a high level diagram of how programs are executed in our distributed R framework. Users write programs using language extensions to R and then submit the code to the new runtime. The code is executed across servers in a distributed manner. Distributed R programs run on commodity hardware: from your multi-core desktop to existing Vertica clusters.

Figure 2 Architecture

Figure 2: Architecture

Our framework adds three main language constructs to R: darray, splits, and update. A foreach construct is also present. It is similar to parallel loops found in other languages.

For transparent scaling, we provide the abstraction of distributed arrays, darray.  Distributed arrays store data across multiple machines and give programmers the flexibility to partition data by rows, columns or blocks. Programmers write analytics code treating the distributed array as a regular array, without worrying that it is mapped to different physical machines. Array partitions can be referenced using splits and their contents modified using update. The body of foreach loop processes array partitions in parallel.

Figure 3 shows part of a program that calculates distributed PageRank of a graph. At a high level, the program executes A = (M*B)+C in a distributed manner till convergence. Here M is the adjacency matrix of a large graph. Initially M is declared a NxN sparse matrix partitioned by rows. The vector A is partitioned such that each partition has the same number of rows as the corresponding partition of M. The accompanying illustration (Figure 3) points out that each partition of A requires the corresponding (shaded) partitions of M, C, and the whole array B. The runtime passes these partitions and automatically reconstructs B from its partitions before executing the body of foreach on workers.

Our algorithms package has distributed algorithms such as regression analysis, clustering, power method based PageRank, a recommendation system, and so on. For each of these applications we had to write less than 150 lines of code.

Presto Code

Figure 3: Sample Code

This post is not to claim yet another system faster than Hadoop. Hence we exclude comprehensive experiment results or pretty graphs.  Our Eurosys 2013 and HotCloud 2012 papers have detailed performance results [1, 2]. As a data nugget, our experiments show that many algorithms in our distributed R framework are more than 20 times faster than Hadoop.

Summary

Our framework extends R. It efficiently executes machine learning and graph algorithms on a cluster. Distributed R programs are easy to write, are scalable, and are fast.

Our aim in building a distributed R engine is not to replace Hadoop or its variants. Rather, it is a design point in the space of analytics interfaces—one that is more familiar to data scientists.

Our framework is still evolving. Today, you can use R on top of Vertica to accelerate your data mining analysis. Soon we will support in-database operations as well. Stay tuned.


[1] Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, Rob Schreiber. Eurosys 2013, Prague, Czech Republic.

[2] Using R for Iterative and Incremental Processing. Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Rob Schreiber. HotCloud 2012, Boston, USA.

GameStop CIO: Hadoop Isn’t For Everyone

GameStop Corp. is the world’s largest multichannel video game retailer, with a retail network and family of brands that includes 6,650 company-operated stores in 15 countries worldwide and online at www.GameStop.com. The network also includes:  www.Kongregate.com, a leading browser-based game site; Game Informer® magazine, the leading multi-platform video game publication; Spawn Labs, a streaming technology company; and a digital PC game distribution platform available at http://www.GameStop.com/PC.

As part of their efforts to upgrade their analytics infrastructure to handle the massive traffic from their 21 million member PowerUp Rewards™ loyalty membership program, GameStop looked at Hadoop to see if the open source platform would handle their Big Data requirements.  Ultimately, GameStop CIO Jeff Donaldson chose the HP Vertica Analytics Platform because his engineers, who are trained in working with traditional data warehousing solutions that use the SQL programming language, would be able to quickly transition to the open, standards-based HP Vertica Analytics Platform.

Recently, GameStop was featured in an article by Clint Boulton, reporter for the the Wall Street Journal’s CIO Journal.  In the article, “GameStop CIO: Hadoop Isn’t For Everyone,” Clint and Jeff Donaldson discuss the issues with implementing Hadoop and why a high-performance analytics platform like HP Vertica may be a better solution for Big Data success than Hadoop.  According to Jeff Donaldson, “[Data management] is a hard enough of a business problem and we wanted to reduce the technology risk to near zero.”

The article can be found at the CIO Journal blog.  To read the full article, you will need to be a member of the Wall Street Journal web site.

Now Shipping: HP Vertica Analytics Platform 6.1 “Bulldozer” Release

We’re pleased to announce that the “Bulldozer” (v6.1) release of the HP Vertica Analytics Platform is now shipping! The 6.1 release extends v6 of the HP Vertica Analytics Platform with exciting new enhancements that are sure to delight cloud computing and Hadoop fans alike, while giving data scientists and administrators more features and functionality to get the most out of their HP Vertica investment.

Tighter Hadoop Integration

With a name like “Bulldozer,” you’d expect the release to provide some heavy, big data lifting, and this release delivers. Many of our customers use Hadoop in early stages of their data pipeline, especially for storing loads of raw data. But after they’ve MapReduce-massaged the data, users want to load it into HP Vertica as fast as possible so they can start their true analytics processing. HP Vertica 6.1 provides a new HDFS connector that allows you to do just that: pull data straight from HDFS with optimal parallelism without any additional MapReduce coding. Furthermore, for users who are still deciding whether or not to bring some of their Hadoop data into their primary analytics window, they can use HP Vertica’s external tables feature with the HDFS connector to run rich analytics queries and functions in situ in HDFS. They may even choose to plug in a custom parser using the User Defined Load framework and let HP Vertica do some of the ETL lifting for them. Flexibility is what it’s all about, and to learn how to use the HP Vertica Analytics Platform with Hadoop, see our newly released white paper: Make All Your Information Matter — Hadoop and HP Vetica Analytics Platform.

Simplified Cloud Deployments

We also have many customers who run HP Vertica in the cloud, and know that more and more enterprises are making the cloud their deployment model of choice. To simplify and improve the cloud deployment experience, we now have an officially qualified Amazon EC2 AMI for HP Vertica. This AMI eliminates the guesswork and manual effort involved in rolling your own AMI. And to make these AMIs even easier to administer, we’ve provided cloud scripts that simplify the installation, configuration, and deployment of HP Vertica clusters. Now creating, expanding, and contracting your HP Vertica deployments is easier than ever, enabling a more agile and elastic cloud experience.

Killer Features for Big Data Analytics

In addition to the above, there are dozens of new features and improvements in this release that address the needs of Big Data analytics deployments. From a new R language pack that gets data scientists up and running quickly to enhanced storage tiering and archiving features that will help optimize storage media spend to new validation tools that assist administrators with hardware deployment planning and tuning, this new release provides the platform needed to create an enterprise-grade Big Data environment.  And, as with every release, we’ve made HP Vertica’s already incredibly fast performance even faster.

It’s easy for me to be excited about all the great new improvements in this release, but I challenge you to come see for yourself. Test drive HP Vertica 6.1 and find out how its new features can help you tackle your biggest Big Data challenges. Interested in learning more? Attend our upcoming Introduction to HP Vertica 6.1 webinar, where we’ll provide even more details about this exciting new release. We’re constantly striving to make the HP Vertica Analytics Platform the best solution for our customers’ Big Data analytics needs, and with our “Bulldozer” now out, the door we’re looking forward to helping more enterprises pave the way to data-driven business success.

 

Luis Maldonado

Director, HP Vertica

Observations from Hadoop World 2012

Strata Hadoop World Logo

More than 3,000 attendees converged on the sold-out O’Reilly Strata Conference and Hadoop World 2012 in New York City to gain some clarity on arguably the biggest high-tech megatrend in recent years: Big Data.

From a 100,000-foot view, the majority of attendees—from press to developers to exhibitors to event staff—understood that we are generating a nearly incomprehensible amount of data, really Big Data. And there’s no reason to believe that this Big Data will continue to grow by orders of magnitude, given the proliferation of:

But from my conversations, attendees came to the show to understand how their organization could manage, analyze, and ultimately monetize this Big Data, and, specifically, how Hadoop could help with that effort.

As a newbie to this space, I could relate to the quizzical faces of attendees, barraged with messages claims as the next Big Data solution, but with very different offerings—everything from search engines to hosted solutions to ETL tools to even staffing resources.

Hadoop in itself comprises a uniquely named set of technologies: Hive, Sqoop, Pig, Flume, etc. Despite the unusual terminology, the Hadoop-focused sessions proved educational and featured an impressive range of real-world case studies even large companies (such as Facebook) using Hadoop to store and analyze an impressive amount of Big Data.

But the question still remains: is Hadoop the answer or are there other technologies that can either complement or serve as a better path?

As is often the case when choosing technology, the answer is “It depends on your business need.”

At HP, many of our customers used Hadoop for batch processing before ultimately adopting the HP Vertica Data Analytics Platform to manage and analyze their Big Data for sub-second query response times.

Other customers, particularly with the Hadoop Connector released with HP Vertica Version 6, use the technologies together to seamlessly move data back and forth between Hadoop and HP Vertica.

Which use cases do you feel are a good fit for Hadoop and how can we provide better integration with our platform? Let us know.

We’re passionate about providing the data analytics platform to help you obtain answers from your Big Data questions and add some clarity, in the process.

Get Started With Vertica Today

Subscribe to Vertica