Vertica

Archive for December, 2012

A Deeper Dive on Vertica & R

The R programming language  is fast gaining popularity among data scientists to perform statistical analyses. It is extensible and has a large community of users, many of whom contribute packages to extend its capabilities. However, it is single-threaded and limited by the amount of RAM on the machine it is running on, which makes it challenging to run R programs on big data.

There are efforts under way to remedy this situation, which essentially fall into one of the following two categories:

  • Integrate R into a parallel database, or
  • Parallelize R so it can process big data

In this post, we look at Vertica’s take on “Integrating R into a parallel database” and the two major areas that allow for the performance improvement.  A follow on blog will be posted to describe alternatives to the first approach.

1.)    Running multiple instances of the R algorithm in parallel (query partitioned data)

The first major performance benefit from Vertica R implementation has to do with running multiple instances of the R algorithm in parallel with queries that chunk the data independently.  In the recently launched Vertica 6.0, we added the ability to write sophisticated R programs and have them run in parallel on a cluster of machines.   At a high level Vertica threads communicate with R processes to compute results.  It uses optimized data conversion from Vertica tables to R data frames and all ‘R’ processing is automatically parallelized between Vertica servers.  The diagram below shows how the Vertica R integration has been implemented from a parallelization perspective.

The parallelism comes from processing independent chunks of data simultaneously (referred to as data parallelism).   SQL, being a declarative language, allows database query optimizers to figure out the order of operations, as well as which of them can be done in parallel, due to the well-defined semantics of the language. For example, consider the following query that computes the average sales figures for each month:

SELECT avg(qty*price) FROM sales GROUP BY month;

The semantics of the GROUP BY operation are such that the average sales of a particular month are independent of the average sales of a different month, which allows the database to compute the average for different months in parallel.   Similarly, the SQL-99 standard defines analytic functions (also referred to as window functions) – these functions operate on a sliding window of rows and can be used to compute moving averages, percentiles etc. For example, the following query assigns student test scores into quartiles for each grade:

SELECT name, grade, score, NTILE(4) OVER (PARTITION BY grade ORDER BY score DESC) FROM test_scores;

   name     grade  score   ntile
 Tigger      1     98         1
 Winnie      1     89         1
 Rabbit      1     78         2
 Roo      1     67         2
 Piglet      1     56         3
 Owl      1     54         3
 Eeyore      1     45         4
 Batman      2     98         1
 Ironman      2     95         1
 Spiderman      2     75         2
 Heman      2     56         2
 Superman      2     54         3
 Hulk      2     43         4

 

Again, the semantics of the OVER clause in window functions allows the database to compute the quartiles for each grade in parallel, since they are independent of one another.   Unlike some of our competitors, instead of inventing yet another syntax to perform R computations inside the database, we decided to leverage the OVER clause, since it is a familiar and natural way to express data parallel computations.  A prior blog post shows how easy it is to create, deploy and use R functions on Vertica.

 

Listed below is an example comparing using R and ODBC vs Vertica’ R implementation with the UDX framework.

Looking at the chart above as your data volumes increase Vertica’s implementation using the UDX framework scales much better compared to an ODBC approach.  Note: Numbers indicated on the chart should only be used for relative comparisons since this is not a formal benchmark.

 

2.)    Leveraging column-store technology for optimized data exchange (query non-partitioned data).

It is important to note that even for non-data parallel tasks (functions that operate on input that is basically one big chunk of non-partitioned data) , Vertica’s implementation  provides better performance since computation runs on a server instead of client, and we have optimized data flow between DB and R (no need to parse data again).

The other major benefits of Vertica’s R integration has to do with the UDX framework and the avoidance of ODBC and by the efficiencies obtained by Vertica’s column store.  Here are some examples showing how much more efficient Vertica’s integration with ‘R’ is compared to a typical ODBC approach for a query having non-partitioned data.

As the chart above indicates performance improvements are also achieved by the optimizing the data transfers between Vertica and R.  Since Vertica is a column store and R is vector based it is very efficient to move data from a Vertica column in very large blocks to R vectors.  Note: Numbers indicated on the chart should only be used for relative comparisons since this is not a formal benchmark.

This blog focused on performance and ‘R’ algorithms that are amenable to data parallel solutions.  A following post will talk about our approach to parallelizing R for problems not amenable to data parallel solutions such as if you want to make one decision tree and “Parallelize R” so it can process the results more effectively.

For more details on how to implement R in Vertica please go to the following blog http://www.vertica.com/2012/10/02/how-to-implement-r-in-vertica/

The growth of Big Data, the demand for Data Scientists, and the power of Community

These was an interesting article in CIO last week, IT Departments Battle for Data Analytics Talent, which argues (along with a related McKinsey report) that by 2018 the US will be facing a massive shortage of analytics talent:

By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.

On a more personal note, I attended a holiday party this weekend where a parent was relating to me how their pre-college-age son was being advised to pursue ‘Data Scientist’ as a course of study because it is ‘hot’ (while asking me what exactly a ‘Data Scientist’ does).

But is the solution really just to throw more people at the problem? More importantly, is harnessing and leveraging Big Data really a labor problem, a technology problem, or a community problem?

At HP Vertica, we believe that the Big Data challenge will be met – and while we agree that Data Scientist will indeed be one of the hottest (if not sexiest) jobs of the 21st century, we are also confident that the power of community will allow companies to leverage technology to compensate for the demand for labor. Consequently, we have been making significant investments in the MyVertica community and have big plans in store for 2013.

Our friends at the Community Roundtable have put together a terrific set of materials around what it takes to build an active, engaged community – which aligns very well with our efforts to ‘socialize’ our organization.

Watch for much more in the year ahead, and if you’re not already a member of MyVertica, sign up today!

GameStop CIO: Hadoop Isn’t For Everyone

GameStop Corp. is the world’s largest multichannel video game retailer, with a retail network and family of brands that includes 6,650 company-operated stores in 15 countries worldwide and online at www.GameStop.com. The network also includes:  www.Kongregate.com, a leading browser-based game site; Game Informer® magazine, the leading multi-platform video game publication; Spawn Labs, a streaming technology company; and a digital PC game distribution platform available at http://www.GameStop.com/PC.

As part of their efforts to upgrade their analytics infrastructure to handle the massive traffic from their 21 million member PowerUp Rewards™ loyalty membership program, GameStop looked at Hadoop to see if the open source platform would handle their Big Data requirements.  Ultimately, GameStop CIO Jeff Donaldson chose the HP Vertica Analytics Platform because his engineers, who are trained in working with traditional data warehousing solutions that use the SQL programming language, would be able to quickly transition to the open, standards-based HP Vertica Analytics Platform.

Recently, GameStop was featured in an article by Clint Boulton, reporter for the the Wall Street Journal’s CIO Journal.  In the article, “GameStop CIO: Hadoop Isn’t For Everyone,” Clint and Jeff Donaldson discuss the issues with implementing Hadoop and why a high-performance analytics platform like HP Vertica may be a better solution for Big Data success than Hadoop.  According to Jeff Donaldson, “[Data management] is a hard enough of a business problem and we wanted to reduce the technology risk to near zero.”

The article can be found at the CIO Journal blog.  To read the full article, you will need to be a member of the Wall Street Journal web site.

Big Data, Information Optimization, and Bulldozers at HP Discover

Nearly 10,000 HP customers, partners, prospects, and employees met in Frankfurt, Germany for HP Discover, which was abuzz with major announcements (including “Bulldozer” or HP Vertica Analytics Platform 6.1) and spirited discussions around Big Data and Information Optimization.

The blogosphere, news feeds, and headlines are dominated by the challenges and perceived virtues of Big Data. But is it all just hype and how can companies really monetize, and avoid getting swallowed up by, all this Big Data? And what does Information Optimization mean and how is Big Data and Information Optimization related, if at all?

The HP Vertica team played a central role in answering these very questions throughout the conference in sessions, press and media briefings, CIO tours, blogging sessions, news video interviews, and even livestream Twitter chats.

Opening Session on Information Optimization

 

To kick off the conference, Colin Mahony joined panelists Professor Andrew McAfee of the MIT Sloan School of Management, John Sontag of HP Labs, and Paul Miller of HP Enterprise Group to offer their perspectives on the potential impact of Big Data on every organization — even going so far as to propose that those businesses who avoid Big Data (or, at a minimum, do not implement a strategy) risk becoming extinct to more nimble competitors.

June Manley of HP Software also shared some interesting results from an independent study conducted by Coleman Parkes in October 2012 with senior business executives and senior technology executives:

  • 84% of executives said that they DO NOT have the right information at the point of need that enables them to obtain actionable insight that drives the right business outcome
  • Only 10% of executives said their organization currently incorporates unstructured data into their enterprise insights, processes, and strategy

Enter Information Optimization, enabling enterprises to harness the power of Big Data by storing, managing, understanding and acting upon the variety, velocity, and volume of organizational data to drive maximum Return on Information. In summary, it’s the HP solution that enables enterprises to monetize all of their Big Data.

Professor McAfee concluded this session by challenging the audience to avoid listening to HIPPOs (Highest Paid Person’s Opinion) and learn from how companies and organizations are using Big Data to accelerate drug discovery, optimize airline seat and scheduling, and even accurately predict the 2012 presidential election.

Livestream Twitter Chat on Big Data

HP also piloted its first livestream Twitter chat on Big Data. Led by Paul Muller of HP Software, the leading question was: How can you harness the true power of business data to stay competitive while remaining compliant?

I joined Randy Cairns and Brian Weiss of HP Autonomy in providing our insights and answering a steady flow of questions from bloggers and Twitter followers (#infoopt). Topics covered how organizations can stay compliant and derive value from the unfathomable amount of unstructured data, the security concerns introduced with such broad access to Big Data, and which data (structured, unstructured, semistructued) to retain and use and which data to confidently discard.

 

For our responses and perspectives, see the archived video chat.

 

Announcing “Bulldozer” HP Vertica Analytics Platform 6.1

We rolled out our “Bulldozer” HP Vertica Analytics Platform 6.1 at the show to much fanfare and interest from companies in need of maximum speed, performance, and scale to power their Big Data analytics initiatives. For an overview, see Luis Maldonado’s blog post as well as Colin Mahony’s video interview with Yahoo!

Those are our perspectives, but we always learn the most from you, our community. So, we welcome your thoughts and feedback.

Now Shipping: HP Vertica Analytics Platform 6.1 “Bulldozer” Release

We’re pleased to announce that the “Bulldozer” (v6.1) release of the HP Vertica Analytics Platform is now shipping! The 6.1 release extends v6 of the HP Vertica Analytics Platform with exciting new enhancements that are sure to delight cloud computing and Hadoop fans alike, while giving data scientists and administrators more features and functionality to get the most out of their HP Vertica investment.

Tighter Hadoop Integration

With a name like “Bulldozer,” you’d expect the release to provide some heavy, big data lifting, and this release delivers. Many of our customers use Hadoop in early stages of their data pipeline, especially for storing loads of raw data. But after they’ve MapReduce-massaged the data, users want to load it into HP Vertica as fast as possible so they can start their true analytics processing. HP Vertica 6.1 provides a new HDFS connector that allows you to do just that: pull data straight from HDFS with optimal parallelism without any additional MapReduce coding. Furthermore, for users who are still deciding whether or not to bring some of their Hadoop data into their primary analytics window, they can use HP Vertica’s external tables feature with the HDFS connector to run rich analytics queries and functions in situ in HDFS. They may even choose to plug in a custom parser using the User Defined Load framework and let HP Vertica do some of the ETL lifting for them. Flexibility is what it’s all about, and to learn how to use the HP Vertica Analytics Platform with Hadoop, see our newly released white paper: Make All Your Information Matter — Hadoop and HP Vetica Analytics Platform.

Simplified Cloud Deployments

We also have many customers who run HP Vertica in the cloud, and know that more and more enterprises are making the cloud their deployment model of choice. To simplify and improve the cloud deployment experience, we now have an officially qualified Amazon EC2 AMI for HP Vertica. This AMI eliminates the guesswork and manual effort involved in rolling your own AMI. And to make these AMIs even easier to administer, we’ve provided cloud scripts that simplify the installation, configuration, and deployment of HP Vertica clusters. Now creating, expanding, and contracting your HP Vertica deployments is easier than ever, enabling a more agile and elastic cloud experience.

Killer Features for Big Data Analytics

In addition to the above, there are dozens of new features and improvements in this release that address the needs of Big Data analytics deployments. From a new R language pack that gets data scientists up and running quickly to enhanced storage tiering and archiving features that will help optimize storage media spend to new validation tools that assist administrators with hardware deployment planning and tuning, this new release provides the platform needed to create an enterprise-grade Big Data environment.  And, as with every release, we’ve made HP Vertica’s already incredibly fast performance even faster.

It’s easy for me to be excited about all the great new improvements in this release, but I challenge you to come see for yourself. Test drive HP Vertica 6.1 and find out how its new features can help you tackle your biggest Big Data challenges. Interested in learning more? Attend our upcoming Introduction to HP Vertica 6.1 webinar, where we’ll provide even more details about this exciting new release. We’re constantly striving to make the HP Vertica Analytics Platform the best solution for our customers’ Big Data analytics needs, and with our “Bulldozer” now out, the door we’re looking forward to helping more enterprises pave the way to data-driven business success.

 

Luis Maldonado

Director, HP Vertica

Vertica Inside

Nearly everyone inside of high tech is familiar with Intel Inside. Even those outside of high tech are familiar with its jingle, heard in living rooms around the world during televised events.

But did you know that the HP Vertica Analytics Platform can also be found “inside” or embedded into a growing number of software solutions as the real-time analytics engine?

At only 80 Megabytes and with a standard SQL engine, the HP Vertica Analytics Platform takes only two minutes to install and just a day or two for you to compare its advantages over OLTP databases. The HP Vertica Analytics Platform scales up and down with ease running on shared single-node appliances to clusters of hundreds of servers both on-premise and in the Cloud.  With standard drivers to help it fit right in with your application and reference architectures with all the major ETL and BI vendors, it’s all about the flexibility to align with any deployment, licensing, and pricing model.

But what kind of results can you expect?

On the infrastructure side, OEM partners have seen as much as a 1,000x query and 100x load performance improvements. They have been able to store more detailed data in the same hardware footprint, take on customers with higher data volumes and rates than ever before, and give their customers more real-time, interactive, and ad-hoc access to the data. On the analytics end, they can extend HP Vertica’s built-in analytics with their own algorithms or take advantage of our platform’s SQL99 extensions or integration points into R and SAS.

And what about the business benefits?

OEM partners ultimately choose the HP Vertica Analytics Platform so that they can:

  • Address the needs of larger customers, particularly when they run into scalability issues with their current database
  • Improve the customer experience with much more interactive, response times
  • Offer new capabilities to their customers, such as ad-hoc query access (which is often restricted because good performance could not be guaranteed with their former database)
  • Lower administration costs around the database for both the OEM partner and their customers (HP Vertica is essentially zero administration after deployment – end customers don’t even need to know that’s there!)

Get started today

Sign up for a 30-day evaluation license of HP Vertica Analytics Platform today, and let’s talk about how the HP Vertica Analytics Platform has everything you need to enhance your software solution with real-time analytics—except the jingle.

Get Started With Vertica Today

Subscribe to Vertica