Vertica

Posts Tagged ‘Vertica’

Taking a Moonshot at Big Data Analytics for Everyone

HP Vertica is very excited about Monday’s announcement of the HP Moonshot system.

Why? Because we believe that the combination of the HP Vertica Analytics Platform running on the HP Moonshot Servers offers a truly game-changing value proposition for a variety of customers, and new segments of the market.

Moonshot is, simply put, a groundbreaking system which offers customers the ability to rapidly deploy, scale and manage with dramatically lower space and energy constraints. While traditional IT services that support business functions will continue to be served by general purpose server infrastructure, a new computing platform is required for specialized workloads that can deliver innovative solutions to market at unprecedented speed and scale.

 

We’ve already successfully tested the HP Vertica Analytics Platform on HP Moonshot Servers, and achieved very comparable performance to traditional Big Data Analytics hardware across certain performance ranges, which for a large segment of the market is more than sufficient to handle their Big Data Analytics loads – while offering very significant potential cost, space and energy savings.

Running Vertica on Moonshot offers yet another proof point of the unmatched value provided by HP’s combination of Information Optimization solutions, and a great example of the opportunity created by innovation that makes us so excited to be a part of the greater OneHP.

To learn more about HP Project Moonshot, visit http://www.hp.com/go/moonshot

A Deeper Dive on Vertica & R

The R programming language  is fast gaining popularity among data scientists to perform statistical analyses. It is extensible and has a large community of users, many of whom contribute packages to extend its capabilities. However, it is single-threaded and limited by the amount of RAM on the machine it is running on, which makes it challenging to run R programs on big data.

There are efforts under way to remedy this situation, which essentially fall into one of the following two categories:

  • Integrate R into a parallel database, or
  • Parallelize R so it can process big data

In this post, we look at Vertica’s take on “Integrating R into a parallel database” and the two major areas that allow for the performance improvement.  A follow on blog will be posted to describe alternatives to the first approach.

1.)    Running multiple instances of the R algorithm in parallel (query partitioned data)

The first major performance benefit from Vertica R implementation has to do with running multiple instances of the R algorithm in parallel with queries that chunk the data independently.  In the recently launched Vertica 6.0, we added the ability to write sophisticated R programs and have them run in parallel on a cluster of machines.   At a high level Vertica threads communicate with R processes to compute results.  It uses optimized data conversion from Vertica tables to R data frames and all ‘R’ processing is automatically parallelized between Vertica servers.  The diagram below shows how the Vertica R integration has been implemented from a parallelization perspective.

The parallelism comes from processing independent chunks of data simultaneously (referred to as data parallelism).   SQL, being a declarative language, allows database query optimizers to figure out the order of operations, as well as which of them can be done in parallel, due to the well-defined semantics of the language. For example, consider the following query that computes the average sales figures for each month:

SELECT avg(qty*price) FROM sales GROUP BY month;

The semantics of the GROUP BY operation are such that the average sales of a particular month are independent of the average sales of a different month, which allows the database to compute the average for different months in parallel.   Similarly, the SQL-99 standard defines analytic functions (also referred to as window functions) – these functions operate on a sliding window of rows and can be used to compute moving averages, percentiles etc. For example, the following query assigns student test scores into quartiles for each grade:

SELECT name, grade, score, NTILE(4) OVER (PARTITION BY grade ORDER BY score DESC) FROM test_scores;

   name     grade  score   ntile
 Tigger      1     98         1
 Winnie      1     89         1
 Rabbit      1     78         2
 Roo      1     67         2
 Piglet      1     56         3
 Owl      1     54         3
 Eeyore      1     45         4
 Batman      2     98         1
 Ironman      2     95         1
 Spiderman      2     75         2
 Heman      2     56         2
 Superman      2     54         3
 Hulk      2     43         4

 

Again, the semantics of the OVER clause in window functions allows the database to compute the quartiles for each grade in parallel, since they are independent of one another.   Unlike some of our competitors, instead of inventing yet another syntax to perform R computations inside the database, we decided to leverage the OVER clause, since it is a familiar and natural way to express data parallel computations.  A prior blog post shows how easy it is to create, deploy and use R functions on Vertica.

 

Listed below is an example comparing using R and ODBC vs Vertica’ R implementation with the UDX framework.

Looking at the chart above as your data volumes increase Vertica’s implementation using the UDX framework scales much better compared to an ODBC approach.  Note: Numbers indicated on the chart should only be used for relative comparisons since this is not a formal benchmark.

 

2.)    Leveraging column-store technology for optimized data exchange (query non-partitioned data).

It is important to note that even for non-data parallel tasks (functions that operate on input that is basically one big chunk of non-partitioned data) , Vertica’s implementation  provides better performance since computation runs on a server instead of client, and we have optimized data flow between DB and R (no need to parse data again).

The other major benefits of Vertica’s R integration has to do with the UDX framework and the avoidance of ODBC and by the efficiencies obtained by Vertica’s column store.  Here are some examples showing how much more efficient Vertica’s integration with ‘R’ is compared to a typical ODBC approach for a query having non-partitioned data.

As the chart above indicates performance improvements are also achieved by the optimizing the data transfers between Vertica and R.  Since Vertica is a column store and R is vector based it is very efficient to move data from a Vertica column in very large blocks to R vectors.  Note: Numbers indicated on the chart should only be used for relative comparisons since this is not a formal benchmark.

This blog focused on performance and ‘R’ algorithms that are amenable to data parallel solutions.  A following post will talk about our approach to parallelizing R for problems not amenable to data parallel solutions such as if you want to make one decision tree and “Parallelize R” so it can process the results more effectively.

For more details on how to implement R in Vertica please go to the following blog http://www.vertica.com/2012/10/02/how-to-implement-r-in-vertica/

HP Vertica and Tableau Software Customers Speak Out in Philadelphia

It was my distinct pleasure this week to participate in a joint customer roundtable at the Cira Center in Philadelphia, co-sponsored by HP Vertica and our partner Tableau Software, and featuring a number of our respective and joint customers speaking out on topics related to Big Data.

Our panelists, who did a terrific job interacting with an audience of more than 50 of their peers, included David Baker of IMS Health, George Chalissery of hMetrix, Amit Garg of Comcast, Seth Madison of Compete.com and Elizabeth Worster of State Street Global Advisors.

The discussion essentially centered on 5 themes related to Big Data. They included (with unattributed comments from the panelists).

  • Democratizing data – all of our panelists discussed the value of giving business users the ability to understand data and make ad hoc requests themselves – as well as extending some of those capabilities outside the walls of the enterprise. A number of concerns and questions came from the audience as to how you handle security when democratizing data which were addressed by our panelists. “Self-reliance really sings to me.” “We have internal and external users – and increasingly the external users are our clients”. 
  • Getting more productivity out of small teams – related to the previous point, data analyst teams are generally small and their time must be leveraged – they don’t have to spend time on repetitive tasks. “Once you start delivering, are on the hook to do it constantly.” “Can’t do anything predictive if you’re reactive all the time.” “You can’t just rely on databases – you do need people.”
  • Extracting meaning from data – panelists repeatedly spoke of the need for first class dashboards – and for those dashboards to be flexible and fast (a primary benefit of our combined Vertica / Tableau solution). “People are more willing to experiment and run what-if scenarios with flexible dashboards” “Your data’s growing, but users want answers faster.”
    • One particularly interesting and notable comment from a Vertica customer - “Results are delivered so fast that I don’t believe it – this can’t be real.” (it is)
  • New capabilities – There was a great deal of discussion of enablement of new organizational capabilities as Big Data gets under control and becomes more available. “People are more willing to experiment because time to load and query data is orders of magnitude better” “When you change the network ecosystem, you can create new offerings and new value for customers” “Having intermediate data helps with disaster recovery and provides redundancy” “I don’t think I’m doing complex things, but then people tell me I am doing very complex things”
  • Time to value – Speed continued to be a theme – both in analyzing Big Data and creating organizational value – “We can answer questions much more quickly and get new data-oriented products into the pipeline for revenue.”, “I don’t need to talk to my manager or IT – I can answer that question right now.”, “You give people a taste of this stuff, and they just want you to do more and more and more”
Overall it was an outstanding event, and we plan to do more partner-related activities with our Business Intelligence and other partners, including the Tableau Customer Conference in early November. We hope to see you at a future event!

Setting the Record Straight on Column Stores (Again?!)

Couple months ago I went to SIGMOD 2012.  One of the big award winners there was Bruce Lindsay (IBM Fellow Emeritus), a true patriarch of relational databases.  (System R; enough said!)

I was somehow drawn to him before I figured out his name, and before I learned that he was an award winner.  Maybe it was the hairdo and mannerisms.

Or maybe it was how he asked the presenters of the paper on “MCJoin” something along the lines of  “So, I’ve written a few join algorithms in my day and one of the things that set me back a few months each time was OUTER JOINs”.  Which, in my day, set me back a few months.

Back to the awards.  Each recipient gave a talk.  Bruce gave a very interesting presentation covering RDBMS, how it built up to something useful over the years, and then considered whether we are “losing our way”.  I was a bit surprised that he listed “column stores” as a “detour” on the path of RDBMS progress.  This is his slide (and, as you view it, try imagine someone in the row in front of you cackling about how Mike Stonebraker would react to it…):

(more…)