Vertica

Archive for the ‘third-party tools integration’ Category

Connecting HP Vertica 7.x to Tableau Desktop 8.2

Connecting HP Vertica to Tableau Desktop from Vertica Systems on Vimeo.
Have you ever wanted to visualize your HP Vertica Analytics Platform with graphs, tables, maps, or other formats? The Tableau Desktop by Tableau Software visualization tool lets you do just that in a couple steps. Use the Tableau Desktop HP Vertica specific ODBC driver connector to access your data from HP Vertica and create different views for visual analysis. Watch this video to learn how to connect HP Vertica to Tableau Desktop using both the HP Vertica specific ODBC driver connector and the generic connector.

Distributed R for Big Data

Data scientists use sophisticated algorithms to obtain insights. However, what usually takes tens of lines of MATLAB or R code is now been rewritten in Hadoop like systems and applied at scale in the industry. Instead of rewriting algorithms in a new model, can we stretch the limits of R and reuse it for analyzing Big Data? We present our early experiences at HP Labs as we attempt to answer this question.

Consider a few use cases– product recommendations in Netflix and Amazon, PageRank calculation by search providers, financial options pricing and detection of important people in social networks. These applications (1) process large amounts of data, (2) implement complex algorithms such as matrix decomposition and eigenvalue calculation, and (3) continuously refine their predictive models on arrival of new user ratings, Web pages, or addition of relations in the network. To support these applications we need systems that can scale, can easily express complex algorithms, and can handle continuous analytics.

The complex aspect refers to the observation that most of the above applications use advanced concepts such as matrix operations, graph algorithms, and so on. By continuous analytics we mean that if a programmer writes y=f(x), then y is recomputed automatically whenever x changes. Continuous analytics reduces the latency with which information is processed. For example, in recommendation systems new ratings can be quickly processed to give better suggestions. In search engines newly added Web pages can be ranked and made part of search results more quickly.

In this post we will focus on scalability and complex algorithms.

R is an open source statistical software. It has millions of users, including data scientists, and more than three thousand algorithms packages. Many machine learning algorithms already exist in R, albeit for small datasets. These algorithms use matrix operations that are easily expressed and efficiently implemented in R. In less than a hundred lines you can implement most algorithms. Therefore, we decided to extend R and determine if we can achieve scalability in a familiar programming model.

Figure 1 is a very simplified view that compares R and Hadoop. Hadoop can handle large volumes of data, but R can efficiently execute a variety of advanced analysis. At HP Labs we have developed a distributed system that extends R. The main advantages are the language semantics, and the mechanisms to scale R and to run programs in a distributed manner.

FIgure 1 Graph

Figure 1: Extending R for Big Data

Details

Figure 2 shows a high level diagram of how programs are executed in our distributed R framework. Users write programs using language extensions to R and then submit the code to the new runtime. The code is executed across servers in a distributed manner. Distributed R programs run on commodity hardware: from your multi-core desktop to existing Vertica clusters.

Figure 2 Architecture

Figure 2: Architecture

Our framework adds three main language constructs to R: darray, splits, and update. A foreach construct is also present. It is similar to parallel loops found in other languages.

For transparent scaling, we provide the abstraction of distributed arrays, darray.  Distributed arrays store data across multiple machines and give programmers the flexibility to partition data by rows, columns or blocks. Programmers write analytics code treating the distributed array as a regular array, without worrying that it is mapped to different physical machines. Array partitions can be referenced using splits and their contents modified using update. The body of foreach loop processes array partitions in parallel.

Figure 3 shows part of a program that calculates distributed PageRank of a graph. At a high level, the program executes A = (M*B)+C in a distributed manner till convergence. Here M is the adjacency matrix of a large graph. Initially M is declared a NxN sparse matrix partitioned by rows. The vector A is partitioned such that each partition has the same number of rows as the corresponding partition of M. The accompanying illustration (Figure 3) points out that each partition of A requires the corresponding (shaded) partitions of M, C, and the whole array B. The runtime passes these partitions and automatically reconstructs B from its partitions before executing the body of foreach on workers.

Our algorithms package has distributed algorithms such as regression analysis, clustering, power method based PageRank, a recommendation system, and so on. For each of these applications we had to write less than 150 lines of code.

Presto Code

Figure 3: Sample Code

This post is not to claim yet another system faster than Hadoop. Hence we exclude comprehensive experiment results or pretty graphs.  Our Eurosys 2013 and HotCloud 2012 papers have detailed performance results [1, 2]. As a data nugget, our experiments show that many algorithms in our distributed R framework are more than 20 times faster than Hadoop.

Summary

Our framework extends R. It efficiently executes machine learning and graph algorithms on a cluster. Distributed R programs are easy to write, are scalable, and are fast.

Our aim in building a distributed R engine is not to replace Hadoop or its variants. Rather, it is a design point in the space of analytics interfaces—one that is more familiar to data scientists.

Our framework is still evolving. Today, you can use R on top of Vertica to accelerate your data mining analysis. Soon we will support in-database operations as well. Stay tuned.


[1] Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, Rob Schreiber. Eurosys 2013, Prague, Czech Republic.

[2] Using R for Iterative and Incremental Processing. Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Rob Schreiber. HotCloud 2012, Boston, USA.

Vertica Moneyball and ‘R’. The perfect team!

Back in April, Colin’s blog on, “Moneyball – not just for baseball anymore” was a good example of describing how statistics can be used to make better decisions on and off the baseball field.  New measures can be created to better understand a player’s real contribution to a team.  For instance, most baseball players are familiar with the popular earned run average (ERA) measure for pitchers, but a new one that is becoming more popular is called WHIP (Walks plus Hits per Innings Pitched).

Here is how Wikipedia describes WHIP: While earned run average (ERA) measures the runs a pitcher gives up, WHIP more directly measures a pitcher’s effectiveness against the batters faced. It is calculated by adding the number of walks and hits allowed and dividing this sum by the number of innings pitched; therefore, the lower a pitcher’s WHIP, the better his performance.   Listed below is the calculation for WHIP.

 WHIP = (Walks + Hits)/ Innings Pitched.

Dashboards such as the following can be built demonstrating these new kinds of measures or key performance indicators (KPI) and how they can be used across a wider audience and provide more insight on teams and players.

Some of the other measures needed to accurately determine a person’s contribution to the team can only be implemented using a statistical package such as ‘R’.  Typically implementing a statistical package in an organization is not a trivial task for the following reasons:

1.)    Specialized domain expertise required – Statistics requires a new skill set to understand and use properly.

2.)    Data Access – Import and Export must be done into the statistical package.

3.)    Performance – Many of the statistical algorithms are compute intensive.

This article will demonstrate how Vertica 6 handles the first two items above and another article to soon be posted will show how Vertica 6 “Presto” has some distinct ‘R’ integration related “Performance” capabilities.

While it is true that understanding statistics can be challenging without proper training, having a group who fully understands the algorithms collaborate with the business domain experts ensures that proper implementation can be done.  Implementing these new algorithms in the database allows your organization to leverage the powerful statistics in their daily business analysis and reduce the time to market because they can now be treated as any other “standard” database function. The possibility for error is also reduced because no longer are complex “Extraction, Transformation and Load (ETL)” products required to import and export the data into the statistical package.  The entire process is now streamlined so that any BI tool or ETL tool in the organization can also leverage the new capability as well because they are now in the database.

So let’s put on our favorite baseball cap, in my case a Tiger cap, and take a closer look at how using ‘R’ can enhance our understanding of our favorite baseball teams and players.

As indicated before, “Moneyball” enlightened the baseball world with many new “measures” that are now almost common speak amongst baseball fans.  The scenario for this example could be a team might want to ensure they are paying their pitchers appropriately based on performance, or they might be interested in finding some more talented pitchers for their team.  Once these pitchers are determined, I want to group them together in “liked clusters” based on our key performance indicators (KPI). The two KPI’s I have decided to use are the WHIP calculation that we described above and another one called IPouts, which is simply the “number of outs pitched”.

Listed below is a simple query showing results for last year’s top pitchers sorted on the new measure called WHIP.

You can see very clearly why Justin Verlander was the MVP and Cy Young award winner last year.  His WHIP and IPouts where the best and he was third in ERA.   All of the measures provided so far can be implemented with standard SQL.  The next thing I want to do is group these pitchers into clusters based on my two measures of WHIP and IPouts.  To do this I used the new Vertica integration with a statistical package called ‘R’ to implement a clustering algorithm called KMeans.  In my case I want 3 clusters of the 91 pitchers from 2011 that qualified.  The column below called Geo.cluster was provided by the integration of ‘R’ in Vertica.

You can see that even in the top 10 we have pitchers in all of our 3 clusters. Keep in mind that lower numbers for WHIP and ERA are better and higher values for IPouts are better. Looking at the list above I now have some better insight on the players and I can focus on cluster 3 players and possibly some players from cluster 2. Listed below is an example of a histogram showing the WHIP on the X axis for all our 91 pitchers of 2011.  You can include these charts and graphs in your dashboards as well.

Other database competitors can also claim ‘R’ integration, but Vertica’s implementation provides better value to your organization because of its simplicity and performance.  Other vendors take an ‘R’ centric approach, which means your users have to know ‘R’ and use the ‘R’ suite of programming tools.  Vertica’s implementation is a more ‘data’ centric approach that shields the users from having to know and use the ‘R’ language.  Users can continue to use their favorite BI or query tool and now have access to ‘R’ capability.

This article demonstrated how statistics can be used to build new measures to provide more insight on a particular situation.  This kind of analysis can also be applied in your organization to help with detecting fraud etc.

Stay tuned on future posts that will give you more detail on how the kmeans and other statistical functions like page rank were implemented in Vertica using ‘R’.  Go Tigers!

For more details on how to implement R in Vertica please to the following blog http://www.vertica.com/2012/10/02/how-to-implement-r-in-vertica/

What’s New in Vertica 4.1

Vertica announced a new version of its Vertica Analytics Platform software, version 4.1, on Tuesday, November 9th at the TDWI Orlando. You can read more about Vertica 4.1 in the press release, but I wanted to give you a few of the highlights of the features that make 4.1 so important to our customers, or anyone looking to make the most of their data.

What’s New in Vertica 4.1 from Vertica Systems on Vimeo.

Here are some highlights from the video:

What’s New Intro
Third-Party Tools Integration – 0:43
SQL Macros – 2:14
Enhanced Security & Authentication – 2:47
Updates & Deletes – 3:27
Vertica 4.1 Wrap Up – 3:50
We hope you enjoy the video!

Get Started With Vertica Today

Subscribe to Vertica