Vertica

Archive for the ‘Moneyball’ Category

Vertica Moneyball and ‘R’. The perfect team!

Back in April, Colin’s blog on, “Moneyball – not just for baseball anymore” was a good example of describing how statistics can be used to make better decisions on and off the baseball field.  New measures can be created to better understand a player’s real contribution to a team.  For instance, most baseball players are familiar with the popular earned run average (ERA) measure for pitchers, but a new one that is becoming more popular is called WHIP (Walks plus Hits per Innings Pitched).

Here is how Wikipedia describes WHIP: While earned run average (ERA) measures the runs a pitcher gives up, WHIP more directly measures a pitcher’s effectiveness against the batters faced. It is calculated by adding the number of walks and hits allowed and dividing this sum by the number of innings pitched; therefore, the lower a pitcher’s WHIP, the better his performance.   Listed below is the calculation for WHIP.

 WHIP = (Walks + Hits)/ Innings Pitched.

Dashboards such as the following can be built demonstrating these new kinds of measures or key performance indicators (KPI) and how they can be used across a wider audience and provide more insight on teams and players.

Some of the other measures needed to accurately determine a person’s contribution to the team can only be implemented using a statistical package such as ‘R’.  Typically implementing a statistical package in an organization is not a trivial task for the following reasons:

1.)    Specialized domain expertise required – Statistics requires a new skill set to understand and use properly.

2.)    Data Access – Import and Export must be done into the statistical package.

3.)    Performance – Many of the statistical algorithms are compute intensive.

This article will demonstrate how Vertica 6 handles the first two items above and another article to soon be posted will show how Vertica 6 “Presto” has some distinct ‘R’ integration related “Performance” capabilities.

While it is true that understanding statistics can be challenging without proper training, having a group who fully understands the algorithms collaborate with the business domain experts ensures that proper implementation can be done.  Implementing these new algorithms in the database allows your organization to leverage the powerful statistics in their daily business analysis and reduce the time to market because they can now be treated as any other “standard” database function. The possibility for error is also reduced because no longer are complex “Extraction, Transformation and Load (ETL)” products required to import and export the data into the statistical package.  The entire process is now streamlined so that any BI tool or ETL tool in the organization can also leverage the new capability as well because they are now in the database.

So let’s put on our favorite baseball cap, in my case a Tiger cap, and take a closer look at how using ‘R’ can enhance our understanding of our favorite baseball teams and players.

As indicated before, “Moneyball” enlightened the baseball world with many new “measures” that are now almost common speak amongst baseball fans.  The scenario for this example could be a team might want to ensure they are paying their pitchers appropriately based on performance, or they might be interested in finding some more talented pitchers for their team.  Once these pitchers are determined, I want to group them together in “liked clusters” based on our key performance indicators (KPI). The two KPI’s I have decided to use are the WHIP calculation that we described above and another one called IPouts, which is simply the “number of outs pitched”.

Listed below is a simple query showing results for last year’s top pitchers sorted on the new measure called WHIP.

You can see very clearly why Justin Verlander was the MVP and Cy Young award winner last year.  His WHIP and IPouts where the best and he was third in ERA.   All of the measures provided so far can be implemented with standard SQL.  The next thing I want to do is group these pitchers into clusters based on my two measures of WHIP and IPouts.  To do this I used the new Vertica integration with a statistical package called ‘R’ to implement a clustering algorithm called KMeans.  In my case I want 3 clusters of the 91 pitchers from 2011 that qualified.  The column below called Geo.cluster was provided by the integration of ‘R’ in Vertica.

You can see that even in the top 10 we have pitchers in all of our 3 clusters. Keep in mind that lower numbers for WHIP and ERA are better and higher values for IPouts are better. Looking at the list above I now have some better insight on the players and I can focus on cluster 3 players and possibly some players from cluster 2. Listed below is an example of a histogram showing the WHIP on the X axis for all our 91 pitchers of 2011.  You can include these charts and graphs in your dashboards as well.

Other database competitors can also claim ‘R’ integration, but Vertica’s implementation provides better value to your organization because of its simplicity and performance.  Other vendors take an ‘R’ centric approach, which means your users have to know ‘R’ and use the ‘R’ suite of programming tools.  Vertica’s implementation is a more ‘data’ centric approach that shields the users from having to know and use the ‘R’ language.  Users can continue to use their favorite BI or query tool and now have access to ‘R’ capability.

This article demonstrated how statistics can be used to build new measures to provide more insight on a particular situation.  This kind of analysis can also be applied in your organization to help with detecting fraud etc.

Stay tuned on future posts that will give you more detail on how the kmeans and other statistical functions like page rank were implemented in Vertica using ‘R’.  Go Tigers!

For more details on how to implement R in Vertica please to the following blog http://www.vertica.com/2012/10/02/how-to-implement-r-in-vertica/

Moneyball – Not Just for Baseball Anymore

Spring is in the air, major league baseball is now underway here in North America, and thoughts of Michael Lewis’ fantastic book and film, “Moneyball” come to mind.  The plot captures how Billy Beane (played by Brad Pitt) leverages an extreme data analyst/quant to fundamentally change baseball strategy and scouting after 100 years of tradition.  The unorthodox data driven strategy was counter to the traditional approach.  Not surprisingly, Billy Beane was questioned until ultimately, the strategy proved successful.  Now, every team in the league, including our Boston Red Sox, is deploying a variant of this approach.  I see the exact same thing happening in just about every industry when it comes to the race for better insight and competitive advantage through extreme information and analytics.  The struggle now of course is where to find the expert quants, analysts, managers, and solution providers who understand how to make it happen.

At Hewlett-Packard, I get to witness and enable real-world moneyball every day in a variety of global industries.  I see how savvy organizations are creating swat teams of business leaders, statisticians, and IT to leverage extreme information and platforms like Vertica in ways that fundamentally alter markets and business dynamics.

In business school I was lucky enough to take Frances Frei’s course “Managing Service Operations”.  The course and her recent best-selling book “Uncommon Service: How to Win by Putting Customers at the Core of Your Business” investigate organizations’ efforts to diagnose and improve service experiences.  Interestingly though, Frances was way ahead of her time and forced us to crunch numbers with statistical programs combining fundamental business information with detailed historical data for true forensics and root cause analysis.  She stressed the importance of math and data analysis.  We were careful never to rely solely on data or theory, but rather bring all of the information together to make the best informed decisions we could.  In the current Big Data era, this can be taken to a whole new level and every company must work this way from the top down.

In addition to the baseball season starting, we know that “April showers bring May flowers”.  The equivalent in our industry is that for the past several years, so many organizations have been “showered” with data.  The “flowers” of course bloom when those same organizations are able to monetize the information to create better products and services and shareholder value.  Modern technologies and comprehensive solution providers like Hewlett-Packard can help organizations drastically reduce the cost and increase the efficacy of analytics by provisioning comprehensive offerings of hardware, software, and services.  Organizations are now able to cost effectively take disparate sources of extreme information, both structured and unstructured and seamlessly combine them for constant ad hoc analysis.  This can lead to fundamentally better decisions and value creation.  Spring is an exciting time of year- let the insights bloom!

Colin Mahony
VP & GM
Vertica, An HP Company

Get Started With Vertica Today

Subscribe to Vertica