Archive for the ‘Vertica 6’ Category

Vertica Moneyball and ‘R’. The perfect team!

Back in April, Colin’s blog on, “Moneyball – not just for baseball anymore” was a good example of describing how statistics can be used to make better decisions on and off the baseball field.  New measures can be created to better understand a player’s real contribution to a team.  For instance, most baseball players are familiar with the popular earned run average (ERA) measure for pitchers, but a new one that is becoming more popular is called WHIP (Walks plus Hits per Innings Pitched).

Here is how Wikipedia describes WHIP: While earned run average (ERA) measures the runs a pitcher gives up, WHIP more directly measures a pitcher’s effectiveness against the batters faced. It is calculated by adding the number of walks and hits allowed and dividing this sum by the number of innings pitched; therefore, the lower a pitcher’s WHIP, the better his performance.   Listed below is the calculation for WHIP.

 WHIP = (Walks + Hits)/ Innings Pitched.

Dashboards such as the following can be built demonstrating these new kinds of measures or key performance indicators (KPI) and how they can be used across a wider audience and provide more insight on teams and players.

Some of the other measures needed to accurately determine a person’s contribution to the team can only be implemented using a statistical package such as ‘R’.  Typically implementing a statistical package in an organization is not a trivial task for the following reasons:

1.)    Specialized domain expertise required – Statistics requires a new skill set to understand and use properly.

2.)    Data Access – Import and Export must be done into the statistical package.

3.)    Performance – Many of the statistical algorithms are compute intensive.

This article will demonstrate how Vertica 6 handles the first two items above and another article to soon be posted will show how Vertica 6 “Presto” has some distinct ‘R’ integration related “Performance” capabilities.

While it is true that understanding statistics can be challenging without proper training, having a group who fully understands the algorithms collaborate with the business domain experts ensures that proper implementation can be done.  Implementing these new algorithms in the database allows your organization to leverage the powerful statistics in their daily business analysis and reduce the time to market because they can now be treated as any other “standard” database function. The possibility for error is also reduced because no longer are complex “Extraction, Transformation and Load (ETL)” products required to import and export the data into the statistical package.  The entire process is now streamlined so that any BI tool or ETL tool in the organization can also leverage the new capability as well because they are now in the database.

So let’s put on our favorite baseball cap, in my case a Tiger cap, and take a closer look at how using ‘R’ can enhance our understanding of our favorite baseball teams and players.

As indicated before, “Moneyball” enlightened the baseball world with many new “measures” that are now almost common speak amongst baseball fans.  The scenario for this example could be a team might want to ensure they are paying their pitchers appropriately based on performance, or they might be interested in finding some more talented pitchers for their team.  Once these pitchers are determined, I want to group them together in “liked clusters” based on our key performance indicators (KPI). The two KPI’s I have decided to use are the WHIP calculation that we described above and another one called IPouts, which is simply the “number of outs pitched”.

Listed below is a simple query showing results for last year’s top pitchers sorted on the new measure called WHIP.

You can see very clearly why Justin Verlander was the MVP and Cy Young award winner last year.  His WHIP and IPouts where the best and he was third in ERA.   All of the measures provided so far can be implemented with standard SQL.  The next thing I want to do is group these pitchers into clusters based on my two measures of WHIP and IPouts.  To do this I used the new Vertica integration with a statistical package called ‘R’ to implement a clustering algorithm called KMeans.  In my case I want 3 clusters of the 91 pitchers from 2011 that qualified.  The column below called Geo.cluster was provided by the integration of ‘R’ in Vertica.

You can see that even in the top 10 we have pitchers in all of our 3 clusters. Keep in mind that lower numbers for WHIP and ERA are better and higher values for IPouts are better. Looking at the list above I now have some better insight on the players and I can focus on cluster 3 players and possibly some players from cluster 2. Listed below is an example of a histogram showing the WHIP on the X axis for all our 91 pitchers of 2011.  You can include these charts and graphs in your dashboards as well.

Other database competitors can also claim ‘R’ integration, but Vertica’s implementation provides better value to your organization because of its simplicity and performance.  Other vendors take an ‘R’ centric approach, which means your users have to know ‘R’ and use the ‘R’ suite of programming tools.  Vertica’s implementation is a more ‘data’ centric approach that shields the users from having to know and use the ‘R’ language.  Users can continue to use their favorite BI or query tool and now have access to ‘R’ capability.

This article demonstrated how statistics can be used to build new measures to provide more insight on a particular situation.  This kind of analysis can also be applied in your organization to help with detecting fraud etc.

Stay tuned on future posts that will give you more detail on how the kmeans and other statistical functions like page rank were implemented in Vertica using ‘R’.  Go Tigers!

For more details on how to implement R in Vertica please to the following blog

On the Trail of a Red-Tailed Hawk! – Part 2

Part 2:  Why User-Defined-Loads is music to my ears!

Webcams and hawks – really?  I heard a healthy dose of skepticism expressed about what User-defined Loads can accomplish.  So how about I show something a bit more practical?

I’m a fan of music of all flavors. Among the various devices I’ve had connected to my stereo at home are an LP player and an open-reel tape deck.  While it’s fun to get out the old vinyl records from time to time, they’re a bit cumbersome and don’t travel very well; more often I just want to pull up iTunes and hit “Play”.  So if I have a recording that I particularly enjoy, I try to digitize it.  But it is a huge pain to split it up into tracks.  I thought – why not automate it? You guessed right – with UDL, of course!

Vertica’s not much for music; natively it couldn’t tell the difference between The Beach Boys and Van Halen.  In fact, it can’t make heads or tails of most any music format.  So I need one (or both) of Vertica’s two other UDL tools.  One is a User-Defined Parser:  I could write a parser that directly parsed the WAV format, for example.  This would give me the best performance, but would take some time to code.

The other is a User-Defined Filter:  A Filter takes in data and processes it and emits it in a new form.  They’re intended for things like decompression and transcoding.  It’d be nice to support FLAC files and the like (c’mon, you can’t use lossy MP3 files for this!; this is the good stuff!), and my computer has more free time than I do, so a simple Filter seems like the way to go.

We conveniently have an example external-Filter example on Github as well; it opens up an external process and passes data through it via pipes.  So let’s go install it:

$ cd Vertica-Extension-Packages/shell_load_package
$ make install

Now, let’s once again make a table and try loading some data:

>>> CREATE TABLE music (filename varchar(64), time_index float, L float, R float)
     ORDER BY filename, time_index;
>>> COPY music (filename AS ‘Sample’, time_index, L, R)
     FROM ‘/home/aseering/Music/Recordings/Sample.flac’
     FIXEDWIDTH COLSIZES (17, 17, 17)
     WITH FILTER ExternalFilter(cmd=’sox –type flac – –type dat -‘);

The FIXEDWIDTH parser takes advantage of the format of the ‘.dat’ file:  It’s three floating-point columns; each floating-point number can be up to 16 characters wide (and is padded with spaces to that length if it’s smaller), and there’s one extra leading space as a separator.

Of course, using ExampleSource, I can load directly from my audio input as well, assuming a properly-configured sound system on the computer:

>>> COPY music (filename AS ‘Sample’, time_index, data filler int, L AS data, R AS data)
     WITH SOURCE ExternalSource(cmd=’arecord -d 10′)
     FILTER ExternalFilter(cmd=’sox –type wav – –type dat -‘);

Once I’ve loaded the data, I have to split up tracks.  On recordings of old records, the only real way I know to split tracks is to look for areas where the audio is quiet for a while.  Vertica can do this with window analytics.  The following is a good start, though there is of course a ton of flexibility if you want to tune the algorithm:

>>> SELECT sqrt(avg(L*L + R*R)
         over (order by time_index rows between 100 preceding and 100 following))
     AS avg_volume
     FROM music WHERE filename = ‘Sample';

And there we go!  Now I can enjoy my Carpenters on the move.

Most people think of databases as tools for managing payroll or user accounts, or diving through sales records.  Vertica can do all of that.  But Vertica 6 is so much cooler than that:  It gives you a platform for running your own custom code in a robust, distributed way.  You can mix-and-match UDLs (and all the other UDx types) however you want; develop libraries of UDL tools; even share those tools on Github if you want, or download tools that others have shared.  (We take pull requests!)  You can also use UDLs with external tables to provide a real queryable table that’s backed by any data store that you want, so you can run any task through Vertica’s parallel execution engine.

Are you convinced now? Whether you are or not, I’ve spend the past few hours digitizing several of my favorite LPs.  Now I can sit by my window with my headphones on listening to music and working away until the hawk shows up. Life is good!

Introducing Vertica 6

by Shilpa Lawande & Luis Maldonado

Big data is all the rage these days.  It is incredible to watch this segment mature- even in just the past few years.  Of course, Vertica continues to evolve and innovate to keep up with the challenges and demands of this movement, especially when it comes to analytics.  The three greatest challenges customers face with regards to Big Data are 1) the volume and velocity of data, 2) the rapidly growing variety of disparate sources of data, and 3) the complex analytics that are required for maximizing the value of information derived from data.  In Vertica 6, we address all of these and more!

It has been just over two and a half years since Vertica first released FlexStore™, our patented architecture for flexible storage and placement of data based on usage patterns. During this time, we have been hard at work to evolve our flagship database product from a blazingly  fast database into a rich and flexible analytic platform.  Our high performance MPP columnar engine combined with our suite of built-in analytic functions including time-series, event-series pattern matching, C++ SDK, and more have enabled over 600 customers to transform their businesses by harnessing the power of large scale real-time analytics.  Last week at HP Discover, we raised the bar once again, with the release of Vertica 6, including the industry’s first and only open analytics architecture that enables broad analytics over any data source, structured, semi-structured or unstructured.

Let’s review some of the highlights of Vertica 6.

  • Vertica 6 FlexStore™ has been expanded to allow access to any data, stored at any location, via any interface, providing a comprehensive architecture to handle the requirements of Big Data – open, extensible, and flexible.  Using our patent-pending User-Defined Load (UDL) feature, you can now connect natively to popular storage systems like Hadoop File System (HDFS), existing databases and data warehouses, as well as unstructured analysis platforms such as HP/Autonomy IDOL™.  Further, by combining our External Tables capability with UDL, you can access those data sources in a federated manner at query time, without pre- loading data into Vertica. Now, whether you are just exploring your data in HDFS to find the right data model or building a production application, you can use the same robust Vertica SQL analytics you know and love, and use your favorite reporting and visualization tools to slice and dice the data!
  • Vertica 6 includes high performance and massively scalable in-database analytics for the R statistical tool.  Data scientists and analysts can now run their favorite statistical algorithms written in R natively and in parallel on Vertica, without the in-memory and single-threading limitations of R!  We’ve also expanded our C++ SDK to add secure sandboxing of user-defined code, and additional APIs for user-defined aggregates, analytics and multi-phase transform functions that can enable Map-Reduce style computations to be done in Vertica.
  • Vertica 6 simplifies the out-of-the-box user experience and enables more diverse workloads with our improvements to Workload Management.  An early adopter of Vertica 6 experienced a 40x speed up on some of their queries.  Whether you are running complex analytics or short operational BI workloads, Vertica’s workload management capabilities can effectively balance all system resources to meet your SLAs.
  • In 2007, Vertica was the first analytics platform to run on the Cloud. With Vertica 6 we extended this lead in significant ways. First, we announced a private beta of Vertica on HP’s new public Cloud. Further, we extended the core platform with a number of enhancements such as Object-level Backup & Restore and Schema evolution that make it much easier to develop and deploy multi-tenant applications using Vertica.
  • Last but not the least, we continue to invest in our core platform with a  number of enhancements to overall query and load performance, elastic cluster, monitoring views, database designer, security and more.

For a quick overview of Vertica 6, we highly recommend watching this video interview of Luis Maldonado, Director of Product Management, from last week’s HP Discover show, or reading What’s new in Vertica 6 Enterprise Edition.  Watch this space over the next few days, for more articles that drill down into specific features of Vertica 6!

We at Vertica are very excited about Vertica 6 and the role it plays in HP’s vision for Cloud, Security and Information Optimization.  We hope you will try it for yourselves and tell us your stories about how you used it to Make it Matter for your business!

Over the next few days, watch this space for articles that drill down into specific features of Vertica 6.

Get Started With Vertica Today

Subscribe to Vertica