Archive for the ‘Uncategorized’ Category

Better Together

Assembling puzzles

Just like peanut butter and chocolate, the mix of several flavors of data is much more interesting and useful than just one. At HP we classify types of data into three categories:

Human Data

Human data is stuff created by people as opposed to machines, like social media posts, videos, audio, emails, spreadsheets, blogs, and Wikipedia. This data is hard to analyze, as it is written in natural language and does not conform to a particular structure, and lives in places that are not particularly easy to access Because human data lacks traditional structure, we can’t just pull it straight into an data warehouse (nor should we want to). If you want to take full advantage of human data, you must do two things: extract metadata and textual content, and extract meaning. These are completely different things I can easily write a program to extract keywords and text from PDFs, and use them for a simple search engine. But unless I understand how that PDF relates to the millions of other documents in my business, I cannot do much more than that simple search. Plus, howcan I extract information from a video? What about audio recordings from your customer service desk? Sentiment from a YouTube video review of your product and the related comments? These are all very valuable, and not particularly easy to analyze.

Machine Data

Machine Data is data produced by machines or for machines, like sensor data from smart meters, wearable technology, and weblogs from your web site. This category of data is growing exponentially faster than human or business data, and the size of the data is the main driver behind technologies like the Hadoop Distributed File System (HDFS). If I asked you how much data you have today versus 5 years ago, you might say 10 times as much. (If I asked you how many new customers you have today vs. 5 years ago, I would hope you’d say 10 times as many as well!) If you do indeed have 10x more data today, it’s because most of your new data is machine data. Machine data is growing so fast that it has spawned a number of new technologies to store and analyze it, from both open-source and proprietary sources. Understanding what these technologies do, and what they do NOT do, should be on your to-do list right now! (If you want help, feel free to contact me.)

Business Data

Data created by businesses to help them run the business. This includes data in your data warehouse, as well as less centralized data like data found in spreadsheets. Think your data warehouse solution has all of your business data? Just for fun, think about how much of your business is run through Excel spreadsheets. If you are lucky, those spreadsheets are sitting in a SharePoint space somewhere, and not just on employee desktops. And if they are on people’s desktops, hopefully, they’re being backed up. Scary that you don’t have that information indexed and searchable, isn’t it?

So now that you have an idea of the types of data out there, what can you do with it? A picture is worth a thousand words, so let’s start of with a picture and a story.

Use Case: NASCAR

First, watch this video. When you think about NASCAR, you think about fast cars flying around the track, smacking into each other as they jockey for position. What you might not realize is that everything in NASCAR comes back to sponsorship. A NASCAR race is essentially a collection of 200mph billboards. Take a look at this picture:


You are looking at 3-Time Sprint Cup Champion Tony Stewart at Infineon Raceway. First, notice that the car is an advertisement for a number of different companies. The race is called the Sprint cup. The raceway is Infineon Raceway. NASCAR is not just about racing!

“The NASCAR ecosystem is a huge one involving fans, race teams, owners, drivers, racetracks, promoters, sponsors, advertisers, media and many more.”
– Sean Doherty, Director of Digital Engagement and Integrated Marketing Communications at NASCAR (credit CIO Insight).

NASCAR is a vehicle for advertising, as much as it is advertising for vehicles. Of course advertisers want to maximize viewers, because that is ultimately what sponsors want: people looking at their logo, or viewing their ads during the commercial break.

NASCAR realizes that its success is all about the fan base. But the majority of that fan base is sitting at home, far from the action. How to engage them? Putting aside creepy ideas like taking over video feeds from an Xbox Kinect, there are plenty of ways that fans publicly interact. The most obvious one: they tweet about the action in real time. They even tweet during the commercials, about the commercials. So now we have two things we can monitor: the number of tweets at any given time during the race, and the content of the tweets. Counting tweets is easy: just pick a time slice like 1 minute, count tweets that include NASCAR-related hashtags in that timeslice, and put them up on a dashboard. TA-DA! You now have one indicator of engagement.

But wait, are the fans happy or mad? We have to look at the content of the tweets, and that means sentiment analysis. We need to attach sentiment to each tweet so that we can gauge overall sentiment. Now the real problem: tweets are, by nature, short. They also are written in shorthand, and use colloquial language. So now we need natural language processing on what is essentially slang. We have two factors that we can gauge throughout the race: engagement level and sentiment. That dashboard is getting more interesting!


Here is a strange and related observation: did you know that the time spent during a hockey game where a Zamboni cleans the ice is one of the most heavily tweeted parts of the game? PEOPLE LOVE THE ZAMBONI.

Anyway, how does this relate to fan engagement? Well, let’s say that it starts raining heavily during a race, and NASCAR decides to pull the vehicles off the track. We now have a problem and an opportunity: will the home viewers check out until the rain stops? How do we keep them engaged during the break? Well, we could start by looking at that dashboard and see what the most heavily talked about parts of the race were, then queue up the commentators and video to go over those bits. We could poll the audience and have them tweet their favorite moment, then watch in real time as we see the results from Twitter. For that we will have to categorize and cluster keywords from the tweets in real time.

There is much more to this use case, but suffice it to say that NASCAR also collects data from media outlets in print, radio, and TV, and adds them into the mix. That means scanning video and audio for keywords and content, just like the tweets.

The data collected by NASCAR can then be used by its sponsors, who have their own data, likely in a more traditional data warehouse. Here are a few of the things NASCAR and their sponsors are doing with this system:

  • Race teams can gauge fan conversation and reaction to a new paint scheme for one of its cars to decide whether to alter it before future races.
  • The Charlotte Motor Speedway is tracking conversations and levels of interaction about one of its recent ticket promotions.
  • A sponsor is following track response and media coverage about a new marketing campaign.

-List Credit: CIO Insight

What has HP done to make this easier?

We covered a lot of ground in that one use case. We needed access to non-traditional data sources like Twitter, access to traditional data sources like an EDW, sentiment analysis, natural language processing, audio text retrieval, video frame recognition, audio text retrieval, and time series functions to slice up the data. Throw in some pattern-matching techniques and probabilistic modeling too. Then connect all that data to some real-time dashboards using standard SQL technologies and tools. That’s quite a laundry list.

HP has all of the technologies needed to implement this solution for NASCAR. We created a platform that can store and analyze 100% of your data. Structured, unstructured, semi-structured, multi-structured, human, machine, or business data: we can store it and analyze it. The latter part is the interesting one. It’s trivial to set up a Hadoop cluster and store your EDW, web logs, and tweets from the Twitter Fire Hose on there. But Hadoop doesn’t magically know how to parse emails, databases, weblog data, or anything else. That’s on you. So is stitching those data sources together, running analytics on them, and hooking all that up to a sensible user interface. Of course, that’s what we do at HP. We have even moved this technology onto the cloud, to make development and testing of these solutions quick and easy. Take a look at Haven on Demand!

What should you be asking yourself?

First, do you understand all of the types of data involved in your industry? Outside of your EDW, how do you interact with your customers, vendors, sponsors, or investors? How can you collect that data and get it into an analytics system? Does your data give you a competitive advantage, or is it just sitting in cold storage? What other data sources do you need in order to make innovative products and services? How do you join it all together using modern data science techniques, while using common data languages like SQL?

These are non-trivial questions. Sometimes just knowing what you have is a science project in itself (it doesn’t have to be, we actually have products for that). Many people assume that data cannot be analyzed unless it is lumped all together in one place, like a Hadoop cluster or an EDW. The good news is that it isn’t necessary in most cases. There are likely cases where you can optimize data processing by moving data into a high-performance data store, but much of your data can be analyzed right where it is. We have been helping customers solve these problems, and we would be delighted to help you as well.

Author Note

This is the first in a series of three articles. The next article deals with how location data from cell phones and social media is creating huge new opportunities for those with the means to analyze it. The third article will deal with machine data, and the issues with dealing with the Internet of Things at scale.

Thanks for reading!

Enter to Win the March Data Madness Machine Learning Mania Contest!

Slam Data Dunk

If you’re a frequent visitor to our blog, you may recall reading about the March Data Madness Sentiment Tracker that we demonstrated at the MIT Sloan Sports Analytics Conference just before the 2013 NCAA Men’s Basketball “March Madness” Tournament.

The demonstration focused on tracking the “sentiment of the crowd” by collecting and analyzing roughly a half million tweets with the HP Vertica and HP IDOL engines. These results were displayed with a Tibco Spotfire dashboard and offered great conversation fodder at the event:

  • Volume of tweets by team
  • Volume of tweets by player
  • Positive, negative, and neutral sentiment groupings
  • Volume of tweets by U.S. city and by worldwide country
  • Volume of tweets by language (English, French, Spanish, etc.)

We also dug into additional results and continued the conversation around sports analytics with our Webinar — The Future of Big Data in Sports – with an impressive roundtable of experts, including STATS and
Join the Machine Learning Mania Competition for a Chance to Win Cash Prizes!

For this year’s tournament, we’re at it again, only we’re offering cash prizes to the data scientist who can use the HP Haven Big Data Platform to accurately predict this year’s winner by sifting through massive amounts of data and applying machine learning and statistical techniques.

You will have access to key HP Haven technologies, including HP Vertica Distributed R to accelerate your machine learning by running your R models across multiple nodes to vastly reduce execution time and analyze much larger data sets.

The Machine Learning Mania Contest , hosted by Kaggle and sponsored by HP Software’s Big Data Group, enables you to get creative with the data sets that you use to create your statistical models. We will provide you with data covering three decades of historical games, but all participants will be encouraged to pull in data from a variety of external sources.
There’s no cost to join and you can compete compete to win up to 15,000 in cash prizes – join the Madness today!

Thoughts About HP Vertica for SQL on Hadoop

Et voilà

Recently, HP has announced HP Vertica for SQL on Hadoop. We’ve leveraged our years of experience in big data analytics and opened up our platform to allow users to tap into the full power of Hadoop. It’s a rich, fast, and enterprise-ready implementation of SQL on Hadoop that we’re very proud to introduce.

We know that you have choice when it comes to SQL-on-Hadoop engines. There are several SQL on Hadoop engines on the market for a reason – they are very powerful way to perform analytics on big data stored in Hadoop by using the familiar SQL language. Users are able to leverage any reporting or analytical tool to analyze and study the data rather than write their own Java and Map/Reduce code.

However, not all SQL-on-Hadoop is created the same. We think HP Vertica for SQL on Hadoop has some very big differences. These include:

  • Platform Agnostic – When you adopt a SQL on Hadoop query engine, it may be stuck to one distribution of Hadoop. Not so with HP Vertica for SQL on Hadoop. Our implementation works with Hortonworks, Cloudera and MapR distributions.
  • SQL Completeness – The richer the SQL engine, the wider the range of analytics that you can perform with extensive coding and data movement. You get a very rich set of analytical functions with HP Vertica for SQL on Hadoop. HP Vertica offers enterprise-ready, advanced analytics that support JOINs, complex data types, and other capabilities only available from our SQL on Hadoop implementation.
  • Manageability – Tools for managing queries and managing the resources of your cluster are fairly scarce and immature in the Hadoop world. However, with some of the tools we include, you can divide resources among different queries and different types of queries. If unplanned and resource-intensive queries have to be cancelled or temporarily interrupted, they can be.
  • Data Source Transparency – It’s important to allow you to query common data standard storage formats such as Parquet, Avro and ORC. When you can use native formats, you avoid having to move the data.
  • Path to Optimization – When you need to boost performance, HP Vertica for SQL on Hadoop offers optimizations like compression, columnar storage, and projections

You can’t really forget the fact that this offering comes from HP Software. Users should be able to take advantage of all the power of our Haven platform for big data. Encompassing proven technologies from HP Software, including Autonomy, Vertica, and ArcSight, Haven enables forward-thinking organizations to make use of virtually all information sources from both inside and outside its four walls to make better, faster decisions.

Download the report here.

See more

And more…

HP Vertica Storage Location for HDFS

Do you find yourself running low on disk space on your HP Vertica database? You could delete older data, but that sacrifices your ability to perform historical queries. You could add new nodes to your cluster or add storage to your existing nodes. However, these options require additional expense.

The HP Vertica Storage Locations for HDFS feature introduced in HP Vertica Version 7.1 offers you a new solution: storing data on an Apache Hadoop cluster. You can use this feature to store data in a Hadoop Distributed File System (HDFS) while still being able to query it through HP Vertica.

Watch this video for an overview of the HP Vertica Storage Locations for HDFS feature and an example of how you can use it to free storage space on your HP Vertica cluster.

For more information about this feature, see the HP Vertica Storage Location for HDFS section of the documentation.

HP Vertica Best Practices: Native Connection Load Balancing

You may be aware that each client connection to a host in your HP Vertica cluster requires a small overhead in memory and processor time. For a single connection, this impact is minimal, almost unnoticeable. Now imagine you have many clients all connecting to the same host at the same time. In this situation, the compounded overhead can potentially affect database performance.

To limit the database performance consequences caused by multiple client connections, you might manually assign certain client connections to certain hosts. But this can become tedious and difficult as more and more client connections are added. Luckily, HP Vertica offers a feature that can do all this for you. It’s called native connection load balancing.

Native connection load balancing is available in HP Vertica 7.0 and later releases. It is a feature built into both the server and the client libraries that helps spread the CPU and memory overhead caused by client connections across the hosts in the database. When you enable native load balancing on the server and client, you won’t have to manually assign clients to specific hosts to reduce overhead.

Watch this best practices video to learn more about HP Vertica native connection load balancing and how to enable and disable it on the server and client.

For more information, see Native Connection Load Balancing in our documentation.

What Is a Range Join and Why Is It So Fast?


Last week, I was at the 2015 Conference on Innovative Data Systems Research (CIDR), held at the beautiful Asilomar Conference Grounds. The picture above shows one of the many gorgeous views you won’t see when you watch other people do PowerPoint presentations. One HP Vertica user at the conference said he saw a “range join” in a query plan, and wondered what it is and why it is so fast.

First, you need to understand what kind of queries turn into range joins. Generally, these are queries with inequality (greater than, less than, or between) predicates. For example, a map of the IPv4 address space might give details about addresses between a start and end IP for each subnet. Or, a slowly changing dimension table might, for each key, record attributes with their effective time ranges.

A rudimentary approach to handling such joins would be as follows: For each fact table row, check each dimension row to see if the range condition is true (effectively taking the Cartesian product and filtering the results). A more sophisticated, and often more efficient, approach would be to use some flavor of interval trees. However, HP Vertica uses a simpler approach based on sorting.

Basically, if the ranges don’t overlap very much (or at all), sorting the table by range allows sections of the table to be skipped (using a binary search or similar). For large tables, this can reduce the join time by orders of magnitude compared to “brute force”.

Let’s take the example of a table fact, with a column fv, which we want to join to a table dim using a BETWEEN predicate against attributes dv_start and dv_end (fv >= dv_start AND fv <= dv_end). The dim table contains the following data:


We can choose, arbitrarily, to sort the data on dv_start. This way, we can eliminate ranges that have a dv_start that is too large to be relevant to a particular fv value. In the second figure, this is illustrated for the lookup of an fv value of 62. The left shaded red area does not need to be checked, because 62 is not greater than these dv_start values.


Optimizing dv_end is slightly trickier, because we have no proof that the data is sorted by dv_end (in fact, in this example, it is not). However, we can keep the largest dv_end seen in the table starting from the beginning, and search based on that. In this manner, the red area on the right can be skipped, because all of these rows have a dv_end that is not greater than 62. The part in blue, between the red areas, is then scanned to look for matches.

If you managed to follow the example, you can see our approach is simple. Yet it has helped many customers in practice. The IP subnet lookup case was the first prominent one, with a 1000x speedup. But if you got lost in this example, don’t worry… the beauty of languages like SQL is there is a community of researchers and developers who figure these things out for you. So next time you see us at a conference, don’t hesitate to ask about HP Vertica features. You just might see a blog about it after.

The HP Vertica Community is Moving!

The HP Vertica online community will soon have a new home. In the next few months, we’ll be joining the Big Data and Analytics Community, part of the HP Developer Community, located at

Why are we doing this?

We’re joining the new community so that you’ll have a centralized place to go for all your big data questions and answers. Using the Big Data and Analytics Community, you will be able to:

  • Connect with customers across all our Big Data offerings, including HP Vertica Enterprise and Community Editions, HP Vertica OnDemand, HP IDOL , and HP IDOL OnDemand.
  • Learn more about HP Haven, the HP Big Data Platform that allows you to harness 100% of your data, including business, machine, and human-generated data.

In short, the Big Data and Analytics Community will provide you with one-stop shopping for product information, guidance on best practices, and solutions to technical problems.

What about existing content?

To preserve the rich exchange of knowledge in our current community and forum, we are migrating all of the content from our current forum to our new Big Data and Analytics location. All your questions and answers will be saved and accessible on the new forum.

When will this happen?

The migration process is just beginning and we estimate it will take a number of weeks. As the new launch date nears, we’ll share more information with you about the actions you’ll need to take to access the new forum.

Want a preview?

Here’s a sneak peak at new community plans:















We look forward to greeting you in our new space! Stay tuned for more detailed information to come.

Get Started With Vertica Today

Subscribe to Vertica