Just like peanut butter and chocolate, the mix of several flavors of data is much more interesting and useful than just one. At HP we classify types of data into three categories:
Human data is stuff created by people as opposed to machines, like social media posts, videos, audio, emails, spreadsheets, blogs, and Wikipedia. This data is hard to analyze, as it is written in natural language and does not conform to a particular structure, and lives in places that are not particularly easy to access Because human data lacks traditional structure, we can’t just pull it straight into an data warehouse (nor should we want to). If you want to take full advantage of human data, you must do two things: extract metadata and textual content, and extract meaning. These are completely different things I can easily write a program to extract keywords and text from PDFs, and use them for a simple search engine. But unless I understand how that PDF relates to the millions of other documents in my business, I cannot do much more than that simple search. Plus, howcan I extract information from a video? What about audio recordings from your customer service desk? Sentiment from a YouTube video review of your product and the related comments? These are all very valuable, and not particularly easy to analyze.
Machine Data is data produced by machines or for machines, like sensor data from smart meters, wearable technology, and weblogs from your web site. This category of data is growing exponentially faster than human or business data, and the size of the data is the main driver behind technologies like the Hadoop Distributed File System (HDFS). If I asked you how much data you have today versus 5 years ago, you might say 10 times as much. (If I asked you how many new customers you have today vs. 5 years ago, I would hope you’d say 10 times as many as well!) If you do indeed have 10x more data today, it’s because most of your new data is machine data. Machine data is growing so fast that it has spawned a number of new technologies to store and analyze it, from both open-source and proprietary sources. Understanding what these technologies do, and what they do NOT do, should be on your to-do list right now! (If you want help, feel free to contact me.)
Data created by businesses to help them run the business. This includes data in your data warehouse, as well as less centralized data like data found in spreadsheets. Think your data warehouse solution has all of your business data? Just for fun, think about how much of your business is run through Excel spreadsheets. If you are lucky, those spreadsheets are sitting in a SharePoint space somewhere, and not just on employee desktops. And if they are on people’s desktops, hopefully, they’re being backed up. Scary that you don’t have that information indexed and searchable, isn’t it?
So now that you have an idea of the types of data out there, what can you do with it? A picture is worth a thousand words, so let’s start of with a picture and a story.
Use Case: NASCAR
First, watch this video. When you think about NASCAR, you think about fast cars flying around the track, smacking into each other as they jockey for position. What you might not realize is that everything in NASCAR comes back to sponsorship. A NASCAR race is essentially a collection of 200mph billboards. Take a look at this picture:
You are looking at 3-Time Sprint Cup Champion Tony Stewart at Infineon Raceway. First, notice that the car is an advertisement for a number of different companies. The race is called the Sprint cup. The raceway is Infineon Raceway. NASCAR is not just about racing!
“The NASCAR ecosystem is a huge one involving fans, race teams, owners, drivers, racetracks, promoters, sponsors, advertisers, media and many more.”
– Sean Doherty, Director of Digital Engagement and Integrated Marketing Communications at NASCAR (credit CIO Insight).
NASCAR is a vehicle for advertising, as much as it is advertising for vehicles. Of course advertisers want to maximize viewers, because that is ultimately what sponsors want: people looking at their logo, or viewing their ads during the commercial break.
NASCAR realizes that its success is all about the fan base. But the majority of that fan base is sitting at home, far from the action. How to engage them? Putting aside creepy ideas like taking over video feeds from an Xbox Kinect, there are plenty of ways that fans publicly interact. The most obvious one: they tweet about the action in real time. They even tweet during the commercials, about the commercials. So now we have two things we can monitor: the number of tweets at any given time during the race, and the content of the tweets. Counting tweets is easy: just pick a time slice like 1 minute, count tweets that include NASCAR-related hashtags in that timeslice, and put them up on a dashboard. TA-DA! You now have one indicator of engagement.
But wait, are the fans happy or mad? We have to look at the content of the tweets, and that means sentiment analysis. We need to attach sentiment to each tweet so that we can gauge overall sentiment. Now the real problem: tweets are, by nature, short. They also are written in shorthand, and use colloquial language. So now we need natural language processing on what is essentially slang. We have two factors that we can gauge throughout the race: engagement level and sentiment. That dashboard is getting more interesting!
Here is a strange and related observation: did you know that the time spent during a hockey game where a Zamboni cleans the ice is one of the most heavily tweeted parts of the game? PEOPLE LOVE THE ZAMBONI.
Anyway, how does this relate to fan engagement? Well, let’s say that it starts raining heavily during a race, and NASCAR decides to pull the vehicles off the track. We now have a problem and an opportunity: will the home viewers check out until the rain stops? How do we keep them engaged during the break? Well, we could start by looking at that dashboard and see what the most heavily talked about parts of the race were, then queue up the commentators and video to go over those bits. We could poll the audience and have them tweet their favorite moment, then watch in real time as we see the results from Twitter. For that we will have to categorize and cluster keywords from the tweets in real time.
There is much more to this use case, but suffice it to say that NASCAR also collects data from media outlets in print, radio, and TV, and adds them into the mix. That means scanning video and audio for keywords and content, just like the tweets.
The data collected by NASCAR can then be used by its sponsors, who have their own data, likely in a more traditional data warehouse. Here are a few of the things NASCAR and their sponsors are doing with this system:
- Race teams can gauge fan conversation and reaction to a new paint scheme for one of its cars to decide whether to alter it before future races.
- The Charlotte Motor Speedway is tracking conversations and levels of interaction about one of its recent ticket promotions.
- A sponsor is following track response and media coverage about a new marketing campaign.
-List Credit: CIO Insight
What has HP done to make this easier?
We covered a lot of ground in that one use case. We needed access to non-traditional data sources like Twitter, access to traditional data sources like an EDW, sentiment analysis, natural language processing, audio text retrieval, video frame recognition, audio text retrieval, and time series functions to slice up the data. Throw in some pattern-matching techniques and probabilistic modeling too. Then connect all that data to some real-time dashboards using standard SQL technologies and tools. That’s quite a laundry list.
HP has all of the technologies needed to implement this solution for NASCAR. We created a platform that can store and analyze 100% of your data. Structured, unstructured, semi-structured, multi-structured, human, machine, or business data: we can store it and analyze it. The latter part is the interesting one. It’s trivial to set up a Hadoop cluster and store your EDW, web logs, and tweets from the Twitter Fire Hose on there. But Hadoop doesn’t magically know how to parse emails, databases, weblog data, or anything else. That’s on you. So is stitching those data sources together, running analytics on them, and hooking all that up to a sensible user interface. Of course, that’s what we do at HP. We have even moved this technology onto the cloud, to make development and testing of these solutions quick and easy. Take a look at Haven on Demand!
What should you be asking yourself?
First, do you understand all of the types of data involved in your industry? Outside of your EDW, how do you interact with your customers, vendors, sponsors, or investors? How can you collect that data and get it into an analytics system? Does your data give you a competitive advantage, or is it just sitting in cold storage? What other data sources do you need in order to make innovative products and services? How do you join it all together using modern data science techniques, while using common data languages like SQL?
These are non-trivial questions. Sometimes just knowing what you have is a science project in itself (it doesn’t have to be, we actually have products for that). Many people assume that data cannot be analyzed unless it is lumped all together in one place, like a Hadoop cluster or an EDW. The good news is that it isn’t necessary in most cases. There are likely cases where you can optimize data processing by moving data into a high-performance data store, but much of your data can be analyzed right where it is. We have been helping customers solve these problems, and we would be delighted to help you as well.
This is the first in a series of three articles. The next article deals with how location data from cell phones and social media is creating huge new opportunities for those with the means to analyze it. The third article will deal with machine data, and the issues with dealing with the Internet of Things at scale.
Thanks for reading!