Author Archive

Better Together

Assembling puzzles

Just like peanut butter and chocolate, the mix of several flavors of data is much more interesting and useful than just one. At HP we classify types of data into three categories:

Human Data

Human data is stuff created by people as opposed to machines, like social media posts, videos, audio, emails, spreadsheets, blogs, and Wikipedia. This data is hard to analyze, as it is written in natural language and does not conform to a particular structure, and lives in places that are not particularly easy to access Because human data lacks traditional structure, we can’t just pull it straight into an data warehouse (nor should we want to). If you want to take full advantage of human data, you must do two things: extract metadata and textual content, and extract meaning. These are completely different things I can easily write a program to extract keywords and text from PDFs, and use them for a simple search engine. But unless I understand how that PDF relates to the millions of other documents in my business, I cannot do much more than that simple search. Plus, howcan I extract information from a video? What about audio recordings from your customer service desk? Sentiment from a YouTube video review of your product and the related comments? These are all very valuable, and not particularly easy to analyze.

Machine Data

Machine Data is data produced by machines or for machines, like sensor data from smart meters, wearable technology, and weblogs from your web site. This category of data is growing exponentially faster than human or business data, and the size of the data is the main driver behind technologies like the Hadoop Distributed File System (HDFS). If I asked you how much data you have today versus 5 years ago, you might say 10 times as much. (If I asked you how many new customers you have today vs. 5 years ago, I would hope you’d say 10 times as many as well!) If you do indeed have 10x more data today, it’s because most of your new data is machine data. Machine data is growing so fast that it has spawned a number of new technologies to store and analyze it, from both open-source and proprietary sources. Understanding what these technologies do, and what they do NOT do, should be on your to-do list right now! (If you want help, feel free to contact me.)

Business Data

Data created by businesses to help them run the business. This includes data in your data warehouse, as well as less centralized data like data found in spreadsheets. Think your data warehouse solution has all of your business data? Just for fun, think about how much of your business is run through Excel spreadsheets. If you are lucky, those spreadsheets are sitting in a SharePoint space somewhere, and not just on employee desktops. And if they are on people’s desktops, hopefully, they’re being backed up. Scary that you don’t have that information indexed and searchable, isn’t it?

So now that you have an idea of the types of data out there, what can you do with it? A picture is worth a thousand words, so let’s start of with a picture and a story.

Use Case: NASCAR

First, watch this video. When you think about NASCAR, you think about fast cars flying around the track, smacking into each other as they jockey for position. What you might not realize is that everything in NASCAR comes back to sponsorship. A NASCAR race is essentially a collection of 200mph billboards. Take a look at this picture:


You are looking at 3-Time Sprint Cup Champion Tony Stewart at Infineon Raceway. First, notice that the car is an advertisement for a number of different companies. The race is called the Sprint cup. The raceway is Infineon Raceway. NASCAR is not just about racing!

“The NASCAR ecosystem is a huge one involving fans, race teams, owners, drivers, racetracks, promoters, sponsors, advertisers, media and many more.”
– Sean Doherty, Director of Digital Engagement and Integrated Marketing Communications at NASCAR (credit CIO Insight).

NASCAR is a vehicle for advertising, as much as it is advertising for vehicles. Of course advertisers want to maximize viewers, because that is ultimately what sponsors want: people looking at their logo, or viewing their ads during the commercial break.

NASCAR realizes that its success is all about the fan base. But the majority of that fan base is sitting at home, far from the action. How to engage them? Putting aside creepy ideas like taking over video feeds from an Xbox Kinect, there are plenty of ways that fans publicly interact. The most obvious one: they tweet about the action in real time. They even tweet during the commercials, about the commercials. So now we have two things we can monitor: the number of tweets at any given time during the race, and the content of the tweets. Counting tweets is easy: just pick a time slice like 1 minute, count tweets that include NASCAR-related hashtags in that timeslice, and put them up on a dashboard. TA-DA! You now have one indicator of engagement.

But wait, are the fans happy or mad? We have to look at the content of the tweets, and that means sentiment analysis. We need to attach sentiment to each tweet so that we can gauge overall sentiment. Now the real problem: tweets are, by nature, short. They also are written in shorthand, and use colloquial language. So now we need natural language processing on what is essentially slang. We have two factors that we can gauge throughout the race: engagement level and sentiment. That dashboard is getting more interesting!


Here is a strange and related observation: did you know that the time spent during a hockey game where a Zamboni cleans the ice is one of the most heavily tweeted parts of the game? PEOPLE LOVE THE ZAMBONI.

Anyway, how does this relate to fan engagement? Well, let’s say that it starts raining heavily during a race, and NASCAR decides to pull the vehicles off the track. We now have a problem and an opportunity: will the home viewers check out until the rain stops? How do we keep them engaged during the break? Well, we could start by looking at that dashboard and see what the most heavily talked about parts of the race were, then queue up the commentators and video to go over those bits. We could poll the audience and have them tweet their favorite moment, then watch in real time as we see the results from Twitter. For that we will have to categorize and cluster keywords from the tweets in real time.

There is much more to this use case, but suffice it to say that NASCAR also collects data from media outlets in print, radio, and TV, and adds them into the mix. That means scanning video and audio for keywords and content, just like the tweets.

The data collected by NASCAR can then be used by its sponsors, who have their own data, likely in a more traditional data warehouse. Here are a few of the things NASCAR and their sponsors are doing with this system:

  • Race teams can gauge fan conversation and reaction to a new paint scheme for one of its cars to decide whether to alter it before future races.
  • The Charlotte Motor Speedway is tracking conversations and levels of interaction about one of its recent ticket promotions.
  • A sponsor is following track response and media coverage about a new marketing campaign.

-List Credit: CIO Insight

What has HP done to make this easier?

We covered a lot of ground in that one use case. We needed access to non-traditional data sources like Twitter, access to traditional data sources like an EDW, sentiment analysis, natural language processing, audio text retrieval, video frame recognition, audio text retrieval, and time series functions to slice up the data. Throw in some pattern-matching techniques and probabilistic modeling too. Then connect all that data to some real-time dashboards using standard SQL technologies and tools. That’s quite a laundry list.

HP has all of the technologies needed to implement this solution for NASCAR. We created a platform that can store and analyze 100% of your data. Structured, unstructured, semi-structured, multi-structured, human, machine, or business data: we can store it and analyze it. The latter part is the interesting one. It’s trivial to set up a Hadoop cluster and store your EDW, web logs, and tweets from the Twitter Fire Hose on there. But Hadoop doesn’t magically know how to parse emails, databases, weblog data, or anything else. That’s on you. So is stitching those data sources together, running analytics on them, and hooking all that up to a sensible user interface. Of course, that’s what we do at HP. We have even moved this technology onto the cloud, to make development and testing of these solutions quick and easy. Take a look at Haven on Demand!

What should you be asking yourself?

First, do you understand all of the types of data involved in your industry? Outside of your EDW, how do you interact with your customers, vendors, sponsors, or investors? How can you collect that data and get it into an analytics system? Does your data give you a competitive advantage, or is it just sitting in cold storage? What other data sources do you need in order to make innovative products and services? How do you join it all together using modern data science techniques, while using common data languages like SQL?

These are non-trivial questions. Sometimes just knowing what you have is a science project in itself (it doesn’t have to be, we actually have products for that). Many people assume that data cannot be analyzed unless it is lumped all together in one place, like a Hadoop cluster or an EDW. The good news is that it isn’t necessary in most cases. There are likely cases where you can optimize data processing by moving data into a high-performance data store, but much of your data can be analyzed right where it is. We have been helping customers solve these problems, and we would be delighted to help you as well.

Author Note

This is the first in a series of three articles. The next article deals with how location data from cell phones and social media is creating huge new opportunities for those with the means to analyze it. The third article will deal with machine data, and the issues with dealing with the Internet of Things at scale.

Thanks for reading!

Vertica in Private Cloud Deployments

Vertica in Private Cloud Deployments

Ask any CIO what their top priorities are, and cloud deployment is likely to be at the top of the list. While the reasons for deploying internal applications on the cloud is beyond the scope of this post, it is valid to ask why private Cloud deployment is a viable option for a Big Data implementation, and what impacts it will have on the deployment, maintenance and performance of the system.

A strong argument for Private Clouds is the savings in time and capital provided by consolidating all applications on to a single, industry-standard server configuration. This then enables fast procurement procedures, and decreases the time to scale out the infrastructure.

  • Vertica runs on industry-standard x86 hardware, and works with all DAS, SAN and NAS solutions in the marketplace.
  • Vertica is a massively parallel processing (MPP) database, scaling out horizontally through the addition of virtual servers rather than increased hardware per virtual server.

All modern virtualization frameworks provide the ability to quickly deploy a pre-configured VM, or template, into the system. This template encodes the results of tuning, security audits and vendor best practices, ensuring that the new virtual server will work seamlessly in the new environment, and reduce support costs for both the customer and the supplying vendor.

  • Vertica can assist you with building your own templates.
  • Every node in a Vertica database provides the same functionality, so only one template is needed.
  • The Vertica database will remotely install on new nodes, rebalance the data throughout the cluster, and bring the new nodes on-line automatically when they are ready.

Another benefit of virtualization is the ease of maintenance. Is a server sending warning signals? Migrate its workload to another virtual server, pull the malfunctioning hardware, and replace it with new hardware.
In addition to the built-in migration services provided by virtualization vendors, Vertica provides simple migration facilities to replace a faulty node with a fresh node. Because Vertica does not use specialized nodes, any available virtual server in the server pool can be used.

There is no free lunch, and the price for improvements in procurement, deployment and maintenance is slower execution on a given hardware configuration. Most cloud deployments see between 15-30% degradation, depending on the application’s profile.

HP Vertica was built for virtualization. Virtualization’s weaknesses are augmented by Vertica’s strengths. For example, one of the weaknesses of a virtual infrastructure is reduced I/O compared to large DAS arrays. Vertica employs aggressive compression routines to minimize the size of the data on-disk, greatly reducing the I/O requirements of the storage network.

Columnar databases have a natural I/O advantage. In a column store, data for each column in a table is stored separately, so only the data needed to answer the question must be scanned, rather than the full row. Especially with wide tables, Vertica only needs to materialize columns specified in the query.

Due to Vertica’s unique architecture, Vertica is CPU-bound, rather than memory or I/O. Most virtual infrastructures are compute-heavy, a perfect match for Vertica.

Vertica can assist you with building your own templates. We can provide best practices, health checks, and other services to ensure that your configuration is optimized and fully supported.

How Does Vertica Enhance Private Cloud Deployments?

Vertica offers additional improvements for cloud deployments above and beyond those provided by your virtualization product.

Elastic Cluster—You can scale your cluster up or down to meet the needs of your database. The most common case is to add nodes to your database cluster to accommodate more data and provide better query performance. However, you can scale down your cluster if you find that it is overprovisioned or if you need to divert hardware for other uses. Visit our online documentation for additional information on Elastic Clusters.

Tiered Storage Support—Most virtual infrastructures make use of storage pools. The idea is to have pools of disks for different workload profiles: SSDs or fast hard drives for high-performance applications, and slower disks for less critical workloads. Visit our online documentation for additional information on Storage Locations.

Fast Backup and Restore—Vertica stores data in highly compressed files on disk. When doing a backup or restore, Vertica moves these highly compressed files over the network to the backup storage location. This provides an immense reduction in bandwidth on the storage networks. Visit our online documentation for additional information on Vertica’s backup and recovery features.

Fast Data Copying—To make these activities simple and fast, Vertica employs the same mechanisms for moving tables between databases as it does for backup and recovery: highly compressed data files are copied between the databases. Each node in the Vertica cluster sends copies of its data to the remote database in parallel, enabling movement of several terabytes per minute in large clusters. Visit our online documentation for additional information on fast data copy.

Final Thoughts

While slower performance may hinder some cloud based deployments, the HP Vertica Analytics Platform implements a number of design features and architectural decisions that complement today’s private cloud environments. Learn more about how HP Vertica handles data faster and more reliably than any other database within public and virtualized enterprise cloud environments.

Get Started With Vertica Today

Subscribe to Vertica