Archive for the ‘Uncategorized’ Category

Couchbase and HP Haven for Customer Analytics

Customer analytics is vital for just about any industry or market segment. By understanding how your customers behave and interact, you can improve how you learn and act from the dialogue. To fully understand your customers, you need to consider web, finance, CRM, and even geospatial and sentiment data.

Today, HP Software’s Big Data group is excited to announce a partnership with Couchbase to offer more powerful data lifecycle analytics for customer data. Couchbase is a high-performance NoSQL distributed database platform for web-scale applications. By extending Couchbase with HP’s Haven Big Data Platform, this partnership supports one of HP’s key initiatives of empowering the data-driven enterprise.

HP Haven enables you to harness 100% of your human, machine, and business data – whether it’s in your data center or in the cloud – at virtually limitless scale. Powered by HP Vertica and HP IDOL, HP Haven offers an easier and proven method to analyze unstructured data and a massively scalable SQL-based analytical database for structured information.

HP Vertica offers full ANSI SQL, distributed R for predictive analytics, and advanced analytics using custom logic and user-defined extensions. HP IDOL indexes, searches, and analyzes human information at scale and in context. IDOL processes hundreds of file types including tweets, email, audio, images, and streaming video. This partnership enables you to leverage another great data source — dynamic Couchbase data structures. By combining financial, operational and performance data with customer profiles from Couchbase, you can create and deploy analytical apps at huge scale.

Use cases include:

  • Internet of Things – Ingest and manage massive volumes of sensor data from connected cars, smart meters, and all intelligent devices, and apply advanced analytics across mixed workloads for predictive maintenance, consumable resupply, and more.
  • Customer 360 — Aggregate multiple and changing data types from disparate data sources and analyze and visualize the data to understand and prevent customer churn, reduce support calls, and improve brand recognition.
  • Fraud Detection – Meet the real-time scale requirements for financial service organizations with fraud detection and risk analytics in reducing profit loss, minimizing financial exposure, and complying with regulations.

Couchbase and HP Haven scale to handle massive quantities of big data for virtually limitless scale and any source.

Evaluate Couchbase Server today:

Test drive HP Vertica Community Edition, an unlimited, freely available version of HP Vertica for up to 1 TB across 3 nodes:

Check out HP IDOL’s core APIs for developing next-generation applications on-demand:

New Release of “DbVisualizer Free for Vertica” Now Available via the HP Haven Marketplace

Version of DbVisualizer Free for Vertica is now available!

Expanding on features normally reserved for the Pro version, DbVis has added the “Connection Keep Alive” feature to DbVisualizer Free for Vertica. The Connection Keep Alive feature issues a simple database “ping” to the database server at a specified interval preventing time-outs and lost connections as a result of being idle in DbVisualizer.

In addition to supporting Vertica Flex Tables, UDFs, and projections, other new features available in version include “Editor Templates” that can be used to easily insert text that you often use in SQL statements, and ”Master Password” that improve the encryption of all your saved passwords.

Version also includes some bug fixes.

To experience these and other new features of DbVisualizer Free for Vertica, simply visit the HP Haven Marketplace to download the latest version.

Whose Side Are You On? Using HP Vertica Pulse with College Basketball

Want to know what people across the nation are saying about your college basketball team? Want to know when and where those opinions change? You can do this and more with HP Vertica Pulse.

Visit our new blog to see how HP data scientist Manolo Garcia-Solaco used HP Vertica Pulse to analyze the sentiment of tweets when the Wisconsin Badgers faced the Duke Blue Devils in the finals of the NCAA basketball championship.

Better Together

Assembling puzzles

Just like peanut butter and chocolate, the mix of several flavors of data is much more interesting and useful than just one. At HP we classify types of data into three categories:

Human Data

Human data is stuff created by people as opposed to machines, like social media posts, videos, audio, emails, spreadsheets, blogs, and Wikipedia. This data is hard to analyze, as it is written in natural language and does not conform to a particular structure, and lives in places that are not particularly easy to access Because human data lacks traditional structure, we can’t just pull it straight into an data warehouse (nor should we want to). If you want to take full advantage of human data, you must do two things: extract metadata and textual content, and extract meaning. These are completely different things I can easily write a program to extract keywords and text from PDFs, and use them for a simple search engine. But unless I understand how that PDF relates to the millions of other documents in my business, I cannot do much more than that simple search. Plus, howcan I extract information from a video? What about audio recordings from your customer service desk? Sentiment from a YouTube video review of your product and the related comments? These are all very valuable, and not particularly easy to analyze.

Machine Data

Machine Data is data produced by machines or for machines, like sensor data from smart meters, wearable technology, and weblogs from your web site. This category of data is growing exponentially faster than human or business data, and the size of the data is the main driver behind technologies like the Hadoop Distributed File System (HDFS). If I asked you how much data you have today versus 5 years ago, you might say 10 times as much. (If I asked you how many new customers you have today vs. 5 years ago, I would hope you’d say 10 times as many as well!) If you do indeed have 10x more data today, it’s because most of your new data is machine data. Machine data is growing so fast that it has spawned a number of new technologies to store and analyze it, from both open-source and proprietary sources. Understanding what these technologies do, and what they do NOT do, should be on your to-do list right now! (If you want help, feel free to contact me.)

Business Data

Data created by businesses to help them run the business. This includes data in your data warehouse, as well as less centralized data like data found in spreadsheets. Think your data warehouse solution has all of your business data? Just for fun, think about how much of your business is run through Excel spreadsheets. If you are lucky, those spreadsheets are sitting in a SharePoint space somewhere, and not just on employee desktops. And if they are on people’s desktops, hopefully, they’re being backed up. Scary that you don’t have that information indexed and searchable, isn’t it?

So now that you have an idea of the types of data out there, what can you do with it? A picture is worth a thousand words, so let’s start of with a picture and a story.

Use Case: NASCAR

First, watch this video. When you think about NASCAR, you think about fast cars flying around the track, smacking into each other as they jockey for position. What you might not realize is that everything in NASCAR comes back to sponsorship. A NASCAR race is essentially a collection of 200mph billboards. Take a look at this picture:


You are looking at 3-Time Sprint Cup Champion Tony Stewart at Infineon Raceway. First, notice that the car is an advertisement for a number of different companies. The race is called the Sprint cup. The raceway is Infineon Raceway. NASCAR is not just about racing!

“The NASCAR ecosystem is a huge one involving fans, race teams, owners, drivers, racetracks, promoters, sponsors, advertisers, media and many more.”
– Sean Doherty, Director of Digital Engagement and Integrated Marketing Communications at NASCAR (credit CIO Insight).

NASCAR is a vehicle for advertising, as much as it is advertising for vehicles. Of course advertisers want to maximize viewers, because that is ultimately what sponsors want: people looking at their logo, or viewing their ads during the commercial break.

NASCAR realizes that its success is all about the fan base. But the majority of that fan base is sitting at home, far from the action. How to engage them? Putting aside creepy ideas like taking over video feeds from an Xbox Kinect, there are plenty of ways that fans publicly interact. The most obvious one: they tweet about the action in real time. They even tweet during the commercials, about the commercials. So now we have two things we can monitor: the number of tweets at any given time during the race, and the content of the tweets. Counting tweets is easy: just pick a time slice like 1 minute, count tweets that include NASCAR-related hashtags in that timeslice, and put them up on a dashboard. TA-DA! You now have one indicator of engagement.

But wait, are the fans happy or mad? We have to look at the content of the tweets, and that means sentiment analysis. We need to attach sentiment to each tweet so that we can gauge overall sentiment. Now the real problem: tweets are, by nature, short. They also are written in shorthand, and use colloquial language. So now we need natural language processing on what is essentially slang. We have two factors that we can gauge throughout the race: engagement level and sentiment. That dashboard is getting more interesting!


Here is a strange and related observation: did you know that the time spent during a hockey game where a Zamboni cleans the ice is one of the most heavily tweeted parts of the game? PEOPLE LOVE THE ZAMBONI.

Anyway, how does this relate to fan engagement? Well, let’s say that it starts raining heavily during a race, and NASCAR decides to pull the vehicles off the track. We now have a problem and an opportunity: will the home viewers check out until the rain stops? How do we keep them engaged during the break? Well, we could start by looking at that dashboard and see what the most heavily talked about parts of the race were, then queue up the commentators and video to go over those bits. We could poll the audience and have them tweet their favorite moment, then watch in real time as we see the results from Twitter. For that we will have to categorize and cluster keywords from the tweets in real time.

There is much more to this use case, but suffice it to say that NASCAR also collects data from media outlets in print, radio, and TV, and adds them into the mix. That means scanning video and audio for keywords and content, just like the tweets.

The data collected by NASCAR can then be used by its sponsors, who have their own data, likely in a more traditional data warehouse. Here are a few of the things NASCAR and their sponsors are doing with this system:

  • Race teams can gauge fan conversation and reaction to a new paint scheme for one of its cars to decide whether to alter it before future races.
  • The Charlotte Motor Speedway is tracking conversations and levels of interaction about one of its recent ticket promotions.
  • A sponsor is following track response and media coverage about a new marketing campaign.

-List Credit: CIO Insight

What has HP done to make this easier?

We covered a lot of ground in that one use case. We needed access to non-traditional data sources like Twitter, access to traditional data sources like an EDW, sentiment analysis, natural language processing, audio text retrieval, video frame recognition, audio text retrieval, and time series functions to slice up the data. Throw in some pattern-matching techniques and probabilistic modeling too. Then connect all that data to some real-time dashboards using standard SQL technologies and tools. That’s quite a laundry list.

HP has all of the technologies needed to implement this solution for NASCAR. We created a platform that can store and analyze 100% of your data. Structured, unstructured, semi-structured, multi-structured, human, machine, or business data: we can store it and analyze it. The latter part is the interesting one. It’s trivial to set up a Hadoop cluster and store your EDW, web logs, and tweets from the Twitter Fire Hose on there. But Hadoop doesn’t magically know how to parse emails, databases, weblog data, or anything else. That’s on you. So is stitching those data sources together, running analytics on them, and hooking all that up to a sensible user interface. Of course, that’s what we do at HP. We have even moved this technology onto the cloud, to make development and testing of these solutions quick and easy. Take a look at Haven on Demand!

What should you be asking yourself?

First, do you understand all of the types of data involved in your industry? Outside of your EDW, how do you interact with your customers, vendors, sponsors, or investors? How can you collect that data and get it into an analytics system? Does your data give you a competitive advantage, or is it just sitting in cold storage? What other data sources do you need in order to make innovative products and services? How do you join it all together using modern data science techniques, while using common data languages like SQL?

These are non-trivial questions. Sometimes just knowing what you have is a science project in itself (it doesn’t have to be, we actually have products for that). Many people assume that data cannot be analyzed unless it is lumped all together in one place, like a Hadoop cluster or an EDW. The good news is that it isn’t necessary in most cases. There are likely cases where you can optimize data processing by moving data into a high-performance data store, but much of your data can be analyzed right where it is. We have been helping customers solve these problems, and we would be delighted to help you as well.

Author Note

This is the first in a series of three articles. The next article deals with how location data from cell phones and social media is creating huge new opportunities for those with the means to analyze it. The third article will deal with machine data, and the issues with dealing with the Internet of Things at scale.

Thanks for reading!

Enter to Win the March Data Madness Machine Learning Mania Contest!

Slam Data Dunk

If you’re a frequent visitor to our blog, you may recall reading about the March Data Madness Sentiment Tracker that we demonstrated at the MIT Sloan Sports Analytics Conference just before the 2013 NCAA Men’s Basketball “March Madness” Tournament.

The demonstration focused on tracking the “sentiment of the crowd” by collecting and analyzing roughly a half million tweets with the HP Vertica and HP IDOL engines. These results were displayed with a Tibco Spotfire dashboard and offered great conversation fodder at the event:

  • Volume of tweets by team
  • Volume of tweets by player
  • Positive, negative, and neutral sentiment groupings
  • Volume of tweets by U.S. city and by worldwide country
  • Volume of tweets by language (English, French, Spanish, etc.)

We also dug into additional results and continued the conversation around sports analytics with our Webinar — The Future of Big Data in Sports – with an impressive roundtable of experts, including STATS and
Join the Machine Learning Mania Competition for a Chance to Win Cash Prizes!

For this year’s tournament, we’re at it again, only we’re offering cash prizes to the data scientist who can use the HP Haven Big Data Platform to accurately predict this year’s winner by sifting through massive amounts of data and applying machine learning and statistical techniques.

You will have access to key HP Haven technologies, including HP Vertica Distributed R to accelerate your machine learning by running your R models across multiple nodes to vastly reduce execution time and analyze much larger data sets.

The Machine Learning Mania Contest , hosted by Kaggle and sponsored by HP Software’s Big Data Group, enables you to get creative with the data sets that you use to create your statistical models. We will provide you with data covering three decades of historical games, but all participants will be encouraged to pull in data from a variety of external sources.
There’s no cost to join and you can compete compete to win up to 15,000 in cash prizes – join the Madness today!

Thoughts About HP Vertica for SQL on Hadoop

Et voilà

Recently, HP has announced HP Vertica for SQL on Hadoop. We’ve leveraged our years of experience in big data analytics and opened up our platform to allow users to tap into the full power of Hadoop. It’s a rich, fast, and enterprise-ready implementation of SQL on Hadoop that we’re very proud to introduce.

We know that you have choice when it comes to SQL-on-Hadoop engines. There are several SQL on Hadoop engines on the market for a reason – they are very powerful way to perform analytics on big data stored in Hadoop by using the familiar SQL language. Users are able to leverage any reporting or analytical tool to analyze and study the data rather than write their own Java and Map/Reduce code.

However, not all SQL-on-Hadoop is created the same. We think HP Vertica for SQL on Hadoop has some very big differences. These include:

  • Platform Agnostic – When you adopt a SQL on Hadoop query engine, it may be stuck to one distribution of Hadoop. Not so with HP Vertica for SQL on Hadoop. Our implementation works with Hortonworks, Cloudera and MapR distributions.
  • SQL Completeness – The richer the SQL engine, the wider the range of analytics that you can perform with extensive coding and data movement. You get a very rich set of analytical functions with HP Vertica for SQL on Hadoop. HP Vertica offers enterprise-ready, advanced analytics that support JOINs, complex data types, and other capabilities only available from our SQL on Hadoop implementation.
  • Manageability – Tools for managing queries and managing the resources of your cluster are fairly scarce and immature in the Hadoop world. However, with some of the tools we include, you can divide resources among different queries and different types of queries. If unplanned and resource-intensive queries have to be cancelled or temporarily interrupted, they can be.
  • Data Source Transparency – It’s important to allow you to query common data standard storage formats such as Parquet, Avro and ORC. When you can use native formats, you avoid having to move the data.
  • Path to Optimization – When you need to boost performance, HP Vertica for SQL on Hadoop offers optimizations like compression, columnar storage, and projections

You can’t really forget the fact that this offering comes from HP Software. Users should be able to take advantage of all the power of our Haven platform for big data. Encompassing proven technologies from HP Software, including Autonomy, Vertica, and ArcSight, Haven enables forward-thinking organizations to make use of virtually all information sources from both inside and outside its four walls to make better, faster decisions.

Download the report here.

See more

And more…

HP Vertica Storage Location for HDFS

Do you find yourself running low on disk space on your HP Vertica database? You could delete older data, but that sacrifices your ability to perform historical queries. You could add new nodes to your cluster or add storage to your existing nodes. However, these options require additional expense.

The HP Vertica Storage Locations for HDFS feature introduced in HP Vertica Version 7.1 offers you a new solution: storing data on an Apache Hadoop cluster. You can use this feature to store data in a Hadoop Distributed File System (HDFS) while still being able to query it through HP Vertica.

Watch this video for an overview of the HP Vertica Storage Locations for HDFS feature and an example of how you can use it to free storage space on your HP Vertica cluster.

For more information about this feature, see the HP Vertica Storage Location for HDFS section of the documentation.

Get Started With Vertica Today

Subscribe to Vertica