Archive for the ‘SQL on Hadoop’ Category

I Love DIY Projects

I love DIY projects. I love watching the YouTube videos, scouring the web for advice from others, and learning new skills. The innovation and creativity that’s out there is amazing! Plus, DIY projects save money, sometimes a lot of money! This past month, we decided to build our own stone patio in the backyard … how hard could that be? Turns out, lifting 5000+ pounds of rock and stone combined with the challenges of grade and water management is a lot harder than it looks!

patch of dirt

Somehow, this experience led me to think about Open Source Software. Quite a leap, isn’t it? But think about it … the innovation and excitement that comes from thousands of smart people working together to create new software is pretty cool. It saves money (well, let’s talk about that later) and it exposes an organization to new thinking in a new era of technology.

Then comes the hard part. First, implementations of Open Source Software like Hadoop do take more expertise than might have been expected. So it really makes sense to engage with a professional Hadoop distribution partner. So free isn’t really free anymore. Then the commercial grade discussion comes into play. Is there enough security and manageability built into the current Hadoop distributions to meet the constantly rising bar in today’s world? I don’t want my company’s logo in the next Big Data breach article. Finally, the underlying infrastructure (kind of like the piping, drainage gravel and paver base that now must be added to my patio project!) starts to expand in ways that might not have been expected.
But does this mean that Open Source projects are a bad idea? Absolutely not! I will never give up the satisfaction (and cost savings!) of DIY projects. But the key is to make the right choices and partner with the right people. HP Software Big Data has a passion for innovation and we love the excitement of the Open Source community; that’s why we were so excited to contribute our recent Distributed R release to Open Source. We want our customers to find value in their Hadoop implementations so we bring the strongest and most sophisticated SQL on Hadoop offering to the market with a set of rich analytics that can really uncover the insights in data stored in Hadoop. After all, as Mike Stonebraker, the founder of HP Vertica and winner of the Turing Award (really the Nobel Prize of Computer Science!) recently said in an interview with Barron’s, “It started out, NoSQL meant, ‘Not SQL,’ then it became ‘Not only SQL,’ and now I think it means “Not-yet-SQL.” That’s why we at HP are so determined to “speak the language” that our developer and customer community knows and needs. And perhaps most importantly, we continue to develop open APIs and SDKs for our Haven Big Data Platform because we know that the brilliant and passionate developer community (think DIY for analytics!) needs the right tools for their jobs.

HP Software Big Data believes in DIY. We’re a hands on group with the advantage of structured QA processes, the expertise from more than a decade of analytics purpose built for Big Data, and the ability to bring data scientists and data migration specialists to any enterprise DIY project through our Enterprise Services Analytics & Data Management practice. Got a DIY project in mind? We’re in!

The Top Five Reasons SQL-on-Hadoop Keeps CIOs Awake at Night

The Elephant and the engineer

Being a part of HP is really an amazing thing – it gives us access to amazing technologies and very bright, hard-working people. But the best part is talking with our customers.

One topic on the mind of many technology leaders today is the “elephant in the room” – Hadoop. From its humble beginnings as a low-cost implementation of mass storage and the Map/Reduce programming framework, it’s become something of a movement. Businesses from Manhattan to Mumbai are quickly discovering that it provides favorable economics for one very specific use case – it provides a very low cost way to store data of uncertain value. This use case even has acquired a name – the “data lake”.

I first heard the term five years ago, while Vertica was a tiny startup based in Boston. It seemed that a few risk-tolerant businesses in California were trying out this thing called Hadoop as a place to park data that they’d previously been throwing away. Many businesses have been throwing away all but a tiny portion of their data simply because they can’t find a cost effective place to store it. To these companies, Hadoop was a godsend.

And yet in some key ways, Hadoop is also extremely limited. Technology teams continue to wrestle with extracting value from a Hadoop investment. Their primary complaint? That there is no easy way to explore and ask questions of data stored in Hadoop. Technology teams understand SQL, but Hadoop provides only the most basic SQL support. I’ve even heard stories of entire teams resigning en masse, frustrated that their company has put them in a no-win situation – data everywhere and not a drop to drink.

Variations on the above story have undoubtedly played out at many companies across the globe. The common theme is that, love it or hate it, SQL is one of the core languages for exploration and inquiry of semi-structured and structured data. And most SQL on Hadoop offerings are simply not up to the task. As a result, we now have a gold rush of sorts, with multiple vendors rushing to build SQL on Hadoop solutions. To date, there are at least seven different commercial SQL for Hadoop offerings, and many organizations are learning about the very big differences between these offerings!

In our many conversations with C-level technology executives, we’ve heard a common set of concerns about most SQL on Hadoop options. Some are significant. So, without further ado, here are the top five reasons SQL on Hadoop keeps CIO’s awake at night:

5. Is it secure? Really?

The initial appeal of the data lake is that it can be a consolidated store – businesses can place all their data in one place. But that creates huge risk because now…all the data is in one place. Therefore, our team has been working diligently a SQL on Hadoop offering that not only consists of core enterprise security features, but it also requires the ability to secure data in flight with such things as SSL encryption, integration with enterprise security systems such as Kerberos, and a column-level access model. If your SQL on Hadoop solution doesn’t offer these features, your data is at risk.

4. Does it support all the SQL you need?

Technically, SQL on Hadoop has been around for years now in the form of an open source project called Hive. Hive has its own version of SQL called HQL. Hive users frequently complain that HQL only supports a subset of SQL. There are many things you just can’t do. This requires all manner of data flow contortions as analysts must continually resort to other tools or languages for things that are very expressible in SQL…if only the Hadoop environment supported it.

This problem remains today, as many of the SQL on Hadoop variants do not support the full range of ANSI SQL. For example, our benchmark team regularly performs tests with the Vertica SQL on Hadoop product to ensure that it meets our standards for quality, stability and performance. One of the test suites we use is the TPC-H benchmark. For those not in the know, TPC-H is an industry standard benchmark with pre-defined SQL, schemas, and data. While our engine runs the full suite of tests, other SQL on Hadoop flavors that we’ve tested are not capable of running the entire workload. In fact, some of them only run 60% of the queries!

3. …And if it runs the SQL, does it run well?

It’s one thing to implement a SQL engine that can parse a bit of SQL and create an execution plan to go and get the data. It’s a very different thing to optimize the engine such that it does these things quickly and efficiently. I’ve been working with database products for almost thirty years now, and have seen over and over that the biggest challenge faced by any SQL engine is not creating the engine, but in dealing with the tens of thousands of edge cases that will arise in the real world.

For example, being aware of sort order in stored data on disk can dramatically improve query performance. Moreover, optimizing the storage of the data to leverage the sort sequence with something like run-length encoding can further improve performance. But not if the SQL engine doesn’t know how to deal with this. One example of an immature implementation is an engine that cannot use just-in-time decompression of highly compressed data. If the system has to pay the CPU penalty of decompressing highly compressed data every time it is queried, why bother compressing it in the first place, except maybe to save disk space? Also, if a user needs to keep extremely high-performance aggregations in sync with the transaction data, unless the engine has been written to manage the data this way, and be aware of the data characteristics at run-time, this simply won’t be possible.
These are just two examples. But it can make the difference between a query taking one second, or two days. Or worse, crashing when you try to run it because uncompressed data overflows the memory and crashes the database.

2. Does it just dump files to a file-system, or actively manage and optimize storage?

Projects built for Hadoop almost invariably pick up some of the “baggage” of using the core Hadoop functionality. For example, some of the SQL on Hadoop offerings just dump individual files into the filesystem as data is ingested. After loading a year of data, you’re likely to find yourself with hundreds of thousands of individual files. This is a performance catastrophe. Moreover, to optimize these files a person has to manually do something –write a script, run a process, call an executable, etc. This just adds to the real cost of the solution in terms of administrative complexity and design complexity to work around performance issues. What a business needs is a system which simplifies this by managing and optimizing files automatically.

1. When two people ask the same question at the same time, do they get the same answer?

There are certain fundamentals about databases that have made them so common for tracking key business data today. One of these things is called ACID compliance. It’s an acronym that doesn’t bear explaining here, so suffice it to say that one of the things an ACID-compliant database guarantees is that if two people ask the exact same question of the exact same data at the exact same time, they will get the same answer.

Seems kind of obvious, doesn’t it? And a common issue with SQL on Hadoop distributions is that they may lack ACID compliance. This isn’t so good for data science to create predictive models for growing the business, and certainly not suitable for producing financials! Caveat Emptor.

Many of our customers consider these five areas to be a benchmark for measuring SQL on Hadoop maturity. SQL on Hadoop offerings that fail to deliver these things will drive up the cost and time it takes to solve problems as analysts must use a mix of tools, work around performance and stability limitations, etc. And in the context of massive data thefts taking place today, how many CIOs feel comfortable with three petabytes of unsecured data pertaining to every single aspect of their business being accessible to anyone with a text editor and a bit of Java programming know-how?

The good news is that we at HP have been thinking of these concerns for years now. And working on solving them. Vertica SQL on Hadoop addresses each of these concerns in a comprehensive way, so organizations can finally unlock the full value of their data lake. We’re happy to tell you more about this, and we’d love for you to try it out! Click here to request more information from our team.

Better Together

Assembling puzzles

Just like peanut butter and chocolate, the mix of several flavors of data is much more interesting and useful than just one. At HP we classify types of data into three categories:

Human Data

Human data is stuff created by people as opposed to machines, like social media posts, videos, audio, emails, spreadsheets, blogs, and Wikipedia. This data is hard to analyze, as it is written in natural language and does not conform to a particular structure, and lives in places that are not particularly easy to access Because human data lacks traditional structure, we can’t just pull it straight into an data warehouse (nor should we want to). If you want to take full advantage of human data, you must do two things: extract metadata and textual content, and extract meaning. These are completely different things I can easily write a program to extract keywords and text from PDFs, and use them for a simple search engine. But unless I understand how that PDF relates to the millions of other documents in my business, I cannot do much more than that simple search. Plus, howcan I extract information from a video? What about audio recordings from your customer service desk? Sentiment from a YouTube video review of your product and the related comments? These are all very valuable, and not particularly easy to analyze.

Machine Data

Machine Data is data produced by machines or for machines, like sensor data from smart meters, wearable technology, and weblogs from your web site. This category of data is growing exponentially faster than human or business data, and the size of the data is the main driver behind technologies like the Hadoop Distributed File System (HDFS). If I asked you how much data you have today versus 5 years ago, you might say 10 times as much. (If I asked you how many new customers you have today vs. 5 years ago, I would hope you’d say 10 times as many as well!) If you do indeed have 10x more data today, it’s because most of your new data is machine data. Machine data is growing so fast that it has spawned a number of new technologies to store and analyze it, from both open-source and proprietary sources. Understanding what these technologies do, and what they do NOT do, should be on your to-do list right now! (If you want help, feel free to contact me.)

Business Data

Data created by businesses to help them run the business. This includes data in your data warehouse, as well as less centralized data like data found in spreadsheets. Think your data warehouse solution has all of your business data? Just for fun, think about how much of your business is run through Excel spreadsheets. If you are lucky, those spreadsheets are sitting in a SharePoint space somewhere, and not just on employee desktops. And if they are on people’s desktops, hopefully, they’re being backed up. Scary that you don’t have that information indexed and searchable, isn’t it?

So now that you have an idea of the types of data out there, what can you do with it? A picture is worth a thousand words, so let’s start of with a picture and a story.

Use Case: NASCAR

First, watch this video. When you think about NASCAR, you think about fast cars flying around the track, smacking into each other as they jockey for position. What you might not realize is that everything in NASCAR comes back to sponsorship. A NASCAR race is essentially a collection of 200mph billboards. Take a look at this picture:


You are looking at 3-Time Sprint Cup Champion Tony Stewart at Infineon Raceway. First, notice that the car is an advertisement for a number of different companies. The race is called the Sprint cup. The raceway is Infineon Raceway. NASCAR is not just about racing!

“The NASCAR ecosystem is a huge one involving fans, race teams, owners, drivers, racetracks, promoters, sponsors, advertisers, media and many more.”
– Sean Doherty, Director of Digital Engagement and Integrated Marketing Communications at NASCAR (credit CIO Insight).

NASCAR is a vehicle for advertising, as much as it is advertising for vehicles. Of course advertisers want to maximize viewers, because that is ultimately what sponsors want: people looking at their logo, or viewing their ads during the commercial break.

NASCAR realizes that its success is all about the fan base. But the majority of that fan base is sitting at home, far from the action. How to engage them? Putting aside creepy ideas like taking over video feeds from an Xbox Kinect, there are plenty of ways that fans publicly interact. The most obvious one: they tweet about the action in real time. They even tweet during the commercials, about the commercials. So now we have two things we can monitor: the number of tweets at any given time during the race, and the content of the tweets. Counting tweets is easy: just pick a time slice like 1 minute, count tweets that include NASCAR-related hashtags in that timeslice, and put them up on a dashboard. TA-DA! You now have one indicator of engagement.

But wait, are the fans happy or mad? We have to look at the content of the tweets, and that means sentiment analysis. We need to attach sentiment to each tweet so that we can gauge overall sentiment. Now the real problem: tweets are, by nature, short. They also are written in shorthand, and use colloquial language. So now we need natural language processing on what is essentially slang. We have two factors that we can gauge throughout the race: engagement level and sentiment. That dashboard is getting more interesting!


Here is a strange and related observation: did you know that the time spent during a hockey game where a Zamboni cleans the ice is one of the most heavily tweeted parts of the game? PEOPLE LOVE THE ZAMBONI.

Anyway, how does this relate to fan engagement? Well, let’s say that it starts raining heavily during a race, and NASCAR decides to pull the vehicles off the track. We now have a problem and an opportunity: will the home viewers check out until the rain stops? How do we keep them engaged during the break? Well, we could start by looking at that dashboard and see what the most heavily talked about parts of the race were, then queue up the commentators and video to go over those bits. We could poll the audience and have them tweet their favorite moment, then watch in real time as we see the results from Twitter. For that we will have to categorize and cluster keywords from the tweets in real time.

There is much more to this use case, but suffice it to say that NASCAR also collects data from media outlets in print, radio, and TV, and adds them into the mix. That means scanning video and audio for keywords and content, just like the tweets.

The data collected by NASCAR can then be used by its sponsors, who have their own data, likely in a more traditional data warehouse. Here are a few of the things NASCAR and their sponsors are doing with this system:

  • Race teams can gauge fan conversation and reaction to a new paint scheme for one of its cars to decide whether to alter it before future races.
  • The Charlotte Motor Speedway is tracking conversations and levels of interaction about one of its recent ticket promotions.
  • A sponsor is following track response and media coverage about a new marketing campaign.

-List Credit: CIO Insight

What has HP done to make this easier?

We covered a lot of ground in that one use case. We needed access to non-traditional data sources like Twitter, access to traditional data sources like an EDW, sentiment analysis, natural language processing, audio text retrieval, video frame recognition, audio text retrieval, and time series functions to slice up the data. Throw in some pattern-matching techniques and probabilistic modeling too. Then connect all that data to some real-time dashboards using standard SQL technologies and tools. That’s quite a laundry list.

HP has all of the technologies needed to implement this solution for NASCAR. We created a platform that can store and analyze 100% of your data. Structured, unstructured, semi-structured, multi-structured, human, machine, or business data: we can store it and analyze it. The latter part is the interesting one. It’s trivial to set up a Hadoop cluster and store your EDW, web logs, and tweets from the Twitter Fire Hose on there. But Hadoop doesn’t magically know how to parse emails, databases, weblog data, or anything else. That’s on you. So is stitching those data sources together, running analytics on them, and hooking all that up to a sensible user interface. Of course, that’s what we do at HP. We have even moved this technology onto the cloud, to make development and testing of these solutions quick and easy. Take a look at Haven on Demand!

What should you be asking yourself?

First, do you understand all of the types of data involved in your industry? Outside of your EDW, how do you interact with your customers, vendors, sponsors, or investors? How can you collect that data and get it into an analytics system? Does your data give you a competitive advantage, or is it just sitting in cold storage? What other data sources do you need in order to make innovative products and services? How do you join it all together using modern data science techniques, while using common data languages like SQL?

These are non-trivial questions. Sometimes just knowing what you have is a science project in itself (it doesn’t have to be, we actually have products for that). Many people assume that data cannot be analyzed unless it is lumped all together in one place, like a Hadoop cluster or an EDW. The good news is that it isn’t necessary in most cases. There are likely cases where you can optimize data processing by moving data into a high-performance data store, but much of your data can be analyzed right where it is. We have been helping customers solve these problems, and we would be delighted to help you as well.

Author Note

This is the first in a series of three articles. The next article deals with how location data from cell phones and social media is creating huge new opportunities for those with the means to analyze it. The third article will deal with machine data, and the issues with dealing with the Internet of Things at scale.

Thanks for reading!

HP Vertica for SQL on Hadoop

HP Vertica for SQL on Hadoop from Vertica Systems on Vimeo

HP Vertica now offers a SQL on Hadoop license, which allows you to leverage Vertica’s powerful analytics engine to explore data in Hadoop Distributed File System (HDFS).

This offering is licensed per-node/per-year term with no data volume limits.

With your SQL on Hadoop license, you get access to proven and enterprise features like:

  • Database designer
  • Management console
  • Workload management
  • Flex tables
  • External tables
  • Backup functionality

See our documentation on HP Vertica SQL on Hadoop for limitations.
To learn more about other HP Vertica licenses, view our Obtaining and Installing Your HP Vertica Licenses video or contact an HP Licensing center.

Vertica on MapR SQL-on-Hadoop – join us in June!

We’ve been working closely with MapR Technologies to bring to market our industry-leading SQL-on-Hadoop solution, and on June 3, 2014 will be jointly delivering a live webinar which will feature this joint solution and related use cases. To register and learn how you can enjoy the benefits of a SQL-on-Hadoop analytics solution that provides the highest-performing, tightly-integrated platform for operational and exploratory analytics, click here.

This joint solution is a unified, integrated solution that reduces complexity and costs by running a single cluster for both HP Vertica and Hadoop. It tightly integrates HP Vertica’s 100% ANSI SQL, high-performance Big Data analytics platform with the MapR enterprise-grade Distribution for Apache Hadoop, providing customers and partners with the highest-performing, most tightly-integrated solution for operational and exploratory analytics with the lowest total cost of ownership (TCO).

This solution will also be presented live by HP Vertica and MapR executives at HP Discover on June 11, 2014. For more information, visit the HP Discover website.

In addition, a specially-optimized version of the MapR Sandbox for Hadoop is now available in the HP Vertica Marketplace. To download this and other add-ons for the HP Vertica Analytics platform, click here.


Get Started With Vertica Today

Subscribe to Vertica