Since Hadoop’s initial release 14 years ago, untold volumes of data have been stored in HDFS (Hadoop Distributed File System). Spread across a virtual landscape of data-inspired organizations, those data lakes are wide, and deep. Companies have made tremendous investments in Hadoop over the years, and data continues to pour into their data lakes.
You might ask, have those been wise investments? We think so. Despite what the naysayers are claiming about Hadoop itself these days, it’s still true that vast quantities of data from useful sources can reveal lucrative patterns that make massive data collection worthwhile. Unfortunately, we believe most of those revelations are still out there to be made. The problem is that many analytics teams are using open-source query engines that were designed to work with their Hadoop distros. Those query engines are simply not providing the insights that are possible from HDFS data lakes.
For heavy-duty concurrency, open-source query engines aren’t the best answer
Companies who have embraced the basic elements of big data – often for running routine reports regarding business activities – are beginning to realize they need to do more with their data. Data has become one of their most valuable business resources. As for HDFS data, query engines like Impala, Hive, and Presto can be fine for ad-hoc data exploration, but things are quite different when you have dozens of users wanting constant access to that data, or you use it to drive production analytics applications or data visualizations, and propagate those to hundreds of end users.
Open source query engines were designed with small teams of data scientists in mind, not enterprise use. They don’t have the sophisticated resource management and query speed optimizations that a database has. As long as they’re only used for experiments or exploration, they work fine. Once you try to put those experiments into large scale production, to get the value back out of your investment, then you discover their limitations.
Here’s a real example. A large company in the electronics industry had a problem, slow queries, and a very high incidence of query failure. The more users they added, seeking to get value out of their data, the higher the number of failures. They realized something had to change when they got to the point where 25% of all queries using the open-source query engine Impala failed. They set up a test, same queries, same number of users, on the same data, but with Vertica doing the querying. By contrast, ~99.5% of all queries succeeded on Vertica, and completed an average of three times faster than those Impala-based queries that did succeed.
Typical Vertica users query between 100-300 TB of data, and many query much more. Dozens or even hundreds of simultaneous users are normal. In addition to querying data in place as Impala does, Vertica can also optimize and compress data efficiently on HDFS to boost performance; you just can’t optimize like that using Impala or Hive. And Vertica manages resources to make sure ALL reasonable queries will complete, with the assurance that your concurrent users will not be disappointed by having their queries fail because your successful query was consuming much of the available compute resources.
So if you’re trying to use Impala, Hive, Presto, or Spark SQL to query HDFS data in production, you should know that the value you’re looking for from Hadoop is achievable, just not with those query engines.
The hidden costs of open source
One of the most common arguments you hear in favor of using all open source components for HDFS data is low cost. Especially for start-up companies, those low price points look appealing. But when you look at the bigger picture, the apparently low prices are deceptive. Companies with big Hadoop investments can attest to the massive, hidden costs of free software.
Having reversed its former stance against “being an open source company,” Cloudera is now available for free, and Impala is as an open source query engine. But what happens when you need extensive technical support? Or you find a bug that is crippling some method your query routines have come to rely on, from the days when Cloudera was a regular, licensed software vendor? You either wait for somebody in the community to struggle with the same issue and, you hope, fix it, or you hire an expert dedicated resource to maintain and improve the software. All of these contribute to additional investments in time, money, and effort that can easily wipe out the savings of the initial software download.
How can you get full value from all that data if only ten people can access it, and a quarter of their queries fail?
Putting aside the TCO risks of open source query tools, Vertica has an impressive, independently documented ROI track record. You will make back $4.07 for every $1.00 you spend on Vertica according to a recent Nucleus Research study. This more-than-4-times saving on your data analytics with Vertica derives not just from the 75-80% improvements in query speed, but also from:
- Better data compression – 90% less space than row-stores, which means big savings on storage costs
- The industry’s most flexible licensing – deploy on premises, in the cloud, hybrid, or change infrastructure as your requirements change, and Dev, QA, and HA environments are included
- Smart resource management to enable unlimited concurrent user access to wring every dime of value out of your investment
All that translates to savings and improved revenue on many levels.
Put Vertica’s Unified Analytics Warehouse to the test
I began this discussion by mentioning data lakes – how they seem here to stay, and how that’s not a bad thing. Recently, a race has developed between data lake and data warehouse vendors, to win the hearts and wallets of big data practitioners. But you don’t have to bet one side against the other.
With Vertica, you can optimize your data and queries with the intelligence of an analytical data warehouse, and you can access vast stores of HDFS data, wherever it lives. Vertica’s Unified Analytics Warehouse gives you a way to combine all your data analytics – cloud, on-premises, or hybrid – to run at blazing speeds at big data scale. And you get full end-to-end machine learning capabilities built in.
The Unified Analytics Warehouse is described in an EMA whitepaper. Here’s an excerpt:
“Instead of replicating the data in Vertica to run a query, Vertica can access the data directly in HDFS or S3 object storage. This eliminates the need for data storage duplication and enables much quicker answers to questions requiring both data stored in Vertica and in other data platforms.”
With Vertica, you can access vast stores of HDFS data, wherever it lives, and you can optimize your data and queries with the intelligence of an analytical data warehouse.
Move your HDFS workloads over to Vertica
So…don’t listen to the naysayers – the investments you’ve made in Hadoop infrastructure are more valuable than ever. We’re not here to replace your HDFS data, but rather to suggest you add Vertica to your Hadoop stack so everyone in your company who can benefit from that data has access, whether for BI or for your evolving data science needs.
Vertica allows organizations to derive maximum value from their HDFS data lakes.