Originally posted 8/15/2011 by Mingsheng Hong, Vertica Product Marketing Engineer
See our recent blog about Apache Hadoop here.
I have an entrepreneur friend who used to carry a butter knife around. He claimed this almighty tool was the only one he ever needed! While the butter knife does serve a wide range of purposes (especially with a stretch of the imagination), in practice it doesnt always yield optimal results. For example, as a screwdriver, it may work for common screws, but certainly not a Phillips (unless you push down very hard and hope not to strip the screw). As a hammer, you may be able to drive finishing nails, but your success and mileage may vary. As a pry bar, well, I think you get my point! Clearly one tool isnt sufficient for all purposes a good toolbox includes various tools each fulfilling a specific purpose.
When it comes to Big Data Analytics, Apache Hadoop (as a platform) has received an incredible amount of attention. Some highlights include: scalable architecture based on commodity hardware, flexible programming language support, and strong open source community support committed to its on-going development. However, Hadoop is not without limitations: due to its batch oriented nature, Hadoop alone cannot be deployed as a real-time analytics solution. Its highly technical and low-level programming interface makes it extremely flexible and friendly to developers but not optimal for business analysts. In an enterprise business intelligence environment Hadoopss limited integration with existing BI tools makes people scratch their head trying to figure out how to fit it into their environment.
As Hadoop has continued to gain traction in the market and (in my opinion) moved beyond the peak of the hype cycle, it is becoming clear that to maximize its effectiveness, one should leverage Hadoop in conjunction with other business intelligence platforms and tools. Best practices are emerging regarding the choice of such companions, as well as how to leverage each component in a joint deployment.
Among the various BI platforms and tools, Vertica has proved an excellent choice. Many of its customers have successfully leveraged the joint deployment of Hadoop and Vertica to tackle BI challenges in algorithmic trading, web analytics, and countless other industry verticals.
What Makes the Joint Deployment so Effective, and What Are the Common Use Cases?
First, both platforms have a lot in common:
- Purpose-built from scratch for Big Data transformation and analytics
- Leverage MPP architecture to scale out with commodity hardware, capable of managing TBs through PBs of data
- Native HA support with low administration overhead
In the Big Data space crowded with existing and emerging solutions, the above architectural elements have been accepted as must-haves for any solution to deliver scalability, cost effectiveness and ease of use. Both platforms have obtained strong market traction in the last few years, with customer success stories from a wide range of industry verticals.
While agreeing on things can be pleasant, it is the following key differences that make Hadoop and Vertica complement each other when addressing Big Data challenges:
|Aspect / Feature
|Interface and extensibility
||Hadoops map-reduce programming interface is designed for developers.The platform is acclaimed for its multi-language support as well as ready-made analytic library packages supplied by a strong community.
||Verticas interface complies with BI industry standards (SQL, ODBC, JDBC etc). This enables both technologists and business analysts to leverage Vertica in their analytic use cases.Verticas 5.0 analytics SDK enables users to plug their custom analytic logic into the platform, with in-process and parallel execution. The SDK is an alternative to the map-reduce paradigm, and often delivers higher performance.
|Tool chain /Eco system
||Hadoop and HDFS integrate well with many other open source tools. Its integration with existing BI tools is emerging.
||Vertica integrates with the BI tools because of its standards compliant interface. Through Verticas Hadoop connector, data can be exchanged in parallel between Hadoop and Vertica.
||Hadoop replicates data 3 times by default for HA. It segments data across the machine cluster for loading balancing, but the data segmentation scheme is opaque to the end users and cannot be tweaked to optimize for the analytic jobs.
||Verticas columnar compression often achieves 10:1 in its compression ratio. A typical Vertica deployment replicates data once for HA, and both data replicas can attain different physical layout in order to optimize for a wider range of queries. Finally, Vertica segments data not only for load balancing, but for compression and query workload optimization as well.
||Because the HDFS storage management does not sort or segment data in ways that optimize for an analytic job, at job runtime the input data often needs to be resegmented across the cluster and/or sorted, incurring a large amount of network and disk I/O.
||The data layout is often optimized for the target query workload during data loading, so that a minimal amount of I/O is incurred at query runtime. As a result, Vertica is designed for real-time analytics as opposed to batch oriented data processing.
||The map-reduce programs use procedural languages (Java, python, etc), which provide the developers fine-grained control of the analytic logic, but also requires that the developers optimize the jobs carefully in their programs.
||The Vertica Database Designer provides automatic performance tuning given an input workload. Queries are specified in the declarative SQL language, and are automatically optimized by the Vertica columnar optimizer.
After working with a number of customers involving joint Hadoop and Vertica deployments, we have identified a number of best practices combing the power of both platforms. As an example, Hadoop is ideal for the initial exploratory data analysis, where the data is often available in HDFS and is schema-less, and batch jobs usually suffice, whereas Vertica is ideal for stylized, interactive analysis, where a known analytic method needs to be applied repeatedly to incoming batches of data. Sessionizing clickstreams, Monte Carlo analysis or web-scale graph analytics are some such examples. For those analytic features supported by both platforms, we have observed significant performance advantages in Vertica, due to the key architectural differences between the two platforms as described above. Finally, by leveraging Verticas Hadoop connector, users can easily move data between the two platforms.
Also, a single analytic job can be decomposed into bits and pieces that leverage the execution power of both platforms; for instance, in a web analytics use case, the JSON data generated by web servers is initially dumped into HDFS. A map-reduce job is then invoked to convert such semi-structured data into relational tuples, with the results being loaded into Vertica for optimized storage and retrieval by subsequent analytic queries. As another example, when an analytic job retrieves input data from the Vertica storage, its initial stages of computation, often consisting of filter, join and aggregation, should be conducted in Vertica for optimal performance. The intermediate result can then be fed into a map-reduce job for further processing, such as building a decision tree or some other machine learning model.
Big Data with Hadoop and Vertica – OSCON 11
The recent OSCON 11 was filled with exciting technology and best practice discussions on Big Data, Java and many other subjects. There I had an opportunity to deliver a talk to the open source community on the subject of this post. In a subsequent talk, my colleagues Steve Watt and Glenn Gebhart presented a compelling demo to illustrate the power of combining Hadoop and Vertica to analyze unstructured and structured data. We were delighted at the feedback that both talks received from the follow-up conversations in person as well as from Twitter. This interview captured the gist of the numerous conversations we had with other attendants of OSCON about Vertica’s real-time analytics capabilities and its underlying technology.