Archive for August, 2011

On Both Sides of the Internship

Author: Lyric Doshi

Three cheers for Vertica Summer Interns 2011!

As summer comes to an end, we bid goodbye to yet another amazing crop of summer interns. This years interns Hieu, Zhongliang, Ruchika and Zhijie were MS/PhD students from different schools along the east coast.  They worked on projects to extend the Vertica SDK, enhance Hadoop/Pig connectivity, and create internal developer productivity tools.  We plan to incorporate much of this work into a future major release of Vertica. In fact, there was so much excitement surrounding their work that in addition to the traditional presentation to engineering, they were asked to present to the entire company after having lunch with Vertica VP&GM Chris Lynch.

Once a picky an intern at other companies (and a very happy one at Vertica, why else would I have come back?), I had the opportunity this year to run our internship program, beginning with coordinating interviews for nearly 40 candidates down to a 4 solid interns. That was over 3 months ago. In the past few days, I took some time to speak with each intern to hear about their experience and I was happy to hear some of my own words to my mentors at Vertica from 2 years ago being reiterated.

Zhongliang gained experience working on a full Java project for first time and told me he felt his coding improved dramatically thanks to feedback from his mentor Matt.

Ruchika told me how her project made some of her friends jealous because she was never bored at work. She appreciated how everyone here dropped their work to help her out when she had questions. She singled out the unwavering patience of her mentor Ben in answering her questions. Having appreciated the same time and time again, I responded, “Been there, done that.”

Hieu highlighted how much fun he had in addition to his project, partaking in our Ping Pong tournament, joining the weekly pick-up basketball, learning the art of sword-play, destroying us in Starcraft, and attending cook-outs hosted by co-workers. He even laughed himself through an unforgettable first experience with water sports (hint: it involved an inner tube, some soccer shorts, and a motorboat) at our annual interns party.

Zhijie was very happy to work on a project that, while sufficiently separated that he did not worry about causing trouble, was in the release plan. Whenever he got frustrated with coding issues, he looked at customer feature requests page to see all the demands for what he was working on and found real inspiration that he was making a difference.

Repeatedly pressing each for complaints in the spirit of constructive criticism, major or minor, I finally forced something out of the Zhijie “The office was a little hot when the AC broke.” Surely a sign of a successful summer?

But it’s not just the interns that got something out of the internship program. Our program is run entirely by engineering and both I and the four dedicated intern mentors gained experience managing projects, goals and expectations. We’ve adapted our program over the years, trying both team and individual projects focussed on everything from tools and demos to server-side changes. This year, we even sat our interns in desks near their respective mentors this year for a proper full-timer experience and even higher mentor accessibility.

A big thank you to all of our interns for your hard work and commitment. We’ll miss having you around but wish you the best for the coming school year and hope to see you again soon!

The Right Tool for the Job: Using Hadoop with Vertica for Big Data Analytics

by Mingsheng Hong, Vertica Product Marketing Engineer

I have an entrepreneur friend who used to carry a butter knife around.  He claimed this “almighty” tool was the only one he ever needed!  While the butter knife does serve a wide range of purposes (especially with a stretch of the imagination), in practice it doesn’t always yield optimal results.  For example, as a screwdriver, it may work for common screws, but certainly not a Phillips (unless you push down very hard and hope not to strip the screw).  As a hammer, you may be able to drive finishing nails, but your success and mileage may vary.  As a pry bar, well, I think you get my point!  Clearly one tool isn’t sufficient for all purposes – a good toolbox includes various tools each fulfilling a specific purpose.

When it comes to Big Data Analytics, Hadoop (as a platform) has received an incredible amount of attention.  Some highlights include: scalable architecture based on commodity hardware, flexible programming language support, and strong open source community support committed to its on-going development.  However, Hadoop is not without limitations: due to its batch oriented nature, Hadoop alone cannot be deployed as a real-time analytics solution.  Its highly technical and low-level programming interface makes it extremely flexible and friendly to developers but not optimal for business analysts.  In an enterprise business intelligence environment Hadoops’s limited integration with existing BI tools makes people scratch their head trying to figure out how to fit it into their environment.

As Hadoop has continued to gain traction in the market and (in my opinion) moved beyond the peak of the hype cycle, it is becoming clear that to maximize its effectiveness, one should leverage Hadoop in conjunction with other business intelligence platforms and tools.  Best practices are emerging regarding the choice of such companions, as well as how to leverage each component in a joint deployment.

Among the various BI platforms and tools, Vertica has proved an excellent choice. Many of its customers have successfully leveraged the joint deployment of Hadoop and Vertica to tackle BI challenges in algorithmic trading, web analytics, and countless other industry verticals.

What makes the joint deployment so effective, and what are the common use cases?

First, both platforms have a lot in common:

  • Purpose-built from scratch for Big Data transformation and analytics
  • Leverage MPP architecture to scale out with commodity hardware, capable of managing TBs through PBs of data
  • Native HA support with low administration overhead

In the Big Data space crowded with existing and emerging solutions, the above architectural elements have been accepted as must-haves for any solution to deliver scalability, cost effectiveness and ease of use.  Both platforms have obtained strong market traction in the last few years, with customer success stories from a wide range of industry verticals.

While agreeing on things can be pleasant, it is the following key differences that make Hadoop and Vertica complement each other when addressing Big Data challenges:

Aspect / Feature Hadoop VERTICA
Interface and extensibility Hadoop’s map-reduce programming interface is designed for developers.The platform is acclaimed for its multi-language support as well as ready-made analytic library packages supplied by a strong community. Vertica’s interface complies with BI industry standards (SQL, ODBC, JDBC etc).  This enables both technologists and business analysts to leverage Vertica in their analytic use cases.Vertica’s 5.0 analytics SDK enables users to plug their custom analytic logic into the platform, with in-process and parallel execution.  The SDK is an alternative to the map-reduce paradigm, and often delivers higher performance.
Tool chain /
Eco system
Hadoop and HDFS integrate well with many other open source tools. Its integration with existing BI tools is emerging. Vertica integrates with the BI tools because of its standards compliant interface.  Through Vertica’s Hadoop connector, data can be exchanged in parallel between Hadoop and Vertica.
Storage management Hadoop replicates data 3 times by default for HA.  It segments data across the machine cluster for loading balancing, but the data segmentation scheme is opaque to the end users and cannot be tweaked to optimize for the analytic jobs. Vertica’s columnar compression often achieves 10:1 in its compression ratio.  A typical Vertica deployment replicates data once for HA, and both data replicas can attain different physical layout in order to optimize for a wider range of queries.  Finally, Vertica segments data not only for load balancing, but for compression and query workload optimization as well.
Runtime optimization Because the HDFS storage management does not sort or segment data in ways that optimize for an analytic job, at job runtime the input data often needs to be resegmented across the cluster and/or sorted, incurring a large amount of network and disk I/O. The data layout is often optimized for the target query workload during data loading, so that a minimal amount of I/O is incurred at query runtime.  As a result, Vertica is designed for real-time analytics as opposed to batch oriented data processing.
Auto tuning The map-reduce programs use procedural languages (Java, python, etc), which provide the developers fine-grained control of the analytic logic, but also requires that the developers optimize the jobs carefully in their programs. The Vertica Database Designer provides automatic performance tuning given an input workload.  Queries are specified in the declarative SQL language, and are automatically optimized by the Vertica columnar optimizer.


After working with a number of customers involving joint Hadoop and Vertica deployments, we have identified a number of best practices combing the power of both platforms.  As an example, Hadoop is ideal for the initial exploratory data analysis, where the data is often available in HDFS and is schema-less, and batch jobs usually suffice, whereas Vertica is ideal for stylized, interactive analysis, where a known analytic method needs to be applied repeatedly to incoming batches of data.  Sessionizing clickstreams, Monte Carlo analysis or web-scale graph analytics are some such examples.  For those analytic features supported by both platforms, we have observed significant performance advantages in Vertica, due to the key architectural differences between the two platforms as described above.

Finally, by leveraging Vertica’s Hadoop connector, users can easily move data between the two platforms.  Also, a single analytic job can be decomposed into bits and pieces that leverage the execution power of both platforms; for instance, in a web analytics use case, the JSON data generated by web servers is initially dumped into HDFS.  A map-reduce job is then invoked to convert such semi-structured data into relational tuples, with the results being loaded into Vertica for optimized storage and retrieval by subsequent analytic queries.   As another example, when an analytic job retrieves input data from the Vertica storage, its initial stages of computation, often consisting of filter, join and aggregation, should be conducted in Vertica for optimal performance.  The intermediate result can then be fed into a map-reduce job for further processing, such as building a decision tree or some other machine learning model.

Big Data with Hadoop and Vertica – OSCON ‘11

The recent OSCON ’11 was filled with exciting technology and best practice discussions on Big Data, Java and many other subjects. There I had an opportunity to deliver a talk to the open source community on the subject of this post. In a subsequent talk, my colleagues Steve Watt and Glenn Gebhart presented a compelling demo to illustrate the power of combining Hadoop and Vertica to analyze unstructured and structured data. We were delighted at the feedback that both talks received from the follow-up conversations in person as well as from Twitter. This interview captured the gist of the numerous conversations we had with other attendants of OSCON about Vertica’s real-time analytics capabilities and its underlying technology.

Get Started With Vertica Today

Subscribe to Vertica