As head of Product Management at a next generation analytic DBMS company I often get the question about Vertica’s endeavors with Hadoop/MapReduce. Given that Vertica and Hadoop/MR share many similar core principles like being massively parallel and highly available on distributed commodity hardware, there is a natural fit. That said, the two are still different- Vertica is designed for real-time analytics of structured data whereas Hadoop/MR is typically for batch oriented jobs with any type of data (structured/semi-structured/unstructured). We try to stay out of the comparisons though and instead focus on complementary approaches, particularly in solving real-world customer problems. This has been our approach since the beginning of our joint-development.
Vertica and Hadoop/MR complement one another extremely well, and we are committed to ensuring bi-directional and tight integration between Hadoop/MR and Vertica. Our preference is to work with great partners like Cloudera who understand enterprise class Hadoop the same way Vertica understands enterprise-class databases. Our approach of seamless and parallel integration is in line with Vertica’s core “One Size Does Not Fit All” tenet. We don’t think we need to develop the technology ourselves, much in the same way that we don’t feel the need to develop our own ETL and front-end visualization solutions.
Vertica is focused on building the best next generation analytic database solution on the market. Our solution enables customers to unlock and monetize their data in a fully-relational and massively parallel manner with scalability and simplicity of setup and administration as core design principles. We enable companies to ingest, store, and analyze vast amounts of structured data with near real-time latency on a fraction of the hardware they would otherwise need. This is why Vertica was founded, this is where we owe our success to date, and as far as we can tell, we are solving a very clear and present data problem that is only getting worse. Our focus is also the reason we reached the 100 customer mark faster than all of our competitors. Among other uses, Hadoop/MR is wonderful at getting more and higher quality data into Vertica.
While Hadoop/MR and Vertica are different, the “problem solved” is not always just orthogonal. As it turns out, and not surprisingly, many data problems can be solved in more than one way. Again, we see merit in Hadoop/MR for several use cases (including but not limited to the massaging, structuring, and transformation of data before and/or after it gets to the database), but we also know that some of the most commonly cited MR use cases can be performed through a single pass of SQL in the database engine as well. By stripping away the noise and listening to our customers and their pain, we are able to deliver a core product that solves many of the same issues. Not all, but many.
A case in point is sessionization, which is perhaps the most often cited use case for MapReduce in the enterprise (stay tuned for a more in depth post on this topic and CTE). Sessionization is the process of taking web log files and grouping them together in buckets of visitor sessions (most commonly time-based, e.g. 30 seconds) for analysis. This has been pegged as problematic to perform in SQL and therefore in the RDBMS because it often requires multiple passes through the engine and is difficult to express. In Vertica 4.0 however, this can be expressed through single pass SQL no problem.
Here’s the SQL with a Web/Clickstream timeout threshold of 30 seconds:
SELECT userId, timestamp, CTE(timestamp – LAG(timestamp) <= ‘30 seconds’) OVER (PARTITION BY userId ORDER BY timestamp) as session FROM webclicks;
By performing this operation in the Vertica database, our customers leverage our massively parallel real-time columnar infrastructure without having to move the data around for external batch processing. They can do this from within their same favorite reporting tool without adding that extra step. Furthermore, Vertica’s extensive native windowing conditions for advanced analytics, including sessionization, are many, and not limited to the conditional true event (CTE) on just timestamp depicted above. Of course, there are still good reasons to perform sessionization outside the database such as not wanting to take up valuable real-time analytics resources while performing such grouping legwork- (although this can actually be solved using Vertica’s new workload management capabilities). We get that, and again, that is why we support native Hadoop/MR — no need for syntax changes.
Key to our One Size Does Not Fit All approach was Vertica’s day one decision to not cut corners and build on top of Postgres or some other traditional row-store as most of our competitors have done with their offerings. We have instead written a truly next generation native MPP-Columnar ADBMS solution from scratch complete with a unique set of bells and whistles (stay tuned for specific post on this subject as well). The good news is that on this core foundation, we can now add functionality that traditional row-stores would never be able to handle in a fast enough manner. Sessionization is a great example. It is simply too inefficient to perform it in a traditional RDBMS, not to mention most databases are not as expressive; hence why many people turn to Hadoop/MR for it. Vertica’s customers are finding there are a lot of things they can now do in Vertica that they could never consider with a traditional database. This combined with tight integration to frameworks like Hadoop allow our customers to monetize all of their data in ways never before possible.