I have just come back from a business trip to China where I visited several large Chinese telecom customers to talk about the recent big Vertica win at Facebook. Two questions these customers had constantly asked me were: What’s the future of MPP databases? Will Hadoop become one database that rules the whole analytic space?
These seemed to be odd questions considering that Facebook, one of the juggernauts in the Open Source community in general and Hadoop world in particular, has recently picked Vertica to be the anchoring database to satisfy its ever-increasing analytical demands and has since put the biggest Vertica cluster (with ~300 nodes and effective data storage of 6+ PB) into production. It tells me that if a Hadoop power-house and the inventor of Hive (the most popular SQL-on-Hadoop database) like Facebook, with its teams of brilliant programmers and bound-less resources, still thinks that it needs a MPP database like Vertica in its “Big Data” technology stack in the foreseeable future, it sends a clear and strong message. Obviously Facebook thinks the answers to both questions are NO, not so fast. In the meantime, Facebook will continue to use Hive/HBase and other Hadoop technologies for the tasks they are good at: ETL, handling unstructured data and conducting complex data-mining types of deep analysis.
So why does Facebook think that it needs a MPP database? Facebook has been running an EDW (Oracle Exadata ~50TB) for some time but feels that their existing EDW is running out of steam because it cannot keep up with the rapid data growth especially as mobile platform becomes more and more popular. Facebook would like to take advantage of the established commercial MPP databases for lower cost, robust eco-system, improved data security and better scalability/performance. Their main reasons for going with an MPP database can be summarized as follows:
- Rapidly expanding analytical needs at Facebook,
- MapReduce is too slow, plus security concerns
- In-Memory Database (IMDB) is too expensive and too immature
- Current SQL-on-Hadoop databases are not good enough and too immature
Facebook has invited four MPP vendors (including Vertica) to participate in two rounds of competitive POCs before declaring Vertica as the ultimate winner on the basis of Vertica’s low TCO, ease of management and superior ad-hoc query performance.
There have recently been many SQL-on-Hadoop offerings in the last couple of years, both open source and proprietary, including but not limited to Hive, Hadapt, Citus, Impala, Stinger and Apache Drill. Though their effort in making Hadoop more SQL friendly is welcome, my general impression is that they are still a long way off in terms of closing the performance gap to the popular MPP databases in the marketplace (e.g. Vertica). Depending on your perspective, you may argue that this gap is not exactly getting narrower at any pace that foretells its closing any time soon.
There is strong reason for me to believe that the SQL-on-Hadoop camp may have over-estimated the effectiveness of bolting/wrapping around open source SQL optimizers (e.g. PostgreSQL) to HDFS and severely underestimated the effort and time it takes to produce an enterprise quality MPP database whose core optimizer/execution engine technology requires years of intensive real world use to mature, and 100s (if not 1000s) of customers to validate and millions of cases to test and train. This is certainly more about practice than theory or concept. Query optimization is fundamentally a software problem and there is a limit to what any “brute force” hardware-based approach can do. To echo and rephrase what the authors of the MapReduce and Parallel Databases: Friends or Foes?” said, smart software (like MPP databases) is still a good idea in the age of Hadoop and “Big Data” and there is plenty of room and opportunity for MPP databases to thrive for a long time to come….
Po Hong is a senior pre-sales engineer in HP Vertica’s Corporate Systems Engineering (CSE) group with a broad range of experience in various relational databases such as Vertica, Neoview, Teradata and Oracle.