by Shilpa Lawande & Rajat Venkatesh
A growing number of Vertica customers use Hadoop as an ETL/pre-processing tool or HDFS as a “data parking lot” – either to stage files before they are loaded or to simply store data whose worth is yet to be proven. For several years, Vertica has provided a Hadoop Connector that provides bi-directional connectivity between the two systems. In this post, we evaluate the pros and cons of the current approach and discuss a new technique that is enabled by Vertica 6, newly released.
The most efficient topology of network connections is shown in the diagram above. There is one database connection for each partition or a subset of partitions and all these connections transfer data in parallel. Most solutions in the market including the current Vertica-Hadoop connector do follow this design. They open a JDBC connection for each (or a subset of) partition and execute a batch insert to transfer the data. Some products, like Vertica may optimize the transfer by using a bulk load API or transform the data to avoid using resources on the database. Apache Sqoop is the end result of this evolution and provides a platform to implement generic or optimized connectors for all databases that provide a JDBC driver.
The architecture described above has some inherent bad qualities owing to a fundamental impedance mismatch between Vertica and Hadoop.
Even though Vertica and Hadoop are both MPP systems there is no good way to coordinate the degree of parallelism across the two systems, resulting in an inefficient use of the combined resources. Even a moderately large Hadoop cluster can overwhelm a database with too many JDBC connections. Therefore, it is common best practice to reduce the number of reduce (or map) tasks that connect to the database.
Vertica’s customers typically transfer a large file (with many HDFS chunks) or a set of files into one table. Each of the resulting JDBC connections initiates a new transaction overwhelming the transaction module with too many actors. Since each connection is a transaction, it’s not possible to roll back a transfer. A failed transfer may leave partial data – the worst kind of failure. A common best practice is to first store the data into a temporary table, verify the transfer and then move it to its final resting table. Another common strategy is to add a column to identify the rows inserted in a transfer, verify the rows with the unique id and then keep or delete the rows. Depending on the database and storage engine, verifying and moving the data may take up significant resources. Regardless of the approach or database, there is significant management overhead for loading data from Hadoop. A majority of Vertica’s customers use Vertica so that they don’t have to worry about ACID for bulk loads. They don’t want to be in the business of writing and maintaining complex scripts.
In our survey of customers, we found that many of them have come up with creative solutions to the “too-many-connections” problems. By popular demand, we’re in the process of creating a repository of the Hadoop connector source on Github and hope to see some of these solutions contributed there. However, by far, the most preferred solution is to simply write the results of a MR job to HDFS files, then transfer those files to Vertica and load them. And that’s what we’ve chosen to do in our new design as well.
In the new HDFS connector, currently in private beta, we have taken the first step to solve the problems described earlier by allowing direct load from files from HDFS to Vertica, thereby eliminating the non-ACID behaviors due to multiple connections. Further, we solve the problem of too many connections by using one TCP/IP connection per file in HDFS. How did we do this? With Vertica 6, we introduced User-defined Loads. This exposes APIs that can extend Vertica’s native load command COPY, to load from any source and any format. With a few days of effort, using the UDL API and Apache Hadoop REST API for HDFS files, we wrote a plugin to read HDFS files from COPY. The benefits are obvious – COPY can process many files within the same transaction in parallel, across the Vertica cluster. Parallelism is limited by the hardware of the machines. Since it is only one transaction, a failure will result in a complete roll back.
The resulting network topology looks like the diagram below.
Let’s see how the HDFS connector works using some examples.
Imagine that there are 6 files in a directory in HDFS
-rw-r-r- 1 hadoop supergroup 759863287 2012-05-18 16:44 hdfslib/glob_test/lineitem.tbl Found 1 items
-rw-r-r- 1 hadoop supergroup 759863287 2012-05-18 16:44 hdfslib/glob_test/lineitem1.tbl Found 1 items
-rw-r-r- 1 hadoop supergroup 759863287 2012-05-18 16:44 hdfslib/glob_test/lineitem11.tbl Found 1 items
-rw-r-r- 1 hadoop supergroup 759863287 2012-05-18 16:44 hdfslib/glob_test/lineitem12.tbl Found 1 items
-rw-r-r- 1 hadoop supergroup 759863287 2012-05-18 16:44 hdfslib/glob_test/lineitem2.tbl Found 1 items
-rw-r-r- 1 hadoop supergroup 759863287 2012-05-18 16:44 hdfslib/glob_test/lineitem31.tbl Found 1 items
The Vertica Copy command to load lineitem.tbl works as follows:
copy lineitem source Hdfs(url='http://<dfs.http.address>/webhdfs/v1/user/hadoop/hdfslib/glob_test/lineitem.tbl');
If you wanted to load all 6 files, you simply use the glob * feature. In this case, the 6 files are loaded into Vertica and are processed in parallel across the cluster.
copy lineitem source Hdfs(url='http://<dfs.http.address>/webhdfs/v1/user/hadoop/hdfslib/glob_test/lineitem*');
In this example, lineitem.tbl is a tab-limited file. What about all the other popular file formats? User Defined Load provides an API to plugin a parser for any file format as well as APIs to decompress, decrypt or perform any arbitrary binary transformation. Check out this blog post showing how UDL can be used to process image file data.
Further, with Vertica 6 external tables, in combination with the HDFS connector, you can now use Vertica SQL to analyze data in HDFS directly (see the “Vertica Analytics Anywhere” blog post). This means, while you are in the exploratory stage with some data, you could experiment with different data models and still use SQL analytics (or use the Vertica R SDK) to plough through the data.
CREATE EXTERNAL TABLE hdfs.lineitem (<list of columns>) AS COPY FROM Hdfs(url='http://<dfs.http.address>/webhdfs/v1/user/hadoop/hdfslib/glob_test/*');
“Wait a minute! Most chunks are sent to another Hadoop datanode and then to Vertica. This is worse than what we had before”. Yes we have exchanged expensive JDBC connections with one extra hop for most chunks. However, the new strategy is still faster and maintains transaction semantics. The next step is to enable COPY to parse data chunks in parallel. However, we believe that the current solution has significant advantages over using JDBC connections and already solves major pain points of Vertica-Hadoop users.
An under-performing connector requires far more resources – machine and human – to bring together data from various sources. With an efficient connector, users can transfer data in real time and on a whim, allowing them to use the right tool for the job.
If you would be interested in our private beta for the HDFS connector, please send a note to email@example.com with a brief description of your use-case.