Vertica

Author Archive

Avro parser UDx – Using Apache Avro to enable easier data transfer from Hadoop to Vertica

After careful research and brainstorming of different ideas for the intern UDx competition we decided to implement an Avro parser UDx. Our team, “The Avro-rian Revolutionaries” wanted to implement something useful, ready to use, and is in the top-3 wish list of customers. And what better than an Avro parser which would help users to easily transfer data from Hadoop to Vertica!. (This Avro parser UDx package is now available on github [6] and Vertica users are encouraged to try it out!)

Apache Avro [1] is a data serialization format widely used in Hadoop world. It is a new data serialization format which succeeds Thrift [2] and Protocol Buffers [3]. According to some technologists, Avro is the best data serialization framework out there [4]. This was good motivation for us to implement an Avro parser for the intern competition, hoping to make importing Avro data into Vertica, feasible.

Figure 1. Hadoop, Avro, Avro UDx and Vertica workflow

With this motivation, we began our day 1 of the 5 day intern competition. The first milestone was to get the standalone Avro parser to work. This basic, standalone parser (still no Vertica in picture) which will just read an Avro file and print out the header and data in text format. The Avro API’s were our means to do it and by referring the basic documentation [5] we quickly came up with a parser which could dump out the contents of a sample Avro file in text format as in Figure 2.

Figure 2: weather.avro sample file in text format.

We spent day 2 of the competition learning the Vertica SDK, the next tool of trade.
There were some great examples already out there on github. We picked a simple example UDx and began using and playing with it. Once we got our hands on loading, testing, and running this UDx we started learning the required SDK interfaces for loading the data into Vertica. One important interface was called UDParser which parses a stream of bytes parallelly into Vertica. Very quickly we were able to use this and develop an UDx skeleton, ready to get integrated into the module developed on day 1.

On day 3, midway through the competition we had the most important milestone to achieve. The task was to integrate our standalone Avro parser developed on day 1 with a parser UDx skeleton developed on day 2. And this was point where we got stuck and had an unexpected setback. After talking to our mentors we discovered that there is an interface gap between Avro file reader api and Vertica UDParser interface. To fill this gap we developed a couple of modules called CRReader and CRStream which successfully addressed the issue.

Day 4, we began integrating the modules, and finally the moment of judgement arrived. This was the moment when we ran our first test of loading a weather.avro file into vertica, which exercised most of the code we wrote. And we did not have to hold our breath long. Within a fraction of a second the data was loaded into Vertica. We really couldn’t believe our eyes that all the 3 pieces of modules we wrote in 3 days are working like parts of an engine. The magic of UDx was happening! and the Avro file was successfully loaded into Vertica. (Figure 3)

Figure 3: Demo screenshot

On day 5, the last day of the competition, we spent all our efforts in testing and packaging the UDx. We wanted to have a quality product which will be ready to use by the customer by the end of competition.

Finally we presented our work with other interns in front of a fully packed room with audience from all departments of Vertica. This was a unique experience by itself because we had to present the work in the most appealing format for audience of different perspective apart from the technical dimension. End of the day we were happy that we learnt lots of new things, collaborated with senior mentors and received great response feedback and comments for our work which made the competition a great success! And now when looking at our UDx parser available on github[6] and ready to use, it gives us great satisfaction of achieving of our first step of getting one step closer to the Avro-rian revolution!

References:
[1] http://avro.apache.org/docs/1.7.1/
[2] http://wiki.apache.org/thrift/FrontPage
[3] http://code.google.com/p/protobuf/
[4] http://www.cloudera.com/blog/2011/05/three-reasons-why-apache-avro-data-serialization-is-a-good-choice-for-openrtb/
[5] http://avro.apache.org/docs/1.6.1/api/cpp/html/index.html
[6] https://github.com/vertica/Vertica-Extension-Packages/tree/master/avro_parser

Get Started With Vertica Today

Subscribe to Vertica