Vertica

How to parse anything into Vertica using ExternalFilter

Vertica’s data-load process has three steps:  “Get the data”, “Unpack the data”, and “Parse the data.”  If you look on our Github site, there are handy helpers for the first two of those steps, ExternalSource and ExternalFilter, that let you call out to shell scripts or command-line tools.  But there’s no helper for parsing.  Why not?  Because you don’t need it!

Earlier today, I was trying to load a simple XML data set into a table:

“”” sample_data.xml:

<data>

<record>

<city>Cambridge, MA</city>

<population>106038</population>

</record>

<record>

<city>Arlington, MA</city>

<population>42389</population>

</record>

<record>

<city>Belmont, MA</city>

<population>24194</population>

</record>

</data>

“””

Vertica doesn’t have a built-in XML parser.  So this might look like it would be a real pain.  But I got it loaded nice and quickly with just a little bit of scripting.

First, we need something that can parse this file.  Fortunately, this can be done with just a few lines of Python:

“”” xmlparser.py:

#!/usr/bin/env python

import sys, xml.etree.ElementTree

for record in xml.etree.ElementTree.fromstringlist(sys.stdin).getchildren():

keys = record.getchildren()

print ‘|’.join(key.text for key in keys)

“””

A very simplistic script; it reads the whole file into memory and it assumes that the data is clean.  But on this file it’s all we need.  For more complicated inputs, we could make the script fancier or install and make use of a third-party tool (such as xml_grep, available as an add-on package in some Linux distributions).

Now, what happens when we run that script on the raw data file?

“””

$ ~/xmlparser.py < sample_data.xml

Cambridge, MA|106038

Arlington, MA|42389

Belmont, MA|24194

“””

You may recognize this as the basic output format of vsql, our command-line client.  Which means that Vertica can load it directly.  If you’ve installed ExternalFilter (by checking out our Github repository and running “make install” in the shell_load_package directory), just do the following:

“””

dbadmin=> CREATE TABLE cities (city VARCHAR, population INT);

dbadmin=> COPY cities FROM LOCAL ‘sample_data.xml’ WITH FILTER ExternalFilter(‘/path/to/xmlparser.py’);

Rows Loaded

————-

3

(1 row)

 

dbadmin=> SELECT * FROM cities;

city      | population

—————+————

Cambridge, MA |     106038

Arlington, MA |      42389

Belmont, MA   |      24194

(3 rows)

“””

Of course, with ExternalFilter, you don’t have to write code at all.  You have full access to the command-line tools installed on your server.  So for example, you can whip up a sed script and get a simple Apache web-log loader:

“””

dbadmin=> COPY weblogs FROM ‘/var/log/apache2/access.log’ WITH FILTER ExternalFilter(cmd=’sed ”s/^\([0-9\.]*\) \([^ ]*\) \([^ ]*\) \[\([^ ]* [^ ]*\)\] “\([^"]*\)” \([0-9]*\) \([0-9]*\) “\([^"]*\)” “\([^"]*\)”$/\1|\2|\3|\4|\5|\6|\7|\8|\9/”’);

“””

Is this really technically parsing, if you’re just outputting more text?  I’ll let the academics argue over that one.  It’s true that a native-C++ UDParser would likely yield better performance, and that these simple examples aren’t the most robust bits of code out there.  But I didn’t have time today to carefully craft an elegant, optimized extension.  I just wanted to load my data, to get the job done.  And these commands let me do so quickly and easily.

Leave a Reply

Get Started With Vertica Today

Subscribe to Vertica