Vertica’s data-load process has three steps: “Get the data”, “Unpack the data”, and “Parse the data.” If you look on our Github site, there are handy helpers for the first two of those steps, ExternalSource and ExternalFilter, that let you call out to shell scripts or command-line tools. But there’s no helper for parsing. Why not? Because you don’t need it!
Earlier today, I was trying to load a simple XML data set into a table:
Vertica doesn’t have a built-in XML parser. So this might look like it would be a real pain. But I got it loaded nice and quickly with just a little bit of scripting.
First, we need something that can parse this file. Fortunately, this can be done with just a few lines of Python:
import sys, xml.etree.ElementTree
for record in xml.etree.ElementTree.fromstringlist(sys.stdin).getchildren():
keys = record.getchildren()
print ‘|’.join(key.text for key in keys)
A very simplistic script; it reads the whole file into memory and it assumes that the data is clean. But on this file it’s all we need. For more complicated inputs, we could make the script fancier or install and make use of a third-party tool (such as xml_grep, available as an add-on package in some Linux distributions).
Now, what happens when we run that script on the raw data file?
$ ~/xmlparser.py < sample_data.xml
You may recognize this as the basic output format of vsql, our command-line client. Which means that Vertica can load it directly. If you’ve installed ExternalFilter (by checking out our Github repository and running “make install” in the shell_load_package directory), just do the following:
dbadmin=> CREATE TABLE cities (city VARCHAR, population INT);
dbadmin=> COPY cities FROM LOCAL ‘sample_data.xml’ WITH FILTER ExternalFilter(‘/path/to/xmlparser.py’);
dbadmin=> SELECT * FROM cities;
city | population
Cambridge, MA | 106038
Arlington, MA | 42389
Belmont, MA | 24194
Of course, with ExternalFilter, you don’t have to write code at all. You have full access to the command-line tools installed on your server. So for example, you can whip up a sed script and get a simple Apache web-log loader:
dbadmin=> COPY weblogs FROM ‘/var/log/apache2/access.log’ WITH FILTER ExternalFilter(cmd=’sed ”s/^\([0-9\.]*\) \([^ ]*\) \([^ ]*\) \[\([^ ]* [^ ]*\)\] “\([^"]*\)” \([0-9]*\) \([0-9]*\) “\([^"]*\)” “\([^"]*\)”$/\1|\2|\3|\4|\5|\6|\7|\8|\9/”’);
Is this really technically parsing, if you’re just outputting more text? I’ll let the academics argue over that one. It’s true that a native-C++ UDParser would likely yield better performance, and that these simple examples aren’t the most robust bits of code out there. But I didn’t have time today to carefully craft an elegant, optimized extension. I just wanted to load my data, to get the job done. And these commands let me do so quickly and easily.