Vertica

Author Archive

How to parse anything into Vertica using ExternalFilter

Vertica’s data-load process has three steps:  “Get the data”, “Unpack the data”, and “Parse the data.”  If you look on our Github site, there are handy helpers for the first two of those steps, ExternalSource and ExternalFilter, that let you call out to shell scripts or command-line tools.  But there’s no helper for parsing.  Why not?  Because you don’t need it!

Earlier today, I was trying to load a simple XML data set into a table:

“”” sample_data.xml:

<data>

<record>

<city>Cambridge, MA</city>

<population>106038</population>

</record>

<record>

<city>Arlington, MA</city>

<population>42389</population>

</record>

<record>

<city>Belmont, MA</city>

<population>24194</population>

</record>

</data>

“””

Vertica doesn’t have a built-in XML parser.  So this might look like it would be a real pain.  But I got it loaded nice and quickly with just a little bit of scripting.

First, we need something that can parse this file.  Fortunately, this can be done with just a few lines of Python:

“”” xmlparser.py:

#!/usr/bin/env python

import sys, xml.etree.ElementTree

for record in xml.etree.ElementTree.fromstringlist(sys.stdin).getchildren():

keys = record.getchildren()

print ‘|’.join(key.text for key in keys)

“””

A very simplistic script; it reads the whole file into memory and it assumes that the data is clean.  But on this file it’s all we need.  For more complicated inputs, we could make the script fancier or install and make use of a third-party tool (such as xml_grep, available as an add-on package in some Linux distributions).

Now, what happens when we run that script on the raw data file?

“””

$ ~/xmlparser.py < sample_data.xml

Cambridge, MA|106038

Arlington, MA|42389

Belmont, MA|24194

“””

You may recognize this as the basic output format of vsql, our command-line client.  Which means that Vertica can load it directly.  If you’ve installed ExternalFilter (by checking out our Github repository and running “make install” in the shell_load_package directory), just do the following:

“””

dbadmin=> CREATE TABLE cities (city VARCHAR, population INT);

dbadmin=> COPY cities FROM LOCAL ‘sample_data.xml’ WITH FILTER ExternalFilter(‘/path/to/xmlparser.py’);

Rows Loaded

————-

3

(1 row)

 

dbadmin=> SELECT * FROM cities;

city      | population

—————+————

Cambridge, MA |     106038

Arlington, MA |      42389

Belmont, MA   |      24194

(3 rows)

“””

Of course, with ExternalFilter, you don’t have to write code at all.  You have full access to the command-line tools installed on your server.  So for example, you can whip up a sed script and get a simple Apache web-log loader:

“””

dbadmin=> COPY weblogs FROM ‘/var/log/apache2/access.log’ WITH FILTER ExternalFilter(cmd=’sed ”s/^\([0-9\.]*\) \([^ ]*\) \([^ ]*\) \[\([^ ]* [^ ]*\)\] “\([^”]*\)” \([0-9]*\) \([0-9]*\) “\([^”]*\)” “\([^”]*\)”$/\1|\2|\3|\4|\5|\6|\7|\8|\9/”’);

“””

Is this really technically parsing, if you’re just outputting more text?  I’ll let the academics argue over that one.  It’s true that a native-C++ UDParser would likely yield better performance, and that these simple examples aren’t the most robust bits of code out there.  But I didn’t have time today to carefully craft an elegant, optimized extension.  I just wanted to load my data, to get the job done.  And these commands let me do so quickly and easily.

On the Trail of a Red-Tailed Hawk! – Part 2

Part 2:  Why User-Defined-Loads is music to my ears!

Webcams and hawks – really?  I heard a healthy dose of skepticism expressed about what User-defined Loads can accomplish.  So how about I show something a bit more practical?

I’m a fan of music of all flavors. Among the various devices I’ve had connected to my stereo at home are an LP player and an open-reel tape deck.  While it’s fun to get out the old vinyl records from time to time, they’re a bit cumbersome and don’t travel very well; more often I just want to pull up iTunes and hit “Play”.  So if I have a recording that I particularly enjoy, I try to digitize it.  But it is a huge pain to split it up into tracks.  I thought – why not automate it? You guessed right – with UDL, of course!

Vertica’s not much for music; natively it couldn’t tell the difference between The Beach Boys and Van Halen.  In fact, it can’t make heads or tails of most any music format.  So I need one (or both) of Vertica’s two other UDL tools.  One is a User-Defined Parser:  I could write a parser that directly parsed the WAV format, for example.  This would give me the best performance, but would take some time to code.

The other is a User-Defined Filter:  A Filter takes in data and processes it and emits it in a new form.  They’re intended for things like decompression and transcoding.  It’d be nice to support FLAC files and the like (c’mon, you can’t use lossy MP3 files for this!; this is the good stuff!), and my computer has more free time than I do, so a simple Filter seems like the way to go.

We conveniently have an example external-Filter example on Github as well; it opens up an external process and passes data through it via pipes.  So let’s go install it:

$ cd Vertica-Extension-Packages/shell_load_package
$ make install

Now, let’s once again make a table and try loading some data:

>>> CREATE TABLE music (filename varchar(64), time_index float, L float, R float)
     ORDER BY filename, time_index;
>>> COPY music (filename AS ‘Sample’, time_index, L, R)
     FROM ‘/home/aseering/Music/Recordings/Sample.flac’
     FIXEDWIDTH COLSIZES (17, 17, 17)
     WITH FILTER ExternalFilter(cmd=’sox –type flac – –type dat -‘);

The FIXEDWIDTH parser takes advantage of the format of the ‘.dat’ file:  It’s three floating-point columns; each floating-point number can be up to 16 characters wide (and is padded with spaces to that length if it’s smaller), and there’s one extra leading space as a separator.

Of course, using ExampleSource, I can load directly from my audio input as well, assuming a properly-configured sound system on the computer:

>>> COPY music (filename AS ‘Sample’, time_index, data filler int, L AS data, R AS data)
     FIXEDWIDTH COLSIZES (17, 18)
     WITH SOURCE ExternalSource(cmd=’arecord -d 10′)
     FILTER ExternalFilter(cmd=’sox –type wav – –type dat -‘);

Once I’ve loaded the data, I have to split up tracks.  On recordings of old records, the only real way I know to split tracks is to look for areas where the audio is quiet for a while.  Vertica can do this with window analytics.  The following is a good start, though there is of course a ton of flexibility if you want to tune the algorithm:

>>> SELECT sqrt(avg(L*L + R*R)
         over (order by time_index rows between 100 preceding and 100 following))
     AS avg_volume
     FROM music WHERE filename = ‘Sample';

And there we go!  Now I can enjoy my Carpenters on the move.

Most people think of databases as tools for managing payroll or user accounts, or diving through sales records.  Vertica can do all of that.  But Vertica 6 is so much cooler than that:  It gives you a platform for running your own custom code in a robust, distributed way.  You can mix-and-match UDLs (and all the other UDx types) however you want; develop libraries of UDL tools; even share those tools on Github if you want, or download tools that others have shared.  (We take pull requests!)  You can also use UDLs with external tables to provide a real queryable table that’s backed by any data store that you want, so you can run any task through Vertica’s parallel execution engine.

Are you convinced now? Whether you are or not, I’ve spend the past few hours digitizing several of my favorite LPs.  Now I can sit by my window with my headphones on listening to music and working away until the hawk shows up. Life is good!

On the Trail of a Red-Tailed Hawk! – Part 1

by Adam Seering

A couple weeks ago, Vertica moved into our cool new space in Cambridge, MA.  I’d been eagerly awaiting the commute; a scenic bike ride along the Charles River.  I didn’t expect nature to follow me all the way up to the 9th floor, though.  So imagine my excitement when, on our first day, a magnificent red-tailed hawk swoops down and lands right on the window ledge!  We’re at the top of the tallest building around, so I guess the ledge works well as a cliff.  The hawk has become a regular visitor, and the unofficial Vertica mascot.  After all, it’s “Simply Fast”.   As much as we’d like to sit there waiting for it to show up all day, we all do have work to get done.  So I figured – why not put a webcam next to the window to analyze its habits and and be alerted when it shows up.

And with Vertica 6, that’s easy!  As it turns out, I’ve been working with the Vertica SDK team on User-Defined Load (UDL), a collection of SDK APIs that lets you ingest data in any arbitrary format from any arbitrary source.  Loading files and tables is pretty dull, so instead, I’ll use Vertica to track our mascot!  In this blog, I’d like to show you how I did it.

The first step is to get Vertica to read data not from a file, but from a webcam.  For this, we will need a User-Defined Source.  UDSources are a part of the UDL SDK.  As far as Vertica is concerned, all they do is emit data.  They could get that data from anywhere:  A file, a remote server, a webcam (!), an API call into a special service; they could even make the data up.

We could write a custom User-Defined Source in C++, that linked in OpenCV or a similar imaging library and interacted with the webcam directly.  But that takes work.  The command-line program “streamer” already provides a simple way to grab a snapshot from a webcam.  Our Github repository contains an example UDSource that runs a program and passes its output to Vertica.  We can install it as follows:

$ git clone https://github.com/vertica/Vertica-Extension-Packages.git
$ cd Vertica-Extension-Packages/shell_load_package

$ make install

Now we have a UDSource that can load our data.  So we need a table to load it into.  At a Vertica prompt:

CREATE TABLE frames (frame_name varchar(64), X int, Y int, R int, G int, B int)
             ORDER BY frame_name, X, Y;

Note the “ORDER BY” clause.  This is a Vertica extension that tells Vertica how to keep the data sorted.  This is the order that the data will come sorted in; we might as well keep it that way so that Vertica can search through it faster.

Now, let’s take a quick look at one line of the data that our command outputs:

0,0: ( 29, 19, 17)  #1D1311  rgb(29,19,17)

It’s a little bit unusual, but Vertica’s built-in Delimited parser can handle it.  So, we can go ahead and run the following statement to load the data:

>>> COPY frames (frame_name AS ‘frame 1′,
                X delimiter ‘,’,
                Y delimiter ‘:’,
                blank_space filler varchar(10) delimiter ‘(‘,
                R delimiter ‘,’,
                G delimiter ‘,’,
                B delimiter ‘)’,
                rest_of_line filler varchar(50))
WITH SOURCE ExternalSource(cmd=’streamer -f jpeg -o /dev/stdout | convert – -compress none txt:-‘);

Now we have our hawk in Vertica!  So we can do some analytics.  For example, we can run a simple query to get the brightness of each frame that I’ve loaded:

>>> SELECT frame_name, avg((R+G+B)/3) AS brightness FROM frames GROUP BY frame_name;

And if I have a frame where the image suddenly gets darker, I know the bird has flown in.

If I want, I could easily do analytics on a window of the data; look for images where one part of the image changes more than others; even write a UDF that does proper object recognition.  Pick out if the hawk is eyeing its prey or snacking on a rodent (which it has done more than once).  But this turns out to be good enough for us here in Cambridge!

Do you have an idea for a UDSource?  Have your own local hawk, or any other project you might be working on?  Download Vertica’s Community Edition and try our UDSource, or create your own and post it to Github.  (We take pull requests!)

Happy Hawking!


Get Started With Vertica Today

Subscribe to Vertica