Vertica

Archive for the ‘User-Defined Load (UDL)’ Category

On the Trail of a Red-Tailed Hawk! – Part 2

Part 2:  Why User-Defined-Loads is music to my ears!

Webcams and hawks – really?  I heard a healthy dose of skepticism expressed about what User-defined Loads can accomplish.  So how about I show something a bit more practical?

I’m a fan of music of all flavors. Among the various devices I’ve had connected to my stereo at home are an LP player and an open-reel tape deck.  While it’s fun to get out the old vinyl records from time to time, they’re a bit cumbersome and don’t travel very well; more often I just want to pull up iTunes and hit “Play”.  So if I have a recording that I particularly enjoy, I try to digitize it.  But it is a huge pain to split it up into tracks.  I thought – why not automate it? You guessed right – with UDL, of course!

Vertica’s not much for music; natively it couldn’t tell the difference between The Beach Boys and Van Halen.  In fact, it can’t make heads or tails of most any music format.  So I need one (or both) of Vertica’s two other UDL tools.  One is a User-Defined Parser:  I could write a parser that directly parsed the WAV format, for example.  This would give me the best performance, but would take some time to code.

The other is a User-Defined Filter:  A Filter takes in data and processes it and emits it in a new form.  They’re intended for things like decompression and transcoding.  It’d be nice to support FLAC files and the like (c’mon, you can’t use lossy MP3 files for this!; this is the good stuff!), and my computer has more free time than I do, so a simple Filter seems like the way to go.

We conveniently have an example external-Filter example on Github as well; it opens up an external process and passes data through it via pipes.  So let’s go install it:

$ cd Vertica-Extension-Packages/shell_load_package
$ make install

Now, let’s once again make a table and try loading some data:

>>> CREATE TABLE music (filename varchar(64), time_index float, L float, R float)
     ORDER BY filename, time_index;
>>> COPY music (filename AS ‘Sample’, time_index, L, R)
     FROM ‘/home/aseering/Music/Recordings/Sample.flac’
     FIXEDWIDTH COLSIZES (17, 17, 17)
     WITH FILTER ExternalFilter(cmd=’sox –type flac – –type dat -’);

The FIXEDWIDTH parser takes advantage of the format of the ‘.dat’ file:  It’s three floating-point columns; each floating-point number can be up to 16 characters wide (and is padded with spaces to that length if it’s smaller), and there’s one extra leading space as a separator.

Of course, using ExampleSource, I can load directly from my audio input as well, assuming a properly-configured sound system on the computer:

>>> COPY music (filename AS ‘Sample’, time_index, data filler int, L AS data, R AS data)
     FIXEDWIDTH COLSIZES (17, 18)
     WITH SOURCE ExternalSource(cmd=’arecord -d 10′)
     FILTER ExternalFilter(cmd=’sox –type wav – –type dat -’);

Once I’ve loaded the data, I have to split up tracks.  On recordings of old records, the only real way I know to split tracks is to look for areas where the audio is quiet for a while.  Vertica can do this with window analytics.  The following is a good start, though there is of course a ton of flexibility if you want to tune the algorithm:

>>> SELECT sqrt(avg(L*L + R*R)
         over (order by time_index rows between 100 preceding and 100 following))
     AS avg_volume
     FROM music WHERE filename = ‘Sample’;

And there we go!  Now I can enjoy my Carpenters on the move.

Most people think of databases as tools for managing payroll or user accounts, or diving through sales records.  Vertica can do all of that.  But Vertica 6 is so much cooler than that:  It gives you a platform for running your own custom code in a robust, distributed way.  You can mix-and-match UDLs (and all the other UDx types) however you want; develop libraries of UDL tools; even share those tools on Github if you want, or download tools that others have shared.  (We take pull requests!)  You can also use UDLs with external tables to provide a real queryable table that’s backed by any data store that you want, so you can run any task through Vertica’s parallel execution engine.

Are you convinced now? Whether you are or not, I’ve spend the past few hours digitizing several of my favorite LPs.  Now I can sit by my window with my headphones on listening to music and working away until the hawk shows up. Life is good!

On the Trail of a Red-Tailed Hawk! – Part 1

by Adam Seering

A couple weeks ago, Vertica moved into our cool new space in Cambridge, MA.  I’d been eagerly awaiting the commute; a scenic bike ride along the Charles River.  I didn’t expect nature to follow me all the way up to the 9th floor, though.  So imagine my excitement when, on our first day, a magnificent red-tailed hawk swoops down and lands right on the window ledge!  We’re at the top of the tallest building around, so I guess the ledge works well as a cliff.  The hawk has become a regular visitor, and the unofficial Vertica mascot.  After all, it’s “Simply Fast”.   As much as we’d like to sit there waiting for it to show up all day, we all do have work to get done.  So I figured – why not put a webcam next to the window to analyze its habits and and be alerted when it shows up.

And with Vertica 6, that’s easy!  As it turns out, I’ve been working with the Vertica SDK team on User-Defined Load (UDL), a collection of SDK APIs that lets you ingest data in any arbitrary format from any arbitrary source.  Loading files and tables is pretty dull, so instead, I’ll use Vertica to track our mascot!  In this blog, I’d like to show you how I did it.

The first step is to get Vertica to read data not from a file, but from a webcam.  For this, we will need a User-Defined Source.  UDSources are a part of the UDL SDK.  As far as Vertica is concerned, all they do is emit data.  They could get that data from anywhere:  A file, a remote server, a webcam (!), an API call into a special service; they could even make the data up.

We could write a custom User-Defined Source in C++, that linked in OpenCV or a similar imaging library and interacted with the webcam directly.  But that takes work.  The command-line program “streamer” already provides a simple way to grab a snapshot from a webcam.  Our Github repository contains an example UDSource that runs a program and passes its output to Vertica.  We can install it as follows:

$ git clone https://github.com/vertica/Vertica-Extension-Packages.git
$ cd Vertica-Extension-Packages/shell_load_package

$ make install

Now we have a UDSource that can load our data.  So we need a table to load it into.  At a Vertica prompt:

CREATE TABLE frames (frame_name varchar(64), X int, Y int, R int, G int, B int)
             ORDER BY frame_name, X, Y;

Note the “ORDER BY” clause.  This is a Vertica extension that tells Vertica how to keep the data sorted.  This is the order that the data will come sorted in; we might as well keep it that way so that Vertica can search through it faster.

Now, let’s take a quick look at one line of the data that our command outputs:

0,0: ( 29, 19, 17)  #1D1311  rgb(29,19,17)

It’s a little bit unusual, but Vertica’s built-in Delimited parser can handle it.  So, we can go ahead and run the following statement to load the data:

>>> COPY frames (frame_name AS ‘frame 1′,
                X delimiter ‘,’,
                Y delimiter ‘:’,
                blank_space filler varchar(10) delimiter ‘(‘,
                R delimiter ‘,’,
                G delimiter ‘,’,
                B delimiter ‘)’,
                rest_of_line filler varchar(50))
WITH SOURCE ExternalSource(cmd=’streamer -f jpeg -o /dev/stdout | convert – -compress none txt:-’);

Now we have our hawk in Vertica!  So we can do some analytics.  For example, we can run a simple query to get the brightness of each frame that I’ve loaded:

>>> SELECT frame_name, avg((R+G+B)/3) AS brightness FROM frames GROUP BY frame_name;

And if I have a frame where the image suddenly gets darker, I know the bird has flown in.

If I want, I could easily do analytics on a window of the data; look for images where one part of the image changes more than others; even write a UDF that does proper object recognition.  Pick out if the hawk is eyeing its prey or snacking on a rodent (which it has done more than once).  But this turns out to be good enough for us here in Cambridge!

Do you have an idea for a UDSource?  Have your own local hawk, or any other project you might be working on?  Download Vertica’s Community Edition and try our UDSource, or create your own and post it to Github.  (We take pull requests!)

Happy Hawking!


Get Started With Vertica Today

Subscribe to Vertica