Vertica

Archive for the ‘data loading’ Category

Data-Driven Decision Making with the Vertica Analytics Platform

Physicians need access to a wealth of critical information from multiple systems in order to make life-saving decisions on a daily basis. Greg Gootee, Product Manager, MZI Healthcare, discusses how their new application, powered by the Vertica analytics platform, helps deliver better patient care through data-driven decision making. Delivering information in a timely manner is central to their application’s success. Check out this video shot at HP Discover 2013 to see how HP Vertica helps Greg and his team provide physicians with the information they need to make more accurate point-of-care decisions. In the video Mr. Gootee recounts how his Aunt may have avoided a tragic incident with better point-of-care services–the type of services MZI Healthcare and the Vertica analytics platform provide.

How to parse anything into Vertica using ExternalFilter

Vertica’s data-load process has three steps:  “Get the data”, “Unpack the data”, and “Parse the data.”  If you look on our Github site, there are handy helpers for the first two of those steps, ExternalSource and ExternalFilter, that let you call out to shell scripts or command-line tools.  But there’s no helper for parsing.  Why not?  Because you don’t need it!

Earlier today, I was trying to load a simple XML data set into a table:

“”” sample_data.xml:

<data>

<record>

<city>Cambridge, MA</city>

<population>106038</population>

</record>

<record>

<city>Arlington, MA</city>

<population>42389</population>

</record>

<record>

<city>Belmont, MA</city>

<population>24194</population>

</record>

</data>

“””

Vertica doesn’t have a built-in XML parser.  So this might look like it would be a real pain.  But I got it loaded nice and quickly with just a little bit of scripting.

First, we need something that can parse this file.  Fortunately, this can be done with just a few lines of Python:

“”” xmlparser.py:

#!/usr/bin/env python

import sys, xml.etree.ElementTree

for record in xml.etree.ElementTree.fromstringlist(sys.stdin).getchildren():

keys = record.getchildren()

print ‘|’.join(key.text for key in keys)

“””

A very simplistic script; it reads the whole file into memory and it assumes that the data is clean.  But on this file it’s all we need.  For more complicated inputs, we could make the script fancier or install and make use of a third-party tool (such as xml_grep, available as an add-on package in some Linux distributions).

Now, what happens when we run that script on the raw data file?

“””

$ ~/xmlparser.py < sample_data.xml

Cambridge, MA|106038

Arlington, MA|42389

Belmont, MA|24194

“””

You may recognize this as the basic output format of vsql, our command-line client.  Which means that Vertica can load it directly.  If you’ve installed ExternalFilter (by checking out our Github repository and running “make install” in the shell_load_package directory), just do the following:

“””

dbadmin=> CREATE TABLE cities (city VARCHAR, population INT);

dbadmin=> COPY cities FROM LOCAL ‘sample_data.xml’ WITH FILTER ExternalFilter(‘/path/to/xmlparser.py’);

Rows Loaded

————-

3

(1 row)

 

dbadmin=> SELECT * FROM cities;

city      | population

—————+————

Cambridge, MA |     106038

Arlington, MA |      42389

Belmont, MA   |      24194

(3 rows)

“””

Of course, with ExternalFilter, you don’t have to write code at all.  You have full access to the command-line tools installed on your server.  So for example, you can whip up a sed script and get a simple Apache web-log loader:

“””

dbadmin=> COPY weblogs FROM ‘/var/log/apache2/access.log’ WITH FILTER ExternalFilter(cmd=’sed ”s/^\([0-9\.]*\) \([^ ]*\) \([^ ]*\) \[\([^ ]* [^ ]*\)\] “\([^"]*\)” \([0-9]*\) \([0-9]*\) “\([^"]*\)” “\([^"]*\)”$/\1|\2|\3|\4|\5|\6|\7|\8|\9/”’);

“””

Is this really technically parsing, if you’re just outputting more text?  I’ll let the academics argue over that one.  It’s true that a native-C++ UDParser would likely yield better performance, and that these simple examples aren’t the most robust bits of code out there.  But I didn’t have time today to carefully craft an elegant, optimized extension.  I just wanted to load my data, to get the job done.  And these commands let me do so quickly and easily.

How to Load New Data and Modify Existing Data Simultaneously

Many Vertica customers tell us “we have an OLTP workload” which is not Vertica’s architectural sweet spot. However, when we dig into what they are actually doing, it often turns out that they are simply bulk loading mostly new data with some small number of updates to existing rows. In Vertica 6, we have added support for the MERGE statement to allow users to do just that.

Let’s look at the example shown in Figure 1. In this example, users and their numbers of appearances at a specific location (given by X, Y columns) are being merged from the table New_Location (a) into the existing table Location (b), and the merged results are shown in (c), with the updated and new data in pink. The user with UserID 1 comes to Burger King again, thus his total number of appearances must be updated to 2; while users 2 and 3 go to a new location, so their data must be inserted.

(more…)

Get Started With Vertica Today

Subscribe to Vertica