Archive for November, 2012

Top 3 Discussions from TDWI 2012

At last week’s TDWI Conference in Orlando, FL there was a general buzz about how to derive value from Big Data for competitive advantage. That makes perfect sense, given that the general conference theme was to “Get Smarter with Big Data.”

Naturally, attendees – largely data scientists, business analysts, architects, and developers — were on hand to learn more about which technologies to recommend back at headquarters for their Big Data initiatives. We at HP Vertica were busy addressing a range of questions at our booth. However, the conversations went beyond features, benefits, and how we complement BI, ETL, and technologies like Hadoop. Attendees were also generally interested in learning from their peers and presenters how to:

    1.  Build a Business Case
    2. One attendee from a Fortune 100 company could understand how his company could gain greater analytics performance from a columnar database (such as HP Vertica Analytics Platform). However, he acknowledged that his company needed to step back and learn how the insight from Big Data analytics would address top-line objectives: reduce costs, generate revenue, differentiate, and ultimately improve customer satisfaction.

    3. Define the Meaning and Value of Big Data
    4. My colleague, Chris Selland, participated in a panel (Business Value: The Fourth V of Big Data) that answered a range of questions from the audience. It was clear that attendees still struggled with a clear common definition of Big Data. The three V analogy (Velocity, Volume, and Variety) seemed to provide more clarity. However, the fourth V (Value) underscored how Big Data analytics is supporting overall business benefits. For an example, check out our recently published case study — Cardlytics Serves Up Success with HP Vertica.  

    5. Adopt Common Use Cases
    6. There were some interesting lunch-time discussions centered on how Big Data analytics could address common use cases by industry. In health care, for example, health care providers are using analytics to ultimately provide improved preventive care based on family history, test results, etc. A large logistics company is using analytics for route optimization in reducing fuel costs and overall emissions. Clickstream analytics, fraud detection, inventory management, and a myriad of use cases are emerging as companies gain greater insight from their Big Data, once they have removed the technology barriers (performance, scalability, and overall low TCO).

    How does your organization plan to derive value from Big Data? We’d love to hear your use case.

Yottabytes, Zillioinics, and More at Defrag 2012

HP Vertica sponsored this year’s Defrag Conference, so I had the opportunity to attend the event in Broomfield, CO. It’s close to the startup community in Boulder and right in my backyard. It was a great conference with an intimate setting. Defrag started in 2007 as a forum for exploring information overload and building implicit tools for the web. In years 2-4, Defrag’s focus shifted to things like enterprise collaboration, social media, and Big Data.

A Future of Yottabytes?

There were a lot of good takeaways and the theme that continues to permeate everything these days is data. The keynotes were great, led by Kevin Kelly, Founding Executive Editor, Wired Magazine. Kelly noted: “Technology and human activity are so global that they operate together as if they were a geological force.” He described this global system of technology deployed around the planet as an “emerging superorganism.” In an industry where I’m focused on how I solve problems today, it was interesting to take a step back and think where technology and data will be 20, 30, and 50 years from now.

When the topic of data came up and the future of data, Kelly talked about the Yottabyte, which is one septillion bytes or 1024. We are truly experiencing an explosion of data (he even included explosion math) to illustrate the amount of data being generated. We will eventually be in the yottabytes. The problem is there is no metric prefix after yotta. He gave some humorous suggestions, such as Lotta or Hella, but it’s a largely unknown area. He followed up with a more realistic approach to naming after yotta and talked about Zillionics. You can read more about Kelly and his thoughts at

Solving Big Data Problems Today

Back to the problems of today, we are helping many of our clients solve this explosion of data problem. Similar to how Kelly draws analogies that technology is now part of our ecosystem, data permeates organizations both internally and externally. It has also become a part of us. Companies that will be around the next 50 years are the ones that will adapt and drive decisions based on information.

This goes for HP Vertica, too. We are constantly adapting and changing to the information that we receive from our clients and the marketplace. HP Vertica allows you to adapt quickly to the continuously evolving, next generations of hardware, software, and data.

Over this Thanksgiving, give thanks and take some time to think about the future of technology, how you will better deal with the explosion of data, and a good name for something after yotta. Maybe in the spirit of technology, we can put it on the web for a vote.

I think the Bradabyte has a nice ring to it….

How to parse anything into Vertica using ExternalFilter

Vertica’s data-load process has three steps:  “Get the data”, “Unpack the data”, and “Parse the data.”  If you look on our Github site, there are handy helpers for the first two of those steps, ExternalSource and ExternalFilter, that let you call out to shell scripts or command-line tools.  But there’s no helper for parsing.  Why not?  Because you don’t need it!

Earlier today, I was trying to load a simple XML data set into a table:

“”” sample_data.xml:



<city>Cambridge, MA</city>




<city>Arlington, MA</city>




<city>Belmont, MA</city>





Vertica doesn’t have a built-in XML parser.  So this might look like it would be a real pain.  But I got it loaded nice and quickly with just a little bit of scripting.

First, we need something that can parse this file.  Fortunately, this can be done with just a few lines of Python:


#!/usr/bin/env python

import sys, xml.etree.ElementTree

for record in xml.etree.ElementTree.fromstringlist(sys.stdin).getchildren():

keys = record.getchildren()

print ‘|’.join(key.text for key in keys)


A very simplistic script; it reads the whole file into memory and it assumes that the data is clean.  But on this file it’s all we need.  For more complicated inputs, we could make the script fancier or install and make use of a third-party tool (such as xml_grep, available as an add-on package in some Linux distributions).

Now, what happens when we run that script on the raw data file?


$ ~/ < sample_data.xml

Cambridge, MA|106038

Arlington, MA|42389

Belmont, MA|24194


You may recognize this as the basic output format of vsql, our command-line client.  Which means that Vertica can load it directly.  If you’ve installed ExternalFilter (by checking out our Github repository and running “make install” in the shell_load_package directory), just do the following:


dbadmin=> CREATE TABLE cities (city VARCHAR, population INT);

dbadmin=> COPY cities FROM LOCAL ‘sample_data.xml’ WITH FILTER ExternalFilter(‘/path/to/’);

Rows Loaded



(1 row)


dbadmin=> SELECT * FROM cities;

city      | population


Cambridge, MA |     106038

Arlington, MA |      42389

Belmont, MA   |      24194

(3 rows)


Of course, with ExternalFilter, you don’t have to write code at all.  You have full access to the command-line tools installed on your server.  So for example, you can whip up a sed script and get a simple Apache web-log loader:


dbadmin=> COPY weblogs FROM ‘/var/log/apache2/access.log’ WITH FILTER ExternalFilter(cmd=’sed ”s/^\([0-9\.]*\) \([^ ]*\) \([^ ]*\) \[\([^ ]* [^ ]*\)\] “\([^”]*\)” \([0-9]*\) \([0-9]*\) “\([^”]*\)” “\([^”]*\)”$/\1|\2|\3|\4|\5|\6|\7|\8|\9/”’);


Is this really technically parsing, if you’re just outputting more text?  I’ll let the academics argue over that one.  It’s true that a native-C++ UDParser would likely yield better performance, and that these simple examples aren’t the most robust bits of code out there.  But I didn’t have time today to carefully craft an elegant, optimized extension.  I just wanted to load my data, to get the job done.  And these commands let me do so quickly and easily.

Capitalizing on the Potential of R and HP Vertica Analytics Platform

With the release of HP Vertica v6, we introduced a no-charge download of a new package that incorporates R, one of the most popular open-source data mining and statistics software offerings on the market today.

With this package, you can implement advanced analytics and sift through your data quickly to find anomalies using advanced R data mining algorithms. For more technical details on this integration, see How to Implement “R” in Vertica.

It’s but one part of our open platform approach, integrating and supporting a range of tools and technologies—from Hadoop to R to BI/visualization environments like Tableau —in affording you more flexibility and choice to derive maximum insight from your Big Data initiative.

“Got it,” you say. “But can you share some common use cases to spark ideas for my organization?”

For a complete list of use cases and additional details on the technology behind this integration, download our just-released white paper: “R” you ready? Turning Big Data into big value with the HP Vertica Analytics Platform and R.

That said, the use cases are only limited by your imagination—everything from behavior analytics (making meaningful predictions based on past and current buying behavior) to claims analyses (identifying anomalies, such as fraud, or identifying product defects early in the product release phase). In fact, the best ideas, and even more gratifying to us, implementations, are happening today in our user community.

One of our digital intelligence customers uses Hadoop’s HDFS to store raw input behavioral data and Hadoop/MapReduce to find conversions (regular-expressions processing) by determining what type of user clicked on a particular advertisement, and HP Vertica Analytics Platform to store and operationalize high-value business data.

In addition, the company’s big data solution supports reporting and analytics via Tableau and the R programming language as well as custom ETL. This combination of technologies helps this customer achieve faster insights that are delivered more consistently with less administrative overhead and lower-cost, commodity hardware.

How are you using R with HP Vertica Analytics Platform? We’re all ears.

HP Vertica Analytics Platform ready to grow with Cardlytics

Recently, we’ve been spending more time with our customers in order to get a better understanding of how they are using HP Vertica, and why they chose the HP Vertica Analytics Platform as their analytics database of choice.  One such customer, Cardlytics, was generous enough to spend some time with us and talk in length about how their company uses HP Vertica to accelerate their Big Data analytics and improve their ability to target new customers with their transaction-driven marketing programs.

To find out more about how Cardlytics is using HP Vertica to power their Big Data analytics program and increase revenue and customer satisfaction, please take some time to read the recently-published case study, or watch one of the Cardlytics customer videos, using the links below:

UPDATE: We just found out that The Economist has a great article on Cardlytics in they published on October 27th.  You can check it out at:

Get Started With Vertica Today

Subscribe to Vertica