Author Archive

Defining Big Data

Let’s start with an easy question.  What’s “Big Data”?  Fortunately, I read the answer to that in one of my favorite tech news sources just the other day:  The answer, for those who won’t bother with the link, is “Big data is any data that doesn’t fit well into tables and that generally responds poorly to manipulation by SQL” … “A Twitter feed is Big Data; the census isn’t. Images, graphical traces, Call Detail Records (CDRs) from telecoms companies, web logs, social data, RFID output can all be Big Data. Lists of your employees, customers, products are not.”

That’s great, except that it is self-contradictory!  5 out of the 7 things the author considers “big data” are not only susceptible to manipulation by SQL (in a well-designed database of course), but have representation on Vertica’s “paying customers” list.  Twitter is a customer (and I thank them for their ability to simultaneously give us props and jabs in venues like ACM SIGMOD).  We hardly ever lose in the CDR space (or any xDR, really).  Zynga has petabytes of what could be loosely described as “web logs” and “social data” stored in Vertica.  The evidence list becomes long and boring quite quickly, so I won’t get into how the other 2 out of 7 “Big Data” areas are, as presented, too nebulous to decide one way or the other.

I can’t claim to have a definitive definition of “Big Data”.   But I can tell you that for each meaningful result (such as a sale of a product), there are many website clicks made, and for each “click” there are many impressions (things that are presented to the user but not clicked).  If you want to analyze these things, and maybe run some tests where you try experiments on your customers and prospects to see what does the best job of reeling them in, you will strain the the abilities of single-machine processing, “traditional” RDBMSs, and many statistics packages and algorithms (yes, including your trusty Excel spreadsheet).  Then there is machine-generated data.  The handful of self-aware devices you own (your desktop PC, laptop, GPS-enabled smart phone, smart meter, car, refrigerator, etc.) have much more free time with which to generate “Big Data” than you do.  (For example, a fair-sized cable MSO has a “sensor network” with 20 million networked devices that never rest, producing 10+ billion rows a day.)

So now that the definition of “Big Data” is as clear as mud, let us next consider structured vs. unstructured data.  I have heard people say that “web logs are unstructured data”.  This is outright silly.  The average web log is entirely machine-generated one line at a time, and to do any serious analysis on it you are going to have to parse it into a format with some amount of structure (at least the date and time, session, page, etc.).  Sure, it can be stored as “unstructured data” in your favorite file system, but that’s a simple matter of procrastination on the issue of writing (or finding that someone has already written a parser.  On the other hand, Twitter data, with its rigid 140-character limit is quite “structured”, but figuring out what it “means” is nonetheless a respectable challenge.

So instead, I would implore you to consider “machine information” versus “human information”.  It is probably true that, byte for byte, there is 10x more “human” information.  The slide deck our sales guys use takes up 10x more space on disk than the spreadsheet that contains the funnel and prospect list.  Viral videos probably (I am not privy to the hard data) consume 10x more disk space than the IPDRs and web logs recording their accesses.

But while disk size is a fun, quotable metric, it says little about the nature of the “Big Data” problem you have to solve.  Instead, consider whether you have “machine” or “human” information.  You should be able to tell at a glance, and this will help you choose the right tools.  If it’s machine-generated financial trades, a scale-out SQL database with some time series analytics extensions will do nicely.  If it’s the tweets of Twitter twits, you can use a structured data tool, but you’re going to need some (in a very loose sense of the word) natural language sentiment analysis and graph processing packages.  If it is a bunch of PDFs, Word documents, HTML pages, PowerPoint presentations, and so on scattered across eleventeen different web servers, CMS systems, and file systems in your enterprise, you may need a high-powered “human information” system like Autonomy (and if you are an intelligence agency that needs to watch every video broadcast worldwide to gauge international sentiment, I think they can help you with that too…).

There is a point to all of this.  I can’t claim to have invented this tag line, but I wish I had.  You should “Know SQL” before you assume you should “NoSQL”.  While “Big Data” couldn’t have lived in an SQL database as they existed 10 years ago, we have different kinds of SQL databases now, that are “web scale”, high performance, designed for analytic workloads, cost effective, and so on.  It’s no longer, as a tech visionary in our own back yard recently said, “just a box that can’t keep up”.  If you have “Big Data” upon which structure can be imposed, analytic databases are very efficient, flexible, easy to use, and come with all the features people have come to expect from 30+ years of development.  (Try one.  We have a free, community download!)

Setting the Record Straight on Column Stores (Again?!)

Couple months ago I went to SIGMOD 2012.  One of the big award winners there was Bruce Lindsay (IBM Fellow Emeritus), a true patriarch of relational databases.  (System R; enough said!)

I was somehow drawn to him before I figured out his name, and before I learned that he was an award winner.  Maybe it was the hairdo and mannerisms.

Or maybe it was how he asked the presenters of the paper on “MCJoin” something along the lines of  “So, I’ve written a few join algorithms in my day and one of the things that set me back a few months each time was OUTER JOINs”.  Which, in my day, set me back a few months.

Back to the awards.  Each recipient gave a talk.  Bruce gave a very interesting presentation covering RDBMS, how it built up to something useful over the years, and then considered whether we are “losing our way”.  I was a bit surprised that he listed “column stores” as a “detour” on the path of RDBMS progress.  This is his slide (and, as you view it, try imagine someone in the row in front of you cackling about how Mike Stonebraker would react to it…):


Monte Carlo Analysis Using the Vertica Analytics Platform: Part 1

This post is part of a new series of blogs on harnessing the power of the Vertica Analytics Platform to solve problems that, at first glance, do not appear to be database-centric.


The main ingredient in any Monte Carlo method is, as the name suggests, some process akin to going to the casino and repeatedly throwing the dice to see if the odds are in your favor or stacked towards the house.  While this description is simple, the true beauty lies in the application of brute force; instead of applying a complex analysis to the rules and physics of a game, just go play it a number of times and see what happens.

Illustrated graphically, suppose we have a model to analyze:

The Monte Carlo method of analysis says to characterize the model by repeated trials:

Databases, especially those geared toward massively parallel analytics, are surprisingly good platforms for implementing Monte Carlo techniques.  Conversely, the Monte Carlo method is a great tool to have on hand for analyzing data, particularly in situations where brute force seems more appealing than over-thinking things.

A Simple Monte Carlo Example

A software engineering manager wants to know the odds of a release shipping in three months.  There are several projects in the release; whichever one of them takes the longest will determine the total time until ship, as seen in the following figure:

Project A will take 2-4 months (if the estimate is uniformly distributed, that’s a 50/50 chance of an on-time completion); in parallel is an initiative that has several end products, Project B and Project C (that share infrastructure).  Project B is expected to take 2-3 months, with the infrastructure required for Project C being completed half way through Project B, followed by 1-2 months of additional work on Project C.  Project B will get done on time, but it’s not immediately clear what the odds are for finishing Project C on schedule, or shipping the release as a whole.  We can either break out the math books to calculate the final probability (a strategy that decreases in appeal as schedule complexity increases), or do a Monte Carlo analysis.

The steps are simple:

1.     Generate random values for the time taken to implement Project A, B, and C.

2.     Apply the formula GREATEST(A, B, B/2 + C)

3.     Count the number of times  this is less than or equal to 3, as a proportion of total trials.


I ran this a few times and got values ranging from 34-41%.  Clearly, 1000 samples is not enough to get a solid read on this model, but that is easy enough to fix:

With a million samples, the answer is consistently 37.4-37.5%.  Certainly, this percentage range is far more precise than the input data.

Why Bother Using a Database?

For the simple example above, using a database may run afoul of the maxim, “when you have a hammer, everything looks like a nail.”  I could have written this in C, Java, perl, or maybe even Z80 assembly language.  If I wanted parallelism, I could have used MapReduce.  Or, I could have used an off-the-shelf project management tool, which may have resorted to the same technique, but it would probably have made a prettier project plan graphic.

That said, there are a few appealing characteristics when using the Vertica Analytics Platform:

  1. The declarative nature of SQL makes this pretty easy to express.  It only took a couple of minutes to code and debug.
  2. I get scalability and parallelism for free.  The Vertica Analytics Platform will assign work to many processors or nodes, where each node will run some trials and record the results, and then the results will be aggregated.
  3. It ran extremely fast; in the blink of an eye on my laptop.

However, the real power here lies in the fact that the data to be analyzed may already be in a database.  We want to bring the computation to the data, as long as this is easy and efficient:

As a concrete example, lets say we have a database that contains, among other things, all the stock trades that happened over a certain period of time (day, hour, minute, second, whatever).  We can create some kind of model that predicts the upcoming price change percentages, based partly on past performance (which is known from database data, in orange), and partly on future events (random variables such as market forces) which ultimately dictate the predicted valuation and uncertainty.  We can then use Monte Carlo techniques and run the model for a number of input scenarios, and then rank the stocks based on the predicted performance and uncertainty.

In order to arrive at SQL that is simple enough to present here, let’s use a very simple process and model:

  1. The model of the expected price changes will be Ax2+By+C, where A, B, and C are computed from the historical data using in-database analytics, and x and y are random inputs representing potential future market forces, uniformly distributed between 0 and 1.  The manner in which A, B, and C are derived is some sort of secret sauce in this “get rich quick” scheme to beat the market.  Of course x and y are the random inputs subject to Monte Carlo “roll the dice” analysis.
  2. The secret formula has already been computed, giving us a “model” table associating each stock symbol with its A, B, and C values:

  3. We want the top 5 stocks, based on the output of the model (expected percentage change).  We’ll compute the standard deviation as well, so that the analyst can disregard any stocks at his or her discretion, based on a gut feel of which stocks are too volatile.

Given the simplifications, here is the SQL that runs the model (we will save the output in case we need it for additional analysis later):

Then, gathering the top 5 picks is also easy to express in SQL:

This produces the results we hoped to see:

While this is a simple example, there is no reason to stop here.  We can use any secret formula, and a much more complicated model, without changing the overall approach.  Or, if we want to change the overall approach, we can incorporate performance assessment and adaptation into the process:

In our example, this would entail comparing the predictions of the model to the actual data acquired by the database as the stock market plays out.  In-database analytics can be used to assess the quality of the model and make adjustments to improve future results.

Stay Tuned for Part 2…

While the examples presented here may suffice as food for thought, there is much more to be written about the use of in-database analytics to build models and measure their effectiveness.  Also, there are several Vertica-specific features and tips to leverage in performing Monte Carlo analysis.  These will be discussed in an upcoming post.


How To Make A Pumpkin Look Like Somebody

(No Artistic Talent Required!)

It’s Halloween! This holiday is second only to Christmas in terms of the amount spent per year on decorations (in the USA, anyway). But for some of us, there is more to it than cheap, mass-produced rubber spiders, talking skeletons, and so on. It’s yet another excuse to be creative, in a geeky way.

So let’s carve pumpkins! When I was a wee lad, my brother and I took black markers and made an outline of what we wanted on our pumpkins, then dear old dad would cut them with a sharp knife (which we, of course, weren’t allowed to play with). Think triangular eyes, a simple nose, and a snaggletoothed mouth. Now that I am older, I find this is way too efficient, and much more time can be frittered away with this exercise…

Basic Carving Outline

(Apologies in advance for saying so many things you already know… or just skip this section.)

  1. Get pumpkins. You can grow them, get them at the store, or visit the kitschy farm down the road. Look for ones you can easily spread your hand across, they are the right size for single page (8 ½ x 11) paper patterns.
  2. Get a paper pattern. They come in books, but more sophisticated ones can be found online ( is a favorite of mine), and printed out (laser printouts are preferred, as they are nearly impervious to pumpkin juice).
  3. Tape the pattern to the pumpkin (This means you don’t actually need artistic talent. Just trace it!). This is easier if you have cut some notches in the paper so it bends around the pumpkin. I say use lots of tape. Cover every bit of the paper with tape. That way, if it gets pumpkin juice on it, it won’t fall apart.
  4. Cut the top off. Angle it, so the top doesn’t fall in. (If you cut it straight up and down, this will happen quickly as the pumpkin ages.) Alternatively, some experts prefer cutting the bottom out of the pumpkin instead of the top. This may make the pumpkin last longer, especially if it is out in the weather. But then you may need a chimney. Either way, I leave a notch so the original orientation of the lid can be quickly reestablished.
  5. Scrape the guts out. Scrape the part where the pattern is applied extra hard, if you are going with a three-level pattern (explained next), so the light shines through nicely. Keep some seeds to bake and eat, if you are into that (I am not).
  6. Cut the pattern. Unless you are being really fancy, this can be done in three levels:
    • Skin left on. This is of course the darkest.
    • Skin peeled off. Much more light shines though.
    • Cut all the way through. This is the lightest.

There are many tools for the job. For cutting through, knives really can’t get a high level of detail compared to the special-purpose pumpkin saws they sell these days. (Cut as perpendicular to the surface of the pumpkin as possible so the piece is easily extracted. If the piece doesn’t pop out easily, cut it into bits.)

For scraping the skin, I haven’t found anything better than a pocket knife. Just cut the edge around the area (this makes nice clean lines), then if the area is small/thin pick it out with the knife point, or if it is large, cut it into pieces to pick out. (Cutting up and down the grain of the pumpkin is easiest, if it is convenient given the shape of the area to scrape.) They also sell tools with little loops on the ends as part of store-bought kits, but I prefer to live dangerously and use my trusty knife.

The order in which the areas are cut out has a profound effect on how hard it is to execute the design without breaking anything. This is hard to pin down in words, but as you don’t want to paint yourself into a corner, you also don’t want to be cutting/scraping anything that has very little structural support. Starting with the smallest details is never a bad idea.

  1. Take the pattern and tape off.
  2. Cut air holes in the back if the pattern doesn’t involve many areas that are cut through.
  3. Put a candle in, light it. Pumpkins are >90% water, so putting them on your cement steps with real fiery candles in doesn’t sound that dangerous.
  4. Take pictures of your handiwork! (See the last section.)

Additional Steps

  • Invite some friends over, and have your sister-in-law with the Johnson and Wales degree make the snacks.
  • Add beer and/or wine, to taste.
  • There are also tailor-made power saws (reciprocating, not circular) available, and these are quite helpful for designs with lots of long cuts. Some people also use power tools to scrape the skin, such as a Dremel tool. This works, but I advise against combining this with #1, or #2 in particular.

How To Make Your Own Patterns

Making your own patterns is a great way to get geek technology involved in what otherwise would be a nice, earthy-crunchy hobby. This is not that hard, but you will impress people who don’t think of these things.

Get a picture using a digital camera, Google image search, or whatever. This could be easy, or not. But you won’t know for sure until later steps. Here’s what I used this year:

Crop the picture. Convert it to black and white. (I used to use PhotoShop when I had an employer that could justify the expense of having a legit copy. Now I use the GIMP.) If you look closely at the hair, you can see that I touched it up a little, in preparation for the next step…

Convert the image to three levels. I use black for skin, gray for scraped, and white for cut through. This gives an impression of the final result, but generally uses more precious toner than doing it the other way ’round. This year I just adjusted the color curve in GIMP, but I am sure I have used other means in the past.

This should result in a 3-level image:

There are a few things to note here. Obviously it is tricky to pick the levels to get something that is true to the original image. However, you also have to be aware that gravity will claim anything that is completely surrounded by a cut-out (white area). You can (in order of preference) either just erase these (in the image above they are tiny), keep fussing with the levels until there aren’t any, add supports to them, or go back to step 1 (pick a new image).

Cut the pumpkin. It may look like complete crap in the daylight:

Fear not! in the dark, things look better than you’d think, given the number of mistakes you may or may not have made in the cutting process:

Get addicted, and do more pumpkins next year. Here are a few samples from our parties. (See if you can spot the shameless plugs for our company.)

Taking Good Pictures

So even if your pumpkin doesn’t look that good, you may be able to salvage it by creative use of your digital camera.

  • Use full manual mode. That way you can adjust the picture by trial and error, and keep the most appealing one.
  • Unless you can hold still for 15 seconds, a tripod is a must.
  • Stop the camera down. (This may not apply to your camera, but the one I got for a mere $150, 6 long years ago works best stopped down, as it blurs less. I realize that this is counter-intuitive since there isn’t a lot of need for depth of field, and the light is low. But that’s what I do.)
  • Same goes for film speed. Use the slowest one, as you get the least noise in the dark areas. Even though this is counter-intuitive in low light settings.
  • Then adjust the exposure time to make it look good. Take a few. Use 2 seconds, 4, 6, 8, 10, 15, and then keep whatever looks best. Usually, it takes a lot of work to get a digital photo to look almost as good as real life, but with pumpkins, it is pretty easy to make things look even better than reality, just pick the right exposure.
  • The first photo shows approximately what the pumpkin looks like in real life (you will have to trust me on this). The second shows the exposure I liked the best, which soaked for a bit longer.


I’d like to thank my wife for starting this tradition, and cleaning the house before every party. And all our friends, for their contributions to our growing gallery.

Why Vertica’s Compression is Better

Data compression is like money or free time: the more the better. Compression can save big bucks on storage costs, increase data center density, or allow more data to be kept, while simultaneously increasing query performance in cases where I/O is the bottleneck. So naturally, database marketeers want to claim great compression ratios. Vertica is no exception.

Generally, I am skeptical of vendors who claim amazing compression breakthroughs. In fact, I’m quite skeptical of any lossless compression product that promises a compression ratio up front, without any knowledge about the data to be compressed. Remember the WEB 16:1 compressor saga? Also, don’t we all know that it’s theoretically impossible to contrive a general, lossless compression utility that makes all input files smaller? Take a look at Steve Tate’s or Mike Goldman’s challenges in the timeless comp.compression FAQ. I’d double their wagers.

Thus I expect some skepticism surrounding Vertica’s claims of compression capabilities. To dispel some of these doubts, I’ll discuss the three main reasons why Vertica’s compression is better than you’ll see in competing databases for a long time to come. I’ll also present a real-life customer example (albeit a best-case scenario) of database compression. I won’t be claiming any specific compression ratio, like “up to 10x compression or more!” But I will claim that we can do as well as anyone else, and usually much better.

Compression Advantages of Vertica’s True Column Store

There are three main reasons that Vertica’s compression is superior and each reason has its own advantage for compressing data:

Reason #1: Vertica Is A True Column Store.
If you walk down attribute columns there is more similarity than if you cut across the rows. So why can’t a row store simply compress its blocks by slicing each block of rows into columns? It can, but given the same block size, the column store will see more data and therefore get better compression, as shown in Illustration 1, where data was fed into gzip in rows, and in columns.

Illustration 1: Vertica’s True Column Compression Advantage 

In the extreme case of run length encoding (RLE), even many billions of records can be encoded in a single disk block, which is far outside the scope of what can be done by slicing up blocks of rows.

Reason #2: Data Is Always Sorted in Vertica.

Data that is well organized compresses better than data that is located haphazardly, as shown in Illustration 2. Take a text file containing a million random integers (say between 1 and 10 million). It’ll be a shade under 8MB, as most entries are 7 digits plus a newline. Applying gzip, you’ll find that the data compresses, because the numbers are made of digits, which are a subset of all byte representations. I got 3.7MB, for a bit better than 2:1 compression. However, sorting the data makes it much more compressible; 3.4:1 compression is achieved. By not using strings, Vertica gets better than 10:1 compression in this case.

Illustration 2: Sorting Dramatically Improves Compression with Vertica 

This is a contrived case, but real data also has trends. Stocks trade from pennies to over a hundred thousand dollars a share. But if the data is sorted by stock, date, and time, there are few distinct trading prices and the data is quite compressible. Unlike other databases that keep data in insertion order, Vertica keeps the data sorted all the time, improving compression.

Reason #3: Vertica Doesn’t Do In-Place Updates.
In most databases that were originally designed for transaction processing, updates are applied in place. Since new values could come along that don’t compress as well as the old values, some empty space must be left, or updates foregone. Since Vertica puts updates in a separate place (such as the Write Optimized Store), we can squeeze every last bit out of the data. I’ve seen some competing systems that update in place and others that don’t allow updates if maximum compression is on, but none that allow updates while still squeezing every last bit out of the data representation.

A Real-Life Example

The reasons sound good, but how well does Vertica’s compression work in practice? Take a look at the following example and ask yourself if your favorite row store (which is probably soon to claim that it is also a column store) can match Vertica’s true column compression.

We have a customer that collects metrics from some meters. There are 4 columns in the schema:

  1. Metric. There are a few hundred metrics collected.
  2. Meter. There are a couple thousand meters.
  3. Collection Time Stamp. Each meter spits out metrics every 5 minutes, 10 minutes, hour, etc., depending on the metric.
  4. Metric Value. A 64-bit floating point value.

So how much space will this take to record? A baseline .csv file of 200 million of the meter/metric/time/value rows takes 6200 MB, for ~31 bytes per row. gzip reduces this to 1050MB.

Vertica compresses much better. By sorting the data on metric, meter, and collection time, Vertica not only optimizes common query predicates (which specify the metric or a time range), but exposes great compression opportunities for each column:

  1. Metric: There aren’t many. With RLE, it is as if there are only a few hundred rows. Vertica compressed this column to 5KB.
  2. Meter: There are quite a few, and there is one record for each meter for each metric. With RLE, Vertica brings this down to a mere 35MB.
  3. Collection Time Stamp: The regular collection intervals present a great compression opportunity. Vertica compressed this column to 20MB.
  4. Metric Value: Some metrics have trends (like lots of 0?values when nothing happens). Others change gradually with time. Some are much more random, and less compressible. However, Vertica compressed the data to only 363MB.
Illustration 3: Real Customer Data Compression


The total size for all the columns is 418MB (a shade over 2 bytes per row). This is less than half of what gzip used, and represents a compression ratio of 30:1 over the raw data.

Get Started With Vertica Today

Subscribe to Vertica