Vertica

Archive for October, 2010

How To Make A Pumpkin Look Like Somebody

(No Artistic Talent Required!)

It’s Halloween! This holiday is second only to Christmas in terms of the amount spent per year on decorations (in the USA, anyway). But for some of us, there is more to it than cheap, mass-produced rubber spiders, talking skeletons, and so on. It’s yet another excuse to be creative, in a geeky way.

So let’s carve pumpkins! When I was a wee lad, my brother and I took black markers and made an outline of what we wanted on our pumpkins, then dear old dad would cut them with a sharp knife (which we, of course, weren’t allowed to play with). Think triangular eyes, a simple nose, and a snaggletoothed mouth. Now that I am older, I find this is way too efficient, and much more time can be frittered away with this exercise…

Basic Carving Outline

(Apologies in advance for saying so many things you already know… or just skip this section.)

  1. Get pumpkins. You can grow them, get them at the store, or visit the kitschy farm down the road. Look for ones you can easily spread your hand across, they are the right size for single page (8 ½ x 11) paper patterns.
  2. Get a paper pattern. They come in books, but more sophisticated ones can be found online (carvingpumpkins.com is a favorite of mine), and printed out (laser printouts are preferred, as they are nearly impervious to pumpkin juice).
  3. Tape the pattern to the pumpkin (This means you don’t actually need artistic talent. Just trace it!). This is easier if you have cut some notches in the paper so it bends around the pumpkin. I say use lots of tape. Cover every bit of the paper with tape. That way, if it gets pumpkin juice on it, it won’t fall apart.
  4. Cut the top off. Angle it, so the top doesn’t fall in. (If you cut it straight up and down, this will happen quickly as the pumpkin ages.) Alternatively, some experts prefer cutting the bottom out of the pumpkin instead of the top. This may make the pumpkin last longer, especially if it is out in the weather. But then you may need a chimney. Either way, I leave a notch so the original orientation of the lid can be quickly reestablished.
  5. Scrape the guts out. Scrape the part where the pattern is applied extra hard, if you are going with a three-level pattern (explained next), so the light shines through nicely. Keep some seeds to bake and eat, if you are into that (I am not).
  6. Cut the pattern. Unless you are being really fancy, this can be done in three levels:
    • Skin left on. This is of course the darkest.
    • Skin peeled off. Much more light shines though.
    • Cut all the way through. This is the lightest.

There are many tools for the job. For cutting through, knives really can’t get a high level of detail compared to the special-purpose pumpkin saws they sell these days. (Cut as perpendicular to the surface of the pumpkin as possible so the piece is easily extracted. If the piece doesn’t pop out easily, cut it into bits.)

For scraping the skin, I haven’t found anything better than a pocket knife. Just cut the edge around the area (this makes nice clean lines), then if the area is small/thin pick it out with the knife point, or if it is large, cut it into pieces to pick out. (Cutting up and down the grain of the pumpkin is easiest, if it is convenient given the shape of the area to scrape.) They also sell tools with little loops on the ends as part of store-bought kits, but I prefer to live dangerously and use my trusty knife.

The order in which the areas are cut out has a profound effect on how hard it is to execute the design without breaking anything. This is hard to pin down in words, but as you don’t want to paint yourself into a corner, you also don’t want to be cutting/scraping anything that has very little structural support. Starting with the smallest details is never a bad idea.

  1. Take the pattern and tape off.
  2. Cut air holes in the back if the pattern doesn’t involve many areas that are cut through.
  3. Put a candle in, light it. Pumpkins are >90% water, so putting them on your cement steps with real fiery candles in doesn’t sound that dangerous.
  4. Take pictures of your handiwork! (See the last section.)

Additional Steps

  • Invite some friends over, and have your sister-in-law with the Johnson and Wales degree make the snacks.
  • Add beer and/or wine, to taste.
  • There are also tailor-made power saws (reciprocating, not circular) available, and these are quite helpful for designs with lots of long cuts. Some people also use power tools to scrape the skin, such as a Dremel tool. This works, but I advise against combining this with #1, or #2 in particular.

How To Make Your Own Patterns

Making your own patterns is a great way to get geek technology involved in what otherwise would be a nice, earthy-crunchy hobby. This is not that hard, but you will impress people who don’t think of these things.

Get a picture using a digital camera, Google image search, or whatever. This could be easy, or not. But you won’t know for sure until later steps. Here’s what I used this year:

Crop the picture. Convert it to black and white. (I used to use PhotoShop when I had an employer that could justify the expense of having a legit copy. Now I use the GIMP.) If you look closely at the hair, you can see that I touched it up a little, in preparation for the next step…

Convert the image to three levels. I use black for skin, gray for scraped, and white for cut through. This gives an impression of the final result, but generally uses more precious toner than doing it the other way ’round. This year I just adjusted the color curve in GIMP, but I am sure I have used other means in the past.

This should result in a 3-level image:

There are a few things to note here. Obviously it is tricky to pick the levels to get something that is true to the original image. However, you also have to be aware that gravity will claim anything that is completely surrounded by a cut-out (white area). You can (in order of preference) either just erase these (in the image above they are tiny), keep fussing with the levels until there aren’t any, add supports to them, or go back to step 1 (pick a new image).

Cut the pumpkin. It may look like complete crap in the daylight:

Fear not! in the dark, things look better than you’d think, given the number of mistakes you may or may not have made in the cutting process:

Get addicted, and do more pumpkins next year. Here are a few samples from our parties. (See if you can spot the shameless plugs for our company.)

Taking Good Pictures

So even if your pumpkin doesn’t look that good, you may be able to salvage it by creative use of your digital camera.

  • Use full manual mode. That way you can adjust the picture by trial and error, and keep the most appealing one.
  • Unless you can hold still for 15 seconds, a tripod is a must.
  • Stop the camera down. (This may not apply to your camera, but the one I got for a mere $150, 6 long years ago works best stopped down, as it blurs less. I realize that this is counter-intuitive since there isn’t a lot of need for depth of field, and the light is low. But that’s what I do.)
  • Same goes for film speed. Use the slowest one, as you get the least noise in the dark areas. Even though this is counter-intuitive in low light settings.
  • Then adjust the exposure time to make it look good. Take a few. Use 2 seconds, 4, 6, 8, 10, 15, and then keep whatever looks best. Usually, it takes a lot of work to get a digital photo to look almost as good as real life, but with pumpkins, it is pretty easy to make things look even better than reality, just pick the right exposure.
  • The first photo shows approximately what the pumpkin looks like in real life (you will have to trust me on this). The second shows the exposure I liked the best, which soaked for a bit longer.

Acknowledgments

I’d like to thank my wife for starting this tradition, and cleaning the house before every party. And all our friends, for their contributions to our growing gallery.

Sessionize with Style: Part 2

Why is the Vertica Approach Unmatched?

In Part 1 of this two-part article, we described Vertica’s approach to sessionization. To understand why this approach is head and shoulders above the state-of-the-art, let us closely examine a few existing approaches.

The Aster Data Approach –From Boston to LA via China

The first approach comes from Aster Data. One query example from this article is quoted below.

SELECT sessionId, userId, timestampValue

FROM Sessionize( ‘timestamp’, 60 ) ON

(SELECT userid, timestampValue FROM webclicks WHERE userid = 50);

There are a few usability hurdles with this design.

  1. The Sessionize function does not take userId as its input. It therefore must be assuming the input table webclicks has a column named userId, to be used in this computation. Hard-coding column names makes the product less usable.
  2. This design of sessionization support unnecessarily deviates from standard SQL in terms of both syntax and semantics, artificially creating a steep learning curve for the end users to climb. For example, the keyword ON has a different meaning from the standard use of ON in SQL ANSI joins. As a proposal for an alternative design, why not retain the subquery as a standard FROM clause subquery, and move the Sessionize function to the SELECT clause? This could preserve the required semantic ordering of evaluating the subquery before the Sessionize function, while minimizing the deviation from Standard SQL.
  3. This example implies that the predicate userid = 50 has to be placed in a subquery in order for that predicate to be pushed before the Sessionize computation. This predicate push down technique, well known to relational databases since the invention of relational algebra in the 1970’s, should have been taken care of in the query optimizer automatically. Forcing the end user to come up with “clever” SQL formulations violates the spirit of declarative programming in SQL, again making the product less usable.

Apparently Aster Data took another stab at the design, where a query example in that new design is quoted below.

Select ts, userid, session

From sessionize (

on clicks partition by userid order by ts

Timecolumn(‘ts’) timeout (60)

)

While this design represents progress, it still suffers from unnecessary deviation from the Standard SQL (see Comment #2 above). Also, since the proposed Sessionize syntax is not fully interoperable with other existing SQL constructs, it makes the query formulation cumbersome when the user wants to perform sessionization and other SQL computation in the same query. For example, how many subqueries need to be involved, if the user wants to first join the table clicks with a dimension table, perform a group-by, and then perform sessionization? In comparison, the Vertica approach employs not a single subquery to accomplish this task.

Finally, in case you wonder, in Aster Data Sessionize is a MapReduce function written in JavaDoes the sessionization task justify the employment of a big hammer execution mechanism like MapReduce? No! As we mentioned before, the run-time complexity of sessionization is at the level of the simplest SQL ’99 analytic functions such as RANK and ROW_NUMBER, which takes nothings more than a single pass over the sorted data.

Using MapReduce to implement sessionization is a mistake on the architectural level. It does not matter how fast your MapReduce implementation is – if you travel from Boston to LA by flying around the globe, does it matter that you are taking a supersonic aircraft? Besides, while MapReduce is more versatile than SQL when processing computational tasks, it has been shown that when processing SQL-like query tasks, due to its significant overhead a MapReduce engine can be slower than a SQL engine by an order of magnitude or more.

The Teradata Approach – a Horse Wagon Fun Ride from Boston to LA

In this article, Teradata revealed its support for sessionization. We quote its query example below.

with dt (IP, Click_Timestamp, samesession) as

( select IP, Click_Timestamp,

case when (Click_Timestamp – (max(Click_Timestamp) over (partition by IP order by Click_Timestamp rows between 1 preceding and 1 preceding ) ) minute) < (interval ’30′ minute) then 0 else 1 end

from webclicks)

select sum(samesession) over (partition by IP order by Click_Timestamp rows unbounded preceding) as Session_No, IP, Click_Timestamp

from dt;

From this example, we can see that Teradata’s sessionization support is not native. Instead, it is expressed in terms of existing SQL ’99 functions. Such a design bears the following consequences.

  1. The resulting formulation is quite cumbersome and unintuitive – if you understand why this query is actually performing sessionization, we salute to you for being a SQL guru. On the other hand, an average DBA might be at a complete loss at what the query intends to do. Also, the WITH clause is an unnecessary big hammer construct to employ – a subquery could have been used. Then again, no subquery would be nice!
  2. The use of the SQL window aggregate function Max further complicates the formulation unnecessarily. The SQL analytic function LAG should have been used.
  3. Last but certainly not least, the complex query formulation involving two query blocks is likely to lead to suboptimal run-time performance.

Conclusions

Vertica’s sessionization support is unmatched among its peers, and here is why.

  1. Thanks to Vertica’s analytic function CONDITIONAL_TRUE_EVENT (CTE), the query formulation of sessionization is extremely compact – no subquery or unnecessary SQL construct is involved.
  2. Thanks to CTE, Vertica’s sessionization goes beyond its standard semantics, and is thus capable of supporting a wider range of use cases.
  3. The run-time performance of Vertica’s sessionization is close to optimal –the computation is fully pipelined for each user id, and is fully load-balanced across all user ids.

In closing this post, we would like to offer our dear esteemed readers the challenge to come up with a solution that beats Vertica’s implementation of sessionization.

The Vertica Connector for Hadoop

At Hadoop World today, Vertica is announcing our second generation connector for Hadoop and Pig.  You can read more about the Vertica Connector for Hadoop in the press release, but we wanted to give you the basics about the new Connector here on the blog.

About a year ago, Vertica was the first analytic database company to deliver a parallel connector for Hadoop.  Today, Vertica has over 25 customers using Vertica and Hadoop together.  These are real customers, with real-world applications, who are combining the strengths of Hadoop and the Vertica Analytics Platform.

And with Vertica Connector for Hadoop, users have unprecedented flexibility and speed in loading data from Hadoop to Vertica and querying data from Vertica in Hadoop.  And, as part of our ongoing commitment to Hadoop, we’ve announced an upcoming backup option to Hadoop File System (HDFS), and plans to publish a set of libraries that will allow Hadoop users to leverage Vertica’s advanced sorting, compression and encoding natively in the backup files on HDFS for additional processing and data exploration.

The Vertica 4.0 Connector for Hadoop is open source, supported by Vertica, and available for download at here.

You can read more on the web site, but here’s a short summary of what Vertica 4.0 Connector for Hadoop offers right now:

  • Enhanced integration of Vertica 4.0 with stable versions of Hadoop and Pig
  • Improved transfer performance

And, in a separate release available in the first half of 2011, we’ll be providing:

  • An innovative, low-cost backup option
  • Libraries to read Vertica’s native format from HDFS

As part of our Hadoop push, we’re going to be presenting a couple of webinars around our combined Vertica/Hadoop solution.  Watch for an announcement here on the blog or in your email (if you are on the Vertica list).

Tales from a Cocktail Party: How Customers Use Vertica

On Wednesday, I gave one of the lightning talks at the 4th Extremely Large Databases Conference (XLDB) at Stanford University about how real customers use Vertica to store and analyze their ‘extremely large’ databases. The 5 minute format is tough.  I had to pick only three customers   but I hope I was able to get across how cool the things they are doing with Vertica are.

Given the proximity to San Francisco, it occurs to me that Vertica is basically in the same business that Levi Strauss was in during the 19th century gold rush: selling supplies to a rapid and lucrative market where there is lots of money to be made. Our modern day gold rush is between businesses to see who can use the massive amounts of data they collect to make the best X possible, where X is anything from ‘television distribution network’ to ‘massive online game’ and everything in between.

Given that we are in the system software business, we sell a (vital) component of an end product and thus I largely don’t see complete applications. I most typically see the SQL that those applications generate, and on most days I am focused on the nitty-gritty details of making it work as fast as possible.  Sometimes, though, I get to step back and see what is really being done with our products and it’s usually quite cool.

So, in this blog post, I wanted to give you some cool cocktail party tidbits about what our customers do with their data…

One telecom company told me that (unsurprisingly) most of those 500 channels on your cable box never get watched.  So most of the time, the company doesn’t waste their network’s bandwidth sending all of the channels to your house.  Instead, they use switched video and then use Vertica to analyze the patterns of who watches what and when in every local distribution group, so that they can make network planning decisions like “how much switched video capacity do we really need for those 100 subscribers?”

Cocktail item: This telco has an (anonymized) record of what station each television box is tuned to for every minute of every day.

Another of our customers analyzes SMS message content for big mobile service providers. To be honest, I don’t know exactly what they are doing with this information, but you might imagine they have interesting data and interesting conclusions, and they use Vertica to parse this data.

Cocktail item: Question: What is the most common text message? Answer: The single letter ‘k.’

My aunt is a devoted online farmer, but I doubt she realizes how much technical firepower is used to analyze her online buying habits. Zynga, the developer of FarmVille, Mafia Wars and a number of other popular online games, has a massive Vertica cluster that they use to analyze past in-game decisions with the goal of making their games more fun to play.  This means you will spend more time playing the games, and hopefully spending more money with Zynga.

Cocktail Item: Zynga’s revenue is massive, estimated by various sources to be $200M or greater in 2009, and also reportedly on track for a killer 2010.  Who would have thought virtual farming or playing a gangster would be so lucrative?

Vertica’s customers are from a wide range of industries – gaming, telecommunications, financial, healthcare, and more – but they share one thing in common.  They all use Vertica to analyze massive amounts of data in real time, converting it into usable information that helps drive business decisions.  And now, I feel like a cocktail!

Sessionize with Style – Part 1

The Vertica Approach

Sessionization is a common analytic operation in clickstream analysis. Given an input clickstream table, where each row records a webpage click made by a particular user (or IP address), the sessionization operation identifies user’s web browsing sessions from the recorded clicks, by grouping the clicks from each user based on the time-intervals between the clicks. Conceptually, if two clicks from the same user are made too far apart in time (as defined by a time-out threshold), they will be treated as coming from two browsing sessions.

Here is an example input clickstream table with a simplified schema. Ignore the output column session_id for now.

user_id timestamp URL session_id
U0 15:00:00 http://online.wsj.com/home-page 0
U0 15:00:25 http://online.wsj.com/public/page/news-world-business.html 0
U0 15:00:45 http://online.wsj.com/articles/capital_journal 0
U0 15:01:45 http://www.xkcd.com/ 0

The standard semantics of sessionization takes a single input parameter: the time-out threshold, which is a constant time interval value. An example time-out threshold value is 30 seconds. Sessionization performs its computation on two columns in the input clickstream table, the user_id and the timestamp of the click. The output session_id column produced by sessionization is shown in the above table.

Vertica’s Sessionization Support

Sessionization in Vertica is built on top of the event-based window function CONDITIONAL_TRUE_EVENT (or CTE in short). Recall the semantics of CTE with input Boolean expression P: CTE(P) is evaluated once per input row, and defines a new window starting at the current row, whenever P is evaluated to true for that row. For example, given a sequence of values <1, 2, 3, 4> for column X, CTE(X > 2) assigns to these rows a sequence of window Ids <0, 0, 1, 2>. Also, recall that the expression P in CTE can access column values in the current row, as well as in previous rows. For example, CTE (X > LAG(X)) defines a new window whenever the value of column X in the current row is greater than X in the last row.

Despite of its powerful semantics, the run-time complexity of CTE is at the level of the simplest SQL ’99 analytic functions such as RANK and ROW_NUMBER – it takes only a single pass over the sorted data, while retaining a minimal amount of state in the computation.

Thanks to CTE, sessionization with its standard semantics can be expressed in Vertica as follows.

SELECT user_id, timestamp, CONDITIONAL_TRUE_EVENT(timestamp – LAG(timestamp) > ’30 seconds’) OVER (PARTITION BY user_id ORDER BY timestamp)

FROM clickstream;

Beyond the Standard Semantics of Sessionization

One limitation of the standard semantics of sessionization is that the time-out threshold is a constant value. However, different users may have different styles and preferences for internet browsing, and therefore the same time-out threshold may not accurately identify sessions for all users.

For example, say user A is a slower web-surfer than an average user, perhaps because A is multi-tasking heavily. Say if an average user does not perform page clicks in a particular web domain D in 30 seconds, it indicates the end of a session. However, for user A, the typical interval between two clicks in same domain is 1 minute, as she is busy tweeting, listening to music, and harvesting in Farmville at the same time. So a better solution is to adaptively determine the session timeout threshold of user A based on her recent browsing behavior (e.g. the average time interval between 2 consecutive clicks in the last 10 clicks which A has performed). This allows the clickstream analyst to customize the timeout threshold for difference users.

For example, to adaptively compute the time-out threshold for a user based on her last 10 clicks with a “fudge factor” of 3 seconds, we can use the following CTE expression: CONDITIONAL_TRUE_EVENT (timestamp – lag(timestamp) <= (LAG(timestamp, 1) – LAG(timestamp,11)) / 10) + ’3 seconds’. The fudge factor can be a multiplicative factor instead of an additive one. For example, it can be 110% of the average time intervals of the last 10 clicks.

Another sessionization use case involving a more sophisticated time-out threshold is to use different threshold values based on other factors, such as the time of the day, or the nature of the website being browsed. For example, the time-out threshold for Wall Street Journal Online should be higher than xkcd webcomic, as the WSJ articles take longer to read in average than the xkcd comic strips.

Conclusion

The Vertica approach to sessionization enjoys the multiple benefits of ease of use, strong expressive power, as well as highly efficient and scalable execution. Wondering how some alternative approaches will stack up against Vertica’s (hint: they won’t)? That’s what we will answer in a future post.

Get Started With Vertica Today

Subscribe to Vertica