Vertica

Archive for the ‘SQL’ Category

Top 4 Considerations When Evaluating a Data Analytics Platform

From fraud detection to clickstream analytics to simply building better products or delivering a more optimal customer experience, Big Data use cases are abounding with analytics at the core.

With a solid business or use case in place, the next step that organizations typically take is to investigate and evaluate the appropriate set of analytics technology from which to accomplish their analysis, often starting with a data analytics platform. But what are the requirements from which to base your evaluation?

The Winter Corporation, the large-scale data experts, just finalized an in-depth white paper (The HP Vertica Analytics Platform: Large Scale Use and Advanced Analytics) that reflects the results and findings through evaluation, independent research, customer and employee interviews, and documentation review.

Intended for a more technical audience, this white paper focuses on key evaluation criteria that your organization can use as a guide as you conduct your own evaluation.

 

 

Winter Corporation identified these key feature areas as critical for any data analytics platform:

1. Architecture
• Column store architecture
• Shared nothing parallelism
• Cluster size and elasticity
• Smart K-Safety based availability
• Hybrid storage model
• Multiple database isolation modes
• Both bulk load and trickle feed

2. Performance
• Extensive data compression and data encoding
• Read-optimized storage
• Highly parallel operation
• Storage of multiple projections
• Automatic physical database design

3. General Useful and Noteworthy Features for Large-Scale Use
• Export-import
• Backup/restore
• Workload analyzer
• Workload management
• Role-based security

4. Extensions for Advanced Analytics
• SQL extensions
• Built-in functions
• User-defined extensions
• Flexibility in accessing and analyzing all data (structured, semistructured, or unstructured)

Finally, once you have evaluated and confirmed that the data analytics platform meets your feature and technology requirements, you want to hear from other organizations that have deployed large-scale analytics’ initiatives in real-world environments.

The white paper concludes with a write-up on how Zynga, a social game services company with more than 240 million users of its online games, stores the actions of every player in every game — about 6 TB per day of data — in near-real time in the HP Vertica Analytics Platform. No matter where in the world a game event occurs, the data can be retrieved via a report or query from the central HP Vertica database no more than five minutes later.

Defining Big Data

Let’s start with an easy question.  What’s “Big Data”?  Fortunately, I read the answer to that in one of my favorite tech news sources just the other day: http://www.theregister.co.uk/2012/08/27/how_did_big_data_get_so_big/.  The answer, for those who won’t bother with the link, is “Big data is any data that doesn’t fit well into tables and that generally responds poorly to manipulation by SQL” … “A Twitter feed is Big Data; the census isn’t. Images, graphical traces, Call Detail Records (CDRs) from telecoms companies, web logs, social data, RFID output can all be Big Data. Lists of your employees, customers, products are not.”

That’s great, except that it is self-contradictory!  5 out of the 7 things the author considers “big data” are not only susceptible to manipulation by SQL (in a well-designed database of course), but have representation on Vertica’s “paying customers” list.  Twitter is a customer (and I thank them for their ability to simultaneously give us props and jabs in venues like ACM SIGMOD).  We hardly ever lose in the CDR space (or any xDR, really).  Zynga has petabytes of what could be loosely described as “web logs” and “social data” stored in Vertica.  The evidence list becomes long and boring quite quickly, so I won’t get into how the other 2 out of 7 “Big Data” areas are, as presented, too nebulous to decide one way or the other.

I can’t claim to have a definitive definition of “Big Data”.   But I can tell you that for each meaningful result (such as a sale of a product), there are many website clicks made, and for each “click” there are many impressions (things that are presented to the user but not clicked).  If you want to analyze these things, and maybe run some tests where you try experiments on your customers and prospects to see what does the best job of reeling them in, you will strain the the abilities of single-machine processing, “traditional” RDBMSs, and many statistics packages and algorithms (yes, including your trusty Excel spreadsheet).  Then there is machine-generated data.  The handful of self-aware devices you own (your desktop PC, laptop, GPS-enabled smart phone, smart meter, car, refrigerator, etc.) have much more free time with which to generate “Big Data” than you do.  (For example, a fair-sized cable MSO has a “sensor network” with 20 million networked devices that never rest, producing 10+ billion rows a day.)

So now that the definition of “Big Data” is as clear as mud, let us next consider structured vs. unstructured data.  I have heard people say that “web logs are unstructured data”.  This is outright silly.  The average web log is entirely machine-generated one line at a time, and to do any serious analysis on it you are going to have to parse it into a format with some amount of structure (at least the date and time, session, page, etc.).  Sure, it can be stored as “unstructured data” in your favorite file system, but that’s a simple matter of procrastination on the issue of writing (or finding that someone has already written a parser.  On the other hand, Twitter data, with its rigid 140-character limit is quite “structured”, but figuring out what it “means” is nonetheless a respectable challenge.

So instead, I would implore you to consider “machine information” versus “human information”.  It is probably true that, byte for byte, there is 10x more “human” information.  The slide deck our sales guys use takes up 10x more space on disk than the spreadsheet that contains the funnel and prospect list.  Viral videos probably (I am not privy to the hard data) consume 10x more disk space than the IPDRs and web logs recording their accesses.

But while disk size is a fun, quotable metric, it says little about the nature of the “Big Data” problem you have to solve.  Instead, consider whether you have “machine” or “human” information.  You should be able to tell at a glance, and this will help you choose the right tools.  If it’s machine-generated financial trades, a scale-out SQL database with some time series analytics extensions will do nicely.  If it’s the tweets of Twitter twits, you can use a structured data tool, but you’re going to need some (in a very loose sense of the word) natural language sentiment analysis and graph processing packages.  If it is a bunch of PDFs, Word documents, HTML pages, PowerPoint presentations, and so on scattered across eleventeen different web servers, CMS systems, and file systems in your enterprise, you may need a high-powered “human information” system like Autonomy (and if you are an intelligence agency that needs to watch every video broadcast worldwide to gauge international sentiment, I think they can help you with that too…).

There is a point to all of this.  I can’t claim to have invented this tag line, but I wish I had.  You should “Know SQL” before you assume you should “NoSQL”.  While “Big Data” couldn’t have lived in an SQL database as they existed 10 years ago, we have different kinds of SQL databases now, that are “web scale”, high performance, designed for analytic workloads, cost effective, and so on.  It’s no longer, as a tech visionary in our own back yard recently said, “just a box that can’t keep up”.  If you have “Big Data” upon which structure can be imposed, analytic databases are very efficient, flexible, easy to use, and come with all the features people have come to expect from 30+ years of development.  (Try one.  We have a free, community download!)

How to Make Pig for Hadoop SQL

A few weeks back, my co-worker Rajat Venkatesh gave a little talk on his work with the Vertica Connector for Hadoop.  When he displayed a Pig program, the de facto programming language for Hadoop, I noted that it looks a lot like SQL.  And also, that it shouldn’t be too hard to convert a Pig program to SQL, allowing us to run Pig on Vertica directly!

So, I spent a few hours over the last week and whipped up an engine for running Pig programs on Hadoop and Vertica.  I’m not trying to start up a Hadoop vs. database war, or even “perfume” Pig (sorry… bad joke).  I just wanted to make life easier for those users who are combining Hadoop and Vertica to get the most out of their data.

The core idea of the Vertica Pig/SQL conversion engine is to rewrite the data flow described by the Pig program into a set of nested SQL queries that produce the query answer. The conversion engine uses the regular PigScriptParser, looks at the LogicalPlan, and transforms each Operator into a SQL statement that implements its functionality.

The Vertica Connector for Hadoop is used to move data from HDFS into Vertica and from Vertica back into HDFS.  We run a mixture of Pig scripts and SQL scripts in order to create tables in Vertica, move the data from Hadoop to Vertica, compute the result, and move it back to Hadoop.  The hope is to seamlessly take advantage of Vertica to run your Pig program faster.

Here’s a fairly trivial Pig script example:

x = LOAD ‘foo.log’ USING PigStorage(‘|’) as (a:int,b:int,c:int);
y = FILTER x BY a > 5;
z = FOREACH y GENERATE b*c as f;
z2 = DISTINCT z;
z3 = FOREACH z2 GENERATE f, f*f as g;
z4 = ORDER z3 by g;
STORE z4 INTO ‘zzz’;

And here is the corresponding code run by the Vertica conversion engine:

Script: schema refresh(SQL)[ErrorsOK]
DROP SCHEMA squeal_PigLatin_simple_pig CASCADE;
CREATE SCHEMA squeal_PigLatin_simple_pig;
Script: sql setup(SQL)
CREATE TABLE squeal_PigLatin_simple_pig.hdfs_L3H9215_verticacorp_com_9935_user_bvandiver_simple_foo_log
(
a int,
b int,
c int,
d varchar(65000)
);
Script: test sql(SQL)
SELECT * FROM (SELECT * FROM (SELECT * FROM (SELECT f as f,f * f as g FROM (SELECT DISTINCT * FROM (SELECT b * c as f FROM (SELECT * FROM squeal_PigLatin_simple_pig.hdfs_L3H9215_verticacorp_com_9935_user_bvandiver_simple_foo_log as x WHERE a > 5) AS y) AS z) as z2) AS z3 ORDER BY g) AS z4 LIMIT 3) AS z5
Script: load3(PIG)
x = LOAD ‘hdfs://L3H9215.verticacorp.com:9935/user/bvandiver/simple/foo.log’ USING PigStorage(‘|’) AS (a:int,b:int,c:int,d:chararray);
STORE x into ‘{squeal_PigLatin_simple_pig.hdfs_L3H9215_verticacorp_com_9935_user_bvandiver_simple_foo_log(a int,b int,c int,d varchar(65000))}’ using com.vertica.pig.VerticaStorer(‘localhost’,'verticadb502′,’5935′,’bvandiver’,'”"‘);
Script: populate vdata4(SQL)
CREATE TABLE squeal_PigLatin_simple_pig.vdata AS SELECT * FROM (SELECT * FROM (SELECT * FROM (SELECT f as f,f * f as g FROM (SELECT DISTINCT * FROM (SELECT b * c as f FROM (SELECT * FROM squeal_PigLatin_simple_pig.hdfs_L3H9215_verticacorp_com_9935_user_bvandiver_simple_foo_log as x WHERE a > 5) AS y) AS z) as z2) AS z3 ORDER BY g) AS z4 LIMIT 3) AS z5
Script: store vdata5(PIG)
vdata = LOAD ‘sql://{SELECT * FROM squeal_PigLatin_simple_pig.vdata}’ using com.vertica.pig.VerticaLoader(‘localhost’,'verticadb502′,’5935′,’bvandiver’,'”"‘);
STORE vdata INTO ‘hdfs://L3H9215.verticacorp.com:9935/user/bvandiver/simple/zzz’ USING org.apache.pig.builtin.PigStorage();

While this example only shows filter, distinct, order, and foreach, the conversion engine currently supports limit, join, sample, and group as well.

There are a number of cool tricks we can do to improve performance. One trick is that we’re already taking advantage of the Vertica Optimizer to make data-aware decisions. For example, the Optimizer will chose join order based on the data actually loaded, instead of relying on the programmer to know what the best order was (and specify it correctly!).

Additionally, we can take advantage of the Vertica Database Designer. Our Database Designer automatically chooses a good data layout and compression strategy for storing the data in Vertica. The Designer uses the schema, the queries, and a sample of the data as its input – all of which the conversion engine can provide and involve programmatically. Finally, you can leave the original data in Vertica, so the second time you run the script, you avoid paying the cost of transferring the data.

What’s the take-away? With the Vertica Pig/SQL conversion engine, dislike of writing SQL is not a reason to throw away your database. If scalability is your major concern, I’d encourage you to see how well Vertica scales up to the challenge of large data.

For an off-hours, couple-day hack, the conversion engine is pretty feature-rich! Today, the conversion engine doesn’t support all of Pig (for instance cogroup or user-defined functions), but this is simply a matter of time and effort. We would love to hear from our user community on the interest in using the Vertica Pig/SQL conversion engine if it were a real product. If so, we will improve it and release it as an open source download.

Thoughts?

What’s New in Vertica 4.1

Vertica announced a new version of its Vertica Analytics Platform software, version 4.1, on Tuesday, November 9th at the TDWI Orlando. You can read more about Vertica 4.1 in the press release, but I wanted to give you a few of the highlights of the features that make 4.1 so important to our customers, or anyone looking to make the most of their data.

What’s New in Vertica 4.1 from Vertica Systems on Vimeo.

Here are some highlights from the video:

What’s New Intro
Third-Party Tools Integration – 0:43
SQL Macros – 2:14
Enhanced Security & Authentication – 2:47
Updates & Deletes – 3:27
Vertica 4.1 Wrap Up – 3:50
We hope you enjoy the video!

Gap Filling and Interpolation (GFI)

A Swiss-Army Knife for Time Series Analytics

Gap Filling and Interpolation (GFI) is a set of patent-pending time series analytics features in Vertica 4.0.  In a previous post, we introduced GFI and showcased their expressive power and ease of use through a use case in financial services. In this post, through additional use cases, we will show that GFI can enable Vertica users in a wide range of industry sectors to achieve a diverse set of goals.

Use Case #1: Aggregate bandwidth analysis for Video-On-Demand (VOD) applications

In a Telecommunications VOD workload, assume we have the following (simplified) fact table schema for tracking the quality of service (QoS) in serving video content to end users: vod_qos_fact(user_id, video_id, timestamp, bandwidth).

Each time a user requests a video clip, a row is inserted into the fact table, recording the timestamp and the bandwidth consumption of serving the video clip at that time.  In addition, when the bandwidth consumption changes significantly for a video clip being served, a new row is inserted to record the new bandwidth consumption. An example fact table might be:

In this example, the  VOD QoS analyst might like to know the following:  for every minute, what is the aggregate bandwidth consumption across all users? This question can be answered in Vertica 4.0 in two pipelined steps.

First, the GFI time series analytics is used to compute the interpolated bandwidth consumption of each user for every minute. The result is shown below.

Second, the interpolated bandwidth consumption is aggregated for every minute. The resulting SQL formulation of these two steps is the following.

SELECT slice_time, sum(i_bandwidth) agg_bandwidth

FROM (

SELECT user_id, slice_time, TS_FIRST_VALUE(bandwidth) AS i_bandwidth

FROM vod_qos_fact

TIMESERIES slice_time AS ’1 minute’ OVER (PARTITION BY user_id ORDER BY timestamp)) tmp

GROUP BY slice_time;

In comparison to other popular solution alternatives such as stored procedure, Map/Reduce script and ODBC/JDBC program, the Vertica approach enjoys the following benefits:

  • Code and architecture simplicity: The computation is expressed in SQL, and is performed within the Vertica database. No additional programming language or execution platform is needed.
  • Efficiency and scalability: The Vertica approach effectively balances the data and computation across all cluster nodes, and minimizes the amount of disk and network I/O needed to accomplish this computation task.

These benefits apply to the following use cases as well.

Use Case #2: Storage compression for inventory management applications

In a retail workload, assume we have the following (simplified) fact table schema for inventory management: inventory_fact(item_id, item_location, timestamp, quantity_in_stock).

For each inventory item in a certain location, every time its stock quantity changes, there is a new row inserted into this fact table with the updated quantity value and the associated timestamp of that inventory change.

For example, I recent went to a nearby Apple Retail Store to purchase an iPad. After that transaction, suppose Apple’s inventory_fact table is populated with a new row, indicating that for this particular Apple store, the quantity of its iPads in stock has decreased by 1.

For a popular product like the iPad, the inventory level in each store may change thousands of times or more each day, creating a large number of rows in the inventory_fact table. However, for the purpose of inventory analysis, the inventory data can often be stored in a much coarser time granularity, such as one row per day for each item and location pair.

The GFI time series analytics in Vertica 4.0 can be used to efficiently aggregate the raw inventory data into a proper time granularity, thus consuming much less storage as well as significantly speeding up inventory analysis. Note that in comparison, traditional SQL analytic functions such as FIRST_VALUE cannot be used to correctly compute the aggregation (the reasoning behind this statement is left as an exercise for the interested reader).

Use Case #3: Power reduction for wireless sensor applications

In a Smartdust project, computer scientists may deploy a large number of sensors in a wild region to measure environmental parameters, such as light and temperature. The sensors transmit their measurement readings via wireless signals to one or more base stations, where installed applications process the measurements.

Since these sensors are battery-driven, efficient power management is crucial to ensure a prolonged operation. The key to efficient power management is to minimize the amount of data transmission from sensors to the base stations.

A base station application may require that measurement updates be available at regular time intervals; for example, once every second. One straightforward implementation is to have each sensor transmit a new measurement for each time interval. However, this transmission policy could lead to a huge battery drain on the sensors, especially considering that the consecutive measurements from the same sensor often carry the same value.

Instead, we could have the sensors submit their readings only when the measurement values change, and then use GFI on the base station application to losslessly recover the required measurement values at regular intervals.

Conclusion

The above three use cases respectively illustrate the use of Vertica’s GFI time series analytics in the context of Telecommunications, Retail, and Science applications. You may be wondering, how to assess whether GFI applies to your particular problem at hand? One way is to look for the following common characteristics of problems to which GFI is applicable:

  1. You have multiple time series event streams (say from multiple users or multiple stock symbols), where the events do not occur at regular time intervals. However, you would like to aggregate or correlate over these event streams (use case #1)
  2. The raw data is stored in a denser time granularity than needed in your analysis computation (use case #2)
  3. The raw data is stored in a sparser time granularity than needed in your analysis computation (use case #3)

With Vertica, the built-in GFI functionality allows you to simplify and accelerate your time-series analytics.