Vertica

Archive for the ‘SQL’ Category

Using Location Data with HP Vertica Place

A couple of weeks ago, I went to my favorite museum in the Boston area to see a new exhibition. While I was there, I noticed a lot of visitors walking through the museum using self-guided tours on handheld devices. I knew that the museum rented such devices to visitors, but I hadn’t thought about how the museum could use these devices to gather information about visitor behavior. Because I’ve been working on HP Vertica’s geospatial functionality, HP Vertica Place, I wondered what kind of insight the museum might be able to gather from their visitors’ spatial relationship with the museum and its artwork.

HP Vertica Place leverages the performance and scale of the HP Vertica Analytics Platform and uses OGC-based SQL functions for computation on two-dimensional planar data. So, I know that HP Vertica Place can help me answer a number of questions regarding point-in-polygon, distance, and intersection to name a few.

To test my hypothesis I needed a dataset. Because I couldn’t find one publicly available, I created my own dataset using a Python script.

Instead of trying to replicate the movement of thousands of people moving through a space with more than 500,000 square feet, I took a more micro approach. Museums are normally compartmentalized into rooms, so I choose to focus on a single room of a museum. In that room, I added six different works of art for visitors to view. Visitors could view the artworks in differing order and could also view the works of art for a varying amount of time. All visitors to the room are assumed to be using a handheld device providing them additional information about the artwork. The handheld device records each visitor’s location data every second. This information is then downloaded from the device after the visitor returns it and is then uploaded to an HP Vertica database.

Using the script I created, I generated a dataset of 2,000 visitors moving through the room from 9:00 am to 6:00 pm.


This graphic shows how the room was laid out:
HP_Vertica_Place_blog_image

I then proposed four questions that can provide valuable information about the collection and how visitors engaged with the space:

  • Which work of art was the most popular?
  • How many people interacted with each artwork?
  • On average, how much time does a visitor spend viewing an artwork?
  • At the busiest times of day, how physically close are visitors to one another?

To find the answers to these questions I queried my database of location data and polygons representing the viewing area of each artwork.

While working through the first question, I discovered that I could use HP Vertica Place to write one query to answer my first two questions. Here’s the query I wrote:

SELECT pol_gid,
       COUNT(DISTINCT(usr_id)) count_user_visit
FROM
  (SELECT pol_gid,
          usr_id,
          COUNT(usr_id) user_points_in
   FROM
     (SELECT STV_Intersect(usr_id, g USING PARAMETERS INDEX='art_index') OVER(PARTITION BEST) AS (usr_id,
                                                                                                  pol_gid)
      FROM usr_data
      WHERE date_time BETWEEN '2014-07-02 09:30:20' AND '2014-07-02 17:05:00') AS c
   GROUP BY pol_gid,
            usr_id HAVING COUNT(usr_id) > 20) AS real_visits
GROUP BY pol_gid
ORDER BY count_user_visit DESC;
 pol_gid | count_user_visit
---------+------------------
       4 |             1663
       2 |             1558
       3 |             1483
       6 |             1272
       5 |             1261
       1 |              856
(6 rows)

From the results it’s clear that artwork number 4 is the most popular, but 2 and 3 are in close competition. However, artwork 1 isn’t drawing nearly as many visitors as we’d expect. Why? Is it not properly marketed? Or is the location a problem? The results of this query could help us address those types of questions.

Curators across the country would be thrilled to know on average how long visitors spent in the viewing area of each artwork in their collection. I used this query to calculate the average time spent viewing an artwork:

SELECT AVG(count_seconds)
FROM
  (SELECT usr_id, COUNT(*) count_seconds
   FROM usr_data
   WHERE STV_Intersect(g USING PARAMETERS index='art_index') = 4
   GROUP BY usr_id) foo;

       AVG
------------------
 328.536970933197
(1 row)

I know that the handheld device records the visitor’s location each second, so I can infer the amount of time the average person intersects with the viewing area of an artwork. From this inference, I can make an accurate estimation about the length of time each visitor viewed a work.

The fourth question relates to my most common gripe when visiting a popular exhibition or artwork: how close are people to standing to one another during the busiest time of day? For this query, I decided to look at artwork number 4 because it was the most popular in our previous query:

SELECT AVG(ST_Distance(foo.g,foo1.g))
FROM tmp foo,
     tmp foo1
WHERE foo.i!=foo1.i;
       AVG
------------------
 2.20523723570353
(1 row)

On average during the busiest time of day, visitors viewing the most popular artwork are standing 2.2 feet apart from one another! This type of insight is fascinating. What if this artwork was placed in a larger room with a wider viewing angle? Would more people be able to get a more direct view of the artwork? These questions are difficult to answer without this type of data and analysis, but think about how having this data could improve the overall visitor experience.

Imagine that you’re a museum curator and learn that your most popular artwork isn’t the Monet that your members continually rave about in their visitor surveys. Instead, it’s actually the Renoir across from the Monet that generates the most traffic. Or, you discover that the buzz of a high-profile auction drove more traffic to your three Giacometti statues than the email marketing campaign you did the month prior.

This type of data is available in many different types of business scenarios, not just museums. Location data provides valuable insights into how people interact with spaces. HP Vertica can help you discover these insights. 

Do you want to try out this example on your own? Install HP Vertica 7.1.x and the HP Vertica Place package from your my.vertica page. Then, download the dataset and accompanying SQL file from our GitHub repository.

How to make ROLLUP fly in Vertica?

ROLLUP screenshot

ROLLUP is a very common Online Analytic Processing (OLAP) function and is part of ANSI SQL. Many customers use ROLLUP to write reports that automatically perform sub-total aggregations across multiple dimensions at different levels in one SQL query. The following is an example:

Select DEPT_CODE, STORE_CODE, SUM(SALE_TOT_QTY) As TOTAL_SALES
From AGG_CUST_MONTHLY A INNER JOIN DIM_DATE B
ON ( A.DATE_KEY = B.DATE_KEY )
WHERE B.DATE_KEY BETWEEN 20120301 AND 20120331
GROUP BY ROLLUP(DEPT_CODE, STORE_CODE);

Note: ROLLUP syntax is not supported in the current version of Vertica 7.0. This
is used only as an illustration. See blog text for more details

The business question the above query intends to answer is: For the month of March, 2014, show the total sales across both departments (dept_code) and stores (store_code) to generate sub-total/total sales figures at three levels:

  1. At the detailed level of each pair of (dept_code, store_code)
  2. At the level of only (dept_code) but aggregating across all stores
  3. Finally a grand sales total aggregating across all departments and stores

Assuming performance is not an issue, clearly ROLLUP is a very powerful and useful function that can do a lot of data crunching in the background and deliver a quite complex report.

Unlike the more familiar GROUP BY SQL function in which the column order is not material, ROLLUP works by creating subtotals that “roll up” from the most detailed level to a grand total and requires an ordered list of grouping expressions to be supplied as arguments. First ROLLUP calculates the standard aggregate values specified in the standard GROUP BY (without ROLLUP) and then ROLLUP moves from right to left through the list of grouping columns, starting with the lowest level of aggregation to create progressively higher-level subtotals. This process continues until all grouping columns have been consumed and aggregations computed. The operator essentially “rolls up” its grouping operations, culminating in a grand total in the end.

If N is the number of grouping columns, ROLLUP function will create (N+1) levels of subtotals (including a final grand total). The column order in a ROLLUP operator is critical since it works by removing the left most column at each step. For example:

ROLLUP(A, B, C) creates 4 groups: (A, B, C), (A, B), (A), ().

Note there are NO groups such as (A, C) and (B, C).

Unfortunately the current version of Vertica 7.0 does not support ROLLUP. So what is the alternative or workaround for the existing Vertica customers? Will the workaround perform on a large data set? Fortunately, the answers to both questions are yes. However it does take some ingenuity in physical design and SQL re-write to make ROLLUP work and perform in the currently supported Vertica releases (version 5.1 and later).

The key considerations are to create and pre-load a local temp table which is sorted on ALL the ROLLUP columns in the same exact order, rewrite SQL at each level using GROUP BY into a sub-query and finally UNION ALL of these sub-queries together. By applying this strategy to the ROLLUP SQL posted at the beginning of this blog, we arrive at the following alternative/workaround:

DROP TABLE IF EXISTS AGG_TEMP CASCADE;

CREATE LOCAL TEMPORARY TABLE AGG_TEMP
ON COMMIT PRESERVE ROWS
AS
/*+ DIRECT */
( Select DEPT_CODE, STORE_CODE, SUM(SALE_TOT_QTY) As TOTAL_SALES
From AGG_CUST_MONTHLY A INNER JOIN DIM_DATE B
ON ( A.DATE_KEY = B.DATE_KEY )
WHERE B.DATE_KEY BETWEEN 20140301 AND 20140331
GROUP BY DEPT_CODE, STORE_CODE )
ORDER BY DEPT_CODE,
STORE_CODE
SEGMENTED BY HASH (DEPT_CODE, STORE_CODE) ALL NODES;

SELECT ANALYZE_STATISTICS(‘AGG_TEMP’);

SELECT DEPT_CODE, STORE_CODE, SUM(TOTAL_SALES)
FROM AGG_TEMP
GROUP BY DEPT_CODE, STORE_CODE
UNION ALL
SELECT DEPT_CODE, NULL As STORE_CODE, SUM(TOTAL_SALES)
FROM AGG_TEMP
GROUP BY DEPT_CODE
UNION ALL
SELECT NULL As DEPT_CODE, NULL As STORE_CODE, SUM(TOTAL_SALES)
FROM AGG_TEMP;

The fact that temp table AGG_TEMP is sorted on the ROLLUP columns (dept_code, store_code) is important for performance because this means that all the GROUP BY sub-queries at individual levels will be able to exploit the highly efficient GROUPBY PIPELINED operator (explicit in the query’s explain plan) in Vertica to get the job done quickly and accurately.

A nice property of GROUPBY PIPELINED operator (in contrast to the more resource intensive GROUPBY HASH) is that its performance is largely independent of the actual table size. As a result this workaround can easily scale up to include tables with 10+B or even 100+B rows without any noticeable drop in query performance (as measured in first fetch time).

To conclude I would like to mention that there has been serious work going on in Vertica Engineering to implement ROLLUP (and many other OLAP functions) in the up-coming Vertica releases. Stay tuned!

Po Hong is a senior pre-sales engineer in HP Vertica’s Corporate Systems Engineering (CSE) group with a broad range of experience in various relational databases such as Vertica, Neoview, Teradata and Oracle.

Vertica on MapR SQL-on-Hadoop – join us in June!

We’ve been working closely with MapR Technologies to bring to market our industry-leading SQL-on-Hadoop solution, and on June 3, 2014 will be jointly delivering a live webinar which will feature this joint solution and related use cases. To register and learn how you can enjoy the benefits of a SQL-on-Hadoop analytics solution that provides the highest-performing, tightly-integrated platform for operational and exploratory analytics, click here.

This joint solution is a unified, integrated solution that reduces complexity and costs by running a single cluster for both HP Vertica and Hadoop. It tightly integrates HP Vertica’s 100% ANSI SQL, high-performance Big Data analytics platform with the MapR enterprise-grade Distribution for Apache Hadoop, providing customers and partners with the highest-performing, most tightly-integrated solution for operational and exploratory analytics with the lowest total cost of ownership (TCO).

This solution will also be presented live by HP Vertica and MapR executives at HP Discover on June 11, 2014. For more information, visit the HP Discover website.

In addition, a specially-optimized version of the MapR Sandbox for Hadoop is now available in the HP Vertica Marketplace. To download this and other add-ons for the HP Vertica Analytics platform, click here.

 

Can Vertica Climb a Tree?

big_basin_0939_mg_1143

The answer is YES if it is the right kind of tree. Here “tree” refers to a common data structure that consists of parent-child hierarchical relationship such as an org chart. Traditionally this kind of hierarchical data structure can be modeled and stored in tables but is usually not simple to navigate and use in a relational database (RDBMS). Some other RDBMS (e.g. Oracle) has a built-in CONNECT_BY function that can be used to find the level of a given node and navigate the tree. However if you take a close look at its syntax, you will realize that it is quite complicated and not at all easy to understand or use.

For a complex hierarchical tree with 10+ levels and large number of nodes, any meaningful business questions that require joins to the fact tables, aggregate and filter on multiple levels will result in SQL statements that look extremely unwieldy and can perform poorly. The reason is that such kind of procedural logic may internally scan the same tree multiple times, wasting precious machine resources. Also this kind of approach flies in the face of some basic SQL principles, simple, intuitive and declarative. Another major issue is the integration with third-party BI reporting tools which may often not recognize vendor-specific variants such as CONNECT_BY.

Other implementations include ANSI SQL’s recursive SQL syntax using WITH and UNION ALL, special graph based algorithms and enumerated path technique. These solutions tend to follow an algorithmic approach and as such, they can be long on theory but short on practical applications.
Since SQL derives its tremendous power and popularity from its declarative nature, specifying clearly WHAT you want to get out of a RDBMS but not HOW you can get it, a fair question to ask is: Is there a simple and intuitive approach to the modeling and navigating of such kind of hierarchical (recursive) data structures in a RDBMS? Thankfully the answer is yes.

In the following example, I will discuss a design that focuses on “flattening” out such kind of hierarchical parent-child relationship in a special way. The output is a wide sparsely populated table that has extra columns that will hold the node-ids at various levels on a tree and the number of these extra columns is dependent upon the depth of a tree. For simplicity, I will use one table with one hierarchy as an example. The same design principles can be applied to tables with multiple hierarchies embedded in them. The following is a detailed outline of how this can be done in a program/script:

  1. Capture the (parent, child) pairs in a table (table_source).
  2. Identify the root node by following specific business rules and store this info in a new temp_table_1.
    Example: parent_id=id.
  3. Next find the 1st level of nodes and store them in a temp_table_2. Join condition:
    temp_table_1.id=table_source.parent_id.
  4. Continue to go down the tree and at the end of each step (N), store data in temp_table_N.
    Join condition: temp_table_M.parent_id=temp_table_N.id, where M=N+1.
  5. Stop at a MAX level (Mevel) when there is no child for any node at this level (leaf nodes).
  6. Create a flattened table: table_flat by adding in total (Mlevel+1) columns named as LEVEL,
    LEVEL_1_ID,….LEVEL_Mlevel_ID.
  7. A SQL insert statement can be generated to join all these temp tables together to load
    into the final flat table: table_flat.

  8. When there are multiple hierarchies in one table, the above procedures can be repeated for each
    hierarchy to arrive at a flattened table in the end.

 

This design is general and is not specific to any particular RDBMS architecture, row or column or hybrid. However the physical implementation of this design naturally favors columnar databases such as Vertica. Why? The flattened table is usually wide with many extra columns and these extra columns tend to be sparsely populated and they can be very efficiently stored in compressed format in Vertica. Another advantage is that when a small set of these columns are included in the select clause of an SQL, because of Vertica’s columnar nature, the other columns (no matter how many there are) will not introduce any performance overhead. This is as close to “free lunch” as you can get in a RDBMS.

Let’s consider the following simple hierarchical tree structure:

Vertica Tree diagram

There are four levels and the root node has an ID of 1. Each node is assumed to have one and only one parent (except for the root node) and each parent node may have zero to many child nodes. The above structure can be loaded into a table (hier_tab) having two columns: Parent_ID and Node_ID, which represent all the (parent, child) pairs in the above hierarchical tree:

CHart 1

It is possible to develop a script to “flatten” out this table by starting from the root node, going down the tree recursively one level at a time and stopping when there is no data left (i.e. reaching the max level or depth of the tree). The final output is a new table (hier_tab_flat):

Chart 2

What’s so special above this “flattened” table? First, this table has the same key (Node_ID) as the original table; Second, this table has several extra columns named as LEVEL_N_ID and the number of these columns is equal to the max number of levels (4 in this case) plus one extra LEVEL column; Third, for each node in this table, there is a row that includes the ID’s of all of its parents up to the root (LEVEL=1) and itself. This represents a path starting from a node and going all the way up to the root level.The power of this new “flattened” table is that it has encoded all the hierarchical tree info in the original table. Questions such as finding a level of a node and all the nodes that are below a give node, etc. can be translated into relatively simple SQL statements by applying predicates to the proper columns.

Example 1: Find all the nodes that are at LEVEL=3.Select Node_ID From hier_tab_flat Where LEVEL=3;Example 2: Find all the nodes that are below node= 88063633.

This requires two logical steps (which can be handled in a front-end application to generate the proper SQL).

Step 2.1. Find the LEVEL of node= 88063633 (which is 3).

Select LEVEL From hier_tab_flat Where Node_ID=88063633;

Step 2.2. Apply predicates to the column LEVE_3_ID:

Select Node_ID From hier_tab_flat Where LEVE_3_ID =88063633;

Complex business conditions such as finding all the nodes belonging to node=214231509 but excluding the nodes that are headed by node=88063633 can now be translated into the following SQL:

Select Node_ID
From hier_tab_flat
Where LEVE_2_ID=214231509
And LEVE_3_ID <> 88063633 ;

By invoking the script that flattens one hierarchy repeatedly, you can also flatten a table with multiple hierarchies using the same design. With this flattened table in your Vertica tool box, you can climb up and down any hierarchical tree using nothing but SQL.

Po Hong is a senior pre-sales engineer in HP Vertica’s Corporate Systems Engineering (CSE) group with a broad range of experience in various relational databases such as Vertica, Neoview, Teradata and Oracle

Top 4 Considerations When Evaluating a Data Analytics Platform

From fraud detection to clickstream analytics to simply building better products or delivering a more optimal customer experience, Big Data use cases are abounding with analytics at the core.

With a solid business or use case in place, the next step that organizations typically take is to investigate and evaluate the appropriate set of analytics technology from which to accomplish their analysis, often starting with a data analytics platform. But what are the requirements from which to base your evaluation?

The Winter Corporation, the large-scale data experts, just finalized an in-depth white paper (The HP Vertica Analytics Platform: Large Scale Use and Advanced Analytics) that reflects the results and findings through evaluation, independent research, customer and employee interviews, and documentation review.

Intended for a more technical audience, this white paper focuses on key evaluation criteria that your organization can use as a guide as you conduct your own evaluation.

 

 

Winter Corporation identified these key feature areas as critical for any data analytics platform:

1. Architecture
• Column store architecture
• Shared nothing parallelism
• Cluster size and elasticity
• Smart K-Safety based availability
• Hybrid storage model
• Multiple database isolation modes
• Both bulk load and trickle feed

2. Performance
• Extensive data compression and data encoding
• Read-optimized storage
• Highly parallel operation
• Storage of multiple projections
• Automatic physical database design

3. General Useful and Noteworthy Features for Large-Scale Use
• Export-import
• Backup/restore
• Workload analyzer
• Workload management
• Role-based security

4. Extensions for Advanced Analytics
• SQL extensions
• Built-in functions
• User-defined extensions
• Flexibility in accessing and analyzing all data (structured, semistructured, or unstructured)

Finally, once you have evaluated and confirmed that the data analytics platform meets your feature and technology requirements, you want to hear from other organizations that have deployed large-scale analytics’ initiatives in real-world environments.

The white paper concludes with a write-up on how Zynga, a social game services company with more than 240 million users of its online games, stores the actions of every player in every game — about 6 TB per day of data — in near-real time in the HP Vertica Analytics Platform. No matter where in the world a game event occurs, the data can be retrieved via a report or query from the central HP Vertica database no more than five minutes later.

Defining Big Data

Let’s start with an easy question.  What’s “Big Data”?  Fortunately, I read the answer to that in one of my favorite tech news sources just the other day: http://www.theregister.co.uk/2012/08/27/how_did_big_data_get_so_big/.  The answer, for those who won’t bother with the link, is “Big data is any data that doesn’t fit well into tables and that generally responds poorly to manipulation by SQL” … “A Twitter feed is Big Data; the census isn’t. Images, graphical traces, Call Detail Records (CDRs) from telecoms companies, web logs, social data, RFID output can all be Big Data. Lists of your employees, customers, products are not.”

That’s great, except that it is self-contradictory!  5 out of the 7 things the author considers “big data” are not only susceptible to manipulation by SQL (in a well-designed database of course), but have representation on Vertica’s “paying customers” list.  Twitter is a customer (and I thank them for their ability to simultaneously give us props and jabs in venues like ACM SIGMOD).  We hardly ever lose in the CDR space (or any xDR, really).  Zynga has petabytes of what could be loosely described as “web logs” and “social data” stored in Vertica.  The evidence list becomes long and boring quite quickly, so I won’t get into how the other 2 out of 7 “Big Data” areas are, as presented, too nebulous to decide one way or the other.

I can’t claim to have a definitive definition of “Big Data”.   But I can tell you that for each meaningful result (such as a sale of a product), there are many website clicks made, and for each “click” there are many impressions (things that are presented to the user but not clicked).  If you want to analyze these things, and maybe run some tests where you try experiments on your customers and prospects to see what does the best job of reeling them in, you will strain the the abilities of single-machine processing, “traditional” RDBMSs, and many statistics packages and algorithms (yes, including your trusty Excel spreadsheet).  Then there is machine-generated data.  The handful of self-aware devices you own (your desktop PC, laptop, GPS-enabled smart phone, smart meter, car, refrigerator, etc.) have much more free time with which to generate “Big Data” than you do.  (For example, a fair-sized cable MSO has a “sensor network” with 20 million networked devices that never rest, producing 10+ billion rows a day.)

So now that the definition of “Big Data” is as clear as mud, let us next consider structured vs. unstructured data.  I have heard people say that “web logs are unstructured data”.  This is outright silly.  The average web log is entirely machine-generated one line at a time, and to do any serious analysis on it you are going to have to parse it into a format with some amount of structure (at least the date and time, session, page, etc.).  Sure, it can be stored as “unstructured data” in your favorite file system, but that’s a simple matter of procrastination on the issue of writing (or finding that someone has already written a parser.  On the other hand, Twitter data, with its rigid 140-character limit is quite “structured”, but figuring out what it “means” is nonetheless a respectable challenge.

So instead, I would implore you to consider “machine information” versus “human information”.  It is probably true that, byte for byte, there is 10x more “human” information.  The slide deck our sales guys use takes up 10x more space on disk than the spreadsheet that contains the funnel and prospect list.  Viral videos probably (I am not privy to the hard data) consume 10x more disk space than the IPDRs and web logs recording their accesses.

But while disk size is a fun, quotable metric, it says little about the nature of the “Big Data” problem you have to solve.  Instead, consider whether you have “machine” or “human” information.  You should be able to tell at a glance, and this will help you choose the right tools.  If it’s machine-generated financial trades, a scale-out SQL database with some time series analytics extensions will do nicely.  If it’s the tweets of Twitter twits, you can use a structured data tool, but you’re going to need some (in a very loose sense of the word) natural language sentiment analysis and graph processing packages.  If it is a bunch of PDFs, Word documents, HTML pages, PowerPoint presentations, and so on scattered across eleventeen different web servers, CMS systems, and file systems in your enterprise, you may need a high-powered “human information” system like Autonomy (and if you are an intelligence agency that needs to watch every video broadcast worldwide to gauge international sentiment, I think they can help you with that too…).

There is a point to all of this.  I can’t claim to have invented this tag line, but I wish I had.  You should “Know SQL” before you assume you should “NoSQL”.  While “Big Data” couldn’t have lived in an SQL database as they existed 10 years ago, we have different kinds of SQL databases now, that are “web scale”, high performance, designed for analytic workloads, cost effective, and so on.  It’s no longer, as a tech visionary in our own back yard recently said, “just a box that can’t keep up”.  If you have “Big Data” upon which structure can be imposed, analytic databases are very efficient, flexible, easy to use, and come with all the features people have come to expect from 30+ years of development.  (Try one.  We have a free, community download!)

How to Make Pig for Hadoop SQL

A few weeks back, my co-worker Rajat Venkatesh gave a little talk on his work with the Vertica Connector for Hadoop.  When he displayed a Pig program, the de facto programming language for Hadoop, I noted that it looks a lot like SQL.  And also, that it shouldn’t be too hard to convert a Pig program to SQL, allowing us to run Pig on Vertica directly!

So, I spent a few hours over the last week and whipped up an engine for running Pig programs on Hadoop and Vertica.  I’m not trying to start up a Hadoop vs. database war, or even “perfume” Pig (sorry… bad joke).  I just wanted to make life easier for those users who are combining Hadoop and Vertica to get the most out of their data.

The core idea of the Vertica Pig/SQL conversion engine is to rewrite the data flow described by the Pig program into a set of nested SQL queries that produce the query answer. The conversion engine uses the regular PigScriptParser, looks at the LogicalPlan, and transforms each Operator into a SQL statement that implements its functionality.

The Vertica Connector for Hadoop is used to move data from HDFS into Vertica and from Vertica back into HDFS.  We run a mixture of Pig scripts and SQL scripts in order to create tables in Vertica, move the data from Hadoop to Vertica, compute the result, and move it back to Hadoop.  The hope is to seamlessly take advantage of Vertica to run your Pig program faster.

Here’s a fairly trivial Pig script example:

x = LOAD ‘foo.log’ USING PigStorage(‘|’) as (a:int,b:int,c:int);
y = FILTER x BY a > 5;
z = FOREACH y GENERATE b*c as f;
z2 = DISTINCT z;
z3 = FOREACH z2 GENERATE f, f*f as g;
z4 = ORDER z3 by g;
STORE z4 INTO ‘zzz';

And here is the corresponding code run by the Vertica conversion engine:

Script: schema refresh(SQL)[ErrorsOK]
DROP SCHEMA squeal_PigLatin_simple_pig CASCADE;
CREATE SCHEMA squeal_PigLatin_simple_pig;
Script: sql setup(SQL)
CREATE TABLE squeal_PigLatin_simple_pig.hdfs_L3H9215_verticacorp_com_9935_user_bvandiver_simple_foo_log
(
a int,
b int,
c int,
d varchar(65000)
);
Script: test sql(SQL)
SELECT * FROM (SELECT * FROM (SELECT * FROM (SELECT f as f,f * f as g FROM (SELECT DISTINCT * FROM (SELECT b * c as f FROM (SELECT * FROM squeal_PigLatin_simple_pig.hdfs_L3H9215_verticacorp_com_9935_user_bvandiver_simple_foo_log as x WHERE a > 5) AS y) AS z) as z2) AS z3 ORDER BY g) AS z4 LIMIT 3) AS z5
Script: load3(PIG)
x = LOAD ‘hdfs://L3H9215.verticacorp.com:9935/user/bvandiver/simple/foo.log’ USING PigStorage(‘|’) AS (a:int,b:int,c:int,d:chararray);
STORE x into ‘{squeal_PigLatin_simple_pig.hdfs_L3H9215_verticacorp_com_9935_user_bvandiver_simple_foo_log(a int,b int,c int,d varchar(65000))}’ using com.vertica.pig.VerticaStorer(‘localhost’,’verticadb502′,’5935′,’bvandiver’,'””‘);
Script: populate vdata4(SQL)
CREATE TABLE squeal_PigLatin_simple_pig.vdata AS SELECT * FROM (SELECT * FROM (SELECT * FROM (SELECT f as f,f * f as g FROM (SELECT DISTINCT * FROM (SELECT b * c as f FROM (SELECT * FROM squeal_PigLatin_simple_pig.hdfs_L3H9215_verticacorp_com_9935_user_bvandiver_simple_foo_log as x WHERE a > 5) AS y) AS z) as z2) AS z3 ORDER BY g) AS z4 LIMIT 3) AS z5
Script: store vdata5(PIG)
vdata = LOAD ‘sql://{SELECT * FROM squeal_PigLatin_simple_pig.vdata}’ using com.vertica.pig.VerticaLoader(‘localhost’,’verticadb502′,’5935′,’bvandiver’,'””‘);
STORE vdata INTO ‘hdfs://L3H9215.verticacorp.com:9935/user/bvandiver/simple/zzz’ USING org.apache.pig.builtin.PigStorage();

While this example only shows filter, distinct, order, and foreach, the conversion engine currently supports limit, join, sample, and group as well.

There are a number of cool tricks we can do to improve performance. One trick is that we’re already taking advantage of the Vertica Optimizer to make data-aware decisions. For example, the Optimizer will chose join order based on the data actually loaded, instead of relying on the programmer to know what the best order was (and specify it correctly!).

Additionally, we can take advantage of the Vertica Database Designer. Our Database Designer automatically chooses a good data layout and compression strategy for storing the data in Vertica. The Designer uses the schema, the queries, and a sample of the data as its input – all of which the conversion engine can provide and involve programmatically. Finally, you can leave the original data in Vertica, so the second time you run the script, you avoid paying the cost of transferring the data.

What’s the take-away? With the Vertica Pig/SQL conversion engine, dislike of writing SQL is not a reason to throw away your database. If scalability is your major concern, I’d encourage you to see how well Vertica scales up to the challenge of large data.

For an off-hours, couple-day hack, the conversion engine is pretty feature-rich! Today, the conversion engine doesn’t support all of Pig (for instance cogroup or user-defined functions), but this is simply a matter of time and effort. We would love to hear from our user community on the interest in using the Vertica Pig/SQL conversion engine if it were a real product. If so, we will improve it and release it as an open source download.

Thoughts?

Get Started With Vertica Today

Subscribe to Vertica