Archive for the ‘big data’ Category

HP Vertica Tutorials You Asked, We Listened.

Over recent months, we’ve heard our community request short, instructional videos and tutorials to help them learn more about the rich and powerful features of the HP Vertica Analytics Platform.

Well, we heard you, and have developed and posted some initial videos to help you maximize your investment in HP Vertica. We’ve posted a new videos that highlight new features in HP Vertica 7 (“Crane”). Among the videos we’ve posted are:

  • A two-part series on the HP Vertica Connector for HCatalog. Part 1 provides an overview. Part 2 includes a demonstration:
  • A demonstration of the power of HP Vertica Flex Zone:
  • A tuturial on how to run Database Designer in Management Console 7:
  • A five-part series that demonstrates how to set up HP Vertica 7 with Amazon Web Services
  • An introduction to HP Vertica 7 Fault Groups

You can see these and all video tutorials here. Here’s a sample:

Stay tuned in the weeks ahead. We’ll be posting new videos that highlight new features in Management Console, how to use Fault Groups to set up large clusters, and more.

We’d love to hear more from you! If you have any suggestions or ideas for topics for future videos, let us know. You can post your ideas on our forum at, or you can send ideas to

We’re committed to your success! Check back soon to see what’s new in HP Vertica Tutorials!

Visualizing Your Query Plan with Management Console 7

You run your newly crafted query and patiently wait for the results to appear on the terminal. You stare at your clock, waiting. 1 minute, 2 minutes, then 5, then 10. Your heart sinks. Why is it taking so long,? The query should be done by now, you tell yourself. You built your projections to optimize the joins, you’re sure there is enough memory to avoid spilling to disk. You start to doubt yourself at this point, so you’ll check to make sure.

You decide to run EXPLAIN to see if there’s anything obvious that the optimizer did incorrectly. You open a separate VSQL window and run EXPLAIN. You can see that there’s a hash-join at Path ID 4-that’s not good. You wonder, why isn’t this a merge-join? And, you could have sworn you were joining on sorted columns. You’d better check the sort order on the columns for your projections. What’s the query for that, again, you wonder. Well, since that may not be the bottleneck anyway; you decide to check the profile information for the query. You try to remember– which table stores profile information? EXECUTION_ENGINE_PROFILES, or QUERY_PLAN_PROFILES?”? What columns? Probably should select on all of them and see which columns I need.

And once you do find the columns you need, you may realize that trying to understand VSQL profile-metric outputs is not how you want to spend your afternoon.

But that doesn’t mean you are forever doomed to wade through dense text to get your answers…

Welcome to Management Console Query Plan Visualizer!

In the HP Vertica Analytics Platform 7., Management Console (MC) offers a simple interface, the Query Plan Visualizer, for getting plan or profile information on the your query. The Query Plan Visualizer provides a graphical view of the details of your query, allowing you to quickly identify and understand problem areas.

Explain SS

Let’s examine the same query mentioned previously using MC’s Query Plan Visualizer. Just paste in the query text and click Explain Plan . The results are shown here:

explain run ss

MC’s EXPLAIN output maintains the structure of the plan, and also highlights important information such as “No Statistics,” while linking to relevant metadata for the projections used and columns materialized. For example, we can see that Path ID 3 is a hash join, but now we can actually find out why.



So now we know why there was a hash-join instead of a merge-join. But how do we see how the query was actually executed? We can get the profile metrics for your query using either of these methods:

  1. We can click “Profile Query.” MC then executes the query and displays profile information for it once it completes. However, our query takes a while to run. This option may take a bit of time…
  2. We can also examine past query activity via the Activity tab located at the bottom of the screen. Using the Activity tab, we can retrieve profile information for queries we’ve already run.

In this case, we’ll choose the second option.

To do so:

    1. Go to the Activity tab,
    2. Select Queries from the dropdown menu for our chart type. This chart shows us a graph of number of queries run in the past.

Because we know our query was run recently, we’ll see it at the right side of the graph. Clicking that location brings us a table of query activity from the past few minutes. Sorting the queries by Elapsed brings our long-running query to the top.

run query ss

Clicking Explain/Profile on the far right of the table brings us back to the Query Plan Visualizer page and requests the profile information from the HP Vertica database.

query profile ss

The screen above shows a collapsed view of the profile information, which hides projection and column information. Metric information for each path appears to the right of the plan. We can measure 5 types of metrics for each path: disk usage, memory usage, data sent, data received, and time spent. Each blue bar represents the relative usage of a metric among all other paths. For example, in the Time column, we can see that the row of Path ID 3 has the largest blue bar (at about 35% fullness). This means, that out of all the paths, Path ID 3 took 35% of the total execution time. Now we can easily see that it was indeed our hash-join that took the most amount of time. Additionally, we can see that the disk-read on Path ID 6 was also responsible for a significant portion of the execution time.

So what about that pie chart? The pie chart shows how long the query took in each of its phases. As the query runs, it goes through multiple phases before it completes. Ideally, the query will spend most of its time in the “execution phase,” as the other phases should happen relatively quickly. So if your pie chart is mostly green, that’s good. Think of the chart as a sanity check that validates whether your query spent most of its time where it should.

Additionally, if you want to track the progress of a long running query, you can profile it with “Enable Monitoring” checked. With monitoring enabled, the counter values on the right hand side update at the set interval time, as well as show how much they increased or decreased by since the previous update. So rather than waiting for the query to complete profiling before you can see profile metric information, you can get the latest information on what paths are currently being processed at your set update-interval.

monitoring ss

By removing the need to know the specific queries required for getting profile information, and by making relevant data (projection metadata, query events) just a click away, the MC Query Plan Visualizer can greatly simplify the process of getting and understanding profiling information. If you’re still using version pre-7.0 version of MC, be sure to upgrade to a new Vertica 7.0 and give this a whirl :-)

Two new case studies – Jobrapido & Supercell.

We’ve published two more case studies, featuring Job Rapido and Supercell. These are compelling examples of innovative companies that use the HP Vertica Analytics Platform to gain a competitive edge and dervive maximum business value from Big Data. The two summaries and respective full case study PDF’s provide details about each company’s goals, success, and ultimate outcomes using HP Vertica. To see more like these, visit the HP Vertica Case Studies page.


Jobrapido scales its database
to the next level

Since its founding in 2006, Jobrapido has become one of the biggest online job search aggregators in the world, helping millions of users everywhere from Italy to the United states find the job that’s right for them. In 2012, they were acquired by Evenbase, a part of DMG media based in the UK. HP Vertica has proved invaluable to their success, performing above and beyond for their big data analytics needs. David Conforti Director of BI at Jobrapido describes HP Vertica as “like having a sort of magic mirror to ask to all the business questions that come to my mind,” one that has allowed him and his team to deliver their users both valuable insight and results, and a unique personal experience based on their analytics.

Download the full case study(.pdf)


Supercell performs real-time analytics

In 2012, Supercell delivered two top-grossing games on iOS with the titles “Clash of Clans” and “Hey Day,” just a year after its founding in 2011. Using HP Vertica big data analytics platform, Supercell has been able to engage in real-time gaming data analytics, allowing them to balance, adapt, and improve their gamers experiences on a day to day basis. “HP Vertica is an important tool in making sure that our games provide the best possible experience for our players” says Janne Peltola, a data scientist at Supercell. Using HP Vertica, Supercell is able to create gaming experiences that are fun and engaging for customers to keep coming back to, long after they have started playing.

Download the full case study (.pdf)

Avoiding the OLAP Cliff for Count Distinct Queries in Vertica

Since the early days of data warehousing and the heyday of Ralph Kimball, data warehouse practitioners recognized the use of pre-computed aggregates to be “the single most effective tool the data warehouse designer has to improve performance” (footnote 1)., However, there was then, and continues to be, a gaping hole in the dimensional modelling approach concerning distinct aggregates, and in particular what to do about COUNT(DISTINCT x).

Let’s say that you want to count the number of distinct users who visit your website each day, each week, and each month. You can solve this problem using a pre-computed aggregate. However, the number of distinct users who visited each week cannot be computed from the number of distinct users who visited each day because some customers may have visited your web site on more than one day in the same week. Since you can’t roll a distinct aggregate up and you can’t incrementally maintain it, you’re pretty much stuck with computing what you need from the detail data.

Before HP Vertica, this had always been the Achille’s Heel of OLAP: operations worked 10x to 1000x faster if you had a pre-computed aggregate (which looks fantastic in a demo). If however, you asked a question that depended on an aggregate that you hadn’t pre-computed or that could not be rolled up, then you fell off the “OLAP cliff” and needed to compute the answer from the detail data, and that could take a long time. Query performance for OLAP queries was highly variable and appeared erratic.

HP Vertica’s columnar MPP database architecture changed most of that. By operating directly on a compressed columnar representation of the data, combined with the ability to massively parallelize the computation, most aggregations could be computed in real time without pre-materializing the aggregates. However, computing COUNT(DISTINCT p) could still be an expensive, memory-intensive operation, even on HP Vertica’s massively parallel architecture. Computing a distinct aggregate on Vertica’s MPP architecture can be broken into these phases:

  • computing a partial aggregate per distinct group on each node
  • redistributing/collecting the partial aggregates on the same node and aggregating again per distinct group
  • sending the results to the initiator node.

HP Vertica has been highly optimized for computing COUNT DISTINCT, but in some cases the computation can still require a great deal of memory and data movement. Since COUNT DISTINCT can be expensive to compute and cannot be rolled up, some people have called this the “COUNT DISTINCT Pain”.

HP Vertica 7.0.0 introduces a new family of aggregate functions designed to alleviate this pain when exact results are not required:


APPROXIMATE_COUNT_DISTINCT(x) is the direct equivalent of COUNT(DISTINCT x), and by default, has an accuracy within 1% of the value returned by COUNT(DISTINCT x), 97% of the time. You can specify that you require more or less accuracy than the default with an optional second argument. Whereas COUNT(DISTINCT x) requires a relatively large amount of memory per aggregate, APPROXIMATE_COUNT_DISTINCT(x) requires only 1500 bytes of memory per aggregate to achieve either:

  • 5% accuracy 97% of the time (typically within 2%), or
  • 50K bytes of memory per aggregate for 1% accuracy 97% of the time.

Furthermore, there is no need to make the partial aggregates distinct before sending the data to the initiator node, as required by COUNT(DISTINCT x).

In a performance experiment on a 1Tb TPC-H dataset, using a single-node developer’s laptop, performing an ungrouped COUNT DISTINCT on (l_orderkey, l_partkey) required 211 seconds. Using APPROXIMATE_COUNT_DISTINCT, the same computation took just 4.02 seconds, a factor of 52 times faster. In other cases where the number of distinct values was small, COUNT DISTINCT and APPROXIMATE_COUNT_DISTINCT were equally fast.And, in some cases where HP Vertica’s COUNT DISTINCT optimizations kick in, COUNT DISTINCT can be faster. So, while your mileage may vary, you should note that there are cases where APPROXIMATE_COUNT_DISTINCT is clearly a lot faster.

But it gets better, because unlike COUNT(DISTINCT x), APPROXIMATE_COUNT_DISTINCT rolls up in the same way SUM(x) and COUNT(x) roll up. By materializing the internal “synopsis” used by APPROXIMATE_COUNT_DISTINCT, you can roll it up later to preserve the full accuracy of APPROXIMATE_COUNT_DISTINCT(). On the same 1Tb TPC-H dataset, precomputing APPROXIMATE_COUNT_DISTINCT_SYNOPSIS on (l_orderkey, l_partkey) and grouping by a low cardinality column, and materializing the result with CREATE TABLE AS SELECT took about 30 seconds. Rollingprecomputed aggregate up with APPROXIMATE_COUNT_DISTINT_OF_SYNOPSIS() took just 64.7 milliseconds, more than 3200x faster than running COUNT DISTINCT against the detail data.

To illustrate, let’s suppose that you’re a political insider using the Pulse innovation package from HP Vertica. HP Vertica Pulse enables you to analyze the sentiment expressed in a tweet. You want to be notified, in real-time, when the number of distinct persons who post a tweet with a negative sentiment about one of several political topics exceeds N in any 1-week period. Instead of constantly running COUNT DISTINCT on the most recent weeks’ worth of tweets, you could compute and save an APPROXIMATE_COUNT_DISTINCT synopsis once per hour, and then run a relatively fast query that combines the pre-materialized synopses with a real-time synopsis computed from the most recent partial hour. Remember that this would not work with a regular COUNT DISTINCT because, if any individuals posted multiple tweets in the same week, they would be counted multiple times. The remarkable thing is that double counting will not occur with aggregating APPROXIMATE_COUNT_DISTINCT synopses. To allow for the possibility of a false-negative signal, you could adjust the alert threshold downward, and if triggered, compute an exact COUNT DISTINCT. However, in this case, the accuracy of APPROXIMATE_COUNT_DISTINCT is much higher than the accuracy of the sentiment classifications, so the measure of interest is intrinsically subjective and approximate anyway.

To compute and save an approximate count distinct synopsis, use the APPROXIMATE_COUNT_DISTINCT_SYNOPSIS() grouping function. To roll up a set of pre-computed synopses, use the APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS() grouping function. That’s all there is to it.

At the time of this writing, HP Vertica is the first and only SQL database to provide approximate count distinct with user-controllable accuracy and to support the rollup of approximate count distinct aggregates. Using the HP Vertica MPP column store, you can avert the OLAP cliff, and using pre-computed synopses, you can avoid the COUNT DISTINCT pain. For the first time since the dawn of data warehousing, you can compute and incrementally maintain pre-computed aggregates for count distinct with controllable accuracy, and roll these aggregates up in an OLAP framework.


(1) – “The Data Warehouse Toolkit”, Ralph Kimball, pg 190.

Enter the Flex Zone – Modernizing the Enterprise Data Warehouse

I’ve had the privilege of attending the Data Warehouse Institute’s (TDWI) conference this week. The Las Vegas show is usually one of their biggest gatherings. This year, there were about 600 of us gathered together to talk about the latest and greatest in the data warehouse and business intelligence world. HP Vertica was a sponsor.
The latest buzz was around many of the new data discovery tools that were announced by some vendors. Vendors recognize that there is a significant amount of undiscovered data in most businesses. As data warehouse teams go merrily along delivering daily analytics, piles and piles of dark data builds within that might have value. To innovate, users are recognizing that some of this unexplored data could be quite valuable, and it’s spurring on the development of a new breed of data discovery tools that enable users to develop new views of structured, semi-structured, and unstructured data.

Of course, this is the very reason that we have developed HP Vertica Flex Zone. The ability to ingest semi-structured data and use current visualization tools are one of the key tenets of HP Vertica Flex Zone. With HP Vertica Flex Zone, you can leverage your existing business intelligence (BI) and visualization tools to visually explore and draw conclusions from data patterns across a full spectrum of structured and semi-structured data. Analysts, data scientists, and business users can now explore and visualize information without burdening or waiting for your IT organizations to use lengthy and costly ETL tools and processes typical with legacy databases and data warehouses.
Most agreed that special data discovery tools should converge with standard analytical platforms in the coming months. Discovery should be as much a part of your business as daily analytics.

There were some first-rate executive sessions led by Fern Halper and Philip Russom, who talked about the transformation of analytics over the years. Analytics has become more mainstream, more understood by the masses of business users. Therefore innovation comes when we can deliver business intelligence for this new generation of information consumers.

The panel discussions and sessions focused very much on business value and put forth a call-to-action for some. Innovate. Feed the business users needs for information that will help drive revenue, improve efficiency, and achieve compliance with regulations. It was clear that data warehouse must be modernized of data warehouse (and that is happening today). Data warehouse pros aren’t satisfied with daily static analytics that they delivered in the past. They are looking for new data sources, including big data, and new-age data analytic platforms to help achieve their business goals.

Get started modernizing your enterprise data warehouse – evaluate HP Vertica 7 today.

Enter the Flex Zone – Under the Hood

With HP Vertica’s latest release (HP Vertica 7 “Crane”), we introduced HP Vertica Flex Zone, based on the patent-pending flex tables technology, which dynamically adapt to whatever schema is present in the data. Flex tables offer dramatic usability improvements over regular tables. In this post, we take a look under the hood and show how flex tables are similar to regular Vertica tables, with a little pinch of magic thrown in.

The Tour Starts with Storage
Let’s take a peek at a flexible table:

    => create flex table tweets();
    => \d tweets

Not Null
Primary Key
Foreign Key
long varbinary(130000)

(2 rows)

At the heart of the implementation is the simple idea that we will store an entire record in __raw__, a single wide varbinary column. Any record provided, no matter the schema, can be stored – assuming that the column is wide enough to hold the whole record. From the example above, you can see that the default max size is around 128K. We’ve found that this size is sufficient for most use cases, but you can stretch it up to 32 MB if necessary. For performance reasons, Vertica includes the __identity__ column as the column on which to sort and segment the data.

So what does the __raw__ column actually contain? Let’s take a peek:

    => copy tweets from ‘tweets.json’ parser fjsonparser();
     Rows Loaded
    (1 row)

    => select __raw__ from tweets limit 1;
    (1 row)

First, notice that this isn’t the JSON value that we supplied as input. When you load a flex table, the parser reads each input record, parses it into a collection of key/value pairs, and writes it into an internal map format. Assuming you query the data at least once, it is worth parsing upfront to avoid re-parsing with each query. Vertica also does a little in-record indexing to make finding field values efficient. Additionally, the storage is now agnostic to the input format, freeing you from having to specify the input format upfront or locking the table to a specific format. Finally, Vertica’s storage engine has more flexibility with compression algorithms for storing the data efficiently on disk.

It’s a Kind of Magic
So why can I run “select text from tweets” despite the “text” column not existing in the table? Looking at the query plan shows us what we would expect, namely we are reading the __raw__ column:

    => EXPLAIN SELECT text FROM tweets;

    Access Path:
    +-STORAGE ACCESS for tweets [Cost: 21K, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
    | Projection: public.tweets_super
    | Materialize: tweets.__raw__
    | Execute on: All Nodes

If you dig into the “GraphViz” (see query plan, you will discover that the actual query that is running is:

     SELECT MapLookup(__raw__,’text’) FROM tweets;

Flex tables have a special parse rule that rewrites unknown column references into calls to the MapLookup function. The MapLookup function takes a key and returns the value, or null if the key does not exist in the map:

   => SELECT maplookup(__raw__,’id’) FROM tweets limit 1;
    (1 row)    

   => SELECT maplookup(__raw__,’nonexistentkey’) FROM tweets limit 1;

    (1 row)

This little piece of magic is what lets you write vanilla SQL queries against data with unknown or varying schema.

Maps to Treasure

Now that you know about the __raw__ column, we can play with a number of the map functions that come with flex tables. Here are a couple of examples:]

Getting JSON back:

    => SELECT maptostring(__raw__) FROM tweets limit 1;
    “” : “230242346078457856”,
    “delete.status.id_str” : “230242346078457856”,
    “delete.status.user_id” : “524683721”,
    “delete.status.user_id_str” : “524683721”

Shredding a record into rows of key value pairs:

    => SELECT mapitems(__raw__) OVER (partition auto) FROM (SELECT__raw__ FROM tweets limit     1) a;


(4 rows)

Finding keys of all records, similar to computing contents of the keys table:

    => SELECT distinct key FROM (select mapkeys(__raw__) OVER (partition auto) as
    key from tweets ) a ORDER BY key;

Checking how close you are to overflowing the record size:

    => SELECT max(length(__raw__)) FROM tweets;
    (1 row)

Isn’t that a row store?

By storing the whole record in a single column, Vertica must read the entire record even if the query needs only a subset of the keys – the definition of a row-store database. Vertica’s flex tables perform decently well due to their solid executor and distributed execution but Vertica’s columnar tables are orders of magnitude faster. As described in a previous post, accelerating flex table performance is done by promoting important columns to be real columns in the table. These hybrid tables have the best of both worlds: flexible schema due to a __raw__ column and columnar access to the most important data. You can see this for yourself by creating a hybrid flex table:

    bvandiver=> create flex table logs (time timestamp, sessionid varchar, component     varchar);
    bvandiver=> \d logs

Not Null
Primary Key
Foreign Key
long varbinary(130000)
t f
f f
f f
f f

(4 rows)


Hopefully this whirlwind tour through the back end of flextables has given you a better understanding of how it all works and the performance tradeoffs we made. As usual we encourage you to download our Community Edition and give it a try!

Social Media Insights with HP Vertica Pulse

Get insight into what your customers are saying about your products and services with HP Vertica Pulse, one of the new innovations available on the recently announced HP Vertica Marketplace.

HP Vertica Pulse allows you to analyze Tweets automatically, scalably, and in-database so that you can:

  • understand the top-of-mind topics for your community and the sentiment associated with those topics.
  • combine results with other business data already stored in Vertica
  • identify advocates and detractors
  • analyze how sentiment changes over time
  • view high-level aggregate results as well as low-level comments
  • customize results for your domain.


Below are some visual examples of how HP Vertica Pulse can help you to make meaning of your data.  In this case, we used our free Social Media Connector on Github to collect and automatically load Tweets about some NFL teams over one weekend in November.  After running Pulse on the data, we used Tableau to visualize.  The visual on the left gives an aggregate view of the most popular topics.  Although we collected Tweets on football teams, a number of other topics appear, like “Tom Brady”, “Cam Newton”, and “call”.  The size of the circle reflects the volume of the tweets, while the color indicates sentiment (green for positive and red for negative).  As it turns out, the Patriots, 49ers, Jets, and Chiefs all lost that weekend.


The next visual shows an excerpt of a drill-down into one of the compelling results, namely the negative sentiment on ‘call’. This extract shows that there was a controversial call in the New England-Carolina game.

Bubble list

By using Pulse, and without actually watching the games, we were able to get an understanding of the most talked-about topics.  We could drill-down into the most compelling results to get an understanding of why the sentiment of the topic is negative, and take action if needed.  In the case of products and services, you could similarly see the aspects that mattered most to your community.  Negative sentiment could highlight a problem that needs attention.

You can try HP Vertica Pulse on tweets that impact your business.  To download, visit the HP Vertica Community Marketplace, and look for HP Vertica Pulse on the Innovations tab.

The HP Vertica Pulse download contains the following resources:

  • Installation files and documentation
  • three instructional videos
  • A quick start guide,
  • A Tableau template

These resources will help you to evaluate HP Vertica Pulse and make meaning of your Tweets as quickly and easily as possible.  When you create your Marketplace account, please review the Terms and Conditions, as they contain important information about the Innovations efforts.

We want to hear from you!  Post your questions, comments, and feedback to the HP Vertica Community.  Would you use HP Vertica Pulse to make an impact on your business?  What changes would you like to see?  We look forward to your feedback!

Get Started With Vertica Today

Subscribe to Vertica