Vertica

Archive for the ‘use cases’ Category

Can Vertica Climb a Tree?

big_basin_0939_mg_1143

The answer is YES if it is the right kind of tree. Here “tree” refers to a common data structure that consists of parent-child hierarchical relationship such as an org chart. Traditionally this kind of hierarchical data structure can be modeled and stored in tables but is usually not simple to navigate and use in a relational database (RDBMS). Some other RDBMS (e.g. Oracle) has a built-in CONNECT_BY function that can be used to find the level of a given node and navigate the tree. However if you take a close look at its syntax, you will realize that it is quite complicated and not at all easy to understand or use.

For a complex hierarchical tree with 10+ levels and large number of nodes, any meaningful business questions that require joins to the fact tables, aggregate and filter on multiple levels will result in SQL statements that look extremely unwieldy and can perform poorly. The reason is that such kind of procedural logic may internally scan the same tree multiple times, wasting precious machine resources. Also this kind of approach flies in the face of some basic SQL principles, simple, intuitive and declarative. Another major issue is the integration with third-party BI reporting tools which may often not recognize vendor-specific variants such as CONNECT_BY.

Other implementations include ANSI SQL’s recursive SQL syntax using WITH and UNION ALL, special graph based algorithms and enumerated path technique. These solutions tend to follow an algorithmic approach and as such, they can be long on theory but short on practical applications.
Since SQL derives its tremendous power and popularity from its declarative nature, specifying clearly WHAT you want to get out of a RDBMS but not HOW you can get it, a fair question to ask is: Is there a simple and intuitive approach to the modeling and navigating of such kind of hierarchical (recursive) data structures in a RDBMS? Thankfully the answer is yes.

In the following example, I will discuss a design that focuses on “flattening” out such kind of hierarchical parent-child relationship in a special way. The output is a wide sparsely populated table that has extra columns that will hold the node-ids at various levels on a tree and the number of these extra columns is dependent upon the depth of a tree. For simplicity, I will use one table with one hierarchy as an example. The same design principles can be applied to tables with multiple hierarchies embedded in them. The following is a detailed outline of how this can be done in a program/script:

  1. Capture the (parent, child) pairs in a table (table_source).
  2. Identify the root node by following specific business rules and store this info in a new temp_table_1.
    Example: parent_id=id.
  3. Next find the 1st level of nodes and store them in a temp_table_2. Join condition:
    temp_table_1.id=table_source.parent_id.
  4. Continue to go down the tree and at the end of each step (N), store data in temp_table_N.
    Join condition: temp_table_M.parent_id=temp_table_N.id, where M=N+1.
  5. Stop at a MAX level (Mevel) when there is no child for any node at this level (leaf nodes).
  6. Create a flattened table: table_flat by adding in total (Mlevel+1) columns named as LEVEL,
    LEVEL_1_ID,….LEVEL_Mlevel_ID.
  7. A SQL insert statement can be generated to join all these temp tables together to load
    into the final flat table: table_flat.

  8. When there are multiple hierarchies in one table, the above procedures can be repeated for each
    hierarchy to arrive at a flattened table in the end.

 

This design is general and is not specific to any particular RDBMS architecture, row or column or hybrid. However the physical implementation of this design naturally favors columnar databases such as Vertica. Why? The flattened table is usually wide with many extra columns and these extra columns tend to be sparsely populated and they can be very efficiently stored in compressed format in Vertica. Another advantage is that when a small set of these columns are included in the select clause of an SQL, because of Vertica’s columnar nature, the other columns (no matter how many there are) will not introduce any performance overhead. This is as close to “free lunch” as you can get in a RDBMS.

Let’s consider the following simple hierarchical tree structure:

Vertica Tree diagram

There are four levels and the root node has an ID of 1. Each node is assumed to have one and only one parent (except for the root node) and each parent node may have zero to many child nodes. The above structure can be loaded into a table (hier_tab) having two columns: Parent_ID and Node_ID, which represent all the (parent, child) pairs in the above hierarchical tree:

CHart 1

It is possible to develop a script to “flatten” out this table by starting from the root node, going down the tree recursively one level at a time and stopping when there is no data left (i.e. reaching the max level or depth of the tree). The final output is a new table (hier_tab_flat):

Chart 2

What’s so special above this “flattened” table? First, this table has the same key (Node_ID) as the original table; Second, this table has several extra columns named as LEVEL_N_ID and the number of these columns is equal to the max number of levels (4 in this case) plus one extra LEVEL column; Third, for each node in this table, there is a row that includes the ID’s of all of its parents up to the root (LEVEL=1) and itself. This represents a path starting from a node and going all the way up to the root level.The power of this new “flattened” table is that it has encoded all the hierarchical tree info in the original table. Questions such as finding a level of a node and all the nodes that are below a give node, etc. can be translated into relatively simple SQL statements by applying predicates to the proper columns.

Example 1: Find all the nodes that are at LEVEL=3.Select Node_ID From hier_tab_flat Where LEVEL=3;Example 2: Find all the nodes that are below node= 88063633.

This requires two logical steps (which can be handled in a front-end application to generate the proper SQL).

Step 2.1. Find the LEVEL of node= 88063633 (which is 3).

Select LEVEL From hier_tab_flat Where Node_ID=88063633;

Step 2.2. Apply predicates to the column LEVE_3_ID:

Select Node_ID From hier_tab_flat Where LEVE_3_ID =88063633;

Complex business conditions such as finding all the nodes belonging to node=214231509 but excluding the nodes that are headed by node=88063633 can now be translated into the following SQL:

Select Node_ID
From hier_tab_flat
Where LEVE_2_ID=214231509
And LEVE_3_ID <> 88063633 ;

By invoking the script that flattens one hierarchy repeatedly, you can also flatten a table with multiple hierarchies using the same design. With this flattened table in your Vertica tool box, you can climb up and down any hierarchical tree using nothing but SQL.

Po Hong is a senior pre-sales engineer in HP Vertica’s Corporate Systems Engineering (CSE) group with a broad range of experience in various relational databases such as Vertica, Neoview, Teradata and Oracle

Our Users Validate the Value of Vertica

We recently allowed TechValidate, a trusted authority for creating customer evidence content, to survey the HP Vertica customer base. The firm reached out to nearly 200 customers across a variety of industries and came back with some extremely powerful results.

From the financial benefits to the performance advantages, the benefits of the HP Vertica Analytics platform were repeatedly and clearly detailed by our customers.

A sampling of some of the comments and results can be found below, but to see the full results set click here.

HP Vertica Software Rocks HP Vertica Software - the best in the market
Query performance increased by 100-500% or more

HP Vertica customers have achieved a wide range of benefitsMajority of Vertica users saved $100-500K or more

 

 

 

 

How MZI HealthCare identifies big data patient productivity gems using HP Vertica

As part of our continuing podcast series, Dana Gardner, president and principal analyst for Interarbor Solutions, recently conducted an interview with Greg Gootee, product manager at MZI HealthCare.   MZI HealthCare develops and provides sophisticated software solutions that are flexible, reliable, cost effective, and help reduce the complexities of the healthcare industry.

In a post on ZDNet, Dana shares some of the highlights from his podcast with Greg Gootee:

Doctors make informed decisions from their experience and the data that they have. So it’s critical that they can actually see all the information that’s available to them.

The other critical thing was speed, being able to deliver high-end analytics at the point of care, instead of two or three months later, and Vertica really produced. In fact, we did a proof of concept with them. It was almost unbelievable some of the queries that ran and the speed at which that data came back to us.

The ability to expand and scale the Vertica system along with the scalability that we get with the Amazon allows us to deliver that information. No matter what type of queries we’re getting, we can expand that automatically. We can grow that need, and it really makes a large difference in how we could be competitive in the marketplace.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy.

Data-Driven Decision Making with the Vertica Analytics Platform

Physicians need access to a wealth of critical information from multiple systems in order to make life-saving decisions on a daily basis. Greg Gootee, Product Manager, MZI Healthcare, discusses how their new application, powered by the Vertica analytics platform, helps deliver better patient care through data-driven decision making. Delivering information in a timely manner is central to their application’s success. Check out this video shot at HP Discover 2013 to see how HP Vertica helps Greg and his team provide physicians with the information they need to make more accurate point-of-care decisions. In the video Mr. Gootee recounts how his Aunt may have avoided a tragic incident with better point-of-care services–the type of services MZI Healthcare and the Vertica analytics platform provide.

Big Data Value in Japan, No Translation Necessary

Last week, I had the opportunity to present at the Gartner BI Summit in Tokyo. With knowledge of merely a handful of Japanese terms, I arrived in this beautiful country with mild heartburn that my presentation would be somehow misinterpreted and fall flat. The sessions were teeming with representatives from organizations across Japan eager to understand if Big Data was valuable or simply another passing technology fad.

Recently celebrating their 50th anniversary in the country, HP Japan was well represented at the event. My HP Vertica counterpart on the ground reinforced the need to emphasize business value, noting the growing demand for Big Data solutions from nearly every industry, particularly automotive, telecommunications, and railways/transportation. However, before technology decisions are made, Japan businesses want concrete evidence that they can either save money, make money, or differentiate from their competitors — not unlike businesses here in the states.

The title of my presentation was The New Economics of Enterprise Data Warehousing, based on a recently published research report from GigaOM. The general message is that traditional enterprise data warehouses cannot, and were never built to, handle the variety, volume, and velocity of Big Data — mainly because Big Data in its truest sense really didn’t exist back in the 80’s and 90’s when those systems were architected. Therefore, a new breed of big data analytics platforms (led by the HP Vertica Analytics Platform) emerged in the past few years that can handle these demands with extreme performance at massive scale, while enabling organizations to achieve true value at an overall lower TCO.

Heads nodded, followed by hushed side conversations in Japanese as attendees heard story after story on how leading organizations — Cardlytics, Guess, KDDI, HP.com, and even the Democratic National Committee — are deriving measurable business value and accomplishing the previously unimaginable with the HP Vertica Analytics Platform (including re-electing an American president).

I didn’t need the two translators (or my colleague) on hand to explain to me that the conference attendees were overall convinced that there is indeed value in all of the Big Data generated around them in Tokyo and other regions of Japan. I left the conference satisfied and amazed by these incredibly polite, organized, and astute people, with an understanding that business value is universally understood, despite the language.

The Disruptive Power of Big Data

Aside from the sheer quantity of digital data created every day—about 2.5 exabytes1 —there’s more to Big Data than volume. Big Data offers enterprise leaders the opportunity to dramatically change the way their organizations operate to gain competitive advantage and find new revenue opportunities. But realizing the value Big Data promises requires a new approach. Traditional data warehouses and business intelligence tools weren’t built for the scale of Big Data, and can’t provide insight quickly enough to be useful or even keep up.

But this isn’t just a case of data growth outstripping technology growth. Big Data embodies fundamental differences that necessitate new approaches and new technologies. Big Data takes many forms, three in particular we’ll discuss here:

  • Transactional data
  • Sentiment and perceptual data based on conversations taking place in social media
  • Data from networked sensors—the so-called “Internet of Things”

Transactional Data

As businesses have expanded—and expanded onto the Internet—the volume of business transactions has grown. The Economist reported in 2010 that Wal-Mart processes more than 1 million customer transactions every hour and maintains databases exceeding 2.5 petabytes (million gigabytes)2. Imagine how those numbers have grown since then.

What’s even more critical is that companies can now capture not just sales transactions, but the detailed histories and clickstreams that lead to the sale. From web-based clickstream analysis to call data records, pre- and post-transaction histories are more robust than ever—and our ability to collect, analyze and act on that data must adjust accordingly.

The social media explosion

Today’s online customer has progressed well beyond accessing information. Today’s consumers are not only interacting and collaborating with each other, but they’re talking about and interacting with your brand. Facebook has more than 1 billion active subscribers3, and it’s estimated they share almost 700,000 individual pieces on content every minute. On Twitter, more than a billion tweets go out every two to three days4. (You can watch them mapped geographically in real-time at tweetping.net.)

Product reviews, user communities, forums and blogs allow consumers to generate content that contains critical insight for the business. The proliferation of user-generated content in these social channels has lead to new techniques and tools for “sentiment analysis”—the ability to measure emotion to determine how your company and brand are perceived.

The Internet of Things

The amount of information generated by devices rather than people is also growing explosively.
Mobile devices—and the apps people use on them—regularly broadcast individuals’ location, performance and other factors to the network. Retailers and distributors are using radio frequency identification (RFID), bar and QR codes to track inventory and enhance their supply chain and inventory performance. The healthcare industry seeks to improve care and reduce costs through remote patient monitoring. The automotive industry is embedding sensors in vehicles. And utilities are beginning to rely on smart meters to track usage. McKinsey Global Institute reports that more than 30 million networked sensors are in use in the transportation, automotive, industrial, utilities and retail sectors—and the number is growing by 30 percent every year.5

We recently presented a webinar on the Internet of Things and the Power of Sensor Data, which delves into this exciting area in much more detail.

Disrupting conventional analytics – developing a ‘conversational relationship with data’

Using Big Data to make operations more efficient, improve competitiveness and increase revenue is not about generating traditional statistics or producing standard reports.

Just as important as systems to collect and store data are systems to analyze and extract insight from that data. Without insight, you can’t gain new knowledge into your markets, your products and your operations.

When you have this insight at your disposal, you can act faster and with greater probability of success.

Extracting business value from Big Data requires a new approach. We believe that Big Data analytics is an iterative process. We describe it as developing a conversational relationship with your data. Analytics becomes a continuous improvement loop, which uses the results of analyses to frame better, more meaningful analyses, which, in turn, produce more definitive results. When results are available in minutes, analysts can ask, “What if?”

When properly applied, Big Data analytics enables business leaders to:

  • Understand market reaction and brand perception
  • Identify key buying factors
  • Segment populations to customize actions
  • Enable experimentation
  • Accurately predict outcomes
  • Reinvent and enhance inventory and supply chain systems and processes
  • Disrupt their industries, gain an edge over competitors and enable new business models

Big Data already proved its game-changing power during the 2012 U.S. presidential election. Obama campaign chairman Jim Messina said: “We were going to demand data on everything, we were going to measure everything…We were going to put an analytics team inside of us to study us the entire time to make sure we were being smart about things.”
And, in fact, Big Data analytics helped the Obama campaign ratchet up the three key levers in any election: voter registration, persuasion and turnout. Rolling Stone magazine singled out Messina and the campaign’s CTO, Harper Reed, as two among a handful of unsung heroes in Obama’s victory.

You can hear more about how HP Vertica contributed to the high-tech strategy behind Obama’s reelection in a recent webinar featuring Chris Wegrzyn, director of data architecture for the Democratic National Committee.

The traditional data warehouse won’t get it done

The concept of the data warehouse evolved in the 1980s. Then, data warehouses were simply databases into which data from multiple sources was consolidated for the purpose of query and reporting. But today, these systems fall short when confronted with the volume, velocity and variety of Big Data. Why? They fail to enable the conversational approach to data required by Big Data analytics.

Traditional databases and data warehouses don’t easily scale to the hundreds of terabytes or even petabytes needed for many Big Data applications. Data is often not compressed, so huge amounts of storage and I/O bandwidth are needed to load, store and retrieve data. Data is still stored in tables by row, so access to a single data element through many rows—a common operation in business analytics—requires retrieving practically all of the data in a dataset to extract the specific element(s) needed. That strains I/O bandwidth and extends processing time. We have seen cases where the velocity of incoming data exceeds the capacity of the system to load it into the database, and queries produce answers in hours rather than the seconds or minutes needed for iterative business analytics. As a result, systems cost too much to maintain, and they fail to deliver the insight business leaders seek.

Take sentiment analysis, for example. The goal is to extract meaningful information from unstructured data so results can be stored in databases and analyzed. But the formats of resulting data are less predictable, more varied and subject to change during iterative analytics. This requires frequent changes to relational database structure and to processes that load data into them. For IT, it means the iterative approach to extracting business insight from Big Data requires new approaches, new tools and new skills.

Challenges for business leaders

Big Data is not just a technical challenge. Gaining and applying business insight compels business leaders to adopt new and disruptive ways of thinking and working.
Successful leaders we have known in data-driven organizations become more familiar with the sources of data available to them. Rather than asking IT what information is available in the database, they view information as a key competitive asset and explore how insights might be extracted from it to offer immediate and sustainable competitive advantage.

A solution for Big Data analytics

HP Vertica Analytics Platform is a new kind of database designed from the ground up for business analytics at the scale of Big Data. Compared to traditional databases and data warehouses, it drives down the cost of capturing, storing and analyzing data. And it produces answers 50 to 1,000 times faster to enable the iterative, conversational analytics approach needed.

  • HP Vertica Analytics Platform compresses data to reduce storage costs and speed access by up to 90 percent.
  • It stores data by columns rather than rows and caches data in memory to make analytic queries 50 to 1,000 times faster.
  • It uses massively parallel processing (MPP) to spread huge data volumes over any hardware, including low-cost commodity servers.
  • It uses data replication, failover and recovery to achieve automatic high availability.
  • It includes a pre-packaged, in-database analytics library to handle complex analytics and development framework.
  • It supports the R statistical programming language so analysts can create user-defined analytics inside the database.
  • It dynamically integrates with Hadoop to analyze large sets of structured, semi-structured and unstructured data.

HP Vertica Analytics Platform means better, faster business insight at less cost.


Test drive the HP Vertica Analytics Platform at www.vertica.com/evaluate.


[1] “Big Data: The Management Revolution,” Andrew McAfee and Erik Brynjolfsson, Harvard Business Review, October, 1012.

[2]“Data, data everywhere,” The Economist, Feb 25, 2010.

[3]Facebook key facts.

[4] http://www.mediabistro.com/alltwitter/tweetping_b35247

[5] “Big data: The next frontier for innovation, competition, and productivity,” The McKinsey Global Institute, June 2011.

Sensor Data and the Internet of Things: When Big Data Gets Really Big

I remember back in the 1990s when Sun Microsystems claimed that “Java anywhere” would even make refrigerators intelligent to know when you were out of milk, triggering a series of events that ultimately resulted in a grocery delivery chain bringing your milk to your door step the very next day.

Fast forward to today. There are millions (and soon billions) of devices that are connected to the Internet — cars, medical equipment, buildings, meters, power grids, and, yes, even refrigerators. These connected devices comprise the Internet of Things (also known as Machine to Machine or M2M).

But why is this important to your world of Big Data analytics?

The Internet of Things is generating an unfathomable amount of sensor data  — data that product manufacturers, particularly, need to manage and analyze to build better products, predict failures to reduce costs, and understand customer behavior to differentiate and improve loyalty.

In fact, a recent report by IDC’s The Digital Universe 2020 forecasts that machine-generated data will increase to 42 percent of all data by 2020, up from 11 percent in 2005.

The use cases are proven and here. Some are even mainstream. Think Progressive Insurance’s Snapshot pay-as-you-drive insurance commercials that have taken over our airwaves. Others are around us, and you may not even know it. Over your next work day, think about how many devices are connected and distributing information just waiting for analysis — your car, train, flight, or bus; traffic lights, road side signs, the elevator and escalator, an ATM, your check-out system.

But, more importantly, join us for our upcoming Webcast: Unlocking the Massive Potential of Sensor Data and the Internet of Things on Thursday, February 14th at noon EST (9:00AM PST).

We look forward to continuing the conversation and share these and other emerging use cases, real-world case studies, and a technology perspective to help you prepare for this massive opportunity ushered in by sensor data and the Internet of Things!

Get Started With Vertica Today

Subscribe to Vertica