Vertica

Archive for the ‘big data’ Category

The Real-Time Unicorn

The “De-mythification” Series

Part 1: The Real-Time Unicorn

This is part one of a series I call the “de-mythification” series, wherein I’ll aim to clear up some of the more widespread myths in the big data marketplace.

In the first of this multi-part series, I’ll address one of the most common myths my colleagues and I have to confront in the Big Data marketplace today: the notion of “real-time” data visibility. Whether it’s real-time analytics or real-time data, the same misconception always seems to come up. So I figured I’d address this, define what “real-time” really means, and provide readers some advice on how to approach this topic in a productive way.

First of all, let’s establish the theoretical definition of “real-time” data visibility. In the purest interpretation, it means that as some data is generated – say, a row of log data in an Apache web server – the data would immediately be queryable. What does that imply? Well, we’d have to parse the row into something readable by a query engine – so some program would have to ingest the row, parse the row, characterize it in terms of metadata, and understand enough about the data in that row to determine a decent machine-level plan for querying it. Now since all our systems are limited by that pesky “speed of light” thing, we can’t move data any faster than that – considerably slower in fact. So even if we only need to move the data through the internal wires of the same computer where the data is generated, it would take measurable time to get the row ready for query. And let’s not forget the time required for the CPU to actually perform the operations on the data. It may be nanoseconds, milliseconds, or longer, but in any event it’s a non-zero amount of time.

So “real-time” never, ever means real-time, despite marketing myths to the contrary.

There are two exceptions to this – slowing down time inside the machine, or technology which queries a stream of data as it flows by (typically called complex event processing, or CEP). With regard to the first option: let’s say we wanted to make data queryable as soon as the row is generated.  We could make the flow from the logger to the query engine part of one synchronous process. So the weblog row wouldn’t actually be written until it were also processed and ready for query. Those of you who administer web and application infrastructures are probably getting gray hair just reading this as you can imagine the performance impact to a web application. So, in the real world, this is a non-starter.  The other option – CEP – is exotic and typically very expensive, and while it will tell you what’s happening at the current moment, it’s not designed to build analytics models.  It’s largely used to put those models to work in a real-time application such as currency arbitrage.

So, given all this, what’s a good working definition of “real-time” in the world of big data analytics?

Most organizations define it this way: “As fast as it can be done providing a correct answer and not torpedoing the rest of the infrastructure or the technology budget”.

Once everyone gets comfortable with that definition, then we can discuss the real goal: reducing the time to useful visibility of the data to an optimal minimum. This might mean a few seconds, it might mean a few minutes, or it might mean hours or longer. In fact, for years now I’ve found that once we get the IT department comfortable with the practical definition of real-time, it invariably turns out that the CEO/CMO/CFO/etc. really meant exactly that when they said they needed real-time visibility to the data. So, in other words, when the CEO said “real-time”, she meant “within fifteen minutes” or something along those lines.

This then becomes a realistic goal we can work towards in terms of engineering product, field deployment, customer production work, etc. Ironically, chasing the real-time unicorn can actually impede efforts to develop high speed data flows by forcing the team to chase unrealistic targets for which, at the end of the day, there is no quantifiable business value.

So when organizations say they need “real-time” visibility to the data, I recommend not walking away from that conversation until fully understanding just what that phrase means, and using that as the guiding principle in technology selection and design.

I hope readers found this helpful! In the remaining segments of this series, I’ll address other areas of confusion in the Big Data marketplace. So stay tuned!

Next up: The Unstructured Leprechaun

 

Introducing HP Vertica “Dragline”

Today, we announced “Dragline,” the code name for the latest release of the HP Vertica Analytics Platform. Focused on the strategic value of all data to every organization, “Dragline” includes a range of industrial-strength features befitting its code name for serious Big Data initiatives.

Our data sheet provides detailed feature descriptions on this release’s full range of capabilities and benefits, so we’ll just focus on three top features that are sure to capture your attention (after all, they came highly requested from our growing customer base).

By the way, learn about these and all of the new “Dragline” features in our upcoming Webinar.


“Project Maverick” – Speed in All Directions

Danger Zone

Speed is a given … but what about in every possible direction? “Project Maverick” has multiple elements including fast, self-service analytics about discrete individuals or devices. Live Aggregate Projections, a key, new feature of “Project Maverick,” accelerates the speed and performance of these live lookups by up to 10x – more frequent or highly concurrent queries on an individual customer basis — by computing metrics on the data as it arrives for targeted and personalized analytics without programming accelerator layers.

For telecommunications companies as well as utilities and energy providers who, due to deregulation, are pursuing smart metering initiatives to differentiate from competitors, this capability is hot. With Live Aggregate Projections, these providers can deliver smart metering reports that educate their customers on consumption rates vis a vis their neighbors in promoting a greener planet and establishing a tighter relationship with their provider. Interested in learning how HP Vertica easily chews through the enormous volume and scale of smart meter data and the Internet of Things? Check out our newly published white paper, including detailed benchmarks.


Mixed Dynamic Workload Management – Make ’em All Happy

Coffee

Another major advancement of “Project Maverick” is Dynamic Mixed Workload Management. Commonly found in many data warehouse technologies and requested by some of our largest customers, this powerful new feature identifies and adapts to varying query complexities — simple and ad-hoc queries as well as long-running advanced queries — and dynamically assigns the appropriate amount of resources to meet the needs of all data consumers.

So, now, you can preserve your precious hardware and system resources, while pleasing even your most demanding internal and external constituents – from analysts who seek ad-hoc insights to data scientists who manage complex analytics to business executives who need customized views or dashboards on Monday mornings.


Cost-Optimized Storage – Don’t Break the Bank

pig

Hadoop is commonly used as a data lake to store bulk data. That’s why with each release, we offer tighter integration that delivers the most open SQL on Hadoop. “Dragline” builds on that strong heritage by helping you to manage multiple storage tiers for cost-effective analytics. You can perform advanced SQL queries on bulk data stored in HDFS using the HP Vertica Analytics Platform without moving the data or using any connectors. You can move data into HP Vertica when your organization requires faster performance in-depth analytics.

HP Vertica supports all of the major Hadoop distributors but the power of MapR and HP Vertica on the same cluster is something special. Join our upcoming Webinar with MapR to get the full details behind this solution and to build your business case for SQL on Hadoop.

Try HP Vertica 7 today and stay tuned for more blog posts and materials to support this release.

Meet The Team: Patrick Day

Pat day

Welcome back to another edition of our “Meet the Team” feature! This week we sat down with our inside sales manager Patrick Day, and talked about everything from what it’s like to manage the badass sales team we have here in Cambridge MA, to ballroom dancing.

Inside the Secret world of the Workload Analyzer

WLA image

When I’m on a flight sitting next to someone, and we’re making polite conversations, often the question comes up “what do you do?” In these situations, I have to assess whether the person works in the IT industry or is otherwise familiar with the lingo. If not, my stock response is “I fix databases”. This usually warrants a polite nod, and then we both go back to sleep. This over-simplified explanation generally suffices, but in truth, it is wholly inadequate. The truth of the matter is that my job is to ensure that databases don’t get broken in the first place; more specifically – an HP Vertica database. But our clients have different, complex goals in mind, they sometimes configure their systems incorrectly for the kind of stuff they’re doing. I’m constantly looking for ways to empower clients to understand problems to look for before they become bigger problems.

That’s why I’m a huge fan of the HP Vertica Workload Analyzer (also referred to as WLA). The WLA is a great tool in the war against problems. The WLA’s goal in life (if it has one) is to find and report on problems – 20 specific ones to be exact. If you are using Management Console, the WLA runs once per day and produces a report. This report indicates actions to take, like tables that should have statistics updated, or queries that aren’t performing adequately. This report (actually just entries in a table) is reported in the Management Console in a window called “Tuning Recommendations”.

But you don’t need Management Console to take advantage of the Workload Analyzer – you can run it manually by running the following command:

SELECT ANALYZE_WORKLOAD(”) ;

Interestingly, even though it’s called “Workload Analyzer”, the actual command is ANALYZE_WORKLOAD. You could toss this into CRON and run it multiple times per day if you’d like, but it’s probably not necessary.

The output of this command comes back to your screen – it reports on all the various “problems” that it is programmed to identify and returns them. It also creates entries in two separate tables V_MONITOR.TUNING_RECOMMENDATIONS and TUNING_RECOMMENDATION_DETAILS.

As I mentioned earlier, there are 20 queries that WLA runs in order to look for problems. These queries are called rules. WLA rules can be disabled or enabled, and many of them have parameters that can be modified- increasing or decreasing the scope of the rule. These rules have clever names like sys_r1 and sys_r13. It’s ok though – WLA doesn’t allow you to run rules directly, but I’ll break down the most useful rules, and how they can be modified for maximum effect.

sys_r1: Stale and Missing Statistics

sys_r1 looks for tables that have stale or missing statistics. It recommends updating statistics on those tables. sys_r5 and sys_r6 find queries that have reported JOIN_SPILLED or GROUP BY SPILLED events. These are indications that there is no good projection design which would enable these queries to operate at maximum efficiency. These three rules generate the most typical output I see at client sites.

sys_r16 and ssys_r20: System Utilization

sys_r16 and sys_r20 are rules that check for system utilization. sys_r16 checks for a CPU utilization that exceeds a 95% threshold for more than 20 minutes. sys_r20 looks for memory utilization of 99% for more than 10 minutes.

Many of the other rules cover very specific edge-case scenarios, and are unlikely to fire in most client sites and we hope to cover those in future blogs. For now, let’s stick to this handful, and talk about how we can modify these rules, and add our own rules!

Altering Rules

What you may not know about the WLA is that you can modify certain rules. That is, you can change some parameter settings such as threshold values that the rule uses to find problems. Let’s take rules sys_r16 and sys_r20. Each looks for CPU and memory utilization in your system. The default thresholds might be too high for your liking–but that’s ok, you can adjust them.

dbadmin=> ALTER TUNING RULE sys_r16 SET USAGE_THRESHOLD_PCT=80;

ALTER TUNING RULE

dbadmin=> ALTER_TUNING_RULE sys_r16 SET DURATION_MIN=10;

ALTER TUNING RULE

Now this rule will fire under less extreme circumstances. Instead of looking for a CPU usage of 95% for 20 minutes, it will now report on a CPU usage of 80% for 10 minutes. All the tuning rule parameters (if applicable) are defined in another system table called VS_TUNING_RULE_PARAMETERS. You can use that table to determine which parameters are available to which rules. You can tweak them if you desire.

Furthermore, you can disable or enable rules with the following commands:

dbadmin=> ALTER_TUNING_RULE sys_r16 disable;

ALTER TUNING RULE

dbadmin=> ALTER_TUNING_RULE sys_r16 enable;

ALTER TUNING RULE

Creating your own rules

The coolest part of the WLA is that you can actually create your own rules. The first step is to write your rule. This is just a regular SQL query whose purpose it is to find a problem. The example I’m going to include here is not currently a rule, but it would make a great addition – finding too many delete vectors. This rule has a threshold defined as a percentage of delete vectors as compared to the number of rows in a table. When this percent exceeds 20%, this rule will fire.

CREATE TUNING RULE dv_rule (DELETE_VECTORS, PURGE, dv_threshold=20) AS SELECT CURRENT_TIME AS time, ‘Delete vectors account for ‘ || round(deleted_row_count/row_count, 2)*100::char(4) || ‘% of rows for projection ‘ || ps.projection_name AS observation_description

, ps.anchor_table_id AS table_id

, ps.anchor_table_name AS table_name

, s.schema_id AS table_schema_id

, ps.projection_schema AS table_schema

, NULL AS transaction_id

, NULL AS statement_id

, (SELECT current_value || ‘%’ FROM VS_TUNING_RULE_PARAMETERS

WHERE tuning_rule = ‘dv_rule’ AND parameter = ‘dv_threshold’)

AS tuning_parameter

, ‘run purge operations on tables with delete vectors’ AS tuning_description

, ‘SELECT PURGE_PROJECTION(”’ || ps.projection_name || ”’);’ AS tuning_command

, ‘LOW’ as tuning_cost

FROM delete_vectors dv

JOIN projection_storage ps ON dv.projection_name = ps.projection_name

JOIN schemata s ON s.schema_name = dv.schema_name

WHERE ps.row_count > 0

AND deleted_row_count / row_count >

(SELECT current_value/100::numeric FROM VS_TUNING_RULE_PARAMETERS

WHERE tuning_rule = ‘dv_rule’

AND parameter = ‘dv_threshold’) ;

The first line sets up my rule. I give it a name “dv_rule” and I define some parameters. The first two are like labels. They are arbitrary. Technically, these are referred to as the “observation type” and the “tuning type.” You can see examples by looking in the TUNING_RECOMMENDATON_DETAILS table. All remaining parameters are optional — and I can optionally define tuning parameters. Here, I’ve defined my 20% threshold for delete vectors. This query follows the order of the columns in the TUNING_RECOMMENDATION_DETAILS table. If you compare the SELECT list, and the columns in that table, you’ll note the similarities.

My query actually references the VS_TUNING_RULE_PARAMETERS table in order to get my % threshold. This creates a bit of a chicken and egg problem – the query won’t work since the parameter doesn’t yet exist in this table. So, the first thing you have to do is create the rule, which also creates the parameter. Then you can modify the query as needed. If you need to drop the query, you can do so with the following command:

dbadmin=> DROP TUNING RULE dv_rule ;

DROP TUNING RULE

Because you might go through this process several times, I recommend putting your work in to a file so you can reference it easily.

And that’s it! That’s everything you need to know in order to create and manage WLA rules. Not only can you use these rules to find problems in your HP Vertica cluster, you could use them to look for issues in your own data.

In the community forums, I’d like to continue the discussion – come up with your own rules, and post them. Who knows – the good ones we might include in a future HP Vertica release!

Meet the Team: Amy Miller

 

Amys Army

It’s another weekly edition of our Meet the Team feature! Our very own Amy Miller from support, shares with us the story of her career here at Vertica, what she loves about her job and her team, and how she wins national hockey tournaments.

HP Vertica presents three talks at ICDE, Chicago, March 31-April 4, 2014

g10278016032009_jpghighres (1)

For the past three years, HP Vertica has presented innovative topics at prestigious database conferences such as the International Conference on Data Engineering (ICDE), the Very Large Databases (VLDB) conference, and the Extremely Large Database (XLDB) conference. This year, our engineering team proudly announces three talks at the upcoming ICDE held in Chicago, IL, USA, March 31-April 4, 2014.

  • On 3/31/2014, at SMDB, Ben Vandiver introduces Flex Zone, one of the new features of our recent release, HP Vertica Analytics Platform 7. Flex Zone enables the
    smooth data load and exploration flexibility of NoSQL solutions while maintaining a unified SQL interface over structured and semi-structured data.
  • On 4/1/2014, Ramakrishna Varadarajan presents Vertica’s customizable physical design tool, called the Database Designer, which produces designs optimized for various scenarios and applications. For a given workload and space budget, Database Designer automatically recommends a physical design that optimizes query performance, storage footprint, fault tolerance, and database recovery to meet different customer requirements.
  • On 4/2/2014, Jaimin Dave introduces HP Vertica SQL Query Optimizer. The Query Optimizer was written from the ground up for the HP Vertica Analytic Database. Jaimin will discuss its design and the tradeoffs encountered during its implementation. He’ll also argue that the full power of today’s database systems can be realized only with a carefully designed custom Query Optimizer, written specifically for the system in which it operates.

Click here to find details and schedule. Please do attend the talks and stop by and say hello to our presenters and their co-authors. We’ll be happy to tell you more about our designs and the trade-offs we encountered.

Can Vertica Climb a Tree?

big_basin_0939_mg_1143

The answer is YES if it is the right kind of tree. Here “tree” refers to a common data structure that consists of parent-child hierarchical relationship such as an org chart. Traditionally this kind of hierarchical data structure can be modeled and stored in tables but is usually not simple to navigate and use in a relational database (RDBMS). Some other RDBMS (e.g. Oracle) has a built-in CONNECT_BY function that can be used to find the level of a given node and navigate the tree. However if you take a close look at its syntax, you will realize that it is quite complicated and not at all easy to understand or use.

For a complex hierarchical tree with 10+ levels and large number of nodes, any meaningful business questions that require joins to the fact tables, aggregate and filter on multiple levels will result in SQL statements that look extremely unwieldy and can perform poorly. The reason is that such kind of procedural logic may internally scan the same tree multiple times, wasting precious machine resources. Also this kind of approach flies in the face of some basic SQL principles, simple, intuitive and declarative. Another major issue is the integration with third-party BI reporting tools which may often not recognize vendor-specific variants such as CONNECT_BY.

Other implementations include ANSI SQL’s recursive SQL syntax using WITH and UNION ALL, special graph based algorithms and enumerated path technique. These solutions tend to follow an algorithmic approach and as such, they can be long on theory but short on practical applications.
Since SQL derives its tremendous power and popularity from its declarative nature, specifying clearly WHAT you want to get out of a RDBMS but not HOW you can get it, a fair question to ask is: Is there a simple and intuitive approach to the modeling and navigating of such kind of hierarchical (recursive) data structures in a RDBMS? Thankfully the answer is yes.

In the following example, I will discuss a design that focuses on “flattening” out such kind of hierarchical parent-child relationship in a special way. The output is a wide sparsely populated table that has extra columns that will hold the node-ids at various levels on a tree and the number of these extra columns is dependent upon the depth of a tree. For simplicity, I will use one table with one hierarchy as an example. The same design principles can be applied to tables with multiple hierarchies embedded in them. The following is a detailed outline of how this can be done in a program/script:

  1. Capture the (parent, child) pairs in a table (table_source).
  2. Identify the root node by following specific business rules and store this info in a new temp_table_1.
    Example: parent_id=id.
  3. Next find the 1st level of nodes and store them in a temp_table_2. Join condition:
    temp_table_1.id=table_source.parent_id.
  4. Continue to go down the tree and at the end of each step (N), store data in temp_table_N.
    Join condition: temp_table_M.parent_id=temp_table_N.id, where M=N+1.
  5. Stop at a MAX level (Mevel) when there is no child for any node at this level (leaf nodes).
  6. Create a flattened table: table_flat by adding in total (Mlevel+1) columns named as LEVEL,
    LEVEL_1_ID,….LEVEL_Mlevel_ID.
  7. A SQL insert statement can be generated to join all these temp tables together to load
    into the final flat table: table_flat.

  8. When there are multiple hierarchies in one table, the above procedures can be repeated for each
    hierarchy to arrive at a flattened table in the end.

 

This design is general and is not specific to any particular RDBMS architecture, row or column or hybrid. However the physical implementation of this design naturally favors columnar databases such as Vertica. Why? The flattened table is usually wide with many extra columns and these extra columns tend to be sparsely populated and they can be very efficiently stored in compressed format in Vertica. Another advantage is that when a small set of these columns are included in the select clause of an SQL, because of Vertica’s columnar nature, the other columns (no matter how many there are) will not introduce any performance overhead. This is as close to “free lunch” as you can get in a RDBMS.

Let’s consider the following simple hierarchical tree structure:

Vertica Tree diagram

There are four levels and the root node has an ID of 1. Each node is assumed to have one and only one parent (except for the root node) and each parent node may have zero to many child nodes. The above structure can be loaded into a table (hier_tab) having two columns: Parent_ID and Node_ID, which represent all the (parent, child) pairs in the above hierarchical tree:

CHart 1

It is possible to develop a script to “flatten” out this table by starting from the root node, going down the tree recursively one level at a time and stopping when there is no data left (i.e. reaching the max level or depth of the tree). The final output is a new table (hier_tab_flat):

Chart 2

What’s so special above this “flattened” table? First, this table has the same key (Node_ID) as the original table; Second, this table has several extra columns named as LEVEL_N_ID and the number of these columns is equal to the max number of levels (4 in this case) plus one extra LEVEL column; Third, for each node in this table, there is a row that includes the ID’s of all of its parents up to the root (LEVEL=1) and itself. This represents a path starting from a node and going all the way up to the root level.The power of this new “flattened” table is that it has encoded all the hierarchical tree info in the original table. Questions such as finding a level of a node and all the nodes that are below a give node, etc. can be translated into relatively simple SQL statements by applying predicates to the proper columns.

Example 1: Find all the nodes that are at LEVEL=3.Select Node_ID From hier_tab_flat Where LEVEL=3;Example 2: Find all the nodes that are below node= 88063633.

This requires two logical steps (which can be handled in a front-end application to generate the proper SQL).

Step 2.1. Find the LEVEL of node= 88063633 (which is 3).

Select LEVEL From hier_tab_flat Where Node_ID=88063633;

Step 2.2. Apply predicates to the column LEVE_3_ID:

Select Node_ID From hier_tab_flat Where LEVE_3_ID =88063633;

Complex business conditions such as finding all the nodes belonging to node=214231509 but excluding the nodes that are headed by node=88063633 can now be translated into the following SQL:

Select Node_ID
From hier_tab_flat
Where LEVE_2_ID=214231509
And LEVE_3_ID <> 88063633 ;

By invoking the script that flattens one hierarchy repeatedly, you can also flatten a table with multiple hierarchies using the same design. With this flattened table in your Vertica tool box, you can climb up and down any hierarchical tree using nothing but SQL.

Po Hong is a senior pre-sales engineer in HP Vertica’s Corporate Systems Engineering (CSE) group with a broad range of experience in various relational databases such as Vertica, Neoview, Teradata and Oracle

Get Started With Vertica Today

Subscribe to Vertica