Vertica

Archive for the ‘Uncategorized’ Category

What Is a Range Join and Why Is It So Fast?

chuck5

Last week, I was at the 2015 Conference on Innovative Data Systems Research (CIDR), held at the beautiful Asilomar Conference Grounds. The picture above shows one of the many gorgeous views you won’t see when you watch other people do PowerPoint presentations. One HP Vertica user at the conference said he saw a “range join” in a query plan, and wondered what it is and why it is so fast.

First, you need to understand what kind of queries turn into range joins. Generally, these are queries with inequality (greater than, less than, or between) predicates. For example, a map of the IPv4 address space might give details about addresses between a start and end IP for each subnet. Or, a slowly changing dimension table might, for each key, record attributes with their effective time ranges.

A rudimentary approach to handling such joins would be as follows: For each fact table row, check each dimension row to see if the range condition is true (effectively taking the Cartesian product and filtering the results). A more sophisticated, and often more efficient, approach would be to use some flavor of interval trees. However, HP Vertica uses a simpler approach based on sorting.

Basically, if the ranges don’t overlap very much (or at all), sorting the table by range allows sections of the table to be skipped (using a binary search or similar). For large tables, this can reduce the join time by orders of magnitude compared to “brute force”.

Let’s take the example of a table fact, with a column fv, which we want to join to a table dim using a BETWEEN predicate against attributes dv_start and dv_end (fv >= dv_start AND fv <= dv_end). The dim table contains the following data:

chuck3

We can choose, arbitrarily, to sort the data on dv_start. This way, we can eliminate ranges that have a dv_start that is too large to be relevant to a particular fv value. In the second figure, this is illustrated for the lookup of an fv value of 62. The left shaded red area does not need to be checked, because 62 is not greater than these dv_start values.

chuck4

Optimizing dv_end is slightly trickier, because we have no proof that the data is sorted by dv_end (in fact, in this example, it is not). However, we can keep the largest dv_end seen in the table starting from the beginning, and search based on that. In this manner, the red area on the right can be skipped, because all of these rows have a dv_end that is not greater than 62. The part in blue, between the red areas, is then scanned to look for matches.

If you managed to follow the example, you can see our approach is simple. Yet it has helped many customers in practice. The IP subnet lookup case was the first prominent one, with a 1000x speedup. But if you got lost in this example, don’t worry… the beauty of languages like SQL is there is a community of researchers and developers who figure these things out for you. So next time you see us at a conference, don’t hesitate to ask about HP Vertica features. You just might see a blog about it after.

The HP Vertica Community is Moving!

The HP Vertica online community will soon have a new home. In the next few months, we’ll be joining the Big Data and Analytics Community, part of the HP Developer Community, located at https://community.dev.hp.com/.

Why are we doing this?

We’re joining the new community so that you’ll have a centralized place to go for all your big data questions and answers. Using the Big Data and Analytics Community, you will be able to:

  • Connect with customers across all our Big Data offerings, including HP Vertica Enterprise and Community Editions, HP Vertica OnDemand, HP IDOL , and HP IDOL OnDemand.
  • Learn more about HP Haven, the HP Big Data Platform that allows you to harness 100% of your data, including business, machine, and human-generated data.

In short, the Big Data and Analytics Community will provide you with one-stop shopping for product information, guidance on best practices, and solutions to technical problems.

What about existing content?

To preserve the rich exchange of knowledge in our current community and forum, we are migrating all of the content from our current forum to our new Big Data and Analytics location. All your questions and answers will be saved and accessible on the new forum.

When will this happen?

The migration process is just beginning and we estimate it will take a number of weeks. As the new launch date nears, we’ll share more information with you about the actions you’ll need to take to access the new forum.

Want a preview?

Here’s a sneak peak at new community plans:

annotatedscreenshot

 

 

 

 

 

 

 

 

 

 

 

 

 

We look forward to greeting you in our new space! Stay tuned for more detailed information to come.

HP Vertica Gives Back this Holiday Season

EastEndHouseThanks

This holiday season, four teams of HP Vertica employees and families made a trip to East End House in Cambridge, MA to help with the annual Thanksgiving Basket Giveaway. If this organization sounds familiar, you might have read our blog about our summer interns visiting the same location to work with students to build bridges made of toothpicks and gumdrops.

This time around, Vertica volunteers assisted with a program that provided food to individuals and families for Thanksgiving. On Monday, the team helped stuff hundreds of bags with donated goods like whole frozen turkeys, boxed stuffing, canned fruits and vegetables, potatoes, and even fresh kale. They bagged over 22 thousand pounds of fresh produce! All of these items were generously donated by individuals and companies. The following day, more Vertica volunteers helped distribute the (now overflowing) bags to over 1,200 families to enjoy this Thanksgiving.

The HP Vertica volunteers are pleased to know they contributed. In the words of Tim Severyn, East End House’s Director of Community Programs, “we couldn’t have done it without you.”

East Cambridge is thankful to have a community center that provides such a great service to local families and HP Vertica looks forward to working with it in the future!

Learn more about East End House and how you can give back to the community here: http://www.eastendhouse.org/get-involved

Ad-Hoc Queries — Cascading Pool to the Rescue

cascading_waterfalls_10

To understand HP Vertica workload management, you should have a basic understanding of Vertica resource pools. For details about HP Vertica’s resource pool architecture and the parameter definitions, see the product documentation.

Because HP Vertica is a columnar database, you’ll rarely encounter an I/O bottleneck. Memory consumption is often the resource bottleneck that workloads have to contend with in an HP Vertica database. You can tune resource pool parameters to control memory allocation and runtime priority among these competing workloads.
In HP Vertica, you can define custom resource pools that can be configured to limit memory usage, concurrency, and runtime priority. Optionally, you can restrict each database user to use a specific resource pool to control memory consumption. And, in HP Vertica 7.1, we introduced a new feature to meet the customer requirement for ad-hoc queries—cascading pool.

Prior to version 7.1, we recommended redirecting truly ad-hoc queries to different sets of custom pools. Because they are difficult to anticipate, it is almost impossible to do this. So, to simplify things for the customer, to integrate better with third-party BI tools, HP Vertica introduced the cascading pool feature in release 7.1.

Here’s how cascading pools work. Let’s assume there are two resource pools: R1 (a starter pool) and R2 (a secondary/cascading pool). When a query’s execution time exceeds the pre-set RUNTIMECAP in R1, it cascades to R2. When that happens, all the resources are released from pool R1 and moved to pool R2 (from an accounting perspective). The query continues to execute without interruption. This, of course, assumes that there are enough resources available in pool R2; otherwise, the query has to wait in the queue or re-plan.

How does this feature help an HP Vertica customer? A typical HP Vertica customer often has two or more different types of workloads in their production environments. Prior to HP Vertica 7.1, customers needed to figure out a way to properly classify a query based on certain criteria (memory footprint by profiling, for example). Customers then had to use a program or script to direct the query to a resource pool. But for ad-hoc queries, this was a very difficult task.
With cascading pools, customer can route all queries through the starter pool R1 and let the queries cascade to the secondary pool R2 automatically.

Furthermore, this feature means that users need not know the existence of the secondary cascading pools. After secondary cascading pools are configured, they work in the background; you don’t need to grant end users explicit access to these secondary pools. So in some sense, one pool – the starter pool – is all that HP Vertica customers and third-party BI tools need.

The HP Big Data Platform Corporate Presales group has developed a video that demonstrates the new cascading pool feature in Vertica 7.1. The video shows how to manage complex workloads using cascading pools.

This workload management demo was performed on a 3-node HP Vertica cluster running version 7.1. Each node has 16 CPU cores and 64 GB of memory.

The test workload consists of:

  • 7500 small queries
  • 700 medium queries
  • 150 large queries

To simulate the real-life scenario, these queries are submitted to HP Vertica based on different schedules and batch sizes.

The same workload was run in two different scenarios:

  • Using the “out-of-the-box” default G pool
  • Using the cascading pools

Figure 1 shows the test results from using the default GENERAL pool and the cascading pools. On average, there is a 30% performance improvement for all three types of workloads when using the cascading pools.

Cascading pools

Figure 1: Total runtime using the default GENERAL pool and the cascading pools.

So what do these cascading pools look like?

create resource pool pool_long priority 10 memorysize ‘0%’ maxmemorysize ‘60%’ plannedconcurrency 4 maxconcurrency NONE executionparallelism 16 queuetimeout 3600 RUNTIMEPRIORITY LOW RUNTIMEPRIORITYTHRESHOLD 0 RUNTIMECAP NONE CPUAFFINITYSET NONE CPUAFFINITYMODE DEFAULT;

create resource pool pool_medium priority 30 memorysize ‘0%’ maxmemorysize ‘60%’ plannedconcurrency 8 maxconcurrency NONE executionparallelism 8 queuetimeout 3600 RUNTIMEPRIORITY MEDIUM RUNTIMEPRIORITYTHRESHOLD 0 RUNTIMECAP ‘100 SECONDS’ CASCADE TO pool_long CPUAFFINITYSET ‘4-15′ CPUAFFINITYMODE SHARED;

create resource pool pool_fast priority 50 memorysize ‘3G’ maxmemorysize ‘20%’ plannedconcurrency 24 maxconcurrency NONE executionparallelism 1 queuetimeout 3600 RUNTIMEPRIORITY HIGH RUNTIMEPRIORITYTHRESHOLD 0 RUNTIMECAP ’10 SECONDS’ CASCADE TO pool_medium CPUAFFINITYSET ‘0-3′ CPUAFFINITYMODE SHARED;

Let’s highlight some of the key design considerations for the cascading pools used in this demo.

  • The demo uses three pools: pool_fast, pool_medium, and pool_long. All queries start in pool_fast. If a query takes more than 10 seconds to finish in pool_fast, it cascades to pool_medium and continues running in pool_medium. If it runs for another 90 seconds and still does not finish, then it cascades to pool_long. The runtime is cumulative as a query moves from one pool to another.
  • pool_fast is meant for catching all these short-running tactical queries. It runs run in “high priority” mode and shares 4 CPU cores (of a total of 16 cores) with other pools.
  • These three pools are designed in such a way that the secondary pools have larger memory budgets than the source pools. This technique ensures to make sure that when a query moves from one pool to another, the secondary/cascading pool has enough resources to keep the query running uninterrupted. That way, the query does not have to wait in the queue for resources to be freed up or retry itself. Minimizing query retries during the pool move improves performance.

By simulating the real-life scenario of managing three different types of query workloads, large, medium, and small, you can see that, compared to the GENERAL pool, a carefully designed set of cascading pools can produce on average 30% performance gain for all workload types. A best practice in designing HP Vertica cascading pools is to minimize query retries during the pool move.

For truly ad-hoc types of workloads, in prior HP Vertica releases, a good knowledge of query profiles and memory footprints was needed in order to take advantage of HP Vertica resource pools for the optimal mixed workload performance. With HP Vertica 7.1, by using cascading pools, you can point all queries to one pool and let HP Vertica do the heavy lifting in the background automatically.

Po Hong and Satish Sathiyavageswaran are respectively Senior Solutions Architect and Solutions Architect in HP Big Data Platform Corporate Presales, which specializes in HP Vertica performance tuning and workload management.

Tech Support Series: Optimizing Projections

Welcome to another installment of our Top Tech Support Questions Answered blog series. In our first blog , we discussed ways to optimize your database for deletes. In this installment, we’ll talk about optimizing projections.

People often ask, “How can I optimize my projections for maximum query performance?” Like with many questions, the answer is “It depends.” This is because every database has a different structure and uses data in very different ways.

But fear not, we do have some general guidelines you can use to optimize your projections.

Your first step should always be to run Database Designer. The HP Vertica Database Designer creates optimized projections based on sample queries and sample data that you provide. However, you may find that you want to create your own projections as well.

If you feel you must create your own projections, focus on three characteristics:

  • Sort Order
  • Segmentation
  • Encoding

Sort Order
Choosing a good sort order can help you achieve maximum query performance. If you have queries that contain GROUP BY clauses, joins, or other predicates, it’s good practice to place the columns specified in those clauses early in the sort order. If you have no other criteria on how to sort your data, the fastest way to access the data is to first sort on the columns with the smallest number of distinct values (lowest cardinality) before the high-cardinality columns.

Segmentation
Also consider creating segmented projections on large tables to spread the query execution workload across multiple nodes. Projection segmentation also provides high availability and recovery, and optimizes query execution. Therefore, it’s important to determine which columns to use to segment a projection. For HP Vertica, hash segmentation is the preferred method of segmentation. Primary key columns that have a large number of unique data values (high cardinality) and acceptable skew in their data distribution are an excellent choice for hash segmentation.

Encoding
Database Designer implements optimum encoding for the data you provide. Likewise, when creating your own projections, make sure you specify the encoding on your projection columns to optimize query performance. With appropriate encoding, you can reduce your database footprint and improve query performance. Read more about the encoding types HP Vertica supports here.

So there you have it– three main characteristics to consider when creating your own projections. As mentioned before, all databases are different, so you may find that leveraging one approach over another is more beneficial for you. But focusing on these three things can make the whole process a little less daunting.

Stay tuned for more tech support blogs!

To learn about optimizing your projections using Database Designer, see our documentation.

Financial Analytics with HP Vertica and Vichara Technologies

74375f59-d325-4850-aca9-1c68a267418d

If you’re an investor on Wall Street you want to know with a good amount of certainty that you aren’t throwing your money into a crapshoot. As it turns out, the government feels the same way. Since the Financial crisis of 2008, the government has added regulations for FSIs (Financial Services Institutions) to follow in order to prevent a repeat scenario. Financial organizations rely on FSIs like Vichara Technologies in order to analyze risk and to provide a comprehensive, aggregated view of a firm’s potential to default at any point of time. As you can imagine, those analyses require looking at a LOT of data.

Take this particular case:

The Monte Carlo method is a common way of producing a VaR (Value at Risk) analysis for the future of FSI’s many accounts by running millions of simulations based on existing data sets. A VaR is essentially a snap shot of the level of risk within a firm at any given point in time. A risk-balance sheet, if you will. The Monte Carlo method of determining a VaR works by running simulations based on past data, say a year’s worth, and then using the results to determine the maximum losses a firm can incur over a year’s time within a percentage of certainty (95% is most common).

In other words, imagine you’re at a poker table with your friends on a Friday night and in the pot is the equivalent to the combined yearly income of everyone at the table. But instead of taking a leap of faith and calling for a raise based on the way your buddy Dave is picking his nose, you get to play the same hand over and over and over again in your head (millions of times) in order to see how the odds are stacked against you with let’s say around 95% accuracy. Imagine being able to do all of that before you decide to fold or go all in with your life’s savings. This type of analysis is used by firms every day to calculate risk effectively and efficiently, so the faster one can run them the more accurate they can be. That’s where HP Vertica and Vichara come into play.

Vichara Technologies deals with these types of risk-modeling calculations every day, and was in need of an upgrade from their Legacy DB. With HP Vertica leading the charge on reinventing RDBMS, it was the obvious choice. In two recent use cases, Vichara was able to leverage its risk assessment software, vLense, on top of the HP Vertica database to totally revolutionize the way these two companies looked at their financial risk analytics. For you engineers out there, V lens is a lot like a Visual studio for C++, but for their own custom language, VQuery, specializing in financial analysis.

Using vLens on top of HP Vertica, the two companies involved in these cases were able to pull from their own data, instead of having to outsource to 3rd parties like Moodys and Fannie Mae. This not only granted them autonomy, but allowed them to create daily reports rather than waiting days, if not weeks, to get the answers they needed. In addition, the data was no longer hidden behind a few programmers running custom scripts to ETL the data. Now even portfolio managers can use the intuitive query builder interface to create ad-hoc queries on the fly. From analyzing the data, they can then easily export the results into SAS or R for further insight. Transparency at its finest.

Wrapping it all up, the folks at Vichara touched on one of the crucial aspects of Vertica during the webinar: the most bang for your buck. Use whatever hardware you want and start small and go big.We aren’t tying you to a multi-million dollar physical ball and chain that might be outdated in a year,we scale when you scale, and we won’t take your arm and leg in the process.

Check out the webinar here to learn more.


Untitled-3

The Top Five Ways to Botch a Vertica Evaluation

shutterstock_96833425

In the years I’ve been working with Vertica and other large scale data technologies, I’ve been a party to a large number of technology evaluations. Most businesses are familiar with these – they’re often called either a “proof of concept”, “proof of value”, or “pilot”. Technology evaluations are a key part of the technology selection process, wherein the business identifies a set of criteria which the candidate technology must meet (or exceed). These evaluations are tightly scoped operations, with clear cut input data, test scenarios, and defined metrics to measure success which are sponsored by company leadership.

At least, that’s the theory.

While some evaluations are very much the way I describe them above, many aren’t. In fact, many evaluations fail to demonstrate measurable value, and can in fact muddy the waters around technology selection – exactly the opposite of what they’re supposed to do. While there are all manner of things that can go wrong with evaluating a big data platform, I’ve seen organizations struggle with specific areas when conducting a Vertica evaluation. Here are the top five.

Mistake number 5: Don’t talk with any Vertica people at all

We’ve all bought cars, and have had to deal with car dealers. For many of us, talking with sales people can leave a bad taste in our mouths. This is unfortunate, because there is unique value to be found in talking with the right sales team. A skilled sales executive will know how to work with an organization’s leadership to map technology to strategy – which greatly increases the likelihood that an investment in that technology will pay off. A skilled presales engineer will know how to deploy the technology in ways that fit a particular business and use case(s) – which can serve as an accelerator in the project, and mitigate the risk of failure. Moreover, these teams accumulate knowledge on best (and worst) practices, and can be a powerful source of knowledge and guidance. By ignoring sales people, organizations run the risk of repeating mistakes made by others and possibly selecting the wrong technology for their needs.

Mistake number 4: Use 100% IT-defined success criteria

First, I have to say that I have nothing but respect for IT teams. I worked in various IT departments for many years before moving to the vendor side of the world. In my experience, they’re incredibly hard working, talented folks. But the people working in the technology trenches tend to think about the technology, not why it’s there. Rather than thinking of that Oracle operational store as “a key resource for business stakeholders to optimize day to day decisions,” they tend to think of it as “an Oracle database that needs to stay up at all times or the CEO will call the CIO and complain.”

This shapes expectations. And when it’s time to select new technology, IT will focus on the things it cares about – SQL completeness, availability, fault-tolerance, backup and recovery, and so forth. I’ve seen evaluations where the IT team made their “wish list” of criteria, and the vendor demonstrated every single one of them, only to see another technology get chosen. Because the test criteria didn’t matter to the business stakeholders.

Mistake number 3: Never, ever run the Database Designer

The other mistakes discussed here are pretty much technology agnostic – they can be issues in all sorts of evaluations. This one, however, is specific to Vertica. That’s because the Vertica team re-invented storage as opposed to borrowing somebody else’s storage engine and bolting on column-like features. While this bit is somewhat longer than the others, it bears reading because it is often the moment when the light bulb goes on for the database folks as to why Vertica has done so well in the market in recent years.

When a user creates a table in Vertica, two things happen:

  1. A logical table is created. . This is the structure that all users will query, insert to, update, delete from, and so forth. It is just a stub however.
  2. A super-projection is created. The superprojection is identical to the logical table.. However, it is the actual storage structure for the data. It uses certain rules for things like data distribution, sort and encoding – which are all part of the “secret sauce” of Vertica’s performance and scalability. The super projection is required because Vertica is a database – we need a spot where data can go in an ACID compliant form immediately.

But the beauty of the Vertica storage engine is that additional projections can be created, and they don’t all require every column. This is why we built our own engine from the ground up – so Vertica establishes a loose coupling between logical data model and the physical storage of that data. Additional projections can use fewer columns, other sort orders, different distribution keys, other forms of compression, etc. to deliver maximum performance. And the database will decide – when a query is submitted – which set of projections will make the query perform the best.

To make projections easier for our users to leverage, we’ve created a tool which is included with Vertica, called the Database Designer. This is unique in the industry as far as I know. A user only needs to create the desired tables and load a modest amount of data, then package up their queries and pass them to the Database Designer. The Database Designer will then test the queries and write SQL to create a set of optimized projections. In this way, the Database Designer can make just about anyone as effective as a skilled DBA when it comes to performance tuning.

Unfortunately, much of the market doesn’t understand Vertica and projections. So I often walk into conversations where the technology team has been told – usually by another vendor – that projections are “cheating” because they optimize performance. And so the business decides to deliberately avoid using the database designer to optimize performance. This is like telling yourself that breathing more oxygen during a foot race is cheating, so the runners should hold their breath during the race in order to slow the faster runners down and give the slower ones a chance of winning. I think I’m being generous when I call this a bad idea.

Mistake number 2: Don’t take it seriously

Sometimes, the technology team already knows which technology they want. And the technology evaluation is just a rubber stamp – the outcome is predetermined, and the team just needs the window dressing to make it look like they evaluated other vendors. This is a bad idea for two reasons. First, even if it’s all about putting a rubber stamp on a predetermined choice, it’s still a new use case for the technology. So the team has to plan to mitigate risk. And a well-executed technology evaluation is one good way to mitigate risk. Second, going into an evaluation having already chosen the technology will put blinders on the team – rather than looking for unique ways in which new technologies can be applied, the focus instead is on doing things very much the way they’ve been done before.

A few years ago, I was managing a field engineering team when we found ourselves in one of these evaluations. The company clearly had already chosen another vendor, but because they were already using Vertica (happily), a technology evaluation was required. The company didn’t take the evaluation very seriously, and despite the fact that our team executed flawlessly, the company went with their original choice. They didn’t pay attention to the fact that the Vertica team started (and finished) the evaluation within seven days, which was how long it took the other vendor to pack their equipment and prepare it for shipping to the customer. They didn’t want to see the findings our team uncovered highlighting revenue opportunities hidden within the data. They selected the other vendor as they’d planned all along. And after six months trying to implement it, the folks who had selected the other vendor were looking for new jobs. Moreover, most of the data science team quit in frustration. So in one fell swoop, they significantly damaged their analytics capabilities

So take it seriously, even if the choice seems predetermined.

Mistake number 1: Do an unrealistic test

One way to create an unrealistic test is to fit the test to the conditions, rather than the conditions to the test. The most frequent mistake here is using Vertica Community Edition, which is limited to three nodes and a terabyte of data, and then forcing the data used in the test to fit that limit. This is a bad idea for several reasons. First, the benefits of a distributed computing technology like Vertica don’t really show up at a terabyte. While you can run queries on the data, old school strategies such as indexing can make it look like row-oriented databases may perform as well. Second, it means “chopping down” the data – or making it fit the one terabyte threshold. This often results in artificial data, which brings with it all sort of problems. The biggest problem, however, is that it may no longer allow you to derive the insights which solve the problems you’re trying to solve. So test with a realistic volume of data. What is “realistic”? It’s a relative thing, but it should be more than just a token amount of data. Don’t feel compelled to limit your evaluation to a terabyte just because you want to run Vertica CE. This often goes hand in hand with mistake number 5 (don’t talk to any Vertica people). Don’t worry about talking with Vertica folks! We’re a friendly bunch with a commitment to the success of our customers. And we’re happy to set you up with an evaluation license that fits your data, so you don’t have to cram the data to fit the license.

Finally, there’s another way in which we see unrealistic evaluations. Particularly when the evaluation is driven by the IT team (see Mistake Number 4), the use case is often “run our existing queries faster”. While this is helpful, this is not what keeps the executive leadership awake at night. What keeps them awake? Fraud detection, personalized marketing, optimized business operations, new data products, and so forth. Note that the phrase “run our queries faster” did not appear on that list. So make the test realistic by asking bigger questions. What can’t the company do today because it can’t cope with big data? Why does it matter? These are the use cases which take a technology evaluation and translate it into terms leadership can understand – how is this technology going to enable the strategy of the business?

So there, in a nutshell, are the problems we see the most often in Vertica evaluations. We do a lot of these, and are quite good at it. So don’t hesitate to let us know when you want to try it out so we can help you avoid the pitfalls, and tie the technology to your strategy. If you’d like to talk with our team, click here to arrange a conversation.

Get Started With Vertica Today

Subscribe to Vertica