Vertica

Archive for the ‘Uncategorized’ Category

HP Vertica Storage Location for HDFS

Do you find yourself running low on disk space on your HP Vertica database? You could delete older data, but that sacrifices your ability to perform historical queries. You could add new nodes to your cluster or add storage to your existing nodes. However, these options require additional expense.

The HP Vertica Storage Locations for HDFS feature introduced in HP Vertica Version 7.1 offers you a new solution: storing data on an Apache Hadoop cluster. You can use this feature to store data in a Hadoop Distributed File System (HDFS) while still being able to query it through HP Vertica.

Watch this video for an overview of the HP Vertica Storage Locations for HDFS feature and an example of how you can use it to free storage space on your HP Vertica cluster.

For more information about this feature, see the HP Vertica Storage Location for HDFS section of the documentation.

HP Vertica Best Practices: Native Connection Load Balancing

You may be aware that each client connection to a host in your HP Vertica cluster requires a small overhead in memory and processor time. For a single connection, this impact is minimal, almost unnoticeable. Now imagine you have many clients all connecting to the same host at the same time. In this situation, the compounded overhead can potentially affect database performance.

To limit the database performance consequences caused by multiple client connections, you might manually assign certain client connections to certain hosts. But this can become tedious and difficult as more and more client connections are added. Luckily, HP Vertica offers a feature that can do all this for you. It’s called native connection load balancing.

Native connection load balancing is available in HP Vertica 7.0 and later releases. It is a feature built into both the server and the client libraries that helps spread the CPU and memory overhead caused by client connections across the hosts in the database. When you enable native load balancing on the server and client, you won’t have to manually assign clients to specific hosts to reduce overhead.

Watch this best practices video to learn more about HP Vertica native connection load balancing and how to enable and disable it on the server and client.

For more information, see Native Connection Load Balancing in our documentation.

What Is a Range Join and Why Is It So Fast?

chuck5

Last week, I was at the 2015 Conference on Innovative Data Systems Research (CIDR), held at the beautiful Asilomar Conference Grounds. The picture above shows one of the many gorgeous views you won’t see when you watch other people do PowerPoint presentations. One HP Vertica user at the conference said he saw a “range join” in a query plan, and wondered what it is and why it is so fast.

First, you need to understand what kind of queries turn into range joins. Generally, these are queries with inequality (greater than, less than, or between) predicates. For example, a map of the IPv4 address space might give details about addresses between a start and end IP for each subnet. Or, a slowly changing dimension table might, for each key, record attributes with their effective time ranges.

A rudimentary approach to handling such joins would be as follows: For each fact table row, check each dimension row to see if the range condition is true (effectively taking the Cartesian product and filtering the results). A more sophisticated, and often more efficient, approach would be to use some flavor of interval trees. However, HP Vertica uses a simpler approach based on sorting.

Basically, if the ranges don’t overlap very much (or at all), sorting the table by range allows sections of the table to be skipped (using a binary search or similar). For large tables, this can reduce the join time by orders of magnitude compared to “brute force”.

Let’s take the example of a table fact, with a column fv, which we want to join to a table dim using a BETWEEN predicate against attributes dv_start and dv_end (fv >= dv_start AND fv <= dv_end). The dim table contains the following data:

chuck3

We can choose, arbitrarily, to sort the data on dv_start. This way, we can eliminate ranges that have a dv_start that is too large to be relevant to a particular fv value. In the second figure, this is illustrated for the lookup of an fv value of 62. The left shaded red area does not need to be checked, because 62 is not greater than these dv_start values.

chuck4

Optimizing dv_end is slightly trickier, because we have no proof that the data is sorted by dv_end (in fact, in this example, it is not). However, we can keep the largest dv_end seen in the table starting from the beginning, and search based on that. In this manner, the red area on the right can be skipped, because all of these rows have a dv_end that is not greater than 62. The part in blue, between the red areas, is then scanned to look for matches.

If you managed to follow the example, you can see our approach is simple. Yet it has helped many customers in practice. The IP subnet lookup case was the first prominent one, with a 1000x speedup. But if you got lost in this example, don’t worry… the beauty of languages like SQL is there is a community of researchers and developers who figure these things out for you. So next time you see us at a conference, don’t hesitate to ask about HP Vertica features. You just might see a blog about it after.

The HP Vertica Community is Moving!

The HP Vertica online community will soon have a new home. In the next few months, we’ll be joining the Big Data and Analytics Community, part of the HP Developer Community, located at https://community.dev.hp.com/.

Why are we doing this?

We’re joining the new community so that you’ll have a centralized place to go for all your big data questions and answers. Using the Big Data and Analytics Community, you will be able to:

  • Connect with customers across all our Big Data offerings, including HP Vertica Enterprise and Community Editions, HP Vertica OnDemand, HP IDOL , and HP IDOL OnDemand.
  • Learn more about HP Haven, the HP Big Data Platform that allows you to harness 100% of your data, including business, machine, and human-generated data.

In short, the Big Data and Analytics Community will provide you with one-stop shopping for product information, guidance on best practices, and solutions to technical problems.

What about existing content?

To preserve the rich exchange of knowledge in our current community and forum, we are migrating all of the content from our current forum to our new Big Data and Analytics location. All your questions and answers will be saved and accessible on the new forum.

When will this happen?

The migration process is just beginning and we estimate it will take a number of weeks. As the new launch date nears, we’ll share more information with you about the actions you’ll need to take to access the new forum.

Want a preview?

Here’s a sneak peak at new community plans:

annotatedscreenshot

 

 

 

 

 

 

 

 

 

 

 

 

 

We look forward to greeting you in our new space! Stay tuned for more detailed information to come.

HP Vertica Gives Back this Holiday Season

EastEndHouseThanks

This holiday season, four teams of HP Vertica employees and families made a trip to East End House in Cambridge, MA to help with the annual Thanksgiving Basket Giveaway. If this organization sounds familiar, you might have read our blog about our summer interns visiting the same location to work with students to build bridges made of toothpicks and gumdrops.

This time around, Vertica volunteers assisted with a program that provided food to individuals and families for Thanksgiving. On Monday, the team helped stuff hundreds of bags with donated goods like whole frozen turkeys, boxed stuffing, canned fruits and vegetables, potatoes, and even fresh kale. They bagged over 22 thousand pounds of fresh produce! All of these items were generously donated by individuals and companies. The following day, more Vertica volunteers helped distribute the (now overflowing) bags to over 1,200 families to enjoy this Thanksgiving.

The HP Vertica volunteers are pleased to know they contributed. In the words of Tim Severyn, East End House’s Director of Community Programs, “we couldn’t have done it without you.”

East Cambridge is thankful to have a community center that provides such a great service to local families and HP Vertica looks forward to working with it in the future!

Learn more about East End House and how you can give back to the community here: http://www.eastendhouse.org/get-involved

Ad-Hoc Queries — Cascading Pool to the Rescue

cascading_waterfalls_10

To understand HP Vertica workload management, you should have a basic understanding of Vertica resource pools. For details about HP Vertica’s resource pool architecture and the parameter definitions, see the product documentation.

Because HP Vertica is a columnar database, you’ll rarely encounter an I/O bottleneck. Memory consumption is often the resource bottleneck that workloads have to contend with in an HP Vertica database. You can tune resource pool parameters to control memory allocation and runtime priority among these competing workloads.
In HP Vertica, you can define custom resource pools that can be configured to limit memory usage, concurrency, and runtime priority. Optionally, you can restrict each database user to use a specific resource pool to control memory consumption. And, in HP Vertica 7.1, we introduced a new feature to meet the customer requirement for ad-hoc queries—cascading pool.

Prior to version 7.1, we recommended redirecting truly ad-hoc queries to different sets of custom pools. Because they are difficult to anticipate, it is almost impossible to do this. So, to simplify things for the customer, to integrate better with third-party BI tools, HP Vertica introduced the cascading pool feature in release 7.1.

Here’s how cascading pools work. Let’s assume there are two resource pools: R1 (a starter pool) and R2 (a secondary/cascading pool). When a query’s execution time exceeds the pre-set RUNTIMECAP in R1, it cascades to R2. When that happens, all the resources are released from pool R1 and moved to pool R2 (from an accounting perspective). The query continues to execute without interruption. This, of course, assumes that there are enough resources available in pool R2; otherwise, the query has to wait in the queue or re-plan.

How does this feature help an HP Vertica customer? A typical HP Vertica customer often has two or more different types of workloads in their production environments. Prior to HP Vertica 7.1, customers needed to figure out a way to properly classify a query based on certain criteria (memory footprint by profiling, for example). Customers then had to use a program or script to direct the query to a resource pool. But for ad-hoc queries, this was a very difficult task.
With cascading pools, customer can route all queries through the starter pool R1 and let the queries cascade to the secondary pool R2 automatically.

Furthermore, this feature means that users need not know the existence of the secondary cascading pools. After secondary cascading pools are configured, they work in the background; you don’t need to grant end users explicit access to these secondary pools. So in some sense, one pool – the starter pool – is all that HP Vertica customers and third-party BI tools need.

The HP Big Data Platform Corporate Presales group has developed a video that demonstrates the new cascading pool feature in Vertica 7.1. The video shows how to manage complex workloads using cascading pools.

This workload management demo was performed on a 3-node HP Vertica cluster running version 7.1. Each node has 16 CPU cores and 64 GB of memory.

The test workload consists of:

  • 7500 small queries
  • 700 medium queries
  • 150 large queries

To simulate the real-life scenario, these queries are submitted to HP Vertica based on different schedules and batch sizes.

The same workload was run in two different scenarios:

  • Using the “out-of-the-box” default G pool
  • Using the cascading pools

Figure 1 shows the test results from using the default GENERAL pool and the cascading pools. On average, there is a 30% performance improvement for all three types of workloads when using the cascading pools.

Cascading pools

Figure 1: Total runtime using the default GENERAL pool and the cascading pools.

So what do these cascading pools look like?

create resource pool pool_long priority 10 memorysize ‘0%’ maxmemorysize ‘60%’ plannedconcurrency 4 maxconcurrency NONE executionparallelism 16 queuetimeout 3600 RUNTIMEPRIORITY LOW RUNTIMEPRIORITYTHRESHOLD 0 RUNTIMECAP NONE CPUAFFINITYSET NONE CPUAFFINITYMODE DEFAULT;

create resource pool pool_medium priority 30 memorysize ‘0%’ maxmemorysize ‘60%’ plannedconcurrency 8 maxconcurrency NONE executionparallelism 8 queuetimeout 3600 RUNTIMEPRIORITY MEDIUM RUNTIMEPRIORITYTHRESHOLD 0 RUNTIMECAP ‘100 SECONDS’ CASCADE TO pool_long CPUAFFINITYSET ‘4-15′ CPUAFFINITYMODE SHARED;

create resource pool pool_fast priority 50 memorysize ‘3G’ maxmemorysize ‘20%’ plannedconcurrency 24 maxconcurrency NONE executionparallelism 1 queuetimeout 3600 RUNTIMEPRIORITY HIGH RUNTIMEPRIORITYTHRESHOLD 0 RUNTIMECAP ’10 SECONDS’ CASCADE TO pool_medium CPUAFFINITYSET ‘0-3′ CPUAFFINITYMODE SHARED;

Let’s highlight some of the key design considerations for the cascading pools used in this demo.

  • The demo uses three pools: pool_fast, pool_medium, and pool_long. All queries start in pool_fast. If a query takes more than 10 seconds to finish in pool_fast, it cascades to pool_medium and continues running in pool_medium. If it runs for another 90 seconds and still does not finish, then it cascades to pool_long. The runtime is cumulative as a query moves from one pool to another.
  • pool_fast is meant for catching all these short-running tactical queries. It runs run in “high priority” mode and shares 4 CPU cores (of a total of 16 cores) with other pools.
  • These three pools are designed in such a way that the secondary pools have larger memory budgets than the source pools. This technique ensures to make sure that when a query moves from one pool to another, the secondary/cascading pool has enough resources to keep the query running uninterrupted. That way, the query does not have to wait in the queue for resources to be freed up or retry itself. Minimizing query retries during the pool move improves performance.

By simulating the real-life scenario of managing three different types of query workloads, large, medium, and small, you can see that, compared to the GENERAL pool, a carefully designed set of cascading pools can produce on average 30% performance gain for all workload types. A best practice in designing HP Vertica cascading pools is to minimize query retries during the pool move.

For truly ad-hoc types of workloads, in prior HP Vertica releases, a good knowledge of query profiles and memory footprints was needed in order to take advantage of HP Vertica resource pools for the optimal mixed workload performance. With HP Vertica 7.1, by using cascading pools, you can point all queries to one pool and let HP Vertica do the heavy lifting in the background automatically.

Po Hong and Satish Sathiyavageswaran are respectively Senior Solutions Architect and Solutions Architect in HP Big Data Platform Corporate Presales, which specializes in HP Vertica performance tuning and workload management.

Tech Support Series: Optimizing Projections

Welcome to another installment of our Top Tech Support Questions Answered blog series. In our first blog , we discussed ways to optimize your database for deletes. In this installment, we’ll talk about optimizing projections.

People often ask, “How can I optimize my projections for maximum query performance?” Like with many questions, the answer is “It depends.” This is because every database has a different structure and uses data in very different ways.

But fear not, we do have some general guidelines you can use to optimize your projections.

Your first step should always be to run Database Designer. The HP Vertica Database Designer creates optimized projections based on sample queries and sample data that you provide. However, you may find that you want to create your own projections as well.

If you feel you must create your own projections, focus on three characteristics:

  • Sort Order
  • Segmentation
  • Encoding

Sort Order
Choosing a good sort order can help you achieve maximum query performance. If you have queries that contain GROUP BY clauses, joins, or other predicates, it’s good practice to place the columns specified in those clauses early in the sort order. If you have no other criteria on how to sort your data, the fastest way to access the data is to first sort on the columns with the smallest number of distinct values (lowest cardinality) before the high-cardinality columns.

Segmentation
Also consider creating segmented projections on large tables to spread the query execution workload across multiple nodes. Projection segmentation also provides high availability and recovery, and optimizes query execution. Therefore, it’s important to determine which columns to use to segment a projection. For HP Vertica, hash segmentation is the preferred method of segmentation. Primary key columns that have a large number of unique data values (high cardinality) and acceptable skew in their data distribution are an excellent choice for hash segmentation.

Encoding
Database Designer implements optimum encoding for the data you provide. Likewise, when creating your own projections, make sure you specify the encoding on your projection columns to optimize query performance. With appropriate encoding, you can reduce your database footprint and improve query performance. Read more about the encoding types HP Vertica supports here.

So there you have it– three main characteristics to consider when creating your own projections. As mentioned before, all databases are different, so you may find that leveraging one approach over another is more beneficial for you. But focusing on these three things can make the whole process a little less daunting.

Stay tuned for more tech support blogs!

To learn about optimizing your projections using Database Designer, see our documentation.

Get Started With Vertica Today

Subscribe to Vertica