Vertica

Archive for the ‘Technical’ Category

HP Vertica Storage Location for HDFS

Do you find yourself running low on disk space on your HP Vertica database? You could delete older data, but that sacrifices your ability to perform historical queries. You could add new nodes to your cluster or add storage to your existing nodes. However, these options require additional expense.

The HP Vertica Storage Locations for HDFS feature introduced in HP Vertica Version 7.1 offers you a new solution: storing data on an Apache Hadoop cluster. You can use this feature to store data in a Hadoop Distributed File System (HDFS) while still being able to query it through HP Vertica.

Watch this video for an overview of the HP Vertica Storage Locations for HDFS feature and an example of how you can use it to free storage space on your HP Vertica cluster.

For more information about this feature, see the HP Vertica Storage Location for HDFS section of the documentation.

HP Vertica Best Practices: Native Connection Load Balancing

You may be aware that each client connection to a host in your HP Vertica cluster requires a small overhead in memory and processor time. For a single connection, this impact is minimal, almost unnoticeable. Now imagine you have many clients all connecting to the same host at the same time. In this situation, the compounded overhead can potentially affect database performance.

To limit the database performance consequences caused by multiple client connections, you might manually assign certain client connections to certain hosts. But this can become tedious and difficult as more and more client connections are added. Luckily, HP Vertica offers a feature that can do all this for you. It’s called native connection load balancing.

Native connection load balancing is available in HP Vertica 7.0 and later releases. It is a feature built into both the server and the client libraries that helps spread the CPU and memory overhead caused by client connections across the hosts in the database. When you enable native load balancing on the server and client, you won’t have to manually assign clients to specific hosts to reduce overhead.

Watch this best practices video to learn more about HP Vertica native connection load balancing and how to enable and disable it on the server and client.

For more information, see Native Connection Load Balancing in our documentation.

What Is a Range Join and Why Is It So Fast?

chuck5

Last week, I was at the 2015 Conference on Innovative Data Systems Research (CIDR), held at the beautiful Asilomar Conference Grounds. The picture above shows one of the many gorgeous views you won’t see when you watch other people do PowerPoint presentations. One HP Vertica user at the conference said he saw a “range join” in a query plan, and wondered what it is and why it is so fast.

First, you need to understand what kind of queries turn into range joins. Generally, these are queries with inequality (greater than, less than, or between) predicates. For example, a map of the IPv4 address space might give details about addresses between a start and end IP for each subnet. Or, a slowly changing dimension table might, for each key, record attributes with their effective time ranges.

A rudimentary approach to handling such joins would be as follows: For each fact table row, check each dimension row to see if the range condition is true (effectively taking the Cartesian product and filtering the results). A more sophisticated, and often more efficient, approach would be to use some flavor of interval trees. However, HP Vertica uses a simpler approach based on sorting.

Basically, if the ranges don’t overlap very much (or at all), sorting the table by range allows sections of the table to be skipped (using a binary search or similar). For large tables, this can reduce the join time by orders of magnitude compared to “brute force”.

Let’s take the example of a table fact, with a column fv, which we want to join to a table dim using a BETWEEN predicate against attributes dv_start and dv_end (fv >= dv_start AND fv <= dv_end). The dim table contains the following data:

chuck3

We can choose, arbitrarily, to sort the data on dv_start. This way, we can eliminate ranges that have a dv_start that is too large to be relevant to a particular fv value. In the second figure, this is illustrated for the lookup of an fv value of 62. The left shaded red area does not need to be checked, because 62 is not greater than these dv_start values.

chuck4

Optimizing dv_end is slightly trickier, because we have no proof that the data is sorted by dv_end (in fact, in this example, it is not). However, we can keep the largest dv_end seen in the table starting from the beginning, and search based on that. In this manner, the red area on the right can be skipped, because all of these rows have a dv_end that is not greater than 62. The part in blue, between the red areas, is then scanned to look for matches.

If you managed to follow the example, you can see our approach is simple. Yet it has helped many customers in practice. The IP subnet lookup case was the first prominent one, with a 1000x speedup. But if you got lost in this example, don’t worry… the beauty of languages like SQL is there is a community of researchers and developers who figure these things out for you. So next time you see us at a conference, don’t hesitate to ask about HP Vertica features. You just might see a blog about it after.

Tech Support Series: Optimizing Projections

Welcome to another installment of our Top Tech Support Questions Answered blog series. In our first blog , we discussed ways to optimize your database for deletes. In this installment, we’ll talk about optimizing projections.

People often ask, “How can I optimize my projections for maximum query performance?” Like with many questions, the answer is “It depends.” This is because every database has a different structure and uses data in very different ways.

But fear not, we do have some general guidelines you can use to optimize your projections.

Your first step should always be to run Database Designer. The HP Vertica Database Designer creates optimized projections based on sample queries and sample data that you provide. However, you may find that you want to create your own projections as well.

If you feel you must create your own projections, focus on three characteristics:

  • Sort Order
  • Segmentation
  • Encoding

Sort Order
Choosing a good sort order can help you achieve maximum query performance. If you have queries that contain GROUP BY clauses, joins, or other predicates, it’s good practice to place the columns specified in those clauses early in the sort order. If you have no other criteria on how to sort your data, the fastest way to access the data is to first sort on the columns with the smallest number of distinct values (lowest cardinality) before the high-cardinality columns.

Segmentation
Also consider creating segmented projections on large tables to spread the query execution workload across multiple nodes. Projection segmentation also provides high availability and recovery, and optimizes query execution. Therefore, it’s important to determine which columns to use to segment a projection. For HP Vertica, hash segmentation is the preferred method of segmentation. Primary key columns that have a large number of unique data values (high cardinality) and acceptable skew in their data distribution are an excellent choice for hash segmentation.

Encoding
Database Designer implements optimum encoding for the data you provide. Likewise, when creating your own projections, make sure you specify the encoding on your projection columns to optimize query performance. With appropriate encoding, you can reduce your database footprint and improve query performance. Read more about the encoding types HP Vertica supports here.

So there you have it– three main characteristics to consider when creating your own projections. As mentioned before, all databases are different, so you may find that leveraging one approach over another is more beneficial for you. But focusing on these three things can make the whole process a little less daunting.

Stay tuned for more tech support blogs!

To learn about optimizing your projections using Database Designer, see our documentation.

New Configuration Parameter Storage and Setting Options

New Configuration Parameter Storage from Vertica Systems on Vimeo.

Set and Clear Configuration Parameters from Vertica Systems on Vimeo.

HP Vertica 7.1.x introduces a new way to store and set configuration parameters.

Previously, configuration parameter values were stored in individual vertica.conf files on each node. As of HP Vertica 7.1.0, these values are stored in the database catalog, allowing the values to stay consistent over all nodes.

Why is this better? Well, when the values were stored at the node level, it was possible that down nodes could have inconsistent values when returning to the cluster because the SET_CONFIG_PARAMETER statement only acts on up nodes. Additionally, there was no way to tell which vertica.conf file had the most up-to-date information. Now, since the values are stored in the database catalog, when the down node returns and grabs the latest catalog file, it also receives the most up-to-date parameter values.

Our new storage settings go hand-in-hand with the HP Vertica 7.1.1 syntax option that lets you SET and CLEAR parameters at the session, node, and database level. Previously, if you wanted to make node-level changes, you had to log on to an individual node hand-edit its vertica.conf file. Not only is this tedious, but if that node were to go down, you could lose those specific settings. As of 7.1.1, the new SET and CLEAR methods allow you to configure any one node’s parameter values from any other node. Use our new SET and CLEAR keywords with the ALTER statements to take advantage of this feature. With ALTER SESSION, ALTER NODE, and ALTER DATABASE, you can set and clear multiple parameters from any node.

Watch these videos to learn more about the new configuration parameter features.

To learn about when and how to set and clear configuration parameters, see our documentation.

Connecting HP Vertica to DbVisualizer



In our latest video entry, we show you how to connect the DbVis Software database tool, DbVisualizer, to your HP Vertica database. DbVisualizer is a database tool that allows you to manage and administer your database visually. To learn more about DbVisualizer, visit their webpage at dbvis.com. You can also go to vertica.com/documentation to find more HP Vertica documentation.

HP Vertica for SQL on Hadoop

HP Vertica for SQL on Hadoop from Vertica Systems on Vimeo

HP Vertica now offers a SQL on Hadoop license, which allows you to leverage Vertica’s powerful analytics engine to explore data in Hadoop Distributed File System (HDFS).

This offering is licensed per-node/per-year term with no data volume limits.

With your SQL on Hadoop license, you get access to proven and enterprise features like:

  • Database designer
  • Management console
  • Workload management
  • Flex tables
  • External tables
  • Backup functionality

See our documentation on HP Vertica SQL on Hadoop for limitations.
To learn more about other HP Vertica licenses, view our Obtaining and Installing Your HP Vertica Licenses video or contact an HP Licensing center.

Get Started With Vertica Today

Subscribe to Vertica