Archive for the ‘Uncategorized’ Category

Work hard, have fun and make a difference!


My name is Jaimin and I work as a Software Engineer in the Distributed Query Optimizer Team at HP Vertica. I wanted to share with you what I think makes Vertica the best place to work! I will explain the kind of impact you can make as an employee/intern at HP Vertica, while sharing my personal experiences.

As a student, I researched many companies I might want to work for to get a better understanding of the everyday life of software engineers. However, what I was most interested in learning about was the kinds of things engineers might do that went above and beyond the normal day-to-day stuff.

Is writing code something unique to the job?

No! Right?

As Software Engineers, we write code, develop algorithms, and implement them. But here at HP Vertica, we do lots of other things besides simply writing code.

Go above and beyond!

Vertica is different from other companies as far as normal day-to-day stuff goes.

Let me ask you this question: How many new graduates would you guess could get a chance to file a patent within their first 6 months of joining a company? How many would get chance to write a paper within first six months? Not a lot, right?

In my experience at HP Vertica, I’ve seen that just about all new graduate engineers file at least one patent in their first year at work. This speaks to the fact that the work we do here at Vertica is completely innovative. Our projects have a huge business impact.

Be the captain of your ship!

Vertica offers engineers incredible opportunities! All you have to do is be willing to accept them. One of the best things about HP Vertica is that you work in an environment where other engineers are smarter than you! You’ll find yourself constantly challenged to learn new, interesting, and exciting things. You’ll get better exposure and, more importantly, you have a massive role to play in the company’s growth and development.

Something else that’s unique about HP Vertica—the projects you work on as an intern become part of the shipping product! As a result, you’ll get the chance to see your code in action and sometimes you can learn what customers have to say about your feature in particular. You won’t be allowed to sit idle for a minute because we have a very short release cycle. This will keep you on your toes and encourage you to think something new day in and day out.

Here, engineers are not forced to work on this and that—they have a great deal of autonomy and frequently get to choose the things they work on. If you have an idea you think can help improve the product, you are encouraged to see it through. And, you’ll also get a chance to participate in various technical events that take place within HP and submit your ideas.

Taking initiative is always encouraged and you’ll be expected to make, discuss, and defend your design decisions with your mentors instead of just following directions. You’ll also be able to learn about the complexities of building a database and how we achieve the performance advantages in HP Vertica.

It is also easy to move between the teams. It is entirely up to you and the only question is what you want to do?

Share and gain knowledge!

Knowledge Sharing is another important thing at Vertica. We do a lunch talk where we discuss any new paper related to database systems. Every now and then people from various teams give tech talks so that each team is aware of what people in other groups are doing.

As a fresh graduate before joining Vertica, I did not have any experience working on a database optimizer product, though I had worked a bit on optimizations when I took a compiler class. Because of the great culture and environment at Vertica, I didn’t find the transition difficult at all. Sometimes it was challenging, but it allowed me to learn a lot by working with incredibly smart people at the company while working on challenging projects (I wonder how many people have the opportunity to work on the design and implementation of queries involving Set Operators during their first year of work).

Have fun!

We frequently unwind doing fun things at work, including watching the Olympics games or other sporting events during lunch, or playing table-tennis and board games when we can. Vertica provides a lot of flexibility and it comes with huge responsibility. You’re expected to get your work done on time—if you do that, no one will have any problem with having a little fun. Interns also go on outdoor field trips, including horseback riding, hiking to Blue Hills, going for a movie, participating in a bocce tournament, and water activities such as motor boat racing. Once, we went to the Boston Harbor and tried to learn how to sail a boat from one of our in-house experts in Vertica.

We are looking for people to join Vertica! Do you have any interest in being challenged in an innovative design environment? Then apply today!

How to make ROLLUP fly in Vertica?

ROLLUP screenshot

ROLLUP is a very common Online Analytic Processing (OLAP) function and is part of ANSI SQL. Many customers use ROLLUP to write reports that automatically perform sub-total aggregations across multiple dimensions at different levels in one SQL query. The following is an example:

WHERE B.DATE_KEY BETWEEN 20120301 AND 20120331

Note: ROLLUP syntax is not supported in the current version of Vertica 7.0. This
is used only as an illustration. See blog text for more details

The business question the above query intends to answer is: For the month of March, 2014, show the total sales across both departments (dept_code) and stores (store_code) to generate sub-total/total sales figures at three levels:

  1. At the detailed level of each pair of (dept_code, store_code)
  2. At the level of only (dept_code) but aggregating across all stores
  3. Finally a grand sales total aggregating across all departments and stores

Assuming performance is not an issue, clearly ROLLUP is a very powerful and useful function that can do a lot of data crunching in the background and deliver a quite complex report.

Unlike the more familiar GROUP BY SQL function in which the column order is not material, ROLLUP works by creating subtotals that “roll up” from the most detailed level to a grand total and requires an ordered list of grouping expressions to be supplied as arguments. First ROLLUP calculates the standard aggregate values specified in the standard GROUP BY (without ROLLUP) and then ROLLUP moves from right to left through the list of grouping columns, starting with the lowest level of aggregation to create progressively higher-level subtotals. This process continues until all grouping columns have been consumed and aggregations computed. The operator essentially “rolls up” its grouping operations, culminating in a grand total in the end.

If N is the number of grouping columns, ROLLUP function will create (N+1) levels of subtotals (including a final grand total). The column order in a ROLLUP operator is critical since it works by removing the left most column at each step. For example:

ROLLUP(A, B, C) creates 4 groups: (A, B, C), (A, B), (A), ().

Note there are NO groups such as (A, C) and (B, C).

Unfortunately the current version of Vertica 7.0 does not support ROLLUP. So what is the alternative or workaround for the existing Vertica customers? Will the workaround perform on a large data set? Fortunately, the answers to both questions are yes. However it does take some ingenuity in physical design and SQL re-write to make ROLLUP work and perform in the currently supported Vertica releases (version 5.1 and later).

The key considerations are to create and pre-load a local temp table which is sorted on ALL the ROLLUP columns in the same exact order, rewrite SQL at each level using GROUP BY into a sub-query and finally UNION ALL of these sub-queries together. By applying this strategy to the ROLLUP SQL posted at the beginning of this blog, we arrive at the following alternative/workaround:


/*+ DIRECT */
WHERE B.DATE_KEY BETWEEN 20140301 AND 20140331



The fact that temp table AGG_TEMP is sorted on the ROLLUP columns (dept_code, store_code) is important for performance because this means that all the GROUP BY sub-queries at individual levels will be able to exploit the highly efficient GROUPBY PIPELINED operator (explicit in the query’s explain plan) in Vertica to get the job done quickly and accurately.

A nice property of GROUPBY PIPELINED operator (in contrast to the more resource intensive GROUPBY HASH) is that its performance is largely independent of the actual table size. As a result this workaround can easily scale up to include tables with 10+B or even 100+B rows without any noticeable drop in query performance (as measured in first fetch time).

To conclude I would like to mention that there has been serious work going on in Vertica Engineering to implement ROLLUP (and many other OLAP functions) in the up-coming Vertica releases. Stay tuned!

Po Hong is a senior pre-sales engineer in HP Vertica’s Corporate Systems Engineering (CSE) group with a broad range of experience in various relational databases such as Vertica, Neoview, Teradata and Oracle.

HP Vertica 7 Installer Changes for Optimum Performance

Installing Vertica title pic

HP Vertica 7 introduces a new installer that helps you configure your system for the best possible performance. How does it do this? Well, the new installer checks your systems and lists any changes that you need to make for optimum performance. Not only that, it gives you a direct link to the documentation page that will help you change your system appropriately. Once you change your system configuration, you can rerun the install and know that your system is optimized to run HP Vertica 7.

Check out the following video to see a demonstration of how to change typical settings for default installs on RedHat or CentOS systems.

Inside the Secret world of the Workload Analyzer

WLA image

When I’m on a flight sitting next to someone, and we’re making polite conversations, often the question comes up “what do you do?” In these situations, I have to assess whether the person works in the IT industry or is otherwise familiar with the lingo. If not, my stock response is “I fix databases”. This usually warrants a polite nod, and then we both go back to sleep. This over-simplified explanation generally suffices, but in truth, it is wholly inadequate. The truth of the matter is that my job is to ensure that databases don’t get broken in the first place; more specifically – an HP Vertica database. But our clients have different, complex goals in mind, they sometimes configure their systems incorrectly for the kind of stuff they’re doing. I’m constantly looking for ways to empower clients to understand problems to look for before they become bigger problems.

That’s why I’m a huge fan of the HP Vertica Workload Analyzer (also referred to as WLA). The WLA is a great tool in the war against problems. The WLA’s goal in life (if it has one) is to find and report on problems – 20 specific ones to be exact. If you are using Management Console, the WLA runs once per day and produces a report. This report indicates actions to take, like tables that should have statistics updated, or queries that aren’t performing adequately. This report (actually just entries in a table) is reported in the Management Console in a window called “Tuning Recommendations”.

But you don’t need Management Console to take advantage of the Workload Analyzer – you can run it manually by running the following command:


Interestingly, even though it’s called “Workload Analyzer”, the actual command is ANALYZE_WORKLOAD. You could toss this into CRON and run it multiple times per day if you’d like, but it’s probably not necessary.

The output of this command comes back to your screen – it reports on all the various “problems” that it is programmed to identify and returns them. It also creates entries in two separate tables V_MONITOR.TUNING_RECOMMENDATIONS and TUNING_RECOMMENDATION_DETAILS.

As I mentioned earlier, there are 20 queries that WLA runs in order to look for problems. These queries are called rules. WLA rules can be disabled or enabled, and many of them have parameters that can be modified- increasing or decreasing the scope of the rule. These rules have clever names like sys_r1 and sys_r13. It’s ok though – WLA doesn’t allow you to run rules directly, but I’ll break down the most useful rules, and how they can be modified for maximum effect.

sys_r1: Stale and Missing Statistics

sys_r1 looks for tables that have stale or missing statistics. It recommends updating statistics on those tables. sys_r5 and sys_r6 find queries that have reported JOIN_SPILLED or GROUP BY SPILLED events. These are indications that there is no good projection design which would enable these queries to operate at maximum efficiency. These three rules generate the most typical output I see at client sites.

sys_r16 and ssys_r20: System Utilization

sys_r16 and sys_r20 are rules that check for system utilization. sys_r16 checks for a CPU utilization that exceeds a 95% threshold for more than 20 minutes. sys_r20 looks for memory utilization of 99% for more than 10 minutes.

Many of the other rules cover very specific edge-case scenarios, and are unlikely to fire in most client sites and we hope to cover those in future blogs. For now, let’s stick to this handful, and talk about how we can modify these rules, and add our own rules!

Altering Rules

What you may not know about the WLA is that you can modify certain rules. That is, you can change some parameter settings such as threshold values that the rule uses to find problems. Let’s take rules sys_r16 and sys_r20. Each looks for CPU and memory utilization in your system. The default thresholds might be too high for your liking–but that’s ok, you can adjust them.





Now this rule will fire under less extreme circumstances. Instead of looking for a CPU usage of 95% for 20 minutes, it will now report on a CPU usage of 80% for 10 minutes. All the tuning rule parameters (if applicable) are defined in another system table called VS_TUNING_RULE_PARAMETERS. You can use that table to determine which parameters are available to which rules. You can tweak them if you desire.

Furthermore, you can disable or enable rules with the following commands:

dbadmin=> ALTER_TUNING_RULE sys_r16 disable;


dbadmin=> ALTER_TUNING_RULE sys_r16 enable;


Creating your own rules

The coolest part of the WLA is that you can actually create your own rules. The first step is to write your rule. This is just a regular SQL query whose purpose it is to find a problem. The example I’m going to include here is not currently a rule, but it would make a great addition – finding too many delete vectors. This rule has a threshold defined as a percentage of delete vectors as compared to the number of rows in a table. When this percent exceeds 20%, this rule will fire.

CREATE TUNING RULE dv_rule (DELETE_VECTORS, PURGE, dv_threshold=20) AS SELECT CURRENT_TIME AS time, ‘Delete vectors account for ‘ || round(deleted_row_count/row_count, 2)*100::char(4) || ‘% of rows for projection ‘ || ps.projection_name AS observation_description

, ps.anchor_table_id AS table_id

, ps.anchor_table_name AS table_name

, s.schema_id AS table_schema_id

, ps.projection_schema AS table_schema

, NULL AS transaction_id

, NULL AS statement_id


WHERE tuning_rule = ‘dv_rule’ AND parameter = ‘dv_threshold’)

AS tuning_parameter

, ‘run purge operations on tables with delete vectors’ AS tuning_description

, ‘SELECT PURGE_PROJECTION(”’ || ps.projection_name || ”’);’ AS tuning_command

, ‘LOW’ as tuning_cost

FROM delete_vectors dv

JOIN projection_storage ps ON dv.projection_name = ps.projection_name

JOIN schemata s ON s.schema_name = dv.schema_name

WHERE ps.row_count > 0

AND deleted_row_count / row_count >

(SELECT current_value/100::numeric FROM VS_TUNING_RULE_PARAMETERS

WHERE tuning_rule = ‘dv_rule’

AND parameter = ‘dv_threshold’) ;

The first line sets up my rule. I give it a name “dv_rule” and I define some parameters. The first two are like labels. They are arbitrary. Technically, these are referred to as the “observation type” and the “tuning type.” You can see examples by looking in the TUNING_RECOMMENDATON_DETAILS table. All remaining parameters are optional — and I can optionally define tuning parameters. Here, I’ve defined my 20% threshold for delete vectors. This query follows the order of the columns in the TUNING_RECOMMENDATION_DETAILS table. If you compare the SELECT list, and the columns in that table, you’ll note the similarities.

My query actually references the VS_TUNING_RULE_PARAMETERS table in order to get my % threshold. This creates a bit of a chicken and egg problem – the query won’t work since the parameter doesn’t yet exist in this table. So, the first thing you have to do is create the rule, which also creates the parameter. Then you can modify the query as needed. If you need to drop the query, you can do so with the following command:

dbadmin=> DROP TUNING RULE dv_rule ;


Because you might go through this process several times, I recommend putting your work in to a file so you can reference it easily.

And that’s it! That’s everything you need to know in order to create and manage WLA rules. Not only can you use these rules to find problems in your HP Vertica cluster, you could use them to look for issues in your own data.

In the community forums, I’d like to continue the discussion – come up with your own rules, and post them. Who knows – the good ones we might include in a future HP Vertica release!

Optimizing Big Data Storage in HP Vertica

Man looking at server in HP EcoPod

With the explosion of data volumes all enterprises are capturing, new technological solutions, such as HP Vertica, offer a solution to non-expert users who need to analyze and monetize their Big Data. If you are a non-expert user, the Database Designer (DBD) module in HP Vertica can help you choose a physical database design that minimizes storage footprint while optimizing the performance of the input query workload. The DBD can recommend good physical designs as quickly as possible using minimal computing resources.

One key technique by which HP Vertica can improve performance is data-specific encoding and compression. The encoding optimization component automatically chooses the best techniques, saving the cost of very-scarce human experts to manually specify encodings. While very effective, this process can be very resource and time intensive, often requiring all of a cluster’s resources for many hours to produce an optimal design. However, this approach is often not cost effective because the less resource intense the optimization process is, the more it will be used and the more data people can analyze.

The HP Vertica Analytic Database was designed from the ground up to analyze massive amounts of data, using research from MIT, Brown, and Brandeis. At least three current HP Vertica customers have over 1 PB in their production databases. One of the key technologies to handle petabyte-scale data sets is sophisticated encoding and compression algorithms because they minimize storage and I/O bandwidth requirements. HP Vertica automatically chooses an optimal physical database design including encoding and compression based on schema, sample data, and query workload. The storage optimization process is extraordinarily accurate.

With storage optimization, the specific challenges to address are:

  • Identifying appropriate encoding/compression methods for each column type based on cardinality, sortedness, or both.
  • Finding the best encoding/compression method among multiple candidate choices that yields the least storage footprint without compromising retrieval performance.
  • Identifying a representative data sample for storage experiments, which requires:
    • Finding an appropriate sample size
    • Identifying an appropriate sampling technique

Data Encoding: Encoding is the process of converting data into a standard format. In HP Vertica, encoded data can be processed directly, which distinguishes encoding from compression. HP Vertica uses a number of different encoding strategies, depending on column data type, table cardinality, and sort order. The query executor in HP Vertica operates on the encoded data representation whenever possible to avoid the cost of decoding. It also passes encoded values to other operations, saving memory bandwidth. In contrast, row stores and most other column stores typically decode data elements before performing any operation. The following table presents the different encoding types along with their descriptions, emphasis on their ideal column type.

Data Compression: Compression is the process of transforming encoded data into a compact format. Compressed data cannot be directly processed; it must first be decompressed. HP Vertica uses integer packing for unencoded integers and LZO for compressible data. When compression is used extensively, it allows a column store to occupy substantially less storage than a row store. In a column store, every value stored in a column has the same data type which enables more effective encoding and compression, particularly in sorted columns. In a row store, each row contains multiple columns with different data types, resulting in a much less effective use of compression.

Storage Optimization: The best encoding type for each column is the one that achieves the best compression (least storage footprint). The DBD tries all ideal encoding candidates for each column, based on the column type and properties to find this optimal encoding. You choose a data sample, and then the DBD compares the storage footprints of the encoding candidates, skipping inappropriate choices. For example, RLE encoding is not an ideal choice for a column identified as high cardinality through statistics or primary key constraints. Thus, the DBD won’t try the RLE encoding type r such columns, as the following table shows. Also, in most cases, Database Designer won’t try a column with an unambiguous encoding choice unless it affects the storage footprint of other columns.

Encoding types Description Ideal for
AUTO(default) Automatically picks based on properties of the data itself, when insufficient examples are known. For string, Boolean and float types, Lempel-Ziv-Oberhumer based (LZO) compression is used. For integer-based types, the compression scheme is based on the delta between consecutive column values. CPU requirements are relatively small. Sorted, high cardinality columns such as primary keys. Also suitable for general-purpose applications for which no other encoding or compression scheme is applicable.
RLE (Run-length encoding)  Replaces sequences (runs) of identical values with a single pair that contains the value and number of occurrences. The storage footprint for RLE and AUTO encoding of string types is the same. Sorted, low-cardinality columns. Ideal when the run length is large, such as when low-cardinality columns are sorted.
BLOCK-DICT (Block Dictionary) Within a block, distinct values are stored in a dictionary. A list of references to that dictionary represents the data block. CPU requirements are significantly higher than for default encoding schemes. Unsorted, low-cardinality columns, such as stock prices.
BLOCKDICT-COMP (Compressed Block Dictionary) Similar to BLOCK_DICT except that dictionary indexes are entropy coded. Requires significantly more CPU time to encode and decode. Unsorted, low-cardinality columns with extremely skewed value distribution.
DELTAVAL (Delta Value) Data is recorded as a difference from the smallest value in a data block. CPU requirements are minimal, and the data never expands. Unsorted, high-cardinality integer or integer-based columns.
DELTARANGE-COMP(Compressed Delta Range) Stores each value as a delta from the previous one.  This scheme has a high cost for both compression and decompression. High-cardinality float columns that are sorted or confined to a range.
COMMONDELTA-COMP(Compressed Common Delta) Builds a dictionary of all the deltas in the block and then stores indexes into the delta dictionary using entropy coding. If the delta distribution is excellent, columns can be stored in less than one bit per row. CPU requirements are extensive. Sorted columns with predictable sequences and only occasional sequence breaks.  (For example,   you could use this encoding type for timestamps recorded at periodic intervals or primary keys).
GCD-DELTA (Greatest Common Divisor Delta Value) Data is recorded as the difference from the smallest value in the data block divided by the greatest common divisor (GCD) of all entries in the block. CPU requirements for decoding GCDDELTA are minimal and the data never expands, but may take more encoding time than DELTAVAL. Many-valued unsorted integer or integer-based columns, when the values are a multiple of a common factor. (For example, timestamps are stored internally in microseconds, so data this is only precise to the millisecond are all multiples of 1000).

Data Sampling for Storage Optimization: Implementing an appropriate sampling technique is critical for the quality of encoding/compression achieved. We experimented two different methods: (a) random sampling and (b) clustered sampling. Random sampling involves selecting a random set of rows for encoding tests. The main disadvantage is performance, as the random sample is selected by scanning every row and applying a random function.

We also noticed that using such a random sample is usually detrimental for certain encoding types. For example, with RLE, it tends to increase the apparent number of distinct values and propensity toward sequences. This increase negatively affects COMMONDELTA-COMP and a few others. To overcome these disadvantages, we implemented a clustered sampling technique. Clustered sampling selects contiguous rows from equally spaced bands in the database, based on a preselected sample size and band count. This approach improves performance because HP Vertica reads only the required amount of data. It also improves quality of sample appropriate for encoding experiments.

Determining an appropriate Sample Size for Storage Optimization: Selecting an appropriate sample size is critical for both performance and quality of the encoding/compression achieved. To decide on an appropriate sample size, we experimented with six real customer data sets. We tried different sample sizes in million increments starting from 1 million to 10 million samples. We then measured the ccompression ratio achieved, as explained by this equation:

Compression ratio = Storage Footprint (in bytes) with optimal encodings (selected using all available data)
Storage Footprint (in bytes) with approximate encodings (selected using a data sample)


The compression ratio cannot be more than 1.0 and the closer it is to 1.0, the better. We found that one million clustered samples are sufficient to achieve a compression ratio of 99.6%. We also noticed that 1 million samples are enough to find the best encodings in most columns. Other columns, however, might end up having the second or third best encoding.

Summary: The Database Designer module of HP Vertica Analytic Database automatically recommends physical designs that optimize query performance, storage footprint, fault tolerance and recovery to meet different customer scenarios and applications. In this blog, we focused on how big data storage optimization is being done using the Database Designer component of HP Vertica.

A Vertica training update: Installing Vertica and Management Console


The training team has started the process of updating and revamping out web-based training offerings, available here. Our first released rebuild is the course Installing Vertica and Management Console. Here are some of the updates that we made to the material:

  • The course covers installing both Vertica Analytics and Management Console
  • We’ve streamlines the description of the installation process, so that you don’t need to work through each step
  • We’ve improved the description of how a cluster is defined, and highlighted some of the most common configuration errors
  • We’ve included pointers to relevant sections of the documentation
  • The course is self-running, and takes less than 10 minutes to complete. You can, of course, pause and replay each page of the course.
  • The course includes voiceover descriptions and a Table of Contents for navigation, moving us away from the text-based presentation model

To start the course, click here

We hope you enjoy it, and look forward to your feedback!

Database Designer in HP Vertica 7

With the HP Vertica 7, you can use Database Designer with Management Console. As in previous releases, you can still run Database Designer from Admin Tools, but its integration with Management Console offers an additional easy-to-use method for creating a database design.

Database Designer optimizes query performance and minimizes the disk storage that the database uses. It does this by analyzing your logical schema, sample data, and, optionally, your sample queries. Then, Database Designer creates a physical schema design (a set of projections) that can be deployed automatically or manually.

Check out the following demo to get started with the new Database Designer feature in Management Console.

* When using this new feature, remember that, to create the design, you must be a DBADMIN user or have the DBUSER role assigned to you with write access to the tables in your schema.

For more information, visit

Get Started With Vertica Today

Subscribe to Vertica