Redesigning Projections for Query Optimization

Posted November 9, 2016 by Vertica Technical Team

Modern Database Analytics

When you submit a query to Vertica, the Vertica query optimizer automatically assembles a query plan, which consists of a set of operations to compute the requested result. Depending on the properties of the projections defined in your database, the query optimizer can choose faster and more efficient operations. Thus, it’s important to recognize what you can do to optimize your projections to improve query performance.

Different projection designs have different impacts on query performance and the resources that query will need to run.

The following table describes how various projection properties either positively (+) or negatively (-) impact query performance. For example, encoding your data will positively impact network usage.

Impact on
Projection property Memory CPU I/O Network Performance
Encoding + When data is encoded

– After data materialization

+/- Depends on encoding type + + +
Order by + In joins and group bys – in Load and Mergeout No impact No impact +
Segmentation No impact – in Load and Mergeout + + When loading data

– In joins and group bys

+ When table is big and data is distributed

– In joins

Projection Encoding and Compression

As a columnar database, Vertica takes advantages of data patterns by applying encoding. The goal of data encoding is to reduce storage and therefore reduce the I/O needed to load and read the data. While there are benefits to decreasing your storage footprint, encoding may increase CPU usage.

In addition to encoding, Vertica applies compression. In Vertica, the compression process transforms data into a compact form using LZO algorithm.

Vertica can work with encoded data until the data needs to be materialized, but compressed data needs to be uncompressed before participating in a query.

Default (AUTO) encoding:

If you don’t specify an encoding type, Vertica uses a default encoding (AUTO). The default encoding type depends on the data you are encoding. For example, the default encoding for TIMESTAMP columns is a compression scheme based on the delta between consecutive column values. For columns with data types such as CHAR and VARCHAR, Vertica applies LZO compression. You can view a list of data types and their default encodings here.

Encoding with Database Designer:

You can run Database Designer to identify optimal encoding for your projections. Database Designer does this by sampling 1% of your data and identifying the column field types and cardinality. Based on these parameters, Database Designer tries different encoding options and selects the one that uses the least storage.

You can also use the DESIGNER_DESIGN_PROJECTION_ENCODINGS function to analyze encoding in specified projections, create a script to implement encoding recommendations, and deploy the recommendations.

Example: Encoding and performance in data loading

Helpful system tables:

PROJECTION_COLUMNS: Provides information such as encoding type, sort order, type of statistics, and the time at which columns statistics were last updated.

COLUMN_STORAGE: Returns the amount of disk storage used by each column of each projection on each node.

Sample query:

SELECT * FROM PROJECTION_COLUMNS pc, COLUMN_STORAGE cs

                WHERE pc.column_id = cs.column_id

                AND pc.projection_id = cs.projection_id

                AND pc.projection_name ilike ”

                GROUP BY 1,2,3,4,5,6,7,9,10,11

                ORDER BY pc.projection_name, pc.sort_position ;

-[ RECORD 1 ]—————-+————————————————

projection_name              | online_sales_fact_DBD_15_seg_vertica_7_2_3_8_b0

projection_column_name       | product_key

column_position              | 2

sort_position                |

encoding_type                | DELTAVAL

encodings                    | Int_Delta

compressions                 | none

sum                          | 40112451252

access_rank                  | 0

statistics_type              | FULL

statistics_updated_timestamp | 2016-10-17 18:44:27.255243-04

ORDER BY clause in projections

Merge joins vs. hash joins

You can improve the performance of join operations by ensuring projections are designed in a way that lets Vertica choose the faster join between the inner and outer tables without having to perform additional sort and data transfer operations.

To implement a join, Vertica uses one of two algorithms:

  • Merge join
  • Hash join

Pros and Cons

Pro Con
Merge Join ·         Faster

·         Less Memory

·         Inputs must be sorted on join column

 

Hash Join ·         No sort requirement

·         Can be faster if inner table is small and fits in memory assigned in the query.

·         More memory

·         If inner table does not fit in memory, query fails and retried automatically by Vertica with “ENABLE_JOIN_SPILL” which forces inner hash table to be spilled to disk and in turn making it costly
due to involved disk IO.

 

The optimizer automatically chooses the most appropriate algorithm given the query and projections in a system. In a hash join algorithm, Vertica uses the smaller (inner) joined table to build an in-memory hash table on the join column. Vertica then scans the outer (larger) table and probes the hash table for matches. If you design your projections so that both join inputs are pre-sorted on the join key, the optimizer can choose the faster merge join.

If only the outer table is sorted, Vertica may perform a sorted merge join by sorting the inner table before performing the join. In some cases this approach can use less memory than hashing the whole inner table in memory.

Hints

Use the following hints to tell the optimizer what join operator to use:

/*+JType(M)*/ : To force a merge join. If join keys are not pre-sorted, Vertica will sort them before the join.

/*+JType(H)*/ : To force a hash join.

GROUP BY clauses in projections

PIPELINED vs. HASH

If your query contains a GROUP BY clause, Vertica computes the result with either the GROUPBY PIPELINED or GROUPBY HASH algorithm. Both algorithms compute the same result, but if your query contains a large number of groups, GROUPBY PIPELINED uses less memory and can be faster. However, this algorithm is used only when the input data is pre-sorted on the GROUP BY columns.

Pros and Cons

Pro Con
PIPELINED ·         Less memory and faster if large # of groups ·         Inputs must be sorted on GROUP BY columns
HASH ·         No sort requirement

·         Can perform better if there are just a few distinct values in the aggregations.

·         More memory

 

To improve the performance of a query that has a large number of distinct groups, use the GROUPBY PIPELINED algorithm by applying the following best practices:

  • Make sure your projection’s sort columns include all columns in the query’s GROUP BY clause
  • If the projection’s ORDER BY clause has more columns than the query’s GROUP BY clause, verify the GROUP BY clause columns occur first in the projection’s ORDER BY clause. An exception to this is described next.
  • If a query’sGROUP BY columns do not appear first in the projection’s ORDER BY clause, then verify that any early-appearing projection sort columns that are missing in the query’s GROUP BY clause are included as single-column constant equality predicates in the query’s WHERE

Projection Replication vs. Segmentation

Database Designer creates projections based on data statistics and queries. To create a design that optimizes query performance, it also reviews the submitted design tables to decide whether projections should be segmented (distributed across the cluster nodes) or replicated (duplicated on all cluster nodes).

Used for…
Replication ·         Small tables

·         LONG VARCHAR and LONG BINARY columns

·         Projections were any of the following is true:

(largest-row-count = # of rows in the table with the largest row count)

o   largest-row-count < 1,000,000 and number of rows in the table <= 10% of largest-row-count

o   largest-row-count >= 10,000,000 and number of rows in the table <= 1% of largest-row-count

o   The number of rows in the table <= 100,000

Segmentation ·         Large tables when deploying to multi-node cluster

Segmentation in auto-projections

Auto-projections are superprojections that Vertica automatically generates for tables. In auto projections, if the table has a primary key defined, the projection is segmented by the primary key. If no primary key exists, the projection is segmented by the first 32 columns of the table.

Minimizing data redistribution

Depending on your query, the Vertica optimizer may need to redistribute data at execution time. This process causes excess network traffic and requires more memory to operate.

The optimizer can redistribute data in two ways:

  • Broadcasting: sends a complete copy of an intermediate result to all nodes in the cluster.
  • Resegmentation: takes an existing projection or intermediate relation and resegments the data evenly across all cluster nodes. At the end of the resegmentation operation, every row from the input relation is on exactly one node.

In Joins

Used when…
Broadcast ·         One table is very small (usually the inner table) compared to the other.

·         Vertica can avoid other large upstream resegmentation operations.

·         Outer join or subquery semantics require one side of the join to be replicated.

Resegmentation ·         Data in distributed joins is not already segmented for local joins.

 

To improve query performance when you join multiple tables, create projections that are identically segmented on the join keys or replicate the inner table. Identically segmented projections, or replicated inner tables, allow the joins to occur locally on each node, reducing data movement across the network during query processing.

You can determine if projections are identically-segmented on the query join keys by generating and viewing a query plan with EXPLAIN. If the query plan contains RESEGMENT or BROADCAST, the projections are not identically segmented.

 

Segmentation and Group Bys

To avoid resegmentation, verify that the GROUP BY clause contains all the segmentation columns of the projection, although it can include other columns as well. Broadcast the data of this join key to other nodes before implementing the join.

 

Reducing the data to distribute with compact network

Sometimes network operations cannot be avoided. If network bandwidth is an issue, you can set the CompressNetworkData to 1, which will compress the data before it is sent to other nodes. This compression speeds up network traffic, but requires more CPU.

 

Projection usage

As the number of projections that are tuned for specific queries increases, the performance of these queries improves. However, the amount of disk space used and the amount of time required to load data increases as well. Therefore, you should create and test designs to determine the optimum number of projections for your database configuration. On average, organizations that choose to implement query-specific projections achieve optimal performance through the addition of a few query-specific projections.

 

Learn more

For more information about projections, see the links below:

 

Blog: Optimizing Projections

Core doc: Working with Projections

Hands on Vertica: Creating Tables and Projections