When you submit a query to Vertica, the Vertica query optimizer automatically assembles a query plan, which consists of a set of operations to compute the requested result. Depending on the properties of the projections defined in your database, the query optimizer can choose faster and more efficient operations. Thus, its important to recognize what you can do to optimize your projections to improve query performance.
Different projection designs have different impacts on query performance and the resources that query will need to run.
The following table describes how various projection properties either positively (+) or negatively (-) impact query performance. For example, encoding your data will positively impact network usage.
||+ When data is encoded
– After data materialization
|+/- Depends on encoding type
||+ In joins and group bys
||– in Load and Mergeout
||– in Load and Mergeout
||+ When loading data
– In joins and group bys
|+ When table is big and data is distributed
– In joins
Projection Encoding and Compression
As a columnar database, Vertica takes advantages of data patterns by applying encoding. The goal of data encoding is to reduce storage and therefore reduce the I/O needed to load and read the data. While there are benefits to decreasing your storage footprint, encoding may increase CPU usage.
In addition to encoding, Vertica applies compression. In Vertica, the compression process transforms data into a compact form using LZO algorithm.
Vertica can work with encoded data until the data needs to be materialized, but compressed data needs to be uncompressed before participating in a query.
Default (AUTO) encoding:
If you dont specify an encoding type, Vertica uses a default encoding (AUTO). The default encoding type depends on the data you are encoding. For example, the default encoding for TIMESTAMP columns is a compression scheme based on the delta between consecutive column values. For columns with data types such as CHAR and VARCHAR, Vertica applies LZO compression. You can view a list of data types and their default encodings here.
Encoding with Database Designer:
You can run Database Designer to identify optimal encoding for your projections. Database Designer does this by sampling 1% of your data and identifying the column field types and cardinality. Based on these parameters, Database Designer tries different encoding options and selects the one that uses the least storage.
You can also use the DESIGNER_DESIGN_PROJECTION_ENCODINGS function to analyze encoding in specified projections, create a script to implement encoding recommendations, and deploy the recommendations.
Example: Encoding and performance in data loading
Helpful system tables:
PROJECTION_COLUMNS: Provides information such as encoding type, sort order, type of statistics, and the time at which columns statistics were last updated.
COLUMN_STORAGE: Returns the amount of disk storage used by each column of each projection on each node.
SELECT * FROM PROJECTION_COLUMNS pc, COLUMN_STORAGE cs
WHERE pc.column_id = cs.column_id
AND pc.projection_id = cs.projection_id
AND pc.projection_name ilike ”
GROUP BY 1,2,3,4,5,6,7,9,10,11
ORDER BY pc.projection_name, pc.sort_position ;
-[ RECORD 1 ]—————-+————————————————
projection_name | online_sales_fact_DBD_15_seg_vertica_7_2_3_8_b0
projection_column_name | product_key
column_position | 2
encoding_type | DELTAVAL
encodings | Int_Delta
compressions | none
sum | 40112451252
access_rank | 0
statistics_type | FULL
statistics_updated_timestamp | 2016-10-17 18:44:27.255243-04
ORDER BY clause in projections
Merge joins vs. hash joins
You can improve the performance of join operations by ensuring projections are designed in a way that lets Vertica choose the faster join between the inner and outer tables without having to perform additional sort and data transfer operations.
To implement a join, Vertica uses one of two algorithms:
Pros and Cons
· Less Memory
|· Inputs must be sorted on join column
||· No sort requirement
· Can be faster if inner table is small and fits in memory assigned in the query.
|· More memory
· If inner table does not fit in memory, query fails and retried automatically by Vertica with “ENABLE_JOIN_SPILL” which forces inner hash table to be spilled to disk and in turn making it costly
due to involved disk IO.
The optimizer automatically chooses the most appropriate algorithm given the query and projections in a system. In a hash join algorithm, Vertica uses the smaller (inner) joined table to build an in-memory hash table on the join column. Vertica then scans the outer (larger) table and probes the hash table for matches. If you design your projections so that both join inputs are pre-sorted on the join key, the optimizer can choose the faster merge join.
If only the outer table is sorted, Vertica may perform a sorted merge join by sorting the inner table before performing the join. In some cases this approach can use less memory than hashing the whole inner table in memory.
Use the following hints to tell the optimizer what join operator to use:
/*+JType(M)*/ : To force a merge join. If join keys are not pre-sorted, Vertica will sort them before the join.
/*+JType(H)*/ : To force a hash join.
GROUP BY clauses in projections
PIPELINED vs. HASH
If your query contains a GROUP BY clause, Vertica computes the result with either the GROUPBY PIPELINED or GROUPBY HASH algorithm. Both algorithms compute the same result, but if your query contains a large number of groups, GROUPBY PIPELINED uses less memory and can be faster. However, this algorithm is used only when the input data is pre-sorted on the GROUP BY columns.
Pros and Cons
||· Less memory and faster if large # of groups
||· Inputs must be sorted on GROUP BY columns
||· No sort requirement
· Can perform better if there are just a few distinct values in the aggregations.
|· More memory
To improve the performance of a query that has a large number of distinct groups, use the GROUPBY PIPELINED algorithm by applying the following best practices:
- Make sure your projections sort columns include all columns in the querys GROUP BY clause
- If the projections ORDER BY clause has more columns than the querys GROUP BY clause, verify the GROUP BY clause columns occur first in the projections ORDER BY clause. An exception to this is described next.
- If a query’sGROUP BY columns do not appear first in the projection’s ORDER BY clause, then verify that any early-appearing projection sort columns that are missing in the query’s GROUP BY clause are included as single-column constant equality predicates in the query’s WHERE
Projection Replication vs. Segmentation
Database Designer creates projections based on data statistics and queries. To create a design that optimizes query performance, it also reviews the submitted design tables to decide whether projections should be segmented (distributed across the cluster nodes) or replicated (duplicated on all cluster nodes).
||· Small tables
· LONG VARCHAR and LONG BINARY columns
· Projections were any of the following is true:
(largest-row-count = # of rows in the table with the largest row count)
o largest-row-count < 1,000,000 and number of rows in the table <= 10% of largest-row-count
o largest-row-count >= 10,000,000 and number of rows in the table <= 1% of largest-row-count
o The number of rows in the table <= 100,000
||· Large tables when deploying to multi-node cluster
Segmentation in auto-projections
Auto-projections are superprojections that Vertica automatically generates for tables. In auto projections, if the table has a primary key defined, the projection is segmented by the primary key. If no primary key exists, the projection is segmented by the first 32 columns of the table.
Minimizing data redistribution
Depending on your query, the Vertica optimizer may need to redistribute data at execution time. This process causes excess network traffic and requires more memory to operate.
The optimizer can redistribute data in two ways:
- Broadcasting: sends a complete copy of an intermediate result to all nodes in the cluster.
- Resegmentation: takes an existing projection or intermediate relation and resegments the data evenly across all cluster nodes. At the end of the resegmentation operation, every row from the input relation is on exactly one node.
||· One table is very small (usually the inner table) compared to the other.
· Vertica can avoid other large upstream resegmentation operations.
· Outer join or subquery semantics require one side of the join to be replicated.
||· Data in distributed joins is not already segmented for local joins.
To improve query performance when you join multiple tables, create projections that are identically segmented on the join keys or replicate the inner table. Identically segmented projections, or replicated inner tables, allow the joins to occur locally on each node, reducing data movement across the network during query processing.
You can determine if projections are identically-segmented on the query join keys by generating and viewing a query plan with EXPLAIN. If the query plan contains RESEGMENT or BROADCAST, the projections are not identically segmented.
Segmentation and Group Bys
To avoid resegmentation, verify that the GROUP BY clause contains all the segmentation columns of the projection, although it can include other columns as well. Broadcast the data of this join key to other nodes before implementing the join.
Reducing the data to distribute with compact network
Sometimes network operations cannot be avoided. If network bandwidth is an issue, you can set the CompressNetworkData to 1, which will compress the data before it is sent to other nodes. This compression speeds up network traffic, but requires more CPU.
As the number of projections that are tuned for specific queries increases, the performance of these queries improves. However, the amount of disk space used and the amount of time required to load data increases as well. Therefore, you should create and test designs to determine the optimum number of projections for your database configuration. On average, organizations that choose to implement query-specific projections achieve optimal performance through the addition of a few query-specific projections.
For more information about projections, see the links below:
Blog: Optimizing Projections
Core doc: Working with Projections
Hands on Vertica: Creating Tables and Projections