|Share this article:|
Best Practices for Projection Optimization
Vertica stores data in projections in a format that optimizes query execution. Similar to materialized views, projections store result sets on disk rather than compute them each time they are used in a query. Vertica automatically refreshes these result sets with updated or new data.
At times you may need to refresh a projection to ensure the latest table data exists in the projection. For example, when you create a projection from an existing table the projection won’t be updated with the latest data until you perform a refresh.
Refreshing a projection also advances your Ancient History Mark. This shortens the gap between your AHM and current epoch, thereby reducing the amount of memory used in the catalog.
This document provides some best practices tips on optimizing projections to ensure you receive the best performance when using and refreshing projections.
This document includes:
- How Vertica Works with Projections
- Helpful Tips
- Monitoring Projection Refreshes
- Other Projection Tasks
How Vertica Works with Projections
Projections store data in a format that optimizes query execution. Similar to materialized views, projections store result sets on disk rather than calculate them each time they are queried. You can refresh projections with new or updated data.
Projections provide the following benefits:
- Compress and encode data to reduce storage space.
- Simplify distribution across the database cluster.
- Provide high availability and recovery.
To minimize some of the risks you may experience when using and refreshing projections, note the following:
- Ensure you have enough temp and disk space for a projection to refresh successfully.
- Create the projection with a logical sort order.
- Use segmented projections on large tables.
- Specify the proper encoding when creating the projection.
If you do not have enough disk space, your projection may take a long time to refresh. Typically, having disk space that is 2 to 3 times the size of the projection alleviates any refresh issues.
Other space issues that may affect your projection performance are:
- Not enough space in TEMP space.
- Not enough space because the table is too large and projection refresh duplicates the data.
If you experience issues due to low disk space, you must perform a rollback, which ends the current transaction and discards all changes that occurred during the transaction. For more information see Creating and Rolling Back Transactions.
You can also monitor disk space using the following system tables:
Logical Sort Order
Selecting a logical sort order can help you achieve maximum query performance. If you have queries that contain GROUP BY clauses, joins, or other predicates, it is a good practice to place the columns specified in those clauses early in the sort order. If you have no other criteria on how to sort your data, it is more efficient to first sort on the columns with the smallest number of distinct values (lowest cardinality) before the high-cardinality columns.
Sorting first by the column of fewest distinct values also provides the benefit of reducing disk space because the column gets compressed more efficiently.
For more information see Choosing Sort Order: Best Practices.
At times it is helpful to create segmented projections on large tables to spread the query execution workload across multiple nodes. Projection segmentation achieves the following goals:
- Ensures high availability and recovery when used with a K-safe value greater than zero.
- Spreads the query execution workload across multiple nodes.
- Allows each node to be optimized for different query workloads.
It is important to determine which columns to use to segment a projection.
Hash segmentation is the preferred segmentation method. Primary key columns that have a large number of unique data values (high cardinality) and acceptable skew in their data distribution are an excellent choice for hash segmentation.
For more information see:
Database Designer implements optimal encoding for your data. When creating projections, make sure you specify the encoding on your projection columns to optimize query performance. With appropriate encoding, you can reduce your database footprint and improve query performance.
For more information see:
Use the following system tables to monitor your projection status and performance:
- PROJECTIONS: Lists projection information
- PROJECTION_CHECKPOINT_EPOCHS: Provides details on when checkpoint epochs are refreshed.
- PROJECTION COLUMNS: Provides information about projection columns, such as encoding type, sort order, type of statistics, and the time at which columns statistics were last updated.
- PROJECTION_DELETE_CONCERNS: Lists projections whose design may cause performance issues when deleting data.
Other Projection Tasks
When you create a projection for a table that already contains data, Vertica does not automatically load that data into the new projection. You must perform a refresh on the projection, otherwise the new projection cannot participate in executed queries on its anchor table.
Preparing projections for merging begins with good projection design (see How to Create a Design). This makes it easier for the Vertica query optimizer to select the best projections when performing a merge.
Good projection design strategy provides projections that help the query optimizer avoid extra sort and data transfer operations, and facilitate MERGE performance.
Making Projections K-Safe
K-safety sets the fault tolerance in your Vertica database cluster. The value K represents the number of replicated data in the database cluster. These replicas allow other nodes to take over query processing for any failed nodes.
Implementing K-Safety in projections differs depending on whether you are using a segmented or unsegmented projections.
For segmented projections in a K-safe database, Vertica creates multiple buddies for this projection and distributes them on different nodes across the cluster.
You may choose to not perform segmentation on dimension tables since they are relatively small. In this case, design a K-safe database so projections for its dimension tables are replicated without segmentation on all cluster nodes.
You set the K-Safety value when creating a projection.
For more information on the topics discussed here, see the following: