Understanding the Vertica Replay Delete Algorithms
Previously, we wrote about the delete lifecycle inVertica. This document focuses on replay deletes in Vertica, how to design projections, and configure your Vertica database to improve replay delete performance.
What is Replay Delete?
Because storage containers are reorganized as part of the Tuple Mover, recovery, refresh, and rebalance operations, the position of deleted records may change in the newly created storage container. In Vertica, a delete vector tracks the position of a deleted record in a storage container and tracks the commit epoch of when the record was deleted. The process of rebuilding delete vectors to adapt to the movement of records across storage containers is known as replay delete. Vertica has two algorithms that perform the replay delete operation.
The Traditional Algorithm
The traditional replay delete algorithm first tries to find the position of a deleted record in a new storage container by matching only the columns in the projection sort order. If the algorithm finds a unique match, then it records the position of the matching record in the storage container in a delete vector. If the algorithm does not find a unique match, then all projection columns are matched.
This replay delete algorithm is efficient when there are only a few columns in the projection sort order and these columns can uniquely identify deleted rows. However, when columns in the projection sort order have low cardinality and cannot uniquely identify deleted rows, this algorithm does not perform well. This algorithm has a time complexity of O(N^2).
The New Algorithm
Because the traditional algorithm only worked on data sets that contained a few columns with unique identifiers, it was not always the best option. The new replay delete algorithm performs a join between deleted tuples and the records from the storage container. All projection columns are used as join keys. Vertica keeps track of the position of records that are input to the join operator. The positions of the records that match the join criteria are used to build a delete vector. The new replay delete plan with a join plan operator has a time complexity of O(NlogN).
How does Vertica use the Algorithms?
When Vertica engineers tested both replay delete algorithms, they discovered that each algorithm has its strengths. The algorithm performance depends on many factors, including projection design, row count, and the number of deleted records to be replayed.
The traditional replay delete algorithm performs best when the projections had high cardinality in the sort order and the percentage of deleted records was low. The new replay delete algorithm performs better than the traditional algorithm when the projection sort order had low cardinality.
Because neither algorithm is an ideal solution for every data set, and because both algorithms performed differently based on the projection design, Vertica uses both. To begin replay delete operations, Vertica uses the traditional replay delete algorithm. When the traditional replay delete algorithm hits its time complexity O(N^2), Vertica re-plans the replay delete task with the new algorithm.
Optimize your Projection Design
To improve replay delete performance, Vertica recommends that the projection should have at least one high cardinality column in the projection sort order. The position of the high cardinality column in the projection sort order does not matter and it can be the last column.
Configuring Vertica for Replay Delete
ReplayDeleteAlgorithmSwitchThreshold is the configuration parameter Vertica uses to abort the traditional replay delete algorithm when the runtime data scan statistics exceed the threshold. Then, Vertica retries with the new join algorithm.
The default value for this parameter is 10. If you change this value to 0, Vertica will not use the traditional replay delete algorithm.
If your database has projections that are all optimized for replay delete and the percentage of deleted rows are low, leave the ReplayDeleteAlgorithmSwitchThreshold parameter at the default value of 10.
If your database has projections that are not optimized for replay delete, and the percentage of deleted records are high for large tables or the status of the projection design is unknown, you can set the ReplayDeleteAlgorithmSwitchThreshold parameter to a value of 0 or 1:
- By setting this value to 1, you lower the threshold used by the traditional replay delete algorithm; projections optimized for replay delete will succeed using the traditional algorithm and others will fail-over more quickly to the new join algorithm.
- By setting this parameter to 0, Vertica uses the new join replay delete algorithm and may save time in most cases, especially if no attention was previously paid to optimizing projections for replay delete.
For more information, see DELETE in the Vertica documentation.