Tech Support Series: Optimizing for Deletes

Posted June 11, 2014 by Sarah Lemaire, Manager, Vertica Documentation

delete-and-end-e1402069907437

This blog is just the first in a series that addresses frequently asked tech support questions. For now, we’?ll talk about optimizing your database for deletion.

You may find that from time to time your recovery and query execution is slow due to high volumes of delete vectors. Occasionally, performing a high number of deletes or updates can negatively affect query performance and recovery due to delete replay.

Delete replay occurs when ROS containers are merged together. The data marked for deletion in each of the ROS containers needs to be remarked once the containers are merged. This process can hold up your ETL processes because the Tuple Mover lock (T lock) stays on until the replay deletes finish.

Luckily, optimizing your database for deletes can help speed up your processes. If you expect to perform a high number of deletes, first consider the reason for deletion. The following is a list of common reasons for high delete usage:

  • You regularly delete historical data and upload new data at specific intervals
  • You constantly update data or you want to delete data that was loaded my mistake
  • You often delete staging tables

To optimize your database for deletion, follow the suggestions that correspond to your reason for deletion.

  1. If you regularly delete historical data to make room for newer data, use partitioning to chunk data into groups that will be deleted together. For example, if you regularly delete the previous month?s data, partition data by month. When you use partitioning, you can use the DROP_PARTITION function to discard all ROS containers that contain data for the partition. This operation removes historical data fast because no purging or replay deletes are involved.
  2. You may also want to delete a high volume of data because it was loaded by mistake or because you frequently update data (which involves frequently deleting data). In these cases, you may see a high volume of delete vectors. There are three good ways to prevent this:
    • Create delete-optimized projections by using a high cardinality column at the end of the sort order. This helps the replay delete process quickly identify rows to be marked for deletion.
    • Make sure your Ancient History Mark (AHM) is advancing and close to the Last Good Epoch (LGE) or Current Epoch. You may also want to periodically use the MAKE_AHM_NOW function to advance the ancient history mark to the greatest allowable value. When a mergeout occurs, all data that is marked for deletion before the AHM will be purged, minimizing the amount of replay deletes.
    • Periodically check the number of delete vectors in your tables using the DELETE_VECTORS system table. The automatic Tuple Mover will eventually purge deleted data but if you find your tables have a large number of delete vectors, you can manually purge records using the PURGE_TABLE function.
  3. You may find that you frequently delete staging tables. To streamline this process, you can truncate the staging table instead of deleting it using the TRUNCATE TABLE function. Truncating a table will discard the ROS containers that contain the data instead of creating delete vectors, and thus is more efficient than table deletion.

Frequently deleting data is often a cause of slow query performance. Fortunately, you can optimize your database for deletions with these tips and avoid the headache.

How to:

Drop a partition:
=> SELECT DROP_PARTITION (table_name, partition_value);

Get epoch:
=> SELECT current_epoch, ahm_epoch, last_good_epoch FROM SYSTEM;

Set AHM to greatest allowable value:
=> SELECT MAKE_AHM_NOW();

Determine number of delete vectors:
=> SELECT * FROM v_monitor.DELETE_VECTORS;

Purge data:
=> SELECT PURGE_TABLE(table_name);