Best Practices for Deleting Data

Updated April 2021

In Vertica, delete operations do not remove rows from physical storage. DELETE marks rows as deleted, as does UPDATE, which combines delete and insert operations. In both cases, Vertica retains discarded rows as historical data, which remains accessible to historical queries until it is purged.

Common reasons for deleting data include:

  • To delete data that falls outside of the retention policy at regular intervals.

  • To update and/or delete loaded data as new updates are available.

  • To delete all data from a staging table.

Different Types of Delete Operations

  Rollback Possibility Performance Use

DELETE/UPDATE

Yes

Depends on the size of table, the number of projections anchored on the table, and the number of records deleted.

Ideal deleting/updating very small percentage of records that need to be removed or changed.

SWAP_PARTITION_BETWEEN_TABLES

No

Fast catalog operation

Ideal for replacing a partition that has changed significantly. This operation require table to be partitioned.

DROP_PARTITION

No

Fast catalog operation

Ideal for deleting old data outside retention policy. This operation requires table to be partitioned.

TRUNCATE_TABLE

No

Fast catalog operation

Ideal for deleting entire table content

Using DELETE/UPDATE

The DELETE statement does not actually delete data from the disk storage. Instead, the statement creates a delete vector that records the position of the deleted record and the epoch when the delete was committed. When storage containers with deleted data are merged, deleted data is purged asynchronously by the automatic Tuple Mover.

The UPDATE statement writes two rows: one with new data and one marked for deletion. To learn more about delete statements and how Vertica handles deleted data, read about the deletes life cycle.

Managing Delete Vectors

The DELETE/UPDATE statement creates delete vectors to mark rows deleted from storage containers. Deleted rows are asynchronously purged when a storage container qualifies for a mergeout operation by the automatic Tuple Mover service. If your application is creating a high percentage of deleted records on a large table, it may impact your query performance and other database operations like node recovery and rebalancing.

The following query can help you identify projections with a high percentage of deleted records:

=> SELECT schema_name
          ,projection_name
          ,count(*) num_ros
          ,sum(total_row_count) num_rows
          ,sum(deleted_row_count) num_deld_rows
          ,sum(delete_vector_count) Num_dv
          ,(sum(deleted_row_count) / sum(total_row_count) * 100)::INT percentage_del_rows
FROM storage_containers
GROUP BY 1, 2
HAVING sum(deleted_row_count) > 0
ORDER BY 5 DESC;
schema_name | projection_name      | num_ros | num_rows | num_deld_rows | Num_dv | percentage_del_rows
------------+----------------------+---------+----------+---------------+--------+--------
store       | store_orders_fact_b1 | 60      | 200044   |  5636         | 62     |  3
store       | store_orders_fact_b0 | 60      | 200210   |  5618         | 62     |  3

If the table is partitioned, you can view the deleted rows per partition using the following query:

=> SELECT p.node_name
          ,p.Table_schema
          ,p.projection_name
          ,p.partition_key
          ,count(DISTINCT p.ros_id) num_ros -- Number of ROS containers in projection
          ,sum(p.ros_size_bytes) used_bytes -- Used bytes by the projecton
          ,sum(p.ros_row_count) num_rows -- Number of rows in projection
          ,sum(p.deleted_row_count) num_del -- Number of deleted rows in the table
          ,sum(delete_vector_count) cdv-- Number deleted vectors
          ,(sum(sc.deleted_row_count)/sum(p.ros_row_count)*100)::int percentage_del_rows -- Percentage of deleted rows per partition.
FROM partitions p inner join storage_containers sc ON ros_id = storage_oid
GROUP BY 1,2,3,4
Having sum(delete_vector_count) > 0
ORDER BY 10 DESC,2,3,4;
node_name           |Table_schema|projection_name       | partition_key |num_ros  |used_bytes|num_rows|num_del|cdv|percentage_del_rows
--------------------+------------+----------------------+---------------+---------+----------+--------+-------+---+-----------
v_utn_demo_node0001 | store      | store_orders_fact_b0 | 200511        | 1       | 244467   | 9978   | 1664  | 9 | 50
v_utn_demo_node0001 | store      | store_orders_fact_b1 | 200511        | 1       | 243471   | 9936   | 1650  | 9 | 50
v_utn_demo_node0001 | store      | store_orders_fact_b0 | 200311        | 1       | 82143    | 3354   | 84    | 1 | 3
v_utn_demo_node0001 | store      | store_orders_fact_b0 | 200407        | 1       | 82799    | 3380   | 90    | 1 | 3
v_utn_demo_node0001 | store      | store_orders_fact_b0 | 200601        | 1       | 79787    | 3258   | 82    | 1 | 3
v_utn_demo_node0001 | store      | store_orders_fact_b0 | 200703        | 1       | 83788    | 3422   | 88    | 1 | 3
v_utn_demo_node0001 | store      | store_orders_fact_b0 | 200712        | 1       | 81095    | 3316   | 92    | 1 | 3

Based on the results of the preceding queries, you can pick one of two options to manually purge deleted data if the percentage of deleted data is higher than 5 percent and the table has more than 100 million records:

  • Purge the partition: Purge the partitioned data (also called a partial purge).

  • Purge the table: Manually purge the data by rewriting the entire table.

Optimizing Projections for DELETE/UPDATE

You can optimize your projection for deletion by performing the following tasks:

  • Create a better projection design by using a high-cardinality column at the end of the sort order and including all columns used in predicates of update and delete statements.

  • Use partitioning to chunk your data into groups so that you can delete the data together with the DROP_PARTITION statement. Hierarchical partitioning is very useful in creating fine-grain partitions, which allow you to update and/or delete recently loaded data.

  • Truncate tables instead of using the DELETE statement when you want to empty a table.

For more information, see Optimizing DELETEs and UPDATEs for Performance.

Using SWAP_PARTITIONS_BETWEEN_TABLES

SWAP_PARTITIONS_BETWEEN_TABLES is an alternative way of deleting and/or updating data from a partitioned table without using the DELETE/UPDATE statement. Unlike the DELETE/UPDATE statement, this operation does not create any delete vectors.

Follow these steps to update data from a specific partition.

  1. Create an empty staging table with the same table definition as the original table.

  2. Load new/updated data for a partition into the staging table.

  3. Swap partitions between the fact table and the staging table using the SWAP_PARTITIONS_BETWEEN_TABLES function. This function runs a fast catalog operation with no data read/write.

For example, suppose you need to reload last week's data due to an error in your table, which is partitioned by month. You delete last week's data and insert new data using the following operations:

  1. Create a staging table.

    => CREATE TABLE store.staging_store_orders_fact like store.store_orders_fact including projections;
  2. Move the data from 2005-11-20 to 2005-11-27 to the staging table.

    => INSERT /*+ direct */ into store.staging_store_orders_fact select * from store.store_orders_fact where date_ordered between '2005-11-01' 
    and '2005-11-19' or date_ordered between '2005-11-28' and '2005-11-30';
  3. Swap the partition.

    => SELECT SWAP_PARTITIONS_BETWEEN_TABLES('store.staging_store_orders_fact',200511,200511,'store.store_orders_fact');
  4. Truncate the staging table so it can be re-purposed for other partitions.

Purging Deleted Data

A purge operation permanently removes historical data from physical storage and frees disk space for reuse.

Purging Partitions

Instead of purging an entire table, you can purge individual partitions. This might be useful if one partition in your table has significantly more deleted rows than the other partitions.

Tip Recall the SELECT schema_name query used for managing delete vectors. This query returns the location of deleted data. If cdv and por_del_rows are higher for a particular partition, it is likely that you can purge just that partition.

Follow these steps to purge a partition:

  1. Use the following command to advance the AHM epoch. Moving the AHM epoch makes all deletes committed prior to the AHM epoch eligible for purging.

    => SELECT GET_AHM_EPOCH(), GET_CURRENT_EPOCH();
    => SELECT MAKE_AHM_NOW();
  2. Purge the partition using the partition key.

    => SELECT PURGE_PARTITION('store.store_orders_fact',200511);

Purging Tables

If a very large table has more than 5% of its deleted data spread across multiple partitions, or if the table is not partitioned, you may need to purge the entire table.

Follow these steps to purge a table:

  1. Use the following command to advance the AHM epoch. Moving the AHM epoch makes all deletes committed prior to the AHM epoch eligible for purging.

    => SELECT GET_AHM_EPOCH(), GET_CURRENT_EPOCH();
    => SELECT MAKE_AHM_NOW();
  2. Purge the table.

    => SELECT PURGE('store.store_orders_fact');

    Or, you can purge the projection.

    => SELECT PURGE_PARTITION ('store_orders_fact_b0')