Another Way to De-Duplicate Table Rows: Quick Tip

Posted February 8, 2019 by Jim Knicely, Big Data Solutions Architect, Vertica

Three 3D arrows, different colors pointing in different directions
To remove duplicate table rows it is typically better (i.e. faster) to create a temporary table that contains all of the distinct rows from a table, drop the original table, then rename the temp table to the original table’s name.

Example: dbadmin=> SELECT * FROM dups; c1 | c2 ----+---- 1 | A 1 | A 1 | A 2 | B 3 | C 3 | C 4 | D (7 rows) dbadmin=> CREATE TABLE dups_new LIKE dups INCLUDING PROJECTIONS; CREATE TABLE dbadmin=> INSERT /*+ DIRECT */ INTO dups_new SELECT DISTINCT * FROM dups; OUTPUT -------- 4 (1 row) dbadmin=> DROP TABLE dups; DROP TABLE dbadmin=> ALTER TABLE dups_new RENAME TO dups; ALTER TABLE dbadmin=> SELECT * FROM dups; c1 | c2 ----+---- 1 | A 2 | B 3 | C 4 | D (4 rows) The issue with that solution is that you’ll need to be sure that the original table grants are restored if they exist.

For smaller tables that have duplicate rows, here is another method to remove them that doesn’t involve creating a new table. dbadmin=> SELECT * FROM dups2; c1 | c2 ----+---- 1 | A 1 | A 1 | A 2 | B 3 | C 3 | C 4 | D (7 rows) dbadmin=> INSERT /*+ DIRECT */ INTO dups2 SELECT DISTINCT c1, c2 FROM dups; OUTPUT -------- 4 (1 row) dbadmin=> DELETE /*+ DIRECT */ FROM dups2 WHERE epoch IS NOT NULL; OUTPUT -------- 7 (1 row) dbadmin=> SELECT * FROM dups2; c1 | c2 ----+---- 1 | A 2 | B 3 | C 4 | D (4 rows) dbadmin=> COMMIT; COMMIT This method works because the hidden table EPOCH column is NULL for each row inserted until you issue a COMMIT statement.

Helpful Link:

https://www.vertica.com/kb/Understanding-Vertica-Epochs/Content/BestPractices/Understanding-Vertica-Epochs.htm

Have fun!