Quick Tip: Replacing non-UTF-8 Characters

Posted July 16, 2021 by James Knicely, Vertica Field Chief Technologist

Vertica database servers expect to receive all data in UTF-8 and Vertica outputs all data in UTF-8. However, you can load or insert data into Vertica that is non UTF-8, but you’ll want to clean it up.

I used to recommend the REGEXP_REPLACE function for that task, but now there is a better way!

Vertica 10.1.x introduces the MAKEUTF8 built-in function that coerces a string to UTF-8 by removing or replacing non-UTF-8 characters.

The old way of removing non-UTF-8 characters:


dbadmin=> \d nonutf_test;
                                       List of Fields by Tables
 Schema |    Table    | Column |     Type     | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------------+--------+--------------+------+---------+----------+-------------+-------------
 public | nonutf_test | c      | varchar(100) |  100 |         | f        | f           |
(1 row)

dbadmin=> INSERT INTO nonutf_test SELECT E'\xa0' || 'Test!' || E'\xa0';
 OUTPUT
--------
      1
(1 row)

dbadmin=> SELECT c, REGEXP_REPLACE(c, '[^\t\r\n\x20-\x7E]+', '', 1, 0, 'b') FROM nonutf_test ;
    c    | REGEXP_REPLACE
---------+----------------
 ▒Test!▒ | Test!
(1 row)

dbadmin=> SELECT * FROM nonutf_test;
    c
---------
 ▒Test!▒
(1 row)

dbadmin=> UPDATE /*+ DIRECT */ nonutf_test SET c = REGEXP_REPLACE(c, '[^\t\r\n\x20-\x7E]+', '', 1, 0, 'b');
 OUTPUT
--------
      1
(1 row)

dbadmin=> SELECT * FROM nonutf_test;
   c
-------
 Test!
(1 row)

Now the new and improved way:


verticademos=> TRUNCATE TABLE nonutf_test;
TRUNCATE TABLE

verticademos=> INSERT INTO nonutf_test SELECT E'\xa0' || 'Test!' || E'\xa0';
 OUTPUT
--------
      1
(1 row)

verticademos=> SELECT c, makeutf8(c) FROM nonutf_test;
    c    | makeutf8
---------+----------
 ▒Test!▒ | Test!
(1 row)

verticademos=> SELECT c, makeutf8(c) "Removed non-UTF8 Junk!" FROM nonutf_test;
    c    | Removed non-UTF8 Junk!
---------+------------------------
 ▒Test!▒ | Test!
(1 row)

verticademos=> UPDATE nonutf_test SET c = makeutf8(c);
 OUTPUT
--------
      1
(1 row)

verticademos=> SELECT * FROM nonutf_test;
   c
-------
 Test!
(1 row)

Have Fun!