Handling Non-UTF-8 Input

Vertica supports loading data files in the Unicode UTF-8 format. You can load ASCII data, which is UTF-8 compatible. Character sets like ISO 8859-1 (Latin1) are incompatible with UTF-8 and are not directly supported.

If you have data that does not meet the UTF-8 standard, you can modify the data during the load or you can transform the data files before loading.

Checking Data Format

Before loading data from text files, you can use several Linux tools to ensure that your data is in UTF-8 format. The file command reports the encoding of any text files. For example:

$ file Date_Dimension.tbl
Date_Dimension.tbl: ASCII text

The file command could indicate ASCII text even though the file contains multibyte characters.

To check for multibyte characters in an ASCII file, use the wc command. For example:

$ wc Date_Dimension.tbl
  1828   5484 221822 Date_Dimension.tbl

If the wc command returns an error such as Invalid or incomplete multibyte or wide character, the data file is using an incompatible character set.

This example shows two files that are not UTF-8 data files:

$ file data*
data1.txt: Little-endian UTF-16 Unicode text 
data2.txt: ISO-8859 text 

The results indicate that neither of the files is in UTF-8 format.

Converting Data While Loading

You can remove or replace non-UTF-8 characters in text data during the load. The MAKEUTF8 function removes such characters by default, or you can specify a replacement string.

The following example shows how to use this function during a load. The original data is loaded into the orig_name column, and the transformed data is loaded into the name column. Typically you would use a FILLER column for the original value instead of adding the column to the table definition; this example adds the column to show the differences side by side.

=> CREATE TABLE people (orig_name VARCHAR, name VARCHAR);
CREATE TABLE
				
=> COPY people (orig_name, name AS MAKEUTF8(orig_name)) FROM ...;
 Rows Loaded 
-------------
           8
(1 row)

=> SELECT * FROM people;
orig_name |  name  
----------+--------
  Dáithí    | Dith
  Fíona     | Fona
  Móirín     | Mirn
  Róisín     | Risn
  Séamus    | Samus
  Séan      | San
  Tiarnán  | Tiarnn
  Áine     | ine
(8 rows)

For general information about transforming data, see Transforming Data During Loads.

Converting Files Before Loading Data

To convert files before loading them into Vertica, use the iconv UNIX command. For example, to convert the data2.txt file from the previous example, use the iconv command as follows:

$ iconv -f ISO88599 -t utf-8 data2.txt > data2-utf8.txt

See the man pages for file and iconv for more information.

Checking UTF-8 Compliance After Loading Data

After loading data, use the ISUTF8 function to verify that all of the string-based data in the table is in UTF-8 format. For example, if you loaded data into a table named people that has a VARCHAR column named name, you can use this statement to verify that all of the strings are UTF-8 encoded:

=> SELECT name FROM people WHERE NOT ISUTF8(name);

If all of the strings are in UTF-8 format, the query should not return any rows.