Handling Non-UTF-8 Input
Vertica supports loading data files in the Unicode UTF-8 format. You can load ASCII data, which is UTF-8 compatible. Character sets like ISO 8859-1 (Latin1) are incompatible with UTF-8 and are not directly supported.
If you have data that does not meet the UTF-8 standard, you can modify the data during the load or you can transform the data files before loading.
Checking Data Format
Before loading data from text files, you can use several Linux tools to ensure that your data is in UTF-8 format. The
file command reports the encoding of any text files. For example:
$ file Date_Dimension.tbl Date_Dimension.tbl: ASCII text
file command could indicate ASCII text even though the file contains multibyte characters.
To check for multibyte characters in an ASCII file, use the
wc command. For example:
$ wc Date_Dimension.tbl 1828 5484 221822 Date_Dimension.tbl
wc command returns an error such as
Invalid or incomplete multibyte or wide character, the data file is using an incompatible character set.
This example shows two files that are not UTF-8 data files:
$ file data* data1.txt: Little-endian UTF-16 Unicode text data2.txt: ISO-8859 text
The results indicate that neither of the files is in UTF-8 format.
You can remove or replace non-UTF-8 characters in text data during the load. The MAKEUTF8 function removes such characters by default, or you can specify a replacement string.
The following example shows how to use this function during a load. The original data is loaded into the
orig_name column, and the transformed data is loaded into the
name column. Typically you would use a FILLER column for the original value instead of adding the column to the table definition; this example adds the column to show the differences side by side.
=> CREATE TABLE people (orig_name VARCHAR, name VARCHAR); CREATE TABLE => COPY people (orig_name, name AS MAKEUTF8(orig_name)) FROM ...; Rows Loaded ------------- 8 (1 row) => SELECT * FROM people; orig_name | name ----------+-------- Dáithí | Dith Fíona | Fona Móirín | Mirn Róisín | Risn Séamus | Samus Séan | San Tiarnán | Tiarnn Áine | ine (8 rows)
For general information about transforming data, see Transforming Data During Loads.
Converting Files Before Loading Data
To convert files before loading them into Vertica, use the
iconv UNIX command. For example, to convert the
data2.txt file from the previous example, use the
iconv command as follows:
$ iconv -f ISO88599 -t utf-8 data2.txt > data2-utf8.txt
See the man pages for
iconv for more information.
Checking UTF-8 Compliance After Loading Data
After loading data, use the ISUTF8 function to verify that all of the string-based data in the table is in UTF-8 format. For example, if you loaded data into a table named
people that has a VARCHAR column named
name, you can use this statement to verify that all of the strings are UTF-8 encoded:
=> SELECT name FROM people WHERE NOT ISUTF8(name);
If all of the strings are in UTF-8 format, the query should not return any rows.