Checking Data Format Before or After Loading
Vertica supports loading data files in the Unicode UTF-8 format. You can load ASCII data, which is UTF-8 compatible. Character sets like ISO 8859-1 (Latin1) are incompatible with UTF-8 and are not supported.
Before loading data from text files, you can use several Linux tools to ensure that your data is in UTF-8 format. The file
command reports the encoding of any text files. For example:
$ file Date_Dimension.tbl
Date_Dimension.tbl: ASCII text
The file
command could indicate ASCII TEXT even though the file contains multibyte characters.
To check for multibyte characters in an ASCII file, use the wc
command. For example:
$ wc Date_Dimension.tbl
1828 5484 221822 Date_Dimension.tbl
If the wc
command returns an error such as Invalid or incomplete multibyte or wide character
, the data file is using an incompatible character set.
This example describes files that are not UTF-8 data files. Two text files have filenames starting with the string data
. To check their format, use the file
command as follows:
$file data*
data1.txt: Little-endian UTF-16 Unicode text data2.txt: ISO-8859 text
The results indicate that neither of the files is in UTF-8 format.
Converting Files Before Loading Data
To convert files before loading them into Vertica, use the iconv
UNIX command. For example, to convert the data2.txt
file from the previous example, use the iconv
command as follows:
iconv -f ISO88599 -t utf-8 data2.txt > data2-utf8.txt
See the man pages for file
and iconv
for more information.
Checking UTF-8 Compliance After Loading Data
After loading data, use the ISUTF8 function to verify that all of the string-based data in the table is in UTF-8 format. For example, if you loaded data into a table named nametable
that has a VARCHAR column named name
, you can use this statement to verify that all of the strings are UTF-8 encoded:
=> SELECT name FROM nametable WHERE ISUTF8(name) = FALSE;
If all of the strings are in UTF-8 format, the query should not return any rows.