Locale and UTF-8 Support

Vertica supports Unicode Transformation Format-8, or UTF8, where 8 equals 8-bit. UTF-8 is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. UTF-8 can represent any universal character in the Unicode standard. Initial encoding of byte codes and character assignments for UTF-8 coincides with ASCII. Thus, UTF8 requires little or no change for software that handles ASCII but preserves other values.

Vertica database servers expect to receive all data in UTF-8, and Vertica outputs all data in UTF-8. The ODBC API operates on data in UCS-2 on Windows systems, and normally UTF-8 on Linux systems. JDBC and ADO.NET APIs operate on data in UTF-16. Client drivers automatically convert data to and from UTF-8 when sending to and receiving data from Vertica using API calls. The drivers do not transform data loaded by executing a COPY or COPY LOCAL statement.

UTF-8 String Functions

The following string functions treat VARCHAR arguments as UTF-8 strings (when USING OCTETS is not specified) regardless of locale setting.

String function Description
LOWER Returns a VARCHAR value containing the argument converted to lowercase letters.
UPPER Returns a VARCHAR value containing the argument converted to uppercase letters.
INITCAP Capitalizes first letter of each alphanumeric word and puts the rest in lowercase.
INSTR Searches string for substring and returns an integer indicating the position of the character in string that is the first character of this occurrence.
SPLIT_PART Splits string on the delimiter and returns the location of the beginning of the given field (counting from one).
POSITION Returns an integer value representing the character location of a specified substring with a string (counting from one).
STRPOS Returns an integer value representing the character location of a specified substring within a string (counting from one).