Specifying Locale: Long Form

Vertica supports long forms that specify the collation keyword. Vertica extends long-form processing to accept collation arguments.

Syntax

[language][_script][_country][_variant][@collation‑spec]

The following syntax options apply:

  • Locale specification strings are case insensitive. For example, en_us and EN_US, are equivalent.
  • You can substitute underscores with hyphens. For example: [-script]

Parameters

language

A two- or three-letter lowercase code for a particular language. For example, Spanish is es English is en and French is fr. The two-letter language code uses the ISO-639 standard.
_script

An optional four-letter script code that follows the language code. If specified, it should be a valid script code as listed on the Unicode ISO 15924 Registry.

_country A specific language convention within a generic language for a specific country or region. For example, French is spoken in many countries, but the currencies are different in each country. To allow for these differences among specific geographical, political, or cultural regions, locales are specified by two-letter, uppercase codes. For example, FR represents France and CA represents Canada. The two letter country code uses the ISO-3166 standard.
_variant

Differences may also appear in language conventions used within the same country. For example, the Euro currency is used in several European countries while the individual country's currency is still in circulation. To handle variations inside a language and country pair, add a third code, the variant code. The variant code is arbitrary and completely application-specific. ICU adds _EURO to its locale designations for locales that support the Euro currency. Variants can have any number of underscored key words. For example, EURO_WIN is a variant for the Euro currency on a Windows computer.

Another use of the variant code is to designate the Collation (sorting order) of a locale. For instance, the es__TRADITIONAL locale uses the traditional sorting order which is different from the default modern sorting of Spanish.

@collation‑spec

Vertica only supports the keyword collation, as follows:

@collation=collation‑type[;arg]...

Collation can specify one or more semicolon-delimited arguments, described below.

collation‑type is set to one of the following values:

  • big5han: Pinyin ordering for Latin, big5 charset ordering for CJK characters (used in Chinese).
  • dict: For a dictionary-style ordering (such as in Sinhala).
  • direct: Hindi variant.
  • gb2312/gb2312han: Pinyin ordering for Latin, gb2312han charset ordering for CJK characters (used in Chinese).
  • phonebook: For a phonebook-style ordering (such as in German).
  • pinyin: Pinyin ordering for Latin and for CJK characters; that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin (used in Chinese).
  • reformed: Reformed collation (such as in Swedish).
  • standard: The default ordering for each language. For root it is [UCA] order; for each other locale it is the same as UCA (Unicode Collation Algorithm) ordering except for appropriate modifications to certain characters for that language. The following are additional choices for certain locales; they have effect only in certain locales.
  • stroke: Pinyin ordering for Latin, stroke order for CJK characters (used in Chinese) not supported.
  • traditional: For a traditional-style ordering (such as in Spanish).
  • unihan: Pinyin ordering for Latin, Unihan radical-stroke ordering for CJK characters (used in Chinese) not supported.
  • binary: Vertica default, providing UTF-8 octet ordering.

Notes:

  • Collations might default to root, the ICU default collation.
  • Invalid values of the collation keyword and its synonyms do not cause an error. For example, the following does not generate an error. It simply ignores the invalid value:

    => \locale en_GB@collation=xyz
    INFO 2567:  Canonical locale: 'en_GB@collation=xyz'
    Standard collation: 'LEN'
    English (United Kingdom, collation=xyz)

For more about collation options, see Unicode Locale Data Markup Language (LDML).

Collation Arguments

collation can specify one or more of the following arguments :

Parameter Short form Description
colstrength

S

Sets the default strength for comparison. This feature is locale dependent.

Set colstrength to one of the following:

  • 1 | primary: Ignores case and accents. Only primary differences are used during comparison—for example, a versus z.
  • 2 | secondary: Ignores case. Only secondary and above differences are considered for comparison—for example, different accented forms of the same base letter such as a versus \u00E4.
  • 3 | tertiary (default): Only tertiary differences and higher are considered for comparison. Tertiary comparisons are typically used to evaluate case differences—for example, Z versus z.
  • 4 | quarternary: For example, used with Hiragana.
colAlternate

A

Sets alternate handling for variable weights, as described in UCA, one of the following:

  • non-ignorable | N | D
  • shifted | S
colBackwards

F

For Latin with accents, this parameter determines which accents are sorted. It sets the comparison for the second level to be backwards.

colBackwards is automatically set for French accents.

Set colBackwards to one of the following:

  • on | O: The normal UCA algorithm is used.
  • off | X: All strings that are in Fast C or D normalization form (FCD) sort correctly, but others do not necessarily sort correctly. Set to off if the strings to be compared are in FCD.
colNormalization

N

Set to one of the following:

  • on | O: The normal UCA algorithm is used.
  • off | X: All strings that are in Fast C or D normalization form (FCD) sort correctly, but others won't necessarily sort correctly. It should only be set off if the strings to be compared are in FCD.
colCaseLevel

E

Set to one of the following:

  • on | O: A level consisting only of case characteristics is inserted in front of tertiary level. To ignore accents but take cases into account, set strength to primary and case level to on.
  • off | X: This level is omitted.
colCaseFirst

C

Set to one of the following:

  • upper | U:  Upper case sorts before lower case.
  • lower | L: Lower case sorts before upper case. This is useful for locales that have already supported ordering but require different order of cases. It affects case and tertiary levels.
  • off | short: Tertiary weights unaffected
colHiraganaQuarternary

H

Controls special treatment of Hiragana code points on quaternary level, one of the following:

  • on | O: Hiragana codepoints get lower values than all the other non-variable code points. The strength must be greater or equal than quaternary for this attribute to take effect.
  • off | X: Hiragana letters are treated normally.
colNumeric

D

If set to on, any sequence of Decimal Digits (General_Category = Nd in the [UCD]) is sorted at a primary level with its numeric value. For example, A‑21 < A‑123.

variableTop

B

Sets the default value for the variable top. All code points with primary weights less than or equal to the variable top will be considered variable, and are affected by the alternate handling.

For example, the following command sets variableTop to be HYPHEN (u2010)

=> \locale en_US@colalternate=shifted;variabletop=u2010

Locale Processing Notes

  • Incorrect locale strings are accepted if the prefix can be resolved to a known locale version.

    For example, the following works because the language can be resolved:

    => \locale en_XX
    INFO 2567:  Canonical locale: 'en_XX'
    Standard collation: 'LEN'
    English (XX)
    

    The following does not work because the language cannot be resolved:

    => \locale xx_XX
    xx_XX: invalid locale identifier
  • POSIX-type locales such as en_US.UTF-8 work to some extent in that the encoding part "UTF-8" is ignored.
  • Vertica uses the icu4c-4_2_1 library to support basic locale/collation processing with some extensions. This does not currently meet current standards for locale processing (https://tools.ietf.org/html/rfc5646).

Examples

Specify German locale as used in Germany (de), with phonebook-style collation:

=> \locale de_DE@collation=phonebook
INFO 2567:  Canonical locale: 'de_DE@collation=phonebook'
Standard collation: 'KPHONEBOOK_LDE'
German (Germany, collation=Phonebook Sort Order)  
Deutsch (Deutschland, Sortierung=Telefonbuch-Sortierregeln)

Specify German locale as used in Germany (de), with phonebook-style collation and strength set to secondary:

=> \locale de_DE@collation=phonebook;colStrength=secondary
INFO 2567:  Canonical locale: 'de_DE@collation=phonebook'
Standard collation: 'KPHONEBOOK_LDE_S2'
German (Germany, collation=Phonebook Sort Order)  
Deutsch (Deutschland, Sortierung=Telefonbuch-Sortierregeln)