Vertica Tokenizers
The Vertica Analytics Platform provides the following pre-configured tokenizers:
Name | Description |
---|---|
public.FlexTokenizer(long varbinary) | Splits the values in your Flex Table by white space. |
v_txtindex.StringTokenizer(long varchar) |
Splits the string into words by splitting on white space. |
v_txtindex.AdvancedLogTokenizer | Uses the default parameters for all tokenizer parameters. For more information, see Advanced Log Tokenizer. |
v_txtindex.BasicLogTokenizer | Uses the default values for all tokenizer parameters except minorseparator, which is set to an empty list. For more information, see Basic Log Tokenizer. |
v_txtindex.WhitespaceLogTokenizer | Uses default values for tokenizer parameters, except for majorseparators, which uses E' \t\n\f\r' ; and minorseparator, which uses an empty list. For more information, see Whitespace Log Tokenizer. |
Vertica also provides the following tokenizer, which is not pre-configured:
Name | Description |
---|---|
v_txtindex.ICUTokenizer | Supports multiple languages. Tokenizes based on the conventions of the language you set in the locale parameter. For more information, see ICU Tokenizer. |
Examples
The following examples show how you can use a pre-configured tokenizer when creating a text index.
Use the StringTokenizer to create an index from the top_100:
=> CREATE TEXT INDEX idx_100 FROM top_100 on (id, feedback) TOKENIZER v_txtindex.StringTokenizer(long varchar) STEMMER v_txtindex.StemmerCaseInsensitive(long varchar);
Use the FlexTokenizer to create an index from unstructured data:
=> CREATE TEXT INDEX idx_unstruc FROM unstruc_data on (__identity__, __raw__) TOKENIZER public.FlexTokenizer(long varbinary) STEMMER v_txtindex.StemmerCaseSensitive(long varchar);