Vertica Tokenizers

The Vertica Analytics Platform provides the following pre-configured tokenizers:

Name Description
public.FlexTokenizer(long varbinary) Splits the values in your Flex Table by white space.
v_txtindex.StringTokenizer(long varchar)

Splits the string into words by splitting on white space.

v_txtindex.AdvancedLogTokenizer Uses the default parameters for all tokenizer parameters. For more information, see Advanced Log Tokenizer.
v_txtindex.BasicLogTokenizer Uses the default values for all tokenizer parameters except minorseparator, which is set to an empty list. For more information, see Basic Log Tokenizer.
v_txtindex.WhitespaceLogTokenizer Uses default values for tokenizer parameters, except for majorseparators, which uses E' \t\n\f\r'; and minorseparator, which uses an empty list. For more information, see Whitespace Log Tokenizer.

Vertica also provides the following tokenizer, which is not pre-configured:

Name Description
v_txtindex.ICUTokenizer Supports multiple languages. Tokenizes based on the conventions of the language you set in the locale parameter. For more information, see ICU Tokenizer.

Examples

The following examples show how you can use a pre-configured tokenizer when creating a text index.

Use the StringTokenizer to create an index from the top_100:

=> CREATE TEXT INDEX idx_100 FROM top_100 on (id, feedback)
                TOKENIZER v_txtindex.StringTokenizer(long varchar)
                 STEMMER v_txtindex.StemmerCaseInsensitive(long varchar);

Use the FlexTokenizer to create an index from unstructured data:

=> CREATE TEXT INDEX idx_unstruc FROM unstruc_data on (__identity__, __raw__)
                                 TOKENIZER public.FlexTokenizer(long varbinary)
                                    STEMMER v_txtindex.StemmerCaseSensitive(long varchar);