Configuring a Tokenizer
You configure a tokenizer by creating a User-Defined Transform Function (UDTF) using one of the two base UDTFs in the v_txtindex.AdvTxtSearchLib library. The library contains two base tokenizers: one for Log Words and one for Ngrams. You can configure each base function with or without positional relevance.
You can choose among several different tokenizer base configurations:
Type | Position | Without Position |
---|---|---|
Ngram | logNgramTokenizerPositionFactory | logNgramTokenizerFactory |
Words | logWordITokenizerPositionFactory | logWordITokenizerFactory |
Create a logWord tokenizer without positional relevance:
=> CREATE TRANSFORM FUNCTION v_txtindex.fooTokenizer AS LANGUAGE 'C++' NAME 'logWordITokenizerFactory' LIBRARY v_txtindex.logSearchLib NOT FENCED;
Retrieve a Tokenizer's proc_oid
After you create the tokenizer, Vertica writes the name and proc_oid to the system table vs_procedures. You must retrieve the tokenizer's proc_oid to perform additional configuration.
Enter the following query, substituting your own tokenizer name:
=> SELECT proc_oid FROM vs_procedures WHERE procedure_name = 'fooTokenizer';