Configuring a Tokenizer

You configure a tokenizer by creating a User-Defined Transform Function (UDTF) using one of the two base UDTFs in the v_txtindex.AdvTxtSearchLib library. The library contains two base tokenizers: one for Log Words and one for Ngrams. You can configure each base function with or without positional relevance.

You can choose among several different tokenizer base configurations:

Type Position Without Position
Ngram logNgramTokenizerPositionFactory logNgramTokenizerFactory
Words logWordITokenizerPositionFactory logWordITokenizerFactory

Create a logWord tokenizer without positional relevance:

=> CREATE TRANSFORM FUNCTION v_txtindex.fooTokenizer AS LANGUAGE 'C++' NAME 'logWordITokenizerFactory' LIBRARY v_txtindex.logSearchLib NOT FENCED;

Retrieve a Tokenizer's proc_oid

After you create the tokenizer, Vertica writes the name and proc_oid to the system table vs_procedures. You must retrieve the tokenizer's proc_oid to perform additional configuration.

Enter the following query, substituting your own tokenizer name:

=> SELECT proc_oid FROM vs_procedures WHERE procedure_name = 'fooTokenizer';