Python Example: String Tokenizer
The following example shows a transform function that breaks an input string into tokens (based on whitespace). It is similar to the tokenizer examples for C++ and Java.
Loading and Using the Example
Create the library and function:
=> CREATE LIBRARY pyudtf AS '/home/dbadmin/udx/tokenize.py' LANGUAGE 'Python'; CREATE LIBRARY => CREATE TRANSFORM FUNCTION tokenize AS NAME 'StringTokenizerFactory' LIBRARY pyudtf; CREATE TRANSFORM FUNCTION
You can then use the function in SQL statements, for example:
=> CREATE TABLE words (w VARCHAR); CREATE TABLE => COPY words FROM STDIN; Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself. >> this is a test of the python udtf >> \. => SELECT tokenize(w) OVER () FROM words; token ---------- this is a test of the python udtf (8 rows)
UDTF Python Code
The following code defines the tokenizer and its factory.
class StringTokenizer(vertica_sdk.TransformFunction): """ Transform function which tokenizes its inputs. For each input string, each of the whitespace-separated tokens of that string is produced as output. """ def processPartition(self, server_interface, input, output): while True: for token in input.getString(0).split(): output.setString(0, token) output.next() if not input.next(): break class StringTokenizerFactory(vertica_sdk.TransformFunctionFactory): def getPrototype(self, server_interface, arg_types, return_type): arg_types.addVarchar() return_type.addVarchar() def getReturnType(self, server_interface, arg_types, return_type): return_type.addColumn(arg_types.getColumnType(0), "tokens") def createTransformFunction(cls, server_interface): return StringTokenizer()