C++ Example: String Tokenizer

The following example shows a subclass of TransformFunction named StringTokenizer. It defines a UDTF that reads a table containing an INTEGER ID column and a VARCHAR column. It breaks the text in the VARCHAR column into tokens (individual words). It returns a table containing each token, the row it occurred in, and its position within the string.

Loading and Using the Example

The following example shows how to load the function into Vertica. It assumes that the TransformFunctions.so library that contains the function has been copied to the dbadmin user's home directory on the initiator node.

=> CREATE LIBRARY TransformFunctions AS
   '/home/dbadmin/TransformFunctions.so';
CREATE LIBRARY
=> CREATE TRANSFORM FUNCTION tokenize
   AS LANGUAGE 'C++' NAME 'TokenFactory' LIBRARY TransformFunctions;
CREATE TRANSFORM FUNCTION

You can then use it from SQL statements, for example:

=> CREATE TABLE T (url varchar(30), description varchar(2000));
CREATE TABLE
=> INSERT INTO T VALUES ('www.amazon.com','Online retail merchant and provider of cloud services');
 OUTPUT 
--------
      1
(1 row)
=> INSERT INTO T VALUES ('www.hp.com','Leading provider of computer hardware and imaging solutions');
 OUTPUT 
--------
      1
(1 row)
=> INSERT INTO T VALUES ('www.vertica.com','World''s fastest analytic database');
 OUTPUT 
--------
      1
(1 row)
=> COMMIT;
COMMIT
=> -- Invoke the UDT
=> SELECT url, tokenize(description) OVER (partition by url) FROM T;
       url       |   words   
-----------------+-----------
 www.amazon.com  | Online
 www.amazon.com  | retail
 www.amazon.com  | merchant
 www.amazon.com  | and
 www.amazon.com  | provider
 www.amazon.com  | of
 www.amazon.com  | c
 www.amazon.com  | loud
 www.amazon.com  | services
 www.hp.com      | Leading
 www.hp.com      | provider
 www.hp.com      | of
 www.hp.com      | computer
 www.hp.com      | hardware
 www.hp.com      | and
 www.hp.com      | im
 www.hp.com      | aging
 www.hp.com      | solutions
 www.vertica.com | World's
 www.vertica.com | fastest
 www.vertica.com | analytic
 www.vertica.com | database
(22 rows)

Notice that the number of rows and columns in the result table are different than the input table. This is one of the strengths of a UDTF.

TransformFunction Implementation

The following code defines the StringTokenizer class.

#include "Vertica.h"
#include <sstream>
// Use the Vertica namespace to make referring
// to SDK classes easier.
using namespace Vertica;
using namespace std;

// The primary class for the StringTokenizer UDTF. This does the actual work.
class StringTokenizer : public TransformFunction {
    // Called for each partition in the table.
    virtual void processPartition(ServerInterface &srvInterface,
                                  PartitionReader &inputReader,
                                  PartitionWriter &outputWriter) {
        // Use a top-level try to prevent exceptions from bubbling up.
        try {
            // Loop through the input rows
            do {
                // First check if the id is NULL. If so, skip the row without
                // writing any output.
                if (inputReader.isNull(0))
                {
                    srvInterface.log("Skipping row with NULL ID");
                    continue;
                }
                // Get an input row containing an int and a varchar
                const vint row = inputReader.getIntRef(0);
                const VString &sentence = inputReader.getStringRef(1);
                // If input string is NULL, then output NULL tokens and 
                // positions.
                if (sentence.isNull())
                {
                    outputWriter.setNull(0);
                    outputWriter.setInt(1,row);
                    outputWriter.setNull(2);
                    // Move on to the next row
                    outputWriter.next();
                } 
                else 
                {
                    // Tokenize the string and output the words
                    std::string tmp = sentence.str();
                    istringstream ss(tmp);
                    int wordcount = 0; // Track which word we're on
                    do
                    {
                        std::string buffer;
                        ss >> buffer;
                        
                        // Copy to output
                        if (!buffer.empty()) {
                            // To write a string, need to get the buffer to
                            // write to, then copy the token into it.
                            VString &word = outputWriter.getStringRef(0);
                            word.copy(buffer);
                             // 2nd column is row id of input
                            outputWriter.setInt(1,row);
                            // 3rd column is the token's position in the row
                            outputWriter.setInt(2,wordcount++);

                            // Move on to the next output row.
                            outputWriter.next();
                        }
                    } while (ss);
                }
            }  while (inputReader.next()); // Get next row (if any)
        } catch (exception& e) {
            // Standard exception. Report the error.
            vt_report_error(0, "Exception while processing partition: [%s]", 
                            e.what());
        }
    }
};

The processPartition() function in this example follows a pattern that you will follow in your own UDTF: it loops over all rows in the table partition that Vertica sends it, processing each row. For UDTFs you do not have to actually process every row. You can exit your function without having read all of the input without any issues. You may choose to do this if your UDTF is performing some sort search or some other operation where it can determine that the rest of the input is unneeded.

In this example, processPartition() first extracts the ID number of the row. Then it extracts the VString containing the text from the PartitionReader object. The VString class represents a Vertica string value (VARCHAR or CHAR).

When developing UDTFs, you often need to handle NULL input values in a special way. This example has two different reactions to a NULL value:

After handling any NULL values, the processPartition() function shown in the example moves on to performing the actual processing. It breaks the string into individual tokens and outputs a row for each token.

Similarly to reading input columns, the PartitionWriter object has functions for writing each type of data to the output row. In this case, the example:

After setting the three column values, the example calls PartitionWriter.next() to complete the output row.

TranformFunctionFactory Implementation

The following code shows the factory class.

// Provides Vertica with metadata about the StringTokenizer 
class TokenFactory : public TransformFunctionFactory {    
    // Tell Vertica that StringTokenizer reads 1 int and 1 string,
    // and returns two ints and a string
    virtual void getPrototype(ServerInterface &srvInterface, ColumnTypes
                              &argTypes, ColumnTypes &returnType) {
        argTypes.addInt(); // Row number
        argTypes.addVarchar(); // Line of text
        returnType.addVarchar(); // The token
        returnType.addInt(); // The row this token occurred in
        returnType.addInt(); // The position in the row of the token
        
    }
    
    // Gives details of the output columns. For integer rows, 
    // Tell Vertica the maxiumu return string length will be, given the input
    // string length. Also names the output columns. This function is only
    // necessary if you have a column that has a variable size (i.e. strings) or
    // have to report their precision. 
    virtual void getReturnType(ServerInterface &srvInterface,
                               const SizedColumnTypes &input_types,
                               SizedColumnTypes &output_types) {
                                   
        // String column's output size will never be more than the size of 
        // the input string
        int input_len = input_types.getColumnType(1).getStringLength();
        
        // Describe each column in the output. Adds a column name.
        output_types.addVarchar(input_len, "token");
        output_types.addInt("row_id");
        output_types.addInt("token_position");
    }
    
    // Instantiate a StringTokenizer  object to process a partition of data
    virtual TransformFunction *createTransformFunction(ServerInterface 
		&srvInterface) {
        return vt_createFuncObj(srvInterface.allocator, StringTokenizer);
    }
};

In this example:

This example must return the maximum length of the VARCHAR output column. It sets the length to the length input string. This is a safe value, because the output will never be longer than the input string. It also sets the name of the VARCHAR output column to "token".

You can use getReturnType() to name the output columns, even if they do not require a precision or output length. The example sets the names of the two INTEGER output columns to "row_id" and "token_postion".

Note: You are not required to supply a name for an output column in this function. However, it is a best practice to do so. If you do not name an output column, getReturnType() sets the column name to "". The SQL statements that call your UDTF must provide aliases for any unnamed columns to access them or else they return an error. From a usability standpoint, it is easier for you to supply the column names here once. The alternative is to force all of the users of your function to supply their own column names for each call to the UDTF.

The implementation of the createTransformFunction() function in the example is boilerplate code. It just calls the vt_returnFuncObj macro with the name of the TransformFunction class associated with this factory class. This macro takes care of instantiating a copy of the TransformFunction class that Vertica can use to process data.

The RegisterFactory Macro

The final step in creating your UDTF is to call the RegisterFactory macro. This macro ensures that your factory class is instantiated when Vertica loads the shared library containing your UDTF. Having your factory class instantiated is the only way that Vertica can find your UDTF and determine what its inputs and outputs are.

The RegisterFactory macro just takes the name of your factory class:

RegisterFactory(TokenFactory);