C++ Example: String Tokenizer
The following example shows a subclass of TransformFunction
named StringTokenizer
. It defines a UDTF that reads a table containing an INTEGER ID column and a VARCHAR column. It breaks the text in the VARCHAR column into tokens (individual words). It returns a table containing each token, the row it occurred in, and its position within the string.
Loading and Using the Example
The following example shows how to load the function into Vertica. It assumes that the TransformFunctions.so
library that contains the function has been copied to the dbadmin user's home directory on the initiator node.
=> CREATE LIBRARY TransformFunctions AS '/home/dbadmin/TransformFunctions.so'; CREATE LIBRARY => CREATE TRANSFORM FUNCTION tokenize AS LANGUAGE 'C++' NAME 'TokenFactory' LIBRARY TransformFunctions; CREATE TRANSFORM FUNCTION
You can then use it from SQL statements, for example:
=> CREATE TABLE T (url varchar(30), description varchar(2000)); CREATE TABLE => INSERT INTO T VALUES ('www.amazon.com','Online retail merchant and provider of cloud services'); OUTPUT -------- 1 (1 row) => INSERT INTO T VALUES ('www.hp.com','Leading provider of computer hardware and imaging solutions'); OUTPUT -------- 1 (1 row) => INSERT INTO T VALUES ('www.vertica.com','World''s fastest analytic database'); OUTPUT -------- 1 (1 row) => COMMIT; COMMIT
=> -- Invoke the UDT => SELECT url, tokenize(description) OVER (partition by url) FROM T; url | words -----------------+----------- www.amazon.com | Online www.amazon.com | retail www.amazon.com | merchant www.amazon.com | and www.amazon.com | provider www.amazon.com | of www.amazon.com | c www.amazon.com | loud www.amazon.com | services www.hp.com | Leading www.hp.com | provider www.hp.com | of www.hp.com | computer www.hp.com | hardware www.hp.com | and www.hp.com | im www.hp.com | aging www.hp.com | solutions www.vertica.com | World's www.vertica.com | fastest www.vertica.com | analytic www.vertica.com | database (22 rows)
Notice that the number of rows and columns in the result table are different than the input table. This is one of the strengths of a UDTF.
TransformFunction Implementation
The following code defines the StringTokenizer
class.
#include "Vertica.h" #include <sstream> // Use the Vertica namespace to make referring // to SDK classes easier. using namespace Vertica; using namespace std; // The primary class for the StringTokenizer UDTF. This does the actual work. class StringTokenizer : public TransformFunction { // Called for each partition in the table. virtual void processPartition(ServerInterface &srvInterface, PartitionReader &inputReader, PartitionWriter &outputWriter) { // Use a top-level try to prevent exceptions from bubbling up. try { // Loop through the input rows do { // First check if the id is NULL. If so, skip the row without // writing any output. if (inputReader.isNull(0)) { srvInterface.log("Skipping row with NULL ID"); continue; } // Get an input row containing an int and a varchar const vint row = inputReader.getIntRef(0); const VString &sentence = inputReader.getStringRef(1); // If input string is NULL, then output NULL tokens and // positions. if (sentence.isNull()) { outputWriter.setNull(0); outputWriter.setInt(1,row); outputWriter.setNull(2); // Move on to the next row outputWriter.next(); } else { // Tokenize the string and output the words std::string tmp = sentence.str(); istringstream ss(tmp); int wordcount = 0; // Track which word we're on do { std::string buffer; ss >> buffer; // Copy to output if (!buffer.empty()) { // To write a string, need to get the buffer to // write to, then copy the token into it. VString &word = outputWriter.getStringRef(0); word.copy(buffer); // 2nd column is row id of input outputWriter.setInt(1,row); // 3rd column is the token's position in the row outputWriter.setInt(2,wordcount++); // Move on to the next output row. outputWriter.next(); } } while (ss); } } while (inputReader.next()); // Get next row (if any) } catch (exception& e) { // Standard exception. Report the error. vt_report_error(0, "Exception while processing partition: [%s]", e.what()); } } };
The processPartition()
function in this example follows a pattern that you will follow in your own UDTF: it loops over all rows in the table partition that Vertica sends it, processing each row. For UDTFs you do not have to actually process every row. You can exit your function without having read all of the input without any issues. You may choose to do this if your UDTF is performing some sort search or some other operation where it can determine that the rest of the input is unneeded.
In this example, processPartition()
first extracts the ID number of the row. Then it extracts the VString
containing the text from the PartitionReader
object. The VString
class represents a Vertica string value (VARCHAR or CHAR).
When developing UDTFs, you often need to handle NULL input values in a special way. This example has two different reactions to a NULL value:
- If the ID column is NULL, then
processPartition()
skips the row and moves on to the next row. It writes a message to the Vertica log to indicate that it is skipping the row. - If the VARCHAR column is NULL,
processPartition()
outputs a single row that contains the row_id value and NULLs for both the token and token_position columns.
After handling any NULL values, the processPartition()
function shown in the example moves on to performing the actual processing. It breaks the string into individual tokens and outputs a row for each token.
Similarly to reading input columns, the PartitionWriter
object has functions for writing each type of data to the output row. In this case, the example:
- Calls the
PartitionWriter
object'sgetStringRef()
function to allocate a newVString
object to hold the token to output for the first column. It then copies the token's value into the VString object. - Calls
setInt()
to output the row ID in the second column. - Calls
setInt()
again to output the token's position within the input string.
After setting the three column values, the example calls PartitionWriter.next()
to complete the output row.
TranformFunctionFactory Implementation
The following code shows the factory class.
// Provides Vertica with metadata about the StringTokenizer class TokenFactory : public TransformFunctionFactory { // Tell Vertica that StringTokenizer reads 1 int and 1 string, // and returns two ints and a string virtual void getPrototype(ServerInterface &srvInterface, ColumnTypes &argTypes, ColumnTypes &returnType) { argTypes.addInt(); // Row number argTypes.addVarchar(); // Line of text returnType.addVarchar(); // The token returnType.addInt(); // The row this token occurred in returnType.addInt(); // The position in the row of the token } // Gives details of the output columns. For integer rows, // Tell Vertica the maxiumu return string length will be, given the input // string length. Also names the output columns. This function is only // necessary if you have a column that has a variable size (i.e. strings) or // have to report their precision. virtual void getReturnType(ServerInterface &srvInterface, const SizedColumnTypes &input_types, SizedColumnTypes &output_types) { // String column's output size will never be more than the size of // the input string int input_len = input_types.getColumnType(1).getStringLength(); // Describe each column in the output. Adds a column name. output_types.addVarchar(input_len, "token"); output_types.addInt("row_id"); output_types.addInt("token_position"); } // Instantiate a StringTokenizer object to process a partition of data virtual TransformFunction *createTransformFunction(ServerInterface &srvInterface) { return vt_createFuncObj(srvInterface.allocator, StringTokenizer); } };
In this example:
- The UDTF takes an INTEGER and a VARCHAR column as input. To define these input columns,
getPrototype()
callsaddInt()
andaddVarchar()
on theColumnTypes
object that represents the input table. - The UDTF returns a VARCHAR and two INTEGER columns as output. The
getPrototype()
function callsaddVarchar()
once andaddInt()
twice to define the output table.
This example must return the maximum length of the VARCHAR output column. It sets the length to the length input string. This is a safe value, because the output will never be longer than the input string. It also sets the name of the VARCHAR output column to "token".
You can use getReturnType()
to name the output columns, even if they do not require a precision or output length. The example sets the names of the two INTEGER output columns to "row_id" and "token_postion".
Note: You are not required to supply a name for an output column in this function. However, it is a best practice to do so. If you do not name an output column, getReturnType()
sets the column name to "". The SQL statements that call your UDTF must provide aliases for any unnamed columns to access them or else they return an error. From a usability standpoint, it is easier for you to supply the column names here once. The alternative is to force all of the users of your function to supply their own column names for each call to the UDTF.
The implementation of the createTransformFunction()
function in the example is boilerplate code. It just calls the vt_returnFuncObj
macro with the name of the TransformFunction
class associated with this factory class. This macro takes care of instantiating a copy of the TransformFunction
class that Vertica can use to process data.
The RegisterFactory Macro
The final step in creating your UDTF is to call the RegisterFactory
macro. This macro ensures that your factory class is instantiated when Vertica loads the shared library containing your UDTF. Having your factory class instantiated is the only way that Vertica can find your UDTF and determine what its inputs and outputs are.
The RegisterFactory
macro just takes the name of your factory class:
RegisterFactory(TokenFactory);