Developing UDxs in Java Tutorial Part 2: User-Defined Scalar Functions

Welcome to part two of the developing UDxs in Java tutorial. See Part 1 in the series if you missed it.

In this tutorial, you will learn how to create User-Defined Scalar Functions.

What is a User Defined Scalar Function?

A User-Defined Scalar Function (UDSF) is a simple function: you provide it with zero or more arguments and it returns a single value as a result.  It must always return a value when called (even if that value is NULL). This is not true of other types of UDxs, as you will see in future tutorials.

You call a UDSF the same way you call Vertica built-in functions such as SIN or REGEXP_LIKE. For example, suppose you have a UDx named ADD2INTS that adds two INTEGER values. You can call this UDSF by passing it two values in a simple SELECT statement:

=> SELECT ADD2INTS(4,6);
 add2ints 
----------
       10
(1 row)

However, more often, you call UDSF functions from within a query:

=> SELECT c1, c2, ADD2INTS(c1, c2) FROM example_table;
 c1 | c2 | add2ints 
----+----+----------
  1 |  2 |        3
  1 |  4 |        5
  2 |  2 |        4
  3 |  2 |        5
  5 |  2 |        7
(5 rows)

In the above statement, ADD2INTS is called once for each row in example_table.

The COUNT_WORDS UDSF

In this tutorial, we will develop a UDSF named COUNT_WORDS. You pass it a VARCHAR argument, and it returns the number of words in the argument as an INTEGER. For the sake of simplicity, “words” means any group of letters or numbers that are separated by spaces. So COUNT_WORDS("5 golden rings") returns the value 3.

Creating an Eclipse Project for UDx Development

The first step in creating our UDSF is to create an Eclipse project.

  1. Start Eclipse.

  2. On the File menu, click New > Java Project to open the New Java Project wizard.

  3. Choose your JDK version on the first pane and click Next.

  4. On the Java Settings pane, click the Libraries tab.

  5. Click Add External JARs, and select /opt/vertica/bin/VerticaSDK.jar file.
  6. Click Finish.
  7. On the File menu, click Import and select General > File System.

  8. Click Next.
  9. Enter /opt/vertica/sdk in the From directory box, select BuildInfo.java. Enter <Project Folder>/src in the Into folder box.

  10. Click Finish.
  11. If an error appears in BuildInfo.java, open it and fix the package name issue in the "package com.vertica.sdk;" line.

  12. Click Move 'BuildInfo.java' to package 'com.vertica.sdk'.

Note You do not need to create a new Eclipse project for every UDx you want to develop. In fact, it is convenient use the same project to develop several UDxs if you intend to deploy them together.

The Anatomy of a UDx

All UDxs contain at least two classes: CountWords.java and CountWordsFactory.java.

  • Function class. Vertica passes data to this class when the UDx is called in a SQL statement. This class processes the data and passes the result back to Vertica. In this tutorial, it is named CountWords.
  • Factory class. This class tells Vertica information about your UDx, such as the arguments it accepts and the values it returns. It is also responsible for instantiating the function class for Vertica when it calls the function. In this tutorial, the factory class is named CountWordsFactory.

Your UDx can also contain other supporting classes, if necessary.

To start developing COUNT_WORDS, create 2 Java source files with template codes.

To create CountWords.java:

  1. In the Package Explorer, right-click the src folder and select New > Class.
  2. In the Package box, type a Java namespace for your UDx. Usually, this will start with com.yourCompanyName. In this tutorial, we’ll use com.example.udxs.
  3. In the Name box, type a name for your UDSF. This value is used to create your UDx’s class files. Normally, this name reflects the name you want to use when calling the function in a SQL statement. However, they do not have to be related. In this tutorial, we’ll use the name CountWords.
  4. In the Superclass box, type com.vertica.sdk.ScalarFunction.
  5. Click Finish.
  6. Replace the generated source codes with the following codes.
package com.example.udxs;
 
import com.vertica.sdk.*;
 
public class CountWords extends ScalarFunction {
 
@Override
public void processBlock(ServerInterface srvInterface, BlockReader argReader, BlockWriter resWriter)
throws UdfException, DestroyInvocation {
 
}
 
}

To create CountWordsFactory.java:

  1. In the Package Explorer, right-click the src folder and select New > Class.
  2. In the Package box, type the same Java namespace of your function class. In this tutorial, we’ll use com.example.udxs.
  3. In the Name box, type a name for your factory class. In this tutorial, we’ll use the name CountWordsFactory.
  4. In the Superclass box, type com.vertica.sdk.ScalarFunctionFactory.
  5. Click Finish.
  6. Replace the generated source codes with the following codes.
package com.example.udxs;
 
import com.vertica.sdk.*;
 
public class CountWordsFactory extends ScalarFunctionFactory {
 
@Override
public ScalarFunction createScalarFunction(ServerInterface srvInterface) {
return null;
}
 
@Override
public void getPrototype(ServerInterface srvInterface, ColumnTypes argTypes, ColumnTypes returnType) {
 
}
 
}

Editing the CountWordsFactory Class

A good next step in creating the CountWords UDx is to edit the CountWordsFactory.java file. Open this file in Eclipse. You will see that the CountWordsFactory class contains two methods: createScalarFunction and getPrototype.

The createScalarFunction method just instantiates and returns an instance of the CountWords class. You add the following code to return a new instance of your function class for you.

return new CountWords();

The getPrototype method defines the arguments and return value for your function.

Let’s take a look at the arguments passed to this method:

  • An instance of the ServerInterface class. You will find this object passed to most of your UDx methods. It lets you interact with the Vertica server. For example, you can use it to send messages to the Vertica log, get the user’s locale, or access parameters that the user passed to your function (which we’ll cover in a future tutorial).
  • An instance of the ColumnType class named argTypes. This object describes the arguments your function accepts.
  • An instance of the ColumnType class named returnTypes. This object defines your function’s return value.

Your override of getPrototype has to populate argTypes and returnTypes with the arguments your function expects and the type of value it returns. You declare the input and output types by calling methods on these objects, such as addInt or addVarchar. In our example, we just need to set one VARCHAR argument and a INTEGER output value, so we can add the following code inside the method:

argTypes.addVarchar();
returnType.addInt();

If our UDSF had more arguments, we would have added more argTypes method calls to getPrototype.

There is an additional method in the ScalarFunctionFactory class named getReturnType that you may need to override in your own UDxs. You must override this method if your function returns a value that has a variable-width (such as VARCHAR) or variable-precision (such as NUMERIC). This method sets the width or precision of these output types.

Since our UDx only returns an INTEGER value, we do not need to override getReturnType. Our factory class is now finished, and we can move on to developing the actual function.

Editing the CountWords Class

When you open the CountWords.java file, you’ll see a single method named processBlock. This method is where your UDSF actually does its processing. Here’s what the code looks like now:

@Override
public void processBlock(ServerInterface srvInterface,
          BlockReader argReader,   BlockWriter resWriter)
    throws UdfException, DestroyInvocation
  {
        
  }

The arguments to the method are:

  • The ubiquitous instance of the ServerInterface class.
  • An instance of the BlockReader class, which your function uses to read its input data.
  • An instance of the BlockWriter class, which your function uses to write its output.

Your override of the processBlock method usually follows these steps:

  1. Read arguments using methods on the BlockReader object.
  2. Process that data somehow to arrive at a result.
  3. Write the result back to Vertica using methods on the BlockWriter object.
  4. Advance the BlockReader and BlockWriter to the next row of input and output.
  5. Continue until the BlockReader runs out of data, then exit.

The function needs to iterate over all of the input, so the majority of the code will be wrapped in a loop. We'll cover that in a bit. Let's dive into what we need to do in each iteration first.

Reading Arguments

Remember that the function in our example counts the number of words in each string passed to it. So, the first step we need to take is to read the string argument. We read arguments by calling type-specific methods on the argReader object. In this case, we are expecting a single VARCHAR value, so we'll call the BlockReader.getString method to get the string value. We'll also set up some local variables to count the words we have found and a flag to indicate whether the current character we're reading is part of a word:

            // Read the string argument
            String inline = argReader.getString(0).trim();
            int wordCount = 0;
            boolean inword = false; // Are we in a word?
Counting Words

With the preliminaries out of the way, we're ready to start counting words. There are many ways to accomplish this task. In this example, we'll step through the characters in the string, and follow this algorithm:

  • If the current character is a letter or number, set the flag indicating we are in a word.
  • If the current character is a whitespace, clear the word flag. If the previous character was in a word, then increment the word count.
  • If we were in a word after we finished going through the string, then increment the word count.

This algorithm is "good enough" for most cases. Translating this into code, we get something like the following:

for (int i = 0; i < inline.length(); i++) {
    if  (Character.isLetterOrDigit(inline.charAt(i))) {
        inword = true;
    } else if (Character.isWhitespace(inline.charAt(i))) {
        if (inword) {
            wordCount++;
        }
        // Not in word any more. 
        inword = false;
    }
}
Writing the Result

Now that we have a word count, we need to send it back to Vertica . Similarly to reading arguments, you write your output by calling type-specific methods on the resWriter object. Because we need to write out an INTEGER value, we need to call the BlockWriter.setLong method to write the output value.

// Save the result to output, then advance to next row of output
resWriter.setLong(wordCount);
Moving on to the Next Input Row

Usually, your UDSF is called from within a query, so it normally gets multiple rows of data to process. Once your UDSF has written its output, it needs to move on to the next row of input and output. You do this by calling the next method on both argReader and resWriter objects.

In our example we call resWriter.next right after calling setLong:

resWriter.next();

The argReader.next method returns a Boolean value indicating whether it read a row of data. If this method call returns False, your UDSF has processed all of the rows it needs to process and must exit. Since we need to loop over the rows until argReader.next returns False, the best place to put the method call in the while statement of a do loop that wraps all of the processing statements.

So, our full processBlock method looks like this:

    @Override
    public void processBlock(ServerInterface srvInterface,
              BlockReader argReader,   BlockWriter resWriter)
        throws UdfException, DestroyInvocation
    {
        do {
            // Read the string argument
            String inline = argReader.getString(0).trim();
            int wordCount = 0;
            boolean inword = false; // Are we in a word?
            
            for (int i = 0; i < inline.length(); i++) {
                if  (Character.isLetterOrDigit(inline.charAt(i))) {
                    inword = true;
                } else if (Character.isWhitespace(inline.charAt(i))) {
                    if (inword) {
                        wordCount++;
                    }
                    // Not in word any more. 
                    inword = false;
                }
            }
            
            // Handle case where string ended with a word.
            if (inword) {
                wordCount++;
            }
            
            // Save the result to output, then advance to next row of output
            resWriter.setLong(wordCount);
            resWriter.next(); // Move to the next row of output
        }  while (argReader.next()); // Get next row of input until there is no more
    }

Exporting and Deploying Your UDSF

After you have finished subclassing ScalarFunctionFactory and ScalarFunction (and fixed any issues in your code), you are ready to deploy your new UDSF. The overall process of deploying a Java UDx is:

  • Create a JAR file that holds your UDx's class files, as well as the BuildInfo.java file.
  • Use the CREATE LIBRARY statement to load your library into the Vertica catalog.
  • Use the CREATE FUNCTION library to add your function to the database catalog.

If you followed the earlier recommendations to develop in Eclipse on a Linux system that also has Vertica running on it, these steps are easy.

Note Eclipse is usually configured to automatically build your project. If you have disabled this option, you must manually build your project before proceeding.

Creating the JAR File

To create the JAR file:

  1. On Eclipse's File menu, click Export....
  2. In the Select and export wizard, expand Java and double-click JAR file.
  3. In the Select the resources to export box, select the name of your UDSF project (Count Words). The checkbox next to your project's name should be checked.
  4. In the JAR file box, enter the full path and file name for the JAR file to create. Or you can click Browse to select a directory and then enter the file name. A good place for this file is the root directory of the your Linux account. In this tutorial we're running Eclipse as the dbadmin user, so we'll save the file to /home/dbadmin/CountWordsUDxLib.jar.
  5. Click Finish.
Loading the JAR Library into the Vertica Catalog

Before you can define a UDx function, you must load the JAR library containing the function into the database catalog using the CREATE LIBRARY statement. When you add the library, Vertica :

  • Examines the library file to make sure it is compatible with the Vertica server's SDK version.
  • Automatically copies the library file to every node in the cluster.

To load the library into the catalog:

  1. If Vertica is not running, start it.
  2. Using vsql, log into Vertica using a superuser account (such as dbadmin).
  3. Use the CREATE LIBRARY statement to add the library to the catalog. The syntax for this statement is:

    CREATE LIBRARY library_name AS 'absolute path to library' LANGUAGE 'Java';

    Where library_name is a name you want to use to refer to the library. You'll use this name later when creating your functions.

    In our example, we'll use the statement:

    => CREATE LIBRARY wordCountLib AS '/home/dbadmin/CountWordsUDxLib.jar' 
       LANGUAGE 'Java';
    
Defining the UDFS Function

After loading the library that contains your UDSF, you must define the function in the Vertica catalog. This maps a SQL function name to a class contained in your library JAR file. Use the CREATE FUNCTION statement to define the function:

CREATE FUNCTION function_name AS LANGUAGE 'Java' NAME 'factory_class' LIBRARY library_name

Where:

  • function_name is the name you want to use in your SQL statements when calling the function. In this example, we'll use the name count_words.
  • factory_class is the fully-qualified name of your function's factory class. In our example, this value is com.example.udxs.CountWordFactory.
  • library_name is the name you assigned to the library earlier. In our example, this is wordCountLib.

In our example, we'll use the statement:

=> CREATE FUNCTION count_words as language 'Java' 
   NAME 'com.example.udxs.CountWordsFactory' library wordCountLib;

Testing Your UDSF

Now that we've deployed our UDSF, it's time to test it out. You can call a UDSF the same way you call many native Vertica functions. Let's call the function in a simple SELECT statement:

=> SELECT count_words('Four score and seven years ago...');
 count_words 
-------------
           6
(1 row)

Usually, you will use your UDSF within a query on a table. For example:

=> CREATE TABLE quotes (quote VARCHAR);
CREATE TABLE
=> COPY quotes FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> Neither a borrower nor a lender be.
>> The play's the thing wherein I'll catch the conscience of the king.         
>> We are such stuff as dreams are made on, rounded with a little sleep
>> \.
=> SELECT count_words(quote), quote FROM quotes;
 count_words |                                quote                                 
-------------+----------------------------------------------------------------------
           7 | Neither a borrower nor a lender be.
          12 | The play's the thing wherein I'll catch the conscience of the king.
          14 | We are such stuff as dreams are made on, rounded with a little sleep
(3 rows)

When testing, you should always try passing your function all types of values that the user might send it. That includes NULL values. Let's see what happens if we call count_words with a NULL:

=> SELECT count_words(NULL);
ERROR 3399:  Failure in UDx RPC call InvokeProcessBlock(): Unexpected exception in User Defined Object [count_words]
null

Oops. That's not good! In the next tutorial, we'll explain error handling and also several ways to handle NULL values.

Additional Resources

For more information about Scalar Functions developed in Java, see the Vertica documentation.