Share this article:

Developing UDxs in Java Tutorial Part 2: User-Defined Scalar Functions

Welcome to part two of the developing UDxs in Java tutorial. See part 1 in the series if you missed it.

In this tutorial, you will learn how to create User-Defined Scalar Functions.

What is a User Defined Scalar Function?

A User-Defined Scalar Function (UDSF) is a simple function: you provide it with zero or more arguments and it returns a single value as a result.  It must always return a value when called (even if that value is NULL). This is not true of other types of UDxs, as you will see in future tutorials.

You call a UDFS the same way you call Vertica built-in functions such as SIN or REGEXP_LIKE. For example, suppose you have a UDx named ADD2INTS that adds two INTEGER values. You can call this UDSF by passing it two values in a simple SELECT statement:

=> SELECT ADD2INTS(4,6);
 add2ints 
----------
       10
(1 row)

However, more often, you call UDSF functions from within a query:

=> SELECT c1, c2, ADD2INTS(c1, c2) FROM example_table;
 c1 | c2 | add2ints 
----+----+----------
  1 |  2 |        3
  1 |  4 |        5
  2 |  2 |        4
  3 |  2 |        5
  5 |  2 |        7
(5 rows)

In the above statement, ADD2INTS is called once for each row in example_table.

The COUNT_WORDS UDSF

In this tutorial, we will develop a UDSF named COUNT_WORDS. You pass it a VARCHAR argument, and it returns the number of words in the argument as an INTEGER. For the sake of simplicity, “words” means any group of letters or numbers that are separated by spaces. So COUNT_WORDS("5 golden rings") returns the value 3.

Creating an Eclipse Project for UDx Development

The first step in creating our UDSF is to create an Eclipse project. The Vertica plugin for Eclipse simplifies this process:

  1. On Eclipse’s File menu, point to New and select Other….
  2. In the Wizards list, expand the Vertica x.x.x Java SDK item (where x.x.x is your version of Vertica ).
  3. Select Vertica UDx Project.
  4. Click Next.
  5. In the Project name box, enter a name for your project. In this tutorial, we’ll name the project Count Words UDSF.
  6. Click Finish.

The plugin performs several configuration steps for you:

  • It adds the VerticaSDK.jar file to your project as a referenced library. This reference ensures that your UDx code can call the Vertica Java SDK. This link also makes the documentation for the SDK available within Eclipse.
  • It adds the BuildInfo.java source file to your project. Your Java UDx library must include this file. It tells Vertica which SDK version it is compatible with.

Note You do not need to create a new Eclipse project for every UDx you want to develop. In fact, it is convenient use the same project to develop several UDxs if you intend to deploy them together.

Running the New UDSF Wizard

In addition to setting up your Eclipse project, the Vertica Java SDK Plugin for Eclipse also contains wizards that help you develop each type of UDx. They create two class files containing boilerplate code that gives you a head start.

To start developing WORD_COUNT:

  1. In the Package Explorer, right-click the src folder and select New > Other….
  2. In the Wizards list, expland the Vertica x.x.x Java SDK and select User Defined Scalar Function.
  3. Click Next.
  4. In the Package box, type a Java namespace for your UDx. Usually, this will start with com.yourCompanyName. In this tutorial, we’ll use com.example.udxs.
  5. In the Name box, type a name for your UDSF. This value is used to create your UDx’s class files. Normally, this name reflects the name you want to use when calling the function in a SQL statement. However, they do not have to be related. In this tutorial, we’ll use the name CountWords.
  6. Click Finish.

The Anatomy of a UDx

After running the UDSF wizard, open the newly-created source folder (com.example.udxs). You’ll see that the wizard created two Java source files for you: CountWords.java and CountWordsFactory.java.

All UDxs contain at least two classes:

  • A factory class. This class tells Vertica information about your UDx, such as the arguments it accepts and the values it returns. It is also responsible for instantiating the function class for Vertica when calls the function. The Vertica Eclipse plugin names this class whatever you entered in the Name box with Factory appended. In this example, the factory class is named CountWordsFactory.
  • A function class. Vertica passes data to this class when the UDx is called in a SQL statement. This class processes the data and passes the result back to Vertica . It is named whatever you entered in the Name field of the wizard. In this case, it is named CountWords.

Your UDx can also contain other supporting classes, if necessary.

Editing the CountWordsFactory Class

A good next step in creating the CountWords UDx is to edit the CountWordsFactory.java file. Double-click this file in the Package Explorer. You will see that the CountWordsFactory class contains two methods: createScalarFunction and getPrototype.

The createScalarFunction method just instantiates and returns an instance of the CountWords class. You shouldn’t have to change this method. The UDSF wizard automatically added the code to return a new instance of your function class for you.

The getPrototype method defines the arguments and return value for your function. Here’s the boilerplate code that the UDSF Wizard created for you:

@Override
public void getPrototype(ServerInterface srvInterface, ColumnTypes argTypes, ColumnTypes returnType){ 
    //TODO: Define arguments and return value
}

Let’s take a look at the arguments passed to this method:

  • An instance of the ServerInterface class. You will find this object passed to most of your UDx methods. It lets you interact with the Vertica server. For example, you can use it to send messages to the Vertica log, get the user’s locale, or access parameters that the user passed to your function (which we’ll cover in a future tutorial).
  • An instance of the ColumnType class named argTypes. This object describes the arguments your function accepts.
  • An instance of the ColumnType class named returnTypes. This object defines your function’s return value.

Your override of getPrototype has to populate argTypes and returnTypes with the arguments your function expects and the type of value it returns. You declare the input and output types by calling methods on these objects, such as addInt or addVarchar. In our example, we just need to set one VARCHAR argument and a INTEGER output value, so we can add the following code inside the method:

argTypes.addVarchar();
returnType.addInt();

If our UDSF had more arguments, we would have added more argTypes method calls to getPrototype.

There is an additional method in the ScalarFunctionFactory class named getReturnType that you may need to override in your own UDxs. You must override this method if your function returns a value that has a variable-width (such as VARCHAR) or variable-precision (such as NUMERIC). This method sets the width or precision of these output types.

Since our UDx only returns an INTEGER value, we do not need to override getReturnType. Our factory class is now finished, and we can move on to developing the actual function.

Editing the CountWords Class

When you open the CountWords.java file, you’ll see that the UDSF Wizard has created the shell of a single method named processBlock. This method is where your UDSF actually does its processing. Here’s what the code looks like now:

@Override
public void processBlock(ServerInterface srvInterface,
          BlockReader argReader,   BlockWriter resWriter)
    throws UdfException, DestroyInvocation
  {
        
  }

The arguments to the method are:

  • The ubiquitous instance of the ServerInterface class.
  • An instance of the BlockReader class, which your function uses to read its input data.
  • An instance of the BlockWriter class, which you function uses to write its output.

Your override of the processBlock method usually follows these steps:

  1. Read arguments using methods on the BlockReader object.
  2. Process that data somehow to arrive at a result.
  3. Write the result back to Vertica using methods on the BlockWriter object.
  4. Advance the BlockReader and BlockWriter to the next row of input and output.
  5. Continue until the BlockReader runs out of data, then exit.

The function needs to iterate over all of the input, so the majority of the code will be wrapped in a loop. We'll cover that in a bit. Let's dive into what we need to do in each iteration first.

Reading Arguments

Remember that the function in our example counts the number of words in each string passed to it. So, the first step we need to take is to read the string argument. We read arguments by calling type-specific methods on the argReader object. In this case, we are expecting a single VARCHAR value, so we'll call the BlockReader.getString method to get the string value. We'll also set up some local variables to count the words we have found and a flag to indicate whether the current character we're reading is part of a word:

            // Read the string argument
            String inline = argReader.getString(0).trim();
            int wordCount = 0;
            boolean inword = false; // Are we in a word?
Counting Words

With the preliminaries out of the way, we're ready to start counting words. There are many ways to accomplish this task. In this example, we'll step through the characters in the string, and follow this algorithm:

  • If the current character is a letter or number, set the flag indicating we are in a word.
  • If the current character is a whitespace, clear the word flag. If the previous character was in a word, then increment the word count.
  • If we were in a word after we finished going through the string, then increment the word count.

This algorithm is "good enough" for most cases. Translating this into code, we get something like the following:

for (int i = 0; i < inline.length(); i++) {
    if  (Character.isLetterOrDigit(inline.charAt(i))) {
        inword = true;
    } else if (Character.isWhitespace(inline.charAt(i))) {
        if (inword) {
            wordCount++;
        }
        // Not in word any more. 
        inword = false;
    }
}
Writing the Result

Now that we have a word count, we need to send it back to Vertica . Similarly to reading arguments, you write your output by calling type-specific methods on the resWriter object. Because we need to write out an INTEGER value, we need to call the BlockWriter.setLong method to write the output value.

// Save the result to output, then advance to next row of output
resWriter.setLong(wordCount);
Moving on to the Next Input Row

Usually, your UDSF is called from within a query, so it normally gets multiple rows of data to process. Once your UDSF has written its output, it needs to move on to the next row of input and output. You do this by calling the next method on both argReader and resWriter objects.

In our example we call resWriter.next right after calling setLong:

resWriter.next();

The argReader.next method returns a Boolean value indicating whether it read a row of data. If this method call returns False, your UDSF has processed all of the rows it needs to process and must exit. Since we need to loop over the rows until argReader.next returns False, the best place to put the method call in the while statement of a do loop that wraps all of the processing statements.

So, our full processBlock method looks like this:

    @Override
    public void processBlock(ServerInterface srvInterface,
              BlockReader argReader,   BlockWriter resWriter)
        throws UdfException, DestroyInvocation
    {
        do {
            // Read the string argument
            String inline = argReader.getString(0).trim();
            int wordCount = 0;
            boolean inword = false; // Are we in a word?
            
            for (int i = 0; i < inline.length(); i++) {
                if  (Character.isLetterOrDigit(inline.charAt(i))) {
                    inword = true;
                } else if (Character.isWhitespace(inline.charAt(i))) {
                    if (inword) {
                        wordCount++;
                    }
                    // Not in word any more. 
                    inword = false;
                }
            }
            
            // Handle case where string ended with a word.
            if (inword) {
                wordCount++;
            }
            
            // Save the result to output, then advance to next row of output
            resWriter.setLong(wordCount);
            resWriter.next(); // Move to the next row of output
        }  while (argReader.next()); // Get next row of input until there is no more
    }

Exporting and Deploying Your UDSF

After you have finished subclassing ScalarFunctionFactory and ScalarFunction (and fixed any issues in your code), you are ready to deploy your new UDSF. The overall process of deploying a Java UDx is:

  • Create a JAR file that holds you UDx's class files, as well as the BuildInfo.java file.
  • Use the CREATE LIBRARY statement to load your library into the Vertica catalog.
  • Use the CREATE FUNCTION library to add your function to the database catalog.

If you followed the earlier recommendations to develop in Eclipse on a Linux system that also has Vertica running on it, these steps are easy.

Note Eclipse is usually configured to automatically build your project. If you have disabled this option, you must manually build your project before proceeding.

Creating the JAR File

To creating the JAR file:

  1. On Eclipse's File menu, click Export....
  2. In the Select and export destination list, expand Java and double-click JAR File.
  3. In the Select the resources to export box, select the name of your UDSF project (Count Words). The checkbox next to your project's name should be checked.
  4. In the JAR file box, enter the full path and file name for the JAR file to create. Or you can click Browse to select a directory and then enter the file name. A good place for this files is the root directory of the your Linux account. In this tutorial we're running Eclipse as the dbadmin user, so we'll save the file to /home/dbadmin/CountWordsUDxLib.jar.
  5. Click Finish.
Loading the JAR Library into the Vertica Catalog

Before you can define a UDx function, you must load the JAR library containing the function into the database catalog using the CREATE LIBRARY statement. When you add the library, Vertica :

  • Examines the library file to make sure it is compatible with the Vertica server's SDK version.
  • Automatically copies the library file to every node in the cluster.

To load the library into the catalog:

  1. If Vertica is not running, start it.
  2. Using vsql, log into Vertica using a superuser account (such as dbadmin).
  3. Use the CREATE LIBRARY statement to add the library to the catalog. The syntax for this statement is:

    CREATE LIBRARY library_name AS 'absolute path to library' LANGUAGE 'Java';

    Where library_name is a name you want to use to refer to the library. You'll use this name later when creating your functions.

    In our example, we'll use the statement:

    => CREATE LIBRARY wordCountLib AS '/home/dbadmin/CountWordsUDxLib.jar' 
       LANGUAGE 'Java';
    
Defining the UDFS Function

After loading the library that contains your UDSF, you must define the function in the Vertica catalog. This maps a SQL function name to a class contained in your library JAR file. Use the CREATE FUNCTION statement to define the function:

CREATE FUNCTION function_name AS LANGUAGE 'Java' NAME 'factory_class' LIBRARY library_name

Where:

  • function_name is the name you want to use in your SQL statements when calling the function. In this example, we'll use the name count_words.
  • factory_class is the fully-qualified name of your function's factory class. In our example, this value is com.example.udxs.CountWordFactory.
  • library_name is the name you assigned to the library earlier. In our example, this is wordCountLib.

In our example, we'll use the statement:

=> CREATE FUNCTION count_words as language 'Java' 
   NAME 'com.example.udxs.CountWordsFactory' library wordCountLib;

Testing Your UDSF

Now that we've deployed our UDSF, it's time to test it out. You can call a UDSF the same way you call many native Vertica functions. Let's call the function in a simple SELECT statement:

=> SELECT count_words('Four score and seven years ago...');
 count_words 
-------------
           6
(1 row)

Usually, you will use your UDSF within a query on a table. For example:

=> CREATE TABLE quotes (quote VARCHAR);
CREATE TABLE
=> COPY quotes FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> Neither a borrower nor a lender be.
>> The play's the thing wherein I'll catch the conscience of the king.         
>> We are such stuff as dreams are made on, rounded with a little sleep
>> \.
=> SELECT count_words(quote), quote FROM quotes;
 count_words |                                quote                                 
-------------+----------------------------------------------------------------------
           7 | Neither a borrower nor a lender be.
          12 | The play's the thing wherein I'll catch the conscience of the king.
          14 | We are such stuff as dreams are made on, rounded with a little sleep
(3 rows)

When testing, you should always try passing your function all types of values that user might send it. That includes NULL values. Let's see what happens if we call count_words with a NULL:

=> SELECT count_words(NULL);
ERROR 3399:  Failure in UDx RPC call InvokeProcessBlock(): Unexpected exception in User Defined Object [count_words]
null

Oops. That's not good! In the next tutorial, we'll explain error handling and also several ways to handle NULL values.

Share this article: