Developing UDxs in Java Tutorial Part 1: Introduction and Setting Up
Welcome to the first part of our tutorial series on developing User Defined Extensions (UDxs) with the Java programming language. In this part, you’ll learn what UDxs are and why you would want to develop one. We’ll also show you how to set up a development environment to make developing your UDxs easier.
What is a UDx?
User Defined Extensions let you add your own features to Vertica. They help you analyze and transform your database’s data in ways that are difficult or impossible to do using SQL alone.
UDxs are broken down into two broad categories:
- User Defined Functions (UDFs) provide functions that you can call from within your SQL statements (usually, a SELECT statement).
- User Defined Loads (UDLs) let you replace one or more steps in the data load process.
Each category has several types of UDxs. Each type of UDx fills a specific role in processing your data. The future installments of this tutorial series explain how you develop and use each type of UDx.
Why Create a UDx?
If you’ve ever found yourself thinking “if I could just do X with my data…” then you may want to create a UDx. (Well, for many values of X, at least.) For example:
- Load data from a data source other than a flat file or stream. For example, load data from a web API or extract it from PDF documents. If you can write code to extract the data, you can load it into Vertica.
- Load data compressed using an unsupported compression format.
- Load data encoded with an unsupported character encoding. (Finally! A way to process all of that data in EBCDIC format you have lying around on punch cards!)
- Transform data in one table into a completely different format. For example, extract words from a VARCHAR column and turn them into a word cloud based on their frequency.
- Use a programming library that performs the data analytics you want to run on your data stored in Vertica.
How Do You Create a UDx?
You develop UDxs in one of four programming languages: C++, Python, Java, or R. Vertica provides a Software Development Kit (SDK) for each of these languages. It reduces the complexity of creating a distributed data processing or analytic function. Instead of worrying about organizing and processing the data, all your code does is process blocks of data that Vertica sends to it and return the results.
What Do These Tutorials Cover?
These tutorials explain how to develop UDxs using Java. The topics include:
- Developing each type of UDx supported in the Vertica Java SDK:
- User Defined Scalar Functions
- User Defined Transform Functions
- User Defined Analytic Functions
- User Defined Sources
- User Defined Filters
- User Defined Parsers
- Handling different numbers of arguments
- Handling parameters
- Debugging your Java UDx
What Do You Need to Know?
These tutorials assume:
- You know basics of using Vertica . For example: how to create tables and run queries. If you are familiar with other SQL databases, you should be fine.
- You know how to program in Java.
- You are familiar any IDE software. In this tutorial, we'll use Eclipse IDE.
Setting Up Your Development Environment
Before you start developing UDxs, you need to set up a development environment. You can develop UDxs on any platform that supports Java. However, developing your UDx on a Linux system that has Vertica installed makes testing your UDx easier. If you choose to develop on a non-Linux system, you must transfer your compiled UDx to a Vertica database in order to test it. These tutorials assume you are developing UDxs on a Linux system on which you have also installed a single-node Vertica database.
Important Never use your production Vertica database as a development platform for UDxs. Because Java UDxs run in a sandbox, they cannot corrupt or crash your database. However, bugs in your UDx can still consume enough RAM and CPU to cause performance issues in your Vertica cluster.
If you do not have a physical Linux system to use as a development system, you can create a virtual development system using virtualization software such as VMWare, Oracle VM VirtualBox, or Microsoft Hyper-V. You could start with the pre-configured CentOS-based Vertica virtual machine with Open Virtualization format or VMware VMX file format. It is available for download at https://www.vertica.com/download/vertica/community-edition/ free registration required). You can also create your own VM and then install Vertica using a community license.
Whether you choose to develop on a physical or virtual system, your development environment must meet the following requirements:
- It must run a version of Linux supported by Vertica . See the Vertica Supported Platforms guide for details.
- It must have sufficient resources to run both Vertica and Eclipse. Your development system should have at least 8GB of memory and two processor cores.
- It must have a supported Java Development Kit (JDK) installed. Vertica supports Oracle Java Platform Standard Edition and OpenJDK. Be sure to install version 6, 7, or 8.
- You must configure your single-node Vertica database to enable Java UDxs. See Installing Java on Hosts in the Vertica documentation to learn how to configure Vertica to enable Java UDxs.
In addition to your development system, you should also have a test cluster. Before you deploy your UDx to a production database, always perform further testing on a multi-node test cluster. If you do not have that hardware available for a physical test cluster, you can create a virtual three-node cluster using an Vertica Community license.
To make developing even easier, consider developing your UDxs while logged into your development Linux system as the dbadmin user. This ensures that you have the permissions necessary to deploy your UDx. Another option is to add a user to your test Vertica database whose name matches your Linux user name. Then grant the Vertica user superuser privileges.
These tutorials assume you are using the Eclipse Foundation's Eclipse IDE for Java Developers to develop Java UDxs.
You need the following two files on the Vertica node.
/opt/vertica/bin/VerticaSDK.jarcontains the Vertica Java SDK and other supporting files.
/opt/vertica/sdk/BuildInfo.javacontains version information about the SDK. You must compile this file and include it within your Java UDx JAR files.
To create a new Java project on Eclipse and set up it with the Vertica Java SDK:
On the File menu, click New > Java Project to open the New Java Project wizard.
Choose your JDK version on the first pane and click Next.
On the Java Settings pane, click the Libraries tab.
- Click Add External JARs, and select /opt/vertica/bin/VerticaSDK.jar file.
- Click Finish.
On the File menu, click Import and select General > File System.
- Click Next.
Enter /opt/vertica/sdk in the From directory box, select BuildInfo.java. Enter <Project Folder>/src in the Into folder box.
- Click Finish.
If an error appears in BuildInfo.java, open it and fix the package name issue in the "package com.vertica.sdk;" line.
- Click Move 'BuildInfo.java' to package 'com.vertica.sdk'.
Now, you are ready to develop UDxs!
In the next tutorial, we'll learn how to develop a User-Defined Scalar Function.
For more information about Java SDK, see the Vertica documentation.