Developing User-Defined Extensions (UDxs)
User-Defined Extensions (UDxs) are functions contained in external shared libraries that are developed in C++, Python, Java, or R using the Vertica SDK. The external libraries are defined in the Vertica catalog using the CREATE LIBRARY statement. They are best suited for analytic operations that are difficult to perform in SQL, or need to be performed frequently enough that their speed is a major concern.
The primary strengths of UDxs are:
- They can be used anywhere an internal function can be used.
- They take full advantage of Vertica's distributed computing features. The extensions usually execute in parallel on each node in the cluster.
- Vertica handles the distribution of the UDx library to the individual nodes. You only need to copy the library to the initiator node.
- All of the complicated aspects of developing a distributed piece of analytic code are handled for you by Vertica. Your main programming task is to read in data, process it, and then write it out using the Vertica SDK APIs.
There are a few things to keep in mind about developing UDxs:
- UDxs can be developed in the programming languages C++, Python, Java, and R. (Not all UDx types support all languages.)
- UDxs written in Java always run in Fenced Mode, because the Java Virtual Machine that executes Java programs cannot run directly within the Vertica process.
- UDxs written in Python and R always run in Fenced Mode
- UDxs developed in C++ have the option of running in unfenced mode, which means they load and run directly in the Vertica database process. This option provides the lowest overhead and highest speed. However, any bugs in the UDx's code can cause database instability. You must thoroughly test any UDxs you intend to run in unfenced mode before deploying them in a live environment. Consider whether the performance boost of running a C++ UDx unfenced is worth the potential database instability that a buggy UDx can cause.
- Because a UDx runs on the Vertica cluster, it can take processor time and memory away from the database processes. A UDx that consumes large amounts of computing resources can negatively impact database performance.
Each UDx type consists of two classes. The main class does the primary work (a transformation, an aggregation, and so on). The class usually has at least three methods: one to set up, one to tear down (release reserved resources), and one to do the actual work. Sometimes additional methods are defined.
The main processing method receives an instance of the
ServerInterface class as an argument. This object is used by the underlying Vertica SDK code to make calls back into the Vertica process, for example to allocate memory. You can use this class to write to the server log during UDx execution.
The second class is a singleton factory. It defines one method that produces instances of the first class, and might define other methods to manage parameters.
When implementing a UDx you must subclass both classes.
The C++, Python, and Java APIs are nearly identical. Where possible, this documentation describes these interfaces without respect to language. Documentation specific to C++, Python, or Java is covered in language-specific sections.
Because some documentation is language-independent, it is not always possible to use ideal, language-based terminology. This documentation uses the term "method" to refer to a Java method or a C++ member function.