Migrating From the Vertica Spark Connector V1 to V2

If you have existing Spark 3.0 or later cluster that uses the Vertica Spark connector V1, consider upgrading to the connector V2. The connector V2 offers news features, better performance, and is under ongoing development. The connector V1 is deprecated, and will eventually be removed from service. See for Available Connector Versions more information about how the V2 connecter compares to the connector V1.

When transitioning from the V1 to the connector V2, there are several changes you must take into account.

Deployment Changes between the V1 and Connector V2

The connector V1 is only distributed with the Vertica server install. To use it, you must copy it from a Vertica node to your Spark cluster. The connector V2 is distributed via several channels giving you more ways to deploy it.

The Spark connector V2 is available from Maven Central. If you use Gradle, Maven, or SBT to manage your Spark applications, you may find it more convenient to deploy the Spark connector using a dependency rather than manually installing it on your Spark cluster. Integrating the Spark Connector into your Spark project as a dependency makes updating to newer versions of the connector easy—just update the required version in the dependency. See Getting the Connector from Maven Central for more information.

You can also download the precompiled connector V2 assembly or build it from source. In this case, you must deploy the connector to your Spark cluster. The connector V1 depends on the Vertica JDBC driver and requires that you separately include it. The connector V2 is an assembly that incorporates all of its dependencies, including the JDBC driver. You only need to deploy a single JAR file containing the Spark connector V2 to your Spark cluster.

You can have Spark load both the V1 and connector V2 at the same time because the connector's primary class name is different in V2 (see below). This renaming lets you add the connector V2 to your Spark configuration files without having to immediately port all of your Spark applications that use the connector V1 to the new V2 API. You can just add theV2 assembly JAR file to spark-jars list in the spark-defaults.conf file.

API Changes

There are several changes from the Spark Connector V1 to the connector V2 API that require changes to your Spark application.

VerticaRDD Class no Longer Supported

The connector V1 supported a class named VerticaRDD to perform data loading from Vertica using the Spark resilient distributed dataset (RDD) feature (see Using the Vertica RDD API with the Connector V1). The connector V2 does not support this separate class. Instead, if you want to directly manipulate an RDD, access it through the DataFrame object you create using the DataSource API.

DefaultSource Class Renamed VerticaSource

The primary class in the connector V1 is named DataSource. In the connector V2, this class has been renamed to VerticaSource. This renaming lets both connectors coexist, allowing you to gradually transition your Spark applications from the V1 to the connector V2.

For your existing Spark application to use the connector V2, you must change calls to the DataSource class to the VerticaSource class. For example, suppose your Spark application has this method call to read data from the connector V1:

spark.read.format("com.vertica.spark.datasource.DefaultSource").options(opts).load()

Then to have it use the connector V2, you change it to this:

spark.read.format("com.vertica.spark.datasource.VerticaSource").options(opts).load()

Changed API Options

In addition to renaming of the DataSource class to VerticaSource, some of the names for options for the primary connector class have changed. Other options are no longer supported. If you are porting a Spark application from the V1 to connector V2 that uses one of the following options, you must update your code:

V1 DataSource Option V2 VerticaSource Option Description
fileformat none The connector V2 does not support the fileformat option. The files that Vertica and Spark write to the intermediate storage location are always in parquet format.
hdfs_url staging_fs_url The location of the intermediate storage location that Vertica and Spark use to exchange data. Renamed to be more general, as future versions of the connector will support storage platforms in addition to HDFS.
logging_level none The conneector no longer supports setting a logging level. Instead, set the logging level in Spark.
numpartitions num_partitions The number of Spark partitions to use when reading data from Vertica.
target_table_ddl target_table_sql A SQL statement for Vertica to execute before loading data from Spark.
web_hdfs_url none The connector V2 does not support using the web HDFS protocol as a fallback.

In addition, the connector V2 has added options to support new features such as Kerberos authentication. For details on the connector V2's VerticaSource options API, see the Vertica Spark connector GitHub project.

For details on the DataSource options API, see V1 Spark Connector Load Options and V1 Spark Connector Save Options.

Take Advantage of New Features

The Vertica Spark connector V2 offers new features that you may want to take advantage of.

Currently, the most notable new feature in the connector V2 is support for Kerberos authentication. This feature lets you configure the connector for passwordless connections to Vertica. See the Kerberos documentation in the Vertica Spark connector GitHub project for details of using this feature.