Migrating From the Vertica Spark Connector V1 to V2
If you have existing Spark 3.0 or later cluster that uses the Vertica Spark connector V1, consider upgrading to the connector V2. The connector V2 offers news features, better performance, and is under ongoing development. The connector V1 is deprecated, and will eventually be removed from service. See for Available Connector Versions more information about how the V2 connecter compares to the connector V1.
When transitioning from the V1 to the connector V2, there are several changes you must take into account.
Deployment Changes between the V1 and Connector V2
The connector V1 is only distributed with the Vertica server install. To use it, you must copy it from a Vertica node to your Spark cluster. The connector V2 is distributed via several channels giving you more ways to deploy it.
The Spark connector V2 is available from Maven Central. If you use Gradle, Maven, or SBT to manage your Spark applications, you may find it more convenient to deploy the Spark connector using a dependency rather than manually installing it on your Spark cluster. Integrating the Spark Connector into your Spark project as a dependency makes updating to newer versions of the connector easy—just update the required version in the dependency. See Getting the Connector from Maven Central for more information.
You can also download the precompiled connector V2 assembly or build it from source. In this case, you must deploy the connector to your Spark cluster. The connector V1 depends on the Vertica JDBC driver and requires that you separately include it. The connector V2 is an assembly that incorporates all of its dependencies, including the JDBC driver. You only need to deploy a single JAR file containing the Spark connector V2 to your Spark cluster.
You can have Spark load both the V1 and connector V2 at the same time because the connector's primary class name is different in V2 (see below). This renaming lets you add the connector V2 to your Spark configuration files without having to immediately port all of your Spark applications that use the connector V1 to the new V2 API. You can just add theV2 assembly JAR file to spark-jars list in the spark-defaults.conf
file.
API Changes
There are several changes from the Spark Connector V1 to the connector V2 API that require changes to your Spark application.
VerticaRDD Class no Longer Supported
The connector V1 supported a class named VerticaRDD
to perform data loading from Vertica using the Spark resilient distributed dataset (RDD) feature (see Using the Vertica RDD API with the Connector V1). The connector V2 does not support this separate class. Instead, if you want to directly manipulate an RDD, access it through the DataFrame
object you create using the DataSource
API.
DefaultSource Class Renamed VerticaSource
The primary class in the connector V1 is named DataSource
. In the connector V2, this class has been renamed to VerticaSource
. This renaming lets both connectors coexist, allowing you to gradually transition your Spark applications from the V1 to the connector V2.
For your existing Spark application to use the connector V2, you must change calls to the DataSource
class to the VerticaSource
class. For example, suppose your Spark application has this method call to read data from the connector V1:
spark.read.format("com.vertica.spark.datasource.DefaultSource").options(opts).load()
Then to have it use the connector V2, you change it to this:
spark.read.format("com.vertica.spark.datasource.VerticaSource").options(opts).load()
Changed API Options
In addition to renaming of the DataSource
class to VerticaSource
, some of the names for options for the primary connector class have changed. Other options are no longer supported. If you are porting a Spark application from the V1 to connector V2 that uses one of the following options, you must update your code:
V1 DataSource Option | V2 VerticaSource Option | Description |
---|---|---|
fileformat | none | The connector V2 does not support the fileformat option. The files that Vertica and Spark write to the intermediate storage location are always in parquet format. |
hdfs_url | staging_fs_url | The location of the intermediate storage location that Vertica and Spark use to exchange data. Renamed to be more general, as future versions of the connector will support storage platforms in addition to HDFS. |
logging_level | none | The conneector no longer supports setting a logging level. Instead, set the logging level in Spark. |
numpartitions | num_partitions | The number of Spark partitions to use when reading data from Vertica. |
target_table_ddl | target_table_sql | A SQL statement for Vertica to execute before loading data from Spark. |
web_hdfs_url | none | The connector V2 does not support using the web HDFS protocol as a fallback. |
In addition, the connector V2 has added options to support new features such as Kerberos authentication. For details on the connector V2's VerticaSource options API, see the Vertica Spark connector GitHub project.
For details on the DataSource options API, see V1 Spark Connector Load Options and V1 Spark Connector Save Options.
Take Advantage of New Features
The Vertica Spark connector V2 offers new features that you may want to take advantage of.
Currently, the most notable new feature in the connector V2 is support for Kerberos authentication. This feature lets you configure the connector for passwordless connections to Vertica. See the Kerberos documentation in the Vertica Spark connector GitHub project for details of using this feature.