The Spark Connector V2

Use the Spark Connector V2 when your Spark cluster has version Spark 3.0 or later installed. The connector V2 is not backwards-compatible with versions of Spark earlier than 3.0 or versions of Scala earlier than 2.12.

Getting the Connector

The Spark connector V2 is an open source project. For the latest information on the Spark connector, visit the spark-connector project on GitHub. To get an alert when there are updates to the connector, you can log into a GitHub account and click the Notifications button on any of the project's pages.

You have three options to get the Spark connector V2:

Get the connector from Maven Central. If you use Gradle, Maven, or SBT to manage your Spark applications, you can add a dependency to your project to automatically get the connector and its dependencies and add them into your Spark application. See Getting the Connector from Maven Central below.
Download a precompiled assembly from the GitHub project's releases page. Once you have downloaded the connector, you must configure Spark to use it. See Deploy the Connector to the Spark Cluster below.
Clone the connector project and compile it. This option is useful if you want features or bugfixes that have not been released yet. See Compiling the Connector below.

Getting the Connector from Maven Central

The Vertica Spark connector is available from the Maven Central Repository. Using Maven Central is the easiest method of getting the connector if you are already using a build tool that can download dependencies from it.

If your Spark project is managed by Gradle, Maven, or SBT you can add the Spark connector to it by listing it as a dependency in its configuration file. You may also have to enable Maven Central if your build tool does not automatically do so.

For example, suppose you use SBT to manage your Spark application. In this case, the Maven Central repository is enabled by default. All you need to do is add the com.vertica.spark dependency to your build.sbt file.

See the com.vertica.spark page on the Maven Repository site for more information on the available Spark connector versions and dependency information for your build system.

Compiling the Connector

You may choose to compile the Spark connector V2 if you want to test new features or bugfixes that have not yet been released. Compiling the connector is necessary if you plan on contributing your own features.

To compile the connector, you need:

The SBT build tool. In order to compile the connector, you must install SBT and all of its dependencies (including a version of the Java SDK). See the SBT documentation for requirements and installation instructions.
git to clone the Spark connector V2 source from the GitHub.

As a quick overview, executing the following commands on a Linux command line will download the source and compile it into an assembly file:

$ git clone https://github.com/vertica/spark-connector.git
$ cd spark-connector/connector
$ sbt assembly

Once compiled, the connector is located at target/scala-n.n/spark-vertica-connector-assembly-x.x.jar . The n.n is the currently-supported version of Scala and x.x is the current version of the Spark connector.

See the CONTRIBUTING document in the connector's GitHub project for detailed requirements and compiling instructions.

Once you compile the connector, you must deploy it to your Spark cluster. See the next section for details.

Deploy the Connector to the Spark Cluster

If you downloaded the Spark connector V2 from GitHub or compiled it yourself, you must deploy it to your Spark cluster before you can use it. Two options include copying it to a Spark node and including it in a spark-submit or spark-shell command or deploying it to the entire cluster and having it loaded automatically.

Loading the Spark Connector from the Command Line

The quickest way to use the connector is to include it in the --jars argument when executing a spark-submit or spark-shell command. To be able to use the connector in the command line, you must first copy its assembly JAR to the Spark node on which you will run the commands. Then add the path to the assembly JAR file as part of the --jars command line argument.

For example, suppose you copied the assembly file to your current directory on a Spark node. Then you could load the connector when starting spark-shell with the command:

spark-shell --jars spark-vertica-connector-assembly-1.0.jar

You could also enable the connector when submitting a Spark job using spark-submit with the command:

spark-submit --jars spark-vertica-connector-assembly-1.0.jar

Configure Spark to Automatically Load the Connector

You can also choose to configure your Spark cluster to automatically load the connector. Deploying the connector this way ensures that the Spark connector is available to all Spark applications on your cluster.

To have Spark automatically load the Spark connector:

Copy the Spark connector's assembly JAR file to the same path on all of the nodes in your Spark cluster. For example, on Linux you could copy the spark-vertica-connector-assembly-1.0.jar file to the /usr/local/lib directory on every Spark node. The file must be in the same location on each node.
In your Spark installation directories on each node in your cluster, edit the conf/spark-defaults.conf file to add or alter the following line:
```
spark.jars /path_to_assembly/spark-vertica-connector-assembly-1.0.jar
```
For example, if you copied the assembly JAR file to /usr/local/lib, you would add:
```
spark.jars /usr/local/lib/spark-vertica-connector-assembly-1.0.jar
```
If the spark.jars line already exists in the configuration file, add a comma to the end of the line, then add the path to the assembly file. Do not add a space between the comma and the surrounding values.

If you have the JAR files for the Spark connector V1 in your configuration file, you do not need to remove them. Both connector versions can be loaded into Spark at the same time without a conflict.
Test your configuration by starting a Spark shell and entering the statement:
```
import com.vertica.spark._ 
```
If the statement completes successfully, then Spark was able to locate and load the Spark connector library correctly.