The Spark Connector V1

Use the legacy Vertica Spark connector V1 when using versions of Spark older than 3.0 or Scala versions earlier than 2.12. If your Spark cluster is running a version later than 3.0, you should use the newer connector V2, because it is faster, has more features, and is fully supported. The connector V1 is deprecated, and will eventually be removed from distribution.

Getting the Connector

The legacy Spark connector V1 is packaged as a JAR file. You install this file on your Spark cluster to enable Spark and Vertica to exchange data. In addition to the connector JAR file, you also need to install the Vertica JDBC client library. The connector uses this library to connect to Vertica.

Both of these libraries are distributed with the Vertica server and are available on all nodes in the Vertica cluster in the following locations:

  • The Spark connector files are located in /opt/vertica/packages/SparkConnector/lib.
  • The JDBC client library is /opt/vertica/java/vertica-jdbc.jar.

Choosing the Correct Connector Version

Vertica supplies multiple versions of the connector JAR files. Each file is compatible one or more versions of Apache Spark and a specific version of Scala. The connector file you need depends on the version of Apache Spark and Scala you have installed. You can determine your Spark and Scala version by starting a Spark shell:

$ spark-shell
SPARK_MAJOR_VERSION is set to 2, using Spark2
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://node01:4040
Spark context available as 'sc' (master = local[*], app id = local-1488824765565).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0.2.6.0.3-8
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)
Type in expressions to have them evaluated.
Type :help for more information.

The startup messages contain the version numbers of both Spark and Scala (shown in bold in the previous example for clarity).

The list in Vertica Integration for Apache Spark tells you which version of the connector JAR file you need for each combination of Spark and Scala. Note that some versions of the connector are compatible with multiple versions of Spark. For example, the connector for Spark 2.1 is also compatible with Spark 2.2.

Deploying the Connector

Once you have the connector and JDBC library JAR files, you can deploy them to your Spark cluster in two ways:

  • Include the connector and Vertica JDBC JAR files using the "--jars" option when invoking spark-submit or spark-shell.
  • Deploy the connector to a Spark cluster so that all Spark applications have access across all nodes.

Copying the Connector for Use with Spark Submit or Spark Shell

The easiest way to deploy the connector to your Spark cluster is to copy it to a single Spark node, then include them in a spark-submit or spark-shell command line.

  1. Transfer both the Vertica Spark connector and Vertica JDBC driver JAR files to a directory on a Spark node.
  2. Log into the Spark node where you copied the files.
  3. Add the connector and JDBC files to the --jars argument of your spark-submit or spark-shell command line. If you are not in the directory where you downloaded the connector and JDBC JAR files, specify the path to them.

    • To start a Spark Application using spark-submit if you are in the same directory as the JAR files:

      spark-submit --jars vertica-spark2.1_scala2.11.jar,vertica-jdbc-10.1.0-0.jar other options SparkApplication.jar  

      Do not include a space before or after the comma in the --jars argument.

    • To start an interactive Spark shell using spark-shell if you are in the same directory as the JAR files:

      spark-shell --jars vertica-spark2.1_scala2.11.jar,vertica-jdbc-10.1.0-0.jar other options

The version numbers in the JAR file names will vary depending on your version of Vertica, Spark, and Scala.

Add the Connector to the Spark Cluster's Configuration

Having to always include the JAR files in your Spark command lines is cumbersome. You can configure Spark to always load the connector. This approach gives all Spark applications on your cluster the ability to transfer data with Vertica.

To deploy to the Spark cluster:

  1. Copy the connector and JDBC JAR files to a common path on all nodes in your Spark cluster.
  2. Add the path for the connector and JDBC driver to your conf/spark-defaults.conf, and restart the Spark Master. For example, modify the spark.jars line by adding the connector and JDBC JARS as follows (replace paths and version numbers with your values):

    spark.jars /JAR_file_Path/vertica-spark2.0_scala2.11.jar,/JAR_file_Path/vertica-jdbc-10.1.0-0.jar

See the Spark Documentation's Configuration page for more information on including JAR files in your Spark jobs.