Integrating with Apache Spark
Apache Spark is an open-source, general-purpose cluster-computing framework. See the Apache Spark website for more information. Vertica provides a connector that you install into Spark that lets you transfer data between Vertica and Spark.
Using the connector, you can:
- Copy large volumes of data from Spark DataFrames to Vertica tables. This feature lets you save your Spark analytics in Vertica.
- Copy data from Vertica to Spark RDDs or DataFrames for use with Python, R, Scala and Java. The connector efficiently pushes down column selection and predicate filtering to Vertica before loading the data.
The connector lets you use Spark to preprocess data for Vertica and to use Vertica data in your Spark application. You can even round-trip data from Vertica to Spark—copy data from Vertica to Spark for analysis, and then save the results of that analysis back to Vertica.
How the Connector Works
The Vertica Apache Spark connector is a library that you incorporate into your Spark applications to read data from and write data to Vertica. When transferring data, the connector uses an intermediate storage location as a buffer between the Vertica and Spark clusters. Using the intermediate storage location lets both Vertica and Spark use all of the nodes in their clusters to transfer data in parallel. Currently, the connector only supports using HDFS as an intermediate storage location.
When transferring data from Vertica to Spark, the connector tells Vertica to write the data as Parquet files in the intermediate storage location. As part of this process, the connector pushes down the required columns and any Spark data filters into Vertica as SQL. This push down lets Vertica pre-filter the data so it only copies the data that Spark needs. Once Vertica finishes copying the data, the connector has Spark load it into DataFrames from the intermediate location.
When transferring data from Spark to Vertica, the process is reversed. Spark writes data into the intermediate storage location. It then connects to Vertica and runs a COPY statement to load the data from the intermediate location into a table. When loading data into Vertica, you can choose to have the data
There are two main versions of the Spark connector:
The legacy connector (now referred to as the connector V1) is a closed-source connector distributed as part of the Vertica server install. This was the only version of the connector available before Vertica version 10.1.1. No new features will be added to this version of the connector. However it will remain available for use with older versions of Spark or Scala that are not supported by the newer connector. See The Spark Connector V1 for more information.
The Spark connector V2 is a new, open-source version of the connector. It does not support versions of Spark older than 3.0 or Scala versions earlier than 2.12. This version of the connector is distributed separately from the Vertica server. See The Spark Connector V2 for more information.
Each of these connectors may have sub-versions that work with specific version combinations of Spark and Scala.
The new connector differs from the old in the following ways:
- The new connector uses the Spark V2 API. Using this newer API makes it more future-proof than the connector V1, which uses an older Spark API.
- The primary class name has changed. Also, the primary class has several renamed configuration options and a few removed options. See Migrating From the Vertica Spark Connector V1 to V2 for a list of these changes.
- It supports more features than the older connector V1, such as Kerberos authentication.
- It comes compiled as an assembly which contains supporting libraries such as the Vertica JDBC library. To use the connector V1, you must also configure the JDBC library yourself.
- It is distributed separate from the Vertica server. You can directly download it from the GitHub project. It is also available from the Maven Central Repository, making it easier for you to integrate it into your Gradle, Maven, or SBT workflows.
- It is an open-source project that is not tied to the Vertica server release cycle. New features and bug fixes do not have to wait for a Vertica release. You can also contribute your own features and fixes.
Use the connector V2 if your environment supports it. It has more features than the connector V1. If your Spark cluster is running a version of Spark earlier than 3.0, continue to use the connector V1. You can load both versions of the connector into Spark at the same time, so you can gradually port your Spark applications to the new version.
To use the Vertica Spark connector, you must have:
- An installed and configured Vertica database. See Installing Vertica and the Configuring the Database for details.
- An Apache Spark cluster supported by one of the versions of the Vertica Spark connector. See Vertica Integration for Apache Spark for details on the versions of Spark and Scala supported by the connector.
- An intermediate storage location where Vertica and Spark can save data being transferred to the other system. See the next section for details.
Intermediate Storage Configuration
When transferring data, the Spark connector has Spark and Vertica write data to a temporary storage location Currently, the connector supports using HDFS as an intermediate storage location. Both Spark and Vertica must be configured to access this location. Dedicate a directory in the intermediate storage location for transferring files between Spark and Vertica. A dedicated storage can help prevent interference from other applications.
On the Vertica size, you must configure your database to read data from and write data to the HDFS storage location. See Using HDFS URLs for the steps you need to take to configure Vertica to be able to access the intermediate storage location.
For Spark, you will need any credentials the HDFS storage location requires to access it.
For supported versions of Apache Spark and Apache Hadoop see the following sections in the Supported Platforms guide: