Integrating with Apache Spark
Welcome to the Vertica Vertica Connector for Apache Spark Guide.
The Vertica Connector for Apache Spark is a fast parallel connector that transfers data between the Vertica Analytics Platform and Apache Spark. This feature lets you use Spark to pre-process data for Vertica and to use Vertica data in your Spark application.
Apache Spark is an open-source, general-purpose cluster-computing framework. It evolved as a faster, multi-stage, in-memory alternative to the two stage, disk-based Map Reduce framework offered by Hadoop. The Spark framework is based on Resilient Distributed Datasets (RDDs), which are logical collections of data partitioned across machines. Spark is typically used in upstream workloads to process data before loading it in Vertica for interactive analytics. It can also be used downstream of Vertica, where data pre-processed by Vertica is then moved into Spark for further transformation.
Using the Vertica Vertica Connector for Apache Spark, you can:
- Move large volumes of data from Spark DataFrames to Vertica tables; the connector allows you to write Spark DataFrames to Vertica tables.
- Move data from Vertica to Spark RDDs or DataFrames for use with Python, R, Scala and Java. The connector efficiently pushes down column selection and predicate filtering to Vertica before loading the data.
This book is intended for anyone who wants to transfer data between a Vertica database and an Apache Spark cluster.
Prerequisites and Compatibility
This document assumes that you have installed and configured Vertica as described in Installing Vertica and the Configuring the Database section of the Administrator's Guide. You must also have installed your Apache Spark clusters.
To save data from Spark to Vertica, you must have an HDFS cluster for an intermediate staging location. Your Vertica database must be configured to read data from this HDFS cluster. See Using HDFS URLs in Integrating with Apache Hadoop for more information.
For supported versions of Apache Spark and Apache Hadoop see the following sections in the Supported Platforms guide: