Integrating with Apache Spark

Welcome to the OpenText Vertica Connector for Apache Spark Guide.

Vertica Connector for Apache Spark is a fast parallel connector that transfers data between the Vertica Analytics Platform and Apache Spark. This feature lets you use Spark to pre-process data for Vertica and to use Vertica data in your Spark application.

Apache Spark is an open-source, general-purpose cluster-computing framework. It evolved as a faster, multi-stage, in-memory alternative to the two stage, disk-based Map Reduce framework offered by Hadoop. The Spark framework is based on Resilient Distributed Datasets (RDDs), which are logical collections of data partitioned across machines. Spark is typically used in upstream workloads to process data before loading it in Vertica for interactive analytics. It can also be used downstream of Vertica, where data pre-processed by Vertica is then moved into Spark for further transformation.

Using the OpenText Vertica Connector for Apache Spark, you can:

Audience

This book is intended for anyone who wants to transfer data between a Vertica database and an Apache Spark cluster.

Prerequisites and Compatibility

This document assumes that you have installed and configured Vertica as described in Installing Vertica and the Configuring the Database section of the Administrator's Guide. You must also have installed your Apache Spark clusters.

To save data from Spark to Vertica, you must have an HDFS cluster for an intermediate staging location. Your Vertica database must be configured to read data from this HDFS cluster. See Reading Directly from HDFS in Integrating with Apache Hadoop for more information.

For details on installing and using Apache Spark and Apache Hadoop, see the Apache Spark web site, the Apache Hadoop website, or your Hadoop vendor's installation documentation.

For supported versions of Apache Spark and Apache Hadoop see the following sections in the Supported Platforms guide:

In This Section