Saving an Apache Spark DataFrame to a Vertica Table

The Vertica Connector for Apache Spark copies data partitions distributed across multiple Spark worker-nodes into a temporary location in HDFS. Vertica then reads the data from HDFS. These data transfers use parallel reads and writes, letting the connector efficiently load large volumes of data from Spark to Vertica.

Save Options

When writing Spark DataFrames to Vertica, you specify a Vertica target table. You also set how the data is saved using the DataFrame SaveMode. The valid values for the SaveMode are:

  • SaveMode.Overwrite overwrites or create new table.

    The existing table is dropped whether the save succeeds or not, except when the connector is asked to load a DataFrame that contains zero rows.

  • SaveMode.Append appends data to an existing table or creates the table if it does not exist.
  • SaveMode.ErrorIfExists creates a new table if the table does not exist, otherwise returns an error.
  • SaveMode.Ignore creates a new table if the table does not exist, otherwise it does not save the data and does not return an error.

The save() operation never results in a partial or duplicate save. The connector either saves the DataFrame in its entirety to the target table successfully or it aborts the save. In case of failures, to avoid leaving the target table in an inconsistent state, the connector:

  1. Saves the data to HDFS as an intermediate staging location
  2. Safely copies it into the actual target table in Vertica

Rejected Rows

For bulk loads, the connector API provides user control to specify a tolerance for rejected rows. You can specify the user tolerance as a parameter (see Parameters in Saving Spark Data to Vertica Using the DefaultSource API). If the number of rejected rows falls within the tolerance level, the save completes successfully. Otherwise, the connector aborts the save and reports an error.

Job Status

The connector creates a Vertica table, S2V_JOB_STATUS_USER_$USERNAME$, to report the status of each job. When the save to Vertica starts, the connector writes the unique job_id to the Spark log. After the Spark job is started, users can consult this table for the job status based on the job_id. The S2V_JOB_STATUS_USER_$USERNAME$ has the following columns:

  • target_table_schema
  • target_table_name
  • save_mode
  • job_name
  • start_time
  • all_done
  • success
  • percent_failed_rows

The table indicates the start time, unique job name, date, percentage of rows that failed, and final outcome of the save job (success=T/F). The column all_done indicates the Spark job finished without errors and the column success indicates the data was saved to the Vertica table (rejected rows must fall within specified tolerance).