Vertica Integration with Apache Airflow: Connection Guide

About Vertica Connection Guides

Vertica connection guides provide basic instructions for connecting a third-party partner product to Vertica. Connection guides are based on our testing with specific versions of Vertica and the partner product.

Vertica and Apache Airflow: Latest Versions Tested

This document is based on our testing using the following versions:

Software Version
Apache Airflow 2.2.2
Desktop Platform

Red Hat Enterprise Linux Server 8.3

Vertica Client

Vertica Python Driver 1.0.2

Vertica Server

Vertica 11.0.1

We tested the integration of Apache Airflow with Vertica using the Vertica Python Driver. To connect to Vertica using Airflow, you need to create Directed Acyclic Graphs (DAGs) using your own Python script. We created a sample Python script that reads data from Vertica and writes this data to Vertica. The sample script is available later in this guide for your reference. As a result of this testing, there are some data type limitations that occur due to the Python script that are listed at the end of this guide.

Apache Airflow Overview

Apache Airflow is an open source work flow engine and scheduling system. It executes your work flows as Directed Acyclic Graphs (DAGs). Airflow allows you to schedule these DAGs and set dependencies on these tasks. Its simple user interface makes it easy to monitor and manage tasks, check status and progress, and troubleshoot issues.

Installing Apache Airflow

To install Apache Airflow, follow the prerequisites and instructions here.

Installing the Vertica Client Driver

Apache Airflow uses the Vertica Python client to connect to Vertica. You also need to install sqlalchemy and Apache Airflow provider package for Vertica.

Note Ensure that you have the latest version of Python and pip.

  1. To install vertica-python client driver using the pip command, refer to the vertica-python github page.

  2. To install sqlalchemy-vertica-python using the pip command, refer to sqlalchemy-vertica-python.

  3. To install Apache Airflow provider package, refer to apache-airflow-providers-vertica.

    For more information about the provider package, see Apache Airflow Provider Package.

Connecting Vertica to Apache Airflow

  1. After you set up Airflow, start the database:
    Airflow db init
  2. Create the user in Airflow:
    airflow users create -r Admin -u admin -e admin@example.com -f admin -l user -p test

    Note -r is role, -u is username, -e is email, -f is first name, -l is last name, -p is password.

  3. Start the web server. The default port is 8080:
    airflow webserver --port 8080
  4. Open http://localhost:8080/home in a browser.

  5. Select Admin and click Connections.

  6. In List Connection, select vertica_default and click the Edit record button.

    Enter the connection details.

    Note By default, the Conn Id is vertica-default. You can change this Id.

    Schema is the database name.



  7. Click Save.

    Note The connection variables that you specify here will be stored in the apache airflow providers vertica file.

Creating Directed Acyclic Graphs (DAGs) Using the Vertica Connection

In this example, we will be creating a DAG which reads data from Vertica and returns the number of rows in a table.

  1. Using the command line interface, create a dags folder in the airflow folder.

  2. Create a sample python program in the dags folder.

  3. Import the VerticaHook library in the dag to connect to Vertica:

    from airflow.contrib.hooks.vertica_hook import VerticaHook
  4. Add the following variable when defining the function:

    variable = VerticaHook('Connection_id').get_cursor()
  5. To update the DAGs list in Apache airflow, you need to run the scheduler.

  6. To start the scheduler, run the following command in a new command line:

    airflow scheduler
  7. We have now created the following sample python program and saved it in the dag folder.

  8. In Apache Airflow, click the Trigger DAG button and check the status.

    If the status shows success, verify the logs in the log folder for the results. The logs folder is in the airflow folder.


Sample DAG Script:

from datetime import datetime, timedelta
import os
import logging
from airflow import DAG
from airflow.contrib.hooks.vertica_hook import VerticaHook
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.today(),
'email': ['@airflow'],
'email_on_failure': False,
'email_on_retry': False,
'retry_delay': timedelta(minutes=5),
}
dag = DAG("vertica_monitoring_1",
default_args=default_args,
schedule_interval=None)
def get_vertica_status(**kwargs):
cur = VerticaHook('vertica_default').get_cursor()
sql = "select count(*) from r_test.sample_decimal "
cur.execute(sql)
result = cur.fetchall()
logging.info(result)
logging.info('above is the result value')
return result[0][0]
result = get_vertica_status()
logging.info(result)
logging.info('The above is the result value')

Known Limitations

Note Apache Airflow uses Python to perform tasks. The following limitations are as a result of using the Python program.

For all data types, Null is displayed as None.

Read Mode

  • For BINARY, VARBINARY, LONG VARBINARY data types, values are not displayed correctly.
  • For INTERVAL and TIMETZ data types, b is appended to each value.
  • For DECIMAL, TIME, TIMESTAMP, TIMESTAMTZ, and UUID data types, the respective function name is appended to the value.

Write Mode

  • For DECIMAL, TIME, TIMESTAMP, TIMESTAMTZ, and UUID data types, since the function name is appended to the value, these values are not written to Vertica.
  • BINARY, VARBINARY, LONG VARBINARY, and TIMETZ data types are not supported.
  • For CHAR, VARCHAR, and LONG VARCHAR data types, replace Null with EMPTY to write the value to Vertica.

For More Information