Enhancing Your Data Discovery Journey: Using DataHub Vertica Plugin

DataHub is a metadata platform for end-to-end lineage and data discovery. This video demonstrates how to effectively incorporate Vertica data in DataHub using the DataHub Vertica Plugin, enabling you to leverage the platform's advanced capabilities for gaining deeper insights.

DataHub is an open-source metadata platform for the modern data stack. It is a data catalog built to enable end-to-end data discovery, data observability, and data governance. This extensible metadata platform is built for developers to tame the complexity of their rapidly evolving data ecosystems and for data practitioners to leverage the total value of data within their organization.

DataHub Vertica Plugin

As of December 2022, DataHub supported ingesting metadata from various data sources, including databases such as PostgreSQL, Snowflake, Vertica, and so on. The beta Vertica DataHub Plugin was originally a community contribution by the DataHub community. The beta plugin only supports ingestion of metadata for tables, views, and data platforms.

Currently the Vertica plugin uses sqlalchemy dialect to connect to Vertica.

However, the plugin was missing important metadata information such as Projections, MLModels, EON mode information, OAuth metadata, and Table-View-Projection lineage. We undertook a project to write a plugin for Vertica that can pull this information. This also required a re-write of the SQLAlchemy Vertica Dialect as DataHub uses SQLAlchemy dialects to connect to Databases.

DataHub plugin for Vertica now has a certified ingestion plugin for DataHub that can be used to ingest metadata into DataHub. This now allows data practitioners to have Vertica metadata within their Datahub instance alongside other metadata, so that they can take advantage of DataHub's end-to-end data discovery and observability. They can derive meaningful insights about the usage of Vertica Objects such as Tables, Views, and Projections and visualize the lineage relationship.

This document provides an example of a metadata ingestion using Vertica and includes the steps to configure the environment and connect to Vertica using the DataHub Vertica plugin.

Test Environment

  • DataHub 0.13.2.4

  • RHEL 8.3

  • Vertica Python driver 1.3.8

  • Vertica Server 24.1

Configuring the Environment to Use DataHub with Vertica

This section explains how to install the DataHub Vertica plugin and initialize a project. The configuration is based on a Linux (RHEL) environment.

Prerequisites

To ingest metadata from Vertica, you will need

  • Vertica Server Version 10.1.1-0 and above. It may also work for older versions.

  • Vertica Credentials (Username/Password)

  • Docker for Linux

  • Docker Compose

  • Python 3.8 or later

Installing the DataHub Vertica Plugin

  1. Install Docker and Docker Compose v2 for your platform.

  2. On Linux, install Docker for Linux and Docker Compose.

  3. Launch the Docker Engine from the command line or the desktop app.

  4. Install DataHub CLI.

    1. Ensure you have Python 3.8+ installed and configured. (Check using the command python3 --version).

    2. Run the following commands in your terminal.

      python3 -m pip install --upgrade pip wheel setuptools
      python3 -m pip install --upgrade acryl-datahub
      datahub version
      
  5. To deploy a DataHub instance locally, run the following CLI command from your terminal.

    datahub docker quickstart
  6. To install the Vertica plugin, run the following command from the command line:

     pip install 'acryl-datahub[vertica]'

    Note This command will automatically install Vertica-python client driver and Vertica- SQLAlchemy dialect. You do not need to install them manually.

See the DataHub Quickstart Guide for more details.

GUI Based Ingestion

To run the ingestion from GUI:

  1. Go to localhost:9002.

  2. Enter your credentials to login.

  3. On the top right, click Ingestion. In Sources, click Create new source , and select Vertica

  4. Fill the form with your Vertica DB credentials.

  5. Click Advance to configure your metadata objects. Click the objects that you want to ingest to Datahub GUI.

  6. Click Next, configure an Ingestion schedule, and then click Next.

  7. Enter the name of the ingestion and then click Advance.


  8. Check the CLI version it is on while running the ingestion. Make sure you enter the CLI Version as 0.10.5.4V or later to see Performance improvement while ingesting the Vertica metadata.

  9. Click Save & Run to start the ingestion.

  10. You will see the following if the ingestion is successful:

     

CLI Based Ingestion

For CLI based ingestion, go to the DataHub directory in your machine. Create the recipe.yml file and add the Vertica source details in it.

Check out the following recipe to get started with ingestion! See Integration Details for all configuration options.

Note By default, you can change all configuration properties that are set to true as per your preference. You can add profiling and other config options based on your needs.

After adding the recipe.yml file in the DataHub directory, run the following command to start the ingestion:

datahub ingest -c recipe.yml

When the ingestion starts running, the log looks like this:

Now you can go to the DataHub UI and check metadata. You will see Vertica in the platforms sections with number of objects ingested:

Click the Vertica logo and it will take you to the list of schemas. You can select any schema and then table, view, projection to view the metadata.

New Vertica DataHub Plugin Features that Support Vertica Metadata

  1. UI-based data ingestion.

    1. You can now easily ingest data using the UI recipe form. That means no CLI and getting data into DataHub is now as easy for both, an experienced or novice user.

  2. Supports viewing metadata for Tables, Projections, and Views.

    1. You can visualize Vertica tables and associated projections.

    2. One stop location to get metadata information such as row count, number of projections, partitions, and columns, table size, column size, data type, min/max stats per column.

    3. For projections, you can also detect the columns list, type of projection, pinned projections, ROS count and so on.

      Note Default values for the columns will show up in Description if there are any.


  3. Supports viewing lineage for Tables, Projections, and Views.

    1. You can visualize “downstream” and “upstream” lineage for a Vertica object. For example, a table can have 2 projections created using the table. These can be viewed as a downstream lineage in DataHub.

      Note Lineage is not displayed if the table, projection, or view is empty.

    2. A view can have multiple upstream tables which can be visualized as well.

    3. This is particularly useful to determine the consequences of deleting or altering a Vertica table/view/projection on its dependencies.


  4. Support UDX metadata.

    1. You can view all schema level UDXs (permitted) and corresponding language.


  5. Support for EON mode metadata.

    1. You can determine DB mode (Enterprise/EON) the data was ingested from, the communal storage path, and subcluster size.

  6. Support for ML Models.

    1. You can view a summary of any ML model created in the Vertica DB at the schema level.

    2. Corresponding results of the Model runs are also shown along with the user who used it.

  7. Support for deleted entities.

    1. During ingestion, enabling stateful_ingestion will allow you to determine any objects that were deleted with respect to previous ingestion.

  8. Owner for ingested data.

    1. Vertica plugin extends DataHub’s support for data federation using “owners”. Every ingested data can only be viewed by the person authorized to in DataHub.

  9. Improved SQLAlchemy Dialect.

    1. The relevant logic to fetch the metadata into Datahub needed to be materialized into the SQLalchemy-vertica dialect. Previously this was developed by an unknown third party. So, as a by-product of this development, we have now a new and improved Vertica-dialect for sqlalchemy that supports many more features than previous open-source versions did.

    2. An additional advantage of adding logic into our dialect instead of DataHub was that any other product can now use the new dialect to access Vertica.

Support for column level lineage for views

Vertica now supports column level lineage for views. Follow these steps to get column level lineage for views in Vertica:

  1. After the DataHub production environment is ready, look for the files where python is installed at the following location. Add a new file vertica.py here.

    {python_location}/site-packages/sqlglot/dialects

  2. Add the following lines of code in the new file and save it.

    from sqlglot import exp, generator, tokens.
    from sqlglot.dialects.dialect import Dialect
    from sqlglot.tokens import Tokenizer, TokenType
    
    
    class Vertica (Dialect):
        class Tokenizer(tokens.Tokenizer):
            QUOTES = ["'", '"']
            IDENTIFIERS = ["`"]
    
            KEYWORDS = {
                **Tokenizer.KEYWORDS,
            }
    
        class Generator(generator.Generator):
            TYPE_MAPPING = {
                **generator.Generator.TYPE_MAPPING,
                exp.DataType.Type.INTERVAL: "BIGINT",
            }
    
  3. Modify __init__.py in the same location and add the following line of code at the end of the file.

    from sqlglot.dialects.vertica import Vertica

    See the following image for details.

  4. Modify dialect.py in the same location and add the following line of code in the Dialects class.

    VERTICA = "vertica" 

    See the following image for details.

  5. Modify the code as explained below in the vertica.py file at

    {python_location}/site-packages/datahub/ingestion/source/sql

    Comment loop_views and get_upstream_lineage_info functions as it overrides CLL feature to view lineage in vertica.py, and save the file.

    You can now run the ingestion from CLI to see the column level lineage for views in Vertica. Following is an illustration with example CLL for views using Vertica as a datasource.

Note After these changes, you will need to run ingestion from CLI only.

Vertica Driver Additional Parameters

DataHub uses the Vertica python driver to connect. For more details about the connection option, see the vertica-python driver readme. You need to pass those parameters in the recipe file that you are using. Following is a sample recipe file:

source:
    type: vertica
    config:
        host_port: 'localhost:5433'
        database: db_name
        username: {"username"}
        password: {"password"}
        include_tables: true
        include_views: true
        include_projections: true
        include_models: true
        include_view_lineage: true
        include_view_column_lineage: true
        profiling:
            enabled: false
        stateful_ingestion:
            enabled: false
        options:
           connect_args:
              connection_load_balance: true
              oauth_access_token: token_value
                
sink:
    type: datahub-rest
    config:
        server: 'http://localhost:8080'

Ensure the indentations are correct. We've only added a few parameters to the recipe file, but all other Vertica-Python connection properties will also work.

Known Limitations

  • In Stats, max and min values are not displayed for primary keys.

  • For integer, boolean, and date columns, length of the column is not displayed as the length is fixed in the Vertica database.

For More Information