Vertica Integration with Experian Pandora: Connection Guide

For Vertica 8.0.x 

About Vertica Connection Guides

Vertica connection guides provide basic information about setting up connections to Vertica from software that our technology partners create. These documents provide guidance using one specific version of Vertica and one specific version of the third party vendor’s software. Other versions of the third-party product may work with Vertica. However, other versions may not have been tested. This document provides guidance using the versions described in the following topic.

Software Versions Used in this Document

This document describes the integration of Vertica with Experian Pandora. The Partner Engineering team has tested Pandora with Vertica using the following software versions:

  • Experian Pandora Client and Server versions 5.4 on a Windows Server 2012 R2 64-bit platform
  • Vertica JDBC driver 8.0
  • Vertica Analytic Database 8.0

Experian Pandora Overview

Experian Pandora is an on-premises solution for data quality analysis and management. Pandora loads the data into a proprietary file system repository where it automatically checks the quality of the data against 180 different data checks and produces 225 different data quality measures. The results are immediately available for viewing in the repository.

For more information, visit the Experian Pandora website.

Using Experian Pandora with Vertica

To assess the quality of your data stored in Vertica using Pandora, follow these steps:

  1. Connect Pandora to Vertica using JDBC.
  2. Select and configure the tables and columns that you want to analyze.
  3. Load the tables and columns into the Pandora repository. Pandora automatically analyzes the quality of the data and stores the results in the repository.
  4. Browse the data quality results. Pandora discovers defects in the data, such as unusual values, inconsistent formats, standard deviation of outliers, and duplicates. You can drill down into the high-level results to view the underlying rows.
  5. Create rules and alerts to monitor the integrity of your data over time.

For more information about Pandora, see the product documentation included in the product download.

Download and Install Experian Pandora

You can download a trial version of Pandora for evaluation. Follow these steps:

  1. Go to https://www.edq.com/ to request a trial of Pandora. You will receive an email containing a link to the latest version.
  2. Download the installer.
  3. Run the installer and follow the steps in the wizard. The installer checks for required software and attempts to install any missing components from your operating system.

On Windows, Pandora is installed in the following directories by default:

  • Pandora application files: C:\Program Files\Experian\Pandora X.X.X
  • Pandora repository: C:\pandora

The Pandora installer lets you perform a Typical, Custom, or Complete installation. With each type of installation, you can choose which of these components to install:

  • Pandora Client—Manages Pandora Server. Available on Windows only.
  • Pandora Server—Performs the data validation. Available for Windows, Unix, and Linux.
  • ODBC drivers (32-bit and 64-bit)—Connect client applications such as Microsoft Excel and Tableau to Pandora Server.
    • You can install the Pandora ODBC drivers on a different machine from the machine that is hosting the Pandora client and server.
    • If your operating system is 32-bit, install the 32-bit driver. If your operating system is 64-bit, you can install either the 32-bit or the 64-bit driver or both.
    • If your application is 32-bit, create a 32-bit DSN. If your application is 64-bit, create a 64-bit DSN. You will use the DSN to connect to Pandora Server from a client application.

Download and Install the Vertica JDBC Driver

Pandora uses Vertica JDBC to connect to Vertica. To install the Vertica JDBC driver, follow these steps:

  1. Go to the Vertica Client Drivers page.
  2. Download the version of the Vertica JDBC driver that is compatible with the architecture of your operating system and the Vertica server version.
  3. Go to the directory of the Pandora repository and paste the Vertica JAR file in the subdirectory drivers\jdbc. For example on Windows, paste the JAR file in C:\pandora\drivers\jdbc.

You do not need to restart Pandora after installing the JDBC driver.

Note Vertica drivers are forward compatible, so you can connect to the Vertica server using previous versions of the client. For more information about client and server compatibility, see Client Driver and Server Version Compatibilityin the Vertica documentation.

Start Experian Pandora

To start Experian Pandora, first start the server, then start the client and create a connection to the server.

Start Pandora Server

Pandora Server runs as a Windows service. You can start the service from the Windows Start menu or by using the Service Manager utility in Windows Control Panel. On the Start menu, locate the Pandora Server program and click thePandora Initialize Database icon.

Start Pandora Client

Follow these steps to start the Pandora client application:

  1. On the Windows Start menu, click Experian Pandora Client.
  2. In the login screen, enter the connection information for Pandora Server:
  3. Server: Name or IP address of the machine running Pandora Server. If Pandora Server and Client are running on the same machine, type localhost.

    Port name: Number of the port where Pandora Server is listening. The default port is 7700.

    Ensure that the firewall allows connections to this port.

    Username: Pandora server user. The default is administrator.

    Password: Pandora server password. The default is administrator.

    Reset Desktop: Check this box if you do not want to restore the previous connection, which is saved and presented by default every time you connect.

  4. Click Login.

Connect Experian Pandora to Vertica

To create a new connection to Vertica using JDBC, follow these steps:

  1. Navigate to Home > Connections.
  2. Right-click Connections and select New.

    NewConnection.png

  3. The Create Connection dialog box displays. This dialog box has three tabs: Details, Properties, and Permissions. The fields of the Create Connection dialog box are described in the following topics.
    Specify the connection information in the Create Connection dialog box and click Save.

The new connection is listed under Connections in the Explorer. If the connection was successful, it is marked as Available.

VerifyConnection.png

Note If the connection has already been created, you will see two additional tabs in the Create Connection dialog: Refresh Details and Automaticn. See Edit an Existing Connection for details.

Create Connection: Details Tab

In the Details tab, enter the details of the connection for your Vertica database.

DetailsTab.png

  • Name (required)—Name that identifies the connection.
  • Description—Description of the connection.
  • Adapter (required)—Select Custom.
  • Username—Username for the database connection. You must provide the database username if it is not specified in the connection string.
  • Password—Password for the database connection. You must provide the password for the database user if it is not specified in the connection string.
  • Driver name (required)—Specify com.vertica.jdbc.Driver.
  • Connection string (required)—The JDBC URL that tells the driver where to find the database. The format of the connection string is as follows:
    jdbc:vertica://VerticaHost:portNumber/databaseName

    Replace VerticaHost, portNumber, and databaseName with your information. You can provide additional connection properties such as username and password as part of the connection string as follows:

    jdbc:vertica://VerticaHost:portNumber/databaseName?user=username&password=password&label=label

    You must include the username and password in the connection string if you do not provide them in the Username and Password fields on the Details tab of the Create Connection dialog box.

    • Schema—Database schema that is visible to Pandora. If you do not specify a schema, no tables are listed.
    • Force schema name—Forces Pandora to specify the schema name in all SQL statements it sends to Vertica. This option is enabled by default.
    • Table pattern—Filter pattern for table names. The filter uses an ILIKE clause with wildcard replacement in the SQL statement. Wildcards are % (percent sign) and _ (underscore). A percent sign replaces multiple characters; an underscore replaces a single character.
    • Filter regular expression—Regular expression in the filter pattern. Pandora matches the regular expression exactly. For example, the expression [a-z] only matches table names consisting of a single lowercase letter. The Matches Expression function in the expression builder uses this same behavior.

    • Filter is case sensitive—Whether or not the regular expression filter is case sensitive.
    • Include Tables—Whether or not tables are visible to Pandora. You can choose to make either tables or views visible. If you do not choose either one, tables are visible.
    • Include Views—Whether or not views are visible to Pandora.

Create Connection: Properties Tab

The Properties tab lets you change settings such as the JDBC connection settings.

Scope—Scope of the property. In this example, the scope is at the connection level, and the value cannot be changed.

Type—Type of the property. In this example, the property type is JDBC.

Name— Name of the setting. In this example, the name of the property is label.

Value—Value of the property. In this example, the session label is PandoraSessionLb.

For a complete list of JDBC properties, see the Vertica documentation.

For information about properties specific to Pandora, see the Pandora User's Guide (installed with the product).

To add a new property to the list:

  1. In the Edit Property box, click + Add.
  2. Select the property type from the Type drop-down liist.
  3. For a given property name, provide a value.
  4. Click the green tick mark to add the new property.

Create Connection: Permissions tab

The Permissions tab lets you set role-based permissions for the tables in the connection. You can specify the following permissions:

  • Read—Allows the user to see the tables.
  • Modify—Allows the user to make local changes to the tables.
  • Export—Allows the user to write changes back to the database.

Edit an Existing Connection

If the Create Connection dialog is populated with information for a connection that was previously created, you see two additional tabs: Refresh Details and Automation.

Refresh Details—Allows you to specify how often to refresh the list of tables in the connection. By default ,the list is refreshed every hour.

Automation—Allows you to specify how to reload the data into the tables that have been loaded into Pandora. The default method is Manual. You can choose to reload the data automatically at intervals specified in a trigger file.

Configure Vertica Tables Before Loading

After you create a connection to Vertica, you must select the tables you want to analyze and configure them before loading them into Pandora.

To prepare a table for loading into Pandora, follow these steps:

  1. In the Explorer, navigate to Home > Connections.
  2. Double-click the connection to list the available tables.
  3. Right-click the table you want to configure and select Configure. The Configure Table dialog box displays. The dialog box has four tabs: Details, Sampling, Dependency/Key Analysis, and Filter. The tabs are described in the following topics.
  4. Enter the configuration information and click OK.

ConfigureTables.png

Configure Table: Details Tab

  • JDBC Table—Original table name in Vertica.
  • Load as—The name for the table in Pandora.
  • Description—Optional description.
  • Schema—Schema for the table in Pandora.
  • Project—Pandora project to contain the table. You can create a new project at this point.
  • Language—The language of the data. The default is English.
  • Exclude from Relationships?—Whether or not to exclude the table from relationships to other tables. Pandora discovers the relationships between tables automatically and incrementally at load time. Check this box if you want to discover relationships manually.

Configure Table: Sampling Tab

Use this tab to limit the number of rows of data to load. Pandora loads all the rows in the table by default.

Configure Table: Dependency / Key Analysis Tab

Use this tab to specify columns for Pandora to analyze at load time for relationships to other tables.

Load Vertica Data Into Pandora

After you have configured the tables, you can load them into Pandora. Pandora automatically analyzes the quality of the data at load time.

  • The maximum number of columns that you can load into a Pandora table is 30,000.
  • The maximum number of rows that you can load into a Pandora table is 2 billion.

To load a table into Pandora, follow these steps:

  1. In the Explorer, navigate to Home > Connections.
  2. Double-click the connection to list the available tables.
  3. Right-click the table you want to load and select Load. The Load Table dialog box appears. The Load Table dialog box has two tabs: Details and Permissions. Pandora populates most of the fields automatically based on the values entered in the fields of the Configure dialog box for each table.
  4. Enter the required information, as described in the following topics.
  5. Click OK.

LoadData.png

Load Table: Details Tab

  • JDBC—Original table name in Vertica.
  • Loads as—The name the table is given when it is loaded into the Pandora repository.
  • Description—An optional description of the table.
  • Default Schema—Where the table is placed within the Pandora repository. A schema is a collection of tables in the Pandora repository. A table can only belong to one schema. You can create a new schema or select an existing schema from the list.
  • Default Project—As with Schema, you can create a new project at this point and add the table to it. A project is a collection of tables that are grouped to perform a particular task.
  • Ignore Empty Tables—Whether or not to ignore empty tables. By default, Pandora includes empty tables.
  • Storage Type—Storage in the repository is compressed and indexed by default.
  • Schedule Load Time—The load time is Now by default. If you want to schedule the load for later, select Later and specify the date and time.

    The next 3 fields allow you to limit the number of rows to load. By default, all rows in the table are loaded.

  • Start at Row—Start loading from row number 1.
  • Sample One Row in Every—Number of rows in a sample. By default, the sample size is 1 row.
  • Row Limit—No limit by default.
  • Exclude from Relationships?—Pandora discovers the relationships between tables automatically and incrementally at load time. Check this box if you want to discover relationships manually.

Load Table: Configure Columns

You can configure individual columns of the table by clicking the Columns button on the lower-left corner of the Details tab. In the Configure Columns dialog box, you can configure columns individually before loading and exclude columns from the load.

ColumnConfig.png

The configurable options for columns include:

  • Is A Key?—Whether Pandora should analyze this column as a key by verifying the uniqueness of the values.
  • Exclude from Relationships? —Whether Pandora should use this column to discover relationships. Enable this option if you know that the column is not a key and does not specify a relationship with other tables.
  • Automatic Key/Dependency Analysis?—Whether Pandora should include this column in its automatic analysis of keys and dependencies when loading the table.

Load Table: Permissions Tab

The table owner and all users in the team of the table owner have full permission to access the table by default. Other users only have Read and Annotate permissions.

View the Loaded Tables

To view the loaded table:

  1. Navigate to Home > Projects.
  2. Select the project.
  3. Click Tables to list the tables that have been loaded into this project.

Browse the Results of Data Quality Analysis

 To browse the results of the analysis for all columns in the table:

  1. In Pandora Administrator, use the Explorer to select Home > Projects > my_project > Tables > my_table.
  2. Click the right arrow to expand the folders at each level.
  3. Click Columns.

BrowseResults.png

To browse the results of the analysis by column (Column Profile):

  1. In the Explorer, select Home > Projects > my_project > Tables > my_table > Columns > my_column.
  2. Click the right arrow to expand the folders at each level.
  3. Click Profile to display the Profile statistics for the selected column.

PandoraAdmin.png

Vertica Data Type Support

Pandora support for Vertica data types has these limitations:

  • The time zone offset is not included in the TIMESTAMPTZ and TIMETZ data types.
  • Values of the TIME and TIMETZ data type are loaded as TIMESTAMP and displayed using the date 01-01-1970.

For More Information

For More Information About… … See
Experian Pandora http://www.experian.com/data-quality/experian-pandora.html
Vertica Community https://vertica.com/community/
Vertica Documentation http://vertica.com/docs/latest/HTML/index.htm
Big Data and Analytics Community https://vertica.com/big-data-analytics-community-content