Vertica and Data-Planet Web Service: Technical Exploration

About this Document

This document explains how you can use Data-Planet Web Services to include public data in your Vertica analytic database. You can extract third-party datasets from the Data-Planet repository such as census, financial, and traffic data. You can ingest or load these datasets into Vertica so that you can improve your database analysis. This document provides guidance using the latest versions of Vertica and Data-Planet as of June, 2016.

Data-Planet Overview

Data-Planet is a web service (SaaS) that provides access to a repository of datasets from a variety of organizations. Data in the repository is arranged and structured in chronological order, and is also known as time-series data. This chronological organization allows users to compare and contrast data from different sources and run visualizations to interpret correlations.

The data in the repository grows regularly with the addition of new datasets. Data-Planet keeps the data updated as information changes at the source.

We have tested loading Data-Planet datasets into Vertica 7.2.x in the cloud.  

Request Data-Planet

Data Planet products and services are available from SAGE Publishing. To request a demo or trial of Data-Planet, contact SAGE: 

  1. Navigate to the Data-Planet website.
  2. Click Contact Us.

Use Data-Planet Datasets with Vertica

This document discusses the two main use cases for using Data-Planet datasets for your analysis in Vertica:

Use Data-Planet Hosting Services

Use Data-Planet hosting services to integrate your data into the Data-Planet repository. Then, use the Data-Planet interface to compare and correlate your data against the collection of statistical datasets.

If you are interested in this service, the Data-Planet team will work with your team to load your data into the repository and control access to your information at the dataset level. This verifies that only users in your organization can see the data.

Include Data-Planet Datasets

To include public data in your analysis in Vertica, export datasets from the Data-Planet repository to JSON and CSV formats. Then, import these datasets into your Vertica database to continue your analysis in Vertica.

You can export data from Data-Planet into Vertica in one of two ways:

  • Export datasets using the Data-Planet User Interface
  • Export datasets using the Data-Planet REST API

After you have exported Data-Planet datasets from the repository, you can import them into your Vertica database. You can export Data-Planet datasets in Excel, CSV, XML and JSON formats. We recommend using CSV or JSON formats for the following reasons:

  • Data in JSON format can be loaded into a columnar table or a flex table using the COPY command and JSON parser, named fjsonparser.
  • Data in CSV format can be loaded using the CSV parser, named fcsvparser.
  • Vertica cannot directly load an XML file using the COPY command and Vertica does not have a parser for XML formats.

After you have exported the data as indicated previously, you can load data into Vertica as described in Load Data-Planet Datasets into Vertica.

Export Data-Planet Datasets Using the Data-Planet User Interface

The Data-Planet user interface allows you to download result sets from correlations you run between datasets. To download aggregated data in Excel, CSV or XML formats, click the Export icon on the Data-Planet interface. We recommend that you download the results set in CSV format. To load the CSV file into Vertica, follow the steps in Load Data-Planet Datasets into Vertica.

For more information about exporting data from the Data-Planet UI, see Exporting Data-Planet Statistical Datasets on the Data-Planet website.

Note Vertica cannot directly load an XML file using the COPY command. If you have an XML file, you must use an application to convert the file to JSON or CSV formats.

Export Data-Planet Datasets Using the Data-Planet REST API

The Data-Planet REST API allows you to export datasets contained in the Data-Planet repository. You can also download a subset of data from a dataset and download the metadata that describes that dataset. You can use the Data-Planet REST API to export data from the Data-Planet repository in XML or JSON formats. We recommend using the JSON format. The REST API does not support ingesting data into the repository.

Consider the following when exporting datasets using the Data-Planet REST API:

  • The maximum number of records in the evaluation version is 1000. If you need a larger number of records, you must request that this value be increased by contacting the Data-Planet team.
  • The API does not support ingesting data into the repository.
  • One API call can only access data for one indicator at a time. The API call cannot be used to export correlations between datasets.
  • If the API index is not updated, the data you see while using the API may not match the data you see in the user interface.

Data-Planet has three REST API calls available:

  • Database API Call: Identifies out metadata information about datasets and documents in the repository
  • Dataset API Call: Search datasets and documents across the entire repository based on your input text
  • Data API Call: Extract a subset of data from a dataset in the repository

For an evaluation of the Data-Planet API, use your Data-Planet account in the key: value pair format. For example, use APIEVAL: zpqk, which authorizes access to the data in the repository. The Data-Planet REST API uses Apache Solr search technology and the standard HTTP GET method to query the data. The REST API supports all Apache Solr parameter queries. For more information, see the Solr Wiki.

Parameters Summary

The following is a summary of the parameters used in the examples in this section.

Common Request and Query Parameters

Parameter Name

Description

q

Main query for the search. This parameter supports raw input strings. The text wrapped in balanced quote characters, such as “San Jose”, is treated as a phrase.

start

Used to paginate results from a query. The first record is offset to 0, and by default this parameter value is 0. 

rows

The maximum number of documents to return for a request. The maximum number supported is 1000.

fl

The field list. Use this parameter to specify a set of fields to return, limiting the amount of information in the response. When returning the results to the client, only fields in this list will be included. Use a space, comma, or + symbol to separate the list of field names. Use * for all fields of the document. The default input is *.

wt

The output format, which is either XML or JSON. The default value is XML.

defType

Use this parameter to specify that the query parser process a query. Solr supports several query parsers to control how queries are parsed. You can find more examples in the Apache Solr Reference Guide.

Facet Parameters

Faceting generates an extra section in the search response known as facet counts. The facet count provides a summary that shows the counts for each field in the result.

Set facet = true to enable facet counts in the query response. The default value is blank. For more information about faceting, see the Solr Faceting Overview guide.

Note The following parameters only have an effect if the facet parameter is set to true.

Parameter Name

Description

facet.field

This field is treated as a facet to generate a facet count. You should specify at least one field and can specify multiple facet fields using &. 

facet.mincount

The minimum counts for facet fields included in the response. The default value is 0.

facet.limit

The maximum number of constraint counts that should be returned for the facet fields. A negative value indicates an unlimited value. The default value is 100. 

Extended DisMax (eDisMax) Query Parser Parameters

The DisMax query parser allows you to search across all datasets in the repository. The parser searches for an input from the user, such as a simple phrase, or individual terms across several text fields.

Use the boost query parameter (bq) to modify the different boosts based on the significance of each field in the search result. For more information, see The DisMax Query Parser in the Apache documentation.

Parameter Name

Description

q

This parameter defines the raw input strings for the query. Examples:

·        Lucene: search for “lucene” in the default field

·        Name:solr: search for “solr” in the name field

qf

The query fields parameter. This parameter specifies the fields in the index on which to perform the query. If this value is absent, it defaults to the df parameter value.

df

The default field parameter. This parameter overrides the default field defined in SchemaXML, if provided.

Important The API links in the following sections are subject to changes by Data-Planet. These APIs may be under maintenance with Data-Planet and may not work as expected. To use the API service, contact the Data-Planet team.

Database Web Service Call

The database web service call API provides metadata information about a dataset or document. It can show information such as the database_id, database_name, source, category, folder, and other properties.

Dataset Web Service Call

The dataset web service call API allows searching (text entered by the user) across datasets in the Data-Planet repository. Use the Extended DisMax query parser (eDisMax) to explore the entire dataset schema. In addition, use the following two search fields that contain all textual data associated with a dataset:

  • “text”: includes all data in the textual fields in the dataset schema
  • “complete_text”: includes the following fields:
    • database_description
    • database_notes
    • technical_documentation
    • source_description
    • source_notes

Data Web Service Call

The data web service call returns a subset of data from the Data-Planet repository in XML or JSON formats. XML is the default format. However, we recommend using JSON with Vertica. You must have at least the following parameters to run a data web service call:

  • database_id: a unique code of the dataset in the repository
  • datetype_id: the type of date to use in the result
  • geo_id: hierarchical organization of the data to use in the result

Load Data-Planet Datasets into Vertica

To import data into your Vertica database, you can use any of the following options:

Import Using Flex Tables

The Vertica flex tables feature allows you to load data into a table without defining columns or schemas. Flex tables support SQL queries and built-in analytics, just as regular tables do. You can load data in JSON and CSV formats into a flex table using a COPY statement.

The following examples show your options:

and

The following examples demonstrate how you can load JSON data into a flex table, how to begin parsing the metadata, and how to load data in CSV format into a flex table.

The first example shows how you can load JSON data into a flex table.

  1. Create a flex table for metadata and data:

    => CREATE FLEX TABLE DataPlanet.AirlineOnTimeMetadata();
    => CREATE FLEX TABLE DataPlanet.AirlineOnTimeData();
  2. Load metadata using the fjsonparser:

    => COPY DataPlanet.AirlineOnTimeMetadata FROM 
    ‘/home/dbadmin/data/flex/AirlineOnTimeMetadata.json’ PARSER
    fjsonparser(flatten_arrays=TRUE, start_point=‘docs’, reject_on_
    duplicate=true);
  3. Load data from a datase:

    => COPY DataPlanet.AirlineOnTimeData FROM 
    ‘/home/dbadmin/data/flex/AirlineOnTimeData.json’ PARSER
    fjsonparser(flatten_arrays=TRUE, start_point=‘row’, reject_on_
    duplicate=true);

Note: The previous example shows how you can begin parsing the METADATA file from start_point key ‘docs’ and begin parsing the DATA file from the start_point key ‘row’. It also shows how you can flatten all arrays in both JSON files.  

  1. Query some of the columns in the metadata file. This example shows properties of the dataset, including the source and database name:

    => SELECT source, database_id, database_name, “datatype_name.0” FROM DataPlanet.AirlineOnTimeMetadata;
  2. Query some columns in both the data and the metadata files. This example queries monthly on-time performance percentages by airline:

    => SELECT 
       data.database_id,
       metadata.database_name,
       data.geo_name Airline,
       data.datedata YearMonth,
       SUM(data.”value._text”)
    FROM
       DataPlanet.ArilineOnTimeData data
    JOIN
       DataPlanet.AirlineOnTimeMetadata metadata
    ON
       (
            data.database_id = metadata.database_id)
    GROUP BY
       1,2,3,4
    ORDER BY
       1;
  3. Use COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW to populate the _keys and _view tables:

    => SELECT 
    COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW(‘DataPlanet.AirlineOnTimeMetadata’),
    COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW(‘DataPlanet.AirlineOnTimeData’);
  4. Query the flex table view as follows:

    => SELECT * FROM DataPlanet.AirlineOnTimeMetadata_view;
    SELECT * FROM DataPlanet.AirlineOnTimeData_view;

The following example shows how to load data in CSV format into a flex table. In this example, the data was exported from the Data-Planet Statistical Datasets user interface: 

  1. Create a flex table:

    => CREATE FLEX TABLE DataPlanet.AirlineOnTimeCSV();
  2. Load CSV data into a flex table using either the fcsvparser or using the fdelimitedparser, specifying a comma delimiter:

    • Load data using the fcsvparser. The Vertica fcsvparser uses RCF4180 standard CSV data format, which has a comma delimiter and is enclosed with “”. The following example uses the reject_on_duplicate=true option to prevent loading duplicates:

      => COPY DataPlanet.AirlineOnTimeCSV FROM
      ‘home/dbadmin/data/flex/AirlineOnTimeCSV.csv’ PARSER
      fcsvparser(reject_on_duplicate=true);
    • Load data using the fdelimitedparser. You must eliminate “” from any strings in the CSV file before importing to Vertica

      => COPY DataPlanet.AirlineOnTimeCSV FROM
      ‘/home/dbadmin/data/flex/AirlineOnTimeCSV.csv’ PARSER
      fdelimitedparser)delimiter=’,’, reject_on_duplicate=true);
  3. Use COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW to populate the _keys and _view tables:

    => SELECT 
    COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW
    (‘DataPlanet.AirlineOnTimeCSV’);
  4. Query the flex table view as shown in the following:

    => SELECT * FROM DataPlanet.AirlineOnTimeCSV_view;

Import Using COPY Statement 

The SQL COPY statement bulk loads data from a file stored on a node into your database. This method is efficient because the file resides on the database server. You can also use the LOCAL option of the COPY statement to stream data from a file on the client, rather than from the file system of the database node.

For more information about the COPY statement syntax, see the SQL Reference Manual in the Vertica documentation.

The following example shows how to use the COPY statement on the server:

  1. Drop the table:

    => DROP TABLE IF EXISTS dataplanet.AirlineOnTimeTest;
  2. Create a new table:

    => CREATE TABLE IF NOT EXISTS dataplanet.AirlineOnTimeTest(
    database_id        Varchar(65000),
    datedata           Varchar(65000)
    datetype_id        Varchar(65000)
    geo_id             Varchar(65000)
    geo_name           Varchar(65000)
    geo_value          Varchar(65000)
    “value._text”      Varchar(65000)
    “value._data_column_id”       Varchar(65000));=>
  3. Copy the table on the server using regular COPY syntax:

    => COPY dataplanet.AirlineOnTimeTest FROM
    ‘/home/dbadmin/data/AirlineOnTimeData.csv’ SKIP 1 DELIMITER ‘,’ ENCLOSED BY
    ‘”’REJECTED DATA’/home/dbadmin/data/REJECTIONS’ EXCEPTIONS
    ‘/home/dbadmin/data/EXCEPTIONS’ DIRECT;
  4. Copy the table on the server using the fcsvparser:

    => COPY dataplanet.AirlineOnTimeTest FROM
    ‘/home/dbadmin/data/AirlineOnTimeData.csv’ PARSER
    fcsvparser(reject_on_duplicate=true);
  5. Copy the table on the server using the fjsonparser:

    => COPY dataplanet.AirlineOnTimeTest FROM
    ‘/home/dbadmin/data/AirlineOnTimeData.json’ PARSER
    fjsonparser(flatten_arrays=TRUE, start_point=’row’,
    reject_on_duplicate=true);

The following example uses COPY LOCAL:

  1. Drop the table:

    => DROP TABLE IF EXISTS dataplanet.AirlineOnTimeTest;
  2. Create a new table:

    => CREATE TABLE IF NOT EXISTS dataplanet.AirlineOnTimeTest(
    database_id        Varchar(65000),
    datedata           Varchar(65000)
    datetype_id        Varchar(65000)
    geo_id             Varchar(65000)
    geo_name           Varchar(65000)
    geo_value          Varchar(65000)
    “value._text”      Varchar(65000)
    “value._data_column_id”       Varchar(65000));=>
  3. Copy the table using regular COPY LOCAL syntax:

    => COPY dataplanet.AirlineOnTimeTest FROM LOCAL 
    ‘C:\Users\Administrator\Downloads\AirlineOnTimeData.csv’ SKIP 1 DELIMITER ‘,’ ENCLOSED BY ‘”’REJECTED DATA’ ‘C:\Users\Adminstrator\Downloads\REJECTIONS’
    EXCEPTIONS ‘C:\Users\Administrator\Downloads\EXCEPTIONS’ DIRECT;
  4. Copy the table local using the fcsvparser:

    => COPY dataplanet.AirlineOnTimeTest FROM LOCAL
    ‘C:\Users\Administrator\Downloads\AirlineOnTimeData.csv’ PARSER
    fcsvparser(reject_on_duplcate=true);
  5. Copy the table local using the fjsonparser

    => COPY dataplanet.AirlineOnTimeTest FROM LOCAL
    ‘C:\Users\Adminstrator\Downloads\AirlineOnTimeData.csv’ PARSER fjsonparser(flatten_arrays=TRUE, start_point=’row’,
    reject_on_duplicate=true);
  6. Copy the table local using the fdelimitedparser, which requires “” to be removed from the file beforehand:

    => COPY dataplanet.AirlineOnTimeTest FROM LOCAL
    ‘C:\Users\Adminstrator\Downloads\AirlineOnTimeData.csv’ PARSER
    fdelimitedparser(delimiter=’,’, reject_on_duplicate=true);

Import using an ETL Tool  

You can also use an ETL tool that integrates with Vertica to load data into your database. These products use either the Vertica JDBC or ODBC driver to connect to Vertica.

Define an ETL process to pull down datasets from the repository using the Data-Planet API and then load the output into your database. You can also use an ETL tool to load data in CSV or JSON formats.

Configure the tool to use the COPY statement to stream data efficiently. The Vertica user in the connection must either be a superuser or have USAGE privileges on the schema and INSERT privileges on the table.

For more information about ETL tools that integrate with Vertica, see the Technology Partners page on the Vertica website.

For More Information