Vertica VHist ETL Overview

This document describes the VHist schema and provides an overview of the ETL processes that populate the VHist dimension tables and fact tables. VHist is used in the Vertica ETL QuickStart applications.

Note This document is an abridged version of the complete Vertica VHist ETL Overview document that is available as a PDF. Refer to that document for details about the VHist fact tables and dimension tables. 

What is VHist?

VHist (Vertica History) is a sample star schema that is populated with information about system processes, including user sessions, cluster management, memory management, disk activity, and I/O. A VHist ETL application loads the data from Vertica system tables into the VHist data warehouse.

The data in Vertica system tables is dynamic, reflecting the constantly-changing state of the system. However, when the data is captured in the VHist star schema, it is suitable for querying, allowing DBAs to monitor system activity over time.

What Are the Vertica QuickStart Applications?

The QuickStarts are free, sample applications created using front-end products from Vertica technology partners.

The QuickStarts are posted for download on vertica.com/quickstart.

For an overview, see the video: Introduction to the Vertica QuickStart Apps.

Note The Vertica QuickStarts are freely available for demonstration and educational purposes. They are not governed by any license or support agreements and are not suitable for deployment in production environments.

Overview VHist ETL

ETL processes extract data from Vertica system tables in the V_CATALOG and V_MONITOR schemas and load it into the dimension tables and fact tables in the VHist schema.

Populating VHist is a two-step process:

  1. Source to stage—The source tables are loaded into staging tables. Each staging table is a replica of its source table, with one difference: the addition of BATCH_ID, an integer column that holds the identity of the ETL batch load.
  2. Stage to star—The staging tables are transformed into a star schema.

Source-to-Stage ETL Process Flow

This diagram summarizes the ETL process flow for creating vhist_stage from Vertica system tables.

vhist_stage_diagram.jpg

Stage-to-Star ETL Process Flow

This diagram summarizes the ETL process flow for creating the VHist star schema from the vhist_stage staging tables.

vhist_star_diagram.jpg

About VHist Incremental Loads

Incremental loads cause the VHist data warehouse to accumulate system data over time, creating a rich store of data for analysis. However, you should take care when repeatedly refreshing the data warehouse, especially when scheduling the ETL job to run at regular intervals.

Keep the following in mind:

  • The size of the VHist data warehouse could potentially grow very large with repeated loads, and the size of VHist counts towards the size limit specified in your Vertica license.

    Note The Vertica Community Edition allows up to one terabyte of free storage. If you already have a licensed version of Vertica, you could build the VHist warehouse using the Community Edition on a separate cluster.

  • There could be gaps in some of the data loaded into VHist, because the source system tables are not flushed uniformly. Vertica system tables are actually views of underlying tables that are flushed at different intervals.

    Note The VHist ETL process generates log tables that capture detailed information about the timing of each load and the amount of data loaded into each target table. To avoid running incremental loads more often than is necessary, try starting with daily loads and review the results. If there are gaps in the results, decrease the interval between loads until you find an optimal balance. The VHist log tables are listed as follows:

    vhist_stage.vhist_stage_load_log
    
    vhist.vhist_load_log

VHist Fact Tables (Overview Only)

V_MONITOR Source VHist Target ETL Based On:

USER_SESSIONS

USER_SESSIONS_FACT

Timestamp

QUERY_PROFILES

QUERY_PROFILES_FACT

Timestamp

LOAD_STREAMS

LOAD_STREAMS_FACT

Timestamp

RESOURCE_REJECTIONS

RESOURCE_REJECTIONS_FACT

Timestamp

DISK_STORAGE

DISK_STORAGE_FACT

Polling

PROJECTION_STORAGE

PROJECTION_STORAGE_FACT

Polling

PROJECTION_USAGE

PROJECTION_USAGE_FACT

Timestamp

SYSTEM_RESOURCE_USAGE

SYSTEM_RESOURCE_USAGE_FACT

Timestamp

For details about the VHist fact tables, see Vertica VHist ETL Overview (unabridged).

For details about the V_MONITOR system tables, see V_MONITOR Schema in the Vertica documentation.

VHist Dimension Tables (Overview Only)

Dimension Description Created by:

DATE_DIM

Contains date data

SQL script

TIME_DIM

Contains timestamp data

SQL script

BATCH_DIM

Contains information about the ETL batch jobs

SQL Script

NODE_DIM

Contains data loaded from V_CATALOG.NODES

ETL

USER_DIM

Contains data loaded from V_CATALOG.USERS

ETL

For details about the VHist dimension tables, see Vertica VHist ETL Overview (unabridged).

For details about the V_CATALOG system tables, see V_CATALOG Schema in the Vertica documentation.

Find More Information