Vertica VHist ETL Overview
This document describes the VHist schema and provides an overview of the ETL processes that populate the VHist dimension tables and fact tables. VHist is used in the Vertica ETL QuickStart applications.
Note This document is an abridged version of the complete Vertica VHist ETL Overview document that is available as a PDF. Refer to that document for details about the VHist fact tables and dimension tables.
What is VHist?
VHist (Vertica History) is a sample star schema that is populated with information about system processes, including user sessions, cluster management, memory management, disk activity, and I/O. A VHist ETL application loads the data from Vertica system tables into the VHist data warehouse.
The data in Vertica system tables is dynamic, reflecting the constantly-changing state of the system. However, when the data is captured in the VHist star schema, it is suitable for querying, allowing DBAs to monitor system activity over time.
What Are the Vertica QuickStart Applications?
The QuickStarts are free, sample applications created using front-end products from Vertica technology partners.
The QuickStarts are posted for download on vertica.com/quickstart.
For an overview, see the video: Introduction to the Vertica QuickStart Apps.
Note The Vertica QuickStarts are freely available for demonstration and educational purposes. They are not governed by any license or support agreements and are not suitable for deployment in production environments.
Overview VHist ETL
ETL processes extract data from Vertica system tables in the V_CATALOG and V_MONITOR schemas and load it into the dimension tables and fact tables in the VHist schema.
Populating VHist is a two-step process:
- Source to stage—The source tables are loaded into staging tables. Each staging table is a replica of its source table, with one difference: the addition of BATCH_ID, an integer column that holds the identity of the ETL batch load.
- Stage to star—The staging tables are transformed into a star schema.
Source-to-Stage ETL Process Flow
This diagram summarizes the ETL process flow for creating vhist_stage from Vertica system tables.
Stage-to-Star ETL Process Flow
This diagram summarizes the ETL process flow for creating the VHist star schema from the vhist_stage staging tables.
About VHist Incremental Loads
Incremental loads cause the VHist data warehouse to accumulate system data over time, creating a rich store of data for analysis. However, you should take care when repeatedly refreshing the data warehouse, especially when scheduling the ETL job to run at regular intervals.
Keep the following in mind:
- The size of the VHist data warehouse could potentially grow very large with repeated loads, and the size of VHist counts towards the size limit specified in your Vertica license.
Note The Vertica Community Edition allows up to one terabyte of free storage. If you already have a licensed version of Vertica, you could build the VHist warehouse using the Community Edition on a separate cluster.
- There could be gaps in some of the data loaded into VHist, because the source system tables are not flushed uniformly. Vertica system tables are actually views of underlying tables that are flushed at different intervals.
Note The VHist ETL process generates log tables that capture detailed information about the timing of each load and the amount of data loaded into each target table. To avoid running incremental loads more often than is necessary, try starting with daily loads and review the results. If there are gaps in the results, decrease the interval between loads until you find an optimal balance. The VHist log tables are listed as follows:
vhist_stage.vhist_stage_load_log
vhist.vhist_load_log
VHist Fact Tables (Overview Only)
V_MONITOR Source | VHist Target | ETL Based On: |
---|---|---|
USER_SESSIONS |
USER_SESSIONS_FACT |
Timestamp |
QUERY_PROFILES |
QUERY_PROFILES_FACT |
Timestamp |
LOAD_STREAMS |
LOAD_STREAMS_FACT |
Timestamp |
RESOURCE_REJECTIONS |
RESOURCE_REJECTIONS_FACT |
Timestamp |
DISK_STORAGE |
DISK_STORAGE_FACT |
Polling |
PROJECTION_STORAGE |
PROJECTION_STORAGE_FACT |
Polling |
PROJECTION_USAGE |
PROJECTION_USAGE_FACT |
Timestamp |
SYSTEM_RESOURCE_USAGE |
SYSTEM_RESOURCE_USAGE_FACT |
Timestamp |
For details about the VHist fact tables, see Vertica VHist ETL Overview (unabridged).
For details about the V_MONITOR system tables, see V_MONITOR Schema in the Vertica documentation.
VHist Dimension Tables (Overview Only)
Dimension | Description | Created by: |
---|---|---|
DATE_DIM |
Contains date data |
SQL script |
TIME_DIM |
Contains timestamp data |
SQL script |
BATCH_DIM |
Contains information about the ETL batch jobs |
SQL Script |
NODE_DIM |
Contains data loaded from V_CATALOG.NODES |
ETL |
USER_DIM |
Contains data loaded from V_CATALOG.USERS |
ETL |
For details about the VHist dimension tables, see Vertica VHist ETL Overview (unabridged).
For details about the V_CATALOG system tables, see V_CATALOG Schema in the Vertica documentation.