Monitoring COPY Loads and Metrics

You can check COPY loads using:

  • Vertica functions
  • LOAD_STREAMS system table
  • LOAD_SOURCES system table

Using Vertica Functions

Two meta-functions return COPY metrics for the number of accepted or rejected rows from a COPY statement:

  1. To get the number of accepted rows, use the GET_NUM_ACCEPTED_ROWS function:

    => select get_num_accepted_rows();
     get_num_accepted_rows 
    -----------------------
                        11
    (1 row)
    
  2. To check the number of rejected rows, use the GET_NUM_REJECTED_ROWS function:

    => select get_num_rejected_rows();
     get_num_rejected_rows 
    -----------------------
                        0
    (1 row)
    

GET_NUM_ACCEPTED_ROWS and GET_NUM_REJECTED_ROWS support loads from STDIN, COPY LOCAL from a Vertica client, or a single file on the initiator. You cannot use these functions for multi-node loads.

Using the CURRENT_LOAD_SOURCE Function

You can include the CURRENT_LOAD_SOURCE function as a part of the COPY statement. Doing so allows you to insert into a column the input file name or value computed by this function.

To insert the file names into a column from multiple source files:

=> COPY t (c1, c2, c3 as CURRENT_LOAD_SOURCE()) FROM '/home/load_file_1' ON  exampledb_node02, 
'/home/load_file_2' ON exampledb_node03 DELIMITER ',';

Using the LOAD_STREAMS System Table

Vertica includes a set of system tables that include monitoring information, as described in Using System Tables. The LOAD_STREAMS system table includes information about load stream metrics from COPY and COPY FROM VERTICA statements. Thus, you can query table values to get COPY metrics.

To see all table columns:

=> SELECT * FROM load_streams; 

Using the STREAM NAME Parameter

Using the STREAM NAME parameter as part of the COPY statement labels COPY streams explicitly so they are easier to identify in the LOAD_STREAMS system table.

To use the STREAM NAME parameter:

=> COPY mytable FROM myfile DELIMITER '|' DIRECT STREAM NAME 'My stream name';

The LOAD_STREAMS system table includes stream names for every COPY statement that takes more than 1-second to run. The 1-second duration includes the time to plan and execute the statement.

Vertica maintains system table metrics until they reach a designated size quota (in kilobytes). This quota is set through internal processes, which you cannot set or view directly.

Other LOAD_STREAMS Columns for COPY Metrics

These LOAD_STREAMS system table column values depend on the load status:

  • ACCEPTED_ROW_COUNT
  • REJECTED_ROW_COUNT
  • PARSE_COMPLETE_PERCENT
  • SORT_COMPLETE_PERCENT

When a COPY statement using the DIRECT option is in progress, the ACCEPTED_ROW_COUNT value can increase during parsing. This value can reach the maximum number of rows in the input file.

If COPY reads input data from multiple named pipes, the PARSE_COMPLETE_PERCENT value remains at zero (0) until all named pipes return an EOF. While COPY awaits an EOF from multiple pipes, it can appear to be hung. However, before canceling the COPY statement, check your system CPU and disk accesses to determine if any activity is in progress.

In a typical load, the PARSE_COMPLETE_PERCENT value can either increase slowly or jump quickly to 100%, if you are loading from named pipes or STDIN. However, SORT_COMPLETE_PERCENT remains at 0 when loading from named pipes or STDIN. After PARSE_COMPLETE_PERCENT reaches 100%, SORT_COMPLETE_PERCENT increases to 100%. Depending on the data sizes, a significant lag can occur between the time PARSE_COMPLETE_PERCENT reaches 100% sand the time SORT_COMPLETE_PERCENT begins to increase.

This example shows how you can set the VSQL expanded display and then select various columns of data from the LOAD_STREAMS system table:

=> \pset expanded
Expanded display is on.
=> SELECT stream_name, table_name, load_start, accepted_row_count, 
   rejected_row_count, read_bytes, unsorted_row_count, sorted_row_count,
   sort_complete_percent FROM load_streams;
-[ RECORD 1 ]----------+---------------------------
stream_name            | fact-13
table_name             | fact
load_start             | 2010-12-28 15:07:41.132053
accepted_row_count     | 900
rejected_row_count     | 100
read_bytes             | 11975
input_file_size_bytes  | 0
parse_complete_percent | 0
unsorted_row_count     | 3600
sorted_row_count       | 3600
sort_complete_percent  | 100

See the SQL Reference Manual for other meta-function details.

Using the LOAD_SOURCES System Table

The LOAD_STREAMS table shows the total number of rows that were loaded or rejected. Grouping this information by source can help you determine from where data is coming. The LOAD_SOURCES system table includes some of the same data as LOAD_STREAMS does but adds this source-specific information. If apportioning is enabled, LOAD_SOURCES also provides information about how loads are apportioned among nodes.

You can use this table to identify causes of differing query results. For example, you can use the following statement to create an external table based on globs:

=> CREATE EXTERNAL TABLE tt AS COPY WITH SOURCE AWS(dir = 'foo', file = '*');

If you select from this table, Vertica loads data from every file in the foo directory and creates one row in the LOAD_SOURCES table for each file. Suppose you later repeat the query and see different results. You could look at the LOAD_SOURCES table and discover that—between the two queries—somebody added another file to the foo directory.  Because each file is recorded in LOAD_SOURCES, you can see the new file that explains the changed query results.

If you are using many data sources. you might prefer to disable this reporting. To disable reporting, set the LoadSourceStatisticsLimit configuration parameter to 0.  This parameter sets the upper bound on the number of sources profiled by LOAD_SOURCES per load. The default value is 256.