Collecting Table Statistics

ANALYZE_STATISTICS collects and aggregates data samples and storage information from all nodes that store projections of the target tables.

You can set the scope of the collection at several levels:

Database
Table
Table columns

ANALYZE_STATISTICS can also control the size of the data sample that it collects.

Analyze All Database Tables

If ANALYZE_STATISTICS specifies no table, it collects statistics for all database tables and their projections. For example:

=> SELECT ANALYZE_STATISTICS (''); 
 ANALYZE_STATISTICS 
--------------------
                  0
(1 row)

Analyze a Single Table

You can compute statistics on a single table as follows:

=> SELECT ANALYZE_STATISTICS ('public.store_orders_fact'); 
 ANALYZE_STATISTICS
--------------------
                  0
(1 row)

When you query system table PROJECTION_COLUMNS, it confirms that statistics have been collected on all table columns for all projections of store_orders_fact:

=> SELECT projection_name, statistics_type, table_column_name,statistics_updated_timestamp 
    FROM projection_columns WHERE projection_name ilike 'store_orders_fact%' AND table_schema='public';
   projection_name    | statistics_type | table_column_name | statistics_updated_timestamp
----------------------+-----------------+-------------------+-------------------------------
 store_orders_fact_b0 | FULL            | product_key       | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | product_version   | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | store_key         | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | vendor_key        | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | employee_key      | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | order_number      | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | date_ordered      | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | date_shipped      | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | quantity_ordered  | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | shipper_name      | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b1 | FULL            | product_key       | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b1 | FULL            | product_version   | 2019-04-04 18:06:55.747329-04
...
(20 rows)

Analyze Table Columns

Within a table, you can narrow scope of analysis to a subset of its columns. Doing so can save significant processing overhead for big tables that contain many columns. It is especially useful if you frequently query these tables on specific columns.

If you collect statistics on specific columns, be sure to include all columns that you are likely to query. If a query includes other columns in that table, the query optimizer regards the statistics as incomplete for that query and ignores them in its plan.

For example, instead of collecting statistics on all columns in store_orders_fact, you can select only those columns that are frequently queried: product_key, product_version, order_number, and quantity_shipped:

=> SELECT DROP_STATISTICS('public.store_orders_fact');
=> SELECT ANALYZE_STATISTICS ('public.store_orders_fact', 'product_key, product_version, order_number, quantity_ordered'); 
 ANALYZE_STATISTICS
--------------------
                  0
(1 row)

If you query PROJECTION_COLUMNS again, it returns the following results:

=> SELECT projection_name, statistics_type, table_column_name,statistics_updated_timestamp 
    FROM projection_columns WHERE projection_name ilike 'store_orders_fact%' AND table_schema='public';
   projection_name    | statistics_type | table_column_name | statistics_updated_timestamp
----------------------+-----------------+-------------------+------------------------------
 store_orders_fact_b0 | FULL            | product_key       | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | FULL            | product_version   | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | store_key         | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | vendor_key        | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | employee_key      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | FULL            | order_number      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | date_ordered      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | date_shipped      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | FULL            | quantity_ordered  | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | shipper_name      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | FULL            | product_key       | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | FULL            | product_version   | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | store_key         | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | vendor_key        | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | employee_key      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | FULL            | order_number      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | date_ordered      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | date_shipped      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | FULL            | quantity_ordered  | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | shipper_name      | 2019-04-04 18:09:40.05452-04
(20 rows)

In this case, columns statistics_type is set to FULL only for those columns on which you ran ANALYZE_STATISTICS. The remaining table columns are set to ROWCOUNT, indicating that only row statistics were collected for them.

ANALYZE_STATISTICS always invokes ANALYZE_ROW_COUNT on all table columns, even if ANALYZE_STATISTICS specifies a subset of those columns.

Data Collection Percentage

By default, Vertica collects a fixed 10-percent sample of statistical data from disk. Specifying a percentage of data to read from disk gives you more control over deciding between sample accuracy and speed.

The percentage of data you collect affects collection time and accuracy:

A smaller percentage is faster but returns a smaller data sample, which might compromise histogram accuracy.
A larger percentage reads more data off disk. Data collection is slower, but a larger data sample enables greater histogram accuracy.

For example:

Collect data on all projections for shipping_dimension from 20 percent of the disk:

=> SELECT ANALYZE_STATISTICS ('shipping_dimension', 20); 
 ANALYZE_STATISTICS 
-------------------
                 0
(1 row)

Collect data from the entire disk by setting the percent parameter to 100:

=> SELECT ANALYZE_STATISTICS ('shipping_dimension', 'shipping_key', 100);
 ANALYZE_STATISTICS 
--------------------
                  0
(1 row)

Sampling Size

ANALYZE_STATISTICS constructs a column histogram from a set of rows that it randomly selects from all collected data. Regardless of the percentage setting, the function always creates a statistical sample that contains up to (approximately) the smaller of:

2¹⁷ (131,072) rows
Number of rows that fit in 1 GB of memory

If a column has fewer rows than the maximum sample size, ANALYZE_STATISTICS reads all rows from disk and analyzes the entire column.

The data collected in a sample range does not indicate how data should be distributed.

The following table shows how ANALYZE_STATISTICS, when set to different percentages, obtains a statistical sample from a given column:

Number of column rows	%	Number of rows read	Number of sampled rows
`<=`max‑sample‑size	20	All	All
400K	10	max‑sample‑size	max‑sample‑size
4000K	10	400K	max‑sample‑size