Collecting Partition Statistics

ANALYZE_STATISTICS_PARTITION collects and aggregates data samples and storage information for a range of partitions in the specified table. Vertica writes the collected statistics to the database catalog.

For example, the following table stores sales data and is partitioned by order dates:

CREATE TABLE public.store_orders_fact
(
    product_key int,
    product_version int,
    store_key int,
    vendor_key int,
    employee_key int,
    order_number int,
    date_ordered date NOT NULL,
    date_shipped date NOT NULL,
    quantity_ordered int,
    shipper_name varchar(32)
);

ALTER TABLE public.store_orders_fact PARTITION BY date_ordered::DATE GROUP BY CALENDAR_HIERARCHY_DAY(date_ordered::DATE, 2, 2) REORGANIZE;
ALTER TABLE public.store_orders_fact ADD CONSTRAINT fk_store_orders_product FOREIGN KEY (product_key, product_version) references public.product_dimension (product_key, product_version);
ALTER TABLE public.store_orders_fact ADD CONSTRAINT fk_store_orders_vendor FOREIGN KEY (vendor_key) references public.vendor_dimension (vendor_key);
ALTER TABLE public.store_orders_fact ADD CONSTRAINT fk_store_orders_employee FOREIGN KEY (employee_key) references public.employee_dimension (employee_key);

At the end of each business day you might call ANALYZE_STATISTICS_PARTITION and collect statistics on all data of the latest (today's) partition:

=> SELECT ANALYZE_STATISTICS_PARTITION('public.store_orders_fact', CURRENT_DATE::VARCHAR(10), CURRENT_DATE::VARCHAR(10));
 ANALYZE_STATISTICS_PARTITION
------------------------------
                            0
(1 row)

The function produces a set of fresh statistics for the most recent partition in store.store_sales_fact. If you query this table each morning on yesterday's sales, the optimizer uses these statistics to generate an optimized query plan:

=> EXPLAIN SELECT COUNT(*) FROM public.store_orders_fact WHERE date_ordered = CURRENT_DATE-1;

                                           QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 QUERY PLAN DESCRIPTION:
 ------------------------------

 EXPLAIN SELECT COUNT(*) FROM public.store_orders_fact WHERE date_ordered = CURRENT_DATE-1;

 Access Path:
 +-GROUPBY NOTHING [Cost: 2, Rows: 1] (PATH ID: 1)
 |  Aggregates: count(*)
 |  Execute on: All Nodes
 | +---> STORAGE ACCESS for store_orders_fact [Cost: 1, Rows: 222(PARTITION-LEVEL STATISTICS)] (PATH ID: 2)
 | |      Projection: public.store_orders_fact_v1_b1
 | |      Filter: (store_orders_fact.date_ordered = '2019-04-01'::date)
 | |      Execute on: All Nodes

Narrowing the Collection Scope

Like ANALYZE_STATISTICS, ANALYZE_STATISTICS_PARTITION lets you narrow the scope of analysis to a subset of a table's columns. You can also control the size of the data sample that it collects. For details on these options, see Collecting Table Statistics.

Collecting Statistics on Multiple Partition Ranges

If you specify multiple partitions, they must be continuous. Different collections of statistics can overlap. For example, the following table t1 is partitioned on column c1:

=> SELECT export_tables('','t1');
                                          export_tables
-------------------------------------------------------------------------------------------------
CREATE TABLE public.t1
(
    a int,
    b int,
    c1 int NOT NULL
)
PARTITION BY (t1.c1);


=> SELECT * FROM t1 ORDER BY c1;
 a  | b  | c1
----+----+----
  1 |  2 |  3
  4 |  5 |  6
  7 |  8 |  9
 10 | 11 | 12
(4 rows)

Given this dataset, you can call ANALYZE_STATISTICS_PARTITION on t1 twice. The successive calls collect statistics for two overlapping ranges of partition keys, 3 through 9 and 6 through 12:

=> SELECT drop_statistics_partition('t1', '', '');
 drop_statistics_partition
---------------------------
                         0
(1 row)

=> SELECT analyze_statistics_partition('t1', '3', '9');
 analyze_statistics_partition
------------------------------
                            0
(1 row)

=> SELECT analyze_statistics_partition('t1', '6', '12');
 analyze_statistics_partition
------------------------------
                            0
(1 row)

=> SELECT table_name, min_partition_key, max_partition_key, row_count FROM table_statistics WHERE table_name = 't1';
 table_name | min_partition_key | max_partition_key | row_count
------------+-------------------+-------------------+-----------
 t1         | 3                 | 9                 |         3
 t1         | 6                 | 12                |         3
(2 rows)

If two statistics collections overlap, Vertica stores only the most recent statistics for each partition range. Thus, given the previous example, Vertica uses only statistics from the second collection for partition keys 6 through 9.

Statistics that are collected for a given range of partition keys always supersede statistics that were previously collected for a subset of that range. For example, given a call to ANALYZE_STATISTICS_PARTITION that specifies partition keys 3 through 12, the collected statistics are a superset of the two sets of statistics collected earlier, so it supersedes both:

=> SELECT analyze_statistics_partition('t1', '3', '12');
 analyze_statistics_partition
------------------------------
                            0
(1 row)

=> SELECT table_name, min_partition_key, max_partition_key, row_count FROM table_statistics WHERE table_name = 't1';
 table_name | min_partition_key | max_partition_key | row_count
------------+-------------------+-------------------+-----------
 t1         | 3                 | 12                |         4
(1 row)

Finally, ANALYZE_STATISTICS_PARTITION collects statistics on partition keys 3 through 6. This collection is a subset of the previous collection, so Vertica retains both sets and uses the latest statistics from each:

=> SELECT analyze_statistics_partition('t1', '3', '6');
 analyze_statistics_partition
------------------------------
                            0
(1 row)

=> SELECT table_name, min_partition_key, max_partition_key, row_count FROM table_statistics WHERE table_name = 't1';
 table_name | min_partition_key | max_partition_key | row_count
------------+-------------------+-------------------+-----------
 t1         | 3                 | 12                |         4
 t1         | 3                 | 6                 |         2
(2 rows)

Supported Date/Time Functions

ANALYZE_STATISTICS_PARTITION can collect partition-level statistics on tables where the partition expression specifies one of the following date/time functions:

Requirements and Restrictions

The following requirements and restrictions apply to ANALYZE_STATISTICS_PARTITION:

  • The table must be partitioned and cannot contain unpartitioned data.
  • The table partition expression must specify a single column. The following expressions are supported:
    • Expressions that specify only the column—that is, partition on all column values. For example:
      PARTITION BY ship_date GROUP BY CALENDAR_HIERARCHY_DAY(ship_date, 2, 2)
    • If the column is a DATE or TIMESTAMP/TIMESTAMPTZ, the partition expression can specify a supported date/time function that returns that column or any portion of it, such as month or year. For example, the following partition expression specifies to partition on the year portion of column order_date:
      PARTITION BY YEAR(order_date)
    • Expressions that perform addition or subtraction on the column. For example:
      PARTITION BY YEAR(order_date) -1
  • The table partition expression cannot coerce the specified column to another data type.
  • Vertica collects no statistics from the following projections:
    • Live aggregate and Top-K projections
    • Projections that are defined to include an SQL function within an expression