Collecting Database Statistics
The Vertica cost-based query optimizer relies on data statistics to produce query plans. If statistics are incomplete or out-of-date, the optimizer is liable to use a sub-optimal plan to execute a query.
For example, you load timestamp data into a large table at regular intervals and query the table for the mostly recently loaded rows. The following scenarios influence how optimizer chooses a plan:
- You load days 1 through 15 into the table and run
ANALYZE_STATISTICS
. When you next run a query that requests yesterday's data by filtering on the timestamp column, the optimizer chooses an optimized query plan. - The next day, you load day 16 data and run the same query. If you do not run
again, the optimizer might conclude that the predicate results in only one row being returned because the date range falls outside the histogram range and the data becomes stale. When the optimizer detects that statistics are not current for a particular predicate—in this case, because a timestamp predicate is outside a histogram's boundary)—Vertica uses other considerations to plan the query, such as FK-PK constraints.ANALYZE_STATISTICS
In this case, you can run ANALYZE_STATISTICS
after loading new data on day 16. You can also look for statistics in the EXPLAIN
-generated query plan. For example, when statistics are outside a histogram's boundaries, the query plan is annotated with a status.
Vertica provides a number of tools like ANALYZE_STATISTICS
for analyzing and updating database statistics. For an overview, see Database Statistics Tools.