Approximate Count Distinct Functions
Vertica provides the COUNT(DISTINCT) function to compute the exact number of distinct values in a data set. If projections are available that allow COUNT(DISTINCT) to execute using the GROUPBY PIPELINED algorithm, COUNT(DISTINCT) performs well. In some situations, however, using APPROXIMATE_COUNT_DISTINCT performs better than COUNT(DISTINCT).
A COUNT [Aggregate] operation performs well when:
- One of the sorted projections delivers an order that enables sorted aggregation to be performed.
- The number of distinct values is fairly small.
- Hashed aggregation is required to execute the query.
When an approximate value will suffice or the values need to be rolled up, consider using the APPROXIMATE_COUNT_DISTINCT* functions.
Note: The APPROXIMATE_COUNT_DISTINCT* functions cannot appear in the same query block as DISTINCT aggregates.
Use Cases
Use APPROXIMATE_COUNT_DISTINCT as a direct replacement for COUNT (DISTINCT) when:
- You have a large data set and you do not require an exact count of distinct values.
- The performance of COUNT(DISTINCT) on a given data set is insufficient.
- You calculate several distinct counts in the same query.
- The plan for COUNT(DISTINCT) uses hashed aggregation.
Most of the time, APPROXIMATE_COUNT_DISTINCT executes faster than a comparable COUNT(DISTINCT) operation when hashed.
The expected value that APPROXIMATE_COUNT_DISTINCT returns is equal to COUNT(DISTINCT), with an error that is lognormally distributed with standard deviation s. You can control the standard deviation directly by setting the error_tolerance.
Use APPROXIMATE_COUNT_DISTINCT_SYNOPSIS and APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS together when:
- You have a large data set and you don't require an exact count of distinct values.
-
The performance of COUNT(DISTINCT) on a given data set is insufficient.
and
- You want to pre-compute the distinct counts and later combine them in different ways.
Pass APPROXIMATE_COUNT_DISTINCT_SYNOPSIS the data set and a normally distributed confidence interval. The function returns a subset of the data, called a synopsis.
Pass the synopsis to the APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS function, which then performs an approximate count distinct on the synopsis.