Approximate Count Distinct Functions

Vertica provides the COUNT(DISTINCT) function to compute the exact number of distinct values in a data set. If projections are available that allow COUNT(DISTINCT) to execute using the GROUPBY PIPELINED algorithm, COUNT(DISTINCT) performs well. In some situations, however, using APPROXIMATE_COUNT_DISTINCT performs better than COUNT(DISTINCT).

A COUNT [Aggregate] operation performs well when:

When an approximate value will suffice or the values need to be rolled up, consider using the APPROXIMATE_COUNT_DISTINCT* functions.

Note: The APPROXIMATE_COUNT_DISTINCT* functions cannot appear in the same query block as DISTINCT aggregates.

Use Cases

Use APPROXIMATE_COUNT_DISTINCT as a direct replacement for COUNT (DISTINCT) when:

Most of the time, APPROXIMATE_COUNT_DISTINCT executes faster than a comparable COUNT(DISTINCT) operation when hashed.

The expected value that APPROXIMATE_COUNT_DISTINCT returns is equal to COUNT(DISTINCT), with an error that is lognormally distributed with standard deviation s. You can control the standard deviation directly by setting the error_tolerance.

Use APPROXIMATE_COUNT_DISTINCT_SYNOPSIS and APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS together when:

Pass APPROXIMATE_COUNT_DISTINCT_SYNOPSIS the data set and a normally distributed confidence interval. The function returns a subset of the data, called a synopsis.

Pass the synopsis to the APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS function, which then performs an approximate count distinct on the synopsis.