APPROXIMATE_COUNT_DISTINCT
Returns the number of distinct non-NULL values in a data set.
Behavior Type
Syntax
APPROXIMATE_COUNT_DISTINCT ( expression[, error-tolerance ] )
Parameters
expression |
Value to be evaluated using any data type that supports equality comparison. |
error‑tolerance |
Numeric value that represents the desired percentage of error tolerance, distributed around the value returned by this function. The smaller the error tolerance, the closer the approximation. You can set If you omit this argument, Vertica uses an error tolerance of 1.25(%). |
Restrictions
APPROXIMATE_COUNT_DISTINCT
and DISTINCT
aggregates cannot be in the same query block.
Error Tolerance
APPROXIMATE_COUNT_DISTINCT(x, error‑tolerance)
returns a value equal to COUNT(DISTINCT x)
, with an error that is lognormally distributed with standard deviation.
Parameter error‑tolerance is optional. Supply this argument to specify the desired standard deviation. error‑tolerance is defined as 2.17 standard deviations, which corresponds to a 97 percent confidence interval:
standard-deviation = error tolerance / 2.17
For example:
-
error‑tolerance = 1
The default setting, corresponds to a standard deviation. 97 percent of the time,APPROXIMATE_COUNT_DISTINCT(x,5)
returns a value between:COUNT(DISTINCT x) * 0.99
COUNT(DISTINCT x) * 1.01
error‑tolerance = 5
97 percent of the time,
APPROXIMATE_COUNT_DISTINCT(x)
returns a value between:COUNT(DISTINCT x) * 0.95
COUNT(DISTINCT x) * 1.05
A 99 percent confidence interval corresponds to 2.58
standard deviations. To set error‑tolerance confidence level corresponding to 99 (instead of a 97) percent , multiply error‑tolerance by 2.17 / 2.58 = 0.841
.
For example, if you specify error‑tolerance
as 5 * 0.841 = 4.2
, APPROXIMATE_COUNT_DISTINCT(x,4.2)
returns values 99 percent of the time between:
COUNT (DISTINCT x) * 0.95
COUNT (DISTINCT x) * 1.05
Examples
Count the total number of distinct values in column product_key
from table store.store_sales_fact
:
=> SELECT COUNT(DISTINCT product_key) FROM store.store_sales_fact;
COUNT ------- 19982 (1 row)
Count the approximate number of distinct values in product_key
with various error tolerances. The smaller the error tolerance, the closer the approximation:
=> SELECT APPROXIMATE_COUNT_DISTINCT(product_key,5) AS five_pct_accuracy, APPROXIMATE_COUNT_DISTINCT(product_key,1) AS one_pct_accuracy, APPROXIMATE_COUNT_DISTINCT(product_key,.88) AS point_eighteight_pct_accuracy FROM store.store_sales_fact;
five_pct_accuracy | one_pct_accuracy | point_eighteight_pct_accuracy -------------------+------------------+------------------------------- 19431 | 19921 | 19921 (1 row)