Approximate Count Distinct Functions

Vertica provides the COUNT(DISTINCT) function to compute the exact number of distinct values in a data set. If projections are available that allow COUNT(DISTINCT) to execute using the GROUPBY PIPELINED algorithm, COUNT(DISTINCT) performs well. In some situations, however, using APPROXIMATE_COUNT_DISTINCT performs better than COUNT(DISTINCT).

A COUNT [Aggregate] operation performs well when:

One of the sorted projections delivers an order that enables sorted aggregation to be performed.
The number of distinct values is fairly small.
Hashed aggregation is required to execute the query.

When an approximate value will suffice or the values need to be rolled up, consider using the APPROXIMATE_COUNT_DISTINCT* functions.

The APPROXIMATE_COUNT_DISTINCT* functions cannot appear in the same query block as DISTINCT aggregates.

Use Cases

Use APPROXIMATE_COUNT_DISTINCT as a direct replacement for COUNT (DISTINCT) when:

You have a large data set and you do not require an exact count of distinct values.
The performance of COUNT(DISTINCT) on a given data set is insufficient.
You calculate several distinct counts in the same query.
The plan for COUNT(DISTINCT) uses hashed aggregation.

Most of the time, APPROXIMATE_COUNT_DISTINCT executes faster than a comparable COUNT(DISTINCT) operation when hashed.

The expected value that APPROXIMATE_COUNT_DISTINCT returns is equal to COUNT(DISTINCT), with an error that is lognormally distributed with standard deviation s. You can control the standard deviation directly by setting the error_tolerance.

Use APPROXIMATE_COUNT_DISTINCT_SYNOPSIS and APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS together when:

You have a large data set and you don't require an exact count of distinct values.
The performance of COUNT(DISTINCT) on a given data set is insufficient.

and
You want to pre-compute the distinct counts and later combine them in different ways.

Pass APPROXIMATE_COUNT_DISTINCT_SYNOPSIS the data set and a normally distributed confidence interval. The function returns a subset of the data, as a binary object called a synopsis.

Pass the synopsis to the APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS function, which then performs an approximate count distinct on the synopsis.

COUNT (DISTINCT) performs better on small groups of data. APPROXIMATE_COUNT_DISTINCT performs better on larger groups of data.

Use APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE to merge multiple synopses into one synopsis. This allows you to continually update a "master" synopsis by merging in one or more synopses that cover more recent, shorter periods of time.

Example

This example shows how you can use Approximate Count Distinct functions, along with GROUP BY, as an efficient way to keep a running approximate count of how many separate users have clicked a given web page in a given period of time.

First, we create a distributed table called pviews to store data about visits to a website. The columns in the table record time of visit, the web page visited, and the user who visited it: visit_time | page_id | user_id

The table is segmented by the hash of the user_id column. This ensures that all the visits by a given user will be stored on the same segment, on the same node. This prevents inefficient cross-node transfer of data, when later we do a COUNT (DISTINCT user_id).

The table also uses hierarchical partitioning by time of visit to optimize the ROS storage to improve performance when we filter data by time.

We use the APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE function to merge the daily synopses into monthly synopses.

DROP TABLE IF EXISTS  pviews;
DROP TABLE
CREATE TABLE pviews(
       visit_time TIMESTAMP NOT NULL,
       page_id INTEGER NOT NULL,
       user_id INTEGER NOT NULL)
ORDER BY page_id, visit_time
SEGMENTED BY HASH(user_id) ALL NODES KSAFE
PARTITION BY visit_time::DATE
GROUP BY CALENDAR_HIERARCHY_DAY(visit_time::DATE, 2, 2);

-- HINT: Use mark_design_ksafe() to set system KSAFE level

CREATE TABLE

-- Load some data to pviews

COPY pviews FROM STDIN DELIMITER '|' DIRECT;
2019-02-01 10:00:01|1000|1
2019-02-01 10:00:02|1002|1
2019-02-01 10:00:03|1002|2
2019-02-01 10:00:04|1002|1
2019-02-01 10:00:05|1002|3
2019-02-01 10:00:06|1002|1
2019-02-01 10:00:07|1002|3
2019-02-01 10:00:08|1002|1
2019-02-01 10:00:09|1002|3
2019-02-01 10:00:12|1002|2
2019-02-02 10:00:01|1000|1
2019-02-02 10:00:02|1002|4
2019-02-02 10:00:03|1002|2
2019-02-02 10:00:04|1002|1
2019-02-02 10:00:05|1002|3
2019-02-02 10:00:06|1002|4
2019-02-02 10:00:07|1002|3
2019-02-02 10:00:08|1002|4
2019-02-02 10:00:09|1002|3
2019-02-02 10:00:12|1002|2
2019-03-02 10:00:01|1000|1
2019-03-02 10:00:02|1002|1
2019-03-02 10:00:03|1002|2
2019-03-02 10:00:04|1002|1
2019-03-02 10:00:05|1002|3
2019-03-02 10:00:06|1002|4
2019-03-02 10:00:07|1002|3
2019-03-02 10:00:08|1002|6
2019-03-02 10:00:09|1002|5
2019-03-02 10:00:12|1002|2
2019-03-02 11:00:01|1000|5
2019-03-02 11:00:02|1002|6
2019-03-02 11:00:03|1002|7
2019-03-02 11:00:04|1002|4
2019-03-02 11:00:05|1002|1
2019-03-02 11:00:06|1002|6
2019-03-02 11:00:07|1002|8
2019-03-02 11:00:08|1002|6
2019-03-02 11:00:09|1002|7
2019-03-02 11:00:12|1002|1
2019-03-03 10:00:01|1000|1
2019-03-03 10:00:02|1002|2
2019-03-03 10:00:03|1002|4
2019-03-03 10:00:04|1002|1
2019-03-03 10:00:05|1002|2
2019-03-03 10:00:06|1002|6
2019-03-03 10:00:07|1002|9
2019-03-03 10:00:08|1002|10
2019-03-03 10:00:09|1002|7
2019-03-03 10:00:12|1002|1
\.

-- Now create a summary table, pview_summary, where each row summarizes activities for one date.

-- The first column is the date to summarize.

-- The second column, partial_visit_count, stores the number of rows in the pviews table

-- that belong to that date.

-- In the third column, daily_users_acdp, we use APPROXIMATE_COUNT_DISTINCT_SYNOPSIS

-- to construct a synopsis that is the approximate number of distinct users for that date,

-- generated from the user_id values in the rows for that date.

DROP TABLE IF EXISTS  pview_summary;
DROP TABLE
CREATE TABLE pview_summary AS
SELECT
        visit_time::DATE "date",
        COUNT(*) partial_visit_count,
        APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(user_id) AS daily_users_acdp
FROM pviews
GROUP BY 1;
CREATE TABLE

-- We set the pview_summary table properties to use the same partioning schema as in pviews.

-- The REORGANIZE keyword forces the repartioning of the existing data to occur immediately.

ALTER TABLE pview_summary ALTER COLUMN "date" SET NOT NULL;
ALTER TABLE
ALTER TABLE pview_summary
PARTITION BY "date"
GROUP BY CALENDAR_HIERARCHY_DAY("date", 2, 2) REORGANIZE;
vsql:/home/ale/acd_ex4.sql:93: NOTICE 8364:  The new partitioning scheme will produce partitions in 2 physical storage containers per projection
vsql:/home/ale/acd_ex4.sql:93: NOTICE 4785:  Started background repartition table task
ALTER TABLE

-- For new incoming data, we'd better have tables to do ETL processing.

-- Let's create two ETL tables with the same structure as pviews and pview_summary.

CREATE TABLE pviews_etl LIKE pviews INCLUDING PROJECTIONS;
CREATE TABLE
CREATE TABLE pview_summary_etl LIKE pview_summary INCLUDING PROJECTIONS;
CREATE TABLE

-- Loading some new data into pviews_etl.

COPY pviews_etl FROM STDIN DELIMITER '|' DIRECT;
2019-03-03 11:00:01|1000|8
2019-03-03 11:00:02|1002|9
2019-03-03 11:00:03|1002|1
2019-03-03 11:00:04|1002|11
2019-03-03 11:00:05|1002|10
2019-03-03 11:00:06|1002|12
2019-03-03 11:00:07|1002|3
2019-03-03 11:00:08|1002|10
2019-03-03 11:00:09|1002|1
2019-03-03 11:00:12|1002|1
\.

-- Summarize the new data.

INSERT INTO pview_summary_etl
SELECT visit_time::DATE "date",
       COUNT(*) partial_visit_count,
              APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(user_id) AS daily_users_acdp
              FROM pviews_etl
              GROUP BY 1;
 OUTPUT 
--------
      1
(1 row)

-- We should sanity-check all the data. Then we can append it to the main table.

SELECT COPY_PARTITIONS_TO_TABLE('pviews_etl', '01-01-0000'::DATE, '01-01-9999'::DATE, 'pviews');
             COPY_PARTITIONS_TO_TABLE             
--------------------------------------------------
 1 distinct partition values copied at epoch 62.

(1 row)

SELECT COPY_PARTITIONS_TO_TABLE('pview_summary_etl', '01-01-0000'::DATE, '01-01-9999'::DATE, 'pview_summary');
             COPY_PARTITIONS_TO_TABLE             
--------------------------------------------------
 1 distinct partition values copied at epoch 63.

(1 row)

-- Clean up the ETL .

DROP TABLE pviews_etl CASCADE;
DROP TABLE
DROP TABLE pview_summary_etl CASCADE;
DROP TABLE

-- Create views and distinct(approximate) views by day for all the data, including the new batch.

SELECT
        "date",
        SUM(partial_visit_count) visit_count,
        APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(daily_users_acdp) AS daily_users_acd
FROM pview_summary
GROUP BY 1 ORDER BY 1;
    date    | visit_count | daily_users_acd 
------------+-------------+-----------------
 2019-02-01 |          10 |               3
 2019-02-02 |          10 |               4
 2019-03-02 |          20 |               8
 2019-03-03 |          20 |              11
(4 rows)

-- We can calculate an exact count, but this query would take a very long time for really big data.

SELECT
        visit_time::DATE "date",
        COUNT(*) visit_count,
        COUNT(DISTINCT user_id) AS daily_users
FROM pviews
GROUP BY 1 ORDER BY 1;
    date    | visit_count | daily_users 
------------+-------------+-------------
 2019-02-01 |          10 |           3
 2019-02-02 |          10 |           4
 2019-03-02 |          20 |           8
 2019-03-03 |          20 |          11
(4 rows)

-- Although we summarized daily, we can still create a report at monthly granularity.

-- Create views and distinct(approximate) views by month.

SELECT
        DATE_TRUNC('MONTH', "date")::DATE "month",
        SUM(partial_visit_count) visit_count,
        APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(daily_users_acdp) AS monthly_users_acd
FROM pview_summary
GROUP BY 1 ORDER BY 1;
   month    | visit_count | monthly_users_acd 
------------+-------------+-------------------
 2019-02-01 |          20 |                 4
 2019-03-01 |          40 |                12
(2 rows)

-- Create views and distinct views by month, the exact and expensive way.

-- Note that we need access to the entire data for every day to do this.

SELECT
        DATE_TRUNC('MONTH', visit_time)::DATE "month",
        COUNT(*) visit_count,
        COUNT(DISTINCT user_id) AS monthly_users
FROM pviews
GROUP BY 1 ORDER BY 1;
   month    | visit_count | monthly_users 
------------+-------------+---------------
 2019-02-01 |          20 |             4
 2019-03-01 |          40 |            12
(2 rows)

-- Merge daily synopses into monthly synopses.

DROP TABLE IF EXISTS  pview_monthly_summary;
DROP TABLE
CREATE TABLE pview_monthly_summary AS
SELECT
        DATE_TRUNC('MONTH', "date")::DATE "month",
        SUM(partial_visit_count) partial_visit_count,
        APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE(daily_users_acdp) AS monthly_users_acdp
FROM pview_summary
GROUP BY 1 ORDER BY 1;
CREATE TABLE

-- Create views and distinct views by month, computed from the merged synopses.

SELECT
        month,
        SUM(partial_visit_count) monthly_visit_count,
        APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(monthly_users_acdp) AS monthly_users_acd
FROM pview_monthly_summary
GROUP BY 1 ORDER BY 1;
   month    | monthly_visit_count | monthly_users_acd 
------------+---------------------+-------------------
 2019-02-01 |                  20 |                 4
 2019-03-01 |                  40 |                12
(2 rows)

-- We can also use the monthly summary to produce a yearly summary, which will be faster

-- than using a daily summary if we have a lot of data.

SELECT
        DATE_TRUNC('YEAR', "month")::DATE "year",
        SUM(partial_visit_count) yearly_visit_count,
        APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(monthly_users_acdp) AS yearly_users_acd
FROM pview_monthly_summary
GROUP BY 1 ORDER BY 1;
    year    | yearly_visit_count | yearly_users_acd 
------------+--------------------+------------------
 2019-01-01 |                 60 |               12
(1 row)

Approximate Count Distinct Functions

Use Cases

Example

See Also

Version 9.2.x Documentation