Approximate Count Distinct Functions

Vertica provides the COUNT(DISTINCT) function to compute the exact number of distinct values in a data set. If projections are available that allow COUNT(DISTINCT) to execute using the GROUPBY PIPELINED algorithm, COUNT(DISTINCT) performs well. In some situations, however, using APPROXIMATE_COUNT_DISTINCT performs better than COUNT(DISTINCT).

A COUNT [Aggregate] operation performs well when:

  • One of the sorted projections delivers an order that enables sorted aggregation to be performed.
  • The number of distinct values is fairly small.
  • Hashed aggregation is required to execute the query.

When an approximate value will suffice or the values need to be rolled up, consider using the APPROXIMATE_COUNT_DISTINCT* functions.

The APPROXIMATE_COUNT_DISTINCT* functions cannot appear in the same query block as DISTINCT aggregates.

Use Cases

Use APPROXIMATE_COUNT_DISTINCT as a direct replacement for COUNT (DISTINCT) when:

  • You have a large data set and you do not require an exact count of distinct values.
  • The performance of COUNT(DISTINCT) on a given data set is insufficient.
  • You calculate several distinct counts in the same query.
  • The plan for COUNT(DISTINCT) uses hashed aggregation.

Most of the time, APPROXIMATE_COUNT_DISTINCT executes faster than a comparable COUNT(DISTINCT) operation when hashed.

The expected value that APPROXIMATE_COUNT_DISTINCT returns is equal to COUNT(DISTINCT), with an error that is lognormally distributed with standard deviation s. You can control the standard deviation directly by setting the error_tolerance.

Use APPROXIMATE_COUNT_DISTINCT_SYNOPSIS and APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS together when:

  • You have a large data set and you don't require an exact count of distinct values.
  • The performance of COUNT(DISTINCT) on a given data set is insufficient.

    and

  • You want to pre-compute the distinct counts and later combine them in different ways.

Pass APPROXIMATE_COUNT_DISTINCT_SYNOPSIS the data set and a normally distributed confidence interval. The function returns a subset of the data, as a binary object called a synopsis.

Pass the synopsis to the APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS function, which then performs an approximate count distinct on the synopsis.

COUNT (DISTINCT) performs better on small groups of data. APPROXIMATE_COUNT_DISTINCT performs better on larger groups of data.

Use APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE to merge multiple synopses into one synopsis. This allows you to continually update a "master" synopsis by merging in one or more synopses that cover more recent, shorter periods of time.

Example

This example shows how you can use Approximate Count Distinct functions, along with GROUP BY, as an efficient way to keep a running approximate count of how many separate users have clicked a given web page in a given period of time.

First, we create a distributed table called pviews to store data about visits to a website. The columns in the table record time of visit, the web page visited, and the user who visited it: visit_time | page_id | user_id

The table is segmented by the hash of the user_id column. This ensures that all the visits by a given user will be stored on the same segment, on the same node. This prevents inefficient cross-node transfer of data, when later we do a COUNT (DISTINCT user_id).

The table also uses hierarchical partitioning by time of visit to optimize the ROS storage to improve performance when we filter data by time.

We use the APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE function to merge the daily synopses into monthly synopses.

DROP TABLE IF EXISTS pviews; DROP TABLE CREATE TABLE pviews( visit_time TIMESTAMP NOT NULL, page_id INTEGER NOT NULL, user_id INTEGER NOT NULL) ORDER BY page_id, visit_time SEGMENTED BY HASH(user_id) ALL NODES KSAFE PARTITION BY visit_time::DATE GROUP BY CALENDAR_HIERARCHY_DAY(visit_time::DATE, 2, 2);

-- HINT: Use mark_design_ksafe() to set system KSAFE level

CREATE TABLE

-- Load some data to pviews

COPY pviews FROM STDIN DELIMITER '|' DIRECT; 2019-02-01 10:00:01|1000|1 2019-02-01 10:00:02|1002|1 2019-02-01 10:00:03|1002|2 2019-02-01 10:00:04|1002|1 2019-02-01 10:00:05|1002|3 2019-02-01 10:00:06|1002|1 2019-02-01 10:00:07|1002|3 2019-02-01 10:00:08|1002|1 2019-02-01 10:00:09|1002|3 2019-02-01 10:00:12|1002|2 2019-02-02 10:00:01|1000|1 2019-02-02 10:00:02|1002|4 2019-02-02 10:00:03|1002|2 2019-02-02 10:00:04|1002|1 2019-02-02 10:00:05|1002|3 2019-02-02 10:00:06|1002|4 2019-02-02 10:00:07|1002|3 2019-02-02 10:00:08|1002|4 2019-02-02 10:00:09|1002|3 2019-02-02 10:00:12|1002|2 2019-03-02 10:00:01|1000|1 2019-03-02 10:00:02|1002|1 2019-03-02 10:00:03|1002|2 2019-03-02 10:00:04|1002|1 2019-03-02 10:00:05|1002|3 2019-03-02 10:00:06|1002|4 2019-03-02 10:00:07|1002|3 2019-03-02 10:00:08|1002|6 2019-03-02 10:00:09|1002|5 2019-03-02 10:00:12|1002|2 2019-03-02 11:00:01|1000|5 2019-03-02 11:00:02|1002|6 2019-03-02 11:00:03|1002|7 2019-03-02 11:00:04|1002|4 2019-03-02 11:00:05|1002|1 2019-03-02 11:00:06|1002|6 2019-03-02 11:00:07|1002|8 2019-03-02 11:00:08|1002|6 2019-03-02 11:00:09|1002|7 2019-03-02 11:00:12|1002|1 2019-03-03 10:00:01|1000|1 2019-03-03 10:00:02|1002|2 2019-03-03 10:00:03|1002|4 2019-03-03 10:00:04|1002|1 2019-03-03 10:00:05|1002|2 2019-03-03 10:00:06|1002|6 2019-03-03 10:00:07|1002|9 2019-03-03 10:00:08|1002|10 2019-03-03 10:00:09|1002|7 2019-03-03 10:00:12|1002|1 \.

-- Now create a summary table, pview_summary, where each row summarizes activities for one date.

-- The first column is the date to summarize.

-- The second column, partial_visit_count, stores the number of rows in the pviews table

-- that belong to that date.

-- In the third column, daily_users_acdp, we use APPROXIMATE_COUNT_DISTINCT_SYNOPSIS

-- to construct a synopsis that is the approximate number of distinct users for that date,

-- generated from the user_id values in the rows for that date.

DROP TABLE IF EXISTS pview_summary; DROP TABLE CREATE TABLE pview_summary AS SELECT visit_time::DATE "date", COUNT(*) partial_visit_count, APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(user_id) AS daily_users_acdp FROM pviews GROUP BY 1; CREATE TABLE

-- We set the pview_summary table properties to use the same partioning schema as in pviews.

-- The REORGANIZE keyword forces the repartioning of the existing data to occur immediately.

ALTER TABLE pview_summary ALTER COLUMN "date" SET NOT NULL; ALTER TABLE ALTER TABLE pview_summary PARTITION BY "date" GROUP BY CALENDAR_HIERARCHY_DAY("date", 2, 2) REORGANIZE; vsql:/home/ale/acd_ex4.sql:93: NOTICE 8364: The new partitioning scheme will produce partitions in 2 physical storage containers per projection vsql:/home/ale/acd_ex4.sql:93: NOTICE 4785: Started background repartition table task ALTER TABLE

-- For new incoming data, we'd better have tables to do ETL processing.

-- Let's create two ETL tables with the same structure as pviews and pview_summary.

CREATE TABLE pviews_etl LIKE pviews INCLUDING PROJECTIONS;
CREATE TABLE
CREATE TABLE pview_summary_etl LIKE pview_summary INCLUDING PROJECTIONS;
CREATE TABLE

-- Loading some new data into pviews_etl.

COPY pviews_etl FROM STDIN DELIMITER '|' DIRECT; 2019-03-03 11:00:01|1000|8 2019-03-03 11:00:02|1002|9 2019-03-03 11:00:03|1002|1 2019-03-03 11:00:04|1002|11 2019-03-03 11:00:05|1002|10 2019-03-03 11:00:06|1002|12 2019-03-03 11:00:07|1002|3 2019-03-03 11:00:08|1002|10 2019-03-03 11:00:09|1002|1 2019-03-03 11:00:12|1002|1 \.

-- Summarize the new data.

INSERT INTO pview_summary_etl
SELECT visit_time::DATE "date",
       COUNT(*) partial_visit_count,
              APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(user_id) AS daily_users_acdp
              FROM pviews_etl
              GROUP BY 1;
 OUTPUT 
--------
      1
(1 row)

-- We should sanity-check all the data. Then we can append it to the main table.

SELECT COPY_PARTITIONS_TO_TABLE('pviews_etl', '01-01-0000'::DATE, '01-01-9999'::DATE, 'pviews'); COPY_PARTITIONS_TO_TABLE -------------------------------------------------- 1 distinct partition values copied at epoch 62. (1 row) SELECT COPY_PARTITIONS_TO_TABLE('pview_summary_etl', '01-01-0000'::DATE, '01-01-9999'::DATE, 'pview_summary'); COPY_PARTITIONS_TO_TABLE -------------------------------------------------- 1 distinct partition values copied at epoch 63. (1 row)

-- Clean up the ETL .

DROP TABLE pviews_etl CASCADE;
DROP TABLE
DROP TABLE pview_summary_etl CASCADE;
DROP TABLE

-- Create views and distinct(approximate) views by day for all the data, including the new batch.

SELECT "date", SUM(partial_visit_count) visit_count, APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(daily_users_acdp) AS daily_users_acd FROM pview_summary GROUP BY 1 ORDER BY 1; date | visit_count | daily_users_acd ------------+-------------+----------------- 2019-02-01 | 10 | 3 2019-02-02 | 10 | 4 2019-03-02 | 20 | 8 2019-03-03 | 20 | 11 (4 rows)

-- We can calculate an exact count, but this query would take a very long time for really big data.

SELECT visit_time::DATE "date", COUNT(*) visit_count, COUNT(DISTINCT user_id) AS daily_users FROM pviews GROUP BY 1 ORDER BY 1; date | visit_count | daily_users ------------+-------------+------------- 2019-02-01 | 10 | 3 2019-02-02 | 10 | 4 2019-03-02 | 20 | 8 2019-03-03 | 20 | 11 (4 rows)

-- Although we summarized daily, we can still create a report at monthly granularity.

-- Create views and distinct(approximate) views by month.

SELECT DATE_TRUNC('MONTH', "date")::DATE "month", SUM(partial_visit_count) visit_count, APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(daily_users_acdp) AS monthly_users_acd FROM pview_summary GROUP BY 1 ORDER BY 1; month | visit_count | monthly_users_acd ------------+-------------+------------------- 2019-02-01 | 20 | 4 2019-03-01 | 40 | 12 (2 rows)

-- Create views and distinct views by month, the exact and expensive way.

-- Note that we need access to the entire data for every day to do this.

SELECT DATE_TRUNC('MONTH', visit_time)::DATE "month", COUNT(*) visit_count, COUNT(DISTINCT user_id) AS monthly_users FROM pviews GROUP BY 1 ORDER BY 1; month | visit_count | monthly_users ------------+-------------+--------------- 2019-02-01 | 20 | 4 2019-03-01 | 40 | 12 (2 rows)

-- Merge daily synopses into monthly synopses.

DROP TABLE IF EXISTS pview_monthly_summary; DROP TABLE CREATE TABLE pview_monthly_summary AS SELECT DATE_TRUNC('MONTH', "date")::DATE "month", SUM(partial_visit_count) partial_visit_count, APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE(daily_users_acdp) AS monthly_users_acdp FROM pview_summary GROUP BY 1 ORDER BY 1; CREATE TABLE

-- Create views and distinct views by month, computed from the merged synopses.

SELECT month, SUM(partial_visit_count) monthly_visit_count, APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(monthly_users_acdp) AS monthly_users_acd FROM pview_monthly_summary GROUP BY 1 ORDER BY 1; month | monthly_visit_count | monthly_users_acd ------------+---------------------+------------------- 2019-02-01 | 20 | 4 2019-03-01 | 40 | 12 (2 rows)

-- We can also use the monthly summary to produce a yearly summary, which will be faster

-- than using a daily summary if we have a lot of data.

SELECT DATE_TRUNC('YEAR', "month")::DATE "year", SUM(partial_visit_count) yearly_visit_count, APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(monthly_users_acdp) AS yearly_users_acd FROM pview_monthly_summary GROUP BY 1 ORDER BY 1; year | yearly_visit_count | yearly_users_acd ------------+--------------------+------------------ 2019-01-01 | 60 | 12 (1 row)

See Also