Vertica In-Database Approximate Count Distinct Functions Using LogLogBeta

Posted July 19, 2017 by Soniya Shah, Information Developer

City in Blur Motion
This blog post was authored by Ginger Ni.

Counting Distinct Values

Data cardinality is a commonly used statistic in data analysis. Vertica has the exact COUNT(DISTINCT) function to count distinct values in a data set, but the function does not scale well for extremely large data sets. When exploring large data sets, speed is critical. Sometimes, an estimation for the distinct count is good enough when people are looking for general patterns in data. Vertica 8.1.1 provides an upgraded version of the approximate count distinct (ACD) functions. These functions are orders of magnitude faster than the previous version of these functions and are much faster than the exact COUNT(DISTINCT). The 8.1.1 version of built-in ACD functions makes data analysis much quicker.

Vertica 8.1.1 has improved the following three functions without changing the user interface:
APPROXIMATE_COUNT_DISTINCT
APPROXIMATE_COUNT_DISTINCT_SYNOPSIS
APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS
The user interfaces of the ACD functions in Vertica 8.1.1 are the same as Vertica 8.1.

The following shows an example of the syntax you can use: -- Using Vertica VMart Example Database -- Count the total number of distinct values in column product_key from table store.store_sales_fact: SELECT COUNT(DISTINCT product_key) FROM store.store_sales_fact; COUNT ------- 19982 (1 row) -- -- Count the approximate number of distinct values in product_key with various error tolerances: SELECT APPROXIMATE_COUNT_DISTINCT(product_key) as unique_product_keys FROM store.store_sales_fact; -- with default error-tolerance = 1 unique_product_keys --------------------- 20014 (1 row) -- -- Or, you can create a synopsis object first, and then return a count from the saved synopsis -- -- Create a synopsis object that represents unique product_key values. It stores the synopsis in the new table my_summary CREATE TABLE my_summary AS SELECT APPROXIMATE_COUNT_DISTINCT_SYNOPSIS (product_key) syn FROM store.store_sales_fact; -- Get a count from the saved synopsis SELECT APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(syn) as unique_product_keys FROM my_summary; unique_product_keys --------------------- 20014 (1 row) -- -- Or, write the two steps in one query SELECT APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(syn) as unique_product_keys FROM (SELECT APPROXIMATE_COUNT_DISTINCT_SYNOPSIS (product_key) syn FROM store.store_sales_fact) my_summary; unique_product_keys --------------------- 20014 (1 row)

Accuracy of ACD Functions

The expected value that APPROXIMATE_COUNT_DISTINCT returns is equal to COUNT(DISTINCT), with an error that is lognormally distributed with standard deviations. You can control the standard deviation directly by setting the error-tolerance(in percentage). The smaller the error tolerance, the closer the approximation.

The following shows the accuracy and error tolerance (%): -- Count the total number of distinct values in column product_key from table store.store_sales_fact: SELECT COUNT(DISTINCT product_key) FROM store.store_sales_fact; COUNT ------- 19982 (1 row) -- -- default error-tolerance = 1(%) SELECT APPROXIMATE_COUNT_DISTINCT(product_key) as unique_product_keys FROM store.store_sales_fact; unique_product_keys --------------------- 20014 (1 row) -- abs(20014 - 19982) / 19982 = 0.16% < 1% -- -- error-tolerance = 3(%) SELECT APPROXIMATE_COUNT_DISTINCT(product_key, 3) as unique_product_keys FROM store.store_sales_fact; unique_product_keys --------------------- 20236 (1 row) -- abs(20236 - 19982) / 19982 = 1.27% < 3% -- -- error-tolerance = 5(%) SELECT APPROXIMATE_COUNT_DISTINCT(product_key, 5) as unique_product_keys FROM store.store_sales_fact; unique_product_keys --------------------- 20446 (1 row) -- abs(20446 - 19982) / 19982 = 2.32% < 5%

Using LogLogBeta with Binary Search Tree Sorting

The Vertica in-database ACD functions were upgraded based on a new approximate algorithm LogLogBeta proposed by Jason Qin and some Vertica SDK improvements in version 8.1.1.

Vertica engineers made several modifications to the algorithm to handle small data sets and small data groups. The APPROXIMATE_COUNT_DISTINCT and APPROXIMATE_COUNT_DISTINCT_SYNOPSIS functions start with a Binary Search Tree Sorting algorithm and switch to the LogLogBeta algorithm to handle large data sets. This avoids the big overhead that LogLogBeta incurs for small data. Most importantly, the LogLogBeta algorithm is implemented in a distributed fashion, which leverages the Vertica Massively Parallel Processing (MPP) architecture. This makes the APPROXIMATE_COUNT_DISTINCT and APPROXIMATE_COUNT_DISTINCT_SYNOPSIS functions very fast.

The following performance tests compare the performance of the Vertica 8.1.1 built-in ACD functions with the Vertica 8.1.0 built-in ACD functions. They also compare the performance of the Vertica 8.1.1 built-in ACD functions with its exact count distinct functions.

Performance Comparisons

The testing environment was set up as follows:
Vertica Cluster: 4 nodes
Machine: HP ProLiant DL380p Gen8/Gen9
Memory: 120 GB
CPU #: 32
OS: RHEL 7.0

Data set 1 consists of the following:
Name: allDataTypesCard1to100K_Table1
Row Count: 10 M rows
Columns: 66 columns with a combination of 11 different data types and 6 different cardinal values.
GROUP BY column name: dateC100K
Cardinality of the GROUP BY column: 100K

Data set 2 consists of the following:
Name: varBin_longStrTbl_99cols
Row Count: 1 M rows
Columns: 99 varBinary columns with long string values (5000 char long). Each row has a random string for var1. var2 thru var99 values default to var1.
GROUP BY column name: var10
Cardinality of the GROUP BY column: 1M

Note: The tests are running ACD on each of the 66 columns of data set 1 or on each of the 99 columns of Ddta set 2 simultaneously in the same query.

Exploring Big Data

The ACD function in Vertica 8.1.1 use a constant size of memory, and scale well on big data. The running time is linear to the dataset size.

Scenario 0: For various row counts, comparison of APPROXIMATE_COUNT_DISTINCT when used with GROUP BY statement for all 66 columns of data set 1.



Vertica 8.1.1 versus Vertica 8.1.0

Compared with the previous versions, the performance of the ACD functions in Vertica 8.1.1 has improved dramatically.

Scenario 1: 8.1.0 vs 8.1.1 APPROXIMATE_COUNT_DISTINCT for 66 columns of varying data types and cardinalities of data set 1 with a GROUP BY clause. The new APPROXIMATE_COUNT_DISTINCT function is more than 25 times faster.



Scenario 2: 8.1.0 vs 8.1.1 APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(ACDSyn) when used with GROUP BY statement for all 66 columns of data set 1. APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS in Vertica 8.1.1 is 67 times faster.



Scenario 3: 8.1.0 vs 8.1.1 APPROXIMATE_COUNT_DISTINCT for 99 long string varBinary columns of data set 2 with GROUP BY and varying error tolerance level. Performance improved dramatically for an error tolerance < 5%.



Approximate Count Distinct versus Exact Count Distinct

The COUNT(DISTINCT) function computes the exact number of distinct values in a data set. COUNT(DISTINCT) performs well when the data set is small or sorted. However, in most situations, APPROXIMATE_COUNT_DISTINCT performs better than COUNT(DISTINCT).

APPROXIMATE_COUNT_DISTINCT was designed to address the issue when the data set is large. The approximate algorithm uses a constant size of memory, and scales well on big data sets.

Use APPROXIMATE_COUNT_DISTINCT as a direct replacement for COUNT (DISTINCT) when:

• You have a large data set and you do not require an exact count of distinct values.
• The performance of COUNT(DISTINCT) on a given data set is insufficient.
• You calculate several distinct counts in the same query.

Use APPROXIMATE_COUNT_DISTINCT_SYNOPSIS and APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS together, if you want to pre-compute the distinct counts and later combine them in different ways. Pass APPROXIMATE_COUNT_DISTINCT_SYNOPSIS the data set and a normally distributed confidence interval. The function returns a summary of the data, called a synopsis. Pass the synopsis to the APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS function, which then performs an approximate count distinct on the synopsis.

Scenario 4: COUNT(DISTINCT)Vs APPROXIMATE_COUNT_DISTINCT for all 66 columns of varying data types with different cardinalities without GROUP BY using data set 1. ACD is more than 5 times faster.



Scenario 5: COUNT(DISTINCT)Vs APPROXIMATE_COUNT_DISTINCT for all 66 columns of varying data types with different cardinalities with GROUP BY using data set 1. ACD is more than 7 times faster.



Scenario 6: COUNT(DISTINCT)Vs APPROXIMATE_COUNT_DISTINCT for 99 long string varBinary columns with GROUP BY using data set 2. ACD is more than 11 times faster.

References

[1] Jason Qin, Denys Kim,YumeiTung, LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting, 2017.
[2] Criteo, Using Vertica and HyperLogLog, 2017.