Since the early days of data warehousing and the heyday of Ralph Kimball, data warehouse practitioners recognized the use of pre-computed aggregates to be “the single most effective tool the data warehouse designer has to improve performance” (footnote 1)., However, there was then, and continues to be, a gaping hole in the dimensional modelling approach concerning distinct aggregates, and in particular what to do about COUNT(DISTINCT x).

Let’s say that you want to count the number of distinct users who visit your website each day, each week, and each month. You can solve this problem using a pre-computed aggregate. However, the number of distinct users who visited each week cannot be computed from the number of distinct users who visited each day because some customers may have visited your web site on more than one day in the same week. Since you can’t roll a distinct aggregate up and you can’t incrementally maintain it, you’re pretty much stuck with computing what you need from the detail data.

Before HP Vertica, this had always been the Achille’s Heel of OLAP: operations worked 10x to 1000x faster if you had a pre-computed aggregate (which looks fantastic in a demo). If however, you asked a question that depended on an aggregate that you hadn’t pre-computed or that could not be rolled up, then you fell off the “OLAP cliff” and needed to compute the answer from the detail data, and that could take a long time. Query performance for OLAP queries was highly variable and appeared erratic.

HP Vertica’s columnar MPP database architecture changed most of that. By operating directly on a compressed columnar representation of the data, combined with the ability to massively parallelize the computation, most aggregations could be computed in real time without pre-materializing the aggregates. However, computing COUNT(DISTINCT p) could still be an expensive, memory-intensive operation, even on HP Vertica’s massively parallel architecture. Computing a distinct aggregate on Vertica’s MPP architecture can be broken into these phases:

- computing a partial aggregate per distinct group on each node
- redistributing/collecting the partial aggregates on the same node and aggregating again per distinct group
- sending the results to the initiator node.

HP Vertica has been highly optimized for computing COUNT DISTINCT, but in some cases the computation can still require a great deal of memory and data movement. Since COUNT DISTINCT can be expensive to compute and cannot be rolled up, some people have called this the “COUNT DISTINCT Pain”.

HP Vertica 7.0.0 introduces a new family of aggregate functions designed to alleviate this pain when exact results are not required:

- APPROXIMATE_COUNT_DISTINCT()
- APPROXIMATE_COUNT_DISTINCT_SYNOPSIS()
- APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS().

APPROXIMATE_COUNT_DISTINCT(x) is the direct equivalent of COUNT(DISTINCT x), and by default, has an accuracy within 1% of the value returned by COUNT(DISTINCT x), 97% of the time. You can specify that you require more or less accuracy than the default with an optional second argument. Whereas COUNT(DISTINCT x) requires a relatively large amount of memory per aggregate, APPROXIMATE_COUNT_DISTINCT(x) requires only 1500 bytes of memory per aggregate to achieve either:

- 5% accuracy 97% of the time (typically within 2%), or
- 50K bytes of memory per aggregate for 1% accuracy 97% of the time.

Furthermore, there is no need to make the partial aggregates distinct before sending the data to the initiator node, as required by COUNT(DISTINCT x).

In a performance experiment on a 1Tb TPC-H dataset, using a single-node developer’s laptop, performing an ungrouped COUNT DISTINCT on (l_orderkey, l_partkey) required 211 seconds. Using APPROXIMATE_COUNT_DISTINCT, the same computation took just 4.02 seconds, a factor of 52 times faster. In other cases where the number of distinct values was small, COUNT DISTINCT and APPROXIMATE_COUNT_DISTINCT were equally fast.And, in some cases where HP Vertica’s COUNT DISTINCT optimizations kick in, COUNT DISTINCT can be faster. So, while your mileage may vary, you should note that there are cases where APPROXIMATE_COUNT_DISTINCT is clearly a lot faster.

But it gets better, because unlike COUNT(DISTINCT x), APPROXIMATE_COUNT_DISTINCT rolls up in the same way SUM(x) and COUNT(x) roll up. By materializing the internal “synopsis” used by APPROXIMATE_COUNT_DISTINCT, you can roll it up later to preserve the full accuracy of APPROXIMATE_COUNT_DISTINCT(). On the same 1Tb TPC-H dataset, precomputing APPROXIMATE_COUNT_DISTINCT_SYNOPSIS on (l_orderkey, l_partkey) and grouping by a low cardinality column, and materializing the result with CREATE TABLE AS SELECT took about 30 seconds. Rollingprecomputed aggregate up with APPROXIMATE_COUNT_DISTINT_OF_SYNOPSIS() took just 64.7 milliseconds, more than 3200x faster than running COUNT DISTINCT against the detail data.

To illustrate, let’s suppose that you’re a political insider using the Pulse innovation package from HP Vertica. HP Vertica Pulse enables you to analyze the sentiment expressed in a tweet. You want to be notified, in real-time, when the number of distinct persons who post a tweet with a negative sentiment about one of several political topics exceeds N in any 1-week period. Instead of constantly running COUNT DISTINCT on the most recent weeks’ worth of tweets, you could compute and save an APPROXIMATE_COUNT_DISTINCT synopsis once per hour, and then run a relatively fast query that combines the pre-materialized synopses with a real-time synopsis computed from the most recent partial hour. Remember that this would not work with a regular COUNT DISTINCT because, if any individuals posted multiple tweets in the same week, they would be counted multiple times. The remarkable thing is that double counting will not occur with aggregating APPROXIMATE_COUNT_DISTINCT synopses. To allow for the possibility of a false-negative signal, you could adjust the alert threshold downward, and if triggered, compute an exact COUNT DISTINCT. However, in this case, the accuracy of APPROXIMATE_COUNT_DISTINCT is much higher than the accuracy of the sentiment classifications, so the measure of interest is intrinsically subjective and approximate anyway.

To compute and save an approximate count distinct synopsis, use the APPROXIMATE_COUNT_DISTINCT_SYNOPSIS() grouping function. To roll up a set of pre-computed synopses, use the APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS() grouping function. That’s all there is to it.

At the time of this writing, HP Vertica is the first and only SQL database to provide approximate count distinct with user-controllable accuracy and to support the rollup of approximate count distinct aggregates. Using the HP Vertica MPP column store, you can avert the OLAP cliff, and using pre-computed synopses, you can avoid the COUNT DISTINCT pain. For the first time since the dawn of data warehousing, you can compute and incrementally maintain pre-computed aggregates for count distinct with controllable accuracy, and roll these aggregates up in an OLAP framework.

———–

(1) – “The Data Warehouse Toolkit”, Ralph Kimball, pg 190.