Improving COUNT DISTINCT Performance with Live Aggregate Projections

Posted February 17, 2016 by HPBigData_BP

Programmer

Today’s organizations need to be able to measure the effectiveness of online ads and marketing campaigns. In your particular organization, you may want to measure how effectively your ads drive unique visitors to your website. Or you may want to see if your ads drive repeat visits from the same user over a specific period of time.

As a Vertica user, you probably know that you can use COUNT DISTINCT to gain these types of insights from your clickstream data. What you may not know is that you can improve the performance of COUNT DISTINCT queries by using live aggregate projections. By pre-aggregating the data in live aggregate projections, you can improve the performance of your COUNT DISTINCT queries.

The best way to learn how this works is, of course, an example. Read on… 

Querying Website Events

First, store data about each website visit in a table similar to the following table:
=> CREATE TABLE WebSiteEventVisits (
visit_id INT NOT NULL, -- Unique id assigned every visit
visitor_id INT,      -- Unique id assigned to each visitor
visit_time TIMESTAMPTZ(3), -- Time of the visit
campaign_id INT NOT NULL       -- Unique id for the ad campaign ) ;

To find out the number of unique visits by a specific user for the month of May 2005, try using the following query:
=> SELECT COUNT (DISTINCT visit_id) AS distinct_visits
FROM WebSiteEventVisits
WHERE visitor_id = 1
AND campaign_id = 12
AND visit_time >= '2005-05-01'::timestamptz(3)
AND visit_time < '2005-06-01'::timestamptz(3);
distinct_visits
-----------------
69

To see the number of unique visitors for the month of May 2005, use this query:
=> SELECT COUNT (DISTINCT visitor_id) AS distinct_visitors
FROM WebSiteEventVisits
WHERE campaign_id = 12
AND visit_time >= '2005-05-01'::timestamptz(3)
AND visit_time < '2005-06-01'::timestamptz(3);
distinct_visitors
------------------
100

Improve COUNT DISTINCT Performance

You gather a lot of data about your website visitors every day. You run these queries daily so you can see quickly which ad campaigns draw the most website visitors.

To help these queries run faster, you could run Database Designer, optimizing for these queries.

However, Vertica provides another option for optimizing COUNT DISTINCT queries that we mentioned earlier in this blog: live aggregate projections.

Create a Live Aggregate Projection

Create a live aggregate projection using the data in WebSiteEventVisits. Use a name that associates the live aggregate projection with WebSiteEventVisits, which Vertica calls the projection’s anchor table.
=> CREATE PROJECTION WebSiteEventVisits_AGG AS
SELECT
visitor_id,
campaign_id,
visit_id,
MAX (visit_time) MAX visit_time, -- Satisfies lower bound predicate (visit_time>='2005-05-01'::TIMESTAMPTZ(3))
MIN (visit_time) MIN_visit_time -- Satisfies upper bound predicate ( visit_time<'2005-06-01'::TIMESTAMPTZ(3))
FROM WebSiteEventVisits
GROUP BY
visitor_id,
campaign_id,
visit_id;

Don’t forget to refresh the table projections so that the live aggregate projection is up to date before you query its content:
=> SELECT REFRESH ('WebSiteEventVisits');
You’ll find that the aggregated data in the live aggregate projection results in a smaller number of rows:
=> SELECT COUNT(*) FROM WebSiteEventVisits;
count
--------
100000
=> SELECT COUNT(*) FROM WebSiteEventVisits_AGG;
count
-------
10000

Query the Live Aggregate Projection

Rewrite your queries to read the data from the live aggregate projection instead of from the anchor table. Notice that the results are the same as querying the anchor table, in this example, WebSiteEventVisits.

The number of unique visits by visitor_id = 1 for the month of May 2005:
=> SELECT COUNT (DISTINCT visit_id) AS distinct_visits
FROM WebSiteEventVisits_AGG
WHERE visitor_id = 1
AND campaign_id = 12
AND max_visit_time >= '2005-05-01'::timestamptz(3)
AND min_visit_time < '2005-06-01'::timestamptz(3);
distinct_visits
-----------------
69

The number of unique visitors for the month of May 2005:

=> SELECT COUNT (DISTINCT visitor_id) AS distinct_visitors
FROM WebSiteEventVisits_AGG
WHERE campaign_id = 12 AND max_visit_time >= '2005-05-01'::timestamptz(3)
AND min_visit_time < '2005-06-01'::timestamptz(3);
distinct_visitors
-------------------
100

If you are aggregating Vertica data using COUNT DISTINCT, consider creating live aggregate projections to pre-aggregate the data and help your queries run faster.

For More Information

You can find everything you need to know about COUNT DISTINCT and live aggregate projections in the Vertica documentation:

Note: If you do not need the precision of COUNT DISTINCT, you can use APPROXIMATE_COUNT_DISTINCT. Approximate Count Distinct Functions in the Vertica documentation describes when to use both capabilities.