GROUP BY Implementation Options

Vertica implements a query GROUP BY clause with one of these algorithms: GROUPBY PIPELINED or GROUPBY HASH. Both algorithms return the same results. Performance of both is generally similar for queries that return a small number of distinct groups—typically a thousand per node .

You can use EXPLAIN to determine which algorithm the query optimizer chooses for a given query. The following conditions generally determine which algorithm is chosen:

  • GROUPBY PIPELINED requires all GROUP BY data to be specified in the projection's ORDER BY clause. For details, see GROUPBY PIPELINED Requirements below.

    Because GROUPBY PIPELINED only needs to retain in memory the current group data, this algorithm generally requires less memory and executes faster than GROUPBY HASH. Performance improvements are especially notable for queries that aggregate large numbers of distinct groups.

  • GROUPBY HASH is used for any query that does not comply with GROUP BY PIPELINED sort order requirements. In this case, Vertica must build a hash table on GROUP BY columns before it can start grouping the data.

GROUPBY PIPELINED Requirements

You can enable use of the GROUP BY PIPELINED algorithm by ensuring that the query and one of its projections comply with GROUP BY PIPELINED requirements. The following conditions apply to GROUPBY PIPELINED. If none of them is true for the query, then Vertica uses GROUPBY HASH.

All examples that follow assume this schema:

CREATE TABLE sortopt (
    a INT NOT NULL, 
    b INT NOT NULL,
    c INT,
    d INT
);
CREATE PROJECTION sortopt_p (
   a_proj,
   b_proj,
   c_proj,
   d_proj )
AS SELECT * FROM sortopt
ORDER BY a,b,c 
UNSEGMENTED ALL NODES;
INSERT INTO sortopt VALUES(5,2,13,84);
INSERT INTO sortopt VALUES(14,22,8,115);
INSERT INTO sortopt VALUES(79,9,401,33);

Condition 1

All GROUP BY columns are also included in the projection ORDER BY clause.

For example:

GROUP BY columns GROUPBY algorithm Reason chosen
a
a,b
b,a
a,b,c
c,a,b

GROUPBY PIPELINED

Columns a, b, and c are included in the projection sort columns.
a,b,c,d

GROUPBY HASH

Column d is not part of the projection sort columns.

Condition 2

If the query's GROUP BY clause has fewer columns than the projection's ORDER BY clause, the GROUP BY columns must:

  • Be a subset of ORDER BY columns that are contiguous.
  • Include the first ORDER BY column.

For example:

GROUP BY columns GROUPBY algorithm Reason chosen
a
a,b
b,a

GROUPBY PIPELINED

All GROUP BY columns are a subset of contiguous columns in the projection's ORDER BY clause {a,b,c}, and include column a.
a,c

GROUPBY HASH

GROUP BY columns {a,c} are not contiguous in the projection ORDER BY clause {a,b,c}.
b,c GROUPBY HASH GROUP BY columns {b,c} do not include the projection's first ORDER BY column a.

Condition 3

If a query's GROUP BY columns do not appear first in the projection's ORDER BY clause, then any early-appearing projection sort columns that are missing in the query's GROUP BY clause must be present as single-column constant equality predicates in the query's WHERE clause.

For example:

Query GROUPBY algorithm Reason chosen
SELECT SUM(a) FROM sortopt
  WHERE a = 10 GROUP BY b

GROUPBY PIPELINED

All columns preceding b in the projection sort order appear as constant equality predicates.
SELECT SUM(a) FROM sortopt
   WHERE a = 10 GROUP BY a, b

GROUPBY PIPELINED

Grouping column a is redundant but has no effect on algorithm selection.
SELECT SUM(a) FROM sortopt
   WHERE a = 10 GROUP BY b, c 

GROUPBY PIPELINED

All columns preceding b and c in the projection sort order appear as constant equality predicates.
SELECT SUM(a) FROM sortopt
   WHERE a = 10 GROUP BY c, b

GROUPBY PIPELINED

All columns preceding b and c in the projection sort order appear as constant equality predicates.
SELECT SUM(a) FROM sortopt
   WHERE a = 10 GROUP BY c

GROUPBY HASH

All columns preceding c in the projection sort order do not appear as constant equality predicates.

Controlling GROUPBY Algorithm Choice

It is generally best to allow Vertica to determine which GROUP BY algorithm is best suited for a given query. Occasionally, you might want to use one algorithm over another. In such cases, you can qualify the GROUP BY clause with a GBYTYPE hint:

GROUP BY /*+ GBYTYPE( HASH | PIPE ) */

For example, given the following query, the query optimizer uses the GROUPBY PIPELINED algorithm:

=> EXPLAIN SELECT SUM(a) FROM sortopt GROUP BY a,b;
 ------------------------------
 QUERY PLAN DESCRIPTION:
 ------------------------------

 EXPLAIN SELECT SUM(a) FROM sortopt GROUP BY a,b;

 Access Path:
 +-GROUPBY PIPELINED [Cost: 11, Rows: 3 (NO STATISTICS)] (PATH ID: 1)
 |  Aggregates: sum(sortopt.a)
 |  Group By: sortopt.a, sortopt.b

...

You can use the GBYTYPE hint to force the query optimizer to use the GROUPBY HASH algorithm instead:

=> EXPLAIN SELECT SUM(a) FROM sortopt GROUP BY /*+GBYTYPE(HASH) */ a,b;
 ------------------------------
 QUERY PLAN DESCRIPTION:
 ------------------------------

 EXPLAIN SELECT SUM(a) FROM sortopt GROUP BY /*+GBYTYPE(HASH) */ a,b;

 Access Path:
 +-GROUPBY HASH (LOCAL RESEGMENT GROUPS) [Cost: 11, Rows: 3 (NO STATISTICS)] (PATH ID: 1)
 |  Aggregates: sum(sortopt.a)
 |  Group By: sortopt.a, sortopt.b

...

The GBYTYPE hint can specify a PIPE (GROUPBY PIPELINED algorithm) argument only if the query and one of its projections comply with GROUP BY PIPELINED requirements. Otherwise, Vertica issues a warning and uses GROUPBY HASH.

For example, the following query cannot use the GROUPBY PIPELINED algorithm, as the GROUP BY columns {b,c} do not include the projection's first ORDER BY column a:

=> SELECT SUM(a) FROM sortopt GROUP BY /*+GBYTYPE(PIPE) */ b,c;
WARNING 7765:  Cannot apply Group By Pipe algorithm. Proceeding with Group By Hash and hint will be ignored
 SUM
-----
  79
  14
   5
(3 rows)