GROUP BY Implementation Options
Vertica implements a query GROUP BY
clause with one of these algorithms: GROUPBY PIPELINED
or GROUPBY HASH
. Both algorithms return the same results. Performance of both is generally similar for queries that return a small number of distinct groups—typically a thousand per node .
You can use EXPLAIN
to determine which algorithm the query optimizer chooses for a given query. The following conditions generally determine which algorithm is chosen:
-
GROUPBY PIPELINED
requires allGROUP BY
data to be specified in the projection'sORDER BY
clause. For details, see GROUPBY PIPELINED Requirements below.Because
GROUPBY PIPELINED
only needs to retain in memory the current group data, this algorithm generally requires less memory and executes faster thanGROUPBY HASH
. Performance improvements are especially notable for queries that aggregate large numbers of distinct groups. GROUPBY HASH
is used for any query that does not comply withGROUP BY PIPELINED
sort order requirements. In this case, Vertica must build a hash table onGROUP BY
columns before it can start grouping the data.
GROUPBY PIPELINED Requirements
You can enable use of the GROUP BY PIPELINED
algorithm by ensuring that the query and one of its projections comply with GROUP BY PIPELINED
requirements. The following conditions apply to GROUPBY PIPELINED
. If none of them is true for the query, then Vertica uses GROUPBY HASH
.
All examples that follow assume this schema:
CREATE TABLE sortopt (
a INT NOT NULL,
b INT NOT NULL,
c INT,
d INT
);
CREATE PROJECTION sortopt_p (
a_proj,
b_proj,
c_proj,
d_proj )
AS SELECT * FROM sortopt
ORDER BY a,b,c
UNSEGMENTED ALL NODES;
INSERT INTO sortopt VALUES(5,2,13,84);
INSERT INTO sortopt VALUES(14,22,8,115);
INSERT INTO sortopt VALUES(79,9,401,33);
Condition 1
All GROUP BY
columns are also included in the projection ORDER BY
clause.
For example:
GROUP BY columns | GROUPBY algorithm | Reason chosen |
---|---|---|
a
a,b
b,a
a,b,c
c,a,b
|
|
Columns a , b , and c are included in the projection sort columns. |
|
|
Column d is not part of the projection sort columns. |
Condition 2
If the query's GROUP BY
clause has fewer columns than the projection's ORDER BY
clause, the GROUP BY
columns must:
- Be a subset of
ORDER BY
columns that are contiguous. - Include the first
ORDER BY
column.
For example:
GROUP BY columns | GROUPBY algorithm | Reason chosen |
---|---|---|
a
a,b
b,a
|
|
All GROUP BY columns are a subset of contiguous columns in the projection's ORDER BY clause {a,b,c} , and include column a . |
|
|
GROUP BY columns {a,c} are not contiguous in the projection ORDER BY clause {a,b,c} . |
b,c
|
GROUPBY HASH
|
GROUP BY columns {b,c } do not include the projection's first ORDER BY column a . |
Condition 3
If a query's GROUP BY
columns do not appear first in the projection's ORDER BY
clause, then any early-appearing projection sort columns that are missing in the query's GROUP BY
clause must be present as single-column constant equality predicates in the query's WHERE
clause.
For example:
Query | GROUPBY algorithm | Reason chosen |
---|---|---|
SELECT SUM(a) FROM sortopt WHERE a = 10 GROUP BY b |
|
All columns preceding b in the projection sort order appear as constant equality predicates. |
SELECT SUM(a) FROM sortopt WHERE a = 10 GROUP BY a, b |
|
Grouping column a is redundant but has no effect on algorithm selection. |
SELECT SUM(a) FROM sortopt WHERE a = 10 GROUP BY b, c |
|
All columns preceding b and c in the projection sort order appear as constant equality predicates. |
SELECT SUM(a) FROM sortopt WHERE a = 10 GROUP BY c, b |
|
All columns preceding b and c in the projection sort order appear as constant equality predicates. |
SELECT SUM(a) FROM sortopt WHERE a = 10 GROUP BY c |
|
All columns preceding c in the projection sort order do not appear as constant equality predicates. |
Controlling GROUPBY Algorithm Choice
It is generally best to allow Vertica to determine which GROUP BY
algorithm is best suited for a given query. Occasionally, you might want to use one algorithm over another. In such cases, you can qualify the GROUP BY
clause with a GBYTYPE
hint:
GROUP BY /*+ GBYTYPE( HASH | PIPE ) */
For example, given the following query, the query optimizer uses the GROUPBY PIPELINED
algorithm:
=> EXPLAIN SELECT SUM(a) FROM sortopt GROUP BY a,b;
------------------------------
QUERY PLAN DESCRIPTION:
------------------------------
EXPLAIN SELECT SUM(a) FROM sortopt GROUP BY a,b;
Access Path:
+-GROUPBY PIPELINED [Cost: 11, Rows: 3 (NO STATISTICS)] (PATH ID: 1)
| Aggregates: sum(sortopt.a)
| Group By: sortopt.a, sortopt.b
...
You can use the GBYTYPE
hint to force the query optimizer to use the GROUPBY HASH
algorithm instead:
=> EXPLAIN SELECT SUM(a) FROM sortopt GROUP BY /*+GBYTYPE(HASH) */ a,b; ------------------------------ QUERY PLAN DESCRIPTION: ------------------------------ EXPLAIN SELECT SUM(a) FROM sortopt GROUP BY /*+GBYTYPE(HASH) */ a,b; Access Path: +-GROUPBY HASH (LOCAL RESEGMENT GROUPS) [Cost: 11, Rows: 3 (NO STATISTICS)] (PATH ID: 1) | Aggregates: sum(sortopt.a) | Group By: sortopt.a, sortopt.b ...
The GBYTYPE
hint can specify a PIPE
(GROUPBY PIPELINED
algorithm) argument only if the query and one of its projections comply with GROUP BY PIPELINED
requirements. Otherwise, Vertica issues a warning and uses GROUPBY HASH
.
For example, the following query cannot use the GROUPBY PIPELINED
algorithm, as the GROUP BY
columns {b,c
} do not include the projection's first ORDER BY
column a
:
=> SELECT SUM(a) FROM sortopt GROUP BY /*+GBYTYPE(PIPE) */ b,c; WARNING 7765: Cannot apply Group By Pipe algorithm. Proceeding with Group By Hash and hint will be ignored SUM ----- 79 14 5 (3 rows)