Analytic Queries in Vertica

Posted October 2, 2017 by Soniya Shah, Information Developer

This blog post was authored by Soniya Shah.

Analytic functions handle complex analysis and reporting tasks. Here are some example use cases for Vertica analytic functions:

• Rank the longest standing customers in a particular state
• Calculate the moving average of retail volume over a specific time
• Find the highest score among all students in the same class

For a full list of Vertica analytic functions, see Analytic Functions in the Vertica documentation.

Analytic vs. Aggregate Functions

Some analytic functions aggregate data, so it’s important to distinguish between analytic functions that perform aggregations and aggregate functions. The similarities and differences are summarized below:

Analytic Functions Aggregate Functions
Return the same number of rows as the input Return a single summary value
The groups of rows on which an analytic function operate are defined by window partitioning and window frame clauses. The groups of rows on which aggregate functions operate are defined by the SQL GROUP BY clause.

Working with the Window Clauses

In an analytic function’s OVER clause, you can specify a window (or group of rows) to analyze.

The window partition clause (PARTITION BY) groups input rows before the function processes them. Window partitioning is similar to the GROUP BY clause, except that it returns only one result row per input row. If you omit the window partition clause, all input rows are treated as a single partition: …OVER(PARTITION BY expression) The window order clause (ORDER BY) specifies how to sort rows that are supplied to the analytic function. The clause only specifies order within a window result set. The query can have its own ORDER BY clause outside the OVER clause. The additional OVER BY clause has precedence over the window order clause and orders the final result set: …OVER(ORDER BY expression [ASC | DESC]) The window frame of an analytic function comprises a set of rows relative to the row that is currently being evaluated. After the function processes that row and its window frame, Vertica advances the current row and adjusts the frame boundaries accordingly. If the OVER clause also specifies a partition, Vertica also checks that the frame boundaries do not cross partition boundaries. This process repeats until the function evaluates the last row of the last partition. …OVER ( {ROWS | RANGE} {BETWEEN start-point AND end-point} | start-point)

Function Execution

Vertica executes an analytic function as follows:

1. Takes the input rows that the query returns after it performs all joins and evaluates FROM, WHERE, GROUP BY, and HAVING clauses.
2. Groups the input rows according to the window partition clause (PARTITION BY). If this clause is omitted, all input rows are treated as a single partition.
3. Sorts rows within each partition according to the window order clause (ORDER BY).
4. If the OVER clause includes a window order clause, the function checks for a window frame clause and executes it as it processes each input row. If the OVER clause omits a window frame clause, the function treats the entire partition as a window frame.

Examples

The following examples use the Vertica VMart example database that ships with all Vertica versions.

Within each window partition, the RANK analytic function ranks all rows in the query results set according to the order specified in the window’s ORDER BY clause. For example, let’s say we want to rank the longest-standing customers in a particular state. This example uses the customer_dimension table in the VMart sample database to rank by state all companies that have been customers since 2007: => SELECT customer_state, customer_name, customer_since, RANK() OVER(PARTITION BY customer_state ORDER BY customer_since) AS rank FROM customer_dimension WHERE customer_type='Company' AND customer_since > '01/01/2007' ORDER BY customer_state LIMIT 10; customer_state | customer_name | customer_since | rank ---------------+---------------+----------------+------ AZ | Foodshop | 2007-01-20 | 1 AZ | Goldstar | 2007-08-11 | 2 CA | Metahope | 2007-01-05 | 1 CA | Foodgen | 2007-02-05 | 2 CA | Infohope | 2007-02-09 | 3 CA | Foodcom | 2007-02-19 | 4 CA | Amerihope | 2007-02-22 | 5 CA | Infostar | 2007-03-05 | 6 CA | Intracare | 2007-03-14 | 7 CA | Infocare | 2007-04-07 | 8 (10 rows) The AVG analytic function computes an average of an expression in a group within a window. For example, the following query uses various tables in the VMart database to find the sales for that calendar month and returns a running average using the default window of RANGE UNBOUNDED PRECEDING AND CURRENT ROW. => SELECT calendar_month_number_in_year, SUM(product_price) AS sales, AVG(SUM(product_price)) OVER (ORDER BY calendar_month_number_in_year) as average_sales FROM product_dimension, date_dimension, inventory_fact WHERE date_dimension.date_key = inventory_fact.date_key AND product_dimension.product_key = inventory_fact.product_key GROUP BY calendar_month_number_in_year; calendar_month_number_in_year | sales | average_sales ------------------------------+----------+------------------ 1 | 23869547 | 23869547 2 | 19604661 | 21737104 3 | 22877913 | 22117373.6666667 4 | 22901263 | 22313346 5 | 23670676 | 22584812 6 | 22507600 | 22571943.3333333 7 | 21514089 | 22420821.2857143 8 | 24860684 | 22725804.125 9 | 21687795 | 22610469.7777778 10 | 23648921 | 22714314.9 11 | 21115910 | 22569005.3636364 12 | 24708317 | 22747281.3333333 (12 rows) The MAX analytic function returns the maximum value of an expression within a window. For example, the following query finds the maximum employee salary in Massachusetts and then computes the deviation between an employee’s annual salary and the maximum employee salary. This example uses the employee_dimension table from the VMart database: =>SELECT employee_state, annual_salary, MAX(annual_salary) OVER(PARTITION BY employee_state ORDER BY employee_key) max,annual_salary- MAX(annual_salary) OVER(PARTITION BY employee_state ORDER BY employee_key) diff FROM employee_dimension WHERE employee_state = 'MA' LIMIT 10; employee_state | annual_salary | max | diff ---------------+---------------+--------+--------- MA | 625887 | 625887 | 0 MA | 395688 | 625887 | -230199 MA | 396174 | 625887 | -229713 MA | 380373 | 625887 | -245514 MA | 196768 | 625887 | -429119 MA | 146312 | 625887 | -479575 MA | 192084 | 625887 | -433803 MA | 141933 | 625887 | -483954 MA | 193966 | 625887 | -431921 MA | 131931 | 625887 | -493956 (10 rows) For more information, see Analytic Functions.