Vertica Analytics Platform Version 9.2.x Documentation

MATCH Clause

A SQL extension that lets you screen large amounts of historical data in search of event patterns, the MATCH clause provides subclasses for analytic partitioning and ordering and matches rows from the result table based on a pattern you define.

You specify a pattern as a regular expression, which is composed of event types defined in the DEFINE subclause, where each event corresponds to a row in the input table. Then you can search for the pattern within a sequence of input events. Pattern matching returns the contiguous sequence of rows that conforms to PATTERN subclause. For example, pattern P (A B* C) consist of three event types: A, B, and C. When Vertica finds a match in the input table, the associated pattern instance must be an event of type A followed by 0 or more events of type B, and an event of type C.

Pattern matching is particularly useful for clickstream analysis where you might want to identify users' actions based on their Web browsing behavior (page clicks). A typical online clickstream funnel is:

Company home page -> product home page -> search -> results -> purchase online

Using the above clickstream funnel, you can search for a match on the user's sequence of web clicks and identify that the user:

  • Landed on the company home page
  • Navigated to the product page
  • Ran a search
  • Clicked a link from the search results
  • Made a purchase

For examples that use this clickstream model, see Event Series Pattern Matching in Analyzing Data.

Syntax

MATCH ( [ PARTITION BY table‑column ] ORDER BY table‑column
... DEFINE event‑name AS boolean‑expr [,…]
... PATTERN pattern‑name AS ( regexp ) 
... [ rows‑match‑clause ] )

Parameters

PARTITION BY

Defines the window data scope in which the pattern, defined in the PATTERN subclause, is matched. The partition clause partitions the data by matched patterns defined in the PATTERN subclause. For each partition, data is sorted by the ORDER BY clause. If the partition clause is omitted, the entire data set is considered a single partition.

ORDER BY

Defines the window data scope in which the pattern, defined in the PATTERN subclause, is matched. For each partition, the order clause specifies how the input data is ordered for pattern matching.

The ORDER BY clause is mandatory.

DEFINE

Defines the boolean expressions that make up the event types in the regular expressions. For example:

DEFINE 
 Entry    AS RefURL  NOT ILIKE '%website2.com%' AND PageURL ILIKE
                               '%website2.com%',
 Onsite   AS PageURL ILIKE     '%website2.com%' AND Action='V',

 Purchase AS PageURL ILIKE     '%website2.com%' AND Action='P' 

The DEFINE subclause accepts a maximum of 52 events. See Event Series Pattern Matching in Machine Learning for Predictive Analytics for examples.

event‑name

Name of the event to evaluate for each row—in the earlier example, Entry, Onsite, Purchase.

Event names are case insensitive and follow the same naming conventions as those used for tables and columns.

boolean‑expr

Expression that returns true or false. boolean_expr can include Boolean Operators and relational (comparison) operators. For example:

Purchase AS PageURL ILIKE '%website2.com%' AND Action = 'P'
PATTERN pattern‑name

Name of the pattern defined in the PATTERN subclause; for example, P is the pattern name defined below:

 PATTERN P AS (…)

A PATTERN is a search pattern that is comprised of a name and a regular expression.

Vertica supports one pattern per query.

regexp

A regular expression comprised of event types defined in the DEFINE subclause and one or more quantifiers below. When Vertica evaluates the MATCH clause, the regular expression identifies the rows that meet the expression criteria.

*

Match 0 or more times

*?

Match 0 or more times, not greedily

+

Match 1 or more times

+?

Match 1 or more times, not greedily

?

Match 0 or 1 time

??

Match 0 or 1 time, not greedily

*+

Match 0 or more times, possessive

++

Match 1 or more times, possessive

?+

Match 0 or 1 time, possessive

|

Alternation. Matches expression before or after the vertical bar. Similar to a Boolean or.

rows‑match‑clause

Specifies how to resolve more than one event evaluating to true for a single row, one of the following:

  • ROWS MATCH ALL EVENTS: If more than one event evaluates to true for a single row, Vertica returns this error :

    ERROR: pattern events must be mutually exclusive
    HINT:  try using ROWS MATCH FIRST EVENT
  • ROWS MATCH FIRST EVENT: If more than one event evaluates to true for a given row, Vertica uses the first event in the SQL statement for that row.

Pattern Semantic Evaluation

  • The semantic evaluating ordering of the SQL clauses is: FROM -> WHERE -> PATTERN MATCH -> SELECT.
  • Data is partitioned as specified in the PARTITION BY clause. If the partition clause is omitted, the entire data set is considered a single partition.
  • For each partition, the order clause specifies how the input data is ordered for pattern matching.
  • Events are evaluated for each row. A row could have 0, 1, or N events evaluate to true. If more than one event evaluates to true for the same row, Vertica returns a run-time error unless you specify ROWS MATCH FIRST EVENT. If you specify ROWS MATCH FIRST EVENT and more than one event evaluates to TRUE for a single row, Vertica chooses the event that was defined first in the SQL statement to be the event it uses for the row.
  • Vertica performs pattern matching by finding the contiguous sequence of rows that conforms to the pattern defined in the PATTERN subclause.

For each match, Vertica outputs the rows that contribute to the match. Rows not part of the match (do not satisfy one or more predicates) are not output.

  • Vertica reports only non-overlapping matches. If an overlap occurs, Vertica chooses the first match found in the input stream. After finding the match, Vertica looks for the next match, starting at the end of the previous match.
  • Vertica reports the longest possible match, not a subset of a match. For example, consider pattern: A*B with input: AAAB. Because A uses the greedy regular expression quantifier (*), Vertica reports all A inputs (AAAB), not AAB, AB, or B.

Notes and Restrictions

  • DISTINCT and GROUP BY/HAVING clauses are not allowed in pattern match queries.
  • The following expressions are not allowed in the DEFINE subclause:

    • Subqueries, such as DEFINE X AS c IN (SELECT c FROM table1)
    • Analytic functions, such as DEFINE X AS c < LEAD(1) OVER (ORDER BY 1)
    • Aggregate functions, such as DEFINE X AS c < MAX(1)
  • You cannot use the same pattern name to define a different event; for example, the following is not allowed for X:

    DEFINE   X AS c1 <  3
      X AS c1  >= 3
    
  • Used with MATCH clause, Vertica Pattern Matching Functions provide additional data about the patterns it finds. For example, you can use the functions to return values representing the name of the event that matched the input row, the sequential number of the match, or a partition-wide unique identifier for the instance of the pattern that matched.

Examples

For examples, see Event Series Pattern Matching in Machine Learning for Predictive Analytics.