A SQL extension that lets you screen large amounts of historical data in search of event patterns, the MATCH clause provides subclasses for analytic partitioning and ordering and matches rows from the result table based on a pattern you define.
You specify a pattern as a regular expression, which is composed of event types defined in the DEFINE subclause, where each event corresponds to a row in the input table. Then you can search for the pattern within a sequence of input events. Pattern matching returns the contiguous sequence of rows that conforms to PATTERN subclause. For example, pattern
A B* C) consist of three event types: A, B, and C. When Vertica finds a match in the input table, the associated pattern instance must be an event of type A followed by 0 or more events of type B, and an event of type C.
Pattern matching is particularly useful for clickstream analysis where you might want to identify users' actions based on their Web browsing behavior (page clicks). A typical online clickstream funnel is:
Company home page -> product home page -> search -> results -> purchase online
Using the above clickstream funnel, you can search for a match on the user's sequence of web clicks and identify that the user:
- Landed on the company home page
- Navigated to the product page
- Ran a search
- Clicked a link from the search results
- Made a purchase
For examples that use this clickstream model, see Event Series Pattern Matching in Analyzing Data.
MATCH ( [ PARTITION BY table‑column ] ORDER BY table‑column ... DEFINE event‑name AS boolean‑expr [,…] ... PATTERN pattern‑name AS ( regexp ) ... [ rows‑match‑clause ] )
Defines the window data scope in which the pattern, defined in the PATTERN subclause, is matched. The partition clause partitions the data by matched patterns defined in the PATTERN subclause. For each partition, data is sorted by the ORDER BY clause. If the partition clause is omitted, the entire data set is considered a single partition.
Defines the window data scope in which the pattern, defined in the PATTERN subclause, is matched. For each partition, the order clause specifies how the input data is ordered for pattern matching.
The ORDER BY clause is mandatory.
Defines the boolean expressions that make up the event types in the regular expressions. For example:
DEFINE Entry AS RefURL NOT ILIKE '%website2.com%' AND PageURL ILIKE '%website2.com%', Onsite AS PageURL ILIKE '%website2.com%' AND Action='V', Purchase AS PageURL ILIKE '%website2.com%' AND Action='P'
The DEFINE subclause accepts a maximum of 52 events. See Event Series Pattern Matching in Machine Learning for Predictive Analytics for examples.
Name of the event to evaluate for each row—in the earlier example,
Event names are case insensitive and follow the same naming conventions as those used for tables and columns.
Purchase AS PageURL ILIKE '%website2.com%' AND Action = 'P'
Name of the pattern defined in the PATTERN subclause; for example, P is the pattern name defined below:
A PATTERN is a search pattern that is comprised of a name and a regular expression.
Vertica supports one pattern per query.
A regular expression comprised of event types defined in the
Specifies how to resolve more than one event evaluating to true for a single row, one of the following:
Pattern Semantic Evaluation
- The semantic evaluating ordering of the SQL clauses is: FROM -> WHERE -> PATTERN MATCH -> SELECT.
- Data is partitioned as specified in the PARTITION BY clause. If the partition clause is omitted, the entire data set is considered a single partition.
- For each partition, the order clause specifies how the input data is ordered for pattern matching.
- Events are evaluated for each row. A row could have 0, 1, or N events evaluate to true. If more than one event evaluates to true for the same row, Vertica returns a run-time error unless you specify ROWS MATCH FIRST EVENT. If you specify ROWS MATCH FIRST EVENT and more than one event evaluates to TRUE for a single row, Vertica chooses the event that was defined first in the SQL statement to be the event it uses for the row.
- Vertica performs pattern matching by finding the contiguous sequence of rows that conforms to the pattern defined in the PATTERN subclause.
For each match, Vertica outputs the rows that contribute to the match. Rows not part of the match (do not satisfy one or more predicates) are not output.
- Vertica reports only non-overlapping matches. If an overlap occurs, Vertica chooses the first match found in the input stream. After finding the match, Vertica looks for the next match, starting at the end of the previous match.
- Vertica reports the longest possible match, not a subset of a match. For example, consider pattern: A*B with input: AAAB. Because A uses the greedy regular expression quantifier (*), Vertica reports all A inputs (AAAB), not AAB, AB, or B.
Notes and Restrictions
- DISTINCT and GROUP BY/HAVING clauses are not allowed in pattern match queries.
The following expressions are not allowed in the DEFINE subclause:
- Subqueries, such as
DEFINE X AS c IN (
SELECT c FROM table1
- Analytic functions, such as
DEFINE X AS c <
(1) OVER (ORDER BY 1)
- Aggregate functions, such as
DEFINE X AS c <
- Subqueries, such as
You cannot use the same pattern name to define a different event; for example, the following is not allowed for X:
DEFINE X AS c1 < 3 X AS c1 >= 3
- Used with MATCH clause, Vertica Pattern Matching Functions provide additional data about the patterns it finds. For example, you can use the functions to return values representing the name of the event that matched the input row, the sequential number of the match, or a partition-wide unique identifier for the instance of the pattern that matched.
For examples, see Event Series Pattern Matching in Machine Learning for Predictive Analytics.