Using Regular Expressions in Dictionaries
Vertica Pulse supports the use of regular expressions in user-defined dictionaries. Vertica Pulse regular expressions use the java.util.regex package syntax. For more information on this syntax, refer to the Oracle documentation.
Note: Regular expressions do not apply to the base word in normalization dictionaries.
You can add regular expressions to the following dictionaries:
- pos_words
- neg_words
- neutral_words
- normalization
- white_list
- stop_words
You can add a regular expression to a user-defined dictionary using an INSERT statement and $REGEX parameter containing the regular expression. Regular expressions are case insensitive. the regular expression $regex(apple) produces the same matches as the regular expression $regex(Apple).
Note: A regular expression can support a single token or word. Smartphone would be a valid token, but smart phone would not .
The following example would match any word ending with the string "day". You could use it to identify the days of the week or words such as yesterday and today.
INSERT INTO stopwords_en Values( '$LIST(nice,good,fine) $REGEX(.*day)');
The following example matches references to iPhones, including the number and letter version.
INSERT INTO whitelist_en Values(‘Iphone $REGEX(\d{1}(S|C)?)’);
To use a parenthesis as a literal part of a regular expression, you must use the escape character \ twice to prevent Pulse from interpreting the parenthesis as metacharacter in the regular expression. The following example would match the literal string (hugs)
.
INSERT INTO whitelist_es Values($REGEX(\\(hugs\\));