
Formatted text such as system logs hold a huge amount of actionable data if you can extract and discover the content. If there is a pattern to the lines, then you can build a Vertica regex parser to extract fields and contents into Vertica for query and analysis at scale. Vertica’s regex parser uses Perl-like regular expressions as described at https://perldoc.perl.org/perlre.html and is implemented with the popular and very efficient Perl Compatible Regular Expressions (PCRE) engine. Regex parsers and other built-in and user-defined parsers can process data from any supported source and operate in parallel. Let’s look at Vertica’s FREGEXPARSER and how to extract content from two types of logs: Apache httpd log and our own vertica.log
httpd.log
Here’s a typical httpd log entry with ranges we would like to capture (we will also try to capture the GET Method but it’s too narrow to type in the legend!)

We can extract fields by matching on pattern – IP, date, separator characters as well as order of fields in the string. Here is a sample regex parser to extract httpd log-like entries:

Breaking down the above expression which will extract tokens denoted by < > left to right from an input log entry:
Just after the beginning of the line (^), extract tokens <srcip>,<dstip>,<username> which are each sequences of non-whitespace characters “[\S+]” followed by whitespace “\s+”;
token <ts> is a composite key that extracts the entire string between the next pair of square braces and in turn extracts of 3 grouped patterns within the <ts> key: <dd>, 2 digits followed by “/”; <mon>, 3 letters in a word followed by “/”; and <yr>, the next 4 digits;
token <request type> is preceded by double quote [“], is made of one or more words (\w+) followed by whitespace “\s+”;
token <request_url> is a series of non-space characters also excluding “?” [^ ?];
token <url_para> is optional as denoted by “(?:” and starts from “?” [?] followed by non-space, non-double quote characters “[^ “]” and followed by whitespace “\s+”;
token <protover> is a series of non-space, non-double quote characters [^ “] followed by double quote [“] and whitespace “\s+”;
tokens <httpcode> and <httpsize> are series of non-whitespace characters “[\S+]” followed by whitespace “\s+”;
last, tokens <refurl> and <useragent> are series of non-double quote characters “[^”]+” enclosed by double quotes [“], followed by whitespace “\s+” for <refurl> and optional whitespace at end of line “\s*” for <useragent>.
Let’s capture fields from the following log entry:
192.168.1.238 192.168.1.206 - [11/Sep/2020:17:06:25 -0400] "GET /favicon.ico HTTP/1.1" 200 3638 "http://192.168.1.206/webgl/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36"
After ingesting the entry with the COPY with FREGEXPARSER, you can then directly query the flex table as shown, or materialize the data into a Vertica table to take advantage of the higher performance offered by a projection optimized for your log searches.
dbadmin=> select maptostring(__raw__) from public.httpd_log_flex limit 1;
{
"dd": "07",
"dstip": "192.168.1.206",
"httpcode": "200",
"httpsize": "17582",
"mon": "Sep",
"protover":
"HTTP/1.1",
"refurl": "-",
"request_type": "GET",
"request_url": "/webgl/",
"srcip": "192.168.1.238",
"ts": "07/Sep/2020:22:13:36 -0400",
"useragent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36",
"username": "-",
"yr": "2020"
}
Vertica.log
Vertica itself outputs formatted log strings as well. Here’s a sample:
2020-05-08 03:05:02.073 INFO New log
2020-05-08 03:05:02.073 Main:0x7f4460a6e280 [Init] Log /home/dbadmin/sbx/v_sbx_...
2020-05-08 03:05:02.074 Main:0x7f4460a6e280 [Init] Processing command line: /op...
2020-05-08 03:05:51.001 ManageEpochs:0x7f4370ff9700-a0000000093495 [Txn] Begin ...
Fields to capture: Timestamp, Process Name, Session, Component, Log Level
Here is the corresponding regex. Vertica log rows are not all consistent; most will match this pattern, but rows that don’t match will be ignored:

Putting this one also into somewhat plain English, the lines of expression read:
Just after the beginning of the line ( ^ ) there is a pattern grouped into a key named ts: ( (?<ts> … ) ), which, in turn consists of 4 grouped patterns, the first of which, (?<yy>\d{4}), goes to a key named yy, and consists of four digits. After this group comes a hyphen, followed by the next named group (?<mm>\d{2}), which goes to a key named mm and consists of two digits; then comes a hypen again, and yet another group, (?<dd>\d{2}), which goes to a key named dd and consists, again, of two digits. This is followed by a single space, and finally, we have the group (?<tm>\d\d:\d\d:\d\d\.\d{3}), which goes to a key named tm, a time literal, consisting of: two digits, a colon, two digits, a colon, two digits, a decimal point, and three digits.
The next group, (?<proc_name>\w+), follows after a single space, goes to a key named proc_name, and consists of one or more programmer word characters.
The next group again, (?<session>0x[[:xdigit:]]{12}.{0,15}), comes after a colon, goes to a key named session and consists of: the two-character sequence “0x”, 12 hexadecimal digits, a dot zero to fifteen occurrences of any character (the dot).
This is followed by a space, an open square bracket; then comes the next group, (?<component>\w+), which is one or more occurrences of a programmer word character; and this is followed by a closing square bracket.
Finally, we have a space, a smaller-than-sign, and the group (?<loglvl>\w+), which goes to the key loglvl and consists, again, of one or more occurrences of a programmer word character; this is followed by a greater-than-sign.

Regex parsing offers a way to define a custom ingest format for lines of text in a very fast manner. Regex can be a bit esoteric and the examples we show may seem complex, but there are many resources at your disposal. If you’re working on a data ingest that might benefit from regex parsing, please get in touch through your account team or visit the following resources.
Related Links:
Documentation for Vertica COPY and FREGEXPARSER
For details on how to construct and interpret regex, visit the Perl Regex documentation
Also visit the Vertica Forum for discussions and questions
And visit the Vertica Academy for more free online training