Apache Log Analysis in Vertica
I am a proud new father and of course the first thing I did was to put pictures of my daughter online on our webserver because I wanted to see who had been looking at them. I could have gone the Google Analytics route, but being a geek I wanted to explore the data myself rather than just get a static report.
Before Vertica, for this kind of analysis I probably would have written a perl script (because it has the best regexp support of any language I know), but as soon as I start doing anything more complicated than summarizing, things get ugly quickly. Specifically I wanted to group the web log entries into sessions (“sessionize”) to analyze visits rather than page views. According to the interwebs, it seems Hadoop is often used to do this kind of analysis, but that still requires a program to compute the statistics of interest, though it can distribute that computation across many machines.
Since working at Vertica, I have become convinced that SQL is an excellent language for this kind of analysis — it allows one to easily express declaratively what is painful to express programatically (e.g. COUNT distinct). An often cited problem of SQL based analysis is that you need to get your data into a database first by writing a load script that parses your logs into some structured table format. Far from impossible (and any analysis needs to put the logs into a structured format) but annoying as the parsing code (the script) and structure definition (SQL DDL) aren’t bound together.
Recently, I have been working at Vertica on our extensibility mechanism to extend our database from within. So I selfishly used my desire to analyze my own web logs to justify writing an example of parsing Apache logs inside the database. On a recent cross country flight I whipped up a simple Apache log parser (now included as an example in our SDK in 5.0 – if you try it out let me know what you think!). The hardest part of the parser was dealing with the Apache log format (which for some reason changed sometime since the first batch of logs I have from 2005).
Armed with the log parser, the analysis of who was looking at my daughter’s pictures became pretty easy. Furthermore, because I had access to the raw log data in a database, I ended up finding several other fascinating patterns that I hadn’t specifically set out to find. The more I see, the more convinced I am that Hal Varian has it right and that data analysis will be the sexy job of the next decade.
The analysis steps were simple:
- • rsync logs from my web server to my laptop
- • Get logs into Vertica with straightforward SQL:
CREATE TABLE raw_logs
(filename VARCHAR(500),raw_log varchar(4000))
SEGMENTED BY HASH(filename) ALL NODES;
COPY raw_logs(filename as ‘access_log’ , raw_log) FROM
DELIMITER E’\1′\”"; — avoid field parsing on tabs
- • Install and run the log parser code
– Install the parser code
CREATE LIBRARY ParserLib AS ‘/tmp/ApacheParser2.so’;
CREATE TRANSFORM FUNCTION ApacheParser
AS LANGUAGE ‘C++’ NAME ‘ApacheParserFactory’ LIBRARY ParserLib;
– Parse the logs into a new table
CREATE TABLE parsed_logs AS
SELECT filename, ApacheParser(raw_log) OVER (PARTITION BY filename)
Voila! Now I have a structured table with one row per log entry (i.e. file served by the server) and one column per logical log field. It is now a simple task to collect the clicks into sessions (see Sessionize with Style)
CREATE TABLE parsed_sessions as
..CONDITIONAL_TRUE_EVENT(ts – LAG(ts) > ’30 seconds’)
….OVER (PARTITION BY ip ORDER BY ts) || replace(ip,’.',”) as session_id
Now I am ready to ask questions like:
How many sessions, ips, clicks and total bytes were served for my daughter’s pages?
..count(distinct session_id) as session_count,
..count(distinct ip) as ip_count,
..count(*) as total_click_count,
..sum(response_size)/(1024*1024) as Mbytes
where extract(year from ts) = ’2011′ and username = ‘changed’;
session_count -| ip_count | total_click_count | Mbytes
oooooooooo 313 |ooooo 162 |oooooooooooo 11151 | 7353.86
Who looked at the most pictures?
..count(*) as click_count
where extract(year from ts) = ’2011′ and username = ‘changed’
group by ip, session_id
order by click_count desc
ooooooo max_ts ooooooo | click_count
2011-04-10 08:11:43-04 | ooooooo 294
2011-04-12 11:53:51-04 | ooooooo 197
2011-04-12 09:22:20-04 | ooooooo 191
2011-04-12 06:17:36-04 | ooooooo 184
2011-04-10 10:46:36-04 | ooooooo 171
2011-04-18 11:18:52-04 | ooooooo 167
2011-04-10 14:47:31-04 | ooooooo 160
2011-04-12 13:35:12-04 | ooooooo 159
2011-04-12 18:15:56-04 | ooooooo 157
2011-04-14 10:04:10-04 | ooooooo 153
So now I was curious: Who where those top 10 clickers? At this point, querying the raw data (as opposed to an aggregated report) was super helpful.
select distinct cnt_rnk, ps.session_id, ip
from parsed_sessions ps JOIN click_counts cc USING (session_id)
where extract(year from ts) = ’2011′ and
oooo username = ‘changed’ and
ooooo ps.session_id IN (select session_id from click_counts where cnt_rnk <= 5)
order by cnt_rnk;
Without divulging any actual of the actual results (privacy, you know), it turns out 7 of the top 10 were my wife and I, one was my mother in law, one was a family friend and one is still a mystery which I am looking into.
As I was poking around, I noticed another interesting pattern that I wasn’t specifically looking for: a lot of requests came in from Google searches. So my next logical query was:
What are people searching for on Google and where does it lead them to on my site?
where referring_url ilike ‘%google.com%’ and
ooooo extract(year from ts) = ’2011′
order by ts desc
ooooooooo ts ooooooooo | oooooooooo request_url oooooooooo | referring_url
2011-06-13 21:36:57-04 | /classes/commit/fft-factoring.pdf | http://www.google.com/search?q=dft+math+using+matrices&hl=en&prmd=ivnsfd&ei=VLn2TdLSLsnr0gHC9MiMCw&start=30&sa=N&biw=1221&bih=812
2011-06-13 19:55:26-04 | /classes/commit/fft-factoring.pdf | http://www.google.com.sa/search?sourceid=navclient&aq=0&oq=fft+matrix+factoriz&ie=UTF-8&rlz=1T4RNSN_enSA402SA402&q=fft+matrix+factorization
2011-06-13 02:58:03-04 | /classes/mechatronics/ion-generator.pdf | http://www.google.com/search?client=ubuntu&channel=fs&q=airflow+detector&ie=utf-8&oe=utf-8
2011-06-10 20:35:54-04 | /classes/commit/fft-factoring.pdf | http://www.google.com/url?sa=t&source=web&cd=4&ved=0CC8QFjAD&url=http%3A%2F%2Fandrew.nerdnetworks.org%2Fclasses%2Fcommit%2Ffft-factoring.pdf&rct=j&q=fft%20matrix%20decomposition&ei=WrjyTdqLKYP2tgPRw_i7Cw&usg=AFQjCNEyfN4KoSidrjR4EsL5wTHbqakb7A
2011-06-10 18:29:56-04 | /favicon.ico | http://www.google.com/search?hl=en&safe=off&client=iceweasel-a&rls=org.mozilla%3Aen-US%3Aunofficial&q=DFT+identities&aq=f&aqi=&aql=t&oq=
2011-06-01 14:54:13-04 | /classes/mechatronics/ion-generator.pdf | http://www.google.com/search?q=ion+%22measure+airflow%22&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
2011-06-01 09:36:51-04 | /classes/commit/fft-factoring.pdf | http://www.google.com/url?sa=t&source=web&cd=5&ved=0CDYQFjAE&url=http%3A%2F%2Fandrew.nerdnetworks.org%2Fclasses%2Fcommit%2Ffft-factoring.pdf&rct=j&q=dft%20factorization&ei=akDmTcHBCILw0gGgofCeCw&usg=AFQjCNEyfN4KoSidrjR4EsL5wTHbqakb7A
2011-05-31 16:41:56-04 | /classes/commit/fft-factoring.pdf | http://www.google.com/url?sa=t&source=web&cd=9&ved=0CFYQFjAI&url=http%3A%2F%2Fandrew.nerdnetworks.org%2Fclasses%2Fcommit%2Ffft-factoring.pdf&rct=j&q=fft%20matrix%20notation&ei=fVLlTZ76Nc-A-waBmKHyBg&usg=AFQjCNEyfN4KoSidrjR4EsL5wTHbqakb7A&sig2=4jpsx3kmRWJilGnX0VOaAg
2011-05-29 14:29:16-04 | /classes/commit/asplos.ps | http://www.google.com/url?sa=t&source=web&cd=4&ved=0CCsQFjAD&url=http%3A%2F%2Fandrew.nerdnetworks.org%2Fclasses%2Fcommit%2Fasplos.ps&rct=j&q=naive%20partitioning%20in%20stream%20graph&ei=tY7iTfzGOpC5hAfFspnzBw&usg=AFQjCNHNNERPnNXT7aFGvfaJB3bjpK5Oxg&cad=rja
2011-05-25 03:44:12-04 | /classes/commit/fft-factoring.pdf | http://www.google.com/url?sa=t&source=web&cd=14&ved=0CC4QFjADOAo&url=http%3A%2F%2Fandrew.nerdnetworks.org%2Fclasses%2Fcommit%2Ffft-factoring.pdf&rct=j&q=DERIVATION%20OF%20DFT&ei=L7PcTZ7zL4q_0AH007C_Dw&usg=AFQjCNEyfN4KoSidrjR4EsL5wTHbqakb7A
I could see the query terms peeking out of that mess, but it isn’t easy to analyze because the query string is URI encoded within the referring url. I thought it would be cool to programmatically pick out the query terms, and so I spent some time messing with an unsatisfactory regexp based solution. Then I found out that Hieu, one of our interepid interns this summer, had already made a URI decoder using our SDK and the uriparser library:
CREATE TRANSFORM FUNCTION UriExtractor
AS LANGUAGE ‘C++’ NAME ‘UriExtractorFactory’ LIBRARY ParserLib;
Extract the search terms from the URIs of Google searches
CREATE table search_terms
SELECT request_url, value as search_term
..(SELECT request_url, UriExtractor(referring_url) OVER (PARTITION BY request_url) FROM search_referrals ) AS sq
WHERE sq.name = ‘q’;
SELECT * FROM search_terms LIMIT 10;
ooooooooooooooo request_url ooooooooooooooo | search_term
/ ooooooooooooooooooooooooooooooooooooooooo | andrew nerdnetworks
/classes/6.033/cyberbeanie.pdf oooooooooooo | 6.033 cyberbeanie
/classes/6.033/cyberbeanie.pdf oooooooooooo | Jerome H. Saltzer and M. Frans Kaashoek. 6.033 class notes
/classes/6.033/cyberbeanie/cyberbeanie.html | 6.033 Bibliography Saltzer Computer systems
/classes/6.033/cyberbeanie.html ooooooooooo | link_send
/classes/6.033/cyberbeanie.html ooooooooooo | “Topics in the Engineering of Computer Systems”
/classes/6.033/cyberbeanie.html ooooooooooo | Jerome H. Saltzer, M. Frans Kaashoek. Topics in the Engineering of Computer Systems. M.I.T. 6.033 class notes
/classes/6.033/spank/spankster.html ooooooo | MITPerson
/classes/6.033/spankster.pdf oooooooooooooo | MITPerson
/classes/6.033/spankster.pdf oooooooooooooo | “chunk server”
Note that some of the actual values in the above data have been changed to protect other people’s privacy.
Now I need to go back to my day job making Vertica better, but I truly do hope people are able to take the Apache log parser and quickly and easily find their own interesting insights.