Vertica

Archive for the ‘Uncategorized’ Category

Join HP Vertica’s User-Driven Community!

We at HP Vertica are very excited to announce our new user-driven community that will help us better serve our users, partners, and anyone generally interested in learning more about the HP Vertica Analytics Platform. In partnership with GetSatisfaction, community members can now:

  • Start engaging with other customers to establish new and valuable relationships
  • Add context to your past issues and get the best possible answers to your questions
  • Wield your influence as subject matter expert in the community

Now, HP Vertica newbies as well as seasoned database/data analytics veterans and everyone in between can post questions, share ideas, report issues, and give praise. To make this information readily available for your convenient access, all questions are cataloged and searchable.

The new-and-improved community interface will enable you to easily access more information and better communicate with HP Vertica and other users of the HP Vertica Analytics Platform.

We welcome you to join our community by visiting http://community.vertica.com or by accessing the community tab on the side of the www.vertica.com homepage.

Big Data is Changing Software and (Product) Development as We Know It

I am often asked about “Big Data”, its use cases, real-world business value and how it will transform various products, services and markets.  This is one of my favorite topics, and I am fortunate in that I get to spend significant amounts of time with our amazing customers and partners who teach me a lot.  I am actually writing this from a plane after a few recent customer meetings that inspired me to share a point of view.

“Big Data” is already having and will continue to have the most impact in products and services where there is an ability to capture information about usage, experience and behavior in a manner that is accepted, yet not disruptive by the consumer of that product or service.  Data warehousing has been around for a long time with regards to retail transactions and purchasing behavior, but usage and experience measurement hasn’t had the same repository equivalent.  It now does, and I believe this will lead to an exponential jump in the quality and variety of products and services that are delivered to consumers.  In fact, this will not only improve existing solutions, but it will spawn entirely new products and services in industries as diverse as entertainment to medical treatments.

While the notion of experience analysis has been around for a long time through various manual observation efforts, focus groups, and survey methods- the results have been fragmented, small, and analyzed in what I’ll call a “basic” manner. Thanks to technology advancements and the resulting cost shifts, massive near real-time “feedback” collection can now be done through automation and sensor technology.  While the prospect of having this information delights any product manager and merchandiser, the challenge of capturing, storing, and analyzing the information at this scale is still foreign to many.

 

There is one community who is embracing this feedback fire hose with greater ease and speed than most- software developers.  Vertica has several ISVs, who are leaving “breadcrumbs” in their code to collect usage information that can be anonymously transferred back to headquarters for very specific feedback on how users of software are interfacing with it.  Their users agree to this data collection and sharing, and the ISVs ensure that it has no impact on the operational performance of their software.

These “breadcrumbs” can measure how long someone spends on a screen, which buttons they clicked on to get there, how successful they were, etc.  For instance, good development organizations analyze the time that a user should get from one place to another, that is, navigation within and between screens.  If and ISVs software is the track, this is the laser measurement for precise timing.

Vertica is an ideal platform to store and analyze this information.  Using Vertica’s advanced analytic and pattern matching capabilities, correlations of usage patterns can be identified and the developers can patch, redesign, or document accordingly to deliver a better experience to end users.  For example, you could quite easily determine that users who spent 3 minutes on one screen, clicked a certain button, spent less than 1 minute on that screen, then quit might not be happy with their experience compared with users who started in the same place but stayed online longer. Further analysis could determine “why” through the more traditional interview techniques to improve the experience.

Why are software developers so eager to embrace this as the early adopters?  Well, one reason is that it gives them direct feedback on their work, without having to get the sometimes editorialized version from sales, support, management and yes even product managers!  Traditionally, most feedback to this community is sparse at best with highly anecdotal sentiment mixed in.  This method can augment that sentiment, (which should still be captured through sales, support, and product management by the way) with very complete data sets.  The product managers at these customers actually love this capability, and many of them are directly interacting and analyzing with the raw data collected.

Software developers also have the ability to make and control their own sensors- pretty cool when you think about it.  The savvy developer is able to create these listening points at various places in their code.  Savvy developers and product managers these days are spending time on these breadcrumbs because while they know they require more work (just as good quality assurance does), the payback is huge and ultimately can save them a lot of time.  Recently I visited one of our customers that develops enterprise software and they are piloting a project in this area that already has 8 Billion rows of this type of information- now that’s bigger than a breadbox!

This capability is not limited to SaaS vendors (although they certainly have more control and an easier time collecting the data).  Our online gaming customers are at the forefront, but we see all ISVs getting into this.  There is so much we can learn from software developers.  What is especially exciting is seeing how other physical sensors are being used in everything from automobiles to jet engines and even refrigerators to deliver the same type of feedback.  There is no question, the sensor economy is upon us.  In the end, this will lead to better products and services for you and me, the consumer, which is a good thing.

Are You Ready for the Data Race?

A few hardy souls pressed against the tide of humanity heading home after work last Tuesday to gather at a nondescript loft in downtown Boston. We carefully looked left and right before dodging in past the bouncer to join a select crowd in their new favorite adrenaline-pumping sport… Tweet Racing.

In Tweet Racing, each participant carefully selects a twitter search term for the race, betting on the term they hope the Twitterverse will smile upon. Thrown into the cage and subjected to Vertica’s live twitter sentiment analysis code, the terms dueled for an hour. There are no rules in Tweet Racing – anything goes. We watched as the participants encouraged their Twitter followers to tweet for their terms or brutally tweet down others.

In the end, we even learned a few things. People don’t feel very strongly about kittens on a Tuesday evening.  However, “skrillex” is fairly popular, but it’s hard to beat “jolie” right after her Oscar Night poses.

 

But we weren’t there just to watch the races. New Blood Boston hosted the Vertica Engineering team for a discussion about ”Big Data” and how the Vertica Analytics Platform is a natural fit for many of the data problems facing start-ups today.

At the event, we showed how Vertica can blaze through anything from clickstream data with funnel analysis to graph problems like k-core and counting triangles – problems that may not initially appear to be database problems. We demonstrated what makes Tweet Racing possible in Vertica – the extensibility of the platform and its applicability to things outside the usual scope of the traditional SQL database.

But mostly, we were there to share our passion for Vertica and the engineering challenges that go into making it the industry’s most powerful, extensible analytics database.

Missed the New Blood Boston event? Check out our Vertica Community Edition to test drive Vertica and experience the thrills first hand!

The Power of Projections – Part 3

By Colin Mahony and Shilpa Lawande

Part III – Comparing and Contrasting Projections to Materialized Views and Indexes

In Part I and Part II of this post, we introduced you to Vertica’s projections, and described how easy it is to interface with them directly via SQL or through our Database Designer™ tool.  We will now end this series by comparing and contrasting Vertica’s projections with traditional indexes and materialized views.

Row-store databases often use Btree indexes as a performance enhancement.  Btree indexes are designed for highly concurrent single-record inserts and updates, e.g. an OLTP scenario. Most data warehouse practitioners would agree that index rebuilds after a batch load are preferable to taking the hit of maintaining them record by record given the logging overhead. Bitmap indexes are designed for bulk loads and are better than btrees for data warehousing but only suitable for low cardinality columns and a certain class of queries.  Even though you have these indexes to help find data, you still have to go to the base table to get the actual data, which brings with it all the disadvantages of a row store.

In a highly simplified view, you can think of a Vertica projection as a single level, densely packed, clustered index which stores the actual data values, is never updated in place, and has no logging. Any “maintenance” such as merging sorted chunks or purging deleted records is done as automatic and background activity, not in the path of real-time loads.  So yes, projections are a type of native index if you will, but they are very different from traditional indexes like Bitmap and Btrees.

Vertica also offers a unique feature known as “pre-join projections”. Pre-join projections denormalize tables at the physical layer under the covers providing a significant performance advantage over joining tables at run-time.  Pre-join projections automatically store the results of a join ahead of time, yet the logical schema is maintained – again, flexibility of the storage structure without having to rewrite your ETL or application.  Vertica can get away with this because it excels at sparse data storage, and in particular isn’t penalized at all for null values nor for wide fact tables. Since Vertica does not charge extra for additional projections, this is a great way to reap the benefits of denormalization without the need to purchase a larger capacity license.

So to sum up, here’s how Vertica projections stack up versus materialized views and conventional indexes.

Vertica’s Projections

Traditional Materialized Views

Traditional Indexes

  • Are primary storage – no base tables are required
  • Are secondary storage
  • Are secondary storage pointing to base table data
  • Can be segmented, partitioned, sorted, compressed and encoded to suit your needs
  • Are rigid: Practically limited to columns and query needs, more columns = more I/O
  • Support one clustered index at most – tough to scale out
  • Have a simple physical design
  • Use Aggregation losing valuable detail
  • Require complex design choices
  • Are efficient to load & maintain
  • Are mostly batch updated
  • Are expensive to update
  • Are versatile – they can support any data model
  • Provide high data latency
  • Provide high data latency
  • Allow you to work with the detailed data
  • Provide near-real time low data latency
  • Combine high availability with special optimizations for query performance

.
That’s pretty much all there is to it.  Whether you are running ad-hoc queries or canned operational-BI workloads, you will find projections to be a very powerful backbone for getting the job done!

Read the rest of the 3-part series…

The Power of Projections – Part 1: Understanding Projections and What They Do
The Power of Projections – Part 2: Understanding the Simplicity of Projections and the Vertica Database Designer™

Reports of SQL’s Death Are Greatly Exaggerated

Apache Log Analysis in Vertica
.

I am a proud new father and of course the first thing I did was to put pictures of my daughter online on our webserver because I wanted to see who had been looking at them. I could have gone the Google Analytics route, but being a geek I wanted to explore the data myself rather than just get a static report.

Before Vertica, for this kind of analysis I probably would have written a perl script (because it has the best regexp support of any language I know), but as soon as I start doing anything more complicated than summarizing, things get ugly quickly. Specifically I wanted to group the web log entries into sessions (“sessionize”) to analyze visits rather than page views. According to the interwebs, it seems Hadoop is often used to do this kind of analysis, but that still requires a program to compute the statistics of interest, though it can distribute that computation across many machines.

Since working at Vertica, I have become convinced that SQL is an excellent language for this kind of analysis — it allows one to easily express declaratively what is painful to express programatically (e.g. COUNT distinct). An often cited problem of SQL based analysis is that you need to get your data into a database first by writing a load script that parses your logs into some structured table format. Far from impossible (and any analysis needs to put the logs into a structured format) but annoying as the parsing code (the script) and structure definition (SQL DDL) aren’t bound together.

Recently, I have been working at Vertica on our extensibility mechanism to extend our database from within. So I selfishly used my desire to analyze my own web logs to justify writing an example of parsing Apache logs inside the database. On a recent cross country flight I whipped up a simple Apache log parser (now included as an example in our SDK in 5.0 – if you try it out let me know what you think!). The hardest part of the parser was dealing with the Apache log format (which for some reason changed sometime since the first batch of logs I have from 2005).

Armed with the log parser, the analysis of who was looking at my daughter’s pictures became pretty easy. Furthermore, because I had access to the raw log data in a database, I ended up finding several other fascinating patterns that I hadn’t specifically set out to find. The more I see, the more convinced I am that Hal Varian has it right and that data analysis will be the sexy job of the next decade.

The analysis steps were simple:

  • • rsync logs from my web server to my laptop
  • • Get logs into Vertica with straightforward SQL:

CREATE TABLE raw_logs
(filename VARCHAR(500),raw_log varchar(4000))
SEGMENTED BY HASH(filename) ALL NODES;

COPY raw_logs(filename as ‘access_log’ , raw_log) FROM
‘/home/alamb/access_log’
DELIMITER E’\1′\”"; — avoid field parsing on tabs

  • • Install and run the log parser code

– Install the parser code
CREATE LIBRARY ParserLib AS ‘/tmp/ApacheParser2.so’;
CREATE TRANSFORM FUNCTION ApacheParser
AS LANGUAGE ‘C++’ NAME ‘ApacheParserFactory’ LIBRARY ParserLib;

– Parse the logs into a new table
CREATE TABLE parsed_logs AS
SELECT filename, ApacheParser(raw_log) OVER (PARTITION BY filename)
FROM raw_logs;

Voila! Now I have a structured table with one row per log entry (i.e. file served by the server) and one column per logical log field. It is now a simple task to collect the clicks into sessions (see Sessionize with Style)

CREATE TABLE parsed_sessions as
SELECT
..*,
..CONDITIONAL_TRUE_EVENT(ts – LAG(ts) > ’30 seconds’)
….OVER (PARTITION BY ip ORDER BY ts) || replace(ip,’.',”) as session_id
FROM parsed_logs;

Now I am ready to ask questions like:

How many sessions, ips, clicks and total bytes were served for my daughter’s pages?

select
..count(distinct session_id) as session_count,
..count(distinct ip) as ip_count,
..count(*) as total_click_count,
..sum(response_size)/(1024*1024) as Mbytes
from parsed_sessions
where extract(year from ts) = ’2011′ and username = ‘changed’;

session_count -| ip_count | total_click_count | Mbytes
________________________________________________________
oooooooooo 313 |ooooo 162 |oooooooooooo 11151 | 7353.86
(1 row)

Who looked at the most pictures?

select
..max(ts) max_ts,
..count(*) as click_count
from parsed_sessions
where extract(year from ts) = ’2011′ and username = ‘changed’
group by ip, session_id
order by click_count desc
limit 10;

ooooooo max_ts ooooooo | click_count
_______________________|_____________
2011-04-10 08:11:43-04 | ooooooo 294
2011-04-12 11:53:51-04 | ooooooo 197
2011-04-12 09:22:20-04 | ooooooo 191
2011-04-12 06:17:36-04 | ooooooo 184
2011-04-10 10:46:36-04 | ooooooo 171
2011-04-18 11:18:52-04 | ooooooo 167
2011-04-10 14:47:31-04 | ooooooo 160
2011-04-12 13:35:12-04 | ooooooo 159
2011-04-12 18:15:56-04 | ooooooo 157
2011-04-14 10:04:10-04 | ooooooo 153
(10 rows)

So now I was curious: Who where those top 10 clickers? At this point, querying the raw data (as opposed to an aggregated report) was super helpful.

select distinct cnt_rnk, ps.session_id, ip
from parsed_sessions ps JOIN click_counts cc USING (session_id)
where extract(year from ts) = ’2011′ and
oooo username = ‘changed’ and
ooooo ps.session_id IN (select session_id from click_counts where cnt_rnk <= 5)
order by cnt_rnk;

Without divulging any actual of the actual results (privacy, you know), it turns out 7 of the top 10 were my wife and I, one was my mother in law, one was a family friend and one is still a mystery which I am looking into.

As I was poking around, I noticed another interesting pattern that I wasn’t specifically looking for: a lot of requests came in from Google searches. So my next logical query was:

What are people searching for on Google and where does it lead them to on my site?

select ts,
request_url,
referring_url
from parsed_logs
where referring_url ilike ‘%google.com%’ and
ooooo extract(year from ts) = ’2011′
order by ts desc
limit 10;

ooooooooo ts ooooooooo | oooooooooo request_url oooooooooo | referring_url
_______________________|___________________________________|____________________________
2011-06-13 21:36:57-04 | /classes/commit/fft-factoring.pdf | http://www.google.com/search?q=dft+math+using+matrices&hl=en&prmd=ivnsfd&ei=VLn2TdLSLsnr0gHC9MiMCw&start=30&sa=N&biw=1221&bih=812
2011-06-13 19:55:26-04 | /classes/commit/fft-factoring.pdf | http://www.google.com.sa/search?sourceid=navclient&aq=0&oq=fft+matrix+factoriz&ie=UTF-8&rlz=1T4RNSN_enSA402SA402&q=fft+matrix+factorization
2011-06-13 02:58:03-04 | /classes/mechatronics/ion-generator.pdf | http://www.google.com/search?client=ubuntu&channel=fs&q=airflow+detector&ie=utf-8&oe=utf-8
2011-06-10 20:35:54-04 | /classes/commit/fft-factoring.pdf | http://www.google.com/url?sa=t&source=web&cd=4&ved=0CC8QFjAD&url=http%3A%2F%2Fandrew.nerdnetworks.org%2Fclasses%2Fcommit%2Ffft-factoring.pdf&rct=j&q=fft%20matrix%20decomposition&ei=WrjyTdqLKYP2tgPRw_i7Cw&usg=AFQjCNEyfN4KoSidrjR4EsL5wTHbqakb7A
2011-06-10 18:29:56-04 | /favicon.ico | http://www.google.com/search?hl=en&safe=off&client=iceweasel-a&rls=org.mozilla%3Aen-US%3Aunofficial&q=DFT+identities&aq=f&aqi=&aql=t&oq=
2011-06-01 14:54:13-04 | /classes/mechatronics/ion-generator.pdf | http://www.google.com/search?q=ion+%22measure+airflow%22&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
2011-06-01 09:36:51-04 | /classes/commit/fft-factoring.pdf | http://www.google.com/url?sa=t&source=web&cd=5&ved=0CDYQFjAE&url=http%3A%2F%2Fandrew.nerdnetworks.org%2Fclasses%2Fcommit%2Ffft-factoring.pdf&rct=j&q=dft%20factorization&ei=akDmTcHBCILw0gGgofCeCw&usg=AFQjCNEyfN4KoSidrjR4EsL5wTHbqakb7A
2011-05-31 16:41:56-04 | /classes/commit/fft-factoring.pdf | http://www.google.com/url?sa=t&source=web&cd=9&ved=0CFYQFjAI&url=http%3A%2F%2Fandrew.nerdnetworks.org%2Fclasses%2Fcommit%2Ffft-factoring.pdf&rct=j&q=fft%20matrix%20notation&ei=fVLlTZ76Nc-A-waBmKHyBg&usg=AFQjCNEyfN4KoSidrjR4EsL5wTHbqakb7A&sig2=4jpsx3kmRWJilGnX0VOaAg
2011-05-29 14:29:16-04 | /classes/commit/asplos.ps | http://www.google.com/url?sa=t&source=web&cd=4&ved=0CCsQFjAD&url=http%3A%2F%2Fandrew.nerdnetworks.org%2Fclasses%2Fcommit%2Fasplos.ps&rct=j&q=naive%20partitioning%20in%20stream%20graph&ei=tY7iTfzGOpC5hAfFspnzBw&usg=AFQjCNHNNERPnNXT7aFGvfaJB3bjpK5Oxg&cad=rja
2011-05-25 03:44:12-04 | /classes/commit/fft-factoring.pdf | http://www.google.com/url?sa=t&source=web&cd=14&ved=0CC4QFjADOAo&url=http%3A%2F%2Fandrew.nerdnetworks.org%2Fclasses%2Fcommit%2Ffft-factoring.pdf&rct=j&q=DERIVATION%20OF%20DFT&ei=L7PcTZ7zL4q_0AH007C_Dw&usg=AFQjCNEyfN4KoSidrjR4EsL5wTHbqakb7A
(10 rows)

I could see the query terms peeking out of that mess, but it isn’t easy to analyze because the query string is URI encoded within the referring url. I thought it would be cool to programmatically pick out the query terms, and so I spent some time messing with an unsatisfactory regexp based solution. Then I found out that Hieu, one of our interepid interns this summer, had already made a URI decoder using our SDK and the uriparser library:

CREATE TRANSFORM FUNCTION UriExtractor
AS LANGUAGE ‘C++’ NAME ‘UriExtractorFactory’ LIBRARY ParserLib;

Extract the search terms from the URIs of Google searches

CREATE table search_terms
AS
SELECT request_url, value as search_term
FROM
..(SELECT request_url, UriExtractor(referring_url) OVER (PARTITION BY request_url) FROM search_referrals ) AS sq
WHERE sq.name = ‘q’;

SELECT * FROM search_terms LIMIT 10;

ooooooooooooooo request_url ooooooooooooooo | search_term
____________________________________________|_______________________________________
/ ooooooooooooooooooooooooooooooooooooooooo | andrew nerdnetworks
/classes/6.033/cyberbeanie.pdf oooooooooooo | 6.033 cyberbeanie
/classes/6.033/cyberbeanie.pdf oooooooooooo | Jerome H. Saltzer and M. Frans Kaashoek. 6.033 class notes
/classes/6.033/cyberbeanie/cyberbeanie.html | 6.033 Bibliography Saltzer Computer systems
/classes/6.033/cyberbeanie.html ooooooooooo | link_send
/classes/6.033/cyberbeanie.html ooooooooooo | “Topics in the Engineering of Computer Systems”
/classes/6.033/cyberbeanie.html ooooooooooo | Jerome H. Saltzer, M. Frans Kaashoek. Topics in the Engineering of Computer Systems. M.I.T. 6.033 class notes
/classes/6.033/spank/spankster.html ooooooo | MITPerson
/classes/6.033/spankster.pdf oooooooooooooo | MITPerson
/classes/6.033/spankster.pdf oooooooooooooo | “chunk server”
(10 rows)

Note that some of the actual values in the above data have been changed to protect other people’s privacy.

Now I need to go back to my day job making Vertica better, but I truly do hope people are able to take the Apache log parser and quickly and easily find their own interesting insights.