<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Vertica &#187; Blog</title>
	<atom:link href="http://www.vertica.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.vertica.com</link>
	<description>Simply Fast</description>
	<lastBuildDate>Tue, 15 May 2012 02:46:13 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
		<item>
		<title>Moneyball &#8211; Not Just for Baseball Anymore</title>
		<link>http://www.vertica.com/2012/04/17/moneyball-not-just-for-baseball-anymore/</link>
		<comments>http://www.vertica.com/2012/04/17/moneyball-not-just-for-baseball-anymore/#comments</comments>
		<pubDate>Tue, 17 Apr 2012 08:36:47 +0000</pubDate>
		<dc:creator>cmahony</dc:creator>
				<category><![CDATA[Moneyball]]></category>

		<guid isPermaLink="false">http://vertica.com/?p=10227</guid>
		<description><![CDATA[<p>Spring is in the air, major league baseball is now underway here in North America, and thoughts of Michael Lewis&#8217; fantastic book and film, “Moneyball” come to mind.  The plot captures how Billy Beane (played by Brad Pitt) leverages an extreme data analyst/quant to fundamentally change baseball strategy and scouting after 100 years of tradition.  The unorthodox data driven strategy was counter to the traditional approach.  Not surprisingly, Billy Beane was questioned until ultimately, the strategy proved successful.  Now, every team in the league, including our Boston Red Sox, is deploying a variant of this approach.  I see the exact same thing happening in just about every industry when it comes to the race for better insight and competitive advantage <a href="" class="read-more">Read More &#187;</a>]]></description>
			<content:encoded><![CDATA[<p>Spring is in the air, major league baseball is now underway here in North America, and thoughts of Michael Lewis&#8217; fantastic book and film, “Moneyball” come to mind.  The plot captures how Billy Beane (played by Brad Pitt) leverages an extreme data analyst/quant to fundamentally change baseball strategy and scouting after 100 years of tradition.  The unorthodox data driven strategy was counter to the traditional approach.  Not surprisingly, Billy Beane was questioned until ultimately, the strategy proved successful.  Now, every team in the league, including our Boston Red Sox, is deploying a variant of this approach.  I see the exact same thing happening in just about every industry when it comes to the race for better insight and competitive advantage through extreme information and analytics.  The struggle now of course is where to find the expert quants, analysts, managers, and solution providers who understand how to make it happen.</p>
<p>At Hewlett-Packard, I get to witness and enable real-world moneyball every day in a variety of global industries.  I see how savvy organizations are creating swat teams of business leaders, statisticians, and IT to leverage extreme information and platforms like Vertica in ways that fundamentally alter markets and business dynamics.</p>
<p>In business school I was lucky enough to take <a href="http://drfd.hbs.edu/fit/public/facultyInfo.do?facInfo=bio&amp;facEmId=ffrei">Frances Frei’s</a> course “Managing Service Operations”.  The course and her recent best-selling book “Uncommon Service: How to Win by Putting Customers at the Core of Your Business” investigate organizations’ efforts to diagnose and improve service experiences.  Interestingly though, Frances was way ahead of her time and forced us to crunch numbers with statistical programs combining fundamental business information with detailed historical data for true forensics and root cause analysis.  She stressed the importance of math and data analysis.  We were careful never to rely solely on data or theory, but rather bring all of the information together to make the best informed decisions we could.  In the current Big Data era, this can be taken to a whole new level and every company must work this way from the top down.</p>
<p>In addition to the baseball season starting, we know that “April showers bring May flowers”.  The equivalent in our industry is that for the past several years, so many organizations have been “showered” with data.  The “flowers” of course bloom when those same organizations are able to monetize the information to create better products and services and shareholder value.  Modern technologies and comprehensive solution providers like Hewlett-Packard can help organizations drastically reduce the cost and increase the efficacy of analytics by provisioning comprehensive offerings of hardware, software, and services.  Organizations are now able to cost effectively take disparate sources of extreme information, both structured and unstructured and seamlessly combine them for constant ad hoc analysis.  This can lead to fundamentally better decisions and value creation.  Spring is an exciting time of year- let the insights bloom!</p>
<p>Colin Mahony<br />
VP &amp; GM<br />
Vertica, An HP Company</p>
]]></content:encoded>
			<wfw:commentRss>http://www.vertica.com/2012/04/17/moneyball-not-just-for-baseball-anymore/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Are You Ready for the Data Race?</title>
		<link>http://www.vertica.com/2012/03/07/are-you-ready-for-the-data-race/</link>
		<comments>http://www.vertica.com/2012/03/07/are-you-ready-for-the-data-race/#comments</comments>
		<pubDate>Wed, 07 Mar 2012 14:46:20 +0000</pubDate>
		<dc:creator>Ben Vandiver</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://184.106.12.19/?p=10010</guid>
		<description><![CDATA[<p>A few hardy souls pressed against the tide of humanity heading home after work last Tuesday to gather at a nondescript loft in downtown Boston. We carefully looked left and right before dodging in past the bouncer to join a select crowd in their new favorite adrenaline-pumping sport&#8230; Tweet Racing.</p> <p>In Tweet Racing, each participant carefully selects a twitter search term for the race, betting on the term they hope the Twitterverse will smile upon. Thrown into the cage and subjected to Vertica&#8217;s live twitter sentiment analysis code, the terms dueled for an hour. There are no rules in Tweet Racing – anything goes. We watched as the participants encouraged their Twitter followers to tweet for their terms or brutally <a href="" class="read-more">Read More &#187;</a>]]></description>
			<content:encoded><![CDATA[<p>A few hardy souls pressed against the tide of humanity heading home after work last Tuesday to gather at a nondescript loft in downtown Boston. We carefully looked left and right before dodging in past the bouncer to join a select crowd in their new favorite adrenaline-pumping sport&#8230; Tweet Racing.</p>
<p>In Tweet Racing, each participant carefully selects a twitter search term for the race, betting on the term they hope the Twitterverse will smile upon. Thrown into the cage and subjected to Vertica&#8217;s live twitter sentiment analysis code, the terms dueled for an hour. There are no rules in Tweet Racing – anything goes. We watched as the participants encouraged their Twitter followers to tweet for their terms or brutally tweet down others.</p>
<p><a href="http://vertica.com/wp-content/uploads/2012/03/Tweets.png"><img class="aligncenter size-full wp-image-10075" title="Tweets" src="http://vertica.com/wp-content/uploads/2012/03/Tweets.png" alt="" width="278" height="81" /></a></p>
<p style="text-align: center;">
<p>In the end, we even learned a few things. People don&#8217;t feel very strongly about kittens on a Tuesday evening.  However, &#8220;skrillex&#8221; is fairly popular, but it&#8217;s hard to beat &#8220;jolie&#8221; right after her Oscar Night poses.</p>
<p><a href="http://vertica.com/wp-content/uploads/2012/03/JolieSkrillexKittens.png"><img class="aligncenter size-full wp-image-10076" title="JolieSkrillexKittens" src="http://vertica.com/wp-content/uploads/2012/03/JolieSkrillexKittens.png" alt="" width="498" height="226" /></a></p>
<p>&nbsp;</p>
<p>But we weren&#8217;t there just to watch the races. <a title="New Blood Boston" href="http://new-blood-boston.eventbrite.com/" target="_blank">New Blood Boston</a> hosted the Vertica Engineering team for a discussion about ”Big Data” and how the <a href="http://www.vertica.com/the-analytics-platform/">Vertica Analytics Platform</a> is a natural fit for many of the data problems facing start-ups today.</p>
<p>At the event, we showed how Vertica can blaze through anything from clickstream data with <a href="http://www.vertica.com/2011/10/05/being-green-with-data-exhaust/">funnel analysis</a> to graph problems like <a href="http://www.vertica.com/2011/09/19/vertica-at-birte-2011-social-graph-analytics/">k-core</a> and <a href="http://www.vertica.com/2011/09/21/counting-triangles/">counting triangles</a> – problems that may not initially appear to be database problems. We demonstrated what makes Tweet Racing possible in Vertica – the extensibility of the platform and its applicability to things outside the usual scope of the traditional SQL database.</p>
<p>But mostly, we were there to share our passion for Vertica and the engineering challenges that go into making it the industry’s most powerful, extensible analytics database.</p>
<p>Missed the New Blood Boston event? Check out our <a href="http://www.vertica.com/community/">Vertica Community Edition</a> to test drive Vertica and experience the thrills first hand!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.vertica.com/2012/03/07/are-you-ready-for-the-data-race/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Announcing the Vertica Community Edition</title>
		<link>http://www.vertica.com/2011/10/18/announcing-the-vertica-community-edition/</link>
		<comments>http://www.vertica.com/2011/10/18/announcing-the-vertica-community-edition/#comments</comments>
		<pubDate>Tue, 18 Oct 2011 12:00:36 +0000</pubDate>
		<dc:creator>slawande</dc:creator>
				<category><![CDATA[Vertica Community Edition]]></category>

		<guid isPermaLink="false">http://vertica.linchpinagency.com/?p=4268</guid>
		<description><![CDATA[<p>by Colin Mahony, VP of Products &#38; Business Development and Shilpa Lawande, VP of Engineering</p> <p>Vertica has had an amazing journey since it was founded in 2005. We&#8217;ve built a great product, a great team and an incredibly strong and loyal customer base and partner ecosystem. When we first started, no one had even heard of a column store, and today, &#8216;Big Data Analytics&#8217; is taking the industry by storm. Every day we see companies &#8211; big and small &#8211; in industries from retail to gaming becoming more data-driven and doing amazing things with the help of analytics. We feel proud and humbled to see the transformation impact the Vertica Analytics Platform has had on our customers&#8217; businesses, and we <a href="" class="read-more">Read More &#187;</a>]]></description>
			<content:encoded><![CDATA[<p>by Colin Mahony, VP of Products &amp; Business Development<br />
and Shilpa Lawande, VP of Engineering</p>
<p>Vertica has had an amazing journey since it was founded in 2005. We&#8217;ve built a great product, a great team and an incredibly strong and loyal customer base and partner ecosystem. When we first started, no one had even heard of a column store, and today, &#8216;Big Data Analytics&#8217; is taking the industry by storm. Every day we see companies &#8211; big and small &#8211; in industries from retail to gaming becoming more data-driven and doing amazing things with the help of analytics. We feel proud and humbled to see the transformation impact the Vertica Analytics Platform has had on our customers&#8217; businesses, and we believe the time has come to broaden access to our technology to a wider Big Data community.</p>
<p>Today, we are truly excited to announce the <strong><a title="Vertica Community Edition beta program" href="http://184.106.12.19/community">Vertica Community Edition</a></strong> beta program! The Vertica Community Edition will offer many of the same features as the enterprise edition of the Vertica Analytics Platform to anyone who wants to discover the power of Vertica.  And, as part of the Community Edition beta announcement, we are developing a new MyVertica Community portal which will provide a platform for Vertica users and partners to interact and share knowledge and code with the entire Vertica user community.</p>
<p>Vertica has always been a customer-driven company and we couldn&#8217;t have built Vertica without ideas, feedback and guidance from our customers and partners. We hope that the Vertica community will play a similar role going forward &#8211; sharing ideas and best practices and providing candid feedback about the product and how it can be made richer and simpler to use.  The MyVertica community portal will feature product downloads, forums, documentation,  training materials, FAQs and best practice guides. We will also be maintaining a GitHub code repository where community users will be able to share code samples, user-defined extensions built using our SDK, adapters to 3rd party products, and more. We hope that with the Community Edition, we take a small step towards our vision of democratizing data and making data and analytics accessible to all!</p>
<p>To register for the Vertica Community Edition beta program, simply visit <strong><a title="Community Edition Beta Signup" href="http://184.106.12.19/community">www.vertica.com/community</a></strong> and complete the registration form.  The beta program will be limited initially, but full availability of the Vertica Community Edition software is expected by the end of the year.</p>
<p>On behalf of Vertica and HP, we are excited to contribute something back to the Vertica Community.  We sincerely invite you to join and contribute, and we can&#8217;t wait to see the many cool things you will do with Big Data and Vertica!</p>
<p>Shilpa &amp; Colin</p>
]]></content:encoded>
			<wfw:commentRss>http://www.vertica.com/2011/10/18/announcing-the-vertica-community-edition/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Being Green with Data Exhaust</title>
		<link>http://www.vertica.com/2011/10/05/being-green-with-data-exhaust/</link>
		<comments>http://www.vertica.com/2011/10/05/being-green-with-data-exhaust/#comments</comments>
		<pubDate>Wed, 05 Oct 2011 13:41:31 +0000</pubDate>
		<dc:creator>mfuller</dc:creator>
				<category><![CDATA[pattern matching]]></category>

		<guid isPermaLink="false">http://vertica.linchpinagency.com/?p=4175</guid>
		<description><![CDATA[Funnel Analysis <p><strong><em></em></strong><em>by Matt Fuller, Vertica</em><strong><em> </em></strong></p> <p>By 2015, it is estimated the annual global internet traffic will reach almost 1 zettabyte. To put it into more familiar units, this is equivalent to about 1 billion terabytes. Web, email, instant messaging, etc. will account for about 30% of this <a href="http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827/white_paper_c11-481360_ns827_Networking_Solutions_White_Paper.html">figure</a>.  I found this fascinating, but not surprising, given the rate of new applications and users entering the market. Whether you believe these estimates or interpret the data differently, I think we can agree there is a vast amount of data out there.</p> <p>As users perform their online activities, such as playing Farmville, reading tweets, or browsing for slick deals on Groupon, web server logs may store their clicks. This <a href="" class="read-more">Read More &#187;</a>]]></description>
			<content:encoded><![CDATA[<h4>Funnel Analysis</h4>
<p><strong><em></em></strong><em>by Matt Fuller, Vertica</em><strong><em><br />
</em></strong></p>
<p>By 2015, it is estimated the annual global internet traffic will reach almost 1 zettabyte. To put it into more familiar units, this is equivalent to about 1 billion terabytes. Web, email, instant messaging, etc. will account for about 30% of this <a href="http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827/white_paper_c11-481360_ns827_Networking_Solutions_White_Paper.html">figure</a>.  I found this fascinating, but not surprising, given the rate of new applications and users entering the market. Whether you believe these estimates or interpret the data differently, I think we can agree there is a vast amount of data out there.</p>
<p>As users perform their online activities, such as playing Farmville, reading tweets, or browsing for slick deals on Groupon, web server logs may store their clicks. This raw data, , or “<a href="http://en.wikipedia.org/wiki/Digital_exhaust">data exhaust</a>,” may appear to be junk to many, but in reality this “data exhaust” can be monetized given the right tools.</p>
<p>In funnel analysis, a funnel is the flow, or path, a user may take before reaching an end goal, such as a purchase, sign up, or download. The path may consist of a series of web clicks to different pages on the site until finally ending up at the goal. Along this path, users may drop out after any point, thus reducing the percentage of users that make it to the goal.</p>
<p><a href="http://184.106.12.19/wp-content/uploads/2011/10/PatternMatching1.png"><img class="aligncenter size-full wp-image-4176" title="PatternMatching1" src="http://184.106.12.19/wp-content/uploads/2011/10/PatternMatching1.png" alt="" width="462" height="123" /></a></p>
<p>Analyzing the funnel can provide insight to improve the site flow. For example, if a site knew the registration page is the page where most users dropped out, the site could improve the usability of that page to engage more customers and then analyze those improvements.</p>
<p><a href="http://184.106.12.19/wp-content/uploads/2011/10/PatternMatching2.png"><img class="aligncenter size-full wp-image-4177" title="PatternMatching2" src="http://184.106.12.19/wp-content/uploads/2011/10/PatternMatching2.png" alt="" width="464" height="126" /></a></p>
<p>In Vertica 5.0, we introduced the latest addition to our in-database analytics package: <strong>Event Series Pattern Matching</strong>.  In this article we will discuss how you can use Vertica’s Event Series Pattern Matching to discover user click events that match funnels.</p>
<p>Suppose we used a user-defined transform (UDT) to help load your server’s <a href="../2011/06/20/reports-of-sqls-death-are-greatly-exaggerated/">web click log</a> into a Vertica table, where each row corresponds to a single click in your log (user_id may actually be an ip address).</p>
<p><a href="http://184.106.12.19/wp-content/uploads/2011/10/PatternMatching3.png"><img class="size-full wp-image-4178 aligncenter" title="PatternMatching3" src="http://184.106.12.19/wp-content/uploads/2011/10/PatternMatching3.png" alt="" width="721" height="534" /></a></p>
<p>Next, we would like to search for sequences of web clicks that match a particular funnel. For example, we would like to identify the series of clicks where a user viewed an item for sale, filled out the form to purchase, and then ultimately made the purchase. Additionally, we would like to include any other items the user visited during the flow. The funnel may look like:</p>
<p><a href="http://184.106.12.19/wp-content/uploads/2011/10/PatternMatching4.png"><img class="aligncenter size-full wp-image-4179" title="PatternMatching4" src="http://184.106.12.19/wp-content/uploads/2011/10/PatternMatching4.png" alt="" width="513" height="133" /></a></p>
<p>Let’s now construct an Event Series Pattern Matching SQL query in Vertica  to find the sequence of clicks matching our funnel.  This can be done in 3 simple steps by defining our funnel as a <a href="http://en.wikipedia.org/wiki/Regular_expression">regular expression</a>, defining an alphabet for our regular expression, and defining the logical window over which we want to match our regular expression. The next sections describe these steps in more detail.</p>
<p><strong>Step 1: Pattern Specification</strong></p>
<p>Event Series Pattern Matching goes far beyond simple Funnel Analysis. So let’s start to use event series pattern matching terminology. First, think of each click record as an “event” and a clickstream (or sequence of many clicks) as an “event series.” And let’s also think of our funnel as a “pattern.”</p>
<p>Our pattern is specified using regular expression notation consisting of events from our event alphabet (more details in the next section) with optional quantifiers such as the <a href="http://en.wikipedia.org/wiki/Kleene_star">Kleene Star</a> (i.e. “*”):</p>
<p style="padding-left: 30px;"><span style="font-family: courier new,courier; color: #4682b4;"><strong>(EntryItemView ItemView* Checkout Purchase)</strong></span></p>
<p>For a contiguous sequence of events, this notation describes all the pattern instances starting with an item page view referred from another site event, zero or more page views of other items, proceeding to the checkout page to buy an item, and then ultimately ending with a purchase by landing on the purchase confirmation page.</p>
<p><strong>Step 2: Event Definitions</strong></p>
<p>Next, we must define our event alphabet. We use SQL Boolean expressions aliased to an event name. The event names are what we used in the pattern specification.</p>
<p style="padding-left: 30px;"><strong><span style="font-family: courier new,courier; color: #4682b4;">EntryItemView as referring_url NOT ILIKE ‘%groupon.com%’</span></strong><br />
<strong><span style="font-family: courier new,courier; color: #4682b4;"><span style="background-color: #ffffff; color: #ffffff;"> &#8230;&#8230;&#8230;&#8230;..</span>and page_url ILIKE ‘%groupon.com%’</span></strong></p>
<p style="padding-left: 30px;"><strong><span style="font-family: courier new,courier; color: #4682b4;">ItemView as page_url ILIKE ‘%groupon.com%’ and action = ‘VIEW’</span></strong></p>
<p style="padding-left: 30px;"><strong><span style="font-family: courier new,courier; color: #4682b4;">Checkout as page_url ILIKE ‘%groupon.com%’ and action = ‘CHECKOUT’</span></strong></p>
<p style="padding-left: 30px;"><strong><span style="font-family: courier new,courier; color: #4682b4;">Purchase as page_url ILIKE ‘%groupon.com%’ and action = ‘PURCHASE’</span></strong></p>
<p>A row is considered to be of an event type if the Boolean expression yields TRUE for that row. For example, the below row from the clickstream table is considered to be of event “Purchase” because the predicate is TRUE given the values of the row.</p>
<p><a href="http://184.106.12.19/wp-content/uploads/2011/10/PatternMatching5.png"><img class="aligncenter size-full wp-image-4183" title="PatternMatching5" src="http://184.106.12.19/wp-content/uploads/2011/10/PatternMatching5.png" alt="" width="716" height="72" /></a><strong></strong></p>
<p><strong>Step 3: Window Definition</strong></p>
<p>A window is the logical grouping of data. In our example, we are trying to discover matching pattern instances <em>per</em> user. Therefore, we would like to logically group all the data per user together and perform Event Series Pattern Matching for <em>each</em> user. And since we wish to find the pattern instances of a contiguous sequence of events, the data must be ordered. In this example, processing the data sorted on the timestamp is the logical choice. We define our window using the PARTITION BY and ORDER BY clause:</p>
<p style="padding-left: 30px;"><span style="color: #4682b4;"><strong><span style="font-family: courier new,courier;">PARTITION BY user_id ORDER BY timestamp</span></strong></span></p>
<p>Optionally, you may want to group the data <em>per</em> user <em>per</em> session. This is simple and efficient to do with Vertica’s in-database <a href="../2010/10/04/sessionize-with-style-part-1/">sessionization</a>. In our example, we added the session id to the table already for simplicity:</p>
<p style="padding-left: 30px;"><span style="font-family: courier new,courier; color: #4682b4;"><strong>PARTITION BY user_id, session_id ORDER BY timestamp</strong></span></p>
<p>(NOTE: In these two examples, what is the “Window” – we talk about it but don’t point out the final result (i.e. the window))</p>
<p><strong>Putting it all together…</strong></p>
<p>Let’s combine the steps from above into a SQL SELECT:</p>
<p style="padding-left: 30px;"><span style="font-family: courier new,courier;"><strong><span style="color: #4682b4;">SELECT user_id, referring_url, page_url,<br />
event_name(), match_id(), pattern_id()<br />
FROM clickstream<br />
MATCH (<br />
<span style="color: #ffffff;">&#8230;&#8230;.</span>PARTITION BY user_id, session_id ORDER BY timestamp<br />
<span style="color: #ffffff;">&#8230;&#8230;.</span>DEFINE<br />
<span style="color: #ffffff;">&#8230;&#8230;&#8230;&#8230;</span>EntryItemView as referring_url NOT ILIKE ‘%groupon.com%’<br />
<span style="color: #ffffff;">&#8230;&#8230;&#8230;&#8230;&#8230;.</span>and page_url ILIKE ‘%groupon.com%’,<br />
<span style="color: #ffffff;">&#8230;&#8230;&#8230;&#8230;</span>ItemView as page_url ILIKE ‘%groupon.com%’<br />
<span style="color: #ffffff;">&#8230;&#8230;&#8230;&#8230;&#8230;.</span>and action = ‘VIEW’,<br />
<span style="color: #ffffff;">&#8230;&#8230;&#8230;&#8230;</span>Checkout as page_url ILIKE ‘%groupon.com%’<br />
<span style="color: #ffffff;">&#8230;&#8230;&#8230;&#8230;&#8230;.</span>and action = ‘CHECKOUT’,<br />
<span style="color: #ffffff;">&#8230;&#8230;&#8230;&#8230;</span>Purchase as page_url ILIKE ‘%groupon.com%’<br />
<span style="color: #ffffff;">&#8230;&#8230;&#8230;&#8230;&#8230;.</span>and action = ‘PURCHASE’<br />
<span style="color: #ffffff;">&#8230;&#8230;&#8230;&#8230;</span>PATTERN P as (EntryItemView ItemView* Checkout Purchase)<br />
);</span></strong></span></p>
<p>Run on Vertica, and Voila!</p>
<p style="text-align: center;"><strong><img class="aligncenter size-full wp-image-4182" title="PatternMatching6" src="http://184.106.12.19/wp-content/uploads/2011/10/PatternMatching6.png" alt="" width="642" height="360" /> </strong></p>
<p>The results above consist of all rows that contributed to a discovered pattern instance. You may have noticed three new functions in the SELECT list: <span style="color: #4682b4; font-family: courier new,courier;"><strong>event_name(), match_id(), pattern_id()</strong></span>. These return data for additional analysis as well as demarcate the different pattern instances:</p>
<ul>
<li><span style="color: #003366;">-  <span style="font-family: courier new,courier; color: #4682b4;"><strong>event_name()</strong></span> returns the name of the event for which that row contributed in the pattern instance</span></li>
<li><span style="color: #003366;">-<strong><span style="font-family: courier new,courier; color: #4682b4;"> match_id()</span></strong> is a monotonically increasing integer to serve as a unique identifier for the row within the pattern instance</span></li>
<li><span style="color: #003366;">-  <span style="color: #4682b4;"><strong><span style="font-family: courier new,courier;">pattern_id()</span></strong></span> is a monotonically increasing integer serving as a unique identifier with the pattern instance within the partition/group</span></li>
</ul>
<p>For simplicity, our example doesn’t demonstrate more than one pattern match per partition/group. But imagine that if user 100 made another set of clicks matching the funnel, its pattern identifier would be 2. And if user 300 also made another set of clicks matching the funnel, its pattern identifier would also be 2 since we reset the starting identifier for each new partition.</p>
<p>We immediately notice from the results that Twitter user ABC has referred users to the website, but more importantly, referred users with a high success rate of making purchases. And of course further analysis, such as aggregation and pivoting, can be performed on the results of this Event Series Pattern Matching.</p>
<p>It might be said this could be done with SQL OLAP windowing functions such as LAG. For simple funnels, this is certainly true. But this would be difficult, if not impossible, using SQL OLAP for funnels defined by more complex pattern specifications including quantifiers such as Kleene Star or Kleene Plus.</p>
<p>And since Event Series Pattern Matching is in-database, you get Vertica’s performance and scalability. As an MPP system, Vertica automatically parallelizes the work across the cluster. The figure below illustrates finding the pattern instances in parallel across a Vertica cluster. First the data is segmented based on the partition window definition. Then each node independently processes and finds the pattern instances in the data segments. Finally, the pattern instance results from each node are combined at the node that issued the query and the final result is sent back to the user.</p>
<p><a href="http://184.106.12.19/wp-content/uploads/2011/10/PatternMatching7.png"><img class="aligncenter size-full wp-image-4184" title="PatternMatching7" src="http://184.106.12.19/wp-content/uploads/2011/10/PatternMatching7.png" alt="" width="703" height="375" /></a></p>
<p>Vertica is an ideal platform for monetizing ALL of your data, and we’ve shown you how event series pattern matching can be used to analyze seemingly unimportant web log data to find the top patterns that lead to conversion events on a web site.  Just by using our new SQL &#8220;match&#8221; clause in three very simple and straightforward steps.</p>
<p>In a future post, we will discuss how one can use event series pattern matching to perform more advanced sessionization and compare it to  Google Analytics.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.vertica.com/2011/10/05/being-green-with-data-exhaust/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Counting Triangles</title>
		<link>http://www.vertica.com/2011/09/21/counting-triangles/</link>
		<comments>http://www.vertica.com/2011/09/21/counting-triangles/#comments</comments>
		<pubDate>Wed, 21 Sep 2011 13:00:45 +0000</pubDate>
		<dc:creator>swalkauskas@vertica.com</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[social graph analysis]]></category>

		<guid isPermaLink="false">http://vertica.linchpinagency.com/?p=4064</guid>
		<description><![CDATA[<p>by Stephen Walkauskas</p> <p>Recently I&#8217;ve heard from or read about people who use Hadoop because their analytic jobs can&#8217;t achieve the same level of performance in a database. In one case, a professor I visited said his group uses Hadoop to count triangles &#8220;because a database doesn&#8217;t perform the necessary joins efficiently.&#8221;</p> <p>Perhaps I&#8217;m being dense but I don&#8217;t understand why a database doesn&#8217;t efficiently support these use-cases. In fact, I have a hard time believing they wouldn&#8217;t perform better in a columnar, MPP database like Vertica &#8211; where memory and storage are laid out and accessed efficiently, query jobs are automatically tuned by the optimizer, and expression execution is vectorized at run-time. There are additional benefits when several, similar <a href="" class="read-more">Read More &#187;</a>]]></description>
			<content:encoded><![CDATA[<p>by Stephen Walkauskas</p>
<p>Recently I&#8217;ve heard from or read about people who use Hadoop because their analytic jobs can&#8217;t achieve the same level of performance in a database. In one case, a professor I visited said his group uses Hadoop to count triangles &#8220;because a database doesn&#8217;t perform the necessary joins efficiently.&#8221;</p>
<p>Perhaps I&#8217;m being dense but I don&#8217;t understand why a database doesn&#8217;t efficiently support these use-cases. In fact, I have a hard time believing they wouldn&#8217;t perform better in a columnar, MPP database like Vertica &#8211; where memory and storage are laid out and accessed efficiently, query jobs are automatically tuned by the optimizer, and expression execution is vectorized at run-time. There are additional benefits when several, similar jobs are run or data is updated and the same job is re-run multiple times. Of course, performance isn&#8217;t everything; ease-of-use and maintainability are important factors that Vertica excels at as well.</p>
<p>Since the &#8220;gauntlet was thrown down&#8221;, to steal a line from Good Will Hunting, I decided to take up the challenge of computing the number of triangles in a graph (and include the solutions in <a href="https://github.com/vertica/Graph-Analytics----Triangle-Counting" target="_blank">GitHub</a> so others can experiment &#8211; more on this at the end of the post).</p>
<h3>Problem Description</h3>
<p>A triangle exists when a vertex has two adjacent vertexes that are also adjacent to each other. Using friendship as an example: If two of your friends are also friends with each other, then the three of you form a friendship triangle. How nice. Obviously this concept is useful for understanding social networks and graph analysis in general (e.g. it can be used to compute the clustering coefficient of a graph).</p>
<p>Let&#8217;s assume we have an undirected graph with reciprocal edges, so there&#8217;s always a pair of edges ({e1,e2} and {e2,e1}). We&#8217;ll use the following input for illustration (reciprocal edge is elided to condense the information):</p>
<div>
<table border="0" cellpadding="0">
<tbody>
<tr>
<td>
<p style="text-align: left" align="center"><strong>source </strong></p>
</td>
<td>
<p style="text-align: left" align="center"><strong>destination </strong></p>
</td>
</tr>
<tr>
<td>Ben</td>
<td>Chuck</td>
</tr>
<tr>
<td>Ben</td>
<td>Stephen</td>
</tr>
<tr>
<td>Chuck</td>
<td>Stephen</td>
</tr>
<tr>
<td>Chuck</td>
<td>Rajat</td>
</tr>
<tr>
<td>Rajat</td>
<td>Stephen</td>
</tr>
<tr>
<td>Andrew</td>
<td>Ben</td>
</tr>
<tr>
<td>Andrew</td>
<td>Matt</td>
</tr>
<tr>
<td>Matt</td>
<td>Pachu</td>
</tr>
<tr>
<td>Chuck</td>
<td>Lyric</td>
</tr>
</tbody>
</table>
</div>
<p><span style="color: #ffffff">.</span><br />
A little ascii art to diagram the graph might help.</p>
<p><a href="http://184.106.12.19/wp-content/uploads/2011/09/Triangles1.png"><img class="size-full wp-image-4065 alignnone" src="http://184.106.12.19/wp-content/uploads/2011/09/Triangles1.png" alt="" width="248" height="256" /></a></p>
<p>I know you can quickly count the number of triangles. I&#8217;m very proud of you but imagine there are hundreds of millions of vertexes and 10s of billions of edges. How long would it take you to diagram that graph? And how much longer to count all of the triangles? And what if your 2 year old daughter barges in counting &#8220;one, two, three, four, &#8230;&#8221; and throws off your count?</p>
<p>Below we present a few practical solutions for large scale graphs and evaluate their performance.</p>
<h3>The Hadoop Solution</h3>
<p>Let&#8217;s consider first the Hadoop approach to solving this problem. The MapReduce (MR) framework implemented in Hadoop allows us to distribute work over many computers to get the count faster. The solution we describe here is a simplified version of <a href="http://theory.stanford.edu/%7Esergei/papers/www11-triangles.pdf" target="_blank">the work at Yahoo Research</a>. You can download our solution <a href="https://github.com/vertica/Graph-Analytics----Triangle-Counting" target="_blank">here</a>.</p>
<h4>Overview</h4>
<p>The solution involves a sequence of 3 MR jobs. The first job constructs all of the triads in the graph. A <em>triad</em> is formed by a pair of edges sharing a vertex, called its <em>apex</em>. It doesn&#8217;t matter which vertex we choose as the apex of the triad, so for our purposes we&#8217;ll pick the &#8220;lowest&#8221; vertex (e.g. friends could be ordered alphabetically by their names). The Yahoo paper makes a more intelligent choice of &#8220;lowest&#8221; – the vertex with the smallest degree. However that requires an initial pass of the data (and more work on my part) so I skipped that optimization and did so consistently for all solutions to ensure fairness.</p>
<p>These triads and the original edges are emitted as rows by the first MR job, with a field added to distinguish the two. Note that the output of the first job can be quite large, especially in a dense graph. Such output is consumed by the second MR job, which partitions the rows by either the unclosed edge, if the row is a triad, or the original edge. A partition has <em>n</em> triangles if it contains an original edge and <em>n</em> triads. A third, trivial MR job counts the triangles produced by the second job, to produce the final result.</p>
<h4>Details</h4>
<p>Let&#8217;s look at each MR job in detail. The map part of the first job generates key-value pairs for each triad such that the apex is the key and the value is the edge. In our small example the map job would emit the following rows.<br />
<span style="color: #ffffff">.</span></p>
<div>
<table border="0" cellpadding="0">
<tbody>
<tr>
<td>
<p style="text-align: left" align="center"><strong>key </strong></p>
</td>
<td>
<p style="text-align: left" align="center"><strong>value </strong></p>
</td>
</tr>
<tr>
<td>Andrew</td>
<td>Andrew, Matt</td>
</tr>
<tr>
<td>Andrew</td>
<td>Andrew, Pachu</td>
</tr>
<tr>
<td>Andrew</td>
<td>Andrew, Ben</td>
</tr>
<tr>
<td>Matt</td>
<td>Matt, Pachu</td>
</tr>
<tr>
<td>Ben</td>
<td>Ben, Chuck</td>
</tr>
<tr>
<td>Ben</td>
<td>Ben, Stephen</td>
</tr>
<tr>
<td>Chuck</td>
<td>Chuck, Rajat</td>
</tr>
<tr>
<td>Chuck</td>
<td>Chuck, Lyric</td>
</tr>
<tr>
<td>Chuck</td>
<td>Chuck, Stephen</td>
</tr>
<tr>
<td>Rajat</td>
<td>Rajat, Stephen</td>
</tr>
</tbody>
</table>
</div>
<p><span style="color: #ffffff">.</span><br />
For each apex-partition, the reduce job emits the original edges and all of the corresponding triads (there are ?(j-1) -&gt; j=1 to d triads per partition, where d is the degree of the vertex at the apex). For each original edge, the key is the edge itself and the value is &#8220;edge&#8221;. For each triad, the key is the unclosed edge. In other words, the edge needed to complete the triangle. The value is &#8220;triad.&#8221; The actual code used &#8220;0&#8243; for the edge value and &#8220;1&#8243; for the triad value for run-time efficiency.</p>
<p>The rows corresponding to the triads emitted by this reduce job in our simple example are described below in the &#8220;key&#8221; and &#8220;value&#8221; columns (the original edges are also emitted by the reduce job but elided below for brevity). For presentation purposes we added a third column &#8220;triad content&#8221;. That column is not produced by the actual reduce job.<br />
<span style="color: #ffffff">.</span></p>
<div>
<table border="0" cellpadding="0">
<tbody>
<tr>
<td>
<p style="text-align: left" align="center"><strong>key </strong></p>
</td>
<td>
<p style="text-align: left" align="center"><strong>value </strong></p>
</td>
<td>
<p style="text-align: left" align="center"><strong>triad content </strong></p>
</td>
</tr>
<tr>
<td>Ben,  Matt</td>
<td>triad</td>
<td>{Andrew, Ben}, {Andrew, Matt}</td>
</tr>
<tr>
<td>Ben, Pachu</td>
<td>triad</td>
<td>{Andrew, Ben}, {Andrew, Pachu}</td>
</tr>
<tr>
<td>Matt, Pachu</td>
<td>triad</td>
<td>{Andrew, Matt}, {Andrew, Pachu}</td>
</tr>
<tr>
<td>Chuck, Stephen</td>
<td>triad</td>
<td>{Ben, Chuck}, {Ben, Stephen}</td>
</tr>
<tr>
<td>Lyric, Rajat</td>
<td>triad</td>
<td>{Chuck, Lyric}, {Chuck, Rajat}</td>
</tr>
<tr>
<td>Lyric, Stephen</td>
<td>triad</td>
<td>{Chuck, Lyric}, {Chuck, Stephen}</td>
</tr>
<tr>
<td>Rajat, Stephen</td>
<td>triad</td>
<td>{Chuck, Rajat}, {Chuck, Stephen}</td>
</tr>
</tbody>
</table>
</div>
<p><span style="color: #ffffff">.</span><br />
The input to the next reduce job is partitioned such that the unclosed edge of each triad is in the same partition as its corresponding original edge, if any. The reduce job just needs to check for the existence of an original edge in that partition (i.e., a row with value set to &#8220;edge&#8221;). If it finds one, all of the triads in the partition are closed as triangles. The reduce job sums up all of the closed triads and on finalize emits a count. A trivial final MR job aggregates the counts from the previous job.</p>
<p>There we&#8217;ve used MapReduce to count the number of triangles in a graph. The approach isn&#8217;t trivial but it&#8217;s not horribly complex either. And if it runs too slowly we can add more hardware, each machine does less work and we get our answer faster.</p>
<h4>Experiences with Hadoop</h4>
<p>I have to admit it took me much longer than I estimated to implement the Hadoop solution. Part of the reason being I&#8217;m new to the API, which is exacerbated by the fact that there are currently two APIs, one of them deprecated, the other incomplete, forcing use of portions of the deprecated API. Specifically, the examples I started with were unfortunately based on the deprecated API and when I ported to the newer one I ran into several silly but somewhat time consuming issues (like <a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reducer.html#reduce%28K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter%29" target="_blank">mapred&#8217;s</a> version of <em>Reducer.reduce</em> takes an <em>Iterator</em> but <a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html#reduce%28KEYIN,%20java.lang.Iterable,%20org.apache.hadoop.mapreduce.Reducer.Context%29" target="_blank">mapreduce&#8217;s</a> version takes an <em>Iterable</em> &#8211; they look similar to the human eye but the compiler knows that a method that takes an <em>Iterator</em> should not be overridden by one that takes an <em>Iterable</em>). Learning curve aside there was a fair chunk of code to write. The simple version is &gt;200 lines. In a more complex version I added a secondary sort to the MR job that computes triads. Doing so introduced several dozen lines of code (most of it brain dead stuff like implementing a <em>Comparable</em> interface). Granted a lot of the code is cookie cutter or trivial but it still needs to be written (or cut-n-pasted and edited). In contrast, to add a secondary sort column in SQL is a mere few characters of extra code.</p>
<h3>The PIG Solution</h3>
<p>Rajat Venkatesh, a colleague of mine, said he could convert the algorithm to a relatively small PIG script and he wagered a lunch that the PIG script would outperform my code. He whipped up what was eventually a 10 statement PIG script that accomplished the task. When we get to the performance comparison we&#8217;ll find out who got a free lunch.</p>
<p><img class="alignnone size-full wp-image-4095" src="http://184.106.12.19/wp-content/uploads/2011/09/Triangles21.png" alt="" width="791" height="193" /></p>
<p>Here&#8217;s the PIG solution, much simpler than coding MR jobs by hand. We used PIG 0.8.1. We made several passes over the script to optimize it, following the PIG Cookbook. For example, we rearranged the join order and put the larger table last (I&#8217;m probably not giving too much away by mentioning that Vertica&#8217;s optimizer uses a cost model which properly chooses join order). We also tried several values for default_parallel and mapreduce.job.maps (and we changed the corresponding parameter in mapred-site.xml as well, just to be safe). We did not enable lzo compression for two reasons. First, considering the hardware used for the experiment (large RAM &#8211; plenty of file system cache, high throughput network), the CPU tax incurred by compression was more likely to hurt performance than help in this case. Second, one website listed 7 steps to get the compression working but the 2nd step had several steps itself, so I gave up on it.</p>
<h3>The Vertica Solution</h3>
<p>Can you count the number of triangles in a graph using a database? Of course. First create an &#8220;edges&#8221; table and load the graph. Vertica can automate the decision about how to organize storage for the table &#8211; something called <a href="http://184.106.12.19/2011/09/01/the-power-of-projections-part-1/" target="_blank">projections</a> specify important characteristics of physical storage such as sort order, segmentation, encoding and compression. In this case we simply tell Vertica how to distribute data among nodes in our cluster (Vertica calls this segmenting). Alternatively the Vertica <a href="http://184.106.12.19/2011/09/02/the-power-of-projections-part-2/" target="_blank">Database Designer</a> can be used to automate projection design. The following statements create our table and load data.</p>
<p><img class="alignnone size-full wp-image-4096" src="http://184.106.12.19/wp-content/uploads/2011/09/Triangles31.png" alt="" width="714" height="39" /></p>
<p>We&#8217;ve got the data loaded and stored in an efficient way. If we need to run more jobs with the same data later we won&#8217;t incur the load cost again. Likewise, if we need to modify the data we only incur work proportional to the change. Now we just need a horribly complex hack to count triangles. Take a deep breath, stretch, get a cup of coffee, basically do what you have to do to prepare your brain for this challenge. Ok, ready? Here it is:</p>
<p><img class="alignnone size-full wp-image-4097" src="http://184.106.12.19/wp-content/uploads/2011/09/Triangles41.png" alt="" width="636" height="68" /></p>
<p>Good, you didn&#8217;t run away screaming in horror. If we ignore the less than predicates the query is simply finding all triplets that form a cycle, v1 -&gt; v2 -&gt; v3 -&gt; v1. The less than predicates ensure we don&#8217;t count the same triangle multiple times (remember our edges are reciprocal and we only need to consider triangles with the &#8220;lowest&#8221; vertex at the apex).</p>
<p>That&#8217;s it! A single, 4-liner query. Of course you&#8217;re interested in what the Vertica database does under the covers and how its performance, disk utilization and scalability compare with those of Hadoop and PIG.</p>
<h3>Performance Study</h3>
<p>The publicly available LiveJournal social network graph (<a href="http://snap.stanford.edu/data/soc-LiveJournal1.html" target="_blank">http://snap.stanford.edu/data/soc-LiveJournal1.html</a>) was used to test performance. It was selected because of its public availability, its modest size permitted relatively quick experiments. The modified edges file (in the original file not every edge is reciprocated) contained 86,220,856 edges, about 1.3GB in raw size. We used HDFS dfs.replication=2 (replication=1 performed worse &#8211; fewer map jobs were run, almost regardless of the mapreduce.job.maps value). Experiments were run on between 1 and 4 machines each with 96GB of RAM, 12 cores and 10GBit interconnect.</p>
<h4>Run-Time Cost</h4>
<p>All solutions are manually tuned to obtain the best performance numbers. For the Hadoop and PIG solutions, the number of mappers and reducers as well as the code itself were tweaked to optimize performance. For the Vertica solution, out-of-the-box Vertica is configured to support multiple users; default expectation is 24 concurrent queries for the hardware used. This configuration was tweaked to further increase pipeline parallelism (equivalent configuration settings will be on by default in an upcoming release). The following chart compares the best performance numbers for each solution.</p>
<p><img class="alignnone size-full wp-image-4069" src="http://184.106.12.19/wp-content/uploads/2011/09/Triangles5.png" alt="" width="609" height="470" /></p>
<p>PIG beat my Hadoop program, so my colleague who wrote the PIG script earned his free lunch. One major factor is PIG&#8217;s superior join performance – its uses hash join. In comparison, the Hadoop solution employs a join method very close to sort merge join.</p>
<p>Vertica&#8217;s performance wasn&#8217;t even close to that of Hadoop &#8211; thankfully. It was much much better. In fact Vertica ate PIG&#8217;s and Hadoop&#8217;s lunch &#8211; its best time is 22x faster than PIG&#8217;s and 40x faster than the Hadoop program (even without configuration tweaks Vertica beats optimized Hadoop and PIG programs by more than a factor of 9x in comparable tests).</p>
<p>Here are a few key factors in Vertica&#8217;s performance advantage:</p>
<ul>
<li><span style="color: #333300"><span style="color: #333300">Fully pipelined execution in Vertica, compared to a sequence of MR jobs in the Hadoop and PIG solutions, which incurs significant extra I/O. We quantify the differences in how the disk is used among the solutions below in the &#8220;disk usage&#8221; study.<br />
<span style="color: #ffffff">.</span><br />
</span></span></li>
<li><span style="color: #333300"><span style="color: #333300">Vectorization of expression execution, and the use of just-in-time code generation in the Vertica engine<br />
<span style="color: #ffffff">.</span><br />
</span></span></li>
<li><span style="color: #333300"><span style="color: #333300">More efficient memory layout, compared to the frequent Java heap memory allocation and deallocation in Hadoop / PIG<br />
</span></span></li>
</ul>
<p>Overall, Hadoop and PIG are free in software, but hardware is not included. With a 22x speed-up, Vertica&#8217;s performance advantage effectively equates to a 95% discount on hardware. Think about that. You&#8217;d need 1000 nodes to run the PIG job to equal the performance of just 48 Vertica nodes, which is a rack and a half of the Vertica appliance.</p>
<p>Finally consider what happens when the use case shifts from counting all of the triangles in a graph to counting (or listing) just the triangles that include a particular vertex. Vertica&#8217;s projections (those things that define the physical storage layout) can be optimized such that looking up all of the edges with a particular vertex is essentially an index search (and once found the associated edges are co-located on disk &#8211; an important detail which anyone who knows the relative cost of a seek versus a scan will appreciate). This very quickly whittles e1 and e3 down to relatively few rows which  can participate in a merge join with e2. All in all a relatively inexpensive operation. On the other hand PIG and Hadoop must process all of the edges to satisfy such a query.</p>
<h4>Disk Usage</h4>
<p>For the input data set of 1.3GB, it takes 560MB to store it in Vertica&#8217;s compressed storage. In comparison, storing it in HDFS consumes more space than the raw data size.</p>
<p>At run-time, here is the peak disk usage among all 3 solutions in a 4-node cluster (remember lzo was not enabled for Hadoop and PIG &#8211; turning it on would reduce disk usage but likely hurt performance).</p>
<p><img class="alignnone size-full wp-image-4070" src="http://184.106.12.19/wp-content/uploads/2011/09/Triangles6.png" alt="" width="608" height="472" /></p>
<p>Given the huge differences in disk usage and thus I/O work, along with other advantages outlined above it should come as no surprise that the Vertica solution is much faster.</p>
<h4>Join Optimization</h4>
<p>As we mentioned earlier, the Hadoop solution does not optimize for join performance. Both Vertica and PIG were able to take advantage of a relatively small edges table that fit in memory (100s of billions or more edges can fit in memory when distributed over 10s or 100s of machines), with a hash join implementation.</p>
<p>For PIG, the join ordering needs to be explicitly specified. Getting this ordering wrong may carry a significant performance penalty. In our study, the PIG solution with the wrong join ordering is 1.5x slower. The penalty is likely even higher with a larger data set, where the extra disk I/O incurred in join processing can no longer be masked by sufficient RAM. To further complicate the matter, the optimal join ordering may depend on the input data set (e.g. whether the input graph is dense or not). It is infeasible for users to manually tweak the join ordering before submitting each PIG job.</p>
<p>In comparison, the <a href="http://184.106.12.19/2010/04/28/vertica-under-the-hood-the-query-optimizer/" target="_blank">Vertica columnar optimizer</a> takes care of join ordering as well as many other factors crucial to optimizing for the job run-time.</p>
<h3>The Right Tool for the Job</h3>
<p>Many people get significant value out of Hadoop and PIG, <a href="http://184.106.12.19/2011/08/15/the-right-tool-for-the-job-using-hadoop-with-vertica-for-big-data-analytics/" target="_blank">including a number of Vertica&#8217;s customers</a> who use these tools to work with unstructured or semi-structured data &#8211; typically before loading that data into Vertica. The question is which tool is best suited to solve your problem. With User Defined Functions, Aggregates, Load, et cetera available or coming soon to Vertica the lines are becoming blurred but when it comes to performance the choice is crystal clear.</p>
<p>In the case of triangle counting as we presented above, the Vertica solution enjoys the following advantages over Hadoop and PIG:</p>
<ul>
<li><span style="color: #333300"><span style="color: #333300">Ease of programming and maintenance, in terms of both ensuring correctness (The Vertica SQL solution is simpler) and achieving high performance (The Vertica optimizer chooses the best execution plan)</span></span><br />
<span style="color: #ffffff">.</span></li>
<li><span style="color: #333300"><span style="color: #333300">Compressed storage</span></span><br />
<span style="color: #ffffff">.</span></li>
<li><span style="color: #333300">Orders of magnitude faster query performance</span></li>
</ul>
<h3>Do Try this at Home</h3>
<p>It is a relatively safe experiment (unlike slicing a grape in half and putting it in the <a href="http://www.youtube.com/watch?v=vCNNqgKqnaQ" target="_blank">microwave</a> &#8211; don&#8217;t try that one at home). We&#8217;ve uploaded all three solutions to <a href="https://github.com/vertica/Graph-Analytics----Triangle-Counting" target="_blank">GitHub</a>. Feel free to run your own experiments and improve on our work. As it stands the project includes a build.xml file which runs the Hadoop and PIG solutions in standalone mode &#8211; the project README file describes these targets and more in detail. With a little more work one can configure a Hadoop cluster and run the experiments in distributed mode, which is how we ran the experiments described above.</p>
<p>It&#8217;s a little more difficult to run the tests if you are not currently a Vertica customer, but we do have a <a href="http://184.106.12.19/evaluate/" target="_blank">free trial version of the Vertica Analytics Platform software</a>.</p>
<h3>Acknowledgements</h3>
<p>Many thanks to Rajat Venkatesh for writing the PIG script (though I already thanked him with a lunch) and Mingsheng Hong for his suggestions, ideas and edits.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.vertica.com/2011/09/21/counting-triangles/feed/</wfw:commentRss>
		<slash:comments>33</slash:comments>
		</item>
		<item>
		<title>Vertica at BIRTE 2011: Social Graph Analytics</title>
		<link>http://www.vertica.com/2011/09/19/vertica-at-birte-2011-social-graph-analytics/</link>
		<comments>http://www.vertica.com/2011/09/19/vertica-at-birte-2011-social-graph-analytics/#comments</comments>
		<pubDate>Mon, 19 Sep 2011 13:39:51 +0000</pubDate>
		<dc:creator>slawande</dc:creator>
				<category><![CDATA[social graph analysis]]></category>

		<guid isPermaLink="false">http://vertica.linchpinagency.com/?p=4055</guid>
		<description><![CDATA[<p>A few days ago, Lakshmikant Shrinivas and I gave an invited talk at <a href="http://wwwdb.inf.tu-dresden.de/birte2011/" target="_blank">BIRTE 2011</a>.  Rather than yet another talk on Vertica 101, we chose to talk about <a href="http://en.wikipedia.org/wiki/Social_network" target="_blank">Social Graph Analytics</a>, a topic we are truly very excited about because it is something we are learning from our customers! And of course, it is uber cool to talk about Cityville and Zynga these days!  We presented the <a href="http://184.106.12.19/wp-content/uploads/2011/05/ZyngaSocialGraphing.pdf" target="_blank">Zynga usecase </a> – how to find the influencers among the active users of a game and several other graph problems Vertica has very successfully solved with SQL.  More about these in future blog posts.</p> <p>The most fun part of this talk (according to the audience) was that <a href="" class="read-more">Read More &#187;</a>]]></description>
			<content:encoded><![CDATA[<p>A few days ago, Lakshmikant Shrinivas and I gave an invited talk at <a href="http://wwwdb.inf.tu-dresden.de/birte2011/" target="_blank">BIRTE 2011</a>.  Rather than yet another talk on Vertica 101, we chose to talk about <a href="http://en.wikipedia.org/wiki/Social_network" target="_blank">Social Graph Analytics</a>, a topic we are truly very excited about because it is something we are learning from our customers! And of course, it is uber cool to talk about Cityville and Zynga these days!  We presented the <a href="http://184.106.12.19/wp-content/uploads/2011/05/ZyngaSocialGraphing.pdf" target="_blank">Zynga usecase </a> – how to find the influencers among the active users of a game and several other graph problems Vertica has very successfully solved with SQL.  More about these in future blog posts.</p>
<p>The most fun part of this talk (according to the audience) was that it was multi-threaded – we weaved a real-time demo through the talk. The demo was in two parts. The first part calculated <a href="http://en.wikipedia.org/wiki/Degeneracy_%28graph_theory%29" target="_blank">an approximate 4-core</a> of a graph of 90M nodes and 405M edges representing active users and their interactions in a game.  A k-core of a graph picks out a subgraph where every node has at least k neighbors in the subgraph – <a href="http://www.nature.com/nphys/journal/v6/n11/abs/nphys1746.html" target="_blank">some research</a> shows that the set of influencers in a social network is often a 3 or 4-core. The demo was run on 4 blades of the <a href="http://h18006.www1.hp.com/storage/server-solutions/vertica-analytics-overview.html" target="_blank">HP/Vertica Analytics System</a> with 12 cores and 96GB RAM each. The animation below show an approximate visualization of the actual graph with 10K nodes and 45K edges (we could not find any graph visualizing tool that could draw a graph of 405M edges!) and how the algorithms iteratively reduces it to find the 4-core. The actual computation (which cannot be visualized) took about 1 minute to process 1TB of in-game event data to compute the initial graph of 90M active users and only 4.5 minutes to run 8 iterations resulting in the 4-core of 34K users!</p>
<p><center><iframe width="480" height="360" src="http://www.youtube.com/embed/co7jD1_g0Es?rel=0" frameborder="0" allowfullscreen></iframe></center></p>
<p>The second part showed how this graph of influencers could be used to do <a href="http://en.wikipedia.org/wiki/A/B_testing" target="_blank">A/B testing</a> for in-game product placement. We simulated giving a group of users from the graph of influencers , Coke, and a group of randomly chosen users, Pepsi, and loading their in-game interactions data into Vertica, every 15 seconds. In the 5-minutes or so it took us to describe the setup, you could see how the relative penetration of the two products changed in real-time.  This was really a demo of Vertica’s awesome real-time analytics engine – we loaded data continuously at the rate of 40GB/minute (on 4 nodes with  2 copies of the data), which translates to 20TB/hr on the full rack HP/Vertica Analytics System.  That’s fast, eh?!!</p>
<p>This little demo and what our customers are doing in the real world, makes for a very convincing case that SQL can be used to express many practical graph problems. <a href="http://184.106.12.19/wp-content/uploads/2011/01/VerticaArchitectureWhitePaper.pdf" target="_blank">Vertica’s MPP architecture</a> and sorted <a href="http://184.106.12.19/2011/09/01/the-power-of-projections-part-1/" target="_blank">projections</a> provide a highly scalable and fast engine for iterative graph algorithms. Such problems are relevant not only to the Zyngas and Facebooks of the world but to any company that has a connected user community. From online forums to friend-and-family call logs, social networks are everywhere and finding influencers is the key to monetizing them. <a href="http://techcrunch.com/2011/08/30/gartner-social-crm-market-will-reach-1b-in-revenue-by-2012/" target="_blank">Gartner believes</a> that social CRM will be a $1B market by end of 2012!</p>
<p>We hope that our talk planted a small seed for the database research community to more formally explore SQL solutions to graph problems!  Watch for more blogs from Vertica on solving graph problems with SQL.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.vertica.com/2011/09/19/vertica-at-birte-2011-social-graph-analytics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Power of Projections &#8211; Part 3</title>
		<link>http://www.vertica.com/2011/09/06/the-power-of-projections-part-3/</link>
		<comments>http://www.vertica.com/2011/09/06/the-power-of-projections-part-3/#comments</comments>
		<pubDate>Tue, 06 Sep 2011 12:30:02 +0000</pubDate>
		<dc:creator>cmahony</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://vertica.linchpinagency.com/?p=3973</guid>
		<description><![CDATA[<p>By Colin Mahony and Shilpa Lawande</p> <p><strong>Part III &#8211; Comparing and Contrasting Projections to Materialized Views and Indexes</strong></p> <p>In <a title="The Power of Projections - Part 1" href="http://184.106.12.19/2011/09/01/the-power-of-projections-part-1/" target="_blank">Part I</a> and <a title="The Power of Projections - Part 2" href="http://184.106.12.19/2011/09/02/the-power-of-projections-part-2/" target="_blank">Part II</a> of this post, we introduced you to Vertica’s projections, and described how easy it is to interface with them directly via SQL or through our Database Designer™ tool.  We will now end this series by comparing and contrasting Vertica’s projections with traditional indexes and materialized views.</p> <p>Row-store databases often use Btree indexes as a performance enhancement.  Btree indexes are designed for highly concurrent single-record inserts and updates, e.g. an OLTP scenario. Most data warehouse practitioners would agree that <a href="" class="read-more">Read More &#187;</a>]]></description>
			<content:encoded><![CDATA[<p>By Colin Mahony and Shilpa Lawande</p>
<p><strong>Part III &#8211; Comparing and Contrasting Projections to Materialized Views and Indexes</strong></p>
<p>In <a title="The Power of Projections - Part 1" href="http://184.106.12.19/2011/09/01/the-power-of-projections-part-1/" target="_blank">Part I</a> and <a title="The Power of Projections - Part 2" href="http://184.106.12.19/2011/09/02/the-power-of-projections-part-2/" target="_blank">Part II</a> of this post, we introduced you to Vertica’s projections, and described how easy it is to interface with them directly via SQL or through our Database Designer™ tool.  We will now end this series by comparing and contrasting Vertica’s projections with traditional indexes and materialized views.</p>
<p>Row-store databases often use Btree indexes as a performance enhancement.  Btree indexes are designed for highly concurrent single-record inserts and updates, e.g. an OLTP scenario. Most data warehouse practitioners would agree that index rebuilds after a batch load are preferable to taking the hit of maintaining them record by record given the logging overhead. Bitmap indexes are designed for bulk loads and are better than btrees for data warehousing but only suitable for low cardinality columns and a certain class of queries.  Even though you have these indexes to help find data, you still have to go to the base table to get the actual data, which brings with it all the disadvantages of a row store.</p>
<p>In a highly simplified view, you can think of a Vertica projection as a single level, densely packed, clustered index which stores the actual data values, is never updated in place, and has no logging. Any “maintenance” such as merging sorted chunks or purging deleted records is done as automatic and background activity, not in the path of real-time loads.  So yes, projections are a type of native index if you will, but they are very different from traditional indexes like Bitmap and Btrees.</p>
<p>Vertica also offers a unique feature known as “pre-join projections”. Pre-join projections denormalize tables at the physical layer under the covers providing a significant performance advantage over joining tables at run-time.  Pre-join projections automatically store the results of a join ahead of time, yet the logical schema is maintained &#8211; again, flexibility of the storage structure without having to rewrite your ETL or application.  Vertica can get away with this because it excels at sparse data storage, and in particular isn’t penalized at all for null values nor for wide fact tables. Since Vertica does not charge extra for additional projections, this is a great way to reap the benefits of denormalization without the need to purchase a larger capacity license.</p>
<p>So to sum up, here’s how Vertica projections stack up versus materialized views and conventional indexes.</p>
<table style="border-color: #000000; border-width: 1px; border-style: solid;" border="1" frame="box" rules="all" cellpadding="0" align="center">
<tbody>
<tr valign="middle">
<td style="border-color: #000000; border-style: solid; border-width: 1px; text-align: center;">
<h4><span style="color: #000000;">Vertica&#8217;s Projections</span></h4>
</td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<h4><span style="color: #000000;">Traditional Materialized Views</span></h4>
</td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<h4><span style="color: #000000;">Traditional Indexes</span></h4>
</td>
</tr>
<tr>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Are primary storage – no base tables are required</span></li>
</ul>
</td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Are secondary storage</span></li>
</ul>
</td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Are secondary storage pointing to base table data</span></li>
</ul>
</td>
</tr>
<tr>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Can be segmented, partitioned, sorted, compressed and encoded to suit your needs</span></li>
</ul>
</td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Are rigid: Practically limited to columns and query needs, more columns = more I/O</span></li>
</ul>
</td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Support one clustered index at most &#8211; tough to scale out</span></li>
</ul>
</td>
</tr>
<tr>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Have a simple physical design</span></li>
</ul>
</td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Use Aggregation losing valuable detail</span></li>
</ul>
</td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Require complex design choices</span></li>
</ul>
</td>
</tr>
<tr>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Are efficient to load &amp; maintain</span></li>
</ul>
</td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Are mostly batch updated</span></li>
</ul>
</td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Are expensive to update</span></li>
</ul>
</td>
</tr>
<tr>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Are versatile – they can support any data model</span></li>
</ul>
</td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Provide high data latency</span></li>
</ul>
</td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Provide high data latency</span></li>
</ul>
</td>
</tr>
<tr>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Allow you to work with the detailed data</span></li>
</ul>
</td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;"></td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;"></td>
</tr>
<tr>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Provide near-real time low data latency</span></li>
</ul>
</td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;"></td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;"></td>
</tr>
<tr>
<td style="border-color: #000000; border-style: solid; border-width: 1px;">
<ul>
<li><span style="color: #000000;">Combine high availability with special optimizations for query performance</span></li>
</ul>
</td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;"></td>
<td style="border-color: #000000; border-style: solid; border-width: 1px;"></td>
</tr>
</tbody>
</table>
<p><span style="color: #ffffff;">.</span><br />
That&#8217;s pretty much all there is to it.  Whether you are running ad-hoc queries or canned operational-BI workloads, you will find projections to be a very powerful backbone for getting the job done!</p>
<p><strong>Read the rest of the 3-part series&#8230;</strong></p>
<p><a title="The Power of Projections - Part 1" href="http://184.106.12.19/2011/09/01/the-power-of-projections-part-1/" target="_blank">The Power of Projections &#8211; Part 1: Understanding Projections and What They Do</a><br />
<a title="The Power of Projections - Part 2" href="http://184.106.12.19/2011/09/02/the-power-of-projections-part-2/" target="_blank">The Power of Projections &#8211; Part 2: Understanding the Simplicity of Projections and the Vertica Database Designer™</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.vertica.com/2011/09/06/the-power-of-projections-part-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Power of Projections &#8211; Part 2</title>
		<link>http://www.vertica.com/2011/09/02/the-power-of-projections-part-2/</link>
		<comments>http://www.vertica.com/2011/09/02/the-power-of-projections-part-2/#comments</comments>
		<pubDate>Fri, 02 Sep 2011 12:30:50 +0000</pubDate>
		<dc:creator>cmahony</dc:creator>
				<category><![CDATA[column store]]></category>

		<guid isPermaLink="false">http://vertica.linchpinagency.com/?p=3964</guid>
		<description><![CDATA[<p>By Colin Mahony and Shilpa Lawande</p> <p><strong>Part II &#8211; Understanding the Simplicity of Projections and the Vertica Database Designer™ </strong></p> <p>In <a title="The Power of Projections - Part 1" href="http://184.106.12.19/2011/09/01/the-power-of-projections-part-1/" target="_blank">Part I of this post</a>, we introduced you to the simple concept of Vertica’s projections.  Now that you have an understanding of what they are, we wanted to go into more detail on how users interface with them, and introduce you to Vertica’s unique Database Designer tool.</p> <p>For each table in the database, Vertica requires a minimum of one projection, called a “superprojection”. A superprojection is a projection for a single table that contains all the columns and rows in the table.  Although the data may be the same as a traditional <a href="" class="read-more">Read More &#187;</a>]]></description>
			<content:encoded><![CDATA[<p>By Colin Mahony and Shilpa Lawande</p>
<p><strong>Part II &#8211; Understanding the Simplicity of Projections and the Vertica Database Designer™ </strong></p>
<p>In <a title="The Power of Projections - Part 1" href="http://184.106.12.19/2011/09/01/the-power-of-projections-part-1/" target="_blank">Part I of this post</a>, we introduced you to the simple concept of Vertica’s projections.  Now that you have an understanding of what they are, we wanted to go into more detail on how users interface with them, and introduce you to Vertica’s unique Database Designer tool.</p>
<p>For each table in the database, Vertica requires a minimum of one projection, called a “superprojection”. A superprojection is a projection for a single table that contains all the columns and rows in the table.  Although the data may be the same as a traditional base table, it has the advantages of segmentation (spreading the data evenly across the nodes in the cluster), sorting, and encoding (compressing the size of the data on a per column basis).  This leads to significant footprint reduction as well as load and query performance enhancements.  To give you a sense of the impact that Vertica&#8217;s projections have on size, most Vertica customers have at least a 50% reduction in footprint thanks to our compression.  This includes the high availability copy and on average 3-5 projections.  Again, contrast this to traditional row-store databases ballooning upwards of 5x their original size and that is a 10:1 difference in Vertica.</p>
<p>To get your database up and running quickly, Vertica automatically creates a default superprojection for each table created through the CREATE TABLE and CREATE TEMPORARY TABLE statements. This means that if database admins and users never want to know about a projection, they don&#8217;t have to &#8211; Vertica will automatically handle it under the covers. To further illustrate this point, users can simply pass in projection parameters such as Order By, Encodings, Segmentation, High Availability, and Partitioning right after the CREATE TABLE statement, never interfacing directly with a projection under the hood.</p>
<p>By creating a superprojection for each table in the database, Vertica ensures that all SQL queries can be answered. Default superprojections alone will do far better than a row-store, however, by themselves they may not fully optimize database performance and Vertica&#8217;s full potential.  Vertica recommends that you start with the default projections and then use Vertica&#8217;s nifty Database Designer™  to optimize your database.  Database Designer creates new projections that optimize your database based on its data statistics and the queries you use. Database Designer:</p>
<p style="padding-left: 10px;">1. Analyzes your logical schema, sample data, and sample queries (optional).<br />
2. Creates a physical schema design (projections) in the form of a SQL script that can be deployed automatically or manually.<br />
3. Can be used by anyone without specialized database knowledge (even business users can run Database Designer).<br />
4. Can be run and re-run anytime for additional optimization without stopping the database.</p>
<p>Designs created by the Database Designer provide exceptional query performance. The Database Designer uses sophisticated strategies to provide excellent ad-hoc query performance while using disk space efficiently. Of course, a proficient human may do even better than the Database Designer with more intimate knowledge of the data and the use-case – a small minority of our customers prefer to do manual projection design and can usually get a good feel for it after working with the product for a few weeks.</p>
<p>We&#8217;ve heard people ask if we need a projection for each query in Vertica, which we absolutely do not! Typically our customers use 3-5 projections and several are using the single superprojection only. A typical customer would have the superprojection along with a few smaller projections (often comprised of only a few columns each).  Unlike MVs and indexes, projections are cheap to maintain during load and due to Vertica’s compression, the resulting data size tends to be 5-25x smaller than the base data. Depending on your data latency needs (seconds to minutes) and storage availability you could choose to add more projections to further optimize the database.  Also important to note is that Vertica does not charge extra for projections, regardless of how many are deployed.  So whether a customer has 1 or 50 projections, their license fees are the same &#8211; entirely based on raw data.</p>
<p>As you can see, projections are very easy to work with, and if you are a business analyst who doesn’t know SQL/DDL, that’s okay, we created a tool that designs, deploys and optimizes the database automatically for you.  Our objective from day one has always been to enable customers to ask more questions and get faster answers from their data without having to constantly tune the underlying database.  <a title="The Power of Projections - Part 3" href="http://184.106.12.19/2011/09/06/the-power-of-projections-part-3/" target="_blank">Part III</a> of this post goes into more detail on projections versus indexes and materialized views.</p>
<p><strong>Read the rest of the 3-part series&#8230;</strong></p>
<p><a title="The Power of Projections - Part 1" href="http://184.106.12.19/2011/09/01/the-power-of-projections-part-1/" target="_blank">The Power of Projections &#8211; Part 1: Understanding Projections and What They Do</a><br />
<a title="The Power of Projections - Part 3" href="http://184.106.12.19/2011/09/06/the-power-of-projections-part-3/" target="_blank">The Power of Projections &#8211; Part 3: Comparing and Contrasting Projections to Materialized Views and Indexes</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.vertica.com/2011/09/02/the-power-of-projections-part-2/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The Power of Projections &#8211; Part 1</title>
		<link>http://www.vertica.com/2011/09/01/the-power-of-projections-part-1/</link>
		<comments>http://www.vertica.com/2011/09/01/the-power-of-projections-part-1/#comments</comments>
		<pubDate>Thu, 01 Sep 2011 14:50:32 +0000</pubDate>
		<dc:creator>cmahony</dc:creator>
				<category><![CDATA[column store]]></category>

		<guid isPermaLink="false">http://vertica.linchpinagency.com/?p=3954</guid>
		<description><![CDATA[<p>By Colin Mahony and Shilpa Lawande</p> <p><strong>Part I: Understanding Projections and What They Do</strong></p> <p>Many of us here at Vertica have been amazed and frankly flattered at how much FUD our competitors are putting out there regarding Vertica&#8217;s &#8220;projections&#8221;.  Having heard some incredibly inaccurate statements about them, we&#8217;ve decided to clarify what they are, how and why we have them, and the advantages they bring.  Actually, projections are a pivotal component of our platform, and a major area of differentiation from the competition.  Most importantly, Vertica&#8217;s customers love the benefits projections bring! In an effort to provide you with as much detail as possible, this blog is broken up into three posts with Parts II and III being more technical.</p> <p>First, <a href="" class="read-more">Read More &#187;</a>]]></description>
			<content:encoded><![CDATA[<p>By Colin Mahony and Shilpa Lawande</p>
<p><strong>Part I: Understanding Projections and What They Do</strong></p>
<p>Many of us here at Vertica have been amazed and frankly flattered at how much FUD our competitors are putting out there regarding Vertica&#8217;s &#8220;projections&#8221;.  Having heard some incredibly inaccurate statements about them, we&#8217;ve decided to clarify what they are, how and why we have them, and the advantages they bring.  Actually, projections are a pivotal component of our platform, and a major area of differentiation from the competition.  Most importantly, Vertica&#8217;s customers love the benefits projections bring! In an effort to provide you with as much detail as possible, this blog is broken up into three posts with Parts II and III being more technical.</p>
<p>First, some background. In traditional database architectures, data is primarily stored in tables. Additionally, secondary tuning structures such as indexes and materialized views are created for improved query performance.  Secondary structures like MVs and indexes have drawbacks &#8211; for instance they are expensive to maintain during data load (more detail on this in Part III).  Hence best practices often require rebuilding them during nightly batch windows, which prevents the ability to do real-time analytics.  Also, it isn’t uncommon to find data warehouse implementations that balloon to 3-6x base table sizes due to these structures. As a result, customers are often forced to remove valuable detailed data and replace it with aggregated data to solve this problem. However you can&#8217;t monetize what you lost!</p>
<p>Vertica created a superior solution by optimizing around performance, storage footprint, flexibility and simplicity. We removed the trade-off between performance and data size by using projections as the lynchpin of our purpose-built architecture.  Physical storage consists of optimized collections of table columns, which we call “projections”. In the traditional sense, Vertica has no raw uncompressed base tables, no materialized views, and no indexes. As a result there are no complex choices &#8211; everything is a projection!  Of course, your logical schema (we support any) remains the same as with any other database so that importing data is a cinch.  Furthermore, you still work with standard SQL/DDL (i.e. Create Table statements, etc).  The magic of projections and Vertica are what we do under the covers for you with the physical storage objects.  We provide the same benefits as indexes without all of the baggage.  We also provide an automatic tool, the Database Designer (more on this in Part II) to create projections automatically.</p>
<p><a href="http://184.106.12.19/wp-content/uploads/2011/09/Projections.jpg"><img class="aligncenter size-full wp-image-3956" title="Projections" src="http://184.106.12.19/wp-content/uploads/2011/09/Projections.jpg" alt="" width="500" height="375" /></a></p>
<p>Projections store data in formats that optimize query execution. They share one similarity to materialized views in that they store data sets on disk rather than compute them each time they are used in a query (e.g. physical storage).  However, projections aren&#8217;t aggregated but rather store every row in a table, e.g. the full atomic detail. The data sets are automatically refreshed whenever data values are inserted, appended, or changed &#8211; again, all of this happens beneath the covers without user intervention &#8211; unlike materialized views. Projections provide the following benefits:</p>
<ul>
<li><span style="color: #003300;">&#x95; Projections are transparent to end-users and SQL. The Vertica query optimizer automatically picks the best projections to use for any query.</span></li>
<p></p>
<li><span style="color: #003300;">&#x95; Projections allow for the sorting of data in any order (even if different from the source tables). This enhances query performance and compression.</span></li>
<p></p>
<li><span style="color: #003300;">&#x95; Projections deliver high availability optimized for performance, since the redundant copies of data are always actively used in analytics.  We have the ability to automatically store the redundant copy using a different sort order.  This provides the same benefits as a secondary index in a more efficient manner.</span></li>
<p></p>
<li><span style="color: #003300;">&#x95; Projections do not require a batch update window.  Data is automatically available upon loads.</span></li>
<p></p>
<li><span style="color: #003300;">&#x95; Projections are dynamic and can be added/changed on the fly without stopping the database.</span></li>
</ul>
<p>In summary, Vertica’s projections represent collections of columns (okay so it is a table!), but are optimized for analytics at the physical storage structure level and are not constrained by the logical schema.  This allows for much more freedom and optimization without having to change the actual schema that certain applications are built upon.</p>
<p>Hopefully this gave you an overview of what projections are and how they work.  Please read <a title="The Power of Projections - Part 2" href="http://184.106.12.19/2011/09/02/the-power-of-projections-part-2/" target="_blank">Part II</a> and <a title="The Power of Projections - Part 3" href="http://184.106.12.19/2011/09/06/the-power-of-projections-part-3/" target="_blank">Part III</a> of this post to drill down into projections even further.<strong></strong></p>
<p><strong>Read the rest of the 3-part series&#8230;</strong></p>
<p><a title="The Power of Projections - Part 2" href="http://184.106.12.19/2011/09/02/the-power-of-projections-part-2/" target="_blank">The Power of Projections &#8211; Part 2: Understanding the Simplicity of Projections and the Vertica Database Designer™</a><br />
<a title="The Power of Projections - Part 3" href="http://184.106.12.19/2011/09/06/the-power-of-projections-part-3/" target="_blank">The Power of Projections &#8211; Part 3: Comparing and Contrasting Projections to Materialized Views and Indexes</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.vertica.com/2011/09/01/the-power-of-projections-part-1/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>On Both Sides of the Internship</title>
		<link>http://www.vertica.com/2011/08/26/on-both-sides-of-the-internship/</link>
		<comments>http://www.vertica.com/2011/08/26/on-both-sides-of-the-internship/#comments</comments>
		<pubDate>Fri, 26 Aug 2011 17:10:51 +0000</pubDate>
		<dc:creator>Lyric Doshi</dc:creator>
				<category><![CDATA[interns]]></category>
		<category><![CDATA[vertica]]></category>

		<guid isPermaLink="false">http://vertica.linchpinagency.com/?p=3931</guid>
		<description><![CDATA[<p><strong>Author: Lyric Doshi</strong></p> Three cheers for Vertica Summer Interns 2011! . <p>As summer comes to an end, we bid goodbye to yet another amazing crop of summer interns. This years interns Hieu, Zhongliang, Ruchika and Zhijie were MS/PhD students from different schools along the east coast.  They worked on projects to extend the Vertica SDK, enhance Hadoop/Pig connectivity, and create internal developer productivity tools.  We plan to incorporate much of this work into a future major release of Vertica. In fact, there was so much excitement surrounding their work that in addition to the traditional presentation to engineering, they were asked to present to the entire company after having lunch with Vertica VP&#38;GM Chris Lynch.</p> <p>Once a picky an intern <a href="" class="read-more">Read More &#187;</a>]]></description>
			<content:encoded><![CDATA[<p><strong>Author: Lyric Doshi</strong></p>
<h3>Three cheers for Vertica Summer Interns 2011!<br />
<span style="color: #ffffff;">.</span></h3>
<p>As summer comes to an end, we bid goodbye to yet another amazing crop of summer interns. This years interns Hieu, Zhongliang, Ruchika and Zhijie were MS/PhD students from different schools along the east coast.  They worked on projects to extend the Vertica SDK, enhance Hadoop/Pig connectivity, and create internal developer productivity tools.  We plan to incorporate much of this work into a future major release of Vertica. In fact, there was so much excitement surrounding their work that in addition to the traditional presentation to engineering, they were asked to present to the entire company after having lunch with Vertica VP&amp;GM Chris Lynch.</p>
<p>Once a picky an intern at other companies (and a very happy one at Vertica, why else would I have come back?), I had the opportunity this year to run our internship program, beginning with coordinating interviews for nearly 40 candidates down to a 4 solid interns. That was over 3 months ago. In the past few days, I took some time to speak with each intern to hear about their experience and I was happy to hear some of my own words to my mentors at Vertica from 2 years ago being reiterated.</p>
<p>Zhongliang gained experience working on a full Java project for first time and told me he felt his coding improved dramatically thanks to feedback from his mentor Matt.</p>
<p>Ruchika told me how her project made some of her friends jealous because she was never bored at work. She appreciated how everyone here dropped their work to help her out when she had questions. She singled out the unwavering patience of her mentor Ben in answering her questions. Having appreciated the same time and time again, I responded, &#8220;Been there, done that.&#8221;</p>
<p>Hieu highlighted how much fun he had in addition to his project, partaking in our Ping Pong tournament, joining the weekly pick-up basketball, learning the art of sword-play, destroying us in Starcraft, and attending cook-outs hosted by co-workers. He even laughed himself through an unforgettable first experience with water sports (hint: it involved an inner tube, some soccer shorts, and a motorboat) at our annual interns party.</p>
<p style="text-align: left;"><a href="http://184.106.12.19/wp-content/uploads/2011/08/SummerInterns.png"><img class="aligncenter size-full wp-image-3932" title="SummerInterns" src="http://184.106.12.19/wp-content/uploads/2011/08/SummerInterns.png" alt="" width="586" height="361" /></a><br />
Zhijie was very happy to work on a project that, while sufficiently separated that he did not worry about causing trouble, was in the release plan. Whenever he got frustrated with coding issues, he looked at customer feature requests page to see all the demands for what he was working on and found real inspiration that he was making a difference.</p>
<p>Repeatedly pressing each for complaints in the spirit of constructive criticism, major or minor, I finally forced something out of the Zhijie &#8220;The office was a little hot when the AC broke.&#8221; Surely a sign of a successful summer?</p>
<p>But it&#8217;s not just the interns that got something out of the internship program. Our program is run entirely by engineering and both I and the four dedicated intern mentors gained experience managing projects, goals and expectations. We&#8217;ve adapted our program over the years, trying both team and individual projects focussed on everything from tools and demos to server-side changes. This year, we even sat our interns in desks near their respective mentors this year for a proper full-timer experience and even higher mentor accessibility.</p>
<p>A big thank you to all of our interns for your hard work and commitment. We&#8217;ll miss having you around but wish you the best for the coming school year and hope to see you again soon!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.vertica.com/2011/08/26/on-both-sides-of-the-internship/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

