Archive for September, 2012

Vertica Gives Back!


HP encourages all employees worldwide to contribute time and energy toward good causes.  Last week, Vertica participated in the second annual Tech Gives Back event hosted by TUGG, a Boston open source philanthropic group which offers local companies a chance to give back to their community through charitable day-long events.  Projects include preparing and packaging meals for community service groups, sorting and inventorying clothes, toys, and books for donation, landscaping, cleaning, and painting line games outside of schools and shelters, and more.

Vertica Gives Back!

This year, Vertica’s project was held at the W. Hennigan Elementary School in Jamaica Plain.  Alongside a few other helpful companies, we spent the day clearing out an overgrown garden, building a couple of impressive benches/ flower beds, and repainting everything other than the building which had ever had paint on it, which was a lot of paint!  There were many murals, games, and railings which were rusted, faded, and long overdue for a fresh coat.  With around 50 people in 5 hours, the improvement was absolutely amazing.

Team Vertica adds a fresh coat of paint.

Just as the last lines of paint were drawn and barely dried, the bell rang and the flood gates were opened.  The kids came running out and shouted with excitement at the new look.  It was truly a rewarding experience.

We look forward to the third annual Tech Gives Back event in 2013, and we are always looking for more ways to give back, so please, if you know of any, send them our way!

If you like technology, good causes, and ping pong (more on that later) join the fun with the Vertica team, we’re hiring!

The finished product!

Defining Big Data

Let’s start with an easy question.  What’s “Big Data”?  Fortunately, I read the answer to that in one of my favorite tech news sources just the other day:  The answer, for those who won’t bother with the link, is “Big data is any data that doesn’t fit well into tables and that generally responds poorly to manipulation by SQL” … “A Twitter feed is Big Data; the census isn’t. Images, graphical traces, Call Detail Records (CDRs) from telecoms companies, web logs, social data, RFID output can all be Big Data. Lists of your employees, customers, products are not.”

That’s great, except that it is self-contradictory!  5 out of the 7 things the author considers “big data” are not only susceptible to manipulation by SQL (in a well-designed database of course), but have representation on Vertica’s “paying customers” list.  Twitter is a customer (and I thank them for their ability to simultaneously give us props and jabs in venues like ACM SIGMOD).  We hardly ever lose in the CDR space (or any xDR, really).  Zynga has petabytes of what could be loosely described as “web logs” and “social data” stored in Vertica.  The evidence list becomes long and boring quite quickly, so I won’t get into how the other 2 out of 7 “Big Data” areas are, as presented, too nebulous to decide one way or the other.

I can’t claim to have a definitive definition of “Big Data”.   But I can tell you that for each meaningful result (such as a sale of a product), there are many website clicks made, and for each “click” there are many impressions (things that are presented to the user but not clicked).  If you want to analyze these things, and maybe run some tests where you try experiments on your customers and prospects to see what does the best job of reeling them in, you will strain the the abilities of single-machine processing, “traditional” RDBMSs, and many statistics packages and algorithms (yes, including your trusty Excel spreadsheet).  Then there is machine-generated data.  The handful of self-aware devices you own (your desktop PC, laptop, GPS-enabled smart phone, smart meter, car, refrigerator, etc.) have much more free time with which to generate “Big Data” than you do.  (For example, a fair-sized cable MSO has a “sensor network” with 20 million networked devices that never rest, producing 10+ billion rows a day.)

So now that the definition of “Big Data” is as clear as mud, let us next consider structured vs. unstructured data.  I have heard people say that “web logs are unstructured data”.  This is outright silly.  The average web log is entirely machine-generated one line at a time, and to do any serious analysis on it you are going to have to parse it into a format with some amount of structure (at least the date and time, session, page, etc.).  Sure, it can be stored as “unstructured data” in your favorite file system, but that’s a simple matter of procrastination on the issue of writing (or finding that someone has already written a parser.  On the other hand, Twitter data, with its rigid 140-character limit is quite “structured”, but figuring out what it “means” is nonetheless a respectable challenge.

So instead, I would implore you to consider “machine information” versus “human information”.  It is probably true that, byte for byte, there is 10x more “human” information.  The slide deck our sales guys use takes up 10x more space on disk than the spreadsheet that contains the funnel and prospect list.  Viral videos probably (I am not privy to the hard data) consume 10x more disk space than the IPDRs and web logs recording their accesses.

But while disk size is a fun, quotable metric, it says little about the nature of the “Big Data” problem you have to solve.  Instead, consider whether you have “machine” or “human” information.  You should be able to tell at a glance, and this will help you choose the right tools.  If it’s machine-generated financial trades, a scale-out SQL database with some time series analytics extensions will do nicely.  If it’s the tweets of Twitter twits, you can use a structured data tool, but you’re going to need some (in a very loose sense of the word) natural language sentiment analysis and graph processing packages.  If it is a bunch of PDFs, Word documents, HTML pages, PowerPoint presentations, and so on scattered across eleventeen different web servers, CMS systems, and file systems in your enterprise, you may need a high-powered “human information” system like Autonomy (and if you are an intelligence agency that needs to watch every video broadcast worldwide to gauge international sentiment, I think they can help you with that too…).

There is a point to all of this.  I can’t claim to have invented this tag line, but I wish I had.  You should “Know SQL” before you assume you should “NoSQL”.  While “Big Data” couldn’t have lived in an SQL database as they existed 10 years ago, we have different kinds of SQL databases now, that are “web scale”, high performance, designed for analytic workloads, cost effective, and so on.  It’s no longer, as a tech visionary in our own back yard recently said, “just a box that can’t keep up”.  If you have “Big Data” upon which structure can be imposed, analytic databases are very efficient, flexible, easy to use, and come with all the features people have come to expect from 30+ years of development.  (Try one.  We have a free, community download!)

When UPDATE is actually INSERT

At the VLDB 2012 conference a few weeks ago, we had a chance to listen to Jiri Schindler giving a tutorial about NoSQL.  His interesting and informative presentation covered the fundamental architecture and I/O usage patterns of RDBMS systems and various NoSQL data management systems, such as HBase, Cassandra, and MongoDB.

During the presentation, Schindler listed basic I/O access patterns for columnar databases using the slide below. It is hard to summarize the operation of the various columnar database systems on a single slide, and Schindler did a great job given the constraints of the presentation. However, while his characterization might hold for other columnar databases, the Vertica Analytic Database  has a different I/O pattern for UPDATEs which we wanted to explain in more detail.

First, Vertica does not require synchronous I/O of a recovery log. Unlike most other RDBMS systems,  Vertica implements durability and fault tolerance via distributed replication.

Second, since Vertica never modifies storage in place, it avoids the other I/O intensive operations referenced in the slide.

When a user issues an UPDATE statement, Vertica performs the equivalent of a delete followed by an insert. The existing row is deleted by inserting a Delete Vector (a small record saying that the row was deleted), and a new copy of the row with the appropriately updated columns is inserted. Both the Delete Vector and the new version of the row are stored in a memory buffer known as the WOS (write optimized store). After sufficient data has accumulated in the WOS from INSERTs, UPDATEs, DELETEs, and COPYs (bulk loads), they are moved in bulk to disk storage known as the ROS.

It is important to note that existing files in the ROS are not modified while data is moved from WOS to the ROS – rather a new set of sorted and encoded column files is created. To avoid a large number of files accumulating over time, the Tuple Mover regularly merges column files together using an algorithm that limits the number of times any tuple is rewritten and also uses large contiguous disk operations, which is quite efficient well on most modern file and disk systems.

This arrangement has several advantages:

  1. From the user’s point of view, the update statement completes quickly and future queries get the expected answer (by filtering out the original values at runtime using the Delete Vectors).
  2. The cost of sorting, encoding, and writing column files to disk is amortized over a large number of rows by utilizing the in memory WOS.
  3. I/O is always in proportion to the number of rows inserted or modified – it is never the case that an update of a small number of rows causes I/O on a significant amount of previously stored data.

More details about how data is stored and Vertica’s overall architecture and design decisions, please consider reading our VLDB 2012 paper.



HP Vertica and Tableau Software Customers Speak Out in Philadelphia

It was my distinct pleasure this week to participate in a joint customer roundtable at the Cira Center in Philadelphia, co-sponsored by HP Vertica and our partner Tableau Software, and featuring a number of our respective and joint customers speaking out on topics related to Big Data.

Our panelists, who did a terrific job interacting with an audience of more than 50 of their peers, included David Baker of IMS Health, George Chalissery of hMetrix, Amit Garg of Comcast, Seth Madison of and Elizabeth Worster of State Street Global Advisors.

The discussion essentially centered on 5 themes related to Big Data. They included (with unattributed comments from the panelists).

  • Democratizing data – all of our panelists discussed the value of giving business users the ability to understand data and make ad hoc requests themselves – as well as extending some of those capabilities outside the walls of the enterprise. A number of concerns and questions came from the audience as to how you handle security when democratizing data which were addressed by our panelists. “Self-reliance really sings to me.” “We have internal and external users – and increasingly the external users are our clients”. 
  • Getting more productivity out of small teams – related to the previous point, data analyst teams are generally small and their time must be leveraged – they don’t have to spend time on repetitive tasks. “Once you start delivering, are on the hook to do it constantly.” “Can’t do anything predictive if you’re reactive all the time.” “You can’t just rely on databases – you do need people.”
  • Extracting meaning from data – panelists repeatedly spoke of the need for first class dashboards – and for those dashboards to be flexible and fast (a primary benefit of our combined Vertica / Tableau solution). “People are more willing to experiment and run what-if scenarios with flexible dashboards” “Your data’s growing, but users want answers faster.”
    • One particularly interesting and notable comment from a Vertica customer – “Results are delivered so fast that I don’t believe it – this can’t be real.” (it is)
  • New capabilities – There was a great deal of discussion of enablement of new organizational capabilities as Big Data gets under control and becomes more available. “People are more willing to experiment because time to load and query data is orders of magnitude better” “When you change the network ecosystem, you can create new offerings and new value for customers” “Having intermediate data helps with disaster recovery and provides redundancy” “I don’t think I’m doing complex things, but then people tell me I am doing very complex things”
  • Time to value – Speed continued to be a theme – both in analyzing Big Data and creating organizational value – “We can answer questions much more quickly and get new data-oriented products into the pipeline for revenue.”, “I don’t need to talk to my manager or IT – I can answer that question right now.”, “You give people a taste of this stuff, and they just want you to do more and more and more”
Overall it was an outstanding event, and we plan to do more partner-related activities with our Business Intelligence and other partners, including the Tableau Customer Conference in early November. We hope to see you at a future event!

A Feather in Vertica’s CAP

In this post, I attempt to relate Vertica distributed system properties to the well known CAP theorem and provide a fault tolerance comparison with the well known HDFS block storage mechanism.

The CAP theorem, as originally presented by Brewer @ PODC 2000 reads:

The CAP Theorem

It is impossible for a web service to provide the following three

  • Consistency
  • Availability
  • Partition-tolerance

The CAP theorem is useful from a system engineering perspective because distributed systems must pick 2/3 of the properties to implement and 1/3 to give up. A system that “gives up” on a particular property strives makes a best effort but cannot provide solid guarantees. Different systems choose to give up on different properties, resulting in different behavior when failures occur. However, there is a fair amount of confusion about what the C, A, and P actually mean for a system.

  • Partition-tolerance – A network partition results in some node A being unable to exchange messages with another node B. More generally, the inability of the nodes to communicate. Systems that give up on P assume that all messages are reliably delivered without fail and nodes never go down. Pretty much any context in which the CAP theorem is invoked, the system in question supports P.
  • Consistency – For these types of distributed systems, consistency means that all operations submitted to the system are executed as if in some sequential order on a single node. For example, if a write is executed, a subsequent read will observe the new data. Systems that give up on C can return inconsistent answers when nodes fail (or are partitioned). For example, two clients can read and each receive different values.
  • Availability – A system is unavailable when a client does not receive an answer to a request. Systems that give up on A will return no answer rather than a potentially incorrect (or inconsistent) answer. For example, unless a quorum of nodes are up, a write will fail to succeed.

Vertica is a stateful distributed system and thus worthy of consideration under the CAP theorem:

  • Partition-tolerance – Vertica supports partitions. That is, nodes can fail or messages can fail to be delivered and Vertica can continue functioning.
  • Consistency – Vertica is consistent. All operations on Vertica are strongly ordered – i.e., there is a singular truth about what data is in the system and it can be observed by querying the database.
  • Availability – Vertica is willing to sacrifice availability in pursuit of consistency when failures occur. Without a quorum of nodes (over half), Vertica will shut down since no modification may safely be made to the system state. The choice to give up availability for consistency is a very deliberate one and represents cultural expectations for a relational database as well as a belief that a database component should make the overall system design simpler. Developers can more easily reason about the database component being up or down than about it giving inconsistent (dare I say … “wrong”) answers. One reason for this belief is that a lack of availability is much more obvious than a lack of consistency. The more obvious and simplistic a failure mode is, the easier integration testing will be with other components, resulting in a higher quality overall system.

In addition to requiring a quorum of up nodes, each row value must be available from some up node, otherwise the full state of the database is no longer observable by queries. If Vertica fully replicated every row on every node, the database could function any time it had quorum: any node can service any query. Since full replication significantly limits scale-out, most users employ a replication scheme which stores some small number of copies of each row – in Vertica parlance, K-Safety. To be assured of surviving any K node failures, Vertica will store K+1 copies of each row. However, it’s not necessary for Vertica to shut down the instant more than K nodes fail. For larger clusters, it’s likely that all the row data is still available. Data (or Smart) K-Safety is the Vertica feature that tracks inter-node data dependencies and only shuts down the cluster when node failure actually makes data unavailable. This feature achieves a significant reliability improvement over basic K-Safety, as shown in the graph below.

The key reason Data K-Safety scales better is that Vertica is careful about how it arranges the replicas to ensure that nodes are not too interdependent. Internally, Vertica arranges the nodes in a ring and adjacent nodes serve as replicas. For K=1, if node i fails, then nodes i-1 and i+1 become critical: failure of either one will bring down the cluster. The key take away is that for each node that fails, a constant number (2) of new nodes become critical, whereas in the regular K-Safety mechanism, failure of the K th node makes all N-K remaining nodes critical! While basic K=2 safety initially provides better fault tolerance, the superior scalability of Data K=1 Safety eventually dominates as the cluster grows in size.

Here we can draw an interesting comparison to HDFS, which also provides high availability access to data blocks in a distributed system. Each HDFS block is replicated and by default stored on three different nodes, which would correspond to a K of 2. HDFS provides no coordination between the replicas of each block: the nodes are chosen randomly (modulo rack awareness) for each individual block. By contrast, Vertica storing data on node i at K=2 would replicate that data on nodes i+1 and i+2 every time. If nodes 3, 6, and 27 fail, there is no chance that this brings down a Vertica cluster. What is the chance that it impacts HDFS? Well, it depends on how much data is stored – the typical block size is 64MB. The graph below presents the results of simulated block allocation on a 100 node cluster with replication factor of 3, computing the probability of a random 3-node failure making at least one block unavailable.

Assuming that you’re storing 50TB of data on your 100 node cluster, the fault tolerance of HDFS should be the same as a basic K=2 Vertica cluster – namely, if any 3 nodes fail, some block is highly likely to be unavailable. Data K-Safety with K=1 provides better fault tolerance in this situation. And here’s the real kicker: at K=1, we can fit 50% more data on the cluster due to less replication!

This comparison is worth a couple extra comments. First, HDFS does not become unavailable if you lose a single block – unless it’s the block your application really needs to run. Second, nodes experience correlated failures, which is why HDFS is careful to place replicas on different racks. We’ve been working on making Vertica rack-aware and have seen good progress. Third, the model assumes the mean-time-to-repair (MTTR) is short relative to the mean-time-to-failure (MTTF). In case of a non-transient failure, HDFS re-replicates the blocks of the failed node to any node that has space. Since Vertica aggressively co-locates data for increased query performance, it uses a more significant rebalance operation to carefully redistribute the failed node’s data to the other nodes. In practice, the recovery or rebalance operation is timely relative to the MTTF.

In conclusion, Vertica uses a combination of effective implementation and careful data placement to provide a consistent and fault tolerant distributed database system. We demonstrate that our design choices yield a system which is both highly fault tolerant and very resource efficient.


  • The CAP theorem was proved by Lynch in 2002 in the context of stateful distributed systems on an asynchronous network.


Big Interns For Big Data

“[My wife] won’t let me talk about work anymore.” — Intern overheard talking at lunch

Quotes can be forged, but casual lunchtime conversation tends to be very candid. Indeed, it’s the indirect signals that mean to the most to me as I coordinate the intern program for the second year.

Another intern expressed delighted surprise at how well the interns are integrated into our teams. I take serious pride in this trait of the Vertica Summer Intern Program, as we ensure our interns each have at least one personal mentor and project that matters to us and to them. With eight interns this year, we have them doing everything from releasing features to customers and researching ways to improve performance to analyzing Vertica usage patterns and improving our testing framework. Our interns represented some geographic diversity, hailing from MIT, UVa, UMass Amherst, University of Houston, Brown and Purdue. The program has doubled and with good reason – in the last six months, two of our interns from 2011 have started full-time, as did a fellow co-intern of mine from back in 2009.

We encourage our interns to work their 40 hours and then enjoy Boston. Still, during the week-long Intern User-Defined-Function Contest, one of the eventual winners told me at 10 PM he wanted to skip school and come work at Vertica, while another pair of interns extended their internships. Though all will be returning to school in the fall, I’m thrilled that we can inspire the interns this deeply and grateful to all my coworkers who helped choose them from the candidate pool.

Vertica intern party

Annual intern party at Shilpa's, complete with (brand new) traditions of single-ski water skiing and watermelon carving. Photos taken by Ramachandra CN

But it’s not all work at Vertica. Along with individual lunches with Vertica’s top-brass, we managed hiking trips, poker nights (intern-organized!), creative four-player bocce matches, horse riding, and water skiing. Trust Vertica interns to even take the weekly Counter Strike game and turn it into a data-collection event, loading in-game kill locations into a Vertica database. I leave you with a level heatmap produced by our interns’ very own Vertica User-Defined-Function.

Heat Map

Here we see the deadliest locations of the Counter Strike map Italy. Though the concentration of carnage while attempting to rescue the hostages in the upper left is unsurprising, we can also understand how dangerous each of the access paths to the hostages are. For the same contest, the other interns created an AVRO parser, a JSON parser, and an automatic email-sending function for their contest entries. Heat map from Mark Fay and Matt Fay

We’ll be keeping in touch with this year’s crop of interns as they finish here and return to their respective academic programs. Many people have helped with the intern program this year, but I feel Adam Seering deserves special mention for all his work in making this summer a success. I also appreciate the support our coworkers have given the intern program from 1-on-1 help to attending the intern presentations in numbers.

Thank you Vertica 2012 interns for all your hard work this summer. You have no idea how much positive feedback I’ve heard about you all!

Our interns ending a successful summer by riding off into the sunset. Literally.

Our interns ending a successful summer by riding off into the sunset. Literally. Photo taken by Ramachandra CN.

Vertica Moneyball and ‘R’. The perfect team!

Back in April, Colin’s blog on, “Moneyball – not just for baseball anymore” was a good example of describing how statistics can be used to make better decisions on and off the baseball field.  New measures can be created to better understand a player’s real contribution to a team.  For instance, most baseball players are familiar with the popular earned run average (ERA) measure for pitchers, but a new one that is becoming more popular is called WHIP (Walks plus Hits per Innings Pitched).

Here is how Wikipedia describes WHIP: While earned run average (ERA) measures the runs a pitcher gives up, WHIP more directly measures a pitcher’s effectiveness against the batters faced. It is calculated by adding the number of walks and hits allowed and dividing this sum by the number of innings pitched; therefore, the lower a pitcher’s WHIP, the better his performance.   Listed below is the calculation for WHIP.

 WHIP = (Walks + Hits)/ Innings Pitched.

Dashboards such as the following can be built demonstrating these new kinds of measures or key performance indicators (KPI) and how they can be used across a wider audience and provide more insight on teams and players.

Some of the other measures needed to accurately determine a person’s contribution to the team can only be implemented using a statistical package such as ‘R’.  Typically implementing a statistical package in an organization is not a trivial task for the following reasons:

1.)    Specialized domain expertise required – Statistics requires a new skill set to understand and use properly.

2.)    Data Access – Import and Export must be done into the statistical package.

3.)    Performance – Many of the statistical algorithms are compute intensive.

This article will demonstrate how Vertica 6 handles the first two items above and another article to soon be posted will show how Vertica 6 “Presto” has some distinct ‘R’ integration related “Performance” capabilities.

While it is true that understanding statistics can be challenging without proper training, having a group who fully understands the algorithms collaborate with the business domain experts ensures that proper implementation can be done.  Implementing these new algorithms in the database allows your organization to leverage the powerful statistics in their daily business analysis and reduce the time to market because they can now be treated as any other “standard” database function. The possibility for error is also reduced because no longer are complex “Extraction, Transformation and Load (ETL)” products required to import and export the data into the statistical package.  The entire process is now streamlined so that any BI tool or ETL tool in the organization can also leverage the new capability as well because they are now in the database.

So let’s put on our favorite baseball cap, in my case a Tiger cap, and take a closer look at how using ‘R’ can enhance our understanding of our favorite baseball teams and players.

As indicated before, “Moneyball” enlightened the baseball world with many new “measures” that are now almost common speak amongst baseball fans.  The scenario for this example could be a team might want to ensure they are paying their pitchers appropriately based on performance, or they might be interested in finding some more talented pitchers for their team.  Once these pitchers are determined, I want to group them together in “liked clusters” based on our key performance indicators (KPI). The two KPI’s I have decided to use are the WHIP calculation that we described above and another one called IPouts, which is simply the “number of outs pitched”.

Listed below is a simple query showing results for last year’s top pitchers sorted on the new measure called WHIP.

You can see very clearly why Justin Verlander was the MVP and Cy Young award winner last year.  His WHIP and IPouts where the best and he was third in ERA.   All of the measures provided so far can be implemented with standard SQL.  The next thing I want to do is group these pitchers into clusters based on my two measures of WHIP and IPouts.  To do this I used the new Vertica integration with a statistical package called ‘R’ to implement a clustering algorithm called KMeans.  In my case I want 3 clusters of the 91 pitchers from 2011 that qualified.  The column below called Geo.cluster was provided by the integration of ‘R’ in Vertica.

You can see that even in the top 10 we have pitchers in all of our 3 clusters. Keep in mind that lower numbers for WHIP and ERA are better and higher values for IPouts are better. Looking at the list above I now have some better insight on the players and I can focus on cluster 3 players and possibly some players from cluster 2. Listed below is an example of a histogram showing the WHIP on the X axis for all our 91 pitchers of 2011.  You can include these charts and graphs in your dashboards as well.

Other database competitors can also claim ‘R’ integration, but Vertica’s implementation provides better value to your organization because of its simplicity and performance.  Other vendors take an ‘R’ centric approach, which means your users have to know ‘R’ and use the ‘R’ suite of programming tools.  Vertica’s implementation is a more ‘data’ centric approach that shields the users from having to know and use the ‘R’ language.  Users can continue to use their favorite BI or query tool and now have access to ‘R’ capability.

This article demonstrated how statistics can be used to build new measures to provide more insight on a particular situation.  This kind of analysis can also be applied in your organization to help with detecting fraud etc.

Stay tuned on future posts that will give you more detail on how the kmeans and other statistical functions like page rank were implemented in Vertica using ‘R’.  Go Tigers!

For more details on how to implement R in Vertica please to the following blog

Get Started With Vertica Today

Subscribe to Vertica