Archive for the ‘vertica’ Category

Top 4 Considerations When Evaluating a Data Analytics Platform

From fraud detection to clickstream analytics to simply building better products or delivering a more optimal customer experience, Big Data use cases are abounding with analytics at the core.

With a solid business or use case in place, the next step that organizations typically take is to investigate and evaluate the appropriate set of analytics technology from which to accomplish their analysis, often starting with a data analytics platform. But what are the requirements from which to base your evaluation?

The Winter Corporation, the large-scale data experts, just finalized an in-depth white paper (The HP Vertica Analytics Platform: Large Scale Use and Advanced Analytics) that reflects the results and findings through evaluation, independent research, customer and employee interviews, and documentation review.

Intended for a more technical audience, this white paper focuses on key evaluation criteria that your organization can use as a guide as you conduct your own evaluation.



Winter Corporation identified these key feature areas as critical for any data analytics platform:

1. Architecture
• Column store architecture
• Shared nothing parallelism
• Cluster size and elasticity
• Smart K-Safety based availability
• Hybrid storage model
• Multiple database isolation modes
• Both bulk load and trickle feed

2. Performance
• Extensive data compression and data encoding
• Read-optimized storage
• Highly parallel operation
• Storage of multiple projections
• Automatic physical database design

3. General Useful and Noteworthy Features for Large-Scale Use
• Export-import
• Backup/restore
• Workload analyzer
• Workload management
• Role-based security

4. Extensions for Advanced Analytics
• SQL extensions
• Built-in functions
• User-defined extensions
• Flexibility in accessing and analyzing all data (structured, semistructured, or unstructured)

Finally, once you have evaluated and confirmed that the data analytics platform meets your feature and technology requirements, you want to hear from other organizations that have deployed large-scale analytics’ initiatives in real-world environments.

The white paper concludes with a write-up on how Zynga, a social game services company with more than 240 million users of its online games, stores the actions of every player in every game — about 6 TB per day of data — in near-real time in the HP Vertica Analytics Platform. No matter where in the world a game event occurs, the data can be retrieved via a report or query from the central HP Vertica database no more than five minutes later.

Optimizing Value – Creating a Conversational Relationship with Your Big Data

I spent most of the past week on the road, attending Gartner Symposium in Orlando and then later in the week at Strata Hadoop World in NYC. (For more, see my colleague Jeff Healey’s excellent recap of Hadoop World here.)

In the course of delivering the session ‘Big Data, Turning the information Overload into an Information Advantage,’ which I delivered with my colleague Jerome Levadoux of our sister company Autonomy, and in just walking the events in general, I spoke to many people, and unsurprisingly found the interest level in Big Data continuing to skyrocket.

Some of the most notable comments came from those who had already begun to tackle the Big Data challenge, since so many are trying to uncover the fourth ‘V’ of Big DataValue.

What I continue to hear is that the Value of effectively leveraging Big Data (or as we across HP like to call it ‘Information Optimization’) lies in fundamentally changing the relationship between the organization and the data. In particular, moving from static queries which take minutes, hours or sometimes days to run, to providing nearly-instantaneous answers that lead to more interactive ‘conversations’ with the data, completely changes how business executives perceive their data, and allows them to gain significantly more meaning and value.

Suddenly, it is no longer “specify the reports, set up the queries, run the reports, deliver to the business users” daily or weekly (rinse and repeat), but “I have a question, I need an answer”, which delivered in near-real-time via a platform such as Vertica then leads quickly to follow-on questions, what-if scenarios, and a virtuous cycle that puts the data – and the Analysts/Data Scientists who provide access to it – in a much more strategic and business-critical role.

My colleague Jim Campbell discussed this during his visit to Cloudera’s booth at Hadoop World.

Live from Strata + Hadoop World 2012: Jim Campbell, Vertica from Cloudera.

If you want to take a live look at how Vertica can add game-changing Velocity to your organizations’ conversations with your Big Data, sign up for an Evaluation today.

When Customers Buy you Beer you are on to Something

A few weeks ago, Shilpa, our VP of engineering was in New York City visiting prospective customers. While there, she also had an informal meetup with new and existing customers. One of our new customers liked Vertica so much that he literally handed Shilpa money to buy the Vertica Engineering team beer.

So, she did what all good managers do – delegate the acquisition to Sumeet. Thanks to his efforts we had a very special addition to one of our recent engineering lunches.

Nick, cheers from the entire engineering team! Thank you for your gift – we will all keep working hard to ensure your experience with Vertica continues to be a pleasure.


Vertica Lunch

If you are intrigued, don’t take my anecdotal customer stories for why Vertica is great – try it yourself with the Vertica Community Edition.

P.S. If you are interested in working somewhere customers like your product so much they send you tasty beverages, we are hiring in all areas. Within engineering specifically we are looking for hackers from the lowest level depths of the database server, up through the client interfaces, the management console and third party integration programs. Consider coming in to talk with us:

The First Boston Area Vertica User Group

We recently held the first Boston area Vertica meetup / user group and it was a huge success! The crowd consisted of a few members from Vertica, representatives from 7 area Vertica customers, a Vertica Partner, Vertica consultants/experts, and also a few (hopefully) future Vertica users!  For the first hour it was all about Vertica users meeting each other for the first time and learning about how each of them use the platform, why they use it, what they like about it, tips and tricks etc…it was pretty cool to take a back seat and listen to them talk about our database!

We had a few speakers,  up first was Syncsort a Vertica partner on the ETL side.  They spoke about how Vertica connects to Syncsort and the benefits of using it with Vertica’s database.  Then up next we had Compete speak about how they use Vertica and the benefits that it brings to their business.

Seth from Compete talking about the fastest database in the world!

We had 2 special guests Colin Mahony, Vertica’s CEO and also Shilpa Lawande, Vertica’s VP of Engineering say a few things and also answer questions from the crowd!  I thought it was awesome for not only our customers to meet them but for Colin and Shilpa to meet with the Vertica community as well!

That was the play by play for the first Boston Vertica User Meetup Group.  It was a success I am hoping that we grow this group with more and more Vertica enthusiasts! Special thanks to Compete for hosting the 1st event. If you have not already, make sure you sign up right here and look for the new Meetup to be announced soon! Don’t miss out!

Vertica Gives Back!


HP encourages all employees worldwide to contribute time and energy toward good causes.  Last week, Vertica participated in the second annual Tech Gives Back event hosted by TUGG, a Boston open source philanthropic group which offers local companies a chance to give back to their community through charitable day-long events.  Projects include preparing and packaging meals for community service groups, sorting and inventorying clothes, toys, and books for donation, landscaping, cleaning, and painting line games outside of schools and shelters, and more.

Vertica Gives Back!

This year, Vertica’s project was held at the W. Hennigan Elementary School in Jamaica Plain.  Alongside a few other helpful companies, we spent the day clearing out an overgrown garden, building a couple of impressive benches/ flower beds, and repainting everything other than the building which had ever had paint on it, which was a lot of paint!  There were many murals, games, and railings which were rusted, faded, and long overdue for a fresh coat.  With around 50 people in 5 hours, the improvement was absolutely amazing.

Team Vertica adds a fresh coat of paint.

Just as the last lines of paint were drawn and barely dried, the bell rang and the flood gates were opened.  The kids came running out and shouted with excitement at the new look.  It was truly a rewarding experience.

We look forward to the third annual Tech Gives Back event in 2013, and we are always looking for more ways to give back, so please, if you know of any, send them our way!

If you like technology, good causes, and ping pong (more on that later) join the fun with the Vertica team, we’re hiring!

The finished product!

HP Vertica and Tableau Software Customers Speak Out in Philadelphia

It was my distinct pleasure this week to participate in a joint customer roundtable at the Cira Center in Philadelphia, co-sponsored by HP Vertica and our partner Tableau Software, and featuring a number of our respective and joint customers speaking out on topics related to Big Data.

Our panelists, who did a terrific job interacting with an audience of more than 50 of their peers, included David Baker of IMS Health, George Chalissery of hMetrix, Amit Garg of Comcast, Seth Madison of and Elizabeth Worster of State Street Global Advisors.

The discussion essentially centered on 5 themes related to Big Data. They included (with unattributed comments from the panelists).

  • Democratizing data – all of our panelists discussed the value of giving business users the ability to understand data and make ad hoc requests themselves – as well as extending some of those capabilities outside the walls of the enterprise. A number of concerns and questions came from the audience as to how you handle security when democratizing data which were addressed by our panelists. “Self-reliance really sings to me.” “We have internal and external users – and increasingly the external users are our clients”. 
  • Getting more productivity out of small teams – related to the previous point, data analyst teams are generally small and their time must be leveraged – they don’t have to spend time on repetitive tasks. “Once you start delivering, are on the hook to do it constantly.” “Can’t do anything predictive if you’re reactive all the time.” “You can’t just rely on databases – you do need people.”
  • Extracting meaning from data – panelists repeatedly spoke of the need for first class dashboards – and for those dashboards to be flexible and fast (a primary benefit of our combined Vertica / Tableau solution). “People are more willing to experiment and run what-if scenarios with flexible dashboards” “Your data’s growing, but users want answers faster.”
    • One particularly interesting and notable comment from a Vertica customer - “Results are delivered so fast that I don’t believe it – this can’t be real.” (it is)
  • New capabilities – There was a great deal of discussion of enablement of new organizational capabilities as Big Data gets under control and becomes more available. “People are more willing to experiment because time to load and query data is orders of magnitude better” “When you change the network ecosystem, you can create new offerings and new value for customers” “Having intermediate data helps with disaster recovery and provides redundancy” “I don’t think I’m doing complex things, but then people tell me I am doing very complex things”
  • Time to value – Speed continued to be a theme – both in analyzing Big Data and creating organizational value – “We can answer questions much more quickly and get new data-oriented products into the pipeline for revenue.”, “I don’t need to talk to my manager or IT – I can answer that question right now.”, “You give people a taste of this stuff, and they just want you to do more and more and more”
Overall it was an outstanding event, and we plan to do more partner-related activities with our Business Intelligence and other partners, including the Tableau Customer Conference in early November. We hope to see you at a future event!

A Feather in Vertica’s CAP

In this post, I attempt to relate Vertica distributed system properties to the well known CAP theorem and provide a fault tolerance comparison with the well known HDFS block storage mechanism.

The CAP theorem, as originally presented by Brewer @ PODC 2000 reads:

The CAP Theorem

It is impossible for a web service to provide the following three

  • Consistency
  • Availability
  • Partition-tolerance

The CAP theorem is useful from a system engineering perspective because distributed systems must pick 2/3 of the properties to implement and 1/3 to give up. A system that “gives up” on a particular property strives makes a best effort but cannot provide solid guarantees. Different systems choose to give up on different properties, resulting in different behavior when failures occur. However, there is a fair amount of confusion about what the C, A, and P actually mean for a system.

  • Partition-tolerance – A network partition results in some node A being unable to exchange messages with another node B. More generally, the inability of the nodes to communicate. Systems that give up on P assume that all messages are reliably delivered without fail and nodes never go down. Pretty much any context in which the CAP theorem is invoked, the system in question supports P.
  • Consistency – For these types of distributed systems, consistency means that all operations submitted to the system are executed as if in some sequential order on a single node. For example, if a write is executed, a subsequent read will observe the new data. Systems that give up on C can return inconsistent answers when nodes fail (or are partitioned). For example, two clients can read and each receive different values.
  • Availability – A system is unavailable when a client does not receive an answer to a request. Systems that give up on A will return no answer rather than a potentially incorrect (or inconsistent) answer. For example, unless a quorum of nodes are up, a write will fail to succeed.

Vertica is a stateful distributed system and thus worthy of consideration under the CAP theorem:

  • Partition-tolerance – Vertica supports partitions. That is, nodes can fail or messages can fail to be delivered and Vertica can continue functioning.
  • Consistency – Vertica is consistent. All operations on Vertica are strongly ordered – i.e., there is a singular truth about what data is in the system and it can be observed by querying the database.
  • Availability – Vertica is willing to sacrifice availability in pursuit of consistency when failures occur. Without a quorum of nodes (over half), Vertica will shut down since no modification may safely be made to the system state. The choice to give up availability for consistency is a very deliberate one and represents cultural expectations for a relational database as well as a belief that a database component should make the overall system design simpler. Developers can more easily reason about the database component being up or down than about it giving inconsistent (dare I say … “wrong”) answers. One reason for this belief is that a lack of availability is much more obvious than a lack of consistency. The more obvious and simplistic a failure mode is, the easier integration testing will be with other components, resulting in a higher quality overall system.

In addition to requiring a quorum of up nodes, each row value must be available from some up node, otherwise the full state of the database is no longer observable by queries. If Vertica fully replicated every row on every node, the database could function any time it had quorum: any node can service any query. Since full replication significantly limits scale-out, most users employ a replication scheme which stores some small number of copies of each row – in Vertica parlance, K-Safety. To be assured of surviving any K node failures, Vertica will store K+1 copies of each row. However, it’s not necessary for Vertica to shut down the instant more than K nodes fail. For larger clusters, it’s likely that all the row data is still available. Data (or Smart) K-Safety is the Vertica feature that tracks inter-node data dependencies and only shuts down the cluster when node failure actually makes data unavailable. This feature achieves a significant reliability improvement over basic K-Safety, as shown in the graph below.

The key reason Data K-Safety scales better is that Vertica is careful about how it arranges the replicas to ensure that nodes are not too interdependent. Internally, Vertica arranges the nodes in a ring and adjacent nodes serve as replicas. For K=1, if node i fails, then nodes i-1 and i+1 become critical: failure of either one will bring down the cluster. The key take away is that for each node that fails, a constant number (2) of new nodes become critical, whereas in the regular K-Safety mechanism, failure of the K th node makes all N-K remaining nodes critical! While basic K=2 safety initially provides better fault tolerance, the superior scalability of Data K=1 Safety eventually dominates as the cluster grows in size.

Here we can draw an interesting comparison to HDFS, which also provides high availability access to data blocks in a distributed system. Each HDFS block is replicated and by default stored on three different nodes, which would correspond to a K of 2. HDFS provides no coordination between the replicas of each block: the nodes are chosen randomly (modulo rack awareness) for each individual block. By contrast, Vertica storing data on node i at K=2 would replicate that data on nodes i+1 and i+2 every time. If nodes 3, 6, and 27 fail, there is no chance that this brings down a Vertica cluster. What is the chance that it impacts HDFS? Well, it depends on how much data is stored – the typical block size is 64MB. The graph below presents the results of simulated block allocation on a 100 node cluster with replication factor of 3, computing the probability of a random 3-node failure making at least one block unavailable.

Assuming that you’re storing 50TB of data on your 100 node cluster, the fault tolerance of HDFS should be the same as a basic K=2 Vertica cluster – namely, if any 3 nodes fail, some block is highly likely to be unavailable. Data K-Safety with K=1 provides better fault tolerance in this situation. And here’s the real kicker: at K=1, we can fit 50% more data on the cluster due to less replication!

This comparison is worth a couple extra comments. First, HDFS does not become unavailable if you lose a single block – unless it’s the block your application really needs to run. Second, nodes experience correlated failures, which is why HDFS is careful to place replicas on different racks. We’ve been working on making Vertica rack-aware and have seen good progress. Third, the model assumes the mean-time-to-repair (MTTR) is short relative to the mean-time-to-failure (MTTF). In case of a non-transient failure, HDFS re-replicates the blocks of the failed node to any node that has space. Since Vertica aggressively co-locates data for increased query performance, it uses a more significant rebalance operation to carefully redistribute the failed node’s data to the other nodes. In practice, the recovery or rebalance operation is timely relative to the MTTF.

In conclusion, Vertica uses a combination of effective implementation and careful data placement to provide a consistent and fault tolerant distributed database system. We demonstrate that our design choices yield a system which is both highly fault tolerant and very resource efficient.


  • The CAP theorem was proved by Lynch in 2002 in the context of stateful distributed systems on an asynchronous network.


Get Started With Vertica Today

Subscribe to Vertica