Vertica

Archive for the ‘Engineering’ Category

Recapping the HP Vertica Boston Meet-Up

This week, some of our Boston-area HP Vertica users joined our team at the HP Vertica office in Cambridge, MA. Over some drinks and great food, we had the honor of hearing from HP Vertica power users Michal Klos followed by Andrew Rollins of Localytics. Both Michal and Andrew offered some valuable insight into how their businesses use the HP Vertica Analytics Platform on Amazon Web Service (AWS).

Michal uses the HP Vertica installation in the cloud, hosted on AWS. The highlight of Michal’s presentation was a live demonstration of a Python script using Fabric (a Python library and command-line tool) and Boto (Python interface to AWS) that executed code to quickly set up and deploy a Vertica cluster in AWS. Launching nodes on the HP Vertica Analytics Platform in AWS eliminates the need to acquire hardware and allows for an extremely speedy deployment. Michal was very complimentary of the recent enhancements to our AWS capabilities in the recently-released version 6.1 of the HP Vertica software.

Michael Klos Demonstration

Following Michal’s demonstration, Andrew took the floor to talk about how Localytics uses the HP Vertica Analytics Platform to  analyze user behavior in mobile and tablet apps.  With HP Vertica, Localytics gives their customers access to granular detail in real-time. Localytics caters to their clients by launching a dedicated node in the cloud for each customer. With the HP Vertica Analytics Platform powering their data in AWS, their customers can start gathering insightful data almost immediately.

Our engineers then took the stage to serve as a panel for questions from the floor. It’s not often that our engineers get the opportunity to answer questions from customers and interested BI professionals in an open forum discussion. Everyone took full advantage of the occasion, asking a number of questions about upcoming features and current use cases.  In addition, our engineers were able to highlight a number of new features from the 6.1 release that the users in attendance may not have been taking advantage of yet.

Meet-ups serve as a fantastic catalyst for users and future users to interact with each other, share best practices and have a valuable conversation with different members of the HP Vertica team. We reiterate our thanks to Michal and Andrew, and to all those that joined us at our offices — thank you for an excellent meet- up!

Don’t miss another valuable opportunity to hear from fellow HP Vertica user Chris Wegrzyn of the Democratic National Committee on our January 24th webinar at 1PM EST. We will discuss how the HP Vertica Analytics Platform revolutionized the way a presidential campaign is run. Register now!

How to parse anything into Vertica using ExternalFilter

Vertica’s data-load process has three steps:  “Get the data”, “Unpack the data”, and “Parse the data.”  If you look on our Github site, there are handy helpers for the first two of those steps, ExternalSource and ExternalFilter, that let you call out to shell scripts or command-line tools.  But there’s no helper for parsing.  Why not?  Because you don’t need it!

Earlier today, I was trying to load a simple XML data set into a table:

“”” sample_data.xml:

<data>

<record>

<city>Cambridge, MA</city>

<population>106038</population>

</record>

<record>

<city>Arlington, MA</city>

<population>42389</population>

</record>

<record>

<city>Belmont, MA</city>

<population>24194</population>

</record>

</data>

“””

Vertica doesn’t have a built-in XML parser.  So this might look like it would be a real pain.  But I got it loaded nice and quickly with just a little bit of scripting.

First, we need something that can parse this file.  Fortunately, this can be done with just a few lines of Python:

“”” xmlparser.py:

#!/usr/bin/env python

import sys, xml.etree.ElementTree

for record in xml.etree.ElementTree.fromstringlist(sys.stdin).getchildren():

keys = record.getchildren()

print ‘|’.join(key.text for key in keys)

“””

A very simplistic script; it reads the whole file into memory and it assumes that the data is clean.  But on this file it’s all we need.  For more complicated inputs, we could make the script fancier or install and make use of a third-party tool (such as xml_grep, available as an add-on package in some Linux distributions).

Now, what happens when we run that script on the raw data file?

“””

$ ~/xmlparser.py < sample_data.xml

Cambridge, MA|106038

Arlington, MA|42389

Belmont, MA|24194

“””

You may recognize this as the basic output format of vsql, our command-line client.  Which means that Vertica can load it directly.  If you’ve installed ExternalFilter (by checking out our Github repository and running “make install” in the shell_load_package directory), just do the following:

“””

dbadmin=> CREATE TABLE cities (city VARCHAR, population INT);

dbadmin=> COPY cities FROM LOCAL ‘sample_data.xml’ WITH FILTER ExternalFilter(‘/path/to/xmlparser.py’);

Rows Loaded

————-

3

(1 row)

 

dbadmin=> SELECT * FROM cities;

city      | population

—————+————

Cambridge, MA |     106038

Arlington, MA |      42389

Belmont, MA   |      24194

(3 rows)

“””

Of course, with ExternalFilter, you don’t have to write code at all.  You have full access to the command-line tools installed on your server.  So for example, you can whip up a sed script and get a simple Apache web-log loader:

“””

dbadmin=> COPY weblogs FROM ‘/var/log/apache2/access.log’ WITH FILTER ExternalFilter(cmd=’sed ”s/^\([0-9\.]*\) \([^ ]*\) \([^ ]*\) \[\([^ ]* [^ ]*\)\] “\([^"]*\)” \([0-9]*\) \([0-9]*\) “\([^"]*\)” “\([^"]*\)”$/\1|\2|\3|\4|\5|\6|\7|\8|\9/”’);

“””

Is this really technically parsing, if you’re just outputting more text?  I’ll let the academics argue over that one.  It’s true that a native-C++ UDParser would likely yield better performance, and that these simple examples aren’t the most robust bits of code out there.  But I didn’t have time today to carefully craft an elegant, optimized extension.  I just wanted to load my data, to get the job done.  And these commands let me do so quickly and easily.

Big Data is Changing Software and (Product) Development as We Know It

I am often asked about “Big Data”, its use cases, real-world business value and how it will transform various products, services and markets.  This is one of my favorite topics, and I am fortunate in that I get to spend significant amounts of time with our amazing customers and partners who teach me a lot.  I am actually writing this from a plane after a few recent customer meetings that inspired me to share a point of view.

“Big Data” is already having and will continue to have the most impact in products and services where there is an ability to capture information about usage, experience and behavior in a manner that is accepted, yet not disruptive by the consumer of that product or service.  Data warehousing has been around for a long time with regards to retail transactions and purchasing behavior, but usage and experience measurement hasn’t had the same repository equivalent.  It now does, and I believe this will lead to an exponential jump in the quality and variety of products and services that are delivered to consumers.  In fact, this will not only improve existing solutions, but it will spawn entirely new products and services in industries as diverse as entertainment to medical treatments.

While the notion of experience analysis has been around for a long time through various manual observation efforts, focus groups, and survey methods- the results have been fragmented, small, and analyzed in what I’ll call a “basic” manner. Thanks to technology advancements and the resulting cost shifts, massive near real-time “feedback” collection can now be done through automation and sensor technology.  While the prospect of having this information delights any product manager and merchandiser, the challenge of capturing, storing, and analyzing the information at this scale is still foreign to many.

 

There is one community who is embracing this feedback fire hose with greater ease and speed than most- software developers.  Vertica has several ISVs, who are leaving “breadcrumbs” in their code to collect usage information that can be anonymously transferred back to headquarters for very specific feedback on how users of software are interfacing with it.  Their users agree to this data collection and sharing, and the ISVs ensure that it has no impact on the operational performance of their software.

These “breadcrumbs” can measure how long someone spends on a screen, which buttons they clicked on to get there, how successful they were, etc.  For instance, good development organizations analyze the time that a user should get from one place to another, that is, navigation within and between screens.  If and ISVs software is the track, this is the laser measurement for precise timing.

Vertica is an ideal platform to store and analyze this information.  Using Vertica’s advanced analytic and pattern matching capabilities, correlations of usage patterns can be identified and the developers can patch, redesign, or document accordingly to deliver a better experience to end users.  For example, you could quite easily determine that users who spent 3 minutes on one screen, clicked a certain button, spent less than 1 minute on that screen, then quit might not be happy with their experience compared with users who started in the same place but stayed online longer. Further analysis could determine “why” through the more traditional interview techniques to improve the experience.

Why are software developers so eager to embrace this as the early adopters?  Well, one reason is that it gives them direct feedback on their work, without having to get the sometimes editorialized version from sales, support, management and yes even product managers!  Traditionally, most feedback to this community is sparse at best with highly anecdotal sentiment mixed in.  This method can augment that sentiment, (which should still be captured through sales, support, and product management by the way) with very complete data sets.  The product managers at these customers actually love this capability, and many of them are directly interacting and analyzing with the raw data collected.

Software developers also have the ability to make and control their own sensors- pretty cool when you think about it.  The savvy developer is able to create these listening points at various places in their code.  Savvy developers and product managers these days are spending time on these breadcrumbs because while they know they require more work (just as good quality assurance does), the payback is huge and ultimately can save them a lot of time.  Recently I visited one of our customers that develops enterprise software and they are piloting a project in this area that already has 8 Billion rows of this type of information- now that’s bigger than a breadbox!

This capability is not limited to SaaS vendors (although they certainly have more control and an easier time collecting the data).  Our online gaming customers are at the forefront, but we see all ISVs getting into this.  There is so much we can learn from software developers.  What is especially exciting is seeing how other physical sensors are being used in everything from automobiles to jet engines and even refrigerators to deliver the same type of feedback.  There is no question, the sensor economy is upon us.  In the end, this will lead to better products and services for you and me, the consumer, which is a good thing.

When Customers Buy you Beer you are on to Something

A few weeks ago, Shilpa, our VP of engineering was in New York City visiting prospective customers. While there, she also had an informal meetup with new and existing customers. One of our new customers liked Vertica so much that he literally handed Shilpa money to buy the Vertica Engineering team beer.

So, she did what all good managers do – delegate the acquisition to Sumeet. Thanks to his efforts we had a very special addition to one of our recent engineering lunches.

Nick, cheers from the entire engineering team! Thank you for your gift – we will all keep working hard to ensure your experience with Vertica continues to be a pleasure.

 

Vertica Lunch

If you are intrigued, don’t take my anecdotal customer stories for why Vertica is great – try it yourself with the Vertica Community Edition.

P.S. If you are interested in working somewhere customers like your product so much they send you tasty beverages, we are hiring in all areas. Within engineering specifically we are looking for hackers from the lowest level depths of the database server, up through the client interfaces, the management console and third party integration programs. Consider coming in to talk with us: marcia.langdon@hp.com.

A Feather in Vertica’s CAP

In this post, I attempt to relate Vertica distributed system properties to the well known CAP theorem and provide a fault tolerance comparison with the well known HDFS block storage mechanism.

The CAP theorem, as originally presented by Brewer @ PODC 2000 reads:

The CAP Theorem

It is impossible for a web service to provide the following three
guarantees:

  • Consistency
  • Availability
  • Partition-tolerance

The CAP theorem is useful from a system engineering perspective because distributed systems must pick 2/3 of the properties to implement and 1/3 to give up. A system that “gives up” on a particular property strives makes a best effort but cannot provide solid guarantees. Different systems choose to give up on different properties, resulting in different behavior when failures occur. However, there is a fair amount of confusion about what the C, A, and P actually mean for a system.

  • Partition-tolerance – A network partition results in some node A being unable to exchange messages with another node B. More generally, the inability of the nodes to communicate. Systems that give up on P assume that all messages are reliably delivered without fail and nodes never go down. Pretty much any context in which the CAP theorem is invoked, the system in question supports P.
  • Consistency – For these types of distributed systems, consistency means that all operations submitted to the system are executed as if in some sequential order on a single node. For example, if a write is executed, a subsequent read will observe the new data. Systems that give up on C can return inconsistent answers when nodes fail (or are partitioned). For example, two clients can read and each receive different values.
  • Availability – A system is unavailable when a client does not receive an answer to a request. Systems that give up on A will return no answer rather than a potentially incorrect (or inconsistent) answer. For example, unless a quorum of nodes are up, a write will fail to succeed.

Vertica is a stateful distributed system and thus worthy of consideration under the CAP theorem:

  • Partition-tolerance – Vertica supports partitions. That is, nodes can fail or messages can fail to be delivered and Vertica can continue functioning.
  • Consistency – Vertica is consistent. All operations on Vertica are strongly ordered – i.e., there is a singular truth about what data is in the system and it can be observed by querying the database.
  • Availability – Vertica is willing to sacrifice availability in pursuit of consistency when failures occur. Without a quorum of nodes (over half), Vertica will shut down since no modification may safely be made to the system state. The choice to give up availability for consistency is a very deliberate one and represents cultural expectations for a relational database as well as a belief that a database component should make the overall system design simpler. Developers can more easily reason about the database component being up or down than about it giving inconsistent (dare I say … “wrong”) answers. One reason for this belief is that a lack of availability is much more obvious than a lack of consistency. The more obvious and simplistic a failure mode is, the easier integration testing will be with other components, resulting in a higher quality overall system.

In addition to requiring a quorum of up nodes, each row value must be available from some up node, otherwise the full state of the database is no longer observable by queries. If Vertica fully replicated every row on every node, the database could function any time it had quorum: any node can service any query. Since full replication significantly limits scale-out, most users employ a replication scheme which stores some small number of copies of each row – in Vertica parlance, K-Safety. To be assured of surviving any K node failures, Vertica will store K+1 copies of each row. However, it’s not necessary for Vertica to shut down the instant more than K nodes fail. For larger clusters, it’s likely that all the row data is still available. Data (or Smart) K-Safety is the Vertica feature that tracks inter-node data dependencies and only shuts down the cluster when node failure actually makes data unavailable. This feature achieves a significant reliability improvement over basic K-Safety, as shown in the graph below.

The key reason Data K-Safety scales better is that Vertica is careful about how it arranges the replicas to ensure that nodes are not too interdependent. Internally, Vertica arranges the nodes in a ring and adjacent nodes serve as replicas. For K=1, if node i fails, then nodes i-1 and i+1 become critical: failure of either one will bring down the cluster. The key take away is that for each node that fails, a constant number (2) of new nodes become critical, whereas in the regular K-Safety mechanism, failure of the K th node makes all N-K remaining nodes critical! While basic K=2 safety initially provides better fault tolerance, the superior scalability of Data K=1 Safety eventually dominates as the cluster grows in size.

Here we can draw an interesting comparison to HDFS, which also provides high availability access to data blocks in a distributed system. Each HDFS block is replicated and by default stored on three different nodes, which would correspond to a K of 2. HDFS provides no coordination between the replicas of each block: the nodes are chosen randomly (modulo rack awareness) for each individual block. By contrast, Vertica storing data on node i at K=2 would replicate that data on nodes i+1 and i+2 every time. If nodes 3, 6, and 27 fail, there is no chance that this brings down a Vertica cluster. What is the chance that it impacts HDFS? Well, it depends on how much data is stored – the typical block size is 64MB. The graph below presents the results of simulated block allocation on a 100 node cluster with replication factor of 3, computing the probability of a random 3-node failure making at least one block unavailable.

Assuming that you’re storing 50TB of data on your 100 node cluster, the fault tolerance of HDFS should be the same as a basic K=2 Vertica cluster – namely, if any 3 nodes fail, some block is highly likely to be unavailable. Data K-Safety with K=1 provides better fault tolerance in this situation. And here’s the real kicker: at K=1, we can fit 50% more data on the cluster due to less replication!

This comparison is worth a couple extra comments. First, HDFS does not become unavailable if you lose a single block – unless it’s the block your application really needs to run. Second, nodes experience correlated failures, which is why HDFS is careful to place replicas on different racks. We’ve been working on making Vertica rack-aware and have seen good progress. Third, the model assumes the mean-time-to-repair (MTTR) is short relative to the mean-time-to-failure (MTTF). In case of a non-transient failure, HDFS re-replicates the blocks of the failed node to any node that has space. Since Vertica aggressively co-locates data for increased query performance, it uses a more significant rebalance operation to carefully redistribute the failed node’s data to the other nodes. In practice, the recovery or rebalance operation is timely relative to the MTTF.

In conclusion, Vertica uses a combination of effective implementation and careful data placement to provide a consistent and fault tolerant distributed database system. We demonstrate that our design choices yield a system which is both highly fault tolerant and very resource efficient.

Notes:

  • The CAP theorem was proved by Lynch in 2002 in the context of stateful distributed systems on an asynchronous network.

 

VLDB 2012 – Istanbul Bound!

I’ll be giving a talk next week about Vertica at VLDB 2012. If you happen to be in Istanbul, please stop by (Nga and I have a T-Shirt for you). Our paper can be found at the VLDB website:

The Vertica Analytic Database: C-Store 7 Years Later

http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf

At Vertica/HP, we pride ourselves on cutting edge technology, informed by the latest academic research, applied with cutting edge software craftsmanship. Over the years, we have benefited by closely collaborating with academic researchers, befitting a company founded by Mike Stonebraker.

Vertica Systems was originally founded to commercialize the ideas from the C-Store research project developed at MIT and other top universities and which was originally described in a VLDB 2005 paper. This year I am proud we have come full circle and published a rigorous technical description of the Vertica Analytic Database in VLDB 2012.

We look forward to many more years of technical breakthroughs and cool innovation in analytic database systems. Speaking of which, we are hiring! If you are a superstar (cliché, I know) and are interested in working with us to

  • Design, build and test challenging distributed systems, database internals, and analytics systems software
  • Bring one of the very few new database engines to new customers who desperately need it

Drop us a line at marcia.langdon@hp.com

Get Started With Vertica Today

Subscribe to Vertica