Vertica

Archive for the ‘Uncategorized’ Category

That Giant Sucking Sound is Your Big Data Science Project

shutterstock_144005719

Vertica recently hosted its second annual Big Data Conference in Boston, Massachusetts. It was very well attended with over eight hundred folks, and about two hundred companies represented. We at Vertica love these events for a few reasons – first because our customers tend to be our best spokespeople because it’s such a sound product, but also because it’s a chance for us to learn from them.

In one of the sessions, the presenter asked the audience how many of them had Hadoop installed today. Almost all the hands went up. This wasn’t too surprising given that the session was on Hadoop and Vertica integration. Then the presenter asked how many of those folks had actually paid for Hadoop. Most of the hands went down. Then the presenter asked how many of those folks felt that they were getting business value out of their investment. Only two or three hands stayed up. This was eye-opening for us at HP, and it was surprising to the audience as well. Everyone seemed to think they were doing something wrong with Hadoop that was causing them to miss out on the value.

Over the next few days, I made a point to track down folks in the audience I knew and get their thoughts on what the issues were. Since most of them were Vertica customers I knew many of them already. I thought it would be helpful to identify the signs indicative of a big data science project – a project where a team has installed something like Hadoop and is experimenting with it in the hope of achieving some new analytic insights, but isn’t on a clear path to deriving value out of it. And some clear themes emerged. And these align with what I and my colleagues in the industry have been observing over the last few years. So, without further ado, here are the top five signs that you may have a big data science project in your enterprise:

    1. The project isn’t tied to business value, but has lots of urgency. Somebody on the leadership team went to a big data presentation and has hit the panic button. As a result, the team rushes ahead and does…something. And maybe splinters into different teams doing different things. We all know how well this will turn out.
    2. The technologies were chosen primarily because they beef up resumes. There’s so much hype around big data and the shortage of people with relevant skills that salaries are inflated. And in the face of a project with high urgency, nobody wants to stand still. So download some software! That open source stuff is great, right? While it’s generally true that multiple technologies can solve the same big data problems, some will fit with the business more readily than others. Maybe they’re easier to deploy. Maybe they don’t require extensive skill retooling for the staff. Maybe the TCO is better. Those are all good things to keep in mind during technology selection. But selecting technology for “resume polishing”? Not so much.
    3. The project is burdened with too much process. Most organizations already have well-defined governance processes in place for technology projects. And, so the reasoning goes, big data is basically just a bunch more of the same data and same old reporting & analysis. So when it’s time to undertake a highly experimental big data analytics project which requires agility and adaptability, rigid process usually results in a risk-averse mindset where failure at any level is seen as a bad thing. For projects like these, failure during the experimentation isn’t just expected, it’s a critical part of innovation.
    4. The “can’t-do” attitude. It’s been a well understood fact of life for decades that IT departments often feel under siege – the business always asks for too much, never knows what it wants, and wants it yesterday. As a result, the prevailing attitude in many IT teams today is to start by saying “no”, and then line up a set of justifications for why radical change is bad.
    5. The decision-making impedance mismatch. Sometimes, organizations need to move fast to develop their insights. Maybe it’s driven by the competition, or maybe it’s driven by a change in leadership. And…then they move slooooowly, and miss the opportunity. Other times, the change represents a big one with impact across the company, and requires extensive buy-in and consensus. And…then it moves at a breakneck pace and causes the organization to develop antibodies and reject the project.

     

    So if your organization has one or more big data projects underway, ask whether it suffers from any of these issues. If so, you may have a big data science project on your hands.

No limits: How Big Data changes competition Data drives the bottom line, and technology is no longer limiting your competitors.

This post is condensed from a full article in the latest issue of Discover Performance, HP Software’s hub for IT thought leadership.

Business technology has always been a world of give and take. The more you ask for, the longer you wait. As technology improves, we compromise less—and in the case of Big Data, we can’t afford to compromise at all.

Today’s Big Data analytics platforms are making it possible for organizations to give the business everything: all the data, from all sources, in all formats, in real time, without limits. It’s a novel idea for most organizations, but it’s in the DNA of young, agile companies. This new breed of business is killing the competition by holding technology to the highest possible standard and putting data at the top of the value pyramid.

To compete, the rest of the market will need to act urgently to change their data ideologies and reject limitations as they store and explore data, and serve analytics insights to the business.

Competing with the new natives
“Leading companies today are changing the user experience while it is happening,” says HP Vertica VP Joy King. King says Twitter, as an example, is using real-time analysis of user demographics and usage trends to deploy new features and UI variations on the fly to limited “cohort” populations. The result is that people who use Twitter differently get a different experience—immediately.

“Compare a company using that approach to a company that’s relying on a report that comes once a week or once a month,” King says. “Who do you think will win?”

To stay on top of the new competitive pace set by the data-native enterprise, join Discover Performance, and get all our Big Data insights in your inbox.

Welcome to the HP Vertica 2014 BDC!

1450057_57_z

Well all, it’s finally here. we’ve come a long way in the past months planning and preparing for this week, and we’re already off to a great start! Yesterday we kicked off the BDC with the Hackathon with datasets provided by Conservation international. Below are the names of the winning teams:

FIRST

Team 6

  • Tomáš Jirotka
  • Pavel Burdanov
  • Nikolay Golov

SECOND

Team 8

  • Phil Ivers
  • Talal Assir
  • Zach Taylor
  • Pedro Pedreira

THIRD

Team 4

  • Norbert Krupa
  • Karel Jakubec
  • Durga Nemani
  • Jun Yin

Next up we had a session of ASE testing and were delighted to find that all 9 of our participants passed the first round! In addition, our Best Practices forum was packed and ended up spilling over into a second room to make space for everyone (a good problem to have!).

IMG_7529

Following all that, we wrapped up the day with a fantastic reception with food and drinks for all.

Check back with us tomorrow for another update!

HP Vertica named “Best Columnar Database”


Winner

“Like the emerging category of in in-memory database technologies, columnar databases are deployed in market segments where speed of data analysis is paramount…”

On August 4th, we were pleased and excited to learn that the HP Vertica Analytics Platform was crowned the winner in this years Database Trends and Aplications readers choice awards for “Best Columnar database.” Here at HP we work tirelessly around the clock to deliver you the fastest, most cutting edge Big Data Platform in the world. This is yet another recognition of our hard work and dedication, and for that we thank you! You can read the whole story here.

We hope to see you at the BDC next week!

System Mechanics & HP Vertica

Vertica + SYSMEC

Last week, Andy Stubley interviewed by Briefings Direct, discussed how HP Vertica is a critical component to System Mechanic’s Zen, a fault, performance and social media service assurance solution for mobile networks. Below is a quick excerpt along with a link to the full article, check it out!


Gardner: Now that we understand what you do, let’s get into how you do it. What’s beneath the covers in your Zen system that allows you to confidently say you can take any volume of data you want?

Stubley: Fundamentally, that comes down to the architecture we built for Zen. The first element is our data-integration layer. We have a technology that we developed over the last 10 years specifically to capture data in telco networks. It’s real-time and rugged and it can deal with any volume. That enables us to take anything from the network and push it into our real-time database, which is HP’s Vertica solution, part of the HP HAVEn family.

Vertica analysis is to basically record any amount of data in real time and scale automatically on the HP hardware platform we also use. If we need more processing power, we can add more services to scale transparently. That enables us to get any amount of data, which we can then process…”

You can read the rest of the article here

Is Big Data Giving You Grief? Part 2: Anger

Is Big Data Giving You Grief? Part Two: Anger

“We missed our numbers last quarter because we’re not leveraging Big Data! How did we miss this?!”

Continuing this five part series focused on how organizations frequently go through the five stages of grief when confronting big data challenges, this post will focus on the second stage: anger.

It’s important to note that while an organization may begin confronting big data with something very like denial, anger usually isn’t far behind. As mentioned previously, very often the denial is rooted in the fact that the company doesn’t see the benefit in big data, or the benefits appear too expensive. And sometimes the denial can be rooted in a company’s own organizational inertia.

Moving past denial often entails learning – that big data is worth pursuing. Ideally, this learning comes from self-discovery and research – looking at the various opportunities it represents, casting a broad net as to technologies for addressing it, etc. Unfortunately, sometimes the learning can be much less pleasant as the competition learns big data first…and suddenly is performing much better. This can show up in a variety of ways – your competitors suddenly have products that seem much more aligned with what people want to buy; their customer service improves dramatically while their overhead actually goes down; and so on.

For better or worse, this learning often results in something that looks an awful lot like organizational “anger”. As I look back at my own career to my days before HP, I can recall more than a few all-hands meetings hosted by somber executives highlighting deteriorating financials, as well as meetings featuring a fist pounding leader or two talking about the need to change, dammit! It’s a natural part of the process wherein eyes are suddenly opened to the fact that change needs to occur. This anger often is focused at the parties involved in the situation. So, who’re the targets, and why?

The Leadership Team

At any company worth its salt, the buck stops with the leadership team. A shortcoming of the company is a shortcoming of the leadership. So self-reflection would be a natural focus of anger. How did a team of experienced business leaders miss this? Companies task leaders with both the strategic and operational guidance of the business – so if they missed a big opportunity in big data, or shot it down because it looked to costly or risky, this is often seen as a problem.

Not to let anybody off the hook, but company leadership is also tasked with a responsibility to the investors. And this varies with the type of company, stage in the market, etc. In an organization tasked with steady growth, taking chances on something which appears risky – like a big data project where the benefits are less understood than the costs – is often discouraged. Also, leaders often develop their own “playbook” – their way of viewing and running a business that works. And not that many retool their skills and thinking over time. So their playbook might’ve worked great when brand value was determined by commercial airtime, and social media was word of mouth from a tradeshow. But the types and volume of information available are changing rapidly in the big data world, so that playbook may be obsolete.

Also, innovation is as much art as science. This is something near & dear to me both in my educational background as well as career interests. If innovation was a competence that could just be taught or bought, we wouldn’t see a constant flow of companies appearing (and disappearing) across markets. We also wouldn’t see new ideas (the web! social networking!) appear overnight to upend entire segments of the economy. For most firms, recognizing the possibilities inherent in big data and acting on those possibilities represents innovation, so it’s not surprising to see that some leadership teams struggle.

The Staff

There are times when the upset over a missed big data opportunity is aimed at the staff. It’s not unusual to see a situation where the CEO of a firm asked IT to research big data opportunities, only to have the team come back and state that they weren’t worthwhile. And six months later, after discovering that the competition is eating their lunch, the CEO is a bit upset at the IT team.

While this is sometimes due to teams being “in the bunker” (see my previous post here), in my experience it occurs far more often due to the IT comfort zone. Early in my career, I worked in IT for a human resources department. The leader of the department asked a group of us to research new opportunities for the delivery of information to the HR team across a large geographic area (yeah, I’m dating myself a bit here…this was in the very early days of the web). We were all very excited about it, so we ran back to our desks and proceeded to install a bunch of software to see what it could do. In retrospect I have to laugh at myself about this – it never occurred to me to have a conversation with the stakeholders first! My first thought was to install the technology and experiment with it, then build something.

This is probably the most common issue I see in IT today. The technologies are different but the practice is the same. Ask a room full of techies to research big data with no business context and…they’ll go set up a bunch of technology and see what it can do! Will the solution meet the needs of the business? Hmm. Given the historical failure rate of large IT projects, probably not.

The Vendors

It’s a given that the vendors might get the initial blame for missing a big data opportunity. After all, they’re supposed to sell us stuff that solves our problems, aren’t they? As it turns out, that’s not exactly right. What they’re really selling us is stuff that solves problems for which their technology was built. Why? Well, that’s a longer discussion that Clayton Christensen has addressed far better than I ever could in “The Innovator’s Dilemma”. Suffice it to say that the world of computing technology continues to change rapidly today, and products built twenty years ago to handle data often are hobbled by their legacy – both in the technology and the organization that sells it.

But if a company is writing a large check every year to a vendor – it’s not at all unusual to see firms spend $1 million or more per year with technology vendors – they often expect a measure of thought leadership from that vendor. So if a company is blindsided by bad results because they’re behind on big data, it’s natural to expect that the vendor should have offered some guidance, even if it was just to steer the IT folks away from an unproductive big data science project (for more on that, see my blog post coming soon titled “That Giant Sucking Sound is Your Big Data Lab Experiment”).

Moving past anger

Organizational anger can be a real time-waster. Sometimes, assigning blame can gain enough momentum that it distracts from the original issue. Here are some thoughts on moving past this.

You can’t change the past, only the future. Learning from mistakes is a positive thing, but there’s a difference between looking at the causes and looking for folks to blame. And it’s critical to identify the real reasons the opportunity was missed instead of playing the “blame game”, as it would suck up precious time and in fact may prevent the identification of the real issue. I’ve seen more than one organization with what I call a “Teflon team” – a team which is never held responsible for any of the impacts their work has on the business, regardless of their track record. Once or twice, I’ve seen these teams do very poor work, but the responsibility has been placed elsewhere. So the team never improves and the poor work continues. So watch out for the Teflon team!

Big data is bigger than you think. It’s big in every sense of the word because it represents not just the things we usually talk about – volume of data, variety of data, and velocity of data – but it also represents the ability to bring computing to bear on problems where this was previously impossible. This is not an incremental or evolutionary opportunity, but a revolutionary one. Can a business improve its bottom line by ten percent with big data? Very likely. Can it drive more revenue? Almost certainly. But it can also develop entirely new products and capabilities, and even create new markets.

So it’s not surprising that businesses may have a hard time recognizing this and coping with it. Business leaders accustomed to thinking of incremental boosts to revenue, productivity, margins, etc. may not be ready to see the possibilities. And the IT team is likely to be even less prepared. So while it may take some convincing to get the VP of Marketing to accept that Twitter is a powerful tool for evaluating their brand, asking IT to evaluate it in a vacuum is a recipe for confusion.

So understanding the true scope of big data and what it means for an organization is critical to moving forward.

A vendor is a vendor. Most organizations have one or more data warehouses today, along with a variety of tools for the manipulation, transformation, delivery, analysis, and consumption of data. So they will almost always have some existing vendor relationships around technologies which manage data. And most of them will want to leverage the excitement around big data, so will have some message along those lines. But it’s important to separate the technology from the message. And to distinguish between aging technology which has simply been rebranded and technology which can actually do the job.

Also, particularly in big data, there are “vendorless” or “vendor-lite” technologies which have become quite popular. By this I mean technologies such as Apache Hadoop, Mongodb, Cassandra, etc. These are often driven less by a vendor with a product goal and more by a community of developers who cut their teeth on the concept of open-source software which comes with very different business economics. Generally without a single marketing department to control the message, these technologies can be associated with all manner of claims regarding capabilities – some of which are accurate, and some which aren’t. This is a tough issue to confront because the messages can be conflicting, diffused, etc. The best advice I’ve got here is – if an open source technology sounds too good to be true, it very likely is.

Fortunately, this phase is a transitional one. Having come to terms with anger over the missed big data opportunity or risk, businesses then start to move forward…only to find their way blocked. This is when the bargaining starts. So stay tuned!

Next up: Bargaining “Can’t we work with our current technologies (and vendors)? …but they cost too much!”

Physical Design Automation in the HP Vertica Analytic Database

Automatic physical database design is a challenging task. Different customers have different requirements and expectations, bounded by their resource constraints. To deal with these challenges in HP Vertica, we adopt a customizable approach by allowing users to tailor their designs for specific scenarios and applications. To meet different customer requirements, any physical database design tool should allow its users to trade off query performance and storage footprint for different applications.

In this blog, we present a technical overview of the Database Designer (DBD), a customizable physical design tool that primarily operates under three design policies:

  • Load-optimized –DBD proposes the minimum required set of super projections (containing all columns) that permit fast load and deliver required fault tolerance.
  • Query-optimized –DBD may propose additional (possibly non-super) projections such that all workload queries are fully-optimized
  • Balanced—DBD proposes projections until it reaches the point where additional projections do not bring sufficient benefits in query optimization.

These options allow users to choose to trade off query performance and storage footprint, while considering update costs. These policies indirectly control the number of projections proposed to achieve the desired balance among query performance, storage and load constraints.
In real-world environments, query workloads often evolve over time. A projection that was helpful in the past may not be relevant today and could be wasting space or slowing down loads. This space could instead be reused to create new projections that optimize current workloads. To cater to such workload changes, DBD operates in two different modes:

  • Comprehensive–DBD creates an entirely new physical design that optimizes for the current workload while retaining parts of the existing design that are beneficial and dropping parts that are non-beneficial
  • Incremental– Customers can optionally create additional projections that optimize new queries without disturbing the existing physical design. Customers should use the incremental mode when workloads have not changed significantly. With no input queries, DBD optimizes purely for storage and load purposes.

ram_comprehensiveMode

The key challenges involved in the projection design are picking appropriate column sets, sort orders, cluster data distributions and column encodings that optimize query performance while reducing space overhead and allowing faster recovery. The DBD proceeds in two major sequential phases. During the query optimization phase, DBD chooses projection columns, sort orders, and cluster distributions (segmentation) that optimize query performance. DBD enumerates candidate projections after extracting interesting column subsets by analyzing query workload for predicate, join, group-by, order-by and aggregate columns. Run length encoding (RLE) is given special preference for columns appearing early in the sort order, because it is beneficial for both query performance and storage optimization. DBD then invokes the query optimizer for each workload query and presents a choice of the candidate projections. The query optimizer evaluates the query plans for all candidate projections, progressively narrowing the set of candidates until a stopping condition (based on the design policy) is reached. Query and table filters are applied during this process to filter one or more queries that are sufficiently optimized by chosen projections or tables that have reached a target number of projections set by the design policy. DBD’s direct use of the optimizer’s cost and benefit model guarantees that it remains synchronized as the optimizer evolves over time.

ram_inputParameters

During the storage optimization phase, DBD finds the best non-RLE column encoding schemes that achieve the smallest storage footprint for the designed projections via a series of empirical encoding experiments on the sample data. In addition, DBD creates the required number of buddy projections containing the same data but distributed differently across the cluster, enabling the design to be tolerant to node-down scenarios. When a node is down, buddy projections are employed to source the missing data in the down nodes. In HP Vertica, identical buddy projections (with same sort orders and column encodings) enable faster recovery by facilitating direct copy of their physical storage structures and DBD automatically produces such designs.

When DBD is invoked with an input set of workload queries, the queries are parsed and useful query meta-data is extracted (e.g., the predicate, group-by, order-by, aggregate and join query columns). Design proceeds in iterations. In each iteration, one new projection is proposed for each table under design. Once an iteration is done, queries that have been optimized by the newly proposed projections are removed, and the remaining queries serve as input to the next iteration. If a design table has reached its targeted number of projections (decided by the design policy), it is not considered in future iterations to ensure that no more projections are proposed for it. This process is repeated until there are no more design tables or design queries are available to propose projections for.

To form the complete search space for enumerating projections, we identify the following design features in a projection definition:

  • Feature 1: Sort order
  • Feature 2: Segmentation
  • Feature 3: Column encoding schemes
  • Feature 4: Column sets (select columns)

We enumerate choices for features 1 and 2 above, and use the optimizer’s cost and benefit model to compare and evaluate them (during the query optimization phase ). Note that the choices made for features 3 and 4 typically do not affect the query performance significantly. The winners decided by the cost and benefit model are then extended to full projections by filling out the choices for features 3 and 4, which have a large impact on load performance and storage (during the storage optimization phase).
In summary, the HP Vertica Database Designer is a customizable physical database design tool that works with a set of configurable input parameters that allow users to trade off query performance, storage footprint, fault tolerance and recovery time to meet their requirements and optionally override design features.

Get Started With Vertica Today

Subscribe to Vertica