Vertica

Archive for the ‘Engineering’ Category

Is Big Data Giving You Grief? Part 2: Anger

Is Big Data Giving You Grief? Part Two: Anger

“We missed our numbers last quarter because we’re not leveraging Big Data! How did we miss this?!”

Continuing this five part series focused on how organizations frequently go through the five stages of grief when confronting big data challenges, this post will focus on the second stage: anger.

It’s important to note that while an organization may begin confronting big data with something very like denial, anger usually isn’t far behind. As mentioned previously, very often the denial is rooted in the fact that the company doesn’t see the benefit in big data, or the benefits appear too expensive. And sometimes the denial can be rooted in a company’s own organizational inertia.

Moving past denial often entails learning – that big data is worth pursuing. Ideally, this learning comes from self-discovery and research – looking at the various opportunities it represents, casting a broad net as to technologies for addressing it, etc. Unfortunately, sometimes the learning can be much less pleasant as the competition learns big data first…and suddenly is performing much better. This can show up in a variety of ways – your competitors suddenly have products that seem much more aligned with what people want to buy; their customer service improves dramatically while their overhead actually goes down; and so on.

For better or worse, this learning often results in something that looks an awful lot like organizational “anger”. As I look back at my own career to my days before HP, I can recall more than a few all-hands meetings hosted by somber executives highlighting deteriorating financials, as well as meetings featuring a fist pounding leader or two talking about the need to change, dammit! It’s a natural part of the process wherein eyes are suddenly opened to the fact that change needs to occur. This anger often is focused at the parties involved in the situation. So, who’re the targets, and why?

The Leadership Team

At any company worth their salt, the buck stops with the leadership team. A shortcoming of the company is a shortcoming of the leadership. So self-reflection would be a natural focus of anger. How did a team of experienced business leaders miss this? Companies task leaders with both the strategic and operational guidance of the business – so if they missed a big opportunity in big data, or shot it down because it looked to costly or risky, this is often seen as a problem.

Not to let anybody off the hook, but company leadership is also tasked with a responsibility to the investors. And this varies with the type of company, stage in the market, etc. In an organization tasked with steady growth, taking chances on something which appears risky – like a big data project where the benefits are less understood than the costs – is often discouraged. Also, leaders often develop their own “playbook” – their way of viewing and running a business that works. And not that many retool their skills and thinking over time. So their playbook might’ve worked great when brand value was determined by commercial airtime, and social media was word of mouth from a tradeshow. But the types and volume of information available are changing rapidly in the big data world, so that playbook may be obsolete.

Also, innovation is as much art as science. This is something near & dear to me both in my educational background as well as career interests. If innovation was a competence that could just be taught or bought, we wouldn’t see a constant flow of companies appearing (and disappearing) across markets. We also wouldn’t see new ideas (the web! social networking!) appear overnight to upend entire segments of the economy. For most firms, recognizing the possibilities inherent in big data and acting on those possibilities represents innovation, so it’s not surprising to see that some leadership teams struggle.

The Staff

There are times when the upset over a missed big data opportunity is aimed at the staff. It’s not unusual to see a situation where the CEO of a firm asked IT to research big data opportunities, only to have the team come back and state that they weren’t worthwhile. And six months later, after discovering that the competition is eating their lunch, the CEO is a bit upset at the IT team.

While this is sometimes due to teams being “in the bunker” (see my previous post here), in my experience it occurs far more often due to the IT comfort zone. Early in my career, I worked in IT for a human resources department. The leader of the department asked a group of us to research new opportunities for the delivery of information to the HR team across a large geographic area (yeah, I’m dating myself a bit here…this was in the very early days of the web). We were all very excited about it, so we ran back to our desks and proceeded to install a bunch of software to see what it could do. In retrospect I have to laugh at myself about this – it never occurred to me to have a conversation with the stakeholders first! My first thought was to install the technology and experiment with it, then build something.

This is probably the most common issue I see in IT today. The technologies are different but the practice is the same. Ask a room full of techies to research big data with no business context and…they’ll go set up a bunch of technology and see what it can do! Will the solution meet the needs of the business? Hmm. Given the historical failure rate of large IT projects, probably not.

The Vendors

It’s a given that the vendors might get the initial blame for missing a big data opportunity. After all, they’re supposed to sell us stuff that solves our problems, aren’t they? As it turns out, that’s not exactly right. What they’re really selling us is stuff that solves problems for which their technology was built. Why? Well, that’s a longer discussion that Clayton Christensen has addressed far better than I ever could in “The Innovator’s Dilemma”. Suffice it to say that the world of computing technology continues to change rapidly today, and products built twenty years ago to handle data often are hobbled by their legacy – both in the technology and the organization that sells it.

But if a company is writing a large check every year to a vendor – it’s not at all unusual to see firms spend $1 million or more per year with technology vendors – they often expect a measure of thought leadership from that vendor. So if a company is blindsided by bad results because they’re behind on big data, it’s natural to expect that the vendor should have offered some guidance, even if it was just to steer the IT folks away from an unproductive big data science project (for more on that, see my blog post coming soon titled “That Giant Sucking Sound is Your Big Data Lab Experiment”).

Moving past anger

Organizational anger can be a real time-waster. Sometimes, assigning blame can gain enough momentum that it distracts from the original issue. Here are some thoughts on moving past this.

You can’t change the past, only the future. Learning from mistakes is a positive thing, but there’s a difference between looking at the causes and looking for folks to blame. And it’s critical to identify the real reasons the opportunity was missed instead of playing the “blame game”, as it would suck up precious time and in fact may prevent the identification of the real issue. I’ve seen more than one organization with what I call a “Teflon team” – a team which is never held responsible for any of the impacts their work has on the business, regardless of their track record. Once or twice, I’ve seen these teams do very poor work, but the responsibility has been placed elsewhere. So the team never improves and the poor work continues. So watch out for the Teflon team!

Big data is bigger than you think. It’s big in every sense of the word because it represents not just the things we usually talk about – volume of data, variety of data, and velocity of data – but it also represents the ability to bring computing to bear on problems where this was previously impossible. This is not an incremental or evolutionary opportunity, but a revolutionary one. Can a business improve its bottom line by ten percent with big data? Very likely. Can it drive more revenue? Almost certainly. But it can also develop entirely new products and capabilities, and even create new markets.

So it’s not surprising that businesses may have a hard time recognizing this and coping with it. Business leaders accustomed to thinking of incremental boosts to revenue, productivity, margins, etc. may not be ready to see the possibilities. And the IT team is likely to be even less prepared. So while it may take some convincing to get the VP of Marketing to accept that Twitter is a powerful tool for evaluating their brand, asking IT to evaluate it in a vacuum is a recipe for confusion.

So understanding the true scope of big data and what it means for an organization is critical to moving forward.

A vendor is a vendor. Most organizations have one or more data warehouses today, along with a variety of tools for the manipulation, transformation, delivery, analysis, and consumption of data. So they will almost always have some existing vendor relationships around technologies which manage data. And most of them will want to leverage the excitement around big data, so will have some message along those lines. But it’s important to separate the technology from the message. And to distinguish between aging technology which has simply been rebranded and technology which can actually do the job.

Also, particularly in big data, there are “vendorless” or “vendor-lite” technologies which have become quite popular. By this I mean technologies such as Apache Hadoop, Mongodb, Cassandra, etc. These are often driven less by a vendor with a product goal and more by a community of developers who cut their teeth on the concept of open-source software which comes with very different business economics. Generally without a single marketing department to control the message, these technologies can be associated with all manner of claims regarding capabilities – some of which are accurate, and some which aren’t. This is a tough issue to confront because the messages can be conflicting, diffused, etc. The best advice I’ve got here is – if an open source technology sounds too good to be true, it very likely is.

Fortunately, this phase is a transitional one. Having come to terms with anger over the missed big data opportunity or risk, businesses then start to move forward…only to find their way blocked. This is when the bargaining starts. So stay tuned!

Next up: Bargaining “Can’t we work with our current technologies (and vendors)? …but they cost too much!”

Physical Design Automation in the HP Vertica Analytic Database

Automatic physical database design is a challenging task. Different customers have different requirements and expectations, bounded by their resource constraints. To deal with these challenges in HP Vertica, we adopt a customizable approach by allowing users to tailor their designs for specific scenarios and applications. To meet different customer requirements, any physical database design tool should allow its users to trade off query performance and storage footprint for different applications.

In this blog, we present a technical overview of the Database Designer (DBD), a customizable physical design tool that primarily operates under three design policies:

  • Load-optimized –DBD proposes the minimum required set of super projections (containing all columns) that permit fast load and deliver required fault tolerance.
  • Query-optimized –DBD may propose additional (possibly non-super) projections such that all workload queries are fully-optimized
  • Balanced—DBD proposes projections until it reaches the point where additional projections do not bring sufficient benefits in query optimization.

These options allow users to choose to trade off query performance and storage footprint, while considering update costs. These policies indirectly control the number of projections proposed to achieve the desired balance among query performance, storage and load constraints.
In real-world environments, query workloads often evolve over time. A projection that was helpful in the past may not be relevant today and could be wasting space or slowing down loads. This space could instead be reused to create new projections that optimize current workloads. To cater to such workload changes, DBD operates in two different modes:

  • Comprehensive–DBD creates an entirely new physical design that optimizes for the current workload while retaining parts of the existing design that are beneficial and dropping parts that are non-beneficial
  • Incremental– Customers can optionally create additional projections that optimize new queries without disturbing the existing physical design. Customers should use the incremental mode when workloads have not changed significantly. With no input queries, DBD optimizes purely for storage and load purposes.

ram_comprehensiveMode

The key challenges involved in the projection design are picking appropriate column sets, sort orders, cluster data distributions and column encodings that optimize query performance while reducing space overhead and allowing faster recovery. The DBD proceeds in two major sequential phases. During the query optimization phase, DBD chooses projection columns, sort orders, and cluster distributions (segmentation) that optimize query performance. DBD enumerates candidate projections after extracting interesting column subsets by analyzing query workload for predicate, join, group-by, order-by and aggregate columns. Run length encoding (RLE) is given special preference for columns appearing early in the sort order, because it is beneficial for both query performance and storage optimization. DBD then invokes the query optimizer for each workload query and presents a choice of the candidate projections. The query optimizer evaluates the query plans for all candidate projections, progressively narrowing the set of candidates until a stopping condition (based on the design policy) is reached. Query and table filters are applied during this process to filter one or more queries that are sufficiently optimized by chosen projections or tables that have reached a target number of projections set by the design policy. DBD’s direct use of the optimizer’s cost and benefit model guarantees that it remains synchronized as the optimizer evolves over time.

ram_inputParameters

During the storage optimization phase, DBD finds the best non-RLE column encoding schemes that achieve the smallest storage footprint for the designed projections via a series of empirical encoding experiments on the sample data. In addition, DBD creates the required number of buddy projections containing the same data but distributed differently across the cluster, enabling the design to be tolerant to node-down scenarios. When a node is down, buddy projections are employed to source the missing data in the down nodes. In HP Vertica, identical buddy projections (with same sort orders and column encodings) enable faster recovery by facilitating direct copy of their physical storage structures and DBD automatically produces such designs.

When DBD is invoked with an input set of workload queries, the queries are parsed and useful query meta-data is extracted (e.g., the predicate, group-by, order-by, aggregate and join query columns). Design proceeds in iterations. In each iteration, one new projection is proposed for each table under design. Once an iteration is done, queries that have been optimized by the newly proposed projections are removed, and the remaining queries serve as input to the next iteration. If a design table has reached its targeted number of projections (decided by the design policy), it is not considered in future iterations to ensure that no more projections are proposed for it. This process is repeated until there are no more design tables or design queries are available to propose projections for.

To form the complete search space for enumerating projections, we identify the following design features in a projection definition:

  • Feature 1: Sort order
  • Feature 2: Segmentation
  • Feature 3: Column encoding schemes
  • Feature 4: Column sets (select columns)

We enumerate choices for features 1 and 2 above, and use the optimizer’s cost and benefit model to compare and evaluate them (during the query optimization phase ). Note that the choices made for features 3 and 4 typically do not affect the query performance significantly. The winners decided by the cost and benefit model are then extended to full projections by filling out the choices for features 3 and 4, which have a large impact on load performance and storage (during the storage optimization phase).
In summary, the HP Vertica Database Designer is a customizable physical database design tool that works with a set of configurable input parameters that allow users to trade off query performance, storage footprint, fault tolerance and recovery time to meet their requirements and optionally override design features.

Obtaining and installing your HP Vertica license

Watch the video here!

Obtaining and installing your HP Vertica license may seem like tricky business. Especially if you have more than one. But the process need not be complicated or frustrating. For a Community Edition license, you don’t even need to go through any additional steps after installing Vertica. For Enterprise Edition or Flex Zone licenses, you’ll go through a step-by-step process in HP’s licensing portal called Poetic and then provide Vertica with the path to the license file you download. That’s it! You can even apply your license through the Vertica Management Console. To see the process in action, watch this video about obtaining and installing the different HP Vertica licenses.

Useful links:
Poetic (HP’s Licensing for Software portal)
HP License Support Center

Work hard, have fun and make a difference!

JaminBlogPic

My name is Jaimin and I work as a Software Engineer in the Distributed Query Optimizer Team at HP Vertica. I wanted to share with you what I think makes Vertica the best place to work! I will explain the kind of impact you can make as an employee/intern at HP Vertica, while sharing my personal experiences.

As a student, I researched many companies I might want to work for to get a better understanding of the everyday life of software engineers. However, what I was most interested in learning about was the kinds of things engineers might do that went above and beyond the normal day-to-day stuff.

Is writing code something unique to the job?

No! Right?

As Software Engineers, we write code, develop algorithms, and implement them. But here at HP Vertica, we do lots of other things besides simply writing code.

Go above and beyond!

Vertica is different from other companies as far as normal day-to-day stuff goes.

Let me ask you this question: How many new graduates would you guess could get a chance to file a patent within their first 6 months of joining a company? How many would get chance to write a paper within first six months? Not a lot, right?

In my experience at HP Vertica, I’ve seen that just about all new graduate engineers file at least one patent in their first year at work. This speaks to the fact that the work we do here at Vertica is completely innovative. Our projects have a huge business impact.

Be the captain of your ship!

Vertica offers engineers incredible opportunities! All you have to do is be willing to accept them. One of the best things about HP Vertica is that you work in an environment where other engineers are smarter than you! You’ll find yourself constantly challenged to learn new, interesting, and exciting things. You’ll get better exposure and, more importantly, you have a massive role to play in the company’s growth and development.

Something else that’s unique about HP Vertica—the projects you work on as an intern become part of the shipping product! As a result, you’ll get the chance to see your code in action and sometimes you can learn what customers have to say about your feature in particular. You won’t be allowed to sit idle for a minute because we have a very short release cycle. This will keep you on your toes and encourage you to think something new day in and day out.

Here, engineers are not forced to work on this and that—they have a great deal of autonomy and frequently get to choose the things they work on. If you have an idea you think can help improve the product, you are encouraged to see it through. And, you’ll also get a chance to participate in various technical events that take place within HP and submit your ideas.

Taking initiative is always encouraged and you’ll be expected to make, discuss, and defend your design decisions with your mentors instead of just following directions. You’ll also be able to learn about the complexities of building a database and how we achieve the performance advantages in HP Vertica.

It is also easy to move between the teams. It is entirely up to you and the only question is what you want to do?

Share and gain knowledge!

Knowledge Sharing is another important thing at Vertica. We do a lunch talk where we discuss any new paper related to database systems. Every now and then people from various teams give tech talks so that each team is aware of what people in other groups are doing.

As a fresh graduate before joining Vertica, I did not have any experience working on a database optimizer product, though I had worked a bit on optimizations when I took a compiler class. Because of the great culture and environment at Vertica, I didn’t find the transition difficult at all. Sometimes it was challenging, but it allowed me to learn a lot by working with incredibly smart people at the company while working on challenging projects (I wonder how many people have the opportunity to work on the design and implementation of queries involving Set Operators during their first year of work).

Have fun!

We frequently unwind doing fun things at work, including watching the Olympics games or other sporting events during lunch, or playing table-tennis and board games when we can. Vertica provides a lot of flexibility and it comes with huge responsibility. You’re expected to get your work done on time—if you do that, no one will have any problem with having a little fun. Interns also go on outdoor field trips, including horseback riding, hiking to Blue Hills, going for a movie, participating in a bocce tournament, and water activities such as motor boat racing. Once, we went to the Boston Harbor and tried to learn how to sail a boat from one of our in-house experts in Vertica.

We are looking for people to join Vertica! Do you have any interest in being challenged in an innovative design environment? Then apply today!

BDOC – Big Data on Campus

I had a great time speaking at the MIT Sloan Sports Analytics Conference yesterday, and perhaps the most gratifying part of doing a panel in front of a packed house was how many students were in the audience. Having been a bit of a ‘stats geek’ during my college years, I can assure you that such an event, even with a sports theme, would never have drawn such an audience back then.

It was even more gratifying to read this weekend’s Wall Street Journal, with the title Data Crunchers Now The Cool Kids on Campus. Clearly this a terrific time to be studying – and teaching – statistics and Big Data. To quote the article:

The explosive growth in data available to businesses and researchers has brought a surge in demand for people able to interpret and apply the vast new swaths of information, from the analysis of high-resolution medical images to improving the results of Internet search engines.

Schools have rushed to keep pace, offering college-level courses to high-school students, while colleges are teaching intro stats in packed lecture halls and expanding statistics departments when the budget allows.

 

Of course, Big Data training is not just for college students, and at HP Vertica we are working on programs to train both professionals as well as students in conjunction with our colleagues in the HP ExpertOne program. We invite those interested in learning more to contact us – including educational institutions who are interested in adding Big Data training to their curriculum.

Startup Rink

For years, I’ve enjoyed working at Vertica, part of a culture where developers aren’t encumbered by bureaucracy, there is a true meritocracy, and we focus on efficiently delivering meaningful features to customers. I’ve been impressed through the years by the commitment, hard work, and truly impressive accomplishments of my colleagues. It takes an incredible team to build a product, like the original Vertica Analytics Database (now known as the HP Vertica Analytics Platform), from scratch, and tackle complex distributed systems and scalability challenges — it is also a lot of fun, especially with this group.

After HP acquired Vertica over a year and a half ago, I was glad to see the startup culture continue to thrive. The acquisition did bring about some change, which has overall been very positive. The engineering group has benefited from a wealth of resources at HP, including new toys, mostly in the form of hardware, and newfound relationships with the talented folks at HP Labs and in other business units.

It is my great fortune to work with truly talented developers, who have greatly impacted my personal and career growth. The challenges we’ve faced have worked to strengthen their influence. During a recent holiday project, I leaned on lessons learned from my colleagues. Interestingly, the project had nothing to do with my profession.

What does building a backyard, or, in my case front yard, skating rink have to do with a startup experience?

For starters, you hear lots of reasons why you shouldn’t do it. Building a rink is an impractical project, especially in my geographical location. It is relatively expensive compared to skating at a public rink — the cost is roughly what many pay for a few months of cable, but for something that you don’t mind your kids doing for hours each day. It is a lot of work. I call it exercise, something I need more of this time of year. At best, temperatures will remain cold enough to sustain five or six weeks of skating. As I got started, I heard all about how the ground didn’t freeze at all last winter.

To complete a project like this one must filter criticism appropriately. The folks at my local box store were very helpful in improving my rink design while others contributed only negative comments. I’m certain a good many of my neighbors think I am crazy. I was a little concerned when two fire engines came down my street while I was flooding the rink. It turns out that they were carrying Santa Claus on display for kids; his sleigh must have been getting tuned for his big day.

front_yard_rinkPerhaps most importantly, you have to be able to rebound when things don’t go as planned. I broke my back — at least it felt that way — framing the rink. What I didn’t count on was a lot of rain, followed by a fair amount of snow. These conditions added additional weight to the rink and made the ground extremely soggy (it was mush to a depth of more than one foot in some areas). Consequently, the deep end of the rink — the ground isn’t perfectly level — burst at one corner.

I’m certain that I looked crazed as I hurried to mend the damage before the rink fell apart completely. Once things stabilized, I could see that the ground wasn’t holding. The stakes were leaning and the rink was in great jeopardy. I felt defeated. I thought about giving up. I’d invested a lot of time and energy and wasted some money on this foolish project. Comments from the naysayers filled my head. But, as I said earlier, I’ve had the good fortune of working on challenging projects with colleagues who know how to make things work in the face of adversity. I didn’t need to consult them. I knew how they’d react. I’ve seen the same scenario play out dozens of times at work. After I cleared my head and got a pep talk from my wife I doubled down my efforts and made a serious attempt to salvage the rink. There was no guarantee of success—things looked bleak.

Thankfully, hard work paid off. It usually does, but there are times when, despite good intentions and best efforts, things don’t work out as intended. When that happens you’re left with valuable lessons learned. And, in that case, next year’s rink will be a success.

shooting_goalA few days after the rink was repaired Mother Nature did her part. The rink has been in operation for a couple of days now. Already, the work has been worthwhile. My family has had some very memorable times out there. Like, the time my three year old daughter amazed us with her on-ice impression of Prof. Hinkle chasing Frosty down a hill as she laughed hysterically or watching my five-year-old son give my wife a celebratory hug after imagining winning the Stanley Cup for the 1,000th time with another amazing goal.

With any luck, we’ve got a few more weeks to enjoy the cold weather. Now I’ve got to head out to resurface the ice with the homeboni I built (see image) so there’s a fresh sheet for the kids to skate on tomorrow.

Homeboni

Recapping the HP Vertica Boston Meet-Up

This week, some of our Boston-area HP Vertica users joined our team at the HP Vertica office in Cambridge, MA. Over some drinks and great food, we had the honor of hearing from HP Vertica power users Michal Klos followed by Andrew Rollins of Localytics. Both Michal and Andrew offered some valuable insight into how their businesses use the HP Vertica Analytics Platform on Amazon Web Service (AWS).

Michal uses the HP Vertica installation in the cloud, hosted on AWS. The highlight of Michal’s presentation was a live demonstration of a Python script using Fabric (a Python library and command-line tool) and Boto (Python interface to AWS) that executed code to quickly set up and deploy a Vertica cluster in AWS. Launching nodes on the HP Vertica Analytics Platform in AWS eliminates the need to acquire hardware and allows for an extremely speedy deployment. Michal was very complimentary of the recent enhancements to our AWS capabilities in the recently-released version 6.1 of the HP Vertica software.

Michael Klos Demonstration

Following Michal’s demonstration, Andrew took the floor to talk about how Localytics uses the HP Vertica Analytics Platform to  analyze user behavior in mobile and tablet apps.  With HP Vertica, Localytics gives their customers access to granular detail in real-time. Localytics caters to their clients by launching a dedicated node in the cloud for each customer. With the HP Vertica Analytics Platform powering their data in AWS, their customers can start gathering insightful data almost immediately.

Our engineers then took the stage to serve as a panel for questions from the floor. It’s not often that our engineers get the opportunity to answer questions from customers and interested BI professionals in an open forum discussion. Everyone took full advantage of the occasion, asking a number of questions about upcoming features and current use cases.  In addition, our engineers were able to highlight a number of new features from the 6.1 release that the users in attendance may not have been taking advantage of yet.

Meet-ups serve as a fantastic catalyst for users and future users to interact with each other, share best practices and have a valuable conversation with different members of the HP Vertica team. We reiterate our thanks to Michal and Andrew, and to all those that joined us at our offices — thank you for an excellent meet- up!

Don’t miss another valuable opportunity to hear from fellow HP Vertica user Chris Wegrzyn of the Democratic National Committee on our January 24th webinar at 1PM EST. We will discuss how the HP Vertica Analytics Platform revolutionized the way a presidential campaign is run. Register now!

Get Started With Vertica Today

Subscribe to Vertica