Author Archive

The Top Five Reasons SQL-on-Hadoop Keeps CIOs Awake at Night

The Elephant and the engineer

Being a part of HP is really an amazing thing – it gives us access to amazing technologies and very bright, hard-working people. But the best part is talking with our customers.

One topic on the mind of many technology leaders today is the “elephant in the room” – Hadoop. From its humble beginnings as a low-cost implementation of mass storage and the Map/Reduce programming framework, it’s become something of a movement. Businesses from Manhattan to Mumbai are quickly discovering that it provides favorable economics for one very specific use case – it provides a very low cost way to store data of uncertain value. This use case even has acquired a name – the “data lake”.

I first heard the term five years ago, while Vertica was a tiny startup based in Boston. It seemed that a few risk-tolerant businesses in California were trying out this thing called Hadoop as a place to park data that they’d previously been throwing away. Many businesses have been throwing away all but a tiny portion of their data simply because they can’t find a cost effective place to store it. To these companies, Hadoop was a godsend.

And yet in some key ways, Hadoop is also extremely limited. Technology teams continue to wrestle with extracting value from a Hadoop investment. Their primary complaint? That there is no easy way to explore and ask questions of data stored in Hadoop. Technology teams understand SQL, but Hadoop provides only the most basic SQL support. I’ve even heard stories of entire teams resigning en masse, frustrated that their company has put them in a no-win situation – data everywhere and not a drop to drink.

Variations on the above story have undoubtedly played out at many companies across the globe. The common theme is that, love it or hate it, SQL is one of the core languages for exploration and inquiry of semi-structured and structured data. And most SQL on Hadoop offerings are simply not up to the task. As a result, we now have a gold rush of sorts, with multiple vendors rushing to build SQL on Hadoop solutions. To date, there are at least seven different commercial SQL for Hadoop offerings, and many organizations are learning about the very big differences between these offerings!

In our many conversations with C-level technology executives, we’ve heard a common set of concerns about most SQL on Hadoop options. Some are significant. So, without further ado, here are the top five reasons SQL on Hadoop keeps CIO’s awake at night:

5. Is it secure? Really?

The initial appeal of the data lake is that it can be a consolidated store – businesses can place all their data in one place. But that creates huge risk because now…all the data is in one place. Therefore, our team has been working diligently a SQL on Hadoop offering that not only consists of core enterprise security features, but it also requires the ability to secure data in flight with such things as SSL encryption, integration with enterprise security systems such as Kerberos, and a column-level access model. If your SQL on Hadoop solution doesn’t offer these features, your data is at risk.

4. Does it support all the SQL you need?

Technically, SQL on Hadoop has been around for years now in the form of an open source project called Hive. Hive has its own version of SQL called HQL. Hive users frequently complain that HQL only supports a subset of SQL. There are many things you just can’t do. This requires all manner of data flow contortions as analysts must continually resort to other tools or languages for things that are very expressible in SQL…if only the Hadoop environment supported it.

This problem remains today, as many of the SQL on Hadoop variants do not support the full range of ANSI SQL. For example, our benchmark team regularly performs tests with the Vertica SQL on Hadoop product to ensure that it meets our standards for quality, stability and performance. One of the test suites we use is the TPC-H benchmark. For those not in the know, TPC-H is an industry standard benchmark with pre-defined SQL, schemas, and data. While our engine runs the full suite of tests, other SQL on Hadoop flavors that we’ve tested are not capable of running the entire workload. In fact, some of them only run 60% of the queries!

3. …And if it runs the SQL, does it run well?

It’s one thing to implement a SQL engine that can parse a bit of SQL and create an execution plan to go and get the data. It’s a very different thing to optimize the engine such that it does these things quickly and efficiently. I’ve been working with database products for almost thirty years now, and have seen over and over that the biggest challenge faced by any SQL engine is not creating the engine, but in dealing with the tens of thousands of edge cases that will arise in the real world.

For example, being aware of sort order in stored data on disk can dramatically improve query performance. Moreover, optimizing the storage of the data to leverage the sort sequence with something like run-length encoding can further improve performance. But not if the SQL engine doesn’t know how to deal with this. One example of an immature implementation is an engine that cannot use just-in-time decompression of highly compressed data. If the system has to pay the CPU penalty of decompressing highly compressed data every time it is queried, why bother compressing it in the first place, except maybe to save disk space? Also, if a user needs to keep extremely high-performance aggregations in sync with the transaction data, unless the engine has been written to manage the data this way, and be aware of the data characteristics at run-time, this simply won’t be possible.
These are just two examples. But it can make the difference between a query taking one second, or two days. Or worse, crashing when you try to run it because uncompressed data overflows the memory and crashes the database.

2. Does it just dump files to a file-system, or actively manage and optimize storage?

Projects built for Hadoop almost invariably pick up some of the “baggage” of using the core Hadoop functionality. For example, some of the SQL on Hadoop offerings just dump individual files into the filesystem as data is ingested. After loading a year of data, you’re likely to find yourself with hundreds of thousands of individual files. This is a performance catastrophe. Moreover, to optimize these files a person has to manually do something –write a script, run a process, call an executable, etc. This just adds to the real cost of the solution in terms of administrative complexity and design complexity to work around performance issues. What a business needs is a system which simplifies this by managing and optimizing files automatically.

1. When two people ask the same question at the same time, do they get the same answer?

There are certain fundamentals about databases that have made them so common for tracking key business data today. One of these things is called ACID compliance. It’s an acronym that doesn’t bear explaining here, so suffice it to say that one of the things an ACID-compliant database guarantees is that if two people ask the exact same question of the exact same data at the exact same time, they will get the same answer.

Seems kind of obvious, doesn’t it? And a common issue with SQL on Hadoop distributions is that they may lack ACID compliance. This isn’t so good for data science to create predictive models for growing the business, and certainly not suitable for producing financials! Caveat Emptor.

Many of our customers consider these five areas to be a benchmark for measuring SQL on Hadoop maturity. SQL on Hadoop offerings that fail to deliver these things will drive up the cost and time it takes to solve problems as analysts must use a mix of tools, work around performance and stability limitations, etc. And in the context of massive data thefts taking place today, how many CIOs feel comfortable with three petabytes of unsecured data pertaining to every single aspect of their business being accessible to anyone with a text editor and a bit of Java programming know-how?

The good news is that we at HP have been thinking of these concerns for years now. And working on solving them. Vertica SQL on Hadoop addresses each of these concerns in a comprehensive way, so organizations can finally unlock the full value of their data lake. We’re happy to tell you more about this, and we’d love for you to try it out! Click here to request more information from our team.

The Top Five Ways to Run a Great Vertica Evaluation

glasses safe

In last week’s blog, I listed the top five ways in which I’ve seen organizations struggle in conducting Vertica evaluations. This week, I’d like to discuss the best practices that drive good Vertica evaluations. What’s a “good” evaluation? For us at HP, it’s one that produces results that allow a company to make a technology decision that maximizes the value of the investment to the business. For our team, it’s not about convincing organizations to buy something they don’t want or need. It’s about associating an investment in our technology with a tangible business outcome.

That said, here are my top five ways to run a great Vertica evaluation.

Best Practice 1: Think Outside the Box

Having spent the early years of my career working with databases by Oracle and Microsoft, I developed a set of core beliefs about how databases worked…and about how they could work. So when I branched out and started working with newer database technologies, my first efforts focused around very conventional data warehousing patterns – rigorous pre-design of a somewhat normalized star or snowflake schema; designing around long loads and longer running queries; thinking in terms of row level transactionality; and so forth.

I had to unlearn a lot of these preconceptions to put the newer technologies to effective use. Case in point – for many years the star/snowflake schema has been the go-to design for data marts and warehouses. It turns out the design was really driven by two separate needs:

  • The notion of “master data”, or dimensions which have been scrubbed, and the performance characteristics of row-based databases when applied to data warehouse use cases.
  • Industry dogma that a snowflake schema is “just the way you do it” because legacy databases just aren’t  fast enough.  And as a result many IT shops believe that entire categories of business questions fall into the “I’m sorry Dave, but I can’t do that” bucket – because they just wouldn’t run.

For big data analytics, some of these beliefs need to be unlearned. Schema is now a flexible thing that can be defined as needed, even at query runtime. Technology like Vertica incorporates a number of analytic extensions which enable a laundry list of business questions that were previously difficult or impossible to answer.

So when preparing to evaluate Vertica, think outside the box. For specific insights on this, read on!

Best Practice 2: Test What You Need to Test

I always recommend to businesses that we identify three types of evaluation criteria: those that need to be verified by tests in the evaluation, those that can be verified in other ways such as references, and those that are considered “nice to have”. This kind of approach will help the evaluation in a number of ways. First, it’ll distill out the tests of core importance. Second, it will help the account team spend time on the things that matter most. Finally, it’ll minimize the time it takes to complete the evaluation.

To my first point of thinking outside the box – our team does these evaluations every day, whereas most businesses only run them every few years. It’s tough to be good at something you don’t do often. We can help identify good tests to run as well as best practices for getting everything done smoothly. So don’t hesitate to ask our team for help when it’s time to identify your test plan.

Best Practice 3: Test What the Business Cares About

This is the corollary to my point last week about not using pure technology-defined success criteria. In my years in the IT trenches, I saw many technology investments fail to deliver the desired business outcome. Often this was because the business was not involved in the technology selection process. The way to fix that is by involving business stakeholders in the evaluation – identify use-cases that are timely and relevant (and for which there’s data), so that when the evaluation is done, business stakeholders can be comfortable that they’re going to get what they need. And the IT team knows it can deliver. This is a very powerful way to mitigate a number of risks.

This is another way in which we can help. We’ve got experts who understand analytic use cases and industry particulars, and who can facilitate the discovery of business-relevant evaluation tests. And we’re happy to work with companies to do this.

Best Practice 4: A Pilot is Not Production

When technology teams don’t conduct frequent evaluations, the inclination is to think of an evaluation like a production implementation with a rigid set of processes and a set schedule. And while evaluations can be run this way, it often results in a case of what I call “use case myopia” – the focus is on testing the technology against goals of incrementally improving things. Sometimes this is appropriate, but when selecting technology for big data analytics, this may miss the mark. Whereas it might be beneficial to the business to build a database so that the analysts can get their reports more quickly, focusing on that test may miss the fact that new data technology allows for business questions which were previously impossible.

For example, I’ve worked with multiple organizations whose first test in an evaluation was to see whether they could run a report more quickly. But after a bit of conversation, we identified analytic use cases that the company didn’t even consider because they were so used to older technology being too slow or hard to use. And these use cases were transformational – they represented entirely new capabilities like fraud detection in seconds instead of days, A/B testing in real time for every feature, real-time application optimization, and behavioral targeting. And the list goes on.

Best Practice 5: Think “partnership”

I saved my personal favorite for last. Having been on both the purchaser and vendor sides of the table, I’ve seen the different ways businesses can approach technology purchases. But the most consistently successful approach I’ve seen is when an organization partners with strategic vendors. This transforms a technology evaluation from a test of nuts and bolts to a test of whether you can build what you need.

Partnering has some requirements though. First, make sure the vendor brings enough to the table to warrant a strategic partnership. In the big data space, there are plenty of vendors who want to be strategic, but lack either the business or technology capabilities to really deliver on the promise. Second, a partnership will require a measure of transparency and trust. This will allow your vendor to help you in ways you might not have thought they could. For example, we at HP can bring all of the capabilities of one of the largest, most-established technology vendors in the world to the table. In the big data space, that means we can help companies leverage things like deep linking, pattern recognition, breakthrough hardware designs, and much more. And as your partner, we’ll help you sort through it all so you don’t have to.

In an evaluation, this means that we can help an organization think out of the box, in terms of a good test plan and business relevant use cases, and help make the evaluation a good one.

“Wherever you go, there you are” –Yogi Berra

IT teams very often find themselves in a set of circumstances with many constraints – budget, time, people, knowhow, and so forth. In that context, an evaluation can represent a lot of work. I’ve watched many businesses work their way through as many as six separate technology evaluations to make a big data platform choice – a considerable investment of time and money. I’ve found that when a company works with us closely during the evaluation process, it goes more quickly and with less investment of their time. So if your organization is about to embark on the big data journey and needs to think about evaluations, we can help. Click here to arrange to talk with one of our folks and learn more.

The Top Five Ways to Botch a Vertica Evaluation


In the years I’ve been working with Vertica and other large scale data technologies, I’ve been a party to a large number of technology evaluations. Most businesses are familiar with these – they’re often called either a “proof of concept”, “proof of value”, or “pilot”. Technology evaluations are a key part of the technology selection process, wherein the business identifies a set of criteria which the candidate technology must meet (or exceed). These evaluations are tightly scoped operations, with clear cut input data, test scenarios, and defined metrics to measure success which are sponsored by company leadership.

At least, that’s the theory.

While some evaluations are very much the way I describe them above, many aren’t. In fact, many evaluations fail to demonstrate measurable value, and can in fact muddy the waters around technology selection – exactly the opposite of what they’re supposed to do. While there are all manner of things that can go wrong with evaluating a big data platform, I’ve seen organizations struggle with specific areas when conducting a Vertica evaluation. Here are the top five.

Mistake number 5: Don’t talk with any Vertica people at all

We’ve all bought cars, and have had to deal with car dealers. For many of us, talking with sales people can leave a bad taste in our mouths. This is unfortunate, because there is unique value to be found in talking with the right sales team. A skilled sales executive will know how to work with an organization’s leadership to map technology to strategy – which greatly increases the likelihood that an investment in that technology will pay off. A skilled presales engineer will know how to deploy the technology in ways that fit a particular business and use case(s) – which can serve as an accelerator in the project, and mitigate the risk of failure. Moreover, these teams accumulate knowledge on best (and worst) practices, and can be a powerful source of knowledge and guidance. By ignoring sales people, organizations run the risk of repeating mistakes made by others and possibly selecting the wrong technology for their needs.

Mistake number 4: Use 100% IT-defined success criteria

First, I have to say that I have nothing but respect for IT teams. I worked in various IT departments for many years before moving to the vendor side of the world. In my experience, they’re incredibly hard working, talented folks. But the people working in the technology trenches tend to think about the technology, not why it’s there. Rather than thinking of that Oracle operational store as “a key resource for business stakeholders to optimize day to day decisions,” they tend to think of it as “an Oracle database that needs to stay up at all times or the CEO will call the CIO and complain.”

This shapes expectations. And when it’s time to select new technology, IT will focus on the things it cares about – SQL completeness, availability, fault-tolerance, backup and recovery, and so forth. I’ve seen evaluations where the IT team made their “wish list” of criteria, and the vendor demonstrated every single one of them, only to see another technology get chosen. Because the test criteria didn’t matter to the business stakeholders.

Mistake number 3: Never, ever run the Database Designer

The other mistakes discussed here are pretty much technology agnostic – they can be issues in all sorts of evaluations. This one, however, is specific to Vertica. That’s because the Vertica team re-invented storage as opposed to borrowing somebody else’s storage engine and bolting on column-like features. While this bit is somewhat longer than the others, it bears reading because it is often the moment when the light bulb goes on for the database folks as to why Vertica has done so well in the market in recent years.

When a user creates a table in Vertica, two things happen:

  1. A logical table is created. . This is the structure that all users will query, insert to, update, delete from, and so forth. It is just a stub however.
  2. A super-projection is created. The superprojection is identical to the logical table.. However, it is the actual storage structure for the data. It uses certain rules for things like data distribution, sort and encoding – which are all part of the “secret sauce” of Vertica’s performance and scalability. The super projection is required because Vertica is a database – we need a spot where data can go in an ACID compliant form immediately.

But the beauty of the Vertica storage engine is that additional projections can be created, and they don’t all require every column. This is why we built our own engine from the ground up – so Vertica establishes a loose coupling between logical data model and the physical storage of that data. Additional projections can use fewer columns, other sort orders, different distribution keys, other forms of compression, etc. to deliver maximum performance. And the database will decide – when a query is submitted – which set of projections will make the query perform the best.

To make projections easier for our users to leverage, we’ve created a tool which is included with Vertica, called the Database Designer. This is unique in the industry as far as I know. A user only needs to create the desired tables and load a modest amount of data, then package up their queries and pass them to the Database Designer. The Database Designer will then test the queries and write SQL to create a set of optimized projections. In this way, the Database Designer can make just about anyone as effective as a skilled DBA when it comes to performance tuning.

Unfortunately, much of the market doesn’t understand Vertica and projections. So I often walk into conversations where the technology team has been told – usually by another vendor – that projections are “cheating” because they optimize performance. And so the business decides to deliberately avoid using the database designer to optimize performance. This is like telling yourself that breathing more oxygen during a foot race is cheating, so the runners should hold their breath during the race in order to slow the faster runners down and give the slower ones a chance of winning. I think I’m being generous when I call this a bad idea.

Mistake number 2: Don’t take it seriously

Sometimes, the technology team already knows which technology they want. And the technology evaluation is just a rubber stamp – the outcome is predetermined, and the team just needs the window dressing to make it look like they evaluated other vendors. This is a bad idea for two reasons. First, even if it’s all about putting a rubber stamp on a predetermined choice, it’s still a new use case for the technology. So the team has to plan to mitigate risk. And a well-executed technology evaluation is one good way to mitigate risk. Second, going into an evaluation having already chosen the technology will put blinders on the team – rather than looking for unique ways in which new technologies can be applied, the focus instead is on doing things very much the way they’ve been done before.

A few years ago, I was managing a field engineering team when we found ourselves in one of these evaluations. The company clearly had already chosen another vendor, but because they were already using Vertica (happily), a technology evaluation was required. The company didn’t take the evaluation very seriously, and despite the fact that our team executed flawlessly, the company went with their original choice. They didn’t pay attention to the fact that the Vertica team started (and finished) the evaluation within seven days, which was how long it took the other vendor to pack their equipment and prepare it for shipping to the customer. They didn’t want to see the findings our team uncovered highlighting revenue opportunities hidden within the data. They selected the other vendor as they’d planned all along. And after six months trying to implement it, the folks who had selected the other vendor were looking for new jobs. Moreover, most of the data science team quit in frustration. So in one fell swoop, they significantly damaged their analytics capabilities

So take it seriously, even if the choice seems predetermined.

Mistake number 1: Do an unrealistic test

One way to create an unrealistic test is to fit the test to the conditions, rather than the conditions to the test. The most frequent mistake here is using Vertica Community Edition, which is limited to three nodes and a terabyte of data, and then forcing the data used in the test to fit that limit. This is a bad idea for several reasons. First, the benefits of a distributed computing technology like Vertica don’t really show up at a terabyte. While you can run queries on the data, old school strategies such as indexing can make it look like row-oriented databases may perform as well. Second, it means “chopping down” the data – or making it fit the one terabyte threshold. This often results in artificial data, which brings with it all sort of problems. The biggest problem, however, is that it may no longer allow you to derive the insights which solve the problems you’re trying to solve. So test with a realistic volume of data. What is “realistic”? It’s a relative thing, but it should be more than just a token amount of data. Don’t feel compelled to limit your evaluation to a terabyte just because you want to run Vertica CE. This often goes hand in hand with mistake number 5 (don’t talk to any Vertica people). Don’t worry about talking with Vertica folks! We’re a friendly bunch with a commitment to the success of our customers. And we’re happy to set you up with an evaluation license that fits your data, so you don’t have to cram the data to fit the license.

Finally, there’s another way in which we see unrealistic evaluations. Particularly when the evaluation is driven by the IT team (see Mistake Number 4), the use case is often “run our existing queries faster”. While this is helpful, this is not what keeps the executive leadership awake at night. What keeps them awake? Fraud detection, personalized marketing, optimized business operations, new data products, and so forth. Note that the phrase “run our queries faster” did not appear on that list. So make the test realistic by asking bigger questions. What can’t the company do today because it can’t cope with big data? Why does it matter? These are the use cases which take a technology evaluation and translate it into terms leadership can understand – how is this technology going to enable the strategy of the business?

So there, in a nutshell, are the problems we see the most often in Vertica evaluations. We do a lot of these, and are quite good at it. So don’t hesitate to let us know when you want to try it out so we can help you avoid the pitfalls, and tie the technology to your strategy. If you’d like to talk with our team, click here to arrange a conversation.

MySQL Ate My Homework: Five Reasons You Should Always Use a Subpar Data Platform

shutterstock_145410211 [Converted]



falling short of a standard <the service at the restaurant was subpar, to say the least>

Synonyms: bush, bush-league, crummy (also crumby), deficient, dissatisfactory,
ill, inferior, lame, lousy,off, paltry, poor, punk, sour, suboptimal, subpar, substandard, unacceptable, unsatisfactory, wack [slang], wanting, wretched, wrong

Related Words: abysmal, atrocious, awful, [slang], brutal, damnable, deplorable, detestable, disastrous, dreadful,execrable, gnarly [slang], horrendous, horrible, pathetic,stinky, sucky [slang], terrible, unspeakable; defective, faulty,flawed; egregious, flagrant, gross; bum, cheesy, coarse,common, crappy [slang], cut-rate, junky, lesser, low-grade,low-rent, mediocre, miserable, reprehensible, rotten,rubbishy, second-rate, shoddy, sleazy, trashy; abominable,odious, vile; useless, valueless, worthless; inadequate,insufficient, lacking, meager (or meagre), mean, miserly, , scanty, shabby, short, skimp, skimpy, spare,stingy; miscreant, scurrilous, villainous; counterfeit, fake,phony (also phoney), sham

In the big data technology industry, we spend most of our time writing blogs and whitepapers about our technology.  I’m sure you’ve heard this before…”Our technology is great…it’s the best…most functional…top-notch” and so forth.  But we never really discuss when someone might want to use less effective technology – systems that may be more raw, or less suited to the task, or that have no vendor behind them.  Sure, these systems can break easily or might not do everything you want, but some of these technologies have tens of thousands of users around the world. So, they must be valid choices, right?

So, when should less effective technology be used?  Based on many years in the IT trenches, here is my countdown of the top five reasons you should use a subpar big data platform.

Caution: sarcasm ahead with a mostly serious ending which actually makes a point

Reason Number 5: Not invented here, dude
Science_and_Invention_Jan_1922_pg822 (1)

Who wants to be boring and pick existing technology that is solid… and works?  By rolling your own, you get serious technical chops.  What’s that knocking sound? That’s O’Reilly Media at your door…they want you to write a book!  Seriously, reinvention is under-rated.  Sure, relational databases have been around for forty-plus years, but reinventing transaction semantics or indexing would be seriously cool!  Give it a funny name and pick a cute animal for the logo and…voila!  Tech cred!

Furthermore, using off-the-shelf technology tends to create a situation some IT shops dread: transparency. What? The executives understand the technology we’re using well enough to monitor progress with it?  Time to throw it out and build something arcane from scratch to control what the execs see!

Reason Number 4: It’s free


When I was seven, one of my dad’s friends came by to visit around the holidays.  He gave me a kitten.  My dad got seriously steamed, and my mom looked like somebody had just sneezed in her soup.  But the kitten was free, right?  Three illnesses, a few injuries, and one or two thousand dollars later, and coupled with a year or so cleaning a litterbox, I realized that the kitten was not – in fact – free.

But we’re talking about software here.  Isn’t that different?  Free means you don’t need to deal with a sales guy and some engineer who’ll help you set things up in an hour.  You just go through a few websites and download four RPMS, the Java SDK, a Java JRE, five or six utilities, upgrade your OS, downgrade your OS, grab some runtime libraries for Linux, the Eclipse IDE, a downgraded version of the Eclipse IDE that’s required by the plug-in you’re about to download, and an Eclipse plug-in which kinda does most of what you need and…voila!  You can run the “hello world” example.  So free must be good, right?  Now, fire up “Getting Better” by the Beatles on your iPod and get to work!

Reason Number 3: You’ve got all the time in the world


Yeah, the business folks are in a panic about losing market share, and the CIO is a little bent out of shape about the fact that the IT budget has been going up at 15% every year, but what’s the big rush?  After all, the prospectus for that O’Reilly book needs to be seriously heavy stuff to have a chance of getting anywhere.  So dig into the technology!  Science projects can be fun when you’re doing science.  Hey, do those hardware guys really think that putting data on the disk tracks closer to the spindle will improve read times by 0.01%??  That sounds fun to test!  We can write a hack in HDFS for that!  Of course, the only way we can tell is on a cluster that has at least a thousand nodes.  The good news is that with modern cloud technologies, it’ll take only six months and ten people to test it!  The business can wait a little longer.

Reason Number 2: It’s cool


Does anything really need to be said here?  Cool + Not Invented Here = Happy Technologists = Productivity, right?

And (drum roll please)…

Reason Number 1: You like risk


Do you fly by those ancient thirty-year olds on your kitesurfing rig wondering why they still use something as yesterday as a windsurfer?  Is base jumping from old-school spots like the KL Tower yesterday’s news for you?  Well, risk on!  In the stock market, risk=volatility=upside, right?  And the worst that can happen is the dollar value of your investment hits zero.  Why should it be any different with technology?  If you’re not base jumping from that erupting volcano, you’re not alive.   So bring together the adrenaline rush and the upside potential of adopting something which looks like it isn’t ready so that, in the event it ever gets to be what you need, you’re ahead of the curve!

Summing up, Seriously

While this piece – so far – has been very sarcastic, there’s a nugget of truth hidden within.  Businesses globally choose subpar technology every day believing that it will solve their problems.  And while they rarely select such technologies based on my sarcastic “top five” list above, they often select these technologies with the mistaken belief that they’re cheaper/better/faster, etc.

Today businesses don’t have to select subpar technologies for big data analytics.  Two years ago, Vertica opted to release the Vertica Community Edition.  This release of Vertica offers the full functionality of the product on up to one terabyte of raw data and a three node cluster.  Furthermore, it now includes the full functionality of Vertica’s sentiment scoring engine (Pulse), Vertica’s geospatial add-in (Place), and Vertica’s auto-schematizer for things like JSON data (FlexZone). I tried to talk the Vertica team out of offering so much for free!  But the team wants to share this with the world so organizations no longer have to settle for a subpar data platform.  It’s hard to argue with that!

So, if you want to try Vertica CE today, click here.

In my twenty-plus years of working with databases, I’ve installed and worked with just about every commercially available database under the sun, including Vertica.  And out of all of them, Vertica has been the easiest to stand up, the most powerful, and the highest quality.  Try it.  Seriously.

Don’t go for the subpar stuff, because you don’t have to.

That Giant Sucking Sound is Your Big Data Science Project


Vertica recently hosted its second annual Big Data Conference in Boston, Massachusetts. It was very well attended with over eight hundred folks, and about two hundred companies represented. We at Vertica love these events for a few reasons – first because our customers tend to be our best spokespeople because it’s such a sound product, but also because it’s a chance for us to learn from them.

In one of the sessions, the presenter asked the audience how many of them had Hadoop installed today. Almost all the hands went up. This wasn’t too surprising given that the session was on Hadoop and Vertica integration. Then the presenter asked how many of those folks had actually paid for Hadoop. Most of the hands went down. Then the presenter asked how many of those folks felt that they were getting business value out of their investment. Only two or three hands stayed up. This was eye-opening for us at HP, and it was surprising to the audience as well. Everyone seemed to think they were doing something wrong with Hadoop that was causing them to miss out on the value.

Over the next few days, I made a point to track down folks in the audience I knew and get their thoughts on what the issues were. Since most of them were Vertica customers I knew many of them already. I thought it would be helpful to identify the signs indicative of a big data science project – a project where a team has installed something like Hadoop and is experimenting with it in the hope of achieving some new analytic insights, but isn’t on a clear path to deriving value out of it. And some clear themes emerged. And these align with what I and my colleagues in the industry have been observing over the last few years. So, without further ado, here are the top five signs that you may have a big data science project in your enterprise:

    1. The project isn’t tied to business value, but has lots of urgency. Somebody on the leadership team went to a big data presentation and has hit the panic button. As a result, the team rushes ahead and does…something. And maybe splinters into different teams doing different things. We all know how well this will turn out.
    2. The technologies were chosen primarily because they beef up resumes. There’s so much hype around big data and the shortage of people with relevant skills that salaries are inflated. And in the face of a project with high urgency, nobody wants to stand still. So download some software! That open source stuff is great, right? While it’s generally true that multiple technologies can solve the same big data problems, some will fit with the business more readily than others. Maybe they’re easier to deploy. Maybe they don’t require extensive skill retooling for the staff. Maybe the TCO is better. Those are all good things to keep in mind during technology selection. But selecting technology for “resume polishing”? Not so much.
    3. The project is burdened with too much process. Most organizations already have well-defined governance processes in place for technology projects. And, so the reasoning goes, big data is basically just a bunch more of the same data and same old reporting & analysis. So when it’s time to undertake a highly experimental big data analytics project which requires agility and adaptability, rigid process usually results in a risk-averse mindset where failure at any level is seen as a bad thing. For projects like these, failure during the experimentation isn’t just expected, it’s a critical part of innovation.
    4. The “can’t-do” attitude. It’s been a well understood fact of life for decades that IT departments often feel under siege – the business always asks for too much, never knows what it wants, and wants it yesterday. As a result, the prevailing attitude in many IT teams today is to start by saying “no”, and then line up a set of justifications for why radical change is bad.
    5. The decision-making impedance mismatch. Sometimes, organizations need to move fast to develop their insights. Maybe it’s driven by the competition, or maybe it’s driven by a change in leadership. And…then they move slooooowly, and miss the opportunity. Other times, the change represents a big one with impact across the company, and requires extensive buy-in and consensus. And…then it moves at a breakneck pace and causes the organization to develop antibodies and reject the project.


    So if your organization has one or more big data projects underway, ask whether it suffers from any of these issues. If so, you may have a big data science project on your hands.

Is Big Data Giving You Grief? Part 5: Acceptance

“We can do this”

Over the last month or so, this series has discussed how organizations often deal with a missed big data opportunity in ways that closely resemble the grieving process, and how that process maps to the commonly understood five stages of grief: denial, anger, bargaining, depression, and acceptance. This is the last entry in the series; it focuses on how an organization can move forward effectively with a big data project.

While big data is big, complicated, fast, and so forth, it is also very vague to most businesses. I was at an event recently where a poll question was asked of a room full of technology professionals – “How important is big data to your business?” A surprisingly high number of respondents felt that big data wasn’t relevant to them. Afterwards, I spoke with one of the attendees over lunch. I asked him what the primary challenges were to his business. It turns out that their business costs rely primarily on commodity costs – if the price of an input such as oil goes up or the supply is disrupted, the entire business is affected. I asked him whether he thought social media was relevant to his business, and he didn’t believe so. I then talked about how hedge funds have found that Tweets can be a very effective way of predicting commodity prices and availability disruptions. Until that moment, he was unaware that this was possible. This was what I call a “light bulb” moment. Suddenly, the appeal of big data became clear.

This experience highlighted for me a fundamental issue I see daily in the big data space – that it’s just too big (and vague) for many organizations to grasp its tangible value – an important pre-requisite to moving forward. So even while they go through all the stages of grief and struggle with the fact that their competitors may be outperforming them due to big data, companies also struggle with how to turn that into a plan of action.

Once they’ve worked their way through the realization that something’s wrong, organizations are often ready to take action. Here are some of the most helpful techniques I’ve seen businesses take over the years to begin an effective big data program – to accept the reality of the situation, and move forward.

Execute tactically, think strategically
For the organization first tackling big data, this is probably the most important thing to keep in mind. Big data projects rarely start with a crystal clear vision of what the strategic outcome should be. Uncertainty and hype around the opportunity, unfamiliarity with how to handle big data, lack of a data science competence, and so forth all create challenges that make it tough to articulate an up-front strategic vision.

But don’t interpret that as a pass to ignore the potential impact of a big data project. Thus the advice. Execute the project tactically – be prepared to move fast with the aim to demonstrate value quickly. And when the project is complete, a debrief with the business leadership is essential. In this debrief, answer two questions: How did applying big data matter to the business? And given what we’ve learned, how can our next project impact the business in a bigger way?

The answers are inputs to the next project, and over time can serve as a powerful guide to articulating a big data strategy for the business.

Don’t boil the ocean
Very often, when a group of people from an organization attend a big data event, they all come back very enthused about big data projects. Vendors love to talk about big-picture, blue sky notions of transforming businesses or industries with big data. It’s exciting stuff, but doesn’t lend itself to immediate action – especially for a business new to big data.

So don’t start there.

A much better approach is to identify measurable goals that can be tied to actions that can be completed in the right timeframe. What’s “the right timeframe”? Good question! In part, it depends on how open the business is to a big data initiative – if the leadership team is bearish on the idea and needs powerful convincing, it’ll be important to demonstrate value quickly. Also, immediacy is a powerful guide to enthusiasm – so don’t tell the IT team to disappear for a year and come back with a big data architecture. There’s no immediacy, and as a result there likely won’t be much focus. So don’t boil the ocean and try to do everything at once, in a big hurry. Start with focus, and retain it as you progress.

One foot in front of the other (and sometimes…baby steps!)
When an organization wakes up and realizes that it’s at risk of being left behind or otherwise outperformed by others due to big data, the first response can be panic. The CEO or CMO may set a goal for the team – catch up. This can kick everyone into overdrive quickly, which is great. But it can also set everyone running in different directions with a vague charter to do something to change the business…now!

The tendency is to start chasing the Big Goal – maybe something dramatic like “reinvent the business”. For the organization new to big data, this is a recipe for trouble. Developing any new core competence takes time, and nobody starts as an expert. Learning to incorporate big data into your business is the same thing. It’s probably not realistic to expect a team accustomed to managing enterprise applications (which might all be running on a twenty-year-old technology stack) to learn massively parallel technologies, large scale data management and data science in a week. Or a month. Or a year.

So put one foot in front of the other. Don’t expect to master big data overnight, and instead take measured steps. Pick a project with a strong return on investment to get stakeholders on board and get the technology team’s feet wet in new technology. Then make the next project somewhat more ambitious. As the team learns more about delivering these projects, it’ll be much more natural to assess larger questions such as revising technology architecture.

It’s not too late
Marketing is marketing and reality is reality. Just because one of your competitors released a success story about their big data program last week doesn’t mean that there’s no benefit for your company. And when an article shows up online or in the printed media that declares that the big data war is over, and you lost if you’re not one of a handful of companies – take it with a huge grain of salt. There’s nothing wrong with a big data project that makes your business more profitable, or drives more top line revenue. And while it’s fun to contemplate reinventing your company, there are plenty of practical (and do-able) opportunities for improving revenue, customer experience, efficiency, etc. So don’t think for a moment that it’s too late.

Furthermore, by waiting a bit, organizations can take advantage of the learnings of others – things to do, things to avoid, and so forth. And the tools will usually improve. And successful use cases will become easier to spot. All these factors will reduce the risk to your big data project, and increase the likelihood of success. So it’s not too late.

To Accept or Not
Sadly, not all organizations make it to this stage. I’ve seen companies get stuck in finger pointing exercises, or trapped in endless cycles of ill-defined big data “science projects” that never seem to produce anything tangible and never end, or even put on blinders and avoid big data completely. But for companies who get to a place where they’re ready to accept the challenge, there are opportunities to meaningfully impact the business. And there are frequently increasing returns on well-crafted big data projects – which is to say that for every additional dollar spent over time, the value to the business actually increases. I’ve seen this cycle unfold time and time again, and in every single case of which I’m aware, the organization has reached the stage I’m referring to as “acceptance”, and is moving forward in a well-planned fashion with an effective big data program.

In fact, as I write this I’m listening to the HP Vertica Customer Advisory Board talk about their experiences to date with Vertica. And every one of them has approached their big data program in the ways described above. And every one of them has discovered increasing returns to their big data investment over time.

So put big data grief aside, accept that big data can help your business, and get started!

Is Big Data Giving You Grief? Part 4: Depression

“The problem is too big.  How can we possibly address it?”

Continuing the five part series which explores how organizations coping with big data often go through a process that closely resembles grief, this segment addresses the point at which the organization finally grasps the reality of big data and realizes the magnitude of the opportunity and challenge…and gets depressed about the reality of it.

Having seen this more than once, I’ve observed a few ways this shows in an organization.  Here are the most common reactions.

It’s too big

This reaction makes sense.  After all, as much as we in the industry say that “big data” is more than big and describe it with a laundry list of varying attributes, we all agree that it’s big.  It represents addressing data at a scale never before attempted by most organizations.  It represents analytic abilities perhaps never done before – and a capability pivot towards being an analytics-driven company.  And it may represent opportunities that are so big they appear to be nebulous: “If I capture ten thousand times as much data about my product, how does that translate into value?  Does that mean I’ll sell ten thousand times as many widgets?  How do I quantify the payoff?”

It may be challenging just to get a handle on the costs of a big data program for reasons mentioned in earlier parts of this series, much less the potential payoff.  This can make for a very challenging return-on-investment calculation.

We’re not ready

I believe I may have heard this particular form of worry more than anything else.  The infrastructure isn’t ready, the people aren’t ready to build big data applications, the business isn’t ready to consume the new data, and so on.  And, in fact, the company may not be prepared to size the big data effort because the team may not have the know-how for the ROI calculation (see above).  Also, the executive leadership may be unprepared to make a strategic wager on the program because of the uncertainty around the risks and benefits.

This can seem like a true show-stopper.  It’s not easy to change an organization.  Skills and technologies may not appear to be aligned with big data needs.  The various lines of business may not realize the ways they can improve or revolutionize their business.  The leadership team may be unaccustomed to making big bets on unproven technologies, or may believe that big data is a fad and will pass.

We’re too late

I hear this a lot too.  Everywhere a business turns today there’s a story about how someone has transformed their business, created new markets, broken old barriers, etc.  It’s easy to believe that all the opportunity is gone – that there’s no more benefit to tackling big data because it’s already been done.  It’s also easy to believe that it would be impossible to “catch up” with others because of all the time and effort required.

While this can be an intimidating belief, it can also be hard to characterize accurately.  After all, do you think your competitors will announce that the big data project they recently publicized in the media is a year late and $10M USD over budget?  Instead, they’ll play it up as if it’s a runaway success.  Vendors help this along too – who wouldn’t want to tout that their product helped a company?

So the saying goes – “The darkest hour is just before the dawn.”  Sage words written long before computers that apply to this situation.   But this is actually a positive place to be, because once a team has moved through anger, denial, bargaining , and into depression, it’s ready to come to terms with the situation and make an action plan to move forward.  I’ll discuss that next week in the final part of this series: acceptance.

Next week the series concludes with…acceptance.  “We can do this.”

Get Started With Vertica Today

Subscribe to Vertica