HP Vertica and IIS recently teamed up with Insight Express, leader in media analytics and marketing solutions, to integrate HP Vertica into their systems in order to exceed growing customer demands for real time performance and faster analytics. Check out the video to learn more!
Archive for the ‘Customers’ Category
There were many interesting technical sessions at the 2014 Vertica Big Data Conference. However, I feel that the most captivating session was on loading 35TB per hour. Just the sheer number of nodes in the clusters, amount of data ingested, and use cases made for a truly educational session. This session covered how Vertica was able to meet Facebook’s ingest SLA (Service Level Agreement) of 35TB/hour.
Facebook’s SLA required continuous loading without disrupting queries and still budgeting enough capacity to catch up in the event of maintenance, unplanned events, or just ETL process issues. Planning for these events with continuous loading presents the options of either catching up or skipping them. There are also numerous considerations such as what is the average ingest (in terms of size) and also the peak ingest. There may also be an availability SLA which must give users access to data within a certain amount of time of an event happening.
Planning for Facebook’s SLA, there were processes that drove the large fact tables which were scheduled to run every 15 minutes or 4 times an hour. Along with what the largest batch could be, and allowing time for batches to catch up, this was calculated out to be 35 TB/hour. To achieve this rate, a speed of 15,000 MB/sec was required to write to disk once.
The first stage of the discovery process was to determine how fast one node can parse with a COPY statement. For the test, a COPY was initiated from any node with files staged on node 1. Only the CPU on node 1 performed parsing, and subsequently parsed rows were delivered to all nodes, then sorted on each node. It was found that a speed of 170 MB/sec could be achieved with 4 COPY statements running in parallel (on one year old hardware). Any more statements would not deliver an increase in parse rate. Within this usage, memory was utilized during the sorting phase.
Scaling out to parse on multiple nodes, the throughput scaled pretty linearly to about 510 MB/sec. In this test, a COPY was initiated from any node with files staged on nodes 1, 2 & 3. Again, CPUs on only node 1, 2 & 3 performed parsing, and subsequently parsed rows were delivered to all nodes, then sorted on each node. More memory was utilized in receive than sort when 50-60 nodes were parsing. However, it doesn’t scale well past that point. Vertica is designed to make sure that there is never head of line blocking in a query plan. This means that after 50 or 60 nodes, the overhead of 50 or 60 streams converging on a node requires a lot of memory for receive buffers.
At 50 to 60 nodes, the overhead of running a COPY statement is large enough that it makes it difficult to run 3 or 4 statements in parallel and continue to scale. If the cluster was scaled up, there would be disruption to queries as the nodes would be completely utilized for 10 minutes at a time. Running COPY for 15 minutes would leave some CPU for queries, but not enough room to catch up. In addition, Facebook required that there not be more than 20% degradation in query performance.
Continuing to scale to 180 nodes, it is expensive to run a COPY statement in terms of memory. The best approach is to isolate the workload by separating the workload of loading data from the workload of querying using Vertica’s ephemeral node feature. This feature is typically used in preparation from removing a node from a cluster. However, in this solution, some nodes were marked ephemeral before data was loaded. Therefore, when data would be loaded into the database, it would not be loaded on the ephemeral nodes and these nodes would not be used for querying.
Having nodes isolated from querying, the loader nodes’ CPU could be utilized 100%. The implementation required adjusting resource pool scaling only on the ingest nodes using an undocumented feature, SysMemoryOverrideMB, within vertica.conf on the ingest nodes. There’s additional considerations around memory utilization on the ingest nodes to set aside memory for the catalog. After some minor tuning, the results look like:
|Nodes||Nodes (Ingest)||Node: (Data)||Ingest Rate (TB/HR)||Peak Ingest Rate (TB/HR)|
There were considerations around using a cluster file system. For instance, there may be a batch process running that produces all the files to be loaded in a COPY statement and then those files are transferred to loader nodes. If one of the loader nodes has a hardware failure, the batch would fail during the COPY. A recovery could be possible if there was a NAS or cluster file system by simply re-running the batch. The cost of disk I/O and network bandwidth required to have a NAS or cluster file system was too expensive when writing extra copies of the data for each batch. Therefore, it would be easier to re-run the batches.
Given the massive pipes at Facebook, it was cheaper to re-run an ingest rather than trying to build a high availability solution that could stage data in the event of a failure in case a part of a batch did not complete. However, if it’s critical to have data available to users rather quickly, it may be worth investing more in an architecture that allows the COPY statement to restart without having to re-stage data.
Along the way, it was discovered that file systems lose performance as they age when subjected to a high rate of churn. The typical Vertica environment has about a year or two of history. With a short window of history and loading 3-4% of the file system capacity each day, the file system is scrubbed frequently. Although some file systems age better than others, it’s good practice to monitor the percentage of CPU time spent in system.
Facebook’s churn rate was about 1.5 PB every 3 days in their 3.5-4 PB cluster. After about a week, the file system would become extremely fragmented. This doesn’t suggest that this was a Vertica issue, just operational issues from the hardware up through the software stack.
There will be times when a recovery will not be able to finish if loading takes place 24 hours/day and there are long running queries. The larger the cluster, the less avoidable a hardware failure and nodes in recovery. As of 7.1, standby nodes can be assigned to fault groups. For example, a rack of servers can be considered a fault domain. Within that fault domain, there is a spare node in standby. If a node were to fail and cross a downtime threshold, the standby node within the same rack would be become active. This approach is more suitable if nodes must enter recovery fairly quickly and the cluster is continuously busy.
The most common practice for a standby cluster is to run an ETL twice to maintain two separate instances of Vertica. With clusters having nodes potentially in the hundreds at Facebook, this approach was not feasible as a high availability solution.
Facebook is taking steps to give better operational performance to other groups in their organization while ensuring that none of their clusters are sitting idle. The focus is moving away from having dedicated resources and to dynamically allocate resources and offer them as a service. The goal is to have any cluster, regardless of size, be able to request data sets from any input source and ingest it into Vertica. In terms of size, 10 nodes is considered small, and several hundred nodes considered large.
To accomplish this, an adapter retrieves the source data and puts it into a form that can be staged or streamed. At the message bus layer, the data is pre-hashed meaning that hashing will be local on Vertica and it will hash to its buddy. This is accomplished by implementing the hash function that’s used on the projection in the message distribution layer. Based on the value of the column, the node on which the row will reside can be predicted. This allows for data to be explicitly sent to targeted nodes in the sense that there is no need for data to be sent between nodes. The ephemeral node approach for ingest would no longer be needed in this elastic and scalable structure.
Is Big Data Giving You Grief? Part Two: Anger
“We missed our numbers last quarter because we’re not leveraging Big Data! How did we miss this?!”
Continuing this five part series focused on how organizations frequently go through the five stages of grief when confronting big data challenges, this post will focus on the second stage: anger.
It’s important to note that while an organization may begin confronting big data with something very like denial, anger usually isn’t far behind. As mentioned previously, very often the denial is rooted in the fact that the company doesn’t see the benefit in big data, or the benefits appear too expensive. And sometimes the denial can be rooted in a company’s own organizational inertia.
Moving past denial often entails learning – that big data is worth pursuing. Ideally, this learning comes from self-discovery and research – looking at the various opportunities it represents, casting a broad net as to technologies for addressing it, etc. Unfortunately, sometimes the learning can be much less pleasant as the competition learns big data first…and suddenly is performing much better. This can show up in a variety of ways – your competitors suddenly have products that seem much more aligned with what people want to buy; their customer service improves dramatically while their overhead actually goes down; and so on.
For better or worse, this learning often results in something that looks an awful lot like organizational “anger”. As I look back at my own career to my days before HP, I can recall more than a few all-hands meetings hosted by somber executives highlighting deteriorating financials, as well as meetings featuring a fist pounding leader or two talking about the need to change, dammit! It’s a natural part of the process wherein eyes are suddenly opened to the fact that change needs to occur. This anger often is focused at the parties involved in the situation. So, who’re the targets, and why?
The Leadership Team
At any company worth its salt, the buck stops with the leadership team. A shortcoming of the company is a shortcoming of the leadership. So self-reflection would be a natural focus of anger. How did a team of experienced business leaders miss this? Companies task leaders with both the strategic and operational guidance of the business – so if they missed a big opportunity in big data, or shot it down because it looked to costly or risky, this is often seen as a problem.
Not to let anybody off the hook, but company leadership is also tasked with a responsibility to the investors. And this varies with the type of company, stage in the market, etc. In an organization tasked with steady growth, taking chances on something which appears risky – like a big data project where the benefits are less understood than the costs – is often discouraged. Also, leaders often develop their own “playbook” – their way of viewing and running a business that works. And not that many retool their skills and thinking over time. So their playbook might’ve worked great when brand value was determined by commercial airtime, and social media was word of mouth from a tradeshow. But the types and volume of information available are changing rapidly in the big data world, so that playbook may be obsolete.
Also, innovation is as much art as science. This is something near & dear to me both in my educational background as well as career interests. If innovation was a competence that could just be taught or bought, we wouldn’t see a constant flow of companies appearing (and disappearing) across markets. We also wouldn’t see new ideas (the web! social networking!) appear overnight to upend entire segments of the economy. For most firms, recognizing the possibilities inherent in big data and acting on those possibilities represents innovation, so it’s not surprising to see that some leadership teams struggle.
There are times when the upset over a missed big data opportunity is aimed at the staff. It’s not unusual to see a situation where the CEO of a firm asked IT to research big data opportunities, only to have the team come back and state that they weren’t worthwhile. And six months later, after discovering that the competition is eating their lunch, the CEO is a bit upset at the IT team.
While this is sometimes due to teams being “in the bunker” (see my previous post here), in my experience it occurs far more often due to the IT comfort zone. Early in my career, I worked in IT for a human resources department. The leader of the department asked a group of us to research new opportunities for the delivery of information to the HR team across a large geographic area (yeah, I’m dating myself a bit here…this was in the very early days of the web). We were all very excited about it, so we ran back to our desks and proceeded to install a bunch of software to see what it could do. In retrospect I have to laugh at myself about this – it never occurred to me to have a conversation with the stakeholders first! My first thought was to install the technology and experiment with it, then build something.
This is probably the most common issue I see in IT today. The technologies are different but the practice is the same. Ask a room full of techies to research big data with no business context and…they’ll go set up a bunch of technology and see what it can do! Will the solution meet the needs of the business? Hmm. Given the historical failure rate of large IT projects, probably not.
It’s a given that the vendors might get the initial blame for missing a big data opportunity. After all, they’re supposed to sell us stuff that solves our problems, aren’t they? As it turns out, that’s not exactly right. What they’re really selling us is stuff that solves problems for which their technology was built. Why? Well, that’s a longer discussion that Clayton Christensen has addressed far better than I ever could in “The Innovator’s Dilemma”. Suffice it to say that the world of computing technology continues to change rapidly today, and products built twenty years ago to handle data often are hobbled by their legacy – both in the technology and the organization that sells it.
But if a company is writing a large check every year to a vendor – it’s not at all unusual to see firms spend $1 million or more per year with technology vendors – they often expect a measure of thought leadership from that vendor. So if a company is blindsided by bad results because they’re behind on big data, it’s natural to expect that the vendor should have offered some guidance, even if it was just to steer the IT folks away from an unproductive big data science project (for more on that, see my blog post coming soon titled “That Giant Sucking Sound is Your Big Data Lab Experiment”).
Moving past anger
Organizational anger can be a real time-waster. Sometimes, assigning blame can gain enough momentum that it distracts from the original issue. Here are some thoughts on moving past this.
You can’t change the past, only the future. Learning from mistakes is a positive thing, but there’s a difference between looking at the causes and looking for folks to blame. And it’s critical to identify the real reasons the opportunity was missed instead of playing the “blame game”, as it would suck up precious time and in fact may prevent the identification of the real issue. I’ve seen more than one organization with what I call a “Teflon team” – a team which is never held responsible for any of the impacts their work has on the business, regardless of their track record. Once or twice, I’ve seen these teams do very poor work, but the responsibility has been placed elsewhere. So the team never improves and the poor work continues. So watch out for the Teflon team!
Big data is bigger than you think. It’s big in every sense of the word because it represents not just the things we usually talk about – volume of data, variety of data, and velocity of data – but it also represents the ability to bring computing to bear on problems where this was previously impossible. This is not an incremental or evolutionary opportunity, but a revolutionary one. Can a business improve its bottom line by ten percent with big data? Very likely. Can it drive more revenue? Almost certainly. But it can also develop entirely new products and capabilities, and even create new markets.
So it’s not surprising that businesses may have a hard time recognizing this and coping with it. Business leaders accustomed to thinking of incremental boosts to revenue, productivity, margins, etc. may not be ready to see the possibilities. And the IT team is likely to be even less prepared. So while it may take some convincing to get the VP of Marketing to accept that Twitter is a powerful tool for evaluating their brand, asking IT to evaluate it in a vacuum is a recipe for confusion.
So understanding the true scope of big data and what it means for an organization is critical to moving forward.
A vendor is a vendor. Most organizations have one or more data warehouses today, along with a variety of tools for the manipulation, transformation, delivery, analysis, and consumption of data. So they will almost always have some existing vendor relationships around technologies which manage data. And most of them will want to leverage the excitement around big data, so will have some message along those lines. But it’s important to separate the technology from the message. And to distinguish between aging technology which has simply been rebranded and technology which can actually do the job.
Also, particularly in big data, there are “vendorless” or “vendor-lite” technologies which have become quite popular. By this I mean technologies such as Apache Hadoop, Mongodb, Cassandra, etc. These are often driven less by a vendor with a product goal and more by a community of developers who cut their teeth on the concept of open-source software which comes with very different business economics. Generally without a single marketing department to control the message, these technologies can be associated with all manner of claims regarding capabilities – some of which are accurate, and some which aren’t. This is a tough issue to confront because the messages can be conflicting, diffused, etc. The best advice I’ve got here is – if an open source technology sounds too good to be true, it very likely is.
Fortunately, this phase is a transitional one. Having come to terms with anger over the missed big data opportunity or risk, businesses then start to move forward…only to find their way blocked. This is when the bargaining starts. So stay tuned!
Next up: Bargaining “Can’t we work with our current technologies (and vendors)? …but they cost too much!”
My father passed away recently, and so I’ve found myself in the midst of a cycle of grief. And, in thinking about good blog topics, I realized that many of the organizations I’ve worked with over the years have gone through something very much like grief as they’ve come to confront big data challenges…and the stages they go through even map pretty cleanly to the five stages of grief! So this series was born.
So it’ll focus on the five stages of grief: denial, anger, bargaining, depression, and acceptance. I’ll explore the ways in which organizations experience each of these phases when confronting the challenges of big data, and also present strategies for coping with these challenges and coming to terms with big data grief.
Part One: Denial
“We don’t have a big data problem. Our Oracle DBA says so.”
Big data is a stealth tsunami – it’s snuck up on many businesses and markets worldwide. As a result, they often believe initially that that they don’t need to change. In other words, they are in denial. In this post, I’ll discuss various forms of denial, and recommend strategies for moving forward.
Here are the three types of organizational “denial” that we’ve seen most frequently:
They don’t know what they’re missing
Typically, these organizations are aware that there’s now much more data available to them, but don’t see that how it represents opportunity to their business. Organizations may have listened to vendors, who often focus their message on use cases they want to sell into – which may not be the problem a business needs to solve. But it’s also common for an organization settle into its comfort zone; the business is doing just fine and the competition doesn’t seem to be gaining any serious ground. So, the reasoning goes, why change?
The truth is that, as much as those of us who work with it every day feel that there’s always a huge opportunity in big data, for many organizations it’s just not that important to them yet. They might know that every day, tens of thousands of people tweet about their brand, but they haven’t yet recognized the influence these tweets can have on their business. And they may not have any inkling that those tweets can be signals of intent – intent to purchase, intent to churn, etc.
They don’t think it’s worth doing
Organizations in denial may also question whether dealing with big data is worth doing. An organization might already be paying a technology vendor $1 million or more per year for technology…and this to handle just a few terabytes of data. When the team looks at a request to suddenly deal with multiple petabytes of data, it automatically assumes that the costs would be prohibitive and shuts down that line of thinking. This attitude often goes hand-in-hand with the first item…after all, if it’s outrageously expensive to even consider a big data initiative, it seems there’s no point in researching it further since it can’t possibly provide a strong return on investment.
Somebody is in the bunker
While the prior two items pertained largely to management decisions based on return on investment for a big data project, this one is different. Early in my career I learned to program in the SAS analysis platform. As I pursued this for several different firms, I observed that organizations would tend to build a team of SAS gurus who held the keys to this somewhat exotic kingdom. Key business data existed only in SAS datasets which were difficult to access from other systems. Also, programming in SAS required a specialized skillset that only a few possessed. Key business logic such as predictive models, data transformations, business metric calculations, etc. were all locked away in a large library of SAS programs. I’ve spoken with more than one organization who tells me that they’ve got a hundred thousand (or more!) SAS datasets, and several times that many SAS programs floating around their business…many of which contain key business logic and information. As a result, the SAS team often held a good position in the organizational food chain, and its members were well paid.
One day, folks began to discover that they could download other tools that did very similar things, didn’t care where the data resided, cost a fraction of SAS, and required less exotic programming skills.
Can you see where this is going?
I also spent some years as an Oracle DBA and database architect, and witnessed very similar situations. It’s not uncommon – especially given how disruptive big data technologies can be – to see teams go “into the bunker” and be very reluctant to change. Why would they volunteer to give up their position, influence and perks? And so we now are at the intersection of information technology and a classic change management challenge.
Moving forward past denial
For an organization, working through the denial stage can seem daunting, but it’s very do-able. Here are some recommendations to get started:
Be prepared to throw out old assumptions. The world is rapidly becoming a much more instrumented place, so there are possibilities today that literally didn’t exist ten years ago. The same will be true in another ten years (or less). This represents both opportunity and competitive threat. Not only might your current competitors leverage data in new ways, but entirely new classes of products may appear quickly that will change everything. For example, consider the sudden emergence in recent years of smartphones, tablets, Facebook, and Uber. In their respective domains, they’ve caused entire industries to churn. So it’s important to cast a broad net in terms of looking for big data projects to deliver value for your business.
Big data means not having to say “no.” I’ve worked with numerous organizations who have had to maintain a high cost infrastructure for so long that they’re used to saying “no” when they ‘re approached for a new project. And they add an exclamation point (“no!”) when they’re approached with a big data project. Newer technologies and delivery models offer the chance to put much more in the hands of users. So, while saying no may sometimes be inevitable, it no longer needs to be an automatic response. When it comes to an organization’s IT culture, be ready to challenge the common wisdom about team organization, project evaluation and service delivery. The old models – the IT service desk, the dedicated analyst/BI team, organizing a technology team into technology centric silos such as the DBA team, etc. may no longer be a fit.
Big data is in the eye of the beholder. Just because vendors love to talk about Twitter (and I’m guilty of that too), doesn’t mean that Twitter is relevant to your business. Maybe you manufacture a hundred pieces of very complex equipment every year and sell them to a handful of very large companies. In this case, it’s probably best not worry overmuch about tweets. You might have a very different big data problem. For instance, you may need to evaluate data from your last generation of devices which had ten sensors that generate ten rows of data per second each. And, you know that the next generation will have ten thousand sensors generating a hundred rows per second each – so very soon it’ll be necessary to cope with around ten thousand times as much data (or more – the new sensors may provide a lot more information than the older ones). And if the device goes awry, your customer might lose a $100 million manufacturing run. So don’t dismiss the possibilities in big data just because your vendor doesn’t talk about your business. Push them to help you solve your problems, and the vendors worth partnering with will work with you to do this.
Data expertise is a good thing. Just because you might not need ten Oracle DBA’s in the new world doesn’t mean that you should lay eight of them off. The folks who have been working intimately with the data in the bunker often have very deep knowledge of the data. They frequently can retool and, in fact, find themselves having a lot more fun delivering insights and helping the business. It may be important to re-think the role of the “data gurus” in the new world. In fact, I’d contend that this is where you may find some of your best data scientists.
While organizational denial is a tough place to be when it comes to big data, it happens often. And many are able to move past it. Sometimes voluntarily, and sometimes not – as I’ll describe in the next installment. So stay tuned!
Anger: “We missed our numbers last quarter because we have a big data problem! What the heck are we going to do about it?”
The “De-mythification” Series
Part 4: The Automagic Pixie
Au∙to∙mag∙ic: (Of a usually complicated technical or computer process) done, operating, or happening in a way that is hidden from or not understood by the user, and in that sense, apparently “magical”
In previous installments of this series, I de-bunked some of the more common myths around big data analytics. In this final installment, I’ll address one of the most pervasive and costly myths: that there exists an easy button that organizations can press to automagically solve their big data problems. I’ll provide some insights as to how this myth has come about, and recommend strategies for dealing with the real challenges inherent in big data analytics.
Like the single-solution elf, this easy button idea is born of the desire of many vendors to simplify their message. The big data marketplace is new enough that all the distinct types of needs haven’t yet become entirely clear – which makes it tough to formulate a targeted message. Remember in the late 1990’s when various web vendors were all selling “e-commerce” or “narrowcasting” or “recontextualization”? Today most people are clear on the utility of the first two, while the third is recognized for what it was at the time – unhelpful marketing fluff. I worked with a few of these firms, and watched as the businesses tried to position product for a need which hadn’t yet been very well defined by the marketplace. The typical response by the business was to keep it simple – just push the easy button and our technology will do it for you.
I was at my second startup in 2001 (an e-commerce provider using what we would refer to today as a SaaS model) when I encountered the unfortunate aftermath of this approach. I sat down at my desk on the first day of the job, and was promptly approached by the VP of Engineering, who informed me that our largest customer was about to cancel its contract – we’d been trying to upgrade the customer for weeks, during which time their e-commerce system was down. Although they’d informed the customer that the upgrade was a push-button process, it wasn’t. In fact, at the time I started there, the team was starting to believe that an upgrade would be impossible and that they should propose re-implementing the customer from scratch. By any standard, this would be a fail.
Over the next 72 hours, I migrated the customer’s data and got them up and running. It was a Pyrrhic victory at best – the customer cancelled anyhow, and the startup went out of business a few months later.
The moral of the story? No, it’s not to keep serious data geeks on staff to do automagical migrations. The lesson here is that when it comes to data driven applications – including analytics – the “too good to be true” easy button almost always is. Today, the big data marketplace is full of great sounding messages such as “up and running in minutes”, or “data scientist in a box”.
“Push a button and deploy a big data infrastructure in minutes to grind through that ten petabytes of data sitting on your SAN!”
“Automatically derive predictive models that used to take the data science team weeks in mere seconds! (…and then fire the expensive data scientists)!”
Don’t these sound great?
The truth is, as usual, more nuanced. One key point I like to make with organizations is that big data analytics, like most technology practices, involves different tasks. And those tasks generally require different tools. To illustrate this for business stakeholders, I usually resort to the metaphor of building a house. We don’t build a house with just a hammer, or just a screwdriver. In fact, it requires a variety of tools – each of which is suited to a different task. A brad nailer for drywall. A circular saw for cutting. A framing hammer for framing. And so on. And in the world of engineering, a house is a relatively simple thing to construct. A big data infrastructure is considerably more complex. So it’s reasonable to assume that an organization building this infrastructure would need a rich set of tools and technologies to meet the different needs.
Now that we’ve clarified this, we can get to the question behind the question. When someone asks me “Why can’t we have an easy button to build and deploy analytics?” What they’re really asking is “How can I use technological advances to build and deploy analytics faster, better and cheaper?”
Ahh, now that’s an actionable question!
In the information technology industry, we’ve been blessed (some would argue cursed) by the nature of computing. For decades now we’ve been able to count on continually increasing capacity and efficiency. So while processors continue to grow more powerful, they also consume less power. As the power requirements for a given unit of processing become low enough, it is suddenly possible to design computing devices which run on “ambient” energy from light, heat, motion, etc. This has opened up a very broad set of possibilities to instrument the world in ways never before seen – resulting in dramatic growth of machine-readable data. This data explosion has led to continued opportunity and innovation across the big data marketplace. Imagine if each year, a homebuilder could purchase a saw which could cut twice as much wood with a battery half the size. What would that mean for the homebuilder? How about the vendor of the saw? That’s roughly analogous to what we all face in big data.
And while we won’t find one “easy button”, it’s very likely that we can find a tool for a given analytic task which is significantly better than one that was built in the past. A database that operates well at petabyte scale, with performance characteristics that make it practical to use. A distributed filesystem whose economics make it a useful place to store virtually unlimited amounts of data until you need it. An engine capable of extracting machine-readable structured information from media. And so on. Once my colleagues and I have debunked the myth of the automagic pixie, we can have a productive conversation to identify the tools and technologies that map cleanly to the needs of an organization and can offer meaningful improvements in their analytical capability.
I hope readers have found this series useful. In my years in this space, I’ve learned that in order to move forward with effective technology selection, sometimes we have to begin by taking a step backward and undoing misconceptions. And there are plenty! So stay tuned.
The “De-mythification” Series
Part 3: The Single-Solution Elf
In this part of the de-mythification series, I’ll address another common misconception in the big data marketplace: that there exists a single piece of technology that will solve all big data problems. Whereas the first two entries in this series focused on market needs, this will focus more on the vendor side of things in terms of how big data has driven technology development, and give some practical guidance on how an organization can better align their needs with their technology purchases.
Big Data is the Tail Wagging the Vendor
Big data is in the process of flipping certain technology markets upside-down. Ten or so years ago, vendors of databases, ETL, data analysis, etc. all could focus on building tools and technologies for discrete needs, with an evolutionary eye – focused on incremental advance and improvement. That’s all changed very quickly as the world has become much more instrumented. Smartphones are a great example. Pre-smartphone, the data stream from an individual throughout the day might consist of a handful of call-detail records and a few phone status records. Maybe a few kilobytes of data at most. The smartphone changed that. Today a smartphone user may generate megabytes, or even gigabytes of data in a single day from the phone, the broadband, the OS, email, applications, etc. Multiply that across a variety of devices, instruments, applications and systems, and the result is a slice of what we commonly refer to as “Big Data”.
Most of the commentary on big data has focused on the impact to organizations. But vendors have been, in many cases, blindsided. With technology designed for orders of magnitude less data, sales teams accustomed to competing against a short list of well-established competitors, marketing messages focused on clearly identified use cases, and product pricing and packaging oriented towards a mature, slow-growth market, many have struggled to adapt and keep up.
Vendors have responded with updated product taglines (and product packaging) which often read like this:
“End-to-end package for big data storage, acquisition and analysis”
“A single platform for all your big data needs”
“Store and analyze everything”
Don’t these sound great?
But simple messages like these mask the reality that there are distinct activities that which comprise big data analytics, and that these activities come with different technology requirements, and much of today’s technology was born in a very different time – so the likelihood of there being a single tool that does everything well is quite low. Let’s start with the analytic lifecycle, depicted in the figure below, and discuss the ways this has driven the state of the technology.
This depicts the various phases of an analytic lifecycle from the creation and acquisition of data through the exploration and structuring to analysis and modeling, to putting the information to work. These phases often require very different things from technology. Let’s take the example of acquiring and storing of large volumes of data with varying structure. Batch performance is often important here, as is cost to scale. Somewhat less important is ease of use – load jobs tend to change at a lower rate than user queries, especially when the data in a document-like format (e.g. JSON). By contrast, the development of a predictive model requires a highly interactive technology which combines high performance with a rich analytic toolkit. So batch use will be minimal, while ease of use is key.
Historically, many of the technologies required for big data analytics were built as stand-alone technologies: a database, a data mining tool, an ETL tool, etc. Because of this lineage, the time and effort required to re-engineer these tools to work effectively together as a single technology, with orders of magnitude more data, can be significant.
Despite how a vendor packages technology, organizations must ask themselves this question: what do you really need to solve the business problems? When it comes time to start identifying a technology portfolio to address big data challenges, I always recommend that customers start by putting things in terms of what they really need. This is surprisingly uncommon, because many organizations have grown accustomed to vendor messaging which is focused on what the vendor wants to sell as opposed as to what the customer needs to buy. It may seem like a subtle distinction, but it can make all the difference between a successful project and a very expensive set of technology sitting on the shelf unused.
I recommend engaging in a thoughtful dialog with vendors to assess not only what you need today, but to explore things you might find helpful which you haven’t thought of yet. A good vendor will help you in this process. As part of this exercise, it’s important to avoid getting hung up on the notion that there’s one single piece of technology that will solve all your problems: the single solution elf.
Once my colleagues and I dispel the single solution myth, we can then have a meaningful dialog with an organization and focus on the real goal: finding the best way to solve their problems with a technology portfolio which is sustainable and agile.
I’ve been asked, more than once “Why can’t there be a single solution? Things would be so much easier that way.” That’s a great question, which I’ll address in my next blog post as I discuss some common sense perspectives on what technology should – and shouldn’t – do for you.
Next up: The Automagic Pixie
We’ve published two more case studies, featuring Job Rapido and Supercell. These are compelling examples of innovative companies that use the HP Vertica Analytics Platform to gain a competitive edge and dervive maximum business value from Big Data. The two summaries and respective full case study PDF’s provide details about each company’s goals, success, and ultimate outcomes using HP Vertica. To see more like these, visit the HP Vertica Case Studies page.
Jobrapido scales its database
to the next level
Since its founding in 2006, Jobrapido has become one of the biggest online job search aggregators in the world, helping millions of users everywhere from Italy to the United states find the job that’s right for them. In 2012, they were acquired by Evenbase, a part of DMG media based in the UK. HP Vertica has proved invaluable to their success, performing above and beyond for their big data analytics needs. David Conforti Director of BI at Jobrapido describes HP Vertica as “like having a sort of magic mirror to ask to all the business questions that come to my mind,” one that has allowed him and his team to deliver their users both valuable insight and results, and a unique personal experience based on their analytics.
Supercell performs real-time analytics
In 2012, Supercell delivered two top-grossing games on iOS with the titles “Clash of Clans” and “Hey Day,” just a year after its founding in 2011. Using HP Vertica big data analytics platform, Supercell has been able to engage in real-time gaming data analytics, allowing them to balance, adapt, and improve their gamers experiences on a day to day basis. “HP Vertica is an important tool in making sure that our games provide the best possible experience for our players” says Janne Peltola, a data scientist at Supercell. Using HP Vertica, Supercell is able to create gaming experiences that are fun and engaging for customers to keep coming back to, long after they have started playing.