This week I sat down with Ben Vandiver, a Vertica veteran who’s been with the company since 2008, and talked about everything from influencing presidential elections, making an impact, and sword-fighting with interns.
Archive for the ‘big data’ Category
The “De-mythification” Series
Part 4: The Automagic Pixie
Au∙to∙mag∙ic: (Of a usually complicated technical or computer process) done, operating, or happening in a way that is hidden from or not understood by the user, and in that sense, apparently “magical”
In previous installments of this series, I de-bunked some of the more common myths around big data analytics. In this final installment, I’ll address one of the most pervasive and costly myths: that there exists an easy button that organizations can press to automagically solve their big data problems. I’ll provide some insights as to how this myth has come about, and recommend strategies for dealing with the real challenges inherent in big data analytics.
Like the single-solution elf, this easy button idea is born of the desire of many vendors to simplify their message. The big data marketplace is new enough that all the distinct types of needs haven’t yet become entirely clear – which makes it tough to formulate a targeted message. Remember in the late 1990’s when various web vendors were all selling “e-commerce” or “narrowcasting” or “recontextualization”? Today most people are clear on the utility of the first two, while the third is recognized for what it was at the time – unhelpful marketing fluff. I worked with a few of these firms, and watched as the businesses tried to position product for a need which hadn’t yet been very well defined by the marketplace. The typical response by the business was to keep it simple – just push the easy button and our technology will do it for you.
I was at my second startup in 2001 (an e-commerce provider using what we would refer to today as a SaaS model) when I encountered the unfortunate aftermath of this approach. I sat down at my desk on the first day of the job, and was promptly approached by the VP of Engineering, who informed me that our largest customer was about to cancel its contract – we’d been trying to upgrade the customer for weeks, during which time their e-commerce system was down. Although they’d informed the customer that the upgrade was a push-button process, it wasn’t. In fact, at the time I started there, the team was starting to believe that an upgrade would be impossible and that they should propose re-implementing the customer from scratch. By any standard, this would be a fail.
Over the next 72 hours, I migrated the customer’s data and got them up and running. It was a Pyrrhic victory at best – the customer cancelled anyhow, and the startup went out of business a few months later.
The moral of the story? No, it’s not to keep serious data geeks on staff to do automagical migrations. The lesson here is that when it comes to data driven applications – including analytics – the “too good to be true” easy button almost always is. Today, the big data marketplace is full of great sounding messages such as “up and running in minutes”, or “data scientist in a box”.
“Push a button and deploy a big data infrastructure in minutes to grind through that ten petabytes of data sitting on your SAN!”
“Automatically derive predictive models that used to take the data science team weeks in mere seconds! (…and then fire the expensive data scientists)!”
Don’t these sound great?
The truth is, as usual, more nuanced. One key point I like to make with organizations is that big data analytics, like most technology practices, involves different tasks. And those tasks generally require different tools. To illustrate this for business stakeholders, I usually resort to the metaphor of building a house. We don’t build a house with just a hammer, or just a screwdriver. In fact, it requires a variety of tools – each of which is suited to a different task. A brad nailer for drywall. A circular saw for cutting. A framing hammer for framing. And so on. And in the world of engineering, a house is a relatively simple thing to construct. A big data infrastructure is considerably more complex. So it’s reasonable to assume that an organization building this infrastructure would need a rich set of tools and technologies to meet the different needs.
Now that we’ve clarified this, we can get to the question behind the question. When someone asks me “Why can’t we have an easy button to build and deploy analytics?” What they’re really asking is “How can I use technological advances to build and deploy analytics faster, better and cheaper?”
Ahh, now that’s an actionable question!
In the information technology industry, we’ve been blessed (some would argue cursed) by the nature of computing. For decades now we’ve been able to count on continually increasing capacity and efficiency. So while processors continue to grow more powerful, they also consume less power. As the power requirements for a given unit of processing become low enough, it is suddenly possible to design computing devices which run on “ambient” energy from light, heat, motion, etc. This has opened up a very broad set of possibilities to instrument the world in ways never before seen – resulting in dramatic growth of machine-readable data. This data explosion has led to continued opportunity and innovation across the big data marketplace. Imagine if each year, a homebuilder could purchase a saw which could cut twice as much wood with a battery half the size. What would that mean for the homebuilder? How about the vendor of the saw? That’s roughly analogous to what we all face in big data.
And while we won’t find one “easy button”, it’s very likely that we can find a tool for a given analytic task which is significantly better than one that was built in the past. A database that operates well at petabyte scale, with performance characteristics that make it practical to use. A distributed filesystem whose economics make it a useful place to store virtually unlimited amounts of data until you need it. An engine capable of extracting machine-readable structured information from media. And so on. Once my colleagues and I have debunked the myth of the automagic pixie, we can have a productive conversation to identify the tools and technologies that map cleanly to the needs of an organization and can offer meaningful improvements in their analytical capability.
I hope readers have found this series useful. In my years in this space, I’ve learned that in order to move forward with effective technology selection, sometimes we have to begin by taking a step backward and undoing misconceptions. And there are plenty! So stay tuned.
The “De-mythification” Series
Part 3: The Single-Solution Elf
In this part of the de-mythification series, I’ll address another common misconception in the big data marketplace: that there exists a single piece of technology that will solve all big data problems. Whereas the first two entries in this series focused on market needs, this will focus more on the vendor side of things in terms of how big data has driven technology development, and give some practical guidance on how an organization can better align their needs with their technology purchases.
Big Data is the Tail Wagging the Vendor
Big data is in the process of flipping certain technology markets upside-down. Ten or so years ago, vendors of databases, ETL, data analysis, etc. all could focus on building tools and technologies for discrete needs, with an evolutionary eye – focused on incremental advance and improvement. That’s all changed very quickly as the world has become much more instrumented. Smartphones are a great example. Pre-smartphone, the data stream from an individual throughout the day might consist of a handful of call-detail records and a few phone status records. Maybe a few kilobytes of data at most. The smartphone changed that. Today a smartphone user may generate megabytes, or even gigabytes of data in a single day from the phone, the broadband, the OS, email, applications, etc. Multiply that across a variety of devices, instruments, applications and systems, and the result is a slice of what we commonly refer to as “Big Data”.
Most of the commentary on big data has focused on the impact to organizations. But vendors have been, in many cases, blindsided. With technology designed for orders of magnitude less data, sales teams accustomed to competing against a short list of well-established competitors, marketing messages focused on clearly identified use cases, and product pricing and packaging oriented towards a mature, slow-growth market, many have struggled to adapt and keep up.
Vendors have responded with updated product taglines (and product packaging) which often read like this:
“End-to-end package for big data storage, acquisition and analysis”
“A single platform for all your big data needs”
“Store and analyze everything”
Don’t these sound great?
But simple messages like these mask the reality that there are distinct activities that which comprise big data analytics, and that these activities come with different technology requirements, and much of today’s technology was born in a very different time – so the likelihood of there being a single tool that does everything well is quite low. Let’s start with the analytic lifecycle, depicted in the figure below, and discuss the ways this has driven the state of the technology.
This depicts the various phases of an analytic lifecycle from the creation and acquisition of data through the exploration and structuring to analysis and modeling, to putting the information to work. These phases often require very different things from technology. Let’s take the example of acquiring and storing of large volumes of data with varying structure. Batch performance is often important here, as is cost to scale. Somewhat less important is ease of use – load jobs tend to change at a lower rate than user queries, especially when the data in a document-like format (e.g. JSON). By contrast, the development of a predictive model requires a highly interactive technology which combines high performance with a rich analytic toolkit. So batch use will be minimal, while ease of use is key.
Historically, many of the technologies required for big data analytics were built as stand-alone technologies: a database, a data mining tool, an ETL tool, etc. Because of this lineage, the time and effort required to re-engineer these tools to work effectively together as a single technology, with orders of magnitude more data, can be significant.
Despite how a vendor packages technology, organizations must ask themselves this question: what do you really need to solve the business problems? When it comes time to start identifying a technology portfolio to address big data challenges, I always recommend that customers start by putting things in terms of what they really need. This is surprisingly uncommon, because many organizations have grown accustomed to vendor messaging which is focused on what the vendor wants to sell as opposed as to what the customer needs to buy. It may seem like a subtle distinction, but it can make all the difference between a successful project and a very expensive set of technology sitting on the shelf unused.
I recommend engaging in a thoughtful dialog with vendors to assess not only what you need today, but to explore things you might find helpful which you haven’t thought of yet. A good vendor will help you in this process. As part of this exercise, it’s important to avoid getting hung up on the notion that there’s one single piece of technology that will solve all your problems: the single solution elf.
Once my colleagues and I dispel the single solution myth, we can then have a meaningful dialog with an organization and focus on the real goal: finding the best way to solve their problems with a technology portfolio which is sustainable and agile.
I’ve been asked, more than once “Why can’t there be a single solution? Things would be so much easier that way.” That’s a great question, which I’ll address in my next blog post as I discuss some common sense perspectives on what technology should – and shouldn’t – do for you.
Next up: The Automagic Pixie
If you ask Conservation International this question, they may just say yes. After all, Conservation International has teamed up with HP Earth Insights to provide organizations around the world — from environmentalists to policy makers – with a real-time look at what is happening within our planets most valuable natural resource: the rain forest.
But how does their work relate to you as a start-up organization or a Fortune 500 company?
First, they have surprisingly similar analytical needs to many other start-ups and corporations, collecting data regularly from 16 sites around the globe, performing more than 4 million climate measurements as of this February, and managing more than 3 TB of biodiversity information. As the name implies, this information is incredibly, well… diverse, including everything from photos to hand-recorded measurements to weather station and camera trap imagery. While your company may not be recording/analyzing the metadata of candid photos of elephants and/or chimpanzees, chances are, many of you out there are working with at least more than one type of data.
Collecting and Analyzing Multiple Data Types
All of these different data types have to be funneled into a database, analyzed, and then acted on. Running queries based on millions of climate readings begins to look a lot like doing the same on a diverse customer base like many other companies deal with every day. Many agricultural companies collect sensor data from across their farm lands to get a forecast of how the climate has affected their crops for the upcoming year. These days, utilities companies are launching Advanced Metering Infrastructures (AMI) to deal with the staggering amounts of sensor data collected from the energy usage of millions of homes. HP Vertica coincidentally works as an effective Meter Data Management (MDM) system (read more here).
Visualizing the Data and Reaching More People
Working with HP, Conservation International has built from the ground up their own analytics system and dashboard for visualizing their data from all 16 rainforests around the globe. CI DBA’s discover trends based on over 140 million simulations, and analyze the metadata from over 1.7 million photos. Not only is their custom interface intuitive, it also enables them to generate PDFs instantly and share to social media directly from the dashboard. For CI, this means more people now see more of their impact in more places to proactively address environment threats. For you, it might mean anything from less time spent prepping your data to present to management, or just simply fewer emails to send.
The Power of Prediction for the Greater Good
Like many companies, CI uses standard methodology in processing their data, and uses R for their analysis, as is very common in scientific studies. Using R, CI can proactively assess where the future trouble spots will be, and what parts of their monitored ecosystems are most threatened. Many other HP Vertica customers use R in surprisingly similar ways, such as seeing what neighborhoods a future power outage might affect most, or how serious the next year’s dry season will be to a farmer’s crops
See Conservation International at the HP Vertica Big Data Conference
These are just a few examples of how an incredibly unique organization uses HP Vertica to analyze unique data, yet does it in ways that many other groups might find surprisingly familiar. Sometimes after a closer look, we can see that many organizations have a lot more in common with their data needs than they may think, and HP Vertica is the right tool for the job.
Be sure to attend out upcoming Big Data Conference in Boston MA, where Conservation International is leading the hackathon!
Yesterday, myself and a few other fellow members of the HP Vertica team attended Boston TechJam 2014 at the city hall plaza in Boston. Featuring a digital art display by local artist Cindy Bishop entitled “The Way You Move”, our booth was thronged with people wanting to know more about what we do as the leading big data analytics platform. Myself and the rest of my team wanted to send out a huge thank you to everyone who stopped by our booth to talk with us. I personally had an amazing time interacting the rest of the tech community here in beautiful Boston, getting a chance to talk to everyone from up and coming innovators to grizzled tech veterans, (some of whom may be joining our ranks in the future!)
Below are some pictures I snapped of the festivities (when there was a rare break in between people coming up to the booth). I’m already looking forward to next year!
The “De-mythification” Series
Part 2: The Unstructured Leprechaun
In this, the second of the multi-part “de-mythification” series, I’ll address another common misconception in the Big Data marketplace today – that there are only two types of data an enterprise must deal with for Big Data analytics – structured and unstructured, and that unstructured data is somehow structure-free.
Let’s start with a definition of “structured” data. When we in the Big Data space talk of structured data, what we really mean is that the data has easily identifiable things like fields, rows, columns, etc. which makes it simple for us to use this as input for analytics. Virtually all modern analytic routines leverage mathematical algorithms which look for things like groupings, trends, patterns, etc., and these routines require that the data be structured in such a way that they can digest it. So when we say “structured” in this context, what we really mean is “structured in such a way that our analytic routines can process it.”
On the other hand, “unstructured” data has become a catch-all term that’s used to describe everything not captured by the definition above. And this is unfortunate, because there’s very little data in the world which is truly unstructured. This over-generalization leads many organizations down costly, time-consuming paths which they don’t need to traverse.
The truth is that there is very little electronic data in our world today which is unstructured. Here’s a short list of some of the types of data or information commonly lumped under the “unstructured” label, with a sanity check as to the real story.
|Type of Data||Common Source(s)||Structure Sanity Check|
|Audio||Call center recordings, webinars, etc.||Digital audio is stored in files, usually as a stream of bits. This stream is encoded and decoded as written & read, often with compression. This is how the audio can be replayed after recording.|
|Video||Dash-cams, security, retail traffic monitoring, social media sharing, etc.||As with audio, digital video is stored in files, with a very similar approach to storing the stream of bits – encoded and often compressed, and replayable with the right decoder.|
|E-mails||Personal and business e-mail, marketing automation, etc.||An e-mail is typically quite well structured, with one section of the message containing key data about the message – From, To, Date, Subject, etc. – and another field containing the message itself, often stored as simple text.|
|Documents (contracts, books, white papers, articles, etc.)||Electronic document systems, file sharing systems such as Google Docs and Sharepoint, etc.||The documents themselves have structure similar to e-mail, with a group of fields often describing the document, and a body of text which comprises the document itself. This is a broad category with much variation.|
|Social Media||Tweets, blog posts, online video, picture sharing, check-ins, status updates, etc.||Similar to e-mails, social media often has data which describes the message – who’s posting it, the date of the post, referenced hashtags and users, etc. – and the post itself. Images, audio and video included in social media are structured no differently than they are elsewhere.|
|Machine Logs||mobile applications, hardware devices, web applications, etc.||I’m not sure who exactly lumped machine logs under the “unstructured” label since these are highly structured and always have been. They are, after all, written by machines! I suspect a bunch of marketing people decided this after consuming one too many bottles of wine in Napa.|
By now it should be clear that this data is not at all unstructured. Quite the opposite. It has plenty of structure to it, otherwise we could never replay that video or audio, read a status update, read e-mail, etc. The real challenge is that this data is generated for a purpose, and that purpose rarely includes analytics. Furthermore, video, audio and email have been around for decades, but it’s only in recent years that we’ve discovered the value of analyzing that information along with the rest.
How does this information add new value? Here are a few examples:
- Hedge funds found, a number of years ago, that by incorporating sentiment analysis of Tweets on publicly traded securities, that they can predict the daily closing prices of those securities very accurately.
- Facial recognition in video allows for the creation of an event driven monitoring system which allows a single soldier to effectively monitor hundreds of security cameras concurrently.
- Sentiment scoring in audio allows a business to detect an unhappy customer during a call, predict that they are likely to churn, and extend a retention offer to keep that customer.
- Expressing the graph of relationships between players of a social game, as determined by their in-game messages, allows the game developer to dramatically improve profitability as well as player experience.
There are many, many such examples. This is why there’s so much attention being paid to “unstructured” data today – it offers a powerful competitive advantage for those who can incorporate it into their analytics.
The problem is that the data serves…the application which created it. When coder/decoder algorithms were being developed in the 1990’s for audio and video, I doubt that anyone expected that someday we might want to understand (a) who is talking; (b) what they’re talking about; and (c) how they feel about it.
This is the core problem many of us in the Big Data industry are working to address today. How do we take data with one type of structure such as audio, and create a second type of structure which suits it for analytics? To accomplish this, we need structure suited to our analytic routines such as a field identifying the person speaking, a field with the timestamp, a field identifying the topic they’re talking about, and so on. Getting from a stream of audio to this requires careful choice of technology, and thoughtful design. Unfortunately, my esteemed colleagues in the Big Data marketplace have tended to oversimplify this complex situation down to a single word: “unstructured”. This has led to the unstructured leprechaun – a mythical creature who many organizations are chasing hoping to find an elusive pot of gold.
Not that simplicity of messaging is a bad thing. Lord knows I’ve been in enough conference rooms watching people’s eyes glaze over as I talk through structured versus unstructured data! But, as with the real-time unicorn, if organizations chase the unstructured leprechaun – the myth that there is this big bucket of “unstructured” data that we can somehow address with a single magic tool (for more on that, see my next post: “The Single Solution Elf”), they risk wasting their time and money approaching the challenge without truly understanding the problem.
Once my colleagues and I get everyone comfortable with this more nuanced situation, we can begin the real work – identifying the high value use-cases where we can bring in non-traditional data to enhance analytic outcomes. It’s worth mentioning that I’m careful today to refer to this data as non-traditional, and never unstructured! This avoids a lot of overgeneralizing, and makes selecting the right portfolio of technologies and designing a good architecture to address the use-cases very do-able.
So when organizations state that they need to deal with their “unstructured” data, I recommend a thorough assessment of the types of data involved and why they matter and the identification of discrete use cases where this data can add value. We can then use this information as a guideline in developing the plan of action that’s much more likely to yield a tangible ROI.
Next up: The Single Solution Elf
The “De-mythification” Series
Part 1: The Real-Time Unicorn
This is part one of a series I call the “de-mythification” series, wherein I’ll aim to clear up some of the more widespread myths in the big data marketplace.
In the first of this multi-part series, I’ll address one of the most common myths my colleagues and I have to confront in the Big Data marketplace today: the notion of “real-time” data visibility. Whether it’s real-time analytics or real-time data, the same misconception always seems to come up. So I figured I’d address this, define what “real-time” really means, and provide readers some advice on how to approach this topic in a productive way.
First of all, let’s establish the theoretical definition of “real-time” data visibility. In the purest interpretation, it means that as some data is generated – say, a row of log data in an Apache web server – the data would immediately be queryable. What does that imply? Well, we’d have to parse the row into something readable by a query engine – so some program would have to ingest the row, parse the row, characterize it in terms of metadata, and understand enough about the data in that row to determine a decent machine-level plan for querying it. Now since all our systems are limited by that pesky “speed of light” thing, we can’t move data any faster than that – considerably slower in fact. So even if we only need to move the data through the internal wires of the same computer where the data is generated, it would take measurable time to get the row ready for query. And let’s not forget the time required for the CPU to actually perform the operations on the data. It may be nanoseconds, milliseconds, or longer, but in any event it’s a non-zero amount of time.
So “real-time” never, ever means real-time, despite marketing myths to the contrary.
There are two exceptions to this – slowing down time inside the machine, or technology which queries a stream of data as it flows by (typically called complex event processing, or CEP). With regard to the first option: let’s say we wanted to make data queryable as soon as the row is generated. We could make the flow from the logger to the query engine part of one synchronous process. So the weblog row wouldn’t actually be written until it were also processed and ready for query. Those of you who administer web and application infrastructures are probably getting gray hair just reading this as you can imagine the performance impact to a web application. So, in the real world, this is a non-starter. The other option – CEP – is exotic and typically very expensive, and while it will tell you what’s happening at the current moment, it’s not designed to build analytics models. It’s largely used to put those models to work in a real-time application such as currency arbitrage.
So, given all this, what’s a good working definition of “real-time” in the world of big data analytics?
Most organizations define it this way: “As fast as it can be done providing a correct answer and not torpedoing the rest of the infrastructure or the technology budget”.
Once everyone gets comfortable with that definition, then we can discuss the real goal: reducing the time to useful visibility of the data to an optimal minimum. This might mean a few seconds, it might mean a few minutes, or it might mean hours or longer. In fact, for years now I’ve found that once we get the IT department comfortable with the practical definition of real-time, it invariably turns out that the CEO/CMO/CFO/etc. really meant exactly that when they said they needed real-time visibility to the data. So, in other words, when the CEO said “real-time”, she meant “within fifteen minutes” or something along those lines.
This then becomes a realistic goal we can work towards in terms of engineering product, field deployment, customer production work, etc. Ironically, chasing the real-time unicorn can actually impede efforts to develop high speed data flows by forcing the team to chase unrealistic targets for which, at the end of the day, there is no quantifiable business value.
So when organizations say they need “real-time” visibility to the data, I recommend not walking away from that conversation until fully understanding just what that phrase means, and using that as the guiding principle in technology selection and design.
I hope readers found this helpful! In the remaining segments of this series, I’ll address other areas of confusion in the Big Data marketplace. So stay tuned!
Next up: The Unstructured Leprechaun