The “De-mythification” Series
Part 1: The Real-Time Unicorn
This is part one of a series I call the “de-mythification” series, wherein I’ll aim to clear up some of the more widespread myths in the big data marketplace.
In the first of this multi-part series, I’ll address one of the most common myths my colleagues and I have to confront in the Big Data marketplace today: the notion of “real-time” data visibility. Whether it’s real-time analytics or real-time data, the same misconception always seems to come up. So I figured I’d address this, define what “real-time” really means, and provide readers some advice on how to approach this topic in a productive way.
First of all, let’s establish the theoretical definition of “real-time” data visibility. In the purest interpretation, it means that as some data is generated – say, a row of log data in an Apache web server – the data would immediately be queryable. What does that imply? Well, we’d have to parse the row into something readable by a query engine – so some program would have to ingest the row, parse the row, characterize it in terms of metadata, and understand enough about the data in that row to determine a decent machine-level plan for querying it. Now since all our systems are limited by that pesky “speed of light” thing, we can’t move data any faster than that – considerably slower in fact. So even if we only need to move the data through the internal wires of the same computer where the data is generated, it would take measurable time to get the row ready for query. And let’s not forget the time required for the CPU to actually perform the operations on the data. It may be nanoseconds, milliseconds, or longer, but in any event it’s a non-zero amount of time.
So “real-time” never, ever means real-time, despite marketing myths to the contrary.
There are two exceptions to this – slowing down time inside the machine, or technology which queries a stream of data as it flows by (typically called complex event processing, or CEP). With regard to the first option: let’s say we wanted to make data queryable as soon as the row is generated. We could make the flow from the logger to the query engine part of one synchronous process. So the weblog row wouldn’t actually be written until it were also processed and ready for query. Those of you who administer web and application infrastructures are probably getting gray hair just reading this as you can imagine the performance impact to a web application. So, in the real world, this is a non-starter. The other option – CEP – is exotic and typically very expensive, and while it will tell you what’s happening at the current moment, it’s not designed to build analytics models. It’s largely used to put those models to work in a real-time application such as currency arbitrage.
So, given all this, what’s a good working definition of “real-time” in the world of big data analytics?
Most organizations define it this way: “As fast as it can be done providing a correct answer and not torpedoing the rest of the infrastructure or the technology budget”.
Once everyone gets comfortable with that definition, then we can discuss the real goal: reducing the time to useful visibility of the data to an optimal minimum. This might mean a few seconds, it might mean a few minutes, or it might mean hours or longer. In fact, for years now I’ve found that once we get the IT department comfortable with the practical definition of real-time, it invariably turns out that the CEO/CMO/CFO/etc. really meant exactly that when they said they needed real-time visibility to the data. So, in other words, when the CEO said “real-time”, she meant “within fifteen minutes” or something along those lines.
This then becomes a realistic goal we can work towards in terms of engineering product, field deployment, customer production work, etc. Ironically, chasing the real-time unicorn can actually impede efforts to develop high speed data flows by forcing the team to chase unrealistic targets for which, at the end of the day, there is no quantifiable business value.
So when organizations say they need “real-time” visibility to the data, I recommend not walking away from that conversation until fully understanding just what that phrase means, and using that as the guiding principle in technology selection and design.
I hope readers found this helpful! In the remaining segments of this series, I’ll address other areas of confusion in the Big Data marketplace. So stay tuned!
Next up: The Unstructured Leprechaun