The Unstructured Leprechaun

Posted June 9, 2014 by Walt_Maguire

High angle view of Beijing Guomao.

The “De-mythification” Series

Part 2: The Unstructured Leprechaun

In this, the second of the multi-part ““de-mythification”” series, I’’ll address another common misconception in the Big Data marketplace today – that there are only two types of data an enterprise must deal with for Big Data analytics—structured and unstructured, and that unstructured data is somehow structure-free.

Let’’s start with a definition of “structured” data. When we in the Big Data space talk of structured data, what we really mean is that the data has easily identifiable things like fields, rows, columns, etc. which makes it simple for us to use this as input for analytics.  Virtually all modern analytic routines leverage mathematical algorithms which look for things like groupings, trends, patterns, etc., and these routines require that the data be structured in such a way that they can digest it. So when we say “structured” in this context, what we really mean is “structured in such a way that our analytic routines can process it.

On the other hand, “unstructured” data has become a catch-all term that’s used to describe everything not captured by the definition above. And this is unfortunate, because there’s very little data in the world which is truly unstructured. This over-generalization leads many organizations down costly, time-consuming paths which they don’t need to traverse.

The truth is that there is very little electronic data in our world today which is unstructured. Here’s a short list of some of the types of data or information commonly lumped under the “unstructured” label, with a sanity check as to the real story.

Type of Data Common Source(s) Structure Sanity Check
Audio Call center recordings, webinars, etc. Digital audio is stored in files, usually as a stream of bits. This stream is encoded and decoded as written & read, often with compression.   This is how the audio can be replayed after recording.
Video Dash-cams, security, retail traffic monitoring, social media sharing, etc. As with audio, digital video is stored in files, with a very similar approach to storing the stream of bits—encoded and often compressed, and replayable with the right decoder.
E-mails Personal and business e-mail, marketing automation, etc. An e-mail is typically quite well structured, with one section of the message containing key data about the message – From, To, Date, Subject, etc. – and another field containing the message itself, often stored as simple text.
Documents (contracts, books, white papers, articles, etc.) Electronic document systems, file sharing systems such as Google Docs and Sharepoint, etc. The documents themselves have structure similar to e-mail, with a group of fields often describing the document, and a body of text which comprises the document itself.  This is a broad category with much variation.
Social Media Tweets, blog posts, online video, picture sharing, check-ins, status updates, etc. Similar to e-mails, social media often has data which describes the message—who’’s posting it, the date of the post, referenced hashtags and users, etc.—and the post itself. Images, audio and video included in social media are structured no differently than they are elsewhere.
Machine Logs mobile applications, hardware devices, web applications, etc. I’’m not sure who exactly lumped machine logs under the “”unstructured”” label since these are highly structured and always have been. They are, after all, written by machines! I suspect a bunch of marketing people decided this after consuming one too many bottles of wine in Napa.

By now it should be clear that this data is not at all unstructured. Quite the opposite. It has plenty of structure to it, otherwise we could never replay that video or audio, read a status update, read e-mail, etc. The real challenge is that this data is generated for a purpose, and that purpose rarely includes analytics. Furthermore, video, audio and email have been around for decades, but it’s only in recent years that we’’ve discovered the value of analyzing that information along with the rest.

How does this information add new value? Here are a few examples:

    • Hedge funds found, a number of years ago, that by incorporating sentiment analysis of Tweets on publicly traded securities, that they can predict the daily closing prices of those securities very accurately.
    • Facial recognition in video allows for the creation of an event driven monitoring system which allows a single soldier to effectively monitor hundreds of security cameras concurrently.
    • Sentiment scoring in audio allows a business to detect an unhappy customer during a call, predict that they are likely to churn, and extend a retention offer to keep that customer.
    • Expressing the graph of relationships between players of a social game, as determined by their in-game messages, allows the game developer to dramatically improve profitability as well as player experience.

There are many, many such examples. This is why there’’s so much attention being paid to ““unstructured”” data today—it offers a powerful competitive advantage for those who can incorporate it into their analytics.

The problem is that the data serves …the application which created it. When coder/decoder algorithms were being developed in the 1990’s for audio and video, I doubt that anyone expected that someday we might want to understand (a) who is talking; (b) what they’re talking about; and (c) how they feel about it.

This is the core problem many of us in the Big Data industry are working to address today. How do we take data with one type of structure such as audio, and create a second type of structure which suits it for analytics? To accomplish this, we need structure suited to our analytic routines such as a field identifying the person speaking, a field with the timestamp, a field identifying the topic they’re talking about, and so on. Getting from a stream of audio to this requires careful choice of technology, and thoughtful design. Unfortunately, my esteemed colleagues in the Big Data marketplace have tended to oversimplify this complex situation down to a single word: ““unstructured””. This has led to the unstructured leprechaun—a mythical creature who many organizations are chasing hoping to find an elusive pot of gold.

Not that simplicity of messaging is a bad thing. Lord knows I’’ve been in enough conference rooms watching people’’s eyes glaze over as I talk through structured versus unstructured data! But, as with the real-time unicorn, if organizations chase the unstructured leprechaun—the myth that there is this big bucket of “”unstructured”” data that we can somehow address with a single magic tool (for more on that, see my next post: “”The Single Solution Elf””), they risk wasting their time and money approaching the challenge without truly understanding the problem.

Once my colleagues and I get everyone comfortable with this more nuanced situation, we can begin the real work—identifying the high value use-cases where we can bring in non-traditional data to enhance analytic outcomes.  It’s worth mentioning that I’m careful today to refer to this data as non-traditional, and never unstructured!  This avoids a lot of overgeneralizing, and  makes selecting the right portfolio of technologies and designing a good architecture to address the use-cases very do-able.

So when organizations state that they need to deal with their “”unstructured”” data, I recommend a thorough assessment of the types of data involved and why they matter and the identification of discrete use cases where this data can add value.  We can then use this information as a guideline in developing the plan of action that’s much more likely to yield a tangible ROI.

Next up: The Single Solution Elf