With our HP Vertica 7 release, we announced HP Vertica Flex Zone, a new product offering that simplifies the way that you consume and then explore semi-structured data, such as Web logs, sensor data, and other emerging data types. In this blog post, our first “Flex Zone Friday” post, let’s look at how you can use HP Vertica Flex Zone to get a leg up on your latest data analysis problem, using Twitter data as the sample data type.
Tweet child of mine
For this example, let’s look at Twitter data. Tweets are an interesting source because the data volume is large, the tweet format is typically JSON, and many companies are mining tweet data for insights about their business in how the community perceives their brand, product, or service. A tweet might look like the following:
It has the tweet text itself, with a lot of interesting metadata. Note that much of the metadata is stored in sub-structures like the “user” field.
You can acquire twitter data quite easily by making requests to the Twitter API. For the purposes of following along here, you can download a small sample of tweets: tweets.json.gz
Look Ma: No Columns!
Typically, loading this type of data into a relational database would have required transforming the raw data into something more easily consumable – or a pre-built custom Twitter data loader. With HP Vertica Flex Zone, you can skip the schema definition phase and just load JSON directly into the database, saving you hours of time:
As easy as “SELECT … FROM T”
Now we can just write SQL queries against the data, as if it were naturally loaded into columns. For example, to discover what the average tweet length by language is:
From the SQL point of view, it doesn’t matter that the source data was JSON. The only concession is the column names themselves: we do need to “” the column names if they contain the ‘.’ character.
Let’s make a Schema
But how can you tell what columns are available? The HP Vertica Analytics Platform has a couple of mechanisms for discovering and making explicit the columns available.
The tweets_keys table now contains a row for each available column in the data. For example, to find all the user-related count columns:
Armed with these new columns, we can compute some statistics about the followers in our dataset:
Clearly someone is tweeting to the void. And we probably have some observer bias in our small sample: there’s probably a correlation between tweeting frequency and follower count.
In addition to building the keys table, the function also whipped up a view that selects all available columns, which can be handy to see what is in the table. Tweets are quite large:
A key differentiator of HP Vertica Flex Zone is that you can visualize this semi-structured data during data exploration with any of your preferred BI and visualization tools. These tools are suitable for the views, as they expose the metadata about the “flexible table” in a manner that the tools already understand. For example, we used Tableau to generate a report against the tweets_view:
And now for something completely different
It turns out that the tweet stream includes two types of records: tweets and deletes. Deletes are revocations of tweets and they have the following structure:
Our previous load actually loaded both record types into the same table. How did the previous queries work? Well, if you reference a column that doesn’t exist in a record, you get null. We can use this to find out how many deletes we loaded:
Lots yet to come
We haven’t talked about a number of key things like column promotion (performance!), dynamic schema, or substructure. Here’s a teaser about substructure — namely the JSON input is quite complex and the HP Vertica Analytics Platform loaded it all. You can descend into the nested structures and get the most out of the data. The following is a query that computes the top hashtags in the last month:
Note: If you got recent data from twitter, this will work without modification – if you used our data set, use interval ‘3 years’ (or more).