Vertica

Archive for the ‘Flex Zone’ Category

The Unstructured Leprechaun

The “De-mythification” Series

Part 2: The Unstructured Leprechaun

Unstructured Data

In this, the second of the multi-part “de-mythification” series, I’ll address another common misconception in the Big Data marketplace today – that there are only two types of data an enterprise must deal with for Big Data analytics – structured and unstructured, and that unstructured data is somehow structure-free.

Let’s start with a definition of “structured” data. When we in the Big Data space talk of structured data, what we really mean is that the data has easily identifiable things like fields, rows, columns, etc. which makes it simple for us to use this as input for analytics.  Virtually all modern analytic routines leverage mathematical algorithms which look for things like groupings, trends, patterns, etc., and these routines require that the data be structured in such a way that they can digest it.   So when we say “structured” in this context, what we really mean is “structured in such a way that our analytic routines can process it.

On the other hand, “unstructured” data has become a catch-all term that’s used to describe everything not captured by the definition above. And this is unfortunate, because there’s very little data in the world which is truly unstructured. This over-generalization leads many organizations down costly, time-consuming paths which they don’t need to traverse.

The truth is that there is very little electronic data in our world today which is unstructured. Here’s a short list of some of the types of data or information commonly lumped under the “unstructured” label, with a sanity check as to the real story.

Type of Data Common Source(s) Structure Sanity Check
Audio Call center recordings, webinars, etc. Digital audio is stored in files, usually as a stream of bits. This stream is encoded and decoded as written & read, often with compression.   This is how the audio can be replayed after recording.
Video Dash-cams, security, retail traffic monitoring, social media sharing, etc. As with audio, digital video is stored in files, with a very similar approach to storing the stream of bits – encoded and often compressed, and replayable with the right decoder.
E-mails Personal and business e-mail, marketing automation, etc. An e-mail is typically quite well structured, with one section of the message containing key data about the message – From, To, Date, Subject, etc. – and another field containing the message itself, often stored as simple text.
Documents (contracts, books, white papers, articles, etc.) Electronic document systems, file sharing systems such as Google Docs and Sharepoint, etc. The documents themselves have structure similar to e-mail, with a group of fields often describing the document, and a body of text which comprises the document itself.  This is a broad category with much variation.
Social Media Tweets, blog posts, online video, picture sharing, check-ins, status updates, etc. Similar to e-mails, social media often has data which describes the message – who’s posting it, the date of the post, referenced hashtags and users, etc. – and the post itself. Images, audio and video included in social media are structured no differently than they are elsewhere.
Machine Logs mobile applications, hardware devices, web applications, etc. I’m not sure who exactly lumped machine logs under the “unstructured” label since these are highly structured and always have been. They are, after all, written by machines!     I suspect a bunch of marketing people decided this after consuming one too many bottles of wine in Napa.

By now it should be clear that this data is not at all unstructured. Quite the opposite. It has plenty of structure to it, otherwise we could never replay that video or audio, read a status update, read e-mail, etc. The real challenge is that this data is generated for a purpose, and that purpose rarely includes analytics. Furthermore, video, audio and email have been around for decades, but it’s only in recent years that we’ve discovered the value of analyzing that information along with the rest.

How does this information add new value? Here are a few examples:

  • Hedge funds found, a number of years ago, that by incorporating sentiment analysis of Tweets on publicly traded securities, that they can predict the daily closing prices of those securities very accurately.
  • Facial recognition in video allows for the creation of an event driven monitoring system which allows a single soldier to effectively monitor hundreds of security cameras concurrently.
  • Sentiment scoring in audio allows a business to detect an unhappy customer during a call, predict that they are likely to churn, and extend a retention offer to keep that customer.
  • Expressing the graph of relationships between players of a social game, as determined by their in-game messages, allows the game developer to dramatically improve profitability as well as player experience.

There are many, many such examples. This is why there’s so much attention being paid to “unstructured” data today – it offers a powerful competitive advantage for those who can incorporate it into their analytics.

The problem is that the data serves…the application which created it. When coder/decoder algorithms were being developed in the 1990’s for audio and video, I doubt that anyone expected that someday we might want to understand (a) who is talking; (b) what they’re talking about; and (c) how they feel about it.

This is the core problem many of us in the Big Data industry are working to address today. How do we take data with one type of structure such as audio, and create a second type of structure which suits it for analytics? To accomplish this, we need structure suited to our analytic routines such as a field identifying the person speaking, a field with the timestamp, a field identifying the topic they’re talking about, and so on. Getting from a stream of audio to this requires careful choice of technology, and thoughtful design. Unfortunately, my esteemed colleagues in the Big Data marketplace have tended to oversimplify this complex situation down to a single word: “unstructured”. This has led to the unstructured leprechaun – a mythical creature who many organizations are chasing hoping to find an elusive pot of gold.

Not that simplicity of messaging is a bad thing. Lord knows I’ve been in enough conference rooms watching people’s eyes glaze over as I talk through structured versus unstructured data! But, as with the real-time unicorn, if organizations chase the unstructured leprechaun – the myth that there is this big bucket of “unstructured” data that we can somehow address with a single magic tool (for more on that, see my next post: “The Single Solution Elf”), they risk wasting their time and money approaching the challenge without truly understanding the problem.

Once my colleagues and I get everyone comfortable with this more nuanced situation, we can begin the real work – identifying the high value use-cases where we can bring in non-traditional data to enhance analytic outcomes.  It’s worth mentioning that I’m careful today to refer to this data as non-traditional, and never unstructured!  This avoids a lot of overgeneralizing, and  makes selecting the right portfolio of technologies and designing a good architecture to address the use-cases very do-able.

So when organizations state that they need to deal with their “unstructured” data, I recommend a thorough assessment of the types of data involved and why they matter and the identification of discrete use cases where this data can add value.  We can then use this information as a guideline in developing the plan of action that’s much more likely to yield a tangible ROI.

Next up: The Single Solution Elf

Enter the Flex Zone – Modernizing the Enterprise Data Warehouse

I’ve had the privilege of attending the Data Warehouse Institute’s (TDWI) conference this week. The Las Vegas show is usually one of their biggest gatherings. This year, there were about 600 of us gathered together to talk about the latest and greatest in the data warehouse and business intelligence world. HP Vertica was a sponsor.
The latest buzz was around many of the new data discovery tools that were announced by some vendors. Vendors recognize that there is a significant amount of undiscovered data in most businesses. As data warehouse teams go merrily along delivering daily analytics, piles and piles of dark data builds within that might have value. To innovate, users are recognizing that some of this unexplored data could be quite valuable, and it’s spurring on the development of a new breed of data discovery tools that enable users to develop new views of structured, semi-structured, and unstructured data.

Of course, this is the very reason that we have developed HP Vertica Flex Zone. The ability to ingest semi-structured data and use current visualization tools are one of the key tenets of HP Vertica Flex Zone. With HP Vertica Flex Zone, you can leverage your existing business intelligence (BI) and visualization tools to visually explore and draw conclusions from data patterns across a full spectrum of structured and semi-structured data. Analysts, data scientists, and business users can now explore and visualize information without burdening or waiting for your IT organizations to use lengthy and costly ETL tools and processes typical with legacy databases and data warehouses.
Most agreed that special data discovery tools should converge with standard analytical platforms in the coming months. Discovery should be as much a part of your business as daily analytics.

There were some first-rate executive sessions led by Fern Halper and Philip Russom, who talked about the transformation of analytics over the years. Analytics has become more mainstream, more understood by the masses of business users. Therefore innovation comes when we can deliver business intelligence for this new generation of information consumers.

The panel discussions and sessions focused very much on business value and put forth a call-to-action for some. Innovate. Feed the business users needs for information that will help drive revenue, improve efficiency, and achieve compliance with regulations. It was clear that data warehouse must be modernized of data warehouse (and that is happening today). Data warehouse pros aren’t satisfied with daily static analytics that they delivered in the past. They are looking for new data sources, including big data, and new-age data analytic platforms to help achieve their business goals.

Get started modernizing your enterprise data warehouse – evaluate HP Vertica 7 today.

Enter the Flex Zone – Flex Zone Demo

In December 2013, we introduced HP Vertica Flex Zone with the HP Vertica 7 “Crane” release. HP Vertica Flex Zone gives you the power to quickly and easily load, explore, analyze, and monetize emerging and rapidly growing forms of structured and semi-structured data, such as social media, sensor, log files, and machine data. You can use your favorite industry-standard business intelligence (BI) and visualization tools to explore Flex Zone data in HP Vertica without creating schemas upfront.

We created the HP Vertica Flex Zone demo video based on a common real life scenario of an HP Vertica Flex Zone customer. It showcases how you can tackle the typical challenges dealing with semi-structured and structured data coming from disparate sources, be able to work with the data in an uncomplicated SQL environment, and most importantly, get value out of all of your data.

Check out the HP Vertica Flex Zone demo video here:

Enter the Flex Zone – Under the Hood

With HP Vertica’s latest release (HP Vertica 7 “Crane”), we introduced HP Vertica Flex Zone, based on the patent-pending flex tables technology, which dynamically adapt to whatever schema is present in the data. Flex tables offer dramatic usability improvements over regular tables. In this post, we take a look under the hood and show how flex tables are similar to regular Vertica tables, with a little pinch of magic thrown in.

The Tour Starts with Storage
Let’s take a peek at a flexible table:

    => create flex table tweets();
    CREATE TABLE
    => \d tweets

Schema
Table
Column
Type
Size
Default
Not Null
Primary Key
Foreign Key
public
tweets
__identity__
int
8
t
f
public
tweets
__raw__
long varbinary(130000)
130000
t
f

(2 rows)

At the heart of the implementation is the simple idea that we will store an entire record in __raw__, a single wide varbinary column. Any record provided, no matter the schema, can be stored – assuming that the column is wide enough to hold the whole record. From the example above, you can see that the default max size is around 128K. We’ve found that this size is sufficient for most use cases, but you can stretch it up to 32 MB if necessary. For performance reasons, Vertica includes the __identity__ column as the column on which to sort and segment the data.

So what does the __raw__ column actually contain? Let’s take a peek:

    => copy tweets from ‘tweets.json’ parser fjsonparser();
     Rows Loaded
    ——————————————–
     542
    (1 row)

    => select __raw__ from tweets limit 1;
    __raw__
    ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————-
    \001\000\000\000H\000\000\000\004\000\000\000\024\000\000\000&\000\000\0008\000\000\000@\000\000\0002327342449993973762327342449993973763829120138291201\004\000\000\000\024\000\000\000$\000\000\0008\000\000\000M\000\000\000delete.status.iddelete.status.id_strdelete.status.user_iddelete.status.user_id_str
    (1 row)

First, notice that this isn’t the JSON value that we supplied as input. When you load a flex table, the parser reads each input record, parses it into a collection of key/value pairs, and writes it into an internal map format. Assuming you query the data at least once, it is worth parsing upfront to avoid re-parsing with each query. Vertica also does a little in-record indexing to make finding field values efficient. Additionally, the storage is now agnostic to the input format, freeing you from having to specify the input format upfront or locking the table to a specific format. Finally, Vertica’s storage engine has more flexibility with compression algorithms for storing the data efficiently on disk.

It’s a Kind of Magic
So why can I run “select text from tweets” despite the “text” column not existing in the table? Looking at the query plan shows us what we would expect, namely we are reading the __raw__ column:

    => EXPLAIN SELECT text FROM tweets;

    Access Path:
    +-STORAGE ACCESS for tweets [Cost: 21K, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
    | Projection: public.tweets_super
    | Materialize: tweets.__raw__
    | Execute on: All Nodes

If you dig into the “GraphViz” (see http://www.graphviz.org/) query plan, you will discover that the actual query that is running is:

     SELECT MapLookup(__raw__,’text’) FROM tweets;

Flex tables have a special parse rule that rewrites unknown column references into calls to the MapLookup function. The MapLookup function takes a key and returns the value, or null if the key does not exist in the map:

    
   => SELECT maplookup(__raw__,’id’) FROM tweets limit 1;
    maplookup
    ——————————————————————————————————————————————————————————————-
    257000038482460672
    (1 row)    

   => SELECT maplookup(__raw__,’nonexistentkey’) FROM tweets limit 1;
    maplookup
    ——————————————————————————————————————————————————————————————————————————————

    (1 row)

This little piece of magic is what lets you write vanilla SQL queries against data with unknown or varying schema.

Maps to Treasure

Now that you know about the __raw__ column, we can play with a number of the map functions that come with flex tables. Here are a couple of examples:]

Getting JSON back:

    => SELECT maptostring(__raw__) FROM tweets limit 1;
    maptostring
    ———————————————————————————————————————————————————————————————
    {
    “delete.status.id” : “230242346078457856”,
    “delete.status.id_str” : “230242346078457856”,
    “delete.status.user_id” : “524683721”,
    “delete.status.user_id_str” : “524683721”
    }

Shredding a record into rows of key value pairs:

    => SELECT mapitems(__raw__) OVER (partition auto) FROM (SELECT__raw__ FROM tweets limit     1) a;

keys
values
delete.status.id
232734244999397376
delete.status.id_str
232734244999397376
delete.status.user_id
38291201
delete.status.user_id_str
38291201

(4 rows)

Finding keys of all records, similar to computing contents of the keys table:

    
    => SELECT distinct key FROM (select mapkeys(__raw__) OVER (partition auto) as
    key from tweets ) a ORDER BY key;
    ————————————————————————————————————————————-
    contributors
    coordinates
    coordinates.coordinates
    coordinates.type
    created_at
    delete.status.id
    delete.status.id_str
    delete.status.user_id
    …

Checking how close you are to overflowing the record size:

    => SELECT max(length(__raw__)) FROM tweets;
    max
    ——————————————————————————————————————
    8675
    (1 row)

Isn’t that a row store?

By storing the whole record in a single column, Vertica must read the entire record even if the query needs only a subset of the keys – the definition of a row-store database. Vertica’s flex tables perform decently well due to their solid executor and distributed execution but Vertica’s columnar tables are orders of magnitude faster. As described in a previous post, accelerating flex table performance is done by promoting important columns to be real columns in the table. These hybrid tables have the best of both worlds: flexible schema due to a __raw__ column and columnar access to the most important data. You can see this for yourself by creating a hybrid flex table:

    bvandiver=> create flex table logs (time timestamp, sessionid varchar, component     varchar);
    CREATE TABLE
    bvandiver=> \d logs

Schema
Table
Column
Type
Size
Default
Not Null
Primary Key
Foreign Key
public
logs
__raw__
long varbinary(130000)
130000
t f
public
logs
"time"
timestamp
8
f f
public
logs
sessionid
varchar(80)
80
f f
public
logs
component
varchar(80)
80
f f

(4 rows)

Conclusion

Hopefully this whirlwind tour through the back end of flextables has given you a better understanding of how it all works and the performance tradeoffs we made. As usual we encourage you to download our Community Edition and give it a try!

Welcome to the Flex Zone – Dynamic Schemas

Enter the Flex Zone

Much of the data analytics we perform occurs with data whose schema changes over time. Many organizations have developers who define the record types that are collected and analysts who comb through these records looking for important patterns. In today’s fast-changing business landscape, agility is the key to success: developers need to continually define new metrics for their analyst’s consumption. Typically, relational databases have been a significant impediment: schema changes are costly and the database data ingest logic is brittle. Instead of record changes being at the need of the developer, they involve careful coordination with database administrators against a backdrop of tweaking production systems. The Flexible table feature in Vertica 7.0 “Crane” seeks to address this challenge by allowing the table schema to vary from record to record. No database administrator involvement is necessary with day-to-day schema changes. The DBA need not get involved, unless performance tuning is necessary.

A Web-driven database

For our example, we will consider a simple web service that serves a form and stores the results. While simplistic, this isn’t far off from serious web services, mobile games, smart meters, etc. There are two source files for this example, a web server flex-server.py and an HTML file form.html (store them in the same directory).

Getting Started

First, we need to create a table to store the form data. With Vertica Flex tables, we don’t even need to know what the structure of the data is upfront:

CREATE FLEX TABLE webreqs();

The web server, implemented in Python, is quite simple: it serves a form and stores the responses as JSON records.

class FlexServer(BaseHTTPServer.BaseHTTPRequestHandler):
    def do_GET(self):
        url = urlparse.urlparse(self.path)
        # form submission?
       if url.path == ‘/data':
            v.send(json.dumps({“time”:str(datetime.datetime.now()),
                  ;”src”:self.client_address[0],
                   #”header”:dict(self.headers.items()), # uncomment to get header info
                             “data”:urlparse.parse_qs(url.query)})+’\n’)
          self.send_response(302)
          self.send_header(‘Location’,urlparse.urljoin(self.path,’/’))
        else:
self.send_response(200)
self.send_header(‘Content-type’,’text/html’)
self.end_headers()
with open(“form.html”,”r”) as src:
self.wfile.write(src.read())
return

v = SendToVertica()
from BaseHTTPServer import HTTPServer
server = HTTPServer((‘localhost’,8080),FlexServer)
server.serve_forever()

For brevity, we’ve left off the SendToVertica class, which streams the data into Vertica – it’s also quite simple and the full text is in flex-server.py.

The HTML file contains the form – it’s broken out into a separate file to make editing it easier. The initial form merely asks for the submitter’s name.

<html><head><title>Enter Info</title><body>
<form action=”data”>
Name:<input type=”text” name=”name”>
<br><input type=”submit” value=”Submit”>
</form></body></html>

Start the server (python flex-server.py) and point your browser at localhost:8080 – you should see something like:

flex-dyn-schema-blog-scrnshot

Submit the form a couple of times. Then look at the server output – it might look something like this:

{“src”: “127.0.0.1”, “data”: {“name”: [“Ben”]}, “time”: “2013-12-24 01:40:35.475723″}
{“src”: “127.0.0.1”, “data”: {“name”: [“Amy”]}, “time”: “2013-12-24 01:40:39.293608″}

Vertica makes it easy to load this data into Flex tables – no table schema definition necessary and there’s already a parser for JSON data. The flex-server is running the following statement in order to load the data:

copy webreqs from stdin parser fjsonparser(flatten_arrays=true);

The flatten_arrays parameter may be new to you – we need it here because HTML form attributes can be many valued, so there’s an extra level of array we need to flatten out of the input.

Now we can write a query against the data as if the JSON features were columns in the table:

select “data.name.0″ from webreqs;
data.name.0
————-
Amy
Ben
(2 rows)

select compute_flextable_keys_and_build_view(‘webreqs’);
compute_flextable_keys_and_build_view
—————————————————————————————————-
Please see public.webreqs_keys for updated keys
The view public.webreqs_view is ready for querying
(1 row)

select * from webreqs_view;
 data.name.0 |     src   | time
      ——————-+——————————-+———————————————————————————————-
         Ben | 127.0.0.1 | 2013-12-24 01:40:35.475723
         Amy | 127.0.0.1 | 2013-12-24 01:40:39.293608
         (2 rows)

It’s Flex Time

But now for the main event: so far, loading the data was easy – but what if the developer decides the form will also have an additional vote field. What needs to change? Well, let’s change the HTML file to include the new form field:

<html><head><title>Enter Info</title><body>
<form action=”data”>
Name:<input type=”text” name=”name”>
<br>Vote:<select type=”select” name=”vote”>
<option>yes</option><option>no</option></select>
<br><input type=”submit” value=”Submit”>
</form></body></html>

Submit the form a couple more times and look at the output – now there’s an additional key in the JSON!

{“src”: “127.0.0.1”, “data”: {“vote”: [“yes”], “name”: [“Jen”]}, “time”: “2013-12-24 01:40:57.785534″}
{“src”: “127.0.0.1”, “data”: {“vote”: [“no”], “name”: [“Ian”]}, “time”: “2013-12-24 01:41:01.867649″}

What database changes are necessary? None! The new fields are immediately available to query:

select sum(case when “data.vote.0″ = ‘yes’ then 1 else 0 end) as yeses, count(“data.vote.0″) as total from webreqs;
yeses | total
   ——-+——-
    1 | 2
(1 row)

But what about the entries submitted before the “vote” field existed?

select “data.name.0″,”data.vote.0″ from webreqs;
data.name.0 | data.vote.0
       ————-+————-
        Jen | yes
        Ben |
        Amy |
        Ian | no
       (4 rows)

For fields that don’t exist in a particular record, you get SQL null. SQL operators do the “right thing” when they encounter nulls, such as count and sum above.

If you believe the schema has changed, you can update your keys and views to reflect the new reality by re-running the schematization tools:

select compute_flextable_keys_and_build_view(‘webreqs’);
compute_flextable_keys_and_build_view
—————————————————————————————————-
Please see public.webreqs_keys for updated keys
The view public.webreqs_view is ready for querying
(1 row)

select * from webreqs_view;
  data.name.0 |    src    | time                       | data.vote.0
       ——————–+———————————+———————————————————————————–+——————————————-
          Jen | 127.0.0.1 | 2013-12-24 01:40:57.785534 | yes
          Ben | 127.0.0.1 | 2013-12-24 01:40:35.475723 |
          Amy | 127.0.0.1 | 2013-12-24 01:40:39.293608 |
          Ian | 127.0.0.1 | 2013-12-24 01:41:01.867649 | no
         (4 rows)

Note that you are not required to recompute the view to access the new fields, only to access the new fields through the view.

Try it yourself – edit the HTML form to have new fields and see how easy it is to query the resulting data in Vertica! You can also uncomment headers line in the server to get access to the header information (think cookies, sessionization, etc).

Tuning for Performance

While flexible tables are great for exploration, we recommend that you do some tuning before moving into production. Luckily, tuning is easy and still preserves the flexible nature of the tables. The central mechanism is “promotion” of one or more columns to real table columns. To promote, select commonly used columns, particularly those used in joins or predicates and add them to your flex table. You can do this by using ALTER TABLE … ADD COLUMN … DEFAULT. For example, the time and vote columns might be of particular interest:

ALTER TABLE webreqs ADD COLUMN “time” TIMESTAMP NOT NULL DEFAULT time::timestamp;
ALTER TABLE webreqs ADD COLUMN “data.vote.0″ VARCHAR(10) DEFAULT “data.vote.0″::varchar;

As part of adding the column, Vertica populates the column with any same-name data. Future loads will also load data into the newly added columns. Note that we have selected specific data types for the new columns. New data that doesn’t match the declared data type will result in nulls getting stored (or errors if the column is declared NOT NULL).

Don’t want to run ADD COLUMN yourself? No problem, because there’s a helper function to do this step for you:

select materialize_flextable_columns(‘webreqs’);

The following columns were added to the table public.webreqs:
data.name.0
src
time
data.vote.0
For more details, run the following query:
SELECT * FROM v_catalog.materialize_flextable_columns_results WHERE table_schema = ‘public’ and table_name = ‘webreqs';

There’s nothing special about these newly added columns – you could have done the same thing upfront:

CREATE FLEX TABLE webreqs (
“time” TIMESTAMP NOT NULL,
“data.vote.0″ VARCHAR(10)
);

So if you know a little about your data and how it will be used, you can jump directly to the optimized approach.

Vertica’s Database Designer exists to optimize the physical design (storage layout) and it works great on promoted columns in flexible tables. A typical usage model is explore, promote, and optimize. A key thing to remember is that changes to the schema do not require promotion and optimization. If you add fields that dramatically change your query workload, then some additional performance optimization may be necessary.

Flexible tables from Vertica should dramatically simplify the process of adapting your schema to the needs of your changing business.

Enter the Flex Zone – An Exploration in Processing Twitter Data

With our HP Vertica 7 release, we announced HP Vertica Flex Zone, a new product offering that simplifies the way that you consume and then explore semi-structured data, such as Web logs, sensor data, and other emerging data types. In this blog post, our first “Flex Zone Friday” post, let’s look at how you can use HP Vertica Flex Zone to get a leg up on your latest data analysis problem, using Twitter data as the sample data type.

Tweet child of mine

For this example, let’s look at Twitter data. Tweets are an interesting source because the data volume is large, the tweet format is typically JSON, and many companies are mining tweet data for insights about their business in how the community perceives their brand, product, or service. A tweet might look like the following:

{“id”:257000038495043584,
“text”:”Let’s play uno”,
“created_at”:”Sat Oct 13 06:09:18 2012″,
“user”:{“name”:”scout”,
  &nbsp ;”lang”:”en”,
    “followers_count”:521,…}
…}

It has the tweet text itself, with a lot of interesting metadata. Note that much of the metadata is stored in sub-structures like the “user” field.

You can acquire twitter data quite easily by making requests to the Twitter API. For the purposes of following along here, you can download a small sample of tweets: tweets.json.gz

Look Ma: No Columns!

Typically, loading this type of data into a relational database would have required transforming the raw data into something more easily consumable – or a pre-built custom Twitter data loader. With HP Vertica Flex Zone, you can skip the schema definition phase and just load JSON directly into the database, saving you hours of time:

CREATE FLEX TABLE tweets ();
COPY tweets FROM ‘/path/to/tweets.json.gz’ GZIP PARSER fjsonparser();

As easy as “SELECT … FROM T”
Now we can just write SQL queries against the data, as if it were naturally loaded into columns. For example, to discover what the average tweet length by language is:

select “user.lang”,avg(length(text))::int from tweets group by “user.lang” order by 2 desc;

user.lang | avg
——————————+——–
       de | 119
       ar | 108
       es | 80
       en | 67
       it | 66
       tr | 66
       nl | 64
       id | 61
       fr | 60
       ru | 57
       ja | 44
       pt | 43
       ko | 40
       th | 22

From the SQL point of view, it doesn’t matter that the source data was JSON. The only concession is the column names themselves: we do need to “” the column names if they contain the ‘.’ character.

Let’s make a Schema

But how can you tell what columns are available? The HP Vertica Analytics Platform has a couple of mechanisms for discovering and making explicit the columns available.

select compute_flextable_keys_and_build_view(‘tweets’);
compute_flextable_keys_and_build_view
————————————————————————————————–
Please see public.tweets_keys for updated keys
The view public.tweets_view is ready for querying
(1 row)

The tweets_keys table now contains a row for each available column in the data. For example, to find all the user-related count columns:

select key_name from tweets_keys where key_name::varchar ilike ‘user%count';
     key_name
————————————————————
user.statuses_count
user.listed_count
user.friends_count
user.followers_count
user.favourites_count
(5 rows)

Armed with these new columns, we can compute some statistics about the followers in our dataset:

select min(“user.followers_count”::int),avg(“user.followers_count”)::int,max
(“user.followers_count”::int) from tweets;

min | avg | max
  —–+—————–+——–
  0 | 1968 | 253435
(1 row)

Clearly someone is tweeting to the void. And we probably have some observer bias in our small sample: there’s probably a correlation between tweeting frequency and follower count.
In addition to building the keys table, the function also whipped up a view that selects all available columns, which can be handy to see what is in the table. Tweets are quite large:

select * from tweets_view limit 1;
contributors | created_at | entities.hashtags | entities.urls | entities.user_mentions | favorited | id | id_str | in_reply_to_screen_name | in_reply_to_status_id | in_reply_to_status_id_str | in_reply_to_user_id | in_reply_to_user_id_str | retweet_count | retweeted | source | text | truncated | user.contributors_enabled | user.created_at | user.default_profile | user.default_profile_image | user.description | user.favourites_count | user.follow_request_sent | user.followers_count | user.following | user.friends_count | user.geo_enabled | user.id | user.id_str | user.is_translator | user.lang | user.listed_count | user.location | user.name | user.notifications | user.profile_background_color | user.profile_background_image_url | user.profile_background_image_url_https | user.profile_background_tile | user.profile_image_url | user.profile_image_url_https | user.profile_link_color | user.profile_sidebar_border_color | user.profile_sidebar_fill_color | user.profile_text_color | user.profile_use_background_image | user.protected | user.screen_name | user.statuses_count | user.time_zone | user.url | user.utc_offset | user.verified | coordinates | geo | place | user.profile_banner_url | delete.status.id | delete.status.id_str | delete.status.user_id | delete.status.user_id_str | retweeted_status.contributors | retweeted_status.created_at | retweeted_status.entities.hashtags | retweeted_status.entities.urls

A key differentiator of HP Vertica Flex Zone is that you can visualize this semi-structured data during data exploration with any of your preferred BI and visualization tools. These tools are suitable for the views, as they expose the metadata about the “flexible table” in a manner that the tools already understand. For example, we used Tableau to generate a report against the tweets_view:

And now for something completely different

It turns out that the tweet stream includes two types of records: tweets and deletes. Deletes are revocations of tweets and they have the following structure:

{
“delete”:{
  “status”:{
   “user_id_str”:”228092142″,
   “id_str”:”218710262428073984″,
   “id”:218710262428073984,
   “user_id”:228092142
  }
 }
}

Our previous load actually loaded both record types into the same table. How did the previous queries work? Well, if you reference a column that doesn’t exist in a record, you get null. We can use this to find out how many deletes we loaded:

select count(*) from tweets where “delete.status.id” is not null;
 count
 —————–
  118
 (1 row)

Lots yet to come

We haven’t talked about a number of key things like column promotion (performance!), dynamic schema, or substructure. Here’s a teaser about substructure — namely the JSON input is quite complex and the HP Vertica Analytics Platform loaded it all. You can descend into the nested structures and get the most out of the data. The following is a query that computes the top hashtags in the last month:

select values,count(*) from
    (select mapitems(“entities.hashtags”) over ()
     from tweets
    where created_at::timestamp > now() – interval ‘1 month’) a
where keys::varchar like ‘%text’
group by values
order by count desc
limit 10;

                                 values | count
              —————————————————————————-+————
       10ThingsThatGetsOnMyNerves       |     2
       Massa                            |     1
       ความในใจของติ่ง                    |     1
       Up                               |     1
       RTしてくれた人全員フォローする      |      1
       Turn                             |     1
       50ThingsAboutMyBoyfriend         |     1
       askenchancer                     |     1
       always                           |     1
       kaburrr                          |     1

Note: If you got recent data from twitter, this will work without modification – if you used our data set, use interval ‘3 years’ (or more).

We encourage you to give HP Vertica Flex Zone a test drive and actively participate in our community site! Stay tuned for our next Flex Zone Friday blog post.

Get Started With Vertica Today

Subscribe to Vertica