Vertica

Archive for the ‘Flex’ Category

Enter the Flex Zone – Under the Hood

With HP Vertica’s latest release (HP Vertica 7 “Crane”), we introduced HP Vertica Flex Zone, based on the patent-pending flex tables technology, which dynamically adapt to whatever schema is present in the data. Flex tables offer dramatic usability improvements over regular tables. In this post, we take a look under the hood and show how flex tables are similar to regular Vertica tables, with a little pinch of magic thrown in.

The Tour Starts with Storage
Let’s take a peek at a flexible table:

    => create flex table tweets();
    CREATE TABLE
    => \d tweets

Schema
Table
Column
Type
Size
Default
Not Null
Primary Key
Foreign Key
public
tweets
__identity__
int
8
t
f
public
tweets
__raw__
long varbinary(130000)
130000
t
f

(2 rows)

At the heart of the implementation is the simple idea that we will store an entire record in __raw__, a single wide varbinary column. Any record provided, no matter the schema, can be stored – assuming that the column is wide enough to hold the whole record. From the example above, you can see that the default max size is around 128K. We’ve found that this size is sufficient for most use cases, but you can stretch it up to 32 MB if necessary. For performance reasons, Vertica includes the __identity__ column as the column on which to sort and segment the data.

So what does the __raw__ column actually contain? Let’s take a peek:

    => copy tweets from ‘tweets.json’ parser fjsonparser();
     Rows Loaded
    ——————————————–
     542
    (1 row)

    => select __raw__ from tweets limit 1;
    __raw__
    ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————-
    \001\000\000\000H\000\000\000\004\000\000\000\024\000\000\000&\000\000\0008\000\000\000@\000\000\0002327342449993973762327342449993973763829120138291201\004\000\000\000\024\000\000\000$\000\000\0008\000\000\000M\000\000\000delete.status.iddelete.status.id_strdelete.status.user_iddelete.status.user_id_str
    (1 row)

First, notice that this isn’t the JSON value that we supplied as input. When you load a flex table, the parser reads each input record, parses it into a collection of key/value pairs, and writes it into an internal map format. Assuming you query the data at least once, it is worth parsing upfront to avoid re-parsing with each query. Vertica also does a little in-record indexing to make finding field values efficient. Additionally, the storage is now agnostic to the input format, freeing you from having to specify the input format upfront or locking the table to a specific format. Finally, Vertica’s storage engine has more flexibility with compression algorithms for storing the data efficiently on disk.

It’s a Kind of Magic
So why can I run “select text from tweets” despite the “text” column not existing in the table? Looking at the query plan shows us what we would expect, namely we are reading the __raw__ column:

    => EXPLAIN SELECT text FROM tweets;

    Access Path:
    +-STORAGE ACCESS for tweets [Cost: 21K, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
    | Projection: public.tweets_super
    | Materialize: tweets.__raw__
    | Execute on: All Nodes

If you dig into the “GraphViz” (see http://www.graphviz.org/) query plan, you will discover that the actual query that is running is:

     SELECT MapLookup(__raw__,’text’) FROM tweets;

Flex tables have a special parse rule that rewrites unknown column references into calls to the MapLookup function. The MapLookup function takes a key and returns the value, or null if the key does not exist in the map:

    
   => SELECT maplookup(__raw__,’id’) FROM tweets limit 1;
    maplookup
    ——————————————————————————————————————————————————————————————-
    257000038482460672
    (1 row)    

   => SELECT maplookup(__raw__,’nonexistentkey’) FROM tweets limit 1;
    maplookup
    ——————————————————————————————————————————————————————————————————————————————

    (1 row)

This little piece of magic is what lets you write vanilla SQL queries against data with unknown or varying schema.

Maps to Treasure

Now that you know about the __raw__ column, we can play with a number of the map functions that come with flex tables. Here are a couple of examples:]

Getting JSON back:

    => SELECT maptostring(__raw__) FROM tweets limit 1;
    maptostring
    ———————————————————————————————————————————————————————————————
    {
    ”delete.status.id” : “230242346078457856″,
    ”delete.status.id_str” : “230242346078457856″,
    ”delete.status.user_id” : “524683721″,
    ”delete.status.user_id_str” : “524683721″
    }

Shredding a record into rows of key value pairs:

    => SELECT mapitems(__raw__) OVER (partition auto) FROM (SELECT__raw__ FROM tweets limit     1) a;

keys
values
delete.status.id
232734244999397376
delete.status.id_str
232734244999397376
delete.status.user_id
38291201
delete.status.user_id_str
38291201

(4 rows)

Finding keys of all records, similar to computing contents of the keys table:

    
    => SELECT distinct key FROM (select mapkeys(__raw__) OVER (partition auto) as
    key from tweets ) a ORDER BY key;
    ————————————————————————————————————————————-
    contributors
    coordinates
    coordinates.coordinates
    coordinates.type
    created_at
    delete.status.id
    delete.status.id_str
    delete.status.user_id
    …

Checking how close you are to overflowing the record size:

    => SELECT max(length(__raw__)) FROM tweets;
    max
    ——————————————————————————————————————
    8675
    (1 row)

Isn’t that a row store?

By storing the whole record in a single column, Vertica must read the entire record even if the query needs only a subset of the keys – the definition of a row-store database. Vertica’s flex tables perform decently well due to their solid executor and distributed execution but Vertica’s columnar tables are orders of magnitude faster. As described in a previous post, accelerating flex table performance is done by promoting important columns to be real columns in the table. These hybrid tables have the best of both worlds: flexible schema due to a __raw__ column and columnar access to the most important data. You can see this for yourself by creating a hybrid flex table:

    bvandiver=> create flex table logs (time timestamp, sessionid varchar, component     varchar);
    CREATE TABLE
    bvandiver=> \d logs

Schema
Table
Column
Type
Size
Default
Not Null
Primary Key
Foreign Key
public
logs
__raw__
long varbinary(130000)
130000
t f
public
logs
"time"
timestamp
8
f f
public
logs
sessionid
varchar(80)
80
f f
public
logs
component
varchar(80)
80
f f

(4 rows)

Conclusion

Hopefully this whirlwind tour through the back end of flextables has given you a better understanding of how it all works and the performance tradeoffs we made. As usual we encourage you to download our Community Edition and give it a try!

Welcome to the Flex Zone – Dynamic Schemas

Enter the Flex Zone

Much of the data analytics we perform occurs with data whose schema changes over time. Many organizations have developers who define the record types that are collected and analysts who comb through these records looking for important patterns. In today’s fast-changing business landscape, agility is the key to success: developers need to continually define new metrics for their analyst’s consumption. Typically, relational databases have been a significant impediment: schema changes are costly and the database data ingest logic is brittle. Instead of record changes being at the need of the developer, they involve careful coordination with database administrators against a backdrop of tweaking production systems. The Flexible table feature in Vertica 7.0 “Crane” seeks to address this challenge by allowing the table schema to vary from record to record. No database administrator involvement is necessary with day-to-day schema changes. The DBA need not get involved, unless performance tuning is necessary.

A Web-driven database

For our example, we will consider a simple web service that serves a form and stores the results. While simplistic, this isn’t far off from serious web services, mobile games, smart meters, etc. There are two source files for this example, a web server flex-server.py and an HTML file form.html (store them in the same directory).

Getting Started

First, we need to create a table to store the form data. With Vertica Flex tables, we don’t even need to know what the structure of the data is upfront:

CREATE FLEX TABLE webreqs();

The web server, implemented in Python, is quite simple: it serves a form and stores the responses as JSON records.

class FlexServer(BaseHTTPServer.BaseHTTPRequestHandler):
    def do_GET(self):
        url = urlparse.urlparse(self.path)
        # form submission?
       if url.path == ‘/data’:
            v.send(json.dumps({“time”:str(datetime.datetime.now()),
                  ;”src”:self.client_address[0],
                   #”header”:dict(self.headers.items()), # uncomment to get header info
                             ”data”:urlparse.parse_qs(url.query)})+’\n’)
          self.send_response(302)
          self.send_header(‘Location’,urlparse.urljoin(self.path,’/'))
        else:
self.send_response(200)
self.send_header(‘Content-type’,'text/html’)
self.end_headers()
with open(“form.html”,”r”) as src:
self.wfile.write(src.read())
return

v = SendToVertica()
from BaseHTTPServer import HTTPServer
server = HTTPServer((‘localhost’,8080),FlexServer)
server.serve_forever()

For brevity, we’ve left off the SendToVertica class, which streams the data into Vertica – it’s also quite simple and the full text is in flex-server.py.

The HTML file contains the form – it’s broken out into a separate file to make editing it easier. The initial form merely asks for the submitter’s name.

<html><head><title>Enter Info</title><body>
<form action=”data”>
Name:<input type=”text” name=”name”>
<br><input type=”submit” value=”Submit”>
</form></body></html>

Start the server (python flex-server.py) and point your browser at localhost:8080 – you should see something like:

flex-dyn-schema-blog-scrnshot

Submit the form a couple of times. Then look at the server output – it might look something like this:

{“src”: “127.0.0.1″, “data”: {“name”: ["Ben"]}, “time”: “2013-12-24 01:40:35.475723″}
{“src”: “127.0.0.1″, “data”: {“name”: ["Amy"]}, “time”: “2013-12-24 01:40:39.293608″}

Vertica makes it easy to load this data into Flex tables – no table schema definition necessary and there’s already a parser for JSON data. The flex-server is running the following statement in order to load the data:

copy webreqs from stdin parser fjsonparser(flatten_arrays=true);

The flatten_arrays parameter may be new to you – we need it here because HTML form attributes can be many valued, so there’s an extra level of array we need to flatten out of the input.

Now we can write a query against the data as if the JSON features were columns in the table:

select “data.name.0″ from webreqs;
data.name.0
————-
Amy
Ben
(2 rows)

select compute_flextable_keys_and_build_view(‘webreqs’);
compute_flextable_keys_and_build_view
—————————————————————————————————-
Please see public.webreqs_keys for updated keys
The view public.webreqs_view is ready for querying
(1 row)

select * from webreqs_view;
 data.name.0 |     src   | time
      ——————-+——————————-+———————————————————————————————-
         Ben | 127.0.0.1 | 2013-12-24 01:40:35.475723
         Amy | 127.0.0.1 | 2013-12-24 01:40:39.293608
         (2 rows)

It’s Flex Time

But now for the main event: so far, loading the data was easy – but what if the developer decides the form will also have an additional vote field. What needs to change? Well, let’s change the HTML file to include the new form field:

<html><head><title>Enter Info</title><body>
<form action=”data”>
Name:<input type=”text” name=”name”>
<br>Vote:<select type=”select” name=”vote”>
<option>yes</option><option>no</option></select>
<br><input type=”submit” value=”Submit”>
</form></body></html>

Submit the form a couple more times and look at the output – now there’s an additional key in the JSON!

{“src”: “127.0.0.1″, “data”: {“vote”: ["yes"], “name”: ["Jen"]}, “time”: “2013-12-24 01:40:57.785534″}
{“src”: “127.0.0.1″, “data”: {“vote”: ["no"], “name”: ["Ian"]}, “time”: “2013-12-24 01:41:01.867649″}

What database changes are necessary? None! The new fields are immediately available to query:

select sum(case when “data.vote.0″ = ‘yes’ then 1 else 0 end) as yeses, count(“data.vote.0″) as total from webreqs;
yeses | total
   ——-+——-
    1 | 2
(1 row)

But what about the entries submitted before the “vote” field existed?

select “data.name.0″,”data.vote.0″ from webreqs;
data.name.0 | data.vote.0
       ————-+————-
        Jen | yes
        Ben |
        Amy |
        Ian | no
       (4 rows)

For fields that don’t exist in a particular record, you get SQL null. SQL operators do the “right thing” when they encounter nulls, such as count and sum above.

If you believe the schema has changed, you can update your keys and views to reflect the new reality by re-running the schematization tools:

select compute_flextable_keys_and_build_view(‘webreqs’);
compute_flextable_keys_and_build_view
—————————————————————————————————-
Please see public.webreqs_keys for updated keys
The view public.webreqs_view is ready for querying
(1 row)

select * from webreqs_view;
  data.name.0 |    src    | time                       | data.vote.0
       ——————–+———————————+———————————————————————————–+——————————————-
          Jen | 127.0.0.1 | 2013-12-24 01:40:57.785534 | yes
          Ben | 127.0.0.1 | 2013-12-24 01:40:35.475723 |
          Amy | 127.0.0.1 | 2013-12-24 01:40:39.293608 |
          Ian | 127.0.0.1 | 2013-12-24 01:41:01.867649 | no
         (4 rows)

Note that you are not required to recompute the view to access the new fields, only to access the new fields through the view.

Try it yourself – edit the HTML form to have new fields and see how easy it is to query the resulting data in Vertica! You can also uncomment headers line in the server to get access to the header information (think cookies, sessionization, etc).

Tuning for Performance

While flexible tables are great for exploration, we recommend that you do some tuning before moving into production. Luckily, tuning is easy and still preserves the flexible nature of the tables. The central mechanism is “promotion” of one or more columns to real table columns. To promote, select commonly used columns, particularly those used in joins or predicates and add them to your flex table. You can do this by using ALTER TABLE … ADD COLUMN … DEFAULT. For example, the time and vote columns might be of particular interest:

ALTER TABLE webreqs ADD COLUMN “time” TIMESTAMP NOT NULL DEFAULT time::timestamp;
ALTER TABLE webreqs ADD COLUMN “data.vote.0″ VARCHAR(10) DEFAULT “data.vote.0″::varchar;

As part of adding the column, Vertica populates the column with any same-name data. Future loads will also load data into the newly added columns. Note that we have selected specific data types for the new columns. New data that doesn’t match the declared data type will result in nulls getting stored (or errors if the column is declared NOT NULL).

Don’t want to run ADD COLUMN yourself? No problem, because there’s a helper function to do this step for you:

select materialize_flextable_columns(‘webreqs’);

The following columns were added to the table public.webreqs:
data.name.0
src
time
data.vote.0
For more details, run the following query:
SELECT * FROM v_catalog.materialize_flextable_columns_results WHERE table_schema = ‘public’ and table_name = ‘webreqs’;

There’s nothing special about these newly added columns – you could have done the same thing upfront:

CREATE FLEX TABLE webreqs (
“time” TIMESTAMP NOT NULL,
“data.vote.0″ VARCHAR(10)
);

So if you know a little about your data and how it will be used, you can jump directly to the optimized approach.

Vertica’s Database Designer exists to optimize the physical design (storage layout) and it works great on promoted columns in flexible tables. A typical usage model is explore, promote, and optimize. A key thing to remember is that changes to the schema do not require promotion and optimization. If you add fields that dramatically change your query workload, then some additional performance optimization may be necessary.

Flexible tables from Vertica should dramatically simplify the process of adapting your schema to the needs of your changing business.

Enter the Flex Zone – An Exploration in Processing Twitter Data

With our HP Vertica 7 release, we announced HP Vertica Flex Zone, a new product offering that simplifies the way that you consume and then explore semi-structured data, such as Web logs, sensor data, and other emerging data types. In this blog post, our first “Flex Zone Friday” post, let’s look at how you can use HP Vertica Flex Zone to get a leg up on your latest data analysis problem, using Twitter data as the sample data type.

Tweet child of mine

For this example, let’s look at Twitter data. Tweets are an interesting source because the data volume is large, the tweet format is typically JSON, and many companies are mining tweet data for insights about their business in how the community perceives their brand, product, or service. A tweet might look like the following:

{“id”:257000038495043584,
“text”:”Let’s play uno”,
“created_at”:”Sat Oct 13 06:09:18 2012″,
“user”:{“name”:”scout”,
  &nbsp ;”lang”:”en”,
    ”followers_count”:521,…}
…}

It has the tweet text itself, with a lot of interesting metadata. Note that much of the metadata is stored in sub-structures like the “user” field.

You can acquire twitter data quite easily by making requests to the Twitter API. For the purposes of following along here, you can download a small sample of tweets: tweets.json.gz

Look Ma: No Columns!

Typically, loading this type of data into a relational database would have required transforming the raw data into something more easily consumable – or a pre-built custom Twitter data loader. With HP Vertica Flex Zone, you can skip the schema definition phase and just load JSON directly into the database, saving you hours of time:

CREATE FLEX TABLE tweets ();
COPY tweets FROM ‘/path/to/tweets.json.gz’ GZIP PARSER fjsonparser();

As easy as “SELECT … FROM T”
Now we can just write SQL queries against the data, as if it were naturally loaded into columns. For example, to discover what the average tweet length by language is:

select “user.lang”,avg(length(text))::int from tweets group by “user.lang” order by 2 desc;

user.lang | avg
——————————+——–
       de | 119
       ar | 108
       es | 80
       en | 67
       it | 66
       tr | 66
       nl | 64
       id | 61
       fr | 60
       ru | 57
       ja | 44
       pt | 43
       ko | 40
       th | 22

From the SQL point of view, it doesn’t matter that the source data was JSON. The only concession is the column names themselves: we do need to “” the column names if they contain the ‘.’ character.

Let’s make a Schema

But how can you tell what columns are available? The HP Vertica Analytics Platform has a couple of mechanisms for discovering and making explicit the columns available.

select compute_flextable_keys_and_build_view(‘tweets’);
compute_flextable_keys_and_build_view
————————————————————————————————–
Please see public.tweets_keys for updated keys
The view public.tweets_view is ready for querying
(1 row)

The tweets_keys table now contains a row for each available column in the data. For example, to find all the user-related count columns:

select key_name from tweets_keys where key_name::varchar ilike ‘user%count’;
     key_name
————————————————————
user.statuses_count
user.listed_count
user.friends_count
user.followers_count
user.favourites_count
(5 rows)

Armed with these new columns, we can compute some statistics about the followers in our dataset:

select min(“user.followers_count”::int),avg(“user.followers_count”)::int,max
(“user.followers_count”::int) from tweets;

min | avg | max
  —–+—————–+——–
  0 | 1968 | 253435
(1 row)

Clearly someone is tweeting to the void. And we probably have some observer bias in our small sample: there’s probably a correlation between tweeting frequency and follower count.
In addition to building the keys table, the function also whipped up a view that selects all available columns, which can be handy to see what is in the table. Tweets are quite large:

select * from tweets_view limit 1;
contributors | created_at | entities.hashtags | entities.urls | entities.user_mentions | favorited | id | id_str | in_reply_to_screen_name | in_reply_to_status_id | in_reply_to_status_id_str | in_reply_to_user_id | in_reply_to_user_id_str | retweet_count | retweeted | source | text | truncated | user.contributors_enabled | user.created_at | user.default_profile | user.default_profile_image | user.description | user.favourites_count | user.follow_request_sent | user.followers_count | user.following | user.friends_count | user.geo_enabled | user.id | user.id_str | user.is_translator | user.lang | user.listed_count | user.location | user.name | user.notifications | user.profile_background_color | user.profile_background_image_url | user.profile_background_image_url_https | user.profile_background_tile | user.profile_image_url | user.profile_image_url_https | user.profile_link_color | user.profile_sidebar_border_color | user.profile_sidebar_fill_color | user.profile_text_color | user.profile_use_background_image | user.protected | user.screen_name | user.statuses_count | user.time_zone | user.url | user.utc_offset | user.verified | coordinates | geo | place | user.profile_banner_url | delete.status.id | delete.status.id_str | delete.status.user_id | delete.status.user_id_str | retweeted_status.contributors | retweeted_status.created_at | retweeted_status.entities.hashtags | retweeted_status.entities.urls

A key differentiator of HP Vertica Flex Zone is that you can visualize this semi-structured data during data exploration with any of your preferred BI and visualization tools. These tools are suitable for the views, as they expose the metadata about the “flexible table” in a manner that the tools already understand. For example, we used Tableau to generate a report against the tweets_view:

And now for something completely different

It turns out that the tweet stream includes two types of records: tweets and deletes. Deletes are revocations of tweets and they have the following structure:

{
“delete”:{
  ”status”:{
   ”user_id_str”:”228092142″,
   ”id_str”:”218710262428073984″,
   ”id”:218710262428073984,
   ”user_id”:228092142
  }
 }
}

Our previous load actually loaded both record types into the same table. How did the previous queries work? Well, if you reference a column that doesn’t exist in a record, you get null. We can use this to find out how many deletes we loaded:

select count(*) from tweets where “delete.status.id” is not null;
 count
 —————–
  118
 (1 row)

Lots yet to come

We haven’t talked about a number of key things like column promotion (performance!), dynamic schema, or substructure. Here’s a teaser about substructure — namely the JSON input is quite complex and the HP Vertica Analytics Platform loaded it all. You can descend into the nested structures and get the most out of the data. The following is a query that computes the top hashtags in the last month:

select values,count(*) from
    (select mapitems(“entities.hashtags”) over ()
     from tweets
    where created_at::timestamp > now() – interval ’1 month’) a
where keys::varchar like ‘%text’
group by values
order by count desc
limit 10;

                                 values | count
              —————————————————————————-+————
       10ThingsThatGetsOnMyNerves       |     2
       Massa                            |     1
       ความในใจของติ่ง                    |     1
       Up                               |     1
       RTしてくれた人全員フォローする      |      1
       Turn                             |     1
       50ThingsAboutMyBoyfriend         |     1
       askenchancer                     |     1
       always                           |     1
       kaburrr                          |     1

Note: If you got recent data from twitter, this will work without modification – if you used our data set, use interval ’3 years’ (or more).

We encourage you to give HP Vertica Flex Zone a test drive and actively participate in our community site! Stay tuned for our next Flex Zone Friday blog post.

Get Started With Vertica Today

Subscribe to Vertica