Vertica

Archive for January, 2014

Enter the FlexZone – Let’s talk ETL

When (and When Not) to Use Data Integration with HP Vertica

In December, HP released version 7 of the HP Vertica analytics platform which includes, among others, a great new feature called HP Vertica Flex Zone (Flex Zone). Flex Zone enables you to quickly and easily load, explore and analyze some forms of semi-structured data. It eliminates the need for coding-intensive schemas to be defined or applied before the data is loaded for exploration.

One of Flex Zone’s important values is that it can save you hours of work setting up and managing data extraction. Rather than setting up schemas and mappings in an ETL tools and later worrying about whether structure will change, the process is simplified with Flex Zone. Data is simply pulled into Flex Zone and structure is automatically understood. Flex Zone is powerful for the exploration of common types of data. Flex Tables can immediately leverage:

  • Delimited data – semi-structured text files. These are often referred to as flat files because the information is not stored in a relational database.
  • JSON – A readable file that is often used in social media and new online applications

For these types of files, which are very common in modern IT infrastructure, you do not need an ETL to extract, transform and load the data. This functionality is included with Flex Zone and can save you many hours in pre-processing data for analytics. It can save you time in the long-run by lowering the need to monitor ETL processes. Other mechanisms also let Vertica ingest data from other common big data structures from HIVE and HDFS.

Having a function in Flex Zone that automatically understands structure is powerful. This is something that normally takes time, slowing the overall process of exploration of the data. Should the structure of the data change, maintaining it is also time-consuming. By integrating these less structured data sources and supporting vanilla SQL queries against them, Vertica brings a key feature of relational databases to bear: abstracting the storage representation from the query semantics.

ETL – Extract Transform Load

However, most ETL tools offer hundreds of connectors that allow for connection into anything from Salesforce to Oracle to DB2 on the mainframe. For these types of uses, you can either use an ETL or export data from the application into a supported format in order to use Flex Tables.

For ETL, users take on the process of extracting data and transforming it to make it fit-for-purpose. The longer process may be necessary, however. During the ETL process, users can ensure that the data conforms to the schema and that data quality standards are upheld. Users can establish business rules and reject any records that don’t conform to standards. Users can recode certain values in the data to standardize them (e.g. ST, Street, strt can be recoded to ‘STREET’). Users can also extract data from sources that have proprietary formats, like SAP, MS SQL and AS/400 and hundreds of others. Therefore, in order to deliver accurate analytics, gain access to odd file formats, ETL is still necessary for certain data.

Some companies need to keep track of where data came from and what was changed in the data. The Data Lineage features of many ETL tools help you track where a change occurred. The result of the data lineage shows in a report which traces a change from the target end component of a Job up to the source end. If this is an important part of your process, you may need an ETL tool.

The good news is that Vertica has several partners who offer free open source ETL with support for Vertica, like Talend and Pentaho, as well as commercial partners like Syncsort, Informatica and others. See http://www.vertica.com/partners/data-integration-etl/ for a complete list.

Enter the Flex Zone – Exploring Delimited Data

Here at HP Vertica, we had to solve a technical challenge that many of you might be facing: data analysis from legacy products. When a customer reports an issue to customer support, we commonly ask for a diagnostic dump. This dump contains structured event data on what the database was doing at the time the problem occurred and provides a primary source of information for resolving the customer’s issue. The dump contains a collection of tables, written out in a simple delimited format similar to the output of a ‘SELECT * from T’ statement. Since the data size can be quite large (gigabytes), we load it into a (ha-ha) Vertica database for analysis. The complication is that the contents of this diagnostic dump vary by Vertica version. We therefore needed a way to ingest the data and make use of it despite the schema varying based on the source version.

 

A mostly-fixed but time-varying schema is a great use case for the newly released Vertica 7 (aka Crane) flex tables feature. With flex tables, we can load the data without fully defining the schema and query the part of the data we need. Indeed, reading our own diagnostic data was the first serious use of flex tables and served to refine our design for the feature. In this post, we will walk you through capturing, loading, and querying data from Vertica’s own DataCollector (DC) tables using Vertica’s flex tables. While previous posts have focused on loading JSON data, Vertica Flex tables support a number of different input formats, including delimited data.

A Whirlwind Tour of the DataCollector

HP Vertica’s Data Collector is our mechanism for capturing historical event data for internal database reporting and diagnostic analysis. Many of our system tables that display historical information, such as query_requests or load_streams, are actually views built on top of data collector tables. Because they change frequently and can be opaque to users outside of Vertica, we don’t document them. If you want to muck around, take a look at:

select data_collector_help();

In this example, we examine a couple of tables that track transactions on the Vertica cluster, the first of which is called DC_TRANSACTION_ENDS. This table has a row for every transaction that completes on each node in the cluster:

select * from dc_transaction_ends limit 4;
time | node_name | session_id | user_id | user_name | transaction_id | number_of_statements | is_committed | end_begin_time | epoch_close_time | end_epoch | is_ddl | ros_rows_written | dvros_rows_written | wos_rows_written | dvwos_rows_written

2014-01-22 00:32:08.025521-05 | initiator | localhost.localdoma-10040:0xb | 45035996273704962 | bvandiver | 45035996273704963 | 1 | t | 2014-01-22 00:32:08.016667-05 | | 1 | t | 0 | 0 | 0 | 0
2014-01-22 00:32:08.03535-05 | initiator | localhost.localdoma-10040:0xb | 45035996273704962 | bvandiver | 45035996273704964 | 1 | f | 2014-01-22 00:32:08.03491-05 | | 1 | t | 0 | 0 | 0 | 0
2014-01-22 00:32:09.163234-05 | initiator | localhost.localdoma-10040:0xf | 45035996273704962 | bvandiver | 45035996273704965 | 1 | f | 2014-01-22 00:32:09.163122-05 | | 1 | t | 0 | 0 | 0 | 0
2014-01-22 00:32:09.175018-05 | initiator | localhost.localdoma-10040:0xf | 45035996273704962 | bvandiver | 45035996273704966 | 1 | f | 2014-01-22 00:32:09.174786-05 | | 1 | t | 0 | 0 | 0 | 0

Queries against this table can reveal a number of interesting properties of the database workload, such as commit rate, rows per bulk load, DDL percentages, and so on. A companion table called DC_TRANSACTION_STARTS registers a row for each transaction that starts – it contains less interesting information because the database does not yet know the contents of the transaction. However, joining the two tables tells us the duration of each transaction.

Capturing a Data Dump

The simplest possible way to export the data is merely to capture the output of running a SQL query – and indeed we frequently need to consume this ad hoc format here at Vertica. From within vsql, Vertica’s command line client, you can do this with:

bvandiver=> \a
Output format is unaligned.
bvandiver=> \o /tmp/dce.txt
bvandiver=> select * from dc_transaction_ends;
bvandiver=> \o

We switch to unaligned output to avoid dumping a huge number of useless blank spaces into the output. An alternative is to merely capture the output with shell redirect:

/opt/vertica/bin/vsql -A -c “select * from dc_transaction_ends;” > /tmp/dce.txt

The contents of the file will look something like this:

time|node_name|session_id|user_id|user_name|transaction_id|number_of_statements|is_committed|end_begin_time|epoch_close_time|end_epoch|is_ddl|ros_rows_written|dvros_rows_written|wos_rows_written|dvwos_rows_written
2014-01-22 00:32:08.002989-05|initiator|localhost.localdoma-10040:0×3|45035996273704962|bvandiver|45035996273704961|1|f|2014-01-22 00:32:08.002313-05||1|f|0|0|0|0
2014-01-22 00:32:08.009437-05|initiator|localhost.localdoma-10040:0×3|45035996273704962|bvandiver|45035996273704962|1|t|2014-01-22 00:32:08.008547-05||1|f|0|0|0|0
2014-01-22 00:32:08.025521-05|initiator|localhost.localdoma-10040:0xb|45035996273704962|bvandiver|45035996273704963|1|t|2014-01-22 00:32:08.016667-05||1|t|0|0|0|0
2014-01-22 00:32:08.03535-05|initiator|localhost.localdoma-10040:0xb|45035996273704962|bvandiver|45035996273704964|1|f|2014-01-22 00:32:08.03491-05||1|t|0|0|0|0
2014-01-22 00:32:09.163234-05|initiator|localhost.localdoma-10040:0xf|45035996273704962|bvandiver|45035996273704965|1|f|2014-01-22 00:32:09.163122-05||1|t|0|0|0|0
2014-01-22 00:32:09.175018-05|initiator|localhost.localdoma-10040:0xf|45035996273704962|bvandiver|45035996273704966|1|f|2014-01-22 00:32:09.174

Exploring the data

How would you load this data into a regular SQL database? Well, you’d have to look at the input file, determine the column count and types, write a large CREATE TABLE statement, and then bulk load the data. With Vertica’s flex tables, you need to do only the following:

CREATE FLEX TABLE dce();
COPY dce FROM ‘/tmp/dce.txt’ PARSER FDelimitedParser();

The FDelimitedParser is a flexible parser much akin to Vertica’s regular COPY parser. The delimiter defaults to ‘|’, which is why it need not be specified here (3 guesses why we picked that default…). Flex tables make it really easy to get the data into the database – something that comes in handy when customers are anxiously awaiting diagnosis! Since we never had to describe the schema, no matter which version of Vertica supplied the source data, the load still completes successfully.

 

Queries are equally straightforward. Just pretend the columns exist! The following query shows the transactions that wrote the largest number of rows:

Query 1:

select sum(ros_rows_written + wos_rows_written) from dce group by transaction_id order by 1 desc limit 10;

Alternatively, this query computes commits per second over the interval captured

select sum(ros_rows_written + wos_rows_written) from dce group by transaction_id order by 1 desc limit 10;

Queries are equally easy – just pretend the columns exist! The following query shows the transactions that wrote the largest number of rows:

Query 2

select sum(ros_rows_written + wos_rows_written) from dce group by transaction_id order by 1 desc limit 10;

Alternatively, this query computes commits per second over the interval captured:

select count(time) / datediff(ss,min(time::timestamp),max(time::timestamp)) from dce where is_committed = ‘t’;

Much like many NoSQL solutions, flex tables require the type information to be supplied at query time, in this case in the form of casting the time column to a timestamp. In many cases, automatic casting rules convert the base varchar type to something appropriate, such as the sum() above casting its inputs to integers. You are not limited to trivial SQL, for example you can load DC_TRANSACTION_STARTS and join the two tables to compute average transaction duration:

Query 3:

select avg(ends.end_time – starts.start_time) from
(select transaction_id,max(time::timestamp) as end_time from dce group by transaction_id) ends,
(select transaction_id,min(time::timestamp) as start_time from dcs group by transaction_id) starts
where ends.transaction_id = starts.transaction_id;

This query needs the subqueries because a row for every transaction exists on every participating node and we want to find the earliest and latest entry. When joining flex tables, you need to fully qualify all source columns with their originating table because either table could possibly have any column name you supply! To simplify, you can have Vertica build a view for you:

select compute_flextable_keys_and_build_view(‘dce’);

Querying the view dce_view looks very much like querying the original Data Collector table. In fact, any of the SQL we just wrote against the flex tables is valid when run against the original source tables DC_TRANSACTION_ENDS and DC_TRANSACTION_STARTS! Accessing flex tables is identical to accessing regular tables–a true illustration of how flex tables meet the underlying promise of SQL databases. Users do not have to know the underlying data storage to derive value from the data.

In search of performance

We get some really large diagnostic dumps and can’t always afford the performance degradation of unoptimized flex tables. Improving performance is particularly simple for tables from the Data Collector because we know upfront which columns are important predicate and join columns. Furthermore, these columns are present in most DC tables and are unlikely to vary. Thus, we use a hybrid flex table: specify some important columns and let the rest of the hang out in the “Flex Zone”. We use the following definition for most DC tables:

CREATE FLEX TABLE hybrid (
time timestamptz,
node_name varchar(100),
session_id varchar(100),
user_id int,
user_name varchar(100),
transaction_id int,
statement_id int,
request_id int
) order by transaction_id,statement_id,time segmented by hash(transaction_id) all nodes ksafe 1;

In our performance evaluation shown in Figure 1, we compare running the above queries using an unoptimized flex table with no explicit columns against an optimized flex table with explicit columns. The optimized tables have same columns, sort order, and segmentation of the “hybrid” table shown above and the experiment was run on 3 virtual machines hosted on an HP DL380 with 12 cores and 96GB of RAM. On the optimized table, Query 1 is an aggregation over an explicit column which appears in the sort order, leading to an efficient aggregation mechanism. Query 2 has a selective predicate column which is not explicit in the optimized table and thus doesn’t gain significant benefit. Query 3 derives significant benefit because the join columns are explicitly part of the sort order and segmentation expression for the optimized table, leading to a cluster aware join which is very memory efficient. With our optimized table, we get the best of both worlds: speedy queries and flexible schema.

speedup

Figure 1: Performance comparison of sample queries run on the unoptimized flex table versus the optimized hybrid flex table. Query 1 is a simple aggregate query and Query 3 is a simple join. The graph shows that optimization sped up Query 1 by a factor of 17, and Query 3 by a factor of 121.

And in conclusion

Flex tables have changed our ingest logic from a complicated and brittle collection of static scripts to one simple Python script that handles it all. And I am much more willing to accept general query dumps from customers, since even if they are 300 columns wide I can still trivially pull them into Vertica for analysis. It all works because flex tables are not just about supporting JSON in the database, they are about changing the way we store and explore data in a relational database.

Welcome to the Flex Zone – Dynamic Schemas

Enter the Flex Zone

Much of the data analytics we perform occurs with data whose schema changes over time. Many organizations have developers who define the record types that are collected and analysts who comb through these records looking for important patterns. In today’s fast-changing business landscape, agility is the key to success: developers need to continually define new metrics for their analyst’s consumption. Typically, relational databases have been a significant impediment: schema changes are costly and the database data ingest logic is brittle. Instead of record changes being at the need of the developer, they involve careful coordination with database administrators against a backdrop of tweaking production systems. The Flexible table feature in Vertica 7.0 “Crane” seeks to address this challenge by allowing the table schema to vary from record to record. No database administrator involvement is necessary with day-to-day schema changes. The DBA need not get involved, unless performance tuning is necessary.

A Web-driven database

For our example, we will consider a simple web service that serves a form and stores the results. While simplistic, this isn’t far off from serious web services, mobile games, smart meters, etc. There are two source files for this example, a web server flex-server.py and an HTML file form.html (store them in the same directory).

Getting Started

First, we need to create a table to store the form data. With Vertica Flex tables, we don’t even need to know what the structure of the data is upfront:

CREATE FLEX TABLE webreqs();

The web server, implemented in Python, is quite simple: it serves a form and stores the responses as JSON records.

class FlexServer(BaseHTTPServer.BaseHTTPRequestHandler):
    def do_GET(self):
        url = urlparse.urlparse(self.path)
        # form submission?
       if url.path == ‘/data’:
            v.send(json.dumps({“time”:str(datetime.datetime.now()),
                  ;”src”:self.client_address[0],
                   #”header”:dict(self.headers.items()), # uncomment to get header info
                             “data”:urlparse.parse_qs(url.query)})+’\n’)
          self.send_response(302)
          self.send_header(‘Location’,urlparse.urljoin(self.path,’/’))
        else:
self.send_response(200)
self.send_header(‘Content-type’,’text/html’)
self.end_headers()
with open(“form.html”,”r”) as src:
self.wfile.write(src.read())
return

v = SendToVertica()
from BaseHTTPServer import HTTPServer
server = HTTPServer((‘localhost’,8080),FlexServer)
server.serve_forever()

For brevity, we’ve left off the SendToVertica class, which streams the data into Vertica – it’s also quite simple and the full text is in flex-server.py.

The HTML file contains the form – it’s broken out into a separate file to make editing it easier. The initial form merely asks for the submitter’s name.

<html><head><title>Enter Info</title><body>
<form action=”data”>
Name:<input type=”text” name=”name”>
<br><input type=”submit” value=”Submit”>
</form></body></html>

Start the server (python flex-server.py) and point your browser at localhost:8080 – you should see something like:

flex-dyn-schema-blog-scrnshot

Submit the form a couple of times. Then look at the server output – it might look something like this:

{“src”: “127.0.0.1″, “data”: {“name”: ["Ben"]}, “time”: “2013-12-24 01:40:35.475723″}
{“src”: “127.0.0.1″, “data”: {“name”: ["Amy"]}, “time”: “2013-12-24 01:40:39.293608″}

Vertica makes it easy to load this data into Flex tables – no table schema definition necessary and there’s already a parser for JSON data. The flex-server is running the following statement in order to load the data:

copy webreqs from stdin parser fjsonparser(flatten_arrays=true);

The flatten_arrays parameter may be new to you – we need it here because HTML form attributes can be many valued, so there’s an extra level of array we need to flatten out of the input.

Now we can write a query against the data as if the JSON features were columns in the table:

select “data.name.0″ from webreqs;
data.name.0
————-
Amy
Ben
(2 rows)

select compute_flextable_keys_and_build_view(‘webreqs’);
compute_flextable_keys_and_build_view
—————————————————————————————————-
Please see public.webreqs_keys for updated keys
The view public.webreqs_view is ready for querying
(1 row)

select * from webreqs_view;
 data.name.0 |     src   | time
      ——————-+——————————-+———————————————————————————————-
         Ben | 127.0.0.1 | 2013-12-24 01:40:35.475723
         Amy | 127.0.0.1 | 2013-12-24 01:40:39.293608
         (2 rows)

It’s Flex Time

But now for the main event: so far, loading the data was easy – but what if the developer decides the form will also have an additional vote field. What needs to change? Well, let’s change the HTML file to include the new form field:

<html><head><title>Enter Info</title><body>
<form action=”data”>
Name:<input type=”text” name=”name”>
<br>Vote:<select type=”select” name=”vote”>
<option>yes</option><option>no</option></select>
<br><input type=”submit” value=”Submit”>
</form></body></html>

Submit the form a couple more times and look at the output – now there’s an additional key in the JSON!

{“src”: “127.0.0.1″, “data”: {“vote”: ["yes"], “name”: ["Jen"]}, “time”: “2013-12-24 01:40:57.785534″}
{“src”: “127.0.0.1″, “data”: {“vote”: ["no"], “name”: ["Ian"]}, “time”: “2013-12-24 01:41:01.867649″}

What database changes are necessary? None! The new fields are immediately available to query:

select sum(case when “data.vote.0″ = ‘yes’ then 1 else 0 end) as yeses, count(“data.vote.0″) as total from webreqs;
yeses | total
   ——-+——-
    1 | 2
(1 row)

But what about the entries submitted before the “vote” field existed?

select “data.name.0″,”data.vote.0″ from webreqs;
data.name.0 | data.vote.0
       ————-+————-
        Jen | yes
        Ben |
        Amy |
        Ian | no
       (4 rows)

For fields that don’t exist in a particular record, you get SQL null. SQL operators do the “right thing” when they encounter nulls, such as count and sum above.

If you believe the schema has changed, you can update your keys and views to reflect the new reality by re-running the schematization tools:

select compute_flextable_keys_and_build_view(‘webreqs’);
compute_flextable_keys_and_build_view
—————————————————————————————————-
Please see public.webreqs_keys for updated keys
The view public.webreqs_view is ready for querying
(1 row)

select * from webreqs_view;
  data.name.0 |    src    | time                       | data.vote.0
       ——————–+———————————+———————————————————————————–+——————————————-
          Jen | 127.0.0.1 | 2013-12-24 01:40:57.785534 | yes
          Ben | 127.0.0.1 | 2013-12-24 01:40:35.475723 |
          Amy | 127.0.0.1 | 2013-12-24 01:40:39.293608 |
          Ian | 127.0.0.1 | 2013-12-24 01:41:01.867649 | no
         (4 rows)

Note that you are not required to recompute the view to access the new fields, only to access the new fields through the view.

Try it yourself – edit the HTML form to have new fields and see how easy it is to query the resulting data in Vertica! You can also uncomment headers line in the server to get access to the header information (think cookies, sessionization, etc).

Tuning for Performance

While flexible tables are great for exploration, we recommend that you do some tuning before moving into production. Luckily, tuning is easy and still preserves the flexible nature of the tables. The central mechanism is “promotion” of one or more columns to real table columns. To promote, select commonly used columns, particularly those used in joins or predicates and add them to your flex table. You can do this by using ALTER TABLE … ADD COLUMN … DEFAULT. For example, the time and vote columns might be of particular interest:

ALTER TABLE webreqs ADD COLUMN “time” TIMESTAMP NOT NULL DEFAULT time::timestamp;
ALTER TABLE webreqs ADD COLUMN “data.vote.0″ VARCHAR(10) DEFAULT “data.vote.0″::varchar;

As part of adding the column, Vertica populates the column with any same-name data. Future loads will also load data into the newly added columns. Note that we have selected specific data types for the new columns. New data that doesn’t match the declared data type will result in nulls getting stored (or errors if the column is declared NOT NULL).

Don’t want to run ADD COLUMN yourself? No problem, because there’s a helper function to do this step for you:

select materialize_flextable_columns(‘webreqs’);

The following columns were added to the table public.webreqs:
data.name.0
src
time
data.vote.0
For more details, run the following query:
SELECT * FROM v_catalog.materialize_flextable_columns_results WHERE table_schema = ‘public’ and table_name = ‘webreqs’;

There’s nothing special about these newly added columns – you could have done the same thing upfront:

CREATE FLEX TABLE webreqs (
“time” TIMESTAMP NOT NULL,
“data.vote.0″ VARCHAR(10)
);

So if you know a little about your data and how it will be used, you can jump directly to the optimized approach.

Vertica’s Database Designer exists to optimize the physical design (storage layout) and it works great on promoted columns in flexible tables. A typical usage model is explore, promote, and optimize. A key thing to remember is that changes to the schema do not require promotion and optimization. If you add fields that dramatically change your query workload, then some additional performance optimization may be necessary.

Flexible tables from Vertica should dramatically simplify the process of adapting your schema to the needs of your changing business.

Enter the Flex Zone – An Exploration in Processing Twitter Data

With our HP Vertica 7 release, we announced HP Vertica Flex Zone, a new product offering that simplifies the way that you consume and then explore semi-structured data, such as Web logs, sensor data, and other emerging data types. In this blog post, our first “Flex Zone Friday” post, let’s look at how you can use HP Vertica Flex Zone to get a leg up on your latest data analysis problem, using Twitter data as the sample data type.

Tweet child of mine

For this example, let’s look at Twitter data. Tweets are an interesting source because the data volume is large, the tweet format is typically JSON, and many companies are mining tweet data for insights about their business in how the community perceives their brand, product, or service. A tweet might look like the following:

{“id”:257000038495043584,
“text”:”Let’s play uno”,
“created_at”:”Sat Oct 13 06:09:18 2012″,
“user”:{“name”:”scout”,
  &nbsp ;”lang”:”en”,
    “followers_count”:521,…}
…}

It has the tweet text itself, with a lot of interesting metadata. Note that much of the metadata is stored in sub-structures like the “user” field.

You can acquire twitter data quite easily by making requests to the Twitter API. For the purposes of following along here, you can download a small sample of tweets: tweets.json.gz

Look Ma: No Columns!

Typically, loading this type of data into a relational database would have required transforming the raw data into something more easily consumable – or a pre-built custom Twitter data loader. With HP Vertica Flex Zone, you can skip the schema definition phase and just load JSON directly into the database, saving you hours of time:

CREATE FLEX TABLE tweets ();
COPY tweets FROM ‘/path/to/tweets.json.gz’ GZIP PARSER fjsonparser();

As easy as “SELECT … FROM T”
Now we can just write SQL queries against the data, as if it were naturally loaded into columns. For example, to discover what the average tweet length by language is:

select “user.lang”,avg(length(text))::int from tweets group by “user.lang” order by 2 desc;

user.lang | avg
——————————+——–
       de | 119
       ar | 108
       es | 80
       en | 67
       it | 66
       tr | 66
       nl | 64
       id | 61
       fr | 60
       ru | 57
       ja | 44
       pt | 43
       ko | 40
       th | 22

From the SQL point of view, it doesn’t matter that the source data was JSON. The only concession is the column names themselves: we do need to “” the column names if they contain the ‘.’ character.

Let’s make a Schema

But how can you tell what columns are available? The HP Vertica Analytics Platform has a couple of mechanisms for discovering and making explicit the columns available.

select compute_flextable_keys_and_build_view(‘tweets’);
compute_flextable_keys_and_build_view
————————————————————————————————–
Please see public.tweets_keys for updated keys
The view public.tweets_view is ready for querying
(1 row)

The tweets_keys table now contains a row for each available column in the data. For example, to find all the user-related count columns:

select key_name from tweets_keys where key_name::varchar ilike ‘user%count’;
     key_name
————————————————————
user.statuses_count
user.listed_count
user.friends_count
user.followers_count
user.favourites_count
(5 rows)

Armed with these new columns, we can compute some statistics about the followers in our dataset:

select min(“user.followers_count”::int),avg(“user.followers_count”)::int,max
(“user.followers_count”::int) from tweets;

min | avg | max
  —–+—————–+——–
  0 | 1968 | 253435
(1 row)

Clearly someone is tweeting to the void. And we probably have some observer bias in our small sample: there’s probably a correlation between tweeting frequency and follower count.
In addition to building the keys table, the function also whipped up a view that selects all available columns, which can be handy to see what is in the table. Tweets are quite large:

select * from tweets_view limit 1;
contributors | created_at | entities.hashtags | entities.urls | entities.user_mentions | favorited | id | id_str | in_reply_to_screen_name | in_reply_to_status_id | in_reply_to_status_id_str | in_reply_to_user_id | in_reply_to_user_id_str | retweet_count | retweeted | source | text | truncated | user.contributors_enabled | user.created_at | user.default_profile | user.default_profile_image | user.description | user.favourites_count | user.follow_request_sent | user.followers_count | user.following | user.friends_count | user.geo_enabled | user.id | user.id_str | user.is_translator | user.lang | user.listed_count | user.location | user.name | user.notifications | user.profile_background_color | user.profile_background_image_url | user.profile_background_image_url_https | user.profile_background_tile | user.profile_image_url | user.profile_image_url_https | user.profile_link_color | user.profile_sidebar_border_color | user.profile_sidebar_fill_color | user.profile_text_color | user.profile_use_background_image | user.protected | user.screen_name | user.statuses_count | user.time_zone | user.url | user.utc_offset | user.verified | coordinates | geo | place | user.profile_banner_url | delete.status.id | delete.status.id_str | delete.status.user_id | delete.status.user_id_str | retweeted_status.contributors | retweeted_status.created_at | retweeted_status.entities.hashtags | retweeted_status.entities.urls

A key differentiator of HP Vertica Flex Zone is that you can visualize this semi-structured data during data exploration with any of your preferred BI and visualization tools. These tools are suitable for the views, as they expose the metadata about the “flexible table” in a manner that the tools already understand. For example, we used Tableau to generate a report against the tweets_view:

And now for something completely different

It turns out that the tweet stream includes two types of records: tweets and deletes. Deletes are revocations of tweets and they have the following structure:

{
“delete”:{
  “status”:{
   “user_id_str”:”228092142″,
   “id_str”:”218710262428073984″,
   “id”:218710262428073984,
   “user_id”:228092142
  }
 }
}

Our previous load actually loaded both record types into the same table. How did the previous queries work? Well, if you reference a column that doesn’t exist in a record, you get null. We can use this to find out how many deletes we loaded:

select count(*) from tweets where “delete.status.id” is not null;
 count
 —————–
  118
 (1 row)

Lots yet to come

We haven’t talked about a number of key things like column promotion (performance!), dynamic schema, or substructure. Here’s a teaser about substructure — namely the JSON input is quite complex and the HP Vertica Analytics Platform loaded it all. You can descend into the nested structures and get the most out of the data. The following is a query that computes the top hashtags in the last month:

select values,count(*) from
    (select mapitems(“entities.hashtags”) over ()
     from tweets
    where created_at::timestamp > now() – interval ’1 month’) a
where keys::varchar like ‘%text’
group by values
order by count desc
limit 10;

                                 values | count
              —————————————————————————-+————
       10ThingsThatGetsOnMyNerves       |     2
       Massa                            |     1
       ความในใจของติ่ง                    |     1
       Up                               |     1
       RTしてくれた人全員フォローする      |      1
       Turn                             |     1
       50ThingsAboutMyBoyfriend         |     1
       askenchancer                     |     1
       always                           |     1
       kaburrr                          |     1

Note: If you got recent data from twitter, this will work without modification – if you used our data set, use interval ’3 years’ (or more).

We encourage you to give HP Vertica Flex Zone a test drive and actively participate in our community site! Stay tuned for our next Flex Zone Friday blog post.

Our Users Validate the Value of Vertica

We recently allowed TechValidate, a trusted authority for creating customer evidence content, to survey the HP Vertica customer base. The firm reached out to nearly 200 customers across a variety of industries and came back with some extremely powerful results.

From the financial benefits to the performance advantages, the benefits of the HP Vertica Analytics platform were repeatedly and clearly detailed by our customers.

A sampling of some of the comments and results can be found below, but to see the full results set click here.

HP Vertica Software Rocks HP Vertica Software - the best in the market
Query performance increased by 100-500% or more

HP Vertica customers have achieved a wide range of benefitsMajority of Vertica users saved $100-500K or more

 

 

 

 

Get Started With Vertica Today

Subscribe to Vertica