Vertica

Archive for the ‘vertica’ Category

DbVisualizer Free for HP Vertica

DB Visualizer for HP Vertica

As a special service to Vertica users and the Big Data community, we have released a new version of DbVisualizer, DbVisualizer Free for Vertica, which is now available for free on the HP Vertica Marketplace. This is an optimized version of DB Visualizer that is unique to Vertica, as it is customized specifically for unique Vertica functions, like viewing projections, and some features usually reserved for the pro version such as multiple execution options. Now available for download here.

Two new case studies – Jobrapido & Supercell.

We’ve published two more case studies, featuring Job Rapido and Supercell. These are compelling examples of innovative companies that use the HP Vertica Analytics Platform to gain a competitive edge and dervive maximum business value from Big Data. The two summaries and respective full case study PDF’s provide details about each company’s goals, success, and ultimate outcomes using HP Vertica. To see more like these, visit the HP Vertica Case Studies page.

Jobrapido-Dark-RGB

Jobrapido scales its database
to the next level

Since its founding in 2006, Jobrapido has become one of the biggest online job search aggregators in the world, helping millions of users everywhere from Italy to the United states find the job that’s right for them. In 2012, they were acquired by Evenbase, a part of DMG media based in the UK. HP Vertica has proved invaluable to their success, performing above and beyond for their big data analytics needs. David Conforti Director of BI at Jobrapido describes HP Vertica as “like having a sort of magic mirror to ask to all the business questions that come to my mind,” one that has allowed him and his team to deliver their users both valuable insight and results, and a unique personal experience based on their analytics.

Download the full case study(.pdf)

Supercell_logo_black_on_white

Supercell performs real-time analytics

In 2012, Supercell delivered two top-grossing games on iOS with the titles “Clash of Clans” and “Hey Day,” just a year after its founding in 2011. Using HP Vertica big data analytics platform, Supercell has been able to engage in real-time gaming data analytics, allowing them to balance, adapt, and improve their gamers experiences on a day to day basis. “HP Vertica is an important tool in making sure that our games provide the best possible experience for our players” says Janne Peltola, a data scientist at Supercell. Using HP Vertica, Supercell is able to create gaming experiences that are fun and engaging for customers to keep coming back to, long after they have started playing.

Download the full case study (.pdf)

Welcome to the Flex Zone – Dynamic Schemas

Enter the Flex Zone

Much of the data analytics we perform occurs with data whose schema changes over time. Many organizations have developers who define the record types that are collected and analysts who comb through these records looking for important patterns. In today’s fast-changing business landscape, agility is the key to success: developers need to continually define new metrics for their analyst’s consumption. Typically, relational databases have been a significant impediment: schema changes are costly and the database data ingest logic is brittle. Instead of record changes being at the need of the developer, they involve careful coordination with database administrators against a backdrop of tweaking production systems. The Flexible table feature in Vertica 7.0 “Crane” seeks to address this challenge by allowing the table schema to vary from record to record. No database administrator involvement is necessary with day-to-day schema changes. The DBA need not get involved, unless performance tuning is necessary.

A Web-driven database

For our example, we will consider a simple web service that serves a form and stores the results. While simplistic, this isn’t far off from serious web services, mobile games, smart meters, etc. There are two source files for this example, a web server flex-server.py and an HTML file form.html (store them in the same directory).

Getting Started

First, we need to create a table to store the form data. With Vertica Flex tables, we don’t even need to know what the structure of the data is upfront:

CREATE FLEX TABLE webreqs();

The web server, implemented in Python, is quite simple: it serves a form and stores the responses as JSON records.

class FlexServer(BaseHTTPServer.BaseHTTPRequestHandler):
    def do_GET(self):
        url = urlparse.urlparse(self.path)
        # form submission?
       if url.path == ‘/data’:
            v.send(json.dumps({“time”:str(datetime.datetime.now()),
                  ;”src”:self.client_address[0],
                   #”header”:dict(self.headers.items()), # uncomment to get header info
                             “data”:urlparse.parse_qs(url.query)})+’\n’)
          self.send_response(302)
          self.send_header(‘Location’,urlparse.urljoin(self.path,’/’))
        else:
self.send_response(200)
self.send_header(‘Content-type’,’text/html’)
self.end_headers()
with open(“form.html”,”r”) as src:
self.wfile.write(src.read())
return

v = SendToVertica()
from BaseHTTPServer import HTTPServer
server = HTTPServer((‘localhost’,8080),FlexServer)
server.serve_forever()

For brevity, we’ve left off the SendToVertica class, which streams the data into Vertica – it’s also quite simple and the full text is in flex-server.py.

The HTML file contains the form – it’s broken out into a separate file to make editing it easier. The initial form merely asks for the submitter’s name.

<html><head><title>Enter Info</title><body>
<form action=”data”>
Name:<input type=”text” name=”name”>
<br><input type=”submit” value=”Submit”>
</form></body></html>

Start the server (python flex-server.py) and point your browser at localhost:8080 – you should see something like:

flex-dyn-schema-blog-scrnshot

Submit the form a couple of times. Then look at the server output – it might look something like this:

{“src”: “127.0.0.1″, “data”: {“name”: ["Ben"]}, “time”: “2013-12-24 01:40:35.475723″}
{“src”: “127.0.0.1″, “data”: {“name”: ["Amy"]}, “time”: “2013-12-24 01:40:39.293608″}

Vertica makes it easy to load this data into Flex tables – no table schema definition necessary and there’s already a parser for JSON data. The flex-server is running the following statement in order to load the data:

copy webreqs from stdin parser fjsonparser(flatten_arrays=true);

The flatten_arrays parameter may be new to you – we need it here because HTML form attributes can be many valued, so there’s an extra level of array we need to flatten out of the input.

Now we can write a query against the data as if the JSON features were columns in the table:

select “data.name.0″ from webreqs;
data.name.0
————-
Amy
Ben
(2 rows)

select compute_flextable_keys_and_build_view(‘webreqs’);
compute_flextable_keys_and_build_view
—————————————————————————————————-
Please see public.webreqs_keys for updated keys
The view public.webreqs_view is ready for querying
(1 row)

select * from webreqs_view;
 data.name.0 |     src   | time
      ——————-+——————————-+———————————————————————————————-
         Ben | 127.0.0.1 | 2013-12-24 01:40:35.475723
         Amy | 127.0.0.1 | 2013-12-24 01:40:39.293608
         (2 rows)

It’s Flex Time

But now for the main event: so far, loading the data was easy – but what if the developer decides the form will also have an additional vote field. What needs to change? Well, let’s change the HTML file to include the new form field:

<html><head><title>Enter Info</title><body>
<form action=”data”>
Name:<input type=”text” name=”name”>
<br>Vote:<select type=”select” name=”vote”>
<option>yes</option><option>no</option></select>
<br><input type=”submit” value=”Submit”>
</form></body></html>

Submit the form a couple more times and look at the output – now there’s an additional key in the JSON!

{“src”: “127.0.0.1″, “data”: {“vote”: ["yes"], “name”: ["Jen"]}, “time”: “2013-12-24 01:40:57.785534″}
{“src”: “127.0.0.1″, “data”: {“vote”: ["no"], “name”: ["Ian"]}, “time”: “2013-12-24 01:41:01.867649″}

What database changes are necessary? None! The new fields are immediately available to query:

select sum(case when “data.vote.0″ = ‘yes’ then 1 else 0 end) as yeses, count(“data.vote.0″) as total from webreqs;
yeses | total
   ——-+——-
    1 | 2
(1 row)

But what about the entries submitted before the “vote” field existed?

select “data.name.0″,”data.vote.0″ from webreqs;
data.name.0 | data.vote.0
       ————-+————-
        Jen | yes
        Ben |
        Amy |
        Ian | no
       (4 rows)

For fields that don’t exist in a particular record, you get SQL null. SQL operators do the “right thing” when they encounter nulls, such as count and sum above.

If you believe the schema has changed, you can update your keys and views to reflect the new reality by re-running the schematization tools:

select compute_flextable_keys_and_build_view(‘webreqs’);
compute_flextable_keys_and_build_view
—————————————————————————————————-
Please see public.webreqs_keys for updated keys
The view public.webreqs_view is ready for querying
(1 row)

select * from webreqs_view;
  data.name.0 |    src    | time                       | data.vote.0
       ——————–+———————————+———————————————————————————–+——————————————-
          Jen | 127.0.0.1 | 2013-12-24 01:40:57.785534 | yes
          Ben | 127.0.0.1 | 2013-12-24 01:40:35.475723 |
          Amy | 127.0.0.1 | 2013-12-24 01:40:39.293608 |
          Ian | 127.0.0.1 | 2013-12-24 01:41:01.867649 | no
         (4 rows)

Note that you are not required to recompute the view to access the new fields, only to access the new fields through the view.

Try it yourself – edit the HTML form to have new fields and see how easy it is to query the resulting data in Vertica! You can also uncomment headers line in the server to get access to the header information (think cookies, sessionization, etc).

Tuning for Performance

While flexible tables are great for exploration, we recommend that you do some tuning before moving into production. Luckily, tuning is easy and still preserves the flexible nature of the tables. The central mechanism is “promotion” of one or more columns to real table columns. To promote, select commonly used columns, particularly those used in joins or predicates and add them to your flex table. You can do this by using ALTER TABLE … ADD COLUMN … DEFAULT. For example, the time and vote columns might be of particular interest:

ALTER TABLE webreqs ADD COLUMN “time” TIMESTAMP NOT NULL DEFAULT time::timestamp;
ALTER TABLE webreqs ADD COLUMN “data.vote.0″ VARCHAR(10) DEFAULT “data.vote.0″::varchar;

As part of adding the column, Vertica populates the column with any same-name data. Future loads will also load data into the newly added columns. Note that we have selected specific data types for the new columns. New data that doesn’t match the declared data type will result in nulls getting stored (or errors if the column is declared NOT NULL).

Don’t want to run ADD COLUMN yourself? No problem, because there’s a helper function to do this step for you:

select materialize_flextable_columns(‘webreqs’);

The following columns were added to the table public.webreqs:
data.name.0
src
time
data.vote.0
For more details, run the following query:
SELECT * FROM v_catalog.materialize_flextable_columns_results WHERE table_schema = ‘public’ and table_name = ‘webreqs’;

There’s nothing special about these newly added columns – you could have done the same thing upfront:

CREATE FLEX TABLE webreqs (
“time” TIMESTAMP NOT NULL,
“data.vote.0″ VARCHAR(10)
);

So if you know a little about your data and how it will be used, you can jump directly to the optimized approach.

Vertica’s Database Designer exists to optimize the physical design (storage layout) and it works great on promoted columns in flexible tables. A typical usage model is explore, promote, and optimize. A key thing to remember is that changes to the schema do not require promotion and optimization. If you add fields that dramatically change your query workload, then some additional performance optimization may be necessary.

Flexible tables from Vertica should dramatically simplify the process of adapting your schema to the needs of your changing business.

Enter the Flex Zone – An Exploration in Processing Twitter Data

With our HP Vertica 7 release, we announced HP Vertica Flex Zone, a new product offering that simplifies the way that you consume and then explore semi-structured data, such as Web logs, sensor data, and other emerging data types. In this blog post, our first “Flex Zone Friday” post, let’s look at how you can use HP Vertica Flex Zone to get a leg up on your latest data analysis problem, using Twitter data as the sample data type.

Tweet child of mine

For this example, let’s look at Twitter data. Tweets are an interesting source because the data volume is large, the tweet format is typically JSON, and many companies are mining tweet data for insights about their business in how the community perceives their brand, product, or service. A tweet might look like the following:

{“id”:257000038495043584,
“text”:”Let’s play uno”,
“created_at”:”Sat Oct 13 06:09:18 2012″,
“user”:{“name”:”scout”,
  &nbsp ;”lang”:”en”,
    “followers_count”:521,…}
…}

It has the tweet text itself, with a lot of interesting metadata. Note that much of the metadata is stored in sub-structures like the “user” field.

You can acquire twitter data quite easily by making requests to the Twitter API. For the purposes of following along here, you can download a small sample of tweets: tweets.json.gz

Look Ma: No Columns!

Typically, loading this type of data into a relational database would have required transforming the raw data into something more easily consumable – or a pre-built custom Twitter data loader. With HP Vertica Flex Zone, you can skip the schema definition phase and just load JSON directly into the database, saving you hours of time:

CREATE FLEX TABLE tweets ();
COPY tweets FROM ‘/path/to/tweets.json.gz’ GZIP PARSER fjsonparser();

As easy as “SELECT … FROM T”
Now we can just write SQL queries against the data, as if it were naturally loaded into columns. For example, to discover what the average tweet length by language is:

select “user.lang”,avg(length(text))::int from tweets group by “user.lang” order by 2 desc;

user.lang | avg
——————————+——–
       de | 119
       ar | 108
       es | 80
       en | 67
       it | 66
       tr | 66
       nl | 64
       id | 61
       fr | 60
       ru | 57
       ja | 44
       pt | 43
       ko | 40
       th | 22

From the SQL point of view, it doesn’t matter that the source data was JSON. The only concession is the column names themselves: we do need to “” the column names if they contain the ‘.’ character.

Let’s make a Schema

But how can you tell what columns are available? The HP Vertica Analytics Platform has a couple of mechanisms for discovering and making explicit the columns available.

select compute_flextable_keys_and_build_view(‘tweets’);
compute_flextable_keys_and_build_view
————————————————————————————————–
Please see public.tweets_keys for updated keys
The view public.tweets_view is ready for querying
(1 row)

The tweets_keys table now contains a row for each available column in the data. For example, to find all the user-related count columns:

select key_name from tweets_keys where key_name::varchar ilike ‘user%count’;
     key_name
————————————————————
user.statuses_count
user.listed_count
user.friends_count
user.followers_count
user.favourites_count
(5 rows)

Armed with these new columns, we can compute some statistics about the followers in our dataset:

select min(“user.followers_count”::int),avg(“user.followers_count”)::int,max
(“user.followers_count”::int) from tweets;

min | avg | max
  —–+—————–+——–
  0 | 1968 | 253435
(1 row)

Clearly someone is tweeting to the void. And we probably have some observer bias in our small sample: there’s probably a correlation between tweeting frequency and follower count.
In addition to building the keys table, the function also whipped up a view that selects all available columns, which can be handy to see what is in the table. Tweets are quite large:

select * from tweets_view limit 1;
contributors | created_at | entities.hashtags | entities.urls | entities.user_mentions | favorited | id | id_str | in_reply_to_screen_name | in_reply_to_status_id | in_reply_to_status_id_str | in_reply_to_user_id | in_reply_to_user_id_str | retweet_count | retweeted | source | text | truncated | user.contributors_enabled | user.created_at | user.default_profile | user.default_profile_image | user.description | user.favourites_count | user.follow_request_sent | user.followers_count | user.following | user.friends_count | user.geo_enabled | user.id | user.id_str | user.is_translator | user.lang | user.listed_count | user.location | user.name | user.notifications | user.profile_background_color | user.profile_background_image_url | user.profile_background_image_url_https | user.profile_background_tile | user.profile_image_url | user.profile_image_url_https | user.profile_link_color | user.profile_sidebar_border_color | user.profile_sidebar_fill_color | user.profile_text_color | user.profile_use_background_image | user.protected | user.screen_name | user.statuses_count | user.time_zone | user.url | user.utc_offset | user.verified | coordinates | geo | place | user.profile_banner_url | delete.status.id | delete.status.id_str | delete.status.user_id | delete.status.user_id_str | retweeted_status.contributors | retweeted_status.created_at | retweeted_status.entities.hashtags | retweeted_status.entities.urls

A key differentiator of HP Vertica Flex Zone is that you can visualize this semi-structured data during data exploration with any of your preferred BI and visualization tools. These tools are suitable for the views, as they expose the metadata about the “flexible table” in a manner that the tools already understand. For example, we used Tableau to generate a report against the tweets_view:

And now for something completely different

It turns out that the tweet stream includes two types of records: tweets and deletes. Deletes are revocations of tweets and they have the following structure:

{
“delete”:{
  “status”:{
   “user_id_str”:”228092142″,
   “id_str”:”218710262428073984″,
   “id”:218710262428073984,
   “user_id”:228092142
  }
 }
}

Our previous load actually loaded both record types into the same table. How did the previous queries work? Well, if you reference a column that doesn’t exist in a record, you get null. We can use this to find out how many deletes we loaded:

select count(*) from tweets where “delete.status.id” is not null;
 count
 —————–
  118
 (1 row)

Lots yet to come

We haven’t talked about a number of key things like column promotion (performance!), dynamic schema, or substructure. Here’s a teaser about substructure — namely the JSON input is quite complex and the HP Vertica Analytics Platform loaded it all. You can descend into the nested structures and get the most out of the data. The following is a query that computes the top hashtags in the last month:

select values,count(*) from
    (select mapitems(“entities.hashtags”) over ()
     from tweets
    where created_at::timestamp > now() – interval ’1 month’) a
where keys::varchar like ‘%text’
group by values
order by count desc
limit 10;

                                 values | count
              —————————————————————————-+————
       10ThingsThatGetsOnMyNerves       |     2
       Massa                            |     1
       ความในใจของติ่ง                    |     1
       Up                               |     1
       RTしてくれた人全員フォローする      |      1
       Turn                             |     1
       50ThingsAboutMyBoyfriend         |     1
       askenchancer                     |     1
       always                           |     1
       kaburrr                          |     1

Note: If you got recent data from twitter, this will work without modification – if you used our data set, use interval ’3 years’ (or more).

We encourage you to give HP Vertica Flex Zone a test drive and actively participate in our community site! Stay tuned for our next Flex Zone Friday blog post.

Data Governance and Chicago’s CampIT event

Steve I recently had the privilege to talk at a CampIT event in Chicago, a very well-attended event at the Stephens Convention Center near Chicago’s O’Hare airport. Analytics professionals gathered and shared ideas on technologies like Hadoop, big data analytics, columnar store databases and in-memory technologies – to name just a few of the topics.

Challenges of Modern Analytics

In my presentation, I covered some of the challenges in modern analytics. Perhaps the biggest technical challenge we’re facing is the ever-growing volumes of data in our organizations. More data means that our legacy analytical solutions are slowing, making it harder and harder to deliver analytics at the right time for the right audience. Business user may lack the technical understanding of how this affects them. They only know that they can’t get answers and business intelligence as readily as they need to.

Another challenge is that IT professionals continued to be asked to do more with less funding. According to Gartner, IT spending increased only about 0.8% this year. IT is spending all of their funds on keeping the wheels on the bus spinning, but few funds in IT are available to innovate. Other budgets, like marketing and sales technology spend are increasing, however. IT is still seen as a cost center in many organizations, while the business side is considered to be revenue-generating.

Data Governance Can Help

Data governance can help us tap into the business-focused budgets with a couple of important edicts:

    1. IT should form an alliance with business users
    Take a real interest in some of the challenges that your business users have by inviting them for coffee or giving them an opportunity to beef about their challenges.

    2. IT should focus on important business aspects of the IT initiative
    If you ask your business users, the most important aspects of IT aren’t technical. The three most important business aspects of any initiative is revenue, efficiency and compliance. IT should be trying hard to help the company make more money, be more efficient in the way that day-to-day business is done, and compliance with state, local, federal or industry regulations.

    3. The data governance team should initially pick projects that can provide quick return on investment and track benefits.
    Quick wins that are profitable to the corporation form an agile approach to data governance. Initiatives shouldn’t take months or years, but days or weeks. When users see the value that IT is bring to the organization, they will want to work with you on solving their issues.

    4. Analytics is just one of the systems of opportunity to begin your data governance initiative.
    Providing fast analytics with Vertica’s help is just one system of opportunity to move your data governance initiative forward.

Tap into Business Budgets

By understanding your business user’s needs, providing a strong ROI and talking about the business benefits of Vertica, you can sell the benefits of big data analytics into your organization. Again, it’s about revenue, efficiency and compliance in your business. It speaks to revenue when you have execution windows to run analysis that you have never had before and now you can find new ways to reach your customers. It speaks to efficiency when you increase speed, typically hundreds of times faster than the old way of doing analytics, and avoid worries about a long analysis taking up too much processing time. It speaks to compliance when you can deliver analysis that’s fast and accurate, and analysis that you don’t have to check and re-check it before you deliver it to a broader audience.

Data-Driven Decision Making with the Vertica Analytics Platform

Physicians need access to a wealth of critical information from multiple systems in order to make life-saving decisions on a daily basis. Greg Gootee, Product Manager, MZI Healthcare, discusses how their new application, powered by the Vertica analytics platform, helps deliver better patient care through data-driven decision making. Delivering information in a timely manner is central to their application’s success. Check out this video shot at HP Discover 2013 to see how HP Vertica helps Greg and his team provide physicians with the information they need to make more accurate point-of-care decisions. In the video Mr. Gootee recounts how his Aunt may have avoided a tragic incident with better point-of-care services–the type of services MZI Healthcare and the Vertica analytics platform provide.

A Sneak Peek of HP Vertica Pulse, Harnessing the Volume and Velocity of Social Media Data

The Web provides us with a myriad of ways to express opinion and interest—from social sites such as Twitter and Facebook to blogs and community forums to product reviews in ecommerce sites to many more. As a result, customers have significant influence in shaping the perceptions of brands and products. The challenge for the managers of those entities on which opinion and interest is expressed is to understand, in an automated way and in as close to real-time as possible, what people are talking about and how they feel about those topics, so that they can better understand and respond to their community.

HP Vertica Pulse — now in private beta — is HP’s scalable, in-database answer to the problem of harnessing the volume and velocity of social media data. Executed through a single line of SQL, HP Vertica Pulse enables you to extract “attributes,” or the aspects of a brand, product, service, or event that your users and customers are talking about; and the ability to assign a sentiment score for each of these attributes, so that you can track your community’s perception on the aspects of your business that your community cares about. Understand whether your customers are looking for a particular feature, how they react to a facet of your product or service as you anticipated, or if they are suddenly encountering problems. See how their perceptions change over time.

We used HP Vertica Pulse at HP Discover to capture attendee sentiment. We captured tweets related to HP Discover, the Tweeter’s screen name, and the timestamp. We ran the tweets through HP Vertica Pulse and visualized the results in Tableau. The whole effort was up and running in just a few hours. The screenshot below shows the major aspects:



 

  • We watched the interest of the crowd change over time. With each successive keynote, we saw new initiatives and people appear in the word cloud. Meg Whitman received a lot of press, as did HP’s New Style of IT and HAVEn. So did Kevin Bacon, who participated in Meg’s keynote.
  • HP Vertica Pulse surfaced news in the data analytics world. In a trial run using tweets related to data analytics, we saw “Walmart” — not a common name in the world of analytics — appear in the word cloud. A quick drilldown in Tableau revealed that Walmart recently purchased data analytics company Inkiru.
  • We captured the most prolific Tweeters. We could expand on this data to include influencer scores and reach out to the most influential posters.
  • We captured sentiment on all of the tweets. In a friendly forum like HP Discover, we expect the majority of the tweets to be neutral or positive in nature.

HP Vertica Pulse is a result of an ongoing collaboration with HP Labs and is built on Labs’s Live Customer Intelligence (LCI) technology. The Labs team has already had great success with LCI as evidenced in part by their Awards Meter application. HP Vertica has also built a social media connector that loads tweets of interest directly from Twitter into the HP Vertica Analytics Platform, allowing you to start understanding your community right away.

HP Vertica Pulse is yet another example of how you can use the HP Vertica Analytics Platform to bring analytics to the data, speeding analysis time and saving the effort of transferring your data to an external system. Because HP Vertica Pulse is in-database, you can store your text data and the associated sentiment alongside sales, demographic, and other business data. HP Vertica Pulse will help you to harness the voice of your customer so that you can better serve them.

We are now accepting applications to trial Pulse in our private beta. To participate, contact me, Geeta Aggarwal, at gaggarwal@vertica.com.

Get Started With Vertica Today

Subscribe to Vertica