Vertica

Archive for the ‘vertica’ Category

How To Make A Pumpkin Look Like Somebody

(No Artistic Talent Required!)

It’s Halloween! This holiday is second only to Christmas in terms of the amount spent per year on decorations (in the USA, anyway). But for some of us, there is more to it than cheap, mass-produced rubber spiders, talking skeletons, and so on. It’s yet another excuse to be creative, in a geeky way.

So let’s carve pumpkins! When I was a wee lad, my brother and I took black markers and made an outline of what we wanted on our pumpkins, then dear old dad would cut them with a sharp knife (which we, of course, weren’t allowed to play with). Think triangular eyes, a simple nose, and a snaggletoothed mouth. Now that I am older, I find this is way too efficient, and much more time can be frittered away with this exercise…

Basic Carving Outline

(Apologies in advance for saying so many things you already know… or just skip this section.)

  1. Get pumpkins. You can grow them, get them at the store, or visit the kitschy farm down the road. Look for ones you can easily spread your hand across, they are the right size for single page (8 ½ x 11) paper patterns.
  2. Get a paper pattern. They come in books, but more sophisticated ones can be found online (carvingpumpkins.com is a favorite of mine), and printed out (laser printouts are preferred, as they are nearly impervious to pumpkin juice).
  3. Tape the pattern to the pumpkin (This means you don’t actually need artistic talent. Just trace it!). This is easier if you have cut some notches in the paper so it bends around the pumpkin. I say use lots of tape. Cover every bit of the paper with tape. That way, if it gets pumpkin juice on it, it won’t fall apart.
  4. Cut the top off. Angle it, so the top doesn’t fall in. (If you cut it straight up and down, this will happen quickly as the pumpkin ages.) Alternatively, some experts prefer cutting the bottom out of the pumpkin instead of the top. This may make the pumpkin last longer, especially if it is out in the weather. But then you may need a chimney. Either way, I leave a notch so the original orientation of the lid can be quickly reestablished.
  5. Scrape the guts out. Scrape the part where the pattern is applied extra hard, if you are going with a three-level pattern (explained next), so the light shines through nicely. Keep some seeds to bake and eat, if you are into that (I am not).
  6. Cut the pattern. Unless you are being really fancy, this can be done in three levels:
    • Skin left on. This is of course the darkest.
    • Skin peeled off. Much more light shines though.
    • Cut all the way through. This is the lightest.

There are many tools for the job. For cutting through, knives really can’t get a high level of detail compared to the special-purpose pumpkin saws they sell these days. (Cut as perpendicular to the surface of the pumpkin as possible so the piece is easily extracted. If the piece doesn’t pop out easily, cut it into bits.)

For scraping the skin, I haven’t found anything better than a pocket knife. Just cut the edge around the area (this makes nice clean lines), then if the area is small/thin pick it out with the knife point, or if it is large, cut it into pieces to pick out. (Cutting up and down the grain of the pumpkin is easiest, if it is convenient given the shape of the area to scrape.) They also sell tools with little loops on the ends as part of store-bought kits, but I prefer to live dangerously and use my trusty knife.

The order in which the areas are cut out has a profound effect on how hard it is to execute the design without breaking anything. This is hard to pin down in words, but as you don’t want to paint yourself into a corner, you also don’t want to be cutting/scraping anything that has very little structural support. Starting with the smallest details is never a bad idea.

  1. Take the pattern and tape off.
  2. Cut air holes in the back if the pattern doesn’t involve many areas that are cut through.
  3. Put a candle in, light it. Pumpkins are >90% water, so putting them on your cement steps with real fiery candles in doesn’t sound that dangerous.
  4. Take pictures of your handiwork! (See the last section.)

Additional Steps

  • Invite some friends over, and have your sister-in-law with the Johnson and Wales degree make the snacks.
  • Add beer and/or wine, to taste.
  • There are also tailor-made power saws (reciprocating, not circular) available, and these are quite helpful for designs with lots of long cuts. Some people also use power tools to scrape the skin, such as a Dremel tool. This works, but I advise against combining this with #1, or #2 in particular.

How To Make Your Own Patterns

Making your own patterns is a great way to get geek technology involved in what otherwise would be a nice, earthy-crunchy hobby. This is not that hard, but you will impress people who don’t think of these things.

Get a picture using a digital camera, Google image search, or whatever. This could be easy, or not. But you won’t know for sure until later steps. Here’s what I used this year:

Crop the picture. Convert it to black and white. (I used to use PhotoShop when I had an employer that could justify the expense of having a legit copy. Now I use the GIMP.) If you look closely at the hair, you can see that I touched it up a little, in preparation for the next step…

Convert the image to three levels. I use black for skin, gray for scraped, and white for cut through. This gives an impression of the final result, but generally uses more precious toner than doing it the other way ’round. This year I just adjusted the color curve in GIMP, but I am sure I have used other means in the past.

This should result in a 3-level image:

There are a few things to note here. Obviously it is tricky to pick the levels to get something that is true to the original image. However, you also have to be aware that gravity will claim anything that is completely surrounded by a cut-out (white area). You can (in order of preference) either just erase these (in the image above they are tiny), keep fussing with the levels until there aren’t any, add supports to them, or go back to step 1 (pick a new image).

Cut the pumpkin. It may look like complete crap in the daylight:

Fear not! in the dark, things look better than you’d think, given the number of mistakes you may or may not have made in the cutting process:

Get addicted, and do more pumpkins next year. Here are a few samples from our parties. (See if you can spot the shameless plugs for our company.)

Taking Good Pictures

So even if your pumpkin doesn’t look that good, you may be able to salvage it by creative use of your digital camera.

  • Use full manual mode. That way you can adjust the picture by trial and error, and keep the most appealing one.
  • Unless you can hold still for 15 seconds, a tripod is a must.
  • Stop the camera down. (This may not apply to your camera, but the one I got for a mere $150, 6 long years ago works best stopped down, as it blurs less. I realize that this is counter-intuitive since there isn’t a lot of need for depth of field, and the light is low. But that’s what I do.)
  • Same goes for film speed. Use the slowest one, as you get the least noise in the dark areas. Even though this is counter-intuitive in low light settings.
  • Then adjust the exposure time to make it look good. Take a few. Use 2 seconds, 4, 6, 8, 10, 15, and then keep whatever looks best. Usually, it takes a lot of work to get a digital photo to look almost as good as real life, but with pumpkins, it is pretty easy to make things look even better than reality, just pick the right exposure.
  • The first photo shows approximately what the pumpkin looks like in real life (you will have to trust me on this). The second shows the exposure I liked the best, which soaked for a bit longer.

Acknowledgments

I’d like to thank my wife for starting this tradition, and cleaning the house before every party. And all our friends, for their contributions to our growing gallery.

MySQL Ate My Homework: Five Reasons You Should Always Use a Subpar Data Platform

shutterstock_145410211 [Converted]

subpar

adjective

falling short of a standard <the service at the restaurant was subpar, to say the least>

Synonyms: bush, bush-league, crummy (also crumby), deficient, dissatisfactory,
ill, inferior, lame, lousy,off, paltry, poor, punk, sour, suboptimal, subpar, substandard, unacceptable, unsatisfactory, wack [slang], wanting, wretched, wrong

Related Words: abysmal, atrocious, awful, [slang], brutal, damnable, deplorable, detestable, disastrous, dreadful,execrable, gnarly [slang], horrendous, horrible, pathetic,stinky, sucky [slang], terrible, unspeakable; defective, faulty,flawed; egregious, flagrant, gross; bum, cheesy, coarse,common, crappy [slang], cut-rate, junky, lesser, low-grade,low-rent, mediocre, miserable, reprehensible, rotten,rubbishy, second-rate, shoddy, sleazy, trashy; abominable,odious, vile; useless, valueless, worthless; inadequate,insufficient, lacking, meager (or meagre), mean, miserly, , scanty, shabby, short, skimp, skimpy, spare,stingy; miscreant, scurrilous, villainous; counterfeit, fake,phony (also phoney), sham

In the big data technology industry, we spend most of our time writing blogs and whitepapers about our technology.  I’m sure you’ve heard this before…”Our technology is great…it’s the best…most functional…top-notch” and so forth.  But we never really discuss when someone might want to use less effective technology – systems that may be more raw, or less suited to the task, or that have no vendor behind them.  Sure, these systems can break easily or might not do everything you want, but some of these technologies have tens of thousands of users around the world. So, they must be valid choices, right?

So, when should less effective technology be used?  Based on many years in the IT trenches, here is my countdown of the top five reasons you should use a subpar big data platform.

Caution: sarcasm ahead with a mostly serious ending which actually makes a point

Reason Number 5: Not invented here, dude
Science_and_Invention_Jan_1922_pg822 (1)

Who wants to be boring and pick existing technology that is solid… and works?  By rolling your own, you get serious technical chops.  What’s that knocking sound? That’s O’Reilly Media at your door…they want you to write a book!  Seriously, reinvention is under-rated.  Sure, relational databases have been around for forty-plus years, but reinventing transaction semantics or indexing would be seriously cool!  Give it a funny name and pick a cute animal for the logo and…voila!  Tech cred!

Furthermore, using off-the-shelf technology tends to create a situation some IT shops dread: transparency. What? The executives understand the technology we’re using well enough to monitor progress with it?  Time to throw it out and build something arcane from scratch to control what the execs see!

Reason Number 4: It’s free


FreeTags

When I was seven, one of my dad’s friends came by to visit around the holidays.  He gave me a kitten.  My dad got seriously steamed, and my mom looked like somebody had just sneezed in her soup.  But the kitten was free, right?  Three illnesses, a few injuries, and one or two thousand dollars later, and coupled with a year or so cleaning a litterbox, I realized that the kitten was not – in fact – free.

But we’re talking about software here.  Isn’t that different?  Free means you don’t need to deal with a sales guy and some engineer who’ll help you set things up in an hour.  You just go through a few websites and download four RPMS, the Java SDK, a Java JRE, five or six utilities, upgrade your OS, downgrade your OS, grab some runtime libraries for Linux, the Eclipse IDE, a downgraded version of the Eclipse IDE that’s required by the plug-in you’re about to download, and an Eclipse plug-in which kinda does most of what you need and…voila!  You can run the “hello world” example.  So free must be good, right?  Now, fire up “Getting Better” by the Beatles on your iPod and get to work!

Reason Number 3: You’ve got all the time in the world

Czech-2013-Prague-Astronomical_clock_face

Yeah, the business folks are in a panic about losing market share, and the CIO is a little bent out of shape about the fact that the IT budget has been going up at 15% every year, but what’s the big rush?  After all, the prospectus for that O’Reilly book needs to be seriously heavy stuff to have a chance of getting anywhere.  So dig into the technology!  Science projects can be fun when you’re doing science.  Hey, do those hardware guys really think that putting data on the disk tracks closer to the spindle will improve read times by 0.01%??  That sounds fun to test!  We can write a hack in HDFS for that!  Of course, the only way we can tell is on a cluster that has at least a thousand nodes.  The good news is that with modern cloud technologies, it’ll take only six months and ten people to test it!  The business can wait a little longer.

Reason Number 2: It’s cool

IMAG1539

Does anything really need to be said here?  Cool + Not Invented Here = Happy Technologists = Productivity, right?

And (drum roll please)…

Reason Number 1: You like risk

800px-John_Carta_motorbike_Base_jumping_Salto_Moto

Do you fly by those ancient thirty-year olds on your kitesurfing rig wondering why they still use something as yesterday as a windsurfer?  Is base jumping from old-school spots like the KL Tower yesterday’s news for you?  Well, risk on!  In the stock market, risk=volatility=upside, right?  And the worst that can happen is the dollar value of your investment hits zero.  Why should it be any different with technology?  If you’re not base jumping from that erupting volcano, you’re not alive.   So bring together the adrenaline rush and the upside potential of adopting something which looks like it isn’t ready so that, in the event it ever gets to be what you need, you’re ahead of the curve!


Summing up, Seriously

While this piece – so far – has been very sarcastic, there’s a nugget of truth hidden within.  Businesses globally choose subpar technology every day believing that it will solve their problems.  And while they rarely select such technologies based on my sarcastic “top five” list above, they often select these technologies with the mistaken belief that they’re cheaper/better/faster, etc.

Today businesses don’t have to select subpar technologies for big data analytics.  Two years ago, Vertica opted to release the Vertica Community Edition.  This release of Vertica offers the full functionality of the product on up to one terabyte of raw data and a three node cluster.  Furthermore, it now includes the full functionality of Vertica’s sentiment scoring engine (Pulse), Vertica’s geospatial add-in (Place), and Vertica’s auto-schematizer for things like JSON data (FlexZone). I tried to talk the Vertica team out of offering so much for free!  But the team wants to share this with the world so organizations no longer have to settle for a subpar data platform.  It’s hard to argue with that!

So, if you want to try Vertica CE today, click here.

In my twenty-plus years of working with databases, I’ve installed and worked with just about every commercially available database under the sun, including Vertica.  And out of all of them, Vertica has been the easiest to stand up, the most powerful, and the highest quality.  Try it.  Seriously.

Don’t go for the subpar stuff, because you don’t have to.

DbVisualizer Free for HP Vertica

DB Visualizer for HP Vertica

As a special service to Vertica users and the Big Data community, we have released a new version of DbVisualizer, DbVisualizer Free for Vertica, which is now available for free on the HP Vertica Marketplace. This is an optimized version of DB Visualizer that is unique to Vertica, as it is customized specifically for unique Vertica functions, like viewing projections, and some features usually reserved for the pro version such as multiple execution options. Now available for download here.

Two new case studies – Jobrapido & Supercell.

We’ve published two more case studies, featuring Job Rapido and Supercell. These are compelling examples of innovative companies that use the HP Vertica Analytics Platform to gain a competitive edge and dervive maximum business value from Big Data. The two summaries and respective full case study PDF’s provide details about each company’s goals, success, and ultimate outcomes using HP Vertica. To see more like these, visit the HP Vertica Case Studies page.

Jobrapido-Dark-RGB

Jobrapido scales its database
to the next level

Since its founding in 2006, Jobrapido has become one of the biggest online job search aggregators in the world, helping millions of users everywhere from Italy to the United states find the job that’s right for them. In 2012, they were acquired by Evenbase, a part of DMG media based in the UK. HP Vertica has proved invaluable to their success, performing above and beyond for their big data analytics needs. David Conforti Director of BI at Jobrapido describes HP Vertica as “like having a sort of magic mirror to ask to all the business questions that come to my mind,” one that has allowed him and his team to deliver their users both valuable insight and results, and a unique personal experience based on their analytics.

Download the full case study(.pdf)

Supercell_logo_black_on_white

Supercell performs real-time analytics

In 2012, Supercell delivered two top-grossing games on iOS with the titles “Clash of Clans” and “Hey Day,” just a year after its founding in 2011. Using HP Vertica big data analytics platform, Supercell has been able to engage in real-time gaming data analytics, allowing them to balance, adapt, and improve their gamers experiences on a day to day basis. “HP Vertica is an important tool in making sure that our games provide the best possible experience for our players” says Janne Peltola, a data scientist at Supercell. Using HP Vertica, Supercell is able to create gaming experiences that are fun and engaging for customers to keep coming back to, long after they have started playing.

Download the full case study (.pdf)

Welcome to the Flex Zone – Dynamic Schemas

Enter the Flex Zone

Much of the data analytics we perform occurs with data whose schema changes over time. Many organizations have developers who define the record types that are collected and analysts who comb through these records looking for important patterns. In today’s fast-changing business landscape, agility is the key to success: developers need to continually define new metrics for their analyst’s consumption. Typically, relational databases have been a significant impediment: schema changes are costly and the database data ingest logic is brittle. Instead of record changes being at the need of the developer, they involve careful coordination with database administrators against a backdrop of tweaking production systems. The Flexible table feature in Vertica 7.0 “Crane” seeks to address this challenge by allowing the table schema to vary from record to record. No database administrator involvement is necessary with day-to-day schema changes. The DBA need not get involved, unless performance tuning is necessary.

A Web-driven database

For our example, we will consider a simple web service that serves a form and stores the results. While simplistic, this isn’t far off from serious web services, mobile games, smart meters, etc. There are two source files for this example, a web server flex-server.py and an HTML file form.html (store them in the same directory).

Getting Started

First, we need to create a table to store the form data. With Vertica Flex tables, we don’t even need to know what the structure of the data is upfront:

CREATE FLEX TABLE webreqs();

The web server, implemented in Python, is quite simple: it serves a form and stores the responses as JSON records.

class FlexServer(BaseHTTPServer.BaseHTTPRequestHandler):
    def do_GET(self):
        url = urlparse.urlparse(self.path)
        # form submission?
       if url.path == ‘/data':
            v.send(json.dumps({“time”:str(datetime.datetime.now()),
                  ;”src”:self.client_address[0],
                   #”header”:dict(self.headers.items()), # uncomment to get header info
                             “data”:urlparse.parse_qs(url.query)})+’\n’)
          self.send_response(302)
          self.send_header(‘Location’,urlparse.urljoin(self.path,’/’))
        else:
self.send_response(200)
self.send_header(‘Content-type’,’text/html’)
self.end_headers()
with open(“form.html”,”r”) as src:
self.wfile.write(src.read())
return

v = SendToVertica()
from BaseHTTPServer import HTTPServer
server = HTTPServer((‘localhost’,8080),FlexServer)
server.serve_forever()

For brevity, we’ve left off the SendToVertica class, which streams the data into Vertica – it’s also quite simple and the full text is in flex-server.py.

The HTML file contains the form – it’s broken out into a separate file to make editing it easier. The initial form merely asks for the submitter’s name.

<html><head><title>Enter Info</title><body>
<form action=”data”>
Name:<input type=”text” name=”name”>
<br><input type=”submit” value=”Submit”>
</form></body></html>

Start the server (python flex-server.py) and point your browser at localhost:8080 – you should see something like:

flex-dyn-schema-blog-scrnshot

Submit the form a couple of times. Then look at the server output – it might look something like this:

{“src”: “127.0.0.1”, “data”: {“name”: [“Ben”]}, “time”: “2013-12-24 01:40:35.475723″}
{“src”: “127.0.0.1”, “data”: {“name”: [“Amy”]}, “time”: “2013-12-24 01:40:39.293608″}

Vertica makes it easy to load this data into Flex tables – no table schema definition necessary and there’s already a parser for JSON data. The flex-server is running the following statement in order to load the data:

copy webreqs from stdin parser fjsonparser(flatten_arrays=true);

The flatten_arrays parameter may be new to you – we need it here because HTML form attributes can be many valued, so there’s an extra level of array we need to flatten out of the input.

Now we can write a query against the data as if the JSON features were columns in the table:

select “data.name.0″ from webreqs;
data.name.0
————-
Amy
Ben
(2 rows)

select compute_flextable_keys_and_build_view(‘webreqs’);
compute_flextable_keys_and_build_view
—————————————————————————————————-
Please see public.webreqs_keys for updated keys
The view public.webreqs_view is ready for querying
(1 row)

select * from webreqs_view;
 data.name.0 |     src   | time
      ——————-+——————————-+———————————————————————————————-
         Ben | 127.0.0.1 | 2013-12-24 01:40:35.475723
         Amy | 127.0.0.1 | 2013-12-24 01:40:39.293608
         (2 rows)

It’s Flex Time

But now for the main event: so far, loading the data was easy – but what if the developer decides the form will also have an additional vote field. What needs to change? Well, let’s change the HTML file to include the new form field:

<html><head><title>Enter Info</title><body>
<form action=”data”>
Name:<input type=”text” name=”name”>
<br>Vote:<select type=”select” name=”vote”>
<option>yes</option><option>no</option></select>
<br><input type=”submit” value=”Submit”>
</form></body></html>

Submit the form a couple more times and look at the output – now there’s an additional key in the JSON!

{“src”: “127.0.0.1”, “data”: {“vote”: [“yes”], “name”: [“Jen”]}, “time”: “2013-12-24 01:40:57.785534″}
{“src”: “127.0.0.1”, “data”: {“vote”: [“no”], “name”: [“Ian”]}, “time”: “2013-12-24 01:41:01.867649″}

What database changes are necessary? None! The new fields are immediately available to query:

select sum(case when “data.vote.0″ = ‘yes’ then 1 else 0 end) as yeses, count(“data.vote.0″) as total from webreqs;
yeses | total
   ——-+——-
    1 | 2
(1 row)

But what about the entries submitted before the “vote” field existed?

select “data.name.0″,”data.vote.0″ from webreqs;
data.name.0 | data.vote.0
       ————-+————-
        Jen | yes
        Ben |
        Amy |
        Ian | no
       (4 rows)

For fields that don’t exist in a particular record, you get SQL null. SQL operators do the “right thing” when they encounter nulls, such as count and sum above.

If you believe the schema has changed, you can update your keys and views to reflect the new reality by re-running the schematization tools:

select compute_flextable_keys_and_build_view(‘webreqs’);
compute_flextable_keys_and_build_view
—————————————————————————————————-
Please see public.webreqs_keys for updated keys
The view public.webreqs_view is ready for querying
(1 row)

select * from webreqs_view;
  data.name.0 |    src    | time                       | data.vote.0
       ——————–+———————————+———————————————————————————–+——————————————-
          Jen | 127.0.0.1 | 2013-12-24 01:40:57.785534 | yes
          Ben | 127.0.0.1 | 2013-12-24 01:40:35.475723 |
          Amy | 127.0.0.1 | 2013-12-24 01:40:39.293608 |
          Ian | 127.0.0.1 | 2013-12-24 01:41:01.867649 | no
         (4 rows)

Note that you are not required to recompute the view to access the new fields, only to access the new fields through the view.

Try it yourself – edit the HTML form to have new fields and see how easy it is to query the resulting data in Vertica! You can also uncomment headers line in the server to get access to the header information (think cookies, sessionization, etc).

Tuning for Performance

While flexible tables are great for exploration, we recommend that you do some tuning before moving into production. Luckily, tuning is easy and still preserves the flexible nature of the tables. The central mechanism is “promotion” of one or more columns to real table columns. To promote, select commonly used columns, particularly those used in joins or predicates and add them to your flex table. You can do this by using ALTER TABLE … ADD COLUMN … DEFAULT. For example, the time and vote columns might be of particular interest:

ALTER TABLE webreqs ADD COLUMN “time” TIMESTAMP NOT NULL DEFAULT time::timestamp;
ALTER TABLE webreqs ADD COLUMN “data.vote.0″ VARCHAR(10) DEFAULT “data.vote.0″::varchar;

As part of adding the column, Vertica populates the column with any same-name data. Future loads will also load data into the newly added columns. Note that we have selected specific data types for the new columns. New data that doesn’t match the declared data type will result in nulls getting stored (or errors if the column is declared NOT NULL).

Don’t want to run ADD COLUMN yourself? No problem, because there’s a helper function to do this step for you:

select materialize_flextable_columns(‘webreqs’);

The following columns were added to the table public.webreqs:
data.name.0
src
time
data.vote.0
For more details, run the following query:
SELECT * FROM v_catalog.materialize_flextable_columns_results WHERE table_schema = ‘public’ and table_name = ‘webreqs';

There’s nothing special about these newly added columns – you could have done the same thing upfront:

CREATE FLEX TABLE webreqs (
“time” TIMESTAMP NOT NULL,
“data.vote.0″ VARCHAR(10)
);

So if you know a little about your data and how it will be used, you can jump directly to the optimized approach.

Vertica’s Database Designer exists to optimize the physical design (storage layout) and it works great on promoted columns in flexible tables. A typical usage model is explore, promote, and optimize. A key thing to remember is that changes to the schema do not require promotion and optimization. If you add fields that dramatically change your query workload, then some additional performance optimization may be necessary.

Flexible tables from Vertica should dramatically simplify the process of adapting your schema to the needs of your changing business.

Enter the Flex Zone – An Exploration in Processing Twitter Data

With our HP Vertica 7 release, we announced HP Vertica Flex Zone, a new product offering that simplifies the way that you consume and then explore semi-structured data, such as Web logs, sensor data, and other emerging data types. In this blog post, our first “Flex Zone Friday” post, let’s look at how you can use HP Vertica Flex Zone to get a leg up on your latest data analysis problem, using Twitter data as the sample data type.

Tweet child of mine

For this example, let’s look at Twitter data. Tweets are an interesting source because the data volume is large, the tweet format is typically JSON, and many companies are mining tweet data for insights about their business in how the community perceives their brand, product, or service. A tweet might look like the following:

{“id”:257000038495043584,
“text”:”Let’s play uno”,
“created_at”:”Sat Oct 13 06:09:18 2012″,
“user”:{“name”:”scout”,
  &nbsp ;”lang”:”en”,
    “followers_count”:521,…}
…}

It has the tweet text itself, with a lot of interesting metadata. Note that much of the metadata is stored in sub-structures like the “user” field.

You can acquire twitter data quite easily by making requests to the Twitter API. For the purposes of following along here, you can download a small sample of tweets: tweets.json.gz

Look Ma: No Columns!

Typically, loading this type of data into a relational database would have required transforming the raw data into something more easily consumable – or a pre-built custom Twitter data loader. With HP Vertica Flex Zone, you can skip the schema definition phase and just load JSON directly into the database, saving you hours of time:

CREATE FLEX TABLE tweets ();
COPY tweets FROM ‘/path/to/tweets.json.gz’ GZIP PARSER fjsonparser();

As easy as “SELECT … FROM T”
Now we can just write SQL queries against the data, as if it were naturally loaded into columns. For example, to discover what the average tweet length by language is:

select “user.lang”,avg(length(text))::int from tweets group by “user.lang” order by 2 desc;

user.lang | avg
——————————+——–
       de | 119
       ar | 108
       es | 80
       en | 67
       it | 66
       tr | 66
       nl | 64
       id | 61
       fr | 60
       ru | 57
       ja | 44
       pt | 43
       ko | 40
       th | 22

From the SQL point of view, it doesn’t matter that the source data was JSON. The only concession is the column names themselves: we do need to “” the column names if they contain the ‘.’ character.

Let’s make a Schema

But how can you tell what columns are available? The HP Vertica Analytics Platform has a couple of mechanisms for discovering and making explicit the columns available.

select compute_flextable_keys_and_build_view(‘tweets’);
compute_flextable_keys_and_build_view
————————————————————————————————–
Please see public.tweets_keys for updated keys
The view public.tweets_view is ready for querying
(1 row)

The tweets_keys table now contains a row for each available column in the data. For example, to find all the user-related count columns:

select key_name from tweets_keys where key_name::varchar ilike ‘user%count';
     key_name
————————————————————
user.statuses_count
user.listed_count
user.friends_count
user.followers_count
user.favourites_count
(5 rows)

Armed with these new columns, we can compute some statistics about the followers in our dataset:

select min(“user.followers_count”::int),avg(“user.followers_count”)::int,max
(“user.followers_count”::int) from tweets;

min | avg | max
  —–+—————–+——–
  0 | 1968 | 253435
(1 row)

Clearly someone is tweeting to the void. And we probably have some observer bias in our small sample: there’s probably a correlation between tweeting frequency and follower count.
In addition to building the keys table, the function also whipped up a view that selects all available columns, which can be handy to see what is in the table. Tweets are quite large:

select * from tweets_view limit 1;
contributors | created_at | entities.hashtags | entities.urls | entities.user_mentions | favorited | id | id_str | in_reply_to_screen_name | in_reply_to_status_id | in_reply_to_status_id_str | in_reply_to_user_id | in_reply_to_user_id_str | retweet_count | retweeted | source | text | truncated | user.contributors_enabled | user.created_at | user.default_profile | user.default_profile_image | user.description | user.favourites_count | user.follow_request_sent | user.followers_count | user.following | user.friends_count | user.geo_enabled | user.id | user.id_str | user.is_translator | user.lang | user.listed_count | user.location | user.name | user.notifications | user.profile_background_color | user.profile_background_image_url | user.profile_background_image_url_https | user.profile_background_tile | user.profile_image_url | user.profile_image_url_https | user.profile_link_color | user.profile_sidebar_border_color | user.profile_sidebar_fill_color | user.profile_text_color | user.profile_use_background_image | user.protected | user.screen_name | user.statuses_count | user.time_zone | user.url | user.utc_offset | user.verified | coordinates | geo | place | user.profile_banner_url | delete.status.id | delete.status.id_str | delete.status.user_id | delete.status.user_id_str | retweeted_status.contributors | retweeted_status.created_at | retweeted_status.entities.hashtags | retweeted_status.entities.urls

A key differentiator of HP Vertica Flex Zone is that you can visualize this semi-structured data during data exploration with any of your preferred BI and visualization tools. These tools are suitable for the views, as they expose the metadata about the “flexible table” in a manner that the tools already understand. For example, we used Tableau to generate a report against the tweets_view:

And now for something completely different

It turns out that the tweet stream includes two types of records: tweets and deletes. Deletes are revocations of tweets and they have the following structure:

{
“delete”:{
  “status”:{
   “user_id_str”:”228092142″,
   “id_str”:”218710262428073984″,
   “id”:218710262428073984,
   “user_id”:228092142
  }
 }
}

Our previous load actually loaded both record types into the same table. How did the previous queries work? Well, if you reference a column that doesn’t exist in a record, you get null. We can use this to find out how many deletes we loaded:

select count(*) from tweets where “delete.status.id” is not null;
 count
 —————–
  118
 (1 row)

Lots yet to come

We haven’t talked about a number of key things like column promotion (performance!), dynamic schema, or substructure. Here’s a teaser about substructure — namely the JSON input is quite complex and the HP Vertica Analytics Platform loaded it all. You can descend into the nested structures and get the most out of the data. The following is a query that computes the top hashtags in the last month:

select values,count(*) from
    (select mapitems(“entities.hashtags”) over ()
     from tweets
    where created_at::timestamp > now() – interval ‘1 month’) a
where keys::varchar like ‘%text’
group by values
order by count desc
limit 10;

                                 values | count
              —————————————————————————-+————
       10ThingsThatGetsOnMyNerves       |     2
       Massa                            |     1
       ความในใจของติ่ง                    |     1
       Up                               |     1
       RTしてくれた人全員フォローする      |      1
       Turn                             |     1
       50ThingsAboutMyBoyfriend         |     1
       askenchancer                     |     1
       always                           |     1
       kaburrr                          |     1

Note: If you got recent data from twitter, this will work without modification – if you used our data set, use interval ‘3 years’ (or more).

We encourage you to give HP Vertica Flex Zone a test drive and actively participate in our community site! Stay tuned for our next Flex Zone Friday blog post.

Data Governance and Chicago’s CampIT event

Steve I recently had the privilege to talk at a CampIT event in Chicago, a very well-attended event at the Stephens Convention Center near Chicago’s O’Hare airport. Analytics professionals gathered and shared ideas on technologies like Hadoop, big data analytics, columnar store databases and in-memory technologies – to name just a few of the topics.

Challenges of Modern Analytics

In my presentation, I covered some of the challenges in modern analytics. Perhaps the biggest technical challenge we’re facing is the ever-growing volumes of data in our organizations. More data means that our legacy analytical solutions are slowing, making it harder and harder to deliver analytics at the right time for the right audience. Business user may lack the technical understanding of how this affects them. They only know that they can’t get answers and business intelligence as readily as they need to.

Another challenge is that IT professionals continued to be asked to do more with less funding. According to Gartner, IT spending increased only about 0.8% this year. IT is spending all of their funds on keeping the wheels on the bus spinning, but few funds in IT are available to innovate. Other budgets, like marketing and sales technology spend are increasing, however. IT is still seen as a cost center in many organizations, while the business side is considered to be revenue-generating.

Data Governance Can Help

Data governance can help us tap into the business-focused budgets with a couple of important edicts:

    1. IT should form an alliance with business users
    Take a real interest in some of the challenges that your business users have by inviting them for coffee or giving them an opportunity to beef about their challenges.

    2. IT should focus on important business aspects of the IT initiative
    If you ask your business users, the most important aspects of IT aren’t technical. The three most important business aspects of any initiative is revenue, efficiency and compliance. IT should be trying hard to help the company make more money, be more efficient in the way that day-to-day business is done, and compliance with state, local, federal or industry regulations.

    3. The data governance team should initially pick projects that can provide quick return on investment and track benefits.
    Quick wins that are profitable to the corporation form an agile approach to data governance. Initiatives shouldn’t take months or years, but days or weeks. When users see the value that IT is bring to the organization, they will want to work with you on solving their issues.

    4. Analytics is just one of the systems of opportunity to begin your data governance initiative.
    Providing fast analytics with Vertica’s help is just one system of opportunity to move your data governance initiative forward.

Tap into Business Budgets

By understanding your business user’s needs, providing a strong ROI and talking about the business benefits of Vertica, you can sell the benefits of big data analytics into your organization. Again, it’s about revenue, efficiency and compliance in your business. It speaks to revenue when you have execution windows to run analysis that you have never had before and now you can find new ways to reach your customers. It speaks to efficiency when you increase speed, typically hundreds of times faster than the old way of doing analytics, and avoid worries about a long analysis taking up too much processing time. It speaks to compliance when you can deliver analysis that’s fast and accurate, and analysis that you don’t have to check and re-check it before you deliver it to a broader audience.

Get Started With Vertica Today

Subscribe to Vertica