Batch Analyzing Data as It Is Loaded

If you are constantly loading data that needs to be analyzed with Pulse, then you should run the sentimentAnalysis() function in batches on the newly loaded data. You can store the sentiment scores in a separate table and associate the rows in the scored table with the original table by joining on IDs between the tables. Running sentimentAnalysis() as the data is loaded and storing the results is more efficient than running sentimentAnalysis() during interactive sessions because the sentimentAnalysis() can take a few seconds to return results.

For example, suppose that you are using the Social Media Connector (available in the ETL and Data Ingest section of the Vertica Marketplace) to retrieve Twitter tweets and load them into Vertica. In this case, you can create shell scripts and a cron job to automatically run sentimentAnalysis() on the text of the tweets. Then you can store the resulting scores in a table for quick retrieval later on.

Complete the following steps as the dbadmin user to run sentimentAnalysis() on your Twitter data. This task also sets up the system to run sentimentAnalysis() on new Twitter data every 2 minutes.

  1. Create a table to hold the tweets (for example, named tweets) with the following structure:

    create table tweets(
    	id int,
    	created_at timezonetz, 
    	"user.name" varchar(144), 
    	"user.screen_name" varchar(144), 
    	text varchar(500), 
    	"retweeted_status.retweet_count" int, 
    	"retweeted_status.id" int,
    	"retweeted_status.favorite_count" int,
    	"user.location" varchar(144), 
    	"coordinates.coordinates.0" float, 
    	"coordinates.coordinates.1" float, 
    	lang varchar(5)
    ); 

    The columns are based on the data returned by Twitter's streaming API. The fields are defined in the Twitter Field Guide at https://dev.twitter.com/docs/platform>/tweets.

    Note that the columns with quoted names; "user.name", "user.screen_name", are sub-fields within a larger field. For example, the "users" field is described here: https://dev.twitter.com/docs/platform>/users.

    You must at least have columns for id, text, and "user.screen_name"

  2. Create a table to hold the sentiment scores (for example, named : tweet_sentiment). Then load it with the scores from your existing tweets. Make sure no new tweets are loaded until this step completes.

    Replace the column names in the following example with the column names from your twitter table. The example uses the column names used by the Social Media Connector:


    create table tweet_sentiment as (select id, "user.screen_name", SentimentAnalysis(text using parameters filterlinks=true,
    filterusermentions=true) over (partition by id, "user.screen_name", text) from tweets where lang='en'order by attribute );
    -- The following table defines which data has been analyzed
    create table dt_start as (select max(created_at) dt from tweets);

    Note: If you have a large number of tweets then this command can take a long time to run. However, it is important to score your existing data before you start scoring newly loaded data.

  3. Create a SQL script to update the tweet_sentiment table with data from newly loaded tweets. Save it in the home folder of the Vertica database admin user. For example, this path could be /home/dbadmin/tweet_update.sql.

    Replace the column names with the column names from your twitter table. The following example uses the column names used by the Vertica Social Media Connector:

    \i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
    drop table if exists dt_end;
    create table dt_end as (select max(created_at) dt from tweets);
    
    -- run sentiment
    insert into tweet_sentiment 
    (select id, "user.screen_name",
    SentimentAnalysis(text using parameters filterlinks=true, 
    filterusermentions=true) over (partition by id, "user.screen_name", text) from tweets where lang='en' and tweets.created_at > (select dt from dt_start) and tweets.created_at <= (select dt from dt_end) order by attribute); -- copy date end into new start date drop table if exists dt_start; create table dt_start as (select dt from dt_end); -- free up jvm resource pool memory used by this script select release_jvm_memory();
  4. Create a shell script named tweet_update.sh that is run from a cron job. This shell script runs the tweet_update.sql script and logs the results to the file tweet_update.log. Save the tweet_update.sh script in the home folder of the Vertica database admin user. For example, this path could be /home/dbadmin/tweet_update.sh.

    Replace the dbadmin, password, and databasename values with the values for your system.

    /opt/vertica/bin/vsql -U dbadmin -w password -d databasename -f /home/dbadmin/tweet_update.sql > tweet_update.log

    After you have created the shell script tweet_update.sh, make the script executable by entering the following command: chmod +x tweet_update.sh.

  5. Create a cron job to run the script every two minutes. Use the command crontab -e to create the cron job. You can view all of your created cron jobs by using the command crontab -l.

    */2 * * * * /home/dbadmin/tweet_update.sh

The script runs every two minutes. Any new tweets that have been loaded in that two-minute window are analyzed and the results are added to the tweet_sentiment table. You can join results of queries by the id's of the tweets and tweet_sentiment tables.