Using Pulse as an Aid in Competitive Analysis

This topic details how you can use Pulse to conduct basic competitive analysis for products or brands. Pulse makes basic competitive analysis simple through use of it's white list feature. By utilizing the white list feature, you can analyze the tweets that pertain only to the brands or products that you are evaluating.

For example, say you wanted to analyze the sentiment of major food brands to determine how the brands compared to each other and what words people associate (positively and negatively) about the brands. Your work flow to do this analysis with Twitter and Vertica Pulse could be as follows:

  1. Start collecting tweets based on the brands or products that you are following. For example, you can use the Social Media Connector (available on the Pulse marketplace) to collect tweets matching keywords.
  2. First, create a white_list that contains the same keywords as the tweets that you are collecting. The whitelist allows you to later group and filter tweets collected. For example:

    insert into pulse.white_list_en values ('productA');
    insert into pulse.white_list_en values ('productB');
    insert into pulse.white_list_en values ('productC');
    \i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
                    
  3. Batch Load Tweets, and be sure to specify whitelistonly=true and relatedwords=true in the sentimentAnalysis() function. This creates a table with the sentiment score for your white-listed attributes. Note that this should be done in batches for large data sets. For smaller data sets (depending on your hardware) you can try and analyze all the tweets at once. For example:

    create table tweet_sentiment as 
    (select id, "user.screen_name",
    SentimentAnalysis(text using parameters filterlinks=true,
    filterusermentions=true, relatedwords=true,
    filterretweets=true, whitelistonly=true)
    over (partition by id, "user.screen_name", text)
    from tweets where lang='en' order by attribute );
                    
  4. Verify that your tweet_sentiment table contains only your whitelist attributes. The following query should only return the brands/products that you have white listed. For example:

    => select distinct(attribute) from tweet_sentiment;
    
      attribute  
    -------------
     ProductA
     ProductB
     ProductC
    (3 rows)
                    
  5. You can get a basic idea of which product or brand is being talked about the most by seeing how many instances of each attribute appear in your data set:

    => select attribute, count(*) from tweet_sentiment group by (attribute) order by count(*) desc;
      attribute  | count 
    -------------+-------
     ProductA   |   701
     ProductB   |   192
     ProductC   |    52
    (3 rows)
                    

    You can see that ProductA is the most talked about product of three being analyzed over the time-frame that the tweets were collected.

  6. Determine the average sentiment scores of the tweets you have collected:

    => select attribute, avg(sentiment_score) as score from tweet_sentiment group by (attribute) order by score DESC;
    
      attribute  |        score        
    -------------+---------------------
     ProductC    |   0.192307692307692
     ProductB    | -0.0729166666666667
     ProductA    |  -0.122681883024251
    (3 rows)
                    

    From this basic analysis, you can see that ProductC has the most positive sentiment from the three brands being analyzed over the time period when the tweets were collected, and ProductA has the lowest sentiment.

  7. You can also determine which words or phrases are associated with each attribute in their positive and negative contexts. For example, to see the list of words that are most associated with positive sentiment for ProductC, you can look at the related words fields and add up the occurances of words associated with positive sentiment:

    => select count(*), related_word_1 from tweet_sentiment where attribute = 'ProductC' and sentiment_score > 0 group by related_word_1 order by count DESC;
    
     count | related_word_1 
    -------+----------------
        11 | delicious
         2 | love
         1 | best
         1 | bless
         1 | good
         1 | work
    (6 rows)
                

    You can also do the same for negative sentiment:

    => select count(*), related_word_1 from tweet_sentiment where attribute = 'ProductC' and sentiment_score < 0 group by related_word_1 order by count DESC;
    
     count | related_word_1 
    -------+----------------
         1 | working
         1 | dragging
         1 | bad
         1 | doomed
         1 | loud
         1 | stressful
         1 | damn
    (7 rows)
                
  8. Finally, Pulse makes it easy to see other attributes associated with your target attributes to help you better understand the context in which people are discussing the brands or products that you are analyzing.

    1. Create another sentiment table from your data, but this time omit the whitelistonly and relatedwords parameters:

      create table tweet2_sentiment as 
      (select id, "user.screen_name",
      SentimentAnalysis(text using parameters filterlinks=true,
      filterusermentions=true, filterretweets=true)
      over (partition by id, "user.screen_name", text)
      from tweets where lang='en' order by attribute ); 
                          
    2. Next, query the tweets that contain your target attribute and find all the other attributes associated with those tweets. Display a count of the top 5 attributes (not including the target attribute):

      => select count(attribute), attribute from  tweet2_sentiment where id in (select id from tweet_sentiment where attribute = 'ProductC') and attribute <> 'ProductC' group by (attribute) order by count(attribute) DESC limit 5;
       count |    attribute    
      -------+-----------------
          13 | bbq
          11 | state
          11 | sandwich
          11 | steak
           3 | ProductB
      (5 rows)
                      

As you can see, a few basic queries can tell you the general sentiment differences between multiple brands or products. You can also determine which words are contributing to the sentiment of each product/brand that you are analyzing and which other attributes people are talking about when they mention the brand or product(s) that you are analyzing.

You could further refine these queries by breaking out different geographic locations or time of day by joining the IDs of the tweet_sentiment table back to the main tweets table and filtering be location or time.