Determining Popular Topics
The next examples in this cookbook use a table with the following structure:
create table tweets( id int, created_at timezonetz, "user.name" varchar(144), "user.screen_name" varchar(144), text varchar(500), "retweeted_status.retweet_count" int, "retweeted_status.id" int, "retweeted_status.favorite_count" int, "user.location" varchar(144), "coordinates.coordinates.0" float, "coordinates.coordinates.1" float, lang varchar(5) );
The columns are based on the data returned by Twitter's streaming API. The fields are defined in the Twitter Field Guide at https://dev.twitter.com/docs/platform>/tweets.
Note that the columns with quoted names; "user.name", "user.screen_name", are sub-fields within a larger field. For example, the "users" field is described here: https://dev.twitter.com/docs/platform>/users.
The example queries provided work with any Twitter data that follows the above table structure.
Determining Popular Topics
The Pulse attribute discovery feature allows you to easily find popular topics in a data set. Use the CommentAttributes() function to extract the attributes from rows of text and count the number of times the attribute occurs.
For example, using a dataset of 30,000 tweets that matched a keyword of "D11" collected during the D11 tech conference in 2013, you could get a count of the attributes discovered by Pulse to determine popular topics:
SELECT t.attribute, count(*) FROM(SELECT CommentAttributes(text)
OVER(PARTITION BEST) FROM tweets) as t
GROUP BY t.attribute ORDER BY count(*) DESC LIMIT 10; attribute | count ----------------+------- http | 3631 d11 | 3281 rt | 2453 encryption | 2356 usb | 2121 aes-256 | 1859 smartphones | 1843 rt @hp | 1788 world | 1609 ceo | 1520 (10 rows)
If the dataset contains tweets in English and Spanish languages, then (using the Pulse multilingual version) each tweet can be analyzed according to its language by specifying the language as argument in the CommentAttributes() function. If the language of a specific tweet is not supported, then that tweet is ignored by the function. For example:
SELECT t.attribute, count(*) FROM(SELECT CommentAttributes(text,lang) OVER(PARTITION BEST) FROM tweets) as t GROUP BY t.attribute ORDER BY count(*) DESC LIMIT 10;
Notice that the top attribute is "http". This is due to the large number of links in tweets. You can ignore links by using the filterlinks argument of CommentAtttributes():
SELECT t.attribute, count(*) FROM
(SELECT CommentAttributes(text USING PARAMETERS filterlinks=true)
OVER(PARTITION BEST) FROM tweets) as t
GROUP BY t.attribute ORDER BY count(*) DESC LIMIT 10; attribute | count ----------------+------- d11 | 4757 rt | 2397 encryption | 2356 usb | 2121 aes-256 | 1871 smartphones | 1829 rt @hp | 1788 world | 1611 ceo | 1542 interview | 1346 (10 rows)
The attribute "http" is now gone from the list, but we still have "rt" (for retweet) on the list and it is not helpful in this context. You can omit terms such as "rt" by adding them to the stop_words list and reloading the stop_words user-dictionary:
INSERT INTO pulse.stop_words_en VALUES('rt');
commit; SELECT LoadDictionary(standard USING PARAMETERS
listName='stop_words') OVER() FROM pulse.stop_words_en;
When you rerun the query you get more accurate results for the popular topics in the data set:
SELECT t.attribute, count(*) FROM
(SELECT CommentAttributes(text USING PARAMETERS filterlinks=true)
OVER(PARTITION BEST) FROM tweets) as t
GROUP BY t.attribute
ORDER BY count(*) DESC LIMIT 10; attribute | count ----------------+------- d11 | 4757 encryption | 2356 perfume | 2121 usb | 1871 aes-256 | 1829 rt @hp | 1788 world | 1611 ceo | 1542 interview | 1346 cloud | 1306 (10 rows)
You can further refine the list to topics that contain specific attributes by adding the attributes in which you are interested to the white_list, and then filtering with the whitelist parameter:
SELECT t.attribute, count(*) FROM (SELECT CommentAttributes(text USING PARAMETERS filterlinks=true, whitelistonly=true) OVER(PARTITION BEST) FROM tweets) as t GROUP BY t.attribute
ORDER BY count(*) DESC LIMIT 10;
Determining The Sentiment of Popular Topics
In addition to finding popular, or most discussed, topics in your data set, you can also easily get an average sentiment for the topics.
The following example uses a dataset of 10,000 tweets containing the hashtag #sports.
SELECT * from
(SELECT attribute, count(attribute) AS
cnt, AVG(sentiment_score) FROM (select
SentimentAnalysis(text USING PARAMETERS
filterlinks=true) OVER(PARTITION BEST) from tweets)
AS t1 GROUP BY attribute ORDER BY
AVG(sentiment_score) desc) AS t2
WHERE t2.cnt > 500
LIMIT 5;
The result shows the top 5 tweets with the highest average sentiment for attributes that have 500 or more occurances:
attribute | cnt | avg -------------------+------+------------------- football | 817 | 0.290085679314565 game | 638 | 0.134796238244514 baseball | 1558 | 0.128369704749679 basketball | 776 | 0.114690721649485 hockey | 2610 | 0.113409961685824