Tuning Pulse
Pulse contains built-in dictionaries that help to determine the sentiment of sentences. These dictionaries are not directly readable. However, you can modify the Pulse dictionary tables to improve automatic attribute discovery and provide more accurate results for sentiment scoring based on your specific data sets. The dictionary tables are available in the Pulse schema. Any words you add to these dictionaries takes precedence over the built-in dictionaries.
Improving Automatic Attribute Discovery
Pulse identifies nouns in sentences and marks them as attributes. However, there are two dictionaries and one mapping that you can modify to improve automatic attribute discovery. These are:
- white_list - a list of words on which you want to score sentiment, but are not auto-discovered by Pulse. Typically these are product or company names, or special words in the domain of the data you are analyzing. You can also add noun phrases to the white_list.
-
Consider the term "President Smith". Pulse automatically marks "President" as an attribute. However, you can add "President Smith" to the white_list and Pulse then uses "President Smith" as the attribute instead of just "President".
-
If your white_list contains phrases that are subsets of other phrases in the white list, then the shorter phrase is not matched if the text being analyzed matches the superset phrase. For example, if both "Honest Al" and "Honest Al Car Emporium" are in the white_list, then the latter phrase is identified as an attribute in the text "Honest Al Car Emporium is not honest.", not the shorter "Honest Al" white_list phrase.
- stop_words - a list of words on which you do not want to score sentiment, but may appear frequently in your data set. stop_words is basically a way to filter out attributes.
- If a word appears in both
stop_words
andwhite_list
, then thewhite_list
word takes precedence. The word appears in results even though it is in thestop_words
dictionary. - normalization - a map of base words and synonyms that allow you to normalize words for easy comparison. For example, you can normalize "Hewlett Packard" to "HP", then count the number of times "HP" appears as an attribute in your data. Any text that contains "HP" or "Hewlett Packard" is counted towards the total.
Determining How Pulse Scores Sentiment
When tuning Pulse it is important to understand why Pulse may not be scoring a particular attribute the way you want it to be scored. For example, consider the sentence "The quick brown fox jumped over the lazy dog." By default, Pulse scores the fox as positive and the dog as negative. If you want to better understand how the words in the sentence affect the attributes, then you can use the relatedwords parameter to see which words are affecting the score. For example:
select SentimentAnalysis('The quick brown fox jumped over the lazy dog.' USING PARAMETERS relatedwords=true) OVER(PARTITION BEST); sentence | attribute | sentiment_score | related_word_1 | related_word_2 | related_word_3 ----------+-----------+-----------------+----------------+----------------+---------------- 1 | fox | 1 | quick | lazy | 1 | dog | -1 | lazy | | (2 rows)
The output details that "quick" and "lazy" impacted the scoring of the "fox" attribute, and that "lazy" affected the scoring of the "dog" attribute. "Quick" (positive) is weighted heavier than "lazy" (negative) when scoring "fox" because the word "quick" is closer to the attribute "fox" in the sentence, and the result is that "fox" is scored positively. "Lazy" (negative) is the only related word being used to score the sentiment for "dog". If you don't agree with the scoring, you can change how these related words affect the score by adding them to the appropriate user-dictionary, as described in "Improving Sentiment Scores".
Improving Sentiment Scores
Pulse scores sentiment on attributes (nouns) in sentences using Natural Language Processing (NLP) algorithms and rules. Pulse attempts to identify the parts of a sentence (for example, verbs, nouns/attributes, adjectives, etc; the parts of speech), and then scores the attributes based on which system-dictionaries the parts of speech appear (positive,negative, or neutral) and where those parts of speech appear in relation to the attributes and other contextual information. Pulse does not identify personal pronouns (he, you, we , she, etc.) as attributes.
Pulse provides a PartsOfSpeech function so that you can verify which parts of speech are being identified in a sentence.
Sentiment Scoring and the Precedence of Pulse User-Dictionaries
The negative, positive, and neutral user-dictionaries adjust the score of an attribute based on which dictionary the words in the sentence appear. User-dictionaries take precedence over the internal dictionaries that Pulse uses for analyzing text, so that you can override the default polarity of an opinion word or phrase by inserting it in the appropriate user-dictionary table.
Pulse also supports using phrases in the pos_words, neg_words and neutral_words dictionaries. Phrases, such as idioms ("hit the nail on the head."), can be added to the user dictionaries. Phrases of two or more words support "fuzzy" matching. For example, the phrase "solve problem" also matches "solves problems".
Pulse uses an order of precedence to determine which user dictionary is used to modify the default score. The order of precedence of the user dictionary that Pulse uses to score attributes is as follows:
- Phrases or strings that occur in the "neutral_words" dictionary
- Phrases or strings that occur in the "neg_words" dictionary
- Phrases or strings that occur in the "pos_words" dictionary
- Single words appearing in the "neutral_words" dictionary
- Single words appearing in the "neg_words" dictionary
- Single words appearing in the "pos_words" dictionary
Note: If a word is present in both
stop_words
andwhite_list
, then thewhite_list
word takes precedence. The word is present in results even though it exists instop_words
.Consider the sentence "Fudge is good". It contains three parts; a noun (fudge), a verb (is), and an adjective (good). When you analyze the sentence using Pulse, it identifies "fudge" as an attribute, because it is a proper noun, and then assigns "fudge" a positive sentiment:
select sentimentAnalysis('Fudge is good') OVER(PARTITION BEST); sentence | attribute | sentiment_score ----------+-----------+----------------- 1 | fudge | 1
The number of words matched against a dictionary also has an impact on which dictionaries take precedence. For example, phrases or word combinations in the user-dictionary lists take precedence over single words. For example, the positive phrase "solve problem" causes a positive score on the text "Joe solves problems", even though "problem" is a negative word. Since phrases have precedence over single words, a positive score is applied to Joe.
SELECT SentimentAnalysis('Joe solves problems.') OVER(PARTITION BEST);
sentence | attribute | sentiment_score ----------+-----------+----------------- 1 | joe | 1 (1 row)
SELECT SentimentAnalysis('Joe is a problem.') OVER(PARTITION BEST); sentence | attribute | sentiment_score ----------+-----------+----------------- 1 | joe | -1 1 | problem | 0 (2 rows)
Tuning Example
You can modify any of the user-dictionaries to improve the accuracy of sentiment scores. The two basic dictionaries, "neg_words" and "pos_words", are typically the easiest to modify to get good results. Words in these two dictionaries can be any part of speech (verb, adjective, etc.). If you find a word that is causing an attribute to be scored positively or negatively, but it should be score as neutral, then you can add that word to the "neutral_words_en" dictionary to cause it to be scored 0.
Consider the sentence "The product delivers simplicity.":
select sentimentAnalysis('The product delivers simplicity.') over(PARTITION BEST); sentence | attribute | sentiment_score ----------+------------+----------------- 1 | product | 0 1 | simplicity | 0 (2 rows)
If you want "product" to be scored positively in this sentence, then you must add "deliver simplicity" to the pos_words user-dictionary. "deliver simplicity" will also match "delivers simplicity" due to the "fuzzy" matching feature of phrases in the dictionaries. If you add "simplicity" by itself to the "pos_words" dictionary, then simplicity in any context is considered positive, which may not be the result you want to achieve across your entire domain. The following example adds "deliver simplicity" to the pos_words user-dictionary for the English language:
insert into pulse.pos_words_en values ('deliver simplicity');
commit;
-- you must reload the dictionaries for the changes to be effective
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
select sentimentAnalysis('The product delivers simplicity.') over(PARTITION BEST); sentence | attribute | sentiment_score ----------+------------+----------------- 1 | product | 1 (1 row)
Notice that "simplicity" is not positive if it is not paired with "deliver":
select sentimentAnalysis('The product provides simplicity.') over(PARTITION BEST); sentence | attribute | sentiment_score ----------+------------+----------------- 1 | product | 0 1 | simplicity | 0 (2 rows)
If you want "simplicity" to always be positive, add it to the "pos_words" list. This example replaces "deliver" with "provides":
insert into pulse.pos_words_en values ('simplicity');
commit;
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
select sentimentAnalysis('The product provides simplicity.') over(PARTITION BEST); sentence | attribute | sentiment_score ----------+------------+----------------- 1 | product | 1 1 | simplicity | 0 (2 rows)
Notice that the sentiment score for the attribute (noun) "simplicity" is not affected by having the word "simplicity" in a Pulse user-dictionary, since it has been identified as an attribute.
Additional Tuning Examples
The following table provides additional examples for tuning Pulse:.
Text | Attribute | Score | Tuning Steps |
---|---|---|---|
New product smashes kickstarter target in a day! | New Product |
Default: -1 After Tuning: 1 |
"Smash" is scored negatively by default. Add "smash target" to "pos_words". |
Get a sneak peek of the new movie. | Movie |
Default: -1 After Tuning: 1 |
"sneak" is scored negatively by default. Add "sneak peek" to "pos_words". |
Google was able to spot trends in flu outbreaks in the United States using the collection and analysis of big data. |
Default: -1 After Tuning: 1 |
"outbreak" is scored negatively by default. Add "spot trend" to "pos_words". |
|
Five health tips that will knock your socks off! | health tips |
Default: -1 After tuning: 1 |
"knock" is scored negatively by default. Add "knock your socks off" to "pos_words". |
If you have many words or base/synonyms to add to user-dictionaries, then you can bulk load the lists from text files. See Bulk Loading Word Lists from Text Files.