Tuning Pulse

Pulse contains built-in dictionaries that help to determine the sentiment of sentences. These dictionaries are not directly readable. However, you can modify the Pulse dictionary tables to improve automatic attribute discovery and provide more accurate results for sentiment scoring based on your specific data sets. The dictionary tables are available in the Pulse schema. Any words you add to these dictionaries takes precedence over the built-in dictionaries.

Improving Automatic Attribute Discovery

Pulse identifies nouns in sentences and marks them as attributes. However, there are two dictionaries and one mapping that you can modify to improve automatic attribute discovery. These are:

Determining How Pulse Scores Sentiment

When tuning Pulse it is important to understand why Pulse may not be scoring a particular attribute the way you want it to be scored. For example, consider the sentence "The quick brown fox jumped over the lazy dog." By default, Pulse scores the fox as positive and the dog as negative. If you want to better understand how the words in the sentence affect the attributes, then you can use the relatedwords parameter to see which words are affecting the score. For example:

select SentimentAnalysis('The quick brown fox jumped over the lazy dog.'
 USING PARAMETERS relatedwords=true) OVER(PARTITION BEST);


 sentence | attribute | sentiment_score | related_word_1 | related_word_2 | related_word_3 
----------+-----------+-----------------+----------------+----------------+----------------
        1 | fox       |               1 | quick          | lazy           | 
        1 | dog       |              -1 | lazy           |                | 
(2 rows)

The output details that "quick" and "lazy" impacted the scoring of the "fox" attribute, and that "lazy" affected the scoring of the "dog" attribute. "Quick" (positive) is weighted heavier than "lazy" (negative) when scoring "fox" because the word "quick" is closer to the attribute "fox" in the sentence, and the result is that "fox" is scored positively. "Lazy" (negative) is the only related word being used to score the sentiment for "dog". If you don't agree with the scoring, you can change how these related words affect the score by adding them to the appropriate user-dictionary, as described in "Improving Sentiment Scores".

Improving Sentiment Scores

Pulse scores sentiment on attributes (nouns) in sentences using Natural Language Processing (NLP) algorithms and rules. Pulse attempts to identify the parts of a sentence (for example, verbs, nouns/attributes, adjectives, etc; the parts of speech), and then scores the attributes based on which system-dictionaries the parts of speech appear (positive,negative, or neutral) and where those parts of speech appear in relation to the attributes and other contextual information. Pulse does not identify personal pronouns (he, you, we , she, etc.) as attributes.

Pulse provides a PartsOfSpeech function so that you can verify which parts of speech are being identified in a sentence.

Sentiment Scoring and the Precedence of Pulse User-Dictionaries

The negative, positive, and neutral user-dictionaries adjust the score of an attribute based on which dictionary the words in the sentence appear. User-dictionaries take precedence over the internal dictionaries that Pulse uses for analyzing text, so that you can override the default polarity of an opinion word or phrase by inserting it in the appropriate user-dictionary table.

Pulse also supports using phrases in the pos_words, neg_words and neutral_words dictionaries. Phrases, such as idioms ("hit the nail on the head."), can be added to the user dictionaries. Phrases of two or more words support "fuzzy" matching. For example, the phrase "solve problem" also matches "solves problems".

Pulse uses an order of precedence to determine which user dictionary is used to modify the default score. The order of precedence of the user dictionary that Pulse uses to score attributes is as follows:

  1. Phrases or strings that occur in the "neutral_words" dictionary
  2. Phrases or strings that occur in the "neg_words" dictionary
  3. Phrases or strings that occur in the "pos_words" dictionary
  4. Single words appearing in the "neutral_words" dictionary
  5. Single words appearing in the "neg_words" dictionary
  6. Single words appearing in the "pos_words" dictionary

Note: If a word is present in both stop_words and white_list, then the white_list word takes precedence. The word is present in results even though it exists in stop_words.

Consider the sentence "Fudge is good". It contains three parts; a noun (fudge), a verb (is), and an adjective (good). When you analyze the sentence using Pulse, it identifies "fudge" as an attribute, because it is a proper noun, and then assigns "fudge" a positive sentiment:

select sentimentAnalysis('Fudge is good') OVER(PARTITION BEST);
 sentence | attribute | sentiment_score 
----------+-----------+-----------------
        1 | fudge     |               1

The number of words matched against a dictionary also has an impact on which dictionaries take precedence. For example, phrases or word combinations in the user-dictionary lists take precedence over single words. For example, the positive phrase "solve problem" causes a positive score on the text "Joe solves problems", even though "problem" is a negative word. Since phrases have precedence over single words, a positive score is applied to Joe.

SELECT SentimentAnalysis('Joe solves problems.') OVER(PARTITION BEST);                                         
sentence | attribute | sentiment_score ----------+-----------+----------------- 1 | joe | 1 (1 row)
SELECT SentimentAnalysis('Joe is a problem.') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+-----------------
        1 | joe       |              -1
        1 | problem   |               0
(2 rows)

Tuning Example

You can modify any of the user-dictionaries to improve the accuracy of sentiment scores. The two basic dictionaries, "neg_words" and "pos_words", are typically the easiest to modify to get good results. Words in these two dictionaries can be any part of speech (verb, adjective, etc.). If you find a word that is causing an attribute to be scored positively or negatively, but it should be score as neutral, then you can add that word to the "neutral_words_en" dictionary to cause it to be scored 0.

Consider the sentence "The product delivers simplicity.":

select sentimentAnalysis('The product delivers simplicity.') over(PARTITION BEST);
 sentence | attribute  | sentiment_score 
----------+------------+-----------------
        1 | product    |               0
        1 | simplicity |               0
(2 rows)

If you want "product" to be scored positively in this sentence, then you must add "deliver simplicity" to the pos_words user-dictionary. "deliver simplicity" will also match "delivers simplicity" due to the "fuzzy" matching feature of phrases in the dictionaries. If you add "simplicity" by itself to the "pos_words" dictionary, then simplicity in any context is considered positive, which may not be the result you want to achieve across your entire domain. The following example adds "deliver simplicity" to the pos_words user-dictionary for the English language:

insert into pulse.pos_words_en values ('deliver simplicity');
commit;
-- you must reload the dictionaries for the changes to be effective
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
select sentimentAnalysis('The product delivers simplicity.') over(PARTITION BEST);
 sentence | attribute  | sentiment_score 
----------+------------+-----------------
        1 | product    |               1
(1 row)

Notice that "simplicity" is not positive if it is not paired with "deliver":

select sentimentAnalysis('The product provides simplicity.') over(PARTITION BEST);
 sentence | attribute  | sentiment_score 
----------+------------+-----------------
        1 | product    |               0
        1 | simplicity |               0
(2 rows)

If you want "simplicity" to always be positive, add it to the "pos_words" list. This example replaces "deliver" with "provides":

insert into pulse.pos_words_en values ('simplicity');
commit;
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
select sentimentAnalysis('The product provides simplicity.') over(PARTITION BEST);
 sentence | attribute  | sentiment_score 
----------+------------+-----------------
        1 | product    |               1
        1 | simplicity |               0
(2 rows)

Notice that the sentiment score for the attribute (noun) "simplicity" is not affected by having the word "simplicity" in a Pulse user-dictionary, since it has been identified as an attribute.

Additional Tuning Examples

The following table provides additional examples for tuning Pulse:.

Text Attribute Score Tuning Steps
New product smashes kickstarter target in a day! New Product

Default: -1

After Tuning: 1

"Smash" is scored negatively by default.

Add "smash target" to "pos_words".

Get a sneak peek of the new movie. Movie

Default: -1

After Tuning: 1

"sneak"  is scored negatively by default.

Add "sneak peek" to "pos_words".

Google was able to spot trends in flu outbreaks in the United States using the collection and analysis of big data. Google

Default: -1

After Tuning: 1

"outbreak" is scored negatively by default.

Add "spot trend" to "pos_words".

Five health tips that will knock your socks off! health tips

Default: -1

After tuning: 1

"knock" is scored negatively by default.

Add "knock your socks off" to "pos_words".

If you have many words or base/synonyms to add to user-dictionaries, then you can bulk load the lists from text files. See Bulk Loading Word Lists from Text Files.