Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Mining Sentiments from Tweets

2012

Twitter is a micro blogging website, where users can post messages in very short text called Tweets. Tweets contain user opinion and sentiment towards an object or person. This sentiment information is very useful in various aspects for business and governments. In this paper, we present a method which performs the task of tweet sentiment identification using a corpus of pre-annotated tweets. We present a sentiment scoring function which uses prior information to classify (binary classification) and weight various sentiment bearing words/phrases in tweets. Using this scoring function we achieve classification accuracy of 87% on Stanford Dataset and 88% on Mejaj dataset. Using supervised machine learning approach, we achieve classification accuracy of 88% on Stanford dataset.

Mining Sentiments from Tweets Akshat Bakliwal, Piyush Arora, Senthil Madhappan Nikhil Kapre, Mukesh Singh and Vasudeva Varma Search and Information Extraction Lab, International Institute of Information Technology, Hyderabad. {akshat.bakliwal, piyush.arora}@research.iiit.ac.in, {senthil.m, nikhil.kapre, mukeshkumar.singh}@students.iiit.ac.in, vv@iiit.ac.in Abstract Twitter is a micro blogging website, where users can post messages in very short text called Tweets. Tweets contain user opinion and sentiment towards an object or person. This sentiment information is very useful in various aspects for business and governments. In this paper, we present a method which performs the task of tweet sentiment identification using a corpus of pre-annotated tweets. We present a sentiment scoring function which uses prior information to classify (binary classification ) and weight various sentiment bearing words/phrases in tweets. Using this scoring function we achieve classification accuracy of 87% on Stanford Dataset and 88% on Mejaj dataset. Using supervised machine learning approach, we achieve classification accuracy of 88% on Stanford dataset. 1 Introduction With enormous increase in web technologies, number of people expressing their views and opinions via web are increasing. This information is very useful for businesses, governments and individuals. With over 340+ million Tweets (short text messages) per day, Twitter is becoming a major source of information. Twitter is a micro-blogging site, which is popular because of its short text messages popularly known as “Tweets”. Tweets have a limit of 140 characters. Twitter has a user base of 140+ million active users1 1 As on March 21, 2012. http://en.wikipedia.org/wiki/Twitter Source: and thus is a useful source of information. Users often discuss on current affairs and share their personals views on various subjects via tweets. Out of all the popular social media’s like Facebook, Google+, Myspace and Twitter, we choose Twitter because 1) tweets are small in length, thus less ambigious; 2) unbiased; 3) are easily accessible via API; 4) from various socio-cultural domains. In this paper, we introduce an approach which can be used to find the opinion in an aggregated collection of tweets. In this approach, we used two different datasets which are build using emoticons and list of suggestive words respectively as noisy labels. We give a new method of scoring “Popularity Score”, which allows determination of the popularity score at the level of individual words of a tweet text. We also emphasis on various types and levels of pre-processing required for better performance. Roadmap for rest of the paper: Related work is discussed in Section 2. In Section 3, we describe our approach to address the problem of Twitter sentiment classification along with pre-processing steps.Datasets used in this research are discussed in Section 4. Experiments and Results are presented in Section 5. In Section 6, we present the feature vector approach to twitter sentiment classification. Section 7 presents as discussion on the methods and we conclude the paper with future work in Section 8. 2 Related Work Research in Sentiment Analysis of user generated content can be categorized into Reviews (Turney, 2002; Pang et al., 2002; Hu and Liu, 2004), Blogs (Draya et al., 2009; Chesley, 2006; He et al., 2008), 11 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pages 11–18, Jeju, Republic of Korea, 12 July 2012. c 2012 Association for Computational Linguistics News (Godbole et al., 2007), etc. All these categories deal with large text. On the other hand, Tweets are shorter length text and are difficult to analyse because of its unique language and structure. (Turney, 2002) worked on product reviews. Turney used adjectives and adverbs for performing opinion classification on reviews. He used PMI-IR algorithm to estimate the semantic orientation of the sentiment phrase. He achieved an average accuracy of 74% on 410 reviews of different domains collected from Epinion. (Hu and Liu, 2004) performed feature based sentiment analysis. Using Noun-Noun phrases they identified the features of the products and determined the sentiment orientation towards each feature. (Pang et al., 2002) tested various machine learning algorithms on Movie Reviews. He achieved 81% accuracy in unigram presence feature set on Naive Bayes classifier. (Draya et al., 2009) tried to identify domain specific adjectives to perform blog sentiment analysis. They considered the fact that opinions are mainly expressed by adjectives and pre-defined lexicons fail to identify domain information. (Chesley, 2006) performed topic and genre independent blog classification, making novel use of linguistic features. Each post from the blog is classified as positive, negative and objective. To the best of our knowledge, there is very less amount of work done in twitter sentiment analysis. (Go et al., 2009) performed sentiment analysis on twitter. They identified the tweet polarity using emoticons as noisy labels and collected a training dataset of 1.6 million tweets. They reported an accuracy of 81.34% for their Naive Bayes classifier. (Davidov et al., 2010) used 50 hashtags and 15 emoticons as noisy labels to create a dataset for twitter sentiment classification. They evaluate the effect of different types of features for sentiment extraction. (Diakopoulos and Shamma, 2010) worked on political tweets to identify the general sentiments of the people on first U.S. presidential debate in 2008. (Bora, 2012) also created their dataset based on noisy labels. They created a list of 40 words (positive and negative) which were used to identify the polarity of tweet. They used a combination of a minimum word frequency threshold and Categorical Proportional Difference as a feature selec- 12 tion method and achieved the highest accuracy of 83.33% on a hand labeled test dataset. (Agarwal et al., 2011) performed three class (positive, negative and neutral) classification of tweets. They collected their dataset using Twitter stream API and asked human judges to annotate the data into three classes. They had 1709 tweets of each class making a total of 5127 in all. In their research, they introduced POS-specific prior polarity features along with twitter specific features. They achieved max accuracy of 75.39% for unigram + senti features. Our work uses (Go et al., 2009) and (Bora, 2012) datasets for this research. We use Naive Bayes method to decide the polarity of tokens in the tweets. Along with that we provide an useful insight on how preprocessing should be done on tweet. Our method of Senti Feature Identification and Popularity Score perform well on both the datasets. In feature vector approach, we show the contribution of individual NLP and Twitter specific features. 3 Approach Our approach can be divided into various steps. Each of these steps are independent of the other but important at the same time. 3.1 Baseline In the baseline approach, we first clean the tweets. We remove all the special characters, targets (@), hashtags (#), URLs, emoticons, etc and learn the positive & negative frequencies of unigrams in training. Every unigram token is given two probability scores: Positive Probability (Pp ) and Negative Probability (Np ) (Refer Equation 1). We follow the same cleaning process for the test tweets. After cleaning the test tweets, we form all the possible unigrams and check for their frequencies in the training model. We sum up the positive and negative probability scores of all the constituent unigrams, and use their difference (positive - negative) to find the overall score of the tweet. If tweet score is > 0 then it is positive otherwise negative. Pf = F requency in P ositive T raining Set Nf = F requency in N egative T raining Set Pp = P ositive P robability of the token. = Pf /(Pf + Nf ) Np = N egative P robability of the token. = Nf /(Pf + Nf ) (1) 3.2 Emoticons and Punctuations Handling We make slight changes in the pre-processing module for handling emoticons and punctuations. We use the emoticons list provided by (Agarwal et al., 2011) in their research. This list2 is built from wikipedia list of emoticons3 and is hand tagged into five classes (extremely positive, positive, neutral, negative and extremely negative). In this experiment, we replace all the emoticons which are tagged positive or extremely positive with ‘zzhappyzz’ and rest all other emoticons with ‘zzsadzz’. We append and prepend ‘zz’ to happy and sad in order to prevent them from mixing into tweet text. At the end, ‘zzhappyzz’ is scored +1 and ‘zzsadzz’ is scored -1. Exclamation marks (!) and question marks (?) also carry some sentiment. In general, ‘!’ is used when we have to emphasis on a positive word and ‘?’ is used to highlight the state of confusion or disagreement. We replace all the occurrences of ‘!’ with ‘zzexclaimzz’ and of ‘?’ with ‘zzquestzz’. We add 0.1 to the total tweet score for each ‘!’ and subtract 0.1 from the total tweet score for each ‘?’. 0.1 is chosen by trial and error method. 3.3 Stemming We use Porter Stemmer4 to stem the tweet words. We modify porter stemmer and restrict it to step 1 only. Step 1 gets rid of plurals and -ed or -ing. 3.4 Stop Word Removal Stop words play a negative role in the task of sentiment classification. Stop words occur in both positive and negative training set, thus adding more ambiguity in the model formation. And also, stop 2 http://goo.gl/oCSnQ http://en.wikipedia.org/wiki/List of emoticons 4 http://tartarus.org/m̃artin/PorterStemmer/ words don’t carry any sentiment information and thus are of no use to us. We create a list of stop words like he, she, at, on, a, the, etc. and ignore them while scoring. We also discard words which are of length ≤ 2 for scoring the tweet. 3.5 Spell Correction Tweets are written in random form, without any focus given to correct structure and spelling. Spell correction is an important part in sentiment analysis of user- generated content. Users type certain characters arbitrary number of times to put more emphasis on that. We use the spell correction algorithm from (Bora, 2012). In their algorithm, they replace a word with any character repeating more than twice with two words, one in which the repeated character is placed once and second in which the repeated character is placed twice. For example the word ‘swwweeeetttt’ is replaced with 8 words ‘swet’, ‘swwet’, ‘sweet’, ‘swett’, ‘swweet’, and so on. Another common type of spelling mistakes occur because of skipping some of characters from the spelling. like “there” is generally written as “thr”. Such types of spelling mistakes are not currently handled by our system. We propose to use phonetic level spell correction method in future. 3.6 Senti Features At this step, we try to reduce the effect of nonsentiment bearing tokens on our classification system. In the baseline method, we considered all the unigram tokens equally and scored them using the Naive Bayes formula (Refer Equation 1). Here, we try to boost the scores of sentiment bearing words. In this step, we look for each token in a pre-defined list of positive and negative words. We use the list of of most commonly used positive and negative words provided by Twitrratr5 . When we come across a token in this list, instead of scoring it using the Naive Bayes formula (Refer Equation 1), we score the token +/- 1 depending on the list in which it exist. All the tokens which are missing from this list went under step 3.3, 3.4, 3.5 and were checked for their occurrence after each step. 3 5 13 http://twitrratr.com/ 3.7 Noun Identification After doing all the corrections (3.3 - 3.6) on a word, we look at the reduced word if it is being converted to a Noun or not. We identify the word as a Noun word by looking at its part of speech tag in English WordNet(Miller, 1995). If the majority sense (most commonly used sense) of that word is Noun, we discard the word while scoring. Noun words don’t carry sentiment and thus are of no use in our experiments. 3.8 Popularity Score This scoring method boosts the scores of the most commonly used words, which are domain specific. For example, happy is used predominantly for expressing the positive sentiment. In this method, we multiple its popularity factor (pF) to the score of each unigram token which has been scored in the previous steps. We use the occurrence frequency of a token in positive and negative dataset to decide on the weight of popularity score. Equation 2 shows how the popularity factor is calculated for each token. We selected a threshold 0.01 min support as the cut-off criteria and reduced it by half at every level. Support of a word is defined as the proportion of tweets in the dataset which contain this token. The value 0.01 is chosen such that we cover a large number of tokens without missing important tokens, at the same time pruning less frequent tokens. Pf = F requency in P ositive T raining Set Nf = F requency in N egative T raining Set if (Pf − Nf ) > 1000) pF = 0.9; elseif ((Pf − Nf ) > 500) pF = 0.8; elseif ((Pf − Nf ) > 250) pF = 0.7; elseif ((Pf − Nf ) > 100) pF = 0.5; elseif ((Pf − Nf < 50)) pF = 0.1; (2) Figure 1 shows the flow of our approach. Figure 1: Flow Chart of our Algorithm 4 Datasets In this section, we explain the two datasets used in this research. Both of these datasets are built using noisy labels. 4.1 Stanford Dataset This dataset(Go et al., 2009) was built automatically using emoticons as noisy labels. All the tweets which contain ‘:)’ were marked positive and tweets containing ‘:(’ were marked negative. Tweets that did not have any of these labels or had both were discarded. The training dataset has ∼1.6 million tweets, equal number of positive and negative tweets. The training dataset was annotated into two classes (positive and negative) while the testing data was hand annotated into three classes (positive, negative and neutral). For our experimentation, we use only positive and negative class tweets from the testing dataset for our experimentation. Table 1 gives the details of dataset. Training Tweets Positive Negative Total Testing Tweets Positive Negative Objective Total 800,000 800,000 1,600,000 180 180 138 498 Table 1: Stanford Twitter Dataset 14 4.2 Mejaj Mejaj dataset(Bora, 2012) was built using noisy labels. They collected a set of 40 words and manually categorized them into positive and negative. They label a tweet as positive if it contains any of the positive sentiment words and as negative if it contains any of the negative sentiment words. Tweets which do not contain any of these noisy labels and tweets which have both positive and negative words were discarded. Table 2 gives the list of words which were used as noisy labels. This dataset contains only two class data. Table 3 gives the details of the dataset. Positive Labels amazed, amused, attracted, cheerful, delighted, elated, excited, festive, funny, hilarious, joyful, lively, loving, overjoyed, passion, pleasant, pleased, pleasure, thrilled, wonderful Negative Labels annoyed, ashamed, awful, defeated, depressed, disappointed, discouraged, displeased, embarrassed, furious, gloomy, greedy, guilty, hurt, lonely, mad, miserable, shocked, unhappy, upset Table 2: Noisy Labels for annotating Mejaj Dataset Training Tweets Positive Negative Total Testing Tweets Positive Negative Total 668,975 795,661 1,464,638 we train on the given training data and test on the testing data. In the second series of experiments, we perform 5 fold cross validation using the training data. Table 4 shows the results of each of these experiments on steps which are explained in Approach (Section 3). In table 4, we give results for each step emoticons and punctuations handling, spell correction, stemming and stop word removal mentioned in Approach Section (Section 3). The Baseline + All Combined results refers to combination of these steps (emoticons, punctuations, spell correction, Stemming and stop word removal) performed together. Series 2 results are average of accuracy of each fold. 5.2 Mejaj Dataset Similar series of experiments were performed on this dataset(Bora, 2012) too. In the first series of experiments, training and testing was done on the respective given datasets. In the second series of experiments, we perform 5 fold cross validation on the training data. Table 5 shows the results of each of these experiments. In table 5, we give results for each step emoticons and punctuations handling, spell correction, stemming and stop word removal mentioned in Approach Section (Section 3). The Baseline + All Combined results refers to combination of these steps (emoticons, punctuations, spell correction, Stemming and stop word removal) performed together. Series 2 results are average of accuracy of each fold. 5.3 Cross Dataset To validate the robustness of our approach, we experimented with cross dataset training and testing. We trained our system on one dataset and tested on the other dataset. Table 6 reports the results of cross dataset evaluations. 198 204 402 Table 3: Mejaj Dataset 6 Feature Vector Approach 5 Experiment In this section, we explain the experiments carried out using the above proposed approach. 5.1 Stanford Dataset On this dataset(Go et al., 2009), we perform a series of experiments. In the first series of experiments, 15 In this feature vector approach, we form features using Unigrams, Bigrams, Hashtags (#), Targets (@), Emoticons, Special Symbol (‘!’) and used a semisupervised SVM classifier. Our feature vector comprised of 11 features. We divide the features into two groups, NLP features and Twitter specific features. NLP features include frequency of positive Method Baseline Baseline + Emoticons + Punctuations Baseline + Spell Correction Baseline + Stemming Baseline + Stop Word Removal Baseline + All Combined (AC) AC + Senti Features (wSF) wSF + Noun Identification (wNI) wNI + Popularity Score Series 1 (%) 78.8 81.3 81.3 81.9 81.7 83.5 85.5 85.8 87.2 Series 2 (%) 80.1 82.1 81.6 81.7 82.3 85.4 86.2 87.1 88.4 Table 4: Results on Stanford Dataset Method Baseline Baseline + Emoticons + Punctuations Baseline + Spell Correction Baseline + Stemming Baseline + Stop Word Removal Baseline + All Combined (AC) AC + Senti Features (wSF) wSF + Noun Identification (wNI) wNI + Popularity Score Series 1 (%) 77.1 80.3 80.1 79.1 80.2 82.9 86.8 87.6 88.1 Series 2 (%) 78.6 80.4 80.0 79.7 81.7 84.1 87.3 88.2 88.1 Table 5: Results on Mejaj Dataset Method wNI + Popularity Score wNI + Popularity Score Training Dataset Stanford Mejaj Testing Dataset Mejaj Stanford Accuracy 86.4% 84.7% Table 6: Results on Cross Dataset evaluation NLP Twitter Specific Unigram (f1) Bigram (f2) Hashtags (f3) Emoticons (f4) URLs (f5) Targets (f6) Special Symbols (f7) # of positive and negative unigram # of positive and negative Bigram # of positive and negative hashtags # of positive and negative emoticons Binary Feature - presence of URLs Binary Feature - presence of Targets Binary Feature - presence of ‘!’ Table 7: Features and Description 16 Feature Set f1 + f2 f3 + f4 + f7 f3 + f4 + f5 + f6 + f7 f1 + f2 + f3 + f4 + f7 f1 + f2 + f3 + f4 + f5 + f6 + f7 Accuracy (Stanford) 85.34% 53.77% 60.12% 85.89% 87.64% Table 8: Results of Feature Vector Classifier on Stanford Dataset unigrams matched, negative unigrams matched, positive bigrams matched, negative bigrams matched, etc and Twitter specific features included Emoticons, Targets, HashTags, URLs, etc. Table 7 shows the features we have considered. HashTags polarity is decided based on the constituent words of the hashtags. Using the list of positive and negative words from Twitrratr6 , we try to find if hashtags contains any of these words. If so, we assign the polarity of that to the hashtag. For example, “#imsohappy” contains a positive word “happy”, thus this hashtag is considered as positive hashtag. We use the emoticons list provided by (Agarwal et al., 2011) in their research. This list7 is built from wikipedia list of emoticons8 and is hand tagged into five classes (extremely positive, positive, neutral, negative and extremely negative). We reduce this five class list to two class by merging extremely positive and positive class to single positive class and rest other classes (extremely negative, negative and neutral) to single negative class. Table 8 reports the accuracy of our machine learning classifier on Stanford dataset. 7 Discussion In this section, we present a few examples evaluated using our system. The following example denotes the effect of incorporating the contribution of emoticons on tweet classification. Example “Ahhh I can’t move it but hey w/e its on hell I’m elated right now :-D”. This tweet contains two opinion words, “hell” and “elated”. Using the unigram scoring method, this tweet is classified neutral but it is actually posi6 http://twitrratr.com/ http://goo.gl/oCSnQ 8 http://en.wikipedia.org/wiki/List of emoticons 7 tive. If we incorporate the effect of emoticon “:-D”, then this tweet is tagged positive. “:-D” is a strong positive emoticon. Consider this example, “Bill Clinton Fail Obama Win?”. In this example, there are two sentiment bearing words, “Fail” and “Win”. Ideally this tweet should be neutral but this is tagged as a positive tweet in the dataset as well as using our system. In this tweet, if we calculate the popularity factor (pF) for “Win” and “Fail”, they come out to be 0.9 and 0.8 respectively. Because of the popularity factor weight, the positive score domniates the negative score and thus the tweet is tagged as positive. It is important to identify the context flow in the text and also how each of these words modify or depend on the other words of the tweet. For calculating the system performance, we assume that the dataset which is used here is correct. Most of the times this assumption is true but there are a few cases where it fails. For example, this tweet “My wrist still hurts. I have to get it looked at. I HATE the dr/dentist/scary places. :( Time to watch Eagle eye. If you want to join, txt!” is tagged as positive, but actually this should have been tagged negative. Such erroneous tweets also effect the system performance. There are few limitations with the current proposed approach which are also open research problems. 1. Spell Correction: In the above proposed approach, we gave a solution to spell correction which works only when extra characters are entered by the user. It fails when users skip some characters like “there” is spelled as “thr”. We propose the use of phonetic level spell correction to handle this problem. 2. Hashtag Segmentation: For handling hashtags, we looked for the existence of the positive or negative words9 in the hashtag. But there can be some cases where it may not work correctly. For example, “#thisisnotgood”, in this hashtag if we consider the presence of positive and negative words, then this hashtag is tagged positive (“good”). We fail to capture the presence and effect of “not” which is making this hash9 17 word list taken from http://twitrratr.com/ tag as negative. We propose to devise and use some logic to segment the hashtags to get correct constituent words. 3. Context Dependency: As discussed in one of the examples above, even tweet text which is limited to 140 characters can have context dependency. One possible method to address this problem is to identify the objects in the tweet and then find the opinion towards those objects. 8 Conclusion and Future Work Twitter sentiment analysis is a very important and challenging task. Twitter being a microblog suffers from various linguistic and grammatical errors. In this research, we proposed a method which incorporates the popularity effect of words on tweet sentiment classification and also emphasis on how to preprocess the Twitter data for maximum information extraction out of the small content. On the Stanford dataset, we achieved 87% accuracy using the scoring method and 88% using SVM classifier. On Mejaj dataset, we showed an improvement of 4.77% as compared to their (Bora, 2012) accuracy of 83.33%. In future, This work can be extended through incorporation of better spell correction mechanisms (may be at phonetic level) and word sense disambiguation. Also we can identify the target and entities in the tweet and the orientation of the user towards them. Acknowledgement We would like to thank Vibhor Goel, Sourav Dutta and Sonil Yadav for helping us with running SVM classifier on such a large data. References Agarwal, A., Xie, B., Vovsha, I., Rambow, O. and Passonneau, R. (2011). Sentiment analysis of Twitter 18 data. In Proceedings of the Workshop on Languages in Social Media LSM ’11. Bora, N. N. (2012). Summarizing Public Opinions in Tweets. In Journal Proceedings of CICLing 2012, New Delhi, India. Chesley, P. (2006). Using verbs and adjectives to automatically classify blog sentiment. In In Proceedings of AAAI-CAAW-06, the Spring Symposia on Computational Approaches. Davidov, D., Tsur, O. and Rappoport, A. (2010). Enhanced sentiment learning using Twitter hashtags and smileys. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters COLING ’10. Diakopoulos, N. and Shamma, D. (2010). Characterizing debate performance via aggregated twitter sentiment. In Proceedings of the 28th international conference on Human factors in computing systems ACM. Draya, G., Planti, M., Harb, A., Poncelet, P., Roche, M. and Trousset, F. (2009). Opinion Mining from Blogs. In International Journal of Computer Information Systems and Industrial Management Applications (IJCISIM). Go, A., Bhayani, R. and Huang, L. (2009). Twitter Sentiment Classification using Distant Supervision. In CS224N Project Report, Stanford University. Godbole, N., Srinivasaiah, M. and Skiena, S. (2007). Large-Scale Sentiment Analysis for News and Blogs. In Proceedings of the International Conference on Weblogs and Social Media (ICWSM). He, B., Macdonald, C., He, J. and Ounis, I. (2008). An effective statistical approach to blog post opinion retrieval. In Proceedings of the 17th ACM conference on Information and knowledge management CIKM ’08. Hu, M. and Liu, B. (2004). Mining Opinion Features in Customer Reviews. In AAAI. Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications of the ACM 38, 39–41. Pang, B., Lee, L. and Vaithyanathan, S. (2002). Thumbs up? Sentiment Classification using Machine Learning Techniques. Turney, P. D. (2002). Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. In ACL.