Abstract
With the rise of web 2.0, a huge amount of unstructured data has been generated on regular basis in the form of comments, opinions, etc. This unstructured data contains useful information and can play a significant role in business decision making. In this context, sentiment analysis (SA) is an active research area and has recently attracted the attention of the research community. The aim of SA is to classify the user-generated content into positive and negative class. State-of-the-art techniques for sentiment classification relies on the traditional bag-of-words approaches. Such approaches can be advantageous in terms of simplicity but completely ignore the semantics aspects, the order between words, and also leads to the curse of dimensionality. Researchers have also proposed semantic-based SA techniques in conjunction with word-order employing high order n-grams, part-of-speech (POS) patterns, and dependency relation features. But can every word or phrase of high order n-grams, POS patterns or dependency relation features represent sentiment clue? If incorporated, then what about the dimensionality? In order to tackle and investigate such issues, in this paper, we propose a novel POS and n-gram based ensemble method for SA while considering semantics, sentiment clue, and order between words called EnSWF which is a four phase process. Our main contributions are four-fold (a) Appropriate Feature Extraction: we investigate and validate extracting various appropriate features for sentiment classification. (b) Dimensionality Reduction: We decrease the dimensionality of feature space by selecting the subset of most meaningful and effective features. (c) Ensemble Model: We propose an ensemble learning method for both filter based features selection and classification using simple majority voting technique. (d) Practicality: we authenticate our claim while applying our model on benchmark datasets. We also show that EnSWF out-perform existing techniques in terms of classification accuracy and reduce high dimensional feature space.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abbasi A, Chen H, Salem A (2008) Sentiment analysis in multiple languages: feature selection for opinion classification in web forums. ACM Trans Inf Syst (TOIS) 26(3):12
Abeillé A (2012) Treebanks: building and using parsed corpora, vol 20. Springer, Berlin
Aggarwal CC, Zhai C (2012) Mining text data. Springer, Berlin
Akhtar N, Ahamad MV (2018) Graph tools for social network analysis. In: Graph theoretic approaches for analyzing large-scale social networks. IGI Global, pp 18–33
Blitzer J, Dredze M, Pereira F (2007) Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp 440–447
Catal C, Nangir M (2017) A sentiment classification model based on multiple classifiers. Appl Soft Comput 50:135–141
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Chawla N, Eschrich S, Hall LO (2001) Creating ensembles of classifiers. In: Proceedings IEEE international conference on data mining. ICDM 2001. IEEE, pp 580–581
Choi Y, Kim Y, Myaeng SH (2009) Domain-specific sentiment analysis using contextual feature generation. In: Proceedings of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion. ACM, pp 37–44
Dave K, Lawrence S, Pennock DM (2003) Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th international conference on world wide web. ACM, pp 519–528
Ekbal A, Saha S (2013) Combining feature selection and classifier ensemble using a multiobjective simulated annealing approach: application to named entity recognition. Soft Comput 17(1):1–16
Esuli A, Sebastiani F Sentiwordnet: a high-coverage lexical resource for opinion mining
Frank E, Bouckaert RR (2006) Naive bayes for text classification with unbalanced classes. In: European conference on principles of data mining and knowledge discovery. Springer, pp 503–510
Goldburd M, Khare A, Tevet CD (2016) Generalized linear models for insurance rating. In: Casualty actuarial society
Hofmann M, Klinkenberg R (2013) Rapidminer: data mining use cases and business analytics applications. CRC Press, Boca Raton
Hu M, Liu B (2004) Mining and summarizing customer reviews. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 168–177
Iliasova O et al (2017) The application of social media analysis for marketing and business
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: European conference on machine learning. Springer, pp 137–142
Karthik M, Davis M (2004) Search using n-gram technique based statistical analysis for knowledge extraction in case based reasoning systems. arXiv:cs/0407009
Khan J, Jeong BS (2016) Summarizing customer review based on product feature and opinion. In: 2016 international conference on machine learning and cybernetics (ICMLC), vol 1. IEEE, pp 158–165
Khan J, Jeong BS, Lee YK, Alam A (2016) Sentiment analysis at sentence level for heterogeneous datasets. In: Proceedings of the sixth international conference on emerging databases: technologies, applications, and theory. ACM, pp 159–163
Kim SM, Hovy E (2004) Determining the sentiment of opinions. In: Proceedings of the 20th international conference on computational linguistics. Association for Computational Linguistics, p 1367
Lewis DD (1998) Naive (bayes) at forty: the independence assumption in information retrieval. In: European conference on machine learning. Springer, pp 4–15
Li S, Zong C, Wang X (2007) Sentiment classification through combining classifiers with multiple feature sets. In: International conference on natural language processing and knowledge engineering. NLP-KE 2007. IEEE, pp 135–140
Li YH, Jain AK (1998) Classification of text documents. Comput J 41(8):537–546
Liu B (2012) Sentiment analysis and opinion mining. Synthesis lectures on human language technologies 5 (1):1–167
Matsumoto S, Takamura H, Okumura M (2005) Sentiment classification using word sub-sequences and dependency sub-trees. In: Pacific-asia conference on knowledge discovery and data mining. Springer, pp 301–311
McAuley J, Leskovec J (2013) Hidden factors and hidden topics: understanding rating dimensions with review text. In: Proceedings of the 7th ACM conference on recommender systems. ACM, pp 165–172
McCallum A, Nigam K, et al (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization. Citeseer, vol 752, pp 41–48
McCullagh P (1984) Generalized linear models. Eur J Oper Res 16(3):285–292
Moraes R, Valiati JF, Neto WPG (2013) Document-level sentiment classification: an empirical comparison between svm and ann. Expert Syst Appl 40(2):621–633
Mullen T, Collier N (2004) Sentiment analysis using support vector machines with diverse information sources. In: Proceedings of the 2004 conference on empirical methods in natural language processing
Ng V, Dasgupta S, Arifin S (2006) Examining the role of linguistic knowledge sources in the automatic identification and classification of reviews. In: Proceedings of the COLING/ACL on main conference poster sessions. Association for Computational Linguistics, pp 611–618
Onan A, Korukoğlu S (2017) A feature selection model based on genetic rank aggregation for text sentiment classification. J Inf Sci 43(1):25–38
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on empirical methods in natural language processing. Association for Computational Linguistics, vol 10, pp 79–86
Park H, Kwon S, Kwon HC (2010) Complete gini-index text (git) feature-selection algorithm for text classification. In: 2010 2nd international conference on software engineering and data mining (SEDM). IEEE, pp 366–371
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Priyadarsini RP, Valarmathi M, Sivakumari S (2011) Gain ratio based feature selection method for privacy preservation. ICTACT J Soft Comput 1(04):2229–6956
Rogati M, Yang Y (2002) High-performing feature selection for text classification. In: Proceedings of the eleventh international conference on information and knowledge management. ACM, pp 659–661
Saleh MR, Martín-Valdivia MT, Montejo-Ráez A, Ureña-López L (2011) Experiments with svm to classify opinions in different domains. Expert Systems with Applications 38(12):14799–14804
Su Y, Zhang Y, Ji D, Wang Y, Wu H (2012) Ensemble learning for sentiment classification. In: Workshop on chinese lexical semantics. Springer, pp 84–93
Subrahmanian VS, Reforgiato D (2008) Ava: adjective-verb-adverb combinations for sentiment analysis. IEEE Intell Syst 23(4):43–50
Tan S, Zhang J (2008) An empirical study of sentiment analysis for chinese documents. Expert Systems with Applications 34(4):2622–2629
Tang J, Alelyani S, Liu H (2014) Feature selection for classification: a review. Data Classification: Algorithms and Applications, p 37
Tripathy A, Agrawal A, Rath SK (2016) Classification of sentiment reviews using n-gram machine learning approach. Expert Syst Appl 57:117–126
Trofimov I, Genkin A (2017) Distributed coordinate descent for generalized linear models with regularization. Pattern Recognit Image Anal 27(2):349–364
Tsutsumi K, Shimada K, Endo T (2007) Movie review classification based on a multiple classifier. In: Proceedings of the 21st pacific Asia conference on language, information and computation, pp 481–488
Turney PD (2002) Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 417–424
Vechtomova O (2009) Introduction to information retrieval christopher d. manning, prabhakar raghavan, and hinrich schutze (stanford university, yahoo! research, and university of stuttgart) cambridge: Cambridge university press, 2008, xxi+ 482 pp; hardbound isbn 978-0-521-86571-5
Wan Y, Gao Q (2015) An ensemble sentiment classification system of twitter data for airline services analysis. In: 2015 IEEE international conference on data mining workshop (ICDMW). IEEE, pp 1318–1325
Wang G, Sun J, Ma J, Xu K, Gu J (2014) Sentiment classification: the contribution of ensemble learning. Decis Support Syst 57:77–93
Wang H, Khoshgoftaar TM, Van Hulse J (2010) A comparative study of threshold-based feature selection techniques. In: 2010 IEEE international conference on granular computing (grc). IEEE, pp 499–504
Weidong Z, Jingyu F, Yongmin L (2014) Using gini-index for feature selection in text categorization
Wilson T, Wiebe J, Hoffmann P (2005) Recognizing contextual polarity in phrase-level sentiment analysis. In: Proceedings of the conference on human language technology and empirical methods in natural language processing. Association for Computational Linguistics, pp 347–354
Xia R, Zong C, Li S (2011) Ensemble of feature sets and classification algorithms for sentiment classification. Inf Sci 181(6):1138–1152
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Icml, vol 97, No 412–420, p 35
Yousefpour A, Ibrahim R, Hamed HNA (2017) Ordinal-based and frequency-based integration of feature selection methods for sentiment analysis. Expert Syst Appl 75:80–93
Yu J, Zha ZJ, Wang M, Chua TS (2011) Aspect ranking: identifying important product aspects from online consumer reviews. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. Association for Computational Linguistics, vol 1, pp 1496–1505
Zhai Z, Xu H, Kang B, Jia P (2011) Exploiting effective features for chinese sentiment classification. Expert Syst Appl 38(8):9139–9146
Zhang D, Xu H, Su Z, Xu Y (2015) Chinese comments sentiment classification based on word2vec and svmperf. Expert Syst Appl 42(4):1857–1863. https://doi.org/10.1016/j.eswa.2014.09.011
Zhang D, Xu H, Su Z, Xu Y (2015) Chinese comments sentiment classification based on word2vec and svmperf. Expert Syst Appl 42(4):1857–1863
Zhu J, Wang H, Zhu M, Tsou BK, Ma M (2011) Aspect-based opinion polling from customer reviews. IEEE Trans Affect Comput 2(1):37–49
Acknowledgments
This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ICT Consilience Creative program (IITP-2019-2015-0-00742) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Khan, J., Alam, A., Hussain, J. et al. EnSWF: effective features extraction and selection in conjunction with ensemble learning methods for document sentiment classification. Appl Intell 49, 3123–3145 (2019). https://doi.org/10.1007/s10489-019-01425-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-019-01425-4