Applying natural language processing algorithm for predicting consumer product preferences in retail stores

Narasimha R Vajjhala

4th International Conference on Innovative Computing and Communication Applying natural language processing algorithm for predicting consumer product preferences in retail stores Michael Akpovona Oshogbunua, Narasimha Rao Vajjhalab*, Sandip Rakshita, and Olumide Longea a American University of Nigeria, Yola, Adamawa, Nigeria, PMB 2250 University of New York Tirana, Kodra Diellit, Tirana, Albania, 1048. michael.oshogbunu@aun.edu.ng, narasimharaonarasimha@gmail.com*, sandip.rakshit@aun.edu.ng, olumide.longe@aun.edu.ng b Abstract: Technology has changed the way retailers predict and understand consumer behaviour. One such technology that can enable retailers to understand consumer preference is Natural Language Processing (NLP). Social media content including the opinions and interests of the customers is recognized as a valuable source of information for businesses. This study aims to perform a semantic analysis of tweets with the use of an NLP algorithm. This study focuses on building an intelligent application capable of predicting the category of goods a customer would most likely buy in a retail store. This study focuses on analysing social media data with NLP to predict what a customer would buy in a retail store. In this study, we measured a 0.3 increase in accuracy when only various forms of nouns were extracted and analysed. Further research may include Named-Entity Recognition (NER), especially for proper nouns. The researchers believe that this study will contribute to changing the trajectory in which NLP is applied in the retail industry. Therefore, the methodology and design used herein will improve the existing approaches that have already been employed concerning NLP and social media data analysis. Keywords: Machine Learning, Natural Language Processing, Prediction, Retail, Intelligent Systems, Social Media, Semantic, Analysis, Mining. 1. Introduction 1.1. Natural Language Processing Natural Language Processing (NLP) works both on text and speech data along with different types of engineering data for the development of intelligent systems (Agarwal & Jayant, 2019; Gupta, Ahlawat, & Sagar, 2017; Alzubi, 2015). Natural language processing techniques can help analyze the social media text, including tweets on Twitter and posts on Facebook (Agarwal & Jayant, 2019). Retailers can understand customer preferences by analyzing their customers' social media data on social networking platforms; a methodology termed as the Semantic Analysis of Social Media (SASM) (Atefeh, Diana, & Graeme, 2017; Vajjhala, Rakshit, Oshogbunu, & Salisu, 2020). This discipline analyzes and transforms social media data into social media intelligence for decisionmakers. Hence, insights from social media data enable executives to lead with contextual knowledge rather than intuition (Erik & Emanuel, 2018). Machine learning algorithms have several advantages, including the ability to reduce uncertainty and predict precisely (Agarwal & Jayant, 2019). Machine learning algorithms also allow real-time analysis and advance forecasting coupled with processing of large volumes of data (Agarwal & Jayant, 2019). Alzubi et al. (2020) found neural networkbased approaches to have better performance as compared to traditional methods in the context of measuring sentence similarity based on the feature engineering and linguistic tools. Hence, combining NLP with machine learning algorithms can provide vital insights into consumer behavior. NLP helps predict online consumer behavior by giving the computing machines the ability to process textual data through computer science, artificial intelligence, and linguistic algorithms. For example, NLP makes it possible to conduct brand perception analysis on social media platforms through the use of Named Entity Recognition (NER) (Erik & Emanuel, 2018). NLP is a computer-based approach to analyzing textual data using a set of theories and techniques. NLP is the theoretically motivated range of computational methods for analyzing and representing naturally occurring texts at one or more linguistic analysis levels to achieve human-like language processing for various tasks or applications (Yoosin & Seung Ryul, 2015). The phrase “range of computational techniques” is essential because there are multiple methods to choose from when a type of language analysis is accomplished. The term “naturally occurring text” can either be an oral or written human language gathered from a real data source, e.g., social media. The phrase “level of linguistics” accounts for the multiple types of language processing when humans produce or comprehend languages, and the term “human-like language processing” depicts that NLP is a sub-field of artificial intelligence (AI). The phrase “for a range of tasks applications” portrays NLP as not the goal but the means to accomplishing a task. Researchers have proposed several methodologies in Natural Language Processing, demand forecasting, and social media data analysis. Techniques such as Extreme Gradient Boosting – XgBoost (Yeo et al., 2020), DivBoosting (Alzubi, 2016), Random Forest (Rajput, 2020), Long Short-Term Memory (Adwan et al., 2020), and Artificial Neural Network (Hasin, Ghosh, & Shareef, 2011) have been used for predicting consumer preferences in the retail sector. Although NLP is still in its nascent stage in the retail industry, it has been well utilized for psychiatric purposes (Rajput, 2020). Researchers working on Twitter mining have mainly used keyword/dictionary lookup and machine learning techniques (Doan et al., 2019). Machine learning techniques, including support vector machines, logical regression, Naïve Bayes, and neural networks, were used mainly by researchers working on Twitter mining (Doan et al., 2019; Jinyan, Becken, & Stantic, 2019). OSHOGBUNU ET AL. 1 4th International Conference on Innovative Computing and Communication Social media contains rich contextual information about its users (Xu, Zhang, & Luo, 2010). The user's behavioral pattern can be constructed with adequate social media data (Zhou, Qian, & Ma, 2012). NLP is one of the technologies that can be used to analyze social media data. Although this methodology is still relatively new, it has been well-utilized in journalism, product recommendation, healthcare, and security. For example, about 25% of all major news is derived from social media (Atefeh et al., 2017). When such data is analyzed, insights on public opinion, societal unrest, and nation-wide sentiment analysis are derived. Therefore, the benefits of NLP on social media data is evident and immense. 1.2. Challenges of using NLP and Twitter Mining Social media content including the opinions and interests of the customers is recognized as a valuable source of information for businesses ("A Novel Structure Refining Algorithm for Statistical-Logical Models," ; Yoosin & Seung Ryul, 2015). Social media data has many facets, including volume, speed, complexity, high dimensionality, lack of structure, as well as incompleteness (Jinyan et al., 2019). One of the critical reasons for the low levels of NLP application is that the traditional NLP techniques do not work very well with social media data's unstructured nature. Hence, there is a need for adaptation for NLP techniques for social media data analysis purposes. NLP needs raw data that can be extracted from various sources, including web data, social media data, audio signals, and reports as well as related operational data (Agarwal & Jayant, 2019). NLP can then be applied on the data to derive meaningful information along with visualizations. The unstructured nature of data on social media emanates from the presence of slang, emojis, abbreviations, typographical errors, non-standard spelling, etc. (Atefeh et al., 2017). Such incorrectness, sparseness, brevity, and language-diversity in social media data makes traditional text analysis techniques unsuitable for tweets (Metzler, Dumais, & Meek, 2007). One of the critical challenges in text mining from tweets is the study of the causal relations in accurately identifying a small fraction of relevant tweets from a large data pool ("A Novel Structure Refining Algorithm for Statistical-Logical Models,"). The text mining of tweets is complicated because of the tweets' informal nature, identifying causal relationships difficult (Doan et al., 2019). The issue of unstructured data, for instance, in tweets can be addressed to some extent through text normalization, term expansion, enhanced feature selection, and noise reduction to improve the accuracy of clustering tweets (Beverungen & Kalita, 2010). However, despite the surplus nature and benefit of social media data, some retailers lag in harnessing it. Hence, they lack a good understanding of their customers. This deficiency makes it difficult to know when and what to stock. This phenomenon is nothing short of a poor understanding of the needs/wants of their customers. While some retailers have shown advancement in addressing the aforementioned, others have fallen short. However, a few retail outlets have striven to understand a consumer’s preference even before they visit the store (Behera & Nain, 2019). OSHOGBUNU ET AL. Retailers can employ social media analysis with NLP in other to understand and predict consumer demands. Social media analysis with NLP is paramount since accuracy in demand forecast helps avoid overstocking, increasing sales, and fostering customer loyalty. Several other approaches have been used to predict consumer preferences, including Time Series Analysis. Time series analysis analyzes historical sales data up to a high degree of precision. For instance, time-series analysis can help predict consumer preferences based on a time-based historical sequence of sales data (Hilal et al., 2019). The time-series approach is better than alternative traditional methods because traditional methods do not provide a reasonable estimate of consumer demands (Hasin et al., 2011). This inadequacy affects sales. Therefore, precision in identifying consumer preferences is paramount in ensuring increased sales, revenue, and customer satisfaction. The rest of this paper is organized as follows. Section 2 presents the methodology used in this study. Section 3 presents the findings of this study. Finally, we conclude our work and present some ideas for future research in Section 4. 2. Methodology This study aims to perform a semantic analysis of tweets with the use of an NLP algorithm. This study focuses on building an intelligent application capable of predicting the category of goods a customer would most likely buy in a retail store. This application is plausible since NLP enables machines to understand human language and translate it into machine-readable languages (Rajput, 2020). Social media data is used in this study since it contains rich contextual information about a user [12]. This also is plausible because it is a tool for sharing ideas and memories (Biba, Ballhysa, & Vajjhala, 2010). Therefore, a user's personality can be derived from their social media data (Zielinska, Welk, Mayhorn, & Murphy-Hill, 2016). Jung et al. (2017) also assert that a user's complete persona can be built based on their social media data. Tweets provide a range of information about the users of Twitter, including the behaviour of the users, lifestyle, thoughts, and experiences (Doan et al., 2019). Tweets have limited number of characters and the use of hashtags makes it easier to process and search data (Adwan et al., 2020). The tweets' data needs to be cleaned to reduce the noise in the text data before feature extraction and feeding to the classifier (Lu, Feng, & O'Neill, 2020). The cleaning process also helps remove stop words, spelling mistakes, slang, etc., from the tweet data (Lu et al., 2020). In this study, we have adopted a deductive and positivist approach. A quantitative methodology assumes that its hypothesis is derived from a hypothetical construct. A deductive approach is the judgment of reasoning which concludes a general statement. Such a method is great for developing a hypothesis from the construct of an existing theory. Hence, we will apply this method to validate our argument. In this study, the researchers seek to use NLP to analyse social media data (i.e., tweets) to predict a consumer’s preference for a category of goods in a retail store. Some consumers take to social media to express their 2 4th International Conference on Innovative Computing and Communication satisfaction/dissatisfaction with a product/brand, speak with/about their preferred brand, or engage in their interest topics. This online social interaction can influence their preference for a category of goods/brand in a retail store. Hence, this study employs a consumer model to understand the behavioural trait and the consumer's decision-making process before and after a purchase. A consumer model utilizes diagrams or pictures to demonstrate the conditions that influence the consumer via its surroundings and the company that sells/manufactures the product. The Nicosia model of consumer behaviour consists of four fields: communication of information to affect consumer attitudes, search and evaluation process, a decision, and outcomes. The Nicosia model of consumer behaviour is the chosen model for this study. This model captures those factors that influence a consumer's shopping behaviour. This model is also concerned with the interactions between a brand and its consumer. This study includes 53 registered Twitter.com users who have interacted with any of the most recent tweets by our intended retail outlet, i.e., SPAR Nigeria. For each user, we obtain 100 of their most recent tweets. This amount to a total of 5,300 tweets for analytical purpose. We have also got a dataset of 1,324 non-surrogate emoji, their Unicode, and description from www.unicode.org. The experiment begins with obtaining an API key and API secret key from our Twitter developer account. This is because all requests to the Twitter API endpoints must come from an authenticated client. Hence, a bearer token was obtained from Twitter by sending a POST request to https://api.twitter.com/oauth2/token with parameters such as grant_type set as client_credentials, authorization typeset as basic auth, username as API key, and password as API secret key. The response from this request is a JSON object containing the bearer token. Henceforth, this token was added to our API client's authorization header, i.e., Postman. Processing the tweet begins by extracting the text message within each JSON object. The text, i.e., tweet obtained, is combined into a temporary corpus, which undergoes an NLP technique called text normalization. At this phase, all emojis within the corpus are identified. Further normalization derives the Unicode value and description of each emoji based on the emoji dataset obtained earlier. Each description was now appended to the corpus. Next, our algorithm uses regular expressions to remove all URLs in the corpus. After that, a word tokenization technique was applied to extract only words and numbers from the corpus, thereby eliminating whitespaces, punctuations, etc. The tokenized corpus was parsed with the POS tagging algorithm, which identified all relevant part-of-speech within the tokenized corpus. The corpus was filtered for words that are only variations of nouns e.g., pronoun, singular nouns, proper nouns, etc. The corpus was filtered because nouns are associated with places or things that reveal a consumer’s preference, e.g., a trip to America or an iPhone. The final corpus is a combination of all words that are only variations of a noun. The final corpus was sent to IBM Watson NLU service for analysis. The resulting response is a JSON object containing a hierarchical structure of categorical topics and their score. This is a descriptive study since it holds a positivist perspective. A OSHOGBUNU ET AL. positivist paradigm is a school of thought that embraces the genuineness of an objective in a single and solid form. Hence, our study employs the positivist approach with a deductive method. Thus, it begins with a global view before narrowing down into a particular topic. Orlikowski and Baroudi (1991) assert that a study that adopts a positivist viewpoint is primarily aimed at testing a theory with the belief of a prior relationship with the phenomena in question. Therefore, this study utilizes a quantitative method to understand the correlation between a consumer’s preference and their tweets. This decision was influenced by the research questions, aims and objective. 3. Findings Table 1 below shows a total of 68 unique categories after analysing 5,300 tweets using the IBM Watson NLU service. Table 1 - Topic Categories Extracted from the Tweets Frequen Category Frequency Category cy Pets 1 Sports 6 Cats 1 Soccer 3 Society 43 Social Institutions 15 Automotive and Vehicles Cars Performance 6 5 Divorce 15 Family and Parenting 20 Coupe 1 Babies and Toddlers 2 Car Culture 2 Baby Clothes 2 Radio 1 Art and Entertainment 29 Wrestling 1 Music 6 Go Kart 1 Music Genre 6 Face and Body Care 1 Hip Hop 2 Body Care 1 Recording Industry 2 Vehicles Religion and Spirituality 2 2 Music Awards 1 Christianity 2 Music Reference 3 Auto Parts 1 Children 17 Record Labels 1 Sex 2 Cosmetics 1 Law, Govt and Politics 19 Eyeshadow 1 Business and Industrial 1 Tech and Computing 5 Business Operations 1 Tech News 2 Internet and Business Plans 1 Unrest and War 16 World Music 1 Crime 3 Sports 6 Personal Offense 3 Soccer 3 Style and Fashion 19 Beauty 3 Technology Automotive and Vehicles Cars 3 6 5 3 4th International Conference on Innovative Computing and Communication Performance Hair Care 3 Food and Drink 7 Coupe 1 Food and Drink 4 Car Culture 2 1 Radio 1 Condiments and Dressings Vehicles 2 Body Art 11 Wrestling 1 Shows and Events 20 Go Kart 1 News 1 Face and Body Care 1 For the first user, the result shows that this user is more interested in Tech News under Technology and Computing's parent category. This is evident with a score of 0.954031, which is higher than the score of 0.771889 given to Internet Technology under the same parent category of Technology and Computing. Meanwhile, for @vons_dev, the results show that this user is more interested in social networks under the subcategory of Internet Technology, which is also under the parent category of Technology and Computing. This is evident with a score of 0.890845, which is higher than the score of 0.88946 given to Tech News under Technology and Computing's parent category. A retailer can use such information to personalize each user's shopping experience based on the hierarchical structure of the categories and weights of the scores. The retailer may now make further connections between the topic categories predicted above and its category of goods in stock. This study extends the text normalization technique of NLP. It includes an algorithm that can translate emoji into their respective Unicode values and descriptions, making it easier to process emoji present in social media data. This science will assist other researchers in processing social media data effectively. This study also introduced the concept of generating a corpus containing only words that are variations of a noun, e.g., such as pronoun, singular noun, proper noun, etc. Such a methodology can improve the accuracy of predictions. This was done in our study because nouns are associated with places or things that reveal a consumer’s preference, e.g., a trip to America or an iPhone. 4. Conclusion This research has also extended the text normalization technique of NLP to translate emoji present in social media data into their Unicode value and description. The research was focused on analyzing social media data with NLP to predict what a customer would buy in a retail store. In this study, we measured a 0.3 increase in accuracy when only various forms of nouns were extracted and analyzed. Further research may include NamedEntity Recognition (NER), especially for proper nouns. The researchers believe that this study will change the trajectory in which NLP is applied in the retail industry. Therefore, the methodology and design used herein will improve the existing approaches that have already been employed concerning NLP and social media data analysis. Despite the gaps identified by the researcher, this study is limited in several ways. One of which is that a user is only considered as a customer if they follow our OSHOGBUNU ET AL. intended retailer, i.e., SPAR Nigeria on Twitter.com. Therefore, users that are customers in reality but don’t follow SPAR Nigeria on Twitter.com are not registered users of Twitter.com or have not interacted with any of our intended retailer’s tweets were ousted in this research. Another limitation of our study was the limited number of rules and patterns applied, because of which we might have missed some of the cause-effect relations. Future research could include the metadata of media files, utilize object detection algorithms to analyze media files, and include customers in reality but may not follow SPAR Nigeria on Twitter.com. Future researchers may also consider using a well-annotated dataset of emoji, use a custom Machine Learning (ML) model, and confirm predictions with the actual Twitter accounts owners. From our analysis, we have identified a correlation between a user’s tweet and their shopping preference. REFERENCES Alexandre, C., & Balsa, J. (2016). Client profiling for an anti-money laundering system. In C. A. Rocha Á., Adeli H., Reis L., Mendonça Teixeira M. (Ed.), New Advances in Information Systems and Technologies. Germany: Springer. Alzubi, J.A., Jain, R., Kathuria, A., Khandelwal, A., Saxena, A., Singh, A. (2020). Paraphrase identification using collaborative adversarial networks. Journal of Intelligent and Fuzzy systems, 39(1). 1021-10322. DOI:10.3233/JIFS-191933 Alzubi, J.A. (2016). Diversity-based boosting algorithm. International Journal of Advanced Computer Science and Applications, 7(5), 524-529. Alzubi, J.A. (2015). Optimal classifier ensemble design based on cooperative game theory. Research Journal of Applied Sciences, Engineering and Technology, 11(12), 1336 -1343. Adwan, O. Y., Al-Tawil, M., Huneiti, A. M., Shahin, R. A., Zayed, A. A. A., & Al-Dibsi, R. H. (2020). Twitter sentiment analysis approaches: A survey. International Journal of Emerging Technologies in Learning, (15), 79-93. doi:10.3991/ijet.v15i15.14467 Agarwal, A., & Jayant, A. (2019). Machine learning and natural language processing in supply chain management: A comprehensive review and future research directions. International Journal of Business Insights & Transformation, 13(1), 3-19. Atefeh, F., Diana, I., & Graeme, H. (2017). Natural Language Processing for Social Media. Second Edition, New York: Morgan & Claypool. Behera, G., & Nain, N. (2019). A Comparative study of big mart sales prediction. In: Nain N., Vipparthi S., Raman B. (eds) Computer Vision and Image Processing. CVIP 2019. Communications in Computer and Information Science, vol 1147. Springer, Singapore. https://doi.org/10.1007/978-981-15-4015-8_37. Beverungen, G., & Kalita, J. (2010). Evaluating methods for summarizing Twitter posts. Fifth International ACM Conference on Web Search and Data Mining, Seattle, Washington, USA. Biba, M., Ballhysa, E., & Vajjhala, N. R. (2010). A Novel Structure Refining Algorithm for Statistical-Logical Models. International Conference on Complex, Intelligent and Software Intensive Systems, CISIS, Krakow, Poland. Doan, S., Yang, E. W., Tilak, S. S., Li, P. W., Zisook, D. S., & Torii, M. (2019). Extracting health-related causality from twitter messages using natural language processing. BMC Medical Informatics and Decision Making, 79(1), 156-169. doi:10.1186/s12911-019-0785-0 4 4th International Conference on Innovative Computing and Communication Erik, H., & Emanuel, R. (2018). Big data analytics and demand forecasting in supply chains: A conceptual analysis. The International Journal of Logistics Management, 29(2), 739-766. doi:10.1108/IJLM-04-2017-0088 Gupta, D., Ahlawat, A., Sagar, K. (2017). Usability prediction and ranking of SDLC models using fuzzy hierarchical usability model. Open Engineering, 7(1), 161-168. doi: https://doi.org/10.1515/eng-2017-0021 Hasin, M., Ghosh, S., & Shareef, M. (2011). An ANN approach to demand forecasting in retail trade in Bangladesh. International Journal of Trade, Economics and Finance, 12(2), 154-160. doi:10.7763/IJTEF.2011.V2.95 Hilal, Z. K., Akyuz, A. O., Mitat, U., Selim, A., Uysal, M. O., Berna Atak, B., & Mehmet Ali, E. (2019). An improved demand forecasting model using deep learning approach and proposed decision integration strategy for supply chain. Complexity, 21(9), 125-142. doi:10.1155/2019/9067367 Jinyan, C., Becken, S., & Stantic, B. (2019). Lexicon based Chinese language sentiment analysis method. Computer Science & Information Systems, 16(2), 639-655. doi:10.2298/CSIS181015013C Jung, S.-G., An, J., Kwak, H., Ahmad, M., Nielsen, L., & Jansen, B. J. (2017). Persona generation from aggregated social media data. Paper presented at the Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, Denver, Colorado, USA. https://doi.org/10.1145/3027063.3053120 Lu, X., Feng, F., & O'Neill, Z. (2020). Occupancy sensing in buildings through social media from semantic analysis. ASHRAE Transactions, 126(1), 265272. Metzler, D., Dumais, S., & Meek, C. (2007). Similarity measures for short segments of text. In: Amati G., Carpineto C., Romano G. (eds) Advances in Information Retrieval. ECIR 2007. Lecture Notes in Computer Science, vol 4425. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-54071496-5_5. OSHOGBUNU ET AL. Orlikowski, W., & Baroudi, J. (1991). Studying information technology in organizations: Research approaches and assumptions. Information Systems Research, 2, 1-28. doi:10.1287/isre.2.1.1 Rajput, A. (2020). Chapter 3 - Natural language processing, sentiment analysis, and clinical analytics. In M. D. Lytras & A. Sarirete (Eds.), Innovation in Health Informatics (pp. 79-97): Academic Press. Vajjhala, N. R., Rakshit, S., Oshogbunu, M., & Salisu, S. (2020). Novel user preference recommender system based on Twitter profile analysis. In: Borah S., Pradhan R., Dey N., Gupta P. (eds) Soft Computing Techniques and Applications. Advances in Intelligent Systems and Computing, vol 1248. Springer, Singapore. https://doi.org/10.1007/978-981-15-7394-1_7. Xu, D., Zhang, L., & Luo, J. (2010). Understanding multimedia content using web scale social media data. 18th ACM International Conference on Multimedia, Firenze, Italy. Yeo, J., Hwang, S., Koh, E., & Lipka, N. (2020). Conversion prediction from clickstream: Modeling market prediction and customer predictability. IEEE Transactions on Knowledge and Data Engineering, 32(2), 246-259. doi:10.1109/TKDE.2018.2884467 Yoosin, K., & Seung Ryul, J. (2015). Opinion-mining methodology for social media analytics. KSII Transactions on Internet & Information Systems, 9(1), 391-406. doi:10.3837/tiis.2015.01.024 Zhou, A., Qian, W., & Ma, H. (2012). Social media data analysis for revealing collective behaviors. 18th ACM SIGKDD international conference on Knowledge discovery and data mining, Beijing, China. https://doi.org/10.1145/2339530.2339746 Zielinska, O., Welk, A., Mayhorn, C. B., & Murphy-Hill, E. (2016). The persuasive phish: examining the social psychological principles hidden in phishing emails. Symposium and Bootcamp on the Science of Security, Pittsburgh, Pennsylvania. https://doi.org/10.1145/2898375.2898382. 5

RELATED PAPERS

RELATED TOPICS

Log In

Applying natural language processing algorithm for predicting consumer product preferences in retail stores

Applying natural language processing algorithm for predicting consumer product preferences in retail stores

Related Papers

RELATED PAPERS

RELATED TOPICS