4th International Conference on Innovative Computing and Communication
Applying natural language processing algorithm for predicting
consumer product preferences in retail stores
Michael Akpovona Oshogbunua, Narasimha Rao Vajjhalab*, Sandip Rakshita, and Olumide Longea
a
American University of Nigeria, Yola, Adamawa, Nigeria, PMB 2250
University of New York Tirana, Kodra Diellit, Tirana, Albania, 1048.
michael.oshogbunu@aun.edu.ng, narasimharaonarasimha@gmail.com*, sandip.rakshit@aun.edu.ng, olumide.longe@aun.edu.ng
b
Abstract: Technology has changed the way retailers predict and understand consumer behaviour. One such technology that can enable retailers to
understand consumer preference is Natural Language Processing (NLP). Social media content including the opinions and interests of the customers is
recognized as a valuable source of information for businesses. This study aims to perform a semantic analysis of tweets with the use of an NLP algorithm.
This study focuses on building an intelligent application capable of predicting the category of goods a customer would most likely buy in a retail store.
This study focuses on analysing social media data with NLP to predict what a customer would buy in a retail store. In this study, we measured a 0.3
increase in accuracy when only various forms of nouns were extracted and analysed. Further research may include Named-Entity Recognition (NER),
especially for proper nouns. The researchers believe that this study will contribute to changing the trajectory in which NLP is applied in the retail industry.
Therefore, the methodology and design used herein will improve the existing approaches that have already been employed concerning NLP and social
media data analysis.
Keywords: Machine Learning, Natural Language Processing, Prediction, Retail, Intelligent Systems, Social Media, Semantic, Analysis, Mining.
1. Introduction
1.1. Natural Language Processing
Natural Language Processing (NLP) works both on text and speech data
along with different types of engineering data for the development of
intelligent systems (Agarwal & Jayant, 2019; Gupta, Ahlawat, & Sagar,
2017; Alzubi, 2015). Natural language processing techniques can help
analyze the social media text, including tweets on Twitter and posts on
Facebook (Agarwal & Jayant, 2019). Retailers can understand customer
preferences by analyzing their customers' social media data on social
networking platforms; a methodology termed as the Semantic Analysis of
Social Media (SASM) (Atefeh, Diana, & Graeme, 2017; Vajjhala,
Rakshit, Oshogbunu, & Salisu, 2020). This discipline analyzes and
transforms social media data into social media intelligence for decisionmakers. Hence, insights from social media data enable executives to lead
with contextual knowledge rather than intuition (Erik & Emanuel, 2018).
Machine learning algorithms have several advantages, including the
ability to reduce uncertainty and predict precisely (Agarwal & Jayant,
2019). Machine learning algorithms also allow real-time analysis and
advance forecasting coupled with processing of large volumes of data
(Agarwal & Jayant, 2019). Alzubi et al. (2020) found neural networkbased approaches to have better performance as compared to traditional
methods in the context of measuring sentence similarity based on the
feature engineering and linguistic tools. Hence, combining NLP with
machine learning algorithms can provide vital insights into consumer
behavior. NLP helps predict online consumer behavior by giving the
computing machines the ability to process textual data through computer
science, artificial intelligence, and linguistic algorithms. For example,
NLP makes it possible to conduct brand perception analysis on social
media platforms through the use of Named Entity Recognition (NER)
(Erik & Emanuel, 2018).
NLP is a computer-based approach to analyzing textual data using a set of
theories and techniques. NLP is the theoretically motivated range of
computational methods for analyzing and representing naturally occurring
texts at one or more linguistic analysis levels to achieve human-like
language processing for various tasks or applications (Yoosin & Seung
Ryul, 2015). The phrase “range of computational techniques” is essential
because there are multiple methods to choose from when a type of
language analysis is accomplished. The term “naturally occurring text”
can either be an oral or written human language gathered from a real data
source, e.g., social media. The phrase “level of linguistics” accounts for
the multiple types of language processing when humans produce or
comprehend languages, and the term “human-like language processing”
depicts that NLP is a sub-field of artificial intelligence (AI). The phrase
“for a range of tasks applications” portrays NLP as not the goal but the
means to accomplishing a task.
Researchers have proposed several methodologies in Natural Language
Processing, demand forecasting, and social media data analysis.
Techniques such as Extreme Gradient Boosting – XgBoost (Yeo et al.,
2020), DivBoosting (Alzubi, 2016), Random Forest (Rajput, 2020), Long
Short-Term Memory (Adwan et al., 2020), and Artificial Neural Network
(Hasin, Ghosh, & Shareef, 2011) have been used for predicting consumer
preferences in the retail sector. Although NLP is still in its nascent stage
in the retail industry, it has been well utilized for psychiatric purposes
(Rajput, 2020). Researchers working on Twitter mining have mainly used
keyword/dictionary lookup and machine learning techniques (Doan et al.,
2019). Machine learning techniques, including support vector machines,
logical regression, Naïve Bayes, and neural networks, were used mainly
by researchers working on Twitter mining (Doan et al., 2019; Jinyan,
Becken, & Stantic, 2019).
OSHOGBUNU ET AL.
1
4th International Conference on Innovative Computing and Communication
Social media contains rich contextual information about its users (Xu,
Zhang, & Luo, 2010). The user's behavioral pattern can be constructed
with adequate social media data (Zhou, Qian, & Ma, 2012). NLP is one of
the technologies that can be used to analyze social media data. Although
this methodology is still relatively new, it has been well-utilized in
journalism, product recommendation, healthcare, and security. For
example, about 25% of all major news is derived from social media
(Atefeh et al., 2017). When such data is analyzed, insights on public
opinion, societal unrest, and nation-wide sentiment analysis are derived.
Therefore, the benefits of NLP on social media data is evident and
immense.
1.2. Challenges of using NLP and Twitter Mining
Social media content including the opinions and interests of the customers
is recognized as a valuable source of information for businesses ("A
Novel Structure Refining Algorithm for Statistical-Logical Models," ;
Yoosin & Seung Ryul, 2015). Social media data has many facets,
including volume, speed, complexity, high dimensionality, lack of
structure, as well as incompleteness (Jinyan et al., 2019). One of the
critical reasons for the low levels of NLP application is that the traditional
NLP techniques do not work very well with social media data's
unstructured nature. Hence, there is a need for adaptation for NLP
techniques for social media data analysis purposes. NLP needs raw data
that can be extracted from various sources, including web data, social
media data, audio signals, and reports as well as related operational data
(Agarwal & Jayant, 2019). NLP can then be applied on the data to derive
meaningful information along with visualizations. The unstructured nature
of data on social media emanates from the presence of slang, emojis,
abbreviations, typographical errors, non-standard spelling, etc. (Atefeh et
al., 2017). Such incorrectness, sparseness, brevity, and language-diversity
in social media data makes traditional text analysis techniques unsuitable
for tweets (Metzler, Dumais, & Meek, 2007). One of the critical
challenges in text mining from tweets is the study of the causal relations
in accurately identifying a small fraction of relevant tweets from a large
data pool ("A Novel Structure Refining Algorithm for Statistical-Logical
Models,"). The text mining of tweets is complicated because of the tweets'
informal nature, identifying causal relationships difficult (Doan et al.,
2019). The issue of unstructured data, for instance, in tweets can be
addressed to some extent through text normalization, term expansion,
enhanced feature selection, and noise reduction to improve the accuracy
of clustering tweets (Beverungen & Kalita, 2010).
However, despite the surplus nature and benefit of social media data,
some retailers lag in harnessing it. Hence, they lack a good understanding
of their customers. This deficiency makes it difficult to know when and
what to stock. This phenomenon is nothing short of a poor understanding
of the needs/wants of their customers. While some retailers have shown
advancement in addressing the aforementioned, others have fallen short.
However, a few retail outlets have striven to understand a consumer’s
preference even before they visit the store (Behera & Nain, 2019).
OSHOGBUNU ET AL.
Retailers can employ social media analysis with NLP in other to
understand and predict consumer demands. Social media analysis with
NLP is paramount since accuracy in demand forecast helps avoid
overstocking, increasing sales, and fostering customer loyalty.
Several other approaches have been used to predict consumer
preferences, including Time Series Analysis. Time series analysis
analyzes historical sales data up to a high degree of precision. For
instance, time-series analysis can help predict consumer preferences based
on a time-based historical sequence of sales data (Hilal et al., 2019). The
time-series approach is better than alternative traditional methods because
traditional methods do not provide a reasonable estimate of consumer
demands (Hasin et al., 2011). This inadequacy affects sales. Therefore,
precision in identifying consumer preferences is paramount in ensuring
increased sales, revenue, and customer satisfaction.
The rest of this paper is organized as follows. Section 2 presents the
methodology used in this study. Section 3 presents the findings of this
study. Finally, we conclude our work and present some ideas for future
research in Section 4.
2. Methodology
This study aims to perform a semantic analysis of tweets with the use of
an NLP algorithm. This study focuses on building an intelligent
application capable of predicting the category of goods a customer would
most likely buy in a retail store. This application is plausible since NLP
enables machines to understand human language and translate it into
machine-readable languages (Rajput, 2020). Social media data is used in
this study since it contains rich contextual information about a user [12].
This also is plausible because it is a tool for sharing ideas and memories
(Biba, Ballhysa, & Vajjhala, 2010). Therefore, a user's personality can be
derived from their social media data (Zielinska, Welk, Mayhorn, &
Murphy-Hill, 2016). Jung et al. (2017) also assert that a user's complete
persona can be built based on their social media data.
Tweets provide a range of information about the users of Twitter,
including the behaviour of the users, lifestyle, thoughts, and experiences
(Doan et al., 2019). Tweets have limited number of characters and the use
of hashtags makes it easier to process and search data (Adwan et al.,
2020). The tweets' data needs to be cleaned to reduce the noise in the text
data before feature extraction and feeding to the classifier (Lu, Feng, &
O'Neill, 2020). The cleaning process also helps remove stop words,
spelling mistakes, slang, etc., from the tweet data (Lu et al., 2020). In this
study, we have adopted a deductive and positivist approach. A
quantitative methodology assumes that its hypothesis is derived from a
hypothetical construct. A deductive approach is the judgment of reasoning
which concludes a general statement. Such a method is great for
developing a hypothesis from the construct of an existing theory. Hence,
we will apply this method to validate our argument. In this study, the
researchers seek to use NLP to analyse social media data (i.e., tweets) to
predict a consumer’s preference for a category of goods in a retail store.
Some consumers take to social media to express their
2
4th International Conference on Innovative Computing and Communication
satisfaction/dissatisfaction with a product/brand, speak with/about their
preferred brand, or engage in their interest topics. This online social
interaction can influence their preference for a category of goods/brand in
a retail store. Hence, this study employs a consumer model to understand
the behavioural trait and the consumer's decision-making process before
and after a purchase. A consumer model utilizes diagrams or pictures to
demonstrate the conditions that influence the consumer via its
surroundings and the company that sells/manufactures the product. The
Nicosia model of consumer behaviour consists of four fields:
communication of information to affect consumer attitudes, search and
evaluation process, a decision, and outcomes. The Nicosia model of
consumer behaviour is the chosen model for this study. This model
captures those factors that influence a consumer's shopping behaviour.
This model is also concerned with the interactions between a brand and its
consumer.
This study includes 53 registered Twitter.com users who have
interacted with any of the most recent tweets by our intended retail outlet,
i.e., SPAR Nigeria. For each user, we obtain 100 of their most recent
tweets. This amount to a total of 5,300 tweets for analytical purpose. We
have also got a dataset of 1,324 non-surrogate emoji, their Unicode, and
description from www.unicode.org. The experiment begins with obtaining
an API key and API secret key from our Twitter developer account. This
is because all requests to the Twitter API endpoints must come from an
authenticated client. Hence, a bearer token was obtained from Twitter by
sending a POST request to https://api.twitter.com/oauth2/token with
parameters such as grant_type set as client_credentials, authorization
typeset as basic auth, username as API key, and password as API secret
key. The response from this request is a JSON object containing the
bearer token. Henceforth, this token was added to our API client's
authorization header, i.e., Postman.
Processing the tweet begins by extracting the text message within
each JSON object. The text, i.e., tweet obtained, is combined into a
temporary corpus, which undergoes an NLP technique called text
normalization. At this phase, all emojis within the corpus are identified.
Further normalization derives the Unicode value and description of each
emoji based on the emoji dataset obtained earlier. Each description was
now appended to the corpus. Next, our algorithm uses regular expressions
to remove all URLs in the corpus. After that, a word tokenization
technique was applied to extract only words and numbers from the corpus,
thereby eliminating whitespaces, punctuations, etc. The tokenized corpus
was parsed with the POS tagging algorithm, which identified all relevant
part-of-speech within the tokenized corpus. The corpus was filtered for
words that are only variations of nouns e.g., pronoun, singular nouns,
proper nouns, etc. The corpus was filtered because nouns are associated
with places or things that reveal a consumer’s preference, e.g., a trip to
America or an iPhone. The final corpus is a combination of all words that
are only variations of a noun. The final corpus was sent to IBM Watson
NLU service for analysis. The resulting response is a JSON object
containing a hierarchical structure of categorical topics and their score.
This is a descriptive study since it holds a positivist perspective. A
OSHOGBUNU ET AL.
positivist paradigm is a school of thought that embraces the genuineness
of an objective in a single and solid form. Hence, our study employs the
positivist approach with a deductive method. Thus, it begins with a global
view before narrowing down into a particular topic. Orlikowski and
Baroudi (1991) assert that a study that adopts a positivist viewpoint is
primarily aimed at testing a theory with the belief of a prior relationship
with the phenomena in question. Therefore, this study utilizes a
quantitative method to understand the correlation between a consumer’s
preference and their tweets. This decision was influenced by the research
questions, aims and objective.
3. Findings
Table 1 below shows a total of 68 unique categories after analysing 5,300
tweets using the IBM Watson NLU service.
Table 1 - Topic Categories Extracted from the Tweets
Frequen
Category
Frequency
Category
cy
Pets
1
Sports
6
Cats
1
Soccer
3
Society
43
Social Institutions
15
Automotive and
Vehicles
Cars
Performance
6
5
Divorce
15
Family and Parenting
20
Coupe
1
Babies and Toddlers
2
Car Culture
2
Baby Clothes
2
Radio
1
Art and Entertainment
29
Wrestling
1
Music
6
Go Kart
1
Music Genre
6
Face and Body Care
1
Hip Hop
2
Body Care
1
Recording Industry
2
Vehicles
Religion and
Spirituality
2
2
Music Awards
1
Christianity
2
Music Reference
3
Auto Parts
1
Children
17
Record Labels
1
Sex
2
Cosmetics
1
Law, Govt and Politics
19
Eyeshadow
1
Business and Industrial
1
Tech and Computing
5
Business Operations
1
Tech News
2
Internet and
Business Plans
1
Unrest and War
16
World Music
1
Crime
3
Sports
6
Personal Offense
3
Soccer
3
Style and Fashion
19
Beauty
3
Technology
Automotive and
Vehicles
Cars
3
6
5
3
4th International Conference on Innovative Computing and Communication
Performance
Hair Care
3
Food and Drink
7
Coupe
1
Food and Drink
4
Car Culture
2
1
Radio
1
Condiments and
Dressings
Vehicles
2
Body Art
11
Wrestling
1
Shows and Events
20
Go Kart
1
News
1
Face and Body Care
1
For the first user, the result shows that this user is more interested in Tech
News under Technology and Computing's parent category. This is evident
with a score of 0.954031, which is higher than the score of 0.771889
given to Internet Technology under the same parent category of
Technology and Computing. Meanwhile, for @vons_dev, the results show
that this user is more interested in social networks under the subcategory
of Internet Technology, which is also under the parent category of
Technology and Computing. This is evident with a score of 0.890845,
which is higher than the score of 0.88946 given to Tech News under
Technology and Computing's parent category. A retailer can use such
information to personalize each user's shopping experience based on the
hierarchical structure of the categories and weights of the scores. The
retailer may now make further connections between the topic categories
predicted above and its category of goods in stock.
This study extends the text normalization technique of NLP. It
includes an algorithm that can translate emoji into their respective
Unicode values and descriptions, making it easier to process emoji present
in social media data. This science will assist other researchers in
processing social media data effectively. This study also introduced the
concept of generating a corpus containing only words that are variations
of a noun, e.g., such as pronoun, singular noun, proper noun, etc. Such a
methodology can improve the accuracy of predictions. This was done in
our study because nouns are associated with places or things that reveal a
consumer’s preference, e.g., a trip to America or an iPhone.
4. Conclusion
This research has also extended the text normalization technique of NLP
to translate emoji present in social media data into their Unicode value
and description. The research was focused on analyzing social media data
with NLP to predict what a customer would buy in a retail store. In this
study, we measured a 0.3 increase in accuracy when only various forms of
nouns were extracted and analyzed. Further research may include NamedEntity Recognition (NER), especially for proper nouns. The researchers
believe that this study will change the trajectory in which NLP is applied
in the retail industry. Therefore, the methodology and design used herein
will improve the existing approaches that have already been employed
concerning NLP and social media data analysis. Despite the gaps
identified by the researcher, this study is limited in several ways. One of
which is that a user is only considered as a customer if they follow our
OSHOGBUNU ET AL.
intended retailer, i.e., SPAR Nigeria on Twitter.com. Therefore, users that
are customers in reality but don’t follow SPAR Nigeria on Twitter.com
are not registered users of Twitter.com or have not interacted with any of
our intended retailer’s tweets were ousted in this research. Another
limitation of our study was the limited number of rules and patterns
applied, because of which we might have missed some of the cause-effect
relations. Future research could include the metadata of media files, utilize
object detection algorithms to analyze media files, and include customers
in reality but may not follow SPAR Nigeria on Twitter.com. Future
researchers may also consider using a well-annotated dataset of emoji, use
a custom Machine Learning (ML) model, and confirm predictions with
the actual Twitter accounts owners. From our analysis, we have identified
a correlation between a user’s tweet and their shopping preference.
REFERENCES
Alexandre, C., & Balsa, J. (2016). Client profiling for an anti-money
laundering system. In C. A. Rocha Á., Adeli H., Reis L., Mendonça
Teixeira M. (Ed.), New Advances in Information Systems and Technologies.
Germany: Springer.
Alzubi, J.A., Jain, R., Kathuria, A., Khandelwal, A., Saxena, A., Singh, A.
(2020). Paraphrase identification using collaborative adversarial networks.
Journal of Intelligent and Fuzzy systems, 39(1). 1021-10322.
DOI:10.3233/JIFS-191933
Alzubi, J.A. (2016). Diversity-based boosting algorithm. International Journal
of Advanced Computer Science and Applications, 7(5), 524-529.
Alzubi, J.A. (2015). Optimal classifier ensemble design based on cooperative
game theory. Research Journal of Applied Sciences, Engineering and
Technology, 11(12), 1336 -1343.
Adwan, O. Y., Al-Tawil, M., Huneiti, A. M., Shahin, R. A., Zayed, A. A. A., &
Al-Dibsi, R. H. (2020). Twitter sentiment analysis approaches: A survey.
International Journal of Emerging Technologies in Learning, (15), 79-93.
doi:10.3991/ijet.v15i15.14467
Agarwal, A., & Jayant, A. (2019). Machine learning and natural language
processing in supply chain management: A comprehensive review and
future research directions. International Journal of Business Insights &
Transformation, 13(1), 3-19.
Atefeh, F., Diana, I., & Graeme, H. (2017). Natural Language Processing for
Social Media. Second Edition, New York: Morgan & Claypool.
Behera, G., & Nain, N. (2019). A Comparative study of big mart sales
prediction. In: Nain N., Vipparthi S., Raman B. (eds) Computer Vision and
Image Processing. CVIP 2019. Communications in Computer and
Information Science, vol 1147. Springer, Singapore.
https://doi.org/10.1007/978-981-15-4015-8_37.
Beverungen, G., & Kalita, J. (2010). Evaluating methods for summarizing
Twitter posts. Fifth International ACM Conference on Web Search and
Data Mining, Seattle, Washington, USA.
Biba, M., Ballhysa, E., & Vajjhala, N. R. (2010). A Novel Structure Refining
Algorithm for Statistical-Logical Models. International Conference on
Complex, Intelligent and Software Intensive Systems, CISIS, Krakow,
Poland.
Doan, S., Yang, E. W., Tilak, S. S., Li, P. W., Zisook, D. S., & Torii, M.
(2019). Extracting health-related causality from twitter messages using
natural language processing. BMC Medical Informatics and Decision
Making, 79(1), 156-169. doi:10.1186/s12911-019-0785-0
4
4th International Conference on Innovative Computing and Communication
Erik, H., & Emanuel, R. (2018). Big data analytics and demand forecasting in
supply chains: A conceptual analysis. The International Journal of
Logistics Management, 29(2), 739-766. doi:10.1108/IJLM-04-2017-0088
Gupta, D., Ahlawat, A., Sagar, K. (2017). Usability prediction and ranking of
SDLC models using fuzzy hierarchical usability model. Open Engineering,
7(1), 161-168. doi: https://doi.org/10.1515/eng-2017-0021
Hasin, M., Ghosh, S., & Shareef, M. (2011). An ANN approach to demand
forecasting in retail trade in Bangladesh. International Journal of Trade,
Economics and Finance, 12(2), 154-160. doi:10.7763/IJTEF.2011.V2.95
Hilal, Z. K., Akyuz, A. O., Mitat, U., Selim, A., Uysal, M. O., Berna Atak, B.,
& Mehmet Ali, E. (2019). An improved demand forecasting model using
deep learning approach and proposed decision integration strategy for
supply chain. Complexity, 21(9), 125-142. doi:10.1155/2019/9067367
Jinyan, C., Becken, S., & Stantic, B. (2019). Lexicon based Chinese language
sentiment analysis method. Computer Science & Information Systems,
16(2), 639-655. doi:10.2298/CSIS181015013C
Jung, S.-G., An, J., Kwak, H., Ahmad, M., Nielsen, L., & Jansen, B. J. (2017).
Persona generation from aggregated social media data. Paper presented at
the Proceedings of the 2017 CHI Conference Extended Abstracts on
Human Factors in Computing Systems, Denver, Colorado, USA.
https://doi.org/10.1145/3027063.3053120
Lu, X., Feng, F., & O'Neill, Z. (2020). Occupancy sensing in buildings through
social media from semantic analysis. ASHRAE Transactions, 126(1), 265272.
Metzler, D., Dumais, S., & Meek, C. (2007). Similarity measures for short
segments of text. In: Amati G., Carpineto C., Romano G. (eds) Advances
in Information Retrieval. ECIR 2007. Lecture Notes in Computer Science,
vol 4425. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-54071496-5_5.
OSHOGBUNU ET AL.
Orlikowski, W., & Baroudi, J. (1991). Studying information technology in
organizations: Research approaches and assumptions. Information Systems
Research, 2, 1-28. doi:10.1287/isre.2.1.1
Rajput, A. (2020). Chapter 3 - Natural language processing, sentiment analysis,
and clinical analytics. In M. D. Lytras & A. Sarirete (Eds.), Innovation in
Health Informatics (pp. 79-97): Academic Press.
Vajjhala, N. R., Rakshit, S., Oshogbunu, M., & Salisu, S. (2020). Novel user
preference recommender system based on Twitter profile analysis. In:
Borah S., Pradhan R., Dey N., Gupta P. (eds) Soft Computing Techniques
and Applications. Advances in Intelligent Systems and Computing, vol
1248. Springer, Singapore. https://doi.org/10.1007/978-981-15-7394-1_7.
Xu, D., Zhang, L., & Luo, J. (2010). Understanding multimedia content using
web scale social media data. 18th ACM International Conference on
Multimedia, Firenze, Italy.
Yeo, J., Hwang, S., Koh, E., & Lipka, N. (2020). Conversion prediction from
clickstream: Modeling market prediction and customer predictability. IEEE
Transactions on Knowledge and Data Engineering, 32(2), 246-259.
doi:10.1109/TKDE.2018.2884467
Yoosin, K., & Seung Ryul, J. (2015). Opinion-mining methodology for social
media analytics. KSII Transactions on Internet & Information Systems,
9(1), 391-406. doi:10.3837/tiis.2015.01.024
Zhou, A., Qian, W., & Ma, H. (2012). Social media data analysis for revealing
collective behaviors. 18th ACM SIGKDD international conference on
Knowledge discovery and data mining, Beijing, China.
https://doi.org/10.1145/2339530.2339746
Zielinska, O., Welk, A., Mayhorn, C. B., & Murphy-Hill, E. (2016). The
persuasive phish: examining the social psychological principles hidden in
phishing emails. Symposium and Bootcamp on the Science of Security,
Pittsburgh, Pennsylvania. https://doi.org/10.1145/2898375.2898382.
5