Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Tourism Enhancement Using LLMs & Neural Network_Report (1)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Tourism enhancement using LLMs and Neural

Network

Report Submitted in partial fulfilment of requirements for the

B.Tech. degree in Computer Science and Engineering

By
Name of Student Roll Number
Anurag Verma 2020UCO1580
Gaurav Bansal 2020UCO1585
Rohit Kumar 2020UCO1597

Under the supervision

Of

Dr. Vandana Bhatia

Department of Computer Science and Engineering


Netaji Subhas University of Technology (NSUT) New Delhi,
India-110078

May 2024
Certificate

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

This is to certify that the work embodied in project thesis titled, "Tourism
enhancement using LLMs and Neural Networks” by Anurag Verma
(2020UCO1580), Gaurav Bansal (2020UCO1585) and Rohit Kumar
(2020UCO1597) is the bonafide work of the group submitted to Netaji Subhas
University of Technology for consideration in 8th Semester B.Tech. Project
Evaluation.

The original Research work was carried out by the team under my/our guidance
and supervision in the academic year 2023-2024. This work has not been
submitted for any other diploma or degree of any university. On the basis of
declaration made by the group, we recommend the project report for evaluation.

Dr. Vandana Bhatia


(Assistant Professor)
Department of Computer Science & Engineering
Netaji Subhas University of Technology

2
Candidate(s) Declaration

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

I/We, Anurag Verma (2020UCO1580), Gaurav Bansal (2020UCO1585) and Rohit


Kumar (2020UCO1597) of B. Tech. Department of Computer Science &
Engineering, hereby declare that the Project-Thesis titled "Tourism enhancement
using LLMs and Neural Networks” which is submitted by me/us to the Department
of Computer Science & Engineering, Netaji Subhas University of Technology
(NSUT) Dwarka, New Delhi in partial fulfilment of the requirement for the award of
the degree of Bachelor of Technology is original and not copied from the source
without proper citation. The manuscript has been subjected to plagiarism checks by
Turnitin software. This work has not previously formed the basis for the award of
any Degree.

Place: Netaji Subhas University of Technology


Date: 29 February, 2024

Anurag Verma Gaurav Bansal Rohit Kumar


2020UCO1580 2020UCO1585 2020UCO1597

3
Certificate of Declaration

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

This is to certify that the Project-Thesis titled "Tourism enhancement using LLMs
and Neural Networks" which is being submitted by Anurag Verma
(2020UCO1580), Gaurav Bansal (2020UCO1585) and Rohit Kumar
(2020UCO1597) to the Department of Computer Science & Engineering, Netaji
Subhas University of Technology (NSUT) Dwarka, New Delhi in partial fulfilment of
the requirement for the award of the degree of Bachelor of Technology, is a record of
the thesis work carried out by the students under my supervision and guidance. The
content of this thesis, in full or in parts, has not been submitted for any other degree or
diploma.

Place:

Date:

Dr. Vandana Bhatia


(Assistant Professor)
Department of Computer Science & Engineering
Netaji Subhas University of Technology

4
Acknowledgement

We would like to express our gratitude and appreciation to all those who made
it possible for us to complete this project. Special thanks to our project supervisor(s)
Dr. Vandana Bhatia whose help, stimulating suggestions and encouragement helped
us in writing this report. We also sincerely thank our colleagues for the time spent
proofreading and correcting our mistakes.

We would also like to acknowledge with much appreciation the crucial role of
the staff in the Department of Computer Science & Engineering, who gave us the
permission to use the lab and the systems and gave permission to use all necessary
things related to the project.

Anurag Verma Gaurav Bansal Rohit Kumar


2020UCO1580 2020UCO1585 2020UCO1597

5
Plagiarism Report

6
Abstract

Traveller happiness and destination management have grown more dependent on the
utilisation of data-driven approaches in the fast changing tourism market. In order to
maximise the complete travel experience, this thesis investigates the integration of
modern data analytics to transform recommendation systems and tourist analysis.
Through the utilisation of various datasets, such as user preferences, past travel
trends, and up-to-date contextual data, this study aims to establish a sturdy structure
for all-encompassing tourism examination.

The goal of the project is to create intelligent recommendation systems that employ
machine learning algorithms to customise recommendations according to user
preferences, behavioural trends, and the ever-changing nature of travel locations. It
also explores the privacy ramifications and ethical issues surrounding the processing
of sensitive travel data.

This project uses an interdisciplinary approach to improve individualised travel


experiences by providing travellers with recommendations that are specific to their
interests and preferences. The findings of this thesis have consequences for the travel
and tourist sector as well as for more general conversations about the ethical use of
data in determining the direction of travel and hospitality.

Additionally, the thesis incorporates data-driven insights to address the difficulties in


guaranteeing sustainability within the tourist sector. Through the examination of
environmental effect data, the study investigates methods to encourage conscientious
and environmentally conscious travel decisions, therefore supporting the travel
industry's wider dedication to sustainable practices. By doing this, the study lays the
groundwork for a future in which data-driven suggestions for travel will satisfy
personal tastes while simultaneously upholding the need of protecting the cultural and
ecological assets that distinguish each place.

7
List of Contents

List of Figures ………………………………………………………… 9

Introduction ………………………………………………………………… 10
Background ………………………………………………………… 10
Motivation ………………………………………………………… 10
Problem Statement ………………………………………………… 11
Objectives ………………………………………………………… 11
Literature Survey ………………………………………………… 12

Data Collection and Cleaning ……………………………………………… 14


Data Collection ………………………………………………… 14
Rationale for Website Selection ………………………… 14
Data Collection Process ………………………………… 14
Reasons for choosing Beautiful Soap ………………………… 15
Reasons for choosing Selenium ………………………… 15
Data Cleaning and Processing ………………………………… 15
Feature Engineering ………………………………………………… 15

Exploring Fuzzy Clustering Algorithms ………………………………….. 17


Why Fuzzy C-Means Clustering? ………………………………… 17
Implementation of FCM ………………………………………… 18
Drawbacks of FCM ………………………………………………… 20

Applying Hierarchical Clustering ………………………………………… 22


Hierarchical Clustering ………………………………………… 22
Advantages of Hierarchical Clustering over FCM ………………… 22
Implementation of Hierarchical Clustering ………………………… 23
2D Scatter Plot Visualisation using Hierarchical Clustering ………… 24
3D Scatter Plot Visualisation using Hierarchical Clustering ………… 26

Sentimental Analysis using LLMs ……..………………………………….. 28


What are LLMs? ………………………………………………… 28
Sentimental Analysis using LLMs ………………………………… 29
Implementation of LLMs ………………………………………… 30
Sentimental Analysis on results of Hierarchical Clustering ………… 31

Conclusion …………………………………………………………………… 33
Accomplishments ………………………………………………… 33
Future Scope ………………………………………………………… 34

References …………………………………………………………………… 36

8
List of Figures

Figure No. Description Page No.


1 List of tags extracted from the complete dataset 16

2 Feature Matrix 20

3 Flowchart for Hierarchical cluster 24


assignment

4 2D scatter plot using t-sne 25

5 2D scatter plot using PCA 25

6 3D scatter plot using hierarchical clustering 27

7 Using results of Clustering for sentiment analysis 30

8 LLM Architecture for Transformers 31

9 Visualisation of clusters in Delhi 33

10 Visualisation of clusters in Manali 33

9
Chapter - 1
Introduction

1.1 Background

India's economy greatly benefits from the tourism sector, which is vital to the country.
India has a rich history, varied landscapes, rich cultural legacy and emerging trends
draws millions of people there every year. But choosing the best places to go and
things to do that fit their interests and preferences may be difficult for travellers. We
suggest creating a recommendation model for Indian tourism destinations in order to
address this problem.

The goal of this research is to create a recommendation model that is especially suited
for Indian tourism. In order to offer customised suggestions, it will take into account
variables including user preferences, geographic location, and cultural interests and
provide the best alternatives in the near vicinity of their preferences.

1.2 Motivation

India's tourism economy is based mostly on cultural visits, which boosts GDP and
creates a large number of jobs. India's incredible diversity—which includes beautiful
landscapes, a rich cultural tapestry, historical riches, and a lively tapestry of
traditions—is what makes it so alluring. Due to its diversity, India is one of the most
popular travel destinations in the world, drawing millions of visitors each year. Even
though there are a lot of other places that do not necessarily have cultural
significance, they offer peace, joy and thrill to those seeking adventure and want to
take a relaxing break from their daily lives. There are certainly a lot of available
choices to make. But even with all of India's resources, travellers frequently encounter
a difficult obstacle: sorting through the plethora of choices and choosing locations and
activities that suit their own interests, limitations, and preferences.

Travellers frequently become overwhelmed by the sheer volume of options when


visiting a nation as large and diverse as India. This can result in less than ideal travel
experiences and lost opportunities to discover the hidden treasures of this enormous
country. A more effective and individualised method to assist travellers in making
selections about their trips is desperately needed as more people discover India's
varied beauty.

Information Overload
When planning a vacation, tourists mostly encounter information overload because
there are a lot of websites, blogs, and guidebooks available that give contradicting
recommendations.

10
Technological Advancements
Developments in data analytics and machine learning offer a chance to build complex
recommendation systems. In the tourism industry, employing these technologies may
greatly improve the calibre of suggestions.

1.3 Problem Statement

Our project aims to create a personalised travel recommendation system for India that
takes into account the varied interests of users as well as their financial limitations. In
order to provide a more personalised and improved travel experience, we try to take
into account user feedback and trends in destination popularity. In addition, we'll
concentrate on ethical data handling, scalability, and real-time updates to develop a
complete solution for tourists visiting India.

1.4 Objectives

Main goal of this BTP is to develop a state-of-the-art recommendation system that


transforms the way travellers arrange their trips within India. Main objectives are :

1. Information Overload: Travellers find it difficult to sort through the


abundance of information available to them from all the available sources and
make collective decisions they feel confident about. The goal of this project is
to make decision-making easier by offering personalised recommendations
based on user preferences.

2. User Personalization: Adventure, cultural exploration and moments of joy


are among the unique interests of travellers. It is crucial to create a
recommendation system that appropriately profiles and customises
recommendations for each user.

3. Economic Considerations: Travel decisions are greatly influenced by


financial constraints. Recommendations from the system should take into
account the user's financial constraints and preferences along with their
interests.

4. Destination Popularity: The popularity of a destination changes over time for


a variety of reasons. Recommendation models must take events, crowd
density, and emerging hotspots into consideration as well as evolving trends.

5. Human Touch: Including firsthand, qualitative comments from past visitors


feedback provides a much-needed human touch, improving user experiences
and trust.

11
6. Privacy and Data Security: When creating a recommendation system, it is
important to guarantee the confidentiality and integrity of user information.

1.5 Literature Survey

Tourist information systems have evolved significantly, incorporating new


technologies to completely change the travel experience. This study explores some
important research on the use of aggregation techniques to improve and personalize
travel offers and highlights the importance of this in increasing accessibility and user
satisfaction.

Data mining methods used in e-tourism planning recommendations and applications


were introduced in the works of Z. Lang and W. Biu[1]. The research includes a
built-in algorithm to classify visitors according to their interests, allowing the system
to create personalized routes and recommend interesting places. This method ensures
that recommendations are tailored to each person's needs and desires and simplifies
the trip planning process.

Z Abbasi-Moud, H Vahdat-Nejad and J Sadri[2] carried out an additional project to


improve tourism recommendations by combining sentiment analysis and synthesis
methods. Through the analysis of user-generated content, this new method delves
into the needs and feelings of each person. The combination of sentiment analysis and
aggregation provides users with a more personalized experience, resulting in more
accurate and engaging recommendations. This strategy has great potential to
improve the user experience in the travel industry.

Based on this foundation, Z Jia, Y Yang, W Gao, and X Chen[3] present a


comprehensive framework for tourism recommendation that combines collaborative
analysis and aggregation. To provide context-aware recommendations, this hybrid
approach uses user interactions and structured tourism data. The results show that this
combination works well, producing more evidence and accuracy than using separate
methods. This introduction highlights the importance of using user behavior and
unique data characteristics to create personalised recommendations.
Recently, S Djebali, Q Gabot and G Guerard[4] investigated the use of hierarchical
classification in tourist recommendation systems. This study highlights the
importance of shape engineering and selection to increase synthetic performance. The
system uses hierarchical classification to divide visitors into important groups, in
order to provide more relevant recommendations. This method shows how the use of
artificial intelligence techniques can improve the instruction process.

G Ratnakanth and S Poonkuzhali[5] made a significant contribution to this field with


their novel application of deep learning based traveller recommendation. To extract

12
hidden representations of tourism preferences, this study introduces an autoencoder
and then uses it for clustering. With the help of this deep learning method, complex
patterns of tourist behavior can be discovered and highly accurate and personalized
recommendations can be made. This work is an example of using intelligent machine
learning techniques to extract different insights from tourism data.
Based on these findings, NWPY Praditya, AE Permanasari[6] present a unique hybrid
tagging system that combines clustering, content-based analysis, and collaborative
analysis to provide travel recommendations. This comprehensive strategy uses each
method and its capabilities to provide highly accurate and context-sensitive
recommendations. This study shows how this hybrid system can improve accessibility
and user satisfaction, demonstrating its potential as a recommender system.
In a related work, CC Yu and H Chang[7] investigate how on-site services can be
combined with aggregation methods to provide tourism advice. Based on the tourist
and their current geographic location, the system uses location data and algorithms to
provide trending information. This method ensures that it is not only personalised but
more relevant to the user and their surroundings, improving the overall travel
experience.
In addition, deep neural networks such as General Adversarial Networks (GANs)
have great potential for use in tourism recommendation systems. Research by EE
Stephy and M Rajeswari shows that GANs can generate synthetic data that matches
the needs and behaviours of real tourists. By using various datasets of tourist
interactions and interests during training, the GAN can generate accurate information
about virtual visitors. These documents can be used to rank algorithms to increase the
accuracy and depth of testing. This method provides a new opportunity to generate
highly personalised and contextual tourism recommendations, thus improving the user
experience.

In summary, a major advancement in the development of tourist recommendation


systems is the convergence of clustering methods with e-tourism applications.

13
Chapter - 2
Data Collection and Cleaning

2.1 Data Collection:

We used web scraping techniques on different travel websites to collect data for
training a machine learning model to recommend tourist attractions in India. In order
to scrape data from Lonely Planet, we used Beautiful Soup[9] with the requests[10]
library, and we used Selenium[11] to extract data from the dynamic websites Trip.com
and TripAdvisor.in. We also incorporated google maps to enrich our data with
location coordinates to enhance the recommendations based on the location
preferences. We explain our methodology in more detail and give the rationale behind
our decisions below.

We systematically collected data from prominent travel websites, including Trip.com,


TripAdvisor.in, and Lonely Planet. These websites were selected for their extensive
data repositories, providing comprehensive information about tourist attractions in
India.

2.1.1 Rationale for Website Selection:

1. Data Availability: Trip.com and TripAdvisor.in are known for their extensive
databases, offering detailed information on attractions, user reviews, and
ratings, providing our model with a rich knowledge base.

2. Diverse Information: Each website contributes diverse data types, such as


attraction descriptions, user-generated reviews, ratings, and geographical
coordinates, enriching our dataset.

3. User-Generated Content: Both Trip.com and TripAdvisor.in prioritize


user-generated content, offering valuable insights into traveller preferences
and feedback.

2.1.2 Data Collection Process:

We employed tailored web scraping techniques for each website. Beautiful Soup was
used for Lonely Planet due to its static structure, while Selenium was chosen for
Trip.com and TripAdvisor.in, which feature dynamic content.

14
2.1.3 Reasons for choosing Beautiful Soap:

We used Beautiful Soap to extract data from the Lonely Planet[12] website.

● The HTML source code of Lonely Planet's website contains easily accessible
data, and its structure is comparatively static.
● When it comes to effectively extracting information from static web pages,
Beautiful Soup excels.
● The website is a good option because it permits scraping in accordance with
its terms of service.

2.1.4 Reasons for choosing Selenium:

We used Beautiful Soap to extract data from the Trip Advisor[13] and Trip.com[13]
websites.

● Because Trip.com and TripAdvisor.in are dynamic websites that use


JavaScript to load material, typical scraping techniques are less successful.
● With the help of Selenium, we may interact with dynamic components and get
data that would otherwise be difficult to obtain.
● Because both websites have conditions of use that allow for scraping,
Selenium is a good fit for these platforms.

2.2 Data Cleaning and Processing:

The Dataset contains Names of Tourist destinations, Rating out of 5 for the given
destination, Amount spent to visit these places, Reviews count, City in which the
Place is located at and Tags associated with the destination. The tags serve the most
important role in our recommendation algorithm and we have assigned clusters
according to the number of tags in the entire dataset.

The missing values are filled with the average of the entire dataset as of now to
suggest relatively similar places. ‘

2.3 Feature Engineering:

Feature engineering was pivotal in enhancing our dataset's quality and relevance. We
extracted geographical coordinates, enabling location-based recommendations.
Textual features, including sentiment analysis scores and keyword extraction, were

15
derived from descriptions and user-generated reviews. Features related to ratings,
review counts, and popularity quantified attraction appeal. We also categorised
attractions based on attributes like historical significance, natural beauty, adventure,
and cultural relevance, providing additional features for personalised
recommendations.

In summary, our data collection and feature engineering efforts have produced a
robust dataset, empowering our machine learning model to offer personalised and
accurate recommendations for travellers exploring India's diverse tourist attractions.

Fig 1. List of tags extracted from the complete dataset

16
Chapter - 3
Exploring Fuzzy Clustering Algorithm

3.1 Why Fuzzy C-Means Clustering?

One of our initial choices for the recommendation algorithm is Fuzzy C-Means
Clustering. It offers more flexibility than its counterparts like K-means Clustering. It
does not strictly adhere to the principle that all the clusters need to be spherical in
shape. Given that a destination might be a part of many clusters, it produces superior
results in datasets that overlap. In this case, fuzzy clustering would be a preferable
option.

1. Soft Assignments:
Soft assignments are supported by FCM, each data point might have various
degrees of membership in several clusters. Item features and user preferences
frequently show some degree of resemblance. By considering intrinsic
ambiguity in user-item relationships, it often portrays complex preferences.

2. Flexibility in Membership:
FCM permits data points to partially belong to multiple clusters whereas hard
clustering states that a data point strictly belongs to one cluster.
Recommendation systems benefit from this flexibility because it represents the
heterogeneous and overlapping nature of user preferences.

3. Robustness to Noise:
It is well known that FCM is resistant to data noise and outliers. The
dependability of clustering findings in recommendation systems is improved
by the capacity to handle noisy data, as user behaviour might occasionally be
irregular or prone to abnormalities. The recommendation system's accuracy
and stability are enhanced by this resilience.

4. Adaptability to Data Distribution:


Since FCM does not assume spherical clusters, it may accommodate clusters
with variable shapes more easily. User preferences may create complex
patterns, FCM's ability to accommodate different cluster forms is a benefit.
This flexibility guarantees that the underlying structures in the data are
appropriately captured by the clustering.

5. Integration of Similarity Measures:


FCM provides adaptability in capturing different elements of user-item
connections by allowing the use of alternative similarity metrics. This is
especially important for recommendation systems that must take into account

17
a variety of data sources, including user ratings, demographic data, and textual
evaluations. FCM's capacity to combine data from several sources makes for a
more thorough clustering strategy.

3.2 Implementation of FCM:

There were several challenges in implementing the naïve Fuzzy C-Means (FCM)
clustering method. Which resulted in inconsistent results and made it difficult to
handle noise in the data. These problems obstructed the clustering process and called
for more research and adjustments.

The volatility of the clustering assignments produced by the FCM method was one of
the main causes for worry. The model frequently assigned distinct destinations to
separate clusters throughout successive iterations. The algorithm's stochastic structure
selects the initial cluster centroids randomly. As a result, slight variations in the
starting circumstances or the sequence of the data may provide varied clustering
results. Repeating the procedure with several initializations lessened the randomness
of assigned clusters.

Another noteworthy obstacle was the vulnerability of FCM to noise present in the
dataset. Because FCM is intrinsically sensitive to noise and outliers, the clustering
results may be skewed. Frequently, a significant portion of the destinations were first
assigned to no cluster, leaving an imperfect clustering solution. One popular solution
to this problem is to use the dataset's average values in place of the null entries. This
method, however, may result in a bias in the clustering, assigning the bulk of the
destinations to one conspicuous cluster.

The overall clustering result might be greatly distorted by the dominance of a single
cluster brought about by noise reduction measures. Other approaches to treating noise
should be investigated in order to get around this problem. For example, robust
clustering algorithms that can manage outliers or preprocessing techniques that can
recognise and remove noisy data points should be investigated. Additionally, by
specifying suitable noise thresholds or adding weighted distance measures to lessen
the impact of noisy data, domain expertise or expert input can assist optimize the
clustering process.

To conclude, the implementation of naïve FCM clustering has brought to light some
significant issues, such as noise intolerance and unstable cluster assignments.
Strategies like limiting randomization in initialization and implementing more
advanced noise-handling techniques should be taken into consideration in order to
improve the resilience and reliability of the clustering findings. More consistent and
significant clustering findings for destination analysis or related applications can be
achieved by carefully weighing these difficulties and investigating suitable solutions.

18
Similar to fuzzy logic, an example in a fuzzy cluster has a degree of membership to
the cluster but does not belong to any particular cluster. We have a coefficient that
indicates the degree uk(t) of belonging to the k-th cluster for each sample t. Since the
total of those coefficients is typically 1, uk(t) may represent the likelihood that an
example t falls into a certain cluster:

(3.1)

A fuzzy version of the k-means partitional method is called the fuzzy c-means. The
centroid of a cluster, also known as a prototype, is calculated using fuzzy c-means as
follows: the mean of all samples is weighted according to their degree (uk) of
belonging to the cluster Ck:

(3.2)

To ensure that the total of the coefficients is equal to 1, they are normalised and
fuzzified with a real parameter m > 1.

(3.3)

19
Fig 2. Feature Matrix

3.3 Drawbacks of FCM:

1. Sensitivity to initial conditions: The original cluster centre location can have
a significant impact on the fuzzy clustering result. Different initial cluster
assignments lead to distinct cluster outcomes. Since the initial cluster is not
that the algorithm allows to assign it is difficult to conclude meaningful
results.
2. Computational Complexity: For a high number of data points computations
are also considerably large, it requires computational resources and a lot of
time to derive results.
3. Need for pre-specification of cluster number: We need to define the number
of clusters beforehand which is difficult to predict since there are various
factors affecting the cluster assignment.
4. Difficulty in interpreting results: Rather of rigid assignments, membership
degrees are produced using fuzzy clustering. In contrast to rigorous clustering
techniques like K-means, this may make it more difficult to comprehend the
groupings.
5. Not suitable for all types of data: It might not function effectively on data
including clusters that are complexly geometrized or have irregular shapes.
6. Assumption of fuzzy membership: It is assumed that every data point has
some degree of affiliation with every cluster. For some datasets, this
assumption might not always be true.

20
7. Vulnerable to noise and outliers: Fuzzy C-means, like many clustering
techniques, is susceptible to noise and outliers, which might result in less
significant clusters.
8. Lack of clear stopping criterion: Convergence of this algorithm is not
guaranteed. We need to specify the interactions it can perform which can be
tricky to estimate.
9. Difficulties with high-Dimensional data: The "curse of dimensionality,"
which can reduce the significance of distance-based computations, may
prevent fuzzy C-means from performing effectively with high-dimensional
data.
10. Limited Scalability: The computing needs of very big datasets may cause it
to suffer.

In order to address these drawbacks, we have used other techniques, like hierarchical
clustering that are more suited for handling huge datasets and structured results.
Furthermore we will investigate unsupervised RNN like LSTM, Deep Neural network
models, such as GAN, in addition to these techniques, using the insights we have
gathered from related research.

21
Chapter - 4
Applying Hierarchical Clustering

4.1 Hierarchical Clustering:

Recursively dividing a dataset into groups with progressively smaller granularities is


known as hierarchical clustering. Dasgupta framed similarity-based hierarchical
clustering as a combinatorial optimization problem, where a "good" hierarchical
clustering is one that minimizes a specific cost function.

This was motivated by the fact that most work on hierarchical clustering was focused
on providing algorithms rather than optimizing a specific objective. He demonstrated
the following desired characteristics of this cost function: Higher layers of the
hierarchy must be used to separate unconnected components, or dissimilar pieces, in
order to attain the best cost. When the similarity between data items is the same, all
clusterings result in the same cost.

4.2 Advantages of Hierarchical Clustering over FCM:

Fuzzy C-Means (FCM) clustering and hierarchical clustering are two separate
methods with varying advantages. Here are some benefits of hierarchical clustering
over FCM, albeit the decision between them will rely on the particulars of the data
and the analysis's objectives:

1. Hierarchical Structure Visualisation:


Dendrograms are a natural product of hierarchical classification and provide a
representation of the hierarchical relationships between groups. By
understanding the structure of your data and how data points can be combined
from this perspective. FCM, a simple clustering method, does not provide this
kind of hierarchy.

2. No need for cluster number specification:


The number of clusters does not need to be predetermined when using
hierarchical clustering. Hierarchical clustering is useful for exploratory data
analysis since it is effective in situations where the ideal number of groups is
not always known in advance. Contrarily, FCM limits some circumstances by
requiring the user to select the number of clusters.

3. Interpretability of Results:
In many cases, hierarchical clustering's hierarchical clustering structure
closely mimics the hierarchical patterns seen in real data. This can help with

22
the results' interpretability, particularly in cases when there is a hierarchical
structure in the interactions between the clusters. Although fuzzy clustering
methods like FCM offer membership degrees, their interpretation of clusters
may not be as clear-cut.

4. Handling of Outliers:
Because the effects of an individual outlier are usually contained inside a tiny
subtree of the dendrogram, hierarchical clustering is comparatively resistant to
outliers. However, because FCM gives membership degrees to individual data
points, it may be susceptible to outliers, which might affect the degree of
membership across several clusters.

5. Flexibility in linkage methods:


With the employment of several linking techniques (e.g., single, full, and
average linkage), hierarchical clustering offers flexibility in the merging of
clusters. When a certain connection criterion best captures the underlying
structure of the data, this might be helpful. In contrast, FCM uses a set method
to identify cluster membership.

Fig 3. Flowchart for Hierarchical cluster assignment

4.3 Implementation of Hierarchical Clustering:

A simple technique used in data analysis to show the hierarchical structure of a data
set is hierarchical classification. Data preparation is the first of many important steps
in planning. What is compiled indicates that the data set under investigation was
generated or collected. The next important step is to calculate the connection. In other
words, it determines how similar or different the data points are from each other.
Different clustering methods, such as neighbourhoods, sums, or averages, affect the
clustering process by determining whether clusters should be merged or split.

Hierarchical classification is the process of dividing or merging groups based on their


appearance. A dendrogram, a tree-like structure that graphically captures the
hierarchical relationships between data points, represents the results. An important
step in the process is dendrogram interpretation, which allows analysts to identify

23
logical groupings of data.

This methodological approach can be used to summarise the hierarchy in a variety of


situations and provides a way to find hidden patterns in a data set. It is a useful tool
for exploratory data analysis because it can provide not only molecular results but
also a graphical representation to help you understand the relationships between data
points.

Inferential clustering methods provide additional learning opportunities because they


do not need to specify the number of clusters in advance.
Unlike other clustering methods that require prior knowledge of the number of
clusters, hierarchical clustering dynamically displays the optimal number by
constructing a dendrogram. This is especially useful in situations where the structure
of the data is unclear or, due to simplicity, it is not clear to determine the optimal
number of clusters.

In addition, the use of hierarchical classification allows for deeper exploration of data
relationships. Using a variety of correlation methods, analysts can tailor the
aggregation method to the specific characteristics of the data set. This method is a
systematic way of looking for patterns in the data, depending on whether the focus is
on the neighbourhood correlation, the total correlation, or the mean. Ward correlations
emphasize cohesion within clusters, while mean correlations capture the diversity of
individuals. Due to its useful and detailed representation of dendrograms, hierarchical
clustering is a very useful and intelligent tool for understanding complex data
structures at a deep level.

Fig 4. Comparing Clusters before and after Hyperparameter Tuning

24
4.4 2D Scatter Plot Visualisation using Hierarchical Clustering:

2D scatterplots are a useful tool to visualize clustering results clearly and concisely in
the context of hierarchical classification. Scatterplots can be used to show the two
components when the power is increased, so that both sets of data have significant
features.

● Dendrogram Overlay:
A dendrogram is the first graphical representation of a hierarchical
classification, but overlapping cluster assignments on a two-dimensional
scatter plot provides a quick and informative way to collect data points. The
value of each point in the two selected categories corresponds to a point in the
scatter, and the color and shape of the point indicate its relationship to a
specific group.

● Inter Cluster Separation:


Distances between clusters can be assessed visually using 2D scatter plots.
Clear boundaries and distinctions between different groups are best as they
facilitate interpretation and identification of the group.

Average Linkage:
daverage(Ci, Cj) = (1 / |Ci| * |Cj|) 𝚺pϵCi 𝚺qϵCj d(Xp, Xq) (4.1)

● Identification of outliers:
2D scatter plots make it easy to see what's in and what's in your data. Visual
cues are provided to help identify and deal with outliers, such as deviations
from the default build template.

● Pattern Recognition:
The scatter plot you create by viewing a hierarchical ranking will help you
identify patterns and trends in your data. Analysts can determine whether
clusters expand or form tight groupings, providing insight into the underlying
structure.

25
Fig 5. 2D Scatter plot using t-sne

Fig 5. 2D Scatter plot using PCA

In 3D scatter plot visualizations, hierarchical classification, a powerful way to


represent the underlying structure of data, finds dynamic signals. This
multidimensionality, beyond the limitations of traditional 2D graphics, allows for
greater visualization of classification interactions. This method involves transforming
the results of hierarchical classification into aesthetic and educational spatial
configurations.

● Data mapping to 3 dimensions:


In a three-dimensional space, each data point Xi is represented by a marker
with coordinates (x1, x2, x3). The particular traits that make a substantial
contribution to the clustering patterns are frequently taken into consideration
while choosing these dimensions.

● Cluster assignments and colour coding:


The association assignments derived from the hierarchical classification are

26
shown in a three-dimensional scatterplot. To facilitate visual segmentation,
data points in the same cluster are often assigned the same symbol color or
shape.

● Heightened perception of clustering relationships:


Scatters with three dimensions provide additional layers of information,
allowing you to see more complex ranking relationships. In 2D, hidden
patterns are revealed, making the check easier to understand.

● Interactive Exploration:
Enhance your browsing experience with tools and frameworks that provide
interactive 3D visualizations. Panning, zooming, and rotating allow analysts to
dynamically visualize the spatial distribution of clusters. This interactive
feature allows you to understand the hierarchical structure of your data.

● Integration with dendrogram information:


For example, information about a dendrogram can be included in a 3D
scatterplot. To connect the visual image to the hierarchy, it is possible to
overlay notes or lines that show the dendrogram and the height at which the
clusters are connected.

Fig 6. 3D scatter plot visualisation using Hierarchical Clustering

27
Chapter - 5
Sentimental Analysis using LLMs

5.1 What are LLMs?

Artificial intelligence and natural language processing have witnessed the


revolutionary rise of large language models, or LLMs; one well-known example of an
LLM's power is GPT-3, or Generative Pre-trained Transformer 3. The
transformational transformer architecture, which forms the foundation of LLMs, is the
complicated architecture at its core. Due to its ability to analyze data in parallel, this
architecture has shown to be quite successful in sequence-to-sequence tasks,
particularly in the field of language synthesis and interpretation. The extraordinary
scale at which LLMs function—models such as GPT-3 include hundreds of billions of
parameters—is what distinguishes them from other types of models. These models'
enormous scale makes it possible for them to represent the subtleties, complexity, and
contextual details of language in a wide range of languages and contexts.

There are two main stages to the LLM training procedure. They go through
pre-training first, when the model learns to anticipate the word that will come after a
particular phrase using large datasets. This first stage gives the model a general
knowledge of semantics and language structure. The model is then fine-tuned for
certain activities or domains, allowing its vast knowledge to be applied in more
specialized settings.

The ability of LLMs to comprehend plain language is what sets them apart. Their
comprehension of written text's context, sentiment, and intent is unmatched. This skill
has real-world uses in sentiment analysis, chatbots, and virtual assistants. The model's
ability to understand subtleties in language improves user engagement and makes for
a more interesting and natural-feeling discussion.

Additionally, LLMs show a remarkable capacity to generate and complete sentences.


Because of this special quality, they can create text that is both cohesive and pertinent
to the context when working on projects like content generation, text summary, and
language translation. In particular, GPT-3 has demonstrated its creative writing skills
by answering questions and producing answers that closely resemble human
communication.

Beyond language creation, LLMs have sophisticated thinking and problem-solving


ability. By utilizing their extensive pre-training knowledge base, they can tackle
complicated jobs, explain things, and respond to inquiries. Their adaptability makes
them useful instruments for information retrieval jobs, decision support systems, and
instructional applications.

28
5.2 Sentimental Analysis using LLMs

Using Large Language Models (LLMs) for sentiment analysis offers a sophisticated
and context-aware method of deciphering the sentiment represented in textual data,
which is a significant leap in natural language processing. The use of LLMs, most
notably models like Gemini, which are excellent at capturing the subtleties of
language patterns and contextual nuances, is the fundamental component of this
technique.

The creation of a properly labeled dataset is the first stage in the sentiment analysis
process using LLMs. This dataset, which includes a variety of text samples matched
with appropriate sentiment labels—including positive, negative, and neutral
sentiments—is essential for training the model. The model can effectively generalize
to the heterogeneity of linguistic expressions in real-world circumstances thanks to
the dataset's diversity.

The model's capacity to identify sentiment is significantly shaped by the pre-training


and fine-tuning phases. To get a comprehensive knowledge of language structure and
semantics, LLMs are first pre-trained on large datasets. The labeled sentiment dataset
is then used to refine the model, which then learns to correlate particular language
patterns with various feelings. The process of fine-tuning enhances the model's
capacity to identify subtleties in emotion and contextual signals.

Transformer architecture is the main architectural component of sentiment analysis


models based on LLMs. Due to the architecture's ability to analyze input text in
parallel, the model is better able to collect contextual information and dependencies.
Next, the input text is divided into smaller chunks using tokenization and embedding
methods, which encode each token as a high-dimensional vector.

The trained model goes through its layers to process the input text and provide a
sentiment prediction during the inference and prediction phase. The model's
confidence in each sentiment category is often reflected in the output, which is a
probability distribution across the various sentiment classes. In order to provide a
final sentiment label, post-processing stages may entail applying thresholds to these
probabilities.

The ability of LLMs to manage contextual sentiment is one of its most prominent
strengths in sentiment analysis. These algorithms are able to understand subtle
differences in sentiment depending on the context of a sentence or paragraph. Since
natural language utterances can contain layers of complexity and ambiguity,
contextual awareness is essential for effectively reading sentiment in natural
language.

29
Sentiment analysis with LLMs has many useful and profound real-world uses.
LLM-based sentiment analysis has proven beneficial in a variety of contexts,
including market research to detect sentiment patterns, customer feedback analysis to
assess product evaluations, and social media monitoring to measure public opinion.
Furthermore, incorporating these models into chatbots and virtual assistants improves
their capacity to connect with people in an emotionally intelligent manner.

Fig 8. LLM Architecture for Transformers

5.3 Implementation of LLMs

When sentiment analysis is applied using Large Language Models (LLMs) and
hierarchical clustering, textual data is first thoroughly understood and categorized
according to sentiment, and then hierarchical clustering is used to further refine the
results. Data preparation is the first stage in the process, and it's important to do this to
make sure the textual data is ready for further analysis. To construct a clean and
uniform dataset, this requires tokenization, stop word removal, and sometimes
stemming or lemmatization.

The emphasis then shifts to using a pre-trained LLM, such GPT-3, for sentiment
analysis on the dataset that has already been preprocessed. Because of their innate
talents, LLMs are able to recognize the subtleties of language and give each text entry
a sentiment label, usually categorizing the entries as positive, negative, or neutral.
Through the creation of sentiment-labeled clusters, this sentiment analysis stage offers
a fundamental knowledge of the emotional tone portrayed within the textual data.

Hierarchical clustering is the next step after the dataset has been given sentiment
labels. A useful method for organizing related data points into hierarchical structures
or trees is hierarchical clustering. This clustering stage attempts to arrange the dataset
according to the discovered sentiments in the context of sentiment-labeled data. Text

30
items that share sentiment labels are clustered together to form cohesive groups that
encapsulate the shared emotional context of the related sentences.

Potential subcategories under broad sentiment labels can be found through a nuanced
analysis of sentiment patterns made possible by the hierarchical clustering method. A
more detailed picture of the sentiment landscape is given by this hierarchical
structure, which identifies groups that share both general sentiment and maybe more
nuanced emotional aspects.

This two-step method provides a comprehensive way to find and arrange patterns in
textual data by integrating sentiment analysis utilizing LLMs with hierarchical
clustering. It makes it easier to identify desirable clusters based on emotional tones
and allows for a more comprehensive comprehension of sentiment differences.
Applying this technique to customer reviews, social media data, or any other text-rich
dataset offers a strong foundation for gaining insights and honing the research through
hierarchical organization. The accuracy and significance of the detected clusters may
be further improved by ongoing refinement and validation against ground truth data,
which can provide important insights into the emotional terrain captured in the textual
material.

Fig 7. Using results of Clustering for sentiment analysis

5.4 Sentimental Analysis on results of Hierarchical Clustering

In order to uncover subtle sentiment patterns, sentiment analysis on hierarchical


clustering clusters evaluates the emotional tone within certain sets of data points. Data
is arranged into hierarchical structures using hierarchical clustering, which groups
data according to similarities. Sentiment analysis is used to determine the main
emotions inside each cluster once clusters have been generated. With the help of this
method, sentiment fluctuations may be examined at a finer level, making it possible to
identify unique emotional themes among various data subsets. Sentiment analysis and
hierarchical clustering together allow for the more nuanced comprehension of the
feelings expressed across a range of categories or themes, as well as insightful
discoveries into the varied emotional landscapes seen in huge datasets.

31
Sentiment analysis applied to hierarchical clustering clusters provides a potent way to
investigate the subtle emotional aspects in large, intricate datasets. It goes beyond a
general sentiment analysis by offering a hierarchical organized framework for
sentiment analysis. A more focused investigation of sentiment dynamics within
certain settings or topics is made possible by the representation of clusters, which are
collections of data points with comparable attributes.

Furthermore, sentiment analysis findings on hierarchical clusters might be useful in a


variety of contexts. For instance, in marketing, knowing the sentiment trends among
various client categories might help develop customized communication or product
positioning strategies. Sentiment trends within topic-specific clusters in social media
research might reveal popular views on particular subjects. Thus, this integrative
method not only makes sentiment research results easier to understand, but it also
creates opportunities for better decision-making across a range of industries.

32
Chapter - 6
Conclusion

6.1 Accomplishments:

Our method carefully revealed fundamental groups within a variety of places by


delving into data-driven insights and utilising hierarchical clustering for tourist
location research. Through careful consideration of several criteria such as attractions,
facilities, and tourist preferences, we have created clusters that accurately depict
complex patterns found in travel environments. The ensuing clusters function as an
all-encompassing framework, providing a more profound comprehension of the
hierarchical connections among tourism destinations.

The way we view and traverse tourist attractions has been revolutionised by the
subsequent visualisation of these clusters on Google Maps. With the help of this
interactive mapping function, visitors and destination managers may make
well-informed selections by simplifying complicated data into cohesive clusters. The
easily navigable website helps with strategic tourism management as well as effective
itinerary planning, enabling local authorities to pinpoint and attend to particular
requirements within each cluster.

● Scraped data from different websites and compiled it down to more than 17k
locations within India.
● Used LLMs to extract the sentiment of user from the prompts he passes to our
application, providing him with relevant tourism recommendations with ease
of use.
● Used Hierarchical Clustering for dividing the locations into different clusters
using multiple features, including reviews, review count, distance, tags,
latitude and longitude.
● Visualised the clusters in 2D as well as 3D scatter plots using Principal
Component Analysis (PCA) as well as t-sne methods.
● Visualised the clusters on Google Maps along with the inter connecting lines
between different locations in a cluster, thus aiding in clear visualisation of
distance between similar locations with a locality.
● Made a website where users can get recommendations similar to a location
within their city, or within complete India, if they wish so.

33
Fig 9. Visualisation of Clusters in Delhi

Fig 10. Visualisation of clusters in Manali

6.2 Future Scope:

In order to improve our knowledge of visitor preferences and behaviours, we realise


how important it is to integrate data from many sources. We are able to give our
consumers recommendations that are more thorough and pertinent by combining data
from many sources.

34
We also want to create customised assessment criteria in order to measure the efficacy
of our updated clustering strategy. With the use of these indicators, we will be able to
evaluate the effectiveness of our recommendations statistically and make any
necessary corrections.

We also intend to incorporate user feedback systems and dynamic updates. This will
guarantee that, over time, our recommendations stay relevant and in line with the
tastes of specific users. We are dedicated to providing a tourism destination
recommendation system that constantly raises user happiness and provides
high-quality suggestions by listening to customer comments and improving our
algorithms.

Furthermore, our system will provide customised descriptions of attractions, creating


vibrant images that arouse wanderlust. Telling a tale that sparks interest and
enthusiasm is more important than just providing a list of facts.

35
References

[1] Z. Wang, B. Liu, and B. Liu, “Tourism recommendation system based on data
mining,” J. Phys. Conf. Ser., vol. 1345, no. 2, p. 022027, Nov. 2019, doi:
10.1088/1742-6596/1345/2/022027.

[2] Z. Abbasi-Moud, H. Vahdat-Nejad, and J. Sadri, “Tourism recommendation


system based on semantic clustering and sentiment analysis,” Expert Syst. Appl.,
vol. 167, p. 114324, Apr. 2021, doi: 10.1016/j.eswa.2020.114324.

[3] Z. Jia, Y. Yang, W. Gao, and X. Chen, “User-Based Collaborative Filtering for
Tourist Attraction Recommendations,” in 2015 IEEE International Conference on
Computational Intelligence & Communication Technology, Feb. 2015, pp. 22–25.
doi: 10.1109/CICT.2015.20.

[4] S. Djebali, Q. Gabot, and G. Guerard, “Hierarchical Clustering and Measure


for Tourism Profiling,” in Web and Big Data, B. Li, L. Yue, C. Tao, X. Han, D.
Calvanese, and T. Amagasa, Eds., in Lecture Notes in Computer Science. Cham:
Springer Nature Switzerland, 2023, pp. 158–165. doi:
10.1007/978-3-031-25198-6_12.

[5] G. Ratnakanth and S. Poonkuzhali, “Indian Tourist Recommendation System


Using Collaborative Filtering and Deep Autoencoder,” in Information and
Communication Technology for Competitive Strategies (ICTCS 2021), M. S.
Kaiser, J. Xie, and V. S. Rathore, Eds., in Lecture Notes in Networks and
Systems. Singapore: Springer Nature, 2023, pp. 341–356. doi:
10.1007/978-981-19-0098-3_34.

[6] N. Wayan Priscila Yuni Praditya, A. Erna Permanasari, and I. Hidayah,


“Designing a tourism recommendation system using a hybrid method
(Collaborative Filtering and Content-Based Filtering),” in 2021 IEEE
International Conference on Communication, Networks and Satellite
(COMNETSAT), Jul. 2021, pp. 298–305. doi:
10.1109/COMNETSAT53002.2021.9530823.

[7] C.-C. Yu and H. Chang, “Personalized Location-Based Recommendation Services


for Tour Planning in Mobile Tourism Applications,” in E-Commerce and Web
Technologies, T. Di Noia and F. Buccafurri, Eds., in Lecture Notes in Computer
Science. Berlin, Heidelberg: Springer, 2009, pp. 38–49. doi:
10.1007/978-3-642-03964-5_5.

[8] E. E. Stephy and M. Rajeswari, “Empowering Tourists with Context-Aware


Recommendations using GAN,” in 2023 Second International Conference on
Electronics and Renewable Systems (ICEARS), Mar. 2023, pp. 1444–1449. doi:
10.1109/ICEARS56392.2023.10085604.

[9] “Beautiful Soup Documentation.” Accessed: Feb. 08, 2023. [Online]. Available:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#module-bs4

36
[10] K. Reitz, “requests: Python.” Accessed: Feb. 08, 2023. [OS Independent].
Available: https://requests.readthedocs.io

[11] “Selenium,” Selenium. Accessed: Feb. 06, 2023. [Online]. Available:


https://www.selenium.dev/documentation/overview/

[12] “Lonely Planet.” Accessed: Feb. 08, 2023. [Online]. Available:


https://www.lonelyplanet.com/india/attractions

[13] “Trip Advisor.” Accessed: Feb. 08, 2023. [Online]. Available:


https://www.tripadvisor.in/

[14] “Trip.com Official Site‎‎.” Accessed: Feb. 08, 2023. [Online]. Available:
https://us.trip.com/?locale=en-us

37

You might also like