Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (211)

Search Parameters:
Keywords = tweet dataset

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
23 pages, 863 KiB  
Article
Fine-Grained Arabic Post (Tweet) Geolocation Prediction Using Deep Learning Techniques
by Marwa K. Elteir
Information 2025, 16(1), 65; https://doi.org/10.3390/info16010065 - 18 Jan 2025
Viewed by 531
Abstract
Leveraging Twitter data for crisis management necessitates the accurate, fine-grained geolocation of tweets, which unfortunately is often lacking, with only 1–3% of tweets being geolocated. This work addresses the understudied problem of fine-grained geolocation prediction for Arabic tweets, focusing on the Kingdom of [...] Read more.
Leveraging Twitter data for crisis management necessitates the accurate, fine-grained geolocation of tweets, which unfortunately is often lacking, with only 1–3% of tweets being geolocated. This work addresses the understudied problem of fine-grained geolocation prediction for Arabic tweets, focusing on the Kingdom of Saudi Arabia. The goal is to accurately assign tweets to one of thirteen provinces. Existing approaches for Arabic geolocation are limited in accuracy and often rely on basic machine learning techniques. Additionally, advancements in tweet geolocation for other languages often rely on distinct datasets, hindering direct comparisons and assessments of their relative performance on Arabic datasets. To bridge this gap, we investigate eight advanced deep learning techniques, including two Arabic pretrained language models (PLMs) on one constructed dataset. Through a comprehensive analysis, we assess the strengths and weaknesses of each technique for fine-grained Arabic tweet geolocation. Despite the success of PLMs in various tasks, our results demonstrate that a combination of Convolution Neural Network (CNN) and Long Short-Term Memory (LSTM) layers yields the best performance, achieving a test accuracy of 93.85%. Full article
Show Figures

Figure 1

21 pages, 563 KiB  
Article
Revisiting Information Cascades in Online Social Networks
by Michael Sidorov, Ofer Hadar and Dan Vilenchik
Mathematics 2025, 13(1), 77; https://doi.org/10.3390/math13010077 - 28 Dec 2024
Viewed by 469
Abstract
It is widely believed that a user’s activity pattern in Online Social Networks (OSNs) is strongly influenced by their friends or the users they follow. Building on this intuition, numerous models have been proposed over the years to predict information propagation in OSNs. [...] Read more.
It is widely believed that a user’s activity pattern in Online Social Networks (OSNs) is strongly influenced by their friends or the users they follow. Building on this intuition, numerous models have been proposed over the years to predict information propagation in OSNs. Many of these models drew inspiration from the process of infectious spread within a population. While this approach is definitely plausible, it relies on knowledge of users’ social connections, which can be challenging to obtain due to privacy concerns. Moreover, while a significant body of work has focused on predicting macro-level features, such as the total cascade size, relatively little attention has been given to the prediction of micro-level features, such as the activity of an individual user. In this study we aim to address this gap by proposing a method to predict the activity of individual users in an OSN, relying solely on their interactions rather than prior knowledge of their social network. We evaluated our results on four large datasets, each comprising over 14 million tweets, recorded on X social network across four different topics over several month. Our method achieved a mean F1 score of 0.86, with a best result of 0.983. Full article
(This article belongs to the Special Issue Big Data and Complex Networks)
Show Figures

Figure 1

11 pages, 4385 KiB  
Article
The Impact of Autonomous Vehicle Accidents on Public Sentiment: A Decadal Analysis of Twitter Discourse Using roBERTa
by Romy Sauvayre, Jessica S. M. Gable, Adam Aalah, Melvin Fernandes Novo, Maxime Dehondt and Cédric Chauvière
Technologies 2024, 12(12), 270; https://doi.org/10.3390/technologies12120270 - 23 Dec 2024
Viewed by 1236
Abstract
In the field of autonomous vehicle (AV) acceptance and opinion studies, questionnaires are widely used. Additionally, AV experiments and driving simulations are utilized. However, few AV studies have investigated social media, and fewer studies have analyzed the impact of AV crashes on public [...] Read more.
In the field of autonomous vehicle (AV) acceptance and opinion studies, questionnaires are widely used. Additionally, AV experiments and driving simulations are utilized. However, few AV studies have investigated social media, and fewer studies have analyzed the impact of AV crashes on public opinion, often relying on limited social media datasets. This study aims to address this gap by exploring a comprehensive dataset of six million tweets posted over a decade (2012–2021), and neural networks, sentiment analysis and knowledge graphs are applied. The results reveal that tweets predominantly convey negative sentiment (40.86%) rather than positive (32.52%) or neutral (26.62%) sentiment. A binary segmentation algorithm was used to distinguish an initial positive sentiment period (January 2012–May 2016) followed by a negative period (June 2016–December 2021), which was initiated by a fatal Tesla accident and reinforced by a pedestrian killed by an Uber AV. The sentiment polarity exhibited in the posted tweets was statistically significant (U = 24,914,037,786; p value < 0.001). The timeline analysis revealed that the negative sentiment period was initiated by fatal accidents involving a Tesla AV driver and a pedestrian hit by an Uber AV, which was amplified by the mainstream media. Full article
(This article belongs to the Special Issue Advanced Autonomous Systems and Artificial Intelligence Stage)
Show Figures

Figure 1

21 pages, 968 KiB  
Article
Advancing Author Gender Identification in Modern Standard Arabic with Innovative Deep Learning and Textual Feature Techniques
by Hanen Himdi and Khaled Shaalan
Information 2024, 15(12), 779; https://doi.org/10.3390/info15120779 - 5 Dec 2024
Viewed by 756
Abstract
Author Gender Identification (AGI) is an extensively studied subject owing to its significance in several domains, such as security and marketing. Recognizing an author’s gender may assist marketers in segmenting consumers more effectively and crafting tailored content that aligns with a gender’s preferences. [...] Read more.
Author Gender Identification (AGI) is an extensively studied subject owing to its significance in several domains, such as security and marketing. Recognizing an author’s gender may assist marketers in segmenting consumers more effectively and crafting tailored content that aligns with a gender’s preferences. Also, in cybersecurity, identifying an author’s gender might aid in detecting phishing attempts where hackers could imitate individuals of a specific gender. Although studies in Arabic have mostly concentrated on written dialects, such as tweets, there is a paucity of studies addressing Modern Standard Arabic (MSA) in journalistic genres. To address the AGI issue, this work combines the beneficial properties of natural language processing with cutting-edge deep learning methods. Firstly, we propose a large 8k MSA article dataset composed of various columns sourced from news platforms, labeled with each author’s gender. Moreover, we extract and analyze textual features that may be beneficial in identifying gender-related cues through their writings, focusing on semantics and syntax linguistics. Furthermore, we probe several innovative deep learning models, namely, Convolutional Neural Networks (CNNs), LSTM, Bidirectional LSTM (BiLSTM), and Bidirectional Encoder Representations from Transformers (BERT). Beyond that, a novel enhanced BERT model is proposed by incorporating gender-specific textual features. Through various experiments, the results underscore the potential of both BERT and the textual features, resulting in a 91% accuracy for the enhanced BERT model and a range of accuracy from 80% to 90% accuracy for deep learning models. We also employ these features for AGI in informal, dialectal text, with the enhanced BERT model reaching 68.7% accuracy. This demonstrates that these gender-specific textual features are conducive to AGI across MSA and dialectal texts. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Graphical abstract

27 pages, 1831 KiB  
Article
A Multi-Architecture Approach for Offensive Language Identification Combining Classical Natural Language Processing and BERT-Variant Models
by Ashok Yadav, Farrukh Aslam Khan and Vrijendra Singh
Appl. Sci. 2024, 14(23), 11206; https://doi.org/10.3390/app142311206 - 1 Dec 2024
Viewed by 1020
Abstract
Offensive content is a complex and multifaceted form of harmful material that targets individuals or groups. In recent years, offensive language (OL) has become increasingly harmful, as it incites violence and intolerance. The automatic identification of OL on social networks is essential to [...] Read more.
Offensive content is a complex and multifaceted form of harmful material that targets individuals or groups. In recent years, offensive language (OL) has become increasingly harmful, as it incites violence and intolerance. The automatic identification of OL on social networks is essential to curtail the spread of harmful content. We address this problem by developing an architecture to effectively respond to and mitigate the impact of offensive content on society. In this paper, we use the Davidson dataset containing 24,783 samples of tweets and proposed three different architectures for detecting OL on social media platforms. Our proposed approach involves concatenation of features (TF-IDF, Word2Vec, sentiments, and FKRA/FRE) and a baseline machine learning model for the classification. We explore the effectiveness of different dimensions of GloVe embeddings in conjunction with deep learning models for classifying OL. We also propose an architecture that utilizes advanced transformer models such as BERT, ALBERT, and ELECTRA for pre-processing and encoding, with 1D CNN and neural network layers serving as the classification components. We achieve the highest precision, recall, and F1 score, i.e., 0.89, 0.90, and 0.90, respectively, for both the “bert encased preprocess/1 + small bert/L4H512A8/1 + neural network layers” model and the “bert encased preprocess/1 + electra small/2 + cnn” architecture. Full article
(This article belongs to the Special Issue Data Mining and Machine Learning in Social Network Analysis)
Show Figures

Figure 1

21 pages, 1133 KiB  
Article
A Stacking Ensemble Based on Lexicon and Machine Learning Methods for the Sentiment Analysis of Tweets
by Sharaf J. Malebary and Anas W. Abulfaraj
Mathematics 2024, 12(21), 3405; https://doi.org/10.3390/math12213405 - 31 Oct 2024
Viewed by 1276
Abstract
Sentiment is employed in various fields, such as collecting web-based opinions for the formulation of governmental policies, measuring employee and customer satisfaction levels in business organizations, and measuring the sentiment of the public in political and security matters. The field has recently faced [...] Read more.
Sentiment is employed in various fields, such as collecting web-based opinions for the formulation of governmental policies, measuring employee and customer satisfaction levels in business organizations, and measuring the sentiment of the public in political and security matters. The field has recently faced new challenges since algorithms must operate with highly unstructured sentiment data from social media. In this study, the authors present a new stacking ensemble method that combines the lexicon-based approach with machine learning algorithms to improve the sentiment analysis of tweets. Due to the complexity of the text with very ill-defined syntactic and grammatical patterns, using lexicon-based techniques to extract sentiment from the content is proposed. On the same note, the contextual and nuanced aspects of sentiment are inferred through machine learning algorithms. A sophisticated bat algorithm that uses an Elman network as a meta-classifier is then employed to classify the extracted features accurately. Substantial evidence from three datasets that are readily available for public analysis re-affirms the improvements this innovative approach brings to sentiment classification. Full article
(This article belongs to the Special Issue Application of Artificial Intelligence in Decision Making)
Show Figures

Figure 1

27 pages, 4185 KiB  
Article
Leveraging Social Media and Deep Learning for Sentiment Analysis for Smart Governance: A Case Study of Public Reactions to Educational Reforms in Saudi Arabia
by Alanoud Alotaibi and Farrukh Nadeem
Computers 2024, 13(11), 280; https://doi.org/10.3390/computers13110280 - 28 Oct 2024
Cited by 1 | Viewed by 1546
Abstract
The Saudi government’s educational reforms aim to align the system with market needs and promote economic opportunities. However, a lack of credible data makes assessing public sentiment towards these reforms challenging. This research develops a sentiment analysis application to analyze public emotional reactions [...] Read more.
The Saudi government’s educational reforms aim to align the system with market needs and promote economic opportunities. However, a lack of credible data makes assessing public sentiment towards these reforms challenging. This research develops a sentiment analysis application to analyze public emotional reactions to educational reforms in Saudi Arabia using AraBERT, an Arabic language model. We constructed a unique Arabic dataset of 216,858 tweets related to the reforms, with 2000 manually labeled for public sentiment. To establish a robust evaluation framework, we employed random forests, support vector machines, and logistic regression as baseline models alongside AraBERT. We also compared the fine-tuned AraBERT Sentiment Classification model with CAMeLBERT, MARBERT, and LLM (GPT) models. The fine-tuned AraBERT model had an F1 score of 0.89, which was above the baseline models by 5% and demonstrated a 4% improvement compared to other pre-trained transformer models applied to this task. This highlights the advantage of transformer models specifically trained for the target language and domain (Arabic). Arabic-specific sentiment analysis models outperform multilingual models for this task. Overall, this study demonstrates the effectiveness of AraBERT in analyzing Arabic sentiment on social media. This approach has the potential to inform educational reform evaluation in Saudi Arabia and potentially other Arabic-speaking regions. Full article
(This article belongs to the Special Issue Artificial Intelligence in Electronic Government (E-government))
Show Figures

Figure 1

22 pages, 1165 KiB  
Article
Advanced Comparative Analysis of Machine Learning and Transformer Models for Depression and Suicide Detection in Social Media Texts
by Biodoumoye George Bokolo and Qingzhong Liu
Electronics 2024, 13(20), 3980; https://doi.org/10.3390/electronics13203980 - 10 Oct 2024
Viewed by 1904
Abstract
Depression detection through social media analysis has emerged as a promising approach for early intervention and mental health support. This study evaluates the performance of various machine learning and transformer models in identifying depressive content from tweets on X. Utilizing the Sentiment140 and [...] Read more.
Depression detection through social media analysis has emerged as a promising approach for early intervention and mental health support. This study evaluates the performance of various machine learning and transformer models in identifying depressive content from tweets on X. Utilizing the Sentiment140 and the Suicide-Watch dataset, we built several models which include logistic regression, Bernoulli Naive Bayes, Random Forest, and transformer models such as RoBERTa, DeBERTa, DistilBERT, and SqueezeBERT to detect this content. Our findings indicate that transformer models outperform traditional machine learning algorithms, with RoBERTa and DeBERTa, when predicting depression and suicide rates. This performance is attributed to the transformers’ ability to capture contextual nuances in language. On the other hand, logistic regression models outperform transformers in another dataset with more accurate information. This is attributed to the traditional model’s ability to understand simple patterns especially when the classes are straighforward. We employed a comprehensive cross-validation approach to ensure robustness, with transformers demonstrating higher stability and reliability across splits. Despite limitations like dataset scope and computational constraints, the findings contribute significantly to mental health monitoring and suggest promising directions for future research and real-world applications in early depression detection and mental health screening tools. The various models used performed outstandingly. Full article
(This article belongs to the Special Issue Information Retrieval and Cyber Forensics with Data Science)
Show Figures

Figure 1

21 pages, 2103 KiB  
Article
On the Utilization of Emoji Encoding and Data Preprocessing with a Combined CNN-LSTM Framework for Arabic Sentiment Analysis
by Hussam Alawneh, Ahmad Hasasneh and Mohammed Maree
Modelling 2024, 5(4), 1469-1489; https://doi.org/10.3390/modelling5040076 - 7 Oct 2024
Viewed by 1210
Abstract
Social media users often express their emotions through text in posts and tweets, and these can be used for sentiment analysis, identifying text as positive or negative. Sentiment analysis is critical for different fields such as politics, tourism, e-commerce, education, and health. However, [...] Read more.
Social media users often express their emotions through text in posts and tweets, and these can be used for sentiment analysis, identifying text as positive or negative. Sentiment analysis is critical for different fields such as politics, tourism, e-commerce, education, and health. However, sentiment analysis approaches that perform well on English text encounter challenges with Arabic text due to its morphological complexity. Effective data preprocessing and machine learning techniques are essential to overcome these challenges and provide insightful sentiment predictions for Arabic text. This paper evaluates a combined CNN-LSTM framework with emoji encoding for Arabic Sentiment Analysis, using the Arabic Sentiment Twitter Corpus (ASTC) dataset. Three experiments were conducted with eight-parameter fusion approaches to evaluate the effect of data preprocessing, namely the effect of emoji encoding on their real and emotional meaning. Emoji meanings were collected from four websites specialized in finding the meaning of emojis in social media. Furthermore, the Keras tuner optimized the CNN-LSTM parameters during the 5-fold cross-validation process. The highest accuracy rate (91.85%) was achieved by keeping non-Arabic words and removing punctuation, using the Snowball stemmer after encoding emojis into Arabic text, and applying Keras embedding. This approach is competitive with other state-of-the-art approaches, showing that emoji encoding enriches text by accurately reflecting emotions, and enabling investigation of the effect of data preprocessing, allowing the hybrid model to achieve comparable results to the study using the same ASTC dataset, thereby improving sentiment analysis accuracy. Full article
Show Figures

Figure 1

23 pages, 5384 KiB  
Article
An Evaluation of the Maternal Patient Experience through Natural Language Processing Techniques: The Case of Twitter Data in the United States during COVID-19
by Debapriya Banik, Sreenath Chalil Madathil, Amit Joe Lopes, Sergio A. Luna Fong and Santosh K. Mukka
Appl. Sci. 2024, 14(19), 8762; https://doi.org/10.3390/app14198762 - 28 Sep 2024
Viewed by 1161
Abstract
The healthcare sector constantly investigates ways to improve patient outcomes and provide more patient-centered care. Delivering quality medical care involves ensuring that patients have a positive experience. Most healthcare organizations use patient survey feedback to measure patients’ experiences. However, the power of social [...] Read more.
The healthcare sector constantly investigates ways to improve patient outcomes and provide more patient-centered care. Delivering quality medical care involves ensuring that patients have a positive experience. Most healthcare organizations use patient survey feedback to measure patients’ experiences. However, the power of social media can be harnessed using artificial intelligence and machine learning techniques to provide researchers with valuable insights into understanding patient experience and care. Our primary research objective is to develop a social media analytics model to evaluate the maternal patient experience during the COVID-19 pandemic. We used the “COVID-19 Tweets” Dataset, which has over 28 million tweets, and extracted tweets from the US with words relevant to maternal patients. The maternal patient cohort was selected because the United States has the highest percentage of maternal mortality and morbidity rate among the developed countries in the world. We evaluated patient experience using natural language processing (NLP) techniques such as word clouds, word clustering, frequency analysis, and network analysis of words that relate to “pains” and “gains” regarding the maternal patient experience, which are expressed through social media. The pandemic showcased the worries of mothers and providers on the risks of COVID-19. However, many people also shared how they survived the pandemic. Both providers and maternal patients had concerns regarding the pregnancy risks due to COVID-19. This model will help process improvement experts without domain expertise to understand the various domain challenges efficiently. Such insights can help decision-makers improve the patient care system. Full article
(This article belongs to the Special Issue Data Mining and Machine Learning in Social Network Analysis)
Show Figures

Figure 1

24 pages, 22050 KiB  
Article
SOD: A Corpus for Saudi Offensive Language Detection Classification
by Afefa Asiri and Mostafa Saleh
Computers 2024, 13(8), 211; https://doi.org/10.3390/computers13080211 - 20 Aug 2024
Viewed by 1128
Abstract
Social media platforms like X (formerly known as Twitter) are integral to modern communication, enabling the sharing of news, emotions, and ideas. However, they also facilitate the spread of harmful content, and manual moderation of these platforms is impractical. Automated moderation tools, predominantly [...] Read more.
Social media platforms like X (formerly known as Twitter) are integral to modern communication, enabling the sharing of news, emotions, and ideas. However, they also facilitate the spread of harmful content, and manual moderation of these platforms is impractical. Automated moderation tools, predominantly developed for English, are insufficient for addressing online offensive language in Arabic, a language rich in dialects and informally used on social media. This gap underscores the need for dedicated, dialect-specific resources. This study introduces the Saudi Offensive Dialectal dataset (SOD), consisting of over 24,000 tweets annotated across three levels: offensive or non-offensive, with offensive tweets further categorized as general insults, hate speech, or sarcasm. A deeper analysis of hate speech identifies subtypes related to sports, religion, politics, race, and violence. A comprehensive descriptive analysis of the SOD is also provided to offer deeper insights into its composition. Using machine learning, traditional deep learning, and transformer-based deep learning models, particularly AraBERT, our research achieves a significant F1-Score of 87% in identifying offensive language. This score improves to 91% with data augmentation techniques addressing dataset imbalances. These results, which surpass many existing studies, demonstrate that a specialized dialectal dataset enhances detection efficacy compared to mixed-language datasets. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)
Show Figures

Figure 1

32 pages, 7299 KiB  
Article
Analysing A/O Possession in Māori-Language Tweets
by David Trye, Andreea S. Calude, Ray Harlow and Te Taka Keegan
Languages 2024, 9(8), 271; https://doi.org/10.3390/languages9080271 - 6 Aug 2024
Viewed by 1324
Abstract
This article contributes the first corpus-based study of possession in Māori, the indigenous language of Aotearoa New Zealand. Like most Polynesian languages, Māori has a dual possessive system involving a choice between the so-called A and O categories. While Māori grammars describe these [...] Read more.
This article contributes the first corpus-based study of possession in Māori, the indigenous language of Aotearoa New Zealand. Like most Polynesian languages, Māori has a dual possessive system involving a choice between the so-called A and O categories. While Māori grammars describe these categories in terms of the inherent semantic relationship between the possessum and possessor, there have been no large-scale corpus analyses demonstrating their use in natural contexts. Social media provide invaluable opportunities for such linguistic studies, capturing contemporary language use while alleviating the burden of gathering data through traditional means. We operationalise semantic distinctions to investigate possession in Māori-language tweets, focusing on the [possessum a/o possessor] construction (e.g., te tīmatanga o te wiki ‘the beginning of the week’). In our corpus comprising 2500 tweets produced by more than 200 individuals, we find that users leverage a wide array of noun types encompassing many different semantic relationships. We observe not only the expected predominance of the O category, but also a tendency for examples described by Māori grammars as A-marked to instead be O-marked (59%). Although the A category persists in the corpus, our findings suggest that language change could be underway. Our primary dataset can be explored interactively online. Full article
(This article belongs to the Special Issue Linguistics of Social Media)
Show Figures

Figure 1

14 pages, 4537 KiB  
Article
Multimodal Hateful Meme Classification Based on Transfer Learning and a Cross-Mask Mechanism
by Fan Wu, Guolian Chen, Junkuo Cao, Yuhan Yan and Zhongneng Li
Electronics 2024, 13(14), 2780; https://doi.org/10.3390/electronics13142780 - 15 Jul 2024
Viewed by 1737
Abstract
Hateful memes are malicious and biased sentiment information widely spread on the internet. Detecting hateful memes differs from traditional multimodal tasks because, in conventional tasks, visual and textual information align semantically. However, the challenge in detecting hateful memes lies in their unique multimodal [...] Read more.
Hateful memes are malicious and biased sentiment information widely spread on the internet. Detecting hateful memes differs from traditional multimodal tasks because, in conventional tasks, visual and textual information align semantically. However, the challenge in detecting hateful memes lies in their unique multimodal nature, where images and text in memes may be weak or unrelated, requiring models to understand the content and perform multimodal reasoning. To address this issue, we introduce a multimodal fine-grained hateful memes detection model named “TCAM”. The model leverages advanced encoding techniques from TweetEval and CLIP and introduces enhanced Cross-Attention and Cross-Mask Mechanisms (CAM) in the feature fusion stage to improve multimodal correlations. It effectively embeds fine-grained features of data and image descriptions into the model through transfer learning. This paper uses the Area Under the Receiver Operating Characteristic Curve (AUROC) as the primary metric to evaluate the model’s discriminatory ability. This approach achieved an AUROC score of 0.8362 and an accuracy score of 0.764 on the Facebook Hateful Memes Challenge (FHMC) dataset, confirming its high discriminatory capability. The TCAM model demonstrates relatively superior performance compared to ensemble machine learning methods. Full article
(This article belongs to the Special Issue Application of Data Mining in Social Media)
Show Figures

Figure 1

14 pages, 1052 KiB  
Article
The Effect of Training Data Size on Disaster Classification from Twitter
by Dimitrios Effrosynidis, Georgios Sylaios and Avi Arampatzis
Information 2024, 15(7), 393; https://doi.org/10.3390/info15070393 - 8 Jul 2024
Viewed by 1366
Abstract
In the realm of disaster-related tweet classification, this study presents a comprehensive analysis of various machine learning algorithms, shedding light on crucial factors influencing algorithm performance. The exceptional efficacy of simpler models is attributed to the quality and size of the dataset, enabling [...] Read more.
In the realm of disaster-related tweet classification, this study presents a comprehensive analysis of various machine learning algorithms, shedding light on crucial factors influencing algorithm performance. The exceptional efficacy of simpler models is attributed to the quality and size of the dataset, enabling them to discern meaningful patterns. While powerful, complex models are time-consuming and prone to overfitting, particularly with smaller or noisier datasets. Hyperparameter tuning, notably through Bayesian optimization, emerges as a pivotal tool for enhancing the performance of simpler models. A practical guideline for algorithm selection based on dataset size is proposed, consisting of Bernoulli Naive Bayes for datasets below 5000 tweets and Logistic Regression for larger datasets exceeding 5000 tweets. Notably, Logistic Regression shines with 20,000 tweets, delivering an impressive combination of performance, speed, and interpretability. A further improvement of 0.5% is achieved by applying ensemble and stacking methods. Full article
Show Figures

Figure 1

17 pages, 3202 KiB  
Article
Arabic Spam Tweets Classification: A Comprehensive Machine Learning Approach
by Wafa Hussain Hantom and Atta Rahman
AI 2024, 5(3), 1049-1065; https://doi.org/10.3390/ai5030052 - 2 Jul 2024
Cited by 1 | Viewed by 1735
Abstract
Nowadays, one of the most common problems faced by Twitter (also known as X) users, including individuals as well as organizations, is dealing with spam tweets. The problem continues to proliferate due to the increasing popularity and number of users of social media [...] Read more.
Nowadays, one of the most common problems faced by Twitter (also known as X) users, including individuals as well as organizations, is dealing with spam tweets. The problem continues to proliferate due to the increasing popularity and number of users of social media platforms. Due to this overwhelming interest, spammers can post texts, images, and videos containing suspicious links that can be used to spread viruses, rumors, negative marketing, and sarcasm, and potentially hack the user’s information. Spam detection is among the hottest research areas in natural language processing (NLP) and cybersecurity. Several studies have been conducted in this regard, but they mainly focus on the English language. However, Arabic tweet spam detection still has a long way to go, especially emphasizing the diverse dialects other than modern standard Arabic (MSA), since, in the tweets, the standard dialect is seldom used. The situation demands an automated, robust, and efficient Arabic spam tweet detection approach. To address the issue, in this research, various machine learning and deep learning models have been investigated to detect spam tweets in Arabic, including Random Forest (RF), Support Vector Machine (SVM), Naive Bayes (NB) and Long-Short Term Memory (LSTM). In this regard, we have focused on the words as well as the meaning of the tweet text. Upon several experiments, the proposed models have produced promising results in contrast to the previous approaches for the same and diverse datasets. The results showed that the RF classifier achieved 96.78% and the LSTM classifier achieved 94.56%, followed by the SVM classifier that achieved 82% accuracy. Further, in terms of F1-score, there is an improvement of 21.38%, 19.16% and 5.2% using RF, LSTM and SVM classifiers compared to the schemes with same dataset. Full article
Show Figures

Figure 1

Back to TopTop