Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (640)

Search Parameters:
Keywords = audio features

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
21 pages, 1111 KiB  
Article
Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
by Pegah Salehi, Sajad Amouei Sheshkal, Vajira Thambawita, Sushant Gautam, Saeed S. Sabet, Dag Johansen, Michael A. Riegler and Pål Halvorsen
Big Data Cogn. Comput. 2025, 9(3), 59; https://doi.org/10.3390/bdcc9030059 - 4 Mar 2025
Viewed by 178
Abstract
This paper explores advancements in real-time talking-head generation, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE [...] Read more.
This paper explores advancements in real-time talking-head generation, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with OpenAI’s Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. Although interviewer training systems are considered a potential application, the primary contribution of this work is the improvement of the technical foundations necessary for creating responsive AI avatars. These advancements enable more immersive interactions and expand the scope of AI-driven applications, including educational tools and simulated training environments. Full article
Show Figures

Figure 1

15 pages, 6428 KiB  
Article
Application of Controlled-Source Audio-Frequency Magnetotellurics (CSAMT) for Subsurface Structural Characterization of Wadi Rum, Southwest Jordan
by Abdullah Basaloom and Hassan Alzahrani
Sustainability 2025, 17(5), 2107; https://doi.org/10.3390/su17052107 - 28 Feb 2025
Viewed by 136
Abstract
The UNESCO World Heritage Centre announced in 2011 that the Wadi Rum Protected Area (WRPA) is a global landmark for natural and cultural attraction, which represents an emerging industrial suburban and a critical socio-economic significance to the country of Jordan. The study area [...] Read more.
The UNESCO World Heritage Centre announced in 2011 that the Wadi Rum Protected Area (WRPA) is a global landmark for natural and cultural attraction, which represents an emerging industrial suburban and a critical socio-economic significance to the country of Jordan. The study area in Wadi Rum is located northeast of the Gulf of Aqaba between the African and Arabian plates. The region is historically characterized by significant tectonic activity and seismic events. This study focuses on characterizing the subsurface structural features of Wadi Rum through the application of the geophysical method of controlled-source audio-frequency magnetotellurics (CSAMT). CSAMT data were collected from 16 sounding stations, processed, and qualitatively interpreted. The qualitative interpretation involved two main approaches: constructing sounding curves for each station and generating apparent resistivity maps at fixed depths (frequencies). The results revealed the presence of at least four distinct subsurface layers. The surface layer exhibited relatively low resistivity values (<200 Ω·m), corresponding to alluvial and wadi sediments, as well as mud flats. Two intermediate layers were identified: the first showed very low resistivity values (80–100 Ω·m), likely due to medium-grained bedded sandstone, while the second displayed intermediate resistivity values (100–800 Ω·m), representing coarse basal conglomerates and coarse sandstone formations. The deepest layer demonstrated very high resistivity values (>1000 Ω·m), which were likely attributed to basement rocks. Analysis of resistivity maps, combined with prior geological information, indicates that the subsurface in the study area features a graben-like structure, characterized by two detected faults trending in the northeast (NE) and southwest (SW) directions. The findings of this study, by providing critical insights into the subsurface structure, make a considerable contribution to the urban sustainability of the region, which is necessary for the careful assessment of potential hazards and the strategic planning of future infrastructure development within the protected area. Full article
Show Figures

Figure 1

18 pages, 1911 KiB  
Article
Enhancing Embedded Space with Low–Level Features for Speech Emotion Recognition
by Lukasz Smietanka and Tomasz Maka
Appl. Sci. 2025, 15(5), 2598; https://doi.org/10.3390/app15052598 - 27 Feb 2025
Viewed by 237
Abstract
This work proposes an approach that uses a feature space by combining the representation obtained in the unsupervised learning process and manually selected features defining the prosody of the utterances. In the experiments, we used two time-frequency representations (Mel and CQT spectrograms) and [...] Read more.
This work proposes an approach that uses a feature space by combining the representation obtained in the unsupervised learning process and manually selected features defining the prosody of the utterances. In the experiments, we used two time-frequency representations (Mel and CQT spectrograms) and EmoDB and RAVDESS databases. As the results show, the proposed system improved the classification accuracy of both representations: 1.29% for CQT and 3.75% for Mel spectrogram compared to the typical CNN architecture for the EmoDB dataset and 3.02% for CQT and 0.63% for Mel spectrogram in the case of RAVDESS. Additionally, the results present a significant increase of around 14% in classification performance in the case of happiness and disgust emotions using Mel spectrograms and around 20% in happiness and disgust emotions for CQT in the case of best models trained on EmoDB. On the other hand, in the case of models that achieved the highest result for the RAVDESS database, the most significant improvement was observed in the classification of a neutral state, around 16%, using the Mel spectrogram. For CQT representation, the most significant improvement occurred for fear and surprise, around 9%. Additionally, the average results for all prepared models showed the positive impact of the method used on the quality of classification of most emotional states. For the EmoDB database, the highest average improvement was observed for happiness—14.6%. For other emotions, it ranged from 1.2% to 8.7%. The only exception was the emotion of sadness, for which the classification quality was average decreased by 1% when using the Mel spectrogram. In turn, for the RAVDESS database, the most significant improvement also occurred for happiness—7.5%, while for other emotions ranged from 0.2% to 7.1%, except disgust and calm, the classification of which deteriorated for the Mel spectrogram and the CQT representation, respectively. Full article
Show Figures

Figure 1

17 pages, 1463 KiB  
Article
Interpretable Probabilistic Identification of Depression in Speech
by Stavros Ntalampiras
Sensors 2025, 25(4), 1270; https://doi.org/10.3390/s25041270 - 19 Feb 2025
Viewed by 175
Abstract
Mental health assessment is typically carried out via a series of conversation sessions with medical professionals, where the overall aim is the diagnosis of mental illnesses and well-being evaluation. Despite its arguable socioeconomic significance, national health systems fail to meet the increased demand [...] Read more.
Mental health assessment is typically carried out via a series of conversation sessions with medical professionals, where the overall aim is the diagnosis of mental illnesses and well-being evaluation. Despite its arguable socioeconomic significance, national health systems fail to meet the increased demand for such services that has been observed in recent years. To assist and accelerate the diagnosis process, this work proposes an AI-based tool able to provide interpretable predictions by automatically processing the recorded speech signals. An explainability-by-design approach is followed, where audio descriptors related to the problem at hand form the feature vector (Mel-scaled spectrum summarization, Teager operator and periodicity description), while modeling is based on Hidden Markov Models adapted from an ergodic universal one following a suitably designed data selection scheme. After extensive and thorough experiments adopting a standardized protocol on a publicly available dataset, we report significantly higher results with respect to the state of the art. In addition, an ablation study was carried out, providing a comprehensive analysis of the relevance of each system component. Last but not least, the proposed solution not only provides excellent performance, but its operation and predictions are transparent and interpretable, laying out the path to close the usability gap existing between such systems and medical personnel. Full article
(This article belongs to the Special Issue Advances in Acoustic Sensors and Deep Audio Pattern Recognition)
Show Figures

Figure 1

32 pages, 4102 KiB  
Article
A Multimodal Pain Sentiment Analysis System Using Ensembled Deep Learning Approaches for IoT-Enabled Healthcare Framework
by Anay Ghosh, Saiyed Umer, Bibhas Chandra Dhara and G. G. Md. Nawaz Ali
Sensors 2025, 25(4), 1223; https://doi.org/10.3390/s25041223 - 17 Feb 2025
Viewed by 317
Abstract
This study introduces a multimodal sentiment analysis system to assess and recognize human pain sentiments within an Internet of Things (IoT)-enabled healthcare framework. This system integrates facial expressions and speech-audio recordings to evaluate human pain intensity levels. This integration aims to enhance the [...] Read more.
This study introduces a multimodal sentiment analysis system to assess and recognize human pain sentiments within an Internet of Things (IoT)-enabled healthcare framework. This system integrates facial expressions and speech-audio recordings to evaluate human pain intensity levels. This integration aims to enhance the recognition system’s performance and enable a more accurate assessment of pain intensity. Such a multimodal approach supports improved decision making in real-time patient care, addressing limitations inherent in unimodal systems for measuring pain sentiment. So, the primary contribution of this work lies in developing a multimodal pain sentiment analysis system that integrates the outcomes of image-based and audio-based pain sentiment analysis models. The system implementation contains five key phases. The first phase focuses on detecting the facial region from a video sequence, a crucial step for extracting facial patterns indicative of pain. In the second phase, the system extracts discriminant and divergent features from the facial region using deep learning techniques, utilizing some convolutional neural network (CNN) architectures, which are further refined through transfer learning and fine-tuning of parameters, alongside fusion techniques aimed at optimizing the model’s performance. The third phase performs the speech-audio recording preprocessing; the extraction of significant features is then performed through conventional methods followed by using the deep learning model to generate divergent features to recognize audio-based pain sentiments in the fourth phase. The final phase combines the outcomes from both image-based and audio-based pain sentiment analysis systems, improving the overall performance of the multimodal system. This fusion enables the system to accurately predict pain levels, including ‘high pain’, ‘mild pain’, and ‘no pain’. The performance of the proposed system is tested with the three image-based databases such as a 2D Face Set Database with Pain Expression, the UNBC-McMaster database (based on shoulder pain), and the BioVid database (based on heat pain), along with the VIVAE database for the audio-based dataset. Extensive experiments were performed using these datasets. Finally, the proposed system achieved accuracies of 76.23%, 84.27%, and 38.04% for two, three, and five pain classes, respectively, on the 2D Face Set Database with Pain Expression, UNBC, and BioVid datasets. The VIVAE audio-based system recorded a peak performance of 97.56% and 98.32% accuracy for varying training–testing protocols. These performances were compared with some state-of-the-art methods that show the superiority of the proposed system. By combining the outputs of both deep learning frameworks on image and audio datasets, the proposed multimodal pain sentiment analysis system achieves accuracies of 99.31% for the two-class, 99.54% for the three-class, and 87.41% for the five-class pain problems. Full article
(This article belongs to the Section Physical Sensors)
Show Figures

Figure 1

11 pages, 877 KiB  
Article
Beyond Spectrograms: Rethinking Audio Classification from EnCodec’s Latent Space
by Jorge Perianez-Pascual, Juan D. Gutiérrez, Laura Escobar-Encinas, Álvaro Rubio-Largo and Roberto Rodriguez-Echeverria
Algorithms 2025, 18(2), 108; https://doi.org/10.3390/a18020108 - 16 Feb 2025
Viewed by 190
Abstract
This paper presents a novel approach to audio classification leveraging the latent representation generated by Meta’s EnCodec neural audio codec. We hypothesize that the compressed latent space representation captures essential audio features more suitable for classification tasks than the traditional spectrogram-based approaches. We [...] Read more.
This paper presents a novel approach to audio classification leveraging the latent representation generated by Meta’s EnCodec neural audio codec. We hypothesize that the compressed latent space representation captures essential audio features more suitable for classification tasks than the traditional spectrogram-based approaches. We train a vanilla convolutional neural network for music genre, speech/music, and environmental sound classification using EnCodec’s encoder output as input to validate this. Then, we compare its performance training with the same network using a spectrogram-based representation as input. Our experiments demonstrate that this approach achieves comparable accuracy to state-of-the-art methods while exhibiting significantly faster convergence and reduced computational load during training. These findings suggest the potential of EnCodec’s latent representation for efficient, faster, and less expensive audio classification applications. We analyze the characteristics of EnCodec’s output and compare its performance against traditional spectrogram-based approaches, providing insights into this novel approach’s advantages. Full article
Show Figures

Figure 1

20 pages, 634 KiB  
Article
SATRN: Spiking Audio Tagging Robust Network
by Shouwei Gao, Xingyang Deng, Xiangyu Fan, Pengliang Yu, Hao Zhou and Zihao Zhu
Electronics 2025, 14(4), 761; https://doi.org/10.3390/electronics14040761 - 15 Feb 2025
Viewed by 220
Abstract
Audio tagging, as a fundamental task in acoustic signal processing, has demonstrated significant advances and broad applications in recent years. Spiking Neural Networks (SNNs), inspired by biological neural systems, exploit event-driven computing paradigms and temporal information processing, enabling superior energy efficiency. Despite the [...] Read more.
Audio tagging, as a fundamental task in acoustic signal processing, has demonstrated significant advances and broad applications in recent years. Spiking Neural Networks (SNNs), inspired by biological neural systems, exploit event-driven computing paradigms and temporal information processing, enabling superior energy efficiency. Despite the increasing adoption of SNNs, the potential of event-driven encoding mechanisms for audio tagging remains largely unexplored. This work presents a pioneering investigation into event-driven encoding strategies for SNN-based audio tagging. We propose the SATRN (Spiking Audio Tagging Robust Network), a novel architecture that integrates temporal–spatial attention mechanisms with membrane potential residual connections. The network employs a dual-stream structure combining global feature fusion and local feature extraction through inverted bottleneck blocks, specifically designed for efficient audio processing. Furthermore, we introduce an event-based encoding approach that enhances the resilience of Spiking Neural Networks to disturbances while maintaining performance. Our experimental results on the Urbansound8k and FSD50K datasets demonstrate that the SATRN achieves comparable performance to traditional Convolutional Neural Networks (CNNs) while requiring significantly less computation time and showing superior robustness against noise perturbations, making it particularly suitable for edge computing scenarios and real-time audio processing applications. Full article
Show Figures

Figure 1

19 pages, 1668 KiB  
Article
Acoustic-Based Industrial Diagnostics: A Scalable Noise-Robust Multiclass Framework for Anomaly Detection
by Bo Peng, Danlei Li, Kevin I-Kai Wang and Waleed H. Abdulla
Processes 2025, 13(2), 544; https://doi.org/10.3390/pr13020544 - 14 Feb 2025
Viewed by 423
Abstract
This study proposes a framework for anomaly detection in industrial machines with a focus on robust multiclass classification using acoustic data. Many state-of-the-art methods only have binary classification capabilities for each machine, and suffer from poor scalability and noise robustness. In this context, [...] Read more.
This study proposes a framework for anomaly detection in industrial machines with a focus on robust multiclass classification using acoustic data. Many state-of-the-art methods only have binary classification capabilities for each machine, and suffer from poor scalability and noise robustness. In this context, we propose the use of Smoothed Pseudo Wigner–Ville Distribution-based Mel-Frequency Cepstral Coefficients (SPWVD-MFCCs) in the framework which are specifically tailored for noisy environments. SPWVD-MFCCs, with better time–frequency resolution and perceptual audio features, improve the accuracy of detecting anomalies in a more generalized way under variable signal-to-noise ratio (SNR) conditions. This framework integrates a CNN-LSTM model that efficiently and accurately analyzes spectral and temporal information separately for anomaly detection. Meanwhile, the dimensionality reduction strategy ensures good computational efficiency without losing critical information. On the MIMII dataset involving multiple machine types and noise levels, it has shown robustness and scalability. Key findings include significant improvements in classification accuracy and F1-scores, particularly in low-SNR scenarios, showcasing its adaptability to real-world industrial environments. This study represents the first application of SPWVD-MFCCs in industrial diagnostics and provides a noise-robust and scalable method for the detection of anomalies and fault classification, which is bound to improve operational safety and efficiency within complex industrial scenarios. Full article
(This article belongs to the Special Issue Research on Intelligent Fault Diagnosis Based on Neural Network)
Show Figures

Figure 1

19 pages, 279 KiB  
Review
Speaker Diarization: A Review of Objectives and Methods
by Douglas O’Shaughnessy
Appl. Sci. 2025, 15(4), 2002; https://doi.org/10.3390/app15042002 - 14 Feb 2025
Viewed by 384
Abstract
Recorded audio often contains speech from multiple people in conversation. It is useful to label such signals with speaker turns, noting when each speaker is talking and identifying each speaker. This paper discusses how to process speech signals to do such speaker diarization [...] Read more.
Recorded audio often contains speech from multiple people in conversation. It is useful to label such signals with speaker turns, noting when each speaker is talking and identifying each speaker. This paper discusses how to process speech signals to do such speaker diarization (SD). We examine the nature of speech signals, to identify the possible acoustical features that could assist this clustering task. Traditional speech analysis techniques are reviewed, as well as measures of spectral similarity and clustering. Speech activity detection requires separating speech from background noise in general audio signals. SD may use stochastic models (hidden Markov and Gaussian mixture) and embeddings such as x-vectors. Modern neural machine learning methods are examined in detail. Suggestions are made for future improvements. Full article
(This article belongs to the Section Electrical, Electronics and Communications Engineering)
18 pages, 1573 KiB  
Article
PD-Net: Parkinson’s Disease Detection Through Fusion of Two Spectral Features Using Attention-Based Hybrid Deep Neural Network
by Munira Islam, Khadija Akter, Md. Azad Hossain and M. Ali Akber Dewan
Information 2025, 16(2), 135; https://doi.org/10.3390/info16020135 - 12 Feb 2025
Viewed by 606
Abstract
Parkinson’s disease (PD) is a progressive degenerative brain disease that worsens with age, causing areas of the brain to weaken. Vocal dysfunction often emerges as one of the earliest and most prominent indicators of Parkinson’s disease, with a significant number of patients exhibiting [...] Read more.
Parkinson’s disease (PD) is a progressive degenerative brain disease that worsens with age, causing areas of the brain to weaken. Vocal dysfunction often emerges as one of the earliest and most prominent indicators of Parkinson’s disease, with a significant number of patients exhibiting vocal impairments during the initial stages of the illness. In view of this, to facilitate the diagnosis of Parkinson’s disease through the analysis of these vocal characteristics, this study focuses on exerting a combination of mel spectrogram and MFCC as spectral features. This study adopts Italian raw audio data to establish an efficient detection framework specifically designed to classify the vocal data into two distinct categories: healthy individuals and patients diagnosed with Parkinson’s disease. To this end, the study proposes a hybrid model that integrates Convolutional Neural Networks (CNNs) and Long Short-Term Memory networks (LSTMs) for the detection of Parkinson’s disease. Certainly, CNNs are employed to extract spatial features from the extracted spectro-temporal characteristics of vocal data, while LSTMs capture temporal dependencies, accelerating a comprehensive analysis of the development of vocal patterns over time. Additionally, the merging of a multi-head attention mechanism significantly enhances the model’s ability to concentrate on essential details, hence improving its overall performance. This unified method aims to enhance the detection of subtle vocal changes associated with Parkinson’s, enhancing overall diagnostic accuracy. The findings declare that this model achieves a noteworthy accuracy of 99.00% for the Parkinson’s disease detection process. Full article
(This article belongs to the Special Issue Feature Papers in Information in 2024–2025)
Show Figures

Graphical abstract

21 pages, 3599 KiB  
Article
Using Deep Learning to Identify Deepfakes Created Using Generative Adversarial Networks
by Jhanvi Jheelan and Sameerchand Pudaruth
Computers 2025, 14(2), 60; https://doi.org/10.3390/computers14020060 - 10 Feb 2025
Viewed by 424
Abstract
Generative adversarial networks (GANs) have revolutionised various fields by creating highly realistic images, videos, and audio, thus enhancing applications such as video game development and data augmentation. However, this technology has also given rise to deepfakes, which pose serious challenges due to their [...] Read more.
Generative adversarial networks (GANs) have revolutionised various fields by creating highly realistic images, videos, and audio, thus enhancing applications such as video game development and data augmentation. However, this technology has also given rise to deepfakes, which pose serious challenges due to their potential to create deceptive content. Thousands of media reports have informed us of such occurrences, highlighting the urgent need for reliable detection methods. This study addresses the issue by developing a deep learning (DL) model capable of distinguishing between real and fake face images generated by StyleGAN. Using a subset of the 140K real and fake face dataset, we explored five different models: a custom CNN, ResNet50, DenseNet121, MobileNet, and InceptionV3. We leveraged the pre-trained models to utilise their robust feature extraction and computational efficiency, which are essential for distinguishing between real and fake features. Through extensive experimentation with various dataset sizes, preprocessing techniques, and split ratios, we identified the optimal ones. The 20k_gan_8_1_1 dataset produced the best results, with MobileNet achieving a test accuracy of 98.5%, followed by InceptionV3 at 98.0%, DenseNet121 at 97.3%, ResNet50 at 96.1%, and the custom CNN at 86.2%. All of these models were trained on only 16,000 images and validated and tested on 2000 images each. The custom CNN model was built with a simpler architecture of two convolutional layers and, hence, lagged in accuracy due to its limited feature extraction capabilities compared with deeper networks. This research work also included the development of a user-friendly web interface that allows deepfake detection by uploading images. The web interface backend was developed using Flask, enabling real-time deepfake detection, allowing users to upload images for analysis and demonstrating a practical use for platforms in need of quick, user-friendly verification. This application demonstrates significant potential for practical applications, such as on social media platforms, where the model can help prevent the spread of fake content by flagging suspicious images for review. This study makes important contributions by comparing different deep learning models, including a custom CNN, to understand the balance between model complexity and accuracy in deepfake detection. It also identifies the best dataset setup that improves detection while keeping computational costs low. Additionally, it introduces a user-friendly web tool that allows real-time deepfake detection, making the research useful for social media moderation, security, and content verification. Nevertheless, identifying specific features of GAN-generated deepfakes remains challenging due to their high realism. Future works will aim to expand the dataset by using all 140,000 images, refine the custom CNN model to increase its accuracy, and incorporate more advanced techniques, such as Vision Transformers and diffusion models. The outcomes of this study contribute to the ongoing efforts to counteract the negative impacts of GAN-generated images. Full article
Show Figures

Figure 1

24 pages, 4502 KiB  
Article
Quality Comparison of Dynamic Auditory Virtual-Reality Simulation Approaches of Approaching Vehicles Regarding Perceptual Behavior and Psychoacoustic Values
by Jonas Krautwurm, Daniel Oberfeld-Twistel, Thirsa Huisman, Maria Mareen Maravich and Ercan Altinsoy
Acoustics 2025, 7(1), 7; https://doi.org/10.3390/acoustics7010007 - 8 Feb 2025
Viewed by 634
Abstract
Traffic safety experiments are often conducted in virtual environments in order to avoid dangerous situations and conduct the experiments more cost-efficiently. This means that attention must be paid to the fidelity of the traffic scenario reproduction, because the pedestrians’ judgments have to be [...] Read more.
Traffic safety experiments are often conducted in virtual environments in order to avoid dangerous situations and conduct the experiments more cost-efficiently. This means that attention must be paid to the fidelity of the traffic scenario reproduction, because the pedestrians’ judgments have to be close to reality. To understand behavior in relation to the prevailing audio rendering systems better, a listening test was conducted which focused on perceptual differences between simulation and playback methods. Six vehicle driving-by-scenes were presented using two different simulation methods and three different playback methods, and binaural recordings from the test track acquired during the recordings of the vehicle sound sources for the simulation were additionally incorporated. Each vehicle driving-by-scene was characterized by different vehicle types and different speeds. Participants rated six attributes of the perceptual dimensions: “timbral balance”, “naturalness”, “room-related”, “source localization”, “loudness” and “speed perception”. While the ratings showed a high degree of similarity among the ratings of the sound attributes in the different reproduction systems, there were minor differences in the speed and loudness estimations and the different perceptions of brightness stood out. A comparison of the loudness ratings in the scenes featuring electric and combustion-engine vehicles highlights the issue of reduced detection abilities with regard to the former. Full article
Show Figures

Figure 1

38 pages, 701 KiB  
Review
Evolution of Bluetooth Technology: BLE in the IoT Ecosystem
by Grigorios Koulouras, Stylianos Katsoulis and Fotios Zantalis
Sensors 2025, 25(4), 996; https://doi.org/10.3390/s25040996 - 7 Feb 2025
Viewed by 698
Abstract
The Internet of Things (IoT) has witnessed significant growth in recent years, with Bluetooth Low Energy (BLE) emerging as a key enabler of low-power, low-cost wireless connectivity. This review article provides an overview of the evolution of Bluetooth technology, focusing on the role [...] Read more.
The Internet of Things (IoT) has witnessed significant growth in recent years, with Bluetooth Low Energy (BLE) emerging as a key enabler of low-power, low-cost wireless connectivity. This review article provides an overview of the evolution of Bluetooth technology, focusing on the role of BLE in the IoT ecosystem. It examines the current state of BLE, including its applications, challenges, limitations, and recent advancements in areas such as security, power management, and mesh networking. The recent release of Bluetooth Low Energy version 6.0 by the Bluetooth Special Interest Group (SIG) highlights the technology’s ongoing evolution and growing importance within the IoT. However, this rapid development highlights a gap in the current literature, a lack of comprehensive, up-to-date reviews that fully capture the contemporary landscape of BLE in IoT applications. This paper analyzes the emerging trends and future directions for BLE, including the integration of artificial intelligence, machine learning, and audio capabilities. The analysis also considers the alignment of BLE features with the United Nations’ Sustainable Development Goals (SDGs), particularly energy efficiency, sustainable cities, and climate action. By examining the development and deployment of BLE technology, this article aims to provide insights into the opportunities and challenges associated with its adoption in various IoT applications, from smart homes and cities to industrial automation and healthcare. This review highlights the significance of the evolution of BLE in shaping the future of wireless communication and IoT, and provides a foundation for further research and innovation in this field. Full article
(This article belongs to the Special Issue Advances in Intelligent Sensors and IoT Solutions (2nd Edition))
Show Figures

Figure 1

20 pages, 917 KiB  
Article
Developing a Dataset of Audio Features to Classify Emotions in Speech
by Alvaro A. Colunga-Rodriguez, Alicia Martínez-Rebollar, Hugo Estrada-Esquivel, Eddie Clemente and Odette A. Pliego-Martínez
Computation 2025, 13(2), 39; https://doi.org/10.3390/computation13020039 - 5 Feb 2025
Viewed by 935
Abstract
Emotion recognition in speech has gained increasing relevance in recent years, enabling more personalized interactions between users and automated systems. This paper presents the development of a dataset of features obtained from RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) to classify [...] Read more.
Emotion recognition in speech has gained increasing relevance in recent years, enabling more personalized interactions between users and automated systems. This paper presents the development of a dataset of features obtained from RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) to classify emotions in speech. The paper highlights audio processing techniques such as silence removal and framing to extract features from the recordings. The features are extracted from the audio signals using spectral techniques, time-domain analysis, and the discrete wavelet transform. The resulting dataset is used to train a neural network and the support vector machine learning algorithm. Cross-validation is employed for model training. The developed models were optimized using a software package that performs hyperparameter tuning to improve results. Finally, the emotional classification outcomes were compared. The results showed an emotion classification accuracy of 0.654 for the perceptron neural network and 0.724 for the support vector machine algorithm, demonstrating satisfactory performance in emotion classification. Full article
(This article belongs to the Section Computational Engineering)
Show Figures

Figure 1

14 pages, 2438 KiB  
Article
Contactless Fatigue Level Diagnosis System Through Multimodal Sensor Data
by Younggun Lee, Yongkyun Lee, Sungho Kim, Sitae Kim and Seunghoon Yoo
Bioengineering 2025, 12(2), 116; https://doi.org/10.3390/bioengineering12020116 - 26 Jan 2025
Viewed by 620
Abstract
Fatigue management is critical for high-risk professions such as pilots, firefighters, and healthcare workers, where physical and mental exhaustion can lead to catastrophic accidents and loss of life. Traditional fatigue assessment methods, including surveys and physiological measurements, are limited in real-time monitoring and [...] Read more.
Fatigue management is critical for high-risk professions such as pilots, firefighters, and healthcare workers, where physical and mental exhaustion can lead to catastrophic accidents and loss of life. Traditional fatigue assessment methods, including surveys and physiological measurements, are limited in real-time monitoring and user convenience. To address these issues, this study introduces a novel contactless fatigue level diagnosis system leveraging multimodal sensor data, including video, thermal imaging, and audio. The system integrates non-contact biometric data collection with an AI-driven classification model capable of diagnosing fatigue levels on a 1 to 5 scale with an average accuracy of 89%. Key features include real-time feedback, adaptive retraining for personalized accuracy improvement, and compatibility with high-stress environments. Experimental results demonstrate that retraining with user feedback enhances classification accuracy by 11 percentage points. The system’s hardware is validated for robustness under diverse operational conditions, including temperature and electromagnetic compliance. This innovation provides a practical solution for improving operational safety and performance in critical sectors by enabling precise, non-invasive, and efficient fatigue monitoring. Full article
(This article belongs to the Special Issue Computer-Aided Diagnosis for Biomedical Engineering)
Show Figures

Figure 1

Back to TopTop