Electrical Engineering and Systems Science
See recent articles
- [1] arXiv:2407.21030 [pdf, html, other]
-
Title: Cluster and Separate: a GNN Approach to Voice and Staff Prediction for Score EngravingComments: Accepted at the 25th International Society for Music Information Retrieval (ISMIR) 2024Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
This paper approaches the problem of separating the notes from a quantized symbolic music piece (e.g., a MIDI file) into multiple voices and staves. This is a fundamental part of the larger task of music score engraving (or score typesetting), which aims to produce readable musical scores for human performers. We focus on piano music and support homophonic voices, i.e., voices that can contain chords, and cross-staff voices, which are notably difficult tasks that have often been overlooked in previous research. We propose an end-to-end system based on graph neural networks that clusters notes that belong to the same chord and connects them with edges if they are part of a voice. Our results show clear and consistent improvements over a previous approach on two datasets of different styles. To aid the qualitative analysis of our results, we support the export in symbolic music formats and provide a direct visualization of our outputs graph over the musical score. All code and pre-trained models are available at this https URL
- [2] arXiv:2407.21122 [pdf, html, other]
-
Title: Shadow Area and Degrees-of-Freedom for Free-Space CommunicationSubjects: Signal Processing (eess.SP); Applied Physics (physics.app-ph); Classical Physics (physics.class-ph)
The number of degrees-of-freedom (NDoF) in a communication system is limited by the number of antenna ports, element shapes, positions, and the propagation environment. As the number of antenna elements increases within a given region, the NDoF eventually saturates due to correlation of the radiated fields. The maximal NDoF can be determined numerically for communication between two regions using singular value decomposition of a channel model representing wave propagation between densely sampled sources at the transmitter and fields at the receiver. This paper provides a straightforward analytical estimate of the NDoF for arbitrarily shaped transmitter and receiver regions. The analysis show that the NDoF for electrically large regions is approximated by the mutual shadow area of the regions, measured in wavelengths. Several setups illustrate the results, which are then compared with numerical evaluations of the singular values of the propagation channel. These new analytical expressions also simplify to previously established results based on Weyl's law and the paraxial approximation.
- [3] arXiv:2407.21133 [pdf, html, other]
-
Title: Data-driven Modeling for Grid Edge IBRs: A Digital Twin Perspective of User-Defined ModelsComments: accepted for presentation at The 2024 Annual Conference of the IEEE Industrial Electronics Society (IECON)Subjects: Systems and Control (eess.SY)
Recent Odessa disturbance events have brought attention to the challenges associated with the interaction between Inverter-Based Resources (IBRs) and the transmission and distribution system. The NERC event diagnosis report has highlighted several issues, emphasizing the need for continuous performance monitoring of these IBRs by system operators. Key areas of concern include the mismatch of control and protection performance of IBRs between the original equipment manufacturer (OEM)-provided models and field measurements. The inability to replicate the realistic response can result in incorrect reliability and resilience studies. In this paper, we developed an approach on how to emulate the behavior of an IBR using measurement data obtained for system operators to utilize in real-time and long-term planning. Two experiments are conducted in the phasor domain and electromagnetic transients (EMT) domain to emulate the behavior for grid forming and grid following inverters under various operating conditions and the effectiveness of the proposed model is demonstrated in terms of accuracy and ease of utilizing user-defined models (UDMs).
- [4] arXiv:2407.21144 [pdf, other]
-
Title: Multi-Task Learning for Few-Shot Online Adaptation under Signal Temporal Logic SpecificationsSubjects: Systems and Control (eess.SY)
Multi-task learning (MTL) seeks to improve the generalized performance of learning specific tasks, exploiting useful information incorporated in related tasks. As a promising area, this paper studies an MTL-based control approach considering Signal Temporal Logic (STL). Task compliance is measured via the Robustness Degree (RD) which is computed by using the STL semantics. A suitable methodology is provided to solve the learning and testing stages, with an appropriate treatment of the non-convex terms in the quadratic objective function and using Sequential Convex Programming based on trust region update. In the learning stage, an ensemble of tasks is generated from deterministic goals to obtain a strong initializer for the testing stage, where related tasks are solved with a larger impact of perturbation. The methodology demonstrates to be robust in two dynamical systems showing results that meet the task specifications in a few shots for the testing stage, even for highly perturbed tasks.
- [5] arXiv:2407.21149 [pdf, html, other]
-
Title: Domain Shift Analysis in Chest Radiographs Classification in a Veterans Healthcare Administration PopulationMayanka Chandrashekar, Ian Goethert, Md Inzamam Ul Haque, Benjamin McMahon, Sayera Dhaubhadel, Kathryn Knight, Joseph Erdos, Donna Reagan, Caroline Taylor, Peter Kuzmak, John Michael Gaziano, Eileen McAllister, Lauren Costa, Yuk-Lam Ho, Kelly Cho, Suzanne Tamang, Samah Fodeh-Jarad, Olga S. Ovchinnikova, Amy C. Justice, Jacob Hinkle, Ioana DanciuSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Objectives: This study aims to assess the impact of domain shift on chest X-ray classification accuracy and to analyze the influence of ground truth label quality and demographic factors such as age group, sex, and study year. Materials and Methods: We used a DenseNet121 model pretrained MIMIC-CXR dataset for deep learning-based multilabel classification using ground truth labels from radiology reports extracted using the CheXpert and CheXbert Labeler. We compared the performance of the 14 chest X-ray labels on the MIMIC-CXR and Veterans Healthcare Administration chest X-ray dataset (VA-CXR). The VA-CXR dataset comprises over 259k chest X-ray images spanning between the years 2010 and 2022. Results: The validation of ground truth and the assessment of multi-label classification performance across various NLP extraction tools revealed that the VA-CXR dataset exhibited lower disagreement rates than the MIMIC-CXR datasets. Additionally, there were notable differences in AUC scores between models utilizing CheXpert and CheXbert. When evaluating multi-label classification performance across different datasets, minimal domain shift was observed in unseen datasets, except for the label "Enlarged Cardiomediastinum." The study year's subgroup analyses exhibited the most significant variations in multi-label classification model performance. These findings underscore the importance of considering domain shifts in chest X-ray classification tasks, particularly concerning study years. Conclusion: Our study reveals the significant impact of domain shift and demographic factors on chest X-ray classification, emphasizing the need for improved transfer learning and equitable model development. Addressing these challenges is crucial for advancing medical imaging and enhancing patient care.
- [6] arXiv:2407.21157 [pdf, html, other]
-
Title: Movable Frequency Diverse Array for Wireless Communication SecurityComments: arXiv admin note: substantial text overlap with arXiv:2407.20280Subjects: Signal Processing (eess.SP)
Frequency diverse array (FDA) is a promising antenna technology to achieve physical layer security by varying the frequency of each antenna at the transmitter. However, when the channels of the legitimate user and eavesdropper are highly correlated, FDA is limited by the frequency constraint and cannot provide satisfactory security performance. In this paper, we propose a novel movable FDA (MFDA) antenna technology where the positions of antennas can be dynamically adjusted in a given finite region. Specifically, we aim to maximize the secrecy capacity by jointly optimizing the antenna beamforming vector, antenna frequency vector and antenna position vector. To solve this non-convex optimization problem with coupled variables, we develop a two-stage alternating optimization (AO) algorithm based on block successive upper-bound minimization (BSUM) method. Moreover, to evaluate the security performance provided by MFDA, we introduce two benchmark schemes, i.e., phased array (PA) and FDA. Simulation results demonstrate that MFDA can significantly enhance security performance compared to PA and FDA. In particular, when the frequency constraint is strict, MFDA can further increase the secrecy capacity by adjusting the positions of antennas instead of the frequencies.
- [7] arXiv:2407.21211 [pdf, html, other]
-
Title: Self-Supervised Models in Automatic Whispered Speech RecognitionComments: 6 pages, 2 figures. Submitted to a conferenceSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
In automatic speech recognition, any factor that alters the acoustic properties of speech can pose a challenge to the system's performance. This paper presents a novel approach for automatic whispered speech recognition in the Irish dialect using the self-supervised WavLM model. Conventional automatic speech recognition systems often fail to accurately recognise whispered speech due to its distinct acoustic properties and the scarcity of relevant training data. To address this challenge, we utilized a pre-trained WavLM model, fine-tuned with a combination of whispered and normal speech data from the wTIMIT and CHAINS datasets, which include the English language in Singaporean and Irish dialects, respectively. Our baseline evaluation with the OpenAI Whisper model highlighted its limitations, achieving a Word Error Rate (WER) of 18.8% on whispered speech. In contrast, the proposed WavLM-based system significantly improved performance, achieving a WER of 9.22%. These results demonstrate the efficacy of our approach in recognising whispered speech and underscore the importance of tailored acoustic modeling for robust automatic speech recognition systems. This study provides valuable insights into developing effective automatic speech recognition solutions for challenging speech affected by whisper and dialect. The source codes for this paper are freely available.
- [8] arXiv:2407.21216 [pdf, html, other]
-
Title: Distribution-Aware Replay for Continual MRI SegmentationSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Medical image distributions shift constantly due to changes in patient population and discrepancies in image acquisition. These distribution changes result in performance deterioration; deterioration that continual learning aims to alleviate. However, only adaptation with data rehearsal strategies yields practically desirable performance for medical image segmentation. Such rehearsal violates patient privacy and, as most continual learning approaches, overlooks unexpected changes from out-of-distribution instances. To transcend both of these challenges, we introduce a distribution-aware replay strategy that mitigates forgetting through auto-encoding of features, while simultaneously leveraging the learned distribution of features to detect model failure. We provide empirical corroboration on hippocampus and prostate MRI segmentation.
- [9] arXiv:2407.21219 [pdf, html, other]
-
Title: Hidden Cyber-Physical Contingency Identification, Classification and Evaluation in Modern Power SystemsComments: Under review in IEEE Transactions on Power SystemsSubjects: Systems and Control (eess.SY)
This paper introduces an advanced stochastic hybrid system modeling framework for modern power systems (MPS) to identify, classify, and evaluate hidden contingencies, which cannot be detected by normal observation sensors. The stochastic hybrid system (SHS) model is designed to capture the dynamics of the internal states of individual nodes, considering their structural properties, and coupling variables under various local and network-level contingencies. Hidden contingencies are identified using a probing approach that measures changes in the eigenvalues of the SHS model and detects deviations from normal operation. Next, contingencies are categorized into three distinct groups according to their impact on MPS: physical contingencies, control network contingencies, and sensing and measurement network contingencies. This classification enables a proactive evaluation of contingencies. The practicality and efficacy of the proposed methodology are validated through simulation experiments on the electrical network of two real-world systems. These simulations underscore the approach's capacity to enhance the resilience of power systems against a spectrum of hidden contingencies.
- [10] arXiv:2407.21259 [pdf, html, other]
-
Title: On the Impact of High-Order Harmonic Generation in Electrical Distribution SystemsComments: accepted for presentation at the 2024 IEEE Energy Conversion Conference and Expo (ECCE)Subjects: Systems and Control (eess.SY)
The modern power grid has seen a rise in the integration of non-linear loads, presenting a significant concern for operators. These loads introduce unwanted harmonics, leading to potential issues such as overheating and improper functioning of circuit breakers. In pursuing a more sustainable grid, the adoption of electric vehicles (EVs) and photovoltaic (PV) systems in residential networks has increased. Understanding and examining the effects of high-order harmonic frequencies beyond $1.5$ kHz is crucial to understanding their impact on the operation and planning of electrical distribution systems under varying nonlinear loading conditions. This study investigates a diverse set of critical power electronic loads within a household modeled using PSCAD/EMTdc, analyzing their unique harmonic spectra. This information is utilized to run the time-series harmonic analysis program in OpenDSS on a modified IEEE 34 bus test system model. The impact of high-order harmonics is quantified using metrics that evaluate total harmonic distortion (THD), transformer harmonic-driven eddy current loss component, and propagation of harmonics from the source to the substation transformer.
- [11] arXiv:2407.21263 [pdf, html, other]
-
Title: Outlier Detection in Large Radiological Datasets using UMAPComments: Accepted in MICCAI-2024 Workshop on Topology- and Graph-Informed Imaging Informatics (TGI3)Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
The success of machine learning algorithms heavily relies on the quality of samples and the accuracy of their corresponding labels. However, building and maintaining large, high-quality datasets is an enormous task. This is especially true for biomedical data and for meta-sets that are compiled from smaller ones, as variations in image quality, labeling, reports, and archiving can lead to errors, inconsistencies, and repeated samples. Here, we show that the uniform manifold approximation and projection (UMAP) algorithm can find these anomalies essentially by forming independent clusters that are distinct from the main (good) data but similar to other points with the same error type. As a representative example, we apply UMAP to discover outliers in the publicly available ChestX-ray14, CheXpert, and MURA datasets. While the results are archival and retrospective and focus on radiological images, the graph-based methods work for any data type and will prove equally beneficial for curation at the time of dataset creation.
- [12] arXiv:2407.21280 [pdf, html, other]
-
Title: Wireless-Powered Mobile Crowdsensing Enhanced by UAV-Mounted RIS: Joint Transmission, Compression, and Trajectory DesignComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Signal Processing (eess.SP)
Mobile crowdsensing (MCS) enables data collection from massive devices to achieve a wide sensing range. Wireless power transfer (WPT) is a promising paradigm for prolonging the operation time of MCS systems by sustainably transferring power to distributed devices. However, the efficiency of WPT significantly deteriorates when the channel conditions are poor. Unmanned aerial vehicles (UAVs) and reconfigurable intelligent surfaces (RISs) can serve as active or passive relays to enhance the efficiency of WPT in unfavourable propagation environments. Therefore, to explore the potential of jointly deploying UAVs and RISs to enhance transmission efficiency, we propose a novel transmission framework for the WPT-assisted MCS systems, which is enhanced by a UAV-mounted RIS. Subsequently, under different compression schemes, two optimization problems are formulated to maximize the weighted sum of the data uploaded by the user equipments (UEs) by jointly designing the WPT and uploading time, the beamforming matrics, the CPU cycles, and the UAV trajectory. A block coordinate descent (BCD) algorithm based on the closed-form beamforming designs and the successive convex approximation (SCA) algorithm is proposed to solve the formulated problems. Furthermore, to highlight the insight of the gains brought by the compression schemes, we analyze the energy efficiencies of compression schemes and confirm that the gains gradually reduce with the increasing power used for compression. Simulation results demonstrate that the amount of collected data can be effectively increased in wireless-powered MCS systems.
- [13] arXiv:2407.21323 [pdf, other]
-
Title: STANet: A Novel Spatio-Temporal Aggregation Network for Depression Classification with Small and Unbalanced FMRI DataWei Zhang, Weiming Zeng, Hongyu Chen, Jie Liu, Hongjie Yan, Kaile Zhang, Ran Tao, Wai Ting Siok, Nizhuan WangSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Accurate diagnosis of depression is crucial for timely implementation of optimal treatments, preventing complications and reducing the risk of suicide. Traditional methods rely on self-report questionnaires and clinical assessment, lacking objective biomarkers. Combining fMRI with artificial intelligence can enhance depression diagnosis by integrating neuroimaging indicators. However, the specificity of fMRI acquisition for depression often results in unbalanced and small datasets, challenging the sensitivity and accuracy of classification models. In this study, we propose the Spatio-Temporal Aggregation Network (STANet) for diagnosing depression by integrating CNN and RNN to capture both temporal and spatial features of brain activity. STANet comprises the following steps:(1) Aggregate spatio-temporal information via ICA. (2) Utilize multi-scale deep convolution to capture detailed features. (3) Balance data using the SMOTE to generate new samples for minority classes. (4) Employ the AFGRU classifier, which combines Fourier transformation with GRU, to capture long-term dependencies, with an adaptive weight assignment mechanism to enhance model generalization. The experimental results demonstrate that STANet achieves superior depression diagnostic performance with 82.38% accuracy and a 90.72% AUC. The STFA module enhances classification by capturing deeper features at multiple scales. The AFGRU classifier, with adaptive weights and stacked GRU, attains higher accuracy and AUC. SMOTE outperforms other oversampling methods. Additionally, spatio-temporal aggregated features achieve better performance compared to using only temporal or spatial features. STANet outperforms traditional or deep learning classifiers, and functional connectivity-based classifiers, as demonstrated by ten-fold cross-validation.
- [14] arXiv:2407.21328 [pdf, html, other]
-
Title: Knowledge-Guided Prompt Learning for Lifespan Brain MR Image SegmentationSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Automatic and accurate segmentation of brain MR images throughout the human lifespan into tissue and structure is crucial for understanding brain development and diagnosing diseases. However, challenges arise from the intricate variations in brain appearance due to rapid early brain development, aging, and disorders, compounded by the limited availability of manually-labeled datasets. In response, we present a two-step segmentation framework employing Knowledge-Guided Prompt Learning (KGPL) for brain MRI. Specifically, we first pre-train segmentation models on large-scale datasets with sub-optimal labels, followed by the incorporation of knowledge-driven embeddings learned from image-text alignment into the models. The introduction of knowledge-wise prompts captures semantic relationships between anatomical variability and biological processes, enabling models to learn structural feature embeddings across diverse age groups. Experimental findings demonstrate the superiority and robustness of our proposed method, particularly noticeable when employing Swin UNETR as the backbone. Our approach achieves average DSC values of 95.17% and 94.19% for brain tissue and structure segmentation, respectively. Our code is available at this https URL.
- [15] arXiv:2407.21343 [pdf, html, other]
-
Title: MIST: A Simple and Scalable End-To-End 3D Medical Imaging Segmentation FrameworkAdrian Celaya, Evan Lim, Rachel Glenn, Brayden Mi, Alex Balsells, Tucker Netherton, Caroline Chung, Beatrice Riviere, David FuentesComments: Submitted to BraTS 2024Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Medical imaging segmentation is a highly active area of research, with deep learning-based methods achieving state-of-the-art results in several benchmarks. However, the lack of standardized tools for training, testing, and evaluating new methods makes the comparison of methods difficult. To address this, we introduce the Medical Imaging Segmentation Toolkit (MIST), a simple, modular, and end-to-end medical imaging segmentation framework designed to facilitate consistent training, testing, and evaluation of deep learning-based medical imaging segmentation methods. MIST standardizes data analysis, preprocessing, and evaluation pipelines, accommodating multiple architectures and loss functions. This standardization ensures reproducible and fair comparisons across different methods. We detail MIST's data format requirements, pipelines, and auxiliary features and demonstrate its efficacy using the BraTS Adult Glioma Post-Treatment Challenge dataset. Our results highlight MIST's ability to produce accurate segmentation masks and its scalability across multiple GPUs, showcasing its potential as a powerful tool for future medical imaging research and development.
- [16] arXiv:2407.21345 [pdf, html, other]
-
Title: Towards EMG-to-Speech with a Necklace Form FactorPeter Wu, Ryan Kaveh, Raghav Nautiyal, Christine Zhang, Albert Guo, Anvitha Kachinthaya, Tavish Mishra, Bohan Yu, Alan W Black, Rikky Muller, Gopala Krishna AnumanchipalliSubjects: Audio and Speech Processing (eess.AS)
Electrodes for decoding speech from electromyography (EMG) are typically placed on the face, requiring adhesives that are inconvenient and skin-irritating if used regularly. We explore a different device form factor, where dry electrodes are placed around the neck instead. 11-word, multi-speaker voiced EMG classifiers trained on data recorded with this device achieve 92.7% accuracy. Ablation studies reveal the importance of having more than two electrodes on the neck, and phonological analyses reveal similar classification confusions between neck-only and neck-and-face form factors. Finally, speech-EMG correlation experiments demonstrate a linear relationship between many EMG spectrogram frequency bins and self-supervised speech representation dimensions.
- [17] arXiv:2407.21381 [pdf, html, other]
-
Title: Identity-Consistent Diffusion Network for Grading Knee Osteoarthritis Progression in Radiographic ImagingComments: Accepted by ECCV 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Knee osteoarthritis (KOA), a common form of arthritis that causes physical disability, has become increasingly prevalent in society. Employing computer-aided techniques to automatically assess the severity and progression of KOA can greatly benefit KOA treatment and disease management. Particularly, the advancement of X-ray technology in KOA demonstrates its potential for this purpose. Yet, existing X-ray prognosis research generally yields a singular progression severity grade, overlooking the potential visual changes for understanding and explaining the progression outcome. Therefore, in this study, a novel generative model is proposed, namely Identity-Consistent Radiographic Diffusion Network (IC-RDN), for multifaceted KOA prognosis encompassing a predicted future knee X-ray scan conditioned on the baseline scan. Specifically, an identity prior module for the diffusion and a downstream generation-guided progression prediction module are introduced. Compared to conventional image-to-image generative models, identity priors regularize and guide the diffusion to focus more on the clinical nuances of the prognosis based on a contrastive learning strategy. The progression prediction module utilizes both forecasted and baseline knee scans, and a more comprehensive formulation of KOA severity progression grading is expected. Extensive experiments on a widely used public dataset, OAI, demonstrate the effectiveness of the proposed method.
- [18] arXiv:2407.21394 [pdf, html, other]
-
Title: Force Sensing Guided Artery-Vein Segmentation via Sequential Ultrasound ImagesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Accurate identification of arteries and veins in ultrasound images is crucial for vascular examinations and interventions in robotics-assisted surgeries. However, current methods for ultrasound vessel segmentation face challenges in distinguishing between arteries and veins due to their morphological similarities. To address this challenge, this study introduces a novel force sensing guided segmentation approach to enhance artery-vein segmentation accuracy by leveraging their distinct deformability. Our proposed method utilizes force magnitude to identify key frames with the most significant vascular deformation in a sequence of ultrasound images. These key frames are then integrated with the current frame through attention mechanisms, with weights assigned in accordance with force magnitude. Our proposed force sensing guided framework can be seamlessly integrated into various segmentation networks and achieves significant performance improvements in multiple U-shaped networks such as U-Net, Swin-unet and Transunet. Furthermore, we contribute the first multimodal ultrasound artery-vein segmentation dataset, Mus-V, which encompasses both force and image data simultaneously. The dataset comprises 3114 ultrasound images of carotid and femoral vessels extracted from 105 videos, with corresponding force data recorded by the force sensor mounted on the US probe. Our code and dataset will be publicly available.
- [19] arXiv:2407.21395 [pdf, html, other]
-
Title: HINER: Neural Representation for Hyperspectral ImageComments: ACM MM24Subjects: Image and Video Processing (eess.IV)
This paper introduces {HINER}, a novel neural representation for compressing HSI and ensuring high-quality downstream tasks on compressed HSI. HINER fully exploits inter-spectral correlations by explicitly encoding of spectral wavelengths and achieves a compact representation of the input HSI sample through joint optimization with a learnable decoder. By additionally incorporating the Content Angle Mapper with the L1 loss, we can supervise the global and local information within each spectral band, thereby enhancing the overall reconstruction quality. For downstream classification on compressed HSI, we theoretically demonstrate the task accuracy is not only related to the classification loss but also to the reconstruction fidelity through a first-order expansion of the accuracy degradation, and accordingly adapt the reconstruction by introducing Adaptive Spectral Weighting. Owing to the monotonic mapping of HINER between wavelengths and spectral bands, we propose Implicit Spectral Interpolation for data augmentation by adding random variables to input wavelengths during classification model training. Experimental results on various HSI datasets demonstrate the superior compression performance of our HINER compared to the existing learned methods and also the traditional codecs. Our model is lightweight and computationally efficient, which maintains high accuracy for downstream classification task even on decoded HSIs at high compression ratios. Our materials will be released at this https URL.
- [20] arXiv:2407.21400 [pdf, html, other]
-
Title: Low-Coherence Sequence Design Under PAPR ConstraintsSubjects: Signal Processing (eess.SP)
Low-coherence sequences with low peak-to-average power ratio (PAPR) are crucial for multi-carrier wireless communication systems and are used for pilots, spreading sequences, and so on. This letter proposes an efficient low-coherence sequence design algorithm (LOCEDA) that can generate any number of sequences of any length that satisfy user-defined PAPR constraints while supporting flexible subcarrier assignments in orthogonal frequency-division multiple access (OFDMA) systems. We first visualize the low-coherence sequence design problem under PAPR constraints as resolving collisions between hyperspheres. By iteratively adjusting the radii and positions of these hyperspheres, we effectively generate low-coherence sequences that strictly satisfy the imposed PAPR constraints. Simulation results (i) confirm that LOCEDA outperforms existing methods, (ii) demonstrate its flexibility, and (iii) highlight its potential for various application scenarios.
- [21] arXiv:2407.21414 [pdf, html, other]
-
Title: Towards interfacing large language models with ASR systems using confidence measures and promptingComments: 5 pages, 3 figures, 5 tables. Accepted to Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
As large language models (LLMs) grow in parameter size and capabilities, such as interaction through prompting, they open up new ways of interfacing with automatic speech recognition (ASR) systems beyond rescoring n-best lists. This work investigates post-hoc correction of ASR transcripts with LLMs. To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods. Our results indicate that this can improve the performance of less competitive ASR systems.
- [22] arXiv:2407.21433 [pdf, html, other]
-
Title: i-CardiAx: Wearable IoT-Driven System for Early Sepsis Detection Through Long-Term Vital Sign MonitoringSubjects: Signal Processing (eess.SP)
Sepsis is a significant cause of early mortality, high healthcare costs, and disability-adjusted life years. Digital interventions like continuous cardiac monitoring can help detect early warning signs and facilitate effective interventions. This paper introduces i-CardiAx, a wearable sensor utilizing low-power high-sensitivity accelerometers to measure vital signs crucial for cardiovascular health: heart rate (HR), blood pressure (BP), and respiratory rate (RR). Data collected from 10 healthy subjects using the i-CardiAx chest patch were used to develop and evaluate lightweight vital sign measurement algorithms. The algorithms demonstrated high performance: RR (-0.11 $\pm$ 0.77 breaths\min), HR (0.82 $\pm$ 2.85 beats\min), and systolic BP (-0.08 $\pm$ 6.245 mmHg). These algorithms are embedded in an ARM Cortex-M33 processor with Bluetooth Low Energy (BLE) support, achieving inference times of 4.2 ms for HR and RR, and 8.5 ms for BP. Additionally, a multi-channel quantized Temporal Convolutional Neural (TCN) Network, trained on the open-source HiRID dataset, was developed to detect sepsis onset using digitally acquired vital signs from i-CardiAx. The quantized TCN, deployed on i-CardiAx, predicted sepsis with a median time of 8.2 hours and an energy per inference of 1.29 mJ. The i-CardiAx wearable boasts a sleep power of 0.152 mW and an average power consumption of 0.77 mW, enabling a 100 mAh battery to last approximately two weeks (432 hours) with continuous monitoring of HR, BP, and RR at 30 measurements per hour and running inference every 30 minutes. In conclusion, i-CardiAx offers an energy-efficient, high-sensitivity method for long-term cardiovascular monitoring, providing predictive alerts for sepsis and other life-threatening events.
- [23] arXiv:2407.21455 [pdf, html, other]
-
Title: RF Power Transmission for Self-sustaining Miniaturized IoT DevicesSubjects: Systems and Control (eess.SY); Signal Processing (eess.SP)
Radio Frequency (RF) wireless power transfer is a promising technology that has the potential to constantly power small Internet of Things (IoT) devices, enabling even battery-less systems and reducing their maintenance requirements. However, to achieve this ambitious goal, carefully designed RF energy harvesting (EH) systems are needed to minimize the conversion losses and the conversion efficiency of the limited power. For intelligent internet of things sensors and devices, which often have non-constant power requirements, an additional power management stage with energy storage is needed to temporarily provide a higher power output than the power being harvested. This paper proposes an RF wireless power energy conversion system for miniaturized IoT composed of an impedance matching network, a rectifier, and power management with energy storage. The proposed sub-system has been experimentally validated and achieved an overall power conversion efficiency (PCE) of over 30 % for an input power of -10 dBm and a peak efficiency of 57 % at 3 dBm.
- [24] arXiv:2407.21486 [pdf, html, other]
-
Title: TinyBird-ML: An ultra-low Power Smart Sensor Node for Bird Vocalization Analysis and Syllable ClassificationLukas Schulthess, Steven Marty, Matilde Dirodi, Mariana D. Rocha, Linus Rüttimann, Richard H. R. Hahnloser, Michele MagnoSubjects: Signal Processing (eess.SP)
Animal vocalisations serve a wide range of vital functions. Although it is possible to record animal vocalisations with external microphones, more insights are gained from miniature sensors mounted directly on animals' backs. We present TinyBird-ML; a wearable sensor node weighing only 1.4 g for acquiring, processing, and wirelessly transmitting acoustic signals to a host system using Bluetooth Low Energy. TinyBird-ML embeds low-latency tiny machine learning algorithms for song syllable classification. To optimize battery lifetime of TinyBird-ML during fault-tolerant continuous recordings, we present an efficient firmware and hardware design. We make use of standard lossy compression schemes to reduce the amount of data sent over the Bluetooth antenna, which increases battery lifetime by 70% without negative impact on offline sound analysis. Furthermore, by not transmitting signals during silent periods, we further increase battery lifetime. One advantage of our sensor is that it allows for closed-loop experiments in the microsecond range by processing sounds directly on the device instead of streaming them to a computer. We demonstrate this capability by detecting and classifying song syllables with minimal latency and a syllable error rate of 7%, using a light-weight neural network that runs directly on the sensor node itself. Thanks to our power-saving hardware and software design, during continuous operation at a sampling rate of 16 kHz, the sensor node achieves a lifetime of 25 hours on a single size 13 zinc-air battery.
- [25] arXiv:2407.21490 [pdf, html, other]
-
Title: Explainable and Controllable Motion Curve Guided Cardiac Ultrasound Video GenerationJunxuan Yu, Rusi Chen, Yongsong Zhou, Yanlin Chen, Yaofei Duan, Yuhao Huang, Han Zhou, Tan Tao, Xin Yang, Dong NiComments: Accepted by MICCAI MLMI 2024Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Echocardiography video is a primary modality for diagnosing heart diseases, but the limited data poses challenges for both clinical teaching and machine learning training. Recently, video generative models have emerged as a promising strategy to alleviate this issue. However, previous methods often relied on holistic conditions during generation, hindering the flexible movement control over specific cardiac structures. In this context, we propose an explainable and controllable method for echocardiography video generation, taking an initial frame and a motion curve as guidance. Our contributions are three-fold. First, we extract motion information from each heart substructure to construct motion curves, enabling the diffusion model to synthesize customized echocardiography videos by modifying these curves. Second, we propose the structure-to-motion alignment module, which can map semantic features onto motion curves across cardiac structures. Third, The position-aware attention mechanism is designed to enhance video consistency utilizing Gaussian masks with structural position information. Extensive experiments on three echocardiography datasets show that our method outperforms others regarding fidelity and consistency. The full code will be released at this https URL.
- [26] arXiv:2407.21501 [pdf, html, other]
-
Title: H-Watch: An Open, Connected Platform for AI-Enhanced COVID19 Infection Symptoms Monitoring and Contact TracingSubjects: Systems and Control (eess.SY)
The novel COVID-19 disease has been declared a pandemic event. Early detection of infection symptoms and contact tracing are playing a vital role in containing COVID-19 spread. As demonstrated by recent literature, multi-sensor and connected wearable devices might enable symptom detection and help tracing contacts, while also acquiring useful epidemiological information. This paper presents the design and implementation of a fully open-source wearable platform called H-Watch. It has been designed to include several sensors for COVID-19 early detection, multi-radio for wireless transmission and tracking, a microcontroller for processing data on-board, and finally, an energy harvester to extend the battery lifetime. Experimental results demonstrated only 5.9 mW of average power consumption, leading to a lifetime of 9 days on a small watch battery. Finally, all the hardware and the software, including a machine learning on MCU toolkit, are provided open-source, allowing the research community to build and use the H-Watch.
- [27] arXiv:2407.21508 [pdf, html, other]
-
Title: Machine Learning In-Sensors: Computation-enabled Intelligent Sensors For Next Generation of IoTSubjects: Signal Processing (eess.SP); Hardware Architecture (cs.AR); Systems and Control (eess.SY)
Smart sensors are an emerging technology that allows combining the data acquisition with the elaboration directly on the Edge device, very close to the sensors. To push this concept to the extreme, technology companies are proposing a new generation of sensors allowing to move the intelligence from the edge host device, typically a microcontroller, directly to the ultra-low-power sensor itself, in order to further reduce the miniaturization, cost and energy efficiency. This paper evaluates the capabilities of a novel and promising solution from STMicroelectronics. The presence of a floating point unit and an accelerator for binary neural networks provide capabilities for in-sensor feature extraction and machine learning. We propose a comparison of full-precision and binary neural networks for activity recognition with accelerometer data generated by the sensor itself. Experimental results have demonstrated that the sensor can achieve an inference performance of 10.7 cycles/MAC, comparable to a Cortex-M4-based microcontroller, with full-precision networks, and up to 1.5 cycles/MAC with large binary models for low latency inference, with an average energy consumption of only 90 $\mu$J/inference with the core running at 5 MHz.
- [28] arXiv:2407.21514 [pdf, other]
-
Title: Wireless Communications in Doubly Selective Channels with Domain AdaptivityComments: Magazine article, 7 pages, 4 figures, 2 tablesSubjects: Signal Processing (eess.SP)
Wireless communications are significantly impacted by the propagation environment, particularly in doubly selective channels with variations in both time and frequency domains. Orthogonal Time Frequency Space (OTFS) modulation has emerged as a promising solution; however, its high equalization complexity, if performed in the delay-Doppler domain, limits its universal application. This article explores domain-adaptive system design, dynamically selecting best-fit domains for modulation, pilot placement, and equalization based on channel conditions, to enhance performance across diverse environments. We examine domain classifications and connections, signal designs, and equalization techniques with domain adaptivity, and finally highlight future research opportunities.
- [29] arXiv:2407.21516 [pdf, other]
-
Title: Expanding the Medical Decathlon dataset: segmentation of colon and colorectal cancer from computed tomography imagesI.M. Chernenkiy, Y.A. Drach, S.R. Mustakimova, V.V. Kazantseva, N.A. Ushakov, S.K. Efetov, M.V. FeldsherovComments: 8 pages, 2 figures, 2 tablesSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Colorectal cancer is the third-most common cancer in the Western Hemisphere. The segmentation of colorectal and colorectal cancer by computed tomography is an urgent problem in medicine. Indeed, a system capable of solving this problem will enable the detection of colorectal cancer at early stages of the disease, facilitate the search for pathology by the radiologist, and significantly accelerate the process of diagnosing the disease. However, scientific publications on medical image processing mostly use closed, non-public data. This paper presents an extension of the Medical Decathlon dataset with colorectal markups in order to improve the quality of segmentation algorithms. An experienced radiologist validated the data, categorized it into subsets by quality, and published it in the public domain. Based on the obtained results, we trained neural network models of the UNet architecture with 5-part cross-validation and achieved a Dice metric quality of $0.6988 \pm 0.3$. The published markups will improve the quality of colorectal cancer detection and simplify the radiologist's job for study description.
- [30] arXiv:2407.21533 [pdf, html, other]
-
Title: Long-Term Forecasts of Failures in Wind TurbinesSubjects: Systems and Control (eess.SY)
We collect papers forecasting wind turbine failures at least two days in advance. We examine the prediction time, methods, failed components, and dataset size. We investigate the effect of using standard SCADA data and data from additional sensors, such as those measuring vibration. We observe a positive correlation between dataset size and prediction time. In the considered cases, one may roughly expect a forecast for at least two days using a dataset of one turbine year and a forecast for two hundred days from a dataset of a hundred turbine years.
- [31] arXiv:2407.21547 [pdf, other]
-
Title: Scheduling Quantum Annealing for Active User Detection in a NOMA NetworkRomain Piron, Claire Goursaud (MARACAS, SOCRATE)Comments: 2024 IEEE International Conference on Quantum Computing and Engineering (QCE), Sep 2024, Montr{é}al (Qu{é}bec), CanadaSubjects: Signal Processing (eess.SP); Quantum Physics (quant-ph)
Active user detection in a non-orthogonal multiple access (NOMA) network is a major challenge for 5G/6G applications. However, classical algorithms that can perform this task suffer either from complexity or reduced performances. This work aims at proposing a quantum annealing approach to overcome this trade-off. Firstly, we show that the maximum a posteriori decoder of the activity pattern of the network can be seen as the ground state of an Ising Hamiltonian. For N users in a network with perfect channels, we propose a universal control function to schedule the annealing process. Our approach avoids to continuously compute the optimal control function but still ensures high success probability while demanding a lower annealing time than a linear control function. This advantage holds even in the presence of imperfections in the network.
- [32] arXiv:2407.21548 [pdf, other]
-
Title: Blind and robust estimation of adaptive optics point spread function and diffuse halo with sharp-edged objectsAnthony Berdeu (LESIA, NARIT)Journal-ref: Astronomy and Astrophysics - A\&A, 2024, 688, pp.A18Subjects: Signal Processing (eess.SP); Instrumentation and Methods for Astrophysics (astro-ph.IM)
Context . Initially designed to detect and characterise exoplanets, extreme adaptive optics (AO) systems open a new window onto the Solar System by resolving its small bodies. Nonetheless, their study remains limited by the accuracy of the knowledge of the AO-corrected point spread function (AO-PSF) that degrades their image and produces a bright halo, potentially hiding faint moons in their close vicinity. Aims . To overcome the random nature of AO-PSFs, I aim to develop a method that blindly recovers the PSF and its faint structured extensions directly into the data of interest, without any prior on the instrument or the object's shape. The objectives are both to deconvolve the object and to properly estimate and remove its surrounding halo to highlight potential faint companions. Methods . My method first estimated the PSF core via a parametric model fit, under the assumption of a sharp-edged flat object. Then, the resolved object and the PSF extensions were alternatively deconvolved with a robust method, insensitive to model outliers, such as cosmic rays or unresolved moons. Finally, the complex halo produced by the AO system was modelled and removed from the data. Results . The method is validated on realistic simulations with an on-sky AO-PSF from the SPHERE/ZIMPOL instrument. On real data, the proposed blind deconvolution algorithm strongly improves the image sharpness and retrieves details on the surface of asteroids. In addition, their moons are visible in all tested epochs despite important variability in turbulence conditions. Conclusions . My method shows the feasibility of retrieving the complex features of AO-PSFs directly from the data of interest. It paves the way towards more precise studies of asteroid surfaces and the discovery and characterisation of Solar System moons in archival data or with future instruments on extremely large telescopes with ever more complex AO-PSFs.
- [33] arXiv:2407.21569 [pdf, html, other]
-
Title: Analysis of Functional Insufficiencies and Triggering Conditions to Improve the SOTIF of an MPC-based Trajectory PlannerComments: Extended VersionSubjects: Systems and Control (eess.SY); Robotics (cs.RO); Software Engineering (cs.SE); Signal Processing (eess.SP)
Automated and autonomous driving has made a significatnt technological leap over the past decade. In this process, the complexity of algorithms used for vehicle control has grown significantly. Model Predictive Control (MPC) is a prominent example, which has gained enormous popularity and is now widely used for vehicle motion planning and control. However, safety concerns constrain its practical application, especially since traditional procedures of functional safety (FS), with its universal standard ISO26262, reach their limits. Concomitantly, the new aspect of safety-of-the-intended-Function (SOTIF) has moved into the center of attention, whose standard, ISO21448, has only been released in 2022. Thus, experience with SOTIF is low and few case studies are available in industry and research. Hence this paper aims to make two main contributions: (1) an analysis of the SOTIF for a generic MPC-based trajectory planner and (2) an interpretation and concrete application of the generic procedures described in ISO21448 for determining functional insufficiencies (FIs) and triggering conditions (TCs). Particular novelties of the paper include an approach for the out-of-context development of SOTIF-related elements (SOTIF-EooC), a compilation of important FIs and TCs for a MPC-based trajectory planner, and an optimized safety concept based on the identified FIs and TCs for the MPC-based trajectory planner.
- [34] arXiv:2407.21600 [pdf, html, other]
-
Title: Robust Simultaneous Multislice MRI Reconstruction Using Deep Generative PriorsShoujin Huang, Guanxiong Luo, Yuwan Wang, Kexin Yang, Lingyan Zhang, Jingzhe Liu, Hua Guo, Min Wang, Mengye LyuSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Medical Physics (physics.med-ph)
Simultaneous multislice (SMS) imaging is a powerful technique for accelerating magnetic resonance imaging (MRI) acquisitions. However, SMS reconstruction remains challenging due to the complex signal interactions between and within the excited slices. This study presents a robust SMS MRI reconstruction method using deep generative priors. Starting from Gaussian noise, we leverage denoising diffusion probabilistic models (DDPM) to gradually recover the individual slices through reverse diffusion iterations while imposing data consistency from the measured k-space under readout concatenation framework. The posterior sampling procedure is designed such that the DDPM training can be performed on single-slice images without special adjustments for SMS tasks. Additionally, our method integrates a low-frequency enhancement (LFE) module to address a practical issue that SMS-accelerated fast spin echo (FSE) and echo-planar imaging (EPI) sequences cannot easily embed autocalibration signals. Extensive experiments demonstrate that our approach consistently outperforms existing methods and generalizes well to unseen datasets. The code is available at this https URL after the review process.
- [35] arXiv:2407.21640 [pdf, html, other]
-
Title: MSA2Net: Multi-scale Adaptive Attention-guided Network for Medical Image SegmentationSina Ghorbani Kolahi, Seyed Kamal Chaharsooghi, Toktam Khatibi, Afshin Bozorgpour, Reza Azad, Moein Heidari, Ilker Hacihaliloglu, Dorit MerhofComments: Accepted at BMVC 2025. Supplementary materials included at the end of the main paper (3 pages, 2 figures, 1 table)Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Medical image segmentation involves identifying and separating object instances in a medical image to delineate various tissues and structures, a task complicated by the significant variations in size, shape, and density of these features. Convolutional neural networks (CNNs) have traditionally been used for this task but have limitations in capturing long-range dependencies. Transformers, equipped with self-attention mechanisms, aim to address this problem. However, in medical image segmentation it is beneficial to merge both local and global features to effectively integrate feature maps across various scales, capturing both detailed features and broader semantic elements for dealing with variations in structures. In this paper, we introduce MSA2Net, a new deep segmentation framework featuring an expedient design of skip-connections. These connections facilitate feature fusion by dynamically weighting and combining coarse-grained encoder features with fine-grained decoder feature maps. Specifically, we propose a Multi-Scale Adaptive Spatial Attention Gate (MASAG), which dynamically adjusts the receptive field (Local and Global contextual information) to ensure that spatially relevant features are selectively highlighted while minimizing background distractions. Extensive evaluations involving dermatology, and radiological datasets demonstrate that our MSA2Net outperforms state-of-the-art (SOTA) works or matches their performance. The source code is publicly available at this https URL.
- [36] arXiv:2407.21738 [pdf, html, other]
-
Title: Leveraging Self-Supervised Learning for Fetal Cardiac Planes Classification using Ultrasound Scan VideosJoseph Geo Benjamin, Mothilal Asokan, Amna Alhosani, Hussain Alasmawi, Werner Gerhard Diehl, Leanne Bricker, Karthik Nandakumar, Mohammad YaqubComments: Simplifying Medical Ultrasound: 4th International Workshop, ASMUS 2023, Held in Conjunction with MICCAI 2023, Vancouver, BC, Canada, October 8, 2023, ProceedingsSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Self-supervised learning (SSL) methods are popular since they can address situations with limited annotated data by directly utilising the underlying data distribution. However, the adoption of such methods is not explored enough in ultrasound (US) imaging, especially for fetal assessment. We investigate the potential of dual-encoder SSL in utilizing unlabelled US video data to improve the performance of challenging downstream Standard Fetal Cardiac Planes (SFCP) classification using limited labelled 2D US images. We study 7 SSL approaches based on reconstruction, contrastive loss, distillation, and information theory and evaluate them extensively on a large private US dataset. Our observations and findings are consolidated from more than 500 downstream training experiments under different settings. Our primary observation shows that for SSL training, the variance of the dataset is more crucial than its size because it allows the model to learn generalisable representations, which improve the performance of downstream tasks. Overall, the BarlowTwins method shows robust performance, irrespective of the training settings and data variations, when used as an initialisation for downstream tasks. Notably, full fine-tuning with 1% of labelled data outperforms ImageNet initialisation by 12% in F1-score and outperforms other SSL initialisations by at least 4% in F1-score, thus making it a promising candidate for transfer learning from US video to image data.
- [37] arXiv:2407.21744 [pdf, html, other]
-
Title: Assessing the Reliability Benefits of Energy Storage as a Transmission AssetComments: Submitted to IEEE Transactions on Industry ApplicationsSubjects: Systems and Control (eess.SY)
Utilizing energy storage solutions to reduce the need for traditional transmission investments has been recognized by system planners and supported by federal policies in recent years. This work demonstrates the need for detailed reliability assessment for quantitative comparison of the reliability benefits of energy storage and traditional transmission investments. First, a mixed-integer linear programming expansion planning model considering candidate transmission lines and storage technologies is solved to find the least-cost investment decisions. Next, operations under the resulting system configuration are simulated in a probabilistic reliability assessment which accounts for weather-dependent forced outages. The outcome of this work, when applied to TPPs, is to further equalize the consideration of energy storage compared to traditional transmission assets by capturing the value of storage for system reliability.
- [38] arXiv:2407.21754 [pdf, other]
-
Title: Cell-free Massive MIMO with Sequential Fronthaul Architecture and Limited Memory Access PointsSubjects: Signal Processing (eess.SP)
Cell-free massive multiple-input multiple-output (CFmMIMO) is a paradigm that can improve users' spectral efficiency (SE) far beyond traditional cellular networks. Increased spatial diversity in CFmMIMO is achieved by spreading the antennas into small access points (APs), which cooperate to serve the users. Sequential fronthaul topologies in CFmMIMO, such as the daisy chain and multi-branch tree topology, have gained considerable attention recently. In such a processing architecture, each AP must store its received signal vector in the memory until it receives the relevant information from the previous AP in the sequence to refine the estimate of the users' signal vector in the uplink. In this paper, we adopt vector-wise and element-wise compression on the raw or pre-processed received signal vectors to store them in the memory. We investigate the impact of the limited memory capacity in the APs on the optimal number of APs. We show that with no memory constraint, having single-antenna APs is optimal, especially as the number of users grows. However, a limited memory at the APs restricts the depth of the sequential processing pipeline. Furthermore, we investigate the relation between the memory capacity at the APs and the rate of the fronthaul link.
- [39] arXiv:2407.21759 [pdf, html, other]
-
Title: Optimal price signal generation for demand-side energy managementSubjects: Systems and Control (eess.SY)
Renewable Energy Sources play a key role in smart energy systems. To achieve 100% renewable energy, utilizing the flexibility potential on the demand side becomes the cost-efficient option to balance the grid. However, it is not trivial to exploit these available capacities and flexibility options profitably. The amount of available flexibility is a complex and time-varying function of the price signal and weather forecasts. In this work, we use a Flexibility Function to represent the relationship between the price signal and the demand and investigate optimization problems for the price signal computation. Consequently, this study considers the higher and lower levels in the hierarchy from the markets to appliances, households, and districts. This paper investigates optimal price generation via the Flexibility Function and studies its employment in controller design for demand-side management, its capability to provide ancillary services for balancing throughout the Smart Energy Operating System, and its effect on the physical level performance. Sequential and simultaneous approaches for computing the price signal, along with various cost functions are analyzed and compared. Simulation results demonstrate the generated price/penalty signal and its employment in a model predictive controller.
New submissions for Thursday, 1 August 2024 (showing 39 of 39 entries )
- [40] arXiv:2407.21029 (cross-list from cs.LO) [pdf, html, other]
-
Title: Data-Driven Abstractions via Binary-Tree Gaussian Processes for Formal VerificationComments: Published at IFAC conference on analysis and design of hybrid systems (ADHS) 2024Subjects: Logic in Computer Science (cs.LO); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG); Systems and Control (eess.SY)
To advance formal verification of stochastic systems against temporal logic requirements for handling unknown dynamics, researchers have been designing data-driven approaches inspired by breakthroughs in the underlying machine learning techniques. As one promising research direction, abstraction-based solutions based on Gaussian process (GP) regression have become popular for their ability to learn a representation of the latent system from data with a quantified error. Results obtained based on this model are then translated to the true system via various methods. In a recent publication, GPs using a so-called binary-tree kernel have demonstrated a polynomial speedup w.r.t. the size of the data compared to their vanilla version, outcompeting all existing sparse GP approximations. Incidentally, the resulting binary-tree Gaussian process (BTGP) is characteristic for its piecewise-constant posterior mean and covariance functions, naturally abstracting the input space into discrete partitions. In this paper, we leverage this natural abstraction of the BTGP for formal verification, eliminating the need for cumbersome abstraction and error quantification procedures. We show that the BTGP allows us to construct an interval Markov chain model of the unknown system with a speedup that is polynomial w.r.t. the size of the abstraction compared to alternative approaches. We provide a delocalized error quantification via a unified formula even when the true dynamics do not live in the function space of the BTGP. This allows us to compute upper and lower bounds on the probability of satisfying reachability specifications that are robust to both aleatoric and epistemic uncertainties.
- [41] arXiv:2407.21054 (cross-list from cs.CL) [pdf, html, other]
-
Title: Sentiment Reasoning for HealthcareKhai Le-Duc, Khai-Nguyen Nguyen, Bach Phan Tat, Duy Le, Jerry Ngo, Long Vo-Dang, Anh Totti Nguyen, Truong-Son HyComments: Preprint, 18 pagesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Transparency in AI decision-making is crucial in healthcare due to the severe consequences of errors, and this is important for building trust among AI and users in sentiment analysis task. Incorporating reasoning capabilities helps Large Language Models (LLMs) understand human emotions within broader contexts, handle nuanced and ambiguous language, and infer underlying sentiments that may not be explicitly stated. In this work, we introduce a new task - Sentiment Reasoning - for both speech and text modalities, along with our proposed multimodal multitask framework and dataset. Our study showed that rationale-augmented training enhances model performance in sentiment classification across both human transcript and ASR settings. Also, we found that the generated rationales typically exhibit different vocabularies compared to human-generated rationales, but maintain similar semantics. All code, data (English-translated and Vietnamese) and models are published online: this https URL
- [42] arXiv:2407.21061 (cross-list from cs.CL) [pdf, html, other]
-
Title: Improving noisy student training for low-resource languages in End-to-End ASR using CycleGAN and inter-domain lossesComments: 10 pages (2 for references), 4 figures, published in SIGUL2024@LREC-COLING 2024Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Training a semi-supervised end-to-end speech recognition system using noisy student training has significantly improved performance. However, this approach requires a substantial amount of paired speech-text and unlabeled speech, which is costly for low-resource languages. Therefore, this paper considers a more extreme case of semi-supervised end-to-end automatic speech recognition where there are limited paired speech-text, unlabeled speech (less than five hours), and abundant external text. Firstly, we observe improved performance by training the model using our previous work on semi-supervised learning "CycleGAN and inter-domain losses" solely with external text. Secondly, we enhance "CycleGAN and inter-domain losses" by incorporating automatic hyperparameter tuning, calling it "enhanced CycleGAN inter-domain losses." Thirdly, we integrate it into the noisy student training approach pipeline for low-resource scenarios. Our experimental results, conducted on six non-English languages from Voxforge and Common Voice, show a 20% word error rate reduction compared to the baseline teacher model and a 10% word error rate reduction compared to the baseline best student model, highlighting the significant improvements achieved through our proposed method.
- [43] arXiv:2407.21066 (cross-list from cs.CL) [pdf, html, other]
-
Title: ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing TasksSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Self-supervised learning has emerged as a key approach for learning generic representations from speech data. Despite promising results in downstream tasks such as speech recognition, speaker verification, and emotion recognition, a significant number of parameters is required, which makes fine-tuning for each task memory-inefficient. To address this limitation, we introduce ELP-adapter tuning, a novel method for parameter-efficient fine-tuning using three types of adapter, namely encoder adapters (E-adapters), layer adapters (L-adapters), and a prompt adapter (P-adapter). The E-adapters are integrated into transformer-based encoder layers and help to learn fine-grained speech representations that are effective for speech recognition. The L-adapters create paths from each encoder layer to the downstream head and help to extract non-linguistic features from lower encoder layers that are effective for speaker verification and emotion recognition. The P-adapter appends pseudo features to CNN features to further improve effectiveness and efficiency. With these adapters, models can be quickly adapted to various speech processing tasks. Our evaluation across four downstream tasks using five backbone models demonstrated the effectiveness of the proposed method. With the WavLM backbone, its performance was comparable to or better than that of full fine-tuning on all tasks while requiring 90% fewer learnable parameters.
- [44] arXiv:2407.21080 (cross-list from q-bio.QM) [pdf, other]
-
Title: Artificial Intelligence Enhanced Digital Nucleic Acid Amplification Testing for Precision Medicine and Molecular DiagnosticsComments: Review article. 46 Pages. 6 Figures. 4 TablesSubjects: Quantitative Methods (q-bio.QM); Image and Video Processing (eess.IV)
The precise quantification of nucleic acids is pivotal in molecular biology, underscored by the rising prominence of nucleic acid amplification tests (NAAT) in diagnosing infectious diseases and conducting genomic studies. This review examines recent advancements in digital Polymerase Chain Reaction (dPCR) and digital Loop-mediated Isothermal Amplification (dLAMP), which surpass the limitations of traditional NAAT by offering absolute quantification and enhanced sensitivity. In this review, we summarize the compelling advancements of dNNAT in addressing pressing public health issues, especially during the COVID-19 pandemic. Further, we explore the transformative role of artificial intelligence (AI) in enhancing dNAAT image analysis, which not only improves efficiency and accuracy but also addresses traditional constraints related to cost, complexity, and data interpretation. In encompassing the state-of-the-art (SOTA) development and potential of both software and hardware, the all-encompassing Point-of-Care Testing (POCT) systems cast new light on benefits including higher throughput, label-free detection, and expanded multiplex analyses. While acknowledging the enhancement of AI-enhanced dNAAT technology, this review aims to both fill critical gaps in the existing technologies through comparative assessments and offer a balanced perspective on the current trajectory, including attendant challenges and future directions. Leveraging AI, next-generation dPCR and dLAMP technologies promises integration into clinical practice, improving personalized medicine, real-time epidemic surveillance, and global diagnostic accessibility.
- [45] arXiv:2407.21130 (cross-list from cs.LG) [pdf, html, other]
-
Title: Computational music analysis from first principlesSubjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
We use coupled hidden Markov models to automatically annotate the 371 Bach chorales in the Riemenschneider edition, a corpus containing approximately 100,000 notes and 20,000 chords. We give three separate analyses that achieve progressively greater accuracy at the cost of making increasingly strong assumptions about musical syntax. Although our method makes almost no use of human input, we are able to identify both chords and keys with an accuracy of 85% or greater when compared to an expert human analysis, resulting in annotations accurate enough to be used for a range of music-theoretical purposes, while also being free of subjective human judgments. Our work bears on longstanding debates about the objective reality of the structures postulated by standard Western harmonic theory, as well as on specific questions about the nature of Western harmonic syntax.
- [46] arXiv:2407.21135 (cross-list from cs.IT) [pdf, html, other]
-
Title: Physical Modelling and Cancellation of External Passive Intermodulation in FDD MIMOSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
In this paper, the physical approach to model external (air-induced) passive intermodulation (PIM) is presented in a frequency-division duplexing (FDD) multiple-input multiple-output (MIMO) system with an arbitrary number of transceiver chains. The external PIM is a special case of intermodulation distortion (IMD), mainly generated by metallic objects possessing nonlinear properties ("rusty bolt" effect). Typically, such sources are located in the near-field or transition region of the antenna array. PIM products may fall into the receiver band of the FDD system, negatively affecting the uplink signal. In contrast to other works, this one directly simulates the physical external PIM. The system includes models of a point-source external PIM, a finite-length dipole antenna, a MIMO antenna array, and a baseband multicarrier 5G NR OFDM signal. The Channel coefficients method for multi-PIM-source compensation is replicated to verify the proposed external PIM modelling approach. Simulation results of artificially generated PIM cancellation show similar performance as real-life experiments. Therefore, the proposed approach allows testing PIM compensation algorithms on large systems with many antennas and arbitrary array structures. This eliminates the need for experiments with real hardware at the development stage of the PIM cancellation algorithm.
- [47] arXiv:2407.21174 (cross-list from cs.CV) [pdf, html, other]
-
Title: AI Safety in Practice: Enhancing Adversarial Robustness in Multimodal Image CaptioningComments: Accepted into KDD 2024 workshop on Ethical AISubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Multimodal machine learning models that combine visual and textual data are increasingly being deployed in critical applications, raising significant safety and security concerns due to their vulnerability to adversarial attacks. This paper presents an effective strategy to enhance the robustness of multimodal image captioning models against such attacks. By leveraging the Fast Gradient Sign Method (FGSM) to generate adversarial examples and incorporating adversarial training techniques, we demonstrate improved model robustness on two benchmark datasets: Flickr8k and COCO. Our findings indicate that selectively training only the text decoder of the multimodal architecture shows performance comparable to full adversarial training while offering increased computational efficiency. This targeted approach suggests a balance between robustness and training costs, facilitating the ethical deployment of multimodal AI systems across various domains.
- [48] arXiv:2407.21294 (cross-list from cs.GT) [pdf, html, other]
-
Title: Decentralized and Uncoordinated Learning of Stable Matchings: A Game-Theoretic ApproachSubjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI); Systems and Control (eess.SY)
We consider the problem of learning stable matchings in a fully decentralized and uncoordinated manner. In this problem, there are $n$ men and $n$ women, each having preference over the other side. It is assumed that women know their preferences over men, but men are not aware of their preferences over women, and they only learn them if they propose and successfully get matched to women. A matching is called stable if no man and woman prefer each other over their current matches. When all the preferences are known a priori, the celebrated Deferred-Acceptance algorithm proposed by Gale and Shapley provides a decentralized and uncoordinated algorithm to obtain a stable matching. However, when the preferences are unknown, developing such an algorithm faces major challenges due to a lack of coordination. We achieve this goal by making a connection between stable matchings and learning Nash equilibria (NE) in noncooperative games. First, we provide a complete information game formulation for the stable matching problem with known preferences such that its set of pure NE coincides with the set of stable matchings, while its mixed NE can be rounded in a decentralized manner to a stable matching. Relying on such a game-theoretic formulation, we show that for hierarchical markets, adopting the exponential weight (EXP) learning algorithm for the stable matching game achieves logarithmic regret with polynomial dependence on the number of players, thus answering a question posed in previous literature. Moreover, we show that the same EXP learning algorithm converges locally and exponentially fast to a stable matching in general matching markets. We complement this result by introducing another decentralized and uncoordinated learning algorithm that globally converges to a stable matching with arbitrarily high probability, leveraging the weak acyclicity property of the stable matching game.
- [49] arXiv:2407.21299 (cross-list from cs.HC) [pdf, html, other]
-
Title: Who should I trust? A Visual Analytics Approach for Comparing Net Load Forecasting ModelsComments: Accepted for publication in the proceedings of 2025 IEEE PES Grid Edge Technologies Conference & Exposition (Grid Edge)Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
Net load forecasting is crucial for energy planning and facilitating informed decision-making regarding trade and load distributions. However, evaluating forecasting models' performance against benchmark models remains challenging, thereby impeding experts' trust in the model's performance. In this context, there is a demand for technological interventions that allow scientists to compare models across various timeframes and solar penetration levels. This paper introduces a visual analytics-based application designed to compare the performance of deep-learning-based net load forecasting models with other models for probabilistic net load forecasting. This application employs carefully selected visual analytic interventions, enabling users to discern differences in model performance across different solar penetration levels, dataset resolutions, and hours of the day over multiple months. We also present observations made using our application through a case study, demonstrating the effectiveness of visualizations in aiding scientists in making informed decisions and enhancing trust in net load forecasting models.
- [50] arXiv:2407.21301 (cross-list from cs.IT) [pdf, html, other]
-
Title: Integrated Sensing and Communication in IRS-assisted High-Mobility Systems: Design, Analysis and OptimizationComments: 15 pages, 9 figuresSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
In this paper, we investigate integrated sensing and communication (ISAC) in high-mobility systems with the aid of an intelligent reflecting surface (IRS). To exploit the benefits of Delay-Doppler (DD) spread caused by high mobility, orthogonal time frequency space (OTFS)-based frame structure and transmission framework are proposed. {In such a framework,} we first design a low-complexity ratio-based sensing algorithm for estimating the velocity of mobile user. Then, we analyze the performance of sensing and communication in terms of achievable mean square error (MSE) and achievable rate, respectively, and reveal the impact of key parameters. Next, with the derived performance expressions, we jointly optimize the phase shift matrix of IRS and the receive combining vector at the base station (BS) to improve the overall performance of integrated sensing and communication. Finally, extensive simulation results confirm the effectiveness of the proposed algorithms in high-mobility systems.
- [51] arXiv:2407.21321 (cross-list from cs.FL) [pdf, other]
-
Title: Hyper parametric timed CTLComments: Accepted to EMSOFT 2024Subjects: Formal Languages and Automata Theory (cs.FL); Systems and Control (eess.SY)
Hyperproperties enable simultaneous reasoning about multiple execution traces of a system and are useful to reason about non-interference, opacity, robustness, fairness, observational determinism, etc. We introduce hyper parametric timed computation tree logic (HyperPTCTL), extending hyperlogics with timing reasoning and, notably, parameters to express unknown values. We mainly consider its nest-free fragment, where temporal operators cannot be nested. However, we allow extensions that enable counting actions and comparing the duration since the most recent occurrence of specific actions. We show that our nest-free fragment with this extension is sufficiently expressive to encode properties, e.g., opacity, (un)fairness, or robust observational (non-)determinism. We propose semi-algorithms for model checking and synthesis of parametric timed automata (an extension of timed automata with timing parameters) against this nest-free fragment with the extension via reduction to PTCTL model checking and synthesis. While the general model checking (and thus synthesis) problem is undecidable, we show that a large part of our extended (yet nest-free) fragment is decidable, provided the parameters only appear in the property, not in the model. We also exhibit additional decidable fragments where parameters within the model are allowed. We implemented our semi-algorithms on top of the IMITATOR model checker, and performed experiments. Our implementation supports most of the nest-free fragments (beyond the decidable classes). The experimental results highlight our method's practical relevance.
- [52] arXiv:2407.21391 (cross-list from cs.SD) [pdf, other]
-
Title: Design and Development of Laughter Recognition System Based on Multimodal Fusion and Deep LearningComments: 7 pages,2 figuresSubjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
This study aims to design and implement a laughter recognition system based on multimodal fusion and deep learning, leveraging image and audio processing technologies to achieve accurate laughter recognition and emotion analysis. First, the system loads video files and uses the OpenCV library to extract facial information while employing the Librosa library to process audio features such as MFCC. Then, multimodal fusion techniques are used to integrate image and audio features, followed by training and prediction using deep learning models. Evaluation results indicate that the model achieved 80% accuracy, precision, and recall on the test dataset, with an F1 score of 80%, demonstrating robust performance and the ability to handle real-world data variability. This study not only verifies the effectiveness of multimodal fusion methods in laughter recognition but also highlights their potential applications in affective computing and human-computer interaction. Future work will focus on further optimizing feature extraction and model architecture to improve recognition accuracy and expand application scenarios, promoting the development of laughter recognition technology in fields such as mental health monitoring and educational activity evaluation
- [53] arXiv:2407.21444 (cross-list from cs.IT) [pdf, html, other]
-
Title: Cooperative Orbital Angular Momentum Wireless CommunicationsSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Orbital angular momentum (OAM) mode multiplexing has the potential to achieve high spectrum-efficiency communications at the same time and frequency by using orthogonal mode resource. However, the vortex wave hollow divergence characteristic results in the requirement of the large-scale receive antenna, which makes users hardly receive the OAM signal by size-limited equipment. To promote the OAM application in the next 6G communications, this paper proposes the cooperative OAM wireless (COW) communication scheme, which can select the cooperative users (CUs) to form the aligned antennas by size-limited user equipment. First, we derive the feasible radial radius and selective waist radius to choose the CUs in the same circle with the origin at the base station. Then, based on the locations of CUs, the waist radius is adjusted to form the receive antennas and ensure the maximum intensity for the CUs. Finally, the cooperative formation probability is derived in the closed-form solution, which can depict the feasibility of the proposed COW communication scheme. Furthermore, OAM beam steering is used to expand the feasible CU region, thus achieving higher cooperative formation probability. Simulation results demonstrate that the derived cooperative formation probability in mathematical analysis is very close to the statistical probability of cooperative formation, and the proposed COW communication scheme can obtain higher spectrum efficiency than the traditional scheme due to the effective reception of the OAM signal.
- [54] arXiv:2407.21453 (cross-list from cs.LG) [pdf, html, other]
-
Title: TinyChirp: Bird Song Recognition Using TinyML Models on Low-power Wireless Acoustic SensorsZhaolan Huang, Adrien Tousnakhoff, Polina Kozyr, Roman Rehausen, Felix Bießmann, Robert Lachlan, Cedric Adjih, Emmanuel BaccelliSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Monitoring biodiversity at scale is challenging. Detecting and identifying species in fine grained taxonomies requires highly accurate machine learning (ML) methods. Training such models requires large high quality data sets. And deploying these models to low power devices requires novel compression techniques and model architectures. While species classification methods have profited from novel data sets and advances in ML methods, in particular neural networks, deploying these state of the art models to low power devices remains difficult. Here we present a comprehensive empirical comparison of various tinyML neural network architectures and compression techniques for species classification. We focus on the example of bird song detection, more concretely a data set curated for studying the corn bunting bird species. The data set is released along with all code and experiments of this study. In our experiments we compare predictive performance, memory and time complexity of classical spectrogram based methods and recent approaches operating on raw audio signal. Our results indicate that individual bird species can be robustly detected with relatively simple architectures that can be readily deployed to low power devices.
- [55] arXiv:2407.21476 (cross-list from cs.CL) [pdf, html, other]
-
Title: On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech RecognitionComments: Accepted at the SynData4GenAI 2024 workshopSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
The rapid development of neural text-to-speech (TTS) systems enabled its usage in other areas of natural language processing such as automatic speech recognition (ASR) or spoken language translation (SLT). Due to the large number of different TTS architectures and their extensions, selecting which TTS systems to use for synthetic data creation is not an easy task. We use the comparison of five different TTS decoder architectures in the scope of synthetic data generation to show the impact on CTC-based speech recognition training. We compare the recognition results to computable metrics like NISQA MOS and intelligibility, finding that there are no clear relations to the ASR performance. We also observe that for data generation auto-regressive decoding performs better than non-autoregressive decoding, and propose an approach to quantify TTS generalization capabilities.
- [56] arXiv:2407.21478 (cross-list from cs.IT) [pdf, html, other]
-
Title: Precoding Based Downlink OAM-MIMO Communications with Rate SplittingSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Orbital angular momentum (OAM) and rate splitting (RS) are the potential key techniques for the future wireless communications. As a new orthogonal resource, OAM can achieve the multifold increase of spectrum efficiency to relieve the scarcity of the spectrum resource, but how to enhance the privacy performance imposes crucial challenge for OAM communications. RS technique divides the information into private and common parts, which can guarantee the privacies for all users. In this paper, we integrate the RS technique into downlink OAM-MIMO communications, and study the precoding optimization to maximize the sum capacity. First, the concentric uniform circular arrays (UCAs) are utilized to construct the downlink transmission framework of OAM-MIMO communications with RS. Particularly, users in the same user pair utilize RS technique to obtain the information and different user pairs use different OAM modes. Then, we derive the OAM-MIMO channel model, and formulate the sum capacity maximization problem. Finally, based on the fractional programming, the optimal precoding matrix is obtained to maximize the sum capacity by using quadratic transformation. Extensive simulation results show that by using the proposed precoding optimization algorithm, OAM-MIMO communications with RS can achieve higher sum capacity than the traditional communication schemes.
- [57] arXiv:2407.21479 (cross-list from cs.IT) [pdf, html, other]
-
Title: Air-to-Ground Cooperative OAM CommunicationsSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
For users in hotspot region, orbital angular momentum (OAM) can realize multifold increase of spectrum efficiency (SE), and the flying base station (FBS) can rapidly support the real-time communication demand. However, the hollow divergence and alignment requirement impose crucial challenges for users to achieve air-to-ground OAM communications, where there exists the line-of-sight path. Therefore, we propose the air-to-ground cooperative OAM communication (ACOC) scheme, which can realize OAM communications for users with size-limited devices. The waist radius is adjusted to guarantee the maximum intensity at the cooperative users (CUs). We derive the closed-form expression of the optimal FBS position, which satisfies the antenna alignment for two cooperative user groups (CUGs). Furthermore, the selection constraint is given to choose two CUGs composed of four CUs. Simulation results are provided to validate the optimal FBS position and the SE superiority of the proposed ACOC scheme.
- [58] arXiv:2407.21491 (cross-list from cs.CL) [pdf, other]
-
Title: Generative Expressive Conversational Speech SynthesisComments: 14 pages, 6 figures, 8 tables. Accepted by ACM MM 2024Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper speaking style in a user-agent conversation setting. Existing CSS methods employ effective multi-modal context modeling techniques to achieve empathy understanding and expression. However, they often need to design complex network architectures and meticulously optimize the modules within them. In addition, due to the limitations of small-scale datasets containing scripted recording styles, they often fail to simulate real natural conversational styles. To address the above issues, we propose a novel generative expressive CSS system, termed GPT-Talker.We transform the multimodal information of the multi-turn dialogue history into discrete token sequences and seamlessly integrate them to form a comprehensive user-agent dialogue context. Leveraging the power of GPT, we predict the token sequence, that includes both semantic and style knowledge, of response for the agent. After that, the expressive conversational speech is synthesized by the conversation-enriched VITS to deliver feedback to the user.Furthermore, we propose a large-scale Natural CSS Dataset called NCSSD, that includes both naturally recorded conversational speech in improvised styles and dialogues extracted from TV shows. It encompasses both Chinese and English languages, with a total duration of 236 hours.We conducted comprehensive experiments on the reliability of the NCSSD and the effectiveness of our GPT-Talker. Both subjective and objective evaluations demonstrate that our model outperforms other state-of-the-art CSS systems significantly in terms of naturalness and expressiveness. The Code, Dataset, and Pre-trained Model are available at: this https URL.
- [59] arXiv:2407.21507 (cross-list from cs.AI) [pdf, html, other]
-
Title: FSSC: Federated Learning of Transformer Neural Networks for Semantic Image CommunicationSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
In this paper, we address the problem of image semantic communication in a multi-user deployment scenario and propose a federated learning (FL) strategy for a Swin Transformer-based semantic communication system (FSSC). Firstly, we demonstrate that the adoption of a Swin Transformer for joint source-channel coding (JSCC) effectively extracts semantic information in the communication system. Next, the FL framework is introduced to collaboratively learn a global model by aggregating local model parameters, rather than directly sharing clients' data. This approach enhances user privacy protection and reduces the workload on the server or mobile edge. Simulation evaluations indicate that our method outperforms the typical JSCC algorithm and traditional separate-based communication algorithms. Particularly after integrating local semantics, the global aggregation model has further increased the Peak Signal-to-Noise Ratio (PSNR) by more than 2dB, thoroughly proving the effectiveness of our algorithm.
- [60] arXiv:2407.21531 (cross-list from cs.SD) [pdf, html, other]
-
Title: Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and GenerationZiya Zhou, Yuhang Wu, Zhiyue Wu, Xinyue Zhang, Ruibin Yuan, Yinghao Ma, Lu Wang, Emmanouil Benetos, Wei Xue, Yike GuoComments: Accepted by ISMIR2024Subjects: Sound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Symbolic Music, akin to language, can be encoded in discrete symbols. Recent research has extended the application of large language models (LLMs) such as GPT-4 and Llama2 to the symbolic music domain including understanding and generation. Yet scant research explores the details of how these LLMs perform on advanced music understanding and conditioned generation, especially from the multi-step reasoning perspective, which is a critical aspect in the conditioned, editable, and interactive human-computer co-creation process. This study conducts a thorough investigation of LLMs' capability and limitations in symbolic music processing. We identify that current LLMs exhibit poor performance in song-level multi-step music reasoning, and typically fail to leverage learned music knowledge when addressing complex musical tasks. An analysis of LLMs' responses highlights distinctly their pros and cons. Our findings suggest achieving advanced musical capability is not intrinsically obtained by LLMs, and future research should focus more on bridging the gap between music knowledge and reasoning, to improve the co-creation experience for musicians.
- [61] arXiv:2407.21545 (cross-list from cs.SD) [pdf, html, other]
-
Title: Robust Lossy Audio Compression IdentificationComments: Accepted to be published in the Proceedings of the 25th International Society for Music Information Retrieval Conference 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Previous research contributions on blind lossy compression identification report near perfect performance metrics on their test set, across a variety of codecs and bit rates. However, we show that such results can be deceptive and may not accurately represent true ability of the system to tackle the task at hand. In this article, we present an investigation into the robustness and generalisation capability of a lossy audio identification model. Our contributions are as follows. (1) We show the lack of robustness to codec parameter variations of a model equivalent to prior art. In particular, when naively training a lossy compression detection model on a dataset of music recordings processed with a range of codecs and their lossless counterparts, we obtain near perfect performance metrics on the held-out test set, but severely degraded performance on lossy tracks produced with codec parameters not seen in training. (2) We propose and show the effectiveness of an improved training strategy to significantly increase the robustness and generalisation capability of the model beyond codec configurations seen during training. Namely we apply a random mask to the input spectrogram to encourage the model not to rely solely on the training set's codec cutoff frequency.
- [62] arXiv:2407.21553 (cross-list from cs.LG) [pdf, html, other]
-
Title: CXSimulator: A User Behavior Simulation using LLM Embeddings for Web-Marketing Campaign AssessmentComments: 5 pages, 2 figures, 1 table, the 33rd ACM International Conference on Information and Knowledge Management (CIKM '24)Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
This paper presents the Customer Experience (CX) Simulator, a novel framework designed to assess the effects of untested web-marketing campaigns through user behavior simulations. The proposed framework leverages large language models (LLMs) to represent various events in a user's behavioral history, such as viewing an item, applying a coupon, or purchasing an item, as semantic embedding vectors. We train a model to predict transitions between events from their LLM embeddings, which can even generalize to unseen events by learning from diverse training data. In web-marketing applications, we leverage this transition prediction model to simulate how users might react differently when new campaigns or products are presented to them. This allows us to eliminate the need for costly online testing and enhance the marketers' abilities to reveal insights. Our numerical evaluation and user study, utilizing BigQuery Public Datasets from the Google Merchandise Store, demonstrate the effectiveness of our framework.
- [63] arXiv:2407.21611 (cross-list from cs.SD) [pdf, html, other]
-
Title: Enhancing Partially Spoofed Audio Localization with Boundary-aware Attention MechanismSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
The task of partially spoofed audio localization aims to accurately determine audio authenticity at a frame level. Although some works have achieved encouraging results, utilizing boundary information within a single model remains an unexplored research topic. In this work, we propose a novel method called Boundary-aware Attention Mechanism (BAM). Specifically, it consists of two core modules: Boundary Enhancement and Boundary Frame-wise Attention. The former assembles the intra-frame and inter-frame information to extract discriminative boundary features that are subsequently used for boundary position detection and authenticity decision, while the latter leverages boundary prediction results to explicitly control the feature interaction between frames, which achieves effective discrimination between real and fake frames. Experimental results on PartialSpoof database demonstrate our proposed method achieves the best performance. The code is available at this https URL.
- [64] arXiv:2407.21615 (cross-list from cs.SD) [pdf, html, other]
-
Title: Between the AI and Me: Analysing Listeners' Perspectives on AI- and Human-Composed Progressive Metal MusicComments: Reviewed pre-print accepted for publication at ISMIR 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
Generative AI models have recently blossomed, significantly impacting artistic and musical traditions. Research investigating how humans interact with and deem these models is therefore crucial. Through a listening and reflection study, we explore participants' perspectives on AI- vs human-generated progressive metal, in symbolic format, using rock music as a control group. AI-generated examples were produced by ProgGP, a Transformer-based model. We propose a mixed methods approach to assess the effects of generation type (human vs. AI), genre (progressive metal vs. rock), and curation process (random vs. cherry-picked). This combines quantitative feedback on genre congruence, preference, creativity, consistency, playability, humanness, and repeatability, and qualitative feedback to provide insights into listeners' experiences. A total of 32 progressive metal fans completed the study. Our findings validate the use of fine-tuning to achieve genre-specific specialization in AI music generation, as listeners could distinguish between AI-generated rock and progressive metal. Despite some AI-generated excerpts receiving similar ratings to human music, listeners exhibited a preference for human compositions. Thematic analysis identified key features for genre and AI vs. human distinctions. Finally, we consider the ethical implications of our work in promoting musical data diversity within MIR research by focusing on an under-explored genre.
- [65] arXiv:2407.21646 (cross-list from cs.CL) [pdf, html, other]
-
Title: Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM AgentComments: Authors are listed in alphabetical order by last name. Demonstrations and human-annotated test sets are available at this https URLSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
In this paper, we present Cross Language Agent -- Simultaneous Interpretation, CLASI, a high-quality and human-like Simultaneous Speech Translation (SiST) System. Inspired by professional human interpreters, we utilize a novel data-driven read-write strategy to balance the translation quality and latency. To address the challenge of translating in-domain terminologies, CLASI employs a multi-modal retrieving module to obtain relevant information to augment the translation. Supported by LLMs, our approach can generate error-tolerated translation by considering the input audio, historical context, and retrieved information. Experimental results show that our system outperforms other systems by significant margins. Aligned with professional human interpreters, we evaluate CLASI with a better human evaluation metric, valid information proportion (VIP), which measures the amount of information that can be successfully conveyed to the listeners. In the real-world scenarios, where the speeches are often disfluent, informal, and unclear, CLASI achieves VIP of 81.3% and 78.0% for Chinese-to-English and English-to-Chinese translation directions, respectively. In contrast, state-of-the-art commercial or open-source systems only achieve 35.4% and 41.6%. On the extremely hard dataset, where other systems achieve under 13% VIP, CLASI can still achieve 70% VIP.
- [66] arXiv:2407.21658 (cross-list from cs.SD) [pdf, html, other]
-
Title: Beat this! Accurate beat tracking without DBN postprocessingComments: Accepted at the 25th International Society for Music Information Retrieval Conference (ISMIR), 2024Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
We propose a system for tracking beats and downbeats with two objectives: generality across a diverse music range, and high accuracy. We achieve generality by training on multiple datasets -- including solo instrument recordings, pieces with time signature changes, and classical music with high tempo variations -- and by removing the commonly used Dynamic Bayesian Network (DBN) postprocessing, which introduces constraints on the meter and tempo. For high accuracy, among other improvements, we develop a loss function tolerant to small time shifts of annotations, and an architecture alternating convolutions with transformers either over frequency or time. Our system surpasses the current state of the art in F1 score despite using no DBN. However, it can still fail, especially for difficult and underrepresented genres, and performs worse on continuity metrics, so we publish our model, code, and preprocessed datasets, and invite others to beat this.
- [67] arXiv:2407.21676 (cross-list from cs.RO) [pdf, html, other]
-
Title: Pedestrian Inertial Navigation: An Overview of Model and Data-Driven ApproachesSubjects: Robotics (cs.RO); Signal Processing (eess.SP)
The task of indoor positioning is fundamental to several applications, including navigation, healthcare, location-based services, and security. An emerging field is inertial navigation for pedestrians, which relies only on inertial sensors for positioning. In this paper, we present inertial pedestrian navigation models and learning approaches. Among these, are methods and algorithms for shoe-mounted inertial sensors and pedestrian dead reckoning (PDR) with unconstrained inertial sensors. We also address three categories of data-driven PDR strategies: activity-assisted, hybrid approaches, and learning-based frameworks.
- [68] arXiv:2407.21698 (cross-list from math.OC) [pdf, html, other]
-
Title: Long-Term Energy Management for Microgrid with Hybrid Hydrogen-Battery Energy Storage: A Prediction-Free Coordinated Optimization FrameworkComments: Submitted to Applied EnergySubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
This paper studies the long-term energy management of a microgrid coordinating hybrid hydrogen-battery energy storage. We develop an approximate semi-empirical hydrogen storage model to accurately capture the power-dependent efficiency of hydrogen storage. We introduce a prediction-free two-stage coordinated optimization framework, which generates the annual state-of-charge (SoC) reference for hydrogen storage offline. During online operation, it updates the SoC reference online using kernel regression and makes operation decisions based on the proposed adaptive virtual-queue-based online convex optimization (OCO) algorithm. We innovatively incorporate penalty terms for long-term pattern tracking and expert-tracking for step size updates. We provide theoretical proof to show that the proposed OCO algorithm achieves a sublinear bound of dynamic regret without using prediction information. Numerical studies based on the Elia and North China datasets show that the proposed framework significantly outperforms the existing online optimization approaches by reducing the operational costs and loss of load by around 30% and 80%, respectively. These benefits can be further enhanced with optimized settings for the penalty coefficient and step size of OCO, as well as more historical references.
- [69] arXiv:2407.21739 (cross-list from cs.CV) [pdf, html, other]
-
Title: A Federated Learning-Friendly Approach for Parameter-Efficient Fine-Tuning of SAM in 3D SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Adapting foundation models for medical image analysis requires finetuning them on a considerable amount of data because of extreme distribution shifts between natural (source) data used for pretraining and medical (target) data. However, collecting task-specific medical data for such finetuning at a central location raises many privacy concerns. Although Federated learning (FL) provides an effective means for training on private decentralized data, communication costs in federating large foundation models can quickly become a significant bottleneck, impacting the solution's scalability. In this work, we address this problem of efficient communication while ensuring effective learning in FL by combining the strengths of Parameter-Efficient Fine-tuning (PEFT) with FL. Specifically, we study plug-and-play Low-Rank Adapters (LoRA) in a federated manner to adapt the Segment Anything Model (SAM) for 3D medical image segmentation. Unlike prior works that utilize LoRA and finetune the entire decoder, we critically analyze the contribution of each granular component of SAM on finetuning performance. Thus, we identify specific layers to be federated that are very efficient in terms of communication cost while producing on-par accuracy. Our experiments show that retaining the parameters of the SAM model (including most of the decoder) in their original state during adaptation is beneficial because fine-tuning on small datasets tends to distort the inherent capabilities of the underlying foundation model. On Fed-KiTS, our approach decreases communication cost (~48x) compared to full fine-tuning while increasing performance (~6% Dice score) in 3D segmentation tasks. Our approach performs similar to SAMed while achieving ~2.8x reduction in communication and parameters to be finetuned. We further validate our approach with experiments on Fed-IXI and Prostate MRI datasets.
Cross submissions for Thursday, 1 August 2024 (showing 30 of 30 entries )
- [70] arXiv:1908.04596 (replaced) [pdf, html, other]
-
Title: A Simulative Study on Active Disturbance Rejection Control (ADRC) as a Control Tool for PractitionersJournal-ref: Electronics, vol. 2, no. 3, pp. 246-279, Aug. 2013Subjects: Systems and Control (eess.SY)
As an alternative to both classical PID-type and modern model-based approaches to solving control problems, active disturbance rejection control (ADRC) has gained significant traction in recent years. With its simple tuning method and robustness against process parameter variations, it puts itself forward as a valuable addition to the toolbox of control engineering practitioners. This article aims at providing a single-source introduction and reference to linear ADRC with this audience in mind. A simulative study is carried out using generic first- and second-order plants to enable a quick visual assessment of the abilities of ADRC. Finally, a modified form of the discrete-time case is introduced to speed up real-time implementations as necessary in applications with high dynamic requirements.
- [71] arXiv:2011.01044 (replaced) [pdf, html, other]
-
Title: Transfer Function Analysis and Implementation of Active Disturbance Rejection ControlComments: 14 pages, 9 figuresJournal-ref: Control Theory and Technology (19), 19-34 (2021)Subjects: Systems and Control (eess.SY)
To support the adoption of active disturbance rejection control (ADRC) in industrial practice, this article aims at improving both understanding and implementation of ADRC using traditional means, in particular via transfer functions and a frequency-domain view. Firstly, to enable an immediate comparability with existing classical control solutions, a realizable transfer function implementation of continous-time linear ADRC is introduced. Secondly, a frequency-domain analysis of ADRC components, performance, parameter sensitivity, and tuning method is performed. Finally, an exact implementation of discrete-time ADRC using transfer functions is introduced for the first time, with special emphasis on practical aspects such as computational efficiency, low parameter footprint, and windup protection.
- [72] arXiv:2110.12509 (replaced) [pdf, other]
-
Title: U-Net-based Lung Thickness Map for Pixel-level Lung Volume Estimation of Chest X-raysTina Dorosti, Manuel Schultheiss, Philipp Schmette, Jule Heuchert, Johannes Thalhammer, Florian Schaff, Thorsten Sellerer, Rafael Schick, Kirsten Taphorn, Korbinian Mechlem, Lorenz Birnbacher, Franz Pfeiffer, Daniela PfeifferSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Purpose: We aimed to estimate the total lung volume (TLV) from real and synthetic frontal X-ray radiographs on a pixel level using lung thickness maps generated by a U-Net.
Methods: 5,959 thorax X-ray computed tomography (CT) scans were retrieved from two publicly available datasets of the lung nodule analysis 2016 (n=656) and the RSNA pulmonary embolism detection challenge 2020 (n=5,303). Additionally, thorax CT scans from 72 subjects (33 healthy: 20 men, mean age [range] = 62.4 [34, 80]; 39 suffering from chronic obstructive pulmonary disease: 25 men, mean age [range] = 69.0 [47, 91]) were retrospectively selected (10.2018-12.2019) from our in-house dataset such that for each subject, a frontal chest X-ray radiograph no older than seven days was available. All CT scans and their corresponding lung segmentation were forward projected using a simulated X-ray spectrum to generate synthetic radiographs and lung thickness maps, respectively. A U-Net model was trained and tested on synthetic radiographs from the public datasets to predict lung thickness maps and consequently estimate TLV. Model performance was further assessed by evaluating the TLV estimations for the in-house synthetic and real radiograph pairs using Pearson correlation coefficient (r) and significance testing.
Results: Strong correlations were measured between the predicted and CT-derived ground truth TLV values for test data from synthetic ($n_{Public}$=1,191, r=0.987, P < 0.001; $n_{In-house}$=72, r=0.973, P < 0.001) and real radiographs (n=72, r=0.908, P < 0.001).
Conclusion: TLV from U-Net-generated pixel-level lung thickness maps were successfully estimated for synthetic and real radiographs. - [73] arXiv:2311.02911 (replaced) [pdf, html, other]
-
Title: Goal-Oriented Wireless Communication Resource Allocation for Cyber-Physical SystemsComments: For a revised version and its published version refer to IEEE TWC of DOI: https://doi.org/10.1109/TWC.2024.3432918Subjects: Signal Processing (eess.SP); Systems and Control (eess.SY)
The proliferation of novel industrial applications at the wireless edge, such as smart grids and vehicle networks, demands the advancement of cyber-physical systems. The performance of CPSs is closely linked to the last-mile wireless communication networks, which often become bottlenecks due to their inherent limited resources. Current CPS operations often treat wireless communication networks as unpredictable and uncontrollable variables, ignoring the potential adaptability of wireless networks, which results in inefficient and overly conservative CPS operations. Meanwhile, current wireless communications often focus more on throughput and other transmission-related metrics instead of CPS goals. In this study, we introduce the framework of goal-oriented wireless communication resource allocations, accounting for the semantics and significance of data for CPS operation goals. This guarantees optimal CPS performance from a cybernetic standpoint. We formulate a bandwidth allocation problem aimed at maximizing the information utility gain of transmitted data brought to CPS operation goals. Since the goal-oriented bandwidth allocation problem is a large-scale combinational problem, we propose a divide-and-conquer and greedy solution algorithm. The information utility gain is first approximately decomposed into marginal utility information gains and computed in a parallel manner. Subsequently, the bandwidth allocation problem is reformulated as a knapsack problem, which can be further solved greedily with a guaranteed sub-optimality gap. We further demonstrate how our proposed goal-oriented bandwidth allocation algorithm can be applied in four potential CPS applications, including data-driven decision-making, edge learning, federated learning, and distributed optimization.
- [74] arXiv:2401.17841 (replaced) [pdf, html, other]
-
Title: Stimulus-Informed Generalized Canonical Correlation Analysis for Group Analysis of Neural Responses to Natural StimuliComments: 14 pages, 16 figuresSubjects: Signal Processing (eess.SP)
Various new brain-computer interface technologies or neuroscience applications require decoding stimulus-following neural responses to natural stimuli such as speech and video from, e.g., electroencephalography (EEG) signals. In this context, generalized canonical correlation analysis (GCCA) is often used as a group analysis technique, which allows the extraction of correlated signal components from the neural activity of multiple subjects attending to the same stimulus. GCCA can be used to improve the signal-to-noise ratio of the stimulus-following neural responses relative to all other irrelevant (non-)neural activity, or to quantify the correlated neural activity across multiple subjects in a group-wise coherence metric. However, the traditional GCCA technique is stimulus-unaware: no information about the stimulus is used to estimate the correlated components from the neural data of several subjects. Therefore, the GCCA technique might fail to extract relevant correlated signal components in practical situations where the amount of information is limited, for example, because of a limited amount of training data or group size. This motivates a new stimulus-informed GCCA (SI-GCCA) framework that allows taking the stimulus into account to extract the correlated components. We show that SI-GCCA outperforms GCCA in various practical settings, for both auditory and visual stimuli. Moreover, we showcase how SI-GCCA can be used to steer the estimation of the components towards the stimulus. As such, SI-GCCA substantially improves upon GCCA for various purposes, ranging from preprocessing to quantifying attention.
- [75] arXiv:2402.03935 (replaced) [pdf, other]
-
Title: On the Accuracy of Phase Extraction from a Known-Frequency Noisy Sinusoidal SignalComments: 30p, 11figs, initially submitted to AICSP the 19th December 2023, Updated 2024/02/07: added watermark, Updated 2024/03/26: no response from AICSP despite several follow-ups, withdrawn and submitted it to IEEE TSP, Updated 2024/07/31: IEEE editor's thought that this was "old research" (worst argument ever), and rejected it without sending it to peer-review we thus submitted it to the APSIPA TSIPSubjects: Signal Processing (eess.SP)
Accurate phase extraction from sinusoidal signals is a crucial task in various signal processing applications. While prior research predominantly addresses the case of asynchronous sampling with unknown signal frequency, this study focuses on the more specific situation where synchronous sampling is possible, and the signal's frequency is known. In this framework, a comprehensive analysis of phase estimation accuracy in the presence of both additive and phase noises is presented. A closed-form expression for the asymptotic Probability Density Function (PDF) of the resulting phase estimator is validated by simulations depicting Root Mean Square Error (RMSE) trends in different noise scenarios. This estimator is asymptotically efficient, converging rapidly to its Cramér-Rao Lower Bound (CRLB). Three distinct RMSE behaviours were identified based on SNR, sample count (N), and noise level: (i) saturation towards a random guess at low Signal to Noise Ratio (SNR), (ii) linear decrease with the square roots of N and SNR at moderate noise levels, and (iii) saturation at high SNR towards a noise floor dependent on the phase noise level. By quantifying the impact of sample count, additive noise, and phase noise on phase estimation accuracy, this work provides valuable insights for designing systems requiring precise phase extraction, such as phase-based fluorescence assays or system identification.
- [76] arXiv:2402.12539 (replaced) [pdf, html, other]
-
Title: Impact of data for forecasting on performance of model predictive control in buildings with smart energy storageMax Langtry, Vijja Wichitwechkarn, Rebecca Ward, Chaoqun Zhuang, Monika J. Kreitmair, Nikolas Makasis, Zack Xuereb Conti, Ruchi ChoudharyComments: 36 pages, 22 figuresJournal-ref: Energy and Buildings (2024)Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Data is required to develop forecasting models for use in Model Predictive Control (MPC) schemes in building energy systems. However, data is costly to both collect and exploit. Determining cost optimal data usage strategies requires understanding of the forecast accuracy and resulting MPC operational performance it enables. This study investigates the performance of both simple and state-of-the-art machine learning prediction models for MPC in multi-building energy systems using a simulated case study with historic building energy data. The impact on forecast accuracy of measures to improve model data efficiency are quantified, specifically for: reuse of prediction models, reduction of training data duration, reduction of model data features, and online model training. A simple linear multi-layer perceptron model is shown to provide equivalent forecast accuracy to state-of-the-art models, with greater data efficiency and generalisability. The use of more than 2 years of training data for load prediction models provided no significant improvement in forecast accuracy. Forecast accuracy and data efficiency were improved simultaneously by using change-point analysis to screen training data. Reused models and those trained with 3 months of data had on average 10% higher error than baseline, indicating that deploying MPC systems without prior data collection may be economic.
- [77] arXiv:2403.05245 (replaced) [pdf, html, other]
-
Title: Noise Level Adaptive Diffusion Model for Robust Reconstruction of Accelerated MRIShoujin Huang, Guanxiong Luo, Xi Wang, Ziran Chen, Yuwan Wang, Huaishui Yang, Pheng-Ann Heng, Lingyan Zhang, Mengye LyuSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
In general, diffusion model-based MRI reconstruction methods incrementally remove artificially added noise while imposing data consistency to reconstruct the underlying images. However, real-world MRI acquisitions already contain inherent noise due to thermal fluctuations. This phenomenon is particularly notable when using ultra-fast, high-resolution imaging sequences for advanced research, or using low-field systems favored by low- and middle-income countries. These common scenarios can lead to sub-optimal performance or complete failure of existing diffusion model-based reconstruction techniques. Specifically, as the artificially added noise is gradually removed, the inherent MRI noise becomes increasingly pronounced, making the actual noise level inconsistent with the predefined denoising schedule and consequently inaccurate image reconstruction. To tackle this problem, we propose a posterior sampling strategy with a novel NoIse Level Adaptive Data Consistency (Nila-DC) operation. Extensive experiments are conducted on two public datasets and an in-house clinical dataset with field strength ranging from 0.3T to 3T, showing that our method surpasses the state-of-the-art MRI reconstruction methods, and is highly robust against various noise levels. The code for Nila is available at this https URL.
- [78] arXiv:2404.10892 (replaced) [pdf, html, other]
-
Title: Automatic classification of prostate MR series type using image content and metadataDeepa Krishnaswamy, Bálint Kovács, Stefan Denner, Steve Pieper, David Clunie, Christopher P. Bridge, Tina Kapur, Klaus H. Maier-Hein, Andrey FedorovSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
With the wealth of medical image data, efficient curation is essential. Assigning the sequence type to magnetic resonance images is necessary for scientific studies and artificial intelligence-based analysis. However, incomplete or missing metadata prevents effective automation. We therefore propose a deep-learning method for classification of prostate cancer scanning sequences based on a combination of image data and DICOM metadata. We demonstrate superior results compared to metadata or image data alone, and make our code publicly available at this https URL.
- [79] arXiv:2404.11889 (replaced) [pdf, html, other]
-
Title: Multi-view X-ray Image Synthesis with Multiple Domain Disentanglement from CT ScansLixing Tan, Shuang Song, Kangneng Zhou, Chengbo Duan, Lanying Wang, Huayang Ren, Linlin Liu, Wei Zhang, Ruoxiu XiaoComments: 13 pages, 10 figures, ACM MM2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
X-ray images play a vital role in the intraoperative processes due to their high resolution and fast imaging speed and greatly promote the subsequent segmentation, registration and reconstruction. However, over-dosed X-rays superimpose potential risks to human health to some extent. Data-driven algorithms from volume scans to X-ray images are restricted by the scarcity of paired X-ray and volume data. Existing methods are mainly realized by modelling the whole X-ray imaging procedure. In this study, we propose a learning-based approach termed CT2X-GAN to synthesize the X-ray images in an end-to-end manner using the content and style disentanglement from three different image domains. Our method decouples the anatomical structure information from CT scans and style information from unpaired real X-ray images/ digital reconstructed radiography (DRR) images via a series of decoupling encoders. Additionally, we introduce a novel consistency regularization term to improve the stylistic resemblance between synthesized X-ray images and real X-ray images. Meanwhile, we also impose a supervised process by computing the similarity of computed real DRR and synthesized DRR images. We further develop a pose attention module to fully strengthen the comprehensive information in the decoupled content code from CT scans, facilitating high-quality multi-view image synthesis in the lower 2D space. Extensive experiments were conducted on the publicly available CTSpine1K dataset and achieved 97.8350, 0.0842 and 3.0938 in terms of FID, KID and defined user-scored X-ray similarity, respectively. In comparison with 3D-aware methods ($\pi$-GAN, EG3D), CT2X-GAN is superior in improving the synthesis quality and realistic to the real X-ray images.
- [80] arXiv:2404.19356 (replaced) [pdf, other]
-
Title: A Concept for Semi-Automatic Configuration of Sufficiently Valid Simulation Setups for Automated Driving SystemsComments: 8 pages, 3 figures. Accepted to be published in 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), Edmonton, Canada, September 24-27, 2024Subjects: Systems and Control (eess.SY)
As simulation is increasingly used in scenario-based approaches to test Automated Driving Systems, the credibility of simulation results is a major concern. Arguably, credibility depends on the validity of the simulation setup and simulation models. When selecting appropriate simulation models, a trade-off must be made between validity, often connected to the model's fidelity, and cost of computation. However, due to the large number of test cases, expert-based methods to create sufficiently valid simulation setups seem infeasible. We propose using design contracts in order to semi-automatically compose simulation setups for given test cases from simulation models and to derive requirements for the simulation models, supporting separation of concerns between simulation model developers and users. Simulation model contracts represent their validity domains by capturing a validity guarantee and the associated operating conditions in an assumption. We then require the composition of the simulation model contracts to refine a test case contract. The latter contract captures the operating conditions of the test case in its assumption and validity requirements in its guarantee. Based on this idea, we present a framework that supports the compositional configuration of simulation setups based on the contracts and a method to derive runtime monitors for these simulation setups.
- [81] arXiv:2406.12828 (replaced) [pdf, html, other]
-
Title: Feasibility of Non-Line-of-Sight Integrated Sensing and Communication at mmWaveComments: 5 pages, accepted for publication at the 25th IEEE International Workshop on Signal Processing Advances in Wireless Communications (SPAWC)Subjects: Signal Processing (eess.SP)
One rarely addressed direction in the context of Integrated Sensing and Communication (ISAC) is non-line-of-sight (NLOS) sensing, with the potential to enable use cases like intrusion detection and to increase the value that wireless networks can bring. However, ISAC networks impose challenges for sensing due to their communication-oriented design. For instance, time division duplex transmission creates spectral holes in time, resulting in spectral replicas in the radar image. To counteract this, we evaluate different channel state information processing strategies and discuss their tradeoffs. We further propose an ensemble of techniques to detect targets in NLOS conditions. Our approaches are validated with experiments using a millimeter wave ISAC proof of concept in a factory-like environment. The results show that target detection in NLOS is generally possible with ISAC.
- [82] arXiv:2406.18306 (replaced) [pdf, html, other]
-
Title: Neural Network-Based IRS Assisted NLoS DoA EstimationSubjects: Signal Processing (eess.SP)
This paper presents a learning-based approach for Direction of Arrival (DoA) estimation using a Intelligent Reflecting Surface (IRS) or Reconfigurable Intelligent Surface (RIS) under a Non-Line-of-Sight (NLoS) scenario. The key innovation is the employment of a novel Neural Network (NN) layer, referred to as the NN-based RIS layer, within a generic network structure.
The NN-based RIS layer is designed to learn the optimal RIS phase shifts that are tailored for the DoA estimation task. To achieve this, the pre-processed real-valued observations are fed into the RIS layer, which has a specialized structure. Unlike regular neural network layers, the weights of the NN-based RIS layer are constrained to be sinusoidal functions, with the phase arguments being the tunable parameters during the training process. This allows the layer to emulate the functionality of an RIS.
Accordingly, the standard feed-forward and back-propagation procedures are modified to accommodate the unique structure of the NN-based RIS layer. Numerical simulations demonstrate that the proposed machine learning-based approach outperforms conventional non-learning-based methods for DoA estimation under most practical SNR ranges in an RIS-assisted scheme, also it shows a better tracking capability. - [83] arXiv:2407.04822 (replaced) [pdf, html, other]
-
Title: YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem AugmentationComments: 2024 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Sept.\ 22--25, 2024, London, UKSubjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument. This task is challenging for modeling as it requires simultaneously identifying multiple instruments and transcribing their pitch and precise timing, and the lack of fully annotated data adds to the training difficulties. This paper introduces YourMT3+, a suite of models for enhanced multi-instrument music transcription based on the recent language token decoding approach of MT3. We enhance its encoder by adopting a hierarchical attention transformer in the time-frequency domain and integrating a mixture of experts. To address data limitations, we introduce a new multi-channel decoding method for training with incomplete annotations and propose intra- and cross-stem augmentation for dataset mixing. Our experiments demonstrate direct vocal transcription capabilities, eliminating the need for voice separation pre-processors. Benchmarks across ten public datasets show our models' competitiveness with, or superiority to, existing transcription models. Further testing on pop music recordings highlights the limitations of current models. Fully reproducible code and datasets are available with demos at \url{this https URL}.
- [84] arXiv:2407.11087 (replaced) [pdf, html, other]
-
Title: Restore-RWKV: Efficient and Effective Medical Image Restoration with RWKVComments: This paper introduces the first RWKV-based model for image restorationSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Transformers have revolutionized medical image restoration, but the quadratic complexity still poses limitations for their application to high-resolution medical images. The recent advent of RWKV in the NLP field has attracted much attention as it can process long sequences efficiently. To leverage its advanced design, we propose Restore-RWKV, the first RWKV-based model for medical image restoration. Since the original RWKV model is designed for 1D sequences, we make two necessary modifications for modeling spatial relations in 2D images. First, we present a recurrent WKV (Re-WKV) attention mechanism that captures global dependencies with linear computational complexity. Re-WKV incorporates bidirectional attention as basic for a global receptive field and recurrent attention to effectively model 2D dependencies from various scan directions. Second, we develop an omnidirectional token shift (Omni-Shift) layer that enhances local dependencies by shifting tokens from all directions and across a wide context range. These adaptations make the proposed Restore-RWKV an efficient and effective model for medical image restoration. Extensive experiments demonstrate that Restore-RWKV achieves superior performance across various medical image restoration tasks, including MRI image super-resolution, CT image denoising, PET image synthesis, and all-in-one medical image restoration. Code is available at: \href{this https URL}{this https URL}.
- [85] arXiv:2407.11865 (replaced) [pdf, html, other]
-
Title: Novel Hybrid Integrated Pix2Pix and WGAN Model with Gradient Penalty for Binary Images DenoisingComments: Systems and Soft ComputingSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
This paper introduces a novel approach to image denoising that leverages the advantages of Generative Adversarial Networks (GANs). Specifically, we propose a model that combines elements of the Pix2Pix model and the Wasserstein GAN (WGAN) with Gradient Penalty (WGAN-GP). This hybrid framework seeks to capitalize on the denoising capabilities of conditional GANs, as demonstrated in the Pix2Pix model, while mitigating the need for an exhaustive search for optimal hyperparameters that could potentially ruin the stability of the learning process. In the proposed method, the GAN's generator is employed to produce denoised images, harnessing the power of a conditional GAN for noise reduction. Simultaneously, the implementation of the Lipschitz continuity constraint during updates, as featured in WGAN-GP, aids in reducing susceptibility to mode collapse. This innovative design allows the proposed model to benefit from the strong points of both Pix2Pix and WGAN-GP, generating superior denoising results while ensuring training stability. Drawing on previous work on image-to-image translation and GAN stabilization techniques, the proposed research highlights the potential of GANs as a general-purpose solution for denoising. The paper details the development and testing of this model, showcasing its effectiveness through numerical experiments. The dataset was created by adding synthetic noise to clean images. Numerical results based on real-world dataset validation underscore the efficacy of this approach in image-denoising tasks, exhibiting significant enhancements over traditional techniques. Notably, the proposed model demonstrates strong generalization capabilities, performing effectively even when trained with synthetic noise.
- [86] arXiv:2407.12038 (replaced) [pdf, html, other]
-
Title: ICAGC 2024: Inspirational and Convincing Audio Generation Challenge 2024Ruibo Fu, Rui Liu, Chunyu Qiang, Yingming Gao, Yi Lu, Shuchen Shi, Tao Wang, Ya Li, Zhengqi Wen, Chen Zhang, Hui Bu, Yukun Liu, Xin Qi, Guanjun LiComments: ISCSLP 2024 Challenge description and resultsSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
The Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC 2024) is part of the ISCSLP 2024 Competitions and Challenges track. While current text-to-speech (TTS) technology can generate high-quality audio, its ability to convey complex emotions and controlled detail content remains limited. This constraint leads to a discrepancy between the generated audio and human subjective perception in practical applications like companion robots for children and marketing bots. The core issue lies in the inconsistency between high-quality audio generation and the ultimate human subjective experience. Therefore, this challenge aims to enhance the persuasiveness and acceptability of synthesized audio, focusing on human alignment convincing and inspirational audio generation. A total of 19 teams have registered for the challenge, and the results of the competition and the competition are described in this paper.
- [87] arXiv:2407.12708 (replaced) [pdf, html, other]
-
Title: An Approximation for the 32-point Discrete Fourier TransformComments: Corrected a typo in (4). 8 pages, 2 tablesSubjects: Signal Processing (eess.SP); Numerical Analysis (math.NA); Methodology (stat.ME)
This brief note aims at condensing some results on the 32-point approximate DFT and discussing its arithmetic complexity.
- [88] arXiv:2407.15689 (replaced) [pdf, html, other]
-
Title: Pediatric Wrist Fracture Detection in X-rays via YOLOv10 Algorithm and Dual Label Assignment SystemSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Wrist fractures are highly prevalent among children and can significantly impact their daily activities, such as attending school, participating in sports, and performing basic self-care tasks. If not treated properly, these fractures can result in chronic pain, reduced wrist functionality, and other long-term complications. Recently, advancements in object detection have shown promise in enhancing fracture detection, with systems achieving accuracy comparable to, or even surpassing, that of human radiologists. The YOLO series, in particular, has demonstrated notable success in this domain. This study is the first to provide a thorough evaluation of various YOLOv10 variants to assess their performance in detecting pediatric wrist fractures using the GRAZPEDWRI-DX dataset. It investigates how changes in model complexity, scaling the architecture, and implementing a dual-label assignment strategy can enhance detection performance. Experimental results indicate that our trained model achieved mean average precision (mAP@50-95) of 51.9\% surpassing the current YOLOv9 benchmark of 43.3\% on this dataset. This represents an improvement of 8.6\%. The implementation code is publicly available at this https URL
- [89] arXiv:2407.20198 (replaced) [pdf, html, other]
-
Title: SpaER: Learning Spatio-temporal Equivariant Representations for Fetal Brain Motion TrackingComments: 11 pages, 3 figures, Medical Image Computing and Computer Assisted Interventions (MICCAI) Workshop on Perinatal Imaging, Placental and Preterm Image analysis (PIPPI) 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In this paper, we introduce SpaER, a pioneering method for fetal motion tracking that leverages equivariant filters and self-attention mechanisms to effectively learn spatio-temporal representations. Different from conventional approaches that statically estimate fetal brain motions from pairs of images, our method dynamically tracks the rigid movement patterns of the fetal head across temporal and spatial dimensions. Specifically, we first develop an equivariant neural network that efficiently learns rigid motion sequences through low-dimensional spatial representations of images. Subsequently, we learn spatio-temporal representations by incorporating time encoding and self-attention neural network layers. This approach allows for the capture of long-term dependencies of fetal brain motion and addresses alignment errors due to contrast changes and severe motion artifacts. Our model also provides a geometric deformation estimation that properly addresses image distortions among all time frames. To the best of our knowledge, our approach is the first to learn spatial-temporal representations via deep neural networks for fetal motion tracking without data augmentation. We validated our model using real fetal echo-planar images with simulated and real motions. Our method carries significant potential value in accurately measuring, tracking, and correcting fetal motion in fetal MRI sequences.
- [90] arXiv:2407.20532 (replaced) [pdf, html, other]
-
Title: Scalable Synthesis of Formally Verified Neural Value Function for Hamilton-Jacobi Reachability AnalysisSubjects: Systems and Control (eess.SY)
Hamilton-Jacobi (HJ) reachability analysis provides a formal method for guaranteeing safety in constrained control problems. It synthesizes a value function to represent a long-term safe set called feasible region. Early synthesis methods based on state space discretization cannot scale to high-dimensional problems, while recent methods that use neural networks to approximate value functions result in unverifiable feasible regions. To achieve both scalability and verifiability, we propose a framework for synthesizing verified neural value functions for HJ reachability analysis. Our framework consists of three stages: pre-training, adversarial training, and verification-guided training. We design three techniques to address three challenges to improve scalability respectively: boundary-guided backtracking (BGB) to improve counterexample search efficiency, entering state regularization (ESR) to enlarge feasible region, and activation pattern alignment (APA) to accelerate neural network verification. We also provide a neural safety certificate synthesis and verification benchmark called Cersyve-9, which includes nine commonly used safe control tasks and supplements existing neural network verification benchmarks. Our framework successfully synthesizes verified neural value functions on all tasks, and our proposed three techniques exhibit superior scalability and efficiency compared with existing methods.
- [91] arXiv:2012.02134 (replaced) [pdf, html, other]
-
Title: K-Deep Simplex: Deep Manifold Learning via Local DictionariesComments: 33 pages, 17 figures. This expanded version includes detailed numerical experiments in the supplementary material. Theorem 3 is a new stability result. The sections have been reorganized, and additional details have been provided for claritySubjects: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP); Optimization and Control (math.OC)
We propose K-Deep Simplex(KDS) which, given a set of data points, learns a dictionary comprising synthetic landmarks, along with representation coefficients supported on a simplex. KDS employs a local weighted $\ell_1$ penalty that encourages each data point to represent itself as a convex combination of nearby landmarks. We solve the proposed optimization program using alternating minimization and design an efficient, interpretable autoencoder using algorithm unrolling. We theoretically analyze the proposed program by relating the weighted $\ell_1$ penalty in KDS to a weighted $\ell_0$ program. Assuming that the data are generated from a Delaunay triangulation, we prove the equivalence of the weighted $\ell_1$ and weighted $\ell_0$ programs. We further show the stability of the representation coefficients under mild geometrical assumptions. If the representation coefficients are fixed, we prove that the sub-problem of minimizing over the dictionary yields a unique solution. Further, we show that low-dimensional representations can be efficiently obtained from the covariance of the coefficient matrix. Experiments show that the algorithm is highly efficient and performs competitively on synthetic and real data sets.
- [92] arXiv:2303.06130 (replaced) [pdf, html, other]
-
Title: Full State Estimation of Continuum Robots from Tip Velocities: A Cosserat-Theoretic Boundary ObserverSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
State estimation of robotic systems is essential to implementing feedback controllers, which usually provide better robustness to modeling uncertainties than open-loop controllers. However, state estimation of soft robots is very challenging because soft robots have theoretically infinite degrees of freedom while existing sensors only provide a limited number of discrete measurements. This work focuses on soft robotic manipulators, also known as continuum robots. We design an observer algorithm based on the well-known Cosserat rod theory, which models continuum robots by nonlinear partial differential equations (PDEs) evolving in geometric Lie groups. The observer can estimate all infinite-dimensional continuum robot states, including poses, strains, and velocities, by only sensing the tip velocity of the continuum robot, and hence it is called a ``boundary'' observer. More importantly, the estimation error dynamics is formally proven to be locally input-to-state stable. The key idea is to inject sequential tip velocity measurements into the observer in a way that dissipates the energy of the estimation errors through the boundary. The distinct advantage of this PDE-based design is that it can be implemented using any existing numerical implementation for Cosserat rod models. All theoretical convergence guarantees will be preserved, regardless of the discretization method. We call this property ``one design for any discretization''. Extensive numerical studies are included and suggest that the domain of attraction is large and the observer is robust to uncertainties of tip velocity measurements and model parameters.
- [93] arXiv:2310.08753 (replaced) [pdf, html, other]
-
Title: CompA: Addressing the Gap in Compositional Reasoning in Audio-Language ModelsSreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, S. Ramaneswaran, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh ManochaComments: ICLR 2024. Project Page: this https URLSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute-binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.
- [94] arXiv:2311.18076 (replaced) [pdf, html, other]
-
Title: Localization from structured distance matrices via low-rank matrix recoveryComments: 20 pages. Introduced a new sampling model. Experimental results on both synthetic and real data. A new optimization program for structured distance geometry based on low-rank recovery. The analysis of the previous sampling model is also discussed. Made changes to improve the clarity and presentation of the paperSubjects: Information Theory (cs.IT); Machine Learning (cs.LG); Robotics (cs.RO); Signal Processing (eess.SP)
We study the problem of determining the configuration of $n$ points by using their distances to $m$ nodes, referred to as anchor nodes. One sampling scheme is Nystrom sampling, which assumes known distances between the anchors and between the anchors and the $n$ points, while the distances among the $n$ points are unknown. For this scheme, a simple adaptation of the Nystrom method, which is often used for kernel approximation, is a viable technique to estimate the configuration of the anchors and the $n$ points. In this manuscript, we propose a modified version of Nystrom sampling, where the distances from every node to one central node are known, but all other distances are incomplete. In this setting, the standard Nystrom approach is not applicable, necessitating an alternative technique to estimate the configuration of the anchors and the $n$ points. We show that this problem can be framed as the recovery of a low-rank submatrix of a Gram matrix. Using synthetic and real data, we demonstrate that the proposed approach can exactly recover configurations of points given sufficient distance samples. This underscores that, in contrast to methods that rely on global sampling of distance matrices, the task of estimating the configuration of points can be done efficiently via structured sampling with well-chosen reliable anchors. Finally, our main analysis is grounded in a specific centering of the points. With this in mind, we extend previous work in Euclidean distance geometry by providing a general dual basis approach for points centered anywhere.
- [95] arXiv:2312.06365 (replaced) [pdf, other]
-
Title: A Balanced Positional Control Architecture for a 12-DoF Quadruped Robot through Simulation-validation and Hardware TestingComments: 26 pages, 11 Figures. v4: Major revisionSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
A multi-joint enabled robot requires extensive mathematical calculations to determine the end effector's position with respect to the other connective joints involved and their corresponding frames in a specific coordinate system. If a control architecture employs fewer positional constraints which cannot precisely determine the end effector's position in all quadrants of a 2D Cartesian plane then the robot is generally under-constrained, leading to challenges in accurate positioning to the end-effector across the entire plane. Consequently, only a subset of the end effector's degree of freedom (DoF) can be assigned for the robot's leg position for pose and trajectory estimation purposes. This paper introduces a novel approach and proposes an algorithm to consider a balanced control of the robot's leg position in a coordinate system so the robot's leg can be precisely determined and the DoF is not limited. Mathematical derivation of the joint angles is derived with forward and inverse kinematics, and Python-based simulation has been done to verify and simulate the robot's locomotion. Using Python-based code for serial communication with a micro-controller unit makes this approach more effective for demonstrating its application on a prototype leg its movement has been realized. The experimental prototype leg exhibits a commendable 78.9% accuracy with the simulated result, validating the robustness of our algorithm in practical scenarios. A comprehensive assessment of the control algorithm with random and continuous data point test has been conducted to ensure performance, so the algorithm can as well be deployed in a physical robot.
- [96] arXiv:2402.15634 (replaced) [pdf, html, other]
-
Title: Sense-Then-Train: An Active-Sensing-Based Beam Training Design for Near-Field MIMO SystemsComments: This paper has been accepted for publication in IEEE Transactions on Wireless CommunicationsJournal-ref: IEEE Transactions on Wireless Communications, early access, 2024Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
An active-sensing-based sense-then-train (STT) scheme is proposed for beam training in near-field multiple-input multiple-output (MIMO) systems. Compared to conventional codebook-based schemes, the proposed STT scheme is capable of not only addressing the complex spherical-wave propagation but also effectively exploiting the additional degrees-of-freedoms (DoFs). The STT scheme is tailored for both single-beam and multi-beam cases. 1) For the single-beam case, the STT scheme first utilizes a sensing phase to estimate a low-dimensional representation of the near-field MIMO channel in the truncated wavenumber domain. Then, in the subsequent training phase, the neural network modules at transceivers are updated online to align beams, utilizing sequentially received ping-pong pilots. This approach can efficiently obtain the aligned beam pair without relying on predefined codebooks or training datasets. 2) For the multi-beam case, based on the single-beam STT, a Gram-Schmidt method is further utilized to guarantee the orthogonality between beams in the training phase. Numerical results unveil that 1) the proposed STT scheme can significantly enhance the beam training performance in the near field compared to the conventional far-field codebook-based schemes, and 2) the proposed STT scheme can perform fast and low-complexity beam training, while achieving a near-optimal performance without full channel state information in both cases.
- [97] arXiv:2403.00237 (replaced) [pdf, html, other]
-
Title: Stable Reduced-Rank VAR IdentificationComments: 17 pages, 6 figuresSubjects: Methodology (stat.ME); Systems and Control (eess.SY)
The vector autoregression (VAR) has been widely used in system identification, econometrics, natural science, and many other areas. However, when the state dimension becomes large the parameter dimension explodes. So rank reduced modelling is attractive and is well developed. But a fundamental requirement in almost all applications is stability of the fitted model. And this has not been addressed in the rank reduced case. Here, we develop, for the first time, a closed-form formula for an estimator of a rank reduced transition matrix which is guaranteed to be stable. We show that our estimator is consistent and asymptotically statistically efficient and illustrate it in comparative simulations.
- [98] arXiv:2404.05584 (replaced) [pdf, html, other]
-
Title: Neural Cellular Automata for Lightweight, Robust and Explainable Classification of White Blood Cell ImagesComments: Accepted for publication at the 27th International Conference on Medical Image Computing and Computer Assisted Intervention - MICCAI 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Diagnosis of hematological malignancies depends on accurate identification of white blood cells in peripheral blood smears. Deep learning techniques are emerging as a viable solution to scale and optimize this process by automatic cell classification. However, these techniques face several challenges such as limited generalizability, sensitivity to domain shifts, and lack of explainability. Here, we introduce a novel approach for white blood cell classification based on neural cellular automata (NCA). We test our approach on three datasets of white blood cell images and show that we achieve competitive performance compared to conventional methods. Our NCA-based method is significantly smaller in terms of parameters and exhibits robustness to domain shifts. Furthermore, the architecture is inherently explainable, providing insights into the decision process for each classification, which helps to understand and validate model predictions. Our results demonstrate that NCA can be used for image classification, and that they address key challenges of conventional methods, indicating a high potential for applicability in clinical practice.
- [99] arXiv:2405.08976 (replaced) [pdf, html, other]
-
Title: Slice-aware Resource Allocation and Admission Control for Smart Factory Wireless NetworksComments: 7 pages, Accepted for presentation at IEEE VTC2024fallSubjects: Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
The 5th generation (5G) and beyond network offers substantial promise as the ideal wireless technology to replace the existing inflexible wired connections in traditional factories of today. 5G network slicing allows for tailored allocation of resources to different network services, each with unique Quality of Service (QoS) requirements. This paper presents a novel solution for slice-aware radio resource allocation based on a convex optimisation and control framework for applications in smart factory wireless networks. The proposed framework dynamically allocates minimum power and sub-channels to downlink mixed service type industrial users categorised into three slices: Capacity Limited (CL), Ultra Reliable Low Latency Communication (URLLC), and Time Sensitive (TS) slices. Given that the base station (BS) has limited transmission power, we enforce admission control by effectively relaxing the target rate constraints for current connections in the CL slice. This rate readjustment occurs whenever power consumption exceeds manageable levels. Simulation results show that our approach minimises power, allocates sub-channels to users, maintains slice isolation, and delivers QoS-specific communications to users in all the slices despite time-varying number of users and changing network conditions.
- [100] arXiv:2405.18153 (replaced) [pdf, html, other]
-
Title: Practical aspects for the creation of an audio dataset from field recordings with optimized labeling budget with AI-assisted strategyComments: Submitted to ICML 2024 Workshop on Data-Centric Machine Learning ResearchSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Machine Listening focuses on developing technologies to extract relevant information from audio signals. A critical aspect of these projects is the acquisition and labeling of contextualized data, which is inherently complex and requires specific resources and strategies. Despite the availability of some audio datasets, many are unsuitable for commercial applications. The paper emphasizes the importance of Active Learning (AL) using expert labelers over crowdsourcing, which often lacks detailed insights into dataset structures. AL is an iterative process combining human labelers and AI models to optimize the labeling budget by intelligently selecting samples for human review. This approach addresses the challenge of handling large, constantly growing datasets that exceed available computational resources and memory. The paper presents a comprehensive data-centric framework for Machine Listening projects, detailing the configuration of recording nodes, database structure, and labeling budget optimization in resource-constrained scenarios. Applied to an industrial port in Valencia, Spain, the framework successfully labeled 6540 ten-second audio samples over five months with a small team, demonstrating its effectiveness and adaptability to various resource availability situations.
Acknowledgments: The participation of Javier Naranjo-Alcazar, Jordi Grau-Haro and Pedro Zuccarello in this research was funded by the Valencian Institute for Business Competitiveness (IVACE) and the FEDER funds by means of project Soroll-IA2 (IMDEEA/2023/91). - [101] arXiv:2407.01275 (replaced) [pdf, html, other]
-
Title: Quaternion-based Adaptive Backstepping Fast Terminal Sliding Mode Control for Quadrotor UAVs with Finite Time ConvergenceComments: Results in EngineeringSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
This paper proposes a novel quaternion-based approach for tracking the translation (position and linear velocity) and rotation (attitude and angular velocity) trajectories of underactuated Unmanned Aerial Vehicles (UAVs). Quadrotor UAVs are challenging regarding accuracy, singularity, and uncertainties issues. Controllers designed based on unit-quaternion are singularity-free for attitude representation compared to other methods (e.g., Euler angles), which fail to represent the vehicle's attitude at multiple orientations. Quaternion-based Adaptive Backstepping Control (ABC) and Adaptive Fast Terminal Sliding Mode Control (AFTSMC) are proposed to address a set of challenging problems. A quaternion-based ABC, a superior recursive approach, is proposed to generate the necessary thrust handling unknown uncertainties and UAV translation trajectory tracking. Next, a quaternion-based AFTSMC is developed to overcome parametric uncertainties, avoid singularity, and ensure fast convergence in a finite time. Moreover, the proposed AFTSMC is able to significantly minimize control signal chattering, which is the main reason for actuator failure and provide smooth and accurate rotational control input. To ensure the robustness of the proposed approach, the designed control algorithms have been validated considering unknown time-variant parametric uncertainties and significant initialization errors. The proposed techniques has been compared to state-of-the-art control technique. Keywords: Adaptive Backstepping Control (ABC), Adaptive Fast Terminal Sliding Mode Control (AFTSMC), Unit-quaternion, Unmanned Aerial Vehicles, Singularity Free, Pose Control
- [102] arXiv:2407.11651 (replaced) [pdf, html, other]
-
Title: Fluid Antenna Grouping Index Modulation Design for MIMO SystemsComments: A longer and more detailed version will be submitted to an IEEE journalSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Index modulation (IM) significantly enhances the spectral efficiency of fluid antennas (FAs) enabled multiple-input multiple-output (MIMO) systems, which is named FA-IM. However, due to the dense distribution of ports on the FA, the wireless channel exhibits a high spatial correlation, leading to severe performance degradation in the existing FA-IM-assisted MIMO systems. To tackle this issue, this paper proposes a novel fluid antenna grouping index modulation (FA-GIM) scheme to mitigate the high correlation between the activated ports. Specifically, considering the characteristics of the FA two-dimensional (2D) surface structure and the spatially correlated channel model in FA-assisted MIMO systems, a block grouping method is adopted, where adjacent ports are assigned to the same group. Subsequently, different groups independently perform port index selection and constellation symbol mapping, with only one port being activated within each group during each transmission interval. Numerical results show that, compared to state-of-the-art schemes, the proposed FA-GIM scheme consistently achieves significant bit error rate (BER) performance gains under various conditions. It has also been confirmed that the proposed scheme is both efficient and robust, enhancing the performance of FA-assisted MIMO systems.
- [103] arXiv:2407.14358 (replaced) [pdf, html, other]
-
Title: Stable Audio OpenComments: Demo: this https URL Weights: this https URL Code: this https URL. arXiv admin note: text overlap with arXiv:2404.10301Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.