Search | arXiv e-print repository

Classical and Quantum Physical Reservoir Computing for Onboard Artificial Intelligence Systems: A Perspective

Authors: A. H. Abbas, Hend Abdel-Ghani, Ivan S. Maksymov

Abstract: Artificial intelligence (AI) systems of autonomous systems such as drones, robots and self-driving cars may consume up to 50% of total power available onboard, thereby limiting the vehicle's range of functions and considerably reducing the distance the vehicle can travel on a single charge. Next-generation onboard AI systems need an even higher power since they collect and process even larger amou… ▽ More Artificial intelligence (AI) systems of autonomous systems such as drones, robots and self-driving cars may consume up to 50% of total power available onboard, thereby limiting the vehicle's range of functions and considerably reducing the distance the vehicle can travel on a single charge. Next-generation onboard AI systems need an even higher power since they collect and process even larger amounts of data in real time. This problem cannot be solved using the traditional computing devices since they become more and more power-consuming. In this review article, we discuss the perspectives of development of onboard neuromorphic computers that mimic the operation of a biological brain using nonlinear-dynamical properties of natural physical environments surrounding autonomous vehicles. Previous research also demonstrated that quantum neuromorphic processors (QNPs) can conduct computations with the efficiency of a standard computer while consuming less than 1% of the onboard battery power. Since QNPs is a semi-classical technology, their technical simplicity and low, compared with quantum computers, cost make them ideally suitable for application in autonomous AI system. Providing a perspective view on the future progress in unconventional physical reservoir computing and surveying the outcomes of more than 200 interdisciplinary research works, this article will be of interest to a broad readership, including both students and experts in the fields of physics, engineering, quantum technologies and computing. △ Less

Submitted 14 June, 2024; originally announced July 2024.

Comments: review article

arXiv:2407.04335 [pdf, ps, other]

Geometrically Inspired Kernel Machines for Collaborative Learning Beyond Gradient Descent

Authors: Mohit Kumar, Alexander Valentinitsch, Magdalena Fuchs, Mathias Brucker, Juliana Bowles, Adnan Husakovic, Ali Abbas, Bernhard A. Moser

Abstract: This paper develops a novel mathematical framework for collaborative learning by means of geometrically inspired kernel machines which includes statements on the bounds of generalisation and approximation errors, and sample complexity. For classification problems, this approach allows us to learn bounded geometric structures around given data points and hence solve the global model learning proble… ▽ More This paper develops a novel mathematical framework for collaborative learning by means of geometrically inspired kernel machines which includes statements on the bounds of generalisation and approximation errors, and sample complexity. For classification problems, this approach allows us to learn bounded geometric structures around given data points and hence solve the global model learning problem in an efficient way by exploiting convexity properties of the related optimisation problem in a Reproducing Kernel Hilbert Space (RKHS). In this way, we can reduce classification problems to determining the closest bounded geometric structure from a given data point. Further advantages that come with our solution is that our approach does not require clients to perform multiple epochs of local optimisation using stochastic gradient descent, nor require rounds of communication between client/server for optimising the global model. We highlight that numerous experiments have shown that the proposed method is a competitive alternative to the state-of-the-art. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2407.02231 [pdf, other]

Safety-Driven Deep Reinforcement Learning Framework for Cobots: A Sim2Real Approach

Authors: Ammar N. Abbas, Shakra Mehak, Georgios C. Chasparis, John D. Kelleher, Michael Guilfoyle, Maria Chiara Leva, Aswin K Ramasubramanian

Abstract: This study presents a novel methodology incorporating safety constraints into a robotic simulation during the training of deep reinforcement learning (DRL). The framework integrates specific parts of the safety requirements, such as velocity constraints, as specified by ISO 10218, directly within the DRL model that becomes a part of the robot's learning algorithm. The study then evaluated the effi… ▽ More This study presents a novel methodology incorporating safety constraints into a robotic simulation during the training of deep reinforcement learning (DRL). The framework integrates specific parts of the safety requirements, such as velocity constraints, as specified by ISO 10218, directly within the DRL model that becomes a part of the robot's learning algorithm. The study then evaluated the efficiency of these safety constraints by subjecting the DRL model to various scenarios, including grasping tasks with and without obstacle avoidance. The validation process involved comprehensive simulation-based testing of the DRL model's responses to potential hazards and its compliance. Also, the performance of the system is carried out by the functional safety standards IEC 61508 to determine the safety integrity level. The study indicated a significant improvement in the safety performance of the robotic system. The proposed DRL model anticipates and mitigates hazards while maintaining operational efficiency. This study was validated in a testbed with a collaborative robotic arm with safety sensors and assessed with metrics such as the average number of safety violations, obstacle avoidance, and the number of successful grasps. The proposed approach outperforms the conventional method by a 16.5% average success rate on the tested scenarios in the simulations and 2.5% in the testbed without safety violations. The project repository is available at https://github.com/ammar-n-abbas/sim2real-ur-gym-gazebo. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: This paper has been accepted for publication in the proceedings of the IEEE/IFAC International Conference on Control, Decision, and Information Technologies (CoDIT), 2024

arXiv:2406.11794 [pdf, other]

DataComp-LM: In search of the next generation of training sets for language models

Authors: Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner , et al. (34 additional authors not shown)

Abstract: We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with dat… ▽ More We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation. △ Less

Submitted 20 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: Project page: https://www.datacomp.ai/dclm/

arXiv:2406.07485 [pdf, other]

PITCH: Productivity and Mental Well-being Coaching through Daily Conversational Interaction

Authors: Adnan Abbas, Sang Won Lee

Abstract: Efficient task planning is essential for productivity and mental well-being, yet individuals often struggle to create realistic plans and reflect upon their productivity. Leveraging the advancement in artificial intelligence (AI), conversational agents have emerged as a promising tool for enhancing productivity. Our work focuses on externalizing plans through conversation, aiming to solidify inten… ▽ More Efficient task planning is essential for productivity and mental well-being, yet individuals often struggle to create realistic plans and reflect upon their productivity. Leveraging the advancement in artificial intelligence (AI), conversational agents have emerged as a promising tool for enhancing productivity. Our work focuses on externalizing plans through conversation, aiming to solidify intentions and foster focused action, thereby positively impacting their productivity and mental well-being. We share our plan of designing a conversational agent to offer insightful questions and reflective prompts for increasing plan adherence by leveraging the social interactivity of natural conversations. Previous studies have shown the effectiveness of such agents, but many interventions remain static, leading to decreased user engagement over time. To address this limitation, we propose a novel rotation and context-aware prompting strategy, providing users with varied interventions daily. Our system, PITCH, utilizes large language models (LLMs) to facilitate externalization and reflection on daily plans. Through this study, we investigate the impact of externalizing tasks with conversational agents on productivity and mental well-being, and the effectiveness of a rotation strategy in maintaining user engagement. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.04927 [pdf, other]

LLM-based speaker diarization correction: A generalizable approach

Authors: Georgios Efstathiadis, Vijay Yadav, Anzar Abbas

Abstract: Speaker diarization is necessary for interpreting conversations transcribed using automated speech recognition (ASR) tools. Despite significant developments in diarization methods, diarization accuracy remains an issue. Here, we investigate the use of large language models (LLMs) for diarization correction as a post-processing step. LLMs were fine-tuned using the Fisher corpus, a large dataset of… ▽ More Speaker diarization is necessary for interpreting conversations transcribed using automated speech recognition (ASR) tools. Despite significant developments in diarization methods, diarization accuracy remains an issue. Here, we investigate the use of large language models (LLMs) for diarization correction as a post-processing step. LLMs were fine-tuned using the Fisher corpus, a large dataset of transcribed conversations. The ability of the models to improve diarization accuracy in a holdout dataset was measured. We report that fine-tuned LLMs can markedly improve diarization accuracy. However, model performance is constrained to transcripts produced using the same ASR tool as the transcripts used for fine-tuning, limiting generalizability. To address this constraint, an ensemble model was developed by combining weights from three separate models, each fine-tuned using transcripts from a different ASR tool. The ensemble model demonstrated better overall performance than each of the ASR-specific models, suggesting that a generalizable and ASR-agnostic approach may be achievable. We hope to make these models accessible through public-facing APIs for use by third-party applications. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2405.02321 [pdf, other]

Accelerating Medical Knowledge Discovery through Automated Knowledge Graph Generation and Enrichment

Authors: Mutahira Khalid, Raihana Rahman, Asim Abbas, Sushama Kumari, Iram Wajahat, Syed Ahmad Chan Bukhari

Abstract: Knowledge graphs (KGs) serve as powerful tools for organizing and representing structured knowledge. While their utility is widely recognized, challenges persist in their automation and completeness. Despite efforts in automation and the utilization of expert-created ontologies, gaps in connectivity remain prevalent within KGs. In response to these challenges, we propose an innovative approach ter… ▽ More Knowledge graphs (KGs) serve as powerful tools for organizing and representing structured knowledge. While their utility is widely recognized, challenges persist in their automation and completeness. Despite efforts in automation and the utilization of expert-created ontologies, gaps in connectivity remain prevalent within KGs. In response to these challenges, we propose an innovative approach termed ``Medical Knowledge Graph Automation (M-KGA)". M-KGA leverages user-provided medical concepts and enriches them semantically using BioPortal ontologies, thereby enhancing the completeness of knowledge graphs through the integration of pre-trained embeddings. Our approach introduces two distinct methodologies for uncovering hidden connections within the knowledge graph: a cluster-based approach and a node-based approach. Through rigorous testing involving 100 frequently occurring medical concepts in Electronic Health Records (EHRs), our M-KGA framework demonstrates promising results, indicating its potential to address the limitations of existing knowledge graph automation techniques. △ Less

Submitted 21 April, 2024; originally announced May 2024.

Comments: 18 pages, 5 figures

arXiv:2404.06650 [pdf]

A Frequency-Domain Beamforming Procedure for Extracting Rayleigh Wave Attenuation Coefficients and Small-Strain Damping Ratio from 2D Ambient Noise Array Measurements

Authors: Aser Abbas, Mauro Aimar, Brady R. Cox, Sebastiano Foti

Abstract: The small-strain damping ratio plays a crucial role in assessing the response of soil deposits to earthquake-induced ground motions and general dynamic loading. The damping ratio can theoretically be inverted for after extracting frequency-dependent Rayleigh wave attenuation coefficients from wavefields collected during surface wave testing. However, determining reliable estimates of in-situ atten… ▽ More The small-strain damping ratio plays a crucial role in assessing the response of soil deposits to earthquake-induced ground motions and general dynamic loading. The damping ratio can theoretically be inverted for after extracting frequency-dependent Rayleigh wave attenuation coefficients from wavefields collected during surface wave testing. However, determining reliable estimates of in-situ attenuation coefficients is much more challenging than achieving robust phase velocity dispersion data, which are commonly measured using both active-source and ambient-wavefield surface wave methods. This paper introduces a new methodology for estimating frequency-dependent attenuation coefficients through the analysis of ambient noise wavefield data recorded by two-dimensional (2D) arrays of surface seismic sensors for the subsequent evaluation of the small-strain damping ratio. The approach relies on the application of an attenuation-specific wavefield conversion and frequency-domain beamforming. Numerical simulations are employed to verify the proposed approach and inform best practices for its application. Finally, the practical efficacy of the proposed approach is showcased through its application to field data collected at a deep, soft soil site in Logan, Utah, USA, where phase velocity and attenuation coefficients are extracted from surface wave data and then simultaneously inverted to develop deep shear wave velocity and damping ratio profiles. △ Less

Submitted 9 April, 2024; originally announced April 2024.

Comments: 42 pages, 12 figures

arXiv:2403.13184 [pdf, other]

The Mass Density of MgII Absorbers from the Australian Dark Energy Survey

Authors: Asif Abbas, Christopher W. Churchill, Glenn G. Kacprzak, Christopher Lidman, Susanna Guatelli, Sabine Bellstedt

Abstract: We present an all-southern sky survey for MgII doublet absorbers in 951 z < 4 AGN/quasar spectra from the Australian Dark Energy Survey (OzDES). The spectral resolution ranges from R = 1400-1700 over the wavelengths 3700 A-8800 A. The survey has a 5sigma detection completeness of 50% and above for rest-frame equivalent widths W_r(2796) >= 0.3 A. We studied 656 MgII absorption systems over the reds… ▽ More We present an all-southern sky survey for MgII doublet absorbers in 951 z < 4 AGN/quasar spectra from the Australian Dark Energy Survey (OzDES). The spectral resolution ranges from R = 1400-1700 over the wavelengths 3700 A-8800 A. The survey has a 5sigma detection completeness of 50% and above for rest-frame equivalent widths W_r(2796) >= 0.3 A. We studied 656 MgII absorption systems over the redshift range 0.33 < z < 2.19 with equivalent widths 0.3 < W_r(2796) < 3.45 A. The equivalent width distribution is well fit by an exponential function with W* = 0.76 +/- 0.04 A and the redshift path density exhibits very little evolution. Overall, our findings are consistent with the large, predominantly northern-sky, surveys of MgII absorbers. We developed and implemented a Monte Carlo model informed by a high-resolution MgII survey for determining the MgII mass density, Omega_MgII. We found Omega_MgII ~ 5 x 10^-7 with no evidence of evolution over a ~7 Gyr time span following Cosmic Noon. Incorporating measurements covering 2.0 < z < 6.4 from the literature, we extended our insights into MgII mass density evolution from the end of reionization well past the Cosmic Noon epoch. The presented Monte Carlo model has potential for advancing our knowledge of the evolution of mass densities of metal-ions common to quasar absorption line studies, as it exploits the efficiency of large low-resolution surveys while requiring only small samples from expensive high-resolution surveys. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: 27 pages, 15 figures, 3 tables. Accepted on Pi Day 2024

arXiv:2403.01024 [pdf, other]

Reservoir Computing Using Measurement-Controlled Quantum Dynamics

Authors: A. H. Abbas, Ivan S. Maksymov

Abstract: Physical reservoir computing (RC) is a machine learning algorithm that employs the dynamics of a physical system to forecast highly nonlinear and chaotic phenomena. In this paper, we introduce a quantum RC system that employs the dynamics of a probed atom in a cavity. The atom experiences coherent driving at a particular rate, leading to a measurement-controlled quantum evolution. The proposed qua… ▽ More Physical reservoir computing (RC) is a machine learning algorithm that employs the dynamics of a physical system to forecast highly nonlinear and chaotic phenomena. In this paper, we introduce a quantum RC system that employs the dynamics of a probed atom in a cavity. The atom experiences coherent driving at a particular rate, leading to a measurement-controlled quantum evolution. The proposed quantum reservoir can make fast and reliable forecasts using a small number of artificial neurons compared with the traditional RC algorithm. We theoretically validate the operation of the reservoir, demonstrating its potential to be used in error-tolerant applications, where approximate computing approaches may be used to make feasible forecasts in conditions of limited computational and energy resources. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2402.18065 [pdf, other]

A Probabilistic Motion Model for Skid-Steer Wheeled Mobile Robot Navigation on Off-Road Terrains

Authors: Ananya Trivedi, Mark Zolotas, Adeeb Abbas, Sarvesh Prajapati, Salah Bazzi, Taskın Padır

Abstract: Skid-Steer Wheeled Mobile Robots (SSWMRs) are increasingly being used for off-road autonomy applications. When turning at high speeds, these robots tend to undergo significant skidding and slipping. In this work, using Gaussian Process Regression (GPR) and Sigma-Point Transforms, we estimate the non-linear effects of tire-terrain interaction on robot velocities in a probabilistic fashion. Using th… ▽ More Skid-Steer Wheeled Mobile Robots (SSWMRs) are increasingly being used for off-road autonomy applications. When turning at high speeds, these robots tend to undergo significant skidding and slipping. In this work, using Gaussian Process Regression (GPR) and Sigma-Point Transforms, we estimate the non-linear effects of tire-terrain interaction on robot velocities in a probabilistic fashion. Using the mean estimates from GPR, we propose a data-driven dynamic motion model that is more accurate at predicting future robot poses than conventional kinematic motion models. By efficiently solving a convex optimization problem based on the history of past robot motion, the GPR augmented motion model generalizes to previously unseen terrain conditions. The output distribution from the proposed motion model can be used for local motion planning approaches, such as stochastic model predictive control, leveraging model uncertainty to make safe decisions. We validate our work on a benchmark real-world multi-terrain SSWMR dataset. Our results show that the model generalizes to three different terrains while significantly reducing errors in linear and angular motion predictions. As shown in the attached video, we perform a separate set of experiments on a physical robot to demonstrate the robustness of the proposed algorithm. △ Less

Submitted 29 February, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

Comments: Accepted for publication at IEEE ICRA 2024

arXiv:2402.13219 [pdf, other]

Analyzing Operator States and the Impact of AI-Enhanced Decision Support in Control Rooms: A Human-in-the-Loop Specialized Reinforcement Learning Framework for Intervention Strategies

Authors: Ammar N. Abbas, Chidera W. Amazu, Joseph Mietkiewicz, Houda Briwa, Andres Alonzo Perez, Gabriele Baldissone, Micaela Demichela, Georgios G. Chasparis, John D. Kelleher, Maria Chiara Leva

Abstract: In complex industrial and chemical process control rooms, effective decision-making is crucial for safety and efficiency. The experiments in this paper evaluate the impact and applications of an AI-based decision support system integrated into an improved human-machine interface, using dynamic influence diagrams, a hidden Markov model, and deep reinforcement learning. The enhanced support system a… ▽ More In complex industrial and chemical process control rooms, effective decision-making is crucial for safety and efficiency. The experiments in this paper evaluate the impact and applications of an AI-based decision support system integrated into an improved human-machine interface, using dynamic influence diagrams, a hidden Markov model, and deep reinforcement learning. The enhanced support system aims to reduce operator workload, improve situational awareness, and provide different intervention strategies to the operator adapted to the current state of both the system and human performance. Such a system can be particularly useful in cases of information overload when many alarms and inputs are presented all within the same time window, or for junior operators during training. A comprehensive cross-data analysis was conducted, involving 47 participants and a diverse range of data sources such as smartwatch metrics, eye-tracking data, process logs, and responses from questionnaires. The results indicate interesting insights regarding the effectiveness of the approach in aiding decision-making, decreasing perceived workload, and increasing situational awareness for the scenarios considered. Additionally, the results provide valuable insights to compare differences between styles of information gathering when using the system by individual participants. These findings are particularly relevant when predicting the overall performance of the individual participant and their capacity to successfully handle a plant upset and the alarms connected to it using process and human-machine interaction logs in real-time. These predictions enable the development of more effective intervention strategies. △ Less

Submitted 20 February, 2024; originally announced February 2024.

arXiv:2402.08093 [pdf, other]

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Authors: Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman

Abstract: We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts ra… ▽ More We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/. △ Less

Submitted 15 February, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

Comments: v1.1 (fixed typos)

arXiv:2402.03973 [pdf, other]

Humans Beat Deep Networks at Recognizing Objects in Unusual Poses, Given Enough Time

Authors: Netta Ollikka, Amro Abbas, Andrea Perin, Markku Kilpeläinen, Stéphane Deny

Abstract: Deep learning is closing the gap with humans on several object recognition benchmarks. Here we investigate this gap in the context of challenging images where objects are seen from unusual viewpoints. We find that humans excel at recognizing objects in unusual poses, in contrast with state-of-the-art pretrained networks (EfficientNet, SWAG, ViT, SWIN, BEiT, ConvNext) which are systematically britt… ▽ More Deep learning is closing the gap with humans on several object recognition benchmarks. Here we investigate this gap in the context of challenging images where objects are seen from unusual viewpoints. We find that humans excel at recognizing objects in unusual poses, in contrast with state-of-the-art pretrained networks (EfficientNet, SWAG, ViT, SWIN, BEiT, ConvNext) which are systematically brittle in this condition. Remarkably, as we limit image exposure time, human performance degrades to the level of deep networks, suggesting that additional mental processes (requiring additional time) take place when humans identify objects in unusual poses. Finally, our analysis of error patterns of humans vs. networks reveals that even time-limited humans are dissimilar to feed-forward deep networks. We conclude that more work is needed to bring computer vision systems to the level of robustness of the human visual system. Understanding the nature of the mental processes taking place during extra viewing time may be key to attain such robustness. △ Less

Submitted 6 February, 2024; originally announced February 2024.

arXiv:2401.04578 [pdf, other]

Effective pruning of web-scale datasets based on complexity of concept clusters

Authors: Amro Abbas, Evgenia Rusak, Kushal Tirumala, Wieland Brendel, Kamalika Chaudhuri, Ari S. Morcos

Abstract: Utilizing massive web-scale datasets has led to unprecedented performance gains in machine learning models, but also imposes outlandish compute requirements for their training. In order to improve training and data efficiency, we here push the limits of pruning large-scale multimodal datasets for training CLIP-style models. Today's most effective pruning method on ImageNet clusters data samples in… ▽ More Utilizing massive web-scale datasets has led to unprecedented performance gains in machine learning models, but also imposes outlandish compute requirements for their training. In order to improve training and data efficiency, we here push the limits of pruning large-scale multimodal datasets for training CLIP-style models. Today's most effective pruning method on ImageNet clusters data samples into separate concepts according to their embedding and prunes away the most prototypical samples. We scale this approach to LAION and improve it by noting that the pruning rate should be concept-specific and adapted to the complexity of the concept. Using a simple and intuitive complexity measure, we are able to reduce the training cost to a quarter of regular training. By filtering from the LAION dataset, we find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs. More specifically, we are able to outperform the LAION-trained OpenCLIP-ViT-B32 model on ImageNet zero-shot accuracy by 1.1p.p. while only using 27.7% of the data and training compute. Despite a strong reduction in training cost, we also see improvements on ImageNet dist. shifts, retrieval tasks and VTAB. On the DataComp Medium benchmark, we achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks. △ Less

Submitted 12 March, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

Comments: Accepted at ICLR 2024, code available at https://github.com/amro-kamal/effective_pruning

arXiv:2312.07810 [pdf]

Efficient Up-Conversion in CsPbBr3 Nanocrystals via Phonon-Driven Exciton-Polaron Formation

Authors: Abdullah S. Abbas, Beiye C. Li, Richard D. Schaller, Vitali B. Prakapenka, Stella Chariton, Gregory S. Engel, A. Paul Alivisatos

Abstract: Lead halide perovskite nanocrystals demonstrate efficient up-conversion, although the precise mechanism remains a subject of active research. This study utilizes steady-state and time-resolved spectroscopy methods to unravel the mechanism driving the up-conversion process in CsPbBr3 nanocrystals. Employing above- and below-gap photoluminescence measurements, we extract a distinct phonon mode with… ▽ More Lead halide perovskite nanocrystals demonstrate efficient up-conversion, although the precise mechanism remains a subject of active research. This study utilizes steady-state and time-resolved spectroscopy methods to unravel the mechanism driving the up-conversion process in CsPbBr3 nanocrystals. Employing above- and below-gap photoluminescence measurements, we extract a distinct phonon mode with an energy of ~7 meV and identify the Pb-Br-Pb bending mode as the phonon involved in the up-conversion process. This result was corroborated by Raman spectroscopy. We confirm an up-conversion efficiency reaching up to 75%. Transient absorption measurements under conditions of sub-gap excitation also unexpectedly reveal coherent phonons for the subset of nanocrystals undergoing up-conversion. This coherence implies that the up-conversion and subsequent relaxation is accompanied by a synchronized and phased lattice motion. This study reveals that efficient up-conversion in CsPbBr3 nanocrystals is powered by a unique interplay between the soft lattice structure, phonons, and excited states dynamics. △ Less

Submitted 19 December, 2023; v1 submitted 12 December, 2023; originally announced December 2023.

Comments: Main text has 6 figures, supporting information has 7 figures. total number of pages 39

arXiv:2312.03189 [pdf, other]

doi 10.1103/PhysRevResearch.6.L022003

$n$-body anti-bunching in a degenerate Fermi gas of $^3$He* atoms

Authors: Kieran F. Thomas, Shijie Li, A. H. Abbas, Andrew G. Truscott, Sean. S. Hodgman

Abstract: A key observable in investigations into quantum systems are the $n$-body correlation functions, which provide a powerful tool for experimentally determining coherence and directly probing the many-body wavefunction. While the (bosonic) correlations of photonic systems are well explored, the correlations present in matter-wave systems, particularly for fermionic atoms, are still an emerging field.… ▽ More A key observable in investigations into quantum systems are the $n$-body correlation functions, which provide a powerful tool for experimentally determining coherence and directly probing the many-body wavefunction. While the (bosonic) correlations of photonic systems are well explored, the correlations present in matter-wave systems, particularly for fermionic atoms, are still an emerging field. In this work, we use the unique single-atom detection properties of $^3$He* atoms to perform simultaneous measurements of the $n$-body quantum correlations, up to the fifth-order, of a degenerate Fermi gas. In a direct demonstration of the Pauli exclusion principle, we observe clear anti-bunching at all orders and find good agreement with predicted correlation volumes. Our results pave the way for using correlation functions to probe some of the rich physics associated with fermionic systems, such as d-wave pairing in superconductors. △ Less

Submitted 5 December, 2023; originally announced December 2023.

Comments: 11 pages, 7 figures

Journal ref: Phys. Rev. Research 6, L022003 (2024)

arXiv:2312.02279 [pdf, other]

Quantum Optimization: Potential, Challenges, and the Path Forward

Authors: Amira Abbas, Andris Ambainis, Brandon Augustino, Andreas Bärtschi, Harry Buhrman, Carleton Coffrin, Giorgio Cortiana, Vedran Dunjko, Daniel J. Egger, Bruce G. Elmegreen, Nicola Franco, Filippo Fratini, Bryce Fuller, Julien Gacon, Constantin Gonciulea, Sander Gribling, Swati Gupta, Stuart Hadfield, Raoul Heese, Gerhard Kircher, Thomas Kleinert, Thorsten Koch, Georgios Korpas, Steve Lenk, Jakub Marecek , et al. (21 additional authors not shown)

Abstract: Recent advances in quantum computers are demonstrating the ability to solve problems at a scale beyond brute force classical simulation. As such, a widespread interest in quantum algorithms has developed in many areas, with optimization being one of the most pronounced domains. Across computer science and physics, there are a number of algorithmic approaches, often with little linkage. This is fur… ▽ More Recent advances in quantum computers are demonstrating the ability to solve problems at a scale beyond brute force classical simulation. As such, a widespread interest in quantum algorithms has developed in many areas, with optimization being one of the most pronounced domains. Across computer science and physics, there are a number of algorithmic approaches, often with little linkage. This is further complicated by the fragmented nature of the field of mathematical optimization, where major classes of optimization problems, such as combinatorial optimization, convex optimization, non-convex optimization, and stochastic extensions, have devoted communities. With these aspects in mind, this work draws on multiple approaches to study quantum optimization. Provably exact versus heuristic settings are first explained using computational complexity theory - highlighting where quantum advantage is possible in each context. Then, the core building blocks for quantum optimization algorithms are outlined to subsequently define prominent problem classes and identify key open questions that, if answered, will advance the field. The effects of scaling relevant problems on noisy quantum devices are also outlined in detail, alongside meaningful benchmarking problems. We underscore the importance of benchmarking by proposing clear metrics to conduct appropriate comparisons with classical optimization techniques. Lastly, we highlight two domains - finance and sustainability - as rich sources of optimization problems that could be used to benchmark, and eventually validate, the potential real-world impact of quantum optimization. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: 70 pages, 9 Figures, 4 Tables

arXiv:2310.18811 [pdf]

Hierarchical Framework for Interpretable and Probabilistic Model-Based Safe Reinforcement Learning

Authors: Ammar N. Abbas, Georgios C. Chasparis, John D. Kelleher

Abstract: The difficulty of identifying the physical model of complex systems has led to exploring methods that do not rely on such complex modeling of the systems. Deep reinforcement learning has been the pioneer for solving this problem without the need for relying on the physical model of complex systems by just interacting with it. However, it uses a black-box learning approach that makes it difficult t… ▽ More The difficulty of identifying the physical model of complex systems has led to exploring methods that do not rely on such complex modeling of the systems. Deep reinforcement learning has been the pioneer for solving this problem without the need for relying on the physical model of complex systems by just interacting with it. However, it uses a black-box learning approach that makes it difficult to be applied within real-world and safety-critical systems without providing explanations of the actions derived by the model. Furthermore, an open research question in deep reinforcement learning is how to focus the policy learning of critical decisions within a sparse domain. This paper proposes a novel approach for the use of deep reinforcement learning in safety-critical systems. It combines the advantages of probabilistic modeling and reinforcement learning with the added benefits of interpretability and works in collaboration and synchronization with conventional decision-making strategies. The BC-SRLA is activated in specific situations which are identified autonomously through the fused information of probabilistic model and reinforcement learning, such as abnormal conditions or when the system is near-to-failure. Further, it is initialized with a baseline policy using policy cloning to allow minimum interactions with the environment to address the challenges associated with using RL in safety-critical industries. The effectiveness of the BC-SRLA is demonstrated through a case study in maintenance applied to turbofan engines, where it shows superior performance to the prior art and other baselines. △ Less

Submitted 28 October, 2023; originally announced October 2023.

Comments: arXiv admin note: text overlap with arXiv:2206.13433

Journal ref: Data & Knowledge Engineering, 2023

arXiv:2310.14788 [pdf]

Specialized Deep Residual Policy Safe Reinforcement Learning-Based Controller for Complex and Continuous State-Action Spaces

Authors: Ammar N. Abbas, Georgios C. Chasparis, John D. Kelleher

Abstract: Traditional controllers have limitations as they rely on prior knowledge about the physics of the problem, require modeling of dynamics, and struggle to adapt to abnormal situations. Deep reinforcement learning has the potential to address these problems by learning optimal control policies through exploration in an environment. For safety-critical environments, it is impractical to explore random… ▽ More Traditional controllers have limitations as they rely on prior knowledge about the physics of the problem, require modeling of dynamics, and struggle to adapt to abnormal situations. Deep reinforcement learning has the potential to address these problems by learning optimal control policies through exploration in an environment. For safety-critical environments, it is impractical to explore randomly, and replacing conventional controllers with black-box models is also undesirable. Also, it is expensive in continuous state and action spaces, unless the search space is constrained. To address these challenges we propose a specialized deep residual policy safe reinforcement learning with a cycle of learning approach adapted for complex and continuous state-action spaces. Residual policy learning allows learning a hybrid control architecture where the reinforcement learning agent acts in synchronous collaboration with the conventional controller. The cycle of learning initiates the policy through the expert trajectory and guides the exploration around it. Further, the specialization through the input-output hidden Markov model helps to optimize policy that lies within the region of interest (such as abnormality), where the reinforcement learning agent is required and is activated. The proposed solution is validated on the Tennessee Eastman process control. △ Less

Submitted 15 October, 2023; originally announced October 2023.

arXiv:2310.08230 [pdf, other]

Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching

Authors: Paul Roetzer, Ahmed Abbas, Dongliang Cao, Florian Bernard, Paul Swoboda

Abstract: In this work we propose to combine the advantages of learning-based and combinatorial formalisms for 3D shape matching. While learning-based shape matching solutions lead to state-of-the-art matching performance, they do not ensure geometric consistency, so that obtained matchings are locally unsmooth. On the contrary, axiomatic methods allow to take geometric consistency into account by explicitl… ▽ More In this work we propose to combine the advantages of learning-based and combinatorial formalisms for 3D shape matching. While learning-based shape matching solutions lead to state-of-the-art matching performance, they do not ensure geometric consistency, so that obtained matchings are locally unsmooth. On the contrary, axiomatic methods allow to take geometric consistency into account by explicitly constraining the space of valid matchings. However, existing axiomatic formalisms are impractical since they do not scale to practically relevant problem sizes, or they require user input for the initialisation of non-convex optimisation problems. In this work we aim to close this gap by proposing a novel combinatorial solver that combines a unique set of favourable properties: our approach is (i) initialisation free, (ii) massively parallelisable powered by a quasi-Newton method, (iii) provides optimality gaps, and (iv) delivers decreased runtime and globally optimal results for many instances. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: Paul Roetzer and Ahmed Abbas contributed equally

arXiv:2310.02110 [pdf, other]

Sieve: Multimodal Dataset Pruning Using Image Captioning Models

Authors: Anas Mahmoud, Mostafa Elhoushi, Amro Abbas, Yu Yang, Newsha Ardalani, Hugh Leather, Ari Morcos

Abstract: Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets. This underscores the critical need for dataset pruning, as the quality of these datasets is strongly correlated with the performance of VLMs on downstream tasks. Using CLIPScore from a pretrained model to only train models using highly-aligned samples is one of the most successful methods for pruning. W… ▽ More Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets. This underscores the critical need for dataset pruning, as the quality of these datasets is strongly correlated with the performance of VLMs on downstream tasks. Using CLIPScore from a pretrained model to only train models using highly-aligned samples is one of the most successful methods for pruning. We argue that this approach suffers from multiple limitations including: false positives and negatives due to CLIP's pretraining on noisy labels. We propose a pruning signal, Sieve, that employs synthetic captions generated by image-captioning models pretrained on small, diverse, and well-aligned image-text pairs to evaluate the alignment of noisy image-text pairs. To bridge the gap between the limited diversity of generated captions and the high diversity of alternative text (alt-text), we estimate the semantic textual similarity in the embedding space of a language model pretrained on unlabeled text corpus. Using DataComp, a multimodal dataset filtering benchmark, when evaluating on 38 downstream tasks, our pruning approach, surpasses CLIPScore by 2.6\% and 1.7\% on medium and large scale respectively. In addition, on retrieval tasks, Sieve leads to a significant improvement of 2.7% and 4.5% on medium and large scale respectively. △ Less

Submitted 10 March, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: Accepted in CVPR 2024

arXiv:2309.15940 [pdf, other]

Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs

Authors: Haonan Chang, Kowndinya Boyalakuntla, Shiyang Lu, Siwei Cai, Eric Jing, Shreesh Keskar, Shijie Geng, Adeeb Abbas, Lifeng Zhou, Kostas Bekris, Abdeslam Boularias

Abstract: We present an Open-Vocabulary 3D Scene Graph (OVSG), a formal framework for grounding a variety of entities, such as object instances, agents, and regions, with free-form text-based queries. Unlike conventional semantic-based object localization approaches, our system facilitates context-aware entity localization, allowing for queries such as ``pick up a cup on a kitchen table" or ``navigate to a… ▽ More We present an Open-Vocabulary 3D Scene Graph (OVSG), a formal framework for grounding a variety of entities, such as object instances, agents, and regions, with free-form text-based queries. Unlike conventional semantic-based object localization approaches, our system facilitates context-aware entity localization, allowing for queries such as ``pick up a cup on a kitchen table" or ``navigate to a sofa on which someone is sitting". In contrast to existing research on 3D scene graphs, OVSG supports free-form text input and open-vocabulary querying. Through a series of comparative experiments using the ScanNet dataset and a self-collected dataset, we demonstrate that our proposed approach significantly surpasses the performance of previous semantic-based localization techniques. Moreover, we highlight the practical application of OVSG in real-world robot navigation and manipulation experiments. △ Less

Submitted 27 September, 2023; originally announced September 2023.

Comments: The code and dataset used for evaluation can be found at https://github.com/changhaonan/OVSG}{https://github.com/changhaonan/OVSG. This paper has been accepted by CoRL2023

arXiv:2309.14233 [pdf]

Urdu Poetry Generated by Using Deep Learning Techniques

Authors: Muhammad Shoaib Farooq, Ali Abbas

Abstract: This study provides Urdu poetry generated using different deep-learning techniques and algorithms. The data was collected through the Rekhta website, containing 1341 text files with several couplets. The data on poetry was not from any specific genre or poet. Instead, it was a collection of mixed Urdu poems and Ghazals. Different deep learning techniques, such as the model applied Long Short-term… ▽ More This study provides Urdu poetry generated using different deep-learning techniques and algorithms. The data was collected through the Rekhta website, containing 1341 text files with several couplets. The data on poetry was not from any specific genre or poet. Instead, it was a collection of mixed Urdu poems and Ghazals. Different deep learning techniques, such as the model applied Long Short-term Memory Networks (LSTM) and Gated Recurrent Unit (GRU), have been used. Natural Language Processing (NLP) may be used in machine learning to understand, analyze, and generate a language humans may use and understand. Much work has been done on generating poetry for different languages using different techniques. The collection and use of data were also different for different researchers. The primary purpose of this project is to provide a model that generates Urdu poems by using data completely, not by sampling data. Also, this may generate poems in pure Urdu, not Roman Urdu, as in the base paper. The results have shown good accuracy in the poems generated by the model. △ Less

Submitted 25 September, 2023; originally announced September 2023.

Comments: 11 pages, 2 figures

arXiv:2307.16679 [pdf, other]

Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

Authors: Guangyan Zhang, Thomas Merritt, Manuel Sam Ribeiro, Biel Tura-Vecino, Kayoko Yanagisawa, Kamil Pokora, Abdelhamid Ezzerg, Sebastian Cygert, Ammar Abbas, Piotr Bilinski, Roberto Barra-Chicote, Daniel Korzekwa, Jaime Lorenzo-Trueba

Abstract: Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosod… ▽ More Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosody and mel-spectrogram prediction for text-to-speech synthesis. We use a prosody model to generate log-f0 and duration features, which are used to condition an acoustic model that generates mel-spectrograms. Experimental results demonstrate that the flow-based model achieves the best performance for spectrogram prediction, improving over equivalent diffusion and L1 models. Meanwhile, both diffusion and flow-based prosody predictors result in significant improvements over a typical L2-trained prosody models. △ Less

Submitted 31 July, 2023; originally announced July 2023.

Comments: 5 pages, 2 figures, 5 tables. Interspeech 2023

arXiv:2307.07062 [pdf, other]

Controllable Emphasis with zero data for text-to-speech

Authors: Arnaud Joly, Marco Nicolis, Ekaterina Peterova, Alessandro Lombardi, Ammar Abbas, Arent van Korlaar, Aman Hussain, Parul Sharma, Alexis Moinet, Mateusz Lajszczak, Penny Karanasou, Antonio Bonafonte, Thomas Drugman, Elena Sokolova

Abstract: We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques im… ▽ More We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques improving naturalness by $7.3\%$ and correct testers' identification of the emphasized word in a sentence by $40\%$ on a reference female en-US voice. We show that this technique significantly closes the gap to methods that require explicit recordings. The method proved to be scalable and preferred in all four languages tested (English, Spanish, Italian, German), for different voices and multiple speaking styles. △ Less

Submitted 13 July, 2023; originally announced July 2023.

Comments: In proceeding of 12th Speech Synthesis Workshop (SSW) 2023

arXiv:2306.11327 [pdf, other]

eCat: An End-to-End Model for Multi-Speaker TTS & Many-to-Many Fine-Grained Prosody Transfer

Authors: Ammar Abbas, Sri Karlapati, Bastian Schnell, Penny Karanasou, Marcel Granero Moya, Amith Nagaraj, Ayman Boustati, Nicole Peinelt, Alexis Moinet, Thomas Drugman

Abstract: We present eCat, a novel end-to-end multispeaker model capable of: a) generating long-context speech with expressive and contextually appropriate prosody, and b) performing fine-grained prosody transfer between any pair of seen speakers. eCat is trained using a two-stage training approach. In Stage I, the model learns speaker-independent word-level prosody representations in an end-to-end fashion… ▽ More We present eCat, a novel end-to-end multispeaker model capable of: a) generating long-context speech with expressive and contextually appropriate prosody, and b) performing fine-grained prosody transfer between any pair of seen speakers. eCat is trained using a two-stage training approach. In Stage I, the model learns speaker-independent word-level prosody representations in an end-to-end fashion from speech. In Stage II, we learn to predict the prosody representations using the contextual information available in text. We compare eCat to CopyCat2, a model capable of both fine-grained prosody transfer (FPT) and multi-speaker TTS. We show that eCat statistically significantly reduces the gap in naturalness between CopyCat2 and human recordings by an average of 46.7% across 2 languages, 3 locales, and 7 speakers, along with better target-speaker similarity in FPT. We also compare eCat to VITS, and show a statistically significant preference. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: Accepted to be published in the Proceedings of InterSpeech 2023

arXiv:2305.15578 [pdf]

An Open-Access Database of Active-source and Passive-wavefield DAS and Nodal Station Measurements at the Newberry Florida Site

Authors: Aser Abbas, Brady R. Cox, Khiem T. Tran, Isabella Corey, Nishkarsha Dawadi

Abstract: This paper documents a comprehensive subsurface imaging experiment using stress waves in Newberry, Florida, at a site known for significant spatial variability, karstic voids, and underground anomalies. The experiment utilized advanced sensing technologies, including approximately two kilometers of distributed acoustic sensing (DAS) fiber optic cable, forming a dense 2D array of 1920 channels, and… ▽ More This paper documents a comprehensive subsurface imaging experiment using stress waves in Newberry, Florida, at a site known for significant spatial variability, karstic voids, and underground anomalies. The experiment utilized advanced sensing technologies, including approximately two kilometers of distributed acoustic sensing (DAS) fiber optic cable, forming a dense 2D array of 1920 channels, and a 2D array of 144 three-component nodal stations, to sense active-source and passive-wavefield stress waves. The active-source data was generated using a vibroseis shaker truck and impact sources, and it was simultaneously sensed by both the DAS and the nodal stations. The vibroseis truck was used to excite the ground in the three directions at 260 locations inside and outside the instrumented array, while the impact sources were used at 268 locations within the instrumented array. The passive-wavefield data recorded using the nodal stations comprised 48 hours of ambient noise collected over a period of four days in four twelve-hour time blocks. Meanwhile, the passive wavefield data collected using DAS consisted of four hours of ambient noise recordings. This paper aims to provide a comprehensive overview of the testing site, experiment layout, the DAS and nodal station acquisition parameters, implemented processing steps, and potential use cases of the dataset. While potential use cases, such as surface wave testing, full waveform inversion, and ambient noise tomography, are discussed relative to example data, the focus of this paper is on documenting this unique dataset rather than on processing the data for detecting anomalies or generating subsurface 2D/3D imaging results. The raw and processed data, along with detailed documentation of the experiment and Python tools to aid in visualizing the DAS dataset have been archived and made publicly available on DesignSafe under project PRJ-3521. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: 33 pages, 12 figures, dataset paper

arXiv:2305.13362 [pdf, other]

On quantum backpropagation, information reuse, and cheating measurement collapse

Authors: Amira Abbas, Robbie King, Hsin-Yuan Huang, William J. Huggins, Ramis Movassagh, Dar Gilboa, Jarrod R. McClean

Abstract: The success of modern deep learning hinges on the ability to train neural networks at scale. Through clever reuse of intermediate information, backpropagation facilitates training through gradient computation at a total cost roughly proportional to running the function, rather than incurring an additional factor proportional to the number of parameters - which can now be in the trillions. Naively,… ▽ More The success of modern deep learning hinges on the ability to train neural networks at scale. Through clever reuse of intermediate information, backpropagation facilitates training through gradient computation at a total cost roughly proportional to running the function, rather than incurring an additional factor proportional to the number of parameters - which can now be in the trillions. Naively, one expects that quantum measurement collapse entirely rules out the reuse of quantum information as in backpropagation. But recent developments in shadow tomography, which assumes access to multiple copies of a quantum state, have challenged that notion. Here, we investigate whether parameterized quantum models can train as efficiently as classical neural networks. We show that achieving backpropagation scaling is impossible without access to multiple copies of a state. With this added ability, we introduce an algorithm with foundations in shadow tomography that matches backpropagation scaling in quantum resources while reducing classical auxiliary computational costs to open problems in shadow tomography. These results highlight the nuance of reusing quantum information for practical purposes and clarify the unique difficulties in training large quantum models, which could alter the course of quantum machine learning. △ Less

Submitted 22 May, 2023; originally announced May 2023.

Comments: 29 pages, 2 figures

Journal ref: Advances in Neural Information Processing Systems 36 (2024)

arXiv:2303.09540 [pdf, other]

SemDeDup: Data-efficient learning at web-scale through semantic deduplication

Authors: Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, Ari S. Morcos

Abstract: Progress in machine learning has been driven in large part by massive increases in data. However, large web-scale datasets such as LAION are largely uncurated beyond searches for exact duplicates, potentially leaving much redundancy. Here, we introduce SemDeDup, a method which leverages embeddings from pre-trained models to identify and remove semantic duplicates: data pairs which are semantically… ▽ More Progress in machine learning has been driven in large part by massive increases in data. However, large web-scale datasets such as LAION are largely uncurated beyond searches for exact duplicates, potentially leaving much redundancy. Here, we introduce SemDeDup, a method which leverages embeddings from pre-trained models to identify and remove semantic duplicates: data pairs which are semantically similar, but not exactly identical. Removing semantic duplicates preserves performance and speeds up learning. Analyzing a subset of LAION, we show that SemDeDup can remove 50% of the data with minimal performance loss, effectively halving training time. Moreover, performance increases out of distribution. Also, analyzing language models trained on C4, a partially curated dataset, we show that SemDeDup improves over prior approaches while providing efficiency gains. SemDeDup provides an example of how simple ways of leveraging quality embeddings can be used to make models learn faster with less data. △ Less

Submitted 22 March, 2023; v1 submitted 16 March, 2023; originally announced March 2023.

arXiv:2301.12159 [pdf, ps, other]

ClusterFuG: Clustering Fully connected Graphs by Multicut

Authors: Ahmed Abbas, Paul Swoboda

Abstract: We propose a graph clustering formulation based on multicut (a.k.a. weighted correlation clustering) on the complete graph. Our formulation does not need specification of the graph topology as in the original sparse formulation of multicut, making our approach simpler and potentially better performing. In contrast to unweighted correlation clustering we allow for a more expressive weighted cost st… ▽ More We propose a graph clustering formulation based on multicut (a.k.a. weighted correlation clustering) on the complete graph. Our formulation does not need specification of the graph topology as in the original sparse formulation of multicut, making our approach simpler and potentially better performing. In contrast to unweighted correlation clustering we allow for a more expressive weighted cost structure. In dense multicut, the clustering objective is given in a factorized form as inner products of node feature vectors. This allows for an efficient formulation and inference in contrast to multicut/weighted correlation clustering, which has at least quadratic representation and computation complexity when working on the complete graph. We show how to rewrite classical greedy algorithms for multicut in our dense setting and how to modify them for greater efficiency and solution quality. In particular, our algorithms scale to graphs with tens of thousands of nodes. Empirical evidence on instance segmentation on Cityscapes and clustering of ImageNet datasets shows the merits of our approach. △ Less

Submitted 5 June, 2023; v1 submitted 28 January, 2023; originally announced January 2023.

Comments: ICML 2023

arXiv:2207.09580 [pdf]

A Frequency-Velocity CNN for Developing Near-Surface 2D Vs Images from Linear-Array, Active-Source Wavefield Measurements

Authors: Aser Abbas, Joseph P. Vantassel, Brady R. Cox, Krishna Kumar, Jodie Crocker

Abstract: This paper presents a frequency-velocity convolutional neural network (CNN) for rapid, non-invasive 2D shear wave velocity (Vs) imaging of near-surface geo-materials. Operating in the frequency-velocity domain allows for significant flexibility in the linear-array, active-source experimental testing configurations used for generating the CNN input, which are normalized dispersion images. Unlike wa… ▽ More This paper presents a frequency-velocity convolutional neural network (CNN) for rapid, non-invasive 2D shear wave velocity (Vs) imaging of near-surface geo-materials. Operating in the frequency-velocity domain allows for significant flexibility in the linear-array, active-source experimental testing configurations used for generating the CNN input, which are normalized dispersion images. Unlike wavefield images, normalized dispersion images are relatively insensitive to the experimental testing configuration, accommodating various source types, source offsets, numbers of receivers, and receiver spacings. We demonstrate the effectiveness of the frequency-velocity CNN by applying it to a classic near-surface geophysics problem, namely, imaging a two-layer, undulating, soil-over-bedrock interface. This problem was recently investigated in our group by developing a time-distance CNN, which showed great promise but lacked flexibility in utilizing different field-testing configurations. Herein, the new frequency-velocity CNN is shown to have comparable accuracy to the time-distance CNN while providing greater flexibility to handle varied field applications. The frequency-velocity CNN was trained, validated, and tested using 100,000 synthetic near-surface models. The ability of the proposed frequency-velocity CNN to generalize across various acquisition configurations is first tested using synthetic near-surface models with different acquisition configurations from that of the training set, and then applied to experimental field data collected at the Hornsby Bend site in Austin, Texas, USA. When fully developed for a wider range of geological conditions, the proposed CNN may ultimately be used as a rapid, end-to-end alternative for current pseudo-2D surface wave imaging techniques or to develop starting models for full waveform inversion. △ Less

Submitted 19 July, 2022; originally announced July 2022.

Comments: 34 pages, 13 figures, 2 tables

arXiv:2207.08034 [pdf, other]

Progress and limitations of deep networks to recognize objects in unusual poses

Authors: Amro Abbas, Stéphane Deny

Abstract: Deep networks should be robust to rare events if they are to be successfully deployed in high-stakes real-world applications (e.g., self-driving cars). Here we study the capability of deep networks to recognize objects in unusual poses. We create a synthetic dataset of images of objects in unusual orientations, and evaluate the robustness of a collection of 38 recent and competitive deep networks… ▽ More Deep networks should be robust to rare events if they are to be successfully deployed in high-stakes real-world applications (e.g., self-driving cars). Here we study the capability of deep networks to recognize objects in unusual poses. We create a synthetic dataset of images of objects in unusual orientations, and evaluate the robustness of a collection of 38 recent and competitive deep networks for image classification. We show that classifying these images is still a challenge for all networks tested, with an average accuracy drop of 29.5% compared to when the objects are presented upright. This brittleness is largely unaffected by various network design choices, such as training losses (e.g., supervised vs. self-supervised), architectures (e.g., convolutional networks vs. transformers), dataset modalities (e.g., images vs. image-text pairs), and data-augmentation schemes. However, networks trained on very large datasets substantially outperform others, with the best network tested$\unicode{x2014}$Noisy Student EfficentNet-L2 trained on JFT-300M$\unicode{x2014}$showing a relatively small accuracy drop of only 14.5% on unusual poses. Nevertheless, a visual inspection of the failures of Noisy Student reveals a remaining gap in robustness with the human visual system. Furthermore, combining multiple object transformations$\unicode{x2014}$3D-rotations and scaling$\unicode{x2014}$further degrades the performance of all networks. Altogether, our results provide another measurement of the robustness of deep networks that is important to consider when using them in the real world. Code and datasets are available at https://github.com/amro-kamal/ObjectPose. △ Less

Submitted 16 July, 2022; originally announced July 2022.

arXiv:2206.14643 [pdf, other]

Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody

Authors: Peter Makarov, Ammar Abbas, Mateusz Łajszczak, Arnaud Joly, Sri Karlapati, Alexis Moinet, Thomas Drugman, Penny Karanasou

Abstract: Generating expressive and contextually appropriate prosody remains a challenge for modern text-to-speech (TTS) systems. This is particularly evident for long, multi-sentence inputs. In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS. We find that long context, powerful text features, and training on m… ▽ More Generating expressive and contextually appropriate prosody remains a challenge for modern text-to-speech (TTS) systems. This is particularly evident for long, multi-sentence inputs. In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS. We find that long context, powerful text features, and training on multi-speaker data all improve prosody. More interestingly, they result in synergies. Long context disambiguates prosody, improves coherence, and plays to the strengths of Transformers. Fine-tuning word-level features from a powerful language model, such as BERT, appears to profit from more training data, readily available in a multi-speaker setting. We look into objective metrics on pausing and pacing and perform thorough subjective evaluations for speech naturalness. Our main system, which incorporates all the extensions, achieves consistently strong results, including statistically significant improvements in speech naturalness over all its competitors. △ Less

Submitted 29 June, 2022; originally announced June 2022.

Comments: Accepted to be published in the Proceedings of InterSpeech 2022

arXiv:2206.14165 [pdf, other]

Expressive, Variable, and Controllable Duration Modelling in TTS

Authors: Ammar Abbas, Thomas Merritt, Alexis Moinet, Sri Karlapati, Ewa Muszynska, Simon Slangen, Elia Gatti, Thomas Drugman

Abstract: Duration modelling has become an important research problem once more with the rise of non-attention neural text-to-speech systems. The current approaches largely fall back to relying on previous statistical parametric speech synthesis technology for duration prediction, which poorly models the expressiveness and variability in speech. In this paper, we propose two alternate approaches to improve… ▽ More Duration modelling has become an important research problem once more with the rise of non-attention neural text-to-speech systems. The current approaches largely fall back to relying on previous statistical parametric speech synthesis technology for duration prediction, which poorly models the expressiveness and variability in speech. In this paper, we propose two alternate approaches to improve duration modelling. First, we propose a duration model conditioned on phrasing that improves the predicted durations and provides better modelling of pauses. We show that the duration model conditioned on phrasing improves the naturalness of speech over our baseline duration model. Second, we also propose a multi-speaker duration model called Cauliflow, that uses normalising flows to predict durations that better match the complex target duration distribution. Cauliflow performs on par with our other proposed duration model in terms of naturalness, whilst providing variable durations for the same prompt and variable levels of expressiveness. Lastly, we propose to condition Cauliflow on parameters that provide an intuitive control of the pacing and pausing in the synthesised speech in a novel way. △ Less

Submitted 28 June, 2022; originally announced June 2022.

Comments: Accepted to be published in the Proceedings of InterSpeech 2022

arXiv:2206.13443 [pdf, other]

CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer

Authors: Sri Karlapati, Penny Karanasou, Mateusz Lajszczak, Ammar Abbas, Alexis Moinet, Peter Makarov, Ray Li, Arent van Korlaar, Simon Slangen, Thomas Drugman

Abstract: In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel appro… ▽ More In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel approach to two-stage training. In Stage I, the model learns speaker-independent word-level prosody representations from speech which it uses for many-to-many fine-grained prosody transfer. In Stage II, we learn to predict these prosody representations using the contextual information available in text, thereby, enabling multi-speaker TTS with contextually appropriate prosody. We compare CC2 to two strong baselines, one in TTS with contextually appropriate prosody, and one in fine-grained prosody transfer. CC2 reduces the gap in naturalness between our baseline and copy-synthesised speech by $22.79\%$. In fine-grained prosody transfer evaluations, it obtains a relative improvement of $33.15\%$ in target speaker similarity. △ Less

Submitted 27 June, 2022; originally announced June 2022.

Comments: Accepted to be published in the Proceedings of InterSpeech 2022

arXiv:2206.13433 [pdf, other]

doi 10.1007/978-3-031-12670-3_12

Interpretable Hidden Markov Model-Based Deep Reinforcement Learning Hierarchical Framework for Predictive Maintenance of Turbofan Engines

Authors: Ammar N. Abbas, Georgios Chasparis, John D. Kelleher

Abstract: An open research question in deep reinforcement learning is how to focus the policy learning of key decisions within a sparse domain. This paper emphasizes combining the advantages of inputoutput hidden Markov models and reinforcement learning towards interpretable maintenance decisions. We propose a novel hierarchical-modeling methodology that, at a high level, detects and interprets the root cau… ▽ More An open research question in deep reinforcement learning is how to focus the policy learning of key decisions within a sparse domain. This paper emphasizes combining the advantages of inputoutput hidden Markov models and reinforcement learning towards interpretable maintenance decisions. We propose a novel hierarchical-modeling methodology that, at a high level, detects and interprets the root cause of a failure as well as the health degradation of the turbofan engine, while, at a low level, it provides the optimal replacement policy. It outperforms the baseline performance of deep reinforcement learning methods applied directly to the raw data or when using a hidden Markov model without such a specialized hierarchy. It also provides comparable performance to prior work, however, with the additional benefit of interpretability. △ Less

Submitted 11 January, 2023; v1 submitted 27 June, 2022; originally announced June 2022.

Journal ref: Preprint: International Conference on Big Data Analytics and Knowledge Discovery Proceedings, 2022

arXiv:2205.11638 [pdf, other]

DOGE-Train: Discrete Optimization on GPU with End-to-end Training

Authors: Ahmed Abbas, Paul Swoboda

Abstract: We present a fast, scalable, data-driven approach for solving relaxations of 0-1 integer linear programs. We use a combination of graph neural networks (GNN) and the Lagrange decomposition based algorithm FastDOG (Abbas and Swoboda 2022b). We make the latter differentiable for end-to-end training and use GNNs to predict its algorithmic parameters. This allows to retain the algorithm's theoretical… ▽ More We present a fast, scalable, data-driven approach for solving relaxations of 0-1 integer linear programs. We use a combination of graph neural networks (GNN) and the Lagrange decomposition based algorithm FastDOG (Abbas and Swoboda 2022b). We make the latter differentiable for end-to-end training and use GNNs to predict its algorithmic parameters. This allows to retain the algorithm's theoretical properties including dual feasibility and guaranteed non-decrease in the lower bound while improving it via training. We overcome suboptimal fixed points of the basic solver by additional non-parametric GNN update steps maintaining dual feasibility. For training we use an unsupervised loss. We train on smaller problems and test on larger ones showing strong generalization performance with a GNN comprising only around $10k$ parameters. Our solver achieves significantly faster performance and better dual objectives than its non-learned version, achieving close to optimal objective values of LP relaxations of very large structured prediction problems and on selected combinatorial ones. In particular, we achieve better objective values than specialized approximate solvers for specific problem classes while retaining their efficiency. Our solver has better any-time performance over a large time period compared to a commercial solver. Code available at https://github.com/LPMP/BDD △ Less

Submitted 28 December, 2023; v1 submitted 23 May, 2022; originally announced May 2022.

Comments: AAAI 2024. Alert before printing: pg. 16-20 only contain per instance results, can possibly be skipped

arXiv:2202.03574 [pdf, other]

Structured Prediction Problem Archive

Authors: Paul Swoboda, Bjoern Andres, Andrea Hornakova, Florian Bernard, Jannik Irmai, Paul Roetzer, Bogdan Savchynskyy, David Stein, Ahmed Abbas

Abstract: Structured prediction problems are one of the fundamental tools in machine learning. In order to facilitate algorithm development for their numerical solution, we collect in one place a large number of datasets in easy to read formats for a diverse set of problem classes. We provide archival links to datasets, description of the considered problems and problem formats, and a short summary of probl… ▽ More Structured prediction problems are one of the fundamental tools in machine learning. In order to facilitate algorithm development for their numerical solution, we collect in one place a large number of datasets in easy to read formats for a diverse set of problem classes. We provide archival links to datasets, description of the considered problems and problem formats, and a short summary of problem characteristics including size, number of instances etc. For reference we also give a non-exhaustive selection of algorithms proposed in the literature for their solution. We hope that this central repository will make benchmarking and comparison to established works easier. We welcome submission of interesting new datasets and algorithms for inclusion in our archive. △ Less

Submitted 17 November, 2023; v1 submitted 4 February, 2022; originally announced February 2022.

Comments: Added multicast instances from Andres group

arXiv:2201.00855 [pdf]

AI & Racial Equity: Understanding Sentiment Analysis Artificial Intelligence, Data Security, and Systemic Theory in Criminal Justice Systems

Authors: Alia Abbas

Abstract: Various forms of implications of artificial intelligence that either exacerbate or decrease racial systemic injustice have been explored in this applied research endeavor. Taking each thematic area of identifying, analyzing, and debating an systemic issue have been leveraged in investigating merits and drawbacks of using algorithms to automate human decision making in racially sensitive environmen… ▽ More Various forms of implications of artificial intelligence that either exacerbate or decrease racial systemic injustice have been explored in this applied research endeavor. Taking each thematic area of identifying, analyzing, and debating an systemic issue have been leveraged in investigating merits and drawbacks of using algorithms to automate human decision making in racially sensitive environments. It has been asserted through the analysis of historical systemic patterns, implicit biases, existing algorithmic risks, and legal implications that natural language processing based AI, such as risk assessment tools, have racially disparate outcomes. It is concluded that more litigative policies are needed to regulate and restrict how internal government institutions and corporations utilize algorithms, privacy and security risks, and auditing requirements in order to diverge from racially injustice outcomes and practices of the past. △ Less

Submitted 3 January, 2022; originally announced January 2022.

Comments: 25 pages

arXiv:2112.04807 [pdf, other]

Effective dimension of machine learning models

Authors: Amira Abbas, David Sutter, Alessio Figalli, Stefan Woerner

Abstract: Making statements about the performance of trained models on tasks involving new data is one of the primary goals of machine learning, i.e., to understand the generalization power of a model. Various capacity measures try to capture this ability, but usually fall short in explaining important characteristics of models that we observe in practice. In this study, we propose the local effective dimen… ▽ More Making statements about the performance of trained models on tasks involving new data is one of the primary goals of machine learning, i.e., to understand the generalization power of a model. Various capacity measures try to capture this ability, but usually fall short in explaining important characteristics of models that we observe in practice. In this study, we propose the local effective dimension as a capacity measure which seems to correlate well with generalization error on standard data sets. Importantly, we prove that the local effective dimension bounds the generalization error and discuss the aptness of this capacity measure for machine learning models. △ Less

Submitted 9 December, 2021; originally announced December 2021.

Comments: 17 pages, 2 figures

arXiv:2111.10270 [pdf, other]

FastDOG: Fast Discrete Optimization on GPU

Authors: Ahmed Abbas, Paul Swoboda

Abstract: We present a massively parallel Lagrange decomposition method for solving 0--1 integer linear programs occurring in structured prediction. We propose a new iterative update scheme for solving the Lagrangean dual and a perturbation technique for decoding primal solutions. For representing subproblems we follow Lange et al. (2021) and use binary decision diagrams (BDDs). Our primal and dual algorith… ▽ More We present a massively parallel Lagrange decomposition method for solving 0--1 integer linear programs occurring in structured prediction. We propose a new iterative update scheme for solving the Lagrangean dual and a perturbation technique for decoding primal solutions. For representing subproblems we follow Lange et al. (2021) and use binary decision diagrams (BDDs). Our primal and dual algorithms require little synchronization between subproblems and optimization over BDDs needs only elementary operations without complicated control flow. This allows us to exploit the parallelism offered by GPUs for all components of our method. We present experimental results on combinatorial problems from MAP inference for Markov Random Fields, quadratic assignment and cell tracking for developmental biology. Our highly parallel GPU implementation improves upon the running times of the algorithms from Lange et al. (2021) by up to an order of magnitude. In particular, we come close to or outperform some state-of-the-art specialized heuristics while being problem agnostic. Our implementation is available at https://github.com/LPMP/BDD. △ Less

Submitted 19 April, 2022; v1 submitted 19 November, 2021; originally announced November 2021.

Comments: Published at CVPR 2022. Alert before printing: last 10 pages just contains detailed results table

arXiv:2109.01838 [pdf, other]

RAMA: A Rapid Multicut Algorithm on GPU

Authors: Ahmed Abbas, Paul Swoboda

Abstract: We propose a highly parallel primal-dual algorithm for the multicut (a.k.a. correlation clustering) problem, a classical graph clustering problem widely used in machine learning and computer vision. Our algorithm consists of three steps executed recursively: (1) Finding conflicted cycles that correspond to violated inequalities of the underlying multicut relaxation, (2) Performing message passing… ▽ More We propose a highly parallel primal-dual algorithm for the multicut (a.k.a. correlation clustering) problem, a classical graph clustering problem widely used in machine learning and computer vision. Our algorithm consists of three steps executed recursively: (1) Finding conflicted cycles that correspond to violated inequalities of the underlying multicut relaxation, (2) Performing message passing between the edges and cycles to optimize the Lagrange relaxation coming from the found violated cycles producing reduced costs and (3) Contracting edges with high reduced costs through matrix-matrix multiplications. Our algorithm produces primal solutions and lower bounds that estimate the distance to optimum. We implement our algorithm on GPUs and show resulting one to two orders-of-magnitudes improvements in execution speed without sacrificing solution quality compared to traditional sequential algorithms that run on CPUs. We can solve very large scale benchmark problems with up to $\mathcal{O}(10^8)$ variables in a few seconds with small primal-dual gaps. Our code is available at https://github.com/pawelswoboda/RAMA. △ Less

Submitted 11 March, 2022; v1 submitted 4 September, 2021; originally announced September 2021.

Comments: Published in CVPR 2022

arXiv:2106.15649 [pdf, other]

Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

Authors: Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati, Thomas Drugman

Abstract: We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale mel-spectrograms to predict finer scale me… ▽ More We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale mel-spectrograms to predict finer scale mel-spectrograms capturing fine-grained prosody. We present details for two specific versions of MSS called Word-level MSS and Sentence-level MSS where the scales in our system are motivated by the linguistic units. The Word-level MSS models word, phoneme, and frame-level spectrograms while Sentence-level MSS models sentence-level spectrogram in addition. Subjective evaluations show that Word-level MSS performs statistically significantly better compared to the baseline on two voices. △ Less

Submitted 29 June, 2021; originally announced June 2021.

Comments: Accepted for the 11th ISCA Speech Synthesis Workshop (SSW11)

arXiv:2106.10229 [pdf, other]

A learned conditional prior for the VAE acoustic space of a TTS system

Authors: Penny Karanasou, Sri Karlapati, Alexis Moinet, Arnaud Joly, Ammar Abbas, Simon Slangen, Jaime Lorenzo Trueba, Thomas Drugman

Abstract: Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of prosodic variability depends heavily on the prior that is used when sampling. In this paper, we propose a novel method to compute an informative prior for… ▽ More Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of prosodic variability depends heavily on the prior that is used when sampling. In this paper, we propose a novel method to compute an informative prior for the VAE latent space of a neural text-to-speech (TTS) system. By doing so, we aim to sample with more prosodic variability, while gaining controllability over the latent space's structure. By using as prior the posterior distribution of a secondary VAE, which we condition on a speaker vector, we can sample from the primary VAE taking explicitly the conditioning into account and resulting in samples from a specific region of the latent space for each condition (i.e. speaker). A formal preference test demonstrates significant preference of the proposed approach over standard Conditional VAE. We also provide visualisations of the latent space where well-separated condition-specific clusters appear, as well as ablation studies to better understand the behaviour of the system. △ Less

Submitted 14 June, 2021; originally announced June 2021.

Comments: in Proceedings of Interspeech 2021

arXiv:2106.07015 [pdf]

Siamese Network Training Using Artificial Triplets By Sampling and Image Transformation

Authors: Ammar N. Abbas, David Moser

Abstract: The device used in this work detects the objects over the surface of the water using two thermal cameras which aid the users to detect and avoid the objects in scenarios where the human eyes cannot (night, fog, etc.). To avoid the obstacle collision autonomously, it is required to track the objects in real-time and assign a specific identity to each object to determine its dynamics (trajectory, ve… ▽ More The device used in this work detects the objects over the surface of the water using two thermal cameras which aid the users to detect and avoid the objects in scenarios where the human eyes cannot (night, fog, etc.). To avoid the obstacle collision autonomously, it is required to track the objects in real-time and assign a specific identity to each object to determine its dynamics (trajectory, velocity, etc.) for making estimated collision predictions. In the following work, a Machine Learning (ML) approach for Computer Vision (CV) called Convolutional Neural Network (CNN) was used using TensorFlow as the high-level programming environment in Python. To validate the algorithm a test set was generated using an annotation tool that was created during the work for proper evaluation. Once validated, the algorithm was deployed on the platform and tested with the sequence generated by the test boat. △ Less

Submitted 26 September, 2021; v1 submitted 13 June, 2021; originally announced June 2021.

arXiv:2106.07003 [pdf]

Experimental Analysis of Trajectory Control Using Computer Vision and Artificial Intelligence for Autonomous Vehicles

Authors: Ammar N. Abbas, Muhammad Asad Irshad, Hossam Hassan Ammar

Abstract: Perception of the lane boundaries is crucial for the tasks related to autonomous trajectory control. In this paper, several methodologies for lane detection are discussed with an experimental illustration: Hough transformation, Blob analysis, and Bird's eye view. Following the abstraction of lane marks from the boundary, the next approach is applying a control law based on the perception to contro… ▽ More Perception of the lane boundaries is crucial for the tasks related to autonomous trajectory control. In this paper, several methodologies for lane detection are discussed with an experimental illustration: Hough transformation, Blob analysis, and Bird's eye view. Following the abstraction of lane marks from the boundary, the next approach is applying a control law based on the perception to control steering and speed control. In the following, a comparative analysis is made between an open-loop response, PID control, and a neural network control law through graphical statistics. To get the perception of the surrounding a wireless streaming camera connected to Raspberry Pi is used. After pre-processing the signal received by the camera the output is sent back to the Raspberry Pi that processes the input and communicates the control to the motors through Arduino via serial communication. △ Less

Submitted 13 June, 2021; originally announced June 2021.

arXiv:2106.03188 [pdf, other]

Combinatorial Optimization for Panoptic Segmentation: A Fully Differentiable Approach

Authors: Ahmed Abbas, Paul Swoboda

Abstract: We propose a fully differentiable architecture for simultaneous semantic and instance segmentation (a.k.a. panoptic segmentation) consisting of a convolutional neural network and an asymmetric multiway cut problem solver. The latter solves a combinatorial optimization problem that elegantly incorporates semantic and boundary predictions to produce a panoptic labeling. Our formulation allows to dir… ▽ More We propose a fully differentiable architecture for simultaneous semantic and instance segmentation (a.k.a. panoptic segmentation) consisting of a convolutional neural network and an asymmetric multiway cut problem solver. The latter solves a combinatorial optimization problem that elegantly incorporates semantic and boundary predictions to produce a panoptic labeling. Our formulation allows to directly maximize a smooth surrogate of the panoptic quality metric by backpropagating the gradient through the optimization problem. Experimental evaluation shows improvement by backpropagating through the optimization problem w.r.t. comparable approaches on Cityscapes and COCO datasets. Overall, our approach shows the utility of using combinatorial optimization in tandem with deep learning in a challenging large scale real-world problem and showcases benefits and insights into training such an architecture. △ Less

Submitted 25 October, 2021; v1 submitted 6 June, 2021; originally announced June 2021.

Comments: To be presented at NeurIPS 2021

arXiv:2105.01133 [pdf, other]

Prediction of clinical tremor severity using Rank Consistent Ordinal Regression

Authors: Li Zhang, Vijay Yadav, Vidya Koesmahargyo, Anzar Abbas, Isaac Galatzer-Levy

Abstract: Tremor is a key diagnostic feature of Parkinson's Disease (PD), Essential Tremor (ET), and other central nervous system (CNS) disorders. Clinicians or trained raters assess tremor severity with TETRAS scores by observing patients. Lacking quantitative measures, inter- or intra- observer variabilities are almost inevitable as the distinction between adjacent tremor scores is subtle. Moreover, clini… ▽ More Tremor is a key diagnostic feature of Parkinson's Disease (PD), Essential Tremor (ET), and other central nervous system (CNS) disorders. Clinicians or trained raters assess tremor severity with TETRAS scores by observing patients. Lacking quantitative measures, inter- or intra- observer variabilities are almost inevitable as the distinction between adjacent tremor scores is subtle. Moreover, clinician assessments also require patient visits, which limits the frequency of disease progress evaluation. Therefore it is beneficial to develop an automated assessment that can be performed remotely and repeatably at patients' convenience for continuous monitoring. In this work, we proposed to train a deep neural network (DNN) with rank-consistent ordinal regression using 276 clinical videos from 36 essential tremor patients. The videos are coupled with clinician assessed TETRAS scores, which are used as ground truth labels to train the DNN. To tackle the challenge of limited training data, optical flows are used to eliminate irrelevant background and statistic objects from RGB frames. In addition to optical flows, transfer learning is also applied to leverage pre-trained network weights from a related task of tremor frequency estimate. The approach was evaluated by splitting the clinical videos into training (67%) and testing sets (0.33%). The mean absolute error on TETRAS score of the testing results is 0.45, indicating that most of the errors were from the mismatch of adjacent labels, which is expected and acceptable. The model predications also agree well with clinical ratings. This model is further applied to smart phone videos collected from a PD patient who has an implanted device to turn "On" or "Off" tremor. The model outputs were consistent with the patient tremor states. The results demonstrate that our trained model can be used as a means to assess and track tremor severity. △ Less

Submitted 3 May, 2021; originally announced May 2021.

arXiv:2102.10810 [pdf, other]

doi 10.1103/PhysRevA.103.053317

Rapid generation of metastable helium Bose-Einstein condensates

Authors: A. H. Abbas, X. Meng, R. S. Patil, J. A. Ross, A. G. Truscott, S. S. Hodgman

Abstract: We report the realisation of Bose-Einstein condensation (BEC) of metastable helium atoms using an in-vacuum coil magnetic trap and a crossed beam optical dipole trap. A novel quadrupole-Ioffe configuration (QUIC) magnetic trap made from in-vacuum hollow copper tubes provides fast switching times while generating traps with a 10G bias, without compromising optical access. The bias enables in-trap 1… ▽ More We report the realisation of Bose-Einstein condensation (BEC) of metastable helium atoms using an in-vacuum coil magnetic trap and a crossed beam optical dipole trap. A novel quadrupole-Ioffe configuration (QUIC) magnetic trap made from in-vacuum hollow copper tubes provides fast switching times while generating traps with a 10G bias, without compromising optical access. The bias enables in-trap 1D doppler cooling to be used, which is the only cooling stage between the magneto-optic trap (MOT) and the optical dipole trap. This allows direct transfer to the dipole trap without the need for any additional evaporative cooling in the magnetic trap. The entire experimental sequence takes 3.3 seconds, with essentially pure BECs observed with $\sim 10^6$ atoms after evaporative cooling in the dipole trap. △ Less

Submitted 22 February, 2021; originally announced February 2021.

Comments: 6 pages, 4 figures

Journal ref: Phys. Rev. A 103, 053317 (2021)

Showing 1–50 of 166 results for author: Abbas, A