Search | arXiv e-print repository

arXiv:2407.20936 [pdf, other]

Non-classical excitation of a solid-state quantum emitter

Authors: Lena M. Hansen, Francesco Giorgino, Lennart Jehle, Lorenzo Carosini, Juan Camilo López Carreño, Iñigo Arrazola, Philip Walther, Juan C. Loredo

Abstract: The interaction between a single emitter and a single photon is a fundamental aspect of quantum optics. This interaction allows for the study of various quantum processes, such as emitter-mediated single-photon scattering and effective photon-photon interactions. However, empirical observations of this scenario and its dynamics are rare, and in most cases, only partial approximations to the fully… ▽ More The interaction between a single emitter and a single photon is a fundamental aspect of quantum optics. This interaction allows for the study of various quantum processes, such as emitter-mediated single-photon scattering and effective photon-photon interactions. However, empirical observations of this scenario and its dynamics are rare, and in most cases, only partial approximations to the fully quantized case have been possible. Here, we demonstrate the resonant excitation of a solid-state quantum emitter using quantized input light. For this light-matter interaction, with both entities quantized, we observe single-photon interference introduced by the emitter in a coherent scattering process, photon-number-depended optical non-linearities, and stimulated emission processes involving only two photons. We theoretically reproduce our observations using a cascaded master equation model. Our findings demonstrate that a single photon is sufficient to change the state of a solid-state quantum emitter, and efficient emitter-mediated photon-photon interactions are feasible. These results suggest future possibilities ranging from enabling quantum information transfer in a quantum network to building deterministic entangling gates for photonic quantum computing. △ Less

Submitted 30 July, 2024; originally announced July 2024.

Comments: 13 pages, 8 figures

arXiv:2407.19677 [pdf, other]

Navigating the United States Legislative Landscape on Voice Privacy: Existing Laws, Proposed Bills, Protection for Children, and Synthetic Data for AI

Authors: Satwik Dutta, John H. L. Hansen

Abstract: Privacy is a hot topic for policymakers across the globe, including the United States. Evolving advances in AI and emerging concerns about the misuse of personal data have pushed policymakers to draft legislation on trustworthy AI and privacy protection for its citizens. This paper presents the state of the privacy legislation at the U.S. Congress and outlines how voice data is considered as part… ▽ More Privacy is a hot topic for policymakers across the globe, including the United States. Evolving advances in AI and emerging concerns about the misuse of personal data have pushed policymakers to draft legislation on trustworthy AI and privacy protection for its citizens. This paper presents the state of the privacy legislation at the U.S. Congress and outlines how voice data is considered as part of the legislation definition. This paper also reviews additional privacy protection for children. This paper presents a holistic review of enacted and proposed privacy laws, and consideration for voice data, including guidelines for processing children's data, in those laws across the fifty U.S. states. As a groundbreaking alternative to actual human data, ethically generated synthetic data allows much flexibility to keep AI innovation in progress. Given the consideration of synthetic data in AI legislation by policymakers to be relatively new, as compared to that of privacy laws, this paper reviews regulatory considerations for synthetic data. △ Less

Submitted 28 July, 2024; originally announced July 2024.

Comments: 5 pages, 2 figures, accepted at the Interspeech SynData4GenAI 2024 workshop

ACM Class: I.2; J.1

arXiv:2407.05959 [pdf, other]

Time Series Dataset for Modeling and Forecasting of $N_2O$ in Wastewater Treatment

Authors: Laura Debel Hansen, Anju Rani, Mikkel Algren Stokholm-Bjerregaard, Peter Alexander Stentoft, Daniel Ortiz Arroyo, Petar Durdevic

Abstract: In this paper, we present two years of high-resolution nitrous oxide ($N_2O$) measurements for time series modeling and forecasting in wastewater treatment plants (WWTP). The dataset comprises frequent, real-time measurements from a full-scale WWTP, with a sample interval of 2 minutes, making it ideal for developing models for real-time operation and control. This comprehensive bio-chemical datase… ▽ More In this paper, we present two years of high-resolution nitrous oxide ($N_2O$) measurements for time series modeling and forecasting in wastewater treatment plants (WWTP). The dataset comprises frequent, real-time measurements from a full-scale WWTP, with a sample interval of 2 minutes, making it ideal for developing models for real-time operation and control. This comprehensive bio-chemical dataset includes detailed influent and effluent parameters, operational conditions, and environmental factors. Unlike existing datasets, it addresses the unique challenges of modeling $N_2O$, a potent greenhouse gas, providing a valuable resource for researchers to enhance predictive accuracy and control strategies in wastewater treatment processes. Additionally, this dataset significantly contributes to the fields of machine learning and deep learning time series forecasting by serving as a benchmark that mirrors the complexities of real-world processes, thus facilitating advancements in these domains. We provide a detailed description of the dataset along with a statistical analysis to highlight its characteristics, such as nonstationarity, nonnormality, seasonality, heteroscedasticity, structural breaks, asymmetric distributions, and intermittency, which are common in many real-world time series datasets and pose challenges for forecasting models. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: 10 pages, 4 figures. This publication accompanies the Mendeley dataset available at this URL (version 1): https://data.mendeley.com/datasets/xmbxhscgpr/1

arXiv:2407.04982 [pdf]

Microstructural and Micromechanical Evolution of Olivine Aggregates During Transient Creep

Authors: Harison S. Wiesman, Thomas Breithaupt, David Wallis, Lars N. Hansen

Abstract: To examine the microstructural evolution that occurs during transient creep, we deformed olivine aggregates to different strains that spanned the initial transient deformation. Two sets of samples with different initial grain sizes of 5 $μ$m and 20 $μ$m were deformed in torsion at T = 1523 K, P = 300 MPa, and a constant shear strain rate of 1.5 $\times$ 10$^{-4}$ s$^{-1}$. Both sets of samples exp… ▽ More To examine the microstructural evolution that occurs during transient creep, we deformed olivine aggregates to different strains that spanned the initial transient deformation. Two sets of samples with different initial grain sizes of 5 $μ$m and 20 $μ$m were deformed in torsion at T = 1523 K, P = 300 MPa, and a constant shear strain rate of 1.5 $\times$ 10$^{-4}$ s$^{-1}$. Both sets of samples experienced strain hardening during deformation. We characterized the microstructures at the end of each experiment using high-angular resolution electron backscatter diffraction (HR-EBSD) and dislocation decoration. In the coarse-grained samples, dislocation density increased from 1.5 $\times$ 10$^{11}$ m$^{-2}$ to 3.6 $\times$ 10$^{12}$ m$^{-2}$ with strain. Although the same final dislocation density was reached in the fine-grained samples, it did not vary significantly at small strains, potentially due to concurrent grain growth during deformation. In both sets of samples, HR-EBSD analysis revealed that intragranular stress heterogeneity increased in magnitude with strain and that elevated stresses are associated with regions of high geometrically necessary dislocation density. Further analysis of the stresses and their probability distributions indicate that the stresses are imparted by long-range elastic interactions among dislocations. These characteristics indicate that dislocation interactions were the primary cause of strain hardening during transient creep. A comparison of the results to predictions from three recent models reveals that the models do not correctly predict the evolution in stress and dislocation density with strain for our experiments due to a lack of previous such data in their calibrations. △ Less

Submitted 6 July, 2024; originally announced July 2024.

arXiv:2407.04291 [pdf, other]

We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker Embeddings

Authors: Ismail Rasim Ulgen, Carlos Busso, John H. L. Hansen, Berrak Sisman

Abstract: In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech. Although speaker embeddings have been widely used in personalized speech synthesis as conditioning inputs, they are designed to lose variation to optimize speaker recognition accuracy. Thus, they are suboptimal for speech synthesis in terms of modeling the rich va… ▽ More In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech. Although speaker embeddings have been widely used in personalized speech synthesis as conditioning inputs, they are designed to lose variation to optimize speaker recognition accuracy. Thus, they are suboptimal for speech synthesis in terms of modeling the rich variations at the output speech distribution. In this work, we propose a novel speaker embedding network which utilizes multiple class centers in the speaker classification training rather than a single class center as traditional embeddings. The proposed approach introduces variations in the speaker embedding while retaining the speaker recognition performance since model does not have to map all of the utterances of a speaker into a single class center. We apply our proposed embedding in voice conversion task and show that our method provides better naturalness and prosody in synthesized speech. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: Submitted to IEEE Signal Processing Letters

arXiv:2406.16364 [pdf]

The unpaved road towards efficient selective breeding in insects for food and feed

Authors: Laura Skrubbeltrang Hansen, Stine Frey Laursen, Simon Bahrndorff, Jesper Givskov Sørensen, Goutam Sahana, Torsten Nygaard Kristensen, Hanne Marie Nielsen

Abstract: Insect production for food and feed presents a promising supplement to ensure food safety and address the adverse impacts of agriculture on climate and environment in the future. However, optimisation is required for insect production to realise its full potential. This can be by targeted improvement of traits of interest through selective breeding, an approach which has so far been underexplored… ▽ More Insect production for food and feed presents a promising supplement to ensure food safety and address the adverse impacts of agriculture on climate and environment in the future. However, optimisation is required for insect production to realise its full potential. This can be by targeted improvement of traits of interest through selective breeding, an approach which has so far been underexplored and underutilised in insect farming. Here we present a comprehensive review of the selective breeding framework in the context of insect production. We systematically evaluate adjustments of selective breeding techniques to the realm of insects and highlight the essential components integral to the breeding process. The discussion covers every step of a conventional breeding scheme, such as formulation of breeding objectives, phenotyping, estimation of genetic parameters and breeding values, selection of appropriate breeding strategies, and mitigation of issues associated with genetic diversity depletion and inbreeding. This review combines knowledge from diverse disciplines, bridging the gap between animal breeding, quantitative genetics, evolutionary biology, and entomology, offering an integrated view of the insect breeding research area and uniting knowledge which has previously remained scattered across diverse fields of expertise. △ Less

Submitted 26 June, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.09981 [pdf, other]

Challenges in explaining deep learning models for data with biological variation

Authors: Lenka Tětková, Erik Schou Dreier, Robin Malm, Lars Kai Hansen

Abstract: Much machine learning research progress is based on developing models and evaluating them on a benchmark dataset (e.g., ImageNet for images). However, applying such benchmark-successful methods to real-world data often does not work as expected. This is particularly the case for biological data where we expect variability at multiple time and spatial scales. In this work, we are using grain data a… ▽ More Much machine learning research progress is based on developing models and evaluating them on a benchmark dataset (e.g., ImageNet for images). However, applying such benchmark-successful methods to real-world data often does not work as expected. This is particularly the case for biological data where we expect variability at multiple time and spatial scales. In this work, we are using grain data and the goal is to detect diseases and damages. Pink fusarium, skinned grains, and other diseases and damages are key factors in setting the price of grains or excluding dangerous grains from food production. Apart from challenges stemming from differences of the data from the standard toy datasets, we also present challenges that need to be overcome when explaining deep learning models. For example, explainability methods have many hyperparameters that can give different results, and the ones published in the papers do not work on dissimilar images. Other challenges are more general: problems with visualization of the explanations and their comparison since the magnitudes of their values differ from method to method. An open fundamental question also is: How to evaluate explanations? It is a non-trivial task because the "ground truth" is usually missing or ill-defined. Also, human annotators may create what they think is an explanation of the task at hand, yet the machine learning model might solve it in a different and perhaps counter-intuitive way. We discuss several of these challenges and evaluate various post-hoc explainability methods on grain data. We focus on robustness, quality of explanations, and similarity to particular "ground truth" annotations made by experts. The goal is to find the methods that overall perform well and could be used in this challenging task. We hope the proposed pipeline will be used as a framework for evaluating explainability methods in specific use cases. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2405.19746 [pdf, other]

DenseSeg: Joint Learning for Semantic Segmentation and Landmark Detection Using Dense Image-to-Shape Representation

Authors: Ron Keuth, Lasse Hansen, Maren Balks, Ronja Jäger, Anne-Nele Schröder, Ludger Tüshaus, Mattias Heinrich

Abstract: Purpose: Semantic segmentation and landmark detection are fundamental tasks of medical image processing, facilitating further analysis of anatomical objects. Although deep learning-based pixel-wise classification has set a new-state-of-the-art for segmentation, it falls short in landmark detection, a strength of shape-based approaches. Methods: In this work, we propose a dense image-to-shape rep… ▽ More Purpose: Semantic segmentation and landmark detection are fundamental tasks of medical image processing, facilitating further analysis of anatomical objects. Although deep learning-based pixel-wise classification has set a new-state-of-the-art for segmentation, it falls short in landmark detection, a strength of shape-based approaches. Methods: In this work, we propose a dense image-to-shape representation that enables the joint learning of landmarks and semantic segmentation by employing a fully convolutional architecture. Our method intuitively allows the extraction of arbitrary landmarks due to its representation of anatomical correspondences. We benchmark our method against the state-of-the-art for semantic segmentation (nnUNet), a shape-based approach employing geometric deep learning and a CNN-based method for landmark detection. Results: We evaluate our method on two medical dataset: one common benchmark featuring the lungs, heart, and clavicle from thorax X-rays, and another with 17 different bones in the paediatric wrist. While our method is on pair with the landmark detection baseline in the thorax setting (error in mm of $2.6\pm0.9$ vs $2.7\pm0.9$), it substantially surpassed it in the more complex wrist setting ($1.1\pm0.6$ vs $1.9\pm0.5$). Conclusion: We demonstrate that dense geometric shape representation is beneficial for challenging landmark detection tasks and outperforms previous state-of-the-art using heatmap regression. While it does not require explicit training on the landmarks themselves, allowing for the addition of new landmarks without necessitating retraining.} △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.12017 [pdf, other]

Spectrally resolved free electron-light coupling strength in a transition metal dichalcogenide

Authors: Niklas Müller, Soufiane el Kabil, Gerrit Vosse, Lina Hansen, Christopher Rathje, Sascha Schäfer

Abstract: Recent advancements in electron microscopy have introduced innovative techniques enabling the inelastic interaction of fast electrons with tightly confined and intense light fields. These techniques, commonly summarized under the term photon-induced nearfield electron microscopy now offer unprecedented capabilities for a precise mapping of the characteristics of optical near-fields with remarkable… ▽ More Recent advancements in electron microscopy have introduced innovative techniques enabling the inelastic interaction of fast electrons with tightly confined and intense light fields. These techniques, commonly summarized under the term photon-induced nearfield electron microscopy now offer unprecedented capabilities for a precise mapping of the characteristics of optical near-fields with remarkable spatial resolution but their spectral resolution were only scarcely investigated. In this study, we employ a strongly chirped and temporally broadband light pulse to investigate the interaction between free electrons and light at the edge of a MoS2 thin film. Our approach unveils the details of electron-light coupling, revealing a pronounced dependence of the coupling strength on both the position and photon energy. Employing numerical simulations of a simplified model system we identify these modulations to be caused by optical interferences between the incident and reflected field as well as an optical mode guided within the transition metal dichalcogenide film. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Comments: 19 pages, 3 figures, 40 references

arXiv:2405.05049 [pdf]

Seeds of Stereotypes: A Large-Scale Textual Analysis of Race and Gender Associations with Diseases in Online Sources

Authors: Lasse Hyldig Hansen, Nikolaj Andersen, Jack Gallifant, Liam G. McCoy, James K Stone, Nura Izath, Marcela Aguirre-Jerez, Danielle S Bitterman, Judy Gichoya, Leo Anthony Celi

Abstract: Background Advancements in Large Language Models (LLMs) hold transformative potential in healthcare, however, recent work has raised concern about the tendency of these models to produce outputs that display racial or gender biases. Although training data is a likely source of such biases, exploration of disease and demographic associations in text data at scale has been limited. Methods We cond… ▽ More Background Advancements in Large Language Models (LLMs) hold transformative potential in healthcare, however, recent work has raised concern about the tendency of these models to produce outputs that display racial or gender biases. Although training data is a likely source of such biases, exploration of disease and demographic associations in text data at scale has been limited. Methods We conducted a large-scale textual analysis using a dataset comprising diverse web sources, including Arxiv, Wikipedia, and Common Crawl. The study analyzed the context in which various diseases are discussed alongside markers of race and gender. Given that LLMs are pre-trained on similar datasets, this approach allowed us to examine the potential biases that LLMs may learn and internalize. We compared these findings with actual demographic disease prevalence as well as GPT-4 outputs in order to evaluate the extent of bias representation. Results Our findings indicate that demographic terms are disproportionately associated with specific disease concepts in online texts. gender terms are prominently associated with disease concepts, while racial terms are much less frequently associated. We find widespread disparities in the associations of specific racial and gender terms with the 18 diseases analyzed. Most prominently, we see an overall significant overrepresentation of Black race mentions in comparison to population proportions. Conclusions Our results highlight the need for critical examination and transparent reporting of biases in LLM pretraining datasets. Our study suggests the need to develop mitigation strategies to counteract the influence of biased training data in LLMs, particularly in sensitive domains such as healthcare. △ Less

Submitted 8 May, 2024; originally announced May 2024.

arXiv:2404.07711 [pdf, other]

OpenTrench3D: A Photogrammetric 3D Point Cloud Dataset for Semantic Segmentation of Underground Utilities

Authors: Lasse H. Hansen, Simon B. Jensen, Mark P. Philipsen, Andreas Møgelmose, Lars Bodum, Thomas B. Moeslund

Abstract: Identifying and classifying underground utilities is an important task for efficient and effective urban planning and infrastructure maintenance. We present OpenTrench3D, a novel and comprehensive 3D Semantic Segmentation point cloud dataset, designed to advance research and development in underground utility surveying and mapping. OpenTrench3D covers a completely novel domain for public 3D point… ▽ More Identifying and classifying underground utilities is an important task for efficient and effective urban planning and infrastructure maintenance. We present OpenTrench3D, a novel and comprehensive 3D Semantic Segmentation point cloud dataset, designed to advance research and development in underground utility surveying and mapping. OpenTrench3D covers a completely novel domain for public 3D point cloud datasets and is unique in its focus, scope, and cost-effective capturing method. The dataset consists of 310 point clouds collected across 7 distinct areas. These include 5 water utility areas and 2 district heating utility areas. The inclusion of different geographical areas and main utilities (water and district heating utilities) makes OpenTrench3D particularly valuable for inter-domain transfer learning experiments. We provide benchmark results for the dataset using three state-of-the-art semantic segmentation models, PointNeXt, PointVector and PointMetaBase. Benchmarks are conducted by training on data from water areas, fine-tuning on district heating area 1 and evaluating on district heating area 2. The dataset is publicly available. With OpenTrench3D, we seek to foster innovation and progress in the field of 3D semantic segmentation in applications related to detection and documentation of underground utilities as well as in transfer learning methods in general. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.07008 [pdf, other]

doi 10.1007/978-3-031-63787-2_9

Knowledge graphs for empirical concept retrieval

Authors: Lenka Tětková, Teresa Karen Scheidt, Maria Mandrup Fogh, Ellen Marie Gaunby Jørgensen, Finn Årup Nielsen, Lars Kai Hansen

Abstract: Concept-based explainable AI is promising as a tool to improve the understanding of complex models at the premises of a given user, viz.\ as a tool for personalized explainability. An important class of concept-based explainability methods is constructed with empirically defined concepts, indirectly defined through a set of positive and negative examples, as in the TCAV approach (Kim et al., 2018)… ▽ More Concept-based explainable AI is promising as a tool to improve the understanding of complex models at the premises of a given user, viz.\ as a tool for personalized explainability. An important class of concept-based explainability methods is constructed with empirically defined concepts, indirectly defined through a set of positive and negative examples, as in the TCAV approach (Kim et al., 2018). While it is appealing to the user to avoid formal definitions of concepts and their operationalization, it can be challenging to establish relevant concept datasets. Here, we address this challenge using general knowledge graphs (such as, e.g., Wikidata or WordNet) for comprehensive concept definition and present a workflow for user-driven data collection in both text and image domains. The concepts derived from knowledge graphs are defined interactively, providing an opportunity for personalization and ensuring that the concepts reflect the user's intentions. We test the retrieved concept datasets on two concept-based explainability methods, namely concept activation vectors (CAVs) and concept activation regions (CARs) (Crabbe and van der Schaar, 2022). We show that CAVs and CARs based on these empirical concept datasets provide robust and accurate explanations. Importantly, we also find good alignment between the models' representations of concepts and the structure of knowledge graphs, i.e., human representations. This supports our conclusion that knowledge graph-based concepts are relevant for XAI. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: Preprint. Accepted to The 2nd World Conference on eXplainable Artificial Intelligence

arXiv:2403.12866 [pdf, other]

Purifying photon indistinguishability through quantum interference

Authors: Carlos F. D. Faurby, Lorenzo Carosini, Huan Cao, Patrik I. Sund, Lena M. Hansen, Francesco Giorgino, Andrew B. Villadsen, Stefan N. van den Hoven, Peter Lodahl, Stefano Paesani, Juan C. Loredo, Philip Walther

Abstract: Indistinguishability between photons is a key requirement for scalable photonic quantum technologies. We experimentally demonstrate that partly distinguishable single photons can be purified to reach near-unity indistinguishability by the process of quantum interference with ancillary photons followed by heralded detection of a subset of them. We report on the indistinguishability of the purified… ▽ More Indistinguishability between photons is a key requirement for scalable photonic quantum technologies. We experimentally demonstrate that partly distinguishable single photons can be purified to reach near-unity indistinguishability by the process of quantum interference with ancillary photons followed by heralded detection of a subset of them. We report on the indistinguishability of the purified photons by interfering two purified photons and show improvements in the photon indistinguishability of $2.774(3)$\% in the low-noise regime, and as high as $10.2(5)$ \% in the high-noise regime. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: 14 pages, 7 figures

arXiv:2403.00293 [pdf, other]

Efficient Adapter Tuning of Pre-trained Speech Models for Automatic Speaker Verification

Authors: Mufan Sang, John H. L. Hansen

Abstract: With excellent generalization ability, self-supervised speech models have shown impressive performance on various downstream speech tasks in the pre-training and fine-tuning paradigm. However, as the growing size of pre-trained models, fine-tuning becomes practically unfeasible due to heavy computation and storage overhead, as well as the risk of overfitting. Adapters are lightweight modules inser… ▽ More With excellent generalization ability, self-supervised speech models have shown impressive performance on various downstream speech tasks in the pre-training and fine-tuning paradigm. However, as the growing size of pre-trained models, fine-tuning becomes practically unfeasible due to heavy computation and storage overhead, as well as the risk of overfitting. Adapters are lightweight modules inserted into pre-trained models to facilitate parameter-efficient adaptation. In this paper, we propose an effective adapter framework designed for adapting self-supervised speech models to the speaker verification task. With a parallel adapter design, our proposed framework inserts two types of adapters into the pre-trained model, allowing the adaptation of latent features within intermediate Transformer layers and output embeddings from all Transformer layers. We conduct comprehensive experiments to validate the efficiency and effectiveness of the proposed framework. Experimental results on the VoxCeleb1 dataset demonstrate that the proposed adapters surpass fine-tuning and other parameter-efficient transfer learning methods, achieving superior performance while updating only 5% of the parameters. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Comments: Accepted to ICASSP 2024

arXiv:2401.06091 [pdf, other]

A Closer Look at AUROC and AUPRC under Class Imbalance

Authors: Matthew B. A. McDermott, Lasse Hyldig Hansen, Haoran Zhang, Giovanni Angelotti, Jack Gallifant

Abstract: In machine learning (ML), a widespread adage is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for binary classification tasks with class imbalance. This paper challenges this notion through novel mathematical analysis, illustrating that AUROC and AUPRC can be concisely related in prob… ▽ More In machine learning (ML), a widespread adage is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for binary classification tasks with class imbalance. This paper challenges this notion through novel mathematical analysis, illustrating that AUROC and AUPRC can be concisely related in probabilistic terms. We demonstrate that AUPRC, contrary to popular belief, is not superior in cases of class imbalance and might even be a harmful metric, given its inclination to unduly favor model improvements in subpopulations with more frequent positive labels. This bias can inadvertently heighten algorithmic disparities. Prompted by these insights, a thorough review of existing ML literature was conducted, utilizing large language models to analyze over 1.5 million papers from arXiv. Our investigation focused on the prevalence and substantiation of the purported AUPRC superiority. The results expose a significant deficit in empirical backing and a trend of misattributions that have fuelled the widespread acceptance of AUPRC's supposed advantages. Our findings represent a dual contribution: a significant technical advancement in understanding metric behaviors and a stark warning about unchecked assumptions in the ML community. All experiments are accessible at https://github.com/mmcdermott/AUC_is_all_you_need. △ Less

Submitted 18 April, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

arXiv:2311.18364 [pdf, other]

Hubness Reduction Improves Sentence-BERT Semantic Spaces

Authors: Beatrix M. G. Nielsen, Lars Kai Hansen

Abstract: Semantic representations of text, i.e. representations of natural language which capture meaning by geometry, are essential for areas such as information retrieval and document grouping. High-dimensional trained dense vectors have received much attention in recent years as such representations. We investigate the structure of semantic spaces that arise from embeddings made with Sentence-BERT and f… ▽ More Semantic representations of text, i.e. representations of natural language which capture meaning by geometry, are essential for areas such as information retrieval and document grouping. High-dimensional trained dense vectors have received much attention in recent years as such representations. We investigate the structure of semantic spaces that arise from embeddings made with Sentence-BERT and find that the representations suffer from a well-known problem in high dimensions called hubness. Hubness results in asymmetric neighborhood relations, such that some texts (the hubs) are neighbours of many other texts while most texts (so-called anti-hubs), are neighbours of few or no other texts. We quantify the semantic quality of the embeddings using hubness scores and error rate of a neighbourhood based classifier. We find that when hubness is high, we can reduce error rate and hubness using hubness reduction methods. We identify a combination of two methods as resulting in the best reduction. For example, on one of the tested pretrained models, this combined method can reduce hubness by about 75% and error rate by about 9%. Thus, we argue that mitigating hubness in the embedding space provides better semantic representations of text. △ Less

Submitted 30 November, 2023; originally announced November 2023.

Comments: Accepted at NLDL 2024

arXiv:2311.08878 [pdf, other]

Multi-objective Non-intrusive Hearing-aid Speech Assessment Model

Authors: Hsin-Tien Chiang, Szu-Wei Fu, Hsin-Min Wang, Yu Tsao, John H. L. Hansen

Abstract: Without the need for a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. While deep learning models have been used to develop non-intrusive speech assessment methods with promising results, there is limited research on hearing-impaired subjects. This study proposes a multi-objective non-intrusive hearing-aid speech assessment model, cal… ▽ More Without the need for a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. While deep learning models have been used to develop non-intrusive speech assessment methods with promising results, there is limited research on hearing-impaired subjects. This study proposes a multi-objective non-intrusive hearing-aid speech assessment model, called HASA-Net Large, which predicts speech quality and intelligibility scores based on input speech signals and specified hearing-loss patterns. Our experiments showed the utilization of pre-trained SSL models leads to a significant boost in speech quality and intelligibility predictions compared to using spectrograms as input. Additionally, we examined three distinct fine-tuning approaches that resulted in further performance improvements. Furthermore, we demonstrated that incorporating SSL models resulted in greater transferability to OOD dataset. Finally, this study introduces HASA-Net Large, which is a non-invasive approach for evaluating speech quality and intelligibility. HASA-Net Large utilizes raw waveforms and hearing-loss patterns to accurately predict speech quality and intelligibility levels for individuals with normal and impaired hearing and demonstrates superior prediction performance and transferability. △ Less

Submitted 15 November, 2023; originally announced November 2023.

arXiv:2311.07264 [pdf, other]

Danish Foundation Models

Authors: Kenneth Enevoldsen, Lasse Hansen, Dan S. Nielsen, Rasmus A. F. Egebæk, Søren V. Holm, Martin C. Nielsen, Martin Bernstorff, Rasmus Larsen, Peter B. Jørgensen, Malte Højmark-Bertelsen, Peter B. Vahlstrup, Per Møldrup-Dalum, Kristoffer Nielbo

Abstract: Large language models, sometimes referred to as foundation models, have transformed multiple fields of research. However, smaller languages risk falling behind due to high training costs and small incentives for large companies to train these models. To combat this, the Danish Foundation Models project seeks to provide and maintain open, well-documented, and high-quality foundation models for the… ▽ More Large language models, sometimes referred to as foundation models, have transformed multiple fields of research. However, smaller languages risk falling behind due to high training costs and small incentives for large companies to train these models. To combat this, the Danish Foundation Models project seeks to provide and maintain open, well-documented, and high-quality foundation models for the Danish language. This is achieved through broad cooperation with public and private institutions, to ensure high data quality and applicability of the trained models. We present the motivation of the project, the current status, and future perspectives. △ Less

Submitted 13 November, 2023; originally announced November 2023.

Comments: 4 pages, 2 tables

arXiv:2310.18450 [pdf, other]

doi 10.21437/Interspeech.2023-1216

MixRep: Hidden Representation Mixup for Low-Resource Speech Recognition

Authors: Jiamin Xie, John H. L. Hansen

Abstract: In this paper, we present MixRep, a simple and effective data augmentation strategy based on mixup for low-resource ASR. MixRep interpolates the feature dimensions of hidden representations in the neural network that can be applied to both the acoustic feature input and the output of each layer, which generalizes the previous MixSpeech method. Further, we propose to combine the mixup with a regula… ▽ More In this paper, we present MixRep, a simple and effective data augmentation strategy based on mixup for low-resource ASR. MixRep interpolates the feature dimensions of hidden representations in the neural network that can be applied to both the acoustic feature input and the output of each layer, which generalizes the previous MixSpeech method. Further, we propose to combine the mixup with a regularization along the time axis of the input, which is shown as complementary. We apply MixRep to a Conformer encoder of an E2E LAS architecture trained with a joint CTC loss. We experiment on the WSJ dataset and subsets of the SWB dataset, covering reading and telephony conversational speech. Experimental results show that MixRep consistently outperforms other regularization methods for low-resource ASR. Compared to a strong SpecAugment baseline, MixRep achieves a +6.5\% and a +6.7\% relative WER reduction on the eval92 set and the Callhome part of the eval'2000 set. △ Less

Submitted 27 October, 2023; originally announced October 2023.

Comments: Accepted to Interspeech 2023

arXiv:2310.16981 [pdf, other]

Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

Authors: Lasse Hansen, Nabeel Seedat, Mihaela van der Schaar, Andrija Petrovic

Abstract: Synthetic data serves as an alternative in training machine learning models, particularly when real-world data is limited or inaccessible. However, ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper addresses this issue by exploring the potential of integrating data-centric AI techniques which profile the data to guide the synthetic data g… ▽ More Synthetic data serves as an alternative in training machine learning models, particularly when real-world data is limited or inaccessible. However, ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper addresses this issue by exploring the potential of integrating data-centric AI techniques which profile the data to guide the synthetic data generation process. Moreover, we shed light on the often ignored consequences of neglecting these data profiles during synthetic data generation -- despite seemingly high statistical fidelity. Subsequently, we propose a novel framework to evaluate the integration of data profiles to guide the creation of more representative synthetic data. In an empirical study, we evaluate the performance of five state-of-the-art models for tabular data generation on eleven distinct tabular datasets. The findings offer critical insights into the successes and limitations of current synthetic data generation techniques. Finally, we provide practical recommendations for integrating data-centric insights into the synthetic data generation process, with a specific focus on classification performance, model selection, and feature selection. This study aims to reevaluate conventional approaches to synthetic data generation and promote the application of data-centric AI techniques in improving the quality and effectiveness of synthetic data. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: Presented at NeurIPS 2023 (Datasets & Benchmarks). *Hansen & Seedat contributed equally

arXiv:2310.13200 [pdf, other]

A Deep Learning Analysis of Climate Change, Innovation, and Uncertainty

Authors: Michael Barnett, William Brock, Lars Peter Hansen, Ruimeng Hu, Joseph Huang

Abstract: We study the implications of model uncertainty in a climate-economics framework with three types of capital: "dirty" capital that produces carbon emissions when used for production, "clean" capital that generates no emissions but is initially less productive than dirty capital, and knowledge capital that increases with R\&D investment and leads to technological innovation in green sector productiv… ▽ More We study the implications of model uncertainty in a climate-economics framework with three types of capital: "dirty" capital that produces carbon emissions when used for production, "clean" capital that generates no emissions but is initially less productive than dirty capital, and knowledge capital that increases with R\&D investment and leads to technological innovation in green sector productivity. To solve our high-dimensional, non-linear model framework we implement a neural-network-based global solution method. We show there are first-order impacts of model uncertainty on optimal decisions and social valuations in our integrated climate-economic-innovation framework. Accounting for interconnected uncertainty over climate dynamics, economic damages from climate change, and the arrival of a green technological change leads to substantial adjustments to investment in the different capital types in anticipation of technological change and the revelation of climate damage severity. △ Less

Submitted 19 October, 2023; originally announced October 2023.

arXiv:2310.11004 [pdf, other]

Advanced accent/dialect identification and accentedness assessment with multi-embedding models and automatic speech recognition

Authors: Shahram Ghorbani, John H. L. Hansen

Abstract: Accurately classifying accents and assessing accentedness in non-native speakers are both challenging tasks due to the complexity and diversity of accent and dialect variations. In this study, embeddings from advanced pre-trained language identification (LID) and speaker identification (SID) models are leveraged to improve the accuracy of accent classification and non-native accentedness assessmen… ▽ More Accurately classifying accents and assessing accentedness in non-native speakers are both challenging tasks due to the complexity and diversity of accent and dialect variations. In this study, embeddings from advanced pre-trained language identification (LID) and speaker identification (SID) models are leveraged to improve the accuracy of accent classification and non-native accentedness assessment. Findings demonstrate that employing pre-trained LID and SID models effectively encodes accent/dialect information in speech. Furthermore, the LID and SID encoded accent information complement an end-to-end accent identification (AID) model trained from scratch. By incorporating all three embeddings, the proposed multi-embedding AID system achieves superior accuracy in accent identification. Next, we investigate leveraging automatic speech recognition (ASR) and accent identification models to explore accentedness estimation. The ASR model is an end-to-end connectionist temporal classification (CTC) model trained exclusively with en-US utterances. The ASR error rate and en-US output of the AID model are leveraged as objective accentedness scores. Evaluation results demonstrate a strong correlation between the scores estimated by the two models. Additionally, a robust correlation between the objective accentedness scores and subjective scores based on human perception is demonstrated, providing evidence for the reliability and validity of utilizing AID-based and ASR-based systems for accentedness assessment in non-native speech. △ Less

Submitted 17 October, 2023; originally announced October 2023.

Comments: Submitted to The Journal of the Acoustical Society of America

arXiv:2309.09688 [pdf, other]

doi 10.1017/jfm.2023.1003

Granular dilatancy and non-local fluidity of partially molten rock

Authors: Richard F. Katz, John F. Rudge, Lars N. Hansen

Abstract: Partially molten rock is a densely packed, melt-saturated, granular medium, but it has seldom been considered in these terms. In this manuscript, we extend the continuum theory of partially molten rock to incorporate the physics of granular media. Our formulation includes dilatancy in a viscous constitutive law and introduces a non-local fluidity. We analyse the resulting poro-viscous--granular th… ▽ More Partially molten rock is a densely packed, melt-saturated, granular medium, but it has seldom been considered in these terms. In this manuscript, we extend the continuum theory of partially molten rock to incorporate the physics of granular media. Our formulation includes dilatancy in a viscous constitutive law and introduces a non-local fluidity. We analyse the resulting poro-viscous--granular theory in terms of two modes of liquid--solid segregation that are observed in published torsion experiments: localisation of liquid into high-porosity sheets and radially inward liquid flow. We show that the newly incorporated granular physics brings the theory into agreement with experiments. We discuss these results in the context of grain-scale physics across the nominal jamming fraction at the high homologous temperatures relevant in geological systems. △ Less

Submitted 27 November, 2023; v1 submitted 18 September, 2023; originally announced September 2023.

Comments: 31 pages, 9 figures, 4 appendicies

Journal ref: Journal of Fluid Mechanics, 2023

arXiv:2308.05709 [pdf, other]

doi 10.1103/PhysRevLett.132.130604

A photonic source of heralded GHZ states

Authors: H. Cao, L. M. Hansen, F. Giorgino, L. Carosini, P. Zahalka, F. Zilk, J. C. Loredo, P. Walther

Abstract: Generating large multiphoton entangled states is of main interest due to enabling universal photonic quantum computing and all-optical quantum repeater nodes. These applications exploit measurement-based quantum computation using cluster states. Remarkably, it was shown that photonic cluster states of arbitrary size can be generated by using feasible heralded linear optics fusion gates that act on… ▽ More Generating large multiphoton entangled states is of main interest due to enabling universal photonic quantum computing and all-optical quantum repeater nodes. These applications exploit measurement-based quantum computation using cluster states. Remarkably, it was shown that photonic cluster states of arbitrary size can be generated by using feasible heralded linear optics fusion gates that act on heralded three-photon Greenberger-Horne-Zeilinger (GHZ) states as the initial resource state. Thus, the capability of generating heralded GHZ states is of great importance for scaling up photonic quantum computing. Here, we experimentally demonstrate this required building block by reporting a polarisation-encoded heralded GHZ state of three photons, for which we build a high-rate six-photon source ($547{\pm}2$ Hz) from a solid-state quantum emitter and a stable polarisation-based interferometer. The detection of three ancillary photons heralds the generation of three-photon GHZ states among the remaining particles with fidelities up to $\mathcal{F}=0.7278{\pm}0.0106$. Our results initiate a path for scalable entangling operations using heralded linear-optics implementations. △ Less

Submitted 10 August, 2023; originally announced August 2023.

Comments: 6 pages, 5 figures

Journal ref: Phys. Rev. Lett. 132, 130604 (2024)

arXiv:2307.12745 [pdf, ps, other]

Concept-based explainability for an EEG transformer model

Authors: Anders Gjølbye Madsen, William Theodor Lehn-Schiøler, Áshildur Jónsdóttir, Bergdís Arnardóttir, Lars Kai Hansen

Abstract: Deep learning models are complex due to their size, structure, and inherent randomness in training procedures. Additional complexity arises from the selection of datasets and inductive biases. Addressing these challenges for explainability, Kim et al. (2018) introduced Concept Activation Vectors (CAVs), which aim to understand deep models' internal states in terms of human-aligned concepts. These… ▽ More Deep learning models are complex due to their size, structure, and inherent randomness in training procedures. Additional complexity arises from the selection of datasets and inductive biases. Addressing these challenges for explainability, Kim et al. (2018) introduced Concept Activation Vectors (CAVs), which aim to understand deep models' internal states in terms of human-aligned concepts. These concepts correspond to directions in latent space, identified using linear discriminants. Although this method was first applied to image classification, it was later adapted to other domains, including natural language processing. In this work, we attempt to apply the method to electroencephalogram (EEG) data for explainability in Kostas et al.'s BENDR (2021), a large-scale transformer model. A crucial part of this endeavor involves defining the explanatory concepts and selecting relevant datasets to ground concepts in the latent space. Our focus is on two mechanisms for EEG concept formation: the use of externally labeled EEG datasets, and the application of anatomically defined concepts. The former approach is a straightforward generalization of methods used in image classification, while the latter is novel and specific to EEG. We present evidence that both approaches to concept formation yield valuable insights into the representations learned by deep EEG models. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: To appear in proceedings of 2023 IEEE International workshop on Machine Learning for Signal Processing

arXiv:2306.16997 [pdf, other]

Unsupervised 3D registration through optimization-guided cyclical self-training

Authors: Alexander Bigalke, Lasse Hansen, Tony C. W. Mok, Mattias P. Heinrich

Abstract: State-of-the-art deep learning-based registration methods employ three different learning strategies: supervised learning, which requires costly manual annotations, unsupervised learning, which heavily relies on hand-crafted similarity metrics designed by domain experts, or learning from synthetic data, which introduces a domain shift. To overcome the limitations of these strategies, we propose a… ▽ More State-of-the-art deep learning-based registration methods employ three different learning strategies: supervised learning, which requires costly manual annotations, unsupervised learning, which heavily relies on hand-crafted similarity metrics designed by domain experts, or learning from synthetic data, which introduces a domain shift. To overcome the limitations of these strategies, we propose a novel self-supervised learning paradigm for unsupervised registration, relying on self-training. Our idea is based on two key insights. Feature-based differentiable optimizers 1) perform reasonable registration even from random features and 2) stabilize the training of the preceding feature extraction network on noisy labels. Consequently, we propose cyclical self-training, where pseudo labels are initialized as the displacement fields inferred from random features and cyclically updated based on more and more expressive features from the learning feature extractor, yielding a self-reinforcement effect. We evaluate the method for abdomen and lung registration, consistently surpassing metric-based supervision and outperforming diverse state-of-the-art competitors. Source code is available at https://github.com/multimodallearning/reg-cyclical-self-train. △ Less

Submitted 20 July, 2023; v1 submitted 29 June, 2023; originally announced June 2023.

Comments: accepted at MICCAI 2023

arXiv:2306.06524 [pdf, other]

What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model

Authors: Mu Yang, Ram C. M. C. Shekar, Okim Kang, John H. L. Hansen

Abstract: This study is focused on understanding and quantifying the change in phoneme and prosody information encoded in the Self-Supervised Learning (SSL) model, brought by an accent identification (AID) fine-tuning task. This problem is addressed based on model probing. Specifically, we conduct a systematic layer-wise analysis of the representations of the Transformer layers on a phoneme correlation task… ▽ More This study is focused on understanding and quantifying the change in phoneme and prosody information encoded in the Self-Supervised Learning (SSL) model, brought by an accent identification (AID) fine-tuning task. This problem is addressed based on model probing. Specifically, we conduct a systematic layer-wise analysis of the representations of the Transformer layers on a phoneme correlation task, and a novel word-level prosody prediction task. We compare the probing performance of the pre-trained and fine-tuned SSL models. Results show that the AID fine-tuning task steers the top 2 layers to learn richer phoneme and prosody representation. These changes share some similarities with the effects of fine-tuning with an Automatic Speech Recognition task. In addition, we observe strong accent-specific phoneme representations in layer 9. To sum up, this study provides insights into the understanding of SSL features and their interactions with fine-tuning tasks. △ Less

Submitted 10 June, 2023; originally announced June 2023.

Comments: Accepted by Interspeech 2023

arXiv:2306.03009 [pdf, other]

doi 10.1038/s43588-023-00573-5

Using Sequences of Life-events to Predict Human Lives

Authors: Germans Savcisens, Tina Eliassi-Rad, Lars Kai Hansen, Laust Mortensen, Lau Lilleholt, Anna Rogers, Ingo Zettler, Sune Lehmann

Abstract: Over the past decade, machine learning has revolutionized computers' ability to analyze text through flexible computational models. Due to their structural similarity to written language, transformer-based architectures have also shown promise as tools to make sense of a range of multi-variate sequences from protein-structures, music, electronic health records to weather-forecasts. We can also rep… ▽ More Over the past decade, machine learning has revolutionized computers' ability to analyze text through flexible computational models. Due to their structural similarity to written language, transformer-based architectures have also shown promise as tools to make sense of a range of multi-variate sequences from protein-structures, music, electronic health records to weather-forecasts. We can also represent human lives in a way that shares this structural similarity to language. From one perspective, lives are simply sequences of events: People are born, visit the pediatrician, start school, move to a new location, get married, and so on. Here, we exploit this similarity to adapt innovations from natural language processing to examine the evolution and predictability of human lives based on detailed event sequences. We do this by drawing on arguably the most comprehensive registry data in existence, available for an entire nation of more than six million individuals across decades. Our data include information about life-events related to health, education, occupation, income, address, and working hours, recorded with day-to-day resolution. We create embeddings of life-events in a single vector space showing that this embedding space is robust and highly structured. Our models allow us to predict diverse outcomes ranging from early mortality to personality nuances, outperforming state-of-the-art models by a wide margin. Using methods for interpreting deep learning models, we probe the algorithm to understand the factors that enable our predictions. Our framework allows researchers to identify new potential mechanisms that impact life outcomes and associated possibilities for personalized interventions. △ Less

Submitted 5 June, 2023; originally announced June 2023.

Journal ref: Nature Computational Science 4 (2024) 43-56

arXiv:2306.00561 [pdf, other]

Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners

Authors: Sarthak Yadav, Sergios Theodoridis, Lars Kai Hansen, Zheng-Hua Tan

Abstract: In this work, we propose a Multi-Window Masked Autoencoder (MW-MAE) fitted with a novel Multi-Window Multi-Head Attention (MW-MHA) module that facilitates the modelling of local-global interactions in every decoder transformer block through attention heads of several distinct local and global windows. Empirical results on ten downstream audio tasks show that MW-MAEs consistently outperform standar… ▽ More In this work, we propose a Multi-Window Masked Autoencoder (MW-MAE) fitted with a novel Multi-Window Multi-Head Attention (MW-MHA) module that facilitates the modelling of local-global interactions in every decoder transformer block through attention heads of several distinct local and global windows. Empirical results on ten downstream audio tasks show that MW-MAEs consistently outperform standard MAEs in overall performance and learn better general-purpose audio representations, along with demonstrating considerably better scaling characteristics. Investigating attention distances and entropies reveals that MW-MAE encoders learn heads with broader local and global attention. Analyzing attention head feature representations through Projection Weighted Canonical Correlation Analysis (PWCCA) shows that attention heads with the same window sizes across the decoder layers of the MW-MAE learn correlated feature representations which enables each block to independently capture local and global information, leading to a decoupled decoder feature hierarchy. Code for feature extraction and downstream experiments along with pre-trained models will be released publically. △ Less

Submitted 1 October, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

arXiv:2305.20017 [pdf, ps, other]

doi 10.1038/s41534-024-00811-2

Controlling the Photon Number Coherence of Solid-state Quantum Light Sources for Quantum Cryptography

Authors: Yusuf Karli, Daniel A. Vajner, Florian Kappe, Paul C. A. Hagen, Lena M. Hansen, René Schwarz, Thomas K. Bracht, Christian Schimpf, Saimon F. Covre da Silva, Philip Walther, Armando Rastelli, Vollrath Martin Axt, Juan C. Loredo, Vikas Remesh, Tobias Heindel, Doris E. Reiter, Gregor Weihs

Abstract: Quantum communication networks rely on quantum cryptographic protocols including quantum key distribution (QKD) using single photons. A critical element regarding the security of QKD protocols is the photon number coherence (PNC), i.e. the phase relation between the zero and one-photon Fock state, which critically depends on the excitation scheme. Thus, to obtain flying qubits with the desired pro… ▽ More Quantum communication networks rely on quantum cryptographic protocols including quantum key distribution (QKD) using single photons. A critical element regarding the security of QKD protocols is the photon number coherence (PNC), i.e. the phase relation between the zero and one-photon Fock state, which critically depends on the excitation scheme. Thus, to obtain flying qubits with the desired properties, optimal pumping schemes for quantum emitters need to be selected. Semiconductor quantum dots generate on-demand single photons with high purity and indistinguishability. Exploiting two-photon excitation of a quantum dot combined with a stimulation pulse, we demonstrate the generation of high-quality single photons with a controllable degree of PNC. Our approach provides a viable route toward secure communication in quantum networks. △ Less

Submitted 31 May, 2023; originally announced May 2023.

Comments: 17 pages

Journal ref: npj Quantum Inf 10, 17 (2024)

arXiv:2305.17154 [pdf, other]

On convex decision regions in deep network representations

Authors: Lenka Tětková, Thea Brüsch, Teresa Karen Scheidt, Fabian Martin Mager, Rasmus Ørtoft Aagaard, Jonathan Foldager, Tommy Sonne Alstrøm, Lars Kai Hansen

Abstract: Current work on human-machine alignment aims at understanding machine-learned latent spaces and their correspondence to human representations. G{ä}rdenfors' conceptual spaces is a prominent framework for understanding human representations. Convexity of object regions in conceptual spaces is argued to promote generalizability, few-shot learning, and interpersonal alignment. Based on these insights… ▽ More Current work on human-machine alignment aims at understanding machine-learned latent spaces and their correspondence to human representations. G{ä}rdenfors' conceptual spaces is a prominent framework for understanding human representations. Convexity of object regions in conceptual spaces is argued to promote generalizability, few-shot learning, and interpersonal alignment. Based on these insights, we investigate the notion of convexity of concept regions in machine-learned latent spaces. We develop a set of tools for measuring convexity in sampled data and evaluate emergent convexity in layered representations of state-of-the-art deep networks. We show that convexity is robust to basic re-parametrization and, hence, meaningful as a quality of machine-learned latent spaces. We find that approximate convexity is pervasive in neural representations in multiple application domains, including models of images, audio, human activity, text, and medical images. Generally, we observe that fine-tuning increases the convexity of label regions. We find evidence that pretraining convexity of class label regions predicts subsequent fine-tuning performance. △ Less

Submitted 6 October, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

arXiv:2305.11157 [pdf, other]

doi 10.1126/sciadv.adj09

Programmable multi-photon quantum interference in a single spatial mode

Authors: Lorenzo Carosini, Virginia Oddi, Francesco Giorgino, Lena M. Hansen, Benoit Seron, Simone Piacentini, Tobias Guggemos, Iris Agresti, Juan Carlos Loredo, Philip Walther

Abstract: The interference of non-classical states of light enables quantum-enhanced applications reaching from metrology to computation. Most commonly, the polarisation or spatial location of single photons are used as addressable degrees-of-freedom for turning these applications into praxis. However, the scale-up for the processing of a large number of photons of such architectures is very resource demand… ▽ More The interference of non-classical states of light enables quantum-enhanced applications reaching from metrology to computation. Most commonly, the polarisation or spatial location of single photons are used as addressable degrees-of-freedom for turning these applications into praxis. However, the scale-up for the processing of a large number of photons of such architectures is very resource demanding due to the rapidily increasing number of components, such as optical elements, photon sources and detectors. Here we demonstrate a resource-efficient architecture for multi-photon processing based on time-bin encoding in a single spatial mode. We employ an efficient quantum dot single-photon source, and a fast programmable time-bin interferometer, to observe the interference of up to 8 photons in 16 modes, all recorded only with one detector--thus considerably reducing the physical overhead previously needed for achieving equivalent tasks. Our results can form the basis for a future universal photonics quantum processor operating in a single spatial mode. △ Less

Submitted 18 May, 2023; originally announced May 2023.

Comments: 8 pages, 5 figures

arXiv:2304.12956 [pdf, other]

doi 10.1364/OPTICAQ.494643

Single-active-element demultiplexed multi-photon source

Authors: Lena M. Hansen, Lorenzo Carosini, Lennart Jehle, Francesco Giorgino, Romane Houvenaghel, Michal Vyvlecka, Juan C. Loredo, Philip Walther

Abstract: Temporal-to-spatial demultiplexing routes non-simultaneous events of the same spatial mode to distinct output trajectories. This technique has now been widely adopted because it gives access to higher-number multi-photon states when exploiting solid-state quantum emitters. However, implementations so far have required an always-increasing number of active elements, rapidly facing resource constrai… ▽ More Temporal-to-spatial demultiplexing routes non-simultaneous events of the same spatial mode to distinct output trajectories. This technique has now been widely adopted because it gives access to higher-number multi-photon states when exploiting solid-state quantum emitters. However, implementations so far have required an always-increasing number of active elements, rapidly facing resource constraints. Here, we propose and demonstrate a demultiplexing approach that utilizes only a single active element for routing to, in principle, an arbitrary number of outputs. We employ our device in combination with a high-efficiency quantum dot based single-photon source, and measure up to eight demultiplexed highly indistinguishable single photons. We discuss the practical limitations of our approach, and describe in which conditions it can be used to demultiplex, e.g., tens of outputs. Our results thus provides a path for the preparation of resource-efficient larger-scale multi-photon sources. △ Less

Submitted 25 April, 2023; originally announced April 2023.

Comments: 7 pages, 7 figures

Journal ref: Optica Quantum 1(1), 1-5 (2023)

arXiv:2304.08984 [pdf, other]

doi 10.1109/CVPRW59228.2023.00381

Robustness of Visual Explanations to Common Data Augmentation

Authors: Lenka Tětková, Lars Kai Hansen

Abstract: As the use of deep neural networks continues to grow, understanding their behaviour has become more crucial than ever. Post-hoc explainability methods are a potential solution, but their reliability is being called into question. Our research investigates the response of post-hoc visual explanations to naturally occurring transformations, often referred to as augmentations. We anticipate explanati… ▽ More As the use of deep neural networks continues to grow, understanding their behaviour has become more crucial than ever. Post-hoc explainability methods are a potential solution, but their reliability is being called into question. Our research investigates the response of post-hoc visual explanations to naturally occurring transformations, often referred to as augmentations. We anticipate explanations to be invariant under certain transformations, such as changes to the colour map while responding in an equivariant manner to transformations like translation, object scaling, and rotation. We have found remarkable differences in robustness depending on the type of transformation, with some explainability methods (such as LRP composites and Guided Backprop) being more stable than others. We also explore the role of training with data augmentation. We provide evidence that explanations are typically less robust to augmentation than classification performance, regardless of whether data augmentation is used in training or not. △ Less

Submitted 18 April, 2023; originally announced April 2023.

Comments: Accepted to The 2nd Explainable AI for Computer Vision (XAI4CV) Workshop at CVPR 2023

arXiv:2303.17719 [pdf, other]

Why is the winner the best?

Authors: Matthias Eisenmann, Annika Reinke, Vivienn Weru, Minu Dietlinde Tizabi, Fabian Isensee, Tim J. Adler, Sharib Ali, Vincent Andrearczyk, Marc Aubreville, Ujjwal Baid, Spyridon Bakas, Niranjan Balu, Sophia Bano, Jorge Bernal, Sebastian Bodenstedt, Alessandro Casella, Veronika Cheplygina, Marie Daum, Marleen de Bruijne, Adrien Depeursinge, Reuben Dorent, Jan Egger, David G. Ellis, Sandy Engelhardt, Melanie Ganz , et al. (100 additional authors not shown)

Abstract: International benchmarking competitions have become fundamental for the comparative performance assessment of image analysis methods. However, little attention has been given to investigating what can be learnt from these competitions. Do they really generate scientific progress? What are common and successful participation strategies? What makes a solution superior to a competing method? To addre… ▽ More International benchmarking competitions have become fundamental for the comparative performance assessment of image analysis methods. However, little attention has been given to investigating what can be learnt from these competitions. Do they really generate scientific progress? What are common and successful participation strategies? What makes a solution superior to a competing method? To address this gap in the literature, we performed a multi-center study with all 80 competitions that were conducted in the scope of IEEE ISBI 2021 and MICCAI 2021. Statistical analyses performed based on comprehensive descriptions of the submitted algorithms linked to their rank as well as the underlying participation strategies revealed common characteristics of winning solutions. These typically include the use of multi-task learning (63%) and/or multi-stage pipelines (61%), and a focus on augmentation (100%), image preprocessing (97%), data curation (79%), and postprocessing (66%). The "typical" lead of a winning team is a computer scientist with a doctoral degree, five years of experience in biomedical image analysis, and four years of experience in deep learning. Two core general development strategies stood out for highly-ranked teams: the reflection of the metrics in the method design and the focus on analyzing and handling failure cases. According to the organizers, 43% of the winning algorithms exceeded the state of the art but only 11% completely solved the respective domain problem. The insights of our study could help researchers (1) improve algorithm development strategies when approaching new problems, and (2) focus on open research questions revealed by this work. △ Less

Submitted 30 March, 2023; originally announced March 2023.

Comments: accepted to CVPR 2023

arXiv:2302.08639 [pdf, other]

Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Authors: Mufan Sang, Yong Zhao, Gang Liu, John H. L. Hansen, Jian Wu

Abstract: Recently, Transformer-based architectures have been explored for speaker embedding extraction. Although the Transformer employs the self-attention mechanism to efficiently model the global interaction between token embeddings, it is inadequate for capturing short-range local context, which is essential for the accurate extraction of speaker information. In this study, we enhance the Transformer wi… ▽ More Recently, Transformer-based architectures have been explored for speaker embedding extraction. Although the Transformer employs the self-attention mechanism to efficiently model the global interaction between token embeddings, it is inadequate for capturing short-range local context, which is essential for the accurate extraction of speaker information. In this study, we enhance the Transformer with the enhanced locality modeling in two directions. First, we propose the Locality-Enhanced Conformer (LE-Confomer) by introducing depth-wise convolution and channel-wise attention into the Conformer blocks. Second, we present the Speaker Swin Transformer (SST) by adapting the Swin Transformer, originally proposed for vision tasks, into speaker embedding network. We evaluate the proposed approaches on the VoxCeleb datasets and a large-scale Microsoft internal multilingual (MS-internal) dataset. The proposed models achieve 0.75% EER on VoxCeleb 1 test set, outperforming the previously proposed Transformer-based models and CNN-based models, such as ResNet34 and ECAPA-TDNN. When trained on the MS-internal dataset, the proposed models achieve promising results with 14.6% relative reduction in EER over the Res2Net50 model. △ Less

Submitted 28 February, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

Comments: Accepted to ICASSP 2023

arXiv:2302.02430 [pdf, other]

doi 10.1557/s43578-023-01041-6

Calibration and data analysis routines for nanoindentation with spherical tips

Authors: Diana Avadanii, Anna Kareer, Lars Hansen, Angus Wilkinson

Abstract: Instrumented spherical nanoindentation with a continuous stiffness measurement has gained increased popularity in material science studies in brittle and ductile materials alike. These investigations span hypotheses related to a wide range of microphysics involving grain boundaries, twins, dislocation densities, ion-induced damage and more. These studies rely on the implementation of different met… ▽ More Instrumented spherical nanoindentation with a continuous stiffness measurement has gained increased popularity in material science studies in brittle and ductile materials alike. These investigations span hypotheses related to a wide range of microphysics involving grain boundaries, twins, dislocation densities, ion-induced damage and more. These studies rely on the implementation of different methodologies for instrument calibration and for circumventing tip shape imperfections. In this study, we test, integrate, and re-adapt published strategies for tip and machine-stiffness calibration for spherical tips. We propose a routine for independently calibrating the effective tip radius and the machine stiffness using three reference materials (fused silica, sapphire, glassy carbon), which requires the parametrization of the effective radius as a function of load. We validate our proposed workflow against key benchmarks, such as variation of Young's modulus with depth. We apply the resulting calibrations to data collected in materials with varying ductility (olivine, titanium, and tungsten) to extract indentation stress-strain curves. We also test the impact of the machine stiffness on recently proposed methods for identification of yield stress, and compare the influence of different conventions on assessing the indentation size effect. Finally, we synthesize these analysis routines in a single workflow for use in future studies aiming to extract and process data from spherical nanoindentation. △ Less

Submitted 5 February, 2023; originally announced February 2023.

arXiv:2301.06916 [pdf, other]

Automated speech- and text-based classification of neuropsychiatric conditions in a multidiagnostic setting

Authors: Lasse Hansen, Roberta Rocca, Arndis Simonsen, Alberto Parola, Vibeke Bliksted, Nicolai Ladegaard, Dan Bang, Kristian Tylén, Ethan Weed, Søren Dinesen Østergaard, Riccardo Fusaroli

Abstract: Speech patterns have been identified as potential diagnostic markers for neuropsychiatric conditions. However, most studies only compare a single clinical group to healthy controls, whereas clinical practice often requires differentiating between multiple potential diagnoses (multiclass settings). To address this, we assembled a dataset of repeated recordings from 420 participants (67 with major d… ▽ More Speech patterns have been identified as potential diagnostic markers for neuropsychiatric conditions. However, most studies only compare a single clinical group to healthy controls, whereas clinical practice often requires differentiating between multiple potential diagnoses (multiclass settings). To address this, we assembled a dataset of repeated recordings from 420 participants (67 with major depressive disorder, 106 with schizophrenia and 46 with autism, as well as matched controls), and tested the performance of a range of conventional machine learning models and advanced Transformer models on both binary and multiclass classification, based on voice and text features. While binary models performed comparably to previous research (F1 scores between 0.54-0.75 for autism spectrum disorder, ASD; 0.67-0.92 for major depressive disorder, MDD; and 0.71-0.83 for schizophrenia); when differentiating between multiple diagnostic groups performance decreased markedly (F1 scores between 0.35-0.44 for ASD, 0.57-0.75 for MDD, 0.15-0.66 for schizophrenia, and 0.38-0.52 macro F1). Combining voice and text-based models yielded increased performance, suggesting that they capture complementary diagnostic information. Our results indicate that models trained on binary classification may learn to rely on markers of generic differences between clinical and non-clinical populations, or markers of clinical features that overlap across conditions, rather than identifying markers specific to individual conditions. We provide recommendations for future research in the field, suggesting increased focus on developing larger transdiagnostic datasets that include more fine-grained clinical features, and that can support the development of models that better capture the complexity of neuropsychiatric conditions and naturalistic diagnostic assessment. △ Less

Submitted 31 January, 2023; v1 submitted 13 January, 2023; originally announced January 2023.

Comments: 24 pages, 5 figures

arXiv:2301.05983 [pdf, other]

On the role of Model Uncertainties in Bayesian Optimization

Authors: Jonathan Foldager, Mikkel Jordahn, Lars Kai Hansen, Michael Riis Andersen

Abstract: Bayesian optimization (BO) is a popular method for black-box optimization, which relies on uncertainty as part of its decision-making process when deciding which experiment to perform next. However, not much work has addressed the effect of uncertainty on the performance of the BO algorithm and to what extent calibrated uncertainties improve the ability to find the global optimum. In this work, we… ▽ More Bayesian optimization (BO) is a popular method for black-box optimization, which relies on uncertainty as part of its decision-making process when deciding which experiment to perform next. However, not much work has addressed the effect of uncertainty on the performance of the BO algorithm and to what extent calibrated uncertainties improve the ability to find the global optimum. In this work, we provide an extensive study of the relationship between the BO performance (regret) and uncertainty calibration for popular surrogate models and compare them across both synthetic and real-world experiments. Our results confirm that Gaussian Processes are strong surrogate models and that they tend to outperform other popular models. Our results further show a positive association between calibration error and regret, but interestingly, this association disappears when we control for the type of model in the analysis. We also studied the effect of re-calibration and demonstrate that it generally does not lead to improved regret. Finally, we provide theoretical justification for why uncertainty calibration might be difficult to combine with BO due to the small sample sizes commonly used. △ Less

Submitted 14 January, 2023; originally announced January 2023.

Comments: 14 pages, 4 figures, 2 tables

arXiv:2301.02057 [pdf, ps, other]

doi 10.21105/joss.05153

TextDescriptives: A Python package for calculating a large variety of metrics from text

Authors: Lasse Hansen, Ludvig Renbo Olsen, Kenneth Enevoldsen

Abstract: TextDescriptives is a Python package for calculating a large variety of metrics from text. It is built on top of spaCy and can be easily integrated into existing workflows. The package has already been used for analysing the linguistic stability of clinical texts, creating features for predicting neuropsychiatric conditions, and analysing linguistic goals of primary school students. This paper des… ▽ More TextDescriptives is a Python package for calculating a large variety of metrics from text. It is built on top of spaCy and can be easily integrated into existing workflows. The package has already been used for analysing the linguistic stability of clinical texts, creating features for predicting neuropsychiatric conditions, and analysing linguistic goals of primary school students. This paper describes the package and its features. △ Less

Submitted 28 March, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

Comments: 3 pages, 0 figures. Submitted to Journal of Open Source Software

Journal ref: Journal of Open Source Software, 8(84), 5153 (2023)

arXiv:2212.08568 [pdf, other]

Biomedical image analysis competitions: The state of current participation practice

Authors: Matthias Eisenmann, Annika Reinke, Vivienn Weru, Minu Dietlinde Tizabi, Fabian Isensee, Tim J. Adler, Patrick Godau, Veronika Cheplygina, Michal Kozubek, Sharib Ali, Anubha Gupta, Jan Kybic, Alison Noble, Carlos Ortiz de Solórzano, Samiksha Pachade, Caroline Petitjean, Daniel Sage, Donglai Wei, Elizabeth Wilden, Deepak Alapatt, Vincent Andrearczyk, Ujjwal Baid, Spyridon Bakas, Niranjan Balu, Sophia Bano , et al. (331 additional authors not shown)

Abstract: The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis,… ▽ More The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps. △ Less

Submitted 12 September, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

arXiv:2211.12632 [pdf, other]

doi 10.21437/Interspeech.2022-11277

Complex-Valued Time-Frequency Self-Attention for Speech Dereverberation

Authors: Vinay Kothapally, John H. L. Hansen

Abstract: Several speech processing systems have demonstrated considerable performance improvements when deep complex neural networks (DCNN) are coupled with self-attention (SA) networks. However, the majority of DCNN-based studies on speech dereverberation that employ self-attention do not explicitly account for the inter-dependencies between real and imaginary features when computing attention. In this st… ▽ More Several speech processing systems have demonstrated considerable performance improvements when deep complex neural networks (DCNN) are coupled with self-attention (SA) networks. However, the majority of DCNN-based studies on speech dereverberation that employ self-attention do not explicitly account for the inter-dependencies between real and imaginary features when computing attention. In this study, we propose a complex-valued T-F attention (TFA) module that models spectral and temporal dependencies by computing two-dimensional attention maps across time and frequency dimensions. We validate the effectiveness of our proposed complex-valued TFA module with the deep complex convolutional recurrent network (DCCRN) using the REVERB challenge corpus. Experimental findings indicate that integrating our complex-TFA module with DCCRN improves overall speech quality and performance of back-end speech applications, such as automatic speech recognition, compared to earlier approaches for self-attention. △ Less

Submitted 22 November, 2022; originally announced November 2022.

Comments: Interspeech 2022: ISCA Best Student Paper Award Finalist

arXiv:2211.12623 [pdf, other]

doi 10.1109/TASLP.2022.3155286

SkipConvGAN: Monaural Speech Dereverberation using Generative Adversarial Networks via Complex Time-Frequency Masking

Authors: Vinay Kothapally, J. H. L. Hansen

Abstract: With the advancements in deep learning approaches, the performance of speech enhancing systems in the presence of background noise have shown significant improvements. However, improving the system's robustness against reverberation is still a work in progress, as reverberation tends to cause loss of formant structure due to smearing effects in time and frequency. A wide range of deep learning-bas… ▽ More With the advancements in deep learning approaches, the performance of speech enhancing systems in the presence of background noise have shown significant improvements. However, improving the system's robustness against reverberation is still a work in progress, as reverberation tends to cause loss of formant structure due to smearing effects in time and frequency. A wide range of deep learning-based systems either enhance the magnitude response and reuse the distorted phase or enhance complex spectrogram using a complex time-frequency mask. Though these approaches have demonstrated satisfactory performance, they do not directly address the lost formant structure caused by reverberation. We believe that retrieving the formant structure can help improve the efficiency of existing systems. In this study, we propose SkipConvGAN - an extension of our prior work SkipConvNet. The proposed system's generator network tries to estimate an efficient complex time-frequency mask, while the discriminator network aids in driving the generator to restore the lost formant structure. We evaluate the performance of our proposed system on simulated and real recordings of reverberant speech from the single-channel task of the REVERB challenge corpus. The proposed system shows a consistent improvement across multiple room configurations over other deep learning-based generative adversarial frameworks. △ Less

Submitted 22 November, 2022; originally announced November 2022.

Comments: Published in: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 30)

arXiv:2211.12193 [pdf, other]

doi 10.1016/j.media.2023.102887

Anatomy-guided domain adaptation for 3D in-bed human pose estimation

Authors: Alexander Bigalke, Lasse Hansen, Jasper Diesel, Carlotta Hennigs, Philipp Rostalski, Mattias P. Heinrich

Abstract: 3D human pose estimation is a key component of clinical monitoring systems. The clinical applicability of deep pose estimation models, however, is limited by their poor generalization under domain shifts along with their need for sufficient labeled training data. As a remedy, we present a novel domain adaptation method, adapting a model from a labeled source to a shifted unlabeled target domain. O… ▽ More 3D human pose estimation is a key component of clinical monitoring systems. The clinical applicability of deep pose estimation models, however, is limited by their poor generalization under domain shifts along with their need for sufficient labeled training data. As a remedy, we present a novel domain adaptation method, adapting a model from a labeled source to a shifted unlabeled target domain. Our method comprises two complementary adaptation strategies based on prior knowledge about human anatomy. First, we guide the learning process in the target domain by constraining predictions to the space of anatomically plausible poses. To this end, we embed the prior knowledge into an anatomical loss function that penalizes asymmetric limb lengths, implausible bone lengths, and implausible joint angles. Second, we propose to filter pseudo labels for self-training according to their anatomical plausibility and incorporate the concept into the Mean Teacher paradigm. We unify both strategies in a point cloud-based framework applicable to unsupervised and source-free domain adaptation. Evaluation is performed for in-bed pose estimation under two adaptation scenarios, using the public SLP dataset and a newly created dataset. Our method consistently outperforms various state-of-the-art domain adaptation methods, surpasses the baseline model by 31%/66%, and reduces the domain gap by 65%/82%. Source code is available at https://github.com/multimodallearning/da-3dhpe-anatomy. △ Less

Submitted 4 July, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

Comments: accepted at Medical Image Analysis

Journal ref: Medical Image Analysis 89, 2023, 102887

arXiv:2211.10565 [pdf, other]

Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting

Authors: Iván López-Espejo, Ram C. M. C. Shekar, Zheng-Hua Tan, Jesper Jensen, John H. L. Hansen

Abstract: In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnable features has not yielded superior KWS performance. In this study, we demonstrate that filterbank learning outperforms handcrafted speech features for KWS whenever the number of filterbank channels is severely decreased. Reducing the number of channels might yield certain KWS performance drop, but… ▽ More In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnable features has not yielded superior KWS performance. In this study, we demonstrate that filterbank learning outperforms handcrafted speech features for KWS whenever the number of filterbank channels is severely decreased. Reducing the number of channels might yield certain KWS performance drop, but also a substantial energy consumption reduction, which is key when deploying common always-on KWS on low-resource devices. Experimental results on a noisy version of the Google Speech Commands Dataset show that filterbank learning adapts to noise characteristics to provide a higher degree of robustness to noise, especially when dropout is integrated. Thus, switching from typically used 40-channel log-Mel features to 8-channel learned features leads to a relative KWS accuracy loss of only 3.5% while simultaneously achieving a 6.3x energy consumption reduction. △ Less

Submitted 23 February, 2023; v1 submitted 18 November, 2022; originally announced November 2022.

arXiv:2211.09913 [pdf, other]

doi 10.1109/TASLP.2021.3130975

Multi-source Domain Adaptation for Text-independent Forensic Speaker Recognition

Authors: Zhenyu Wang, John H. L. Hansen

Abstract: Adapting speaker recognition systems to new environments is a widely-used technique to improve a well-performing model learned from large-scale data towards a task-specific small-scale data scenarios. However, previous studies focus on single domain adaptation, which neglects a more practical scenario where training data are collected from multiple acoustic domains needed in forensic scenarios. Au… ▽ More Adapting speaker recognition systems to new environments is a widely-used technique to improve a well-performing model learned from large-scale data towards a task-specific small-scale data scenarios. However, previous studies focus on single domain adaptation, which neglects a more practical scenario where training data are collected from multiple acoustic domains needed in forensic scenarios. Audio analysis for forensic speaker recognition offers unique challenges in model training with multi-domain training data due to location/scenario uncertainty and diversity mismatch between reference and naturalistic field recordings. It is also difficult to directly employ small-scale domain-specific data to train complex neural network architectures due to domain mismatch and performance loss. Fine-tuning is a commonly-used method for adaptation in order to retrain the model with weights initialized from a well-trained model. Alternatively, in this study, three novel adaptation methods based on domain adversarial training, discrepancy minimization, and moment-matching approaches are proposed to further promote adaptation performance across multiple acoustic domains. A comprehensive set of experiments are conducted to demonstrate that: 1) diverse acoustic environments do impact speaker recognition performance, which could advance research in audio forensics, 2) domain adversarial training learns the discriminative features which are also invariant to shifts between domains, 3) discrepancy-minimizing adaptation achieves effective performance simultaneously across multiple acoustic domains, and 4) moment-matching adaptation along with dynamic distribution alignment also significantly promotes speaker recognition performance on each domain, especially for the LENA-field domain with noise compared to all other systems. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING

arXiv:2211.09898 [pdf, other]

doi 10.21437/Interspeech.2022-904

Audio Anti-spoofing Using a Simple Attention Module and Joint Optimization Based on Additive Angular Margin Loss and Meta-learning

Authors: Zhenyu Wang, John H. L. Hansen

Abstract: Automatic speaker verification systems are vulnerable to a variety of access threats, prompting research into the formulation of effective spoofing detection systems to act as a gate to filter out such spoofing attacks. This study introduces a simple attention module to infer 3-dim attention weights for the feature map in a convolutional layer, which then optimizes an energy function to determine… ▽ More Automatic speaker verification systems are vulnerable to a variety of access threats, prompting research into the formulation of effective spoofing detection systems to act as a gate to filter out such spoofing attacks. This study introduces a simple attention module to infer 3-dim attention weights for the feature map in a convolutional layer, which then optimizes an energy function to determine each neuron's importance. With the advancement of both voice conversion and speech synthesis technologies, unseen spoofing attacks are constantly emerging to limit spoofing detection system performance. Here, we propose a joint optimization approach based on the weighted additive angular margin loss for binary classification, with a meta-learning training framework to develop an efficient system that is robust to a wide range of spoofing attacks for model generalization enhancement. As a result, when compared to current state-of-the-art systems, our proposed approach delivers a competitive result with a pooled EER of 0.99% and min t-DCF of 0.0289. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: Interspeech 2022

arXiv:2211.02051 [pdf, other]

Fearless Steps Challenge Phase-1 Evaluation Plan

Authors: Aditya Joglekar, John H. L. Hansen

Abstract: The Fearless Steps Challenge 2019 Phase-1 (FSC-P1) is the inaugural Challenge of the Fearless Steps Initiative hosted by the Center for Robust Speech Systems (CRSS) at the University of Texas at Dallas. The goal of this Challenge is to evaluate the performance of state-of-the-art speech and language systems for large task-oriented teams with naturalistic audio in challenging environments. Research… ▽ More The Fearless Steps Challenge 2019 Phase-1 (FSC-P1) is the inaugural Challenge of the Fearless Steps Initiative hosted by the Center for Robust Speech Systems (CRSS) at the University of Texas at Dallas. The goal of this Challenge is to evaluate the performance of state-of-the-art speech and language systems for large task-oriented teams with naturalistic audio in challenging environments. Researchers may select to participate in any single or multiple of these challenge tasks. Researchers may also choose to employ the FEARLESS STEPS corpus for other related speech applications. All participants are encouraged to submit their solutions and results for consideration in the ISCA INTERSPEECH-2019 special session. △ Less

Submitted 3 November, 2022; originally announced November 2022.

Comments: Document Generated in February 2019 for conducting the Fearless Steps Challenge Phase-1 and its associated ISCA Interspeech-2019 Special Session

arXiv:2208.02778 [pdf, other]

Attention and DCT based Global Context Modeling for Text-independent Speaker Recognition

Authors: Wei Xia, John H. L. Hansen

Abstract: Learning an effective speaker representation is crucial for achieving reliable performance in speaker verification tasks. Speech signals are high-dimensional, long, and variable-length sequences containing diverse information at each time-frequency (TF) location. The standard convolutional layer that operates on neighboring local regions often fails to capture the complex TF global information. Ou… ▽ More Learning an effective speaker representation is crucial for achieving reliable performance in speaker verification tasks. Speech signals are high-dimensional, long, and variable-length sequences containing diverse information at each time-frequency (TF) location. The standard convolutional layer that operates on neighboring local regions often fails to capture the complex TF global information. Our motivation is to alleviate these challenges by increasing the modeling capacity, emphasizing significant information, and suppressing possible redundancies. We aim to design a more robust and efficient speaker recognition system by incorporating the benefits of attention mechanisms and Discrete Cosine Transform (DCT) based signal processing techniques, to effectively represent the global information in speech signals. To achieve this, we propose a general global time-frequency context modeling block for speaker modeling. First, an attention-based context model is introduced to capture the long-range and non-local relationship across different time-frequency locations. Second, a 2D-DCT based context model is proposed to improve model efficiency and examine the benefits of signal modeling. A multi-DCT attention mechanism is presented to improve modeling power with alternate DCT base forms. Finally, the global context information is used to recalibrate salient time-frequency locations by computing the similarity between the global context and local features. This effectively improves the speaker verification performance compared to the standard ResNet model and Squeeze & Excitation block by a large margin. Our experimental results show that the proposed global context modeling method can efficiently improve the learned speaker representations by achieving channel-wise and time-frequency feature recalibration. △ Less

Submitted 23 August, 2023; v1 submitted 4 August, 2022; originally announced August 2022.

arXiv:2207.04540 [pdf, other]

Multi-Frequency Information Enhanced Channel Attention Module for Speaker Representation Learning

Authors: Mufan Sang, John H. L. Hansen

Abstract: Recently, attention mechanisms have been applied successfully in neural network-based speaker verification systems. Incorporating the Squeeze-and-Excitation block into convolutional neural networks has achieved remarkable performance. However, it uses global average pooling (GAP) to simply average the features along time and frequency dimensions, which is incapable of preserving sufficient speaker… ▽ More Recently, attention mechanisms have been applied successfully in neural network-based speaker verification systems. Incorporating the Squeeze-and-Excitation block into convolutional neural networks has achieved remarkable performance. However, it uses global average pooling (GAP) to simply average the features along time and frequency dimensions, which is incapable of preserving sufficient speaker information in the feature maps. In this study, we show that GAP is a special case of a discrete cosine transform (DCT) on time-frequency domain mathematically using only the lowest frequency component in frequency decomposition. To strengthen the speaker information extraction ability, we propose to utilize multi-frequency information and design two novel and effective attention modules, called Single-Frequency Single-Channel (SFSC) attention module and Multi-Frequency Single-Channel (MFSC) attention module. The proposed attention modules can effectively capture more speaker information from multiple frequency components on the basis of DCT. We conduct comprehensive experiments on the VoxCeleb datasets and a probe evaluation on the 1st 48-UTD forensic corpus. Experimental results demonstrate that our proposed SFSC and MFSC attention modules can efficiently generate more discriminative speaker representations and outperform ResNet34-SE and ECAPA-TDNN systems with relative 20.9% and 20.2% reduction in EER, without adding extra network parameters. △ Less

Submitted 10 July, 2022; originally announced July 2022.

Comments: Accepted to Interspeech 2022

Showing 1–50 of 229 results for author: Hansen, L