Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 60 results for author: Benetos, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.17618  [pdf, other

    eess.AS cs.CL cs.SD

    Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model

    Authors: Jiawen Huang, Emmanouil Benetos

    Abstract: Multilingual automatic lyrics transcription (ALT) is a challenging task due to the limited availability of labelled data and the challenges introduced by singing, compared to multilingual automatic speech recognition. Although some multilingual singing datasets have been released recently, English continues to dominate these collections. Multilingual ALT remains underexplored due to the scale of d… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Accepted at EUSIPCO 2024

  2. arXiv:2405.19327  [pdf, other

    cs.CL cs.AI cs.LG

    MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

    Authors: Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, Esther Cheng, Jie Liu, Qunshu Lin, Raven Yuan, Tuney Zheng, Wei Pang, Xinrun Du, Yiming Liang, Yinghao Ma, Yizhi Li, Ziyang Ma, Bill Lin, Emmanouil Benetos, Huan Yang, Junting Zhou, Kaijing Ma, Minghao Liu, Morry Niu , et al. (20 additional authors not shown)

    Abstract: Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparabl… ▽ More

    Submitted 2 June, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

    Comments: https://map-neo.github.io/

  3. arXiv:2405.01646  [pdf, other

    cs.CV

    Explaining models relating objects and privacy

    Authors: Alessio Xompero, Myriam Bontonou, Jean-Michel Arbona, Emmanouil Benetos, Andrea Cavallaro

    Abstract: Accurately predicting whether an image is private before sharing it online is difficult due to the vast variety of content and the subjective nature of privacy itself. In this paper, we evaluate privacy models that use objects extracted from an image to determine why the image is predicted as private. To explain the decision of these models, we use feature-attribution to identify and quantify whic… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: 7 pages, 3 figures, 1 table, supplementary material included as Appendix. Paper accepted at the 3rd XAI4CV Workshop at CVPR 2024. Code: https://github.com/graphnex/ig-privacy

  4. arXiv:2404.18081  [pdf, other

    cs.SD cs.AI cs.CL cs.LG cs.MM eess.AS

    ComposerX: Multi-Agent Symbolic Music Composition with LLMs

    Authors: Qixin Deng, Qikai Yang, Ruibin Yuan, Yipeng Huang, Yi Wang, Xubo Liu, Zeyue Tian, Jiahao Pan, Ge Zhang, Hanfeng Lin, Yizhi Li, Yinghao Ma, Jie Fu, Chenghua Lin, Emmanouil Benetos, Wenwu Wang, Guangyu Xia, Wei Xue, Yike Guo

    Abstract: Music composition represents the creative side of humanity, and itself is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints. While demonstrating impressive capabilities in STEM subjects, current LLMs easily fail in this task, generating ill-written music even when equipped with modern techniques like In-Context-Learning and C… ▽ More

    Submitted 30 April, 2024; v1 submitted 28 April, 2024; originally announced April 2024.

  5. arXiv:2404.06393  [pdf, other

    cs.SD cs.AI eess.AS

    MuPT: A Generative Symbolic Music Pretrained Transformer

    Authors: Xingwei Qu, Yuelin Bai, Yinghao Ma, Ziya Zhou, Ka Man Lo, Jiaheng Liu, Ruibin Yuan, Lejun Min, Xueling Liu, Tianyu Zhang, Xinrun Du, Shuyue Guo, Yiming Liang, Yizhi Li, Shangda Wu, Junting Zhou, Tianyu Zheng, Ziyang Ma, Fengze Han, Wei Xue, Gus Xia, Emmanouil Benetos, Xiang Yue, Chenghua Lin, Xu Tan , et al. (4 additional authors not shown)

    Abstract: In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the chal… ▽ More

    Submitted 10 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

  6. arXiv:2403.11706  [pdf, other

    cs.SD cs.LG eess.AS

    Generalized Multi-Source Inference for Text Conditioned Music Diffusion Models

    Authors: Emilian Postolache, Giorgio Mariani, Luca Cosmo, Emmanouil Benetos, Emanuele Rodolà

    Abstract: Multi-Source Diffusion Models (MSDM) allow for compositional musical generation tasks: generating a set of coherent sources, creating accompaniments, and performing source separation. Despite their versatility, they require estimating the joint distribution over the sources, necessitating pre-separated musical data, which is rarely available, and fixing the number and type of sources at training t… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Accepted at ICASSP 2024

  7. arXiv:2402.16153  [pdf, other

    cs.SD cs.AI cs.CL cs.LG cs.MM eess.AS

    ChatMusician: Understanding and Generating Music Intrinsically with LLM

    Authors: Ruibin Yuan, Hanfeng Lin, Yi Wang, Zeyue Tian, Shangda Wu, Tianhao Shen, Ge Zhang, Yuhang Wu, Cong Liu, Ziya Zhou, Ziyang Ma, Liumeng Xue, Ziyu Wang, Qin Liu, Tianyu Zheng, Yizhi Li, Yinghao Ma, Yiming Liang, Xiaowei Chi, Ruibo Liu, Zili Wang, Pengfei Li, Jingcheng Wu, Chenghua Lin, Qifeng Liu , et al. (10 additional authors not shown)

    Abstract: While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the… ▽ More

    Submitted 25 February, 2024; originally announced February 2024.

    Comments: GitHub: https://shanghaicannon.github.io/ChatMusician/

  8. arXiv:2402.01424  [pdf, other

    cs.SD cs.LG eess.AS

    A Data-Driven Analysis of Robust Automatic Piano Transcription

    Authors: Drew Edwards, Simon Dixon, Emmanouil Benetos, Akira Maezawa, Yuta Kusaka

    Abstract: Algorithms for automatic piano transcription have improved dramatically in recent years due to new datasets and modeling techniques. Recent developments have focused primarily on adapting new neural network architectures, such as the Transformer and Perceiver, in order to yield more accurate systems. In this work, we study transcription systems from the perspective of their training data. By measu… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

    Comments: Accepted for publication in IEEE Signal Processing Letters on 31 Janurary, 2024

  9. arXiv:2311.10057  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

    Authors: Ilaria Manco, Benno Weck, SeungHeon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, Elio Quinton, György Fazekas, Juhan Nam

    Abstract: We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs, designed for the evaluation of music-and-language models. The dataset consists of 1.1k human-written natural language descriptions of 706 music recordings, all publicly accessible and released under Creative Common licenses. To showcase the use of our dataset, we benchmark popular models o… ▽ More

    Submitted 22 November, 2023; v1 submitted 16 November, 2023; originally announced November 2023.

    Comments: Accepted to NeurIPS 2023 Workshop on Machine Learning for Audio

  10. arXiv:2311.01526  [pdf, other

    cs.SD cs.LG eess.AS

    ATGNN: Audio Tagging Graph Neural Network

    Authors: Shubhr Singh, Christian J. Steinmetz, Emmanouil Benetos, Huy Phan, Dan Stowell

    Abstract: Deep learning models such as CNNs and Transformers have achieved impressive performance for end-to-end audio tagging. Recent works have shown that despite stacking multiple layers, the receptive field of CNNs remains severely limited. Transformers on the other hand are able to map global context through self-attention, but treat the spectrogram as a sequence of patches which is not flexible enough… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

  11. arXiv:2310.09853  [pdf, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    MERTech: Instrument Playing Technique Detection Using Self-Supervised Pretrained Model With Multi-Task Finetuning

    Authors: Dichucheng Li, Yinghao Ma, Weixing Wei, Qiuqiang Kong, Yulun Wu, Mingjin Che, Fan Xia, Emmanouil Benetos, Wei Li

    Abstract: Instrument playing techniques (IPTs) constitute a pivotal component of musical expression. However, the development of automatic IPT detection methods suffers from limited labeled data and inherent class imbalance issues. In this paper, we propose to apply a self-supervised learning model pre-trained on large-scale unlabeled music data and finetune it on IPT detection tasks. This approach addresse… ▽ More

    Submitted 15 October, 2023; originally announced October 2023.

    Comments: submitted to ICASSP 2024

  12. arXiv:2309.08730  [pdf, other

    eess.AS cs.AI cs.CL cs.MM cs.SD

    MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

    Authors: Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, Emmanouil Benetos

    Abstract: Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains not well-explored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio… ▽ More

    Submitted 2 April, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Journal ref: 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics

  13. arXiv:2307.09795  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    From West to East: Who can understand the music of the others better?

    Authors: Charilaos Papaioannou, Emmanouil Benetos, Alexandros Potamianos

    Abstract: Recent developments in MIR have led to several benchmark deep learning models whose embeddings can be used for a variety of downstream tasks. At the same time, the vast majority of these models have been trained on Western pop/rock music and related styles. This leads to research questions on whether these models can be used to learn representations for different music cultures and styles, or whet… ▽ More

    Submitted 19 July, 2023; originally announced July 2023.

  14. arXiv:2307.05161  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    On the Effectiveness of Speech Self-supervised Learning for Music

    Authors: Yinghao Ma, Ruibin Yuan, Yizhi Li, Ge Zhang, Xingran Chen, Hanzhi Yin, Chenghua Lin, Emmanouil Benetos, Anton Ragni, Norbert Gyenge, Ruibo Liu, Gus Xia, Roger Dannenberg, Yike Guo, Jie Fu

    Abstract: Self-supervised learning (SSL) has shown promising results in various speech and natural language processing applications. However, its efficacy in music information retrieval (MIR) still remains largely unexplored. While previous SSL models pre-trained on music recordings may have been mostly closed-sourced, recent speech models such as wav2vec2.0 have shown promise in music modelling. Neverthele… ▽ More

    Submitted 11 July, 2023; originally announced July 2023.

  15. arXiv:2306.17103  [pdf, other

    cs.CL cs.SD eess.AS

    LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT

    Authors: Le Zhuo, Ruibin Yuan, Jiahao Pan, Yinghao Ma, Yizhi LI, Ge Zhang, Si Liu, Roger Dannenberg, Jie Fu, Chenghua Lin, Emmanouil Benetos, Wenhu Chen, Wei Xue, Yike Guo

    Abstract: We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language mo… ▽ More

    Submitted 21 November, 2023; v1 submitted 29 June, 2023; originally announced June 2023.

    Comments: 9 pages, 2 figures, 5 tables, accepted by ISMIR 2023

  16. arXiv:2306.10548  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    MARBLE: Music Audio Representation Benchmark for Universal Evaluation

    Authors: Ruibin Yuan, Yinghao Ma, Yizhi Li, Ge Zhang, Xingran Chen, Hanzhi Yin, Le Zhuo, Yiqi Liu, Jiawen Huang, Zeyue Tian, Binyue Deng, Ningzhi Wang, Chenghua Lin, Emmanouil Benetos, Anton Ragni, Norbert Gyenge, Roger Dannenberg, Wenhu Chen, Gus Xia, Wei Xue, Si Liu, Shi Wang, Ruibo Liu, Yike Guo, Jie Fu

    Abstract: In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue… ▽ More

    Submitted 23 November, 2023; v1 submitted 18 June, 2023; originally announced June 2023.

    Comments: camera-ready version for NeurIPS 2023

  17. arXiv:2306.00107  [pdf, other

    cs.SD cs.AI cs.CL cs.LG eess.AS

    MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

    Authors: Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos, Norbert Gyenge, Roger Dannenberg, Ruibo Liu, Wenhu Chen, Gus Xia, Yemin Shi, Wenhao Huang, Zili Wang, Yike Guo, Jie Fu

    Abstract: Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is partially due to the distinctive challenges associated with modelling musical knowledge, part… ▽ More

    Submitted 22 April, 2024; v1 submitted 31 May, 2023; originally announced June 2023.

    Comments: accepted by ICLR 2024

  18. arXiv:2305.17719  [pdf, other

    eess.AS cs.SD

    Adapting Language-Audio Models as Few-Shot Audio Learners

    Authors: Jinhua Liang, Xubo Liu, Haohe Liu, Huy Phan, Emmanouil Benetos, Mark D. Plumbley, Wenwu Wang

    Abstract: We presented the Treff adapter, a training-efficient adapter for CLAP, to boost zero-shot classification performance by making use of a small set of labelled data. Specifically, we designed CALM to retrieve the probability distribution of text-audio clips over classes using a set of audio-label pairs and combined it with CLAP's zero-shot classification results. Furthermore, we designed a training-… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

  19. arXiv:2212.08952  [pdf, other

    cs.SD eess.AS

    Learning from Taxonomy: Multi-label Few-Shot Classification for Everyday Sound Recognition

    Authors: Jinhua Liang, Huy Phan, Emmanouil Benetos

    Abstract: Everyday sound recognition aims to infer types of sound events in audio streams. While many works succeeded in training models with high performance in a fully-supervised manner, they are still restricted to the demand of large quantities of labelled data and the range of predefined classes. To overcome these drawbacks, this work firstly curates a new database named FSD-FS for multi-label few-shot… ▽ More

    Submitted 17 December, 2022; originally announced December 2022.

    Comments: submitted to ICASSP2023

  20. arXiv:2212.02508  [pdf, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music Audio Representation Learning

    Authors: Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Chenghua Lin, Xingran Chen, Anton Ragni, Hanzhi Yin, Zhijie Hu, Haoyu He, Emmanouil Benetos, Norbert Gyenge, Ruibo Liu, Jie Fu

    Abstract: The deep learning community has witnessed an exponentially growing interest in self-supervised learning (SSL). However, it still remains unexplored how to build a framework for learning useful representations of raw music waveforms in a self-supervised manner. In this work, we design Music2Vec, a framework exploring different SSL algorithmic components and tricks for music audio recordings. Our mo… ▽ More

    Submitted 5 December, 2022; originally announced December 2022.

  21. arXiv:2210.15310  [pdf, other

    eess.AS cs.SD

    Learning Music Representations with wav2vec 2.0

    Authors: Alessandro Ragano, Emmanouil Benetos, Andrew Hines

    Abstract: Learning music representations that are general-purpose offers the flexibility to finetune several downstream tasks using smaller datasets. The wav2vec 2.0 speech representation model showed promising results in many downstream speech tasks, but has been less effective when adapted to music. In this paper, we evaluate whether pre-training wav2vec 2.0 directly on music data can be a better solution… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  22. arXiv:2208.12208  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Contrastive Audio-Language Learning for Music

    Authors: Ilaria Manco, Emmanouil Benetos, Elio Quinton, György Fazekas

    Abstract: As one of the most intuitive interfaces known to humans, natural language has the potential to mediate many tasks that involve human-computer interaction, especially in application-focused fields like Music Information Retrieval. In this work, we explore cross-modal learning in an attempt to bridge audio and language in the music domain. To this end, we propose MusCALL, a framework for Music Contr… ▽ More

    Submitted 25 August, 2022; originally announced August 2022.

    Comments: Accepted to ISMIR 2022

  23. arXiv:2207.07769  [pdf, ps, other

    cs.LG cs.AI cs.IR

    Anomalous behaviour in loss-gradient based interpretability methods

    Authors: Vinod Subramanian, Siddharth Gururani, Emmanouil Benetos, Mark Sandler

    Abstract: Loss-gradients are used to interpret the decision making process of deep learning models. In this work, we evaluate loss-gradient based attribution methods by occluding parts of the input and comparing the performance of the occluded input to the original input. We observe that the occluded input has better performance than the original across the test dataset under certain conditions. Similar beh… ▽ More

    Submitted 15 July, 2022; originally announced July 2022.

    Comments: Accepted at ICLR RobustML workshop 2021

  24. arXiv:2204.04651  [pdf, other

    cs.SD cs.IR eess.AS

    Deep Conditional Representation Learning for Drum Sample Retrieval by Vocalisation

    Authors: Alejandro Delgado, Charalampos Saitis, Emmanouil Benetos, Mark Sandler

    Abstract: Imitating musical instruments with the human voice is an efficient way of communicating ideas between music producers, from sketching melody lines to clarifying desired sonorities. For this reason, there is an increasing interest in building applications that allow artists to efficiently pick target samples from big sound libraries just by imitating them vocally. In this study, we investigated the… ▽ More

    Submitted 10 April, 2022; originally announced April 2022.

    Comments: Submitted to Interspeech 2022 (under review)

  25. arXiv:2204.03898  [pdf, other

    eess.AS cs.SD

    Exploring Transformer's potential on automatic piano transcription

    Authors: Longshen Ou, Ziyi Guo, Emmanouil Benetos, Jiqing Han, Ye Wang

    Abstract: Most recent research about automatic music transcription (AMT) uses convolutional neural networks and recurrent neural networks to model the mapping from music signals to symbolic notation. Based on a high-resolution piano transcription system, we explore the possibility of incorporating another powerful sequence transformation tool -- the Transformer -- to deal with the AMT problem. We argue that… ▽ More

    Submitted 8 April, 2022; originally announced April 2022.

    Comments: Accepted by ICASSP 2022

    ACM Class: H.5.5

  26. arXiv:2202.01646  [pdf, other

    cs.SD eess.AS eess.SP

    Improving Lyrics Alignment through Joint Pitch Detection

    Authors: Jiawen Huang, Emmanouil Benetos, Sebastian Ewert

    Abstract: In recent years, the accuracy of automatic lyrics alignment methods has increased considerably. Yet, many current approaches employ frameworks designed for automatic speech recognition (ASR) and do not exploit properties specific to music. Pitch is one important musical attribute of singing voice but it is often ignored by current systems as the lyrics content is considered independent of the pitc… ▽ More

    Submitted 3 February, 2022; originally announced February 2022.

    Comments: To appear in Proc. ICASSP 2022

  27. arXiv:2112.04214  [pdf, other

    cs.SD cs.CL cs.IR cs.LG eess.AS

    Learning music audio representations via weak language supervision

    Authors: Ilaria Manco, Emmanouil Benetos, Elio Quinton, Gyorgy Fazekas

    Abstract: Audio representations for music information retrieval are typically learned via supervised learning in a task-specific fashion. Although effective at producing state-of-the-art results, this scheme lacks flexibility with respect to the range of applications a model can have and requires extensively annotated datasets. In this work, we pose the question of whether it may be possible to exploit weak… ▽ More

    Submitted 17 February, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

    Comments: Accepted to ICASSP 2022

  28. arXiv:2110.04585  [pdf, other

    eess.AS cs.SD

    An evaluation of data augmentation methods for sound scene geotagging

    Authors: Helen L. Bear, Veronica Morfi, Emmanouil Benetos

    Abstract: Sound scene geotagging is a new topic of research which has evolved from acoustic scene classification. It is motivated by the idea of audio surveillance. Not content with only describing a scene in a recording, a machine which can locate where the recording was captured would be of use to many. In this paper we explore a series of common audio data augmentation methods to evaluate which best impr… ▽ More

    Submitted 9 October, 2021; originally announced October 2021.

    Comments: Presented at Interspeech 2021

  29. arXiv:2110.03965  [pdf, other

    eess.AS cs.SD

    Joint Scattering for Automatic Chick Call Recognition

    Authors: Changhong Wang, Emmanouil Benetos, Shuge Wang, Elisabetta Versace

    Abstract: Animal vocalisations contain important information about health, emotional state, and behaviour, thus can be potentially used for animal welfare monitoring. Motivated by the spectro-temporal patterns of chick calls in the time$-$frequency domain, in this paper we propose an automatic system for chick call recognition using the joint time$-$frequency scattering transform (JTFS). Taking full-length… ▽ More

    Submitted 8 October, 2021; originally announced October 2021.

    Comments: 5 pages, submitted to ICASSP 2022

  30. More for Less: Non-Intrusive Speech Quality Assessment with Limited Annotations

    Authors: Alessandro Ragano, Emmanouil Benetos, Andrew Hines

    Abstract: Non-intrusive speech quality assessment is a crucial operation in multimedia applications. The scarcity of annotated data and the lack of a reference signal represent some of the main challenges for designing efficient quality assessment metrics. In this paper, we propose two multi-task models to tackle the problems above. In the first model, we first learn a feature representation with a degradat… ▽ More

    Submitted 19 August, 2021; originally announced August 2021.

    Comments: Published in 2021 13th International Conference on Quality of Multimedia Experience (QoMEX)

  31. arXiv:2107.13617  [pdf, other

    cs.SD cs.IR cs.LG cs.NE eess.AS

    Pitch-Informed Instrument Assignment Using a Deep Convolutional Network with Multiple Kernel Shapes

    Authors: Carlos Lordelo, Emmanouil Benetos, Simon Dixon, Sven Ahlbäck

    Abstract: This paper proposes a deep convolutional neural network for performing note-level instrument assignment. Given a polyphonic multi-instrumental music signal along with its ground truth or predicted notes, the objective is to assign an instrumental source for each note. This problem is addressed as a pitch-informed classification task where each note is analysed individually. We also propose to util… ▽ More

    Submitted 28 July, 2021; originally announced July 2021.

    Comments: 4 figures, 4 tables and 7 pages. Accepted for publication at ISMIR Conference 2021

  32. arXiv:2104.11984  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    MusCaps: Generating Captions for Music Audio

    Authors: Ilaria Manco, Emmanouil Benetos, Elio Quinton, Gyorgy Fazekas

    Abstract: Content-based music information retrieval has seen rapid progress with the adoption of deep learning. Current approaches to high-level music description typically make use of classification models, such as in auto-tagging or genre and mood classification. In this work, we propose to address music description via audio captioning, defined as the task of generating a natural language description of… ▽ More

    Submitted 24 April, 2021; originally announced April 2021.

    Comments: Accepted to IJCNN 2021 for the Special Session on Representation Learning for Audio, Speech, and Music Processing

  33. arXiv:2104.06607  [pdf, other

    cs.SD eess.AS

    Revisiting the Onsets and Frames Model with Additive Attention

    Authors: Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos, Dorien Herremans

    Abstract: Recent advances in automatic music transcription (AMT) have achieved highly accurate polyphonic piano transcription results by incorporating onset and offset detection. The existing literature, however, focuses mainly on the leverage of deep and complex models to achieve state-of-the-art (SOTA) accuracy, without understanding model behaviour. In this paper, we conduct a comprehensive examination o… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.

    Comments: Accepted in IJCNN 2021 Special Session S04. https://dr-costas.github.io/rlasmp2021-website/

  34. Adversarial Unsupervised Domain Adaptation for Harmonic-Percussive Source Separation

    Authors: Carlos Lordelo, Emmanouil Benetos, Simon Dixon, Sven Ahlbäck, Patrik Ohlsson

    Abstract: This paper addresses the problem of domain adaptation for the task of music source separation. Using datasets from two different domains, we compare the performance of a deep learning-based harmonic-percussive source separation model under different training scenarios, including supervised joint training using data from both domains and pre-training in one domain with fine-tuning in another. We pr… ▽ More

    Submitted 3 January, 2021; originally announced January 2021.

    Comments: 5 pages, 2 figures and 1 table. Accepted for publication in IEEE Signal Processing Letters

  35. arXiv:2010.09969  [pdf, other

    cs.SD cs.LG eess.AS

    The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

    Authors: Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos, Dorien Herremans

    Abstract: Most of the state-of-the-art automatic music transcription (AMT) models break down the main transcription task into sub-tasks such as onset prediction and offset prediction and train them with onset and offset labels. These predictions are then concatenated together and used as the input to train another model with the pitch labels to obtain the final transcription. We attempt to use only the pitc… ▽ More

    Submitted 19 October, 2020; originally announced October 2020.

    Comments: Accepted in ICPR

  36. arXiv:2010.07913  [pdf, other

    eess.AS cs.SD

    Dataset artefacts in anti-spoofing systems: a case study on the ASVspoof 2017 benchmark

    Authors: Bhusan Chettri, Emmanouil Benetos, Bob L. T. Sturm

    Abstract: The Automatic Speaker Verification Spoofing and Countermeasures Challenges motivate research in protecting speech biometric systems against a variety of different access attacks. The 2017 edition focused on replay spoofing attacks, and involved participants building and training systems on a provided dataset (ASVspoof 2017). More than 60 research papers have so far been published with this dataset… ▽ More

    Submitted 15 October, 2020; originally announced October 2020.

    Comments: Accepted to IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020

  37. arXiv:2005.07788  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Reliable Local Explanations for Machine Listening

    Authors: Saumitra Mishra, Emmanouil Benetos, Bob L. Sturm, Simon Dixon

    Abstract: One way to analyse the behaviour of machine learning models is through local explanations that highlight input features that maximally influence model predictions. Sensitivity analysis, which involves analysing the effect of input perturbations on model predictions, is one of the methods to generate local explanations. Meaningful input perturbations are essential for generating reliable explanatio… ▽ More

    Submitted 15 May, 2020; originally announced May 2020.

    Comments: 8 pages plus references. Accepted at the IJCNN 2020 Special Session on Explainable Computational/Artificial Intelligence. Camera-ready version

  38. arXiv:2005.06650  [pdf, other

    eess.AS cs.LG cs.SD

    Memory Controlled Sequential Self Attention for Sound Recognition

    Authors: Arjun Pankajakshan, Helen L. Bear, Vinod Subramanian, Emmanouil Benetos

    Abstract: In this paper we investigate the importance of the extent of memory in sequential self attention for sound recognition. We propose to use a memory controlled sequential self attention mechanism on top of a convolutional recurrent neural network (CRNN) model for polyphonic sound event detection (SED). Experiments on the URBAN-SED dataset demonstrate the impact of the extent of memory on sound recog… ▽ More

    Submitted 5 August, 2020; v1 submitted 13 May, 2020; originally announced May 2020.

    Comments: Accepted to INTERSPEECH 2020

  39. arXiv:2004.07171  [pdf, other

    cs.SD cs.IR eess.AS

    Musical Features for Automatic Music Transcription Evaluation

    Authors: Adrien Ycart, Lele Liu, Emmanouil Benetos, Marcus T. Pearce

    Abstract: This technical report gives a detailed, formal description of the features introduced in the paper: Adrien Ycart, Lele Liu, Emmanouil Benetos and Marcus T. Pearce. "Investigating the Perceptual Validity of Evaluation Metrics for Automatic Piano Music Transcription", Transactions of the International Society for Music Information Retrieval (TISMIR), Accepted, 2020.

    Submitted 15 April, 2020; originally announced April 2020.

    Comments: Technical report

  40. Audio Impairment Recognition Using a Correlation-Based Feature Representation

    Authors: Alessandro Ragano, Emmanouil Benetos, Andrew Hines

    Abstract: Audio impairment recognition is based on finding noise in audio files and categorising the impairment type. Recently, significant performance improvement has been obtained thanks to the usage of advanced deep learning models. However, feature robustness is still an unresolved issue and it is one of the main reasons why we need powerful deep learning architectures. In the presence of a variety of m… ▽ More

    Submitted 24 March, 2020; v1 submitted 22 March, 2020; originally announced March 2020.

    Comments: This publication has been accepted in 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX)

  41. arXiv:1910.10105  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Modeling plate and spring reverberation using a DSP-informed deep neural network

    Authors: Marco A. Martínez Ramírez, Emmanouil Benetos, Joshua D. Reiss

    Abstract: Plate and spring reverberators are electromechanical systems first used and researched as means to substitute real room reverberation. Nowadays they are often used in music production for aesthetic reasons due to their particular sonic characteristics. The modeling of these audio processors and their perceptual qualities is difficult since they use mechanical elements together with analog electron… ▽ More

    Submitted 17 April, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: Presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, May 2020. Source code, dataset, audio examples and more detailed diagrams: https://mchijmma.github.io/modeling-plate-spring-reverb/

  42. arXiv:1907.05122  [pdf, other

    eess.AS cs.SD

    Polyphonic Sound Event and Sound Activity Detection: A Multi-task approach

    Authors: Arjun Pankajakshan, Helen L. Bear, Emmanouil Benetos

    Abstract: Polyphonic Sound Event Detection (SED) in real-world recordings is a challenging task because of the dynamic polyphony level, intensity, and duration of sound events. Current polyphonic SED systems fail to model the temporal structure of sound events explicitly and instead attempt to look at which sound events are present at each audio frame. Consequently, the event-wise detection performance is m… ▽ More

    Submitted 1 August, 2019; v1 submitted 11 July, 2019; originally announced July 2019.

    Comments: Accepted to WASPAA 2019

  43. arXiv:1907.02477  [pdf, other

    cs.LG cs.CR cs.SD eess.AS

    Adversarial Attacks in Sound Event Classification

    Authors: Vinod Subramanian, Emmanouil Benetos, Ning Xu, SKoT McDonald, Mark Sandler

    Abstract: Adversarial attacks refer to a set of methods that perturb the input to a classification model in order to fool the classifier. In this paper we apply different gradient based adversarial attack algorithms on five deep learning models trained for sound event classification. Four of the models use mel-spectrogram input and one model uses raw audio input. The models represent standard architectures… ▽ More

    Submitted 15 August, 2019; v1 submitted 4 July, 2019; originally announced July 2019.

    Comments: Fixed Freesound data reference to FSDKaggle2018

  44. arXiv:1905.06148  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    A general-purpose deep learning approach to model time-varying audio effects

    Authors: Marco A. Martínez Ramírez, Emmanouil Benetos, Joshua D. Reiss

    Abstract: Audio processors whose parameters are modified periodically over time are often referred as time-varying or modulation based audio effects. Most existing methods for modeling these type of effect units are often optimized to a very specific circuit and cannot be efficiently generalized to other time-varying effects. Based on convolutional and recurrent neural networks, we propose a deep learning a… ▽ More

    Submitted 21 June, 2019; v1 submitted 15 May, 2019; originally announced May 2019.

    Comments: audio files: https://mchijmma.github.io/modeling-time-varying/

  45. arXiv:1905.01899  [pdf, other

    cs.SD eess.AS

    Investigating kernel shapes and skip connections for deep learning-based harmonic-percussive separation

    Authors: Carlos Lordelo, Emmanouil Benetos, Simon Dixon, Sven Ahlbäck

    Abstract: In this paper we propose an efficient deep learning encoder-decoder network for performing Harmonic-Percussive Source Separation (HPSS). It is shown that we are able to greatly reduce the number of model trainable parameters by using a dense arrangement of skip connections between the model layers. We also explore the utilisation of different kernel sizes for the 2D filters of the convolutional la… ▽ More

    Submitted 30 July, 2019; v1 submitted 6 May, 2019; originally announced May 2019.

    Comments: Accepted for publication at WASPAA 2019, 5 pages, 5 figures

  46. arXiv:1905.00979  [pdf, other

    eess.AS cs.SD

    City classification from multiple real-world sound scenes

    Authors: Helen L. Bear, Toni Heittola, Annamaria Mesaros, Emmanouil Benetos, Tuomas Virtanen

    Abstract: The majority of sound scene analysis work focuses on one of two clearly defined tasks: acoustic scene classification or sound event detection. Whilst this separation of tasks is useful for problem definition, they inherently ignore some subtleties of the real-world, in particular how humans vary in how they describe a scene. Some will describe the weather and features within it, others will use a… ▽ More

    Submitted 29 July, 2019; v1 submitted 2 May, 2019; originally announced May 2019.

    Comments: Accepted to WASPAA 2019

  47. arXiv:1904.10408  [pdf, other

    eess.AS cs.SD

    Towards joint sound scene and polyphonic sound event recognition

    Authors: Helen L. Bear, Ines Nolasco, Emmanouil Benetos

    Abstract: Acoustic Scene Classification (ASC) and Sound Event Detection (SED) are two separate tasks in the field of computational sound scene analysis. In this work, we present a new dataset with both sound scene and sound event labels and use this to demonstrate a novel method for jointly classifying sound scenes and recognizing sound events. We show that by taking a joint approach, learning is more effic… ▽ More

    Submitted 1 July, 2019; v1 submitted 23 April, 2019; originally announced April 2019.

    Comments: Accepted to Interspeech 2019

  48. arXiv:1904.09533  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    GAN-based Generation and Automatic Selection of Explanations for Neural Networks

    Authors: Saumitra Mishra, Daniel Stoller, Emmanouil Benetos, Bob L. Sturm, Simon Dixon

    Abstract: One way to interpret trained deep neural networks (DNNs) is by inspecting characteristics that neurons in the model respond to, such as by iteratively optimising the model input (e.g., an image) to maximally activate specific neurons. However, this requires a careful selection of hyper-parameters to generate interpretable examples for each neuron of interest, and current methods rely on a manual,… ▽ More

    Submitted 27 April, 2019; v1 submitted 20 April, 2019; originally announced April 2019.

    Comments: 8 pages plus references and appendix. Accepted at the ICLR 2019 Workshop "Safe Machine Learning: Specification, Robustness and Assurance". Camera-ready version. v2: Corrected page header

    Journal ref: SafeML Workshop at the International Conference on Learning Representations (ICLR) 2019

  49. arXiv:1904.04589  [pdf, other

    eess.AS cs.SD

    Ensemble Models for Spoofing Detection in Automatic Speaker Verification

    Authors: Bhusan Chettri, Daniel Stoller, Veronica Morfi, Marco A. Martínez Ramírez, Emmanouil Benetos, Bob L. Sturm

    Abstract: Detecting spoofing attempts of automatic speaker verification (ASV) systems is challenging, especially when using only one modeling approach. For robustness, we use both deep neural networks and traditional machine learning models and combine them as ensemble models through logistic regression. They are trained to detect logical access (LA) and physical access (PA) attacks on the dataset released… ▽ More

    Submitted 4 July, 2019; v1 submitted 9 April, 2019; originally announced April 2019.

    Comments: Accepted at Interspeech 2019, Graz, Austria

  50. arXiv:1811.06330  [pdf, other

    cs.SD eess.AS

    Audio-based identification of beehive states

    Authors: Inês Nolasco, Alessandro Terenzi, Stefania Cecchi, Simone Orcioni, Helen L. Bear, Emmanouil Benetos

    Abstract: The absence of the queen in a beehive is a very strong indicator of the need for beekeeper intervention. Manually searching for the queen is an arduous recurrent task for beekeepers that disrupts the normal life cycle of the beehive and can be a source of stress for bees. Sound is an indicator for signalling different states of the beehive, including the absence of the queen bee. In this work, we… ▽ More

    Submitted 15 February, 2019; v1 submitted 15 November, 2018; originally announced November 2018.

    Comments: Accepted for ICASSP 2019