Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–35 of 35 results for author: Jansen, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.13762  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation

    Authors: Gwanghyun Kim, Alonso Martinez, Yu-Chuan Su, Brendan Jou, José Lezama, Agrim Gupta, Lijun Yu, Lu Jiang, Aren Jansen, Jacob Walker, Krishna Somandepalli

    Abstract: Training diffusion models for audiovisual sequences allows for a range of generation tasks by learning conditional distributions of various input-output combinations of the two modalities. Nevertheless, this strategy often requires training a separate model for each task which is expensive. Here, we propose a novel training approach to effectively learn arbitrary conditional distributions in the a… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  2. Dataset balancing can hurt model performance

    Authors: R. Channing Moore, Daniel P. W. Ellis, Eduardo Fonseca, Shawn Hershey, Aren Jansen, Manoj Plakal

    Abstract: Machine learning from training data with a skewed distribution of examples per class can lead to models that favor performance on common classes at the expense of performance on rare ones. AudioSet has a very wide range of priors over its 527 sound event classes. Classification performance on AudioSet is usually evaluated by a simple average over per-class metrics, meaning that performance on rare… ▽ More

    Submitted 30 June, 2023; originally announced July 2023.

    Comments: 5 pages, 3 figures, ICASSP 2023

    Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5

  3. arXiv:2306.02984  [pdf, other

    physics.med-ph cs.LG eess.IV

    A Deep Learning Approach Utilizing Covariance Matrix Analysis for the ISBI Edited MRS Reconstruction Challenge

    Authors: Julian P. Merkofer, Dennis M. J. van de Sande, Sina Amirrajab, Gerhard S. Drenthen, Mitko Veta, Jacobus F. A. Jansen, Marcel Breeuwer, Ruud J. G. van Sloun

    Abstract: This work proposes a method to accelerate the acquisition of high-quality edited magnetic resonance spectroscopy (MRS) scans using machine learning models taking the sample covariance matrix as input. The method is invariant to the number of transients and robust to noisy input data for both synthetic as well as in-vivo scenarios.

    Submitted 5 June, 2023; originally announced June 2023.

  4. arXiv:2305.06594  [pdf, other

    cs.SD cs.CV cs.LG cs.MM eess.AS

    V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

    Authors: Kun Su, Judith Yue Li, Qingqing Huang, Dima Kuzmin, Joonseok Lee, Chris Donahue, Fei Sha, Aren Jansen, Yu Wang, Mauro Verzetti, Timo I. Denk

    Abstract: Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally alig… ▽ More

    Submitted 22 February, 2024; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: accepted at AAAI 2024, music samples available at https://tinyurl.com/v2meow

  5. Facebook Data Shield: Increasing Awareness and Control over Data used by Newsfeed-Generating Algorithms

    Authors: Jules Sinsel, Anniek Jansen, Sara Colombo

    Abstract: Social media platforms newsfeeds are generated by AI algorithms, which select and order posts based on user data. However, users are often unaware of what data is collected and employed for this aim, neither can they control it. To open up discussions on what data users are willing to feed the newsfeed algorithm with, we created the Facebook Data Shield, a human-size interactive installation where… ▽ More

    Submitted 20 February, 2023; originally announced February 2023.

    Journal ref: In Proceedings of the Seventeenth International Conference on Tangible, Embedded, and Embodied Interaction (TEI 2023). Association for Computing Machinery, New York, NY, USA, Article 46, 1-6

  6. Wizard of Errors: Introducing and Evaluating Machine Learning Errors in Wizard of Oz Studies

    Authors: Anniek Jansen, Sara Colombo

    Abstract: When designing Machine Learning (ML) enabled solutions, designers often need to simulate ML behavior through the Wizard of Oz (WoZ) approach to test the user experience before the ML model is available. Although reproducing ML errors is essential for having a good representation, they are rarely considered. We introduce Wizard of Errors (WoE), a tool for conducting WoZ studies on ML-enabled soluti… ▽ More

    Submitted 17 February, 2023; originally announced February 2023.

    Journal ref: In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA '22). ACM, New York, NY, USA, Article 426, 1-7

  7. arXiv:2301.11325  [pdf, other

    cs.SD cs.LG eess.AS

    MusicLM: Generating Music From Text

    Authors: Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank

    Abstract: We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous s… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

    Comments: Supplementary material at https://google-research.github.io/seanet/musiclm/examples and https://kaggle.com/datasets/googleai/musiccaps

  8. arXiv:2301.03238  [pdf, ps, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    MAQA: A Multimodal QA Benchmark for Negation

    Authors: Judith Yue Li, Aren Jansen, Qingqing Huang, Joonseok Lee, Ravi Ganti, Dima Kuzmin

    Abstract: Multimodal learning can benefit from the representation power of pretrained Large Language Models (LLMs). However, state-of-the-art transformer based LLMs often ignore negations in natural language and there is no existing benchmark to quantitatively evaluate whether multimodal transformers inherit this weakness. In this study, we present a new multimodal question answering (QA) benchmark adapted… ▽ More

    Submitted 9 January, 2023; originally announced January 2023.

    Comments: NeurIPS 2022 SyntheticData4ML Workshop

  9. arXiv:2208.12415  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    MuLan: A Joint Embedding of Music Audio and Natural Language

    Authors: Qingqing Huang, Aren Jansen, Joonseok Lee, Ravi Ganti, Judith Yue Li, Daniel P. W. Ellis

    Abstract: Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedd… ▽ More

    Submitted 25 August, 2022; originally announced August 2022.

    Comments: To appear in ISMIR 2022

  10. arXiv:2208.01174  [pdf, other

    cs.CL cs.AI

    TextWorldExpress: Simulating Text Games at One Million Steps Per Second

    Authors: Peter A. Jansen, Marc-Alexandre Côté

    Abstract: Text-based games offer a challenging test bed to evaluate virtual agents at language understanding, multi-step problem-solving, and common-sense reasoning. However, speed is a major limitation of current text-based games, capping at 300 steps per second, mainly due to the use of legacy tooling. In this work we present TextWorldExpress, a high-performance simulator that includes implementations of… ▽ More

    Submitted 2 March, 2023; v1 submitted 1 August, 2022; originally announced August 2022.

    Comments: Accepted to EACL 2023

  11. arXiv:2204.05738  [pdf, other

    eess.AS cs.SD

    Text-Driven Separation of Arbitrary Sounds

    Authors: Kevin Kilgour, Beat Gfeller, Qingqing Huang, Aren Jansen, Scott Wisdom, Marco Tagliasacchi

    Abstract: We propose a method of separating a desired sound source from a single-channel mixture, based on either a textual description or a short audio sample of the target source. This is achieved by combining two distinct models. The first model, SoundWords, is trained to jointly embed both an audio clip and its textual description to the same embedding in a shared representation. The second model, Sound… ▽ More

    Submitted 12 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

  12. Universal Paralinguistic Speech Representations Using Self-Supervised Conformers

    Authors: Joel Shor, Aren Jansen, Wei Han, Daniel Park, Yu Zhang

    Abstract: Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. In this work, we introduce a new state-of-the-art paralinguistic representation derived from large-scale, fully self-supervised training of a 600M+ parameter Conformer-based architecture. We… ▽ More

    Submitted 13 December, 2022; v1 submitted 9 October, 2021; originally announced October 2021.

    Journal ref: ICASSP 2022-2022 IEEE

  13. arXiv:2109.13226  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

    Authors: Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang , et al. (1 additional authors not shown)

    Abstract: We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled da… ▽ More

    Submitted 21 July, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

    Comments: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated; v3: corrections based on reviewer feedback, bibliography updated

  14. arXiv:2107.04132  [pdf, other

    cs.CL cs.AI

    A Systematic Survey of Text Worlds as Embodied Natural Language Environments

    Authors: Peter A Jansen

    Abstract: Text Worlds are virtual environments for embodied agents that, unlike 2D or 3D environments, are rendered exclusively using textual descriptions. These environments offer an alternative to higher-fidelity 3D environments due to their low barrier to entry, providing the ability to study semantics, compositional inference, and other high-level tasks with rich high-level action spaces while controlli… ▽ More

    Submitted 8 July, 2021; originally announced July 2021.

    Comments: 18 pages

  15. arXiv:2107.00135  [pdf, other

    cs.CV

    Attention Bottlenecks for Multimodal Fusion

    Authors: Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

    Abstract: Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimod… ▽ More

    Submitted 30 November, 2022; v1 submitted 30 June, 2021; originally announced July 2021.

    Comments: Published at NeurIPS 2021. Note this version updates numbers due to a bug in the AudioSet mAP calculation in Table 1 (last row)

  16. arXiv:2106.00847  [pdf, other

    eess.AS cs.SD

    Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

    Authors: Scott Wisdom, Aren Jansen, Ron J. Weiss, Hakan Erdogan, John R. Hershey

    Abstract: Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-the-wild data; however, it suffers from two outstanding problems. F… ▽ More

    Submitted 16 October, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

    Comments: 5 pages, 1 figure. WASPAA 2021

  17. arXiv:2105.07031  [pdf, other

    cs.SD eess.AS

    The Benefit Of Temporally-Strong Labels In Audio Event Classification

    Authors: Shawn Hershey, Daniel P W Ellis, Eduardo Fonseca, Aren Jansen, Caroline Liu, R Channing Moore, Manoj Plakal

    Abstract: To reveal the importance of temporal precision in ground truth audio event labels, we collected precise (~0.1 sec resolution) "strong" labels for a portion of the AudioSet dataset. We devised a temporally strong evaluation set (including explicit negatives of varying difficulty) and a small strong-labeled training subset of 67k clips (compared to the original dataset's 1.8M clips labeled at 10 sec… ▽ More

    Submitted 14 May, 2021; originally announced May 2021.

    Comments: Accepted for publication at ICASSP 2021

  18. arXiv:2105.02132  [pdf, other

    cs.SD cs.LG eess.AS

    Self-Supervised Learning from Automatically Separated Sound Scenes

    Authors: Eduardo Fonseca, Aren Jansen, Daniel P. W. Ellis, Scott Wisdom, Marco Tagliasacchi, John R. Hershey, Manoj Plakal, Shawn Hershey, R. Channing Moore, Xavier Serra

    Abstract: Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this… ▽ More

    Submitted 14 September, 2021; v1 submitted 5 May, 2021; originally announced May 2021.

  19. arXiv:2011.01143  [pdf, other

    cs.SD cs.CV eess.AS

    Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

    Authors: Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez, Daniel P. W. Ellis, John R. Hershey

    Abstract: Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos. Pri… ▽ More

    Submitted 29 May, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

    Comments: ICLR 2021, 27 pages

  20. arXiv:2009.14259  [pdf, other

    cs.CL cs.CV

    Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions

    Authors: Peter A. Jansen

    Abstract: The recently proposed ALFRED challenge task aims for a virtual robotic agent to complete complex multi-step everyday tasks in a virtual home environment from high-level natural language directives, such as "put a hot piece of bread on a plate". Currently, the best-performing models are able to complete less than 5% of these tasks successfully. In this work we focus on modeling the translation prob… ▽ More

    Submitted 26 October, 2020; v1 submitted 29 September, 2020; originally announced September 2020.

    Comments: Accepted to Findings of EMNLP. V2: corrected typo Table 1; margins Table 3

  21. arXiv:2005.00878  [pdf, other

    cs.SD cs.LG eess.AS

    Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking

    Authors: Eduardo Fonseca, Shawn Hershey, Manoj Plakal, Daniel P. W. Ellis, Aren Jansen, R. Channing Moore, Xavier Serra

    Abstract: The study of label noise in sound event recognition has recently gained attention with the advent of larger and noisier datasets. This work addresses the problem of missing labels, one of the big weaknesses of large audio datasets, and one of the most conspicuous issues for AudioSet. We propose a simple and model-agnostic method based on a teacher-student framework with loss masking to first ident… ▽ More

    Submitted 25 July, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: Accepted in IEEE Signal Processing Letters, openly accessible at https://ieeexplore.ieee.org/document/9130823

    Journal ref: IEEE Signal Processing Letters, Vol. 27, 2020, pages 1235-1239

  22. arXiv:2002.12764  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Towards Learning a Universal Non-Semantic Representation of Speech

    Authors: Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Felix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, Yinnon Haviv

    Abstract: The ultimate goal of transfer learning is to reduce labeled data requirements by exploiting a pre-existing embedding model trained for different datasets or tasks. The visual and language communities have established benchmarks to compare embeddings, but the speech community has yet to do so. This paper proposes a benchmark for comparing speech representations on non-semantic tasks, and proposes a… ▽ More

    Submitted 6 August, 2020; v1 submitted 25 February, 2020; originally announced February 2020.

    Journal ref: Proceedings of INTERSPEECH 2020

  23. arXiv:1911.07951  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Improving Universal Sound Separation Using Sound Classification

    Authors: Efthymios Tzinis, Scott Wisdom, John R. Hershey, Aren Jansen, Daniel P. W. Ellis

    Abstract: Deep learning approaches have recently achieved impressive performance on both audio source separation and sound classification. Most audio source separation approaches focus only on separating sources belonging to a restricted domain of source classes, such as speech and music. However, recent work has demonstrated the possibility of "universal sound separation", which aims to separate acoustic s… ▽ More

    Submitted 18 November, 2019; originally announced November 2019.

    Journal ref: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  24. arXiv:1911.05894  [pdf, other

    cs.SD eess.AS stat.ML

    Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision

    Authors: Aren Jansen, Daniel P. W. Ellis, Shawn Hershey, R. Channing Moore, Manoj Plakal, Ashok C. Popat, Rif A. Saurous

    Abstract: Humans do not acquire perceptual abilities in the way we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies more heavily on multimodal unsupervised learning (as infants) and active learning (as children). With this motivation, we present a learning framework for sound representation and… ▽ More

    Submitted 13 November, 2019; originally announced November 2019.

    Comments: This extended version of a ICASSP 2020 submission under same title has an added figure and additional discussion for easier consumption

  25. arXiv:1910.06635  [pdf, other

    eess.IV cs.CV q-bio.QM

    Liver segmentation and metastases detection in MR images using convolutional neural networks

    Authors: Mariëlle J. A. Jansen, Hugo J. Kuijf, Maarten Niekel, Wouter B. Veldhuis, Frank J. Wessels, Max A. Viergever, Josien P. W. Pluim

    Abstract: Primary tumors have a high likelihood of developing metastases in the liver and early detection of these metastases is crucial for patient outcome. We propose a method based on convolutional neural networks (CNN) to detect liver metastases. First, the liver was automatically segmented using the six phases of abdominal dynamic contrast enhanced (DCE) MR images. Next, DCE-MR and diffusion weighted (… ▽ More

    Submitted 15 October, 2019; originally announced October 2019.

    Journal ref: J. Med. Imag. 6(4), 044003 (2019)

  26. arXiv:1908.08254  [pdf, other

    eess.IV cs.CV

    Motion correction of dynamic contrast enhanced MRI of the liver

    Authors: Mariëlle J. A. Jansen, Wouter B. Veldhuis, Maarten S. van Leeuwen, Josien P. W. Pluim

    Abstract: Motion correction of dynamic contrast enhanced magnetic resonance images (DCE-MRI) is a challenging task, due to changes in image appearance. In this study a groupwise registration, using a principle component analysis (PCA) based metric,1 is evaluated for clinical DCE MRI of the liver. The groupwise registration transforms the images to a common space, rather than to a reference volume as convent… ▽ More

    Submitted 22 August, 2019; originally announced August 2019.

  27. arXiv:1908.08251  [pdf, other

    eess.IV cs.CV

    Optimal input configuration of dynamic contrast enhanced MRI in convolutional neural networks for liver segmentation

    Authors: Mariëlle J. A. Jansen, Hugo J. Kuijf, Josien P. W. Pluim

    Abstract: Most MRI liver segmentation methods use a structural 3D scan as input, such as a T1 or T2 weighted scan. Segmentation performance may be improved by utilizing both structural and functional information, as contained in dynamic contrast enhanced (DCE) MR series. Dynamic information can be incorporated in a segmentation method based on convolutional neural networks in a number of ways. In this study… ▽ More

    Submitted 22 August, 2019; originally announced August 2019.

    Comments: Submitted to SPIE Medical Imaging 2019

  28. arXiv:1802.03052  [pdf, other

    cs.CL cs.AI cs.IR

    WorldTree: A Corpus of Explanation Graphs for Elementary Science Questions supporting Multi-Hop Inference

    Authors: Peter A. Jansen, Elizabeth Wainwright, Steven Marmorstein, Clayton T. Morrison

    Abstract: Developing methods of automated inference that are able to provide users with compelling human-readable justifications for why the answer to a question is correct is critical for domains such as science and medicine, where user trust and detecting costly errors are limiting factors to adoption. One of the central barriers to training question answering models on explainable inference tasks is the… ▽ More

    Submitted 8 February, 2018; originally announced February 2018.

    Comments: Accepted at the Language Resource and Evaluation Conference (LREC) 2018

  29. arXiv:1711.02209  [pdf, ps, other

    cs.SD eess.AS stat.ML

    Unsupervised Learning of Semantic Audio Representations

    Authors: Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel P. W. Ellis, Shawn Hershey, Jiayang Liu, R. Channing Moore, Rif A. Saurous

    Abstract: Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the ca… ▽ More

    Submitted 6 November, 2017; originally announced November 2017.

    Comments: Submitted to ICASSP 2018

  30. arXiv:1706.01977  [pdf, other

    cs.RO

    From the Lab to the Desert: Fast Prototyping and Learning of Robot Locomotion

    Authors: Kevin Sebastian Luck, Joseph Campbell, Michael Andrew Jansen, Daniel M. Aukes, Heni Ben Amor

    Abstract: We present a methodology for fast prototyping of morphologies and controllers for robot locomotion. Going beyond simulation-based approaches, we argue that the form and function of a robot, as well as their interplay with real-world environmental conditions are critical. Hence, fast design and learning cycles are necessary to adapt robot shape and behavior to their environment. To this end, we pre… ▽ More

    Submitted 6 June, 2017; originally announced June 2017.

    Comments: Submitted to Robotics: Science and Systems (RSS 2017)

  31. arXiv:1609.09430  [pdf, other

    cs.SD cs.LG stat.ML

    CNN Architectures for Large-Scale Audio Classification

    Authors: Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, Kevin Wilson

    Abstract: Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying th… ▽ More

    Submitted 10 January, 2017; v1 submitted 29 September, 2016; originally announced September 2016.

    Comments: Accepted for publication at ICASSP 2017 Changes: Added definitions of mAP, AUC, and d-prime. Updated mAP/AUC/d-prime numbers for Audio Set based on changes of latest Audio Set revision. Changed wording to fit 4 page limit with new additions

  32. A segmental framework for fully-unsupervised large-vocabulary speech recognition

    Authors: Herman Kamper, Aren Jansen, Sharon Goldwater

    Abstract: Zero-resource speech technology is a growing research area that aims to develop methods for speech processing in the absence of transcriptions, lexicons, or language modelling text. Early term discovery systems focused on identifying isolated recurring patterns in a corpus, while more recent full-coverage systems attempt to completely segment and cluster the audio into word-like units---effectivel… ▽ More

    Submitted 16 September, 2017; v1 submitted 22 June, 2016; originally announced June 2016.

    Comments: 15 pages, 6 figures, 8 tables

    Journal ref: Comput. Speech Lang. 46 (2017) 154-174

  33. Unsupervised word segmentation and lexicon discovery using acoustic word embeddings

    Authors: Herman Kamper, Aren Jansen, Sharon Goldwater

    Abstract: In settings where only unlabelled speech data is available, speech technology needs to be developed without transcriptions, pronunciation dictionaries, or language modelling text. A similar problem is faced when modelling infant language acquisition. In these cases, categorical linguistic structure needs to be discovered directly from speech audio. We present a novel unsupervised Bayesian model th… ▽ More

    Submitted 9 March, 2016; originally announced March 2016.

    Comments: 11 pages, 8 figures; Accepted to the IEEE/ACM Transactions on Audio, Speech, and Language Processing

    Journal ref: IEEE/ACM Trans. Audio, Speech, Language Process. 24 (2016) 669-679

  34. arXiv:1508.04422  [pdf, other

    stat.ML cs.LG cs.NE stat.ME

    Scalable Out-of-Sample Extension of Graph Embeddings Using Deep Neural Networks

    Authors: Aren Jansen, Gregory Sell, Vince Lyzinski

    Abstract: Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In… ▽ More

    Submitted 14 June, 2016; v1 submitted 18 August, 2015; originally announced August 2015.

    Comments: 10 pages, 2 figures, 1 table, this paper is under consideration for publication in Pattern Recognition Letters

  35. arXiv:1009.5718  [pdf, other

    cs.NI

    Monitoring wild animal communities with arrays of motion sensitive camera traps

    Authors: Roland Kays, Sameer Tilak, Bart Kranstauber, Patrick A. Jansen, Chris Carbone, Marcus J. Rowcliffe, Tony Fountain, Jay Eggert, Zhihai He

    Abstract: Studying animal movement and distribution is of critical importance to addressing environmental challenges including invasive species, infectious diseases, climate and land-use change. Motion sensitive camera traps offer a visual sensor to record the presence of a broad range of species providing location -specific information on movement and behavior. Modern digital camera traps that record video… ▽ More

    Submitted 28 September, 2010; originally announced September 2010.