Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–13 of 13 results for author: Oh, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.13676  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment

    Authors: Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung

    Abstract: Recent studies on learning-based sound source localization have mainly focused on the localization performance perspective. However, prior work and existing benchmarks overlook a crucial aspect: cross-modal interaction, which is essential for interactive sound source localization. Cross-modal interaction is vital for understanding semantically matched or mismatched audio-visual events, such as sil… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Journal Extension of ICCV 2023 paper (arXiV:2309.10724). Code is available at https://github.com/kaistmm/SSLalignment

  2. arXiv:2403.01898  [pdf, other

    cs.CV eess.IV

    Revisiting Learning-based Video Motion Magnification for Real-time Processing

    Authors: Hyunwoo Ha, Oh Hyun-Bin, Kim Jun-Seong, Kwon Byung-Ki, Kim Sung-Bin, Linh-Tam Tran, Ji-Yun Kim, Sung-Ho Bae, Tae-Hyun Oh

    Abstract: Video motion magnification is a technique to capture and amplify subtle motion in a video that is invisible to the naked eye. The deep learning-based prior work successfully demonstrates the modelling of the motion magnification problem with outstanding quality compared to conventional signal processing-based ones. However, it still lags behind real-time performance, which prevents it from being e… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: 19 pages

  3. arXiv:2312.09551  [pdf, other

    eess.IV cs.CV

    Learning-based Axial Video Motion Magnification

    Authors: Kwon Byung-Ki, Oh Hyun-Bin, Kim Jun-Seong, Hyunwoo Ha, Tae-Hyun Oh

    Abstract: Video motion magnification amplifies invisible small motions to be perceptible, which provides humans with a spatially dense and holistic understanding of small motions in the scene of interest. This is based on the premise that magnifying small motions enhances the legibility of motions. In the real world, however, vibrating objects often possess convoluted systems that have complex natural frequ… ▽ More

    Submitted 26 March, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

    Comments: main paper: 12 pages, supplementary: 10 pages, 20 figures, 1 table

  4. arXiv:2309.10724  [pdf, other

    cs.CV cs.AI cs.MM cs.SD eess.AS

    Sound Source Localization is All about Cross-Modal Alignment

    Authors: Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung

    Abstract: Humans can easily perceive the direction of sound sources in a visual scene, termed sound source localization. Recent studies on learning-based sound source localization have mainly explored the problem from a localization perspective. However, prior arts and existing benchmarks do not account for a more important aspect of the problem, cross-modal semantic understanding, which is essential for ge… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

    Comments: ICCV 2023

  5. arXiv:2305.16699  [pdf, other

    eess.AS cs.AI cs.LG

    Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis

    Authors: Seongyeon Park, Bohyung Kim, Tae-hyun Oh

    Abstract: Recently, zero-shot TTS and VC methods have gained attention due to their practicality of being able to generate voices even unseen during training. Among these methods, zero-shot modifications of the VITS model have shown superior performance, while having useful properties inherited from VITS. However, the performance of VITS and VITS-based zero-shot models vary dramatically depending on how the… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023

  6. arXiv:2303.17490  [pdf, other

    cs.CV cs.MM cs.SD eess.AS eess.IV

    Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

    Authors: Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Andrew Owens, Tae-Hyun Oh

    Abstract: How does audio describe the world around us? In this paper, we propose a method for generating an image of a scene from sound. Our method addresses the challenges of dealing with the large gaps that often exist between sight and sound. We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities despite their information gaps. The k… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  7. arXiv:2303.17489  [pdf, other

    eess.AS cs.MM cs.SD

    Prefix tuning for automated audio captioning

    Authors: Minkyu Kim, Kim Sung-Bin, Tae-Hyun Oh

    Abstract: Audio captioning aims to generate text descriptions from environmental sounds. One challenge of audio captioning is the difficulty of the generalization due to the lack of audio-text paired training data. In this work, we propose a simple yet effective method of dealing with small-scaled datasets by leveraging a pre-trained language model. We keep the language model frozen to maintain the expressi… ▽ More

    Submitted 4 April, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023

  8. arXiv:2303.15669  [pdf, other

    eess.AS cs.AI cs.LG

    Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages

    Authors: Seongyeon Park, Myungseo Song, Bohyung Kim, Tae-Hyun Oh

    Abstract: Neural text-to-speech (TTS) models can synthesize natural human speech when trained on large amounts of transcribed speech. However, collecting such large-scale transcribed data is expensive. This paper proposes an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data. With our pre-training, we can remarkably reduce the amount of paired… ▽ More

    Submitted 27 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023

  9. arXiv:2202.05961  [pdf, other

    cs.CV eess.IV

    Audio-Visual Fusion Layers for Event Type Aware Video Recognition

    Authors: Arda Senocak, Junsik Kim, Tae-Hyun Oh, Hyeonggon Ryu, Dingzeyu Li, In So Kweon

    Abstract: Human brain is continuously inundated with the multisensory information and their complex interactions coming from the outside world at any given moment. Such information is automatically analyzed by binding or segregating in our brain. While this task might seem effortless for human brains, it is extremely challenging to build a machine that can perform similar tasks since complex interactions ca… ▽ More

    Submitted 11 February, 2022; originally announced February 2022.

  10. arXiv:2012.02753  [pdf, other

    eess.SY

    Model-plant mismatch learning offset-free model predictive control

    Authors: Sang Hwan Son, Jong Woo Kim, Tae Hoon Oh, Jong Min Lee

    Abstract: We propose model-plant mismatch learning offset-free model predictive control (MPC), which learns and applies the intrinsic model-plant mismatch, to effectively exploit the advantages of model-based and data-driven control strategies and overcome the limitations of each approach. In this study, the model-plant mismatch map on steady-state manifold in the controlled variable space is approximated v… ▽ More

    Submitted 13 December, 2020; v1 submitted 4 December, 2020; originally announced December 2020.

  11. arXiv:2008.10542  [pdf, other

    eess.IV cs.CV

    Automatic LiDAR Extrinsic Calibration System using Photodetector and Planar Board for Large-scale Applications

    Authors: Ji-Hwan You, Seon Taek Oh, Jae-Eun Park, Azim Eskandarian, Young-Keun Kim

    Abstract: This paper presents a novel automatic calibration system to estimate the extrinsic parameters of LiDAR mounted on a mobile platform for sensor misalignment inspection in the large-scale production of highly automated vehicles. To obtain subdegree and subcentimeter accuracy levels of extrinsic calibration, this study proposed a new concept of a target board with embedded photodetector arrays, named… ▽ More

    Submitted 24 August, 2020; originally announced August 2020.

    Comments: prepost for IEEE journal

  12. arXiv:1912.04487  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Listen to Look: Action Recognition by Previewing Audio

    Authors: Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani

    Abstract: In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies. First, we devise an ImgAud2Vid framework that hallucinates clip-level features by distilling from lighter modalit… ▽ More

    Submitted 28 March, 2020; v1 submitted 9 December, 2019; originally announced December 2019.

    Comments: Appears in CVPR 2020; Project page: http://vision.cs.utexas.edu/projects/listen_to_look/

  13. arXiv:1811.10813  [pdf, other

    cs.CV eess.AS

    Noise-tolerant Audio-visual Online Person Verification using an Attention-based Neural Network Fusion

    Authors: Suwon Shon, Tae-Hyun Oh, James Glass

    Abstract: In this paper, we present a multi-modal online person verification system using both speech and visual signals. Inspired by neuroscientific findings on the association of voice and face, we propose an attention-based end-to-end neural network that learns multi-sensory associations for the task of person verification. The attention mechanism in our proposed network learns to conditionally select a… ▽ More

    Submitted 26 November, 2018; originally announced November 2018.