research-article

Two-level fast-forwarding using speech detection for rapidly perusing video

Authors:

Kazutaka Kurihara,

Yoko Sasaki,

Jun Ogata,

Masataka GotoAuthors Info & Claims

AH '14: Proceedings of the 5th Augmented Human International Conference

Article No.: 19, Pages 1 - 2

https://doi.org/10.1145/2582051.2582070

Published: 07 March 2014 Publication History

Get Access

Abstract

In video content such as feature films, the main themes and messages are often sufficiently conveyed through dialogue and narration. To augment human capability to consume video content, here we propose a system for watching such videos at very high speed while ensuring that speech is still comprehensible. Specifically, we employ a purpose-built automatic speech detector to realize two-level fast-forwarding for a wide variety of video content: very fast during segments without speech, and understandably fast during segments with speech. In our experiments, practical performance was achieved by frame-by-frame audio classification using Gaussian mixture models trained on subtitle information from 120 commercial DVD movies.

References

[1]

Cheng, K. Y., Luo, S. J, Chen, B. Y., and Chu, H. H. SmartPlayer: User-Centric Video Fast-Forwarding. In Proc. CHI'09, 2009, pp. 789--798.

Digital Library

Google Scholar

[2]

Diarization Error Rate. http://www.xavieranguera.com/phdthesis/node108.html.

Google Scholar

[3]

Hidden Markov Model Toolkit. http://htk.eng.cam.ac.uk

Google Scholar

[4]

Kotti, M., Moschou, V. and Kotropoulos C. Review: Speaker segmentation and clustering. Signal Processing 88, 5 (2008), 1091--1124.

Digital Library

Google Scholar

[5]

Kurihara, K. CinemaGazer: A System for Watching Videos at Very High Speed. In Proc. of AVI'12, pp.108--115, 2012.

Digital Library

Google Scholar

[6]

Peker, K. A., Divakaran, A. and Sun, H. Constant pace skimming and temporal sub-sampling of video using motion activity. In Proc. IEEE Int. Conf. Image Processing (ICIP), Vol. 3, 2001, pp. 414--417.

Crossref

Google Scholar

Index Terms

Two-level fast-forwarding using speech detection for rapidly perusing video
1. Human-centered computing
  1. Human computer interaction (HCI)
2. Information systems
  1. Information systems applications
    1. Multimedia information systems

Recommendations

Statistical conversion of silent articulation into audible speech using full-covariance HMM

Conversion of silent articulation captured by ultrasound and video to modal speech.Comparison of GMM and full-covariance phonetic HMM without vocabulary limitation.HMM-based approach allows the use of linguistic information for regularization.Objective ...
An improvement in audio-visual voice activity detection for automatic speech recognition
IEA/AIE'10: Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I

Noise-robust Automatic Speech Recognition (ASR) is essential for robots which are expected to communicate with humans in a daily environment. In such an environment, Voice Activity Detection (VAD) strongly affects the performance of ASR because there are ...
A Multilingual to Polyglot Speech Synthesizer for Indian Languages Using a Voice-Converted Polyglot Speech Corpus

A multilingual synthesizer synthesizes speech, for any given monolingual or mixed-language text, that is intelligible to human listeners. The necessity for such synthesizer arises in a country like India, where multiple languages coexist. For the ...

Comments

Information & Contributors

Information

Published In

AH '14: Proceedings of the 5th Augmented Human International Conference

March 2014

249 pages

ISBN:9781450327619

DOI:10.1145/2582051

General Chair:
Tsutomu Terada
Kobe University, Japan
,
Program Chairs:
Masahiko Inami
Keio University, Japan
,
Kai Kunze
Osaka Prefecture University, Japan
,
Takuya Nojima
University of Electro-communications, Japan

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 March 2014

Check for updates

Author Tags

Qualifiers

Research-article

Conference

AH '14

Sponsor:

MEET IN KOBE 21st Century

AH '14: 5th Augmented Human International Conference

March 7 - 8, 2014

Kobe, Japan

Acceptance Rates

Overall Acceptance Rate 121 of 306 submissions, 40%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
97
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

Statistical conversion of silent articulation into audible speech using full-covariance HMM

An improvement in audio-visual voice activity detection for automatic speech recognition

A Multilingual to Polyglot Speech Synthesizer for Indian Languages Using a Voice-Converted Polyglot Speech Corpus