Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3394486.3409558acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
abstract

Multimodal Machine Learning for Video and Image Analysis

Published: 20 August 2020 Publication History

Abstract

In this talk, we will first discuss multimodal ML for video content analysis. Videos typically have data in multiple modalities like audio, video, and text (captions). Understanding and modeling the interaction between different modalities is key for video analysis tasks like categorization, object detection, activity recognition, etc. However, data modalities are not always correlated -- so, learning when modalities are correlated and using that to guide the influence of one modality on the other is crucial. Another salient feature of videos is the coherence between successive frames due to continuity of video and audio, a property that we refer to as temporal coherence. We show how using non-linear guided cross-modal signals and temporal coherence can improve the performance of multimodal ML models for video analysis tasks like categorization. We also created a hierarchical taxonomy of categories internally. Our experiments on the large-scale YouTube-8M dataset show how our approach significantly outperforms state-of-the-art multimodal ML model for video categorization using our taxonomy, as well as generalizes well to an internal dataset of video segments from actual TV programs. The next part of the talk will briefly discuss our work on explainability of multimodal ML models. We will conclude the talk by outlining other multimodal ML applications like incremental object detection and visual dialog, and discuss potential applications of multimodal ML to various domains.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
August 2020
3664 pages
ISBN:9781450379984
DOI:10.1145/3394486
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Check for updates

Author Tags

  1. cross-modal correlation
  2. deep learning
  3. guided attention
  4. multi-modal fusion and embeddings
  5. temporal coherence
  6. video analysis

Qualifiers

  • Abstract

Conference

KDD '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 287
    Total Downloads
  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)2
Reflects downloads up to 06 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media