abstract

Multimodal Machine Learning for Video and Image Analysis

Author:

Shalini GhoshAuthors Info & Claims

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Page 3608

https://doi.org/10.1145/3394486.3409558

Published: 20 August 2020 Publication History

Get Access

Abstract

In this talk, we will first discuss multimodal ML for video content analysis. Videos typically have data in multiple modalities like audio, video, and text (captions). Understanding and modeling the interaction between different modalities is key for video analysis tasks like categorization, object detection, activity recognition, etc. However, data modalities are not always correlated -- so, learning when modalities are correlated and using that to guide the influence of one modality on the other is crucial. Another salient feature of videos is the coherence between successive frames due to continuity of video and audio, a property that we refer to as temporal coherence. We show how using non-linear guided cross-modal signals and temporal coherence can improve the performance of multimodal ML models for video analysis tasks like categorization. We also created a hierarchical taxonomy of categories internally. Our experiments on the large-scale YouTube-8M dataset show how our approach significantly outperforms state-of-the-art multimodal ML model for video categorization using our taxonomy, as well as generalizes well to an internal dataset of video segments from actual TV programs. The next part of the talk will briefly discuss our work on explainability of multimodal ML models. We will conclude the talk by outlining other multimodal ML applications like incremental object detection and visual dialog, and discuss potential applications of multimodal ML to various domains.

Index Terms

Multimodal Machine Learning for Video and Image Analysis

Recommendations

Cross-modal Non-linear Guided Attention and Temporal Coherence in Multi-modal Deep Video Models
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Videos have data in multiple modalities, e.g., audio, video, text (captions). Understanding and modeling the interaction between different modalities is key for video analysis tasks like categorization, object detection, activity recognition, etc. ...
Multimodal Analysis of Interruptions
Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. Anthropometry, Human Behavior, and Communication
Abstract
During an interaction, interactants exchange speaking turns. Exchanges can be done smoothly or through interruptions. Listeners can display backchannels, send signals to grab the speaking turn, wait for the speaker to yield the turn, or even ...
Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

Common approaches to problems involving multiple modalities (classification, retrieval, hyperlinking, etc.) are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially ...

Comments

Information & Contributors

Information

Published In

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

August 2020

3664 pages

ISBN:9781450379984

DOI:10.1145/3394486

General Chairs:
Rajesh Gupta
UC San Diego, USA
,
Yan Liu
USC, USA
,
Program Chairs:
Mohak Shah
LG Electronics, USA
,
Suju Rajan
Linkedin, USA
,
Publications Chairs:
Jiliang Tang
Michigan State, USA
,
B. Aditya Prakash
Georgia Tech, USA

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Check for updates

Author Tags

Qualifiers

Abstract

Conference

KDD '20

Sponsor:

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

July 6 - 10, 2020

CA, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
287
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)2

Reflects downloads up to 06 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

Index Terms

Recommendations

Cross-modal Non-linear Guided Attention and Temporal Coherence in Multi-modal Deep Video Models

Multimodal Analysis of Interruptions

Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations