Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3681756.3697974acmconferencesArticle/Chapter ViewAbstractPublication Pagessiggraph-asiaConference Proceedingsconference-collections
poster

Multimodal Learning for Autoencoders

Published: 02 December 2024 Publication History

Abstract

In this work, Multimodal Autoencoder is proposed in which images are reconstructed using both image and text inputs, rather than just images. Two new loss terms are introduced: Image-Text-loss (Limte) that measures similarity between input image and input text, and Text-Similarity-loss (Ltesim) that measures similarity between generated text from the reconstructed image and input text. Our experiments demonstrate that images reconstructed using both modalities (image and text) are of significantly higher quality than those reconstructed from either modality alone, highlighting how features learned from one modality can enhance another in the reconstruction process.

References

[1]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arxiv:https://arXiv.org/abs/2010.11929 [cs.CV]
[2]
Xinyang Geng, Hao Liu, Lisa Lee, Dale Schuurmans, Sergey Levine, and Pieter Abbeel. 2022. Multimodal Masked Autoencoders Learn Transferable Representations. arxiv:https://arXiv.org/abs/2205.14204 [cs.CV]
[3]
Natasha Jaques, Sara Taylor, Akane Sano, and Rosalind Picard. 2017. Multimodal autoencoder: A deep learning approach to filling in missing sensor data and enabling better mood prediction. In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII). 202–208.
[4]
Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2023. Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. arxiv:https://arXiv.org/abs/2209.03430 [cs.LG]
[5]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
[6]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SA '24: SIGGRAPH Asia 2024 Posters
December 2024
260 pages
ISBN:9798400711381
DOI:10.1145/3681756
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 December 2024

Check for updates

Author Tags

  1. Computer Vision
  2. Natural Language
  3. Autoencoder
  4. Multimodality

Qualifiers

  • Poster

Conference

SA '24
Sponsor:
SA '24: SIGGRAPH Asia 2024 Posters
December 3 - 6, 2024
Tokyo, Japan

Acceptance Rates

Overall Acceptance Rate 178 of 869 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 292
    Total Downloads
  • Downloads (Last 12 months)292
  • Downloads (Last 6 weeks)107
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media