discussion

Fusing AI: Multimodal Language Models Inference Across Diverse Inputs

Authors:

Mlađan Jovanović,

Mark CampbellAuthors Info & Claims

Computer, Volume 57, Issue 11

Pages 124 - 130

https://doi.org/10.1109/MC.2024.3445515

Published: 01 November 2024 Publication History

Abstract

Despite the various hurdles multimodal language models (MLMs) face, their broad applicability outweighs the implementation effort. As MLM technologies advance, they will support additional modalities, new learning and inference methods, and stricter regulations governing privacy and responsible use.

References

[1]

S. Yin, C. Fu, S. Zhao, K. Li, T. Xu, and E. Chen, “A survey on multimodal large language models,” 2024,.

[2]

S. Ebrahimi, S. Ö. Arık, Y. Dong, and T. Pfister, “LANISTR: Multimodal learning from structured and unstructured data,” Google Cloud AI Research, Mountain View, CA, USA, 2024. Accessed: Jul. 9, 2024. [Online]. Available: https://arxiv.org/pdf/2305.16556

[3]

T. Capel and M. Brereton, “What is human-centered about human-centered AI? A map of the research landscape,” in Proc. CHI Conf. Human Factors Comput. Syst., Hamburg, Germany, 2023, pp. 1–23.

Digital Library

[4]

D. Zhang et al., “MM-LLMs: Recent advances in multimodal large language models,” 2024,.

[5]

B. McKinzie et al., “MM1: Methods, analysis & insights from,” Apple, Cupertino, CA, USA, 2024. Accessed: Jul. 8, 2024. [Online]. Available: https://arxiv.org/pdf/2403.09611

[6]

A. N. Reganti. “Introduction to MM LLMs.” Github. Accessed: Jul. 16, 2024. [Online]. Available: https://github.com/aishwaryanr/awesome-generative-ai-guide/blob/main/resources/mm_llms_guide.md

[7]

C. Huyen. “Multimodality and large multimodal models (LMMs).” chiphuyen.com. Accessed: Jul. 17, 2024. [Online]. Available: https://huyenchip.com/2023/10/10/multimodal.html

[8]

F. Bordes et al., An Introduction to Vision-Language Modeling. Menlo Park, CA, USA: Meta, 2024.

[9]

R. Jain, “Interviewee,” Professor Emeritus, University of California, Irvine. [Interview], Jul. 2024.

[10]

G. Penedo et al. “FineWeb: Decanting the web for the finest text data at scale.” HuggingFace. Accessed: Jul. 12, 2024. [Online]. Available: https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1

[11]

Y. Jiang, J. Irvin, J. H. Wang, M.A. Chaudhry, J. H. Chen, and A. Y. Ng, Many-Shot In-Context Learning in Multimodal Foundation Models. Palo Alto, CA, USA: Stanford Univ., 2024.

[12]

L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, Vision Mamba: Efficient Visual Representation Learning With Bidirectional State Space Model. Wuhan, China: Huazhong Univ. Sci. Technol., 2024.

[13]

H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, and F. Wei, “DeepNet: Scaling transformers to 1,000 layers,” IEEE Trans. Pattern Anal. Mach. Intell., early access, Apr. 10, 2024.

Digital Library

[14]

Z. Gekhman et al., Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? Haifa, Israel: Technion - Israel Inst. Technol., 2024.

[15]

S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, “Unifying large language models and knowledge graphs: A roadmap,” IEEE Trans. Knowl. Data Eng., vol. 36, no. 7, pp. 3580–3599, Jul. 2024.

Digital Library

[16]

J. Yuan et al., RAG-Driver: Generalisable Driving Explanations With Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Models. Oxford, U.K.: Oxford Univ., 2024.

[17]

G. Xiao, Y. Tian, B. Chen, and M. Lewis, “Efficient streaming language models with attention sinks,” in Proc. Int. Conf. Learn. Representations, Vienna, Austria, 2024, pp. 1–21. [Online]. Available: https://openreview.net/pdf?id=NG7sS51zVF

[18]

A. Kirillov et al., “Segment anything,” in Proc. CVF Int. Conf. Comput. Vis., Paris, France, 2023, pp. 4015–4026.

[19]

X. Wang, X. Zhang, Y. Cao, W. Wang, C. Shen, and T. Huang, “SegGPT: Segmenting everything in context,” Beijing Acad. Artif. Intell., Beijing, China, 2023. [Online]. Available: https://arxiv.org/pdf/2304.03284

[20]

O. Bar-Tal et al., “Lumiere: A space-time diffusion model for video generation,” Google Research, Mountain View, CA, USA, 2024. Accessed: Jul. 16, 2024. [Online]. Available: https://arxiv.org/pdf/2401.12945

[21]

A. Blattmann et al., “Align your latents: High-resolution video synthesis with latent diffusion models,” in Proc. CVF Conf. Comput. Vis. Pattern Recognit., Vancouver, BC, Canada, 2023, pp. 22,563–22,575.

[22]

H. Zhang, X. Li, and L. Bing, “Video-LLaMA: An instruction-tuned audio-visual language model for video understanding,” in Proc. Conf. Empirical Methods in Natural Lang. Process. Syst. Demonstrations, Singapore, 2023, pp. 1–11.

[23]

Z. Peng et al., Kosmos-2: Grounding Multimodal Large Language Models to the World. Redmond, WA, USA: Microsoft Research, 2023.

[24]

H. You et al., “Ferret: Refer and ground anything anywhere at any granularity,” in Proc. Int. Conf. Learn. Representations, Vienna, Austria, 2024, pp. 1–28. [Online]. Available: https://openreview.net/pdf?id=2msbbX3ydD

[25]

B. Zhou et al., TinyLLaVA: A Framework of Small-Scale Large Multimodal Models. Beijing, China: Inst. Artif. Intell., Beihang Univ., 2024.

[26]

K. Zhang et al., BiomedGPT: A Unified and Generalist Biomedical Generative Pre-Trained Transformer for Vision, Language, and Multimodal Tasks. Bethlehem, PA, USA: Dept. Comput. Sci. Eng., Lehigh Univ., 2024.

[27]

Gemini Team, “Gemini: A family of highly capable multimodal models,” 2024,.

[28]

O. Burda-Lassen, A. Chadha, S. Goswami, and V. Jain, How Culturally Aware Are Vision-Language Models? Standford, CA, USA: Standford Univ., 2024.

[29]

Q. Lu, L. Zhu, X. Xu, and J. Whittle, “Responsible-AI-by-design: A pattern collection for designing responsible artificial intelligence systems,” IEEE Softw., vol. 40, no. 3, pp. 63–71, May 2023.

Digital Library

[30]

Anthropic, Scaling Monosemanticity: Extracting Interpretable Features From Claude 3 Sonnet. San Francisco, CA, USA: Anthropic, 2024.

[31]

H. Zhao, M. Andriushchenko, F. Croce, and N. Flammarion, “Is in-context learning sufficient for instruction following in LLMs?” Accessed: Jul 18, 2024. [Online]. Available: https://arxiv.org/pdf/2405.19874

[32]

M. Jovanovic and P. Voss, “Towards incremental learning in large language models: A critical review,” May 2024,.

[33]

R. Bansal et al., LLM Augmented LLMs: Expanding Capabilities Through Composition. Google Research, Mountain View, CA, USA, 2024. Accessed: Jul. 19, 2024. [Online]. Available: https://arxiv.org/pdf/2401.02412

[34]

I. Solaiman et al., Evaluating the Social Impact of Generative AI Systems in Systems and Society. Oxford, U.K.: Oxford Univ. Press, 2023.

[35]

M. Campbell and M. Jovanović, “Digital self: The next evolution of the digital human,” Computer, vol. 55, no. 4, pp. 82–86, Apr. 2022.

Index Terms

Fusing AI: Multimodal Language Models Inference Across Diverse Inputs
1. Applied computing
  1. Arts and humanities
    1. Language translation
2. Computing methodologies

Index terms have been assigned to the content through auto-classification.

Recommendations

Leveraging hierarchy in multimodal generative models for effective cross-modality inference
Abstract
This work addresses the problem of cross-modality inference (CMI), i.e., inferring missing data of unavailable perceptual modalities (e.g., sound) using data from available perceptual modalities (e.g., image). We overview single-...
Skipping spare information in multimodal inputs during multimodal input fusion
IUI '09: Proceedings of the 14th international conference on Intelligent user interfaces

In a multimodal interface, a user can use multiple modalities, such as speech, gesture, and eye gaze etc., to communicate with a system. As a critical component in a multimodal interface, multimodal input fusion explores the ways to effectively ...
Intent capturing through multimodal inputs
HCI'13: Proceedings of the 15th international conference on Human-Computer Interaction: interaction modalities and techniques - Volume Part IV

Virtual manufacturing environments need complex and accurate 3D human-computer interaction. One main problem of current virtual environments (VEs) is the heavy overloads of the users on both cognitive and motor operational aspects. This paper ...

Comments

Information & Contributors

Information

Published In

cover image Computer

Computer Volume 57, Issue 11

Nov. 2024

136 pages

Issue’s Table of Contents

0018-9162 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 November 2024

Qualifiers

Discussion

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents