Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
discussion

Fusing AI: Multimodal Language Models Inference Across Diverse Inputs

Published: 01 November 2024 Publication History

Abstract

Despite the various hurdles multimodal language models (MLMs) face, their broad applicability outweighs the implementation effort. As MLM technologies advance, they will support additional modalities, new learning and inference methods, and stricter regulations governing privacy and responsible use.

References

[1]
S. Yin, C. Fu, S. Zhao, K. Li, T. Xu, and E. Chen, “A survey on multimodal large language models,” 2024,.
[2]
S. Ebrahimi, S. Ö. Arık, Y. Dong, and T. Pfister, “LANISTR: Multimodal learning from structured and unstructured data,” Google Cloud AI Research, Mountain View, CA, USA, 2024. Accessed: Jul. 9, 2024. [Online]. Available: https://arxiv.org/pdf/2305.16556
[3]
T. Capel and M. Brereton, “What is human-centered about human-centered AI? A map of the research landscape,” in Proc. CHI Conf. Human Factors Comput. Syst., Hamburg, Germany, 2023, pp. 1–23.
[4]
D. Zhang et al., “MM-LLMs: Recent advances in multimodal large language models,” 2024,.
[5]
B. McKinzie et al., “MM1: Methods, analysis & insights from,” Apple, Cupertino, CA, USA, 2024. Accessed: Jul. 8, 2024. [Online]. Available: https://arxiv.org/pdf/2403.09611
[6]
A. N. Reganti. “Introduction to MM LLMs.” Github. Accessed: Jul. 16, 2024. [Online]. Available: https://github.com/aishwaryanr/awesome-generative-ai-guide/blob/main/resources/mm_llms_guide.md
[7]
C. Huyen. “Multimodality and large multimodal models (LMMs).” chiphuyen.com. Accessed: Jul. 17, 2024. [Online]. Available: https://huyenchip.com/2023/10/10/multimodal.html
[8]
F. Bordes et al., An Introduction to Vision-Language Modeling. Menlo Park, CA, USA: Meta, 2024.
[9]
R. Jain, “Interviewee,” Professor Emeritus, University of California, Irvine. [Interview], Jul. 2024.
[10]
G. Penedo et al. “FineWeb: Decanting the web for the finest text data at scale.” HuggingFace. Accessed: Jul. 12, 2024. [Online]. Available: https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
[11]
Y. Jiang, J. Irvin, J. H. Wang, M.A. Chaudhry, J. H. Chen, and A. Y. Ng, Many-Shot In-Context Learning in Multimodal Foundation Models. Palo Alto, CA, USA: Stanford Univ., 2024.
[12]
L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, Vision Mamba: Efficient Visual Representation Learning With Bidirectional State Space Model. Wuhan, China: Huazhong Univ. Sci. Technol., 2024.
[13]
H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, and F. Wei, “DeepNet: Scaling transformers to 1,000 layers,” IEEE Trans. Pattern Anal. Mach. Intell., early access, Apr. 10, 2024.
[14]
Z. Gekhman et al., Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? Haifa, Israel: Technion - Israel Inst. Technol., 2024.
[15]
S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, “Unifying large language models and knowledge graphs: A roadmap,” IEEE Trans. Knowl. Data Eng., vol. 36, no. 7, pp. 3580–3599, Jul. 2024.
[16]
J. Yuan et al., RAG-Driver: Generalisable Driving Explanations With Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Models. Oxford, U.K.: Oxford Univ., 2024.
[17]
G. Xiao, Y. Tian, B. Chen, and M. Lewis, “Efficient streaming language models with attention sinks,” in Proc. Int. Conf. Learn. Representations, Vienna, Austria, 2024, pp. 1–21. [Online]. Available: https://openreview.net/pdf?id=NG7sS51zVF
[18]
A. Kirillov et al., “Segment anything,” in Proc. CVF Int. Conf. Comput. Vis., Paris, France, 2023, pp. 4015–4026.
[19]
X. Wang, X. Zhang, Y. Cao, W. Wang, C. Shen, and T. Huang, “SegGPT: Segmenting everything in context,” Beijing Acad. Artif. Intell., Beijing, China, 2023. [Online]. Available: https://arxiv.org/pdf/2304.03284
[20]
O. Bar-Tal et al., “Lumiere: A space-time diffusion model for video generation,” Google Research, Mountain View, CA, USA, 2024. Accessed: Jul. 16, 2024. [Online]. Available: https://arxiv.org/pdf/2401.12945
[21]
A. Blattmann et al., “Align your latents: High-resolution video synthesis with latent diffusion models,” in Proc. CVF Conf. Comput. Vis. Pattern Recognit., Vancouver, BC, Canada, 2023, pp. 22,563–22,575.
[22]
H. Zhang, X. Li, and L. Bing, “Video-LLaMA: An instruction-tuned audio-visual language model for video understanding,” in Proc. Conf. Empirical Methods in Natural Lang. Process. Syst. Demonstrations, Singapore, 2023, pp. 1–11.
[23]
Z. Peng et al., Kosmos-2: Grounding Multimodal Large Language Models to the World. Redmond, WA, USA: Microsoft Research, 2023.
[24]
H. You et al., “Ferret: Refer and ground anything anywhere at any granularity,” in Proc. Int. Conf. Learn. Representations, Vienna, Austria, 2024, pp. 1–28. [Online]. Available: https://openreview.net/pdf?id=2msbbX3ydD
[25]
B. Zhou et al., TinyLLaVA: A Framework of Small-Scale Large Multimodal Models. Beijing, China: Inst. Artif. Intell., Beihang Univ., 2024.
[26]
K. Zhang et al., BiomedGPT: A Unified and Generalist Biomedical Generative Pre-Trained Transformer for Vision, Language, and Multimodal Tasks. Bethlehem, PA, USA: Dept. Comput. Sci. Eng., Lehigh Univ., 2024.
[27]
Gemini Team, “Gemini: A family of highly capable multimodal models,” 2024,.
[28]
O. Burda-Lassen, A. Chadha, S. Goswami, and V. Jain, How Culturally Aware Are Vision-Language Models? Standford, CA, USA: Standford Univ., 2024.
[29]
Q. Lu, L. Zhu, X. Xu, and J. Whittle, “Responsible-AI-by-design: A pattern collection for designing responsible artificial intelligence systems,” IEEE Softw., vol. 40, no. 3, pp. 63–71, May 2023.
[30]
Anthropic, Scaling Monosemanticity: Extracting Interpretable Features From Claude 3 Sonnet. San Francisco, CA, USA: Anthropic, 2024.
[31]
H. Zhao, M. Andriushchenko, F. Croce, and N. Flammarion, “Is in-context learning sufficient for instruction following in LLMs?” Accessed: Jul 18, 2024. [Online]. Available: https://arxiv.org/pdf/2405.19874
[32]
M. Jovanovic and P. Voss, “Towards incremental learning in large language models: A critical review,” May 2024,.
[33]
R. Bansal et al., LLM Augmented LLMs: Expanding Capabilities Through Composition. Google Research, Mountain View, CA, USA, 2024. Accessed: Jul. 19, 2024. [Online]. Available: https://arxiv.org/pdf/2401.02412
[34]
I. Solaiman et al., Evaluating the Social Impact of Generative AI Systems in Systems and Society. Oxford, U.K.: Oxford Univ. Press, 2023.
[35]
M. Campbell and M. Jovanović, “Digital self: The next evolution of the digital human,” Computer, vol. 55, no. 4, pp. 82–86, Apr. 2022.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Computer
Computer  Volume 57, Issue 11
Nov. 2024
136 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 November 2024

Qualifiers

  • Discussion

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media