Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3607827acmconferencesBook PagePublication PagesmmConference Proceedingsconference-collections
LGM3A '23: Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications
ACM2023 Proceeding
Publisher:
  • Association for Computing Machinery
  • New York
  • NY
  • United States
Conference:
MM '23: The 31st ACM International Conference on Multimedia Ottawa ON Canada 2 November 2023
ISBN:
979-8-4007-0283-9
Published:
29 October 2023
Sponsors:
Next Conference
October 28 - November 1, 2024
Melbourne , VIC , Australia
Reflects downloads up to 14 Oct 2024Bibliometrics
Skip Abstract Section
Abstract

On behalf of the organizing committee, it is our distinct pleasure to extend a warm welcome to the LGM3A Workshop. As Chairs of this conference, we are delighted to bring together a community of scholars, researchers, and professionals from diverse backgrounds, all driven by a shared passion for advancing the frontiers of knowledge in our field.

In recent years, the field of large language models has witnessed remarkable growth, with models like GPT, T5, RoBERTa, and BERT transforming our understanding of natural language processing. These models, trained on vast volumes of text data, have empowered us to decode the intricate structures and patterns of human language. Moreover, with the surge in multimodal data-comprising audio, visual, and text-we are now on the cusp of an exciting era where these large generative language models are poised to revolutionize multimodal applications.

The workshop serves as a pivotal platform to delve into this dynamic intersection of language models and multimodal applications. In the era of BLIP, Flamingo, KOSMOS, PaLM-E, LLaVA, Visual ChatGPT, and the eagerly awaited GPT-4, we find ourselves at a juncture where large language models are enabling us to understand and generate responses with unprecedented accuracy and nuance across diverse modalities.

Skip Table Of Content Section
SESSION: Keynote Talks
keynote
Large Generative Models Meet Multimodal Video Intelligence

In this talk, I would like to share my recent research around multimodal video intelligence in the era of large generative models. I will first talk about video-language pretraining techniques (All-in-one, EgoVLP) that use one single model to power ...

keynote
Unlocking Multimedia Capabilities of Gigantic Pretrained Language Models

Benefitting from unprecedented computational power, massive data throughput, and superhuman memory, large language models (LLMs) are fundamentally transforming multimodal machine learning. An LLM can be analogized to an enormous treasure box guarded by ...

keynote
Multi-Modal Generative AI with Foundation Models

Generating photorealistic and controllable visual contents has been a long-pursuing goal of artificial intelligence (AI), with extensive real-world applications. It is also at the core of embodied intelligence. In this talk, I will discuss our work in ...

SESSION: Session 1: Paper Presentation
research-article
Open Access
NeurSEG: A Segment Driven Deep Neural Model for Nested Named Entity Recognition

Named Entity Recognition (NER) is a fundamental problem in natural language processing (NLP). Apart from flat entities, nested entities are also commonly existed in real-life textual data. However, the current methods are not capable of handling nested ...

research-article
SAT: Self-Attention Control for Diffusion Models Training

Recent text-to-image diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, a persistent challenge lies in the generation of detailed images, especially human-related images, which often ...

research-article
Open Access
Multimodal Data Augmentation for Image Captioning using Diffusion Models

Image captioning, an important vision-language task, often requires a tremendous number of finely labeled image-caption pairs for learning the underlying alignment between images and texts. In this paper, we proposed a multimodal data augmentation ...

research-article
ImEW: A Framework for Editing Image in the Wild

The ability to edit images in a realistic and visually appealing manner is a fundamental requirement in various computer vision applications. In this paper, we present ImEW, a unified framework designed for solving image editing tasks. ImEW utilizes off-...

research-article
CGSMP: Controllable Generative Summarization via Multimodal Prompt

Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of a large language model (LLM), this advancement has resulted in more fluent and coherent Natural Language Generation, which has contributed to ...

research-article
Generating Multimodal Augmentations with LLMs from Song Metadata for Music Information Retrieval

In this work we propose a set of new automatic text augmentations that leverage Large Language Models from song metadata to improve on music information retrieval tasks. Compared to recent works, our proposed methods leverage large language models and ...

research-article
Open Access
Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model

In this paper, we introduce Subsampling of frequent Words for Contrastive Language-Image Pre-training (SW-CLIP), a novel approach for the training Vision-Language Models (VLMs). SW-CLIP uses frequency-based subsampling of words that has been previously ...

research-article
Fashion-GPT: Integrating LLMs with Fashion Retrieval System

Customers on a fashion e-commerce platform although expressing their clothing preferences through combined imagery and textual information, they are limited to retrieve with single-round fixed inputs. At the same time, large language models (LLMs) have ...

Contributors
  • Huawei Technologies Co., Ltd.
Index terms have been assigned to the content through auto-classification.

Recommendations