Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleJanuary 2025JUST ACCEPTED
Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review
The intelligent dialogue system, aiming at communicating with humans harmoniously with natural language, is brilliant for promoting the advancement of human-machine interaction in the era of artificial intelligence. With the gradually complex human-...
- ArticleJanuary 2025
FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation
AbstractResearch on food image understanding using recipe data has been a long-standing focus due to the diversity and complexity of the data. Moreover, food is inextricably linked to people’s lives, making it a vital research area for practical ...
- ArticleDecember 2024
Deneb: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning
AbstractIn this work, we address the challenge of developing automatic evaluation metrics for image captioning, with a particular focus on robustness against hallucinations. Existing metrics are often inadequate for handling hallucinations, primarily due ...
- research-articleNovember 2024
An Empirical Analysis of GPT-4V's Performance on Fashion Aesthetic Evaluation
- Yuki Hirakawa,
- Takashi Wada,
- Kazuya Morishita,
- Ryotaro Shimizu,
- Takuya Furusawa,
- Sai Htaung Kham,
- Yuki Saito
SA '24: SIGGRAPH Asia 2024 Technical CommunicationsArticle No.: 24, Pages 1–4https://doi.org/10.1145/3681758.3698022Fashion aesthetic evaluation is the task of estimating how well the outfits worn by individuals in images suit them. In this work, we examine the zero-shot performance of GPT-4V on this task for the first time. We show that its predictions align fairly ...
- research-articleOctober 2024
Multi-modal Auto-regressive Modeling via Visual Tokens
MM '24: Proceedings of the 32nd ACM International Conference on MultimediaPages 10735–10744https://doi.org/10.1145/3664647.3681685Large Language Models (LLMs), benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities. However, as for extending auto-regressive modelling to multi-...
-
- research-articleOctober 2024
Visual Grounding with Multi-modal Conditional Adaptation
MM '24: Proceedings of the 32nd ACM International Conference on MultimediaPages 3877–3886https://doi.org/10.1145/3664647.3681256Visual grounding is the task of locating objects specified by natural language expressions. Existing methods extend generic object detection frameworks to tackle this task. They typically extract visual and textual features separately using independent ...
- research-articleOctober 2024
Divide and Conquer: Isolating Normal-Abnormal Attributes in Knowledge Graph-Enhanced Radiology Report Generation
MM '24: Proceedings of the 32nd ACM International Conference on MultimediaPages 4967–4975https://doi.org/10.1145/3664647.3681201Radiology report generation aims to automatically generate clinical descriptions for radiology images, reducing the workload of radiologists. Compared to general image captioning tasks, the subtle differences in medical images and the specialized, ...
- research-articleOctober 2024
Narrowing the Gap between Vision and Action in Navigation
MM '24: Proceedings of the 32nd ACM International Conference on MultimediaPages 856–865https://doi.org/10.1145/3664647.3681150The existing methods for Vision and Language Navigation in the Continuous Environment (VLN-CE) commonly incorporate a waypoint predictor to discretize the environment. This simplifies the navigation actions into a view selection task and improves ...
- research-articleOctober 2024
SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses
- Chaolei Tan,
- Zihang Lin,
- Junfu Pu,
- Zhongang Qi,
- Wei-Yi Pei,
- Zhi Qu,
- Yexin Wang,
- Ying Shan,
- Wei-Shi Zheng,
- Jian-Fang Hu
MM '24: Proceedings of the 32nd ACM International Conference on MultimediaPages 8383–8392https://doi.org/10.1145/3664647.3681042Video grounding is a fundamental problem in multimodal content understanding, aiming to localize specific natural language queries in an untrimmed video. However, current video grounding datasets merely focus on simple events and are either limited to ...
- research-articleOctober 2024
PEneo: Unifying Line Extraction, Line Grouping, and Entity Linking for End-to-end Document Pair Extraction
MM '24: Proceedings of the 32nd ACM International Conference on MultimediaPages 5171–5180https://doi.org/10.1145/3664647.3680931Document pair extraction aims to identify key and value entities as well as their relationships from visually-rich documents. Most existing methods divide it into two separate tasks: semantic entity recognition (SER) and relation extraction (RE). However,...
- research-articleOctober 2024
Natural Language Induced Adversarial Images
MM '24: Proceedings of the 32nd ACM International Conference on MultimediaPages 10872–10881https://doi.org/10.1145/3664647.3680902Research of adversarial attacks is important for AI security because it shows the vulnerability of deep learning models and helps to build more robust models. Adversarial attacks on images are most widely studied, which include noise-based attacks, image ...
- research-articleOctober 2024
Triple Alignment Strategies for Zero-shot Phrase Grounding under Weak Supervision
MM '24: Proceedings of the 32nd ACM International Conference on MultimediaPages 4312–4321https://doi.org/10.1145/3664647.3680897Phrase Grounding, i.e., PG aims to locate objects referred by noun phrases. Recently, PG under weak supervision (i.e., grounding without region-level annotations) and zero-shot PG (i.e., grounding from seen categories to unseen ones) are proposed, ...
- research-articleOctober 2024
Text-Region Matching for Multi-Label Image Recognition with Missing Labels
MM '24: Proceedings of the 32nd ACM International Conference on MultimediaPages 6133–6142https://doi.org/10.1145/3664647.3680815Recently, large-scale visual language pre-trained (VLP) models have demonstrated impressive performance across various downstream tasks. Motivated by these advancements, pioneering efforts have emerged in multi-label image recognition with missing labels,...
- research-articleOctober 2024
Prompt-Guided Image-Adaptive Neural Implicit Lookup Tables for Interpretable Image Enhancement
MM '24: Proceedings of the 32nd ACM International Conference on MultimediaPages 6463–6471https://doi.org/10.1145/3664647.3680743In this paper, we delve into the concept of interpretable image enhancement, a technique that enhances image quality by adjusting filter parameters with easily understandable names such as "Exposure'' and "Contrast''. Unlike using predefined image ...
- research-articleOctober 2024
Personalized Video Summarization by Multimodal Video Understanding
CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge ManagementPages 4382–4389https://doi.org/10.1145/3627673.3680011Video summarization techniques have been proven to improve the overall user experience when it comes to accessing and comprehending video content. If the user's preference is known, video summarization can identify significant information or relevant ...
- ArticleOctober 2024
Knowledge-Grounded Adaptation Strategy for Vision-Language Models: Building a Unique Case-Set for Screening Mammograms for Residents Training
- Aisha Urooj Khan,
- John Garrett,
- Tyler Bradshaw,
- Lonie Salkowski,
- Jiwoong Jeong,
- Amara Tariq,
- Imon Banerjee
Medical Image Computing and Computer Assisted Intervention – MICCAI 2024Pages 587–598https://doi.org/10.1007/978-3-031-72390-2_55AbstractA visual-language model (VLM) pre-trained on natural images and text pairs poses a significant barrier when applied to medical contexts due to domain shift. Yet, adapting or fine-tuning these VLMs for medical use presents considerable hurdles, ...
- ArticleNovember 2024
Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval
AbstractThe pre-trained vision and language (V&L) models have substantially improved the performance of cross-modal image-text retrieval. In general, however, V&L models have limited retrieval performance for small objects because of the rough alignment ...
- ArticleNovember 2024
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations
- Kilichbek Haydarov,
- Xiaoqian Shen,
- Avinash Madasu,
- Mahmoud Salem,
- Li-Jia Li,
- Gamaleldin Elsayed,
- Mohamed Elhoseiny
AbstractWe introduce Affective Visual Dialog, an emotion explanation and reasoning task as a testbed for research on understanding constructed emotions in response to visually grounded conversations. The task involves three skills: (1) Dialog-based ...
- research-articleSeptember 2024
DCMFNet: Deep Cross-Modal Fusion Network for Different Modalities with Iterative Gated Fusion
GI '24: Proceedings of the 50th Graphics Interface ConferenceArticle No.: 23, Pages 1–12https://doi.org/10.1145/3670947.3670956Cross-modal fusion aims to establish a consistent correspondence between arbitrary modalities. Due to the inherent differences between these modalities, accurately modeling their correspondence is a challenging task. Referring image segmentation (RIS) ...
- research-articleJanuary 2024
From Pixels to Explanations: Uncovering the Reasoning Process in Visual Question Answering
MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in AsiaArticle No.: 7, Pages 1–9https://doi.org/10.1145/3595916.3626376Visual reasoning requires models to construct a reasoning process towards the final decision. Previous studies have used attention maps or textual explanations to illustrate the reasoning process, but both have their limitations. Attention maps can be ...