research-article

Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer

Authors:

Jiaming Lei,

Lin Li,

Chunping Wang,

Jun Xiao,

Long ChenAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 1602 - 1611

https://doi.org/10.1145/3664647.3681036

Published: 28 October 2024 Publication History

Get Access

Abstract

Benefiting from strong generalization ability, pre-trained vision-language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the model's poor understanding of verb/noun classes. To this end, we introduce a new approach for zero-shot GSR via Language EXplainer (LEX), which significantly boosts the model's comprehensive capabilities through three explainers: 1) verb explainer, which generates general verb-centric descriptions to enhance the discriminability of different verb classes; 2) grounding explainer, which rephrases verb-centric templates for clearer understanding, thereby enhancing precise semantic role localization; and 3) noun explainer, which creates scene-specific noun descriptions to ensure context-aware noun recognition. By equipping each step of the GSR process with an auxiliary explainer, LEX facilitates complex scene understanding in real-world scenarios. Our extensive validations on the SWiG dataset demonstrate LEX's effectiveness and interoperability in zero-shot GSR.

Supplemental Material

MP4 File - Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer

In this paper, we introduce a new approach for zero-shot GSR via Language EXplainer (LEX), which ignificantly boosts the model's comprehensive capabilities through three explainers: 1) verb explainer, which generates general verb-centric descriptions to enhance the discriminability of different verb classes; 2) grounding explainer, which rephrases verb-centric templates for clearer understanding, thereby enhancing precise semantic role localization; and 3) noun explainer, which creates scene-specific noun descriptions to ensure context-aware noun recognition. By equipping each step of the GSR process with an auxiliary explainer, LEX facilitates complex scene understanding in real-world scenarios. Our extensive validations on the SWiG dataset dmonstrate LEX's effectiveness and interoperability in zero-shot GSR.

Download
24.58 MB

References

[1]

2022. Exploring structure-aware transformer over interaction proposals for human-object interaction detection. In CVPR.

Abstract

Supplemental Material

References

Index Terms

Recommendations

Consistency-guided pseudo labeling for transductive zero-shot learning

Cross-language few-shot intent recognition via prompt-based tuning: Cross-language few-shot intent recognition...

Improving zero-shot cross-lingual transfer via progressive code-switching

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations