Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleFebruary 2024
Zero-shot aerial object detection with visual description regularization
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 770, Pages 6926–6934https://doi.org/10.1609/aaai.v38i7.28518Existing object detection models are mainly trained on large-scale labeled datasets. However, annotating data for novel aerial object classes is expensive since it is time-consuming and may require expert knowledge. Thus, it is desirable to study label-...
- research-articleFebruary 2024
Diverse and aligned audio-to-video generation via text-to-video model adaptation
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 738, Pages 6639–6647https://doi.org/10.1609/aaai.v38i7.28486We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the ...
- research-articleFebruary 2024
Multi-modal prompting for open-vocabulary video visual relationship detection
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 724, Pages 6513–6521https://doi.org/10.1609/aaai.v38i7.28472Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily ...
- research-articleFebruary 2024
A multimodal, multi-task adapting framework for video action recognition
- Mengmeng Wang,
- Jiazheng Xing,
- Boyuan Jiang,
- Jun Chen,
- Jianbiao Mei,
- Xingxing Zuo,
- Guang Dai,
- Jingdong Wang,
- Yong Liu
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 613, Pages 5517–5525https://doi.org/10.1609/aaai.v38i6.28361Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend ...
- research-articleFebruary 2024
Bias-conflict sample synthesis and adversarial removal debias strategy fo temporal sentence grounding in video
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 504, Pages 4533–4541https://doi.org/10.1609/aaai.v38i5.28252Temporal Sentence Grounding in Video (TSGV) is troubled by dataset bias issue, which is caused by the uneven temporal distribution of the target moments for samples with similar semantic components in input videos or query texts. Existing methods resort ...
-
- research-articleFebruary 2024
Towards balanced alignment: modal-enhanced semantic modeling for video moment retrieval
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 429, Pages 3855–3863https://doi.org/10.1609/aaai.v38i4.28177Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query by constructing cross-modal alignment strategies. However, these existing strategies are often sub-optimal since they ignore the ...
- research-articleFebruary 2024
TD2-net: toward denoising and debiasing for dynamic scene graph generation
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 389, Pages 3495–3503https://doi.org/10.1609/aaai.v38i4.28137Dynamic scene graph generation (SGG) focuses on detecting objects in a video and determining their pairwise relationships. Existing dynamic SGG methods usually suffer from several issues, including 1) Contextual noise, as some frames might contain ...
- research-articleFebruary 2024
Exploring domain incremental video highlights detection with the LiveFood benchmark
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 1132, Pages 10155–10163https://doi.org/10.1609/aaai.v38i9.28880Video highlights detection (VHD) is an active research field in computer vision, aiming to locate the most user-appealing clips given raw video inputs. However, most VHD methods are based on the closed world assumption, i.e., a fixed number of highlight ...
- research-articleFebruary 2024
ViSTec: video modeling for sports technique recognition and tactical analysis
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 944, Pages 8490–8498https://doi.org/10.1609/aaai.v38i8.28692The immense popularity of racket sports has fueled substantial demand in tactical analysis with broadcast videos. However, existing manual methods require laborious annotation, and recent attempts leveraging video perception models are limited to low-...
- research-articleFebruary 2024
Spatio-temporal fusion for human action recognition via Joint Trajectory Graph
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 842, Pages 7579–7587https://doi.org/10.1609/aaai.v38i7.28590Graph Convolutional Networks (GCNs) and Transformers have been widely applied to skeleton-based human action recognition, with each offering unique advantages in capturing spatial relationships and long-range dependencies. However, for most GCN methods, ...
- research-articleFebruary 2024
TF-CLIP: learning text-free CLIP for video-based person re-identification
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 752, Pages 6764–6772https://doi.org/10.1609/aaai.v38i7.28500Large-scale language-image pre-trained models (e.g., CLIP) have shown superior performances on many cross-modal retrieval tasks. However, the problem of transferring the knowledge learned from such models to video-based person reidentification (ReID) has ...
- research-articleFebruary 2024
Referred by multi-modality: a unified temporal transformer for video object segmentation
- Shilin Yan,
- Renrui Zhang,
- Ziyu Guo,
- Wenchao Chen,
- Wei Zhang,
- Hongyang Li,
- Yu Qiao,
- Hao Dong,
- Zhongjiang He,
- Peng Gao
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 717, Pages 6449–6457https://doi.org/10.1609/aaai.v38i6.28465Recently, video object segmentation (VOS) referred by multimodal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual ...
- research-articleFebruary 2024
MuLTI: efficient video-and-language understanding with text-guided MultiWay-sampler and multiple choice modeling
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 700, Pages 6297–6305https://doi.org/10.1609/aaai.v38i6.28448Video-and-language understanding has a variety of applications in the industry, such as video question answering, textvideo retrieval, and multi-label classification. Existing video-and-language understanding methods generally adopt heavy multi-modal ...
- research-articleFebruary 2024
Dynamic semantic-based spatial graph convolution network for skeleton-based human action recognition
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 692, Pages 6225–6233https://doi.org/10.1609/aaai.v38i6.28440Graph convolutional networks (GCNs) have attracted great attention and achieved remarkable performance in skeleton-based action recognition. However, most of the previous works are designed to refine skeleton topology without considering the types of ...
- research-articleFebruary 2024
Temporal correlation vision transformer for video person re-identification
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 676, Pages 6083–6091https://doi.org/10.1609/aaai.v38i6.28424Video Person Re-Identification (Re-ID) is a task of retrieving persons from multi-camera surveillance systems. Despite the progress made in leveraging spatio-temporal information in videos, occlusion in dense crowds still hinders further progress. To ...
- research-articleFebruary 2024
GMMFormer: gaussian-mixture-model based transformer for efficient partially relevant video retrieval
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 641, Pages 5767–5775https://doi.org/10.1609/aaai.v38i6.28389Given a text query, partially relevant video retrieval (PRVR) seeks to find untrimmed videos containing pertinent moments in a database. For PRVR, clip modeling is essential to capture the partial relationship between texts and videos. Current PRVR ...
- research-articleFebruary 2024
Prompting segmentation with sound is generalizable audio-visual source localizer
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 630, Pages 5669–5677https://doi.org/10.1609/aaai.v38i6.28378Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio? In this work, we concentrate on the Audio-Visual Localization and Segmentation tasks but under the demanding ...
- research-articleFebruary 2024
CoVR: learning composed video retrieval from web video captions
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 586, Pages 5270–5279https://doi.org/10.1609/aaai.v38i6.28334Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-...
- research-articleFebruary 2024
Open-vocabulary video relation extraction
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 580, Pages 5215–5223https://doi.org/10.1609/aaai.v38i6.28328A comprehensive understanding of videos is inseparable from describing the action with its contextual action-object interactions. However, many current video understanding tasks prioritize general action classification and overlook the actors and ...
- research-articleFebruary 2024
Towards efficient and effective text-to-video retrieval with coarse-to-fine visual representation learning
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 579, Pages 5207–5214https://doi.org/10.1609/aaai.v38i6.28327In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with ...